Link based BPSO for feature selection in big data text clustering

Link based BPSO for feature selection in big data text clustering

Accepted Manuscript Link based BPSO for feature selection in big data text clustering Neetu Kushwaha, Millie Pant PII: DOI: Reference: S0167-739X(17...

640KB Sizes 3 Downloads 105 Views

Accepted Manuscript Link based BPSO for feature selection in big data text clustering Neetu Kushwaha, Millie Pant

PII: DOI: Reference:

S0167-739X(17)32185-4 https://doi.org/10.1016/j.future.2017.12.005 FUTURE 3843

To appear in:

Future Generation Computer Systems

Received date : 27 September 2017 Revised date : 24 November 2017 Accepted date : 3 December 2017 Please cite this article as: N. Kushwaha, M. Pant, Link based BPSO for feature selection in big data text clustering, Future Generation Computer Systems (2017), https://doi.org/10.1016/j.future.2017.12.005 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Link based BPSO for feature selection in Big data text clustering Neetu Kushwaha,Millie Pant Department of ASE, Indian Institute of Technology Roorkee, Roorkee 247001, India [email protected], [email protected] Abstract Feature selection is a significant task in data mining and machine learning applications which eliminates irrelevant and redundant features and improves learning performance. This paper proposes a new feature selection method for unsupervised text clustering named link based particle swarm optimization(LBPSO). This method introduces a new neighbour selection strategy in BPSO to select prominent features. The performance of traditional particle swarm optimization(PSO)is limited by using global best updating mechanism for feature selection. Instead of using global best, LBPSO particles are updated based on neighbour best position to enhance the exploitation and exploration capability. These prominent features are then tested using k-means clustering algorithm to improve the performance and reduce the cost of computational time of the proposed algorithm. The performance of LBPSO are investigated on three published datasets, namely Reuter 21578, TDT2 and Tr11.Our results based on evaluation measures show that the performance of LBPSO is superior than other PSO based algorithms. Keywords: Big Data, Text Clustering, Particle Swarm Optimization, Scale Free Network, k-means, Feature Selection 1.

Introduction

In recent years, there has been a continuous growth of internet technology resulting in tremendous amount of text information. It is very difficult to process this text information manually and to extract valuable information from document corpus in time. Finding desired information in an age of ‘big data’ has been a challenge for standard information retrieval technology. Text analysis in the domain of text mining requires complex techniques to deal with numerous text documents. Text clustering (TC) is one of the most efficient techniques used in text mining domain, machine learning, and pattern recognition. With a good text clustering method, computers can automatically organize a document corpus into several hierarchies of semantic clusters[1]. It is a process of organizing documents into meaningful groups in such a way that documents of the same group are more similar to one another than documents belonging to different groups. Same topic is shared by the text documents that are in same cluster and different clusters represent different topics. In order to apply text clustering algorithm, we need to transform these raw text documents into numerical format which consists of document’s characteristics. To extract interesting patterns and insights from them, most fundamental and crucial step is document representation. In text clustering, documents are represented by their terms. Terms are either single term or the multi-word terms .Vector space model (VSM) is a very commonly used model for Document representation in text clustering[2]. In VSM, each term in the document is considered as feature /dimension. Text documents contain high dimensional informative and uninformative (irrelevant, redundant, and noisy) features[3]. High dimensionality is always an ultimate challenge in the text document. Text clustering algorithms do not accomplish any type of feature selection(FS) method. Moreover, dimension reduction fails because of a vast number of text features and uninformative text features[4]. The efficiency of text clustering is affected by the dimensionality of text documents. The purpose of FS algorithms is to remove irrelevant or redundant features from the original set of features without sacrificing the prediction accuracy and computational time and to find a new subset of relevant features[5]. As we decrease the dimension of the text documents, the accuracy of the clustering algorithm increases. Feature selection techniques not only increase the clustering accuracy and efficiency but also reduces cost of computational complexity[6]. Feature selection methods can be broadly categorized into three types based on the different strategies of searching: filter, wrapper and Embedded methods. In filter method, it defines the relevant features without using any learning algorithm. On the other hand, wrapper method uses learning method to select informative feature. Generally, wrapper method outperforms filter method in terms of classification accuracy. Embedded methods integrate both feature selection methods to the learning model so as to achieve high accuracy or good performance with moderate computational cost (e.g. support vector machines and least square regression). Wrapper method can itself be broadly classified into two algorithms-sequential and heuristic algorithms. In sequential method, we start with empty set and add some features in it at every step until maximum objective function value is achieved. While, in heuristic search, evaluation is performed on different subsets of features to optimize the objective function value. In the literature, several features selection methods have been introduced such as mutual information[7], sequential search algorithms[8] etc. Five features selection methods including χ2 statistic document frequency, term strength, information gain (IG), and mutual information have been compared by Yang and Pedersen[9] . Principal component analysis (PCA) have been successfully used in text categorization[10], [11], Neural networks have also been widely applied by many researchers[11]–[14].

Feature selection is a type of discrete optimization problem. Various types of search techniques have been proposed in the literature such as sequential forward and backward feature selection (SFS, SBS) to overcome feature selection problems. However, these techniques may have a premature convergence problem or have more computational complexity. To alleviate these problems, evolutionary computation (EC) techniques which are population based solvers produce optimum solutions with less computational cost for discrete optimization problems. These techniques have been widely used for finding the global optimum and gaining popularity day by day. There are so many meta heuristic algorithms such as, particle swarm optimization[15], [16],artificial bee colony (ABC) [17], genetic algorithms (GA)[18], [19] and ant colony optimization[20] and so on[21], [22], are used for feature selection problem. PSO is an efficient and robust population based optimization algorithm [23]where a population consisting of particles move on a -dimensional search space. PSO require less computational time and can converge quickly as compared to other meta heuristic algorithms[24].PSO is an efficient algorithm for feature selection in text clustering area as shown by different researchers: Chuang et al. [25]presents a hybrid method based on Pearson correlation and Taguchi chaotic binary PSO (BPSO) for feature selection. The proposed algorithm outperforms other algorithms for ten gene expression data set. Ghamisi et al. [26] introduced a new feature selection algorithm based on integration of genetic algorithm with particle swarm optimization. Banka and Dara[27] proposed a hamming distance based BPSO. In this paper, hamming distance is used as a proximity measure to update the velocity of the particle in BPSO. Use of bare bones PSO for feature selection was introduced by Zhang et al[28]. In this method, a reinforced memory strategy is used for updating of local leaders of particles and also introduced the uniform combination of mutation and crossover to balance the local and global searching abilities of the PSO algorithm. Chuang et al.[29]presented a improved version of BPSO for FS named as catfish BPSO. In this paper, author introduced a new particle named as catfish into the search space to replace the worst particle in the population if global best particle is not changed or improved in the last consecutive iteration. Bharti et al.[4] proposed a new feature selection method based on BPSO with an opposition-based initialization to achieve a better final solution and a new strategy to generate dynamic inertia weight. The proposed algorithm obtained better computation time and clustering accuracy as compared to other methods on three benchmark text datasets. Abualigah et al.[30] proposed a feature selection algorithm based on a hybridization of PSO with genetic operators- mutation and crossover (HFSPSOTC). In this paper, we introduce a new neighbour selection strategy in BPSO for feature selection in text clustering. The proposed algorithm makes full use of the distinguishing ability of each feature, as well as fully considers the relevance and redundancy of each feature. Two genetic operators are used in the end to avoid stagnation problem and to explore towards global optimum solution. Moreover, the mean absolute difference (MAD) is used to evaluate fitness of the particles. The remaining of the paper is segregated into 5 sections. In section 2, we present a brief introduction of Text clustering, BPSO, Scale free Barabasi & Albert model and k-means algorithm. In Section 3, we describe the proposed method and implementation details. Section 4 details the experiments that evaluate our method and the analysis of results. Finally, the last section presents the conclusions of this study. 2. Background This section describes the text clustering, basic binary PSO, introduction to scale free networks and its Barabasi & Albert(BA) model and gives a brief introduction to k-means algorithm. 2.1. Text Clustering In text clustering, a set of large text documents are partitioned into a subset of clusters based on their intrinsic properties. From the optimization perspective, text clustering is a particular type of NP hard grouping problem. Mathematically it can be defined as: Let be a set of text documents, D d , d , … . d where denotes the number of all documents that are to be grouped. The objective is to determine a partition C C , C , … . C fulfilling the conditions: ∅ k 1,2, … . , K  C  C ∩ C ′ ∅ if k k ′  ⋃ C  objects belonging to the same cluster are as similar to each other as possible, while the objects belonging to different clusters are as dissimilar as possible. Before applying clustering algorithm, standard pre-processing steps are used to pre-process the text documents which involve: tokenization, stop words removal, stemming, term weighting steps. Fig .1 shows the phases of text clustering process. These pre-processing steps convert the text documents into numerical format or matrix. Brief description of standard pre-processing steps is as follows: a) Tokenization In this process text data gets spitted into a sequence of basic independent units (words or terms) called tokens. b) Stop words removal The stop words removal process removes certain common words that occur most frequently. The example of such words are ‘a’, ‘an’, ‘the’, ‘that’, ‘are’ etc. A list of such words can be found at (http://www.unine.ch/Info/clef/), that contains 571 words.

c) Stemming Stemmingg refers to convverting the word ds to their root for simplicity. For example, ‘introduce’, ‘inttroduction’ willl be treated as introduuce or ‘computaation’, ‘computeer’, ‘computes’ and ‘computin ng’ will be treateed as compute. d) Term weightin ng Term weiighting is usedd to convert thee textual inform mation into a numerical n form mat. There are so many term weighting schemes aare available inn the literature.. The TFIDF (tterm frequency inverse docum ment frequencyy) is most wideely used as weight schheme.

Fig. 1: Phasses of text clusteering process 2.2. Binaary particle swaarm optimization PSO algorrithm works ass follows: Initiaally PSO generrates a random population whiich consists of number of ind dividuals or particles, representing a potential soluttion for optimizzation. Each particle has a fitness value whhich can be callculated by , wheere representts velocity of paarticle , using fitneess function. Every particle iss represented ass a vector, ( , and denotes posittion and person nal best positionn found by partticle in his history respectiveely. In PSO, eaach particle moved to a new position based on old position and new w velocity of the particle. (1) , ,…. ,

(2)

,….

,

,….

(3)

,

,…

(4)

Where i 1,2, … , N, j 1,2, … , The globaal best position is represented as a

Based on the current veelocity and possition, the nextt velocity and position p is calcculated by folloowing equation ns 5 and 6 respectiveely as given beloow: (5) 1 ∗ ∗ ∗ ∗ ∗ 1

1

(6)

where and reppresent the curreent velocity andd position of th he particle respeectively, 1 represent 1 and the next pposition and vellocity of the paarticle , is a inertia weightt, and are the acceleratioon coefficients. and aree the two random m numbers gen nerated uniform mly between 0 an nd 1. Also, 1 and 1 , wherre and are the maaximum and miinimum velocitty at which particle acccelerates in a bounded searcch space. In Eqq. 5, second terrm and third teerm represent ccognition and social s term respectiveely. Inertia weigght is used to control the gloobal exploration n and local explloitation of searrch area. Two comm mon social topology models in i PSO are globbal best ( and the locaal best . In del, particle mod learns from b position found by the enttire population while, in h particle is m the global best model the veelocity of each modified, by consideringg the personal ex xperience of itss own as well ass its nearest neighbours. ous optimizationn problems. Laater Eberhart an nd Kennedy[31]], modified it to o tackle the PSO was initially designned for continuo binary opttimization problems as well. Algorithm A 1 shoows the BPSO algorithm. a Binary PS SO (BPSO)prim marily extended d the original PSO by using the sigmoid trransfer functionn (as shown in n Fig.2) to transform the value of veelocity from thee continuous spaace into binary space. In this BPSO, B at each iiteration, the veelocity is p modified aaccording to Eqqs. (7) and the position 1

is moodified accordiing to the follow wing: (7)

0, 1,

where

is the sigm moid transfer fu unction which ddenotes the prob bability that thee th bit takes tthe value 0 or 1. 1 8) (8 1

e

Fig. 2: S Sigmoid transferr function m 1. BPSO algoorithm Algorithm Input: Iniitialize the popuulation and PSO O parameters. , , _ : Maximum num mber of iteratio ons. : Currennt Position of paarticles in the sw warm. : Dim mension of the dataset d : Numbber of particles in i the swarm. : Paarticle historicaal Personal best positions. :G Global best posittion. Output: and its obbjective function n value. Algorithm m: 1. Evaluate each solution s based on o objective funnction. 2. Set of each e particle and of the whole populatiion. 3. while Maximuum number of iterations ( _ ) reached do d 4. for all particlees for each dimension d 5. Update velocityy of the th parrticle using Eq. (5). 6. Update the possition of the th h particle using Eq. (6). end for 7. end for 8. for all particlees 9. if < then 10. 11. end if 12. if < then 13. 14. end if 15. end while 16. Return p particle with ob bjective functionn value (a new subset of text features). f

2.3. Scalee free BA model Networks are ubiquitouss in nature and d society descrribing various complex c system ms i.e. geneticc networks with h genes or proteins connected by chhemical interacttions, social nettworks composed of individuaals interacting thhrough social connections c etc. Despiite their diversiity, most netwo orks appearing in nature follo ow universal orrganizing princi ciples. In particular, it has been founnd that many neetworks of scien ntific interest arre scale-free in nature. This paaper exhibits thee structural chaaracteristics of scale-frree networks wiithin a PSO pop pulation. Fig. 3 shows an exam mple of scale freee network.

Barabasi & Albert proposed a scale freee model in 19 99[33], called the t BA model, where the distr tribution of the number of connectionns to each nodee, the degree, fo ollows a power law. The BA model m works on the following pproperties: Growing Size: At every step time, new w node with edges wiill connect to th hose in the netw work. Preferenttial Attachmen nt: The probability of adding a new node dep pends on the degree ~

nodes which are alreaady present of nodee and is given by Eq. 9: (9) (



(b)

(a)

d 44 links, contrribution of hubs highly (conneected nodes (redd)) to overall co onnectivity Fig. 3: (a)) Networks havve 36 nodes and of the netw work is dominaating and (b) illu ustrates that hubbs follow ws power law.   2.4. k-meeans k-means iss a simple partiitional clusterin ng algorithm devveloped by Macc Queen in 1967 [34]. It is a tw wo-phase iteratiive process in which tthe whole dataset is divided into disjoint clusters[35]. Itts popularity is due to its simp mplicity and com mputational efficiencyy. Algorithm 2 shows s the k-means clustering aalgorithm. Algorithm m 2: k-means Input: Daataset : ∈ , 1, … , , K (Numbber of clusters) Output:C C C , C , … . C , K cluster centroids c Algorithm m: 1. Initialize K clusster centroids raandomly from tthe text dataset.. 2. Data assignmeent: -Assign eaach data point too the closest clu uster centroid based on some distance measu ure. For every data point in datasset: C: min 3.

Centroid estim mation: -Update each cluster ccentroid value to the mean valu ue of that cluster. 1 |C | ∀ ∈

4. 5.

where |C | is thhe number of daata vectors in clluster j Repeat steps 2 and 3 until con nvergence. Return C C , C , … . C , K centroids

3. Propposed methodollogy 3.1. Neigghbourhood seleection strategy in PSO Like otherr meta-heuristic algorithms, PSO P suffers froom some inhereent drawbacks like prematuree convergence and a getting itself trappped in local miinima. Sometim mes, the suboptiimal are near to o the global opttimum and the nneighborhoods of trapped individualls may contain the global optim mum. At such ssituation, search hing the neighb borhoods of inddividuals is help pful to find better soluutions[36]. Neighborhood sellection is the crrucial step in an ny optimization n algorithm as iit affects the co onvergence speed andd quality of the final solution. The proposed Neighborhood d selection strategy is divided into three phasses: in first phase, Scaale free topologgy is constructed using Barabaasi-Albert (BA) model that pro ovide the neighbbor information n matrix. In second phhase, Link matrrix is computed d by using Eq. 110. In the third phase, rank off each particle iis calculated thrrough Link

matrix and neighborhoodd of each partiicle are selectedd based on ran nks. Algorithm 3 describes thee steps of neig ghbourhood selection sstrategy. 3.2. Scalee free topology Construction In LBPSO O, each particlee is represented d as a node of a scale-free neetwork. For this, BA model iss adapted to geenerate the topology oof the swarm. Initially, I a fully y connected nettwork is constru ucted by using particles. A new particle is added at each step time in the netw work, which is connected to parrticles with a prredefined probaability that is prroportional to the degrree of existing particle (i.e. ). 3.3. Compputation of Lin nk matrix The link between the two particle denotes the common neig ghbours betweeen the two pparticles [37], [38]. For example, , is the t number of common c neighbbours between the two particlles and . Thhe neighbour in nformation about everry particle can be represented d as ∗ matrrix where denotes the nu umber of particlles in the population. The value of eeach , is 0 and 1 depending upon whethher the particless and are neighbours or not. Link functtion can be calculatedd by multiplyingg the th row off neighbour mattrix M with its th column: (10) ,

,



,

3.4. Partiicle Importancee calculation After obtaaining the com mmon neighbou urs, the rank vaalues of each particle p pair aree calculated ussing the , . The particles w which has more number of co ommon neighbbours gets high her rank values otherwise getss smaller rank values. By ranking, w we can evaluatee the neighbor which is influeencing the partiicle the most. The T link functioon uses the kno owledge of neighbourr particles in evvaluating the relationship betw tween two particles. It is repo orted as one off the efficient method m for measuringg the closeness of two particlees. Based on thiis concept we can c use this app proach for searcching the neigh hbourhoods in BPSO tto select the releevant features in i text clusterinng. Algorithm m 3 shows the pseudo code of Neighbour N selecction strategy. Algorithm m 3: Neighbour selection stra ategy Input: Sizze of the wholee population (N Output: B Best neighbor of o each particle (nbest) Algorithm m: Step 1: Geenerate scale freee network usin ng BA model. Step 2: for each pair of particle p Find the links using Eq. 10. en nd for Step 3: Caalculate the rankk values of each h particle pair. Step 4: Paarticle having hiighest rank valu ue will act as N Neighbor best (n nbest of Particcle (i) return nbeest

3.5. Soluttion Representaation: In LBPSO O, each particle in the swarm m represents onne candidate solution for thee problem. Eacch particle conttains some position hhaving dim mension which is denoted as vector and thatt have values either e 0 and 1aa s shown in Fiig. 4. Each dimensionn is treated as one o feature. Fro om the Fig. 4, w we can say that particle has (here =9) feature thaat has value either 0 annd 1. If the valuue at position is i 1 that means th feature is selected s otherwise it is not seleected.

Fig. 4: Solution representation of a particle

3.6. Objective function: The objective function for the proposed algorithm is the mean absolute difference (MAD) [30]. (11)

1 ,

Where, 1 ,

where, feature

is the number of selected features in text document , is the mean value of the vector , is the weighting value of in document and is the number of features in the original text dataset.

3.7. Mutation and crossover operation in LBPSO The proposed algorithm also takes advantage of crossover and mutation operations of genetic evolutionary algorithm to improve the convergence rate. Mutation process of GA has been added in the proposed algorithm to overcome the problem of stagnation. Crossover operator Crossover operation[39] widely used in the GA is adapted to increase the exploration and exploitation capability of the PSO. It helps the PSO to prevent the premature convergence. The Crossover operation for PSO can be described as follows: ,

,

where

, , ,

(12)

,

0.90 is the pre-determined crossover rate and

∈ 1,2, ⋯ ,

1,

1, ⋯ ,

Mutation operator Mutation[39] has been introduced into the PSO as a mechanism to increase the diversity of PSO, and by doing so improving the exploration abilities of the algorithm. This operation for PSO can be described as follows: µ

,

,

,

, ,

(13)

,

0

0.20 is the pre-determined mutation rate and ,

where µ is a number between [0,1],

∈ 1,2, ⋯ ,

1,

1, ⋯ ,

4.

.

Proposed method for feature selection To preserve swarm diversity in the PSO without slowing down convergence significantly, a neighbourhood selection strategy in PSO (LBPSO) is proposed. Instead of learning from and , the LBPSO learns from and (a selected by the neighbourhood selection strategy) in the early stage to enhance diversity.The velocities and positions of LBPSO are updated according to Eqs. (14) and (15), respectively. Unlike in global best model, in LBPSO each particle learns from the same nbest in the social learning part. Algorithm 4 shows the proposed LBPSO feature selection algorithm.

1







∗ 1

Algorithm 4. Proposed Link-BPSO (LBPSO) algorithm Input: Initialize the population and PSO parameters. , , , , : Scale free network matrix : Current Position of particles in the swarm _ : Maximum number of iterations

1



(14) (15)

: Dim mension of the dataset d : Numbber of particles in i the swarm : Paarticle historicaal Personal best positions :G Global best posittion Output: and its obbjective function n value (subsett of text featuress). Algorithm m: of each 1. Set e particle and (obtainned using Algorrithm 3) of the whole populatition. 2. while Maximuum number of iterations ( _ ) reached do d 3. for all particlees for each dimension d 4. Update the veloocity of the th h particle using Eq. (14). 5. Update the possition of the th h particle using Eq. (15). 6. Apply the gen netic operators 7. Crossover operrator 8. Mutation operaator end for 9. end for 10. for all particlees 11. if < then 12. 13. end if 14. if < then 15. 16. end if 17. end while 18. Return p particle with ob bjective functionn value (a new subset of text features). f

Integratioon of LPSO wiith k-means tex xt clustering allgorithm After pre--processing, prooposed algorithm m is used to sellect new subsett of informativee features from tthe original feaature space. These featture not only inncrease the effiicacy of clusterring algorithm but b also decrease the computaational complex xity. In this paper, moodified BPSO (L LBPSO) algorithm with integrration of link neighbour n strateegy is used for feature selectio on. Finally, k-means teext clustering algorithm a is perrformed on infoormative subsett of features. A complete proccess and flow ch hart of text clustering is presented inn Fig. 5 and 6.

Fig g. 5: Process off proposed LBP PSO text clusterring

  Fig. 6: Flow cchart of proposeed methodology y 5. Expeerimental setup p The perfoormance of thee proposed alg gorithm(LBPSO O) is compared d with existing g feature selecction algorithm ms for text clustering. A series of exxperiments are conducted on w well-known stan ndard text datassets. All the alggorithms are im mplemented R running MATLAB 2015B. Final and testedd on work stattion with a core i7 Intel 3.66 GHz processsor and 8GB RAM clustering of the text doccuments are peerformed by k-m means. The parrameter settings of all comparred algorithms are set the same as thheir original papper.The parameeter setting of prroposed algoritthms is shown in Table 1. Parameter settinng for the propo osed algorithm Table 1: P Parameteers ,

_ : Maximum num mber of iterationns. : Numbeer of particles inn the swarm

valuee 2.2,2.2 2 dynamic d 0.2 0.2 100 80

5.1. Datasets Three benchmark datasets Reuters-21578, TDT2 and TR11were used in our experiments. These datasets differ considerably in the sparsity, skewness, and number of features. These are the standard benchmark text datasets for text clustering that are widely used by the researchers. These text datasets are available at http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html. The characteristics of these data sets are summarised in Table 2. TDT2 The TDT2 dataset consists of 100 document clusters. The total number of on-topic documents is 11,201 which are classified into 96 semantic categories. We have created 5 subsets of datasets randomly from the TDT2 dataset varying in dimension value and classes. The first dataset (TDT2A) contains 314 text documents belonging to 3 classes. The second dataset (TDT2B) contains 1319 text documents belonging to 6 classes. The third dataset (TDT2C) contains 1683 text documents belonging to 5 classes. Reuter-21578 Reuter-21578 dataset contains a set of news published by Reuter newswire in 1987. It is a collection of 21578 documents with 135 thematic categories which are distributed non-uniformly. We have created 4 random subset of reuter-21578 dataset varying in dimension value and classes. The first dataset (Reuters A) contains 121 text documents belonging to 3 classes. The second dataset (Reuters B) contains 394 text documents belonging to 4 classes.The third dataset (Reuters C) contains 160 text documents belonging to 5 classes. The fourth dataset (Reuters D) contains 492 text documents belonging to 5 classes. TR11 The TR11 dataset is derived from TREC collections (http://trec.nist.gov). It is collection of 414 documents having 9 classes. Table 2: Details of text datasets s.no 1 2 3 4 5 6 7 8

Datasets TDT2A TDT2B TDT2C TR11 Reuters A Reuters B Reuters C Reuters D

No of Doc. 314 1319 1683 414 121 394 160 492

No. of classes 3 5 5 9 3 4 5 5

No. of features 5076 14964 12737 6429 973 3018 1309 4131

5.2. Evaluation metric The clustering performance is evaluated by using four metrics: accuracy (ACC), purity, Rand-index and NMI. In these metrics, high value indicates good clustering result which lies between 0 and 1. Accuracy: - The accuracy[40] is applied to compute the percentage of correctly predicted labels, and is defined as : Accuracy



δ r , map l n

where δ x, y

f x

(16)

1, 0,

and map l denotes the mapping function that permutes clustering labelsl to the corresponding label from the data set. Normalized Mutual Information(NMI):-It[40], [41] is popular metric used for evaluating clustering performance between label setm and cluster setg which is defined as follows: ∑ NMI ∑



n

n n log n

,

n. n log n . n, ∑

(17)

n n log n

where n be the number of objects in cluster m 1 ≪ c ≪ K obtained by using the clustering algorithms and n be the number of objects in cluster g 1 ≪ s ≪ K in the ground-truth labels. n , is the number of object that are common in two clusters m and g .

Purity:- For each cluster, purity[42] determines the largest class and it attempts to capture how well the groups match with the reference on average. Purity P

1 ∗ max ∗ n n

(18)

n ∗ Purity P n

(19)

Purity where P is the centroid of the ith cluster.

Rand Index(RI):Rand Index measures the percentage of decisions that are correctly classified[43]and is defined as :. ∑

A 2∑

where be the label of instance pair the ground-truth labels.

A

A A



A 2∑

A

(20)

A A

obtained by using the clustering algorithms and A be the label of instance pair in

5.3. Results and discussion The results obtained by proposed algorithm are compared among BPSO and HFSPSOTC. The maximum number of iterations is set to 100 for all algorithms. For fair comparison, each of the clustering algorithms was run 10 times. The parameter setting used for each clustering algorithms are set as same mentioned in the original research paper. Table 3 shows the results for Accuracy obtained by all other feature selection algorithms-BPSO, HFSPSOTC, LBPSO on each of the 8 standard text benchmark datasets. For all data sets except REUTER C text dataset, LBPSO exhibits a significantly higher accuracy compared to HFSPSOTC and BPSO. It follows that LBPSO is both effective and efficient for finding the global optimum solution as compared to the other two methods. HFSPSOTC achieves highest accuracy in REUTER C dataset as compared to other. From the result it can be concluded that LBPSO is superior to the other compared algorithms with respect to the accuracy measure. Similar experimental results with these algorithms are observed using purity evaluation measure in Table 4. From the results it can be seen that the proposed algorithm, was capable of appropriately clustering the text data in all datasets. Table 3: Performance comparison using Accuracy measure DATASET BPSO HFSPSOTC TDT2A 60.8280 95.8599 TDT2B 75.5118 57.6952 TDT2C 60.8280 95.8599 TR11 44.2029 46.8599 REUTER A 57.0248 95.0413 REUTER B 51.5228 84.5178 REUTER C 62.6087 76.5217 REUTER D 70.0000 73.7500

LBPSO 96.1783 79.3783 96.1783 62.0773 96.6942 89.0863 66.5217 80.0000

Table 4: Performance comparison using Purity measure DATASET BPSO HFSPSOTC TDT2A 0.6720 0.9586 TDT2B 0.9212 0.8901 TDT2C 0.6720 0.9586 TR11 0.6425 0.6715 REUTER A 0.7025 0.9504 REUTER B 0.6980 0.8883 REUTER C 0.7913 0.7652 REUTER D 0.7375 0.7563

LBPSO 0.9618 0.9484 0.9618 0.7319 0.9669 0.8959 0.8174 0.8125

Table 5 shows the results for RI and NMI obtained by different clustering algorithm. we found that the overall performance of LBPSO text clustering in terms of RI and NMI is better than other algorithms especially on TDT2B, TR11, REUTER A, REUTER C, REUTER D where for TDT2B, TDT2C, REUTER B text datasets, LBPSO perform better in terms of NMI and

HFSPSOTC in terms of RI. From the result it can be concluded that the proposed algorithm surpasses all other algorithms and was capable of appropriately clustering the text data. Table 5: Performance comparison using NMI and RI measure BPSO Dataset/Algorithm TDT2A

HFSPSOTC

NMI

RI

0.4318

0.3227

NMI

LBPSO RI

NMI

RI

0.8386

0.8817

0.8451

0.8806

0.6096

0.6893

0.447

0.8314

0.6796

0.4318

0.3227

0.8386

0.8817

0.8451

0.8806

0.4503

0.2621

0.5174

0.2937

0.6054

0.4546

0.2895

0.2371

0.809

0.8527

0.8686

0.8956

TDT2B

0.756

TDT2C TR11 REUTER A REUTER B

0.311

0.2397

0.7296

0.8102

0.7428

0.7795

REUTER C

0.5424

0.4813

0.6313

0.495

0.66

0.5628

REUTER D

0.6284

0.5836

0.667

0.6155

0.7441

0.6981

Table 6 presents the number of selected features obtained by other compared algorithms on eight datasets. From the result, it can be clearly state that the proposed algorithm finds the better subset of relevant feature space regarding the effectiveness of text clustering as compared to other algorithms. Fig. 7 shows a better overview on this comparison. However, it is worth mentioning that, the purpose of feature selection is to remove useless features without sacrificing the clustering accuracy; otherwise, the performance might be degraded though that feature subset has small size. For example, for the text data set TDT2B, the HFSPSOTC got a subset consisting of the smallest features but it achieved a lower accuracy. On the other hand, the LBPSO selected a larger feature subset that provides better accuracy compared to the HFSPSOTC for the same data set. Therefore, the results in Table 6 indicate that the smallest or largest feature subset does not guarantee the best or worst accuracy. Table 6: Number of selected features and their percentage Dataset/Algorithm

BPSO

HFSPSOTC

Selected features(%)

LBPSO

Selected features (%)

Selected features (%)

TDT2A 2532(49.8%)

2593(51%)

2571(50.6%)

TDT2B

7511(50.1%)

7507(50.1%)

7535(50.3%)

TDT2C

6366(49.9%)

6371(50%)

6408(50.3%)

3152(49%)

3290(51.1%)

3241(50%)

TR11 REUTER A

458(47%)

439(45.1%)

451(46.3%)

REUTER B

1472(48.7%)

1491(49.4%)

1491(49.4%)

REUTER C

927(49.2%)

891(47.2%)

909(48.2%)

REUTER D

644(49.1%)

633(48.3%)

633(48.3%)

Fig. 7: 7 Number of seelected featuress and their perceentage 6. Concclusion In this stuudy, a novel binnary particle sw warm optimizaation with an in ntegration of neew neighbour sselection strateg gy to solve feature sellection problem m in text clusterring is proposedd. LBPSO algo orithm introducees a new updatting strategy to learn from neighbourr best position instead of glob bal best. LBPSO O takes originaal text dataset as a input and pro roduces the new w subset of prominentt features. k-meeans clustering algorithm takees these featuress as input to ev valuate the featuure selection method. m The proposed algorithm overw whelms all otheer well-known ccompared algorrithms on eight subsets of threee benchmark teext datasets in terms oof NMI, RI, purrity and accuraccy measures. T The experimentaal results also verify v that LBPSSO is more efffective than other BPS SO based FS method m for textt clustering. Thhe proposed feeature selection n algorithm enhhances the textt clustering algorithm results by makking more numb ber of similar ggroups. In futurre, the proposed d algorithm cann be integrated with other meta heurristic algorithmss to get more in nformative featuures by improviing search abilities of the algorrithm. Referencees [1]

L L. M. Abualigahh, A. T. Khaderr, and M. A. Al--Betar, “Multi-objectives-baseed text clusterinng technique usiing Km mean algorithm,” in 2016 7th International I Co Conference on Computer C Sciencce and Informattion Technology gy (CSIT), 22016, pp. 1–6.

[2]

L L. M. Abualigahh, A. T. Khaderr, M. A. Al-Betaar, and E. S. Haanandeh, “Unsu upervised Text FFeature Selection T Technique Baseed on Particle Sw warm Optimizaation Algorithm m for Improving g the Text Clusttering.”

[3]

L L. Zheng, R. Diao, and Q. Shen n, “Self-adjustinng harmony seaarch-based featu ure selection,” SSoft Comput., vol. v 19, no. 66, pp. 1567–15779, Jun. 2015.

[4]

K K. K. Bharti andd P. K. Singh, “Opposition “ chaaotic fitness mu utation based ad daptive inertia w weight BPSO fo or feature selection in text clustering,” Ap ppl. Soft Compuut. J., vol. 43, pp. 20–34, 2016.

[5]

““Data-driven gloobal-ranking local feature seleection methods for text categorrization,” Experrt Syst. Appl., vol. v 42, no. 44, pp. 1941–19449, Mar. 2015.

[6]

L Luying Liu, Jiannchu Kang, Jing g Yu, and Zhonngliang Wang, “A “ Comparative Study on Unssupervised Featture S Selection Methoods for Text Clu ustering,” in 20005 Internationa al Conference on o Natural Langguage Processiing and K Knowledge Enggineering, pp. 59 97–601.

[7]

R R. Battiti, “Usinng Mutual Inforrmation for Seleecting Features in Supervised Neural N Net Leaarning,” IEEE Trans. T N Neural Networkks, vol. 5, no. 4, pp. 537–550, JJul. 1994.

[8]

““Floating searchh methods in feaature selection, ” Pattern Recog gnit. Lett., vol. 15, no. 11, pp. 1119–1125, No ov. 1994.

[9]

Y Y. Yang and J. O. O Pedersen, “A A comparative sstudy on featuree selection in teext categorizatioon,” in Icml, 19 997, vol. 997, pp. 412–4200.

[10]

S S. L. Y. Lam annd Dik Lun Lee,, “Feature reducction for neurall network based d text categorizaation,” in Proceeedings. 66th Internationaal Conference on o Advanced Sys ystems for Advanced Applicatio ons, pp. 195–2002.

[11]

““Web page featuure selection an nd classificationn using neural networks,” n Inf. Sci. S (Ny)., vol. 158, pp. 69–88, Jan. 22004.

[12]

Z.-Q. LIU and Y.-J. ZHANG, “A COMPETITIVE NEURAL NETWORK APPROACH TO WEB-PAGE CATEGORIZATION,” Int. J. Uncertainty, Fuzziness Knowledge-Based Syst., vol. 9, no. 6, pp. 731–741, Dec. 2001.

[13]

J. H. Wang and H. Y. Wang, “Incremental Neural Network Construction for Text Classification,” in 2014 International Symposium on Computer, Consumer and Control, 2014, pp. 970–973.

[14]

Wai Lam, M. Ruiz, and P. Srinivasan, “Automatic text categorization and its application to text retrieval,” IEEE Trans. Knowl. Data Eng., vol. 11, no. 6, pp. 865–879, 1999.

[15]

R. Bello, Y. Gomez, A. Nowe, and M. M. Garcia, “Two-Step Particle Swarm Optimization to Solve the Feature Selection Problem,” in Seventh International Conference on Intelligent Systems Design and Applications (ISDA 2007), 2007, pp. 691–696.

[16]

K. Tanaka, T. Kurita, and T. Kawabe, “Selection of Import Vectors via Binary Particle Swarm Optimization and Cross-Validation for Kernel Logistic Regression,” in 2007 International Joint Conference on Neural Networks, 2007, pp. 1037–1042.

[17]

H. Wang, H. Yu, Q. Zhang, S. Cang, W. Liao, and F. Zhu, “Parameters optimization of classifier and feature selection based on improved artificial bee colony algorithm,” in 2016 International Conference on Advanced Mechatronic Systems (ICAMechS), 2016, pp. 242–247.

[18]

M. L. Raymer, W. F. Punch, E. D. Goodman, L. A. Kuhn, and A. K. Jain, “Dimensionality reduction using genetic algorithms,” IEEE Trans. Evol. Comput., vol. 4, no. 2, pp. 164–171, Jul. 2000.

[19]

Il-Seok Oh, Jin-Seon Lee, and Byung-Ro Moon, “Hybrid genetic algorithms for feature selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1424–1437, Nov. 2004.

[20]

“Text feature selection using ant colony optimization,” Expert Syst. Appl., vol. 36, no. 3, pp. 6843–6853, Apr. 2009.

[21]

K.-C. Lin, K.-Y. Zhang, Y.-H. Huang, J. C. Hung, and N. Yen, “Feature selection based on an improved cat swarm optimization algorithm for big data classification,” J. Supercomput., vol. 72, no. 8, pp. 3210–3221, Aug. 2016.

[22]

K.-C. Lin, S.-Y. Chen, and J. C. Hung, “Feature Selection and Parameter Optimization of Support Vector Machines Based on Modified Artificial Fish Swarm Algorithms,” Math. Probl. Eng., vol. 2015, pp. 1–9, 2015.

[23]

R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” MHS’95. Proc. Sixth Int. Symp. Micro Mach. Hum. Sci., pp. 39–43, 1995.

[24]

“Improved particle swarm optimization algorithm and its application in text feature selection,” Appl. Soft Comput., vol. 35, pp. 629–636, Oct. 2015.

[25]

“Gene selection and classification using Taguchi chaotic binary particle swarm optimization,” Expert Syst. Appl., vol. 38, no. 10, pp. 13367–13377, Sep. 2011.

[26]

P. Ghamisi, S. Member, and J. A. Benediktsson, “Feature Selection Based on Hybridization of Genetic Algorithm and Particle Swarm Optimization,” vol. 12, no. 2, pp. 309–313, 2015.

[27]

H. Banka and S. Dara, “A Hamming distance based binary particle swarm optimization ( HDBPSO ) algorithm for high dimensional feature selection , classification and validation ,” Pattern Recognit. Lett., vol. 52, pp. 94–100, 2015.

[28]

Y. Zhang, D. Gong, Y. Hu, and W. Zhang, “Neurocomputing Feature selection algorithm based on bare bones particle swarm optimization,” Neurocomputing, vol. 148, pp. 150–157, 2015.

[29]

“Improved binary particle swarm optimization using catfish effect for feature selection,” Expert Syst. Appl., vol. 38, no. 10, pp. 12699–12707, Sep. 2011.

[30]

L. M. Abualigah and A. T. Khader, “Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering,” J. Supercomput., 2017.

[31]

J. Kennedy and R. C. Eberhart, “A discrete binary version of the particle swarm algorithm,” 1997 IEEE Int. Conf. Syst. Man, Cybern. Comput. Cybern. Simul., vol. 5, pp. 4–8, 1997.

[32]

Barabasi and Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–12, Oct. 1999.

[33]

M. Pósfai, G. Musella, M. Martino, R. Sinatra Acknowledgements, S. Morrison, A. Husseini, and P. Hoevel, “Chapter 5: The Barabási-Albert Model,” Netw. Sci., vol. 0, 2015.

[34]

J. Macqueen, “Some methods for classification and analysis of multivariate observations,” Proc. Fifth Berkeley Symp. Math. Stat. Probab., vol. 1, no. 233, pp. 281–297, 1967.

[35]

N. Kushwaha, M. Pant, S. Kant, and V. Kumar, “Magnetic optimization algorithm for data clustering,” Pattern Recognit. Lett., vol. 0, pp. 1–7, 2017.

[36]

H. Wang, H. Sun, C. Li, S. Rahnamayan, and J.-S. Pan, “Diversity enhanced particle swarm optimization with neighborhood search,” Inf. Sci. (Ny)., vol. 223, pp. 119–135, 2013.

[37]

“Text document clustering based on neighbors,” Data Knowl. Eng., vol. 68, no. 11, pp. 1271–1288, Nov. 2009.

[38]

Y. Li, C. Luo, and S. M. Chung, “A parallel text document clustering algorithm based on neighbors,” Cluster Comput., vol. 18, no. 2, pp. 933–948, Jun. 2015.

[39]

A. La, M. A. Al-betar, M. A. Awadallah, A. Tajudin, and L. Mohammad, “A comprehensive review : Krill Herd algorithm ( KH ) and its applications,” Appl. Soft Comput. J., vol. 49, pp. 437–446, 2016.

[40]

L. Sun and C. Guo, “Incremental affinity propagation clustering based on message passing,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 11, pp. 2731–2744, 2014.

[41]

F. Shang, L. C. Jiao, J. Shi, F. Wang, and M. Gong, “Fast affinity propagation clustering: A multilevel approach,” Pattern Recognit., vol. 45, no. 1, pp. 474–486, 2012.

[42]

Y. Zhao and G. Karypis, “Empirical and theoretical comparisons of selected criterion functions for document clustering,” Mach. Learn., vol. 55, no. 3, pp. 311–331, 2004.

[43]

N. M. Arzeno and H. Vikalo, “Semi-supervised affinity propagation with soft instance-level constraints,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 5, pp. 1041–1052, 2015.

Biographies (Text)

Neetu Kushwaha received the B. Tech degree in Computer Science & Engineering from RGPV in 2009, M.Tech from NIT Jalandhar in 2012 .Currently; she is a Ph.D. student with the Department of applied science and engineering IIT Roorkee, India. Her research interests include Data Mining, Machine Learning, Evolutionary Algorithms and Swarm Intelligence techniques.

Dr. Millie Pant is Associate Professor in the Department of applied science and engineering at Indian Institute of Technology Roorkee. She did her PhD from mathematics department of IIT Roorkee. Her areas of interest include Numerical optimization and Operations research, Evolutionary Algorithms and Swarm Intelligence techniques. She has published 200 plus research papers in various journals and conferences of national and international repute.

Biograp phies (Phottograph)

Authorr Name: Neeetu Kushw waha

Authorr Name: Drr. Millie Pa ant

Research Highlights

   

  

             



The feature selection algorithm for clustering text data is studied. Propose a new neighbourhood strategy in binary PSO to solve feature selection problem. Our proposed algorithm outperforms other state of the art algorithms in benchmark text datasets.   Experimental results on various text datasets show that the proposed algorithm has a good clustering effect.