Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering

Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering

Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of King Saud Un...

4MB Sizes 0 Downloads 59 Views

Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Contents lists available at ScienceDirect

Journal of King Saud University – Computer and Information Sciences journal homepage: www.sciencedirect.com

Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering Madhulika Yarlagadda a,b,⇑, K. Gangadhara Rao c, A. Srikrishna b a

Department of Computer Science and Engineering, JNTUK, Kakinada, Andhra Pradesh, India Department of Information Technology, RVRJC College of Engineering, Chowdavaram, Guntur(Dt), Andhra Pradesh, India c Department of Computer Science and Engineering, Acharya Nagarjuna University, Guntur, Andhra Pradesh, India b

a r t i c l e

i n f o

Article history: Received 19 May 2019 Revised 10 August 2019 Accepted 4 September 2019 Available online xxxx Keywords: Document clustering Moth Search Algorithm Rider Optimization Algorithm Frequent itemset Stop word removal

a b s t r a c t Document clustering has recently been paid great attention in retrieval, navigation, and summarization of huge volumes of documents. With a better document clustering approach, computers can organize a document corpus automatically to a meaningful cluster for enabling efficient navigation, and browsing of the corpus. Document navigation and browsing is a valuable complement to the deficiencies of information retrieval technologies. This paper introduces Modsup-based frequent itemset and Rider Optimization-based Moth Search Algorithm (Rn-MSA) for clustering the documents. At first, the input documents are given to the pre-processing step, and then, the extraction is carried out based on TFIDF and Wordnet features. Once the extraction is done, the feature selection is carried out based on frequent itemset for the establishment of feature knowledge. At last, the document clustering is done using the proposed Rn-MSA, which is designed by combining Rider Optimization Algorithm (ROA), and the Moth Search Algorithm (MSA). The performance of the document clustering based on proposed Modsup + Rn-MSA is evaluated in terms of precision, recall, F-Measure, and accuracy. The developed document clustering method achieves the maximal precision of 95.90%, maximal recall of 96.41%, maximal F-Measure of 96.41%, and the maximal accuracy of 95.12% that indicates its superiority. Ó 2019 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction The rapid increase in the World Wide Web (WWW) has improved the number of documents that are accessible online. It is impossible to count the number of available documents from the web. Frequently available text documents are research articles, technical reports, newspaper, journal paper, blogs, and so on. The documents available online and their categories are separate, numerous, huge, and are more valuable and useful (Farahat and Mohamed, 2011; Manning et al., 2009; Zhang et al., 2006). With the use of web search engines, an individual can easily locate and surf the documents. Due to the development of WWW, the

⇑ Corresponding author at: Department of Computer Science and Engineering, JNTUK, Kakinada, Andhra Pradesh, India. E-mail address: [email protected] (M. Yarlagadda). Peer review under responsibility of King Saud University.

Production and hosting by Elsevier

complexity of web query processing has been increased, and relevant documents are located from a large text repository has been a challenge for the designers of a web search engine, and users of internet (Forsati, 2013; Pera and Ng, 2010). When searching the document through WWW, the search engines provide several documents. Here, most of the documents are applicable to the topic and some consists of irrelevant documents with limited quality. Clustering plays a very significant role in organizing a huge amount of documents returned from the search engines into clusters (Jensi and Wiselin Jiji, 2013). Clustering of documents is one of the techniques used in search engines for finding the similar documents (Pamba et al., 2019; Gharib and Fouad, 2012). By organizing the similar documents together, a bulky collection of documents is easily navigated, browsed and organized. Document clustering plays a significant role in different fields, such as knowledge discovery, business applications, and so on. Document clustering is a data analysis techniques, which partitions the document into groups of same objects using similarity measure such that similar objects are placed within the same cluster, and dissimilar objects are out of the cluster. It is used in pattern recognition, machine learning, and statistics (Berkhin, 2006; Onan et al., 2017). Clustering is very useful for categorization, a group of

https://doi.org/10.1016/j.jksuci.2019.09.002 1319-1578/Ó 2019 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

2

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

documents, and topic detection (Manning et al., 2009; Sharma, 2019). Document clustering is closely related to data clustering (Xie and Xing, 2013). The principle of document clustering is to meet human interests in information searching and understanding. Document clustering failed to consider the relationship between the words (Kiran and Shankar, 2010). At present, every paper documents are in electronic form, because of quick access and smaller storage. Hence, it is the main problem to retrieve relevant documents from the larger database. The challenging problems of document clustering are high dimensionality, big volume, and complex semantics.

the proposed method obtains the maximal precision, recall, F-measure, and accuracy of 95.90%, 96.4%, 96.4%, and 95%, respectively, which indicates the superiority of the proposed method. The main contribution of the research paper is developing a document clustering approach using the proposed Modsup + RnMSA for clustering the documents based on their similarity in such a way that is effective for applications related to the document retrieval. The organization of the paper is: Section 2 discusses with document clustering using Modsup + Rn-MSA. Section 3 presents the results and discussions of the proposed method, and finally, section 4 concludes the paper.

1.1. Literature review Several approaches are developed for document clustering, which are not performed well because of high sparsity of document corpus (Sumathi Rani and China Babu, 2019; Gulnashin et al., 2019). Pei and Chen (2016) developed Concept Factorization With Adaptive Neighbors (CFANs) to improve the performance of document clustering. This framework was utilized for extracting the representation space, which maintained the neighborhood structure of the data. The neighborhood of the graph weights matrix was established by integrating the ANs regularization with the CF model. The method failed to find out the number of clusters automatically. Pamba et al. (2019) developed Frequent Pattern Growth-based Dynamic Fuzzy Particle Swarm Optimization (FPDFPSO) for clustering the documents. This framework solved the issues of managing diversity, local hookups, dependency in parameters, and convergence speed. This method is not suitable for finding the better cluster centroids but also finds the best solutions using DFPSO. The robustness of the method was better, but it was dependent on a large dataset. Gulnashin et al. (2019) presented a deterministic approach for initializing spherical kmeans. Spherical k-means was considered as a best method for clustering the documents, and considered the cosine similarity from documents to find the most suitable documents. The method took large time to converge. Karpagam and Saradha (2019) developed semantic word based on an answer generator model for document clustering. Using this model, the system reduced the search gap from user query, and answered the sentences based on Wordnet. This framework consists of unambiguous words string distance, word order, sentence similarity, and semantic similarity of words. After that, the knowledge base was established from a large corpus, and clustering is done based on grouping domain context. The method did not consider other techniques to improve the system performance. Table 1 shows the literature review based on the authors, year, techniques being used, strengths, and weaknesses of the existing methods. While analyzing the existing works in document clustering, it can be deduced that the area of document clustering has many issues, which need to be solved. Accordingly, this paper develops a document clustering technique using Modsup and Rn-MSA, which tries to solve the problems in the existing document clustering approaches. The overall procedure of the proposed document clustering involves the following four steps, such as preprocessing, feature extraction, feature knowledge establishment, and document clustering. At first, the documents are preprocessed based on stop-word removal and stemming techniques. Then, the feature extraction has been carried out using TF-IDF and wordnet features. Depending on the extracted features, feature selection is performed using frequent itemset to establish feature knowledge. Then, the document clustering is carried out using the proposed Rn-MSA, which is developed by integrating ROA and MSA. The performance of the proposed document clustering scheme is analyzed using two datasets, namely Reuter database and 20 Newsgroups database. From the analysis, it is depicted that

2. Proposed document clustering using Modsup-based frequent itemset and Moth-Rider Optimization Algorithm This section presents the document clustering approach using Moth-Rider Optimization Algorithm. Fig. 1 deliberates the schematic diagram of the proposed Modsup frequent itemset and Moth-Rider Optimization Algorithm for document clustering. At first, the keywords from the documents are given to the preprocessing step to remove redundant and unnecessary words from the data using stop word removal and stemming. After preprocessing, the feature extraction is carried out using TF-IDF and wordnet features to find the keywords from the document. Using the extracted features, the feature knowledge is established based on frequent datasets. At last, the document clustering is performed using the proposed Rn-MSA in which the standard Rider Optimization Algorithm (ROA) (Binu et al., 2018) and Moth Search Algorithm (MSA) (Wang, 2016) are integrated. 2.1. Pre-processing The first step involved in the document clustering is the preprocessing of the text. The input database contains unnecessary words or phrases, which may affect the clustering process. Consider D be the set of database and it comprises of n number of documents in the database and is denoted as D ¼ fdi ; 1 6 i 6 ng. Therefore, the pre-processing is considered for removing the redundant words from the text database. The two main steps in pre-processing are: (1) Stop word removal, and (2) Stemming. Stop word removal: The stop word is nothing but the unnecessary words present in the text document, like an, a, the, in, etc. Stemming: In this step, the stemming technique converts the terms that are inevitably not a meaningful word to its root from the language. The number of documents in the database D is expressed as,

n o di ¼ wij ; 1 6 j 6 mi

ð1Þ th

where, mi denotes the extracted words from the i documents. After extracting the keywords, W unique keywords is obtained as,

W ¼ fbl ; 1 6 x 6 kg

ð2Þ

where, k represents the total number of words in the dictionary or unique keywords from the documents. Thus, the dictionary words are obtained from the pre-processing step, and then the feature is extracted from the dictionary words. 2.2. Semantic level feature extraction using TF-IDF and wordnet Feature extraction is followed after pre-processing step by extracting keywords from the documents based on TF-IDF, and WordNet features. The feature extraction step carried out in this work is briefly explained as follows:

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

3

Table 1 Literature Review. Authors

Year

Methods

Advantages

Disadvantages

Karpagam and Saradha (2019) Gulnashin et al. (2019) Pamba et al. (2019) Pei and Chen (2016) Agrawal et al. (1993)

2019

semantic word based on answer generator model

the system reduced the search gap from the user query

2019

Better performance

1993

Deterministic initialization technique (Duwairi and Abu-Rahmeh Method) Frequent pattern growth-based dynamic fuzzy particle swarm optimization algorithm Novel concept factorization (CF) method, called CF with adaptive neighbors (CFANs) pruning techniques

This method did not consider other techniques to improve the system performance This method took large time to converge.

Hotho et al. (2003) Krishnapuram et al. (2001) Wu et al. (2019)

2003

Ontology-based document clustering

2001

Choi and Park (2019)

2019

fuzzy c-medoids (FCMdd) and robust fuzzy cmedoids (RFCMdd) a multiple optimization particle swarm optimization algorithm by adapting the densitybased (CMPSO) Topic-tree (TP-Tree)

2019 2016

2019

minimal convergence speed and minimum mean-squared residualerror Converge very quickly The pruning techniques reduce a very large fraction of itemsets without measuring them. domain-specific ontology improves the clustering useful in Web mining applications

Takes more iteration This method cannot determine the number of clusters automatically Excess pruning affects the performance

Low accuracy High complexity

achieve good performance for the sparse dataset

not achieve better results in sometimes

superior performance and shorter running time

prone to sampling errors

Fig. 1. Schematic diagram of document clustering using the proposed Modsup frequent itemset and Moth-Rider Optimization Algorithm.

2.2.1 TF-IDF TF is utilized to compute the occurrence of each word in a document. IDF is used for calculating significant word that occurs rarely in the document. The expression for IDF is expressed by,

Q ðbl ; DÞ ¼ log

n jfd 2 D : bl 2 dgj th

ð3Þ

Q ðbl ; DÞ be the IDF of bl word in database D, bl represents the words in the document, d refers to document, and n indicates the collection of documents.

2.2.2 WordNet features The next important feature extraction from the word documents are WordNet features. The WordNet ontology (Elberrichi et al., 2008) explains regarding two kinds of semantic relations, and they are synonymy, and hyponymy. Usage of the WordNet ontology for finding the semantic relations of the words, help the extraction process in three aspects. 1) Finding the required amount of lexical paraphrases from the word document, 2) Finding semantic net for lexicalization, and 3) Generation of the base for the document, such that the domain ontology can be built for the document. Creation of the WordNet ontology framework for the realization of the semantic relations of the word document solves

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

4

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

the usage of synset for producing paraphrases. Also, it determines the interchangeability of two synonyms in a particular context. The WordNet ontology scheme can be considered as the dataprocessing source, with the collection of synsets. Synsets are referred to as the sets of synonyms gathering topics that have similar significances. The WordNet ontology scheme extracts two types of features, like synonymy and hyponym, from the documents. The synonymy identifies the symmetrical relation among the words combining any two similar words, such that the replacement of the combining words does not affect the actual meaning of the context. For example, the synonymy and hyponym of a keyword bl in the word document is represented as, C bl andBbl . After the extraction of the synonym and hyponym for every keyword, the dictionary contains 3  k number of keywords. Using these keywords, the feature matrix is created by identifying TF, IDF, frequency of hyponym, and synonyms. The extracted features are expressed as,

  F ¼ f ij ; 1 6 i 6 n; 1 6 j 6 g

ð4Þ

where, n signifies the total number of documents, g refers to the total number of extracted dimension of the features using hyponymy, TF, synonymy, and IDF. 2.3. Feature selection based on proposed Modsup-based frequent itemset

ð5Þ th

l 1X Zi  Ai l i¼1 T i

!

th

length sequence.

length sequence is denoted as T i , and Ai be th

the average value of support of f i in the i length sequence. After finding the mod-support for every keyword or features, the features are selected based on the condition Modsupðf i Þ > threshold. The selected features (keywords) and its corresponding TF, IDF, synonym, and hyponyms are represented in the selected database, which is now given to the document clustering process. The feature selected database for the document clustering is represented as,

(

F

select

¼

select

f ij

0;

;

16i6n 16j6gy

Let v be the number of centroids to be computed that represents the size of the solution based on the number of centroids. In each centroid, g  y features are presented, and therefore, the dimension of the solution vector is of size 1  ½v  ðg  yÞ. b) Fitness evaluation The fitness function is evaluated based on the distance measure that is estimated between the selected data point, and the centroid. The fitness measure should be maintains minimal, and the solution that affords with the minimal value of the distance is selected as a better solution. The fitness is computed as,

Fitness; f ¼

n v X X i¼1

k¼1

k F select ; Cv k ij

ð8Þ

i2k where, F select be the data point, C v denotes the v th centroid. The optiij mization procedure of the MSA is given as follows: 1) Initialization

where, l refer to the length of the frequent itemset, and Z i signifies The total length in i

selected database F select is given to the input of the proposed RnMSA which is developed by integrating the standard Rider Optimization Algorithm (ROA) (Binu et al., 2018) and Moth Search Algorithm (MSA) (Wang, 2016). The MSA is inspired by the prey searching behaviour of the moths, and the solution update is carried out based on its position change. The moth’s behaviour depends on two basic characteristics, like phototaxis, and levy flight. The moth’s location movement is influenced by the position of the light source. MSA can search for the best solution effectively with improved accuracy. Moreover, the algorithm negotiates complex operations, and thus, the execution of MSA is easy and flexible. The ROA concept is adapted to update the solution based on the past events that had occurred with respect to the time. Hence, the incorporation of ROA concept into MSA is made for finding the optimal regions.

ð6Þ

the total itemsets that are covered by f i in the i th

This section briefly explains the proposed Moth-Rider Optimization Algorithm for document clustering. Here, the feature

th

where, dij denotes the presence of j word in the i document. This database Dinput is now subjected to the Apriori algorithm to mine the frequent itemset. Apriori algorithm mines 1 length of frequent itemset based on the support threshold. After mining the frequent itemset, the features are selected from the Dinput using the proposed mod-support designed in this paper. The mod-support for finding the important feature is expressed as,

Mod:Supportðf j Þ ¼

2.4. Document clustering based on proposed Moth-Rider Optimization Algorithm

a) Solution representation

This section briefly explains the proposed frequent itemsetbased feature selection. After extracting the features, the feature selection is performed using the proposed frequent itemset to reduce the dimensionality of the documents drastically. The proposed mod-support reduces overfitting, which means that it removes redundant data. Also, it improves the accuracy and reduces the training time. The feature database F which contains only the TF of the extracted keywords is given to the frequent itemset mining. This featured database can be indicated as,

  Dinput ¼ di;j ; 1 6 i 6 n ; 1 6 j 6 k

where, n signifies the total document, and the reduced features are denoted as y.

)

ð7Þ

As the first step, the position of the moths is randomly initialized. Here, the solution represents the position, and hence, the solution space has f number of moths. The solution representation of the MS algorithm is given as follows,

  Y ¼ Yj ; 1 6 j 6 f

ð9Þ th

where, Y j refers to the j moth in the solution, and the value ranges upto ½1; f , and f represents the total moths in the solution space. 2) Fitness evaluation The fitness function is computed for the individual solution using the Eq. (8) for better result. The output having improved fitness value is considered as the optimal output. The best solution is determined at the previous iteration as each solution craves to obtain a better position.

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

5

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

"

3) Location update using levy flights

Y iþ1 1 j;k

After evaluating the fitness, the solution undergoes position update based on the levy flight update, and it is mentioned as follows:

Y iþ1 ¼ Y ij þ e:LðzÞ j

ð10Þ

where, Y ij specifies the location of the moth at the iteration i, and the term LðzÞ signifies to the step drawn due to the movement of levy flight. The parameter e indicates the scaling factor and is expressed as,



ð11Þ

2

where, W max refer to the maximum step walk. Then, the levy distribution LðzÞ is represented as,

LðzÞ ¼

  ða  1ÞCða  1Þsin pða21Þ

paa

ð12Þ

where, a is greater than 0:CðyÞ is the gamma function. 4) Fly straightly

   i i i Y iþ1 j;k ¼ k  Y j;k þ b: Y best  Y j;k

ð13Þ

i i i Y iþ1 j;k ¼ Y j;k k þ kbY best  kY j;k

ð14Þ

The final position for a moth j is expressed as, i i Y iþ1 j;k ¼ Y j;k ðk  kbÞ þ kbY best

ð15Þ

where, Y ibest indicates the best position of the moth, and the other terms b and a indicate the acceleration factor, and the scaling factor, respectively. During the fly straightly movement, the location of the moth is influenced by the location of the light source. Here, the acceleration constant influences the convergence speed of the algorithm. In some cases, the position of the moth goes beyond the position of the light source. Then, the Eq. (18) is modified with the ROA for enhancing the effectiveness of the approach and to find solutions to various optimization issues. The follower update equation of ROA is expressed as,

h   i L i Y iþ1  Y LL;k  @ ij;k j;k ¼ Y L;k þ cos P j;k

ð16Þ

Let us assume Y Lj;k as Y ibest and substitute in Eq. (15),

h   i i i Y iþ1  @ ij;k j;k ¼ Y best 1 þ cos P j;k

ð17Þ

  i 1 þ cos P ij;k  @ ij;k h   ¼ ðk  kbÞY ij;k 1 þ cos Pij;k  @ ij;k  kb

ð23Þ

th

ð18Þ

1 þ cosðP ij;k Þ  @ ij;k

Y iþ1 j;k 1 þ cosðP ij;k Þ  @ ij;k kbY iþ1 j;k 1 þ cosðPij;k Þ  @ ij;k

as

P ij;k ,

th

and the distance travelled by j

th

th

rider in k

coordinate is

denoted as, @ ij;k . The scaling factor is denoted as k, and the acceleration factor is denoted as b. The distance is measured based on the product of the off-time and velocity. Thus by Eq. (23), the position update of the moths can be obtained, using the location of the moths in its preceding iteration, light absorption coefficient, attractiveness, and the distance between the moths.

After updating the position of the moths, the solution space undergoes the fitness evaluation. Here, the optimal solution is obtained by identifying the location providing minimal fitness. After the fitness evaluation, the best solution replaces the older solution.

After a certain iteration limit, the algorithm terminates, and the optimal solution is retained at the end of the procedure. Algorithm 1 depicts the algorithmic steps of the proposed Rn-MSA algorithm. Algorithm 1:. Pseudocode for the proposed Rn-MSA algorithm

Input: F select Output: Best solution (Centroid) Procedure: Begin   Population initiation Y ¼ Y j ; 1 6 j 6 f Evaluation of fitness: Fitness is evaluated using equation (8) While j < MaxGen Arrange the moth individuals using equation (10) for j ¼ 1 to NP 2 do Generate Y iþ1 j;k by performing levy flights end for j NP for j ¼ 2þ1 to NP do If rand > 0:5 then Estimate Y iþ1 j;k using Eq. (13) Else

Y iþ1 j;k

Substituting Eqs. (18) in Eq. (15), the solution becomes,

Y iþ1 j;k 

ð22Þ

6) Termination

After rearranging the above equation,

Y iþ1 j;k ¼

ð21Þ

5) Finding the best solution

The location of the moth is also influenced by the light source, and the upgrade solution is represented as follows:

Y ibest ¼

¼ ðk  kbÞY ij;k

where, the steering angle of j rider in k coordinate is represented

W max i

1 þ cosðPij;k Þ  @ ij;k

  3 1 þ cos Pij;k  @ ij;k  kb 5 ¼ ðk  kbÞY i   j;k 1 þ cos Pij;k  @ ij;k

2 4 Y iþ1 j;k

Y iþ1 j;k

kb

#

kb þ ðk  kbÞY ij;k

ð19Þ

¼ ðk  kbÞY ij;k

ð20Þ

Generate Y iþ1 j;k using Eq. (23) End if End for j New solutions are determined and the intensities are compared again i¼iþ1 end while Best solution is obtained End

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

6

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

The overall workflow of the proposed document clustering is described in Fig. 2. Initially, the input documents are subjected to the pre-processing for removing unwanted words. Then, the next step is the feature extraction, in which the features, such as TFIDF and WordNet features are extracted. After that, the feature selection is performed based on the proposed Modsup-based frequent itemset. Finally, the selected features are subjected to the proposed Rn-MSA, which clusters the documents.

3. Results and discussion The results and discussion of the proposed Modsup + Rn-MSA for document clustering are demonstrated in this section with an effective comparative analysis to prove the effectiveness of the proposed method.

3.1. Experimental setup The experimentation of the proposed technique of text categorization is performed in the system with 2 GB RAM, Intel i-3 core processor, Windows 10 Operating System. The proposed method is executed in MATLAB.

3.2. Database description The dataset is taken from Reuter database and 20 Newsgroups database for text categorization.

3.2.1. 20 Newsgroups database The 20 Newsgroups dataset (20 Newsgroup database, 2018) is collected nearly about 20,000 documents that are divided from 20 groups. Ken Lang gathered this dataset where each of the new group represents a topic. 3.2.2. Reuter database This Reuter dataset (Reuter database, 2018) consists of documents that are associated with the newswire stories. The documents are divided into PEOPLES, TOPICS, ORGS, PLACES, and EXCHANGES. 3.3. Performance metrics The evaluation of the proposed technique is performed using three metrics, namely recall, Precision, F-measure, and accuracy. Precision: Precision is defined as the nearness of more than two measurements to each other, and is difficult from that of accuracy.

Precision ¼

tp tp þ f p

ð24Þ

where, t p denotes the true positive, and f p represents the false positive. Recall: Recall is defined by the computation of the total number of actual positives that the system captures with the label of it as the true positive.

Recall ¼

tp tp þ f n

ð25Þ

where,f n be the false negative. F measure: The F measure refers to the measure of accuracy of a test and is weighted by the harmonic mean value of the precision and recall of the test.



F measure ¼ 2 

Precision  Recall Precision þ Recall

ð26Þ

Accuracy: The accuracy denotes the measure of the closeness of the Modsup + Rn-MSA approach for document clustering, and is expressed as,

Accuracy ¼

tp þ tn tp þ tn þ f p þ f n

ð27Þ

3.4. Performance analysis The performance analysis of the developed Modsup + Rn-MSA is carried out with respect to the precision, recall, and F-measure. These parameters are necessary to be improved to enhance the performance of the technique.

Fig. 2. Flow diagram of the proposed document clustering process.

3.4.1. Using 20 Newsgroups database Fig. 3 shows the performance analysis of the developed Modsup + Rn-MSA based on precision, accuracy recall, and F-Measure using the 20 Newsgroup database. The performance analysis based on precision by varying the cluster size is depicted in Fig. 3a). In this figure, the precision value has its limit ranging from 0 to 100, and the cluster size value varies between 4 and 20. When the size of the cluster is 4, the precision values measured by the Modsup + Rn-MSA for the population size 10, 20, 30, 40, and 50 are 93.620%, 93.8265%, 94.511%, 95.764%, and 96%, respectively. Similarly, when the size of the cluster is increased to 20, the precision value obtained by the Modsup + Rn-MSA with population size 10 is 63.828%,Modsup + Rn-MSA with population size 20 is 66.588%, Modsup + Rn-MSA with population size 30 is 66.588%, Modsup + Rn-MSA with population size 40 is 81.161%, whereas,

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

7

Fig. 3. Performance analysis of the developed Modsup + Rn-MSA based on population size, (a) Precision, (b) Recall, (c) F-Measure, and (d) Accuracy.

the Modsup + Rn-MSA with population size 50 achieved the precision of 81.223%. From the above interpretation, it is clearly shown the maximum precision measure of 96% is attained when the population size is 50 for the cluster size 4. Fig. 3b) depicts the performance analysis in terms of the recall by varying the size of the cluster. When the size of the cluster is 8, the recall values measured by the Modsup + Rn-MSA with population size 10, 20, 30, 40, and 50 are 85.578%, 87.573%, 91.799%, 93.784%, and 94.978%, respectively. Likewise, when the size of the cluster is 16, the recall value obtained by the Modsup + RnMSA with population size 10 is 74.897%, Modsup + Rn-MSA with population size 20 is 74.897%, Modsup + Rn-MSA with population size 30 is 75.199%, Modsup + Rn-MSA with population size 40 is 87.448%, and Modsup + Rn-MSA with population size 50 is 87.448%. Fig. 3c) depicts the performance analysis in terms of F-Measure by varying the cluster size. When the size of the cluster is 12, the FMeasure values measured by the Modsup + Rn-MSA with population size 10, 20, 30, 40, and 50 are 79.703%, 85.578%, 90.014%, 92.117%, and 93.757% respectively. When the size of the cluster is increased to 20, the F-Measure value obtained by the Modsup + Rn-MSA with population size10 is 63.284%, Modsup + Rn-MSA

with population size 20 is 64.317%, Modsup + Rn-MSA with population size 30 is 66.601%, Modsup + Rn-MSA with population size 40 is 67.106%, and Modsup + Rn-MSA with population size 50 is 67.234%. Fig. 3d) depicts the performance analysis in terms of accuracy by varying the cluster size. When the size of the cluster is 16, the F-Measure values measured by the Modsup + Rn-MSA with population size 10, 20, 30, 40, and 50 are 74.9%, 74.9%, 75.1%, 87.45%, and 87.45%, respectively. When the cluster size is increased to 20, the F-Measure value obtained by the Modsup + Rn-MSA with population size 10 is 70.2%, Modsup + Rn-MSA with population size 20 is 71.5%, Modsup + Rn-MSA with population size 30 is 74.9%,Modsup + Rn-MSA with population size 40 is 75.1%, and Modsup + Rn-MSA with population size 50 is 75.2%. 3.4.2. Using Reuter database Fig. 4 shows the performance analysis of the developed Modsup + Rn-MSA based on precision, recall, F-Measure, and accuracy using Reuter database. Fig. 4a) shows the performance analysis based on precision by varying the cluster size. In this figure, the precision value has its limit ranging from 0 to 100, and the cluster size value varies between 4 and 20. When the cluster size is 12, the

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

8

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Fig. 4. Performance analysis of the proposed Modsup + Rn-MSA based on population size, (a) Precision, (b) Recall, (c) F-Measure, and (d) Accuracy.

precision values measured by the Modsup + Rn-MSA for the population size 10, 20, 30, 40, and 50 are 80.981%, 91%, 91.9%, 94.6%, and 96% respectively. Similarly, when the size of the cluster is increased to 20, the precision value obtained by the Modsup + Rn-MSA with population size 10 is 62.233%, Modsup + Rn-MSA with population size 20 is 87.508%, Modsup + Rn-MSA with population size 30 is 90.612%, Modsup + Rn-MSA with population size 40 is 92.504%, whereas, the Modsup + Rn-MSA with population size 50 achieved the precision of 95.5%. Fig. 4b) depicts the performance analysis in terms of the recall by varying the cluster sizes. When the size of the cluster is 8, the recall values measured by the Modsup + Rn-MSA with population size 10, 20, 30, 40, and 50 are 86.602%, 92.8%, 94.6%, 96%, and 96% respectively. Similarly, when the size of the cluster is 16, the recall value obtained by the Modsup + Rn-MSA with population size 10 is 83.365%, Modsup + Rn-MSA with population size 20 is 91%, Modsup + Rn-MSA with population size 30 is 91.9%, Modsup + Rn-MSA with population size 40 is 94.6%, and Modsup + Rn-MSA with population size 50 is 96%. Fig. 4c) depicts the performance analysis in terms of F-Measure by varying the cluster sizes. When the size of the cluster is 16, the F-Measure values measured by the Modsup + Rn-MSA with population size 10, 20, 30, 40, and 50 are 83.367%, 89.2%, 92.355%, 94.6%, and 96% respectively. Likewise, when the cluster size is

increased to 20, the F-Measure value 96% is obtained by the Modsup + Rn-MSA with population size 10 is 75.81%, Modsup + Rn-MSA with population size 20 is 86.5%, Modsup + Rn-MSA with population size 30 is 91%, Modsup + Rn-MSA with population size 40 is 92.331%, whereas, the Modsup + Rn-MSA with population size 50 achieved the precision of 95.5%. Fig. 4d) depicts the performance analysis in terms of accuracy by varying the cluster size. When the size of the cluster is 8, the F-Measure values measured by the Modsup + Rn-MSA with population size 10, 20, 30, 40, and 50 are 86.6%, 92.8%, 94.6%, 95%, and 95% respectively. When the cluster size is increased to 16, the F-Measure value obtained by the Modsup + Rn-MSA with population size 10 is 83.333%, Modsup + Rn-MSA with population size 20 is 91%, Modsup + Rn-MSA with population size 30 is 91.9%, Modsup + Rn-MSA with population size 40 is 94.6% and Modsup + Rn-MSA with population size 50 is 95%. 3.5. Comparative analysis The comparative analysis of the proposed Modsup + Rn-MSA by evaluating the performance of other comparative techniques is elaborated in this section. The comparative analysis is performed by varying the size of the cluster, and the results are evaluated based on precision, recall, F-Measure, and accuracy.

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

3.6. Competing methods The methods, such as Modsup + MSA, Incremental Construction of GMM Tree (ICGT) (Wan et al., 2018), and Incremental K-Means (Sangaiah and Fakhry, 2018), are utilized for the comparison with the proposed Modsup + Rn-MSA for the analysis. 3.6.1. Comparative analysis using 20 Newsgroups database The comparative analysis of the developed method is analyzed based on precision, recall, F-Measure, and accuracy using 20 Newsgroups dataset is depicted in Fig. 5. Fig. 5a) shows the analysis based on precision with different cluster sizes. When the size of the cluster is 8, the existing techniques, such as Modsup + MSA, ICGT, and Incremental K-Means possess the precision of 91.67%, 86.255%, and 64.57% respectively, which is comparatively lower than the text categorization method using Modsup + Rn-MSA. For the same training data, the Modsup + Rn-MSA acquired the precision of 94.134%. The comparative analysis based on recall is depicted in Fig. 5b). When the size of the cluster is 20, the recall of the existing methods, like Modsup + MSA, ICGT, and Incremental K-Means is 75.09%, 60%, and 60% respectively, whereas the Modsup + Rn-MSA attained the recall of 75.199%. The analysis based on F-Measure is depicted in Fig. 5c). Here, for the size of the cluster is 4, the existing techniques, like Modsup +

9

MSA, ICGT, and Incremental K-Means achieved the F-Measure of 94.55%, 91.61%, 90.258% respectively, but the Modsup + Rn-MSA acquired the F-Measure of 95.57%. When the size of the cluster is 20, the existing techniques, such as Modsup + MSA, ICGT, and Incremental K-Means achieved the F-Measure of 67.106%, 66.944%, and 60% respectively. Meanwhile, the Modsup + Rn-MSA attained the F-Measure of 88.838%. The comparative analysis, in terms of accuracy, is depicted in Fig. 5d). When the cluster size 16 is considered, the existing methods, like Modsup + MSA, ICGT, and Incremental K-Means acquires the accuracy of 86.9%, 86.75%, and 74.9% respectively. Meanwhile, the Modsup + Rn-MSA obtained the accuracy value of 87.45%. 3.6.2 Comparative analysis using Reuter database The comparative analysis of the proposed Modsup + Rn-MSA is analyzed based on precision, recall, F-Measure using Reuter dataset is depicted in Fig. 6. Fig. 6a) shows the analysis based on precision with different cluster sizes. When the cluster size is 4, the existing techniques, such as Modsup + MSA, ICGT, and Incremental K-Means, possesses the precision of 92.8%, 91.691%, and 85.6% respectively, which is comparatively lower than the text categorization method using Modsup + Rn-MSA. For the same training data, the Modsup + Rn-MSA acquired the precision of 96.4%. The comparative analysis based on recall is depicted in Fig. 6b). When

Fig. 5. Comparative analysis by varying the document size using 20 Newsgroups database (a) Precision, (b) Recall, (c) F-Measure, and (d) Accuracy.

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

10

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Fig. 6. Comparative analysis by varying the document size using Reuter database (a) Precision, (b) Recall, (c) F-Measure, and (d) Accuracy.

the cluster size is 16, the recall value of the existing methods, like Modsup + MSA, ICGT, and Incremental K-Means is 91.742%, 87.276%, and 82% respectively, whereas the Modsup + Rn-MSA attained the recall of 93.751%. The comparative analysis based on F-Measure is depicted in Fig. 6c). Here, for the size of the cluster is 16, existing techniques, like Modsup + MSA, ICGT, and Incremental K-Means achieved the F-Measure of 88.50%, 84.55%, and 82% respectively, but the Modsup + Rn-MSA acquired the F-Measure of 94.6%. When the size of the cluster is increased to 20, the existing techniques, such as Modsup + MSA, ICGT, and Incremental K-Means achieved the F-Measure of 83.03%, 82.954%, and 69.737% respectively. Meanwhile, the Modsup + Rn-MSA attained the F-Measure of 94.52%. The comparative analysis, in terms of accuracy, is depicted in Fig. 6d). When the cluster size 12 is considered, existing methods, like Modsup + MSA, ICGT, and Incremental K-Means acquires the accuracy of 93.725%, 92.85%, and 86.5% respectively. Meanwhile, the Modsup + Rn-MSA obtained the accuracy value of 95%. 3.7. Comparative discussion Table 2 describes the discussion to reveal the best performance attained by the text categorization methods, based on precision, recall, F-Measure, and accuracy using 20 Newsgroup database.

Modsup + MSA acquired the precision, recall, F-Measure, and accuracy of 91.69%, 94.14%, 94.55%, and 94.14% respectively, while ICGT attains the precision, recall, F-Measure, and accuracy of 91.62%, 91.62%, 91.61%, and 91.73%. The precision, recall, F-Measure, and accuracy values of Incremental K-Means are 82.96%, 91.51%, 90.25%, and 91.62% respectively. Among all the comparative methods, Modsup + Rn-MSA possess improved performance with precision, recall, F-Measure, and accuracy of 95.90%, 94.37%, 95.57%, and 94.38% respectively. Table 3 describes the comparative discussion to reveal the best performance attained by the text categorization methods, based on recall, precision, F-Measure, accuracy. Modsup + MSA acquired the precision, recall, F-Measure, and accuracy values of 92.81%, 94.87%, 92.84%, and 94.88% respectively, while ICGT attains the precision, recall, F-Measure, and accuracy of 91.69%, 94.51%, 92.60%, and 94.52%. Incremental K-Means achieved the precision, recall, F-Measure, and accuracy values of 91.69%, 90.77%, 89.60%, and 90.76% respectively. Among all the comparative methods, Modsup + Rn-MSA possess improved performance with precision, recall, F-Measure, and accuracy of 95.61%, 96.41%, 96.41%, and 95.12% respectively. From the comparative analysis, it can be shown that the proposed algorithm achieves a high performance than the existing methods. The reasons for the high performance of the proposed

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

11

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx Table 2 Comparative analysis based on 20 Newsgroup database. Methods

Precision (%)

Recall (%)

F-Measure (%)

Accuracy (%)

Modsup + MSA ICGT Incremental K-Means Proposed Modsup + Rn-MSA

91.69 91.62 82.96 95.90

94.14 91.62 91.51 94.37

94.55 91.61 90.25 95.57

94.14 91.73 91.62 94.38

Methods

Precision (%)

Recall (%)

F-Measure (%)

Accuracy (%)

Modsup + MSA ICGT Incremental K-Means Proposed Modsup + Rn-MSA

92.81 91.69 91.69 95.61

94.87 94.51 90.77 96.41

92.84 92.60 89.60 96.41

94.88 94.52 90.76 95.12

Table 3 Comparative analysis based on Reuter database.

Table 4 Statistical analysis based on 20 Newsgroup database. Methods Precision

Recall

F-Measure

Accuracy

Precision Mean Variance Recall Mean Variance F-Measure Mean Variance Accuracy Mean Variance

Modsup + MSA

ICGT

Incremental K-Means

Proposed Modsup + Rn-MSA

91.69 91.65 0.04 94.14 94.1 0.04 94.55 94.51 0.04 94.14 94.1 0.04

91.62 91.58 0.04 91.62 91.58 0.04 91.61 91.58 0.03 91.73 91.7 0.03

82.96 82.92 0.04 91.51 91.48 0.03 90.25 90.21 0.04 91.62 91.56 0.06

95.90 95.88 0.02 94.37 94.35 0.02 95.57 95.55 0.02 94.38 94.37 0.01

Table 5 Statistical analysis based on Reuter database. Methods Precision

Recall

F-Measure

Accuracy

Precision Mean Variance Recall Mean Variance F-Measure Mean Variance Accuracy Mean Variance

Modsup + MSA

ICGT

Incremental K-Means

Proposed Modsup + Rn-MSA

92.81 92.79 0.03 94.87 94.82 0.05 92.84 92.8 0.04 94.88 94.85 0.03

91.69 91.65 0.04 94.51 94.48 0.03 92.60 92.57 0.03 94.52 94.50 0.02

91.69 91.66 0.03 90.77 90.74 0.03 89.60 89.56 0.04 90.76 90.73 0.03

95.61 95.59 0.02 96.41 96.39 0.02 96.41 96.39 0.02 95.12 95.11 0.01

algorithm are Rn-MSA algorithm finds the best solution effectively, since it has the advantages of both ROA and MSA. The MSA negotiates complex operations, and the execution of MSA is easy and flexible. Hence, the incorporation of ROA concept into MSA determines the optimal regions. Also, the feature selection is performed using the proposed frequent itemset, which reduces the dimensionality of the documents.

3.8. Statistical analysis This section presents the statistical test in the comparison of the results with the existing methods. The statistical analysis is performed based on mean and variance, and the results are shown in Tables 4 and 5. The statistical analysis based on 20 Newsgroup database is provided in Table 4, and the statistical analysis based on Reuter database is provided in Table 5. From the analysis, it

can be shown that the proposed method has the maximum performance than the existing methods. 4. Conclusion This research paper presents an approach for document clustering using Modsup + Rn-MSA. At first, the documents are given to pre-processing to remove redundant and unnecessary words from the data using two steps, namely, stop word removal and stemming. After pre-processing, the feature extraction is carried out using TF-IDF, and Wordnet features to find the important keywords of the document. Then, the feature selection is performed using frequent itemsets for the establishments of feature knowledge. Finally, the document clustering is performed using the proposed Modsup + Rn-MSA, for clustering the documents based on their similarity in such a way that is effective for applications related to the document retrieval. Thus, the proposed Modsup

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002

12

M. Yarlagadda et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

+ Rn-MSA method is performed for clustering the documents. Experimentation is carried out using two databases, namely 20 newsgroup database and the Reuter database. The performance of the Modsup + Rn-MSA is evaluated based on precision, recall, F-measure and accuracy. The proposed method produces the maximal precision of 95.90%, maximal recall of 96.41%, the maximal F- measure of 96.41%, and the maximal accuracy of 95.12%, that indicates the superiority of the proposed method. The future dimension of the research will be concentrated on extending the analysis using other standard databases with highly advanced features.

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References 20 Newsgroup database, ‘‘http://qwone.com/~jason/20Newsgroups/”, accessed on October 2018. Agrawal, Rakesh, Imielin´ski, Tomasz, Swami, Arun, 1993. Mining association rules between sets of items in large databases. ACM Sigmod Rec. 22 (2), 207–216. Berkhin, Pavel, 2006. Survey of clustering data mining techniques. In: Grouping Multidimensional Data, pp. 25–71. Binu, D., Kariyappa, B.S., 2018. RideNN: a new rider optimization algorithm-based neural network for fault diagnosis in analog circuits. IEEE Trans. Instrum. Measur. Elberrichi, Zakaria, Rahmoun, Abdelattif, Bentaalah, Mohamed Amine, 2008. Using WordNet for text categorization. Int. Arab J. Inf. Technol. 5 (1), 16–24. Farahat, Ahmed K., Kamel, Mohamed S., 2011. Statistical semantics for enhancing document clustering. Knowl. Inf. Syst. 28 (2), 365–393. Forsati, Rana, 2013. Efficient stochastic algorithms for document clustering. Inf. Sci. 220, 269–291. Gharib, Tarek F., Fouad, Mohammed M., Mashat, Abdulfattah, Bidawi, Ibrahim, 2012. Self organizing map-based document clustering using WordNet ontologies. IJCSI Int. J. Comput. Sci. Issues 9 (2). Gulnashin, Fatima, Sharma, Iti, Sharma, Harish, 2019. A new deterministic method of initializing spherical k-means for document clustering. In: Progress in Advanced Computing and Intelligent Engineering, pp. 149–155. Hotho, Andreas, Staab, Steffen, Stumme, Gerd, 2003. Ontologies improve text document clustering. Proceedings of the Third IEEE International Conference on Data Mining. Choi, Hyeok-Jun, Park, Cheong Hee, 2019. Emerging topic detection in twitter stream based on high utility pattern mining. Expert Syst. Appl. 115, 27–36. Jensi Dr., R., Wiselin Jiji, G., 2013. A survey on optimization approaches to text document clustering. Int. J. Comput. Sci. Appl. (IJCSA) 3 (6).

Karpagam, K., Saradha, A., 2019. In: A Framework for Intelligent Question Answering System using Semantic Context-specific Document Clustering and Wordnet, pp. 44–62. Kiran, G.V.R., Shankar, Ravi, Pudi, Vikram, 2010. Frequent Itemset based hierarchical document clustering using wikipedia as external knowledge. In: International Conference on Knowledge-based and Intelligent Information and Engineering Systems, pp. 11–20. Manning, Christopher D., Raghavan, Prabhakar, Schütze, Hinrich, 2009. An Introduction to Information Retrieval. Onan, Aytug, Bulut, Hasan, Korukoglu, Serdar, 2017. An improved ant algorithm with LDA based representation for text document clustering. J. Inf. Sci. 43 (2), 275–292. Pamba, Raja Varma, Sherly, Elizabeth, Mohan, Kiran, 2019. Self-adaptive frequent pattern growth-based dynamic fuzzy particle swarm optimization for web document clustering. In: Computational Intelligence: Theories, Applications and Future Directions, pp. 15–25. Pei, Xiaobing, Chen, Chuanbo, Gong, Weihua, 2016. Concept factorization with adaptive neighborsfor document clustering. IEEE Transactions on Neural Networks and Learning SYSTEMS. Pera, Maria Soledad, Ng, Yiu-Kai, 2010. Naive Bayes Classifier for web document summaries created by using word similarity and significant factors. Int. J. Artif. Intell. Tools 19 (4), 465–486. Raghu Krishnapuram, A., Joshi, O., Nasraoui, L. Yi, 2001. Low-complexity fuzzy relational clustering algorithms for web mining. IEEE Trans. Fuzzy Syst. 9 (4), 595–607. Reuter database, https://archive.ics.uci.edu/ml/machine-learningdatabases/ reuters21578-mld/, accessed on October 2018. Sangaiah, Arun Kumar, Fakhry, Ahmed E., Abdel-Basset, Mohamed, El-henawy, Ibrahim, 2018. Arabic text clustering using improved clustering algorithms with dimensionality reduction. Cluster Comput., 1–15 Sharma, Iti, Jain, Aaditya, Sharma, Harish, 2019. Stream and online clustering for text documents. In: Proceedings of International Conference on Advanced Computing Networking and Informatics, pp. 469–475. Sumathi Rani, Manukonda, China Babu, Geddati, 2019. Efficient query clustering technique and context well-informed document clustering. In: Soft Computing and Signal Processing, pp. 261–271. Wan, Yuchai, Liu, Xiabi, Yi, Wu, Guo, Lunhao, Chen, Qiming, Wang, Murong, 2018. ICGT: A Novel Incremental Clustering Approach based on GMM Tree. Wang, Gai-Ge, 2016. Moth search algorithm: a bio-inspired metaheuristic algorithm for global optimization problems. Memetic Comput. Wu, Jimmy Ming-TaiChun-Wei Jerry, Lin, Philippe Fournier, Viger, Youcef, Djenouri, Chun-Hao, Chen, Zhongcui, 2019. Li The density-based clustering method for privacy-preserving data mining. Math. Biosci. Eng.: MBE 16 (3), 1718–1728. Xie, Pengtao, Xing, Eric P., 2013. Integrating Document Clustering and Topic Modeling. Zhang, Zhiqiang, Wang, Linan, Xie, Xiaoqin, Pan, Haiwei, 2006. A graph based document retrieval method. Proceedings of International Symposium on Communications and Information Technologies.

Further reading Lee, Daniel D., Sebastian Seung, H., 1999. Learning the parts of objects bynonnegative matrix factorization. Nature 401 (6755), 788. Salton, G., Wong, A., Yang, C.S., 1975. A vector space model for automatic indexing. Commun. ACM 18 (11), 613–620.

Please cite this article as: M. Yarlagadda, K. Gangadhara Rao and A. Srikrishna, Frequent itemset-based feature selection and Rider Moth Search Algorithm for document clustering, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.002