A similarity assessment technique for effective grouping of documents

A similarity assessment technique for effective grouping of documents

Information Sciences 311 (2015) 149–162 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

377KB Sizes 3 Downloads 35 Views

Information Sciences 311 (2015) 149–162

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

A similarity assessment technique for effective grouping of documents Tanmay Basu ⇑, C.A. Murthy Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India

a r t i c l e

i n f o

Article history: Received 20 February 2014 Received in revised form 25 December 2014 Accepted 15 March 2015 Available online 21 March 2015 Keywords: Document clustering Text mining Applied data mining

a b s t r a c t Document clustering refers to the task of grouping similar documents and segregating dissimilar documents. It is very useful to find meaningful categories from a large corpus. In practice, the task to categorize a corpus is not so easy, since it generally contains huge documents and the document vectors are high dimensional. This paper introduces a hybrid document clustering technique by combining a new hierarchical and the traditional k-means clustering techniques. A distance function is proposed to find the distance between the hierarchical clusters. Initially the algorithm constructs some clusters by the hierarchical clustering technique using the new distance function. Then k-means algorithm is performed by using the centroids of the hierarchical clusters to group the documents that are not included in the hierarchical clusters. The major advantage of the proposed distance function is that it is able to find the nature of the corpora by varying a similarity threshold. Thus the proposed clustering technique does not require the number of clusters prior to executing the algorithm. In this way the initial random selection of k centroids for k-means algorithm is not needed for the proposed method. The experimental evaluation using Reuter, Ohsumed and various TREC data sets shows that the proposed method performs significantly better than several other document clustering techniques. F-measure and normalized mutual information are used to show that the proposed method is effectively grouping the text data sets. Ó 2015 Elsevier Inc. All rights reserved.

1. Introduction Clustering algorithms partition a data set into several groups such that the data points in the same group are close to each other and the points across groups are far from each other [9]. The document clustering algorithms try to identify inherent grouping of the documents to produce good quality clusters for text data sets. In recent years it has been recognized that partitional clustering algorithms e.g., k-means, buckshot are advantageous due to their low computational complexity. On the other hand these algorithms need the knowledge of the number of clusters. Generally document corpora are huge in size with high dimensionality. Hence it is not so easy to estimate the number of clusters for any real life document corpus. Hierarchical clustering techniques do not need the knowledge of number of clusters, but a stopping criterion is needed to terminate the algorithms. Finding a specific stopping criterion is difficult for large data sets. The main difficulty of most of the document clustering techniques is to determine the (content) similarity of a pair of documents for putting them into the same cluster [3]. Generally cosine similarity is used to determine the content similarity between two documents [24]. Cosine similarity actually checks the number of common terms present in the documents. If ⇑ Corresponding author. Tel.: +91 33 25753109; fax: +91 33 25783357. E-mail addresses: [email protected] (T. Basu), [email protected] (C.A. Murthy). http://dx.doi.org/10.1016/j.ins.2015.03.038 0020-0255/Ó 2015 Elsevier Inc. All rights reserved.

150

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

two documents contain many common terms then they are very likely to be similar. The difficulty is that there is no clear explanation as to how many common terms can identify two documents as similar. The text data sets are high dimensional data set and most of the terms do not occur in each document. Hence the issue is to find the content similarity in such a way so that it can restrict the low similarity values. The actual content similarity between two documents may not be found properly by checking only the individual terms of the documents. A new distance function is proposed to find distance between two clusters based on a similarity measure, extensive similarity between documents. Intuitively, the extensive similarity restricts the low (content) similarity values by a predefined threshold and then determines the similarity between two documents by finding their distance with every other document in the corpus. It assigns a score to each pair of documents to measure the degree of content similarity. A threshold is set on the content similarity value of the document vectors to restrict the low similarity values. A histogram thresholding based method is used to estimate the value of the threshold from the similarity matrix of a corpus. A new hybrid document clustering algorithm is proposed, which is a combination of a hierarchical and k-means clustering technique. The hierarchical clustering technique produces some baseline clusters by using the proposed cluster distance function. The hierarchical clusters are named as baseline clusters. These clusters are created in such a way that the documents inside a cluster are very similar to each other. Actually the extensive similarity of all pair of documents of a baseline cluster is very high. The documents of two different baseline clusters are very dissimilar to each other. Thus the baseline clusters intuitively determine the actual categories of the document collection. Generally there exist some singleton clusters after constructing the hierarchical clusters. The distance between a singleton cluster and each baseline cluster is not so small. Hence k-means clustering algorithm is performed to group these documents to a particular baseline cluster, with which it has highest content similarity. If for several iterations of k-means algorithm each of these singleton clusters are grouped to the same baseline cluster then they are likely to be assigned correctly. The significant property of the proposed technique is that it can automatically identify the number of clusters. It has become clear from the experiments that the number of clusters of each corpus is very close to the actual category. The experimental analysis using several well known TREC and Reuter data sets have shown that the proposed method performs significantly better than several existing document clustering algorithms. The paper is organized as follows. Section 2 describes some related works. The document representation technique is presented in Section 3. The proposed document clustering technique is explained in Section 4. The evaluation criteria for evaluating the clusters generated by a particular method is described in Section 5. Section 6 presents the experimental results and a detailed analysis on the results. Finally we conclude and discuss about the further scope of this work in Section 7. 2. Related works There are two basic types of document clustering techniques available in the literature – hierarchical and partitional clustering techniques [8,11]. Hierarchical clustering produces a hierarchical tree of clusters where each individual level can be viewed as a combination of clusters in the next lower level. This hierarchical structure of clusters is also known as dendrogram. The hierarchical clustering techniques can be divided into two parts – agglomerative and divisive. In an Agglomerative Hierarchical Clustering (AHC) method [30], starting with each document as individual cluster, at each step, the most similar clusters are merged until a given termination condition is satisfied. In a divisive method, starting with the whole set of documents as a single cluster, the method splits a cluster into smaller clusters at each step until a given termination condition is satisfied. Several halting criteria for AHC algorithms have been proposed. But no widely acceptable halting criterion is available for these algorithms. As a result some good clusters may be merged, which will be eventually meaningless to the user. There are mainly three variations of AHC techniques – single-link, complete-link and group-average hierarchical method for document clustering [6]. In single-link method the similarity between a pair of clusters is calculated as the similarity between the two most similar documents where each document represents each individual cluster. The complete-link method measures the similarity between a pair of clusters as the least similar documents, one of which is in each cluster. The group average method merges two clusters if they have least average similarity than the other clusters. Average similarity means the average of the similarities between the documents of each cluster. In a divisive hierarchical clustering technique, initially, the method assumes the whole data set as a single cluster. Then at each step, the method chooses one of the existing clusters and splits it into two. The process continues till only singleton clusters remain or it reaches a given halting criterion. Generally the cluster with the least overall similarity is chosen for splitting [30]. In a recent study, Lai et al. have proposed an agglomerative hierarchical clustering algorithm by using dynamic k-nearest neighbor list for each cluster. The clustering technique is named as Dynamic k-Nearest Neighbor Algorithm (DKNNA) [16]. The method uses a list of dynamic k nearest neighbors to store k nearest neighbors of each cluster. Initially the method assumes each document as a cluster and finds the k nearest neighbors of each cluster. The minimum distant clusters are merged and their nearest neighbors are updated accordingly and then again finds the minimum distant clusters and merge them and so on. The algorithm continues until the desired number of clusters are obtained. In the merging and updating process of each iteration, the k nearest neighbors of the clusters, which are affected by the merging process are updated. If the set of k nearest neighbors are empty for some of the clusters being updated, their nearest neighbors are determined by searching all the clusters. Thus the proposed approach can guarantee the exactness of the nearest neighbors of a cluster and can obtain good quality clusters [16]. Although the algorithm has shown good results for some artificial and image data sets, but it has two

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

151

limitations to apply it to text data sets. The method needs the knowledge of desired number of clusters, which is very difficult to predict and it is problematic to determine a valid k for text data sets. In contrast to hierarchical clustering techniques, partitional clustering techniques allocate data into a previously known fixed number of clusters. The commonly used partitional clustering technique is k-means method, where k is the desired number of clusters [13]. Here initially k documents are chosen randomly from the data set, and they are called seed points. Each document is assigned to its nearest seed point, thereby creating k clusters. Then the centroids of the clusters are computed, and each document is assigned to its nearest centroid. The same process continues until the clustering does not change i.e., the centroids in two consecutive iterations remain the same. Generally, the number of iterations is fixed by the user. The procedure stops if it converges to a solution i.e., the centroids are the same for two consecutive iterations, or the process terminates after a fixed number of iterations. k-means algorithm is advantageous for its low computational complexity [23]. It takes linear time to build the clusters. The main disadvantage is that the number of clusters is fixed and it is very difficult to select a valid k for an unknown text data set. Also there is no universally acceptable way of choosing the initial seed points. Recently Chiang et al. proposed a time efficient k-means algorithm by compressing and removing the patterns at each iteration that are unlikely to change their membership thereafter [22], but the limitations of the k-means clustering technique have not been discussed. Bisecting k-means method [30] is a variation of basic k-means algorithm. This algorithm tries to improve the quality of clusters in comparison to k-means clusters. In each iteration, it selects the largest existing cluster (the whole data set in the first iteration) and divides it into two subsets using k-means (k = 2) algorithm. This process is continued till k clusters are formed. Bisecting k-means algorithm generally produces almost uniform sized clusters. Thus it can perform better than k-means algorithm when the actual groups of a data set are almost of similar size i.e., the number of documents in the categories of a corpus are close to each other. On the contrary, the method produces poor clusters for the corpora, where the number of documents in the categories differ very much. This method also faces difficulties like k-means algorithm, in choosing the initial seed points and a proper value of the parameter k. Buckshot algorithm is a combination of basic k-means and hierarchical clustering methods. It tries to improve the performance of k-means algorithm by choosing better initial centroids [26]. It uses a hierarchical clustering algorithm on some sample documents of the corpus in order to find robust initial centroids. Then k-means algorithm is performed to find the clusters using these robust centroids as the initial centroids [3]. But repeated calls to this algorithm may produce different partitions. If the initial random sampling does not represent the whole data set properly, the resulting clusters may be of poor quality. Note that appropriate value of k is necessary for this method too. Spectral clustering algorithm is a very popular clustering method which works on the similarity matrix rather than the original term-document matrix using the idea of graph cut. It uses the top eigenvectors of the similarity matrix derived from the similarity between documents [25]. The basic idea is to construct a weighted graph from the corpus, where each node represents a document and each weighted edge represents the similarity between two documents. In this technique the clustering problem is formulated as a graph cut problem. The core of this theory is the eigenvalue decomposition of the Laplacian matrix of the weighted graph obtained from data [10]. Let X ¼ fd1 ; d2 ; . . . ; dN g be the set of N documents to cluster. Let S be the N  N similarity matrix where Sij represents the similarity between the documents di and dj . Ng et al. [25] proposed a spectral clustering algorithm, which simultaneously partitions the Laplacian data matrix L into k subsets using the k   qðd ;d Þ largest eigenvectors and they have used a gaussian kernel Sij ¼ exp  2ri 2 j on the similarity matrix. Here qðdi ; dj Þ denotes the similarity between di and dj and r is the scaling parameter. The gaussian kernel is used to get rid of the curse of dimensionality. The main difficulty of using a gaussian kernel is that, it is sensitive to the parameter r [21]. A wrong value of r may highly degrade the quality of the clusters. It is extremely difficult to select a proper value of r for a document collection, since the text data sets are generally sparse with high dimension. It should be noted that the method also suffers from the limitations of the k-means method, discussed above. Non-negative Matrix Factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. It finds the positive factorization of a given positive matrix [19]. Xu et al. [33] have demonstrated that NMF performs well for text clustering compared to the other similar methods like singular value decomposition and latent sematic indexing. The technique factorizes the original term-document matrix D approximately as, D  UV T , where U is a non-negative matrix of size n  m, and V is an m  N non-negative matrix. The base vectors in U can be interpreted as a set of terms in the vocabulary of the corpus, while V describes the contribution of the documents to these terms. The matrices U and V are randomly initialized, and their contents iteratively estimated [1]. The Non-negative Matrix Factorization method attempts to determine U and V, which minimize the following objective function



 1  T D  UV  2

ð1Þ

where kk denotes the squared sum of all the elements in the matrix. This is an optimization problem with respect to the matrices U ¼ ½uik ; V ¼ ½v jk ; 8i ¼ 1; 2; . . . ; n, 8j ¼ 1; 2; . . . ; N and k ¼ 1; 2; . . . ; m and as the matrices U and V are non-negative, we have uik P 0; v jk P 0. This is a typical constrained non-linear optimization problem and can be solved using the Lagrange method [3]. The interesting property of NMF technique is that it can also be used to find the word clusters instead of document clusters. The columns of U can be used to discover a basis, which corresponds to word clusters. The NMF algorithm has its disadvantages too. The optimization problem of Eq. (1) is convex in either U or V, but not in both U and V, which means

152

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

that the algorithm can guarantee convergence to a local minimum only. In practice, NMF users often compare the local minima from several different starting points, using the results of the best local minimum found. On large sized corpora this may be problematic [17]. Another problem with NMF algorithm is that it relies on random initialization and as a result, the same data might produce different results across runs [1]. Xu et al. [34] proposed concept factorization (CF) based document clustering technique, which models each cluster as a linear combination of the documents, and each document as a linear combination of the cluster centers. The document clustering is then accomplished by computing the two sets of linear coefficients, which is carried out by finding the non-negative solution that minimizes the reconstruction error of the documents. The major advantage of CF over NMF is that it can be applied to data containing negative values and the method can be implemented in the kernel space. The method has to select k concepts (cluster centers) initially and it is very difficult to predict a value of k in practice. Dasgupta et al. [7] proposed a simple active clustering algorithm which is capable of producing multiple clusterings of the same data according to user interest. The advantage of this algorithm is that the user feedback required by this algorithm is minimal compared to the other existing feedback-oriented clustering techniques, but the algorithm may suffer from human feedback, if the topics are sensitive or when the perception varies. Carpineto et al. have done a good survey on search results clustering techniques. They have elaborately explained and discussed various issues related to web clustering engines [5]. Wang et al. [32] proposed an efficient soft-constraint algorithm by obtaining a satisfactory clustering result so that the constraints would be respected as many as possible. The algorithm is basically an optimization problem and it starts by randomly assuming some initial cluster centroids. The method can produce insignificant clusters if the initial centroids are not properly selected. Zhu et al. [35] proposed a semi-supervised Non-negative Matrix Factorization method based on the pairwise constraints – mustlink and cannot-link. In this method must-link constraints are used to control the distance of the data in the compressed form, and cannot-link constraints are used to control the encoding factor to obtain a very good performance. The method has shown very good performance in some real life text corpora. The algorithm is a new variety of NMF method, which again relies on random initialization and may produce different clusters for several runs on a corpus, where the sizes of the categories highly varies from each other. 3. Vector space model for document representation The number of documents in the corpus throughout this article is denoted by N. The number of terms in the corpus is th

th

denoted by n. The i term is represented by t i . Number of times the term t i occurs in the j document is denoted by tfij ; i ¼ 1; 2; . . . ; n; j ¼ 1; 2; . . . ; N. Document frequency dfi is the number of documents in which the term t i occurs. Inverse   document frequency idfi ¼ log dfNi , determines how frequently a word occurs in the document collection. The weight of th

th

the i term in the j document, denoted by wij , is determined by combining the term frequency with the inverse document frequency [29] as follows:  

wij ¼ tfij  idfi ¼ tfij  log

N ; dfi

8i ¼ 1; 2; . . . ; n and 8 j ¼ 1; 2; . . . ; N

The documents are represented using the vector space model in most of the clustering algorithms [29]. In this model each th ~ ¼ ðw ; w ; . . . ; w Þ. document dj is considered to be a vector, where the i component of the vector is wij i.e., d j 1j 2j nj The key factor in the success of any clustering algorithm is the selection of a good similarity measure. The similarity ~i and d ~j , it is required between two documents is achieved through some distance function. Given two document vectors d to find the degree of similarity (or dissimilarity) between them. Various similarity measures are available in the literature but the commonly used measure is cosine similarity between two document vectors [30], which is given by

Pn   ~  d~ d i j k¼1 ðwik  wjk Þ ; cos d~i ; d~j ¼     ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P Pn n ~ ~ 2 2 di  dj  k¼1 wik  k¼1 wjk

8i; j

ð2Þ

The weight of each term in a document is non-negative. As a result the cosine similarity is non-negative and bounded   ~; d ~ ¼ 1 means the documents are exactly similar and the similarity decreases as the value decreases between 0 and 1. cos d i j to 0. An important property of the cosine similarity is its independence of document length. Thus cosine similarity has become popular as a similarity measure in the vector space model [14]. Let D ¼ fd1 ; d2 ; . . . ; dr g be the set of r documents, P ~ ~ where each document has n number of terms. The centroid of D; Dcn can be calculated as, Dcn ¼ 1r rj¼1 d j , where dj is the corresponding vector of document dj . 4. Proposed clustering technique for effective grouping of documents A combination of hierarchical clustering and k-means clustering methods has been introduced based on a similarity assessment technique to effectively group the documents. The existing document clustering algorithms so far discussed determine the (content) similarity of a pair of documents for putting them into the same cluster. Generally the content similarity is determined by the cosine of the angle between two document vectors. The cosine similarity actually checks the number of common terms present in the documents. If two documents contain many common terms then the documents

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

153

are very likely to be similar, but the difficulty is that there is no clear explanation as to how many common terms can identify two documents as similar. The text data sets are high dimensional data set and most of the terms do not occur in each document. Hence the issue is to find the content similarity in such a way so that it can restrict the low similarity values. The actual content similarity between two documents may not be found properly by checking the individual terms of the documents. Intuitively, if two documents are content wise similar then they should have similar type of relation with most of the other documents i.e., if two documents x and y have similar content and if x is similar to any other document z then y must be similar or somehow related to z. This important characteristic is not observed in cosine similarity measure. 4.1. A similarity assessment technique A similarity measure, Extensive Similarity (ES) is used to find the similarity between two documents in the proposed work. The similarity measure extensively checks all the documents in the corpus to determine the similarity. The extensive similarity between two documents is determined depending on their distances with every other document in the corpus. Intuitively, two documents are exactly similar, if they have sufficient content similarity and they have almost same distance with every other document in the corpus (i.e., both are either similar or dissimilar to all the other documents) [18]. The content similarity is defined as a binary valued distance function. The distance between two documents is minimum i.e., 0 when they have sufficient content similarity, otherwise the distance is 1 i.e., they have very low content similarity. The distance between two documents di and dj ; 8i; j is determined by putting a threshold h 2 ð0; 1Þ on their content similarity as follows:

disðdi ; dj Þ ¼



1 if qðdi ; dj Þ 6 h 0

ð3Þ

otherwise

where q is the similarity measure to find the content similarity between di and dj . Here h is a threshold value on the content similarity and it is used to restrict the low similarity values. A data dependent method for estimating the value of h is dis  ~; d ~ and ~ , where d cussed later. In the context of document clustering q is considered as cosine similarity i.e., qðdi ; dj Þ ¼ cos d i j i   ~ ~ ~ dj are the corresponding vectors of documents di and dj respectively. If cos di ; dj ¼ 1 then we can strictly say that the docu  ~i ; d ~j > h then they have sufficient content similarity ments are dissimilar. On the other hand, if the distance is 0, i.e., cos d and the documents are somehow related to each other. Let us assume that di and dj have cosine similarity 0.52 and dj and d0 (another document) have cosine similarity 0.44 and h ¼ 0:1. Hence both disðdi ; dj Þ ¼ 0 and disðdj ; d0 Þ ¼ 0 and the task is to distinguish these two distances of same value. The extensive similarity is thus designed to find the grade of similarity of the pair of documents which are similar content wise [18]. If disðdi ; dj Þ ¼ 0 then extensive similarity finds the individual content similarities of di and dj with every other document, and assigns a score (l) to denote the extensive similarity between the documents as below.

li;j ¼

N X jdisðdi ; dk Þ  disðdj ; dk Þj k¼1

Thus the extensive similarity between documents di and dj ; 8i; j is defined as

ESðdi ; dj Þ ¼

N  li;j

if disðdi ; dj Þ ¼ 0

1

otherwise

ð4Þ

Two documents di ; dj have maximum extensive similarity N, if the distance between them is zero, and distance between di and dk is same as the distance between dj and dk for every k. In general, if the above said distances differ for li;j times then the extensive similarity is N  li;j . Unlike other similarity measures, ES takes into account the distances of the said two documents di ; dj with respect to all the other documents in the corpus when measuring the distance between them [18]. li;j indicates the number of documents with which the similarity of di is not the same as the similarity of dj . As the li;j value increases, the similarity between the documents di and dj decreases. If li;j ¼ 0 then di and dj are exactly similar. Actually li;j denotes a grade of dissimilarity and it indicates that di and dj have different distances with li;j number of documents. The extensive similarity is used to define the distance between two clusters in the first stage of the proposed document clustering method. A distance function is proposed to create the baseline clusters. It finds the distance between two clusters say, C x and C y . Let T xy be a multi-set consisting of the extensive similarities between each pair of documents, one from C x and the other from C y and it is defined as,

T xy ¼ fESðdi ; dj Þ :

ESðdi ; dj Þ P 0;

8di 2 C x and dj 2 C y g

Note that T xy consisting of all the occurrences of the same extensive similarity values (if any) for different pairs of documents. The proposed distance between two clusters C x and C y can be defined as



dist clusterðC x ; C y Þ ¼

1

if T xy ¼ ;

N  avgðT xy Þ otherwise

ð5Þ

154

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

The function dist cluster finds the distance between two clusters C x and C y as the average of the multi set of non-negative ES values. The distance between C x and C y is infinite, if there are no two documents that have a non-negative ES value i.e., no similar documents are present in C x and C y . Intuitively, infinite distance between clusters denotes that every pair of documents, one from C x and the other from C y either share a very few number of terms, or no term is common between them i.e., they have a very low content similarity. Later we shall observe that any two clusters with infinite distance between them remain segregated from each other. Thus, a significant characteristic of the function dist cluster is that it would never merge two clusters with infinite distance between them. The proposed document clustering algorithm initially assumes each document as a singleton cluster. Then it merges those clusters with minimum distance, and the distance is within a previously fixed limit a. The process of merging continues until the distance between every two clusters is less than or equal to a. The clusters which are not singletons are named as Baseline Clusters (BC). The selection of the value of a is discussed in Section 6.2 of this article. 4.2. Properties of dist cluster The important properties of the function dist cluster are described below.  The minimum distance between any two clusters C x and C y is 0, when avgðT xy Þ ¼ N i.e., the extensive similarity value between every pair of documents, one from C x and the other from C y is N. Although in practice this minimum value can be rarely observed between two different document clusters. The maximum value of dist cluster is infinite.  If C x ¼ C y then dist clusterðC x ; C y Þ ¼ N  avgðT xx Þ ¼ 0.

dist clusterðC x ; C y Þ ¼ 0 ) avgðT xy Þ ¼ N ) ESðdi ; dj Þ ¼ N;

8di 2 C x and 8dj 2 C y :

Now ESðdi ; dj Þ ¼ N implies that two documents di and dj are exactly similar. Note that ESðdi ; dj Þ ¼ N ) disðdi ; dj Þ ¼ 0 and li;j ¼ 0. Here disðdi ; dj Þ ¼ 0 implies that di and dj are similar in terms of content, but they are not necessarily same i.e., we can not say di ¼ dj , if disðdi ; dj Þ ¼ 0. Thus dist clusterðC x ; C y Þ ¼ 0 ; C x ¼ C y and hence dist cluster is not a metric.  It is symmetric. For every pair of clusters C x and C y ; dist clusterðC x ; C y Þ ¼ dist clusterðC y ; C x Þ.  dist clusterðC x ; C y Þ P 0 for any pair of clusters C x and C y .  For any three clusters C x ; C y and C 0 , we may have

dist clusterðC x ; C y Þ þ dist clusterðC y ; C 0 Þ  dist clusterðC x ; C 0 Þ < 0 when 0 6 dist clusterðC x ; C y Þ < N, 0 6 dist clusterðC y ; C 0 Þ < N and dist clusterðC x ; C 0 Þ ¼ 1. Thus it does not satisfy the triangular inequality. 4.3. A method for estimation of h There are several types of document collections available in real life. The similarities or dissimilarities between documents present in one corpus may not be same as the similarities or dissimilarities of the other corpora, since the characteristics of the corpora are different [18]. Additionally, one may view the clusters present in a corpus (or in different corpora) under different scales, and different scales produce different partitions. Similarities corresponding to one scale in one corpus may not be same as the similarities corresponding to the same scale in a different corpus. This has been the reason to make the threshold on similarities data dependent [18]. In fact, we feel that a fixed threshold on similarities will not give satisfactory results on several data sets. There are several methods available in literature for finding a threshold for a two-class (one class corresponds to similar points, and the other corresponds to dissimilar points) classification problem. A popular method for such classification is histogram thresholding [12]. Let, for a given corpus, the number of distinct similarity values be p, and let the similarity values be s0 ; s1 ; . . . ; sp1 . Without loss of generality, let us assume that (a) si < sj ; ifi < j and (b) (siþ1  si Þ ¼ ðs1  s0 Þ; 8i ¼ 1; 2; . . . ; ðp  2Þ. Let gðsi Þ denote the number of occurrences of si ; 8i ¼ 0; 1; . . . ; ðp  1Þ. Our aim is to find a threshold h on the similarity values so that a similarity value s < h implies the corresponding documents are practically dissimilar, otherwise they are similar. The aim is to make the choice of threshold to be data dependent. The basic steps of the histogram thresholding technique are as follows:  Obtain the histogram corresponding to the given problem.  Reduce the ambiguity in histogram. Usually this step is carried out using a window. One of the earliest such techniques is the moving average technique in time series analysis [2], which is used to reduce the local variations in a histogram. It is convolved with the histogram resulting in a less ambiguous histogram. We have used the weighted moving averages using window length 5 of the gðsi Þ values as,

gðsi Þ gðsi2 Þ þ gðsi1 Þ þ gðsi Þ þ gðsiþ1 Þ þ gðsiþ2 Þ f ðsi Þ ¼ Pp1 ;  5 gðs Þ j j¼0

8i ¼ 2; 3; . . . ; p  3

ð6Þ

155

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

 Find the valley points in the modified histogram. A point si corresponding to the weight function f ðsi Þ is said to be a valley point if f ðsi1 Þ > f ðsi Þ and f ðsi Þ < f ðsiþ1 Þ.  The first valley point of the modified histogram is taken as the required threshold on the similarity values. In the modified histogram corresponding to f, there can be three possibilities regarding the valley points, which are stated below. (i) There is no valley point in the histogram. If there is no valley point in the histogram then either it is a constant function, or it is an increasing or decreasing function of similarity values. These three types of histograms impose strong conditions on the similarity values which are unnatural for a document collection. Another possible histogram where there is no valley point is a unimodal histogram. There is a single mode in the histogram, and the number of occurrences of a similarity value increases as the similarity values increase to the mode, and decreases as the similarity values move away from mode. This is an unnatural setup since, there is no reason of having such a strong property to be satisfied by a histogram of similarity values. (ii) Another option is that there exists exactly one valley point in the histogram. The number of occurrences of the valley point is smaller than the number of occurrences of the other similarity values in a neighborhood of valley point. In practice this type of example is also rare. (iii) The third and most usual possibility is that the number of valley points is more than one i.e., there exist several variations in the number of occurrences of similarity values. Here the task is to find a threshold from a particular valley. In the proposed technique the threshold is selected from the first valley point. The threshold may be selected from the second or third or a higher valley. But for this we may treat some really similar documents as dissimilar, which lie in between the first valley point and the higher one. Practically the text data sets are sparse and high dimensional. Hence high similarities between documents are observed in very few cases. It is true that, for a high h value the extensive similarity between every two documents in a cluster will be high, but the number of documents in each cluster will be too few due to the sparsity of the data. Hence h is selected from the first valley point as the similarity values in the other valleys are higher than the similarity values in the first valley point. Generally similarity values do not satisfy the property that ðsiþ1  si Þ ¼ ðs1  s0 Þ; 8i ¼ 1; 2; . . . ; ðp  2Þ. In reality there are th

ðp þ 1Þ distinct class intervals of similarity values, where the i i ¼ 1; 2; . . . ; p. The ðp þ 1Þ

th

class interval is ½vi1 ; vi Þ, a semi-closed interval, for

class interval corresponds to the set where each similarity value is greater than or equal to

vp .

th

gðsi Þ corresponds to the number of similarity values falling in the i class interval. The vi ’s are taken in such a way that ðvi1  vi Þ ¼ ðv1  v0 Þ; 8i ¼ 2; 3; . . . ; p. Note that v0 ¼ 0 and the value of vp is decided on the basis of the observations. The last interval, i.e., the ðp þ 1Þth interval is not considered for the valley point selection, since we assume that if any similarity value is greater than or equal to vp then the corresponding documents are actually similar. Under this setup, we have v þv

taken si ¼ i 2 iþ1 ; 8i ¼ 0; 1; . . . ; ðp  1Þ. Note that the defined si ’s satisfy the properties (a) si < sj ; ifi < j and (b) (siþ1  si Þ ¼ ðs1  s0 Þ; 8i ¼ 1; 2; . . . ; ðp  2Þ. The proposed method finds the valley point, and its corresponding class interval. The minimum value of that particular class interval is taken as the threshold. Example. Let us consider an example of histogram thresholding for the selection of theta for a corpus. The similarity values, the values of g and f are shown in Table 1. Initially we have divided the similarity values into a few class intervals of length 0.001. Let us assume that there are 80 such intervals of equal length and si represents the middle point of the ith class interval for i ¼ 0; 1; . . . ; 79. The values of gðsi Þ’s and the corresponding f ðsi Þ’s are then found. Note that the moving averages have been used to remove the ambiguities in the gðsi Þ values. Valleys in the similarity values corresponding to 76 f ðsi Þ’s are then found. Let s40 , which is equal to 0.0405, be the first valley point, i.e., f ðs39 Þ > f ðs40 Þ and f ðs40 Þ < f ðs41 Þ. The minimum similarity value of the class interval ½0:040  0:041Þ is taken as the threshold h. Table 1 An example of h estimation by histogram thresholding technique. Class intervals (vi ’s)

si ’s

No. of elements of the intervals

Moving averages

½0:000—0:001Þ ½0:001—0:002Þ ½0:002—0:003Þ .. . ½0:040—0:041Þ .. . ½0:077—0:078Þ ½0:078—0:079Þ ½0:079—0:080Þ P0.080

0:0005 0:0015 0:0025 .. . 0:0405 .. . 0:0775 0:0785 0:0795 –

gðs0 Þ gðs1 Þ gðs2 Þ .. . gðs40 Þ .. . gðs77 Þ gðs78 Þ gðs79 Þ gðs80 Þ

– – f ðs2 Þ .. . f ðs40 Þ .. . f ðs77 Þ – – –

156

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

4.4. Procedure of the proposed document clustering technique The proposed document clustering technique is described in Algorithm 1. Initially each document is taken as a cluster. Therefore Algorithm 1 starts with N individual clusters. In the first stage of Algorithm 1, a distance matrix is developed th

th

th

whose ij entry is the dist clusterðC i ; C j Þ value where C i and C j are i and j cluster respectively. It is a square matrix and has N rows and N columns for N number of documents in the corpus. Each row or column of the distance matrix is treated as a cluster. Then the Baseline Clusters (BC) are generated by merging the clusters whose distance is less than a fixed threshold a. The value of a is constant throughout Algorithm 1. The process of merging stated in step 3 of Algorithm 1 merges two rows say i and j and the corresponding columns of the distance matrix by following a convention regarding numbering. It merges two rows into one, the resultant row is numbered as minimum of i; j, and the other row is removed. Similar numbering follows for columns too. Then the index structure of the distance matrix is updated accordingly. Algorithm 1. Iterative document clustering by baseline clusters Input: (a) A set of clusters C ¼ fC 1 ; C 2 ; . . . ; C N g, where N is the number of documents. C i ¼ fdi g; i ¼ 1; 2; . . . ; N, where di is the ith document of the corpus. (b) A distance matrix DM½i½j ¼ dist clusterðC i ; C j Þ; 8i; j 2 N. (c) a be the desired threshold on dist cluster and iter be the number of iteration. Steps of the algorithm: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38:

for each clusters C i ; C j 2 C where C i – C j and N > 1 do if dist clusterðC i ; C j ) 6 a then DM merge (DM; i; j) Ci Ci [ Cj N N1 end if end for nbc 0; BC ; //Baseline clusters are initialized to empty set nsc 0; SC ; //Singleton clusters are initialized to empty set for i ¼ 1 to N do if jC i j > 1 then nbc nbc þ 1 // No. of baseline clusters BCnbc Ci // Baseline clusters else nsc nsc þ 1 // No. of singleton clusters SCnsc Ci // Singleton clusters end if end for if nsc ¼ 0jjnbc ¼ 0 then return BC // If no singleton cluster at all exists or no baseline cluster is generated else EBCk BCk ; 8k ¼ 1; 2; . . . ; nbc // Initialization of extended baseline clusters ebctk centroid of BCk ; 8k ¼ 1; 2; . . . ; nbc // Extended base centroids nct k ð0Þ; 8k ¼ 1; 2; . . . ; nbc; it 0 while ebctk – nct k ; 8k ¼ 1; 2; . . . ; nbc and it 6 iter do ebctk centroid of EBCk ; 8k ¼ 1; 2; . . . ; nbc NCLk BCk ; 8k ¼ 1; 2; . . . ; nbc // New set of clusters at each iteration for j ¼ 1 to nsc do if ebctk is the nearest centroid of SCj ; 8k ¼ 1; 2; . . . ; nbc then NCLk NCLk [ SCj // Merger of singleton clusters to baseline clusters end if end for nctk centroid of NCLk ; 8k ¼ 1; 2; . . . ; nbc EBCk NCLk ; 8k ¼ 1; 2; . . . ; nbc it ¼ it þ 1 end while return EBC end if

Output: A set of extended baseline clusters EBC ¼ fEBC1 ; EBC2 ; . . . ; EBCnbc g

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

157

After constructing the baseline clusters some clusters may remain as singleton clusters. Every such singleton cluster (i.e., a single document) is merged with one of the baseline clusters using k-means algorithm in the second stage. In the second stage the centroids of the baseline clusters (i.e., non singleton clusters) are calculated and they are named as base centroids. The value of k for k-means algorithm is taken as the number of baseline clusters. The rest of the documents which are not included in the baseline clusters are grouped by the iterative steps of the k-means algorithm using these base centroids as the initial seed points. Note that, those documents, which are not included in the baseline clusters, are only considered for clustering in this stage. But, for the calculation of a cluster centroid, every document in the cluster, including the documents in the baseline clusters, are considered. A document is put into that cluster for which the content similarity between the document and the base centroid is maximum. The newly formed clusters are named as Extended Baseline Clusters (EBC). It may be noted that the processing in the second stage is not needed if no singleton cluster is produced in the first stage. We believe that such a possibility is remote in real life and none of our experiments yielded such an outcome. However, such a clustering is desirable as it produces compact clusters. 4.5. Impact of extensive similarity on the document clustering technique The extensive similarity plays significant role in constructing the baseline clusters. The documents in the baseline clusters are very similar to each other as their extensive similarity is very high (above a threshold h). It may be observed that whenever two baseline clusters are merged in the first stage, the similarity between any two documents in the baseline clusters are at least be equal to h. Note that the distance between two different baseline clusters is greater than or equal to a and the distance between a baseline cluster and a singleton cluster (or between two singleton clusters) may be infinite and they would never merge to construct a new baseline cluster. Infinite distance between two clusters indicates that the extensive similarity between at least one document of the baseline cluster and the document of the singleton cluster (or, between the documents of two different singleton clusters) is 1. Thus the baseline clusters intuitively determine the categories of the document collection by measuring the extensive similarity between documents. 4.6. Discussion The proposed clustering method is a combination of baseline clustering and k-means clustering methods. Initially it creates some baseline clusters. The documents which do not have much similarity with any one of the baseline clusters would remain as singleton clusters. Therefore k-means method is implemented to group these documents to the corresponding baseline clusters. k-means algorithm has been used due to its low computational complexity. It is also useful as it can be easily implemented. But the performance of k-means algorithm suffers from selection of initial seed points and there is no method for selecting a valid k. It is very difficult to select a proper k for a sparse text data set with high dimensionality. In various other clustering techniques, k-means algorithm has been used as an intermediary stage e.g., spectral clustering, buckshot algorithms etc. These algorithms also suffer from the said limitations of k-means method. Note that the proposed clustering method overcomes these two major limitations of k-means clustering algorithm and has utilized the effectiveness of k-means method by introducing the idea of baseline clusters. The effectiveness of the proposed technique in terms of clustering quality may be observed in the experimental results section later. The proposed technique is designed like buckshot clustering algorithm. The main difference between buckshot and the proposed one lies in designing the hierarchical clusters in the first stage of both of the methods. Buckshot uses the traditional single-link clustering technique to develop the hierarchical clusters to create the initial centroids of the k-mean clustering in the second stage. Thus buckshot may suffer from the limitations of both single-link clustering technique (e.g., chaining effect [8]) and k-means clustering technique. In practice the text data sets contain many categories of pffiffiffiffiffiffi uneven sizes. In those data sets initial random selection of kn documents may not be proper, i.e, no documents may be selected from an original cluster if its size is small. Note that no random sampling is required for the proposed clustering technique. In the proposed one the hierarchical clusters are created using extensive similarity between documents, and these hierarchical baseline clusters would no more be used in the second stage. In the second stage, k-means algorithm is performed only to group those documents that have not been included in the baseline clusters and the initial centroids are generated from these baseline clusters. In Buckshot algorithm, all the documents are taken into consideration for clustering by k-means algorithm. It uses the single-link clustering technique, only to create the initial seed points of the k-means algorithm. Later it can be seen from the experiments that the proposed one performs significantly better than buckshot clustering technique. The process of creating baseline clusters in the first stage of the proposed technique is quite similar to the group-average hierarchical document clustering technique [30]. Both the techniques find the average of similarities of the documents of two individual clusters for merging them into one. The proposed method finds the distance between two clusters using extensive similarity, whereas the group-average hierarchical document clustering technique generally uses cosine similarity. The group-average hierarchical clustering technique cannot distinguish two dissimilar clusters explicitly, like the proposed method. This is the main difference between the two techniques.

158

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

5. Evaluation criteria If the documents within a cluster are similar to each other and dissimilar to the documents in the other clusters then the clustering algorithm is considered to perform well. The data sets under consideration have labeled documents. Hence quality measures based on labeled data are used here for comparison. Normalized mutual information and f-measure are very popular and are used by a number of researchers [30]31 to measure the quality of a cluster using the information of the actual categories of the document collection. Let us assume that R is the set of categories and S is the set of clusters. Consider there are I number of categories in R and J number of clusters in S. There are a total of N number of documents in the corpus i.e., both R and S individually contains N documents. Let ni be the number of documents belonging to category i, mj be the number of documents belonging to cluster j and nij be the number of documents belonging to both category i and cluster j, for all i ¼ 1; 2; . . . ; I and j ¼ 1; 2; . . . ; J. Mutual information is a symmetric measure to quantify the statistical information shared between two distributions, which provides an indication of the shared information between a set of categories and a set of clusters. Let IðR; SÞ denotes the mutual information between R and S and EðRÞ and EðSÞ be the entropy of R and S respectively. IðR; SÞ and EðRÞ can be defined as

IðR; SÞ ¼

  J I X X nij Nnij ; log N ni mj i¼1 j¼1

EðRÞ ¼ 

I n  X ni i log N N i¼1

There is no upper bound for I(R, S), so for easier interpretation and comparisons a normalized mutual information that ranges from 0 to 1 is desirable. The normalized mutual information (NMI) is defined by Strehl et al. [31] as follows:

IðR; SÞ NMIðR; SÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EðRÞEðSÞ F-measure determines the recall and precision value of each cluster with a corresponding category. Let, for a query the set of relevant documents be from category i and the set of retrieved documents be from cluster j. Then recall, precision and fmeasure are given as follows:

nij nij ; 8 i; j; Precisionij ¼ ; ni mj 2  Recallij  Precisionij ; 8 i; j F ij ¼ Recallij þ Precisionij

Recallij ¼

8 i; j

If there is no common instance between a category and a cluster (i.e., nij ¼ 0) then we shall assume F ij ¼ 0. The value of F ij will be maximum when Precisionij ¼ Recallij ¼ 1 for a category i and a cluster j. Thus the value of F ij lies between 0 and 1. The best f-measure among all the clusters is selected as the f-measure for the query of a particular category i.e., F i ¼ maxj2½0;J F ij ; 8i. The f-measure of all the clusters is weighted average of the sum of the f-measures of each category, P F ¼ Ii¼1 nNi F i . We would like to maximize f-measure and normalized mutual information to achieve good quality clusters. 6. Experimental evaluation 6.1. Document collections Reuters-21578 is a collection of documents that appeared on Reuters newswire in 1987. The documents were originally assembled and indexed with categories by Carnegie Group, Inc. and Reuters, Ltd. The corpus contains 21,578 documents in 135 categories. Here we considered the ModApte version used in [4], in which there are 30 categories and 8067 documents. We have divided this corpus into four groups and with the name as rcv1, rcv2, rcv3 and rcv4. 20-Newsgroups corpus is a collection of news articles collected from 20 different news sources. Each news source constitutes a different category. In this data set, articles with multiple topics are cross posted to multiple newsgroups i.e., there are overlaps between several categories. The data set is named as 20ns here. The rest of the corpora were developed in the Karypis lab [15]. The corpora tr31, tr41 and tr45 are derived from TREC-5, TREC-6, and TREC-7 collections.1 The categories of the tr31, tr41 and tr45 were generated from the relevance judgment provided in these collections. The corpus fbis was collected from the Foreign Broadcast Information Service data of TREC-5. The corpora la1 and la2 were from the Los Angeles Times data of TREC-5. The category labels of la1 and la2 were generated according to the name of the newspaper sections where these articles appeared, such as Entertainment, Financial, Foreign, Metro, National, and Sports. The documents that have a single label were selected for la1 and la2 data sets. The corpora oh10 and oh15 were created from OHSUMED collection, subset of MEDLINE database, which contains 233,445 documents indexed using 14,321 unique categories[15]. Different subsets of categories have been taken to construct these data sets. 1

http://trec.nist.gov.

159

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162 Table 2 Data sets overview. Data set

No. of documents

No. of terms

No. of categories

20ns fbis la1 la2 oh10 oh15 rcv1 rcv2 rcv3 rcv4 tr31 tr41 tr45

18,000 2463 3204 3075 1050 913 2017 2017 2017 2016 927 878 690

35,218 2000 31,472 31,472 3238 3100 12,906 12,912 12,820 13,181 10,128 7454 8261

20 17 6 6 10 10 30 30 30 30 7 10 10

The number of documents, number of terms and number of categories of these corpora can be found in Table 2. For each of the above corpora, the stop words have been extracted using the standard English stop word list.2 Then, by applying the standard porter stemmer algorithm [27] for stemming, the inverted index is developed. 6.2. Experimental setup Single-Link Hierarchical Clustering (SLHC) [30], Average-Link Hierarchical Clustering (ALHC) [30], Dynamic k-Nearest Neighbor Algorithm (DKNNA) [16], k-means clustering [13], bisecting k-means clustering [30], buckshot clustering [6], spectral clustering [25] and clustering by Non-negative Matrix Factorization (NMF) [33] techniques are selected for comparison with the proposed clustering technique. k-means and bisecting k-means algorithms have been executed 10 times to reduce the effect of random initialization of seed points and for each execution they have been iterated 100 times to reach a solution (if they are not converged automatically). Buckshot algorithm has also been executed 10 times to reduce the effect of random pffiffiffiffiffiffi initialization of initial kN documents. The f-measure and NMI values of k-means, bisecting k-means and buckshot clustering techniques shown here are the average of 10 different results. Note that the proposed method finds the number of clusters automatically from the data sets. The proposed clustering algorithm has been executed first and then all the other algorithms have been executed to produce the same number of clusters as the proposed one. Tables 3 and 4 show the f-measure and NMI values respectively for all the data sets. Number of Clusters (NCL) developed by the proposed method is also pffiffiffiffiffiffiffi pffiffiffiffiffi shown. The f-measure and NMI are calculated using these NCL values. The value of a is chosen as, a ¼ N for N number of documents in the corpus. The NMF based clustering algorithm has been executed 10 times to reduce the effect of random initialization and for each time it has been iterated 100 times to reach a solution. The values of k for DKNNA is taken as k ¼ 10. The value of r of the spectral clustering technique is set by search over values from 10 to 20 percent of the total range of the similarity values and the one that gives the tightest clusters is picked, as suggested by Ng et al. [25]. The proposed histogram thresholding based technique for estimating a value of h has been followed in the experiments. We have considered class intervals of length 0.005 for similarity values. We have also assumed that content similarity (here cosine similarity) value greater than 0.5 means that the corresponding documents are similar. Thus, the issue here is to find a h; 0 < h < 0:5 such that a similarity value grater than h denotes that the corresponding documents are similar. In the experiments we have used the method of moving averages with the window length of 5 for convolution. The text data sets are generally sparse and the number of high similarity values is practically very low and there are fluctuations in the heights of the histogram for two successive similarity values. Hence it is not desirable to take the window length of 3 as the method considers the heights of just the previous and the next value for calculating f ðsi Þ’s. We have tried with window length of 7 or 9 on some of the corpora in the experiments, but the values of h remain more or less same as they are selected by considering window length of 5. On the other hand window length of 7 or 9 need more calculations than window length of 5. These are the reasons for the choice of window of length 5. It has been found that several local peaks and local valleys are removed by this method. The number of valley regions after smoothing the histogram by the method of moving averages is always found to be greater than three. 6.3. Analysis of results Tables 3 and 4 show the comparison of proposed document clustering method with the other methods using f-measure and NMI respectively, for all data sets. There are 104 comparisons for the proposed method using f-measure in Table 3. The proposed one performs better than the other methods in 91 cases and for the rest 13 cases other methods (e.g., buckshot, spectral clustering algorithms) have an edge over the proposed method. Few of the exceptions, where the other methods perform better than the proposed one are e.g., SLHC and NMF for rcv3 (Here the f-measure of SLHC and NMF are respectively 2

http://www.textfixer.com/resources/common-english-words.txt.

160

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

Table 3 Comparison of various clustering methods using f-measure. Data sets

NCTa

NCLb

20ns fbis la1 la2 oh10 oh15 rcv1 rcv2 rcv3 rcv4 tr31 tr41 tr45

20 17 6 6 10 10 30 30 30 30 7 10 10

23 19 8 6 12 10 31 30 32 32 7 10 11

F-measure BKMc

KM

BS

SLHC

ALHC

DKNNA

SC

NMF

Proposed

0.357 0.423 0.506 0.484 0.304 0.363 0.231 0.233 0.188 0.247 0.558 0.564 0.556

0.449 0.534 0.531 0.550 0.465 0.485 0.247 0.281 0.271 0.322 0.665 0.607 0.673

0.436 0.516 0.504 0.553 0.461 0.482 0.307 0.324 0.351 0.289 0.646 0.593 0.681

0.367 0.192 0.327 0.330 0.205 0.206 0.411 0.404 0.408 0.405 0.388 0.286 0.243

0.385 0.192 0.325 0.328 0.206 0.202 0.360 0.353 0.376 0.381 0.387 0.280 0.248

0.408 0.288 0.393 0.405 0.381 0.366 0.431 0.438 0.436 0.440 0.457 0.416 0.444

0.428 0.535 0.536 0.541 0.527 0.516 0.298 0.312 0.338 0.401 0.589 0.557 0.605

0.445 0.435 0.544 0.542 0.481 0.478 0.516 0.489 0.511 0.509 0.545 0.537 0.596

0.474 0.584 0.570 0.563 0.500 0.532 0.553 0.517 0.294 0.289 0.678 0.698 0.750

a

NCT stands for number of categories. NCL stands for number of clusters. BKM, KM, BS, SLHC, ALHC, DKNNA, SC and NMF stand for Bisecting k-Means, k-Means, BuckShot, Single-Link Hierarchical Clustering, Average-Link Hierarchical Clustering, Dynamic k-Nearest Neighbor Algorithm, spectral clustering and Non-negative Matrix Factorization respectively. b

c

Table 4 Comparison of various clustering methods using normalized mutual information. Data sets

20ns fbis la1 la2 oh10 oh15 rcv1 rcv2 rcv3 rcv4 tr31 tr41 tr45 a

NCT

20 17 6 6 10 10 30 30 30 30 7 10 10

NCL

23 19 8 6 12 10 31 30 32 32 7 10 11

Normalized mutual information BKMa

KM

BS

SLHC

ALHC

DKNNA

SC

NMF

Proposed

0.417 0.443 0.266 0.249 0.226 0.213 0.302 0.296 0.316 0.317 0.478 0.470 0.492

0.428 0.525 0.299 0.312 0.352 0.352 0.409 0.411 0.416 0.414 0.463 0.550 0.599

0.437 0.524 0.295 0.323 0.333 0.357 0.407 0.399 0.408 0.416 0.471 0.553 0.591

0.270 0.051 0.021 0.021 0.050 0.067 0.0871 0.053 0.049 0.048 0.065 0.054 0.084

0.286 0.362 0.218 0.215 0.157 0.155 0.108 0.150 0.162 0.175 0.212 0.237 0.354

0.325 0.405 0.241 0.252 0.239 0.236 0.213 0.218 0.215 0.220 0.414 0.456 0.512

0.451 0.520 0.285 0.335 0.417 0.358 0.429 0.426 0.404 0.414 0.436 0.479 0.503

0.432 0.446 0.296 0.360 0.410 0.357 0.434 0.420 0.476 0.507 0.197 0.506 0.488

0.433 0.544 0.308 0.386 0.406 0.380 0.495 0.465 0.448 0.452 0.509 0.619 0.694

All the symbols in this table are the same symbols used in Table 3.

0.408, 0.511 and the f-measure of the proposed method is 0.294). Similarly, Table 4 shows that the proposed method performs better than the other methods using NMI in 98 out of 104 cases. A statistical significance test has been performed to check whether these differences are significant when other clustering algorithms beat the proposed algorithm both in Tables 3 and 4. The same statistical significance test has been performed when the proposed algorithm performs better than the other clustering algorithms. A generalized version of paired t-test is suitable for testing the equality of means when the variances are unknown. This problem is the classical Behrens–Fisher problem in hypothesis testing and a suitable test statistic3 is described and tabled in [20,28], respectively. It has been found that out of those 91 cases, where the proposed algorithm performed better than the other algorithms, in Table 3, the differences are statistically significant in 86 cases for the level of significance 0.05. For all of the rest 13 cases the differences are statistically significant in Table 3, for the same level of significance. Hence the performance of the proposed method is found to be significantly better than the other methods in 86.86% (86/99) cases using f-measure. Similarly, in Table 4 the results are significant in 89 out of 98 cases when proposed method performed better than the other methods and all the results of the rest 6 cases are significant. Thus in 93.68% (89/95) cases the proposed method performs significantly better than the other methods using NMI. Clearly, these results show the effectiveness of the proposed document clustering technique. Remark. A point is to be noted that the number of clusters produced by the proposed method for each corpus are close to the actual number of categories of each corpus. It may be observed from Tables 3 and 4 that the number of clusters is equal to the actual number of categories for la2, oh15, rcv2, tr31 and tr41 corpora. The difference between the number of clusters and actual number of categories is at most 2 for rest of the corpora. Since the text data sets provided here are very sparse and

3

x1 x2 The test statistic is of the form t ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , where x1 ; x2 are the means, s1 ; s2 are the standard deviations and n1 ; n2 are the number of observations. 2 2 s1 =n1 þs2 =n2

161

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162 Table 5 Processing time (in seconds) of different clustering methods.

a

Methods

BKMa

KM

BS

SLHC

ALHC

DKNNA

SC

NMF

Proposed

20ns fbis la1 la2 oh10 oh15 rcv1 rcv2 rcv3 rcv4 tr31 tr41 tr45

1582.25 94.17 159.23 149.41 18.57 18.12 89.41 98.32 97.11 100.17 29.41 27.29 25.45

1594.54 91.52 153.22 144.12 18.32 20.02 91.15 106.08 106.29 95.28 30.23 28.16 25.01

1578.12 92.46 142.36 139.34 17.32 18.26 86.37 103.16 98.35 96.49 30.34 27.46 26.06

1618.50 112.15 160.12 163.47 26.31 24.15 87.47 104.17 98.47 109.32 33.15 33.96 31.17

1664.31 129.58 179.62 182.50 33.51 31.46 103.37 120.45 124.72 126.30 40.13 39.58 38.23

1601.23 104.32 162.25 164.33 23.26 23.18 88.37 99.27 98.47 99.49 32.35 30.52 28.50

1583.62 100.90 146.68 142.33 24.82 20.94 94.62 97.81 108.24 113.81 37.98 25.85 29.72

1587.36 93.19 153.50 144.46 22.48 17.61 87.80 93.51 94.96 98.70 29.33 27.54 26.51

1595.23 90.05 140.31 140.29 18.06 16.22 86.18 92.53 93.54 93.32 29.38 26.49 24.65

All the symbols in this table are the same symbols used in Table 3.

high dimensional, then it may be implied that the method proposed here for estimating the value of h is able to detect the actual grouping of the corpus. 6.4. Processing time The similarity matrix requires N  N memory locations, and to store N clusters, initially, N memory locations are needed for the proposed method. Thus the space complexity of the proposed document clustering algorithm is OðN 2 Þ. OðN 2 Þ time is required to build the extensive similarity matrix and to construct (say) mð NÞ baseline clusters, the proposed method takes OðmN 2 Þ time. In the final stage of the proposed technique, k-means algorithm takes OððN  bÞmtÞ time to merge (say) b number of singleton clusters to the baseline clusters, where t is number of iterations of k-means algorithm. Thus the time complexity of the proposed algorithm is OðN 2 Þ as m is very small compared to N. The processing time of each algorithm used in the experiments have been measured on a quad core Linux workstation. The time (in seconds) taken by different clustering algorithms to cluster every text data set are reported in Table 5. The time shown here for the proposed algorithm is the sum of the times taken to estimate the value of h, to build the baseline clusters, and to perform the k-means clustering algorithm to merge the remaining singleton clusters to the baseline clusters. The time shown for bisecting k-means, buckshot, k-means and NMF clustering techniques are the average of the processing times of 10 iterations. It is to be mentioned that the codes for all the algorithms are written in C++ and the data structures for all the algorithms are developed by the authors. Hence the processing time can be reduced by incorporating some more efficient data structures for the proposed algorithm as well as the other algorithms. Note that the processing time of the proposed algorithm is less than KM, SLHC, ALHC and DKNNA for each data set. The execution time of BKM is less than the proposed one for 20ns, the execution time of SC is less than the proposed one for tr41 and the execution time of NMF is less than the proposed algorithm for tr31. The execution time of the proposed algorithm is less than BKM, SC and NMF for each of the other data sets. The processing time of the proposed algorithm is comparable with buckshot algorithm (although in most of the data set the processing time of the proposed algorithm is less than buckshot). The dimensionality of the data sets (used in the experiments) varies from 2000 (fbis) to 35,218 (20ns). Hence the proposed clustering algorithm may be useful in terms of processing time for any real life high dimensional data set. 7. Conclusions A hybrid document clustering algorithm is introduced by combining a new hierarchical and traditional k-means clustering techniques. The baseline clusters produced by the new hierarchical technique are the clusters where the documents possess high similarity among them. The extensive similarity between documents ensures this quality of the baseline clusters. It is developed on the basis of similarity between two documents and their distances with every other documents in the document collection. Thus the documents with high extensive similarity are grouped in the same cluster. Most of the singleton clusters are nothing but the documents which have low content similarity with every other document. In practice the number of such singleton clusters is sufficient and can not be ignored as outliers. Therefore k-means algorithm is performed iteratively to assign these singleton clusters to one of the baseline clusters. Thus the proposed method reduces the error of k-means algorithm due to random seed selection. Moreover the method is not as expensive as the hierarchical clustering algorithm which can be observed from Table 5. The significant characteristic of the proposed clustering technique is that the algorithm automatically decides the number of clusters in the data. The automatic detection of number of clusters for such a sparse and high dimensional text data is very important. The proposed method is able to determine the number of clusters prior to implement the algorithm by applying a threshold h on the similarity value between documents. An estimation technique is introduced to determine a value of h from a

162

T. Basu, C.A. Murthy / Information Sciences 311 (2015) 149–162

corpus. The experimental results show the value and validity of the proposed estimation of h. In the experiments the threshold pp ffiffiffiffiffiffiffi ffiffiffiffiffi N . It indicates that the distance between two different clusters must on the distance between two clusters is taken as a ¼ be greater than a. It is very difficult to fix a lower bound on the distance between two clusters in practice as the corpora are pp ffiffiffiffiffiffiffi ffiffiffiffiffi 1 1 N , say, a ¼ N 3 or a ¼ N 2 then some really different clusters sparse in nature and have high dimensionality. If we select a > pffiffiffiffiffiffiffi pffiffiffiffiffi 1 N , say, a ¼ N 5 then we may get may be merged into one, which is surely not desirable. On the other hand, if we select a < some very compact clusters, but large number of small sized clusters would be created, which is also not expected in practice. It may be observed from the experiments that the number of clusters produced by the proposed technique is very close to the actual number of categories for each corpus and the proposed one outperforms the other methods. Hence we may claim that pffiffiffiffiffiffiffi pffiffiffiffiffi the selection of a ¼ N is proper, though it has been selected heuristically. The proposed hybrid clustering technique tries to solve some issues of some of the well known partitional and hierarchical clustering techniques. Hence it may be useful in many real life unsupervised applications. Note that any similarity measure can be used instead of cosine similarity to design extensive similarity for different types of data sets except text data. It is to be mentioned that the value of a should be chosen carefully whenever the method is applied to different other types of applications. In future we shall apply the proposed method on social network data to find different types of communities or topics. In that case we may have to incorporate the idea of graph theory into the proposed distance function to find relation between different sets of nodes of a social site. Acknowledgment The authors would like to thank the reviewers and the editor for their valuable comments and suggestions. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35]

N.O. Andrews, E.A. Fox, Recent Developments in Document Clustering, Technical Report, Verginia Tech., USA, 2007. R.G. Brown, Smoothing, Forecasting and Prediction of Discrete Time Series, Prentice-Hall, Englewood Cliffs, NJ, 1962. C. Aggarwal, C. Zhai, A survey of text clustering algorithms, Mining Text Data (2012) 77–128. D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Trans. Knowl. Data Eng. 17 (12) (2005) 1624–1637. C. Carpineto, S. Osinski, G. Romano, D. Weiss, A survey of web clustering engines, ACM Comput. Surveys 41 (3) (2009). D.R. Cutting, D.R. Karger, J.O. Pedersen, J.W. Tukey, Scatter/gather: a cluster-based approach to browsing large document collections, in: Proceedings of the International Conference on Research and Development in Information Retrieval, SIGIR’93, 1993, pp. 126–135. S. Dasgupta, V. Ng, Towards subjectifying text clustering, in: Proceedings of the International Conference on Research and Development in Information Retrieval, SIGIR’10, NY, USA, 2010, pp. 483–490. R.C. Dubes, A.K. Jain, Algorithms for Clustering Data, Prentice Hall, 1988. R. Duda, P. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, 1973. M. Filipponea, F. Camastrab, F. Masullia, S. Rovettaa, A survey of kernel and spectral methods for clustering, Pattern Recognit. 41 (1) (2008) 176–190. R. Forsati, M. Mahdavi, M. Shamsfard, M.R. Meybodi, Efficient stochastic algorithms for document clustering, Inform. Sci. 220 (2013) 269–291. C.A. Glasbey, An analysis of histogram-based thresholding algorithms, Graph. Models Image Process. 55 (6) (1993) 532–537. J.A. Hartigan, M.A. Wong, A k-means clustering algorithm, J. Roy. Statist. Soc. (Appl. Statist.) 28 (1) (1979) 100–108. A. Huang, Similarity measures for text document clustering, in: Proceedings of the New Zealand Computer Science Research Student Conference, Christchurch, New Zealand, 2008, pp. 49–56. G. Karypis, E.H. Han, Centroid-based document classification: analysis and experimental results, in: Proceedings of the Fourth European Conference on the Principles of Data Mining and Knowledge Discovery, PKDD’00, Lyon, France, 2000, pp. 424–431. J.Z.C. Lai, T.J. Huang, An agglomerative clustering algorithm using a dynamic k-nearest neighbor list, Inform. Sci. 217 (2012) 31–38. A.N. Langville, C.D. Meyer, R. Albright, Initializations for the Non-negative Matrix Factorization, in: Proceedings of the Conference on Knowledge Discovery from Data, KDD’06, 2006. T. Basu, C.A. Murthy, Cues: a new hierarchical approach for document clustering, J. Pattern Recognit. Res. 8 (1) (2013) 66–84. D.D. Lee, H.S. Seung, Algorithms for Non-negative Matrix Factorization, in: Advances in Neural Information Processing Systems, vol. 13, 2001, pp. 556– 562. E.L. Lehmann, Testing of Statistical Hypotheses, John Wiley, New York, 1976. X. Liu, X. Yong, H. Lin, An improved spectral clustering algorithm based on local neighbors in kernel space, Comput. Sci. Inform. Syst. 8 (4) (2011) 1143– 1157. C.S. Yang, M.C. Chiang, C.W. Tsai, A time efficient pattern reduction algorithm for k-means clustering, Inform. Sci. 181 (2011) 716–731. M.I. Malinen, P. Franti, Clustering by analytic functions, Inform. Sci. 217 (2012) 31–38. C.D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, New York, 2008. A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in: Proceedings of Neural Information Processing Systems (NIPS’01), 2001, pp. 849–856. P. Pantel, D. Lin, Document clustering with committees, in: Proceedings of the International Conference on Research and Development in Information Retrieval, SIGIR’02, 2002, pp. 199–206. M.F. Porter, An algorithm for suffix stripping, Program 14 (3) (1980) 130–137. C.R. Rao, S.K. Mitra, A. Matthai, K.G. Ramamurthy (Eds.), Formulae and Tables for Statistical Work, Statistical Publishing Society, Calcutta, 1966. G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983. M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques, in: Proceedings of the Text Mining Workshop, ACM International Conference on Knowledge Discovery and Data Mining (KDD’00), 2000. A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse framework for combining multiple partitions, J. Machine Learn. Res. 3 (2003) 583–617. J. Wang, S. Wu, H.Q. Vu, G. Li, Text document clustering with metric learning, in: Proceedings of the 33rd International Conference on Research and Development in Information Retrieval, SIGIR’10, 2010, pp. 783–784. W. Xu, X.Liu, Y.Gong, Document clustering based on Non-negative Matrix Factorization, in: Proceedings of the International Conference on Research and Development in Information Retrieval, SIGIR’03, Toronto, Canada, 2003, pp. 267–273. W. Xu, Y.Gong, Document clustering by concept factorization, in: Proceedings of the International Conference on Research and Development in Information Retrieval, SIGIR’2004, NY, USA, 2010, pp. 483–490. Y. Zhu, L. Jing, J. Yu, Text clustering via constrained nonnegative matrix factorization, in: Proceedings of the IEEE International Conference on Data Mining (ICDM’2011), 2011, pp. 1278–1283.