Applied Soft Computing Journal 73 (2018) 623–634
Contents lists available at ScienceDirect
Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc
Application of a density based clustering technique on biomedical datasets ∗
Md Anisur Rahman , Md Zahidul Islam School of Computing and Mathematics, Charles Sturt University, Panorama Avenue, Bathurst, NSW 2795, Australia
highlights • • • • •
Evaluation of a density based clustering technique (DenClust) on biomedical datasets. DenClust produces number of clusters and initial seeds using a deterministic process. DenClust performs better than six existing techniques on twenty biomedical datasets. An empirical analysis to evaluate the quality of initial seeds was also performed. Sign test results indicate the superiority of DenClust over the existing techniques.
article
info
Article history: Received 1 August 2017 Received in revised form 3 September 2018 Accepted 8 September 2018 Available online xxxx Keywords: Clustering Cluster evaluation K-means Data mining Machine learning Biomedical datasets
a b s t r a c t The detection of the number of clusters in a biomedical dataset is very important for generating high quality clusters from the biomedical dataset. In this paper, we aim to evaluate the performance of a density based K-Means clustering technique called DenClust on biomedical datasets. DenClust produces the number of clusters and the high quality initial seeds from a dataset through a density based seed selection approach without requiring an user input on the number of clusters and the radius of the clusters. The high quality initial seeds for K-Means results in high quality clusters from a dataset. The performance of DenClust is compared with six other existing clustering techniques namely CRUDAW-F, CRUDAW-H, AGCUK, GAGR, K-Means, and K-Means++ on the twenty biomedical datasets in terms of two external cluster evaluation criteria namely Entropy and Purity and one internal cluster evaluation criteria called Sum of Squared Error (SSE). We also perform a statistical non-parametric sign test on the cluster evaluation results of the techniques. Both the cluster evaluation results and statistical non-parametric sign test results indicate the superiority of DenClust over the existing techniques on the biomedical datasets. The complexity of DenClust is O(n2 ) but the overall execution time of DenClust on the datasets is less than the execution time of AGCUK and GAGR having O(n) complexity. © 2018 Elsevier B.V. All rights reserved.
1. Introduction Clustering is an unsupervised learning technique. It groups similar records in a cluster and dissimilar records in different clusters. It extracts hidden patterns from large datasets that help in decision making processes in various fields including medical research, crime detection/prevention, social network analysis and market research [1–7]. Therefore, it is very important to produce good quality clusters from a dataset. ∗ Corresponding author. E-mail address:
[email protected] (M.A. Rahman). https://doi.org/10.1016/j.asoc.2018.09.012 1568-4946/© 2018 Elsevier B.V. All rights reserved.
K-Means is one of the top ten data mining techniques because of its simplicity [8]. It is a widely used clustering technique, where the number of clusters (k) needs to be provided by a user even before the clustering process starts [9–12]. Based on a user defined number of clusters, K-Means first randomly selects k number of records as the initial seeds. It then goes through the clustering processes and produces k clusters. However, one of the limitations of K-Means is its requirement of the user input on k. It can be difficult for a user (data miner) to estimate the correct value for k [13,14]. Another limitation of KMeans is the possibility of selecting poor quality initial seeds due to its random seed selection criteria. A set of poor quality initial seeds may produce poor quality clusters [9,15,16].
624
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634
Some existing clustering techniques called CRUDAW-H and CRUDAW-F [1,17] obtain the initial seeds and the number of clusters automatically through a deterministic process. The deterministic process of CRUDAW-H and CRUDAW-F requires a user defined threshold called r (radius of a cluster), which is then used for producing the initial seeds and the number of clusters for a dataset. However, it can be difficult for a user to correctly estimate a suitable value of r especially when the user does not have good understanding on the dataset. However, some existing clustering techniques work on the datasets that have only numerical attributes or for datasets that have only categorical attributes while in reality many datasets have both numerical and categorical attributes [1,15,18–20]. There are techniques that can work on both numerical and categorical attributes, but some of them do not consider any similarity between categorical values while clustering the records, in the sense that if two categorical values (of an attribute) belonging to two records are different then the distance between the two records in terms of the attribute is considered to be 1 (regardless of the similarity of the values), and otherwise 0 [1,15,19–21]. K-Means is one of the popular clustering algorithms [22]. A recently proposed modified version of K-Means that selects initial seeds deterministically has been applied successfully on biomedical datasets [23]. However, the technique requires the number of cluster k as an input which might be hard for a user to estimate the correct number of clusters for a dataset [23]. Moreover, the technique does not work on a dataset having any categorical attributes. In this study, we present a novel clustering technique called DenClust that works on biomedical datasets having categorical and/or numerical attributes. DenClust produces the number of clusters k and the high quality initial seeds through a deterministic process without requiring a user input on any parameters such as k and r. The density based seed selection approach of DenClust works well on biomedical datasets. However, DenClust also works well on other kinds of datasets. The initial seeds produced by DenClust are the centers of a dense region and they are expected to represent the natural clusters. Therefore, the initial seeds are expected to be of high quality. Our empirical analysis presented in Section 4.8 to evaluate the quality of initial seeds also supports this expectation. Our experimental results also indicate the superiority of DenClust. In this study, we mainly aim to evaluate the performance of DenClust on biomedical datasets. We collect twenty (20) biomedical datasets from the UCI Machine learning repository and Bioinformatics [6,7,24]. In the experimental evaluation of this paper, we apply DenClust on these biomedical datasets whereas in the conference version of DenClust paper [25] we used only three datasets. To evaluate the performance of DenClust, we implement DenClust and six (6) other existing clustering techniques namely CRUDAW-F and CRUDAW-H [1,17], K-Means [10], K-Means++ [26], AGCUK [19] and GAGR [20]. We compare DenClust with the existing techniques using two external cluster evaluation criteria namely Entropy and Purity and one internal cluster evaluation criteria called Sum of Square Error (SSE) [1,27,10]. The experimental results indicate that the performance of DenClust is better than the existing techniques in terms of three cluster evaluation criteria on the biomedical datasets used in this study. Moreover, we perform a statistical non-parametric sign test on the cluster evaluation results of the techniques that suggests statistical significance of DenClust over the existing techniques on these biomedical datasets. We also present an empirical analysis on the quality of the initial seeds produced by DenClust as well as an empirical analysis on the T value (a user defined parameter), for the considered datasets. The complexities and the execution time of the techniques are also presented in this study.
The structure of the paper is as follows. In Section 2, we discuss some existing clustering techniques. DenClust is presented in Section 3. The experimental results on biomedical datasets and discussions are presented in Section 4. We provide some concluding remarks in Section 5. 2. Literature review 2.1. Dataset In this study, we consider a dataset D having n records D = {R1 , R2 , . . . , Rn }, and m attributes A = {A1 , A2 , . . . , Am }. The attributes of a dataset can be categorical and/or numerical. 2.2. K-Means and K-Means++ K-Means requires a user to input the number of clusters k. It then randomly selects k records as the initial seeds from a data set [10,11]. All other records are assigned to the nearest seeds to form the initial set of clusters. Based on the records in each cluster, K-Means re-calculates the seed of each cluster [10,28]. All records of the dataset are assigned again to different clusters in such a way that a record is assigned to the cluster, the seed of which has the minimum distance with the record. The process continues until one of the termination conditions (user defined number of iterations or a minimum difference between the values of the objective function in two consecutive iterations) are satisfied. KMeans++ is a modified version of K-Means, where the initial seeds of the clusters are selected using a probabilistic approach [26]. However, the number of clusters k in K-Means++ is a user defined parameter. K-Means and K-Means++ also do not work on a dataset that has both categorical numerical attributes. 2.3. Bisecting-K-Means (BKM) Bisecting K-Means (BKM) is a variation of K-Means which also selects initial seeds randomly [29]. At the beginning, it considers the whole dataset as one cluster and then it divides the whole dataset into two partitions using K-Means. From the two partitions, BKM picks one partition as a cluster and the remaining partition is considered as a dataset for partitioning using K-Means once again. BKM again applies K-Means to partition the dataset (remaining partition) to two sub-partitions. The process of partitioning the records continues until it reaches the desired number of clusters. From the two partitions, one partition is selected for further division based on the size of the cluster (partition) or the similarity of the records within the cluster. In BKM, for some cases refinement may be required on the initial clusters to produce the final clusters from a dataset. Moreover, another limitation of BKM is that it requires a user input on the number of clusters. 2.4. Basic Farthest Point Heuristic (BFPH) and New Farthest Point Heuristic (NFPH) Basic Farthest Point Heuristic (BFPH) [30] requires the number of clusters k as a user input. It then randomly selects a record as the first initial seed. However, unlike K-Means, the other seeds are selected deterministically. The record having the maximum distance with the first seed is selected as the second seed. For the selection of the third seed, the distance between a record and its nearest seed is used. The record having the maximum distance (with its nearest seed) is considered as the third seed. The seed selection process continues until BFPH produces the user defined number of initial seeds or runs out of records. The initial seeds are given to K-means to produce the final clusters. New Farthest Point Heuristic (NFPH) [30] also requires the number of clusters as an input. However, it selects all seeds (including
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634
the first seed) deterministically. It first calculates the score of each record based on the frequency of each attribute value of the record. The frequencies of the attribute values appearing in a record are added together to obtain the score of the record. The record having the highest score is considered as the first seed. The other seeds are selected by using the same approach of BFPH, where the record having the maximum distance with the first seed is selected as the second seed and so on. Both NFPH and BFPH work on a dataset that has only categorical attributes. Moreover, BFPH and NFPH consider the distance between two domain values of a categorical attribute either zero (if the two values are the same) or one (if the two values are different). 2.5. Distance calculation for categorical attribute There are a number of distance calculation functions that have been proposed for numerical attributes. However, the distance calculation function for the values of categorical attributes has received less attention [31]. Typically, the distance between two domain values of a categorical attribute is considered to be either zero (if the two values are the same) or one (if the two values are different) [11,21]. The distance between two domain values of a categorical attribute can depend on their similarity [1]. The similarity between two domain values of a categorical attribute is generally measured based on their co-appearance with the domain values of other categorical attributes among the records of the dataset [1,32]. To calculate the similarity between two domain values of a categorical attribute, an existing technique called VICUS [32] can be used. VICUS calculates the similarity based on the co-appearances of the domain values with the domain values of other attributes. In DenClust, we used VICUS to calculate the similarity between two domain values of a categorical attribute. The similarity between two domain values of a categorical attribute is used in distance calculation function as shown in (2). 2.6. A modified version of K-Means that works on mixed datasets A modified version of K-Means [33] that works on a dataset that has both categorical and/or numerical attributes. The technique randomly selects the records of a dataset as initial seeds to produce the final clusters from the dataset. It also requires input on the number of clusters k. It next calculates the seeds based on the records belonging to each cluster. The seed value of a categorical attribute is the proportional distribution of all domain values (fuzzy seed) of the attribute whereas the seed value for a numerical attribute is the attribute value (normalized) belonging to the records of the cluster. It then partitions the records into clusters using the same approach of K-Means. In the modified version of K-Means [33], the distance between a record and the seed of a cluster is calculated using a weighted distance function. The distance between two categorical attribute values is calculated with respect to other attributes [34]. To calculate the distance between two categorical attribute values, the records having these two attribute values for that attribute are separated. The separated records are then partitioned into two groups based on the two values of that categorical attribute. In each group, the probability of other attribute values is calculated. From each group, the maximum probability of an attribute value is taken to calculate the distance between two categorical attribute values with respect to that attribute. Similarly, the distance between two categorical attributes is calculated with respect to the other attributes of the dataset. However, a limitation of the modified version of K-Means [33] is its requirement for a user input on the number of clusters. Another limitation of the technique is the random selection of the initial seeds.
625
2.7. Clustering technique for categorical dataset There is an existing technique that selects initial seeds deterministically for a dataset that has only categorical attributes [35]. The technique calculates the density of the records of the dataset, to select initial seeds for clustering, as well as the frequency of the domain values of each categorical attribute to calculate the density of a record. A record that has an average maximum density is considered as the first initial seed. A record is considered as the second seed if the density of the record times the distance of the record with the first seed is maximum compared to any other record of the dataset. A record is considered to be the third seed if the density of the record times the distance between the record and a selected seed is maximum compared to other records. Similarly, k seeds are selected from the dataset. However, one limitation of the seed selection technique [35] is that it requires an user to provide the number of clusters k as an input. There are a few other existing deterministic seed selection techniques. However, these techniques require the user to provide the number of clusters as an input, which again may be difficult for the user. Some of them do not work on datasets that have both categorical and numerical attributes [9,36]. For clustering categorical data, an existing clustering technique can be used that selects the initial seeds and the numbers of clusters from a dataset through the clustering process [9]. For each categorical attribute of a dataset, the technique divides the dataset into subsets based on the domain values of the categorical attribute. The technique next calculates the seeds for each subset. The domain value of a categorical attribute that has the maximum frequency among the records of a subset is considered as the seed value of that categorical attribute. The number of subsets can be up to the total domain values of all categorical attributes of a dataset. Then, the density of each potential seed is calculated and the seed that has the highest density is considered as the first seed. From the remaining seeds, the possibility of each seed is calculated. The seed that has the highest possibility is considered as the second seed. The seed selection process is stopped if the value of possibility rapidly changes. However, the technique does not work on a dataset that has both categorical and numerical attributes. 2.8. Deterministic seed selection for density based clustering An improved version of K-Means that selects an initial seed deterministically (inspired by K-Means++) from the dense regions of a dataset [23]. The technique uses a probabilistic method to select the initial seeds from a dataset. The initial seeds are given as input to K-Means to produce the final clusters of a dataset. One of the limitations of the technique is that it requires the number of cluster k as an input for a dataset [23]. Moreover, the technique does not work on a dataset having any categorical attributes. A density based clustering technique assumes that the seeds of the clusters are surrounded by neighbor records where the neighbor records have lower local density and they are relatively far from any records having higher local density [37]. For each record of a dataset, the technique calculates the local density and its distance from the records of higher density. The local density of a record is the number of neighbor records where the distance between each neighbor record and that particular record is less than a cutoff distance. After calculating the seeds, the technique assigns the remaining records to the closest seeds. One of the limitations of the technique is that it requires a user defined cutoff distance which can be difficult for an user to correctly estimate.
626
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634
2.9. Ensemble clustering technique Ensemble clustering technique uses the results of different clustering techniques in order to produce high quality clusters [38– 40]. An existing ensemble clustering technique called complementary ensemble clustering (CEC) combines clustering results of multiple runs of K-Means and is applied on biomedical datasets [41]. However, one of the limitations of the technique is that it requires a user defined number of clusters as an input. 2.10. CRUDAW-F and CRUDAW-H CRUDAW-F [1,17] and CRUDAW-H [1,17] solve the issues caused by the random seed selection approach of KMeans, KMeans++, modified version of K-Means, BKM, BFPH, NFPH by using a deterministic initial seed selection process based on a user defined radius r. The initial seed selection process of CRUDAW-F and CRUDAW-H is the same. They select the record from a dataset which has the densest region (the maximum number of records) within its r radius, as the 1st seed and then removes the first seed and all records within its r radius. From the remaining records of the dataset, it then selects the record which has the densest region within its r radius as the second seed. The seed selection process of CRUDAW-F and CRUDAW-H [1,17] continues as long as the following two conditions are satisfied: first condition is, there are user defined number of records in the dataset even after the removal of the seeds and the records within their r distance whereas the second condition is that a seed has the user defined number of records within its r radius. For CRUDAW-F, the initial seeds are used to determine the initial fuzzy membership degrees for clustering. For CRUDAW-H, the initial seeds are given to K-Means to produce the final clusters from the dataset. 2.11. Hierarchical clustering technique Hierarchical clustering techniques partition the records of a data set into a hierarchy [1,10,27]. A set of nested clusters and the clusters are formed like a tree structure using Hierarchical clustering techniques. The tree structure of the clusters is also known as dendrogram. There are two types of hierarchical clustering namely Agglomerative Hierarchical Clustering and Divisive Hierarchical Clustering. Agglomerative hierarchical clustering is a bottom-up approach whereas Divisive hierarchical clustering is a top-down approach. However, one of the limitations of many hierarchical clustering techniques is that they have high computational complexity. For example, an existing hierarchical clustering technique has complexity of O(n3 ), whereas K-Means has the complexity of O(n) [1]. In hierarchical clustering techniques a record belonging to a cluster cannot move into another cluster. They may also fail to separate overlapping clusters [1,10,27]. 2.12. AGCUK and GAGR AGCUK is a genetic algorithm based clustering technique that obtains the number of clusters from a dataset through the clustering process [19]. In AGCUK, the population size (i.e. the number of chromosomes) and the generations are user defined parameters. For initial population, the number of genes (i.e. the number of clusters) k for a chromosome are randomly chosen in the range 2 to √ n, where n is the number of records in a dataset. To form a chromosome, k number of records are randomly selected from a dataset as genes (i.e. seeds). AGCUK uses a noising selection method to maintain diversity in population. In the mutation process, AGCUK uses absorption and division approaches to resolve the underpartition and over-partition issues of genetic algorithm. However,
due to the random gene selection process, the chromosomes of a population may not represent the actual clusters of a dataset. Another limitation of AGCUK is that it does not work on a dataset that has both categorical and numerical attributes. GAGR is also a genetic algorithm based clustering technique where the number of population and the generations are user defined parameters [20]. Each chromosome in initial population of GAGR is produced by selecting k number of non identical records from a dataset. GAGR performs a gene rearrangement in a chromosome in order to produce better quality offspring. However, GAGR cannot work on a dataset that has both categorical and numerical attributes. 2.13. Complexities of the techniques We now present the complexities of the techniques used in this study. The complexity for DenClust, CRUDAW-F (CF) and CRUDAW-H (CH) is O(n2 ), whereas the complexity of AGCUK, KMeans (KM), KMeans++ (KM++) and GAGR is O(n) [1,19,20]. However, in the literature there are some existing clustering techniques including ACCA, HC and ACAD have O(n3 ) complexity [42–44]. 3. DenClust: A density based K-Means clustering technique In this study, we present a density based K-Means clustering technique called DenClust that automatically calculates the number of clusters k and the initial seeds without requiring a user input on the parameters such as the number of clusters k and the radius r. It uses a deterministic process to produce the initial seeds that represent the densest regions of a dataset. The initial seeds are then fed into K-Means to produce the final clusters of the dataset. The initial seeds are conceptually similar to the cluster centers since each seed represents a dense region and no duplicate seeds are chosen from the same region. Therefore, the initial seeds selected by DenClust are expected to be of high quality and the high quality initial seeds are expected to produce high quality clusters. Our experimental results also indicate the superiority of DenClust. The DenClust algorithm is presented in Algorithm 1. The main steps of DenClust are as follows (see Algorithm 1). Step 1: Automatic selection of high quality initial seeds. Step 2: Initial Seeds are fed into K-Means to produce the final set of clusters. Step 1: Automatic selection of high quality initial seeds DenClust uses a density based approach for the selection of a set of initial seeds from the densest regions of a dataset. The attributes A = {A1 , A2 , . . . , Am } of a dataset can be numerical and/or categorical. Let us assume that there are h numerical attributes An = {A1 , A2 , . . . , Ah } and (m − h) categorical attributes Ac = {Ah+1 , Ah+2 , . . . , Am }. If the dataset has any numerical attributes then DenClust first normalizes each numerical attribute in the range between 0 to 1 in order to give equal emphasis to each numerical attribute while calculating the distance between two records. The Normalize(D) method normalizes the records of a dataset D and produces a normalized dataset called D′ . However, if a dataset does not have any numerical attributes then Normalize(D) produces D′ which is the same as the D. If the ath numerical attribute Aa has the domain [l, u], where l is the lower and u is the upper limit of the domain and Ri,a is the ath attribute value of the ith record, then the normalized value of Ri,a is as follows: n(Ri,a ) =
Ri,a − l u−l
.
(1)
Let a categorical attribute Ap+b = {ab1 , ab2 . . . abq } has q domain values. DenClust computes the similarity of each pair of values (abi , abj ), sim(abi , abj ); ∀i, j using an existing technique called
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634
VICUS [32]. If the dataset has any numerical attributes, VICUS does not consider the domain values of the numerical attributes for similarity calculation. Note that, in DenClust we calculate the similarity of each pair of categorical values considering the domain values of all numerical attributes by categorizing the domain values of the numerical attributes using an existing technique [1,45]. However, DenClust is not limited to the technique [32] and any other suitable technique can be used. The similarity of a value pair varies between 0 and 1, where 0 means no similarity and 1 means a complete similarity. The distance between two values of an attribute σ (abi , abj ) is calculated as follows:
σ (abi , abj ) = 1 − sim(abi , abj ).
(2)
Therefore, σ (abi , abj ) also varies between 0 and 1, where a lower value means a smaller distance. From (2), we can see that if the similarity sim(abi , abj ) of two domain values of an attribute is high then the distance σ (abi , abj ) between them will be low. While many existing techniques [11,21] consider the distance between two categorical values to be either 0 or 1 (and nothing in between), DenClust considers them to be anything between 0 and 1—based on the similarity between the values. DenClust then calculates the distance between all pair of records using (3). Note that in DenClust, the distance between all pair of records and a record to a seed is calculated using (3).
δ (Ri , Rj ) =
| Ah | ∑
|n(Ri,a ) − n(Rj,a )| +
a=1
m ∑
σ (Ri,b , Rj,b ).
(3)
b=|Ah |+1
Algorithm 1: DenClust: A Density Based K-Means Clustering Input : A dataset D, a threshold T , the number of iterations N for K-Means and a threshold ϵ for K-Means Output: A set of clusters C Set S ← φ /*S is a set of initial seeds. Initially S is set to null*/ /*Step 1: Automatic selection of initial seeds*/ D′ ← Normalize(D) /*Numerical attributes of D are normalized and the normalized dataset is stored in D′ */ count = 0 while |D′ | ≥ T do Score ← CalculateScore(D′ )/*Score contains density of each record*/ R ← FindMax(Score)/*the record R has the maximum density*/ count = count + 1 if Score(R) ≥ T OR count ≤ 2 then d ← Neighbor(D′ , R)/*d is the set of neighbors of R */ S ← S ∪ R/*add R in the set S */ D′ ← D′ − d /*records in d are removed from D′ */ else break /*Step 2: Feed the Initial Seeds to K-Means to produce final clusters*/ Set Ocur ←0, Oprev ←0 for t ← 1 to N do C ← PartitionRecord(D′ , S)/*Partitions the records into clusters*/ S ← CalculateSeed(C )/*Seed calculation */ Ocur ← SSE(D′ , S)/* SSE is the K-Means objective function*/ if ( t > 1 and | Ocur -Oprev | ≤ ϵ ) then Break; Oprev ← Ocur C ← PartitionRecord(D′ , S) /*Produce the final clusters*/ C ← Denormalize(C , D) /*The records in C are denormalized*/ return C Based on the computed distances (using (3)), DenClust calculates the density (score) of each record. The score of a record Ri is the number of records that have the minimum distances with Ri .
627
Note that, if a record has the same minimum distance with more than one record then in score calculation, DenClust randomly adds score 1 to one of the records to which the record has the same minimum distance. The CalculateScore(D′ ) method of Algorithm 1 calculates the score of each record and stores it in Score. The record Ri having the highest score is selected as the first seed. The record having the highest score is determined using the FindMax(Score) method. From the dataset, DenClust next removes the first seed Ri and all other records for which Ri is the nearest neighbor. The neighbors of the densest record is determined using Neighbor(D′ ,R) method. From the remaining records of the dataset, the record having the highest score is considered as the second seed. The second seed and all records having their minimum distance with the second seed are removed. Note that, during seed selection if there are two or more records having the same highest score then DenClust randomly selects one of the same highest score records as a seed. DenClust continues the seed selection process while the number of remaining records in the dataset is greater than or equal to T . Additionally, it only accepts a seed if the seed has at least T records for whom it is the nearest neighbor (See Algorithm 1). However, DenClust chooses at least two seeds even if they do not have at least T records associated with them. DenClust also allows a user to input any other value for T if he/she wants to do so. CRUDAW-H and CRUDAW-F [1,17] require a user defined radius r since it selects the record Ri as a seed, where Ri has the maximum number of records within its r radius. However, DenClust does not require an input on r since it selects a record Ri as a seed, where Ri has the maximum number of records for whom Ri is the nearest neighbor. To identify the nearest neighbor of a record we do not need r. Step 2: Initial Seeds are fed into K-Means to produce the final set of clusters The initial seeds produced in Step 1 are given input to K-Means to produce the final clusters. With high quality initial seeds, KMeans is expected to produce high quality clusters [1,17,28]. Based on the initial seeds, the records of a dataset are partitioned into different clusters using PartitionRecord(D′ , S). A record is assigned to the seed of a cluster with which the record has the minimum distance. Note that, during partitioning the records, the distance between a record and a seed is calculated using (3). Based on the records of each cluster, DenClust calculates the seed for each cluster using CalculateSeed(C ). During seed calculation, the value of a numerical attribute in a seed is the average value of the attribute of all records that belong to the cluster. However, for a categorical attribute, the value that has the maximum frequency among the records of a cluster is considered as the seed value of the categorical attribute. DenClust next calculates the value of the objective function Sum of Square Error (SSE) for the clusters using (9). In Step 2, for every iteration of the for loop, the value of C , S and Ocur will be updated. The partitioning of records and the seed selection process continues until the termination conditions are satisfied. A user defined maximum number of iterations N is the first termination condition. A user defined absolute difference ϵ between the values of SSE in two consecutive iterations (i.e. |Ocur − Oprev | ≤ ϵ ) and t > 1 is the second termination condition. The break statement under if the statement will be executed when the value of t > 1 and |Ocur − Oprev | ≤ ϵ . In our experimentation, we used ϵ = 0.005 (see Section 4.3). Based on the empirical analysis on the maximum number of iterations (N) as shown in Section 4.3, we set N equal to 50. The Denormalize(C , D) method converts the normalized records into the original records for the clusters of a dataset. For the records that belong to a cluster we find the indices (position of the record in the dataset) of the records. Based on the indexes of the records we then collect the records from the original dataset.
628
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634
4. Experimental results and discussion
If k is the number of clusters, Sj is the seed of the jth cluster (Cj ), Ri is the ith record of the jth cluster (Cj ), σ (Ri , Sj ) is the distance between Ri and Sj then SSE is calculated as follows:
4.1. Datasets, techniques and the cluster evaluation criteria We implement DenClust and six existing clustering techniques namely CRUDAW-F [1,17] and CRUDAW-H [1,17], K-Means [10], K-Means++ [26], AGCUK [19] and GAGR [20]. In the experiments, we use twenty biomedical datasets that we obtain from the UCI Machine learning repository and Bioinformatics [6,7,24]. We provide a brief introduction to the datasets in Table 1. We compare the performance of the techniques in terms of two external cluster evaluation criteria (Entropy and Purity) and one internal cluster evaluation criteria named Sum of Square Error (SSE) [1,10,27]. Note that higher Purity values indicate better clustering results, whereas lower Entropy and SSE values indicate the better results. Note that we remove the class attribute from a dataset before applying any clustering technique on it, since the datasets on which clustering techniques are applied do not have the class attribute i.e. labels for the records. The class attribute values are again used for cluster evaluation purposes with Entropy and Purity measures [1,10]. Moreover, if a dataset has any records with missing values then we remove the records having missing values from the dataset before applying any clustering technique on it. 4.2. Definitions of the cluster evaluation criteria The definitions of Purity, Entropy and Sum of Square Error (SSE) are given below. The purity of a cluster is calculated to evaluate if the records of the cluster have the same class value. If pi is the purity of the ith cluster and j is a class value then pi is calculated as follows: pi = maxj (pi,j ), ∀j.
(4)
Here pi,j is the probability of a record of ith cluster that has j class value. If the number of records in the ith cluster is Qi and the number of records of the ith cluster that have the class value j is Qi,j , then the probability pi,j is calculated as follows: pi,j =
Qi,j Qi
.
(5)
If the number of records in the dataset is n then the overall purity for k number of clusters is calculated as follows: Purity =
k ∑ Qi i=1
n
pi .
(6)
A higher purity value indicates better clustering results and the value of purity varies from 0 to 1. Similar to purity, the entropy of a cluster is also calculated to evaluate if the records of the cluster have the same class value. If ei is the entropy of the ith cluster then ei is calculated as follows: ei = −
z ∑
pi,j log2 pi,j .
(7)
j=1
Here z is the domain size of the class attribute and pi,j is calculated using (5). The overall entropy for k number of clusters is calculated as follows: Entropy =
k ∑ Qi i=1
n
ei .
(8)
A lower entropy value indicates better clustering results and the value of entropy varies from 0 to log2 n .
SSE =
|Cj | k ∑ ∑
σ (Ri , Sj )2 .
(9)
j=1 i=1
4.3. The parameters used in the experiments In the experimental evaluation of DenClust we use ten T values namely 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, 0.8%, 0.9% and 1% of the total number records of a dataset D. The number of iterations for DenClust, CRUDAW-F, CRUDAW-H, KMeans, and KMeans++ is 50 and ϵ = 0.005. In the experiments on AGCUK, the number of chromosomes in the initial population and the number of generations are considered 20 and 50, respectively as suggested in the study [19]. The values of rmax and rmin are considered 1 and 0, respectively based on the recommendation of the study [19]. For CRUDAW-F and CRUDAW-H, we use T = 1.0% and for CRUDAWF fuzzy coefficient β = 2.2 as recommended in the paper [1,17]. Moreover, for CRUDAW-F and CRUDAW-H, we use 30 different r values namely 0.2, 0.18, 0.17, 0.16, 0.15, 0.14, 0.13, 0.12, 0.11, 0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0.009, 0.008, 0.007, 0.006, 0.005, 0.004, 0.003, 0.002, 0.001, 0.0005, and 0.0001. As mentioned above in the experimentation, we consider the maximum number of iterations of K-Means (N) to be 50 as a termination condition. We perform an empirical analysis to justify the selection of N = 50. We run K-Means on the BT dataset 50 times with just one termination condition ϵ = 0.005. From the 50 runs we aim to find out the number of iterations required by K-Means. We find that typically K-Means terminates in less than 50 iterations when it only uses ϵ as the termination condition. We consider ϵ as the natural termination condition and therefore prefer to choose an N value that will let the K-Means to terminate based on the value of ϵ , but will take care of unusual situations where ϵ fails to terminate even for a large number of iterations. In Fig. 1, we present the number of times (i.e. the frequency) K-Means terminates in a particular number of iterations, out of the 50 runs of K-Means. K-Means terminates 7 times in the 18th iteration. The maximum number of iterations required by K-Means is 33. Therefore, we find that N = 50 is a suitable option. Similarly, we run K-Means on the LD dataset and present the frequency versus iteration in Fig. 2. In BT and LD datasets, K-Means terminates with less than 50 iterations. Therefore, we consider that the number of iterations for K-Means in DenClust is 50, as a safe condition. 4.4. The experimental results We evaluate DenClust by comparing its clustering results with the results of the six existing techniques in terms of SSE (lower the better) as shown in Table 2. In Table 2, the SSE values of DenClust on each dataset are the average of the 10 SSE values as we run DenClust for ten different T values on each individual dataset. In Table 2, the SSE values of CRUDAW-F on each dataset are the average of 30 SSE values that CRUDAW-F produces for 30 different r values. Similarly, on each dataset the SSE values of CRUDAW-H are the average of 30 SSE values that CRUDAW-H produces for 30 different r values. In experimentation of K-Means, we√randomly generate 10 k (the number of cluster) in the range 2 to n (where n is the number of records) and then for k we run K-Means 10 times, i.e for k values run K-Means 100 times. Note that, for each run of K-Means we use 50 iterations. Therefore, for each dataset the SSE values of K-Means in Table 2 are the average of the 100 SSE values. For K-Means++, we also follow the same approach as we do for KMeans. Therefore, the SSE values of K-Means++ on each dataset are the average of the 100 SSE values. For each dataset, the SSE values
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634
629
Table 1 A brief introduction to the datasets. Datasets
No.of records with missing values
Blood Transfusion (BT) 748 Dermatology (DL) 366 Liver Disorder (LD) 345 Mammographic Mass (MM) 961 Pima Indian Diabetes (PID) 768 Yeast(YS) 1484 Haberman (HM) 306 Ecoli (EC) 336 Cardictography (CG) 2126 Cho (CO) 386 Iyer (IR) 517 Heart Disease (HD) 303 Hepatitis (HT) 80 Seeds (SD) 210 Wisconsin Breast Cancer (WBC) 699 Spectf Heart (SFH) 267 Breast Cancer (BC) 286 Spect Heart (SH) 267 Post Operative Patient (POP) 90 Thoracic Surgery (TS) 470
No. of records without missing values
No. of numerical attributes
No. of categorical attributes
Class size
748 358 345 830 768 1484 306 336 35 386 517 297 80 210 683 267 277 267 87 470
4 34 6 5 8 18 3 8 35 16 11 13 19 7 9 44 0 0 1 3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 22 7 13
2 6 2 2 2 10 2 8 3 5 11 5 2 3 2 2 2 2 3 2
Fig. 3. Average Entropy of the techniques on the 16 biomedical datasets having only numerical attributes. Fig. 1. The frequency versus iteration of K-Means to the BT dataset.
Fig. 2. The frequency versus iteration of K-Means to the LD dataset.
of AGCUK in Table 2 are the average of the 10 SSE values as we run AGCUK 10 times. For GAGR, we follow the same approach that we do for AGCUK. From Table 2, we can see that in terms of SSE, DenClust performs better than all the existing techniques on the sixteen datasets having only numerical attributes used in this experimentation. In
the table, the best results are put in boldface. Note that, the datasets used in Table 2 have only numerical attributes as K-Means, KMeans++, AGCUK and GAGR cannot work on a dataset that has categorical attributes. In each dataset of Table 2, the SSE value of DenClust is lower than the SSE values of the existing techniques, i.e., the performance of DenClust is better than the existing techniques in each dataset in terms of SSE. We also present the average Entropy (lower the better) and Purity (higher the better) of the techniques on each individual numerical dataset in Tables 3 and 4, respectively. From Tables 3 and 4, it is clear that DenClust performs better than the six (6) existing techniques in terms of Entropy and Purity on the biomedical datasets having only numerical attributes. In terms of Entropy, DenClust performs better than the existing techniques on all datasets except LD. However, on the LD dataset in terms of Entropy, DenClust performs better than K-Means, K-Means++, AGCUK, and GAGR. In terms of Purity, DenClust performs better than the existing techniques on all datasets except Haberman, MM, and LD. However, on the LD dataset in terms of Purity, DenClust performs better than K-Means, K-Means++, AGCUK, and GAGR. The average SSE, Entropy and Purity of the techniques on all biomedical datasets having only numerical attributes are presented in Table 5, Figs. 3 and 4. The average SSE, Entropy and Purity results also indicate that DenClust produces high quality clusters on the biomedical datasets having only numerical attributes.
630
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634 Table 2 Average SSE (lower the better) of the techniques on each individual biomedical dataset having only numerical attributes. DS
DenClust
CF
CH
KM
KM++
AGCUK
GAGR
BT DL LD MM PID YS HM EC CG CO IR HD HT SD WBC SFH
84.744 874.923 85.118 127.473 320.803 406.524 26.690 77.356 2723.075 303.402 95.511 91.255 141.361 48.652 404.294 746.495
135.510 1326.845 135.510 185.384 502.835 638.962 58.367 114.837 7260.013 439.299 199.965 162.902 162.902 90.799 613.400 1022.129
138.666 1205.305 138.666 183.458 473.793 608.414 57.355 112.396 5440.417 432.002 210.629 160.716 160.716 90.707 570.040 1005.078
122.596 1299.849 12 302.231 235.964 436.852 504.314 224.818 110.045 762 834.684 1710.922 2448.024 12 231.006 6133.450 411.386 4833.182 61 680.203
133.980 1363.627 139.642 188.740 485.051 566.184 235.201 110.760 4214.958 428.039 168.191 155.457 302.991 102.526 586.339 1012.796
246.002 2076.708 185.216 393.690 577.918 793.077 75.214 150.031 5560.037 638.858 280.804 172.582 162.324 184.053 873.758 1166.933
159.661 1509.863 175.971 319.340 452.565 492.687 2245.405 126.201 790 664.499 1660.461 1962.814 12 660.082 5426.527 431.938 5302.404 53 727.999
Table 3 Average Entropy (lower the better) of the techniques on each individual biomedical dataset having only numerical attributes. DS
DenClust
CF
CH
KM
KM++
AGCUK
GAGR
BT DL LD MM PID YS HM EC CG CO IR HD HT SD WBC SFH
0.650 0.102 0.787 0.469 0.609 1.433 0.534 0.473 0.056 0.704 0.987 1.027 0.467 0.168 0.094 0.438
0.724 0.661 0.724 0.659 0.796 2.172 0.760 0.763 0.723 1.105 2.285 1.483 1.483 0.310 0.194 0.563
0.723 0.279 0.723 0.664 0.770 1.886 0.760 0.684 0.214 1.096 2.404 1.475 1.475 0.308 0.124 0.595
0.711 0.361 0.939 0.671 0.746 1.683 0.802 0.653 0.893 1.099 2.014 1.633 0.901 0.393 0.125 0.594
0.718 0.539 0.952 0.662 0.772 1.856 0.812 0.623 0.079 1.102 2.117 1.371 0.772 0.388 0.147 0.636
0.783 1.726 0.978 0.738 0.831 2.354 0.778 1.062 0.173 2.042 3.006 1.618 0.590 0.984 0.378 0.711
0.730 1.201 0.939 0.742 0.806 1.910 0.768 0.901 0.867 1.222 1.781 1.711 0.892 0.466 0.178 0.598
Table 4 Average Purity (higher the better) of the techniques on each individual biomedical dataset having only numerical attributes. DS
DenClust
CF
CH
KM
KM
AGCUK
GAGR
BT DL LD MM PID YS HM EC CG CO IR HD HT SD WBC SFH
0.786 0.969 0.696 0.654 0.779 0.554 0.706 0.855 0.987 0.771 0.701 0.665 0.837 0.940 0.975 0.845
0.768 0.786 0.768 0.799 0.696 0.363 0.750 0.795 0.807 0.676 0.426 0.579 0.579 0.903 0.956 0.794
0.769 0.898 0.769 0.802 0.714 0.468 0.751 0.814 0.943 0.664 0.406 0.581 0.581 0.906 0.975 0.796
0.773 0.875 0.614 0.797 0.726 0.539 0.738 0.825 0.780 0.669 0.478 0.555 0.653 0.873 0.973 0.795
0.772 0.801 0.592 0.803 0.712 0.469 0.736 0.829 0.982 0.659 0.455 0.605 0.740 0.873 0.970 0.794
0.762 0.507 0.580 0.767 0.693 0.344 0.755 0.729 0.965 0.401 0.284 0.561 0.798 0.608 0.898 0.794
0.767 0.638 0.609 0.754 0.715 0.468 0.753 0.761 0.784 0.638 0.506 0.546 0.650 0.849 0.961 0.796
Table 5 Average SSE (lower the better) of the techniques on the 16 biomedical datasets having only numerical attributes. DS
DenClust
CF
CH
KM
KM++
AGCUK
GAGR
All
431.50
860.94
723.31
57 014.49
670.32
890.13
58 476.16
Moreover, we also compare the performance of DenClust on the four biomedical datasets having categorical attributes. The experimental results of all the techniques are presented in Tables 6– 8. From Tables 6–8, we can see that the overall performance of
DenClust is better than the existing techniques on the four datasets having categorical attributes. The experimental results indicate that DenClust performs better than the existing techniques on the datasets having numerical and/or categorical attributes.
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634 Table 6 Average SSE (lower the better) of the techniques on the four biomedical datasets having categorical attributes. Datasets
DenClust
CRUDAW-H
CRUDAW-F
BC SH POP TS
12.679 17.882 1.642 59.161
21.739 29.297 2.578 77.670
21.978 29.949 2.590 82.879
Table 7 Average Entropy (lower the better) of the techniques on the four biomedical datasets having categorical attributes. Datasets
DenClust
CRUDAW-H
CRUDAW-F
BC SH POP TS
0.552 0.533 1.061 0.531
0.721 0.596 0.935 0.589
0.723 0.583 0.932 0.593
Table 8 Average Purity (higher the better) of the techniques on the four biomedical datasets having categorical attributes. Datasets
DenClust
CRUDAW-H
CRUDAW-F
BC SH POP TS
0.799 0.848 0.724 0.858
0.765 0.807 0.761 0.851
0.764 0.808 0.762 0.851
Fig. 4. Average Purity of the techniques on the 16 biomedical datasets having only numerical attributes. Fig. 6. Sign test of DenClust on the 20 biomedical datasets.
Fig. 5. Sign test of DenClust on the 16 biomedical datasets having only numerical attributes.
631
632
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634
4.5. Statistical analysis on the clustering results of techniques We perform a statistical non-parametric sign test on the clustering results to demonstrate the statistical significance of DenClust over the existing techniques [46]. The right tailed sign test is carried out for the significance level alpha = 0.05 (i.e. 95% significance level). In Fig. 5, we present the sign test results of DenClust (compared with the existing techniques) based on the sixteen (16) biomedical data sets that have only numerical attributes. The six bars in Fig. 5 show the z-values (test statistic values) for DenClust and the six existing techniques while the seventh bar shows the zref value. DenClust is considered to be significantly better than the results obtained by the existing techniques, if a z-value is greater than the z-ref value. For alpha = 0.05, the z-ref value is 1.96. From Fig. 5 we can see that the results of DenClust are significantly better than the six existing techniques in terms of all three cluster evaluation criteria based on the sixteen (16) biomedical datasets. In Fig. 6, we report the sign test results of DenClust (compared with CRUDAW-F and CRUDAW-H) based on the twenty (20) biomedical data sets having categorical and/or numerical attributes. Fig. 6 also shows that the results of DenClust are significantly better than those of CRUDAW-F and CRUDAW-H in terms of all three cluster evaluation criteria on the twenty biomedical datasets. 4.6. Execution time of the techniques We present the execution time of the techniques in Table 9 on the twenty biomedical datasets. From Table 9, we can see that the overall execution time of DenClust is higher than the execution time of CRUDAW-F, CRUDAW-H, K-Means and K-Means++ on the datasets. However, the execution time of DenClust is lower than the execution time of AGCUK and GAGR on the datasets. For all techniques, we carry out the experimentation using a HP Envy Notebook that has Intel(R) Core(TM) i7-4700 MQ CPU @ 2.4 GHz 2.4 GHz Processor, 16 GB of RAM, 64-bit Operating System, and x64-based Processor. 4.7. An analysis on the T value of DenClust In the experimentation of DenClust, we use ten different T values (0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, 0.8%, 0.9% and 1%). We now analyze to identify a possible suitable T value based on the datasets used in these experiments. For each T value, DenClust produces a set of seeds from a dataset and then applies K-Means to produce the final clusters of the dataset. We next evaluate the clusters using an internal cluster evaluation criteria called COSEC (higher the better) [1,18]. For each dataset, we identify for which T value DenClust has the highest COSEC. For each T value, we next count the number of datasets for which DenClust has highest COSEC value. In Fig. 7, we present the frequency (i.e. count) versus T value. From Fig. 7, we can see that for T = 0.9% the frequency is 4, i.e. for T = 0.9% in four datasets DenClust has the highest COSEC value of out 16 datasets. The next highest frequency is 3 that is for T = 0.8%. However, a user can use different T values and check for which T value DenClust produces good quality clusters from a dataset. 4.8. An analysis on the initial seed selection process of denclust We also perform an empirical analysis to evaluate the quality of the initial seeds that are produced by the seed selection process of DenClust. The quality of initial seeds produced by DenClust is compared with the random seed selection process taken by KMeans [10]. We use fifteen biomedical datasets namely Dermatology, Spectf Heart, Seeds, Hepatitis, Heart Disease, Ecoli, Blood
Fig. 7. Analysis on T value of DenClust using 20 biomedical datasets.
Transfusion, Liver Disorder, WBC, Breast Cancer, Post Operative Patient, Thoracic Surgery and Spect Heart that we obtain from the UCI machine learning repository and Bioinformatics [6,7,24]. From Fig. 7, we see that for T = 0.9% (out of ten T values) DenClust has the highest frequency. Therefore, we consider T = 0.9% to evaluate the quality of the initial seeds produced by DenClust. On the above fifteen datasets, for T = 0.9% we produce clusters using DenClust and evaluate them in terms of SSE. For each √dataset, we randomly select k number of seeds in between 2 to n (n is the number of records) and then apply K-Means to produce the final clusters and evaluate the clusters in terms of SSE. In Table 10, we present the SSE values for both seed selection approaches. The results shown in Table 10 clearly indicate that the density based seed selection process of DenClust produces higher quality seeds than the random seed selection process. For the remaining five datasets (Cardictography, Iyer, Mammographic Mass, Pima Indian Diabetes and Yeast) for T = 0.9% the quality of DenClust’s seeds is not better than the randomly selected seeds. However, for T = 0.1% the quality of DenClust’s seed selection process is better than the random seed selection process on the Cardictography, Iyer, Mammographic Mass, Pima Indian Diabetes, and Yeast datasets. 5. Conclusion In this paper, we present a clustering technique called DenClust that uses a density based deterministic process for the selection of initial seeds without requiring any user input on the number of clusters and/or the radius of the cluster seeds. An empirical analysis on initial seeds indicates that DenClust produces high quality initial seeds. We evaluate the performance of DenClust on twenty biomedical datasets, compared against six existing techniques. Our experimental results indicate that DenClust performs better than the six existing techniques in terms of all three evaluation criteria on all the biomedical datasets used in this study. We also carry out some empirical analysis on values for T and N. The computational complexities and execution time of the techniques are also presented in this paper. The time complexity of DenClust is the same as CRUDAW-F and CRUDAW-H. However, the time complexity of DenClust is higher than K-Means, K-Means++, AGCUK and GAGR. Therefore, DenClust is suitable for the applications where a better quality of clustering result is appreciated even if it takes a longer time. For example, a scenario where medical research is carried out on patient datasets in order to
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634
633
Table 9 Execution time (in seconds) of the techniques on twenty biomedical datasets. DS
DenClust
CF
CH
KM
KM++
AGCUK
GAGR
BT DL LD MM PID YS HM EC CG CO IR HD HT SD WBC SFH BC SH POP TS
3.971 1.423 1.162 9.896 10.596 57.197 0.992 1.038 295.055 3.961 9.240 0.782 0.217 0.500 17.667 6.926 0.664 1.867 0.441 17.606
2.260 1.958 1.155 2.869 6.089 11.913 0.505 0.246 71.194 3.666 2.822 0.661 0.615 0.418 4.714 2.889 0.487 2.027 0.343 6.325
1.977 1.681 0.850 1.758 3.986 8.748 0.456 0.217 51.374 3.874 1.831 0.644 0.345 0.318 2.945 2.774 0.647 1.856 0.251 5.963
0.592 0.827 0.563 1.444 3.957 10.976 0.147 0.362 60.205 2.035 4.472 0.426 0.138 0.212 2.893 2.575 NA NA NA NA
2.740 1.396 0.574 1.377 4.503 12.328 0.288 0.363 61.883 3.241 4.522 0.446 0.085 0.250 3.330 2.708 NA NA NA NA
48.957 156.682 50.158 50.506 437.327 528.781 75.810 15.362 1221.264 58.026 63.585 69.163 41.194 45.926 47.600 52.936 NA NA NA NA
83.041 169.092 92.371 73.573 745.212 810.636 58.581 47.776 1693.655 92.090 66.675 129.342 38.731 67.152 48.457 80.228 NA NA NA NA
Table 10 The quality of the initial seeds of DenClust. Datasets
T
k
Blood Transfusion Dermatology Liver Disorder Haberman Ecoli Cho Heart Disease Hepatitis Seeds WBC Spectf Heart Breast Cancer Post Operative Patient Spect Heart Thoracic Surgery
0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9%
10 18 14 20 4 24 32 24 53 16 36 24 28 32 2
discover treatment/prevention strategies and disease pattern. A better clustering result can be appreciated as it is more likely to get better decision/conclusion even if it takes more time. Our experimental results indicate the superiority of DenClust over the existing techniques on biomedical datasets. We also perform a statistical non-parametric sign test on the clustering results of the techniques that also indicate a statistical significance of DenClust over the existing techniques on the biomedical datasets. References [1] M.A. Rahman, Automatic Selection of High Quality Initial Seeds for Generating High Quality Clusters without requiring any User Inputs (Ph.D. Dissertation), School of Computing and Mathematics, Charles Sturt University, 2014. [2] R. Adderley, M. Townsley, J. Bond, Use of data mining techniques to model crime scene investigator performance, Knowl.-Based Syst. 20 (2) (2007) 170– 176. [3] P. Zhao, C.-Q. Zhang, A new clustering method and its application in social networks, Pattern Recognit. Lett. 32 (15) (2011) 2109–2118. [4] A.T. Azar, S.A. El-Said, A.E. Hassanien, Fuzzy and hard clustering analysis for thyroid disease, Comput. Methods Programs Biomed. 111 (1) (2013) 1–16. [5] J.H. Abawajy, A.V. Kelarev, M. Chowdhury, Multistage approach for clustering and classification of ECG data, Comput. Methods Programs Biomed. 113 (3) (2013) 720–730. [6] VR. Iyer, MB. Eisen, DT. Ross, G. Schuler, T. Moore, J. Lee, JM. Trent, LM. Staudt, JJ. Hudson, MS. Boguski, D. Lashkari, D. Shalon, D. Botstein, The transcriptional program in the response of the human fibroblasts to serum, Science 283 (5398) (1999) 83–87. [7] S. Chu, J. DeRisi, M. Eisen, J. Mulholl, D. Botstein, PO. Brown, I. Herskowitz, The transcriptional program of sporulation in budding yeast, Science 282 (5389) (1998) 699–705.
SSE (lower the better) DenClust’s seed
Random seed
116.690 1142.229 118.480 50.920 142.998 370.555 109.570 141.361 47.949 509.251 813.980 19.485 1.642 21.991 129.192
119.075 1333.092 136.689 71.314 149.663 405.523 159.468 308.775 112.654 541.168 998.898 26.950 4.100 36.272 81.948
[8] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, J.G. McLachlan, A. Ng, B. Liu, P.S. Yu, Z.H. Zhou, M. Steinbach, D.J. Hand, D. Steinberg, Top 10 algorithms in data mining, Knowl. Inf. Syst. 14 (2008) 1–37. [9] L. Bai, J. Liang, C. Dang, An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data, Knowl.-Based Syst. 24 (6) (2011) 785–795. [10] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, first ed., Pearson Addison Wesley, 2005. [11] Z. Huang, Clustering large data sets with mixed numeric and categorical values, in: Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining, 21-34, Singapore, 1997. [12] F. Khan, An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application, Appl. Soft Comput. 12 (11) (2012) 3698–3700. [13] S. Chuan Tan, K. Ming Ting, S. Wei Teng, A general stochastic clustering method for automatic cluster discovery, Pattern Recognit. 44 (2011) (2011) 2786–2799. [14] A.K. Jain, Data clustering: 50 years beyond K-Means, Pattern Recognit. Lett. 31 (8) (2010) 651–666. [15] A.M. Bagirov, Modified global k-means algorithm for minimum sum-ofsquares clustering problems, Pattern Recognit. 41 (10) (2008) 3192–3199. [16] R. Maitra, A. Peterson, A. Ghosh, A systematic evaluation of different methods for initializing the K-means clustering algorithm, IEEE Trans. Knowl. Data Eng. (2010). [17] M.A. Rahman, M.Z. Islam, CRUDAW: a novel fuzzy technique for clustering records following user defined attribute weights, in: Proceedings of the 10th Australasian Data Mining Conference (AusDM’12), in: CRPIT Series, vol. 134, ACS, Sydney, Australia, 2012, pp. 27–42. [18] M.A. Rahman, Islam A hybrid clustering technique combining a novel genetic algorithm with K-Means, Knowl.-Based Syst. 71 (2014) 345–365. [19] Y. Liu, X. Wu, Y. Shen, Automatic clustering using genetic algorithms, Appl. Math. Comput. 218 (4) (2011) 1267–1279.
634
M.A. Rahman, M.Z. Islam / Applied Soft Computing Journal 73 (2018) 623–634
[20] D.X. Chang, X.D. Zhang, C.W. Zheng, A genetic algorithm with gene rearrangement for K-means clustering, Pattern Recognit. 42 (7) (2009) 1210–1222. [21] J. Ji, W. Pang, C. Zhou, X. Han, Z. Wang, A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data, Knowl.-Based Syst. 30 (2012) 129–135. [22] M.C. de. Souto, I.G. Costa, D.S. de. Araujo, T.B. Ludermir, A. Schliep, Clustering cancer gene expression data: a comparative study, BMC Bioinformatics 9 (1) (2008). [23] N. Nidheesh, K.A.A. Nazeer, P.M. Ameer, An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data, Comput. Biol. Med. 91 (2017) (2017) 213–221. [24] K. Bache, M. Lichman, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2013. http: //archive.ics.uci.edu/ml/. [25] M.A. Rahman, M.Z. Islam, T. Bossomaier, DenClust: A density based seed selection approach for K-Means, in: Proceedings of the 13th International Conference, ICAISC 2014, Zakopane, Poland, 2014, Proceedings, Part II, 784795. [26] D. Arthur, S. Vassilvitskii, K-Means++: the advantages of careful seeding, in: Proceedings of the 18th Annual ACM-SIAM symposium on Discrete algorithms (2007) pp. 1027-1035. [27] J. Han, M. Kamber, Data Mining Concepts and Techniques, second ed., Morgan Kaufmann, San Francisco, 2006. [28] M.A. Rahman, M.Z. Islam, Seed-Detective: a novel clustering technique using high quality seed for K-means on categorical and numerical attributes, in: Proceedings of the 9th Australasian Data Mining Conference(AusDM’11), in: CRPIT Series, vol. 121, ACS, Ballarat, Australia, 2011, pp. 211–220. [29] S.M. Savaresi, D. Boley, On the performance of bisecting k-means and PDDP, in: Proceedings of the 1st SIAM International Conference on Data Mining, Chicago, IL USA, 2001. [30] Z. He, Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering. arXiv preprint cs/0610043, 2006. [31] C. Wang, L. Cao, M. Wang, Coupled nominal similarity in unsupervised learning, in: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, Scotland, UK 2011.
[32] H. Giggins, L. Brankovic, VICUS - a noise addition technique for categorical data, in: Proceedings of the 10th Australasian Data Mining Conference (AusDM 2012), Vol. 134, CRPIT, 2012, pp. 139–148. [33] A. Ahmad, L. Dey, A K-Mean clustering algorithm for mixed numeric and categorical data, Data Knowl. Eng. 63 (2) (2007) 503–527. [34] A. Ahmad, L. Dey, A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set, Pattern Recognit. Lett. 28 (1) (2007) 110–118. [35] F. Cao, J. Liang, L. Bai, A new initialization method for categorical data clustering, Expert Syst. Appl. 36 (7) (2009) 10223–10228. [36] R. Cordeiro de Amorim, B. Mirkin, Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognit. 45 (3) (2012) 1061–1075. [37] A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science 344 (6191) (2014) 1492–1496. [38] G. Forestier, P. Ganarski, C. Wemmert, Collaborative clustering with background knowledge, Data Knowl. Eng. 69 (2010) 211–228. [39] R. Kashef, M.S. Kamel, Enhanced bisecting -means clustering using intermediate cooperation, Pattern Recognit. 42 (2009) 2557–2569. [40] R. Kashef, M.S. Kamel, Cooperative clustering, Pattern Recognit. 43 (2010) 2315–2329. [41] S.J. Fodeh, C. Brandt, T.B. Luong, A. Haddad, M. Schultz, T. Murphy, M. Krauthammer, Complementary ensemble clustering of biomedical data, J. Biomed. Inf. 46 (2013) 436–443. [42] H. Pirim, B. Eksioglu, A.D. Perkins, C. Yuceer, Clustering of high throughput gene expression data, Comput. Oper. Res. 39 (2012) 3046–3061. [43] A. Bhattacharya, R.K. De, Average correlation clustering algorithm (ACCA) for grouping of co-regulated genes with similar pattern of variation in their expression values, J. Biomed. Inf. 43 (2010) 560–568. [44] A. Chowdhury, S. Das, Automatic shape independent clustering inspired by ant dynamics, Swarm Evol. Comput. 3 (2012) 33–45. [45] M.A. Rahman, M.Z. Islam, AWST: a novel attribute weight selection technique for data clustering, in: Proceedings of the 13th Australasian Data Mining Conference (AusDM’15), in: CRPIT Series, vol. 168, ACS, Sydney, Australia, 2015, pp. 51–58. [46] M.F. Triola, Elementary Statistics, 8th ed., Addison Wesley Longman, Inc, 2001.