7 Evaluation of cluster validation metrics 7.1 Introduction Unsupervised learning, also known as clustering (see chapter 3), is increasingly popular because it efficiently organizes an overwhelming amount of data (Jain, 2010; Xu & Wunsch, 2009). Cluster analysis has been applied to a wide range of applications as an exploratory tool to enhance knowledge discovery. For example, in biomedical applications, cluster analysis aids in disease subtyping, i.e. the task of identifying homogenous patient subgroups that can guide prognosis, treatment decisions and possibly predict outcomes or recurrence risks (Saria & Goldenberg, 2015). This naturally translates to clustering: finding meaningful subgroups in a given dataset. Here, the definition of “meaningful” is problem-dependent. For a given clustering algorithm, multiple results can be obtained from the same dataset by varying parameters. Ultimately, the validity of any subgrouping depends on whether computationally-discovered subgroups actually uncover/expose a domain-specified variation that is meaningful and significant in the domain application. To guide a successful cluster analysis, one uses a quality measure to quantitatively evaluate how well the resulting set of partitions fits the input data. Cluster validity refers to the formal procedures used to evaluate clustering results in a quantitative and objective fashion (Jain, 2010). It determines which set of clusters is optimal for approximating the underlying subgroups in the dataset as well as how many clusters exist in the data. Broadly, two main types of cluster validation indices (CVI) have been proposed and studied extensively in the literature: external validation indices and internal indices (Kova´cs, Lega´ny, & Babos, 2005; Liu, Li, Xiong, Gao, & Wu, 2010; Xu & Wunsch, 2009). Relative CVIs (Brun et al., 2007) have also been discussed as a third category, but they are not considered in the context of this work, as their applicability is limited and they have proven to be approximately as effective as internal CVIs (Brun et al., 2007). Internal CVIs are very important in the context of clustering since there is no ground truth (or labeled data) available. They are used to determine the best partitioning for the data based on the inherent structural properties (compactness and separability) of that data. They are also used to assess the general effectiveness of a clustering method. Internal CVIs have been proposed and evaluated in (Arbelaitz, Gurrutxaga, Muguerza, Pe´rez, & Perona, 2013; Brun et al., 2007; Dubes, 1987; Halkidi, Batistakis, & Vazirgiannis, 2001; Kova´cs et al., 2005; Liu et al., 2010, 2013; Maulik & Bandyopadhyay, 2002; Milligan & Cooper, 1985; Vendramin, Campello, & Hruschka, 2010). Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00007-3 Copyright © 2020 Elsevier Inc. All rights reserved.
189
190
Computational Learning Approaches to Data Analytics in Biomedical Applications
The majority of prior CVI evaluation analyses/surveys are focused on which set of indices can determine the optimal number of clusters in the data. However, that is not enough to guarantee that it is the optimal set of clusters for the data. When considering applications in which clustering is highly beneficial, it is important to discover meaningful subgroups, not just the optimal number of clusters. Domain experts are very interested in understanding the features that define the resulting subgroups generated by a clustering algorithm, not just the number of clusters found. The question then remains: which set of indices is most reliable? No one-size-fits-all solution is available. In light of this, is there a way to combine the results generated from multiple indices to leverage their individual benefits jointly? It is imperative to develop quality measures capable of identifying optimal partitions for a given dataset. The benefits of processing and inferring information from unlabeled data for a variety of domain problems is undeniable. Thus, improving the quality measures for unsupervised learning remains a vital task. In this work, the authors investigate a statistics-based evaluation framework to empirically assess the performance of five highly cited internal CVIs on real datasets using six common clustering algorithms. The objective is threefold: i) to assess the consistency/reliability of the internal CVIs in accurately determining the optimal partitioning of the data (not just the optimal number of clusters) using rigorous statistical analysis; ii) to assess the performance of CVIs in relation to diverse clustering algorithms as well as diverse distributions/complexities of the datasets; iii) to provide a guide for combining the results from multiple CVIs, using an ensemble validation paradigm, to accurately determine the most optimal clustering scheme/configuration for a given dataset.
7.2 Related works Multiple papers have investigated the performance of various internal CVIs, usually based on their success in identifying the optimal number of partitions in the dataset, using one or two clustering algorithms. The empirical evaluations conducted in (Kova´cs et al., 2005; Liu et al., 2010; Milligan & Cooper, 1985) were specifically performed on a wide range of synthetic datasets. The analysis carried out by Maulik and Bandyopadhy (Maulik & Bandyopadhyay, 2002) employed three artificial datasets and two real datasets. A model-based evaluation of internal CVIs in (Brun et al., 2007), again utilizing synthetic datasets, demonstrated that the performance of validity indices is highly variable. They observed that for the complex models for which the clustering algorithms performed poorly, internal indices failed to predict the error of the algorithm. They concluded that not much faith should be placed on a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm. It is well known that biological/biomedical datasets are usually complex. Hence the question remains, how can the clustering results of those datasets be evaluated to obtain reliable/meaningful results? A detailed study on the validation properties (monotonicity, noise, density, subclusters, and skewed distribution) of 11 widely-used internal CVIs is presented in
Chapter 7 Evaluation of cluster validation metrics 191
(Liu et al., 2010) and extended to 12 in (Liu et al., 2013) to include a new metric based on the notion of nearest neighbors, CVNN. In (Arbelaitz et al., 2013), a general contextindependent cluster evaluation process is presented for 30 different CVIs using 3 clustering algorithms on both synthetic and real datasets. Their results showed that the “best-performing” indices on the synthetic data was not the same for the real data. Their results ranked the indices in three groups by overall performance. The first group included Silhouette, DavieseBouldin, CalinskieHarabasz, generalized Dunn, COP and SDbw. This chapter evaluates Silhouette, DavieseBouldin, CalinskieHarabasz, Dunns and Xie-Beni, as described in detail in Section 7.3. Another approach that has been developed for validating the number of clusters present in a dataset is to view clustering as a supervised classification problem, in which the ‘true’ class labels must also be estimated (Tibshirani & Walther, 2005). The output labels from the clustering algorithm are used to train and build classification models to assess the quality of the clustering result. The basic idea is that ‘true’ class labels will improve the prediction strength of the classification models. Hence, the resulting “prediction strength” measure assesses the quality of the clustering results. Bailey reviews in (Bailey, 2013), a recent work on alternative clusterings, that multiple clusterings are reasonable for a given dataset and uses a set of constraints to determine the multiple partitions suited for the data. Vendramin et al., (2010) conduct their evaluation of the CVIs on the basis that even though an algorithm might correctly estimate the correct number of clusters contained in a given data, it still does not guarantee the quality of the clustering. They propose assessing the value of CVIs to correctly determine the quality of the clustering by applying external CVIs. Thus, good cluster validation indices would correlate well with external indices. It is expected that a good relative clustering validity measure will rank the partitions according to an ordering that is similar to those established by an external criterion, since external criteria rely on supervised information about the underlying structure in the data. They compute the Pearson correlation coefficient between the various clustering indices and Jaccard index (external CVI) on the K-means clustering output from 972 synthetic datasets with varying numbers of clusters, size and distributions. A statistical test is employed to determine statistically significant differences among the average correlation values for the different CVIs. This chapter builds on Vendramin et al.’s approach (Vendramin et al., 2010) of finding the correlation between internal and external CVIs to assess the performance of the internal CVIs specifically for biomedical data analysis. The approach is significantly extended by using 6 varied clustering algorithms (to ensure robustness of the results beyond K-means) on 14 real biological datasets. Additionally, three different external CVIs are considered. The Spearman (rather than the Pearson) correlation coefficient is employed to measure the correlation of the ranks of the partitions between internal and external indices. It is less sensitive to outliers and can quantify the strength of any monotonic relationship (not just linear ones). This current work provides a more extensive evaluation of internal validation indices across real biological data on a variety of widely-used algorithms.
192
Computational Learning Approaches to Data Analytics in Biomedical Applications
7.3 Background Internal CVIs usually employ varied measures to compute the degree of separateness and compactness of the clusters to evaluate and determine an optimal clustering scheme (Kova´cs et al., 2005). “Compactness” measures how close members of each cluster are to each other, i.e. cluster homogeneity/similarity, while “separateness” measures how separated the clusters are from each other. The assumption is that a good clustering result should yield clusters that are compact and well separated. However, a key thing to consider is the level of separateness and compactness desired may vary by domain application since clustering is data-driven. As an illustrative example, consider the commonly used iris dataset (Clifford, Wessely, Pendurthi, & Emes, 2011). It comprises 3 classes of 50 instances, each represented by a set of 4 features, where each class refers to a type of iris plant. It is known that one class is linearly separable from the other two, as illustrated in Fig. 7.1. The iris dataset is often employed to demonstrate how effective a cluster algorithm is at uncovering these three subgroups though two of them are non-linearly separable. Spectral clustering algorithm (Von Luxburg, 2007) was applied to cluster the dataset to reveal meaningful underlying subgroups. Since it is assumed that the number of clusters is not known a priori, the number of clusters k were varied from 2 to 7, and 5 different CVIs were applied to determine the optimal clustering configuration. According to Table 7.1, 4 of the CVIs selected the “k ¼ 2” result as the optimal scheme for the data. A visual comparison of the results (Fig. 7.1) demonstrate that at k ¼ 2, the greatest separation is observed among the two clusters; at k ¼ 3, the level of separation is weaker 2.0 1.5 1.0 0.5 0.0 –0.5 –1.0 –1.5 –4
–3
–2
–1
0
1
2
3
4
5
FIG. 7.1 Visualization of Iris dataset using Principal Component Analysis (PCA). The red, Yellow and green overlay are the partitions obtained for spectral clustering 3-clusters result while the rectangular overlay denotes its 2cluster result. The 2-cluster result was selected optimal by four of the indices.
Chapter 7 Evaluation of cluster validation metrics 193
Table 7.1
Evaluation of Spectral algorithm on Iris data. Number of clusters (k)
Internal CVIs
2
3
4
5
6
7
Silhouettea Davies bouldinb Xie-Benib Dunna Calinski-Harabasz (CH)a
0.687 0.766 0.065 0.339 502.8
0.555 1.962 0.160 0.133 556.1
0.496 3.155 0.278 0.137 526.0
0.371 4.612 0.558 0.062 454.8
0.334 5.983 0.475 0.074 429.4
0.347 6.900 0.447 0.083 438.4
a
The higher the value, the better. The lower the value, the better the result.
b
between cluster 2 and 3, though the compactness is greater. Thus, the “k ¼ 3” result selected by the Calinski-Harabasz (CH) index is actually more meaningful and representative of the underlying complex iris structure in this context. This illustration exposes the potential flaw that may lie in using the CVI alone to determine the optimal clustering scheme and/or make a final decision based on the majority voting of a combination of CVIs. The issue with the traditional approach of simply applying one or more internal CVIs and deciding based only on a numeric value is a lack of understanding of the underlying structure of the real labeled dataset being used to assess the algorithm and which set of CVI(s) would be most appropriate to follow. In this case, CH was rightly aligned with the optimal scheme; however, that is not always the case, as is illustrated further. As previously mentioned, in clustering applications for biomedical data analysis, discovering the optimal partitions in the data is as important, if not more, than the optimal number of subgroups. This is because the subgroups (clusters) identified are subsequently analyzed further to determine the discriminant features and potential biomarkers. In this setting, internal CVIs are also used to compare the best results obtained across multiple algorithms to determine the optimal solution. For the iris dataset, consider the “k ¼ 3” clustering results of two other clustering algorithms (affinity propagation and K-means (Bailey, 2013)) in addition to spectral clustering. Their outcomes were evaluated using the CH index, given that it appeared to select the most reliable result in the case of spectral clustering applied to the iris data. Since the ground truth information is known for iris data, the various k ¼ 3 results can be evaluated using the percentage accuracy (an external validation index). Accuracy specifies the percentage of data points correctly assigned to their proper partition. It is commonly used in the context of supervised learning. Table. 7.2 reveals a conflict in the optimal clustering scheme between the CH index and Accuracy. The K-means “k ¼ 3” result would be most desirable going by the internal CVI (CH). However, according to the Accuracy index, the affinity propagation clustering result is most optimal, as it attains an accuracy of 0.95.
194
Computational Learning Approaches to Data Analytics in Biomedical Applications
Table 7.2 Iris data.
Evaluation of Calinski-Harabasz Index across multiple algorithms on Results of varied algorithms at k [ 3 scheme
Validation indices
Affinity propagation
K-means
Spectral
Calinski-Harabasz (CH) Accuracy
555.03 0.95
561.63 0.92
556.12 0.90
This example raises key questions that this chapter attempts to address. How can the reliability of internal validation measures be improved given that clustering is supposed to be an autonomous way to discover meaningful subgroups?
7.3.1
Commonly used internal validation indices
A brief description of the commonly used internal validation metrics assessed in this work is provided here for context. The following describes 5 out of the 11 widely used indices in the literature, as rated in (Bailey, 2013). The notations and definitions employed are similar to those presented in (Liu et al., 2010). Let D denote the dataset; N: number of objects in D; c: center of D; k: number of clusters; Ci: the ieth cluster; ni: number of objects in Ci; ci: center of Ci; d (x, y): distance between x and y. The validation indices can be defined mathematically as follows: (1) Silhouette index (SI) (Rousseeuw, 1987). This is a composite index that measures both the compactness (using the distance between all the points in the same cluster) and separation of the clusters (based on the nearest neighbor distance). It computes the pairwise difference of the between- and within-cluster distances. A larger average SI value indicates a better overall quality of the clustering result. Some variations have been proposed in literature (Vendramin et al., 2010). The standard definition given by (7.1) is: ) ( 1 X 1 X bðxÞ aðxÞ SI ¼ k i ni x˛C maxx ½bðxÞ; aðxÞ
(7.1)
i
where aðxÞ and bðxÞ is defined as: 3 2 X X 1 1 aðxÞ ¼ dðx; yÞ ; bðxÞ¼ minj;jsi 4 dðx; yÞ 5: ni 1 y˛C ; ysx ni y˛C i
j
(2) Calinski-Harabasz index (CH) (Calinski & Harabasz, 1974). This measures between-cluster isolation and within-cluster coherence. It is based on average between- and within-cluster sum of squares by computing its between-cluster
Chapter 7 Evaluation of cluster validation metrics 195
isolation and within-cluster coherence. The maximum value determines the optimal clustering configuration. P ni d 2 ðci ; cÞ=ðk 1Þ CH ¼ P Pi 2 i x˛Ci d ððx; ci ÞÞ=ðN kÞ
(7.2)
(3) Dunn’s index (Liu et al., 2010). This defines inter-cluster separation as the minimum pairwise distance between objects in different clusters and intra-cluster compactness as the maximum value of the largest distance between a pair of objects in the same cluster. Multiple variations have been proposed in literature. A maximum value is optimal. The standard definition was utilized: ( DI ¼ mini minj
minx˛Ci ; y˛Cj dðx; yÞ maxk ðmaxx;y˛Ck dðx; yÞÞ
!) (7.3)
(4) Xie-Beni index (XB) (Liu et al., 2010). This defines the inter-cluster separation as the minimum square distance between cluster centers and the intra-cluster compactness as the mean square distance between each data object and its cluster center. A minimum value is an optimal result. " XB ¼
XX
#
d ðx; ci Þ = N , mini;jsi d 2 ðci ; cj Þ 2
(7.4)
i x˛Ci
(5) Davies-Bouldin index (DB) (Bolshakova & Azuaje, 2003). This measures the average value of the similarity between each cluster and its most similar cluster. The index computes the dispersion of a cluster and a dissimilarity measure between pairs of clusters. A lower DB index implies a better cluster configuration. 9 82 3 = < 1X X 1X 1 DB ¼ maxj;jsi 4 dðx; ci Þ þ dðx; cj Þ5 = dðci ; cj Þ ; : ni x˛C k i nj x˛C i
j
(7.5)
196
Computational Learning Approaches to Data Analytics in Biomedical Applications
7.3.2
External validation indices
As mentioned in Section 7.2, the external validation indices (Jain, 2010) are used for evaluating the quality of the clustering results as ranked by the internal CVIs. A brief description of the three external CVIs is presented as follows. Note that all three indices range between 0 and 1. A higher value denotes a better result. (1) Clustering Accuracy: This metric is also known as classification accuracy or the inverse of clustering error. Since it is an external metric, it is assumed that the ground truth information of each point is known. Clustering accuracy can be defined as the percentage of correctly assigned data points in each partition over the entire sample size N. According to (Brun et al., 2007), the error of a clustering algorithm is the expected difference between its labels and the labels generated by the labeled point process (in this context - the known ground truth). Using this approach, clustering accuracy can be formally defined as follows. Let DAi denote the labeling of a dataset D given by a clustering algorithm Ai, while DP is the ground truth label. Let LAi ðD; xÞ and LP ðD; xÞ denote the label of x˛D for DAi and DP respectively. The label accuracy (7.6) between the clustering label and the ground truth label is the proportion of points that have the same label: εðDP ; DAi Þ ¼
jfx : LP ðD; xÞ ¼ LAi ðD; xÞgj : N
(7.6)
Since the agreement or disagreement between two partitions is independent of the indices used to label their clusters, the partition accuracy (7.7) is defined as: ε ðDP ; DAi Þ ¼ maxp ðDP ; pDAi Þ
(7.7)
where the maximum is taken over all of the possible permutations pD of the k clusters in DAi. The clustering accuracy is the inverse of the partition error defined in (Brun et al., 2007). A greedy approach is utilized in this work to maximize the label accuracy over all the partitions. Ai
(2) Adjusted Rand Index (ARI): This is an extension of the Rand Index to account for a correction of the results due to chance. The Rand index measures the agreement between true clustering and predicted clustering: ARI normalizes, so the expected value is 0 when the clusters are selected by chance and 1 when a perfect match is achieved. Let a denote the number of pairs belonging in the same class in true clustering and in the same cluster in predicted clustering (PP); b: the number of pairs belonging in the same class in PT and in different clusters in PP; c: the number of pairs belonging in different classes in PT and in the same cluster in PP; d: the number of pairs belonging in different class in both PT and PP. Then, ARI is given by (7.8).
Chapter 7 Evaluation of cluster validation metrics 197
ða þ cÞða þ bÞ M ; ARI ¼ ða þ cÞða þ bÞ ða þ cÞða þ bÞ 2 M a
(7.8)
where M ¼ a þ b þ c þ d ¼ N ðN 1Þ=2. Analogous adjustments can be made for other clustering indices, but due to its simplicity, the Adjusted Rand Index is most popular among them as of this writing. (3) Jaccard Index: This was introduced as an improvement over the original Rand index. It eliminates the term d to distinguish between good and bad partitions. It is defined by:
Jaccard Index ¼
7.3.3
a : aþbþc
(7.9)
Statistical methods
An overview of the statistical methods employed in the assessment is presented here. Section 7.4 demonstrates how they are utilized to provide a meaningful evaluation. Spearman’s rank correlation coefficient (Conover, 1999). This is a measure of association between two variables. It is calculated by ranking the data within two different variables and computing the Pearson correlation coefficient on the ranks for the two variables. Spearman’s correlation ranges in value from -1 to 1, with values near 1 indicating similarity in ranks for the two variables and values near -1 indicating ranks are dissimilar for the two variables. Spearman’s correlation will be used to assess agreement in ranks between internal and external validation indices. Since it is based on ranks, it is less sensitive to outliers than the Pearson correlation coefficient and can also measure the strength of any monotonic relationship, whereas Pearson is utilized only for linear relationships. This robustness to outliers, ability to capture the strength of more general types of relationships and natural interpretation of a rank correlation make it a more desirable metric than the Pearson correlation for evaluating many clustering validation indices. Three-Factor Analysis of Variance (ANOVA). ANOVA modeling is employed to test for significant differences in average Spearman correlation values among different datasets, algorithms and internal CVIs. For overall significant effects, pairwise comparisons are made to identify significant differences between groups within each of the factors using Tukey’s method to control the type I error for the multiple comparisons made.
198
Computational Learning Approaches to Data Analytics in Biomedical Applications
7.4 Evaluation framework The statistics-based evaluation framework applied for assessing the performance of internal CVIs is described as follows: Perform clustering on t real datasets (D ¼ fD1 ; D2 ; . Dt g) using a varied set of r clustering algorithms (A ¼ fA1 ; A2 ; . Ar g) and varying parameters to generate a set of m clustering configurations for each dataset Di per algorithm Ai . For each dataset Di , compute the values of the p internal and q external CVIs on the r m set of the clustering results. Obtain the Spearman correlation value for each set of m clustering configurations obtained per dataset Di per algorithm Ai by comparing each internal CVI’s result for each set of m clustering configurations with its corresponding external CVI (per index) results. This yields a q p set of correlation values per algorithm Ai per dataset Di . Conduct the 3-factor ANOVA to test for significant differences in the average Spearman correlation between different internal CVIs, algorithms and datasets. Follow up significant effects with Tukey’s multiple comparison procedure to determine which groups are significantly different. Additionally, conduct 2-way interactions to determine whether the effect of one factor depends on another factor. Follow up significant interactions with an interaction plot and overall summary of Tukey pairwise comparisons. For the results presented in this paper, r ¼ 6 algorithms were applied to cluster t ¼ 14 datasets drawn from the UCI machine learning repository (Dua & Graff, 2017) and Kaggle datasets (Kaggle Inc, 2016) as described in Table. 7.3. The datasets are listed in an increasing order of assumed complexity based on the increasing number of clusters Table 7.3
Overview of real biological datasets.
Dataset
Tag
Clusters (k)
Features (Dim)
Sample size (N)
Haberman Vertebral_2 Pima indians diabetes* Indian liver patient database (ILPD) Physical spine data* Parkinson disease Breast cancer (Wisconsin) Iris Vertebral_3 Seeds Wine Breast tissue Ecoli Yeast
D-01 D-02 D-03 D-04 D-05 D-06 D-07 D-08 D-09 D-10 D-11 D-12 D-13 D-14
2 2 2 2 2 2 2 3 3 3 3 6 8 10
3 6 8 10 12 22 30 4 6 7 13 9 7 8
306 310 768 597 310 195 569 150 310 210 178 106 336 1484
Chapter 7 Evaluation of cluster validation metrics 199
followed by features. The number of clusters, k, was selected as the parameter to vary. By varying k from 2 to 10, a set of m¼9 clustering configuration were generated per algorithm per dataset: thus, a total of 756 results. For the internal CVIs where a maximum value indicates an optimal clustering result (SI, CH, Dunn’s), positive values of Spearman’s correlation are expected if the internal and external CVIs are in agreement, and the closer that value is to 1 the stronger the agreement. However, for internal indices where a minimum value implies optimal clustering (XB, DB), values closer to -1 would have a stronger agreement. To ensure the Spearman’s correlation values were comparable across all internal indices, the XB and DB index values are negated so that the maximum is best. To assess whether there were statistically significant differences in the average Spearman correlations based on the different factors explored (datasets, algorithms and internal validation indices), a 3-factor ANOVA model (7.10) with all of the main effects and second order interactions was conducted. The three-way interaction was not considered to be of direct interest for testing and was assumed to be insignificant. The model is given by: Yijk ¼ m þ Ii þ Aj þ Dk þ ðIAÞij þ ðIDÞik þ ðADÞjk þ εijk ;
(7.10)
where Y ijk is the Spearman correlation between one of the external validation indices and internal validation index i, for algorithm j in dataset k. Let m denote the overall average Spearman’s correlation, Ii is the effect for the internal CVI (i ¼ 1,.., 5), Aj is the effect for the algorithm (j ¼ 1,.., 6), Dk is the effect for the dataset (k ¼ 1,..,14), while ðIAÞij ; ðIDÞik , ðADÞjk are the interaction terms. εijk wN 0; s2 are independent and identically distributed error terms. F-tests are conducted for the overall mean differences between indices, algorithms and datasets. There is also testing to determine whether the effect of one factor depends on another factor (interactions). Pairwise comparisons are made to identify overall significant differences between groups within each of the factors using Tukey’s method to control the type I error for the multiple comparisons made. The effectiveness of applying majority voting across the values of the internal CVIs for each dataset across all the clustering configurations obtained for the six algorithms was also assessed.
7.5 Experimental results and analysis This section presents the results of the evaluation of the 5 internal CVIs on the 14 real datasets. Six different algorithms (Jain, 2010; Xu & Wunsch, 2009) (affinity propagation (AF), K-means, spectral clustering and agglomerative hierarchical clustering (HC) using 3 different linkage methods (complete, average and ward)) were applied to the real datasets to conduct a detailed and robust comparison of the varied internal CVIs. The K-means and hierarchical algorithms were implemented using python scikit libraries (Pedregosa et al., 2012), while spectral clustering was carried out in Matlab using the Shi & Malik algorithm (Von Luxburg, 2007). The AF algorithm was implemented using the Matlab CVAP toolbox (Wang, Wang, & Peng, 2009).
200
Computational Learning Approaches to Data Analytics in Biomedical Applications
Fig. 7.2 provides the average Spearman’s correlation values by algorithm for each of the internal and external CVIs. Notice that CH provides positive correlations on average with Accuracy and ARI, but results are varied for Jaccard. DB, SI and XB perform reasonably well with Accuracy and ARI in all but two of the hierarchical methods (Average and Complete). Dunn’s provides the worst performance with many negative correlations, especially with Accuracy and ARI. Note that while Jaccard has been used in previous studies, it is not as effective in capturing the variation in results that appears to exist in the internal CVI’s as illustrated by Accuracy and ARI. Hence, the Jaccard index was excluded in subsequent results analyses. Fig. 7.3 illustrates the average Spearman’s correlation values by dataset for each of the internal CVIs with Accuracy and ARI. Both Accuracy and ARI indicate that there is a lot of variation in the average correlations by internal CVI and dataset. Dunn’s performs the worst again with varying results for the other internal CVIs. All indices have low average correlations in dataset D14, which has the most perceived degree of complexity among all the datasets explored, as indicated by its large number of clusters and sample size (Table. 7.3). The 3-factor ANOVA using Spearman’s correlation between internal CVIs and Accuracy was fit to further explore the differences in means for the factors of interest (internal CVIs, algorithms, datasets and their second order interactions). All factor effects are significant at the a ¼ 0.05 significance level with a p-value <0.0001, except the Internal CVI*Algorithm Interaction (p ¼ 0.3317). This indicates there are differences in the average Spearman’s correlation between the internal CVIs and Accuracy by algorithm, dataset and internal CVI. All further comparison p-values are calculated using Tukey’s method for pairwise comparisons. Additionally, the effect of internal CVIs depends on the dataset used but not the algorithm used. Although the Algorithm*Dataset interaction was included in the model for completeness and it does explain a significant portion of the variation in Spearman’s correlation, it is not of direct interest for testing and will not be discussed further. Fig. 7.4A shows the main effect plot for internal CVI. When averaged over dataset and algorithm, DB has the highest average Spearman correlation (r s ) with Accuracy (r s ¼0.463) followed by SI (r s ¼0.379) and XB (r s ¼0.335), although there is no significant difference between any of the three (p > 0.05). CH has the second to lowest average (r s ¼0.238), and it is significantly different from DB (p ¼ 0.0002), but not SI or XB (p > 0.05). Dunn’s has the lowest average (r s ¼-0.169), and it is significantly different than all the other internal CVIs (p < 0.0001 for all comparisons). It is the only CVI with a negative average. It can be observed from the main effect plot for the algorithm (Fig. 7.4B) that, when averaged over the dataset and internal CVI, the Spectral algorithm has the highest average Spearman correlation (r s ¼0.442), followed by AF (r s ¼0.415) and K-means (r s ¼0.336). There is no significant difference between these algorithms (p > 0.05). Among the hierarchical methods, Ward’s had the next highest average (r s ¼0.224) followed by complete linkage (r s ¼0.120) with no significant difference between them
Accuracy
AF
H-Average H-Complete H-Ward
Spearman
Spearman XB
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
SI
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
Kmeans
Spectral
AF
H-Average H-Complete H-Ward
SI
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
XB
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
Dunns
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
XB
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00 Kmeans
Spectral
AF
H-Average H-Complete H-Ward
Kmeans
Spectral
FIG. 7.2 Results by Algorithm. Bar plots of the average Spearman’s correlation between each of 5 internal CVIs (CH, DB, Dunns, SI, XB) and 3 external CVI’s (Accuracy, Adjusted Rand Index, Jaccard) for each of the 6 algorithms (AF, H-Average, H-Complete, H-Ward, Kmeans, Spectral). Averages were taken over the 14 datasets tested. Error bars represent one standard error above or below the mean. The Jaccard index conveyed the least information about the variations of the indices, thus, those results were excluded from the remaining analyses.
Chapter 7 Evaluation of cluster validation metrics 201
SI
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
DB
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
Dunns
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
CH
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
DB
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
Dunns
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
Jaccard
CH
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
DB
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
Spearman
Adjusted RandIndex
CH
1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00
202 Computational Learning Approaches to Data Analytics in Biomedical Applications
FIG. 7.3 Results by Dataset. Bar plots of the average Spearman’s correlation between each of 5 internal CVIs (CH, DB, Dunns, SI, XB) and 2 external CVI’s (Accuracy, Adjusted Rand Index) for each of the 14 datasets (see Table 7.3 for tag descriptions). Averages were taken over the 6 algorithms tested. Error bars represent one standard error above or below the mean. As can be observed, all the indices performed badly at the tail end of difficulty of the datasets (to the right).
Chapter 7 Evaluation of cluster validation metrics 203
(B)
1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1
Spearman LS Means
Spearman LS Means
(A)
CH
Spearman LS Means
(C)
DB
Dunns Internal CVI
SI
1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1
XB
AF
H-Average H-Complete H-Ward Kmeans Spectral Algorithm
1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1 D01
D02
D03
D04
D05
D06
D07
D08
D09
D10
D11
D12
D13
D14
Dataset
FIG. 7.4 Main Effect Plots. Plot of means for the main effects (A) Internal CVI, (B) Algorithm, (C) Dataset of the 3-factor ANOVA. The average Spearman’s correlation along with standard error bars for each level within each factor is given.
(p > 0.05). The Average method had the lowest average Spearman correlation (r s ¼-0.043). It is significantly different (p < 0.0001) from all of the other methods except for the complete linkage (p ¼ 0.0553). For the dataset’s main effect (Fig. 7.4C), averaging is conducted over internal CVIs and algorithms. The dataset with the highest average Spearman correlation is D02 (r s ¼0.572). Several other datasets have a similar average Spearman correlation and were not statistically different from D02. The most complex dataset (D14) had a significantly lower average (r s ¼-0.406) than all of the others (p < 0.05 for all comparisons). The interaction plot for the Internal CVI*Algorithm is given in Fig. 7.5A. Since this interaction is not significant, no pairwise comparisons were conducted. It is evident from the plot that the lines are roughly parallel (with some small, but insignificant exceptions.) Additionally, Dunn’s has the lowest average across all algorithms, and the other four internal CVI’s are all similar and not statistically different for each algorithm. Fig. 7.5B illustrates the interaction plot for the Internal CVI*Dataset. Note that the lines are intersecting in many places and are not parallel, which is indicative of a significant interaction. This implies that comparisons between the internal CVIs depends on the dataset. Although pairwise comparisons were performed there are too many to efficiently discuss, so this work focuses on a few noteworthy observations. Dunn’s has the lowest
204
Computational Learning Approaches to Data Analytics in Biomedical Applications
Spearman LS Means
(A) CH DB Dunns SI XB
1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1 AF
H-Average
H-Complete
(B) Spearman LS Means
H-Ward
Kmeans
Spectral
Algorithm
1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1
CH DB Dunns SI XB
D01
D02
D03
D04
D05
D06
D07
D08
D09
D10
D11
D12
D13
D14
Dataset
FIG. 7.5 Interaction Plots. (A) Plot of means for the Internal CVI*Algorithm (B) Internal CVI*Dataset interactions of the 3-factor ANOVA. The average Spearman correlation for each Algorithm and Dataset are given, with separate lines for the 5 internal CVIs (CH, DB, Dunns, SI, XB).
average Spearman correlation across most datasets, but there are some exceptions, most notably in datasets with more clusters (D12, D13, D14). For most of the datasets, the other 4 internal CVIs are similar and not significantly different, with a few exceptions. In D4 and D13, there is no significant difference among any of the internal CVIs. Datasets D06 and D07 exhibit a different pattern than the others. It is interesting that these two datasets have the largest number of features. For the datasets with the largest number of features (D06, D07, D11) and largest number of clusters (D12, D13, D14), the internal CVIs vary more in the ordering of the average Spearman correlation. An evaluation of the majority voting across all the indices is provided in Table. 7.4. Usually three of the five indices (excluding CH and Dunn’s) agreed on the same configuration. The voting fell short of the optimal result across the 6 algorithms as indicated by the highest accuracy value in all cases except for D-14 i.e. yeast, the most complex dataset with 10 clusters and 8 features. A visualization of the plots of the ground truth and optimal clustering configuration scheme is illustrated in Fig. 7.6. As can be observed from the visualization, the H-complete cluster 2 result (Fig. 7.6B) is clearly a suboptimal scheme when compared to ground truth clusters, but this is not detected by the CVIs. A more viable path for designing a robust cluster validation paradigm should include visualization of the resulting clusters for visual validation by the human user. See Chapter 8 and 9 for further discussion on visualization methods and tools.
Chapter 7 Evaluation of cluster validation metrics 205
Table 7.4
Assessment of majority voting scheme per dataset. Majority internal CVIs choice
Data
Algorithm
D-01 D-02 D-03 D-04 D-05 D-06 D-07 D-08 D-09 D-10 D-11 D-12
H-average H-complete/H-average H-complete/H-average H-complete Affinity H-ward H-complete/H-average Spectral/H-ward/H-average H-complete/H-average Affinity H-ward/H-average Affinity/k-means/all HC methods No consensus H-complete/H-average
D-13 D-14
Choice versus Actual ky
Votes
Accuracy @ choice
Highest accuracy across all algorithms
2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/3 2/3 2/3 2/3 2/6
3/5 3/5 3/5 3/5 4/5 3/5 2/5 3/5 3/5 2/5 2/5 4/5
0.843 0.887 0.846 0.801 0.952 0.810 0.822 0.900 0.877 0.943 0.669 0.689
0.935 0.968 0.940 0.955 0.977 0.892 0.902 0.953 0.965 0.957 0.994 0.981
-/8 2/10
n/a 3/5
n/a 0.877
0.929 0.877
7.6 Ensemble validation paradigm In varied cluster analysis applications, one metric is usually selected for identifying the optimal set of partitions. Sometimes more than one metric is used, and the results are combined by a simple majority voting (i.e. the clustering output selected is based on the most popular among the metrics). Applying basic majority voting across multiple internal CVIs makes the assumption that these metrics can be evaluated in the same domain value space. Each metric views the task of determining the optimal clustering configuration from a different perspective (see Section 7.3.1). To effectively combine the strengths of each metric, an ensemble cluster validation paradigm as discussed in (Nguyen, Nowell, Bodner, & Obafemi-Ajayi, 2018) is proposed. The objective of the ensemble cluster validation method is to leverage the strengths of the diverse multiple metrics by utilizing aggregated ranks to determine the optimal clustering for a given dataset. The ensemble method selects the top result with the highest aggregated ranks for further domain-specific analysis. The framework can be applied to agglomerate any number of m internal CVIs deemed useful for the specific cluster analysis. By applying one or multiple clustering algorithms to a given dataset and varying the set of parameters, a set of n possible clustering configuration outputs (Diji ¼ 1, .,n) are obtained. Each CVI gives a rank to each obtained clustering output Di based on its performance criteria. For each CVI, the top r most optimal clustering outputs from the set of Di are selected and assigned an adjusted score based on the rank. For a given CVI, Cj, its best (i.e. highest ranked) output, is assigned a score of r. Likewise,
206
Computational Learning Approaches to Data Analytics in Biomedical Applications
(A) 0.8
0.6
0.4
0.2
0.0
–0.2
–0.4 –0.6
–0.4
–0.2
0.0
0.2
0.4
0.6
0.8
PCA Visualization of clustering result of H-complete 2-cluster configuration of yeast data
(B) 0.8
0.6
0.4
0.2
0.0
–0.2
–0.4 –0.6
–0.4
–0.2
0.0
0.2
0.4
0.6
0.8
PCA Visualization of Actual cluster of yeast data FIG. 7.6 Visualization of yeast data using PCA to illustrate the poor clustering configuration obtained from the highly ranked method. (A) Clustering result of H-complete 2-cluster configuration of yeast data. (B) Actual 10-cluster configuration of yeast data.
Chapter 7 Evaluation of cluster validation metrics 207
the second highest ranked output is assigned a score of r 1, and the third is r 2. The last of the r best performing, i.e. the rth best performing score, is assigned a score of 1. Note that any Di that is not part of the top r is assigned a score of 0. The final weighted score Wi of each clustering output Di is the sum of the scores from each CVI. The final rank of the weighted ensemble validation score is subsequently applied to determine the optimal scheme, with the maximum value being the most optimal. For further discussion of this ensemble validation method, including its application in a biomedical data analysis of phenotype data see (Nguyen et al., 2018).
7.7 Summary This chapter presents the preliminary results of a detailed evaluation of 5 commonly used internal CVIs by assessing their performance on real biological datasets with available ground truth using rigorous statistical methods. An evaluation framework is presented for evaluating the quality of the partitions selected by the CVIs, not just the optimal number of clusters. Studying the performance of CVIs to determine the underlying structure of known benchmark datasets currently used to evaluate clustering methods will advance knowledge of the performance of CVIs and aid in understanding their selection and use. As demonstrated in this work, there is not one universal internal CVI that works the best across real datasets, and a majority voting approach was also demonstrated to be not entirely effective. Thus, an open area of research is to identify effective metrics to assess clustering results, including investigating how different metrics deal with clusters that are outliers.
References Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pe´rez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243e256. https://doi.org/10.1016/j. patcog.2012.07.021. Bailey, J. (2013). Alternative clustering Analysis : A review. In C. C. Aggarwal, & C. K. Reddy (Eds.), Data clustering: Algorithms and applications (1st ed., pp. 533e548). Taylor & Francis. Bolshakova, N., & Azuaje, F. (2003). Cluster validation techniques for genome expression data. Signal Processing, 83(4), 825e833. https://doi.org/10.1016/S0165-1684(02)00475-9. Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., et al. (2007). Model-based evaluation of clustering validation measures. Pattern Recognition, 40(3), 807e824. https://doi.org/10.1016/j. patcog.2006.06.026. Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics Theory and Methods, 3(1), 1e27. https://doi.org/10.1080/03610927408827101. Clifford, H., Wessely, F., Pendurthi, S., & Emes, R. D. (2011). Comparison of clustering methods for investigation of genome-wide methylation array data. Frontiers in Genetics, 2, 88. https://doi.org/10. 3389/fgene.2011.00088. Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York, NY: Wiley. Dua, D., & Graff, C. (2017). Machine learning repository. Retrieved from University of California. School of Information and Computer Science website: http://archive.ics.uci.edu/ml.
208
Computational Learning Approaches to Data Analytics in Biomedical Applications
Dubes, R. C. (1987). How many clusters are best? - an experiment. Pattern Recognition, 20(6), 645e663. https://doi.org/10.1016/0031-3203(87)90034-3. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2e3), 107e145. https://doi.org/10.1023/A:1012801612483. Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651e666. https://doi.org/10.1016/j.patrec.2009.09.011. Kaggle Inc. (2016). Kaggle - your home for data science. Retrieved March 8, 2018, from https://www. kaggle.com/. Kova´cs, F., Lega´ny, C., & Babos, A. (2005). Cluster validity measurement techniques. Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence, 2006, 1e11. https://doi.org/10.7547/87507315-91-9-465. Liu, Y., Li, Z., Xiong, H., Gao, X., & Wu, J. (2010). Understanding of intenal clustering validation measures. In IEEE internatinal conference on data mining (pp. 911e916). https://doi.org/10.1109/ICDM.2010. 35. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., & Wu, S. (2013). Understanding and enhancement of internal clustering validation measures. IEEE Transactions on Cybernetics, 43(3), 982e994. https://doi.org/10. 1109/TSMCB.2012.2220543. Maulik, U., & Bandyopadhyay, S. (2002). Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12), 1650e1654. https://doi.org/10.1109/TPAMI.2002.1114856. Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a dataset. Psychometrika, 50, 159e179. https://doi.org/10.1007/BF02294245. Nguyen, T., Nowell, K., Bodner, K. E., & Obafemi-Ajayi, T. (2018). Ensemble validation paradigm for intelligent data analysis in autism spectrum disorders. In 2018 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB) (pp. 1e8). https://doi.org/10.1109/ CIBCB.2018.8404960. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2012). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825e2830. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(C), 53e65. https://doi.org/10.1016/ 0377-0427(87)90125-7. Saria, S., & Goldenberg, A. (2015). Subtyping: What it is and its role in precision medicine. IEEE Intelligent Systems, 30(4), 70e75. https://doi.org/10.1109/MIS.2015.60. Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511e528. https://doi.org/10.1198/106186005X59243. Vendramin, L., Campello, R. J. G. B., & Hruschka, E. R. (2010). Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4), 209e235. https://doi.org/10.1002/ sam.10080. Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395e416. https://doi.org/10.1007/s11222-007-9033-z. Wang, K., Wang, B., & Peng, L. (2009). Cvap: Validation for cluster Analyses. Data Science Journal, 8, 88e93. https://doi.org/10.2481/dsj.007-020. Xu, R., & Wunsch, D. C., II. (2009). Clustering. IEEE Press/Wiley. https://doi.org/10.1002/9780470382776.