Available online at www.sciencedirect.com
Fuzzy Sets and Systems 193 (2012) 1 – 32 www.elsevier.com/locate/fss
Positional and confidence voting-based consensus functions for fuzzy cluster ensembles Xavier Sevillano∗ , Francesc Alías, Joan Claudi Socoró GTM – Grup de Recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, Quatre Camins, 2, 08022 Barcelona, Spain Received 17 June 2010; received in revised form 9 September 2011; accepted 17 September 2011 Available online 22 September 2011
Abstract Consensus clustering, i.e. the task of combining the outcomes of several clustering systems into a single partition, has lately attracted the attention of researchers in the unsupervised classification field, as it allows the creation of clustering committees that can be applied with multiple interesting purposes, such as knowledge reuse or distributed clustering. However, little attention has been paid to the development of algorithms, known as consensus functions, especially designed for consolidating the outcomes of multiple fuzzy (or soft) clustering systems into a single fuzzy partition—despite the fact that fuzzy clustering is far more informative than its crisp counterpart, as it provides information regarding the degree of association between objects and clusters that can be helpful for deriving richer descriptive data models. For this reason, this paper presents a set of fuzzy consensus functions capable of creating soft consensus partitions by fusing a collection of fuzzy clusterings. Our proposals base clustering combination on a cluster disambiguation process followed by the application of positional and confidence voting techniques. The modular design of these algorithms makes it possible to sequence their constituting steps in different manners, which allows to derive versions of the proposed consensus functions optimized from a computational standpoint. The proposed consensus functions have been evaluated in terms of the quality of the consensus partitions they deliver and in terms of their running time on multiple benchmark data sets. A comparison against several representative state-of-the-art consensus functions reveals that our proposals constitute an appealing alternative for conducting fuzzy consensus clustering, as they are capable of yielding high quality consensus partitions at a low computational cost. © 2011 Elsevier B.V. All rights reserved. Keywords: Information sciences; Fuzzy clustering; Group decision-making; Pattern recognition
1. Introduction The development of the Information and Communication Technologies entails two intrinsically contradictory consequences: while facilitating knowledge acquisition by making information easier to share and access, it has also fostered the generation of increasingly growing amounts of digital information, making it sometimes almost impossible to separate the wheat from the chaff, thus giving rise to the so-called data overload effect—a problem that started to attract the attention of researchers more than a decade ago [1]. This fact has motivated the interest in the development of automatic tools that allow knowledge extraction from large data repositories, regardless of their domain [2]. ∗ Corresponding author. Tel.: +34 93 290 24 52; fax: +34 93 290 24 70.
E-mail addresses:
[email protected] (X. Sevillano),
[email protected] (F. Alías),
[email protected] (J. Claudi Socoró). 0165-0114/$ - see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.fss.2011.09.007
2
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
The techniques supporting these tools belong to the fields of knowledge discovery and data mining, which are largely based on machine learning and pattern recognition techniques [3]. When it comes to extracting knowledge from a given data collection, one of the primary tasks one thinks of is organization: clearly, arranging the contents of a data repository according to some meaningful structure helps to gain some perspective on it—in fact, organizing information is one of the most innate activities involved in human learning [4]. In general terms, the structures according to which objects are classified are known as taxonomies, and, although their shape can vary widely (e.g. from parent–child hierarchical trees to network schemes or simple group structures), they share the common goal of facilitating knowledge extraction by implementing a structure in an unstructured world of information. When dealing with digital data, the manual creation of a taxonomy can become a very challenging and burdensome task, as it requires previous domain knowledge or labeled data (which is not always available) and/or careful inspection of the whole data collection before designing the most suitable taxonomic structure. For this reason, it would be very useful to develop systems capable of organizing data in a fully automatic manner, so that no expert supervision nor domain knowledge was required. If this goal was accomplished, the role of expert taxonomists would be minimized— good news given the dramatic pace at which digital data is generated. Regardless of the taxonomic scheme’s layout, data organization criteria are typically based on analyzing the similarities between objects, grouping them according to their degree of similarity—i.e. the goal is to place dissimilar instances in separate and distant groups (or clusters), while placing similar objects in the same group (or in different but closely located clusters). This task, known as unsupervised classification or clustering, is an important process which underlies many automated knowledge discovery processes [1–3]. Among the myriad of existing approaches proposed to solve the unsupervised classification problem, a separating line can be drawn between hard (or crisp) and soft (or fuzzy) clustering techniques. The main difference between them lies in the fact that the former partition the data collection subject to clustering into a number of non-overlapping groups, while the latter associate each object in the collection to each cluster to a certain degree. By doing so, fuzzy clustering techniques are able to model the ambiguity regarding whether one object belongs to one or more classes [6]. Indeed, when an expert tries to manually derive a taxonomy that describes a data collection, membership ambiguity usually constitutes a beneficial factor, as it allows to derive richer and more complex descriptive data models, rarely affecting human analytic capabilities negatively [5]. For instance, imagine that a taxonomy had to be derived upon a collection of news articles. An article entitled ‘Obama will order BP to hand over control of oil spill damage claims’ could either be assigned to a ‘politics’ cluster, or to an ‘economy’ cluster, or to an ‘environment’ cluster. But possibly, common sense would recommend to indicate that this article belongs to all three clusters to a certain extent. For this reason, it seems reasonable that it would be interesting to devise automatic data analysis and organization techniques capable of creating data models that allow multiple levels of class memberships. Clearly, fuzzy clustering algorithms allow to do so, as they estimate a distribution over the clusters that specifies the probability that an object is assigned to a cluster. Moreover, it is easy to see that crisp clustering can be regarded as a simplification of fuzzy clustering, as a hard partition is always obtainable from a soft one by simply assigning each instance to the cluster it is most strongly associated with—a simplification that, as mentioned earlier, may give rise to the loss of valuable information, as an object may naturally be associated to more than one cluster [6]. Resorting to the previous news article example once more, assigning the aforementioned article regarding the Gulf of Mexico oil spill to the ‘environment’ cluster would be equivalent to ignore the political and economical aspects of that piece of news, which would lead to an impoverished data model of the news articles collection subject to clustering. A recent trend in the field of unsupervised classification is the combination of the outcomes of multiple clustering systems into a single consolidated partition. This task, known as consensus clustering, is formally defined as ‘the problem of combining multiple partitionings of a set of objects (compiled in a cluster ensemble) into a single consolidated clustering by means of the application of a consensus function, without accessing the features or algorithms that determined the combined partitionings’ [7]. Consensus clustering can be regarded as the unsupervised counterpart of supervised classifier committees [8]. However, the purely symbolic nature of the labels returned by unsupervised classifiers makes consensus clustering a more challenging task. Possibly due to this fact, consensus clustering has historically been far less popular than classifier committees, and it has only began to draw considerable attention of researchers during the last decade [7,14].
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
3
Anyway, consensus clustering has multiple practical applications, such as [7]: knowledge reuse (in scenarios where access to a data collection may be restricted, it is possible to obtain a new partition of the data by means of consensus clustering if a set of legacy partitions exist, thus reconciling the knowledge contained in them), distributed clustering (obtaining a consolidated partition of data scattered across different locations by combining partitions generated at local level by means of consensus clustering), or robust clustering (if distinct clustering systems disagree, combining their outcomes may offer additional information and discriminatory power, thus obtaining a combined better clustering closer to a hypothetical true classification [9]). However, despite the objective interest and practical applicability of consensus clustering, and the greater generality of soft clustering, there exist relatively few approaches to consensus clustering specifically oriented to combine the outcomes of multiple fuzzy unsupervised classifiers into a single soft consensus partition. Allowing for this fact, this paper puts forward a set of consensus functions for obtaining soft consensus clusterings upon fuzzy cluster ensembles. The proposed consensus functions are devised as the concatenation of a cluster disambiguation step and the application of a voting procedure [10]. In particular, multiple consensus functions are constructed by combining (i) several confidence and positional voting strategies, and (ii) distinct ways of sequencing the cluster disambiguation and voting steps. The main contributions of this work lie in the use of positional voting methods in the consensus clustering problem, and the introduction of different cluster disambiguation and voting sequencing schemes that allow to optimize the resulting consensus functions from a computational perspective. This paper is organized as follows: in Section 2, we briefly describe the consensus clustering problem, formally define soft cluster ensembles and survey the most relevant related work on fuzzy clusterings consolidation techniques and voting-based consensus functions. In Section 3, the proposed fuzzy consensus functions are described in terms of their constituting elements: the cluster disambiguation and the voting processes. Moreover, we also propose sequencing these procedures in two distinct manners, which gives rise to a total of eight novel soft consensus functions. Next, Section 4 describes the design of the experiments we have conducted in order to evaluate our proposals. These experiments aim to identify which of the consensus functions proposed in this work offer the best performance, for a subsequent comparison against multiple state-of-the-art alternatives (see Section 5). And finally, the properties of our proposals are discussed in Section 6, and the conclusions of this work are presented in Section 7. 2. Soft consensus clustering The consensus clustering problem consists of creating a single partition of a data set through the combination of the outcomes of multiple unsupervised classifiers [7], as depicted in Fig. 1. The two key elements of this process are the cluster ensemble and the consensus function. A cluster ensemble is the compilation of the partitions created by l clustering systems subject to combination. From a notational standpoint, we mathematically represent a cluster ensemble by means of a matrix E, the size of which depends on (i) the number of objects contained in the data collection, (ii) the number of combined clustering systems and (iii) the fact of whether they are crisp or fuzzy classifiers—depending on this last factor, E will be either a hard or a soft cluster ensemble. A consensus function is an algorithm that yields a consensus partition resulting from the combination of the cluster ensemble components. Regardless of whether the cluster ensemble E is fuzzy or not, this consensus partition can either be crisp or fuzzy. In any case, the general approach taken for deriving the consensus partition upon the cluster ensemble is to obtain a clustering that shares the most information with the cluster ensemble components, i.e. that agrees with the combined clusterings as much as possible [7,11]—in other words, if an object is assigned to a particular cluster by many of the partitions contained in the cluster ensemble, the consensus clustering generated by the consensus function should also place that object in that cluster. In general terms, consensus clustering can be posed as an optimization problem the goal of which is to minimize a cost function measuring the dissimilarity between the consensus clustering solution and the partitions in the cluster ensemble. Unfortunately, finding the partition that minimizes the proposed symmetric difference distance metric (i.e. the so-called median partition) is a NP-hard problem [12]. This is the reason why it is necessary to resort to distinct heuristics in order to conduct clustering combination, such as those based on (i) a hypergraph representation of the cluster ensemble [7,13], (ii) object co-association measures [12,14], (iii) categorical clustering [15,16], (iv) probabilistic approaches [17–19], (v) reinforcement learning [20], (vi) search techniques [12,21], or (vii) correlation clustering [11], to name a few.
4
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
Fig. 1. Block diagram of the consensus clustering process conducted on a cluster ensemble created as the compilation of the outcomes of l unsupervised classifiers.
In this work, we find a solution to the consensus clustering problem by making use of possibly the most intuitive heuristic employed in this field: voting [10,22,23,26]. Indeed, voting methods provide a natural way of combining different visions of the same event (e.g. the votes cast by multiple voters in a presidential election, or several partitions of a single data collection). Furthermore, they allow to obtain high quality consensus partitions despite their conceptual simplicity (see Section 5.2). A review on previous approaches to consensus clustering based on voting strategies is presented in Section 2.2. Prior to this, however, the next section is devoted to the description of several basic concepts related to consensus clustering that must be introduced as a preamble to the presentation of our proposals. 2.1. Soft cluster ensembles In general terms, a cluster ensemble E is defined as the compilation of the outcomes of multiple clustering systems. Thus, before introducing the formal definition of cluster ensembles, the reader should familiarize with the way the results of a clustering process are codified. To this end, let us consider an unsupervised classifier that partitions into k clusters a data collection containing n objects. In the case that such classifier was a crisp one, its outcome could be represented as a n-dimensional integervalued row vector of labels (or labeling) k, the ith component of which identifies to which of the k clusters the ith object in the data set is assigned to: k = [1 2 . . . n ] where i ∈ {1, . . . , k}, ∀i ∈ {1, . . . , n}
(1)
Let us assume for a while that we partition a data collection formed by n = 7 objects into k = 3 clusters. In this case, the labeling k of Eq. (2) represents a partition in which the first three objects are assigned to the same cluster, the next two objects to another cluster, and the final two objects to a third cluster. k = [1 1 1 2 2 3 3]
(2)
However, it is of paramount importance to notice that the label vector presented in Eq. (2) is not unique in defining such partition. Indeed, due to the unsupervised nature of the clustering problem, clusters are not identifiable in advance, so the labels they are assigned are merely symbolic. Therefore, the labelings k = [2 2 2 1 1 3 3], k = [3 3 3 2 2 1 1] (or any of the k! = 3! = 6 possible label permutations that can be applied on the label vector of Eq. (2)) represent exactly the same partition.
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
5
On its part, the outcome of a fuzzy unsupervised classifier can be formally represented by means of a k ×n real-valued clustering matrix K – see Eq. (3) – the (i,j)th entry of which indicates the degree of association between the jth object and the ith cluster. ⎛ ⎞ 11 12 . . . 1n ⎜ ⎟ ⎜ 21 22 . . . 2n ⎟ ⎜ ⎟ where i j ∈ R, ∀i ∈ {1, . . . , k} and ∀ j ∈ {1, . . . , n} K=⎜ . (3) .. ⎟ .. ⎜ .. . . ⎟ ⎝ ⎠ k1 k2 . . . kn Assuming again that n = 7 objects and k = 3 clusters, the application of a hypothetical fuzzy clustering algorithm could yield the clustering matrix K presented as ⎛ ⎞ 0.054 0.026 0.057 0.969 0.976 0.011 0.009 ⎜ ⎟ K = ⎝ 0.921 0.932 0.905 0.025 0.019 0.030 0.014 ⎠ (4) 0.025 0.042 0.038 0.006 0.005 0.959 0.976 Consider momentarily that the scalar entries of matrix K represent the probability that each object belongs to each cluster. In this case, the observation of the first column of K suggests that the first object is highly likely to belong to cluster number two, as its corresponding membership probability (0.921, found on the second row) is much larger than those of clusters number one and three (0.054 and 0.025, respectively). However, recall that cluster identification is ambiguous, so any matrix resulting from one of the k! = 3! = 6 possible permutations of the rows of K would yield an equivalent fuzzy partition. At this point, let us consider that a data set containing n objects is subject to clustering by l fuzzy unsupervised classifiers, each one of which creates partitions with ki clusters (∀i ∈ {1, . . . , l}). The compilation of their outcomes gives rise to a soft cluster ensemble E, which is mathematically expressed as a l( li=1 ki ) × n matrix, as presented [16,28] ⎛ ⎞ K1 ⎜ ⎟ ⎜ K2 ⎟ ⎜ ⎟ (5) E=⎜ . ⎟ ⎜ .. ⎟ ⎝ ⎠ Kl where Ki is the ki × n real-valued clustering matrix resulting from the ith soft clustering process (∀i ∈ {1, . . . , l}). The (a, b)th entry of Ki represents the degree of association between the bth object and the ath cluster according to the ith fuzzy unsupervised classifier the output of which is compiled in the soft cluster ensemble E. 2.2. Related work Despite the relatively recent awakening of consensus clustering, multiple approaches to the combination of several partitions can be found in the literature. The interested reader is referred to the recent and comprehensive survey of [10]. In this section, we will review those approaches based on voting strategies and those focused on combining the outcomes of fuzzy unsupervised classifiers. 2.2.1. Consensus functions based on voting The application of voting strategies for constructing consensus functions is based on the notion that the assignment of an object to a cluster can be interpreted as a decision of an unsupervised classifier, regardless of whether it is a crisp or a fuzzy one. For this reason, voting methods lend themselves for consolidating multiple clusterings, inasmuch as voting is a formal way of combining the opinions of several voters into a single consolidated decision, such as the election of a president. Therefore, if an object is placed in the same cluster according to a certain number of the cluster ensemble components, the consensus clustering process should respect that decision. According to this analogy, the l unsupervised classifiers the outputs of which are compiled in a cluster ensemble can be regarded as voters, while the assignment of an object to a cluster can be interpreted as an election in which clusters play the role of candidates.
6
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
However, it is quite obvious that the cluster identification ambiguity problem introduced in Section 2.1 is very likely to occur among a set of different clusterings compiled in a cluster ensemble. Therefore, as candidates must be univocally identified so that the result of an election is correct, consensus functions based on voting strategies must include a cluster disambiguation process prior to voting proper. Throughout the following paragraphs we briefly survey some of the most relevant works devoted to the design of consensus functions based on voting techniques found in the literature. One of the pioneering works in voting-based consensus clustering was the voting-merging algorithm (VMA) [22]. In that work, the authors combined the output of hard clustering systems so as to reduce the variance of the error between them. To do so, clusters were firstly disambiguated by matching those clusters sharing the highest percentage of objects. Subsequently, depending on the level of agreement between the disambiguated clusters, each object was assigned to a consensus partition cluster with a certain weight, which allowed to obtain a fuzzy consensus partition as a result. The BagClust1 consensus function proposed in [23] was based on applying plurality voting on the components of the cluster ensemble after a cluster disambiguation process based on measuring the overlap between clusters. A related and contemporary proposal was the one presented in [24]. The consensus clustering solution was obtained through a maximum likelihood mapping in which the label permutation problem is solved by means of the Hungarian method [25], which ended up in the application of plurality voting on the disambiguated individual partitions in the cluster ensemble. Similarly, a majority voting strategy was applied in [26] after disambiguating the clusters also using the Hungarian algorithm. Recently, a set of consensus functions based on cumulative voting – named Un-normalized Cumulative Voting (URCV), Reference-based Cumulative Voting (RCV) and Adaptive Cumulative Voting (ACV) – whose time complexity scales linearly with the size of the data collection, was presented in [10]. Those proposals were based on solving the cluster correspondence problem by the computation of a probabilistic mapping, the derivation of which is based on the idea of cumulative voting, which is regarded as a means for maximizing the information transfer between the cluster ensemble components and the consensus partition.
2.2.2. Consensus functions for soft cluster ensembles As mentioned earlier, relatively little research on consensus functions specifically designed to create soft consensus partitions upon the combination of multiple fuzzy unsupervised classifiers has been conducted. One of the pioneering consensus functions for soft cluster ensembles was the fuzzy extension of VMA presented in [27]. In that work, the authors defined the consensus clustering solution as the one minimizing the average square distance with respect to all the partitions in the cluster ensemble. They demonstrated that obtaining such consensus clustering boils down to finding the optimal cluster permutations of all the cluster ensemble components, which becomes an unfeasible problem if approached directly. For this reason, they resorted to an approximation algorithm that averages partitions on a pairwise sequential basis after conducting cluster disambiguation based on matching those clusters sharing the highest percentage of objects. The Probabilistic Label Aggregation (PLA) consensus function constituted a neat approach to the soft consensus clustering problem [18]. It was based on factorizing the joint probability matrix of finding two objects in the same cluster-computed upon the soft cluster ensemble–, thus obtaining estimates for class-likelihoods and class-posteriors, upon which the consensus clustering solution was based. A recent contribution in this area was the Information Theoretic K-means (ITK) algorithm [16]. In that case, each object in the data set was represented by means of the concatenated posterior cluster membership probability distributions corresponding to each one of the l fuzzy partitions in the soft cluster ensemble. Thus, using the Kullback–Leibler divergence between those probability distributions as a measure of the distance between objects, the k-means algorithm was applied so as to obtain the consensus clustering solution. Last but not least, it must be noted that some consensus functions originally posed for hard cluster ensembles can be employed for combining soft clusterings after introducing minor modifications on them. This is the case of the hypergraph partition-based consensus functions Cluster-based Similarity Partitioning Algorithm (CSPA), HyperGraph Partitioning Algorithm (HGPA), Meta-CLustering Algorithm (MCLA) [7] and Hybrid Bipartite Graph Formulation (HBGF) [13], the soft versions of which were presented in [16,28]—similarly, the soft version of the Evidence ACcumulation (EAC) consensus function [14] was derived in [28].
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
7
One of the contributions of this work is an exhaustive experimental comparison between the voting-based consensus functions put forward in this work (described in Section 3) and several of the previously proposed fuzzy consensus functions—see Section 5.2. 3. Voting-based consensus functions As pointed out in Section 2.2.1, voting-based consensus functions can be described as the concatenation of a cluster disambiguation and a voting processes. This modularity allows to implement multiple consensus functions by combining different cluster disambiguation techniques and voting strategies. In Sections 3.1 and 3.2, the cluster disambiguation and voting methods that make up our consensus functions are described. Next, in Section 3.3, we introduce two distinct ways of sequencing the cluster disambiguation and voting processes that, with the aim of optimizing the computational aspect of our proposals, give rise to the eight consensus functions put forward in this work. 3.1. Cluster disambiguation In this work, the cluster disambiguation step is responsible for solving the cluster re-labeling problem – an instance of the cluster correspondence problem, in which a one to one correspondence between clusters is assumed, that is, ki = k (∀i ∈ {1, . . . , l}). 1 So, given a pair of clustering solutions with k clusters each, cluster disambiguation methods must find, among the k! possible cluster permutations, the one that maximizes some cluster correspondence criterion. As mentioned earlier, cluster permutations amount to row order rearrangements in a fuzzy clustering scenario. Thus, the outcome of a cluster disambiguation process that aims to align the clusters of two k-way fuzzy partitions K1 and K2 can be represented by means of a k × k permutation matrix P. Therefore, if the clusters in K1 are taken as a reference, the premultiplication of K2 by P yields the cluster disambiguated version of this latter partition [29]. The derivation of such permutation matrix P has been tackled in the consensus clustering literature by several approaches, such as (i) cluster correspondence estimation based on common space cluster representation by Singular Value Decomposition [30], (ii) the Soft Correspondence Ensemble Clustering algorithm, which is based on establishing a weighted correspondence between clusters [31], (iii) the cumulative voting approach, that, unlike common one-toone voting schemes, computes a probabilistic mapping between clusters [10], or (iv) the FullSearch, Greedy and LargeKGreedy cluster alignment algorithms [32]. In this work, we have chosen a simple, well-established and reliable technique for solving the cluster re-labeling problem: the Hungarian method (also known as Kuhn–Munkres or Munkres assignment algorithm) [25]. The reason for this choice is that besides being a thoroughly studied and tested method (see [17]), the Hungarian algorithm is capable of solving the cluster correspondence problem in O(k 3 ) time. To find the optimal cluster permutation that yields the largest probability mass over all cluster assignment probabilities, the Hungarian algorithm poses the cluster correspondence problem as a weighted bipartite matching problem [24]. In particular, we have employed the implementation of [33], which bases the derivation of the cluster permutation matrix P upon a measure of the dissimilarity between the clusters of the clustering solutions under consideration. The interested reader is referred to Appendix A.2 for a description of how cluster dissimilarity is computed. In any case, given a soft cluster ensemble E containing a set of l soft clustering solutions, the cluster disambiguation process consists in, taking one of them as a reference, apply the Hungarian method sequentially on the remaining l − 1 clustering solutions [17]. As a result, a cluster aligned version of the cluster ensemble is obtained, and voting procedures can be readily applied on it. 3.2. Voting procedure Once the correspondence between the k clusters of each one of the l soft clustering solutions compiled in the soft cluster ensemble E has been resolved and the corresponding cluster permutations have been applied, it is time to derive the consensus clustering solution, a task we tackle by means of voting procedures. In this section, we describe four voting methods that constitute the core of the consensus functions put forward in this work. 1 However, it is also possible to use these methods to match partitions with different number of clusters, just by equaling them through the addition of the necessary number of ‘dummy’ clusters to the clusterings with fewer clusters [27]. See Appendix A.1 for further insight on this issue.
8
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
It is important to notice that the scalar elements that constitute the soft cluster ensemble E can be interpreted as the preference of each object for each cluster. Following this analogy, the process of fuzzily classifying an object can be regarded as an election, in which each one of the l classifiers, which is regarded as a voter, casts its preference for each and every one of the clusters (or candidates). For this reason, the soft consensus functions proposed in this work make use of confidence and positional voting methods, which are applicable in voting scenarios in which voters grade all candidates according to their degree of liking [34]. In a nutshell, confidence voting methods directly conduct mathematical operations on the specific values of the preference scores the voters emit, whereas positional voting techniques are based on ranking the candidates according to the degree of confidence expressed by the voters. However, an important issue must be taken into account when applying voting procedures for deriving a consensus partition upon a soft cluster ensemble: provided that it is possible that the l soft cluster ensemble components have fairly distinct natures, we may find that they differ in terms of proportionality and scale. By proportionality we refer to the relationship between the values of the constituting scalars of each soft clustering matrix Ki and the strength of association between objects and clusters. For instance, if such scalars represented membership probabilities, their value would be directly proportional to the strength of association between objects and clusters, whereas the opposite would happen in the case these scalars represented distances to cluster centroids. Quite obviously, any voting method applied on such preference scores should take their proportionality into account when combining them so as to obtain coherent results. That is, if confidence voting was employed, a suitable transformation should be designed so as to make all these scalars either directly or inversely proportional to the strength of association between objects and clusters. In case that positional voting strategies were applied, candidates should be ranked in ascendant or descendant order depending on the proportionality of the object to cluster association scores. Moreover, if the voting method is based on conducting mathematical operations on the scalar values that make up each soft clustering matrix Ki (such as in confidence voting) it would also be necessary to consider their scale, as if they had different dynamic ranges the result of the voting process could be totally biased. Quite obviously, positional voting techniques are immune to scaling issues. Throughout the following paragraphs, the four voting techniques employed for creating the consensus functions proposed in this work are described. 2 3.2.1. Confidence voting Consensus functions based on confidence voting methods derive the consolidated clustering solution by conducting simple mathematical operations on the confidence scores each clusterer assigns to each cluster. For this reason, a prerequisite for using these voting methods is that these confidence values are comparable in magnitude [34]. Assuming this is true, we propose the derivation of fuzzy consensus functions based on the sum and product confidence voting rules, which are described next: • Sum voting rule: the confidence voting sum rule simply consists of adding the confidence values that all the voters cast for each candidate. As a result, a k × n sum matrix RE is obtained, the (i,j)th entry of which equals the sum of the preference scores of assigning the jth object to the ith cluster across the l cluster ensemble components, as presented RE =
l
Ki
(6)
i=1
where Ki refers to the ith clustering contained in the soft cluster ensemble E. The complexity of this voting process is O(nkl). • Product voting rule: in this case, preference values per candidate are multiplied instead of added (see Eq. (7)). It is important to notice that matrix products are computed entrywise, so its complexity is O(nkl). As a result, the k × n product matrix PE is obtained as PE =
l
Ki
(7)
i=1 2 For notational simplicity, in the following we refer by E and K to the cluster disambiguated versions of the cluster ensemble and the ith soft i partition contained in E, assuming that the cluster disambiguation process has already been conducted.
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
9
Input: Soft cluster ensemble E containing l fuzzy clusterings Ki (∀i = {1, . . . , l}) Output: Borda voting matrix BE Data: k clusters, n objects BE = 0k×n for a = 1 . . . l do for b = 1 . . . n do r = Rank (ab ); for c = 1 . . . k do BE (r (c) , b) = BE (r (c) , b) + (k − c + 1); end end end Algorithm 1: Algorithmic description of the Borda voting rule. Rank is the symbolic representation of the cluster ordering procedure, while the vector ab represents the bth column of the ath cluster ensemble component Ka , r is a clusters ranking vector and 0k×n represents a k × n zero matrix. Quite obviously, the product rule is highly sensitive to low values, which could ruin the chances of a candidate on winning the election, no matter what its other confidence values are [34]. 3.2.2. Positional voting Positional voting methods rank the candidates according to the confidence scores emitted by the voters. Thus, finegrain information regarding preference differences between candidates is ignored, although problems in scaling the voters confidence scores are avoided—that is, positional voting is useful in situations where confidence values are hard to scale correctly [34]. As an aid for describing the positional voting methods that constitute the core of our consensus functions, Eq. (8) defines Ki (the ith component of the soft cluster ensemble E) in terms of its columns, represented by vectors ki j (∀ j = {1, . . . , n}): ⎞ ⎛ K1 ⎟ ⎜ ⎜ K2 ⎟ ⎟ ⎜ (8) E = ⎜ . ⎟ where Ki = (ki1 ki2 . . . kin ) ⎜ .. ⎟ ⎠ ⎝ Kl In this work, we propose employing two positional voting strategies for deriving the consensus clustering solution, namely the Borda and the Copeland voting methods, which are described next. • Borda voting rule: Borda voting computes the mean rank of each candidate over all voters, re-ranking them according to their mean rank [35]. This process results in a grading of all the n objects with respect to each of the k clusters, which is embodied in a k × n Borda voting matrix BE . Such grading process, which is described in Algorithm 1, is conducted as follows: firstly, for each object, clusters are ranked according to their degree of association with respect to it (from the most to the least strongly associated). Then, the top ranked cluster receives k points, the second ranked cluster receives k − 1 points, and so on. After iterating this procedure across the l cluster ensemble components, the grading matrix BE is obtained. According to Borda voting, the higher the value of the (i,j)th entry of BE , the more likely the jth object belongs to the ith cluster. Assuming that the Rank procedure is implemented by means of an efficient sorting algorithm such as MergeSort [36], the complexity of the Borda voting process is O(nlk log k). • Copeland voting rule: the Copeland rule [37] is a pairwise aggregation implementation of the Condorcet voting method [38]. The Copeland rule can be regarded as a positional voting strategy, as it employs the voters’ preference choices between any given pair of candidates [34]. In particular, this voting method performs an exhaustive pairwise candidate ranking comparison across voters, and the winner of each one of these one-to-one confrontations scores a point. The result of this process is the Copeland score matrix CE , the (i,j)th element of which indicates how many
10
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
Input: Soft cluster ensemble E containing l fuzzy clusterings Ki (∀i = {1, . . . , l}) Output: Copeland voting matrix CE Data: k clusters, n objects for b = 1 . . . n do M = 0k×k for a = 1 . . . l do r = Rank (ab ); for c = 1 . . . k do M (r(c), r(c + 1 ÷ k)) = M (r(c), r(c + 1 ÷ k)) + 1; end end for c = 1 . . . k do CE (c, b) = Count M(c, 1 ÷ k) ≥ 2l ; end end Algorithm 2: Algorithmic description of the Copeland voting rule. Rank is the symbolic representation of the cluster ordering procedure, while the vector kab represents the bth column of the ath cluster ensemble component Ka , r is a clusters ranking vector, 0k×k represents a k × k zero matrix, and Count is the symbolic representation of a procedure for counting how many elements of the cth row of matrix M are greater or equal to l/2. The obelus symbol (÷) is employed to indicate a range, i.e. 1÷10 represents the range of integer numbers between 1 and 10. candidates does the ith candidate beat in one-to-one comparisons in the jth election (where candidates are clusters and an election corresponds to the clusterization of an object)—see Algorithm 2. Its complexity is also O(nlk log k). 3.3. Disambiguation and voting sequencing As mentioned earlier, the consensus functions proposed in this work are conceived as the concatenation of a cluster disambiguation and a voting procedure. However, these two processes can be sequenced in different ways, and depending on how this is done, different consensus functions can be generated. Throughout the following paragraphs, we describe two manners of sequencing cluster disambiguation and voting that, combined with the four voting methods described in Section 3.2, give rise to eight consensus functions. We refer to these two sequencing strategies as direct and iterative, respectively. Direct sequencing consists of applying the cluster disambiguation and voting procedures on the cluster ensemble E as a whole (see Algorithm 3). Firstly, the clusters corresponding to the l cluster ensemble components are aligned by means of the Hungarian method, thus obtaining the disambiguated cluster ensemble Edis . Subsequently, the voting procedure symbolically represented by the function Vote is applied on Edis . As a result, we obtain a voting results matrix V that, optionally, can be subjected to a normalization procedure, which would finally yield the fuzzy consensus partition Kc . Depending on how the Vote step is implemented (i.e. depending on whether the sum, product, Borda or Copeland voting rule is applied), the voting results matrix V will correspond either to the RE , PE , BE or the CE matrices described in Section 3.2, and a distinct consensus function is obtained. Therefore, from this point onwards, we will refer to the four consensus functions that use the direct sequencing strategy as DSC (Direct Sum Consensus), DPC (Direct Product Consensus), DBC (Direct Borda Consensus) and DCC (Direct Copeland Consensus), depending on the voting rule employed. On its part, iterative sequencing is based on conducting the cluster disambiguation and voting procedures on pairs of clusterings following an iterative approach—see Algorithm 4. The whole process is started by picking one of the l soft cluster ensemble components as a reference (denoted as Kref , which can possibly be chosen randomly). Subsequently, the cluster disambiguation plus voting procedures are iteratively run on pairs of clusterings compiled in a duplex cluster ensemble E2 , which is updated at each iteration by including the reference partition Kref -recalculated at each iteration by means of the voting process–plus one of the remaining cluster ensemble components across l − 1 iterations (again, this latter selection can also be conducted at random).
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
11
Input: Soft cluster ensemble E containing l fuzzy clusterings Ki (∀i = {1, . . . , l}) Output: Fuzzy consensus clustering Kc Data: k clusters, n objects Edis = Hungarian (E); V = Vote (Edis ); Kc = Normalize (V); Algorithm 3: Algorithmic description of the direct sequencing of the cluster disambiguation and the voting procedures. Input: Soft cluster ensemble E containing l fuzzy clusterings Ki (∀i = {1, . . . , l}) Output: Fuzzy consensus clustering c Data: k clusters, n objects Kref = K1 ; for r = 2,
. . . , ldo Kref ; E2 = Kr E2dis = Hungarian (E2 ); V = Vote (E2dis ); Kref = Normalize (V) end Kc = Kref Algorithm 4: Algorithmic description of the iterative sequencing of the cluster disambiguation and the voting procedures. The randomness in the selection of Kref at the beginning of the algorithm and of Kr at each iteration (see Algorithm 4) can cast a shadow of doubt as regards the associativity of iterative sequencing. In other words, how important is the order in which the cluster ensemble components are combined? How does changing this order affect the quality of the consensus partition? We will analyze this important issue on the experimental section of this work. The four consensus functions resulting from applying the four voting methods presented in Section 3.2 using the iterative sequencing approach are named ISC (Iterative Sum Consensus), IPC (Iterative Product Consensus), IBC (Iterative Borda Consensus) and ICC (Iterative Copeland Consensus). 4. Experimental setup The experimental section of this paper—Section 5—aims to evaluate the performance of the fuzzy consensus functions introduced in this work in comparison to several representative state-of-the-art consensus functions. However, prior to that, it is necessary to describe the conditions under which all the experimentation reported in this paper has been carried out. In particular, we describe the data collections over which our experiments have been conducted, the procedure employed for generating the soft cluster ensembles, and the benchmark of state-of-the art consensus functions our proposals will be compared to. Moreover, we also describe the indices used for measuring the performance of our proposals, together with the statistical tests employed for analyzing the significance of the results obtained. 4.1. Data collections The experiments presented in this work have been conducted on 11 publicly available data collections obtained from the UCI Machine Learning Repository [39] and from the on-line repository of the Machine Learning Group of the University College Dublin [40], which are commonly employed as benchmarks in the pattern recognition and machine learning fields. The names of these data sets are: Zoo, Iris, Wine, Glass identification (Glass for short), Ionosphere, Wisconsin Diagnostic Breast Cancer (WDBC), Balance Scale (Balance for short), Multiple Features
12
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
Table 1 Summary of the data sets employed in the experiments. The ‘Class imbalance’ column presents the percentage of objects in the data set belonging to the most and least populated categories, respectively. Data set name
Number of objects (n)
Number of attributes (d)
Number of classes (c)
Class imbalance (%)
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
101 150 178 214 351 569 625 2000 2000 2225 7494
17 4 13 10 34 32 4 649 6679 6767 16
7 3 3 6 2 2 3 10 20 5 10
40.6–3.9 33.3–33.3 39.9–26.9 35.5–4.2 64.1–35.9 62.7–37.3 46.1–7.8 10–10 5–5 22.9–17.3 10.4–9.6
(MFeat), miniNewsgroups (miniNG), BBC 3 and Pen-Based Recognition of Handwritten Digits (PenDigits). Table 1 summarizes the main characteristics of these data collections: the number of objects they contain (n), the number of attributes used for representing each object (d), the number of classes (c) and their class imbalance. As it can be observed, these data sets are pretty diverse as far as these parameters are concerned, thus constituting a representative set of data collections for evaluating the performance of our proposals in different clustering scenarios. 4.2. Creation of soft cluster ensembles For a given data collection, consensus functions are run on a soft cluster ensemble E, which is generated by compiling the outcomes of l fuzzy clustering processes applied on the data collection at hand. In order to create highly diverse cluster ensembles, we have employed different data representation techniques and clustering algorithms during their generation. Firstly, multiple data views have been created through the application of dimensionality reduction processes based on the Principal Component Analysis, Independent Component Analysis, Non-Negative Matrix Factorization and Random Projection feature extraction methods—thus, our soft cluster ensembles will exhibit data representation diversity. Subsequently, on each distinct representation of the data set, we have applied the fuzzy c-means and the k-means clustering algorithms 4 for creating l algorithmically diverse components that make up the soft cluster ensemble E. For simplicity, the number of clusters to be found by these clustering algorithms, k, has been set equal to the real number of categories in each collection, 5 c. As a result of this process, one soft cluster ensemble E per data collection is obtained. Its number of components, l, is presented in Table 2. Differences between the sizes of the soft cluster ensembles of distinct data sets are caused by disparities in the number of dimensionally reduced data representations we have been able to generate, a factor that is ultimately limited by the number of features d used for representing each object in origin (see Table 1). Finally, in order to obtain statistically reliable results from the analysis of the consensus functions performance, l l , 10 , 5l , 2l and l we have run all our algorithms on 250 bootstrap samples of the cluster ensemble of sizes 20 (where x denotes the application of the floor function on x).
3 Available for download at http://mlg.ucd.ie/datasets/bbc.html (last accessed on September 2011). 4 Whereas fuzzy c-means is fuzzy by nature, the k-means clustering algorithm is not. However, some implementations of the latter are capable of
returning object to cluster centroid distances, which can be used as an indicator of the degree of association between objects and clusters. For the sake of greater algorithmic diversity, variants of k-means using the Euclidean, city block, cosine and correlation distances have been employed. 5 The number of clusters k to be found can be estimated by means of well-known statistical methods (see [41]), or even be estimated from the consensus clustering process itself [22]. In any case, this issue is left for future work.
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
13
Table 2 Number of components (l) of the soft cluster ensemble E of each data set. Data set
Number of cluster ensemble components (l)
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
342 54 270 174 582 678 42 36 219 342 171
4.3. Consensus functions benchmark We have compared the consensus functions proposed in this work to five state-of-the-art consensus functions based on fairly distinct approaches. More specifically, this comparison is conducted against (i) the VMA consensus function [27], and (ii) the soft versions of four hypergraph-based consensus functions, called CSPA, HGPA and MCLA [7], and the EAC consensus function [14]—see [16,28] for details on the derivation of the soft versions of these last four consensus functions. The reason for using these consensus functions is that, besides being based on fairly distinct approaches, they constitute a classic standard in the field. 4.4. Performance evaluation measures The performance of consensus functions is analyzed in terms of two aspects: their computational cost and the quality of the consensus partitions they deliver. As regards the computational cost, we measure the CPU time required for executing each consensus function under Matlab 7.0.4 on Dual Pentium 4 3GHz/1 GB RAM computers. As far as the evaluation of the consensus clusterings quality is concerned, in this work we have employed an external cluster validity index that measures their degree of resemblance to a predefined and allegedly correct cluster structure of the data set defined by the ground truth of each data collection (i.e. the class each object belongs to), which is only employed for evaluation purposes. Despite the proposed consensus functions output fuzzy consensus clustering solutions, we have not evaluated their quality directly. Instead, we have created the crisp version of the obtained fuzzy consensus clusterings (assigning each object to the cluster it is most strongly associated with), measuring their similarity to the ground truth of each data set. The reason for this is twofold: firstly, a soft ground truth is not available for these data sets, so fuzzy consensus clusterings cannot be directly evaluated using an external cluster validity index. And secondly, provided that most of the state-of-the-art consensus functions that form our benchmark deliver hard consensus clustering solutions, fair interconsensus functions comparison requires converting the soft consensus clustering matrix Kc output by our consensus functions to a crisp consensus labeling kc —recall that this simply boils down to assigning each object to the cluster it is more strongly associated with. In particular, we have measured the similarity between any data set’s ground truth c and a given consensus clustering kc in terms of their normalized mutual information (NMI) , which is defined as k k n h,l n 2 (NMI) (9) (c, kc ) = n h,l logk 2 (c ) (k ) n n n c l=1 h=1
( c) nh
h
l
(k )
where is the number of objects in cluster h according to c, nl c is the number of objects in cluster l according to the evaluated consensus partition kc , n h,l denotes the number of objects in cluster h according to c as well as in group
14
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
l according to kc , n is the number of objects in the collection, and k is the number of clusters into which objects are clustered according to kc and c [7]. The reason for choosing normalized mutual information as the evaluation measure is that (NMI) is theoretically well-founded, unbiased, symmetric, plus being normalized in the [0, 1] interval—therefore, the higher the value of (NMI) (c, kc ), the more similar kc and c are, and hence, the greater the quality of the evaluated consensus partition kc [7]. The CPU times and (NMI) scores were obtained by running the consensus functions on the 250 randomly generated bootstrap sample soft cluster ensembles of each data collection mentioned in Section 4.2. 4.5. Statistical tests In order to determine whether there exist significant differences between the performances of the consensus functions subject to comparison, we have conducted non-parametric statistical significance tests following the recommendations in [42,43]. In particular: • When two consensus functions have to be compared one against the other (i.e. pairwise comparison), we have employed the non-parametric Wilcoxon paired signed-rank test [44]. • When multiple comparisons are to be conducted, we use the Iman–Davenport test in order to detect statistically significant differences between consensus functions and the Shaffer post hoc test so as to discover which consensus functions attain better performance than the others. 6 In all cases, we have set the significance level at = 0.05. However, the use of statistical tests does not only provide information regarding whether differences between compared algorithms are statistically significant or not, but they can also tell us how important they are. To that effect, we also provide the reader with the p-value corresponding to each comparison, as the smaller it is, the more significant the differences. 5. Experiments The objective of the experimental section of this work is to answer the following questions: • Among our proposals, which is the most suitable sequencing strategy (i.e. direct or iterative) for each voting method (Borda, Copeland, product and sum)? • How do our proposals compare to the state-of-the-art? In order to answer the first question, the goal of the first part of this evaluation section is to come up with the optimal cluster disambiguation-voting sequencing approach for each voting technique. Moreover, we will also conduct an experimental analysis of the computational performance of the proposed consensus functions, plus an empirical evaluation of the associativity property of the iterative sequencing scheme and its impact on the quality of consensus partitions (see Section 5.1). Subsequently, we will compare the performance of the optimal versions of our consensus functions against several representative state-of-the-art consensus functions (see Section 5.2). 5.1. Which is the most appropriate sequencing strategy for each voting method? In this section, we are interested in analyzing whether, for a given voting rule, sequencing the cluster disambiguation and voting processes following either the direct or the iterative strategy has a significant impact on the performance of the corresponding version of the consensus function. The aim of this study is to choose the best performing version of each one of the proposed consensus functions (that is, the one with lowest CPU time and highest (NMI) (c, kc )). To that effect, we have conducted a pairwise comparison between the direct and the iterative versions of the consensus functions based on each voting rule, that is: DBC vs. IBC, DCC vs. ICC, DPC vs. IPC, and DSC vs. ISC. 6 We have employed the software for conducting non-parametric statistical tests available at http://sci2s.ugr.es/sicidm/.
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
15
Table 3 Average running CPU time (measured in seconds) and normalized mutual information between the consensus clustering kc and the ground truth of each data collection c for each consensus function. The CPU times and (NMI) (c, kc ) scores of the best performing consensus function are highlighted in boldface. The p-value resulting from the Wilcoxon pairwise comparison statistical test is presented (with p < 0.05 indicating statistically significant differences). CPU time
pCPU time
(NMI)
Voting method
Consensus function
p(NMI)
Borda
DBC IBC
3.19 ± 9.17 5.29 ± 12.39
8.85e−17
0.44 ± 0.23 0.39 ± 0.24
1.32e−7
Copeland
DCC ICC
25.29 ± 94.22 55.68 ± 205.18
1.65e−34
0.44 ± 0.23 0.18 ± 0.22
1.63e−221
Product
DPC IPC
1.53 ± 4.86 0.63 ± 1.64
9.25e−6
0.45 ± 0.23 0.46 ± 0.23
0.0802
Sum
DSC ISC
1.76 ± 5.46 0.81 ± 2.26
5.26e−5
0.44 ± 0.23 0.43 ± 0.23
6.72e−4
The performance of each consensus function is described in Table 3 in terms of its average CPU running time (measured in seconds) and (NMI) scores obtained after the execution of each consensus function on all the soft cluster ensemble bootstrap samples created on each data collection. In all cases, we present the mean±standard deviation values of these performance indices. The best global result in each case is highlighted in bold-face. Moreover, for each pairwise comparison, we provide the associated p-value returned by the Wilcoxon paired signedrank test in order to indicate whether the two compared consensus functions are significantly different in terms of running time and (NMI) score, and to which degree. In the following sections, we analyze the results of each pairwise comparison, determining which is the optimal sequencing strategy for each distinct voting scheme. Moreover, besides the global analysis represented by the results presented in Table 3, Appendix B provides the reader with the results of these pairwise comparisons detailed on each distinct data collection. 5.1.1. Borda voting (DBC vs. IBC) It can be observed from Table 3 that Direct Borda Consensus (DBC) outperforms Iterative Borda Consensus (IBC) both in terms of running CPU time and (NMI) score. Moreover, the Wilcoxon signed-rank test reveals that the differences in favor of DBC are statistically significant, as p-values well below the significance level ( = 0.05) are obtained. In order to provide the reader with a more comprehensive comparison between DBC and IBC, we quantified the differences in the performance of both consensus functions. In an individual analysis of all the experiments conducted, we observed that DBC is faster than IBC by a statistically significant margin in a 82% of the experiments conducted, and relative CPU time differences between 2% and 80% are measured. We believe the reason for this is that the main difference between IBC and DBC lies in the fact that the former requires executing the Borda voting process on duplex cluster ensembles l-1 times, while the latter consists in running Borda voting on an l-way cluster ensemble, but just once. That is, both sequencing strategies differ in the number of voters l considered in each election. However, the strongest dependency of the time complexity of the Borda voting method is on k (the number of clusters)—it is superlinear (k log k), while linear in the number of voters l and elections n. Therefore, the reduction of l inherent to IBC has no effect on the overall time complexity of this consensus function when compared to DBC. As regards the quality of the consensus clusterings delivered by DBC and IBC, it was observed that, in a 54% of the experiments conducted, the former attains higher (NMI) (c, kc ) scores than the latter. This suggests that voting sequentially on partial views of the cluster ensemble harms the results of the overall voting process. Therefore, we conclude that the optimal Borda voting based consensus function is DBC, not only from a computational standpoint, but also as far as the quality of the consensus clustering obtained is concerned. For this reason, DBC will be the Borda voting based consensus function subject to comparison against the state-of-the-art in Section 5.2.
16
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
5.1.2. Copeland voting (DCC vs. ICC) The observation of Table 3 reveals that Direct Copeland Consensus (DCC) is clearly superior to Iterative Copeland Consensus (ICC). The Wilcoxon statistical test supports this claim, as p-values far below the significance level = 0.05 indicate. From a quantitative perspective, the comparison of both consensus functions in terms of the CPU time required for their execution revealed that DCC is dramatically faster than ICC. In fact, differences between the running times of both consensus functions were statistically significant in nearly a 86% of all the experiments conducted. Moreover, the relative differences between the CPU running time ranged from 3% to 85% in favor of DCC. That is, the inherent reduction of l in ICC is not sufficient for making this consensus function faster that DCC, given the superlinear dependence on k of the time complexity of Copeland voting (a trend that has also been observed in the consensus functions based on Borda voting). If the quality of the consensus clusterings delivered by both consensus functions is compared, it can be observed that ICC presents a dramatically worse performance than DCC. In fact, the latter attained higher (NMI) (c, kc ) scores than the former, and the differences between them were statistically significant in a 75% of the experiments conducted. We conjecture that the poor quality of the consensus clusterings delivered by ICC is due to the fact that the Copeland voting method requires a global view of the votes cast by the voters in order to conduct the exhaustive pairwise candidate ranking comparison across voters successfully. These results allow us to draw the conclusion that DCC is clearly superior to ICC in terms of the two comparison parameters evaluated in this work, so we select DCC as the representative of Copeland voting based consensus functions. 5.1.3. Product voting (DPC vs. IPC) The performance comparison of the two consensus functions based on the product voting rule (i.e. Direct Product Consensus—DPC and Iterative Product Consensus—IPC) presented in Table 3 reveals that the latter is the optimal option in this case. Indeed, as far as the CPU time required for running each consensus function is concerned, it can be observed that sequencing the cluster disambiguation and voting processes following an iterative approach is beneficial, and statistical significant differences between the running times of DPC and IPC—always in favor of the latter—are found according to the Wilcoxon significance test. Here, the superiority of IPC in terms of CPU running time is due to the linear dependency of product voting rule complexity on the number of candidates (k), voters (l) and elections (n). From a quantitative perspective, IPC was faster than DPC in a 76% of the conducted experiments, and the relative differences between them ranged from a 7% to a 124% in favor of the former. When DPC and IPC were compared in terms of the quality of the consensus clusterings they deliver, we also found that IPC yields higher quality consensus partitions, although the Wilcoxon significance test indicated that no statistically significant differences exist between them as far as this evaluation parameter is concerned. However, despite the differences between both consensus functions are only statistically significant as regards their running time, the existing differences are always in favor of IPC (i.e. lower CPU time and higher (N M I ) ). For this reason, we choose IPC as the best performing consensus function based on the product voting rule. 5.1.4. Sum voting (DSC vs. ISC) The results presented in Table 3 allow to evaluate the performance of the two consensus functions based on the sum voting scheme (i.e. DSC and ISC). Analogously to what has been observed on the consensus function based on the product voting rule, the iterative approach is more computationally efficient than the direct one (that is, ISC is faster than DSC). Moreover, the differences in computational complexity between both consensus functions are statistical significant, as pCPU time < 0.05. From a quantitative standpoint, ISC is from a 16% to a 144% faster than DSC. However, DSC yields, in general, higher quality consensus clusterings than ISC, and statistically significant differences in favor of the former are found according to the Wilcoxon signed-rank test. Therefore, we face a dilemma as regards the selection of the best performing consensus function based on the sum voting rule, as ISC outperforms DSC from a computational perspective, whereas DSC yields higher quality consensus partitions than ISC. In both situations, moreover, the differences between both consensus functions are statistically significant.
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
17
In order to choose one of them for the subsequent comparison against the state-of-the-art, we have measured the relative differences between their CPU running times and (NMI) (c, kc ) scores. This analysis reveals that, in relative average terms, ISC is a 53.2% faster than DSC. However, the (NMI) (c, kc ) scores of the consensus clusterings delivered by DSC are only a 4.7% higher (in average) than those yielded by ISC. Consequently, we conclude that ISC is the best performing sum voting rule based consensus function, as the lower quality of its consensus partitions is largely compensated by its much greater computational efficiency. 5.1.5. Considerations on computational complexity As a result of the analysis conducted in the previous sections, we have come up with the optimal proposed consensus functions for each voting scheme (i.e. DBC, DCC, IPC and ISC), which will be subject to comparison against several state-of-the-art consensus functions in the upcoming section. Prior to that, however, it is worth conducting a deeper analysis of the characteristics of our proposals from the computational complexity perspective. Throughout the following paragraphs, we are going to analyze, by means of experimentation, how the computational complexity of the proposed consensus functions depends on the three main parameters that define the computational complexity of the consensus clustering problem: the number of clusters k, the number of objects n and the cluster ensemble size l. In general terms, the reader may have noticed that consensus functions based on confidence and positional voting methods show, in computational terms, a quite opposite behavior depending on whether they use the direct or the iterative sequencing scheme. Recall that the difference between both schemes lies in the fact that the former executes the voting process once on a l-way cluster ensemble, whereas the latter runs the voting process l − 1 times on a duplex (i.e. on a smaller) cluster ensemble. The differences observed in the computational behavior of confidence and positional voting based consensus functions with respect to the use of the direct or the iterative sequencing schemes are caused by the fact that the time complexity of confidence (i.e. sum and product) voting rules is linear in n, l and k, while the dependence of the time complexity of Borda and Copeland voting is linear in n and l but superlinear in k. Thus, the reduction of l inherent to the iterative sequencing approach gives rise to more computationally efficient consensus functions when these are based on confidence voting methods, in contrast to what is observed in those consensus functions based on positional voting techniques, as the time complexity of these is dominated by a superlinear term depending on k. On a more detailed scale, it is worth observing how the difference between the running times of the iterative and direct versions of consensus functions grows as the size of the cluster ensemble l increases. This can be observed in Fig. 2, which depicts the CPU time boxplots corresponding to the execution of the eight consensus functions proposed l l , 10 , 5l , 2l and l on the Ionosphere data in this work over soft cluster ensemble bootstrap samples of sizes 20 7 collection (with l = 582). For instance, it can be observed that, as l grows large, the execution of l − 1 runs of the Borda and Copeland voting methods becomes more costly, although voting only involves two cluster ensemble components per run (IBC and ICC). In contrast, the increase of l has a smaller impact on the running time of the DBC and DCC consensus functions, as the time complexity of a single run of Borda and Copeland voting is less dependent on l than it is on k. As mentioned earlier, the opposite behavior is observed in the confidence voting based consensus functions (DPC vs. IPC, and DSC vs. ISC). This evidence should be taken into consideration by users when having to combine a large number of clusterings, depending on the voting method underlying the consensus function employed. And secondly, it is also interesting to analyze how the running time of the consensus functions depends on the size of the data set n as well as on the number of clusters k, considering whether the direct or the iterative sequencing scheme is applied. To that effect, we conducted an experiment in which the eight proposed consensus functions were run on cluster ensembles of size l = 30 on each data collection. The average running times of each consensus function is depicted in Fig. 3. On the horizontal axis, data sets are arranged in increasing order according to the number of objects they contain (see Table 1 for a reference of the size and number of classes in the data sets). In general terms, it can be observed how the average execution time of consensus functions grows with the data set size, as expected. It is also noticeable that this monotonic growth is interrupted at some points, e.g. the Glass and 7 The same behavior was observed on the remaining data collections, although we only present the results on the Ionosphere data set for illustration purposes.
18
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
Fig. 2. CPU time boxplots of the proposed consensus functions for different cluster ensemble sizes on the Ionosphere data collection (where l = 582). (a) Borda voting consensus functions. (b) Copeland voting consensus functions. (c) Product voting consensus functions. (d) Sum voting consensus functions. 103 DBC DCC
CPU time (sec.)
102
DPC DSC IBC ICC
10
1
IPC ISC
100
10−1
Zoo
Iris
Wine
Glass Ionosphere WDBC
Balance
MFeat
miniNG
BBC
PenDigits
Data set
Fig. 3. CPU time mean values of the proposed consensus functions (with a cluster ensemble size of l = 30) on the 11 data sets, sorted by the number of objects they contain (see Table 1).
miniNG data sets. This is due to the fact that these two data sets, despite being smaller than the ones following them in the succession (i.e. Ionosphere and BBC, respectively), have a comparatively larger number of clusters. For instance, miniNG has a 10% less of objects that BBC (n = 2000 vs. n = 2225), but its number of classes is four times bigger (k = 20 vs. k = 5), which increases the running time of the consensus functions accordingly.
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
19
These results illustrate the computational behavior of our proposals from a practical viewpoint, supporting the claims made in the computational complexity theoretical analysis presented in Section 3.2. 5.1.6. Considerations on the associativity of the iterative sequencing scheme As mentioned in Section 3.3, it is reasonable to conjecture whether the random ordering of the cluster ensemble components that is inherent to the iterative sequencing scheme affects the quality of the consensus partitions we generate. The following paragraphs are devoted to the experimental analysis of this issue. In each of our experiments runs, we have ordered the cluster ensemble components in a fully random manner. If the cluster ensemble components ordering affected the quality of the consensus partition, the (NMI) scores of the iterative sequencing version of each consensus function should attain significantly different values, exhibiting a much larger standard deviation than its direct sequencing counterpart (in which all cluster ensemble components are combined at once). A glimpse at Table 3 shows that this is not the case. Rather the contrary, the average standard deviations of the (NMI) scores of IPC and ISC are exactly the same as those of DPC and DSC, whereas this parameter is very similar between IBC and DBC, and between ICC and DCC. Thus, it seems that the random arrangement of the cluster ensemble components in the iterative sequencing scheme has little effect on the quality of the consensus clustering. A more detailed analysis can be conducted if we compare the standard deviations of the (NMI) scores of the iterative and direct versions of each consensus function on each separate data collection (presented in Tables B2, B4, B6 and B8 in Appendix B). In global terms, the iterative sequencing version of the consensus functions exhibits an equal or smaller standard deviation in (NMI) score than its direct sequencing counterpart in more than half of the experiments conducted. Therefore, we conclude that the ordering of the cluster ensemble components in the iterative sequencing scheme has a low influence (if any) on the quality of the resulting consensus clustering. In other words, it can be conjectured that the consensus partition obtained on distinctly ordered cluster ensembles tends to converge to a similar solution regardless of their ordering when combined using the iterative sequencing approach. 5.2. Comparison vs. state-of-the-art consensus functions The experiments described in the previous section indicate that, among our proposals, the best performing consensus functions associated to each voting method are DBC, DCC, IPC and ISC. This section presents several experiments intended to compare these four consensus functions with several state-of-the-art alternatives in terms of the quality of the consensus clustering solutions obtained and their time complexity. To that effect, we have executed the CSPA, EAC, HGPA, MCLA and VMA consensus functions on the same data collections and soft cluster ensemble bootstrap samples as our four proposals, measuring the CPU time (in seconds) required for their execution and computing the quality of the consensus clusterings they yield in terms of the normalized mutual information with respect to the ground truth of each data set. Table 4 describes the performance of the nine compared consensus functions in terms of their average CPU running time and (NMI) scores (± for standard deviations). The best global result (i.e. lowest CPU time and highest (NMI) score) in each case is highlighted in bold-face. Moreover, in order to show how well each consensus function performs compared to the remaining ones, we also provide the ranking of each one of them averaged across all the data collections employed in our experiments (the lower the rank value, the better the performance). Furthermore, as we are conducting a two-fold evaluation of each consensus function (i.e. we have two evaluation parameters), we have averaged the CPU time and (NMI) score average rankings, obtaining as a result the mean average ranking of each algorithm (presented on the rightmost column of Table 4). This constitutes an indicator of their relative overall performance. A first glance at Table 4 reveals that IPC excels over the remaining consensus functions, both in terms of running time and consensus clustering quality. As regards time complexity, notice how, in average terms, IPC attains the lowest running time, followed by VMA, HGPA and ISC. Accordingly, these four consensus functions occupy the four first spots on the CPU time average ranking. On the other hand, it can be observed that our proposals based on positional voting (i.e. DBC and DCC) are placed in the last places of the CPU time ranking, although DBC’s average ranking is not far from that of CSPA.
20
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
Table 4 Average ± standard deviation running CPU time, normalized mutual information (and associated average rankings), and mean average ranking of each consensus function. The CPU times, (NMI) scores and average rankings of the best performing consensus functions are highlighted in boldface.
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
CPU time
Avg. rank
/(NMI)
Avg. rank
Mean avg. rank
6.91 ± 6.24 405.70 ± 415.68 0.81 ± 4.53 4.13 ± 2.17 0.73 ± 5.17 3.19 ± 9.17 25.29 ± 94.22 0.63 ± 9.17 0.81 ± 2.26
6.55(7) 6.36(6) 2.45(1) 6.18(5) 2.73(3) 6.73(8) 8.00(9) 2.55(2) 3.45(4)
0.37 ± 0.20 0.06 ± 0.06 0.02 ± 0.04 0.32 ± 0.16 0.45 ± 0.23 0.44 ± 0.23 0.44 ± 0.22 0.46 ± 0.23 0.44 ± 0.21
5.09(6) 8.36(8) 8.55(9) 6.73(7) 3.00(2) 3.73(4) 3.36(3) 2.09(1) 4.09(5)
5.82(7) 7.36(9) 5.50(5) 6.45(8) 2.86(2) 5.23(4) 5.68(6) 2.32(1) 3.77(3)
Table 5 Pairs of consensus functions considered statistically different in terms of CPU time according to the Shaffer post hoc test, and their corresponding adjusted p-values. Pair
Hypothesis
Adjusted p-value
1 2 3 4 5 6 7
HGPA vs. DCC IPC vs. DCC VMA vs. DCC ISC vs. DCC HGPA vs. DBC IPC vs. DBC VMA vs. DBC
4.0922e−4 6.9244e−4 0.0014 0.0084 0.011 0.0221 0.0406
In terms of the quality of the consensus partitions obtained, our four proposals perform similarly, and they are comparable to VMA, while being clearly superior to the remaining state-of-the-art consensus functions. In fact, IPC, DCC, DBC and ISC are placed in the first, third, fourth and fifth places of the (NMI) score average ranking, respectively. When the mean average ranking is computed by averaging the CPU time and (NMI) score rankings, we observe that three of our proposals (IPC, ISC and DBC) are found among the four highest ranked consensus functions, with IPC ranked in the first place. If we analyze whether the differences between the compared consensus functions are statistically significant, the Iman-Davenport test indicates they are, both in terms of CPU time and (NMI) score. In particular, this test returns p-values far below the = 0.05 significance level ( p = 1.30e−10 in the CPU time comparison, and p = 9.42e−16 in the (NMI) score comparison). For this reason, we employ the Shaffer post hoc test in order to determine between which pairs of consensus functions differences are statistically significant. The results of running the Shaffer test on the CPU time and the (NMI) score results are presented in Tables 5 and 6, respectively. The data is presented on these two tables according to the following convention: we only present the pairs of consensus functions between which the Shaffer test finds statistically significant differences. In those cases, the first consensus function is statistically better than the second one. Moreover, we show the adjusted p-value associated to each comparison. As Table 5 reveals, the largest differences found between consensus functions in terms of running time are with respect to those based on positional voting (i.e. DCC and DBC). The fastest consensus functions are HGPA, IPC, VMA and ISC (as already observed in Table 4). The main reason why DCC and DBC are the most computationally costly algorithms is due to the clusters ranking process they must conduct prior to voting, which penalizes them from a computational perspective. According to the results presented in Tables 4 and 6, HGPA and EAC are the consensus functions that yield the lowest quality consensus partitions. Moreover, notice that in nine of the twelve pairwise comparisons in which statistically
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
21
Table 6 Pairs of consensus functions considered statistically different in terms of (NMI) score according to the Shaffer post hoc test, and their corresponding adjusted p-values. Pair
Hypothesis
Adjusted p-value
1 2 3 4 5 6 7 8 9 10 11 12
IPC vs. HGPA IPC vs. EAC VMA vs. HGPA DCC vs. HGPA DBC vs. HGPA VMA vs. EAC DCC vs. EAC DBC vs. EAC ISC vs. HGPA IPC vs. MCLA CSPA vs. HGPA ISC vs. EAC
1.3738e−5 9.4039e−5 4.7113e−4 0.0010 0.0014 0.0030 0.0060 0.0084 0.0084 0.0092 0.0319 0.0319
significant differences in terms of consensus clustering quality are found, the winning consensus function is one of our proposals. This is a clear indicator of the good performance of our voting based consensus functions as regards the quality of the consensus partitions they deliver. Finally, besides the global analysis represented by the results presented in this section, Appendix C provides the reader with the results of these pairwise comparisons detailed on each distinct data collection. 6. Discussion and lessons learned We start this section with a digression on the origin of the motivations of this work. Next, we will enumerate the main lessons learned from the results of the empirical evaluation of our proposals, including some usage recommendations. The main motivation behind the proposals put forward in this paper lies in the fact that most of the literature on consensus clustering is mainly focused on hard cluster ensembles, i.e. the combination of the outcomes of multiple crisp clustering systems into a consolidated crisp partition. In our opinion, however, fuzzy clustering processes are much more informative than their crisp counterparts, as they provide an indication of the strength of association between objects and clusters, an information that can be highly valuable due to the fact that objects may naturally belong to more than one cluster. For this reason, our goal has been the design of consensus functions that create fuzzy consensus clusterings upon fuzzy cluster ensembles. The initial source of inspiration for the soft consensus functions just presented was metasearch (also known as information fusion) systems, the main purpose of which is to obtain improved search results by combining the ranked lists of documents returned by multiple search engines in response to a given query. Although the resemblance between metasearch and consensus clustering was already reported in [11,28], direct inspiration came from the works of Aslam and Montague [46,47], where metasearch algorithms based on positional voting were devised—notice that this type of voting techniques lend themselves to be applied in this context, as search engines return lists of ranked documents. From that point on, the analogy between object-to-cluster association scores in a soft cluster ensemble and voters’ preferences for candidates – i.e. considering clusters as candidates, cluster ensemble components as voters, and the association of each object to the clusters as an election – became the key issue for deriving consensus functions based on voting techniques. We have structured the performance study of our proposals in two sections. The first part of our analysis deals with the comparison between consensus functions based on the same voting method (i.e. confidence voting—sum and product rules and positional voting—Borda and Copeland rules) but using different strategies for sequencing the cluster disambiguation and voting processes (i.e. direct or iterative). From this study, we have learnt the following lessons: • When the cluster disambiguation and voting processes are sequenced following the iterative scheme, the computationally cheap confidence voting based consensus functions are the least time consuming option, as the voting process is conducted repeatedly on partial views of the cluster ensemble. In contrast, positional voting methods benefit from
22
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
the direct sequencing approach due to their higher computational complexity. However, confidence voting consensus functions are faster than their positional voting counterparts, as the candidate ranking process involved in the latter increases their running time, making their time complexity have a superlinear dependence on the number of clusters. Among them, the consensus function based on Copeland voting is the slowest one, due to the exhaustive pairwise candidate confrontation implicit in this voting method. • The quality of the clusterings delivered by consensus functions based on positional voting is negatively affected by the use of the iterative sequencing strategy, as they require a global view of the votes cast by the voters. On the other hand, confidence voting consensus functions are rather insensitive to this issue, which allows to draw a univocal optimality correspondence between direct sequencing and positional voting, and between iterative sequencing and confidence voting. Therefore, the proposed consensus functions that show an optimal performance are Direct Borda Consensus (DBC), Direct Copeland Consensus (DCC), Iterative Product Consensus (IPC) and Iterative Sum Consensus (ISC). • When the performance of the DBC, DCC, IPC and ISC consensus functions is compared in terms of the quality of the consensus clusterings they deliver, we find small – but seldom statistically significant – differences. Therefore, the main criterion for preferring one over the others must be taken on computational grounds. • In this sense, IPC and ISC show the best performance among our proposals, as they are capable of obtaining high quality consensus clusterings while being computationally efficient. However, our proposals based on positional voting (DBC and DCC) are the most suitable option when we have to combine fuzzy clusterings of such different natures that it is impossible to scale object-to-cluster associations properly. In the second part of our experimental analysis, we have confronted our four optimal consensus functions with multiple representative state-of-the-art alternatives. From this comparison, we have drawn several significant conclusions: • The confidence voting based consensus functions (IPC and ISC) show a highly competitive performance: they deliver consensus clusterings of similar (or higher) quality to those yielded by the fastest state-of-the-art consensus functions, while being as fast (or faster) than those state-of-the-art consensus functions that obtain the highest quality consensus partitions. • The positional voting based consensus functions (DBC and DCC) rank among the best performing ones in terms of consensus clustering quality, outperforming hypergraph and evidence accumulation based consensus functions by a large margin. • When compared in terms of the quality of the consensus clusterings delivered, the proposed consensus functions find their most consistent rival in VMA, as it yields consensus clusterings of comparable quality, while being as fast to execute as IPC and ISC. To conclude, the consensus functions based on voting methods introduced in this work have shown a good performance when compared against several state-of-the-art alternatives in terms of time complexity and the quality of the consensus clusterings they deliver. In particular, the use of confidence voting methods together with the iterative sequencing of the cluster disambiguation and voting procedures gives rise to our most competitive proposals (i.e. IPC and ISC). Interestingly, those consensus functions based on positional voting achieve their best performance when the cluster disambiguation and voting procedures are sequenced according to the direct scheme (that is, DBC and DCC). However, despite yielding consensus partitions of comparable quality, the computational complexity of the latter is much larger than that of the former, so their use is only recommendable in case we have to combine fuzzy clusterings of such different natures that it becomes difficult to scale object-to-cluster associations properly. 7. Conclusions and further work This paper has introduced the application of voting procedures to the fuzzy consensus clustering problem. More specifically, we have devised several soft consensus functions that allow to consolidate a set of fuzzy clusterings into a single soft consensus partition. Multiple aspects have been taken into account when designing these consensus functions, such as (i) the need to preserve the fuzziness of the data along the whole process, (ii) the ability to combine fuzzy clusterings of different natures, and (iii) the computational complexity of the resulting consensus function. In order to meet the requirements derived from the first two points of the preceding enumeration, we have decided to implement the core of the clustering combination process by means of confidence and positional voting methods.
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
23
Moreover, due to the largely different time complexities of these two types of voting strategies, we have also devised a direct and an iterative ways of sequencing the cluster disambiguation and voting processes that constitute the consensus function. As a result, we have come up with four computationally optimized voting based fuzzy consensus functions, that have been subject to comparison against several representative state-of-the-art consensus functions. The conclusions drawn from this experimental comparison indicate that the proposed voting based consensus functions constitute a highly competitive alternative for tackling the fuzzy consensus clustering problem, as they are capable of yielding consensus partitions of comparable or superior quality to those obtained by state-of-theart clustering combiners at a reasonable computational cost. In particular, our proposals are among the top performers, as far as the quality of the obtained consensus clustering is concerned. Meanwhile, confidence voting based consensus functions show small time complexities, being comparable to the fastest state-of-the-art consensus functions. An additional appealing feature of our proposals is that they naturally deliver fuzzy consensus clusterings, which makes all sense in a soft clustering scenario. However, the lack of a fuzzy ground truth has not allowed evaluating the soft consensus clusterings obtained, which constitutes one of the future directions of research of the work presented in this paper. As mentioned earlier, this would probably make the differences between the proposed consensus functions more evident, as it would highlight the differences between the distinct voting methods employed. Despite the good results attained, we consider that there exist several other issues to be addressed in this area of work, such as the following: • As mentioned in Section 4.2, the components of the fuzzy cluster ensembles employed in this work all have the same number of clusters k. We believe that it would be of paramount interest to adapt our consensus functions so that they were able to combine clusterings with different number of clusters, which could possibly be done by completing those clusterings with fewer clusters with dummy clusters [22]. • We are also interested in exploring cluster disambiguation techniques other than the Hungarian algorithm, analyzing whether they are worth it, as regards their impact both in terms of the quality of the consensus clusterings obtained and the overall computational complexity of the resulting consensus function.
Appendix A. Dummy clusters and cluster dissimilarity measures This appendix illustrates, by means of an ongoing toy example, two important procedures related to the construction of the fuzzy consensus functions proposed in this work. Firstly, the addition of dummy clusters for combining fuzzy clusterings that contain a different number of clusters (see Appendix A.1). And secondly, the derivation of cluster dissimilarity measures between fuzzy partitions (see Appendix A.2). A.1. Introduction of dummy clusters for combining fuzzy clusterings with different number of clusters The goal of this section is to illustrate how consensus clustering can be conducted on fuzzy partitions with different number of clusterings. As mentioned in Section 3.1, this can possibly be done by completing those clusterings with fewer clusters with dummy clusters. According to the description in Section 2.1, a fuzzy clustering is usually represented by means of a k × n real-valued clustering matrix, where n is the number of objects in the data set and k is the number of clusters objects are associated to. That is, the number of rows of a fuzzy partition indicates its corresponding number of clusters. Therefore, given a soft cluster ensemble composed of the outcome of l fuzzy unsupervised classifiers, each one of which creates partitions with ki clusters (∀i ∈ {1, . . . , l}), it is straightforward to determine the maximum number of clusters as expressed in kmax = max ki , ∀i ∈ {1, . . . , l} i
(A.1)
Once kmax is known, all we have to do is add kmax − ki (∀i ∈ {1, . . . , l}) rows to each cluster ensemble component. Each one of these rows represents a dummy cluster. As such, their contents must be such that they represent a null degree of association between the objects and that cluster.
24
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
Let us illustrate the dummy clusters introduction process by means of the following toy example. In this case, we assume that the soft cluster ensemble contains two fuzzy partitions, namely K1 and K2 (see Eq. (A.2)). Notice that n = 9 objects, and k1 = 3 and k2 = 4 clusters (therefore, kmax = 4): ⎛ ⎞ 0.054 0.026 0.057 0.969 0.976 0.959 0.01 0.016 0.011 ⎜ ⎟ ⎟ 0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017 K1 = ⎜ ⎝ ⎠ 0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972 ⎛
0.832 0.901 0.019
0.03
0.014 0.025 0.057 0.017 0.055
⎞
⎟ ⎜ ⎜ 0.042 0.025 0.052 0.011 0.01 0.006 0.038 0.903 0.829 ⎟ ⎟ ⎜ K2 = ⎜ ⎟ ⎜ 0.026 0.054 0.829 0.909 0.876 0.919 0.705 0.01 0.016 ⎟ ⎠ ⎝ 0.1 0.02 0.1 0.05 0.1 0.05 0.2 0.07 0.1
(A.2)
As kmax = 4, a fourth dummy cluster must be added to K1 as an additional row, obtaining K1 as a result. If we consider, without loss of generality, that the scalars that constitute our fuzzy partitions represent membership probabilities, the added dummy cluster should be an all-zero row, as presented in ⎛ ⎞ 0.054 0.026 0.057 0.969 0.976 0.959 0.01 0.016 0.011 ⎜ ⎟ ⎜ 0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017 ⎟ ⎜ ⎟ (A.3) K1 = ⎜ ⎟ ⎜ 0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972 ⎟ ⎝ ⎠ 0
0
0
0
0
0
0
0
0
If other object-to-cluster association metrics were employed, suitable values should be given to the elements of the rows representing dummy clusters (e.g. if object-to-cluster association was expressed in terms of object to cluster centroid distances, the added rows should contain extremely high values in order to represent that the degree of association between the objects and that cluster is extremely low). In any case, once the dummy clusters are added, the cluster disambiguation and voting processes can be conducted with no further modification. A.2. Computing cluster dissimilarity The example just introduced in Appendix A.1 will serve as a guideline for describing how cluster dissimilarity measures can be obtained. In order to measure cluster dissimilarity between the pair of fuzzy partitions K1 and K2 (see Eq. (A.4)), first we compute the cluster similarity matrix SK1 ,K2 upon the fuzzy partitions themselves once the dummy clusters have been added. As we will shortly see, the introduction of dummy clusters does not affect the way cluster similarity is computed: ⎛ ⎞ 0.054 0.026 0.057 0.969 0.976 0.959 0.01 0.016 0.011 ⎜ ⎟ ⎜ 0.921 0.932 0.905 0.025 0.019 0.030 0.014 0.055 0.017 ⎟ ⎜ ⎟ K1 = ⎜ ⎟ ⎜ 0.025 0.042 0.038 0.006 0.005 0.011 0.976 0.929 0.972 ⎟ ⎝ ⎠ 0 ⎛
0
0
0.832 0.901 0.019
0 0.03
0
0
0
0
0
0.014 0.025 0.057 0.017 0.055
⎞
⎜ ⎟ ⎜ 0.042 0.025 0.052 0.011 0.01 0.006 0.038 0.903 0.829 ⎟ ⎜ ⎟ K2 = ⎜ ⎟ ⎜ 0.026 0.054 0.829 0.909 0.876 0.919 0.705 0.01 0.016 ⎟ ⎝ ⎠ 0.1
0.02
0.1
0.05
0.1
0.05
0.2
0.07
0.1
(A.4)
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
25
As in this example object-to-cluster associations are expressed in terms of membership probabilities, the cluster similarity matrix is computed as the simple matrix product between the fuzzy partitions, as described by Eq. (A.5). In the case object-to-cluster associations were expressed by means of other metrics, an ad hoc but analogous procedure should be devised: ⎞ ⎛ 0.138 0.056 2.675 0.210 ⎟ ⎜ ⎜ 1.628 0.174 0.902 0.214 ⎟ ⎟ ⎜ ⎟ (A.5) SK1 ,K2 = K1 K2T = ⎜ ⎟ ⎜ ⎜ 0.185 1.686 0.767 0.366 ⎟ ⎠ ⎝ 0 0 0 0 The interpretation of the contents of SK1 ,K2 is that its (i,j)th element is proportional to the similarity between the ith cluster of K1 and the jth cluster of K2 . Therefore, if analyzed columnwise, the SK1 ,K2 matrix in Eq. (A.5) reveals that the first cluster of K2 presents the highest degree of similarity with respect to the second cluster of K1 , whereas the second cluster of K2 is mostly similar to the third cluster of K1 . On the other hand, the third cluster of K2 is highly similar to the first cluster of K1 , while the fourth cluster of K2 resembles in contents to the third cluster of K1 . As expected, none of the clusters in K2 exhibits the least degree of similarity with respect the fourth cluster of K1 , which is the added dummy cluster. Finally, in order to solve the weight bipartite matching problem using the Hungarian method implementation of [33], it is necessary to transform SK1 ,K2 into a cluster dissimilarity matrix DK1 ,K2 . In our case, as the Hungarian method implementation employed in this work does not require that the cluster dissimilarity measures verify any special property as regards their scale, DK1 ,K2 was obtained by subtracting SK1 ,K2 from a k × k constant matrix, as expressed in DK1 ,K2 = f 1k×k − SK1 ,K2
(A.6)
where 1k×k is a k × k ones matrix, and f is any large enough scalar factor that ensures that DK1 ,K2 is a nonnegative matrix. Appendix B. Pairwise comparison between our proposals across all data collections The goal of this appendix is to complement the global results presented in Section 5.1, reinforcing the decisions regarding which of the proposed consensus functions show an optimal performance by presenting a more detailed dissection of the obtained results. To this end, we present the results of the pairwise comparison of the eight proposed consensus functions grouped by voting method. That is, we compare those consensus functions based on the same voting technique (i.e. DBC vs. IBC, DCC vs. ICC, etc.), just like we did in Section 5.1, but with one difference: in the following paragraphs, we will detail the performance attained by each consensus function on each one of the 11 data collections employed in our experiments. Again, the performance of consensus functions is evaluated in terms of their running time and the quality of the consensus partitions they deliver. We present the mean ± standard deviation of these two evaluation parameters, including the p-value resulting from the pairwise Wilcoxon statistical test, which indicates whether the differences between each pair of consensus functions are significant (if p < 0.05) or not, and to which extent. B.1. Borda voting (DBC vs. IBC) The performance indices corresponding to the execution of the Direct Borda Consensus (DBC) and Iterative Borda Consensus (IBC) functions on the 11 data collections employed in this work are summarized in Tables B1 and B2. As regards the computational performance of DBC and IBC, it can be observed from Table B1 that the former is faster than the latter on all the data collections, attaining statistically significant differences (i.e. p-values below 0.05) on eight of the 11 data collections employed in our experiments. As far as the quality of the consensus partitions delivered by DBC and IBC is concerned, the inspection of Table B2 reveals that DBC tends to yield better partitions than IBC in seven of the 11 data sets, reaching statistically significant
26
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
Table B1 Average ± standard deviation running CPU times (measured in seconds) of the direct and iterative Borda consensus functions (DBC and IBC) on the 11 data collections employed in this work. The CPU time of the best performing consensus function is highlighted in boldface. The p-value returned by the Wilcoxon paired signed-rank test corresponding to each pairwise comparison is presented, with values lower than = 0.05 indicating the existence of statistically significant differences between DBC and IBC. CPU time
DBC
IBC
p-Value
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
0.53 ± 0.68 0.24 ± 0.13 0.68 ± 0.56 0.71 ± 0.74 2.15 ± 2.43 3.96 ± 4.44 0.37 ± 0.25 1.06 ± 0.92 2.80 ± 4.48 8.66 ± 9.85 19.47 ± 23.49
0.81 ± 1.10 0.36 ± 0.27 1.34 ± 1.22 1.14 ± 1.21 4.61 ± 4.58 9.30 ± 9.50 0.63 ± 0.54 2.22 ± 2.38 3.54 ± 5.99 12.20 ± 12.95 26.41 ± 29.92
0.1553 1.3197e−4 1.4844e−7 0.0115 6.4072e−11 2.9993e−12 4.5630e−5 9.1098e−5 0.4418 0.4145 0.0125
Table B2 Average ± standard deviation (NMI) (c, kc ) scores of the direct and iterative Borda consensus functions (DBC and IBC) on the 11 data collections employed in this work. The (NMI) (c, kc ) score of the best performing consensus function is highlighted in boldface. The p-value returned by the Wilcoxon paired signed-rank test corresponding to each pairwise comparison is presented, with values lower than = 0.05 indicating the existence of statistically significant differences between DBC and IBC.
/(NMI) (c, kc )
DBC
IBC
p-Value
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
0.68 ± 0.07 0.66 ± 0.09 0.54 ± 0.12 0.30 ± 0.06 0.06 ± 0.02 0.46 ± 0.13 0.12 ± 0.10 0.57 ± 0.08 0.26 ± 0.08 0.54 ± 0.19 0.61 ± 0.04
0.68 ± 0.07 0.65 ± 0.09 0.48 ± 0.12 0.29 ± 0.05 0.06 ± 0.05 0.25 ± 0.18 0.12 ± 0.10 0.56 ± 0.07 0.26 ± 0.08 0.41 ± 0.18 0.62 ± 0.04
0.9926 0.095 1.7656e−9 0.0015 4.3556e−4 5.4166e−36 0.8447 0.4779 1 0.0030 0.1471
differences on five of them. It is important to note that IBC performs better than DBC only on the PenDigits data collection, although the difference in favor of IBC is not statistically significant in this case. This detailed view of the comparison between DBC and IBC reinforces the conclusions drawn in Section 5.1, where DBC was chosen as the best performing consensus function based on the Borda voting scheme. B.2. Copeland voting (DCC vs. ICC) The performance analysis of the Direct Copeland Consensus (DCC) and Iterative Copeland Consensus (ICC) functions on the 11 data sets is presented in Tables B3 and B4. From a computational perspective, DCC outperforms ICC by a wide margin on all the collections, with statistically significant differences on 10 of the 11 data sets (attaining, moreover, p-values well below the = 0.05 significance level on most cases)—see Table B3. When compared in terms of the quality of the generated consensus partitions, it can be observed that DCC is clearly superior to ICC on all data sets, as it always reaches higher (NMI) scores (see Table B4). Moreover, the differences between both consensus functions are statistically significant on all data sets, and by a very wide margin.
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
27
Table B3 Average ± standard deviation running CPU times (measured in seconds) of the direct and iterative Copeland consensus functions (DCC and ICC) on the 11 data collections employed in this work. The CPU time of the best performing consensus function is highlighted in boldface. The p-value returned by the Wilcoxon paired signed-rank test corresponding to each pairwise comparison is presented, with values lower than = 0.05 indicating the existence of statistically significant differences between DCC and ICC. CPU time
DCC
ICC
p-Value
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
2.25 ± 2.47 0.33 ± 0.20 0.86 ± 0.68 1.97 ± 1.77 2.13 ± 2.45 3.97 ± 4.42 0.54 ± 0.36 10.84 ± 10.07 315.88 ± 335.73 30.01 ± 30.79 230.11 ± 208.60
8.34 ± 9.42 0.69 ± 0.62 2.84 ± 2.82 6.63 ± 6.75 7.37 ± 7.44 14.30 ± 15.17 1.36 ± 1.29 21.53 ± 23.59 665.05 ± 719.81 66.89 ± 68.91 492.07 ± 467.16
3.5754e−5 4.3800e−10 3.2832e−14 6.4808e−14 6.5306e−18 4.5350e−17 2.3355e−10 0.0173 0.1679 0.0394 7.0735e−4
Table B4 Average ± standard deviation (NMI) (c, kc ) scores of the direct and iterative Copeland consensus functions (DCC and ICC) on the 11 data collections employed in this work. The (NMI) (c, kc ) score of the best performing consensus function is highlighted in boldface. The p-value returned by the Wilcoxon paired signed-rank test corresponding to each pairwise comparison is presented, with values lower than = 0.05 indicating the existence of statistically significant differences between DCC and ICC.
/(NMI) (c, kc )
DCC
ICC
p-Value
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
0.68 ± 0.07 0.66 ± 0.09 0.53 ± 0.12 0.31 ± 0.06 0.06 ± 0.02 0.46 ± 0.13 0.13 ± 0.10 0.55 ± 0.08 0.36 ± 0.12 0.53 ± 0.21 0.64 ± 0.05
0.27 ± 0.27 0.52 ± 0.21 0.10 ± 0.16 0.09 ± 0.11 0.03 ± 0.06 0.03 ± 0.08 0.08 ± 0.08 0.38 ± 0.11 0.03 ± 0.06 0.07 ± 0.15 0.29 ± 0.21
3.3007e−30 5.2133e−12 3.3327e−72 5.7541e−54 5.7137e−33 3.4152e−84 3.5137e−11 1.4399e−11 9.5307e−5 6.5786e−14 1.1075e−14
These results confirm the conclusion drawn in Section 5.1 as regards the selection of the optimal consensus function based on the Copeland voting scheme: DCC is, by far, the best performing consensus function of this family. B.3. Product voting (DPC vs. IPC) Tables B5 and B6 present the summary statistics corresponding to the experiments related to the execution of the consensus functions based on the product voting rule, Direct Product Consensus (DPC) and Iterative Product Consensus (IPC) on the 11 data collections employed in this work. As far as the CPU time required for running each consensus function is concerned, it can be observed in Table B5 that sequencing the cluster disambiguation and voting processes following an iterative approach is beneficial. It is important to note that statistical significant differences in favor of IPC are obtained on five of the 11 data sets, whereas in the cases in which DPC is faster than IPC (i.e. on the Iris and Balance collections), no statistically significant differences are found. When DPC and IPC are compared in terms – see Table B6 – of the quality of the consensus clusterings they deliver, we find that IPC yields consensus partitions with higher (NMI) score on 10 of the 11 data sets, although statistically significant differences between both consensus functions are found on only three of them.
28
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
Table B5 Average ± standard deviation running CPU times (measured in seconds) of the direct and iterative Product consensus functions (DPC and IPC) on the 11 data collections employed in this work. The CPU time of the best performing consensus function is highlighted in boldface. The p-value returned by the Wilcoxon paired signed-rank test corresponding to each pairwise comparison is presented, with values lower than = 0.05 indicating the existence of statistically significant differences between DPC and IPC. CPU time
DPC
IPC
p-Value
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
0.79 ± 0.96 0.14 ± 0.07 0.32 ± 0.26 0.54 ± 0.59 1.18 ± 1.79 1.98 ± 2.62 0.15 ± 0.10 0.35 ± 0.36 6.23 ± 10.27 2.78 ± 3.58 9.08 ± 13.39
0.51 ± 0.59 0.16 ± 0.08 0.27 ± 0.22 0.39 ± 0.35 0.39 ± 0.57 0.49 ± 0.44 0.17 ± 0.14 0.28 ± 0.20 3.41 ± 4.35 0.29 ± 0.29 3.54 ± 4.30
0.0139 0.0663 0.3034 0.0684 1.1739e−5 2.2311e−6 0.4908 0.6191 0.7149 2.0894e−6 0.0015
Table B6 Average ± standard deviation (NMI) (c, kc ) scores of the direct and iterative Product consensus functions (DPC and IPC) on the 11 data collections employed in this work. The (NMI) (c, kc ) score of the best performing consensus function is highlighted in boldface. The p-value returned by the Wilcoxon paired signed-rank test corresponding to each pairwise comparison is presented, with values lower than = 0.05 indicating the existence of statistically significant differences between DPC and IPC.
/(NMI) (c, kc )
DPC
IPC
p-Value
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
0.71 ± 0.05 0.69 ± 0.09 0.47 ± 0.08 0.35 ± 0.05 0.04 ± 0.02 0.50 ± 0.09 0.15 ± 0.11 0.53 ± 0.07 0.32 ± 0.11 0.58 ± 0.17 0.64 ± 0.04
0.73 ± 0.04 0.69 ± 0.08 0.48 ± 0.08 0.36 ± 0.05 0.04 ± 0.02 0.51 ± 0.08 0.16 ± 0.11 0.54 ± 0.08 0.35 ± 0.13 0.63 ± 0.16 0.66 ± 0.04
1.8573e−4 0.610 0.0932 0.0046 0.9820 0.0972 0.4174 0.2358 0.3940 0.0637 1.1448e−9
All these results lead us to choose IPC as the best performing consensus function based on the product voting rule, just as reported in Section 5.1. B.4. Sum voting (DSC vs. ISC) The results of the pairwise comparison between Direct Sum Consensus (DSC) and Iterative Sum Consensus (ISC) on each one of the 11 data collections employed in this work are presented in Tables B7 and B8. As regards the computational performance of DSC and ISC, the CPU time values presented in Table B7 reveal that the latter is faster than the former on all the data collections but one, attaining statistically significant differences on four of the 11 data collections employed in our experiments. Therefore, from a computational perspective, ISC seems to be the optimal sum voting rule based consensus function. When DSC and ISC are compared in terms of the quality of the consensus clusterings they deliver (see Table B8), it can observed that the former yields better quality consensus partitions than the latter on eight of the 11 data collections employed in our experiments (attaining statistically significant differences in seven of those data sets). Therefore, we find that ISC is faster than DSC, whereas DSC yields better consensus partitions than ISC. As described in Section 5.1, we decide to select ISC as the optimal consensus function based on the sum voting method. Indeed,
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
29
Table B7 Average ± standard deviation running CPU times (measured in seconds) of the direct and iterative Sum consensus functions (DSC and ISC) on the 11 data collections employed in this work. The CPU time of the best performing consensus function is highlighted in boldface. The p-value returned by the Wilcoxon paired signed-rank test corresponding to each pairwise comparison is presented, with values lower than = 0.05 indicating the existence of statistically significant differences between DSC and ISC. CPU time
DSC
ISC
p-Value
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
0.77 ± 0.87 0.14 ± 0.08 0.32 ± 0.31 0.55 ± 0.63 1.19 ± 1.85 1.96 ± 2.59 0.16 ± 0.13 0.43 ± 0.41 10.59 ± 12.16 2.85 ± 3.66 10.23 ± 14.11
0.59 ± 0.55 0.14 ± 0.08 0.27 ± 0.23 0.49 ± 0.52 0.39 ± 0.57 0.51 ± 0.44 0.14 ± 0.12 0.35 ± 0.28 8.90 ± 8.74 0.32 ± 0.35 3.94 ± 4.53
0.1781 0.6379 0.6953 0.9191 1.9934e−4 9.7446e−7 0.3674 0.4372 0.8218 5.3330e−6 9.2466e−4
Table B8 Average ± standard deviation (NMI) (c, kc ) scores of the direct and iterative Sum consensus functions (DSC and ISC) on the 11 data collections employed in this work. The (NMI) (c, kc ) score of the best performing consensus function is highlighted in boldface. The p-value returned by the Wilcoxon paired signed-rank test corresponding to each pairwise comparison is presented, with values lower than = 0.05 indicating the existence of statistically significant differences between DSC and ISC.
/(NMI) (c, kc )
DSC
ISC
p-Value
Zoo Iris Wine Glass Ionosphere WDBC Balance MFeat miniNG BBC PenDigits
0.72 ± 0.05 0.69 ± 0.10 0.49 ± 0.09 0.35 ± 0.05 0.04 ± 0.02 0.49 ± 0.10 0.15 ± 0.11 0.50 ± 0.07 0.35 ± 0.09 0.60 ± 0.19 0.65 ± 0.04
0.69 ± 0.07 0.68 ± 0.09 0.45 ± 0.13 0.32 ± 0.06 0.05 ± 0.04 0.40 ± 0.16 0.17 ± 0.12 0.53 ± 0.08 0.29 ± 0.10 0.48 ± 0.23 0.63 ± 0.06
7.5420e−10 0.0641 1.1001e−8 3.5873e−9 0.2091 2.5639e−22 0.0925 0.0040 0.0062 0.0079 1.1543e−5
an analysis of the figures presented in Tables B7 and B8 reveals that the gains in time complexity of ISC outweigh the advantages derived from the fact that DSC yields higher quality consensus clusterings. This becomes particularly evident as the size of the data collections grows large (e.g. see the BBC and PenDigits collections—Tables B7 and B8). Appendix C. Comparison against the state-of-the-art across data collections For each one of the 11 data collections employed in this work, we have created a quality vs. time complexity diagram that allows to compare the performances of the nine soft consensus functions at a glance. Quite obviously, the higher (NMI) (c, kc ) and the lower CPU time, the better a consensus functions performs. On these diagrams, we have plotted a rectangular region limited by the mean plus the standard deviation of the two evaluated magnitudes (i.e. (NMI) (c, kc ) and CPU time) computed throughout all the experiments conducted on each data collection. That is, for each consensus function, the coordinates of the lower left corner of the associated rectangle – signaled with an identifying marker – correspond to the mean values of (NMI) (c, kc ) and CPU time (i.e. (NMI) (c,kc ) and CPUtime ). The coordinates of the upper right corner of the rectangle are located at point of the plane defined by (NMI) (c,kc ) + (NMI) (c,kc ) and CPUtime + CPUtime (where symbolizes standard deviation).
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
0.6 0.4
0.8 0.6 0.4
0.2
0
100
CPU time (sec.)
1
φ 0.2
0
0
100
CPU time (sec.)
0.8 0.6 0.4
φ
0.4
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
0.2
1
0
0
102
0.6
φ
0.4 0.2 0
102
CPU time (sec.) 1
HGPA MCLA VMA DBC DCC IPC ISC
0.8 c
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
0.8 c
100
CPU time (sec.) 1
(γ,λ )
0.2
φ(NMI)(γ,λ )
CPU time (sec.)
0.6 0.4
0.2
100
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
0.8
(γ,λc)
0.6
101
φ
0.8
1
(γ,λc)
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
(NMI)
1
100
CPU time (sec.)
(NMI)
CPU time (sec.)
(NMI)
0.6 0.4
0.2
100
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
0.8
(γ,λc)
c
0.6 0.4
0.2
c
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
(NMI)
0.4
100
CPU time (sec.)
0.8
(γ,λ )
0.6
10−1
0.4
0 10−1
10−1
φ
c
φ(NMI)(γ,λ )
0.8
0
0.6
0.2
1
(NMI)
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
0.8
CPU time (sec.)
1
0
1
0.2
0 10−1
φ(NMI)(γ,λ )
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
φ
φ(NMI)(γ,λc)
0.8
1
φ(NMI)(γ,λc )
CSPA EAC HGPA MCLA VMA DBC DCC IPC ISC
(γ,λ ) c
1
(NMI)
30
0.6 0.4 0.2
100
102
CPU time (sec.)
0
101
102
CPU time (sec.)
Fig. C.1. (NMI) (c, kc ) vs. CPU time diagrams comparing the proposed consensus functions against the state-of-the-art across the 11 data collections employed in this work. (a) Zoo, (b) Iris, (c) Wine, (d) Glass, (e) Ionosphere, (f) WDBC, (g) Balance, (h) MFeat, (i) miniNG, (j) BBC, (k) PenDigits.
By means of this simplified version of a bivariate boxplot (or bagplot) [45], the reader is provided with a pretty summarized and visual description of the performance of the consensus functions subject to comparison that allows to grasp the relative differences between them.
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
31
A global inspection of Fig. C.1 reveals that the global results presented in Section 5.2 are confirmed in a data set by data set performance analysis. That is, it can be observed that the four proposed consensus functions (i.e. DBC, DCC, IPC and ISC) attain higher (NMI) (c, kc ) scores than the CSPA, EAC, HGPA and MCLA on almost every data collection, while yielding consensus clusterings of similar quality to those delivered by the VMA consensus function. Moreover, we have noticed that the soft versions of the EAC and HGPA consensus functions perform very poorly from a consensus clustering quality standpoint. In computational terms, it can be observed that those consensus functions based on confidence voting (that is, IPC and ISC) are usually among the fastest (together with VMA and HGPA), whereas the consensus functions based on the Borda and Copeland positional voting rules (DBC and DCC) are comparatively slower than most of the state-of-the-art consensus functions subject to comparison. Last but not least, it is important to highlight that the CSPA and EAC consensus functions were not executable on the PenDigits data collection (which is the one containing the largest number of instances, n). This is due to the fact that their space complexity is quadratic in n, which makes their application more difficult on large data sets. This is not an issue for our consensus functions, which are all linear in n. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]
U. Fayyad, Data mining and knowledge discovery: making sense out of data, IEEE Expert Mag. 11 (5) (1996) 20–25. I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kauffman Publishers, 2005. W. Klosgen, J.M. Zytkow, J. Zyt, Handbook of Data Mining and Knowledge Discovery, Oxford University Press, 2002. M.R. Anderberg, Cluster Analysis for Applications, Monographs and Textbooks on Probability and Mathematical Statistics, Academic Press Inc., 1973. F. Höppner, F. Klawonn, R. Kruse, T. Runkler, Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition, Wiley, 1999. A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a survey, ACM Comput. Surv. 31 (3) (1999) 264–323. A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res. 3 (2002) 583–617. T.G. Dietterich, Ensemble methods in machine learning, Multiple Classifier Systems, Lecture Notes in Computer Science, vol. 1857, 2000, pp. 1–15. F.R. Pinto, J.A. Carriço, M. Ramirez, J.S. Almeida, Ranked adjusted rand: integrating distance and partition information in a measure of clustering agreement, BMC Bioinformatics 8 (44) (2007) 1–13. H.G. Ayad, M.S. Kamel, Cumulative voting consensus method for partitions with variable number of clusters, IEEE Trans. Pattern Anal. Mach. Intell. 30 (1) (2008) 160–173. A. Gionis, H. Mannila, P. Tsaparas, Clustering aggregation, ACM Trans. Knowl. Discov. Data 1 (1) (2007) 1–30. A. Goder, V. Filkov, Consensus clustering algorithms: comparison and refinement, in: Proceedings of the 2008 SIAM Workshop on Algorithm Engineering and Experiments (ALENEX), 2008, pp. 109–117. X.Z. Fern, C.E. Brodley, Solving cluster ensemble problems by bipartite graph partitioning, in: Proceedings of the 21st International Conference on Machine Learning, 2004, pp. 281–288. A. Fred, A.K. Jain, Combining multiple clusterings using evidence accumulation, IEEE Trans. Pattern Anal. Mach. Intell. 27 (6) (2005) 835–850. A. Topchy, A.K. Jain, W. Punch, Combining multiple weak clusterings, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 331–338. K. Punera, J. Ghosh, Soft Consensus Clustering, Adv. Fuzzy Clustering Appl. (2007) 69–92. A. Topchy, M. Law, A.K. Jain, A. Fred, Analysis of consensus partition in clustering ensemble, in: Proceedings of the 4th International Conference on Data Mining, 2004, pp. 225–232. T. Lange, J.M. Buhmann, Combining partitions by probabilistic label aggregation, in: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005, pp. 147–156. T. Li, C. Ding, M.I. Jordan, Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization, in: Proceedings of the 7th IEEE International Conference on Data Mining, 2007, pp. 577–582. A. Agogino, K. Tumer, Efficient agent-based cluster ensembles, in: Proceedings of the 5th International Joint Conference on Autonomous Agents and Multi-Agent Systems, 2006. V. Filkov, S. Skiena, Integrating microarray data by consensus clustering, Int. J. Artif. Intell. Tools 13 (4) (2004) 863–880. E. Dimitriadou, A. Weingessel, K. Hornik, Voting-merging: an ensemble method for clustering, in: G. Dorffner, H. Bischof, K. Hornik (Eds.), Artificial Neural Networks—ICANN 2001, Lecture Notes in Computer Science, vol. 2130, Springer, 2001, pp. 217–224. S. Dudoit, J. Fridlyand, Bagging to improve the accuracy of a clustering procedure, Bioinformatics 19 (9) (2003) 1090–1099. B. Fischer, J.M. Buhmann, Bagging for path-based clustering, IEEE Trans. Pattern Anal. Mach. Intell. 25 (11) (2003) 1411–1415. H. Kuhn, The Hungarian method for the assignment problem, Nav. Res. Log. 2 (1955) 83–97.
32
X. Sevillano et al. / Fuzzy Sets and Systems 193 (2012) 1 – 32
[26] D. Greene, P. Cunningham, Efficient Ensemble Methods for Document Clustering, Technical Report CS-2006-48, Trinity College Dublin, 2006. [27] E. Dimitriadou, A. Weingessel, K. Hornik, A combination scheme for fuzzy clustering, Int. J. Pattern Recognition Artif. Intell. 16 (7) (2002) 901–912. [28] X. Sevillano, F. Alías, J.C. Socoró, BordaConsensus: a new consensus function for soft cluster ensembles, in: Proceedings of the 30th ACM SIGIR Conference, 2007, pp. 743–744. [29] B. Noble, J.W. Daniel, Applied Linear Algebra, Prentice Hall, 1988. [30] C. Boulis, M. Ostendorf, Combining multiple clustering systems, in: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol. 3202, 2004, pp. 63–74. [31] B. Long, Z.M. Zhang, P.S. Yu, Combining multiple clusterings by soft correspondence, in: Proceedings of the 5th IEEE International Conference on Data Mining, 2005, pp. 282–289. [32] M. Jakobsson, N.A. Rosenberg, CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure, Bioinformatics 23 (2007) 1801–1806. [33] M. Buehren, Functions for the Rectangular Assignment Problem http://www.mathworks.com/matlabcentral/fileexchange/6543 , last accessed on September 2011. [34] M. van Erp, L. Vuurpijl, L. Schomaker, An overview and comparison of voting methods for pattern recognition, in: Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition, 2002, pp. 195–200. [35] J.C. de Borda, Memoire sur les elections au scrutin, Histoire de l’Academie Royale des Sciences, Paris, 1781. [36] D. Knuth, The Art of Computer Programming, vol. 3, Addison-Wesley, 1998. [37] A.H. Copeland, A ‘reasonable’ social welfare function, in: Notes from a seminar on Applications of Mathematics to the Social Sciences, University of Michigan, 1951. [38] M. de Condorcet, Essai sur l’application de l’analyse à la probabilité des decisions rendues à la pluralité des voix, 1785. [39] A. Frank, A. Asuncion, UCI Machine Learning Repository http://archive.ics.uci.edu/ml , University of California, Irvine, School of Information and Computer Sciences, last accessed on September 2011. [40] D. Greene, P. Cunningham, Practical solutions to the problem of diagonal dominance in kernel document clustering, in: Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 377–384. [41] T. Lange, V. Roth, M.L. Braun, J.M. Buhmann, Stability-based validation of clustering solutions, Neural Comput. 16 (6) (2004) 1299–1323. [42] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [43] S. García, F. Herrera, An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons, J. Mach. Learn. Res. 9 (2008) 2677–2694. [44] F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull. 1 (6) (1945) 80–83. [45] P.J. Rousseeuw, I. Ruts, J.W. Tukey, The bagplot: a bivariate boxplot, Am. Stat. 53 (1999) 382–387. [46] J.A. Aslam, M. Montague, Models for metasearch, in: Proceedings of 24th ACM SIGIR Conference, 2001, pp. 276–284. [47] M. Montague, J.A. Aslam, Condorcet fusion for improved retrieval, in: Proceedings of the 24th ACM CIKM Conference, 2002, pp. 538–548.