Bregmannian consensus clustering for cancer subtypes analysis

Bregmannian consensus clustering for cancer subtypes analysis

Computer Methods and Programs in Biomedicine 189 (2020) 105337 Contents lists available at ScienceDirect Computer Methods and Programs in Biomedicin...

2MB Sizes 2 Downloads 54 Views

Computer Methods and Programs in Biomedicine 189 (2020) 105337

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine journal homepage: www.elsevier.com/locate/cmpb

Bregmannian consensus clustering for cancer subtypes analysis Jianqiang Li a,∗, Liyang Xie a, Yunshen Xie a, Fei Wang b a b

School of Software, Beijing University of Technology, China Weill Cornell Medical College, Cornell University, USA

a r t i c l e

i n f o

Article history: Received 29 July 2019 Revised 9 January 2020 Accepted 9 January 2020

Keywords: Cancer subtypes analysis Consensus Clustering Bregman divergence

a b s t r a c t Cancer subtype analysis, as an extension of cancer diagnosis, can be regarded as a consensus clustering problem. This analysis is beneficial for providing patients with more accurate treatment. Consensus clustering refers to a situation in which several different clusters have been obtained for a particular data set, and it is desired to aggregate those clustering results to get a better clustering solution. In this paper, we propose to generalize the traditional consensus clustering methods in three manners: (1) We provide Bregmannian consensus clustering (BCC), where the loss between the consensus clustering result and all the input clusterings are generalized from a traditional Euclidean distance to a general Bregman loss; (2) we generalize the BCC to a weighted case, where each input clustering has different weights, providing a better solution for the final clustering result; and (3) we propose a novel semi-supervised consensus clustering, which adds some must-link and cannot-link constraints compared with the first two methods. Then, we obtain three cancer (breast, lung, colorectal cancer) data sets from The Cancer Genome Atlas (TCGA). Each data set has three data types (mRNA, mircoRNA, methylation), and each is respectively used to test the accuracy of the proposed algorithms for clusterings. The experimental results demonstrate that the highest aggregation accuracy of the weighted BCC (WBCC) on cancer data sets is 90.2%. Moreover, although the lowest accuracy is 62.3%, it is higher than other methods on the same data set. Therefore, we conclude that as compared with the competition, our method is more effective. © 2020 Elsevier B.V. All rights reserved.

1. Introduction Cancer is a major public health problem in the world [1]. The International Agency for Research on Cancer (IARC) estimated that 18.1 million new cases of cancer have been discovered and that the number of deaths caused by cancer has reached 9.6 million worldwide in 2018 [2]. In terms of incidence, the top three types of cancer are lung, female breast, and colorectal [2]. The various different molecular profiles can lead to a phenomenon where each cancer includes several subtypes, posing great challenges for medical researchers [3]. Thus, significant attention has been focused on using clustering methods to aggregate all the molecules in a sample into multiple clusters. This solution can assist clinicians in developing precise treatments by combining the methods for analyzing the different molecular profiles between cancer patients and healthy subjects. However, the robustness of data clustering is still not well-solved, as (1) even given the same data set, different structures may be found by off-the-shelf clustering methods; (2) no ground truth can be used for validating the clustering result,



Corresponding author. E-mail address: [email protected] (J. Li).

https://doi.org/10.1016/j.cmpb.2020.105337 0169-2607/© 2020 Elsevier B.V. All rights reserved.

making it impossible to perform cross-validation to optimize parameters in a single clustering method; and (3) for the same iterative method (such as k-means [4]), different initialization processes or parameters may generate different clustering results. Therefore, a new concept for improving the effectiveness of clustering is necessary. We treat this problem as consensus clustering problem, which refers to determining how to utilize a set of different input clusterings that originated from the same data set to determine a uniform clustering result that is superior than others. Nowadays, consensus clustering has become an important means for solving classical clustering problems. Moreover, several additional problems can be addressed as consensus clustering problems, e.g., clustering with multiple criteria, distributed clustering, and heterogeneous data sources clustered [5, 6, 7]. In addition, many algorithms have been provided to manage the consensus clustering problem, such as the graph cut method, information-theoretical methods, the matrix factorization-based method, and the hierarchical method [8, 9, 10, 11]. Most of the traditional methods are only applicable when each input clustering is equal. However, in real-world cases, we may find that:

2

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337 •



different input clusterings may have significant differences, such that the process becomes a brute-force voting to achieve consensus by simply finding the average, which is meaningless; and the subsets of the input clustering may be highly-correlated, such that redundant clustering will bias the final solution towards the correlated clustering.

To solve the above problems, a few researchers have recently proposed the concept of a weighted cluster ensemble (or weighted consensus clustering) [12]. For example, [13] proposed a weighted cluster ensemble scheme in which different dimensions of data vectors are weighted differently to obtain data clustering; however, the final consensus clustering result is obtained by treating all of the data partition results equally. [14] proposed assigning different weights to different data partition results, where the weights are pre-computed according to a few empirically-defined diversity measures. [15] applied a similar scheme, and the authors also proposed a novel kernel consensus function to determine the final data partition results. In addition, [16] proposed cluster ensemble selection, which only use a subset of the data partition results to obtain consensus data clusters. It is a special case of weighted consensus clustering. Two key points for the success of weighted consensus clustering are as follows. •



How to determine the weights used for weighting different data partition results. The existing methods usually compute these weights empirically beforehand (according to prior knowledge); this is generally heuristic and may bias the final consensus clustering. How to define a proper consensus function. Traditional consensus clustering methods usually construct a data similarity matrix based on each data partition result, and average them to get the consensus function. However, there is no theoretical guarantee regarding the optimality of this approach.

In this study, a novel consensus clustering framework is provided to determine an optimal clustering for a cancer dataset (e.g., the lung, breast, and colorectal cancer data set), to assist in the analysis of cancer subtypes. We show that the traditional averaging consensus function is just a special case of our framework under the matrix Frobenius norm distortion measure. Moreover, we generalize this Frobenius norm-based distortion measure to a more general class of Bregman divergence [17], and we expect to capture the different characteristics contained in the matrix when we measure the distortion. We also propose a weighted consensus clustering scheme that can learn both the data partition weights and final consensus cluster connectivity matrix jointly under a unified optimization framework. Moreover, we extend our algorithm to a semi-supervised case by incorporating a few must-link and cannotlink constraints, which are widely applied in semi-supervised clustering methods [16, 18, 10, 20, 21], but rarely in consensus clustering problems [22, 29]. The highlights of our framework include the following: •





We utilize a much more general criterion, i.e., Bregman divergence, instead of Euclidean distance or Kullback-Leibler (KL) divergence to measure the matrix consensus. We prove that both the weighted problem and unweighted problems are convex, and thus, our algorithm is extended to deal with the weighted consensus cluster problem. Moreover, we generalize our framework for semi-supervised cases by incorporating pairwise constraints. We apply the new consensus clustering methods to the analysis of molecular profiles for cancer subtypes, aiming to develop effective treatments for these aggressive types of cancers.

The rest of this paper is organized as follows. In Section 2, we briefly illustrate a few related works. Section 3 formally states

the problem, and introduces a few preliminaries. The detailed algorithm for generalized consensus clustering is introduced in Section 4, and the extension to semi-supervised cases is presented in Section 5. Section 6 will present the respective experimental results on the lung, breast, and colorectal datasets, followed by conclusions in Section 7. 2. Related works The clustering analysis of gene expression profiles is a crucial research topic for cancer subtype diagnosis, which is beneficial for providing more precise treatments for cancer patients [23, 24, 25, 26]. In fact, even decades ago, tremendous research studies already existed regarding clustering analysis of gene expression data [27, 28]. For instance, Ying Xu.et al constructed a minimum spanning tree (MST) to represent multi-dimensional gene expression profiles, and then provided a MST-based clustering algorithm to cluster these gene data, as utilized on Yeast dataset, human serum dataset, and Arabidopsis dataset [27]. Nathan Cunningham.et al conducted a brief analysis to identify colon cancer subtypes, where hierarchical clustering, K-means clustering, and model-based clustering were used. The three methods were compared in pairs under the weighted average discrepant pairs (WADP) method, i.e., an evaluation criterion to measure the robustness of a clustering algorithm [30]. In the following decades, to improve the robustness and stability of clustering approaches, researchers started to focus on consensus clustering, as it can find an optimal solution under multiple single clusterings with different clustering results [31]. As can be concluded from previous efforts, consensus clustering algorithms are composed of two steps: 1) choosing a suitable clustering mechanism for generating diverse partitions; and 2) constructing a consensus function for generating an optimal clustering with different partitions [32, 33, 34, 35]. For the first step, researchers usually adopted two methods: using different clustering algorithms for generating different data partitions; and using different parameter settings in the same algorithm for different partitions [36]. In fact, most researchers apply k-means clustering with random initializations or with a random number of clusters to generate the different clusterings. For example, [37] introduced two weak clustering algorithms to generate data partitions: randomly selecting one dimension of multidimensional data for clustering, and splitting the data using random hyperplanes. In addition, in [38], a new resampling technique was proposed for generating data subset partitions, and then combining them to get the results of the entire data set. [39] proposed using the Locality Adaptive Clustering algorithm to generate data partition results. For the second step, a popular method for constructing a consensus function is to create a co-association matrix, which is also applicable to any clustering algorithm that directly operates on similarities (e.g., hierarchical clustering and spectral clustering) because the characters in it are similar to a similarity matrix [40]. [33] derived a consensus function based on the principle of Information Bottleneck. [15] utilized a kernel to compute a consensus function. [8] proposed constructing a consensus function by using a graph partitioning strategy. Because of the robustness and stability of the consensus clustering algorithm, it has been introduced into the medical field, and has been gradually used for gene expression data. For example, Junjie Wu.et al applied a K-means-based consensus clustering (KCC) method on a breast_w dataset oriented from the UC Irvine Machine Learning repositories, and had a good result under the validation of a normalized Rand index [41]. Zhiwen Yu proposed a new method named “Graph-based consensus clustering” (GCC), and applied the GCC on breast data, central nervous system (CNS) tumor data, leukemia data, and lung cancer data, helping to approximately estimate the underlying classes of inputs and

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337

automatically categorize samples. The experimental results showed that the GCC is better than standard consensus clustering [32]. In [42], an integrated approach based on consensus clustering was provided for managing heterogeneous data sets, and was applied to two groups of microarray data, named “yeast” and “cell cycling gene”. Eric F. Lock found a scalable Bayesian framework for the identification of clinical disease subtypes, which was applied on breast cancer tumor samples from The Cancer Genome Atlas [43]. Yu.et al utilized a new method called the random double clustering based fuzzy cluster ensemble framework (RDCFCE), to cluster tumors based on gene expression data, by adding a fuzzy extension model into an integration framework [44]. Nevertheless, the above studies only focus on a specific method, and thus lack universality. As a consequence, they may still have great limitations in practical application. Accordingly, this paper introduces a novel consensus function construction method for classifying gene expression data, where we take the lung, breast, and colorectal cancer as examples, and aim to produce a co-association matrix from a matrix approximation perspective. Our framework can be easily generalized to weighted and semi-supervised versions, which can make it much more powerful in more complicated situations. 3. Preliminaries and problem statement In this section, we will illustrate some basic information with respect to our research, including two parts: a few notations and preliminary knowledge, and the problem definition. 3.1. Notations and preliminaries A few frequently used notations are described as follows. Suppose the entire dataset with n data points is χ = {x1 ,x2 ,…, xn }, where the m partitions can be represented as P = {P1 ,P2 ,…, Pm }. Moreover, each partition Pi (i = 1, 2, . . . , m) consists of k clusters {π1i , π2i , . . . , πki }, and satisfies the restriction of χ = ∪kj=1 π ji . Note that different partitions may have different kvalues. The connectivity matrix is defined as follows. Definition 1. (Connectivity Matrix) The connectivity matrix Mi for partition Pi is an n by n symmetric square matrix, with its (u, v)-th entry as: Thus, Mi can be used to represent partition Pi . For soft clustering algorithms (such as clustering derived from the Gaussian Mixture Model), Mi is defined in a way such that Mui v denotes the possibility that xu and xv belong to the same cluster, in this case Mui v ∈ [0, 1].



Mui v =

1, i f xu and xv belong to the same cluster in P i 0, otherwise

(1)

Another definition that will be frequently used in this paper is Bregman divergence, which is defined as follows [17]. Definition 2. (Bregman Divergence) A strictly convex function ϕ defined on a convex S is denoted as φ : S ⊆ R → R, where ϕ is differentiable on ri(S). S is the effective domain of ϕ , i.e., it is the set of all x such that ϕ (x) < +∞. ri(S) is the relative interior of S, which is assumed to be non-empty. The Bregman divergence Dϕ : S × ri(S) → [0, ∞) is defined as follows: 

Dφ (x, y ) = φ (x ) − φ (y ) − φ (y )(x − y )

(2)

In the above, ϕ (x) is the gradient vector of ϕ evaluated at x. There are many convenient properties of Bregman divergence. For example, Dϕ (x,y) is non-negative, and it is convex with respect to x. Many commonly-used distance metrics (such as Euclidean

3

distance and Kullback-Leibler divergence) can be deemed as special cases of Bregman divergence [17]. In this study, we will consider the following Separable Bregman Divergence [17]. Definition 3. (Separable Bregman Divergence) For any symmetric n × n matrices, the separable Bregman matrix divergence is defined to be:

Dφ ( X , Y ) =



Dφ (Xuv , Yuv )

(3)

u,v

Here, Xuv denotes the (u, v)-th element of X, and Dϕ (Xuv ,Yuv ) is the Bregman divergence as defined in Eq. (2). 3.2. Problem statement As previously stated, consensus clustering refers to the problem of finding an optimal partition of a dataset χ . In this study, we relate each data partition Pi to a connectivity matrix Mi . Thus, there is also a connectivity matrix M corresponding to the final optimal consensus partition P∗ . As the connectivity matrix can also be seen as a data similarity matrix, the final data clusters can be discovered by running a few similarity-based clustering algorithms on M. In this sense, finding an optimal M is of key importance for obtaining good clustering results. In this study, we consider this goal as the optimization problem:





minM D M, ∪m Mi i=1 T s.t. M = M , M ≥ 0

(4)

In the above, D(M, ∪m Mi ) = D(M, {M1 , M2 , . . . , Mm } ) is some i=1 distortion measure to compute the consensus of P∗ and P, and M ≥ 0 indicates that M is non-negative. 4. Generalized consensus clustering with Bregman divergence In this section, we will illustrate our generalized consensus clustering framework with any arbitrary (weighted) additive Bregman divergence measure. 4.1. Unweighted Bregmannian consensus clustering D will be considered to be the following additive Bregman divergence:





m

i D M, ∪m i=1 M =

i=1



Dφ M, Mi



(5)

Then, our goal is to find an optimal M by minimizing:

J1 =

m i=1



Dφ M, Mi



(6)

In the above, Dϕ (M,Mi ) is the separable Bregman divergence, as defined in Definition 3. Then, Problem (4) becomes:

       minM i uv φ (Muv ) − φ Mui v − φ Mui v Muv − Mui v s.t. M = M T , M ≥ 0

(7)

To solve the above problem, we first ignore the symmetry and ∂J non-negativity constraints on M, and let M J1 = ∂ M1 = 0, where 0 is an n × n zero matrix. Then, we can obtain

φ (M pq ) =

  1 φ Mipq n

(8)

i

As J1 is convex in M, the solution to Eq. (8) is the global optimum for minimizing J1 . A few typical Bregman divergence measures and their corresponding connectivity matrices can be found in Table 1. Note that as long as 0 ≤ Mipq ≤ 1 for ∀ 1 ≤ i ≤ m, 1 ≤ p, q ≤ n, the nonnegativity and symmetry constraints are all automatically satisfied,

4

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337 Table 1 Some distances and the corresponding optimal M. D(x)

ϕ (x)

 ϕ (x)

Euclidean Distance

R

1 2 x 2

x

Exponential Distance

R

exp(x)

exp(x)

1 log( m

Kullback-Leibler Distance Itakura-Saito Distance

R++ R++

xlogx − x − logx

logx

π

Logistic Distance

[0,1]

logx + (1 − x)log(1 − x)

x log 1−x

Mpq 1  m

1 x

i

Mipq 

(

exp Mii j i 1 i i exp m logM pq  1 m/ i M1pq exp(m pq ) · m pq exp(m pq )+1

=

(

(

1 m

 i

))

)

)

Mi

log 1−Mpqi

pq

as shown in Table 1. Therefore, although there are n2 equations in Eq. (8), there are only n(n − 1)/2 variables to optimize, because of the symmetry of {Mi }m and M. We can also directly set Mii = 1. i=1 Finally, Algorithm 1 summarizes the entire procedure of the unweighted Bregmannian consensus clustering method.

solutions by adding one regularizer on J2 , and the problem is described as follows:

4.2. Weighted Bregmannian consensus clustering

In the above, λ > 0 is a tradeoff parameter. When λ → 0, w will converge to the solution shown in Eq. (14); whenλ → ∞, w will converge to [1/m, 1/m, .., 1/m]T , which is a simple average of all of the partitions. We hope to obtain an optimal information leverage of all the partitions by determining a specific λ; thus, we need to solve a quadratic programming problem with respect to w when fixing M. One issue that should be mentioned here is that the authors in [13] also proposed a weighted consensus clustering algorithm, where instead of weighting the pairwise matrix distortions, they directly assumed that the optimal M is just a weighted sum of Mi , i.e.:

In Eq. (6), the impact of each partition is the same. However, different partitions need different importance values in real world cases. Therefore, it is necessary to find a series of weighting factors  {wi }m subject to ∀ i = 1, 2, . . . , m, wi ≥ 0, wi = 1, and to dei=1 i

fine D to be the following weighted additive Bregman divergence:





i D M, ∪m i=1 M =

m 



wi Dφ M, Mi



(9)

i=1

In this case, we need to minimize the following objective to get the optimal M:



J2 = wi Dφ M, M

 i

(10)

M J2 = 0

(11)

Following the same procedure as minimizing J1 in Eq. (6), we can obtain:

φ (M pq ) =



wi φ Mipq





wi Dφ M, Mi i  s.t. wi ≥ 0, wi = 1



(13)

i

It is not difficult to solve the linear programming problem, but the solution is always as follows:



wi =

M=



(15)

wi = 1, wi ≥ 0 (∀ i = 1, 2, .., m)



wi M i

(16)

Note that in our case, if we choose ∅(x ) = comes:

M pq =





1, i f i = argmin j Dφ M, Mi 0, otherwise



(14)

In this case, the optimal M will only be relevant to the nearest Mi , according to their Bregman divergence. We avoid such trivial

1 2 2x ,

then Eq. (12) be-

wi Mipq

(17)

i

This is the same as Eq. (16). Therefore, the method in [13] is a special case of our framework, with ∅(x ) = 12 x2 . Finally, the weighted Bregmannian consensus clustering (WBCC) algorithm is summarized in Algorithm 2 (where Jˆ is defined in Eq. (18)), and we have the following theorem to prove its convergence. Theorem. Algorithm 2 will converge. Proof. Let

Jˆ2 (M, w ) =

Then, with the different choices of ϕ , we can solve for the optimal M. By fixing M, the problem is transformed into linear programming:



i=1

i=1

(12)

i

minw

s.t.

m 



wi Dφ M, Mi + λ w 2

i

The problem is not convex with respect to w = (w1 , w2 ,…, wm )T and M. We propose a block coordinate descent (BCD) algorithm to minimize the above objective. When we fix one variable, the optimization over the other can be treated as a convex problem with a unique solution. By fixing w, the resultant problem is essentially minimizing J1 , and the solution can be obtained by setting:



m 

minM,w

m 





wi Dφ M, Mi + λ w 2

(18)

i=1

Then, for ∀w, M, Jˆ2 (M, w ) ≥ 0. Then, by fixing w = wt , the minimization of Jˆ2 (M, w ) is convex with respect to M and Mt + 1 is the optimal solution, and we have Jˆ2 (Mt , wt ) ≥ Jˆ2 (Mt+1 , wt ). Similarly, by fixing M = Mt + 1 , the optimization Problem (15) over w is also convex, and wt + 1 is the optimal solution; thus, we have Jˆ2 (Mt+1 , wt ) ≥ Jˆ2 (Mt+1 , wt+1 ). Therefore, we have a monotonicallydecreasing sequence Jˆ2 (M0 , w0 ) ≥ Jˆ2 (M1 , w0 ) ≥ Jˆ2 (M1 , w1 ) ≥ . . . ≥ 0, indicating that Algorithm 2 converges. 5. Semi-supervised consensus clustering with Bregman divergence In this section, we add a new setting for consensus clustering, where two sets of pairwise M and ࢙ are given as constraint conditions. (x p , xq ) ∈ M means that xp and xq are in the same cluster,

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337

and (xp ,xq ) ∈ ࢙ denotes that they are in the different clusters. Although these constraint conditions have been widely used in semisupervised clustering [18, 10, 20, 21], few studies have applied them to consensus clustering [22, 29, 50]. In addition, no previous work has applied the Bregman divergence to semi-supervised consensus clustering, which includes unweighted Bregmannian consensus clustering with constraints, and a generalized WBCC with constraints.

5

5.2. Generalized weighted cluster aggregation with constraints We define the semi-supervised weighted consensus clustering problem as the following Eq. (24):



minw.M

i





wi D∅ M, Mi + λ w 2

s.t. M pq = smax , i f (x p , xq ) ∈ M M pq = smin , i f (x p , xq ) ∈ C m  wi = 1, wi ≥ 0 ( ∀ i = 1, . . . , m )

(24)

i=1

5.1. Unweighted Bregmannian consensus clustering with constraints To implement the unweighted Bregmannian consensus clustering with the constraints in M and ࢙, where we also aim to minimize objective J1 as in Eq. (6), we solve the problem as follows:

minM J1 =





D∅ M, Mi



minM

i

s.t. M pq = smax , i f (x p , xq ) ∈ M M pq = smin , i f (x p , xq ) ∈ C

(19)

Here, smax and smin are the maximum and minimum similarity (correlation) between all pairs of data points, respectively. The approach to setting them in practice will be discussed later. Clearly, solving Eq. (19) can be viewed as a convex optimization problem with linear constraints. This formula can be transformed as follows:

minM J1 =



s.t. e pk

T





D∅ M, Mi

This is very similar to Problem (15). When w and M are joint, Eq. (24) will not be guaranteed to be convex, but when they are separated with one fixed, it will be convex. Thus, the BCD algorithm can be applied to solve this problem. The specific method is as follows. If w is fixed, we can solve the optimal M by:



M e pk = bk , k = 1 , 2 , . . . , K



wi D∅ M, Mi



i

s.t. M pq = smax , i f (x p , xq ) ∈ M M pq = smin , i f (x p , xq ) ∈ C

(25)

Similar to solving Problem (19), we can get the solution to the above problem as follows. •

If Mpq is a constrained element, then:



M pq = •

i



smax , i f (x p , xq ) ∈ M smin , i f (x p , xq ) ∈ C

If Mpq is a regular element, then Mpq can be solved by:

∅ (M pq ) =

(20)

(26)





wi ∅ Mipq



(27)

i

If M is fixed, then we can solve the optimal w as follows: Here, e pk ∈ Rn× 1 is an indicator vector with only the p-th element being one and all other elements being zero, k is the index of the constraint, and K is the total number of constraints. In that regrad, bk = smin if (xp ,xq ) ∈ ࢙ and bk = smax if i f (x p , xq ) ∈ M. Clearly, Problem (20) is a convex optimization problem whose objective is convex and constraints are affine. These are sufficient conditions to satisfy Slater’s condition [19], and can be solved by the Karush-Kuhn-Tucker conditions. Through a few simple derivations, we can obtain the solution of M using the following approach. •

For regular elements:

∅ (M pq ) =

1  i  ∅ M pq n

(21)

i



For constrained elements:

    i  αk = n∅ M pk qk − ∅ M pk qk M pk qk = bk

(23)

i

In the above, we call Mpq a regular element if (x p , xq ) ∈ / M and (xp ,xq ) ∈࢙; we call Mpq a constrained element if either (x p , xq ) ∈ M or (xp ,xq ) ∈ C. In comparing with Eq. (8), it should be noted that the solutions for regular elements in M between semi-supervised and unsupervised Bregmannian consensus clustering are the same; for constrained elements in M, according to Eq. (23), the semisupervised algorithm just sets them to the exact values in their constraints. Algorithm 3 summarizes the entire procedure of the Unweighted Bregmannian Consensus Clustering With Constraints algorithm.

minw

m  i=1

s.t.

m 

i=1





w i D∅ M t , M i + λ w 2

wi = 1, wi ≥ 0 ( ∀ i = 1, . . . , m )

(28)

The above is a convex quadratic programming problem. Algorithm 4 summarizes the WBCC with constraints (where Jˆ is defined in Eq. (18)). The convergence of Algorithm 4 can be analyzed similarly to how we analyzed the convergence of Algorithm 2, as the objective is lower-bounded from zero, and the iteration can guarantee the monotonic decreasing of the objective function value. 6. Experiments In this section, a few experiments will be elaborated to test the effectiveness of the Bregmannian Consensus Clustering (BCC) method, which includes an implementation on two different synthetic data sets and an application on real cancer data sets. 6.1. Synthetic examples For the construction of synthetic data sets, we adopt the following two approaches: •



Clusters with different shapes. The data set consists of a line with 50 points, a two-dimensional plane with 300 points and a sphere in the three-dimensional space with 400 points. It is shown in Fig. 1(a). Clusters lying in different subspaces. This data set consists of four clusters, which are generated in a manner similar to [45]. The first two clusters exist in dimensions x and y. The data forms a normal distribution N ~ (μ, 0.1), where the value of μ

6

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337

Fig. 1. Toy examples. (a), (d) are the original data sets, where we use different colors to denote different clusters. (b), (e) are the clustering results by K-means. (c), (f) are the results of our Weighted Bregmanian Consensus Clustering (WBCC) method with φ (x ) = 12 x2 .

are 0.6 and -0.6 in dimension x, and 0.5 in dimension y. In dimension z, these clusters are with noise that forms N ~ (0, 1). The last two clusters exist in dimensions y and z, and noise exist in dimension x. It is shown in Fig. 1(d). The K-means method is run on these synthetic data sets, where the cluster centers are obtained by random initialization, and the number of clusters depends on the real number. According to the experimental results, we can see that the K-means method find incorrect clusterings in these data sets, which is shown in Fig. 1(b) and (e). What’s more, for the same data set, Weighted Bregmannian Consensus Clustering (WBCC) algorithm under the Euclidean distance is implemented, where the 30 base cluster connectivity matrices are obtained by clustering the data set into 20 clusters using K-means with randomly initialized cluster centers. The weight vector is initialized to w = [1/30, 1/30, . . . , 1/30]T ∈ R30 × 1 . Finally the clustering result is obtained by running K-means on the aggregated cluster connectivity matrix (i.e., treat each column of the aggregation matrix as the n-dimensional embedding of the data points). The results show our WBCC algorithm is better on synthetic data set, as shown in Fig. 1(c) and (f). We also conducted an experiment to demonstrate the differences between the original K-means, unweighted BCC, and WBCC methods. The data set is shown in Fig. 2 and is composed of two classes, with each class distributed as a half moon. The data set contains 30 0 0 points in total, with 1500 in each class. The results of the different methods are shown in Fig. 3. The first row corresponds to the Cluster Connectivity Matrix (CCM) M constructed by different methods. In all the figures, the data points

Fig. 2. The original data set, which is composed of two classes with each class distributed as a half moon.

from the upper half moon are indexed first, followed by the data points from the lower half moon. Fig. 3(a) is the CCM by grouping the data set into two clusters by standard K-means, where Mij = 1 (white pixel) if xi and xj are grouped to the same cluster, otherwise Mij = 0 (black pixel). Fig. 3(b) shows the CCM constructed by UBCC, where we run K-means 50 times on the data set where 40 times cluster the data into 50 clusters and 10 times cluster them into 5 clusters. We then construct CCM for each run and finally average them. Fig. 3(b) shows that there are still some data points whose cluster memberships are confused with the oth-

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337

7

Fig. 3. A synthetic example illustrating the difference among original K-means, Unweighted Bregmanian Consensus Clustering (UBCC) and Weighted Bregmanian Consensus Clustering (WBCC). We use φ (x ) = 12 x2 when dealing with the Bregman divergence.

ers (where the lighter the pixel color, the larger the corresponding value would be). Fig. 3(c) plots the CCM obtained from WBCC, which clearly shows that that the connectivity value will be very low (actually zero) if the corresponding data points belong to different clusters. Another thing that measures the quality of the constructed CCM is the Fielder vector [45] of its corresponding normalized Laplacian. Here if the constructed CCM is M, then its corresponding Laplacian is

L = I − D−1/2 MD1/2

(29)

where D is a diagonal matrix with the same size as M and Dii =  Mi j . Here the Fielder vector of L is its eigenvector whose eigeni

value is the second smallest (as the smallest eigenvalue of L is zero, whose corresponding eigenvector is a constant vector). Ideally, for two classes clustering, the Fielder vector of L should be well separated into two pieces, and the values on each piece are the same

constant. We plot the Fielder vector of the Laplacian matrices constructed from different CCMs in the second row of Fig. 3, where the y-axis represents the values on each dimension of the Fielder vector, and x-axis corresponds to the dimension index. From these figures we can see that the Fielder vector of the Laplacian derived from the CCM of WBCC is well separated as two constant pieces (Fig. 3(f)), which clearly indicates the cluster structure contained in the data sets. If we perform K-means on this vector [46], we can get the data clusters as shown in Fig. 3(i). As comparisons, we also show the fielder vectors of the Laplacians derived from the CCMs of K-means and UBCC in Fig. 3(d) and (e), as well as the clustering results of performing K-means on these vectors in Fig. 3(g) and (h), which show that there are some data points wrongly clustered. The difference between UBCC and WBCC can be easily analyzed. As when constructing the CCM for them, we first perform 50 times of K-means on the data set independently, where 40 times cluster the data set into 50 clusters and 10 times into 5 clusters. When

8

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337 Table 2 Description of the Data Sets. Data sets

# Types

#Samples

# Dimensions

# Class

BREAST

mRNA DNA miRNA Survival mRNA DNA miRNA Survival mRNA DNA miRNA Survival

105 105 105 105 92 92 92 92 106 106 106 106

23094 354 17814 17814 23088 312 12042 23074 352 -

2 2 2

COLON

LUNG

cluster the data set into 5 clusters by K-means, some data points lie in the boundary between the two half moons could be misclustered, which will make the resultant CCM not correct. For UBCC, all 50 CCM will simply be averaged, which will make the final CCM biased by those CCMs constructed when the data set is clustered into 5 clusters by K-means. However, for WBCC, as the combination coefficients of different base CCMs are different, the CCMs constructed when the data set is clustered into 5 clusters will received a very low group of weights. Thus they will not affect the final results. From Figs. 1 and 3, we can see that our weighted BCC performs very well on the simulation data. So, next, we will discuss the weighted BCC algorithm in different scenarios further on a few cancer data sets, including the breast cancer dataset, lung cancer dataset, and colorectal cancer dataset.

6.2. Real application 6.2.1. Cancer data sets To add significance to our study, we chose lung cancer, breast cancer, and colorectal cancer as our real research objects. These are massive concerns for people, as they are the top three cancer types in terms of incidence worldwide [2]. The original cancer data are from “The Cancer Genome Atlas” (TCGA) database, launched in 2006 by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). The data in the TCGA can only be obtained through a series of complex pre-processing operations, but fortunately, the preprocessed data in [47] is suitable for our studies. In that regard, we take advantage of three gene expression types (mRNA, microRNA, and methylation) for each cancer. The type of mRNA data refers to the amount of RNA expression measured by an RNA chip or RNA-Sequencing; microRNA data refers to the amount of RNA expression measured by microRNA chips or microRNA-Sequencing; and methylation refers to the degree of DNA methylation, as measured by methylation chips. The detailed preprocess are discussed in [47]. We use a total of 9 datasets to represent the above 3 cancer types, so as to evaluate the performance of our proposed method for cancer subtypes analysis. The basic information of these data sets is shown in Table 2. The breast cancer dataset: This dataset includes 105 samples, and the dimensions of the mRNA expression, DNA methylation, and microRNA (miRNA) expression data are 23094, 354, and 17814, respectively. The lung cancer dataset: The number of samples is 106, and the dimensions of the three types of expression data are 12042, 23074, and 352, respectively. The colorectal cancer dataset: The dataset includes 92 samples, and the dimensions of the three types of expression data are

Positive/Negative

87/18

83/9

66/40

17814, 23088, and 312, respectively. All of the data sets are directly available at http://compbio.cs.toronto.edu/SNF/SNF/Software.html. 6.2.2. Evaluation measure The criterion for evaluating the effectiveness of our methods is the Clustering Accuracy (CA), which refers to the entire degree of matching between all pairs of classes and clusters. The purpose of CA is to discover the one-to-one relationship between clusters and classes, and to measure the extent to which each cluster contains data points from the corresponding class. The CA can be computed as follows:



1 CA = max N





T (Ck , Lm )

(30)

C k ,Lm

Here, ࢙k denotes the k-th cluster in the final results, and Lm is the true m-th class. T(࢙k , Lm ) is the number of entities which belong to class m and are assigned to cluster k. The CA is computed as the maximum sum of T(࢙k , Lm ) for all pairs of clusters and classes, and these pairs have no overlaps, i.e., each cluster is assigned to exactly one class. The maximum is taken over the assignment of each cluster to different possible classes. A higher CA indicates a better performance from our algorithm. 6.2.3. Comparison with unsupervised methods In this section, we will present the results of the unsupervised WBCC algorithm under the Euclidean, exponential distance, and KL divergence approaches, respectively named as EWBCC, eWBCC, and KLWBCC. In addition, we also show the experimental results of a few other approaches for comparison, including: K-means is run 50 times with randomly-initialized cluster centers, and the number of clusters k is set to 2. The final partitions are obtained by averaging the 50 independent clustering results. In this regard, the “K-means” mentioned in the following portions are all implemented in the same manner. Spectral Clustering (SC) is also run 50 times, and the number of clusters k is equal to the real number. The implementation process can be seen in [45]. The results are averaged. The Cluster-Based Similarity Partitioning Algorithm (CSPA) requires parameter λ, where each λ is the result of running Kmeans 1 times. Here, K-means is also run 50 times with randomlyinitialized cluster centers. The CSPA is referred to in [33]. The Hyper-Graph-Partitioning Algorithm (HGPA) is implemented as discussed in [51]. In that regard, 100 super-edges in the hypergraph are obtained by running K-means 50 times. Non-negative Matrix Factorization-Based Consensus Clustering (NMFC) uses a parameter setting similar to that of CSPA, and the details can be seen in [10]. The experimental results are listed in Table 3, where for a set of our WBCC methods, we run a simple K-means on the final combined cluster aggregation matrix. For all of the ensemble methods,

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337

9

Table 3 Experimental Results in Average Clustering Accuracy.

Breast_Methy

K-means 0.629

SC 0.569

CSPA 0.512

HPGA 0.521

NMFC 0.619

EWBCC 0.629

eWBCC 0.629

KLWBCC 0.829

Breast_Mirna Breast_Gene Colon_Methy Colon_ Mirna Colon_Gene Lung_ Methy Lung_ Mirna Lung_ Gene

0.667 0.646 0.595 0.559 0.593 0.678 0.612 0.614

0.807 0.567 0.582 0.889 0.582 0.492 0.384 0.497

0.533 0.575 0.554 0.517 0.572 0.526 0.557 0.589

0.520 0.532 0.523 0.526 0.528 0.532 0.555 0.525

0.667 0.644 0.592 0.539 0.551 0.654 0.577 0.540

0.667 0.648 0.598 0.565 0.620 0.679 0.614 0.612

0.667 0.648 0.598 0.565 0.626 0.679 0.613 0.618

0.829 0.829 0.902 0.902 0.902 0.623 0.623 0.623

Table 4 Experimental Results in Standard Deviation of Clustering Accuracy.

Breast_Methy

K-means 0.105

SC 0.098

CSPA 0.163

HPGA 0.107

NMFC 0.012

EWBCC 0.002

eWBCC 0.001

KLWBCC 0.001

Breast_Mirna Breast_Gene Colon_Methy Colon_ Mirna Colon_Gene Lung_ Methy Lung_ Mirna Lung_ Gene

0.065 0.037 0.109 0.113 0.127 0.065 0.183 0.208

0.121 0.049 0.092 0.107 0.175 0.114 0.107 0.112

0.172 0.098 0.201 0.182 0.118 0.123 0.159 0.131

0.059 0.083 0.106 0.072 0.152 0.139 0.046 0.094

0.041 0.038 0.027 0.039 0.055 0.009 0.004 0.106

0.001 0.001 0.005 0.010 0.007 0.001 0.002 0.008

0.004 0.003 0.008 0.003 0.003 0.005 0.004 0.004

0.005 0.001 0.003 0.004 0.002 0.003 0.001 0.003

Algorithm 1 Unweighted Bregmannian Consensus Clustering. Require: Partitions {P i }m i=1 1: Construct m cluster aggregation matrix {Mi }m i=1 2: Solve the partial differential Eq. (8). 3: Output M.

Algorithm 3 Unweighted Bregmannian Consensus Clustering with Constraints. , Constraint sets M and ࢙ Require: Partitions {P i }m i=1 1: Construct m cluster aggregation matrix {Mi }m i=1 2: Set the regular elements in M using Eq. (21). 3. Set the constrained elements in C using Eq. (23) 4: Output M.

Algorithm 2 Weighted Bregmannian Consensus Clustering. , precision E, tradeoff parameter λ > 0 Require: Partitions {P i }m i=1 1: Construct m cluster aggregation matrix {Mi }m i=1 2: Initialize w0 = [1/m, 1/m, . . . , 1/m]T ∈ Rm× 1 , M0 = In×n ,  = +∞, t = 0 3: while  > E do 4: t = t + 1  t−1 wi D∅ ( M t , M i ) 5: Solve Mt by minimizing i

6:

Solve wt by m  wti D∅ (Mt , Mi ) + λ||wt ||2 minwt s.t.

m  i=1

i=1

wti = 1, wti ≥ 0 (∀ i = 1, 2, . . . , m)

7: Compute  = |Jˆ2 (Mt , wt ) − Jˆ2 (Mt−1 , wt−1 )| 8: end while 9: Output M.

we use 50 base clusterings, which are generated by K-means clustering. As far as the same dataset is concerned, it is evident from Table 3 that the WBCC method is superior to other methods. For breast (colorectal) cancer data, the accuracy of the KLBCC method is 82.9% (90.2%), which is much higher than other clustering methods. For the lung cancer dataset, although the clustering accuracies of our EWBCC and eWBCC methods are just 67.9%, they are superior to other methods. In addition, the standard deviation of clustering accuracy is used to represent the robustness of these algorithms, which is shown in Table 4. It is obvious that our consensus methods are more stable. 6.2.4. Comparison with semi-supervised methods We also conduct a set of experiments to compare the performance of semi-supervised WBCC (SSWBCC) with other competitors on breast, lung, and colorectal data sets. We also implemented

Algorithm 4 Weighted Bregmannian Consensus Clustering. , Constraint sets M and ࢙, precision E, Require: Partitions {P i }m i=1 1: Construct m cluster aggregation matrix {Mi }m i=1 2: Initialize w0 = [1/ m, 1/ m, . . . , 1/ m]T ∈ Rm× 1 , M0 = In×n ,  = +∞, t = 0 3: while  > E do 4: t = t + 1 5: Obtain Mt by solving Problem (25) with w = wt 6: Obtain wt by solving Problem (28) with M = Mt 7: Compute  = |Jˆ2 (Mt , wt ) − Jˆ2 (Mt−1 , wt−1 )| 8: end while

SSWBCC under the Euclidean, exponential distance, and KL divergence metrics (denoted as SSEWBCC, SSeWBCC, and SSKLWBCC, respectively). The comparison methods include: Pairwise Constrained K-means (PCK-means) is an improved Kmeans algorithm with some pairwise constraints, which include a few must-links and cannot-links. This method is implemented with a fixed setting of the number of cluster k after obtaining some constraints. The implementation can be seen in [48]. Metric Pairwise Constrained K-means (MPCK-means) is implemented in [49], and the number of clusters k is also set to 2. In our experiments, the constraints were generated by randomly selecting a pair of data points from data sets, where the labels were available for evaluation purposes, but unavailable for clustering. In order to avoid the occurrence of repeated pairwise constraints in the process of random selection, our approach is as follows: firstly, the samples are numbered, such as 0, 1, 2, and so on, and a pair of samples generated by random selection are sorted in an incremental way. For example, if a pair of samples (105, 9) is selected, it will be transformed into (9, 105). And then we make a search operation for this pairwise constraint in the set of must-

10

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337

Fig. 4. the x-axis corresponds to the number of constraints generated, and y-axis represents the averaged clustering accuracy, where each algorithm is run 50 times under the same number of constraints. And the error bar refers to the standard deviation of clustering accuracy under each number of constraints. The accuracy of SSEWBCC, SSeWBCC, SSKLWBCC, MPCkmeans and PCkmeans are depicted in black, red, blue, purple and green. And in (a), (c), (d), (f) and (g), the black line and red line coincide completely, which shows that SSeWBCC and SSEWBCC have highly similar effects for cancer subtypes analysis.

links and cannot-links. If it is not in the M and ࢙, we put this constraint in the corresponding set and the number of constraints plus one; and vice versa, keeping unchanged. It is noted that if the labels of the two points were the same, a must-link was generated; and vice versa, a cannot-link was generated. Although the setting information of constraints is from labels, it is not the same as directly assigning the randomly selected samples to the label. The constraint can only reflect whether a pair of samples belong to the same cluster or not, and it can’t indicate which cluster the samples belong to. Thus, the clustering of these cancer datasets with some constraints depends on the semi-supervised clustering algorithm. The results in terms of clustering accuracy averaged over 50 independent runs are shown in Fig. 4, where the x-axis corresponds to the number of constraints generated, and y-axis represents the averaged clustering accuracy. From Fig. 4(a)-(f), it can be noted that the clustering accuracy of the proposed SSEWBCC, SSeWBCC, and SSKLWBCC is higher than those of other methods (MPCKmeans and PCKmeans). In addition, for colon cancer, although the performance of the semi-supervised WBCC is comparable to other ap-

proaches, given higher number of constraints (in this case more than 2500) it improves in performance, as presented in Fig. 4(g)(i). According to error bars in the Fig. 4, it shows that our algorithms are more robust because the standard deviation of our methods is smaller than that of others. Moreover, in Fig. 4 (a), (c), (d), and (g), the black lines and red lines coincide completely, indicating that the SSEWBCC and SSEWBCC have highly similar effects for cancer subtypes analysis. And from these figures, we can also see that our semi-supervised ensemble methods can generally outperform other semi-supervised clustering algorithms, with the increase of restriction conditions for cancer subtypes analysis. 7. Conclusions In this paper, we propose a general framework for consensus clustering based on Bregman divergence, and then derive a series of clustering aggregation algorithms under such a framework, including the WBCC and the SSWBCC. And the core code is available at https://github.com/XieYunshen/consensus_clustering.

J. Li, L. Xie and Y. Xie et al. / Computer Methods and Programs in Biomedicine 189 (2020) 105337

In the clinical domain, these algorithms are used to analyze the similarity between molecules in mRNA expression data, DNA methylation data, or microRNA expression data, aiming to further study subtypes of cancer (e.g., breast cancer, colorectal cancer, and lung cancer). Moreover, we compare the weighted clustering aggregation algorithms with K-means, SC, CSPA, HPGA, and NMFC, and compare SSWBCC methods with PCK-means, and MPCK-means. Finally, the results of the experiments show the effectiveness of our methods. In the next step, we plan to apply our consensus clustering framework to large-scale data that not only refers to data with high-dimensionality, but also includes several samples. It may be useful for more generalized fields, such as finding relevant but easily-overlooked factors of environmental pollution, analyzing other influencing causes of education levels, and searching for hidden but still important causes of mental disease. Declaration of Competing Interest We declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. Acknowledgment This study is supported by the National Key R&D Program of China with project no. 2017YFB1400803. Reference [1] A. Jemal, F. Bray, M.M. Center, et al., Global cancer statistics, CA: Cancer J. Clin. 61 (2) (2011) 69–90. [2] F. Bray, J. Ferlay, I. Soerjomataram, et al., Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA: Cancer J. Clin. 68 (6) (2018) 394–424. ´ [3] K. Tomczak, P. Czerwinska , M. Wiznerowicz, The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge, Contemp. Oncol. 19 (1A) (2015) A68. [4] H. Xiong, J. Wu, J. Chen, K-means clustering versus validation measures: a data-distribution perspective, IEEE Trans. Syst. Man Cybernet. Part B (Cybernetics) 39 (2) (2009) 318–331. [5] B. Malakooti, Z. Yang, Clustering and group selection of multiple criteria alternatives with application to space-based networks, IEEE Trans. Syst. Man Cybernet. Part B (Cybernetics) 34 (1) (2004) 40–51. [6] O. Younis, S. Fahmy, HEED: a hybrid, energy-efficient, distributed clustering approach for ad hoc sensor networks, IEEE Trans. Mobile Comput. 3 (4) (2004) 366–379. [7] Chen S., Wang F., Zhang C.Simultaneous heterogeneous data clustering based on higher order relationships[C]. Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on. IEEE, 2007: 387-392. [8] X.Z. Fern, C.E. Brodley, Solving cluster ensemble problems by bipartite graph partitioning, in: Proceedings of the Twenty-First International Conference on Machine learning, ACM, 2004, p. 36. [9] I.S. Dhillon, S. Mallela, D.S. Modha, Information-theoretic co-clustering, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2003, pp. 89–98. [10] T. Li, C. Ding, M.I. Jordan, Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization, in: icdm, IEEE, 2007, pp. 577–582. [11] D.S. Wilks, in: Cluster Analysis[M]//International Geophysics, 100, Academic press, 2011, pp. 603–616. [12] C. Domeniconi, M. Al-Razgan, Weighted cluster ensembles: methods and analysis, ACM Trans. Know. Discov. Data (TKDD) 2 (4) (2009) 17. [13] M. Al-Razgan, C. Domeniconi, Weighted clustering ensembles, in: Proceedings of the 2006 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, 2006, pp. 258–269. [14] F. Gullo, A. Tagarelli, S. Greco, Diversity-based weighting schemes for clustering ensembles, in: Proceedings of the 2009 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, 2009, pp. 437–448. [15] S. Vega-Pons, J. Correa-Morris, J. Ruiz-Shulcloper, Weighted cluster ensemble using a kernel consensus function, in: Iberoamerican Congress on Pattern Recognition, Berlin, Heidelberg, Springer, 2008, pp. 195–202. [16] X.Z. Fern, W. Lin, Cluster ensemble selection, Statist. Anal. Data Min. 1 (3) (2008) 128–141. [17] A. Banerjee, S. Merugu, I.S. Dhillon, et al., Clustering with Bregman divergences, J. Mach. Learn. Res. 6 (Oct) (2005) 1705–1749.

11

[18] S. Basu, D. Ian, K. Wagstaff (Eds.), Constrained Clustering: Advances in Algorithms, Theory, and Applications, CRC Press, 2008. [19] S. Dempe, Directional differentiability of optimal solutions under Slater’s condition, Math. Programm. 59 (1–3) (1993) 49–69. [20] , A probabilistic framework for semi-supervised clustering, in: S. Basu, M. Bilenko, J.M. Raymond (Eds.), Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2004. [21] F. Wang, T. Li, C. Zhang, Semi-supervised clustering via matrix factorization, in: Proceedings of the 2008 Siam International Conference on Data Mining, Society for Industrial and Applied Mathematics, 2008, pp. 1–12. [22] H. Aidos, A. Lourenço, D. Batista, et al., Semi-supervised consensus clustering for ECG pathology classification, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Cham, Springer, 2015, pp. 150–164. [23] R. Shamir, R. Sharan, 1 1 algorithmic approaches to clustering gene expression data, Curr. Topic. Comput. Mol. Biol. (2002) 269. [24] M.B. Eisen, P.T. Spellman, P.O. Brown, et al., Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. 95 (25) (1998) 14863–14868. [25] T. Kohonen, The self-organizing map, Proc. IEEE 78 (9) (1990) 1464–1480. [26] P. Tamayo, D. Slonim, J. Mesirov, et al., Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation, Proc. Natl. Acad. Sci. 96 (6) (1999) 2907–2912. [27] Y. Xu, V. Olman, D. Xu, Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees, Bioinformatics 18 (4) (2002) 536–545. [28] G. Kerr, J. Ruskin H, M. Crane, et al., Techniques for clustering gene expression data, Comput. Biol. Med. 38 (3) (2008) 283–293. [29] W. Xiao, Y. Yang, H. Wang, et al., Semi-supervised hierarchical clustering ensemble and its application, Neurocomputing 173 (2016) 1362–1376. [30] M.B. Eisen, P.T. Spellman, P.O. Brown, et al., Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. 95 (25) (1998) 14863–14868. [31] A. Topchy, A.K. Jain, W. Punch, Clustering ensembles: models of consensus and weak partitions, IEEE Trans. Pattern Anal. Mach. Intell. 27 (12) (2005) 1866–1881. [32] Z. Yu, H.S. Wong, H. Wang, Graph-based consensus clustering for class discovery from gene expression data, Bioinformatics 23 (21) (2007) 2888–2896. [33] A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res. 3 (Dec) (2002) 583–617. [34] A. Goder, V. Filkov, Consensus clustering algorithms: comparison and refinement, in: Proceedings of the Meeting on Algorithm Engineering & Expermiments, Society for Industrial and Applied Mathematics, 2008, pp. 109–117. [35] S. Vega-Pons, J. Ruiz-Shulcloper, A survey of clustering ensemble algorithms, Int. J. Pattern Recognit. Artif. Intell. 25 (3) (2011) 337–372. [36] L.I. Kuncheva, D.P. Vetrov, Evaluation of stability of k-means cluster ensembles with respect to random initialization, IEEE Trans. Pattern Anal. Mach. Intell. 28 (11) (2006) 1798–1808. [37] A. Topchy, A.K. Jain, W. Punch, Combining multiple weak clusterings, in: Third IEEE International Conference on Data Mining, IEEE, 2003, pp. 331–338. [38] B. Minaei-Bidgoli, A.P. Topchy, W.F. Punch, A Comparison of Resampling Methods for Clustering Ensembles, in: Proceedings of the International Conference on Artificial Intelligence, IC-AI’04, CSREA Press, Las Vegas, NV, United states, 2004, pp. 939–945. [39] C. Domeniconi, D. Gunopulos, S. Ma, et al., Locally adaptive metrics for clustering high dimensional data, Data Mining Knowl. Disc. 14 (1) (2007) 63–97. [40] J. Azimi, X. Fern, Adaptive cluster ensemble selection, in: IJCAI-09 - Proceedings of the 21st International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence, USA, Pasadena, California, 2009, pp. 992–997. [41] J. Wu, H. Liu, H. Xiong, et al., K-means-based consensus clustering: a unified view, IEEE Trans. Knowl. Data Eng. 27 (1) (2014) 155–169. [42] V. Filkov, S. Skiena, Integrating microarray data by consensus clustering, Int. J. Artif. Intell. Tools 13 (4) (2004) 863–880. [43] E.F. Lock, D.B. Dunson, Bayesian consensus clustering, Bioinformatics 29 (20) (2013) 2610–2616. [44] Z. Yu, H. Chen, J. You, et al., Adaptive fuzzy consensus clustering framework for clustering analysis of cancer data, IEEE/ACM Trans. Comput. Biol. Bioinformat. (TCBB) 12 (4) (2015) 887–901. [45] B. Malakooti, Z. Yang, Clustering and group selection of multiple criteria alternatives with application to space-based networks, IEEE Trans. Syst. Man Cybernet. Part B (Cybernetics) 34 (1) (2004) 40–51. [46] H. Xiong, J. Wu, J. Chen, K-means clustering versus validation measures: a data-distribution perspective, IEEE Trans. Syst. Man Cybernet. Part B (Cybernetics) 39 (2) (2008) 318–331. [47] B. Wang, A.M. Mezlini, F. Demir, et al., Similarity network fusion for aggregating data types on a genomic scale, Nature Methods 11 (3) (2014) 333. [48] Wagstaff K., Cardie C., Rogers S., et al. Constrained k-means clustering with background knowledge[C]//ICML. 2001, 1: 577-584. [49] M. Bilenko, S. Basu, J. Mooney R, Integrating constraints and metric learning in semi-supervised clustering, in: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, 2004, p. 11. [50] Y. Wang, Y. Pan, Semi-supervised consensus clustering for gene expression data analysis, BioData Min. 7 (1) (2014) 7. [51] Karypis G.hMETIS 1.5: a hypergraph partitioning package. http://www. cs. umn. edu/~ metis, 1998.