Expert Systems with Applications 30 (2006) 137–141 www.elsevier.com/locate/eswa
An efficient greedy K-means algorithm for global gene trajectory clustering Zeke S.H. Chan a,*, Lesley Collins b, N. Kasabov a a
Knowledge Engineering and Discovery Research Institute (KEDRI), Auckland University of Technology, Auckland, New Zealand b Allan Wilson Center for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand
Abstract Optimal clustering of co-regulated genes is critical for reliable inference of the underlying biological processes in gene expression analysis, for which the K-means algorithm have been widely employed for its efficiency. However, given that the solution space is large and multimodal, which is typical of gene expression data, K-means is prone to produce inconsistent and sub-optimal cluster solutions that may be unreliable and misleading for biological interpretation. This paper applies a novel global clustering method called the greedy elimination method (GEM) to alleviate these problems. GEM is simple to implement, yet very effective in improving the global optimality of the solutions. Experiments over two sets of gene expression data show that the GEM scores significantly lower clustering errors than the standard K-means and the greedy incremental method. q 2005 Elsevier Ltd. All rights reserved. Keywords: K-means clustering; Gene expression data; Greedy elimination method
1. Introduction The K-means algorithm is a simple and efficient clustering method that has been applied to many engineering problems (Chang & Lai, 2005; Hsieh, 2005; Kuo, Kuo, & Chen, 2005; Shin & Sohn, 2004; Tsai, Chiu, & Chen, 2005). In gene expression analysis, the clustering of co-regulated genes is an important task, for which the K-means algorithm has been widely used either as a stand-alone clustering method, or as a fast method for computing the optimal initial cluster centers for more expensive clustering methods. It employs a simple iterative scheme that performs hill climbing from initial centers, whose values are usually randomly picked from the training data. Although the algorithm is very efficient, it suffers two well-known problems that hinder its application on gene expression data: (i) the solutions are only locally optimal and (ii) their qualities are sensitive to the initial conditions (i.e. the values of the initial centers). This paper applies an efficient global clustering method called the greedy elimination (GEM) method (Chan & Kasabov, 2004) for alleviating these problems. Experiments conducted on real gene expression data show that GEM is effective in enhancing the global optimality and consistency of the clustering solutions. * Corresponding author. E-mail address:
[email protected] (Z.S.H. Chan).
0957-4174/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2005.09.049
2. Problem definition and the greedy elimination algorithm (GEM) With the conventional K-means algorithm, the clustering task is to cluster N samples of training data XZ{x1,.,xN} into K Voronoi partitions defined by the cluster centers MZ {m1,.,mK}. The most common clustering criterion E(X,M) is the minimization of the clustering error, which is defined as the sum of squared Euclidean distances between each data point to its nearest cluster center. Let Ck,kZ[1,2,.,K] represent K disjoint subsets such that (x n2C k) if kZ arg mini ðjjxn Kmi jj2 Þ. E(X,M) is given by EðX; MÞ Z
N X K X
IðXn 2Ck Þjjxn Kmk jj2
(1)
nZ1 kZ1
where I(X)Z1 if X is true and 0 otherwise. GEM is implemented as follows. Let aO1 represent the enlargement factor, and K the desired number of centers. GEM initializes with aK centers, and then eliminates them one-byone until K centers remain. Empirically, we find that aZ2 gives very cost-effective solutions and we will use this value in all subsequent experiments. Let M*(J) denote the optimal solution for J centers. The kernel operation of GEM is the greedy elimination of the centers for obtaining M*(JK1) given M*(J), and it proceeds as follows. From M*(J), we extract J sets of reduced solutions MKj,jZ [1,2,.,J] and evaluate their upper bound clustering errors (described later), where MKj is given as M*(J) minus the jth
138
Z.S.H. Chan et al. / Expert Systems with Applications 30 (2006) 137–141
Fig. 1. Demonstration of GEM on clustering an artificial dataset into four clusters. (a) We use an enlargement factor of aZ2 to begin with eight initial centers, which provide knowledge on the global distribution of the data. (b) The centers are eliminated one-by-one until only four centers remain.
center. The reduced solution that yields the lowest upper bound * , is selected to be the optimal initial error, denoted MK j condition for the next stage solution, to which we apply the K-means algorithm to obtain the corresponding optimal solution M*(JK1). Thus, for GEM to compute the optimal solution for K centers, we set the initial solution to M*(aK) and repeat the greedy elimination process to obtain the intermediate solutions M*(aKK1), M*(aK-2) and so on until M*(K) is obtained. An illustration of GEM is shown in Fig. 1 and its pseudo codes are provided as follows. Step 1 Choose the value of the enlargement factor aO1. Apply K-means to find the optimal solution with aK centers (initial values of centers are randomly sampled from the training data), which becomes the current solution M*(aK)Z{m1,.,maK}. Step 2 Let J denote the number of centers in the current solution. If JZK, exit and output M*(J) as the final solution; otherwise move on to step 3. Step 3 FOR jZ1 TO J, i. Extract the reduced solution MKjZ{m1,.,mjK1, mjC1,.mJ} from the current solution M*(J) by keeping all the centers except for the jth one. ii. Use the Euclidean distance table computed from the last run of K-means to calculate the upper bound clustering error of MKj. END * that Apply K-means to optimise the reduced solution MK j yields the lowest upper bound clustering error to retrieve M*(JK1), the optimal solution for (JK1) centers. Return to step 2. The upper bound clustering error is a very efficient means to compare the relative optimality of the reduced solution MKi, because it can be evaluated without re-computing the Euclidean distance between the data points and MKj. It is computed by: (i) storing the table that records the Euclidean distance between the data points and the centers from
computing M*(J); and (ii) summing the minimum distance between each data point to all except the jth center in the table. Let UKj(X,M*(J)) denote the upper bound clustering error for the reduced solution MKj. It is formally given as UKj ðX; M * ðJÞÞ Z
N K X X
Iðxn 2Ck Þjjxn Kmk* jj2
(2)
nZ1 kZ1;ksj
UKj(X,M*(J)) is called the ‘upper-bound’ error of MKj because we can apply K-means to further refine MKj to achieve lower error. Overall, GEM requires only one run of K-means for each stage of center reduction, and therefore a total of (aKKK)Z(aK1)K runs of K-means1. GEM achieves two important objectives: global clustering and efficiency. It achieves global clustering because by using an initial solution of more than K centers, GEM covers a larger portion of the search space, and therefore gaining more knowledge on the global distribution of the data in the beginning of the clustering process. By comparing the optimality of different solutions and eliminating the suboptimal centers during the greedy elimination process, GEM uses this knowledge to guide the search towards the globally optimal region. One can draw an analogy between the use of more than K centers in GEM for global clustering, and the use of multiple search points in Evolutionary Algorithms (such as Genetic Algorithm) for global optimisation. GEM is also more efficient than many global clustering methods proposed in the literature for two factors. First, empirical tests show that a small enlargement factor of aZ2 is often sufficient. Second, the required number of K-means runs grows linearly in K but not in N. Since typically (K/N), this requirement is much smaller than some global clustering methods such as the Genetic Algorithm by Maulik (Maulik & Bandyopadhyay, 2000) that scales in (population size! max.generation), where the values the population size and maximum generation are [10,100] and 1000, respectively. 1 Note that the use of the number of K-means runs alone is not an accurate measure of GEM’s computational complexity, because each run operates on a different number of centers. Here we use this measure only as a very rough approximation.
Z.S.H. Chan et al. / Expert Systems with Applications 30 (2006) 137–141
139
3. Experiments
Fig. 2. The log-normalized gene expression trajectories of the fibroblasts (517 genes) and the yeast data (612 genes).
GEM’s efficiency is similar to the fast greedy incremental method (GInc) proposed by Likas (Likas, Vlassis, & Verbeek, 2003), which uses k-d tree to reduce the original complexity that scales in NK; however, GEM involves no ad-hoc heuristics and is more simple and robust to implement. In addition to its global clustering property and efficiency, GEM also has the advantage that the optimal clustering errors of all intermediate solutions of between aK and K centers are available as a side-product of the greedy elimination process, which we can use to determine the optimal number of centers, using criteria like Akaike or Bayesian Information Criteria.
To validate the effectiveness of GEM, we compare its performance against two control algorithms on clustering gene expression data: the standard K-means, and the fast greedy incremental method (GInc) proposed by Likas (Likas et al., 2003). GInc starts from with one center and proceeds incrementally until the model contains k centers, and uses the k-d tree to approximate the optimal center insertion point in order to reduce the original requirement of NK number of K-means runs. Note that unlike the standard K-means and GEM that use random initial conditions, GInc is a deterministic algorithm, i.e. it yields the same result over different runs, and therefore we cannot measure its consistency in reproducing similar solutions over different runs. GEM uses the scaling factor aZ2 in all cases to demonstrate its robustness against parameter tuning. Both the standard K-means and GEM initialise the center values by sampling randomly from the training data. For all three methods, the stopping criterion for the K-means iterations is if the clustering error decreases by less than 0.01%. Clustering errors are measured in sum-squared errors (SSEs) defined in (1). All algorithms are coded in MATLAB and tested on a P4 2.4GHz PC. The experiments are performed on the fibroblasts and the yeast time-series gene expression data. Both datasets are lognormalized (Fig. 2) before clustering. The fibroblasts data was reported in (Iyer, Eisen, Ross, & Schuler, 1999). It contains the time-series gene expression data for the response of human fibroblast cells to fetal bovine serum (FBS). The dataset contains gene expression data at 12 time points during the physiological response of fibroblasts to serum using cDNA microarrays. It consists of 8618 genes sampled at {0, 0.25, 0.5, 1, 2, 4, 6, 8, 12, 16, 20, 24} hours, of which 517 genes are identified to respond significantly to the serum and are used in
Table 1 Comparison of clustering performance between the standard K-means, GInc and GEM over 20 runs for the Human Fibroblasts the Yeast data (SSE-sum squared error, SD, standard deviation, the unit for time is second) K
Human fibroblasts data
Yeast data
Std. K-means
5 7 9 11 13 15 17 19
Ginc
GEM
Std. K-means
GInc
GEM
Mean SSE (SD)
Time
Mean SSE (SD)
Time
Mean SSE (SD)
Time
Mean SSE (SD)
Time
Mean SSE (SD)
Time
Mean SSE (SD)
Time
77.0 (3.2) 65.8 (2.8) 55.5 (1.7) 53.1 (2.5) 47.4 (1.9) 45.9 (1.7) 43.9 (1.9) 41.8 (1.3)
0.1
76.1 n/a 62.5 n/a 53.0 n/a 48.5 n/a 45.0 n/a 42.4 n/a 40.6 n/a 38.9 n/a
0.3
74.8 (0.9) 61.0 (0.3) 52.0 (0.3) 47.5 (0.6) 43.9 (0.5) 41.0 (0.6) 38.6 (0.4) 36.6 (0.4)
0.2
159.6 (10.0) 136.8 (4.7) 128.7 (4.3) 120.0 (5.6) 115.1 (4.2) 112.4 (3.6) 110.4 (2.7) 107.4 (4.4)
0.1
152.6 n/a 133.1 n/a 125.6 n/a 115.6 n/a 111.0 n/a 107.0 n/a 101.6 n/a 98.7 n/a
0.4
151.8 (0.7) 132.3 (0.6) 119.8 (0.6) 111.1 (0.4) 105.2 (0.7) 100.2 (0.6) 96.1 (0.5) 92.65 (0.9)
0.2
0.1 0.1 0.1 0.1 0.1 0.1 0.1
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.3 0.4 0.6 0.8 1.0 1.3 1.68
0.1 0.1 0.1 0.1 0.1 0.1 0.1
0.6 0.7 0.8 0.9 1.1 1.3 1.3
0.3 0.5 0.6 0.9 1.3 1.6 2.08
140
Z.S.H. Chan et al. / Expert Systems with Applications 30 (2006) 137–141
Table 2 Frequencies of clustering together members that belong to the same previously reported gene groups over 20 runs Functional group Yeast data
Transcription factors Cell cycle progression Cell cycle progression Cholesterol biosynthesis
Cholesterol biosynthesis
Fibroblasts data
Cell cycle inhibitors Cell cycle inhibitors Histone proteins/chromatin
DNA replication
Methionine biosynthesis Methionine biosynthesis Cell cycle progression DNA repair-G1 phase Cell cycle-M phase
Genes/transcription factors c-FOS, JunB, MAPKP1 CyclinA, CyclinB1, Cdc2, Cdc28 CyclinA, CyclinB1, Cdc2, Cdc28, Madp2, CENP-f HMG CoA reductase, IPPisomerase, farnesylDFT, squalene epoxidase HMG CoA reductase, IPPisomerase, farnesylDFT, squalene epoxidase, CYP51 Kip1, Kip2 P18, DDB2, DP2(E2F2) HHO1, HTB2, HTA2, HHF1, HHT2, HTB1, HTA1 CDC2, POL1, POL2, CDC21, CDC45, PMS1, MSH2 MET1, MET3, SAM1 MET14, MET17, MUP1 PCL9, TEC1, ASH1, SIC1, STS1 DUN1, MSH2, MSH6, RAD51, RAD54 CDC5, CDC20, PMP1, PMA1, CLB1, BUD4
No. genes
Freq. of grouping Std. K-means (%)
GInc (%)
GEM (%)
3 4
80 100
0 100
100 100
6
100
100
100
4
80
0
100
5
75
0
100
2 3 7
80 70 95
100 100 100
100 100 100
7
85
100
100
3 3 5
100 95 85
100 100 100
100 100 100
5
90
100
100
6
85
100
100
All clustering methods use five clusters. Gene names printed in bold indicate that these genes are added to expand the one previous group.
our analysis. The yeast data, first reported in (Spellman, Sherlock, Zhang, & Iyer, 1998), records the expressions of cellcycle-regulated genes from the yeast Saccharomyces cerevisiae. In their paper, Spellman et al. analyzed transcript levels in yeast cell cultures whose cell cycles had been synchronized by three independent methods, as each synchronization method can produce different side effects, such as activating some heat shock proteins. The yeast cells were sampled at 7 min intervals for 119 min, giving a total of 18 time points. This dataset contains 800 genes whose transcript levels vary during the yeast cell cycle, of which we use only the subset of 612 genes that contains no missing gene expression values across all of the 18 time points. Each gene expression dataset is clustered into [5,7,.,19] groups 20 times using the three tested algorithms and the results are shown in Table 1. GEM scores the lowest clustering error in every case, followed by GInc, and then the standard K-means. GEM also scores much lower standard deviation of errors than the standard K-means, suggesting higher consistency in reproducing similar solutions over different runs (GInc produces deterministic solutions and hence this measure of standard deviation is not applicable). Interestingly, GEM requires less or equal computation time than GInc when k!13, and more computation time only when kR13. The computational time, ranging (0.2–2)s, is very efficient for a global clustering method. Finally, we compare the consistency in grouping together genes that have previously found to be similarly expressed
over the 20 runs. Consistency of the clustering algorithm is extremely important for making confident interpretation of the underlying biological process in gene expression analysis. Table 2 shows some of gene groups identified in (Iyer et al., 1999; Swamy, Tan, SZhu, Lu, Achuth, & Moochhala, 2004) for the fibroblasts data and (Spellman et al., 1998) for the yeast data, and the frequencies that the members of the same groups are clustered together. These results are taken from grouping each dataset into five clusters, as in the case of the original papers of (Iyer et al., 1999) and (Spellman et al., 1998). GEM generally scores much higher percentage in reproducing the same gene groups in all cases over the 20 runs. Note that some gene groups are expanded from the previous groups to include more gene members. In many of these cases, the scorings of the standard K-means and GInc deteriorate, while GEM still maintains high scoring. GEM presents a robust and practical method for reliable clustering of gene expression data. 4. Conclusions This paper applies a fast and effective global clustering method called GEM on the clustering of gene expression data. The experimental results demonstrate GEM’s superior clustering performance over the standard K-means and GInc. Highly reliable clustering algorithms are essential to microarray expression analysis as inconsistent clustering may cause
Z.S.H. Chan et al. / Expert Systems with Applications 30 (2006) 137–141
misleading assumptions on the genes having similar expression patterns. Improvements by GEM in clustering reliability are therefore of practical importance. Acknowledgements This research is supported by the KEDRI postdoctoral fellow research fund and by FRST New Zealand (NERF/AUT X0201). References Chan, Z. S. H., & Kasabov, N. (2004). Efficient global clustering using the greedy elimination method. Electronics Letters, 40(25), 1611–1612. Chang, P., & Lai, C. (2005). A hybrid system combining self-organizing maps with case-based reasoning in wholesaler’s new-release book forecasting. Expert Systems with Applications, 29(1), 183–192. Hsieh, N. (2005). Hybrid mining approach in the design of credit scoring models. Expert Systems with Applications, 28(4), 655–665.
141
Iyer, V. R., Eisen, M. B., Ross, D. T., & Schuler, G. (1999). The transcriptional program in the response of human fibroblasts to serum. Science, 283, 83– 87. Kuo, R., Kuo, Y., & Chen, K. (2005). Developing a diagnostic system through integration of fuzzy case-based reasoning and fuzzy ant colony system. Expert Systems with Applications, 28(4), 783–797. Likas, A., Vlassis, N., & Verbeek, J. (2003). The global k-means clustering algorithm. Pattern Recognition, 36, 451–461. Maulik, J., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33, 1455–1465. Shin, H., & Sohn, S. (2004). Segmentation of stock trading customers according to potential value. Expert Systems with Applications, 27(1), 27–33. Spellman, P. T., Sherlock, G., Zhang, M. Q., & Iyer, V. R. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12), 3273–3297. Swamy, S. M., Tan, P., SZhu, Y. Z., Lu, J., Achuth, H. N., & Moochhala, S. (2004). Role of phenytoin in wound healing: Microarray analysis of early transcriptional responses in human dermal fibroblasts. Biochemical and Biophysical Research Communications, 314(3), 661–666. Tsai, C., Chiu, C., & Chen, J. (2005). A case-based reasoning system for PCB defect prediction. Expert Systems with Applications, 28(4), 813–822.