Gene expression data clustering using a multiobjective symmetry based clustering technique

Gene expression data clustering using a multiobjective symmetry based clustering technique

Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Q1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35...

4MB Sizes 0 Downloads 52 Views

Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Q1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62Q4 Q3 63 64 65 66

Contents lists available at ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm

Gene expression data clustering using a multiobjective symmetry based clustering technique Sriparna Saha a,n, Asif Ekbal a, Kshitija Gupta a, Sanghamitra Bandyopadhyay b a b

Department of Computer Science and Engineering, Indian Institute of Technology, Patna, India Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India

art ic l e i nf o

a b s t r a c t

Article history: Received 3 April 2013 Accepted 17 July 2013

The invention of microarrays has rapidly changed the state of biological and biomedical research. Clustering algorithms play an important role in clustering microarray data sets where identifying groups of co-expressed genes are a very difficult task. Here we have posed the problem of clustering the microarray data as a multiobjective clustering problem. A new symmetry based fuzzy clustering technique is developed to solve this problem. The effectiveness of the proposed technique is demonstrated on five publicly available benchmark data sets. Results are compared with some widely used microarray clustering techniques. Statistical and biological significance tests have also been carried out. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Microarray data Gene expression data clustering Clustering Multiobjective optimization (MOO) Symmetry Archived multiobjective simulated annealing based technique (AMOSA) Automatic determination of number of clusters

1. Introduction Microarray technology helps to study the expression levels of a huge number of genes across different experimental conditions. Applications of these techniques include fields of medical diagnosis, bio-medicine, gene expression profiling, etc [1–3]. Gene is the fundamental unit of storage of hereditary information in living beings. For the technical point of view, it can be treated as a distinct sequence of nucleotides constituting part of a chromosome. Information from a gene is used in the synthesis of functional gene products like proteins and functional RNAs for non-protein coding genes. This process of synthesis is called Gene Expression, by which genotype (a particular set of genes in a genome) gives rise to phenotype (the physical characteristic of the genotype). This Gene Expression Data is generally very huge in size and to search for useful patterns within this data, genes have to be grouped into “clusters” on the basis of similar features. The Gene Expression data is in the form of a 2-Dimensional matrix of Gene Names and the corresponding expression levels for features exhibited by the genes. Clustering [4,5] is a widely used unsupervised pattern classification technique which divides the input space into K regions depending on

n

Corresponding author. Tel.: þ 91 8809559190. E-mail addresses: [email protected] (S. Saha), [email protected] (A. Ekbal), [email protected] (K. Gupta).

some similarity/dissimilarity criterion. Here value of number of clusters (K) may or may not be known a priori. Another important aspect of clustering is to automatically determining the appropriate number of clusters from a given data set [6,7]. In order to achieve this and to check the validity of the obtained partitioning, several cluster validity indices have been developed in the literature. Validity indices try to measure the quality of the clustering solutions; their values used to assign an ordering of the clusters in terms of their goodness. Valid partitioning corresponds to the global optimum of these validity functions. For this reason, Genetic Algorithms (GAs) have been popularly used to determine global optimum values of validity indices, which in turn help to simultaneously detect the appropriate number of clusters and the appropriate partitioning of a data set [8–10]. A variable string length GA (VGA) based clustering method (named VGAPS-clustering) is developed in [11]. In this algorithm a recently developed point symmetry based distance [12] is used for the allocation of cluster number to different points. It uses a point symmetry (PS) based recently developed cluster validity index, Sym-index [13,11] as the objective function. The proposed VGAPS-clustering is able to identify clusters having any shape and size as long as they contain the symmetry property with the help of this PS-distance. Clustering is regarded as a complex task as for many data sets no unambiguous partitioning exists [6,7]. A variety of clustering algorithms exist in the literature; most of them optimized a single cluster quality measure which mirrors a single measure of goodness of a partitioning solution. But sometimes a single cluster validity index is not applicable for determining appropriate

0010-4825/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiomed.2013.07.021

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

partitioning for different types of data sets having different properties. Hence, it is required to optimize simultaneously multiple cluster quality measures which are responsible for capturing different cluster properties. Thus here the clustering problem is defined as a multiobjective optimization problem where multiple number of cluster quality measures are optimized simultaneously. A multiobjective clustering technique named MOCK is devised in [14]. Results shown in [14] prove that MOCK performs better than several single-objective clustering algorithms, a recently developed ensemble technique, and two other recent methods of model selection. In this paper we have developed a novel fuzzy multiobjective clustering technique. Here cluster centers are encoded in the form of a string instead of data points. The technique is thereafter able to determine the appropriate number of clusters and the appropriate partitioning from different kinds of data sets having clusters of different shapes, size and convexity as long as the property of symmetry exists in the clusters. We have used AMOSA, a recently developed multiobjective optimization technique based on simulated annealing, [15], as the underlying optimization technique. Here points are allocated to different clusters based on the point symmetry based distance instead of using the Euclidean distance. AMOSA is utilized to optimize two cluster validity indices simultaneously. By using the first cluster validity index, named Xie-Beni-index [16] we can measure the total compactness and separation of a particular partitioning in terms of Euclidean distance, by using the second cluster validity index, named FSym-index [17] we can measure the total amount of symmetry contained in a particular clustering result. Any multiobjective optimization technique produces a set of non-dominated solutions on its final Pareto optimal front. Each of these nondominated solutions corresponds to a style of clustering the particular data set. All these solutions are identically relevant from the execution point of view. But sometimes depending on the application domain the user may want a single solution. Thus in this paper a novel method to select a single solution from the final Pareto optimal front is developed. The performance of the multiobjective SA (MOSA) based fuzzy clustering technique (MO-fuzzy) has been demonstrated on five real-life gene expression data sets, viz., Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana, Human Fibroblasts Serum and Rat CNS data. The effectiveness of the proposed technique, as compared to MOGA clustering [18], FCM algorithm [19], single objective GA based clustering technique [20], hierarchical average linkage clustering [21], Self Organizing Map (SOM) clustering [22] and Chinese Restaurant Clustering (CRC) [23], is established both quantitatively and visually. Statistical significance tests have been also performed to prove the effectiveness of the MO-fuzzy clustering technique. To prove that the proposed technique produces biologically significant clusters a biological significance test has also been carried out.

2. Gene expression data clustering Traditional approaches of genomic research concentrate on local examination and collection of data for a single gene. However, use of microarray technology enables monitoring of expression levels of tens of thousands of genes simultaneously. There are mainly two different types of microarray experiments [24]. These are cDNA microarray [25] and oligonucleotide arrays (abbreviated oligo chip) [26]. During microarray experiment, a large number DNA sequences (genes, cDNA clones, or co-expressed sequence tags) are judged under multiple conditions. Examples of conditions are timeseries during a biological process (e.g., the yeast cell cycle) or a collection of different tissue samples (e.g., normal versus cancerous tissues).

In general no distinction is made among DNA sequences; which are uniformly termed as “genes”. Similarly all kinds of experimental conditions are termed as “samples”. Microarray technology has been successfully applied for solving problems from many areas like medical diagnosis, bio-medicine, gene expression profiling, etc. In general, the gene expression values during a biological experiment are determined at different time points [27]. A microarray gene expression data, consisting of x genes and y time points, is generally arranged in a 2D matrix G ¼ ½g ij  of size x  y. Each element g ij tells the expression level of the ith gene at the jth time point. Some noise and missing values can be present in the original gene expression matrix determined from the scanning process. There may be some systematic variations arising from the experimental procedure. Data pre-processing is a necessary step before application of any clustering technique. Missing value can be estimated using various methods [28]. After this data normalization is done. Thereafter any clustering technique can be applied on the processed gene expression data. Clustering is a widely used microarray analysis tool. The genes with similar expression profiles can be detected by using clustering techniques. Clustering methods divide a set of n objects into K partitions depending on some similarity/dissimilarity metric. Here the value of number of clusters, K, is not known a priori. Application of clustering techniques aid in the understanding of gene function, gene regulation, cellular processes, and subtypes of cells. Genes with similar expression patterns (co-expressed genes) should be put in a single cluster with similar cellular functions. This helps to predict the functionality of some unknown genes which information was not previously known [2,29]. Regulatory motifs specific to each gene cluster can be recognized based on the searching for common DNA sequences at the promoter regions of genes within the same cluster [24]. Cis-regulatory elements can also be proposed based on this information [29,30]. Hypotheses regarding the mechanism of the transcriptional regulatory network can be revealed based on the inference of regulation through the clustering of gene expression data [31]. At the end sub-cell types can be identified based on the clustering results on expression profiles. These information were difficult to be identified by using traditional morphology-based approaches [1]. Clustering methods have been widely used for the discovery of cancer subtypes [1–3]. Many novel clustering techniques have been developed by the bioinformaticians which are suitable for clustering gene expression data. But medical scientists prefer to use some traditional clustering techniques for solving this particular problem [32]. Using modern days microarray technologies, molecular signatures of cancer cells can be measured [33]. One most important and useful exploratory analysis on the microarray data is to apply clustering techniques on the cancer/patient samples (tissues) [33]. The basic goal is to determine groups of samples sharing similar expression patterns, which can help to discover novel cancer subtypes. Clustering methods have been used extensively for solving these kind of problems. Several clustering techniques have been proposed by bioinformaticians taking into account different internal characteristics of gene expression data, for example presence of noise and missing values, and high-dimensional nature of data sets [33–35]. These clustering techniques are tested with the help of available public data in clinical studies which were previously published. In [36,37] authors have used K-means clustering technique for solving the gene expression data clustering problem. Recently, many new clustering algorithms [38,39] have been proposed for solving the gene expression data clustering which can overcome the drawbacks of K-means algorithm. Self-organizing maps are used to cluster gene expression data in [40]. These algorithms perform better than K-means algorithm. Eisen et al. [2] developed an

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

agglomerative algorithm called UPGMA (Unweighted Pair Group Method with Arithmetic Mean) for gene expression data clustering and proposed a method to graphically represent the clustered data set. Alon et al. [41] devised an algorithm to partition the genes using a divisive approach which is called the deterministicannealing algorithm (DAA) [42]. Eisen's method was very much popular among biologists and has become the most widely used tool in gene expression data analysis [2,41,43]. However, because of the sensitivity of the data on the structure of the hierarchical dendogram, these conventional agglomerative approaches suffer most. High computational complexity is another drawback of the hierarchical clustering techniques. In [44], the authors used CLICK, a graph theory based clustering technique, for clustering publicly available gene expression data and compared their approach with SOM based [40] and Eisen's hierarchical method [2]. In [45], a new clustering technique named DHC (a density-based, hierarchical clustering method), is proposed to identify the co-expressed gene groups from gene expression data. Several model based clustering approaches for gene expression data have been proposed in [46–48] which provide some statistical frameworks to model the gene expression data.

are dominated by some new solutions then those were removed from the archive. If during this process, the archive size crosses the soft limit then single linkage clustering is invoked to reduce the size of the archive to HL. At the beginning of the AMOSA process, the temperature is first fixed at T max . All the archive members are initialized randomly and non-dominated solutions are finally stored. After that a solution is selected from the archive which is called the current-pt, or the initial solution. Objective functional values are calculated for current-pt. Then this solution is mutated to produce a new solution which is called new-pt (please note that perturbations are task specific and for gene expression data clustering we have developed three different types of perturbation operators which are described in Section 4.2.3 of the current paper), and its objective functions are computed. Goodness of the newly generated solution is checked after calculating the domination status between the new-pt, current-pt and the solutions in the archive. A novel quantity called the amount of domination, Δdomða; bÞ, between two solutions a and b is introduced in [15]. It is calculated as follows:

Δdoma;b ¼ ∏M i ¼ 1;f i ðaÞ a f i ðbÞ 3. The SA Based MOO Algorithm: AMOSA Archived multiobjective simulated annealing (AMOSA) [15,7] is a recently developed MO version of the simulated annealing (SA) algorithm. MOO is used to solve those real-world problems where there are multiple number of objective functions to be optimized simultaneously. Simulated annealing (SA) is a popular search technique for solving difficult single objective based optimization problems [6,7]. In [49] it has been proved that SA can converge to global optimum if annealed sufficiently slowly. Simulated annealing (SA) [50] is developed inspired by the principles of statistical mechanics which discusses about the characteristics of a large number of atoms at low temperature in order to produce minimal cost solutions to large optimization problems by minimizing the associated energy. In statistical mechanics, it is fundamentally required to investigate the ground states or low-energy states of matter. These states are attained at very low temperature. But it is not good to lower the temperature repeatedly. Then it will result in an unstable state. Rather it is required to spend some sufficient amount of time at each temperature. Thus during the annealing process, the temperature is first increased to a high value and then gradually decreased to a very low value (T min ) while at each temperature one has to be spend sufficient amount of time. This process results in a low-energy state. Single objective version of SA has been widely used to solve various optimization problems. But the multiobjective version of SA is not so popular. It is because SA provides a single solution after a single execution. MOO requires multiple solutions. In order to obtain multiple solutions we have to run SA multiple times. Recently Bandyopadhyay et al. developed an efficient multiobjective version of SA called AMOSA [15] that overcomes this limitation. The below description of AMOSA is based on the paper [15,51]. AMOSA (archived multiobjective simulated annealing) [15] is a newly developed multiobjective version of SA. Several concepts have been newly invented in AMOSA. The concept of an archive which stores the non-dominated solutions found so far is introduced in AMOSA. There are two limits associated with archive. These are hard or strict limit represented by HL, and a larger, soft limit represented by SL, where SL 4 HL. Archive is used to store all the non-dominated solutions generated so far during the annealing process. During the SA process if some members of the archive

3

jf i ðaÞf i ðbÞj Ri

ð1Þ

where, f i ðaÞ and f i ðbÞ denote the ith objective values of the two solutions, Ri denotes the corresponding range of the objective function calculated using the solutions of the archive and M is the number of objectives. There are three different cases depending on the domination status between the new-pt, current-pt and the points in the archive. These cases are discussed in details in [15], and are briefly mentioned here for the sake of completeness. Case 1: In this case either current-pt dominates the new-pt or current-pt and new-pt are non-dominating to each other but some points in the archive dominate new-pt. Let us assume there are total k number of points in the archive which dominate the new-pt. This case is demonstrated in Fig. 1. Then in [15] a quantity Δdomavg is determined as follows: ð∑ki ¼ 1 ðΔdomi;newpt Þ þ Δdomcurrentpt;newpt Þ=ðk þ 1Þ. Then new-pt is chosen as the current-pt with the acceptance probability of qqs ¼

1 : 1 þ eΔdomavg =temp

ð2Þ

Note that Δdomavg calculates the average amount of domination of the new-pt by ðk þ1Þ points, namely, the current-pt and k points of the archive. Case 2: In this case new-pt is non-dominating with respect to current-pt and the points in the archive. This can be demonstrated with different examples shown in Fig. 2. In all these cases new-pt is accepted as the current-pt with probability 1. If some points in the archive are dominated by new-pt, then those points are removed from the archive. new-pt is then added in the archive. In between if archive size exceeds the SL, single linkage clustering technique is invoked to reduce its size to HL. Case3: In this case current-pt and k where k r 0 points in the archive are dominated by new-pt. This case can be demonstrated using Fig. 3. Here the minimum of the differences of domination amounts between the new-pt and the k points, denoted by Δdommin of the archive is calculated. The point from the archive which has the minimum difference is chosen as the current-pt with acceptance probability probability ¼

1 : 1 þ expðΔdommin Þ

ð3Þ

Otherwise the new-pt is chosen as the current-pt. This is also called the informed reseeding of the annealer when the archive point is accepted as the current-pt.

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

f2(maximize)

f2 (maximize)

Points in the archive

Points in the archive

Current

Current New

New

f1(maximize)

f1 (maximize)

Fig. 1. Different cases when New is dominated by Current: (a) New is non-dominating with respect to the solutions of Archive except Current if it is in the Archive; (b) some solutions in the Archive dominate New (figure taken from [7]).

f2(maximize)

f2(maximize)

f2(maximize)

Points in the archive

Points in the archive

Current

Points in the archieve

Current

Current New

New New

f1(maximize)

f1(maximize)

f1(maximize)

Fig. 2. Different cases when New and Current are non-dominating: (a) Some solutions in Archive dominate New; (b) New is non-dominating with respect to all the solutions of Archive; (c) New dominates k (k Z 1) solutions in the Archive (figure taken from [7]).

f2(maximize)

f2(maximize)

f2(maximize)

Points in the archive

Points in the archive

Points in the archive New

New

Current

Current New Current

f1(maximize)

f1(maximize)

f1(maximize)

Fig. 3. Different cases when New dominates Current: (a) New is dominated by some solutions in Archive; (b) New is non-dominating with respect to the solutions in the Archive except Current, if it is in the Archive; (c) New dominates some solutions of Archive other than Current (figure taken from [7]).

The above steps are repeated iter times for each temperature (temp). Temperature is decremented using the equation α  temp where α is called the cooling rate, till the minimum temperature, T min , is reached. The process stops after that and the final archive will contain all the non-dominated solutions. Results shown in [15] illustrate the effectiveness of AMOSA in comparison to NSGA-II [52] and some other well-known MOO algorithms. The pseudocode of the AMOSA algorithm is shown in Fig. 4.

AMOSA is used as the underlying MOO technique in this article because of its improved performance over some other well-known MOO algorithms especially for 3 or more objectives [15].

4. Proposed method of multiobjective clustering This section describes the recently developed multiobjective clustering technique, MO-fuzzy, in detail.

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

5

Fig. 4. The AMOSA Algorithm [15] (source code is available at: www.isical.ac.in/sriparna_r).

4.1. String representation and population initialization In MO-fuzzy clustering, center based real-encoding is used. Here the coordinates of the centers of the partitions are encoded in the form of a string (member of the archive). Each string is having different length. Let us assume string i represent the centers of K i clusters where the dimension of the data space is d, then the string will have the length of li where li ¼ d  K i . For example, in case of two space, the string 〈12:31:422:10:010:015:310:27:5〉 represents four cluster centers: (12.3, 1.4), (22.1, 0.01), (0.0, 15.3), and (10.2, 7.5). Another important point of string encoding is that each center is regarded to be indivisible. This means at the time of mutation if we will insert a new center all the dimensional values have to be inserted and if we want to delete a center all the dimensional values have to be deleted. The number of centers, K i , encoded in a string i is chosen randomly between two limits K min and K max . The value is determined using the following equation K i ¼ ðrandðÞmodðK max 1ÞÞ þ 2. Here, randðÞ is a function returning a random i, and K max is the upperlimit of the number of clusters. Minimum number of clusters is assumed to be equal to 2. The number of whole clusters present in a particular string of archive can therefore vary in the range two to

K max . The K i cluster centers represented in a string are some randomly selected distinct points from the data set. 4.2. Membership value calculations and objective function computations Objective function computation is composed of two steps. Firstly, membership values of n points (where n is the total number of points in the data set) to different clusters are computed. Next, two cluster validity indices, namely Xie-Beniindex [16] and FSym-index [17], are computed and used as two objective functions of the string. These two objective functions are simultaneously optimized using the search capability of AMOSA. Here our aim is to get the clusters which are biologically relevant. In order to get biologically relevant clusters we need to put genes which are having similar gene expression profiles (similar types of gene expression values over different time points) in a single cluster. FSym-index and Xie-Beni-index try to minimize compactness and maximize cluster separation. Thus optimum values of these indices will correspond to those solutions where genes having similar gene expression patterns will be in same cluster

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

so that distance between genes of a particular cluster in terms of Euclidean distance or symmetry can be minimized (so that it minimizes compactness)and genes having dissimilar patterns will be in different clusters so that distance between two genes in two different clusters can be maximized (in order to maximize cluster separation). Thus we will finally get some biologically relevant clusters. 4.2.1. Computing the membership values In MO-fuzzy, if a particular point is indeed symmetrical with respect to a cluster center, then it is allocated to the corresponding cluster center with membership value equals to 1. Otherwise Euclidean distance is utilized to compute the membership values for points whose symmetrical measures to all the cluster centers are greater than some threshold (i.e., points which are not symmetrical with respect to any cluster center). The membership value calculation is based on a recently developed point symmetry based distance [12], dps ðx; cÞ, which is calculated as follows. Let a point be x. The symmetrical (reflected) point of x with respect to a particular center c is 2  cx [12,7]. Let us denote this by x n . Let knear unique nearest neighbors of x n be at Euclidean distances of di s, i ¼ 1; 2; …knear. Then dps ðx; cÞ ¼ dsym ðx; cÞ  de ðx; cÞ; dps ðx; cÞ ¼

∑knear i ¼ 1 di  de ðx; cÞ; knear

ð4Þ ð5Þ

where de ðx; cÞ is the Euclidean distance between the point x and c, and dsym ðx; cÞ is a symmetry measure of x with respect to c and is defined as ∑knear i ¼ 1 di =knear. Note that knear cannot be chosen equal to 1, since in this case if x n exists in the data set then dps ðx; cÞ ¼ 0 and hence there will be no impact of the Euclidean distance. On the contrary, large values of knear may not be suitable because it may underestimate the amount of symmetry of a point with respect to a particular cluster center. Here knear is chosen equal to 2. It may be noted that the proper value of knear largely depends on the distribution of the data set. The properties of dps ðx; cÞ are thoroughly described in [11]. For each point x j ; j ¼ 1; 2; …n (here n is the number of points in the data set), the membership values to K different clusters (where K is the number of clusters encoded in this particular string) are calculated by using the following equation: Find the cluster center nearest to xj by using the point symmetry based distance; That is, we choose the cluster center k which is the closest to the data point x j by using the minimum-symmetric-distance criterion: k ¼ argmini ¼ 1;…K dps ðx j ; c i Þ. Here point symmetry-based distance dps ðx j ; c i Þ is calculated by Eq. (5). Here, c i represents the center of the ith cluster. If the corresponding symmetry measurement, denoted by dsym ðx j ; c k Þ (as defined in Eq. (4)) is smaller than a user-defined predetermined parameter θ, then the membership function uij is calculated using the following equations: uij ¼ 1;

if i ¼ k

uij ¼ 0;

if i a k:

On the other hand, the membership uij is calculated by using the rule based on normal fuzzy C-means [19] algorithm: uij ¼

1

 2=m1 d ∑Kk ¼ 1 d ij kj

:

ð6Þ

Here m A ½1; 1Þ is called the weighting exponent which has the name “fuzzifier”. Inspired by [19], we have kept the value of this parameter equals to 2, m ¼ 2. The Euclidean distance between a pattern x j and the cluster center c i is denoted by dij . Note that, in Eq. (6), the Euclidean distance has been used to calculate the

membership degrees in place of the symmetry-based distance. Eq. (6) is derived in [19] under the assumption that the distance measure used should be a norm. Note that the dps measure is not a norm [12]. Thus, the point symmetry-based distance cannot be used in Eq. (6). The value of the threshold or the prespecified parameter θ is max kept equal to dNN , the maximum nearest neighbor distance in the data set, as done in [12]. Updating the centers: Once the membership values of all the points to different clusters are determined, the cluster centers represented in a particular string of the archive are updated using the following equation as in FCM [19]: ci ¼

∑nj¼ 1 ðuij Þm x j ∑nj¼ 1 ðuij Þm

;

1 r i rK:

ð7Þ

These steps are carried out to introduce some limited amount of local search (provided by FCM-like center update) in the proposed algorithm to speed up the convergence of MO-fuzzy. 4.2.2. Fitness calculation Two objective functions are computed for optimization. The objective of MO-fuzzy clustering technique is to optimize both the objective functions simultaneously. Objective functions are FSymindex [17] and Xie-Beni index [16]. The fuzzy symmetry-based cluster validity index FSym-index [17] is an internal cluster validity index which quantifies the goodness of a partitioning in terms of “symmetry” and the separation present in the clusters. The index is described below. Suppose K cluster centers are represented by c i where 1 r i r K and UðXÞ ¼ ½uij Kn is the membership matrix of the whole data set. Then, the FSym-index is calculated as follows:   1 1 FSymðU; KÞ ¼   DK ; ð8Þ K EK where K is the number of clusters encoded in a particular string. Here, K

E K ¼ ∑ Ei

ð9Þ

i¼1

such that n

Ei ¼ ∑ ðum ij  dps ðx j ; c i ÞÞ

ð10Þ

j¼1

and DK ¼ maxKi;j ¼ 1 J c i c j J :

ð11Þ

DK denotes the maximum Euclidean distance between two cluster centers among all centers. dps ðxj ; ci Þ is again the point symmetry based distance [12] between the point xj and the cluster center ci , calculated using the Eq. (5). This index is inspired by the PBMF-index developed in [53]. In order to determine actual number of clusters from data sets having symmetrical shaped clusters, the objective is to maximize this index. Xie-Beni Index: The second objective function used here in the proposed clustering algorithm is the Xie-Beni index [16]. This is an old internal cluster validity index, widely used in the literature. In 1991, Xie and Beni [16] developed this cluster validity index (Xie-Beni) which is again based on two properties: compactness and separation. As per the definitions of the Xie-Beni-index the numerator quantifies the compactness of the partitioning measured using the matrix U while the denominator quantifies the separation between clusters which is measured using the Euclidean distance between cluster centers. In principle, in order to attain a good partitioning, the compactness value should be minimum and the separation should be maximum. Thus the

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Xie-Beni-index value should be minimized after varying the number of clusters in the range, k ¼ 1; …; K max , in order to obtain the desirable partitioning. Let K cluster centers be represented by c i where 1 ri r K and UðXÞ ¼ ½uij Kn denote the membership matrix for the data. Then Xie-Beni-index is defined as follows: XieBeniðU; KÞ ¼

∑Ki¼ 1 ∑nj¼ 1 u2ij J x j c i J 2 nðmini a k J c i c k J 2 Þ

:

ð12Þ

Thus the two objective functions for AMOSA based clustering techniques are f 1 ¼ FSym and f 2 ¼ 1=Xie  Beni. The objective is to maximize these two indices.

4.3. Mutation In the proposed clustering technique we have developed three types of mutation operations. (1) Mutation 1: This is about changing each cluster center by some small amount. In order to generate some random numbers Laplacian distribution is used. Here each cluster center represented in a string is modified with a random variable which is drawn using a Laplacian distribution, pðϵÞ p ejϵμj=δ . Here the magnitude of perturbation is measured using the scaling factor δ and μ is the old value at the position which is to be mutated. The scaling factor δ is generally set equal to 1.0. By using the Laplacian distribution a value near the old value is generated and old value is replaced with the newly generated value. This is applied individually to all the dimensions of a particular center if it is selected for mutation. (2) Mutation 2: This mutation operation is used to reduce the size of the string. Here a randomly generated cluster center is deleted from the string in order to decrease the number of cluster centers encoded in the string by 1. Here cluster centers are considered to be indivisible. This means while deleting a particular cluster center, all the dimensional values are removed. (2) Mutation 3: This mutation is for incrementing the number of clusters by 1. Here a new center is inserted in the string. Thus the number of cluster centers encoded in the string is incremented by 1. Here again cluster centers are considered to be indivisible. After choosing a new center from the data set randomly, all the dimensional values of that center are inserted in the string.

7

4.4. Selection of the best solution Any multiobjective optimization technique produces a set of non-dominated solutions on its final Pareto optimal front [54]. Each of these non-dominated solutions corresponds to a style of clustering the particular data set. All these solutions are identically relevant from the execution point of view. But sometimes depending on the application domain the user may want a single solution. Thus in this article a novel method to select a single solution from the final Pareto optimal front is developed. Here assumption is that the class labels of some of the points which are called test patterns are known to us. The proposed MO-fuzzy produces a set of Pareto optimal solutions. Thereafter cluster labels are assigned to test patterns based on the nearest center criterion and depending on the clustering results encoded in different solutions of the final Pareto front. The goodness of the clustering result is measured by calculating Silhouette index [55] on the test patterns. Silhouette index is an internal cluster validity index which is used to quantify the quality of any clustering solution C. Let a denote the average distance of a point from the other points of the cluster to which the point is assigned, and b denote the minimum of the average distances of the point from the points of the other clusters. The silhouette width s of the point is calculated by using the following equation: s¼

ðbaÞ : maxfa; bg

ð13Þ

Silhouette index sðCÞ is determined as the average Silhouette width of all the data points (genes). It fundamentally quantifies the compactness and separation of clusters in terms of Euclidean distance. Silhouette index values can vary between  1 to 1. Higher value means better partitioning. The Silhouette index values are calculated for all the solutions. The solution with the maximum Silhouette index value calculated over the test patterns is selected as the best solution.

5. Experimental results A description of the performance metrics and the data sets used for experiments are provided in this section. Thereafter, we have demonstrated the obtained experimental results both visually and numerically.

5.1. Performance metrics Example 4.2: Let a string look like 〈3:51:52:14:91:61:2〉, representing three cluster centers in a 2-d plane (3.5, 1.5), (2.1, 4.9), and (1.6, 1.2). (i) If mutation type 1 is selected, then first one position from the string has to be selected for perturbation. Let position 2 be selected randomly. Then, each dimension of (2.1, 4.9) will be changed by some value generated using the Laplacian distribution. (ii) If mutation type 2 is selected, a center will be removed from the string. Let center 3 be selected for deletion. Then, after deletion, the string will look like 〈3:51:52:14:9〉. (iii) If mutation type 3 is selected, a new center will be added to the string. Let the randomly chosen point from the data set to be added to the string be (9.7, 2.5). After addition of this center, the string will look like 〈3:51:52:14:91:61:29:72:5〉. Any one of the above-mentioned types of mutation is applied to each string to generate a new string.

In order to judge the performance of the clustering algorithms, Silhouette index [55] is calculated for gene expression data sets. In order to visually evaluate the clusters, two cluster visualization tools namely Eisen plot and cluster profile plot, have been used.

5.2. Eisen plot In case of Eisen plot [2,27], (see Fig. 5(a) for example) by coloring the corresponding cell of the data matrix with a color similar to the original color of its spot on the microarray, the expression value of a gene at a specific time point is represented. Higher expression levels are denoted by using the shades of red color, low expression levels are denoted by the shades of green color and absence of differential expression values are denoted by the colors towards black. In this paper before plotting the Eisen plot we have ordered the genes so that the genes that belong to the same cluster are placed one after another. White colored blank rows are used to identify the cluster boundaries.

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

Fig. 5. Yeast Sporulation data clustered using MO-fuzzy clustering method. (a) Eisen plot and (b) cluster profile plots.

5.3. Cluster profile plot The normalized gene expression values (light green) of the genes of that cluster with respect to the time points [27] is illustrated using the cluster profile plot (see Fig. 5(b) for example). A black line is used to show the average expression values of the genes of the cluster over different time points together with the standard deviations within the cluster at each time point [27].





5.4. Data sets used Pre-processed datasets have been downloaded from the site.1 The number of genes after pre-processing the dataset and the number of features in each dataset have been mentioned in Table 1. The short description of the data sets are given below:

 Yeast Sporulation: In this data set [56,27], there are total 6118



genes where 7 time points (0, 0.5, 2, 5, 7, 9 and 11.5 h) are measured for each gene during the sporulation process of budding yeast. Log-transformation is then applied on this data set. The Sporulation data set can be down-loadable from the website.2 Among these 6118 genes, some genes are ignored from further analysis whose expression levels did not change significantly during the harvesting. A threshold level of 1.6 for the root mean squares of the log2-transformed ratios is used to calculate this. At the end, there are 474 genes in the final set. Yeast Cell Cycle: The Yeast Cell Cycle data set was determined from a data set which shows the variation of expression patterns of approximately 6000 genes over two cell cycles (17 time points). 384 genes out of total 6000 genes have been



chosen to be cell-cycle regulated [57,27]. This data set can be down-loadable from the following website.3 Arabidopsis Thaliana: Gene expression values of 138 genes of Arabidopsis Thaliana are contained in this data set. It consists of expression levels of the genes over 8 time points viz., 15 min, 30 min, 60 min, 90 min, 3 h, 6 h, 9 h, and 24 h [58,27]. This data set can be down-loadable from.4 Human Fibroblasts Serum: Expression levels of 8613 human genes are contained in this dataset [59,27]. There are total 13 dimensions in this data set corresponding to 12 time points (0, 0.25, 0.5, 1, 2, 4, 6, 8, 12, 16, 20 and 24 h) and one unsynchronized sample. Thereafter total 517 genes have been selected for consideration where expression levels of these genes changed drastically across the time points. Log2-transformation is then applied on these 517 genes. This data set is available at.5 Rat CNS: This data set is calculated using reverse transcriptioncoupled PCR which is executed to determine the expression levels of a set of 112 genes during rat central nervous system development over 9 time points [60,27]. This data set can be down-loadable from.6

5.5. Discussion of results The evaluation results of the proposed MO-fuzzy clustering have been shown for five publicly available real life gene expression data sets, viz., Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana, Human Fibroblasts Serum and Rat CNS data. Results obtained by the proposed technique are also compared with those obtained by fuzzy MOGA clustering [18], FCM [19], single objective 3 4

1 2

http://anirbanmukhopadhyay.50webs.com/mogasvm.html. http://cmgm.stanford.edu/pbrown/sporulation.

5 6

http://faculty.washington.edu/kayee/cluster. http://homes.esat.kuleuven.be/thijs/Work/Clustering.html. http://www.sciencemag.org/feature/data/984559.shl. http://faculty.washington.edu/kayee/cluster.

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Table 1 Description of data sets. Data set

Yeast Sporulation Yeast Cell Cycle Rat Central Nervous System (CNS) Human Fibroblasts Serum Arabidopsis Thaliana

9

Table 3 Median values of Silhouette index scores over 20 consecutive runs of different algorithms. No. of genes in pre-processed dataset

No. of features

474 384 112 517 138

7 17 9 13 8

Table 2 The three most significant GO terms and the corresponding p-values for each of the 6 clusters of Yeast Sporulation data as found by MO-fuzzy clustering technique. Clusters

Significant GO term

p-value

Cluster 1

Sporulation GO:0043934 Sporulation resulting in formation of a cellular spore GO:0030435 Anatomical structure formation involved in morphogenesis GO:0048646

4.63e  38 3.17e 37

Cluster 2

M phase of meiotic cell cycle GO:0051327 Meiosis GO:0007126 Meiotic cell cycle GO:0051321

1.05e  23 1.05e  23 1.38e  23

Cluster 3

Preribosome GO:0030684 Ribosome biogenesis GO:0042254 Ribonucleoprotein complex biogenesis GO:0022613

4.45e  27 4.44e 18 1.76e 16

Cluster 4

Carboxylic acid metabolic process GO:0019752 Glutamate dehydrogenase (NADP þ) activity GO:0004354 Oxoacid metabolic process GO:0043436

0.00022

3.74e 37

0.00026 0.00028

Cluster 5

Cytosolic ribosome GO:0022626 Cytoplasmic translation GO:0002181 Structural constituent of ribosome GO:0003735

5.51e  56 1.24e  55 3.36e  51

Cluster 6

Glycolysis GO:0006096 Glucose catabolic process GO:0006007 Hexose catabolic process GO:0019320

2.91e 14 7.14e 14 6.86e  13

genetic clustering scheme which minimizes Xie-Beni-index validity measure (SGA) [20], average linkage method [21], SOM [22] and CRC [23]. Obtained clustering results are verified after conducting several statistical and biological significance tests. Mean Silhouette Indices values for various data-sets obtained by different clustering techniques are shown in Table 3. Results shown in Table 3 reveal that for all the data sets MO-fuzzy attains the maximum Silhouette index values. MO-fuzzy has determined 6, 5, 6, 5 and 6 number of clusters for the Sporulation, Cell Cycle, Arabidopsis, Serum and Rat CNS data sets, respectively. This conforms to the findings in the literature [61,20,56,60]. In order to visually demonstrate the results of MO-fuzzy clustering, Figs. 5–7 show the Eisen plots provided by MO-fuzzy clustering technique on the four data sets. For example, the 6 clusters of the Yeast Sporulation data are very prominent from the corresponding Eisen plot (Fig. 5(a))). From the color representations of the Eisen plot, we can see that similar colors are grouped together, denoting that expression profiles of the genes of a cluster are similar to each other as they produce similar color patterns. Here eisen plots obtained by the proposed algorithm are shown in Figs. 5(a), 6(a), 7(a) and 8(a), for four data sets Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana and Rat CNS, respectively.

Algorithm

Sporulation

Cell cycle

Arabidopsis

Serum

Rat CNS

MO-fuzzy MOGA FCM SGA Average linkage SOM CRC

0.5888 0.5766 0.4686 0.5698 0.5007 0.5786 0.5619

0.4348 0.4221 0.3812 0.4315 0.4388 0.3823 0.4271

0.4129 0.4024 0.3656 0.3837 0.3151 0.2334 0.3955

0.4008 0.3844 0.3152 0.3672 0.3562 0.3352 0.3246

0.4958 0.4822 0.4113 0.4563 0.4122 0.4340 0.4561

The cluster profile plots (Fig. 5(b)) are also drawn based on the obtained clustering results by the proposed algorithm. These plots also demonstrate how the expression profiles for the different groups of genes differ from each other, while the profiles within a group are reasonably similar. Here cluster profile plots obtained by the proposed algorithm are shown in Figs. 5(b), 6(b), 7(b) and 8(b), for four data sets Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana and Rat CNS, respectively. The proposed technique performs better compared to the other clustering methods mainly because of the following reasons: first of all, this is a multiobjective clustering method. Simultaneous optimization of multiple cluster validity measures helps to handle clusters with different characteristics. This in turn produces higher quality solutions representing different possible partitionings. Secondly, the strength of symmetry based distance is utilized along with multiobjective clustering efficiently. It has been already shown in [12] that symmetry based distances are useful for detecting clusters having any shape and size as long as they satisfy the symmetry property. This work shows that symmetry based distances are also useful for detecting clusters from gene expression data. Finally, incorporation of fuzziness makes the proposed technique better equipped in handling overlapping clusters. 5.6. Statistical test Here we have carried out some statistical tests guided by [62,63] to establish the superiority of the proposed clustering technique, MO-fuzzy compared to the existing techniques. We have done Friedman statistical test [64] to detect whether the four clustering techniques, MO-fuzzy, MOGA, SGA and CRC used here for experiment perform similarly or not. Ranks are assigned to each algorithm for each data set. Thereafter a test is conducted to check whether the measured average ranks are significantly different from the mean rank. In our case after conducting Friedman test, we found that measured average ranks and mean rank are different with a p value of 0.0166. The corresponding results are shown in Table 4. Thereafter Nemenyi's test [65] is carried out in order to compare the clustering techniques pairwise. In each case we have kept α ¼ 0:05. Results reveal that for all the pairs of algorithms, the null hypotheses which states that the pairing algorithms perform similarly are rejected as the corresponding p values are smaller than the α. 5.7. Biological significance The statistically significant Gene Ontology (GO) annotation database http://db.yeastgenome.org/cgi-bin/GO/goTermFinder is used to establish the biological relevance of a cluster. The functional enrichment of a group of genes is measured in terms of three structured, controlled vocabularies (ontologies), viz., associated biological processes, molecular functions and biological components. A cumulative hypergeometric distribution is

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

10

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 6. Yeast Cell Cycle data clustered using MO-fuzzy clustering method. (a) Eisen plot and (b) cluster profile plots.

Fig. 7. Arabidopsis Thaliana data clustered using MO-fuzzy clustering method. (a) Eisen plot and (b) cluster profile plots.

used to determine the degree of functional enrichment (p-value). The probability of calculating the number of genes involved in a given GO term (i.e., function, process, component) within a cluster can be determined then. For a particular GO category, the probability p of producing k or more genes within a cluster of size n, can be calculated by using the following equation as

shown in [27]    f gf p ¼ 1 ∑ k1 i¼1

i

ni   g n

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

11

Fig. 8. Rat CNS data clustered using MO-fuzzy clustering method. (a) Eisen plot and (b) cluster profile plots.

Table 4 Computation of the rankings for the four algorithms considered in the study over 5 gene expression data sets, based on the Silhouette index values obtained. Data set

MO-fuzzy

MOGA

SGA

CRC

Sporulation Cell cycle Arabidopsis Serum Rat CNS Average rank

0.589 (1) 0.43 (1) 0.41 (1) 0.40 (1) 0.50 (1) 1

0.577 (2) 0.42 (2) 0.40 (2) 0.38 (2) 0.48 (2) 2

0.569 (3) 0.43 (1) 0.38 (3) 0.37 (3) 0.46 (3) 2.6

0.561 (4) 0.43 (1) 0.40 (2) 0.32 (4) 0.46 (3) 2.8

where the total number of genes within a category and within the genome are represented by f and g, respectively. By computing the p-value for each GO category, we can verify statistical significance for the genes in a cluster. The degree of how well the genes in the cluster match with the different GO categories can be quantified by using this test. The p value of the obtained cluster for a particular GO-category will be equal to 0 if the majority of genes in a cluster possess the same biological function. In this paper we have performed the biological significance test for Yeast Sporulation data at the 1% significance level. Here we have tested the clustering results obtained by all the algorithms used in this paper biologically. For different algorithms, we have reported the number of clusters for which the most significant GO terms have a p-value less than 0.01 (1% significance level). For MO-fuzzy, all the 6 clusters obtained are biologically significant whereas for MOGA, FCM, SGA, Average linkage, SOM and CRC the number of biologically significant clusters are 6, 4, 6, 4, 4 and 6, respectively. Table 2 illustrates the three most significant GO terms (along with the corresponding p-values) shared by the genes of each of the 6 clusters identified by MO-fuzzy clustering technique. Results prove that since all the p-values are less than 0.01, all the clusters produced by MOfuzzy clustering technique are significantly relevant with respect to some GO categories. Biologically significant test reveals that the proposed MO-fuzzy clustering technique generates biologically relevant and functionally enriched clusters in case of gene expression data clustering.

6. Conclusion In this paper a novel fuzzy multiobjective (MO) clustering technique (MO-fuzzy) is proposed which can automatically partition the data into an appropriate number of clusters as long as the clusters possess the point symmetry property. Cluster centers are encoded in the string. Two objective functions are utilized for optimization. The first one reflects the total compactness of the partitioning based on the Euclidean distance, the other quantifies the total amount of symmetry present in the clusters. AMOSA, a recently developed simulated annealing based multiobjective optimization method, is utilized to optimize these indices so that it can enable the proposed algorithm to automatically detect the appropriate number of clusters as well as the appropriate partitioning from different data sets. The proposed technique is applied for extracting clusters from gene expression data. Results on five real life gene expression data sets have been demonstrated. The performance of the proposed technique has been compared with those of MOGA, FCM, SGA, Average linkage, SOM and CRC clustering methods. The obtained results by the used algorithms are measured both quantitatively and visually using cluster visualization tools. Results prove the effectiveness of the proposed MO-fuzzy clustering technique which outperforms the other algorithms used here for comparison purpose. This is because in MO-fuzzy we have integrated the effectiveness of multiobjective optimization, fuzzy clustering and symmetry based distance. Results of biological significance tests show that proposed technique is able to identify biologically significant clusters. The biological significance tests have been carried out on the clustering results on Yeast Sporulation data at the 1% significance level. Below we have provided the number of clusters attained by different algorithms for which the most significant GO terms have a p-value less than 0.01 (1% significance level): MO-fuzzy-6, MOGA - 6, FCM - 4, SGA - 6, Average linkage - 4, SOM - 4 and CRC - 6. Results show that three most significant GO terms (along with the corresponding p-values) shared by the genes of each of the 6 clusters identified by MO-fuzzy technique are smaller than the corresponding p values shared by the genes of the corresponding clusters of all the

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

12

other techniques. This again proves the effectiveness of the proposed MO-fuzzy clustering technique in determining biologically relevant and functionally enriched clusters. In this paper we have utilized only two cluster validity indices as the objective functions. As AMOSA is well-known for optimizing more than three objective functions, so in future we would like to add some more objective functions in our proposed algorithm. For selecting a single solution a novel method is developed which is a semi-supervised based approach. In future we would like to develop some unsupervised technique for the selection of a single solution.

Conflict of interest statement Authors do not have any conflict of interest. References [1] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J.J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, L.M. Staudt, Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 403 (6769) (2000) 503–511. [2] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. 95 (25) (1998) 14863–14868, http://dx.doi.org/10.1073/pnas.95.25.14863. [3] D. Lockhart, E. Winzeler, Genomics gene expression and DNA arrays, Nature 405 (June (6788)) (2000) 827–836. [4] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ, 1988. [5] J.T. Tou, R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley, Reading, 1974. [6] S. Bandyopadhyay, S.K. Pal, Classification and Learning Using Genetic Algorithms Applications in Bioinformatics and Web Intelligence, Springer-Verlag, Hiedelberg, Germany, 2007. [7] S. Bandyopadhyay, S. Saha, Unsupervised Classification – Similarity Measures, Classical and Metaheuristic Approaches, and Applications, Springer, 2013. [8] S. Bandyopadhyay, U. Maulik, Nonparametric genetic clustering: comparison of validity indices, IEEE Trans. Syst. Man Cybern. Part C 31 (1) (2001) 120–125. [9] R.H. Eduardo, F.F.E. Nelson, A genetic algorithm for cluster analysis, Intelligent Data Anal. 7 (2003) 15–25. [10] S. Bandyopadhyay, U. Maulik, Genetic clustering for automatic evolution of clusters and application to image classification, Pattern Recognition 2 (2002) 1197–1208. [11] S. Bandyopadhyay, S. Saha, A point symmetry based clustering technique for automatic evolution of clusters, IEEE Trans. Knowl. Data Eng. 20 (November (11)) (2008) 1–17. [12] S. Bandyopadhyay, S. Saha, GAPS: a clustering method using a new point symmetry based distance measure, Pattern Recog. 40 (2007) 3430–3451. [13] S. Saha, S. Bandyopadhyay, Application of a new symmetry based cluster validity index for satellite image segmentation, IEEE Geosci. Remote Sensing Lett. 5 (April (2)) (2008) 166–170. [14] J. Handl, J. Knowles, An evolutionary approach to multiobjective clustering, IEEE Trans. Evol. Comput. 11 (1) (2007) 56–76. [15] S. Bandyopadhyay, S. Saha, U. Maulik, K. Deb, A simulated annealing based multi-objective optimization algorithm: AMOSA, IEEE Trans. Evol. Comput. 12 (June (3)) (2008) 269–283. [16] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 841–847. [17] S. Saha, S. Bandyopadhyay, A new point symmetry based fuzzy genetic clustering technique for automatic evolution of clusters, Inf. Sci. 179 (19) (2009) 3230–3246. [18] U. Maulik, A. Mukhopadhyay, S. Bandyopadhyay, Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes, BMC Bioinformatics 10 (1) (2009) 27, http://dx.doi.org/10.1186/1471-2105-10-27 〈http://www.biomedcentral.com/1471-2105/10/27〉. [19] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum, New York, 1981. [20] U. Maulik, S. Bandyopadhyay, Fuzzy partitioning using a real-coded variablelength genetic algorithm for pixel classification, IEEE Trans. Geosci. Remote Sensing 41 (5) (2003) 1075–1081 〈http://dblp.uni-trier.de/db/journals/tgrs/ tgrs41.html#MaulikB03〉. Q2 [21] A.K. Jain, M. Murthy, P. Flynn, Data clustering: a review, ACM Comput. Rev. [22] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, T.R. Golub, Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation, Proc. Natl. Acad. Sci. USA 96 (1999) 2907–2912.

[23] Z.S. Qin, Clustering microarray gene expression data using weighted chinese restaurant process, Bioinformatics 22 (16) (2006) 1988–1997, http://dx.doi. org/10.1093/bioinformatics/btl284 〈http://dx.doi.org/10.1093/bioinformatics/ btl284〉. [24] D. Jiang, C. Tang, A. Zhang, Cluster analysis for gene expression data: a survey, IEEE Trans. Knowl. Data Eng. 16 (11) (2004) 1370–1386, http://dx.doi.org/ 10.1109/TKDE.2004.68. [25] M. Schena, D. Shalon, R.W. Davis, P.O. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science 270 (523) (1995) 467–470, URL http://dx.doi.org/10.1126/science.270.5235.467. [26] D.J. Lockhart, H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo, M.S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, E.L. Brown, Expression monitoring by hybridization to high-density oligonucleotide arrays, Nat. Biotechnol. 15 (1997) 1359–1367. [27] U. Maulik, A. Mukhopadhyay, S. Bandyopadhyay, Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes, BMC Bioinformatics 27 (2009) 1197–1208. [28] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R.B. Altman, Missing value estimation methods for DNA microarrays, Bioinformatics (Oxford, England) 17 (6) (2001) 520–525 〈http://dx.doi. org/10.1093/bioinformatics/17.6.520〉. [29] S. Tavazoie, D. Hughes, M. Campbell, R. Cho, G. Church, Systematic determination of genetic network architecture, Nature Genet. 22 (5) (1999) 281–285. [30] A. Brazma, J. Vilo, Gene expression data analysis, 〈citeseer.nj.nec.com/braz ma00gene.html〉, 2000. [31] P. D'Haeseleer, X. Wen, S. Fuhrman, R. Somogyi, Mining the gene expression matrix: inferring gene relationships from large scale gene expression data, in: Second International Workshop on Information Processing in Cells and Tissues, 1998, pp. 203–212. [32] S. Bandyopadhyay, U. Maulik, J.T.-L. Wang (Eds.), Analysis of Biological Data: A Soft Computing Approach, World Scientific, 2007. [33] M.C.P. de Souto, I.G. Costa, D.A.S. de Araujo, T.B. Ludermir, A. Schliep, Clustering cancer gene expression data: a comparative study, BMC Bioinformatics 9 (2008) 497 þ 〈http://dx.doi.org/10.1186/1471-2105-9-497〉. [34] J.-P. Brunet, P. Tamayo, T.R. Golub, J.P. Mesirov, Metagenes and molecular pattern discovery using matrix factorization, Proc. Natl. Acad. Sci. USA 101 (12) (2004) 4164–4169 〈http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=384712&tool=pmcentrez&rendertype=abstract〉. [35] L. Liu, D.M. Hawkins, S. Ghosh, S.S. Young, Robust singular value decomposition analysis of microarray data, Proc. Natl. Acad. Sci. USA 100 (23) (2003) 13167–13172 〈http://www.pnas.org/content/100/23/13167.abstract〉. [36] G. Sherlock, Analysis of large-scale gene expression data, Curr. Opinion Immunol. 12 (2) (2000) 201–205 〈http://view.ncbi.nlm.nih.gov/pubmed/10712947〉. [37] F.D. Smet, J. Mathys, K. Marchal, G. Thijs, B.D. Moor, Y. Moreau, Adaptive quality-based clustering of gene expression profiles, Bioinformatics 18 (2002) 735–746. [38] R. Herwig, A.J. Poustka, C. Müller, C. Bull, H. Lehrach, J. O'Brien, Large-scale clustering of cDNA-fingerprinting data, Genome Res. 9 (11) (1999) 1093–1105, http://dx.doi.org/10.1101/gr.9.11.1093. [39] L.J. Heyer, S. Kruglyak, S. Yooseph, Exploring expression data: identification and analysis of coexpressed genes, Genome Res. 9 (11) (1999) 1106–1115 http://dx.doi.org/10.1101/gr.9.11.1106. [40] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, T.R. Golub, Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation, Proc. Natl. Acad. Sci. 96 (6) (1999) 2907–2912, http://dx.doi.org/10.1073/ pnas.96.6.2907. [41] U. Alon, N. Barkai, D.A. Notterman, K. Gishdagger, S. Ybarradagger, D. Mackdagger, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. USA 96 (12) (1999) 6745–6750. [42] K. Rose, Deterministic annealing for clustering, compression, classification, regression, and related optimization problems, in: Proceedings of the IEEE, 1998, pp. 2210–2239. [43] V.R. Iyer, M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J.C. Lee, J.M. Trent, L.M. Staudt, J. Hudson, M.S. Boguski, D. Lashkari, D. Shalon, D. Botstein, P.O. Brown, The transcriptional program in the response of human fibroblasts to serum, Science 283 (5398) (1999) 83–87 〈http://www.biomedsearch.com/ nih/transcriptional-program-in-response-human/9872747.html〉. [44] R. Sharan, R. Shamir, Center click: a clustering algorithm with applications to gene expression analysis, in: ISMB, 2000, pp. 307–316. [45] D. Jiang, J. Pei, A. Zhang, Dhc: a density-based hierarchical clustering method for time series gene expression data, in: Proceedings of the Third IEEE Symposium on Bio-Informatics and Bio-Engineering (BIBE 03), 2003. [46] C. Fraley, A.E. Raftery, How many clusters? Which clustering method? Answers via model-based cluster analysis, Comput. J. 41 (8) (1998) 578–588. [47] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, W.L. Ruzzo, Model-based clustering and data transformations for gene expression data, Bioinformatics 17 (10) (2001) 977–987. [48] D. Ghosh, A.M. Chinnaiyan, Mixture modelling of gene expression data from microarray experiments, Bioinformatics 18 (2) (2002) 275–286. [49] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (6) (1984) 721–741. [50] S. Kirkpatrick, C. Gelatt, M. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671–680.

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

S. Saha et al. / Computers in Biology and Medicine ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

[51] S. Bandyopadhyay, Multiobjective simulated annealing for fuzzy clustering with stability and validity, Trans. Syst. Man Cyber. Part C 41 (5) (2011) 682–691, http://dx.doi.org/10.1109/TSMCC.2010.2088390. [52] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2) (2002) 182–197. [53] M.K. Pakhira, U. Maulik, S. Bandyopadhyay, Validity index for crisp and fuzzy clusters, Pattern Recog. 37 (3) (2004) 487–501. [54] K. Deb, Multi-objective Optimization Using Evolutionary Algorithms, John Wiley and Sons, Ltd, England, 2001. [55] P. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math. 20 (1) (1987) 53–65, http://dx.doi. org/10.1016/0377-0427(87)90125-7. [56] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. Brown, I. Herskowitz, The transcriptional program of sporulation in budding yeast, Science 282 (5389) (1998) 699–705. [57] R. Cho, M. Campbell, E. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. Wolfsberg, A. Gabrielian, D. Landsman, D. Lockhart, R. Davis, A genome-wide transcriptional analysis of the mitotic cell cycle, Mol. Cell 2 (1) (1998) 65–73. [58] P. Reymond, H. Weber, M. Damond, E. Farmer, Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis, Plant Cell 12 (5) (2000) 707–720.

13

[59] V.R. Iyer, M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J.C.F. Lee, J.M. Trent, L.M. Staudt, J. Hudson, M.S. Boguski, D. Lashkari, D. Shalon, D. Botstein, P.O. Brown, The transcriptional program in the response of human fibroblasts to serum, Science 283 (5398) (1999) 83–87. [60] X. Wen, S. Fuhrman, G.S. Michaels, D.B. Carr, S. Smith, J.L. Barker, R. Somogyi, Large-scale temporal gene expression mapping of central nervous system development, Proc. Natl. Acad. Sci. 95 (1) (1998) 334–339 〈http://www.pnas. org/cgi/content/abstract/95/1/334〉. [61] R. Sharan, A. Maron-Katz, R. Shamir, Click and expander: a system for clustering and visualizing gene expression data, Bioinformatics 19 (14) (2003) 1787–1799. [62] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30 〈http://dl.acm.org/citation.cfm?id=1248547.1248548〉. [63] S. García, F. Herrera, An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons, J. Mach. Learning Res. 9 (2008) 2677–2694 〈http://www.jmlr.org/papers/volume9/ garcia08a/garcia08a.pdf〉. [64] M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, J. Am. Stat. Assoc. 32 (200) (1937) 675–701 〈http:// www.jstor.org/stable/2279372〉. [65] P. Nemenyi, Distribution-Free Multiple Comparisons, Ph.D. Thesis, 1963.

Please cite this article as: S. Saha, et al., Gene expression data clustering using a multiobjective symmetry based clustering technique, Comput. Biol. Med. (2013), http://dx.doi.org/10.1016/j.compbiomed.2013.07.021i

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35