Expert Systems With Applications 91 (2018) 198–210
Contents lists available at ScienceDirect
Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa
FWCMR: A scalable and robust fuzzy weighted clustering based on MapReduce with application to microarray gene expression Behrooz Hosseini a, Kourosh Kiani a,∗ a
Electrical and Computer Engineering Department, Semnan University, Semnan, Iran
a r t i c l e
i n f o
a b s t r a c t
Article history: Received 28 May 2017 Revised 7 August 2017 Accepted 31 August 2017 Available online 8 September 2017
Data clustering is a very useful data mining technique to find groups of similar objects present in the dataset. Scalability to handle immense volumes, robustness to intrinsic outlier data and validity of clustering results are the main challenges of any data clustering approach. In order to address these challenges, a fuzzy weighted clustering approach which is comprehensibly parallel and distributed in every phase, is proposed in this research. Although the proposed method can be used for various data clustering purposes, it has been applied in gene expression clustering to reveal functional relationships of genes in a biological process. Conforming to MapReduce, the proposed method also presents a novel similarity measure which benefits from combining ordered weighted averaging and Spearman correlation coefficient. In the proposed method, density reachable genes were joined to establish subclusters. Afterwards, final cluster results were obtained by merging these subclusters. A voting system detects the best weights and consequently the most valid clusters among all possible results for each distinct dataset. The whole algorithm is implemented on a distributed processing platform and it is scalable to process any size of data stored in cloud infrastructures. Precision of resulting clusters were evaluated using some of the well-known cluster validity indexes in the literature. Also, the efficiency of the proposed method in scalability and robustness was compared with recently published similar researches. In all the mentioned comparisons, the proposed method outperformed recent works on the same datasets. © 2017 Elsevier Ltd. All rights reserved.
Keywords: MapReduce Distributed density based clustering Gene expression microarray Fuzzy weighted clustering Decision making Big data
1. Introduction Clustering is an unsupervised approach of data mining which attempts to group similar objects according to similar characteristics. Generally speaking, clustering analysis has three main area to address: similarity measurement(s), clustering algorithm, and clustering validation. There has been a rich literature on clustering analysis over the past decades (Aggarwal;charu & Chandan, 2014; Fahad et al., 2014; Kim, 2009). Genes are fundamental biological information storage units found in every living creature. The process by which information from a gene is used in the synthesis of functional gene products is called gene expression. Elucidating patterns laid in gene expression data proposes unique possibilities for a rich understanding of functional genomics (Allison, Cui, Page, & Sabripour, 2006; Sateesh Babu & Suresh, 2013). Quality of gene expression clustering plays a substantial role in the refinement of valuable biological data to find natural groups existing
∗
Corresponding author. E-mail addresses:
[email protected] nan.ac.ir,
[email protected] (K. Kiani).
(B.
http://dx.doi.org/10.1016/j.eswa.2017.08.051 0957-4174/© 2017 Elsevier Ltd. All rights reserved.
Hosseini),
kourosh.kiani@sem
in the gene set. Microarray (Castellanos-Garzón, García, Novais, & Díaz, 2013) across collections of related samples, is one of biotechnological means used to facilitate simultaneous monitoring of the expression levels of thousands of genes during important biological processes. With recent advances in gene microarray technology, high-density DNA arrays can be monitored all in one place (Castellanos-Garzón et al., 2013). However, the growing need for storing, retrieving and processing large-scale genome data presents considerable challenges to biologists (Wong, 2016). Various clustering algorithms have been addressed to more precisely group similar genes according to similar expression patterns (Kerr, Ruskin, Crane, & Doolan, 2008; Yoo et al., 2012). However, there is no single “best” clustering algorithm which is the “winner” in every case (Xu & Wunsch, 2005). Microarray datasets are very complex and have an intrinsic outlier or missing values (Moorthy, Saberi Mohamad, & Deris, 2014). Robustness keeps clustering results valid in the presence of outlier data. With the presence of such undesirable data within any dataset including gene expression, robustness of clustering algorithm is very important. Recently microarray datasets have become great both in volume and complexity. Hence, scalable algorithms are inevitably needed to handle them. To address these challenges, this research proposes a distributed
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
density based clustering algorithm that tries to group the genes with a novel fuzzy weighted similarity metric. In the rest of the paper, the proposed method is named FWCMR which is an acronym for Fuzzy Weighted Clustering approach based on MapReduce. In distributing the total data through nodes of Hadoop (the most popular big data processing framework) (White, 2012) in a way that does not affect their possible correlations, FWCMR finds the major density centers where the subclusters originate from. Consequently, the proposed clustering algorithm tries to merge the middle step subclusters into an integrated and unified final cluster result. Unlike partially distributed algorithms such as parallel k-means addressed by Zhao, Ma, & He (2009), FWCMR is a completely distributed algorithm that uses MapReduce (MR) (Shim, 2012) in every step. The proposed method is also robust to intrinsic outliers within datasets in comparison with recently published similar approaches. Finally, the quality of resulted clusters is measured by widespread clustering validity measurements in the literature. Experiments demonstrate the effectiveness and efficiency of FWCMR in comparison with similar researches. The proposed approach described in this research is motivated by the need of method for data clustering and specially gene expression clustering that is reliable, scalable and robust to outliers at the same time. The contributions of this paper are: (1) benefiting from a parallel and distributed algorithm which correctly discovers valid gene clusters; (2) a scalable, flexible and yet robust similarity measure for gene expression clustering on the basis of Ordered Weighted Averaging (OWA) operator; (3) compatibility with recent cloud computing frameworks and complete integrity with the MapReduce paradigm and specially, Hadoop; (4) addressing a new density based clustering which introduces gene density connectivity to prune the exploring space for similarity assessment. FWCMR gives considerable results using scientific case studies and easily scales with any size of dataset. All the steps were implemented in MapReduce and no serial bottleneck was observed in the whole algorithm. The roadmap of this paper is organized as follows: In Section 2, preliminaries including the background on similarity measurement, density based clustering steps and the theoretical concepts of the approach are introduced, together with an overview of the previous works on microarray gene expression clustering. In Section 3, the proposed method is presented by introducing a novel fuzzy weighted similarity measurement and a distributed clustering procedure benefiting from MapReduce. Section 4 explains and demonstrates the practical details of the experiments, evaluations and results. Finally, the paper results are reviewed and discussed in Section 5 and concluded in Section 6.
2. Preliminaries and related works 2.1. Clustering definition A good clustering method for gene expression should have characteristics such as, 1) simply coping with noisy high dimensional data, 2) being unaffected by the order of input, 3) having amenable computation complexity (that is, allow increased data load without breakdown or least requirement of major changes), 4) requiring few input parameters, 5) being independent of any metadata or a priori knowledge on cluster numbers or results structure (Kerr et al., 2008). Clustering approaches on big data can be grouped into various methods (Jiang, Tang, & Zhang, 2004; Shim, 2012), but generally, they are divided into partition based, hierarchical and density based algorithms. A good deal of clustering algorithms applied so far produce hard clusters, that is, each gene is assigned only to one cluster. Hard clustering is suitable in the case where clusters are well separated. However, in the microarray data clustering, where
199
gene clusters might frequently overlap, soft clustering is more useful (Futschik & Carlisle, 2005). Cluster analysis is an exploratory procedure. Hence, having no a priori information on gene groups such as their actual number or possible outlier pattern is common. Arbitrary selection of this number may undesirably bias the search. Partition based clustering algorithms like k-means (Azimi, Ghayekhloo, Ghofrani, & Sajedi, 2017), fuzzy c-means (Nayak, Naik, Behera, & Abraham, 2017), and similar algorithms suffer from the dependency of having a priori knowledge on the number of clusters and possible pattern of data distribution. Partition based algorithms are also sensitive to outlier so that if one of the outliers is accidentally selected as the centroid or a simple member of the cluster, it can affect the accuracy of the final results. Results are also dependent on initial cluster centers (Aggarwal;charu & Chandan, 2014). In hierarchical clustering (Gao, Jiang, She, & Fu, 2010), if two nodes are incorrectly joined to a cluster, there is no correction step in the later stages to solve this problem and this wrong assignment will remain in the final results. More than that, hierarchical clustering in each step relies on the results from previous step to decide on the rest of the procedure. This voracious characteristic is not suitable for distributed and parallel computation. However, density based algorithms (Kriegel, Kröger, Sander, & Zimek, 2011) require no special prior knowledge on the data or the number of clusters. They are robust to outlier’s side effects and flexible to change in any shape that fits the data distribution. Density based clustering approaches has shown better performance in average in comparison with other clustering methodologies (Aggarwal;charu & Chandan, 2014). In order to benefit from the mentioned advantages, FWCMR is a kind of density based clustering algorithms. Extremely high dimensional gene expression data make the conventional clustering process time consuming with heavy computational cost. FWCMR benefits from the distributed processing schema inspired by MapReduce in every step which enables the parallel execution of gene expression clustering for any arbitrary size of data. 2.2. Literature review Gene clustering has been proven to be helpful in understanding gene function, regulation and their involvement in the cellular processes (Daxin Jiang et al., 2004). Clustering is also beneficial in revealing subcell types and better understanding of the transcriptional regulatory network. Clustering gene expression data have been an area of interest in recent decade (Kerr et al., 2008). Therefore, many profitable reviews on gene expression clustering have also been published (Kotlyar, Fuhrman, Ableson, & Somogyi, 2002; Perdew, Vanden Heuvel, & Peters, 2014). One of the earliest approaches to gene expression clustering was addressed by Sharp, Tuohy, & Mosurski (1986) in which two predefined distinct clusters are established. One is for extremely expressed genes and the other one is for ordinary expressing ones. Later, after the introduction of microarray technologies, many clustering approaches were used to clarify this mass of data, which have been reviewed by Sherlock (20 0 0). A good research on microarray clustering was conducted by Salem, Jack, and Nandi (2008) in which selforganizing oscillator networks are customized for microarray clustering. Their method does not require the knowledge of the number of clusters and indicates constructive achievements in comparison with hierarchical, k-means and Self Organizing Map (SOM) clustering using Mahalanobis distance. However, the algorithm suffers from the calculation of a large distance matrix that records the distance of every point. It also imposes heavy computation costs on intensive datasets and it is dependent on the serial oscillation phases, which makes it very difficult to be adopted in a parallel implementation. Authors in P Maji & Paul (2013) addressed a rough
200
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
fuzzy clustering which benefits from rough sets and fuzzy sets as a new c-means algorithm. With judicious handling of uncertainties, the main problems of partitioned based clustering remain unsolved and the final results of clustering are very sensitive to the initial parameters. Method presented by Pradipta Maji and Paul (2014) benefits from mutual information which is semi-supervised clustering approach. Partial information gained from probabilistic distribution of data involves prior knowledge to keep algorithm accurate. It also suffers from Euclidian distance shortcomings in similarity calculation. Many bioinformatics algorithms have been addressed and inspired by MapReduce processing model (P Maji & Paul, 2013; Pradipta Maji & Paul, 2014). One of the studies that tried to utilize the MapReduce is that of Boeva, Tsiporkova, and Kostadinova (2014) which focused on a MapReduce approach for clustering of gene expression datasets obtained from multiple experiments. In the map phase, the k-means clustering is simultaneously applied for all the experiments, then different lists of clustering solutions are merged into a single cluster through reduction. Ultimately, the cluster results are refined by applying the formal concept analysis. Despite their innovations, problems and weakness of k-means remains in their approach and the similarity assessment is not measured in a distributed way. Hence, bottlenecks of serial processing seriously remain within their algorithm. With regards to clustering algorithms inspired by MapReduce, the parallel DBSCAN addressed by Li and Xi (2011) is one of the first methods in clustering using MapReduce framework. It partitions the data to find the clusters in each partition and merges the clusters from every partition using MapReduce. Using MapReduce in clustering was a good beginning but their approach was just an ensemble of singular algorithms run in parallel and suffered from introducing an intuitively parallel algorithm. Another MapReduce inspired clustering approach presented by He et al. (2011) proposed dividing the data space into a grid, running the traditional DBSCAN algorithm in each grid in parallel with spatial indices such as similarity joint tree (Shim, Srikant, & Agrawal, 1997) and merging every found cluster in a single machine. Though their innovative idea is in presenting a novel clustering algorithm, their approach suffers from a memory-based spatial index in each reduced function and merging clusters from all grid cells in a single machine which imposes significance in memory computation overhead instead of distributed batch processing. Regarding the remaining shortcomings mentioned, FWCMR is introduced. The experimental results show improvements in comparison with recently published approaches. 2.3. Definitions of density clustering Let τ and ξ be the maximum acceptable distance (measured dissimilarity) of genes within cluster and the minimum number of a cluster members, respectively. They are two main thresholds to control the cluster homogeneity and number of clusters. They both are dependent on the biological experiments and may variate accordingly. Let Nτ (gi ) be the genes with distance less than τ to the ith gene (gi ). We call gi a Core Gene (CG), if (|Nτ (gi )| + 1) ≥ ξ where |Nτ (gi )| shows the count of genes with distance less than τ to the gi , otherwise gi will be called Border Gene (BG). Given a set of genes, G = {g1 ,g2 …, gn }, used in the experiment, gj ∈ G is directly densityreachable from gi ∈ G if (1) gi is a core gene and (2) gj ε Nτ (gi ). A core gene is also directly density-reachable to itself. If gj is directly τ . ξ
density-reachable from gi , we denote it as gi → g j . A gene gk is density- reachable from another gene g1 if there exist a chain of τ . ξ
genes g1 ,g2 ,…, gk where gi → gi+1 holds for every i with 1 ≤ i ≤ k, then it will be concluded that g1 is density reachable to gk which τ . ξ
is denoted as g1 ⇒ gk . Chain of genes have common pattern of expression here. Accordingly, if a gene gi is neither a core gene nor density-reachable from other core genes we call it an outlier gene
(OG). Note that although density reachability is transitive but it is not symmetric. Due to what was proposed, a gene gi is densityconnected to a gene gj if there is a core gene like gk ε G such that τ . ξ
τ . ξ
τ . ξ
gk ⇒ gi and gk ⇒ g j . We denote density-connected as gi ⇔ g j .On this basis, a cluster is a set of density-connected genes with maximum possible members regarding to the density-reachability definition. Outliers will be the genes which cannot be member of any cluster. With respect to what was presumed, a cluster C is a subset of microarray genes with conditions depicted in the Fig. 1:
2.4. Similarity measure In gene expression clustering, similarity can be assessed in terms of similar patterns of expression across identical time or condition points. Many studies use the absolute value of the correlation as an unsigned co-expression similarity measure (Horvath, 2011). However, using the absolute value of the correlation may obfuscate biologically relevant information, since no distinction is made between gene repression and activation (Horvath, 2011). In addition to scaling, shifting and shape problems, a robust similarity measure must allow for non-uniformity. Several popular and commonly used similarity (distance) measures are reviewed in Horvath (2011) and Kriegel et al. (2011). Using several similarity measures consequently produce clusters of various shape (e.g. Euclidean is spherical, Mahalanobis (Mahalanobis, 1930) is ellipsoidal). Particularly, correlation based similarity measurements indicate degree of similarity of changes in expression across samples, for distinct gene expression profiles, regardless of their scale. Pearson correlation is one of the famous ones which is sensitive to outliers. It is also not robust for non-Gaussian distribution (Bickel, 2001). Non-parametric correlation coefficients such as Spearman’s rank, show better achievements when outliers are present (Romesburg, 2004). Spearman treats all ranks equally important. Expression of gene in some of the conditions or time points might have higher or lower importance and impression on final decisions (Roden et al., 2006). Benefiting from the positive aspects of Spearman’s rank, the proposed similarity measure is specially used for the case studies where all of the conditions of expression are not equally important for biologists. For instance, a scientist might be interested to identify relatively subset of genes that share coherent expression across some preferred conditions, rather than all or most conditions as required in conventional clustering approaches (Roden et al., 2006). Biological significance of conditions or annotated gene can be assessed whether by some measures (Speer, Spiet, & Zell, 2005) or with the aid of biologists (Yousri, 2010). In these occasions, the proposed similarity measure can compensate the shortcoming of existence of such similarity measurement in the literature.
2.4.1. Proposed similarity measure In this article, a fuzzy weighted aggregation of Spearman’s rank which is an expansion of the conventional Spearman’s rank is addressed. Ordered Weighted Averaging (OWA) (Yager, 1988) is a fuzzy aggregation operator that suitably models linguistically expressed aggregation instructions. By benefiting from the OWA concepts, a new similarity measure is proposed that we call OWASpearman rank. Here, we describe the construction of this novel weighted rank measure of correlation and then use an example to illustrate the advantages of the new measure. General OWA operator to find the weighted mean of n values (a1 , …, an ) is defined in Eq. (1)
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
201
Fig. 1. Summary of membership conditions in the proposed method.
F ( a1 , . . . , an ) =
n
w jb j
j=1
S.T . 0 ≤ w j ≤ 1,
n
wj = 1
(1)
j=1
Where bj is the jth most prominent of the ai and wj is the respective jth weight. The input of the OWA should be ranked in descending order in a way that the first input is the most prominent one. Weight values actually model different aggregation operations. Two essential parameters characterizing the weights of OWA are: (1) Attitudinal Character (AC) and (2) Dispersion Rate (DR). AC (also called Orness Factor) as denoted in Eq. (2) characterizes the similarity of aggregation to logical OR operation.
AC =
n 1 (n − j )w j n−1
(2)
As it is obvious, Spearman treats all ranks equally effective on the aggregation procedure. It is known that in some biological experiments, different conditions might have different impressions on gene expression Roden et al., 2006). A relative weight (wk ) can be presumed for each sample representing the importance of that condition (Cok ) assuming 1 ≤ k ≤ n. Whole of these weights, con
struct the weight vector(Wk ). According to OWA (Yager, 1988) wk ∈ n [0.1] and wk = 1. The weighted mean (μw ) is calculated using k=1
Eq. (5). Weighted covariance (covw ) also is calculated by equations ((6).
μw ( g i ) =
n
covw gi , g j =
j=1
DR = −
w j ln w j
(3)
j=1
Monotony of weights, plays a substantial role on the involvement of a subset of the whole arguments (a1 , …, an ) in the aggregation procedure. Higher values of DR shows more monotonic weights, and subsequently, association of more arguments in the aggregation procedure. Pearson’s correlation coefficient (Lee Rodgers & Nicewander, 1988) is the covariance of the two variables divided by the product of their standard deviations. Spearman correlation (rspn ) (Kotlyar et al., 2002) in original format is the Pearson correlation between the ranked expression values of two distinct genes (gi ,gj ), which is computed by Eq. (4).
rSpn gi , g j = 1 −
6
n
k=1
(5)
n
wk gi,k − μw (gi ) g j,k − μw g j
(6)
k=1
It is assumed that AC ∈ [0, 1]. For instance, AC = 1 results in {w1 = 1; wi = 0, i = 1} which returns the most prominent input, and AC = 0 leads to {wn = 1; wi = 0, i = n} which returns the least prominent one. In a same way AC = 1/2 brings about {wn = 1/n} which returns the standard mathematical mean of input values. Accordingly, other values of AC can respectively affect the weight vector. Inspired by the definition of entropy in information theory (Stone, 2015), DR denoted in Eq. (3) also characterizes how uniformly the arguments are being used in aggregation assessment. n
w j gi,k
k=1
2
r k ( gi ) − r k g j
n n2 − 1
(4)
Here n is the number of observed conditions per gene, when (rk (gi )) and rk (gj )) is respectively ranked observation of genes (gi ) and (gj ) assuming 1 ≤ k ≤ n. If (gi ) and (gj ) tend to similarly increase or decrease expression in similar conditions (Cok ), (rspn ) is positive. On the other hand, (rspn ) is negative, when in the same conditions, one of the genes tends to decrease when the other increases, or vice versa. A (rspn ) of zero indicates that there is no particular relation between two genes. Higher absolute values of (rspn ) show higher correlations between two genes and vice versa.
Due to the definition of conventional Spearman, the proposed similarity measure (rsim ), namely OWA-Spearman similarity, can be calculated using the Eq. (7)
rSim gi , g j =
covw gi , g j
covw (gi , gi )covw g j , g j
Where wi ∈ [0, 1] and
n i=1
(7)
wi = 1. Higher values of weights in-
dicate higher impression of that condition in the similarity assessment between two distinct genes. By tuning DR and AC the flexibility of proposed similarity measure can be adjusted. The proposed measure is an extension of conventional Spearman and by assuming all of the conditions equally important (AC = 1/2 and wi = 1/n), the conventional Spearman value is obtained. An expert also can manually assign the weights in a way that better fits with any particular experiment. In FWCMR more important conditions are ranked higher and subsequently they will get higher weights for further decisions. Finding an appropriate weight vector relies on the experiment and it can change regarding to the experimental data. Either an expert biologist can manually set the weights or they can be determined by tuning the DR and AC values. When the conditions are not equally important they should be reordered in a way that the more impressive one in decision making gets higher order in comparison with other less impressive conditions. In a particular experiment higher values of DR leads to more monotone weights and vice versa. Also the AC magnitude can enhance the effect of higher or lower order conditions on final decision. The proposed equation also is a solution for the tie problem in conventional Spearman. The tie problem happens when an example gene might have identical values of expression in some conditions which leads to identical rank values. Regarding to the FWCMR, if the tie problem happens between two distinct conditions like Coa and Cob , the condition Coa will get higher rank if its associated weight value wa is greater than wb (wa > wb). When the weights are equal
202
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210 Table 1 The gene expression rank matrix in 4 different conditions (Cok ). gi
Co1
Co2
Co3
Co4
g1 g2 g3 g4 g5
1 2 2 3 1
4 1 3 2 2
3 4 4 1 3
2 3 1 4 4
Table 2 The similarity matrix obtained by classic Spearman (rspn (gi gj )). gi
rspn (gi g1 )
rspn (gi g2 )
rspn (gi g3 )
rspn (gi g4 )
rspn (gi g5 )
g1 g2 g3 g4 g5
+ 1.00 −0.20 + 0.60 −0.60 + 0.20
−0.20 + 1.00 + 0.20 −0.20 + 0.60
+ 0.60 + 0.20 + 1.00 −1.00 −0.20
−0.60 −0.20 −1.00 + 1.00 + 0.20
+ 0.20 + 0.60 −0.20 + 0.20 + 1.00
Table 3 The similarity matrix obtained by the proposed OWA-Spearman (rsim (gi gj )). gi
rsim (gi g1 )
rsim (gi g2 )
rsim (gi g3 )
rsim (gi g4 )
rsim (gi g5 )
g1 g2 g3 g4 g5
+ 1.00 −0.17 + 0.60 −0.46 + 0.47
−0.17 + 1.00 + 0.39 −0.29 + 0.54
+ 0.60 + .039 + 1.00 −1.00 + 0.66
−0.46 −0.29 −1.00 + 1.00 −0.32
+ 0.47 + 0.54 + 0.66 −0.32 + 1.00
(wa = wb and a = b) then the condition Coa will get higher rank if it is sorted more important (a < b). In order to better explain the Eq. (5) a visual example is depicted in Table 1. In which the columns are the conditions in which the expression values for 5 distinct genes have been measured. The respected weight vec
tor Wk ={w1 = 0 · 4;w2 = 0 · 35;w3 = 0 · 2;w4 = 0 · 05} is assumed according to the importance of the conditions (Cok , 1 ≤ k ≤ 4). The conditions are ranked according to their importance where Cok is more important than Cok+1 . In the rows of Table 1 the expression values of five sample genes have been normalized and then substituted with their associated ranks. In Table 2 the similarity matrix is indicated using the conventional Spearman rSpn (gi ,gj ) values between gi and gj on the basis of their ranks in Table 1. Table 3 shows the proposed OWA-Spearman (rsim (gi ,gj )) values which is the similarity of expression between ith and jth genes with respect to the
mentioned weight vector of Wk . By comparing the similarities obtained from the conventional Spearman and the proposed OWA-Spearman in this simple example, their different behavior is recognized. In some cases, both show equal similarity values like (rspn (g3 ,g4 )= rsim (g3 ,g4 ) = −1) or (rspn (g1 ,g3 )= rsim (g1 ,g3 )). However, in comparing the genes g3 and g5 they give contradictive results. This is due to the variable importance of conditions in the proposed similarity measure. The proposed similarity measure is symmetric (rsim (gi , gj ) = rsim (gj , gi )) and there is no need to calculate the distances twice. Upper triangle data of similarity matrix of the Table 3 can easily be calculated using MapReduce. The genes are partitioned between the processing nodes and final results will be reduced to construct the similarity matrix. A special list will prune the calculation to prevent double times similarity calculations. 3. Fuzzy weighted clustering based on MapReduce (FWCMR) FWCMR is a density based clustering that tries to find genes having similar expression patterns. FWCMR has various MapReduce (MR) based steps which is depicted in a simple flowchart
Fig. 2. A general view on the main steps of FWCMR.
overviewed in Fig. 2. In this section the proposed approach will be described in details. 3.1. Data preprocessing Accuracy of gene expression data considerably relies on experimental procedure and frequent image corruption as slide impurities may lead to incomplete data (Kerr et al., 2008). The original gene expression data naturally come along with missing values, outlier and systematic variations. Accordingly, a preprocessing step is indispensable before beginning the clustering. Preprocessing performed on our dataset include data cleansing (removing corrupt inaccurate records), normalization (Schuchhardt, 20 0 0))putting values in range by logarithmic transformation of each expression level (and missed value imputations (Tang et al., 2014) (substituting values when possible) and biased values transformation. Gene expression dataset is an (s × n) matrix M where M = {mij |(1 ≤ i ≤ s)and (1 ≤ j ≤ n)}. The rows of matrix M represent expression patterns of a set of s genes (G = {g1 , …, gs }) while the columns represent expression profiles of a set of n condition samples (Co = {Co1 , …., Con }) at different conditions or time points. Accordingly, each cell mi,j of matrix M is the expression level of the gene gi (1 ≤ i ≤ s) on the condition Coj (1 ≤ j ≤ n) . The genes whose expression levels did not change significantly during the harvesting will be ignored from further analysis. Standardization of each row of matrix is done using MapReduce. Simply mean and variance of each row will be calculated in a parallel run. Then old values of each row of matrix will be replaced with their new values using the Eq. (8)
gnew i, j =
gold − μg i i, j
σg i
(8)
Here μgi is the mean and σgi is the standard deviation of all values in ith gene and gi.j is the expression value of gi in the condition Coj . 3.2. MapReduce based similarity calculation (Similarity-MR) The proposed similarity measure (rsim ) indicates the distance between two distinct genes. Accordingly, higher values indicate higher similarities between the respected genes. In order to efficiently calculate the distance between two distinct genes, first the values in normalized expression matrix is distributed among different processing nodes in the map phase and then ranked according to their magnitude values (Ranking-MR). Final results are aggregated into a unified rank matrix in reduce phase. Each row of rank matrix now consists of the rank of expression values instead
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
203
of their magnitude for the respective gene. Afterwards, assuming that we have q distinct number of processing nodes, the rank matrix will be divided into q different groups. Each of these groups members will be submitted to one of the processing nodes. Due to the symmetric characteristic of FWCMR, a controlling list will help not to calculate the similar distance values twice. If we have s distinct row of data in the rank matrix, (s/q) of them will be handled by each processing node. The distance between all of the nodes with each other, will be calculated in map phase, which will be followed by aggregation of final results in reduce phase.
generates unique UID inside each processing node so that two distinct processing node never will contain equal UID. The directly
3.3. Data division on the basis of neighbor similarity (Neighboring-MR)
genes (gdi ) the density reachability (gdi ⇒ gu ) will be evaluated. Whole of the unvisited genes (gu ) which are density connected
τ . ξ
density connected genes {g j |gi → g j } to the initial gene (gi ) will be given the same unique identifier (UIDi ). The inner loop of map phase, tries to assign (UIDi ) to all of density reachable genes of (gi ). The key-value pair will be assigning value of (UIDi) to the key of the gene which is density reachable to gi When a unique identifier is assigned to a gene, it will be marked as visited and its visiting component will become true. Subclusters are distinguished from each other by their assigned unique identifiers. Afterwards, for any of the unvisited genes (gu ) and derived initial τ . ξ
τ . ξ
The neighboring genes Nτ (gi ) are those genes with similarity more than τ to the gene (gi ). In a simple parallel MapReduce algorithm namely, Neighboring-MR, neighbors of each gene can be marked out. Accordingly, the core genes with (|Nτ (gi )| + 1 ≥ ξ ) and their respected direct neighbors will be obtained. In order to split data among the processing nodes in a way that related data remain in the same processing nodes, core genes with their associated neighbors are saved in a tree data structure described in Shim et al. (1997). We call this tree structure the similarity join tree. On the basis of the similarity join tree, the rank matrix will be partitioned among the q distinct processing nodes again. The partitioning procedure splits the core genes and their associated neighbors among the processing nodes in a way that each distinct core gene and all of their associated direct neighbors remain accessible for processing nodes. Naturally, there might be some cases in which an overlapping gene is simultaneously a direct neighbor of two or more core genes. In these occasions, an overlapping gene between two or more core genes will remain accessible for all of the processing nodes. According to the soft clustering characteristics of our approach, a gene will be member of more than one cluster according to its membership value. The remaining genes which are not a direct neighbor of any particular gene will be randomly distributed among the processing nodes. 3.4. MapReduce based subclustering (Subclustering-MR) The genes set, due to their large size and complexities, cannot be processed easily by a single machine. Therefore, distributing them among the processing nodes prepared for MapReduce algorithm sounds inevitable. Regarding the similarity join tree, core genes with their associated neighbors are located in each local processing node. In the beginning of this step, whole of the genes are not assigned to any cluster yet. Each gene has five components. One component shows whether it has been visited or not. By default, this component is set to unvisited for every gene in the beginning. Whenever a gene becomes visited, this component changes to visited. The second component shows if the gene is a core gene. The third component contains the direct neighbors. The forth component indicates if a gene is a border gene or not. Ultimately, the last component stores the subcluster Unique Identifier (UID) value. All of the genes has a default value of (UID) in the beginning of this phase showing they have not been assigned to any subcluster yet. The init () function puts whole of the mentioned components to their corresponding default values. Inside each processing node, the map and reduce algorithms will begin to construct the subclusters. The main loop in the map phase ends whenever there isn’t any core gene which is still remained unvisited. In map phase, a subcluster originates from a randomly chosen core gene which is unvisited yet. We call this core gene an initial gene (gi ). This initial gene (gi ) will be given a unique identifier (UIDi ) of its respective subcluster. A function called Next-UID
{gu |gdi ⇔ gu } to the derived initial genes (gdi ) including their yet unvisited neighbor genes are checked. If these genes are not core gene themselves (Nτ (Nτ (gu )) < ξ ), they will become the subcluster border line genes and the algorithm will not continue for this identifier. The component of border status becomes true for such genes. On the other hand, if the recently added genes to the subcluster are core genes themselves, all of their direct neighbors will be assigned to the same unique identifier and the algorithm would follow them again until reaching to the border genes. Finally all the genes with density reachability to gi will be assigned the value of UIDi (write.context(key, UID)). The algorithm tries to assign the unique identifier to every gene located in the list of neighbors of initial gene. If any gene remains in the unvisited list and it is a core gene yet, a new distinct subcluster (SUBC(UIDj )) will be established with a distinct unique identifier (UIDj ). The inner loop of map phase will be continued on this new initial node τ . ξ
(gj ) and its density reachable nodes (g j ⇒ gu ). The map phase stops when no new initial gene can be developed (none of the remaining unvisited genes are core genes). Eventually, in the reduce phase, whole of the genes with identical unique identifier pairs {gi .gj ε (SUBC(UIDk )) | (gi , UIDk ) and (gj , UIDk ) and (i = j)} will be aggregated and assigned to a unique subcluster (SUBC(UIDk )). Each distinct cluster should have a distinct representative called prototype for further comparison between subclusters. The prototype of each subcluster can be calculated in the reduce step by averaging the expression of whole genes within the subcluster. The prototype (Pr), is a vector Pr={Pr1 , Pr2 ,… Prn } containing the average of rank values for each of the conditions. The elements of this vector (Prj ) for a subcluster (SUBC(UIDk )), (1
n
P r j (SU BC (U IDk ) ) =
i=1 rk
gi, j
( n )
(9)
In Eq. (9), Prj is the jth prototype while (n ) is the number of genes in a subcluster, having identical unique identifier and rk (gi,j ) is the associated rank value of expression of gene (i) in condition (j). The prototype can be easily obtained for each unique identifier. If any possible genes remain unvisited and it is neither a core gene nor a density connected gene to any of core genes, then it will be assigned as a partial outlier gene. It is presumed as partially outlier, because until this step, it has not become a member of any cluster, however there are possibilities that a partially outlier gene can become a member of other clusters inside other processing nodes in the upcoming processing steps. Fig. 3 denotes the MapReduce function proposed for this step. 3.5. MapReduce based final merge (Final Merge-MR) We next develop the MapReduce algorithm for discovering the final gene clusters. In this step, generating the final clusters will begin by merging the subclusters obtained from previous steps.
204
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
Fig. 3. The Subclustering Map and Reduce algorithms.
Like other steps, this step also includes map and reduce phase. Final merge in its map phase only considers three substantial conditions. (I) Two cluster sharing common gene(s), (II) membership of partially outliers to other initial clusters gained from other computing nodes. (III) Similarity of two subclusters on the basis of their prototypes. At the map phase of this step, all of the genes are aggregated in a matrix which we call integrated matrix. Most of the genes have been assigned to a unique identifier showing their relation with an initial subcluster. Final merge detects any of the three mentioned conditions through the integrated matrix. First condition needs to check two types of genes: (a) the overlapping genes between two clusters and (b) the border genes. In both situations, the possibilities of membership to other subclusters will be evaluated. Benefiting from the distributed file system, simultaneous access to whole of the genes in parallel is possible for every processing node. Here again all the genes with their components and attributes are divided between the processing nodes If we have s distinct gene and q number of processing nodes and p number of initial genes then ((s-p)/q) of genes will be assigned to each processing node. Every processing node also receive a list of all initial genes and their respective UID. This is a redundancy policy to have a list of all initial genes in every processing node. The remarked redundancy does not impose a memory or I/O bottleneck for the system due to the small and limited size of initial
genes list. Initial genes represent their respective initial subcluster. Each processing node access to its assigned border genes and overlapping genes to evaluate their similarities (density reachability) to all initial subcluster. Each computing node will assign the overlapping and border genes to its similar subcluster if there exist any density connectivity. Multiple assignments of the genes to more than one subcluster is approved in this step. It means that a gene can be a member of various subclusters of the genes at the same time. For the second condition, reachability of partially outlier genes to the existing subclusters will be assessed in all processing nodes. For the last condition, the similarities between any of the obtained subclusters will be evaluated. If the distance between them is less than acceptable threshold, they will be joined together. For similarity assessment between two obtained subclusters, the prototype will be updated for each distinct subcluster. If (rsim ) between prototypes of two distinct subclusters is more than acceptable threshold (τ ), they probably can be presumed as similar subclusters. In order to assure the similarity of these two subclusters, their reachability also will be evaluated. If any of two subclusτ . ξ
ters are reachable (density connected) SU BC (U IDi ) ⇔ SU BC (U ID j ), and their similarity is more than threshold, they will be merged to build a partially merged subcluster. This partial merge in the end of map section, is just letting any of genes to have multiple
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
205
Fig. 4. The Final Merge, Map and Reduce algorithms.
unique identifiers (UID1 , UID2 , …, UIDC ) showing its intercommunication with subclusters (SUBC(UID1 ), SUBC(UID2 ), …, SUBC(UIDC )). The partial merge will be followed with an integrated merge in the reduce section. The integrated merge function detects for the genes in common between any subclusters. If a gene is marked with more than one (UID), it means that we have intersections between subclusters. If the number of intersections (common genes between two distinct subclusters) are more than the threshold (ξ ), these subclusters will be combined together to make a new integrated subcluster with a new (UID). On the other hand, if the number of intersections is not more than threshold, the common gene between two subclusters can still contain unique identifier from both. A single gene may participate in multiple pathways in biology. These particular genes will get a fuzzy membership (FM) assignment to each of subclusters. This fuzzy membership to a subcluster (SUBC(UIDk )), shows how much a particular gene can express in a similar way to the other genes located in (SUBC(UIDk )). The similarity (rsim ) of a gene with the prototype of subcluster P(UIDk ) is calculated. The results are obtained using max (rsim ) according to the Eq. (10).
rsim (gi , P (UIDk ) ) F M (gi , SU BC (U IDk ) ) = z × 100 q=1 rsim (gi , P (UI Dq ) )
(10)
Finally, the unvisited genes and outliers will be unified together. The pseudo code denoted in Fig. 4 indicates the map and reduce functions utilized for final merge phase. 3.6. Global post-processing After the final merge, the resulting clusters contain genes with the most similar expression pattern in identical conditions. The yet remained outliers from previous steps are the genes which were not frequently expressed in the same patterns that most of other
genes in the experiment expressed. In the biological aspects, the interconnections between two major clusters can be interpretable (Futschik & Carlisle, 2005). Hence their associated fuzzy memberships will be accessible for further biological interpretations like gene regulatory expression mechanism. 4. Experimental results Typically, some recent algorithms have been implemented and compared with FWCMR on the basis of pervasive clustering validation metrics. The proposed method was run on our local Hadoop framework with four commercial computing nodes. The proposed method also is tested on EC2 (Amazon Elastic Compute Cloud, 2017). 4.1. Dataset To carry out cross-species analysis using these databases, we need methods that can match experiments in one species with experiments in another speciesto carry out cross-species analysis using these databases, we need methods that can match experiments in one species with experiments in another speciesInspired by the era of big data, FWCMR is one of the approaches found to be very beneficial when huge number of data is available for clustering. In the proposed research, multiple microarray datasets are used which are obtained from the global cell cycle control of gene expression in fission yeast schizosaccharomyces pombe (Rustici et al., 2004). This is a dataset on periodic transcription and its regulation during the cell cycle and the entire raw data set is available for download with accession numbers from E-MEXP-54 to E-MEXP64 in Rustici et al. (2004). Here, gene expression is a function of time in cells synchronized through different experiments respectively in various elutriations, cdc25 and cdc10 block release. Also
206
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
Arabidopsis Thaliana, Human Fibroblasts Serum and Rat CNS are three famous datasets utilized to better compare the results of FWCMR with recent works. Arabidopsis Thaliana is the gene expression values of 138 genes of Arabidopsis Thaliana over 8 time points (Maulik, Mukhopadhyay, & Bandyopadhyay, 2009). Human Fibroblasts Serum contains expression levels of 8613 human genes. There exist totally 13 dimensions in this data set corresponding to 12 time points and one unsynchronized sample (Iyer et al., 1999). Rat CNS consists of the expression levels of a set of 112 genes during rat central nervous system development over 9 time points (Wen et al., 1998). Evaluating the scalability of methods using large synthetic data set is common in the literature (Katal, Wazid, & Goudar, 2013). Similar synthetic dataset has been generated for this experiment. To produce realistic data sets, we enforce our data sets to follow the same conditions in microarray gene expression. The synthetic data (SD) contains from 10 0 0 to one million genes expressing in more than 30 distinct condition. In addition, some outliers are randomly scattered in the expression matrix for noise robustness assessment. Also in biology one of the ideas is to test a cross-species biological pathway in a numerous type of living creatures to find common pathways in their metabolism, sickness, etc. Scalable clustering on a large scale dataset can be useful in those area of research (Le, Oltvai, & Bar-Joseph, 2010). To perform crossspecies analysis using very large databases, we need methods that can match experiments in one species with experiments in another species. FWCMR can be useful then.
Here, a(i) is assumed to be the average dissimilarity of i with all of other data within the same cluster. Smaller values of a(i) means that i has been assigned to its right cluster. Also, b(i) is the lowest average dissimilarity of i to any other cluster, of which it is not a member. Beta index (β ) is defined as the ratio of total variation and within-cluster variation which is indicated in Eq. (14)
β=
i i N xi j − v¯ 2 , M = xi j − vi 2 , N= M
c
n
i=1 j=1
c
n
(14)
i=1 j=1
In this equation, ni is the number of objects in the ith cluster (i = 1,2,..,c), while N represents total variation and M indicates within-cluster validation. FWCMR tries to check various AC and DR values. Each combination of AC and DR values can be called a scenario. The mentioned cluster validity indexes of final results will be calculated for each of these scenarios. The winner clustering among all of possible scenarios, is the one which gains the optimum values for DI, SI, β and DBI. In order to find the winner, the rank of DI, SI, β and DBI of all scenarios are calculated. The first procedure is the present winner as default. The second will replace that if it has a positive rank distance from the present winner. Elsewhere the first will remain the present winner. The distance of ranks is calculated as shown in Eq. 15.
D=
4
Rnew (cl ust − val id )i − Rold (cl ust − val id )i
(15)
i=1
4.2. Cluster validity Validation of resulting clusters have been mentioned in many papers (Roden et al., 2006; Romesburg, 2004). Since clustering is an unsupervised method no singular gold standard can be perceptible. For that, number of cluster validity index measures are available in the literature. However, none of the measures can perform equally well in all kinds of datasets (Mukhopadhyay, Maulik, & Bandyopadhyay, 2013). In order to evaluate the clustering results Dunn index (DI) (Dunn, 1974), Davies-Bouldin index (DBI) (Davies & Bouldin, 1979), Silhouette Index (SI) (Rousseeuw, 1987) and Beta Index (β ) (Pal, Ghosh, & Shankar, 20 0 0) have been used for clustering evaluation. According to these indexes, a good clustering is obtained when the means of different clusters are sufficiently far apart, as compared to the within cluster variance. DI evaluates how much a cluster is compact and well separated and is calculated in Eq. (11):
DI =
min δ Ci , C j max k
, (1 ≤ i, j, k ≤ m), i = j
(11)
Here m is the number of clusters, δ (Ci ,Cj ) is inter-cluster distance between clusters Ci and Cj and k is the maximum distance between two points in the same cluster. DBI is obtained by the ratio of within-cluster distance and between-cluster separation shown in Eq. (12). m 1 max DBI = i = j m i=1
Si + S j Mi, j
(1 ≤ i, j ≤ m), i = j
(12)
In this equation Sk is within-cluster distance in cluster k and Mi,j is between-cluster separation. SI provides a succinct representation of how well each object lies within its cluster. The silhouette ranges from −1 to 1. Higher value of SI indicates that the member of a cluster is well matched to its own cluster while poorly matched to neighboring clusters. SI can be calculated as demonstrated in Eq. (13).
SI =
b( i ) − a ( i ) , i = j max{a(i ), b(i )}
(13)
Here D is the distance of ranks, R is the rank of scenario and (clust − valid)i is the ith cluster validity index. Here i in cluster validity indexes respectively represents DI, SI, β and DBI. If the distance (D) equals to zero, the one with higher rank in SI will be the present winner. This checking will be continued for the rest of the scenarios to find the final winner among all. In this way the best clustering procedure will be obtained with optimum values of DI, SI, β and DBI. 5. Results and discussion It was mentioned that regarding to the importance of the conditions in similarity assessment of two gene’s behavior, the columns of gene expression array should be rearranged. The weights are then assigned to each of conditions according to their order. It was mentioned in Section 2.4.1 that AC and DR characterize the OWA operators where (0 ≤ AC ≤ 1) and (DR > 0); (AC = 0) returns min value while (AC = 1) returns max value. The other AC values behave between these two extremes. In the present research among all possible values for AC, 4 equally distributed values of (0.0,0.25,0.50,1.0) are assumed which could be an approximate representation of the different possible behaviors caused by AC. On the other hand, DR is always non negative where near zero values means least entropy and vice versa. DR is bounded but the upper bound depends on the number of conditions. DR reaches to its maximum when all of weights are equal (Wi = 1n ; 1 < i < n). In order to better visualize DR, it is normalized by dividing all of DR values by the maximum using DRnormal = MaxDR (DR ) . Now DR also belongs to the interval of zero to one. Just same as AC, 4 different values of (0.0,0.25,0.50,1.0) have been considered for DR. Finally (4 × 4 = 16) sixteen possible values are achieved from combination of AC and DR. Among these 16 possible conditions, 5 more distinct conditions from the 16 conditions have been depicted, in Table 4 and Table 5 to illustrate how different values of AC and DR can affect the weight vectors and subsequently the clustering result. . Table 4 shows the validity indexes of results from Human Fibroblasts Serum dataset whilst Table 5 contains the validity indexes achieved from Rat CNS dataset. It is obvious that different weight vectors affect the final clustering quality. In our approach
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210 Table 4 Validity index of clustering results obtained from human fibroblasts serum (Iyer et al., 1999) by varying DR and AC.
207
Table 7 Validity index of different clustering methods obtained from Human Fibroblasts Serum (Iyer et al., 1999).
DR
AC
DI
DBI
β
SI
Method
DI
DBI
β
SI
0 0 1 1 1/2
0 1 1/2 1/4 1/2
0.385 0.391 0.405 0.472 0.410
1.661 1.409 1.521 0.882 0.904
32.951 37.206 38.107 46.871 45.533
0.305 0.314 0.330 0.442 0.426
RFC MBC MSBC HRSC FWCMR
0.450 0.432 0.464 0.458 0.483
0.903 0.911 0.885 0.897 0.869
40.784 40.221 43.367 42.424 46.879
0.364 0.359 0.397 0.381 0.452
Table 5 Validity index of clustering results obtained from Rat CNS dataset (Wen et al., 1998) by varying DR and AC. DR
AC
DI
DBI
β
SI
0 0 1 1 1/2
0 1 1/2 1/4 1/2
0.210 0.273 0.295 0.308 0.315
0.822 0.794 0.804 0.824 0.793
4.615 4.336 4.510 4.884 4.879
0.417 0.422 0.430 0.471 0.480
Table 6 Validity index of different clustering methods achieved from Arabidopsis Thaliana dataset (Maulik et al., 2009). Method
DI
DBI
B
SI
RFC MBC MSBC HRSC FWCMR
0.579 0.586 0.610 0.592 0.604
0.945 0.937 0.918 0.926 0.904
7.612 7.620 7.688 7.680 7.854
0.359 0.368 0.399 0.395 0.412
we achieved the optimum weight vectors by modifying the AC and DR for each of the datasets by a voting mechanism. Our algorithm begins from a default equal weight for all conditions. Dataset will be partitioned into chunks of data and each chunk, is assigned to a processing node. Afterwards FWCMR will be run and final results will be attained. Four validity indexes of DI, DBI, β and SI will be calculated and recorded in the history. By recursively modifying the AC and DR the winner among the records will be chosen as the optimum weight vector. In the present research we assumed 4 value for AC and 4 for DR. Therefore 16 distinct conditions have been checked to reach the optimum results. Five sample conditions of AC and DR from the total 16 conditions are presented in Table 4 and Table 5. Four validity indexes of DI, DBI, β and SI have been calculated for each of these methods in clustering four different gene expression datasets. In order to evaluate the proposed method, FWCMR is compared with four recent gene expression clustering algorithms. These algorithms are Rough-Fuzzy Clustering (RFC) (P Maji & Paul, 2013), Modelling Based Clustering (MBC) (Blomstedt, Dutta, Seth, Brazma, & Kaski, 2016), Multiobjective Symmetry Based Clustering (MSBC) (Saha, Ekbal, Gupta, & Bandyopadhyay, 2013), Hessian Regularization Based Symmetric Clustering (HRSC) (Ma, Hu, He, & Jiang, 2016). Table 6, Table 7 and Table 8 respectively demonstrates the validity indexes achieved from Arabidopsis Thaliana, Human Fibroblasts Serum and Rat CNS datasets. For whole of these datasets our approach shows relatively superior achievements. In order to test the scalability of algorithm when the data scales up to high volume we examined our algorithm with schizosaccharomyces pombe dataset (E-MEXP-54 TO E-MEXP-64) (Rustici et al., 2004). This dataset includes 10 different expression matrixes of same genes for various conditions. In order to compare the results of other works with ours, we have run them one by one on each of these distinct
Table 8 Validity index of different clustering methods achieved from Rat CNS dataset (Wen et al., 1998). Method
DI
DBI
β
SI
RFC MBC MSBC HRSC FWCMR
0.291 0.288 0.302 0.313 0.320
0.814 0.822 0.803 0.794 0.782
4.351 4.211 4.721 4.845 4.933
0.439 0.425 0.477 0.482 0.496
Table 9 Average of validity index on Schizosaccharomyces pombe (E-MEXP-54 TO E-MEXP-64) (Rustici et al., 2004). Method
DI
DBI
β
SI
RFC MBC MSBC HRSC FWCMR
0.271 0.286 0.295 0.314 0.501
0.713 0.722 0.697 0.683 0.313
10.529 10.227 11.362 14.227 25.631
0.300 0.296 0.311 0.324 0.503
datasets addressed in Rustici et al. (2004) in a sequential manner. Table 9 indicates a drastic improvement in the validity index of results achieved by the FWCMR in comparison with recent similar methods (RFC, MBC, MSBC and HRSC). For testing the robustness of FWCMR to the presence of outlier, some of the expression values of Rat CNS and Arabidopsis Thaliana datasets were randomly biased to a value between −8% to + 8% of their real content. The clustering algorithms (RFC, MBC, MSBC, HRSC and FWCMR) once again were performed on the artificially manipulated datasets. The validity indexes of resulting clusters (DI, DBI, β and SI) were obtained after injection of imposed outliers. The change of validity index in response to the injection of artificial outlier is depicted in Fig. 5. How much the validity index change remains near zero it means that the approach has more robustness. It is interpretable from Fig. 5. that FWCMR in comparison with other recent approaches shows a relatively superior robustness to the effects of outliers. The small case experiments show the efficiency of the proposed approach. When FWCMR can work on Hadoop, it can be run on whatever data that can be handled by Hadoop. In order to evaluate the scalability of the proposed algorithm another experiment have been carried out. A synthetic dataset (SD) was generated which is a simulation of real gene expression microarray but for a large number of genes. The dataset has been generated for several number of genes. The initial number is 10 0 0 genes and each time the number of genes has become ten times larger to 10 0 0,0 0 0 genes at its maximum level. The cluster validity index for each of these (SD) has been calculated and depicted in Fig. 6(a). As it is depicted in Fig. 6(a), the quality of clustering results decreases by the increase in the size of data. It is predictable and caused by the nature of noise propagated into the data when it is synthetically generated. It is also because of a natural overlap of clusters when the scale up happens in the dataset. However, the contribution of FWCMR is shown when the size has become
208
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
Fig. 5. Trend of validity index changes after injection of outlier data into the datasets. (a) Change in Rat CNS dataset (Wen et al., 1998); (b) Change in Arabidopsis Thaliana dataset (Maulik et al., 2009).
Fig. 6. (a) Cluster validity index for various synthetic data (SD). SD1 = 10 0 0 genes, SD2 = 10,0 0 0 genes, SD3 = 10 0,0 0 0 genes, SD4 = 10 0 0,0 0 0 genes. (b) The speedup ratio for various number of processing nodes for the synthetic dataset of DS4.
N times larger, but the precision did not decrease even N/2 times lower. In other words, the decrease rate in quality of cluster is much less than increase rate in volume of data. It means that the proposed method can keep the precision and quality of clustering reasonable when the data size scales up and resist against possible outlier data. In Fig. 6(b) FWCMR has been applied on the DS4 dataset which consist one million expression profile of variable genes. As it is expected the speedup is obtained by increase in the number of processing nodes for the same dataset. However, it is not n times faster when the quantity of nodes becomes n times bigger. The main reason is that in large scale processing because of the infrastructure and intrinsic serial procedures for communication between nodes, the speedup is not drastic or near ideal. But convenience, reliability, adaption with large volumes and low expense processing to handle such large data are the positive characteristics of MapReduce in comparison with serial or classic parallel approaches. With the help of FWCMR, a biologist can refine large species-cross datasets in a reasonable time and cost. The precision of FWCMR regarding the cluster validity index has been evaluated with two well-known clustering approach on the basis of MapReduce. K-means addressed in Boeva et al. (2014) and DBSCAN-MR introduced in He et al. (2011) have been compared with FWCMR in clustering the SD4 dataset. As It is depicted in Fig. 7.(a), FWCMR achieved higher accuracy in cluster validity index in comparison with recent approaches. Fig. 7.(b) indicates that the speedup in these approach does not have a drastic difference. This is caused by the characteristics of MapReduce that impose some serial bottlenecks and infrastructure barriers. Scalability of the proposed approach is depicted in Fig. 8. With increase in the number of computing machine the computation time decreases. Although it is less than the ideal speedup but it is much better than current serial approaches. As it is obvious, ideal speedup means that with
increasing the processing nodes N times, the running time must become 1/N. According to the Amdahl’s law, the speedup is always limited by the serial parts of algorithm regardless of increase in processing nodes. MapReduce has its intrinsic serial bottlenecks. It is up to the user of the algorithm to tradeoff between the cost of adding the processing node and the speedup regardless of limitations. Fig. 8(b) shows the consequences of increasing processing nodes in the time share of different steps of the proposed algorithm. Similarity-MR and Subcluster-MR show drastic changes in response to increasing processing power showing higher adaption with the distributed essence of MapReduce. One of the main bottlenecks of algorithms developed for Hadoop is the I/O bottleneck. FWCMR has the least possible need for read and write of the data to the main job tracker to avoid the I/O bottleneck. 5.1. Discussion FWCMR is scalable and easily adapts with recent cloud computing frameworks. FWCMR, regarding its characteristics, considers the whole data as an integrated object instead of serial discrete blob of objects. However, in serial single machine algorithm, each time a part of data is present and the rest of the data is not accessible for a comprehensive analysis. The clusters which are formed in this way will partially contain the global optimum biological clusters and they cannot be a reliable representative of a real group of co-functioning genes. Hence, as shown, FWCMR indicates improvements in various validity indexes and quality of clustering. FWCMR specially outperforms the serial single machine approaches when the volume of input data increases drastically. Scalability of distributed algorithm like FWCMR is much more as compared to the parallel algorithms which need to be run on a single machine with high memory and computation power. FWCMR
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
209
Fig. 7. (a) The cluster validity of K-means-MR, DBSCAN-MR and FWCMR in clustering SD4 (b) Speedup in clustering SD4 for various number of processing nodes.
Fig. 8. (a) Relation of speedup and number of processing nodes. (b) Effect of increasing processing nodes on different phases of proposed method.
can be run either on a network consisting of commodity hardware or on recent cloud processing infrastructures. Accordingly, Speedup can be reached by increasing the distributed processing nodes. It is a density based approach and regarding the effect of weights, FWCMR shows robustness in the presence of random outliers. So after deliberate injection of noise, the results are still more satisfying in comparison with recent similar works. FWCMR is a density based clustering approach and benefits from the advantages of density based clustering in comparison with partition based and hierarchical clustering. 6. Conclusion With regards to the recent improvements in the computing power of cloud infrastructures, requirement for the development of novel parallel and distributed algorithm is more than before. The proposed clustering method is a good response to this requirement. The contribution of this paper lies in developing a new soft clustering algorithm with an innovative fuzzy weighted similarity measurement specially for gene expression. FWCMR can work under any of the latest cloud computing infrastructures. The proposed algorithm does not suffer from any bottleneck of serial processing and operates fully distributed. Also, FWCMR does not need a prior knowledge on the number of clusters or data distribution. Its final results are not dependent on the starting situation. With the contribution of the proposed method in gene expression clustering, the expression conditions can either equally or unequally affect clustering results. A voting mechanism finds the best weight vector according to the dataset. In other words, the establishment of weights is data driven and the unsupervised characteristics of the clustering is preserved. Although, the weights can
also be manually set by an expert. Benefiting from the weighted spearman and density based clustering, three practical experiments were implemented. Validity index of clusters attained by FWCMR showed superiority of results in comparison with some of recently released clustering algorithms such as (RFC, MBC, MSBC and HRSC). In the second experiment, scalability of FWCMR was evaluated by a relatively bigger discrete chunk of data from the same experiment. FWCMR that showed better efficiency not only in computation time and complexity of implementation but also in the final clusters validity, in comparison with similar methods. Finally, the robustness to the injection of artificial outlier data was assessed and FWCMR showed least side effects in comparison with similar works. In future works, further investigations on the applications of FWCMR in other approaches is recommended. For instance, Gene Ontology (GO) can be used for weight establishment due to the biological relevance of genes. Other novel big data frameworks like Apache Spark can be used to enhance the distributed recurrent processing capabilities of FWCMR.
Acknowledgements This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.eswa.2017.08.051.
210
B. Hosseini, K. Kiani / Expert Systems With Applications 91 (2018) 198–210
References Aggarwal, C. C., & Reddy, C. K. (2014). Data clustering algorithms and applications (illustrate). New YorkNew York: CRC Press. Allison, D. B., Cui, X., Page, G. P., & Sabripour, M. (2006). Microarray data analysis: From disarray to consolidation and consensus. Nature Reviews Genetics, 7(1), 55– 65. http://doi.org/10.1038/nrg1749. Amazon Elastic Compute Cloud. 2017. (Amazon EC2). Retrieved from https://aws. amazon.com/ec2/. Azimi, R., Ghayekhloo, M., Ghofrani, M., & Sajedi, H. (2017). A novel clustering algorithm based on data transformation approaches. Expert Systems with Applications, 76, 59–70. http://doi.org/10.1016/j.eswa.2017.01.024. Bickel, D. R. (2001). Robust cluster analysis of DNA microarray data: An application of nonparametric correlation dissimilarity. In Proc. joint statistical meetings of the am. statistical assoc., (biometrics section). Blomstedt, P., Dutta, R., Seth, S., Brazma, A., & Kaski, S. (2016). Modelling-based experiment retrieval: A case study with gene expression clustering. Bioinformatics, 32(9), 1388–1394. http://doi.org/10.1093/bioinformatics/btv762. Boeva, V., Tsiporkova, E., & Kostadinova, E. (2014). Analysis of multiple DNA microarray datasets. In Springer handbook of bio-/neuroinformatics (pp. 223–234). Berlin Heidelberg: Springer. http://doi.org/10.1007/978- 3- 642- 30574- 0_14. Castellanos-Garzón, J. A., García, C. A., Novais, P., & Díaz, F. (2013). A visual analytics framework for cluster analysis of DNA microarray data. Expert Systems with Applications, 40(2), 758–774. http://doi.org/10.1016/j.eswa.2012.08.038. Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224–227. PAMI http: //doi.org/10.1109/TPAMI.1979.4766909. Jiang, D., Tang, C., & Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11), 1370–1386. http://doi.org/10.1109/TKDE.2004.68. Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95–104. http://doi.org/10.1080/01969727408546059. Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., et al. (2014). A survey of clustering algorithms for Big Data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2(3), 267–279. http://doi.org/ 10.1109/TETC.2014.2330519. Futschik, M. E., & Carlisle, B. (2005). Noise-robust soft clustering of gene expression time-course data. Journal of Bioinformatics and Computational Biology, 3(4), 965– 988. http://doi.org/10.1142/S0219720 0 050 01375. Gao, H., Jiang, J., She, L., & Fu, Y. (2010). A New Agglomerative hierarchical clustering algorithm implementation based on the MapReduce framework. JDCTA, 4(3), 95–100. http://doi.org/10.4156/jdcta. He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., et al. (2011). MR-DBSCAN: An efficient parallel density-based clustering algorithm using MapReduce. In 2011 IEEE 17th International Conference on Parallel and Distributed Systems (pp. 473– 480). IEEE. http://doi.org/10.1109/ICPADS.2011.83. Horvath, S. (2011). Weighted network analysis: applications in genomics and systems biology. Springer Science & Business Media. Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., & Lee, J. C. F.… others. (1999). The transcriptional program in the response of human fibroblasts to serum. Science, 283(5398), 83–87. Katal, A., Wazid, M., & Goudar, R. H. (2013). Big data: Issues, challenges, tools and good practices. In Contemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404–409). http://doi.org/10.1109/IC3.2013.6612229. Kerr, G., Ruskin, H. J., Crane, M., & Doolan, P. (2008). Techniques for clustering gene expression data. Computers in Biology and Medicine, 38(3), 283–293. http://doi. org/Techniquesforclusteringgeneexpressiondata. Kim, W. (2009). Parallel clustering algorithms. Kotlyar, M., Fuhrman, S., Ableson, A., & Somogyi, R. (2002). Spearman correlation identifies statistically significant gene expression clusters in spinal cord development and injury. Neurochemical Research, 27(10), 1133–1140. http://doi.org/10. 1023/A:1020969208033. Kriegel, H. P., Kröger, P., Sander, J., & Zimek, A. (2011). Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 231–240. June http://doi.org/10.1002/widm.30. Le, H.-S., Oltvai, Z. N., & Bar-Joseph, Z. (2010). Cross-species queries of large gene expression databases. Bioinformatics, 26(19), 2416–2423. Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1), 59–66. http://doi.org/10.1080/ 0 0 031305.1988.10475524. Li, L., & Xi, Y. (2011). Research on clustering algorithm and its parallelization strategy. In 2011 International conference on computational and information sciences (pp. 325–328). IEEE. http://doi.org/10.1109/ICCIS.2011.223. Ma, Y., Hu, X., He, T., & Jiang, X. (2016). Hessian regularization based symmetric nonnegative matrix factorization for clustering gene expression and microbiome data. Methods, 111, 80–84. http://doi.org/10.1016/j.ymeth.2016.06.017. Mahalanobis, P. C. (1930). On test and measures of group divergence, Part I: Theoretical formulae. Maji, P., & Paul, S. (2013). Rough-fuzzy clustering for grouping functionally similar genes from microarray data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(2), 286–299. http://doi.org/10.1109/TCBB.2012.103. Maji, P., & Paul, S. (2014). Mutual information based supervised attribute clustering for microarray sample classification. In Scalable pattern recognition algorithms: Applications in computational biology and bioinformatics (pp. 225–252). Cham, Switzerland: Springer International Publishing. http://doi.org/10.1007/ 978- 3- 319- 05630- 2_9.
Maulik, U., Mukhopadhyay, A., & Bandyopadhyay, S. (2009). Combining pareto-optimal clusters using supervised learning for identifying co-expressed genes. BMC Bioinformatics, 10(1), 27. Moorthy, K., Saberi Mohamad, M., & Deris, S. (2014). A review on missing value imputation algorithms for microarray gene expression data. Current Bioinformatics, 9(1), 18–22. Mukhopadhyay, A., Maulik, U., & Bandyopadhyay, S. (2013). An interactive approach to multiobjective clustering of gene expression patterns. IEEE Transactions on Biomedical Engineering, 60(1), 35–41. http://doi.org/10.1109/TBME.2012.2220765. Nayak, J., Naik, B., Behera, H. S., & Abraham, A. (2017). Hybrid chemical reaction based metaheuristic with fuzzy c-means algorithm for optimal cluster analysis. Expert Systems with Applications, 79, 282–295. http://doi.org/10.1016/j.eswa.2017. 02.037. Pal, S. K., Ghosh, A., & Shankar, B. U. (20 0 0). Segmentation of remotely sensed images with fuzzy thresholding, and quantitative evaluation. International Journal of Remote Sensing, 21(11), 2269–2300. Perdew, G. H., Vanden Heuvel, J. P., & Peters, J. M. (2014). Regulation of gene expression. Humana Press Retrieved from. https://books.google.com/books?id= y7bUrQEACAAJ. Roden, J. C., King, B. W., Trout, D., Mortazavi, A., Wold, B. J., & Hart, C. E. (2006). Mining gene expression data by interpreting principal components. BMC Bioinformatics, 7(1), 194. http://doi.org/10.1186/1471-2105- 7- 194. Romesburg, C. (2004). Cluster analysis for researchers. Lulu. com. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. http://doi.org/10.1016/0377-0427(87)90125-7. Rustici, G., Mata, J., Kivinen, K., Lió, P., Penkett, C. J., Burns, G., et al. (2004). Periodic gene expression program of the fission yeast cell cycle. Nature Genetics, 36(8), 809–817. http://doi.org/10.1038/ng1377. Saha, S., Ekbal, A., Gupta, K., & Bandyopadhyay, S. (2013). Gene expression data clustering using a multiobjective symmetry based clustering technique. Computers in Biology and Medicine, 43(11), 1965–1977. http://doi.org/10.1016/j.compbiomed. 2013.07.021. Salem, S. A., Jack, L. B., & Nandi, A. K. (2008). Investigation of self-organizing oscillator networks for use in clustering microarray data. IEEE Transactions on Nanobioscience, 7(1), 65–79. http://doi.org/10.1109/TNB.20 08.20 0 0151. Sateesh Babu, G., & Suresh, S. (2013). Parkinson’s disease prediction using gene expression – A projection based learning meta-cognitive neural classifier approach. Expert Systems with Applications, 40(5), 1519–1529. http://doi.org/10. 1016/j.eswa.2012.08.070. Schuchhardt, J. (20 0 0). Normalization strategies for cDNA microarrays. Nucleic Acids Research, 28(10), 47e–447. http://doi.org/10.1093/nar/28.10.e47. Sharp, P. M., Tuohy, T. M. F., & Mosurski, K. R. (1986). Codon usage in yeast: Cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Research, 14(13), 5125–5143. http://doi.org/10.1093/nar/14.13.5125. Sherlock, G. (20 0 0). Analysis of large-scale gene expression data. Current Opinion in Immunology, 12(2), 201–205. http://doi.org/10.1016/S0952-7915(99)0 0 074-6. Shim, K. (2012). MapReduce algorithms for Big Data analysis, 2016–2017. Shim, K., Srikant, R., & Agrawal, R. (1997). High-dimensional similarity joins. In Data engineering, 1997. proceedings. 13th international conference on (pp. 301–311). http://doi.org/10.1109/ICDE.1997.581814. Speer, N., Spiet, C., & Zell, A. (2005). Biological cluster validity indices based on the gene ontology. In International symposium on intelligent data analysis (pp. 429– 439). http://doi.org/10.1007/11552253_39. Stone, J. V. (2015). Information theory: A tutorial introduction. Sebtel Press Retrieved from https:// books.google.com/ books?id=FAfNnQEACAAJ. Tang, S., Ding, Y., Sibille, E., Mogil, J., Lariviere, W. R., & Tseng, G. C. (2014). Imputation of truncated p-values for meta-analysis methods and its genomic application. The Annals of Applied Statistics, 8(4), 2150. http://doi.org/10.1214/ 14-AOAS747. Wen, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., Barker, J. L., et al. (1998). Large-scale temporal gene expression mapping of central nervous system development. In Proceedings of the National Academy of Sciences: 95 (pp. 334–339). White, T. (2012). Hadoop: The definitive guide. O’Reilly Media, Inc. Wong, K. C. (2016). Computational biology and bioinformatics: Gene regulation. CRC Press LLC Retrieved from https:// books.google.com/ books?id=kbxTjgEACAAJ. Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678. http://doi.org/10.1109/TNN.2005.845141. Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Transactions on Systems, Man, and Cybernetics, 18(1), 183–190. http://doi.org/10.1109/21.87068. Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J.-F., et al. (2012). Data mining in healthcare and biomedicine: A survey of the literature. Journal of Medical Systems, 36(4), 2431–2448. http://doi.org/10.1007/ s10916- 011- 9710- 5. Yousri, N. A. (2010). On the validation of gene expression clusters. In 2010 5th Cairo international biomedical engineering conference (pp. 129–132). IEEE. http: //doi.org/10.1109/CIBEC.2010.5716091. Zhao, W., Ma, H., & He, Q. (2009). Parallel K-means clustering based on MapReduce Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5931 LNCS, 674–679 http://doi. org/10.1007/978- 3- 642- 10665- 1_71.