Impact of heuristics in clustering large biological networks

Impact of heuristics in clustering large biological networks

Accepted Manuscript Title: New Heuristics for Clustering Large Biological Networks Author: Md. Kishwar Shafin Kazi Lutful Kabir Iffatur Ridwan Tasmiah...

7MB Sizes 2 Downloads 54 Views

Accepted Manuscript Title: New Heuristics for Clustering Large Biological Networks Author: Md. Kishwar Shafin Kazi Lutful Kabir Iffatur Ridwan Tasmiah Tamzid Anannya Rashid Saadman Karim Mohammad Mozammel Hoque M. Sohel Rahman PII: DOI: Reference:

S1476-9271(15)30015-3 http://dx.doi.org/doi:10.1016/j.compbiolchem.2015.05.007 CBAC 6432

To appear in:

Computational Biology and Chemistry

Received date: Revised date: Accepted date:

12-4-2015 17-5-2015 28-5-2015

Please cite this article as: Md. Kishwar Shafin, Kazi Lutful Kabir, Iffatur Ridwan, Tasmiah Tamzid Anannya, Rashid Saadman Karim, Mohammad Mozammel Hoque, M. Sohel Rahman, New Heuristics for Clustering Large Biological Networks, Computational Biology and Chemistry (2015), http://dx.doi.org/10.1016/j.compbiolchem.2015.05.007 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ip t

New Heuristics for Clustering Large Biological Networks

Department of CSE, MIST, Mirpur Cantonment, Dhaka-1216, Bangladesh 2 A`EDA Group, Department of CSE, BUET, Dhaka-1215, Bangladesh

us

1

cr

Md. Kishwar Shafin1 , Kazi Lutful Kabir1 , Iffatur Ridwan1 , Tasmiah Tamzid Anannya1 , Rashid Saadman Karim1 , Mohammad Mozammel Hoque1 and M. Sohel Rahman2

te

d

M

an

Abstract. To uncover functional modules in large biological networks, traditional clustering algorithms exhibit certain limitations. Specifically, these are either slow in execution or unable to cluster. As a result, faster methodologies are always in demand. In this context, some more efficient approaches have been introduced most of which are based on greedy techniques. Clusters produced as a result of implementation of any such approach are highly dependent on the underlying heuristics. It is expected that better heuristics will yield improved results. In this paper, we have proposed two new heuristics and incorporated these in a recent celebrated greedy clustering algorithm named SPICi. We have implemented three new variants and have conducted extensive experiments to analyze the performance of the new variants. The results are found to be promising.

Ac ce p

Keywords: Algorithms, Biological Networks, Clustering, Heuristics.

1

Introduction

Clustering is an important tool in biological network analysis. However, traditional clustering algorithms do not perform well in the analysis of large biological networks being either extremely slow or even unable to cluster [24]. On the other hand, recent advancement of the state of the art technologies along with computational predictions have resulted in large scale biological networks for numerous organisms [10]. As a result, faster clustering algorithms are of tremendous interest. There exist a number of clustering algorithms that work well on small to moderate biological networks. For instance, a number of algorithms in the literature can guarantee that they would generate clusters with some specific properties (e.g., Cfinder [5], [22], [12], [15]). They are however computationally very intensive and hence do not scale well as the size of the biological network increases. Algorithms like WPNCA [23] and ClusterONE [21] are new approaches to handle weighted biological networks of moderate size. But in many cases they fail or require a large amount time to cluster.

Page 1 of 19

2

M

an

us

cr

ip t

To this end, some more efficient approaches have been introduced most of which are based on greedy techniques (e.g., SPICi [17], DPClus [7] etc.). However, algorithms like MGclus [14] suit relatively well for clustering large biological networks with dense neighborhood. In most cases, clusters produced by greedy approaches are highly dependent on the heuristic(s) employed. It is expected that a better heuristic will yield even more improved results. This motivates us to search for a better heuristic for a well-performing greedy approach to devise an even better clustering algorithm that not only runs faster but also provides quality solutions. SPICi [17] is a relatively recent and new approach among the greedy techniques that can cluster large biological networks. After carefully studying and analyzing the implementation of SPICi, we have discovered that some essential modification in the heuristics employed can bring drastic change in the clusters’ quality. In this paper, we have proposed a couple of new heuristics for SPICi with an aim to devise an even better clustering algorithm. We have implemented three new versions of SPICi and have conducted extensive experiments to analyze the performance our new implementations. Experimental results are found to be promising with respect to both speed and accuracy.

Background

Ac ce p

te

d

We start this section with preliminaries on some related notions followed by a discussion on the algorithmic framework of SPICi. We will also briefly review the heuristics used in SPICi. A biological network is modeled as an undirected graph G = (V, E) where each edge (u, v) ∈ E has a confidence score (0 < wu,v ≤ 1), also called the weight of the edge. We say that, wu,v = 0, if the two vertices u, v have no edge between them. The weighted degree of each vertex u, denoted by dw (u),Pis the sum of the confidence scores of all of its incident edges, i.e., dw (u) = (u,v)∈E wu,v . Based on the confidence scores or weights of the edges, we can define the term density for a set of vertices S ⊆ V as follows. The density D(S) of a set S ⊆ V of vertices is defined as the sum of the weights of the edges that have both end vertices belonging to S divided by the total number of possible edges in S. Formally, D(S) =

P u,v∈S wu,v |S|×(|S|−1)/2

For each vertex u and a set S ⊆ V , support of u by S, denoted by S(u, S), is defined as the sum of the confidence scores of the edges of u that are incident to the vertices in S. Formally, P S(u, S) = v∈S wu,v Given a weighted network, the goal of SPICi is to output a set of disjoint dense sub-graphs. SPICi uses a greedy heuristic approach that builds one cluster at a time and expansion of each cluster is done from an original protein seed

Page 2 of 19

ip t

pair. SPICi depends on two parameters, namely, the support threshold, Ts and the density threshold, Td . The use of these two parameters will be clear shortly. Now, we briefly review how SPICi employs its heuristic strategies. In fact, SPICi first selects two seed nodes and then attempt to expand the clusters.

us

cr

Seed Selection While selecting the seed vertices, SPICi uses a heuristic. Very briefly, at first it chooses a vertex u in the network that has the highest weighted degree. Then it divides the neighboring vertices of u into five bins according to their edge weights, namely, (0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8] and (0.8, 1.0]. Then the vertex with the highest weighted degree belonging to the highest nonempty bin is chosen as the second seed, v. The edge (u, v) is referred to as the seed edge.

Proposed Heuristics

te

3

d

M

an

Cluster Expansion For cluster expansion, SPICi follows a procedure similar to that of [6]. It works with a vertex set S for the cluster initially containing the two selected seed vertices. It uses a heuristic approach to build the clusters and it builds one cluster at a time. In the cluster expansion step, SPICi searches for the vertex u such that S(u, S) is maximum amongst all the unclustered vertices that are adjacent to a vertex in S. If S(u, S) is smaller than a threshold then u is not added to S and D(S) is updated accordingly. However, if the calculated D(S) turns out to be smaller than the density threshold Td then SPICi does not include u in the cluster and output S.

Ac ce p

The heuristics employed by SPICi are developed based on an observation that two vertices are more likely to be in the same cluster if the weight of the edge between them is higher [17]. The two heuristics SPICi employs are implemented in the form of two procedures, namely, Search and Expand. In the Search procedure, node with the highest outdegree is chosen as the seed and in Expand, node with the highest support is selected as the candidate to be added to the cluster. In this paper, we have proposed two heuristics and we combine our heuristics with the heuristics of SPICi to have three new versions of SPICi. We will refer to these three versions as SPICi1+ , SPICi2+ and SPICi12 + . To be specific, we employ a new heuristic and modify the Expand procedure of SPICi to get Expand+. Similarly, we employ another new heuristic and modify the Search procedure of SPICi to get Search+. In SPICi1+ , we combine Expand+ with Search while in SPICi2+ , we combine Expand with Search+. Both the heuristics are replaced by Search+ and Expand+in SPICi12 + . In essence, our first heuristic is to choose the node with the highest weighted degree among the neighbors as the first seed and second one is to choose the node with the highest average weighted degree as the candidate to join the cluster.

Page 3 of 19

ip t cr us an

M

Fig. 1. An example to illustrate the necessity of a new measure

Average Edge Weight

Ac ce p

3.1

te

d

Note that, the heuristics employed by SPICi are developed based on an observation that two vertices are more likely to be in the same cluster if the weight of the edge between them is higher [9]. In what follows, we describe our heuristic strategies along with the motivation and rationale behind those.

Consider Figure 1. Here assume that, the current cluster set is S = {1, 2, 3} and the set of candidate nodes is {4, 5}. The goal at this point is to expand the current cluster. Now SPICi calculates S(4, S) = 1.4 and S(5, S) = 1.5 and since S(5, S) > S(4, S), it will include node 5 in the cluster. However, two vertices are more likely to be in the same module if the weight on the edge between them is higher [17]. In Figure 1, we can see that the average weight with which node 4 is connected with nodes 2 and 3 is higher than the same with which node 5 is connected with nodes 1, 2 and 3. So, a cluster comprising the set {4, 2, 3} seems more meaningful than a cluster comprising the set {5, 1, 2, 3}. Hence, although node 4 is not connected to node 1, it seems more useful to include node 4 in the current cluster. To make a better decision, we introduce a new heuristic measure which we refer to as the Average Edge Weight as follows. For each node u and a set S ⊆ V , let Q ⊆ S be the set of vertices u is connected with. The average edge weight of u by S is defined as follows, P

AverageEdgeW eight(u, S) =

wu,v |Q|

v∈Q

Page 4 of 19

ip t cr us an

M

Fig. 2. Another illustration denoting the necessity of a new measure

Ac ce p

te

d

Now let us refer back to the scenario illustrated in the Figure 1. Using our new heuristic measure, we calculate Average Edge Weight(4,S)= 0.7 and AverageEdgeWeight(5,S)=0.5. Since AverageEdgeWeight(4,S) > AverageEdgeWeight (5,S), in contrast to SPICi, we choose node 4 as desired. In Expand procedure, as we move from a dense region to a less dense region, it is more sensible to select nodes having higher values for average edge weight. This statement can be supported by the illustration of biological networks of Figure 3 and Figure 4, where we observe that nodes from less dense areas have more edge connectivity than the nodes in more dense areas. Despite having less connectivity, the edge weights of the nodes in dense areas carry better values. This encourages us to identify the dense area nodes using the Average Edge Weight as it focuses on the edge values. The modified Expand procedure, i.e., Expand+ is presented in the form of Algorithm 1. In order to observe how this heuristic effects clustering, we have searched for candidate nodes that are selected to take within the cluster having a higher AverageEdgeWeight but lower support value than other candidate nodes. The results are shown in Table 1 where we can see that the case we consider comes up frequently. 3.2

Weighted Degree of Neighbors

For each vertex u, the weighted degree of its neighbors, denoted by Aw (u) is simply the summation of the weighted degrees of all of its neighbors. So, we have

Page 5 of 19

ip t

Table 1. Candidate nodes affected by AverageEdgeWeight heuristic

cr

Biogrid yeast STRING YEAST Biogrid Human STRING Human 5361 6371 7498 18670 2421 1752 1907 5134

Algorithm 1 Expand+(u, v)

M

an

us

Vertices Affected Candidates

Ac ce p

te

d

initialize the cluster S = {u, v} initialize CandidateQ to contain vertices neighboring u or v initialize AverageEdgeW eightHeap to contain vertices neighboring u or v while CandidateQ is not empty do find the largest non-empty bin of Average Edge W eight Heap nodes in CandidateQ among the bins with AverageEdgeW eight(t, S) range of (0.8, 1], (0.6, 0.8], (0.4, 0.6], (0.2, 0.4], (0.0, 0.2]. extract t from CandidateQ with highest support(t, S) and t belongs to largest non-empty bin of AverageEdgeW eightHeap if support(t, S) ≥ Ts × |S| × density(S) and density(S ∪ {t}) > Td then S = S + {t} increase the support for vertices connected to t in CandidateQ for all unclustered vertices adjacent to t, insert them into CandidateQ if not present for all unclustered vertices adjacent to t, update AverageEdgeW eightHeap break from loop end if end while return S

Page 6 of 19

ip t cr us an

M

Fig. 3. Visualization of STRING Human Network [4]

Ac ce p

te

d

P Aw(u) = (u,v)∈E dw (v). To illustrate the usefulness of this heuristic measure, let us consider Figure 2. While selecting the first seed SPICi groups the set of nodes with similar outdegrees. In Figure 2, SPICi would have two group of nodes {1, 2, 3, 5} and {4}. The grouping is as such, because node 1 has outdegree of 2.4 which is rounded off to 2 and similarly the outdegrees of 2, 3, 5 are 1.5, 2, 1.5 which are also rounded to 2. As these nodes have the same outdegree after rounding off, they are in the same group. And node 4 has outdegree of 1.4 so it is rounded off to 1 and it creates a new group. Now, SPICi will select a node from the highest weight group. This node selection can not be predicted because there is no sequencing or sorting in SPICi, it is dependent on how the nodes are traversed and entered in the group. Thus, we can not predict which node will be selected beforehand. Suppose 5 is selected. But, clearly 5 is in a weak neighborhood, i.e., it does not have a dense group around it. But if we use the weighted degree of all the neighbors for a particular node then for node 1 we have the highest weighted degree (Aw (1)=2+1.4+1.5+1.6=6.5). And, choosing the node with the highest weighted degree of neighbors is likely to enhance the probability to select the most promising node as the first seed. This ensures that we are selecting the node from a dense neighborhood, so in the expand process we will always start in a dense population. From the visualization of significant portions of STRING Human and STRING Yeast networks (Figure 3 and Figure 4) we can infer that, nodes of less dense areas have the property of being connected to many nodes from different dense areas which may help them to get selected as the seed node. However, seed node should be selected from a dense area to uncover quality functional module. The modified Search procedure, i.e., Search+ is provided in form of Algorithm 2.

Page 7 of 19

ip t cr us

an

Fig. 4. Visualization of STRING Yeast Network [4]

d

M

To analysis a bit further, we have checked the topology in the input biological networks and calculated how many nodes are there in a weak neighborhood. We have searched for those nodes that have strictly higher outdegrees but lower neighbor weights than the nodes adjacent to them. Any node satisfying such condition is in a weak neighborhood with high outdegree which may encourage SPICi to make a wrong selection.

Biogrid yeast STRING YEAST Biogrid Human STRING Human 5361 6371 7498 18670 1797 3650 994 9320

Ac ce p

Vertices In weak neighborhood

te

Table 2. Weak neighborhood topology analysis

From the results we see that almost half of the nodes of STRING networks are in weak neighborhood. This suggests that the heuristics we have proposed will make a better choice in many situations. The presence of such topology also indicates the necessity of improved heuristics for greedy clustering algorithms.

4

Computational Complexity Analysis

In this section we analyze the time complexity of our new implementations based on the newly proposed heuristics. In particular, we implement three improved versions of SPICi, namely, SPICi1+ , SPICi2+ and SPICi12 + . More specifically, in SPICi1+ , we plug in Expand+ in SPICi instead of Expand; in SPICi2+ we plug in Search+ in SPICi instead of Search; and in SPICi12 + we plug in both Expand+ and Search+ instead of Expand and Search, respectively. Table 3 shows the summary of these three implementations along with the original

Page 8 of 19

Algorithm 2 Search+

an

us

cr

ip t

Initialize DegreeQ to be V while DegreeQ is not empty do Extract u from DegreeQ with largest weighted degree of neighbors if u has adjacent vertices in DegreeQ then Find from u’s adjacent vertices the second seed protein v S=Expand(u, v) else S={u} end if V =V - S Delete all vertices in S from DegreeQ For each vertex in t in DegreeQ that is adjacent to a vertex in S, decrement its weighted degree by support(t,S) end while

M

version, i.e., SPICi. In what follows, we assume that the graph has a total of n vertices and m edges. Table 3. Definition of SPICi, SPICi1+ , SPICi2+ and SPICi12 +

te

d

Algorithm Component 1 Component 2 Reference SPICi Expand Search [17] 1 SPICi+ Expand+ Search Current Paper SPICi2+ Expand Search+ Current Paper SPICi12 Expand+ Search+ Current Paper +

Ac ce p

– SPICi1+ : In SPICi1+ , we use three binary heap data structures, namely, DegreeQ (for the Search procedure), CandidateQ and AverageEdgeWeightHeap (for the Expand+ procedure). Each of the operations of Insert, ExtractMax, DecreaseKey, IncreaseKey and Delete on CandidateQ, DegreeQ and AverageEdgeWeightHeap runs in O(log n) time. For initial seed selection, each vertex is inserted in DegreeQ once in the Search procedure and hence there are n Insert operations. In Expand+, the neighbors of the nodes that are already in a cluster are the candidate nodes for further cluster formation and are inserted in CandidateQ using the Insert operation based on the AverageEdgeWeight (described in Section 3.1) of each candidate node stored in the AverageEdgeWeightHeap. Subsequently, the values are updated using IncreaseKey and DecreaseKey operations. These operations together yield a total of n Insert and update (i.e., IncreaseKey and DecreaseKey) operations. Furthermore, the nodes that are inserted in CandidateQ and DegreeQ are extracted needing n ExtractMax or Delete operations. Hence, the running time of Expand+ is O(2n log n) depending on CandidateQ and AverageEdgeWeightHeap. Now

Page 9 of 19

Experiments and Results

M

5

an

us

cr

ip t

to read the graphs O(m) time is needed. So the total time complexity of SPICi1+ is O(3n log n + m) = O(n log n + m). – SPICi2+ : In Search+, we have implemented the Weighted Degree of Neighbors heuristic (described in Section 3.2). In Search+, we have used Fibonacci Heap [13] on which we need to apply Insert, DecreaseKey, Delete and ExtractMax operations. Here, each of the first three operations can be done in constant time whereas the last one (i.e., ExtractMax) runs in O(log n) time. Note that, for the initial calculation of the Weighted Degree of Neighbors, we need O(m) time. Now, there will be a total of n ExtractMax operations. On the other hand, In Expand there are n IncreaseKey operations. Hence the running time of Search+ is O(n log n + m). So, after including the time to read the graph, the total running time of SPICi2+ becomes O(n log n + 2m)=O(n log n + m). 12 – SPICi12 + : The running time of SPICi+ can be easily deduced by summing up the respective running times of Expand+ and Search+ procedures. The running time of Expand+ is O(2n log n) and that of Search+ is O(n log n+ m). Therefore, considering the time to read the graph, the total running time of SPICi12 + becomes O(3n log n + 2m) = O(n log n + m).

Ac ce p

te

d

We have conducted our experiments on a PC having Intel 2.40 GHz core i5 processor with 4GB memory. The coding has been done in C++ using Codeblocks 10.05 IDE. All the experiments have been run in Linux (Ubuntu 12.04) environment. The source codes of our proposed heuristics are freely available at http://goo.gl/e9du1Y. For all the experiments we set both Ts and Td to 0.5, the same value used in SPICi. For our experiments we had to convert the gene names and we used various sources to convert and extract the gene names from [4], [2] and [1]. The overall conversion of the gene names affects the analysis. But as all the experiments were done using the same procedure, the overall penalty borne by the clusters are the same. The reported runtimes of SPICi, SPICi1+ and SPICi2+ are wall clock times since the same was done in [17]. 5.1

Network Datasets

We have used four networks: two networks for yeast and two for human. These are the same networks used by Jiang and Singh [9] to experimentally evaluate SPICi. The properties of these networks are reported in Table 4. Experimentally determined physical and genetic interactions are found in the two Biogrid networks [11]. On the other hand, functional association between proteins that are derived from data integration are found in the two STRING networks [16]. These datasets are available at [3]. Jiang and Singh [9] conducted another analysis on the Human bayesian network which we could not get from the authors. For the Biogrid networks all non-redundant interaction pairs are extracted which include the protein genetic and physical interactions and all weighted interactions for the STRING networks were used as reported in [17] .

Page 10 of 19

Table 4. Test set of biological networks

5.2

Biogrid yeast STRING YEAST Biogrid Human STRING Human 5361 6371 7498 18670 85866 311765 23730 1432538

ip t

Vertices Edges

GO analysis

M

an

us

cr

The GO analysis conducted here is based on the analysis done in [17]. They used a framework described in [24] to evaluate the obtained clusters. We have used the same framework to compare among the clusters we have obtained using different heuristics. To construct the functional modules’ reference set we have used Gene Ontology (GO) in the same way as it was done in [17]. The Gene Ontology [8] is an external measure to derive functional modules. For a GO biological process (BP) or cellular component (CC) functional term, a module contains all the proteins that are annotated with that term. Evaluation of clustering algorithms is done by judging how well the clusters correspond to the functional modules as derived from either GO BP or GO CC annotations. Following the work of [17], we have considered the GO terms that annotate at most 1000 proteins for each organism. For a particular GO annotation A, GA is the functional module set having all genes that are annotated with A. To measure the similarity between GO functional modules and derived clusters, [24] uses the following three measures:

Ac ce p

te

d

– Jaccard: Consider a cluster C. With each GO derived functional module |C∩GA | . The maximum group GA the Jaccard value of C is computed as |C∪G A| Jaccard value we get for cluster C over all GO term A is considered the Jaccard value of C. – PR (Precision Recall): Consider a cluster C. With each GO derived func|C∩GA | A| tional module GA , its PR value is computed as |C∩G |GA | × |C| . The maximum PR value we get for cluster C over all GO term A is considered the PR value of C. – Semantic density: Average semantic similarity within each pair of annotated proteins is computed for each cluster. For two proteins p1 with annotations A(p1) and p2 with annotations A(p2). The semantic similarity of their GO annotations is defined as: 2 × mina∈A(p1)∩A(p2) log(p(a)) mina∈A(p1) log(p(a)) + mina∈A(p1) log(p(a))

where p(a) is the fraction of annotated proteins with annotation a in the organism [18], [24]. For semantic density calculations, all GO terms that are annotating even more than 1000 proteins are also considered [17]. The measures above have a range between 0 and 1. The higher the values, the better is the result of uncovering clusters satisfying the functional modules

Page 11 of 19

cr

ip t

corresponding to GO. We calculate jaccard, PR and semantic density for each cluster and for both BP and CC ontology. These measures are attributed to all proteins in the cluster. Singleton cluster genes are penalized by assigning 0 to each of the three measure. Lastly, we compute average value of each of the six measures (three CC and three BP) over all proteins of the network. The results of the GO analysis are presented in Table 5. Here we have analyzed the 3 new variants, namely, SPICi1+ , SPICi2+ and SPICi12 + along with SPICi as well as some other celebrated algorithms from the literature, namely, MGclus [14], ClusterONE [21] and WPNCA [23].

STRING Yeast

Ac ce p

STRING Human

BP PR 0.158 0.171 0.158 0.167 0.136 0.143 0.143 0.093 0.107 0.097 0.102 0.092 0.04 0.073 0.196 0.213 0.195 0.202 0.098 0.191 0.104 0.074 0.084 0.089 0.090 0.151 -

CC CC sDensity Jaccard 0.291 0.156 0.271 0.176 0.302 0.155 0.285 0.159 0.117 0.171 0.425 0.159 0.312 0.247 0.097 0.059 0.086 0.067 0.093 0.061 0.102 0.065 0.026 0.079 0.125 0.032 0.071 0.089 0.333 0.169 0.334 0.183 0.429 0.175 0.401 0.177 0.079 0.118 0.199 0.174 0.151 0.136 0.377 0.048 0.441 0.048 0.383 0.048 0.382 0.051 0.438 0.075 -

M

an

BP BP sDensity Jaccard 0.351 0.189 0.358 0.205 0.365 0.191 0.379 0.203 0.126 0.176 0.498 0.167 0.379 0.219 0.191 0.117 0.182 0.135 0.181 0.122 0.203 0.132 0.061 0.126 0.202 0.051 0.173 0.102 0.431 0.225 0.455 0.246 0.531 0.224 0.511 0.234 0.076 0.138 0.279 0.232 0.162 0.147 0.364 0.099 0.321 0.109 0.376 0.115 0.366 0.117 0.232 0.187 -

d

Biogrid Human

Algorithm SPICi SPICi1+ SPICi2+ SPICi12 + MGclus ClusterOne WPNCA SPICi SPICi1+ SPICi2+ SPICi12 + MGclus ClusterOne WPNCA SPICi SPICi1+ SPICi2+ SPICi12 + MGclus ClusterOne WPNCA SPICi SPICi1+ SPICi2+ SPICi12 + MGclus ClusterOne WPNCA

te

Network Biogrid Yeast

us

Table 5. GO analysis of clusters output by different algorithms

CC PR 0.129 0.145 0.126 0.127 0.131 0.138 0.209 0.038 0.042 0.039 0.041 0.046 0.022 0.059 0.149 0.158 0.153 0.154 0.081 0.143 0.102 0.031 0.031 0.031 0.032 0.046 -

From Table 5, we observe that the changes in the heuristic of the algorithm effects the quality of the clusters. For moderate sized networks like Biogrid Human and Biogrid Yeast, SPICi1+ has higer values in most of the cases. But for large networks the values are close to the values of SPICi. The same property holds with SPICi2+ but when we use both heuristics together in SPICi12 + it dominates SPICi in most of the cases increasing the qulity of the functional modules. Algorithms like WPNCA [23] and MGclus [14] are unable to cluster large biological network while ClusterONE [21] shows good performance in large biological network but has the overhead of time complexity which is a major concern for fast clustering algorithms.

Page 12 of 19

Cluster-Size analysis

Table 6. Cluster Size Analysis

SPICi12 +

ClusterOne

MGclus

50∼150 4 0 9 12 0 2 7 7 3 0 8 12 3 0 5 7 0 4 18 62 9 8 18 164 3 903 -

150∼ 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 4 4 0 0 4 0 0 0 -

cr

15∼50 28 7 51 94 4 25 28 37 26 4 38 73 24 9 35 45 5 33 166 392 88 90 92 549 191 1552 -

te

d

WPNCA

5∼15 105 98 145 351 83 92 110 208 114 108 92 185 134 92 170 342 111 172 360 1239 190 302 167 514 657 767 -

us

SPICi2+

0∼5 2059 3996 2106 8987 3722 1955 2389 9872 1982 3835 2265 8590 1658 3524 2070 8946 7950 2955 1107 6430 171 607 64 5585 5182 6600 -

an

SPICi1+

Biogrid Yeast Biogrid Human String Yeast String Human Biogrid Yeast Biogrid Human String Yeast String Human Biogrid Yeast Biogrid Human String Yeast String Human Biogrid Yeast Biogrid Human String Yeast String Human Biogrid Yeast Biogrid Human String Yeast String Human Biogrid Yeast Biogrid Human String Yeast String Human Biogrid Yeast Biogrid Human String Yeast String Human

M

SPICi

ip t

5.3

Ac ce p

To analyze the different sizes of the functional modules uncovered by the algorithms we have also performed the cluster-size analysis. The clusters are divided into five different beans depending on the number of proteins in the cluster. The first four bins are, [0, 5), [5, 15), [15, 50], [50, 150) which resembles the lower and upper limit of the size of the clusters. The last bin contains the number of clusters having more than 150 proteins. The result of this analysis has been presented in Table 6. After observing the values of Table 6, we notice that greedy algorithms like SPICi uncovers modules of medium size mostly. One of the reasons for this can be attributed to the fact that these algorithms are non-overlapping. On the other hand ClusterOne and MGclus uncovers large functional modules for large networks. Comparing between SPICi and SPICi12 + we can see that the number of small clusters of sizes between 5 to 15 for large networks decreased. On the other hand, the number of clusters of size 15 to 150 also decreased. As both the number of clusters decreased this could only happen when the clusters uncovered by SPICi12 + are mostly medium sized clusters. We can understand that more medium sized cluster means that average size of clusters increased. So, the average size of the clusters of SPICi12 + is better than that of SPICi. So, while clustering with SPICi12 , it uncovers clusters that are + on average larger in size maintaining the density threshold. So, it uncovers clus-

Page 13 of 19

ip t

ters which are better in size and thus for the GO analysis while we calculate the intersection values between clusters and derived functional modules it gives us a better GO analysis result. So, we can understand that uncovering clusters having larger size on an average is a major reason for SPICi12 + to have better GO analysis results than SPICi.

Practical Running Time

String Human

9s 11s 9s 10s Unable to Cluster 25240s Unable to Cluster (process ran over 12 hrs)

us

Biogrid Human 1s 1s 2s 2s 85s 3603s 21600s (Approx.)

M

5.4

STRING Yeast 1s 3s 2s 1s 5s 683s 180s

an

SPICi SPICi2+ SPICi1+ SPICi3+ MGclus ClusterOne WPNCA

Biogrid Yeast 1s 1s 1s 1s 32s 546s 1620s

cr

Table 7. Run time analysis

Ac ce p

te

d

The running times of the algorithms run during our experiments are reported in Table 7. Note that, an ideal clustering algorithm should cluster large biological networks reasonably fast while maintaining the quality of the cluster. MGclus [23], ClusterOne [21] and WPNCA [23] are not time efficient as compared to SPICi and the three new variants as the process of clustering is very slow with these algorithms. In our experiments, MGclus gave a runtime error when clustering STRING Human network and the process WPNCA but could not provide any output for the same network even after 12 hours. ClusterOne took 1 hour to cluster STRING yeast network and seven hours to cluster STRING human network. Although ClusterOne has achieved good GO analysis results for large biological network, its high running time makes it infeasible for large networks. 5.5

Robustness Analysis

Robustness analysis of the improved algorithms are performed using the process of Brohee [9] which tests the algorithms ability to restate the MIPS complexes from synthetic test network data. To perform the robustness analysis, we have created networks from the MIPS complexes of Saccharomyces cerevisiae available at [19]. The nodes of the created network resembles to the protein of these complexes and the edges are created between any two proteins in the same complex. Suppose the network created this way holds |E| edges. Now, the networks are modified using two factors, edge addition rate pa and edge deletion rate pd . At first the values of pa and pb are chosen over 10 different rates and then on the generated network pa × |E| are added prior to the removal of pd × |E| edges

Page 14 of 19

Table 8. Accuracy : SPICi Vs SPICi12 + ( Addition rate (row-wise) and Deletion rate (column-wise) 0.3 1.00 1.03 1.03 1.02 1.01 1.03 1.01 1.00 1.00 1.00

0.4 0.98 1.00 1.03 1.03 1.02 1.01 1.00 1.00 1.00 1.00

0.5 1.03 1.04 0.97 1.02 1.02 1.02 1.00 1.00 1.00 1.00

0.6 1.04 0.91 0.99 1.02 1.01 1.02 1.01 1.00 1.00 1.00

0.7 1.02 1.03 1.01 1.01 1.00 0.98 1.02 1.00 1.00 1.00

0.8 1.04 1.04 0.94 1.03 0.91 0.97 1.09 1.02 1.00 1.00

0.9 1.11 1.04 0.91 0.98 0.94 0.96 1.01 0.99 0.99 1.00

ip t

0.2 0.99 1.08 0.98 1.02 1.02 1.02 1.00 1.00 1.00 1.00

cr

0.1 0.99 0.99 1.01 1.02 1.02 1.02 1.00 1.00 1.00 1.00

us

0.0 0.99 1.04 1.07 1.06 1.06 1.01 1.00 1.00 1.00 1.00

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Table 9. Separation : SPICi Vs SPICi+12 ( Addition rate (row-wise) and Deletion rate (column-wise) 0.2 1.36 1.32 1.90 0.92 1.04 1.81 1.41 0.92 0.82 0.71

0.3 1.97 1.60 2.14 1.87 0.92 1.62 1.21 1.73 1.73 1.73

0.4 1.81 1.68 1.66 1.49 0.89 1.04 0.45 1.41 1.41 0.88

0.5 1.87 1.58 1.83 1.28 0.98 1.21 1.13 1.41 0.95 1.41

0.6 1.38 1.91 1.24 1.93 1.55 1.57 1.56 0.95 0.97 1.41

0.7 1.78 1.26 1.48 1.13 1.48 1.08 1.30 1.41 1.41 1.41

an

0.1 1.35 1.61 1.04 1.79 1.88 1.02 1.73 1.73 1.72 1.41

M

0.0 1.99 1.32 1.32 1.49 1.00 1.97 1.73 1.41 1.41 1.41

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.8 1.93 1.39 1.54 1.75 1.23 1.55 1.17 1.63 1.73 1.73

0.9 1.46 1.46 1.73 1.47 1.50 1.31 1.11 1.04 1.13 0.90

0 1.01 1.00 1.01 1.02 1.03 1.01 1.02 1.00 1.01 1.03

0.1 1.02 1.01 1.02 1.01 1.04 1.01 1.02 1.01 1.02 1.07

0.2 1.00 1.00 1.01 1.01 1.01 1.02 1.05 1.00 1.05 1.06

0.3 1.01 1.01 1.01 1.00 1.00 1.00 1.05 1.04 1.02 1.05

Ac ce p

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

te

d

Table 10. Accuracy : SPICi Vs SPICi+2 ( Addition rate (row-wise) and Deletion rate (column-wise) ) 0.4 1.01 1.05 1.04 1.01 1.00 1.01 1.05 0.98 0.98 1.05

0.5 1.01 1.01 1.01 1.01 1.00 1.00 1.06 1.01 1.01 1.05

0.6 1.00 1.00 1.00 1.01 1.01 1.00 1.01 1.01 1.01 1.07

0.7 1.00 1.01 1.01 1.01 1.01 1.01 1.01 1.01 1.01 1.00

0.8 1.00 1.01 1.01 1.01 1.01 0.94 1.01 1.01 1.00 1.01

0.9 1.00 1.01 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Table 11. Separation : SPICi Vs SPICi2+ ( Addition rate (row-wise) and Deletion rate (column-wise) ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 0.1 1.39 0.98 1.96 0.96 1.33 1.52 1.11 1.50 0.82 1.34 0.72 1.13 1.48 1.26 1.31 1.35 1.34 1.49 1.12 1.46

0.2 1.76 1.90 1.74 1.57 1.51 1.53 1.69 0.82 0.89 1.25

0.3 1.61 1.53 1.45 1.35 1.36 1.39 1.49 1.40 1.60 1.23

0.4 1.41 1.77 1.46 1.42 1.33 1.44 1.44 1.38 1.21 1.45

0.5 1.52 1.38 1.42 1.39 1.31 1.26 1.54 1.50 1.90 1.23

0.6 1.41 1.41 1.41 1.21 1.05 1.08 1.57 1.34 1.49 1.22

0.7 1.42 1.42 1.41 1.43 1.32 1.41 1.22 1.41 1.26 1.23

0.8 1.41 1.41 1.41 1.41 1.41 0.87 1.41 1.41 1.41 1.55

0.9 1.41 1.41 1.42 1.31 1.14 1.21 1.61 1.45 1.21 1.11

Page 15 of 19

Table 12. Accuracy : SPICi Vs SPICi1+ ( Addition rate (row-wise) and Deletion rate (column-wise) ) 0.2 0.98 0.95 0.97 0.99 0.99 0.99 0.98 0.97 0.94 0.75

0.3 0.99 0.97 0.99 0.99 0.99 0.99 0.99 0.98 0.99 0.94

0.4 1.01 0.96 0.99 0.99 0.99 0.99 0.99 0.99 0.87 0.87

0.5 0.98 0.98 0.98 0.99 0.99 0.99 0.75 0.95 0.95 0.92

0.6 1 1.00 1.00 1.01 1.00 1.01 1.01 0.99 1.06 0.97

0.7 1 1.00 1.00 1.00 1.00 1.01 1.00 1.01 1.00 0.97

0.8 1 1.00 1.00 1.00 1.00 0.94 0.99 0.99 0.99 0.96

0.9 1 1.00 1.00 0.99 0.99 1.00 1.00 1.00 1.00 1.00

ip t

0.1 0.98 0.93 0.98 0.98 0.98 0.99 0.85 0.89 0.91 0.90

cr

0 0.96 0.95 0.96 0.96 0.95 0.97 0.97 0.92 0.95 0.97

us

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Table 13. Separation : SPICi Vs SPICi1+ ( Addition rate (row-wise) and Deletion rate (column-wise) ) 0.3 1.08 1.18 1.53 1.55 1.38 1.46 1.42 1.60 1.48 1.28

0.4 1.01 1.28 1.18 1.22 1.22 1.44 1.49 1.29 0.95 1.23

0.5 1.05 0.81 1.59 1.07 1.23 1.17 0.81 1.13 1.18 1.16

0.6 1.41 1.41 1.41 1.09 1.02 1.04 1.57 1.08 1.08 1.03

0.7 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.72 0.99

an

0.2 1.11 1.23 1.32 1.24 1.22 1.17 1.71 1.18 1.20 1.68

M

0 0.1 1.54 1.08 1.47 1.72 1.35 1.89 1.62 1.22 1.58 1.15 1.18 0.95 1.26 1.32 1.14 1.46 1.49 1.34 0.9 1.12

0.8 1.41 1.41 1.41 1.41 1.41 0.70 1.41 1.41 1.41 2.80

0.9 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41

d

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Ac ce p

te

which are selected randomly. The two measures used by Brohee [9] are Accuracy and Separation. Accuracy indicates how well the algorithm can recover the gold standars of MIPS complexes. On the other hand the Separation measure provides how specific the mapping is between the clusters to the MIPS gold standard set. For a more precise definition of measures and process description please refer to [9]. The robustness of each variant (i.e., SPICi1+ , SPICi2+ and SPICi12 + ) has been compared with SPICi. The combination of 10 different addition rates and 10 different deletion rates was used to generate the test networks. This test provides the separation and accuracy measures [9] which helps us to deduce the robustness of the algorithm. The output of the robustness analysis has been presented in Tables 8 to 13. Among these tables, Table 8 (Table 9), Table 10 (Table 11) and Table 12 (Table 13) present the accuracy (separation) measure, respec2 1 tively, for SPICi12 + , SPICi+ and SPICi+ against SPICi. In each table, addition and deletion rates are placed row-wise and column-wise respectively. Each cell represents a particular measure, i.e., accuracy or separation, which is obtained from the test networks generated using the addition and deletion rate of the corresponding row and column. In each cell of the tables, a value higher than 1.00 indicates that the corresponding variant (i.e., SPICi12 + (in case Tables 8 and 9) or SPICi2+ (in case Tables 10 and 11) or SPICi1+ (in case Tables 12 and 13)) performed better than SPICi.

Page 16 of 19

Ac ce p

te

d

M

an

us

cr

ip t

At this point a brief discussion on the robustness analysis performed is in order. In Tables 8 and 9, we have compared SPICi12 + against SPICi. After observing the values in Table 8 (i.e., accuracy) and Table 9 (i.e., separation), we can realize that the level of accuracy of these 2 algorithms is similar. Note that the accuracy and separation values are not very large than 1.00, where the value of 1.00 means that both the algorithms handled the noise in the network in a similar manner. But in most cases, the separation values of SPICi12 + reflects better performance. Tables 10 and 11 presents the accuracy and separation values of SPICi2+ against SPICi. For higher addition and deletion rates the accuracy of SPICi2+ is almost 1.00 whereas for lower addition and deletion rates it performs slightly better in comparison. With respect to the separation values, SPICi2+ performs better than SPICi. Finally, Tables 12 and 13 denote accuracy and separation measures of 2 SPICi1+ against SPICi. Unlike the accuracy values of SPICi12 + and SPICi+ (Ta1 bles 8 and 10), the accuracy values of SPICi+ are mostly less than 1.00. Even in a few cases, the value is near 0.75. This indicates slightly superior robustness of SPICi against SPICi1+ . But for the separation measure, SPICi1+ still performs well. As both SPICi1+ , SPICi2+ and SPICi12 + are similar to the original algorithm (i.e., SPICi), they all are expected to have almost the same level of robustness. To summarize the observations of Tables 8 to 13, we notice that the robustness of SPICi1+ and SPICi2+ is similar to that of SPICi. However, it can be seen that for low addition and deletion rates SPICi1+ and SPICi2+ performs better. The separation values from the Tables 9, 11 and 13 suggest that the new variants have uncovered clusters that are more accurately mapped to the MIPS gold standard as in most cases the values for the new variants are higher than 1.00 indicating a better mapping. So, the overall results of robustness analysis shows that the improved heuristics handles the noise in the network in a similar fashion of SPICi but gives a better mapping to the gold standards.

6

Conclusion

In bio-informatics clustering algorithms are considered to be one of the most important tools. Although there are a number of clustering algorithms that can cluster biological network, most of them fail to cluster large biological networks. In this paper we have proposed two new greedy heuristics and have presented three new implementations of a celebrated greedy clustering algorithm called SPICi [17]. Our experimental results and analyses indicate promising results. From our investigation we found that in greedy algorithms, heuristics are the key that helps the algorithm to use the structures and foundation of input networks. In this paper we have seen, SPICi1+ , SPICi2+ and SPICi12 + uses different sets of heuristics and that gives different quality outputs. This suggests that better heuristics yields better results. Hence, it presents a new scope of designing heuristics that evaluates biological network in a better way. As a future work we are plan to investigate whether our heuristic approaches are useful in other relevant clustering issues like multi-network clustering, overlap-

Page 17 of 19

ping cluster identification, active module identification etc. as has been discussed in a recent review at [20].

ip t

References

Ac ce p

te

d

M

an

us

cr

1. BioGRID 3.2 Help and Support Resources, Last accessed on April 26, 2014, at 08:05:00PM. [Online]. Available: http://thebiogrid.org 2. THE SYNERGIZER, Last accessed on May 11, 2014, at 05:10:00PM. [Online]. Available: http://llama.mshri.on.ca/synergizer/translate/ 3. SPICi: Speed and Performance In Clustering, Last accessed on May 21, 2014, at 10:15:00PM. [Online]. Available: http://compbio.cs.princeton.edu/spici/ 4. STRING-Known and Predicted Protein-Protein Interactions, Last accessed on March 16, 2015, at 02:05:00PM. [Online]. Available: http://www.string-db.org 5. B. Adamcsek, G. Palla, I. J. Farkas, I. De´renyi, and T. Vicsek, Cfinder: locating cliques and overlapping modules in biological networks, Bioinformatics, vol. 22, pp. 1021-1023, 2006. 6. M. Altaf-Ul-Amin, Y. Shinbo, K. Mihara, K. Kurokawa, and S. Kanaya, Development and implementation of an algorithm for detection of protein complexes in large interaction networks, BMC Bioinformatics, vol. 7, p. 207, 2006. 7. M. Altaf-Ul-Amin, H. Tsuji, K. Kurokawa, H. Asahi, Y. Shinbo, and S. Kanaya, DPClus: A density-periphery based graph clustering software mainly focused on detection of protein complexes in interaction networks, BMC Bioinformatics, vol. 7, pp. 150-156, 2006. 8. M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. IsselTarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, and G. M. R. . G. Sherlock, Gene Ontology: tool for the unification of biology, Nature Genetics, vol. 25, pp. 25-29, 2000. 9. S. Brohee and J. V. Helden, Evaluation of clustering algorithms for protein-protein interaction networks, BMC bioinformatics vol. 7, no. 1, pp. 488, 2006. 10. M.-C. Brun, C. Herrmann, and A. G´ uenoche, Clustering proteins from interaction networks for the prediction of cellular functions, BMC Bioinformatics, vol. 5, p. 95, 2004. 11. A. Chatr-aryamontri, B.-J. Breitkreutz, S. Heinicke, L. Boucher, A. G. Winter, C. Stark, J. Nixon, L. Ramage, N. Kolas, L. O’Donnell, T. Reguly, A. Breitkreutz, A. Sellam, D. Chen, C. Chang, J. M. Rust, M. S. Livstone, R. Oughtred, K. Dolinski, and M. Tyers, The BioGRID interaction database: 2013 update, Nucleic Acids Research, vol. 41, pp. 816-823, 2013. ¨ 12. R. Colak, F. Hormozdiari, F. Moser, A. Schonhuth, J. Holman, M. Ester, and S. C. Sahinalp, Dense Graphlet Statistics of Protein Interaction and Random Networks, Pacific Symposium on Biocomputing, vol. 14, pp. 178-189, 2009. 13. M. L. Fredman and R. E. Tarjan, Fibonacci heaps and their uses in improved network optimization algorithms, Journal of the ACM (JACM), vol. 34, no. 3, pp. 596-615, 1987. 14. O. Frings, A. Alexeyenko, A. Sonhammer and Erik LL, MGclus: network clustering employing shared neighbors, Molecular BioSystems, vol. 9, pp. 1670-1675, 2013. 15. E. Georgii, S. Dietmann, T. Uno, P. Pagel, and K. Tsuda, Enumeration of condition-dependent dense modules in protein interaction networks, Bioinformatics, vol. 25, pp. 933-940, 2009.

Page 18 of 19

Ac ce p

te

d

M

an

us

cr

ip t

16. L. J. Jensen, M. Kuhn, M. Stark, S. Chaffron, C. J. Creevey, J. Muller, T. Doerks, P. Julien, A. Roth, M. Simonovic, P. Bork, and C. von Mering, STRING 8 - a global view on proteins and their functional interactions in 630 organisms, Nucleic Acids Research, vol. 37, pp. 412-416, 2009. 17. P. Jiang and M. Singh, SPICi: a fast clustering algorithm for large biological networks, Bioinformatics, vol. 26, pp. 1105-1111, 2010. 18. P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble, Semantic Similarity Measures as Tools for Exploring the Gene Ontology, Pacific Symposium on Biocomputing, pp. 601-612, 2003. 19. H. W. Mewes, D. Frishman, U. G¨ uldener, G. Mannhaupt, K. Mayer, M. Mokrejs, B. Morgenstern, M. M¨ unsterk¨ otter, S. Rudd, and B. Weil, MIPS: a database for genomes and protein sequences, Nucleic acids research, vol. 30, no. 1, pp. 31-34, 2002. 20. K. Mitra, A-R. Carvunis, S. K. Ramesh and T. Ideker, Integrative approaches for finding modular structure in biological networks. Nature Reviews Genetics vol. 14, pp. 719-732, 2013. 21. T. Nepusz, H. Yu and A. Paccanaro, Detecting overlapping protein complexes in protein-protein interaction networks, Nature methods, vol. 9, no. 5, pp. 471-472, 2012. 22. G. Palla, I. De´renyi, I. Farkas, and T. Vicsek, Uncovering the overlapping community structure of complex networks in nature and society, Nature, vol. 435, pp. 814-818, 2005. 23. W. Peng, J. Wang, B.Zhao and L.Wang, Identification of protein complexes using weighted PageRank-Nibble algorithm and core-attachment structure vol. 12, no. 1, pp. 179-192, 2014, IEEE. 24. J. Song and M. Singh, How and when should interactome-derived clusters be used to predict functional modules and protein function? Bioinformatics, vol. 25, pp. 3143-3150, 2009.

Page 19 of 19