Accepted Manuscript Title: Predicting overlapping protein complexes from weighted protein interaction graphs by gradually expanding dense neighborhoods Author: Christos Dimitrakopoulos Konstantinos Theofilatos Andreas Pegkas Spiros Likothanassis Seferina Mavroudi PII: DOI: Reference:
S0933-3657(15)30049-X http://dx.doi.org/doi:10.1016/j.artmed.2016.05.006 ARTMED 1469
To appear in:
ARTMED
Received date: Revised date: Accepted date:
17-10-2015 30-5-2016 30-5-2016
Please cite this article as: Dimitrakopoulos Christos, Theofilatos Konstantinos, Pegkas Andreas, Likothanassis Spiros, Mavroudi Seferina.Predicting overlapping protein complexes from weighted protein interaction graphs by gradually expanding dense neighborhoods.Artificial Intelligence in Medicine http://dx.doi.org/10.1016/j.artmed.2016.05.006 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Predicting overlapping protein complexes from weighted protein interaction graphs by gradually expanding dense neighborhoods
Christos
Dimitrakopoulosa,
Konstantinos
Theofilatosa,
Andreas
Pegkasb,
Spiros
Likothanassisa,b, Seferina Mavroudia,b,c a
InSyBio Ltd, 109 Uxbridge Road, W5 5TL London, UK
b
Department of Computer Engineering and Informatics, University of Patras, Building B
University Campus, Rio, 26500, Greece c
Department of Social Work, School of Sciences of Health and Care, Technological
Educational Institute of Western Greece, Megalou Aleksandrou 1, Koukouli, 26334 Patra, Greece {c.dimitrakopoulos, mavroudi}@ceid.upatras.gr
k.theofilatos}@insybio.com,
{pegkas,
likothan,
Highlights
Gradually Expanding Dense Neighborhoods (GENA) is proposed for the computational prediction of protein complexes from weighted protein interaction networks.
GENA permits the participation of proteins to multiple complexes in agreement with the underlying cell mechanisms.
GENA outperformed three of the state of the art algorithms for predicting protein complexes in experiments with datasets from yeast and human organisms.
Downstream analysis of the resulted clusters revealed functional homogeneity between the proteins of the same cluster.
Significantly altered network modules were detected when GENA was applied to two co-expression networks: one generated from Parkinson patients and one from healthy individuals.
Abstract Objective Proteins are vital biological molecules driving many fundamental cellular processes. They rarely act alone, but form interacting groups called protein complexes. The study of protein complexes is a key goal in systems biology. Recently, large protein-protein interaction (PPI) datasets have been published and a plethora of computational methods that provide new ideas for the prediction of protein complexes have been implemented. However, most of the methods suffer from two major limitations:
First, they do not account for proteins
participating in multiple functions and second, they are unable to handle weighted PPI graphs. Moreover, the problem remains open as existing algorithms and tools are insufficient in terms of predictive metrics. Method In the present paper, we propose gradually expanding neighborhoods with adjustment (GENA), a new algorithm that gradually expands neighborhoods in a graph starting from highly informative “seed” nodes. GENA considers proteins as multifunctional molecules allowing them to participate in more than one protein complex. In addition, GENA accepts weighted PPI graphs by using a weighted evaluation function for each cluster. Results In experiments with datasets from saccharomyces cerevisiae and human, GENA outperformed Markov
clustering,
restricted
neighborhood
search
and
clustering
with overlapping neighborhood expansion, three state-of-the-art methods for computationally predicting protein complexes. Seven PPI networks and seven evaluation datasets were used in total. GENA outperformed existing methods in 16 out of 18 experiments achieving an average
improvement of 5.5% when the maximum matching ratio metric was used. Our method was able to discover functionally homogeneous protein clusters and uncover important network modules in a Parkinson expression dataset. When used on the human networks, around 47% of the detected clusters were enriched in gene ontology (GO) terms with depth higher than five in the GO hierarchy. Conclusions In the present manuscript, we introduce a new method for the computational prediction of protein complexes by making the realistic assumption that proteins participate in multiple protein complexes and cellular functions. Our method can detect accurate and functionally homogeneous clusters. Keywords: computational prediction of protein complexes; clustering of protein-protein interaction networks; functionally homogeneous protein clusters; Parkinson differentially expressed network modules 1 Introduction Proteins are considered the most important players in molecular interactions. They play a significant role in all cellular functions (e.g. transmission of regulatory signals in the cell) and they catalyze a huge number of chemical reactions. They rarely act isolated, but they are combined in functional modules with one example of them being the protein complexes. The prediction of protein complexes is crucial for understanding the cellular mechanisms and for predicting the functions of uncharacterized proteins. The experimental prediction of protein complexes is mainly limited to tandem affinity purification (TAP) [1], which provides erroneous data and is not cost-effective and time-efficient. TAP results have raised the human interactome’s coverage, but also include many false positives and false negatives [2]. In virtue of the experimental approaches’ limitations, researchers have recently emphasized in the computational prediction of protein complexes from protein-protein interaction (PPI) data by using unsupervised clustering algorithms [3, 4]. The assumption behind most of the methods is the detection of strongly connected components of proteins that are sparsely connected to the rest of the graph [5, 6]. These algorithms are based on very different approaches. Most of them require the specification of a considerable number of parameters, some of which drastically affect the results. Clustering as a modeling approach can address the problem of detecting protein complexes in PPI graphs, however, standard clustering is not ideal for PPI networks. Proteins may have multiple functions, and therefore their corresponding graph nodes may belong to more than
one cluster. For instance, 17 pairs of complexes overlap in the Aloy dataset [7], 40 pairs in the BT_409 dataset [8] and 215 pairs in the Pu dataset [9]. Such nodes present a challenge to traditional PPI clustering algorithms and recently, algorithms that detect overlapping clusters have been proposed [10, 11]. Moreover, the state of the art methods for clustering PPI graphs are usually applied to weighted PPI graphs only after 'binarizing' them by removing weighted edges below a given threshold. The idea of using the original weighted PPI graphs was introduced recently [12] and demonstrated a significant improvement in the detection of protein complexes. In the present paper, we propose gradually expanding neighborhoods with adjustment (GENA), a fully unsupervised clustering algorithm, which consists of two steps. In the first step, a greedy approach is used to initialize the clusters. We used initial “seed” vertices with a high potential for cluster formation based on the clustering coefficient metric. The seed nodes are absorbing neighboring nodes (and gradually forming a growing cluster) based on an evaluation function, which is defined as a generalized version of the connectivity in the weak sense (section 2.2). Each cluster grows independently from the other clusters and as a result they can overlap. The clusters stop growing when no neighboring node can improve their evaluation function. In the second step of “adjustment”, random moves are performed between the clusters to optimize the clustering solution of the initialization step. In experiments using public datasets of protein complexes from saccharomyces cerevisiae and human, GENA outperformed restricted neighborhood search (RNSC), Markov clustering (MCL) and clustering with overlapping neighborhood expansion (ClusterONE), which are the state of the art algorithms for predicting protein complexes. We performed an extensive analysis by using several input networks and evaluation datasets. In specific, we used three weighted PPI graphs and four evaluation datasets for saccharomyces cerevisiae as well as two weighted PPI graphs and three evaluation datasets for human (section 2.1). 2 Materials and methods 2.1 Datasets 2.1.1 Protein-protein interactions datasets In the present paper, six PPI datasets have been used as inputs for the prediction algorithms. They originate from different organisms (saccharomyces cerevisiae and human). Based on the PPI datasets, we created weighted PPI graphs. For saccharomyces cerevisiae, we used Krogan et al core and extended datasets [13], Collins et al [14] dataset as described in [12]. For
human, we built a network by using the interactions reported in the human protein reference database (HPRD). Krogan et al [13] have combined results from matrix-assisted laser desorption/ionization – time of flight (MALDI-TOF) mass spectrometry and liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) experiments to identify protein-protein interactions. The reason for using data from two independent experimental settings was based on the observations that a single mass spectrometry method often fails to identify all proteins. Hence, using data from two independent methods was expected to increase the coverage and confidence of the obtained interactome. The results of the two methods were combined by using a supervised machine learning approach using hand-curated protein complexes from the Munich information center for protein sequences (MIPS) reference database [15] as a gold standard dataset. A two round learning phase framework was encountered coupling the output of Bayesian networks and decision trees with the stacked generalization algorithm [16]. In the first round, Bayesian inference networks and 28 different kinds of decision trees were tested finally settling on three methods: Bayesian networks and C4.5-based and boosted decision stumps. The output of these three methods was used as the input for a second round of learning with the stacked generalization algorithm. The output of the stacked generalization algorithm (i.e. a probability value between 0 and 1) was then thresholded at two different levels to obtain the core and extended datasets. The Krogan core dataset included all interactions with posterior probability higher than 0.273, while the extended dataset included all interactions with posterior probability higher than 0.101 [12]. Collins et al [14] have combined the experimentally derived PPI networks of Krogan et al [13] and Gavin et al [17] by re-analyzing the raw primary affinity purification data of these experiments using a novel scoring technique called purification enrichment (PE). The PE scores are motivated by the probabilistic socio-affinity scoring framework of Gavin et al [17], but also take into account negative evidence (i.e. pairs of proteins where one of them fails to appear as a prey when the other one is used as a bait). The first PPI dataset for the human organism consists of the protein interactions from the HPRD database [18]. These protein interactions were filtered using the evolutionary Kalman mathematical modelling (EVOKALMAMODEL) method proposed in the human interactome knowledge base (HINT-KB) [19]. EVOKALMAMODEL predicts protein-protein interactions (PPIs) by fusing sequential, functional and structural PPI data. The extracted PPI graph consists of 7450 proteins and 21.475 interactions. The main idea of EVOKALMAMODEL is to construct an optimal mathematical predictor equation by exploring a pool of given
mathematical terms. It combines Kalman filtering, an adaptive filtering technique with a genetic algorithm, a heuristic method based on the process of natural selection. The genetic algorithm detects the optimal subset of terms for the predictor’s mathematical equation and then applies extended Kalman filters to compute its optimal parameters. The final equation is used to score and filter the protein interactions. The second human PPI network is the entire HINT-KB network, which is constructed based on protein interactions included in the IrefIndex database and predicted as positive by the EVOKALMAMODEL method [19]. It contains 20845 unique proteins and 211367 unique interactions and was selected because it provides the highest coverage of the human interactome, while at the same time it is comprised of only confidently predicted interactions. Consequently, it enables the prediction of a high number of high quality, not previously reported protein complexes. 2.1.2
Evaluation datasets
For Saccharomyces Cerevisiae, four well-studied datasets of protein complexes were used. The first is the one proposed and described in [8], which is named BT_409 and consists of 409 protein complexes. The second dataset named Aloy [7] contains 101 protein complexes derived using structure-based protein matching with known structures and screened with the electron microscopy method. The third dataset is the Pu dataset [9], which contains 408 protein complexes. The fourth dataset is the MIPS catalog of protein complexes [15] as used in [12]. To avoid selection bias, we decided to consider all MIPS categories containing at least three and at most 100 proteins as protein complexes. MIPS category 550 and all its descendants were excluded because they correspond to unconfirmed protein complexes that were predicted by computational methods. For human, we made an effort to gain an insight about the quality of the clustering results by building three different protein complexes’ datasets using protein complexes published in the comprehensive resource of mammalian protein complexes (CORUM) [20]. The first human dataset (566 protein complexes) was created by filtering out all protein complexes, which include a protein that is not present in any of the HPRD PPIs. The second human evaluation dataset (1433 protein complexes) consists of protein complexes in CORUM when filtering out all complexes with more than half of their proteins not included in any of the HPRD PPIs. The third dataset is the full CORUM protein complex dataset of the human organism containing (1724) protein complexes.
2.2 Gradually expanding neighborhoods with adjustment GENA is able to overcome the current limitations of the state of the art methods. The first improvement is the use of a generalized measure of connectivity in the weak sense, which permits the processing of weighted PPI networks. The second is an independent “seed node” augmenting strategy for each cluster that enables the prediction of overlapping protein complexes. In the first step (initialization), the algorithm selects some initial “seed” nodes based on the highest clustering coefficient. Clustering coefficient on undirected networks is defined as: CC (n)
2en kn (kn 1)
(2.1)
where kn is the number of neighbors of node n and en is the number of connected pairs between all neighbors of node n. Then each of the seed nodes starts augmenting a cluster by using an l-forward r-backward searching algorithm. In the forward step, a number of nodes (equal to l) are placed inside the cluster. In order for a node to be included into the cluster it should be a neighboring node and its absorption into the cluster should improve the clusters' evaluation function. A neighboring node to a cluster is a node that does not belong to the cluster and is connected to one of its nodes. In the backward step, we exclude from the cluster a set of nodes (equal to r) such that their removal improves the evaluation function. We experimented by using several values for l and r and surprisingly found that l=1 and r=0 were the ones performing the best. Hence, in the initialization step the clusters are only expanding. As evaluation function, a generalized version of the cluster connectivity in the weak sense has been used [21]:
aWin _ c (1 a)Wout | Cl N cl |
,
(2.2)
where Win_c is the complement of Win and Win is the sum of the intra-cluster edges, meaning the edges between nodes of the same cluster. When an edge between two nodes of the cluster does not exist, it will be counted as zero in Win and as one in Win_c. Wout is the sum of edges connecting nodes of the clusters with nodes outside of it.
Win
W
a ,bCl
Win _ c
a ,b
, if there is no connection between a and b, then Wa,b=0
1W
a ,bCl
a ,b
(2.4)
(2.3)
Wout
aCl ,bCl ,bNcl
(2.5)
Wa ,b
Cl is the set that contains the nodes of the cluster and Ncl is the set of nodes that are neighbors to the cluster and do not belong in set Cl. For an optimal cluster, the sums Win_c and Wout must be low and consequently the evaluation function in (2.2) must be minimized. The parameter α in equation (2.2) determines the importance of participation of the factors Win_c and Wout in the evaluation function. We empirically set this parameter to 0.6, which means that internal cluster connectivity is of higher importance than external connectivity. We made this decision by observing the local network topology of the known protein complexes in the input PPI networks. The algorithm grows clusters independently and is therefore able to discover overlapping clusters. When choosing the seed nodes, nodes that are part of the so far uncovered clusters are excluded. Nevertheless, nodes from already uncovered clusters are not excluded in the expansion phase of the clusters. In the second step (adjustment), we are exploring the local space of the cluster solution given in the initialization step. We perform local search inspired by the RNSC algorithm by extending its different types of moves. The evaluation function for each cluster is the one defined at (2.2) and for the initial random partitioning of the clusters, we used the clustering solution given in the initialization step. To simulate for the case of a protein participating in several protein complexes, we defined three types of moves. First, a node can be moved to another cluster with probability 0.8. Second, a node can be copied to another cluster with probability 0.1 (node participates in both clusters). Third, a node can be deleted from a cluster with probability 0.1. Based on the observed participation of proteins in complexes from the evaluation datasets, we assumed that there are a few proteins participating in several complexes and many, which participate in one, or few complexes (see figure s1). As a result, we gave a low probability for the moves that copy or delete nodes. The evaluation function for a clustering solution is computed as:
Eval _ cl i cl1 Eval (i) , where Ncl the number of clusters and the evaluation of each cluster N
is computed as:
Eval (i )
aWin _ c (i) (1 a)Wout (i)
and a = 0.6.
| Cli Ncl (i) |
, where Win_c, Wout , Ncl and Cli are now defined on clusters
The adjustment step terminates when we reach the maximum number of one million moves or when the evaluation function did not improve in the last 1000 moves. To avoid getting trapped in local optimal solutions, we used a tabu list of length 50 and tolerance equal to 1 as proposed in [5]. Tabu tolerance is the maximum amount of times that one node can appear in the tabu list. Because of the stochastic nature of the adjustment step, we ran the procedure 100 times and chose the best clustering solution. After the execution of both steps, clusters with only one protein are discarded. Finally, highly overlapping pairs of clusters are merged. The overlap score of two clusters A and B is defined with the Meet/Min index: Meet / Min index
| A B | min(| A |, | B |)
(2.6)
which in case it is greater than 80%, the two clusters are merged. In contrast with the Jaccard metric, Meet/Min metric implies that if cluster A is part (at least 80%) of cluster B, the two clusters should be merged in one, independently of the size of B (or the other way around). GENA was built in Matlab R2015b. The algorithm's pseudocode follows: Input: PPI network (an undirected weighted graph) G(V,E) Clustering coefficient vector Be (equal to the number of nodes) Output: The predicted protein clusters: All_Clusters Algorithm: While Be is not empty, Choose node x with maximum Clustering coefficient; Call Routine Cluster -> Init_Cluster (node x); Be = Be – { Cluster }; All_Clusters = All_Clusters + Cluster; End Call Routine All_Clusters -> Adjustment(All_Clusters) Merge Highly Overlapping Clusters Store All_Clusters
Routine Init_Cluster(node x): Cluster = {x}; While Cluster Evaluation not improved, nn = {neighboring nodes of the Cluster}; if there a node y in nn, that improves the evaluation function the best, Cluster = Cluster U {y}; Update evaluation function; else break and return Cluster; end end Routine Adjustment (All_Clusters): Tabu_list = {}; Clustering_Cost = Cost(All_Clusters); Choose randomly a node and its move type to a neighboring cluster; (with probabilities 0.8, 0.1 and 0.1 for move, copy or delete) if maximum times of iterations reached or not changed in the last N iterations, exit(); else if cluster evaluation is improved, Update Tabu_list; Update All_Clusters; Update Clustering_Cost; end end 2.3 Evaluation metrics Sensitivity, positive predictive value (PPV) and geometric accuracy are classically used to measure the correspondence of a PPI clustering solution on a set of reference protein complexes. Considering the annotated complexes as a reference classification, sensitivity is defined as the fraction of proteins of complex i which are found in cluster j. To characterize
the general sensitivity of a clustering result, the clustering-wise sensitivity is computed as the weighted average of Sncoi overall complexes.
N Sn Sn N n
i 1
i
coi
(2.7)
n
i 1
i
The positive predictive value is the proportion of members of cluster j that belong to complex i, relative to the total number of members of this cluster assigned to all complexes. To characterize the general PPV of a clustering result as a whole, the clustering-wise PPV is computed as the weighted average of the individual PPVcl j of all clusters.
PPV
m
T PPVcl j
j 1 . j m
(2.8)
j 1T. j
The geometric accuracy (Acc) indicates the tradeoff between sensitivity and positive predictive value. It is obtained by computing the geometrical mean of the Sn and the PPV.
Acc Sn * PPV
(2.9)
The advantage of taking the geometric rather than arithmetic mean is that it yields a low score when either the Sn or the PPV metric is low and as a result it balances better the tradeoff between the two metrics. The sensitivity and PPV individually give a false idea of quality in the trivial cases where all proteins are assigned to a single cluster (Sn = 1 ⇒ Acc > 0.5) or where, on the contrary, each protein is assigned to a single-element cluster (PPV = 1 ⇒ Acc > 0.5). To avoid these erroneous interpretations, the separation metric has been proposed [11], which takes into consideration the fact that clustering predictions with fewer known complexes should be regarded as the highest quality ones. For the computation of the separation metric, we consider the contingency table. The contingency table indicates the absolute frequency of intersections between complexes and clusters. From these values, we derive relative frequencies with respect to the marginal sums, either per row ( Frowi , j ) or per column ( Fcoli , j ). Fcoli , j Frowi , j
Ti , j
i1Ti, j n
Ti , j
m
T
j 1 i , j
PPVi , j
(2.10)
Note that the frequency per column is identical to the PPV defined above. The frequency per row, on the contrary, differs from the sensitivity in the case that the algorithm can assign a protein to multiple clusters or leave some proteins unassigned. In such cases, the frequency per row provides a more drastic criterion than the sensitivity defined above. The separation is defined as the product of column-wise and row-wise frequencies:
Sepi , j Fcoli , j * Frowi , j
(2.11)
It takes values between 0 and 1. The maximal value Sepi , j = 1 indicates a perfect and exclusive correspondence between complex j and cluster i: it indicates that the cluster contains all the members of the complex and only them. In addition, it deals efficiently with multiple assignments. It penalizes cases where proteins of a given complex are assigned to multiple clusters, by using row sums rather than complex sizes. The complex-wise separation is calculated as the sum of separation values for a given complex i. m
Sepcoi Sepi , j
(2.12)
j 1
Reciprocally, a cluster-wise separation can be calculated, which reflects the concentration of one or several complexes within a given cluster.
n
Sep Sepco * Sepcl Sepcl j Sepi , j
(2.13)
i 1
To estimate a clustering result as a whole, clustering-wise Sepco and Sepcl values are computed as the averages of Sepcoi overall complexes, and of Sepcl j overall clusters, respectively.
n
Sepco
Sep
coi
i 1
n m
Sepcl
Sep
cl j
j 1
m
(2.14)
Finally, the geometrical separation (Sep) is computed as the geometrical mean of Sepco and
Sepcl . Sep Sepco * Sepcl
(2.15)
Separation yields a problem in cases when the reference set of protein complexes and/or the set of predicted complexes contains overlaps. Although the measure is correct since the MIPS complexes are not well separated from each other, it gives a misleading result because they match themselves perfectly [12]. As a more reliable metric, we used the maximum matching ratio (MMR) measure as described in [12]. MMR measure has been designed to find a maximum matching between reference and predicted complexes in a way that does not penalize overlaps explicitly.
3 Experimental results and discussion 3.1
Comparison in terms of performance
The performance of GENA was compared to the performance of three state of the art methods, which include MCL [22], RNSC [5] and ClusterONE [12]. MCL and RNSC are among the most well-established and frequently used methods for the prediction of protein complexes and ClusterONE [12] is one of the most promising, recently proposed algorithms for the detection of overlapping protein complexes from weighted protein interaction networks. The results for the MCL and RNSC applications were produced using the BioNets Tool from InSyBio Suite [23], which uses the optimized values for their parameters as described in [22, 5]. In Figures 3.1-3.6, we present the results based on the MMR measure, which is the most reliable evaluation metric as described in section 2.3. GENA has always a higher performance compared to MCL and RNSC, and additionally, in most cases, a higher performance to ClusterONE (except for using Collins2007 network with BT_409 and Pu evaluation datasets). We can mainly attribute the improvements of GENA compared to MCL and RNSC to its handling of the edges’ weights. All methods for predicting human protein-protein interactions are known to yield a non-negligible amount of noise (false positives) and to miss a considerable fraction of existing interactions (false negatives) [24]. Therefore, the available protein interaction data are extremely noisy. This problem is alleviated when weighted PPI networks are binarized to unweighted PPI networks.
GENA’s framework that incorporates weighted PPI graphs definitely directly faces that problem by evaluating the confidence of each PPI. Moreover, GENA does not need to tune the number of initial clusters. For example, RNSC’s performance significantly changes with the change of the initial random clusters [5]. In contrast to MCL and RNSC, ClusterONE can handle weighted PPI networks. GENA had on average better performance than ClusterONE and in this case, it is attributed to the 1) clustering coefficient metric for defining the initial “seed” nodes and to 2) counting the internal edge weights still missing to complete the cluster’s interactions (supplementary figure s2). We additionally evaluated GENA and ClusterONE based on the number of true overlapping proteins they recovered. In principle, we counted how many of the proteins involved in more than one true protein complex are also involved in more than one predicted cluster. The results are presented in supplementary Tables s2-s9. GENA has always a higher true positive rate, but both algorithms exhibit high false positive rates. The formulas used for true and false positive rates are: 𝑇𝑃
𝑇𝑃𝑅 = 𝑇𝑃+𝐹𝑁 (3.1) 𝐹𝑃
𝐹𝑃𝑅 = 𝐹𝑃+𝑇𝑁 (3.2) where TP is the number of proteins that belong to several complexes and are present in more than one cluster, FP is the number of proteins that belong to only one protein complex and are present in more than one cluster, FN is the number of proteins that belong to more than one protein complex and to only one cluster and TN is the number of proteins that belong to only one protein complex and to only one cluster. The choice of seed nodes plays an important role in our algorithmic procedure and determines the final clusters. We used the clustering coefficient to choose the seed nodes, in order to have initial seeds that have the highest tendency in forming a cluster in their direct neighborhood. We observed that when we used the degree centrality to choose the initial “seed” nodes, as conducted in ClusterONE [12], we always got a lower MMR (largest difference 2%). In contrast to ClusterONE, GENA generalizes the intra-connectivity between the nodes of a cluster by counting the edges that do not exist. Based on equation (2.2), every possible edge between the nodes of a cluster has a weight, with the non-existing edges having a weight equal to one. This metric allows down weighting of the clusters that have a high weighted summary of intra-edges, and at the same time are missing a large number of them (supplementary figure s2).
Finally, the running times of the GENA algorithm on the different network sizes are given in supplementary table 1. The running time is highly dependent on the number of nodes in the network.
3.2 Functional homogeneity of clusters As a case study, we analyzed GENA’s clustering on the full human PPI weighted graph obtained from HINT-KB to functionally characterize the predicted protein complexes. The procedure for the functional characterization of protein complexes is described in detail in [25]. In specific, we estimated the percentage of predicted clusters, which were characterized as functionally enriched, using the hypergeometric distribution [26] and the functional annotation of the gene ontology (GO) database [27]. To evaluate the performance of the proposed algorithm we compared it to the methodologies used in [25] and [28]. The comparison was performed in order to validate GENA’s ability to detect functionally homogeneous protein clusters. For a fair comparison, we adopted the same filtering procedure as followed in [25] to include only specific GO-terms (depth higher than five in the GO hierarchy). The experimental results presented in Table 1 indicate the higher functional enrichment rate in GENA’s clusters than in the complexes predicted by the other approaches. 3.3
GENA implementation on expression datasets
As an additional validation step, GENA was used to cluster gene co-expression networks. As a case study, we analyzed a dataset (GDS2519) from gene expression omnibus (GEO) repository. This dataset has been constructed by performing microarray experiments on blood samples of 50 early stage Parkinson disease (PD) patients, 22 control healthy subjects and 33 other neurodegenerative disease control subjects in order to measure the expression levels of more than 20.000 human genes. The gene expression dataset was parsed and analyzed using InSyBio BioNets [23] in order to construct two different gene expression networks: the first by using the control samples and the second by using Parkinson patients. The initial soft file was parsed and two gene expression datasets were constructed (one for every different experimental state: early stage PD patients and the other for healthy control subjects). Logarithmic normalization on the expression values was performed and the datasets were filtered to reduce insignificant genes using a minimum average expression value (2.5) and a minimum expression variance (0.001). Gene expression datasets were used to construct PD and control gene co-expression networks
by using a variation of the mutual information method [23]. In specific, for every two nodes an edge was added to the gene co-expression network, if the mutual information among the expression profiles of the two nodes of the edge exceeded a threshold. The thresholds for adding edges were dynamically generated to alleviate problems occurring by using the same threshold for all nodes. In particular, for every single node we computed its mutual information with all other nodes. We then assumed that the mutual information values followed a normal distribution, and the threshold for adding edges was selected in a predefined confidence interval (99% for our case study). In order to force nodes to a minimum number of edges, we specified a minimum threshold value (0.7 for both networks). Both networks were then clustered using the proposed approach to uncover sets of genes with similar expression profiles. The experimental results uncovered a set of 1011 clusters in the control co-expression network and a set of 676 clusters in the PD co-expression network. A comparison between them indicated that approximately 35% of the initial control clusters were preserved with similarity over 50% in the PD co-expression network. In contrast, 312 clusters are not present in the PD co-expression network and their further analysis could shed light on the molecular mechanisms of PD. In Table 3.1, we present the five largest clusters that existed in the control co-expression network, but not in the Parkinson one. In this table, we provide a list of specific GO terms, which functionally characterize the clusters.
4
Conclusion and future challenges
In the post-genomic era, an important challenge is to analyze biological systems at the network level, in order to understand the topological organization of PPI networks, identify protein complexes and functional modules, discover functions of uncharacterized proteins and obtain networks that are more exact. For the sake of these goals, a series of clustering approaches have been proposed. In the context of the present paper, we proposed an unsupervised clustering algorithm, GENA, which permits the participation of a protein to many protein complexes and can handle weighted PPI graphs. Compared to other methods for PPI clustering, GENA achieved a significant improvement in the prediction of protein complexes (on average 5.5% improvement when the MMR metric was used). As a future work, we intend to apply GENA to more available PPI networks and extend the analysis to different organisms. GENA could significantly contribute to the discovery of novel protein complexes for organisms that have not been extensively studied so far. Current clustering approaches mainly focus on detecting clusters in static PPI networks. However, both the PPIs and protein complexes are dynamically organized when performing specific
functions. Dynamic modules generally correspond to the sequential ordering of molecular events in cellular systems. The way to explore dynamic modules from static PPI networks is a very difficult task and should definitely be addressed by the GENA algorithm in a future research work. Acknowledgements InSyBio participates in the NBG Business Seeds program by the National Bank of Greece. References [1]
K. Theofilatos, C. Dimitrakopoulos, A. Tsakalidis, S. Likothanassis, S. Papadimitriou, S. Mavroudi, Computational Approaches for the Prediction of Protein-Protein Interactions: A Survey. Current Bioinformatics 6(4), 398—414 (2011).
[2]
C. Von Mering, R. Krause, B. Snel, Comparative assessment of large data sets of protein–protein interactions. Nature 417(6887):399–403 (2002).
[3]
V. Spirin and L. Mirny, Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA, 100(21):12123-8 (2003).
[4]
G. Bader and C. Hogue, An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4:2 (2003).
[5]
A. King, N. Przulj, I. Jurisica, Protein complex prediction via cost-based clustering. Bioinformatics, 20(17):3013-20 (2014).
[6]
A. Enright, S. van Dongen, and C. Ouzounis, An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002).
[7]
P. Aloy, B. Bottcher, H. Ceulemans, C. Leutwein, C. Mellwig, S. Fischer et al., Structure-based assembly of protein complexes in yeast, Science 303 2026-2029 (2004).
[8]
C. Friedel, J. Krumsiek and R. Zimmer, Bootstrapping the Interactome: Unsupervised Identification of Protein Complexes in Yeast, Journal of Computational Biology 16(8) 971-987 (2009).
[9]
S. Pu, J. Vlasblom, A. Emili, J. Greenblatt, SJ Wodak, Identifying functional modules in the physical interactome of Saccharomyces cerevisiae. Proteomics 7 944–960 (2007).
[10]
G. Bader and C. Hogue, An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 2 (2003).
[11]
G. Liu, L. Wong, H. Chua, Complex discovery from weighted PPI networks,
Bioinformatics 25, 1891–1897 (2009).
[12]
T. Neputz, H. Yu and A. Paccanaro, Detecting overlapping protein complexes in protein-protein interaction networks. Nature Methods 9, 471-472 (2012).
[13]
N.J. Krogan, G Cagney, H Yu, G Zhong, X Guo, A Ignatchenko, et al. Global landscape of protein complexes in the yeast. Nature 440, 637–643 (2006).
[14]
S.R. Collins, P Kemmeren, XC Zhao, JF Greenblatt, F Spencer, FC Holstege et al., Towards a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol. Cell. Proteomics 6, 439–450 (2007).
[15]
H.W. Mewes, C Amid, R Arnold, D Frishman, U Guldener, G Mannhaupt et al. MIPS: analysis and annotation of proteins from whole genomes. Nucl. Acids Res. 32, D41– 44 (2004).
[16]
D. Wolpert, Stacked generalization. Neural Networks 5, 241–259 (1992).
[17]
A. Gavin, P. Aloy, P. Grandi, R. Krause, M. Brosche, B. Dumpelfeld et al. Proteome survey reveals modularity of the yeast. Nature 440, 631–636 (2006).
[18]
T.S. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, et al., Human Protein Reference Database—2009 update, Nucleic Acids Research 37 (2009) D767- D772.
[19]
K., Theofilatos, C., Dimitrakopoulos, D., Kleftogiannis, C., Moschopoulos, S., Papadimitriou, S., Likothanassis et al., The Human Interactome Knowledge Base (HINT-KB): an integrative human protein interaction database enriched with predicted protein–protein interaction scores using a novel hybrid technique, Artificial Intelligence Review, Springer Netherlands, 1-17 (2013).
[20]
A. Ruepp, B. Waegele, M. Lechner, B. Brauner, I. Dunger-Kaltenbach, G. Fogo et al., CORUM: the comprehensive resource of mammalian protein complexes—2009, Nucleic Acids Research 38 (database issue) D497-501 (2009).
[21]
F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, Defining and identifying communities in networks. D. Proc. Natl. Acad. Sci. USA101, 2658–2663 (2004).
[22]
S. Van Dongen, Graph clustering by flow simulation. In PhD thesis Centers for mathematics and computer science (CWI), University of Utrecht; (2000).
[23]
InSyBio
BioNets
White
Paper.
Online
available
at:
http://www.insybio.com/files/BioNets_WhitePaper.pdf (Accessed 1 May 2015) [24]
S. Brohee, J. van Helden, Evaluation of clustering algorithms for protein-protein interaction networks, BMC Bioinformatics 7, 488 (2006).
[25]
K. Theofilatos, N. Pavlopoulou, C. Papasavvas, S. Likothanassis, C. Dimitrakopoulos, E. Georgopoulos, et al., Evolutionary Enhanced Markov Clustering - EEMC: A Novel
Unsupervised Methodology for Predicting Protein Complexes From Weighted ProteinProtein Interaction Graphs, Artificial Intelligence in Medicine, Mar;63(3):181-9 (2015). [26]
I. Rivals, L. Personnaz, L Taing, & MC Potier. Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics,23(4), 401-407 (2007).
[27]
M Ashburner, CA Ball, JA Blake, D. Botstein, H. Butler, JM. Cherry, & G. Sherlock. Gene Ontology: tool for the unification of biology. Nature genetics, 25(1), 25-29 (2000)
[28]
S. Kikugawa, K. Nishikata, K. Murakami, Y. Sato, M. Suzuki, M. Altaf-Ul-Amin & T. Imanishi. PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from HInvitational
protein-protein
interactions
integrative
dataset. BMC
systems
biology, 6(Suppl 2), S7 (2012)
0.35 0.3 0.25 MCL
0.2
RNSC 0.15
Clusterone GENA
0.1 0.05 0 Krogan2006_core
Krogan2006_extended
Collins2007
Figure 3.1 Maximum matching ratio (MMR) metric for Markov clustering (MCL), restricted neighborhood
search
(RNSC),
clustering
with overlapping neighborhood expansion
(ClusterONE) and gradually expanding neighborhoods with adjustment (GENA) on the three saccharomyces cerevisiae input networks and the evaluation dataset of Munich information center for protein sequences (MIPS)
0.7 0.6 0.5 MCL
0.4
RNSC 0.3
Clusterone GENA
0.2 0.1 0 Krogan2006_core
Krogan2006_extended
Collins2007
Figure 3.2 Maximum matching ratio (MMR) metric for Markov clustering (MCL), restricted neighborhood
search
(RNSC),
clustering
with overlapping neighborhood expansion
(ClusterONE) and gradually expanding neighborhoods with adjustment (GENA) on the three saccharomyces cerevisiae input networks and the evaluation dataset of BT_409
0.9 0.8 0.7 0.6 MCL
0.5
RNSC
0.4
Clusterone
0.3
GENA
0.2 0.1 0 Krogan2006_core
Krogan2006_extended
Collins2007
Figure 3.3 Maximum matching ratio (MMR) metric for Markov clustering (MCL), restricted neighborhood
search
(RNSC),
clustering
with overlapping neighborhood expansion
(ClusterONE) and gradually expanding neighborhoods with adjustment (GENA) on the three saccharomyces cerevisiae input networks and the evaluation dataset of Aloy.
0.6
0.5
0.4 MCL RNSC
0.3
Clusterone GENA
0.2
0.1
0 Krogan2006_core
Krogan2006_extended
Collins2007
Figure 3.4 Maximum matching ratio (MMR) metric for Markov clustering (MCL), restricted neighborhood
search
(RNSC),
clustering
with overlapping neighborhood expansion
(ClusterONE) and gradually expanding neighborhoods with adjustment (GENA) on the three saccharomyces cerevisiae input networks and the evaluation dataset of Pu.
0.35 0.3 0.25 MCL
0.2
RNSC
0.15
Clusterone
0.1
GENA
0.05 0 Corum Complexes
Corum Complexes HPRD
Corum Complexes HPRD 50%
Figure 3.5 Maximum matching ratio (MMR) metric for Markov clustering (MCL), restricted neighborhood
search
(RNSC),
clustering
with overlapping neighborhood expansion
(ClusterONE) and gradually expanding neighborhoods with adjustment (GENA) on the three human evaluation datasets (section 2.1) and the human protein reference database (HPRD) input network.
0.3 0.25 0.2 MCL 0.15
RNSC Clusterone
0.1
GENA 0.05 0 Corum Complexes
Corum Complexes HPRD
Corum Complexes HPRD 50%
Figure 3.6 Maximum matching ratio (MMR) metric for Markov clustering (MCL), restricted neighborhood
search
(RNSC),
clustering
with overlapping neighborhood expansion
(ClusterONE) and gradually expanding neighborhoods with adjustment (GENA) on the three human evaluation datasets (section 2.1) and the human interactome knowledge base (HINTKB) input network.
Table 3.1 Comparative results in the functional characterization of human protein complexes Method
Percentage of functionally characterized clusters using GO-terms with depth higher than five in the GO hierarchy
PCD-q
35.6%
EE-MC
43.21%
GENA
47.03%
Table 3.1: List of five most significantly altered gene clusters between control and Parkinson gene co-expression networks and their functional annotations. Cluster Number
Cluster’s Genes
Clusters specific GO terms
1
AU146646, USP18, TMEM140, RAC1, S100A11, CXCR2, MSRB1, LILRB3, TALDO1. NCF4, ADAR. PAK1. RAB31, FPR1, IL1B, PYGL, LIMK2, FRAT2, CDC42EP3, CMTM6, LILRA6, USP10 CDA, IFNGR2, IMPA2, NCF4, TALDO1, ADAR, RAC1, LILRB3, USP10, LST1, PAK1, MSRB1, CDC42EP3, PYGL, IL1B, LCP1, LIMK2, FPR1, RAB31, FRAT2, CMTM6, LILRA6
intermediate filament binding
3
RPS6, EIF3F, RPS29, RPS16, RPS10
PTGES3, RPS13, RPS20, RPS28, RPL19, UXT, RPS27A, RPL14, SSRP1, RPS5, TRIM44, FBL, MIF, RPS15, EIF3K, RPL7A, COX4I1,
telomerase activity, dopachrome isomerase activity, cytokine receptor binding, prostaglandin-E synthase activity
4
RPL24, MYL12B, RPL13A, DAP3, RPL32, RPS2, EEF1G, RPL11, RPL38, EDF1, RPL37, RPL13, RPS6, RPL29, RPS13, RPS20, RPL28, HTT, RPL10A, RPL5, RPS21, SNRPD3
5S rRNA binding, dynactin binding, dynein intermediate chain binding, diazepam binding, histone pre-mRNA DCP binding
5
RNF14, TESC, LGALS3, XK, UBE2H, PRDX2, SELENBP1, RPIA, PSMF1, BCL2L1, SNCA, GSPT1, EPB42, ASCC2, DMTN, SLC6A8, MARCH8, ANK1, KRT1, MPP1, GYPA, MYL4
translation release factor activity, guanylate kinase activity, ribose-5phosphate isomerase activity, protein kinase inhibitor activity, creatine transmembrane transporter activity, creatine:sodium symporter activity, ferrous iron binding, thioredoxin peroxidase activity, choline transmembrane transporter activity, phosphatase inhibitor activity, ubiquitin-like protein transferase activity, IgE binding, spectrin binding, Hsp70 protein binding, myosin II heavy chain binding, MHC class II protein binding, dynein binding, monosaccharide binding, tau protein binding, arachidonic acid binding, BH3 domain binding, phospholipase D inhibitor activity
2
vitamin binding, GTP-dependent protein binding, thioesterase binding, bile acid binding, peptide-methionine (R)-S-oxide reductase activity, inositol monophosphate 3-phosphatase activity, inositol monophosphate 4phosphatase activity, methionine-R-sulfoxide reductase activity