Accepted Manuscript Title: Multiobjective optimization to reconstruct biological networks Author: Ahmed Naef Rosni Abdullah Nuraini Abdul Rashid PII: DOI: Reference:
S0303-2647(18)30131-X https://doi.org/doi:10.1016/j.biosystems.2018.09.003 BIO 3877
To appear in:
BioSystems
Received date: Revised date: Accepted date:
8-4-2018 16-7-2018 11-9-2018
Please cite this article as: Ahmed Naef, Rosni Abdullah, Nuraini Abdul Rashid, Multiobjective optimization to reconstruct biological networks, (2018), https://doi.org/10.1016/j.biosystems.2018.09.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ip t
Multiobjective Optimization to Reconstruct Biological Networks Ahmed Naefa,∗, Rosni Abdullaha,b , Nuraini Abdul Rashidc a School
of Computer Sciences, Universiti Sains Malaysia, 11800 USM, Penang Advanced IPv6 Centre (Nav6) 6th Floor, School of Computer Sciences, Universiti Sains Malaysia, 11800 USM, Penang c College of Computer and information sciences, Princess nourah bint Abdulrahman university, KSA
us
cr
b National
an
Abstract
Automated methods for reconstructing biological networks are becoming increasingly important in computational systems biology. Public databases con-
M
taining information on biological processes for hundreds of organisms are assisting in the inference of such networks. This paper proposes a multiobjective
d
genetic algorithm method to reconstruct networks related to metabolism and protein interaction. Such a method utilizes structural properties of scale-free
te
networks and known biological information about individual genes and proteins to reconstruct metabolic networks represented as enzyme graph and protein
Ac ce p
interaction networks. We test our method on four commonly-used protein networks in yeast. Two are networks related to the metabolism of the yeast: KEGG and BioCyc. The other two datasets are networks from protein-protein interaction: Krogan and BioGrid. Experimental results show that the proposed method is capable of reconstructing biological networks by combining different omics data and structural characteristics of scale-free networks. However, the proposed method to reconstruct the network is time-consuming because several evaluations must be performed. We parallelized this method on GPU to overcome this limitation by parallelizing the objective functions of the presented method. The parallel method shows a significant reduction in the execution ∗ Corresponding
author Email address:
[email protected] (Ahmed Naef)
Preprint submitted to Journal of LATEX Templates
July 16, 2018
Page 1 of 43
time over the GPU card which yields a 492-fold speedup. Keywords: Biological Network Reconstruction, Topological Properties,
ip t
Metaheuristic, Optimization.
cr
1. Introduction
Recent advances in high-throughput techniques have produced a huge amount
us
of genomic, proteomic and metabolic data. A lot of work has been done to organize, study and analyze such data in order to increase the knowledge of living systems [1, 2]. Among the presented works, several studies have shown that
an
5
biological activities are often carried out in the form of interacting networks [3]. Reconstructing such biological networks has been a priority in system biology,
M
where various biological networks such as gene regulatory networks, protein interaction networks and metabolic networks are reconstructed from genomic, 10
proteomic and metabolic data in an effort to increase the knowledge of living
d
systems at the cellular level [4]. In this work, we focus on algorithms to reconstruct protein interaction networks and metabolic networks.
te
Lacroix et al. [5] defined metabolic networks as “A collection of objects and the relations among them. The objects represent chemical compounds, biochemical reactions, enzymes, and genes.” Reconstructing a metabolic network,
Ac ce p
15
therefore, involves detecting the relations among genes, proteins (enzymes), and reactions in a given metabolic system. Here, we represent the metabolic network as a graph, G = (V, E), where nodes V represent objects (chemical compounds, biochemical reactions, enzymes, or genes) and edges E represent the relations
20
among those objects. Such representation provides a platform for systematic integration of omics data with metabolic networks in order to explore new biological patterns and hypotheses [6]. In particular, the enzyme graph view of cellular metabolism is used to construct the theoretical graph representation in which the nodes represent enzymes and the edges represent enzyme-enzyme
25
relations (i.e., there is an edge between two enzymes if they catalyze two consecutive reactions in a particular pathway). In other words, there is an edge
2
Page 2 of 43
between two enzymes if they catalyze two consecutive reactions that have a compound in common which is considered as the product of one reaction and
30
ip t
the substrate of the other reactions.
According to the studies published in the context of biological networks
cr
analysis literature, recent expansion of databases involving a massive biological
information have facilitated many researchers to study the current knowledge
us
of biological networks. However, most organism-specific metabolic networks remain incompletely characterized and contain many missing enzymes and reac35
tions in known pathways [7, 8, 9, 10, 11]. Therefore, one of the major challenges
an
in computational system biology is to gain better models of metabolic networks and identify candidate interactions for completing the pathways [9]. In addition, reconstructing a high-confidence protein-protein interaction network improves
40
M
the prediction of protein complex - a group of proteins that achieve a particular biological function within the cells [12].
Previous research in systems biology at the molecular level has already pro-
d
vided a partial view of metabolic networks for different organisms [13]. Many
te
methods have been developed to predict and analyze the metabolic networks. Most of such methods attempt to infer the network de novo in which the data about individual genes, proteins, metabolites, biochemical reactions and
Ac ce p
45
pathways are given, and edges are inferred from these data only, using a variety of inference approaches. Examples include Karp et al. [14, 15], Tsoka et al. [16], and M. Chen and Hofest¨adt [17] developed methods based on genome annotations to construct metabolic pathways for an organism. They infer the
50
reactions involved in the genome annotation and then they infer the metabolic pathways from these reactions. Yousofshahi et al. [18], Meitalovs [19], and McShan et al. [20] utilized the known reactions retrieved from KEGG databases. Compound chemical structure retrieved from KEGG database is also used by Moryia et al. [21] and K. Tanaka et al. [22] in the process of metabolic network
55
reconstruction. Moryia et al. [23] used BLAST to compare amino acid pairs and then they used KEGG Ortholog (KO) identifiers to group such amino acid sequences. On the other hand, based on the idea that ’omics’ datasets may 3
Page 3 of 43
contain errors which lead to incorrectly metabolic pathways prediction, other approaches developed to reconstruct biological networks used prior information about the individual genes, proteins (enzymes) and reactions associated with
ip t
60
supervised machine learning algorithms (Yamanishi et al. [11], Nakamura et
cr
al. [24], Bleakley et al. [25], Yip and Gerstein [26], Kotera et al. [27],Vert et al. [28],Yamanishi et al. [29], and Dale et al. [30]).
65
us
In addition to biological network reconstruction, analyzing the structural features of the biological networks has become a focus of attention in an effort to reveal the fundamental design principles for the large-scale organization of in-
an
teractions in all cells and microorganisms [31]. Many studies have analyzed the topological characteristics of biological networks, including metabolic and protein interaction networks. Such studies uncovered several significant topological characteristics of the networks including scale-free property in these networks
M
70
of many organisms where few nodes have lots of interactions, and many nodes have few interactions with others. Therefore, the probability p(k) that a node
d
has k links to other nodes is proportional to k −γ , where γ is the degree ex-
75
te
ponent taking values between 2 and 3 and this is regarded as the definition of the power-law degree distribution property. Small average path length is also a
Ac ce p
primary characteristic of scale-free networks defined as the average of the shortest distance d between each pair of nodes in the network, and high average clustering coefficient is also a common property to many real complex networks including metabolic networks [31, 32, 33].
80
Despite the importance of topological characteristics in providing essential
principles of the organization of interactions in metabolic networks, the existing methods whether direct de novo or supervised machine learning methods have
relied only on the known biological information. Furthermore, to the best of our knowledge, there is no evolutionary algorithm in the literature proposed to
85
reconstruct metabolic networks although evolutionary algorithms have been successfully applied to many problems such as inferring gene regulatory networks [34, 35, 36]. The terms structural and topological are used interchangeably in the literature. In this study, we hypothesize that the integration of prior bio4
Page 4 of 43
logical information and structural characteristics of scale-free networks in the 90
process of metabolic network reconstruction improves the identification of sig-
ip t
nificant interactions and, thus, completing pathways. For this purpose, we develop a multiobjective genetic algorithm based method considering both known
cr
biological information (expression, localization, phylogenetic profiles) and three topological characteristics of the scale-free network (degree distribution, average shortest path length and clustering coefficient) for network reconstruction.
us
95
The remainder of this paper is organized as follows. Section 2 describes the method and materials. The GPU-based solution of the developed method is
an
presented in section 3. Section 4 provides the experimental setup, results and
100
2. Method and Materials
M
discussion. Section 5 presents the conclusion and future work.
In this study, biological network reconstruction is considered as a multiob-
d
jective optimization problem. We use genetic algorithm (GA) which has been the most popular heuristic approach for multiobjective optimization problems
105
te
[37]. In this study, two fundamental bases considered are biological data and the topological characteristics of the scale-free network. The steps of th pro-
Ac ce p
posed GA flowchart are illustrated in Figure 1 and detailed in the following subsections.
2.1. Generate Initial Population In the first step of the present method, we generate an initial population.
110
We encoded the chromosome as an adjacency matrix of size N × N , where N is the network size, the rows and columns represent proteins. Each element, aij , in the matrix equals 1 if its corresponding proteins interact with each other, otherwise it equals 0. A typical property of many real networks, including biological networks, is
115
that the vertex connectivities follows the scale-free power law distribution. Such property is the result of the contentious growth of the network by the addition of
5
Page 5 of 43
Start
ip t
Generate initial population
Evaluate fitness
cr
Rank the population using NSGA-II Sort the population using nondominant sorting approach
us
Compute crowding distance between individuals of each front
Select elite individuals
an
Select highest ranked individuals
Perform mutation
M
New generation
Max. Iteration is reached
No
Yes
Return optimal solution found, 1𝑠𝑡 𝑓𝑟𝑜𝑛𝑡.
d
End
te
Figure 1: Flow Chart of GA-Based Network Inference Method
Ac ce p
new nodes that attach preferentially to already well-connected nodes [38]. On the basis of the mentioned property, we started the GA run from an informed non-random solutions. Each solution of the population is initialized with a
120
different small seed network derived from gold standard datasets. The use of this strategy avoids using random initialization, which led to very poor initial solutions (random graph). However, it is necessary first to obtain gold standard network data set in order to initialize the solutions. 2.1.1. Metabolic network
125
In the case of applying our method to reconstruct metabolic network, we first assembled the gold standard metabolic network (KEGG and BioCyc) as described below: a. Obtain genome annotation. Genomic information is important to pre6
Page 6 of 43
Table 1: Data Sources Used for Metabolic Network Reconstructions Name Genome annotation databases: https://www.ncbi.nlm.nih.gov/genbank/
SGD
http://www.yeastgenome.org/
Enzyme databases: ExPASy
https://www.expasy.org/
cr
GenBank
ip t
Link
us
http://downloads.yeastgenome.org/curation/
SGD
chromosomal_feature/dbxref.tab UniProt
http://www.uniprot.org/
an
Metabolic databases:
http://www.kegg.jp/
BioCyc
https://biocyc.org/
Protein localization databases:
M
KEGG
UniProt
http://www.uniprot.org/
huh et al.
http://yeastgfp.yeastgenome.org/
Gene Expression Omnibus
https://www.ncbi.nlm.nih.gov/geo/ http://www-genome.stanford.edu/yeast_stress/index.
te
Gasch et al.
d
Gene expressionist databases:
shtml
Ac ce p
Spellman et al.
http://genome-www.stanford.edu/cellcycle/
Phylogenetic protein profile: KEGG Ortholog Cluster
http://www.genome.jp/tools/oc/
cisely provide the gene properties concerning the organisms genome. As
130
the network reconstruction depends primarily on the genome annotation, it is important to download the most recent version available to ensure the accurateness. We use two database resources to retrieve the genome annotation for Saccharomyces cerevisiae and then extract all genes and gene products.
135
b. Identify candidate metabolic genes (enzyme proteins). In the case of a metabolic network draft reconstruction, it is particularly important to be able to determine the genes encoded to enzyme proteins. This is a straight-
7
Page 7 of 43
EC 3.1.3.11
YMR205C
EC 2.7.1.11
YKL060C
EC 4.1.2.13
YCR012W
EC 2.7.2.3 (a)
R00771: D-glucose 6-phosphate = D-fructose 6-phosphate
6-phosphofructokinase E.C. 2.7.1.11 Fructose-bisphosphatase E.C. 3.1.3.11
R00756: ATP + D-fructose 6-phosphate = ADP + D-fructose 1,6-bisphosphate
ip t
EC 5.3.1.9
YLR377C
KEGG Reaction ID: Reaction
Glucose-6-phosphate isomerase E.C. 5.3.1.9
R00762: D-fructose 1,6-bisphosphate + H(2)O = D-fructose 6-phosphate + phosphate R01068: D-fructose 1,6-bisphosphate = glycerone phosphate + D-glyceraldehyde3phosphate R01512: ATP + 3-phospho-D-glycerate = ADP + 3-phospho-D-glyceroyl phosphate
Fructose-bisphosphate aldolase E.C. 4.1.2.13 Phosphoglycerate kinase E.C. 2.7.2.3
(c)
EC 5.3.1.9
E.C. 2.7.1.1
EC 2.7.1.1
E.C. 4.1.2.13
E.C. 3.1.3.11
EC 4.1.2.13
E.C. 2.7.2.3
EC 5.4.2.11 (b)
KEGG Reaction ID
KEGG Reference Pathway ID
R00771
path:rn00500
R00756
path:rn00680, path:rn01120, path:rn01200
R00762
path:rn00680, path:rn00710, path:rn01100, path:rn01120, path:rn01200
R01068
path:rn00680, path:rn00710, path:rn01100, path:rn01120, path:rn01200
R01512
path:rn00010, path:rn00710, path:rn01100, path:rn01110, path:rn01120, path:rn01130 path:rn01200, path:rn01230
us
E.C. 3.1.3.11
an
EC 5.3.1.9
cr
YBR196C
Enzyme Name EC-Number
(d)
M
Figure 2: Using KEGG API, various extracted information from KEGG database used in the reconstruction of draft metabolic network (a) enzyme proteins with their EC numbers, (b) list of enzyme-enzyme relation catalyzing successive reactions (c) list of enzymes with reactions
d
(d) list of pathways involving particular reactions.
te
forward step once the genome annotation is retrieved. We retrieved such metabolic genes from the annotation information (e.g., those genes having EC-number field as per represented in the Genbank file), as shown in
Ac ce p
140
Figure 2.a.
c. Obtain metabolic pathways including all biochemical reactions catalyzed by enzymes identified in step b. We use KEGG and BioCyc databases, currently the most widely used, to retrieve all pathways of the organism
145
of interest (See Figures 2c and 2d).
d. Assemble draft reconstruction. Once enzyme functions and biochemical reactions catalysed by such enzymes have been tentatively defined, we derive the enzyme-enzyme relation. Two enzymes are linked when they catalyze successive reactions in known pathways (See Figures 2b, 2c, and
150
2d). Draft reconstruction is the first step in the process of reconstructing metabolic networks [39]. It is based on the genome annotation of the target organism and biochemical databases. The entire draft reconstruction is 8
Page 8 of 43
A candidate solution S
...
0 1
...
...
...
...
...
...
0
Fit(Degree_distribution( S))
v2 & v3
Average_clustering_coefficient( S)
v4
ip t
1
v1
v5
∑g(S, K genexpr)
cr
...
0
Average_path_length( S)
v6
∑g(S, K protloc) ∑g(S, K protphy)
v7
us
0
Fitness values
Objective Functions
Adjacency matrix of S
an
Figure 3: An illustration of candidate solution evaluation.
followed by manual or computational approaches to refine such network
155
M
by inferring missing reactions. This step is also used in the literature as a method to reconstruct a gold standard metabolic network [29] [9].
d
2.1.2. Protein interaction network
We also use the proposed method to reconstruct protein interaction network
te
using high-quality datasets BioGrid used by Yip and Gerstein [40] and Krogan et al. datasets [41]. Therefore, part of those datasets will be used to generate the initial population.
Ac ce p
160
2.2. Fitness Evaluation
In the second step, the fitness of the individuals are evaluated. We consider
two fundamental aspects of which we calculate seven fitness values for each individual (see Figure 3):
165
2.2.1. Topological Characteristics. As mentioned earlier, despite network diversity, most biological networks share three prominent structural features: power-law degree distribution, small average path length distance and high average clustering coefficient. Thus, the topological-based objective functions used to evaluate the reconstructed network
170
were designed using these three features. The objective functions are as follows. 9
Page 9 of 43
1. Degree distribution. We provided an objective function that calculates the degree distribution values of each individual. Then, the powerlaw
ip t
Python package developed by Alstott et al. [42] was used to examine the distributions. With the Fit and Distribution objects included in this
package, one can determine whether the degree distribution of a candidate
cr
175
solution is considered a good fit to power-law degree distribution. The
law distribution. (n)
us
larger the likelihood fit, the better the distribution is fitted to the power-
2. Average path length. Let D = {duv } be the shortest path between each
M
an
pair (u, v) in undirected graphs. the shortest path connecting u and v duv = 0, u=v ∞, otherwise
We calculated the average shortest path length L of each individual using Floyd-Warshall algorithm [43]- a simple and widely used algorithm to
d
180
compute shortest paths between all pairs of vertices in a graph [44]. We
te
then evaluated L on the basis of Barabasi and Oltvai's [33] finding: scalefree networks have short average path lengths L goes as L ∼ log(log(N ))
Ac ce p
where N is the network size. In this objective function, we minimize the 185
difference between the average path lengths L and Barabasi and Oltvai's findings log(log(N )).
3. Clustering coefficient. An important feature of the scale-free network structure also considered in this work is that the candidate solution should have a high average clustering coefficient value. We provide an objective function that evaluates the fitness values of the individuals of each generation in terms of the average clustering coefficient. The local clustering coefficient Ci for a node vi in undirected graphs is calculated as: Ci =
2E ki (ki − 1)
(1)
Where E is the number of edges connecting the neighbours of the node vi and ki is the degree of vi . The average of the local clustering coefficients 10
Page 10 of 43
of all the vertices n is therefore given by: n
190
1X Ci n i=1
(2)
ip t
hCi =
The larger the average clustering coefficient hCi is, the better the solution.
cr
As a matter of fact, an entire network cannot be perfectly described by a small set of statistical measures [45]. Furthermore, all the existing reconstruction
us
methods based on available biological data resulted in identifying new interactions for completing pathways. In this study, we also consider prior knowledge 195
about individual genes or proteins such as the level of gene expression, protein
an
localization, and phylogenetic profiles to support the process of the network reconstruction.
M
2.2.2. Prior Biological Information
1. Gene expression profile. Early studies demonstrated that many genes 200
with similar expression profiles tend to have similar function [46], and
d
many recent machine learning classification algorithms underlined that
te
gene function could be predicted from microarray data [47]. Gene expression profiles are represented as a real-valued matrix in which the rows are
Ac ce p
genes, and the columns represent experiment conditions or time series. In 205
this study, we use two microarray datasets: Spellman et al. [48] data set containing 77 experiments which focus on the identification of cell-cycleregulated genes and Gasch et al. [49] datasets involving 173 experiments which studied the responses of yeast cell to environmental changes. First, each data set was pre-processed by estimating the missing expression val-
210
ues using K-nearest neighbours (KNNimpute) method [50]. Then, we use the self-organizing map (SOM) method to cluster the genes with similar expression profiles. It is implemented using the Kohonen package [51] in the R environment for statistical computing [52]. Finally, we constructed a kernel matrix Kgenexpr of size N × N in which rows and columns rep-
215
resent genes and each entry equals to 1 if the corresponding genes belong to the same cluster, otherwise it equals 0. 11
Page 11 of 43
2. Protein subcellular localization data. Information about the protein locations in the cell is also widely used in the reconstruction of biological
220
ip t
networks. Cells are organized into membranes and compartments, which
are specialized for various biological functions. Comprehensive knowledge
cr
of protein subcellular localization is critical for understanding their functions and interactions [53]. As such, proteins carrying out their tasks in
us
the same subcellular localization can form a protein complex (functionally related/linked to each other). We utilize such prior biological information 225
and use the localization information of the large-scale budding yeast local-
an
ization experiment [53]. This dataset describes the location of proteins in 22 intracellular locations such as the nucleus, Golgi, mitochondrion, etc. We build protein subcellular localization profiles represented as a binary
230
M
matrix, where the rows represent proteins and columns represent intracellular locations and the elements are either 1 or 0 representing the presence and absence of the protein in a certain intracellular location, respectively.
d
Because these profiles in the mentioned dataset describe properties of indi-
te
vidual protein and not pairs of proteins, we constructed the kernel matrix, Kprotloc as a measure of similarity between proteins with respect to the protein subcellular localization data by using linear kernel.
Ac ce p
235
3. Phylogenetic profiles. The phylogenetic profile of an orthologous protein is a vector of length n which represents the names of different organisms that contain that protein [54]. The comparisons of protein phylogenetic profiles are used for identifying functionally related and physically linked
240
proteins [55][56][57].We built protein phylogenetic profiles from KEGG Orthology (KO) and KEGG Orthology Cluster (KEGG OC) [58] databases. However, the constructed profiles provide information on individual genes rather than a pair of genes. We, therefore, built a kernel matrix representing the similarity between each pair of genes in which rows and columns
245
represent enzyme proteins and the entries are the similarity between the corresponding proteins. We applied the Jaccard similarity measure to the phylogenetic profile data which resulted in a kernel matrix Kprotphy . As 12
Page 12 of 43
we are interested in a binary similarity matrix, each entry of the kernel matrix Kprotphy is converted to 1 if it is greater than a threshold, τ , oth-
ip t
erwise it equals to 0. We further amend the Kprotphy by utilizing KEGG
250
OC tools and assigning 1 for each pair of enzyme proteins having the same
cr
OCs number, OCs includes 1176030 OCs, obtained by clustering 8357175 genes in 2112 fully sequenced genomes (153 eukaryotes, 1830 bacteria and
us
129 archaea).
Given Kgenexpr , Kprotloc and Kprotphy , we evaluated each individual S of each generation as follows.
255
XX
g(aij , kij )
an
f (S, K) =
(3)
Where aij is an element of the adjacency matrix representing the individual
M
S and its value is either 1 (there is an interaction between the corresponding proteins) or 0 (there is no interaction between the corresponding proteins); kij is an element from either gene expression profile kernel Kgenexpr , protein
and g(aij , kij ) returns 1 if aij = kij , otherwise it returns 0. The term g(.)
te
260
d
subcellular localization kernel Kprotloc or protein phylogenetic kernel Kprotphy
determines the proportion of edges in a candidate solution that has evidence
Ac ce p
based on the considered omics data sets. The maximum value is reached when each edge between a pair of proteins inferred has similar profiles in the considered omics data sets.
265
2.3. Parent Selection
In this step a set of parent individuals is selected from the population for reproduction. We used the crowded binary tournament selection method [59] used in NSGA-II, which is one of the most popular Pareto dominance based MOEAs [60]. The characteristic features of NSGA-II are its fast nondominated sorting
270
approach for ranking solutions. At any given generation:
Let P : n−dimensional candidate solutions (population). P = [s1 , s2 , ..., sn ] 13
Page 13 of 43
Let F : m−dimensional objective functions.
ip t
f (s) = [f1 (s), f2 (s), ...fm (s)]
In NSGA-II, all solutions of the population, P , are first grouped into different
cr
nondominated groups called Pareto fronts. For each solution, one has to check dominance relationship by determining how many solutions dominate it and the set of solutions to which it dominates. Given any two solutions si and sj , there
si dominates sj ,
si is being dominated by sj ,
Neither solution dominates.
M
an
are three cases of dominance relationship:
us
275
si dominates sj if the solution si is no worse than sj in all fitness values and the solution si is strictly better than sj in at least one value. This dominance
d
relationship is used to attribute rank in which all nondominated individuals had
te
a rank of 1 and assigned to Pareto front 1, all individuals that being dominated by other individuals of front 1 had a rank 2 and assigned to Pareto front 2,
Ac ce p
and so on. Then, the density of individuals within each front is estimated. Here, a specific crowding distance is used which calculates the average distance of two neighbouring individuals on either side of this individual along each of the fitness values. Every individual i in the population has two attributes: a nondomination rank (irank ) and a crowding distance to the closest neighbours (idist ). The partial order ≺ c is based on these two attributes and defined as: i ≺ j : irank < jrank ∨ (irank = jrank ∧ idist > jdist )
280
During the selection operation, the individuals are selected based on their rank and crowding distance values. NSGA-II selects individual with lesser rank when both lie on different fronts. However, the one with a higher crowding distance value is selected when both individuals belong to the same front.
14
Page 14 of 43
ip t cr us
Figure 4: Mutation Operator
285
an
2.4. Crossover
Crossover operator is an important step in most traditional GA implementations. It can be used to produce a single solution from two highly fit separate
M
individuals. In this work, general crossover operations between two individuals resulted in rewiring the edges of the network and changes the structural properties and the functional form of the candidate solutions. Thus, we excluded
ability. 2.5. Mutation
d
crossover operator and instead mutation operator is performed with high prob-
te
290
Ac ce p
In this study, we examined three different mutation operators that play
important roles in the process of network reconstruction. We utilized the prior biological information as well as the growth and preferential attachment feature leading to a scale-free network. Therefore, the fourth step, mutation operator,
can be described as follows: at each time, a new vertex, vi , representing a protein is added to the network (the selected solution). Such node connects with m edges where m ≤ the number of existing vertices V . The added vertex is connected with m existing vertices V based on three different criteria: (1) growth and preferential attachment: each vj in V is connected to the added vertex vi with a probability proportional to the degree of the node p(kj ) according to the
equation (4): P (kj ) = P
kj
l∈V,l6=j
kl
(4)
15
Page 15 of 43
generate initial population (init pop) of size (popSize);
2
while max iteration is not reached do
ip t
1
calculate fitness values for each individual by using multiobjective
3
cr
function; select parent using NSGA-II ;
5
/* perform mutation by expanding each individual in the
us
4
6
vi ← a new vertex;
7
for i ← 1 to m do
*/
an
population by using one of the following cases:
vj ← a vertex in the current individual; P Pj = kj / l∈V,l6=j kl ;
8 9
M
/* where kj and kl are the degrees of the vertices vj
10
and vl // Case 1:
growth and preferential attachment
12
add an edge between vi and vj with probability = Pj ;
13
// Case 2:
14
s = Kgenexpr (vi , vj ) + Kprotloc (vi , vj ) + Kprotphy (vi , vj );
15
if s > 2 then
d
11
*/
Ac ce p
te
prior biological data
add an edge between vi and vj ;
16
17
end
18
// Case 3:
combination of growth and preferential
attachment and prior biological data
if rand ≤ Pj and s > 1 then
19
add an edge between vi and vj ;
20
end
21
end
22 23
end
Algorithm 1: Pseducode for GA used for biological network reconstruction
16
Page 16 of 43
where kj and kl are the degrees of the vertices vj and vl so that vertices with higher degrees have more chances to be connected to the newly added vertex. (2) prior biological information: each vj in V is connected to the added vertex vi
ip t
295
if the following three conditions are satisfied (i ) similar gene expression profiles,
cr
(ii ) similar subcellular localization profiles, and (iii ) similar phylogenetic profiles. (3) In the third case, we combined the growth and preferential attachment
300
us
and prior biological information conditions, see Figure 4. In this case, we only consider two of the conditions presented in criteria (2).
an
The steps 2 to 4 are repeated until a maximum number of iteration is reached. Overall, Algorithm 1 summarizes the steps of our method proposed for recon-
305
M
structing biological network.
3. Design and Implementation of the Proposed GA-based Method
d
Using GPU
This section focuses on the design and implementation of the proposed recon-
te
struction GA-based method using GPU. In this study, we concentrated mainly on parallelizing the objective functions of the presented GA-based method not to parallelize the complete method. First of all, at the initialization phase, mem-
Ac ce p
310
ory on the device was allocated, and the data was transferred to the GPU to calculate the fitness values for each individual in the population. We represent an individual as an array of size N × N . Let L be the size of the population, L arrays are allocated to L multiple processors (MP) at the same time.
315
Then, three different kernels are executed to calculate the fitness values for the individuals.
We designed a kernel that executes a number of threads on GPU and re-
turns sorted degree distribution and the average clustering coefficient for each individual. We utilized built-in functions (e.g. atomicAdd ) so that the degree 320
distribution and the clustering coefficient of each node of an individual were simultaneously calculated and stored in shared memory and global memory, re-
17
Page 17 of 43
7
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7
Node degrees:
k0=2 0 0
Shared memory:
k1=2
k2=1
1 1
k3=4
2 3
3 2
k4=5
k5=3
4 1
k6=3 6 0
7 0
8 0 P(8)
P(0)
P(1)
P(2)
P(3)
P(4)
P(5)
P(6)
P(7)
0
0.125
0.375
0.25
0.125
0.125
0
0
Clustering coefficient:
C(0) 0
C(2) 0
C(3) 0
C(4) 0.2
C(5) 0.67
0
us
Degree distribution:
C(1) 0
k7=2
5 1
cr
Threads:
5
2
ip t
6
4
3
1
0
C(6) 0.67
C(7) 1
an
Figure 5: Calculate Degree Distribution and Clustering Coefficient on GPU.
spectively. We only have 1024 threads per block. We allocated each individual to a particular block. Each thread calculates the degree ki for a particular node
325
M
vi in an individual and increases the entry of a shared memory allocated in ki by 1. Each thread, simultaneously, calculates the clustering coefficient of the
d
node vi by calculating the number of neighbours |V | and the number of edges Ei among the vertices V and stores such information in global memory. We
te
proposed (see Figure 6) a methodology to adopt GA on GPU in a generic way, and we detail this process in the example shown in Figure 5. In the host, we used the powerlaw Python package [42] to evaluate the fitness
Ac ce p
330
of the individuals using the degree distribution values calculated on GPU. Another kernel function was designed to calculate the average path length for each individual of a generation. As mentioned above, we used the Floyd-Warshall algorithm which depends on the principle of dynamic programming. We store
335
the best values (shortest path between each pair of protein) found so far in a matrix dist in global memory so that it can be accessed via all threads. Let L be the number of individuals, the elements of the L dist matrices are updated
at the same time. For each node w in the individual, each thread only examines the distance between a pair of vertices (v, u), in a particular individual, 340
passing through the node w and saves the value returned from the function min(dist(u, v), dist(u, w) + dist(w, v)) in the array dist (see Figure 6)
18
Page 18 of 43
𝐶𝑃𝑈
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒, 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡ℎ𝑟𝑒𝑎𝑑𝑠 = 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑖𝑧𝑒
𝐺𝑟𝑖𝑑
Copy population CPU → GPU
Block 0
Block 𝑛
Block 1
Each thread: • Calculates degree 𝑘(𝑣) of vertex 𝑣 and increases the item of a vector in a shared memory at index 𝑘 𝑣 by 1. • Calculates clustering coefficient 𝐶 𝑣 of vertex 𝑣 Save
us
• Degree distribution values of an individual to a shared memory. • Degree distribution values of population to a global memory. • Clustering coefficient of each node of each individual to a global memory
cr
Copy back degree distribution & clustering coefficient GPU→CPU
ip t
Launch 𝑘𝑒𝑟𝑛𝑒𝑙1
𝐺𝑟𝑖𝑑(𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑖𝑧𝑒, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑖𝑧𝑒) Block 1,0
Block 0,1
Block 1,1
Block 𝑛, 0
Block 𝑛, 1
an
Launch 𝑘𝑒𝑟𝑛𝑒𝑙2
Block 0,0
Copy back shortest path length GPU→CPU
Block 0, 𝑛
Block 1, 𝑛
Block 𝑛, 𝑛
M
Each thread: • Calculates the shortest path distance between a pair of vertices and stores the result in a global memory.
Figure 6: Calculate fitness values of the population on GPU with respect to degree distribu-
d
tion, clustering coefficient and average path length
te
The third kernel was designed to calculate the fitness values of each individual on the basis of omics data. Each thread is responsible for examining
Ac ce p
the presence or absence of links of a particular pair of nodes in a particular 345
individual by utilizing gene expression data, protein subcellular localisation and protein phylogenetic data as shown in Figure 7. 𝐶𝑃𝑈
Copy input data: kernel matrices CPU → GPU
𝐺𝑟𝑖𝑑
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒, 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡ℎ𝑟𝑒𝑎𝑑𝑠 = 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑖𝑧𝑒
Block 0
Block 1
Block 𝑛
Launch 𝑘𝑒𝑟𝑛𝑒𝑙3 Copy results GPU→CPU
• Results of the comparison (global memory)
Each thread: Save • Compares the entries of an individual with the
corresponding entries in kernel matrices: 𝑘 𝑔𝑒𝑥𝑝𝑟 , 𝑘 𝑙𝑜𝑐 𝑎𝑛𝑑 𝑘 𝑝ℎ𝑦
Figure 7: Calculate fitness values of the population on GPU with respect to prior biological information (gene expression kernel matrix kgexpr , localisation kernel matrix kloc and phylogenetic kernel matrix kphy )
19
Page 19 of 43
4. Results and Discussion
ip t
The results given in this section were collected over a system with six cores of Intelr Xeonr processor E5-2620 15M Cache, 2.00 GHz as the host CPU. 350
We are using NVIDIA Tesla K10.G2.8GB GPU mounted on the host for results
cr
of GPU implementation. K10.G2.8GB GPU comprises two identical GK104 GPUs, each with 4 GB of memory. The GPUs are connected via a PCI Express
and the compute capability of the device is 3.0. 4.1. Parameter Tuning
an
355
us
switch; the total number of multiprocessors is eight. The clock rate is 475MHz,
In this paper, we used the method introduced by Taguchi et al. [61] for tuning the parameters of the GA algorithm. The Taguchi method is based on
M
statistical and sensitivity analysis for determining the optimal set of parameters to achieve robust performance [62].
d
Four basic parameters of the proposed GA are considered for tuning, namely population size, mutation probability, elitism rate and a maximum number of
te
generations. The different levels of these parameters for the tuning process are shown in Table 2. The L16 orthogonal array (a table balanced to ensure that
Ac ce p
all levels of all factors are considered equally) is used. Sixteen experiments are required to be conducted based on the L16 orthogonal array (see Table 3). After conducting the experiments, the optimal combination of parameters can be determined using the Signal-to-noise (SN) ratio. The higher the SN ratio, the better the quality as shown in Figure 8. For this problem, we consider the '
smaller the better' category to compute the SN ratio. Therefore, the SN ratio
for each experiment can be computed using the following equation: SN = −10log(
1X 2 y ) n
(5)
Where y is a weighted combination of the multiple criteria. We used one of the first and most used methods that transform the multiobjective problem into a mono objective problem [63]. Given m values of the different objective
20
Page 20 of 43
functions, the response y is defined by:
i=1 360
λi
fi (x) − fimin fimax − fimin
(6)
ip t
y=
m X
Where m is the number of objective functions used and fimin and fimax are the
cr
lower and upper values of the different objective functions. The GA parameters used are shown in Table 4.
us
The proposed mutation operator has the parameter (m) which indicates the number of nodes that will probably connect each newly added node. In 365
order to significantly gain the advantage of this parameter, an experimental
an
test evaluation is performed to identify the best m parameter value. For this experiment, we run the proposed algorithm for each changed value of the m parameter. The starting value was set to the number of existing nodes in the
370
M
seed network N . On the basis of the fact that each new node will connect with m vertices where m is less than or equal to the number of the nodes in the
d
considered network [38]. In subsequent runs, each new node will connect to the interval of between 0.01 and 1. The parameter values reported in Table 4
te
were used, and the best accuracy of the proposed method was when setting the parameter m to the value 0.06.
The threshold τ used in the process of building the kernel matrix Kprotphy
Ac ce p
375
was similarly tuned. We used the parameter values reported in Table 4 and performed experimental tests to evaluate the best value for this parameter. We run the proposed method for each changed value of τ parameter, and the starting value was set to 0.5 and until 0.9. We recorded the best accuracy when
380
the parameter τ equals to 0.5. We studied the performance of the proposed method in three aspects. The
first one, outlined in section 4.2, studies the performance of the method when using three different mechanisms for network growth. Section 4.3 presents the GPU performance. Finally, section 4.4 analyzes the topological characteristics
385
of the biological networks.
21
Page 21 of 43
Table 2: Parameters of the proposed GA and their experimental levels.
Code
Parameter
1
2
3
4 50
Population size
P
20
30
40
2
Mutation rate
M
0.2
0.4
0.6
0.8
3
Elitism rate
E
0.05
0.10
0.15
0.20
4
No. of generation
G
200
300
400
500
us
cr
1
ip t
Level No.
an
Table 3: The experimental layout of the parameters.
Parameter Experiment
y
M
20
0.2
2
20
3
20
4
20
5
E
G
0.01
200
2.5000
M
P 1
0.02
300
2.0079
0.6
0.03
400
2.43204
0.8
0.04
500
1.68784
d
0.4
0.02
7
30
0.6
0.04
200
2.25537
8
30
0.8
0.03
300
2.32549
0.2
400
1.98763
30
0.4
0.01
500
2.43353
Ac ce p
te
6
30
9
40
0.2
0.03
500
2.58457
10
40
0.4
0.04
400
2.01768
11
40
0.6
0.01
300
2.15318
12
40
0.8
0.02
200
1.85902
13
50
0.2
0.04
300
2.24021
14
50
0.4
0.03
200
2.36538
15
50
0.6
0.02
500
2.45703
16
50
0.8
0.01
400
2.15148
4.2. Validity of the Resulted Biological Network For evaluating the performance of the proposed algorithm of using three different types for network growth: prior biological data, topological characteristics and both (see Figure 9), we used five widely-known measures: Precision, 22
Page 22 of 43
ip t cr us
an
Figure 8: SN ration at different levels of the GA parameters. Table 4: Genetic algorithm parameters.
Values
Population size
M
Genetic parameter
20 400
Mutation probability
0.8
d
Maximum iterations
0.20
te
Elitism probability
Recall, F-measure, Specificity and Accuracy to compare the findings against
Ac ce p
gold standard networks. For each edge e in the reconstructed biological net-
work N , let true positive (T P ) be the set of edges shared between the inferred
network N and the gold standard network G while false positive (F P ) is defined as the set of edges existing only in network N . False negative (F N ) is defined
as the edges that are members of the reference network G but not found in the
network N , and true negative (T N ) is the edges not available in both networks N and G. Hence, recall, precision, f-measure, specificity and accuracy scores are calculated according to the following equations: Recall(R) =
TP TP + FN
P recision(P ) =
TP TP + FP
(7)
(8)
23
Page 23 of 43
P recision × Recall P recision + Recall
Accuracy(ACC) =
ip t
Specif icity(SP ) =
(9)
TN FP + TN
(10)
TP + TN TP + FN + FP + TN
cr
F − measure(F ) = 2 ×
(11)
us
Once we selected the values of GA parameters, as shown in Table 4, we run our algorithm 20 times and calculate recall, precision, f-measure, specificity and accuracy values in each experiment. Indeed, before applying the reconstruction procedure depicted in Figure 9, we carried out a series of preliminary exper-
an
390
iments using the data set of KEGG. The experiments are conducted in two groups. The 1st -group experiment is to verify the suitability of reconstruction
M
metabolic networks by using prior biological data while the 2nd -group uses the topological characteristics. Within each group the evaluation will be based 395
on objective function for genomic data and objective functions for topological
d
characteristics (see Table 5 for the results of the 1st -group and Table 6 for the
te
2nd -group).
Table 5: The average, standard deviation, max and min of the recall, precision and f-measure
Ac ce p
validation scores used to validate 20 independent runs of the 1st experiment reconstruction methods.
Network growth Obj. functions Measure
expanding based on prior biological data
genomic data
topological characteristics
R
P
F
SP
ACC
R
P
F
SP
ACC
Min
0.29
0.69
0.41
0.76
0.46
0.27
0.72
0.39
0.78
0.43
Max
0.36
0.76
0.48
0.81
0.52
0.36
0.82
0.50
0.84
0.53
Mean
0.32
0.74
0.45
0.79
0.49
0.33
0.78
0.46
0.82
0.49
Std. Dev.
0.01
0.02
0.01
0.01
0.01
0.02
0.02
0.03
0.02
0.02
Generally speaking, the best accuracy obtained from the preliminary experiments was on average of 75% by using the genomic data in the population
24
Page 24 of 43
Table 6: The average, standard deviation, max and min of the recall, precision and f-measure validation scores used to validate 20 independent runs of the 2nd experiment reconstruction
expanding based on topological characteristics
Obj. functions
topological characteristics
P
F
SP
ACC
R
P
Min
0.59
0.69
0.64
0.76
0.68
0.60
Max
0.90
0.75
0.80
0.80
0.83
0.68
Mean
0.72
0.71
0.72
0.78
0.75
0.63
Std. Dev.
0.10
0.02
0.04
0.01
0.03
0.02
On the basis of scale-free network structural characteristics
SP
ACC
0.72
0.66
0.78
0.69
0.80
0.72
0.83
0.74
0.76
0.69
0.81
0.72
0.03
0.01
0.02
0.01
an
Expand each candidate network
A Set of seed networks
F
us
R
Evaluate the new networks
M
Measure
genomic data
cr
Network growth
ip t
method.
On the basis of prior biological information
gene expression profiles
Small average shortest path distance
Protein subcellular localization profiles
High average clustering coefficient
protein phylogentic profiles
Return an inferred network
Ac ce p
te
d
Powerlaw degree distribution
Select the best fit
Figure 9: The Proposed Reconstruction Method
400
evaluation and topological characteristics in the mutation operation (see Table 6).
Thereafter, we used the proposed method depicted in Figure 9. As stated
in section 2, we started the GA run from an informed non-random population initialized with different small seed networks with different random sizes derived
405
from gold standard databases while four objective functions are used to evaluate the fitness of each individual. The first objective function uses prior biological information (Expression, Localization, and Phylogenetic profile) to evaluate the fitness of each individual of the population. The other three objective func-
25
Page 25 of 43
Table 7: The average, standard deviation, max and min of the recall, precision and f-measure
Measure
R
P
F
SP
ACC
Min
0.44
0.66
0.53
0.75
0.59
Max
0.70
0.94
0.80
0.95
0.81
Mean
0.55
0.78
0.64
0.82
Std. Dev.
0.08
0.09
0.08
0.06
Min
0.26
0.44
0.33
0.68
Max
0.60
0.79
0.68
0.85
0.73
Mean
0.45
0.63
0.52
0.76
0.61
Std. Dev.
0.09
0.07
0.09
0.07
0.04
Min
0.87
0.45
0.59
0.65
0.70
Max
0.95
0.85
0.90
0.87
0.91
Mean
0.92
0.66
0.76
0.75
0.81
0.47
0.02
0.09
0.07
0.05
0.05
0.94
0.73
0.82
0.79
0.84
0.96
0.84
0.90
0.86
0.90
Min Max Mean
0.95
0.80
0.86
0.83
0.88
Std. Dev.
0.005
0.04
0.02
0.03
0.02
te
BioGrid
0.06
d
Std. Dev.
M
Krogan
an
YeastCyc
0.68
us
KEGG
cr
Dataset
ip t
validation scores used to validate the resulted network of 20 independent runs.
Ac ce p
tions utilize topological properties (average shortest path length, average clus-
410
tering coefficient, and degree distribution) of the reconstructed network to be used as quantities measures (see Figure 9). We applied our method to four commonly-used protein networks in yeast. Two of them are networks related to the metabolism of the yeast: KEGG and BioCyc. We used KEGG API to extract 116 known pathways of budding yeast (Saccharomyces cerevisiae) that
415
involves 1341 genes having EC numbers and 832 enzyme-enzyme interactions. Such information was used as a gold standard database. We also used the computationally reconstructed network (YeastCyc - version 19.5) [8] which contains 1461 reactions in the database, and among them, 1168 are catalysed by enzymes. We extracted 779 EC numbers from YeastCyc assigned to 598 genes in yeast or-
420
ganism. The other two datasets are networks from protein-protein interaction:
26
Page 26 of 43
Krogan et al. [41] and BioGrid used by Yip and Gerstein [40]. The properties
ip t
of the four datasets used in the experimental work are outlined in Table 8. We Table 8: Topological properties of the datasets used in the experimental study.
Proteins
Interactions
Network
Avg. Clustering
Avg.
density
Coefficient
length
path
cr
Dataset KEGG-Used
387
808
0.011
0.42
KEGG-All
502
1053
0.008
0.40
YeastCyc-All
517
825
0.006
0.21
12.5
YeastCyc-USED
464
692
0.006
0.20
6.13
Krogan
1200
2333
0.003
0.45
5.22
BioGrid
1865
6436
0.004
0.37
4.33
6.90
M
an
us
6.73
Table 9: The number of common proteins among the omics data and the considered gold standard networks used in the experimental study.
Gene expression Gasch et al.
[49]
te
(6152)
Gene expression
d
Dataset (Size)
Spellman et al. [48] (6178)
Localization
Phylogenetic
(4156)
(3943)
385
385
313
385
KEGG-All
496
496
396
500
Ac ce p
KEGG-Used
YeastCyc-All
512
511
415
503
YeastCyc-USED
462
462
402
451
Krogan
1192
1190
1084
987
BioGrid
1863
1863
1863
1863
run our algorithm 20 times and calculate recall, precision, f-measure, specificity and accuracy values in each experiment. Then, we evaluated the distribution of
425
the results as depicted in Table 7. The standard deviation of the resulting Recall (R), Precision (P), F-measure (F), Specificity (SP) and Accuracy (ACC) was small (< 0.1) indicating good stability. As shown in Table 10, it is clear that the reconstruction of different networks did not differ from one another. The proposed method usually discovered highest interactions when relying only on the 27
Page 27 of 43
ip t cr us
Table 10: Result summary: Number of inferred interactions (#), Recall (R), Precision (P), F-measure (F), Specificity (SP) and Accuracy (ACC) for our method using different inference
an
criteria.
Metabolic networks
Network growth
KEGG R
P
F
SP
M
#
ACC #
R
YeastCyc P
F
SP
ACC
784 0.57 0.53 0.55 0.64 0.68
782 0.52 0.58 0.55 0.74 0.65
Topological properties
2289 0.23 0.64 0.34 0.74 0.40
4162 0.11 0.67 0.20 0.79 0.25
Prior
biological
data
+
1091 0.69 0.93 0.79 0.94 0.80
te
Topological properties
d
Prior biological data
Ac ce p
Network growth
#
R
909 0.60 0.79 0.68 0.85 0.73
Protein interaction networks Krogan P
F
BioGrid SP
ACC #
R
P
F
SP
ACC
Prior biological data
1657 0.83 0.60 0.69 0.72 0.75
4882 0.87 0.66 0.75 0.74 0.79
Topological properties
4243 0.31 0.57 0.41 0.70 0.48
9147 0.66 0.46 0.55 0.75 0.60
2075 0.95 0.84 0.95 0.90 0.87
5668 0.96 0.84 0.90 0.86 0.90
Prior
biological
data
+
Topological properties
28
Page 28 of 43
430
growth and preferential attachment feature in the mutation operator. However, many of the inferred interactions have no biological meaning compared with the
ip t
gold standard datasets. The structural properties of biological networks provide
essential principles of the organization of interactions in a particular network,
435
cr
but an entire biological network cannot be completely described and inferred us-
ing a small collection of statistical measures that solely takes into consideration
us
the structural properties of the network [45].
Taking into account the prior biological information (Expression, Localisation and Phylogenetic profile) of proteins involved in the experiments, our
440
an
method was capable of inferring the known interactions (recall values in Table 10). However, precision value and the number of inferred interaction were significantly lower compared with the results obtained when either using only
M
the topological properties of the network or combining prior biological information and topological properties. This is particularly the case when inferring the metabolic networks due to the lack of prior biological information available on the considered enzymes (see Table9).
d
445
te
We hypothesized that the performance of the method can be increased when utilizing both prior biological data and structural features of scale-free networks.
Ac ce p
Here, we consider the similarity of a pair of protein in at least two omics data sets as well as the preferential attachment way of network growth. By applying this mechanism to the mutation operator, we can predict more known interactions as shown in Table 10. To statistically compare these results, we used the McNemar test [64] which has been used by Roche-Lima et al. [9] to evaluate the performance of different methods used to predict metabolic network. McNemar's test defined a z score, calculated as: z=
(|Nsf − Nf s | − 1) Nsf + Nf s
(12)
Where (Nsf and Nf s ) is the number of times when one algorithm failed, and the other succeed. Assume that Nsf is the number of times algorithm A failed, and algorithm B succeed and Nf s is the number of instances algorithm B succeeded and algorithm A failed. z scores are interpreted as follows: when z equals to 29
Page 29 of 43
450
0, the two algorithms have the same performance. However, when z value is greater than 0, the performance of one algorithm is significantly better than the
ip t
other based on the value (Nsf and Nf s ). For example, if the value Nsf is greater than Nf s , the algorithm A performs better than algorithm B. We computed
455
cr
the z scores considering our method when using both criteria as algorithm A and the two remaining methods as Algorithm Bs. In all cases, we obtained z
us
scores greater than 0 and based on the (Nsf and Nf s ) method when using both criteria performed better with p-values < 0.05.
an
4.3. GPU Performance
Table 11: Comparison between GPU and CPU computational time
Time (h)
Sequential
M
Implementation
522
20 different runs
Mean
1.06
Worst
1.07
Best
1.05
Ac ce p
te
d
Parallel
To evaluate the GPU performance, we measured the speedup achieved by
460
parallelizing the objective functions of the entire population on a GPU chip. Therefore, we compared the results of GPU computation and CPU computation for performing the same tasks - reconstructing KEGG network. The CPU was an Intel Core i7 4930K (3.40GHz) processor running Windows 10. For CPU computation, we used the same evolutionary model as illustrated in Algorithm
465
1. Other conditions, such as the population size, mutation probability, and elitism rate, are also the same. A single run was performed on the CPU while 20 different runs were performed on the GPU. The results are reported in Table 11. It is obvious that we get a significant reduction in the execution time when we implement the method over the GPU card which yields 492× speedup. We also
470
executed the proposed parallel method for reconstructing biological networks 30
Page 30 of 43
ip t
2
1
cr
Time (h)
1.5
0.5
KEGG
BioCyc
us
0 Krogan Dataset
BioGrid
an
Figure 10: The average execution time of the GPU implementation for reconstructing four different biological networks.
M
(BioCyc, Krogan and BioGrid) with different networks sizes. Figure 10 shows the average execution time for 20 different runs.
d
4.4. Topological Analysis of Networks
475
te
Structural network analysis takes into consideration the topological and not the functional form of the reconstructed network. These are a set of essential topological characteristics of biological networks, namely power-law degree
Ac ce p
distribution, small average shortest path length and high average clustering coefficient. We analyzed the topological properties of the reconstructed networks in order to verify if the structure of such networks still has the characteristics of
480
scale-free networks. As mentioned in the method section, we used an enzymatic view graph-theoretical representation for the metabolic network in which nodes are enzymes, and edges represent the enzyme-enzyme relation (two enzymes catalyse two consecutive reactions). The output degree distributions of 12 reconstructed networks using four gold
standard datasets and three different ways for network growth are shown in Figure 11. The data sets are KEGG (see Figure 11a), YeastCyc (see see Figure 11b), BioGrid used by Yip and Gerstein [40] (see see Figure 11c) and Krogan et al. [41] (see see Figure 11d) while the different mechanisms for network growth
31
Page 31 of 43
Table 12: Topological properties of the reconstructed metabolic networks
Average
Coefficient
path length
Prior biological data
0.04
0.52
3.7
Topological properties
0.07
0.31
2.3
0.01
0.33
5.9
Prior biological data
0.02
0.21
4.7
Topological properties
0.06
0.21
0.01
0.19
Topological properties
YeastCyc
Prior biological data + Topological properties
Krogan
Prior biological data
0.01
Topological properties
0.03
Topological properties Prior biological data Topological properties Prior biological data +
1.5
1.5
2.4
2.3
5.4
1.6 2.8
0.25
2.07
2.2
0.46
5.07
2.5
0.01
0.37
3.7
2.2
0.02
0.24
2.6
2.0
0.36
4.18
2.1
0.01
0.005
te
Topological properties
1.3
4.71
d
BioGrid
1.4
0.44
M
Prior biological data +
(γ)
cr
Prior biological data +
exponent
ip t
Clustering
density
us
KEGG
Network
Network growth
an
Dataset
Ac ce p
are prior biological data and topological characteristics. In these figures, P (k) is the probability of nodes which have a k degree. It was calculated by dividing
the number of proteins which had k links with the total number of proteins in the resulted network. As shown in Figure 11, the scale-free property in which a few nodes have a lot of interactions and many nodes have few interaction exists clearly when (1) using the prior biological data and topological characteristics to build the four studied networks, (2) only using the prior biological data for inferring the Krogan networks. Using the topological properties and the seven fitness values for reconstructing the network did not result in a clear power-law degree distribution except with the Krogan dataset. This is because of taking into account the prior biological information when evaluating the fitness of the candidate solutions. We further used the powerlaw Python package to determine
32
Page 32 of 43
whether the degree distribution of each network reconstructed is considered a good fit to power-law degree distribution. Experimental results show that
ip t
networks reconstructed follow the power-law degree distribution with p-values
between 1e−1 and 1e−20. We also calculated the degree exponent of each degree
X γ = 1 + N[ ln
(13)
us
i
ki ]−1 kmin − 12
cr
distribution (see TABLE 12) using the following functions [65]:
Where N is the network size, ki is the degree of node i and kmin is the minimum 485
degree in the network. The degree exponent of power-law degree distribution
an
lies between the values 2 and 3. In our reconstructions, using only topological properties obtains values between 2 and 3 for all resulted networks except the KEGG network. When using only the omics data, the degree exponent is be-
490
M
tween the values 2 and 3 for protein interaction networks and less than 2 for the reconstructed metabolic networks. Also, when using prior both mechanisms, the degree exponent is between the values 2 and 3 for protein interaction networks
d
and less than 2 for the reconstructed metabolic networks.
te
Table 12 also shows other important features of the biological networks, namely the clustering coefficient and network density. The high values of the clustering coefficients imply that a pair of proteins tends to be densely clustered
Ac ce p
495
in the protein interaction networks. While the low values of the network density represent the sparsity of the network. In biological networks, the average path length scales logarithmically with the number of nodes [33]. We calculated the logarithm of the reconstructed networks size and all results are close to the
500
number 2. As shown in Table 12, the average path lengths range from 2 to 5. This is due to using seven different criteria for selecting the best solution with one criterion being the average path length. Overall, the experimental finding is that the protein-protein interaction network is better than the metabolic network. This is because of the lack of the
505
omics datasets for the enzyme proteins while there is abundant prior biological information on the proteins involved in the protein interaction dataset used in this study (see Table 9). For the reconstruction of the protein interaction 33
Page 33 of 43
network, the most important biological information is the phylogenetic profiles
510
Table 9).
cr
4.5. Comparison to State-of-the-art Algorithm in Literature
ip t
followed by the expression data and finally the subcellular localisation data (see
As presented previously in the related work section, the methods proposed in the literature for reconstructing metabolic networks and protein-protein interac-
515
us
tion networks belong to two classes: de novo and supervised machine learning methods. For comparison, we only consider the supervised machine learning
an
based methods because these methods proposed to reconstruct metabolic networks represented as enzyme graph and protein-protein interaction networks which are similar to our work. However, the supervised machine learning based
520
M
methods depend on the AUC measure - a well-known metric used to evaluate the performance of machine learning based methods - except Vert et al. [66] who used the accuracy measure as well. For this reason, we compared our pro-
d
posed method to the-state-of-the-art method in the literature proposed by Vert
te
et al. [66]. Vert et al. [66] used two pairwise kernels: metric learning pairwise kernel (MLPK) and tensor product pairwise kernel (TPPK) to infer similarities 525
between two pairs of enzymes. The mapping specifies two pairs of proteins as
Ac ce p
similar to each other when each protein in a pair is similar to one corresponding protein in the other pair using four genomic datasets: expression, localization, phylogenetic profile and yeast two-hybrid. Such data have been used individually and together to train SVM in order to reconstruct enzyme networks. In
530
their method, the best results of the inference have been yielded from the combination of the four datasets and the MLPK approach performs better than the TPPK approach on three out of four datasets. It reached an overall accuracy of 83.9 ± 0.9 when using a combination of genomics data as well as a combination
of kernels. It outperformed another state-of-the-art method proposed by Ben535
Hur and Noble [67] in terms of AUC score. For comparison, we used the same data set which is derived from KEGG database by Yamanishi et al. [4]. We found that the performance of our method is better in terms of the accuracy 34
Page 34 of 43
15
20
10
15
0.04 0.02
Frequency p(k) 5
0.00
0.10
25
Degree k
ip t
10
20
0
10 20 30 40 50 60
cr
5
Topological characteristics
0.00
0.2
Frequency p(k)
Omics data
0.0
Frequency p(k)
Omics data & Topological characteristics
Degree k
Degree k
15
20
0.04
Frequency p(k)
0.30
5
10
0.02 0.00
0.15
an
10
Topological characteristics
15
0
20
M
5
0.00
0.2
0.4
Frequency p(k)
Omics data
0.0
Frequency p(k)
Omics data & Topological characteristics
us
(a) Reconstruction KEGG Network on the Basis of Three Different Mechanisms
Degree k
Degree k
40
60
80
Degree k
20
60
100
140
0
40
0.02
Frequency p(k) 20
Degree k
0.00
0.04
Frequency p(k)
te 0
0.04
Topological characteristics
0.08
Omics data
0.00
0.00 0.04 0.08
Ac ce p
Frequency p(k)
Omics data & Topological characteristics
d
(b) Reconstruction YeastCyc Network on the Basis of Three Different Mechanisms
60
0
20
Degree k
40
60
80
Degree k
(c) Reconstruction BioGrid Network on the Basis of Three Different Mechanisms
0
10
30 Degree k
50
0
10
20
30
Degree k
40
0.06 0.03 0.00
Frequency p(k)
Topological characteristics
0.10
Frequency p(k)
Omics data
0.00
0.20 0.10 0.00
Frequency p(k)
Omics data & Topological characteristics
0
20
40
60
80
Degree k
(d) Reconstruction Krogan Network on the Basis of Three Different Mechanisms Figure 11: Degree distribution results for the reconstructed networks.
35
Page 35 of 43
score (91%).
540
ip t
5. Conclusion
It appears that metabolic networks represented as enzyme graph and protein
cr
interaction networks can be reconstructed using the prior biological data about
individual gene or protein (expression, subcellular localisation and phylogenetic
us
profile) and the topological characteristics of the scale-free network. Previous publications used biological information for reconstructing protein interaction 545
networks [29][68][4][25]. However, to the best of our knowledge, this is the first
an
paper that includes the topological characteristics of the scale-free networks in the process of metabolic network reconstruction. We propose a method based on metaheuristic algorithm (GA) for inferring different networks. Such a method
550
M
utilizes the growth and preferential attachment mechanism as well as the available biological data about individual genes and proteins. We found that our
d
method identified a high percentage of the known considered networks: KEGG, YeasCyc, BioGrid used by Yip and Gerstein [40] and Krogan et al. [41] network.
te
Furthermore, the resulted networks satisfied the common structural properties of the scale-free network (power-law degree distribution with exponent values between 2 and 3 in Krogan and BioGrid networks and less than 2 in KEGG and
Ac ce p
555
YeastCyc networks, low-density rate, small average shortest path). We understand that we need larger biological data to test the general perfor-
mance and stability of our method on organisms. This is designated for future work. Also, we plan to test our method on different biological networks for
560
different organisms when sufficient prior biological data is available.
References
[1] C. St. Clair, J. E. Visick, Exploring Bioinformatics: A Project-Based Approach, 2nd Edition, Jones and Bartlett Publishers, Inc., 2013. [2] M. Zvelebil, J. Baum, Understanding Bioinformatics, Garland Science, 565
2008. 36
Page 36 of 43
[3] R. Sharan, T. Ideker, Modeling cellular machinery through biological net-
ip t
work comparison, Nature biotechnology 24 (4) (2006) 427–433. [4] Y. Yamanishi, J.-P. Vert, M. Kanehisa, Protein network inference from
multiple genomic data: a supervised approach, Bioinformatics 20 (suppl 1) (2004) i363–i370.
cr
570
[5] V. Lacroix, L. Cottret, P. Th´ebault, M.-F. Sagot, An introduction to
us
metabolic networks and their structural analysis, IEEE/ACM Trans. Comput. Biol. Bioinformatics 5 (4) (2008) 594–617.
575
an
[6] C. Smolke, The Metabolic Pathway Engineering Handbook: Fundamentals, The Metabolic Pathway Engineering Handbook, CRC Press, 2009.
M
[7] Y. Yamanishi, Supervised Inference of Metabolic Networks from the Integration of Genomic Data and Chemical Information, John Wiley & Sons, Inc., 2010, pp. 189–211.
T. J. Lee, P. Kaipa, F. Gilham, A. Spaulding, L. Popescu, et al., Pathway
te
580
d
[8] P. D. Karp, S. M. Paley, M. Krummenacker, M. Latendresse, J. M. Dale,
tools version 13.0: integrated software for pathway/genome informatics and
Ac ce p
systems biology, Briefings in bioinformatics 11 (1) (2010) 40–79. [9] A. Roche-Lima, M. Domaratzki, B. Fristensky, Metabolic network prediction through pairwise rational kernels., BMC bioinformatics 15 (1) (2014)
585
318.
[10] M. Kotera, Y. Tabei, Y. Yamanishi, A. Muto, Y. Moriya, T. Tokimatsu, S. Goto, Metabolome-scale prediction of intermediate compounds in multistep metabolic pathways with a recursive supervised approach, Bioinformatics 30 (12) (2014) i165–i174.
590
[11] Y. Yamanishi, Y. Tabei, M. Kotera, Metabolome-scale de novo pathway reconstruction using regioisomer-sensitive graph alignments, Bioinformatics 31 (12) (2015) i161–i170.
37
Page 37 of 43
[12] S. Taghipour, P. Zarrineh, M. Ganjtabesh, A. Nowzari-Dalini, Improving protein complex prediction by reconstructing a high-confidence proteinprotein interaction network of escherichia coli from different physical in-
ip t
595
teraction data sources, BMC Bioinformatics 18 (1) (2017) 10.
URL https://doi.org/10.1186/s12859-016-1422-x
cr
10.1186/s12859-016-1422-x.
doi:
600
us
[13] J.-P. Vert, Reconstruction of Biological Networks by Supervised Machine Learning Approaches, John Wiley & Sons, Inc., 2010, pp. 163–188.
an
[14] P. D. Karp, M. Latendresse, R. Caspi, The pathway tools pathway prediction algorithm, Standards in genomic sciences 5 (3) (2011) 424. [15] P. D. Karp, S. Paley, P. Romero, The pathway tools software, Bioinformat-
605
M
ics 18 (suppl 1) (2002) S225–S232.
[16] S. Tsoka, D. Simon, C. A. Ouzounis, Automated metabolic reconstruction
d
for methanococcus jannaschii, Archaea 1 (4) (2004) 223–229.
te
[17] M. Chen, R. Hofest¨ adt, Web-based information retrieval system for the prediction of metabolic pathways, NanoBioscience, IEEE Transactions on
Ac ce p
3 (3) (2004) 192–199. 610
[18] M. Yousofshahi, K. Lee, S. Hassoun, Probabilistic pathway construction, Metabolic engineering 13 (4) (2011) 435–444.
[19] J. Meitalovs, Software tool for probabilistic metabolic pathways construction, in: Computational Intelligence and Informatics (CINTI), 2012 IEEE 13th International Symposium on, IEEE, 2012, pp. 405–408.
615
[20] D. C. McShan, S. Rao, I. Shah, Pathminer: predicting metabolic pathways by heuristic search, Bioinformatics 19 (13) (2003) 1692–1698. [21] Y. Moriya, D. Shigemizu, M. Hattori, T. Tokimatsu, M. Kotera, S. Goto, M. Kanehisa, Pathpred: an enzyme-catalyzed metabolic pathway prediction server, Nucleic acids research (2010) gkq318. 38
Page 38 of 43
620
[22] K. Tanaka, K. Nakamura, T. Saito, H. Osada, A. Hirai, H. Takahashi, S. Kanaya, M. Altaf-Ul-Amin, Metabolic pathway prediction based on in-
ip t
clusive relation between cyclic substructures, Plant biotechnology 26 (5) (2009) 459–468.
625
cr
[23] Y. Moriya, M. Itoh, S. Okuda, A. C. Yoshizawa, M. Kanehisa, Kaas: an automatic genome annotation and pathway reconstruction server, Nucleic
us
acids research 35 (suppl 2) (2007) W182–W185.
[24] M. Nakamura, T. Hachiya, Y. Saito, K. Sato, Y. Sakakibara, An efficient al-
an
gorithm for de novo predictions of biochemical pathways between chemical compounds, BMC bioinformatics 13 (17) (2012) S8. 630
[25] K. Bleakley, G. Biau, J.-P. Vert, Supervised reconstruction of biological
M
networks with local models, Bioinformatics 23 (13) (2007) i57–i65. [26] K. Y. Yip, M. Gerstein, Training set expansion: an approach to improving
d
the reconstruction of biological networks from limited and uneven reliable interactions, Bioinformatics 25 (2) (2009) 243–250. [27] M. Kotera, Y. Yamanishi, Y. Moriya, M. Kanehisa, S. Goto, Genies: gene
te
635
network inference engine based on supervised analysis, Nucleic acids re-
Ac ce p
search (2012) gks459.
[28] J.-P. Vert, Y. Yamanishi, Supervised graph inference, in: Advances in Neural Information Processing Systems, 2004, pp. 1433–1440.
640
[29] Y. Yamanishi, J.-P. Vert, M. Kanehisa, Supervised enzyme network inference from the integration of genomic data and chemical information, Bioinformatics 21 (suppl 1) (2005) i468–i477.
[30] J. M. Dale, L. Popescu, P. D. Karp, Machine learning methods for metabolic pathway prediction, BMC bioinformatics 11 (1) (2010) 15.
645
[31] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, A.-L. Barab´asi, The largescale organization of metabolic networks, Nature 407 (6804) (2000) 651– 654. 39
Page 39 of 43
[32] X. F. Wang, G. Chen, Complex networks: small-world, scale-free and be-
650
ip t
yond, Circuits and Systems Magazine, IEEE 3 (1) (2003) 6–20. [33] A.-L. Barabasi, Z. N. Oltvai, Network biology: understanding the cell’s
cr
functional organization, Nature reviews genetics 5 (2) (2004) 101–113.
[34] S. Kikuchi, D. Tominaga, M. Arita, K. Takahashi, M. Tomita, Dynamic modeling of genetic networks using genetic algorithm and s-system, Bioin-
655
us
formatics 19 (5) (2003) 643–650.
[35] P. Koduru, S. Das, S. Welch, J. L. Roe, Fuzzy Dominance Based Multi-
an
objective GA-Simplex Hybrid Algorithms Applied to Gene Network Models, Springer Berlin Heidelberg, 2004, pp. 356–367.
M
[36] E. Keedwell, A. Narayanan, Discovering gene networks with a neuralgenetic hybrid, IEEE/ACM Transactions on Computational Biology and 660
Bioinformatics 2 (3) (2005) 231–242.
d
[37] A. Konak, D. W. Coit, A. E. Smith, Multi-objective optimization using
te
genetic algorithms: A tutorial, Reliability Engineering & System Safety 91 (9) (2006) 992–1007.
Ac ce p
[38] A.-L. Barab´ asi, R. Albert, Emergence of scaling in random networks, sci-
665
ence 286 (5439) (1999) 509–512.
[39] I. Thiele, B. . Palsson, A protocol for generating a high-quality genomescale metabolic reconstruction, Nature protocols 5 (1) (2010) 93121.
[40] K. Y. Yip, M. Gerstein, Training set expansion: an approach to improving the reconstruction of biological networks from limited and uneven reliable
670
interactions, Bioinformatics 25 (2) (2008) 243–250.
[41] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta, A. P. Tikuisis, et al., Global landscape of protein complexes in the yeast saccharomyces cerevisiae, Nature 440 (7084) (2006) 637.
40
Page 40 of 43
[42] J. Alstott, E. Bullmore, D. Plenz, powerlaw: a python package for analysis of heavy-tailed distributions, PloS one 9 (1) (2014) e85777.
ip t
675
[43] R. W. Floyd, Algorithm 97: shortest path, Communications of the ACM
cr
5 (6) (1962) 345.
[44] S. Hougardy, The floyd–warshall algorithm on graphs with negative cycles,
680
us
Information Processing Letters 110 (8-9) (2010) 279–281.
[45] A. Bailey, M. Ventresca, B. Ombuki-Berman, Genetic programming for the automatic inference of graph models for complex networks, Evolutionary
an
Computation, IEEE Transactions on 18 (3) (2014) 405–419.
[46] M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein, Cluster analysis and
685
M
display of genome-wide expression patterns, Proceedings of the National Academy of Sciences 95 (25) (1998) 14863–14868. [47] A. Kimmig, F. Costa, Link and Node Prediction in Metabolic Networks
d
with Probabilistic Logic, Springer Berlin Heidelberg, 2012, pp. 407–426.
te
[48] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, B. Futcher, Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by
Ac ce p
690
microarray hybridization, Molecular biology of the cell 9 (12) (1998) 3273– 3297.
[49] A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, P. O. Brown, Genomic expression programs in the
695
response of yeast cells to environmental changes, Molecular biology of the cell 11 (12) (2000) 4241–4257.
[50] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman, Missing value estimation methods for dna microarrays, Bioinformatics 17 (6) (2001) 520–525.
41
Page 41 of 43
700
[51] R. Wehrens, L. M. Buydens, et al., Self-and super-organizing maps in r:
ip t
the kohonen package, J Stat Softw 21 (5) (2007) 1–19. [52] R. C. Team, et al., R: A language and environment for statistical comput-
cr
ing.
[53] W.-K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman, E. K. O’Shea, Global analysis of protein localization in budding yeast, Nature 425 (6959) (2003) 686–691.
us
705
[54] R. Jothi, T. M. Przytycka, L. Aravind, Discovering functional linkages and
an
uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment, BMC bioinformatics 8 (1) (2007) 173. [55] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, T. O. Yeates,
M
710
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles, Proceedings of the National Academy of Sciences 96 (8)
d
(1999) 4285–4288.
715
te
[56] J. De Las Rivas, J. J. Lozano, A. R. Ortiz, Comparative analysis of chloroplast genomes: functional annotation, genome-based phylogeny, and de-
Ac ce p
duced evolutionary patterns, Genome research 12 (4) (2002) 567–583. [57] J. C. Mellor, I. Yanai, K. H. Clodfelter, J. Mintseris, C. DeLisi, Predictome: a database of putative functional links between proteins, Nucleic acids research 30 (1) (2002) 306–309.
720
[58] A. Nakaya, T. Katayama, M. Itoh, K. Hiranuka, S. Kawashima, Y. Moriya, S. Okuda, M. Tanaka, T. Tokimatsu, Y. Yamanishi, et al., Kegg oc: a largescale automatic construction of taxonomy-based ortholog clusters, Nucleic acids research 41 (D1) (2013) D353–D357. [59] B. L. Miller, D. E. Goldberg, Genetic algorithms, tournament selection,
725
and the effects of noise, Complex systems 9 (3) (1995) 193–212.
42
Page 42 of 43
[60] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: Nsga-ii, IEEE transactions on evolutionary
ip t
computation 6 (2) (2002) 182–197.
[61] G. Taguchi, S. Chowdhury, S. Taguchi, Robust engineering, McGraw-Hill Professional, 2000.
cr
730
¨ urk, N. Kaya, F. Ozt¨ ¨ urk, Hybrid multi-objective shape [62] A. R. Yıldız, N. Ozt¨
us
design optimization using taguchi’s method and genetic algorithm, Structural and Multidisciplinary Optimization 34 (4) (2007) 317–332.
735
an
[63] E.-G. Talbi, Metaheuristics: from design to implementation, Wiley, 2009. [64] Q. McNemar, Note on the sampling error of the difference between corre-
M
lated proportions or percentages, Psychometrika 12 (2) (1947) 153–157. [65] M. Newman, Networks: An Introduction, Oxford University Press, Inc.,
d
2010.
[66] J.-P. Vert, J. Qiu, W. S. Noble, A new pairwise kernel for biological network inference with support vector machines, BMC Bioinformatics 8 (10) (2007)
te
740
Ac ce p
S8. doi:10.1186/1471-2105-8-S10-S8. [67] A. Ben-Hur, W. S. Noble, Kernel methods for predicting proteinprotein interactions, Bioinformatics 21 (2005) i38–i46.
doi:10.1093/
bioinformatics/bti1016.
745
[68] M. Kotera, Y. Yamanishi, Y. Moriya, M. Kanehisa, S. Goto, Genies: gene network inference engine based on supervised analysis, Nucleic Acids Research 40 (W1) (2012) W162–W167.
43
Page 43 of 43