Multiobjective optimization to reconstruct biological networks

Multiobjective optimization to reconstruct biological networks

Accepted Manuscript Title: Multiobjective optimization to reconstruct biological networks Author: Ahmed Naef Rosni Abdullah Nuraini Abdul Rashid PII: ...

2MB Sizes 0 Downloads 135 Views

Accepted Manuscript Title: Multiobjective optimization to reconstruct biological networks Author: Ahmed Naef Rosni Abdullah Nuraini Abdul Rashid PII: DOI: Reference:

S0303-2647(18)30131-X https://doi.org/doi:10.1016/j.biosystems.2018.09.003 BIO 3877

To appear in:

BioSystems

Received date: Revised date: Accepted date:

8-4-2018 16-7-2018 11-9-2018

Please cite this article as: Ahmed Naef, Rosni Abdullah, Nuraini Abdul Rashid, Multiobjective optimization to reconstruct biological networks, (2018), https://doi.org/10.1016/j.biosystems.2018.09.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ip t

Multiobjective Optimization to Reconstruct Biological Networks Ahmed Naefa,∗, Rosni Abdullaha,b , Nuraini Abdul Rashidc a School

of Computer Sciences, Universiti Sains Malaysia, 11800 USM, Penang Advanced IPv6 Centre (Nav6) 6th Floor, School of Computer Sciences, Universiti Sains Malaysia, 11800 USM, Penang c College of Computer and information sciences, Princess nourah bint Abdulrahman university, KSA

us

cr

b National

an

Abstract

Automated methods for reconstructing biological networks are becoming increasingly important in computational systems biology. Public databases con-

M

taining information on biological processes for hundreds of organisms are assisting in the inference of such networks. This paper proposes a multiobjective

d

genetic algorithm method to reconstruct networks related to metabolism and protein interaction. Such a method utilizes structural properties of scale-free

te

networks and known biological information about individual genes and proteins to reconstruct metabolic networks represented as enzyme graph and protein

Ac ce p

interaction networks. We test our method on four commonly-used protein networks in yeast. Two are networks related to the metabolism of the yeast: KEGG and BioCyc. The other two datasets are networks from protein-protein interaction: Krogan and BioGrid. Experimental results show that the proposed method is capable of reconstructing biological networks by combining different omics data and structural characteristics of scale-free networks. However, the proposed method to reconstruct the network is time-consuming because several evaluations must be performed. We parallelized this method on GPU to overcome this limitation by parallelizing the objective functions of the presented method. The parallel method shows a significant reduction in the execution ∗ Corresponding

author Email address: [email protected] (Ahmed Naef)

Preprint submitted to Journal of LATEX Templates

July 16, 2018

Page 1 of 43

time over the GPU card which yields a 492-fold speedup. Keywords: Biological Network Reconstruction, Topological Properties,

ip t

Metaheuristic, Optimization.

cr

1. Introduction

Recent advances in high-throughput techniques have produced a huge amount

us

of genomic, proteomic and metabolic data. A lot of work has been done to organize, study and analyze such data in order to increase the knowledge of living systems [1, 2]. Among the presented works, several studies have shown that

an

5

biological activities are often carried out in the form of interacting networks [3]. Reconstructing such biological networks has been a priority in system biology,

M

where various biological networks such as gene regulatory networks, protein interaction networks and metabolic networks are reconstructed from genomic, 10

proteomic and metabolic data in an effort to increase the knowledge of living

d

systems at the cellular level [4]. In this work, we focus on algorithms to reconstruct protein interaction networks and metabolic networks.

te

Lacroix et al. [5] defined metabolic networks as “A collection of objects and the relations among them. The objects represent chemical compounds, biochemical reactions, enzymes, and genes.” Reconstructing a metabolic network,

Ac ce p

15

therefore, involves detecting the relations among genes, proteins (enzymes), and reactions in a given metabolic system. Here, we represent the metabolic network as a graph, G = (V, E), where nodes V represent objects (chemical compounds, biochemical reactions, enzymes, or genes) and edges E represent the relations

20

among those objects. Such representation provides a platform for systematic integration of omics data with metabolic networks in order to explore new biological patterns and hypotheses [6]. In particular, the enzyme graph view of cellular metabolism is used to construct the theoretical graph representation in which the nodes represent enzymes and the edges represent enzyme-enzyme

25

relations (i.e., there is an edge between two enzymes if they catalyze two consecutive reactions in a particular pathway). In other words, there is an edge

2

Page 2 of 43

between two enzymes if they catalyze two consecutive reactions that have a compound in common which is considered as the product of one reaction and

30

ip t

the substrate of the other reactions.

According to the studies published in the context of biological networks

cr

analysis literature, recent expansion of databases involving a massive biological

information have facilitated many researchers to study the current knowledge

us

of biological networks. However, most organism-specific metabolic networks remain incompletely characterized and contain many missing enzymes and reac35

tions in known pathways [7, 8, 9, 10, 11]. Therefore, one of the major challenges

an

in computational system biology is to gain better models of metabolic networks and identify candidate interactions for completing the pathways [9]. In addition, reconstructing a high-confidence protein-protein interaction network improves

40

M

the prediction of protein complex - a group of proteins that achieve a particular biological function within the cells [12].

Previous research in systems biology at the molecular level has already pro-

d

vided a partial view of metabolic networks for different organisms [13]. Many

te

methods have been developed to predict and analyze the metabolic networks. Most of such methods attempt to infer the network de novo in which the data about individual genes, proteins, metabolites, biochemical reactions and

Ac ce p

45

pathways are given, and edges are inferred from these data only, using a variety of inference approaches. Examples include Karp et al. [14, 15], Tsoka et al. [16], and M. Chen and Hofest¨adt [17] developed methods based on genome annotations to construct metabolic pathways for an organism. They infer the

50

reactions involved in the genome annotation and then they infer the metabolic pathways from these reactions. Yousofshahi et al. [18], Meitalovs [19], and McShan et al. [20] utilized the known reactions retrieved from KEGG databases. Compound chemical structure retrieved from KEGG database is also used by Moryia et al. [21] and K. Tanaka et al. [22] in the process of metabolic network

55

reconstruction. Moryia et al. [23] used BLAST to compare amino acid pairs and then they used KEGG Ortholog (KO) identifiers to group such amino acid sequences. On the other hand, based on the idea that ’omics’ datasets may 3

Page 3 of 43

contain errors which lead to incorrectly metabolic pathways prediction, other approaches developed to reconstruct biological networks used prior information about the individual genes, proteins (enzymes) and reactions associated with

ip t

60

supervised machine learning algorithms (Yamanishi et al. [11], Nakamura et

cr

al. [24], Bleakley et al. [25], Yip and Gerstein [26], Kotera et al. [27],Vert et al. [28],Yamanishi et al. [29], and Dale et al. [30]).

65

us

In addition to biological network reconstruction, analyzing the structural features of the biological networks has become a focus of attention in an effort to reveal the fundamental design principles for the large-scale organization of in-

an

teractions in all cells and microorganisms [31]. Many studies have analyzed the topological characteristics of biological networks, including metabolic and protein interaction networks. Such studies uncovered several significant topological characteristics of the networks including scale-free property in these networks

M

70

of many organisms where few nodes have lots of interactions, and many nodes have few interactions with others. Therefore, the probability p(k) that a node

d

has k links to other nodes is proportional to k −γ , where γ is the degree ex-

75

te

ponent taking values between 2 and 3 and this is regarded as the definition of the power-law degree distribution property. Small average path length is also a

Ac ce p

primary characteristic of scale-free networks defined as the average of the shortest distance d between each pair of nodes in the network, and high average clustering coefficient is also a common property to many real complex networks including metabolic networks [31, 32, 33].

80

Despite the importance of topological characteristics in providing essential

principles of the organization of interactions in metabolic networks, the existing methods whether direct de novo or supervised machine learning methods have

relied only on the known biological information. Furthermore, to the best of our knowledge, there is no evolutionary algorithm in the literature proposed to

85

reconstruct metabolic networks although evolutionary algorithms have been successfully applied to many problems such as inferring gene regulatory networks [34, 35, 36]. The terms structural and topological are used interchangeably in the literature. In this study, we hypothesize that the integration of prior bio4

Page 4 of 43

logical information and structural characteristics of scale-free networks in the 90

process of metabolic network reconstruction improves the identification of sig-

ip t

nificant interactions and, thus, completing pathways. For this purpose, we develop a multiobjective genetic algorithm based method considering both known

cr

biological information (expression, localization, phylogenetic profiles) and three topological characteristics of the scale-free network (degree distribution, average shortest path length and clustering coefficient) for network reconstruction.

us

95

The remainder of this paper is organized as follows. Section 2 describes the method and materials. The GPU-based solution of the developed method is

an

presented in section 3. Section 4 provides the experimental setup, results and

100

2. Method and Materials

M

discussion. Section 5 presents the conclusion and future work.

In this study, biological network reconstruction is considered as a multiob-

d

jective optimization problem. We use genetic algorithm (GA) which has been the most popular heuristic approach for multiobjective optimization problems

105

te

[37]. In this study, two fundamental bases considered are biological data and the topological characteristics of the scale-free network. The steps of th pro-

Ac ce p

posed GA flowchart are illustrated in Figure 1 and detailed in the following subsections.

2.1. Generate Initial Population In the first step of the present method, we generate an initial population.

110

We encoded the chromosome as an adjacency matrix of size N × N , where N is the network size, the rows and columns represent proteins. Each element, aij , in the matrix equals 1 if its corresponding proteins interact with each other, otherwise it equals 0. A typical property of many real networks, including biological networks, is

115

that the vertex connectivities follows the scale-free power law distribution. Such property is the result of the contentious growth of the network by the addition of

5

Page 5 of 43

Start

ip t

Generate initial population

Evaluate fitness

cr

Rank the population using NSGA-II Sort the population using nondominant sorting approach

us

Compute crowding distance between individuals of each front

Select elite individuals

an

Select highest ranked individuals

Perform mutation

M

New generation

Max. Iteration is reached

No

Yes

Return optimal solution found, 1𝑠𝑡 𝑓𝑟𝑜𝑛𝑡.

d

End

te

Figure 1: Flow Chart of GA-Based Network Inference Method

Ac ce p

new nodes that attach preferentially to already well-connected nodes [38]. On the basis of the mentioned property, we started the GA run from an informed non-random solutions. Each solution of the population is initialized with a

120

different small seed network derived from gold standard datasets. The use of this strategy avoids using random initialization, which led to very poor initial solutions (random graph). However, it is necessary first to obtain gold standard network data set in order to initialize the solutions. 2.1.1. Metabolic network

125

In the case of applying our method to reconstruct metabolic network, we first assembled the gold standard metabolic network (KEGG and BioCyc) as described below: a. Obtain genome annotation. Genomic information is important to pre6

Page 6 of 43

Table 1: Data Sources Used for Metabolic Network Reconstructions Name Genome annotation databases: https://www.ncbi.nlm.nih.gov/genbank/

SGD

http://www.yeastgenome.org/

Enzyme databases: ExPASy

https://www.expasy.org/

cr

GenBank

ip t

Link

us

http://downloads.yeastgenome.org/curation/

SGD

chromosomal_feature/dbxref.tab UniProt

http://www.uniprot.org/

an

Metabolic databases:

http://www.kegg.jp/

BioCyc

https://biocyc.org/

Protein localization databases:

M

KEGG

UniProt

http://www.uniprot.org/

huh et al.

http://yeastgfp.yeastgenome.org/

Gene Expression Omnibus

https://www.ncbi.nlm.nih.gov/geo/ http://www-genome.stanford.edu/yeast_stress/index.

te

Gasch et al.

d

Gene expressionist databases:

shtml

Ac ce p

Spellman et al.

http://genome-www.stanford.edu/cellcycle/

Phylogenetic protein profile: KEGG Ortholog Cluster

http://www.genome.jp/tools/oc/

cisely provide the gene properties concerning the organisms genome. As

130

the network reconstruction depends primarily on the genome annotation, it is important to download the most recent version available to ensure the accurateness. We use two database resources to retrieve the genome annotation for Saccharomyces cerevisiae and then extract all genes and gene products.

135

b. Identify candidate metabolic genes (enzyme proteins). In the case of a metabolic network draft reconstruction, it is particularly important to be able to determine the genes encoded to enzyme proteins. This is a straight-

7

Page 7 of 43

EC 3.1.3.11

YMR205C

EC 2.7.1.11

YKL060C

EC 4.1.2.13

YCR012W

EC 2.7.2.3 (a)

R00771: D-glucose 6-phosphate = D-fructose 6-phosphate

6-phosphofructokinase E.C. 2.7.1.11 Fructose-bisphosphatase E.C. 3.1.3.11

R00756: ATP + D-fructose 6-phosphate = ADP + D-fructose 1,6-bisphosphate

ip t

EC 5.3.1.9

YLR377C

KEGG Reaction ID: Reaction

Glucose-6-phosphate isomerase E.C. 5.3.1.9

R00762: D-fructose 1,6-bisphosphate + H(2)O = D-fructose 6-phosphate + phosphate R01068: D-fructose 1,6-bisphosphate = glycerone phosphate + D-glyceraldehyde3phosphate R01512: ATP + 3-phospho-D-glycerate = ADP + 3-phospho-D-glyceroyl phosphate

Fructose-bisphosphate aldolase E.C. 4.1.2.13 Phosphoglycerate kinase E.C. 2.7.2.3

(c)

EC 5.3.1.9

E.C. 2.7.1.1

EC 2.7.1.1

E.C. 4.1.2.13

E.C. 3.1.3.11

EC 4.1.2.13

E.C. 2.7.2.3

EC 5.4.2.11 (b)

KEGG Reaction ID

KEGG Reference Pathway ID

R00771

path:rn00500

R00756

path:rn00680, path:rn01120, path:rn01200

R00762

path:rn00680, path:rn00710, path:rn01100, path:rn01120, path:rn01200

R01068

path:rn00680, path:rn00710, path:rn01100, path:rn01120, path:rn01200

R01512

path:rn00010, path:rn00710, path:rn01100, path:rn01110, path:rn01120, path:rn01130 path:rn01200, path:rn01230

us

E.C. 3.1.3.11

an

EC 5.3.1.9

cr

YBR196C

Enzyme Name EC-Number

(d)

M

Figure 2: Using KEGG API, various extracted information from KEGG database used in the reconstruction of draft metabolic network (a) enzyme proteins with their EC numbers, (b) list of enzyme-enzyme relation catalyzing successive reactions (c) list of enzymes with reactions

d

(d) list of pathways involving particular reactions.

te

forward step once the genome annotation is retrieved. We retrieved such metabolic genes from the annotation information (e.g., those genes having EC-number field as per represented in the Genbank file), as shown in

Ac ce p

140

Figure 2.a.

c. Obtain metabolic pathways including all biochemical reactions catalyzed by enzymes identified in step b. We use KEGG and BioCyc databases, currently the most widely used, to retrieve all pathways of the organism

145

of interest (See Figures 2c and 2d).

d. Assemble draft reconstruction. Once enzyme functions and biochemical reactions catalysed by such enzymes have been tentatively defined, we derive the enzyme-enzyme relation. Two enzymes are linked when they catalyze successive reactions in known pathways (See Figures 2b, 2c, and

150

2d). Draft reconstruction is the first step in the process of reconstructing metabolic networks [39]. It is based on the genome annotation of the target organism and biochemical databases. The entire draft reconstruction is 8

Page 8 of 43

A candidate solution S

...

0 1

...

...

...

...

...

...

0

Fit(Degree_distribution( S))

v2 & v3

Average_clustering_coefficient( S)

v4

ip t

1

v1

v5

∑g(S, K genexpr)

cr

...

0

Average_path_length( S)

v6

∑g(S, K protloc) ∑g(S, K protphy)

v7

us

0

Fitness values

Objective Functions

Adjacency matrix of S

an

Figure 3: An illustration of candidate solution evaluation.

followed by manual or computational approaches to refine such network

155

M

by inferring missing reactions. This step is also used in the literature as a method to reconstruct a gold standard metabolic network [29] [9].

d

2.1.2. Protein interaction network

We also use the proposed method to reconstruct protein interaction network

te

using high-quality datasets BioGrid used by Yip and Gerstein [40] and Krogan et al. datasets [41]. Therefore, part of those datasets will be used to generate the initial population.

Ac ce p

160

2.2. Fitness Evaluation

In the second step, the fitness of the individuals are evaluated. We consider

two fundamental aspects of which we calculate seven fitness values for each individual (see Figure 3):

165

2.2.1. Topological Characteristics. As mentioned earlier, despite network diversity, most biological networks share three prominent structural features: power-law degree distribution, small average path length distance and high average clustering coefficient. Thus, the topological-based objective functions used to evaluate the reconstructed network

170

were designed using these three features. The objective functions are as follows. 9

Page 9 of 43

1. Degree distribution. We provided an objective function that calculates the degree distribution values of each individual. Then, the powerlaw

ip t

Python package developed by Alstott et al. [42] was used to examine the distributions. With the Fit and Distribution objects included in this

package, one can determine whether the degree distribution of a candidate

cr

175

solution is considered a good fit to power-law degree distribution. The

law distribution. (n)

us

larger the likelihood fit, the better the distribution is fitted to the power-

2. Average path length. Let D = {duv } be the shortest path between each

M

an

pair (u, v) in undirected graphs.    the shortest path connecting u and v   duv = 0, u=v     ∞, otherwise

We calculated the average shortest path length L of each individual using Floyd-Warshall algorithm [43]- a simple and widely used algorithm to

d

180

compute shortest paths between all pairs of vertices in a graph [44]. We

te

then evaluated L on the basis of Barabasi and Oltvai's [33] finding: scalefree networks have short average path lengths L goes as L ∼ log(log(N ))

Ac ce p

where N is the network size. In this objective function, we minimize the 185

difference between the average path lengths L and Barabasi and Oltvai's findings log(log(N )).

3. Clustering coefficient. An important feature of the scale-free network structure also considered in this work is that the candidate solution should have a high average clustering coefficient value. We provide an objective function that evaluates the fitness values of the individuals of each generation in terms of the average clustering coefficient. The local clustering coefficient Ci for a node vi in undirected graphs is calculated as: Ci =

2E ki (ki − 1)

(1)

Where E is the number of edges connecting the neighbours of the node vi and ki is the degree of vi . The average of the local clustering coefficients 10

Page 10 of 43

of all the vertices n is therefore given by: n

190

1X Ci n i=1

(2)

ip t

hCi =

The larger the average clustering coefficient hCi is, the better the solution.

cr

As a matter of fact, an entire network cannot be perfectly described by a small set of statistical measures [45]. Furthermore, all the existing reconstruction

us

methods based on available biological data resulted in identifying new interactions for completing pathways. In this study, we also consider prior knowledge 195

about individual genes or proteins such as the level of gene expression, protein

an

localization, and phylogenetic profiles to support the process of the network reconstruction.

M

2.2.2. Prior Biological Information

1. Gene expression profile. Early studies demonstrated that many genes 200

with similar expression profiles tend to have similar function [46], and

d

many recent machine learning classification algorithms underlined that

te

gene function could be predicted from microarray data [47]. Gene expression profiles are represented as a real-valued matrix in which the rows are

Ac ce p

genes, and the columns represent experiment conditions or time series. In 205

this study, we use two microarray datasets: Spellman et al. [48] data set containing 77 experiments which focus on the identification of cell-cycleregulated genes and Gasch et al. [49] datasets involving 173 experiments which studied the responses of yeast cell to environmental changes. First, each data set was pre-processed by estimating the missing expression val-

210

ues using K-nearest neighbours (KNNimpute) method [50]. Then, we use the self-organizing map (SOM) method to cluster the genes with similar expression profiles. It is implemented using the Kohonen package [51] in the R environment for statistical computing [52]. Finally, we constructed a kernel matrix Kgenexpr of size N × N in which rows and columns rep-

215

resent genes and each entry equals to 1 if the corresponding genes belong to the same cluster, otherwise it equals 0. 11

Page 11 of 43

2. Protein subcellular localization data. Information about the protein locations in the cell is also widely used in the reconstruction of biological

220

ip t

networks. Cells are organized into membranes and compartments, which

are specialized for various biological functions. Comprehensive knowledge

cr

of protein subcellular localization is critical for understanding their functions and interactions [53]. As such, proteins carrying out their tasks in

us

the same subcellular localization can form a protein complex (functionally related/linked to each other). We utilize such prior biological information 225

and use the localization information of the large-scale budding yeast local-

an

ization experiment [53]. This dataset describes the location of proteins in 22 intracellular locations such as the nucleus, Golgi, mitochondrion, etc. We build protein subcellular localization profiles represented as a binary

230

M

matrix, where the rows represent proteins and columns represent intracellular locations and the elements are either 1 or 0 representing the presence and absence of the protein in a certain intracellular location, respectively.

d

Because these profiles in the mentioned dataset describe properties of indi-

te

vidual protein and not pairs of proteins, we constructed the kernel matrix, Kprotloc as a measure of similarity between proteins with respect to the protein subcellular localization data by using linear kernel.

Ac ce p

235

3. Phylogenetic profiles. The phylogenetic profile of an orthologous protein is a vector of length n which represents the names of different organisms that contain that protein [54]. The comparisons of protein phylogenetic profiles are used for identifying functionally related and physically linked

240

proteins [55][56][57].We built protein phylogenetic profiles from KEGG Orthology (KO) and KEGG Orthology Cluster (KEGG OC) [58] databases. However, the constructed profiles provide information on individual genes rather than a pair of genes. We, therefore, built a kernel matrix representing the similarity between each pair of genes in which rows and columns

245

represent enzyme proteins and the entries are the similarity between the corresponding proteins. We applied the Jaccard similarity measure to the phylogenetic profile data which resulted in a kernel matrix Kprotphy . As 12

Page 12 of 43

we are interested in a binary similarity matrix, each entry of the kernel matrix Kprotphy is converted to 1 if it is greater than a threshold, τ , oth-

ip t

erwise it equals to 0. We further amend the Kprotphy by utilizing KEGG

250

OC tools and assigning 1 for each pair of enzyme proteins having the same

cr

OCs number, OCs includes 1176030 OCs, obtained by clustering 8357175 genes in 2112 fully sequenced genomes (153 eukaryotes, 1830 bacteria and

us

129 archaea).

Given Kgenexpr , Kprotloc and Kprotphy , we evaluated each individual S of each generation as follows.

255

XX

g(aij , kij )

an

f (S, K) =

(3)

Where aij is an element of the adjacency matrix representing the individual

M

S and its value is either 1 (there is an interaction between the corresponding proteins) or 0 (there is no interaction between the corresponding proteins); kij is an element from either gene expression profile kernel Kgenexpr , protein

and g(aij , kij ) returns 1 if aij = kij , otherwise it returns 0. The term g(.)

te

260

d

subcellular localization kernel Kprotloc or protein phylogenetic kernel Kprotphy

determines the proportion of edges in a candidate solution that has evidence

Ac ce p

based on the considered omics data sets. The maximum value is reached when each edge between a pair of proteins inferred has similar profiles in the considered omics data sets.

265

2.3. Parent Selection

In this step a set of parent individuals is selected from the population for reproduction. We used the crowded binary tournament selection method [59] used in NSGA-II, which is one of the most popular Pareto dominance based MOEAs [60]. The characteristic features of NSGA-II are its fast nondominated sorting

270

approach for ranking solutions. At any given generation: ˆ

Let P : n−dimensional candidate solutions (population). P = [s1 , s2 , ..., sn ] 13

Page 13 of 43

ˆ

Let F : m−dimensional objective functions.

ip t

f (s) = [f1 (s), f2 (s), ...fm (s)]

In NSGA-II, all solutions of the population, P , are first grouped into different

cr

nondominated groups called Pareto fronts. For each solution, one has to check dominance relationship by determining how many solutions dominate it and the set of solutions to which it dominates. Given any two solutions si and sj , there

si dominates sj ,

ˆ

si is being dominated by sj ,

ˆ

Neither solution dominates.

M

ˆ

an

are three cases of dominance relationship:

us

275

si dominates sj if the solution si is no worse than sj in all fitness values and the solution si is strictly better than sj in at least one value. This dominance

d

relationship is used to attribute rank in which all nondominated individuals had

te

a rank of 1 and assigned to Pareto front 1, all individuals that being dominated by other individuals of front 1 had a rank 2 and assigned to Pareto front 2,

Ac ce p

and so on. Then, the density of individuals within each front is estimated. Here, a specific crowding distance is used which calculates the average distance of two neighbouring individuals on either side of this individual along each of the fitness values. Every individual i in the population has two attributes: a nondomination rank (irank ) and a crowding distance to the closest neighbours (idist ). The partial order ≺ c is based on these two attributes and defined as: i ≺ j : irank < jrank ∨ (irank = jrank ∧ idist > jdist )

280

During the selection operation, the individuals are selected based on their rank and crowding distance values. NSGA-II selects individual with lesser rank when both lie on different fronts. However, the one with a higher crowding distance value is selected when both individuals belong to the same front.

14

Page 14 of 43

ip t cr us

Figure 4: Mutation Operator

285

an

2.4. Crossover

Crossover operator is an important step in most traditional GA implementations. It can be used to produce a single solution from two highly fit separate

M

individuals. In this work, general crossover operations between two individuals resulted in rewiring the edges of the network and changes the structural properties and the functional form of the candidate solutions. Thus, we excluded

ability. 2.5. Mutation

d

crossover operator and instead mutation operator is performed with high prob-

te

290

Ac ce p

In this study, we examined three different mutation operators that play

important roles in the process of network reconstruction. We utilized the prior biological information as well as the growth and preferential attachment feature leading to a scale-free network. Therefore, the fourth step, mutation operator,

can be described as follows: at each time, a new vertex, vi , representing a protein is added to the network (the selected solution). Such node connects with m edges where m ≤ the number of existing vertices V . The added vertex is connected with m existing vertices V based on three different criteria: (1) growth and preferential attachment: each vj in V is connected to the added vertex vi with a probability proportional to the degree of the node p(kj ) according to the

equation (4): P (kj ) = P

kj

l∈V,l6=j

kl

(4)

15

Page 15 of 43

generate initial population (init pop) of size (popSize);

2

while max iteration is not reached do

ip t

1

calculate fitness values for each individual by using multiobjective

3

cr

function; select parent using NSGA-II ;

5

/* perform mutation by expanding each individual in the

us

4

6

vi ← a new vertex;

7

for i ← 1 to m do

*/

an

population by using one of the following cases:

vj ← a vertex in the current individual; P Pj = kj / l∈V,l6=j kl ;

8 9

M

/* where kj and kl are the degrees of the vertices vj

10

and vl // Case 1:

growth and preferential attachment

12

add an edge between vi and vj with probability = Pj ;

13

// Case 2:

14

s = Kgenexpr (vi , vj ) + Kprotloc (vi , vj ) + Kprotphy (vi , vj );

15

if s > 2 then

d

11

*/

Ac ce p

te

prior biological data

add an edge between vi and vj ;

16

17

end

18

// Case 3:

combination of growth and preferential

attachment and prior biological data

if rand ≤ Pj and s > 1 then

19

add an edge between vi and vj ;

20

end

21

end

22 23

end

Algorithm 1: Pseducode for GA used for biological network reconstruction

16

Page 16 of 43

where kj and kl are the degrees of the vertices vj and vl so that vertices with higher degrees have more chances to be connected to the newly added vertex. (2) prior biological information: each vj in V is connected to the added vertex vi

ip t

295

if the following three conditions are satisfied (i ) similar gene expression profiles,

cr

(ii ) similar subcellular localization profiles, and (iii ) similar phylogenetic profiles. (3) In the third case, we combined the growth and preferential attachment

300

us

and prior biological information conditions, see Figure 4. In this case, we only consider two of the conditions presented in criteria (2).

an

The steps 2 to 4 are repeated until a maximum number of iteration is reached. Overall, Algorithm 1 summarizes the steps of our method proposed for recon-

305

M

structing biological network.

3. Design and Implementation of the Proposed GA-based Method

d

Using GPU

This section focuses on the design and implementation of the proposed recon-

te

struction GA-based method using GPU. In this study, we concentrated mainly on parallelizing the objective functions of the presented GA-based method not to parallelize the complete method. First of all, at the initialization phase, mem-

Ac ce p

310

ory on the device was allocated, and the data was transferred to the GPU to calculate the fitness values for each individual in the population. We represent an individual as an array of size N × N . Let L be the size of the population, L arrays are allocated to L multiple processors (MP) at the same time.

315

Then, three different kernels are executed to calculate the fitness values for the individuals.

We designed a kernel that executes a number of threads on GPU and re-

turns sorted degree distribution and the average clustering coefficient for each individual. We utilized built-in functions (e.g. atomicAdd ) so that the degree 320

distribution and the clustering coefficient of each node of an individual were simultaneously calculated and stored in shared memory and global memory, re-

17

Page 17 of 43

7

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7

Node degrees:

k0=2 0 0

Shared memory:

k1=2

k2=1

1 1

k3=4

2 3

3 2

k4=5

k5=3

4 1

k6=3 6 0

7 0

8 0 P(8)

P(0)

P(1)

P(2)

P(3)

P(4)

P(5)

P(6)

P(7)

0

0.125

0.375

0.25

0.125

0.125

0

0

Clustering coefficient:

C(0) 0

C(2) 0

C(3) 0

C(4) 0.2

C(5) 0.67

0

us

Degree distribution:

C(1) 0

k7=2

5 1

cr

Threads:

5

2

ip t

6

4

3

1

0

C(6) 0.67

C(7) 1

an

Figure 5: Calculate Degree Distribution and Clustering Coefficient on GPU.

spectively. We only have 1024 threads per block. We allocated each individual to a particular block. Each thread calculates the degree ki for a particular node

325

M

vi in an individual and increases the entry of a shared memory allocated in ki by 1. Each thread, simultaneously, calculates the clustering coefficient of the

d

node vi by calculating the number of neighbours |V | and the number of edges Ei among the vertices V and stores such information in global memory. We

te

proposed (see Figure 6) a methodology to adopt GA on GPU in a generic way, and we detail this process in the example shown in Figure 5. In the host, we used the powerlaw Python package [42] to evaluate the fitness

Ac ce p

330

of the individuals using the degree distribution values calculated on GPU. Another kernel function was designed to calculate the average path length for each individual of a generation. As mentioned above, we used the Floyd-Warshall algorithm which depends on the principle of dynamic programming. We store

335

the best values (shortest path between each pair of protein) found so far in a matrix dist in global memory so that it can be accessed via all threads. Let L be the number of individuals, the elements of the L dist matrices are updated

at the same time. For each node w in the individual, each thread only examines the distance between a pair of vertices (v, u), in a particular individual, 340

passing through the node w and saves the value returned from the function min(dist(u, v), dist(u, w) + dist(w, v)) in the array dist (see Figure 6)

18

Page 18 of 43

𝐶𝑃𝑈

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒, 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡ℎ𝑟𝑒𝑎𝑑𝑠 = 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑖𝑧𝑒

𝐺𝑟𝑖𝑑

Copy population CPU → GPU

Block 0

Block 𝑛

Block 1

Each thread: • Calculates degree 𝑘(𝑣) of vertex 𝑣 and increases the item of a vector in a shared memory at index 𝑘 𝑣 by 1. • Calculates clustering coefficient 𝐶 𝑣 of vertex 𝑣 Save

us

• Degree distribution values of an individual to a shared memory. • Degree distribution values of population to a global memory. • Clustering coefficient of each node of each individual to a global memory

cr

Copy back degree distribution & clustering coefficient GPU→CPU

ip t

Launch 𝑘𝑒𝑟𝑛𝑒𝑙1

𝐺𝑟𝑖𝑑(𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑖𝑧𝑒, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑖𝑧𝑒) Block 1,0

Block 0,1

Block 1,1

Block 𝑛, 0

Block 𝑛, 1

an

Launch 𝑘𝑒𝑟𝑛𝑒𝑙2

Block 0,0

Copy back shortest path length GPU→CPU

Block 0, 𝑛

Block 1, 𝑛

Block 𝑛, 𝑛

M

Each thread: • Calculates the shortest path distance between a pair of vertices and stores the result in a global memory.

Figure 6: Calculate fitness values of the population on GPU with respect to degree distribu-

d

tion, clustering coefficient and average path length

te

The third kernel was designed to calculate the fitness values of each individual on the basis of omics data. Each thread is responsible for examining

Ac ce p

the presence or absence of links of a particular pair of nodes in a particular 345

individual by utilizing gene expression data, protein subcellular localisation and protein phylogenetic data as shown in Figure 7. 𝐶𝑃𝑈

Copy input data: kernel matrices CPU → GPU

𝐺𝑟𝑖𝑑

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒, 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡ℎ𝑟𝑒𝑎𝑑𝑠 = 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑠𝑖𝑧𝑒

Block 0

Block 1

Block 𝑛

Launch 𝑘𝑒𝑟𝑛𝑒𝑙3 Copy results GPU→CPU

• Results of the comparison (global memory)

Each thread: Save • Compares the entries of an individual with the

corresponding entries in kernel matrices: 𝑘 𝑔𝑒𝑥𝑝𝑟 , 𝑘 𝑙𝑜𝑐 𝑎𝑛𝑑 𝑘 𝑝ℎ𝑦

Figure 7: Calculate fitness values of the population on GPU with respect to prior biological information (gene expression kernel matrix kgexpr , localisation kernel matrix kloc and phylogenetic kernel matrix kphy )

19

Page 19 of 43

4. Results and Discussion

ip t

The results given in this section were collected over a system with six cores of Intelr Xeonr processor E5-2620 15M Cache, 2.00 GHz as the host CPU. 350

We are using NVIDIA Tesla K10.G2.8GB GPU mounted on the host for results

cr

of GPU implementation. K10.G2.8GB GPU comprises two identical GK104 GPUs, each with 4 GB of memory. The GPUs are connected via a PCI Express

and the compute capability of the device is 3.0. 4.1. Parameter Tuning

an

355

us

switch; the total number of multiprocessors is eight. The clock rate is 475MHz,

In this paper, we used the method introduced by Taguchi et al. [61] for tuning the parameters of the GA algorithm. The Taguchi method is based on

M

statistical and sensitivity analysis for determining the optimal set of parameters to achieve robust performance [62].

d

Four basic parameters of the proposed GA are considered for tuning, namely population size, mutation probability, elitism rate and a maximum number of

te

generations. The different levels of these parameters for the tuning process are shown in Table 2. The L16 orthogonal array (a table balanced to ensure that

Ac ce p

all levels of all factors are considered equally) is used. Sixteen experiments are required to be conducted based on the L16 orthogonal array (see Table 3). After conducting the experiments, the optimal combination of parameters can be determined using the Signal-to-noise (SN) ratio. The higher the SN ratio, the better the quality as shown in Figure 8. For this problem, we consider the '

smaller the better' category to compute the SN ratio. Therefore, the SN ratio

for each experiment can be computed using the following equation: SN = −10log(

1X 2 y ) n

(5)

Where y is a weighted combination of the multiple criteria. We used one of the first and most used methods that transform the multiobjective problem into a mono objective problem [63]. Given m values of the different objective

20

Page 20 of 43

functions, the response y is defined by:

i=1 360

λi

fi (x) − fimin fimax − fimin

(6)

ip t

y=

m X

Where m is the number of objective functions used and fimin and fimax are the

cr

lower and upper values of the different objective functions. The GA parameters used are shown in Table 4.

us

The proposed mutation operator has the parameter (m) which indicates the number of nodes that will probably connect each newly added node. In 365

order to significantly gain the advantage of this parameter, an experimental

an

test evaluation is performed to identify the best m parameter value. For this experiment, we run the proposed algorithm for each changed value of the m parameter. The starting value was set to the number of existing nodes in the

370

M

seed network N . On the basis of the fact that each new node will connect with m vertices where m is less than or equal to the number of the nodes in the

d

considered network [38]. In subsequent runs, each new node will connect to the interval of between 0.01 and 1. The parameter values reported in Table 4

te

were used, and the best accuracy of the proposed method was when setting the parameter m to the value 0.06.

The threshold τ used in the process of building the kernel matrix Kprotphy

Ac ce p

375

was similarly tuned. We used the parameter values reported in Table 4 and performed experimental tests to evaluate the best value for this parameter. We run the proposed method for each changed value of τ parameter, and the starting value was set to 0.5 and until 0.9. We recorded the best accuracy when

380

the parameter τ equals to 0.5. We studied the performance of the proposed method in three aspects. The

first one, outlined in section 4.2, studies the performance of the method when using three different mechanisms for network growth. Section 4.3 presents the GPU performance. Finally, section 4.4 analyzes the topological characteristics

385

of the biological networks.

21

Page 21 of 43

Table 2: Parameters of the proposed GA and their experimental levels.

Code

Parameter

1

2

3

4 50

Population size

P

20

30

40

2

Mutation rate

M

0.2

0.4

0.6

0.8

3

Elitism rate

E

0.05

0.10

0.15

0.20

4

No. of generation

G

200

300

400

500

us

cr

1

ip t

Level No.

an

Table 3: The experimental layout of the parameters.

Parameter Experiment

y

M

20

0.2

2

20

3

20

4

20

5

E

G

0.01

200

2.5000

M

P 1

0.02

300

2.0079

0.6

0.03

400

2.43204

0.8

0.04

500

1.68784

d

0.4

0.02

7

30

0.6

0.04

200

2.25537

8

30

0.8

0.03

300

2.32549

0.2

400

1.98763

30

0.4

0.01

500

2.43353

Ac ce p

te

6

30

9

40

0.2

0.03

500

2.58457

10

40

0.4

0.04

400

2.01768

11

40

0.6

0.01

300

2.15318

12

40

0.8

0.02

200

1.85902

13

50

0.2

0.04

300

2.24021

14

50

0.4

0.03

200

2.36538

15

50

0.6

0.02

500

2.45703

16

50

0.8

0.01

400

2.15148

4.2. Validity of the Resulted Biological Network For evaluating the performance of the proposed algorithm of using three different types for network growth: prior biological data, topological characteristics and both (see Figure 9), we used five widely-known measures: Precision, 22

Page 22 of 43

ip t cr us

an

Figure 8: SN ration at different levels of the GA parameters. Table 4: Genetic algorithm parameters.

Values

Population size

M

Genetic parameter

20 400

Mutation probability

0.8

d

Maximum iterations

0.20

te

Elitism probability

Recall, F-measure, Specificity and Accuracy to compare the findings against

Ac ce p

gold standard networks. For each edge e in the reconstructed biological net-

work N , let true positive (T P ) be the set of edges shared between the inferred

network N and the gold standard network G while false positive (F P ) is defined as the set of edges existing only in network N . False negative (F N ) is defined

as the edges that are members of the reference network G but not found in the

network N , and true negative (T N ) is the edges not available in both networks N and G. Hence, recall, precision, f-measure, specificity and accuracy scores are calculated according to the following equations: Recall(R) =

TP TP + FN

P recision(P ) =

TP TP + FP

(7)

(8)

23

Page 23 of 43

P recision × Recall P recision + Recall

Accuracy(ACC) =

ip t

Specif icity(SP ) =

(9)

TN FP + TN

(10)

TP + TN TP + FN + FP + TN

cr

F − measure(F ) = 2 ×

(11)

us

Once we selected the values of GA parameters, as shown in Table 4, we run our algorithm 20 times and calculate recall, precision, f-measure, specificity and accuracy values in each experiment. Indeed, before applying the reconstruction procedure depicted in Figure 9, we carried out a series of preliminary exper-

an

390

iments using the data set of KEGG. The experiments are conducted in two groups. The 1st -group experiment is to verify the suitability of reconstruction

M

metabolic networks by using prior biological data while the 2nd -group uses the topological characteristics. Within each group the evaluation will be based 395

on objective function for genomic data and objective functions for topological

d

characteristics (see Table 5 for the results of the 1st -group and Table 6 for the

te

2nd -group).

Table 5: The average, standard deviation, max and min of the recall, precision and f-measure

Ac ce p

validation scores used to validate 20 independent runs of the 1st experiment reconstruction methods.

Network growth Obj. functions Measure

expanding based on prior biological data

genomic data

topological characteristics

R

P

F

SP

ACC

R

P

F

SP

ACC

Min

0.29

0.69

0.41

0.76

0.46

0.27

0.72

0.39

0.78

0.43

Max

0.36

0.76

0.48

0.81

0.52

0.36

0.82

0.50

0.84

0.53

Mean

0.32

0.74

0.45

0.79

0.49

0.33

0.78

0.46

0.82

0.49

Std. Dev.

0.01

0.02

0.01

0.01

0.01

0.02

0.02

0.03

0.02

0.02

Generally speaking, the best accuracy obtained from the preliminary experiments was on average of 75% by using the genomic data in the population

24

Page 24 of 43

Table 6: The average, standard deviation, max and min of the recall, precision and f-measure validation scores used to validate 20 independent runs of the 2nd experiment reconstruction

expanding based on topological characteristics

Obj. functions

topological characteristics

P

F

SP

ACC

R

P

Min

0.59

0.69

0.64

0.76

0.68

0.60

Max

0.90

0.75

0.80

0.80

0.83

0.68

Mean

0.72

0.71

0.72

0.78

0.75

0.63

Std. Dev.

0.10

0.02

0.04

0.01

0.03

0.02

On the basis of scale-free network structural characteristics

SP

ACC

0.72

0.66

0.78

0.69

0.80

0.72

0.83

0.74

0.76

0.69

0.81

0.72

0.03

0.01

0.02

0.01

an

Expand each candidate network

A Set of seed networks

F

us

R

Evaluate the new networks

M

Measure

genomic data

cr

Network growth

ip t

method.

On the basis of prior biological information

gene expression profiles

Small average shortest path distance

Protein subcellular localization profiles

High average clustering coefficient

protein phylogentic profiles

Return an inferred network

Ac ce p

te

d

Powerlaw degree distribution

Select the best fit

Figure 9: The Proposed Reconstruction Method

400

evaluation and topological characteristics in the mutation operation (see Table 6).

Thereafter, we used the proposed method depicted in Figure 9. As stated

in section 2, we started the GA run from an informed non-random population initialized with different small seed networks with different random sizes derived

405

from gold standard databases while four objective functions are used to evaluate the fitness of each individual. The first objective function uses prior biological information (Expression, Localization, and Phylogenetic profile) to evaluate the fitness of each individual of the population. The other three objective func-

25

Page 25 of 43

Table 7: The average, standard deviation, max and min of the recall, precision and f-measure

Measure

R

P

F

SP

ACC

Min

0.44

0.66

0.53

0.75

0.59

Max

0.70

0.94

0.80

0.95

0.81

Mean

0.55

0.78

0.64

0.82

Std. Dev.

0.08

0.09

0.08

0.06

Min

0.26

0.44

0.33

0.68

Max

0.60

0.79

0.68

0.85

0.73

Mean

0.45

0.63

0.52

0.76

0.61

Std. Dev.

0.09

0.07

0.09

0.07

0.04

Min

0.87

0.45

0.59

0.65

0.70

Max

0.95

0.85

0.90

0.87

0.91

Mean

0.92

0.66

0.76

0.75

0.81

0.47

0.02

0.09

0.07

0.05

0.05

0.94

0.73

0.82

0.79

0.84

0.96

0.84

0.90

0.86

0.90

Min Max Mean

0.95

0.80

0.86

0.83

0.88

Std. Dev.

0.005

0.04

0.02

0.03

0.02

te

BioGrid

0.06

d

Std. Dev.

M

Krogan

an

YeastCyc

0.68

us

KEGG

cr

Dataset

ip t

validation scores used to validate the resulted network of 20 independent runs.

Ac ce p

tions utilize topological properties (average shortest path length, average clus-

410

tering coefficient, and degree distribution) of the reconstructed network to be used as quantities measures (see Figure 9). We applied our method to four commonly-used protein networks in yeast. Two of them are networks related to the metabolism of the yeast: KEGG and BioCyc. We used KEGG API to extract 116 known pathways of budding yeast (Saccharomyces cerevisiae) that

415

involves 1341 genes having EC numbers and 832 enzyme-enzyme interactions. Such information was used as a gold standard database. We also used the computationally reconstructed network (YeastCyc - version 19.5) [8] which contains 1461 reactions in the database, and among them, 1168 are catalysed by enzymes. We extracted 779 EC numbers from YeastCyc assigned to 598 genes in yeast or-

420

ganism. The other two datasets are networks from protein-protein interaction:

26

Page 26 of 43

Krogan et al. [41] and BioGrid used by Yip and Gerstein [40]. The properties

ip t

of the four datasets used in the experimental work are outlined in Table 8. We Table 8: Topological properties of the datasets used in the experimental study.

Proteins

Interactions

Network

Avg. Clustering

Avg.

density

Coefficient

length

path

cr

Dataset KEGG-Used

387

808

0.011

0.42

KEGG-All

502

1053

0.008

0.40

YeastCyc-All

517

825

0.006

0.21

12.5

YeastCyc-USED

464

692

0.006

0.20

6.13

Krogan

1200

2333

0.003

0.45

5.22

BioGrid

1865

6436

0.004

0.37

4.33

6.90

M

an

us

6.73

Table 9: The number of common proteins among the omics data and the considered gold standard networks used in the experimental study.

Gene expression Gasch et al.

[49]

te

(6152)

Gene expression

d

Dataset (Size)

Spellman et al. [48] (6178)

Localization

Phylogenetic

(4156)

(3943)

385

385

313

385

KEGG-All

496

496

396

500

Ac ce p

KEGG-Used

YeastCyc-All

512

511

415

503

YeastCyc-USED

462

462

402

451

Krogan

1192

1190

1084

987

BioGrid

1863

1863

1863

1863

run our algorithm 20 times and calculate recall, precision, f-measure, specificity and accuracy values in each experiment. Then, we evaluated the distribution of

425

the results as depicted in Table 7. The standard deviation of the resulting Recall (R), Precision (P), F-measure (F), Specificity (SP) and Accuracy (ACC) was small (< 0.1) indicating good stability. As shown in Table 10, it is clear that the reconstruction of different networks did not differ from one another. The proposed method usually discovered highest interactions when relying only on the 27

Page 27 of 43

ip t cr us

Table 10: Result summary: Number of inferred interactions (#), Recall (R), Precision (P), F-measure (F), Specificity (SP) and Accuracy (ACC) for our method using different inference

an

criteria.

Metabolic networks

Network growth

KEGG R

P

F

SP

M

#

ACC #

R

YeastCyc P

F

SP

ACC

784 0.57 0.53 0.55 0.64 0.68

782 0.52 0.58 0.55 0.74 0.65

Topological properties

2289 0.23 0.64 0.34 0.74 0.40

4162 0.11 0.67 0.20 0.79 0.25

Prior

biological

data

+

1091 0.69 0.93 0.79 0.94 0.80

te

Topological properties

d

Prior biological data

Ac ce p

Network growth

#

R

909 0.60 0.79 0.68 0.85 0.73

Protein interaction networks Krogan P

F

BioGrid SP

ACC #

R

P

F

SP

ACC

Prior biological data

1657 0.83 0.60 0.69 0.72 0.75

4882 0.87 0.66 0.75 0.74 0.79

Topological properties

4243 0.31 0.57 0.41 0.70 0.48

9147 0.66 0.46 0.55 0.75 0.60

2075 0.95 0.84 0.95 0.90 0.87

5668 0.96 0.84 0.90 0.86 0.90

Prior

biological

data

+

Topological properties

28

Page 28 of 43

430

growth and preferential attachment feature in the mutation operator. However, many of the inferred interactions have no biological meaning compared with the

ip t

gold standard datasets. The structural properties of biological networks provide

essential principles of the organization of interactions in a particular network,

435

cr

but an entire biological network cannot be completely described and inferred us-

ing a small collection of statistical measures that solely takes into consideration

us

the structural properties of the network [45].

Taking into account the prior biological information (Expression, Localisation and Phylogenetic profile) of proteins involved in the experiments, our

440

an

method was capable of inferring the known interactions (recall values in Table 10). However, precision value and the number of inferred interaction were significantly lower compared with the results obtained when either using only

M

the topological properties of the network or combining prior biological information and topological properties. This is particularly the case when inferring the metabolic networks due to the lack of prior biological information available on the considered enzymes (see Table9).

d

445

te

We hypothesized that the performance of the method can be increased when utilizing both prior biological data and structural features of scale-free networks.

Ac ce p

Here, we consider the similarity of a pair of protein in at least two omics data sets as well as the preferential attachment way of network growth. By applying this mechanism to the mutation operator, we can predict more known interactions as shown in Table 10. To statistically compare these results, we used the McNemar test [64] which has been used by Roche-Lima et al. [9] to evaluate the performance of different methods used to predict metabolic network. McNemar's test defined a z score, calculated as: z=

(|Nsf − Nf s | − 1) Nsf + Nf s

(12)

Where (Nsf and Nf s ) is the number of times when one algorithm failed, and the other succeed. Assume that Nsf is the number of times algorithm A failed, and algorithm B succeed and Nf s is the number of instances algorithm B succeeded and algorithm A failed. z scores are interpreted as follows: when z equals to 29

Page 29 of 43

450

0, the two algorithms have the same performance. However, when z value is greater than 0, the performance of one algorithm is significantly better than the

ip t

other based on the value (Nsf and Nf s ). For example, if the value Nsf is greater than Nf s , the algorithm A performs better than algorithm B. We computed

455

cr

the z scores considering our method when using both criteria as algorithm A and the two remaining methods as Algorithm Bs. In all cases, we obtained z

us

scores greater than 0 and based on the (Nsf and Nf s ) method when using both criteria performed better with p-values < 0.05.

an

4.3. GPU Performance

Table 11: Comparison between GPU and CPU computational time

Time (h)

Sequential

M

Implementation

522

20 different runs

Mean

1.06

Worst

1.07

Best

1.05

Ac ce p

te

d

Parallel

To evaluate the GPU performance, we measured the speedup achieved by

460

parallelizing the objective functions of the entire population on a GPU chip. Therefore, we compared the results of GPU computation and CPU computation for performing the same tasks - reconstructing KEGG network. The CPU was an Intel Core i7 4930K (3.40GHz) processor running Windows 10. For CPU computation, we used the same evolutionary model as illustrated in Algorithm

465

1. Other conditions, such as the population size, mutation probability, and elitism rate, are also the same. A single run was performed on the CPU while 20 different runs were performed on the GPU. The results are reported in Table 11. It is obvious that we get a significant reduction in the execution time when we implement the method over the GPU card which yields 492× speedup. We also

470

executed the proposed parallel method for reconstructing biological networks 30

Page 30 of 43

ip t

2

1

cr

Time (h)

1.5

0.5

KEGG

BioCyc

us

0 Krogan Dataset

BioGrid

an

Figure 10: The average execution time of the GPU implementation for reconstructing four different biological networks.

M

(BioCyc, Krogan and BioGrid) with different networks sizes. Figure 10 shows the average execution time for 20 different runs.

d

4.4. Topological Analysis of Networks

475

te

Structural network analysis takes into consideration the topological and not the functional form of the reconstructed network. These are a set of essential topological characteristics of biological networks, namely power-law degree

Ac ce p

distribution, small average shortest path length and high average clustering coefficient. We analyzed the topological properties of the reconstructed networks in order to verify if the structure of such networks still has the characteristics of

480

scale-free networks. As mentioned in the method section, we used an enzymatic view graph-theoretical representation for the metabolic network in which nodes are enzymes, and edges represent the enzyme-enzyme relation (two enzymes catalyse two consecutive reactions). The output degree distributions of 12 reconstructed networks using four gold

standard datasets and three different ways for network growth are shown in Figure 11. The data sets are KEGG (see Figure 11a), YeastCyc (see see Figure 11b), BioGrid used by Yip and Gerstein [40] (see see Figure 11c) and Krogan et al. [41] (see see Figure 11d) while the different mechanisms for network growth

31

Page 31 of 43

Table 12: Topological properties of the reconstructed metabolic networks

Average

Coefficient

path length

Prior biological data

0.04

0.52

3.7

Topological properties

0.07

0.31

2.3

0.01

0.33

5.9

Prior biological data

0.02

0.21

4.7

Topological properties

0.06

0.21

0.01

0.19

Topological properties

YeastCyc

Prior biological data + Topological properties

Krogan

Prior biological data

0.01

Topological properties

0.03

Topological properties Prior biological data Topological properties Prior biological data +

1.5

1.5

2.4

2.3

5.4

1.6 2.8

0.25

2.07

2.2

0.46

5.07

2.5

0.01

0.37

3.7

2.2

0.02

0.24

2.6

2.0

0.36

4.18

2.1

0.01

0.005

te

Topological properties

1.3

4.71

d

BioGrid

1.4

0.44

M

Prior biological data +

(γ)

cr

Prior biological data +

exponent

ip t

Clustering

density

us

KEGG

Network

Network growth

an

Dataset

Ac ce p

are prior biological data and topological characteristics. In these figures, P (k) is the probability of nodes which have a k degree. It was calculated by dividing

the number of proteins which had k links with the total number of proteins in the resulted network. As shown in Figure 11, the scale-free property in which a few nodes have a lot of interactions and many nodes have few interaction exists clearly when (1) using the prior biological data and topological characteristics to build the four studied networks, (2) only using the prior biological data for inferring the Krogan networks. Using the topological properties and the seven fitness values for reconstructing the network did not result in a clear power-law degree distribution except with the Krogan dataset. This is because of taking into account the prior biological information when evaluating the fitness of the candidate solutions. We further used the powerlaw Python package to determine

32

Page 32 of 43

whether the degree distribution of each network reconstructed is considered a good fit to power-law degree distribution. Experimental results show that

ip t

networks reconstructed follow the power-law degree distribution with p-values

between 1e−1 and 1e−20. We also calculated the degree exponent of each degree

X γ = 1 + N[ ln

(13)

us

i

ki ]−1 kmin − 12

cr

distribution (see TABLE 12) using the following functions [65]:

Where N is the network size, ki is the degree of node i and kmin is the minimum 485

degree in the network. The degree exponent of power-law degree distribution

an

lies between the values 2 and 3. In our reconstructions, using only topological properties obtains values between 2 and 3 for all resulted networks except the KEGG network. When using only the omics data, the degree exponent is be-

490

M

tween the values 2 and 3 for protein interaction networks and less than 2 for the reconstructed metabolic networks. Also, when using prior both mechanisms, the degree exponent is between the values 2 and 3 for protein interaction networks

d

and less than 2 for the reconstructed metabolic networks.

te

Table 12 also shows other important features of the biological networks, namely the clustering coefficient and network density. The high values of the clustering coefficients imply that a pair of proteins tends to be densely clustered

Ac ce p

495

in the protein interaction networks. While the low values of the network density represent the sparsity of the network. In biological networks, the average path length scales logarithmically with the number of nodes [33]. We calculated the logarithm of the reconstructed networks size and all results are close to the

500

number 2. As shown in Table 12, the average path lengths range from 2 to 5. This is due to using seven different criteria for selecting the best solution with one criterion being the average path length. Overall, the experimental finding is that the protein-protein interaction network is better than the metabolic network. This is because of the lack of the

505

omics datasets for the enzyme proteins while there is abundant prior biological information on the proteins involved in the protein interaction dataset used in this study (see Table 9). For the reconstruction of the protein interaction 33

Page 33 of 43

network, the most important biological information is the phylogenetic profiles

510

Table 9).

cr

4.5. Comparison to State-of-the-art Algorithm in Literature

ip t

followed by the expression data and finally the subcellular localisation data (see

As presented previously in the related work section, the methods proposed in the literature for reconstructing metabolic networks and protein-protein interac-

515

us

tion networks belong to two classes: de novo and supervised machine learning methods. For comparison, we only consider the supervised machine learning

an

based methods because these methods proposed to reconstruct metabolic networks represented as enzyme graph and protein-protein interaction networks which are similar to our work. However, the supervised machine learning based

520

M

methods depend on the AUC measure - a well-known metric used to evaluate the performance of machine learning based methods - except Vert et al. [66] who used the accuracy measure as well. For this reason, we compared our pro-

d

posed method to the-state-of-the-art method in the literature proposed by Vert

te

et al. [66]. Vert et al. [66] used two pairwise kernels: metric learning pairwise kernel (MLPK) and tensor product pairwise kernel (TPPK) to infer similarities 525

between two pairs of enzymes. The mapping specifies two pairs of proteins as

Ac ce p

similar to each other when each protein in a pair is similar to one corresponding protein in the other pair using four genomic datasets: expression, localization, phylogenetic profile and yeast two-hybrid. Such data have been used individually and together to train SVM in order to reconstruct enzyme networks. In

530

their method, the best results of the inference have been yielded from the combination of the four datasets and the MLPK approach performs better than the TPPK approach on three out of four datasets. It reached an overall accuracy of 83.9 ± 0.9 when using a combination of genomics data as well as a combination

of kernels. It outperformed another state-of-the-art method proposed by Ben535

Hur and Noble [67] in terms of AUC score. For comparison, we used the same data set which is derived from KEGG database by Yamanishi et al. [4]. We found that the performance of our method is better in terms of the accuracy 34

Page 34 of 43

15

20

10

15

0.04 0.02

Frequency p(k) 5

0.00

0.10

25

Degree k

ip t

10

20

0

10 20 30 40 50 60

cr

5

Topological characteristics

0.00

0.2

Frequency p(k)

Omics data

0.0

Frequency p(k)

Omics data & Topological characteristics

Degree k

Degree k

15

20

0.04

Frequency p(k)

0.30

5

10

0.02 0.00

0.15

an

10

Topological characteristics

15

0

20

M

5

0.00

0.2

0.4

Frequency p(k)

Omics data

0.0

Frequency p(k)

Omics data & Topological characteristics

us

(a) Reconstruction KEGG Network on the Basis of Three Different Mechanisms

Degree k

Degree k

40

60

80

Degree k

20

60

100

140

0

40

0.02

Frequency p(k) 20

Degree k

0.00

0.04

Frequency p(k)

te 0

0.04

Topological characteristics

0.08

Omics data

0.00

0.00 0.04 0.08

Ac ce p

Frequency p(k)

Omics data & Topological characteristics

d

(b) Reconstruction YeastCyc Network on the Basis of Three Different Mechanisms

60

0

20

Degree k

40

60

80

Degree k

(c) Reconstruction BioGrid Network on the Basis of Three Different Mechanisms

0

10

30 Degree k

50

0

10

20

30

Degree k

40

0.06 0.03 0.00

Frequency p(k)

Topological characteristics

0.10

Frequency p(k)

Omics data

0.00

0.20 0.10 0.00

Frequency p(k)

Omics data & Topological characteristics

0

20

40

60

80

Degree k

(d) Reconstruction Krogan Network on the Basis of Three Different Mechanisms Figure 11: Degree distribution results for the reconstructed networks.

35

Page 35 of 43

score (91%).

540

ip t

5. Conclusion

It appears that metabolic networks represented as enzyme graph and protein

cr

interaction networks can be reconstructed using the prior biological data about

individual gene or protein (expression, subcellular localisation and phylogenetic

us

profile) and the topological characteristics of the scale-free network. Previous publications used biological information for reconstructing protein interaction 545

networks [29][68][4][25]. However, to the best of our knowledge, this is the first

an

paper that includes the topological characteristics of the scale-free networks in the process of metabolic network reconstruction. We propose a method based on metaheuristic algorithm (GA) for inferring different networks. Such a method

550

M

utilizes the growth and preferential attachment mechanism as well as the available biological data about individual genes and proteins. We found that our

d

method identified a high percentage of the known considered networks: KEGG, YeasCyc, BioGrid used by Yip and Gerstein [40] and Krogan et al. [41] network.

te

Furthermore, the resulted networks satisfied the common structural properties of the scale-free network (power-law degree distribution with exponent values between 2 and 3 in Krogan and BioGrid networks and less than 2 in KEGG and

Ac ce p

555

YeastCyc networks, low-density rate, small average shortest path). We understand that we need larger biological data to test the general perfor-

mance and stability of our method on organisms. This is designated for future work. Also, we plan to test our method on different biological networks for

560

different organisms when sufficient prior biological data is available.

References

[1] C. St. Clair, J. E. Visick, Exploring Bioinformatics: A Project-Based Approach, 2nd Edition, Jones and Bartlett Publishers, Inc., 2013. [2] M. Zvelebil, J. Baum, Understanding Bioinformatics, Garland Science, 565

2008. 36

Page 36 of 43

[3] R. Sharan, T. Ideker, Modeling cellular machinery through biological net-

ip t

work comparison, Nature biotechnology 24 (4) (2006) 427–433. [4] Y. Yamanishi, J.-P. Vert, M. Kanehisa, Protein network inference from

multiple genomic data: a supervised approach, Bioinformatics 20 (suppl 1) (2004) i363–i370.

cr

570

[5] V. Lacroix, L. Cottret, P. Th´ebault, M.-F. Sagot, An introduction to

us

metabolic networks and their structural analysis, IEEE/ACM Trans. Comput. Biol. Bioinformatics 5 (4) (2008) 594–617.

575

an

[6] C. Smolke, The Metabolic Pathway Engineering Handbook: Fundamentals, The Metabolic Pathway Engineering Handbook, CRC Press, 2009.

M

[7] Y. Yamanishi, Supervised Inference of Metabolic Networks from the Integration of Genomic Data and Chemical Information, John Wiley & Sons, Inc., 2010, pp. 189–211.

T. J. Lee, P. Kaipa, F. Gilham, A. Spaulding, L. Popescu, et al., Pathway

te

580

d

[8] P. D. Karp, S. M. Paley, M. Krummenacker, M. Latendresse, J. M. Dale,

tools version 13.0: integrated software for pathway/genome informatics and

Ac ce p

systems biology, Briefings in bioinformatics 11 (1) (2010) 40–79. [9] A. Roche-Lima, M. Domaratzki, B. Fristensky, Metabolic network prediction through pairwise rational kernels., BMC bioinformatics 15 (1) (2014)

585

318.

[10] M. Kotera, Y. Tabei, Y. Yamanishi, A. Muto, Y. Moriya, T. Tokimatsu, S. Goto, Metabolome-scale prediction of intermediate compounds in multistep metabolic pathways with a recursive supervised approach, Bioinformatics 30 (12) (2014) i165–i174.

590

[11] Y. Yamanishi, Y. Tabei, M. Kotera, Metabolome-scale de novo pathway reconstruction using regioisomer-sensitive graph alignments, Bioinformatics 31 (12) (2015) i161–i170.

37

Page 37 of 43

[12] S. Taghipour, P. Zarrineh, M. Ganjtabesh, A. Nowzari-Dalini, Improving protein complex prediction by reconstructing a high-confidence proteinprotein interaction network of escherichia coli from different physical in-

ip t

595

teraction data sources, BMC Bioinformatics 18 (1) (2017) 10.

URL https://doi.org/10.1186/s12859-016-1422-x

cr

10.1186/s12859-016-1422-x.

doi:

600

us

[13] J.-P. Vert, Reconstruction of Biological Networks by Supervised Machine Learning Approaches, John Wiley & Sons, Inc., 2010, pp. 163–188.

an

[14] P. D. Karp, M. Latendresse, R. Caspi, The pathway tools pathway prediction algorithm, Standards in genomic sciences 5 (3) (2011) 424. [15] P. D. Karp, S. Paley, P. Romero, The pathway tools software, Bioinformat-

605

M

ics 18 (suppl 1) (2002) S225–S232.

[16] S. Tsoka, D. Simon, C. A. Ouzounis, Automated metabolic reconstruction

d

for methanococcus jannaschii, Archaea 1 (4) (2004) 223–229.

te

[17] M. Chen, R. Hofest¨ adt, Web-based information retrieval system for the prediction of metabolic pathways, NanoBioscience, IEEE Transactions on

Ac ce p

3 (3) (2004) 192–199. 610

[18] M. Yousofshahi, K. Lee, S. Hassoun, Probabilistic pathway construction, Metabolic engineering 13 (4) (2011) 435–444.

[19] J. Meitalovs, Software tool for probabilistic metabolic pathways construction, in: Computational Intelligence and Informatics (CINTI), 2012 IEEE 13th International Symposium on, IEEE, 2012, pp. 405–408.

615

[20] D. C. McShan, S. Rao, I. Shah, Pathminer: predicting metabolic pathways by heuristic search, Bioinformatics 19 (13) (2003) 1692–1698. [21] Y. Moriya, D. Shigemizu, M. Hattori, T. Tokimatsu, M. Kotera, S. Goto, M. Kanehisa, Pathpred: an enzyme-catalyzed metabolic pathway prediction server, Nucleic acids research (2010) gkq318. 38

Page 38 of 43

620

[22] K. Tanaka, K. Nakamura, T. Saito, H. Osada, A. Hirai, H. Takahashi, S. Kanaya, M. Altaf-Ul-Amin, Metabolic pathway prediction based on in-

ip t

clusive relation between cyclic substructures, Plant biotechnology 26 (5) (2009) 459–468.

625

cr

[23] Y. Moriya, M. Itoh, S. Okuda, A. C. Yoshizawa, M. Kanehisa, Kaas: an automatic genome annotation and pathway reconstruction server, Nucleic

us

acids research 35 (suppl 2) (2007) W182–W185.

[24] M. Nakamura, T. Hachiya, Y. Saito, K. Sato, Y. Sakakibara, An efficient al-

an

gorithm for de novo predictions of biochemical pathways between chemical compounds, BMC bioinformatics 13 (17) (2012) S8. 630

[25] K. Bleakley, G. Biau, J.-P. Vert, Supervised reconstruction of biological

M

networks with local models, Bioinformatics 23 (13) (2007) i57–i65. [26] K. Y. Yip, M. Gerstein, Training set expansion: an approach to improving

d

the reconstruction of biological networks from limited and uneven reliable interactions, Bioinformatics 25 (2) (2009) 243–250. [27] M. Kotera, Y. Yamanishi, Y. Moriya, M. Kanehisa, S. Goto, Genies: gene

te

635

network inference engine based on supervised analysis, Nucleic acids re-

Ac ce p

search (2012) gks459.

[28] J.-P. Vert, Y. Yamanishi, Supervised graph inference, in: Advances in Neural Information Processing Systems, 2004, pp. 1433–1440.

640

[29] Y. Yamanishi, J.-P. Vert, M. Kanehisa, Supervised enzyme network inference from the integration of genomic data and chemical information, Bioinformatics 21 (suppl 1) (2005) i468–i477.

[30] J. M. Dale, L. Popescu, P. D. Karp, Machine learning methods for metabolic pathway prediction, BMC bioinformatics 11 (1) (2010) 15.

645

[31] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, A.-L. Barab´asi, The largescale organization of metabolic networks, Nature 407 (6804) (2000) 651– 654. 39

Page 39 of 43

[32] X. F. Wang, G. Chen, Complex networks: small-world, scale-free and be-

650

ip t

yond, Circuits and Systems Magazine, IEEE 3 (1) (2003) 6–20. [33] A.-L. Barabasi, Z. N. Oltvai, Network biology: understanding the cell’s

cr

functional organization, Nature reviews genetics 5 (2) (2004) 101–113.

[34] S. Kikuchi, D. Tominaga, M. Arita, K. Takahashi, M. Tomita, Dynamic modeling of genetic networks using genetic algorithm and s-system, Bioin-

655

us

formatics 19 (5) (2003) 643–650.

[35] P. Koduru, S. Das, S. Welch, J. L. Roe, Fuzzy Dominance Based Multi-

an

objective GA-Simplex Hybrid Algorithms Applied to Gene Network Models, Springer Berlin Heidelberg, 2004, pp. 356–367.

M

[36] E. Keedwell, A. Narayanan, Discovering gene networks with a neuralgenetic hybrid, IEEE/ACM Transactions on Computational Biology and 660

Bioinformatics 2 (3) (2005) 231–242.

d

[37] A. Konak, D. W. Coit, A. E. Smith, Multi-objective optimization using

te

genetic algorithms: A tutorial, Reliability Engineering & System Safety 91 (9) (2006) 992–1007.

Ac ce p

[38] A.-L. Barab´ asi, R. Albert, Emergence of scaling in random networks, sci-

665

ence 286 (5439) (1999) 509–512.

[39] I. Thiele, B. . Palsson, A protocol for generating a high-quality genomescale metabolic reconstruction, Nature protocols 5 (1) (2010) 93121.

[40] K. Y. Yip, M. Gerstein, Training set expansion: an approach to improving the reconstruction of biological networks from limited and uneven reliable

670

interactions, Bioinformatics 25 (2) (2008) 243–250.

[41] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta, A. P. Tikuisis, et al., Global landscape of protein complexes in the yeast saccharomyces cerevisiae, Nature 440 (7084) (2006) 637.

40

Page 40 of 43

[42] J. Alstott, E. Bullmore, D. Plenz, powerlaw: a python package for analysis of heavy-tailed distributions, PloS one 9 (1) (2014) e85777.

ip t

675

[43] R. W. Floyd, Algorithm 97: shortest path, Communications of the ACM

cr

5 (6) (1962) 345.

[44] S. Hougardy, The floyd–warshall algorithm on graphs with negative cycles,

680

us

Information Processing Letters 110 (8-9) (2010) 279–281.

[45] A. Bailey, M. Ventresca, B. Ombuki-Berman, Genetic programming for the automatic inference of graph models for complex networks, Evolutionary

an

Computation, IEEE Transactions on 18 (3) (2014) 405–419.

[46] M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein, Cluster analysis and

685

M

display of genome-wide expression patterns, Proceedings of the National Academy of Sciences 95 (25) (1998) 14863–14868. [47] A. Kimmig, F. Costa, Link and Node Prediction in Metabolic Networks

d

with Probabilistic Logic, Springer Berlin Heidelberg, 2012, pp. 407–426.

te

[48] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, B. Futcher, Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by

Ac ce p

690

microarray hybridization, Molecular biology of the cell 9 (12) (1998) 3273– 3297.

[49] A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, P. O. Brown, Genomic expression programs in the

695

response of yeast cells to environmental changes, Molecular biology of the cell 11 (12) (2000) 4241–4257.

[50] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman, Missing value estimation methods for dna microarrays, Bioinformatics 17 (6) (2001) 520–525.

41

Page 41 of 43

700

[51] R. Wehrens, L. M. Buydens, et al., Self-and super-organizing maps in r:

ip t

the kohonen package, J Stat Softw 21 (5) (2007) 1–19. [52] R. C. Team, et al., R: A language and environment for statistical comput-

cr

ing.

[53] W.-K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman, E. K. O’Shea, Global analysis of protein localization in budding yeast, Nature 425 (6959) (2003) 686–691.

us

705

[54] R. Jothi, T. M. Przytycka, L. Aravind, Discovering functional linkages and

an

uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment, BMC bioinformatics 8 (1) (2007) 173. [55] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, T. O. Yeates,

M

710

Assigning protein functions by comparative genome analysis: protein phylogenetic profiles, Proceedings of the National Academy of Sciences 96 (8)

d

(1999) 4285–4288.

715

te

[56] J. De Las Rivas, J. J. Lozano, A. R. Ortiz, Comparative analysis of chloroplast genomes: functional annotation, genome-based phylogeny, and de-

Ac ce p

duced evolutionary patterns, Genome research 12 (4) (2002) 567–583. [57] J. C. Mellor, I. Yanai, K. H. Clodfelter, J. Mintseris, C. DeLisi, Predictome: a database of putative functional links between proteins, Nucleic acids research 30 (1) (2002) 306–309.

720

[58] A. Nakaya, T. Katayama, M. Itoh, K. Hiranuka, S. Kawashima, Y. Moriya, S. Okuda, M. Tanaka, T. Tokimatsu, Y. Yamanishi, et al., Kegg oc: a largescale automatic construction of taxonomy-based ortholog clusters, Nucleic acids research 41 (D1) (2013) D353–D357. [59] B. L. Miller, D. E. Goldberg, Genetic algorithms, tournament selection,

725

and the effects of noise, Complex systems 9 (3) (1995) 193–212.

42

Page 42 of 43

[60] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: Nsga-ii, IEEE transactions on evolutionary

ip t

computation 6 (2) (2002) 182–197.

[61] G. Taguchi, S. Chowdhury, S. Taguchi, Robust engineering, McGraw-Hill Professional, 2000.

cr

730

¨ urk, N. Kaya, F. Ozt¨ ¨ urk, Hybrid multi-objective shape [62] A. R. Yıldız, N. Ozt¨

us

design optimization using taguchi’s method and genetic algorithm, Structural and Multidisciplinary Optimization 34 (4) (2007) 317–332.

735

an

[63] E.-G. Talbi, Metaheuristics: from design to implementation, Wiley, 2009. [64] Q. McNemar, Note on the sampling error of the difference between corre-

M

lated proportions or percentages, Psychometrika 12 (2) (1947) 153–157. [65] M. Newman, Networks: An Introduction, Oxford University Press, Inc.,

d

2010.

[66] J.-P. Vert, J. Qiu, W. S. Noble, A new pairwise kernel for biological network inference with support vector machines, BMC Bioinformatics 8 (10) (2007)

te

740

Ac ce p

S8. doi:10.1186/1471-2105-8-S10-S8. [67] A. Ben-Hur, W. S. Noble, Kernel methods for predicting proteinprotein interactions, Bioinformatics 21 (2005) i38–i46.

doi:10.1093/

bioinformatics/bti1016.

745

[68] M. Kotera, Y. Yamanishi, Y. Moriya, M. Kanehisa, S. Goto, Genies: gene network inference engine based on supervised analysis, Nucleic Acids Research 40 (W1) (2012) W162–W167.

43

Page 43 of 43