Prediction of metabolic pathways from genome-scale metabolic networks

Prediction of metabolic pathways from genome-scale metabolic networks

BioSystems 105 (2011) 109–121 Contents lists available at ScienceDirect BioSystems journal homepage: www.elsevier.com/locate/biosystems Prediction ...

2MB Sizes 1 Downloads 113 Views

BioSystems 105 (2011) 109–121

Contents lists available at ScienceDirect

BioSystems journal homepage: www.elsevier.com/locate/biosystems

Prediction of metabolic pathways from genome-scale metabolic networks Karoline Faust a,∗ , Didier Croes b , Jacques van Helden b a b

Research Group of Bioinformatics and (Eco-)Systems Biology (BSB), VIB – Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe), Université Libre de Bruxelles, Boulevard du Triomphe, B-1050 Brussels, Belgium

a r t i c l e

i n f o

Article history: Received 22 December 2010 Received in revised form 23 March 2011 Accepted 5 May 2011 Keywords: Metabolic pathway definition Metabolic pathway prediction Metabolic network representation Subgraph extraction

a b s t r a c t The analysis of a variety of data sets (transcriptome arrays, phylogenetic profiles, etc.) yields groups of functionally related genes. In order to determine their biological function, associated gene groups are often projected onto known pathways or tested for enrichment of known functions. However, these approaches are not flexible enough to deal with variations or novel pathways. During the last decade, we developed and refined an approach that predicts metabolic pathways from a global metabolic network encompassing all known reactions and their substrates/products, by extracting a subgraph connecting at best a set of seed nodes (compounds, reactions, enzymes or enzyme-coding genes). In this review, we summarize this work, while discussing the problems and pitfalls but also the advantages and applications of network-based metabolic pathway prediction. © 2011 Elsevier Ireland Ltd. All rights reserved.

1. Introduction 1.1. Interpretation of functionally associated genes A variety of experimental and in silico approaches identifies groups of genes likely to participate in a common biological process: transcriptome arrays, prediction of operons and regulons, groups of synteny, gene fusions, phylogenetic profiles, etc. Researchers are then faced with the challenge of understanding the biological function of a group of associated genes. Usually, this task is tackled by gene set enrichment analysis (GSEA) (e.g. Subramanian et al., 2005; Backes et al., 2007), which detects overrepresented functional categories in the gene set of interest or by pathway projection (also known as pathway mapping) e.g. (Dahlquist et al., 2002; Paley and Karp, 2006; Adler et al., 2008), which maps genes of interest to a set of known pathways. Our group spent a decade of research to develop an alternative approach to interpret associated enzyme-coding genes. This review will summarize this approach and discuss its drawbacks as well as its benefits. 1.2. Pathway-based interpretation of functionally associated genes Both GSEA and pathway mapping approaches rely on predefined functional groups (e.g. Gene Ontology classes) or pathways

∗ Corresponding author. E-mail addresses: [email protected] (K. Faust), [email protected] (D. Croes), [email protected] (J. van Helden). 0303-2647/$ – see front matter © 2011 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2011.05.004

(e.g. KEGG reference maps) and are therefore unable to deal with variants or combinations of the reference pathways or to detect novel pathways. GSEA returns a list of functional classes showing a significant intersection with the query gene set. Such associations provide a first clue to the possible functions of the gene set of interest, but processes are considered as “bags” of genes, without any indication about internal relationships and interconnections of their individual activities. Pathway mapping provides a more interpretable result, by highlighting the relationships between the mapped enzymes, but fails to identify “transversal” pathways, i.e. pathways that combine fragments of several reference pathways (e.g. in yeast, methionine biosynthesis combines the synthesis of the carbon skeleton from aspartate and the incorporation of sulfide resulting from the sulfur assimilation pathway). Biological processes are highly interconnected and are therefore better described as a network rather than as a set of separated pathways. Taking a network perspective can overcome the restrictions of GSEA and pathway projection. For these reasons, ab initio pathway discovery methods are required to annotate the metabolism of the thousands of organisms for which we dispose of a fully sequenced genome but whose pathways have never been characterized experimentally. In addition, ab initio pathway discovery can be employed to integrate data from high-throughput experiments with knowledge on biological networks. The scores from these experiments are converted into network weights, which allow favoring pathways that incorporate a maximal number of high-scoring (i.e. highly differentially expressed) genes. We will discuss this application of pathway discovery in more detail in Section 10.3.

110

K. Faust et al. / BioSystems 105 (2011) 109–121

1.3. Network-based interpretation of functionally associated genes Zien et al. (2000) were among the first to analyze co-expressed genes by extracting pathways from a network instead of mapping the genes to pre-defined pathways. They first constructed a metabolic network (reactions and compounds) collected from various databases, and enumerated all possible pathways between glucose and pyruvate (two-end path finding). These pathways were subsequently scored according to the expression ratios of their associated genes. The expression data (DeRisi et al., 1997) monitored the difference in gene expression between a start time (reference time point) and several time points during the transition of yeast cells from fermentation to respiration. The canonical glycolysis pathway ranked at the top of the list of expression-based scored pathways. This two-end path finding approach however requires to give as input the start and the end of the pathway, an information which is usually missing when dealing with groups of functionally related genes. In 2000, one of us proposed an approach based on subgraph extraction, to assemble metabolic pathways from a set of “seed” reactions (van Helden et al., 2000). As a proof of concept, this approach was applied to a set of 20 genes co-expressed during the yeast cell cycle, resulting in a linear pathway from sulfate to lmethionine that combined the annotated pathways for methionine biosynthesis and sulfur assimilation (van Helden et al., 2000). Fig. 1 summarizes this procedure. Starting from a set of coexpressed genes, a first step is to identify enzyme-coding genes and the reactions associated to them. In some cases, these seed reactions are directly connected by their substrates and products, so that they can be immediately interlinked to form a pathway. However, the main idea of the approach is to fill gaps between the seed reactions by extracting intermediate compounds and reactions from a metabolic network. In the following, we will present various metabolic pathway definitions and introduce a more functionally oriented definition of a metabolic pathway (Sections 2–4). Next, we will enumerate different ways to represent metabolic networks as graphs and highlight the problems of graph traversal in metabolic networks (Section 5). The main problem is the presence of highly connected (hub) compounds, which leads to biochemically infeasible pathways when traversing the graph naively. The impact of this problem on the measurement of network properties such as the average path length (“small-worldness”) will be discussed in Section 7. We will expound our solutions to the hub compound problem in the context of two-end path finding (between two compounds or reactions) in Section 6 and of multiple-end pathway prediction (between a set of compounds or reactions) in Section 8. In addition, we will demonstrate multiple-end pathway prediction on an example (Section 9). Finally, we will discuss the strengths and weaknesses of topology-based pathway prediction and its applications in microarray analysis and metabolic network reconstruction.

2. From networks to pathways A major weakness of GSEA and pathway projection is their reliance on fixed pathway boundaries. In many cases, textbooks and databases agree on the boundary of a pathway (e.g. glycolysis starts with glucose and ends with pyruvate), but there are also some “fuzzy” pathways with less well defined boundaries (for instance, KEGG module M00176 sulfur reduction starts from sulfate (Kanehisa et al., 2008), whereas the MetaCyc sulfur reduction pathways start all from elemental sulfur (Caspi et al., 2008)). In addition, pathway databases deal differently with pathway variants (for example, KEGG maps merge all known variants of a pathway,

whereas MetaCyc stores them as separate entities). Thus, pathway projection results depend on the specific reference pathway source selected (KEGG maps, KEGG modules, SEED subsystems (Overbeek et al., 2004), MetaCyc pathways, etc.). One may define pathway boundaries by partitioning the metabolic network with different algorithms (Gagneur et al., 2003; Gerlee et al., 2009; Guimerà and Amaral, 2005). However, the performance of these algorithms is assessed by comparing their output to a reference pathway set. Thus, metabolic network partitioning does not solve the underlying problem: What is a metabolic pathway? What does distinguish current reference pathways from random reaction sequences? This question is not easily answered, as the variety of existing metabolic pathway definitions underlines (e.g. Lacroix et al., 2008).

3. What is a metabolic pathway? Since the metabolic pathway definition of choice determines the prediction procedure, we will briefly summarize some current definitions. In the words of the classical biochemistry textbook Nelson and Cox (2005), a metabolic pathway is a “sequence of enzyme-catalyzed reactions by which a living organism transforms an initial source compound into a final target compound”. This definition has some shortcomings: (1) it excludes branched pathways (e.g. aromatic amino acid biosynthesis) and cycles (e.g. TCA cycle), (2) it does not take into account spontaneous reactions, i.e. reactions that occur without being catalyzed by an enzyme (e.g. the conversion from l-glutamate gamma-semialdehyde into water and (S)-1-pyrroline5-carboxylate). A generic definition that also covers branched and cyclic pathways considers a pathway to be a sub-network of the metabolic network (e.g. Forst and Schulten, 1999). Since this definition can be expressed entirely in a graph theoretical form, without the need of additional concepts, it could be also termed the topological definition of metabolic pathways. However, this definition suffers from a severe problem: It does not differentiate between biochemically valid and invalid pathways. To exemplify this problem, Fig. 2 shows a biochemically irrelevant pathway that perfectly conforms to the topological definition. Since the topological definition does not distinguish between biochemically relevant and irrelevant pathways, alternatives have been proposed in the literature, which reduce the number of permissible pathways by imposing constraints. Schuster et al. (1999) introduced the elementary modes, which are defined as a “minimal set of enzymes that could operate at steady state with all irreversible reactions proceeding in the appropriate direction”. To qualify as an elementary mode, a pathway has to stoichiometrically balance all its compounds and in addition satisfy a non-decomposability constraint, which roughly says that it cannot be further divided into stoichiometrically balanced sub-pathways. How elementary modes can be applied to discover novel pathways is briefly reviewed in Schuster et al. (2010). Interestingly, the elementary modes correspond to the minimal T-invariants known from Petri net theory (Voss et al., 2003). A related concept are the extreme pathways, introduced by Schilling et al. In elementary mode as well as in flux balance analysis, the metabolic system is assumed to be at steady state. Thus, the left-hand site of equation dx/dt = S × v, where x is the compound concentration, S is the stoichiometric matrix and v is the flux vector, can be set to zero. The set of solutions satisfying this simplified equation plus a number of additional constraints (e.g. a non-negative constraint) forms the flux cone. The extreme pathways are the edges of the flux cone, that is they are the basis from which all other fluxes can be obtained by linear combination (Schilling et al., 2000). The concept

K. Faust et al. / BioSystems 105 (2011) 109–121

111

Fig. 1. Flow chart of pathway prediction by subgraph extraction. The approach takes two inputs: (1) a set of seed reactions, which are obtained from a set of associated enzyme-coding genes and (2) a metabolic network constructed from a metabolic database, such as KEGG (Kanehisa et al., 2008) or MetaCyc (Caspi et al., 2008). A metabolic pathway is computed by connecting the seed reactions (in blue) with intermediate reactions and compounds (in yellow) from the metabolic network. Image sources: the global KEGG map image and KEGG symbol were obtained from the KEGG database homepage (Kanehisa et al., 2008), the MetaCyc symbol from the MetaCyc home page (Caspi et al., 2008). The operon image was taken from RegulonDB (Gama-Castro et al., 2008) and the genome image from the Comprehensive Microbial Resources (Davidsen et al., 2010).

of extreme currents is closely related to that of extreme pathways (Clarke, 1988). Another definition was advanced by Pitkänen et al. (2005), which states that pathways should be minimal feasible (that is selfmaintaining) systems. More precisely, the reactions contained in a pathway should be necessary and sufficient to synthesize all its compounds. Chemical organization theory (e.g. Centler et al., 2005) refines this definition. A chemical organization has two properties: self-maintenance (for each compound of the pathway there exists a non-negative flux vector for its synthesis) and closure (each com-

pound that can be generated by the reactions of the system is a part of the system). Both properties guarantee that chemical organizations are stable reaction systems. These constraint-based definitions divide the pathway’s compounds in two sets: internal compounds that have to satisfy the constraints imposed by the definition (i.e. of stoichiometrically balance or self-maintenance) and external compounds that are excluded from the constraint. It is not always easy to know which compound belongs to which set, a problem that will be further discussed in Section 5.2.2.

112

K. Faust et al. / BioSystems 105 (2011) 109–121

consumes l-cysteine, whose sulfur is transferred to cystathioinine. The resulting depletion in l-cysteine may eventually lead to the activation of the cystein biosynthesis pathway. This case is far from exceptional: a representative fraction of the canonical metabolic pathways are not balanced (Planes and Beasley, 2008). When inferring pathways from sets of functionally related enzymes, we can thus relax the self-maintenance and stoichiometric balance constraints. Nevertheless, it is important to differentiate between biochemically relevant and irrelevant connections of the seed reactions. In the absence of further constraints, an atom flow-based definition is most helpful for this purpose. 4. Functional definition of a metabolic pathway

Fig. 2. Example of a metabolic pathway that conforms to the topological pathway definition but is biochemically invalid. This pathway suggests that ADP can be synthesized from d-glucose within one step, which would require an unrealistically large number of atomic re-arrangements, additions and eliminations. Such unrealistic pathways often result when a naive path finding algorithm traverses metabolic networks without distinguishing between main and side compounds.

Other definitions emphasize the matter flow through the pathway. For instance, Arita (2004) defines a pathway as follows: “A metabolic pathway (pathway for short) from metabolite X to Y is defined as a sequence of biochemical reactions through which at least one carbon atom in X reaches Y”. This definition allows distinguishing main compounds that carry matter through the pathway from side compounds that are energy or electron donors/acceptors (e.g. ATP/ADP or NADPH/NADP+ ) or small inorganic molecules (e.g. CO2 or water). However, Arita’s definition has the weakness to exclude pathways involving the transfer of other atom types, e.g. the sulfur reduction pathway. Rather than associating the “main” and “side” property to compounds as such, Karp defines those qualifiers in the context of a pathway: “The main compounds lie along the backbone of the pathway—these compounds are shared between consecutive steps of a pathway” (Karp and Paley, 1994). However, this distinction between main and side compounds pre-supposes a pathway definition and is thus not helpful in a context of pathway discovery. The definitions listed so far emphasize different aspects of pathways, such as their stoichiometric balance, self-maintenance or atom flow. Not all of these aspects are equally relevant for the interpretation of sets of functionally-related genes. To interpret such a gene set, we need to obtain the part (or sub-network) of the metabolic network in which their products are involved. We can enforce this sub-network to be balanced and/or self-maintained, but we may also assume that the pathway catalyzed by those genes does not have to be balanced. For instance, changes in external conditions (availability or depletion of some metabolites) will generally activate the expression of a handful of specific enzymes catalyzing the degradation or the biosynthesis of the concerned metabolites. In many cases, those pathways are not balanced, and their activation will decrease internal concentrations of several other metabolites. For example, in Escherichia coli, a lack in methionine inactivates the MetJ repressor, thereby triggering the transcription of the enzymes ensuring a linear path from l-aspartate to l-methionine. This pathway however

At this point, we would like to introduce a definition that emphasizes the functional aspect of a pathway. We mentioned the problem of arbitrary boundaries between pathways in the metabolic network. We note that such boundaries are naturally introduced if we consider gene regulation. Genes are switched on and off together in response to a particular condition, up- or down-regulating the pathway that is needed/unnecessary in that condition. Genes involved in a common function tend to be co-expressed, conserved in phylogenetic profiles or transferred together in horizontal gene transfer events. Different organisms may respond to the same metabolic requirement by expressing different sets of enzymes and transporters. The “boundaries” of a metabolic pathway should thus not be defined in terms of absolute rules, such as key compounds or stoichiometry, but be considered as organism- and even context-dependent. Thus, we can define a metabolic pathway as a set of interconnected reactions that can be activated coordinately to ensure a particular cellular function. This definition encompasses the topological constraint whilst ensuring biochemical relevance, but presents the weakness of requiring a definition of the cellular function. However, the lack of a rigorous definition of cellular function gives the freedom to define multi-scale functional modules (processes) ranging from very specific pathways to “super-pathways” (e.g. biosynthesis of methionine from homoserine, biosynthesis of methionine from aspartate, biosynthesis of aspartate-derivative amino acids, etc.). 5. Graph representation and traversal in metabolic networks 5.1. Representation of metabolic networks In order to connect query reactions (seeds) in the metabolic network, we need to represent the metabolic network as a graph. A graph is a mathematical abstraction of connected objects. It consists of nodes (also called vertices) representing the objects and of edges, which connect the nodes and represent links between objects. The link between two objects may be directional and is then represented by a directed edge called arc, which points from the source node to the target node. A graph containing arcs is also called directed graph or digraph. It is not trivial to map metabolism onto a graph in a meaningful way. Some network representations suffer from important drawbacks, and are therefore less suited for the prediction of metabolic pathways than others (van Helden et al., 2002). A commonly employed network representation is the compound graph (e.g. in Chou et al., 2009; Goesmann et al., 2002; Ma and Zeng, 2003), where nodes stand for compounds and edges/arcs for reactions (see Fig. 3A). Two compound nodes are connected if they participate as substrate and product in a common reaction. In the

K. Faust et al. / BioSystems 105 (2011) 109–121

113

Fig. 3. Alternative network representations of metabolism. (A and B) In the compound (A) and reaction (B) graph, nodes represent resp. compounds and reactions, whereas edges represent resp. reactions and compounds. These graphs represent the same reactions (e.g. 2.7.1.2 in A) or compounds (e.g. d-glucose in B) multiple times. (C and D) Bipartite graphs (C) and hypergraphs (D) avoid this shortcoming by representing reactions and compounds as two separate node types (bipartite graph) or by representing reactions with hyper-edges, which connect multiple compounds (compound-centric hypergraph). (E) Undirected graphs enable graph algorithms to go from one substrate to another substrate or from one product to another product. To prevent this, metabolic networks should be represented by directed graphs. (F) In bipartite graphs, forward and reverse direction of a reaction can then be represented by two separate nodes, which have to be mutually exclusive to prevent a graph algorithm to go from one substrate to another substrate (or from one product to another product) via the forward and reverse reaction direction. (G) The atom mapping graph traces atoms through the metabolic network by matching the structures of substrates and products. Corresponding atom groups are encircled in matching colors. The compound structures were downloaded from KEGG and drawn with Jmol (Herréz, 2006).

reaction graph (Fig. 3B), the counterpart of the compound graph, nodes represent reactions and edges/arcs compounds (e.g. in Forst and Schulten, 2001; Wagner and Fell, 2001). Two reaction nodes are connected if they share a compound that acts as product of the first and as substrate of the second reaction. In both network representations, a compound (reaction graph) or a reaction (compound graph) can occur multiple times. For instance, in the reaction graph, a compound edge/arc occurs as often as there are reactions which produce or consume it. This problem has recently been noted by Klamt et al. (2009), but was already recognized by van Helden et al. (2002). An algorithm that connects seed reactions in a metabolic network represented as compound or reaction graph may predict a pathway containing the same reaction or compound multiple times. This may be positive in some cases (fatty acid biosynthesis, cyclic pathways), but will in most cases result in problematic pathways containing futile cycles or invalid shortcuts. For instance, in the compound graph, one could reach one substrate of a reaction from another in case the reaction is reversible. To avoid this behavior, the algorithm would need to keep track of the labels of arcs that were already traversed. In addition, both the compound and the reaction graph make it more difficult to deal with mixed input comprising compounds as well as reactions. For these reasons, these graph representations are unsuitable for our pathway prediction approach. The bipartite graph (Fig. 3C) is a representation that avoids the multiple occurrence problem of the compound and reaction graphs. Bipartite graphs are made of two node types, where edges/arcs

never connect two nodes of the same type. For metabolic networks, the two node types represent compounds and reactions, respectively (e.g. in Sirava et al., 2002; van Helden et al., 2001), and arcs connect either a compound to a reaction (substrate link) or vice versa (product link). Petri nets, which are sometimes employed to describe metabolic networks e.g. Küffner et al. (2000) are a special case of bipartite graphs, where compounds are place nodes and reactions transition nodes. Bipartite graphs representing metabolic networks have also been described as AND–OR graphs, where compound nodes have the role of OR nodes and reaction nodes the role of AND nodes (Pitkänen et al., 2005). Hypergraphs generalize the concept of a graph. Whereas in a graph, each arc/edge connects only two nodes, an arc/edge in a hypergraph (termed hyper-arc/hyper-edge) may connect more than two nodes and can thus easily represent a reaction that involves more than two compounds. In principle, hypergraphs can be compound-centered (nodes represent compounds) or reaction-centered (nodes represent reactions), but so far, only the compound-centered hypergraph has been mentioned in the literature (e.g. Mithani et al., 2009) (see Fig. 3D). The stoichiometric matrix employed in flux balance analysis is mathematically equivalent to a (compound-centered) directed hypergraph (excluding catalytic compounds) (Klamt et al., 2009). Despite of their recent recommendation in Klamt et al. (2009), hypergraphs have a disadvantage for pathway prediction: It is not as straightforward as in bipartite graphs to predict pathways when the “seeds” combine compounds and reactions. Furthermore,

114

K. Faust et al. / BioSystems 105 (2011) 109–121

hypergraphs can be mapped onto bipartite graphs by transforming each hyper-arc/hyper-edge into a node. Thus, bipartite and hypergraphs are mathematically equivalent graph representations. Arita introduced the atom mapping graph (Arita, 2000), where nodes represent atoms and arcs mappings between atoms in the substrate and product of a reaction. Atom mappings are computed by representing the chemical structure of a compound as a graph and then using a graph matching algorithm to find the most similar substrate-product structures (see Fig. 3G). The atom mapping graph has been recently employed in Boyer and Viari (2003), Heath et al. (2010) and Pitkänen et al. (2009). 5.2. Graph traversal in metabolic networks A simple approach towards the prediction of a pathway is to find the shortest path(s) between a given start and end node in the network. In this context, we would like to clarify the difference between a path and a (metabolic) pathway. In graph theory, a path is defined as a linear sequence of nodes connected by edges such that each node pair is connected by only one edge. In addition, each path node, including the start and the end node, can be found at most once in a path. In contrast, according to most definitions, a metabolic pathway (as defined above) may contain branches, cycles, and multiple instances of the same compound. K-shortest path algorithms such as (Jimenez and Marzal, 1999; Eppstein, 1999; Yen, 1971) enumerate all shortest paths between two nodes in a network in the order of their length and are commonly applied to predict pathways in metabolic networks (Sirava et al., 2002; Arita, 2003; Blum and Kohlbacher, 2008; Faust et al., 2009b; Pitkänen et al., 2009; Heath et al., 2010). However, when using them to predict metabolic pathways, two issues specific to metabolic networks need to be considered. 5.2.1. Reaction directionality A specific problem of metabolic networks is reaction reversibility. This is the fact that the chemical reactions in these networks must be considered as reversible, unless specific information is given to the contrary. Indeed, even if a chemical reaction has a strong preference for one direction in one organism, its direction may be the opposite in another organism, as reaction directionality depends not only on the standard Gibbs free energy change of the reaction but also on reactant concentrations and the temperature. One is faced with several choices for representing reaction reversibility in the metabolic graph. One possibility is to link reversible reactions and their reactants with an edge that can be traversed in both directions during path finding. This however would make it cumbersome to distinguish between substrates and products. Indeed, straightforward navigation through the graph would result in connecting two substrates (or two products) of the same chemical reaction to each other. In the context of path finding, this would mean that two substrates of the same reaction can be interconverted in one step, thereby violating the laws of chemistry (see Fig. 3E). Another solution is to represent a reversible reaction as two separate nodes in the network, one for each direction, as illustrated in Fig. 3F. However, inclusion of both direct and reverse reactions in the same path would lead to the chemically meaningless situation where two substrates of a reversible reaction could be transformed into each other in two steps. For instance, a reversible reaction (A + B ⇔ C) would be converted to a direct reaction (A + B → C) and a reverse (C → A + B), which would open the possibility for an irrelevant two-steps path converting one substrate into the other one (A → C → B). Path finding algorithms should thus be adapted to prevent including both the direct and the reverse reactions in the same path. This constraint of mutual exclusion increases the complexity of path finding algorithms. Despite of this difficulty, we opted

for this solution, as it allows to easily represent both reversible and physiologically irreversible reactions. For the latter, the reverse reaction can simply be omitted. 5.2.2. The hub compound problem As illustrated in Fig. 2, not all reaction sequences are biochemically relevant. The pathway shown in Fig. 2 suggests that ADP could be produced from d-glucose within one step, which is biochemically impossible. ADP is indeed a product of reaction 2.7.1.2 and a substrate of reaction 2.7.1.40, but is in both cases a side-compound and cannot be considered as “produced” from d-glucose, or “producing” pyruvate. It should thus never be used as intermediate node in a path connecting d-glucose and pyruvate. This problem has a tremendous effect on path finding: side compounds may be involved in hundreds of reactions, thus acting as hubs in the network. If a path finding algorithm connects two nodes via such a short-cut, this will result in an invalid pathway in most cases. The problem is illustrated in Fig. 4B: in the raw metabolic network, all the shortest paths from l-arginine to succinate use irrelevant shortcut through metabolites that serve as side-compounds in the reactions included in the paths (H2 O, NADP+ , NAD+ , ADP). A first attempt to circumvent the shortcut effect of “metabolic hubs” was to search paths in a “filtered” network, from which a subset of the most connected compounds (“hubs”) had been removed. Our early results showed that this filtering improved the relevance of the subgraph extraction, but was only able to infer short paths (typically, 2 reaction steps) between seed reactions (van Helden et al., 2001), because of the remaining highly connected compounds (which do not act as side compounds) such as alanine or aspartate. As shown in Fig. 4C, the shortest paths from l-arginine to succinate in the filtered graph are much shorter than the reference arginine catabolic pathway and none of the intermediate steps corresponds to the annotations (Fig. 4A). The problem is to distinguish main compounds from side compounds. For instance, ADP is a side compound in the glycolysis pathway, but a perfectly valid intermediate in nucleotide biosynthesis. In addition, it is unclear what defines a side compound. Compounds such as water, ADP/ATP, NADPH/NADP+ act as side compound in most cases, but what about acetyl-CoA? Besides, some other highly connected compounds such as pyruvate act as main substrate/product in many reactions, and will have the same shortcut effect in a basic path finding approach. For these reasons, simply removing a list of “hub” compounds from the network does not solve the problem. 6. Path finding in weighted networks As discussed in Section 5.2.2, the removal of hub compounds is problematic because there is no clear-cut distinction between main and side compounds. However, we observe that typical side compounds tend to be involved in more reactions than the average compound. Motivated by this observation, we thought about a path finding strategy that takes the number of connections of a compound into account and came up with the idea of weighting the metabolic network. In a node-weighted graph, each node is assigned a real number, called its weight or cost. We employed a weight policy that assigns to each compound node a weight corresponding to its degree (i.e. the number of edges connecting this node) and to each reaction node a weight of 1. We adapted the path finding algorithm to enumerate the lightest paths (i.e. the path minimizing the sum of node weights) instead of the shortest paths (i.e. the path with the minimal number of nodes) (Croes et al., 2005, 2006). Since there may be more than one lightest path, we employ a Kshortest path finding algorithm (backtracking) to collect all paths of

K. Faust et al. / BioSystems 105 (2011) 109–121

115

Fig. 4. Evaluation of paths computed for E. coli Arginine degradation II (AST) pathway. (A) The AST pathway as annotated in MetaCyc. Ellipses represent compounds and rectangles reactions. Nodes are labeled with their KEGG identifier, compounds in addition with their name and reactions with their enzyme classification (EC) number. The seed nodes are colored in blue, whereas intermediate nodes, which should be found by path finding, have a green border. (B-D) Paths computed by path finding in a metabolic network constructed from KEGG data, where (B) shows the raw graph, (C) the filtered graph and (D) the weighted graph. (E) Paths computed by path finding in the RPairs network constructed from KEGG data. Correctly predicted intermediate nodes have a green border. The paths are labeled with their rank, where the lightest path(s) has rank one, the second-lightest path(s) rank two and so on.

equivalent weight. By penalizing the highly connected compounds, we reduce the probability of selecting meaningless shortcuts when navigating in the graph, without requiring to make somewhat arbitrary choices of the compounds to exclude. Fig. 4D displays the five lightest paths from l-arginine to succinate in the compound-weighted metabolic network. The

third-ranking path perfectly matches the annotated pathway. Since in general we cannot know which among the top-ranking paths to select, we base our evaluation on a comparison of the first-ranking paths (i.e. the lightest paths) to the annotated pathway. We systematically evaluated the performance of this approach by comparing the intermediate reactions in the computed

116

K. Faust et al. / BioSystems 105 (2011) 109–121

Fig. 5. The decomposition of reaction R02724 into five reactant pairs (RPairs) is shown. Each RPair links a substrate to a product with similar compound structure. In addition, each RPair is assigned to a class that describes its role in the reaction. For instance, “main” refers to “main changes on substrates” (Kotera et al., 2004a) (e.g. cholesterol and pregnenolone) whereas “leave” describes addition or elimination of inorganic compounds (e.g. oxygen and 4-methylpentanal). Importantly, the rare side compound ferredoxin is not involved in any RPair. The reason is that ferredoxin does not contribute atoms, but electrons to the reaction. A path finding algorithm that is unaware of the RPair annotation may falsely connect cholesterol with oxidized adrenal ferredoxin. Thus, RPairs improve the accuracy of path finding by preventing the traversal of a reaction via its side products.

pathways1 to those in linearized annotated pathways. In addition, we compared the performance of the weighted graph with that of the un-weighted raw (all compounds and reactions) and filtered (top 30 highly connected metabolites removed) graphs and found the correspondence between the computed and annotated pathways to be very poor (<30%) in the raw graph, increasing to approximately 65% in the filtered graph and reaching approximately 85% in the weighted graph. Considering the best-matching path among the five lightest paths increases the correspondence to 92%.

pounds either by excluding them (Ma and Zeng, 2003) or by tracing atoms (Arita, 2004). Both concluded that the average path length is around eight. Thus, the small world property claimed for metabolic networks is mainly due to biochemically irrelevant paths traversing side compounds and disappears when measuring metabolic distance in a more realistic way (Croes et al., 2006; Lima-Mendez and van Helden, 2009).

7. The illusion of small-worldness

8.1. The benefits of RPairs

In a weighted metabolic network, we can measure a metabolic distance between two enzymes s and t, which we define as: Ds,t = Wp − (Ws + Wt )/2, where Wp is the sum of node weights in the lightest path between any reaction associated to enzyme s and any reaction associated to enzyme t, Ws is the weight of the start node and Wt the weight of the end node. In the same way, we can also compute the metabolic distance between two compounds. It was previously claimed that metabolic networks are “small world” networks (Jeong et al., 2000; Fell and Wagner, 2000). Small world networks are characterized by a small average path length compared to random networks (Watts and Strogatz, 1998). The average path length (AL) is defined as the average length of shortest paths computed between all node pairs. In metabolic networks, a small AL means that any compound can be synthesized from any other compound within a few enzymatic steps (around three as stated in Jeong et al., 2000). We computed the distribution of metabolic distances between random pairs of reactions in the raw, the filtered and the weighted network and found that in the raw network, the most frequent metabolic distance is two, whereas the most frequent metabolic distances in the weighted network comprise five to eight reaction steps. This result is supported by two other studies, which measured the average shortest path length between compounds in the E. coli network and treated hub com-

In the previous section, we have shown that penalizing hub compounds by weighting them substantially improves metabolic path finding. However, a degree-dependent weight policy does not penalize rare side compounds. For example, consider KEGG reaction R02724 depicted in Fig. 5. In this reaction, reduced adrenal ferredoxin does not contribute any atoms to the products, but acts as an electron acceptor. However, since it has a low degree (it is involved only in a few reactions), the path finding algorithm may traverse it even in a weighted graph, thus predicting an invalid pathway. One solution to this problem is to trace the atoms of the compound(s) of interest through the metabolic network, an approach first introduced by Arita (2000, 2003) and then applied in various path finding tools (Rahman et al., 2004; Blum and Kohlbacher, 2008; Pitkänen et al., 2009). In our case, there is an important drawback to this approach: it is not suited for path finding between reactions. Indeed, all tools based on atom tracing predict pathways between compounds only. Atoms are usually traced by first computing substrate-product mappings and then enumerating paths through a graph constructed from these mappings (Arita, 2004). However, in the case of a reaction with multiple products, it is not clear which of the multiple substrate-product mappings to select. If we decide to trace the atoms of all products of such a reaction, we will frequently include highly connected side compounds such as ADP, water or orthophosphate, thus predicting irrelevant pathways. The evaluation of weighted path finding (see Section 6) has shown that pathway prediction improves when we force the path

1 In cases where several paths have the same weight, we merge all of them, thus we can predict branched pathways by enumerating linear paths.

8. Extending pathway prediction with RPairs and multiple seed nodes

K. Faust et al. / BioSystems 105 (2011) 109–121

finding algorithm at a branching point (that is when reaching a reaction with multiple products) to continue with the least connected compound. Likewise, atom tracing applied to reactions with multiple products may benefit when tracing the least connected compounds only. To test this idea, we employed the substrate-product mappings (called RPairs) stored in the KEGG RPAIR database (Kotera et al., 2004b,a). The RPAIR database has the additional benefit of assigning a role to the RPairs, which include main (for major carbon atom transfer), cofac (for cofactors of oxidoreductases), trans (for functional groups transferred by transferases), ligase (for triphosphates involved in ligase reactions) and leave (for addition or elimination of small inorganic compounds). These roles are reaction-specific, thus allowing us to select for each reaction its relevant substrate-product mappings and to neglect irrelevant cofac, ligase or leave mappings. For instance, KEGG reaction R02724 (Fig. 5) is divided into 5 reactant pairs. None of these reactant pairs involves reduced or oxidized adrenal ferredoxin, as it neither contributes nor receives any atoms during the reaction (see Fig. 5). An RPairs-aware path finding algorithm will therefore avoid ferredoxin as intermediate compound. To test the impact of RPair annotations on path finding accuracy, we compared the performance of the RPair network (with RPairs instead of reactions) with the default metabolic network. We measured the path finding accuracy in both networks for different values of various parameters, among them the directionality of the network, the weight policy, removal of hub compounds and RPair class filtering (e.g. by keeping only main RPairs). The most important finding was that the weighted RPair network yields higher pathway prediction accuracies than either the unweighted RPair network or the weighted reaction network (Faust et al., 2009b). The main reason for the better performance of the weighted compared to the unweighted RPair network is that the RPairs do not always allow us to avoid side compounds (e.g. in KEGG reaction R00299, ADP and ATP form a main RPair). However, they allow us to do so in many cases, which is why the unweighted RPair network performs better than the unweighted network without RPairs. Thus, the combination of a hub-node penalizing weight policy with RPairs performs better than either strategy alone, a result in agreement with the findings by Blum and Kohlbacher (2008). Another interesting observation is that the removal of all non-main RPairs decreases the path finding performance. This counter-intuitive result is due to the fact that we measured path finding accuracy with reference pathways containing some substrate-product pairs not classified as main pairs in the RPAIR database. The presence of trans RPairs in reference pathways reflects the ambiguity of some RPair class assignments. Tools such as (Antonov et al., 2008) that only rely on main RPairs may therefore miss some pathways of interest. 8.2. From two-end to multi-end pathway prediction So far, we have focussed on predicting pathways between two seed nodes (which could be compounds or reactions). In order to interpret associated enzyme-coding gene sets, we need first to map genes to reactions and second to extend pathway prediction to a set of seed nodes. 8.2.1. Gene-to-reaction mapping and seed node sets There is a many-to-many relationship between enzyme-coding genes, their EC numbers and associated reactions. For example, genes coding for different sub-units of the same enzyme share the same enzyme classification (EC) number whereas genes coding for multifunctional enzymes (e.g. pentafunctional ARO1 gene in Saccharomyces cerevisiae) are associated to more than one EC number. In turn, an EC number can be linked to more than one reaction. For example, EC number 1.1.1.1 (conversion of an alcohol into an aldehyde or ketone) is associated to 18 reactions in KEGG, out of

117

which only one may be relevant for the pathway to be predicted. To deal with this complex relationship, we devised a strategy to cope with groups of seed nodes (by introducing pseudo-nodes, see Faust et al., 2010). Only one of the members of a seed node group needs to be included into the predicted pathway, but possibly more may be added. Thus, an (inclusive) OR relationship between seed nodes can be expressed by grouping them into the same group, whereas an AND relationship can be expressed by instantiating a different group for each seed. Seed reactions can be grouped on several levels: gene-wise (all reactions associated to one gene form one seed node group), EC-number wise (all reactions associated to an EC number form one seed node group) or reaction-wise (each reaction forms a seed node group of its own). In our experience, EC number-wise groups are most suited for pathway prediction: They consider all catalytic functions of an enzyme while avoiding to include all reactions associated to an EC number. Thus, EC number groups are robust with respect to imprecise EC number-reaction mappings. 8.2.2. Multi-end pathway prediction Pathway prediction given multiple seeds in a weighted network can be regarded as an instance of the Steiner tree problem, where a set of seed nodes has to be connected within a weighted graph such that the resulting subgraph is of minimal weight. Because of the minimal weight constraint, the solution subgraph will always be a tree, called Steiner tree. The Steiner tree problem is known to be NP-hard (Karp, 1972). We tested three different heuristics that are all based on repetitively executed path finding (details see Faust et al. (2010)). In addition, we also evaluated a random-walk based algorithm called kWalks (Dupont et al., 2006). kWalks computes the relevance of each edge and node in the network with respect to the seed node set. The relevance of an edge or node is defined as the expected number of times it appears in random walks between the seed nodes, where a random walk starts from each seed node and ends as soon as it hits another seed node. The network is then built by adding edges in the order of their relevance to a sub-network initially consisting of the seed nodes only, until either all seed nodes are connected or all edges have been added. The random walks are computed efficiently using absorbing Markov chain theory. Thus, kWalks is designed to extract rapidly the part of the input network most relevant for connecting the given seed nodes and is indeed much faster than the three Steiner tree heuristics. However, our evaluation showed that MetaCyc reference pathways are more accurately predicted with a Steiner tree heuristic than with kWalks. So we decided to combine kWalks with the best-performing Steiner tree heuristic in a hybrid approach that unites the strengths of both approaches: kWalks is launched first to reduce the input network size and thus the run-time of the subsequently executed Steiner tree heuristic, which predicts the pathway within the sub-network extracted by kWalks. This hybrid approach yielded the highest prediction accuracy in our evaluation. An interesting feature of the kWalks algorithm is its good performance in the absence of a weight policy. Indeed, in the hybrid approach, a subgraph with weights computed by kWalks yields a higher accuracy than an unweighted subgraph. This suggests that kWalks can be applied to discover weights in unweighted networks. 9. Pathway prediction example In the following, we will demonstrate pathway prediction on an operon from Pseudomonas aeruginosa PAO1. The aruCFGDB operon contains five genes: PA0895 (aruC), PA0896 (aruF), PA0897 (aruG), PA0898 (aruD) and PA0899 (aruB). When entering these gene identifiers into the pathway extraction web server (at http://rsat.bigre.ulb.ac.be/neat/, Brohée et al., 2008) and selecting P. aeruginosa (pae) as the organism of interest, the genes are

118

K. Faust et al. / BioSystems 105 (2011) 109–121

Fig. 6. Reaction mapping output of the pathway extraction tool for the five genes in the aruCFGDB operon of P. aeruginosa PAO1. The genes are mapped to their EC numbers, reactions and main RPairs using the KEGG database (version 55.0). The many-to-many relationship between enzyme-coding genes and EC numbers is exemplified by PA0895 (associated to two EC numbers) and PA0896 and PA0897 (associated to the same EC number).

mapped to their respective EC numbers, reactions and main RPairs (see Fig. 6). Notably, genes PA0896 and PA0897 share the same EC number, whereas PA0895 is associated to two EC numbers, thus illustrating the aforementioned many-to-many relationship between enzyme-coding genes and EC numbers. When grouping seed reactions EC number-wise, the pathway shown in Fig. 7 results. This pathway has been predicted from the weighted RPair network (KEGG version 55.0), thus benefiting from the weights as well as the RPair annotation. The network includes all reactant pairs and small molecule compounds in KEGG. We could have selected an organism-specific network of P. aeruginosa for higher specificity, but at the cost of possibly decreasing the sensitivity (i.e. the capability of predicting pathways that occur in P. aeruginosa, but have not yet been included in the organism-specific network). Interestingly, the predicted pathway proposes several intermediate reactions for which P. aeruginosa genes are known in KEGG (PA0901 and PA1162) but also contains a reaction (R04217 with EC number 2.6.1.81) that is not associated to any gene in P. aeruginosa. Such a gap may be interpreted in several ways: (1) the reaction is spontaneous, (2) the reaction is a wrong prediction, i.e. P. aeruginosa

has no enzyme that catalyses this reaction, (3) the reaction is carried out by an enzyme in P. aeruginosa, but this enzyme is not known or not annotated in the source database or the enzyme is present in the database but not linked to this reaction. In general, the substrate range of many enzymes is unknown, so we may expect many missing links between reactions and their catalyzing enzymes in current metabolic databases. However, in MetaCyc the EC number of the gap reaction (2.6.1.81) is linked to a P. aeruginosa gene (aruC). This gene is contained in KEGG (version 55.0), but not linked to EC number 2.6.1.81. Indeed, the P. aeruginosa-specific KEGG map of Arginine and proline metabolism, which contains the predicted pathway, does not link EC number 2.6.1.81 to any P. aeruginosa gene. Interestingly, the predicted pathway reproduces the AST (arginine succinyl-transferase) pathway annotated in MetaCyc for E. coli and P. aeruginosa. Thus, without knowledge of either the metabolic network or the pathways of P. aeruginosa, we could correctly predict the arginine degradation pathway (AST pathway) from the operon encoding it. It is clear that not in all cases a complete pathway will be encoded in an operon. Enzyme-coding genes involved in the same path-

Fig. 7. Pathway prediction result for the enzyme-coding genes in the aruCFGDB operon of P. aeruginosa PAO1. Seed reactant pairs (RPairs) obtained from the genes were connected by a subgraph extraction algorithm (combining kWalks and a Steiner tree heuristic) in a network built from KEGG RPAIR (version 55.0). RPairs are represented as rectangles and compounds as ellipses. Seed RPairs have blue borders, whereas the borders of intermediate RPairs and compounds are colored according to their membership to different KEGG maps, where beige stands for arginine and proline metabolism and violet for lysine biosynthesis. Intermediates that are not part of any KEGG map have orange borders. If a compound of a seed RPair did not already form a part of the pathway, it was added after prediction and colored in magenta. The predicted pathway covers the complete arginine succinyl-transferase (AST) pathway, which is annotated in MetaCyc for P. aeruginosa.

K. Faust et al. / BioSystems 105 (2011) 109–121

ways may belong to several operons or even be scattered over the whole genome (e.g. methionine biosynthesis in E. coli). However, if these genes are co-regulated, and thus detectable as a group of associated genes (i.e. co-expressed in a micro-array experiment or sharing common transcription factor binding sites), our pathway prediction procedure can still assemble their pathway. 10. Discussion 10.1. Strengths of the prediction approach Pathway prediction by subgraph extraction is a generic pathway prediction approach that can be applied to any biological network and that can handle networks of realistic sizes (with thousands of nodes and tens of thousands of edges). It does not require any other input apart from the network and the seed nodes, although additional information in form of RPairs or a well defined weight policy increases prediction accuracy. In addition, weights can be finetuned to favor organism-specific reactions or to integrate scores from a high-throughput experiment. To our knowledge, the prediction approach summarized here is the only one that can predict metabolic pathways from a mixture of reactions and compounds. 10.2. Weaknesses of the prediction approach We decided not to impose constraints on the stoichiometry or feasibility of the extracted pathway, i.e. compounds in the pathway are expected to be available in sufficient numbers. This may in some cases lead to wrong predictions (de Figueiredo et al., 2009). A more in-depth discussion of non-stoichiometric versus stoichiometric approaches in path finding can also be found in Faust et al. (2009a) and Planes and Beasley (2008). Another disadvantage of our metabolic pathway prediction approach is its restriction to small molecule metabolism, which excludes polymers (e.g. starch, DNA, RNA). In addition, our evaluations have shown that the prediction approach does not perform well for the highly connected central part of the metabolic network, e.g. glycolysis. A distinction between these pathways from alternatives may be possible when Gibbs free energy changes are taken into account. However, the computation of these energy changes requires data on compound concentrations in the cell. Furthermore, subgraph extraction with Steiner tree heuristics cannot predict spiral-shaped (e.g. fatty acid biosynthesis, which uses the same enzymes in several elongation rounds) or cyclic pathways, as the solution to the Steiner tree problem is a tree. It may however find parts of these pathways. In general, the extraction of the lightest sub-network assumes that the pathway to be extracted is as short as biochemically possible. This parsimony assumption makes sense in many cases, as the synthesis of an enzyme is costly. However, not all pathways have developed to synthesize or degrade a compound with the minimal number of enzymatic steps. For example, the TCA cycle has not been optimized to synthesize oxaloacetate with the smallest possible number of enzymes but to produce energy and precursors for anabolic metabolic pathways such as amino acid biosynthesis. 10.3. Applications 10.3.1. Interpretation of associated genes from high throughput data The main application of pathway prediction by subgraph extraction is the interpretation of co-expressed gene sets. Several network-based tools to interpret gene (or protein) sets obtained from high-throughput data have been developed. They all rely on the same principle: First, gene scores are obtained from the highthroughput data. For instance in the case of micro-arrays, the scores could consist in the log-ratio of measured gene expression levels.

119

Next, the scores are converted into node weights and the subnetwork is extracted from the weighted graph. The weights cause the extraction algorithm to favor sub-networks containing highscoring genes over others. Roughly speaking, we may distinguish “global” and “local” subgraph extraction strategies. Tools following the global strategy incorporate nodes in such a way that the weight of the extracted sub-network is optimized, without the need of specifying seed nodes. To our knowledge, Ideker et al. (2002) were the first to propose this strategy, which they implemented using a simulated annealing based algorithm. A recently published tool, MetaPath, is based on the same principle (Liu and Pop, 2010). An interesting study by Dittrich et al. (2008) identifies sub-networks in protein–protein interaction networks by applying an algorithm that solves the prize collecting Steiner tree problem exactly (Ljubic´ et al., 2006). These authors also present a statistic to aggregate multiple p-values on nodes derived from several experiments into one node weight and in addition report sub-optimal solutions in the order of their weight. The prize-collecting Steiner tree problem is a variant of the Steiner tree problem that does not consider seed nodes but instead searches for the maximum weight sub-network in a network where nodes have positive weights (prizes) and edges negative weights (costs). A prize-collecting Steiner tree solving algorithm has also been applied to detect novel pathways in integrated protein–protein and protein–DNA networks (Huang and Fraenkel, 2009). In contrast to the global strategy, the local strategy connects specific nodes of interest (i.e. seed nodes) in the network. This strategy has been applied to protein–protein interaction networks (Scott et al., 2005) as well as to metabolic networks (Antonov et al., 2008, 2009; Noirel et al., 2008). In the former case, Scott and coworkers employ a Steiner tree exact solution (small seed node sets) and a heuristic (large seed node sets), whereas in the latter case, the authors rely on custom algorithms to solve the same problem. The distinction between global and local strategies is somewhat artificial: In fact, the global strategy may be tuned by assigning weights such that specific nodes of interest are favored. In turn, the local strategy may be “globalized” by repeatedly running the extraction algorithm and then reporting the optimal solution among a set of solutions. Whether the global or local strategy is more appropriate depends on the size and the expected number of false positives in the gene/enzyme group of interest. In comparison to the metabolic subgraph extraction approaches mentioned above, our approach offers the advantages of a well designed network representation, of an appropriate treatment of hub compounds (RPairs and weights), of being able to handle sets of seed nodes as well as being extensively evaluated on reference pathways. A major weakness of our approach is its lack of a method to compute network weights from high-throughput experiment scores. However, this weakness can be circumvented by first computing network weights independently with a statistically sound procedure (such as the one suggested in Dittrich et al. (2008)) and then applying our prediction approach using these weights. All approaches discussed so far rely only on the network topology. Recently, a metabolic subgraph extraction approach was published that is based on the stoichiometric definition of a metabolic pathway (i.e. the idea that a valid metabolic pathway should stoichiometrically balance all its internal compounds). This approach combines stoichiometric and reaction directionality constraints (as in flux balance analysis) with a constraint with respect to expression data, such that the agreement between the activity of a reaction (its flux) and the expression values of its associated enzyme(s) is maximized Shlomi et al. (2008) and Zur et al. (2010). Since it does not consider seed nodes, this approach can be compared to the “global” topology-based subgraph extraction approaches. Whether this stoichiometric-based approach performs better than topological approaches remains to be evaluated.

120

K. Faust et al. / BioSystems 105 (2011) 109–121

10.3.2. Metabolic network reconstruction Another application of metabolic pathway prediction is metabolic network reconstruction, whose goal is to decipher the metabolic network of an organism from its genome (see e.g. Duarte et al., 2007 as an example and Thiele and Palsson, 2010 for a reconstruction protocol). Currently, manual metabolic network reconstruction (in some cases combined with automated procedures as in Tier 2 BioCyc databases Caspi et al., 2008) still outperforms entirely automated procedures Ginsburg (2009), making automated high-quality metabolic reconstruction an object of intensive research. Automated metabolic reconstruction approaches may be roughly divided into two categories: Pathway-based approaches rely on known metabolic pathways as template (Moriya et al., 2007; Karp et al., 2009), which may be further assembled into a network (DeJongh et al., 2007), whereas constraint-based approaches start from the genome without taking knowledge on pathways into account (Becker et al., 2007; Henry et al., 2010). All of these approaches may result in a reconstructed network with gaps, i.e. missing reactions that are expected to be present because of phenotype data or because they synthesize or degrade important “house-keeping” (primary) compounds. Various approaches have been suggested to fill these gaps, either on the level of the individual reaction (Green and Karp, 2004) or the entire network (Christian et al., 2009). Pathway prediction can fill gaps on the pathway level, by proposing alternative pathways that can carry out the missing function. In addition, pathway prediction may serve as an alternative starting point for pathway-based metabolic network reconstruction. Instead of starting with a set of “template” pathways, pathways can be predicted from known gene associations in the organism (operons/regulons, synteny, etc.), thereby integrating knowledge on gene regulation into the reconstruction procedure. 10.3.3. Other applications Subgraph extraction could also be applied to predict biodegradation pathways. In contrast to other biodegradation prediction approaches (Jaworska et al., 2002; Pazos et al., 2005; Ellis et al., 2008), our pathway prediction approach accepts both compounds and reactions as input. Therefore, it can incorporate knowledge on intermediates as well as on participating enzymes. Pathway prediction by subgraph extraction could also be useful in comparative metagenomics, where it could identify differentially abundant pathways from over- or under-represented orthologous gene groups. 10.4. Outlook In the future, we plan to apply pathway prediction on other biological networks such as protein–protein and protein–gene interaction networks and ultimately on a network combining the former with metabolism. In addition, metabolic networks could be refined to include compartments and transporters. Another improvement concerns the usage of a compound hierarchy that would allow to treat stereoisomers and generic compounds appropriately. For instance, if a generic compound (e.g. a amino acid) is provided, subgraph extraction could treat all its children (i.e. specific amino acids) as a seed node set. Furthermore, the integration of atom tracing (as in Pitkänen et al., 2009; Heath et al., 2010) with RPair annotation may increase the accuracy of pathways predicted for a set of seed reactions. Acknowledgements KF was supported by Actions de Recherches Concertées de la Communauté Franc¸aise de Belgique (ARC grant number 04/09307). The BiGRe Laboratory is supported by the Belgian Program

on Interuniversity Attraction Poles, initiated by the Belgian Federal Science Policy Office, project P6/25 (BioMaGNet). DC is funded by the MICROME Collaborative Project funded by the European Commission within its FP7 Programme, under the thematic area “BIO-INFORMATICS – Microbial genomics and bio-informatics” (contract number 222886-2). References Adler, P., Reimand, J., Jänes, J., Kolde, R., Peterson, H., Vilo, J., 2008. KEGGanim: pathway animations for high-throughput data. Bioinformatics 24, 588–590. Antonov, A., Dietmann, S., Mewes, H., 2008. KEGG spider: interpretation of genomics data in the context of the global gene metabolic network. Genome Biology 9. Antonov, A., Dietmann, S., Wong, P., Mewes, H., 2009. TICL—a web tool for network-based interpretation of compound lists inferred by high-throughput metabolomics. FEBS Journal 276, 2084–2094. Arita, M., 2000. Metabolic reconstruction using shortest paths. Simulation Practice and Theory 8, 109–125. Arita, M., 2003. In silico atomic tracing by substrate–product relationships in Escherichia coli intermediary metabolism. Genome Research 13, 2455–2466. Arita, M., 2004. The metabolic world of Escherichia coli is not small. Proceedings of the National Academy of Sciences of the United States of America 101, 1543–1547. Backes, C., Keller, A., Kuentzer, J., Kneissl, B., Comtesse, N., Elnakady, Y.A., Mller, R., Meese, E., Lenhof, H.-P., 2007. GeneTrail—advanced gene set enrichment analysis. Nucleic Acids Research 102, W186–W192. Becker, S., Feist, A., Mo, M., Hannum, G., Palsson, B., Herrgard, M., 2007. Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox. Nature Protocol 2, 727–738. Blum, T., Kohlbacher, O., 2008. Using atom mapping rules for an improved detection of relevant routes in weighted metabolic networks. Journal of Computational Biology 15, 565–576. Boyer, F., Viari, A., 2003. Ab initio reconstruction of metabolic pathways. Bioinformatics 19, ii26–ii34. Brohée, S., Faust, K., Lima-Mendez, G., Sand, O., Janky, R., Vanderstocken, G., Deville, Y., van Helden, J., 2008. NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways. Nucleic Acids Research 36, W444–W451. Caspi, R., Foerster, H., Fulcher, C., Kaipa, P., Krummenacker, M., Latendresse, M., Paley, S., Rhee, S., Shearer, A., Tissier, C., Walk, T., Zhang, P., Karp, P., 2008. The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Research 36, D623–D631. Centler, F., di Fenizio, P.S., Matsumaru, N., Dittrich, P., 2005. Chemical organizations in the central sugar metabolism of Escherichia coli. Modeling and Simulation in Science Engineering and Technology, Post-proceedings of ECMTB 2005. Chou, C.-H., Chang, W.-C., Chiu, C.-M., Huang, C.-C., Huang, H.-D., 2009. FMM: a web server for metabolic pathway reconstruction and comparative analysis. Nucleic Acids Research 37, W129–W134. Christian, N., May, P., Kempa, S., Handorf, T., Ebenhöh, O., 2009. An integrative approach towards completing genome-scale metabolic networks. Molecular BioSystems 5, 1889–1903. Clarke, B., 1988. Stoichiometric network analysis. Cell Biophysics 12, 237–253. Croes, D., Couche, F., Wodak, S., van Helden, J., 2005. Metabolic pathfinding: inferring relevant pathways in biochemical networks. Nucleic Acids Research 33, W326–W330. Croes, D., Couche, F., Wodak, S., van Helden, J., 2006. Inferring meaningful pathways in weighted metabolic networks. Journal of Molecular Biology 356, 222–236. Dahlquist, K.D., Salomonis, N., Vranizan, K., Lawlor, S., Conklin, B., 2002. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nature Genetics 31, 19–20. Davidsen, T., Beck, E., Ganapathy, A., Montgomery, R., Zafar, N., Yang, Q., Madupu, R., Goetz, P., Galinsky, K., White, O., Sutton, G., 2010. The comprehensive microbial resource. Nucleic Acids Research 38, D340–D345. de Figueiredo, L., Podhorski, A., Rubio, A., Kaleta, C., Beasley, J., Schuster, S., Planes, F., 2009. Computing the shortest elementary flux modes in genome-scale metabolic networks. Bioinformatics 25, 3158–3165. DeJongh, M., Formsma, K., Boillot, P., Gould, J., Rycenga, M., Best, A., 2007. Toward the automated generation of genome-scale metabolic networks in the SEED. BMC Bioinformatics 8, 139. DeRisi, J., Iyer, V., Brown, P., 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686. Dittrich, M., Klau, G., Rosenwald, A., Dandekar, T., Müller, T., 2008. Identifying functional modules in protein–protein interaction networks: an integrated exact approach. Bioinformatics 24, i223–i231. Duarte, N., Becker, S., Jamshidi, N., Thiele, I., Mo, M., Vo, T., ad, B.Ø., Palsson, R.S., 2007. Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proceedings of the National Academy of Sciences of the United States of America 104, 1777–1782. Dupont, P., Callut, J., Dooms, G., Monette, J.-N., Deville, Y., 2006. Relevant subgraph extraction from random walks in a graph. Research Report UCL/FSA/INGI RR 2006–07. Ellis, L., Gao, J., Fenner, K., Wackett, L., 2008. The University of Minnesota pathway prediction system: predicting metabolic logic. Nucleic Acids Research 36, W427–W432.

K. Faust et al. / BioSystems 105 (2011) 109–121 Eppstein, D., 1999. Finding the k shortest paths. SIAM Journal on Computing 28, 652–673. Faust, K., Croes, D., van Helden, J., 2009a. In response to “Can sugars be produced from fatty acids? A test case for pathway analysis tools”. Bioinformatics 25, 3202–3205. Faust, K., Croes, D., van Helden, J., 2009b. Metabolic pathfinding using RPAIR annotation. Journal of Molecular Biology 388, 390–414. Faust, K., Dupont, P., Callut, J., van Helden, J., 2010. Pathway discovery in metabolic networks by subgraph extraction. Bioinformatics 26, 1211–1218. Fell, D., Wagner, A., 2000. The small world of metabolism. Nature Metabolic Engineering 18, 1121–1122. Forst, C., Schulten, K., 1999. Evolution of metabolisms: a new method for the comparison of metabolic pathways using genomics information. Journal of Computational Biology 6, 343–360. Forst, C., Schulten, K., 2001. Phylogenetic analysis of metabolic pathways. Journal of Molecular Biology 52, 471–489. Gagneur, J., Jackson, D., Casari, G., 2003. Hierarchical analysis of dependency in metabolic networks. Bioinformatics 19, 1027–1034. Gama-Castro, S., Jiménez-Jacinto, V., Peralta-Gil, M., Santos-Zavaleta, A., naloza Spinola, M.P., Contreras-Moreira, B., Segura-Salazar, J., niz Rascado, L.M., Martínez-Flores, I., Salgado, H., Bonavides-Martínez, C., Abreu-Goodger, C., Rodríguez-Penagos, C., Miranda-Ríos, J., Morett, E., Merino, E., Huerta, A., no Quintanilla, L.T., Collado-Vides, J., 2008. RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Research 36. Gerlee, P., Lizana, L., Sneppen, K., 2009. Pathway identification by network pruning in the metabolic network of Escherichia coli. Bioinformatics 25, 3282–3288. Ginsburg, H., 2009. Caveat emptor: limitations of the automated reconstruction of metabolic pathways in Plasmodium. Trends in Parasitology 25, 37–43. Goesmann, A., Haubrock, M., Meyer, F., Kalinowski, J., Giegerich, R., 2002. PathFinder: reconstruction and dynamic visualization of metabolic pathways. Bioinformatics 18, 124–129. Green, M., Karp, P., 2004. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 5, 76. Guimerà, R., Amaral, L., 2005. Functional cartography of complex metabolic networks. Nature 433, 895–900. Heath, A., Bennett, G., Kavraki, L., 2010. Finding Metabolic Pathways Using Atom Tracking. Bioinformatics 26, 1548–1555. Henry, C., DeJongh, M., Best, A., Frybarger, P., Linsay, B., Stevens, R., 2010. Highthroughput generation, optimization and analysis of genome-scale metabolic models. Nature Biotechnology 2, 977–982. Herréz, A., 2006. Biomolecules in the computer: Jmol to the rescue. Biochemistry and Molecular Biology Education 34, 255–261. Huang, S., Fraenkel, E., 2009. Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks. Science Signaling 6053, 101–112. Ideker, T., Ozier, O., Schwikowski, B., Siegel, A., 2002. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18, S233–S240. Jaworska, J., Dimitrov, S., Nikolova, N., Mekenyan, O., 2002. Probabilistic assessment of biodegradatability based on metabolic pathways: CATABOL system. SAR and QSAR in Environmental Research 13, 307–323. Jeong, H., Tombor, B., Albert, R., Oltvai, Z., Barabási, A.-L., 2000. The large-scale organization of metabolic networks. Nature 407, 651–654. Jimenez, V., Marzal, A., 1999. Computing the k shortest paths: a new algorithm and an experimental comparison. In: Lecture Notes in Computer Science—Proceedings of the 3rd International Workshop on Algorithm Engineering 1668 , pp. 15–29. Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., Yamanishi, Y., 2008. KEGG for linking genomes to life and the environment. Nucleic Acids Research 36, D480–D484. Karp, P., Paley, S., 1994. Representations of metabolic knowledge: pathways. Proceedings International Conference on Intelligent Systems for Molecular Biology 2, 203–211. Karp, P., Paley, S., Krummenacker, M., Latendresse, M., Dale, J., Lee, T., Kaipa, P., Gilham, F., Spaulding, A., Popescu, L., Altman, T., Paulsen, I., Keseler, I., Caspi, R., 2009. Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology. Briefings in Bioinformatics 2, 40–79. Karp, R., 1972. Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (Eds.), Complexity of Computer Computations. Plenum Press, pp. 85–103. Klamt, S., Haus, U.-U., Theis, F., 2009. Hypergraphs and cellular networks. PLoS Computational Biology 5, 5. Kotera, M., Hattori, M., Oh, M.-A., Yamamoto, R., Komeno, T., Yabuzaki, J., Tonomura, K., Goto, S., Kanehisa, M., 2004a. RPAIR: a reactant-pair database representing chemical changes in enzymatic reactions. Genome Informatics 15, P062. Kotera, M., Okuno, Y., Hattori, M., Goto, S., Kanehisa, M., 2004b. Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. Journal of the American Chemical Society 126, 16487–16498. Küffner, R., Zimmer, R., Lengauer, T., 2000. Pathway analysis in metabolic databases via differential metabolic display. Bioinformatics 16, 825–836. Lacroix, V., Cottret, L., Thébault, P., Sagot, M., 2008. An introduction to metabolic networks and their structural analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5, 594–617. Lima-Mendez, G., van Helden, J., 2009. The powerful law of the power law and other myths in network biology. Molecular BioSystems 5, 1482–1493.

121

Liu, B., Pop, M., 2010. Identifying differentially abundant metabolic pathways in metagenomic datasets. Lecture Notes in Computer Science: Bioinformatics Research and Applications 6053, 101–112. ´ I., Weiskircher, R., Pferschy, U., Klau, G.W., Mutzel, P., Fischetti, M., 2006. An Ljubic, algorithmic framework for the exact solution of the prize-collecting Steiner Tree problem. Mathematical Programming Series B 105, 427–449. Ma, H., Zeng, A.-P., 2003. Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics 19, 270–277. Mithani, A., Preston, G.M., Hein, J., 2009. Hypergraph based tool for metabolic pathway prediction and network comparison. Bioinformatics 25, 1831–1832. Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A., Kanehisa, M., 2007. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Research 35, W182–W185. Nelson, D., Cox, M., 2005. Lehninger Principles of Biochemistry, fourth edition. Noirel, J., Ow, S.Y., Sanguinetti, G., Jaramillo, A., Wright, P.C., 2008. Automated extraction of meaningful pathways from quantitative proteomics data. Briefings in Functional Genomics and Proteomics 7, 136–146. Overbeek, R., Disz, T., Stevens, R., 2004. The seed: a peer-to-peer environment for genome annotation. Communications of the ACM 47, 47–51. Paley, S.M., Karp, P.D., 2006. The pathway tools cellular overview diagram and Omics viewer. Nucleic Acids Research 34, 3771–3778. Pazos, F., Guijas, D., Valencia, A., Lorenzo, V.D., 2005. Metarouter: bioinformatics for bioremediation. Nucleic Acids Research 35, D588–D592. Pitkänen, E., Jouhten, P., Rousu, J., 2009. Inferring branching pathways in genomescale metabolic networks. BMC Systems Biology 3, 103. Pitkänen, E., Rantanen, A., Rousu, J., Ukkonen, E., 2005. Finding feasible pathways in metabolic networks. In: Proceedings of the 10th Panhellenic Conference on Informatics (PCI’2005), Lecture Notes in Computer Science , Springer, pp. 123–133. Planes, F., Beasley, J., 2008. A critical examination of stoichiometric and pathfinding approaches to metabolic pathways. Briefings in Bioinformatics 9, 422–436. Rahman, S., Advani, P., Schunk, R., Schrader, R., Schomburg, D., 2004. Metabolic pathway analysis web service (pathway hunter tool at cubic). Bioinformatics 21, 1189–1193. Schilling, C., Letscher, D., Palsson, B., 2000. Theory for the systemic definition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective. Journal of theoretical Biology 203, 229–248. Schuster, S., Dandekar, T., Fell, D., 1999. Detection of elementary flux modes in biochemical networks: a promising tool for pathway analysis and metabolic engineering. TIBTECH 17, 53–60. Schuster, S., de Figueiredo, L.F., Kaleta, C., 2010. Predicting novel pathways in genome-scale metabolic networks. Biochemical Society Transactions 38, 1202–1205. Scott, M., Perkins, T., Bunnell, S., Pepin, F., Thomas, D., Hallett, M., 2005. Identifying regulatory subnetworks for a set of genes. Molecular and Cellular Proteomics 4 (5), 683–692. Shlomi, T., Cabili, M., Herrgard, M., Palsson, B., Ruppin, E., 2008. Network-based prediction of human tissue-specic metabolism. Nature Biotechnology 26, 1003–1010. Sirava, M., Schaefer, T., Eiglsperger, M., Kaufmann, M., Kohlbacher, O., BornbergBauer, E., Lenhof, H., 2002. BioMiner—modeling, analyzing, and visualizing biochemical pathways and networks. Bioinformatics 18 (2), S219–S230. Subramanian, A., Tamayoa, P., Mootha, V., Mukherjeed, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., Mesirov, J., 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102, 15545–15550. Thiele, I., Palsson, B., 2010. A protocol for generating a high-quality genome-scale metabolic reconstruction. Nature Protocol 5, 93–121. van Helden, J., Gilbert, D., Wernisch, L., Schroeder, M., Wodak, S., 2001. Application of regulatory sequence analysis and metabolic network analysis to the interpretation of gene expression data. Lecture Notes in Computer Science 2066, 147–165. van Helden, J., Naim, A., Mancuso, R., Eldridge, M., Wernisch, L., Gilbert, D., Wodak, S., 2000. Representing and analysing molecular and cellular function in the computer. Biological Chemistry 381, 921–935. van Helden, J., Wernisch, L., Gilbert, D., Wodak, S., 2002. Graph-based analysis of metabolic networks. In: Ernst Schering Research Foundation Workshop, vol. 38 , Springer-Verlag, pp. 245–274. Voss, K., Heiner, M., Koch, I., 2003. Steady state analysis of metabolic pathways using Petri nets. In Silico Biology 3, 367–387. Wagner, A., Fell, D., 2001. The small world inside large metabolic networks. Proceedings of the Royal Society of London Series B 268, 1803–1810. Watts, D., Strogatz, S., 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442. Yen, J., 1971. Finding the K shortest loopless paths in a network. Management Science 17, 712–716. Zien, A., Küffner, R., Zimmer, R., Lengauer, T., 2000. Analysis of gene expression data with pathway scores. In: Proceedings of the International Conference of Intelligent Systems Molecular Biology , pp. 407–417. Zur, H., Ruppin, E., Shlomi, T., 2010. iMAT: an integrative metabolic analysis tool. Bioinformatics 26, 3140–3142.