Available online at www.sciencedirect.com
Model selection in systems and synthetic biology Paul Kirk1, Thomas Thorne1 and Michael PH Stumpf Abstract Developing mechanistic models has become an integral aspect of systems biology, as has the need to differentiate between alternative models. Parameterizing mathematical models has been widely perceived as a formidable challenge, which has spurred the development of statistical and optimisation routines for parameter inference. But now focus is increasingly shifting to problems that require us to choose from among a set of different models to determine which one offers the best description of a given biological system. We will here provide an overview of recent developments in the area of model selection. We will focus on approaches that are both practical as well as build on solid statistical principles and outline the conceptual foundations and the scope for application of such methods in systems biology. Addresses Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London SW7 2AZ, UK
1
These authors contributed equally.
Current Opinion in Biotechnology 2013, 24:767–774 This review comes from a themed issue on Systems biology Edited by Orkun S Soyer and Peter S Swain For a complete overview see the Issue and the Editorial Available online 8th April 2013 0958-1669/$ – see front matter, # 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.copbio.2013.03.012
Introduction Biological organisms exhibit diverse types of behaviour and levels of complexity. Models, however simplified, are attempts to impose some structure onto these complicated systems. The very process of abstraction can already provide insights into the processes shaping these systems and their behaviour. Models in biology have been remarkably successful in evolutionary biology, ecology and epidemiology. In the context of cellular and molecular biology adoption was somewhat delayed but with the rise of systems biology modelling approaches are becoming all-pervasive [1]. Models are thus finding their use in studies of signal transduction, gene regulation, stress response, cell-fate decision making, and increasingly also in clinical contexts. Developing models, however, remains a challenge [2]; and even if we have a model, deciding whether it is www.sciencedirect.com
actually a good one is also far from trivial [3,4]. A wealth of approaches have been developed to critique models quantitatively, or choose from among a set of candidate models which one(s) is (are) the best. Here we will discuss their relative merits using some illustrative examples. The practice of model selection is encapsulated in Figure 1. Mathematical models represent our knowledge of how nature works, and where competing mechanisms have been proposed we will obtain different mathematical representations. The level to which these models differ in capturing aspects of the real world seems like the most natural way of ranking such models and choosing the most successful from among them. Ockham’s razor, PLURALITAS NON EST PONENDA SINE NECESSITATE, is often invoked as a guiding principle, whence a more complicated model should only be chosen if it offers a significant improvement over a competing simpler alternative model. Quantifying such differences in a model’s power or usefulness is therefore at the heart of all the approaches discussed below.
Likelihood based approaches Likelihood based approaches attempt to determine a single point in parameter space that best explains the observed data, with the assumption that there is a single set of ‘true’ parameter values that perfectly capture the data generating process. In order to do this, a likelihood function, l(u ; y) = p(y|u), must be defined, where y represents the observed data, and p is a probability model with parameters u. A key problem in the application of likelihood based approaches is to determine the maximum value of the likelihood function of the model. To derive an estimate, we can apply numerical optimisation techniques that search the parameter space for the maximum value of the likelihood. Many such techniques exist (and are implemented in numerous commercial and open source packages [e.g. 5–7]). However, their ability to identify the global maximum will be crucially dependent upon the complexity of the likelihood surface, and escaping local maxima represents a significant challenge. By themselves likelihoods and maximum likelihood estimates do not allow us to compare different models [8]. Here we have to apply other means [9]. Given a pair of nested models (one being a special case of the other) the simplest test that can be performed is the likelihood ratio test. This compares the hypothesis that the null model (model 1) is the true data generating model with the hypothesis that a second model (model 2) generated the Current Opinion in Biotechnology 2013, 24:767–774
768 Systems biology
Figure 1
1.5
Data
Species X
1 0.5
Hypothesis generation
0
−0.5 −1 −1.5 0
2
4
t
6
M2
M3 1.5
1
1
1
0.5
0.5
0.5
0
−0.5
Species X
1.5
Species X
Species X
M1 1.5
0
−0.5
0
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
0
2
t
4
6
0
2
t
4
6
0
2
t
4
6
Model complexity Model ranking f (M2) > f (M3) > f (M1) Model selection
M2 Current Opinion in Biotechnology
Model selection in systems and synthetic biology. The first step in any modelling task is to generate hypotheses (which might be informed by a set of observations, or derived from fundamental biophysical principles) in an attempt to explain a given biological process. These are then formalised and represented as a set of candidate mathematical models, fMi gni¼1 . The different models may be of varying complexities, and the process of model selection requires us to find models that are of sufficient complexity to explain the observed behaviour, while avoiding over-fitting. There are a variety of ways of ranking models, and in the present article we consider likelihood based approaches and Bayesian methods, as well as touching upon a variety of model (in)validation and checking procedures that can be used to assess whether a model is good (rather than better than a given competitor). On the strength of a ranking, it might be possible to select a single best candidate, which may then be further tested and refined.
data. Under these conditions, the ratio of the maximum log likelihoods of the models is approximately distributed as a chi-squared distribution, lðu1 ; yÞ (1) 2log x2n2 n1 ; lðu2 ; yÞ with degrees of freedom n2 n1, where n1 and n2 are the number of free parameters and u1 ; u2 are the maximum likelihood parameter estimates for models 1 and 2, respectively. For two nested models the more complex model will always be able to better explain the observed data, so a statistical test is required to determine if the Current Opinion in Biotechnology 2013, 24:767–774
improvement is significant. By calculating p-values under the appropriate chi-squared distribution it is then possible to use classical hypothesis testing to determine if the null model can be rejected. This approach becomes fragile or clumsy when considering multiple, and especially non-nested models, IM = {M1, . . ., Mn}. Then one possible approach is to compare them by assessing the amount of information lost in approximating the true data generating mechanism with each model, using methods from information theory. Of course we do not typically know the true model www.sciencedirect.com
Model selection in systems and synthetic biology Kirk, Thorne and Stumpf 769
generating the data, and so cannot make this comparison directly. However it can be shown (see e.g. Burnham and Anderson [9]) that it is possible to approximate the relative information loss incurred by different models. Such ideas lead to Akaike’s Information Criterion (AIC), defined for a given model as AICi ¼ 2loglðui ; y; M i Þ þ 2ki ; (2) where lðui ; y; M i Þ is the likelihood function associated with model Mi, ui is the maximum likelihood estimate of the parameters of Mi, and ki is the number of parameters in the model. Calculating the AIC for each model allows us to compare the relative fit of a set of models to the observed data. To better do so it is possible to derive the so-called AIC differences Di and Akaike weights wi of the models Mi 2 IM, Di ¼ AICi minðAIC j Þ;
Di
j¼1
eD j =2
(4)
:
The AIC differences Di take into account only the relative differences between the AIC values (see Figure 2), whilst the Akaike weights wi normalize these across all of the models under consideration; they can therefore be interpreted as the probability of Mi to be the correct model conditional on the data and the panel of candidate models being considered. The supposed ‘best’ model may in fact provide a poor fit to the observed data; it is only better than the rest, and therefore some care needs to be taken when interpreting the results of model selection criteria. In addition to the AIC we can also consider the Bayesian Information Criterion (BIC), which unlike the AIC is unbiased for large sample sizes, BICi ¼ 2loglðui ; y; M i Þ þ ki logn;
posterior ¼
prior likelihood : marginal likelihood
where Mi denotes the model under consideration, u its parameters, and D the data. In addition to likelihood specification and exploration, Bayesian inference procedures thus also face the challenge of eliciting appropriate priors (although, of course, the opportunity to specify prior belief is also a notable benefit). For parameter inference, elucidating the posterior distribution (or obtaining samples from it) is the key Bayesian objective. When comparing models Rit is usually the marginal likelihood, p(D|Mi) = up(u|Mi)p(D|u, Mi)du that is of interest. Given any two of our models, say Mi and Mj, the Bayes factor [13], Kij, in favour of model Mi is defined to be the ratio of the marginal likelihoods associated with the two models, pðDjM i Þ ; (6) K ij ¼ pðDjM j Þ
(5) ui
with n being the number of samples, and ki and again being the number of parameters and maximum likelihood parameter estimate associated with model Mi. It is then possible to calculate BIC differences as with the AIC, and to estimate the marginal probability of the data given the models [9]. Other, more complex information criteria exist: Takeuchi’s Information Criterion (TIC) [9], is better suited to situations where none of the models under consideration are close to the true data generating mechanism; while the Widely Applicable Information Criterion (WAIC) [10] provides an information criterion that is appropriate for models for which parameter identifiability is an issue (more precisely, for which the Fisher Information matrix www.sciencedirect.com
A different perspective is afforded by Bayesian approaches, which are gaining traction in computational biology [12], and may also be used in synthetic biology applications (see Box 1) [43]. Underlying all such approaches is Bayes’s theorem, which provides a coherent inference framework for updating prior knowledge in light of observed data [44–47]:
For parameter inference, this is, pðujM i Þ pðDju; M i Þ ; pðujD; M i Þ ¼ pðDjM i Þ
and wi ¼ PM
Bayesian model ranking
(3)
j
e 2
is singular; see Watanabe [11]). Although the AIC and BIC are the most commonly used, different information criteria may be suited to specific needs and care should be taken to select one appropriately.
Box 1 Model selection for synthetic biology Instead of observed data D we can also use specifications of the type of data we would like to see produced by a system [43]. This takes us into the area of design in synthetic biology [44,45]. The challenge is to encode the type of behaviour expected from a synthetic system, the design objectives, O. Then we can define, e.g. the posterior probability of a model/candidate design in light of the design objectives
PðM i jOÞ as before. Compared to optimisation approaches popular in the engineering-inspired synthetic biology literature [46], statistical approaches automatically strike a compromise between the efficiency of a model to satisfy the design objectives, and the robustness of the desired output. Encoding the design objectives then is the principal challenge [47].
Current Opinion in Biotechnology 2013, 24:767–774
770 Systems biology
Figure 2
(a)
(b)
Model 1: Equations:
Parameters: k1, k2, k3, k4, V, Km.
dS = –k1S – k2SR + k3CRS dt
Variables (molecular species):
= k1S
= k2SR – k3CRS – k4CRS = k4CRS
1
6
S, Sdeg, R, CRS, Rpp.
VRpp Km + Rpp
3
= –k2SR + k3CRS +
−5
dSdeg dt dR dt dCRS dt dRpp dt
7.5
VRpp − Km + Rpp
0.8 5 7.
Model 2:
dt V RS V2Rpp dR = − 1 + K2 + R dt K3 + Rpp dt
=
V1RS K2 + R
−
8
k2 0.4
7.5
6
S, Sdeg, R, Rpp.
3
0.2
V2Rpp
−2 9 −3 7 −45
K3 + Rpp 0
Model 3:
0
6
−5
1 −2
dRpp
6
Variables (molecular species):
= k1S
3
dSdeg
−13
dS = – k1 S dt
8
0.6 −5
Parameters: k1, k2, k3, V1, V2.
7.5
Equations:
3
−13
−5 −13
−21 −29 −37 −45
0.2
−5 −13
−21 −29 −37 −45
0.4
0.6
1
Km
Equations:
Parameters: k1, k2, V1, V2.
dS =0 dt dR V2Rpp V RS = − 1 + dt K1 + R K2 + Rpp
Variables (molecular species): S,R,Rpp.
V1RS V2Rpp dRpp = − dt K1 + R K2 + Rpp
(c)
(d) 30
1
AIC
AIC BIC
BIC
25
0.8
AIC/BIC weights
20
D
−21
0.8
15
10
0.6
0.4
0.2
5
0
0 1
2
3
Model
1
2
3
Model Current Opinion in Biotechnology
(a) A model of a signal transduction cascade in which a protein S activates a protein R, including degradation of S in models 1 and 2. A synthetic data set consisting of 7 data points was generated by simulating from model 1 and adding Gaussian noise. Throughout, we assume the noise model to be known, which allows us to define a likelihood. (b) A contour plot of the likelihood surface for parameters k1 and Km of model 1, with the other parameters set to the maximum likelihood estimate. Model selection can work even when the parameters are not inferred with high confidence. (c) AIC and BIC differences D for the three models calculated from the maximum likelihood estimates and (d) AIC and BIC weights. These show a clear preference for models 1 and 2, but are unable to clearly distinguish the two models given the observed data.
Current Opinion in Biotechnology 2013, 24:767–774
www.sciencedirect.com
Model selection in systems and synthetic biology Kirk, Thorne and Stumpf 771
pðM i jDÞ= pðM j jDÞ ðby Bayes0 s theoremÞ; pðM i Þ= pðM j Þ posterior odds in favour of M i K ij ¼ : prior odds in favour of M i
K ij ¼
significant challenge. Many approaches for marginal likelihood estimation have been proposed, a full review of which is beyond the scope of the present article. However, numerous reviews, extensions, interpretations and comparisons of these methods are given in the literature [e.g. 14–17]. The most appropriate method for a given situation will depend upon the model being considered, the availability of an appropriate implementation, and the expertise and personal preferences of individual modellers. But the flexibility of Bayesian approaches is beginning to yield results in systems biology, especially in the areas
(7) (8)
The standard reference for interpreting the values taken by Kij is provided by Jeffreys [13, Appendix B]. Estimation of the marginal likelihood, p(D|Mi), of model Mi requires R evaluation of the (often high dimensional) integral up(u|Mi)p(D|u, Mi)du, which usually represents a
Figure 3
(a)
(b)
−1
Frequency
−3 −4 MultiNest estimate (Model 1) MultiNest estimate (Model 2) MultiNest estimate (Model 3) Monte Carlo estimate (Model 1) Monte Carlo estimate (Model 2) Monte Carlo estimate (Model 3)
−5 −6 −7 −8 −9
0
2000
4000
6000
8000
50
200
100 50
100 0
0 0
0.5
1
0 0
0.5
k1 Frequency
Log(Marginal Likelihood)
−2
1
0
0.5
k2
1
k3
100
100
50
50
100 50 0 0
0.5
10000
0 0
1
0.05
k4
0.1
0 0
0.5
V
1
Km
Number of Live Points (for MultiNest)
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
Rpp
0.8
Rpp
Rpp
(c)
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
Model 1
0.1 0
0
20
40
Model 2
0.1
60
80
100
0
0
20
Time
0
0.05
40
True underlying function Observed data
Model 3
0.1
60
80
0
100
0
20
40
Time
0.1
0.15
0.2
0.25
60
80
100
Time
0.3
0.35
0.4
0.45
0.5
Posterior probability Current Opinion in Biotechnology
Application of Bayesian model ranking to the example considered in Figure 2. (a) Taking U(0, 1) priors for all of the unknown parameters, MultiNest was used to estimate (log) marginal likelihoods for the 3 models. The number of ‘‘live points’’ used by MultiNest tunes the accuracy/computational cost, so a range was considered. Simple Monte Carlo estimates of the log marginal likelihood are shown as coloured patches (mean 2 standard deviations). On the basis of Bayes factors, we would say that there is decisive evidence in favour of models 1 and 2 relative to model 3, but only weak evidence to favour model 1 over model 2. (b) MultiNest also provides posterior parameter samples. The marginal posterior distributions are represented by histograms, with a black bar indicating the ‘‘true’’ parameter value used to simulate the data. (c) For each model, we plug in the corresponding posterior parameter samples and then simulate. This gives a distribution of model outputs, represented here as a heatmap (effectively, a 2D histogram). Comparing the observed data with this distribution provides another way to assess how well each model performs (see Gelman et al. [53] for a thorough treatment of predictive checks). It is clear that both models 1 and 2 are able to reproduce the observed behaviour, but model 3 is not. www.sciencedirect.com
Current Opinion in Biotechnology 2013, 24:767–774
772 Systems biology
of model-based gene regulation network modelling [2,18] and signal transduction [19] and kinetic models [20] Software for performing the calculation of marginal likelihoods includes BioBayes [21] and MultiNest [22]. BioBayes was specifically designed with biological applications in mind, and supports SBML (systems biology markup language; see [23]). MultiNest (which implements a number of nested sampling [24] procedures) has been widely applied in cosmological modelling [e.g. 25], but is equally applicable to computational biology. To demonstrate, we provide an example in Figure 3. In all of the above, we have implicitly assumed that the likelihood function, p(D|u, Mi), is tractable and may be evaluated. For situations where this is not the case, approximate Bayesian computation (ABC) has been proposed (see the seminal contributions of Tavare´ et al. [26] and Marjoram et al. [27], as well as the recent reviews of Marin et al. [28] and Sunna˚ker et al. [29]). ABC has been used to perform parameter inference in a wide variety of biological modelling problems [e.g. 30,31,32,33–36]. In the context of model selection in systems and population biology, Toni et al. [31] and Toni and Stumpf [37] describe an ABC approach, implemented in the ABCSysBio package of Liepe et al. [38]. ABC often makes use of summaries of the data, and it is now well-established that the loss of information suffered when using insufficient summary statistics raises significant challenges for ABC model selection [39]. If summarising the data truly cannot be avoided, this motivates careful consideration of the choice of summary statistic(s) [40,41]. There is also a host of alternative approaches emerging which approximate models [42] or likelihoods in suitable ways and which can be employed for model selection and model checking.
Model choice heuristics and model checking Likelihood and Bayesian approaches are principled and based on sound statistical criteria. In particular they provide means of balancing model complexity and predictive power in well defined ways; in the Bayesian framework, robustness of the model is also a determining factor in which model is chosen. The approaches most likely to be successful for model selection will tend to fall into either of these two categories, although information theoretical approaches may have something to offer, too (beyond the AIC, etc.). Sometimes it is also possible to develop computationally affordable approaches around bootstrap procedures [48], but implicitly at least, these, too, will incorporate similar statistical foundations. Especially the engineering literature has given rise to a number of approaches centred around model-invalidation [49,50]. Here we typically try to identify sets of Current Opinion in Biotechnology 2013, 24:767–774
conditions in light of available data that cannot be met by a given model or set of models [51]. These approaches, and the emerging class of parameter-free approaches on the basis of algebraic geometry [52], are generally only applicable to deterministic models (with added noise), but related in spirit to predictive checking approaches from statistics (see [53], and also Figure 3c) and (in)validation approaches from the engineering literature. Here we are interested in whether a given model is actually capable of producing behaviour that shares characteristics of experimentally observed data. This question is subtly but non-trivially different from model selection: there we will choose which model is better, whereas in model checking we seek to evaluate whether a model is good. Vexingly, these two are not necessarily the same. Such notions are frequently lost when models are calibrated against data in an optimization framework, as are straightforward assessments of parsimony and robustness [50,32]. Robustness by itself has also been used as a heuristic approach to choose between models [50,54]; ultimately, there seem to be more deeply rooted links between the Bayesian framework (or a more comprehensive exploration of the likelihood surface) and robustness arguments that may be worthy of further exploration.
Conclusion Model selection is deeply engrained in scientific practice, though often in an intuitive, decidedly non-quantitative way. At a statistical level it differs from hypothesis testing in a number of ways. First of all, whereas hypothesis testing traditionally dealt with whether a single hypothesis could be held up — that is, failed to be rejected — in light of available data, model selection takes a more positive attitude and assigns probabilities of a given model from among a set of candidate models. Choosing the right model, that is, identifying the mechanisms at work, is at the very heart of trying to understand nature. We always have to keep in mind though, that no model will ever be ‘‘true’’ in the sense of being an accurate description of the process that generates data, or indeed, the processes governing how biological systems work. Here the statistical rigour afforded by modern model selection procedures offers invaluable guidance.
References [1]. Paul Nurse, Jacqueline Hayles: The cell in an era of systems biology. Cell 2011, 144:850-854. [2]. Marbach D et al.: Wisdom of crowds for robust gene network inference. Nat Methods 2012, 9:796-804. This paper illustrates how pooling the results of different inference algorithms can vastly improve the overall performance of reverse engineering tasks. [3]. Meyer P et al.: Verification of systems biology research in the age of collaborative competition. Nat Biotechnol 2011, 29:811-815. [4]. Stumpf Michael PH, Balding David J, Girolami Mark: Handbook of Statistical Systems Biology. Wiley; 2011. www.sciencedirect.com
Model selection in systems and synthetic biology Kirk, Thorne and Stumpf 773
[5]. R Development Core Team R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2011 www.R-project.org. ISBN 3-900051-07-0.
[25]. Martin J, Ringeval C, Trotta R: Hunting down the best model of inflation with Bayesian evidence. Phys Rev D 2011, 83:063524(11).
[6]. Eaton JW, Bateman D, Hauberg S: GNU Octave Manual Version 3. Network Theory Ltd.; 2008:. ISBN 095461206X, 9780954612061.
[26]. Tavare´ S, Balding DJ, Griffiths RC, Donnelly P: Inferring coalescence times from DNA sequence data. Genetics 1997, 145:505-518.
[7]. Brian Gough: GNU Scientific Library Reference Manual – Third Edition. 3rd edn. Network Theory Ltd; 2009:. ISBN 0954612078, 9780954612078. [8]. Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingmuller U, Timmer J: Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 2009 Aug, 25:1923-1929. [9]. Burnham KP, Anderson DR: Model Selection and Multimodel Inference: A Practical Information-theoretic Approach. Springer Verlag; 2002. [10]. Watanabe S: Algebraic Geometry and Statistical Learning Theory (Cambridge Monographs on Applied and Computationa [Hardcover]). Cambridge University Press; 2009. [11]. Watanabe S: Almost all learning machines are singular. In Proceedings of the 2007 IEEE Symposium on Foundations of Computational Intelligence. 2007:383-388 http://dx.doi.org/ 10.1109/FOCI.2007.371500. [12]. Wilkinson DJ: Bayesian methods in bioinformatics and computational systems biology. Brief Bioinform 2007, 8:109116 http://dx.doi.org/10.1093/bib/bbm007. [13]. Jeffreys H: Theory of Probability. Oxford: Clarendon Press; 1961. [14]. Vyshemirsky V, Girolami MA: Bayesian ranking of biochemical system models. Bioinformatics (Oxford England) 2008, 24:833-839. [15]. Robert CP, Wraith D: Computational methods for Bayesian model choice. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering: The 29th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering. AIP Conference Proceedings. 2009:251-262 http://dx.doi.org/10.1063/1.3275622. [16]. Marin J-M, Robert CP: Importance sampling methods for Bayesian discrimination between embedded models. Front Stat Decision Making Bayesian Anal 2010. [17]. Friel N, Wyse J: Estimating the Evidence – A Review. Statistica Neerlandica; 2012. [18]. Titsias MK, Honkela A, Lawrence ND, Rattray M: Identifying targets of multiple co-regulating transcription factors from expression time-series by Bayesian model comparison. BMC Syst Biol 2012, 6:53. Model selection is here used to rationally develop mechanistic hypotheses and understanding from high-throughput data. [19]. Xu T-R et al.: Inferring signaling pathway topologies from multiple perturbation measurements of specific biochemical species. Sci Signal 2010, 3:ra20. Nice illustration of how model selection can improve our understanding of biological mechanisms. [20]. Schmidl D, Hug S, Li WB, Greiter MB, Theis FJ: Bayesian model selection validates a biokinetic model for zirconium processing in humans. BMC Syst Biol 2012, 6:95. An example of how Bayesian model selection can be used in order to argue in favour of a mechanistic model, rather than failing to simply reject a hypothesis. [21]. Vyshemirsky V, Girolami MA: BioBayes: A Software Package for Bayesian Inference in Systems Biology. Oxford, England: Bioinformatics; 2008. [22]. Feroz F, Hobson MP, Bridges M: MultiNest: An Efficient and Robust Bayesian Inference Tool for Cosmology and Particle Physics. Monthly Notices of the Royal Astronomical Society; 2009.
[27]. Marjoram P, Molitor J, Plagnol V, Tavare´ S: Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci U S A 2003, 100:15324-15328. [28]. Marin J-M, Pudlo P, Robert CP, Ryder RJ: Approximate Bayesian computational methods. Stat Comput 2012. [29]. Sunna˚ker M, Giovanni Busetto A, Numminen E, Corander J, Foll M, Dessimoz C: Approximate Bayesian computation. PLoS Comput Biol 2012, 9 http://dx.doi.org/10.1371/ journal.pcbi.1002803. [30]. Nunes MA, Balding DJ: On optimal selection of summary statistics for approximate Bayesian computation. Stat Appl Genet Mol Biol 2010, 9 http://dx.doi.org/10.2202/1544-6115.1576 Article 34. [31]. Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MPH: Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J R Soc Interface 2009, 6:187-202. [32]. Toni T, Ozaki Y-I, Kirk P, Kuroda S, Stumpf MPH: Elucidating the in vivo phosphorylation dynamics of the ERK MAP kinase using quantitative proteomics data and Bayesian model selection. Mol Biosyst 2012 http://dx.doi.org/10.1039/ c2mb05493k. Illustration of how ABC methods can be used to choose from different mechanistic models. [33]. Liepe J, Taylor H, Barnes CP, Huvet M, Bugeon L, Thorne T, Lamb JR, Dallman MJ, Stumpf MPH: Calibrating spatiotemporal models of leukocyte dynamics against in vivo liveimaging data using approximate Bayesian computation. Integr Biol 2012, 4:335-345. [34]. Holmes GR, Anderson SR, Dixon G, Robertson AL, Carlos ReyesAldasoro C, Billings SA, Renshaw SA, Kadirkamanathan V: Repelled from the wound or randomly dispersed? Reverse migration behaviour of neutrophils characterized by dynamic modelling. J R Soc Interface/R Soc 2012, 9:3229-3239. [35]. Ratmann O, Andrieu C, Wiuf C, Richardson S: Model criticism based on likelihood-free inference, with an application to protein network evolution. Proc Natl Acad Sci U S A 2009, 106:10576-10581 http://dx.doi.org/10.1073/pnas.0807882106. [36]. Thorne T, Stumpf MPH: Graph spectral analysis of protein interaction network evolution. J R Soc Interface/R Soc 2012, 9:2653-2666. [37]. Toni T, Stumpf MPH: Simulation-based model selection for dynamical systems in systems and population biology. Bioinformatics (Oxford, England) 2010. This paper develops the ABC model selection framework for dynamical systems. [38]. Liepe J, Barnes C, Cule E, Erguler K, Kirk P, Toni T, Stumpf MPH: ABC-SysBio-approximate Bayesian computation in Python with GPU support. Bioinformatics 2010, 26:1797-1799. [39]. Robert CP, Cornuet J-M, Marin J-M, Pillai N: Lack of confidence in ABC model choice. Proc Natl Acad Sci U S A 2011, 108:1511215117. [40]. Barnes C, Filippi S, Stumpf MPH, Thorne T: Considerate approaches to achieving sufficiency for ABC model selection. Stat Comput 2012, 22:1181-1197. [41]. Marin JM, Pillai N, Robert CP, Rousseau J: Relevant statistics for Bayesian model choice. 2011, arXiv.org.
[23]. Hucka M et al.: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics (Oxford England) 2003, 19: 524-531.
[42]. Liu B, Hagiescu A, Palaniappan SK, Chattopadhyay B, Cui Z, Wong W-F, Thiagarajan PS: Approximate probabilistic analysis of biopathway dynamics. Bioinformatics 2012, 28:1508-1516.
[24]. Skilling J: Nested Sampling for General Bayesian Computation. Bayesian Analysis; 2006.
[43]. Barnes C, Silk D, Sheng X, Stumpf MPH: Bayesian design of synthetic biological systems. Proc Natl Acad Sci U S A 2011, 108:15190-15195.
www.sciencedirect.com
Current Opinion in Biotechnology 2013, 24:767–774
774 Systems biology
The process of model selection in systems biology translates into a system design procedure for synthetic biology.
[49]. Anderson J, Papachristodoulou A: On validation and invalidation of biological models. BMC Bioinform 2009, 10:132.
[44]. Ma W, Trusina A, El-Samad H, Lim WA, Tang C: Defining network topologies that can achieve biochemical adaptation. Cell 2009, 138:760-773. [45]. Myers CJ: Engineering Genetic Circuits. Chapman & Hall; 2009.
[50]. Bates DG, Cosentino C: Validation and invalidation of systems biology models using robustness analysis. IET Syst Biol 2011, 5:229-244. An overview with applications of (in)validation approaches to model selection in systems biology.
[46]. Lim Wendell A: Designing customized cell signalling circuits. Nat Rev Mol Cell Biol 2010, 11:393-403.
[51]. Shinar G, Feinberg M: Concordant chemical reaction networks. Math Biosci 2012, 240:92-113.
[47]. Silk D, Kirk P, Barnes CP, Toni T, Rose A, Moon S, Dallman MJ, Stumpf MPH: Designing attractive models via automated identification of chaotic and oscillatory dynamical regimes. Nature commun 2011, 2:489.
[52]. Harrington HA, Ho KL, Thorne T, Stumpf MPH: Parameter-free model discrimination criterion based on steady-state coplanarity. Proc Natl Acad Sci U S A 2012, 109:15746-15751.
[48]. Timmer J, Muller TG, Swameye I, Sandra O, Klingmuller U: Modeling the nonlinear dynamics of cellular signal transduction. Int J Bifurcation Chaos 2004, 14:2069-2079.
Current Opinion in Biotechnology 2013, 24:767–774
[53]. Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. 2nd edn. Chapman & Hall/CRC; 2003. [54]. Rybin´ski M, Gambin A: Model-based selection of the robust JAK-STAT, activation mechanism. J Theor Biol 2012, 309:34-46.
www.sciencedirect.com