Parameter estimation in modulated, unbranched reaction chains within biochemical systems

Parameter estimation in modulated, unbranched reaction chains within biochemical systems

Computational Biology and Chemistry 29 (2005) 309–318 Parameter estimation in modulated, unbranched reaction chains within biochemical systems Raman ...

313KB Sizes 0 Downloads 51 Views

Computational Biology and Chemistry 29 (2005) 309–318

Parameter estimation in modulated, unbranched reaction chains within biochemical systems Raman Lall 1 , Eberhard O. Voit ∗ Department of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, 303K Cannon Place, 135 Cannon Street, Charleston, SC 29425, USA Received 24 May 2005; received in revised form 2 August 2005; accepted 2 August 2005

Abstract Modern biology is increasingly developing techniques for measuring time series of global gene expression and of many simultaneous proteins or metabolites. These data contain valuable information on the dynamics of cells, which has to be extracted with computational means. Given a suitable mathematical model, this extraction is in principle a straightforward regression task, but the complexity and nonlinearity of the differential equations that describe biological systems cause severe difficulties when the systems are of realistic size. We propose a method of stepwise regression that can be applied effectively to linear portions of pathways. The method may be combined with other estimation methods and either directly yields reasonable parameter estimates or at least provides appropriate start values for subsequent nonlinear search algorithms. We illustrate the method with the analysis of in vivo NMR data describing the dynamics of glycolytic metabolites in Lactococcus lactis. © 2005 Elsevier Ltd. All rights reserved. Keywords: Biochemical Systems Theory; Lactococcus lactis; Parameter estimation; S-system

1. Introduction Mathematical models of biochemical systems have traditionally been constructed from the bottom up. Individual enzyme-catalyzed steps were formulated as rate laws and their parameters were estimated from experimental results in vitro, for instance, in the form of Lineweaver-Burk plots (e.g., Segel, 1991). Subsequently all rate functions were integrated into a comprehensive model describing the dynamics of select metabolite pools of interest. The integrative behavior was compared to observed system responses, if these were available, which was rarely the case, and discrepancies were resolved through secondary adjustments of parameter values. ∗ Corresponding author. Present address: The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, 313 Ferst Drive, Suite 4103, Atlanta, GA 30332-0535, USA. Tel.: +1 4043855057; fax: +1 4048944243. E-mail addresses: [email protected] (R. Lall), [email protected] (E.O. Voit). 1 Present address: BACTER Institute, Room 6615 Biochemistry Addition, 433 Babcock Drive, Madison, WI 53706, USA.

1476-9271/$ – see front matter © 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2005.08.001

Modern biology is generating new classes of data that permit a complementary approach. These data are of high density, capturing simultaneously the expression of hundreds or thousands of genes or proteins, or the concentrations of many metabolites. As a direct extension, the same experimental techniques allow the construction of time series of these measurements. These time series are very interesting, because they illuminate the dynamics of cellular responses in unprecedented ways. It is easy to imagine that these data have the potential of offering truly novel insights into the functioning of biological systems, if efficacious methods are available for their analysis. Indeed, several groups around the world have begun establishing methods that estimate model parameters from high-density time series (e.g., Voit and Savageau, 1982; Voit and Sands, 1996; Voit, 2000; Maki et al., 2002; Almeida and Voit, 2003; Kikuchi et al., 2003; Voit and Almeida, 2003, 2004; Veflingstad et al., 2004; Kimura et al., 2005; Tsai and Wang, 2005; Naval et al., 2005; see also Vance et al., 2002; Torralba et al., 2003). The estimation of system parameters from time series data is in principle a straightforward regression task, which,

310

R. Lall, E.O. Voit / Computational Biology and Chemistry 29 (2005) 309–318

however, is obstructed by severe challenges in implementation. One class of issues pertains to the choice of the mathematical model that is supposed to capture the observed dynamics. Clearly, there are no a priori rules or laws that identify one specific type of model as optimal for a given biological system. Thus, one has to make assumptions and either use functions that have worked well in the past, or one employs generic representations that are supported by mathematical theory. The former approach might use modulated Michaelis-Menten or Hill functions, while the latter might use power-law representations. The second class of issues is of a technical nature and, for instance, includes the lack of convergence of nonlinear estimation algorithms, which is aggravated by the combinatorial explosion in the number of parameters to be estimated if the system grows in size. In this article, we discuss a technique that, in combination with existing methods, helps tame the combinatorial explosion in the size of the parameter search space. This technique deals efficiently with linear portions of complex pathways, which may or may not be regulated by activators or inhibitors. The method does not provide advantages over other methods at branch points, but we show how linear parts may be separated out and thus render the proposed techniques admissible. This is important, because many parts of biochemical systems are indeed linear. As just one example, the conversion of PRPP into IMP consists of an unbranched chain of 10 steps, the first of which is subject to several inhibitory signals (cf. Stanbury et al., 1983: Chapter 50). The method begins with the decoupling of the system equations, as it was proposed in the recent literature (Voit and Almeida, 2004; see also Voit and Savageau, 1982; Volgin et al., 2003; Clements et al., 2004); we will review this method in the next section. The decoupling renders it possible that the parameters of the entire system be estimated one equation at a time. Specifically, one considers the measured data quasi as forcing functions for the equation presently under consideration. Upon decoupling, and using power-law representations for the underlying model, we show that linear segments of pathways may be estimated with linear regression techniques that are extremely efficient. We illustrate the method with time series data that Neves et al. (1999) obtained from Lactococcus lactis with methods of in vivo nuclear magnetic resonance.

2. Methods 2.1. Decoupling Essentially all models of biochemical systems are based on nonlinear ordinary differential equations (ODE’s). If time series measurements are available for all system variables of interest, then it should be possible in principle to use a nonlinear search algorithm, which solves the equations at

each iteration and ultimately yields optimal parameter values. Practical experience shows that such an approach is not feasible for realistically sized problems (cf. CPU time information in Kikuchi et al., 2003) and that other techniques must be developed. If the noise level is fairly low, the data may be interpolated with straight lines, splines, or the so-called threepoint method (Burden and Faires, 1993). If the noise is significant, several smoothing or collocation methods are available. In either case, the result is a (piecewise) smooth set of points that may henceforth be used in lieu of the raw data. This smooth set can also be used to compute slopes (with the possible exception of breakpoints of a piecewise representation), which subsequently allows the replacement of differentials in the ODEs with slopes as well as the decoupling of the ODE system, as we will show below. This procedure has two added benefits. First, the original time points for this procedure need not necessarily be evenly spaced. Secondly, once the smoothing is completed, functional values are available at almost arbitrary time points. The decoupling “trick” consists of considering the slopes as estimates of the true differentials on the left-hand sides of the ODE model describing the biochemical system of interest. Thus, the nonlinear ODE model is replaced by n sets of N algebraic equations each, and each set may be treated as an independent regression task in N equations. To execute the regression with set i ∈ {1, . . ., n}, the values of all variables at each time point tk , k ∈ {1, . . ., N}, are taken as the smoothed values described above and the differentials on the left hand sides are taken as the estimated slopes at each time point k ∈ {1, . . ., N}. Specifically, suppose the nonlinear ordinary differential equation model has the form ˙ = f (X) X

X(t0 ) = X0 ,

(1)

where X is a variable or a vector of variables, the notation ˙ = dX/dt indicates differentiation with respect to time t, X and X0 represents the initial condition(s) at the start time t0 . According to Eq. (1), the instantaneous change (slope) of X at any given time point tk is equal to f(X) at tk , i.e., ˙ tk ) = f (X(tk )). Thus, if one interprets the derivatives X ˙ X(at at all measured time points tk as slopes, one may estimate these slopes from the data as S(tk ) and approximate the differential equations as  ˙  tk = f (X(tk )) S(tk ) ≈ X (2) (Voit and Savageau, 1982; Voit and Almeida, 2004). For clarity, it is useful to make the parameters of the system explicit. Each function fi of the vector f may be characterized by the parameters pi1 , . . . , piMi that need to be optimized. These parameters may differ in number among the n equations and are the only unknowns in these equations. The estimation problem is then reformulated from one involving n ODEs to a larger system of n × N algebraic equations

R. Lall, E.O. Voit / Computational Biology and Chemistry 29 (2005) 309–318

311

of the form

and the generic S-system structure is

S1 (t1 ) ≈ f1 (X1 (t1 ), X2 (t1 ), . . . ., Xn (t1 ); p11 , . . . , p1M1 ) S1 (t2 ) ≈ f1 (X1 (t2 ), X2 (t2 ), . . . ., Xn (t2 ); p11 , . . . , p1M1 )

˙ i = αi X

: S1 (tN ) ≈ f1 (X1 (tN ), X2 (tN ), . . . ., Xn (tN ); p11 , ..., p1M1 ) : Si (tj ) ≈ fi (X1 (tj ), X2 (tj ), . . . ., Xn (tj ); pi1 , . . . , piMi ) : Sn (tN ) ≈ fn (X1 (tN ), X2 (tN ), . . . ., Xn (tN ); pn1 , . . . , pnMn )

n 

j=1

.

(3) The reformulation of the system of differential equations as a system of algebraic equations has the obvious advantage of making numerical integration unnecessary, but it has other very beneficial consequences. First and foremost, this representation constitutes a decoupling of the system of state equations. Clearly, if all X’s and S’s are known, the system may be split up into n groups, each corresponding to one of the differential equations of the system. The groups may be analyzed sequentially or in parallel.

n 

Xjgij − βi

Xjhij ,

i = 1, . . . , n.

(6)

j=1

BST has been reviewed numerous times (e.g., Savageau, 1976; Voit, 1991, 2000; Torres and Voit, 2002) and does not require further general discussion here. Because our method focuses on linear chains, the difference between the GMA and S-system forms is immaterial, and the generic format of a linear chain model is ˙ 1 = Input − β1 X

n 

Xjh1j

j=1

˙ 2 = β1 X

n 

Xjh1j − β2

j=1

n 

Xjh2j

j=1

: n 

βi−1 Xjhi−1,j j=1

˙i = X

n 

(7)

− βi Xjhij j=1

: n 

Xjhn,j − Efflux

2.2. Biochemical Systems Theory (BST)

˙ n = βn−1 X

Recognizing the rapid increase in complexity of biochemical systems models with the number of metabolites, Savageau (1969a,b) proposed representing rate functions with products of power-law functions, which are the immediate consequence of Taylor’s theorem of numerical analysis applied to variables in logarithmic space. Thus, each step Vi involving at most n dependent (state) variables and m independent (control variables or constant enzyme activities) takes the format

Note that the degradation terms are equivalent to the subsequent production terms, which is a consequence of the unbranched nature of the pathway. The proposed estimation of linear chains proceeds in the following fashion. Suppose the term Input is known, either because it represents a controlled input from outside the system or it has been estimated before with any of the available methods. Then, the β1 -term may be obtained from the first equation as

Vi = γi

n+m 

Xjfij ,

j=1

(4)

j=1

˙ 1. Xjh1j = Input − X

(8)

j=1

where γ i is the rate constant and the exponent fij is the kinetic order that quantifies the effect of variable Xj on Vi . For estimation purposes, one typically assumes that the values of the control variables are known and can be subsumed into the rate constants. Thus, we will in the following omit the explicit inclusion of independent variables. BST offers alternative representations, of which the most widely used are the generalized mass action (GMA) and the S-system representations. In GMA models, every step is represented individually by one product of power-law functions, while in the S-system form all incoming fluxes at a given pool are aggregated and collectively represented by one product of power-law functions, and the same is done with all effluxes. Thus, the generic GMA structure is   Pi n   ±γip Xjfijp  , i = 1, . . . , n, ˙i = X (5) p=1

β1

n 

j=1

Substituting numerical values for all variables and slopes at N time points and taking logarithms on both sides yields the linear regression task ln(β1 ) +

n 

˙ 1 (tk )), h1j ln(Xj (tk )) = ln(Input − X

(9)

j=1

where the right-hand side now consists of a numerical value. Once the parameters of the β1 -term are estimated, the method proceeds to the next equation, using the same procedure of substituting values, rearranging the equations, taking logarithms, and performing a linear regression. We illustrate the method with a specific example in the following sections. An obvious question is to what degree this method leads to an undue accumulation of errors, because late steps in the chain are estimated from estimates that were again estimated from estimates. A rigorous statistical analysis of this accumulation is difficult. However, the result of the proposed

312

R. Lall, E.O. Voit / Computational Biology and Chemistry 29 (2005) 309–318

analysis: (a) is obtained fast, because all steps consist of simple algebraic manipulations (including the regression step itself); (b) are to be checked ultimately with the fully integrated ODE model against all observed time series; (c) may merely serve to provide good starting values for a subsequent nonlinear estimation analysis. Furthermore, one should see the issue of error accumulation in comparison with alternatives, such as a nonlinear regression or a genetic algorithm, which are notorious for their computational challenges.

3. Illustration example L. lactis is an industrially important microorganism that plays an essential role in the production of fermented milk, cheese, yogurt, meat, bread, vegetables and wine. Through its production of lactic acid, polysaccharides and CO2 , and the digestion of casein, it provides an effective preservation of foods and improves the flavor, texture, color, and preservation characteristics of fermented products. For our illustration, we use in vivo time series data on some key metabolites of glycolysis in L. lactis. These were measured in 50 mL mini-bioreactors specifically designed to maintain cell suspensions in defined conditions of gas atmosphere, pH and temperature. The circulating water bath maintained a temperature of 30 ◦ C and the suspension was circulated from the bioreactor to a standard 10 mm NMR tube. The flow rate in the outlet (i.d. 0.8 mm) and inlet (i.d. 1.6 mm) tube was maintained at a circulation rate of 32 mL/min and a minimum residence time of 7 s in the NMR tube. Time courses for substrate consumption, product formation and the dynamics of intracellular metabolite pools were thus monitored in vivo under aerobic or anaerobic conditions. Details of the experimental procedures can be found in Neves (2001) and Neves et al. (1999, 2000, 2002). Glycolysis is among the best studied biochemical pathways, and there is no need to review it in detail. The simplified scheme underlying our model analysis is given in Fig. 1. We use the first three steps of this pathway to demonstrate the proposed stepwise estimation. Of note is that phosphoenolpyruvate (PEP) in Neves’ (2001) data remains below the detection level for the first 10 min of the experiment. During this time, almost all glucose is used up and material has moved down the pathway, so that it is almost certain that PEP is being produced during this time. Furthermore, it is certain that without any initial PEP the glycolytic pathway could not even be started, because glucose could not be converted into G6P. We thus reinterpreted the observed data by modeling the initial PEP production as a slow exponential function that stayed below the detection limit as observed. Whether this reflects the “true” process is not known. However, since the focus of the paper is on the estimation method and not so much on the biological system, the assumption on the dynamics of PEP may be sufficient for our illustration. The BST equations for the first steps of glycolysis are readily set up in symbolic form, according to well-documented

Fig. 1. Simplified scheme of glycolysis in Lactococcus lactis.

guidelines (e.g., Voit, 2000). The consumption of extracellular glucose is driven by its own concentration and by PEP, which is used as phosphate donor. It is also potentially affected by glucose 6-phosphate (G6P; X2 ), which has been found to inhibit glucose uptake and conversion into G6P in yeast (Galazzo and Bailey, 1990). Since there is no production of glucose, the first equation captures only the cellular uptake of glucose from the medium and is given as ˙ 1 = −β1 X1h11 X2h12 X8h18 , X

(10)

where the variable names are given in Fig. 1 and their concentrations are in millimolar. For the regression task, we use time course measurements by Neves (2001) and in cases where they are rather noisy, we smooth them with the software WebMetabol (Voit and Almeida, 2004), which is based on an artificial neural network. WebMetabol also automatically computes the slopes S1 (tk ) of the observed time course of glucose consumption, and these S1 (tk ) subsequently become estimators of the true differentials as discussed above; thus ˙ 1 (tk ). S1 (tk ) ≈ X

(11)

Because no glucose is generated, the slopes S1 (tk ) are negative at all times. Multiplying both sides by −1 and taking logarithms yields ln(S1 (tk )) = ln(β1 ) + h11 ln(X1 (tk )) + h12 ln(X2 (tk )) + h18 ln(X8 (tk ))

for k = 1, . . . , N.

(12)

R. Lall, E.O. Voit / Computational Biology and Chemistry 29 (2005) 309–318

There are N equations of this type, reflecting the N time points considered in the regression. Smoothed values of S1 (tk ) and Xi (tk ) are substituted in each equation. Using all N equations simultaneously, the linear regression determines optimal values for the parameters β1 , h11 , h12 , and h18 . One should note that we have limited the regression in this step to the variables X1 , X2 , and X8 , instead of using all variables in the system. This set was determined by our specific knowledge of the glycolytic pathway. If nothing about the contributing variables and the regulation of this step were known, the symbolic power-law representation would include all variables that could possibly affect this step, for instance, as inhibitors or cofactors. The regression would still be linear, but it would contain more variables. This would not cause problems in principle, but could lead to redundancies and possible overfitting. Thus, it is advantageous, though not mandatory, to limit each regression to those variables that are known to affect the process under investigation. Furthermore, the smoothing step is not mandatory, especially if the raw time courses are already fairly smooth (such as in the case of glucose consumption; see next section), but if experimental noise is significant, the corresponding noise in the slopes is magnified, and experience (Almeida and Voit, 2003; Voit and Almeida, 2004) has shown that the estimation procedure benefits from smoothing. We describe in the next section how one can assess the sensitivity of the results to experimental noise. The dynamics of the subsequent metabolite, G6P, is given symbolically as ˙ 2 = α2 X1g21 X2g22 X8g28 − β2 X2h22 X3h23 . X

(13)

Because G6P is the only fate of glucose consumption, the consumption flux and the flux of G6P synthesis are subject to the precursor-product constraints α2 = β1 , g21 = h11 , g22 = h12 , and g28 = h18 . The parameters on the right-hand sides were estimated above, and metabolite concentrations are obtained from the smoothed observations, so that the entire production term of G6P becomes a single numerical value at each of the N time points. Thus, rearranging Eq. (3) results in ˙2 β2 X2h22 X3h23 = α2 X1g21 X2g22 X8g28 − X

(14)

and logarithmic transformation yields the next set of algebraic regression equations at N time points: ln(β2 ) + h22 ln(X2 (tk )) + h23 ln(X3 (tk )) ˙ 2 (tk )). = ln(α2 X1g21 (tk )X2g22 (tk )X8g28 (tk ) − X

(15)

With the right-hand side of the equation completely known, linear regression directly yields estimates of β2 , h22 and h23 . To complete the illustration, the dynamics of FBP is symbolically written as ˙ 4 = α4 X2g42 X3g43 − β4 X4h44 X5h45 X6h46 . X

(16)

313

Because of the precursor-product constraints between G6P and FBP, we know α4 = β2 , g42 = h22 , and g43 = h23 . Thus, rearranging Eq. (16) and taking logarithms results in ln(β4 ) + h44 ln(X4 (tk )) + h45 ln(X5 (tk )) + h46 ln(X6 (tk )) ˙ 4 (tk )) = ln(α4 X2g42 (tk )X3g43 (tk ) − X

(17)

and linear regression on Eq. (17) produces the parameter values of β4 , h44 , h45 and h46 . The procedure is obviously highly structured, which allowed us to automate it in Matlab and run it on a PC.

4. Results In all practical cases, linear regression yields a unique solution. However, in order to test the reliability of results, we subjected the data to random noise, which was added uniformly from a 10% range about each smoothed experimental data point. As an alternative, we could have developed a bootstrapping or jackknife scheme with the measured data, but because of the small number of measurements it appeared that experimental variability was better represented in our approach. It may be confusing that we first smooth the data and now artificially add noise. However, the two manipulations serve very different purposes. The smoothing is used to compute good estimates of the slopes of the time courses at all time points, which subsequently allow us to replace differential with algebraic equations, as discussed before. By contrast, the artificial addition of noise constitutes a sensitivity analysis that is beneficial for the following reason: Given a set of time series data, the (linear) regression result is unique and reflects all idiosyncrasies of the particular data and their experimental noisiness. The same is true for smoothed data. By repeatedly adding random noise, we obtain parameter estimation results for slightly modified time courses, which mimic replicate experiments. A comparison of the results from many of these artificial replicates shows how much each estimated kinetic order or rate constant is affected by experimental noise. Regressions with these artificially altered data were again automated in Matlab and produced results, such as Fig. 2. The figure shows plots of the kinetic orders versus the logarithm of the rate constant for glucose consumption. In each case shown, the residual error is approximately the same, which indicates that there is compensation of error between the rate constant and the kinetic orders, as it has been discussed in other contexts (e.g., Sands and Voit, 1996). Interestingly, the kinetic order h11 is quite reliably estimated between 1.5 and 2, and h18 has a positive value between 0.5 and 2.0. On the other hand, h12 shows significantly different values between 0 and −6, which would be interpreted as “no appreciable effect” of X2 on glucose uptake in the case h12 ≈ 0 or as strong inhibition of glucose uptake by G6P (h12  0). While Ockham’s razor would suggest a system design without inhibition, Galazzo

314

R. Lall, E.O. Voit / Computational Biology and Chemistry 29 (2005) 309–318

The results from estimations of decoupled equations in the space of S versus X do not always hold up when they are integrated in a comprehensive model that describes the system dynamics in the space Xi versus t (e.g., Voit and Sands, 1996). Nonetheless, the estimates can serve as good starting values for a subsequent nonlinear optimization of the entire pathway. In our case, some estimates were appropriate (e.g., β1 , h11 , h12 , h18 ), while others were refined during a nonlinear optimization of all simultaneous differential equations in their original form. Given the initial estimates from the linear regression, this nonlinear regression was greatly simplified. The results indicate that an S-system model captures the dynamics of this pathway quite well (Fig. 3). The numerical characteristics of this model are given in Appendix A. It is noted that good fits can also be obtained with a model consisting of Michaelis-Menten rate laws and their extensions, as proposed by Hoefnagel et al. (2002). For instance, in this model, the hexokinase reaction between glucose and G6P is represented as v[1] =

Vmax1 Gluc PEP[t]  Gluc Km1 Gluc Km1 PEP 1 + Km1 Gluc +  PEP[t] PYR[t] × 1 + Km1 PEP + Km1 PYR

G6P[t] Km1 G6P



(18)

Fig. 2. Plot of estimated kinetic orders of various steps as functions of corresponding rate constants. Each point is the result of a regression in which the smoothed data were subjected to uniform 10% random noise. The residual error at each point is essentially the same. The plots clearly show compensation of errors. Panel A: Estimated kinetic orders of glucose consumption as functions of ln(β1 ); note compensation of errors between β1 and h12 . Panel B: Parameters of the conversion of G6P into FBP. Panel C: Relationships between degradation parameters of FBP (see text for further details).

and Bailey (1990) suggested that such an inhibitory effect exists, at least in yeast. Without further information, the given data cannot reliably distinguish between these two scenarios. Additional data sets could identify the truth, or one could take the ambiguity as motivation for targeted experiments.

which is to be compared to the much simpler representation Eq. (10) in the S-system model. However, while we found the functional forms of these rate equations to be appropriate for fitting the observed data (using the Levenberg-Marquardt algorithm), the optimized parameter values were often drastically different from those obtained from literature information, which was typically based on in vitro enzyme assays. A similar observation was made in other contexts (e.g., Albe et al., 1989). As an example, the fit of the hexokinase reaction with Eq. (18) is shown in Fig. 4 in comparison to a fit with Eq. (10), and Table 1 shows a comparison of the corresponding parameter values with those found in the literature. Thus, either a Michaelis-Menten model or an S-system model is able to capture the observed dynamics of the glycolytic pathway in L. lactis in vivo, but the parameter values in both cases must be re-estimated, because literature information is not always sufficient.

5. Discussion Any parameter estimation in biochemical systems faces the challenge of nonlinearity with all its computational problems, and one must expect that no single method will be sufficient for all estimation tasks. We have demonstrated in the past that decoupling significantly simplifies these estimation tasks, because large networks can be analyzed one metabolite at a time and this analysis can be executed sequentially or in parallel (Voit and Almeida, 2004). Nonetheless,

R. Lall, E.O. Voit / Computational Biology and Chemistry 29 (2005) 309–318

315

Fig. 3. NMR measurements of glycolytic metabolites in Lactococcus lactis (Neves, 2001) and S-system model fit (see Appendix A).

the combinatorial explosion that comes with larger systems remains to be a hurdle, and any additional information that can be brought to bear on the estimation problem provides welcome relief. For instance, if it is known that a kinetic order in a BST model represents an inhibitory effect, then the parameter search space is immediately halved, because positive values of this parameter are no longer admissible.

The present contribution adds another tool to the repertoire of estimation methods. It allows the stepwise estimation of parameters in linear chains. This focus may appear rather limited, but metabolic databases like KEGG (http://www.genome.jp/kegg/) demonstrate that indeed the synthesis of many metabolites includes unbranched sequences of steps. Thus, if a linear chain starts at branch point Xi , and if the production of Xi has been estimated with other methods, then the production term can be converted into a numerical value and the estimation can proceed as we demonstrated above, using only algebraic manipulations, including linear regression steps that are very efficient and computationally cheap. In large systems, this reduction in complexity will be very useful. Table 1 Comparison of parameter values in Eq. (18) from literature (Hoefnagel et al., 2002) and from model fitting

Fig. 4. Comparison of fits of the hexokinase step between glucose and G6P with power-law (Eq. (10)) and Michaelis-Menten type (Eq. (18)) representations.

Parameter

Literature value

Optimized model value

Vmax1 Km1Gluc Km1PEP Km1G6P Km1PYR

160 0.015 0.3 500 2

27.82 8.44 0.59 12.52 1.797

316

R. Lall, E.O. Voit / Computational Biology and Chemistry 29 (2005) 309–318

Because of the stepwise nature of the procedure, there is the potential risk that residual errors from one regression will negatively affect the next regression steps. It seems that this issue is the price to be paid for the simplicity of each regression step. As with other methods that are based on estimations of decoupled algebraic equations, the results are “local” estimates. Because these are obtained via linear regression, all diagnostic tools available in standard regression software packages are directly applicable. Because of the relatively small number of data points, we evaluated the sensitivity of our results to experimental and natural variation by repeatedly adding random noise to the smoothed data points, thereby mimicking replicates of the observations, and comparing the regression results. This method showed very clearly that some estimates were much more stable than others and that there was in many cases compensation of error between parameters within the same process description (cf. Fig. 2). A further issue associated with the decoupling of equations is that the collection of “local” estimates does not always lead to a satisfactory fit of the complete, integrated set of differential equations (see discussion in Voit and Sands, 1996). Again, this is an unavoidable drawback that renders the simplified regression task possible. However, even if the local estimates do not yield a perfect overall fit, much is gained from these estimates, because it is a well-known fact that nonlinear regression algorithms typically work quite well if they are initiated with good start values. Thus, even rough estimates that are obtained with low computational cost may serve as start values from which a full nonlinear regression (based on the differential equations themselves) has a good chance of converging. The proposed method is particularly useful for models within BST, because the stepwise procedure naturally leads to linear regressions in logarithmic coordinates. By contrast, if the procedure were applied to other rate functions, such as Michaelis-Menten rate laws, the estimation would still be somewhat simplified, but not linear. This poses the question of whether BST models are sufficiently accurate to capture the dynamics of natural systems. Indeed, many applications in the past have suggested that BST models are of similar accuracy as other models (e.g., Curto et al., 1998; Alvarez-Vasquez et al., 2004). However, these comparative analyses were not ideal, because they were seldom performed with high-quality dynamic data obtained from experiments in vivo. Here, we have demonstrated with highly accurate in vivo NMR measurements that an S-system model captures dynamic data indeed well. While S-system models have sometimes been criticized for the manner of flux aggregation into one influx and one efflux per pool and for the fact that individual power-law functions do not saturate, the dynamic data and their analysis shown here demonstrate again that these issues are not of particular relevance. This is so because the metabolite pools in vivo usually do not venture into domains where rate functions saturate or where the difference between aggregated and nonaggregated fluxes becomes detrimental. Given the operational tractability of BST models, one

should therefore carefully consider whether it is necessary to evoke complex mechanistic rate functions (cf. Eq. (18) versus Eq. (10)) that may involve dozens of parameters for a single enzyme-catalyzed step (Savageau, 1976; Schulz, 1994). Indeed, extending the argument from the parameter estimation task to the task of identifying network structure from time series data, S-systems within BST have unique features that are unmatched by competing approaches and that have been recognized widely in recent years.

Acknowledgments This work was supported in part by a Quantitative Systems Biotechnology grant (BES-0120288; E.O. Voit, PI) from the National Science Foundation, a National Heart, Lung and Blood Institute Proteomics Initiative (Contract N01-HV28181; D. Knapp, PI), and an endowment from the Georgia Research Alliance. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring institutions.

Appendix A. S-system model The parameters of the S-system used to model measured time courses describing the initial steps of glycolysis in L. lactis were estimated with a number of methods, including the methods proposed here, which were used as initial estimates for the unbranched steps of the pathway. The numerical implementation below was used to capture the observed data in Fig. 3. The format of the model is self explanatory and compatible with the BST software PLAS (Ferreira, 1996). X1 X2 X3 X4 X5 X6 X8

= – b1 X1ˆh11 X2ˆh12 X8 = a2 X1ˆg21 X2ˆg22 X8 – b2 X2ˆh22 X3ˆh23 = a3 X2ˆg32 X3ˆg33 X4ˆg34 – b3 X2ˆh32 X3ˆh33 = a4 X2ˆg42 X3ˆg43 – b4 X4ˆh44 X5ˆh45 X6ˆh46 = a5 X3ˆg53 X4ˆg54 – b5 X4ˆh54 X5ˆh55 X6ˆh56 = a6 – b6 X4ˆh64 X5ˆh65 X6ˆh66 = X8 – .001

at t = 0 X1 = 500 X2 = 0.1 X3 = 0.5 X4 = 1 X5 = 14.5 X6 = 5.029 X8 = .005 b1 = .2 h11 = 1.85 h12 = -1 a2 = b1 g21 = h11 g22 = h12

R. Lall, E.O. Voit / Computational Biology and Chemistry 29 (2005) 309–318 b2 = 0.022464 h22 = 3.0703 h23 = 3.2298 a3 = 2.5877 g32 = 0 g33 = 0 g34 = 0 b3 = 0.014909 h32 = 0 h33 = 3.1998 h34 = 0 h35 = 0 a4 = b2 g42 = h22 g43 = h23 b4 = 0.015 h43 = 0 h44 = 2.2 h45 = 0.5 h46 = .1 a5 = 0.12652 g53 = 2.5915 g54 = 0. b5 = 6.633862 h53 = 0 h54 = 0 h55 = 0.12 h56 = 0 a6 = 2.5243 g62 = 0 g63 = 0 b6 = 0.1818 h63 = 0 h64 = 0.3521 h65 = 0.9577543 h66 = 0.1 t0 = 0 tf = 6 hr = .25

References Albe, K.R., Butler, M.H., Wright, B.E., 1989. Cellular concentrations of enzymes and their substrates. J. Theor. Biol. 143, 163–195. Almeida, J.S., Voit, E.O., 2003. Neural-network-based parameter estimation in complex biomedical systems. Genome Inform. 14, 114– 123. Alvarez-Vasquez, F., Sims, K.J., Hannun, Y.A., Voit, E.O., 2004. Integration of kinetic information on yeast sphingolipid metabolism in dynamical pathway models. J. Theor. Biol. 226, 265–291. Burden, R.L., Faires, J.D., 1993. Numerical Analysis, fifth ed. PWS Publishing Co., Boston, MA, pp. 156–167. Clements, J.C., Nenonen, J., Li, P.K., Horacek, B.M., 2004. Activation dynamics in anisotropic cardiac tissue via decoupling. Ann. Biomed. Eng. 32 (7), 984–990. Curto, R., Voit, E.O., Sorribas, A., Cascante, M., 1998. Mathematical models of purine metabolism in man. Math. Biosci. 151, 1–49.

317

Ferreira, A.E.N., 1996. PLAS: http://www.dqb.fc.ul.pt/docentes/aferreira/ plas.html. Galazzo, J.L., Bailey, J.E., 1990. Fermentation pathway kinetics and metabolic flux control in suspended and immobilized Saccharomyces cerevisiae. Enzyme Microb. Technol. 12, 162–172. Hoefnagel, M.H., Starrenburg, M.J., Martens, D.E., Hugenholtz, J., Kleerebezem, M., Van Swam, I.I., Bongers, R., Westerhoff, H.V., Snoep, J.L., 2002. Metabolic engineering of lactic acid bacteria, the combined approach: kinetic modeling, metabolic control and experimental analysis. Microbiology 148, 1003–1013. Kimura, S., Ide, K., Kashihara, A., Kano, M., Hatakeyama, M., Masui, R., Nakagawa, N., Yokoyama, S., Kuramitsu, S., Konagaya, A., 2005. Inference of S-system models of genetic networks using a cooperative coevolutionary algorithm. Bioinformatics 21, 1154–1163. Kikuchi, S., Tominaga, D., Arita, M., Takahashi, K., Tomita, M., 2003. Dynamic modeling of genetic networks using genetic algorithm and S-system. Bioinformatics 19, 643–650. Maki, Y., Ueda, T., Okamoto, M., Uematsu, N., Inamura, Y., Eguchi, Y., 2002. Inference of genetic network using the expression profile time course data of mouse P19 cells. Genome Inform. 13, 382–383. Naval, P.C., Gonzales, O.R., Mendoza, E., Sison, L.G., 2005. Heuristic parameter estimation methods for S-system models of biochemical networks. In: IEEE Conference HNICEM 2005, 17–20 March 2005, Manila, Philippines. Neves, A.R., 2001. Metabolic Strategies to Reroute Carbon Fluxes in Lactococcus lactis: Kinetics of Intracellular Metabolite Pools by in vivo Nuclear Magnetic Resonance. Doctoral Dissertation, Universidade Nova de Lisboa, Instituto de Tecnologia Qu´ımica e Biol´ogica. Neves, A.R., Ramos, A., Nunes, M.C., Kleerebezem, M., Hugenholtz, J., de Vos, W.M., Almeida, J., Santos, H., 1999. In vivo nuclear magnetic resonance studies of glycolytic kinetics in Lactococcus lactis. Biotechnol. Bioeng. 64 (2), 200–212. Neves, A.R., Ramos, A., Shearman, C., Gasson, M.J., Almeida, J.S., Santos, H., 2000. Metabolic characterization of Lactococcus lactis deficient in lactate dehydrogenase using in vivo 13 C NMR. Eur. J. Biochem. 267 (12), 3859–3868. Neves, A.R., Ventura, R., Mansour, N., Shearman, C., Gasson, M.J., Maycock, C., Ramos, A., Santos, H., 2002. Is the glycolytic flux in Lactococcus lactis primarily controlled by the redox charge? Kinetics of NAD(+) and NADH pools determined in vivo by 13 C NMR. J. Biol. Chem. 277 (31), 28088–28098. Sands, P.J., Voit, E.O., 1996. Flux-based estimation of parameters in Ssystems. Ecol. Model. 93, 75–88. Savageau, M.A., 1969a. Biochemical systems analysis. 1. Some mathematical properties of the rate law for the component enzymatic reactions. J. Theor. Biol. 25, 365–369. Savageau, M.A., 1969b. Biochemical systems analysis. 2. The steady-state solutions for an n-pool system using a power law approximation. J. Theor. Biol. 25, 370–379. Savageau, M.A., 1976. Biochemical Systems Analysis: A Study of Function and Design in Molecular Biology. Addison-Wesley, Reading, MA. Segel, L.A., 1991. Biological Kinetics. Cambridge University Press, Cambridge, UK. Schulz, A.R., 1994. Enzyme Kinetics. From Diastase to Multi-enzyme Systems. Cambridge University Press, Cambridge, U.K. Stanbury, J.B., Wyngaarden, J.B., Fredrickson, D.S., Goldstein, J.L., Brown, M.S. (Eds.), 1983. The Metabolic Basis of Inherited Disease, fifth ed. McGraw-Hill Book Company, Auckland, New York. Torralba, A.S., Yu, K., Shen, P., Oefner, P.J., Ross, J., 2003. Experimental test of a method for determining causal connectivities of species in reactions. Proc. Natl. Acad. Sci. U.S.A. 100, 1494–1498. Torres, N.V., Voit, E.O., 2002. Pathway Analysis and Optimization in Metabolic Engineering. Cambridge University Press, Cambridge, UK. Tsai, K.Y., Wang, F.S., 2005. Evolutionary optimization with data collocation for reverse engineering of biological networks. Bioinformatics 21, 1180–1188.

318

R. Lall, E.O. Voit / Computational Biology and Chemistry 29 (2005) 309–318

Vance, W., Arkin, A.P., Ross, J., 2002. Determination of causal connectivities of species in reaction networks. Proc. Natl. Acad. Sci. U.S.A. 99, 5816–5821. Veflingstad, S.R., Almeida, J.S., Voit, E.O., 2004. Priming nonlinear searches for pathway identification. BMC Theoretical Biology and Medical Modelling, PMID 15367330. Voit, E.O. (Ed.), 1991. Canonical Nonlinear Modeling. S-System Approach to Understanding Complexity. Van Nostrand Reinhold, New York, xi + 365 pp. Voit, E.O., 2000. Computational Analysis of Biochemical Systems: A Practical Guide for Biochemists and Molecular Biologists. Cambridge University Press, Cambridge, United Kingdom. Voit, E.O., Almeida, J., 2003. Dynamic profiling and canonical modeling: powerful partners in metabolic pathway identification. In: Goodacre, R., Harrigan, G.G. (Eds.), Metabolite Profiling: Its Role in Biomarker

Discovery and Gene Function Analysis. Kluwer Academic Publishing, Dordrecht, The Netherlands. Voit, E.O., Almeida, J.S., 2004. Decoupling dynamical systems for pathway identification from metabolic profiles. Bioinformatics 20 (11), 1670–1681. Voit, E.O., Sands, P.J., 1996. Modeling forest growth. I. Canonical approach. Ecol. Model. 86, 51–71. Voit, E.O., Savageau, M.A., 1982. Power-law approach to modeling biological systems. III. Methods of analysis. J. Ferment. Technol. 60 (3), 233–241. Volgin, V.M., Volgina, O.V., Davydov, A.D., 2003. Finite difference method of simulation of nonsteady-state ion transfer in electrochemical systems with allowance for migration. Comput. Biol. Chem. 27 (3), 185–196.