Copyright © IFAC Computer Applications in Biotechnology, Osaka, Japan, 1998
MODELLING OF THE IN VIVO KINETICS OF FOLDING AND AGGREGATION OF A RECOMBINANT PROTEIN
Frank HOFFMANN 1), Clemens POSTEN 2), Ursula RINAS 1)
I) GBF - National Research Center/or Biotechnology Mascheroder Weg I , 38124 Braunschweig,
[email protected] 2) University Karlsruhe (TH) Institutfor Mechanical Engineering 76128 Karlsruhe, e-mail.·
[email protected]
ABSTRACT: During the production of basic Fibroblast Growth Factor (bFGF) with recombinant Escherichia coli, the protein partly folds into the active form which is dissolved in the cytoplasma and partly accumulates in inclusion bodies. A simple, three parameters containing model of the kinetic competition between folding and aggregation of nascent protein has been applied to the process. Obtained parameter values vary strongly among different cultivations, but arc located in an ellipsoid region, indicating low parameter estimation accuracy. As shown in detailed analysis by means of the Fisher information matrix, deviations are due to linear dependencies. The influence of sampling pattern on Amill was computed. The information relevant for parameter identification has to be utilized by frequent measurements during the first two hours after induction. Sample intervals of five minutes are recommended. Results of such an improved measurement scheme are shown. Copyright © 1998IFAC Keywords: Parameter estimation, Model reduction, Biotechnology, Sampling frequency, Linear dependence
1. INTRODUCTION
models have been introduced for in vivo aggregation, but for monomeric proteins it is rather impossible to distinguish between intermediates and native protein, so no direct measurements are to be obtained for their kinetics. While in vitro it is possible to follow the course of reaction by measuring the folding intermediates (Utiyama and Baldwin, 1986), only the result of the partitioning into soluble rsp. insoluble fraction can be measured in vivo. Although protein folding is a complex and highly spatially organized process, it is important to fonnulate lumped parameter models for better understanding the dynamics of the folding process and in order to optimize production conditions. Nevertheless, the question remains, whether a model can be complex enough to describe the dominant features of folding and aggregation on the one hand, and to be simple enough to be validated by the data on the other hand. To validate a model goodness-of-fit, parameter estimation accuracy and exclusion of competing models have to be checked. Especially estimation accuracy limits severely the complexity of mathematical models: biological models often contain more parameters than can be identified from available data. Too many
Pharmaceutically important proteins are frequently produced by inserting the related DNA-sequence into the E coli genome and by its expression in a fermentation process. The related induction system can be stimulated, as shown here for the production of basic Fibroblast Growth Factor (bFGF), by a fast increase of temperature. After joining of the amino acids into the polypeptide chain during the translation process, the chain folds to its tertiary structure driven by intramolecular forces. Especially in artificial expression systems the peptide chains partly fail to fold properly and intermolecular interactions occure, leading to the formation of aggregates of several peptide chains and fmally to the the formation of insoluble inclusion bodies. Yield of recombinant protein production is limited by formation of such inclusion bodies. Mechanisms have been proposed for folding and aggregation, based on folding/refolding experiments of purified protein by characterization of kinetic ally trapped intermediates, mutanional analysis and H-exchange-Iabelling. similar
17
parameters allow fitting of faulty models, concealing modeling errors. In case of nearly linearly dependent parameters estimated values may vary considerably on small changes in experimental data due to measurement noise (Posten, 1994). This will disable biological interpretation of the values. It is inIportant to consider parameter estimation accuracy and employ means to improve it, if necessary.
3. A SIMPLE MODEL 3.1 Model Structure Kiefhaber et al (1991) proposed a simple model of the competition between protein folding rsp. aggregation in vitro. They suggested extension to conditions ill vivo by addition of a constant production rate. Experimentally, a steady state level for total protein is achieved after several hours, which camIot be accounted for by simultanous degradation, as inclusion bodies are not successible for proteolysis. These findings advise an exponential decreasing production rate, which is a conmlOn feature of recombinant protein production (Kramer 1996). This leads to the model equations
2. MA TERlAL AND METHODS Escherichia coli TG I (OSM 6056) pAFGFB carrying the gene for basic Fibroblast Growth Factor (bFGF) under the control of the APLP R promoter was used for recombinant protein production. High cell density cultivations (Korz, 1994) were conducted in reactors holding working volumes of 51 (Seeger et aL. 1995) or 501 (KJeppe, unpublished results). After a batch phase on glucose and mineral salts medium at 30°C as specified in Seeger et al. (1995), feeding of a solution containing So = 0.61 g glucose/g feed at a exponentially incresing rate according to eq. I was started to allow population growth with constant specific growth rate of ~l ,et = 0.12 h- 1 On an optical density of 00600 = 100, bFGF production was induced by a temperatur shift to 42°C and the feeding rate was adapted using the values given in Table 1. Maintenance requirements were accounted for by m E= 0.025 g/gh. Index f denotes the values at feeding start of time 1. cell density X and medium volume I . (J.1Jel V ./11. ( t ) = - ,, -+
So
I11 F.
)
J us
(V II ) r·e l',a{l -If) .
'. \'
(2)
dl dt
dN
-
dl
( 1)
'
fl'd h- t
........................ .......................
f33 f37 hdf40 hdf41 hdf42
50 50 5 5 5
Y X/s
gi g
0,08 0,08 0,04 0,08 0,04
0,41 0,41 0,5 0,5 0,5
- JL' Nand
(3)
(4 )
(5)
3.2 Parameter Estimation Identification of the parameter vector P was carried out by minimization ofa cost criterion J, here the sum of squared differences between simulation y and experimental data yM, over N samples at sampling time
X r.md g/l ...........................
.. .....................
= kIN ·/
. /" •
which build up a simple model to describe kinetic competition between folding and aggregation of proteins; it incorporates an exponential decreasing production rate r}" and the material balances for intermediate I. native protein N and aggregate A, with first order folding reaction (rate constant kiN)' nth order aggregation (rate constant kit.), and dilution by cell growth (specific growth rate fl).
after induction (~l"t: preset specific growth rate. Y X/~: yield coefficient, X r.ind : cell dry weight upon induction) Volume I
(k IN + ,lL). / - k lA
dA = k . I" - ,lL ' A dl I1 '
Table 1: Names, volumes, and feeding rate parameters
#
= r" -
40 39 60.2 24 46.6
t,: J
=
±(y(tr' p) - yM (t,)f .
(6)
i ::' 1
The structure of the identification problem allows for a decoupling into two sub-problems. So paranleter estimation was performed ill two steps: ftrstly, the sum of all folding and aggregation species was taken as model output y and used to determine rpo and 1: to identify production kinetics of total protein. Secondly, the folding paranIeters (kiN' kl A and n) are estimated by fitting tile model output y =N+I to tile amount of soluble bFGF. Specific growth rate fl was determined experimentally and was found to be constant during the first hours after induction.
After induction. samples were taken in intervals of appr. one hour. After harvesting of the cells by centrifugation, they were disrupted by sonification. Separation of soluble and aggregated fraction was achieved by centrifugation. Quantitative evaluation was carried out by SOS-PAGE and densiometry. Data were taken from Seeger et al. (1995) and KJeppe (unpublished results) for modeling use.
18
100
5~k:: ··" : '" o
5
10
amount of soluble bFGF.estimated parameter values for five cultivations are listed table 2. Special attention ought to be paid to the order n of the aggregation reaction. It ranges from 0 to 10, what is difficult to interpret physiologically. When it is fixed to n = 2, what is meaningful for an intermolecular reaction as aggregation is, obtained rate constants scatter over four 15 orders of magnitude.
tind [h] -Total bFGF
Production rate rp
• measurements
Table 2: Estimated parameter values for the competition between folding and aggregation; the order n of the aggregation reaction is either estimated or preset to n =2 . Fig. 1: Total protein production rate r" and total protein content from cultivation f 33; symbols are measurements, lines represent simulations. Apart from noise, the model represents the measureme~lts fairly good. Fig. I shows that the heuristic approach for protein production, which in fact is an input for the folding model. holds.
foldin free order n g
#
kIN h· 1
klA
hi
n
folding
fixed order n
kIN h· 1
klA
(gCDW/mg)
(gCDW/mg)
n·1 ............................ .....................
f33 f37 hdf40 hdf 41 hdf42
go .
"' . 70
tobl bFGF
Fig. 2: Comparison of simulation and measurement values of the specific concentrations of total and soluble bFGF after induction by temperature shift to 42 D C, cultivation f33 . Fig. 2 shows results for the folding model. Total bFGF rises sharply immediately after induction and reaches a steady state after four hours. After more than ten hours, experimental values y(tj, P + 3P)
~ y(tj' P) + ay(tj, P)I 3P
= y(tj,p)+ Y p (t j , P). 3P
12.6 3.1 6.5 7.9 3.9
...........................
0.3 0.83 3.9 18.36 0.2 0.031 9.2 13.74 0.0006 9.69
0.1 35.84 0.007 11.3 2.64
To detennine whether this strongly variing parameter values reflect differences between respective cultivations or whether this is due to sensitivity of estimated parameters to measurement errors, parameter estimation accuracy has been calculated.
" soluble bFGF
8.2 7.9 5.6 6.9 3.7
hi
J.J Parameter Estimation Accuracy Calculation of parameter estimation accuracy can be perfonned by evaluation of the Fisher infonnation matrix (Goodwin, 1988). The results are then applied to optimization of sampling as presented by Munack (1991 ). Assuming a given model represents a process adequately, there shall be "true" parameter vector P, which is unknown. Estimates P are obtained by minimization of a cost criterion J, cf. eq.4. Because of the measurement noise, the obtained values even for an identically repeated experiment will differ. It is import to gain a quantitative statement on the variance of the estimation. To this aim. the influence of each parameter PJ on the model outputs y; is calculated and collected into the matrix of output sensitivity functions Yp , the elements of which are Yp(i,j) = ay/ apt. In this work, they were calculated numerically by increasing rsp. decreasing each parameter by 5% and calculating the central differences normalized by the nominal parameters. Now the influence of small parameter changes on the model output y at sampling times t; can approximately be expressed as a linarization.
· 3P p
(7)
decrease, probably due to proteolytic activity. Model simulation ascent more slowly and reaches a maximum at t = 5h, when production rate r p is less than 5% of its initial value. Afterwards model simulation decline as dilution through cell growth is faster than production. Soluble bFGF shows a pattern similar to total bFGF: it soon reaches its highest value, and starts to decline after seven hours of induction. Soluble bFGF is represented quite well by the simulation. Only at the onset of proteolysis, when carbondioxid evolution by the stressed cells stops (data not shown) and measured bFGF values decrease remarkably. model predicts too high amounts. In the first ten hours after induction, however, this simple model is useful to describe the
19
along the course of the valley. So it is not an intrinsic difference
The expectation of the cost criterion J(P+oP) is now given as
E(J(P + op») = opT. F(P)· oP + N . m
(8)
with m being the number of measurement variables. The Fisher information matrix is now calculated as N
F= LYJ(I;, P).C-' .Yp(I;, p)
-z I!. c
0
., .,
(9)
-- J J
= 1500 (m glgCDW f =4000 (m glgCDW)' parameter values
o
;=1
In (k ,,)
Fig. 3: Contour lines of the identification functional J
where Cl IS the inverse of the measurement error covariance matrix. For one measurement item it is C = 2 110 Here 0 = 0.1 is assumed . F is the inverse of the parameter estimation covariance matrix. A derivation is given by Munack and Posten (1989 ).
for tile cultivation f37 as a function of tile logarithms of the two parameters ki N and klA, and tile values obtained by parameter estimation for five cultivations. The reaction order n was fixed n = 2, and y = I + N was set as model output for soluble bFGF.
The cost criterion can have a minimum at the true P only if F(P) is positive defmite, i.e., no Eigenvalue is zero. The closer an Eigenvalue is to zero , the larger is the possible error in parameter estimation. For the linearization the contour lines of the cost criterion are ellipoids, where the length of a half axis is the inverse of the square root of the corresponding Eigenvalue. The direction of the axis gives the corresponding Eigenvector.
in between tile solid contour lines can be interpreted as representations of tile same physical system. Only measurement noise deternlines the exact position of estimated kl N,k lAin the valley. The identified parameters, however, are biologically meaningless. Nevertheless, most of the obtained paranleter pairs for tile different cultivations (squares in Fig. 5) are found along the course of the valley. So it is not an intrinsic difference of protein folding, which leads to the various parameter values, but tile sensitivity to small deviations and measurement noise, resulting in low estimation accuracy . To consider improvements, a quantitative measure of estimation accuracy is needed. This is provided by the Eigcnvalues of tile Fisher information matrix . The smallest one is indicative for the largest possible error. Consequentely, tile smallest Eigenvalue of F was close to zero .
Further infomlatioll is given by evaluation of these Eigenvectors ofF. Each corresponds to an Eigenvalue and indicates for which parameter combination it holds. If the Eigenvector corresponding to Amm is pointing to one single parameter, it cannot be identified properly, possibly because it refers to a very fast or very slow reaction (steady or frozen state, respectively). It may help to eliminate it from the model. I f the Eigenvector corresponding to Amm is pointing to a parameter combination, the parameters probably are linearly dependent. An analytical solution could give a hint how to combine the parameters to only one, which could be estimated better.
4. IMPROVEMENT OF SAMPLING 4.1 Increased Sampling Rate The largest possible error can be minimized by increasing the smallest Eigenvalue Amm (Munack, 1991) by improVed experiment design . Either input trajectories or sampling pattern can be optimized for better paranleter estimation accuracy. The latter case is shown in the following section. In case of persisting low estimation accuracy, parameters, as indicated by the Eigenvector of the smallest Eigenvalue, have to be eliminated from tile model or linearly dependent parameters have to be combined. The anlOunt of total bFGF is used to identify the production parameter, as is the soluble fraction to estimate ki N' n, and k lA . The accuracy of the latter estimate is lower and hence considered here. Since we employed only one measurement item and three parameters, the output sensitivity V r is a vector with three elements. The time courses of VI' for a representative cultivation are shown in Fig. 4A . After four hours they reach a steady state, as does bFGF, for
The cost criterion J for cultivation f 37 is plotted as a function of the logarithms of kIN and kl A and the contour lines for J = 1500 and 4000 (mg/gCDW)2 are shown in Fig. 3. The minimum is J = 630 (mg/gCDWf The contour lines form long stretched valleys, indicating a low estimation accuracy and linear dependence of the rate constants. All estimates klN,k lA in between the solid contour lines can be interpreted as representations of the same physical system. Only measurement noise determines the exact position of estimated k' N,k IA in the valley. The identified parameters, however, are biologically meaningless. Nevertheless, most of the obtained parameter pairs for the different cultivations (squares in Fig. 5) are found
20
the cease of recombinant protein production. While 8ylOk' N and Dy/uk'A alter monotonically, ay/on passes through an extremum at t", I h. The Fisher information matrix is obtained by summation at the sanlple times (each hour in 0 - 12h) according to eq.9. The components of the Eigenvectors arc shown in Fig. 4B. The components oftlle smallest Eigenvalue A = 0.0037 indicate a linear dependency of the rate constants; tlley exhibit components to direction both of k' A and kiN , while the estimation of n is hardly effected by estimation errors of the rate constants.
A)
>-
~2
tirre/h 0.0
C
2
4
6
8
10
12
·5
c
-0.2
3!
:;
%
-OA
0
o' 0
o.
10
02 00 02
+>
·Vi
B)0rt0tr=02 00
0.2
a.
hours does not contribute to estimation accuracy, so consideration was restricted to 12 h (Il). As the output sensitivity functions indicates, frequent sampling in the beginning increases A mill considerably (lIl -VI). The best effect is achieved directly after induction (Ill), when sample intervals of five minutes effect a 25 to 200 fold increase of A mU! compared to sampling pattem 1. After several hours, the improvement is only marginal; even 36 samples taken in the fifth to seventh hour after induction did hardly increase A min (VI). The benefit of expanding frequent measurements to two hours direct from the start is obvious (VII-IX), but double effort ceases to double effect when sampling intervals of five minutes are reached . Since by then A min has increased by onc rsp . two orders of magnitude, we suppose that it is sufficient to sample in intervals of five minutes during the first two hours after induction to enable accurate parameter estimation in models of the kinetic competition between folding and aggregation.
GY/Ck,N GY/m . GY/Ck,A
o. 0
o. 1
4.2 Experimental Results According to the parameter estimation accuracy analysis, the production of bFGF and its partitioning into soluble and insoluble fraction is followed by sampling intervals of five minutes during the fust two hours after induction. Fig. 5 shows the results. Biomass continues to increase after induction for approximately five hours with a specific growth rate of ~l = 0.05h-'. This is lower than the set rate of ~lset = 0.08 h-' , as building blocs and energy is directed to foreign protein synthesis. Afterwards growth stops due to stress caused by enlarged protein synthesis and elevated temperature and cells are diluted by the feeding_ Also bFGF production decreases after five hours and reached a steady state of 60 mg/gCDW after eight hours. Whereas bFGF is detectable in the total protein from the very beginning, it appears in the soluble fraction only after 20 min . Later it rises more moderately than the total bFGF and soon reaches its final value of 15 mg/gCDW .
-0.6
-0.8
Fig. 4: Time course of the output sensitivity functions V" (A); Eigenvalues A and components of the corresponding Eigenvectors of the Fisher information matrix (B); data of cultivation f 33 ). To increase estimation accuracy, the influence of sampling pattem on the smallest Eigenvalue was investigated. As 8y/un has a maximum directly after induction, frequent measurements here may have a favorable effect on parameter estimation accuracy. The results are compiled in Table 3.
Table 3 : Effect ofsampIing mode on the smallest Eigenvalue A nun of the Fisher information matrix, calculated for parameter sets of two example cultivations
70
f33 I
experimental sample times
0.004
hdf 41 0.001
0.004 0.646 0.009 0.007 0.005 0.152 0.495 0.969
0.003 0.116 0.006 0.003 0.004 0.049 0.092 0. 123
..
.
' 0
gDW
IJ
0-12 III 0- 12 IV 0-12 V 0-12 VI 0-12 VII 0-12 VIII 0-12 IX 0-12
h : ev. h. h: ev. h., 0-1 h: every 5 min h: ev. h., 1-2 h: every 5 min h : ev. h., 2-3 h: every 5 min h: ev. h., 4-6 h: every 5 min h: ev. h., 0-2 h: ev. 20 min h: ev. h., 0-2 h: ev. 10 min h: ev. h., 0-2 h: every 5 min
":'
......
oFGF 1nl. !;le,"" •
".
10
"
10 .- ~••• ",. .'
" Fig. 5 Cell dry weight. soluble and total bFGF after induction The simple model eq . 2-5 have been applied to this cultivation, both with the order of aggregation n fixed to n = 2 and as an estimable parameter. The best fit simulation curves are displayed in Fig. 6. The obtained parameter values and the sum of squared modelling errors J appear in Table 4. Both modeling approaches fit
Among the cultivations, actual sampling pattem (I ) differed as listed in table 3, so as a standard, hourly sampling was chosen. Prolonged sampling up to 18
21
5. CONCLUSIONS
the experimental data . With prefixed order n, the slope is steeper in the beginning and more moderate after several minutes, whereas the free order line starts more conservative but increases more stable. Inspection suggest a better fit of the second mode, as is also indicated by the lower J. -
In a simple model of protein folding, parameter estimation failed to gain accurate values, which would be necessary for meaningful physiological interpretation. Analysis can be carried out by means of the Fisher information matrix, which is revealed to be an appropriate tool even in the highly nonlinear system brought up here. Applied to parameter sets obtained for several cultivations. it has been shown that frequent measurements in the very beginning of the induction phase arc appropriate to improve estimation accuracy notably . Sanlpling in intervals of five minutes is recommended to make parameter identification for in vivo folding models possible.
fnen
--·· - - fixedn=2 • m •• sur. m e nl.
.
;f.-'" .
.,.y
,.' .
Fig. 6: The amount of soluble bFGF simulated according to a kinetic competition of first order folding and nUl order aggregation, where n is an estimable paranleter or is fixed to n = 2, is compared to the experimental data.
6. REFERENCES Goodwin , G.c. ( 1988 )ldentification Experiment Design, in: ,))'slems and Co/ltrol Encyclopedia , Vo\. 4.225 7-2264. PerganlOn, Oxford.
Table 4: Estimated parameter values and resulting sum of square deviation l a not estimated, but fixed in advance kI N 1.41 145 .8
kI A 1.67 304
n 2a 0.95
Kiefhaber,T. , R Rudolph, H·H. Kohler. l Buchner.( 1991 ). Protein Aggregation ill vitro and ill vivo : a Quantitative Model of the Kinetic Competition Between Folding and Aggregation, lJio/{ec/lI1ulogy 9, 825-829.
] 87.3 85 .8
Korz, D. l , U. Rinas. K. HcIlmuth, E.A. Sanders, W.D. Deckwer( 1995 ). Simple fed-batch techique for high-cell density cultivation of Escherichia coli. J Riolechnol., 39, 59-65.
4.3 Model Reduction In case of an estimated aggregation order of n~ I it is possible to achieve an analytical solution of the govering differential equations, showing the linear dependency ofthe two folding kinetic parameters. Let kl = kI N+ kI A+ ~l , eq. (3) and (4) can explicitcIy be solved leading for the intermediate to
rpo I . (e - /~ _ e - kl' )
J(I)=
Kramer. W, G . Elmecker, R Weik, K. Bayer (1996 )Anll N Y A cad ;';d 782, 323-333 Munack,A (1991) Optimization of Sampling, in: K. Schiigerl (cd), Measuring Modelling and Control, Hiot echnolofi)l Vo!. 4, 251-267. VCH, Weinheim , .
( 10)
le - I1
I T
I
and to total bFGF
G( () = ~ .
I' - ;Ir
(e-';; - e 1'')
(I 1)
Posten. C. ( 1993) "Basic Omcepts of Computer Modeling and Optinlization in Bioprocess application". in : T.K. Ghose (cd.) , Process Compulations in Biotechnology, Tata Mc Graw-Hill. New Delhi,.
respectively. For the soluble bFGF follows
eO"~
e-;;;
N(t) =I",k
N
·(
e-I'
(1'- ,~XIs -X) + (J1-I~XX -Is) + (I<, - ,uXX -,u)
J( 12)
Seeger. A. B. Schneppe, lE.G. McCarthy, W.-D. Deckwer, U. Rinas, (1995) Comparison of temperature- and isopropyl-[-I-D-thiogalactopyranosid-induced synthesis of basic fibroblast growth factor in high-cell density cultures of recombinant Escherichia m/i" , Enzym e Microh. Techno/. 17. ,947-953 .
If reaction of the intermediate takes place much faster than dilution by growth or the change of production rate, i.e. kl » ~l und k l » I h , we obtain by simplification: N (I) =
'1'0 ' k N kl .
.
(I' - ;/~)
.
(e-.r;; _ e-1" )
Hence the fraction of soluble bFGF is given by N(t) = k N _ kN I
G(l)
kl
k N +kA
(13 )
H. Utiyama,( 1986 ) RL. BaldwiJl. Kinetic Mechanisms of Protein Folding, Methods EllzYl1lol .. 131,51-70.
( 14)
l+k;{N
as a function of only one parameter kl i\/klN and found to be constant trough out the cultivation.
22