68
NUMERICAL COMPUTER METHODS
[4]
[4] U s e o f W e i g h t i n g F u n c t i o n s in D a t a F i t t i n g
By ENRICO DI CERA Introduction Data analysis plays a central role in the description of a number of physical, chemical, and biochemical phenomena. Our interpretation of the experimental facts, phenomenological or mechanistic as it may be, is very often based on a set of physicochemical parameters involved in the particular model being proposed or tested. Best-fit parameter estimates are derived from minimization of a suitable functional, or sum of squares, that can be written as I
¢P = ~ [yj- F(xj,{$})]2
(1)
j=l
Here n is the number of experimental points, yj is the jth experimental observation, while F is the deterministic model, or fitting function, used to interpolate the experimental measurements through the independent variable x and a set of s parameters {$} = ¢1, $2 . . . . . $,. When dealing with repeated experimental observations Eq. (1) is extended to the number of different experiments i = 1, 2 . . . . . m as follows:
cD = ~ ~' [yq- F(xij,{qj})]2
(2)
i=l j=l
where yq is thejth experimental observation in the ith experiment, while xij is the value of the independent variable x coupled to Y0" The significance of Eq. (2) and its simplest form given by Eq. (1) is independent of the particular form of the fitting function F and hinges solely on its computational aspects related to minimization of the difference between experimental data and theoretical predictions. In so doing we make no distinction among different experimental points and implicitly assume that they are perfectly equivalent in the minimization procedure and contribute equally to parameter estimation. This assumption is certainly appealing when considering the computational aspects of least-squares minimization. In many cases of practical interest experimental data points cannot be treated as equivalent. In some cases experimental points carry different t y. Bard, "Nonlinear Parameter Estimation." Academic Press, New York, 1974. METHODS IN ENZYMOLOGY, VOL. 210
Copyright © 1992 by Academic Press, Inc. All rights of reproduction in any form reserved.
[4]
WEIGHTING FUNCTIONS IN DATA FITTING
69
experimental error. In other cases minimization of Eq. (2) involves experimental determinations of different scales. Application of Eqs. (1) and (2) in these cases may lead to unsatisfactory estimates of the parameters. The solution of the problem is one and the same for all cases and hinges on the definition of a weighting factor, w, for each experimental data point. In so doing one takes explicitly into account the exact probability distribution of experimental measurements, and the relevant functionals to be minimized become
= ~ wj[y~- F(xj,{q,})] 2
(3)
j=l
in the case of single determinations, and
dp = ~ ~ wij[Yi,i -F(xij,{~b})] 2
(4)
i=1 j = l
for multiple experimental determinations. The weight wj of pointj in Eq. (3) expresses the relative contribution of that point to the sum of squares. When all weights w are equal in Eq. (3), then the weighting factor becomes a scaling factor that has no effect on the minimization procedure. In this case one deals with a functional such as Eq. (1) that implies uniform weighting of all experimental data points. A differential weighting is introduced as soon as the experimental data points have different weights. In the case of multiple determinations each data point is weighted according to its contribution within the data set to which it belongs and the relative contribution of that data set to the total sum of squares. It is clear from Eqs. (3) and (4) that the definition of weighting factors for each experimental data point is critical for correct data analysis. However, very little attention is often paid to aspects of data analysis that deal with the formulation of a correct weighting scheme for the data set under consideration. The issue of correct data weighting is commonly dismissed with generic assumptions on the distribution of experimental errors, and data analysis is usually concentrated on parameter resolution and tests of different models, interested as we are in the interpretation of the experimental facts. The definition of weighting factors in the functionals given in Eqs. (3) and (4) is conditio sine qua non for correct data analysis and parameter resolution in practical applications. In this chapter we shall discuss the importance of properly casting the fitting problem in terms of weighting functions. We shall also discuss the consequences of improper or incorrect weighting schemes in parameter resolution and show a specific application to the analysis of steady-state enzyme kinetics, which provides an example of general interest.
70
NUMERICAL COMPUTER METHODS
[4]
Weighting Factors and Difference between Errors and Residuals Under a set of assumptions the weighting factor of pointj can uniquely be defined in terms of its actual experimental error. In principle, if we knew the exact model to be used as fitting equation and the true values {to*} of the parameter set {to}, then the difference ej = yj - F(xj,{to*})
(5)
would actually give us the experimental error of pointj. In practice, even if we knew the exact fitting function we could only access the best-fit estimates of {tO}and not the true values {to*}. The difference pj = yj - F(xj,{to})
(6)
is equal to the residual of point j. The fundamental difference between errors and residuals is that the standard deviation of ej is independent of F, while that of pj is not. The only way we can practically obtain a value for the experimental error of point j is by carrying out repeated measurements Ylj, Yzj . . . . . Y,,i of yj at xj. Under the assumptions that m is large and the yj values are randomly scattered around the mean (yj) (Ylj + Yzj + • • • + Ym)/m, the standard deviation =
°'J= ( ~
[ Y 6 - (YJ)]z/m)
(7)
can be taken as the experimental error of pointj. The important definition of weighting factor as w~ = o ' ; 2
(8)
leads to least-variance estimates of the model parameters {to} in Eq. (3). The reader should be able to demonstrate that the same expression in Eq. (7) can be arrived at by computing the standard deviation of 8j according to Eq. (5). In fact, the value of F(xj,{to*}) cannot change in different determinations of yj, since {to*} represents the true parameter values. On the other hand, the best-fit value of F(xj,{to}) is a function of Yu, and therefore the standard deviation of Or is a function of the particular model we have chosen to fit the data. For this reason residuals are not experimental errors and must never be used to assess the distribution of weighting factors for uniform or differential weighting purposes. Experimental errors reflect the property of a distribution of experimental measurements and have nothing to do with the fitting function that is going to be used to interpolate the data. Residuals reflect the difference between experimental determinations and theoretical predictions of a par-
[4]
WEIGHTING FUNCTIONS IN DATA FITTING
71
ticular model that we have selected for our analyses. The weighting factor is defined as the inverse of the experimental error squared and likewise reflects a property of the experimental data alone. It can be proved mathematically that such a definition of weighting factors guarantees least-variance parameter estimates when F is a linear function of {0}. Although in the general case of nonlinear least-squares, where F is a nonlinear function of{0}, a similar proof does not exist, nevertheless there is reason to believe that Eq. (8) still provides an optimal weighting scheme. 1,2 In computing the standard deviation of the distribution of experimental measurements we have implicitly assumed that each experimental point is independent of any other. In general, such a restriction can be dropped and the weighting factor can be calculated as the element of the covariance matrix associated with the measurements. 1-3 Weighting Factors and Data Transformations Once experimental data have been collected and information on the distribution of experimental errors is available, the minimization procedure can be started using the functional
dO= ~ wj[yj- F(xj,{0})] 2 = ~ o-j-Z[yj- F(xj,{0})] 2 j=l
(9)
j=l
The nature of the fitting function F is critical for correct parameter estimation. It is often found more convenient to transform the stretch of original data points y into a new set of values y' to simplify the form of the fitting function. In many cases such a transformation is aimed at linearizing the fitting problem, that is, to yield a new F(y',{0'}) that is linear in the parameters {0'}. If the experimental error of Ys is %, what is the value of o-j' for yj ? It is clear that the distribution of the errors in y' values is not the same as that of the errors in data points y, and therefore if a given weighting scheme is applied in the case of Eq. (9), a different one should be used when minimizing:
= ~ wj[y:- F(xj,{0'})] ~= ~ (rs-Z[yj- F(x:,{0'})] 2 j=l
j=l
(10)
The calculation of wj from wi on the transformation yj --->yj' hinges on the derivation of the error o-j from the error o-i. It is important to stress that such a problem has not a general and exact solution and can only be 2 D. A. Ratkowski, "Nonlinear Regression Modeling." Dekker, New York, 1983. 3 M. E. Magar, "Data Analysis in Biochemistry and Biophysics." Academic Press, New York, 1972.
72
NUMERICAL COMPUTER METHODS
[4]
solved to a first approximation. ~If the errors are small and the transformation yj --~ yj is continuous and bounded and so are all its derivatives, then to a first approximation, the Jacobian transformation o-j 2 = (Oyj/Oyj)Ztr 2
(11)
along with the definition of weighting factor given in Eq. (8) allow calculation of the new set of weights. Equation (11) is the familiar formula for error propagation. 3 Data transformations are widely used in the analysis of many systems of biochemical interest. Typical examples are given in the following sections. Relaxation Kinetics
In the process A ~ B, the concentration of A, c, at time t is given by c = Co e x p ( - t/r)
(12)
where Co is the initial concentration of A and ¢ is the relaxation time for the irreversible conversion of A to B. If data are collected as c values as a function of time using spectroscopic methods, as in the case of stoppedflow kinetics, the exact form to be minimized is = ~
w j [ c j - co e x p ( - t j h ' ) ] 2
(13)
j=l
and the two parameters co and z need to be resolved by nonlinear leastsquares. If the independent variable t is assumed to be errorless and the accuracy of the spectrophometer is the same over the concentration range 0-c0 of A, then one can reasonably conclude that all w values are equal in Eq. (13) and therefore apply uniform weighting in the minimization procedure. Taking the logarithm of both sides of Eq. (12) yields In c = In c o - t/r
(14)
so that = ~ wj[ln cj - In Co + tj/'l'] 2
(15)
j=l
which is a linear function in the parameters In Co and 1/~-. In view of Eq. (11) it is clear that the distribution of the errors in In c is not the same as that of the errors in c, and therefore if uniform weighting is applied in the case of Eq. (13), differential weighting should be used in Eq. (15). Hence, in the transformation c ~ In c we should minimize
[4]
WEIGHTING FUNCTIONS IN DATA FITTING
qb = ~ wj.c][ln cj - In Co + tj/~']2
73 (16)
j=l
since wj = 1/trj 2 = 1/[(Oyj/Oyj)2tr 2] = wi/[(O In cJOcj) 2] = wjc 2. Whatever the value of wi, the value of wj will depend explicitly on cj. In the case of uniform weighting of the c measurements (wg = 1 for all j), the In c values should be weighted according to c 2. Therefore, cl = 10 and cn = 1 receive the same weight in Eq. (13), but In Cl has 100 times more weight than In cn in Eq. (16).
Steady-State Kinetics In the case of steady-state kinetics the velocity, v, of product formation is measured as a function of the substrate concentration, x. For a system obeying a simple Michaelis-Menten equation
v = erk¢atx/(Kn~ + x)
(17)
where eT is the total enzyme concentration, and the values of the catalytic constant kc~t and Michaelis-Menten constant Km are to be resolved from analysis of the experimental data. If the values o f x are again assumed to be errorless and the distribution of the errors on the v values is uniform, then
dp = ~
[vj - erkcatXj/(K m + xj)] 2
(18)
j=l
is the functional to be minimized. There are several possible linearizations of the Michaelis-Menten equation. The familiar double-reciprocal transformation of Lineweaver-Burk, namely,
1/v = 1/(eTkcat) + [Km/(eTkcat)](1/x)
(19)
is the most widely used procedure to analyze steady-state kinetics. It is important to recognize that this transformation is no longer accompanied by a uniform distribution of errors. The correct weighting factor of 1/vi is obtained from Eq. (1 l) and is equal to v] . The correct form to be minimized is therefore
dp = ~
j=l
v 4 { 1 / v j - I/(eTkeat) -- [Km/(eTkcat)](1/xj)} 2
(20)
Interestingly, the weighting factor for the v ---> 1/v transformation is similar to that for the c ~ In c transformation discussed in the case of relaxation kinetics. Consequently, a velocity measurement around Km that yields a value of v = eTkcat/2 is equivalent to a measurement taken under
74
NUMERICAL COMPUTER METHODS
[4]
saturating conditions, that is, v ~ e x k c a t when using the Michaelis-Menten equation, as opposed to the Lineweaver-Burk plot where the measurement around Km should be weighted 16 times less.
Ligand Binding Equilibria In the case of ligand binding studies the fractional saturation, Y, of the macromolecule is measured as a function of the ligand activity, x. Experimental data are fitted to the equation t
I
Y = ~ jAjxJ/t ~ Ajx j = F(x,{A}) j=0
(21)
j=0
where t is the number of sites, and the A's are the overall association constants to be resolved by nonlinear least-squares. If the distribution of experimental errors on Y is uniform, what is the correct weighting scheme for the transformation Y ~ In[ Y/(1 - Y)] that leads to the familiar Hill plot? Application of Eq. (11) yields w' = Y2(1 - y)2. Therefore, when fitting data in the form of the Hill plot, a point collected at half-saturation should be weighted about 8 times more than a point at 90% saturation, and 638 times more than a point at 99% saturation.
Global Data Fitting In a number of cases it is necessary to interpolate globally observations of different scale. For example, such a situation arises when kcat and K m values for an enzyme are collected over a wide pH range and need to be globally interpolated with a thermodynamic scheme to pull out the pK values of the ionizable groups involved in the control of binding and catalytic events, as recently shown in the case of thrombin. 4 In this case F(x,{to}) is the fitting function for kcat, G(x,{to}) is the fitting function for Kin, and the set of parameters {tO}shared in both functions is resolved by global data fitting. The form to be minimized is
a~ = ~ w,j[y U j=l
-
F(x~,{to})l 2 + ~ wzj[y2j - G(xj,{to})l 2
(22)
j=l
where the yl's and yz's refer to kcat and K m values, respectively. Typically, values are 5-8 orders of magnitude bigger than K m values, and so are their respective errors. In this case it would make no sense to minimize a functional such as Eq. (22) using the same weight for both kcat and K m values, because the information relative to the kcat values would overkca t
4 R. De Cristofaro and E. Di Cera, J. Mol. Biol. 216, 1077 (1990).
[4]
WEIGHTING FUNCTIONS IN DATA FITTING
75
whelm that relative to the K m values in the sum of squares. Use of correct weighting factors according to the actual experimental errors carried by the two sets of observations guarantees meaningful parameter estimation in the global data fitting. Consequences of Incorrect Data Weighting In science asking the right question is often more important than giving the right answer. Indeed, a key question arises quite naturally at this point as to the consequences of incorrect data weighting. Why should one worry about computing a correct weighting scheme for each experimental data set and any tranformations thereof? Why should the budding enzymologist be concerned with a correct weighting scheme for his/her beloved doublereciprocal transformation? With perfect data and the correct fitting function the difference yj - F ( x j , { ~ } ) in Eq. (3) is identically zero for all points, and the true values of the parameters are recovered in the minimization procedure independently of any weighting scheme. If the fitting function is incorrect there is very little need to worry about the quality of data or correct weighting schemes, since the best-fit parameter estimates are meaningless anyway. Therefore, the situation where one should worry about correct weighting arises when the fitting function is correct and the data carry a finite experimental error, which is exactly the most common situation encountered in practice. It is clear that there is no magic formula to answer the question of the consequences of incorrect weighting in the general case. As already pointed out, we do not even have a mathematical proof that an optimal weighting scheme should exist when dealing with nonlinear least-squares. Therefore, we shall address the problem by means of Monte Carlo simulations of an a d h o c chosen example of practical interest, namely, the analysis of steady-state kinetics. The case of ligand binding equilibria has been dealt with elsewhere. 5 Consider the set of steady-state measurements reported in Fig. I, and assume that nothing is known about the distribution of experimental errors. A list of the experimental data is given in Table I to allow the interested reader to reproduce the results. How should we proceed to analyze this data set? The simplest choice would be to assume that the substrate concentration is errorless and the experimental error is uniformly distributed among all velocity measurements. This implies that the exact form to be minimized needs to be cast in terms of the Michaelis-Menten (MM) equation [Eq. (17)] with uniform weighting. The best-fit values of the parameters are in this case kcat = 104.7 - 0.6 sec -1 and K m = 2.89 - 0.07 5 E. Di Cera, F. Andreasi Bassi, and S. J. Gill, Biophys. Chem. 34, 19 (1989).
76
NUMERICAL COMPUTER METHODS
[4]
100 80
>
60 40 20 0
I
I
I
I
-6
log
I
-4
-5
[S]
FIG. 1. Steady-state velocity measurements of human t~-thrombin amidase activity as a function of the logarithm of substrate concentration (data obtained by Dr. Raimondo De Cristofaro). Velocity values are expressed per unit thrombin concentration. Experimental conditions are as follows: 1.2 nM thrombin, 50 mM Tris-HCl, 0. l M NaCl, 0.1% polyethylene glycol (PEG) 6000, pH 8.0, 25°C. The substrate used for these determinations is the synthetic tripeptide S-2238.
/zM. The standard error of the fit, or, is given by [d~/(n s)] 1/2, where is given in Eq. (3), n is the number of data points, and s is the number of independent parameters. The value of or is 0.9 sec-1 for the data reported in Fig. I, which shows that the data are very accurate and have less than 1% error. To assess the consequences of incorrect weighting the same data can then be converted to a double-reciprocal plot according to the Lineweaver-Burk (LB) transformation [Eq. (19)] and fitted using uniform weighting. The best-fit values of the parameters are in this case kcat = 103.0 - 2.7 sec -1 and K m -- 2.76 - 0.11/~M. Under the assumption that the error is uniformly distributed among the velocity values there is a definite, although not significant, difference between the values of Km and kcatobtained with the MM and LB equations. When the data are correctly weighted in the LB transformation according to Eq. (20) we obtain kcat = 104.7 - 0.6 sec -1 and Km = 2.88 - 0.07/zM. These values are practically identical to those obtained using the MM equation with uniform weighting. Another linearization of Eq. (18) is given by the Eadie-Hofstee (EH) transformation -
O =
e T k c a t -- K m o / X
(23)
A plot of u versus o/x yields a straight line with slope Km and intercept
[4]
WEIGHTING FUNCTIONS IN DATA FITTING
77
TABLE I EXPERIMENTAL VALUES OF STEADY-STATE VELOCITY MEASUREMENTS OF HUMAN ot-THROMBIN AMIDASE ACTIVITYa x (t~M)
v (sec -I)
98.000 49.000 24.500 12.250 6.125 3.062 1.531 0.766
101.67 98.90 93.45 85.40 70.06 54.90 35.16 22.62
Under the experimental conditions given in the legend to Fig. 1. Data are averages of duplicate determinations. Velocity values have been normalized for the enzyme concentration.
kcateT. This transformation is the kinetic parallel of the Scatchard plot in ligand binding equilibria and introduces an error in the independent variable. The EH transformation is extremely interesting for the following reason. If we assume that the error in the EH plot is uniformly distributed, then the functional ~P = ~ [Vj -- (eTkca t -- Kmoj/xj)] 2 j=l
(24)
is the correct form to resolve the independent parameters. This form can be rearranged to yield
= ~ [(Km + xj)/xj]Z[vj - exkeatXj/(Km + xi)] 2 j=l
(25)
and is therefore identical to fitting data with an error proportional to the velocity in the MM plot. Therefore, if the error is not uniform and is proportional to v, then one should either use differential weighting in the MM plot or uniform weighting in the EH Plot. Transformation of the data in the EH plot has in this case a variance-equalizing effect and corrects for the nonuniform error distribution: When the data in Fig. 1 are analyzed according to Eq. (23) with uniform weighting the best-fit values of the parameters are kcat = 104.3 -+ 1.2 sec-l 6 j. A. Zivin and D. R. Waud, Life Sei. 30, 1407 (1982).
78
NUMERICAL COMPUTER METHODS
[4]
and K m = 2.85 -+ 0.08/zM. Again, very little difference is observed with respect to the MM equation. Therefore, experimental data taken in the substrate concentration range from Kin~4 to 32Kin and with 1% error on the velocity values show no significant difference in the best-fit values of kca t and Km obtained from the MM, LB, and EH plots. Because with perfect data we would see no difference at all in the parameter values obtained in the three plots, one may naturally wonder what experimental error or substrate concentration range would reveal a significant difference with incorrect weighting schemes. A Monte Carlo study has been conducted as follows. Steady-state velocity measurements, normalized by the enzyme concentration eT, were simulated according to the equation O = kcatx/(Km + x) +
e [ - 2 In(RND0] j/2 cos(2¢rRND2)
(26)
where RND~ and RND2 are two random numbers in the range 0-1. A pseudorandom error distribution7N(O,e) with 0 mean and a standard deviation equal to e was assumed for all velocity measurements, independent of x, to simulate a uniform experimental error. For each data set eight points were generated in the substrate concentration range Kin~4 to 32Km using a value of xj as follows
xj = 32Km/2j-~
(j = 1, 2 . . . . .
8)
(27)
to yield points equally spaced in the logarithmic scale, as in the case of the data shown in Fig. 1. The simulated values of kcat and K m w e r e both equal to 1. This choice was made solely for the sake of simplicity and without loss of generality. Ten values of e were used in the simulations, from 0.01 to 0.1, to yield an average standard error of the fit from 1% to 10%. For each value of e, 10,000 data sets were generated according to Eq. (26). Each set was analyzed according to the MM, EH, and LB equations using uniform weighting in all cases. Simulation and fitting of 10,000 data sets took on the average less than 70 sec on our Compaq 486/ 320 computer. Owing to the (known) error distribution, this weighting scheme is correct for the MM transformation and incorrect for the EH and LB equations. The mean, m, and standard deviation, o-, of the distribution of kca t and Km values were computed in the three cases, and the results of the simulation study are shown in Fig. 2. The actual values of the parameter bias and percent error are listed in Table II. The simulation study confirms the result observed in the analysis of real experimental data with 1% error (0.01 in Fig. 2 and Table II), namely, that very little difference exists 7 G. E. P. Box and M. E. Muller, Ann. Math. Stat. 29, 610 (1958).
[4]
WEIGHTING FUNCTIONS IN DATA FITTING 1.5
79 A
/x 1.3 ~J rO LJ
Z~
........-..---.---.-'-.-'-'.',,'. .---.-----.,......._.. ~,¢ ~
1.0
v
v
O.B
0.5
' .00
J
.02
i
•
•
.04
•
i
.06
.OB
.10
erroP 2.5
A
2.0
A E X/
1.5
A A
A .....................
~1=:1" lrl ~ ~~.---~t"-'~g"'.-. w,-..s,~...~ ~ ............. w
1.0
0.5' • O0
~~
tu
O
rn
~v
13
-
. . . . . . . . . . . . . . . • 02
. 04
. 06
. OB
. 10
error
FIG. 2. Results of the Monte Carlo study discussed in the text for 1O,000 simulated data sets of 8 points each carrying a uniform error. The mean values of kca t (top) and Km (bottom) were obtained with the MM (©), EH (C1), or LB (A) equation. The one standard deviation (68%) confidence interval for the parameter values obtained by means of the MM equation is depicted by a dashed line.
a m o n g t h e v a l u e s o f Km a n d kcat o b t a i n e d w i t h t h e M M , E H , a n d L B equations. However, as soon as the error exceeds the 4% level, then a significant d i f f e r e n c e is o b s e r v e d w i t h t h e L B t r a n s f o r m a t i o n . T h e bias o n t h e Km a n d kcat lies w e l l o u t s i d e t h e o n e s t a n d a r d d e v i a t i o n c o n f i d e n c e interval of the values determined with the MM equation, whereas the percent error shows that both parameters are unresolved. Likewise, a s m a l l b u t d e f i n i t e b i a s is o b s e r v e d w i t h t h e E H t r a n s f o r m a t i o n , a l t h o u g h
80
NUMERICAL COMPUTER METHODS
[4]
T A B L E II RESULTS OF THE MONTE CARLO STUDY DISCUSSED IN THE TEXT FOR 10,000 SIMULATED DATA SETS OF 8 OR 6 (REPORTED IN PARENTHESES) POINTS EACH CARRYING A UNIFORM ERRORa MM e
%B
EH %E
Km(simulated value = 1) 0.01 0 (0) 3 (4) 0.02 0 (0) 6 (8) 0.03 0 (1) 9 (12) 0.04 0 (1) 12 (16) 0.05 1 (2) 15 (20) 0.06 2 (3) 18 (24) 0.07 2 (3) 21 (28) 0.08 2 (5) 24(32) 0.09 3 (5) 27 (36) 0.10 4 (6) 30 (38)
%B
-0 (-0) -1 (-1) - 1 (-2) -3 (-4) -4 (-7) -6 (-10) - 9 ( - 15) - 13 ( - 1 9 ) - 15 ( - 2 3 ) - 18 ( - 2 8 )
LB %E
4 8 13 17 20 24 27 31 35 39
%B
%E
(5) (11) (16) (21) (26) (30) (36) (41) (47) (52)
1 (1) 2 (4) 6 (12) 14 (15) 23 (34) 30 (*) 52 (*) 80 (*) * (*) * (*)
8 17 29 73 * * * * * *
(10) (23) (*) (*) (*) (*) (*) (*) (*) (*)
1 (2)
0 (0)
3 (4)
2 (3) 3 (5) 4 (6) 4 (8) 6 (10) 6 (12) 7 (13) 8(15) 9 (16)
1 (1) 2 (5) 4 (5) 6(14) 9 (28) 13 (43) 16 (64) 29(97) 44 (*)
6 (9) 10(71) 25 (*) * (*) * (*) * (*) * (*) * (*) * (*)
kcat (simulated value = 1)
0.01
0(0)
1 (2)
-0
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
0(0) 0(0) 0 (0) 0(0) 0 (1) 0 (1) 0 (1) 0(1) 0 (1)
1 (2) 2 (4) 3 (5) 3 (6) 4 (7) 5 (9) 6 (10) 6(11) 7 (12)
-0 (-1) -1 (-1) -1 (-2) -2 (-4) -3 (-5) -4 (-7) -5 (-9) -6(-11) - 7 ( - 13)
(-0)
a The mean, m, and standard deviation, o-, of each parameter obtained by using the MM, EH, or LB transformation are used to calculate the percent bias = 100(m - ~0')/~* (%B) and percent error = lOOtr/m (%E). Values of %B and % E exceeding 100 are indicated by an asterisk (*).
the mean values of K m and kcat are within the confidence interval of the MM determinations. A similar simulation study has been carried out using only 6 velocity values per data set in the substrate concentration range Kin~4to 8Km, in order to investigate the effect of a narrower substrate range on parameter resolution with incorrect weighting schemes. The results are shown in Fig. 3, and the relevant statistics are summarized in Table II (values given in parentheses). As expected, the bias and standard deviation of the distribution of each parameter increase, but qualitatively the results resemble those shown in Fig. 2. The LB transformation yields meaningless results
WEIGHTINGFUNCTIONSIN DATAFITTING
[4] t.5
81
/x &
1.'3
& ---
¢0 u
1.0
.....
"-'"
~ . . . ~ t . . . ~ . . . ,.,
0.8
0.,.5 ,00
'
'
•
•
.02
.04
.06
I
I
I
.08
I
,
.10
e r r o r
2.5
A
2.0
E
t.5
.A....~. .............. • -vi
~--~L~.
w ...........
0.5
•
• 00
•
.02
•
~
/rl
~
-
--
t~l
{2]
I
I
'"
........................
,,t
.04
I
.06
I
.08
|
I
. 10
e r r o r
Fro. 3. Results of the Monte Carlo study discussed in the text for 10,000simulated data sets of 6 points each carrying a uniform error. The mean values of kcat(top) and Km(bottom) were obtained by using the MM (O), EH (rq), or LB (A) equation• The one standard deviation (68%) confidence interval for the parameter values obtained by means of the MM equation is depicted by a dashed line.
as soon as the error exceeds 3%. In both cases the MM equation yields lowest-variance and lowest-bias parameter estimates and performs better than the E H transformation. Therefore, when using incorrect weighting schemes we do observe difficulties in parameter resolution, as shown by large bias and percent error, but this effect strongly depends on the experimental error. In general, however, the L B transformation provides the least reliable parameter estimates for errors less than 3% and yields
82
NUMERICAL COMPUTER METHODS
[4]
meaningless results for bigger errors. This conclusion concurs with the results of previous Monte Carlo studies. 8-1° In many circumstances of practical interest one is not aware of the exact distribution of experimental errors, and therefore a meaningful Monte Carlo study of steady-state kinetics should also consider the consequences of incorrect weighting when the error is not uniformly distributed. A necessary condition for this case to be directly compared to the one already analyzed is that the standard deviation of the fit must be comparable to that observed in the simulation of velocity measurements with uniform error. Simulated values that carry the same average noise, independent of the error distribution, perfectly mimic the real situation where we have a finite noise but we know nothing about the error distribution. Another Monte Carlo study has thus been carried out as follows. Steadystate velocity measurements, normalized by the enzyme concentration eT, were simulated according to the equation V = kcatx/(K m + x)
+
e l - 2 In(RNDI)] 1/2 cos(27rRND2)kcatx/(K m + x) (28)
that is, with a pseudorandom error distributed as N(O,ev) and proportional to the velocity value. This error simulates a nonuniform distribution that necessarily demands differential weighting in the minimization procedure. For each data set eight data points were generated in the substrate concentration range Kin~4 to 32Km, as in the case of uniform error distribution, with K m and kca t both equal to 1. Likewise, 10 values of e were used in the simulations from 0.02 to 0.2. These values yielded the same standard deviation of the fit as e values in the range 0.01-0.1 in Eq. (26). For each value of e, 10,000 data sets were generated according to Eq. (28), and each set was analyzed according to the MM, EH, and LB equations using uniform weighting in all cases. In this case, the weighting scheme is correct for the EH transformation and incorrect for the MM and LB equations. The results of the simulation study are shown in Fig. 4 and Table III. When velocity measurements in the substrate range Km/4to 32Km carry an experimental error proportional to v, then there is no significant difference in the parameter values obtained with the MM, EH, and LB equations using uniform weighting. This drastic difference versus what was observed in the case of uniform error has not been pointed out before. The same conclusion is reached for velocity measurements in the substrate range Kin~4 to 8Km, as shown in Fig. 5 and Table III (values given in parentheses). The parameter bias is approximately the same for all transformations, s j. E. Dowd and D. S. Riggs, J. Biol. Chem. 240, 863 (1965). 9 G. L. Atkins and I. A. Nimmo, Biochem. J. 149, 775 (1975). 10 j. G. W. Raaijmakers, Biometrics 43, 793 (1987).
[4]
83
WEIGHTING FUNCTIONS IN DATA FITTING
t.5
t.3 .4-1 U
1.0
. . . . . .=.
w
~
~
El
0
0.8
|
0.5 .00
*
.02
.04
.06
.08
.10
ePPOP 2.5
2.0
E
1.5
...._.............,---,--,----,----,---, 1.0
0.5 .00
-*.
. . . . . .
.02
.04
.06
.Off
.~0
e l " P 0 r-' FIG. 4. Results of the Monte Carlo study discussed in the text for 10,000 simulated data sets of 8 points each carrying an error proportional to the velocity value. The mean values of kcat(top) and Km (bottom) were obtained by using the MM (©), EH (D), or LB (A) equation. The one standard deviation (68%) confidence interval for the parameter values obtained by means of the EH equation is depicted by a dashed line.
whereas the percent error indicates that the EH equation performs best, as implied by the nature of the distribution of errors. Conclusions Correct data weighting guarantees least-variance estimates of the parameter values obtained in least-squares minimization. Weighting functions can be derived from the distribution of experimental errors, and data
84
NUMERICAL COMPUTER METHODS
[4]
TABLE III RESULTS OF THE MONTE CARLO STUDY DISCUSSED IN THE TEXT FOR 10,000 SIMULATED DATA SETS OF 8 OR 6 (REPORTED IN PARENTHESES) POINTS EACH CARRYING AN ERROR PROPORTIONAL TO THE VELOCITY VALUEa MM e/2
%B
EH %E
K m (simulated value = 1) 0.01 0 (0) 4 (5) 0.02 0 (1) 7 (9) 0,03 l (1) 11 (14) 0.04 1 (2) 15 (19) 0.05 2 (3) 19 (24) 0,06 3 (5) 23 (29) 0.07 4 (7) 27 (36) 0.08 5 (9) 32 (42) 0.09 7 (12) 37 (51) 0.10 10 (14) 43 (61) keat (simulated value = 1) 0.01 0 (0) 1 (2)
2 (4)
LB
%B
%E
%B
%E
-0 (-0) -0 (-1) - 1 ( - 1) -2 (-2) -2 (-3) -3 (-5) -4 (-7) -7 (-9) -8 (-11) - 9 ( - 13)
3 (3) 5 (7) 8 (10) 10 (13) 13 (17) 16 (20) 18 (24) 21 (28) 24 (32) 27 (35)
0 (0) 0 (0) 0 (1) 1 (1) 1 (3) 2 (5) 3 (7) 4 (9) 6 (15) 11 (22)
4 (4) 7 (9) l l (14) 15 (19) 19 (26) 24 (32) 28 (61) 60 (93) 92 (*) * (*)
(-0)
1 (2)
0 (0)
1 (2)
-0 (-0)
-0
2 (3)
0 (0)
3 (5)
o.02
0 (0)
0.03 0.04 0.05 0.06 0.07 0.08 0.09
0 0 0 0 1 1 1
(0) (1) (1) (1) (2) (2) (3)
4 (6) 5 (8) 6 (10) 7 (12) 9 (14) 10 (16) l l (19)
-0 -1 -1 -1 -2 -2 -3
(-1) (-1) (-2) (-2) (-3) (-4) (-5)
3 (5) 4 (7) 6 (8) 7 (10) 8 (12) 9 (13) 10 (15)
0.10
1 (4)
12 (21)
-4
(-6)
11 (16)
0 -0 -0 1 1 1 2
(0) (0) (0) (1) (2) (2) (4)
5 (7) 6 (9) 8 (13) 10 (16) 12 (27) 24 (37) 30 (58)
4 (6)
56 (89)
The mean, m, and standard deviation, tr, of each parameter obtained by using the MM, EH, or LB transformation are used to calculate the percent bias = 100(m - 0*)/qJ* (%B) and percent error = lO0o-/m (%E). Values of %B and %E exceeding 100 are indicated by an asterisk (*).
should be analyzed in the form they are taken experimentally, using a correct weighting scheme. Data transformation leads to changes in the error distribution and consequently demands correct reformulation of the weighting scheme according to Eq. (l 1). The reader should not be encouraged, however, to apply systematically such a procedure in data fitting because Eq. (11) gives only an approximate estimate of the new weighting factor as a function of the old one and strongly depends on the analytical form of the transformation. In general, one should not overlook the fact that different weighting schemes have no influence on parameter estimation when the fitting func-
[4]
WEIGHTING FUNCTIONS IN DATA FITTING
85
t.5
t.3
I0 (J
1.0
0.8
|
0.5
.00
i
• 02
|
i
.04
|
.06
i
|
.08
|
. 10
error 2.5
2.0
E
3£
1.5
t.0
•~----~z.
w
~n
m
D
E1
,
, .04
I
, .06
,
Q .....
0.5
, • 00
, .02
.08
13
G
. .....
.
. 10
error FIG. 5, Results of the Monte Carlo study discussed in the text for 10,000 simulated data sets of 6 points each carrying an error proportional to the velocity value. The mean values ofkca t (top) and Km(bottom) were obtained by using the MM (©), EH (lq), or LB (A) equation. The one standard deviation (68%) confidence interval for the parameter values obtained by means of the EH equation is depicted by a dashed line.
tion is c o r r e c t a n d d a t a points are errorless, and therefore one should e x p e c t v e r y little influence o f different weighting s c h e m e s on p a r a m e t e r d e t e r m i n a t i o n s w h e n the fitting f u n c t i o n is c o r r e c t and experimental data are v e r y a c c u r a t e . W h e n d a t a are not a c c u r a t e and the distribution o f e x p e r i m e n t a l e r r o r s is u n k n o w n , t h e n it is difficult to predict the effect o f i n c o r r e c t weighting s c h e m e s in the general case. E a c h case should be t h o r o u g h l y d i s s e c t e d with the help o f M o n t e Carlo simulations in o r d e r
86
NUMERICAL COMPUTER METHODS
[4]
to assess specifically the influence of weighting functions on parameter determinations for a particular model, under particular experimental conditions. Also, a direct experimental measure of the error distribution is strongly recommended. In some cases, the use of nonparametric procedures of parameter estimation might be useful. 1H2 The specific case of the analysis of steady-state kinetics dealt with in this chapter has revealed the consequences of incorrect weighting schemes as a function of experimental error. Of particular practical interest is the fact that the Lineweaver-Burk transformation always yields the least reliable parameter estimates, regardless of the nature and extent of experimental error. The budding enzymologist should thus be encouraged to abandon such a procedure in practical applications. The Michaelis-Menten equation performs best with uniformly distributed errors and so does the Eadie-Hofstee transformation when errors are proportional to the velocity value. However, when errors are proportional to the velocity value the Michaelis-Menten equation performs better than the Eadie-Hofstee transformation in the case of uniformly distributed errors. This suggests that the Michaelis-Menten equation should be preferred in the analysis of steady-state kinetics whenever the exact distribution of experimental errors is unknown. Although when dealing with errors of 1% or less the values of K m and kca t a r e not significantly affected by the particular transformation, nevertheless the reader should be very cautious in generalizing such a result. The analytical form of the Michaelis-Menten equation is particularly simple and does not comply with intrinsic difficulties in parameter resolution. When the fitting function contains parameters that are difficult to resolve, as in the case of ligand binding equilibria of multimeric proteins such as hemoglobin, then even an experimental error as small as 0.3% may lead to significantly different parameter estimates when using different weighting schemes. 5 A final remark should be made as to the operational aspects of weighting functions. Throughout this chapter we have linked the concept of weighting factor to the experimental error associated with a given experimental determination. Looking back to Figs. 2-5 one sees that when the fitting problem is correctly cast, then it makes very little difference to determine K mand kcatin the substrate range Kin~4to 32Kin or Km/4to 8Km. This means that we could drop two points at high enzyme saturation from the data set in Fig. 1 without affecting the results, and we in fact obtain in this case kcat = 104.7 --- 1.2 sec-1 and K m = 2.89 --- 0.11 /xM. Dropping experimental data is mathematically equivalent to weighting those data 1I A. Cornish-Bowden and R. Eisenthal, Biochem. J. 139, 721 (1974). 12 E. Di Cera and S. J. Gill, Biophys. Chem. 29, 351 (1988).
[5]
ANALYSIS OF RESIDUALS
87
points zero, which implies that those points carry an infinite experimental error. Because we know that this is not the case, then it is clear that the two points we dropped in Fig. 1 have very little, if any, influence on the determination of Km and kcat. This is because the fitting function is not sufficiently sensitive to experimental points collected for substrate concentrations over 8K m. Indeed, one can assign an informational content to any experimental data point by consideration of the information stored in the fitting equation. The informational content depends solely on the particular form of the fitting function and is independent of experimental errors. Consequently, it must not be used to assess correct weighting factors that are set solely by experimental errors, as we have seen. Rather, the informational content of a data point should be seen as an operationally useful concept characterizing the particular fitting problem under consideration that may indicate the range of the independent variable to be explored experimentally in the resolution of model parameters. Acknowledgments I am grateful to Dr. Michael L. Doyle for assistance in preparation of the figures.
[5] A n a l y s i s o f R e s i d u a l s : C r i t e r i a for D e t e r m i n i n g Goodness-of-Fit
By
MARTIN STRAUME and
MICHAEL L. JOHNSON
Introduction Parameter-estimation procedures provide quantitation of experimental data in terms of model parameters characteristic of some mathematical description of the relationship between an observable (the dependent variable) and experimental variables [the independent variable(s)]. Processes such as least-squares minimization procedures 1,2 will produce the maximum likelihood model parameter values based on minimization of the sum of squared residuals (i.e., the sum of the squares of the differences between the observed values and the corresponding theoretical values calculated by the model employed to analyze the data). There are assumptions regarding the properties of experimental uncertainty distributions contained in i M. L. Johnson and L. M. Faunt, this volume [1]. 2 M. L. Johnson and S. G. Frasier, this series, Vol. 117, p. 301.
METHODS IN ENZYMOLOGY, VOL. 210
Copyright © 1992by AcademicPress, Inc. All rights of reproduction in any form reserved.