MICROCHEMICAL
JOURNAL
4,
216-22.5 (1989)
Some Observations
on Fitting a Straight
Line to Data
A.G. ASUERO'AND A.G. GONZALEZ Department
of Analytical
Chemistry, Faculty of Pharmacy, 41012 Seville, Spain
The University
of Seville,
Received March 8, 1989; accepted March 15, 1989 Analytical chemists use prediction equations, e.g., straight line relationships, to predict future values of experimental responses (calibration domain), to compare analytical procedures, to represent expected values of experimental responses, and to estimate parameters (to evaluate constants in a given mathematical model). The present paper is devoted to the study of basic equations of weighted linear regression (straight line case), with emphasis on the weighting factors as well as on the normalization (scaling) of the weights. 0 1989 Academic
press, Inc.
Analytical chemists are often interested in the fitting of mathematical equations to experimental data. Common situations that may be described by functional relationships include calibration curves (I, 2) relating measured values of response to a property of a material, comparison of analytical procedures (3-j), and relationships in which time is the x-variable (6). In this respect Deming (7) emphasized that some researchers still “tit the data to a model,” which suggests a lack of scientific integrity. What is meant, of course, is that they fit a model to the data (8). Models that take into account the possibility of uncertainty are called probabilistic or stochastic models, whereas those that do not allow for uncertainty are known as deterministic models. The mechanistic model is a type of probabilistic model that is based on some known or presumed transformation between factors and responses. In this case there exists an exact mathematical formula QJ as a function of x) relating the two variables, and the only reason that the observations do not fit this equation is because of disturbances or errors in measurements in the observed value of one or both variables. Parameters of the approximating function are frequently derived using least-squares methodology. Analytical chemists use prediction equations to predict future values of experimental responses (derivation of calibration curves (I, 2)), to represent expected values of experimental responses (9) (response surfaces), or to evaluate unknown parameters, e.g., constants in a given mathematical model (10, II). This paper is devoted to the study of basic equations of weighted linear regression (straight line case), with emphasis on the topics of weighting factors and the normalization of the “weights.” THEORY
Suppose we choose y to be a linear function of x. The relationship between each observation pair (xi, yJ can then be represented as ’ To whom correspondence
should be addressed. 216
0026-265X/89 $1.50 Copyright 8 1989 by Academic Press, Inc. All rights of reproduction in any form reserved.
FITTING
A STRAIGHT yj
=
a0
+
LINE OllXj
+
TO Ei.
DATA
217 (1)
The signal yi is composed of a deterministic component predicted by the linear model and a random component, l i (22). One must find the estimates of a0 and ai of the true values a0 and (or. It should be stressed that the underlying assumption is that x is free from error (being assigned); that is, the precision of the measurement (23,M) of the x values is much better than the precision of the measurement of the y values. In addition, the greater error in y is attributable to the fact that any process involves unobservable variables that are not included in the equation of regression; i.e., the dependent variable y is afflicted by random errors of measurements. Although there are ways of allowing for possible random error in the values of the independent variable (25, 16), these methods are not described here. When x, as well as y, is subject to error, the regression situation is considerably more complicated, even in the simple one-x case. The component ei represents the difference between the observed yi values and the yi values predicted by the model; the l i values are called residual deviations or error disturbances (17). The residual values are often referred to as “error” values. This is somewhat presumptuous (7) because it is assumed that this model is the correct model, that all values predicted by this model are true values, and that any measured values of response not in agreement with this model are therefore in error. It is important to realize that uncertainties between what is observed and what is predicted may exist (a) because the model may not be correct or (b) because of imprecision in the measuremem process. The term “residual” is preferred and refers to uncertainties, not necessarily errors (7). The estimation of a, and a, of the true values a0 and 0~~is performed by the conventional least-squares method by calculating values of a0 and a, for which Q = Zci’
(2)
is minimal (18). An assumption that the errors E are normally distributed is not required to obtain the estimates a, and a, (13, Z9), but it is required later to introduce tests such as t- or F-tests, which depend on the assumption of normality or to obtain confidence intervals based on the r- and F-distributions. When measurements are obtained over a wide range of a variable x, the assumption that the absolute error of measurement is essentially constant over the range of experimental data is not valid. It sometimes happens that some of the observations used in a regression analysis are more reliable than others. Thus, the so-called “homocedastic,” or uniform variance, condition is not fulfilled and the direct application of conventional least-squares techniques can produce gross errors. Random error (20) is caused by noise, and noise sources may be a function of the signal, the concentration, or other factors (21, 22); it is reasonable to assume that many analytical data will not show constant variance and that variance is not a simple function of the dependent variable (22, 23).
218
ASUERO
AND
GONZALEZ
The nonuniform variance problem can be solved in two ways (12): (1) by performing a transformation of the variables in such a way that homocedasticity is obtained, and (2) by using weighting factors. In these cases, the least-squares procedure minimizes the weighted sum of squares of the deviations (24), ej, or residuals Q =
ZW&f
=
ZWiCJi
-
9J2
=
ZWl(yi
-
(Q
+
UlXi))*
(i= 1 to k).
(3)
A Brief Review of Weighting The Wi are called weighting factors, as they are in one-to-one correspondence with the data points (Xi, y,) themselves. The more accurately a data point is known, the larger the value for the associated wi should be. Therefore, the fitted curve should pass close to more accurately known points (3,25), and this is shown by the inclusion of the weighting factors in Eq. (3). Although the general concept of weighting values is mentioned in several of the more complete texts on statistics (25-27), a detailed procedure is not given. For this reason brief notes on this topic are given below. The weighting factors can be given in a number of different ways, depending on the characteristics of the data set: (a) Absolute weights. In the absence of more complete information it is commonly assumed that equal weighting of all the points is satisfactory. If wi = 1, for all i, then Eq. (3) reduces to Eq. (2). These are called absolute weights (28). The weighting factor is frequently ignored by assuming unit weighting. (b) Statistical weights. Another option is to use relative weighting, where 1 wi=-’
Yi
Here, we are saying that the points with smaller values of yi are known relatively accurately (28), and the points with larger values of yi are less well known. These are called statistical weights (29). (c) Assumption of constant percentage error. A method used for minimizing relative deviations rather than absolute deviations gives equal experimental weight to all measurements regardless of the range in which measurements are made. Thus, the deviations of experimental measurements in the high x-range no longer overshadow or swamp (30, 31) those in the low x-range. With this option, we minimize
Qr=P(yy2=Z(l- (;+$))‘. Actually, in many typical chemical applications, the experimental conditions are controlled so that the percentage error is constant. Nonconstant variance occurs when the variance of yi depends on xi; a peculiar case of heterocedasticity,
FITTING
A STRAIGHT
LINE
TO DATA
219
important in analytical chemistry (Z2), is that for many analytical methods relative standard deviations are reasonably constant over a considerable dynamic range. (6) Instrumental weights. The primal conception of a weight is that of a repeated observation (24). A fourth, and often used, weighting scheme is appropriate where the uncertainties in the yi values can be characterized by real standard deviations. In fact, one approach to data acquisition (32) is to collect a relatively smalI number of data and to make replicates so that the standard deviation can be calculated and then used to assign individual weights to these data points. In these cases, the weights can normally be considered inversely related to the variance of the points (24, 28) wi=-’
1 S2
These are called instrumental weights (29). In such a scheme, no assumptions about the precision of the various data points are necessary; such precision is often found to be nonuniform in a given data set, i.e., heterocedastic. This procedure, however, increases the cost of analysis and will be worthwhile only if additional data quality is required. At least 10 replicate measurements should be made at each x-value and used to calculate standard deviations (22). It is the task of the analyst to decide whether he must improve his method by using more sophisticated procedures (12). In the absence of a sufficient number of replicates, a functional relationship between variance and the independent variable can be assumed. In fact, regular cases of nonhomogeneity are characterized by a functional relationship between variance and the expected values of the responses (33). (e) Trunsjbrmation-dependent weights. Unequal weights may be introduced without being realized (34). A rather distinct approach and an also veryoften-practiced alternative when standard deviations are not available is to take a single set of closely spaced data (32), in which insufficient information is available to justify the assignment of separate weights to the individual data points. However, if the form of the nonlinear relationship between two variables is known, it is sometimes possible to make a transformation of one or both variables such that the relationship between the transformed variables can be expressed as a straight line. These nonlinear relationships are said to be intrinsically linear (19, 26). The transformed data will not necessarily satisfy certain assumptions which are theoretically necessary to apply the regression analysis. In general, when experimental data zi (dependent variable whose values are measured) are converted into transformed data yi (dependent variable resulting from the linearization) for subsequent use in a least-squares analysis, one should introduce a weighting factor wi given by (32, 35) wi =
( )
1 2 (ay/az) *
These are called transformation-dependent weights (32).
(7)
220
ASUERO
AND
GONZALEZ
(f) Mixed instrumental transformation-dependent weights. Use of the transformed weights with the transformed function and points will give unintended results (36). In order to maintain the proper relationship between the weights and the points being fit, we must also transform the weights. The random error propagation law (37), when applied to a function y = j(z), gives
If the weights are transformed using the general error propagation law, we get
ay -2
WY= wz z
1
= S: (ayla.7J2*
0
(9)
In other words, transformation-dependent weighting should be used in addition to any weighting based on the measure of the standard deviation s, of the individual data points (32). Nevertheless, in those cases in which the individual standard deviations remain unknown, global weighting scheme (e) is the best choice. Weighted Linear Regression:
Basic Equations
Using the weighted least-squares procedure, we obtain estimates of parameters 0~~and CX,(Eq. 1) by minimizing the weighted sum of squares of residuals Zw& (Eq. (3)). These estimates, a, and a,, are unbiased; i.e., the expected value of weighted residuals, Z(w”*eJ, is zero. If Q is to be a minimum, the first partial derivatives of y with respect to a, and a, must be zero. Omitting the algebra, we obtain the set of normal equations
Solving by Cramer’s rule, we get
(13) It is simpler, however, to find a, from Eq. (lo), once the value of a, is known: ao
=
xwi.Yi
-
CWi
&Wi
= 7 - a$.
The fitted line passes through the weighted centroidal point (x, 7) (centroid). An error in the value of jr leads to a constant error in y for all points on the line, which is translated up or down without a change in slope (13). Nevertheless, when there is an error in both x and y coordinates at some or all of the observed points, the line does not pass through the centroid (24) (unless the errors in y and x are equal).
FITTING
A STRAIGHT
LINE
TO
221
DATA
A measure of the overall quality of the fit function is the standard deviation of the fit, s,/,,
Q s;/x
= k-2
mYi
-
=
%I2
.
(15)
k-2
On the other hand, the correlation coefficient is given by (23, 14) r=
~adl,
(16)
where a, and a; are the slopes of the regression lines of y on x and of x on y, respectively, and (17) In statistical theory, a correlation is a measure of the association between two random variables. In those cases in which two variables are functionally related, neither is random. Random variability enters because of the random error l i. Correlation in its mathematical-statistical sense does not exist. In fitting functional models, values of r close to + 1 or - 1 do provide (38) an aura of respectability, but not much else. In addition, although the correlation coefficient is conceptually simple and attractive and is frequently used as a measure of how well a model fits a set of data, it is not, by itself, a good measure of the factors as they appear in the model (39), primarily because it does not take into account the degrees of freedom (4M3). We may obtain more compact expressions suitable for computer purposes by taking into account that x and jj are the weighted means of the x and y values, respectively, X =
CW$jlZWj
W3)
7 =
CWiyjEWj
(19)
and that S,,: and S,, are weighted sums of squares about the weighted mean for two variables, and S,, is the corresponding weighted sum of weighted cross products: Sxx
=
EW&XF
-
(ZW$i)*&Wi
=
CWj(Xi
-
Tf)*
(20)
Syy
=
XWiyF
-
(ZWiyJ*EWi
=
ZW,(yi
-
y)*
(21)
&T
=
ZW$iyj
-
(ZW&J(CWflj)EWj
=
ZWi(Xj
-
T))otj
-
7).
(22)
In fact, this notation is similar to that used in unweighted linear regression (26). Similar shortcuts were due to Gauss (24). In this way, the values of slope, intercept, correlation coefficient, and standard deviation of the regression line are given respectively by al =
SX~SXX
(23)
222
ASUERO
AND
GONZALEZ
(24) (25) (26)
The second form in Eqs. (20~(22) requires the computation of (X - X) and (y y) for all the observations. This may be tedious if x and p involve several decimal places, and the computation of products and squares becomes more laborious than the direct operation with the first form in Eqs. (20)-(22), which is normally performed on pocket calculators (26, 44). It is worth noting that the denominators of both Eqs. (12) and (13), e.g., the first and second parts of the right member of the first form of Eqs. (20)-(22), involve the subtraction of two large positive quantities from each other. Therefore, these equations are potentially inaccurate if only a limited number of significant figures are carried out on calculators and computers (45-47). To avoid rounding errors, it is best to carry as many significant figures as possible in these computations. Rounding is best done at the reporting, not intermediate, stage of a calculation (26). Most digital computers, because of their roundoff characteristics (26), provide more accurate answers using the second form in Eqs. (20)-(22). (27)
(this is equivalent to correcting the variable for its mean or centering the data (13, 16)) and
should always be calculated on a computer (44) in preference to Eqs. (12) and (13). Equation (28) is slightly superior to Eq. (14), particularly when a, is close to zero. The standard deviation of intercept, s,~, and slope, sal, and the covariance between the slope and intercept, cov(a,, a,), may be evaluated from the weighted linear regression in matrix form (48): s
2 a0
_ --
2
cw: s,y&wi
+
s:,=s&.&x
(2% (30)
FITTING
A STRAIGHT
LINE
TO
cov(ao, al) = xs~,JSxx.
DATA
223 (31)
Weight Normalization The weights calculated as indicated in Eqs. (4)-(9) may be normalized (49, 50). This can be accomplished by transforming each “old” weight, wi, into a “new” weight, wf , w*I = so2w. I*
(32)
The quantity so2is simply a proportionality factor chosen for convenience, since it cancels out of the normal equations; its choice does not affect the values of the estimates u,, and a, of ol, and o1 and is evidently the variance of a function of unit weight (24). The variance and covariance formulas for the weighted case show that so2does not cancel, instead, e.g., (33) hence its value is not arbitrary in estimating the variance ratios. Some workers set so2 = 1, giving from Eq.(32), wT = Wi (22,25,28,29,.52, 52). To some authors this approach seems to be entirely reasonable (52) and is certainly used very often. Another convention is to normalize the weights by setting (49) zw; = 1.
(34)
Then, So2 =
l~,Wi
(35)
and wi* = w,lcw,.
(36)
Connors (,53) makes the reasonable requirement in the choice of so2 that the parameters of the weighted regression be identical to those for unweighted regression when the wi are equal (homocedasticity). It is obvious that this criterion means that w*, = 1
(37)
Cw) = k.
(38)
and
Therefore, with this choice so2 = awi and
(39)
224
ASUERO AND GONZALEZ (40)
In other words, the weights calculated are normalized in such a way (50,53,54) that the average weight is 1; that is, the weights have been scaled so that their sum is equal to the number of points on the graph (25). This simplifies the subsequent calculations, but also provides unbiased estimates of the parameters (in the absence of replication). In general, nonhomogeneity can cause either too many or too few significant differences. If the weights do not differ greatly, the weighted estimates will not be greatly different from the unweighted estimates (27); e.g., the weighed linear regression line in calibration problems differs very little from the ordinary regression line. The width of the confidence interval for the true concentration of an unknown sample, however, can be very different when weighted regression (22, 52) is used. A further paper in this series will be devoted to replicate observations, including homogeneity and nonhomogeneity of variances, as well as to the fit and testing criteria. ACKNOWLEDGMENT The authors thank DGICYT (Direcci6n General de Investigaci6n Cientflica y Tecnica de Espafta) for financial assistance through Project PB-86-0611.
REFERENCES 1. 2. 3. 4. 5. 6. 7.
Agterdenbos, J. Anal. Chim. Acta, 1979, 108, 315-323. Schwartz, L. M. Anal. Chem., 1979, 51, 723-727. Thompson, M. Analyst (London), 1982, 107, 1169-l 180. de la Guardia, M.; Carrefio, A. S.; Navarro, V. B. Ann. Quim. (Madrid), 1981, 77, 129-132. de la Guardia, M.; Carretio, S. A.; Navarro, V. B. Ann. Quim. (Madrid), 1983, 79, 446447. Schwartz, L. M. Anal. Chem., 1981, 53, 206-213. Deming, S. N. Linear models and matrix least squares in clinical chemistry, In Chemometrics, Mathematical and Statistics in Chemistry (B. R. Kowalski, Ed.), pp. 267-394. Reidel, Dordrecht, 1984. 8. Daniel, C.; Wood, F. S. Fitting Equations to Data, Computer Analysis of Multifactor Data, 2nd ed. Wiley, New York, 1980. 9. Box, G. E. P.; Draper, N. P. Empirical Model-Building and Response Surfaces. Wiley, New York, 1987. 10. Garcia, M. C.; Ramis, G.; Mongay, C. Talanta, 1982, 29, 43.5-439. Il. Mongay, C.; Ramis, G.; Garcia, M. C. Spectrochim. Acta Part A, 1982, 38, 247-252. 12. Massart, D. L.; Vandeginste, B. G. M.; Deming, S. N.; Michotte, Y.; Kaufman, L. Chemomet, rics: A Textbook, Chap. 5. Elsevier, Amsterdam, 1988. 13. Kennedy, J. B.; Neville, A. M. Basic Statistical Methods for Engineers and Scientists, 2nd ed., pp. 256, 263, 265, 298. Harper & Row, New York, 1976. 14. Natrella, M. G. Experimental Statistics, Chap. 5, National Bureau of Standards, Handbook 91. U.S. Govt. Printing Office, Washington, DC, 1966. 15. Ripley, B. D.; Thompson, M. Analyst (London), 1987, 112, 373-383. 16. Anderson, R. L. Practical Statistics for Analytical Chemists, pp. 108-118, 219. Van Nostrand, Princeton, NJ, 1987. 17. Latorre, G. Analysis of variance and linear models. In Chemometrics, Mathematics and Statistics in Chemistry (B. R. Kowalsky, Ed.), pp. 377-391. Reidel, Dordrecht, 1984. 18. Dom, W. S.; McCracken, D. D. Numerical Methods with Fortran, Vol. IV, Case Studies, Chap. 7. Wiley, New York, 1972.
FITTING
A STRAIGHT
LINE
TO DATA
225
19. Tomassone, R.; Lesquoy, E.; Miller, C. La Regression, nouveau regards sur une ancienne methode statistique, pp. 15, 38. Masson, Paris, 1983. 20. Prudnikov, E. D.; Shapkina, Y. S. Analyst (London), 1984, 109, 305-307. 21. Christian, G. D.; Callis, J. B. (Eds.) Trace Analysis, Spectroscopic Methods for Molecules, pp. 33-36, 87-91. Wiley, New York, 1986. 22. Garden, J. S.; Mitchell, D. G.; Mills, W. N. Anal. Chem., 1980, 52, 2310-2315. 23. Bubert, H.; Klockenkiimper, R. Fresenius’ Z. Anal. Chem., 1983, 316, 186-193. 24. Deming, W.. E. Statistical Adjustment of Data, Chap. 2, p. 175. Dover, New York, 1964. 25. Miller, J. C.; Miller, J. N. Statistics for Analytical Chemistry, pp. 124-128. 2nd ed., Horwood, Chichester, 1988. 26. Draper, N. R.; Smith, H. Applied Regression Analysis, 2nd ed., pp. 108-115,222, 14. Wiley, New York, 1981. 27. Box, G. E. P.; Hunter, W. G.; Hunter, J. S. Statistics for Experimenters, pp. 505-507. Wiley, New York, 1978. 28. Jurs, P. C. Computer Software Applications in Chemistry, pp. 37-38. Wiley, New York, 1986. 29. Jhonson, K. J. Numerical Methods in Chemistry, p. 245. Dekker, New York, 1980. 30. Anderson, K. P.; Snow, R. L. J. Chem. Educ., 1%7, 44, 756-757. 31. Smith, E. D.; Mathews, D. M. J. Chem. Educ., 1%7,44, 757-759. 32. De Levie, R. J. Chem. Educ., 1986, 63, 10-15. 33. Bayne, C. K.; Rubin, I. B. Practical Experimental Designs and Optimization Methods for Chemists, pp. 61-62. VCR Publishers, Deerlield Beach, FL 1986. 34. Bandreth, D. A. 3. Chem. Educ., 1968, 45, 657-660. 35. Meites, L. CRC Crit. Rev. Anal. Chem., 1979, 8, l-53. 36. Jurs, P. C. .4nal. Chem., 1970, 42, 747-750. 37. Green, J. R.; Margerison, D. Statistical Treatment of Experimental Data, pp. 86-97. Elsevier, Amsterdam, 1977. 38. Hunter, J. S. J. Ass. Off. Anal. Chem., 1981, 64, 576583. 39. Deming, S. N.; Morgan, S. L. Experimental Design:A Chemometrics Approach, p. 145. Elsevier, Amsterdam, 1987. 40. Hancock, C. K. .Z. Chem. Educ., 1965, 43, 608-609. 41. Van Arendonk, M. D.; Skogerboe, R. K. Anal. Chem., 1981, 52, 2350-2351. 42. Plesch, R. Git Fachz. Lab., 1982, 132, 1040-1044. 43. Tiley, P. F. Chem. Brit., 1985, 21, 162-163. 44. Lee, J. D.; Lee, T. D. Statistics and Computer Methods in Basic, pp. 108-109. Van NostrandReinhold, New York, 1982. 45. Wanek, P. M.; Whipple, R. E.; Fickies, T. E.; Grant, P. M. Anal. Chem., 1982, 54, 1877-1878. 46. Solberg, H. E. Anal. Chem., 1983, 55, 1611. 47. Shukla, S. S.; Rusling, J. F. Anal. Chem., 1984, 56, 1347A-1368A. 48. Pattengill, M. D.; Sands, D. E. .Z. Chem. Educ., 1979, 56, 244-247. 49. Spiridonov, V. P.; Lopatkin, A. A. Tratamiento matemdtico de datos ftsico-qut’micos, pp. 11l112. Mir, Moscu, 1973. 50. Sharaf, M. ,4.; Illman, D. L.; Kowalsky, B. R. Chemometrics, p. 28. Wiley, New York, 1986. 51. Jurs, P. C.; Isenhour, T. L.; Wilkins, C. L. Basic Programming for Chemists, an Introduction, pp. 229233. Wiley, New York, 1987. 52. Caulcutt, R.; Boddy, R. Statistics for Analytical Chemistry, pp. 100-110. Chapman & Hall, London, 1987. 53. Connors, K. A. Binding Constants, the Measurement of Molecular Complex Stability, p. 115. Wiley, New York, 1987. 54. Mendelbaum, H. G.; Madaule, F.; Desgranges, M. Bull. Sot. Chim. France, 1973, 1619-1628.