Outliers and robust parameter estimation

Outliers and robust parameter estimation

[23] OUTLIERS AND ROBUST PARAMETER ESTIMATION 417 random times throughout the day in order to achieve a maximally representative characterization o...

373KB Sizes 0 Downloads 115 Views

[23]

OUTLIERS AND ROBUST PARAMETER ESTIMATION

417

random times throughout the day in order to achieve a maximally representative characterization of their glycemic control through SMBG.

Acknowledgments This study is supported by the National Institutes of Health grant RO1 DK51562, by Amylin Pharmaceuticals, San Diego, CA, and by Lifescan Inc., Milpitas, CA.

[23]

Outliers and Robust Parameter By

Estimation

MICHAEL L. JOHNSON

Introduction Biomedical researchers are frequently faced with the task of estimating a series of parameters from a set of experimental data. This is commonly done by selecting a set of parameters of an equation such that the equation provides a good description of the data. In most cases the mathematical form of the equation is determined by a theoretical description of the underlying physics, chemistry, physiology, and/or biology. The process of selecting the parameter values is usually done by a least squares parameter estimation procedure. 1-3 Outliers are those data points that simply did not fall near the central tendency of the majority of the data points. These outlying data points are typically caused by weighing or pipetting errors, but can arise for many reasons. The presence of outliers in a data set forces the investigator to make a series of decisions: Should they be included during the fitting process or should they be eliminated? Which data points are outliers? Are they outliers because the data points are bad or because the mathematical form of the equation (i.e., the theory) is wrong? If a data point is suspected of being an outlier, can an experimental reason (i.e., an experimental error) be found to justify eliminating the particular data point? Least squares parameter estimation procedures are particularly sensitive to the presence of outliers in a data set. Consider the example shown in Fig. 1. The solid line in the lower panel (Fig. 1) is the least squares "best fit" quadratic polynomial [Eq. (1)]. a M. L. J o h n s o n and S. G. Frasier, Methods Enzymol. 117, 301 (1985). 2 M. L. J o h n s o n and L. M. Faunt, Methods Enzymol. 210, 1 (1992). 3 M. L. Johnson, Methods Enzymol. 240, 1 (1994).

METHODS IN ENZYMOLOGY,VOL. 321

Copyright © 2000by AcademicPress All rights of reproduction in any form reserved. 0076-6879/00$30.00

418

[23]

NUMERICAL COMPUTER METHODS

3O ._1

r~ ILl

rr

°t

-30

"' ..J

6

<

5

<

>

o I

I

I

I

4

6

8

10

4

I---

zW

3

,,,

2

a

1

121 Z O_ OJ

0

2

12

INDEPENDENTVABIABLE F]~. 1. Least squares analysis of the data by a quadratic polynomial. The solid line in the lower panel is the "best fit" to Eq. (1). The dashed line is the corresponding fit of the data without the data point at an independent variable of 7. The upper panel displays the weighted residuals, solid diamonds correspond to the solid-line fit, and the open squares correspond to the dashed-line fit. The least squares estimated parameters are shown in Table I.

G (X, answers) = Y = a + b X + c X 2

(1)

The data points are shown in the lower panel (Fig. 1). The dashed line is the corresponding fit of the data without the data point at an independent variable of 7, X = 7. The upper panel (Fig. 1) displays the weighted residuals, i.e., the differences between the fitted curve and the data points normalized by the uncertainties of the particular data points (solid diamonds correspond to the solid-line fit and the open squares correspond to the dashed-line fit). The least squares estimated parameters are shown in Table I. This data set was generated with the values of a = 1.0, b = 1.0, and c = -0.1, and then the data point at X = 7 was systematically altered. A small amount (SD = 0.1) of Gaussian distributed random noise was also added to the data set. Clearly, the least squares procedure correctly found the parameter values in the absence of the outlier at X = 7, but not when it was present. From an examination of the fit in the presence of the data point at X = 7, it appears that the data point at X = 7 is an outlier. The parameter estimation without this data point provides a better description of the data. However, a better fit is not really a good justification f o r arbitrarily removing

[231

419

OUTLIERS AND ROBUST PARAMETER ESTIMATION TABLE I PARAMETERS FROM LEAST SQUARES ANALYSISa Solid line

D a s h e d line

Parameter

Value

Standard error b

Value

Standard error b

a b c s2

0.32 1.42 -0.13

0.87 0.37 0.14

1.081 0.988 -0.100

0.104 0.043 0.004

106.52

0.9877

As shown in Fig. 1 . b Standard errors were evaluated by a 3000-cycle bootstrap method.

a

a data point.

The concern is that this particular data point was influential enough to cause systematic errors in the estimated parameters and as a consequence, a major shift in the calculated curve. This systematic shift in the estimated curve makes it difficult to identify a particular data point as an outlier. The purpose of this manuscript is to discuss one method of fitting data in the presence of outliers. This method is "robust" in that it is less sensitive than least squares methods to the presence of outliers. The reader is, however, cautioned that this method does not have the rigorous statistical basis that least squares methods have. Least S q u a r e s P a r a m e t e r E s t i m a t i o n Least squares parameter estimation procedures automatically adjust the parameters of a fitting equation until the weighted sum of the squared residuals [i.e., the sample variance of fit, Eq. (2)] is a minimum:

s2_

l

~ [Yi-G(Xi, answers)] 2

n - m/=1

O'i

(2)

where Xi and Yi are the ith of n data points, o r i is the standard error of the ith data point, G is the fitting function, and answers are the m parameters of the fitting function. For the present example, the answers are a, b, and c of Eq. (1) with m = 3. The statistical basis of the least squares procedures is dependent on a series of assumptions about the experimental data and the fitting function. Specifically, these assumptions are made: All of the experimental uncertainties are Gaussian distributed and on the dependent variables (the Y axis), i.e., no errors exist in the independent variables (the X axis) and no system-

420

NUMERICAL COMPUTER METHODS

[231

atic (non-Gaussian) uncertainties exist. The fitting equation is the correct equation. The data points are "independent observations." A sufficient number of data points to provide a good sampling of the experimental uncertainties are required; i.e., this implies that the data should not be smoothed to remove the noise components. With these assumptions it can be shown analytically that the least squares procedures will produce the parameter values that have the highest probability of being correct. 1-3 Least squares parameter estimation is a maximum likelihood (highest probability) method. Conversely, if these assumptions are valid and an analysis method produces answers that are different from the values found by the least squares procedure, then the other analysis method is producing answers with less than the maximum likelihood (i.e., they are wrong). An examination of these assumptions explains why the least squares procedure is significantly biased by the outlier data point in Fig. 1 at X = 7. That particular data point contains a significant amount of systematic uncertainty. Therefore, the least squares assumptions are not met and as a consequence the least squares procedure is expected to produce answers that are less than optimal--which is exactly what is seen in Fig. 1 and Table I. The actual numerical procedure utilized for the parameter estimations (e.g., least squares and others) presented in this work is the Nelder-Mead simplex algorithm. 4 Most available methods for the evaluation of confidence intervals assume that the fitting procedure is a least squares procedure. The bootstrap method 5 was used in this work because it does not assume that the procedure is a least squares procedure.

Robust P a r a m e t e r E s t i m a t i o n Procedures From Eq. (2) it is obvious that the least squares procedure weights each data point with the square of the weighted distance between the data point and the fitted curve. This is correct for Gaussian distributed random noise. However, when a few outliers exist with a significant amount of systematic uncertainty, it is clear that the outlier data points have too much influence on the resulting parameter values. A simple solution is to lower the power in Eq. (2) to something less than two so that these outliers will have less influence. One commonly used method is to minimize the sum of the absolute values, SAV, of the weighted residuals, as in Eq. (3).

4j. A. Nelder and R. Mead, Comput. J. 7, 308 (1965). 5B. Efron and R. J. Tibshirani, "An Introduction to the Bootstrap." Chapman and Hall, New York, 1993.

[23]

OUTLIERS

CO ._1 ,< Z) U) IJ.I n-

"' .J

AND ROBUST PARAMETER

35 0

ESTIMATION

421

t

o

-35

o

o

o

o

o

o

o

o

I

I

I

I

I

2

4

6

8

10

6

m

< K .< >

5

zW

3

,,,

2

a

1

131 Z O_ LLI

4

12

INDEPENDENTVABIABLE Fia. 2. Robust analysis of the data by a quadratic polynomial. The solid line in the lower panel is the "best fit" to Eq. (1). The dashed line is the corresponding fit of the data without the data point at an independent variable of 7. The upper panel displays the weighted residuals, solid diamonds correspond to the solid-line fit, and the open squares correspond to the dashedline fit. The least squares estimated parameters are shown in Table II.

SAV-

m =~lYi-G(Xi'answers)tri

(3)

This is one of the procedures in the field of robust statistics. 6,7 The reader should be cautioned that while this procedure has some specific applications, such as locating outliers, it is not a maximum likelihood procedure. When applied to data that meet the assumptions of least squares fitting, this procedure will produce results that are not the maximum likelihood results. Under these conditions, this robust procedure will provide answers that are less than optimal. R o b u s t Analysis of Present Example Figure 2 and Table II present an analogous robust analysis of the data shown in Fig. 1 and Table I. This analysis was performed by minimizing 6 p. j. Huber, "Robust Statistics." John Wiley and Sons, New York, 1981. 7 D. C. Hoaglin, F. Mosteller, and J. W. Tukey, "Understanding Robust and Exploratory Data Analysis." John Wiley and Sons, New York, 1983.

422

[231

NUMERICAL COMPUTER METHODS

TABLE II PARAMETERS FROM ROBUST ANALYSIS a

Solid line

Dashed line

Parameter

Value

Standarderror b

Value

a b c

0.978 1.047 -0.104

0.089 0.040 0.004

0.986 1.038 -0.103

Standard

error b

0.079 0.034 0.003

As shown in Fig. 2. b Standard errors were evaluated by a 3000-cyclebootstrap method. a

the sum of the weighted absolute values of the residuals, S A V in Eq. (3), instead of minimizing the weighted sum of the squares of the residuals, s 2 in Eq. (2). For this analysis the values calculated, while including the data point at X = 7 (solid line and sold diamonds), virtually superimpose with the values calculated in the absence of the data point (dashed line and open squares). This indicates that the outlier had little or no contribution to the analysis, thus making it easier to identify since the calculated curve is biased toward the outlier to a lesser degree. However, the answers (i.e., the parameter values) do not agree with the values from the least squares analysis. Figure 3 presents the analysis of the data set when the outlier at an independent variable of 7 is removed. The solid line and solid diamonds correspond to the least squares analysis while the dashed line and open squares correspond to the robust analysis. Conclusions This chapter presents an example of robust parameter estimation. The example describes how robust fitting can be employed for data that may contain "outliers." An outlier is simply a data point that for some reason contains a significant amount of systematic (i.e., nonrandom) experimental uncertainty. The example demonstrates how the least squares parameter estimation is sensitive to the presence of these outliers while the robust method is comparatively less sensitive. Thus, it is easier to identify presumptive outliers with the robust method. Note that the robust method presented here is only one of many possible "robust" methods. This robust method minimized the sum of the first powers of the absolute value of the residuals. In principle, Eq. (3) could be written with a power term, q, as in Eq. (4). S A Vq

_

1

~,

n - m i=1

Yi - G ( X i , answers) q O'i

(4)

[231

OUTLIERS or)

AND

ROBUST

PARAMETER

423

ESTIMATION

3

,_1

<

a

a

0

[] •

o

m

n-

-3

I

I

I

I

I

IJJ _.I

<

3

~-

2

x '" x x x

Z I..U 13 Z tLI

121

0

2

4

6

8

10

12

INDEPENDENTVARIABLE FIG. 3. Least squares and robust analysis of the data in Figs. 1 and 2 with the outlier at an independent variable of 7 removed. The solid line and solid diamonds correspond to the least squares analysis (as shown in Fig. 1 and Table I), whereas the dashed line and open squares correspond to the robust analysis (as shown in Fig. 2 and Table II).

For the present robust example, q is unity. However, any power, q, less than 2 should decrease the sensitivity of the parameter estimation to the presence of outliers. In addition, there are many other "norms" of the data that could be minimized that are less sensitive to outliers than the least squares method. This is expected since the least squares method is based on the assumption that outliers are not present within the data. Thus, if outliers are present and the least squares method is employed, then it is expected that the results will be less than optimal. If this is the case, why not always use a robust method? The answer is that the robust methods do not have a rigorous statistical basis. A large body of experimental data is consistent with the assumptions that are outlined above for the least squares method. For data that are consistent with the assumptions, the least squares method provides maximum likelihood parameter values. If outliers are not present (as in Fig. 3), the robust methods do not provide the parameter values with the maximum likelihood of being correct. If the experimental data are not consistent with the least squares assump-

424

NUMERICAL COMPUTER METHODS

[24]

tions, then clearly least squares will not provide the optimal parameter values. However, it cannot be demonstrated that the robust method will provide the optimal (i.e., maximum likelihood) parameter values either. In addition, a large number of techniques have been developed that are based on the concepts of least square parameter estimations. These include goodness-of-fits tests such as autocorrelation, runs test, Kolmogorov-Smirnov test, and Durbin-Watson test. It also includes methods for the evaluation of the precision of the estimated parameters 9,1° (i.e., the parameter confidence intervals). In general, these techniques apply to least squares but not to robust parameter estimation procedures. So why use robust parameter estimation procedures? Because in many cases they appear to provide an analysis of the data that is less sensitive to the presence of outliers. This makes it easier to identify those points that may be outliers.

Acknowledgments The author acknowledges the support of the National Science Foundation Science and Technology Center for Biological Timing at the University of Virginia (NSF DIR-8920162), the General Clinical Research Center at the University of Virginia (NIH RR-00847), and the University of Maryland at Baltimore Center for Fluorescence Spectroscopy (NIH RR-08119).

8 M. Straume and M. L. Johnson, Methods Enzymol. 210, 87 (1992). 9 D. M. Bates and D. G. Watts, "Nonlinear Regression Analysis and Its Applications." John Wiley and Sons, New York, 1988. 10D. G. Watts, Methods Enzymol. 240, 23 (1994).

[24] P a r a m e t e r

Correlations

while Curve Fitting

By MICHAEL L. JOHNSON Introduction While curve fitting experimental data, i.e., simultaneous multiple parameter estimation, an investigator will generally find that the answers are cross-correlated. This cross-correlation is evident when an increase in the variance-of-fit caused by a variation in any parameters can be partially compensated for by a variation in one, or more, of the other parameters. This is generally not a problem if the degree of this compensation, i.e., cross-correlation, is low. However, if the correlation is nearly complete, then

METHODS IN ENZYMOLOGY,VOL.321

Copyright© 2000by AcademicPress All rightsof reproductionin any formreserved. 0076-6879/00 $30.00