Stepwise regression is an alternative to splines for fitting noisy data

Stepwise regression is an alternative to splines for fitting noisy data

Pergamon 0021-9290(95)00044-5 TECHNICAL STEPWISE REGRESSION J. Biomechnnics, Vol. 29, No. 2, pp. 23S-238, 1996 Copyright 0 1995 Elsevier Science L...

308KB Sizes 13 Downloads 164 Views

Pergamon

0021-9290(95)00044-5

TECHNICAL

STEPWISE REGRESSION

J. Biomechnnics, Vol. 29, No. 2, pp. 23S-238, 1996 Copyright 0 1995 Elsevier Science Ltd Printed in Great Britain. All rkghts reserved W&9290/96 $1500 + .oO

NOTE

IS AN ALTERNATIVE NOISY DATA

TO SPLINES FOR FITTING

Thomas J. Burkholder and Richard L. Lieber Departments of Orthopaedics and Bioengineering, Biomedical Sciences Graduate Group, University of California and Veterans Administration Medical Center, San Diego, U.S.A. Abstract-In this study, we compared numerical methods that are used to fit noisy data. Comparisons included polynominal regression, stepwise polynomial regression and quintic spline approximation. The advantages and limitations of each method are discussed in terms of curve fit quality, computational speed and ease, and solution compactness, Overall, the spline approximation and stepwise polynomial regression provide the best fits to the data. Stepwise regression provides the added utility of providing a simple, unconstrained function which can be easily implemented in simulation studies

sion model may have a more generic form than simple polynomial regression, such as: ax3 + b, where a and b are regression constants as opposed to the simple polynomial regression solution which would take on the form ax’ + bx2 + cx + d, where a, b, c and d are regression constants. Splines are strongiy influenced both by choice of order (2m) and smoothing parameter (s). A spline of order 2m will have (2m - 1) derivatives which are piecewise linear (Woltring 1985). The parameter S is chosen either by trial and error or, more often, by a generalized cross-validation technique (Woltring, 1986). In view of the important role of curve-fitting methods in data analysis and the variety of emphases in curve fitting, we have examined the strengths and weaknesses of several commonlyused methods. The purpose of this paper is to compare the methods of simple and stepwise polynomial regression and quintic spline smoothing.

INTRODUCTION

Raw data processing has two main goals: noise reduction and information compression. Factors that affect the utility of data processing methods include simplicity, fidelity, flexibility, operator independence, and computational complexity. The majority of previous curve-fitting studies focus on whole body kinematics, in which accelerations are calculated from position data. For this application, noise reduction is the most important function, and fidelity the primary concern, since small errors and noise are dramatically magnified by the differentiation process. When processing larger data sets, compression becomes more important, as does computational simplicity. Current data processing methods fall into three general categories: polynomial regression, spline interpolation and digital filtering (Wood, 1982). Both polynomial regression and spline interpolation are curve-fitting procedures that minimize a merit function, generally least squares. Polynomial regression attempts to fit a single equation to the entire data set, while spline interpolation fits a sequence of curves to segments of the data. Digital filtering, like analog filtering, reduces the frequency spectrum of the data. Because many curve-fitting exercises lead to simulations or models of the observed effects, we focus our attention on curve fitting methods capable of providing explicit functions from the data set, i.e. polynomial regression and spline interpolation. It is clear that the most critical decision in polynomial regression is the choice of polynomial order. This choice determines the form of the curve fit as well as its derivatives. A number of schemes exist to assist in this choice, including minimization of total error (Pezzack et al., 1977), and identification of an inflection in the plot of residuals vs polynomial order (Jackson, 1979). Stepwise polynomial regression (Wonnacott and Wonnacott, 1981) permits the data themselves to determine the form of the regression equation by adding polynomials to the model only as they significantly improve the curve fit. Thus, a stepwise regres-

METHODS

Received in final form 15 March 1995. Address for correspondence: Richard L. Lieber, Ph.D., Department of Orthopaedics (9151), V.A. Medical Center and, UC. San Diego School of Medicine, 3350 La Jolla Village Drive, San Diego, CA 92161, U.S.A. 235

Polynomial regression was performed using StatView 4.0 (Abacus Concepts, Inc., Berkeley, CA). Quintic spline smoothing was performed using Woltring’s (1986) Fortran code (GCVSPL) with the MODE parameter set to 2 (generalized cross-validation). This program was executed on a Macintosh IIci in the Macintosh Programming Workshop (MPW) programming environment using the MPW SADE symbolic debugger program. (MPW 3.2.3, Apple Computer, Inc., Cupertino, CA). Stepwise regression procedure determines the set of independent variables that most closely determine the dependent variable. This is accomplished by the repetition of a variable picking algorithm. At each of these steps, a single variable is either entered or removed from the model. For each step, simple regression is performed using the previously included independent variables and one of the excluded variables. Each of these regressions is subjected to an ‘F-test’. The variable with the largest F value, greater than a user defined threshold, is added to the model. This general procedure is easily applied to polynomials by using powers of the independent variable as pseudoindependent variables. To evaluate the ability of such routines to reduce noise, datasets were created with increasing levels of white noise. Each set was computer-generated using the formula x + sin(x) - 0.5 sin(2x) evaluated over the range 0 < x c 3 [Fig. l(A)]. The

236

Technical Note

4- A

+

.*

q

+

.

32II Data (SNR = 5) -Data Function I m _ ra .

q

- -Spline ---Stepwise

Smoothing Regression

-~-~-Simple

Regression

1

-1-I 0

2

1

Independent

3

Variable ‘0

Data (SNR = 10)

-

series of data sets were generated, each starting from 0, increasing linearly to 1 with slope = 2, then decreasing linearly back to 0 with various slopes [Fig. l(B)]. Any remaining interval was filled with zeros to make each set range from 0 to 1 in the independent variable. White noise was also imposed upon these ramp functions with a SNR of 10. The ratio of the decreasing slope to the increasing slope (R) was varied from 1 to 10. Algorithm fidelity was evaluated by the sum of squares of the first and second derivatives of the difference between the test function and curve fits. The residual in the derivatives was thought to be a more sensitive index of fit quality than the residuals in the test function. The data reduction achieved with each algorithm was quantified by dividing the number of data points (101) by the number of parameters required to describe the model fully. For example, a cubic polynomial, with four parameters, would have a reduction factor of 25. Since this parameter depends strongly on the choice of data set length, the actual numerical value does not have intrinsic meaning, but it provides a framework in which to compare the algorithms.

Data Function

- - Spline Smoothing --- Stepwise Regression -‘-‘-

-0.2-1 0.0

.

I 0.2

r 0.4

Independent

Simple

0 0.6

RESULTS

Regression

r 0.8

‘I 1.0

Variable

Fig. 1. (A) Test function of the form x + sin(x) - 0.5 sin(2x) and approximations evaluated over 0 < x < 3 (solid line). Raw data (squares) are shown for a SNR of 5. (B) Sample ramp test function and approximations with a slopes ratio of 9 and a SNR of 10. At this slopes ratio, the spline fit is clearly superior to either regression fit. signal-to-noise ratio (SNR) was defined as the variance of the test function divided by the variance of the added noise and was varied from infinity (no noise) to 1.0 (noise equals signal amplitude). Each method was strongly dependent on the exact noise magnitude. When curve fits were tested against the generating function, the calculated residual varied considerably. For this reason, three separate trials were performed at each noise level. To provide a second general type of curve-fitting test, another

Spline fitting and stepwise regression were both very sensitive to noise level, while simple polynomial regression was not [Table 1; Fig. 2(A)]. As the SNR decreased (noise increased) spline and stepwise routines diverged from the test function, as indicated by the increasing residuals. The spline response occasionally demonstrated significant oscillation, as illustrated by the residual sum at the SNR value of 1.67. The residual sum for both derivatives of the polynomial regression remained relatively constant in all cases. The response of the spline to the ramp functions was dramatically different compared to global regression Fable 2; Fig. 2(B)]. The spline tended to oscillate about the generating function, keeping both the sphne and its first derivative relatively close to the “correct” solution, but causing the second derivative to be more divergent. Both regression routines tended to smooth the descending segment, diverging noticeably even at a slope ratio of three. GCVSPL produced as many model coefficients as data points in the original dataset, yielding a compression factor of 1. The polynomial regression procedure yielded an average compression factor of 19 for all datasets, slightly less than the stepwise regression routine average of 27. CONCLUSIONS

The purpose of this note was to compare the benefits and disadvantages of curve-fitting methods. The local spline routine

Table 1. Residual sum of squares from routines with decreasing signal-to-noise ratios Simple polynomial regression Signal-tonoise ratio ET.0 5.00

3.33 2.50 2.00 1.67 1.43 1.25 1.11 1.00

1st derivative

2nd derivative

5.1 5.7 3.8 11 5.9 19 16 27 85 284 61

150 160 130 200 140 270 230 180 660 6200 170

Stepwise polynomial regression 1st derivative 0.049 0.95 1.4 12 24 55 40 21 130 53 76

2nd derivative 5.3 22 40 260 400 810 600 270

Quintic spline 1st derivative 3 10-6 1.0 13 22 28 17 1100 28 630 186 700

2nd derivative 0.69 27 710 1100 710 120 380000 120 50000 30000 52000

Technical Note

237

Table 2. Residual sum of squares from routines with increasing asymmetry Simple polynomial regression Ratio of slopes 1st derivative CR) 1 2 3 4 5 6 I 8 9 10

120 220 370 580 870 1000 1300 1500 1600 1900

Stepwise polynomial regression

2nd derivative 5600 160000 560000 690000 810000 6ooooo 790000 700000 61OCKlO 640000

Quintic spline

1st derivative

2nd derivative

1st derivative

2nd derivative

38 110 390 410 740 1200 1400 1600 16000 2100

41000 200000 630000 730000 1.4 106 190000 190000 200000 660000 220 ooo

20 137 75 110 140 170 270 353 300 380

17000 980000 120000 270 000 6ooooo 790000 2.2 lo6 31. lo6 2.8 lo6 3.6 lo6

10000

1A $j1000 !(I;; fii+y.Qz....__....._.._._..._.. Simple Polynomial

4

Regression

Stepwise Polynomial

Regression

Spline

if

0.1 I 0

1 5

10

Signal to Noise Ratio 10000,

p $1000 F ‘ci 9 ‘5 n

100

“.O.‘”

.q

ie l.L

4

fik? d

10 1 0

-c)-

Simple Polynomial

-

Stepwise Polynomial

--.-o--2

Spline 5

s

Regression Regression I 10

Slopes Ratio Fig. 2. (A) Residuals in the first derivative of sine functions evaluated for simple regression (squares), stepwise. regression (diamonds) and splines (circles). (B) Residuals in the first derivative of ramp functions evaluated for simple regression (squares), stepwise regression (diamonds) and splines (circles).

displayed the greatest accuracy, except for the few cases when it was distorted by excessive oscillations as noted by Woltring (1985). This effect accounted for the excessively large residual at a SNR of 1.67 (Table 1). Global regression routines produced smoother functions at the cost of local detail. Stepwise regression and splines both displayed an accuracy which was dependent on the amount of noise obscuring the signal. Both normal and stepwise polynomial regressions can be performed using standard statistical packages, while the spline routine requires some knowledge of FORTRAN or fairly elaborate mathematical techniques. The time required for evaluation

of the polynomial curve fits was about half the time required for the spline fitting routine required. In spite of the observation that splines were the most accurate, comparable accuracy using stepwise polynomial regression (especially for well-behaved data) was achieved more simply and quickly. Polynomial regression, on the other hand, is best for minimizing high noise levels, but can only be.applied to relatively smooth data. Acknowledgements-This work was supported by the Veterans Administration and NIH grants AR35192 and AR40050. We

238

Technical Note

thank Drs Greg Loren and Scott Shoemaker for helpful discussions.

Woltring, H. (1985) On optimal smoothing and derivative estimation from noisy displacement data in biomechanics. J. Hum.

REFERENCES

Jackson, K. M. (1979) Fitting of mathematical functions to biomechanical data. IEEE Trans. Biomed. Eng. 26, 122-124. Peazack, J. C., Norman, R. W. and Winter, D. A. (1977) An assessment of derivative determining techniques used for motion analysis. J. Biomechanics 10, 377-382.

Moo.

Sci. 4, 229-245.

Woltring, H. (1986) A FORTRAN package for generalized cross-validation spline smoothing and differentiation. Adv. Eng. Soft. 8, 104-l 13. Wonnacott, T. H. and Wonnacott, R. J. (1981) Regression: A Second Course in Statistics. Wiley, New York. Wood, G. A. (1982) Data smoothing and differentiation procedures in biomechanics, In Exercise and Sport Sciences Reviews, Vol. 10, pp. 309362. Williams and Wilkins, Baltimore.