LMSMVE: A program for least median of squares regression and robust distances

LMSMVE: A program for least median of squares regression and robust distances

COMPUTERS AND BIOMEDICAL RESEARCH 25, 384-391 (1992) LMSMVE: A Program for Least Median of Squares Regression and Robust Distances* GERARD E. DAL...

466KB Sizes 1 Downloads 41 Views

COMPUTERS

AND

BIOMEDICAL

RESEARCH

25, 384-391 (1992)

LMSMVE: A Program for Least Median of Squares Regression and Robust Distances* GERARD E. DALLAL~

AND PETER J. ROUSSEEUW

Received October 30. 1991

The program LMSMVE performs robust regression analysis by using the method of the least median of squares. It also computes robust distances to locate leverage points. that is. outliers with respect to the set of independent variables. LMSMVE constructs plots of least median of squares residuals against robust distances. Both methods can tolerate up to half the data being outliers before they fail to give results that describe the bulk of the data. A complete system that operates directly on SYSTAT files is available for the IBM PC and L 1992 Academic compatibles; it includes a utility that converts ASCII files to SYSTATformat. Pre55. Inc

INTRODUCTION Regression

More computing time is spent on ordinary least-squares regression than on any other applied statistical technique. Least-squares regression minimizes the sum of squared residuals (differences between observed and predicted response) in solving the set of equations

* Copies of the executable file and machine readable user’s guide can be obtained by sending a formatted disk and self-addressed mailer to Gerard E. Dallal at USDA HNRC. Copies of the PROGRESS and MINVOL source code can be obtained by sending two formatted floppy disks and a self-addressed mailer to Peter J. Rousseeuw. The contents of this publication do not necessarily reflect the views or policies of the U.S. Department of Agriculture. nor does mention of trade names. commercial products. or organizations imply endorsement by the U.S. Government. t Also affiliated with Tufts University School of Nutrition, 132 Curtis Street, Medford, MA 02 155. 384 OOIO-4809192 $5.00 copyright All rights

0 1992 by Academic Press. Inc. of reproduction in any form reserved.

385

LMSMVE

FIG. 1. Robustness of LMS regression and lack of robustness of ordinary least-squares regression with respect to an outlier.

yi=p,+

&l~Xi~iEi,

L=l

i=l,...,n,

where Y is the dependent variable, X, , . . . , X, is a set of independent variables, P (),...r &, is a set of unknown regression coefficients, and c, , . . . , E, is a set of independent, normally distributed errors with mean 0 and common, unknown variance. The least-squares solution is notorious for its capacity to be dramatically distorted by even a single outlier. Sometimes, a large reduction in the residual sum of squares is obtained by making sacrifices in fitting the bulk of the data in order to realize a large gain in fitting the outliers. In these cases, even a single outlying value can result in a regression surface that fails to go through the vast majority of the data. It is even possible for well-behaved observations to have larger residuals than the outliers, as in Fig. 1. One approach to avoiding the problems arising from outliers was the development of robust regression techniques, involving estimators that are not as easily influenced by outliers. With least absolute values (or L,) regression, the sum of the absolute residuals is minimized (I). With M-estimators, the square is replaced by another symmetric function that gives outliers less influence over the fitted equation (2). Least median of squares (LMS) regression (3, 4) minimizes the median squared residual. Because the median squared residual is not changed by outlying observations, response outliers do not distort the fitted surface and are easily identified in the corresponding residual plots. Intuition suggests, correctly, that

386

DALLAL

AND

ROUSSEEUW

because the objective function involves a median, up to 50% of the data can be outliers before the results fail to describe the bulk of the data. Leverage Points Leverage points are observations that are outlying with respect to the independent variables. If a leverage point is also a regression outlier, that is, if it does not follow the same regression surface as the bulk of the data, the least-squares regression surface will be altered greatly. On the other hand, if the leverage point follows the same regression surface as the bulk of the data, the leastsquares regression coefficients will have very small standard errors and may appear to be estimated more accurately than the rest of the data suggest. Many classical methods of outlier detection make use of the sample mean and covariance matrix. The Mahalanobis distance is typical. Essentially, it is the distance between an observation and the sample mean vector, standardized by the sample covariance matrix. The Mahalanobis distance is MD; = d[(xi

- T(X)]C(X)

‘(x; - 7’(X))‘].

where xi is an individual observation, X is the cases by variables data matrix, T(X) is the vector of arithmetic means, and C(X) is the sample covariance matrix. Such methods may prove ineffective because a small group of outliers can attract the sample mean vector in their direction and inflate the covariance matrix, thereby masking their presence. One widely used measure of leverage is based on the diagonal elements of the hat matrix H = X(X’X)

- ‘X’ ,

which got its name from the expression for the predicted values from a multiple linear regression * = Hy. The diagonal elements of the hat matrix can be shown to be related directly to the Mahalanobis distance H;i = (MD,)‘/(n

- 1) + l/n.

Therefore, the diagonal of the hat matrix is no more reliable than the Mahalanobis distance in identifying leverage points. In our program, leverage points (and multivariate outliers in general) are detected by computing distances using a formula similar to that for the Mahalanobis distance, but with location and scale estimates based on the minimum volume ellipsoid covering half the observations (5, 6). That is, we let T(X) be the center of the minimum volume ellipsoid covering half the observations and C(X) be determined by the same ellipsoid, multiplied by a correction factor to obtain consistency at multinormal distributions. Neither T(X) nor C(X) is af-

387

LMSMVE

fected by outlying easily identified.

observations,

and multidimensional

outliers

can now be

THE PROGRAM LMSMVE

LMSMVE is a program written for the IBM PC and compatibles running under DOS 2.0 or later versions. A math coprocessor is recommended but not required. The program is written in Microsoft FORTRAN version 5.0 and compiled with all optimization turned off. Double precision calculations are used to obtain the regression estimates and robust distances. LMSMVE standardizes each variable (by subtracting a robust location estimate and dividing by a robust scale estimate) to avoid the problems inherent in analyzing variables that differ by orders of magnitude. Results are transformed back to the original scale before they are reported. LMSMVE can process up to 300 observations and 10 independent variables stored in SYSTAT/MYSTAT data files. For those who wish to work independently of SYSTAT and MYSTAT, LMSMVE is accompanied by a utility (7) that converts space or comma delimited ASCII files of numerical data into SY STAT/MY STAT files. The LMS regression surface lies in the middle of the two closest (in the Y direction) hyperplanes that contain half the observations. LMSMVE finds the LMS solution by fitting regression surfaces to subsets of p + 1 points. This is merely a matter of solving systems of p + 1 linear equations with p + 1 unknown parameters. The LMS regression is given by those parameters, chosen over all possible sets of p + 1 observations, that yield the minimum median squared residual when applied to the entire data set. Evaluating all possible subsets of p + 1 observations can be computationally infeasible for large data sets. When ,,Cp+, is greater than a few hundred, LMSMVE draws random samples of p + 1 observations and determines the best solution obtained from these randomly selected subsets. Along with the LMS regression comes a robust scale estimate, 1.4826(1 + 5/(n - p - 1)) d(median

squared residual).

The LMS regression and scale estimates are used to identify observations with outlying residuals, defined to be those observations whose residual is greater than 2.5 times the robust scale estimate. These observations are set aside, and ordinary least squares is applied to the remaining observations to produce a reweighted least-squares regression solution. The output from LMSMVE contains three analyses for each set of data: (1) ordinary least squares (LS), (2) least median of squares, and (3) reweighted least squares (RLS). Rousseeuw’s robust distances are obtained in a similar fashion. For smaller problems, all possible samples of p + 1 observations are examined. For each sample, the mean and covariance matrix are calculated and the corresponding ellipsoid is inflated to contain the appropriate number of points. The robust

388

DALLAL

AND

ROUSSEEUW

TABLE ORDINARY

Dependent Independent

AND

REWEIGHTED

variable: variables:

LEAST-SQUARES

WTGAIN KCAL

0.28455 - 75.84943 3.80797 109.85190

Sum of squares = 19244.67000 Degrees of freedom = 32 Scale estimate = 39.57630 Coefficient of determination (R’)

WEIGHT

GAIN

DATA

CONSTANT

error

regression 0.05817 30.27073 2.97770 96.1 I IOX

/-Value

P-Value

4.89167 ~~1.50570

0.00007 0.0’01 I

1.27883 1.14297

0.21428 0.26534

5.90631

0.00001

= 0.79409

Reweighted KCAL U-RATIO PCTFA’I CONSTANT

OF KIM

PCTFAT

Standard Least-square5

KCAL CP-RATIO PCTFAT CONSTANT

AN.~L.YSES

CP-RATIO

Coefficient

Variable

1

least-squares

0.33632 -85.64302

0.05694 24.58165

7.23812 -- 13.13696

2.57618 90.12605

based

on LMS - 3.48402 2.80963

-- 0.14576

0.00234

0.01082 0.88557

Weighted sum of squares = 11400.28000 Degrees of freedom = 20 Scale estimate = 23.87496 Coefficient of determination (R’) = 0.80961 There are 24 points with nonzero weight Average weight = 0.92308

distances are obtained from the ellipsoid with smallest volume. For larger problems, many random samples of p + 1 observations are drawn. Commands In many ways, LMSMVE behaves like a SYSTAT module. The USE command accesses data files, the SUBMIT command submits files of commands for batch processing, and the OUTPUT command directs program output to the screen, printer, or disk file. The SAVE command creates a SYSTAT system file containing predicted values and residuals from the least squares, least median of squares, and reweighted least-squares regression, along with robust distances and the weights determined by LMS. The model command MODEL

(Y variable)

= CONSTANT

+ (variable list)

specifies the model and ESTIMATE causes the model to be fitted. ADD and DELETE commands allow variables to be added or removed from the model without respecifying the entire model.

389

LMSMVE

ii

3

2

2

'$%1 ii

________________________________________--------2/

0-0 0; w-ah

0

0

i

o

l-

300

200

400

I

I

a

,

II

-2

.-‘.-.‘-~-"-‘-----.--------~--------------------

-3

-

-4

200

0

0 I 300

500

400

B

0

I

II

.____________-_____________

0 0

i

Oo “,

0

I 5 t

:o

$0

0”

1

I

-&

_-_-__---__--___-

0

0 .________________-_-----------.-------------------

0

i

-

0

O

i

z

-4 1

2 -Matances

3

I 4

I 6

8 J

-4

=T. oooo

ot

0

00

j

.I

O

-0

s -2 i -3

-3

0

5

3 (I) -1

0

0

0

4

0

j -_---___--------_--

00

oao

f

-

B2 0

3 0) -2

0 00

3

'

_

::?z

-

500

32

ii 0 co 3 -1

00

Leal3tMedhnofsquvee~~

.________________-____________

8

0

-1

Leat3qmueplwkddvalreo

4

0

I 0 -0

700

0

Q .---‘-----‘--‘--‘----------------------~--------0 1

0

1 F!owawMs

2

3 l3chmt

4

6

u8tenoea

FIG. 2. Plots of predicted values, distances, and standardized residuals from ordinary leastsquares and least median of squares regression analyses of Kim weight gain data.

LMSMVE allows the user to specify the number of random samples of observations to be examined in the search for the LMS regression and robust distances. The default option gives a probability greater than 95% of analyzing samples that are free of outlying observations. Two random number generators are provided, one to allow LMSMVE to duplicate analyses produced by Rousseeuw and Leroy’s original PROGRESS program (41, and another-(8) modified according to (9)-with a much longer period, for examining many more samples than were possible in PROGRESS. The PLOT command can be used to request plots of standardized residuals versus fitted Y-values and each observation’s index (the order in which it appears in the data set). Plots of LS residuals against Mahalanobis distances

390

DALLAL

AND

ROUSSEEUW

and LMS and RLS residuals against Rousseeuw’s robust distances are produced whenever distances are requested through the DISTANCE command. LMSMVE displays high-resolution plots on screen; they can be printed by pressing the Shift-PrtSc key combination if DOS’s GRAPHICS command has been issued before using LMSMVE. At the user’s option. low-resolution plots composed of line printer graphics can be sent to output files so that the output is composed solely of ASCII characters. A complete user’s guide is contained on the program disk. SOURCE CODE AND OTHERIMPLEMENTATIONS The program LMSMVE is based on a combination of two earlier implementations of these methods. The first of these is PROGRESS (for “program for robust regression”), which has been developed since 1983 and is fully described in (4). The second is MINVOL (for “minimum volume ellipsoid”), which has been developed since 1985 and is described in (5). Both are stand-alone programs written in portable FORTRAN, except for a few I/O operations. The source code for PROGRESS and MINVOL can be obtained through the Internet by sending the two-line message SEND PROGRESS FROM GENERAL SEND MINVOL FROM GENERAL to “STATLIB@‘LIB.STAT.CMU.EDU”. Both methods have been incorporated into the statistical software package S-PLUS (Statistical Sciences, Inc.. 1700 Westlake Ave. N., Suite 500, Seattle, WA 98109, (206) 283-8802). There are versions of S-PLUS for DOS and most Unix machines. The least median of squares regression estimator is given by the function “lmsreg,” which is based on (but not identical to) PROGRESS. Robust distances can be obtained from the function “cov.mve,” which has been added to S-PLUS recently. The results of lmsreg and cov.mve are not identical to those of LMSMVE because of differences in precision, random number generators, number of replications, and so on, but the statistical conclusions will typically be the same. EXAMPLE Reference (10) studied the 9-week weight gains of 26 normal rats allowed to choose their food ad libitum from three isocaloric diets that differed in their percentages of carbohydrate, fat, and protein. We use these data to see whether the global percentage of fat in the self-selected diet (PCTFAT) helps predict weight gain (WTGAIN) after adjusting for daily caloric intake (KCAL) and the global carbohydrate-protein ratio (CPRATIO). Table 1 gives the results of the least-squares regression analysis applied to the entire data set and the reweighted least-squares analysis after two regression outliers identified by LMSMVE have been set aside. The reweighted analysis, but not the original least-squares analysis, suggests that even after total caloric

LMSMVE

391

intake and the carbohydrate-protein ratio are taken into account, those rats who ate diets with a higher percentage of fat had a greater weight gain. Figure 2 was constructed by using SYGRAPH to create displays based on the SAVEd SYSTAT system file of residuals, predicted values, and distances. The plot of least-squares standardized residuals against fitted values shows all standardized residuals to be less than 2.5. The plot of least median of squares standardized residuals against fitted values contains two regression outliers. The plot of LMS standardized residuals against Rousseeuw’s robust distances shows one of the regression outliers to be a leverage point as well, which accounts for the change in the significance of percentage fat in the regression equation. REFERENCES 1. NARULA. S. C., AND WELLINGTON, J. F. The minimum sum of absolute errors regression: A state of the art survey. Znf. S&t. Reu. 50, 317 (1982). 2. HUBER, P. “Robust Statistics.” Wiley, New York, 1981. 3. ROUSSEEUW, P. Least median of squares regression. J. Am. Stat. Assoc. 79, 871 (1984). 4. ROUSSEEUW, P.. AND LEROY, A. “Robust Regression and Outlier Detection.” Wiley, New York, 1987. 5. ROUSSEEUW, P. Multivariate estimation with high breakdown point. In “Mathematical Statistics and Applications, Vol. B” (W. Grossman, G. Pflug, I. Vincze, and W. Wertz, Eds.). pp. 283-297. Reidel, Dordrecht, 1985. 6. ROUSSEEUW, P., AND VAN ZOMEREN, B. C. Unmasking multivariate outliers and leverage points (with discussion). J. Am. Sfat. Assoc. 85, 633 (1990). 7. DALLAL, G. E. WUBA-WUBA: From ASCII files to MYSTATSYSTAT files. Am. Stat. 41, 328 (1987). 8. WICHMANN, B. A., AND HILL, I. D. Algorithm AS 183. An efficient and portable pseudorandom number generator. Appl. Stat. 31, 188 (1981). 9. MCLEOD, A. I. Remark as R58. A remark on algorithm AS 183. An efficient and portable pseudo-random number generator. Appl. Stat. 34, 198 (1985). 10. KIM. S. H.. MAURON, J., GLEASON, R., AND WURTMAN, R. Selection of carbohydrate to protein ratio and correlations with weight gain and body fat in rats allowed three dietary choices. Inc. J. Virum. Nutr. Res. 61, 166 (1991).