Computational North-Holland
Statistics
& Data
Analysis 15 (1993) 13-46
13
Fitting additive models to regression data Diagnostics and alternative views Leo Breiman
*
University of California, Berkeley, CA, USA Received November 1990 Revised August 1991 Abstract: A method is given for fitting additive models to multivariate regression type data. The emphasis is on diagnostics analogous to the concept of influence and on alternative models. These are illustrated in three case studies. Simulations show excellent predictive accuracy as compared with existing methods. Keywords: Additive
models;
Splines;
Cross-validation;
Knot deletion;
Stability.
1. Introduction Given regression type data (y,, xJ, additive model is a fit of the form
IZ = 1,. . . , N, with X, = (x,~, . . . , xnM),
an
M
j =
C4m(Xm)~
where the {$,} are in a fairly general class of smooth functions. If x is one-dimensional, then this is the univariate curve fitting problem, with many papers published on the subject. For x multivariate, the literature on additive model fitting is smaller. One general method is the ACE algorithm (Breiman and Friedman 1985). Following this is work by Stone (1986), Stone and Koo (19861, and Hastie and Tibshirani (19861, (1987) on generalized additive models. Buja et. al (19891, explore the structure and convergence of the iterative ‘backfitting’ algorithm used in ACE, and Tibshirani (1988) develops an ACE-like algorithm based on a variance stabilization criterion. Whaba and her coworkers Correspondence to: Leo Breiman, Ph.D, Department Berkeley, CA 94720, USA. * Partially supported by NSF grant DMS-8718362. 0167-9473/93/$06.00
0 1993 - Elsevier
Science
of Statistics,
Publishers
University
B.V. All rights reserved
of California,
14
L. Breiman / Fitting additive models
are approaching additive models through the use of smoothing splines, but I am not aware of any published work yet testing the methods involved. Friedman and Silverman (1989) developed an interesting new method called TURBO, and in the discussion, Hastie proposes an alternative method based on ACE type backfitting. A good general reference is the recent Hastie and Tibshirani book (1990). Since the publication of the ACE algorithm and with its widespread use, some deficiencies have been noted. For small sample sizes (I 100) the results can be noisy, and diagnostics are difficult. A new method for fitting additive models, which remedies these problems, is given in this paper with emphasis on diagnostics and alternative views. Fitting additive models is a fairly new statistical endeavor. To understand some of the elements in our approach, it is useful to summarize the salient difficulties. 1.1. Problems in fitting additive models
What is clear, both in the univariate work and in the recent multivariate work, is that the complexity of additive model fitting far exceeds that of fitting linear models. The space of linear models for data with M-predictor variables is M-dimensional, and least squares regression chooses one of these. The space of additive models for the same data is a much higher dimensional space. Often, the dimensionality is not well defined. The data are then being used to choose among a wider range of alternatives. The result is sensitive both to the class of additive models available for selection and to the selection method. With this in mind, diagnostics and alternatives become important. Here are some of the issues related to this fitting problem: (i) Controlling the degrees of freedom
Suppose bivariate data (y,, x,), IZ= 1,. . . , N is generated from the model. Y, ==f(%) + E, with {E,} i.i.d, mean zero noise. Estimates of f(x) are based on local weighted averaging of the y,, corresponding to X, in the neighborhood of X. The critical question is how big to make this averaging window. If it is too big, local detail in f may be smeared out; too small and the estimate of f is noisy. If an attempt is made to estimate f by regressing the y, on a polynomial function of the form Cfpmxm, then the size of the window is proportional to l/(M + 1). In this sense, window size and number of degrees of freedom in the fit are closely related. Many papers in the univariate curve fitting situations have focussed on methods for window size selection. Some smoothing methods (kernel, smoothing splines) use cross-validation to estimate a global optimal window size. Others are advocates for a locally adaptive window size - narrow when f(x) is rapidly changing and wider in interval of slow change. Super-smoother (Friedman and Stuetzle, 1982), the smoother used in ACE, estimates a locally adaptive window
L. Breiman / Fitting additive models
15
size. Hardle and Bowman (1988) propose a local adaptive window size selection for kernel smooths. All of this work, both theoretical and through the use of simulated data, has shown that selection of the right number of degrees of freedom is a critical ingredient in accurate univariate curve fitting. The idea in all methods is to have available a large class {f(x)} of potential fitting functions and to select y^E {f(x)) with an appropriate number of degrees of freedom. In the multivariate case with x = (xi,. . . , xM), the fits on different X, may require different degrees of freedom. Some 4, may underfit, others may overfit. The question of determining the degrees of freedom for all Ix,} to give an accurate fit is difficult. (ii) Stability and influence The concept of influence is important in linear regression, but even more so in fitting additive models. In the additive model context the key concept is more properly called stability. The idea of stability is that the transformations c$,,, remain about the same under minor perturbations of the data, say when a small number of cases are removed. In data analytic projects where the shape of the transformations are of scientific interest, stability is an important issue. If transformations change drastically when one or a few cases are removed, then they do not reflect an overall pattern in the data. There are a number of mechanisms through which one or a few cases can have a large effect. One occurs when two variables are strongly correlated leading to inherent instability between their transformations. If one or a few cases are deleted, the two individual transformations may change dramatically even though their sum remains more or less the same. Another mechanism occurs when a case is ‘influential’ in determining a transformation because of its position in the body of the data. A third mechanism occurs when two or more models fit the data almost equally well. Then deletion of one or several cases may swing the balance from one model to another. All of these mechanisms occur in the fittings of the data sets in the third section, and serve as illustrations of our diagnostics. (iii) Alternative views If the only goal is prediction then the issues in additive modeling would be simpler. Almost every data analyst, who has run a best subsets variable selection procedure in linear regression, knows that it is possible to have a number of very different linear predictors using the same number of variables giving about the same residual-sum-of-squares; or, in simulations, giving about the same prediction accuracy. So, if prediction is the only goal, pick any one. Often, the form of the predictor is of scientific interest, leading to conclusions about the effects of the predictor variables on the response. In this situation, the capability of the best subsets procedure, to give alternative but equally predic-
16
L. Breiman / Fitting additive models
tive models, is invaluable. Understanding the connection between these different but ‘equally good’ models can give improved insight into the relationships in the data. In additive fitting, the number of different but equally predictive models may be larger. Being able to view alternative models becomes important. In particular, the analyst should be able to look at models with varying degrees of freedom and with the potential of showing either more or less of local detail (at the risk of having it confounded with noise). 1.2. Organization of results In Section 2 we give a description of the DK/CV (delete-knot/cross-validation) program for constructing additive models and of its diagnostic and alternative view capabilities. Section 3 illustrates the use of DK/CV on 3 data sets, emphasizing the ideas of stability and alternative views. Section 4 gives the results of a number of simulation experiments that compare DK/CV to ACE and the Friedman-Silverman (1989) program TURBO.
2. The structure of DK/CV The idea of DK/CV is basically simple. For each m, many knots {tkm}, k= l,..., k, are put down on the range of x,, and functions gkm(x,) defined =x,, and g,, = 1. as (xm - tkm)+3, together with g&x,) Fit the model
where the (Sk,} are the OLS coefficients found by minimizing c C( y, - k,m
Pkmgkm(Xmm
))27
and define 4,(x,> = Ckjkmgkrn(xm). Now stepwise deletion of the variables (xm - tkm)+3 is carried out. Deletion of these variables corresponds to knot deletion. This procedure of knot deletion, first suggested in univariate curve fitting by P. Smith (1982), has attractive features. One can see, either analytically, or by example, that deleting a knot causes an increased window size only in the neighborhood of the removed knot. Therefore, knot deletion curve fitting is a locally adaptive window size method. The fact that it is an effective univariate smooth was established in a large simulation (Breiman and Peters, 1988). In the multivariate procedure for additive modeling, all knots on all variables are treated equally, so that the knot deleted at any given stage is that one whose deletion causes the least increase in residual-sum-of-squares.
L. Breiman / Fitting additive models
17
While using a spline basis and knot deletion is an attractive concept, a number of issues had to be resolved in order to make it work well: (i) How many knots should be deleted? (ii) How many initial knots should be used and where should they be placed? (iii) How can end point variance be controlled? Here are the answers used in DK/CV. 2.1. Cross-validation Cross-validation is used to select the dimensionality of the model. The cases are randomly divided into V groups, L,, . . . , L,, of as nearly the same size as possible. Leave out the vth group of cases and construct an OLS model y^j”)(x> using the rest of the data where J is the total number of basis elements {gkm}. As the deletion proceeds, a succession of models G:‘;‘, y^J?,, . . . , y^$“) is generated. The I/-fold cross-validated estimate for the prediction error using j basis elements is
The dimensionality selected is that j, 0 5 j 5 J, which minimizes I%(j). With the deletion procedure and cross-validation, attention has to be paid to problems of numerical accuracy and computational efficiency. In the basic algorithm, XtX is computed and the inversion and deletion carried out using double precision Gaussian sweeps. Forming X’X takes about as much time as the deletion sequence. In I/-fold cross-validation, the cases are randomly permuted. Then the first N/I/ of them are used as the group L,, the second N/I/ as L,, etc. Let X’X”‘) be the sum-of-squares matrix computed for L,. Then X’X = Cxtx@), and the XtX matrix leaving out the vth group is XtX -X’X(“). Using this computational procedure substantially reduces the burden of cross-validation. 2.2. Number and placement of initial knots Our first approach towards setting the initial number of knots was to put down many. In the Breiman and Peters simulation (1988) 9 knots were used for 25 bivariate cases and larger numbers were used for increased sample sizes. However, more careful work on a variety of datasets revealed that using too many initial knots could have an adverse effect on the fit of the model. The problem is that with too many initial knots, there are too many alternative deletion paths. This leads to higher variability in the model selected. The most attractive solution arrived at is: let K be the number of initial knots per variable (assumed the same for each variable). Go through the deletion
18
L. Breiman / Fitting additive models
process and let E(K) be the minimum cross-validated PE estimate among the deleted models. Now take K to minimize E(K). Knot placement is important. Experimentation shows that knot placement appreciably effects the accuracy of the results. Further, poor knot placement can lead to XtX matrices that cannot be stably inverted. In the univariate situation with data ( y,, x,J, n = 1,. . . , N and {x,} randomly selected from the density f(x), Agarwal and Studden (1980) show that the asymptotically optimal knot density is proportional to f ‘I9 . The knot placement algorithm used in DK/CV is based on this result. The data between min(x,) and max(x,) is binned into L equal intervals, where L = int(5N’13). If there are c(Z) x,‘s in the Zth bin, then n(Z) = [c(Z)]‘/~ is an unnormalized estimate of f119. Let I, consist of the bins 1,. . . , I,, I2 of the bins I, + 1,. . . , I, etc., up to IK and Nk the total of the n(Z) for the bins in Ik. Then the problem is to split the bins into I,, . . . , IK so that N,, . . . , NK are as equal as possible. If ‘as equal as possible’ is interpreted as var(N,) = minimum, this leads to Ck N: = minimum, and the problem can be solved via a dynamic programming algorithm. Then, t, = min(x,), t, is the right endpoint of I,, t, the right endpoint of Z2, etc. When the {x,) data has many ties or values clustered close together, not enough bins have n(Z) > 0 to support the desired number of knots. The algorithm senses this and reduces the number of knots. Under testing with a diversity of both ordinary and strange gapped X-distributions, this knot placement procedure has been stable and produced better results than either equispaced order statistic knots or knots equispaced in the usual metric. With many x-variables the algorithm is applied individually to the values of each x,. The values of the knots are not cross-validated, but are taken to remain the same for the reduced learning sets used in the cross-validation process. 2.3. End point variance and equivariance Again, consider the univariate case ( y,, x,), n = 1,. . . , N. Suppose there are knots at x_= min(x,) < t, < * - -
=x+. Then the power spline basis for all functions {h(x)} on [x_, x,], which are piecewise cubits with at most 3rd derivative discontinuities at t,, k = 1,. . . , K, consists of 1, x, x2, x3, ((x - tk)+3).
(2.1)
If an OLS fit f(x) to the y, is made using the basis in (2.1) then the variance of j(x) increases as x approaches the endpoints x_, x,. Extrapolation outside of [x_, x+1 is through cubic polynomials. What is desired is a more moderate variance expansion at the endpoints and a linear extrapolation scheme. This can be done by going to ‘natural splines’. Take t, =x_, t, E (x_, x,), k = 2,..., K, tK+l =x+, and (h(x)] the set of piecewise cubic polynomials on t-m, ~1 such that h”(tl) = h”(tK+l> = 0, at h”‘(tl - ) = h”‘(tK+l + ) = 0, with only third derivative discontinuities
L. Breiman / Fitting additive models
19
tK+*. The conditions imposed make h(x) linear at the endpoints x_, x +. Enforcing this condition stiffens the ends and has proven, in many simulations, to improve accuracy. A basis for the set {h(x)] consists of the functions
t,, * * *
7
1, x, (x - tk)+3,
k= l,...,K+
1.
If the knot at t, is deleted, then h(x) is linear to the left of t,. If the knot at is deleted, then equivariance requires that h(x) be linear to the right of t,. These conditions and their extensions lead to solving least square problems under linear constraints. The sequence of deletions under these constraints can be accomplished using a modified Gaussian sweep procedure. This algorithm is described in the Appendix. Thus, in DK/CV, for specified K, the knot placement algorithm puts K knots t,, = min,(x,,), tTm,. . . , tKm, down on the range of x,. The basis elements are
tK+ 1
Skm(Xm> = (%n-
L?J+3~ &&m)
=x*7
and the deletion process is started. To ensure equivariance, the linear function g&x,) is not deleted unless all knots on x, are first deleted. Once g&x,) is deleted the entire variable is absent from the model. 2.4. Alternative views and RSS-extremeness One way of generating alternative models is to vary the number of initial knots, K. Perhaps the optimal cross-validated value of K is 5. But setting K equal to 10 might reveal some interesting local structure. However, K= 10 with, say, 10 variables, gives well over 100 models to examine. The concept of RSS-extremeness reduces the number of interesting models to such a relatively small number that each one can be inspected in detail. Using all of the data and going through the deletion process, generates a sequence of models { Tj} using j basis elements and corresponding residual-sumof-squares RSS( j), j = 0, 1,. . . , J. Suppose there is a cost (Y per variable and define the total cost of using j basis elements as R( j, (;Y) = RSS( j) + ,j,
j=l
,***, J.
Look at the subset of values j such that for some (Y> 0
For each such minimizing j there is a range [a-(j), a+(j)] of a-values which it is a minimizer. Let $* be the noise variance estimate based on the model of dimension J. Define a minimizer j to be RSS-extreme if a+(j)/&* and a-( j)/a* I 10. Then, cross-validation is used to select only among RSS-extreme models,
for full 2 2 the
20
L. Breiman
/ Fitting additive models
It is not difficult to find the minimizing j. Let C be the lower convex hull of the {RSS( j)}. Then the minimizing j’s are the values of j such that the C(j) are extreme points of C. For (Y= 2&*, the minimizer of R( j, a> is the model minimizing Mallows C, criterion. The prevalent wisdom in the field is that the dimension j of the best predictive model is the minimizer of R( j, a) for some (Y in the range 23* I (YI lo&* (Miller, 1990). The DK/CV program prints out the [a-, c.w+]range for all RSS-extreme models and allows the inspection of the output of any one of these models. Besides providing alternative views, the use of RSS-extreme models has advantageous side-effects. First, by reducing the number of models to be compared, the cross-validation computations are significantly reduced. Second, the evidence (see Breiman, 1992 and Breiman and Spector, 1989) is that the restriction to RSS-extreme models enables cross-validation to select more accurate predictive equations.
2.5. Stability and other diagnostics In fitting nonlinear transformations to small and/or noisy data sets, stability is an important issue. Deleting only a few or even one case can lead to large changes in the estimated transformations. Often, this is not reflected in the residuals, since a change in a transformation on one variable may be compensated for by changes in the transformations of other variables. To define a measure of instability, let 4, be the transformation on x, estimated using all data, and 4; the corresponding estimated transformation with D 2 1 cases deleted. Let {Zk ,} be 100 equi-spaced points between min x, and max x,, and define the resulting instability as A= ;
_Ck(&(W
C~3&?J* i,k In V-fold cross-validation, a number of cases = N/I/ is left out in each cross-validation. For the uth cross-validation A(u) is computed. For small data sets, leave-one-out cross-validation is computationally feasible. Then, A(x,> gives a measure of the instability that results in leaving out the nth case x,. An overall measure of instability is given by 3 = Av,(A(v)). Another diagnostic is ‘stability intervals’. Using T/-fold cross-validation, if &x> is the estimated transformation on any variable x and J,(X) the estimated transformation on the uth cross-validation, then the stability interval is
&x) + 2
i
1- ;
- &3&iJ)‘/
1SD,(&)).
The interpretation of this band is that its size relative to the size of 4(x) gives an indication of the location of the instability of 4; regions where the band is wider are indicative of larger local instability. By definition, the stability interval is the 95% jackknifed confidence interval. But simulations have not given strong support for their inferential validity in the present context.
L. Breiman / Fitting additiue models
21
Once a model is fitted, if 4m(xm> is the transformation of the x,th variable, then the ‘importance’ of the x,th variable is the sample standard deviation of 4, normalized so that the sum of the variable importances is 100.
3. Three case studies The case studies in this section illustrate the use of DC/CV models. 3.1. The Wright-Gambino
in fitting additive
data
This is a dataset from organic chemistry (Wright and Gambino, 1984) analyzed in the Friedman-Silverman (19891 paper. To quote from them: “The observations are 37 compounds that were collected to examine the structure activity relationship of 6-anilinouracils as inhibitors of Bacillus DNA polymeraze III. The four structural variables measured on each compound are: x1: meta substituent hydrophobic construct; x2: para substituent hydrophobic construct; x3: group size of substituent in meta position; x4: group size of substituent in para position. The response variable is the logarithm of the inverse concentration of 6anilinouracil needed to achieve 50% inhibition of enzyme activity”. This data were supplied to us by J. Friedman and contained 37 cases. Each predictor variable has at most 11 distinct values, but 3 of the variables have only 8 ‘separated’ distinct values and the other has 10. An initial linear fit gave an R2 of 0.56. Wright and Gambino fit a quadratic model with 7 degrees of freedom and get an R2 of 0.89. We started with 6 initial knots per variable, but the knot placement algorithm sensed too few distinct values and decreased the number of initial knots to 4 per variable. With this few knots per variable we did not try to decrease the number of initial knots any further. With only 37 cases, we did leave-one-out cross-validation. There are 3 RSS-extreme models d.f. 7 8 10
cr2.5 2.3 1.4
(Y+ 74.1 2.5 2.3
MRSS 0.083 0.076 0.064
PE 0.199 0.275 0.368.
PE is the cross-validated estimate of the mean squared prediction error. Here, the ranges of (Yare normalized by dividing by g2. Focusing on the minimum PE model with 7 d.f., R2 = 0.94, Figure 3.1 gives the graphs of the transformations for this model. In this figure and subsequent graphs the range of each x-variable is mapped onto [O, l] and the transformation is graphed at equi-spaced points over this interval. Interesting nonlinearities appear in variables 2 and 3.
L. Breiman / Fitting additive models
22
0
.6
.4
.2
Fig. 3.1. Wright-Gambino
1
.8
data.
The overall instability d is 1.38. This can be broken down into the individual variable contributions 0.60, 0.01, 0.77, 0.0. Variables 2 and 4 have a correlation of 0.61. Variables 1 and 3 have a correlation of 0.52, going up to 0.68 when one outlier (31) is removed. The other correlations are much smaller. The instability results suggest that there is a switching back and forth between variables 1 and 3. Our first move to increase stability is to delete both variable 3 and variable 4 and look at models based only on variables 1 and 2. Running DK/CV with variables 1 and 2 only gave 2 rss-extreme models. d.f. 7 8
(Y2.8 0.9
Q+ 40.2 2.8
MRSS 0.101 0.093
%% 0.158 0.155
The relative importances of variables 1 and 2 were 49, 51. The minimum i% model had R2 = 0.94 and the transformations are graphed in Figure 3.2.
2-
/' g.' :' &..'
l-
+.....A..., A... +....c....+....*, ..+.'-a "..* ,.,p'" .'.., ... 'a...
*." ..'
"*.
-2 0
.2 Fig. 3.2.
.4
Wright-Gambino
.6
data (variables
.a
1 and 2 only).
1
L. Breiman / Fitting additiue models
23
The instability dropped to 0.035 with 0.020 due to variable 1 and 0.015 to variable 2. The fact that R2 remains about the same and PE drops in going from 4 variables to 2 is indicative of excess baggage. The four largest instabilities by case are 34 28 15 16
0.33; 0.18; 0.12; 0.08.
To illustrate our approach, we deleted case 34 from the data set and reran DK/CV. There were again two RSS-extreme models. d.f. 7 8
(Y2.8 0.9
(Y+ 10.8 2.8
MRSS 0.105 0.095
PE 0.181 0.173
The minimum i% model had R2 = 0.93. The instability is 0.059 with 0.044 in variable 1 and 0.015 in variable 2. Deleting case 34 does not give any improvement, but perhaps persistence is needed. The largest instabilities by case are 31 13 28 15
0.94; 0.33; 0.16; 0.11.
We deleted 31 and reran. No help! The instability increased to 0.065 with no increase in R2.The instabilities by case are now 13 26 15 28
1.20; 0.19; 0.14; 0.14.
We delete 13 and rerun. There are 3 RSS-extreme d.f. 6 7 8
(Y4.3 3.2 0.7
a+ 11.9 4.3 3.2
MRSS 0.105 0.092 0.081
models.
PE 0.172 0.176 0.136
The minimum i?!? model has R2 = 0.95. The instability has dropped to 0.025 with 0.009 in variable 1 and 0.016 in variable 2. The 4 largest instabilities by case are 14 27 25 32
0.12; 0.11; 0.07; 0.07.
The transformations
are graphed in Figure 3.3.
L. Breiman / Fitting additive models
24 2.
-2 0
.2
.6
.4
Fig. 3.3. Wright-Gambino
.8
1
data (cases 13, 31, 34 deleted).
Comparing Figure 3.3 with 3.2 the major change is in the transformation of variable 1. Figure 3.4 shows the scatterplot of y versus xl. The two extreme points on the right are cases 13, 34. The effect of deleting these two cases is clear, and explains the difference in the right ends of the transformations in Figures 3.2 and 3.3. Case 31 is one of the four points on the left of this scatterplot and partially governs how much the transformation curves up near the left end of the range. We explored alternative models with other case deletions and also tried using variable 3 instead of 1. Nothing was as satisfactory as using variables 1, 2, either deleting no cases or deleting 13, 31, 34. Through these various models the shape of transformation on variable two was relatively invariant. The stability intervals for variables 1 and 2 with all cases and with cases 13, 31, 34 deleted using leave-one-out cross-validation are plotted in Figures 3.5-3.8.
X1
Fig. 3.4.
L. Breiman / Fitting additive models
Fig. 3.5. Wright-Gambino
0
.2
data (stability
.4
Fig. 3.6. Wright-Gambino
.6
data (stability
interval for xl).
.8
1
interval for x2).
2 i
O-
-1 -
-2 0
.2
Fig. 3.7. Wright-Gambino
.4
data (stability
.6
interval
.8
for xl, cases 13, 31, 34 deleted).
1
L. Breiman / Fitting additive models
26 i-
l-
O-
-1 ‘.., l
-2 0
.2
.4
Fig. 3.8. Wright-Gambino
.6
.8
1
data (stability interval for x2, cases 13, 31, 34 deleted).
Wright and Gambino fit a model with 7 d.f. that is quadratic in the 3rd and 4th variable and has R2 = 0.89. The 7 d.f. model produced by DK/CV on all data, but using variables 1, 2 only, has R2 = 0.93. Wright and Gambino give a number of chemical interpretations based on their quadratic model. The two variable DK/CV model might lead to different interpretations. 3.2. Life cycle savings data This is a 50 case, 5 variable dataset used by Belsley, Kuh and Welsch in their book “Regression Diagnostics” (1980) as an example to illustrate their methods of finding influential observations and outliers. The data are from 50 countries. The response is the savings ratio = aggregate personal saving divided by disposable income. The four predictor variables are: x1: percent of population < 15 years old; x2: percent of population > 75 years old; xg: per-capita disposable income; x4: percent rate of change of per-capita disposable income. The data are noisy and a linear fit gives R2 = 0.33. The correlation matrix is Xl
x7
X1
xA
-0.91
1
-
-
x3
-0.75
0.79
1
-
x4
-0.20
0.08
0.04
1
x2
I
A first run was done with 6 initial knots per variable and leave-one-out cross-validation. There were three RSS-extreme models d.f. 2 3 7
(Y6.3 3.5 1.5
a+ 16.2 6.3 3.5
MRSS 15.6 14.0 10.5
PE 16.9 18.1 19.8
21
L. Breiman / Fitting additice models
,,0
.4
.2
.6
.8
1
Fig. 3.9. Savings data (7 deg. of freedom).
The minimum I?? model had R2 = 0.21. This was not an encouraging result. Still, an interesting question was, what does the 7 d.f. model look like? We ran DK/CV forcing it to select the model with 7 d.f. The resulting model had R2 = 0.47 and used variables 1 and 4 only. The transformations are graphed in Figure 3.9, and both are nonlinear. The instability is 12.68, distributed among the four variables in the amounts 4.63, 4.59, 0.15, 3.31. This indicates that switching between variables is occurring, so we restained attention to models based on variables 1 and 4 only. With this, the instability dropped to 1.73 with 0.44, 1.29 due to variables 1 and 4 respectively, and PE= 17.8. The four largest instabilities by case are: 49 23 19 50
19.1; 18.3; 15.6; 12.5.
Now case 49 is deleted and DK/CV rerun. Three RSS-extreme models emerge d.f. 2 3 6
cr8.7 3.3 0.9
(Y+ 15.2 8.7 3.3
MRSS 15.9 13.5 10.8
= 21.0 15.7 15.0
The minimum I?? model has R2 = 0.46, with instability 0.305. Of this, 0.247 is due to variable 1 and 0.068 to variable 4. The transformations are graphed in Figure 3.10 and the transformation on x4 is nonlinear. Case 49 (LIBYA) has a large effect on the fit, and its deletion is stabilizing. With the deletion of case 49 and the preference swinging to a nonlinear model, we wanted to explore how many initial knots to use. The data allowed
28
L. Breiman
-10
/ Fitting additiue models
’
I
0
.2
.4
.6
.8
1
Fig. 3.10. Savings data (case 49 deleted).
the use of up to 9 knots per variable. We did runs on the data (case 49 deleted) with 4 to 9 initial knots, using leave-one-out cross-validation. Initial Knots
df.
R2
Ed
4 5 6 7 8 9
3 5 6 3 6 3
0.33 0.40 0.46 0.33 0.46 0.33
16.4 15.3 15.0 16.4 15.5 16.7
0.93 0.28 0.29 0.92 0.36 1.58
The models with 5 and 6 d.f. were all nonlinear in xl, and linear in x4. The models with 3.d.f. were linear in both xl and x4. Given these results, we continued our exploration using 6 initial knots. The largest instabilities by case are now: 46 26 10 7
1.47; 0.91; 0.90; 0.71.
We tried deleting 46, after which the deletion of 47 was indicated. This only worsened matters. Then we tried deleting 26 together with 49, after which deletion of 46 was indicated. Again, not much help. However, when we deleted 10 and 49, improvement occurred. There were 3 RSS-extreme models d.f.
CL-
LY+
MRSS
PE
2 3 6
9.3 3.9 1.1
17.3 9.3 3.9
15.9 13.4 10.3
20.3 17.3 13.7
The minimum PE model has R2 = 0.50 and the PE has dropped instability is 0.156 with 0.110 due to variable 1 and 0.046 to 4.
to 13.7. The
L. Breiman / Fitting additive models
29
10
-10 0
.2 Fig. 3.11.
.4
.6
.8
1
Savings data (cases 10, 20, 49 deleted).
The only cases with instabilities above 0.5 are: 46 26 20
1.55; 1.01; 0.54.
Deletion of 46 or 26 does not lead to improvement. When 10, 20, 49 are deleted the instability drops to 0.088 with 0.065, 0.023 due to variables 1 and 4, respectively. There are only 2 RSS-extreme models with 3 and 6 d.f. The latter has the minimum I% of 12.7 and R* = 0.53. These transformations appear in Figure 3.11. The stability intervals for xl and x4 for 49 deleted, and for 10, 20, 49 deleted are in Figures 3.12-3.15. In this data, two models are competing. One is the 6 or 7 d.f. model with a non-linear transformation on xl. The other is the 3 d.f. model with linear transformations on both variables. However, the 6 d.f. model has the highest R*, lowest PE and d, so we prefer it on these grounds.
0
.2
.4
Fig. 3.12. Savings data (stability
.6
interval
.8
for xl, case 49 deleted).
1
L. Breiman / Fitting additive models
30
10 -
O-
-10 -
0
.2
.4
.6
Fig. 3.13. Savings data (stability
0
.2
.4
Fig. 3.14. Savings data (stability
0
.2 Fig. 3.15. Savings
.4
data (stability
.8
1
interval for x4, case 49 deleted).
.6
.8
1
interval for xl, cases 10, 20, 49 deleted).
.6
.B
interval for x4, cases 10, 20, 49 deleted).
1
31
L. Breiman / Fitting additive models
Comparing this analysis with that in Belsley, Kuh and Welsch, shows the large differences that occur between a linear and nonlinear analysis. The original 5 d.f. linear model had R2 = 0.33. With case 49 deleted, the best model is nonlinear in two variables with 6 d.f. and R2 = 0.46. Deleting cases 10 and 20 decreases the instability, increases R2 to 0.53 and decreases PE to 12.7. Belsley, Kuh and Welsch note that leave-one-out statistics only measure the influence of single cases and may not reveal the influence of small groups of cases acting in concert. The same is true of our stability analysis using leave-oneout cross-validation, although stepwise methods can sometimes uncover synergistic effects as with cases 13 and 34 in the Wright-Gambino data. 3.3. Air pollution data This data consists of daily measurements of 8 meteorological variables in the Los Angeles Basin during 1976. The response variable is the daily maximum one hour ozone level at Upland, at that time an ozone ‘hot spot’. This data was analyzed in the Breiman, Friedman ACE paper (198.51, and since then in the Hastie, Tibshirani paper (19861, in Buja et al. (19891, and in Friedman, Silverman (1989). The meteorological variables are: X1: 500 milliber pressure height measured at Vandenberg AFB (500 MPHT); x2: wind speed at LAX (WS); xj: humidity at LAX (HMDTY); X4: temperature at Sandberg (TMPSD); X5: inversion base height at LAX (INVHT); x6: pressure gradient, LAX to Dagget (PG); x7: inversion top temperature at LAX (INVTMP); xa: visibility at LAX (VSBLTY). There are 331 cases with no missing data. The correlation matrix of the data is
xl
xl 1
x2 -
x3 -
x4 -
x5 -
x6
x7 -
x8 -
x2
0.23
1
-
_
-
-
_
-
x3
0.07
0.22
1
-
-
-
_
-
x4
0.81
0.34
1
-
-
-
-
-0.53
1
-
-
-
0.03
1
-
-
-0.10
1
-
-0.13
-0.42
-0.01
X5
-0.50
0.19
x6
-0.15
0.34
x7 x8
0.85 _ -0.35
-0.16 0.13
-0.26 0.65
0.19
0.20
0.86
-0.40
-0.38
-0.78 0.39
1I
On the first run, 16 knots per variable were used along with ten-fold cross-validation. The minimum PE model had 10 d.f. with R2 = 0.74, PE= 18.4 and a = 2.87. The first three variables were completely deleted. Keeping these
32
L. Breiman / Fitting additive models
Table 1 Results for the minimum i% models. No. Initial Knots
d.f.
E
R2
2
5 6 7 8 9 10 12 14 16
18 18 16 16 8 8 8 9 10
17.4 17.3 17.5 17.9 18.2 18.7 18.7 18.3 18.4
0.77 0.77 0.77 0.77 0.73 0.73 0.73 0.74 0.74
0.28 0.35 0.71 0.41 6.11 3.43 4.00 5.00 2.87
variables deleted we reran, changing the number of initial knots. The results, for the minimum i% models, can be found in Table 1. The most promising results are gotten using 6 initial knots. There are six RSS-extreme models d.f. 7 8 13 14 16 18
(Y8.6 5.4 4.9 2.8 2.4 0.8
a+ 12.8 8.6 5.4 4.9 2.8 2.4
R2 0.73 0.74 0.76 0.76 0.76 0.77
PE 17.7 18.0 17.9 17.5 17.4 17.3
A check of cross-validated residuals did not show any evidence of outliers. None of the A(u), u= l,..., 10 were inordinately large. The results were consistent under changes of the random number seed. The transformations for the five undeleted variables are shown in Figure 3.16. The variable importances are: x4 X5
x6 x7 x8
32.3; 15.9; 17.3; 27.7; 6.8.
The interesting aspect of Table 1 is that for 5 to 8 initial knots, the d.f. selected is 16-18. For 9-16 initial knots, the d.f. selected is 8-10. Our interpretation is this: For many initial knots, there is more inherent instability. To reduce this instability and the accompanying higher i%, fewer d.f. are selected. (The referee disagrees with this explanation and offers this alternative: “more initial knots allow the procedure to pinpoint more precisely any changepoints right off and throw away redundant knots”.) At any rate, given these results, we worked mainly with six initial knots per variable, (but with some backwards looks at 5, 7, and 8 initial knots). The first thing we did was to run DK/CV with 6 initial knots on all 8 variables. This did
L. Breiman
0
.2
Fig. 3.16. Ozone
33
/ Fitting additive models
.4
data (all variables,
.6
.8
1
6 initial knots).
not improve matters. A model with 23 d.f. was selected with R* = 0.78. But 2 increased to 0.83 and I% to 17.5. After some experimentation, the initial 5 variable model proved superior to models containing more or other variables. These five variables included x4 and x7, two temperature variables with a correlation of 0.86 and with x7 contributing a large share to the instability. We deleted x7. The minimum I% dropped to 17.0 and d to 0.03, while R* = 0.76 and d.f. = 15. There were 4 RSS-extreme models with d.f. = 7, 8, 13, 15. In particular, the model with d.f. = 13 was very close to the d.f. = 1.5 model, having PE= 17.0 and R* = 0.76. The transformations for d.f. = 15 are shown in Figure 3.17. With xl, x2, x3, x7 deleted, we went back and checked to see what the effect was of the number of initial knots (IK). There are two competing models, one with about 15 d.f. and one with about 7-8 d.f. No matter how many initial knots, both of these models appear on the corresponding RSS-extreme list. For the number of initial knots equal to 10, 13, 14, 16, 18, minimum I% narrowly prefers a low d.f. model to one of about 15 d.f. All the models with d.f. = 15 show almost the same pictures for the transformations of the variables, and all of the models with d.f. = 7, 8 show about the same transformations. For example, Figure 3.18 gives the transformations with 6 initial knots for the 7 d.f. model (Note: the lines for VSBLTY and INVHT lie on top of each other). Figures 3.19, 3.20 give the transformations for 7 d.f. and 15 d.f. with 11 initial knots, and Figures 3.21, 3.22 are the transformations with 16 initial knots for 8 and 13 d.f.
34
L. Breiman / Fitting additive models
0
.2
.4
.6
.8
1
Fig. 3.17. Ozone data (6 initial knots, 15 d.f.).
Our preferred picture is that with 6 initial knots and 13 d.f. shown in Figure 3.23. The indications are that inversion base height and visibility are important predictors of ozone only in their lower ranges. Furthermore, that the effect of temperature seems to be separated into two regimes. The main (but small) difference between the 13, 15 d.f. pictures for 6 initial knots and the 13, 15 d.f. pictures for 11 and 16 initial knots is in the PG transformation. To look at the possibility that this difference could be due to the relatively broad spacing of the 6 knot points, we reran with 6 initial knots but with the knots for PG concentrated around the data values corresponding to 0.20, 0.45 on the graphs. The results for 13 d.f. are given in Figure 3.24. The PG transformation is now much closer to that given by using 11 and 16 initial knots. The real difference in pictures occurs when one of the temperature variables is deleted. Then the transformation on the remaining temperature variable changes considerably. but the transformations on the other variables are rela-
Table 2 IK
5
6
7
8
9
d.f. R2
15 0.76
15 0.76
15 0.76
15 0.76
15 0.76
E
17.5
17.0
17.6
17.2
18.0
d
0.14
0.03
0.23
0.10
0.13
10 7 0.73 18.1 0.13
11
12
15 0.76
15 0.76
18.2
18.0
0.11
0.18
13 7 0.73 18.3 0.24
14 8 0.74 18.5 0.24
15 15 0.76 18.4 0.18
16 8 0.73 18.1 0.17
18 8 0.74 18.5 0.31
35
L. Breiman / Fitting additive models
-10
1 0
I
I
I
.4
2
I
I
.8
.6
1
Fig. 3.18. Ozone data (6 initial knots, 7 d.f.).
tively unaffected. We note also that other variable deletions were attempted but resulted in poorer fits. The transformations graphed differ ‘somewhat’ from those given in the ACE paper. In data analysis there is no hard and fast ‘truth’.
-10 0
I 2
I
1 .4
.6
Fig. 3.19. Ozone data (11 initial knots, 7 d.f.1.
I .8
1
L. Breiman / Fitting additive models
36
0
2
.4
.6
.0
1
.8
1
Fig. 3.20. Ozone data (11 initial knots, 15 d.f.).
.2
.4
.6
Fig. 3.21. Ozone data (11 initial knots, 8 d.f.).
4. Simulation
examples
There are two ways of looking at the performance of a new procedure. The first is to run it on simulated data, where we know what the answer should look like. If we don’t get reasonable results here, something is wrong.
37
L. Breiman / Fitting additive models 10
I
-10
I
1
.2
0
.4
.6
.a
1
Fig. 3.22. Ozone data (16 initial knots, 13 d.f.).
The second is to run it on real data. The problem with this is that we don’t know what is inside the black box. The results can be strongly procedure dependent. DK/CV will usually tell a different story than a linear regression. It may tell a somewhat different story than ACE (see 3.2). It will certainly tell a
1(I-
I3
7; I
+
-1( )0
1
I
2
I
1
.4
.6
Fig. 3.23. Ozone data (6 initial knots, 13 d.f.).
.a
1
L. Breiman / Fitting additive models
38 10
C
I
-10 0
1
2
.4
.6
Fig. 3.24. Ozone data (knots concentrated
.a
1
on pg).
different story than CART. The way to proceed is to ask what new clues to the mystery a procedure can give. On the three case studies in Section 3, DK/CV tells a somewhat different story than the Friedman-Silverman TURBO program and from that of ACE. The broad outlines are similar, but the pictures differ to some extent. It is useful, therefore, to work on simulated data, where the correct answers are known. The Friedman and Silverman (1989) paper contains a number of simulation experiments comparing their program to ACE. We ran DK/CV on similar data sets and report on the results below. 4.1. Comparisons with the F-S method A complex bivariate example given in F-S consists of generating (x,, y,) with x, uniform over [ -0.2, 1.01 and y, given by x, I 0,
E IIT
” =
i
sin[2r(l
50 points
-x,)‘]
+ E,,
x, > 0,
with E, selected from the distribution N(0, max2(0.05, x,)). To see what this data looks like see Figure 3a in F-S. For the number of initial knots ranging up from 3, we ran DK/CV on 10 data sets generated as above and averaged the resulting PE’s to get In. knots 3 4 5 6 7 8 9 10 11 12 15 20 25
%$0.45 0.44 0.42 0.37 0.35 0.34 0.34 0.33 0.34 0.34 0.33 0.35 0.35.
39
L. Breiman / Fitting additive models .5
.4
.3
,,‘--___.-
/ I
.2 .p. ,' 7;
\ \
* i' ,d
'\,, 'x,,
'r.'
',.. -_
_,__--_____-----
---__,_,-
,-
,-
_.--
___---
,-
/
.-
.I -
0 -.2
0
.2
.4
Fig. 4.1. Sample size 50 motor-cycle
.6
.a
data.
Using 10 initial knots and lo-fold cross-validation DK/CV was run on 500 data sets generated as described. The average of I E( Y I x) - y^(x) 1 over the 500 runs is plotted in Figure 4.1 and should be compared with Figure 3e in F-S. This indicates that DK/CV is comparable and perhaps slightly superior to TURBO on this data. Next, we look at a pure noise example, with 10 predictor variables x1,. . . , .x,~ and sample size 50. Each x,, m = 1,. . . , 10 is uniform on [0, l] and y is N(10, 1). We ran one data set of this type using 3 initial knots, and 4 initial knots. The resulting minimum PE’s were 0.85 and 0.89, respectively. More initial knots gave degrees of freedom larger than the sample size. We used 3 initial knots. To cut down on computing time we used 5-fold cross-validation, (the results in Breiman and Spector (1989) indicate that 5-fold is generally as accurate as lo-fold cross-validation). Note that each fitting started with 31 d.f. The best thing the program could do was to try and fit the response by a constant, decreasing the degrees of freedom to 1. We ran 100 replicated data sets, and DK/CV fit by a constant 82 times (TURBO did this 67 times). In each run, the root mean squared distance of y^(x,> from 10 was computed. The 5th, 50th and 95th percentiles are:
DK/CV TURBO ACE
5th 0.01 0.02 0.68
50th 0.12 0.18 0.85
95th 0.54; 0.50; 1.00.
40
L. Breiman / Fitting additive models
The next example is what F-S refer to as a highly structured Data of sample size 100 are generated from Y =f(-+
* * -7 $0)
additive model.
+ E,
xiO> is an additive function of xi,. . . , x5 only and nonlinear where f(~i,..., That is f = C:~,(X,), with &, $4, 4s linear, and Xl, x2. $,(x)
= 0.1 e4X
6,(x)
= 4/( 1 + e-(x-0.5)/0.05).
in
The {E,} are N(0, 11. The signal to noise ratio is high with R2 typically in the 0.8-0.9 range. We did one run each with 10 fold cross-validation, varying the number of initial knots. This gave Initial Knots
3
4
5
6
7
8
9
PE
1.25 0.03 0.86
1.25 0.05 0.85
1.19 0.03 0.86
1.26 0.05 0.86
1.32 0.18 0.86
1.41 0.25 0.86
1.49 0.33 0.86.
a R2
This suggested using 5 initial knots. We did runs on 100 replicate data sets using 5-fold cross-validation. In each run the root mean squared distance between {f(x,ll and {$(x,)1 was computed. The 5th, 50th and 95th percentiles of these distances compared to the F-S results are: DK/CV TURBO ACE
5th 0.32 0.31 0.60
50th 0.43 0.48 0.72
95th 0.65; 0.62; 0.85.
4.2. Hastie’s example Another interesting example is provided in Hastie’s discussion of the F-S paper. Hastie claims “variable selection techniques tend to have high variance with regard to the subset selected.” To illustrate TURBO’s high variability he generated 50 data sets of sample size 100 from the model y = 0.667 sin(l.3xi)
- 0.456~; + E,
where E, xi, x2 are N(0, 1) and xi, x2 have correlation 0.4. He ran TURBO on the 50 data sets and plotted the transformations. The superimposed transformations are fairly bushy and support Hastie’s assertion. Hastie obligingly provided us with the 50 data sets. We ran on the first four, using lo-fold cross-validation with the following average results Initial Knots
3
4
5
6
7
8
9
10
15
E
0.449
0.469
0.461
0.492
0.468
0.463
0.461
0.481
0.520
d
0.119
0.175
0.091
0.401
0.253
0.261
0.242
0.480
0.703.
L. Breiman
/ Fitting additive models
41
2
1
0
-1
4 -2
-1
0
1
2
-2
-1
0
1
2
Fig. 4.2.
The results led to the use of 3 initial knots. We then ran on all 50 data sets using lo-fold cross-validation and plotted the transformations in Figure 4.2. The plots have been adjusted up or down so that each curve has the same average over the middle 40% of the data. They show much less variability than TURBO.
42
L. Breiman / Fitting additive models
5. Discussion
DK/CV combines knot deletion and cross-validation to give a method for accurately fitting additive models and overcomes the difficulties encountered by ACE on small noisy data sets. The use of stability diagnostics and inspection of alternative RSS-extreme models gives insight in the difficult and sometimes ambiguous job of fitting these nonlinear models. The Fortran code for DK/CV is available from the author ([email protected]).
References Agarwal, G.G. and W.J. Studden, Asymptotic integrated mean square error using least squares and minimizing splines, Ann. Statist., 8 (1980) 1307-1325. Belsley, D.A., E. Ruh and R.E. Welsch, Regression Diagnostics (Wiley, New York, 1980). Breiman, L., The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error, J. Amer. Statist. Assoc., 87, 738-754. Breiman, L. and J.H. Friedman, Estimating optimal transformations for multiple regression and correlation (with discussion), J. Amer. Statist. Assoc., 80 (198.5) 580-619. Breiman, L. and S. Peters, Comparing automatic smoothers, Technical report no. 161 (Statistics Dept., University of California, Berkeley, CA, 1988) (Accepted for publication, Inter. Stat&t. Review). Breiman, L. and P. Spector, Submodel selection and evaluation in regression X-random case, Technical report no. 197 (Statistics Dept., University of California, Berkeley, CA, 1989). Breiman, L., J.H. Friedman, R. Olshen and C.J. Stone, Classification and Regression Trees (Wadsworth, Belmont, CA, 1984). Buja, A., T. Hustie and R. Tibshirani, Linear smoothers and additive models (with discussion), Ann. Statist., 17 (1989) 453-555. Burman, A., Estimation of optimal transformations using v-fold cross-validation and repeated learning testing methods (1989) Sankhyg A, forthcoming. Friedman, J.H. and B.W. Silverman, Flexible parsimonious smoothing and additive modeling (with discussion), Technometrics, 31 (1989) 3-39. Friedman, J.H. and W. Stuetzle, Smoothing of scatterplots, Technical report ORION 003 (Statistics Dept., Stanford University, Stanford, CA, 1982). Hsrdle, W. and A.W. Bowman, Bootstrapping in nonparametric regression: Local adaptive smoothing and confidence bands, J. Amer. Statist. Asoc., 83 (1988) 102-110. Hastie, T. and R. Tibshirani, Generalized additive models (with discussion), Statist. Science, 1 (1986) 297-318. Hastie, T. and R. Tibshirani, Generalized additive models: some applications, J. Amer. Statist. Assoc., 82 (1987) 371-386. Hastie, T. and R. Tibshirani, Generalized Additive Models (Chapman and Hall, London, 1990). in: R. Enslein et al. (Ed.), Statistical Methods for Digital Jenrich, R., Stepwise regression, Computers (Wiley-Interscience, New York, 1977). Miller, A.J., Subset Selection in Regression (Chapman and Hall, London, 1990). Ramsay, J.O., Monotone regression splines in action (with discussion), Statist. Science, 3 (1988) 425-461. Smith, P.L., Curve fitting and modeling with splines using statistical variable selection techniques, NASA report no. 166034 (NASA, Langley Research Center, Hampton, VA, 1982). Stone, C.J., The dimensionality reduction principle for generalized additive models, Ann. Statist., 14 (1986) 590-606.
43
L. Breiman / Fitting additive models
Stone, C.J. and Cha-Yong Koo, Additive splines in statistics, Proceedings, 1985 Annual Amer. Statist. Assoc. Meeting, Computing Section (1986). Tibshirani, R., Estimating transformations for regression via additivity and variance stabilization, J. Amer. Statist. Assoc., 83 (1988) 394-405. Wright, G.E. and J.J. Gambino, Quantitative structure-activity relationships of 6-anilinouracils as inhibitors of bacillus subtilis DNA polymerase III, J. Med. Chem., 27 (1984) 181-185.
Appendix
If there is no knot deletion, it is simple to set up a basis for the natural splines. With knot deletion DK/CV does the more complex computations involved in keeping the transformation linear to the right of the last undeleted rightmost knot and similarly on the left. This appendix describes the algorithm. Given{x,),n=l,..., N, with x_ = min(x,) and x, = max(x,), consider knots x_=f,
h(x)=
c &(X-ltk)+3+ci+yx.
The conditions that h(x) be linear to the right of x, P K+l=
-
impose the constraints
5P,7 1 W)
The knot at x, is seen as a ‘phantom’ knot whose purpose is to cancel out x3 term to the right of x,. To get the OLS estimates for the other coefficients, least squares minimization is carried out under the linear constraints (A.l). the knot tK+i is deleted, and the condition of linearity to the right of imposed, the constraint (A.11 has to be augmented by
&l,=o.
an a If t,
(fw
1
Thus, the requirement that the deletion process keep the fitting functions linear to the right of rightmost undeleted knot leads to solving least squares problems under either one or two linear constraints. Doing least squares minimization under linear constraints can be carried out using well known procedures. However incorporating a sequence of such solutions into a process involving the deletion of, perhaps, hundreds of knots has to be done with an eye on computational efficiency. The algorithm used, based on the sweep operator, does not appear in the literature we have searched, and is outlined below. Given data (y,, x,), x, = (xln,.. ., x,,) consider the problem of minimizing C,(y, - CmPmx,J2 over the {p,), subject to the constraint C,z,p, = 0. Let S = X ‘X. Using a Lagrangian formulation, the solution is /3 = SPX’y
+ AS_‘&
44
L. Breiman / Fitting additive models
where A = -z’S-‘X’y/zY’z. Let X’y = v. Then RSS =y’y -u~S-‘Y
+ (u’S-‘z)~/z’S-‘z.
Now consider a similar minimization solution is p = s-‘u
+ (A,, h,)(S_‘z,
under the constraints z’/3 = 0, u’p = 0. The
s-‘z&
where defining H=
(
then h = -H-‘W, and RSS =yy - Y’S-‘Y + WtH-‘W. Given a square matrix A r (ajk>, the result of sweeping A on its k th diagonal element is a new matrix A = (a’j,) with
szu v z’000
a’kk
=
Gij =
“jk
-l/akk,
=
ajk/akk
Y
‘kj
= akj/akky
aij - aikakj/akk,
for i # k and j # k. Consider the matrix
If this matrix is swept on the diagonal elements of S, the result is (see Jennrich, 1977) - -s-’ S_‘Z S_‘U s-‘v T=
-I Z’S UtS-’ V’S_’
-Z’S_‘z -z’S-
‘u
-dS-‘Z
-Z’S_ ‘u
-ztS-‘u
-U’S- lU -UtS-‘U
-dS-‘V y’y - u’s-lu
Given a square matrix partitioned A D’ i E’
D B F’
E F C
1
as
where A, B, C are square, sweeping on the diagonal elements of B produces A - DB-‘D’ B-‘D’ E’ _ F’B- ‘Dt
DB-’ -B-’ F’B-’
E -DB-‘F B-‘F C-F’B-‘F
1
45
L. Breiman / Fitting additive models
Sweep T on the single diagonal element S-lY
-+ s-lo
y’y - O’S_ lu +
--z’S-~Z. Then
- (z’s-‘u/z’s-‘z)s-‘z,
y’y - Y’S_ lV + (ZJ’S- ‘z)‘/(z’s-‘z).
Denoting the new matrix again by T, the OLS coefficients under the z’j3 = 0 constraint are in the right hand column of T, and the RSS in the lower right corner. Now, consider sweeping the original T matrix on the two diagonal elements -ztS-iz and -U’S_‘u. Then s-1
-+ s-1 V - (Pz,
yty - U’s-‘Y
S-‘u)K’W,
--j yfy - V’S_ ‘Y + W’H- ‘W,
so the coefficients under u”fl = 0, z’jI = 0 are in the right column of the swept T and the RSS in the lower right corner. Consider knots t, < t, < . * * < t, with t, = min(x,) and t, < x+= max(x,). For a linear combination C,“=rPk(x - t,)+3 to have zero 2nd derivative at x,, we need the constraint (A.l). For the combination to be linear to the right of t, the additional constraint (A.2). is needed. Let zk =x+- t,, k = 1,. . . , K and zK+r = z~+~ = 0. Let uk = 1, k = 1,. . . , K and uK+i = uKt2 = 0. For II = 1,. . . , N, let
fkn = C-G-
tk)+3Y
form sik = C,f,,f,,,
and the augmented
svu
v 0
tOO
0’
00
zt 0
T= i u’
k=l,...,K,
0
YtY
1
fK+l,n=.q,,
fK+z,n=L
matrix
where uk = C, fkn yn. Sweep T on the K + 2 diagonal elements of S and on the (K + 3, K + 3) diagonal element, getting the coefficients and RSS under the z’p = 0 constraint. Denote the new matrix again by T. Deletion of any of the K + 2 variables is done by operating on the corresponding diagonal element of T by the inverse sweep operator. Deleting the end ‘phantom’ knot is done by sweeping on the (K + 4, K + 4) diagonal element. Since T is symmetric with elements (tjk), sweeping or inverse sweeping on tkk increases the lower right hand corner element by -t2 1,Kf5 /tkk.
Thus, the deletion compute (A.3). If k = K + 4. Find the k I K, then inverse
64.3)
strategy is as follows: for the set of undeleted tk, k I K, has not yet been swept on, compute (A.31 for minimum of these (A.11 values. If the minimum occurs for a sweep T on t,,. If the minimum is at K t 4, sweep T on tK+4 K+4
L. Breiman / Fitting additive models
46
t K+4,K+4. If
all knots t,, k I K have been deleted, the next deletions are by inverse sweeps on tK+l,K+l and on tK+2,K+2. In the general case of M variables x,, with f( knots t(l, m), . . . , t(K, m) on the mth variable, let m9n = max(x,,), n and for k = 1,. . . , K f(n,
k + (m - l)K) = (X,, - t(k,
m))+3.
For k = MK, . . . , MK + M f(n,
k) =x.,k-MK,
f(n,
MK+M+
and Let J=MK+M+
1) = 1. 1 and
t(i, k) = Cf(n, i)f(nt k), n i, k = l,.. ., J. This J x J matrix will be augmented by M rows and columns for the (A.l) constraints, an additional M rows and columns for the (A.21 constraints and a final row and column containing uk = C,f(n, k)y,. The J+m row of T, m = 1,. . . , M, is defined as being all zero except for t(k+KM,
J+m)=m,-t(k,
m),
k= l,...,K.
The J + M + m row, m = 1,. . . , M, all zeroes except for t(k+KM,J+M+m)=l,
TheJ+2M+lrowisallzeroesexceptfor
k= l,...,K.
uk, k=l,...,Jand
J+2M+l)=y’y.
t(J+2M+l,
Now, the 1st J + M diagonal elements of T are swept on, giving coefficients and RSS under (A.l) constraints. An inverse sweep on one of the 1st J-M - 1 diagonal elements will delete an ordinary knot. A sweep on one of the J + M + m diagonal elements, m = 1,. . . , M removes one of the phantom right endpoint knots. The deletion which causes the least rise in RSS is computed as that unswept k which minimizes 2 -tk,J+2Mtl
/tkk.
A sweep on the diagonal element at J + M + m, m = 1,. . . , M, which corresponds to the linear term in X, is allowed only if all the knots on X, have been deleted.