Computational North-Holland
Statistics
& Data Analysis
379
16 (1993) 379-403
Projection pursuit regression for moderate non-linearities Magne
Aldrin
Norwegian Computing Center, Oslo, Norway
Erik Bolviken
and Tore Schweder
University of Oslo, Norway Received November Revised July 1992
1991
Abstract: We present methods specially designed to be effective with moderately non-linear regression relationships. The model fitted is of the Projection Pursuit Regression (PPR) type with a smooth, non-parametric link function connecting the mean response to a linear combination of the regressors. New algorithms, close to ordinary linear regression, are developed. Considerable numerical evidence is given to substantiate the claims that the new methods outperform standard PPR when the non-linearity is moderate and the signal to noise ratio small. Keywords: Nonparametric
regression;
Smoothing;
Projection
pursuit
regression.
1. Introduction
Consider the regression model Y =f(x)
+ E,
(1)
where y is the univariate response, x a p-dimensional predictor and E independent additive noise. The 1980’s saw the emergence of a number of non-parametric techniques for fitting this model, notably the PPR, ACE, AVAS, TURBO and MARS methods presented in Friedman and Stuetzle (19811, Breiman and Friedman (19851, Tibshirani (19871, Friedman and Silverman (1988) and Friedman (1991). The great strength in all these methods is their ability to identify non-linear relationships without the need for a specific parametric form in advance. Correspondence to: M. Aldrin, Norwegian Computing Center, P.O. Box 114 Blindern, N-0314 Oslo 3, Norway. This work was supported by the Norwegian Research Council (NAVF) and by the Royal Norwegian Council for Scientific and Industrial Research (NTNF). 0167-9473/93/$06.00
0 1993 - Elsevier
Science
Publishers
B.V. All rights reserved
380
M. Aldrin et al. / Projection pursuit
However, it may be that most non-linearities encountered by statisticians in practice are only moderately deviant from a linear structure. Typical examples are s-shapes and other modest curvatures, slight jumps in derivatives or local dips or bumps. It may be very important to uncover such effects, and they might be completely unsuspected by the researchers. We shall in this paper propose methods specifically designed to be effective with non-linear, monotone relationships. They will be less general than the methods cited earlier, but they will work with fewer observations and more noise. Substantial numerical evidence will be presented in support of this statement. The performance will be judged on synthetic and real data by comparing with ordinary linear regression and projection pursuit regression (PPR). Our approach has much in common with the latter.
2. Methods 2.1. Linear and projection pursuit regression Linear
regression
f(x)
(method
= #I’x.
L) assumes
f(x)
in (1) to be a linear function,
i.e. (2)
The fitting is traditionally by least squares. PPR, on the other hand, is rooted on the expansion of f(x) as a sum of non-linear functions of unidimensional projections of X, i.e.
fCx) = j~19j(P;xj.
(3)
where both pj and ‘pj are unknowns. It has been proved by Diaconis and Shahshahani (1984) that any function can be represented in this way. In the sample situation the ‘pi-s must be subjected to some loose restriction of smoothness. The fitting is again by least squares, and is stepwise in k, possibly with so-called backfitting (which means that the preceding pairs (pi, qj> are adjusted to improve fit whenever a new term is added). Figure 1 indicates what can be achieved by the PPR technique (details about the data are given in Section 3.4). The scatter plot of the response against the estimated best linear predictor i = b&,x shows a marked curvature. Only a non-linear element in the model can capture this. The solid line of the figure is a non-parametric smooth obtained from the so-called supersmoother of Friedman (1984a) (see Appendix B), implemented in for example the BLSS package (Abrahams and Rizzardi, 1988). We shall actually justify this extremely simple non-linear technique in the next subsection. The standard PPR approach due to Friedman and Stuetzle (1981) is more ambitious in that the predictor and the non-parametric link function cp is determined in a joint operation. Suppose
M. Aldrin et al. / Projection pursuit
381
Fig. 1. The response y and the smooth @ against the estimated best linear predictor i, for the Boston housing tract data with all 506 observations. i is scaled to cover the range - 1 to 1.
k = 1 in (3), let z = p’x be the linear predictor and write ‘pg = cp for the link. By definition ‘Pp=qy
I-q,
(4)
which must be computed by a smoother in the sample situation. The determination of j3 to fit ‘pg to y is then reduced to a non-linear least squares problem that can be solved numerically. 2.2. New proposals Linear regression and PPR can from our point of view be regarded as two extremes, one being totally rigid in its adherence to an assumed structure, the other being complete flexible. We shall now put forward a middle-of-the-road approach which, while non-parametric, aims at preserving as much as possible of the stability of linear methods. Consider the model f(x) = 40’4.
(5) Assume x to have zero mean, and let without loss of generality /3 be standardized so that var(j?‘x) = p’E,/3 = 1, writing E, for the covariance matrix of X. Let z=fi’x be the predictor
(6) and introduce
u =x - (&/3)z.
(7)
382
M. Aldrin et al. / Projection pursuit
It is easy to check that z and u are uncorrelated linear transformations of x. The vector u, which is to play a prominent role below, is in case of E, = 0 (the identity), simply the vector from x perpendicular on the line through the origin in the direction p. (There is a less natural geometric interpretation for general EC,. Introduce (x,, x2) =x;&‘x2 as the scalar product between vectors x1 and x2 in x-space. Then z is the coordinate of x along the vector E,JI with respect to this skewed geometry and u becomes the difference between x and its Xi’ projection along E,p.> On combining (5) and (7) with (1) it follows that xy = (-V(4P,B)
+ 4+
+xe
(8)
or, by taking expectations, EXY= J%=f+))(~,P) where EXY= E(v)
+ JVf(44,
is the cross covariance matrix between
(9) x and y. Introduce
as the linear regression of y on x under an assumed linear relationship. follows by premultiplying (9) with E;’ that B,,,=M+B?
It
(11)
where A = E(-V(z)),
(12)
and B
=E,-lE(cp(z)u).
(13)
The decomposition (11) of a quantity that can be recovered even from small sets of noisy data provides the motivation for our proposals. (11) shows that fiolS is the sum of one term proportional to the true direction vector fi and another term B serving as bias. The latter is under model (5) often quite small. We have, for instance, the following result, which, to our knowledge, was first presented in Schweder (1987). Proposition. Suppose x follows an elliptically contoured distribution (See Muirhead (1982) for definition and properties). Then, under model (51, B = 0 and hence BIAS
=
w
Proof. Suppose x is Gaussian. Then u and z, as uncorrelated
linear transformations of x, are also stochastically independent. Hence u and C&Z) are independent too, and B, as given by (131, must be zero. The extension to elliptically contoured distributions in general follows from the symmetry properties of that family. Cl
M. Aldrin et al. / Projection pursuit
383
The proposition suggests a very simple non-linear method, which _we shall refer to as method S. Take the ordinary linear least squares estimate f301sfor /3 and obtain cp by smoothing y in this direction. This will not work if A = cov(z, C&Z))is close to zero, 9 being, for example, a symmetrical quadratic function. However, as argued in the introduction, there are countless situations where cp is monotone or close to monotone. More generally, suppose A # 0. Recall that /3,,rsconverges to pals in probability as the sample size n -+ mAunder suitable regularity conditions. Hence, the proposition guarantees that fiols will consistently estimate the right direction vector p under model (5) as long as the predictor x is generated by an elliptically contoured distribution satisfying the necessary regularity conditions. The method may be useful even outside the elliptic family. Indeed, by (13) the bias B is determined by Eq(z)u. Since u and z are uncorrelated, we might expect this quantity be quite small, whatever the distribution of X, if cp does not deviate too much from linearity. For noisy data, the gain in stability by first estimating /3 by linear regression and then cp by smoothing, could well outweight the loss caused by the bias B. The simple S method may be improved. One approach which we shall not pursue in this paper is to improve the method by starting out with a robust estimator of the linear regression parameter. Another possibility is to make use of the decomposition (11) and to go through an alternating iteration between cp and /? to estimate the direction correction B. This means that cp and p are estimated simultaneously, as in PPR, but according to another criterion. Stability is hopefully enhanced in cases where B is a relatively small adjustment term. It is easy to prove from (8) that the sample version of (11) is
(14) where i =
$.g.q’P(q) l-1
(15)
and
P-9 are sample analogues of (12) and (131, 2X is the sample covariance matrix of X, and tX, is the sample cross covariance of x and E. Note that p in (14) is still the true direction vector. Method S removed both i and i,,, from the right hand side of (141, but suppose g was kept, and the resulting equation b,,, = hp + B solved for fi. Seemingly this might improve accuracy, and the bias outside the family of elliptically contoured distributions would vanish. The numerical problem might appear formidable since $ depends on both cp and p, and for strong non-linearities it might be suspected that the solution would be unstable. However, in the situations we have in mind 6 is a relatively small adjustment term so that fiols is already close to the final direction. Indeed, one way to
384
M. Aldrin et al. / Projection pursuit
implement the idea is to compute 2 as an estimate of B, by using the method S estimates of p and cp, and then use i,,, - b as a revised direction. This could be repeated. We have implemented the following iterative scheme, which will be called method A: (0) Estimate (g, cp) by method S, denote the estimates (6, 6). (1) Calc$ate F from current (p, +). (2) Use #I a: (Bols - B) as a revised direction, and standardize so that j’e,/? = 1. (3) Calculate $ by smoothing y against i = b’x and go to (1). If the algorithm converges, the limit will satisfy the equation ~,,,=i~+ii.
(17)
After each iteration, it is possible to judge how far the current values are from the solution of (17) by measuring the distance between the left side and right side of (17). Practical experience dictates that each iteration does not always reduce this distance. To handle this, our implementation given in Appendix A is a slightly modified version.
3. Examples In this section the four regression methods are compared on simulated and real data, with emphasis on prediction performance. Implementation details are given in Appendix A and Appendix B. The three non-linear methods employ the supersmoother due to Friedman (1984a). 3.1. Simulations with normal distributed x Consider a model of the form (51, where x is distributed as N,(O, I), E is N(0, 02) and independent of X. The /3-vector has length 1, with all elements equal to p - ‘12. The cp function is ‘p(z) = a . atan( bz),
(18) where a is a scaling constant and b is used to vary the degree of non-linearity in the function, and atan means the arcus tangens function. The conditions in the proposition in Section 2.2 is thus fulfilled. The arcus tangens function has an S-form, and bz will cover different part of the S-function when b varies. A typical sample of 1000 z’s will cover the range from -3.2 to 3.2. Figure 2 shows how the function a - atan( bz) will look in this range of z, for different values of b. The S-function approaches a linear function as b + 0 and a step function as b --+= w. Thus for a given distribution on z, the degree of non-linearity in cp can be characterised by the parameter b or equivalently corr(cp(z), z). In this and other figures y is scaled (through the value of a) to give cp(-3.2) = - 1 and (~(3.2) = 1, while in all computations y is scaled such that var( y) = 1. The simulation study reported was carried out under variation of several factors. The degree of non-linearity ranged from complete linearity to a step
385
M. Aldrin et al. / Projection pursuit
v
cp
1
.oo-
o.so-
O.OO-
-0.50
-1.00
l&J , , -3.20
-1.60
0.00
1.80
3.20
z
Fig. 2. Plot of the function cp = a.atan(bz) against z, for 5 different values of b. When z is N(O, 11, the five different q’s is such that corr(cp(z), z) is 1.00 (linear function), 0.99, 0.93, 0.87 and 0.80 (step function).
in steps shown in Figure 2. The number of observations in the training set was 50, 150 and 450, and the error variance was 5% and 50% of the total variance of y. The number of regressors was 1, 5 and 10. Figure 3 shows 150 simulated y-values against z (with the cp-functions as solid line); for three different degrees of non-linearity, and two levels of error variance. (A sharp eye may discover that the simulations are dependent. They were not in the simulation study itself.) For each combination of the factors mentioned, a training set was simulated, together with a test set of 1000 observations. To cover the general situation, Y and X from the training set were centred (equivalent to estimate a constant term), even if their expectations in the true model are zero. The observations in the test set were adjusted according to the centring values. Each of the four regression methods was used to estimate the model from the training set, and to predict the observations in the test set. PPR was used with one term, k = 1. The test set was used for evaluating the lack of accuracy, by computing the root mean squared error RMSE((p), based on the prediction errors of cp, that is, cp(p’x) - @. A closely related measure is the RMSE(y), based on the prediction errors y - @(/?‘x), which includes the noise term. Since y is scaled to have unit variance, both measures is given relative to 1, but they should also be compared to their lower bounds. With a perfect prediction, RMSE(cp) will be 0 regardless of a;*, while RMSE( y) will be m = 0.224 if a,* = 0.05, and Jos = 0.707 if o-;Z= 0.50. The two measures produce the same ordering between methods. The RMSE(po) is chosen as the principal measure here. function,
386
M. Aldrin et al. / Projection pursuit
. .5.
R 6
I.
..
s
,’
I’
- i: ,..
.’
.7,
.
:‘.
:.’
.s
8 dN
.’ i
L.
.‘_.
. .
. . . .
.’
.
8 dN
M. Aldrin et al. / Projection pursuit
387
For each combination, 50 simulations were carried out. All simulations were independent, both within and between the different combinations. The standard deviations of the RMSE’s of each method are at most 0.06 in the case with 50 observations and large error variance, and at most 0.02 with low error variance. When the number of observations increase to 450, the standard deviations are reduced by a factor of f. The four methods were tested on the same sets of simulations, which cause their RMSE’s in a specific combination to be dependent. As a result of this, the differences between the RMSE’s have slightly smaller uncertainty than the RMSE’s itself. The results that are presented below, show a very stable pattern. This make formal significance tests uninteresting. The results are shown in Figure 4. In a), where there is only one predictor variable, PPR, S and A are identical. Each panel shows the RMSE((p) versus decreasing correlation between z and cp(z>, or equivalently, versus increasing degree of non-linearity. In the three upper panels the error variance is low, while it is high in the lower panels. The number of observations is low in the two most left panels, medium in the panels in the middle, and high in the right panels. The results for higher x-dimensions are shown in b) and c). When cp is exactly linear, linear regression outperforms the non-linear methods, but for remarkable small deviations from linearity, this changes to the advantage of the latter methods. With exact linearity, the methods may be ordered from best to worse as follows: linear regression, S, A and PPR. With maximum degree of non-linearity, and with low error variance and/or high number of observations, the ordering is the complete opposite. Note that methods S and A are never very far from the best among the four, while PPR is in real trouble with high dimension on the predictor, much noise and few observations. 3.2. Simulations
with log-normal
x
Until now we have operated within the conditions of the proposition in Section 2.2. In the next example, the predictor variable has a non-elliptic distribution, and the elements /3 are unequal. Apart from this, the experimental design has remained unchanged. The elements of x are log-normaltO, 11, but adjusted to have 0 mean, and independent of each other. The distribution of z thus has 0 mean and a long tail to the right. One element of /3 is 0.9, the others are of an equal small value, determined by the standardization of /? to length 1. These values were chosen to give approximately maximum skewness of the linear regression/method A estimate of fi. Note that the model is invariant to linear transformations of X, so the fact that X, is the identity matrix, does not limit the generality of the experiment. The results are given in Figure 5 for 5- and lo-dimensional x. The maximum degree of non-linearity (step function) is now reached for corr(cp(z), z) = 0.67. Note that the scale of the y axis is increased in the lower panels of b). The main impression is similar to the situation with normally distributed X, although
1 .m
0.95
= 50
,...”
= 50
0.85
CT* E = 0.50
corr (cpw, 2)
0-W
,_..’
02F = 0.05
0.80
.S
L
I
I 1 .oo
0.75
I
Rfvv3.E (cp)
.m
0.00
0.25
0.50
0.75
RMSE (cp) I .m
obs.
0.95
I
obs.
= 150
1.00
corr (cpw,
0.90
on simulated
0.95
s.....-.. ;.s ._.... .._,_...
0.811
data. Notation: ..
2)
0.85
I
,L\
O.SO
0.00 _I
0.25
0.50
P for PPR, A and S for methods
corr (q-G%z)
0.90
I
0.80
obs.
= 450
cr* F = 0.05
0.90
corr (cp(z), z)
1.00
I
I
corr (cp(z),z)
0.90
A and S. x is normally
O.%
O.S5
s
0.80
s
L
I
One
0.80
distributed.
0.85
I
s__________._._.. 5..............“‘..
iL
0.95
s ___._,_.___,______
1 .m
.s...........-....s.................s,__.....‘.
RMSE (cp)
1 I~ .- .\ mmIrL _,, _,____r_ _I n __.._,
0.85
L for linear regression,
\ P.
I.00
o.oo- .
0.25
0.50
0.86
02F = 0.50
0.85
1
02F = 0.05
corr (W), z)
0.90
I
= 150
J.------:::::::::
obs.
Fig. 4. RMSE((p)
L
ji/_
obs.
,s __...... ..__5 . . . . . . . . . . . . . . . . s.--‘-
RMSE (~1
0.00
0.25
0.M
0.75
1.00
RMSE (9)
4a) X-N,(O,I)
M. Aldrin et al. / Projection pursuit
389
390
M. Aldrin et al. / Projection pursuit
corr
0.84
z)
0.75 0.67
0.67
(cp(-a
Fig. 5. RMSE((p) on simulated
0.00
data. Notation
1 .oo
z)
0.67
corr (cpw,
as in Figure 4. Each element of x is log-normally unequal elements of /?,
0-m
1.00
Jr distributed.
0.00
0.!50-
0.25
0.92
I
obs.
\ 0.92
obs.
(cpw,
1
0.75
2)
0.75
I
02c = 0.50
z)
P
02 = 0.05
0.67
I
0.67
,
L
in a> and ten in b), with
corr(TO),
0.84
I
= 450
corr
0.W
I
= 450
Five predictors
(cp)
I.m 1 RMSE
0.50
0.75
a*e = 0.50
-
, 1.w
0.50
0.W
z)
0.00
0.7-i-
0.92
(q-~(z),
= 150
corr
0.75
,,,
0.W
0.25-
O.!S-
0.75-
(cp)
I.oo 1
RMSE
0.75
1.00
obs.
0.92
F
u* = 0.05
0.7S
(cp)
.w
= 150
I
.m
RMSE
I
obs.
1 .oo
0.00
0.w
0.50
0.50
0.25
0.75
.oo
RMSE(cp)
= 0.218 for i = 2, .._, 5
0.75
0.92
= 0.9,13(i)
era = 0.05 I
((P)
, p(1)
= 50
1.00
RMSE
obs.
5a) X-lognormal,(l,O)
.oo
1
I,
1
0.00
1.40
.oo
RMSE (cp)
I
(cp)
= 50
0.84
I
, p(l)
I
0.64
= 0.9,13(i)
0.75
I
02F = 0.50
z)
0.75
I
02F = 0.05
corr(cp(z),z)
= 50
corr(qW,
0.92
I
obs.
0.92
1
obs.
lognormal,,(l,O)
0.00 1-l I
0.25
0.60
0.75
, .oo
RMSE
5b) x-
1
I 0.6.
P
0.57
I I
1
I.00
RMSE
0.m
1.40.
1.00
J,
RMSE (cp)
0.75
0.75
I
02F = 0.50
corr (q(z), z)
0.W
I
= 150
corr(qW,z)
0.64
Fig. 5 (continued).
0.92
I
obs.
0.92
0.6:
I
0.67
I
(cp)
J, 1
.w
(cp)
*.m
RMSE
o.oo
0.X
I
cJ* * = 0.05
0.50
1.w
= 150
0.76
.oo -I
obs.
0.76
1
RMSE(cp)
= 0.145 for i = 2, . . . . 10
I 0.84
= 450
I
0.75
I
02F = 0.50
I 0.75
0; = 0.05
corr (q-0). z)
0.64
= 450
corr(cpG9.z)
0.92
I
obs.
I 0.92
obs.
0.67
I
1 0.67
M. Aldrin et al. / Projection
393
pursuit
method S now does a less good job (the reason is that the assumptions underlying the synthetic data yield skewly distributed estimates of p). Observe that the curves for PPR and method A are non-monotone, which may appear peculiar. The explanation can be found in Figure 2, having in mind the distribution of z. When the correlation is about 0.93, the non-linear part of the curve is outside the central range covered by most of the observations in the training sets. But in the test sets, some observations will be located also in the non-linear part, and since the squared errors are included in RMSE, even a few observations in this part will have large impact. When the correlation decreases from 0.93, the non-linear part of the curve will be more and more in the middle of the training set observations. This causes the RMSE to first decrease, and then increase again when the curve becomes near a step function, since $ is restricted to be smooth. (The fact that PPR gives better result for a correlation of about 0.85 than for exact linearity in the lower left corner of Figure 5a), is due to simulation uncertainty). 3.3. Detailed look at some specific cases To gain more insight into what is going on we shall now consider a specific example in more detail. Suppose x - N&O, I), corr(cp(z), z) = 0.93 and that there are 150 observations in the training set. Consider first the estimation of /?. i is standardized to the same length as p (unit length), since it is the direction and not the length of j that is of interest. To evaluate the estimation bias, one may compare the true @-vector with ES, where the expectation operator E here denotes the mean over 50 simulations. This give: a 5-dimensional measure. A one-dimensional measure of bias is w ere 1) a 1) means the Euciidean norm. A one-dimensional measHP-/HII, h ure of root mean squared error, including both bias and variance, is E II/3 - /! 11. The results are shown in Table 1, for low and high error variance. Since the elements of p enter into the model and the methods in a complete symmetric way, there should be no bias at all, which is confirmed in the table. The fact that
Table 1 Statistics for estimated fi, in the situation with x - N,(O, I), 150 observations, 0.93. The expectation operator E denotes the mean over 50 simulations Error var.
. /3 and E/3
Method
and corr(cp(z),
IIH/3 - /3II
ElIPill
1 0 W
h i g h
method method PPR
S A
0.44 0.45 0.45
0.44 0.44 0.45
0.46 0.44 0.45
0.45 0.45 0.45
0.45 0.44 0.44
0.015 0.006 0.003
0.068 0.039 0.033
method method PPR
S A
0.45 0.45 0.45
0.43 0.42 0.44
0.48 0.47 0.45
0.43 0.43 0.45
0.45 0.45 0.44
0.039 0.040 0.015
0.174 0.172 0.156
0.45
0.45
0.45
0.45
0.45
0.000
0.000
true
t) =
394
M. Aldrin et al. / Projection pursuit
M. Aldrin et al. / Projection pursuit
9
9
8
8
dN
? 8 dN
8
dN
dN
396
M. Aldrin et al. / Projection pursuit
h II EC/3 - p> II
is higher when the error variance is high, is only due to larger simulation uncertainty. The mean squared error indicates interesting differences between the methods. With low error variance, PPR and method A are substantially better than method S, and PPR slightly better than method A. With high error variance, methods S and A are about equally good, and PPR slightly better. Consider next the estimation of cp. The results from two specific simulations are shown in Figure 6 a). For the 1000 observations in the test set, 4 is plotted against “z,together with the true function (cp against z). With low error variance, all three methods does a good job, and PPR is the best method in the central part of the data. With high error variance, all methods approximately fit the overall S-curve, but PPR produce more local bumps than the other methods, which is a consequence of overfitting. The total effect of estimation errors is visualized in Figure 6 b). It corresponds to Figure 6(a), but the x-axis is now the true 2’s. The &J’Slook like point clouds, because the error in the estimate of /3 is taken into account. The width of a point cloud gives an idea of the error in the estimate of /3, and this agrees with Table 1. The visual picture of the prediction performance in these particular simulations agrees with the mean RMSE(cp) over 50 simulations, displayed in each panel. We have investigated the effect of varying the distribution of the predictors X. Multivariate t-distribution (elliptic) changed little from that shown above, but consider log-normally (non-elliptic) distributed predictors, as in Table 2. Note that the bias has increased substantially. As expected, this applies to method S in particular, but even methods A and PPR display signs of bias when the error variance is high. An experiment with an uniform distribution of each element of x (non-elliptic) resulted in an overestimation of the largest element in p, but apart from that, the results was close to the results from the log-normal experiment.
Table 2 Statistics for estimated /3, in the situation with each element observations, and corr(cp(z), z) = 0.85. The expectation operator simulations Error var.
* p and E/3
Method
of x log-normal (0, I), 1.50 E denotes the mean over 50
IIEC/3- S>ll
Elkfill
I
method method PPR
S A
0.78 0.89 0.90
0.31 0.22 0.22
0.32 0.23 0.22
0.31 0.22 0.22
0.32 0.23 0.21
0.223 0.014 0.006
0.257 0.046 0.036
h i
method
S
0.79
0.32
0.31
0.33
0.28
0.210
0.290
Fl
method PPR
A
0.87 0.88
0.24 0.25
0.24 0.25
0.26 0.27
0.23
0.058 0.076
0.186 0.195
0.90
0.22
0.22
0.22
0.22
0.000
0.000
0 W
true
397
M. Aldrin et al. / Projection pursuit
3.4. Example with real data
The purpose of this section is to compare the performance of the regression methods on the Boston housing tract data of Harrison and Rubinfeld (1978). This is one of the data sets following the BLSS package (Abrahams and Rizzardi, 1988). The response variable is a house price index, named MV. This is to be related to 13 predictor variables. The data set contains 506 observations. In the original analysis, the main purpose was to quantify the effect of one specific predictor variable, namely a pollution indicator called NOX. In the analysis here, most attention will be given to prediction. Harrison and Rubinfeld settled on the following equation: log(MV) = PO + &CRIME
+ &ZN + &INDUS
+ &RM* +&AGE
+ &NOX*
+ P8 log(DIS) + Ps log(RAD)
+ PloTAX + p,,PTRATIO + PI4 log(LSTAT).
+ P&HAS
+ Pr2(B - 0.63)”
(19)
In the analysis given here, the response variable was kept untransformed, while the predictor variables were used in three different ways. First we entered them untransformed. This gives a simple analysis, and may be put into an automatic procedure. A more clever treatment would be to look at the data, and try to find transformations of each predictor variable that explains the response variable well. This may result in the transformations of Harrison and Rubinfeld, which were our next choice. Finally, linear and quadratic terms of the predictor variables were used. This gives a more flexible analysis than just to use the untransformed variables, and may be treated automatic, but it involves as much as 26 predictor variables. The intention here is to compare the methods, not to analyse the data, so a cross-validation experiment similar to the simulation experiments were executed. A sample of 50 observations were drawn from the data without replacement, and used for fitting, while the rest of the data were used for evaluating the predictions from the estimated model. This was repeated 50 times, and the RMSE(y) was computed. (RMSE((p) could of cause not be computed, since the true model is unknown). The same procedure was done for 150 and 450 observations in the test set. Following the proposed modelling strategy of Friedman (1984b), a PPR analysis of the whole data set using untransformed variables gave as result a PPR model with k = 3 non-linear terms, while using the transformed variables resulted in 2 terms. PPR were thus fitted with 1, 2, and 3 terms in these cases, but were not used for the quadratic predictor set. Using only 50 or 150 observations to decide k, would result in fewer terms in a PPR model. Remark that Harrison and Rubinfeld used all observations to find their transformations, and fewer observations would probably lead to transformations of lower quality.
398
M. Aldrin et al. / Projection pursuit
Table 3 Crossvalidation results from the Boston data. Stars symbolize that method A is significantly better than PPR with k = 1. The three columns in the middle are predicted RMSE(y) for untransformed, transformed and quadratic predictor variables. The last three columns are the ratios of predicted RMSE(y) to fitted RMSE(y) for the same three treatments of the predictor variables No. obs.
5 0
1 5 0
4 5 0
Method
k
Predicted
RMSE( y)
Predicted/fitted
ratio
untr.
tr.
quad.
untr.
tr.
quad.
lin. reg. method S method A PPR PPR PPR
0 1 1 1 2 3
0.67 0.64 0.61 * 0.63 0.76 0.81
0.58 0.57 0.57 * 0.61 0.67 0.68
1.32 1.34 1.49
1.97 2.18 2.37 2.76 6.52 13.87
1.93 2.11 2.27 2.55 5.24 9.52
6.05 6.51 8.33
lin. reg. method S method A PPR PPR PPR
0 1 1 1 2 3
0.56 0.52 0.49 * 0.50 0.53 0.54
0.48 0.47 0.46 * 0.48 0.51 0.52
0.53 0.52 0.52
1.20 1.26 1.32 1.42 2.17 2.71
1.18 1.24 1.29 1.37 2.49 2.49
1.51 1.56 1.62
lin. reg. method S method A PPR PPR PPR
0 1 1 1 2 3
0.52 0.47 0.45 0.45 0.39 0.39
0.45 0.43 0.43 0.43 0.36 0.36
0.43 0.42 0.41
1.03 1.05 1.07 1.09 1.24 1.35
1.03 1.05 1.06 1.08 1.14 1.23
1.04 1.05 1.06
The results are shown in Table 3, as RMSE( y) from the test sets, and the ratio of RMSE( y) from the test sets and the training sets. The reader should have in mind that in a realistic analysis with 50 observations, 2- or 3-terms PPR models would not be used, good transformations would be difficult to find, and a full quadratic model would not be fitted uncritically. The most important comparisons are to be made within each column, for a specific number of observations. The differences between the methods are not very large, but the pattern is very clear. Significance tests performed to the difference of method A and PPR with 1 term, gives as result that method A is significant better (on 95% level) than PPR for 50 and 1.50 observations with both untransformed and transformed predictor variables, while the methods are equal good with 450 observations. The results for 450 observations make clear that a l-term model is too simple to describe the data satisfactory. The predicted/fitted ratio of RMSE demonstrates nicely the ordering of the methods from simple to complex. An analysis with all 506 observations gave very similar results for method A and PPR with 1 term. The estimated @vectors were about the same. Figure 7 shows the response values and the estimated curves against the estimated directions, and should be compared to Figure 1 which shows the same for method S. For both method A and PPR corr(@(Z>, 2) are about 0.9, and the
M. Aldrin et al. / Projection pursuit
399
400
M. Aldrin et al. / Projection pursuit
error variance about 20% of the total variance of y. Together with the results of the simulation experiments, this explains the results in Table 3. Neglecting the question about additivity for the error term, the curvatures of the functions corresponds to the log-transformation of Harrison and Rubinfeld, which is opposite to the results of the ACE-analysis of Breiman and Friedman (1985). In an analysis of the whole data set, the goodness of fit may be done by crossvalidation, or by bootstrapping. The uncertainty in the estimates of fl and cp may be evaluated by bootstrapping.
4. Concluding
remarks
*
The empirical experiments in this paper clearly show that PPR may give poor results due to overfitting, when the signal to noise ratio is high compared to the number of observations. This is not surprising, since PPR estimates p and q simultaneously by least squares. The empirical results indicates that method A is a very good candidate, when the non-linearities are moderate. But it is not well understood why this is so. Method A is related to ordinary linear regression, which may lead to more stable estimates, but as with PPR, the p and cp are estimated simultaneously, by solving the moment equation (17). Until more research is done, the most important motivation for Method A are the empirical results, which are not quite general. Other types of non-linearities or non-Gaussian noise may result in other conclusions.
Appendix A. Implementation
details
PPR, method S, and method A all need a smoother to estimate the cp-functions. For all three methods smoothing has been performed by the supersmoother of Friedman (1984a), see Appendix B. This gives as result a smoothed y-value for each pair (yi, zi> in the training set, with zi = p’xi. When predicting in a test set, interpolating and extrapolating has to be done from these points. For interpolation we have used the NAG routines EOlBEF/EOlBFF (NAG Fortran Library, Mark 14, 1990), which use a piecewise cubic Hermite interpolant that preserves monotonicity over ranges where the data points are monotonic. For extrapolation we used linear regression y = b, + biz with the coefficients estimated by the 10 points with lowest z-values for extrapolating to the left, and the 10 points with highest z-values for extrapolating to the right. The computer program used for PPR is written by Friedman (1984b). Estimation of a PPR-model with k terms is done by first fitting a model with k + 2 terms, and then eliminating two terms by a backward stepwise procedure. Implementation of method S is straight forward, since it just smoothes the linear regression solution. Concerning method A, we have implemented a modified version of the algorithm given in Section 2, explained below. Denote the left side and right
M. Aldrin et al. / Projection pursuit
401
side of equation (17) with L and R. The goal of the method is to find j and @ such that L = R. The difference between L and R may be measured by the following scalar distance: L
D=
II
R
~_~
IIL IlIz,
IIR Iii,
Ii e
x’
where
IIa II2, = /L&
(20)
The norm is taken with regard to i, to take into account the scales and dependencies in the predictor variables. Owing to the standardization, the distance measure may be seen relative to 1. The algorithm for method A given in Section 2, does not always reduce D for each iteration, so we have used the following extra steps in addition to the original steps 01-3): After iteration number j, calculate D’, and let DminJ be the least D including iteration j, with corresponding direction vectors grnin%jand a’. Calculate also the (relative) improvement from iteration j - 1 to j as Ij = (Dmin,j- 1 _ Dj)/Dmin,j-
1.
(21)
This may result in one of the four following situations: (a> If D’ is less than a specified small number, then the solution is found with the desired accuracy, and the iteration is stopped. (b) Else, if the absolute value of the improvement is small in two subsequent iterations, then this signals that a (global or local) minimum is approximately reached, and the iteration is stopped. (c) Else, if the improvement is positive or zero, then do an ordinary iteration. (d) Else, if the improvement is negative, then try again with a new direction vector between ^min,j1 and $‘. This may be written as: P (a) if Dj < a,, then stop, solution is found. (b) elseif abs(lj) < 6, and abs(lj-‘1 < 6,, then stop. (c) elseif I’ 2 0, go to step (1). (d) elseif I’ < 0, replace step (1) and (2) with ij+’ = $($min,j-l -ii>. endif If the algorithm stops in (a), then the solution of (17) is found with satisfactory accuracy. If it stops in (b), then measured by D, the end point will at least be equally better than the starting point given by method S. It is also possible to judge how far from the solution the end point is, since D may be seen relative to 1. It is reasonable to let the required accuracy depend on the complexity of the problem. 6, is set to
(22) where the constant c is set so that if II = 50 observations and the noise variation is 50% of the total variation, that means u~*/uy’ = 0.50, then 6, = 0.025. When
402
M. Aldrin et al. / Projection pursuit
a,* and aY2is unknown, these can be replaced with the empirical values. 6, is set to 0.005. The possibility of the algorithm to find the solution of (17) with the desired accuracy will depend on the situation. This is illustrated by the following examples from the simulation study in Section 3, with x distributed as log-normaNO, 1) and with dimension 10. With a linear function, small error variance, and 450 observations, the algorithm always converged to a solution with D < 6, = 0.0025. In this linear case, the ideal is to stop after step 0. In average over the 50 simulations, 0.7 iterations were made after step 0. With a step function, large error variance and 50 observations, the algorithm converged to a solution with D < 6, = 0.025 about 50% of the simulations, the largest D at the end point was 0.18, and the average number of iterations was 16. However, when the algorithm stopped with a large D, the values of the resulting (p, @) were usually not far from the starting values from method S. This tended to occur in the situations with high error variance and few observations, where it is difficult to get precise estimates. It is reasonable that the results from method A are close to those from method S in such situations. To test how sensitive the algorithm is to the starting values from method S, we repeated the simulation experiment in the cases with log-normal X, dimension 10, 50 observations, and either low or high error variance (compare the two panels on the left in Figure 5b). First, the starting values of p from method S were changed slightly, by multiplying each element by a factor drawn independently from a uniform distribution between 0.9 and 1.1. This had only minor effect of the results. Then the starting values of j3 were drawn at random, with each element drawn independently from a uniform distribution with mean 0. In most simulations (about 95%) this gave approximately the same results as before, but in a few simulations the algorithm stopped with a D in the size of 0.5, with corresponding bad predictions.
Appendix B. The supersmoother
We here give a short description of the supersmoother of Friedman (1984a), and investigate the effect of varying the degree of smoothness re_strictions. The supersmoother tries to estimate E( y I z), denoted as E( y I z) = G(z). The input of the smoother is n pair of data values ( y,, zi), i = 1,. . . , II. To estimate E( yi 1zi), it uses those observations with predictor values z in a neighbourhood of zi to fit a least squares straight line, and takes the value of the line at zi as the smoothed value. A critical parameter is the size of the neighbourhood; how many observations should be used to fit the line? This is measured by the span value, defined as the share of the IZdata points which takes part in the fitting of the line. The bigger the span, the smoother $ will be. The span value may vary over the range of predictor z values. The optimal span value for each zi is automatically chosen by cross validation, but it is constrained to vary smooth from one observation to the next (as ordered on zi>.
M. Aldrin et al. / Projection pursuit
403
In the ordinary simulations we used a variable span value chosen by cross validation. To test how sensitive the results are to the choice of the smoother, we repeated the simulation experiment in the cases with log-normal X, dimension 10, 50 observations, and low and high error variance, as at the end of Appendix A. These repeated experiments were performed with a constant high span value of 0.5 (smooth curve) and a constant low span value of 0.2 (not so smooth curve>. As one could expect, use of the large span reduced the RMSE for the three non-linear methods in cases with corr(cp(z), z) near 1, and increased the RMSE in cases with lower correlation. This is a natural effect of fitting a more smooth curve. The effect of using the low span value was opposite, the RMSE were increased in the situations with correlation near 1, and otherwise decreased. However, the change in the span value did not effect the ordering between the three non-linear methods. This indicates that the choice of smoother is not critical to our general conclusions. One option of the supersmoother is to use the so-called ‘bass control’. This forces the supersmoother to use larger span than what is chosen by cross validation, and thus gives more smoother curves. We did not use this in our experiments.
References Abrahams, D.M. and F. Rizzardi, BLSS - The Berkeley Interactive Statistical System, (W.W. Norton & Company, New York, 1988). Breiman, L. and J.H. Friedman, Estimating optimal transformations for multiple regression and correlation, with discussion, Journal of the American Statistical Association, 80 (198.5) 580-619. Diaconis, P. and M. Shahshahani, On non-linear functions of linear combinations, SLAM Journal on Scientific and Statistical Computation, 5 (1984) 175-191. Friedman, J.H., A variable span smoother (Stanford University, Department of Statistics, Report LCS 05, 1984a). Friedman, J.H., SMART user’s guide (Stanford University, Department of Statistics, Report LCS 01, 1984b). Friedman, J.H., Multivariate adaptive regression splines, The annals of Statistics, 19 (1) (1991) 1-141. Friedman, J.H. and W. Stuetzle, Projection pursuit regression, Journal of the American Statistical Association, 76 (1981) 817-823. Friedman, J.H. and B.W. Silverman, Flexible parsimonious smoothing and additive modeling, with discussion, Technometrics, 31 (1989) 3-40. Harrison, D. and D.L. Rubinfeld, Hedonic housing prices and the demand for clean air, Journal of Environmental Economics Management, 5 (1978) 81-102. Muirhead, R.J., Aspects of Multivariate Theory (Wiley, New York, 1982). NAG Fortran Library, Mark 14 NAG ltd. Oxford, UK, (1990). Schweder, T., Canonical Regression, NR Report no. 804, Norwegian Computing Center, (1987, revised version 1989). Tibshirani, R., Estimating transformations for regression via additivity and variance stabilization, Journal of the American Statistical Association, 83 (1988) 394-405.