3.13 Other Methods in Nonlinear Regression B. Li, B. R. Bakshi, and P. K. Goel, Ohio State University, Columbus, OH, USA ª 2009 Elsevier B.V. All rights reserved.
3.13.1 3.13.2 3.13.2.1 3.13.2.2 3.13.2.3 3.13.2.4 3.13.2.5 3.13.3 3.13.3.1 3.13.3.2 3.13.3.3 3.13.3.4 3.13.3.5 3.13.3.6 3.13.3.7 3.13.4 3.13.4.1 3.13.4.1.1 3.13.4.1.2 3.13.4.2 3.13.5 References
Introduction Classification and Regression Trees Tree Building Tree Pruning Advantages and Disadvantages of CART Other Tree Methods Other Issues Multivariate Adaptive Regression Splines Forward Stepwise Part Backward Stepwise Part Polishing Model Selection in MARS ANOVA Interpretation Advantages and Disadvantages of MARS Other Issues Projection Pursuit Regression Steps in PPR Searching for the next term Termination Advantages and Disadvantages of PPR Illustrative Example
Symbols C(M) M Pm spm ¼ 1 jT j tpm x ¼ x1, x2,. . ., xJ ˆ yk
cost complexity measure of MARS model number of basis functions number of partitions or splits right or left of the associated step function number of terminal nodes in CART in tree T location of the split (or knot) in the corresponding input space inputs or predictor variables kth predicted output or response variable
j T j mk
m m(!Tmx) m (p, m)
463 465 465 465 466 467 467 467 468 468 469 469 470 470 470 471 471 471 472 472 472 474
matrix of basis function parameters term to penalize the overfitting of trees output weight or regression coefficient relating the mth basis function to the kth output mth basis function ridge function in PPR input transformation selected input variables in each partition
3.13.1 Introduction Besides all the nonlinear regression methods discussed in the previous chapters, classification and regression trees (CART) (see also Chapter 3.17), multivariate adaptive regression splines (MARS), and projection pursuit regression (PPR) have also been introduced into chemometrics recently,1,2,3 although they have not 463
464 Other Methods in Nonlinear Regression
received as much attention as traditional linear statistical methods and artificial neural networks for empirical modeling. An insight into the relationship between various methods is possible via a common framework4 that brings out the similarities and differences between empirical modeling methods, including CART, MARS, and PPR. The framework is based on the fact that all empirical modeling methods can be represented as a weighted sum of basis functions as yˆk ¼
M X
mk m ðm ð; XÞÞ
ð1Þ
m¼1
where yˆk is the kth predicted output or response variable, m the mth basis function, mk the output weight or regression coefficient relating the mth basis function to the kth output, the matrix of basis function parameters, m represents the input transformation, and x ¼ x1, x2, . . . , xJ are the inputs or predictor variables. The latent variables or transformed inputs are represented as zm ¼ m ð; XÞ
Different empirical modeling methods may be derived depending on the nature of the input transformation, the type of activation functions, and the optimization criteria for determining the adjustable parameters. Both CART and MARS belong to the category of partition-based methods since as they partition the input space by selecting the most relevant variables. These methods may also be considered to be a special case of projection-based methods, which include backpropagation networks and projection pursuit regression, with the projection hyperplane restricted to be perpendicular to at least one of the input axes, namely, the projection directions are either zero or one. In general, partition-based methods can be represented as a special version of Equation (1): yˆ ¼
M X
m m
ðxPRm Þ
ð2Þ
m¼1
where Rm is the set of the selected inputs. Restricting the inputs in the basis functions to a subset results in partitioning of the input space into hyperrectangular regions. Both CART and MARS aim to derive a good set of regions in the input space with the corresponding basis functions that minimize the specified objective function. By the explicit selection of the most relevant set of input variables, the CART and MARS models tend to be physically interpretable. PPR is a nonlinear multivariate statistical modeling technique developed by Friedman and Stuetzle5 for analyzing high-dimensional data (i.e., a large number of input variables). The central idea of PPR is to extract linear combinations of input variables as derived features, and then model the target function as a nonlinear additive function of these features. Thus, PPR projects the input variables on a linear hyperplane making it a projection-based method. Hence, like CART and MARS, the PPR model can also be represented by Equation (1). PPR is also similar to backpropagation networks (BPN) due to the linear projection. However, in PPR, the basis functions are not prespecified and can adapt their shape to the training data. This is unlike the fixed-shape sigmoid basis functions used in BPN. Thus, in PPR, the input/output parameters and the shape of the basis functions are determined to minimize the mean square error of the approximation min jjy – yˆ jj2
;m ;m
ð3Þ
In fact, one can view linear regression, generalized additive model (GAM),6 and PPR as a group of regression procedures ordered in an ascending generality. Specifically, linear regression models the regression surface by a sum of linear functions of the input variables, whereas GAM allows for nonlinearity by modeling with general smooth functions for each input variable. PPR allows both nonlinearity and interactions by modeling with adaptive smooth functions of linear combinations of the input variables. In this chapter, we will describe CART, MARS, and PPR in detail.
Other Methods in Nonlinear Regression
465
3.13.2 Classification and Regression Trees CART or recursive partition regression7 is a form of binary recursive partitioning (Chapter 3.17). The term ‘binary’ implies that we first split the space into two regions, and model the response by a constant for each region. Then, we choose the variable and split point to achieve the best fit again on one or both of these regions. Thus, each node can be split into two child nodes, in which case the original node is called a parent node. The term ‘recursive’ refers to the fact that the binary partitioning process can be applied over and over again. Thus, each parent node can give rise to two child nodes and, in turn, each of these child nodes may themselves be split, forming additional children. The general empirical model given by Equation (1) may be specialized to describe the CART-fitted model as yˆ ¼
M X m¼1
m
Pm Y
H ½s pm ðx ðp;mÞ – t pm Þ
ð4Þ
p¼1
where H is the Heaviside or unit step function ( H ½ ¼
1 if 0 0 otherwise
Pm is the number of partitions or splits, spm ¼ 1 and indicates the right or left of the associated step function, (p, m) indicates the selected input variables in each partition, and tpm represents the location of the split (or knot) in the corresponding input space. The indices p and m are used for the split and node or basis function, respectively. CART analysis consists of two basic steps. The first step consists of tree building, during which a tree is built using recursive binary splitting. Each resulting node is assigned a value, which is based on the distribution of the observations in the training set that fall in that node. A predicted value is assigned to each node whether or not that node is subsequently split into child nodes. At the point where the tree building process stops, a ‘maximal’ tree is generated, which probably greatly overfits the information contained within the training data set. The second step consists of tree ‘pruning’, which results in the creation of a sequence of simpler and simpler trees, through the cutting off of increasingly important nodes. An optimal tree is selected from among the sequence of pruned trees. Both steps will be discussed in more detail below. A discussion of classification based on CART is available in Chapter 3.17. 3.13.2.1
Tree Building
It is known that finding the best binary partition in terms of the minimum mean square error defined in Equation (3) is generally computationally infeasible. Hence, CART proceeds with a greedy algorithm that starts the tree building process at the root node, which includes the full training data set. Beginning with this node, the CART algorithm finds the best possible variable to split the node into two child nodes. To find the best variable, it checks all possible splitting variables (called splitters), and also possible values of the variable to be used to split the node. Note that, for every possible splitter and split point, the value that minimizes the mean square error at each node is simply the average of all the responses that fall into this node. The process of node splitting, followed by the assignment of a predicted value to each node, is repeated for each child node and continued recursively until it is impossible to continue or some user-specified condition is met. More specifically, the process is stopped when (1) there is only one observation in each of the child nodes; (2) all observations within each child node have the identical distribution of input variables, making splitting impossible; or (3) an external or prespecified condition on the number of nodes in the maximal tree has been set by the user. 3.13.2.2
Tree Pruning
A very large tree might overfit the data, whereas a small tree might not capture the important structure from the data. Therefore, tree size is an important tuning parameter governing the model’s complexity, and the optimal tree size should be adaptively chosen from the data. In CART, the generated ‘maximal’ tree is pruned using a
466 Other Methods in Nonlinear Regression
‘cost-complexity’ strategy to generate a sequence of simpler and simpler trees, each of which is a candidate for the appropriately fit final tree. The ‘cost complexity’ strategy relies on a complexity parameter, denoted . The idea is to optimize not the mean square error in Equation (3), but the cost complexity criterion C ðT Þ ¼ jjy – yˆ jj2 þ jT j
ð5Þ
where jT j is the number of terminal nodes, the nodes without any child nodes, in tree T. Hence, a biased resubstitution measure is used in the pruning process, and the jT j term serves to penalize the overfitting of trees that partition the input space too finely. In particular, beginning at the terminal nodes, the nodes are pruned away if the resulting change in the cost complexity C ðjT jÞ is reduced. Thus, is a measure of the additional accuracy a split must add to the entire tree to warrant the additional complexity. As is increased, more nodes are pruned away, resulting in simpler trees. The goal in selecting the optimal tree, defined with respect to expected performance on an independent set of data, from a sequence of ‘pruned’ trees is to find the correct complexity parameter so that the information in the training data set is fit but not overfit. In general, finding this value for would require an independent test set of data, but this requirement can be avoided using the technique of K-fold cross-validation. In K-fold cross-validation, the training data set is partitioned into K subsets. Of the K subsets, a single subset is retained as the validation data for testing the model, and the remaining K 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds can then be averaged to produce a single estimation.
3.13.2.3
Advantages and Disadvantages of CART
CART has a number of advantages over traditional linear/nonlinear regression methods. First, CART is inherently nonparametric and can handle mixed type of input variables naturally. In other words, no assumptions are made regarding the underlying distribution of values of the input variables, for example, numerical data that are highly skewed or multimodal, as well as categorical predictors with either ordinal or nonordinal structure. This is an important feature, as it eliminates researcher time which would otherwise be spent determining whether variables are normally distributed, and making transformation if they are not. Second, CART also has sophisticated methods for dealing with missing variables. Instead of discarding all the observations with some missing values or trying to fill in (impute) the missing values with say the mean of that predictor over the nonmissing observations, CART has two better approaches. The first approach is to construct surrogate variables during the model fitting procedure. Specifically, when considering a predictor for a split, CART uses only the observations for which that predictor is not missing. Having chosen the best (primary) predictor and split point, CART forms a list of surrogate predictors and split points. The first surrogate is the predictor and corresponding split point that best mimics the split of the training data achieved by the primary split. The second surrogate is the predictor and corresponding split point that does second best, and so on. When sending observations down the tree either in the training or in the prediction phase, CART uses the surrogate splits in order, especially if the primary splitting predictor is missing. The second approach is applicable only to categorical predictors: CART simply makes a new category for ‘missing’ variables. This might help the analyst discover that observations with missing values for some measurement behave differently than those with nonmissing values. Third, CART is adept at capturing nonadditive behavior, that is, complex interactions among predictors are routinely and automatically handled with relatively little input required from the analyst. This is in marked contrast to some other multivariate nonlinear modeling methods, in which extensive input from the analyst, analysis of interim results, and subsequent modification of the method are required. Finally, CART-fitted model is often relatively easy to interpret. The input space partition is fully described by a single tree, and can be easily visualized by a tree diagram. Due to its hierarchical structure of the rule and its inherent logic, the tree presentation is often considered to mimic the way a person thinks. Despite its many advantages, there are a number of disadvantages of CART. One major problem with this type of approach to data analysis is in the instability of tree structures, that is, a small change of input data will change the entire tree structure, which is a crucial obstacle to more sound interpretability. Another problem is
Other Methods in Nonlinear Regression
467
that the fitted response surface from CART lacks smoothness, which may severely degrade the predictive performance, when the underlying target function is expected to be smooth. Furthermore, CART has difficulty in capturing additive structure (without interactions among the input variables). It is possible that an accurate predictive CART model may contain substantial interactions that are actually not present in the target function, especially when there is a high degree of collinearity among some (or all) of the input variables in the training data. Hence, it is difficult for CART to fit certain types of simple functions, for example, additive linear functions with more than a few nonzero coefficients.8
3.13.2.4
Other Tree Methods
Besides CART, ID3, which received its name because it was the third in a series of ‘interactive dichotomizer’ procedures, and its successor and refinement, C4.5 and C5.09 are the other popular tree building procedures. ID3 was limited to categorical predictors, and used a top-down rule with no pruning. With more recent developments, C5.0 has become quite similar to CART. However, C5.0 is able to derive a set of rules after a tree is grown. For example, the splitting rules that define the terminal nodes can sometimes be simplified without affecting the subset of observations that fall into the nodes. In addition, C4.5 and C5.0 do not use surrogate splits to handle the missing values. Since they do not compute surrogate splits and hence does not need to store them, they may be preferred to CART if space complexity (storage) is a major concern.
3.13.2.5
Other Issues
Suppose we have n training observations in p dimensions (input variables). Then, CART requires O (pn log n) operations for an initial sort for each predictor, and typically another O (pn log n) operations for the split computations. If the splits occurred near the edges of the predictor ranges, this number could increase to O (n2p). In practice, searching the surrogate splits can approximately double the computational time than without searching the surrogate splits. It should be noted that the models fitted by CART are not guaranteed to be optimal. In CART, at each stage of the tree building process, the split selected is the one that immediately reduces the variation the most, that is, CART grows trees using a greedy algorithm. It is possible that some other split would better set things up for further splitting to be effective. However, a ‘long-sighted’ tree-growing program would require much more computational time. Thus, CART sacrifices optimality for computational efficiency, and the ‘greediness’ of building a tree is alleviated by the pruning step. Recently, many tree-based methods have been proposed to overcome the drawbacks from trees, for example, instability and poor predictive performance. Among them, multiple additive regression trees (MART) and random forest are the two popular tree-based methods that substantially improve both the stability and the predictive power of CART. MART, which was developed by Friedman,10 is an implementation of the gradient tree boosting methods for predictive data mining, while random forest, proposed by Breiman,11 is a refinement of the bagging strategy12 on CART. Instead of constructing a single tree, both MART and random forest build many trees, and the final fitted model and prediction are based on an ensemble of trees. Additional details about some such extensions are available in Chapter 2.31.
3.13.3 Multivariate Adaptive Regression Splines MARS8 is an adaptive regression procedure well suited for high-dimensional problems, and it overcomes the disadvantages of CART. MARS attempts to approximate complex relationships by a series of linear regressions on different intervals of the input variable ranges (i.e., subregions of the independent variable space). Specifically, MARS divides the input space into overlapping partitions by keeping both the parent and daughter nodes after splitting a region. This prevents discontinuities in the approximation at the partition boundaries, producing continuous approximations with continuous derivatives. MARS is highly
468 Other Methods in Nonlinear Regression
flexible as it can adapt very complex functional forms including nonlinear functions with complicated interactions. As in CART, the MARS model can also be represented by Equation (2). However, instead of using the step function, MARS uses multivariate spline basis functions obtained by multiplying univariate splines represented by a two-sided truncated power basis, yˆ ¼
M X m¼1
m
Pm Y spm x ðp;mÞ – tpm qþ
ð6Þ
p¼1
where q is the order of the splines and the subscript þ indicates the positive part of the argument. Note that the basis functions Equation (4) produced by recursive partitioning are a subset of a complete tensor product with q ¼ 0 spline basis with knots {tpm} at every distinct marginal data point value. Hence, CART can be viewed as a special case of MARS, and CART selects a (relatively very small) subset of regressor functions for this (very large) complete spline basis. The MARS algorithm consists of three basic steps. The first step consists of the forward stepwise part of MARS, during which an adaptive regression model is built using the truncated power basis in Equation (6) with q ¼ 1 (piecewise linear basis). The second step is the backward stepwise part of MARS similar to the ‘pruning’ process in CART. An optimal model in terms of lack-of-fit criterion is selected. The last step is to ‘polish’ the optimal fitted response surface by replacing the piecewise linear basis with piecewise cubic basis. Each of these steps will be described in more detail below.
3.13.3.1
Forward Stepwise Part
The forward stepwise part of MARS is similar to the tree building step in CART, but involves the following three modifications. 1. The Heaviside or unit step function H (x t) is replaced by a truncated power spline function ½ðx – t Þqþ with q ¼ 1. 2. Both the parent and child nodes are kept after splitting a region, thereby making both parent and child nodes eligible for further splitting. q Qm 3. The product associated with each basis function Pp¼1 spm xðp;mÞ – tpm þ is restricted to only the distinct predictor variables. The first modification overcomes the ‘lack of smoothness’ from CART, that is, the fitted MARS response surface is continuous. For the second modification, since there is no restriction on the choice of a parent term, MARS is able to produce models involving either high- or low-order interactions or both. Furthermore, MARS can produce purely additive models by always choosing the constant basis function as the parent. The restriction of the third modification avoids higher power than q (here q ¼ 1) on each individual variable. Note that the forward stepwise part starts with only a constant basis function selected in the model. For each splitting, two basis functions [(x(p,m) tpm)], one which improves the model fit (decrease the lack-of-fit) the most or the other which degrades it (increase the lack-of-fit) the least, will be added to the selected basis set. The forward stepwise part stops whenever the number of basis functions selected reaches the user-specified maximum number of basis functions Mmax.
3.13.3.2
Backward Stepwise Part
As with recursive partitioning in CART, the basis set created from the forward stepwise part is then subject to a backward stepwise deletion strategy to produce a final set of basis functions. However, unlike CART, the basis functions produced from the forward stepwise part do not have zero pairwise product expectations; that is, the corresponding regions are not disjoint but overlap. Thus, removing a basis function does not produce a hole in the input space. Backward stepwise part involves Mmax 1 iterations. For each iteration, MARS chooses the one whose removal either improves the fit the most or degrades it the least. As a result, backward stepwise part
Other Methods in Nonlinear Regression
469
finally constructs a sequence of Mmax 1 models, each one having one less basis functions than the previous one in the sequence. Then, MARS picks the one with the smallest lack-of-fit value as the optimal model. 3.13.3.3
Polishing
A piecewise linear approximation, of course, does not possess continuous derivatives. To overcome this, the final MARS model is obtained by fitting the following piecewise cubic basis function set to the data. 8 0 > > < C ðxjs ¼ þ1; t – ; t ; tþ Þ ¼ pþ ðx – t – Þ2 þrþ ðx – t – Þ3 > > : x–t 8 – ðx – t Þ > > < C ðxjs ¼ – 1; t – ; t ; tþ Þ ¼ p – ðx – tþ Þ2 þr – ðx – tþ Þ3 > > : 0
x t– t – < x < tþ x tþ x t– t – < x < tþ x tþ
with t < t < tþ. By setting pþ ¼ ð2tþ þ t – – 3t Þ=ðtþ – t – Þ2 rþ ¼ ð2t – tþ – t – Þ=ðtþ – t – Þ3
ð7Þ
p – ¼ ð3t – 2t – – tþ Þ=ðt – – tþ Þ2 r – ¼ ðt – þ tþ – 2t Þ=ðt – – tþ Þ3
causes Cðxjs; t – ; t ; tþ Þ to be continuous and have continuous first-order derivatives. Compared to the truncated linear basis function (q ¼ 1), which is characterized by a single split (or knot) location t, each corresponding truncated cubic function is characterized by three location parameters: a central knot t and upper/lower side knots t. During the polishing process, the central knot t is placed at the same location as the (single) knot for its corresponding truncated linear basis function. The side knots t for each cubic function basis are placed at the midpoints between its central knot t and the two adjacent central knots in the same projection. The polishing process can be viewed as replacing the angular hinge of a truncated linear basis function by the corresponding smoothing hinge from the truncated cubic function. After the polishing step, the final model becomes continuous in the first derivatives. 3.13.3.4
Model Selection in MARS
The lack-of-fit criterion used in the forward and backward stepwise part of MARS is a modified form of the generalized cross-validation (GCV) criterion originally proposed by Craven and Wahba:13 n 2 1X GCVðMÞ ¼ yi – fˆðx i Þ n i¼1
,
CðMÞ 1– n
2 ð8Þ
where C(M) is the cost complexity measure of a model containing M basis function. The cost complexity measure C(M) used in MARS is h
i –1 CðMÞ ¼ trace B BT B BT þ 1 þ dM
ð9Þ
where M is the number of basis functions selected in the MARS model, B is the M n data matrix, and the quantity d represents a cost for each basis function added into the optimization. The quantity d can be viewed as a smoothing parameter in the MARS procedure, since the larger value of d will lead to fewer basis functions (fewer knots or splits being placed) and therefore smoother function estimates. Note that using only the first term in Equation (9) leads to the GCV criterion proposed by Craven and Wahba.13 In MARS, the value of the smoothing parameter d is often set between 2 and 4, which is fairly effective in most of the situations.
470 Other Methods in Nonlinear Regression
3.13.3.5
ANOVA Interpretation
The ability of MARS to provide significant insight into the input–output relationship is provided by rearranging Equation (6) to yˆ ¼ 0 þ
X Pm¼1
y i ðx i Þ þ
X
y i j ðx i ; x j Þ þ
Pm¼2
X
y ijk ðx i ; x j ; x k Þ þ
ð10Þ
Pm¼3
where each term indicates the input variables that are relevant to the model, and shows how they interact with other inputs in approximating the output. Specifically, the first sum is over all the basis functions that involve only a single variable, whereas the second sum is over all the basis functions that involve exactly two variables (representing two-variable interactions), and so on. Interpretation of the MARS model is greatly facilitated through its ANOVA decomposition in Equation (10). This representation identifies the particular variables that enter into the model, whether they enter purely additive (without interacting with other input variables) or they are involved in interactions with other input variables, the level of the interactions, and the other input variables that participate in them. 3.13.3.6
Advantages and Disadvantages of MARS
The advantages of MARS include its ability to provide a continuous approximation of the target function to an arbitrary order of derivatives by increasing the value of q for each basis function. Second, in the forward stepwise part of MARS, since only two basis functions [(x(p,m) tpm)] are entered each time, relatively few counterproductive basis functions are introduced that must be later removed by the backward stepwise strategy. Thus, each individual basis function that is entered each time is more likely to make a positive contribution to the model, so the available degrees of freedom are more judiciously used in MARS. Third, by not removing the ‘parent’ basis functions, MARS is able to enter basis functions involving a variety of interaction levels at any stage. Thus, MARS can produce from purely additive models (without any interactions) to models involving highly complex interactions. Fourth, since MARS (forward stepwise part) enters one knot (split) each time, MARS is able to operate locally. In other words, MARS can build up the regression surface parsimoniously, using nonzero components locally – only where they are needed, without affecting the regions far away from it. Thus, MARS is well suited to approximate a potential sharp local structure in the underlying function without disregarding the fit elsewhere. Finally, since no numerical optimization is required in MARS, the implementation of MARS is relatively easy and fast, and the solutions are unique. There are no problems associated with being trapped in the suboptimal local minima, and the issue of obtaining good starting points. A major disadvantage of MARS is that if the underlying function involves a pure high-order interaction effect with absolutely no corresponding main or lower-order interaction effects, then at best MARS will have to enter several more or less randomly selected low-order basis functions before it can capture this effect, or at worst it will fail to detect it. This is particularly likely if the function has a strong dependence on other variables for which adding basis functions is profitable and the sample size is small. 3.13.3.7
Other Issues
The computational complexity for fitting a MARS model is quite high. For n observations with p input variables, MARS requires nm2 þ pmn operations to add a basis function to a model with m basis functions already selected from p input variables. Therefore, to build a MARS model with M selected basis functions requires about nM3 þ pM2n operations, which can be prohibitive if M is a reasonable fraction of sample size n. However, MARS algorithm admits a high degree of parallelization so that it could run fast on computers with parallel architectures.8 Like CART, MARS can also handle ‘mixed’ type of input variables, that is, quantitative and qualitative, in a natural way. For qualitative input variables, MARS will consider all possible binary partitions of the categories much like CART does. Each such partition generates a pair of piecewise constant basis functions for the two sets of categories. This basis pair is then treated as any other basis, and is used in forming tensor products with other basis functions already in the model.
Other Methods in Nonlinear Regression
471
3.13.4 Projection Pursuit Regression The PPR model is of the form yˆ ¼
M X
m ð!T m xÞ
ð11Þ
m¼1
where the function m ð!T m xÞ is often called a ridge function, which varies only in the direction defined by the vector !m, and M is the number of terms added into the model. Equation (11) indicates that PPR is an additive model, but in the derived transformed features m ¼ !T m x rather than the original input variables themselves. The basic idea of PPR, proposed by Friedman and Stuetzle,5 is not a new one. In multivariate data analysis, it has been a common practice to use lower-dimensional projections of the data for the purpose of interpretation and visualization (Section 2.28.2.3). Since there are infinitely many projections from a high dimension to a lower dimension, it is important to select the projection(s) that can reveal the most interesting structures of the data. The choice of a projection usually is based on an appropriate and numerical measure of merit for a specific goal. In PPR, the figure of merit is the fraction of variance explained by a smooth of Y versus a linear combination of ! ? x of the predictor. As the multivariate structure often will not be completely reflected in a single projection, in PPR the already discovered structure is removed in order to allow the algorithm to find additional interesting projection(s). Hence, the PPR model (11) has an additive form that facilitates interpretation and visualization in practice. This algorithm also has strong similarities with other statistical and chemometric methods such as the nonlinear iterative partial least squares (NIPALS) algorithm used for partial least squares (PLS) and principal component analysis (PCA), and the insight has been used to unify these methods.14 It is known that PPR is a universal approximator, that is, PPR approximations are dense in the sense that any function of p variables can be arbitrarily closely approximated by the ridge function expansions in Equation (11) for a large M.15 In addition, the PPR algorithm has been extended for multiple-input–multiple-output modeling by a technique called smooth multiple additive regression technique (SMART).16
3.13.4.1
Steps in PPR
PPR consists of three basic steps. In the first step, the current residuals r are initialized to the responses y, which have been centered (yi ¼ 0). The second step is searching for the next term to be added into the model. The third step is to decide whether this term (from the second step) should be added into the model or terminate the fitting process should be terminated. The final PPR model is constructed by implementing the second and third steps in an iterative manner. The second and third steps will be described in more detail below.
3.13.4.1.1
Searching for the next term Given the current residuals rm and the direction vector !m, finding the best ridge function m becomes a onedimensional smoothing problem. Then we can estimate m by applying any univariate smoothing method, such as the smoothing spline. On the contrary, given the ridge function m, we can approximate the optimal !m from the current estimate !old with the quasi-Newton method. Specifically, we have T T m ð!T xi Þ m ð!T old xi þ 9m ð!old xi Þð! – !old Þ xi
ð12Þ
Equation (12) gives n X i¼1
n 2 X 2 ri – m !T xi 9m ð!T old xi Þ i¼1
2 ri – m ð!T T old xi Þ – ! !T x þ x i i old old 9m ð!T old xi Þ
ð13Þ
Thus, to minimize the right-hand side of Equation (13), we can carry out a least squares regression with 2 T T T !T old xi þ ðri – m ð!old xi ÞÞ=9m ð!old xi Þ on the input with weights 9m ð!old xi Þ and no intercept term. This gives the updated direction vector !new. The two steps of estimating m and !m in Equation (12) and Equation (13) can be iterated until convergence.
472 Other Methods in Nonlinear Regression
3.13.4.1.2
Termination After the searching step, the figure of merit I(!m) is measured: Pn I ð!m Þ ¼ 1 –
i¼1
2 ri – m !T xi Pn 2 m i¼1 ri
ð14Þ
If the figure of merit from Equation (14) is smaller than a user-specified threshold, the fitting process will stop without adding the last term into the PPR model. Otherwise, the current residuals r and the term counter m will be updated: ri
ri – m ð!T m xi Þ;
m
mþ1
i ¼ 1; . . . ; n
ð15Þ
and the searching step is iterated. 3.13.4.2
Advantages and Disadvantages of PPR
PPR overcomes several limitations of other nonparametric regression procedures. First, it is known that the kernel and nearest-neighbor techniques do not perform well for high-dimensional data because of the inherent sparsity of samples. In PPR, since each added term is estimated in a univariate setting, the high-dimensionality problem is circumvented. In addition, PPR does not require specification of a metric in the input space. Second, unlike those partition-based methods, such as CART, PPR does not split the sample. Thus, PPR allows more complex models when necessary. Third, in contrast to GAM, PPR is able to approximate a much richer class of functions, for example, interactions of input variables are directly considered in PPR. Finally, PPR, in general, is well suited in those situations where there are substantial nonlinearities in the dependence of the responses on the input variables, especially if the nonlinearities are approximated reasonably well by a few ridge functions. Inevitably, PPR also has some disadvantages. First, since the models in PPR are a sum of a few ridge functions, each varying only along a single direction vector !, PPR has difficulties modeling regression surfaces that very with equal strength along all possible (or many) directions. Second, for each term in the model, PPR has to examine the p-dimensional parameter space to estimate the direction vector !. Third, by using the nonparametric smoothers, for each term, PPR has to select a smoothing parameter. Finally, although we can easily visualize the relationship between the response and transformed variable !T m x in each term, the interpretation may not be easy.
3.13.5 Illustrative Example As an illustration, a simulated example from Bakshi and Utojo4 is given here. The purpose of this example is not to compare the performance of CART, MARS, and PPR, but to provide an insight into the similarities and differences between them. This insight should be useful for understanding the application of these methods to chemometric data such as those in Deconinck et al.,1 Xu et al.,2 Put et al.,3 and others. The example consists of two inputs x1 and x2 and an output y related as y ¼ ðx1 þ 0 x2 Þ2 ¼ x12
ð16Þ
Equation (16) indicates that the output is only related to x1, that is, the optimum projection for this example is in the direction parallel to the x1 axis. The training set contains 189 observations shown in Figure 1, and the testing data consists of 95 data points. The mean square error based on the training and testing data from each method is summarized in Table 1. The CART model can be represented by a binary tree and is physically interpretable. Figure 2 shows the fitted binary tree structures (right) and its fitted values against x1 (left). We see that CART approximates the underlying smooth function by a crude step function. MARS overcomes the disadvantages of CART by providing continuous approximations with continuous derivatives.
Other Methods in Nonlinear Regression
473
y
x1
x2 Figure 1 Training samples for the illustrative example.
Table 1 Comparison of mean square errors (MSEs) on the quadratic example Method
Training MSE
Testing MSE
CART MARS PPR
2.09 102 2.32 103 1.61 106
3.06 102 1.89 103 5.30 104
x1 < 1.015
Fitted value 1.0 1.5
2.0
|
x1 >= −0.9836
x1 < 1.25
0.5
1.286
x1 >= −0.6779 x1 < 0.6086
1.893
x1 >= −1.248 1.317
1.849
0.0
0.6675 0.1222
−1.5
−1.0
−0.5
0.0 x1
0.5
1.0
0.6114
1.5
Figure 2 Result of CART. Left: fitted values on x1 (solid). Dash line is the true quadratic function. Right: binary tree diagram.
The performance of MARS is depicted in Figure 3. In this example, PPR provides the best approximation of the target function. Figure 4 shows the adaptive basis function on the first (left) and second (right) term. Note that PPR is able to approximate the quadratic surface with only one basis function as shown in the left panel of Figure 4. Some additional details and examples of extensions to the methods described in this chapter are provided in Chapter 2.31.
2.0
Fitted value 0.5 1.0 1.5
1.5 1.0 0.5
−0.5
−0.5
0.0
0.0
Fitted value
2.0
2.5
2.5
474 Other Methods in Nonlinear Regression
−1.5
−1.0
−0.5
0.0
0.5
1.0
−1.5
1.5
−1.0
−0.5
x1
0.0 x2
0.5
1.0
1.5
−2
−1.0 −0.5
−1
0.0
0.0
0.5
1.0
1
1.5
2
2.0
Figure 3 Result of MARS. Left: Fitted values on x1 (solid). Right Fitted values on x2 (solid). Dash line is the true function.
−1.5
−1.0
−0.5
0.0 Term 1
0.5
1.0
1.5
−1.5
−1.0
−0.5
0.0 Term 2
0.5
1.0
1.5
Figure 4 Result of PPR. Fitted adaptive basis function on the first left and second (right) term.
References 1. Deconinck, E.; Hancock, T.; Coomans, D.; Massart, D. H.; Vander Heyden, Y. Classification of Drugs in Absorption Classes Using the Classification and Regression Trees (CART) Methodology. J. Pharm. Biomed. Anal. 2005, 39 (1–2), 91–103. 2. Xu, Q. S.; Daszykowski, M.; Walczak, B.; Daeyaert, F.; de Jonge, M. R.; Heeres, J.; Koymans, L. M. H.; Lewi, P. J.; Vinkers, H. M.; Janssen, P. A and Massart, D. L. Multivariate Adaptive Regression Splines – Studies of HIV Reverse Transcriptase Inhibitors. Chemom. Intell. Lab. Syst. 2004, 72 (1), 27–34. 3. Put, R.; Xu, Q. S.; Massart, D. L.; Vander Heyden Y. Multivariate Adaptive Regression Splines (MARS) in Chromatographic Quantitative Structure–Retention Relationship Studies. J. Chromatogr. A. 2004, 1055 (1–2), 11–19. 4. Bakshi, B. R.; Utojo, U. A Common Framework for the Unification of Neural, Chemometric and Statistical Modeling Methods. Anal. Chim. Acta 1999, 384, 227–247. 5. Friedman, J.; Stuetzle, W. Projection Pursuit Regression. J. Am. Stat. Assoc. 1981, 76, 817–823. 6. Hastie, T.; Tibshirani, R. Generalized Additive Models; Chapman and Hall: London, 1990. 7. Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and Regression Trees; Wadsworth: Belmont, California, 1984. 8. Friedman, J. Multivariate Adaptive Regression Splines (with discussion). Ann. Stat. 1991, 19, 1–141. 9. Quinlan, R. C4.5: Programs for Machine Learning; Morgan Kaufmann: San Mateo, 1993. 10. Friedman, J. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. 11. Breiman, L. Random Forest. Mach. Learn. 2001, 45, 5–32.
Other Methods in Nonlinear Regression
475
12. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 26, 123–140. 13. Craven, P.; Wahba, G. Smoothing Noisy Data with Spline Functions. Numerische Mathematik 1979, 31, 377–403. 14. Bakshi, B. R.; Utojo, U. Unification of Neural and Statistical Modeling Methods that Combine Inputs by Linear Projection. Comput. Chem. Eng. 1998, 22, 1859. 15. Diaconis, P.; Shahshahani, M. On Nonlinear Functions of Linear Combinations. SIAM J. Sci. Comput. 1984, 5, 171–191. 16. Friedman, J. A Variable Span Smoother; Technical Report 5, Department of Statistics; Stanford University: Palo Alto, California, 1984.
476 Other Methods in Nonlinear Regression
Biographical Sketches
Bhavik R. Bakshi is Professor of Chemical and Biomolecular Engineering and co-director of the Center for Resilience at the Ohio State University. He received his B. Chem. Eng. degree from the Institute of Chemical Technology, University of Mumbai, India, and M.S. in Chemical Engineering Practice and Ph.D. in Chemical Engineering from the Massachusetts Institute of Technology. He worked at Aware Inc. before joining Ohio State in 1993. His awards include the Ted Peterson Award from the CAST division of AIChE and the National Science Foundation CAREER award. His current research interests include Process Modeling and Operation; Decision Making under Uncertainty; Complex Industrial– Ecological Systems; and Green Engineering.
Prem K. Goel is Professor of Statistics at the Ohio State University since 1983. He served as the Chairperson of the Department of Statistics during 1988–92 and as the Director of the Statistical Consulting Service during 1983–88 and 1992–93. Prior to coming to OSU, he served as the Director of the Statistics & Probability Program at the NSF (1982–83) and as Associate Professor of Statistics (1977–83) and Assistant Professor of Statistics (1971–77) at Purdue University. Goel completed his Ph.D. in Statistics in 1971 from Carnegie-Mellon University. He earned B.Sc. in Mathematics and Statistics (1962) and M. Stat. (1964) from Agra University, India, and M.S. (1969) in Statistics from Carnegie Mellon University. Goel is a Fellow of the American Statistical Association, AAAS, and Institute of Mathematical Statistics, an Elected Member of the International Statistical Institute and Sigma Xi, and is listed in Who’s Who in Science and Engineering 2005. In 1983, National Science Foundation gave him the Outstanding Performance Award as a Program Director. His research interests include the following: Applied Stochastic Modeling; Bayesian Modeling and Analysis; Combining Information and Data Fusion; Information Theory, Hierarchical Models & Bayesian Networks; Statistical Machine Learning; and Interdisciplinary Research in Engineering and Medicine.