Computers and Chemical Engineering Supplement (1999) 5327-5330 ~ t 999 Elsevier Science Ltd. All rights reserved
Pergamon
PU: S0098·135~199/00235·5
Regression Diagnostic Using an Orthogonalized Variables Based Stepwise Regression Procedure Neima Brauner and Mordechai Shacham School of Engineering, Tel-Aviv University, Tel Aviv 69978, Israel Department of Chemical Engineering, Ben-Gurion University of the Negev Beer-Sheva 84105, Israel Abstract Regression diagnostic for identifying the causes that limit the precision and/or stability of a regression model is considered. The diagnostic procedure uses the indicators generated in the SROV process that was proposed recently by Shacham and Brauner (1998b). These indicators consider the signal-to-noise ratio in the independent and dependent variables data. It is shown that a routine used of the SROV procedure for regression of experimental data enables identification of the optimal model when the precision of the model is limited only by the accuracy of the experimental data. Otherwise, the indicators generated by the SROV procedure and the guidelines given in the paper can help to pinpoint the potential causes, so that remedial actions can be taken. Keywords: collinearity, regression, stepwise, optimal, precision, noise.
Introduction Mathematical model based simulation, design, control and optimization of chemical processes become increasingly more widespread, and the requirements for more precise regression models become increasingly more severe. Regression models of chemical processes can be partially theory based or completely empirical. In both cases, it is not known a-priori how many explanatory variables (independent variables, and/or their functions) and parameters should be included in the model for obtaining an optimal regression model. An insufficient number of "explanatory variables results in an inaccurate model characterized by a large variance. Some independent variables, which may have critical effects on the dependent variable under certain circumstances, may be omitted. On the other hand, the inclusion of too many explanatory terms renders an unstable model. The instability is characterized by typical ill effects, whereby adding or removing an experimental point from the data set may drastically change the parameter values; Also, the derivatives of the dependent variable are not represented correctly and extrapolation outside the region, where the measurements were taken, yields absurd results even for a small range of extrapolation. The causes for ill-conditioning in regression can be categorized in the following four reasons: 1. Presence of collinearity among the explanatory variables. 2. Presence of variables in the model which are nearly orthogonal to the dependent variable. 3. An inappropriate model [i.e, linear vs, nonlinear) and 4. Excessive error in the dependent variable data (as in the presence of outlying observations). If the causes of the imprecision and instability of the model are diagnosed correctly remedial measures can be taken to improve the model (Brauner and Shacham, 1998a,
1998b, Shacham and Brauner, 1997). There are many available statistical techniques for regression diagnostics (for detailed description and references, see for example Neter et al., 1990). Some of these techniques can give misleading results (for details, see, for example Brauner and Shacham, 1998a and Shacham and Brauner, 1998a) and most of them are too complex to be used on a routine basis. Therefore, many of the regression models being developed and used are far from being optimal. Shacham and Brauner (1998b) have developed a new stepwise regression procedure based on orthogonalized variables (SROV). This procedure finds the optimal set of explanatory variables which shouldbe included in a regression model and at the same time provides numerical indicators for the extent of collinearity among the explanatory variables and on the significance of the correlation between each of the explanatory variables and the dependent variable. These indicators can be used for regression diagnostics. They can pinpoint potentially problematic models and data that require additional statistical tests. In this paper, the use of the various numerical indicators, generated by the SROV procedure for regression diagnostics are described and demonstrated. The calculations were carried out using the regression program of the POLYMATH 4.0 package (Shacham and Cutlip, 1996) and the MATLAB (Math Works, 1992) program. Basic Concepts A standard linear regression model can be written: y
= Po + PiXi + P2 X2 ... + Pnxn + E
(1)
where y is an N-vector of the dependent variable, = 1,2,'" n) are N vectors of explanatory variables, Po, Pi... Pn are the model parameters to be
Xj(j
5328
Computers and Chemical Engineering Supplement (1999) S327-S330
estimated and £0 is an N-vector of stochastic terms (measurement errors). It should be noted that an explanatory variable can represent an independent variable or a function of independent variables. A certain error (imprecision, noise) in the explanatory variable is considered. Thus, a vector of an explanatory variable can be represented by Xj = xj+6xj where Xj is an N-vector of expected value of Xj and 6xj is an N-vector of stochastic terms due to noise. The errors in the dependent variable (E) and the explanatory variables (chj) cannot be measured but can be estimated. If estimates on the experimental errors are available, these can be used for 6xj and e. Otherwise, it is usually assumed that the data is correct up to the last decimal digit reported. In such cases, the average rounding error can be used (approximately 3 x lO-t, where t is the number of reported digits after the decimal point, see Stewart, 1987. If functions of the independent variables are used or data transformation is carried out, the error propagation formula is used to calculate the resultant dXj. The vector of estimated parameters, j3T is usually calculated via the least squares error approach, by X T y, where solving the normal equation x Tx{3 X = [l,X1,X2'" .xn ] is an N(n + 1) data matrix and XTX = A is the normal matrix. An alternative option is solving the over-determined system of equations, xj3 = y, using QR decomposition (Press et al., 1992). The QR decomposition requires more arithmetic operations than the solution of the normal equation, but is less sensitive to numerical error propagation (see Brauner and Shacham, 1998b). From among the widely used statistical tests and criteria, the variance and confidence intervals on parameter estimates will be used in this work for comparison between various regression models. The N model variance 52 = L(Yi-yd 2/(N-n-l) is a meai=l sure for the variability of the y values predicted by the regression model. The confidence interval (l!!.{Jj) on a parameter estimate is defined by:
=
where ajj is the ph diagonal clement of A -1 ,t(v, 0) is the statistical t distribution corresponding to v degrees of freedom (v = N - (n + 1)) and a desired confidence level, 0 and 5 is the standard error of the estimate. Clearly, if I {3j 1< l!!.{Jj then the zero value is included inside the confidence interval. Thus, there is no statistical justification to include the associated term in the regression model. Note that, when the explanatory variables are strongly correlated, the individual confidence intervals will usually underestimate the uncertainty in the parameter estimates as indicated by the confidence regions. In this work, confidence intervals on parameter estimates of orthogonalized variables (no correlation between the variables) will be used as indicators. A statistically valid model is defined as a model where these confidence intervals are smaller than the respective pa-
rameter values. The optimal model is a statistically valid model which yields a minimal variance. Regression diagnostics using the indicators generated in the SROV procedure The Stepwise Regression using Orthogonalized Variables (SROV) procedure is used for selecting the explanatory variables that should be included in the regression model, which is optimal in the sense described in the previous section. The same procedure also yields various indicators that can identify the dominant cause preventing addition of more variables to the model, thus limiting its precision. The SROV procedure is described in detail in Shacham and Brauner, 1998b. In this procedure, the selection of a new variable to enter the model is based on three indicators: a correlation indicator (Y Xj), a collinearity indicator (TNRj) , and an indicator which measures the signal-to-noise ratio in the correlation (CNRj.) The SROV procedure consists of successive phases, in the first phase an initial (nearly optimal) solution is found. In the subsequent stages the variables are rotated in an attempt to improve the model:· Every phase of the procedure consists of successive stages, where at each stage, one of the explanatory variables is selected to enter the regression model (basic variables). The remaining explanatory variables (non-basic variables) and the dependent variable are updated, by subtracting the information which is collinear with the variables already included in the model. This updating generates nonbasic variables and a residual of the dependent variable, which are orthogonal to the basic variables set. The strength of the linear correlation between an explanatory variable xi, and a dependent variable y is measured by Y x, = yT »s. where y and Xj are centered and normalized to a unit length. The value of 1 Y x, I is in the range [0,1]. In case of a perfect correlation between y and Xj (y is aligned in the Xj direction), I Y Xj 1= 1. In case y is unaffected by Xi (the two vectors are orthogonal), Y Xi = O. The inclusion of a variable x p , which has the highest level of correlation with y in the basic set (Y X p value is the closest to one) will affect the maximal reduction of the variance of the regression model. Therefore, the criterion x p = xj{max I Y Xj I} is used to determine which of the non-basic variables should preferably be included in the regression model at the next stage, provided that the following C N Rand TNR tests are both satisfied. The CNRj measures the signal-to-noise ratio of Y Xj, and is defined by: CNRj
=
T
N
I Y Xi
Li=l (\ xii f , I +
1
I YiOXij I)
1/2
(3)
The denominator of eq. (3) represents the error in Y Xj as estimated via the error propagation formula . A value of C N Rj >> 1 signals that the correlation between Xj and y is significantly larger than the noise level. But when CNRj $ 1, the noise in YXj, as affected by OXj and £0, is as large as, or even larger than 1Y Xj I. If this is the case, no reliable value
Computers and Chemical Engineering Supplement (/999) S327--5330
for Y Xj can be obtained and the respective variable should not be included in the model. The TNRj measures the signal-to-noise ratio in an explanatory variable Xj. It is defined in terms of the corresponding Euclidean norms (Brauner and Shacham, 1998a) }1/2 T II x j ll _ XjXj (4) :1 - II 8xj II - { 8xJ8xj A value ofTN n, >> 1 indicates that the (non-basic) TNR.-
explanatory variable Xj, contains valuable information. On the other hand, a value of TNRj ~ 1 implies that the information included in Xj, is mostly noise, and therefore it should not be added to the basic variables. The addition of new variables stops when for all the non-basic variables either GNRj ~ 1 or TN n, ~ 1. The various steps of the SROV procedure are demonstrated in the example. After the optimal model has been found, several indicators for the various explanatory variables are reported for diagnostic purposes. For the basic variables, the parameter estimates, j3 confidence intervals tl{3 and tl{3j / I {3j I are reported, all based on standardized (centered and normalized to unit length) and orthogonalized variables. For the variables that are not included in the regression model the indicators Y Xj , GNRj and TNRj are reported. These indicators can be used for regression diagnostics, where several different cases can be considered: 1 All tl{3j / 1 {3j 1< 1 for the basic variables and all GNRj ~ 1 (or very close to 1) for the non-basic variables. In this case, a statistically valid model has been obtained. The inclusion of additional explanatory variables in the model is prevented by the level of the noise [i.e. experimental error). The model can be improved in this case by providing more precise data of y and X. 2 One or more tl{3j/ 1 {3j I> 1, and there are still non-basic variables for with TNRj > 1 and GNRj > 1. In this case, the variance is being inflated because of excessive error in y (for example, due to presence of outlying observations), inappropriate structure of model and/or omission of important explanatory variables from the model. Residual plots and other statistical tools can be used to pinpoint the exact cause of the problem. Then, remedial measures can be taken. 3 AIlI::!.{3j/ I {3j 1< 1 there are still non-basic variables for which GNRj > 1 but TNRj < 1. In this case, collinearity among the explanatory variables prevents the inclusion of additional variables for increasing the model precision. Data transformations can often alleviate the ill-effects of collinearity (see Brauner and Shacham 1998a,b). Example. A simulated collinear system This example is used to demonstrate the SROV procedure for the case when collinearity among the independent variables is the dominant cause that limits the number of variables that can be included in the model. The data (Table 1) were used by Belsley (1991) to demonstrate the effects of collinearity.
5329
There are two observers (A and B) who took the readings of the independent variable data. The difference between their readings can be used for estimating the vector of errors OXj (half of the absolute value of the differences). At the O-stage of the SROV procedure the variables are centered and normalized to a unit length. In Table 2 the functions I YXj I ,TNRj and GNRj are shown. It can be seen that all values of TNRj and GNRj are much larger than one,. thus, there is no restriction in inclusion of any of the variables in the model. The indicator I Y x, 1 is the closest to one for the variable xI(1 YXI 1= 0.96105), thus, variable Xl is selected the first to enter the regression. The respective parameter value is {31 = -0.96105 and the variance of the regression model, which includes only Xl is 0.010911. All variables are updated for obtaining the components which are orthogonal to Xl and the indicators Y Xj ,TNRj and GN Rj are recalculated for stage 1. It can be seen that the values of both TNRj and GN Rj are still larger than one. Thus, there is no restriction on any of the remaining non-basic variables. At this point,the I Y x, I value closest to one is for the variable X2, so this variable is added next to the basic set with {32 = 0.001305. The regression model with the 2 variables (Xl, X2) yields a variance of 1.305 x 10- 3 , almost an order of magnitude smaller than the model which includes only Xl. At stage 2, only X3 remains a non-basic variable, but both TNR 3 and GNR3 are smaller than one, signaling a case of collinearity. The orthogonal component of X3 contains mainly noise, so it should not be included in the model. Thus, at the end of phase 1, an initial solution consisting of variables Xl and X2 has been found with a variance of 1.305 x 10- 3 • The stepwise regression proceeds by carrying out phase 2 - variables rotation to search for solutions which yield a smaller variance. In the 0- stage of the first rotation variable X2 (which was the last to enter the basic set in phase 1) is included in the regression model. At stage 1, 1 YX3 I has the largest absolute value (see Table 2) so it enters the basic set (replacing xd. The variance for a model including X2, X3 and a free parameter is 1.2862 x 10- 3 , slightly smaller than that obtained for the model which includes Xl and X2. In the next ( 2nd) rotation, X3 enters the model first and X2 second, verifying that no other solution with a lower variance value can be found. In Table 3, the parameter estimates, {3j , the associated confidence intervals, tl{3j and the ratios tl{3j / I {3j I are shown for the cases where two orthogonalized variables (X2 and X3) or three orthogonalized variables (X2' X3 and Xl) are included in the regression model (plus a free parameter). For the two explanatory variables model, the confidence intervals are much smaller than the parameter values. Adding the third variable results in a parameter value which is an order of magnitude smaller than the respective confidence interval. Similar results are obtained by regression of the original untransformed data (Table 4). In the model containing X2 and X3 all l::!.{3j/ 1 {3j I are smaller
5330
Computers and Chemical engineering Supplement (/999) SJ27--SJJO
than 1, while in the model containing all three variables f::!.{3j/ I {3j I is larger than one for 3 out of the 4 parameters, indicating that the model is illconditioned. The variance of the"3 variables model is larger than that of the four variables model. In figure 1, the residual plot, for the 2 variables model shows that the error is randomly distributed, thus this model represents the data appropriately. In this example, the SROV procedure has identified correctly that only two of the three independent variables can be included in a statistically valid, stable model, and has diagnosed collinearity as the dominant cause preventing the inclusion of the third variable.
4.0 User's Manual, CACHE Corp., Austin, TX. Stewart, G.W., 1987, Statistical Sci. 2,68-100. Table I. Data for the Example (Beisley,1991)
Observer A
x 3.3979
&
-3.138 -0.297 -4.582 0.301 2.729 -4.836 0.065 4.102
1.6094 3.7131 1.6767 0.0419 3.3768 1.1661 0.4701
~! ·3.136 -0.296 -4.581 '0.3 2.73 -4.834 0.064 4.103
Observer B ~ ~ 1.288 0.251 1.246 0.498 -0.281 0.349 0.206 1.069
0.17 0.043 0.108 0.118 0.036 -0.093 0.048 0.376
1
Stage 0
•
Stage 1
2 3 2 3
mq
TNRj
CNRj
0.96105 39922.0 4637.5 0.57787 1286.9 637.05 0.29049 90.907 82.339 Selected variable I, PI"'·0.96105,s2=0.010911 0.94736 766.46 530.38 0.94707 208.78 185.80 Selected variable 2, P2=0.001305, s2=1.305xI O·l 0.1737 0.82891 0.17835
Stage 2 3 End ofphase I. Starting phase 2. first rotation.
Stage 0 . Selected variable 2,lh=-0.57787, s2=0.095152 Var. No.: j IYXjl TNRj CNRj Stage 1 1 0.9941 1574.9 1397.7 3 0.99419 174.53 223.94 Selected variable 3, P3=-0.10239, s2=1.2862xl0·] Stage 2 1 0.12655 0.82892 0.12365 End offirst rotation. Starting phase 2. second rotation. Stage 0 Selected variable 3, Pl"'-0.29049,s2=O.l308 Var . No.: j IYXjl lNRj CNRj Stag91 1 0.99569 366.85 292.41 2 0.99578 231.75 203.06 Selected variable 2, P2"'1.2024, s2=1.2862xI0·l Stage 2 0.12655 0.82892 0.12365
Table 3. Regression Results, Normalized and Orthogonalized Variables for the Example. Var. No.: j OJ 2 0.57787 3 -1.0239
References Belsley,D.A.,1991,Condilion Diagnostics, Collinearity and Weak Data in Regression, John Wiley, NY.
Brauner,N. and Shacham, M., 1998a, AIChE, J., 44, 603-610. Brauner, N. and Shacham, M., 1998b, Math. and Compo in Simulation, 48, 75-91. Math Works, Inc., 1992, The Student Edition of MATLAB, Prentice Hall, Englewood Cliffs, N.J. Neter, J., Wasserman \V . and Kutner, M.H., 1990, Applied Linear Statistical Models, Irwin, Burr Ridge. W.H., Teukolsky, S.A., Vetterling, W.T. and Flanerry, B.P.,1992, Numerical Recipes in FORTRAN, 2nd ed., Cambridge Univ, Press Shacham, M. and Brauner, N.,1997, Ind.& Eng. Chern. Res. 36, 4405-4412. Shacham M., and Brauner N., 1998a, A collinearity diagnostic based on truncation error to noise ratio,
Internal report, Ben-Gurion Univ. of the Negev. Shacham M., and Brauner N., 1998b, Stepwise regression based on data precision and collinearily con-
siderations, submitted for publication. Shacham, M. and Cutlip,
~~ 0.169 0.044 0.109 0.117 0.Q35 -0.094 0.047 0.375
Table 2. Steps of the SROV Procedure for the Example. Var, No .:j
Conclusions The SROV procedure, which generates and uses the indicators Y Xj, cs Rj, TN Rj and {3j/ f::!.{3j (for orthogonalized variables) can help to pinpoint cases where the precision and/or stability of the regression model are limited by causes other than the accuracy of the experimental data. Due to space limitations, only one example is presented in the paper. In the example it is demonstrated that the SROV procedure selects the variables which should be included in a statistically valid, optimal model, identifying correctly collinearity as the dominant cause which prevents addition of one more variable. The guidelines given in this paper relate the values of the various indicators to the causes of imprecision, such as collinearity among the explanatory variables, inflated variance caused by an inappropriate model, presence of outlying observations, or omission of important explanatory variables. Once the problem and its potential cause have been identified, remedial measures can be taken. These are demonstrated in additional examples, which are avaiiable by the authors, and will be presented in the conference.
~ 1.286 0.25 1.247 0.498 -0.28 0.35 0.208 1.069
~LB.,
1996, POLYMATH
1
Table 4. Regression Results, Original Variables and Data for the Example. Var. No.: j I3.i lil3.i 0 1.26239 0.1984 2 2.99993 0.3180 1.324 3 -10.6368 1 s' 0.02141
liP4I3.i1 0.1572 0.106 0.124
lil3.i
Pi
liPMI
1.25465 0.2521 0.2009 9.02192 65.52 7.262 7.869 -38.44 302.5 10.B8 0.974148 10.5985 0.02634
0.2,....---------------, o
0.1-
8
o
Oo--~-------------------
-OJ
o
-0.2
o I
-0.30
0.8
1.6
2.4
3.2
4.0
Response variable. y Figure 1. Residual plot for 2 variables model