MULTIPLE REGRESSION

MULTIPLE REGRESSION

Multiple Regression 3 MULTIPLE REGRESSION This section sums up Douglas Montgomery’s [13] work on multiple regression and basic definitions and deduct...

232KB Sizes 4 Downloads 139 Views

Multiple Regression

3 MULTIPLE REGRESSION This section sums up Douglas Montgomery’s [13] work on multiple regression and basic definitions and deductions in the bases of statistics [14]. The aim of curve fitting is to express the relationship between two or more variables in mathematical form by determining the equation connecting the variables. If we feel that there is a linear relationship between the dependent variable y and independent variables x i (i=1,2,…m), then we would seek the equation connecting the variables which has the form:

y = a0 + a1 ⋅ x1 + a2 ⋅ x2 + K + am ⋅ xm + ε ,

(3.1)

where the unknown parameters {a i } are regression coefficients, and ε represents the random errors. The regression coefficients {a i } are obtained with the least squares method. Equation (3.1) represents the plane in the m-dimensional rectangular coordinate system. Our assumption is that the number of equations is greater than the number of regressor variables n > m, x ij . The data are shown in Table 3.1. The estimation of the procedure requires the mathematical expectation of random errors to be E[ ε ]=0, the variance to be E[ ε ] 2 = σ , and that { ε } are not correlated with each other. In this Table 3.1. Data for linear multiple regression [13]

Y

x1

x2



xk

y1

x11

x21



xm1

y2

x12

x22



xm2

M

M

M

M

M

yn

x1n

x2n



xmn

19

Mineral Wool

case, we can write a model with the form:

y = a0 + a1 ⋅ x1 j + a2 ⋅ x2 j + K + am ⋅ xmj + ε j

(3.2)

Equation (3.2) can be written also in the matrix notation:

y = xa + ε (3.3) where y is (n x 1) vector, x is (n x m) matrix, a is (m x 1) vector of regression coefficient and ε is (n x 1) vector of random errors. ⎡ y1 ⎤ ⎢y ⎥ y = ⎢ 1⎥ ⎢ M⎥ ⎢ ⎥ ⎣ yn ⎦

⎡1 x11 ⎢1 x 12 x=⎢ ⎢M M ⎢ ⎣1 x1n

xm1 ⎤ xm1 ⎥⎥ M M⎥ ⎥ L xmn ⎦

x21 K x22 L M x2 n

⎡ε 0 ⎤ ⎢ ⎥ ε ε=⎢ 1⎥ ⎢ M⎥ ⎢ ⎥ ⎢⎣ε m ⎥⎦

⎡ a0 ⎤ ⎢ ⎥ a a=⎢ 1⎥ ⎢ M⎥ ⎢ ⎥ ⎢⎣ am ⎥⎦

The vector of the regression coefficient a is found by the least square method: n

L = ∑ ε 2j = εT ε = (y − xa)T (y − xa) j =1

(3.4)

The notation T denotes a transpose matrix or a vector, for example vector a and its transposing vector a T :

⎡ a0 ⎤ ⎢ ⎥ a a=⎢ 1⎥ ⎢ M⎥ ⎢ ⎥ ⎢⎣ am ⎥⎦

and aT = [ a0 , a1 , a2 ,L am −1 , am ]

The least squares estimators must satisfy the equation:

∂L = −2xT y + 2xT xa = 0 , ∂a

(3.5)

which simplifies to

xT xa = xT y ,

(3.6)

20

Multiple Regression

To solve the equation (3.6), we multiply both sides by the inverse of the matrix product xT xa = (xT x) −1 xT y . The least square estimator of a is:

xT xa = (xT x) −1 xT y ,

(3.7)

or in the explicit matrix form:

⎡n 0 ⎢0 S 11 ⎢ ⎢ 0 S12 ⎢ ⎢M M ⎢⎣ 0 S1m 3.1

0 S12 S22 M S2 m

⎡ n ⎤ 0 0 ⎤ ⎡ a0 ⎤ ⎢ ∑ yi ⎥ i =1 ⎢ ⎥ ⎥ L S1m ⎥⎥ ⎢ a1 ⎥ ⎢ ⎢ S1 y ⎥ L S2 m ⎥ ⎢ a2 ⎥ = ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ S2 y ⎥ M M ⎥⎢ M⎥ ⎢ M ⎥ L Smm ⎥⎦ ⎢⎣ am ⎥⎦ ⎢ ⎥ ⎢⎣ Smy ⎥⎦

(3.8)

Hypothesis testing in multiple linear regression

In multiple linear regression, we wish to test hypotheses about model parameters a. It is important to consider testing for the significance of regression. In multiple linear regression, this is accomplished by testing [13]:

H 0 : a1 = a2 = L = am = 0 H1 : ai ≠ 0

(3.9)

at least one i

The rejection of H 0 in equation (3.9) implies that at least one variable in the model contributes significantly to the fit. The total sum of squares is portioned into regression and error sums of squares S yy = SS R + SS E where SS R and SS E are: m

m

SS E = ∑ ( yi − yest ) 2 ;

SS R = ∑ ai yi ( xij − x) 2 ,

i =1

i =1

(3.10)

The test procedure for H 0 : a i = 0 is the computation:

F0 =

SS R ( n − m − 1) MS R = m SS E MS E

(3.11)

and the rejection of H 0 if F 0 > F a,m,n–m-1 . The procedure is usually summarized in the analysis of variance tables, such as Table 3.2 [13]: We are frequently interested in testing hypotheses on individual regression coefficients. The hypotheses for testing the significance 21

Mineral Wool Table 3.2. Analysis of variance table S o urc e o f Va ria tio n

S um o f S qua re s

D e g re e s o f Fre e do m

M e an S qua re

Le ve ls

Fis he r Sta tis tic s F0

Re gre ssio n

SSR

K

MS R

Erro r o f re sid ua l

SSE

n -m -1

MS E

To ta l

S yy

n -1

M S E /M S R

of any individual coefficient, for example a i , are:

H 0 : ai = 0

(3.12)

H1 : ai ≠ 0

The appropriate test statistics are:

ai MS E Cii .

t0 =

(3.13)

Hypothesis H 0 : ai = 0 is rejected if t0 > tα / 2,n − m−1 . 3.1.1 Coefficient of determination The coefficient of determination is the quadratic value of the correlation coefficient r and is defined as [13, 14]: m

r2 = 1−

∑(y − y

( est ) i

i

i =1

m

∑ ( y − y) i =1

m

)2 =

2

i

∑(y i =1 m

est

− y )2

∑ ( y − y) i =1

= 2

Explained variation Total variation .(3.14)

i

The coefficient of determination r 2 can be interpreted as the fraction of total variation which is explained by the least squares regression line. This means that r measures how well the least squares regression line fits the sample data. r 2 lies between 0 and 1, r 2 ≥ 0. The definition (3.14) holds for non-linear correlation as well. Important statistics are also the adjusted statistics r 2 :

⎛ n −1 ⎞ 2 2 = 1− ⎜ radj ⎟ (1 − r ) ⎝n−m⎠ The advantage of adjusted statistics r 2 is that they do not 22

Multiple Regression

automatically increase if a new variable is added to the model. 3.1.2 Other linear and non-linear models The linear model y = xa + ε is a general model. It can be used for adapting the linear relations of unknown parameters a. By transforming the equations, it is in many cases possible to form a linear model. The most frequent expression in empirical correlations is the non-linear relation which represents the product of particular non-linear connections.

Y = a0 X 1a1 X 2a2 L X iai X ia+i1+1 X mam ,

(3.15)

The equation (3.15) can simply be transformed into a linear model. If the variables X 1 , X 2 , ... X m are independent of each other, we can take the logarithm of the equation (3.15).

y = log Y = log(a0 X 1a1 X 2a2 L X mam−−11 X mam ) = = a0 + a1 log X 1 + a2 log X 2 + L + am−1 log X m−1 + am log X m = (3.16) = a0 + a1 x1 + a2 x2 + ... + am−1 xm−1 + am x6 m Final form of the equation (3.16) is simple and represents the linear relation of the transformed value y to the transformed values x 1 , x 2 , .... x m . Indeed, the equation (3.15) can be solved also as a non-linear equation. In this case, we use the gradient methods, most frequently the Levenberg–Marquardt method. In order to successfully solve a non-linear system of equations, properly chosen values of parameters a i are essential. The latter can more easily be determined by the solution of a linear problem. 3.1.3 Computer printout Computer programmes are used extensively in regression analysis. The output from such a programme, SPSS [15], is shown below. First, we have to determine the input variables. A computer printout can appropriately be presented with the practical example of six independent variables. A regression equation can be expressed as:

dV = a0 Π1a1 Π 2a2 Π 3a3 Π 4a4 Π 5a5 Π 6a6 .

(3.17)

The regression equation (3.17) can be transformed into the form of equation (3.16). In this manner, a linear model can be acquired: 23

Mineral Wool y = log dV = log( a0Π1a1 Π 2a2 Π 3a3 Π 4a4 Π 5a5 Π 6a6 ) = = a0 + a1 log Π1 + a2 log Π 2 + a3 log Π 3 + a4 log Π 4 + a5 log Π 5 + a6 log Π 6 = (3.18) = a0 + a1 x1 + a2 x2 + a3 x3 + a4 x4 + a5 x5 + a6 x6

We can designate the equation (3.18) as Model 1. Table 3.3. M ode l

Variable s Ente re d

Variable s Re move d

1

X6, X1, X5, X4, X3, X2

M e thod Enter

All requested variables entered. Dependent Variable: Y Let us take a look at the computer printout of six independent variables. On the basis of the computer printout we first obtain information about the correlation coefficient r, coefficient of determination r 2 , adopted coefficient r 2 and the standard error of estimate (SEE). This is presented in the table below. r 2 shows that the model explains almost 88 % of joint variance because it amounts to r 2 = 0.878. Model summary Table 3.4. M o de l

R

R S qua re

Adjus te d R S qua re

Std. Erro r o f the Es tima te

1

0.938

0.878

0.852

2 . 2 6 7 E- 0 2

a Predictors: (Constant), X6, X1, X5, X3, X2, X4 b Dependent Variable: Y The variance of analysis helps us to ascertain if the regression model is statistically significant. Table 3.5. M ode l Regression Residual Total

Sum of Square s

De gre e s of Fre e dom

M e an Square

F

Sig.

0.107

6

1.781E- 02

34.647

. 0 00

1.491E- 02

29

5.141E- 04

0.122

35

a Predictors: (Constant), X6, X1, X5, X3, X2, X4 b Dependent Variable: Y 24

Multiple Regression

At the end, we check the importance of particular predictors for explaining the regression criterion. For this purpose, we use the statistic t-test. Since the t-test can be misleading in individual cases, we perform regression analysis in stages by adding variables one after another untill they can contribute to explaining the criterion. Data in the table below leads us to the conclusion that all predictors are statistically characteristic and very important. Least important are coefficients a 4 and a 6 , but their importance is still great. Coefficients Table 3.6.

N o ns ta nda rd Co e ffic ie nts

Sta nda rd Co e ffic ie nts

t

S ig .

1.995

0.056

a

Std. Erro r

Co ns ta nt

a0 = 145.789

73.085

X1

a1 = –0.353

0.167

–0.532

–2.106

0.044

X2

a2 = –0.266

0.162

–0.422

–1.641

0 . 11 2

X3

a 3 = 2 11 . 11 3

104.867

0.384

2.013

0.053

X4

a 4 = – 9 . 6 9 7 E- 0 2

0.099

–0.181

–0.983

0.334

X5

a5 = –25.781

7.754

–0.404

–3.325

0.002

X6

a6 = 0.103

0.093

0.147

1.108

0.277

β

a Dependent Variable: Y The importance can most easily be presented graphically, Fig. 3.1. Equation (3.15) can be tackled with the help of solving nonlinear equations. In this case, we are trying to solve a system of equations with the Levenberg–Marquardt method. It is necessary to determine the initial values a 1 , a 2 , … a k . Because this is an iterative procedure, the precision of iterations is necessary as well and should be 1× 10 –10 . First, the variance of the analysis helps us to ascertain if the regression model is statistically significant. The table below clarifies this because we see that the determination coefficient is very high. 25

Mineral Wool

The following table leads us to conclude that all parameters are of great importance. A special demand has been made that this time the coefficient be a 0 = 1. Caution is required when interpreting r 2 for the linear and non-linear model. In the linear model, the value of the correlation coefficient r 2 holds for the transformed value of the dependent variable log Y. Scatterplot Dependent Variable: LOGY Dependent variable: log Y

Regression standardized Regression standardized predicted value predicted value

3

2

1

0

-1

-2

-3 .6

.7

.8

.9

log Y

LOGY

Fig. 3.1. Comparison between the measured and the regression model (SPSS graph). Table 3.7. D e g re e s o f Fre e do m

S um o f S qua re s

M e an S qua re

F

S ig .

Re gre ssio n

6

1206.33906

201.05651

2 2 11 . 11

0.000

Re sid ua l

30

2.72794

0.09093

Unc o rre c te d To ta l

36

1209.06700

( C o r r e c te d To ta l)

35

21.80512

S o urc e

r2 = 1 - Re sid ua l S S /C o rre c te d S S = 0 . 8 7 4 8 9

26

Multiple Regression Table 3.8.

N o ns ta nda rd Co e ffic ie nts Va ria ble

ai

t

S ig .

Std. Erro r

9 5 % Co nfide nc e Inte rv a l fo r B Lo we r B o und

Uppe r B o und

a0 = 1 X1

a 1= – 0 . 3 5 8

0.158

–2.272

0.03039

–3.680

–0.036

X2

a 2= – 0 . 2 4 5

0.155

–1.585

0.12342

–3.561

0.071

X3

a 3= 2 . 1 8 9

1.050

2.085

0.04568

0.045

4.334

X4

a 4= – 0 . 2 5 5

0.073

–3.492

0.00151

–3.404

–0.106

X5

a 5= – 3 6 . 7 5 3

5.672

–6.480

3 . 7 E- 0 7

–48.337

–25.170

X6

a6 = 0.144

0.096

1.490

0.14662

–0.053

0.340

27