Computer Progrmns in Biomedicine 15 (1982) 141-150
141
Elsevier Biomedical Press
A Pascal program for weighted least squares regression on a microcomputer H. Tyson, H. H e n d e r s o n and M. M c K e n n a * 8iotogF and * Computer Science Departments, McGiil University, Montreal, Quebec, H3A I BI, Canada
Weighted least-squares regression has been programmed in Pascal for a microcomputer. A double precision Pascal compiler and the Motorola 6809 assembler produce a fast machine-code program occupying 22000 bytes of memory when appended to the Pascal run-time module. Large data sets fit in the remaining memory. A regression with 72 observations and 24 parameters runs in 7 min, excluding optional print oat of large matrices. The maximum dimensions of the design matrix, X, can be altered by modifying two Pascal constants. Minor changes to the Pascal source program will make it compatible with other Pascal compilers. The program optionally orthogonalises the X matrix to detect linearly-dependent columns in X, and/or generate orthogonal parameter estimates. After orthogonalizing X and fitting the model, the parameter estimates for the original X can be retrieved by the program. Regressions on a repea,.edly reduced model are performed through elimination of columns in X until the minimum adequate model is obtained.
1. INTRODUCTION Fitting linear additive models with weighted least-squares methods using matrix procedures has many applications in biology and medicine. The ability to use large data sets, and to alter the parameters in the model easily and quickly, facilitates exploitation of these techniques. A Pascal program for an 8-bit microcomputer is described here. It allows the fitting of selectively reduced models to the data until the minimum adequate model is achieved. This single Pascal program condenses the set of BASIC programs outlined by Tyson and Fieldes (1982) [1]. It is 6 times faster than the BASIC implementation, and it offers significantly extended capabilities. Both the Pascal and the BASIC programs use double precision calculations. The weighted least squares solution in matrix notation is:
b= ( X'V-~X) -I X'V-~y where 0010-468X/82/0000-0000/$02.75 © 1982 Elsevier Biomedical Press
b = the vector of parameter estimates: X = the design matrix; V = the diagonal matrix of weights; y - the vector of observed means. An apostrophe (') indicates the transpose of a matrix or a vector. Draper and Smith (1968) [2] describe the underlying theory, and its application to regression and the analysis of variance. The design matrix (X) specifies the parameters to be fitted in the appropriate linear additive model for the data. Each observed mean (y) has a corresponding variance (V). The reciprocals of these variances supply the weights. In most biological applications V is a diagonal matrix since covariance between observed means is absent, or is impossible to ascertain. The weights can, therefore, be regarded as a vector for practical purposes. This matrix solution is completely general, and applicable to analysi s of variance as well as linear, curvilinear, multiple linear and multiple curvilinear regressions and covariance. It is used, for example, in fitting genetical parameters to data from crosses (Masher and Jinks, 1971) [3]. It can also be used in the combined analysis of regressions resulting from
142
electrophoresis in a range of gel concentrations, where information on relative molecular mass a n d / o r net charge differences can be extracted from__ regression slopes and intercepts (Fieldes and Tysofi, 1982) [4]. All such applications hinge on the specification of the requisite design matrix. To improve execution time over the interpreted BASIC version, a Pascal compiler supplying the required double precision (16.8 significant figures) in the calculations was used. The resultant Pascal program was implemented on a Motorola 6809 microcomputer with 56000 bytes of RAM memory and 5 in. floppy disc drives. The program is used here to fit weighted and unweighted curvilinea.r regressions as an example. An unweighted fit is obtained from the program by omitting the vector (V) of weights: the program sets all weights = 1. Details of data input are dealt with in the example below. Specification of a model to be fitted to a data set may either: t (i) Start from the simplest possible model, test its adequacy, and then add further parameters if the original model does not expla!n the variation satisfactorily (Mather and Jinks, 1971) [3]. (it) ¢otart with the most complex model, which, if adequate, may be reduced in size by eliminating those parameters contributing little to the explanation of observed variability (Tyson, 1973) 151. McNair'~ (197~) [61 geqetical modelling and analysis embody both approaches. In this Pascal program, approach (it) is made possible through the choice of parameters for elimination by the user, followed by refitting of the reduced model.
2. METHODS For the weighted least squares solution:
!~= ( X'V- i X ) i X,V_ly a series of matrix operations is involved. In this Pascal program the operational sequence is similar to the chained BASIC programs (Tyson and Fieldes, 1982) [1]. Partial pivoting is again used in inversion. The inverse replaces the original ( X'V-
X) matrix. Figure 1 summarises the steps in the program. The program asks whether transfo.,'mation of the original X matrix to obtain orthogonal columns is wanted. If the design matrix X is orthogonal, that is, the product of its transpose multiplied into the original ( X ' X ) is a diagonal matrix, probL':ms at the inversion stage are reduced. Transformation to orthogonal columns, as described by Draper and Smith (1968) [2] (p. 156), locates those column vectors within an X matrix which are not linearly independent. This is useful when a model is to be extended by additional non-orthogonal parameters, a situation frequently encountered in fitting genetical models to generation means from crosses between genotypes. In matrix terms the transformation is: Z,r = Z , - Z(Z'Z)-tZ'Zi where Z = the matrix of column vectors already transformed; Z, = the next column vector of X to be transformed: Z,T = the transformed vector orthogoaal to vectors already in Z. This matrix procedure can be reduced to steps analogous to the Gram-Schmidt process (Anton, 19771 [7] for obtaining ortho-normal bases, a more efficient approach resulting in far more compact program cc~e. At the bottom of each column the orthogonalised X, the norm of that column is printed out. The norm is defined as the square-root after summing the squares of the column elements. A norm approaching zero pinpoints a column dependent on the ones preceeding it. For certain problems, e.g., linear and curvilinear regression, comparison of parameter estimates before and after orthogonalizing X detects problems with the inversion stage. Another important use for this procedure is in calculating orthogonal polynomials where increments for the independent variable are unequal. Retrieval of parameter estimates for the non-orthogonal X following the use of orthogonal polyno:nials is included in this program. It allows curvilinear regressions to be fitted, for example, with output of b coefficients from weighted or
1. Prompt user for orthogonalisation of X matrix, conversion of variances to weights, and choice of outputs - observed y, variances, weights, X matrix, orthogonalised X, the ~X'V-zX) inverse, X'V-Jy, parameter estimates, their standard errors, predicted y, total, regression and r~sidual sums of squares, recalculated/; values. 2. Read in observed means y and (optionally) the vector of their variances. 3. Read in X matrix. 4. Calculate parameter estimates and their standard errors. 5. Calculate total, regression (all parameters) and residual sums of squares. 6. If needed, recalculate/~ estimates for unorthogonalised X. 7. Remove (another) column in X and go to (3), or stop.
Fig. 1. Summary of steps in Pascal program
unweighted fits, coefficients which can then be used to generate predicted (Yi) values for any x, values the user desires. The recalcul::~ted/~ values are printed out at the end of all other results to emphasise the fact that results such as parameter standard errors and t values are for the parameter estimates obtained using the orthogonalised X. If X has been orthogonalised and columns within X are to be removed, then the column removal must be performed on the original X, which has to be reread from the disc. After column elimination the reduced X matrix is reorthogonalised. In order to reread the X matrix from the disc, the Pascal compiler used hece requires modification at execution time to ,'he file control block of the disc operating system (Techl~.ical Systems Co,~sultants Flex 9). This was achieved by calls to 'PEEKW' and 'POKEW' in conjunction with the compiler's 'RESET'. The statements are gathered together in a procedure, so that for other compilers they can be readily removed. If the weights corresponding to each of the observed means (vector y) are the reciprocals of the variances of these observed means, then the residual sum of squares, after fitting the model in
the design matrix, is distributed as a X 2. A significant residual sum of squares would imply that tl:e model fitted did not adequately explain the variability among the observed means in the vector y. If so, the model being fitted to the data would have to be modified a n d / o r expanded, or discarded in favour of a different model. If the residual sum of squares is not significant, the model explains the variability adequately. Nevertheless, some of the parameters in the model may be redundant. This leads to the question of parameter elimination. Orthogonality of the parameters in the design matrix allows the ready removal o t a parameter contributing little to explanation of the variability among observed means; the reduced model can be refitted to the data. A check of the new residual term will sho~, whether the parameter is redundant. This orocess may be continued; parameters can be removed one by one until the minimum adequate model is obtained. The choice of parameter for removal at each cycle is made by ranking the t values, calculated by dividing each parameter by its standard error. By discarding the parameter corresponding to the lowest calculated ' t ' value, refitting the model and checking the residual term for significance, redundant parameters can be sequentially eliminated one at a time from the model. After each fitting, the program asks which column is to be eliminated from X. Thus, parameters closest to zero a n d / o r having the largest standard errors are removed. The winimum adequate model is the one for which any further elimination will generate a significant X 2. Any column in X may be eliminated, not necessarily the final (rigi~t.hand)column. When column dim'nation is used, the program retains the original column numbers for the X matrix throughout all subsequent output, to remove confusion about which coh, mns (parameters) remain in the model, and which are eliminated. Predicted values, given parameter estimates (b) are found from:
p=X where p is the vector of predicted (y) values. This
144 yields predictions for the input values used. The program produces the regression (all par r.aneters in model) and residual sums of squares from the appropriate expressions derailed by Draper and Smith (1968) [2] for weighted least squares fitting. The degrees of freedom ( D F ) for the regression and residual items are, rtspectively, equal to the number of columns (parameters) in the model, and the number of rows minu~ the number of colunms. If X is orthogonalised, the b coefficients for the originally non-orthogonal model must be retrieved if new values for the independent variable are to be used for prediction. An apparently adequate model does not preclude the existence of another adequate model based on different premises. A minimum adequate model does provide one with: (i) Location of the relevant parameters: (ii) Predictions. If the parameters have a biological meaning, progress in basic understanding may be made.
3. HARDWARE AND SOFTWARE A Motorola 6809 microcomputer with 56000 bytes of RAM memory plus two 5 in. floppy disc drives (98000 bytes/disc), was used. The machine code generated by the double precision Pascal compiler (Technical Systems Consultants) and 6809 assembler for this program, and the corresponding run-time system, occupy 22000 bytes. The user can select output; intermediate results may be obtained in addition to parameters, as summarised in fig. 1.
4. MODE OF AVAILABILITY A listing of the Pascal program will be supplied on request. The program is 555 lines long. A print out from the example described below is included.
(order of regression). The data are growth rate measurements of a tomato genotype. The plants were grown in 4 replicates: 2 saEples we,re taken from each replicate, and the log Cry weights were recorded at each of 6 equally-s3aced sampling times (8 values at each sampling - ~ee table 1). The variation between duplicates withi, replicates supplies the variance of the mean value at each sampiing ~he sum of the 6 corresponding sums of squares provides the error sun'- of squares in Mathefs analysis of variance. Transformation of the X matrix to orthogonal columns by this program generates the orthogonal polynomials needed:
(i)
Mather [8] uses an analysis of variance to test the linearity of the regression of log dry weight en sampling time. The residual term after fitting a linear regress;ion is significant when tested against the error mean square in the a'mlysis of variance, showing that a higher order regression is needed.
(ii) He constructs orthogonal polynomials f(~r a fifth order regression: these allow independent regression sums of squares for each term in the model to be extracted; (iii) He shows that a quadratic (second-order) regression is required, using the error mean square from the analysis of variance to chcck the significance of successively higher order regression terms in the equation. Parameter estimates, the appropriate polynomial (minimum adequate mod~l), and/~ coefficients for the original non-orthogonal model are obtained from this single Pasca! program. The 6 equally spaced sampling times are coded 0-5. A fifth-order polynomial is the inost complex model fitted, and generates predicted points which fit those observed exactly. It allows no degrees of freedom for the residual term. The model for this case would be:
5. EXAMPLE (Mather, 1966) [8] (p. 129)
y, - b o + b,x, + b2 x2 + bsx ) + b4x~ + b~x 5,
In fitting regressions to curved lines, orthogonal polynomials are needed for the choice of model
The X matrix, and the vectors of observed means and their variances, are:
145 X matrix
1 ! 1 1 ! 1
0 1 2 3 4 5
0 1 4 9 16 25
0 1 8 27 64 125
0 1 16 81 256 625
Reciprocals of these variances provide the vector of weights. The X matrix has successive columns that are clearly not orthogonal. Transformation to i .000000 1.000000 1.000000 ! .000 000 i.000000 1.000 000
- 2.500000 - 1.500000 -- 0.500000 0.500000 1.500000 2.500000
3.333333 -0.666667 - 2.666667 - 2.666667 - 0.666667 3.333333
0 1 32 243 1024 3125
M e a n log d r y wt
Variance of mean
0.742125 !.514875 2.221125 2.877875 3.306500 3.449750
0.002891188 0.030276938 0.010409438 0.004374438 0.008027625 0.005050250
orthogonal columns by this program generates the orthogonalised X matrix below:
- 3.000000 4.200000 2.400000 -- 2.400000 - 4.200000 3.000000
1.714286 - 5.142857 3.428571 3.42857 i -- 5.142857 ! .714286
- 0.476190 2.380952 - 4.761905 4.76 i 905 - 2.380952 0.476 ! 90
TABLE ! G r o w t h of t o m a t o plants (log dry weight values). Data from Mather (1966) [2] Sampling
Replicate
(x)
!
0 1 2 3 4 5
0.748 i.279 2.117 2.857 3.230 3.602
Means 2
0.763 1.732 2.290 3.P,04 3.390 3.477
0.845 !.398 2.452 2.839 3.487 3.204
3 0.672 i.591 2.097 2.806 3.167 3.447
0.881 1.342 2.170 2.964 3.367 3.556
4 0.785 !.799 2.236 2,886 3.358 3.519
Mather's analysis of variance Source
Replicates Sampling Linear regression Residual Interaction Error (duplicates)
Degrees of freedom
Mean square
3
0.027583
i 4 15 24
43.768585 0.462402 0.005863 0.020343
a Significant at probability <0.01
22.73 a
0.580 1.580 2243 2,935 3.236 3.362
0.663 1.398 2.164 2.732 3.217 3.431
0.742125 1.514875 2.221125 2.877875 3.3C~500 3.449750
146
Given the polynomial's parameter estimates in the output from the unweighted fitting of the model: y, = bo +
+
+
+ b x, +
or any reduced version of it, the sums of squares accounted for by fitting any individual terms can be quickly calculated from the general matrix expression (Harvey, 1968) [9]:
Sum of squares =/;'Zwhere/~ is the vector of parameter estimates involved and Z-~ is the inverse of that portion of ( X'V- IX)- ~ which corresponds to this vector. In this case, the calculation of the sum of squares accounted for by fitting parameter b,, (n = 0-5 here) is: 2
( i,. ) ic.. where c,n is the element from the leading diagonal of the (X'VX) inverse corresponding to b,,. The standard errors of the parameter estimates in the weighted fit are found as the square roots of the elements in the leading diagonal of ( X ' V IX)-w. in an unweighted fit. these elements are multiplied by the residual mean squ-Lre before taking square roots (Mather and Jinks, 1971) [3]. With the particular compiler used here. to run the program the user must execute the PRUN command as shown below. This command runs the machine code program stored in the LEAST file. The 3 character strings following LEAST are sent as parameters to the LEAST program. For the LEAST program, these parameters must be file names: PRUN LEAST XFILE )'FILE VFILE
XFILE is the file holding the X matrix. The first 2 numbers in this file state the row and column dimensions of the X matrix. The row dimension corresponds to the number of observed means, while the column dimension corresponds to the number of parameters in the model being fitted. The rest of XFILE contains the X matrix stated in row-by-row order. When using this Pascal, a blank line must follow the X matrix, and
nothing must follow the blank line. The requirement is related to the problem of rereading X with this particular compiler. YFILE contains the column vector of observed means. VFILE may contain the variances of the observed means, or it may contain the reciprocals of the variances. The program asks the user whether the variances are already reciprocals. All data on these files are read in free format, with one or more blanks separating each number. The files are most conveniently set up using a text editor. The user can assign arbitrary weights to the observed means by simply stating these weights in the VFILE file. At execution time, the usei~would answer 'yes' to the question: 'Are the variances already reciprocals?' If an unweighted solution is desired, the user replaces the VFILE parameter of the LEAST program by a single period (.). This will cause the program to assign l's to the variances. The title of the output will indicate: UNWEIGHTED LEAST SQUARES REGRESSION
instead of: WEIGHTED LEAST SQUARES REGRESSION
The following runs were made with these data: (i) Unweighted least squares fits of fifth, fourth, third, second and first order polynomials; (2) Weighted least squares fits of fifth, fourth, third, second and first order polynomials.
5.1. Results 5.1.1. Unweighted least-squares fitting The parameter estimates for the first to the fifth order regressions are shown in table 2. Predicted Yi values for the second and fifth order regressions are also shown. For the fifth order regression, the sums of squares removed by fitting the individual linear and quadratic terms are given. These are independent sums of squares, agreeing with th,~ values given by Mather [8] and clearly showing the decreasing portion of the total variation accounted for by successively higher order terms in the regression modei.
147 TABLE 2 Results from unweighted fitting of first to fifth order polynomials to data from Mather (1966) [8l Regression order
Parameter estimates b0
bl
First Second Third Fourth Fifth
2.352042 2.352042 2.352042 2.352042 2.352042
0.559136 0.559 ! 36 0.559136 0.559136 0.559136
Regression order
Predicted Yi ( = log dry weight) values
Second Fifth Obsvd.
b2
-
b3
0.076036 0.076036 0.076036 0.076036
-0.015095 --0.015095 --0.015095
/74
b5
-0.001547 --0.001547
0.002642
Y0
Ym
Y2
Y3
Y4
Ys
0.700750 0.742125 0.742125
i.564029 1.514875 1.5 ! 4875
2.275236 2.221125 2.22 ! ! 25
2.83437 i 2.877875 2.877875
3.241436 3.306500 3.306500
3.49629 3.449750 3.449750
Sums of squares individually removed by each term ( f r o m / ; ' Z - t/;) b0 33.192600
bt 5.471073
b2 0.2 ! 5840
b3 0.014765
b4 0.000107
b5 0.000390
Total 38.894874
Original/; coefficients for complete (fifth order) non-orthogonal model bo 0.742 i 2
bt 0.92397
b2 - 0.26349
b3 0.14420
Parameters were removed one at a time, beginning with the fifth order terms, from the X matrix. This meant the successive elimination of columns at the fight side of the orthogonalised X matrix shown above. In refitting the reduced models, the residual term increases and the parameter estimates are constant. The predicted y values for the fifth order poynomial agree exactly with those observed. Parameters (b coefficients) in the model: y, -- bo + btx, + b2x 2 + b3x~ + b4x4 + bsx 5,
allow the prediction of y, for any x,. The recalculated/; coefficients for the original non-orthogonal X matrix are shown at the bottom of table 2 and
b4 - 0.03457
bs 0.00264
may be compared with those from the orthogonal polynomials. The estimates and predicted values for the second order regression model agree with Mather's results [8]. 5.1.2. Weighted least squares fitting Of more interest is the set of results from weighted fitting, ~;hown in table 3. Again, elimination of columns from the orthogonalised X matrix shown above corresponded to the removal of specific parameters. Choice for removal was by ranking of ' t ' values for each parameter. Starting from the fifth order polynomial, yielding a perfect fit between predicted and observed means, the fifth order term (right-hand column) is eliminated.
14~ TABLE 3 R e , u l t s from ~eighted fitting of first to fifth order polynomials to data from M a t h e r (1966) [81 Regression order
Parameter estimates bo
bl
b3
b2
b4
First SE t
2. ?08484 0.03 i 327 73.7 a
0.577424 0.0 ! 6205 35.6 a
Second SE t
2.369 ! 77 0.032584 72.7 a
O.5489 ! 6 0.016743 32.8 ~
- 0.079599 0.0 ! ! 75 ! 6.8 a
Third SE t
2.350359 0.035729 65.g ~
0.560743 0.019110 29.3 ~
- 0.076309 G.012024 6.3 b
-0.016158 0.012587 1.3 NS
Fourth SE t
2.350895 0.040783 57.6 b
0.560570 0.020 i 38 27.8 b
-- 0.076362 0.0 ! 2180 6.3 NS
-0.016043 0.013261 1.2NS
--0.000325 0.011934 0.03NS
2.352042
0.559 ! 36
- 0.076036
-0.015095
- 0.001547
Fi fth Regression order
Second Fifth Obsvd.
b5
0.0026,12
Predicted y, ( = log dry weight) values )b
Yt
Y2
.1:
Y4
Y5
0.731619 0.742125 0.742125
!.598856 i.514875 !.514875
2.306932 2.221125 2.221125
2.855849 2.877875 2.877875
3.245605 3.306500 3.306500
3.476202 3.449750 3.449750
" Significant at probability <0.01: b at <0.05 SE = standard error of parameter estimate NS = not significam
Original/; coefficients for complete (fifth order) non-orthogonal model bo 0.74212
bl 0.92397
b2 -0.26349
b3 0.14420
The successively reduced models have non-significant residual sums of squares until the elimination of the quadratic term (table4). The parameter values for weighted fitting change with successive eliminations, in contrast to the unweighted case. The X2 for the residual of the weighted linear regression is highly significant. Thus, the minimum adequate model is the one containing both linear and quadratic terms. The parameter estimates, except for the fifth order (perfect fit) polynomial, are
b4 -0.03457
b5 0.00264
slightly divergent from those found by unweighted fitting. A crucial point is the comparison of the F ratios for the quadratic regression model. Given the model: y, = bo + blx, + b2 x2
the weighted and unweighted F ratios for regression mean square against residual mean square are 3757.9 and 2531.1, respectively, using the re-
149
TABLE 4 Regression and residual sums of squares from weighted fitting of first to fifth order regressions Order
Regression
DF
SS
First Second Third Fourth Fihh
Residual
DF
SS
6304.364966 6350.229796 6351.877763 6351.878503 6351.919620
2 3 4 5 6
47.554655 a 1.689824 0.041858 0.041118 0.000000
4 3 2 I 0
F ratio
( P = 0.05)
265.14 3757.92 75874.12 30895.85
9.49 7.82 5.99 3.84
X2
= Significant at probability <0.01 S S = sums of squares D F = degrees of freedom
spective residual mean squares for both the F values. Clearly, the use of weights increases the sensitivity of the F test, as expected, since the variances of the six sampling means are distinctly heterogeneous.
5.2. Summary |
This example demonstrates i The transformation to, arid use of, an X matrix with orthogonai columns; (ii) Extraction of the minimum adequate model by using a weighted least squares fit; (iii) T ~ increase in sensitivity by using a weigilted least squares fit. Point (iii) is particularly relevant in situations where, as here, variances for observed means show considerable heterogeneity. In many biological experiments, variances within groups show heterogeneity; weighted least squares regressions provide a way to cope with this. The program provides a way to exploit the advantages of these techniques on inexpensive computing equipment. (i)
ACKNOWLEDGEMENT This work was supported by an NSERC (Canada) scholarship awarded to M.M.
REFERENCES
[ ! ] H. Tyson and M.A. Fieldes, A BASIC program package for weighted least squares solutions using a microprocessor with disc memory, Comput. Prog. Biomed. 14 (1982) 77-84. [2] N.R. Draper and H. Smith, Applied Regression Analysis (Wiley, New York NY, 1968). [3] K. Mather and J.L. Jinks, Biometrical Genetics (Cornell University Press NY, 1971). [4] M.A. Fieldes and H. Tyson, isozyme comparisons using weighted analy~s of electrophoretic mobility in a range of polyacrylamide gel concentrations, Electrophoresis 2 (1982) 296-303. [5] H. Tyson, Cytoplasmic effects on plant weight in crosses between flax genotypes and genotrophs, Heredity 30 (!~73) 327-340. [6] M.R. MacNair, The genetics of copper tolerance in the yellow monkey flower, M/mulus guttatus; I. Crosse~ to non-tolerants, Genetics 91 (1979) 553-563. [7] H. Anton, Elementary Linear Algebra (Wiley, New York NY, 1977). [8] K. Mather, Statistical analysis in biology (University Paperbacks, Methuen, London, 1966). [9] W.R. Harvey, Least squares analysis of data with unequal sub class numbers (Bulletin ARS-20-8, USDA, 1960).