Computer Progm~ in Biome~'is~ 14 (1982) 77-84 Elsevier Biomedical Press
A BASIC program package for weighted least squares solutions using a microprocessor with disc memory Hugh Tyson and M" "y Ann Fieldes Biologr Department. McGill University. 1205 Au. Dr. Penfield. Montreal. Quebe'."H3A I BI. Canada
The fitting of linear additive models to data by weighted least squares methods has been programmed in BASIC. By ~fitting up the matrix solution into several steps, the RAM memory space needed in a microprocessor by the corre~¢om:ling programs was tedta:ed, allowing room for larger matric :s. Storage of intermediate products, and of subsequent programJ in the sequence, was on a mini floppy dual-disc drive. Using an extended BASIC interpreter permitting l¢~,etal files to be open during a program, the program package compromi~d between computing time and probknn size. Problems involving 1600 element m:,trices were run on a micro having 32000 bytes of memory, allowing. for example, the fitting of a model with 22 parameters to 72 observations.
I. INTRODUCTION
The 8 bit microprocessor in its various forms is becoming increasingly used for data analysis in biological research. The rising costs of time-sharing on larger installations, along with the improving versatility of the third generation micro and the continuous reduction in the cost of its memory and mass storage peripherals, have prompted the tackling of ever larger problems. Software, particularly the extended BASIC interpreters permitting calculations with up to 17 significant figures, as well as disc operations with data and program files, have made these small machines extremely powerful and cost effective. In this study, the fitting of linear additive models to data, using matrix methods for weighted least squares solutions, has been programmed in BASIC. Specifically, the problems involved a set of parameters or comparisons which were completely orthogonal to one another. Planned orthogonal comparisons are very useful in many experimental designs, leading to the partitioning of. for example, a treatment sum of squares into single degree, of freedom mean squares and independent significance tests of the comparisons against an 0010-~8X/82/0000-0000/$02.75 © 1982 Elsevier Biomedical Press
error mean square. With planned comparisons and such an orthogonal breakdown of a sum of squares, the extraction of information in a set of experimental results is straightforward, complete and unequivocal. Where the size of the problem to be tackled is small, the matrix solution of the weighted least squares fit can be accommodated within a single program, given a reasonable amount of useable RAM memory, say 48000. Be)ond a relatively small number of observations and parameters, the lengthy BASIC interpreters used, and the limitations of present microprocessor RAM memory sizes, forestall this. An escape from the impasse lies through the subdivision of the weighted least squares solution into a number of separate, sequential operations which can each be handlcd by short programs, with storage of internw.diate results and subsequent programs on disc memory. These programs are chained together, and once started, complete the fitting and the output of the parameter estimates, plus their standard errors, t-tests and sums of squares. The predicted values, using the parameter estimates, are also calculated. There are many applications for weighted least squares procedures in biological research. One ex-
7g
ample is the fitting of models containing genetical parameters for additive and dominance effects to data from crosses between inbred fines [11. A second example concerns the estimation of sion slopes and intercepts from electrophoresis data where a number of gel concentrations are used. In th"s context, calculation of regressions of Iog(Rm) c.n concentration may be coupled with comt~..-',~ons amongst the result~mt slopes and intercepts. Finally, the analysis of variance of. for example, a randomised block d.-,si~,q~ can be cartied out by fitting the corresponding linear additive model with these procedures. Where plot (cell) means and their variances are available, and where ~hese variances are heterogeneous, the weighted fit may be more appropriate than :he unweighted solution implicit in the use of standard analysis of variance formulae. Examples will be described briefly. The versatility and generality of the matrix methods for fitting models by weighted least squares procedures make them very attractive. Their appeal is not that they offer short computing times on small machines, but that the whole approach can be used for so many different problems. Attention is entirely focussed on the model which is to be fitted; once the appropriate design matrix has been "ormulated, the same general solution and programming is utilised. Essentially, this generalised approach requires specification of the parameters in a design, or X, matrix. Given observed means (y) and their variances (v), the reciprocals of these being used as the weights in most instances, then the procedure is as outlined in [2]. Starting with an orthogonai (X), difficulties at the inversion stage are reduced.
2. METHODS In matrix terms, the weighted least squares solution is: !; = (X'V -~X)- ~X'V -*y. and for programming purposes, this solution can be subdivided into 6 operations. These are: (I) Production of the transpose, X':
(2) Multiplication of the transpose into V -t, the matrix of weights; (3) Multipiication of this product into the design matrix X: (4) Multiplication of the product from (2) into y, the column vector of observed means, (5) Calculation of the inverse (X'V-tX); (6) Multiplication of the inverse into the product at (5). Two additional steps (7) and (8) yield the standard errors, sums of squares, and the predicted values plus residuals. Programs N I-N8 correspond to these separate operations. The design matrix (X) here will have a row number equal to the number of observations, and a column number equal to the number of parmneters to be fitted. These constants are requested as input in the first program (N I), and stored as integer values in a random access file for use in loop operations by subsequent programs. The actual construction of the design matrix is described clearly for various problems 12]. Later discussion of the examples here will take this up. The observations form a column vector ( ~ ) whose row dimension is. of course, equal to that of X. The weights also form a column vector (t,) of the same size: they may be stored either as variances of the observed means, or their reciprocals. The alternatives are signalled to N2. which deals appropriately with ©. The standard matrix solution above indicates that the weights form a (diagonal? matrix V - ~: in N2 they are utilised as a columr, vector for space saving. By setting the values in :, equal to !, the unweighted solution is obtained. Both the weighted and up.weighted fitting prc'suppose independent columns in (X). This may be verified with the matri~ procedure detailed in [2l, which has also been programmed in extended BASIC (ORTHOG.BAS) and forms an accessory program to this package. The procedure produces orthogonal columns from (X) input, but any column dependence will result in columns of zeros, thus locating such dependence. Given orthogonal parameters (columns) in X, the X'X product will be a diagonal matrix. Orthogonal polynomials can, for example, be readily calculated for a curvilinear regression p;'oblem where the inccenlents on the x-axis are unequal. Thus, (X) may be checked for
79
non-independence, or adjusted for orthogonal colunms, with O R T H O G prior to running. A valuable attribute of the weighted fit ira which the reciprocals of the variances of the observed means are used as the weights, is that the residual mean square is distributed as a X2. This allows a test of the significance of the residual term. If non-significaat, it would imply that the model fitted was adequate in the seine of satisfactorily explaining the variability among the observed means. There would be the possibility that some of the parameters included in the model were, in fact.
N1- I ~ z l l n o f
( X ) , c r I=rm¢l~ to.
F(3RM
Tnm~)ole of (X) Itor'e¢l on diM: I n ~ t of mw, ¢ol. d=wm,,am~ oml ~-=nt out r~lUmRt, ~ore¢l dllC
I N2
-
l:~'o(luct
X'V "l l~rme¢l and stored
m Oa¢. ~ ~ r t
I~-
out or V
I~om~__~_t X ' V ' I X .~orm~ anti stored on di~.
I N4- ~_~,ct
ona=:
X ' V " I y kx'med ~
om¢~
Invtq"sK~ of X'V "~ X, ~ B r ' s e storeO on ~sc .C]ptgx~ prr~ out
N~ - FeocJuct (X'Y"~ X~X'V "t y stcrml cn case ¢~tgnal ~.-.~ out o9 ~ : x a ~ 4 a r estrnotes
l N?-
Calc~
(a) A design (X) matrix specifying the parameters operating for each observed mean is required; this is m o s t onveniently set up as a random access file on disc (X.DAT), as are y (Y.DAT) and v (V or V I.DAT). N I stores the row and columr, dimensions of X as a rapdom access file (CONST.DAT), and puts requests for printed output of various calculations in FLAG.DAT. X, y and v are set up with an accessory program, FORM.BAS; a branch to this from N I is provided. The disc file operations may require, apart from file opening and closing, a 'read' from data within the program, an output of this data to the file, or an input from the file. FORM.BAS contains all these main elements, to allow random access file creation and checking of the data in the file.
p r ~ out oq .v
t MS-
redundant. If the residual term were significant, it would imply that the model fitted was not adequate, and would require either more parameters, or a completely different set of parameters to achieve adequacy. The flowchart in fig. 1 outlines the sequence of programs. We next consider: (a) Data input; (b) Data output; (c) Memory requirements for each program and maximum matrix sizes; (d) Error situations.
of stflnclorcl errors.
totoJ sums of saum-es, regress~n sum of squares, r e s g ~
sum c~
- ~ n r v s , t yOkeS for e o ~ laa~ameCer Results stored on chsc C ~ t ~ W~nt out of r e s e t s
1 NB - C o l o J ~ m of Ix~u:tecl volues extol resKluols Pr~.t out of results
Fig. I. Flowchart of programs N I to N8.
(b) The output of results from N I - N 8 can be sent either to the terminal on the control I / O port, or to a printer at another I / O port. in the second case, an appropriate driver routine must be provided by the user. Hard copy output is invoked by the 'PRINT O,' statements, with suitable 'INPUT "TERMINAL OR P R I N T E R . . . " .... ' and : I F . . . ' statements to divert the output to the printer as :equired. All print choices are made in N 1. At completion, a message is sent to the control port terminal by each of N I-NS. The sequence of N I - N 8 can be restarted at any point if prerequisite data or intermediate product files have been set up. (c) The extended BASIC interpreter used here, plus any one of N I - N S , occupy between 19000 and 21000 bytes, The interpreter allocates 8 bytes for the storage of each floating point number in dimensioned arrays. Given the maximum size of
go available RAM, and the way in which mattic~ are handled by a specific program, the useable size of the matrices can be calculated. In operations 1, 2 and 3 (N I-N3), the maximum size requirement is equivalent to a matrix of n by (m + I) elements. where n and m ate the numbers of rows and columns if X, respectively. N I leaves sfightly less space than N2 or N3; with 48000 bytes of RAM available, 3450 elements could be handled by N i. This implies a matrix of the order of 58 × 59. (X) is the largest matrix which must be dealt with, and N i, N2 and N3 are written to hold only one such matrix at a time. In N3, the multiplication procedure inputs the second matrix row by row, with output of the individual products to a disc file. While this results in a considerable increase in computing time due to numerous disc transfers, it compromises on the RAM space requirement. It would be feasible to split the X. or matrices of equivalent size, into 2 o." more portions to increase the size of problem which could be tackled, but the resultant slowdown through excessive disc transfers means that this approach should be avoided if possible. (d) Incorrect dimensions for the number of observations (rows in X) or parameters (columns in X) would, if too large, generate error messages from the interpreter, such as 'array reference out of range', if too small, a printout of (X) would display incorrect placement of the values. For under/overflow in N5 during calculation of the determinant, the error is interpreted and a message printed out before the program halts. Essentially, reliance is placed on inbuilt error detection and messages by the BASIC interpreter rather than the BASIC program.
3. HARDWARE This set of programs has been run on both the SWTP 6800 and 6809 computers. In each case, mini floppy :5.25 inch dual disc drives were used with single-sided, single density operation and a capacity of 85 000 bytes/disc. The 6800 had 40000 bytes of RAM available, and the 6809 had 56000 bytes. In both cases, 8000 bytes of this total was
occupied by the resla~ive disc operating system software, 4. SOFTWARE BASIC programs are transportable, but inevitably some problems of translation from one 'dialect" into another could arise, such as alr,,,-ration of subscripts if the zero subscript were not allowed. The critical characteristic of the interpreter is the abifity to open and close several files while a program is running. The abifity to load the next program in the sequence automatically through the "chain" statement is convenient, but not vital. The errors introduced into matrix calculations by rounding off are widely appreciated. This is important during inversion, and can be amplified by attempts to invert ill-conditioned matrices. The inversion subroutine in N5 uses partial pivoting. and is slightly modified from the BASIC subroutine published in [3]. Lack of symmetry in the inverse of a large (X'V-iX) matrix would pinpoint problems with N5. in addition, a comparison of parameter estimates before and after adjusting a non-orthogonal (X) to orthogonal columns should also pick up problems at inversion. Such checks have indicated the absence of rounding off errors in using an extended BASIC interpreter, producing up to 17 significant digits, for a variety of numerical examples requiring inversion of matrices of up to 24 × 24. The comparison of pre- and postorthogonalisation results is illustrated in the first example. 5. MODE OF A~,AILABILITY FOR GRAM PACKAGE
PRO-
A listing of these BASIC programs (N ! to Ng. plus ORTHOG) will be supplied on request. 6. SAMPLE RUNS
Example ! The first example is of trivial size, but allows all results to be shown. T'ne data are from [1]; the X,y
81
and v values are shown below. The parameters for overall mean, additive genetic effects and dominance effects are specified for each of the 6 generations in (X); the parameters in this case are not orthogonal to one another. I
2
3
•
y
1.0 1.0 1.0 1.0 1.0 1.0
!.0 0.5 0.0 0.0 -0.5 - 1.0
0.0 0.5 1.0 0.5 0.5 0.0
( - 5.0) (1.0) (7.0) (1.0) (1.0) ( -- 5.0)
9.125 7.500 5.500 6.595 5.575 5.475
© 120.7584
100.0000 135.2082 71.8184 244.1406 307.7870
• Figures in brackets represent col. 3 of (X) after-adjustment with O R T H O G X ha~ dimensions (6 x 3): y and v are (6 x I).
The output'for this example is shown in table I; the values agree with those in [I ]. If (X) is adjusted with ORTHOG so that all columns are orthogonal, tl,.e new estimates for parameters I and 3 change, but the residual sum of squares is exactly the same as before. The t-tests for parameters2 and 3 are also exactly the same as before, but for parameter I there is a change. The new results are:
TABLE I
Output for example I Weight values follow: i 2 3 4 5 6 Y ! 2 3 4 5 6
120.7584
100 135.2082 71.8184 244.1406 307.787
values now follow: 9.125 7.5 5.5 6.595 5.575 5.475
INVERSE ( I , I ) = 2.482344383303E-03 (1,3) = - 3.40~. ! %,397437E-03 (2,1)= 1.011157115758E.03 (2,2) = 2.375934890384E-03 (3. I ) = - 3.409196697437E-03 (J,3)= 8.90732784156 E-03
(I,2)=
1.011157115758E-03
(2,3) = - 1.092816701525E-03 (3,2) = - 1.0928 i 6701525E-03
D E T E R M I N A N T = 4.906 i 71821022E + 07
Parameter estimates: ( i ) = 7.34076847286 (2) = 1.847047427698 (3) = - 1.742049297743
Total SS (uncorrected) = 39707.8931128 I Regression S S = 39703.5229 ! 329 Residual SS =4.370199518171
Parameter
Standard errors
I=
0.0344~3 ~68312232
Standard errors of parameters for weighted soln.only:
0.048743562553269
( I ) = 0.049823131006623 (2) - 0.048743562553269 (3) = 0.0943786408 ! 2212
6.6149145988f,00
[ t = 191.94 ( 147.-4)] 2=
t.8470474276~80 [ t = 37.89 (37.t9)] 3 = -0.1451707748119
7.864886734351E-03
It= 1~.46 (18.46)1 Figures in brackets are t-values from table I. i.e.. fit prior to adjustment with O R T H O G .
t-Values (weighted soln.) ( I ) = 147.34 ( 2 ) = 37.89 (3)= 18.46
Predicted ), values
Example 2 Electrophoresis data from protein separations carried out at 6 increasing acrylamide gel concentrations required the calculation of the linear regressions of log (relative mobility) on gel percentage. Three proteins were measured in 4 types of plant material, namely 2 parents and their reciprocal FI hybrid progeny. For each of the 12 combinations, at each of the 6 concentrations,
I 9.1878 2 7.3933 3 5.5987 4 6.4697 5 5.5462 6 5.4937 (Printout of observed - predicted = residuals optional)
8 values were obtained, from which means and their variances were calculated. There were (12 x 6 ) - 7 2 means in all. Orthogonal comparisons
g2
8zesuons, and their comparimm with one another in the ways described above, can now be
among parents and FI reciprocals were:
~ t ~ . . , m t ~ in a sinSk design matrix. Amnning
F 1's-- Fcnude listed I st. PI
Pl x P 2
I 0 I
0 I -I
P2xPI
P2
0 -I -I"
-I
0 I
Among the 3 proteitls, which migrated to different positions in the gels, comparisons were: Protein I
Protein 2
Protein 3
I l
-I
0 -2
l
These comparisons among the two sets of main factors generate 6 interactions• A total of I I degrees of freedom (DF) are thus accounted for. There were 6 gel concentrations. Linear regressions of io8~ R,,) on concentration were fitted to each of the 12 sets of means and concentrations• Comparisons among the regression intercepts and slopes of t~te kind enumerated above for plant types, prot,:ins and their interactions were required. it is thus possible to ask whether the intercepts and slopes differ between the 2 parents, between the 2 reciprocals, or whether the average intercept and slope in the progeny differ from those in the parents• Also, the 2 comparisons among proteins across all plant types can be evaluated for both intercepts and slopes, with the interactions between the two sets of main factors additionally being checked. Fitting a linear regression to a set of 6 points leaves 4 DF for the residual sum of squares. In total, there would be ( 4 × 12)=48 DF for the residual in a combined analysis of all 12 regressions. The remaining DF are accounted for as follows. The intercept and slope of the single line th:ough all 72 points account for 2 DF. Twelve individual lines cam be fitted, so that there are 1 I DF for comparisons among the 12 intercepts, with a further !1 DF for the corresponding comparisons among the 12 slopes. The calculation of the 12 individual linear re-
that the column vector of observationg has been set up with the order: Protein
Plant type
Obser.-ed (y) mean
I I ! i I I
Pl Pl Pl Pl Pl Pt
I 2 3 4 5 6
where I, 2, 3... 6 represent the 6 ~ ! concentrations
! I 1 i I I
Pl Pl Pl Pl P: PI
I 2 3 4 5 6
~ert 1,2,3...6 represent the 6 ~1 com:entratiom
3
P2
xP2 ×P2 xP2 xP2 xP2 "
6
whem !, 2, 3... 6 repot the 6 &el txmcgntrations
The systematic arrmgement of the observed means in y implies that there will be a systematic pattern in the corresponding (X). The calculation of the intercept and slope i~ each of the 12 sets of observed means and gel concentrations will require, then, the following columns in (X): Row Row Row Row Row Row Row Row Row Row Row Row
I-1 2=1 3= I 4=1 5=1 6=1 7=1 8=1 9=i 10= I II = i 12= I
I 2 3 4 5 6 I 2 3 4 5 6
Row 72 = I
6
While only the first 12 rows of (X) are shown, continuation of the two first columns in (X) over the remaining 60 rows is straightforward. The resalt would be estimation of the overall intercept and overall slope across all 72 points.
1¢3
seen, the set up of the second comparison between the 3 proteins, and of the 3 comparisons among the 4 plant types, follows quickly. In table 2, the first 24 rows of the completed (X) are shown. This has 24 columns corresponding to the 24 parameters which are to be fitted to the observed means. It can probably be seen that the final 12 columns of (X) deal with the 6 interactions for intercepts a~d the 6 for slopes. It may also be seen that the sir.lplest way to produce the interaction columns in a large (X) such as this is to program their construction. Again, Draper and Smith [2] supply the methodology. In summary, one has to specify the model to be fitted. With a systematic arrangement of the observations, there will be a pattern in the emerging
The comparisons of the intercept and slope for protein I with those for protein 2 are next considered. There are 48 rows in y correslmnding to proteins 1 and 2. The third and fourth columns in the (X) which is being built up are readily obtained. Column 3 will consist of 24 ,'.'s followed by 24-l's. The fmal 24 rows of column 3 will be O's. Column4 will consist of 24 rows in which the pattern 1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6 occurs. This will be followed by 24 rows in which the above values are negative. Again, the final 24 rows for column 4 will be O's. Draper and Smith..[2l provide a very clear exposition of this methodology for a small example. The 48 rows of columns 3 and 4 are thus looking at the difference between protein I and protein 2. Once this can be
TABLE 2 (X)--The
first 24 rows for example
2. in
which 24 parameters are to be fitted to 72 observations
Column n u m b e r = par::meter number: 00 12
II 12 13 14 15 16 I I 12 13 14 15 16 I I 12 13 14 15 16 II 12 13 14 15 16
0000 3456
0 7
0 8
0 9
I 0
i i
Proton
~anttypes
IIii 1212 1313 1414 1515 1616 I I i I 1212 1313 1414 1515 1616 I I I I 1212 1313 1414 1515 1616 i111 1212 1313 1414 1515 1616
I I 0 0 I ! 2 0 0 I I 3 0 0 I I 4 0 0 i I 5 0 0 I ! 6 0 0 I 0 0 i I -I 0 0 i 2-1-2 0 0 i 3-1 0 0 I 4-1 0 0 I 5-1 0 0 I 6-1 0 0-1 -I -I 0 0-1 -2-1 0 0-1 -3-1 0 0-1 -4-1 0 0-1-5-1 0 0-1-6-I -I-I 0 0 I -I -2 0 0 I -I-3 0 0 I -I -d 0 0 I -I -c 0 0 I -I -~ 0 0 I
R o w s 25-72 are constructed in a similar fashion
I 2
I 3
I 4
I 5
I 6
i 7
-I -3 -4 -5 -6 -I -2 -3 -4 -5 -6 I 2 3 4 5 6
I I I I I I 0 0 0 0 0 0 0 0 0 0 0 0 -1 -I -I -I -I -I
I 2 3 4 5 6 0 0 0 0 0 0 0-1 0-1 0-1 0-1 0-1 0-1 -I -2 -3 -4 -5 -6
0 0 0 0 0 0 I I I I I I
0 0 0 0 0 0
2 0
2 I
2 2
2 3
2 4
O 0 0 0 0 0 I 2 3 4 5 6
I I I I I I
I 2 3 4 5 6
-I -I -! -I -I -I -I -I -I - I -I -I I i I I I 1
-I -2 -3 -4 --5 -6 -I -2 -3 -4 -5 -6 I 2 3 4 5 6
Interactions
Interactions I 2 3 4 5 6
I 9
I 8
0 0 0 0 0 0 I 2 3 4 5 6 -I -2 -3 -4 -5 -6 0 0 0 0 0 0
I I I i I I
I 2 3 4 5 6
-I -I -I -! -I -I -I -I -I -I -I -I I I I I I I
-i -2 -3 -4 -5 -6 -I -2 -3 -4 -5 -6 I 2 3 4 5 6
I I I I I I 0 0 0 0 0 0 0 0 0 0 0 0 -I --
-! -I -I -!
I 0 2 0 3 0 4 0 5 0 6 0 I 0 I 0 I 0 I 0 I 0 I 0 0 --I 0--I 0 --I 0 - I 0 -I 0 -I 0 -I 0 --2 0 -3 0 -4 0 -5 0 -6
-I -2 -3 -4 -5 -6 0 0 0 0 0 0
84
columns of (X). Once the construction of (X) has b~'n finalised, the problem of obtaining a weighted least squares solution is virtually complete. There are no scalar formulae to be altered, no programmin8 changes to be made. These are the principal attractio,~ of this approach, thus leaving the researcher free to concentrate on the model. One final point stems from the fact that the residttal sum of squares of the weil0tted fit gives a test (as a X2) of the adequacy, of the model in explaining the variability among the obs.'rved means. Where the residual sum of squares is not significant, one or more of the parameters in the model may be redundant. Once located, they may be dropped from the model, which can be then retitted to the data. A prolFession of elimination and refitting will result in the minimum adequate model. Location of possibly redundant parameters can be made by ranking the ;-test values for all parameters, then eliminating the parameter(s) cor-
t'espoe~lin6 to the lowest t-values. One proceeds until the point at which further eli,'nination results in a sisnificant X2 for the residual term.
ACKNOWLEDGEMENT This work was made possible through the award of an operating grant from the National Research Council of Canada, to whom thanks are extender4
REFERENCES
"
[I] K. Mather and J.L. Jinks, Biometrical Genetics (Comell University Press,Ithaca NY. 1971). [2] N.R. Draper and H. Smith, Applied reKression analyJ,is (Wiley. New York NY, 1968). [3J V. Milkovitch,BASI: subroutinelibraFy(VM Professional Application Software, Inc..New York NY, 1979).