An Application of Linear Regression Analysis to Biometric Data* G. A.
BAKER
Assistant Statistician in the Experiment Station, University of California, Davis (Received for publication September 13, 1948)
A
* Presented under the title of, "A mathematical model of the relation between white and yolk weights of birds' eggs," at the twenty-eighth meeting of the Institute of Mathematical Statistics, June 17-19, San Diego, California.
in both variables. The general principle used in this development is that of maximum likelihood. Any mathematical model, such as those used in engineering, physics, and biology is only an approximation to reality. Sometimes models which are obviously faulty have great practical utility in predicting what may be expected. Our present purpose is to apply the methods found in the literature and to develop other mathematical models that will give us a rational method of estimating a "best line" in some sense that will represent the relation between white and yolk weights for some or all species of birds. This problem has already been considered by Asmundson et al. (1943). The results of the various methods will be compared. For some purposes simpler techniques may give as acceptable results as those involving more formidable mathematics. All the methods will be seen to deviate from reality in some important respects. As preliminary to any determination of relationship between white and yolk weights of birds' eggs a comprehensive survey of the character of the data should be made. From data at hand it appears that birds within species may differ in means and variances of weights and that the yolk and white weights are positively correlated. Yolk and white weights within a species are functions of egg number. The standard deviations of yolk and white weights for different species are approximately proportional to mean values.
293
Downloaded from http://ps.oxfordjournals.org/ by guest on April 9, 2015
COMPLETE theory and practice of regression analyses, based on definite assumptions or mathematical models, has been developed by Galton (1889), Pearson (1896), Yule (1932), Fisher (1938), Deming (1943), David and Neyman (1938), and many others. In some respects a more practical approach is given by Ezekiel (1941). Eisenhart (1939) has discussed the use of regression methods in biological and industrial research. The usual assumptions are that one variable, say x, is without error and that the variance of the other variable, say y, measured from the regression curve, is uniform with respect to x. The case of errors in both variables, however, has also been treated by numerous authors, for instance Wald (1939), Deming (1943), and Kavanagh (1941). Nonuniform variance has been discussed by Brady (1938), Salmon (1938), Baker (1941, 1945), and others. Dr. Benjamin Epstein has noted errors in the expressions for the variances and correlation coefficient of estimates for the constants of a straight line given by Baker (1941). Recently Neyman and Scott (1948) have developed a method of consistent estimates based on partially consistent observations which applies to cases of linear regression when there are errors
294
G. A. BAKER
LINEAR RELATIONSHIPS BETWEEN WEIGHT OF YOLK (*) AND WEIGHT OF WHITE (y).
The data are taken from a paper by Asmundson, Baker, and Emlen (1943) (text Figure 1) on the relation of weight of egg white, y, to the weight of yolk, x, for 24 species of birds whose average egg sizes range from 1.5 to 161.0 grams. Because of technical difficulties, more or less beside the point of the paper in The Auk, a straight line was simply drawn to represent the relationship between x and y. The equation of this line as read from the graph is (a) y = 0.7+1.56x. Eighteen of the residuals from (a) are negative.
If a conventional least-square line under the assumptions of x free of error and equal weights for each observed point is fitted to these data we obtain (b) y=1.929 + 1.563x. If the residuals from this line are computed they appear peculiar. They are large for large values of x, and all negative for small values of x. This straight line cannot be considered a good representation of the relation between x and y. Eighteen of the residuals are positive. We shall try first to do something about the systematic change in the size of the residuals from (b). Let us proceed along the lines indicated by Baker (1941) by grouping the species in small groups of 5 or 4, computing smoothed weights, and Relation of weight of white, y, to weight of yolk, x, for 24 species of birds whose average egg sizes range from 1.5 to l6l.O grams Name*
y
Weights Weights for (c) for (d)
Bank Swallow Redstart Nightingale Barn Swallow Sparrow
0.3 0.4 0.4 0.5 0.6
1.1 1.5 1.4 1.2 2.0
232.20 130.61 130.61 79.01 55.38
136.03 73.69 82.69 98.71 39.84
Redbreast Mockingbird Tricolored Redwing Brewer's Blackbird Fieldfare
0.7 0.7 0.8 0.9 1.7
2.2 3.0 2.6 3.3 3.8
40.96 40.96 31.52 25.00 6.99
32.23 19.03 23.36 15.01 9.51
European BlackbirdI Crow Quail Raven Corn Crake (Rail)
2.2 2.5 3.2 3.6 4.5
4.2 13.4 6.2 13.9 7.1
4.15 3.23 1.97 1.56 1.00
7.23 1.00 3.35 0.86 2.30
Ring-necked Pheasant Lapwing Gull Guinea Chicken
8.7 15.2 9.3 13.3 10.8 25.7 16.0 20.3 17.2 33.0
0.37 0.37 0.37 0.37 0.37
0.53 0.62 0.22 0.25 0.17
Silver Pheasant Turkey Duck (Pekin) Goose
17.4 26.0 28.0 57.3
0.37 0.37 0.37 0.37
0.21 0.17 0.17 0.17
21.5 47.2 46.4 90.8
* Names assigned by Professors T. I. Storer and V. S. Asmundson.
Downloaded from http://ps.oxfordjournals.org/ by guest on April 9, 2015
The "true" means for yolk and white weights for different species do not lie on a line because of biological differences between species with the same egg size. The standard deviations of species deviations from a straight line depend on the size of the egg (perhaps proportional to a weighted sum of the yolk and white weights). The differences between species are large compared with differences within species. From an inspection of the data of text Figure 1 given by Asmundson et al. (1943) it appears that a line will represent adequately the relationship between white weight and yolk weight for the range under consideration. Also, there is a marked relationship between these two variables. This paper will present in some detail one example of the application of regression methods in biometric research. Comparative results are given for fitting a straight regression line under three progressively more realistic sets of assumptions—first, x free of error and uniform variance; second, x free of error and nonuniform variance; third, both x and y subject to error and nonuniform variance.
APPLICATION OF LINEAR REGRESSION ANALYSIS TO BIOMETRIC DATA
then fitting a straight line. By this method of least-squares with unequal weights and x assumed free of error we obtain (c) y=0.955+1.634*. Eleven of the residuals from (c) are negative and 13 are positive. Let us now consider that both variables x and y are subject to error. We shall follow the methods detailed by Baker (1945) and proceed as for equation (c). Thus we obtain (d) y = 0.762 + 1.715x.
(e)
y = 0.968+1.560*.
The angle between the direction along which deviations are measured and the fitted line is now 87° 50' instead of 90° as for equation (d). I t is necessary to estimate the weights separately in applying the methods of Neyman and Scott (1948) in making maximum likelihood estimates of a structural relationship similar to those above. Also, in order to include the data for all
24 species, it is necessary to count each species as a single observation since the individual white and yolk weights are not known for some species. On this basis we obtain (f) y = 0.874+1.650* as the maximum likelihood estimate of the structural relationship using the same weights as (c). If we use the same weights as for (d) and count each species as one observation we obtain (g) y = 0.019+1.626* as the maximum likelihood estimate of the structural relationship between white and yolk weights. Complete data are available for 8 species with x + y varying from 1.7 to 74 grams. Thus, taking account of the variation within species (covariation seems to be not important) and with weights inversely proportional to the square of x + y we find that the maximum likelihood estimate of the desired structual relationship is (h) y = 0.294+2.411*. It is possible that some transformation of the variables, weight of yolk (*) and weight of white (y) would give pairs of values that fulfill the requirements of uniform variance along a linear regression line. One such possibility is to consider log * and log y. If such values are plotted a first casual glance indicates some degree of conformity to the assumptions of linear regression and uniform variance. However, closer inspection reveals that neither assumption is valid. The double log plot implies that y = 0 when * = 0. This results in a serious distortion of the regression curve, for it is obvious from the (*, y) plot that y does not approach 0 as * approaches 0. If a least-square line is fitted to the (log *, log y) values by as-
Downloaded from http://ps.oxfordjournals.org/ by guest on April 9, 2015
for a least-square line with unequal weights and both x and y subject to error. We now find that 12 residuals are negative. In obtaining (d) it was assumed that the sum of the squares of the perpendicular distances from the observed species points to the fitted line should be a minimum. Let us relax this condition and determine a least-square line with the distances from the observed points to the fitted line in some direction weighted inversely as the squares of the distances of the points of intersections of lines through the points with this direction with the fitted line from the y-intercept of the fitted line. The least-square equations are not simple, and can be solved only approximately. However, in this case we find
295
296
G. A. BAKER
suming the log ^-values fixed and free of error and each species point of equal weight we obtain (i)
logm y = 0.764 logw #+0.4959
or . (i1)
y = 3A33x°-m.
DISCUSSION The drawing of freehand curves to represent data is hard to defend because of its subjective nature. However, if the person drawing the curve is experienced and is well acquainted with the vagaries of his data the result may not be too bad. This method has been recommended in certain cases by Yule (1932) and used by many practicing statisticians. The equation (a) seems somewhat better than (b). The principle of least-square is well established in theory and practice. Its use is widespread and the results in general have been satisfactory. The usual, simplest assumptions made in applying the least-square principle is that one variable, x in this case, is free of error and that the other variable, y, has uniform spread or variance for all values of x. These assumptions are not sufficiently realistic in this case. Methods that are adequate to approach reality more closely are available in the literature or have been developed in this paper. The results are given as equations (c), (d), and (e). The weights are still somewhat arbitrary but can
CONCLUSION We have considered in some detail the problem of finding an estimate of the relationship between white and yolk weights of birds' eggs for all species based on partial data for 24 species and more adequate data for eight species. Of course, the most effective way to improve the estimate of this relationship is to obtain more data. For the given data, however, the freehand curve is subject to criticism because it lacks uniqueness or objectiveness. The simplest least-square lines are not adequate because the assumptions on which they are based are not sufficiently realistic. However, with more realistic bases
Downloaded from http://ps.oxfordjournals.org/ by guest on April 9, 2015
The first six observed log y-values are considerably below the regression line. Also, the points seem more widely scattered for medium and large values of log x. A transformation of the form (log (x+k), log y) might be found that would overcome the distortion near the origin but the constant, k, would be troublesome to determine.
hardly be improved without additional data. The principle of maximum likelihood is another well-developed method of obtaining estimates of population parameters from limited data. Such estimates can be modified for consistency, and the asymptotic variance can be computed. The matter of weights to be assigned to each observed value again enters in and the data are not sufficient to indicate clearly just what these weights should be. Equations (f), (g), and (h) are the results of applying the method of maximum likelihood estimation with various sets of weights to the problem of finding the linear relationship between white and yolk weights of birds' eggs. It is dangerous to extrapolate (h) to eggs as big as that of the goose. It is common practice to try to rectify data by means of proper transformations. Some biologists have recommended the method leading to equation (i1). However, if (i1) is plotted in terms of x and y on text figure 1 of Asmundson et al. (1943) very serious defections are apparent.
APPLICATION OF LINEAR REGRESSION ANALYSIS TO BIOMETRIC DATA
REFERENCES
Asmundson, V. S., G^ A. Baker, and J. T. Emlen, 1943. Certain relations between the parts of birds' eggs. The Auk 60:34-44.
Baker, G. A., 1941. Linear regression when the standard deviations of arrays are not all equal. Jour. Amer. Stat. Assoc. 36^500-506. , 1945. Test of the significance of the differences of per cents of emergence of seedlings in multiple field trials. Jour. Amer. Stat. Assoc. 40: 93-97. Brady, Dorothy S., 1938. Variations in family living expenditures. Jour. Amer. Stat. Assoc. 33: 385389. David, F. N., and J. Neyman, 1938. Extension of the Markoff theorem on least-squares. Stat. Research Mem. 2:105-116. Deming, W. Edwards, 1943. Statistical adjustment of data. John Wiley & Sons, New York. Eisenhart, C , 1939. The interpretation of certain regression methods and their use in biological and industrial research. Ann. of Math. Stat. 10:162186. Ezekiel, Mordecai, 1941. Methods of correlation analysis. John Wiley & Sons. New York. Ed. 2. Risher, R. A., 1938. Statistical methods for research workers. Oliver & Boyd, Edinburgh and London. Ed. 7. Galton, Francis, 1889. Natural inheritance. The Macmillan Co. Kavanagh, Arthur M., 1941. Note in the adjustment of observations. Ann. of Math. Stat. 12:111-114. Neyman, Jerzy, and Elizabeth Scott, 1948. Consistent estimates based on partially consistent observations. Econometrica 16: 1-31. Pearson, Karl, 1896. Mathematical contributions to the theory of evolution III. Regression, heredity, and panmixia. Phil. Trans. Roy. Soc. London, Ser. A. 187:243-318. Salmon, S. C , 1938. Generalized standard errors for evaluating bunt experiments with wheat. Jour. Amer. Soc. Agron. 30:647-663. Wald, Abraham, 1939. The fitting of straight lines if both variables are subject to error. Cowles Commission for Research in Economics Report of the Fifth Annual Research Conference on Economics and Statistics held at Colorado Springs, p. 25-28. Yule, G. Udny, 1932. An Introduction to the theory of statistics. Ed. 10, C. Griffin & Co. Ltd., London.
Downloaded from http://ps.oxfordjournals.org/ by guest on April 9, 2015
the least-square lines are fairly stable and are not so dependent on the selection of weights, which must remain more or less arbitrary with only the present limited data, as are the maximum likelihood lines. The maximum likelihood lines have certain theoretical advantages that certainly should be considered in situations where the data are sufficient to accurately determine the weights to be assigned to each observed value. In the present case they do not seem to be any "better" than the least-square lines. In general, in estimating a relationship such as apparently exists between white and yolk weights of birds' eggs it is necessary to consider carefully the actual character of the data with which we are dealing. If this is done by an experienced person, then a freehand curve may give a good result. A well-established principle based on assumptions not sufficiently close to reality may give a poor result. Although, if the assumptions are modified towards reality the results may be vastly improved. Any of these methods will give improved estimates as additional data become available. Transformations should be applied to data only after full consideration has been given to all the implications of the transformation.
297