CHAPTER 9
Regression Analysis Contents 9.1 Uses of Anthropological Regression Analysis 9.2 Linear Regression Analysis 9.2.1 Exercise in SPSS: Descriptive Statistics and Data Visualization 9.2.2 Exercise in SPSS Regression Output
9.3 Anthropological Challenges: Reporting Results 9.3.1 Exercise Supplementary Online Resources Data Source for Exercises List of Anthropological Research Articles That Use Linear Regression References
115 116 117 118 119 120 120 121 121 121 122
This chapter builds on the previous chapter to distinguish linear regression from correlation in the context of anthropological research. Simple regression techniques for scale variables are covered, and students are prompted to report the results from the analysis in meaningful ways. Other commonly used techniques in anthropology, such as logistic regression (nonparametric) and multiple regression, are not included in this workbook.
9.1 USES OF ANTHROPOLOGICAL REGRESSION ANALYSIS Some anthropological questions go beyond testing covariance/correlation (as in Chapter 8) and instead assess the predictive capacity of one variable on another. Linear regression measures how well cases fit a model describing the linear pattern of the data, calculating how an independent variable (X) predicts a dependent variable (Y), and how close to the model the data points actually fall. Assumptions of regression are similar to those of correlation (Hanneman, Kposowa, & Riddle, 2013): 1. Data are normally distributed. 2. Data are patterned linearly.
Quantitative Anthropology ISBN 978-0-12-812775-9 https://doi.org/10.1016/B978-0-12-812775-9.00009-8
Copyright © 2019 Elsevier Inc. All rights reserved.
115
116
Quantitative Anthropology
3. Data are homoscedastic (have the same general deviation from the best fit line across the range of data points). 4. Dependent variable (Y) values are not influenced by each other (i.e., are independent). 5. Residuals are normal, homoscedastic, and not associated with the independent variable. Residuals are the measure of the distance from any point to the model regression line. An ideal fit for regression would minimize the size of the residuals.
Linear regression can be used to address some of the following questions: • Can we predict an individual’s stature given a single long bone length? • Does maternal income predict infant birthweight, and does this vary among different populations? • Is distance to water predictive of site size for archeological sites on the Giza plateau in Egypt?
9.2 LINEAR REGRESSION ANALYSIS Stature is a continuous trait influenced by both genetics and environment. In this chapter, you will investigate the link between an individual’s stature and their mother’s stature, assessing the strength of any association. Blending Inheritance, Eugenics, and Statistics The idea of blending inheritance, long discredited in biology, posits that individuals will inherit the average of their parents’ traits. Correlation and regression analyses on anthropometric traits such as stature were the hallmark of eugenics research at the turn of the 20th century (Louçã 2009). Indeed, Karl Pearson (from whom is derived Pearson’s r) was the Galton Chair of Eugenics at King's College, London. The close history between anthropology, eugenics, and statistics is a constant reminder that scholars must ground their work in solid theory and must always consider the ethical implications of the questions asked.
You want to know whether or not the stature of an individual’s mother is predictive of their own adult stature. Deviations from a predictive model may indicate major influence from environmental factors, paternal genetics,
Regression Analysis
117
and/or epigenetics. To test this, we will use Franz Boas’s anthropometric dataset from 1910 (also in Chapter 11), as compiled by C. Gravlee (gravlee.org). The Research Question Is there a predictive relationship between maternal stature and an offspring’s stature? If so, what is the strength of that relationship?
9.2.1 Exercise in SPSS: Descriptive Statistics and Data Visualization The first step in inferential statistical testing is to calculate descriptive statistics and generate exploratory visualizations. This involves calculating (with SPSS) descriptive statistics, making tables and charts (histograms/scatterplots/box plots, etc.), and evaluating the central tendencies, dispersion, and distribution of your data. This is crucial to ensure that you do not violate the assumptions of regression analysis. More importantly, this process allows you to determine the patterns in the data and understand what you are testing for. Step 1: Open Exercise 9.2.sav in SPSS. In regression analysis, the most important exploratory graph to make is the scatterplot, as it allows you to check for violations of the assumptions of regression (linearity, homoscedasticy). Review Exercise 8.2.1 for instructions on how to create a scatterplot using Legacy Dialogs in SPSS, placing maternal_stat in the X-axis box and stature in the Y-axis box. Step 2: In addition, generate data on frequencies, central tendency, dispersion, and distribution on each variable in the dataset. Make sure you generate histograms and tables that help you to understand your data. You previously learned to do this in SPSS using Explore functions. Step 3: Summarize your results by answering the following questions: 1. Describe your results. You do not need to detail all the descriptive statistics for each variable, but discuss any you think might be important to our understanding of the results. In general, you are looking for patterns. 2. Are any assumptions of regression violated? If so, explain those below and discuss how you will address the violation (e.g., data transformation, outlier culling). Be sure to complete these before continuing with Exercise 9.2.2. 3. Based on the research question discussed earlier and a preliminary look at the data in SPSS, write a null hypothesis and an alternative hypothesis. Discuss the methods you will use to test the hypotheses (i.e., which
118
Quantitative Anthropology
statistical tests you will use). Why is regression a better test for this question than correlation?
9.2.2 Exercise in SPSS After completing Exercise 9.2.1 and assessing the characteristics of your data, proceed with linear regression if your data do not violate assumptions of normality, linearity, and homoscedasticity. Step 1: Click on ANALYZE-REGRESSION-LINEAR. Step 2: Drag and drop the dependent variable (stature) into the Dependent box, and the independent variable (maternal_stat) into the central box Block 1 of 1. Step 3: Click on the Statistics button (Fig. 9.1). Select Estimates and Confidence Intervals on the left part of the window, and Model Fit and Descriptives on the right. Select Continue. Step 4: We want to assess the residuals as well as test the model. To do this, click on Plots. Move “ZRESID” (z-scores of the residuals) to the Y box and move “ZPRED” (z-scores of the predicted values of the dependent variable) to the X box; Select Histogram and Normal probability plot. Click Continue. Step 5: Select Ok. Step 6: Review the results in the data output window.
Figure 9.1 Linear regression menu with statistics submenu choices.
Regression Analysis
119
Regression Output The regression output has multiple components that allow you to interpret the goodness of fit of the model, the predictive quality of the equation, and the relationship between the variables. The first box, Descriptives, gives you the mean, standard deviation, and sample size for both variables. The Correlations box yields r (Pearson correlation), which is the linear association between the variables and its statistical significance. Interpret this as you did in Chapter 8. More information on the regression can be found in the Model box, which provides R, R2, and the standard error of the estimate. This indicates how well X predicts Y as a measurementdit measures the goodness of fit of the regression equation (Hanneman et al., 2013). The ANOVA box gives the results of a test of significance of the model, as it is a measure of the difference in variance between groups. The Coefficients box tells us more about the model itself. The first B (for Constant) indicates the Y-intercept (where the regression line meets the Y axis). The second B (for Maternal_Stat) indicates the slope of the regression line (which tells us how Y changes with respect to X). Both measures have t-tests that test the likelihood that the values are indicative of population-level associations. The final boxes give information on the residuals, which are important to evaluate in relation to the assumptions outlined above. • The normality of residuals can be assessed in the first two plots: the histogram (which should resemble a normal distribution) and the normal pep plot (in which the dots should be close to the diagonal line). • The linearity of the relationship between residuals and predicted values can be examined via the final scatterplot, which should show a consistent spread from a horizontal (not diagonal) line (see https://datascience. ibm.com/docs/content/analyze-data/spss-viz-linear.html for more information on how SPSS calculates this). Read through your results and answer the following questions: 1. What are your r and R2 statistics? Is the model statistically significant? 2. What does your Y-intercept mean relative to the relationship between maternal and child stature? What are the slope results? Are these measures statistically significant? 3. Are any assumptions related to the residuals violated?
120
Quantitative Anthropology
4. Overall, interpret what your results mean in the context of the research question. Are we looking at a weak or strong predictive relationship and what might the variation present in the model indicate?
9.3 ANTHROPOLOGICAL CHALLENGES: REPORTING RESULTS Regression analyses are used across anthropological subfields, in some cases to create predictive models for further data collection and analysis and, in other cases, to evaluate the relationship between variables to better understand the deviation from a predictive model. As with other statistical tests, there is no one way to describe regression outputs, but you should provide the unstandardized B, the t-test results, and the P-value (note that the b statistic, as a standardized measure between 0 to 1, makes it possible to compare between two regression models). Some examples of how the results of regression analyses are presented in anthropological presentations are given below: The results show that, for the two largest immigrant groups in this analysis, cephalic index changes as a linear function of the time elapsed between arrival and birth, controlling for maternal stature (Hebrews: b ¼ .141, p ¼ 000; Bohemians: b ¼ .099, p ¼ 004). Although this association is highly statistically significant, the magnitude of the relationship is notably small (Gravlee, Bernard, & Leonard, 2003, p. 133; describing the results of a reanalysis of Franz Boas’s immigrant anthropometric study from 1910, using Boas’ language for the immigrant groups; see Chapter 11 for more). Between months two and six, rate of weight gain and infant d15N values are inversely related (linear regression R2: two months: R2 ¼ 0.676, P ¼ 0.0242; five months: R2 ¼ 0.432, P ¼ 0.285; six months: R2 ¼ 0.420, P ¼ 0.082; Fig. 5 and Table 2) (Reitsema and Muir, 2015, p. 353, discussing dietary isotope differences with age in rhesus macaques). Removing this site from the computations and regressing population on site size, an R2 value of 0.24 is obtained (Pearson’s r ¼ .49), indicating a distressingly weak relationship between these two variables (Schreiber and Kintigh, 1996, p. 577, on the association between population size and site size in the Andes).
9.3.1 Exercise Return to the results from Exercise 9.2 and prepare a succinct report listing your findings and interpreting your results. Choose the tabular and
Regression Analysis
121
graphical output that will best express the relationships between variables, and make sure you clearly report the relevant descriptive statistics. You should follow the standard format of scientific papers: brief background which introduces the research question, hypotheses, and the sample that will be tested; a methods section discussing the statistical tests you will do, along with their assumptions; a results section reporting the results of your tests, including relevant tabular and graphical output; and a discussion of what those results mean, regardless of statistical significance. Finally, consider what new data or research questions would allow you to better understand your original research question.
SUPPLEMENTARY ONLINE RESOURCES Data Source for Exercises Boas’s dataset of osteometric measurements from multiple immigrant groups in New York City in 1910 (Gravelee et al. 2003a; Boas, 1910). The data are produced in full on Clarence Gravlee’s personal website (http:// www.gravlee.org/research/boas/). The subsample used in this exercise pools the immigrant groups, looks only at daughters over age 20/25, and outliers have been removed from the analysis.
List of Anthropological Research Articles That Use Linear Regression • • • •
•
Gravlee, C. C., Bernard, H. R., & Leonard, W. R. (2003). Heredity, environment, and cranial form: A reanalysis of Boas’s immigrant data. American Anthropologist, 105, 125e138. Reitsema, L. J., & Muir, A. B. (2015). Growth velocity and weaning d(15)N "Dips" during ontogeny in Macaca mulatta. American Journal of Physical Anthropology, 157, 347e357. Schreiber, K. J., & Kintigh, K. W. (1996). A test of the relationship between site size and population. American Antiquity, 61, 573e579. Thayer, Z. M., Blair, I. V., Buchwald, D. S., & Manson, S. M. (2017). Racial discrimination associated with higher diastolic blood pressure in a sample of American Indian adults. American Journal of Physical Anthropology, 163, 122e128. Thompson, R. C., Allam, A. H., Lombardi, G. P., Wann, L. S., Sutherland, M. L., Sutherland, J. D., et al. (2013). Atherosclerosis across 4000 years of human history: the Horus study of four ancient populations. The Lancet, 381, 1211e1222.
122
Quantitative Anthropology
REFERENCES Gravlee, C. C., Bernard, H. R., & Leonard, W. R. (2003). Heredity, environment, and cranial form: A reanalysis of Boas’s immigrant data. American Anthropologist, 105, 125e138. Louçã, F. (2009). Emancipation through interaction d how eugenics and statistics converged and diverged. Journal of the history of biology, 42, 649e684. Hanneman, R. A., Kposowa, A. J., & Riddle, M. D. (2013). Basic statistics for social research. San Francisco: Jossey-Bass.