CHAPTER 11
Comparing Three or More Groups: Analysis of Variance Contents 11.1 Anthropological Questions Comparing More Than Two Groups 11.2 Comparing Three or More Groups With Parametric Techniques 11.2.1 Descriptive Statistics and Data Visualization for Comparing Multiple Groups: Exercise in SPSS 11.2.2 Analysis of Variance: Exercise in SPSS Interpreting the SPSS ANOVA Output
11.3 Nonparametric Alternatives to ANOVA 11.3.1 Exercise in SPSS 11.4 Post Hoc Tests of Comparison and Strength; Effect Size: Eta Squared 11.4.1 Exercise in SPSS Identifying the Differences: Post Hoc Tests Effect Size
11.5 Anthropological Challenges: Reporting Results 11.5.1 Exercise Supplementary Online Resources Boas Immigrant Dataset Johnstown Flood Paleodemographic Data File for Exercises List of Anthropological Research Articles That Use ANOVA References
143 144 145 147 147 149 149 151 151 151 152 152 153 153 153 154 154 154
This chapter covers statistical methods involving three or more groups: analysis of variance (ANOVA). Analysis of parametric and nonparametric datasets allows students to practice examining variation within and between groups. This chapter also discusses applying post hoc tests to ANOVA, distinguishing between these tests, and reporting outcomes with reference to anthropological questions.
11.1 ANTHROPOLOGICAL QUESTIONS COMPARING MORE THAN TWO GROUPS Examination of differences in means between three or more groups is common in anthropological research. ANOVA tests are used to compare a scale variable across three or more independent categorical groups (such as Quantitative Anthropology ISBN 978-0-12-812775-9 https://doi.org/10.1016/B978-0-12-812775-9.00011-6
Copyright © 2019 Elsevier Inc. All rights reserved.
143
144
Quantitative Anthropology
Table 11.1 Parametric techniques and their nonparametric counterpartsdcomparing groups Parametric technique Nonparametric technique
Independent-samples t-test Paired-samples t-test One-way between-groups ANOVA One-way repeated-measures ANOVA
ManneWhitney U test Wilcoxon Signed Rank test KruskaleWallis test Friedman test
Table adapted from Pallant (2013, p. 212).
gender, year, or household). ANOVA is useful for parametric samples, but nonparametric techniques must be used for nonnormal datasets (Table 11.1). ANOVA is a powerful set of statistical tests for comparing within-group and between-group variation. An important example of its use is the work by evolutionary biologist Richard Lewontin, who in 1972 published a seminal paper on human race categories. In it, he argued that there is greater variation within racial categories than between them and that social categories of race thus do not correspond to biological variation.
Anthropological questions that use ANOVA include some of the following: • Are there differences in mean femur length between skeletal samples from European populations who lived before, during, and after the Black Death? • Are there differences in mean income among four US communities with large refugee populations? • Is there a difference in the mean size of archaeological sites in three periods of early imperial China?
11.2 COMPARING THREE OR MORE GROUPS WITH PARAMETRIC TECHNIQUES The Research Question In the early 20th century, American anthropologist Franz Boas conducted a massive study to understand human variation and the metrics thought to define “biological” races by many physical anthropologists at that time. This study, much cited and recently replicated (Gravlee, Bernard, & Leonard, 2003a, 2003b; Sparks & Jantz, 2002), was a foundational work in biological anthropology.
Comparing Three or More Groups: Analysis of Variance
145
We will work with Boas’s dataset of osteometric measurements from multiple immigrant groups in New York City in 1910 (Boas, 1910; Gravlee et al. 2003a). The data are reproduced in full on Clarence Gravlee’s personal website (http://www.gravlee.org/research/boas/). As with all one-way ANOVA tests, there is a categorical independent variable and a continuous dependent variable. Taking our cue from Boas’ original study, we will be investigating the link between environment and the cephalic index. What we are studying is the between-group variance by birth location: are there differences between (1) individuals born outside the United States, (2) individuals born in the United States to recently arrived mothers, and (3) individuals born in the United States to mothers who have been in the United States for over a decade? While we could collapse the locations into two groups of foreign-born and US-born individuals, we would miss out on the variation found within US-born individuals if it exists. Specifically, we will ask: is there a significantly different cephalic index between groups based on the location and timing of their births? How does this metric compare between different immigrant groups? The cephalic index is a ratio of head breadth to length, and has a long history of misuse by scientists attempting to show the biological stability of race categories. Boas’s study of cranial morphometrics debunked this racist methodology, although its influence persisted into the 20th century (Gravlee et al., 2003a).
The independent variabledbirth_loc_3 identifies the location where the individual was born, with the US-born individuals separated into two groups based on when their mother immigrated to the United States. The subsample here only includes individuals from one immigrant group (Sicilians only, as opposed to Boas’s seven immigrant groups). Only females between four and 19 years old are included. The dependent variable is Cephalic Index or Cephalic_index. It is calculated as cranial width/cranial breadth 100. To control for the effects of age, we have created a new variable for analysis called Zcephalic based on the z-scores of age groups.
11.2.1 Descriptive Statistics and Data Visualization for Comparing Multiple Groups: Exercise in SPSS In this exercise, we will work with both z-scores and the unstandardized data. Your tests will be on z-scores for categories broken down by age
146
Quantitative Anthropology
(Variable Zcephalic), but you should calculate the differences in unstandardized means first (Variable Cephalic_index). Step 1: Open the file “Exercise 11.2” in SPSS and check that the variables you are using are read properly in Variable View. We will be working with the variable Zcephalic for the inferential statistics, but for descriptive statistics, use the variable Cephalic_index first. To turn off the Split Cases function, go to DATAdSPLIT FILE. Select Analyze all cases, do not create groups and select.
Step 2: Select ANALYZE - DESCRIPTIVE STATISTICS - EXPLORE. Move cephalic_index to the Dependent List box and age to the Factor List box. Click on Plots and select Histogram and Normality plots with tests, and deselect Stem-and-leaf. Click on Continue, then Ok. Step 3: Calculate descriptive statistics and visualize the data with tables and figures. Run descriptive tests on the variable Zcephalic, assessing central tendency and distribution. Answer the following questions: 1. Are any of the assumptions of ANOVA violated? If so, what are they and how will you address them? 2. Based on the research question above and a preliminary look at the data in SPSS, write a null hypothesis and alternative hypothesis, bearing in mind that we will be comparing means from three samples of interest: one non-US born and two US-born samples. 3. Decide the alpha level for rejecting the null hypothesis. Determine what your required P-value will be. Why are you choosing this value? ANOVA Assumptions Be sure that you check for any violations of the assumptions of one-way ANOVA: distribution of the response (dependent) variable is normal, standard deviation of the population distribution is the same for each group, and samples from the populations are independent random samples. Some ways of checking for normal distribution: histogram, stem-and-leaf plots, ShapiroeWilk test (if the P is larger than the critical value 0.05 then the data are normally distributed). The final assumption can be tested by “Levene’s test for homogeneity of variances,” a part of the ANOVA output that we generate.
Comparing Three or More Groups: Analysis of Variance
147
11.2.2 Analysis of Variance: Exercise in SPSS Step 1: In SPSS, CHOOSE ANALYZEdCOMPARE MEANSdONE-WAY ANOVA. Step 2: Move the variable ZCephalic to the Dependent List. Move the variable birth_loc_3 to the Factor box. Step 3: Under the Post Hoc tab, choose the post hoc tests most appropriate to your research question and your desired alpha level (Fig. 11.1). You can perform them all initially, but you will not include all the results in your report (Exercise 11.5.1). The recommended tests differ depending on the characteristics of your sample and whether they violate any of the assumptions of ANOVA (Table 11.2); you should include at least one test from “Equal variances assumed” and one from “Equal variances not assumed.” Determine your desired alpha level for these tests, or keep the default 0.05. Step 4: From the Options menu, choose Descriptive statistics, Homogeneity of variance test, Brown-Forsythe, Welch, and Means plot. Click Continue and Ok. Step 5: Review the results in the data output window. Interpreting the SPSS ANOVA Output Assumption of Homogeneity of Variance
The assumption of homogeneity of variance for each of the groups being compared is tested in SPSS with Levene’s testdthe Test of
Figure 11.1 One-way ANOVA window with post hoc options.
148
Quantitative Anthropology
Table 11.2 Post hoc test choices Post hoc test
Equal variances
LSD SNK, REGWF, REGWQ, and Duncan Bonferroni/Sidak Tukey’s Scheffe Hochberg’s GT2 Gabriel’s WallereDuncan, Dunnett
X X X X X X X X
Equal sample sizes
X X X
Tests for unequal variances and unequal sample sizes
Games-Howell test Tamhane’s T2 Dunnett’s T3, Dunnett’s C Tables derived from IBM’s Rational Insight Guide at https://www.ibm.com/support/ knowledgecenter/en/SSRL5J_1.1.0/com.ibm.swg.ba.cognos.ug_cr_rptstd.10.1.1.doc/c_id_obj_anova. html. Ranked by conservativeness (least conservative at the top), with chances of a Type II error decreasing from the top of the table.
Homogeneity of Variances. To interpret the test results, review the Levene statistic and the P-value. A large statistic and small P-value (less than 0.05, for example) indicate that the variances are not equal and the homogeneity of variances assumption is violated. Based on the result of Levene’s test, decide how to proceed with interpreting the test statistics: • If equality of variances is not violated (your Levene’s significance level is >0.05), examine the F statistic and P value (“Sig.”) from the ANOVA output box. • If equality of variances is violated, use the results from the “Robust Tests of Equality of Means” box, which will give you two testsd“ Welch” and “Brown-Forsythe”dwhich you should use instead of the normal ANOVA output. These two tests do not assume equality of variances. The ANOVA Output
If your significance value for the F statistic is greater than the chosen alpha level and you fail to reject the null hypothesis, then interpret that result and finish the analysis. If your significance value for the F statistic is less than the chosen alpha level, then you can reject the null hypothesis. You should also examine
Comparing Three or More Groups: Analysis of Variance
149
your post hoc test results and calculate the effect size, which we will do in Exercise 11.4.1. (For now, you are only determining whether any differences exist between the groupsdyou are not yet assessing where those differences lie, if there are any.) Interpret the results: 1. Report your F-statistic, degrees of freedom, P-value, and any other relevant data. 2. Interpret what the results mean: go beyond a statement of statistical significance and return to the research question. 3. BonusdRead Gravlee et al. (2003a, 2003b) and Sparks and Jantz (2002). How do our methods differ from theirs? How might our choices affect the results (looking at the ANOVA results for Sicilians alone rather than as seven immigrant groups; looking at individuals between 4 and 19 years of age; using age- and region-standardized Z-scores)?
11.3 NONPARAMETRIC ALTERNATIVES TO ANOVA A central assumption of ANOVA is normality. As there are still interesting questions to ask of nonnormal data, sometimes nonparametric tests are more appropriate for a particular research question. For this exercise, you will apply a nonparametric ANOVA to answer a research question on burial behaviors in 19th-century Pennsylvania following a natural disaster. The Research Question Returning to the data set from the Johnstown Flood, you want to examine the distribution of Flood victims across various burial treatments to understand the construction of mortuary contexts following disasters. To do this, you will test whether there is a difference in the age of Flood victims across three burial conditions: 1 ¼ victims buried in Johnstown 2 ¼ victims buried outside of Johnstown 3 ¼ victims whose bodies were not recovered Your goal is to assess the following research question: Is there a difference in average age of Johnstown Flood victims with different burial treatments?
11.3.1 Exercise in SPSS Part 1: Step 1: Open the dataset 11.3 in SPSS and check that the variables you are using are read properly in Variable View. Perform descriptive statistical
150
Quantitative Anthropology
tests using the EXPLORE menu on each of the three burial treatments. Answer the following questions: 1. List the measures of central tendency and dispersion for each group. Are any of the assumptions of parametric ANOVA tests violated? 2. Based on the research question above and a preliminary look at the data in SPSS, write a null hypothesis and as many alternative hypotheses as needed, bearing in mind that we will be comparing means from three samples of interest: 1, 2, and 3. 3. Decide the alpha level for rejecting the null hypothesis. Why are you choosing this value? Part 2: Step 1: In SPSS, click on the following: ANALYZEdNONPARAMETRIC TESTSdLEGACY DIALOGSdK INDEPENDENT SAMPLES. Step 2: In the Test Variable List field, put Age. Move BurialLocation to the Grouping Variable field. Step 3: Click Define Range. Type “1” into Minimum and “3” into Maximum (Fig. 11.2). This will test all three burial conditions. Click Continue. Step 4: Under Test Type select the box for KruskaleWallis. The KruskaleWallis test, similar to the ManneWhitney U test (see Chapter 7), examines the variable as a whole, ranking each value in comparison to all other values and then summing these ranks (Kruskal and Wallis, 1952). The resulting H statistic has a distribution similar to the chi-square statistic with
Figure 11.2 Independent samples dialogues, with grouping variables range selected.
Comparing Three or More Groups: Analysis of Variance
151
degrees of freedom equal to the number of groups being compared minus one (Kruskal and Wallis, 1952). Note that the KruskaleWallis test may also be used with ordinal data. Step 5: Click Ok and go to the Data Output window. Review your results and answer the following questions: 4. Report your test statistic, df, and P-value. In addition, report the groups that yield the higher and lower Mean Ranks provided in the output. 5. Interpret what the results mean: go beyond a statement of statistical significance and return to the research question.
11.4 POST HOC TESTS OF COMPARISON AND STRENGTH; EFFECT SIZE: ETA SQUARED Statistical significance is not the only end product of quantitative analyses. Significance testing tells us the likelihood of incorrectly rejecting the null hypothesis but does not indicate the degree of difference between groups ( Nakagawa and Cuthill, 2007). The work done in Exercise 11.2.2 to understand the association between birth location and cephalic measurement is a good start, but we also need to calculate effect size to understand the magnitude of difference between the means. Effect size allows us to state the “degree to which the null hypothesis is false” (Cohen, 1988, p. 9e10). Do not calculate effect size unless you have statistically significant results. In Exercise 11.2.2, we asked you to select several output options that provided post hoc tests of the associations being tested. In this exercise, you will examine the results of those tests, do an additional calculation of effect size, and interpret the result. Only proceed with post hoc tests of comparison and strength if you have rejected the null hypothesis for the overall ANOVA.
11.4.1 Exercise in SPSS Identifying the Differences: Post Hoc Tests We had you complete multiple post hoc tests when calculating the ANOVA. Post hoc tests provide multiple comparisons between groups to assess where any significant differences might lie. The output for these tests (Post Hoc Tests: Multiple Comparisons) shows you the group-by-group comparisons of means, including the mean difference, the standard error, and the P-value (“sig.”) for each comparison. The Homogeneous Subsets table also helps you see where the differences lie, as the more similar categories are grouped together. Your choice of test matters: post hoc tests should be chosen based on sample attributes to most accurately understand betweengroup differences. Table 11.2 shows the various post-hoc tests available in ANOVA, their requirements, and their conservatism (from top to bottom).
152
Quantitative Anthropology
1. Review the post hoc test output tables from Exercise 11.2.2. Statistically significant comparisons will have a Sig. value less than the alpha level you determined for your overall ANOVA (the F test). You should also notice the redundancy in the table, as each group is compared to all other groups twice. These results tell you where the differences are found. Where are the significant differences between groups in your data, if any? Effect Size To understand the magnitude of any significant differences, you should calculate effect size. This is done by hand, using the outputs of the ANOVA test: the sum of squares between groups and the sum of squares total. Eta squared ¼
Sum of squares between groups Sum of squares total
Eta squared tells us the how the variation in the dependent variable can be accounted for by the groups in the independent variable (Ellis, 2010). As with t-tests, 0.01 is a small effect, 0.06 is a medium effect, and 0.14 is a large effect (Cohen, 1988, as cited in; Ellis, 2010). 2. Calculate the effect size for the comparisons made in Exercise 11.2.2, should it be relevant (i.e., if you found a significant difference between the birth locations). Return to the output from the exercise to find the between groups sum of squares and total sum of squares. Report and interpret the results:
11.5 ANTHROPOLOGICAL CHALLENGES: REPORTING RESULTS There is no completely standardized way that ANOVA and its nonparametric counterparts are reported in the anthropological literature. You should provide the test statistic, degrees of freedom, and P-value, as well as any needed post hoc test results and effect size statistics. Here are some ways that the results of ANOVA have been discussed in the results sections of peer-reviewed articles: ANOVA ANOVA indicates that the difference in stress duration between all groups at all perikymata periodicities is significantly different (F ¼ 6.54; P ¼ .0001). Temple, McGroarty, Guatelli-Steinberg, Nakatsukasa, and Matsumura (2013, p. 235); comparing enamel defect timing differences between Japanese and Alaskan archaeological populations.
Comparing Three or More Groups: Analysis of Variance
153
In addition, differences in the total current psychosocial burden score emerged in relation to race/ethnicity (F (3, 192) ¼ 3.65, P ¼ .01). Post hoc analyses indicated that black men in our sample had higher total burden scores than Latino men (1.45 vs. 0.96, P < .05) and white men (1.45 vs. 1.00, P < .05). Halkitis et al. (2012, p. 373); discussing HIV risk and psychosocial burdens in US communities of gay, bisexual, and other men who have sex with men.
KruskaleWallis No statistically significant differences were found between the medians of these three sites for either d13Cco (KruskaleWallis; H ¼ 2.08, df ¼ 2, p ¼ 0.35) or d15N (H ¼ 1.06, df ¼ 2, p ¼ 0.59), and were thus compared to human skeletal values as a single group. Gregoricka et al. (2017, p. 153), describing isotopic ratios from human remains at three monastic sites in Jerusalem.
One might expect lasting structural inequalities in this community, but the data suggest otherwise. A KruskaleWallis test finds no statistical differences in median individual incomes among the descendants of Andrevola kings, Masevohe commoners, and Makoa slaves (c2 ¼ .521, df ¼ 2, p ¼ .771). Tucker et al. (2015, p. 334), discussing economic inequality in communities in Madagascar.
11.5.1 Exercise Choose from among the results from Exercises 11.2 and 11.4 and prepare a succinct report listing your findings and interpreting your results. Choose the tabular and graphical output that will best express the relationships between variables, and make sure you clearly report the relevant descriptive statistics. You should follow the standard format of scientific papers: brief background which introduces the research question, hypotheses, and the sample that will be tested; a methods section discussing the statistical tests you will do, along with their assumptions; a results section reporting the results of your tests, including relevant tabular and graphical output; and a discussion of what those results mean, regardless of statistical significance. Finally, consider what new data or research questions would allow you to better understand your original research question.
SUPPLEMENTARY ONLINE RESOURCES Boas Immigrant Dataset Boas’s data set of osteometric measurements from multiple immigrant groups in New York City in 1910 (Boas, 1910; Gravlee et al. 2003a).
154
Quantitative Anthropology
The data are produced in full on C. Gravlee’s website (http://www. gravlee.org/research/boas/). The subsample here only includes individuals from four immigrant groups (as opposed to Boas’s seven) and individuals under age 25.
Johnstown Flood Paleodemographic Data File for Exercises Data from McGough, M. R. (2002). The 1889 Flood in Johnstown, Pennsylvania. Gettysburg: Thomas. Data compiled by L. Williams.
List of Anthropological Research Articles That Use ANOVA • •
• •
•
Cheverko, C. M., & Hubbe, M. (2017). Comparisons of statistical techniques to assess age-related skeletal markers in bioarchaeology. American Journal of Physical Anthropology, 163, 407e416. De Vleeschouwer, F., Piotrowska, N., Sikorski, J., Pawlyta, J., Cheburkin, A., Le Roux, G., et al. (2009). Multiproxy evidence of ‘Little Ice Age’ palaeoenvironmental changes in a peat bog from northern Poland. The Holocene, 19, 625e637. Gravlee, C. C., Bernard, H. R., & Leonard, W. R. (2003b). Heredity, environment, and cranial form: A reanalysis of Boas’s immigrant data. American Anthropologist, 105, 125e138. Luque, J. S., Castaneda, H., Tyson, D. M., Vargas, N., Proctor, S., & Meade, C. D. (2010). HPV Awareness among Latina Immigrants and Anglo American Women in the Southern U.S.: Cultural Models of Cervical Cancer Risk Factors and Beliefs. NAPA Bull, 34, 84e104. Williams, J. P., & Andrefsky, W. (2011). Debitage variability among multiple flint knappers. Journal of Archaeological Science, 38, 865e872.
REFERENCES Boas, F. (1910). Changes in bodily form of descendants of immigrants. United States immigration commission, Senate Document 208, 61st Congress. Washington, DC: Government Printing Office. Cohen, J. W. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, N.J: Lawrence Erlbaum. Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge: Cambridge University Press. Gravlee, C. C., Bernard, H. R., & Leonard, W. R. (2003a). Boas’s Changes in Bodily Form: The immigrant study, cranial plasticity, and Boas’s physical anthropology. American Anthropologist, 105, 326e332. Gravlee, C. C., Bernard, H. R., & Leonard, W. R. (2003b). Heredity, environment, and cranial form: A reanalysis of Boas’s immigrant data. American Anthropologist, 105, 125e138.
Comparing Three or More Groups: Analysis of Variance
155
Gregoricka, L. A., Sheridan, S. G., & Schirtzinger, M. (2017). Reconstructing life histories using multi-tissue isotope analysis of commingled remains from St Stephen’s monastery in Jerusalem: Limitations and potential. Archaeometry, 59, 148e163. Halkitis, P. N., Kupprat, S. A., Hampton, M. B., Perez-Figueroa, R., Kingdon, M., Eddy, J. A., et al. (2012). Evidence for a syndemic in aging HIV-positive gay, bisexual, and other MSM: Implications for a holistic approach to prevention and healthcare. Annals of Anthropological Practice, 36, 365e386. Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47, 583e621. Lewontin, R. C. (1972). The apportionment of human diversity. In T. Dobzhansky, M. K. Hecht, & W. C. Steere (Eds.), Evolutionary biology (pp. 381e398). Boston, MA: Springer. Nakagawa, S., & Cuthill, I. C. (2007). Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews of the Cambridge Philosophical Society, 82, 591e605. Sparks, C. S., & Jantz, R. L. (2002). A reassessment of human cranial plasticity: Boas revisited. Proceedings of the National Academy of Sciences of the United States of America, 99, 14636e14639. Pallant, J. (2013). SPSS Survival Manual: A step by step guide to data analysis using IBM SPSS. Maidenhead: McGraw Hill. Temple, D. H., McGroarty, J. N., Guatelli-Steinberg, D., Nakatsukasa, M., & Matsumura, H. (2013). A comparative study of stress episode prevalence and duration among Jomon period foragers from Hokkaido. American Journal of Physical Anthropology, 152, 230e238. Tucker, B., Lill, E., Tsiazonera, Tombo, J., Lahiniriko, R., Rasoanomenjanahary, L., et al. (2015). Inequalities beyond the Gini: Subsistence, social structure, gender, and markets in southwestern Madagascar. Economic Anthropology, 2, 326e342.