Testing Spearman's hypothesis with advanced placement examination data

Testing Spearman's hypothesis with advanced placement examination data

INTELL-01132; No of Pages 9 Intelligence xxx (2016) xxx–xxx Contents lists available at ScienceDirect Intelligence Testing Spearman's hypothesis wi...

354KB Sizes 3 Downloads 98 Views

INTELL-01132; No of Pages 9 Intelligence xxx (2016) xxx–xxx

Contents lists available at ScienceDirect

Intelligence

Testing Spearman's hypothesis with advanced placement examination data Russell T. Warne ⁎ Department of Behavioral Science, Utah Valley University, United States

a r t i c l e

i n f o

Article history: Received 8 March 2016 Received in revised form 4 May 2016 Accepted 5 May 2016 Available online xxxx Keywords: Spearman's hypothesis Group differences Advanced Placement Academic achievement

a b s t r a c t The nature, source, and meaning of average group score differences between demographic groups on cognitive tests has been a source of controversy for decades. One possible explanation is “Spearman's hypothesis,” which states that the magnitude of score differences across demographic groups is a direct function of how strongly the test measures g. To test this hypothesis, Jensen (1985, 1998) developed the method of correlated vectors. In this study I used the method of correlated vectors to examine the relationship between racial/ethnic group differences of Advanced Placement (AP) exam scores and the correlation between those AP exam scores and a test of general cognitive ability, the PSAT. Results are consistent with Spearman's hypothesis for White-Black and White-Hispanic comparisons, but not for White-Asian comparisons. Comparisons of White examinees and Native Americans are inconclusive. This study shows that academic achievement tests can be used to test Spearman's hypothesis. Additionally, Spearman's hypothesis is not a unique characteristic of White-Black differences in cognitive test scores, but it may not universal either. © 2016 Elsevier Inc. All rights reserved.

One of the oldest and most consistent findings in intelligence research is the existence of mean score differences among racial and ethnic groups. Generally speaking, examinees descended from East Asians outscore examinees from European populations. These groups' mean scores, in turn, exceed those of populations originating from Latin America, and examinees descended from Sub-Saharan Africans score lowest of all major racial groups (Warne, 2016a; Giessman, Gambrell, & Stebbins, 2013; Gottfredson, 1997; Rushton, 2000). Although not as robust as research on these four racial groups, some researchers have also found an IQ advantage for populations descended from European Jews (Nisbett et al., 2012; te Nijenhuis, David, Metzen, & Armstrong, 2014; Terman, 1926) and lower mean IQ scores for individuals descended from Native Americans (te Nijenhuis, van den Hoek & Armstrong, 2015) and South Asian populations (Lynn & Owen, 1994; Rushton, Čvorović, & Bons, 2007). There is some disagreement over the size of these mean group differences and whether the score gaps are narrowing (Nisbett et al., 2012; Rushton, 2012), but no scholars in the field of human intelligence deny the existence of these score gaps. However, the cause of these differences is a matter of much discussion. Many members of the public and others unacquainted with modern psychometrics blame the existence of these score differences on test bias (see Warne et al., 2014, for examples), though this explanation is discounted among experts on the topic (Gottfredson, 1997, 2009;

⁎ Department of Behavioral Science, Utah Valley University, 800 W. University Parkway MC 115, Orem, UT 84058., United States. E-mail address: [email protected].

Jensen, 1980a, 1980b; Reynolds, 2000). If the tests are not the cause of mean score differences among groups, then the cause possibly arises from differences in ability among groups. Scholars of human intelligence have long recognized that mean score differences among groups vary from test to test or from subtest to subtest (see Sunne, 1917, for a very early example). To explain this finding, Jensen (1980b, 1985) coined the term “Spearman's hypothesis,” which is that the magnitude of group differences on a mental test is a function of the ability of that test to measure g. The hypothesis is attributed to Spearman who theorized (1927, pp. 379–380) that differences in mean subtest score differences among Black and White examinees in two prior studies (Derrick, 1920; Pressey & Teter, 1919) were due to the subtests' varying saturation with the g factor. According to Spearman—and Jensen—the subtests that were the best measures of g would also exhibit the largest Black-White mean score differences. Several studies have shown this to be the case (e.g., Jensen, 1985, 1998; Kane, 2007; te Nijenhuis et al., 2016). To test Spearman's hypothesis, Jensen (1985, 1998) developed the method of correlated vectors. In its simplest form, the method of correlated vectors “… is one way of testing whether the g factor extracted from a battery of diverse tests is related to some variable, X, which is external to the battery of tests” (Jensen, 1998, p. 589). In other words, the method of correlated vectors can help researchers determine whether score differences between groups (which is X in this case) on cognitive tests are related to the degree that each test measures g (as measured by factor loadings or correlation coefficients). If the correlation between g saturation and mean score differences is positive, then it is an indication that the data support Spearman's hypothesis. A correlation close to zero suggests the possibility that the group test score differences and g tests'

http://dx.doi.org/10.1016/j.intell.2016.05.002 0160-2896/© 2016 Elsevier Inc. All rights reserved.

Please cite this article as: Warne, R.T., Testing Spearman's hypothesis with advanced placement examination data, Intelligence (2016), http:// dx.doi.org/10.1016/j.intell.2016.05.002

2

R.T. Warne / Intelligence xxx (2016) xxx–xxx

saturation are unrelated and that group score differences are not caused by differences in g. Since Jensen's creation of the method of correlated vectors to examine score differences between White and Black American samples, others have used the procedure to examine whether g is at the root of IQ differences among other racial and ethnic groups (e.g., Hartmann, Hye Sun Kruuse, & Nyborg, 2007; Lynn & Owen, 1994; te Nijenhuis et al., 2014; te Nijenhuis, Al-Shahomee, van den Hoek, Grigoriev & Repko, 2015; te Nijenhuis, Willigers, Dragt & van der Flier, 2016; Rushton et al., 2007) and even across species of primates (Fernandes, Woodley, & te Nijenhuis, 2014). Data concerning Black-White differences on cognitive tests generally support Spearman's hypothesis. However, differences among other groups are either inconsistent or there are not enough data to support strong conclusions about whether patterns of test score differences are due to differences in the g saturation of the tests (Hartmann et al., 2007). More data concerning the racial and ethnic differences on cognitive test scores is needed to determine the nature of g differences among groups and the applicability of Spearman's hypothesis to more demographic groups. Past research into Spearman's hypothesis has been almost exclusively conducted with cognitive ability/intelligence tests. For example, in Jensen's (1985) original test of Spearman's hypothesis, he used data from the Wechsler scales, the Armed Services Vocational Aptitude Battery, SAT, General Aptitude Test Battery, and other tests of aptitude. In recent years many of researchers studying Spearman's hypothesis have used the Raven's matrices tests (te Nijenhuis, Al-Shahomee, et al., 2015). However, academic achievement tests also show substantial mean score differences among racial and ethnic groups (e.g., ACT, Inc., 2014; Warne, Anderson, & Johnson, 2013), and these achievement tests are sometimes used as measures of g in research (e.g., Kura, 2013; Lynn & Meisenberg, 2010). The purpose of this study is to apply Jensen's (1985, 1998) method of correlated vectors to determine whether Spearman's hypothesis is supported with publicly available academic achievement test data. The use of educational data in a study of cognitive ability differences is a new characteristic of research on Spearman's hypothesis. There are two advantages of opening up tests of Spearman's hypothesis to educational achievement data. The first advantage is pragmatic: if Spearman's hypothesis can be shown to apply to achievement tests, then more measures and datasets could be used to test the hypothesis in the future. Experts in human intelligence and education would not be confined to explicitly cognitive tests of g, as has been the case with studies of Spearman's hypothesis thus far. Additionally, showing the applicability of Spearman's hypothesis to achivement data would connect the theoretical research on intellectual abilities to new practical, real-world measures and outcomes, especially in education. The second advantage is theoretical: for decades there has been a debate in the psychometric and education communities about the relationship between academic achievement tests and academic aptitude tests (Merwin & Gardner, 1962; Schmeiser & Welch, 2006). Some theorists claim that there is no difference between the two types of tests (e.g., Corno et al., 2002; Schmeiser & Welch, 2006), while others claim that these are two separate families of tests, though with some overlap in content (e.g., Glaser, 1963; Zwick, 2006). A finding that Spearman's hypothesis applies to achievement tests would indicate that both types of tests measure the same construct. Testing Spearman's hypothesis with achievement data can also make contributions to the theory of education policy and curriculum planning. Understanding the degree to which g theory applies to educational achievement can help policy makers and school personnel make informed decisions about the likely success of educational interventions. If the degree of success on academic achievement tests is a product of g, then courses and services could be targeted to students based on their likelihood of benefiting from remedial or advanced services. At the individual level, academic programs with relatively low g loadings could be opened to a high proportion of students in a school, while programs that require

high levels of g could be targeted to the most academically prepared students. At the school level, a school's demographics and mean test scores could help personnel decide the most cost-effective use of new resources, and school personnel would maximize the probability of success for an intervention (Warne, 2016a). In sum, in this study I examined whether Spearman's hypothesis is consistent with the pattern of differences in educational test scores among five large racial and ethnic groups in the United States: Whites, Blacks, Hispanics, Asian Americans, and Native Americans. My goals are to (a) add to the data concerning White-Black differences in test scores of g, (b) test the applicability of Spearman's hypothesis to other racial and ethnic groups, and (c) determine whether academic achievement tests can be used to Spearman's hypothesis. 1. Methods 1.1. Data sources 1.1.1. Measure of g The measure of g in this study is the PSAT, a standardized test of academic aptitude created and administered by the College Board for students in Grades 9 and 10. PSAT scores have correlations N0.80 with other recognized measures of g, such as the SAT (Proctor & Kim, 2010). The PSAT has three sections: math, critical reading, and writing. For this study, I used the combined math and critical reading score as a measure of g. 1.1.2. Academic achievement measures The academic achievement measures in this study were the Advanced Placement (AP) exams, which are standardized tests that students take at the end of an AP course. AP courses are high school classes with an introductory college curriculum taught by high school teachers. AP exams are scored on a scale of 1 to 5, with a score of 3 generally recognized as a passing score. AP exams are created by the College Board, but the College Board does not grant college credit for passing AP exams. Rather, the student reports AP scores to the college they later attend, and the college decides the minimum score necessary to grant college credit for performance on an AP exam (Lichten, 2000, 2010). The College Board offers 35 AP exams in 26 subjects, ranging from foreign languages to physical sciences and from the arts to mathematics. 1.1.3. Vectors The vector data in this study come from two public data sources. The first source was a College Board study by Zhang, Patel, and Ewing (2014), which provided the vector representing the correlation of AP exam scores and PSAT scores. In this study about 30% of PSAT examinees later took an AP exam. This is in line with independent data which shows that about 36% of high school graduates completed at least one AP course (see Warne et al., 2015). Zhang et al. (2014) did not report a correlation between the PSAT and AP Italian Language and Culture test, so this test was eliminated from all analyses. Content on the PSAT changes from year to year, yet a comparison of the Zhang et al. (2014) data with similar College Board studies (Camara & Millsap, 1998; Ewing, Camara, & Millsap, 2006) shows that there is little variation from year to year in the PSAT-AP exam score correlations. The source of the vector containing the demographic group score gaps is College Board data from the 2013–2015 administration years (College Board, 2013, 2014a, 2015a). Data from each year were analyzed separately so that each year of data could serve as a replication for the other years. The College Board only reports numbers of students who earned each score (i.e., from 1 to 5). These ordinal scores were converted to effect size estimates (i.e., Cohen's d) through a method that can recover group differences from percentages of examinees that fall into ordinal categories (Ho & Reardon, 2012; Reardon & Ho, 2015). In this procedure, the area under the receiver operating curve is converted into an estimate of group differences. Simulations and real data have

Please cite this article as: Warne, R.T., Testing Spearman's hypothesis with advanced placement examination data, Intelligence (2016), http:// dx.doi.org/10.1016/j.intell.2016.05.002

R.T. Warne / Intelligence xxx (2016) xxx–xxx

shown this method to produce accurate effect size estimates as long as there are at least four ordered categories and the thresholds between adjacent groups are not located in the extremes of the hypothesized underlying continuous distribution of ability. This method was used to calculate an effect size difference between majority White examinees and four racial/ethnic minority groups (i.e., Black, Hispanic, Asian, and Native American students) based on the percentages of each demographic group who obtained each score on the AP exams. The College Board does not regularly release reliability data for AP exams or the PSAT. Based on data from Richardson, Gonzalez, Leal, Castillo, and Carman (in press), PSAT internal consistency reliability values ranged from 0.84 to 0.87 on each subtest. Using a conservative estimate of 0.84 for each subtest and the Spearman-Brown formula, this produces an estimated reliability of 0.913 for combined math and critical reading PSAT scores. AP exam reliability data were taken from a College Board report of 27 AP exam reliability coefficients (Bridgeman, Morgan, & Wang, 1996). For AP exams that were not included in the report, the median reported reliability coefficient (0.) was assigned. Like the PSAT, content on most AP exams changes annually. However, other studies reporting reliability coefficients for AP exams (e.g., Ewing, Huff, & Kaliski, 2010; Lukhele, Thissen, & Wainer, 1994; Wainer & Thissen, 1993) are highly similar to the Bridgeman et al. (1996) data, indicating that reliability coefficients in the latter study can be used in the current analysis. Readers should be aware that in 2015 the College Board replaced the AP Physics B examination with two exams: AP Physics 1 and AP Physics 2. The reliability coefficient for the AP Physics B examination was applied to these exams in the 2015 data. 1.1.4. Sample sizes The sample sizes for the correlations used to produce these vectors vary greatly across vector elements. For the PSAT-AP correlations from Zhang et al. (2014), the sample sizes ranged from 2296 for the AP Japanese Language and Culture exam through 500,972 for the AP English Literature and Composition exam. The median sample size was 48,928 (for AP Physics C: Mechanics), with a mean sample size of 112,682.1 (SD = 138.152.2). For the vector of 2015 AP score gap data, there were 2,416,329 American examinees who took 4,343,547 AP exams. The most popular exam was the AP English Language and Composition exam (519,338 examinees), while the least popular exam was AP Japanese Language and Culture (2139 examinees). Overall, the 1,320,989 White examinees took 2,360,536 (54.3%) AP exams; 191,366 Black examinees took 304,677 (7.0%) AP exams; 297,967 Asian examinees took 650,898 (15.0%) AP exams; 438,561 Hispanic examinees took 756,267 (17.4%) AP exams; 12,983 Native American examinees took 21,632 (0.5%) AP exams. All 154,463 examinees who were of “Other” or “Not stated” race/ethnicity were eliminated from analysis. These examinees took 249,537 (5.7%) AP exams. All groups had an increase in the number of examinees and exams from year to year between 2013 and 2015. Examinees of other/not stated race/ethnicity increased from 4.8% to 5.7%, and Hispanic examinees decreased from 17.4% to 16.4% in that time. White examinees decreased from 56.5% to 54.3% of examinees, and Asian examinees increased from 14.9% to 15.0% between 2013 and 2015. Finally, both Black and Native American examinees' percentages stayed constant at 7.0% and 0.5%, respectively. Generally, the percentage of each demographic group which took each test was similar to the percentages listed above. However, foreign language tests sometimes diverged markedly from these percentages. Asian students were a majority of examinees for the AP Chinese Language and Culture (84.4%) and AP Japanese Language and Culture (64.0%) exams. White examinees were 79.8% of examinees in the AP German Language and Culture exams. Hispanic examinees were a majority of examinees for the AP Spanish Language and Culture exam (64.3%) and the AP Spanish Literature and Culture exam (81.9%). The disproportionate numbers of examinees consequentially led to other

3

racial and ethnic groups being a much smaller percentage of examinees for these foreign language exams. The percentages listed in this paragraph were similar from 2013 to 2015. 1.1.5. Statistical procedures The procedure of correlating vectors in this study followed Jensen's (1998, Appendix B) guidelines. In this study, the first vector consisted of each AP exam's correlation with the PSAT. The second vector was the estimated Cohen's d between White examinees and a minority group, which was corrected for low AP score reliability by dividing by the square root of the reliability coefficient. The correlations between these two vectors were calculated using Pearson's r and Spearman's ρ, providing both parametric and nonparametric measures of association between vectors. The correlated vector analysis was performed twice: once for all AP tests and again for the subset of non-foreign language AP tests. The non-foreign language tests were examined separately because language proficiency seems to have an influence on the results of analyses of Spearman's hypothesis (te Nijenhuis, Willigers, et al., 2016), and many students who are native speakers of non-English languages often take AP exams in their foreign language (Warne, 2016b), presumably for easy college credit. 2. Results The weighted mean group difference of all 2015 AP scores combined was 0.774 d for White-Black differences, 0.484 for White-Hispanic differences, −0.156 White-Asian differences, and 0.436 for White-Native American differences. (Positive effect sizes in this article indicate higher scores for White examinees.) These effect sizes were similar to the effect sizes calculated from the 2013 and 2014 data. All estimated effect sizes are displayed in Table 1. To check whether these estimates were realistic, I compared these values to traditional Cohen's d values calculated from the AP scores themselves, a procedure that assumes that the AP exam scores are interval data. The two Cohen's d estimates were highly correlated across all three years of data. Nearly all r values were between 0.991 and 0.999, and nearly all ρ values were between 0.952 and 0.994. The only exception was for 2013 White-Hispanic standardized differences, which had a correlation of r = 0.756 and ρ = 0.912 between d values estimated from ordinal data and the traditionally calculated values. The high correlations between most effect size estimates indicate that (a) College Board psychometricians tend to produce cutoff points that are well spaced and produce interval-level data, and (b) Ho and Reardon's (2012) method of estimating effect sizes from ordinal data produces realistic results. Table 1 shows the estimated effect sizes between mean group scores, the effect sizes after correction for AP exam reliability, and the correlations between PSAT and AP exam scores. All results in Table 1 are for the 2015 AP data. These results support Spearman's hypothesis for the White-Black, White-Hispanic, and White-Native American score gaps. However, the results are mixed for the pattern of WhiteAsian score differences on AP tests. In this racial group comparison, Spearman's hypothesis was supported when all AP exams were analyzed (r = 0.575, ρ = 0.250), but when foreign language exams were removed, the correlations weakened (r = 0.093, ρ = 0.232). To determine whether the results of the analysis are robust and replicable, the same analysis was repeated with data from 2013 and 2014. The results are shown in Table 2. These results indicate that both the Pearson's r and Spearman's ρ correlations between vectors are strong and stable for the White-Black and White-Hispanic comparisons for all three years of data. This was true whether foreign language AP exams were included in the analysis or not. For the White-Black comparison, all correlations were at least 0.725, with r values ranging between 0.725 and 0.816, and ρ values between 0.865 and 0.901. For the White-Hispanic comparison, all correlations between the vectors were

Please cite this article as: Warne, R.T., Testing Spearman's hypothesis with advanced placement examination data, Intelligence (2016), http:// dx.doi.org/10.1016/j.intell.2016.05.002

4

R.T. Warne / Intelligence xxx (2016) xxx–xxx

Table 1 Group differences and PSAT-AP exam correlations. White-Black Comparisons

White-Hispanic Comparisons

White-Asian Comparisons

White-Native American Comparisons

AP exam

r with PSATa

Reliabilityb

Estimated d

Corrected d

Estimated d

Corrected d

Estimated d

Corrected d

Estimated d

Corrected d

Art history Biology Calculus AB Calculus BC Chemistry Chinese language & culture Computer science A Economics: macro Economics: Micro English language & composition English literature & composition Environmental science European history French language & culture German language & culture Government: comparative Government: U.S. Human geography Japanese language & culture Latin Music theory Physics 1 Physics 2 Physics C: electricity & magnetism Physics: mechanics Psychology Spanish language & culture Spanish literature & culture Statistics Studio art: 2-D design Studio art: 3-D design Studio art: drawing U.S. history World history

0.537 0.647 0.523 0.478 0.611 −0.028 0.594 0.595 0.633 0.736 0.711 0.668 0.598 0.424 0.321 0.595 0.646 0.642 0.025 0.478 0.512 0.583 0.583 0.465 0.566 0.608 0.030 0.395 0.651 0.201 0.192 0.262 0.653 0.643

0.88 0.93 0.92 0.90 0.91 0.89 0.93 0.89 0.90 0.84 0.82 0.89 0.84 0.92 0.94 0.85 0.84 0.89 0.89 0.89 0.93 0.93 0.93 0.92 0.90 0.87 0.89 0.85 0.89 0.89 0.89 0.91 0.83 0.89

0.618 0.905 0.742 0.533 0.812 0.579 0.622 0.778 0.787 0.875 0.936 0.897 0.591 0.399 0.380 0.697 0.766 0.709 −0.021 0.304 0.599 0.812 0.697 0.579 0.705 0.665 0.541 0.373 0.871 0.537 0.560 0.466 0.762 0.754

0.659 0.939 0.773 0.562 0.851 0.614 0.645 0.825 0.829 0.955 1.034 0.951 0.645 0.416 0.392 0.756 0.836 0.752 −0.023 0.322 0.621 0.842 0.723 0.604 0.743 0.713 0.574 0.405 0.923 0.570 0.594 0.488 0.836 0.799

0.499 0.742 0.579 0.451 0.681 0.199 0.556 0.766 0.697 0.762 0.758 0.709 0.618 0.499 0.665 0.466 0.738 0.599 0.325 0.462 0.568 0.762 0.564 0.428 0.579 0.576 −0.307 0.458 0.738 0.347 0.399 0.366 0.661 0.681

0.532 0.769 0.604 0.475 0.714 0.211 0.577 0.812 0.735 0.831 0.837 0.752 0.675 0.521 0.686 0.505 0.805 0.635 0.345 0.490 0.589 0.790 0.585 0.447 0.611 0.617 −0.326 0.497 0.782 0.368 0.423 0.383 0.726 0.722

−0.067 .-128 −0.103 −0.163 −0.239 −1.788 −0.167 −0.142 −0.192 −0.114 −0.050 −0.039 −0.117 −0.007 −0.014 −0.085 −0.007 −0.149 −0.981 −0.221 −0.300 −0.057 −0.206 −0.188 −0.138 −0.131 −0.131 −0.278 −0.221 −0.185 0.053 −0.296 −0.153 −0.199

−0.072 −0.133 −0.107 −0.172 −0.250 −1.896 −0.173 −0.151 −0.202 −0.124 −0.055 −0.041 −0.128 −0.007 −0.015 −0.092 −0.008 −0.158 −1.040 −0.234 −0.311 −0.059 −0.214 −0.196 −0.146 −0.141 −0.139 −0.302 −0.234 −0.196 0.056 −0.311 −0.168 −0.211

0.388 0.469 0.399 0.286 0.496 −0.053 0.318 0.380 0.318 0.492 0.545 0.399 0.344 0.477 0.511 0.503 0.522 0.458 —c 0.518 0.289 0.380 0.167 0.436 0.300 0.329 0.347 0.380 0.473 0.333 0.064 0.282 0.447 0.451

0.413 0.487 0.416 0.301 0.520 −0.056 0.330 0.403 0.335 0.537 0.602 0.423 0.375 0.497 0.527 0.546 0.570 0.486 —c 0.549 0.300 0.394 0.173 0.454 0.316 0.353 0.368 0.412 0.501 0.353 0.068 0.296 0.491 0.478

All tests: Foreign language tests eliminated: a b c

r 0.723 0.727

ρ (p) 0.881 (b0.001) 0.868 (b0.001)

r 0.833 0.842

ρ (p) 0.871 (b0.001) 0.862 (b0.001)

r 0.575 0.093

ρ (p) 0.250 (0.154) 0.232 (0.226)

r 0.597 0.602

ρ (p) 0.526 (0.002) 0.633 (b0.001)

Values taken from Zhang et al. (2014). Values taken from Bridgeman et al. (1996). Native American n for this test was too small to include in analysis.

between 0.809 and 0.841 for Pearson's r and between 0.682 and 0.873 for Spearman's ρ. The similarly strong and stable results across (a) two measures of correlation, (b) two minority groups for (c) both non-foreign language AP exams and all AP exams support the generalizability of Spearman's hypothesis as an explanation for AP exam racial/ ethnic group score differences.

For the White-Native American comparisons, however, the 2013 correlations are inconsistent and very weak (ρ = − 0.217 for all AP exams and ρ = 0.180 for non-foreign language AP exams). In 2014 the correlations are positive and larger in magnitude (ρ = 0.354 for all AP exams and ρ = 0.365 for non-foreign language exams). The trend towards strengthening correlations between PSAT scores and

Table 2 Correlation of AP score gaps and AP-PSAT correlations across three years. All AP exams Comparison

2013: d (r, ρ)

2014: d (r, ρ)

2015: d (r, ρ)

White-Black White-Hispanic White-Asian White-Native American

0.837 (0.775, 0.895) 0.380 (0.809, 0.754) −0.188 (0.593, 0.282) 0.466 (−0.279, −0.217)

0.808 (0.741, 0.897) 0.484 (0.838, 0.880) −0.149 (0.589, 0.271) 0.447 (0.535, 0.354)

0.774 (0.723, 0.881) 0.484 (0.833, 0.871) −0.157 (0.575, 0.250) 0.436 (0.597, 0.526)

Foreign language exams eliminated Comparison

2013: d (r, ρ)

2014: d (r, ρ)

2015: d (r, ρ)

White-Black White-Hispanic White-Asian White-Native American

0.850 (0.795, 0.877) 0.454 (0.811, 0.685) −0.181 (0.018,- 0.004) 0.466 (−0.319, 0.180)

0.812 (0.817, 0.901) 0.669 (0.861, 0.857) −0.149 (0.205, 0.230) 0.447 (0.284, 0.365)

0.778 (0.727, 0.865) 0.677 (0.841, 0.859) −0.142 (0.093, 0.232) 0.439 (0.602, 0.620)

Note. Cohen's d values are estimated from ordinal AP data using the Ho and Reardon (2012) method with all AP scores combined.

Please cite this article as: Warne, R.T., Testing Spearman's hypothesis with advanced placement examination data, Intelligence (2016), http:// dx.doi.org/10.1016/j.intell.2016.05.002

R.T. Warne / Intelligence xxx (2016) xxx–xxx

group mean differences between White and Native American examinees continued in 2015, where the values become ρ = 0.526 (for all AP exams) and ρ = 0.620 (for non-foreign language exams). It is interesting that the estimated mean score differences between White and Native American examinees are similar (d2013 = 0.466, d2014 = 0.447, d2015 = 0.436). Thus, it is the pattern of score differences among AP exams that changed across years for the White-Native American comparison. In the White-Asian comparison, the results of the 2015 data were replicated; in all three years the magnitude of the correlation between vectors was reduced when foreign language exams were removed from the analysis. This indicates that the foreign language exams (especially the AP Chinese Language and Culture and AP Japanese Language and Culture exams, both of which exhibited the largest White-Asian mean score gaps in all three years) were inflating the correlation between vectors—and therefore provided spurious evidence in favor of Spearman's hypothesis. Like the White-Native American comparisons, the mean score differences between White and Asian examinees did not change much from year to year (d2013 = − 0.181, d2014 = − 0.149, d2015 = − 0.142). Table 2 shows that in all three years the White-Asian scores gaps were the smallest differences analyzed in this study. 3. Discussion In this study I used the method of correlated vectors to examine whether the pattern of score gaps among racial and ethnic groups on AP exams was consistent with Spearman's hypothesis. Using the PSAT as a measure of g, I found that score gaps between White examinees and Black and Hispanic examinees are largely consistent with Spearman's hypothesis. In other words, the AP exams with the largest score gaps are also the tests that correlate most strongly with g, as measured by the PSAT. Data for White-Native American score differences were less consistent, with the results shifting from a negative to a positive correlation and increasing in magnitude. This may be due to the small number of Native American students who took some AP exams. In 2013 the median AP exam had 210 Native American examinees; this number rose to 220 in 2014 and then to 248 in 2015, which is in accordance with the work of previous researchers who have documented increases in minority participation in the AP program (e.g., Warne, 2015). The small group n values for many tests may result in unstable d estimates. It is possible that the instability of this vector's values contributed to the negative correlation between PSAT-AP correlations and White-Native American differences in Table 2 in 2013. As more AP exams had a larger population of Native American examinees, these results stabilized and became consistent with a prior meta-analysis on Spearman's hypothesis and White-Native American score differences (te Nijenhuis, van den Hoek et al., 2015). Yet, because there is so little literature on the applicability of Spearman's hypothesis to White-Native American score differences, I recommend that research continue in this area. On the other hand, the pattern of score gaps between White and Asian American examinees in inconsistent, and when foreign language AP exams are eliminated from analysis, this study does not support Spearman's hypothesis for these two groups. Although Spearman's hypothesis seems to be supported when all AP exams are analyzed, this is driven by the high numbers of examinees who are native Chinese and Japanese speakers who are taking those languages' AP tests (79.0% for AP Chinese Language & Culture and 44.8% for AP Japanese Language & Culture, according to the College Board, 2015b). Without these foreign languages in the analysis, the evidence in favor of Spearman's hypothesis for White-Asian comparisons evaporates. Previous researchers testing Spearman's hypothesis have all used aptitude or intelligence tests batteries to examine the relationship between tests' g loadings and group mean score differences. This is the first study to show that—for some racial and ethnic groups—Spearman's

5

hypothesis also applies to educational data. Therefore, Spearman's hypothesis is not some esoteric quirk of intelligence tests. Because academic tests also show evidence of Spearman's hypothesis for some racial and ethnic groups, it is realistic to expect that the magnitude of some racial group differences on academic tests will be in proportion to the degree to which the test measures g. This fact has practical implications in education, such as in accountability testing, measuring individual educational progress, and special academic program admissions. For example, many school personnel who select children for gifted programs have the aim (sometimes because of legal mandates) to identify a gifted student population that has is racially representative of the student population at large (e.g., Naglieri & Ford, 2003). However, this study indicates that using tests that are better measures of g will result in a gifted program that has proportionally very few Black and Hispanic students and a greater proportion of White and Asian students. Indeed, assuming the same identification criteria are used for all demographic groups, the only way to obtain proportionality will be to use tests that do not measure g or any other cognitive ability. Of course, this would defeat the purpose of having a special academic program for “gifted” learners. Conversely, school personnel who work with struggling students would also see racial disproportionality when highly loaded tests of g are used to identify students for remedial or compensatory education programs. For these programs the proportion of Black and Hispanic students will be greater than the general population, and White and Asian students will be underrepresented. These are precisely the patterns of representation that are typically seen in gifted and special education programs in the United States (Morgan, Farkas, Hillemeier, & Maczuga, 2012; Yoon & Gentry, 2009). It is interesting that although Asian American students outscored their White counterparts on almost every AP exam (with the exception of the 2013 AP Art History, 2014 AP Government: U.S., and 2015 AP Studio Art: 3D Design exams), these differences may not be due to differences in g. As a result, researchers may need to look elsewhere for the cause of higher rates of advanced academic achievement among Asian students—at least among elite high school students. Frequently researchers find that the mean IQ score of people descended from Asian populations exceeds the mean IQ score of populations descended from Europeans (e.g., Giessman et al., 2013; Herrnstein & Murray, 1996; Kane, 2007; Olszewski-Kubilius & Lee, 2011). As an explanation for mean White-Asian score differences, Spearman's hypothesis has been supported in the past (e.g., Kane, 2007), but it was not in this study. The effect sizes in Table 2 indicate that standardized score differences among racial/ethnic groups are smaller than what is usually found in the research on group differences in intelligence. For example, in one recent meta-analysis of aptitude tests of general intelligence, there was an average difference of d = 1.10 between White and Black samples and d = 0.72 between White and Hispanic samples (Roth, Bevier, Bobko, Switzer, & Tyler, 2001, pp. 311, 318). Yet, academic achievement score gaps among racial and ethnic groups in the United States tend to be smaller than aptitude test score gaps, especially when tests measure mastery of an explicitly taught curriculum (e.g., Warne et al., 2013) or a narrower range of abilities than a general intelligence test (e.g., Hedges & Nowell, 1999). Both of these conditions apply to AP exams. The narrower mean score differences could also be a consequence of restriction of range (discussed below). One surprise in this study was the fact that almost all AP exam scores positively correlated with this measure of g (i.e., PSAT scores). The fact that some AP exams have few or no items that contain explicitly cognitive content (e.g., the AP Studio Art exams or the AP Music Theory exam) and yet still correlate positively with PSAT scores supports the beliefs of intelligence theorists who claim that many life tasks require g—or at least correlated with it (e.g., Gottfredson, 1997; Lubinski, 2000). These correlations with the PSAT show that g truly is what Spearman (1904, 1927) called it: a general mental ability.

Please cite this article as: Warne, R.T., Testing Spearman's hypothesis with advanced placement examination data, Intelligence (2016), http:// dx.doi.org/10.1016/j.intell.2016.05.002

6

R.T. Warne / Intelligence xxx (2016) xxx–xxx

The only exceptions to this pattern of positive PSAT-AP correlations were the AP Chinese Language & Culture, AP Japanese Language & Culture, and AP Spanish Language & Culture exams (Zhang et al., 2014), which had correlations with the PSAT close to zero. This may be because many native speakers of these languages often take these AP exams in order to easily earn college credit (Warne, 2016b; Kanno & Kangas, 2014). For these native speakers performance on an AP foreign language exam may have little do to with their general mental ability. 4. Limitations 4.1. Use of PSAT-AP correlations There are limitations to this study that researchers should be aware of. First, when using the method of correlated vectors, one set of vectors should ideally be factor loadings from a series of mental tests (Jensen, 1985, 1998). In this study, however, there were no factor loadings available; rather the vectors were a series of correlations between (a) the PSAT and the AP exam scores, and (b) AP exam score gaps between groups. The former set of vectors assumes that the PSAT is a pure measure of g—an assumption that some readers may disagree with because factor loadings do not always correspond to a test's observed correlation with intelligence (Ashton & Lee, 2005). Additionally, the PSAT scores in this study consist of only a math and critical reading section; therefore, the scores may lack construct validity as a measure of g because the PSAT may not measure the full span of cognitive abilities. However, general cognitive ability tests tend to correlate highly with one another (Gottfredson, 1997), as do the g factors extracted from them (Johnson, Bouchard, Krueger, McGue, & Gottesman, 2004; Johnson, te Nijenhuis, & Bouchard, 2008). The high correlation that the PSAT has with the SAT (Proctor & Kim, 2010) lends support to the argument that the PSAT is a strong measure of g. Moreover, the method of substituting correlation coefficients for factor loadings has been used in past studies of Spearman's hypothesis (e.g., te Nijenhuis & Hartmann, 2006). 4.2. Sample characteristics Another limitation is the non-representative nature of the sample data, which adds to the sampling error of this study. AP students tend to be more academically elite than the average high school student; as a result, the non-representativeness of the samples in this study likely produces distorted vector values. According to Zhang et al.’s study (2014, Table 1), students who took AP exams had PSAT scores that were an average of 1.29 standard deviations above the scores of the general PSAT examinee population. This restriction of range likely attenuates the values of the elements in both vectors (see, for example, Roth et al., 2001), which would weaken the evidence for Spearman's hypothesis. Another possible consequence of using an academically elite sample is that Spearman's (1927, pp. 217–221) law of diminishing returns may be in force. This “law” states that higher g groups have weaker correlations among cognitive abilities and therefore weaker g loadings. Spearman's law of diminishing returns has been observed many times (e.g., Lohman, Gambrell, & Lakin, 2008; te Nijenhuis & Hartmann, 2006; Tommasi et al., 2015). However, its applicability to academic achievement is unclear (Coyle, 2015; Coyle, Snyder, Pillow, & Kochunov, 2011), which means that the law's impact on this study is ambiguous. It is also important to note that the examinee population is not similar across AP tests. Data from Table 3 in Zhang et al.'s (2014) study can be used to report effect sizes that quantify the difference between AP examinees and the total PSAT examinee population. Although overall the effect size is d = 1.29 (favoring AP examinees), the range of these effect sizes is from d = 1.10 (for AP Studio Art: 2-D Design) to d = 3.54 (for AP Physics C: Electricity & Magnetism). Thus, the estimates

of the score differences between groups are not the same across AP tests. 4.3. Further research 4.3.1. Academic achievement tests at Spearman's hypothesis Although this study is supportive of Spearman's hypothesis, there should be further research on the degree to which academic achievement tests can be used to study Spearman's hypothesis. I also recommend that future researchers incorporate the data from this study into meta-analyses on Spearman's hypothesis (following the model of studies like te Nijenhuis, van den Hoek et al., 2015) and to determine the degree to which the restriction of range, sampling error, and use of achievement tests function as moderators in Spearman's hypothesis. A researcher with access to individual student scores from a government-mandated battery of academic achievement tests could extract a g factor through factor analysis and then examine Spearman's hypothesis across tests of different academic topics. 4.3.2. Use of other analysis procedures It is important to note that the results of the method of correlated vectors in this study do not prove that Spearman's hypothesis is the cause of group mean score differences on AP exams. This is because the method of correlated vectors is sometimes insensitive to model misspecification (Lubke, Dolan, & Kelderman, 2001). For example, tests of correlated vectors can be supportive of Spearman's hypothesis even if there are no actual differences in g or the hierarchical CattellHorn-Carroll model doesn't fit the data at all (Ashton & Lee, 2005; Dolan, 2000; Lubke et al., 2001). Indeed, the method of correlated vectors relies strongly on the assumption that g exists for all groups and that the g model of cognitive abilities fits all groups well—something that Jensen (1998, pp. 373–374) himself was explicit about. Critics of the method of correlated vectors advocate using multigroup confirmatory factor analysis (MCFA; Meredith, 1993) to test these assumptions and to execute a test of Spearman's hypothesis (see Frisby & Beaujean, 2015, for an example of this sort of test). MCFA has its advantages in establishing the assumptions needed to test Spearman's hypothesis, including the ability to test competing factor structure models of mental abilities (Dolan & Hamaker, 2001). However, the data in this study were not reported as raw data or a covariance matrix, which precluded the use of MCFA. The method of correlated vectors was the only available alternative, which is why I chose to use it for this study. Yet, I did make a diligent effort to examine whether the assumptions of the method of correlated vectors were realistic for Advanced Placement data. I found that the assumption of measurement invariance across groups is plausible because Ewing et al. (2006) observed that PSAT scores were better predictors (as measured by ΔR2 values) of scores on 11 AP exams for minority students than for White students, after controlling for high school course grades. Additionally, the rank order of the ΔR2 values is highly similar across racial and ethnic groups (with the exception of the AP Chemistry and AP Macroeconomics exams for Black students), indicating that the relationships between AP exams and PSAT scores are comparable across groups. In a related study Camara and Millsap (1998) found that within-group correlations between AP exam scores and PSAT scores are similar across racial and ethnic groups for three AP exams. Additionally, investigations of item and test bias on Advanced Placement exams support the assumption of measurement invariance. Studies (e.g., Richardson et al., in press; Stricker & Emmerich, 1999) of item or test bias in AP exams show little differential item functioning or differences in predictive validity across racial or ethnic groups. The same is usually true for tests of g (Warne et al., 2014; Jensen, 1980a, 1998), making it unlikely that the correlation between the PSAT and AP exam scores will be very different across racial and ethnic groups. Readers should not take this section of the article to indicate that the method of correlated vectors should only be used as a last resort. The

Please cite this article as: Warne, R.T., Testing Spearman's hypothesis with advanced placement examination data, Intelligence (2016), http:// dx.doi.org/10.1016/j.intell.2016.05.002

R.T. Warne / Intelligence xxx (2016) xxx–xxx

method of correlated vectors has unique advantages over MCFA, such as the ability to be combined with meta-analytic methods to produce generalizable results about the plausibility of Spearman's hypothesis. For example, previous researchers have applied corrections for low reliability or restriction of range to generate estimates of error-free correlations between differences on the latent g variable and subtests' g loadings (Woodley, te Nijenhuis, Must & Must, 2014). The method of correlated vectors is also flexible enough to be incorporated into a moderator analysis framework, an advantage that researchers have exploited in trying to better understand phenomena, such as the Flynn effect (e.g., Flynn, te Nijenhuis, & Metzen, 2014; Woodley of Menie & Dunkel, 2015; Woodley of Menie, Figueredo, Dunkel, & Madison, 2015). These characteristics of the method of correlated vectors show that the procedure is a viable method for studies of Spearman's hypothesis. 4.3.3. Further racial/ethnic group comparisons Although White-Black and White-Hispanic results were consistent with Spearman's hypothesis, results from White-Native American comparisons were less consistent. Data from later years seemed to tentatively support Spearman's hypothesis and are line with results from a recent meta-analysis (te Nijenhuis, van den Hoek et al., 2015). However, this study's results do not support Spearman's hypothesis as an explanation for the score differences between White and Asian AP examinees. Future researchers should continue to investigate the nature of cognitive and academic test score differences across demographic groups so that scholars can understand the limits of Spearman's hypothesis. The unsupportive results of the analysis of White-Asian group differences in this study do not conclusively prove that White-Asian test score differences are unrelated to differences in g. Yet, the low correlation between vectors does mean that researchers should entertain the possibility that non-g influences determine White-Asian score differences on AP exams. Possibilities include the effects of differential selection into AP courses (an issue noted by the College Board, 2014b), higher motivation among Asian students for academic success (a possibility that Nisbett et al., 2012, raised) and other factors that could negate or overshadow the impact of g in determining score differences between White and Asian students on AP exams. It is also important to consider that the White-Asian AP score differences were smaller than the differences between White examinees and other groups (see Table 2). When another group of researchers found that Spearman's hypothesis was not supported in a White-East Asian comparison, they suggested that small differences in test scores may make finding a strong correlation between vectors difficult (Nagoshi, Johnson, DeFries, Wilson & Vandenberg, 1984). Indeed, Spearman's hypothesis was strongly supported in a different White-Asian comparison study where the average group differences across subtests of the Universal Nonverbal Intelligence Test was d = 0.352 in favor of examinees of Asian descent (Kane, 2007). Perhaps there is a threshold of mean score differences below which the method of correlated vectors has insufficient statistical sensitivity to test Spearman's hypothesis. Another possibility is that the term “Asian” is too inexact because it encompasses a genetically, culturally, and geographically diverse population. According to Lynn and Meisenberg's (2010) estimates of national IQs, Asia includes countries with the highest IQs (e.g., Singapore, China, and Japan) and many countries with IQs about one standard deviation below the mean (e.g., the Philippines, India, Saudi Arabia, and Iran). Tishkoff et al. (2009) found that, when separated from Europeans and Africans, people descended from indigenous Asian populations actually represent three branches (i.e., clades) of genetic descent—one corresponding to East Asians, another representing central and south Asians, and a third, small clade consisting of Middle Easterners (see also Underhill et al., 2000, for similar results). The latter clade is closely related to Europeans, while the Central/South Asian and East Asian clades are distinct from one another and from non-Asians. If within-group and between-group differences in intelligence are due to genetic differences in g (as Rushton and Jensen, 2005, suggested), then combining

7

these disparate Asian clades may mask the influence of differential selection pressures and genetic history—and therefore the relationship between g and test score differences. Indeed, when other researchers examined White-Central/South Asian score differences and WhiteEast Asian score differences they have often found that Spearman's hypothesis is supported (Lynn & Owen, 1994; Rushton et al., 2007; te Nijenhuis, Grigoriev & van den Hoek, 2016). I recommend that future researchers who examine Spearman's hypothesis with Asian populations take efforts to use more homogeneous subgroups of Asians and to identify the racial/ethnic group membership of participants with more exactitude than I did in this study. 5. Conclusion The results of this study are interesting because they strengthen the argument that Spearman's hypothesis is not a universal phenomenon. Yet, this study also supports the claims of individuals who say that Spearman's hypothesis applies across several comparisons of racial/ethnic groups and is more generalizable than Spearman (1927) suspected or that Jensen (1985) originally claimed (see te Nijenhuis et al., 2014, for an example of generality of Spearman's hypothesis). This study also shows the viability of testing Spearman's hypothesis with academic achievement data, which may lead to new sources of data that could test Spearman's hypothesis with larger sample sizes than before. References ACT, Inc. (2014). The ACT technical manual. Retrieved from http://www.act.org/aap/pdf/ ACT_Technical_Manual.pdf Ashton, M. C., & Lee, K. (2005). Problems with the method of correlated vectors. Intelligence, 33, 431–444. http://dx.doi.org/10.1016/j.intell.2004.12.004. Bridgeman, B., Morgan, R., & Wang, M. -M. (1996). Reliability of advanced placement examinations. ETS Research Report Series, 1996(1), i-19. http://dx.doi.org/10.1002/j. 2333-8504.1996.tb01681.x. Camara, W. J., & Millsap, R. (1998). Using the PSAT/NMSQT and course grades in predicting success in the advanced placement program (report no. 98–4). New York, NY: College Board (Retrieved from http://research.collegeboard.org/sites/default/files/ publications/2012/7/researchreport-1998-4-using-psat-nmsqt-course-gradespredicting-success-ap.pdf). College Board (2013). National report. Retrieved from http://media.collegeboard.com/ digitalServices/pdf/research/2013/National_Summary_13.xls College Board (2014a). National report. Retrieved from http://media.collegeboard.com/ digitalServices/pdf/research/2014/National_Summary.xlsx College Board (2014b). The 10th annual AP report to the nation. New York, NY: College Board (Retrieved from http://media.collegeboard.com/digitalServices/pdf/ap/rtn/ 10th-annual/10th-annual-ap-report-to-the-nation-single-page.pdf). College Board (2015a). National report. Retrieved from http://media.collegeboard.com/ digitalServices/misc/ap/national-summary-2015.xlsx College Board (2015b). Student score distributions. AP exams – May 2015. Retrieved from https://secure-media.collegeboard.org/digitalServices/pdf/research/2015/ Student-Score-Distributions-2015.pdf Corno, L., Cronbach, L. J., Kupermintz, H., Lohman, D. F., Mandinach, E. B., Porteus, A. W., & Talbert, J. E. (2002). Remaking the concept of aptitude: Extending the legacy of Richard E. Snow. Mahwah, NJ: Lawrence Erlbaum Associates. Coyle, T. R. (2015). Relations among general intelligence (g), aptitude tests, and GPA: Linear effects dominate. Intelligence, 53, 16–22. http://dx.doi.org/10.1016/j.intell.2015. 08.005. Coyle, T., Snyder, A., Pillow, D., & Kochunov, P. (2011). SAT predicts GPA better for high ability subjects: Implications for Spearman's law of diminishing returns. Personality and Individual Differences, 50, 470–474. http://dx.doi.org/10.1016/j.paid.2010.11.009. Derrick, S. M. (1920). A comparative study of the intelligence of seventy-five whites and fifty-five colored college students by the Stanford revision of the Binet-Simon scale. Journal of Applied Psychology, 4, 316–329. http://dx.doi.org/10.1037/h0071332. Dolan, C. V. (2000). Investigating Spearman's hypothesis by means of multi-group confirmatory factor analysis. Multivariate Behavioral Research, 35, 21–50. http://dx.doi.org/ 10.1207/S15327906MBR3501_2. Dolan, C. V., & Hamaker, E. L. (2001). Investigating Black-White differences in psychometric IQ: Multi-group confirmatory factor analyses of the WISC-R and K-ABC and a critique of the method of correlated vectors. In F. H. Columbus (Ed.), Advances in psychology research. Vol. 6. (pp. 31–59). Huntington, NY: Nova Science Publishers. Ewing, M., Camara, W. J., & Millsap, R. E. (2006). The relationship between PSAT/NMSQT® scores and AP® examination grades: A follow-up study. New York, NY: College Board (Retrieved from http://research.collegeboard.org/sites/default/files/publications/ 2012/7/researchreport-2006-1-psat-nmsqt-scores-ap-examination-grades-followup.pdf). Ewing, M., Huff, K., & Kaliski, P. (2010). Validating AP exam scores. In P. M. Sadler, G. Sonnert, R. H. Tai, & K. Klopfenstein (Eds.), AP: A critical examination of the advanced placement program (pp. 85–105). Cambridge, MA: Harvard Education Press.

Please cite this article as: Warne, R.T., Testing Spearman's hypothesis with advanced placement examination data, Intelligence (2016), http:// dx.doi.org/10.1016/j.intell.2016.05.002

8

R.T. Warne / Intelligence xxx (2016) xxx–xxx

Fernandes, H. B. F., Woodley, M. A., & te Nijenhuis, J. (2014). Differences in cognitive abilities among primates are concentrated on G: Phenotypic and phylogenetic comparisons with two meta-analytical databases. Intelligence, 46, 311–322. http://dx.doi.org/ 10.1016/j.intell.2014.07.007. Flynn, J. R., te Nijenhuis, J., & Metzen, D. (2014). The g beyond Spearman's g: Flynn's paradoxes resolved using four exploratory meta-analyses. Intelligence, 44, 1–10. http:// dx.doi.org/10.1016/j.intell.2014.01.009. Frisby, C. L., & Beaujean, A. A. (2015). Testing Spearman's hypotheses using a bi-factor model with WAIS-IV/WMS-IV standardization data. Intelligence, 51, 79–97. http:// dx.doi.org/10.1016/j.intell.2015.04.007. Giessman, J. A., Gambrell, J. L., & Stebbins, M. S. (2013). Minority performance on the Naglieri nonverbal ability test, second edition, versus the cognitive abilities test, form 6: One gifted program's experience. Gifted Child Quarterly, 57, 101–109. http:// dx.doi.org/10.1177/0016986213477190. Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 18, 519–521. http://dx.doi.org/10.1037/ h0049294. Gottfredson, L. S. (1997). Mainstream science on intelligence: An editorial with 52 signatories, history, and bibliography. Intelligence, 24, 13–23. http://dx.doi.org/10.1016/ S0160-2896(97)90011-8. Gottfredson, L. S. (2009). Logical fallacies used to dismiss the evidence on intelligence testing. In R. P. Phelps (Ed.), Correcting fallacies about educational and psychological testing (pp. 11–65). Washington, DC: American Psychological Association. Hartmann, P., Hye Sun Kruuse, N., & Nyborg, H. (2007). Testing the cross-racial generality of Spearman's hypothesis in two samples. Intelligence, 35, 47–57. http://dx.doi.org/10. 1016/j.intell.2006.04.004. Hedges, L. V., & Nowell, A. (1999). Changes in the Black-White gap in achievement test scores. Sociology of Education, 72, 111–135. http://dx.doi.org/10.2307/2673179. Herrnstein, R. J., & Murray, C. (1996). The bell curve: Intelligence and class structure in American life (2nd ed.). New York, NY: Free Press. Ho, A. D., & Reardon, S. F. (2012). Estimating achievement gaps from test scores reported in ordinal “proficiency” categories. Journal of Educational and Behavioral Statistics, 37, 489–517. http://dx.doi.org/10.3102/1076998611411918. Jensen, A. R. (1980a). Bias in mental testing. New York, NY: The Free Press. Jensen, A. R. (1980b). Précis of bias in mental testing. Behavioral and Brain Sciences, 3, 325–333. http://dx.doi.org/10.1017/S0140525X00005161. Jensen, A. R. (1985). The nature of the Black–White difference on various psychometric tests: Spearman's hypothesis. Behavioral and Brain Sciences, 8, 193–219. http://dx. doi.org/10.1017/S0140525X00020392. Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger. Johnson, W., Bouchard, T. J., Jr., Krueger, R. F., McGue, M., & Gottesman, I. I. (2004). Just one g: Consistent results from three test batteries. Intelligence, 32, 95–107. http:// dx.doi.org/10.1016/S0160-2896(03)00062-X. Johnson, W., Nijenhuis, J. T., & Bouchard, T. J., Jr. (2008). Still just 1 g: Consistent results from five test batteries. Intelligence, 36, 81–95. http://dx.doi.org/10.1016/j.intell. 2007.06.001. Kane, H. (2007). Group differences in nonverbal intelligence: Support for the influence of Spearman's g. Mankind Quarterly, 48, 65–82. Kanno, Y., & Kangas, S. E. N. (2014). “I'm not going to be, like, for the AP”: English language learners' limited access to advanced college-preparatory courses in high school. American Educational Research Journal, 51, 848–878. http://dx.doi.org/10. 3102/0002831214544716. Kura, K. (2013). Japanese north–south gradient in IQ predicts differences in stature, skin color, income, and homicide rate. Intelligence, 41, 512–516. http://dx.doi.org/10.1016/ j.intell.2013.07.001. Lichten, W. (2000). Whither advanced placement? Education Policy Analysis Archives, 8(29). http://dx.doi.org/10.14507/epaa.v8n29.2000. Lichten, W. (2010). Whither advanced placement—Now? In P. M. Sadler, G. Sonnert, R. H. Tai, & K. Klopfenstein (Eds.), AP: A critical examination of the advanced placement program (pp. 233–243). Cambridge, MA: Harvard Education Press. Lohman, D. F., Gambrell, J., & Lakin, J. (2008). The commonality of extreme discrepancies in the ability profiles of academically gifted students. Psychology Science Quarterly, 50, 269–282. Lubinski, D. (2000). Scientific and social significance of assessing individual differences: “Sinking shafts at a few critical points”. Annual Review of Psychology, 51, 405–444. http://dx.doi.org/10.1146/annurev.psych.51.1.405. Lubke, G. H., Dolan, C. V., & Kelderman, H. (2001). Investigating group differences on cognitive tests using Spearman's hypothesis: An evaluation of Jensen's method. Multivariate Behavioral Research, 36, 299–324. http://dx.doi.org/10.1207/ s15327906299-324. Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, 234–250. http://dx.doi.org/10.1111/j.1745-3984.1994. tb00445.x. Lynn, R., & Owen, K. (1994). Spearman's hypothesis and test score differences between whites, Indians, and blacks in South Africa. The Journal of General Psychology, 121, 27–36. http://dx.doi.org/10.1080/00221309.1994.9711170. Lynn, R., & Meisenberg, G. (2010). National IQs calculated and validated for 108 nations. Intelligence, 38, 353–360. http://dx.doi.org/10.1016/j.intell.2010.04.007. Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543. http://dx.doi.org/10.1007/BF02294825. Merwin, J. C., & Gardner, E. F. (1962). Development and application of tests of educational achievement. Review of Educational Research, 32, 40–50. http://dx.doi.org/10.2307/ 1169202. Morgan, P. L., Farkas, G., Hillemeier, M. M., & Maczuga, S. (2012). Are minority children disproportionately represented in early intervention and early childhood special

education? Educational Researcher, 41, 339–351. http://dx.doi.org/10.3102/ 0013189x12459678. Naglieri, J. A., & Ford, D. Y. (2003). Addressing underrepresentation of gifted minority children using the Naglieri nonverbal ability test (NNAT). Gifted Child Quarterly, 47, 155–160. http://dx.doi.org/10.1177/001698620304700206. Nagoshi, C. T., Johnson, R. C., DeFries, J. C., Wilson, J. R., & Vandenberg, S. G. (1984). Group differences and first principal-component loadings in the Hawaii family study of cognition: A test of the generality of ‘Spearman's hypothesis. Personality and Individual Differences, 5, 751–753. http://dx.doi.org/10.1016/0191-8869(84)90125-9. Nisbett, R. E., Aronson, J., Blair, C., Dickens, W., Flynn, J., Halpern, D. F., & Turkheimer, E. (2012). Intelligence: New findings and theoretical developments. American Psychologist, 67, 130–159. http://dx.doi.org/10.1037/a0026699. Olszewski-Kubilius, P., & Lee, S. -Y. (2011). Gender and other group differences in performance on off-level tests: Changes in the 21st century. Gifted Child Quarterly, 55, 54–73. http://dx.doi.org/10.1177/0016986210382574. Pressey, S. L., & Teter, G. F. (1919). Minor studies from the psychological laboratory of Indiana University. I. A comparison of colored and white children by means of a group scale of intelligence. Journal of Applied Psychology, 3, 277–282. http://dx.doi.org/10. 1037/h0075831. Proctor, T. P., & Kim, Y. R. (2010). Score change for 2007 PSAT/NMSQT test-takers: An analysis of score changes for PSAT-NMSQT test-takers who also took the 2008 PSAT/NMSQT test or a spring 2008 SAT test (college board research note no. RN-41). New York, NY. Retrieved from https://research.collegeboard.org/sites/ default/files/publications/2012/7/researchnote-2010-41-score-change-2007psat.pdf Reardon, S. F., & Ho, A. D. (2015). Practical issues in estimating achievement gaps from coarsened data. Journal of Educational and Behavioral Statistics, 40, 158–189. http:// dx.doi.org/10.3102/1076998615570944. Reynolds, C. R. (2000). Why is psychometric research on bias in mental testing so often ignored? Psychology, Public Policy, and Law, 6, 144–150. http://dx.doi.org/10.1037/ 1076-8971.6.1.144. Richardson, C. C., Gonzalez, A., Leal, L., Castillo, M. Z., & Carman, C. A. (2016). PSAT component scores as a predictor of success on AP exam performance for diverse students. Education and Urban Society. http://dx.doi.org/10.1177/0013124514533796 (in press). Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., III, & Tyler, P. (2001). Ethnic group difference in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330. http://dx.doi.org/10.1111/j.1744-6570.2001. tb00094.x. Rushton, J. P. (2000). Race, evolution, and behavior: A life history perspective (3rd ed.). Port Huron, MI: Charles Darwin Research Institute. Rushton, J. P. (2012). No narrowing in mean Black–White IQ differences—Predicted by heritable g. American Psychologist, 67, 500–501. http://dx.doi.org/10.1037/a0029614. Rushton, J. P., & Jensen, A. R. (2005). Thirty years of research on race differences in cognitive ability. Psychology, Public Policy, and Law, 11, 235–294. http://dx.doi.org/10.1037/ 1076-8971.11.2.235. Rushton, J. P., Čvorović, J., & Bons, T. A. (2007). General mental ability in South Asians: Data from three Roma (Gypsy) communities in Serbia. Intelligence, 35, 1–12. http:// dx.doi.org/10.1016/j.intell.2006.09.002. Schmeiser, C. B., & Welch, C. J. (2006). Test development. In R. L. Brennan (Ed.), Educational measurement (pp. 307–353) (4th ed.). Westport, CT: Praeger Publishers. Spearman, C. (1904). “General intelligence,” objectively determined and measured. American Journal of Psychology, 15, 201–293. http://dx.doi.org/10.2307/1412107. Spearman, C. (1927). The abilities of man: Their nature and measurement. New York, NY: The Macmillan Company. Stricker, L. J., & Emmerich, W. (1999). Possible determinants of differential item functioning: Familiarity, interest, and emotional reaction. Journal of Educational Measurement, 36, 347–366. http://dx.doi.org/10.1111/j.1745-3984.1999.tb00561.x. Sunne, D. (1917). A comparative study of white and negro children. Journal of Applied Psychology, 1, 71–83. http://dx.doi.org/10.1037/h0073489. te Nijenhuis, J., & Hartmann, P. (2006). Spearman's “law of diminishing returns” in samples of Dutch and immigrant children and adults. Intelligence, 34, 437–447. http://dx. doi.org/10.1016/j.intell.2006.02.002. te Nijenhuis, J., David, H., Metzen, D., & Armstrong, E. L. (2014). Spearman's hypothesis tested on European Jews vs non-Jewish Whites and vs Oriental Jews: Two meta-analyses. Intelligence, 44, 15–18. http://dx.doi.org/10.1016/j.intell.2014.02.002. te Nijenhuis, J., Al-Shahomee, A. A., van den Hoek, M., Grigoriev, A., & Repko, J. (2015). Spearman's hypothesis tested comparing Libyan adults with various other groups of adults on the items of the standard progressive matrices. Intelligence, 50, 114–117. http://dx.doi.org/10.1016/j.intell.2015.03.001. te Nijenhuis, J., van den Hoek, M., & Armstrong, E. L. (2015). Spearman's hypothesis and Amerindians: A meta-analysis. Intelligence, 50, 87–92. http://dx.doi.org/10.1016/j. intell.2015.02.006. te Nijenhuis, J., Bakhiet, S. F., van den Hoek, M., Repko, J., Allik, J., Žebec, M. S., ... Abduljabbar, A. S. (2016). Spearman's hypothesis tested comparing Sudanese children and adolescents with various other groups of children and adolescents on the items of the standard progressive matrices. Intelligence, 56, 46–57. http://dx.doi.org/ 10.1016/j.intell.2016.02.010. te Nijenhuis, J., Grigoriev, A., & van den Hoek, M. (2016). Spearman's hypothesis tested in Kazakhstan on the items of the standard progressive matrices plus. Personality and Individual Differences, 92, 191–193. http://dx.doi.org/10.1016/j.paid.2015.12.048. te Nijenhuis, J., Willigers, D., Dragt, J., & van der Flier, H. (2016). The effects of language bias and cultural bias estimated using the method of correlated vectors on a large database of IQ comparisons between native Dutch and ethnic minority immigrants from non-Western countries. Intelligence, 54, 117–135. http://dx.doi.org/10.1016/j. intell.2015.12.003.

Please cite this article as: Warne, R.T., Testing Spearman's hypothesis with advanced placement examination data, Intelligence (2016), http:// dx.doi.org/10.1016/j.intell.2016.05.002

R.T. Warne / Intelligence xxx (2016) xxx–xxx Terman, L. M. (1926). Genetic studies of genius: Vol. I. Mental and physical traits of a thousand gifted children (2nd ed.). Stanford, CA: Stanford University Press. Tishkoff, S. A., Reed, F. A., Friedlaender, F. R., Ehret, C., Ranciaro, A., Froment, A., ... Williams, S. M. (2009). The genetic structure and history of Africans and African Americans. Science, 324(5930), 1035–1044. http://dx.doi.org/10.1126/science.1172257. Tommasi, M., Pezzuti, L., Colom, R., Abad, F. J., Saggino, A., & Orsini, A. (2015). Increased educational level is related with higher IQ scores but lower g-variance: Evidence from the standardization of the WAIS-R for Italy. Intelligence, 50, 68–74. http://dx. doi.org/10.1016/j.intell.2015.02.005. Underhill, P. A., Shen, P., Lin, A. A., Jin, L., Passarino, G., Yang, W. H., ... Oefner, P. J. (2000). Y chromosome sequence variation and the history of human populations. Nature Genetics, 26, 358–361. http://dx.doi.org/10.1038/81685. Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118. http://dx.doi.org/10.1207/s15324818ame0602_1. Warne, R. T., Anderson, B., & Johnson, A. O. (2013). The impact of race and ethnicity on the identification process for giftedness in Utah. Journal for the Education of the Gifted, 36, 487–508. http://dx.doi.org/10.1177/0162353213506065. Warne, R. T., Yoon, M., & Price, C. J. (2014). Exploring the various interpretations of “test bias”. Cultural Diversity and Ethnic Minority Psychology, 20, 570–582. http://dx.doi.org/ 10.1037/a0036503. Warne, R. T., Larsen, R., Anderson, B., & Odasso, A. J. (2015). The impact of participation in the Advanced Placement program on students' college admissions test scores. The Journal of Educational Research, 108, 400–416. http://dx.doi.org/10.1080/00220671. 2014.917253. Warne, R. T., & Anderson, B. (2015). The Advanced Placement program's impact on academic achievement. New Educational Foundations, 4, 32–54.

9

Warne, R. T. (2016a). Five reasons to put the g back into giftedness: An argument for applying the Cattell–Horn–Carroll theory of intelligence to gifted education research and practice. Gifted Child Quarterly, 60, 3–15. http://dx.doi.org/10.1177/ 0016986215605360. Warne, R. T. (2016b). Research on the academic benefits of the Advanced Placement program: Taking stock and looking forward. Manuscript submitted for publication. WoodleyofMenie, M. A., & Dunkel, C. S. (2015). In France, are secular IQ losses biologically caused? A comment on Dutton and Lynn (2015). Intelligence, 53, 81–85. http://dx.doi. org/10.1016/j.intell.2015.08.009. WoodleyofMenie, M. A., Figueredo, A. J., Dunkel, C. S., & Madison, G. (2015). Estimating the strength of genetic selection against heritable g in a sample of 3520 Americans, sourced from MIDUS II. Personality and Individual Differences, 86, 266–270. http:// dx.doi.org/10.1016/j.paid.2015.05.032. Woodley, M. A., te Nijenhuis, J., Must, O., & Must, A. (2014). Controlling for increased guessing enhances the independence of the Flynn effect from g: The return of the Brand effect. Intelligence, 43, 27–34. http://dx.doi.org/10.1016/j.intell.2013.12.004. Yoon, S. Y., & Gentry, M. (2009). Racial and ethnic representation in gifted programs: Current status of and implications for gifted Asian American students. Gifted Child Quarterly, 53, 121–136. http://dx.doi.org/10.1177/0016986208330564. Zhang, X., Patel, P., & Ewing, M. (2014). AP potential predicted by PSAT/NMSQT scores using logistic regression. New York City, NY: College Board (Retrieved from http://research. collegeboard.org/sites/default/files/publications/2014/10/ap-potential-predicted-bypsat-nmsqt-scores-logistic-regression.pdf). Zwick, R. (2006). Higher education admissions testing. In R. L. Brennan (Ed.), Educational measurement (pp. 647–679) (4th ed.). Westport, CT: Praeger Publishers.

Please cite this article as: Warne, R.T., Testing Spearman's hypothesis with advanced placement examination data, Intelligence (2016), http:// dx.doi.org/10.1016/j.intell.2016.05.002