Available online at www.sciencedirect.com
Intelligence 36 (2008) 653 – 663
Recruitment modeling: An analysis and an application to the study of male–female differences in intelligence☆ Earl Hunt ⁎, Tara Madhyastha The University of Washington Received 1 December 2006; received in revised form 26 February 2008; accepted 20 March 2008 Available online 16 May 2008
Abstract Studies of group differences in intelligence often invite conclusions about groups in general from studies of group differences in selected populations. The same design is used in the study of group differences in other traits as well. Investigators observe samples from two groups (e.g. men and women) in some accessible population, but seek to conclude something about a wider, general population. The most frequent case is probably a study contrasting undergraduate men and women. The investigator will know the method by which people are recruited from the accessible population. However the methods by which the members of each group enter the accessible population from the general population is not under the control of the investigator. Call this the recruitment process. The recruitment process may introduce differences between groups in the accessible population that do not occur in the general population. Continuing the example, the recruitment processes that draw men and women from the college-age population into the population of students may not be the same. Therefore, in order to draw an inference about group differences in the general population from group differences in samples from the accessible population it is necessary to have a model of the recruitment process. We develop such a model, and present data showing that it appears to be valid for the case of recruitment into the population of youth who consider post-secondary education. We illustrate use of the model by analyzing findings from a widely publicized study of differences in intelligence between men and women, and show that the conclusions from that study are modified by consideration of the recruitment process. © 2008 Elsevier Inc. All rights reserved. Keywords: Intelligence; Male–female differences; Sampling; Population estimates; Gender differences; Mathematical modeling
1. Introduction Many studies of intelligence, and indeed, many studies in psychology in general, use the following paradigm. The researcher wishes to draw an inference about differences between two or more groups along some trait (here ☆ We thank Paul Irwing and Wendy Johnson for helpful, constructive comments on earlier drafts of this paper. ⁎ Corresponding author. Department of Psychology Box 351525, University of Washington, Seattle, WA 98195-1525, United States. E-mail address:
[email protected] (E. Hunt).
0160-2896/$ - see front matter © 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.intell.2008.03.002
intelligence). Because it is often infeasible to construct a true random sample of the general population, the researcher observes the difference in a sample from some population that can readily be tested (e.g., college students, military personnel, or public school students). We will refer to this as the accessible population. The problem is that recruitment from the general population to the accessible population may be different for each group within the overall population. To take an obvious case, compared to their frequencies in the general population, men are overrepresented and women are underrepresented in the United States Armed Services. It is highly likely that
654
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663
the psychological factors that influence a man's decision to enlist are not the same as those that influence a woman to enlist. It would be inappropriate to draw an unqualified conclusion about a psychological difference between men and women based upon observations of differences between service men and women. More abstractly, when there are inter-group differences in recruitment from the general to the accessible population, it is inappropriate to generalize results about group differences observed in the sample to the general population, even if the sample is a random sample of the accessible population. Once stated, the problem is obvious. But what can we do about it? In this paper we approach the problem by constructing mathematical models of the process by which members of different groups are recruited from the general population to the accessible population. We will refer to them as recruitment models. These models can be used to generalize results based upon a sample of the accessible population to form a conclusion about the general population. We will illustrate our methods by considering a muchstudied, much debated topic: differences in intelligence between men and women. More particularly, we will examine a highly publicized recent finding claiming that there are such differences, in favor of men, and will show that when recruitment effects are considered the conclusion of the original authors must be substantially modified. We regard this finding as interesting in itself. However we believe that in the long run the methodology behind our reasoning is likely to be of more importance than the specific finding, because it provides a potential solution to the recruitment problem. We will first describe the example. We then consider briefly a simple recruitment model that has been previously applied to the study of group differences in intelligence. The reason for doing so is because this model has been used before. We then consider a more realistic recruitment model, provide data showing its validity, and then apply it to the example.
percentage of men than women in all score ranges above the mean, and a higher percentage of women than men in ranges below the mean. Jackson and Rushton argued that this rules out the possibility that male scores are higher because males have higher variability in intelligence (e.g. Jensen, 1998, pg. 537), on the grounds that they did not observe an excess of males in both the high and low scoring categories of the SAT. They included in their paper the caution that it would be interesting to see if this finding could be repeated in a more general population. Their paper generated a good deal of attention in the popular press. Rushton was interviewed on television. Jackson and Rushton's conclusion was also mentioned, usually with skepticism, in a number of editorials in the popular press. To the best of our knowledge, none of these interviews or commentaries questioned whether the validity study sample was representative of the general population, or included the original authors' caution that the finding should be repeated in a more general population. There is reason to believe that Jackson and Rushton's study was influenced by substantial recruitment effects. The accessible population, students who take the SAT in a given year, is constructed by self selection from the general population of students in their last two years of high school. In Jackson and Rushton's data set, which we assume was an appropriate sample of the accessible population, approximately 55% of the examinees were women. However, according to the US census approximately 49% of people aged 15–19 are female. This suggests that for some reason women are more likely to choose to take the SAT than are men. We shall show, by applying recruitment models, that the Jackson and Rushton results could be produced even if the mean intelligence of women was equal to the mean intelligence of men in the general population.
2. The example
Jackson and Rushton examined a 1991 validity study sample of the Scholastic Assessment Test (SAT). This set contained scores for 46,509 men and 56,007 women. They describe the sample by saying “about 50% of the US population now go to college and SAT test takers are representative of those who aspire to do so.” (Jackson & Rushton, 2006, pg. 480).1 There is a disparity between
Jackson and Rushton (2006) claimed that there is approximately a 4 IQ point (.24 SD units) male–female difference in general intelligence, in favor of males. They based this conclusion on an analysis of approximately 100,000 examinees in the Scholastic Assessment Test (SAT) validation study sample of 1991. Jackson and Rushton extracted a general factor from individual item scores on the math and verbal subsections of the SAT, and then compared the scores of men and women on that general factor. They also found that there was a higher
3. The recruitment issue illustrated by the Jackson and Rushton dataset
1 We accept Jackson and Rushton's statement. We have been unable to reanalyze their data directly as the College Board does not have this data in their current archives. According to Rushton (personal communication, April 8, 2007) Jackson, who is now deceased, had physical custody of the data and it is no longer available.
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663
ratio of men to women in Jackson and Rushton's sample and in the population. The ratio in the sample was .83. According to the US census data for 1990, the ratio of men to women in the United States was 1.04:1, i.e. slightly more men than women. In addition, a higher percentage of women than men complete high school. Estimates of dropout rates vary significantly depending on how they are calculated.2 However, a recent study (Losen, Orfield, Wald, & Swanson, 2004) has estimated dropout rates at the turn of the century to be 32%. The National Center for Education Statistics (2006) has estimated the 1991–1992 graduation rate to have been 73.2% of the population of 17 year olds. Currently, there is an 8% difference in dropout rates between men (36%) and women (28%). This is somewhat controversial, so we will assume that men and women both drop out at the 32% rate. (Were we to use differential dropout rates the effects we shall describe would increase.) Finally, in the high school population, more females than males take the SAT. In 1991–1992, 52.45% of SAT test takers were women, and that percentage has been increasing. This is close to the percentage in Jackson and Rushton's sample. A two stage recruitment process applied; recruitment into the class of high school juniors and seniors, and then into the subset of the class who decide to take the SAT. The recruitment process was in all likelihood related to whatever trait causes people to have high SAT scores, for a massive literature has shown that people with high test scores are less likely to drop out of high school and more likely to enter college than people with low test scores. Jackson and Rushton refer to this trait as intelligence (g), and we shall follow their lead. Because the recruitment process has been more selective for men than women, and has selected against low intelligence, it is virtually certain that an estimate of the differences between male and female test scores in the (differentially recruited) sample will be biased toward overestimating the difference in intelligence between men and women in the population. But what is the extent of the bias? 4. The left censored model Faced with an analogous problem in another article in this journal, estimation of differences in intelligence scores across states in the United States from SAT scores, Kanazawa (2006) applied a recruitment model that he referred to as the left censored model. He 2
A related statistic, the percentage of the population age 16–24 who are not high school graduates has varied between 12.5% and 9.4% , with most figures close to 11%. A very small time trend is discernible (National Center for Educational Statistics, 2006, Table 104).
655
assumed that anyone who completes high school would, if tested, have had a higher test score than everyone who does not, and that anyone who takes the SAT would have a higher test score than everyone who does not. As the name implies, the left censored model amounts to observing the distribution of scores above a certain known percentile, and estimating the distribution of all scores from these observations. We now apply the left censored model to calculate the expected mean differences in Jackson and Rushton's sample, assuming that there is no difference between the means in the population. Assume a standard normal distribution, and let P be the percentile at which left censoring occurs. Let pm and pf be the percentages of men and women, respectively, taking the SAT. Therefore Pm = 1 − pm, Pf = 1−pf. We will use the National Center for Education Statistics (NCES) (2006) estimate that .4167 of all high school students take the SAT, and Losen et al.'s (2004) estimate of .32 dropout rate. Determining the percentiles of men and women who took the SAT can be calculated by treating the problem in a manner analogous to a Bayes' law problem in probability theory. Let S be the descriptive statement “took the SAT”, M and F be the statements “Male” or “Female”, and let D be the descriptive statement “Dropped out of High School.” Let f(S), f(D), f(M), and f(F) be the relative frequencies of individuals described in this way. Using a frequency analog to Bayes' law we have f ð M&S Þ ¼ f ðS Þf ð MjS Þ f ð M&S Þ ¼ f ð M Þf ðSjM Þ f ðS Þf ð M jS Þ : f ðSjM Þ ¼ f ðM Þ
ð1Þ
From the census statistics we know that f(M) ~ .5098. Multiplying the percentage of students who took the SAT in 1991 (from the NCES statistics) by the estimated percentage of high school graduates (Losen et al., 2004) we calculate f(S) = relative frequency of individuals in the relevant age range who take the SAT = .4167 (1−.32), so that f(S) = .2834. Assuming that Jackson and Rushton's sample is representative of the SAT test takers in that year, f(M|S) = .4537. Four digit accuracy is appropriate given the very large sample and population sizes involved. Substituting these numbers into Eq. (1) produces f ðSjM Þ ¼ :2522:
ð2aÞ
An analogous set of calculations for women produces f ðSjF Þ ¼ :3158:
ð2bÞ
According to the left censored model, Jackson and Rushton were comparing roughly the top quarter of men
656
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663
to slightly less than the top third of women. In this case we would expect men to score higher than women, on the average, even though the population means were identical. The expected difference is easily computed. The percentile scores for men and women are PM = 1 − f (S| M), and similarly for PF. Let zm and zf be the standard deviation scores corresponding to the percentile scores for men and women, respectively. Then, with g and G representing the standard density function and the cumulative density function respectively, g ðzm Þ 1 G ðzm Þ g ðzf Þ Ef ¼ 1 G ðzf Þ
Em ¼
ð3Þ
(Hunt, 1995, pg. 151). This leads to an expected sample mean score of 1.26 standard deviation units for men and 1.13 for women, a difference of .13 standard deviation units in favor of men. Jackson and Rushton report a value of .24. A straightforward acceptance of the left censored model implies that the difference between population means is about .11, less than half the difference reported by Jackson and Rushton, and just over 1.5 points in the normal IQ metric. However this is not the only source of bias in their data. The Jackson and Rushton calculation of differences in standard deviation units was based on the standard deviation of the sample. Our calculation is based on the standard deviation of the population, which is the normal reference point for IQ scores. Because of left truncation the expectation of the sample standard deviation will be smaller than the standard deviation of the population, thus amplifying the difference between the two groups. In addition, Jackson and Rushton did not properly consider the effect of differences in variance between men and women.3 If the population variance for men is greater than the variance for women, as is often reported (Hedges & Nowell, 1995), left truncation cuts off a larger proportion of male low scores in the population than it 3
Jackson and Rushton claim that variance effects could not be responsible for their data, on the grounds that they did not observe an excess of males with low scores on the SAT (see their Fig. 2 and accompanying text). This argument is not correct. The excess of males would apply to the upper and lower tails of the distribution of the general population. Under the left truncation the lower tail of the distribution, where an excess of males is observed, is not represented at all in the accessible population. Under the recruitment model to be described the probability of observing SAT scores of people from the lower tail of the distribution is very small.
does female scores, thus further increasing the expected difference between sample means when the population means are identical. In order to calculate the possible effects we require an estimate of male/female variances in the general population, rather than the self-selected population who take the SAT. In order to obtain some idea of the effect, we calculated the expected difference in the sample, using the left-truncated model, and assuming that the male/female variance is 1.18. This value was chosen because it is a compromise between values for the mathematics and reading scores of men and women on the National Assessment of Educational Progress (NAEP tests in 1990, Hedges & Nowell, 1995, Table 3), and because the NAEP program attempts to obtain a representative sample of students enrolled in American schools. When this correction is applied, the expected value of the difference between the SAT scores of men and women, assuming the left truncation model, is .27, slightly higher than the .24 that Jackson and Rushton observed. According to the left truncation model, Jackson and Rushton observed less of a difference in SAT scores than would be expected on the assumption that there is no difference in population means. We believe that the above argument is sufficient to lead to questions about both the Jackson and Rushton analysis, in particular, and, in general, the fairly common practice of extrapolating from observations in self-selected samples to observations in populations, without considering the possibility of recruitment effects. Such extrapolations are not validated by correcting for range restriction, for the issue is one of systematic bias, not restriction in variance. Although the left censored model has been used in the past, in fact it is not realistic. It is unlikely that all people who complete high school are more intelligent than all drop-outs, or that all students who take the SAT are more intelligent than all students who do not take the test. It is reasonable to believe a weaker assertion, that on the average people who graduate from high school are more intelligent than those who do not, and that people who attempt to go to college are, in general, more intelligent than those who do not. In the next section we describe a realistic recruitment model, and present data showing that it is valid in two representative cases. We then apply it to the Jackson and Rushton results. 5. The logistic recruitment model We assume that the probability that a high school student will take the SAT increases with his/her intelligence. Accordingly we treat “taking the SAT” as if it were a test question, and apply the formalisms of
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663
657
Item Response Theory to model the result. We use a 2 parameter logistic model (Birnbaum, 1968), P ð hÞ ¼
1 ; 1 þ eDðaðhbÞÞ
ð4Þ
where P(θ) is the probability of taking the SAT given an underlying intelligence of θ, a is proportional to the slope of the tangent at the point where a student has a 50% chance of taking the test, and b is the value of intelligence at which there is a 50% chance of taking the SAT. As in item response theory, b will be referred to as the threshold parameter, and a as the sensitivity parameter, for it indicates how a change in ability affects the probability of taking the SAT. D is a scaling factor (= 1.7) that ensures that the absolute difference between the scaled logistic conditional density function (cdf) and the normal cdf is b.01. The left censored model is subsumed by the logistic recruitment model, for the two are identical in the limit, when a = ∞. If intelligence in the population is normally distributed, the density function of intelligence in the testtaking population can be expressed as the product of the density function of the normal distribution and P(θ) 2
eðhlÞ =2r 1 pffiffiffiffiffiffi d f ðhja; bÞ ¼ : DðaðhbÞÞ 1 þ e r 2p 2
ð5Þ
Fig. 1 illustrates Eq. (5) for two different values of the sensitivity parameter, a, where μ = 0 and σ = 1. Fig. 2 illustrates the effects of altering the threshold parameter, b. Note that the integral of Eq. (5) is the size of the accessible population (here, people taking the SAT) expressed as a fraction of the general population. The values of a and b control this fraction. As the threshold is increased, the size of the accessible population decreases. The effect of a threshold change depends upon the sensitivity parameter. Increases in sensitivity increase the percentage of the accessible population lying above threshold and, concomitantly, decreases the percentage of the accessible population lying below threshold. To determine whether this model is realistic we examined data from an urban school district in Washington state.4 Students in Washington are required to take an evaluation of their mathematical skill, the Mathematics section of the Washington State Assessment of Student Learning (Math WASL), in 10th grade. This test has been found to be highly correlated with the
4
We thank the district's director of research, Pat Cummings, for providing the data.
Fig. 1. The effect of varying the sensitivity (a) parameter in the logistic recruitment model. The top figure shows the standard normal distribution, with the probability density function (pdf) on the ordinate and the standard normal variable on the abscissa. The middle figure shows the probability of taking the SAT as determined by the logistic recruitment model, with b = 0 and a either 2 (solid line) or .5 (dashed line). The bottom figure shows the result of multiplying the two distributions to produce a density function for the accessible population. The integral of this density function will be that fraction of the general population that fall into the accessible population. Raising a with b fixed increases the fraction of the recruited population that lies above the threshold value and decreases the fraction below the threshold.
equivalent reading test, and is believed to be highly gloaded (Rayborn, 1997). This gives us a sample of students with known ability, using the Math WASL as a proxy for intelligence, or g. In this school district 1430 students took the WASL in the 2002–2003 school year. Subsequently 608 (42.5%) took the SAT as juniors or seniors (in the years 2004–2006). Fig. 3 shows the fit of the logistic recruitment model to this data. The Hosmer and Lemeshow test indicates good fit (Chi square = 5.921,
658
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663
a = .751, b = .172 for females and a = .598, b = .591 for males. This means that the 50% threshold for deciding to take the SAT was lower for women than for men, and that for practically all but the lowest levels of the Math WASL women were more likely to take the SAT than men. We then used the logistic recruitment model to calculate the expected difference in SAT scores in the school in question. As before, and as in Jackson and Rushton's analysis, the calculations will be in terms of population standard deviation units, assuming a population mean of zero, and no mean differences between men and women. The expected value of the sample mean for a group can be calculated by taking the integral of the product of Eq. (6) at each level of intelligence (θ), multiplied by the θ value, and normalizing by dividing by the expected fraction of the group that appears in the accessible population. The a and b parameters used are those determined by fitting the logistic regression model to the group equation. E ð hÞ ¼
∫e
ðhlÞ2 =2r2
pffiffiffiffiffiffi r 2p 2
d
1 d hdh 1 þ eDðaðhbÞÞ
ð6Þ
eðhlÞ =2r 1 pffiffiffiffiffiffi d dh: D 1 þ e ðaðhbÞÞ r 2p
∫
2
We applied Eq. (6) to calculate the expected sample means for men and women, using the parameter Fig. 2. The effect of varying the threshold (b) parameter in the logistic recruitment model. The top figure shows the standard normal distribution, with the probability density function (pdf) on the ordinate and the standard normal variable on the abscissa. The middle figure shows the probability of taking the SAT as determined by the logistic recruitment model, with a = .5 and b either 2 (solid line) or .5 (dashed line). The bottom figure shows the result of multiplying the two distributions to produce a density function for the accessible population. The integral of this density function will be the fraction of the general population that fall into the accessible population. Raising b with a fixed a decreases the size of the accessible population, as a fraction of the general population, and increases the values within the accessible population of the variable on which recruitment takes place.
df = 8, p b .65). For reference, a score of 400 on the WASL indicates proficiency as established by the state standards. The parameters are a = .669, b = .339 for all students. We then examined the population of males and females separately. The Math WASL was taken by 800 (55.94%) females and 630 (44.06%) males. Fig. 4 shows the logistic regression of the probability of taking the SAT on math WASL separately for females and males. For females, the Hosmer and Lemeshow goodness of fit is Chi square = 5.02, df = 8, and p b .75. For males, it is Chi square = 8.56, df = 8, and p b .38. The parameter values are
Fig. 3. The logistic recruitment model fitted to data from an urban school district in Washington State. The relative frequency of taking the SAT (P(SAT) is shown as a function of a student's Mathematics WASL score. The points represent the fraction of students taking the WASL in a bin of ten students whose WASL scores were within a ten point span. The line is the logistic regression of the P(SAT) on the WASL. The best fitting parameters are a = .669 and b = .339.
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663
659
6. Application of the logistic recruitment model to national data
Fig. 4. The logistic recruitment model for the data shown in Fig. 3, fit separately for men and women. The parameter values are a = .751, b = .172 for females and a = .598, b = .591 for males.
estimates and population frequencies for men to calculate the expectation for men, and the parameters and population frequencies for women to calculate the expectation for women. We then subtracted the estimate for men from the estimate for women to calculate the expected difference in sample means. In making this calculation we have also assumed that nobody two standard deviations below the mean (e.g., with an IQ of less than 70) is likely to take the test. This was done to avoid contaminating our analyses due to possible malefemale imbalances in the population of severely mentally retarded individuals. We do not believe that these individuals are relevant to the discussion of group differences in cognitive skills within the normal range. The resulting expected difference was .179 standard deviation units. We were unable to calculate the derived g score used by Jackson and Rushton, because we did not have access to individual item data. However using the total SAT scores (Math plus Verbal) in the High School sample, and using a standard deviation value of 110, which is the value reported by the College Entrance Examination Board (2004), we determined that the actual difference in sample means was .185, only .006 standard deviation units above the expected value. This provides an independent test of the logistic recruitment model. The parameters of the model are determined solely by whether or not a student takes the SAT, independently of the scores obtained.
The calculations in the preceding section showed that the logistic recruitment model provided an accurate picture of the process of deciding to take the SAT within a single high school. We would not expect the parameters estimated from this data to necessarily be applicable for recruitment to the SAT on a nationwide basis. However it is possible to make an estimate of nationwide parameters using a representative sample, the US Department of Education's National Educational Longitudinal Survey of 1988 (NELS:88). We examined the subset of students from the NELS:88 who were interviewed in grades 8, 10, and 12, and who graduated with their cohort (N = 13903). Therefore this data does not consider drop outs, so the general population here is the population of students who complete high school. Fortuitously, these students were in 12th grade in 1991– 1992, the date in which the data for Jackson and Rushton's analysis was gathered. Students were given a cognitive test developed by the Educational Testing Service that was designed to measure ability in several subject areas in that wide grade range. In addition, in their senior year the students reported whether they had or planned to take the SAT. A composite math/verbal score was created for each student; we used the score created in 8th grade as our measure of general intelligence. Applying the methods described above, we calculated a = .41 for women and a = .39 for men, with associated b values of .11 for women and .35 for men. This suggests that the preponderance of women in the population of SAT examinees may be due to differences in threshold rather than differences in the rate parameter. However other combinations of parameters might be found in local situations, as our analysis of the High School sample shows. 7. Sensitivity analysis The analyses described above require estimates of the probability that a person will be recruited from the general to the accessible population, as a function of some variable of interest. In order to calculate the parameters of the model an investigator has to have access to the relative frequencies of recruitment into the accessible population as a function of that variable, across the general population. There may be situations in which the only statistics available are the relative frequencies of each group in the sample and in the general population. To take a common case, study participants are often described as “undergraduates in
660
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663
university X.” In fact the methods used to solicit participation are likely to define an accessible population that is quite different from the general undergraduate student body. Psychology students, for instance, are far more likely to be asked to participate in an experiment than are physics students. In such cases the relative frequencies of, say, men and women in both the sample and the university can readily be obtained. This information is not sufficient to identify the parameters of the logistic recruitment model. However we may explore the implications of various combinations of assumed values, in order to determine what values could or could not reproduce the observed sample differences under various hypotheses about the parameter values in the population, including the population difference in means. We illustrate this technique using the Jackson and Rushton data. In our analyses we assumed that there is no difference in mean intelligence between men and women in the population, the normal “null hypothesis” in such studies. We then consider variations in either a (sensitivity) or b (threshold) values that could produce our estimates of the relative frequencies that we estimated above for SAT test takers in the 1991 high school population. The analyses were repeated for variance ratios of 1 (no difference in variation between men and women) and 1.18, the value that has been observed empirically for tests of this type (see above for discussion of this point). Once the parameter values were obtained we used Eq. (6) to determine the expected difference in sample means. The question we ask is whether any of the various parameter values and assumptions about variance could produce Jackson
Table 1 bm bf
.5 .6 .7 .8 .9 1 1.1 1.2 1.3 1.4
.7 .24 .35
.8 .15 .23 .31
.9 .10 .16 .22 .28
1 .07 .11 .16 .21 .26
1.1 .07 .08 .12 .16 .20 .23
1.2 .07 .06 .09 .12 .15 .19 .22
1.3 .09 .06 .07 .09 .12 .15 .18 .20
1.4 .11 .06 .06 .08 .10 .12 .14 .17 .19
1.5 .13 .07 .06 .07 .08 .10 .12 .14 .16 .17
Expected differences in sample means, varying the threshold (b) required to produce a sample constituting of 41.67% of the population, with a 55:45 male–female ratio, given a population ratio of 51:49. Sensitivity (a) parameters were adjusted as appropriate for each pair of b values. We assume that the male and female populations have identical means and variances.
Table 2 bm bf
.7 .8 .9 1 1.1 1.2 1.3 1.4
.8 .73
.9 .60 .65
1 .49 .55 .59
1.1 .42 .46 .50 .54
1.2 .37 .40 .43 .46 .49
1.3 .33 .35 .38 .41 .43 .46
1.4 .30 .31 .34 .36 .38 .40 .42
1.5 .28 .29 .30 .32 .34 .36 .38 .40
Expected differences in sample means, varying b and selecting the parameters of a that generate the percentages of male and female SAT takers as described in Table 1. We assume that the male and female populations have identical means and a male/female variance ratio of 1.18.
and Rushton's observed .24 s.d. unit difference in sample means. Tables 1 and 2 show the analyses when we assumed various threshold values for men and women, with the obvious restriction that the threshold value for women always had to be lower than the threshold value for men. If we assume no differences in variances between men and women (Table 1) the Jackson and Rushton data can be replicated if the men's threshold is between .7 and 1.1 standard deviation units, and the women's threshold is .1 or .2 standard deviation units lower. If we assume a 1.18 variance ratio the model predicts higher expected mean differences than were observed in the Jackson and Rushton data, showing that their data is, in some circumstances, compatible with an assumed population mean difference in favor of women. Tables 3 and 4 repeat the analysis varying sensitivity (a) parameters. Table 3 shows the case for equal variances and Table 4 the case for a variance ratio of 1.18. As can be seen, there are a number of pairs of sensitivity parameters that can approximate Jackson and
Table 3 am af
.3 .4 .5 .6 .7 .8
.3 .03 − .06 − .14 − .21 − .27 − .32
.4 .13 .04 − .04 − .11 − .17 − .22
.5 .22 .13 .05 −.02 −.08 −.13
.6 .30 .21 .13 .06 .00 − .05
.7 .37 .28 .20 .13 .07 .02
.8 .43 .34 .27 .19 .13 .08
The expected differences in sample means expected under varying combinations of the sensitivity (a) parameter, adjusting the threshold (b) parameter, assuming equal population variances. Parameter combinations were chosen to generate a sample that conformed to the restrictions described for Table 1.
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663 Table 4 am af
.3 .4 .5 .6 .7 .8
.3 .19 .10 .02 −.05 −.11 −.17
.4 .31 .22 .14 .07 .01 − .04
.5 .42 .33 .25 .18 .12 .07
.6 .51 .43 .35 .28 .21 .16
.7 .59 .51 .43 .36 .30 .24
.8 .66 .58 .50 .43 .37 .31
The analysis of Table 3 repeated assuming a variance ratio of 1.18.
Rushton's result and, if the 1.18 variance ratio is accepted, a number of pairs for which the expected mean difference exceeds that found by Jackson and Rushton. Tables 1–4 and the different assumptions about variance ratios can be thought of as analogous to the conventional procedure of evaluating empirical results by comparing them to what would be expected under the null hypothesis that there is no difference between groups. The only logical difference is that in the conventional procedure one determines the region of a one-dimensional space (the population mean) that could give rise to the observed effects. In the analysis here we consider what region of the five dimensional space (parameter values and the variance ratio) might have produced the observed results. In terms of familiar statistics, we are dealing with confidence regions instead of confidence intervals. The sensitivity analysis presented here considers only the case of the “null hypothesis,” that there are no differences in population means. It is also possible to determine confidence regions under alternative hypotheses about population means. For instance, Jackson and Rushton state (and we agree) that it would be interesting to know if their results could be replicated in more general populations. Suppose we take as an alternative hypothesis the assumption that the population means are indeed approximately those observed in Jackson and Rushton's sample. This hypothesis is also interesting because it approximates a hypothesis about differences in g in men and women that has been put forward by other investigators (Lynn & Irwing, 2004a,b). We have explored this possibility, and found that if the alternative hypothesis is true there are many parameter combinations for which the predicted difference in sample means is greater than that observed by Jackson and Rushton. This finding is not surprising, for, as Tables 1–4 show, there are combinations of parameter values that lead to a greater-than-observed difference in sample means without assuming a difference in population means. This is particularly true for an assumed variance ratio of 1.18, which we regard as the most likely value for a test similar to the SAT. For
661
studies using different types of tests, such as the Raven Matrices reports analyzed by Lynn and Irwing (2004a, b), a different assumption might be appropriate. What this means in terms of generalizing their results (or anyone else's) can only be answered by conducting analyses similar to the ones we have reported here for the Jackson and Rushton data. We repeat our major conclusion. By combining an observed difference with a recruitment model we may determine what the characteristics of the general population would have to be to give rise to observations in a sample from the accessible population. While the procedures we describe here are, insofar as we know, original, our use of the logistic regression model is in principle no different from the use of the normal distribution in the common procedures of placing confidence limits around sample results or investigating the compatibility of observed results with various alternative hypotheses. 8. Discussion Our results have two implications: for the interpretation of sample results in general, and for the specific case that we used as an example, male–female differences in intelligence. We regard the first implication as by far the most important. However we would like to dispose of the issue concerning male–female differences first. We believe that the Jackson and Rushton study cannot be used to draw the conclusion that men exceed women in general intelligence, in the population as a whole. However, conditional upon accepting their identification of a general factor on the SAT with ‘general intelligence,' which is not unreasonable, the Jackson and Rushton study does provide evidence for a conclusion that in the population of university applicants men have higher average scores than women. This more limited conclusion is not trivial. For instance, Jackson and Rushton's data could validly be used to argue that, given SAT scores, women should be expected to have lower grades in college than men. To the extent that this is not true (and it generally is not, as indicated by a number of studies) other factors, not evaluated by the SAT, must be determining grades. Speculating about what these factors are would take us far beyond the present discussion. We now turn to the issue of recruitment more generally. We believe that the logistic recruitment model is a good model of a situation in which the probability of recruitment can reasonably be ordered along some variable of interest. Ordering the variable “probability of
662
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663
preparing to continue education beyond high school” by variables relating to intelligence, grades, or socioeconomic status are obvious candidates for analysis by the logistic recruitment model. On the other hand, we do not believe that the logistic recruitment model applies to all situations. Our more general point is that some recruitment process will apply, and before interpreting a finding consideration has to be given to the recruitment process. We can think of recruitment processes in which probability of recruitment might be negatively related to intelligence, or even nonmonotonically related, but still reflect an orderly process. As an example of what is likely to be a non-monotonic case, consider the many studies in which the participants are drawn from enlistees in the US Armed Services. Enlistment in the military is related to intelligence in two separate ways. People with low scores on cognitive aptitude tests (such as the Armed Services Qualification Test-AFQT) are either disqualified or, depending upon the need for enlistees at the time, unlikely to be recruited. Those individuals in the general population who, if they took the AFQT, would have obtained very high scores are not likely to volunteer for enlistment; preferring to pursue other options, including enrollment in colleges and universities. We conjecture that a model of the military recruitment process into the US Armed Services would have to establish some point on an intelligence test scale that represents individuals most likely to enlist, with the probability of enlistment falling off from that point, possibly in an asymmetric fashion. Readers may think of other models, for other situations. The point is that recruitment effects can be substantial. Furthermore, it will often be the case that the presence of substantial recruitment effects can be detected by a rather superficial analysis of the data. To illustrate this we consider two more studies of male– female differences. Lynn and Irwing (2004b) report a difference between men and women, in favor of men, on the Raven Matrix test. From this study they generalize to the population of college students. Their sample consisted of 1807 men and 415 women enrolled at Texas Agricultural and Mechanical University (Texas A&M) during the 1990s. The sample was not collected by Lynn and Irwing themselves, and they provide no further information concerning the sampling procedure. Examination of the statistical analysis of enrollments provided on the Texas A&M website shows that over this period the enrollments of men and women were approximately equal, so the sample was clearly not a random sample of that university's student body, let alone of American university students in general. Given how
little we know about the construction of the sample, we have no way to construct a recruitment model. For instance, in this case there is no compelling reason to believe that the recruitment process for women would necessarily have been more selective with respect for intelligence than the process for men. In fact, it might have been the reverse. We simply do not know. What we do know, however, is that substantial and differentiated recruitment must have occurred, because the male:female ratio in the sample was so disparate from that in the university at the time. The same authors report a meta-analysis of studies of male–female differences in intelligence in several countries (Lynn & Irwing, 2004a). It would not be appropriate here to go into detail about this meta-analysis, as our purpose is only to show how suspicious some (not all) of their data points are. The ratio of men to women in their Brazilian sample was 2.35:1. Either Brazil has an unusual population, or a recruitment model should be investigated, or the data point should be regarded as of low quality indeed. Similarly, their male–female ratios for studies in France, England, and the US (two studies combined) are, respectively, .70, .82, and .80. In these developed countries we would not expect such ratios unless sampling was restricted to senior citizens, a population where there is a preponderance of women. Some consideration of recruitment effects is obviously required. We do not believe that the problems discussed here occur solely in the study of male–female differences, or for that matter solely in the study of intelligence. Elsewhere one of us has pointed out that recruitment effects are common in the study of differences in traits between groups, and that failing to consider them can lead to erroneous conclusions (Hunt & Carlson, 2007). Here we have illustrated one possible solution to the recruitment problem. We think it has widespread but not universal applicability. We encourage others to develop models for other situations. We also believe that, in the absence of modeling, authors should provide, and editors and reviewers should insist upon, much more information about sampling than statements like “participants were university students…”. This is especially the case when, as in the studies cited here, examinations of statistics on group participation rate indicate that the accessible population was unlikely to be a representative sample of the general population. References Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord, & M. R. Novick (Eds.), Statistical theories of mental test scores Reading, Mass.: Addison-Wesley Publishing.
E. Hunt, T. Madhyastha / Intelligence 36 (2008) 653–663 College Entrance Examination Board (2004). College-bound seniors: A profile of SAT program test-takers. Retrieved November 14, 2007, from http://www.collegeboard.com/prod_downloads/about/ news_info/cbsenior/yr2004/2004_CBSNR_total_group.pdf Hedges, L. V., & Nowell, A. (1995). Sex differences in mental test scores, variability, and numbers of high-scoring individuals. Science, 270, 365−367. Hunt, E. (1995). Will we be smart enough: A cognitive analysis of the coming workforce. New York: Russell Sage Foundation. Hunt, E., & Carlson, J. (2007). Considerations relating to the study of group differences in intelligence. Perspectives in Psychological Science, 2(2), 194−213. Jackson, D. N., & Rushton, J. P. (2006). Males have greater g. Sex differences in mental ability from 100,000 17–18-year-olds on the Scholastic Assessment Test. Intelligence, 34(5), 479−486. Jensen, A. R. (1998). The g factor: the science of mental ability. Human evolution, behavior, and intelligence. Westport, CT: Praeger.
663
Kanazawa, S. (2006). IQ and the wealth of states. Intelligence, 34(6), 593−600. Losen, D. J., Orfield, G., Wald, J., & Swanson, C. B. (2004). Losing our future: how minority youth are being left behind by the graduation rate crisis. Cambridge, MA: Civil Rights Project, Harvard University. Lynn, R., & Irwing, P. (2004). Sex differences on the progressive matrices: A meta-analysis. Intelligence, 32(5), 481−498. Lynn, R., & Irwing, P. (2004). Sex differences on the advanced progressive matrices in college students. Personality and Individual Differences, 37(1), 219−223. National Center for Education Statistics (2006). Digest of education statistics. Retrieved August 7, 2007, from http://nces.ed.gov/ programs/digest/ Rayborn, R. R. (1997). An examination of content and construct validity for the Washington State assessment of student learning grade 4 math test. Washington Educational Research Association, Winter Conference, 1997.