The problem of group differences in ability test scores in employment selection

The problem of group differences in ability test scores in employment selection

Journal of Vocational Behavior 33, 272-292 (1988) The Problem of Group Differences in Ability Test Scores in Employment Selection FRANK L. SCHMIDT Un...

2MB Sizes 34 Downloads 18 Views

Journal of Vocational Behavior 33, 272-292 (1988)

The Problem of Group Differences in Ability Test Scores in Employment Selection FRANK L. SCHMIDT University of Iowa Group differences on employment tests of cognitive abilities often lead to lower job selection rates for blacks and some other minorities, such as Hispanics. These lower selection rates are referred to as “adverse impact.” This paper reviews measurement-based research efforts to solve or circumvent this problem. Cumulative research findings have disconhrmed three early theories that postulated that average score differences between groups were due to deficiencies in the tests: the theory of single group validity, the theory of differential validity, and the theory of test underprediction. At the same time, research has shown that such tests are valid for virtually all jobs and that failure to employ them in selection can lead to substantial economic losses. These findings have increased the percentage of jobs for which such tests will probably be used. Attempts to solve or circumvent the problem of group differences by redefining test (or selection) fairness, by relying on job experience to eliminate ability-related job performance differences, by use of specific aptitudes rather than general ability measures, and by searching for bias in job performance measures appear to have been unsuccessful. It is possible that use of certain nontest selection methods in combination with general mental ability measures may lead to some reduction in hiring rate differences. However, the reduction appears small, and the picture as a whole indicates that the problem of group differences and adverse impact is far more intractable than initially believed. Political, legal, and social policy decisions will determine how this problem is addressed. Measurement-based research has not solved the problem. Its major contribution has been to clarify the nature of the problem, and in this respect it has been very successful. o 1988 Academic

Press. Inc.

Blacks and Hispanics typically have lower average scores on employment tests of cognitive abilities (e.g., verbal, quantitative, spatial, or mechanical abilities). Measurement technology alone cannot solve the problem of group differences in mean levels of abilities that have been found to An earlier revised version of this paper was presented at the Conference on Fairness in Employment Testing, October 1-2, 1987,Balboa Bay Club, Newport Beach, CA, sponsored by the Personnel Testing Council of Southern California. I acknowledge the helpfulness in preparing this paper of discussions with John Hunter and Mark Larson. Any errors of commission or omission are the authors. Requests for reprints should be addressed to Frank L. Schmidt, Department of IRHR, Phillips Hail, University of Iowa, Iowa City, IA 52242 272 OOOl-8791/88$3.00 Copyright All rights

8 1988 by Academic Press, Inc. of reproduction in any form reserved.

PROBLEM

OF GROUP

DIFFERENCES

273

predict work performance and educational success. Only a few people ever believed it could. A more interesting and relevant question is can measurement-based substantive research solve this problem, or at least reveal how to circumvent it? For example, can such research identify special abilities that are just as predictive of job performance but have less adverse impact (i.e., smaller group differences) than general mental ability? Can measurement-based research reveal a way that we can substitute job experience or special training for general mental ability with no loss in performance? There have been a number of attempts to find research-based solutions to the problem of group differences, and I will describe some of them. The purpose is not to present an exhaustive review, but rather to provide illustrative and typical examples. Although very few people ever believed that measurement models and methods per se could solve the problem of group differences, many researchers in the later 1960s and early 1970s (myself included) believed that measurement-based substantive research could find a solution, So did the foundations and government agencies that funded their research. It now appears that these research efforts have produced at best only very partial solutions, as we will see. That leaves us with another question: Is there anything that measurement and measurement-based research can contribute or has contributed in this area? The answer is yes: measurement and measurement-based research has greatly clarified the nature of the problem. As result, even though we do not have a solution, we do have a much clearer understanding of just what the problem is. FAILED THEORIES ABOUT TEST INAPPROPRIATENESS In the years immediately following passage of the 1964 Civil Rights Act, personnel and industrial psychologists focused their attention strongly for the first time on the large mean score differences between blacks and whites that are usually observed on cognitive ability tests. Two broad theories were advanced that it was hoped would solve the problems of adverse impact that this score difference would otherwise create if scores were used in selecting and promoting workers. These were the theory of subgroup validity differences and the theory of underprediction (test unfairness). Both these theories led to the conclusion that employment tests of cognitive abilities were inappropriate for use with blacks, Hispanics, and other minorities, but for different reasons. The important element that these theories had in common is the hypothesis that the average test score differences between groups are caused by defects or deficiencies in the tests themselves and do not represent real differences in developed abilities.

274 The Theory of Subgroup

FRANK

L. SCHMIDT

Validity

Differences

This theory holds that because of cultural differences, cognitive tests have lower validity (test-criterion correlations) for minority than for majority groups. This theory takes two forms. The subtheory of single-group validity. This theory holds that tests may be valid for the majority but “invalid” (that is, have zero validity) for minorities. Although it was an erroneous procedure (Humphreys, 1973), the procedure adopted by psychologists to test for single-group validity in samples of data was to test black and white validity coefficients for statistical significance separately by race. Because minority sample sizes were usually smaller than those for the majority, and thus sizeable relationships would more often go undetected, small-sample single-group validity studies often showed that the tests significantly predicted performance among whites but not among blacks. This outcome was taken as support for the existence of single group validity. Four different quantitative reviews of such studies have now demonstrated that evidence for single-group validity by race (blacks vs whites) does not occur any more frequently in samples than would be expected solely on the basis of chance (Boehm, 1977; Katzell & Dyer, 1977; O’Connor, Wexley, & Alexander, 1975; Schmidt, Bemer, & Hunter, 1973). Thus it appears that the evidence for single group validity was a statistical artifact. The subtheory of differential validity. This theory holds that the tests are predictive for all groups but to a different degree. It was usually hypothesized that tests had higher validities for whites than blacks. This theory was tested by applying a statistical test of the difference between observed validities for blacks and whites. Individual studies obtained varying and conflicting results. Later reviews quantitatively combined the findings across studies. The first such review was done by Ruth (1972), who found that differential validity occurred in samples at only chance levels of frequency. Two later reviews claimed to have found somewhat higher frequencies: Boehm (1977) found a frequency of 8%, and Katzell and Dyer (1977) reported frequencies in the 20-30% range. But Hunter & Schmidt (1978) showed that the data preselection technique used in these studies resulted in a Type I statistical bias which created the false appearance of a higher incidence of differential validity. Two later reviews that avoided this Type I bias found differential validity to be at chance levels (Bartlett, Bobko, Mosier, & Hannan, 1978; Hunter, Schmidt, & Hunter, 1979). Bartlett et al. (1978) analyzed 1190 pairs of validity coefficients for blacks and whites (using a significance level of .OS), and found significant black-white differences in 6.8% of the pairs of coefficients. Hunter et al. (1979) found the frequency of differential validity among the 712 validity pairs with a positive average validity to be 6%. Similar results have been obtained for Hispanic Americans (Schmidt,

PROBLEM

OF GROUP

DIFFERENCES

275

Pearlman, & Hunter, 1980). Thus the evidence taken as a whole indicates that employment tests are equally valid for blacks, whites, and (English speaking) Hispanics (Linn, 1978). The earlier belief in differential validity apparently resulted from excessive faith in individual small-sample studies in which the significant difference observed actually occurred by chance. The Theory of Test Underprediction This theory holds that even if validity coefficients are equal for minority and majority groups, a test is likely to be unfair if the average test score is lower for minorities. There are numerous statistical models of test fairness, and these differ significantly in their properties (Jensen, 1980, chap. 10; Hunter & Schmidt, 1976; Hunter et al., 1977). However, the most commonly accepted model of test fairness is the regression model (Clear-y & Hilton, 1968). This model defines a test as unfair to a minority group if it predicts\ lower average levels of job performance than the minority group in fact achieves. This is the concept of test fairness embedded in the federal government’s Uniform Guidelines on Employee Selection Procedures (Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor & Department of Justice, 1978; Ledvinka, 1979). This theory of test unfairness is based on the assumption that the factors causing lower test scores do not also cause lower job performance. The accumulated evidence on this theory is clear: Lower test scores among minorities (blacks and Hispanics) are accompanied by lower job performance, exactly as in the case of the majority (Bartlett et al., 1978; Campbell, Crooks, Mahoney, & Rock, 1973; Gael & Grant, 1972; Gael, Grant, & Ritchie, 1975a, 1975b; Grant & Bray, 1970; Jensen, 1980, chap. 10; Schmidt, Pearlman, & Hunter, 1980; Ruth, 1972; Tenopyr, 1967). It is important to note that this finding holds true whether supervisory ratings ofjob performance or objective job sample measures of performance are used. Tests predict job performance of minority and majority persons in the same way (Wigdor & Garner, 1982). The small departures from perfect fairness that exist actually favor minority groups. That is, they often predict slightly higher levels of job performance among minorities than actually occur. The lower average test scores of minorities often translate into lower hiring rates, an outcome referred to as “adverse impact.” These findings show that adverse impact against minorities is not the result of defects in employment tests. The cumulative research on test fairness shows that the average ability and cognitive skill differences between groups are directly reflected in job performance, indicating that they are not artifacts. The differences are not created by the tests, but are preexisting, and thus the problem is not a defect or deficiency in the tests. We do not know what all the causes of these group differences in ability are, how long they will persist, or how best to eliminate them. For many

276

FRANK

L. SCHMIDT

other groups in the past, such differences have declined or disappeared over time. But at the present time, the differences exist, are substantial, and are reflected in job performance. FAILED THEORIES ABOUT LACK OF TEST USEFULNESS Thus research findings clearly refute the two initial theories that were advanced to solve or circumvent the problem of group differences on employment tests of cognitive abilities. However, these are not the only research findings that define the social problem we face. Even if employment tests were equally valid and predictively fair for minorities, the problem of group differences would be less pressing if (a) employment tests were invalid for most jobs and therefore should rarely be used and/or (b) even valid employment tests were typically of little practical or economic value. If the first of these were true, then employment tests could and should be used for only a few jobs in the economy. And if the second were true, then organizational or even societal decisions to suspend use of ability tests in employment could be made with little resulting economic loss. In fact, both these propositions were widely believed just a few years ago. The first is referred to as the theory of test invalidity; and the second, the theory of low utility. From the point of view of our desire to find a simple and immediate solution to the problem of adverse impact, it is unfortunate that both of these theories have been shown by research findings to be false. We now examine this evidence. The Theory of Test Invalidity This theory holds that cognitive employment tests are frequently invalid. This theory takes two forms. The first subtheory holds that test validity is situationally specific-a test that is valid for a job in one organization or setting may be invalid for the same job in another organization or setting. The conclusion is that a separate validity study is necessary in each setting to determine validity. The second form of this theory holds that test validity is job specific-a cognitive test valid for one job may be invalid for another job. The conclusion is that a separate validity study is necessary for every job because validities cannot be generalized across jobs or settings. The subtheory of situation-specific validity. The empirical basis for the subtheory that validity is situation specific was the considerable variability in observed validity coefficients from study to study even when jobs and tests appeared to be similar or identical (Ghiselli, 1966). The older explanation for this variation was that the nature and definition of job performance is different from job to job and that the human observer or job analyst is too poor an information receiver and processor to detect these subtle but important differences. If this were true, empirical validation would be required in each situation, and generalization (transportability

PROBLEM

OF GROUP

DIFFERENCES

277

or applicability) of validities across settings or organizations would be impossible (Albright, Glennon, & Smith, 1963; Ghiselli, 1966; Guion, 1965). A different hypothesis was investigated starting in the 1970s. This hypothesis is that the variance in the outcomes of validity studies within job-test combinations is due to statistical artifacts. Schmidt, Hunter, and Urry (1976) showed that under typical and realistic validation conditions, a valid test will show a statistically significant validity in only about 50% of studies. For example, they showed that if true validity for a given test is constant at .45 in a series of jobs, if criterion reliability is .70, if the prior selection ratio on the test is .60 (6 out of 10 applicants are hired), and if sample size is 68 (the median over 406 published validity studies; Lent, Aurbach, & Levin, 1971b), then the test will be reported to be valid 54% of the time and invalid 46% of the time (two-tailed test, p = .05). This is the kind of variability in validity findings that was the basis for the theory of situation-specific validity (Ghiselli, 1966; Lent, Aurbach, & Levin, 1971a). If all, or even most, of the variance in validity coefficients across situations for job-test combinations is due to statistical artifacts, then the theory of situational specificity is false, and validities are generalizable. We developed a method for testing this hypothesis (Pearlman, Schmidt, & Hunter, 1980; Schmidt & Hunter, 1977; Schmidt, Gast-Rosenberg, & Hunter, 1980; Schmidt, Hunter, Pearlman, & Shane, 1979). One starts with a fairly large number of validity coefficients for a given test-job combination and computes the variance of this distribution. From this variance in the validity coefficients, one then subtracts variance due to various sources of error. There are at least seven sources of error variance: (1) sampling error (i.e., variance due to N < =); (2) differences between studies in criterion reliability; (3) differences between studies in test reliability; (4) differences between studies in range restriction; (5) differences between studies in amount and kind of criterion contamination and deficiency (Brogden & Taylor, 1950); (6) computational and typographical errors (Wolins, 1%2); and (7) slight differences in factor structure between tests of a given type (e.g., arithmetic reasoning tests). Using conventional statistical and measurement principles, Schmidt et al. (1979) showed that the first four sources alone are capable of producing as much variation in validities as is typically observed from study to study. Results from application of the method to empirical data bear out this prediction. To date, distributions of validity coefficients have been examined for almost 500 test-job combinations (e.g., Lilienthal & Pearlman, 1983; Linn, Hamisch, & Dunbar, 1981; Pearlman et al., 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980; Schmidt & Hunter, 1977; Schmidt, Hunter, & Caplan, 1981; Schmidt, Hunter, Pearlman, & Shane, 1979; Schmidt, Hunter, Pearlman, & Caplan, 1981). The first four artifacts

278

FRANK

L. SCHMIDT

listed above accounted for most of the observed variance of validity coefficients. About 85% of the variance in validities that is accounted for by artifacts is accounted for by simple sampling error. Corrections for sampling error alone lead to the same conclusions about validity generalizability as corrections for the first four artifacts (Pearlman et al., 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980). Where it has been possible to control for the other three sources of variation in observed validities, all observed variation has been accounted for (cf. Schmidt et al. 1985). These findings are quite robust. Callendar and Osburn (1980) and Raju and Burke (1983) have derived alternative equations for testing validity generalizability and have shown that these produce identical conclusions and virtually identical numerical results. These findings show that the theory of situational specificity is false: cognitive tests that are valid for predicting performance in a job in one context are valid for predicting performance in that job in other settings. The subtheory ofjob-specijic validity. This subtheory holds that a given test will be valid for some jobs but invalid for others. But here too, sampling error and other artifacts can falsely cause a test to appear to be invalid. Just as sampling error can produce the appearance of inconsistency in the validity of a test for the same job in different settings, sampling error in validity coefficients can cause tests to appear to be valid for one job but invalid for another job when they actually are valid in all of them. Based on an analysis of data from almost 370,000 clerical workers, Schmidt, Hunter, and Pearlman (1981) showed that the validities of seven cognitive abilities are essentially constant across five different task-defined clerical job families. All seven abilities were highly valid in all five job families. This study also examined patterns of validity coefficients for five cognitive tests which were obtained from a sample of 23,000 people in 35 highly heterogeneous jobs (for example, welders, cooks, clerks, administrators). Validities for each test varied reliably from job to job. But the variation was small, and all tests were valid at substantial levels for all these varied jobs. This finding has been replicated and extended to the least complex, lowest-skill jobs. The U.S. Employment Service has conducted over 500 criterion-related validity studies on jobs that constitute a representative sample of jobs in the Dictionary of Occupational Titles (U.S. Department of Labor, 1977). In a cumulative analysis of these studies, Hunter (1980) showed that cognitive abilities are valid for all jobs and job groupings studied. When jobs were grouped according to complexity of informationprocessing requirements, the estimated true validity of a composite of verbal and quantitative abilities for predicting on-the-job performance varied from .56 for the highest-level job grouping to .23 for the lowest.

PROBLEM OF GROUP DIFFERENCES

279

These values were larger for measures of performance in job training. Thus, even for the lowest-skill jobs, validity is still large enough to be of practical value in selection. (A validity of .23 has 23% as much value as perfect validity, other things equal.) These studies disconfirm the hypothesis of job-specific test validity for cognitive tests. These cumulative analyses of existing studies show that the most frequently used cognitive ability tests are valid for all jobs and job families, although at a higher level for some than others. In conclusion, the evidence shows that the validity of cognitive tests is neither specific to situations nor specific to jobs. These findings rule out the possibility that there will be large numbers of jobs for which tests should not be used because of invalidity. In fact, they probably mean that ability tests will be used for more rather than fewer jobs as time goes on. The Theory of Low Utility This theory holds that employee selection methods, even when valid, have little impact on the performance and productivity of the resultant workforce because differences in performance do not have much, if any, economic impact. From this theory, it follows that selection procedures can be manipulated at little economic cost to achieve a racially representative workforce. The basic equation for determining the impact of selection on workforce productivity had been available for years (Brogden, 1949; Cronbach & Gleser, 1957), but it had not been employed because there were no feasible methods for estimating one critical equation parameter: the standard deviation of employee job performance in dollars (SDy), which indicates the dollar impact to the firm of differences in job performance. The greater SDy is, the greater is the payoff in improved productivity from selecting high-performing employees. During the 197Os,we devised a method for estimating SDy based on careful estimates by supervisors of employee output (Hunter & Schmidt, 1982; Schmidt, Hunter, McKenzie, & Muldrow, 1979). Applications of this method showed that SDy was larger than expected. For example, for entry-level budget analysts and computer programmers, SDy was $11,327and $10,413, respectively. This means that a computer programmer at the 85th percentile in performance is worth $20,800 more per year to the employing organization than a computer programmer at the 15th percentile. (These figures are in 1978 dollars and would be considerably larger today.) When there is a larger pool of applicants than positions to be filled, use of valid selection tests substantially increases the average performance level of the resultant workforce and therefore substantially improves productivity. For example, use of the Programmer Aptitude Test in place of an invalid selection method to hire 618 entry-level computer programmers leads to an estimated productivity improvement of $54.7

280

FRANK L. SCHMIDT

million (in 1978 dollars) over a lo-year period if the top 30% of applicants rather than a random 30% of them are hired (Schmidt, Hunter, McKenzie, & Muldrow, 1979). Many other utility studies have been published in the industrial/organizational psychology literature in recent years. These findings mean that selecting high performers is more important for organizational productivity than had been thought. Research has established that mental skills and abilities are important determinants of performance on the job. If tests measuring these abilities are dropped and replaced by less valid procedures, the proportion of low-performing people hired increases, and the result is a serious decline in productivity in the individual firm and in the economy as a whole. Consider an empirical example. Some years ago, U.S. Steel selected applicants into their skilled trades apprentice programs from rhe top down in rankings based on the applicants’ (total) scores on a valid battery of cognitive aptitude tests. They then lowered their testing standards dramatically, requiring only minimum scores on the tests equal to about the seventhgrade level and relying heavily on seniority. Because their apprentice training center kept excellent records, they were able to show that (a) scores on mastery tests given during training declined markedly, (b) the flunk-out and drop-out rates increased dramatically, (c) average training time and training cost for those who did make it through the program increased substantially, and (d) average ratings of later performance on the job declined (Braithwaite, 1976). The theory that selection procedures are not important is sometimes presented in a more subtle form. In this form, the theory holds that all that is important is that the people hired be “qualified.” This theory results in pressure on employers to set low minimum qualification levels and then to hire on the basis of other factors (or randomly) from among those who meet these minimum levels. This is the system U.S. Steel adopted. In our experience, minimum levels on cognitive ability tests are typically set near the 15th percentile for applicants. Such “minimum competency” selection systems result in productivity losses 80 to 90% as great as complete abandonment of valid selection procedures (Schmidt, Mack, & Hunter, 1984). For example, if an organization the size of the federal government were to move from ranking on valid tests to such a minimum competency selection system with the cutoff at the 20th percentile, yearly productivity gains from selection would be reduced from $15.6 billion to $2.5 billion (Hunter, 1981). This represents an increase in labor costs of $13.1 billion per year required to maintain the same level of output. In a smaller organization such as the Philadelphia police department, the loss would be $12 million per year, a drop in the gain from using tests from $18 million to $6 million (Hunter, 1979). The problem is that there is no real dividing line between the qualified and the unqualified. Employee productivity falls on a continuum from

PROBLEM

OF GROUP

DIFFERENCES

281

very high to very low, and the relation between ability test scores and employee job performance and output is almost invariably linear (Society for Industrial and Organizational Psychology, 1987; Hunter & Schmidt, 1982; Schmidt, Hunter, McKenzie, & Muldrow, 1979), meaning that the same degree of difference in test scores among able people translates into the same degree of advantage in performance as it does among less able people in the same job. Thus a reduction in minimum acceptable test scores at any point in the test score range results in a reduction in the productivity of employees selected A decline from superior to average performance may be less visible than a decline from average to poor performance, but it can be very costly in terms of lost productivity. The finding that test score/job performance relationships are linear means that ranking applicants on test scores and selecting from the top down maximizes the productivity of employees selected. This finding also means that any minimum test score requirement (test cutoff score) is arbitrary: No matter where it is set, a higher cutoff score will yield more productive employees, and a lower score will yield less productive employees. On the other hand, it means that if the test is valid, all cutoff scores are “valid” by definition. The concept of “validating” a cutoff score on a valid test is therefore not meaningful. Thus with this research, two more pieces of the picture fall into place. We now know not only that cognitive employment tests are equally valid and predictively fair for minorities, but also (1) that they are valid for virtually all jobs, and (2) that failure to use them in selection will typically result in substantial economic loss to individual organizations and the economy as a whole. These findings obviously do not point to a solution for the problem of group differences and adverse impact. However, the research I have described thus far is not all of the research relevant to the problem. OTHER RESEARCH RELEVANT TO THE PROBLEM OF ADVERSE IMPACT Redefinition of Test Fairness One approach to reducing the problem of adverse impact has been to advocate statistical definitions of a fair selection test that are different from the regression model of fairness that was discussed earlier. The regression model, the most widely accepted model, holds that a test is fair to the minority group if the prediction line for the majority group does not predict lower average performance for minorities than their actual performance. Note that test scores are used to predict job performance-which is logical, since the purpose in selection is to predict later job performance. Darlington (1971) advocated exactly the opposite definition of a fair test. He defined a test as fair only if job performance

282

FRANK

L. SCHMIDT

scores predicted test scores equally accurately for both minority and majority groups. Cole (1973) advanced a mathematically identical definition. Thomdike (1971) advanced a different definition, one that holds that a test can be fair only when it yields predictions of minority job performance that are higher than actual performance. These definitions lead to the conclusion that existing tests are often unfair to blacks and should be modified to reduce group differences or dropped from use. Hunter and Schmidt (1976) critiqued these and other alternative concepts of a fair test and showed that all are essentially methods of producing disguised quota hiring systems. However, because the quotas are lower for blacks than whites, adverse impact would be merely reduced; it would not be eliminated. In recent years, all such alternative models of fairness have fallen into disfavor, and are rarely used or even mentioned. The 1985 revision of the APA Test Standards, for example, employs only the regression definition. Use of Specific Aptitude Theory to Reduce Adverse Impact Differential (or specific) aptitude theory holds that specific aptitudes measured by particular tests can increase the validity of employment selection over and above the validity of general mental ability. For example, this theory holds that for some jobs, the specifically numerical factor in a numerical test, independent of general ability, can contribute to the prediction of job performance. This theory holds that optimal selection is attained by identifying and measuring the specific aptitudes that are used in job performance, not by selection based on general mental ability. The black-white difference is sometimes smaller for such aptitudes than for general mental ability. For example, the difference is often smaller for perceptual speed and for numerical computations. By deliberately choosing aptitudes with smaller differences, it might be possible to reduce adverse impact. For example, it has been stated that one reason that the Department of Defense dropped the spatial aptitude subtest of the ASVAB (Armed Services Vocational Aptitude Battery) from ASVAB forms 8/9/10 was that the black-white difference was particularly large on that subtest. The problem with this approach is that research in recent years suggests that specific aptitude theory is erroneous for all but a small minority of jobs. For most jobs, specific aptitudes appear to contribute nothing to the prediction of job performance over and above general mental ability (Hunter, 1983, 1984, 1985, 1986; Thomdike, 1986). That is, their validity is due almost entirely to the fact that they are measures (in part) of general mental ability; the specific factors in the tests appear to contribute little or nothing to validity. The major exceptions appear to be clerical jobs, where perceptual speed increases validity by about .02 and a small group of highly technical jobs, where spatial ability may make an in-

PROBLEM

OF GROUP

DIFFERENCES

283

dependent contribution. Thus for most jobs, the highest validity (and therefore utility) is obtained when general mental ability tests are used. Use of a specific aptitude--or even two or three-in place of a general ability test leads to losses in validity and utility. Unfortunately, general mental ability tests also show large group differences. Use of Alternative Selection Procedures It has long been hoped that selection procedures could be found that would have validity equal to that of ability tests but less adverse impact. Hunter and Hunter (1984) quantitatively reviewed the research evidence on alternatives, and found that for entry level jobs-where most hiring is done-no alternative has validity equal to measures of general mental ability. For jobs where the applicants are expected to know the job prior to hire, several procedures, including work sample tests and job knowledge tests, have validity comparable to general mental ability. However, they found that little information was available on the adverse impact of alternative selection methods; thus the question of whether they would reduce adverse impact could not be answered. They also found that virtually no correlations were reported between alternatives and ability measures, making it difficult to estimate what the combined validity of alternatives and ability would be. Alternatives are actually misnamed: if they are valid, they should be used in combination with ability measures to maximize overall validity and utility. Thus they are more appropriately referred to as “supplements,” not alternatives. Supervisory Profile Record (SPR). Relatively complete information is available on one nontest selection procedure: the Supervisory Profile Record published by Richardson, Bellows and Henry (1981). This instrument is used in selecting first line supervisors and consists of 128 biographical data items and 33 “judgment” items, all empirically keyed across numerous organizations. Its mean validity is approximately .40, and validity generalizes across organizations, races, sex, job experience levels, and other variables (Haymaker & Erwin, 1986; Hirsh, Schmidt, Erwin, Owens, & Sparks, 1987). Although this validity is less than that for general ability (Schmidt, Hunter, Pearlman, and Shane, 1979), the combined validity is as high or higher than ability alone. When ability and the SPR are equally weighted, the black-white difference is reduced approximately 15% in comparison to ability tests used alone. The SPR correlates about SO with general mental ability, and so is in part a measure of ability. However, other factors are also measured. The blackwhite difference on the SPR alone is about one-half of a standard deviation, a difference about half as large as that for general mental ability. However, used alone, the SPR tends to overpredict later job performance of blacks (that is, is biased in favor of them). Also, use of SPR alone results in a

284

FRANK

L. SCHMIDT

utility loss of at least 20% in comparison to use of ability and the SPR in combination. The SPR illustrates a point made by Hunter and Hunter (1984): it should be possible to identify nontest procedures that will add to the validity of general ability and at the same time reduce adverse impact. Although research along these lines cannot realistically be expected to eliminate adverse impact, it may have the potential for producing selection systems with reduced adverse impact and enhanced utility. Job sample tests. Job (or work) sample tests are measures that require performance of a sample of tasks taken from the actual job. Obviously they are appropriate only for applicants with previous job experience or training. In the early 1970s I hypothesized that the use of job sample tests in the hiring of experienced workers would reduce or possibly eliminate adverse impact. This hypothesis was tested in research with metals trades apprentices at a large corporation. The job sample test consisted of cutting metal products on various machines (lathes, milling machines, etc.) to specific tolerance and finish requirements. Our research team found that there was little difference between minority and majority apprentices on tolerance or finish scores, but there was a large difference (.89 SD) on time required to complete the work (Schmidt, Greenthal, Hunter, Berner, and Seaton, 1977). There was a large difference (1.32 SDS) between minority and majority apprentices on the written job knowledge test; according to Hunter’s (1983) path analysis findings, this knowledge difference would be expected to produce a difference on the job sample test because knowledge contributes to performance. But because the apprentices could take all the time they wanted, the effect of knowledge differences showed up in the amount of time taken, not in the quality scores. The overall group difference on the total job sample score (tolerance plus finish plus time taken) was .70 SD, about what would be expected from the group difference in job knowledge (Hunter, 1983). If a general mental ability measure is not to be used, then optimal selection would be based on a combination of the job sample and job knowledge tests (see Hunter, 1983), which would have had adverse impact almost equal to that of the job knowledge test. Recently Bemardin (1984) conducted a meta-analytic review of black-white differences on job sample tests and found an average difference of .54 SDS. This is very close to the value of .50 that Hunter, Schmidt, and Rauschenberger (1977) estimated in the middle 1970s based on a general reading of the literature. Thus job sample tests do not appear to have the potential to eliminate adverse impact. Although the black-white difference appears to be smaller than that on ability tests, the use of job sample and job knowledge tests in combination to hire experienced workers may typically yield black-white differences in predictor scores similar to those on general mental ability tests. The employment interview. In recent years, more validity data have

PROBLEM

OF GROUP

DIFFERENCES

285

become available for the employment interview. McDaniel, Whetzel, Schmidt, Hunter, Maurer, and Russell (1988) have now completed a meta-analysis of 74 validity coefficients, a much larger number than was included in previous reviews. The employment interview has traditionally been believed to have very low validity for predicting job performance, and this belief has been supported by previous quantitative reviews. Hunter and Hunter (1984), for example, reported average validities of .14 for predicting supervisory ratings of job performance (based on 10 coefficients), and .lO for predicting training performance (based on 9 coefficients). The more comprehensive meta-analysis by McDaniel et al. indicates that the validity of even the unstructured employment interview may be considerably higher. Corrected for range restriction, the mean value was about .40 for job performance criteria (mostly supervisory ratings). The uncorrected average (uncorrected for measurement and statistical artifacts) was about .30, still over twice as large as the uncorrected value of .I1 from Hunter and Hunter (1984). Corresponding corrected and uncorrected validities for training performance were .39 and .28, again much larger than earlier figures. Mean validities for structured interviews were higher. Based on available data, the (corrected) correlation between interview ratings and measures of general mental ability was approximately .30. This leads to an estimated correlation of .55 to .60, for typical jobs, between the equally weighted sum of ability and interview scores and job performance. This represents an increase of at least .04 in predictive validity beyond that of ability tests alone or of .15 for interviews alone. The multiple correlation is slightly higher. Unfortunately, there is little data on the extent of black-white differences on interview scores, so we do not know whether the interview has less adverse impact on average than do ability tests. However, if it does, then the weighted combination of interview scores and ability scores would have both slightly higher validity (and utility) and less adverse impact than ability scores alone. However, the extent of reduction in adverse impact is probably small. But, in any event, since avenues for reducing adverse impact without reducing validity appear to be limited in number, the interview deserves further research in this connection. Theory of Fadeout with Job Experience Suppose the superiority of higher ability people in job performance is short-lived. For example, suppose higher general mental ability leads to higher job performance in the early months of job tenure, with this advantage being eroded over time by learning that takes place as people gain experience on the job. This is precisely what is posited by the convergence theory. If this theory were correct, it would greatly reduce the cost of eliminating adverse impact. The cost would exist only in the

286

FRANK

L. SCHMIDT

early phases of job tenure and then would disappear. That is, after a period on the job, any difference between blacks and whites in job performance would tend to disappear. In a recent study (Schmidt, Hunter, Outerbridge, & Goff, 1988), we tested this theory using a unique set of data gathered by Army researchers (N = 1474). We found that for all three measures of job performance-work sample tests, job knowledge tests, and supervisory ratings-the difference in performance between the top and bottom ability halves remained constant out to at least 5 years on the job. (Beyond that point the data were inadequate to draw conclusions.) Thus our findings disconhrmed convergence theory. It appears that initial ability differences produce job performance differences that are quite lasting. This study did not break out the data by race, but because we know that ability test scores relate to job performance in the same way for whites and blacks (as shown in test fairness studies), the findings can safely be generalized to the white-black difference. Ackerman (1987) has recently summarized the vast body of research evidence showing that when tasks are complex enough to require continued information processing over time (i.e., tasks that are not repetitive enough to be made automatic or habitual), mental ability continues to correlate with performance indefinitely over time. The implication, of course, is that the elimination of adverse impact, to the extent that it requires lowering ability standards, will continue to exact a cost in performance for a long time-perhaps for as long as people remain on the job. Thus we find again that the problem of group differences is more intractable than we hoped would be the case. Recent Studies of Criterion Bias Recently some researchers have attempted to resurrect the hypothesis that bias in measures of job performance may be responsible for the finding that ability tests are predictively fair for blacks (Bernardin, 1984; Ford, Kraiger, & Schectman, 1986; Kraiger & Ford, 1985). If both the tests and the job performance measures were racially biased in the same direction, then test fairness analyses might indicate that tests are fair when they are not. These recent studies are flawed both because they ignore the existing evidence on this question and because they analyze the data inappropriately. In the 1970s researchers recognized the possibility that supervisory ratings, the most frequently used measure of job performance, might be biased against blacks, because most of the raters were white. They therefore constructed and used content valid work sample tests to measure job performance. These studies yielded the same conclusions about test fairness as the studies that used supervisory ratings (e.g., see Campbell et al. 1973; Gael and Grant, 1972; Gael et al. 1975a, 1975b). Also, many studies were conducted in which the criterion was performance in training, assessed by objectively scored content-valid

PROBLEM

OF GROUP

DIFFERENCES

287

measures of amount learned. These studies also confirmed the finding of test fairness. Thus we can be reasonably certain that the conclusions of test fairness are not an artifact of racial bias in criterion measures. The approach taken by Bernardin (1984) and Ford et al. (1986) is to use meta-analysis to compare the black-white differences on criterion measures classified as “objective” with those classified as “subjective” or “less direct” measures of job performance. Hunter and Hirsh (1987) have pointed out serious flaws in this research. For example, Bernardin (1984) classified content-valid work sample measures as “indirect measures.” He considered measures of turnover and absenteeism to be objective measures of job performance. He interpreted his finding that blacks and whites differed less on turnover and absenteeism than on work sample tests as indicating racial bias in the work sample tests, despite the fact that the work sample tests were not only objective measures, but also had higher construct validity as measures of job performance. Ford et al. (1986) compared black-white differences on supervisory ratings to those for single partial and low reliability measures of job performance-accidents, customer complaints, stock shortages, etc. The difference averaged about .32 SD on these individual fragmentary indices, while it was about .45 SD on supervisory ratings. They interpreted the fact that the difference was larger for the subjective supervisory ratings than for the objective measures as indicating bias in the supervisory ratings. Hunter and Hirsh (1987) pointed out that the difference on the sum of the fragmentary measures-a better but still very incomplete measure of job performance-would probably have been at least as large as the difference on supervisory ratings. For all of these reasons, this research effort appears to be a blind alley; again, there appears to be no simple solution. CONCLUSION This overview of measurement-based research points to the conclusion that there is no magic bullet that will solve the problem created by group differences in average measured mental ability. In the long run, increases in educational opportunities and educational achievement by minority students may eliminate or at least reduce the problem. History shows that this has indeed happened for impoverished immigrants from Eastern Europe who entered this country in the late 1800sand early 1900s. Recent findings show that ability test scores of each succeeding generation are higher than those of the preceeding generation (Flynn, 1987). This finding has been replicated in 14 different countries, and minority scores apparently increase at about the same rate as those of the majority. However, until recently at least, the black-white difference has remained essentially constant. But findings from the National Assessment of Educational Progress and from the College Entrance Examination Board indicate that

288

FRANK

L. SCHMIDT

in recent years the gap in achievement and ability test scores for blacks and whites has narrowed somewhat (Jones, 1984). In 1987, the ACT (American College Test) and SAT (Scholastic Aptitude Test) scores of blacks have again increased slightly relative to those of whites. This is an encouraging trend. If it continues, the problem of group differences may disappear over the long run. However, the long-term solution will by definition be a long time in coming. The time required may be on the order of 50 to 100 years. Many people would like to take steps now that would produce an immediate decrease in adverse impact in employment. Earlier we examined one attempt to do this-the minimum competency or low cutoff method of selection-and saw that it leads to very large losses in employee productivity; losses that may be too large to be acceptable in a highly competitive international economy. Although it comes as a surprise to many people, most of the productivity loss in minimum competency selection systems comes from majority group members, not minority group members. Minimum competency systems usually reduce standards for all applicants. Because most people hired are majority group members, the cumulative productivity loss from hiring less productive members of the majority group is much greater than the loss due to less productive minority workers. However, despite these productivity losses, such systems typically merely reduce the discrepancy in employment rates; they do not eliminate adverse impact. On the other hand, selection systems based on top-down hiring within each group completely eliminate adverse impact at a much smaller price in lowered productivity. Such systems typically yield 85% or more of the productivity gains attainable with optimal nonpreferential use of selection tests (Cronbach, Yalow, & Schaeffer, 1980; Hunter et al. 1977; Schmidt et al., 1984). This is the system currently used by the U.S. Employment Service in its validity generalization-based GATB (General Aptitude Test Battery) job testing program, a program now used by thousands of employers in nearly all states. However, such selection systems raise a host of legal, social, and moral questions, as noted by Lerner (1977, 1979). In particular, they are open to the charge of reverse discrimination. However, a special committee appointed by the National Research Council to examine these questions and to evaluate the legal, social, and economic desirability of such within-group selection systems has recently concluded that it is difficult to identify an alternative selection system that yields a better balanced tradeoff between the objectives of economic efficiency and elimination of adverse impact (Wigdor & Hartigan, 1988). Even this august group could discover no really good solution to this conundrum. Discussion and debate will undoubtedly continue (as it should); there is room for disagreement about what should

PROBLEM OF GROUP DIFFERENCES

289

be done about the problem. But as a result of the research discussed in this article, we at least have a far clearer concept of what exactly the problem is that we are discussing and debating. REFERENCES Ackerman, P. L. (1987). Individual differences in skill learning: An integration of psychometric and information processing perspectives. Psychological Bulletin, 102, 3-27. Albright, L. E., Glennon, J. R., & Smith, W. J. (1%3). The uses ofpsychological tests in industry. Cleveland, OH: Allen. Bartlett, C. J., Bobko, P., Mosier, S. B., & Hannan, R. (1978). Testing for fairness with a moderated multiple regression strategy: An alternative to diBerential analysis. Personnel Psychology, 31, 233-241. Bemardin, H. J. (1984). An analysis of black-white differences in job performance. Paper presented at the 44th Annual Meeting of the Academy of Management, Boston. Boehm, V. R. (1977). Differential prediction: A methodological artifact? Journal of Applied Psychology, 62, 146-154. Braithwaite, D. (1976). Personal communication. Brogden, H. E. (1949). When testing pays off. Personnel Psychology, 2, 171-183. Brogden, H. E., & Taylor, E. K. (1950). A theory and classification of criterion bias. Educational and Psychological Measurement, 10, 159-186. Callender, J. C., & Osbum, H. G. (1980). Development and test of a new model of validity generalization. Journal of Applied Psychology, 65, 543-558. Campbell, J. T., Crooks, L. A., Mahoney, M. H., & Rock, D. A. (1973). An investigation of sources of bias in the prediction of job performance: A six year study (Final Project Report No. PR-73-37). Princeton, NJ: Educational Testing Service. Cleary, T. A., & Hilton, T. I. (1968). Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement, 5, 115-124. Cole, N. S. (1973). Bias in selection. Journal of Educational Measurement, 10, 237-255. Cronbach, L. J., & Gleser, G. (1957). Psychological tests andpersonnel decisions. Urbana: University of Illinois Press. Cronbach, L. J., Yalow, E., & Schaeffer, G. (1980). A mathematical structure for analyzing fairness in selection. Personnel Psychology, 33, 693-704. Darlington, R. B. (1971). Another look at “cultural fairness.” Journal of Educational Measurement, 8, 71-82. Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice (1978). Adoption by four agencies of Uniform Guidelines on Employee Selection Procedures. Federal Register, 43, 38290-38315. Flynn, J. R. (1987).Massive IQ gains in 14nations: What IQ tests really measure. Psychological Bulletin, 101, 171-191. Ford, K. J., Kraiger, K., & Schechtman, S. L. (1986). Study of race effects in objective indices and subjective evaluations of performance: A meta-analysis of performance criteria. Psychological Bulletin, 99, 330-337. Gael, S., & Grant, D. L. (1972). Employment test validation for minority and nonminority telephone company service representatives. Journal of Applied Psychology, 56, 135139. Gael, S., Grant, D. L., & Ritchie, R. J. (1975a). Employment test validation for minority and nonminority clerks with work sample criteria. Journal of Applied Psychology, 60, 420-426. Gael, S., Grant, D. L., & Ritchie, R. J. (1975b). Employment test validation for minority and nonminority telephone operators. Journal of Applied Psychology, 60, 411-419. Ghiselli, E. E. (1%6). The validity of occupational aptitude tests. New York: Wiley.

290

FRANK L. SCHMIDT

Grant, D. L, & Bray, D. W. (1970). Validation of employment tests for telephone company installation and repair occupations. Journal of Applied Psychology, 54, 7-14. Guion, R. M. (1965). Personnel testing. New York: McGraw-Hill. Haymaker, J. M., & Erwin, F. W. (1986). The generalizability of total score validity of the Supervisory Projile Record. Washington, DC: Richardson, Bellows, and Henry, Inc. Hirsh, H. R., Schmidt, F. L., Erwin, F. W., Owens, W. A., & Sparks, P. P. (1987). Biographical data in employment selection: Can validities be made generalizable? Unpublished manuscript. Humphreys, L. G. (1973). Statistical definitions of test validity for minority groups. Journal of Applied Psychology, 58, 1-4. Hunter, J. E. (1979). An analysis of validity, dtserential validity, test fairness, and utility for the Philadelphia police oficers selection. Examination prepared by the Educational Testing Service. Report to the Philadelphia Federal District Court, Alvarez vs. City of Philadelphia. Hunter, J. E. (1980). Validity generalization for 12,tMOjobs: An application of synthetic validity and validity generalization to the General Aptitude Test Battery (GATB). Washington, DC: U.S. Employment Service, U.S. Department of Labor. Hunter, J. E. (1981). The economic benefits of personnel selection using ability tests: A state of the art review including a detailed analysis of the dollar benefit of U.S. Employment Service placements and a critique of the low cutoff method of test use. Report prepared for the U.S. Employment Service, U.S. Department of Labor, Washington, DC. Hunter, J. E. (1983). The prediction of job performance in the military using ability composites: The dominance of general cognitive ability over speciJic aptitudes. Report for Research Applications, Inc., in partial fulfillment of DOD Contract F41689-83-C0025. Hunter, J. E. (1984). The validity of the Armed Forces Vocational Aptitude Battery (ASVAB) high school composites. Report for Research Applications, Inc., in partial fulfillment of DOD Contract F41689-83-C-0025. Hunter, J. E. (1985). Differential validity across jobs in the military. Report for Research Applications, Inc., in partial fulfillment of DOD Contract F41689-83-C-0025. Hunter, J. E. (1986). Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of Vocational Behavior, 29, 340-362. Hunter, J. E., & Hirsh, H. R. (1987). Applications of meta-analysis. In C. L. Cooper and I. T. Robertson (Eds.), Review of industrial psychology (Vol. 2, pp. 321-357). New York: Wiley. Hunter, J. E., & Hunter, R. F. (1984). The validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72-99. Hunter, J. E., & Schmidt, F. L. (1976). A critical analysis of the statistical and ethical implications of various definitions of “test bias.” Psychological Bulletin, 83, 10531071. Hunter, J. E., & Schmidt, F. L. (1978). Differential and single group validity of employment tests by race: A critical analysis of three recent studies. Journal of Applied Psychology, 63, 1-11. Hunter, J. E., & Schmidt, F. L. (1982). Fitting people to jobs: Implications of personnel selection for national productivity. In E. A. Fleishman (Ed.), Human performance and productivity. Hillsdale, NJ: Erlbaum. Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differential validity of employment tests by race: A comprehensive review and analysis. Psychological Bulletin, 86, 721735. Hunter, J. E., Schmidt, F. L., & Rauschenberger, J. M. (1977). Fairness of psychological

PROBLEM OF GROUP DIFFERENCES

291

tests: Implications of four definitions for selection utility and minority hiring. Journal of Applied

Psychology,

62, 245-260.

Jensen, A. R. (1980). Bias in mental testing. New York: Free Press. Jones, L. V. (1984). White-black achievement differences. American Psychologisf, 39, 1207-1213. Katzell, R. A., & Dyer, F. J. (1977). Differential validity revived. Journal of Applied Psychology, 62, 137-145. Kraiger, K., & Ford, J. K. (1985). A meta-analysis of ratee race effects in performance ratings. Journal of Applied Psychology, 70, 56-65. Ledvinka, J. (1979). The statistical definition of fairness in the federal selection guidelines and its implications for minority employment. Personnel Psychology, 32, 551-562. Lent, R. H., Aurbach, H. A., & Levin, L. S. (1971a). Predictors, criteria, and significant results. Personnel Psychology, 24, 519-533. Lent, R. H., Aurbach, H. A., & Levin, L. S. (1971b). Research and design and validity assessment. Personnel Psychology, 24, 247-274. Lemer, B. (1977). Washington v. Davis: Quantity, quality and equality in employment testing. In P. Kurland (Ed.), The 1976 Supreme Court Review. Chicago: University of Chicago Press. Lemer, B. (1979). Employment discrimination: Adverse impact, validity, and equality. In P. Kurland & G. Casper (Eds.), The 1979 Supreme Court Review. Chicago: University of Chicago Press. Lilienthal, R. A., & Pearlman, K. (1983). The validity of federal selection tests for aid/ Washington, DC: U.S. Office technicians in the heahh, science, and engineeringfields. of Personnel Management, Personnel Research and Development Center. Linn, R. L. (1978). Single-group validity, differential validity, and differential predictions. Journal of Applied Psychology, 63, 507-514. Linn, R. L., Harnisch, D. L., & Dunbar, S. B. (1981). Validity generalization and situational specificity: An analysis of the prediction of first year grades in law school. Applied Psychological Measurement, 5, 281-289. McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., Hunter, J. E., Maurer, S., & Russell, J. (1988). The validity of employment interviews: A review and meta-analysis. Unpublished paper. O’Connor, E. J., Wexley, K. N., & Alexander, R. A. (1975). Single group validity: Fact or fallacy? Journal of Applied Psychology, 60, 352-355. Pearlman, K., Schmidt, F. L., L Hunter, J. E. (1980). Validity generalization results for tests used to predict training success and job proficiency in clerical occupations. Journal

of Applied

Psychology,

65, 373-406.

Raju, N. S., & Burke, M. J. (1983). Two new procedures for studying validity generalization. Journal

of Applied

Psychology,

68, 382-395.

Richardson, Bellows and Henry. (1981). Supervisory Profile Record technical reports, Vol. 1, 2, and 3. Washington, DC: Author. Ruth, W. W. (1972). A re-analysis ofpublished differential valid& studies. Paper presented at the meeting of the American Psychological Association, Honolulu. Schmidt, F. L., Bemer, J. G., & Hunter, J. E. (1973). Racial differences in validity of employment tests: Reality or illusion? Journal of Applied Psychology, 53, 5-9. Schmidt, F. L., Gast-Rosenberg, I., & Hunter, J. E. (1980). Validity generalization results for computer programmers. Journal of Applied Psychology, 65, 643-661. Schmidt, F. L., Greenthal, A. L., Hunter, J. E., Bemer, J. G., & Seaton, F. W. (1977). Job sample vs. paper-and-pencil trades and technical tests: Adverse impact and examinee attitudes. Personnel Psychology, 30, 187-197. Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529-540.

292

FRANK L. SCHMIDT

Schmidt, F. L., Hunter, J. E., & Caplan, J. R. (1981). Validity generalization results for two jobs in the petroleum industry. Journal of Applied Psychology, 66, 261-273. Schmidt, F. L., Hunter, J. E., McKenzie, R., & Muldrow, T. (1979). The impact of valid selection procedures on workforce productivity. Journal of Applied Psychology, 64, 609-626. Schmidt, F. L., Hunter, J. E., Outerbridge, A. N., & Goff, S. (1988). The joint relation of experience and ability with job performance: A test of three hypotheses. Journal of Applied

Psychology,

73, 46-57.

Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1981). Task differences and validity of aptitude tests in selection: A red herring. Journal of Applied Psychology, 66, 166185. Schmidt, F. L., Hunter, J. E., & Pearlman, K., & Shane, G. S. (1979). Further tests of the Schmidt-Hunter Bayesian validity generalization procedure. Personnel Psychology, 32, 257-281. Schmidt, F. L., Hunter, J. E., Pearlman, K., & Caplan, J. K. (1981). Validity generalization results for three occupations in the Sears Roebuck Company. Chicago: Sears Roebuck Company. Schmidt, F. L., Hunter, J. E., Pearlman, K., & Hirsh, H. R. (1985). Forty questions about validity generalization and meta-analysis. Personnel Psychology, 38, 697-798. Schmidt, F. L., Hunter, J. E., & Urry, V. M. (1976). Statistical power in criterion-related validity studies. Journal of Applied Psychology, 61, 473-485. Schmidt, F. L., Mack, M. J., & Hunter, J. E. (1984). Selection utility in the occupation of U.S. Park Ranger for three modes of test use. Journal of Applied Psychology, 69, 490-497.

Schmidt, F. L., Pearlman, K., & Hunter, J. E. (1980).The validity and fairness of employment and educational tests for Hispanic Americans: A review and analysis. Personnel Psychology,

33, 705-724.

Society for Industrial and Organizational Psychology, Inc. (1987).Principles for the validation and use of personnel selection procedures (Third ed.). College Park, MD: Author. Tenopyr, M. L. (1%7). Race and socioeconomic status as moderators in predicting machineshop training success. Presented at the annual meeting of the American Psychological Association, Washington, D.C. Thomdike, R. L. (1971). Concepts of culture fairness. Journal of Educational Measurement, 8, 63-70.

Thorndike, R. L. (1986). The role of general ability in prediction. Journal of Vocational Behavior,

29, 332-339.

U.S. Department of Labor (1977). Dictionary of occupational titles (4th ed.). Washington, DC: U.S. Government Printing Office. Wigdor, A. K., & Gamer, W. R. (1982). Ability testing: Uses, consequences, and controversies. Washington, DC: Academy Press. Wigdor, A. K., & Hartigan, J. A. (1988). Interim report: Within-group scoring of the General Aptitude Test Battery. Washington, DC: Academy Press. Wolins, L. (1%2). Responsibility for raw data. American Psychologist, 17, 657-658. Received: July 26, 1988