Comparability of personality test scores for immigrants and majority group members: Some dutch findings

Comparability of personality test scores for immigrants and majority group members: Some dutch findings

~ Pergamon Person. individ. Diff. Vol. 23, No. 5, pp. 849-859, 1997 © 1997 ElsevierScienceLtd. All rights reserved Printed in Great Britain PII: S019...

879KB Sizes 0 Downloads 34 Views

~ Pergamon

Person. individ. Diff. Vol. 23, No. 5, pp. 849-859, 1997 © 1997 ElsevierScienceLtd. All rights reserved Printed in Great Britain PII: S0191-8869(97)00109-8 0191-8869/97 $17.00+0.00

COMPARABILITY OF PERSONALITY TEST SCORES FOR IMMIGRANTS AND MAJORITY GROUP MEMBERS: SOME DUTCH FINDINGS Jan te Nijenhuis, 1'2. Henk van der Flier, L2and Liesbeth van Leeuwen ]'2 1Department of Work and Organizational Psychology, Vrije Universiteit, Amsterdam, The Netherlands and :Dutch Railways, Utrecht, The Netherlands

(Received 23 November 1996) Summary--The questions addressed are whether a taxonomy with the dimensions Neuroticism and Extraversion can be generalized to immigrant groups in the Netherlands, and whether scores on an Eysenckstyle personality questionnaire, the ABV (Amsterdam Biographical Questionnaire), can be used for assessment within these groups. Use was made of test data on first-generation immigrants (N= 1302) and majority group members (N= 796). The scales appear to measure the same constructs in the various groups and the majority of the items do not show differential item functioning, therefore the taxonomy can be generalized to immigrant populations and the questionnaire can be put to good use for comparisons within culturally homogeneous groups of non-native born, non-native language immigrants. © 1997 Elsevier Science Ltd

Keywords: personality questionnaires, personality tests, group differences, immigrants, comparability, construct validity.

INTRODUCTION One of the goals of personality theories is the development of a taxonomy of personality characteristics. An important test of such a taxonomy or structure is its generalizability to other groups (Eysenck, 1991; Costa & McCrae, 1992a; Zuckerman, 1992). Most of the personality models that are based on factor analytical research are hierarchical (Eysenck, 1947; Zuckerman, 1991; Costa & McCrae, 1992b): primary traits correlate and thus form higher-order factors in a factor analysis. How the primary traits cluster into higher-order factors is an empirical question; the ways people behave, feel, and think do not necessarily covary equally in different groups. In the study of group differences, a distinction can be made between culture-specific and culture-general or universal orientations; both result in the personality factors Extraversion (E) and Neuroticism (N) being recovered in most of the groups studied (Katigbak, Church & Akamine, 1996). Applying a personality questionnaire that has been developed in one group to another group usually results in a number of the items showing different statistical characteristics in this second group. A practical question is therefore whether and under what conditions personality questionnaires can also be used in other groups; for instance, can personality tests be used to make assessments in all groups in countries with immigrants and in countries with a heterogeneous population? (Bernstein, Teng, Grannemann & Garbin, 1987; Green & Kelley, 1988; Takeuchi, Kuo, Kim & Leaf, 1989; Choca, Shanley, Peterson & Van Denburg, 1990). The use of personality tests to assess immigrants is being criticized in both the popular and the scientific press in the Netherlands. These critics assume that these tests are of limited use for assessing persons with a limited knowledge of the Dutch language and culture. It goes without saying that strict requirements regarding the quality of tests have to be met in applied settings; a conservative strategy is only to use tests for assessment if their validity has been convincingly demonstrated in empirical research. Eysenck and Eysenck (1983) report a great number of studies with translated and adapted versions of the EPQ (Eysenck & Eysenck, 1975) in different groups; in these studies the majority of the items cluster into the same four factors as in the British sample, namely Extraversion, Neuroticism, Psychoticism, and Social Conformity. A small number of the items load differently on the same *To whom all correspondence should be addressed at: Department of Work and Organizational Psychology, Vrije Univsiteit, Van der Boechorststraat 1, 1081 BT Amsterdam, The Netherlands. 849

850

Jan te Nijenhuis et al.

factors, or even on different factors. The nonequivalent items do not form a fifth factor neither statistically or conceptually, which implies that they are probably connected with other traits or are influenced by differences in semantic interpretation. In the various studies, examples were given of items that did not load on the same factors as in the British sample and for which plausible post hoc explanations could be given. On the other hand, there were items that did not load on the right factors and for which such explanations could not be given. The assessment of the use of a test for comparisons between groups focuses on two different questions. The first is whether in the two groups the same scale score reflects the same position on a latent trait; the second question is whether members of different groups with the same scale score have the same probability of showing specific criterion behavior in the future. These two questions pertain to the traditional distinction between construct validity and predictive validity. We will deal here only with the first question; one way to assess construct validity is to look at the occurrence of quantitative DIF (differential item functioning). An item is said to show quantitative DIF when a group, given a position on a latent trait, obtains a systematically higher or lower score on that item than another group. Eysenck (1986) maintains that quantitative DIF can hardly have played a role in the studies reported by Eysenck and Eysenck (1983) because the factor comparison coefficients would never have reached their reported high values if quantitative DIF had been present. Bijnen, Van der Net and Poortinga (1986) stress the lack of proof for the absence of quantitative DIF in the Eysenck studies. The occurrence of some quantitative DIF does not always have to be disturbing, because in applied settings and research settings one is working with scale scores that can be obtained by different combinations of item scores; a higher score on one item can be compensated by a lower score on another item. The two questions addressed in this paper are whether a taxonomy with the dimensions Neuroticism and Extraversion can be generalized to four immigrant groups in the Netherlands, and whether the Amsterdamse Biografische Vragenlijst (Amsterdam Biographical Questionnaire) can be used for assessment within these groups. METHOD Research participants

This project used test data on first-generation immigrants and majority group members who applied for blue collar jobs at the Dutch Railways and regional bus companies in the Netherlands from 1988 until 1992. The application process included a psychological examination which took place at the Work Conditions Service Unit of the Dutch Railways in ten centers throughout the Netherlands. The immigrant sample constituted the complete population of first-generation immigrant job applicants. A subsample was selected from the majority group in such a way that the distribution with respect to the jobs and regions in this subsample was as close to that in the immigrant group as possible; Table 1 shows the distributions of the groups in terms of demographic variables. The immigrant group from countries classified as other consists of persons from Asia (with the exception of Turkey, Israel, Japan, and the countries of the former Soviet Union), Africa (with the exception of South Africa and the North African countries), South America (with the exception of Surinam), and Central America (with the exception of the Netherlands Antilles). In view of the overly large heterogeneity of this group, its test scores are only reported in analyses of Table 1. Distribution of the immigrant group members and the majority group members with respect to native country, size of subgroup, percentage of males, and age Native country Turkey North Africa Netherlands Antilles Surinam Other Netherlands

n

% males

Mean age

27l 165 126 524 216 796

96.4 97.4 83.5 82.5 85.3 87.7

24.0 27.4 31.2 30.3 30.6 28.4

Note. The group of North Africans consists of persons from Morocco,

Algeria, Tunisia, Libya, and Egypt.

Comparability of test scores

851

all the immigrants, treated as a single group. The mean number of years that the immigrants had been residing in the Netherlands at the time of their application was 11.2 yr (SD = 6.9 yr).

Test The Amsterdam Biographical Questionnaire (Amsterdamse Bioyrafische Vrayenlijst, ABV) (Wilde, 1970), one of the two personality questionnaires used most frequently in the Netherlands, is mainly based on the MPI (Maudsley Personality Inventory, Eysenck, 1959); it consists of the following four scales: Neuroticism (N) as manifested by the presence of psycho-neurotic complaints (30 items), Neurosomatism (NS) as manifested by the presence of functional somatic complaints (17 items), Extraversion (E) (21 items), and Social Conformity or Lie (L) (23 items). Contrary to the later Eysenck questionnaires, the E scales of the MPI and the ABV consist of the dimensions Sociability and Impulsivity. The L scale had originally been included in questionnaires to control for social desirability as an answering tendency during the completion of the questionnaire, but research into the construct validity of the scale showed that it has the character of a stable personality trait rather than that of a situation-dependent answering tendency which disturbs the validity of the other scales. This personality factor may be described as Social Conformity. Consistent, stable individual differences on the L scale have been found that correlate with other measures, such as adjustment (Eysenck & Eysenck, 1975; McCrae & Costa, 1983). The mean scores on the L scale of the EPQ in West European countries are lower than those in Southern European and Third World countries (Barrett & Eysenck, 1984). Nyborg, Eysenck and Kroll (1982) suppose that in societies where many kinds of behavior are acceptable, people will obtain lower L scores and that in restrictive societies people will obtain higher L scores. The ABV consists of 91 items that can be answered with yes, ?, or no and that are scored with a key for each item. Half of the items of the N, E, and L scales of the ABV are the same as those of the corresponding scales of the EPQ. A review of test research in the Netherlands by Evers, Van Vliet-Mulder and Ter Laak (1992) showed that the ABV is one of the best personality tests in the Netherlands: it has good predictive validity, content validity, and construct validity. Scale scores of the majority group members are therefore generally considered to provide a good indication of their personality.

Statistical analyses Means and reliabilities. The deviation of the mean scale scores of the immigrants from the mean scale scores of the majority group members was calculated in terms of the standard deviation of the majority group. The mean scale score by sex was computed for groups having a sufficient number of women. 99% of the majority group women and 95% of the Surinamese and Antillian women had applied for the positions of bus driver and train guard. These groups were proportionally matched with men who had applied for the same positions. To assess the reliability, Cronbach's alpha was chosen. Dimensional comparability. There are many procedures and measures to compare factor solutions in two groups, between which it is difficult to choose. A conclusion from a comparative study (Barrett, 1986) is that there is no single solution to the problem of comparison of factor solutions that is acceptable to every researcher. The suggested strategy (Barrett, 1986) is to use different procedures and measures in the same study so as to reduce the influence of the weaknesses of each. The dimensional comparability was therefore examined by means of two combinations of a procedure with a measure. Principal Component Analysis (PCA) solutions with Varimax rotation of the majority group and the various immigrant groups were compared with the congruence coefficient (Tucker, 1951). A value of the congruence coefficient greater than.85 is generally considered to be high. The Pecon procedure (Ten Berge, 1986) was developed for component comparisons across populations, using (rotation to perfect) congruence. Ten Berge states that for any given component in a first population a parallel component in any second population can be defined which has, up to a constant of proportionality, the same component scores coefficients. When using the Pecon procedure, one begins by deriving a theoretically sound factor structure for one group. Then the weight matrix of this first group is applied to the correlation matrix of the second group. In this way, the second group's factor structure has by definition the same meaning as the original group's factor structure. The question is then whether the same factors explain as much variance in

852

Jan te Nijenhuis et al.

the second group as in the first group. In this study, the weight matrix of the PCA with Varimax rotation of the majority group was used as a starting point for Pecon, because the test scores of majority group members are generally considered to provide a good indication of their personality. To reduce the number of comparisons in the main analyses and to diminish the risk of accidental deviations or deviations of little practical significance, we checked whether the data from the Turks and the North Africans on the one hand, and the data from the Surinamese and the Antillians on the other, could be combined. These combinations seemed obvious, considering the similarity between the Surinamese and Antillians with respect to their proficiency in Dutch. Through education and exposure to the media in their native countries, the Surinamese and Antillians came into contact with the Dutch language and culture, which is not true of the Turks and North Africans. With the aid of LISREL-7 (J6reskog & S6rbom, 1988), the hypothesis that the correlation matrices of two groups are equal was tested, both at the level of the scales and the items. When working with large samples, even small differences between the correlation matrices can lead to large chi-square values; these chi-square values will make the small differences significant (J6reskog, 1969). For that reason, various researchers have suggested additional goodness-of-fit measures, such as the Goodness-ofFit Index (GF1) (J6reskog & S6rbom, 1988). The GFIindicates to what extent the correlation matrix of one of two groups deviates from a common matrix that is estimated by the LISREL program.

Qualitative and quantitative DIF DIF (differential item functioning) research starts with a definition of what constitutes differentially functioning and nondifferentially functioning items. Then, on the basis of statistical procedures that are operationalizations of the definition of DIF, the question of which items show DIF can be addressed. The term DIF is used, because it pertains only to statistical deviance; the items that show DIF deviate only in a statistical sense from the other items. Finally, based on the statistical results and other information, hypotheses are formulated about the qualities of the tested persons or the items (or both) that might be responsible for the statistical deviance. An item is said to show qualitative DIF when its loading changes in a factor analysis. An item is said to show quantitative DIF when a group, given a position on a latent trait, obtains a systematically higher or lower score on that item than another group. Stricker's (1982) partial correlation index was chosen as a check for quantitative DIF. Unlike other DIF techniques, it is also applicable to nondichotomous items. The partial correlation index corrects the correlation between item score and subgroup for the scale score, minus the item score. Because the samples were large, the partial correlation index was already significant (p < 0.001, two-tailed) at a value around 0.10. In this study an item will be considered as showing quantitative DIF when the absolute value of the partial correlation index is larger than 0.20; a value below 0.20 means that less than 4% of the variance in the item is explained by DIF, which can be considered a small effect. A serious weakness of conditional DIF indices like the partial correlation index is that if the scale contains a relatively large number of DIF items the scale score may be biased. This can mean that some DIF items are not classified as such, and that some non-DIF items are erroneously classified as showing DIF. Therefore an iterative procedure was used for the computation of the partial correlation index (Van der Flier, Mellenbergh, Ad6r & Wijn, 1984). Briefly this method involves removing DIF items from the scale (or, more accurately, not counting them in determining the scale score) so that the number of items removed increases by one with each successive iteration. After the first iteration, the item that showed the most DIF was not included for the computation of the scale score, and the partial correlation index was computed again. Next, the two items that showed the most DIF were identified and not included for the computation of the scale score, the partial correlation index was computed again, and so forth. The procedure was ended when all items with an absolute value of the partial correlation index lower than 0.20 had been excluded from the computation of the scale score, or when less than half of the items were left over. This was done to prevent there being too few items left to reliably estimate the position on the trait. After the items that showed DIF had been identified, the next task was to explain the statistical deviance. We checked whether the quantitative DIF could be explained by means of a lower level of proficiency in Dutch. For an estimate of proficiency in Dutch, use was made of the job applicants' scores on the Dutch version of the General Aptitude Test Battery (Te Nijenhuis & Van der Flier, in press). A PCA solution with Varimax rotation of the eight GATB subtests showed a clear Verbal

Comparability of test scores

853

Abilities factor; the factor score of the Verbal Abilities factor was computed by means o f the regression method for every individual. The unrotated factor solution showed a clear g factor that explained around half of the variance; the factor scores were computed here as well. For a good estimate of proficiency in Dutch, the factor score of the g factor was partialed out of the factor score of the Verbal Abilities factor. The resulting language proficiency score was correlated with the item scores. Again, a correlation of higher than 0.20 was chosen as indicating a relevant relation between proficiency in Dutch and the item score. After that, the agreement measure Kappa was used to check whether quantitative D I F (partial correlation index larger than 0.20) coincided with a relatively high correlation (higher than 0.20) with a measure of proficiency in Dutch. Post hoc inspection o f the D I F items was conducted, with the emphasis on identifying striking features that could be related to well-known differences between majority and immigrant groups. Finally, the effect of the biased items on the mean scores of the immigrants was estimated. Therefore, the effect of the removal of items that showed D I F on the scale scores was computed. This was done by first computing the deciles for the majority group. After this, the distribution of the scores of the complete immigrant group over the deciles was computed. Finally, the items that showed quantitative D I F for the complete immigrant group were removed and the procedure was repeated to find out whether the distributions became more alike. Finally, the relationship between qualitative and quantitative D I F was checked using Kappa also. RESULTS Means

Table 2 shows the mean scores of the majority group and the immigrant groups. The immigrants showed higher mean scores on the Neuroticism, Neurosomatism and Social Conformity or Lie scales and lower mean scores on the Extraversion scale. The differences were larger for the Turks and North Africans than for the Surinamese and Antillians. Table 3 shows the sex differences o f the mean scale scores. The usual sex differences were only found for the immigrant group on the N and NS scales. Female immigrant group members scored 0.30 SD lower on L. In all other cases the sex differences were small.

Table 2. Means and standard deviations of the scales for the majority group; Deviations from the mean of the majority group expressed in standard deviations of the majority group (dev.) and standard deviations (SD) on the scales by immigrant group

Scale

N NS E L

Majority

Group North Africans

Turks

Antitlians

Surinamese

All Immigrants

M

SD

Dev.

SD

Dev.

SD

Dev.

SD

Dev.

SD

Dev.

SD

26.51 13.39 66.78 42.53

13.47 2.20 13.19 9.94

0.79 1.05 -0.36 0.78

17.40 3.75 11.49 9.60

0.93 1.19 -0.50 0.69

18.13 3.97 11.61 9.24

0.43 0.37 -0.08 0.56

16.40 3.04 9.94 9.83

0.50 0.28 -0.05 0.52

16.34 2.62 11.65 9.71

0.62 0.63 -0.19 0.56

17.01 3.35 11.60 9.69

Note. The minimum number of subjects was 790 for the majority group, 255 for the Turks, 146 for the North Africans, 123 for the Antillians, 512 for the Surinamese, and 1245 for all the immigrants. N = Neuroticism, NS = Neurosomatism, E = Extraversion, L = Social Conformity or Lie.

Table 3. Sex differences in mean scale scores for the majority group and for Antillians and Surinamese; Deviations of the mean of the female group from the mean of the male group in standard deviations of the total subgroup (Dev.) Group Antillians+Surinamese males females

Majority

Scale N NS E L

males

females

M

M

Dev.

M

M

Dev.

25.08 13.13 67.86 43.16

26.91 13.18 68.50 42.62

0.11 0.02 -0.06 -0.06

31.84 13.84 67.18 48.47

35.89 14.36 65.96 45.52

0.30 0.50 -0.09 -0.30

Note. The minimum number of subjects was 403 for majority men, 91 for majority women, 409 for Antillian and Surinamese men, and 105 for Antillian and Surinamese women. The sample consists of job applicants for the position of bus driver and train guard.

Jan te Nijenhuis et al.

854

Table 4. Values of reliability coefficients (Alpha) for majority group members and immigrants Scale Group Majority Turks North Africans Antillians Surinamese All immigrants

N

NS

E

L

0.82 0.82 0.83 0.84 0.82 0.83

0.46 0.59 0.65 0.62 0.47 0.60

0.78 0.65 0.67 0.56 0.69 0.67

0.79 0.74 0.69 0.76 0.77 0.75

Reliabilities Table 4 shows the reliability coefficients for the groups; they are satisfactory for the majority group for the N, E, and L scales; the NS scale has an alpha of only .46, but also has the smallest number of items. The values of alpha for the N scale were virtually the same for the immigrants and the majority group. Generally, the reliability of the NS scale is higher in the immigrant group than in the majority group. The reliability of scale E is lower for the immigrants, with too low a value for the Antillian group. The L scale shows a somewhat lower value for alpha.

Dimensional Comparability LISREL was used to test the hypothesis that the correlation matrices of the immigrant groups were similar. Table 5 shows the results of comparing the correlation matrix of the Turks with the correlation matrix of the North Africans, and of comparing the correlation matrix of the Antillians with the correlation matrix of the Surinamese. The correlation matrices of the scale scores for the Turks and North Africans and those for the Antillians and Surinamese did not differ significantly at the 0.05 level; the lowest value of the GF! was 0.991 for the Turks and North Africans and 0.981 for the Antillians and Surinamese. These two comparisons indicated that there were no systematic differences in the correlation matrices of the scale scores for the compared groups. The a priori division of the groups in the Method section was consistent with the outcomes of the LISREL analyses. Table 6 shows the eigenvalues for the majority group; the Turks and North Africans; and the Antillians and Surinamese. The eigenvalues did not drop off sharply; a three factor solution was

Table 5. Comparison of the correlation matrices of the Turks with the correlation matrices of The North Africans and of the correlation matrices of the Antillians with the correlation matrices of the Surinamese with LISREL Level

GFI

GFI

Turks

North Africans

0.997 0.874

0.991 0.624

Antillians

Surinamese

0.981 0.578

0.999 0.960

Scales Items Scales Items

X2

df

p

4.02 4450.41

6 4095

0.673 0.000

5.25 4673.14

6 4095

0.513 0.000

Note. G F I = Goodness of Fit Index, X2= Chi-square, df= degrees of freedom, p = probability.

Table 6. Eigenvalues for the majority group (Maj.); the Turks and North Africans (T + N); and the Antillians and Surinamese (A+S)

Factor

Maj.

T+ N

A+ S

1

1.87

1.88

1.76

2 3 4

1.01 0.70 0.42

1.07 0.58 0.47

1.09 0.70 0.45

Comparability of test scores

855

Table 7. V a r i m a x rotated factor matrices for the majority group (Maj.); the T u r k s and N o r t h Africans ( T + N ) ; and the Antillians and Surinamese (S + A)

Scales

Maj.

Neuroticism T+N

N NS E L

0.72 0.94 -0.13 -0.15

0.69 0.94 -0.20 -0.13

A+S

Maj.

Extraversion T+N

A+S

Maj.

0.78 0.92 -0.13 0.13

- 0.36 - 0.01 0.97 -0.02

- 0.34 - 0.10 0.96 -0.06

- 0.27 0.00 0.98 -0.07

- 0.31 - 0.04 -0.03 0.98

Social C o n f o r m i t y T+N A+S 0.34 - 0.02 0.07 0.97

- 0.25 - 0.01 -0.07 0.98

Table 8. Congruence coefficients for every factor after V a r i m a x rotation at the level o f the scales for T u r k s and N o r t h Africans (T + N); and Antillians and Surinamese (A + S) Neuroticism

Extraversion

Social C o n f o r m i t y

0.998 0.999

0.995 0.995

0.999 0.997

T+ N A+ S

chosen because three factors could be expected on theoretical grounds. Table 7 shows the factor solutions; the factors could be interpreted as Neuroticism, Extraversion and Social Conformity (SC). Table 8 shows that the congruence coefficients for the comparison of the solutions after Varimax rotation were high, ranging from 0.995 to 0.999. The Pecon procedure was used for the comparison of the percentage of explained variance of the Varimax rotated solution of the majority group and the solutions after rotation to perfect congruence of the immigrant groups. The weight matrix of the three factor Varimax solution of the majority group applied to the correlation matrix of the Turkish and North African group and the Antillian and Surinamese group yielded the following percentages of explained variance: for the majority group 89.01%; for the Turks and North Africans 88.25%; and for the Antillians and Surinamese 88.81%. This means that the factors in the three groups, that have by definition the same interpretation, explained virtually the same amount of variance in the immigrant groups. Taken as a whole, the deviations in the factor structures of the immigrant groups are small.

Qualitative and quantitative DIF Table 5 shows that at the level of individual items the differences between Turks and North Africans and between Antillians and Surinamese were statistically significant (p < 0.01). The analyses at the level of individual items were therefore carried out separately for every immigrant group. A factor analysis at the level of the individual items for the majority group yielded a three factor solution that could be interpreted well and had only a few items that did not load well. The factors could be interpreted as Neuroticism, Extraversion, and Social Conformity. The factor solution of the four immigrant groups yielded three factor solutions with the same interpretation as in the majority group. A comparison of these solutions with the solution of the majority group yielded some items that showed qualitative DIF. The presence of qualitative DIF was assessed mainly by changes in the loading of an item on the factor on which it would be theoretically expected to load. The items that showed qualitative DIF were not the same for all immigrant groups. Table 9 shows the congruence coefficients between the factor solutions of the majority group and immigrant groups for all three factors. The values of the congruence coefficients were high ( > 0.85) for the factor N and, in general, low (<0.85) for factors E and SC. The factor N turned out to be comparable to Table 9. Congruence coefficients at the level o f individual items between the factor solutions o f the majority group and the i m m i g r a n t groups for every factor

Factor

Turks

N E SC

0.916 0.659 0.762

Group North Africans 0.873 0.689 0.766

Antillians

Surinamese

0.869 0.551 0.806

0.940 0.885 0.888

Note. N = Neuroticism, E = Extraversion, SC = Social Conformity.

856

Jan te Nijenhuis et al. Table 10. Percentage of items showing quantitative DIF for the minority groups for every scale

Scale N NS E L

Turks

North Africans

Group Antillians

Surinamese

Allimmigrants

20 18 14 17

10 12 10 22

7 6 0 4

10 0 5 26

10 6 10 17

that of the majority group; the factors E and SC were less comparable. It should be taken into account that the size of the groups, especially the Antillian and North African groups, was small in proportion to the number of items, so that the factor solution was not always stable--which means that items may switch scales. Table 10 shows the percentage of items that show quantitative DIF, for every scale. The number of items that showed quantitative DIF was the largest for the Turks (16), followed by the North Africans (12), the Surinamese (10), and the Antillians (4). Ten items showed quantitative DIF for the complete immigrant group, about 11% of the 91 items. The L scale contained the largest percentage of items that showed quantitative DIF. In this study, quantitative DIF is to be interpreted as answering yes or no more often than majority group members with the same scale score. In the majority group, none of the items correlated higher than.20 with the measure of proficiency in Dutch. In the immigrant groups, a small number of items correlated higher than 0.20 with the measure of proficiency in Dutch; for the Turks 6 items, for the North Africans 10 items, for the Antillians 7 items, and for the Surinamese 4 items. For three out of four immigrant groups there is some agreement between the items with respect to a relatively high correlation with the measure of proficiency in Dutch and quantitative DIF; for the Turks K a p p a = 0 . 3 0 ( p = 0.001); for the North Africans K a p p a = - 0 . 0 2 (p=0.83); for the Antillians K a p p a = 0 . 2 6 (p=0.015); and for the Surinamese Kappa = 0.26 (p = 0.006). Post hoc inspection of items that showed quantitative DIF resulted in explanations in terms of the importance of the relationship with acquaintances, a traditional way of living, a relatively unfavourable socio-economic position, and a low level of proficiency in Dutch. An item on the E scale showing quantitative DIF was: " D o you prefer to limit your contacts with other people to a small number of good friends and acquaintances?". Immigrants responded yes relatively more often, which may be explained by the fact that Turks and North Africans, in particular, prefer to limit their social intercourse to their so-called ingroup. An item on the L scale showing quantitative DIF was: "As a child, did you do as you were told immediately and without grumbling?". Immigrants responded yes relatively more often, which may be explained by the fact that in traditional societies parents are often authority figures to their children. An item on the N scale showing quantitative DIF was: " D o you worry about awful things that might happen?". Immigrants responded yes more often, which may be explained by the fact that immigrants often do find themselves in a relatively unfavourable socio-economic position, which may lead to worries about possible setbacks. An item on the NS scale showing quantitative DIF was: " D o you have bowel movements every day?". Immigrants responded no more often, which may simply have indicated that the expression 'bowel movement' was unfamiliar. It was not possible to find a post hoc explanation for some other items that showed quantitative DIF; in addition, the expectation that certain items would show quantitative DIF was not borne out. The effect of the quantitative DIF on the mean scores of the immigrants was estimated. For reasons of clarity, Table 11 distinguished between a low score (deciles 1 to 3), an average score (deciles 4 to 7), and a high score (deciles 8 to 10). After removing the DIF items, the distribution of the scores of the immigrants came closer to the distribution of the majority group for all the scales. The quantitative DIF therefore influenced the decile into which immigrants were classified. Experienced selection psychologists at the Work Conditions Service Unit of the Dutch Railways, however, regard the influence of a one-decile difference on the chance of being selected for a position as small. We checked whether the same items showed qualitative and quantitative DIF. No clear relation was found; for the Turks K a p p a = 0 . 0 4 7 (p=0.655); for the North Africans K a p p a = 0 . 1 0 4

Comparability of test scores

857

Table 11. Distribution over deciles of the Dutch group for the complete immigrant group before and after biased items are removed Scale N

NS E

L

Decile 1-3 4-7 8-10 1-7 8-10 1-3 4-7 8 10 1-3 4-7 8-10

% Before

% After

14.5 35.4 50.2 57.7 42.4 39.4 45.8 14.8 14.4 36.3 49.4

16.7 39.7 43.6 61.6 38.5 35.3 46.6 18.1 16.9 44.8 38.4

(p = 0.322); for the Antillians Kappa = 0.113 (p = 0.266); and for the Surinamese Kappa = 0.104 (p = 0.322). DISCUSSION The results show that a taxonomy with the dimensions Neuroticism and Extraversion can be generalized to immigrant populations in the Netherlands. The results provide important indications that the ABV can be used for assessment within immigrant groups.

Means and reliabilities The immigrants showed higher mean scores on the Neuroticism, Neurosomatism, and Social Conformity scales and lower scores on the Extraversion scale. This profile is inconsistent with the regular job requirements of emotional stability, sociability, and flexibility, so this would imply a lower suitability for many positions. The usual sex differences on personality tests, the males scoring higher than the females on E and lower on N and L, only manifested themselves on the N and NS scales for the immigrant group. The scores on the L scale clearly went the opposite direction for the immigrant group. The other differences were small. The values of coefficient alpha are lower for the immigrants on scale e and slightly lower on scale L; the scales N and NS had an equal or even higher value. On average, the alphas had a comparable value.

Dimensional comparability Group differences in factor loadings may reflect changes in the construct validity of scales. The factor analyses showed that the factor solution at the level of the scale scores of the majority group was also clearly evident in the immigrant groups; there were only small changes in factor loadings, and the values of the congruence coefficients were high; the percentages of explained variance in the majority and immigrants groups are quite similar. The results of the analyses of the dimensional comparability turned out to be consistent with one another. More generally, the dimensional structures were found to be the same to a large extent. The scales therefore appear to measure the same constructs in the various groups.

Qualitative and quantitative DIF The majority of the items did not show qualitative and quantitative DIF. The items showed the most DIF for the Turks and North Africans and the least DIF for the Antillians and Surinamese. Qualitative and quantitative DIF were apparently unrelated. The items that showed quantitative DIF turned out to partially compensate for each other. This means that if the items that showed quantitative DIF were left out, then the score profile of the immigrants could be expected to improve only slightly. Post hoc explanations could be found for several items that showed quantitative DIF. These post hoc explanations were based on well-known differences between the immigrant and the majority group, such as the importance of the relationships with acquaintances, the traditional way of living,

858

Jan te Nijenhuis et al.

the unfavourable socio-economic position, and a low level of proficiency in Dutch. There was some support for the hypothesis that the quantitative DIF is partially caused by the immigrants' lower level of Dutch language proficiency. The high L scores of the immigrants are comparable to the high L scores found in many nonwestern countries. These scores may reflect the high level of social conformity and the lack of flexibility in the value systems of a more traditional society. A conclusion is that a taxonomy with the basic dimensions N and E can be generalized to immigrant populations in the Netherlands, that are originally from Turkey, Northern Africa, the Netherlands Antilles, and Surinam. Important indications that the ABV can be used for assessment within immigrant groups are that the scales appear to measure the same constructs in the various groups, that the reliabilities are comparable to those in the majority group, and that most of the items are equivalent; the sex differences do not yield any important indications. More validity data are of course required to allow strong conclusions. The data allow for some comments about the use of the scales for comparison between groups. The test has construct validity in the immigrant groups; comparability of dimensional structures and a partial compensation of the items showing quantitative DIF would seem to suggest that immigrants and majority group members with the same scale scores do not obtain widely different positions on the latent trait. Whether the same position on a personality scale for persons from different groups leads to the same social behavior, or whether cultural influences affect the predictive validity, is a matter for empirical research. Until this research has been carried out, selection psychologists have to take the possibility of differing test-criterion relationships into account in the assessment of immigrants. Practical consequences

There seem to be two options for dealing with DIF: (a) Retain all items, that is, both the core of universal items and the items showing DIF, or (b) drop the items showing DIF and retain only the universal items. For comparisons within groups using subgroup norms, it seems preferable to retain all items and if necessary adjust the scoring key. By doing this, the homogeneity of the scales is optimized. CONCLUSION A taxonomy with the dimensions Neuroticism and Extraversion can be generalized to immigrant populations in the Netherlands. The test can be put to good use for comparisons within culturally homogeneous groups of non-native born, non-native language minorities. Acknowledgement--The Training and Recruitment Foundation of the Dutch Railways provided financial support to this research.

REFERENCES Barrett, P. (1986). Factor comparison: an examination of three methods. Personality and Individual Differences, 7, 327-340. Barrett, P. & Eysenck, S. B. G. (1984). The assessment of personality factors across 25 countries. Personality and Individual Differences, 5, 615-632. Bernstein, i. H., Teng, G., Grannemann, B. D. & Garbin, C. P. (1987). Invariance in the MMPI's component structure.

Journal of Personality Assessment, 51,522-531. Bijnen, E. J., Van der Net, T. Z. J. & Poortinga, Y. H. (1986). On cross-cultural comparative studies with the Eysenck Personality Questionnaire. Journal of Cross-Cultural Psychology , 17, 3-16. Choca, J. P., Shanley, L. A., Peterson, C. A. & Van Denburg, E. (1990). Racial bias and the MCMI. Journal of Personality Assessment, 54, 479-490. Costa, P. T. Jr. & McCrae, R. R. (1992). Four ways five factors are basic. Personality and Individual Differences, 13, 653665. Costa, P.T., Jr. & McCrae, R. R. (1992b). Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FF1) professional manual. Odessa, FL: Psychological Assessment Resources. Evers, A., Van Vliet-Mulder, J. C. & Ter Laak, J. (1992). Documentatie van tests en testresearch in Nederland. [Documentation of tests and test research in the Netherlands] Assen: Van Gorcum. Eysenck, H. J. (1947). Dimensions of personality. New York: Praeger. Eysenck, H. J. (1959). Manual for the Maudsley Personality Inventory. London: University of London Press.

C o m p a r a b i l i t y of test scores

859

Eysenck, H. J. (1986). Cross-cultural comparisons. The validity of assessment by indices of factor comparison. Journal of Cross-Cultural Psychology, 17, 506-515. Eysenck, H. J. (1991). Dimensions of personality: 16, 5 or 3?--Criteria for a taxonomic paradigm. Personality andlndividual Differences, 12, 773 790. Eysenck, H. J. & Eysenck, S. B. G.(1975). Manual of the Eysenck Personality Questionnaire (Junior and Adult). London: Hodder & Stoughton. Eysenck, H. J, & Eysenck, S. B. G. (1983). Recent advances in the cross-cultural study of personality. In J. N. Butcher & C. D. Spielberger (Eds.), Advances in personality assessment, 2. Hillsdale N J: Lawrence Erlbaum Associates. Green, S. B. & Kelley, C. K. (1988). Racial bias in prediction with the MMPI for a juvenile delinquent population. Journal of Personality Assessment, 52, 263-275. J6reskog, K. G. (1969). A general approach to the confirmatory maximum likelihood factor analysis. Psychometrika, 34, 183-202. J6reskog, K. G. & S6rbom, D. (1988). Lisrel-7 (computer manual). Mooresville, IN: Scientific Software. Katigbak, M. C., Church, A. T. & Akamine, T. X. (1996). Cross-cultural generalizability of personality dimensions: Relating indigenous and imported dimensions in two cultures. Journal of Personality and Social Psychology, 70, 99 114. McCrae, R. R., Costa, P. T. Jr. (1983). Social desirability scales: more substance than style. Journal of Consulting and Clinical Psychology, 51,882 888. Nyborg, H., Eysenck, S. B. G. & Kroll, N. (1982). Cross-cultural comparison of personality in Danish and English children. Scandinavian Journal of Psychology, 23, 291-297. Stricker, L. J. (1982). Identifying test items that perform differentially in population subgroups: a partial correlation index. Applied PsychologicalMeasurement, 6, 261-273. Takeuchi, D. T., Kuo, H.-S., Kim, K. & Leaf, P. J. (1989). Psychiatric symptom dimensions among Asian Americans and Native Hawaiians: An analysis of the Symptom Checklist. Journal of Community Psychology, 17, 319-329. Te Nijenhuis, J. & Van der Flier, H. (in press). Comparability of GATB scores for immigrants and majority group members: Some Dutch findings. Journal of Applied Psychology. Ten Berge, J. M. F. (1986). Rotation to perfect congruence and the cross-validation of component weights across populations. Multivariate Behavioral Research, 21, 41~54. Tucker, L. R. (1951). A methodJbr synthesis of factor analysis studies (Personnel Research Section Report No. 984). Washington, DC: Department of the Army. Van der Flier, H., Mellenbergh, G. J., Ad6r, H. J. & Wijn, M. (1984). An iterative item bias detection method. Journal of Educational Measurement, 21, 131 145. Wilde, G. J. S. (1970). Neurotische labiliteit gemeten volgens de vragenlijstmethode (2nd ed.). [Neurotic lability measured by the questionnaire method] Amsterdam: Van Rossen. Zuckerman, M. (1991). Psychobiology of personality. Cambridge: Cambridge University Press. Zuckerman, M. (1992). What is a basic factor and which factors are basic? Turtles all the way down. Personality and Individual Differences, 13, 675~681.