Is the Neuroticism Scale of the Eysenck Personality Inventory contaminated by response bias?

Is the Neuroticism Scale of the Eysenck Personality Inventory contaminated by response bias?

Personality and Individual Differences 36 (2004) 743–755 www.elsevier.com/locate/paid Is the Neuroticism Scale of the Eysenck Personality Inventory c...

205KB Sizes 4 Downloads 72 Views

Personality and Individual Differences 36 (2004) 743–755 www.elsevier.com/locate/paid

Is the Neuroticism Scale of the Eysenck Personality Inventory contaminated by response bias? Stuart J. McKelvie* Department of Psychology, Bishop’s University, Lennoxville, Quebec, J1M 1Z7, Canada Received 18 March 2002; received in revised form 1 October 2002; accepted 12 November 2002

Abstract Neuroticism (N) is perceived to be socially undesirable and the items measuring it in the Eysenck Personality Inventory (EPI) are all positively keyed. To investigate whether N is contaminated by social desirability or acquiescence response bias, undergraduates completed the EPI Form A or B [measuring N, extraversion (E) and lie (L)] in its original version, or in a balanced version where items were positively or negatively keyed. A measure of self-deceptive enhancement (SDE) and various criterion measures for N and E were also administered. Reliability, convergent validity and discriminative validity of the original and balanced EPIs were similar, indicating that acquiescence was not a problem. N and SDE were negatively related, but convergent validity coefficients corrected for SDE were lower than the raw ones, indicating that SDE represented content not error variance. In contrast, some L-corrected validity coefficients increased, indicating that N may be distorted by faking. Finally, some psychometric properties were weaker for Form B than for Form A. # 2002 Elsevier Ltd. All rights reserved. Keywords: Neuroticism; Response bias; Acquiescence; Social desirability; Eysenck Personality Inventory (EPI)

1. Introduction Hans Eysenck’s inventories have been widely employed to measure the major dimensions of personality, one of which is neuroticism (N). N means emotional lability, and high scorers are worriers who exhibit overly strong emotional reactions that do not dissipate quickly (Eysenck & Eysenck, 1968). N is scored by awarding one point for each item that is endorsed. Scores on self-report inventories may be contaminated by response biases, particularly acquiescence (the tendency to agree or disagree), and social desirability (the tendency to convey * Tel.: +1-819-822-9600x2402; fax: +1-819-822-9661. E-mail address: [email protected] (S. J. McKelvie). 0191-8869/03/$ - see front matter # 2002 Elsevier Ltd. All rights reserved. doi:10.1016/S0191-8869(02)00348-3

744

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

favourable or unfavourable impressions) (Webster, 1958). Although Eysenck’s N has good psychometric properties, it is open to both distortions (Ferrando, 2001). First, all items are worded in the neurotic direction (e.g. ‘‘My feelings are easily hurt’’), so that a yeasayer will score high and appear neurotic, whereas a naysayer will score low and appear stable. Second, neuroticism is perceived as socially undesirable (Dunnett, Koun, & Barber, 1981; Edwards & Walsh, 1964; Francis, 1993), so that a person who paints themselves positively will score low and appear stable, whereas a person who paints themselves negatively will score high and appear neurotic. Because acquiescence and social desirability are independent (Greenwald & Clausen, 1970), a naysayer could also answer in the socially desirable direction, and appear extremely stable. Similarly, a yeasayer answering in the socially undesirable direction would appear extremely neurotic. 1.1. Acquiescence 1.1.1. Past research According to Paulhus (1991), agreement acquiescence means giving a positive response to all items (e.g. ‘‘yes’’ to both happy and not happy) and acceptance acquiescence means endorsing all qualities (e.g. ‘‘yes’’ to happy and sad and ‘‘no’’ to not happy and not sad). Most research has been conducted on agreement acquiescence, with arguments both for (Ray, 1983) and against (Grimm & Church, 1999; Rorer, 1965) its importance. With Eysenck’s N Scales, scores may be slightly affected by this bias because Eysenck and Eysenck (1964) found that N/E correlations were somewhat higher for congruent scales (both N and E positively keyed) than for incongruent scales (N positively keyed, E both positively and negatively keyed; see also Martin & Stanley, 1963). The first purpose of the present study was to examine this problem with a new approach by comparing the psychometric properties of the original N Scale with those of a redesigned scale that controlled for agreement response bias. The standard recommendation for minimizing agreement acquiescence is to balance the scale so that a score is obtained by answering ‘‘yes’’ on half the items and ‘‘no’’ on half the items (Anastasi & Urbina, 1997; Couch & Keniston, 1960). That is, items are positively or negatively keyed. However, this may not remove the effect (Holden & Fekken, 1985; Jackson, 1967). First, negative items may be more prone to response bias than negative items (Ibrahim, 2001), so that acquiescence effects may not be cancelled out.1 Second, positively and negatively keyed items may load on separate factors (Barnette, 2000; Ibrahim, 2001; Miller & Cleary, 1993), perhaps because reversed scoring is often created by using negatively worded items (e.g. a point for happy is given for responding ‘‘no’’ to ‘‘not happy’’). The important distinction here is between positive and negative wording and positive and negative keying, because Holden and Fekken (1985) found that criterion validity was only lower with negative wording. Moreover, the effect of negative wording was greater for items containing the modifier ‘‘not’’ than for items with a negative prefix (e.g. un-) or a negative frequency (e.g. seldom). 1.1.2. Present research Acquiescence was investigated by creating a balanced N Scale in which the items were positively or negatively keyed. To avoid negative wording, the negatively keyed items were positively worded 1

I am grateful to a reviewer for this observation.

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

745

or contained a negative frequency; the modifier ‘‘not’’ was omitted. For example, on the original scale, a point for N was given for responding ‘‘yes’’ to ‘‘Would you call yourself a nervous person?’’ On the balanced scale, a point was given for responding ‘‘no’’ to ‘‘Would you call yourself a relaxed person?’’ To ascertain if agreement response bias was a factor in N scores, the psychometric properties of N were compared for the original and balanced versions of the Eysenck Personality Inventory (EPI).2 If some people are yeasayers and others are naysayers, individual variation in N will be confounded with individual variation in acquiescence. This has implications for reliability and validity, which are estimated by correlations reflecting such variation. Usually, a scale is better if it is more reliable, which implies that reliability would be higher for the balanced than for the original version of N if response bias constitutes random error in the original. However, the bias might enhance score consistency, artificially raising test reliability. Consequently, reliability would be very high on the original version and lower on the balanced version. The most important evidence for the effect of acquiescence response bias would be if it lowered test validity. This was examined by comparing convergent and discriminant validity for the original and balanced forms of the test. Convergent validity coefficients between EPI–N and three criteria were calculated: N from another inventory (McCrae’s Big Five NEO Personality Inventory-Revised, NEO-PI-R), a global self-rating on N (Personality Judgment Self-N, PJSN), and a global rating on N given by another person who knew the participant well (Personality Judgment Other-N, PJON). None of these criteria is likely to suffer from any acquiescence effect. The NEOPI-R N Scale is itself balanced. The global self-rating (PJSN) was obtained by asking the participant to read the characteristics of a neurotic person and of a stable person, then to rate which one was more like them on a Likert Scale. The global other-rating (PJON) was obtained in the same way. It is the most objective, because it was made by someone else. If acquiescence response bias was present, removing it on the balanced version of the EPI should result in higher convergent validity coefficients than on the original version. Discriminant validity was examined by calculating correlations between N and E, which are conceptually independent (Eysenck & Eysenck, 1968). E items are both positively and negatively keyed, but more of them (15 out of 24) require ‘‘yes’’ than ‘‘no’’ responses. This implies that the score might be slightly contaminated by agreement acquiescence, and N would be positively correlated with E on the original test. Although the correlation between N and E is usually close to zero or slightly negative (Eysenck & Eysenck, 1968), there is one surprisingly high value of 0.62 (Rahim, 1984). However, this should not occur in the balanced version. Also, two additional criteria were used: a global self-rating on E (PJSE) and a global otherrating on E (PJOE). The PJSE rating was less likely than EPI-E to be influenced by acquiescence, because it required participants to rate which of two descriptions (extraverted or introverted) was more like them. The PJOE rating was obtained in the same way, and was unlikely to be contaminated at all because it was made by another person. In both cases, the convergent validity coefficients should be close to zero for the original and balanced EPIs. 2 Although published in 1968, the EPI is still widely used. A search of PsyINFO produced 56 studies between 1999 and December 2001.

746

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

1.2. Social desirability 1.2.1. Past research There are two kinds of social desirability (Paulhus, 1991). Self-deceptive enhancement (SDE) means that the person answers honestly but has an overly positive view of themselves. Because SDE has been linked to personal adjustment, it may represent variation in its own right rather than response bias, so that removing it may lower the validity of a test score with which it is associated (Paulhus, 1991). Indeed, using the unbiased external criterion of spousal ratings, this occurred when N from the Big Five NEO-PI was corrected for social desirability and other ‘‘validity’’ scales (McCrae & Costa, 1985; Piedmont, McCrae, Riemann, & Angleitner, 2000). On the other hand, impression management (IM) means that the person deliberately tailors their response to create a good impression. This source of error variance should be controlled. Paulhus (1991) states that the EPI Lie scale measures IM more than SDE. Although the correlation between Eysenck’s N and measures of social desirability has been nonsignificant (Dunnett et al., 1981; Rahim, 1984), many other correlations have been negative (e.g. Davies, French & Keogh, 1998; Farley, 1966; Ferrando, 2001; Francis, 1993; Furnham & Henderson, 1982). However, the relationship between N and SDE is stronger than the relationship with IM. Why is this? According to Farley’s (1966), SDE and N share two sources of overlap (approval motivation and pathological content) whereas IM and N share only one (approval motivation). However, writers disagree on the implications for social desirability response bias in Eysenck’s N Scale. It has been argued that N is (Helmes, 1980) and is not (Paulhus, 1991) contaminated by SDE, and that N is (Paulhus, 1991) and is not (Ferrando, 2001) contaminated by M. Notably, Eysenck and Eysenck (1968) interpret the N/L relationship as interesting in its own right. L may reflect conformity (Davies et al., 1998; Massey, 1980) or low impulsivity (Loo, 1980). 1.2.2. Present research Although there is a stronger relationship between N and SDE than between N and IM, it is not clear whether or not N is compromised. This was investigated by adopting McCrae and Costa’s (1985) approach comparing raw criterion validity coefficients with validity coefficients corrected for social desirability. It was predicted that if relationships between N and social desirability scales indicate response bias in N, then convergent validity would be higher when the original correlations were corrected. This was examined for both the SDE factor (Paulhus’ SDE Scale) and for the IM factor (Eysenck’s L Scale). Discriminant validity with the three measures of E was also investigated. A social desirability effect predicts a negative correlation between N and E as measured by the EPI and perhaps also by the PJSE rating, because N is perceived as socially undesirable and E is perceived as socially desirable (Dunnett et al., 1981; Edwards & Walsh, 1964; Francis, 1993). Indeed, a negative correlation between Eysenck’s E and N has occurred (Francis, 1993; Furnham & Henderson, 1982). However, the PJOE rating is less likely to be influenced by social desirability because it was given by another person. This correlation should be close to zero. If these effects are present, then the corrected discriminant validity coefficients between N and E would change from negative to zero for the EPI–E and the PJSE. The nonsignificant value for the PJOE rating would remain the same. Finally, although social desirability response bias predicts a negative correlation between N and E on the EPI, acquiescence response bias predicts a positive correlation. If both biases were

747

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

present, they might cancel each other out, giving a zero-order correlation, which has occurred (e.g. Eysenck & Eysenck, 1968; Francis, 1993). This would cast doubt on Eysenck’s claim that N and E are factorially independent.

2. Method 2.1. Participants Participants were two samples (n=147, n=126) representing the Bishops’ University undergraduate population (n=1850). Each sample was stratified by gender, academic division (business, humanities, natural sciences, social sciences) and degree progress (less or more than 50% of courses completed). Although selection was not strictly random, the numbers in each gender/ division/progress stratum were proportional to their numbers in the population. In particular, 74 men and 73 women in Sample 1 initially completed Form A of the EPI, and 62 men and 64 women in Sample 2 initially completed Form B. Within each sample, half initially completed the original version of the EPI and half completed the balanced version. Numbers were 74 (Form A, original), 73 (Form A, balanced), 62 (Form B, original) and 64 (Form A, balanced). This allocation was semi-random, care being taken that each group remained representative of the population. Before the second session, approximately one half of each of the four groups was allocated by the same means to one of two subgroups that completed either the EPI Form A or Form B. These data were used to calculate test–retest and alternate-form reliability. Some participants did not complete all tests, so that sample sizes varied for the different reliability and validity estimates (see Tables 1, 2 and 3). 2.2. Materials Forms A and B each consist of 57 items (24 for N, 24 for E, 9 for L) that are answered with ‘‘yes’’ or ‘‘no’’. The manual (Eysenck & Eysenck, 1968) cites test–retest reliability coefficients (9 months to 1 year) of 0.81 to 0.91 for N and 0.82 to 0.97 for E. The correlations between the two forms (termed ‘‘split-half’’ reliability) for a normal sample are 0.80 for N and 0.75 for E. The correlations between N and E for a normal sample are 0.00 for Form A and 0.09 for Form B. Table 1 Reliability coefficients for the original and balanced versions of the EPI N Scale Type of reliability

Split-half (with Spearman-Brown correction) Yes–no Odd–even Corrected odd–even Test–retest Alternate-form

Form A

Form B

n

Original

n

Balanced

n

Original

n

Balanced

74 74 74 33 33

0.799 0.817 0.792 0.806 0.627

73 73 73 31 31

0.752 0.773 0.765 0.861 0.684

62 62 62 26 27

0.737 0.752 0.734 0.929 0.810

64 64 64 27 27

0.711 0.789 0.813 0.762 0.817

748

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

Table 2 Validity coefficients for the EPI-N Scale (Form A): raw, corrected for S.D. and corrected for L Type of validity

Original n

Balanced

Raw

Desirability SDE L

74 74

0.445** 0.127

Convergent NEO-PI-R N PJSN PJON

74 66 44

0.644** 0.594** 0.457**

Discriminant EPIE PJSE PJOE

74 65 44

0.093 0.012 0.088

Corrected

n

Raw

SDE

L

– 0.248

0.473** –

73 74

0.617** 0.289*

0.585** 0.556** 0.363*

0.703** 0.642** 0.499**

073 62 47

0.796** 0.574** 0.317*

0.104 0.212 0.103

0.093 0.271 0.106

74 63 47

0.348* 0.196 0.215

Corrected SDE

L

– 0.275*

0.612 –

0.615** 0.530** 0.285y

0.222 0.116 0.201

0.785** 0.537** 0.285y

0.335* 0.177 0.274y

y P <0.07. * P< 0.05. ** P< 0.01.

Table 3 Validity coefficients for the EPI-N Scale (Form B): raw, corrected for S.D. and corrected for L Type of validity

Original n

Balanced

Raw

Corrected SDE

Desirability SDE L

62 62

0.640** 0.062

Convergent NEO-PI-R N PJS N PJO N

62 53 51

0.776** 0.493* 0.057

Discriminant EPI E PJS E PJO E

62 53 51

y

P <0.07. * P< 0.05. ** P< 0.01.

0.296* 0.089 0.292*

– 0.154

0.667** 0.344* 0.107

0.234 0.078 0.340*

n

Raw

L 0.573** –

73 74

0.407** 0.043

0.786** 0.485** 0.059

73 62 47

0.716** 0.333* 0.216

0.295* 0.067 0.306*

74 63 47

0.026 0.199 0.207

Corrected SDE

L

– 0.107

0.429** –

0.632** 0.234 0.110

0.137 0.216 0.233

0.709** 0.282y 0.218

.053 0.161 0.201

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

749

The manual refers to a variety of evidence for factorial validity, construct validity and concurrent validity. In the original EPI, the score for N is the number of positive (‘‘yes’’) responses. For both Forms A and B, a balanced version of the EPI was created by changing half of the N items to be negatively keyed. One of the 12 changed items contained the prefix ‘‘un-‘‘, but none contained the word ‘‘not’’. For the remaining 11, changes were made by using a word with the opposite meaning (e.g. nervous became relaxed) or by using a different modifier (e.g. often became rarely or hardly ever, hard became easy). The second questionnaire, the ‘‘NSD Personality Inventory’’ (NSDPI), consisted of 44 items, 24 from the NEO-PI-R (Costa & McCrae, 1992) to measure N and 20 from the BIDR (Paulhus, 1991) to measure the SDE component of social desirability. Each item was answered on a five-point scale from strongly disagree to strongly agree. In the NEO-PI-R manual (Costa & McCrae, 1992), internal consistency for N is 0.92 (coefficient alpha). Test–retest over a ‘‘short’’ (unspecified) interval is 0.87. It is stated that the test has good long-term test–retest reliability over 6 years (0.68–0.83), but specific reliability coefficients for N are not presented. A variety of factor-analytic, convergent and discriminant validity evidence for N is presented. In particular, it is ‘‘strongly correlated’’ with N from the EPI. The 20 SDE items were the first 20 from the BIDR (Paulhus, 1991). Alpha ranges from 0.68 to 0.80, and 5-week test–retest reliability is 0.69. SDE scores have been validated by their convergent validity with similar measures of social desirability and by their discriminant validity with measures of IM. Correlations for SDE and IM for the BIDR range from 0.05 to 0.40. The third questionnaire (‘‘Personality Judgments’’; PJ) obtained global ratings of E and N by the participant (Form S; PJS) and by another person (Form O; PJO). Eysenck and Eysenck’s (1968) descriptions of the typical extravert and typical introvert were provided (labelled ‘‘Person A’’ and ‘‘Person B’’) along with an 11-point rating scale from (‘‘I am just like A’’ to ‘‘I am just like B’’). Then their descriptions of the typical neurotic and stable person were provided (labelled ‘‘Person C’’ and ‘‘Person D’’) with the same scale. 2.3. Procedure After signing a consent form, participants were tested individually or in groups of two or three. During the first session, they completed the EPI then the NSDPI. Approximately 2 weeks later (range 10–17 days), they completed the EPI again followed by the PJS. Participants in Sample 1 filled out the EPI Form A (original or balanced) in the first session. Later, they completed the same version (original, balanced) of the test. However, for both groups (original, balanced), half completed the EPI Form A (test–retest) and half completed the EPI Form B (alternate-form). Participants were told that they might remember how they responded to particular items or similar items in the first session and not to simply repeat these answers. Rather, they should reconsider each question and answer honestly. Participants in Sample 2 initially filled out the EPI Form B (original or balanced). In each group, half later completed Form B (test–retest) and half completed Form A (alternate form) under the same memory instructions as Sample 1. When the PJS was finished, participants were asked to give the PJO to the friend who knew them best and to return the completed questions within 5 days. Later, they received a written summary of the results.

750

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

3. Results and discussion All statistical analyses were conducted using SPSS Version 10.0. 3.1. Acquiescence response bias 3.1.1. Reliability Table 1 shows the reliability coefficients for Forms A and B of the EPI N-Scale. The ‘‘yes–no’’ coefficients refer to the positively and negatively keyed items. The balanced coefficients for Form A and for Form B are smaller than the original ones (0.752, 0.799; 0.711, 0.737), but only slightly. The odd–even reliability is the traditional split-half estimate where scores from odd and even items are compared. For Form A, the balanced version was slightly less reliable than the original version (0.773, 0.817), whereas for Form B it was slightly more reliable (0.789, 0.752). However, this estimate might confound odd-even items with positive–negative keying. The corrected odd– even split-half estimate (Table 1, third row) overcame this problem by combining scores from the first, third, fifth, etc. items for the positively keyed items and from the first, third, fifth, etc. items for the negatively keyed items. Scores from the even items were similarly combined. Again, the balanced coefficient for Form A was slightly smaller than the original (0.765, 0.792) and the balanced coefficient for Form B was slightly greater than the original (0.813, 0.734). Given that split half estimates are usually expected to exceed 0.85 on a unidimensional trait, the numbers for the original scales are slightly low. This contradicts the idea that reliability is artificially inflated by acquiescence response bias. Furthermore, the balanced estimates are not systematically different from the original ones, as would be expected if this version of the test removed any bias. For test–retest reliability, the present estimates for the original test (0.806, Form A; 0.929, Form B) are similar to the values in the manual (0.81–0.91), perhaps because the shorter delay here (2 weeks vs. 9-months to 1-year) was offset by the warning to avoid memory. On Form A, reliability was somewhat higher for the balanced than for the original version (0.861), but on Form B it was considerably smaller (0.762). This suggests that Form B might be artificially inflated by acquiescence. Finally, alternate-form reliability for the balanced version improved somewhat from the original version (0.627 to 0.684) for Form A and improved marginally (0.810 to 0.817) for Form B. Neither result suggests that the reliability of the original versions of the test is inflated by acquiescence. With one exception (test–retest, Form B), these results indicate that reliability of N in the EPI is not much affected by balancing the scoring key. Together with the split half estimates, which did not vary much from the original to the balanced versions, there is little evidence that N is contaminated by acquiescence response bias. 3.2. Validity 3.2.1. Convergent and discriminant validity Tables 2 and 3 show the validity coefficients between N and the various criteria for Forms A and B, respectively. The first column contains the raw correlations for the original version of the test. In both cases, the values are highest with N from the NEO-PI-R, lowest with the PJON, and

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

751

intermediate with the PJSN, probably reflecting the similarity between EPI-N and the three criterion measures: another self-report inventory, another self-evaluation and evaluation by someone else. For Form A, the discriminant validity coefficients between N and E (0.093, 0.012, 0.088) are not statistically significant, which supports Eysenck’s claim that N and E are conceptually independent. In particular, the first one, which represents the correlation between N and E from the EPI, is not positive, as would be predicted if there was acquiescence response bias in N and, perhaps, in E. For Form B, the three discriminant validity coefficients are also lower than their convergent counterparts. However, two of them (0.296 with EPI-E and 0.292 with the other rating) are negative and significant. The first of these, the correlation between E and N from the EPI, again contradicts the presence of acquiescence response bias in N. 3.2.2. Original vs. balanced validity coefficients The most important comparisons here are between the raw correlations for the original and balanced versions of the test (the first and fourth columns of Tables 2 and 3, respectively). If acquiescence is a problem in the original versions, the convergent validity coefficients will be higher for the balanced versions. However, for each form, this occurred on only one of the three criteria: on the NEO-PI-R for Form A (0.644 increasing to 0.796) and on the other rating for Form B (0.057 increasing to 0.216). Because balancing the scale did not systematically increase convergent validity, there is little evidence of acquiescent response for N on either form of the test. This casts doubt on the claim that Form B is more contaminated by acquiescence response bias than Form A (Stones, 1977). 3.3. Social desirability response bias The correlations between SDE and L were 0.070 for Form A and 0.175 for Form B, supporting Paulhus’ (1991) claim that SDE and L measure different kinds of social desirability. On the original EPI, the correlations between N and SDE were 0.445 and 0.640 for Forms A and B, respectively (column 1 of Tables 2 and 3), which replicates previous findings (e.g. Davies et al., 1998). For the balanced EPI, the correlations were similar (0.617, 0.407; column 4 of Tables 2 and 3), which shows that N shares variance with SDE. In contrast, the correlations between N and L were lower. In fact, for Forms A and B, original versions, and for Form B, balanced version, they were not significant (columns 1 and 4 of Tables 2 and 3). The value for Form A, balanced version, was significant, but low (0.289). These results confirm that N shares less variance with L than with SDE (e.g. Davies et al., 1998). 3.4. Validity 3.4.1. Convergent validity The first columns of Tables 2 and 3 contain the raw validity coefficients for the original versions of the test, and the next two columns contain the partial correlations obtained by correcting the raw correlations for SDE and L. For convergent validity on Form A, the SDE-corrected validity coefficients are lower than the raw validity coefficients on all three criteria, whereas the L-corrected validity coefficients are higher. On Form B, the three SDE-corrected validity coefficients are again lower than the raw

752

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

validity coefficients, but neither value for the PJON rating is significant. The L- corrected validity coefficient is slightly higher than the raw value for the NEO-PI-R and very slightly higher for the PJON rating, but again the latter value is not significant. For the PJSN rating, the L-corrected validity coefficient is slightly lower than the raw value. The fourth column of Tables 2 and 3 contains the raw validity coefficients for the balanced version of the scale, and the last two columns contain the balanced version corrected validity coefficients for SDE and for L. The latter show what happens when the original version of N is corrected for acquiescence and social desirability. For both Forms A and B, the SDE-corrected validity coefficients are lower than the balanced raw values on all three criteria. In five out of six cases, they are also lower than the original raw values. In the sixth case (Form B, PJON criterion), the SDE-corrected coefficient (0.110) is marginally higher than the original (0.057), but both are not significant. The L-corrected coefficients are lower than both the balanced and raw values in the same five cases. For the exception (Form B, PJON), the corrected value is higher than the raw original value (0.218 vs. 0.057) and only marginally higher than the raw balanced value (0.218 vs. 0.216). Because the raw criterion validity coefficients decreased consistently when they were corrected for SDE, the SDE/N relationship seems to represent content variance, not response bias. The shared content variance may reflect personal adjustment (Paulhus, 1991) or pathological content (Farley, 1966). This argument holds particularly for Form A, where the SDE-corrected validity coefficients are lower than the original raw ones in all six cases. For Form B, the same pattern holds in the four cases involving the NEO-PI-N and PJSN as criteria. In the other two cases, with PJON as the criterion, the data are more difficult to interpret, because the original raw validity coefficients were not significant. Because PJON is the most objective of the three criteria, these findings raise questions about Form B as a valid measure of N. The results for L are less clear than those for SDE. Although most of the raw validity coefficients decreased or remained the same when they were corrected, all three increased for Form A (even though N and L were not significantly correlated). This suggests that the measurement of N on Form A is somewhat contaminated by response bias. 3.4.2. Discriminant validity If social desirability contaminated N and E, there would be a negative correlation between them on the EPI and possibly on the PJSE, but not on the PJOE (ratings from others). However, none of the three discriminant validity coefficients for Form A on the original version were significant. This could have occurred because the social desirability effect leading to a negative correlation between N and E was cancelled out by an acquiescence effect leading to a positive correlation between them. If so, the negative correlation might occur on the balanced version. In fact, this relationship was negative and significant (r=0.348) with E from the EPI as the criterion (Table 2). Both of the other correlations (with PJSE and PJOE) were negative but not significant. When the balanced discriminant validity coefficient for EPI–E was corrected for SDE, it became less negative and nonsignificant (0.222), suggesting some contaminating effect from SDE. When the balanced discriminant validity coefficient was corrected for L, it also became slightly less negative, but remained significant (0.335), suggesting a very small contaminating effect from L. For Form B’s balanced version, with acquiescence minimized, none of the three raw N/E correlations was significant. However, on the original version, there was a significant negative

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

753

correlation (0.296) between N and E on the EPI. This became less negative and nonsignificant (0.234) with SDE partialled out, again suggesting a contaminating effect from SDE. In contrast, with L partialled out, it did not change (0.295), suggesting no contamination from L. Notably, the discriminant validity coefficient between N and the PJOE was also negative and significant (0.292). It did not change (0.306) with L partialled out but it increased (0.340) with SDE removed. Because the other person’s rating is not likely to be contaminated by social desirability, this suggests that the negative relationship between N and E is substantive. If so, it raises a question about Eysenck and Eysenck’s (1968) claim that the two traits are independent, at least on Form B. Further work should be done to find out if these negative correlations between N and PJOE can be replicated.

4. General discussion and conclusion This is the first study to simultaneously examine acquiescence and social desirability response biases for Eysenck’s N. It is also more complete than past work because reliability (split-half, test–retest, alternate-form) and construct validity (convergent and discriminative validity) were investigated with three criteria. On the other hand, undergraduate students do not represent the general population, and sample sizes were reduced for some estimates. Moreover, the rater was chosen by the participant as the friend who knew them best. These judgments may be more valid than self-ratings, but they may be less valid than McCrae and Costa’s (1985) spousal ratings. Finally, although the inflationary effect of memory on test–retest reliability was controlled by instruction, participants may have recalled previous responses after only 2 weeks. Nevertheless, the evidence pertains to the problem of response bias in Eysenck’s N. Although the items are all positively keyed, leaving the test open to yeasaying or naysaying, the results clearly indicate that acquiescence response bias was not a problem on either form of the test: reliability was not inflated on the original versions and was similar to that on the balanced versions, convergent validity coefficients were not systematically higher on the balanced than on the original versions and discriminant validity coefficients between E and N on the original versions were not positive. Using different methods, these results confirm previous conclusions that acquiescence does not seriously contaminate Eysenck’s N (Eysenck & Eysenck, 1964; Martin & Stanley, 1963). Although N is socially undesirable, convergent validity declined when the effect of SDE was removed, indicating that SDE is a meaningful component of N. This agrees with McCrae and Costa’s (1985) finding that the correlation between N from the NEO-PI-R and spousal ratings decreased when corrected for social desirability. In contrast, there was evidence, particularly for Form A, that convergent validity increased when the effect of L was removed, indicating that impression management may contaminate N. This disagrees with McCrae and Costa’s (1985) finding that convergent validity for their N decreased when corrected for L. However, it supports Eysenck and Eysenck’s (1968) advice, which has been followed (Rahim, 1984) and debated (Dunnett et al., 1981; Furnham & Henderson, 1982), that elevated scores on L may indicate faking, and that low scores on N (particularly on Form A) should be interpreted with caution. The results also support Paulhus’ (1991) claim that SDE represents content more than error variance whereas IM represents error more than content variance.

754

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

Generally, E and N were uncorrelated and this relationship was not systematically affected by removing SDE or L variance, supporting the discriminative validity of N. However, this was clearer for Form A than for Form B, where there was evidence of a negative relationship between N and E. There were also two other contrasts between the two forms. For Form A, there was a healthy convergent validity coefficient when the criterion was another person’s rating, whereas it was not significant for Form B. Unfortunately, this criterion is the most objective of the three. There were also healthy zero-order discriminative validity coefficients for Form A when the E criteria were from the EPI and from ratings by another person, whereas these coefficients were both negative for Form B. Although there was no systematic evidence that N in Form B was more contaminated by social desirability response bias than N in Form A, these findings raise questions about the construct validity of Form B. In conclusion, although N is measured by a positively keyed scale and is perceived as socially undesirable, EPI scores were not seriously distorted by acquiescence or self-deceptive enhancement tendencies. However, if a person has an elevated score on L, a low score on N should be interpreted with caution. References Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ: Prentice Hall. Barnette, J. J. (2000). Effects of stem and Likert response option reversals on survey internal consistency: if you feel the need, there is a better alternative to using those negatively worded items. Educational and Psychological Measurement, 60, 361–370. Costa, P. T. Jr., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO PI-R) and NEO Five-Factor Inventory professional manual. Odessa, FL: Psychological Assessment Resources. Couch, A., & Keniston, K. (1960). Yeasayers and naysayers: agreeing response set as a personality variable. Journal of Abnormal and Social Psychology, 60, 151–174. Davies, M., French, C. C., & Keogh, E. (1998). Self-deceptive enhancement and impression management correlates of EPQ-R dimensions. The Journal of Psychology, 132, 401–406. Dunnett, S., Koun, S., & Barber, P. J. (1981). Social desirability in the Eysenck Personality Inventory. British Journal of Psychology, 72, 19–26. Edwards, A., & Walsh, J. A. (1964). Response sets in standard and experimental personality scales. American Educational Research Journal, 1, 52–61. Eysenck, H. J., & Eysenck, S. B. G. (1968). Manual of the Eysenck Personality Inventory. San Diego, CA: Educational Testing Service. Eysenck, S. B. G., & Eysenck, H. J. (1964). ‘‘Acquiescence’’ response set in personality inventory items. Psychological Reports, 14, 513–514. Farley, F. (1966). Social desirability, extraversion, and neuroticism: a learning analysis. The Journal of Psychology, 64, 113–118. Ferrando, P. J. (2001). The measurement of neuroticism using MMQ, MPI, EPI and EPQ items: a psychometric analysis based on item response theory. Personality and Individual Differences, 30, 641–656. Francis, L. J. (1993). The dual nature of the Eysenckian Neuroticism scales: a question of sex differences? Personality and Individual Differences, 15, 43–59. Furnham, A., & Henderson, M. (1982). The good, the bad and the mad: response bias in self-report measures. Personality and Individual Differences, 3, 311–320. Greenwald, H. J., & Clausen, J. D. (1970). Test of relationship between yeasaying and social desirability. Psychological Reports, 27, 139–141. Grimm, S. D., & Church, A. T. (1999). A cross-cultural study of response biases in personality measures. Journal of Research in Personality, 33, 415–441.

S.J. McKelvie / Personality and Individual Differences 36 (2004) 743–755

755

Helmes, E. (1980). A psychometric investigation of the Eysenck Personality Questionnaire. Applied Psychological Measurement, 4, 43–55. Holden, R. R., & Fekken, C. G. (1985). Structured personality test item characteristics and validity. Journal of Personality, 19, 386–394. Ibrahim, A. M. (2001). Differential responding to positive and negative items: the case of a negative item in a questionnaire for course and faculty evalution. Psychological Reports, 88, 497–500. Jackson, D. N. (1967). Acquiescence response styles: problems of identification and control. In I. A. Berg (Ed.), Response set in personality assessment (pp. 71–114). Chicago: Aldine. Loo, R. (1980). Characteristics of the Eysenck Personality Questionnaire Lie scale and of extreme lie scorers. Psychology, 17, 5–10. Martin, J., & Stanley, G. (1963). Social desirability and the Maudsley Personality Inventory. Acta Psychologica, 21, 260–264. Massey, A. (1980). The Eysenck Personality Inventory Lie Scales: lack of insight or. . .? The Irish Journal of Psychology, iv(3), 172–174. McCrae, R. R., & Costa, P. T. Jr. (1985). Social desirability scales: more substance than style. Journal of Counseling and Clinical Psychology, 51, 882–888. Miller, T. R., & Cleary, T. A. (1993). Direction of wording effects in balanced scales. Educational and Psychological Measurement, 53, 51–60. Paulhus, D. L. (1991). Measurement and control of response bias. In J. P. Robinson, P. R. Shaver, & L. S. Wrightsman (Eds.), Measures of personality and social psychological attitudes. Vol. 1 of Measures of social psychological attitudes (pp. 17–60). San Diego, CA: Academic Press. Piedmont, R. L., McCrae, R. R., Riemann, R., & Angleitner, A. (2000). On the validity of validity scales: evidence from self-reports and observer ratings in volunteer samples. Journal of Personality and Social Psychology, 78, 582–593. Rahim, M. A. (1984). Social desirability response set and the Eysenck Personality Inventory. The Journal of Psychology, 116, 149–153. Ray, J. J. (1983). Reviving the problem of acquiescent response bias. The Journal of Social Psychology, 121, 81–96. Rorer, L. G. (1965). The great response-style myth. Psychological Bulletin, 63, 129–156. Stones, M. J. (1977). A further study of response set and the Eysenck Personality Inventory (EPI). Journal of Clinical Psychology, 33, 147–150. Webster, H. (1958). Correcting personality scales for response sets or suppression effects. Psychological Bulletin, 55, 62–64.