Personality and Individual Differences 49 (2010) 669–676
Contents lists available at ScienceDirect
Personality and Individual Differences journal homepage: www.elsevier.com/locate/paid
Review
How similar are personality scales of the ‘‘same” construct? A meta-analytic investigation q Victoria L. Pace a,*, Michael T. Brannick b a b
Department of Psychology, Florida International University, United States Department of Psychology, University of South Florida, United States
a r t i c l e
i n f o
Article history: Received 14 January 2010 Received in revised form 10 June 2010 Accepted 15 June 2010 Available online 6 July 2010 Keywords: Personality scale convergence Personality tests Meta-analysis Five-Factor Model
a b s t r a c t An underlying assumption of meta-analysis is that effect sizes are based on commensurate measures. If measures across studies do not have the same empirical meaning, then our theoretical understanding of relations among variables will be clouded. Two indicators of scale commensurability were examined for personality measures: (1) correlations among different scales with similar labels (e.g., different measures of extraversion) and (2) score reliability for different scales with similar labels. First, meta-analyses of correlations between many commonly-used scales were computed, both including and excluding scales classified as non-Five-Factor Model measures. Second, subgroup meta-analyses of reliability were examined, with specific personality scale as moderator. Results reveal that assumptions of commensurability among personality measures may not be entirely met. Whereas meta-analyzed reliability coefficients did not differ greatly, scales of the ‘‘same” construct were only moderately correlated in many cases. Some improvement to this meta-analytic correlation occurred when measures were limited to those based on the Five-Factor Model. Questions remain about the similarity of personality construct conceptualization and operationalization. Ó 2010 Elsevier Ltd. All rights reserved.
1. Introduction Researchers have begun to consider the similarity of personality scales of ostensibly the same construct. In particular, some have complained that scales are not similar enough to be grouped together in the same meta-analysis (e.g., Hogan, 2005). What do we know about convergence among scales and their reliabilities? To date, some researchers have meta-analyzed criterion-related validities of the Big Five (Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) by assuming predictor scales from the included studies were essentially equivalent (e.g., Barrick & Mount, 1991; Clarke & Robertson, 2008; Connor-Smith & Flachsbart, 2007; Judge, Heller, & Mount, 2002; Prinzie, Stams, Dekovic, Reijntjes, & Belsky, 2009; Salgado, 1997; Tett, Jackson, & Rothstein, 1991). Often the determination of equivalence has been based on comparison of definitions of constructs the scales purport to measure. Some researchers have also examined scales at the item level (e.g., Judge, Bono, Ilies, & Gerhardt,
q Portions of this research were presented at the 23rd Annual Conference of the Society for Industrial and Organizational Psychology, San Francisco, CA and were based on the first author’s doctoral dissertation. * Corresponding author. Address: Department of Psychology, DM 256, Florida International University, 11200 S.W. 8th Street, Miami, FL 33199, United States. Tel.: +1 305 348 1970; fax: +1 305 348 3879. E-mail address: vpace@fiu.edu (V.L. Pace).
0191-8869/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.paid.2010.06.014
2002) or have relied on information from previous factor analyses (e.g., Salgado, 1997; Tett et al., 1991). Although the names of scales frequently differ, the similarity of scores and inferences based upon the scales is an empirical question. That is, if the measures are so highly correlated that their antecedents and consequents are the same, then the difference in names is of trivial importance. However, if the measures are not highly correlated, then distinct names and distinct treatments of the measures are warranted. For that matter, measures with the same name that are not highly correlated should not be included in a single meta-analysis. However, there appears to be no clearly quantitative review of a multitude of personality scales that examines their commensurability. There are several reasons to expect that scales assigned by researchers to one of the Big Five personality factors may not be as highly correlated as one would expect. Among these reasons are the variety of underlying personality theories; differences in context, breadth, and item content among the scales; and potential unreliability of some scales.
1.1. The Big Five and other taxonomies An issue that complicates the assignment of scales to factors is the variety of ways in which the personality domain has been divided. Although a five-factor structure may be currently the most widely accepted, there have been and remain many who argue
670
V.L. Pace, M.T. Brannick / Personality and Individual Differences 49 (2010) 669–676
for a greater or fewer number of personality factors (e.g., Cattell, Eber, & Tatsuoka, 1970; Eysenck & Eysenck, 1985). Accordingly, many personality scales were not developed to measure Big Five factors, but were oriented toward alternative construct sets (Salgado, 1997). Such diversity causes problems because broader scales may measure more than one of the Big Five and narrower scales may measure only part of a Big Five factor. Assigning broader scales to any one of the Big Five introduces construct contamination into that factor, whereas grouping narrower factors ignores the differences among narrower constructs. One solution might be the specification of inclusion criteria that limit future meta-analyses to tests based on a single theory, such as the Five-Factor Model (FFM), or Big Five. In essence, Salgado (2003) used this approach to separate FFM tests from non-FFM ones. Even among proponents of a five-factor structure, there are different views concerning the facets that constitute each factor. To illustrate this point, Costa and McCrae (1992) gave the six facets of NEO Openness to Experience as (Openness to) Feelings, Aesthetics, Fantasy, Actions, Values, and Ideas. Others (e.g., Chernyshenko, Stark, Woo, & Conz, 2008; Hough & Ones, 2001) list differently conceptualized facets, often omitting content related to feelings. When comparing measures, it would not be surprising to find a relatively low mean correlation between scales that emphasize distinct aspects of a construct. Even when measures are substantially correlated with one another within a factor, it is mathematically possible for scales to show differential relations with other measures and criteria. 1.2. Systematic differences in personality scale construction At the scale level, there are several potential reasons that measures of reportedly the same construct may differ markedly even within the same theoretical framework. Among these are differences in context of the items, scale breadth, and specific item content. 1.2.1. Context Essentially the same item (e.g., ‘‘I speak my mind”) may be framed for different contexts such as work, school, home, and so forth. Orientation of a scale toward general or more specific situations may affect its relationships with other constructs (Lievens, De Corte, & Schollaert, 2008). Studies by Schmit, Ryan, Stierwalt, and Powell (1995), Bing, Whanger, Davison, and VanHook (2004), and Hunthausen, Truxillo, Bauer, and Hammer (2003) found that scales initially developed for clinical use exhibited significantly improved predictive validities when items or instructions were altered to target the criterion context rather than the original general context. 1.2.2. Breadth Although the majority of meta-analytic research to date has examined broad factors, there may be slight differences in breadth among ‘‘broad” measures, reflecting different understandings of the components or facets that should be included in a particular personality factor. Specific facets may lead to scale differences, but may also aid in the prediction of certain criteria (Christiansen, 2008; Dudley, Orvis, Lebiecki, & Cortina, 2006; Paunonen & Ashton, 2001). 1.2.3. Item content Another seemingly subtle, but possibly substantive difference between scales thought to measure the same construct was mentioned by Hogan (2005). According to Hogan (2005), other researchers have grouped the NEO Agreeableness scale (Costa & McCrae, 1992) and the Hogan Personality Inventory (HPI; Hogan & Hogan, 2007). Likeability scale into the same meta-analyses for the factor Agreeableness. Hogan contended that the two scales measure different constructs and predict criteria differently.
1.3. Differences in reliability All else being equal, scales that produce less reliable scores will prove less useful in prediction. Viswesvaran and Ones (2000) metaanalyzed reliabilities produced by Big Five personality scales and found standard deviations of internal consistency to hover around .10 across measures of a single Big Five construct. It is quite possible that some scales consistently produce scores of lower reliability than others. 1.4. Consequences of heterogeneous scale groupings If seemingly similar scales are actually substantially different, grouping them in a meta-analysis poses the well-known ‘‘apples and oranges” problem (see Cortina, 2003, or Sharpe, 1997, for further discussion) in which different elements are combined in a common group and the group’s relationship with other variables, such as work outcomes, is assessed. Clearly, if group elements have differing relationships with the criterion, an overall effect size will obscure these differences and lead to incorrect conclusions. As an illustration, consider a pair of predictor measures (A and B) and an outcome measure C. Assume A is a strong positive predictor of C, and B is a weak positive predictor of C. If A and B are grouped and we examine only their pooled ability to predict C, we will underestimate the predictive ability of A and overestimate the predictive ability of B. Differential predictive ability has been found when considering separate facets, such as those of extraversion for prediction of depression (e.g., Naragon-Gainey, Watson, & Markon, 2009), but such differences could also occur among tests. Researchers have acknowledged the inclusion of diverse predictors as a potential cause of relationship underestimation in meta-analysis, and the isolation of specific personality tests appears to be gaining support (e.g., Steel, Schmidt, & Shultz, 2008). Further consideration of characteristics of measures will allow us to examine whether an ‘‘apples and oranges” problem exists. 1.5. Hypothesis and research question 1.5.1. Hypothesis Correlations between scores from scales that were developed from the Five-Factor Model will be stronger than those between scales that were developed with different personality theories in mind. 1.5.2. Research question How similar are reliability estimates from different scales that seemingly measure the same construct? The hypothesis most directly addresses scale commensurability and the effectiveness of one restriction aimed at improving it. The research question is directly related to the utility of tests. If tests are not commensurable, however, then meta-analyses of the tests are not meaningful, and the applied researcher is left to consider the merits of each test based on the reliability and validity evidence that is available for the given application. Differences in reliability could result in differences in validity. 2. Method 2.1. Literature review 2.1.1. Types of data collected Correlations between scores from groups of individuals who were assessed using two or more personality scales were collected. Reliability data for the included personality scales were collected from these and additional studies. Coded variables included
V.L. Pace, M.T. Brannick / Personality and Individual Differences 49 (2010) 669–676
personality test name, stated scale construct, corresponding Big Five construct and facet according to Hough and Ones (2001), test classification as FFM or non-FFM according to Salgado (2003) where possible, test reliability obtained (internal consistency), correlations with other personality scales of the same Big Five construct, number of participants in the sample, published/ unpublished status, and other variables not used for this study. 2.1.2. Sources of data Data were extracted from journal articles, dissertations, test manuals, and unpublished studies. Data were found by searching the PsycInfo and ProQuest Thesis and Dissertation databases and through e-mails to test publishers and personality researchers. Database search terms included the Big Five constructs and combinations of scale names from the Hough and Ones taxonomy (2001). Also included were Big Five, Five-Factor Model, scale development, valid*, convergen*, and variations on those and similar terms. An extensive list of researchers was generated and contacted based on published literature, presentations in conferences, and recommendations by other researchers. A total of 59 useable sources were located for convergent validity meta-analyses. Some sources included effect sizes from multiple studies. Studies that included reliability data were combined with additional studies from industrial–organizational psychology that used at least one included personality scale, often as a predictor of a variety of job-related criteria. This yielded 79 usable sources for the reliability meta-analyses. 2.1.3. Inclusion criteria Correlations between scales and reliabilities of scales were taken from studies of non-clinical adult populations using English language versions of the scales. These studies were conducted by researchers from a range of psychological topic areas. Examination of scale similarity was limited to measures that had been grouped by Hough and Ones (2001) into each of the Big Five constructs, with the addition of closely-related scales, such as shorter or earlier versions by the same authors. Scales that were listed as global or facet measures of a Big Five construct were grouped as measures of that Big Five construct. No limitation was placed on publication date for the convergent studies; however, the earliest included study was published in 1970. All included scales were classified as either FFM or non-FFM measures. We used classifications provided by Salgado (2003) where possible and classified additional scales based on a review of test manuals and scale development literature by the scale authors. A list of categorized measures is available by request from the first author. If a study included more than one reliability coefficient for a Big Five factor, reliability for the global scale was selected. In the very few cases for which only reliabilities of narrower non-FFM scales were provided, one coefficient was chosen randomly after an attempt was made to retain a variety of facet scales. The choice of one coefficient per sample per analysis was made to avoid interdependence among effect sizes. Global scales were also preferred over facet scales for inclusion in convergent validity analyses. When facet scales were included, only their convergence with global or corresponding facet scales was considered. Analyses that were limited to FFM measures generally did not include narrower facet scales. 2.2. Analyses 2.2.1. Analyses of scale correlations (convergent validity) First, meta-analyses of product-moment correlations between pairs of scales were calculated across all scales, without regard
671
for specific test, for each of the Big Five factors. For example, convergent validity coefficients comparing any two agreeableness scales were analyzed to determine the meta-analytic mean convergence and its distribution, along with related statistics. Second, we conducted similar meta-analyses of correlations exclusively between FFM scales of the same construct. Results from these analyses can be compared to those from the larger datasets that included both FFM and non-FFM scales to examine differences in estimated effect size as well as narrowing of credibility intervals. Next, we conducted meta-analysis of each specific scale’s correlations to other scales of the same construct when possible. As an illustration, the Intellectance scale from the HPI was analyzed for its convergent validity with other scales categorized as openness to experience (without respect to the specific tests from which they came). Results may provide guidance on grouping of scales for meta-analysis or research ideas related to incremental criterion-related validity, factor analysis, or scale development. Fourthly, we conducted meta-analyses of correlations between particular pairs of scales when at least six correlations could be found. For example, we analyzed correlations between the NEO and CPI (California Psychological Inventory; Gough & Bradley, 1996) scales of conscientiousness. Results provide preliminary information about the likely amount of conceptual overlap between specific tests.
2.2.2. Analyses of reliability At the factor level, coefficient alpha reliability was meta-analyzed for each Big Five factor, across scales. In other words, five meta-analyses of reliability were conducted (one for each of the five personality factors). At the scale level, coefficient alpha reliabilities of included scales were computed separately by scale when our data included at least six reliability coefficients. Studies using global factors of the most popular personality tests were most available, so these types of meta-analyses were most numerous.
2.3. Meta-analytic procedures 2.3.1. Outlier analysis Analyses with and without outliers were conducted if these data points were suspected of having a great influence on the mean effect size or variance of effect sizes. This is most often a concern when studies that are much larger than the rest (large N) produce effect sizes that are largely discrepant from the remaining studies. Studies were considered outliers based on sample size if they had more than twice as many participants as the next largest study.
2.3.2. Correcting for statistical artifacts A ‘‘bare-bones” approach was used to correct for sampling error only. Whereas more fully corrected meta-analyses may adjust for a variety of artifacts, including reliabilities of predictor and criterion, range restriction, etc., we opted for a more conservative approach as suggested by Rosenthal (1991) with the idea that some researchers and practitioners may prefer meta-analytic effect sizes that might be found with large samples over effect sizes that might be obtained in an ideal world where measures are perfectly reliable, for example. However, there is some discussion concerning the use of ‘‘bare-bones” analyses (cf. Schmidt & Hunter, 2003). We followed the method described in Hunter and Schmidt (2004) for all analyses.
672
V.L. Pace, M.T. Brannick / Personality and Individual Differences 49 (2010) 669–676
3. Results 3.1. Convergent validity Sample-size-weighted mean correlations between scales ^ ), along with the number of correlations on which these (denoted q means are based (K), the total number of participants involved (N), weighted variance of the observed correlations (S2r ), sampling error variance (SE2r ), standard deviation of the infinite-sample correla-
tions (r^q ), as well as 95% confidence and credibility intervals are given in Tables 1–6. Estimated mean convergent validities among all measures (FFM and non-FFM) were generally below .50, with convergent validity highest among extraversion scales, followed by emotional stability scales. Estimates of convergent validities by test ranged from .31 to .54 for agreeableness (Table 1), .27 to .51 for conscientiousness (Table 2), .26 to .51 for openness to experience (Table 3), .37 to .66 for extraversion (Table 4), and .32 to .66 for emotional stability
Table 1 Bare-bones meta-analytic convergent validities of specific agreeableness scales (with remaining agreeableness scales). Test
K
N
^ q
S2r
SE2r
r^q
95% Confidence interval
ACL CPI Goldberg, Saucier, or IPIP HPI NEO PRF All tests
10 10 9 7 19 8 36* 35 10
1330 1389 2931 985 4479 1196 8280 6320 2930
.41 .31 .54 .48 .52 .34 .31 .47 .61
.025 .008 .020 .017 .020 .008 .105 .028 .006
.005 .006 .002 .004 .002 .005 .004 .003 .001
.141 .048 .135 .113 .133 .049 .319 .156 .072
.31–.51 .26–.37 .45–.64 .39–.58 .46–.59 .28–.40 .20–.42 .42–.53 .56–.66
FFM global tests
95% Credibility interval .13–.68 .22–.41 .28–.81 .26–.70 .26–.78 .24–.44 .31–.94 .17–.78 .47–.75
Note: All tests = all tests included in the dataset for this research; FFM global tests = only tests categorized as FFM, excluding facet level measures; ACL = Adjective Check List (Gough); CPI = California Psychological Inventory; Goldberg, Saucier, and IPIP = Goldberg Big Five Factor Markers, Saucier Mini-Markers, and International Personality Item Pool; HPI = Hogan Personality Inventory; NEO = NEO-PI, NEO-PI-R, and NEO-FFI; PRF = Personality Research Form. * Includes large (outlier N) studies; following row shows results without these studies.
Table 2 Bare-bones meta-analytic convergent validities of specific conscientiousness scales (with remaining conscientiousness scales). Test ACL CPI Goldberg, Saucier, and IPIP HPI NEO PRF All tests FFM global tests
K *
12 11 27* 26 8 7 31* 30 8 56* 55 10
N
q^
S2r
SE2r
r^q
95% Confidence interval
1542 1132 5224 3685 2307 1189 8161 6622 1259 11,407 9868 2573
.44 .44 .31 .27 .47 .36 .49 .51 .34 .42 .43 .63
.019 .026 .022 .027 .063 .048 .026 .029 .039 .044 .051 .005
.005 .006 .004 .006 .002 .004 .002 .002 .005 .003 .004 .001
.119 .141 .132 .143 .246 .207 .153 .162 .184 .202 .217 .057
.36–.52 .34–.53 .25–.34 .21–.34 .30–.65 .20–.52 .43–.54 .45–.57 .20–.48 .37–.48 .37–.49 .59–.67
95% Credibility interval .20–.67 .16–.71 .05–.57 .01–.56 .01–.96 .04–.77 .19–.79 .19–.83 .02–.70 .03–.82 .004–.85 .52–.74
Note: All tests = all tests included in the dataset for this research; FFM global tests = only tests categorized as FFM, excluding facet level measures; ACL = Adjective Check List (Gough); CPI = California Psychological Inventory; Goldberg, Saucier, and IPIP = Goldberg Big Five Factor Markers, Saucier Mini-Markers, and International Personality Item Pool; HPI = Hogan Personality Inventory; NEO = NEO-PI, NEO-PI-R, and NEO-FFI; PRF = Personality Research Form. * Includes large (outlier N) studies; following row shows results without these studies.
Table 3 Bare-bones meta-analytic convergent validities of specific openness scales (with remaining openness scales). Test ACL Goldberg, Saucier, and IPIP HPI NEO All tests FFM global tests
K *
8 7 6* 5 6* 5 19* 18 26* 25 10* 9
N
^ q
S2r
SE2r
r^q
95% Confidence interval
1076 666 1607 1006 1087 683 5522 3562 6710 4750 2120 1519
.34 .28 .51 .47 .26 .38 .41 .48 .40 .45 .51 .48
.008 .006 .010 .013 .033 .014 .017 .015 .025 .029 .015 .018
.006 .009 .002 .003 .005 .005 .002 .003 .003 .003 .003 .004
.049 0 .090 .098 .167 .094 .120 .107 .150 .161 .110 .122
.27–.40 .22–.34 .43–.59 .37–.57 .12–.41 .28–.48 .36–.47 .42–.53 .34–.46 .38–.51 .43–.58 .39–.57
95% Credibility interval .24–.43 .28 .33–.69 .28–.67 .07–.59 .20–.56 .18–.65 .27–.69 .11–.70 .13–.76 .29–.72 .24–.72
Note: All tests = all tests included in the dataset for this research; FFM global tests = only tests categorized as FFM, excluding facet level measures; ACL = Adjective Check List (Gough); Goldberg, Saucier, and IPIP = Goldberg Big Five Factor Markers, Saucier Mini-Markers, and International Personality Item Pool; HPI = Hogan Personality Inventory; NEO = NEO-PI, NEO-PI-R, and NEO-FFI. * Includes large (outlier N) studies; following row shows results without these studies.
673
V.L. Pace, M.T. Brannick / Personality and Individual Differences 49 (2010) 669–676 Table 4 Bare-bones meta-analytic convergent validities of specific extraversion scales (with remaining extraversion scales). Test
K
N
q^
S2r
SE2r
r^q
95% Confidence interval
ACL CPI EPI
14 32 7* 6 8 8* 7 42* 41 19* 17 45 7 14 103 9
2218 9365 1017 548 2307 1298 894 11,577 10,359 7382 4080 14,780 1046 3289 28,521 2449
.37 .57 .66 .63 .60 .41 .58 .63 .65 .55 .58 .58 .56 .61 .56 .62
.011 .016 .010 .018 .004 .065 .005 .014 .013 .026 .025 .017 .019 .015 .023 .003
.005 .002 .002 .004 .001 .004 .004 .001 .001 .001 .002 .001 .003 .002 .002 .001
.081 .121 .090 .117 .050 .246 .032 .111 .107 .159 .152 .124 .125 .116 .145 .041
.32–.43 .53–.61 .59–.74 .53–.74 .56–.65 .23–.59 .53–.63 .59–.66 .61–.68 .48–.62 .50–.65 .54–.62 .46–.66 .55–.68 .53–.59 .58–.66
Goldberg, Saucier, and IPIP HPI MBTI MMPI NEO PRF 16PF All tests FFM global tests
95% Credibility interval .22–.53 .33–.81 .48–.84 .41–.86 .50–.70 .07–.89 .51–.64 .41–.85 .43–.86 .24–.86 .28–.87 .34–.83 .32–.81 .39–.84 .28–.85 .54–.70
Note: All tests = all tests included in the dataset for this research; FFM global tests = only tests categorized as FFM, excluding facet level measures; ACL = Adjective Check List (Gough); CPI = California Psychological Inventory; EPI = Eysenck Personality Inventory; Goldberg, Saucier, and IPIP = Goldberg Big Five Factor Markers, Saucier Mini-Markers, and International Personality Item Pool; HPI = Hogan Personality Inventory; MBTI = Myers-Briggs Type Indicator; MMPI = Minnesota Multiphasic Personality Inventory; NEO = NEO-PI, NEO-PI-R, and NEO-FFI; PRF = Personality Research Form; 16PF = Sixteen Personality Factors. * Includes large (outlier N) studies; following row shows results without these studies.
Table 5 Bare-bones meta-analytic convergent validities of specific emotional stability scales (with remaining emotional stability scales). Test
K
N
^ q
S2r
SE2r
r^q
95% Confidence interval
ACL Goldberg, Saucier, and IPIP MMPI
8 7 19* 17 23* 21 7 35* 33 9* 8
1888 2161 6544 3242 8184 4882 1333 11,019 7717 2090 1489
.46 .64 .35 .32 .55 .66 .66 .51 .51 .64 .68
.041 .004 .030 .059 .027 .014 .010 .048 .059 .005 .001
.003 .001 .002 .004 .001 .001 .002 .002 .002 .002 .002
.196 .056 .167 .234 .160 .113 .092 .215 .239 .055 0
.32–.60 .59–.69 .29–.43 .21–.44 .48–.61 .61–.71 .58–.73 .43–.58 .43–.59 .59–.68 .66–.70
NEO 16PF All tests FFM global tests
95% Credibility interval .08–.85 .53–.75 .02–.67 .14–.78 .23–.86 .44–.88 .48–.84 .08–.93 .04–.98 .53–.75 .68
Note: All tests = all tests included in the dataset for this research; FFM global tests = only tests categorized as FFM, excluding facet level measures; ACL = Adjective Check List (Gough); Goldberg, Saucier, and IPIP = Goldberg Big Five Factor Markers, Saucier Mini-Markers, and International Personality Item Pool; HPI = Hogan Personality Inventory; MMPI = Minnesota Multiphasic Personality Inventory; NEO = NEO-PI, NEO-PI-R, and NEO-FFI; 16PF = Sixteen Personality Factors. * Includes large (outlier N) studies; following row shows results without these studies.
Table 6 Bare-Bones convergent validities of some specific test pairs. Test pair
Construct
NEO and MMPI
Emotional stability
NEO and CPI
Conscientiousness
NEO and MMPI NEO and MBTI CPI and MBTI 16PF and MBTI
Extraversion Extraversion Extraversion Extraversion
K *
6 4 * 7 6 7 13 7 8
N
^ q
S2r
SE2r
r^q
95% Confidence interval
95% Credibility interval
4205 903 2714 1175 4346 3510 3671 1999
.41 .54 .40 .42 .54 .69 .63 .66
.010 .026 .006 .012 .022 .001 .005 .011
.001 .002 .002 .004 .001 .001 .001 .001
.097 .155 .061 .095 .146 .022 .066 .100
.33–.49 .38–.70 .35–.46 .33–.51 .43–.65 .67–.72 .58–.68 .58–.73
.22–.60 .24–.84 .28–.52 .23–.60 .25–.82 .65–.74 .50–.76 .46–.85
Note: CPI = California Psychological Inventory; MBTI = Myers-Briggs Type Indicator; MMPI = Minnesota Multiphasic Personality Inventory; NEO = NEO-PI, NEO-PI-R, and NEOFFI; 16PF = Sixteen Personality Factors. Includes large (outlier N) studies; following row shows results without these studies.
*
(Table 5). Credibility intervals were generally quite wide. These intervals tended to narrow when meta-analyzing the convergence of a specific test to all others, and they narrowed further when examining specific pairs of tests, indicating that specific test name was a moderator of convergence. In other words, some scales showed greater convergence than others. Confirming our hypothesis, estimated mean convergent validities between scores from scales that were developed from the
FFM were stronger than those between scales that represented a variety of personality theories although the degree of improvement varied by construct. Estimated mean convergent validities among FFM measures were in the low to mid .60s, with the exception of openness to experience, which was around .50. These figures indicated a noticeable improvement in convergence among measures of agreeableness, conscientiousness, and emotional stability, but not as much improvement among measures of openness
674
V.L. Pace, M.T. Brannick / Personality and Individual Differences 49 (2010) 669–676
differences were statistically significant, differences of this magnitude may not be of particular concern in practice, as results indicated satisfactory reliabilities for research purposes (over .70; Nunnally & Bernstein, 1994).
Table 7 Mean correlations among Big Five personality dimensions from the literature. Personality Dimension 1. Agreeableness 2. Conscientiousness
1
2
3
4
.27 .28 *
3. Emotional Stability
.25 .42 *
4. Extraversion
.17 .13 *
5. Openness to Experience
.11 .12 *
.26 .38 .46 .00 .20 .32 .06 .13 .27
4. Discussion
.19 .25 .49 .16 .12 .30
An underlying assumption of some previous meta-analyses using Big Five personality factors is that all scales that ostensibly measure the same factor are similar enough to group into a common meta-analysis. To assess the degree of similarity among scales, two indicators were examined: correlations among personality scales (convergent validity or alternate forms reliability) including a separate analysis among FFM scales, and reliability differences across personality scales (internal consistency reliability). Results indicated that the assumption of commensurability is questionable. Convergent validities were lower than expected, indicating substantial differences among tests. Such a result begs for an explanation of the differences among tests as well as a consideration of the implications of such differences for theory and practice. In consideration of the reasons for differences among tests, it may be helpful to start with specific results and consider potential reasons for these. Results provide support that differences in underlying factor taxonomy, scale breadth, and item content diminish convergent validities. For example, convergent validities increased substantially for measures of some constructs when we limited studies to those using global construct scales classified as FFM measures. For the agreeableness and conscientiousness constructs, convergent validities with a variety of other tests were highest for the NEO and Goldberg families of tests and lowest for the CPI and PRF (Personality Research Form; Jackson, 1999). One explanation for these differences may be that the NEO and Goldberg tests were intended as measures of Big Five factors, whereas the CPI and PRF were based on other models of personality. Salgado (2003) contended that convergent validity should be lower across
.17 .40 .40
Note: Results given in or based on the following articles are provided, in order, from top to bottom: Ones, Viswesvaran, and Reiss (1996) (estimated population correlations based on previous meta-analytic research by Ones); Digman (1997) (unitweighted, uncorrected mean correlations from nine adult studies included in his analyses); Spector, Schneider, Vance, and Hezlett (2000) (based on a single study with N ranging from 332 to 407). * Not provided.
and extraversion when FFM measures were examined exclusively (Tables 1–5). Convergent validities can be compared to results from studies that reported correlations between different factors (e.g., Digman, 1997; Ones, Viswesvaran, & Reiss, 1996; Spector, Schneider, Vance, & Hezlett, 2000). Relevant findings from these studies are included in Table 7. In general, convergent validities are substantially larger than reported discriminant validity correlations. Nevertheless, convergent validities varied by test as well as by construct and were lower than an aspirational minimum of .70. 3.2. Reliability Meta-analyses of reliabilities (Table 8) indicated that reliability did not differ greatly across commonly-used tests. Although some Table 8 Bare-bones meta-analysis of reliability. Test
K
N
q^
S2r
SE2r
r^q
95% Confidence interval
95% Credibility interval
PCI (Agreeableness) PCI (Conscientiousness) PCI (Emotional stability) PCI (Extraversion) PCI (Openness) NEO-PI-R (Agreeableness) NEO-PI-R (Conscientiousness) NEO-PI-R (Emotional stability/neuroticism) NEO-PI-R (Extraversion) NEO-PI-R (Openness) NEO-FFI (Agreeableness) NEO-FFI (Conscientiousness) NEO-FFI (Emotional stability/neuroticism) NEO-FFI (Extraversion) NEO-FFI (Openness) Goldberg/Saucier/IPIP (Agreeableness) Goldberg/Saucier/IPIP (Conscientiousness) Goldberg/Saucier/IPIP (Emotional stability) Goldberg/Saucier/IPIP (Extraversion) Goldberg/Saucier/IPIP (Intellect or Openness) All tests (Agreeableness)
9 14 9 9 9 10 11 10 13 8 13 18 12 15 11 17 25 18 18 19 54* 53 77 53 61 51
2034 3595 2034 2034 2034 2116 2021 1919 2416 1480 2300 2996 2014 2538 1789 3112 4538 3274 3369 3333 13,729 9851 15,218 9741 16,355 8911
.79 .81 .85 .85 .81 .86 .91 .90 .86 .85 .74 .81 .80 .79 .75 .79 .82 .80 .82 .79 .80 .80 .81 .82 .83 .78
.004 .004 .001 .000 .003 .005 .002 .001 .003 .005 .002 .003 .003 .004 .005 .006 .002 .005 .004 .006 .004 .006 .009 .011 .004 .006
.001 .000 .000 .000 .001 .000 .000 .000 .000 .000 .001 .001 .007 .001 .001 .001 .001 .001 .001 .001 .000 .001 .001 .001 .000 .001
.055 .061 .028 .012 .045 .072 .040 .029 .051 .068 .037 .050 .043 .053 .058 .073 .031 .065 .057 .073 .063 .074 .089 .100 .058 .070
.75–.83 .77–.84 .82–.87 .84–.87 .78–.84 .82–.91 .89–.93 .89–.92 .83–.89 .80–.90 .71–.77 .78–.84 .77–.83 .76–.82 .71–.79 .76–.83 .81–.84 .76–.83 .79–.85 .75–.82 .79–.82 .78–.82 .79–.83 .79–.85 .81–.84 .76–.80
.68–.90 .69–.93 .79–.90 .83–.88 .72–.90 .72–1.0 .83–.99 .85–.96 .76–.96 .75–.98 .67–.82 .71–.91 .71–.88 .69–.89 .64–.87 .65–.94 .76–.88 .67–.92 .71–.93 .64–.93 .68–.93 .65–.95 .64–.98 .62–1.0 .71–.94 .65–.92
All All All All
tests tests tests tests
(Conscientiousness) (Emotional stability) (extraversion) (Openness)
Note: Goldberg, Saucier, and IPIP = Goldberg Big Five Factor Markers, Saucier Mini-Markers, and International Personality Item Pool; PCI = Personal Characteristics Inventory. Includes large (outlier N) studies; following row shows results without these studies.
*
V.L. Pace, M.T. Brannick / Personality and Individual Differences 49 (2010) 669–676
measure types than among FFM measures exclusively. Our results provide empirical support for this. Another explanation for relatively low convergent validity is that some scales, such as the PRF scales for agreeableness and conscientiousness, measure narrower facets of the Big Five rather than global factors (Hough & Ones, 2001). Recognition of the facets measured by tests may lead toward understanding similarities and differences among personality tests, and perhaps the nature of any differential prediction by tests. This recognition brings us to our second topic of discussion, implications for theory and practice. Differences among scales in their focus on facets could partially explain why certain measures may be more useful in specific circumstances or when attempting to predict certain criteria. For example, some facets of agreeableness might be more useful for predicting response to psychotherapy than for predicting performance in retail sales jobs. Clearly, continued development of facet taxonomies and categorization of measures into facets and global factors is needed (Roberts, Chernyshenko, Stark, & Goldberg, 2005). Results from the Roberts et al. study suggested constituent facets of conscientiousness. More recent work has begun to clarify facets of openness (Chernyshenko et al., 2008) and agreeableness (Connelly, Davies, Ones, & Birkland, 2008). Perhaps with greater understanding of facets, we will find that scales included in the current study differ in their focus on specific facets. Comparing factor analyses that yield constituent facets of each commonly-used test of a particular Big Five factor may help to define content differences between tests. Relatively low convergent validity does not mean that tests necessarily differ in their usefulness for prediction, particularly for global criteria. Until we know if personality tests display similar relations to various criteria, it is uncertain whether they differ in utility. One interpretation of the convergent validity results from this study is that tests with higher convergent validities are more similar, and are presumably measuring something closer to a generally understood concept of the construct, whereas those with lower convergent validities may be measuring less commonly included aspects of the construct. If these less commonly included aspects add to the criterion-related validity of the test, it could be helpful to identify them and include them in other tests. However, if they are not useful, elimination of the discrepant aspects might be advisable. A closer look at criterion validities, especially at the facet and item levels, would clarify this issue. Particularly useful for further test development and validity maximization efforts would be future research using designs that allow a determination of incremental validity of one personality test over another. In applied use, variation in mean level of convergent validity among tests is of practical interest. Less convergence between tests implies that consumers need to be more informed about the exact content of each measure and the differences between tests. According to our results, this is especially true when considering tests that are based on different personality theories. In summary, differences in content were found among scales, based on convergent validities. The specific nature of these differences is not fully known. Differences among FFM tests appear to be of lesser magnitude than differences observed when non-FFM measures are included. However, mean correlations from metaanalyses in which a wide variety of personality tests are grouped into categories by the Big Five (or other taxonomies) may not be as meaningful as desired. Correlations among scales are small enough that different measures may have different nomological nets and different utility for applied research. Internal consistency reliability did not explain much of the variability in convergence. Continuing efforts toward the improvement of personality testing for prediction of specific criteria are encouraged.
675
Acknowledgements We thank Walter C. Borman and dissertation committee members, Judith Becker Bryant, Bill N. Kinder, and Stephen Stark for their helpful comments during this process.
References A list of references for studies included in the meta-analyses is available by request from the first author. Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Bing, M. N., Whanger, J. C., Davison, H. K., & VanHook, J. B. (2004). Incremental validity of the frame-of-reference effect in personality scale scores: A replication and extension. Journal of Applied Psychology, 89, 150–157. Cattell, R. B., Eber, H. W., & Tatsuoka, M. M. (1970). Handbook for the sixteen personality factor questionnaire (16PF). Champaign, IL: Institute for Personality and Ability Testing. Chernyshenko, O. S., Stark, S., Woo, S. E., & Conz, G. (2008). Openness to Experience: Its facet structure, measurement, and validity. In Paper presented at the 23rd annual conference of the society for industrial and organizational psychology, San Francisco, CA. Christiansen, N. D. (2008). Further consideration of the usefulness of narrow trait factors. In Paper presented at the 23rd annual conference of the society for industrial and organizational psychology, San Francisco, CA. Clarke, S., & Robertson, I. (2008). An examination of the role of personality in work accidents using meta-analysis. Applied Psychology: An International Review, 57, 94–108. Connelly, B. S., Davies, S., Ones, D. S., & Birkland, A. (2008). Agreeableness: A metaanalytic review of structure, convergence, and predictive validity. In Paper presented at the 23rd annual conference of the society for industrial and organizational psychology, San Francisco, CA. Connor-Smith, J. K., & Flachsbart, C. (2007). Relations between personality and coping: A meta-analysis. Journal of Personality and Social Psychology, 93, 1080–1107. Cortina, J. M. (2003). Apples and oranges (and pears, oh my!): The search for moderators in meta-analysis. Organizational Research Methods, 6, 415–439. Costa, P. T., Jr., & McCrae, R. R. (1992). Revised NEO Personality Inventory and NEO Five-Factor Inventory Professional Manual. Odessa, FL: Psychological Assessment Resources. Digman, J. M. (1997). Higher-order factors of the Big Five. Journal of Personality and Social Psychology, 73, 1246–1256. Dudley, N. M., Orvis, K. A., Lebiecki, J. E., & Cortina, J. M. (2006). A meta-analytic investigation of conscientiousness in the prediction of job performance. Examining the intercorrelations and the incremental validity of narrow traits. Journal of Applied Psychology, 91, 40–57. Eysenck, H. J., & Eysenck, M. W. (1985). Personality and individual differences: A natural science approach. New York: Plenum Press. Gough, H. G., & Bradley, P. (1996). CPI manual (3rd ed.). Mountain View, CA: CPP. Hogan, R. (2005). In defense of personality measurement: New wine for old whiners. Human Performance, 18, 331–341. Hogan, R., & Hogan, J. (2007). Hogan Personality Inventory manual (3rd ed.). Tulsa, OK: Hogan Assessment Systems. Hough, L. M., & Ones, D. S. (2001). The structure, measurement, validity, and use of personality variables in industrial, work, and organizational psychology. In N. Anderson, D. S. Ones, H. Sinangil Kepir, & C. Viswesvaran (Eds.). Handbook of industrial, work, and organizational psychology (Vol. 1, pp. 233–277). London: Sage. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings (2nd ed.). Thousand Oaks, CA: Sage. Hunthausen, J. M., Truxillo, D. M., Bauer, T. N., & Hammer, L. B. (2003). A field study of frame-of-reference effects on personality test validity. Journal of Applied Psychology, 88, 545–551. Jackson, D. N. (1999). Personality Research Form manual (3rd ed.). Port Huron, MI: SIGMA Assessment Systems. Judge, T. A., Bono, J. E., Ilies, R., & Gerhardt, M. W. (2002). Personality and leadership: A qualitative and quantitative review. Journal of Applied Psychology, 87, 765–780. Judge, T. A., Heller, D., & Mount, M. K. (2002). Five-factor model of personality and job satisfaction: A meta-analysis. Journal of Applied Psychology, 87, 530–541. Lievens, F., De Corte, W., & Schollaert, E. (2008). A closer look at the frame-ofreference effect in personality scale scores and validity. Journal of Applied Psychology, 93, 268–279. Naragon-Gainey, K., Watson, D., & Markon, K. E. (2009). Differential relations of depression and social anxiety symptoms to the facets of Extraversion/Positive Emotionality. Journal of Abnormal Psychology, 118, 299–310. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill. Ones, D. S., Viswesvaran, C., & Reiss, A. D. (1996). Role of social desirability in personality testing for personnel selection: The red herring. Journal of Applied Psychology, 81, 660–679.
676
V.L. Pace, M.T. Brannick / Personality and Individual Differences 49 (2010) 669–676
Paunonen, S. V., & Ashton, M. C. (2001). Big Five factors and facets and the prediction of behavior. Journal of Personality and Social Psychology, 81, 524–539. Prinzie, P., Stams, G. J. J. M., Dekovic, M., Reijntjes, A. H. A., & Belsky, J. (2009). The relations between parents’ Big Five personality factors and parenting: A metaanalytic review. Journal of Personality and Social Psychology, 97, 351–362. Roberts, B. W., Chernyshenko, O. S., Stark, S., & Goldberg, L. R. (2005). The structure of conscientiousness: An empirical investigation based on seven major personality questionnaires. Personnel Psychology, 58, 103–139. Rosenthal, R. (1991). Meta-analytic procedures for social research. Newbury Park, CA: Sage. Salgado, J. F. (1997). The five factor model of personality and job performance in the European community. Journal of Applied Psychology, 82, 30–43. Salgado, J. F. (2003). Predicting job performance using FFM and non-FFM personality measures. Journal of Occupational and Organizational Psychology, 76, 323–346. Schmidt, F. L., & Hunter, J. E. (2003). Meta-analysis. In I. B. Weiner (Series Ed.) & J. A. Schinka, & W. F. Velicer (Vol. Eds.), Handbook of psychology: Vol. 2. Research methods in psychology (pp. 533–554). Hoboken, NJ: Wiley.
Schmit, M. J., Ryan, A. M., Stierwalt, S. L., & Powell, A. B. (1995). Frame-of-reference effects on personality scale scores and criterion-related validity. Journal of Applied Psychology, 80, 607–620. Sharpe, D. (1997). Of apples and oranges, file drawers and garbage: Why validity issues in meta-analysis will not go away. Clinical Psychology Review, 17, 881–901. Spector, P. E., Schneider, J. R., Vance, C. A., & Hezlett, S. A. (2000). The relation of cognitive ability and personality traits to assessment center performance. Journal of Applied Social Psychology, 30, 1474–1491. Steel, P., Schmidt, J., & Shultz, J. (2008). Refining the relationship between personality and subjective well-being. Psychological Bulletin, 134, 138–161. Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job performance. A meta-analytic review. Personnel Psychology, 44, 703– 742. Viswesvaran, C., & Ones, D. S. (2000). Measurement error in the ‘‘Big Five Factors” personality assessment: Reliability generalization across studies and measures. Educational and Psychological Measurement, 60, 224–235.