Differences in Accuracy of Absolute and Comparative Performance Appraisal Methods

Differences in Accuracy of Absolute and Comparative Performance Appraisal Methods

ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES Vol. 70, No. 2, May, pp. 95–103, 1997 ARTICLE NO. OB972698 Differences in Accuracy of Absolute ...

165KB Sizes 0 Downloads 24 Views

ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES

Vol. 70, No. 2, May, pp. 95–103, 1997 ARTICLE NO. OB972698

Differences in Accuracy of Absolute and Comparative Performance Appraisal Methods Stephen H. Wagner and Richard D. Goffin Northern Illinois University

scales (BARS, Smith & Kendall, 1963), behavioral observation scales (BOS, Latham & Wexley, 1977) and appraisal systems used in management by objectives (MBO, Carroll & Schneier, 1982; Drucker, 1954). The second kind of judgment task, termed comparative, requires raters to rate the ratees relative to one another. Traditional forms of comparative procedures include straight-ranking and paired comparison (Murphy & Cleveland, 1995). Numerous comparative and absolute rating approaches have been developed which attempt to optimize the validity of performance judgments (Cardy & Dobbins, 1994). However, decades of rating format research has not produced unequivocal evidence of the superiority of one method over others (Landy & Farr, 1980). Consequently, the focus of performance appraisal research has shifted from the question of measurement to the consideration of social–psychological processes underlying performance appraisal (Murphy & Cleveland, 1995). However, the possibility of format-related differences may be worthy of reconsideration based on recent evidence that comparative approaches of performance appraisal may possess greater criterion-related validity than absolute approaches (Goffin, Gellatly, Paunonen, Jackson, & Meyer, 1996; Nathan & Alexander, 1988). The purpose of the present research was to extend this line of research by comparing the accuracy of absolute and comparative performance appraisal methods. The judgmental accuracy of students’ ratings of the performance of videotaped lecturers with absolute and comparative performance appraisal methods was assessed with four distinct forms of accuracy, developed by Cronbach (1955): elevation (EL), the accuracy of ratings, averaging across all ratees and performance dimensions; differential elevation (DE), accuracy in discriminating among ratees, averaging across performance dimensions; stereotype accuracy (SA), accuracy in discriminating among performance dimensions, averaging across ratees; and differential accuracy

The primary question of this experiment was whether absolute and comparative performance appraisal ratings differ in terms of four components of accuracy: differential elevation (DE), differential accuracy (DA), elevation (EL), and stereotype accuracy (SA). Because comparative performance appraisal methods often use global items (overall performance dimensions), whereas certain absolute performance appraisal methods utilize specific items (critical incidents), the effect of specific versus global items was also investigated. Eighty participants viewed four videotaped lecturers and rated their performance 24 h later with both absolute and comparative performance appraisal methods which used both specific and global item-types. No advantages were associated with the absolute rating method, however, comparative ratings were more accurate than absolute ratings with respect to DA and SA. Global items resulted in greater DE and EL accuracy than did specific; however, the converse was true with respect to DA and SA accuracy. Implications for the practice of performance appraisal are discussed. q 1997 Academic Press

Performance appraisal typically involves one of two types of judgment tasks. The first and most common type of judgment, termed absolute, involves rating each ratee against some absolute standard. The absolute standard can be operationalized in vague trait-like terms, as with graphic rating scales, or in behaviorbased terms, as with behaviorally anchored rating The second author is now at the Department of Psychology, The University of Western Ontario. This research was carried out as the first author’s M.A. thesis. The authors are indebted to Peter Villanova for his input during the planning of this project and the writing of this article. An earlier version of this paper was presented at the Eleventh Annual Conference of the Society for Industrial and Organizational Psychology, April, 1996, San Diego, CA. Address correspondence and reprint requests to Richard D. Goffin, Department of Psychology, The University of Western Ontario, Canada N6A 5C2.

95

0749-5978/97 $25.00 Copyright q 1997 by Academic Press All rights of reproduction in any form reserved.

96

WAGNER AND GOFFIN

(DA), accuracy in discriminating among ratees, within each performance dimension. Validity of Comparative and Absolute Methods Miner (1988) compared a traditional straight-ranking method and a rated ranking technique. The procedure of rated ranking involved first ranking employees on a dimension of performance, then giving each employee an absolute rating on that dimension of performance that was consistent with the initial rank orders. The rated ranking procedure did not demonstrate superiority over straight ranking with regard to reliability or construct validity, suggesting that the addition of the absolute rating procedure was no better than the ranking method by itself. Goffin et al. (1996a) compared the validity of the BOS and a comparative rating method, the Relative Percentile Method (RPM). Although new, the RPM has been used in several field studies (Christiansen, Goffin, Johnston, & Rothstein, 1994; Goffin et al., 1996a; Goffin, Rothstein, & Johnston, 1996b; Meyer, Paunonen, Gellatly, & Goffin, 1989). Goffin et al. (1996a) concluded that the RPM demonstrated a significantly greater relationship with external criteria, such as cognitive ability, personality, vocational interests, and organizational commitment, than did the BOS. Given that these external criteria might reasonably be considered part of the nomological network of variables surrounding the domain of performance, Goffin et al. (1996a) suggested that their results supported the validity of the RPM over that of the BOS. The general inferential strategy used by Goffin et al. (1996a) is described by Murphy and Cleveland (1995) as one of the strategies appropriate for construct validation of performance ratings. Interestingly, Goffin et al.’s (1996a) results also indicated a marked lack of convergence between the BOS and RPM, suggesting that the two approaches might be measuring different attributes. Further evidence of the validity of comparative performance methods was provided by Nathan and Alexander’s (1988) meta-analysis which indicated that clerical skills tests tended to have higher average (corrected) criterion validity with comparative (r¯ 5 .51) than with absolute performance criteria (r¯ 5 .34). Additionally, the potential value of comparative information has been highlighted by Farh and Dobbins’ (1989) finding that self-ratings may have more validity when they incorporate comparative information. Bommer, Johnson, Rich, Podsakoff, and MacKenzie (1995) used meta-analysis to study the relations of absolute and comparative performance appraisal methods with various objective criteria (e.g., sales output, production records, etc.). Although Bommer et al. failed to find a statistically

significant difference in the sizes of these relations, those involving comparative methods were, on average, higher than those of absolute methods. Heneman (1986) suggested that it may be simpler for raters to compare an employee to other employees than to evaluate an employee against absolute rating anchors. Although comparative judgments play a central role in many theories of social judgment, social comparison theory (Festinger, 1954) may best support Heneman’s hypothesis about the relative simplicity of comparative performance appraisal methods.1 Festinger proposed that under conditions of uncertainty, people evaluate their own abilities by making comparisons with similar others. More recently, Kruglanski and Mayseless (1990) have defined the social comparison process more broadly as “comparative judgments of social stimuli on particular content dimensions” (p. 196). Because judgmental measures of performance are inherently subjective, uncertainty may, to some extent, be a common element of the performance rating experience. Moreover, it is often the case that raters do not have comprehensive knowledge of the ratee’s performance (Murphy & Cleveland, 1995), which may increase the uncertainty involved in rating performance. Thus, the uncertainty involved in rating performance may promote the comparison of the employee being rated to other similar employees. By facilitating this social comparison process, comparative methods of performance appraisal may result in ratings with greater validity and accuracy. On the other hand, by considering each ratee in isolation, absolute appraisal methods may tend to hamper social comparisons thus detracting from accuracy and validity. Global and Specific Performance Appraisal Methods Traditionally, comparative performance appraisal methods, such as straight-ranking and paired comparison, have used global items which refer to an overall dimension of performance. Although global items have also been used in many absolute rating formats (e.g., Smith & Kendall, 1963), recent years have witnessed a tendency for absolute formats to incorporate specific items which may refer to critical incidents of behavior (e.g., Latham & Wexley, 1977). To avoid the possible confounding of rating method (absolute vs comparative) with item type (global vs specific), the current study used comparative and absolute methods that incorporated both specific and global items. By including both global and specific items in both rating methods it was 1

We thank an anonymous reviewer for suggesting social comparison theory as a possible rationale for the superiority of relative methods.

97

PERFORMANCE APPRAISAL ACCURACY

possible to compare the effect of comparative and absolute rating methods independently of the effects of global and specific item type. Basic psychometric considerations would suggest that specific formats, by virtue of using multi-item scales, should surpass global formats due to greater reliability and greater domain representativeness. However, research comparing the quality of ratings from global and specific items has yielded mixed findings. Fay and Latham (1982) found that raters using scales describing specific behaviors committed fewer errors than raters using global trait descriptions. However, Heneman (1988) found that global traits were rated more accurately than specific behaviors. The studies mentioned above inferred rating accuracy through a variety of methods, but did not directly assess it using Cronbach’s (1955) approach. One study that did apply Cronbach’s approach compared the accuracy of ratings made immediately after the observation of performance with those made after a 24-h delay using eight global performance dimensions and 12 specific job behaviors (Murphy & Balzer, 1986). Although the effect of global versus specific items was not evaluated as a main effect by Murphy and Balzer, a significant interaction between time of rating (immediate or delayed) and item-type (global vs specific) was observed for one component of accuracy, DE. Ratings that were memory-based (24-h delay condition) had greater accuracy with global than specific items, whereas ratings made immediately after observation had greater accuracy with specific than global items. These findings suggest that memory-based performance ratings may capitalize on raters’ general impressions of ratees rather than their recall of specific behaviors. Because performance appraisal is generally an infrequent (i.e., annual), memory-based process, the current experiment incorporated a 24-h delay between observation and rating in order to place memory demands on our raters which more closely resemble the modal criterion setting (see Bernardin & Villanova, 1986). Under such conditions, global items may be more accurate than specific items in terms of DE. HYPOTHESES

The primary purpose of this experiment was to compare the accuracy of comparative and absolute performance rating approaches. The first hypothesis was that the accuracy of a comparative rating method would exceed that of an absolute approach with respect to Cronbach’s (1955) four components of accuracy. It was further predicted that the comparative approach would be more accurate overall, regardless of the item type

(global vs specific). A secondary purpose was to determine whether any advantages in accuracy are associated with the type of item (global vs specific). From Murphy and Balzer (1986), the second hypothesis was that global items would be more accurate than specific items in terms of DE accuracy. No differences between the global and specific item types were expected for DA, SA, or EL. METHOD

Participants Thirty-four male and forty-six female (N 5 80) undergraduate psychology students viewed videotaped lectures and rated each lecturer’s performance the following day. The sample size was determined by conducting a statistical power analysis assuming a medium (.5) effect size (Cohen, 1988). Each test session was conducted in small group format. Materials Absolute rating scale. The absolute rating method utilized both specific and global items. The specific version of the absolute form contained 12 items which were developed for the videotapes described below by Murphy, Garcia, Kerkar, Martin, and Balzer (1982a). Murphy, Martin, and Garcia (1982b) submitted these items to a principal-components analysis and found two factors, speaking style and clarity, accounting for 75% of the common variance. The global absolute form contained two items with explicit definitions, asking raters to evaluate how often the lecturer demonstrated clarity and effective speaking style. Both global and specific items utilized a 7-point scale of frequency of observation ranging from “never” (1) to “all of the time” (7). A hallmark of the absolute rating scales was that each ratee was evaluated independently using a separate rating form. Comparative rating scale. The comparative rating method also required specific and global items. The specific comparative form consisted of the 12 critical incident items adapted from the absolute scale, whereas the global comparative form required raters to evaluate the lecturers in terms of the two performance factors “clarity” and “speaking style.” On both types of items, the lecturers were evaluated on a 101-point scale corresponding to the percentile-rank of all teachers at the university. Following standard RPM instructions (Goffin et al., 1996a) for the global and specific formats, raters were asked to place a mark across a line with anchors ranging from zero to 100 (in increments of 10) for each lecturer. This mark represented the ratee’s

98

WAGNER AND GOFFIN

relative performance on the dimension being rated. Raters were asked to avoid exact ties. In contrast to the absolute format in which each lecturer was evaluated independently and on a separate form, when using comparative scales the participants rated all four lecturers conjointly on each item. Videotapes. The videotapes used in this study were developed by K. R. Murphy and have been used in numerous studies (Murphy et al., 1982a, 1982b; Murphy & Balzer, 1986). Each videotape vignette depicted a different lecturer, who was, in reality, a drama student, delivering a lecture either on self-fulfilling prophecy or crowding and stress. A total of four vignettes (5–7 min each) depicting different lecturers were used. Procedure Before viewing the videotapes, participants were instructed that they would later be evaluating four lecturers’ performance using two different performance appraisal methods. In order to enhance the recall of the particular lecturer, each was identified by a fictitious name before the participants viewed the respective lecture. Four lecture vignettes, presented in a constant order, were observed by all participants on the first session of the experiment. Participants returned 24 h later to evaluate the ratees using the comparative and absolute approaches with both specific and global items. Thus, both the method of rating (absolute vs comparative) and the type of item (specific vs global) were measured as within subject variables. The sequence of rating method (absolute then comparative or vice versa) and the sequence of item type (global then specific or vice versa) were counterbalanced across participants, that is, they were treated as between subject variables. Prior to evaluating the ratees on the second day of the experiment, participants were trained on how to use the comparative and absolute scales. Ratings on sample items were explained in the context of hypothetical lecturers, and participants were encouraged to ask questions about how to use the rating scales. Dependent Variables Accuracy measures. Generally, judgmental accuracy refers to how congruent ratings of performance are with “true” ratings of performance (Murphy & Cleveland, 1995). Cronbach’s (1955) four components of accuracy were used as the dependent variables representing rating accuracy: (1) EL or accuracy in the overall level of rating, (2) DE or accuracy in distinguishing among ratees, (3) SA or accuracy in distinguishing

among dimensions of performance, and (4) DA or accuracy in distinguishing among ratees within each dimension of performance (see Murphy & Cleveland, 1995; Cardy & Dobbins, 1994, for a discussion). True scores. The true score measures of performance for the absolute and comparative rating methods were derived from the study of expert raters who had enhanced opportunity to evaluate ratee behaviors, a procedure developed by Borman (1977). Fourteen graduate students, with training in social and industrial/ organizational psychology, served as expert raters. Prior to rating, the expert raters were provided with a synopsis explaining the nature of the tapes and the rating formats. All of the experts were familiar with the job being rated because 8 of the 14 experts had actually performed a similar job (i.e., teaching introductory psychology), and all had extensive experience in observing and rating this job (i.e., all had taken part in teacher evaluations for many years). In the process of making their ratings, experts were thoroughly trained in using each of the performance rating methods, given multiple opportunities to view the videotapes, and encouraged to take notes on the performance of the lecturers in the vignettes. Experts rated the videotaped lecturers using both comparative and absolute rating methods containing both global and specific items. Smither, Barry, and Reilly (1989) demonstrated that expert ratings such as these may serve as valid estimates of true scores in appraisal research. The experts’ mean rating for each item served as the estimate of the true score of performance. In total, 28 true scores were derived for each of the four videotaped lecturers: 12 true scores for the specific comparative items, two true scores for the global comparative items, 12 true scores for the specific absolute items, and two true scores for the global absolute items. An intra-class index of rater convergence, described by Kavanagh, MacKinney, and Wolins (1971), was used to assess agreement among the expert raters. Rater agreement was evident for the specific comparative items (r 5 .70), the global comparative items (r 5 .97), the specific absolute items (r 5 .76), and the global absolute items (r 5 .96). Convergence between the true scores for the corresponding absolute and comparative items was very high (r 5 .97). RESULTS

Preliminary Analyses In order to meaningfully assess the levels of accuracy associated with comparative versus absolute rating method, it was necessary that all ratings be on a common metric. Therefore, all absolute ratings (global and

99

PERFORMANCE APPRAISAL ACCURACY

specific) were transformed to correspond to the comparative method’s metric. Consequently, all four of Cronbach’s (1955) accuracy components were computed based on a 101-point scale. Table 1 contains descriptive statistics on the accuracy levels associated with the absolute versus comparative rating methods and the global versus specific items types. Lower scores indicate greater levels of accuracy. As shown in Table 1, there was a general tendency for the comparative rating format to produce more favorable accuracy values. With regard to item type, global items appear to result in considerably greater accuracy with respect to the DA and SA components whereas specific items were moderately more accurate vis-a`-vis the DE and EL measures. Multivariate Analysis of Variance of Rating Method and Item Type Fundamental assumptions underlying MANOVA were assessed prior to undertaking the analyses (see Tabachnick & Fidell, 1989). No threats to the veracity of the results on the basis of heterogeneity of variance or covariance were evident. With respect to normality, the distributions of all the accuracy measures were substantially skewed and/or kurtotic. Following Tabachnick and Fidell, a square root transformation was performed on the accuracy measures and the distributions were normalized. All analyses were therefore conducted on the normalized accuracy measures. A 2 (absolute vs comparative) 3 2 (global vs specific) 3 2 (absolute-comparative sequence vs comparative-absolute sequence) 3 2 (global-specific sequence vs specific-global sequence) multivariate analysis of variance was conducted using the four accuracy measures as

Means and Standard Deviations of Accuracy Measures on a 101-Point Scale Global

DE DA SA EL

DE DA SA EL

Specific SD

9.76 5.21 3.21 6.98

Absolute 4.46 2.29 2.27 5.33

9.52 3.52 2.92 7.07

Comparative 4.69 2.32 2.44 5.55

Wilk’s D

Source

Hypoth. df

Error df

F

Sequence A Sequence B

.92 .94

Between subjects 4 4

73 73

1.56 1.13

Rating method Type of item

.53 .05

Within subjects 4 4

73 73

15.90* 305.72*

Note. Sequence A, comparative/absolute or absolute/comparative; Sequence B, global/specific or specific/global; Rating Method, comparative or absolute; Type of Item, global or specific. All interactions were non-significant (p . .05). *p , .01. N 5 80.

dependent variables: EL, DE, SA, and DA. Within-subjects variables (rating method and type of item) had a total N of 80 for each cell and between-subjects variables (sequence of rating method and sequence of type of item) had a total N of 40 for each cell. A summary of findings for the full MANOVA analysis is presented in Table 2. The main effect for rating method (absolute vs comparative) was significant, (n2 5 .47). The main effect for type of item (specific vs global) was also significant, (n2 5 .95). No interactions involving method of rating or type of item attained significance. As discussed earlier, the means shown in Table 1 suggest that the comparative rating method was more accurate than the absolute rating method and that the global item type results in considerably greater accuracy with respect to DA and SA components but specific items were more advantageous in terms of DE and EL accuracy. Univariate Analyses

TABLE 1

M

TABLE 2 Multivariate Analysis of Variance of Accuracy Measures

M

SD

8.05 14.98 9.71 5.28

3.62 4.24 3.27 4.22

The pooled within cell correlations were small (|r¯| 5 .16), therefore, it was concluded that the dependent variables were sufficiently orthogonal to utilize univariate ANOVAs as a follow-up to the MANOVA. The ANOVAs presented in Table 3 test the effect of TABLE 3 Univariate F Tests for the Main Effect of Rating Method (Absolute or Comparative) DV

8.88 12.52 8.01 5.10

4.25 3.17 2.77 4.84

Note: DE, differential elevation; DA, differential accuracy; SA, stereotype accuracy; EL, elevation. Lower values indicate greater accuracy. N 5 80.

DE DA SA EL

df

Hypoth. MS

Error MS

1,76 1,76 1,76 1,76

.10 11.41 3.16 .12

.42 .18 .34 1.13

F .25 60.93* 9.09* .11

Note. DE, differential elevation; DA, differential accuracy; SA, stereotype accuracy; EL, elevation. *p , .01. N 5 80.

100

WAGNER AND GOFFIN

non-normalized means presented in Fig. 2 illustrate that specific items were more accurate with regard to DE and EL, whereas global items were more accurate with regard to DA and SA. DISCUSSION

FIG. 1. Mean differential accuracy and stereotype accuracy (in 101-point scale) for absolute and comparative rating methods. Lower values indicate greater accuracy.

The primary hypothesis of this experiment, that comparative ratings would be more accurate than absolute ratings, received partial support. Although no differences in accuracy between the two rating methods were evident for DE and EL accuracy, the comparative method outperformed the absolute method with respect to the DA and SA measures. Additionally, the non-significant interaction between rating method and item-type suggests that for both the global and specific items the comparative method was more accurate than the absolute method in terms of DA and SA. No advantages were associated with the absolute rating method. The current study adds a unique perspective to the research comparing absolute and comparative performance appraisal methods in that rating quality was assessed in terms of accuracy. The fact that the comparative method was more accurate for DA and SA has a number of implications. Both DA and SA are associated with how accurately the rater evaluates performance in the context of the group. Raters who accurately evaluate the group of ratees, as a whole, with regard to

rating method (absolute vs comparative) on each of the normalized accuracy measures. The effects were significant for two accuracy components, DA (n2 5 .44) and SA (n2 5 .10). The means presented in Fig. 1 indicate that the comparative method resulted in more accurate ratings than did the absolute method with respect to both DA and SA. For the sake of interpretability, the means presented in Fig. 1 were not subjected to the normalizing square root transformation. Univariate ANOVAs were also conducted to follow up on the significant multivariate effect of global versus specific item-type (see Table 4). The effect of item-type was significant for all four measures of accuracy (DE, n2 5 .14; DA, n2 5 .93; SA, n2 5 .84; EL, n2 5 .16). The TABLE 4 Univariate F Tests for the Main Effect of Type of Item (Global or Specific) DV DE DA SA EL

df

Hypoth. MS

Error MS

1,76 1,76 1,76 1,76

2.92 221.04 137.21 10.07

.22 .20 .31 .66

F 12.78* 1073.93* 430.25* 15.12*

Note: DE, differential elevation; DA, differential accuracy; SA, stereotype accuracy; EL, elevation. *p , .01. N 5 80.

FIG. 2. Mean differential elevation, differential accuracy, stereotype accuracy, and elevation (in 101-point scale) for specific and global Items. Lower values indicate greater accuracy.

PERFORMANCE APPRAISAL ACCURACY

each dimension of performance are accurate in terms of SA. Raters who accurately evaluate each ratee’s pattern of strengths and weaknesses within the group are accurate in terms of DA. Additionally, Cardy and Dobbins (1994) have suggested that DA may be the most appropriate measure of the degree to which ratings are affected by random error, thus, our findings suggest that, overall, comparative ratings may contain less random error than do absolute ratings. The current laboratory findings, considered in concert with the generally higher validity of comparative ratings seen in the field studies Goffin et al. (1996a) and Nathan and Alexander (1988), are grounds for optimism that comparative approaches may have important advantages in the practice of performance appraisal. A number of organizations have recently adopted comparative methods of performance appraisal (see Villanova, Dahmus, & Bernardin, 1992); however, it is likely that many of these companies have adopted a conventional form of comparative ratings that differs from the currently used RPM (Goffin et al., 1996a) in two main respects. First, conventional comparative approaches such as straight-ranking or paired comparisons do not facilitate meaningful comparisons between ratees who have been rated by different raters. Second, conventional comparative approaches tend to rely largely on a single overall rating of performance rather than rating dimensions of performance as is typically done with the RPM. Thus, organizations that already use a conventional comparative approach might find it beneficial to switch to the RPM. A secondary purpose of the current research was to determine whether any advantages that were uncovered were due to item type (specific vs global). As discussed previously, comparative performance appraisal methods often adopt a global item type, whereas certain absolute performance appraisal methods (e.g., Latham & Wexley, 1977) use items that are specific in content. In this experiment, the type of item did have an effect on rating accuracy. However, the hypotheses that the global items would be more accurate for DE than the specific items and that there would be no differences between the item types for DA, SA, and EL were not supported. Rather, specific items were rated more accurately with regard to DE and EL whereas global items were rated more accurately with regard to DA and SA. Accordingly, these results suggest that the practice of performance appraisal would benefit by the routine inclusion of both types of items. The purpose to which the appraisal is directed should then dictate which type of item will be relied upon (see Murphy & Cleveland, 1995, for a discussion of the purposes of appraisal as they relate to DA, SA, DE, and EL). These results are not consistent with Murphy and

101

Balzer’s (1986) finding that memory-based ratings had greater DE accuracy with a global item type than a specific item type. However, our study used two global items that were empirically derived from ratings of the specific items through exploratory factor analysis (Murphy et al., 1982), whereas Murphy and Balzer used eight global items that were not directly derived from their specific items. Hence comparisons involving their specific and global items may have confounded the item type and the item content. This may explain why Murphy and Balzer chose not to report main effects of the global and specific item types in their study. Other research comparing specific and global items has not capitalized on Cronbach’s four components of accuracy, therefore, it is not as immediately relevant here. Nonetheless, Fay and Latham’s (1982) study supported the superiority of specific over global items whereas Heneman’s (1988) work suggested that global items were rated more accurately than were specific items. One possible explanation for these mixed finding may be drawn from our own results, that is, global items may engender ratings that are more accurate in some respects (e.g., DA and SA), whereas specific items may facilitate other forms of accuracy (e.g., EL and DE). Any conclusions drawn from the current study must be limited by the fact that the affective, social, and political influences on performance ratings that are present in organizational settings were not likely to be active in this experiment. As emphasized by Ilgen and Favero (1985), laboratory research such as ours often fails to reflect the true nature and duration of rater– ratee relationships. Unfortunately, the logistics of conducting research on the accuracy of performance appraisal typically requires such compromises (Murphy & Cleveland, 1995). In particular, the necessity of establishing known true scores typically precludes complete fidelity in the representation of modal performance appraisal scenarios. Nonetheless, existing field studies which have compared absolute and comparative methods are much less susceptible to the above criticisms and they have also demonstrated support for comparative methods over absolute ones (Goffin et al., 1996a; Nathan & Alexander, 1988). In fact, it is conceivable that socio-political factors which are present in field but not laboratory settings may accentuate the advantages of the comparative method of rating. Although there will always be limitations to the generalizability of laboratory research, a number of steps were taken in the design of this study to reduce such limitations. First, the rating task involved teacher evaluation, a familiar activity for the participants. Second, participants rated multiple ratees whereas past research has often only involved rating a single ratee (see Bernardin & Villanova, 1986). Finally, this study

102

WAGNER AND GOFFIN

incorporated a delay between observation and rating to make the ratings memory-based, as would typically be the case in an organizational setting. Collectively, these design features may serve to attenuate, although certainly not eliminate, the barriers to generalizability that plague many laboratory studies of performance appraisal. Having found some evidence that a comparative performance appraisal approach may be advantageous, the question of why this would be the case remains. Social comparison theory (Festinger, 1954) provides one possible answer. Judgmental ratings of performance typically involve some degree of uncertainty. In such conditions, social comparison theory suggests that there may be a tendency to compare the ratee to other persons with similar job responsibilities. Comparative methods of performance appraisal may facilitate this social comparison process and thereby result in greater accuracy. Goffin et al. (1996a) suggested that comparative methods may encourage raters to be more discriminating because all ratees are considered jointly. On the other hand, because each ratee is to be considered one at a time when using absolute formats, raters may be unaware that several ratees, who may differ widely in performance, are receiving very similar scores. Future research should investigate the cognitive processes involved with making absolute and comparative ratings to assess the plausibility of these (and other) explanations of the differences between absolute and comparative performance appraisal methods. REFERENCES Bernardin, H. J., & Villanova, P. (1986). Performance appraisal. In E. Locke (Ed.), Generalizing from laboratory to field settings. Lexington, MA: Lexington Books. Bommer, W. H., Johnson, J. L., Rich, G. A., Podsakoff, P., & MacKenzie, S. B. (1995). On the interchangeability of objective and subjective measures of employee performance: A meta-analysis. Personnel Psychology, 48, 587–605. Borman, W. C. (1977). Consistency of rating accuracy and rating errors in the judgment of human performance. Organizational Behavior and Human Performance, 20, 238–252.

of others” and “assumed similarity.” Psychological Bulletin, 52, 177–193. Drucker, P. F. (1954). The practice of management. New York: Harper and Row. Farh, J. L., & Dobbins, G. H. (1989). Effects of comparative performance information on the accuracy of self-ratings and agreement between self- and supervisor ratings. Journal of Applied Psychology, 74, 606–610. Fay, C. H., & Latham, G. P. (1982). Effects of training and rating scales on rating errors. Personnel Psychology, 35, 105–116. Festinger, L. (1954). A theory of social comparison processes. Human Relations, 7, 117–140. Goffin, R. D., Gellatly, I. R., Paunonen, S. V., Jackson, D. N., & Meyer, J. P. (1996a). Criterion validation of two approaches to performance appraisal: The Behavioral Observation Scale and the Relative Percentile Method. Journal of Business and Psychology, 11, 23–33. Goffin, R. D., Rothstein, M. G., & Johnston, N. G. (1996b). Personality testing and the assessment center: Incremental validity for managerial selection. Journal of Applied Psychology, 81, 746–756. Heneman, R. L. (1988). Traits, behaviors, and rater training: Some unexpected results. Human Performance, 1, 85–98. Heneman, R. L. (1986). The relationship between supervisory ratings and results-oriented measures of performance: A meta-analysis. Personnel Psychology, 39, 811–826. Ilgen, D. R., & Favero, J. M. (1985). Limits in generalization from psychological research to performance appraisal processes. Academy of Management Review, 10, 311–321. Kavanaugh, M. J., MacKinney, A. C., & Wolins, L. (1971). Issues in managerial performance: Multitrait-multimethod analysis of ratings. Psychological Bulletin, 75, 34–49. Kruglanski, A. W., & Mayseless, O. (1990). Classic and current social comparison research: Expanding the perspective. Psychological Bulletin, 108, 195–208. Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87, 72–107. Latham, G. P. & Wexley, K. N. (1977). Behavioral observation scales for performance appraisal purposes. Personnel Psychology, 30, 225– 268. Meyer, J. P., Paunonen, S. V., Gellatly, I. R., Goffin, R. D., & Jackson, D. N. (1989). Organizational commitment and job performance: It’s the nature of the commitment that counts. Journal of Applied Psychology, 74, 152–156. Miner, J. B. (1988). Development and application of the rated ranking technique in performance appraisal. Journal of Occupational Psychology, 61, 291–305.

Cardy, R. L., & Dobbins, G. H. (1994). Performance appraisal: Alternative perspectives. Cincinnati, OH: South-Western.

Murphy, K. R., & Balzer, W. K. (1986). Systematic distortions in memory-based behavior ratings and performance evaluations: Consequences for rating accuracy. Journal of Applied Psychology, 71, 39–44.

Carroll, S. J., & Schneier, C. E. (1982). Performance appraisal and review systems: The identification measurement and development of performance in organizations. Glenview, IL: Scott, Foresman.

Murphy, K. R., & Cleveland J. N. (1995). Understanding performance appraisal: Social organizational, and goal-based perspectives. Thousand Oaks, CA: Sage.

Christiansen, N. D., Goffin, R. D., Johnston, N. G., & Rothstein, M. G. (1994). Correcting the Sixteen Personality Factors test for faking: Effects on criterion-related validity and individual hiring decisions. Personnel Psychology, 47, 847–860.

Murphy, K. R., Garcia, M., Kerkar, S., Martin, C., & Balzer W. K. (1982a). Relationship between observational accuracy and accuracy in evaluating performance. Journal of Applied Psychology, 67, 320– 325.

Cohen, J. (1988). Statistical power analysis for behavioral sciences (2nd ed.). New York: Academic Press.

Murphy, K. R., Martin, C., & Garcia, M. (1982b). Do Behavioral Observation Scales Measure Observation? Journal of Applied Psychology, 67, 562–567.

Cronbach, L. J. (1955). Processes affecting scores on “understanding

PERFORMANCE APPRAISAL ACCURACY Nathan, B. R., & Alexander, R. A. (1988). A comparison of criteria for test validation: A meta-analytic investigation. Personnel Psychology, 41, 517–535. Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47, 149–155. Smither, J. W., Barry, S. R., & Reilly, R. R. (1989). An investigation Received: July 8, 1996

103

of the validity of expert true score estimates in appraisal research. Journal of Applied Psychology, 74, 143–151. Tabachnick, B. G., & Fidell, L. (1989). Using multivariate statistics. NY: Harper & Row. Villanova, P. J., Dahmus, S. A., & Bernardin, H. J. (1992). Rater sources of leniency revisited: A multiple observation and double cross-validation study. Paper presented at the 1992 meeting of the Academy of Management, Las Vegas, NV.