The impact of ratee performance characteristics on rater cognitive processes and alternative measures of rater accuracy

The impact of ratee performance characteristics on rater cognitive processes and alternative measures of rater accuracy

ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES 4, 232-260 (1989) The Impact of Ratee Performance Characteristics on Rater Cognitive Proc...

2MB Sizes 0 Downloads 24 Views

ORGANIZATIONAL

BEHAVIOR

AND

HUMAN

DECISION

PROCESSES

4,

232-260 (1989)

The Impact of Ratee Performance Characteristics on Rater Cognitive Processes and Alternative Measures of Rater Accuracy MARGARETYOUTZPADGETT Wayne State University

AND DANIEL R. ILGEN Michigan State University In a repeated measures design, the effects of two ratee performance characteristics (performance level and performance consistency) on rater cognitive processes and rating accuracy were examined. Thirty-seven raters evaluated the performance of four subordinates who differed from each other on their performance level and consistency. It was hypothesized that these performance characteristics would influence the amount of performance information collected by raters as well as the way in which they processed this information. In addition, it was suggested that performance characteristics would differentially affect several types of rating accuracy measures. Cronbach’s elevation accuracy and two indices of accuracy based on signal detection theory, labeled behavioral and classification accuracy by R. G. Lord (1985, Journal ofApplied Psychology, 70, 66-71), were compared. Results showed that characteristics of ratee performance influenced both rater cognitive processes and rating accuracy. 8 1989 Academic Press, Inc.

A number of factors thought to intluence performance ratings have been identified. These can be clustered as follows: (1) the appraisal instrument, (2) the rater, (3) the ratee, and (4) the performance rating context. Until recently, most research focused on either the rating instrument or the rater. In both cases, the most frequently used measures of rating quality were the reliability and validity of the ratings and the absence of rating errors, such as halo and leniency (e.g., Latham & Wexley, 1977; Smith & Kendall, 1963). As interest in the process of rating others increased, the criteria for assessing rating quality shifted from rating errors to rating accuracy (e.g., Bemardin & Pence, 1980; Borman, 1978). However, defining rating acAddress correspondence and requests for reprints to Margaret Youtz Padgett, Department of Management, 328 Prentis Building, Wayne State University, Detroit, MI 48202. 232 0749-5978189 $3.00 Copyright Q 1989 by Academic Press, Inc. All rights of reproduction in any form reserved.

RATER

ACCURACY

233

curacy and either identifying or creating situations in which it could be assessed has proved to be difficult. Given the complexity of assessing and defining accuracy, the shift in emphasis from rating errors to rating accuracy raised two major issues: (1) the conceptualization and measurement of rating accuracy and (2) the identification of rating conditions that reduce the ability of raters to make accurate ratings. Both of these issues will be considered here. The Conceptualization

of Rating Accuracy

At first glance, rating accuracy as a construct appears to be relatively straightforward; raters are accurate to the extent that their ratings of performance for an individual match the individual’s true performance. However, closer examination of rating accuracy reveals that it is more complex, both conceptually and operationally. Cronbach (1955) recognized this complexity when he demonstrated that the overall accuracy of an individual’s rating of another person is really a function of four distinct components, which he labeled elevation, differential elevation, stereotype accuracy, and differential accuracy. Until recently, these four measures of rating accuracy dominated performance appraisal research, with some researchers (e.g., Borman, 1978) arguing that differential accuracy was the most appropriate of the four. Unfortunately, because of the typical way in which performance is measured, a problem exists in the assessment of rating accuracy with any of the Cronbach measures (Lord, 1985; Lord, Binning, Rush, & Thomas, 1978; Rush, Phillips, & Lord, 1981). With a typical rating scale, raters are presented with a number of trait or behavioral dimensions and are required to provide a rating of some individual on each of the dimensions. The problem stems from the fact that raters are believed to simplify the rating process by summarizing specific observations of ratee performance and storing this information in memory in categories or as general impressions of people (Rosch, 1978). To the extent that this occurs, the original observed behaviors are forgotten and the general categories become the basis for subsequent evaluations of performance. Thus, individuals are recalled as possessing those attributes which are most representative (or prototypical) of the category to which they were assigned (Feldman, 1981; Lord, Foti, & Phillips, 1982). When the actual behavior displayed by ratees is consistent with the characteristics that are seen as prototypical of their category in memory, evaluations based on this category will be judged to be accurate. For example, if initial observations of an individual result in he or she being placed into the category of “good performer,” using this general impres-

234

PADGETT

AND

ILGEN

sion to provide performance ratings results in the individual being given high ratings on all dimensions. If the true performance for the individual was, in fact, high, then the rating is accurate. However, as noted by Lord (1985), to infer that the rater is able to make accurate judgments from this case is misleading because the accuracy was produced by a cognitive simplification in the rating process rather than by actual recall of effective behaviors relevant to each of the dimensions on which ratings were provided. Because the Cronbach measures of accuracy are derived from rating scales that do not require raters to recall actual behavioral observations, they are subject to this problem. Lord (1985) suggested a solution to this problem using signal detection theory to distinguish between classification accuracy and behavioral accuracy. Classification accuracy was defined the same as described in the preceding paragraph; it is the apparent accuracy that results when raters evaluate people based on category membership rather than actual behaviors observed. Classification accuracy is likely to be high when a ratee’s actual behavior is consistent with the behavior expected on the basis of the category to which it was allocated by the rater. Since ratings can appear accurate according to classification accuracy without the rater actually recalling specific behaviors, Lord (1985) defined a second form of accuracy that reflects the ability of raters to identify actual behaviors exhibited by a ratee. Such identification requires both the recognition of behaviors that did occur and the recognition of behavior that did not. An index that accounted for both these conditions was labeled behavioral accuracy by Lord (1985). Conceptually these two types of accuracy are not necessarily related and may even covary negatively. For example, classification “accuracy” is likely to be highest when a rater is able to simplify the rating process by integrating ratee behavior into a single cognitive category and then evaluating the ratee in accordance with the categorization. In this situation, however, behavioral accuracy is likely to be low since integrating ratee behaviors into a cognitive category tends to result in specific behaviors being forgotten. Other examples might be given where the two forms covary positively. The important point, however, is that although behavioral accuracy is what is typically meant when the term “accuracy” is used, we believe that most common operational definitions of accuracy (such as those of Cronbach) only provide an indication of the rater’s ability to categorize and form general impressions of ratees (i.e., classification “accuracy”). One purpose of the present research is to demonstrate empirically the importance of distinguishing between these classifications and behavioral accuracy, particularly when examining the cognitive processes of raters.

RATERACCURACY

235

Ratee Performance Characteristics

A second purpose of this research is to identify ratee characteristics that influence rater cognitive processes and through this, affect the ability of raters to provide accurate ratings of performance. In this study, two ratee performance characteristics, performance level (good vs poor performer) and performance consistency (consistent vs inconsistent performer), were examined. While both characteristics have been addressed in previous research (e.g., DeNisi & Stevens, 1981; Gordon, 1970; Scott & Hamner, 1975), typically they have been related to such dependent variables as attributions of causality, allocation of organizational rewards, and ratings of motivation and ability rather than to rating accuracy. We selected these two ratee characteristics because, in contrast to typical research on ratee characteristics which has generally focused on static attributes such as race or sex (e.g., Hamner, Kim, Baird, & Bigoness, 1974), level and consistency are dynamic. It is suggested that the nature of the performance information provided to raters (i.e., consistent or inconsistent, effective or ineffective) may potentially introduce errors into performance ratings by influencing the way in which this information is processed. Researchers studying cognitive processes have suggested two possible ways in which raters may process performance information (e.g., Murphy, Martin, & Garcia, 1982; Nathan & Lord, 1983). One approach is derived from theories of cognitive categorization (Rosch, 1978; Rosch, Mervis, Gray, Johnson, & BoyesBraem, 1976). According to this model, raters integrate behavioral information into general categories or impressions of people. After this occurs, specific behaviors tend to be forgotten, and the category prototype (an abstraction based upon the most typical or representative features of the category) becomes the basis for subsequent evaluations of performance. In contrast to the categorization model is a three-step model proposed by Borman (1978). In this model, raters are believed to observe incidents of work performance, evaluate the incidents in terms of their effectiveness, and then store this information in dimensionally independent schemata. When it is time to provide evaluations of performance, this information is recalled and weighted in order to arrive at an overall evaluation for each dimension. In essence, this model proposes that raters are able to recall actual behaviors observed rather than just general category characteristics and hence, is a much more behaviorally oriented approach to processing performance information. These two approaches to processing information have implications for the ability of raters to provide accurate performance ratings. Clearly, if information is processed according to the categorization model, then true

236

PADGETT

AND

ILGEN

accuracy, in a behavioral sense, is likely to be low because raters are thought to recall only category prototypes rather than actual behavioral observations. Only accuracy at the level of general categories (i.e., classification “accuracy”) is likely to be high. In contrast, the latter approach to processing information is likely to facilitate behavioral accuracy since specific incidents of behavior are stored and recalled. In this study, an attempt was made to identify conditions that would influence which approach to processing information raters used. The approach is consistent with that used in Nathan and Lord (1983), where it was hypothesized that, under certain rating conditions (e.g., minimal demands on short and long term memory), the behavioral model would be more likely to be used but that as memory demands become greater, reliance on general category prototypes should increase. They found some support for these hypotheses. Because ratee performance characteristics (i.e., level and consistency) affect the type of performance information provided to raters, it is suggested here that these characteristics may also influence how raters process information. These characteristics are also likely to affect rating accuracy. The present study was designed to test these hypotheses, which are described in greater detail below. Hypotheses Performance consistency and rating process variables. Two rating process variables were examined in this study: the amount of performance information sampled by raters and the general approach used by them to process this information. The cognitive processing view of performance appraisal (e.g., Feldman, 1981) suggests that there will be an inverse relationship between performance consistency and sampling. According to this view, during a rater’s initial exposure to ratees, she or he attempts to form some kind of impression of ratees. The impression places the ratee into categories that facilitate “sense-making” (Weick, 1979) to the rater. For consistent performers, this categorization process should not be problematic since the ratee can easily be categorized as either a good or poor performer. Furthermore, since the subsequent behaviors of consistent performers should match this impression, the initial categorization should not be questioned. On the other hand, when a ratee’s performance is inconsistent, initial categorization becomes more difficult. Moreover, since some later observations will fail to fit the category into which the individual is eventually placed, a controlled recategorization process, in which raters search for additional information, is called into play according to Feldman (1981). Consistent with this notion, research indicates that disconfirmed expectations about a stimulus person trigger the perceiver’s

RATER

237

ACCURACY

search for causal information (Pyszczynski & Greenberg, Weiner, 1981). This suggests our first hypothesis:

1981; Wong &

Hl. More sampling of information about ratee performance cur for inconsistent performers than for consistent performers.

will oc-

The second rating process variable to be examined is the approach used by raters to process performance information. Because of the difficulty in directly measuring how raters process information, it will be assessed indirectly by inferring the process from the types of behaviors that are correctly and incorrectly recognized by raters. As noted above, it is expected to be more difficult to categorize the behaviors of an inconsistent, than a consistent, performer. This should force raters to focus more on specific ratee behaviors when evaluating the performance of those who perform inconsistently than those who perform consistently. Hence, the number of actual ratee behaviors, both prototypical (i.e., categoryconsistent) and nonprototypical (i.e., category-inconsistent), correctly identified by raters should be greater for inconsistent performers. In contrast, because of the ease with which consistent performers can be categorized, raters should rely heavily on the category prototype when providing performance ratings of these people. The result is that raters should be more likely to attribute to consistent ratees prototypical behaviors that they did not actually exhibit but which would be expected on the basis of the ratee’s category assignment. This leads us to conclude that raters will tend to use the category-based approach to processing information when evaluating consistent performers but the behaviorally based approach when rating inconsistent performers. Stated more formally: H2. Raters will correctly identify more behaviors (both categoryconsistent and category-inconsistent) for inconsistent performers than for consistent performers. H3. Raters will be more likely to attribute nonpresent behaviors consistent with the ratee’s category to consistent performers than to inconsistent performers. Performance consistency and rating accuracy. When conceptualized as classification accuracy, the primary requirement for accuracy is that raters be able to place ratees into global categories. Thus, classification accuracy is hypothesized to be higher for consistent performers because of the greater ease with which raters can form an impression about these ratees. Similar results are expected for Cronbach’s measure of elevation accuracy, which was also used in this study. Although Lord’s measure of classification accuracy and Cronbach’s measure of elevation accuracy are

238

PADGETT

AND

ILGEN

operationally distinct, they are believed to be conceptually similar; both assess the ability of raters to globally categorize ratees. Hence, results for these two measures of accuracy should be similar. On the other hand, an inverse relationship is hypothesized to exist between behavioral accuracy and performance consistency, with behavioral accuracy being greater for inconsistent performers. One reason for this is the expectation that raters will tend to use the more behaviorally based approach to processing information when evaluating inconsistent performers which, as noted above, should increase behavioral accuracy. Moreover, previous research implies that information inconsistent with expectations (i.e., nonprototypical, or category-inconsistent, information) is stored in memory in a unique way and thus, is more likely to be recalled (Graesser, Gordon, & Sawyer, 1979; Graesser, Woll, Kowalski, & Smith, 1980; Woll & Graesser, 1982). Research in the leadership area is consistent with this notion (e.g., Phillips, 1984; Phillips & Lord, 1982). In addition, Hastie (1980) found that memory for behavior that is inconsistent with general impressions is greater than is memory for behavior consistent with these impressions. Other things being equal, this should increase behavioral accuracy for inconsistent performers. This leads to the following hypotheses: H4. Classification accuracy will be greater for consistent performers than for inconsistent performers. HS. Elevation accuracy will be greater for consistent performers than for inconsistent performers. H6. Behavioral accuracy will be greater for inconsistent performers than for consistent performers. Performance level and rating process variables. Research by Gordon (1970) indicated that there may be differences in the ability of raters to accurately evaluate good and poor performers. He found that raters were more accurate (measured in terms of the number of effective and ineffective behaviors correctly identified by raters) in their evaluations of good performance than bad performance (a phenomenon which he labeled the “differential accuracy phenomena”). While Gordon didn’t speculate as to the reason for this difference, it is possible that raters may find it more difficult to evaluate ineffective behaviors than effective. If this is the case, it is likely that raters will have more difficulty evaluating poor performers because they exhibit more ineffective behaviors and hence will feel a greater need to observe their behavior. This leads to the following hypotheses:

H7. Raters will find it more difficult to evaluate poor performers good performers.

than

RATER

ACCURACY

H8. More sampling of information about ratee performance cur for poor performers than good performers.

239

will oc-

Performance level and rating accuracy. Based on the differential accuracy phenomenon identified by Gordon (1970), behavioral accuracy should be greater when evaluating good performers than poor performers. However, there is no reason to expect a difference between good and poor performers on classification accuracy. Hence, it is hypothesized that: H9. Behavioral poor performers.

accuracy will be greater for good performers than for METHOD

Overview Undergraduates were hired to play the role of an office manager who supervised four secretaries. The four secretaries worked as a team in order to complete work assignments for the faculty members in the department. Four videotapes, one of each secretary, were available so that the supervisors could observe each secretary at work. In order to enhance the realism of the setting, several conditions described by Favero and Ilgen (1985) as highly desirable for performance appraisal research were incorporated into the study. First, the participants observed the secretaries over time. Second, participants had a variety of different tasks to perform only one of which was evaluating performance. Third, the job (secretary in a department at a university) was one familiar to the participants and thus, the social categories used to judge people in this position should have been relatively accessible to the subject population (Fiske & Kinder, 1981). Finally, multiple ratees (four secretaries) were evaluated. Subjects and Design A sample of 37 individuals (14 males and 23 females), ranging in age from 17 to 38 years (mean 22 years) participated in the study. Sample size was based on a power analysis assuming a small effect size and power of .85 (Cohen & Cohen, 1983). Participants were recruited through a newspaper advertisement offering to pay approximately $18 for 4 hr of participation . The design was a 2 x 2 analysis of variance with repeated measures on both independent variables. The independent variables were performance level (good or poor performer) and performance consistency (consistent and inconsistent). The manipulation of the independent variables was done through the creation of four videotaped stimulus persons, each representing a combination of the two independent variables (i.e., a consis-

240

PADGETT

AND ILGEN

tently good performer, a consistently poor performer, an inconsistently good performer, and an inconsistently poor performer). ’ The design was a repeated measures one because all participants viewed each of the four videotapes and hence, were exposed to all possible combinations of the independent variables. Development

of Stimulus

Materials

Construction of videotapes. Four videotapes were developed, one for each of the four secretaries. Each tape contained 17 I- to 2-min incidents of secretarial performance, with each incident representing some level of performance on one or more of four performance dimensions. The performance dimensions were (1) job knowledge and skill, (2) organizational ability, (3) dealing with faculty/students, and (4) working cooperatively with other secretaries. The videotapes depicted between 8 and 12 examples of ratee job behavior on each of the four performance dimensions for each secretary. To develop behavioral incidents for the videotapes, five secretaries were interviewed and asked to describe incidents of good and poor secretarial performance. One hundred and one incidents were collected from these interviews and were then converted into short descriptions suitable for filming. Prior to filming, the incident descriptions were evaluated by six organizational behavior graduate students on the basis of the dimensions judged to be relevant to the incident and the level of performance represented by the incident on each of those dimensions. These initial ratings were used only to decide which incidents to film for each secretary in order to create the desired manipulations. They were not used in developing the performance standards. Four individuals with secretarial experience were hired to play the roles of the four secretaries in the videotapes. Volunteers tilled other roles. r At first glance it may appear illogical to speak in terms of an “inconsistently good performer” since being inconsistent requires that the individual exhibit both good and bad behaviors. However, defining “inconsistent” as the presence of a high degree of variance around some mean level of performance means that it would be possible for someone to be inconsistent but yet, overall, a good performer (i.e., by having a large amount of variance around a relatively high mean level of performance). More concretely, using a 7-point rating scale, an inconsistently good performer might exhibit behaviors ranging in effectiveness from 7 to 3, with a mean of about 5.3; in contrast, all the behaviors displayed by a consistently good performer might be rated between 5.0 and 5.5 with a mean of about 5.3. Both performers have a relatively high mean performance level but the former has much more variance (or inconsistency) in her performance. Operationalizing performance inconsistency in this way seems appropriate since it is quite likely that a person could be considered an overall good performer while at the same time exhibiting a substantial amount of variance in his/her performance over time.

RATER

ACCURACY

241

Development of performance standards. In order to make a judgment about the accuracy of the ratings provided by participants, it is necessary to have some standards to which to compare these ratings. These standards are often called “true score ratings” and are typically derived by having a number of “expert raters” evaluate the videotapes. The averages of these ratings then serve as the performance standards, or true scores, for the videotapes. The procedures used to develop the performance standards were similar to those used by other researchers (e.g., Borman, 1978, 1979; Murphy, Garcia, Kerkar, Martin, & Balzer, 1982). Ten persons working full time as secretaries served as the expert raters. These individuals rated the videotaped incidents. Several procedures were followed to enhance the ability of the secretaries serving as raters to provide accurate performance ratings. First, they were given an hour of training on common rating errors (halo, leniency/strictness, central tendency, first impression, and contrast effect) and the use of the rating scale. Next, they practiced rating and discussed their ratings among themselves in order to establish a common frame of reference for each performance dimension (Bernardin & Pence, 1980). In addition, the secretaries were (1) given a detailed written description of each incident to read prior to each rating session, (2) shown each incident twice before making their rating, and (3) encouraged to take notes. In making the actual ratings, the raters were first asked to indicate which of the four performance dimensions were represented in the incident. Next, they rated the level of performance on the selected dimension(s) using a 7-point behaviorally anchored rating scale (BARS) developed by the authors specifically for clerical workers. As noted above, the secretaries on the videotapes were shown to be interacting and working together in the same office. As a result, in some of the videotaped incidents, one or more of the other secretaries, besides the focal secretary for the videotape, played a role. Since it was believed that study participants would use this additional information (as well as the information on each secretary’s main videotape) in forming an impression of each secretary, it was necessary to incorporate this information into the development of the performance standards. This was done by having the expert raters evaluate any of the secretaries who played an active role in the incident. These additional incident ratings were then included when determining the performance standards for each secretary. The initial criteria used to evaluate agreement among the expert raters were (1) 70% agreement on which dimensions were relevant to a videotaped incident (i.e., at least 7 of the 10 raters had to agree on which of the four performance dimensions were displayed in the incident) and (2) a standard deviation of less than 1.25 across ratings of the level of perfor-

242

PADGETT

AND

ILGEN

mance represented on each of the relevant dimensions. Using three criteria, 45 incidents were acceptable and 3 clearly were not. The remaining 35 incidents did not meet the critieria but showed enough agreement to warrant further consideration. As a result, the raters met again to reevaluate these 35 incidents. The second rating of these incidents resulted in 21 which clearly met the criteria and 14 which did so except for one outlier. It was decided to include all 35 of these incidents in the set of incidents from which those included on the videotapes were chosen. The incidents actually included on the videotapes were selected in order to maximize the desired performance level and performance consistency manipulations. In order to do this, it was necessary that several conditions be met: (1) the average performance level of the two secretaries within the high consistency condition should be approximately equal to the average performance level of the two secretaries within the low consistency condition; (2) the average performance level of the two secretaries within the high performance condition should be greater than the average performance level of the two secretaries within the low performance condition; (3) the average standard deviation across dimension means for the two secretaries within the high consistency condition should be lower than the average standard deviation across dimension means for the two secretaries within the low consistency condition; and (4) the mean standard deviation in performance level across incidents within a performance dimension should be lower for the two secretaries in the high consistency condition than for the two secretaries in the low consistency condition. The first two conditions were necessary to create the performance level manipulation and the last two conditions to establish the performance consistency manipulation. Table 1 presents descriptive statistics for the incidents that comprised the final four videotapes. Performance standards for each secretary were calculated both at the level of individual incidents and at the level of performance dimensions. For each incident included on the videotapes, the ratings given by the expert raters were averaged yielding the performance standard for each dimension relevant to that incident. The average of the performance standards for each incident relevant to a particular dimension became the performance standard for that dimension. This approach to developing performance standards was used (instead of the more typical procedure of having the expert raters make a single rating on each dimension after viewing the entire videotape for a given ratee) because it allowed for a more precise definition of the performance standards than is usually possible. In addition, it allowed a participant’s accuracy to be determined in relation to just those incidents which he or she observed rather than in relation to the entire collection of incidents on the videotape. This was

243

RATER ACCURACY TABLE CHARACTERISTICS

OF THE VIDEOTAPES

1 USED AS STIMULUS

MATERIALS

Stimulus materials Consistent performers 1. Mean performance rating (by expert raters) within consistency conditions

4.64

4.39

High performers

Low performers

5.56

3.47

Consistent performers

Inconsistent performers

2. Mean performance rating (by expert raters) within performance level conditions

3. Mean standard deviation of performance ratings across dimension means within consistency conditions

.37

.70

Consistent performers 4. Mean standard deviation in performance level across incidents within a dimension

Inconsistent performers

.61

1.29

Consistent performers

5. Number of behavioral incidents in the total set

Inconsistent performers -__.__

Inconsistent performers

High

Low

High

Low

38

44

40

38

important since participants chose how many videotaped incidents to observe for each secretary and hence, all participants did not see the same number of incidents. Psychometric properties of expert ratings. In order to be confident that the performance standards developed for each secretary (ratee) on each performance dimension were indicative of the secretary’s actual performance, it was necessary to establish more precisely the extent of agreement between the expert raters on the ratings given to the secretaries. Agreement in two areas was assessed: (1) agreement on the incidents determined to be relevant to each performance dimension and (2) agreement on the level of performance represented in the incident on the relevant performance dimensions. The extent of agreement for the incidents judged to be relevant to each performance dimension was assessed using coefficient alpha. Data were

244

PADGETT

AND ILGEN

coded “0” for dimensions judged not to be relevant to the incident and “1” for relevant ones.2 Alpha coefficients were calculated for each dimension both across incidents for each ratee and across all incidents relevant to the dimension regardless of ratee. Since the alpha coefficients for each ratee did not differ substantially from the alphas across all four ratees, only the overall alpha coefficients based on all four ratees are reported. CoefIicient alphas were 90 for Job Knowledge and Skill, .96 for Dealing with Professors, .97 for Working Cooperatively with Other Secretaries, and .91 for Organizational Ability. Intraclass correlations were used to determine the degree of agreement among the 10 expert raters on the overall level of performance of each ratee on each of the four performance dimensions. These were .88 for Job Knowledge, .81 for Dealing with Professors, .87 for Working Cooperatively with Other Secretaries, .93 for Organization of Work, and .91 for the overall ratings. Procedures

Each person participated in three separate experimental sessions. The first lasted 1% hr, the second 1 hr, and the final 2 hr. Performance ratings were made at the end of the first session and during the final session. Sessionswere separated by approximately 10 to 14days with the first and last sessionbetween 23 and 27 days apart. All sessionswere run in a small office on the campus of a large state university. The room was furnished with a table and chair where participants worked on in-basket tasks. A bookcase served as a partition in the room. Behind it was a videorecorder and monitor. The monitor was placed so that it could not be seen from the table. Session One. When participants arrived for the first session, they read and signed a consent form and then filled out an initial questionnaire designed to gather background information. After completing this questionnaire, participants watched a 15 min videotaped set of instructions describing the study. They were also provided with a written copy of the * In order to make this determination, behavioral incidents served as the cases (or replications) and raters served as the items. The data points were whether or not a particular rater judged the incident in question to be relevant to a given performance dimension. From this information a correlation could be calculated for each pair of raters. This correlation reflects the amount of agreement between any two raters on which incidents they believed were relevant to a particular performance dimension. From this correlation matrix (where the elements were the correlation between each pair of raters) a coefficient alpha was calculated. In this situation, the coeflicient alpha represents the overall amount of agreement across the 12 expert raters on which incidents were relevant to a particular performance dimension. This procedure was repeated for each of the four dimensions and for each of the four ratees.

RATER ACCURACY

245

information presented in the videotape (labeled as the Management Department Office Manual) to read as they listened to the videotape. The study was described as one that dealt with how managers manage their time. Participants were told that they would be playing a role of an office manager in the Department of Management at a fictitious university. A brief description of the organization, the job of office manager, and the nature and number of subordinates associated with the position was provided. Most of the office manager’s tasks were presented in an in-basket. The in-basket included such things as filling out a variety of university forms, updating a departmental account, writing several memos, making changes in the course schedule book, scheduling rooms and times for the classes of departmental faculty, and making a schedule for the visit of a prospective job candidate. In all, 22 tasks were included in the in-basket. No participants completed all of the tasks during the course of the study. In addition to working on the in-basket, participants were told that they would be evaluating the performance of each of the four secretaries at the end of the first and third sessions and that they could “watch” the secretaries doing their job by viewing portions of a videotape of each secretary. The amount of time they spent viewing the videotapes was up to them after the initial introductory session. The evaluation form and its use were then explained. Participants were informed that they would be evaluated on two major criteria, the quality and quantity of the in-basket work completed and the accuracy of their performance evaluations of each secretary. As an incentive to perform well, a $25 reward was offered for being the highest performer. This was awarded in addition to the $18 paid to everyone for participating in the study (in reality, the winner of the $25 bonus was determined through a random drawing). After completion of the videotaped instructions, the experimenter explained how to operate the videorecorder. Then, each participant was shown the first five behavioral incidents on the videotape for each secretary. The purposes of this initial viewing were to “introduce” participants to each secretary and to establish the performance level and performance consistency manipulations. Since there was the possibility that some participants would choose not to watch any or only a few additional performance incidents, it was important that these initial incidents clearly display the intended characteristics of each secretary. Thus, for the consistently good performer, all five incidents were examples of effective behavior while all were examples of ineffective performance for the consistently poor performer. For the inconsistently good performer, participants were shown three incidents of effective behavior, one incident of average performance, and one incident of below average performance.

246

PADGETT

AND

ILGEN

Finally, for the inconsistently poor performer, participants viewed three incidents of ineffective behavior, one of average, and one above average. No check on the manipulation was done after viewing the initial five incidents, but the manipulations were checked at the end of the third session. The order in which participants initially watched the four secretary videotapes was counterbalanced to control for any possible effects due to the order of viewing the videotapes. Participants were permitted to take notes while watching the videotapes. After watching the videotape of each secretary, the experimenter briefly reviewed the instructions. Participants were then given up to 30 min to complete a performance evaluation on each of the four secretaries and to begin work on the in-basket tasks. During this first session, participants were not allowed to view any additional videotape incidents prior to completing the rating forms. Session Two. Session Two was held approximately 12 days after the first. When participants arrived for the second session, the instructions, rules, financial incentives, and operation of the video equipment were briefly reviewed. Participants were then shown a behavioral incident viewed in the first session to be sure that they could identify each secretary. The experimenter also checked in on each participant two times during the session to answer any questions and to announce the amount of time remaining in the session. Session Three. Initial procedures for Session Three were the same as those for Session Two. Then, participants were given 50 min to work on the in-basket tasks and to watch the videotapes of the secretaries. After 50 min, participants completed a performance evaluation on each of the secretaries and filled out the final questionnaires. When all questionnaires were completed, participants were debriefed and paid for their participation. Measures Sampling. The amount of information search or sampling done by each rater for each secretary was measured as the total number of behavioral incidents watched for that secretary. At the end of each session, the experimenter recorded the number of incidents observed. Prior to beginning the next session, the videotapes were set at the point where that person had stopped watching each secretary during the previous session. Information processing variables. The approach used by raters to process information, either categorically or behaviorally, was inferred from measures of the types of behaviors correctly and incorrectly identified by raters. The questionnaire from which these measures were derived will be described below. The number of behaviors for any given secretary correctly identified by raters as having been exhibited by that secretary was

RATER

measured as a proportion formula proportion

247

ACCURACY

of the total number

of behaviors correctly

of behaviors,

identified

using the

= T, I

where ni was the total number of behaviors, both prototypical and nonprototypical, correctly identified by the rater as having been exhibited by ratee i, and I; was the total number of behaviors actually exhibited by ratee i in the behavioral incidents which the rater observed. The number of prototypical behaviors incorrectly identified by the rater as having been exhibited by a given ratee (Proto. Behaviors) was measured as a proportion of the total with the formula proportion

of prototypical

behaviors incorrectly

identified

= phi tpbi’

where phi is the number of prototypical behaviors incorrectly identified by the rater as having been exhibited by ratee i while tpb, is the total number of prototypical behaviors on the questionnaire which were not exhibited by ratee i. Rating difficulty was measured by a single item on the questionnaire completed by participants in the final session. The item asked participants how much difficulty they had in evaluating a secretary’s performance; responses were obtained rated on a Spoint scale, ranging from “extremely difftcult” to “no difficulty.” Rating accuracy. Three measures of rating accuracy were used in this study: Cronbach’s elevation accuracy, and Lord’s measures of classification and behavioral accuracy. Before describing these measures in greater detail, a general comment should be made about how the accuracy measures were calculated. Because study participants were able to choose how many behavioral incidents they wanted to watch for each secretary, it was necessary to take the number of incidents observed into account when assessing each participant’s degree of rating accuracy with each measure. For Cronbach’s measure of elevation accuracy, this involved developing a unique set of performance standards for each participant that included only those behavioral incidents actually observed by them. Because of the way in which Lord’s measure of accuracy were assessed, different procedures were required to take into account the number of behavioral incidents that were observed by each participant. These procedures will be described below. The first measure of rating accuracy used was Cronbach’s elevation accuracy. Because a within-subjects design was used, a slight modification of Cronbach’s formula was necessary in order to determine how

248

PADGETT

AND

ILGEN

accurately a rater evaluated each of several ratees. Computation of Cronbath’s elevation accuracy typically involves comparing the observed mean rating, summed across dimensions and ratees, to the true mean rating, also summed across dimensions and ratees. Hence, it reflects the extent to which raters are able to accurately identify the overall level of performance exhibited by ratees. In order to calculate a measure of elevation accuracy for each rutee, rather than across ratees, the same formula was used except that the observed and true means were only summed across dimensions. Specifically, elevation accuracy was calculated as the square root of (elevation

accuracy) * = (~Xi/~i

- ~Si/~i)2,

where Xi is the observed rating on dimension i, Si is the standard (or true score) rating on dimension i derived from the expert ratings of the videotapes, and 12iis the number of performance dimensions (in this case Izi = 4). Calculated in this way, elevation accuracy indicates a rater’s ability to accurately identify the overall level of performance for a particular rutee. This Cronbach accuracy measure was selected because it most nearly represented the accuracy of a rater’s general impression (or categorization) or a ratee. In order to determine a rater’s level of classification and behavioral accuracy, Lord (1985) advocated using signal detection theory (SDT; Swets & Picket& 1982) which has been widely used by memory researchers to develop indices of discriminability. Measuring accuracy in this way required determining both a prototypical and nonprototypical hit and false alarm rate, which will be described below. (For a more thorough discussion of the use of SDT to assess accuracy in behavioral measurement, see Lord, 1985.) The information to calculate Lord’s classification and behavioral accuracy measures (as well as the two information processing variables described above) was derived from a checklist completed by participants during the last session. This checklist consisted of 55 behaviors. Each of the behaviors listed was a major behavior displayed in one of the incidents on the videotape. Sample behavioral statements included (1) does not get a professor’s presentation notes and overheads typed on time and (2) agrees to stay after regular working hours to finish typing an important paper for a professor in the department. Between 15 and 17 of the behaviors were exhibited by each secretary (there was some overlap since some of the behaviors were exhibited by more than one secretary). For consistent performers, all of the behaviors represented either good or poor performance while for inconsistent performers, approximately half of the behaviors on the checklist were consistent with the category and half were inconsistent with it.

RATER

ACCURACY

249

Participants were asked to read each statement and to indicate which secretary (or secretaries) exhibited that behavior. They were also told to indicate which of the behaviors on the checklist they did not observe. The latter was included as an option because participants did not watch all of the incidents for each secretary and thus, would not have seen some of the behaviors on the checklist. It was felt that in order to get an appropriate indication of behavioral accuracy raters should be able to identify not only behaviors viewed but also behaviors not viewed. From these responses, it was possible to calculate the prototypical and nonprototypical hit rate and false alarm rate for each of the four secretaries. These rates were used to calculate classification and behavioral accuracy indices. Prototypical hit rate was defined as the proportion of category-consistent (i.e., prototypical) items correctly identified as having been exhibited by the ratee. Prototypical false alarm rate was the proportion of category-consistent items that were falsely recognized as having been exhibited by the ratee. Nonprototypical hit rate was the proportion of category-inconsistent (i.e., nonprototypical) items correctly attributed to the ratee. Finally, nonprototypical false alarm rate was the proportion of category-inconsistent items incorrectly identified as having been exhibited by the ratee. Note that, for consistent performers, the nonprototypical hit rate was zero, since due to the nature of the manipulation, these performers only exhibited behaviors consistent with the category of either good or poor performer. In calculating these hit rates and false alarm rates on the basis of the incidents actually observed by participants, it was necessary to distinguish between three types of errors which raters could make: (1) attributing to Ratee A a behavior which she did not exhibit and which had not been observed by the rater in the portion of incidents that he or she watched; (2) attributing to Ratee A a behavior which she did not exhibit but which they had been observed in the observed portion of the videotape; and (3) attributing to Ratee A a behavior that she exhibited but one that had not been observed by the rater in the portion of videotaped incidents that he watched. Errors of the first type were what Lord considered false alarms. However, the other two types of errors, which occurred because there were multiple ratees and because participants were allowed to choose how many incidents to watch for each ratee, should also be considered false alarms because they clearly constitute a reduction in behavioral accuracy. As a result, all three types of errors were included in the measurement of the prototypical and nonprototypical false alarm rates. The measures of classification and behavioral accuracy were then calculated from these hit rates and false alarm rates. Classification accuracy was calculated as

250

PADGETT

CA = (PHR

+ PFAR)

AND

ILGEN

- (NPHR

+ NPFAR),

where CA is classification accuracy, PHR is the prototypical hit rate, PFAR is the prototypical false alarm rate, NPHR is the nonprototypical hit rate, and NPFAR is the nonprototypical false alarm rate. High classification accuracy meant that raters tended to attribute prototypical behaviors to the ratee regardless of whether or not they were actually exhibited by them. Behavioral accuracy was calculated using the formula BA = (PHR + NPHR)

- (PFAR

+ NPFAR),

where BA is behavioral accuracy and the other terms are as defined above. The greater the degree of behavioral accuracy the more able raters are to identify the actual behaviors exhibited by the ratee regardless of the prototypicality of the behavior. RESULTS Manipulation

Checks

Two types of manipulation checks were done to ensure that raters perceived secretaries to have the performance characteristics that they were intended to have. Information for the first check was derived from the final questionnaire given to participants. The questionnaire contained two questions assessing the perceived variability (or consistency) of each secretary’s performance (average alpha = .68) and four items assessing the perceived overall effectiveness of each secretary’s performance (average alpha = .76). Each of these scales was used as the dependent variable in a repeated measures analysis of variance, with the manipulated performance characteristics serving as the independent variables. Results for perceived performance effectiveness indicated a highly significant performance level main effect (F(1,36) = 520.95, p < .Ol), with the mean difference in the expected direction (weigh = 25.03 vs Zi,,,+, = 11.20). When the dependent variable was the perceived variability of the secretary’s performance, a significant main effect for consistency was found (F(1,36) = 75.65, p < .Ol). Again, the mean difference was in the direction expected (~i~incon= 9.70 vs X,,, = 7.30). However, there was also a significant performance level main effect on perceived performance variability (F( 1,36) = 86.43, p < .Ol) with high performers being perceived as having less variable performance than low performers (E,,, = 9.93 vs = 7.07). This effect may have been due to the fact that consistency zhigh of performance was perceived to be a positive characteristic while inconsistency was perceived negatively. Post hoc questionning of each participant concerning the performance characteristics of the four secretaries provided the second check on the intended manipulations. Ninety-eight percent of the participants correctly

251

RATER ACCURACY

identified the intended performance level of all four secretaries and 77% correctly indicated the intended degree of consistency for all four. The remaining 23% of the participants correctly identified the degree of consistency for two of the four secretaries. There was no apparent pattern of misidentification, suggesting that the misidentification resulted from individual differences in perceptions of the four secretaries rather than from the ineffectiveness of the intended manipulations. In combination, these two manipulation checks provide support for the effectiveness of the performance level and performance consistency manipulations. Performance

Consistency

Cell means and standard deviations for all the dependent variables are reported in Table 2. Marginal means and the results for the repeated measures analyses of variance used to test the hypotheses are presented in Tables 3 and 4 and will be described more completely below. All hypotheses were tested using repeated measures analyses of variance. Rating process variables. The first hypothesis stated that greater sampling of behavioral information would occur for inconsistent performers than for consistent ones (Hypothesis 1). This hypothesis was not supported (F(1,36) = 1.14, p > .OS). The second and third hypotheses were that raters would correctly identify more behaviors for inconsistent performers than for consistent perTABLE CELL

MEANS

AND STANDARD

Consistent high performer Variable” 1. Rating difficulty 2. Sampling 3. Proportion of behaviors correctly identified 4. Proportion of prototypical behaviors incorrectly identified 5. Elevation accuracyb 6. Classification accuracy 7. Behavioral accuracy

2

DEVIATIONS

FOR DEPENDENT

Inconsistent high performer

VARIABLES

Consistent low performer

Inconsistent low performer

M

SD

M

SD

M

SD

M

SD

1.43 1.62

.65 1.85

2.39 8.30

.86 2.72

2.16 9.59

1.07 2.97

3.00 9.54

1.05 2.86

.65

.24

.50

.16

.56

.23

.51

.20

.34 .I2

.26 .33

.17 .93

.14 .56

.34 .47

.18 .34

.20 .74

.20 .57

.96 .51

.40 .24

.ll .39

.30 .19

.76 .42

.45 .23

.22 .39

.35 .27

o N = 37 for all cells. b Higher numbers indicate less accuracy.

252

PADGETT

AND ILGEN

TABLE 3 MARGINALMEANSANDANALYSISOFVARIANCETABLESFOR RATING PROCESSVARIABLES Marginal means Variable 1. Rating diffkulty Performance level (L)

High 1.91

Low 2.58

df

MS

F

36 1

1.04 16.89

16.17*

.72 29.43

40.65*

Performance consistency (C)

1.79

2.69

36 1

LXC

-

-

36 1

2. Sampling Performance level (L)

.57 .108

.19

7.96

9.56

36 1

4.97 95.68

19.23*

Performance consistency (C)

8.61

8.92

36 1

3.14 3.57

1.14

LXC

-

-

36 1

5.38 4.92

.91

36 1

.03 .08

2.40

36 1 36 1

.03 .36 .04 .lO

3. Proportion of behaviors correctly identified Performance level (L)

.58 .61

Performance consistency (C) LXC

-

4. Proportion of prototypical behaviors incorrectly identified Performance level (L) Performance consistency (C) LXC

.54

-

.51 -

13.23* 2.94

.25

.26

36 1

.Ol .Ol

39

.34

.18

36 1

.02 .89

47.83*

36 1

.02 .Ol

.29

-

*p s .Ol.

formers and that they would be more likely to attribute nonpresent prototypical behaviors to consistent performers. Results of a repeated measures analysis of variance using the proportion of behaviors correctly identified (Hypothesis 2) as the dependent variable found a significant

253

RATER ACCURACY TABLE MARGINAL

MEANS AND ANALYSIS RATING Accurucy

4 OF VARIANCE MEASURES

TABLES

FOR

df

MS

Marginal means Variable” 1. Elevation accuracy Performance level (L)

High

Low

.83

.61

Performance consistency (C)

59

.84

LXC

-

-

2. Classification accuracy Performance level (L)

54

Performance consistency (C)

.86

LXC

-

3. Behavioral accuracy Performance level (L) Performance consistency (C) LXC

.45 .47 -

.49 .16 -

.41 .39 -

F

36 1

.22 1.80

8.01*

36 1

.20 2.15

10.72*

36 1

.22 .03

.12

36 1

.06 .lO

1.77

36 1

.15 18.15

118.20*

36 1

.15 .88

5.86

36 1

.03 .08

2.26

36 1

.03 .22

7.33*

36 1

.03 .08

2.69

a Note that for elevation accuracy low scores indicate greater accuracy while the reverse is true for classification and behavioral accuracy. * p s .Ol.

performance consistency main effect (F(1,36) = 13.23, p < .Ol) with means in the opposite direction of that hypothesized (see Table 3). Specifically, raters correctly identified more behaviors for consistent performers than for inconsistent performers. However, a similar analysis using the proportion of prototypical behaviors incorrectly attributed to a ratee as the dependent variable (Hypothesis 3) found support for the hypothesis. The performance consistency main effect was significant (F(1,36) = 47.83, p < .Ol) with raters incorrectly attributing more prototypical behaviors to consistent performers. Rating accurucy. The fourth and fifth hypotheses were that classification accuracy (Hypothesis 4) and elevation accuracy (Hypothesis 5) would be greater for consistent performers than for inconsistent perform-

254

PADGETT

AND

ILGEN

ers. Repeated measures analyses of variance found support for both of these hypotheses (see Table 4). Results revealed a significant main effect of performance consistency on both classification accuracy (F(1,36) = 118.20, p < .Ol) and elevation accuracy (F(1,36) = 10.72, p < .Ol). In both cases raters were more accurate when evaluating consistent performers. No significant interactions were found for either of these variables. Hypothesis 6 was that raters would have a higher degree of behavioral accuracy when evaluating inconsistent performers than consistent performers. This was tested with a repeated measures analysis of variance using Lord’s behavioral accuracy as the dependent variable. Results for the relationship between consistency and behavioral accuracy did not support the hypothesis. Although a significant performance consistency main effect was found for behavioral accuracy (F(1,36) = 7.33, p < .Ol), the means were in the opposite direction of that hypothesized; behavioral accuracy was greater for the consistent performers than for the inconsistent ones. Performance

Level

Rating process variables. The seventh hypothesis was that raters would find it more diffkult to evaluate poor performers than good performers. This hypothesis was supported (F(1,36) = 16.17, p < .Ol). It is interesting to note that the main effect for performance consistency on rating difficulty was also significant (F(1,36) = 40.65, p < .Ol), with raters finding it more difficult to evaluate inconsistent performers than consistent performers. It was also hypothesized that sampling of information would be greater for poor performers than good performers (Hypothesis 8). Results of the repeated measures analysis of variance using sampling as the dependent variable revealed a significant main effect for performance level (F(1,36) = 19.23, p < .Ol), with means in the direction hypothesized. The final hypothesis was that behavioral accuracy would be greater for good performers than poor. The repeated measures analysis of variance using behavioral accuracy as the dependent variable (see Table 4) found that although means were in the hypothesized direction, the performance level main effect was not significant (F(1,36) = 2.26, p > .05).

DISCUSSION Previous researchers have argued that the appropriate criterion in evaluating the quality of performance appraisals is the accuracy of those ratings (Bernardin 8z Pence, 1980; Borman, 1978). Unfortunately, research using rating accuracy as the criterion has proceeded without first

RATER

ACCURACY

255

developing an understanding of how accuracy should be conceptualized and measured. Adopting Lord’s (1985) position, we argued that the Cronbath (1955) measures of rating accuracy, which are commonly used by researchers, do not provide an appropriate measure of accuracy. Specifically, these measures are flawed because a rater could appear accurate (with accuracy typically interpreted as the ability to correctly identify ratee behaviors) when, in fact, they were only accurate at the level of a general category or impression (i.e., they were only able to correctly identify the general category into which the person should be placed). The results of the present study found some support for this as well as suggesting the importance of distinguishing between these two conceptualizations of accuracy, which Lord (1985) called behavioral and classification accuracy. Second, this study identified two ratee performance characteristics (performance level and performance consistency) which influenced rater cognitive processes and both forms of rating accuracy. Researchers studying cognitive processes have suggested two possible ways in which raters may process performance information (e.g., Murphy, Martin, & Garcia, 1982; Nathan & Lord, 1983). In one approach, raters integrate behavioral information into general categories or impressions of people. In this case, specific behaviors tend to be forgotten and general impressions become the basis for subsequent evaluations. In the second approach, it is thought that either actual behavioral observations are recalled or that behavioral observations are integrated into behavioral dimensions which are recalled. In either case, the focus is on a more behaviorally oriented approach to processing and storing information. which should then facilitate accuracy in rating. The results of this study suggest that the consistency of a ratee’s performance may influence the approach used by raters to process information. Specifically, for consistent performers, raters were more likely to integrate specific observed behaviors into general cognitive categories and then use these categories in making evaluations. Support for this comes from the finding that raters were significantly more likely to incorrectly attribute nonpresent prototypical behaviors to consistent performers than to inconsistent performers. This indicates that raters tended to attribute any prototypical behaviors to consistent performing ratees regardless of whether or not they actually exhibited them. Clearly, the categorization model of information processing (Rosch, 1978) best described the process by which raters evaluated consistent performers in the present situation. In contrast, it was hypothesized that raters would use a more behavioral approach to processing information when evaluating inconsistent performers. Support for this hypothesis is equivocal. While the overall proportion of behaviors correctly identified was found to be greater for

256

PADGET-T

AND

ILGEN

consistent performers than inconsistent, it seems clear that raters were more sensitive to the actual behaviors exhibited by inconsistent ratees. If they had only been responding to their general impression of these ratees, then the total number of behaviors correctly identified for inconsistent performers would have been much lower than that actually observed because the number of nonprototypical behaviors correctly identified would have been quite low. However, the fact that raters were quite willing to attribute (correctly) nonprototypical behaviors to inconsistent performers supports the position that raters were evaluating them on the basis of observed behaviors rather than general category membership. Performance consistency also affected classification and behavioral accuracy, presumably because of its effect on how information was processed. For consistent performing ratees, raters seemed to rely heavily on their general impression when providing performance ratings. As hypothesized, this resulted in a much higher level of classification accuracy for consistent performers than for inconsistent performers. However, although raters appeared to process and store specific ratee behaviors when evaluating inconsistent performing ratees, the overall level of behavioral accuracy was still greater for consistent performers than inconsistent performers . The finding that the greatest degree of behavioral accuracy occurred for consistent performers was contrary to our hypothesis. At first glance, this finding appears to contradict the notion that for consistent performers, raters process information using general categories. Instead, it implied that raters stored and recalled specific behaviors exhibited by consistent performers. However, since in the present situation consistent performers did not exhibit any nonprototypical behaviors, relying on a general impression of these ratees would actually increase the rater’s probability of correctly attributing prototypical behaviors to these ratees, leading to the higher hit rate for consistent performers which we observed and, other things being equal, increasing behavioral accuracy. This would also explain the significant positive correlation observed between classification and behavioral accuracy for the consistent performers (474) = .46, p < .Ol). On the other hand, for inconsistent performers this correlation was negative (474) = - .29, p < .Ol) since, in this case, relying on a general impression would tend to reduce behavioral accuracy by increasing the likelihood that raters would forget nonprototypical behaviors (i.e., by lowering the nonprototypical hit rate). Several additional hypotheses were made concerning the effect of ratee performance level. Performance level was found to influence the amount of information about ratee performance collected by raters. Specifically, raters spent more time observing the videotapes of poor performers than

RATER

ACCURACY

257

good performers, perhaps because they felt it was more difficult to evaluate poor performers, as was also found. Good behavior, on the other hand, may have been less ambiguous and therefore, easier to identify. This explanation would be consistent with the finding that raters found it more difficult to evaluate low performers than high performers. The reason the effect was only found for consistent performers is not clear. Implications The results of this study have several implications. First, the distinction between classification and behavioral accuracy is important as is the empirical demonstration of the differences in the way these accuracy measures function. Consistent with Lord (1983, we have argued that although accuracy, in a conceptual sense, really implies behavioral accuracy, common operationalizations of accuracy (such as Cronbach’s) only tap a rater’s ability to place ratees into global categories (i.e., classification accuracy). We also agree with Lord (1985) that researchers in the area of performance appraisal looking at rating accuracy, particularly those studying cognitive processes, should begin to use a more behavioral approach to assessing accuracy. Although this is a more rigorous accuracy criterion than that typically assessed, we believe it is more conceptually correct and thus, avoids the problem of rater’s appearing to be accurate with Cronbach’s measures of accuracy when, in fact, in a true behavioral sense, they may not be. Although a behavioral accuracy criterion also has a great deal of practical relevance for such functions as providing developmental feedback, in other situations classification accuracy may be all that is required of raters. For example, when raters have to select a subordinate to receive an award or determine which subordinates should be given the largest or smallest pay increases, all that is necessary is that raters be able to assess, in a general way, the overall performance level of the ratee. Given this, we suggest that classification accuracy can perhaps best be seen as a necessary but not sufficient condition for true behavioral accuracy. The data from this study also suggest the need to identify factors that might increase the tendency of raters to rely on a general category or impression in evaluating performance, rather than on specific ratee behaviors. This is because the present research clearly suggests that rating based on a general impression reduces accuracy while focusing on behaviors increases accuracy. This study suggested one such factor, the consistency of a ratee’s performance, but other factors, such as the race and sex of the ratee, might also result in a similar tendency. Identifying these factors could suggest new approaches to training raters designed to alleviate this source of inaccuracy in ratings. From a practical point of

258

PADGETT

AND

ILGEN

view, this study also supports the suggestions of previous researchers (e.g., Bemardin & Pence, 1980) on the importance of training raters to focus on ratee behaviors, perhaps using such things as behavioral diaries. A final implication of this study is that raters need to learn the distinction between prototypical and nonprototypical behaviors and to be aware of the common tendency for false positive prototypical behavior errors. For example, when observing multiple ratees, the rater may recall observing a particular behavior but attribute that behavior to the ratee for which the behavior is most prototypical. One potential outcome of this tendency would be for the ratings of consistent performers to be more extreme. In other words, good performers would generally be rated higher than they deserve while poor performers would tend to be rated lower than they ought to be. If raters can be taught to eliminate this type of error, perhaps through procedures such as reality monitoring (Johnson & Raye, 1982), the accuracy of evaluations might be increased. Since some evidence suggests that observational accuracy is positively related to rating accuracy (Murphy, Garcia, Kerkar, Martin, 62 Balzer, 1982) future research should examine ways to increase the accuracy of behavioral observations. REFERENCES Bemardin, H. J., & Pence, E. C. (1980). Effects of rater training: Creating new response sets and decreasing accuracy. Journal of Applied Psychology, 65, 60-66. Borman, W. C. (1978). Exploring the upper limits of reliability and validity in job performance ratings. Journal of Applied Psychology, 63, 135-144. Borman, W. C. (1979). Individual differences correlates of accuracy in evaluating other’s performance effectiveness. Applied Psychological Measurement, 3, 103-115. Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum. Cronbach, L. J. (1955). Processes affecting scores on “understanding of others” and “assumed similarity.” Psychological Bulletin, 52, 177-193. DeNisi, A. S., & Stevens, G. E. (1981). Profiles of performance, performance evaluations, and personnel decisions. Academy of Management Journal, 24, 592-602. Favero, J. L., & Ilgen, D. R. (1985). The effects ofratee characteristics on rater perfarmance appraisal behavior (Tech. Rep. No. 83-5). Oflice of Naval Research. Feldman, J. (1981). Beyond attribution theory: Cognitive processes in performance appraisal. Journal of Applied Psychology, 66, 127-148. Fiske, S. T., & Kinder, D. R. (1981). Involvement expertise and schema use: Evidence from political cognition. In N. Cantor&J. Kihlstrom (Eds.), Personality, cognition and social interaction. Hillsdale, NJ: Erlbaum. Gordon, M. E. (1970). The effect of the correctness of the behavior observed on the accuracy of ratings. Organizational Behavior and Human Performance, 5, 366-377. Graesser, A. C., Gordon, S. E., & Sawyer, J. D. (1979). Recognition memory for typical and atypical actions in scripted activities: A test of a script pointer + tag hypothesis. Journal of Verbal Learning and Verbal Behavior, 18, 319-332. Graesser, A. C., Woll, S. B., Kowalski, D. J., & Smith, D. A. (1980). Memory for typical

RATER

ACCURACY

259

and atypical actions in scripted activities. Journal of Experimental Psychology: Human and Memory, 6, 503-5 15. Hamner, W. C., Kim, J. S., Baird, L., & Bigoness, N. J. (1974). Race and sex as determinants of ratings by potential employers in a simulated work sampling task. Journal of Applied Psychology, 59, 705-711. Hastie, R. (1980). Memory for behavioral information that confirms or contradicts a personality impression. In R. Hastie, T. M. Ostrom, E. B. Ebbesen, R. S. Wyer, D. L. Hamilton, & D. E. Carlston (Eds.), Person memory: The cognitive basis of social perception. Hillsdale, NJ: Erlbaum. Johnson, M. K., & Raye, C. L. (1982). Reality monitoring. Psychological Review, 88, 6785. Latham, G. P., & Wexley, K. N. (1977). Behavioral observation scales for performance appraisal purposes. Personnel Psychology, 30, 255-268. Lord, R. G. (1985). Accuracy in behavioral measurement: An alternative definition based on raters’ cognitive schema and signal detection theory. Journal of Applied Psychology, 70, 66-71. Lord, R. G., Binning, J. F., Rush, M. C., & Thomas, J. C. (1978). The effect of performance cues and leader behavior on questionnaire ratings of leadership behavior. Organizational Behavior and Human Performance, 21, 27-39. Lord, R. G., Foti, R. J., & Phillips, J. S. (1982). A theory of leadership categorization. In J. S. Hunt, W. Sekaran, & C. Schreisheim (Eds.), Leadership: Beyond established views (pp. 10&121). Carbondale, IL: Southern Illinois Univ. Press. Murphy, K. R., Garcia, M., Kerkar, S., Martin, C., & Balzer, W. K. (1982). Relationship between observational accuracy and accuracy in evaluating performance. Journal of Learning

Applied

Psychology,

67, 320-325.

Murphy, K. R., Martin, C., & Garcia, M. (1982). Do behavioral observation scales measure observation? Journal of Applied Psychology, 67, 562-567. Nathan, B. R., & Lord, R. G. (1983). Cognitive categorization and dimensional schemata: A process approach to the study of halo in performance ratings. Journal of Applied Psychology, 68, 102-l 14. Phillips, J. S. (1984). The accuracy of leadership ratings: A cognitive categorization perspective. Organizational Behavior and Human Performance, 23, 125-138. Phillips, J. S., & Lord, R. G. (1982). Schematic information processing and perceptions of leadership in problem-solving groups. Journal of Applied Psychology, 67, 486-492. Pyszczynski, T. A., & Greenberg, J. (1981). The role of disconfirmed expectations in the investigation of attributional processing. Journal of Personality and Social Psychology, 40, 32-38. Rosch, E. (1978). Principles of categorization. In E. Rosch & B. Lloyd (Eds.), Cognition and categorization (pp. 28-49). Hillsdale, NJ: Erlbaum. Rosch, E., Mervis, C. G., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382-439. Rush, M. C., Phillips, J. S., & Lord, R. G. (1981). Effects of a temporal delay in rating on leader behavior descriptions: A laboratory investigation. Journal of Applied Psychology, 66, 442-450. Scott, W. E., & Hamner, W. C. (1975). The irdluence of variations in performance profiles on the performance evaluation process: An examination of the validity of the criterion. Organizational Behavior and Human Performance, 14, 360-370. Smith, P., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47, 149-155.

260

PADGETT

AND ILGEN

Swets, J. A., & Pickett, R. M. (1982). Evaluation of diagnostic systems. New York: Academic Press. Weick, K. E. (1979). The socialpsychology of organizing. Reading, MA: Addison-Wesley. Wall, S. B., & Graesser, A. C. (1982). Memory discrimination for information typical or atypical of person schemata. Social Cognition, 1, 287-310. Wong, P. T. P., & Weiner, B. (1981). When people ask “why” questions, and the heuristics of attributional search. Journal of Personality and Social Psychology, 40, 650463. RECEIVED: April 17, 1986