ORGANIZATIONAL
BEHAVIOR
AND
HUMAN
DECISION
Accuracy in Performance MARY D. ZALESNY*
AND
PROCESSES
51,
22-50 (1992)
Evaluations
SCOTTHIGHHOUSE?
*Kent State University, and tUniversity of Missouri-St.
Louis
Rating accuracy. measured as agreement with expert ratings and as ability to predict behavior, was investigated in a sample of student teachers and school administrators evaluating the classroom performance of a focal teacher. Predictive accuracy was significantly related to accuracy as agreement with experts on the performance dimension most closely associated with the behavior predicted. Also, the two accuracy measures were more strongly related when the ratee’s behavior was predicted first, followed by performance evaluation than when the ratee’s performance was evaluated first, followed by behavioral prediction. An examination of various rater characteristics revealed that raters’ perceived similarity to the ratee in attitudes about teaching were related to both accuracy measures. Implications for performance accuracy research and rater accuracy training are discussed. e 1992Academic press. IX.
Accuracy is an important issue in human social judgment. Across social contexts and events, frequently it is important to know when people are not making good judgments or decisions (Evans, 1984). Much of the research on performance judgments has highlighted the process involved in performance appraisal, that is, the cognitive operations of appraisers in observing, encoding, storing, and subsequently retrieving and evaluating performance information (see, for example, DeNisi, Cafferty , & Meglino, 1984; Feldman, 1981; Ilgen & Feldman, 1983; Wexley & Klimoski, 1984). However, although the integrity of the appraisal process partly determines the accuracy of the appraisal, ultimately, the correctness of rater judgments about the performance of others rests on the criterion against which rater accuracy is measured. Several recent papers have questioned the conceptual and operational definitions of accuracy and inaccuracy used in research on human judgment (Funder, 1987; Kenny & Albright, 1987; Lord, 1985; Sulsky & Balzer, 1988). For example, Funder (1987) questions the applicability to real-life situations of much laboratory research on human judgment by noting that what may constitute an error of judgment in the laboratory The authors thank James Breaugh and Michael Harris for comments on earlier drafts, and Joseph Colihan, Kevin Hupp, MaryEllen Kinnaly, Liz Lane, and Matt Paese for help in data collection. Address all correspondence and reprint requests to Mary D. Zalesny, College of Business Administration, Kent State University, Kent, OH 44242. 22 0749-5978192$3.00 Copyright 0 1952 by Academic Press. Inc. All rights of reproduction in any fom reserved.
ACCURACY
IN
PERFORMANCE
EVALUATIONS
23
may not constitute a mistake in real life. This is because in the laboratory, as in real life, people are likely to “go beyond the information given” (p. 82) and to rely on their broader knowledge and past experiences in similar situations when making judgments. Funder (1987) advocates two approaches to accuracy that have research and practical significance for the study of accuracy in performance appraisals. One approach is to incorporate into laboratory research representations of real life by faithfully reconstructing all of the important components and sources of information that actually are found in a particular real-life situation. The second is to examine whether real-life judgments of raters are, in fact, right or wrong when compared with external criteria. For the latter, Funder suggests two external criteria: agreement with others and the ability to predict behavior. The first criterion is recognizable as consensus or expert agreement frequently used in appraisal research. The second criterion, the ability to predict behavior, has not been used as a measure of rater accuracy in performance appraisals although it has appeared in the form of inferences that raters are asked to make when they use a behaviorally anchored rating scale (Nathan & Alexander, 1985). However, whereas behaviorally anchored rating scales (BARS) require a determination of the level of performance that could be expected of each ratee, the correctness of this judgment is not evaluated against future behavior. Accuracy as ability to predict behavior could be measured with BARS ratings if it could be determined whether each ratee later behaved as expected on each performance dimension. Lord (1985) suggests that differences in how information is processed reflect different types of accuracy. For example, raters who faithfully encode and recall specific observed behaviors (e.g., have a good memory for detail) should have high behavioral accuracy. Classification accuracy is evident when raters simplify and integrate behavioral observations into appropriate categories or overall impressions of an individual. For example, a rater might observe, classify, and evaluate patterns of behaviors as indicative of good interpersonal relationships with superiors. Nathan and Alexander (1985) have referred to this ability to infer behavior patterns or traits from limited observations as inferential accuracy. Finally, raters can exhibit accuracy (or inaccuracy) in how they apply decision criteria within and across ratees (e.g., consistent application of performance standards) . The present study addresses issues raised by Funder (1987) and Lord (1985) by examining two different measures of accuracy in performance appraisal, i.e., accuracy as agreement with expert ratings and as the ability to predict behavior. Funder’s (1987) arguments regarding errors and mistakes suggest that if people typically go beyond the information available to them and use their broader knowledge and experiences when
24
ZALESNY
AND
HIGHHOUSE
making judgments, then they may use the same or a highly similar knowledge/experience base from which to make judgments about a particular person or event. Raters who make correct judgments, then, should be “accurate” regardless of whether accuracy in a judgment task is determined by agreement with experts or by verifying that a predicted behavior actually occurred. Following Funder, we would expect the two measures of accuracy to be related. Sulsky and Balzer’s (1988) comparison of different conceptual and psychometric definitions of differential accuracy also provides empirical support for a positive relationship between the two accuracy measures. Other researchers have argued persuasively that differences in the nature of the information encoded into memory (and subsequently available for later use) may affect the correctness of judgments when people observe the behavior of others (Foti & Lord, 1987; Lord, 1985; Murphy, Balzer, Lockhart, & Eisenman, 1985). If raters process different information and engage in different processing depending upon the requirements of the evaluation situation (e.g., recall of behavior versus impression formation; see, for example, Balzer, 1986; Lord, 1985; Pulakos, 1986; for a more general discussion see, Higgins & Bargh, 1987; Tulving & Thomson, 1973), then the two measures may not necessarily be related. In their encoding specificity principle, Tulving and Thomson (1973) proposed that what information a person encoded and stored would determine what retrieval cues would provide access to the stored information. Retrieval of stored information is facilitated when there is similarity in the features of the context or environment existing when both encoding and retrieval occur. In Lord’s (1985) accuracy nomenclature, the information attended to, processed, and recalled that would lead to behavioral accuracy may not facilitate classification accuracy. The latter may require that different information be encoded or be organized differently in memory. We hypothesized that raters who matched experts in their evaluations of another’s performance also would be correct in predicting the ratee’s behavior in a specific situation. Because characteristics of the rater (e.g., experience in conducting evaluations or perceived similarity to the ratee) may facilitate rating accuracy on one or both methods of determining accuracy, the covariation between each accuracy measure and rater characteristics likely to be related to performance appraisal accuracy also was assessed. The central issues were not which was the best method of assessing accuracy. Rather, they were the extent of relationship between the two measures and the extent to which the accuracy measures are similarly related to rater characteristics. Following Funder’s recommendation of maximizing realism, participants and a situation were selected that would readily generalize to real evaluation tasks while maintaining experimental control. The situation
ACCURACY
IN PERFORMANCE
EVALUATIONS
25
chosen was the evaluation of a teacher’s classroom performance. The participants who served as appraisers were student teachers who had completed at least one student teaching assignment and school administrators who regularly evaluated teacher performance. As a part of their student teaching, each student’s classroom performance had been observed and evaluated several times by their teaching supervisor and by other teachers and student teachers. Also, as a part of their training, they observed and frequently evaluated other student and full-time teachers. Typically, multiple ratees are presented to raters who often receive only minimal exposure (generally 5-10 min each) to the ratees before making their evaluations (e.g., Borman, 1977; Cardy & Kehoe, 1984; Pulakos, 1984; Sulsky & Balzer, 1988). In the present study a single stimulus person was presented to raters for a longer period of time. This provided the raters with a greater sampling of the ratee’s behavior on which to base their judgments and prediction (see also, Balzer, 1986). The use of a single target ratee also parallels several real-life evaluation situations including subordinates rating their supervisor; peers rating a colleague; or a supervisor making a promotion, transfer, or termination decision on a single employee based on that individual’s recent performance record. EXPERIMENT 1 Method Overview. Ninety-six student teachers were recruited from the teacher education programs of three universities located in a large midwestem city. Prior to their participation, they completed a background survey on their previous and current teaching experience and their experience with classroom teaching evaluations. In groups of 6 to 80, the student teachers were shown a videotape of an elementary school teacher conducting a mathematics lesson in a sixth-grade class.’ After viewing the tape, the student teachers individually evaluated the teacher’s classroom performance and responded to items about the evaluation. A slightly modified version of the performance evaluation form used by the state and local school districts was used for the evaluations. The entire procedure was completed in about 2 h. No course credit or payment was given for participation. Development of the petformance videotape. A recent graduate of the teacher education program at one of the participating universities was ’ Group sizes were 6, 10, and 80. Scheduling constraints prevented us from dividing the largest group into smaller ones for viewing the videotape. However, three monitors were positioned in the room to ensure that participants in this group could readily observe the focal teacher’s behaviors on the stimulus tape.
26
ZALESNY
AND
HIGHHOUSE
contacted and agreed to have her classroom performance videotaped.* The class consisted of eight sixth-graders, seven of whom were males. The teacher presented a 30-min mathematics lesson that introduced a new topic to the class. Two videotapes were made on two separate occasions. The first taping provided an extended observation of the students and their interactions with the teacher and also minimized any “taping” effects on the students’ or teacher’s behaviors when the second videotape was made. Just prior to the second taping (i.e., the one presented to the raters), the investigator met privately with the students and gave several students written instructions on specific behaviors to emit at specified times throughout the lesson. The behavioral instructions were based on observations made in the first tape of the students’ behaviors in class. Examples of the instructions are: (Student 1) Ask to sharpen your pencil at 3:07; Ask to leave the room for a drink of water at 3:18; Ask to use the restroom at 3:28; (Student 2) Talk with your neighbor at the following times: 3:05, 3:09, 3:16, 3:20, 3:24; (Student 3) Look out the window every few minutes. Students were told to emit the behaviors as close as possible to the indicated times without watching the clock too frequently. Although all students received written instructions, some of the instructions asked the student to “Act like you usually do in class.” The instructions to the students served as an experimental manipulation to ensure that the teacher would be confronted with situations requiring behaviors other than those directly associated with presentation of the lesson (e.g., disciplining students, answering questions unrelated to the lesson). Once the students indicated understanding their instructions, the second taping was begun. The teacher had been instructed to use her usual style of teaching rather than try to portray a particular level of classroom effectiveness. The intent of the study was not to present several examples of varied teacher performance, but rather to present a realistic example of this teacher’s typical performance. Performance dimensions and evaluation. Teacher evaluation forms and a description of each evaluated dimension were obtained from the local and state school boards and two of the universities participating in the study. Because the student teachers were all familiar with the local and state evaluation forms and because they were being (and would continue to be) evaluated on these dimensions, the local and state performance dimensions were used as the criteria in the present study.3 Additional z With the exception of one individual, the participants were not acquainted with nor knew who was the focal teacher in the videotape. 3 Teacher performance dimensions were developed from job analysis information collected from elementary and secondary school teachers, administrators and supervisors (the
ACCURACY
IN PERFORMANCE
EVALUATIONS
27
items were added to each dimension to reflect aspects of classroom performance that were emphasized in the universities’ evaluation forms. Teacher performance was assessed on four dimensions: (1) Instructional methods, materials, and techniques (Instructional Methods), (2) Classroom management (Classroom Management), (3) Interpersonal relationships with students (Interpersonal Relationships), and (4) Personal and professional qualities (Professional Qualities). Each dimension contained several items on which the focal teacher was to be evaluated for her level of effectiveness. Effectiveness was measured on a Spoint scale which ranged from 1 = very poor (indicates substandard teacher behavior) to 5 = very good (indicates clearly superior teacher behavior). Instructional Methods contained 17 times that described critical and basic requirements for effective classroom instruction (a = .94). Examples of items in this dimension are: “Teacher displays knowledge of subject matter. ” “Lesson is designed in a clear, logical, and sequential format. ” “Expectations for students are made clear.” Classroom Management contained 9 items related to the learning environment created and maintained by the teacher (IX = .89). Example items are: “Teacher is consistent and fair in dealing with classroom behavior.” “Teacher reinforces appropriate behavior.” Interpersonal Relationships consisted of 7 items (a = .93) including: “Does not interrupt students when they are speaking (e.g., answering a question). ” “Gives constructive criticism and praise when appropriate.” Professional Qualities contained 5 items ((r = .79) that reflected personal grooming and professional behavior. Examples are: “Is neat and clean in dress. ” “Teacher uses appropriate vocabulary . ’ ’ Student teachers provided overall effectiveness ratings for each dimension and overall performance. They also rated on a Spoint scale the importance of each dimension to effective teacher performance (1 = not important to effective teacher performance; 5 = crucial to effective teacher performance). Personal characteristics and similarity measures. Research has indicated that rater characteristics and rater-ratee demographic and attitude similarity are related to performance ratings (Landy & Farr, 1980; Pulakos & Wexley, 1983; Zalesny & Kirsch, 1989) and may be related to judgment accuracy (Funder, 1987; see also Dreher, Ash, & Hancock, 1988). For example, the impact of biographic and perceived attitude similarity and rater experience on hiring decisions and performance evaluations has been demonstrated consistently in laboratory and field investigations (Frank & Hackman, 1975; Huber, Neale, & Northcraft, 1987; individuals responsible for conducting teacher evaluations in their schools), and school board members.
28
ZALESNY AND HIGHHOUSE
Latham, Wexley, & Pursell, 1975; Pulakos & Wexley, 1983; Rand & Wexley, 1975; Schmitt, 1,976;Wexley, Alexander, Greenawalt, & Couch, 1980). Background information collected before the presentation of the videotape was obtained from student teachers on age, sex, teaching experience (in months), grade levels taught, number of times own performance had been evaluated, and number of observations made of other teachers for the purpose of evaluating their performance. The latter two measures were used as indicants of rating experience. After the focal teacher’s performance had been evaluated, student teachers responded to items that assessedtheir perceived similarity to the focal teacher in attitudes about teaching. The scale used to measure this construct has been used to assess attitude similarity in both long-term relationships (e.g., between supervisors and their subordinates) and relatively brief interactions (e.g., between job recruiters and applicants) (see Graves & Powell, 1988; Pulakos & Wexley, 1983; Wexley et al., 1980). For the present study, the scale was modified to reflect teacher behaviors. On five of the six scale items, student teachers were asked to indicate on a 7-point scale (1 = extremely similar; 7 = not at all similar; reverse scored for analysis) how similar they were to the focal teacher in: “attitudes toward teaching, ” “approaches for dealing with problems,” “beliefs about how students should be treated in the classroom,” “attitudes toward work,” and “beliefs about how people should be treated at work.” On a separate item, they were asked for their agreement to the item, “Overall, the teacher and I are similar kinds of people.” (1 = strongly agree; 7 = strongly disagree; reverse scored for analysis). The six items (0~ = .94) were summed and averaged for analysis. Accuracy measures. Accuracy as agreement with experts was operationalized as the distance (specifically the absolute value of the difference) between the averaged expert ratings for the teacher’s performance and the rater’s rating of the teacher’s performance. This operationalization corresponds to the distance accuracy measure (using expert ratings) described by Sulsky and Balzer (1988) as being most consistent with conceptual and psychometric definitions of accuracy.4 Separate distance accuracy measures were computed for overall performance and for each dimension rating. Accuracy as ability to predict behavior was assessed by having the 4 The data also were analyzed with accuracy as agreement with experts operationalized as the squared deviation between rater and expert ratings. Both methods yielded equivalent findings. Correlational accuracy also has been suggested as an important measure of the direction or pattern of rater inaccuracy across ratees. However, because there was only one ratee, this measure could not be computed.
ACCURACY IN PERFORMANCE EVALUATIONS
29
student teachers predict the teacher’s behavior toward one of her students. This student had been instructed to talk to his neighbor five times during the lesson. Each time the student talked to his neighbor, the teacher in the videotape disciplined the student. The final talking incident came very near the end of the tape and included the teacher’s final discipline response. However, just after the final talking episode occurred, the videotape was stopped so that the teacher’s final response was not seen. At this point, the student teachers had seen all but the last few minutes of the tape. They also had seen several instances of the teacher responding to the behaviors the students had been instructed to emit and had seen four talking-disciplining episodes with one particular student. Table 1 presents the teacher’s responses to the student for each of the five talking episodes and responses considered acceptable by teacher educators. When the student teachers received their rating forms, the first item TABLE 1 TEACHER
Instance First Second Third
Fourth
Fifth
RESPONSES TO TALKING
EPISODES
Teacher response Indicate awareness of student’s behavior (e.g., “John, up here please.“) Indicate awareness of student’s behavior (e.g., “John, look up here please.“) Assess student’s knowledge of classroom rules; (e.g., “John, what is rule number three?” [John] “Work quietly.” “Are you working quietly?” [John] “No.” “No? That’s your warning.“) Indicate consequences for continued behavior (e.g., . . . “Do you know what happens next?” [John] “Yes.” “Check mark will be up in a minute.“) Initiate consequences; remove student from situation to counsel/discipline (e.g., Puts name on board; puts check mark next to name (rules specify that three check marks require after school work); calls student outside to hallway after providing in-class work for others; counsels/ disciplines student privately)
Note. The progressive discipline taught at the participating institutions provides for teachers’ establishing classroom policy and communicating the policy to the students. Discipline is initiated with calling attention to student’s behavior, indicating the consequences for the continued behavior, and finally, following through with the consequences stipulated in the classroom policy. The focal teacher followed this sequence in her responses.
ZALESNY AND HIGHHOUSE
30
asked them to recall the talking episodes and the teacher’s responses to them. Student teachers then indicated how they believed the teacher responded to the last talking incident and their confidence in their prediction. They also indicated what they believed the teacher should have done after the last talking incident. This item served to measure the correctness of the student teachers’ perceptions of appropriate behavior from any teacher in the situation depicted in the tape. Accuracy in prediction of the focal teacher’s response was measured by checking predicted teacher response with actual teacher response to the last talking incident. As shown in Table 1, the focal teacher’s response to the talking incident involved multiple behaviors (e.g., verbal reprimand, state consequences for continuation of behavior, talk to student in private). The authors coded all of the student teacher responses, reaching 92% agreement; all disagreements were discussed and resolved. A correct prediction was given to each student teacher who predicted at least one of the four responses that the teacher actually made (i.e., the teacher carried though with any of the consequences). Expert raters. Eight experienced teachers served as the experts in the study. Two of the experts directed the teacher education programs at their respective universities and had extensive experience training elementary and secondary school teachers. The remaining six experts were faculty in the teacher education program at one of the universities who also had considerable teaching experience and experience in training elementary and secondary school teachers. The experts viewed the videotape and responded to the same evaluation items as the student teachers, including predicting the focal teacher’s response to the last talking incident in the tape. Their ratings for teacher effectiveness and importance on each of the performance dimensions and overall were averaged for the data analysis and are shown in Table 2. Generally, the effectiveness ratings given by the experts were consistent TABLE 2 EXPERTS’ RATINGS OF TEACHER EFFECTIVENESS AND DIMENSION IMPORTANCE
Teacher effectiveness
Dimension importance
Performance dimension
M
SD
M
SD
Instructional methods Classroom management Interpersonal relationships Professional qualities Overall performance
3.25 3.63 3.25 3.75 3.75
.71 .74 .89 .71 .71
4.88 4.88 5.00 4.29
.35 .35
Note. All ratings are based on a S-point response scale.
0 .49
ACCURACY IN PERFORMANCE EVALUATIONS
31
across raters and were slightly above the midscale of the S-point scale. The dimension importance ratings also were consistent across raters, but were higher. All experts correctly predicted the teacher’s response to the last talking incident. Results Sample description. Of 96 student teachers participating in the research, 83 (86%) provided complete and usable responses. Demographic and other descriptive information on the sample are shown in Table 3. The average age was 26 years and ranged from 2 1 to 4 1 years; the majority of the sample was female (94%), with experience in teaching primarily (70%) at the elementary level (i.e., kindergarten through sixth grade) with the remaining representing teaching experience at either the secondary level (i.e., grades 7 through 12) or both levels.5 Teaching experience in the sample averaged 15 months and varied from slightly more than 1 month to more than 20 years (when substitute and part-time teaching was included). All of the student teachers’ classroom performances had been evaluated within the previous 6 months and on the average the evaluations had occurred within the last month. On average, the student teachers observed and informally evaluated other teachers’ classroom performances 29 times. Overall, the sample was very similar to the focal teacher in age, sex, and teaching experience. Table 3 also shows the correlations for the accuracy and rater characteristics measures. As expected, the ability to predict teacher disciplining behavior was related to agreement with experts, but only for ratings of performance in Classroom Management. Student teachers who correctly predicted the teacher’s response to the last talking incident were in closer agreement with expert ratings of the teacher’s classroom management performance than were student teachers who did not correctly predict her behavior. Accuracy in predicting teacher behavior also was correlated with knowledge of the appropriate action the teacher should have taken with attitude similarity, and with one measure of rating experience (number of evaluation observations made). Student teachers who were correct in their predictions believed that the teacher’s expected response was the correct response and saw the teacher as holding attitudes about teaching similar to their own. They also had made relatively few previous evaluation observations of other teachers as a part of their student teaching experiences. 5 The selection of a sixth-grade class for the videotape was based on the opinions of teachers and school administrators who indicated that the type of students and situations encountered at this grade level would be equally relevant to both elementary and secondary school teachers.
SD
I
2
Age
8
21 33 -Go9
26 42 -
-01 -II 01 16 -05 65 -08 -02 03 -49 01 02 -23 15 -6s -09 09 13 08 - I3
00 -07 I3 -OS 13 16 IO I3
IO I5
53 27 --
-55
-04
-14
9
01 -48
-40
IO
-19 I6 -II I4 82 I8 66 -56 -67 04 -5 63 07 05 14 -21 -12 30 - I3 24 IO I5 -02 03
I7 52 --
53 -_
16 --83
00
--51
--46
07
IO -61 IO
7
00
27 3% 1
(94) 63 -
6
I2
-06 -14
S
IO -21 -02 T-l
-01 I2 01
4
- 15 -07
13 02 25 i-i
3
I8
12
I3
I2 -13 -02 62 I6 63 06 -% -ii 02 09 15 16 -13 -03 16 02 02
37 -49
II
IS
16
-21 -% -86 167% -08 09 - I7 24 10 Tir-19
14
-01 -34
I7
00
18
Now. n = 83. Signiticant correlations are underlined (r 2 .lY, p 6 .05; I 2 .23, p < .Ol). For grades taught. I = elementary. 2 = secondary. 3 = both. Experience is given in months teaching. Median experience = 4 months. Prediction coded 0 = incorrect, I = correct; appropriate action = knowledge of correct teacher action (0. I). With the exception of accuracy scores, higher means represent a more positive response.
1. 2. 3. 4. 5 6: 7. 8.
25.9 6.3 1.5 .6 -02 Grade taught 19 II 14.9 20.0 Experience I3 10.4 21.6 -09 Times evaluated IO 17 28.7 41.3 Observations made 3.4 Attitude similarity -01 1.5 -21 -09 3.1 .9 -13 Instructional methods Instructional methods -24 .6 -08 .6 accuracy -II .9 -31 9. Classroom management 3.1 IO. Classroom management 12 .7 I .o accuracy -29 II. Interpersonal -04 I.0 --34 3.3 relationships 12. Interpersonal 13 .I .7 -05 relationships accuracy 3.9 -28 13. Professional qualities .9 ---26 14. Professional qualities 17 09 .6 .6 accuracy 3.3 -09 .8 -25 15. Overall rating 06 .7 24 .8 16. Overall accuracy -14 .5 -ii .3 17. Prediction 05 08 1.5 4.6 18. Confidence .5 -0.5 --34 .6 19. Appropriate action
M
TABLE 3
EXPERIMENT 1 MEANS, STANDARD DEVIATIONS, AND CORRELATIONS
ACCURACY IN PERFORMANCE EVALUATIONS
33
Agreement with experts’ overall performance rating was positively correlated with confidence in prediction of teacher behavior in the disciplining situation (but not with accuracy in prediction), age, attitude similarity, accuracy as expert agreement on Classroom Management and Professional Qualities, and the dimension and overall performance effectiveness ratings (negative correlations indicate positive relationships). Agreement with experts concerning the teacher’s effectiveness on Instructional Methods was negatively related to grade level taught and number of evaluation observations made and positively related to confidence in prediction of teacher disciplining behavior, agreement with experts on the teacher’s Interpersonal Relationships with students, and perceived similarity in attitudes about teaching. Interpersonal Relationship accuracy was also positively related to perceived teaching attitude similarity with the focal teacher. Other significant correlations revealed that older student teachers gave lower performance ratings than their younger colleagues. Also, teachers with experience teaching at both the secondary and the elementary level were less likely than teachers with elementary level teaching experience to have indicated the appropriate response that the teacher should have taken in response to the last talking incident. Teaching experience and number of times one’s own performance was evaluated were not related to either of the accuracy measures. Comparisons of the accuracy measures. Only 29% of the student teachers accurately predicted what the focal teacher’s response would be to the last talking incident. However, more than half (57%) of the student teachers correctly identified the appropriate action they thought the teacher should take, indicating that many student teachers did not believe that the focal teacher would take the appropriate action. Twenty percent of the students matched the expert ratings of the teacher’s Classroom Management performance. When both correct prediction and agreement with expert ratings were jointly considered, only 11% of the student teachers were accurate in their prediction and their Classroom Management performance ratings while 13% were accurate in prediction and overall performance ratings.6 Agreement with expert ratings was considerably better for overall performance rating with 37% of the student teachers matching the experts’ overall rating. Examination of the correlations for the two accuracy measures (prediction and expert agreement) shows that with the exception of agreement with experts on Professional Qualities, the dimension, overall, and pre6 Similar percentages were found for accuracy in predicting teacher disciplining behavior and agreement with experts on the other performance rating dimensions (Instructional Methods = 13%, Interpersonal Relationships = IO%, Professional Qualities = 10%).
34
ZALESNY
AND
HIGHHOUSE
diction accuracy measures are correlated significantly with only one other variable, i.e., perceived similarity of attitudes about teaching with the focal teacher. Moreover, perceived teaching attitude similarity is positively related to accuracy for only Classroom Management and overall accuracy and negatively related to accuracy for the remaining three performance dimensions. To more closely examine this relationship between teaching attitude similarity and accuracy, accuracy as agreement with experts on the performance dimensions, as agreement with experts on overall performance, and as prediction of teacher behavior were regressed in three separate equations on rater characteristics: teaching attitude similarity, teaching experience, grade level taught, and two measures of rating experience (i.e., number of times own performance had been evaluated and number of evaluations made of other teachers). To control for the moderate correlations among the four performance dimensions, a multivariate multiple regression was performed regressing the set of performance dimensions on rater characteristics. A significant multivariate F (F(24,186) = 3.31, p s .Ol) revealed that rater characteristics contributed significantly in explaining variance only for Instructional Methods (F(6,51) = 3.77, p s .Ol) and Classroom Management (F(6,51) = 3.01, p s .Ol) accuracy. Univariate analyses showed perceived teaching attitude similarity as the sole rater characteristic significantly related to Instructional Methods accuracy (overall F(6,76) = 3.07, p s .Ol; p = .30, p s .Ol) and perceived teaching attitude similarity and teaching experience as significantly related to Classroom Management accuracy (overall F(6,76) = 4.64, p G .Ol; l3 = - .48, p G .Ol, and l3 = .22, p G .05, respectively). Similar results were found when agreement with experts on overall performance and prediction of teacher behavior were separately regressed on rater characteristics. Perceived teaching attitude similarity accounted for a significant amount of the variance explained in expert agreement accuracy (overall F(6,76) = 3.27, p s .Ol; p = - .46, p s .Ol). Both number of evaluation observations made (p = - .20, p s .lO) and perceived teaching attitude similarity (l3 = .19, p c .lO) contributed to explaining predictive accuracy (F(6,76) = 1.91, p G .lO). Reflective of the correlations in Table 2, the regression weights for the rater characteristics reveal that with the exception of instructional Methods, greater perceived similarity to the focal teacher in attitudes about teaching was associated with greater accuracy. Greater teaching experience and less evaluative experience also were associated with rater accuracy on Classroom Management and Overall Performance, respectively. The addition of an interaction term (in post hoc analyses) to assess the joint effects of teaching experience and teaching attitude similarity resulted in a significant interaction only for accuracy in evaluating Pro-
ACCURACY
IN PERFORMANCE
EVALUATIONS
35
fessional Qualities. Examination of the interaction revealed that while rater accuracy on this performance dimension increased with teaching experience, the increase was consistently larger with greater attitude similarity. Discussion
In Experiment I, accuracy as agreement with expert ratings and accuracy as ability to predict behavior were related but only for expert agreement on the same performance dimension as the behavioral prediction (i.e., classroom management). The two accuracy measures also were significantly related to perceived similarity to the teacher on attitudes about teaching. In both cases, raters who saw themselves as having teaching attitudes similar to the teacher’s were more accurate than their counterparts in their evaluations of her performance. The small, but significant, relationships between the two accuracy measures and between each accuracy measure and perceived teaching attitude similarity with the ratee suggest that the influence of rater characteristics on rating accuracy and the interaction of rater characteristics with accuracy measures (e.g., as suggested by Lord, 1985) remain important issues in performance appraisal research. Whereas previous research has demonstrated the impact of perceived similarity between raters and ratees on the level of performance ratings given (Pulakos & Wexley, 1983; Wexley et al., 1980), these findings point to the potential importance of perceived similarity between raters and ratees for rating accuracy. One explanation for the impact of perceived similarity on performance rating accuracy is the tendency for people to develop and use prototypes as organizing structures for perceiving and evaluating the behavior of others and generally simplifying the processing of information about others (Cantor & Mischel, 1979; Dipboye, 1985; Feldman, 1981; Lord, Foti, & Phillips, 1982; Markus & Smith, 1981). The prototype may represent an ideal example of a particular type of individual or of a category of behaviors and may serve as a standard against which behavior is evaluated. People are likely to rely on the prototype when judging a specific ratee’s behavior, especially when the behavior is important to an individual (Markus & Wurf, 1987). Use of a prototype would be an example of what Funder (1987) referred to as raters going beyond the information given when faced with any judgment task. From the number of raters who correctly stated the appropriate action the focal teacher should have taken in the talking incident on which the prediction was based, it appears that a well-articulated prototype of an effective teacher probably was available for evaluating the focal teacher’s performance for the majority of student teachers in this sample. In addition to their having a prototype to facilitate perceiving, organiz-
36
ZALESNY
AND
HIGHHOUSE
ing, and evaluating the behaviors of other teachers, the student teachers also were likely to have well-elaborated self-schemata for themselves as teachers. That is, effective teaching should be an important dimension on which student (and other) teachers perceive and evaluate themselves (Markus & Wurf, 1987). To the extent that the student teachers’ selfschemata for themselves as teachers were equivalent or similar to their effective teacher prototypes, student teachers also should have used their self-schemata to evaluate the performance of the videotaped teacher. The more similar to the student teacher the focal teacher was perceived and the more similar the self-schema to an effective teacher prototype, the more accurate the student teachers should be in their assessment of the focal teacher’s performance. The likely use of self-schemata follows from research in social cognition showing that people have a tendency to use their self-schemata as a basis for judging others when they have considerable information about themselves relevant to the dimension being judged and have limited information about the persons being judged (Holyoak & Gordon, 1983; Markus, Smith, & Moreland, 1985; Srull & Gaelick, 1983). Certainly, the experimental methods used in this study and in most investigations of rating accuracy (i.e., minimal information about the ratee relative to information about the self) would appear to maximize the likelihood that self-schemata would be used by raters in their evaluation task. A related explanation, also from social cognition research, suggests that people who have well-developed self-schemata which they use in evaluating others are likely to process the information available about the others more deeply (Kuiper & Rogers, 1979). This would suggest that raters who use their relevant “self’ categories in judging a ratee and who perceive themselves as similar to a ratee may be more accurate because they attend more closely to the ratee’s behaviors or simply attend to the ratee’s behaviors for a longer time than raters who do not use their selfschemata and who do not see themselves as similar to the ratee. However, an alternative explanation for the relationship found between both accuracy measures and perceived similarity concerns the order in which the accuracy measures were taken. Recall that prediction was not related to overall performance accuracy (measured against expert raters) but only to accuracy on one dimension (i.e., classroom management) and that student teachers first predicted the teacher’s behavior in the disciplining situation and then rated her performance across all four performance dimensions. The positive relationship between accuracy on the classroom management performance dimension and accuracy in prediction may have occurred because the prediction request forced retrieval and processing of the teacher’s performance on that dimension first. The order of rating may have primed the processing of classroom management
ACCURACY
IN PERFORMANCE
EVALUATIONS
37
performance information more deeply than other performance information. As a result, classroom management performance information may have been more readily accessible or salient when judgments were requested on all performance dimensions (see, for example, St-1111 & Wyer, 1979; Zadny & Gerard, 1974). Because experts also would have been affected by order effects, changing the order of accuracy assessment may reveal a different relationship between accuracy measures. EXPERIMENT
2
Two additional groups were run to assessthe effect of order of rating on the two measures of rating accuracy and to examine the role of selfschemata in performance rating accuracy. One group consisted of student teachers; the other group consisted of school administrators who conducted yearly teacher performance reviews. It was hypothesized that raters who first predicted the teacher’s behavior would be more accurate in their later evaluation of her Classroom Management performance than would raters who first evaluated the teacher on all performance dimensions and later predicted her disciplining behavior. The rationale was that the request for a behavioral prediction would lead to additional and different processing and organization of specific information about a particular aspect of the teacher’s classroom performance which in turn would lead to better retention of behaviors representative of the teacher’s performance and to greater accuracy. We also hypothesized that raters whose self-schemata closely resembled an effective teacher prototype would be more accurate in their ratings on both measures of accuracy and more confident of their ratings. This expectation was based on two assumptions. First, people tend to use self-schemata (when they exist) as a standard for evaluating others (Markus et al., 1985). The existence of a self-schema based on one’s own experience should also increase rater confidence in the accuracy of their evaluations. Second, the more one’s self-schema matches the prototype of an effective teacher, the more accurate the standard used in evaluating others will be, and the greater the likelihood of a more accurate evahration. Method Overview. The Experiment 2 procedures were identical to those in Experiment 1 with the following exceptions. First, after viewing the videotape of the focal teacher’s classroom performance, one-half of the individuals in each group (students and administrators) evaluated the teacher’s performance and then predicted her behavior in the disciplining episode. The other half first predicted her behavior and then evaluated her performance (as in Experiment 1). Second, 2 to 4 weeks before they
38
ZALESNY
AND
HIGHHOUSE
participated in the study, the school administrators completed a selfdescription questionnaire composed of items that represented an effective teacher prototype and returned it by mail to the researchers. Effective teacher prototype and self-schema measures. Procedures utilized by Foti (personal communication) and Rosch (1978) were used to develop the prototype descriptive items. First, examples of effective teacher classroom performance were obtained from student teachers and the experts (in Experiment 1) whose ratings were used to establish the focal teacher’s level of performance. Students and teachers were asked to think of teachers they had known or heard about whose behaviors exceeded the performance requirements for each of the four performance dimensions on which teacher performance was evaluated. Elimination of redundant and ambiguously worded behaviors or traits resulted in 83 descriptive items. Next, the 83 descriptive items were sent to 100 elementary and secondary school teachers who were asked to judge each item as to how well it represented their idea of the best example of an effective teacher (regardless of grade level taught). Responses ranged from “fits image very well” (scored as “1”) to “fits image very poorly” (scored as “7”). The means and standard deviation for each item were calculated on the 77 surveys returned. Items with a mean less than “4” (on the 7-point scale) and a standard deviation ~1.0 were retained, yielding 37 items descriptive of an effective teacher (i.e., an effective teacher prototype). The 37 items were then incorporated into a survey sent to all of the school administrators. The survey instructed the administrators to rate the extent to which each item was descriptive of how they saw themselves as teachers. Responses were given on a 7-point scale that ranged from “very descriptive of me” (1) to “not at all descriptive of me” (7). Sample description. Participants included 116 (of 120) student teachers enrolled in the teacher education program of a large midwestern university. Students in Experiment 2 were similar to their counterparts in Experiment 1 (see Table 4) in age, sex (87% females), and teaching and evaluation experience. The administrator sample included 75 of the 83 administrators (i.e., principals, assistant principals, department chairpersons, and consultants) of a large school district located in the same metropolitan area as the university. Not surprisingly, the school administrators were older and had more teaching experience than the student teachers and were overrepresented (as a group) by males (72%). All administrators regularly observed teacher classroom performance for evaluation purposes and all were either directly or indirectly (i.e., provided input to others) involved in the formal review of school teachers (who numbered approximately 1100 in the district). The administrators performed or assisted in an average of 20 appraisals yearly.
ACCURACY
IN PERFORMANCE
39
EVALUATIONS
TABLE 4 MEANS AND STANDARD DEVIATIONS FOR EXPERIMENT 2 Students
1. Age 2. Grade taught 3. Experience 4. Times evaluated 5. Observations made 6. Attitude similarity 7. Instructional methods rating 8. Instructional methods accuracy 9. Classroom management rating 10. Classroom management accuracy 11. Interpersonal relationships rating 12. Interpersonal relationships accuracy 13. Professional qualities rating 14. professional qualities accuracy 15. Overall rating 16. Overall accuracy 17. Correct prediction 18. Confidence 19. Appropriate action
Administrators
P,E
E.P
M
SD
M
SD
M
SD
M
SD
21.9 1.4 14.5 17.7 38.2 3.5 3.0 .7 3.5 .8 3.1 .8 3.7 .6 3.2 .8 .7 5.1 .7
6.6 .8 25.6 30.0 37.6 1.5 .8 .5 1.0 .5 1.0 .7 .8 .5 .8 .5 .5 1.3 .5
46.3 1.9 18.3
5.0 .6 8.1
35.2
11.2
35.7
10.7
20.3 4.0 3.2 .7 3.6 .6 3.4 .8 3.6 .6 3.3 .7 .4 5.0 .5
13.5 1.3 .9 .5 .8 .4 1.0 .6 .7 .4 .7 .5 .5 1.3 .5
3.5 3.1 .6 3.5 .7 3.1 .7 3.5 .6 3.2 .9 .6 5.1 .7
1.5 .8 .6 .9 .7 1.0 .7 .7 .6 .7 .6 .5 1.2 .5
3.9 3.9 .6 3.6 .8 3.3 .8 3.8 .5 3.2 .8 .6 5.1 .6
1.4 .9 .6 .9 .7 1.0 .8 .8 .6 .8 .7 .5 1.4 .5
Note. Students n = 116; Administrators n = 75. P,E = predict, evaluate; E,P = evaluate, predict. Experience is in months for students, in years for administrators. Correct prediction = 0.1; appropriate action = knowledge of correct teacher action = 0.1.
Compared with the students in Experiment I, the majority of students and administrators in Experiment 2 correctly predicted the focal teacher’s final response to the talkative student (58% combined, 67% students, 47% administrators) and correctly identified the appropriate response to the situation (63% combined, 72% students, 53% administrators). They were also better in matching expert ratings of the teacher’s Classroom Management (39% combined, 38% students, 52% administrators), but equally poor in matching both expert ratings of Classroom Management and correctly predicting the teacher’s disciplining behavior (11% combined, 10% students, 13% administrators) and in matching expert ratings on overall performance (31% combined, 29% students, 32% administrators). Somewhat surprisingly, a significantly greater number of student teachers than school administrators in Experiment 2 correctly predicted the teacher’s behavior and correctly identified the appropriate disciplining response. However, a significantly larger number of administrators than students matched expert ratings of the teacher’s Classroom Management performance. These differences may indicate the greater salience of the appropriate sequence of teacher disciplining behaviors among the students who were personally involved in acquiring these competencies. Results Order of rating.
Table 5 shows the correlations among the accuracy measures and teaching attitude similarity for the combined sample of
40
ZALESNY
AND
HIGHHOUSE
students and administrators classified by the order of rating they performed (i.e., prediction, then evaluation (P,E) versus evaluation, then prediction (E,P)). First, comparing the correlations for the Experiment 2 combined student and administrator group who first predicted, then evaluated, the focal teacher (P,E) with the corresponding Experiment 1 correlations (shown in Table 3) reveals considerable similarity in both magnitude and direction for the correlations between accuracy in evaluating Classroom Management and correct prediction, Overall Performance accuracy, and teaching attitude similarity. In each case, closer agreement with expert ratings of Classroom Management was significantly related to closer agreement with experts on the overall performance rating, and greater perceived similarity to the focal teacher in teaching attitudes, and marginally related (p = .07) to correct prediction of the teacher’s disciplining behavior. Unlike Experiment 1, attitude similarity was signiticantly correlated with overall performance accuracy (as expert agreement) but not with accuracy as correct prediction. Similar to Experiment 1 findings, the two accuracy measures did not jointly covary with any other variable examined in this study. Second, comparisons of accuracy between the two rating order groups in Experiment 2 (i.e., P,E versus E,P) provide some support for the hypothesized relationship of accuracy on Classroom Management ratings with both Overall Performance rating accuracy and correct prediction for the P,E group that is not evident for the E,P group. That is, Classroom Management rating accuracy was more strongly related to overall performance rating accuracy (60 versus .36) and to correct prediction (- .15 TABLE 5 VARIABLES FORTWOORDERSOF RATING:P,EVS E,P
CORRELATIONSAMONGSTUDY
I 1. Attitude similarity 2. Instructional methods accuracy 3. Classroom management accuracy 4. Interpersonal relationships accuracy 5. Professional qualities accuracy 6. Overall accuracy 7. Correct prediction 8. Confidence in prediction 9. Appropriate action
2
-
3
4
-22 -12 13
--47 02
-
21 -19 ---
-26 -17 -50
04
-42 07 26 -88 12 @I --09-E-03-m-16 11 -07 01 09 -07 -19 04
5
6 -58
7
8
9
12 -03
-09
41 -
22 -
11
06
07
II
-22
-36
00
15 -02
06
10
13 -06
00 04 18 @
53 II 06
33 -07 = -05 05 -03 03 -31
28 -
01
01 -16 04 IO IO 29 03 02 -
Note. n = 80-96. Predict, evaluate is shown below the diagonal, Evaluate, predict above the diagonal. Significant correlations are underlined (r 2 .17, p c .05; r z .22, p c .Ol). Correct prediction = 0,l; Appropriate action = knowledge of correct teacher action teacher = 0, I. Full correlation matrix is available from first author.
ACCURACY
IN
PERFORMANCE
EVALUATIONS
41
versus 30) for the P,E rater group than the E,P group. However, the difference between the correlations for the two groups was significant only for overall performance rating accuracy. The pattern of these correlations suggests that correct prediction of the teacher’s behavior may be more likely to lead to more accurate assessment of the teacher’s Classroom Management and overall performance than is an accurate assessment of overall or Classroom Management performance likely to lead to a correct prediction of disciplining behavior. The processing and organization of information associated with making a behavioral prediction may have made more accessible the information needed for more accurate Classroom Management and overall performance evaluation. Although perceived similarity in attitudes about teaching was associated with greater overall performance accuracy for both order groups, it was associated with greater accuracy on Classroom Management assessment for the P,E group and unrelated to accuracy on Classroom Management for the E,P group. It seems reasonable to expect that people who see themselves as attitudinally similar to a ratee on a personally relevant issue might pay greater attention to the behavior they will evaluate. However, in Experiment 2, more classroom management-specific information should have been available to raters in the P,E group than to those in the E,P group because the P,E raters were required to process information related to the classroom management performance dimension before they processed information related to the other performance dimensions. These data reinforce the interpretation of order of rating affecting the extent and nature of information processing occurring when performance judgments are made. Self-schema/effective prototype match. It was hypothesized that a rater’s self-schema as an effective teacher would be positively related to accuracy and confidence in performance ratings of a focal teacher. Given that all of the administrators had taught previously and almost half still were teaching, it is reasonable to assume that they had self-schemata for effectiveness as a teacher and would be likely to use them when evaluating others. Self-descriptions relative to an effective teacher prototype were received from approximately one-half (48%) of the school administrators. The items were summed into a self-schema index of teacher effectiveness. This index was then correlated with performance dimension and overall performance rating accuracy, correction prediction accuracy, selfreported confidence in ratings and prediction, and attitude similarity. Similarity between administrators’ self-schemata as effective teachers and an effective teacher prototype was positively related to overall performance and Classroom Management rating accuracy and to confidence
42
ZALESNY
AND
HIGHHOUSE
in behavioral prediction. It was not related to correct prediction or teaching attitude similarity, although the correlations were in the expected direction. The hypothesis for self-schema match with effective prototype was partially supported; raters whose self-schemata more closely resembled the prototype of an effective teacher were more accurate (in agreement with experts on overall performance and classroom management ratings) and more confident in their prediction than were raters whose self-schemata less closely resembled the effective teacher prototype. GENERAL DISCUSSION Two experiments examined the relationship between different methods of measuring rating accuracy and the influence of (a) the order in which the two different judgments were made and (b) two rater characteristics: perceived similarity in attitudes about teaching and the existence of a self-schema relevant to the judgment. Although not conclusive, the results from both experiments suggest that the two measures of accuracy may be related under certain conditions, such as when the performance dimensions on which an individual is evaluated correspond to the behaviors being predicted and when prediction precedes evaluation. However, the overall lack of a strong or consistent relationship between the two accuracy measures fits more closely Lord’s (1985) propositions of differing cognitive processes required when different kinds of accuracy are requested (e.g., evaluation of performance level versus prediction of future behavior based on past actions) than the accuracy consistency implied by Funder’s (1987) work. Also, as Lord (1985) suggested, different rater characteristics may facilitate one type of accuracy than another. If we define accuracy broadly to include evaluation and prediction, then the results of this study support the position that these two different types of accuracy are not necessarily related and we must investigate the conditions under which each is most likely to occur and the conditions under which they may be related. The results regarding prediction preceding evaluation suggest that the processing and organization of information associated with the behavioral prediction may enhance later evaluation on corresponding performance dimensions. Because raters were not primed before they viewed the videotape to attend to disciplining behaviors, they could not have selectively encoded those behaviors, making them more available or accessible when the prediction and evaluation were requested. Selective recall would have occurred when the ratings were requested. For prediction followed by evaluation, recall and organization of information would be guided by the disciplining episode and prediction, which should have led to better accessibility of disciplining behaviors in memory (Tulving & Thomson, 1973). For evaluation followed by prediction, recall and organization
ACCURACY
IN
PERFORMANCE
EVALUATIONS
43
should have led to the development of a general impression which then guided subsequent prediction. The P,E rating order, then, may have aided the recall and organization of specific information about the teacher’s behavior when disciplining students. This additional detail and amount of information appear to have enhanced later evaluation of her performance in that area. However, an important question is whether the improvement in rating accuracy under conditions of prediction followed by evaluation is unique to this sample or whether it would be found for different jobs or performance dimensions. Information processing research may provide some of the answers. For example, research generally supports an “anchoring and adjustment” or “on-line” processing model of judgment whereby people form judgments about other people or things as evidence pertaining to the people or things is encountered (Hastie & Park, 1986). This contrasts with a memorybased model of judgment (e.g., Tversky 8z Kahneman, 1973) which proposes that judgments are formed followed the retrieval of information stored in working or long-term memory. We can probably safely assume that Experiment 1 and 2 participants engaged in on-line processing of the information they encountered in the videotape of the teacher conducting her class. When an evaluation of the teacher’s performance that was familiar to both samples was requested, the participants could draw upon likely already-formed judgments to rate the teacher. However, the request for a specific behavioral prediction should have required a search of long-term memory for relevant encoded information from which the prediction could be made. That is, the prediction request should have initiated more effortful memory-based processing. Following the prediction, participants would have available to them the products of both types of processing. Further “anchoring and adjustment” of their on-line judgments stimulated by memory-based processing should have led to more accurate ratings overall. Although teaching attitude similarity and self-schema/effective prototype match appear to be related conceptually and are similarly related to other variables measured in the two experiments, the absence of a correlation between them indicates that these are not identical constructs. These rater characteristics, however, appear to be related to rating accuracy. Their relationship to rating accuracy may be a function of their importance and frequent use by the study participants. Because the student teachers and administrators in both studies were highly familiar with the task (teaching), they may have frequently used teaching attitude similarity and self-schema/prototype match to process teaching-related information. The salience and relevance of these constructs should have fostered their use as internal sources of information that directed the organization and storage in memory of teaching-related information (Higgins
44
ZALESNY
AND
HIGHHOUSE
& King, 1981; Srull & Wyer, 1979). However, if this is the case, it also may indicate that the relationships found for attitude similarity and selfschema/prototype match with rating accuracy may be specific to these samples and this particular rating task (i.e., teacher performance). Other rating characteristics may have emerged as important constructs related to rating accuracy if the same rating task were used with different (nonteaching) samples or if a different performance rating task (e.g., supervisory skills) were sued with the same sample. As Higgins and Bargh (1987) and others (e.g., Markus & Smith, 1981) suggest, there are many potential sources or structures of information and knowledge that can serve as a reference point for social perceptions and evaluations (such as performance appraisal). Which of several sources will be utilized is likely to be a function of the nature of the task, the personal relevance of the situation, and the extent to which the source aids information organization or recall (e.g., a heuristic), among others. Clearly, other investigations which vary the task and raters should help clarify the role of perceived similarity to the ratee and use of self-schema in rating accuracy. One consideration for future research is the relationship of perceived similarity to the ratee with rating accuracy when different processing strategies are cued by the situation or the rating instructions.7 Because most teacher education programs promote a particular discipline strategy, most of the student teachers in these two investigations probably had access to a stored disciplining script that could be used to evaluate the teacher’s performance. Foti and Lord (1987) showed that by varying instructions, a person (prototype) or a script (behavior) schema could be induced to guide information processing. Instructions given in the present studies should have induced a teacher prototype schema. This cueing would have made salient the raters’ impressions of the focal teacher and their perceived similarity to her. If the instructions had cued a script schema (e.g., the teacher’s goal in the tape was to use effective classroom management), then the disciplining script schema would have been most salient. Under these instructions, perceived similarity may never have been cued and the findings reported here may have been quite different. There are other issues related to the design and findings for these two experiments that require comment and discussion. First is the differences in rating accuracy found across the experiments. The student teachers and administrators in Experiment 2 were better than the student teachers in Experiment 1 in their prediction of the teacher’s behavior and in their agreement with expert ratings of the teacher’s Classroom Management performance. Because predictive accuracy was dependent upon attention ’ The authors are grateful to one of the reviewers
who raised this issue.
ACCURACY
IN
PERFORMANCE
EVALUATIONS
45
to and evaluation of components of the Classroom Management performance dimension, it is possible that something interfered with the Experiment 1 participants’ observation of classroom management behaviors. One possibility is the size of the groups viewing the tape. Although there were multiple monitors used for the presentation of the videotape to the group of 80 participants in Experiment 1, the size of the room and the number of people present may have been sufficiently distracting to affect attention. A second possibility is differences between the two student groups. Information from the director of the teacher program at the university providing the largest number of student teachers for Experiment 1 and all of the Experiment 2 students suggested that the Experiment 2 participants were better students than their Experiment 1 counterparts (e.g., in their motivation, course grades, involvement in the program). A second design issue is the differences in findings for the various dimension and overall performance ratings, including the relationship of behavioral prediction accuracy with only Classroom Management rating accuracy. In hindsight, we believe that Classroom Management behaviors were probably the most prominent noninstructional behaviors in the videotape. Certainly, the behavioral instructions given the students maximized the likelihood of eliciting disciplining responses from the teacher. This performance dimension’s strong correlation with overall accuracy suggests that Classroom Management largely determined the overall performance evaluation. Although the participants’ self-reported dimension importance ratings did not identify Classroom Management as significantly more important than the other performance dimensions (the exception was for Professional Qualities, which was seen as significantly less important than any of the other three dimensions), the raters may have given this dimension greater weight because they had more information on Classroom Management available from their observations than they did on the other performance dimensions. Postexperiment discussions with teacher educators and school administrators also suggested that teacher effectiveness in classroom management is an increasingly relevant concern among educators. Increased emphasis in teacher education programs and in teaching evaluations may have made this dimension especially salient to the raters in these studies. Ease of observation of Professional Qualities may also explain this performance dimension’s positive relationship with overall performance accuracy. Also, educators and school administrators noted that unless the performance of a ratee was decidedly poor on this dimension (e.g., unclean or unkempt in appearance, poor grammar used), any rater (including the experts against whose ratings other ratings were judged) was likely to give an average to above average rating on this dimension. The clear
46
ZALESNY
AND
HIGHHOUSE
and generally agreed-upon criteria for acceptable performance on Professional Qualities made it one of the easiest of the dimensions to rate accurately . The relative difftculty in observing a sufftcient sample of behaviors related to the use of Instructional Methods and (in this stimulus tape) Interpersonal Relationships from a single observation is a likely explanation for the poor relationships of these two dimensions with overall performance accuracy. Although the lesson shown in the tape could be evaluated readily on its substantive content, the teacher’s use of various instructional methods and techniques was (in hindsight) constrained within a single 30-min lesson. Observation of the teacher conducting several lessons would provide a better sampling of these behaviors. Also, because of the high frequency of disciplining behaviors elicited by the students, there were few opportunities for the teacher to engage in interpersonal relationship behaviors. Again, the opportunities for the occurrence of these behaviors were constrained by the study design. Differences in the ability of the rater to observe a sufficient sampling of behaviors also may explain the differences in rating accuracy across the performance dimensions. Improvements in study design, including the use of multiple lesson observations, instructions to students that should elicit other teacher behaviors besides disciplining, and prediction of other teacher behaviors might lead to greater consistency in findings across dimensions. However, it may be presumptious to conclude that study design considerations are the sole explanations for the differences in accuracy found for the various performance dimensions. Because much of the research on performance rating accuracy has generally focused only on overall performance accuracy rather than on performance dimension rating accuracy or on different types of accuracy, there is little empirical evidence showing that rating accuracy generalizes across performance dimensions or definitions of accuracy. The present study’s limitations notwithstanding, there may not necessarily be a consistent relationship among performance dimension accuracies or definitions of accuracy. This may be especially true when the performance dimensions themselves represent relatively diverse and uncorrelated competencies or other characteristics. Research on rating accuracy appears to be working from a model that identifies the antecedents and correlates of overall performance rating accuracy defined as agreement with experts. This may be sufficient for practical management decisions regarding administrative concerns, such as pay increases or merit awards. However, such a model may not be particularly helpful for advancing a theoretical model of human judgment accuracy or for practical management decisions regarding employee development or promotions. These latter purposes require additional research on the inconsistencies among
ACCURACY IN PERFORMANCE EVALUATIONS
47
performance dimension accuracies and between definitions of accuracy. We recommend that the research also include examination of perceived similarity to the ratee on issues relevant to the rater, the existence and use of self-schema and effective prototypes when evaluating performance and other rater competencies and experiences (e.g., trend analysis). The findings for self-schema match with effective prototype may have specific implications for rater training. The two most common types of rater training have been rater error training and frame of reference training (F-O-R). Of these, the more effective type of rater accuracy training appears to be F-O-R training which is meant to provide raters with a common frame of reference for interpreting and evaluating ratee performance (Hauenstein & Foti, 1989). However, rater error training has focused almost exclusively on minimizing halo, leniency, and contrast effects in performance evaluation (Smith, 1986). Rater accuracy might be enhanced if F-O-R training was combined with rater error training that included stereotype accuracy training especially where self-schemata do not exist or do not match effective prototypes. As Cronbach (1955) recognized in his analysis of the components of rater accuracy, raters may hold accurate or inaccurate stereotypes about ratees. To the extent that the stereotypes are accurate (see, for example, the discussion in Kenny and Albright, 1987), rater comparisons of ratee stereotypes and effective performance prototype should result in an accurate appraisal of ratee performance. Conversely, to the extent that the rater holds inaccurate ratee stereotypes, the stereotype-prototype comparisons will yield inaccurate evaluations. Future research on rating accuracy might also consider improving accuracy through rater training that involves not only establishing across raters a common frame of reference for organizationally relevant prototypes of effective performance, but also improving rater interpretations of relevant ratee characteristics (i.e., improving stereotype accuracy). Stereotype accuracy training would include stereotype error training, so that perceptions of similarity-dissimilarity between raters and ratees are made on performance-relevant attributes and characteristics and not on performance-irrelevant attributes such as sex, age, or race of the ratee. That is, when raters compare ratees to themselves, their self-perceptions reflect primarily performance-relevant attributes and characteristics. Such a focus would give new meaning to the “similar to me” effect (Rand & Wexley, 1975) in performance evaluations. REFERENCES Balzer, W. K. (1986). Biases in the recording of performance-related information: The effects of initial impression and centrality of the appraisal task. Organizational Behavior and Human Decision
Processes, 37, 329-347.
ZALESNY AND HIGHHOUSE
48
Blalock, H. M., Jr. (1972). Social sfatistics (2nd ed.). New York: McGraw-Hill. Borman, W. C. (1977). Consistency of rating accuracy and rating errors in the judgment of human performance. Organizational Behavior and Human Performance, 28, 238-252. Cantor, N., & Mischel, W. (1979). Prototypes in person perception. In L. Berkowitz (Ed.), Advances in experimental social psychology (Vol. 12, pp 3-52). New York: Academic Press. Cardy, R. L., & Kehoe, J. F. (1984). Rater selective attention ability and appraisal effectiveness: The effect of cognitive style on accuracy of differentiation among ratees. Journal
of Applied Psychology,
69, 589-594.
Cronbach, L. J. (1955). Processes affecting scores on “understanding of others” and “assumed similarity.” Psychological Bulletin, 52, 177-193. DeNisi, A. S., Cafferty, T. P., & Meglino, B. M. (1984). A cognitive model of the performance appraisal process: A model and research propositions. Organizational Behavior and Human Performance,
33, 36Q-3%.
Dipboye, R. L. (1985). Some neglected variables in research on discrimination in appraisals. Academy of Management Review, 10, 116127. Dreher, G. F., Ash, R. A., & Hancock, P. (1988). The role of the traditional research design in underestimating the validity of the employment interview. Personnel Psychology, 41, 315-327. Evans, J. St., B. T. (1984). In defense of the citation bias in the judgment literature. American Psychologist, 33, 217-236. Feldman, J. M. (1981). Beyond attribution theory: Cognitive processes in performance appraisal. Journal of Applied Psychology, 66, 127-148. Foti, R. J. (March 1989). Personal communication. Foti, R. J., & Lord, R. G. (1987). Prototypes and scripts: The effects of alternative methods of processing information on rater accuracy. Organizational Behavior and Human Decision Processes, 39, 318-340. Frank, L. L., & Hackman, J. R. (1975). Effects of interviewer-interviewee similarity in interviewer objectivity in college admission interviews. Journal of Applied Psychology, 60,356360.
Funder, D. C. (1987). Errors and mistakes: Evaluating the accuracy of social judgment. Psychological Bulletin, 101, 75-90. Graves, L. M., & Powell, G. N. (1988). An investigation of sex discrimination in recruiters’ evaluations of actual applicants. Journal of Applied Psychology, 73, 20-29. Hauenstein, N. M. A., & Foti, R. J. (1989). From laboratory to practice: Neglected issues in implementing frame-of-reference training. Personnel Psychology, 42, 359-378. Higgins, E. T., & Bargh, J. A. (1987). Social cognition and social perception. Annual Review of Psychology,
38, 369-425.
Holyoak, K. J., & Gordon, P. C. (1983). Social reference points. Journal of Personalify and Social Psychology. 44, 881-887. Huber, V. L., Neale, M. A., & Northcraft, G. B. (1987). Judgment by heuristics: Effects of ratee and rater characteristics and performance standards on performance-related judgments. Organizational Behavior and Human Decision Processes, 40, 149-169. Ilgen, D. R., & Feldman, J. M. (1983). Performance appraisal: A process focus. In Research in organizational behavior (Vol. 5, pp. 141-197), San Francisco: JAI Press. Kenny, D. A., & Albright, L. (1987). Accuracy in interpersonal perception: A social relations analysis. Psychological Bulletin. 102, 39O-tO2. Kuiper, N. A., & Rogers, T. B. (1979). Encoding of personal information: Self-other differences. Journal of Personality and Social Psychology, 37, 499-514. Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin. 87, 72-107.
ACCURACY
IN
PERFORMANCE
EVALUATIONS
49
Landy, F. J., & Farr, J. L. (1983). The measurement of work performance. New York: Academic Press. Latham, G. P., Wexley, K. N., & Pursell, E. D. (1975). Training managers to minimize rating errors in the observation of behavior. Journal ofApplied Psychology, 60, 55& 555. Lord, R. G. (1985). Accuracy in behavioral measurement: An alternative definition based on rater’s cognitive schema and signal detection theory. Journal of Applied Psychology, 70, 66-71. Lord, R. G., Foti, R. J., & Phillips, J. S. (1982). A theory of leadership categorization. In J. G. Hunt, V. Sekaran, & C. Schriesheim (Eds.), Leadership: Beyond established views (pp. 104-121). Carbondale, IL: SIU Press. Markus, H., & Smith, J. (1981). The influence of self-schema on the perception of others. In N. Cantor & J. F. Kihlstrom (Eds.), Personality, cognition, and social interaction (pp. 233-262). Hillsdale, NJ: Erlbaum. Markus, H., Smith, J., & Moreland, R. L. (1985). Role of the self-concept in the perception of others. Journal of Personality and Social Psychology, 49, 1494-1512. Markus, H., & Wurf, E. (1987). The dynamic self-concept: A social psychological perspective. Annual Review of Psychology, 38, 299-337. Murphy, K. R., Balzer, W. K., Lockhart, M. C., & Eisenman, E. J. (1985). Effects of previous performance on evaluations of present performance. Journal of Applied Psychology,
70, 72-84.
Nathan, B. R., & Alexander, R. A. (1985). The role of inferential accuracy in performance rating. Acudemy of Munugemenr Review, 10, 109-l 15. Pulakos, E. D. (1984). A comparison of rater training programs: Error training and accuracy training. Journal of Applied Psychology, 69, 581-588. Pulakos, E. D. (1986). The development of training programs to increase accuracy with differential rating tasks. Organizational Behavior and Human Decision Processes, 38, 76-91. Pulakos, E. D., & Wexley, K. N. (1983). The relationship among perceptual similarity, sex, and performance ratings in manager-subordinate dyads. Academy of Management Journal, 26, 129-139. Rand, T. M., & Wexley, K. N. (1975). Demonstration of the effect “similar to me,” in similuated employment interviews. Psychologicul Reports, 36, 535-544. Rosch, E. (1978). Principles of categorization. In E. Rosch & B. B. Lloyd (Eds.), Cognition and curegorizufion. Hillsdale: NJ: Erlbaum. Schmitt, N. (1976). Social and situational determinants of interview decisions: Implications for the employment interview. Personnel Psychology, 29, 79-101. Smith, D. E. (1986). Training programs for performance appraisal: A review. Academy of Management Review, 11, 22-40. Srull, T. K., & Gaelick, L. (1983). General principles and individual differences in the self as a habitual reference point: An examination of self-other judgments of similarity. Social Cognition, 2, 108-121. Srull, T. K., & Wyer, R. S. (1979). The role of category accessibility in the interpretation of information about persons: Some determinants and implications. Journal of Personality and Sociul Psychology, 37, 1660-1672. Sulsky, L. M., & Balzer, W. K. (1988). Meaning and measurement of performance rating accuracy: Some methodological and theoretical concerns. Journal of Applied Psychology, 73, 497-506. Tulving, E., & Thomson, D. M. (1973). Encoding specificity and retrieval processes in episodic memory. Psychological Review, 80, 352-373. Wexley, K. N., Alexander, R. A., Greenawalt, J. P., & Couch, M. A. (1980). Attitudinal
50
ZALESNY
AND
HIGHHOUSE
congruence and similarity as related to interpersonal evaluations in managersubordinate dyads. Academy of Management Journal, 23, 320-330. Wexley, K. N., & Khmoski, R. (1984). Performance appraisal: An update. In K. M. Rowland & G. D. Ferris (Eds.), Research in personnel and human resource management (Vol. 2), Greenwich, CT: JAI Press. Zadny, J., & Gerard, H. B. (1974). Attributed intentions and informational selectivity. Journal of Experimental Social Psychology, 10, 34-52. Zalesny, M. D., & Kirsch, M. P. (1989). The effect of similarity on performance ratings and intermter agreement. Human Relations, 42, 81-%. RECEIVED
September 25, 1989