Behaviorally anchored rating scales: An application for evaluating teaching practice

Behaviorally anchored rating scales: An application for evaluating teaching practice

Teaching and Teacher Education 59 (2016) 414e419 Contents lists available at ScienceDirect Teaching and Teacher Education journal homepage: www.else...

263KB Sizes 1 Downloads 82 Views

Teaching and Teacher Education 59 (2016) 414e419

Contents lists available at ScienceDirect

Teaching and Teacher Education journal homepage: www.elsevier.com/locate/tate

Behaviorally anchored rating scales: An application for evaluating teaching practice Michelle Martin-Raugh*, Richard J. Tannenbaum, Cynthia M. Tocci, Clyde Reese Educational Testing Service (ETS), 660 Rosedale Rd., Princeton, NJ, 08541, USA

h i g h l i g h t s  We compared Behaviorally Anchored Rating Scales to the Framework for Teaching.  BARS provide behavioral anchors delineating levels of performance.  Nineteen raters, users of the FfT trained to use BARS, evaluated teacher lessons.  We report rater agreement, usability judgments, and preferences for both tools.  Raters, although familiar with the FfT, reported favorable reactions to BARS.

a r t i c l e i n f o

a b s t r a c t

Article history: Received 25 January 2016 Received in revised form 18 July 2016 Accepted 22 July 2016

We developed Behaviorally Anchored Rating Scales (BARS) for measuring teaching practice, and compared them to the well-established Framework for Teaching (FfT; Danielson, 2013). BARS provide behavioral anchors delineating levels of performance via a set of behaviors. Our BARS focused on two dimensions of teaching, leading a classroom discussion and making content and practices explicit. We examined how a) rater agreement for BARS compares to rater agreement using the FfT, and b) how BARS and the FfT compare regarding perceived ease of use, perceived accuracy, and perceived advantages and disadvantages. Nineteen raters, who are users of the FfT and were trained to use BARS, independently evaluated video-taped teacher lessons using both methods. Rater agreement was higher for the FfT, which may, in part, be a factor of the raters' greater experience with that rating system. Nonetheless, raters reported that there are many aspects of BARS that are desirable. © 2016 Elsevier Ltd. All rights reserved.

Keywords: Behaviorally anchored rating scales (BARS) Framework for Teaching (FfT) Appraisal Evaluation

There is a great deal of agreement among both researchers and educators that teachers have a large effect on the lives of elementary school students (Harris & Rutledge, 2010; McCartney, Dearing, Taylor, & Bub, 2007), and that effectively measuring teaching performance is an important area of inquiry. Recent education reforms aiming to improve student performance have focused, in part, on improving teacher selection, preparation, and evaluation (Goe, 2007). However, there is not yet consensus among educational researchers about the specific indicators that define quality teaching nor, not surprisingly therefore, about the best ways to measure teaching practice (cf. Ball & Hill, 2008). Prior research suggests that traditional principal evaluations of teachers insufficiently differentiate between more and less effective teachers and provide an inadequate foundation for highlighting teacher needs

* Corresponding author. E-mail address: [email protected] (M. Martin-Raugh). http://dx.doi.org/10.1016/j.tate.2016.07.026 0742-051X/© 2016 Elsevier Ltd. All rights reserved.

for training and improvement (Danielson, 1996; Medley & Coker, 1987). More rigorous teacher evaluation tools may inform teacher development at crucial junctures, such as certification and selection, and may be used to shape educator training and professional development (Glazerman et al., 2010; Jamil, Sabol, Hamre, & Pianta, 2015; Bill and Melinda Gates Foundation, 2010). In this study, we report our efforts to develop and evaluate Behaviorally Anchored Rating Scale (BARS), a type of performance rating scale featuring narrative behavioral anchors at scale points (Smith & Kendall, 1963), for use in measuring observed teaching practice for elementary school teachers teaching Kindergarten through 6th grade. The primary purpose of this study is to evaluate the viability of BARS for evaluating teaching practice. Specifically, we describe the development steps and potential benefits of this measure, and compare some of its properties to those of the commonly used Framework for Teaching (Danielson, 1996; 2007; 2013). Our goal is to consider the possibility of using BARS in the

M. Martin-Raugh et al. / Teaching and Teacher Education 59 (2016) 414e419

service of teacher evaluation, preparation, or development via an exploratory, preliminary study. We have included the Framework for Teaching (FfT) as one reasonable point of reference for facilitating an evaluation of the potential merits of BARS. 1. Assessing teaching quality There is no one agreed upon definition of teaching quality and effectiveness. One relatively narrow view of teaching quality defines it as the ability to produce gains in student achievement scores on standardized assessments (Little, Goe, & Bell, 2009). However, teachers are not solely responsible for students' learning and test scores, as other factors outside of a teacher's control, such as peers, family members, student abilities, and the school environment affect student learning (Goe, Bell, & Little, 2008; Little et al., 2009). Moreover, standardized assessments are limited in the information they impart, and learning may not be fully captured via scores on assessments (Goe et al., 2008; Little et al., 2009). Thus, an approach to evaluating teaching quality that focuses on specific observable teacher behaviors, likely related to student learning and development, offers a more expanded view of teaching quality (McCloy, 2013). Ball and colleagues have proposed what they call high-leverage practices (HLPs) for teachers, which are “practices at the heart of the work of teaching that are most likely to affect student learning” (Ball & Forzani, 2010, p. 43). One example of a high leverage practice is making content and practices (e.g., specific texts, problems, ideas, theories, processes) explicit to students through explanation, modeling, representations, and examples. Another is effectively leading a group or class discussion. High-leverage practices describe critical characteristics of effective teaching (Ball & Forzani, 2010). However, empirical research on these practices is limited; the current study is one of the first to study their utility as descriptors of teaching to be implemented in evaluating practice. These high leverage practices can be conceptualized as dimensions of effective teaching that may be evaluated via classroom observation. High-leverage practices are expected to apply across grade levels and subjects. It is reasonable to propose that making content explicit to students, for example, generalizes to all classroom teachers. What may vary by grade level and subject, however, is the specific nature of the evidence that a teacher demonstrates in order to make content explicit; but the fundamental construct remains the same. In this study, we are focusing specifically on elementary school teachers and on English language arts (ELA) and mathematics because the elementary grades serve as the foundation for subsequent learning, and ELA and mathematics are prominent among the essential common core subject areas (e.g., Entwisle & Hayduk, 1988; Porter, McMaken, Hwang, & Yang, 2011). Many of the current approaches to teaching evaluation for school teachers involve holistic rubrics or performance appraisal instruments, such as the Teacher Performance Evaluation Rubric or Compass rubric. Tools such as these require raters to make an overall judgment about the quality of performance, resulting in scoring that is assumed to be easy, cost-effective and accurate (Jonsson & Svingby, 2007). However, low levels of rater reliability appear to be a persistent issue in this type of observation system (Casabianca, Lockwood, & McCaffrey, 2015). McCaffrey and colleagues (McCaffrey, Yuan, Savitsky, Lockwood, & Edelen, 2014) reported that correlations among rater errors may substantially distort teacher observation ratings. Moreover, prior research has shown that rater effects accounted for 25%e70% of the variance in observation ratings (Bill and Melinda Gates Foundation, 2012; Casabianca et al., 2013; Hill, Charalambous, & Kraft, 2012). Rater error contributing to unreliability can take many forms.

415

For instance, raters can differ in the extent to which they make severe or lenient ratings (Kingsbury, 1922; Landy & Farr, 1980). Ratings may also be subject to “halo error,” which is a bias resulting in a rater evaluating a behavior based on positive or negative impressions about the individual being assessed (Thorndike, 1920). Additionally, raters may tend to assign scores in the middle of the score range rather than using the full scale, resulting in a central tendency bias (Saal, Downey, & Lahey, 1980). The measurement of teaching quality may be improved through an increased emphasis on teaching behavior and more clearly behaviorally defined scale anchors that are intended to reduce rater biases, and, consequently, result in more reliable and accurate ratings. Our study explored the potential value of applying such a rating scale to the observation of teaching practice in elementary school. We collected rater judgments about the efficacy of the behaviorally anchored rating scales (BARS) and how its use compared to the FfT, and also compared some basic measurement properties across the different rating approaches. 2. Behaviorally anchored rating scales (BARS) BARS may afford several advantages over traditional evaluation methods. One advantage stems from the fact that subject matter experts (SMEs) who are familiar with a job and its demands (teachers, in this case) provide information at each step in the development process used to build the scales (Schwab, Heneman, & DeCotiis, 1975). To build BARS, first, critical incidents (Flanagan, 1954) depicting highly effective or ineffective behaviors performed on the job are collected from SMEs. Second, the critical incidents are edited such that redundancies across incidents are removed and they conform to a common format. Third, in a step often referred to as retranslation (Schwab et al., 1975), SMEs are asked to evaluate the performance dimension the behavior may be classified into. Fourth, SMEs rate each incident for effectiveness so that the means of these ratings can be used to position the critical incidents as scale anchors. Incidents that do not meet a predetermined criterion of agreement among SMEs are discarded. The input provided by SMEs at each stage of the development process is likely to result in anchor terminology that is specialized and relevant to the job in question, which may positively impact the reliability of the ratings collected (Schwab et al., 1975). Additionally, the retention of only incidents that reach a high level of expert agreement may reduce central tendency and leniency errors (Smith & Kendall, 1963). Finally, those that are evaluated using BARS may be more likely to react favorably to their evaluations knowing that SMEs with similar backgrounds to their own contributed to the development of the scales (Schwab et al., 1975). Although the use of ratings to gauge performance is widespread across a variety of professions, educators and researchers alike have had some dissatisfaction with these measures as a result of their vulnerability to intentional and unintentional bias (Landy & Farr, 1980). BARS are intended to reduce bias and subjectivity in rater judgments, therefore improving judgment reliability and accuracy, by providing examples of observable behaviors along different points of a rating scale; a separate BARS is developed for each targeted job dimension (Bernardin & Smith, 1981). These behavioral anchors help to ensure that raters have a more standardized and uniform understanding of performance, which should result in more consistent and accurate interpretations and evaluations (Bernardin & Beatty, 1984; Schultz & Zedeck, 2011). BARS were initially proposed by Smith and Kendall (1963) as a methodology to support more objective supervisory ratings of employee job performance than those produced using commonly used Likert scales that are anchored by adjectives. Likert scales are more prone to suffering from rater errors, such as leniency or halo

416

M. Martin-Raugh et al. / Teaching and Teacher Education 59 (2016) 414e419

error, than scales that have specific, behavioral anchors (Borman & Dunnette, 1975). For instance, one previous study compared the amount of leniency error present in students' Likert ratings and ratings made using BARS in evaluating their teacher's course performance (Cook, 1989). Multivariate and univariate analyses revealed that raters exhibited significantly less leniency on two performance dimensions, “subject coverage” and “workload,” when using BARS than when using a Likert scale. Thus, in summary, BARS may afford several advantages compared to other, more holistic teaching evaluation tools, such as the Framework for Teaching (FfT; Danielson, 1996; 2007; 2013) described in more detail below. Perhaps most importantly, the scales are anchored by concrete, behavioral examples, not generic descriptors; these behavioral examples are scaled to each of the scale points through a systematic, teacher-driven process. The behavioral examples are generated by teachers and the assignment of examples to specific scale points is carried out by the teachers. This process ensures the scales produced a) are meaningful to the educators ultimately using them for evaluation and reflect actual practice, b) are likely to yield buy-in from teachers as a result of the process used for development, and c) support more accurate and less biased ratings based on findings from other studies conducted in other workplace contexts (cf. Borman & Dunnette, 1975; Cook, 1989). 3. The Framework for Teaching (FfT) Existing teaching evaluation tools often require evaluators to make holistic judgments about dimensions of teachers' practice. For instance, one popular tool, the Framework for Teaching (FfT; Danielson, 1996; 2007; 2013; see scales in Appendix A), assesses teachers in four domains: 1) planning and preparation, 2) the classroom environment, 3) instruction, and 4) professional responsibilities. The four domains defined a total of 22 components, with performance on each component scored using holistic fourlevel rubrics: unsatisfactory, basic, proficient, or distinguished. Each level, however, is further illustrated by so-called critical attributes and possible examples, which are intended to better operationalize each level. The FfT is meant to generalize across both subjects and grade levels. It takes approximately 20 hours to train raters using the FfT on eight of the teaching components in the domains of classroom environment and instruction that can be observed in the classroom. In a commercial product for FfT, raters spend approximately 2 hours learning about how to evaluate and record evidence acquired via observation, approximately 2 hours in a bias training, and then spend the remainder of training time scoring and discussing segments of teachers' lessons for practice. For this study, we focused on two of the eight observable components: 3a) communicating with students and 3b) using questioning and discussion techniques. 4. Development of BARS for the current study The focus of our BARS was on two high-leverage practices (Ball & Forzani, 2011): leading a classroom discussion and making content and practices explicit, which represent two dimensions of practice in the larger domain of teaching. We followed established procedures to develop the BARS (e.g., Pounder, 2000). For each dimension, we collected approximately 10 critical incidents from each of 57 elementary teachers and faculty members who prepare elementary teachers (subject matter experts, SMEs) from across the United States. Each critical incident is a short, behavioral episode that describes a situation, a person's response to that situation, and the outcome of the behavior (Flanagan, 1954). Critical incidents are often used in role delineation or job analysis applications (e.g.,

Latham, Saari, Pursell, & Campion, 1980). These critical incidents were then edited such that they were more concise and grammatically correct. In addition, the outcome of the focal behavior was removed so that we could collect judgments of incident effectiveness based on the behavior described in the incident, rather than on its results (Motowidlo, Crook, Kell, & Naemi, 2009). Next, for each incident, we collected a) ratings of effectiveness on a 7-point scale (1 ¼ very ineffective to 7 ¼ very effective) and b) confirmation of the categorization of each incident into the targeted dimensions by the same group of 57 SMEs. The average rating assigned to the incident by the SMEs signals the degree to which the incident describes effective or ineffective performance. The standard deviation of the ratings for each incident represents the amount of agreement among the SMEs. Incidents with high agreement (e.g., a standard deviation of 1.5 or less) were retained (Schwab et al., 1975). The retained incidents, associated with each score point (1e7), defined the BARS for the dimension. For example, the behavioral incident reliably considered to depict the least effective performance is assigned a 1; conversely, a 7 is assigned to the behavioral incident depicting the most effective performance. Anchors between these extremes depict a progression of increasingly more effective performance. Incidents receiving a mean expert effectiveness rating between 1 and 2.5 were assigned to the lowest score band. Incidents receiving a mean expert effectiveness rating between 2.5 and 5.5 were assigned to the medium score band. Incidents receiving a mean expert effectiveness rating between 5.5 and 7 were assigned to the highest score band. The medium score band is the largest because incidents tend to cluster along the ends of the effectiveness distribution (Martin-Raugh, Kell, & Motowidlo, 2016) as a result of the way in which they are collected (i.e., by asking SMEs for especially effective and ineffective examples of performance). We included multiple behavioral anchors for selected scale point ranges to provide more information for the raters to help guide their decisions; in this regard, our rating scale also resembles what are known as behavioral summary scales (Borman, Hough, & Dunnette, 1976; see the scales in Appendix B). The current study is a preliminary, exploratory investigation of the utility of BARS for evaluating teaching practice. In this investigation we plan to answer the following research questions: Research Question 1: How does rater agreement for BARS compare to rater agreement using the FfT? Research Question 2: How do BARS and the FfT compare regarding perceived ease of use, perceived accuracy, and perceived advantages and disadvantages? We compared two dimensions of teaching behavior (leading a classroom discussion and making content and practices explicit) separately for mathematics and English language arts (ELA) across three different 15-min segments of video-taped teacher lessons across grades one through six. To facilitate the comparison of BARS to the FfT, we chose to focus on leading a classroom discussion because it is very similar to the FfT dimension of using questioning and discussion techniques; similarly, making content and practices explicit is very similar to the FfT dimension communicating with students. It is worth noting that although we chose to focus on two specific high-leverage teaching practices in this study, BARS may be developed to measure as many practices or dimensions as necessary, such that they cover the full range of the domain of interest, in the same way that the FfT measures multiple dimensions of teaching performance. The two practices examined in this study represent a subset of all of the dimensions of teaching that would need to be evaluated by an operational teacher evaluation tool.

M. Martin-Raugh et al. / Teaching and Teacher Education 59 (2016) 414e419

5. Method 5.1. Sample Nineteen raters (9, K-6 ELA teachers and 10, K-6 mathematics teachers) previously trained in the use of the FfT, who qualify as “master coders,” participated in the study. The nine ELA raters evaluated three videos of ELA lessons. Ten mathematics teachers evaluated three videos of mathematics lessons. 5.2. Procedure Raters participated in about 2.5 hours of interactive training that covered a) a short refresher on the FfT methodology, b) a longer explanation of how to use BARS, and c) an opportunity to practice rating two videos for each of the two teaching dimensions using both evaluation methods, followed by d) a group discussion of the practice videos, where participants were given the opportunity to ask questions. Appendix C displays Master scores using BARS and the FfT for the two video examples we used in training. For BARS, raters were instructed to select a score band according to where the preponderance of evidence for teaching behavior lies and to calibrate within a given score band according to how well the teaching behavior they just observed exemplifies the behaviors depicted within that band. Prior to this study, all raters had received approximately 20 hours of training regrading use of the FfT and had had opportunities to use the FfT in evaluating teachers in practice. Raters then evaluated three pre-selected 15-min video-taped segments of teacher lessons that portrayed a teacher both making content and practices explicit and leading a class discussion to reduce the amount of videos that would need to be watched by each rater. Video-taped teacher lessons were used to achieve a greater degree of control than would have obtained had we used in-person observations, where we could not ensure that there would be sufficient variability in teacher behaviors and evidence that would result in variability in scores across observations. Lesson-length videos depicting unedited, authentic ranges of teaching performance were selected from ETS' Classroom Video Library. A different teacher appeared in each video clip. The ELA videos featured clips from Kindergarten, third, and sixth grade. The Mathematics videos featured clips from first, second, and third grade. Video-taped segments were selected by the research team such that there was variance in the type of evidence and scores. The raters first recorded evidence they considered relevant to each of the two aspects of teaching practice we focused on in this study (leading a classroom discussion and making content and practices explicit). Then, the raters independently evaluated video-taped teacher performances pre-selected from an existing repository, using BARS first and the FfT second. The BARS for each of the two subject areas and performance dimensions are displayed in Appendix B. Descriptions of level of proficiency for the two FfT components, communicating with students and using questioning and discussion techniques are presented in Appendix A. Raters were asked to record and submit their ratings (BARS and FfT), notes, and rationales to the research team. Finally, raters completed a brief survey after rating each of the three videos and then completed a final reaction survey after completing their ratings for all three videos. See Appendix D for survey questions. 5.3. Analysis plan We compared rater agreement using both BARS and the FfT by reporting the average percentage of exact and adjacent agreement across raters across the three videos for each performance dimension, separately for mathematics and ELA. We then

417

summarized results from the survey administered to raters after they completed their ratings of the videos to explore ease of use, perceived accuracy, and perceived advantages and disadvantages of the two rating scales. As raters have received much more training and are more familiar with using the FfT than BARS, we expect that results will favor the FfT. However, we did anticipate that some raters would respond positively to BARS and that the results of our survey would provide useful information about the utility of BARS for teacher observation. We also anticipated that the feedback would identify improvements that could be made to BARS, making them more suitable for evaluating teaching practice. 6. Results We present the average percentage of exact and adjacent agreement for the two dimensions of BARS and FfT for mathematics and ELA in Table 1. To compare inter-rater agreement across both the FfT and BARS methodology, we first converted BARS scores from a 7-pt scale to a 4-pt scale via a linear transformation so that they could be directly compared to BARS scores. We then computed average percentages of exact and adjacent agreement across the three videos to index rater agreement using both the BARS and the FfT at the performance dimension level, which are displayed in Table 1. For both teaching dimensions across both mathematics and ELA, the average percentages of both exact and adjacent agreement were higher when raters used the FfT to evaluate videos than when they used BARS. As average exact agreement is under 75% for both BARS and the FfT in this study, agreement across raters can be considered generally poor (cf. Hartmann, 1977), although it is worth noting that FfT scores yielded levels of agreement that were much closer to the acceptable 75% threshold than BARS scores. Additionally, on a four-point scale, 25% exact agreement and 62.5% adjacent agreement are considered levels of agreement that may be obtained by chance (Hayes & Hatch, 1999); both the BARS and the FfT perform at better than chance levels regarding rater agreement in this study. Across the board, agreement between raters is higher when using the FfT, which is not entirely surprising as all raters in this study had received extensive training and calibration using the FfT evaluation system. 6.1. Quantitative survey results We first compared judgments regarding usability and rater preference across the two rating methodologies. Responses to the question “Overall, which of the following methods did you find easier to use to evaluate the teaching practices depicted in the videos you saw today?” are presented in Table 2. The majority of

Table 1 Average exact and adjacent percentage of rater agreement. Exact agreement English language arts (ELA) BARS HLP A 51.87% BARS HLP B 51.87% FfT 3A 70% FfT 3B 63.3% Mathematics BARS HLP A 46.67% BARS HLP B 30% FfT 3A 70% FfT 3B 63.33%

Adjacent agreement 85.17% 85.17% 96.67% 96.67% 83.33% 66.67% 96.67% 96.67%

Note. BARS scores were recoded to a 4-point scale. One rater did not provide a BARS rating for HLP B for Video 3. Another rater did not provide a FfT rating for FfT 3A for Video 2.

418

M. Martin-Raugh et al. / Teaching and Teacher Education 59 (2016) 414e419

Table 2 Ease of use of BARS vs. the FfT by subject (N ¼ 19).

BARS FfT Equal

Table 4 Contamination of ratings by preferences for BARS vs. the FfT by subject (N ¼ 19).

Mathematics

ELA

3 5 2

1 3 5

BARS FfT Equal

Mathematics

ELA

2 6 2

3 1 5

Note. Tallies reflect responses to the question “Overall, which of the following methods did you find easier to use to evaluate the teaching practices depicted in the videos you saw today?”.

Note. Tallies reflect responses to the question “Overall, which of the following do you feel required you to rely LESS on your own preferences about the quality of the teaching practices depicted in the videos you saw today?”.

ELA raters reported both rating methodologies to be equally easy to use. For mathematics, five of the raters judged the FfT easier to use, three judged the BARS easier to use, and two judged both methods to be equally easy to use. Responses to the question “Overall, which of the following do you feel allowed you to make more accurate judgments about the teaching practices depicted in the videos you saw today?” are presented in Table 3. For ELA, only one rater out of nine thought the BARS resulted in more accurate judgments, while the rest of the sample was split between indicating that both methods resulted in equally accurate judgments or that the FfT resulted in more accurate judgments. For mathematics, five out of 10 raters thought the FfT resulted in more accurate judgments, while three thought the BARS were more accurate. The remaining two individuals thought the two methods were equal. Table 4 summarizes responses to the question “Overall, which of the following do you feel required you to rely LESS on your own preferences about the quality of the teaching practices depicted in the videos you saw today?” For ELA, five raters reported the two methods were equal, three preferred the BARS, and only one reported the FfT resulted in less reliance on one's preferences. For mathematics, six out of 10 raters reported the FfT resulted in less reliance on preferences, and the remaining four raters were split in indicating the two methods were equal or that the BARS resulted in less reliance on preferences.

some examples of student-led behaviors, bullets, and fewer scale levels of the FfT. Interestingly, the clear emphasis on teacher behavior in the BARS as compared to the FfT was reported to be an advantage by some and a disadvantage by others. The survey data collected in the study also highlighted several suggestions for how BARS could be improved. Some raters indicated that the scale could benefit from additional scale levels and additional detail describing the medium and low score bands.

6.2. Qualitative survey results The qualitative evidence collected via the open-ended survey questions administered to raters reveals several advantages of BARS relative to the FfT. First, some raters suggested in their openended response that the BARS was more clear, faster and easier to use and understand than the FfT. Raters also liked the scale's visual representation and increased number of scale values compared to the FfT scale. Finally, raters reported the fact that BARS has fewer negative descriptors than the FfT to be an advantage. Raters also described several disadvantages of the BARS relative to the FfT. First, raters reported that it was more difficult to assign an exact score within the medium score band, because the middle set of anchors represented all three score points. This made it more difficult for raters to consider what differentiated, for example, a 5 from a 4, because both values are attached to the same anchors. Second, some raters indicated that they prefer the inclusion of

Table 3 Perceived accuracy of BARS vs. the FfT by subject (N ¼ 19).

BARS FfT Equal

Mathematics

ELA

3 5 2

1 4 4

Note. Tallies reflect responses to the question “Overall, which of the following do you feel allowed you to make more accurate judgments about the teaching practices depicted in the videos you saw today?”.

7. Discussion The use of BARS is commonplace in applications of industrial psychology, such as performance appraisal rater training. BARS are not often considered in educational applications, although the goal of anchoring scale points more clearly is a concern in education. Although the two performance dimensions examined in this study are specific in nature, BARS may still be of use in measuring the quality of teaching practice more broadly. A separate BARS would simply need to be constructed for each teaching dimension of interest. Results of this study suggest that although raters seemed to have a preference for the FfT over BARS, several raters did react more favorably to BARS, citing, for example, the ease of use and perceived resistance to personal preferences. We are optimistic in the potential value of using BARS for teacher observations that support, for example, professional development, as the behavioral anchors may be used to provide concrete and specific feedback. But we also recognize that our study is a first foray into the application of BARS in the teaching context, and acknowledge the importance of continued investigations to understand and account for sources of variance, moderator effects, and unintended consequences. Results regarding inter-rater agreement for the BARS and the FfT suggest that raters tend to be in greater agreement when using the FfT, although it should be noted that the average percentages of exact and adjacent agreement produced in this study for both the FfT and BARS are generally relatively low, which may be due to the expedited training that all the raters experienced. Nonetheless, the comparatively higher levels of agreement for the FfT were not surprising, given that the raters had much more previous experience with that rating approach (approximately 10 times more than that associated with the BARS). Still, the finding that rater agreement for the BARS was reasonably close to that of the FfT is promising. Future research may benefit by comparing rater agreement on the FfT and BARS methodologies using a design where both groups are unfamiliar with both of the rating scales and are provided with an equal amount of training. One of the chief criticisms of BARS was that the middle score band is not detailed enough. The largest difference in rater agreement using the BARS versus the FfT was found in mathematics for dimension B (leading a group discussion; using questioning and discussion techniques). The relatively low level of rater agreement for the BARS for this dimension likely occurred because the anchors for the middle band were not detailed enough. However, further investigation is needed to determine what observers need to

M. Martin-Raugh et al. / Teaching and Teacher Education 59 (2016) 414e419

differentiate between surface-level classroom talk and true classroom discussion that supports learning. Prior research (e.g., Harari & Zedeck, 1973; Landy & Guion, 1970; Smith & Kendall, 1963) has reported similar difficulty in identifying anchors for the central portions of the scales. This particular drawback may occur in part as a result of the critical incident technique (Flanagan, 1954) used to gather the behavioral examples that comprise the BARS anchors. Because the majority of the behavioral examples collected tend to cluster at the poles of the effectiveness distribution, this results in fewer behavioral examples describing the mid-level of effectiveness, which may be characterized as mediocre and non-critical for describing performance in a particular domain. Future iterations of BARS for use in teaching may be improved upon by explicitly asking experts to provide better differentiated examples of mid-level performance. A related area to explore is the value of having anchored score bands, as we had in this study, versus having more individually anchored score points (each point may still be anchored with multiple examples). As there are implementation and measurement trade-offs associated with each approach, future research on modifications to the existing BARS format is warranted. Finally, participants' open-ended responses to survey questions suggest that BARS have many attributes raters find desirable. This, as previously noted, is encouraging given that the raters who participated in this study may have had a preexisting affinity for the FfT due to their experience and familiarity with that rating approach. It must be noted that as the current study required that raters be familiar with the FfT and by design was exploratory, we utilized a relatively small sample. Future research would greatly benefit from exploring the merit of BARS in other contexts and with a larger sample of raters. Appendix A. Supplementary data Supplementary data related to this article can be found at http:// dx.doi.org/10.1016/j.tate.2016.07.026. References Ball, D. L., & Forzani, F. M. (2010). Teaching skillful teaching. Educational Leadership, 68, 40e45. Ball, D. L., & Forzani, F. (2011). Identifying high-leverage practices for teacher education. In Paper presented at the American educational research association annual meeting, Philadelphia, Pennsylvania. Ball, D. L., & Hill, H. C. (2008). Measuring teacher quality in practice. In D. H. Gitomer (Ed.), Measurement issues and assessment for teaching quality (pp. 80e98). Thousand Oaks, CA: Sage Publications. Bernardin, H. J., & Beatty, R. W. (1984). Performance appraisal: Assessing human behavior at work. Boston, MA: Kent. Bernardin, H. J., & Smith, P. C. (1981). A clarification of some issues regarding the development and use of behaviorally anchored ratings scales (BARS). Journal of Applied Psychology, 66, 458e463. Bill and Melinda Gates Foundation. (2012). Gathering feedback for teaching: Combining high- quality observations with student surveys and achievement gains. Seattle, WA: Author. Retrieved from: http://www.metproject.org/downloads/ MET_Gathering_Feedback_Research_Paper.pdf. Bill & Melinda Gates Foundation. (2010). Learning about Teaching: Initial findings from the measures of effective teaching project. MET Project Research Paper. Borman, W. C., & Dunnette, M. D. (1975). Behavior-based versus trait-oriented performance ratings: An empirical study. Journal of Applied Psychology, 60, 561e565. Borman, W. C., Hough, L. M., & Dunnette, M. D. (1976). Development of behaviorally based rating scales for evaluating the performance of US navy recruits. Minneapolis, MN: Personnel Decisions Research Inst. Casabianca, J. M., Lockwood, J. R., & McCaffrey, D. F. (2015). Trends in classroom observation scores. Educational and Psychological Measurement, 75, 311e337. Casabianca, J. M., McCaffrey, D. F., Gitomer, D., Bell, C., Hamre, B. K., & Pianta, R. C. (2013). Effect of observation mode on measures of secondary mathematics teaching. Educational and Psychological Measurement, 73, 757e783. Cook, S. S. (1989). Improving the quality of student ratings of instruction: A look at two strategies. Research in Higher Education, 30, 31e45. Danielson, C. (1996). Enhancing professional practice: A framework for teaching.

419

Alexandria, VA: Association for Supervision and Curriculum Development. Danielson, C. (2007). Enhancing professional practice: A framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development. Danielson, C. (2013). The 2013 framework for teaching evaluation instrument. The Charlotte Danielson Group. Entwisle, D. R., & Hayduk, L. A. (1988). Lasting effects of elementary school. Sociology of Education, 61, 147e159. Flanagan, J. C. (1954). The critical incident technique. Psychological Bulletin, 51, 327e358. Glazerman, S., Loeb, S., Goldhaber, D., Staiger, D., Raudenbush, S., & Whitehurst, G. (2010). “Evaluating Teachers: The Important Role of Value-Added.” Brown Center on Education Policy. Washington, DC: Brookings Institution. Goe, L. (2007). The link between teacher quality and student outcomes: A research synthesis. National Comprehensive Center for Teacher Quality. Retrieved March 3, 2009 from http://www.tqsource.org/publications/LinkBetweenTQandStudent Outcomes.pdf. Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness: A research synthesis. National Comprehensive Center for Teacher Quality. Retrieved March 3, 2009 from http://www.tqsource.org/publications/EvaluatingTeachEffectiveness. pdf. Harari, O., & Zedeck, S. (1973). Development of behaviorally anchored scales for the evaluation of faculty teaching. Journal of Applied Psychology, 58, 261e265. Harris, D., & Rutledge, S. (2010). Models and predictors of teacher effectiveness: A comparison of research about teaching and other occupations. The Teachers College Record, 112, 914e960. Hartmann, D. P. (1977). Considerations in the choice of interobserver reliability measures. Journal of Applied Behavior Analysis, 10, 103e116. Hayes, J. R., & Hatch, J. A. (1999). Issues in measuring reliability correlation versus percentage of agreement. Written Communication, 16, 354e367. Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41, 56e64. Jamil, F. M., Sabol, T. J., Hamre, B. K., & Pianta, R. C. (2015). Assessing teachers' skills in detecting and identifying effective interactions in the classroom. The Elementary School Journal, 115, 407e432. Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2, 130e144. Kingsbury, F. A. (1922). Analyzing ratings and training raters. Journal of Personnel Research, 1, 377e383. Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87, 72e107. Landy, F. J., & Guion, R. M. (1970). Development of scales for the measurement of work motivation. Organizational Behavior and Human Performance, 5, 95e103. Latham, G. P., Saari, L. M., Pursell, E. D., & Campion, M. A. (1980). The situational interview. Journal of Applied Psychology, 65, 422e427. Little, O., Goe, L., & Bell, C. (2009). A practical guide to evaluating teacher effectiveness. National Comprehensive Center for Teacher Quality. Retrieved July 18, 2016 from: http://files.eric.ed.gov/fulltext/ED543776.pdf. Martin-Raugh, M. P., Kell, H. J., & Motowidlo, S. J. (2016). Prosocial knowledge mediates effects of agreeableness and emotional intelligence on prosocial behavior. Personality and Individual Differences, 90, 41e49. McCaffrey, D. F., Yuan, K., Savitsky, T. D., Lockwood, J. R., & Edelen, M. O. (2014). Uncovering multivariate structure in classroom observations in the presence of rater errors. Educational Measurement: Issues and Practice, 34, 34e46. McCartney, K., Dearing, E., Taylor, B. A., & Bub, K. L. (2007). Quality child care supports the achievement of low-income children: Direct and indirect pathways through caregiving and the home environment. Journal of Applied Developmental Psychology, 28, 411e426. McCloy, R. A. (2013, August 20). Ramifications of the performance/effectiveness distinction for teacher evaluation. Retrieved from https://www.humrro.org/ corpsite/blog/2013-08-20/teacher-evaluation-part-2. Medley, D. M., & Coker, H. (1987). The accuracy of principals' judgments of teacher performance. The Journal of Educational Research, 80, 242e247. Motowidlo, S. J., Crook, A. E., Kell, H. J., & Naemi, B. (2009). Measuring procedural Knowledge more simply with a single-response situational judgment test. Journal of Business and Psychology, 24, 281e288. Porter, A., McMaken, J., Hwang, J., & Yang, R. (2011). Common core standards: The new U.S. intended curriculum. Educational Researcher, 40, 103e116. Pounder, J. S. (2000). A behaviourally anchored rating scales approach to institutional self- assessment in higher education. Assessment & Evaluation in Higher Education, 25, 171e182. Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88, 413e428. Schultz, M. M., & Zedeck, S. (2011). Predicting lawyer effectiveness: Broadening the basis for law school admission decisions. Law & Social Inquiry, 36, 620e661. Schwab, D. P., Heneman, H. G., & DeCotiis, T. A. (1975). Behaviorally anchored rating scales: A review of the literature. Personnel Psychology, 28, 549e562. Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47, 9e155. Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4, 25e29.