Measurement of teacher performance: A study in instrument development

Measurement of teacher performance: A study in instrument development

Tea223. Blom, G. (1958). Statistical estimates and transformed beta variables. New York: John Wiley and Sons. Brennan, R. L. (1983). Elements ofgenera...

1MB Sizes 0 Downloads 40 Views

Tea
Vol.

I, No

63-77,

1985 0

MEASUREMENT

DONOVAN

I, pp.

OF TEACHER PERFORMANCE: INSTRUMENT DEVELOPMENT

PETERSON,

THEODORE University

MICCERI,

(1742-051X/X5 $3 Ml+0 lx) 1985 Pcrgamon Press Lfd

A STUDY

AND B. OTHANEL

IN

SMITH

of South Florida

Abstract-The procedures used to produce a research-based teaching evaluation system, containing low-inference indicators of effective and ineffective teacher behavior, included instrument development, content validation, and field testing. An extensive reliability study produced estimates for three types of consistency: intercoder agreement (r = 0.8.5), stability across teaching situations (r = 0.86), and discriminant reliability among teachers (r = 0.79). The norming component of the study conducted in 45 schools, generated over 1200 observations of current teacher practice at all grade levels and in various subject areas. The results indicated the substantially generic nature of these behaviors; only grade level and instructional method (interactive, lecture, and independent seatwork) produced meaningful differences among groups of teachers. In addition, the average teacher used most of the effective behaviors (70%), and a few of the ineffective behaviors (8%) during observation. These results support the instrument’s value in teacher training. remediation, and evaluation

Evaluation of teachers has a varied history consisting of three approaches each in turn dominating its time. These are evaluation by inspection, by test results, and by observation and judgment. The use of inspectors to evaluate teachers was common practice in America and Europe during the nineteenth century and the early years of the current century. This mode of evaluation lingers even today in the practice of school principals who believe that they can tell the effectiveness of teachers by simply looking at their performance. Although there is considerable interest today, especially in the United States, in student achievement as an indicator of teacher

effectiveness, the approach is not new. In the closing years of the nineteenth century, the British government instituted a merit system in which teachers were evaluated by the achievement of their students as measured by inspector-administered tests (Small, 1972; Travers, 1981). The tests were given in basic subjects, and teachers were paid by the number of pupils who exceeded the national average. The system was soon abandoned. Despite its flaws, this principle is still being advocated in the United States. Among the problems of this method are how to single out the influence of a single teacher when student achievement is influenced by collective teacher impact, how to adjust for student variability in background and aptitude.

The research reported in this article was conducted under the auspices of the Florida Coalition for the Development of a Performance Measurement System and the Office of Teacher Education of the Florida State Department of Education. Funds were provided by the Coalition and the Florida State Legislature. The authors are solely responsible for the content of this article. 63

64

DONOVAN

PETERSON

how to adjust for regression effects, and how to overcome the limitations of achievement tests. By 1915, the use of rating scales had become so common that the U.S.A.‘s National Society for the Study of Education published a yearbook on this subject. The development and use of these instruments became extensive as more and more attention was given to improvement of instruction and as an increasing number of states mandated periodic teacher evaluation. Generally speaking, these scales were seldom tested for reliability, and their validity at best was dubious. The item content of these scales often had little relevance to teacher performance. A recent study of a sample of rating scales found that classroom-performance items ranged from 5 to 40% of the total items (Florida Coalition, 1983a). Most of the items concerned criteria such as personal attributes, interpersonal relations, and compliance with school policies - dimensions only remotely related, if at all, to classroom performance. The inadequacies of rating scales have been debated for some time. Critics have pointed out that these scales include presage and context variables as well as process variables; that they use high- rather than low-inference variables; that they depend upon consensus about the quality of teacher performance as their knowledge base; and that they combine measurement and evaluation in a single operation, requiring the user of the rating instrument to identify a teacher’s performance and judge its quality at the same time (Medley, Coker, & Soar, 1984). Furthermore, studies of the relation between teacher ratings and student achievement have consistently yielded nonsignificant results, leading one researcher (Medley, 1983) to observe that “no evidence has yet been published that ratings of teacher effectiveness made by supervisors have any relationship to teacher effectiveness, and no one seems even to be asking the question.” Today there is a trend toward measurement of teachers’ classroom behavior as an alternative to teacher ratings. A recent step in this trend is the Florida Performance Measurement System (FPMS), the development of which is set forth here.

et al.

Content

Validity:

First Approximation

The initial condition of a justifiable performance measurement system is a valid content base. Failure to satisfy this condition increases the probability that reliability estimates will simply describe the consistency of irrelevant measures of teacher effectiveness. Two purposes guided the development of the FPMS: (a) to develop a set of formative instrufor diagnosing classroom teacher ments behavior, and (b) to develop an instrument for measuring and evaluating that behavior. With these objectives in mind, the Florida Coalition undertook the first task: to seek an initial approximation to a valid content base for instruamong ment development. Consensus educators has been commonly relied upon for valid content in the development of rating instruments. In sharp contrast, the research literature on teaching effectiveness provided the source of the FPMS knowledge base. The literature on research using student achievement and conduct as the primary dependent variables and teaching behavior as the independent variable was surveyed (Florida Coalition, 1983b). A set of 31 concepts of effective teaching behavior was identified and classified into six broad domains of instruction: planning; management of student conduct; instructional organization and development; presentation of subject matter; communication: verbal and nonverbal; and testing: student preparation, administration, and feedback. These 31 concepts embrace 124 indicators of teaching behavior, each of which is defined and exemplified by one or more instances of teacher performance. The following excerpt from the Domain - Presentation of Subject Matter illustrates both concepts and indicators. The concept is “presentation of interpretative (conceptual) knowledge,” and it is defined as “teacher performance involved in analyzing and presenting information to facilitate the acquisition of concepts.” The indicators are: “gives definition only,” “gives example(s) ,” “idenonly, ” “tests example (rule-example) tifies attributes,” “distinguishes related conThis concept cepts,” and “concept induction.”

65

Measurement of Teacher Performance

and its indicators were followed by a summary of supportive research and by research on extensions of and exceptions to the concept. The selection and organization of the knowledge base of the FPMS can be summarily described as follows: independent researchers identified numerous discrete behaviors, but these behaviors were not organized or related to one another in a manner applicable to measuring teacher behavior. The FPMS research team organized the literature into generalizable and logically related components, that is, domains, concepts, and indicators. When the research findings for the domains were collected and organized, three authorities on teacher effectiveness research were assembled to review the summaries, to identify misinterpretations and omissions, to judge organization and clarity, and to make such criticism as they deemed relevant. Several corrections of research summaries, concepts, and indicators resulted. The research findings on effective teaching and the consensus of the review team constitute the first approximation to establishing the content validity of the FPMS.

Instrumentation The problem of instrumentation lies primarily in the selection and arrangement of items amenable to systematic observation. Based on the concepts and indicators for each domain, formative (preliminary) observation instruments were constructed, one of each of the six domains. Figure 1 shows an example of one instrument. The formative instruments, as suggested above, were designed to diagnose teacher performance as a basis for remediation. They were subsequently synthesized into a single summative instrument, making it possible to observe a sample of teacher performance in each domain during a single class period. Both the formative and summative instruments are divided into two categories, effective and ineffective behaviors, according to the findings of teacher effectiveness researches. Neither the planning nor testing domains were included in this instrument because they comprised

behaviors that rarely occur in a class situation. These domains can best be assessed by product examination, interviews, or formative instruments applied at appropriate times, for example, while a test is being explained or administered. Because the summative instrument is designed primarily to collect data with which to evaluate teachers for promotion or certification rather than for remediation, major emphasis was placed on the development of this instrument. Initial Field Test Following initial development, both formative and summative instruments were refined by field tests. Teachers and administrators (N = 24) participated in the initial test. The primary purposes of this study were to test coding methods, examine item clarity, and test intercoder agreement. Participant training included reading, discussion, knowledge tests, and practice with videotaped lessons. Three groups of observers used one of three methods of observation to test the accuracy of the methods: (a) one group coded behaviors by frequency of occurrence, but not on a timed basis as in a category system, (b) a second group observed for two minutes, then coded the frequency of each observed behavior for two minutes, and repeated this same cycle for the duration of the session, (c) a third group observed two minutes, then coded, using the sign system, and then repeated the same cycle. The codes resulting from each method were then correlated with ‘master observer scores’. Of the three methods, continuous coding by frequency of observed behavior (Group A) correlated highest with the criterion scores obtained from ‘master observers’. It was subsequently selected as the coding method. Item analysis and observer feedback were used for item deletion and refinement. For example, abnormally large variance in coding ‘question overload’ was detected. The wording of this item was changed to the expression ‘states long/multiple questions’. Several such modifications were made during an iterative development process.

DONOVAN

66

Category

Effective

Indicators

PETERSON

Frequency

er al.

Frequency

Ineffective

Indicators

Specifies a rule Clarifies a rule Practices rule Reprimands rule infraction

Does not specify when rule needed Does not clarify rule

Stopsdeviant

Does not stop deviancy/ deviancy spreads Corrects lesser deviancy Desists onlooker or wrong student Uses rough, angry, punitive desists Uses approval-focused desist Ignores deviancy, continues task/ ignores task, engrosses in deviancy Ignores other students needing help/drops task, engrosses in intrusion

Does not correct rule infraction

behavior

Correctsworse deviancy Desists student causing disruption Suggest alternative behavior Attends task and deviancy simultaneously Attends to two instructional simultaneously

tasks

Poses question -selects reciter Alerts class/calls on one reciter Alerts non-performers

Selects reciter -poses Alerts group-unison Ignores nonperformers

Ignores irrelevancies/ continueson task Gives short. clear nonacademic directions Moves whole/subgroup

Reacts to or interjects irrelevancies/flip-flops/dangles Overdwells or fragments nonacademic directions Fragments group movement

Praises specific conduct Praises non-deviant, on-task behavior Gives low-key, quiet praise Uses contingency praise Uses authentic, varied. warm praise Controls class reaction to misconduct

Uses general

Figure

praise

Allows class to reinforce misconduct

1. Management

By the end of this initial field test, intercoder agreement coefficients on the formative instruments, using the Spearman procedure, ranged for various items from r = 0.60 to r = 0.90 with a mean r = 0.76. Second

conduct

question response

Field Test

A second field study was conducted involving 16 teachers and administrators. The training procedure remained substantially the same, although the training materials were revised. The major purpose of this study was to further examine items where few, if any, behaviors were being coded. This study

of student

conduct

revealed a number of items that, although research-based and indicative of effective teacher behavior, were not being coded accurately. The term ‘connected discourse’ was one such item. Connected discourse is defined as “thematically connected discourse that leads to at least one point.” Although observers reliably coded related behaviors such as ‘emphasis,’ ‘verbal challenge,’ and ‘use of vague terms,’ they made unacceptable levels of error in coding behaviors designated as ‘connected discourse’ and ‘use of correct terminology.’ The antithesis of connected discourse is ‘scrambled discourse.’ It is defined as “discontinuous

points verballylchallengesstudents

momentum

Figure 2. Summative

that shows interest-smiles,

20. Stops misconduct 21. Maintains instructional

19. Uses body behavior gestures

15. Emphasizes important 16. Expressesenthusiasm 17. 18.

11. Treatsconcept-definition/attributes/examples/ non-examples 12. Discusses causeeeffect/uses linking words/applies law or principle 13. States and applies academic rule 14. Develops criteria and evidence for value judgment

1. Begins instruction promptly 2. Handles materials in an orderly manner 3. Orients students to classwork/ maintains academic focus 4. Conducts beginning/ending review 5. Questions: academic single factual (Dom. 5.0) comprehension/ requires analysis/reasons lesson development 6. Recognizes response/amplifies/gives corrective feedback 7. Gives specific academic praise 8. Provides for practice 9. Gives directions/assigns/checks comprehension of homework, seatwork assignment/gives feedback 10. Circulates and assists students

observation

Frequency

instrument.

Frequency

or examples

only

monotone,

Delays desist/doesn’t stop misconduct/desists Loses momentum-fragments non-academic directions, overdwells

Uses vague/scrambled discourse Uses loud-grating, high pitched, talk Frowns, deadpan or lethargic

punitively

inaudible

Discusses either cause or effect only/uses no linking word(s) Does not state or does not apply academic rule States value judgment with no criteria or evidence

Gives definition

Poses multiple questions asked as one, unison response Poses non-academic questions/non-academic procedural questions Ignores student or response/expresses sarcasm, disgust, harshness Uses general, non-specific praise Extends discourse, changes topic with no practice Gives inadequate directions/no homework/no feedback Remains at desk/circulates inadequately

Delays Does not organize or handle materials systematically Allows talk/activity unrelated to subject

68

DONOVAN

or garbled verbal behavior in which ideas are loosely associated.” Scrambled discourse appears easier to code than connected discourse. ‘Vagueness terms,’ the antithesis of ‘the use of correct terminology,’ were also accurately coded by observers. The second study removed items with low coefficients of agreement and reworded others to facilitate the correct coding of behaviors. Instrument complexity was thereby reduced, and intercoder agreement coefficients modestly improved.

Reliability

Study

The criteria used in the FPMS reliability study were: (1) Intercoder agreement - the degree to which two or more observers working independently agree upon the recording of an indicator of teacher behavior. Intercoder agreement is viewed as evidence of objectivity. (2) Stability - the degree to which a subject (teacher) exhibits similar behaviors on two or more occasions. (3) Discriminant reliability - the degree to which the instrument consistently ranks teachers on a scale of effectiveness. Consistent with Cronbach (1951)) Nunnally (1978), Medley and Mitzel(1963), and Medley, Coker, & Soar (1984), we consider discriminant reliability the most important of these estimates, since without discriminant reliability, high coefficients or values on the other indicators may have nothing to do with behavior differences among teachers. The tests for reliability were conducted with videotapes of regular classroom instruction. The videotapes were made of nine teachers who prepared and taught two lessons, similar in method and context but different in content, to the same or similar groups of students. Teaching method was held constant for each teacher in order to reduce errors unattributable to the instrument or observers when measuring stability between each teacher’s two lessons. The teachers were selected from a variety of subject areas and levels for purposes of preliminary

PETERSON

et al.

testing of generalizability. The participating teachers apparently varied reliably in overall level of effectiveness, as determined by the resulting discriminant reliability coefficient. Observers for the reliability study were arbitrarily assigned to teams of three. Data from these observations were processed on an IBM 370 using on-line (direct access) systems. All data were independently verified and corrected. Reliability estimates were produced using the analysis of variance technique (Medley & Mitzel, 1963; Shrout & Fleiss, 1979; Mitchell, 1979). A three-way ANOVA was conducted using teachers, situations, and teams of observers as independent variables. Final scales were developed within the instrument-a composite score on the 20 effective items and separate subscale scores for each of the four domains represented in the instrument. As often occurs with observation data, the item scores were predominantly distributed in “J” curves. To avoid overweighting more frequently occurring behaviors in total scores, each item was individually standardized to a mean of 5.0 and standard deviation of 1.0 using the percentile-rank based Normal Standardized Technique (Blom, 1958). Transformed data were then summed into five separate scores for the positive indicators on the summative instrument, and reliability coefficents were computed for the total instrument, and for each domain separately. As can been seen in Table 1, the reliability estimates for the Total scale and for Domains 3 and 4 are relatively high for this type of instrument. The reliability estimates for Domains 2 and 5 are of more questionable adequacy for individual decisions. The indicators on the summative instrument were also tested separately to determine their ability to discriminate between ‘high scoring’ and ‘low scoring’ teachers. High and low scoring teachers were differentiated by total scale scores on the transformed standardized data. The highest scoring three teachers were placed in Group 1, the lowest scoring in Group 2. Two videotapes were made of each teacher. Group A consisted of the first lesson for each teacher, and Group B of each teacher’s second lesson.

Measurement

of Teacher

69

Performance

Table 1 Reliabilitv Estimates: Summative Instrument Effective Indicators One Team of Three Observers, Nine Teachers, Two lessons Tvoe of ,I

estimates

logical analysis, items 1, 10.20, and 21 were combined 2, and Items 1 and 10 were excluded subscale for Domain

for Domain

An item’s tendency to accurately distinguish between the high and the low scoring teachers was estimated by making t-tests of the differences in mean scores on the item between Groups 1 and 2, and an item’s tendency to erroneously discriminate between two lessons by the same teacher was estimated by f-tests of the differences in mean scores on the item between Groups A and B. These tests indicated which items contribute most to identifying high and low scoring teachers - that is, to discriminant reliability. Results showed that the majority of individual items discriminated between high and low scoring teachers, while not discriminating between the same teacher on two separate occasions.

to stabilize

reli-

were based upon randomly selected teams of three observers observing nine teachers teaching two lessons each; teams of two observers observing nine teachers teaching two lessons each; and single observers observing nine teachers teaching two lessons each. The data in Table 2 show that there is little difference between reliability estimates based upon teams of two or three observers. However, a considerable drop in reliability is evident, across all scores, when only one observer is used. We therefore concluded that summative evaluations of teachers using the FPMS should be based upon scores from at least two different observers.

Development Number

subscale 3.

of Norms

of Observers Observing teachers in practice is fundamental to converting a measurement instrument, such as the FPMS, to an evaluation system. A study adequate for norm development may:

Another question addressed the effect on reliability estimates when multiple observers code the same teacher. Reliability estimates

Levels of Reliability

Obtained

Table 2 for One, Two, and Three

Observers

Type of reliabilitv

Number of observers

Total scale 20 items

Domain 2 4 items

Domain 3 9 items

Domain 4 4 items

Domain 5 3 items

Intercoder

Three Two One

0.85 0.82 0.64

0.66 0.65 0.50

0.87 0.85 0.67

0.85 0.81 0.71

0.63 0.54 0.30

Stability

Three Two One

0.86 0.81 0.70

0.64 0.55 0.26

0.85 0.80 0.77

0.88 0.85 0.72

0.42 0.54 0.37

Discriminant

Three Two One

0.79 0.75 0.55

0.61 0.49 0.25

0.80 0.76 0.58

0.81 0.77 0.64

0.42 0.38 0.15

70

DONOVAN

(1) provide a basis for setting preliminary standards based upon observed behavior of teachers in practice; (2) identify factors - grade level, subject, classroom conditions, student variables, teacher variables, and instruction format variables - associated with different frequencies of teacher behaviors in classroom instruction: (3) determine whether variables that significantly affect student achievement also alter teacher behavior in varying contexts; (4) identify clusters of teacher behavior common to specific contexts of teaching. Selection of Sample Factors contributing to sample selection of schools for the norming study included size (number of students), grade level, socioeconomic status (SES) of students, subject area, and school location (urban or rural). A total of 45 schools in 13 districts, within the state of Florida, participated. These districts varied in size, urban/rural location, and socioeconomic status. The schools observed ranged in size from 3000 students in a high school to 370 students in an elementary school. The socioeconomic status of the students ranged from 2 to 84% low SES as determined by data from the state’s freelunch program. Observations of 468 elementary teachers (grades K-5) were conducted in 17 schools, 226 middle school teachers (grades 6-8 or 9) in nine schools, 528 high school teachers (grades 9 or l&12) in 14 schools, and 11 teachers in a center for adjustment. Cluster sampling observation of all teachers in each school - was used to assure that all subject areas would be represented. A total of 1223 observations were obtained from 117 observers. All observers were trained and certified by an observation-reliability check, using videotapes of regular classroom instruction. Observations were conducted during a threemonth period. All data were collected using standard FPMS instruments and forms. These included the FPMS summative instrument and the frame-factor and instructional-format ques-

PETERSON

et al.

tionnaires used to collect information on variables such as grade level, subject, class size, SES and type of lesson. Sociodemographics

of the Norming Sample

As a result of the cluster sampling and the large number of schools involved, the sample closely matched the general distribution of various sociodemographic, subject, and lesson factors in Florida public schools. Each grade level (K-12) was represented by at least 50 observations. Although several subject areas (e.g. agriculture, n = 7) were represented by very few cases, most had at least 20 lessons included in the sample, and major subject areas (language, math, science, etc.), were each represented by 100 or more observations. Self-contained classrooms, defined as one teacher in one room with one class, were by far the most common (n = 989). However, pods or open classes, found in elementary schools, were also fairly comon (n = 84). All other classroom types were rare by comparison. Of all teachers, 93% perceived their classrooms as adequate or excellent for instruction, indicating that classroom conditions did not generally represent a barrier to instruction from the teacher’s perspective. The average class size was 23. All academic and socioeconomic levels of students were represented in the sample. However, only about one student in each class was perceived to come from the top 25% in SES, while almost five students in each class were perceived to come from the lowest 25%. Of all teachers, 75% believed their classes were void of gifted students, while 54% reported no learning-disabled students in their classes. The teacher sample was largely composed of females (n = 838, 72%) particularly at the elementary level (n = 410,88%). The majority of teachers were white-nonHispanic (X7%), with 10.5% black-nonHispanic and 0.5% Hispanic. Only six teachers defined themselves as members of some ethnic group other than these. The median level of experience among teachers was 11.36 years, with 30 teachers hav-

71

Measurement of Teacher Performance

ing less than one year of experience, and an equal number having 30 or more years of experience. The majority of teachers held bachelors (56.7%) or masters (39.2%) degrees. Only ten lacked a four-year degree. Instructional Format Teachers have little control over sociodemographic variables. We call these frame factors, or the context within which teachers function. Instructional-format variables differ from the frame factors in that teachers have some choice in their selection. These include grouping of students, instructional method, lesson type, and length of lesson. Three methods of class grouping for instruction were common: total group instruction, subgroup or independent instruction, and a combination of the two. Of the observations, 60% were made of total group instruction, 16% of subgroup or independent instruction, 16% of a combination of the two, and 7% of miscellaneous grouping patterns. Teaching method was recorded as lecture, discussion or interaction, recitation, independent work, or any combination of these. Over 80% of all classes observed involved some form of discussion or interaction or recitation. Most teachers used combinations of methods. The initial hypothesis was that the FPMS summative instrument would yield different frequencies of codes depending upon the type of lesson being taught. Data were therefore gathered to determine whether observations were conducted during an introductory or development lesson, a review or practice lesson, or a combination of the two. The most frequent lesson type observed included both introductory or developmental and review or practice, that is, diversified lessons (413 observations). Introductory or developmental lessons constituted 26.6% of the observations and review or practice constituted 30.8%. Observations during administration of tests were conducted on 13 occasions but were considered inappropriate for analysis. Observers were instructed to code an entire lesson from start to finish. Variations in length

of the classes observed included 15-29 minute classes (118 observations), 30-39 minute classes (548 observations), 40-49 minute classes (228 observations), and 5G60 minute classes (329 observations). Because the FPMS is a frequency-based system, variations in observation time were standardized as follows: (Item Score/ Length of Observation) x (Median Length of Observation). All of the sociodemographic and instructional format variables were used in analyses of scores on the summative instrument. Scaling Two methods of scaling the instrument were tested: (a) standardizing observations, as in the reliability study, and (b) quartile-based scores. In standardizing, absolute raw scores by item were transformed to become ‘normal standardized scores.’ This procedure reduces the effects of long-tailed distributions on summed-item scores. Standardizing scores, however, posed two serious problems that we were unable to resolve. First, when applied to teacher evaluation, this method requires each observer to code a minimum of ten teachers, preferably in different subject areas and at different levels. It is impractical, however, to expect all observers in school settings to code ten teachers. Second, this procedure orders subgroups of teachers (those observed by one or more persons) independently of all other groups of observed teachers. Within any given group, some will score high while others will score low. This phenomenon leaves open the possibility that a teacher might score considerably lower against one group than against another group. If this possibility occurs, the scaling method would reduce confidence in the system’s validity. For these and other reasons, a second scaling technique was tested, namely, quartile-based scores, described below. The major advantage of quartile-based scaling is the direct relationship it allows between individual evaluation scores and a normative data-base. This relationship provides a basis for judging the quality of teaching, as defined by the observation instru-

72

DONOVAN

PETERSON

ment, over a number of years. The base year for the FPMS summative instrument is 1983-1984. Base years serve as the standard against which subsequent years may be compared. The quartile-based scores were obtained as follows. The normative study indicated that several items on the summative instrument occurred infrequently (50% of the effective items had median occurrence below 1 SO). This fact eliminates the use of multipoint scales. In addition, some observers coded more frequently than others, a matter of pacing. Given these factors, frequency-based scaling would tend to overweight the more frequently occurring items. For these reasons, a three-point scale for each item based upon the semi-interquartile range was tested. An individual in the top 25% of frequency for an item receives a score of three points; in the 25% to 75% range, a score of two; and in the bottom 25%) a score of one point. Using this method, one finds that teachers scoring high on the FPMS tend to be those exhibiting multiple behaviors in many categories, while teachers scoring low exhibit few behaviors in few categories. This scaling was then applied to the reliability data (19821983 report) with positive results. It appeared to

Reliability Effective

Type of reliability Intercoder Stability Discriminant

Effective

Type of reliability Intercoder Stability Discriminant

Estimates Indicators

er al.

increase the variance among teachers (ranking them in almost exactly the same order as the original standard normalized method), without simultaneously increasing the variance either among teams of observers or lessons. This scaling satisfies the purposes of norming: it directly relates to the normative data; it appears valid for the instrument’s purposes (many behaviors in many categories yield high scores, few behaviors in few categories yield low scores); and it shows a reliability level equal to or greater than that obtained from the standard normalized scores. (See Tables 3 and 4.) Given these advantages of the quartile-based method, and the relatively high estimates of reliability, the norming data were subsequently analyzed using this method. Data Analysis Two procedures (regression analysis and analysis of variance) were used to determine (a) the degree to which the FPMS summative instrument is generic and generalizable to K-12 teaching contexts, and (b) the number and nature of possible norm groups required when using this instrument.

Table 3 for Five Separate Scales Using Quartile-based Scores 3 Teams of Observers, Two Lessons, Nine Teachers

Total scale 20 items

Domain 2 4 items

Domain 3 9 items

Domain 4 4 items

Domain 5 3 items

0.98 0.92 0.91

0.94 0.77 0.77

0.98 0.90 0.90

0.98 0.95 0.94

0.83 0.56 0.52

Table 4 Reliability Estimates for Five Negative Scales Indicators 3 Teams of Observers, Two Lessons, Nine Teachers Total scale 20 items

Domain 2 4 items

Domain 3 9 items

Domain 4 4 items

Domain 5 3 items

0.98 0.85 0.84

0.98 0.75 0.74

0.98 0.87 0.86

0.97 0.18 0.17

0.98 0.91 0.91

Measurement of Teacher Performance

Regressions were run separately from all continuous independent variables (all student variables, most school and classroom variables, and teacher experience) to each item on the summative instrument, to a total score of each domain (effective and ineffective), and to each effective and ineffective total score. These analyses were conducted first upon a randomly selected subsample (n = 670) of the total group. Significant and meaningful results were then cross-validated upon the second subsample (n = 553). The regression analyses indicated no important relationship to item, domain, or total scores for any of the continuous independent variables. ANOVAs were then run on total scores for all the categorical independent variables (grade level, lesson format, subject area, classroom type, and educational level) to determine if main effects existed separately for these variables. The ANOVAs were run first on sample 2 (n = 553) and then tested in a factorial design on sample 1 (n = 670) against the hypothesized most important factor (lesson format: instructional method). In these analyses, three considerations were of importance: (1) Since several ANOVAs were being run, the significance level acceptable for continued consideration was set at 0.01. (2) In addition, owing to the large ns (500 or more in some analyses), the meaningfulness of differences determined to be significant is important. ANOVAs based on 500 or more cases tend to have significant results for means that are only minimally different. A criterion of at least one-third standard deviation, or in this case 2.15 points difference between groups, for the effective total score, was set as a minimum for further consideration. This assures that a mean difference of only a single point on a total score will not put individuals in different norm categories. (3) At the other end of the continuum, some of the subgroups in the analysis have very small 11s. Unusually high- or low-frequency observations tend to skew distributions when this is the case. Therefore, at least 75 observations had to be obtained for any given subgroup in order to accept results as valid for norming purposes.

13

Norming Study Results Norming studies of this nature examine the relationship of independent variables (e.g. grade level, subject area, sex, ethnicity) to one or more dependent variables - in this case the effective and ineffective scores on the FPMS summative instrument. This relationship is determined by similarities or differences of dependent variable means and distributions across subgroups of independent variables. Differences in mean scores, or distribution shapes, are the basis upon which normed groups are created. Applying quartile-based scores to the norming observation scores on the effective side of the FPMS summative instrument allows a range from 20 to 59. Scores on the norming observations actually ranged from 20 to 57. Scores on the ineffective side of the instrument can range for 19 to 38. The actual range in the norming study spanned 24-38. As already noted, a criterion of 0.33 standard deviation from the grand mean score (S, = 6.86 x 0.33 = 2.15) was set as the acceptable level of variation within norm groups, and a minimum sample size of approximately 75 had to be obtained for consideration as a separate norm group. These criteria were set to avoid false effects of outlying observations on any given distribution caused by a small IZin the subgroup. The findings of this study were, therefore, analyzed by first considering sample size (75 or more observations) and then differences in subgroup mean scores to see whether they exceeded 0.33 standard deviation (2.15 points). Then, if other factors were similar, either separate norm groups were created or subgroup scores were adjusted using a standard factor. The first finding was that teachers on the average were coded in far fewer ineffective categories (1.5) than in effective categories (14.3). The range of both effective and ineffective scores allows for a high level of discrimination. Some teachers, although very few, demonstrated high frequencies of ineffective behavior while most demonstrated little or no ineffective behavior. As already noted, analyses were conducted

74

DONOVAN

PETERSON

on the relationships between independent variables (frame factors and instructional methods) and dependent variables (effective and ineffective scores) using regression analysis, analysis of variance, and the criteria set forth above. For the ineffective side of the FPMS summative instrument, there was no evidence that separate norm groups were needed or that standard factors were required to adjust subgroup scores. All teachers appeared to be part of the same norm group for the ineffective side of the FPMS summative instrument. Two factors emerged as being of primary import for norming on the effective side: grade level and instructional method. Analysis of grade level showed significant differences between elementary (K-5) and post-elementary (&12) teachers as they were coded on the effective side of the instrument. The second factor of importance in norming proved to be instructional method. Observations coded on the effective side of the summative instrument at the 612 level varied signifi cantly depending upon teaching method, namely, lecture, interaction, or independent seat or lab work. Rather than creating separate norm groups, score adjustments for the two lower-scoring methods proved to control adequately for this factor. Thus, only two norm groups result, elementary grades K-5 and postelementary grades 612 based upon adjusted scores. This outcome is interpreted as meaning that the ranges and mean scores on the FPMS summative instrument are not significantly different across subject areas, such as language arts, math, science, social studies, and practical arts; nor do they vary significantly by the type of classroom facility. Teacher characteristics, such as level of degree held, experience, sex. or race, do not influence scores in an important fashion; nor do student characteristics, such as number of boys or girls in class or even the socioeconomic status of students. One potentially interesting trend emerged. Although statistically nonsignificant, the data indicate that beginning teachers demonstrated a wider range of scores than experienced teachers, and the more experience, the narrower the range became.

et al

Converting

Measurement

to Evaluation

Scores

This norming study provides the basis upon which data can be converted to scores for evaluating teachers. This claim is based on norming study results that support the following conclusions: (1) The majority of teachers regularly use several of the indicators of the FPMS summative instrument in their daily classes (75% use 12 or more in a lesson). (2) All groups and subgroups of teachers show considerable variation in the use of both positive and negative indicators. This variation occurs in both the number of indicators and the frequency of specific indicators. (3) The generic nature of these behaviors across all forms of teaching appears strongly supported. (4) It is possible to compare the frequency and distribution of behaviors exhibited by an individual teacher to those found among the normed population of teachers. Considering the preceding propositions, we find sufficient basis for the evaluation of individual teacher performance within the limits of measurement error for the FPMS summative instrument. The semi-interquartile range appears to be an appropriate and reliable method for scaling items on the FPMS summative instrument. Total scores created by summing these item scores appear to create considerable variation among teachers in all subpopulations. Teacher scores are then converted to normed scores to determine each individual’s comparative score. Standards can then be set to describe variations in performance. The example in Table 5 is offered as it applies to the FPMS. If, for example, the standard is set at 30% on the norm scale. then 49% of the teachers observed in the norming study would receive average classroom performance evaluations. This standard applies only to classroom instruction. -Some purposes of the evaluation may require that other criteria also be considered. for example, experience. degrees, attendance. and subject area examination. Each criterion will screen out a number of teachers. This

Measurement

of Teacher

Performance

75

Table 5 Probability

Percentile

20% 30% 40% 50%

of and ineffective

20% 30% 40% 50%

implies that if the standard observation is too high, for whatever example, retention, if the standard

on the performance few any teachers the reward for or merit is too may qualify.

Discussion A common practice in noted earlier, is evaluation. observation is an example

as and an of

to individual perceptions of quality of The FPMS requires that observation and coding be done first, and that evaluations then be made by to norms similar to for standardized achieveThe development

of

FPMS has produced for which and reported and for which estimates of and reliability In this developmental and indicators of and ineffective teacher behavior have of training persons to make observations using this system has been developed tested, and the system has and found to be within generally acceptaranges of and professional tenets of teacher evaluation should be satisfied through research of this nature. They include:

(2) Content validity - use of of items to clear and representative that connect (3) Reliability can made objectively, with consistency across and varying contexts, within of error, and the results of different qualities. Teacher to this of scrutiny should satisfy most standards for measuring teacher behavior. ultimate of a teacher evaluation system is substantiate the claim if a teacher receives a high score on evaluation system, the teacher’s of teachers in same context that receive a low score. Evidence on this claim To make the system must be by correlating teacher scores

The FPMS is in the instruments, and summative, may traced to literature on and there found to the exception of concepts) to for achievement or In one sense, then, has already on item-by-item basis, for this system. of experiments designed to test the causal of processproduct relationships in classroom settings in

(1) Job relatedness - a legal obligation to on the basis of what do.

deportment

achievement

and

76

DONOVAN

PETERSON

Mitman, 1978; Emmer, Sanford, Clements, & Martin, 1982; Evertson, Emmer, Sanford, & Clements, 1983; Good & Grouws, 1979; 1981; Stallings, Needels, & Stayrook, 1978). Nevertheless, the ultimate test for the FPMS, as for any other system, depends upon a significant connection between evaluations of teachers and measures of student outcomes recognized as goals of the schools in which it is applied. The main ingredients required for validation studies are valid and reliable measures of both teacher performance and student outcomes. This paper has described one attempt to develop the former. Lack of space and the needed experience precludes treatment in the present paper of the development of student outcome measures. It seems appropriate, however, to mention some of the problems encountered in selecting these measures. Student achievement examinations are the most widely used measures of student outcome. We have encountered the following problems, not all of which are unique to our work, with student achievement examinations: (1) Common use. Although some achievement tests are more popular than others, each school district (at least in Florida) selects its own battery of examinations. Some tests are comparable, while others are not. Predictive validity studies require comparable product measures. (2) Content validity. If the outcome measure does not test what is being taught, any connection between how well the teacher performs, and student outcomes will be an accident. Content validity must, therefore, be established for an achievement test to be used in a predictive validity study. (3) Extraneous effects. Although regression analysis may control for some of the previous growth in achievement (e.g. previous experience and the effects of former teachers), increase in achievement attributable to a single reacher, using current methods of measurement, may be erratic and unreliable. Measures of student outcome used to evaluate individual teachers must be sufficiently reliable for that purpose.

ef al.

(4) Practicality. Commercially available achievement tests can be expensive. School districts in Florida administer these tests only in selected grade levels. The estimation of predictive validity should be conducted at several grade levels. Even when the adequate predictive validity of a teacher evaluation system has been established, the teacher evaluation system should be monitored and validated frequently against student outcome measures. Yet this monitoring must be cost effective. These are our concerns regarding the design of studies and the student-outcome measures needed to determine the predictive validity of the FPMS.

References Anderson, L. M., Ever&on. C. M., & Brophy, J. E. (1979). An experimental study of effective teaching in first-grade reading groups. Elementary SchoolJournal, 79,19>223. Blom, G. (1958). Statistical estimates and transformed beta variables. New York: John Wiley and Sons. Brennan, R. L. (1983). Elements ofgeneralizability theory. Iowa City, IA; The’American &liege Testing Program. Coker, H., Medley, D. M., & Soar, R. (1980). How valid are expert opinions about effective teaching. Phi Delta Kappan, 62, 131-134, 149. Crawford. J.. Care, N. L.. Corno. L.. Stavrook, N., & Mitman, A. (1978). An experiment on teacher effectiveness and parent-assisted instruction in the third grade (Vols. l-3). Stanford, CA: Stanford University, Program on Teaching Effectiveness, Center for Educational Research at Stanford. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334. Emmer, E. T., Sanford, J. P., Clements, B., & Martin, J. (1982). Improving classroom management and organization in junior high schools: an experimental investigation. Austin, TX: University of Texas, Research and Development Center for Teacher Education. Evertson, C. M., Emmer, E. T.. Sanford, J. P., & Clements, B. S. (1983). Improving classroom management: an experiment in elementary school classrooms. Elementary School Journal, 84, 173-188. Florida Coalition for the Development of a Performance Measurement System (1983a). A study of measurement and training components specified in the Management Training ACI. Tallahassee, FL: Office of Teacher Education, Certification, and Inservice Staff Development. Florida Coalition for the Development of a Performance Measurement System (1983b). Domains: knowledge base of the Florida Performance Measurement System. Chipley. FL: Panhandle Area Educational Cooperative.

Measurement

of Teacher

Good, T. L.. & Grouws, D. A. (1981). Experimental research in secondary mathematics tilassrooms: Working with teachers. Columbia, MO: University of Missouri. Good, T. L., & Grouws, D. A. (1979). The Missouri mathematics effectiveness project: an experimental study in fourth-grade classrooms. Journal of Educational Psychology, 71,355-362. Medley, D. M. (1983). Teacher effectiveness. In H. E. Mitzel (Ed.), Encyclopedia of educational research (5th ed.). pp. 1894-1903. New York: Macmillan. Medley, D. M., Coker. H., & Soar, R. S. (1984). Measurement-based evaluation of teacher performance. New York: Longmans. Medley. D. M., & Mitzel. H. E. (1963). Measuring classroom behavior by systematic observation. In N. L. Gage (Ed.). Handbook of research on teaching, pp. 247-328. Chicago, IL: Rand McNally. Mitchell, S. K. (1979). Interobserveragreement,reliability, and generalizability of data collected in observational studies. Psychological Bulletin, 86, 376390. National Society for the Study of Education. (1915).

II

Performance

Methods for measuring teachers’ efficiency (14th Yearbook, Part II). Bloomington, IL: Public School Publishing Company. Nunnally, J. C. (1978). Psychometric theory. New York: McGraw-Hill. Shrout, P. E.. & Fleiss, J. L. (1979). lntraclasscorrelations: Uses in assessing rater reliability. Psychological Bulletin, 86.42&428. Small. A. A. (1972). Accountability in Victorian England. Phi Delta Kappan, 53,438439. Stallings. J., Needels, M.. & Stayrook. N. (1978). How to change the process of teaching basic reading skills in secondary schools: phase II and phase 111. Menlo Park. CA: SRI International. Travers, R. M. W. (1981). Criteria of good teaching. In J. Millman (Ed.), Handbook of teacher evaluation, pp. 14-22. Beverly Hills, CA: Sage Publications. Received

June 27. 1984 0