Observation as a tool for evaluation of implementation

Observation as a tool for evaluation of implementation

STUDIES IN EDUCATIONAL EVALUATION Volume 1, No. 2 Summer 1975 EVALUATION STUDIES O B S E R V A T I O N AS A TOOL FOR E V A L U A T I O N OF I M P L...

958KB Sizes 0 Downloads 35 Views

STUDIES IN EDUCATIONAL EVALUATION

Volume 1, No. 2

Summer 1975

EVALUATION STUDIES O B S E R V A T I O N AS A TOOL FOR E V A L U A T I O N OF I M P L E M E N T A T I O N 1 GAEA L E I N H A R D T

Learning Research and Developme~t Center, Uni~'ersity oJ' Pittsburgh I n the U n i t e d States in the past 15 years there has been a marked effort to improve education at all levels with special emphasis at the preschool and elementary grades. This effort has resulted in a proliferation of new subject m a t t e r sequences as well as more global educational alternatives such as the "open classroom." The question now being raised is whether or not the programs have been successful: t h a t is, requests are being made to evaluate the effects of educational innovation. The form which the evaluations take has been largely a function of the n a t u r e of the innovations. I n n o v a t i o n s which focus on the process of education tend to emphasize classroom descriptions of attitude, climate, a n d interaction p a t t e r n s : while i n n o v a t i o n s which focus on academic i m p r o v e m e n t tend to emphasize positive changes in standardized subject m a t t e r tests. This paper presents information which can broaden the i n t e r p r e t a t i o n and utility of outcome measures on standardized tests. The purpose of this paper is to demonstrate the need for and means of e v a l u a t i n g i m p l e m e n t a t i o n of educational innovations. I t presents information which can broaden the i n t e r p r e t a t i o n and use of outcome measures on standardized tests. Measures of i m p l e m e n t a t i o n can both clarify the n a t u r e of the educational process and d e m o n s t r a t e the relationship of t h a t process to observed achievement. Several assumptions are made. First, educational i n n o v a t i o n s need to be evaluated not only for the obvious economic reasons, b u t also in order to provide clearer insight into areas which need improvement. Second, the reporting of educational outcomes w i t h o u t relating the outcomes to the i n n o v a t i v e process does not constitute a meaningful evaluation. Third, an appropriate way to evaluate educational i n n o v a t i o n s is to measure both i n p u t or a n t e c e d e n t variables and process or transaction variables, and to use those measures to explain or predict outcomes (Stuff]ebeam, 1971; Stake, 1967; Cooley, 1971). The main body of this paper is concerned with indicating what type of information is useful for measuring i m p l e m e n t a t i o n variables and how to assure t h a t those measures will be credible.

SETTING The Learning Research and Development Center (LRDC) is currently involved in evaluating its educational program in the Follow Through Schools. LRDC is one 1 The research reported herein was supported in part by a grant from the United States Oltice of Education to the Learning Research and Development Center. The opinions expressed do not necessarily reflect the position or policy of the Office of Education and no official endorsement should be inferred. 77

78

LEINHAHDT

of 22 sponsors in t h e n a t i o n w i d e Follow T h r o u g h p r o g r a m . T h e L e a r n i n g l{eseareh a n d l ) e v e l o p m e n t C e n t e r ' s i n s t r u c t i o n a l Model ~ is present in seven Follow T h r o u g h sites in k i n d e r g a r t e n t h r o u g h t h i r d g r a d e s : each site consists of from t w o to seven schools. T h e e v a l u a t i o n e f f o r t d e s e r i b e d here focuses on t h e sevond g r a d e classrooms at f o u r e s t a b l i s h e d sites, t h o s e sites w h i c h h a v e had t h e p r o g r a m at least one year. The input data. whi('h deserihed the entering aptitude

Thorndike

(~ognitive

Abilities

el'students.

Test. a general abilities

(,onsist o f t h e

test. The output

data

Lorge(,onsist

of Ineastlres on t h e W i d e ilan}de A ( ' h i e v e m e n t T e s t (WH, AT). T h e d a t a for implem e n t a t i o n consist ()t' d e s c r i p t i o n s and m e a s u r e s of t h e v a r i o u s d i m e n s i o n s of t h e classroom o b t a i n e d front an i n s t r u m e n t which was specifically designed for t h a t purpose. All i n v e s t i g a t i o n s which t a k e place in a n a t u r a l s e t t i n g h a v e s o m e u n i q u e restrictions a n d a d v a n t a g e s a s s o c i a t e d ~ i t h t h e m . T h e a d w m t a g e s are t h e t r e m e n d o u s l y im'reased c r e d i b i l i t y and g e n e r a l i z a b i l i t y of t h e i n f b r m a t i o n o b t a i n e d . (~learly, if' one can d e m o n s t r a t e t h a t a p r o g r a m uan be i m p l e m e n t e d and t h a t tile i m p l e m e n t a t i o n i m p r o v e s p e r i b r m a n c e in such w i d e l y differing s e t t i n g s as Follow T h r o u g h . one has built a w ' r y stron G ease for t h e p r o g r a m . T h e d i s a d v a n t a g e s , h o w e v e r , are also v e r y g r e a t : t h e y focus on t h e fbllowing t.hree areas : the g e o g r a p h i c lo(:ation of t h e classrooms (they are widely dispersed n a t i o n a l l y ) : t h e staffing at each site (it varies in t e r m s of t h e a v a i l a b i l i t y and willingness of its m e m b e r s to e n g a g e in e v a l u a t i o n a c t i v i t i e s ) : and t h e record keeping process (no p e r m a n e n t or c o n s i s t e n t records of t e s t i n g and pres c r i p t i o n are n o r m a l l y k e p t ) .

T H E D E V E L O P M E N T OF A N I M P L E M E N T A T I O N I N S T R U M E N T F i g u r e 1 shows t h e s e q u e n c e by which all i n s t r u m e n t for m e a s u r i n g t h e i m p l e m e n t a lion of t h e I,RI)(! I n s t r u c t i o n a l P r o g r a m was d e v e l o p e d and tried out. I view t h e steps as n e c e s s a r y a n d sufficient for t h e d e v e l o p m e n t of an i m p l e m e n t a t i o n i n s t r u m e n t : howeww, I do n o t view this as a u n i q u e solution to t h e p r o b l e m of such d e v e l o p m e n t . T h e d i a g r a m ix read in tile t r a d i t i o n a l m a n n e r a n d will n o t be diseussed in detail, b u t it will be refbrred to t h r o u g h o u t tile 1)aper. T h e first four steps in t h e Iigllre g e n e r a t e d specific i n f o r m a t i o n ab<>ut tile LRI)(! prog~ram and a b o u t t h e u n d e r l y i n g t h e o r i e s of e d u c a t i o n on which t h e m o d e l was built. "-'The I.earnin~ Hesea,'vh and I )(welopll,mt (!vnte,"s Instructional Model (IA{ I)(q M) is designed as an individtm]ized early h,arnin v cttrrimJlutn, which tbcuses on the dew,lopmenI of skills assouiated with intCli~em.e and the learnin v of'tb,'mal suh jccl nlatt('l' coIIt~.'ll{.The inajor veals of'the I.1:~l )(1 M can be collapsed into three broad areas. T h e y are: (1) the h,arnin~z and l'el~,lltioll o f suhject matter. (2) sell' dire~.tion in learninv, a~ld (3) s(.'ial skills. The desired outcomes are pel'tbrmance (.ompeten ties for all s l u d c n i s in each el'these areas. T h e m a j o r i t y . l ' w o r k in desi~znin~z t}w cdm'ational system "has hccn h m a r d s iml)lvmentin~z the mm'imdum ~oals. This w o r k im'ludes h r e a k i n ~ down each

suh.jecI mailer ar~,a into mills or hwels and t'm'therinto obiectiw's or skills which are usually hier archically seqm'Hced ill the mu'Hculum, l:urther, each ohjectiw, or skill has a m'iterion relbrenued I:,st so that a child can I., a p l w o p r i a t e l y placed, and so that progress can he monitored. A prescrip lion or assi~ImWI11 sxstem ~as dew,loped ~:hi,'h. ~ i l h lhe aid o f i n d i v i d u a l lesls, matches ea('h child with instructional material hv unit or level. Because oF the i n d i v i d u a l i z e d n a t u r e o f ' t h e I.II I )( 'I ,M. the children in a room are f'requently enV*~zed in difli'rcnt activities. Each child is w o r k i n g ,>n a i m r l i c u l a r assigmncnl I~ reach compctem.y in a i m r l i c u ] a r vurHculum skill. Some dfihh'en arc l u t o r i n g or bein,.Z,'l u t o r e d : some are t a k i n g one to-ore' Jests at t h e i r desks: still . t h e r s are work in~ ill t l m c X l d o r a h , r 3 area. " l ' h u l e a c h c r u i r u u l a t e s o r t r a v c l s " a r o u n d the room inter'actin~z with t l . ' chilih'cn ~m a oH. I,)(me (ii' small grmq) hasis. T h e inleractions, therefore, v a r y from (*hild Io child, d e l . , n d i n ~ on what the d f i l d ix d . i n g .

OBSERVATION

79

The discussions with developers and implementors brought out the concern t h a t while the measures might be restricted to the unique aspects of LRDC's program, the domains of the variables tapped should be generalizable to a variety of educational settings. Stated somewhat differently, the specific measures are nested within variables (potentially measurable in diverse ways) which in t u r n are nested within f u n d a m e n t a l domains of concern. The following list of the variables t h a t need to be measured emerged ti'om observing classrooms, discussing the program with developers and imlN'mentors. and examining the literature: the context variables of each classroom, the, allocation of time, the allocation of space, the assignment and m e a s u r e m e n t procedures, class room m a n g e m e n t , a n d s t u d e n t independenee. Measures of these variables should serve the following functions: (1) provide descriptive information a b o u t the field sites using the LRDC program; (2) provide a basis for comparing the laboratory and field schools; (3) provide a basis for comparing the model and the field: and (4) provide an explanation of o u t p u t variance not accounted for by i n p u t measures.

THE INSTRUMENTS After six major revisions of trial i n s t r u m e n t s , a field version was developed. That is. steps 1 15 on Figure 1 where cycled a p p r o x i m a t e l y six times before an i n s t r u m e n t which could be used in the field was developed. Following the field test of this instru ment, a final version was developed incorporating minor revisions which arose fi'om the feedback of the field tester (Step 19). The i n s t r u m e n t itself, the n a t u r e of instruction, c o n t e n t validation, and the training program for its use are presented elsewhere (Leinhardt, 1972). The i n s t r u m e n t covers four majors areas: background, prescription, testing, and teacher interactions, which will each be discussed briefly. Background. The first four questions on the i n s t r u m e n t provide background inform a t i o n on the classroom ; the n u m b e r of boys and girls involved ; the n u m b e r of children who are presented on the day of observation; the n u m b e r of years the teacher has been using either the P r i m a r y E d u c a t i o n Program (PEP) or I n d i v i d u a l l y Prescribed I n s t r u c t i o n (IPI)3 ; the age range of the class : the size of the class in square feet (trans formed into the n u m b e r of square feet per child), and the allocation of time and space for exploratory.

Prescription. The next area on the i n s t r u m e n t concerns prescription information. Question 5a on the i n s t r u m e n t asks the observer to list all of the 1PI Math or Quantification assignments written on each child's prescription sheet on the day of the observation. The information is obtained by looking at each child's ticket (sheet,) or folder and recording the most recent list of assignments. The question was coded in a manner which would yield information a b o u t the uniqueness of the list, of assignments a The LRDC Individualized Instructional Programs include the Individually Prescribed In struetinnal (IPI) Program for children of elementary grades (grade one through six), and t.h~, Pri mary Education Project (PEP) designed tbr children of early childhood age (age three through seven). Both, IPI and PEP were developed to provide educational experiences that are adaptive to the learn ing needs of the individuM student. The programs were designed with the I)asie assumptions that : (1) children display a wide range of dift~rences in their entering abilities and the ways in which they learn and acquire eompetencies, and (2) to provide educational experien('es that are adaptiw, to the individual differences means, to provide learning situations (e.g.. elassromn organization. learning materials, etc.) that can accommodate the needs of the individual stude~lI and when needed. teach the prerequisite abilities demanded by the learning situations.

i

2

3

I

I n t e r v i e w D e v e l o p e r s and P r a c t i t i o n e r s to g e n e r a t e l i s t of domains.

G e n e r a t e f i r s t set of m e a s u r e s to tap domains of i n t e r e s t . 1-

heck on g e n e r a l i z a b i l i t y of r e c o r d keeping, lab to site.

T13a os c oos orl use of i n si t r u m e n t .

dco paro

Code r e s u l t s e s t i m a t e s in Steps 5 and 10. I

12 R e v i s e i n s t r u m e n t and p r e s e n t it to D e v e l o p e r s and P r a c t i t i o n e r s including teachers~ if p o s s i b l e . V

13 Repeat step 10 developing r e v i s e d e s t i m a t e s for all 4 points in step 10.

11 Develop coding / p r o c e d u r e s for /measures,

9a

I

THINK

4

Decide on m o d e s on i n s t r u m e n t li. e. questionnaire, i n t e r v i e w .

r

I

0ry out m e a s u r e s in lab c l a s s r o o m for: 1) t i m e e s t i m a t e s ; 2) d e g r e e of intrusion; 3) e s t i m a t e c o m p l e t e n e s s of i n f o r m a t i o n ; 4) decide on logical o r d e r of m e a s u r e s in an i n s t r u m e n t .

D i s c u s s d o m a i n s with D e v e l o p e r s and P r a c t i t i o n e r s r e q u e s t i n g r e v i s i o n o r r e f i n e m e n t s f r o m them.

~aG e n e r a t e list of d o m a i n s and c l a s s r o o m characteristics. Possible parameters of t h e s e d o m a i n s should be c o n s i d e r e d .

O b s e r v e two o r t h r e e c l a s s r o o m s at different g r a d e l e v e l s ,

6 E x a m i n e lab c l a s s r o o m r e c o r d keeping p r a c t i c e s and schedules.

1 Read all available l i t e r a t u r e including studies, goals, budgets, etc.

Z

I

1

5minister in the field, check results. IAre they comparable to step 5? LIf yes, stop; if no, reloop.

Using p r o g r a m from 23 train (in land out of classroom) individuals [to administer Instrument.

f Instructions and Instruments.

lo :Ve °

2ode data and revise Instructions and Instruments using feedback and results as guidelines.

2Tlry °ut in field get f°rmal [ feedback from r e s e a r c h e r .

]

"16 Revise Instruc~ons and Instruments and develop a device (questionnaire) for formal feedback on Instructions and Instruments.

1TSry out instruction's on a subject. |

L I-

24a Check interobserver iagreement f appropriate.

[23d [Plan classroom training.

~r 23c 1an outside classroom aining sequence.

i ~ d e skills into those outside e class, using videotape etc., d those needing a classroom.

~ad3a efine skills needed for ministering instrument •

Figure 1. A l)iagramatit. Representation of the Development of an Implementation Instrument

M0ake final revisions on Instructions and Instruments and train r e s e a r c h e r to administer it in the field.

~ b j e c t gives formal feedback using device of step 16.

Ln T M subject to use strument in classroom

IRepeat step 15 preferably with a naive subject

?L

o

©

82

LEINHARI)T

obtained. A ratio of unique assignments (different by units and levels) over total a s s i g n m e n t was formed, giving a single measure of uniqueness for the classroom.

Tesli~,~/. The next domain is testing. To gather some information a b o u t the procedures being used,
Teacker Interaction,~. In a n y individualized program, a great deal of the success or failure is dependent upon the teacher's interactions with the students, The teacher ix the one who sets the tone or atmosphere of individualization. I f his/her actions convey a sense of group rather t h a n individual t r e a t m e n t , then the effort to individualize will have failed in a t i m d a m e n t a l way. Question 6 a t t e m p t s to tap some of tile rele r a n t information a b o u t teacher and s t u d e n t interactions. The teacher's and a i d e s interactions were (Jbserved
OBSERVATION

83

The coding of the o b s e r v a t i o n section is c o m p l e x : therefore, the measure will be followed b y a brief description of the coding procedures. Total frequency was coded by counting the t o t a l n u m b e r of contacts m a d e for the entire observation time (15 to 20 minutes), and includes: m a n a g e m e n t , cognitive, checkoff, c o g n i t i v e - m a n a g e m e n t , u n a t t a c h e d positive or negatives, a n d a n y uncodeable X's. The frequency for cognitives was o b t a i n e d b y counting all cognitives plus cognitive m a n a g e m e n t s for the entire observation period. The frequency of management,s was o b t a i n e d b y counting all other contacts made. Thus, frequency of m a n a g e m e n t plus frequency of cognitives equals t o t a l frequency. The percentage of negative contacts was o b t a i n e d by adding all negatively coded contacts (negatives, negative m a n a g e m e n t , negative cognitive, negative cognitive-management) a n d dividing t h a t b y the t o t a l n u m b e r of contacts made. (The percentage of positive contacts was so small a n d unvaried t h a t it was not coded after the field trial was examined.) There are three d i s t r i b u t i o n measures: total, m a n a g e m e n t , a n d cognitive. They do not a d d up to t o t a l distribution. The system for coding the t o t a l d i s t r i b u t i o n will be explained to provide an e x a m p l e ; the other two codes are similar. A distribution was calculated for each cell by the formula (0 E)2/E. Where:

0

The observed total n u m b e r of contacts ibr t h a t cell (or observed m a n a g e m e n t s or cognitives).

E

= (Tco/Teh)Ch = The expected total n u m b e r of contacts for t h a t cell.

Tco

Total contacts m a d e over all cells (or t o t a l cognitives, etc.)

Tch

Total n u m b e r of children in the system obtained by Z~ (C--h).

Ch

= The average n u m b e r of children per (:ell based on 2 counts of 4 counts depending on the two forms,

The d i s t r i b u t i o n measure is t h e n o b t a i n e d b y summing over all cells. I f the observed frequency equalled the e x p e c t e d frequency in each cell, the t o t a l score would be zero. Thus, t h e smaller the measure, the more evenly the teacher and aide are d i s t r i b u t i n g their a t t e n t i o n . The three variables frequency, content, and d i s t r i b u t i o n are each i m p o r t a n t indicators of the teacher's style of interaction. The frequency of contact is a good measure of the " t r a v e l " rate in the room. I t is also a reasonable i n d i c a t o r of how long children wait before t h e y are able to get the teacher's a t t e n t i o n . The m a j o r difficulty for observ a t i o n of the content of a t e a c h e r ' s r e m a r k s lies with the decision of which aspects are most relevant. I have chosen to focus on a r a t h e r simple distinction between negative s t a t e m e n t s and all others, and cognitive versus m a n g e m e n t s t a t e m e u t s . The collapsing of the existing categories was done for several reasons: first, to increase interobserver a g r e e m e n t ; second, to increase the frequency of the observation of the categories within the limited time of observation, a n d finally, to focus on the most relevant p a r t s of the teacher's speech. Again, there is the p r o b l e m of incomplete r a t h e r than i n a p p r o p r i a t e measures. Those p a r t s of a teacher's i n t e r a c t i o n which would seem most r e l e v a n t to the s t u d e n t ' s a d v a n c e m e n t are those which concern the affective dimension and subject m a t t e r content of the communication.

84

LEINHAR, I)T

Other Mea,~'ure,~" Oblai'vted. Several other measures were obtained which have not been included in this general discussion. Some were o b t a i n e d by directly questioning Educational Specialists 4 after the i n s t r u m e n t had been administered, others were obtained fi'om the i n s t r u m e n t itself. Additional measures include: the n u m b e r of adults observed t r a v e l i n g ; w h e t h e r or not the ehildren get their own work (information was obtained by interviewing the child : it p r o v i d e d useful anecdotal intbrmation, but was not very generalizable at the classroom level): hours assigned in m a t h and reading (regardless of whether the reading is an Lt~I)(I program) which was obtained from the site specialists; and finally, the n u m b e r of days the teacher was absent during the year, also obtaineme extent,, flag raising are fairly easy for the observer to see a~M re(.ut it is bar(1 to record both together. O t h er kinds of signals use(I are almost imp()ssild(~ t<> ol)serx e accurately if they occur in conjnnction with an<)they system.

PROCEDURE The i n s t r u m e n t was administered twice to all second grade classrooms in the ibm' esrahlished sites. One classror knowledge of the instruetions, ability to unitize, and ability to categorize teachers' verbal interactions. In class t,raining foeused on recording the distribution of t e a c h e r s interacti
OBSERVATION

85

The n e x t section examines the problem of o b s e r v a t i o n a l reliability, followed b y discussions of the inter-observer reliability, s h o r t - t e r m teacher stability, and longt e r m reliability of the i n s t r u m e n t .

RELIABILITY The concepts of reliability a n d v a l i d i t y involve procedures b y which confidence in a measuring device m a y be established. T h e y lend s u p p o r t to the assertion t h a t the measure consistently reports the same situation the same way, and t h a t the measure a c t u a l l y represents t h a t which it is supposed to be measuring. I n our case, the m a j o r challenges to reliability were t h a t different observers regarded the same event differently, and the lack of s t a b i l i t y or representativeness of the behavior observed. The domains of p a r t i c u l a r interest here are the inter-observer reliability and the s t a b i l i t y of t e a c h e r behaviors over time. Several procedures are available for calculating either the overall reliability of d a t a or some specific aspect of reliability, such as inter-observer agreement. No one of t h e m is c o m p l e t e l y a p p r o p r i a t e to the p r o b l e m of calculating the reliability of our data. I n keeping with the spirit, if not the specific method, of Cronbach's " T h e o r y of Generalizability: A Liberalization of R e l i a b i l i t y T h e o r y " (Cronbach, R a j a r a t n a m , & Gleser, 1963) where " ' R e l i a b i l i t y T h e o r y ' is i n t e r p r e t e d as a t h e o r y regarding the a d e q u a c y with which one can generalize from one observation to a universe of o b s e r v a t i o n s " (p. 137), I will present a v a r i e t y of evidence, some of which used t r a d i t i o n a l estimations of reliability coefficients a n d others did not, to s u p p o r t the generalizability, or lack of it, for this d a t a set. Three aspects of reliability were e s t i m a t e d : inter-observer, s h o r t - t e r m s t a b i l i t y of the t e a c h e r behavior, and long-term reliability. The results are r e p o r t e d in Tables l, 2, 3, a n d 4. The reasons for seeking more complete information a b o u t the reliability of the i n s t r u m e n t go back to the initial points m e n t i o n e d in this p a p e r - there is a very definite need t h a t the d a t a which represent measures of i m p l e m e n t a t i o n be credible. One w a y of establishing such credibility is to show t h a t the d a t a are reliable and valid.

Inter-Observer Reliability. W h e n e v e r h u m a n observation is a basis for measurement, one is faced with the p r o b l e m of individual differences in observers producing differing results, when in fact, t h e y should have produced the same results. I f one is dealing with several observers, the p r o b l e m is to get all the observers to code the same event in the same way. Table 1 presents the in-class inter-observer a g r e e m e n t for nine observers. A g r e e m e n t was checked b y t a k i n g a ratio for each category between two recordings of one situation a t one time. All of the observers h a d an a g r e e m e n t check with me, and in those cases where two people were t r a i n e d together, the ratio of a g r e e m e n t between t h e m is also given. This table presents the range over individuals and categories, and how much each i n d i v i d u a l agrees on each category. The overall reliability for observers is 82 percent. The range across observers is 66 to 100 percent a n d across categories is 29 to 98 percent. Because the category d i s t r i b u t i o n of m a n a g e m e n t contacts had such a low reliability, it was dropped. Short-term Stability. S h o r t - t e r m s t a b i l i t y refers to the s t a b i l i t y of a t e a c h e r ' s observed behavior over a p p r o x i m a t e l y 48 hours. This was checked b y having one r a n d o m l y selected classroom at each site observed twice in two days. The reason for calculating

94

Average by [ndivMual

* No n e g a t i v e s were observe(t.

3.'1

78

98

78

26

99

86

74

85

87

80

79

95

74

64

100

75

75

96 82 63

100"

100 82 77

96 100

97 I00 96

Fotal ',~ognitive Vlanagement Percent Negative Distribution Fotal 1)istribution -~ognitive Distribution Ylanagemeni

81

43

98

80

100

90 92 63

84

78

84

64

100

90 100 71

85

;);)

82

SO

1O0

1 O0 92 88

66

25

46

72

100

Sl 75 65

67

31

41

90

74

9O 75 65

86

80

95

80

74

<~4 100 9O

l)eveiol)er l)evelol)er l)eveloper Observer~ Developt'r I)cvelol)i'r ()l)servers l ) e v e l o t n w ' l)evelOlier ()bservcrs & & & & & & B & (' E&I) & P & (] ()bs(q'v(q' Observer O1)server Observer Observer ()l)s(q'v(q' ()tiserver A B C 1) E F G

Table 1. Rhe Ratio of Inter-observer A g r e e m e n t s by Category

l 0(i

100

lOO

100

I l)l~ *

100 100 100

l)('v('lotn'r & ()bservcr H

82

59

81

80

91 l

93 98 78

(Mean) Ave. by (~ategory

OBSERVATION

87

short-term stability is to show t h a t the observed teacher characteristics remain relatively stable over a short period of time. A s u m m a r y of short-term stability estimates for several variables is given in Table 2. The variables chosen were ones which would v a r y over a 48-hour period (or less). The observers were n o t instructed to recount enrollment, recheck the n u m b e r of years of experience, or record other context data. Therefore, these variables are not included in the estimations. The estimates given in Table 2 were obtained from a two-way mixed-model (rows [teachers] random, columns [time] fixed) repeated measures ANOVA of teachers by time for each category for each pass. (That is, separate ANOVAs were calculated for each pass and each category.) F r o m this, two estimates can be calculated which account for the variance due to teachers. One is Cronbach's estimate (which is not preset]ted), (Cronbach, 1971 ) ; the other is an eta squared. The eta is obtained by divid: ing t h e s u m s of squares due to teachers by the total sums of squares (i.e., ~2 = SSTeachers/SSTotal). This estimate gives the a m o u n t of variance explained by having the same teacher versus different teachers observed. Table 2 shows the result of estimating short-term reliability by an eta squared. (Both estimates were very close ; for further discussion of a comparison of the two estimates, see Leinhardt, 1972.) The average stability for pass one is .73, and .78 for pass two.

Table 2. Estimates of Short-Term Stability of Teacher Behaviors Variable Percent of children present Number of days since the last test Number of cognitive statements Number of management statements Distribution of cognitive statements Percent of unique assignments Percent of negative statements

1st Pass ,80 .99 .82 ,80 .83 ,50 .37

2nd Pass .78 .94 .87 .96 .065 .85 .98

The extremely low stability (.065) associated with the distribution of cognitive contacts on the second pass is due largely to one classroom. As shown in Table 3, a classroom at Site 1 had an extremely low distribution score on the first observation and an extremely high score on the second. On the second observation of this classroom, the teacher was p l a n n i n g a field trip for the day b u t kept the students in school one extra hour so t h a t the observation could be made - a fact u n k n o w n to us at the time. Almost no classroom work went on during the observational period; there was a high degree of disruptive behavior. Due to the low stability of the distribution of cognitives in the classroom for the second pass, the measure was s u b s t i t u t e d in the final data set by the measures from the first set.

Long-Term Reliability. Long-term reliablity refers, in this case, to the consistency over one a n d one-hMf m o n t h s of some of the measures which remained unchanged from the field test and final version. The purpose of estimating this is in p a r t the same as estimating short-term stability. B u t it serves the additional function of estimating

88

LEINHARDT

t h e r e l i a b i l i t y of t h e i n s t r u m e n t in r e c o r d i n g s o m e e v e n t s w h i c h p r e s u m a b l y do n o t change. T h e r e l i a b i l i t y e s t i m a t e s are r e p o r t e d in T a b l e 4. T h e s t a b i l i t y o f t h e f o l l o w i n g m e a s u r e s is r e a s o n a b l y high: t e a c h e r e x p e r i e n c e , e n r o l l m e n t , r a t i o of b o y s to girls, a n d s e q u e n c e of e x p l o r a t o r y while t h e s t a b i l i t y of t h e r e m a i n i n g m e a s u r e s is low. T h e c o n s e q u e n c e of t h e low r e l i a b i l i t y is t h a t t h e t w o v a r i a b l e s will n o t be used to predict p e r f o r m a n c e , a l t h o u g h t h e a t t e n d e n c e v a r i a b l e will be a v e r a g e d across t i m e s a n d used for d e s c r i p t i v e purposes. Table 3. Distribution of Cognitive Contacts: Raw Scores for Doublechecks

I st pass Site

2nd pass

Time 1

Time '2

Time 1

5.33 8.06 25.36 6.08

4.03 6.94 15.14 8.45

2.43 4.04 3.4 7.7

rl'ime "2 20.79 7.27 4.7 2.4

Note: The classrooms for eaeh site ditt>r from the first |)ass to the second. Table 4. Estimation of the Long-Term Reliability

Variable

r,2

Years of experience :Enrollment Ratio of boys to girls Percent of children t ,l'eNell[

Square tbet per pupil Sequence of exploratory

.92 .70 .80 .34 .13 .84

In a d d i t i o n to t h e c o n c e r n a b o u t t h e l o n g - t e r m s t a b i l i t y of' t h e v a r i a b l e s , t h e r e is a concern a b o u t t h e t y p e of i n f o r m a t i o n g a i n e d b e t w e e n t h e first pass a n d t h e second, or t h e difference b e t w e e n s e n d i n g e i t h e r one or e i g h t people to collect d a t a . One e x p e c t a t i o n c o n c e r n i n g t h e difference b e t w e e n t h e g r o u p s w o u l d be a g r e a t e r overall v a r i a n c e for each v a r i a b l e on t h e second pass t h a n on t h e first, a t t r i b u t a b l e to i n d i v i d u a l differences b e t w e e n raters. A t e s t was m a d e tbr t h e a s s u m p t i o n of e q u a l i t y of o v e r a l l dispersion. T h e v a r i a n c e - c o v a r i a n c e m a t r i c e s f r o m t h e t w o passes were flmnd to c o m e f r o m s i m i l a r p o p u l a t i o n s . That. is, t h e null h y p o t h e s i s H0: D, = D2 = A is r e t a i n e d (F = .921). T h i s is not, t h e r e f o r e , an e s t i m a t e of t h e r e l i a b i l i t y b e t w e e n one o b s e r v e r and eight, observers, b u t r a t h e r , e v i d e n c e for t h e assertion that. t h e t w o s i t u a t i o n s were c o m p a r a b l e in v a r i a n c e a n d c o v a r i a n e e . VALIDITY

T h e r e are t w o challenges to t h e v a l i d i t y of this i n s t r u m e n t . First, it is possible t h a t t h e c h a r a c t e r i s t i c s e x a m i n e d are m e a s u r e d a c c u r a t e l y , but. are not t h e r e l e v a n t

OBSERVATION

89

ones in t e r m s of the final performance of a class. Second, the characteristics selected m a y be the most i m p o r t a n t ones, b u t the m a n n e r of measuring t h e m is not sensitive enough to reveal significant (in the sense of useful) differences. The first challenge has been discussed in the p r e s e n t a t i o n of the i n s t r u m e n t . The second challenge was answered in two steps. First, intercorrelations between the input, process, and o u t p u t variables were e x a m i n e d to d e t e r m i n e if the relationships among the variables were consistent with the theoretical basis of the model (they were). Then, the relationships between the o u t p u t and process variables were e x a m i n e d controlling for the i n p u t variables (e.g., a p a r t i a l correlation) (Leinhardt, 1972). A selected group of the residual process variables (to be discussed later) a c c o u n t e d for 46 p e r c e n t of the residual o u t p u t variance (Cooley and L e i n h a r d t , 1974). The i n s t r u m e n t a p p e a r s to be b o t h sensitive to differences between classrooms and useful in explaining outcomes in terms of a c h i e v e m e n t scores.

FINDINGS The p u r p o s e of developing the i m p l e m e n t a t i o n i n s t r u m e n t was to provide a source of information which could b r o a d e n the i n t e r p r e t a t i o n of a c h i e v e m e n t data. The specific information o b t a i n e d from this can be e x a m i n e d in t e r m s of three questions. First, which changes in variables in the classroom a p p e a r to affect the a c h i e v e m e n t of students ? Second, w h a t differences are observed between the i m p l e m a n t a t i o n of the prog r a m in the field versus its i m p l e m e n t a t i o n in l a b o r a t o r y classrooms ? Third, how is the educational model t r a n s f o r m e d when it is i m p l e m e n t e d ? I n considering the relationship between classroom variables and a c h i e v e m e n t variables, it is crucial first to consider how initial input, process measures, and o u t p u t measures relate to each other. I t is obvious t h a t one could o b t a i n strong relationships between w h a t a p p e a r e d to be measures of process a n d o u t p u t b y merely having the process measures be surrogate i n p u t measures. F o r example, teacher experience with our p r o g r a m correlates .47 with s t u d e n t achievement, b u t it also correlates .58 with IQ. W h e n I Q is p a r t i a l e d o u t of both a c h i e v e m e n t a n d teacher experience, the p a r t i a l correlation between experience and a c h i e v e m e n t goes to .05. T h a t is, in this case teacher experience is confounded with i n p u t variables. I n this d a t a set the IQ means (considered input) correlate .68 with a r i t h m e t i c means which leaves 54 percent of the variance in a c h i e v e m e n t to be explained b y measures of the classroom which are not themselves accidental measures of input. Table 5 shows the p a r t i a l correlation between five classroom residuals a n d W i d e Range A c h i e v e m e n t Math Residuals. The variables which are negatively related to a c h i e v e m e n t when IQ is controlled are a large class, more boys t h a n girls, and a larger n u m b e r s of negative contacts. Only the last variable is one which relates specifically to the L R D C program. T h a t is, in t h e L R D C model, the emphasis is on having the teacher reinforce learning behaviors r a t h e r t h a n punish i n a p p r o p r i a t e behavior. The variables which are positively correlated with a c h i e v e m e n t are the n u m b e r of d a y s between tests a n d the a m o u n t of class t i m e spent on m a t h e m a t i c s . The second finding is predictable ; although it can influence decisions a b o u t a m o u n t s of time d e v o t e d to a n y one curriculum area, it does not significantly influence the model. The finding t h a t higher a c h i e v e m e n t is associated with greater time between tests is both startling and intriguing. This would not be e x p e c t e d b y the model and it poses some interesting possibilities. P e r h a p s teachers who frequently test spend less class time t u t o r i n g or teaching or

90

LEINHARDT

a d a p t i n g assignments to meet individual needs. Or perhaps frequent testing of a child in and of itself is dysfunctional to increases in learning. I n either ease, it is the type of information which is imt)ortant to verify and ti'ed back to developers and implementors. In addition to its analyti<' function, the i n s t r u m e n t was t<) provide infbrmation ab<)ut the i m p l e m e n t a t i o n of the program in the fieht and to permit a eonqmrison of the i m p l e m e n t a t i o n between the field and the lab<)ratory sites. I t would be impossible within the scope of this l)aper to present a detailed des<'ription of each elassro<)m ()n each variable. Instead, two descriptions will t)e provide(t: first, a comparison between the field and laborat<)ry sites and, second, a genernl discussion <)f how the mo(lel looks in all the classrooms examined. T a b l e 5. P a r t i a l C o r r e l a t i o n s C o n t r o l l i n g

for Entering

Abilities

1.00

.27

.34

.17

.11

.30

Ratio of bo3;s to girls

.27

1.00

99

.22

.36

.30

rl'ime t>(,fwcen tests

•3 4

9 .>

1.00

.00

.00

.32

l)er(,enta~e of negalivc (~(m tn(,ts

• 17

,) 9

.00

1.0()

.19

.37

mathematics

.l I

.36

.00

.19

1.00

.47

WRAT math means

.30

.30

.32

.37

.47

1 .oo

('lass enr.llm(,nl

Anlotlrlt o~' time spenl Oll

Table 6 shows the site averages on sewmteen variables. In comparing the lal)oratory and field sites, those variables which appear similar in both fiehl sites and developmental sites will be discussed first, tbllowed by those variables for which there is a difl>rence between the dewqopmental and field sites. The variables on which there is similarity are teacher experience with the LR I)(' Model, sex ratio in the class, percentage of unique assignments, percentage of negative contacts, access to play following work. the n u m b e r of adults traveling, and the n u m ber of m i n u t e s of math or reading per day. For most of the variables, the field and developmental sites lo<)k similar. The variables on which there is a difl'erence, tmwever, are quite interesting. The n u m b e r of pupils per class, espeeially when considered in light, of the percentage of children present is smaller in the dewqopmental schools t h a n in the field site schools. This is an i m p o r t a n t difference when e x a m i n i n g outeoum measures and /)er pul>il expenditures. There are other difl>rences ; in general, more time between tests elapses fbr those in developmental schools, they make tbwer cognitive c
Developmental 1 Developmental 2

1 2 3 4

Site

1 2 3 4 Developmental 1 Developmental 2

Site

1 2 3 4 Developmental 1 Developmental 2

Site

0.55

2.83

0.4

6.0

20.20 10.56 15.30 14.25

31.00

19.50

03.20

10.60

0.55

0.00

2.00

0.45 0.41 0.83 t).82

sl)

1.40

1.80 2.16 1.50 2.30

.g

0.70

3.20

2.41 1.79 1.90 0.63

0.00

0.00

! .00 0.33 0.00 0.00

Y,

14

9

5 I 5 7

0.00

0.00

5

2

6 1 3 6

X

45.00

60.00

45.00 60.00 44.16 60.00

48 60

.21 .21

04.45

11.87

15.83 18.75 20.88 20.33

~

0.00

0.00

0.00 0.00 5.85 0.00

05.73

08.74

07.60 10.92 09.71 08.75

i'~

90.0

90.0

90.0 120.0 88.3 120.0

0.00

0.00

0.00 0.00 4.08 0.00

N u m b e r of M i n u t e s of R e a d i n g per D a y

00.35

10.65

08.93 04.26 11.43 12.31

SD

05

10

08 04 08 18

0.03

6.60

2.70 3.20 3.75 4.94

SD

Distril) ution (Cognitive)

57 51 55 56

Percent Unique X SD

.13 .25 .45 .24

Distribution Total

1.15

(t.78

1.10 1.17 1.18 1.04

R a t i o B o y s to Girls .~ SD

N u m b e r of M i n u t e s of M a t h p e r D a y

7

2

4 2 3 4

SO

Percent Negative Contacts

84.5

87.8

89.0 96,6 92.5 87.8

Percent Present "X SD

0.00 0.52 0.00 0.00

Child O b t a i n s Own W o r k

09.19

19.33

06.30 03.40 09.80 04.40

N u m b e r of Management Contacts X SD

22.5

22.6

27.4 24.0 21.0 25.0

Class Size .X SD

05.98 07.50 10.75 04.54

N u m b e r of Adults Traveling

16.50

11.04

13.40 19.02 28.00 16.08

SD

N u m b e r of Cognitive C o n t a c t s

0.44 0.41 1.03 0.00

0.2 1.8 1.3 0.0

Experience SD

Table 6. Means and Standard Deviations by Sites of 30 Classrooms on 17 Process Variables

4.09

6.10

9.40 2.61 4.50 1.33

0.07

0.45

0.55 0.00 0.00 0.41

SD

7.50

6.16 5.16 6.42 8.25

7.70

4.35 3.87 3.80 2.22

N u m b e r of D a y s t h e T e a c h e r was A b s e n t

0.50

0.20

0.40 1.00 0.00 0.33

~

P l a y Follows W o r k

10.68

16.72

09.06 06.85 07.26 03.15

A v e r a g e N u m b e r of DaysSinee Last Test X SD

©

92

LEINHARDT

A more relevant question t h a n the comparison of field and developmental schools is how much do the schools look like the model. The answer c a n n o t be given a precis( value, b u t some general observations can be made. I f one observes the Follow Througlq classrooms using the LRDC program to determine if they operate in a m a n n e r similar to traditional classrooms, the answer is d e a r l y n o . All of the classrooms visited are individualized to some extent. Some sites have modified the model in a specific way~ such as having two teachers travel, others have extended the m a n a g e m e n t system from the LRDC program curriculum area to all curriculum areas. Some classrooms test frequently, b u t tend to give children similar assignments rather t h a n individual ones. No classroom assigns j u s t those pages needed by the child as indicated by the test : rather, most. of them start at the first page a child needs as indicated by a test. and assigns a block of pages after that... There are at least four different styles of traveling, all of which are compatible with the LRDC I n s t r u c t i o n a l Model. W h a t is clear from Table 6 is t h a t the program can be i m p l e m e n t e d in diverse settings. I t is also clear t h a t the program undergoes a certain a m o u n t of modification in the field. One of the questions raised by a s t u d y like this is what. modifications in the mode/. made in the field, improve the model. For example, perhaps the model should place less emphasis on frequent testing and unique prescription a n d more emphasis on differing modes of t r a n s m i t t i n g information. I n a well sequenced curriculum it m a y not be necessary to continuously m o n i t o r progress ; a n d the same effort m a y be better expended to provide a diversity of curriculum objectives and a means to meet them.

LIMITATIONS The i n s t r u m e n t does not provide i n f o r m a t i o n on all of the domains initially identified. Some variables were very difficult to measure w i t h o u t extensive clinical d a t a or w i t h o u t developing separate m e a s u r e m e n t i n s t r u m e n t s for them (e.g., student, inde pendence). Some variables were lost because of'the inability to obtain the information in a reliable fashion. Still other variables were omitted because there appeared to be no measured difference between classrooms. However, a good start has been made in the construction of a reliable and valid i n s t r u m e n t for measuring elassroom process a n d the i m p l e m e n t a t i o n of the LRDC I n s t r u c t i o n a l Model.

IMPLICATIONS An i m p l e m e n t a t i o n i n s t r u m e n t provides i n t b r m a t i o n a b o u t the educational processes t h a t occur in classrooms using a particular i n n o v a t i o n . The i n s t r u m e n t can be a valuable tool to evaluators e x a m i n i n g the overall results of an i n n o v a t i o n in explain ing those results, b u t it is also a useful tool for implementors and developers. For iraplementors, it provides i n f o r m a t i o n a b o u t the success of the i m p l e m e n t a t i o n relative to the model. For developers, the i n s t r u m e n t can provide information on the consequences, both positive and negative, of u n i n t e n d e d changes in the model. This inibrm a t i o n in t u r n can become the basis for change in specific programs and overall ass u m p t i o n s of an educational model.

OBSERVATION

93

REFERENCES COOLEY, W. Methods of evaluating school innovations. Invited address (>f the 79th Annual Convention of the American Psychological Association, Washington, D.C., September 1971 COOLEY. W. and L E I N H A R D T , G. Evaluating individualized education in the elementary school. In P. O. Davidson. F. W. Clark, and L. A. Hamerlynek (Eds.). Ecalu¢ttiol+ of behacioml progr<~ms i~ community, re,sidential ~ml ,~chool setting,~. Champaign, III: Research Press, 1974. C RONBACH, L. J. Test validation. In R. Thorndike (Ed.), Ed~tctltio,(d Meas~t,'eme~l. (2rid ed.) Washington, D.C. American Council on Edm'ation, 1971. (!RONBACH, L. J., RAJARATNAM, N. and GI,ESER. G. C. Theory of generalizability: A liberal ization of relyability theory. The Britislt d o , r , a l oj" Nl(~ti.~h'c(tl l)x!tchology. 19(W,. 16 (2). GLASER, R. Evaluation of i~l,struction and cha~git~(t educc¢lio~lal model,~. Los Angeles: (!enter ii)r the Study of Evaluation of Instructional Programs, 1968. (Also LFII)C Reprint 46.) L E I N H A R D T , G. The boojum of evaluation hnplementation, some measures. Unpublished manuseript, University of Pittsburgh, Learning Research arid Development (?enter, 1972. REYNOLDS. G., LIGHT, J. and MUELLEB,, F. The effects of reinfi)rcing quality or quantity of academic performance. Paper presented at the Annual Meeting of the American Edtn,ational Research Association, New York, 1971. STAKE. R. E. Toward a technology for the evaluation of educational programs. In l/. W. Tyler, R. M. Gagne, and M. Seriven (Eds.) Per,~peetices qf VarricMtlm E~,oh,ation. A ERA Series on ( !urriculum Evaluation. No. 1. Chicago: Rand McNally. 1967.