Reliability and generalizability of ratings of compositions

520KB Sizes 8 Downloads 42 Views

Report

PDF Reader
Full Text

Studies in Educational Evaluation. Vol. 16, pp. 501-512, 1990 Printed in Great Britain. All rights reserved.

0191-491X/90 $0.00 + .50 © 1990 Pergamon Press plc

EVALUATION STUDIES

RELIABILITY AND GENERALIZABILITY OF RATINGS OF COMPOSITIONS Rainer H. Lehmann School of Education, University of Hamburg, F.R.G.

Introduction

in

One of t he aims of t he IEA I n t e r n a t i o n a l S t u d y of A c h i e v e m e n t W r i t t e n C o m p o s i t i o n h a s b e e n "to make a contribution t ow ard

solving p r o b l e m s r e l a t e d to t he a s s e s s m e n t of e s s a y - t y p e a n s w e r s " ( P elg r u m & W a r r i e s 1986, p. 18). T h e r e are, of c o u r s e , as a n y o n e s e r i o u s l y en g aged in t he field k n o w s - m a n y s u c h probl em s, only a few of w h i c h c a n be d e a l t with

in t hi s p a p e r .

Here,

onl y s u c h q u e s t i o n s are

t r e a t e d w h i c h f o c u s on t h e following f o u r s o u r c e s of v a r i a t i o n in t he a s s e s s m e n t of s t u d e n t writing:

.

Between-rater factors

2.

W i t h i n - r a t e r f act or s

3. 4.

B e t w e e n - a s s i g n m e n t f a c t or s W i t h i n - s t u d e n t factors

(cf. W e s d o r p et al. 1982, p. 2990. S i n c e in t h e IEA r e p e a t e d l y on t h e s a m e

s t u d y no s t u d e n t s w ere a s k e d to w o r k assignment, within-student and between

a s s i g n m e n t f a c t o r s are c o n f o u n d e d ,

so t h a t t h e y

c a n only be a n a l y z e d

co n jo in tly here. However, a s u g g e s t i o n will be e l a b o r a t e d w h i c h is b a s e d on t h e n o t i o n of v a r i a n c e c o m p o n e n t s a n d w h i c h allows for a s i m u l t a n e o u s

501

502

Ft. 14. Lehmann

ev alu atio n of t hos e effects. The

d a t a u s e d for e x e m p l i f i c a t i o n c o m e f r o m t h e W e s t

German

c o m p o n e n t of t h e IEA S t u d y , involving a t o t a l of 1487 l lth-grade s t u d e n t s from 71 c l a s s r o o m s in eight different t r a c k s of t h e school s y s t e m of t h e City of H a m b u r g . E a c h of t h e s e s t u d e n t s h a d b e e n a s k e d to complete f o u r writing a s s i g n m e n t s , i.e. one m or e t h a n i n t e r n a t i o n a l l y obligatory, in a p a r t l y r o t a t e d design: 1.

On e of t h e f o u r s h o r t a s s i g n m e n t s in t h e f o r m a t of a l e t t e r (the i n t e r n a t i o n a l p r a g m a t i c / f u n c t i o n a l t a s k s - a l e t t e r to a n u n c l e d e s c r i b i n g a bicycle; a self-description in a letter to a penpal; a formal n o t e to t h e h e a d of t he school; a n d a r e p l y to a n a d v e r t i s e m e n t for a s u m m e r job);

2.

One of t h r e e longer i n t e r n a t i o n a l a s s i g n m e n t s (narrative p e r s o n a l story; p e r s u a s i v e / a r g u m e n t a t i v e essay; reflective essay);

3.

The a s s i g n m e n t of a letter of advice to a y o u n g e r s t u d e n t ;

4.

Th e

assignment

of

paraphrasing,

a n a l y z i n g rhetorically,

and

evaluating a n e w s p a p e r c o m m e n t a r y . Two i n d e p e n d e n t

s e t s of s c or es on a five-point scale

w ere a w a r d e d

to e a c h e s s a y in a c c o r d a n c e w i t h the International Scoring Guides ( Go r man , P u r v e s & D e g e n h a r t , 1988). In e a c h case, t h e r a t e r s w ere two from a j u r y of five fully t r a i n e d a n d certified m o t h e r t o n g u e t e a c h e r s . T he e s s a y s wer e d i s t r i b u t e d a m o n g t h e r a t e r s in s u c h a w a y as to allocate a p p r o x i m a t e l y equal por t i ons to all possible c o m b i n a t i o n s of rat ers.

Methodological C o n s i d e r a t i o n s Th e s t u d y of b e t w e e n - r a t e r effects, w i t h i n - r a t e r effects, a n d b e t w e e n a s s i g n m e n t / w i t h i n - s t u d e n t effects entails, for a s t u d y of t hi s size and c o m p l e x design, a n impressive a r r a y of very specific statistical anal yses, far b e y o n d of w h a t can be r e p o r t e d here. To give j u s t a n i ndi cat i on of w h a t is involved, it s h o u l d be m e n t i o n e d t h a t t h e i n t e r n a t i o n a l r e p o r t i n g r e q u i r e m e n t s for s c o r i n g i n c l u d e d , for t he case of t h e H a m b u r g St udy, t h e c o m p l e t i o n of 144 f or m s h e e t s , c o n t a i n i n g a t ot al of several

Ratings and Compositions

503

t h o u s a n d statistics referring to the 9 different t a s k s , 5 different raters, a n d 10 possible c o m b i n a t i o n s of raters. W h e n t h e s e forms were conceptualized, little was k n o w n with respect to the a c t u a l r a t e r performance. T h u s it h a d s e e m e d appropriate n o t to rely on p a r a m e t r i c t e s t models alone, b u t also to include more "intuitive" statistics.

Rater

I st

round

package A B C D E

no.

1 3

2 4

5

6

7

8 I0

9

round p a c k a g e no. 2 nd

8 i0 2 4 6

9 1 3 5 7

Figure 1: Scoring scheme of the Hamburg Study ("star design")

Two

such

intuitive concepts

are percentage

of perfect a g r e e m e n t

b e t w e e n two i n d e p e n d e n t r a t i n g s a n d p e r c e n t a g e of loose a g r e e m e n t , defined as the p e r c e n t a g e of differences b e t w e e n two r a t i n g s not greater t h a n one scale point. While these concepts have the a d v a n t a g e of not b e i n g b a s e d on a s s u m p t i o n s as to the h o m o g e n e i t y of m e a n s a n d v a r i a n c e s , t h e i r m a j o r d r a w - b a c k is t h a t t h e y c a n n o t be converted into w e l l - d e f i n e d r e l i a b i l i t y coefficients. Moreover, p e r c e n t a g e of loose a g r e e m e n t does n o t d i s c r i m i n a t e well between quality levels, if the

504

R. H. Lehmann

agreement

between

two

independent

r a t i n g s is g e n e r a l l y h i g h . In t h e

H a m b u r g d a t a , t h e r e is m o r e t h a n 9 7 p e r c e n t l o o s e a g r e e m e n t b e t w e e n i n d e p e n d e n t r a t i n g s . P e r f e c t a g r e e m e n t w a s a c h i e v e d o n 7 3 . 2 p e r c e n t of all 5 3 6 2 r a t e d c o m p o s i t i o n s . If, h o w e v e r , t h e h y p o t h e s e s of h o m o g e n e o u s v a r i a n c e s (and m e a n s ) between independent ratings can be maintained, correlations between ratings and associated measures such as regression coefficients, c o r r e l a t i o n ratios, a n d v a r i a n c e c o m p o n e n t s c a n b e u s e d . T h e y are clearly s u p e r i o r in t h a t t h e y c a n b e r e l a t e d to t h e c l a s s i c a l r e l i a b i l i t y c o e f f i c i e n t w h i c h is d e f i n e d a s t h e ( e s t i m a t e d ) t r u e v a r i a n c e d i v i d e d b y t h e o b s e r v e d variance:

2

2

c~t Reliability 2 GO

with

c~t 2 s2 ~t + e

2 c~t = true variance 2 o

= observed variance

2 r~ = error variance e It is a s s u m e d reliability and

that an appropriate generalizability

of

treatment

(cf. Thorndike 1982). of p r o b l e m s involving

the

essay ratings should be based on this

concept. T e s t i n g t h e A s s u m p t i o n s - H o m o g e n e i t y of S c o r e V a r i a n c e s There

are

a n u m b e r of w a y s to a s c e r t a i n

that

scores

and Means from different

r a t e r s do, i n d e e d , d i s p l a y v a r i a n c e s s u f f i c i e n t l y s i m i l a r to b e c o m p a t i b l e with the hypothesis of homogeneity. One o f t h e s e is to l o o k a t t h e e x t r e m e s w i t h i n t a s k s , i.e. t h a t p a i r of r a t e r s for w h i c h in a given t a s k t h e o b s e r v e d d i f f e r e n c e in s t a n d a r d d e v i a t i o n s is largest.

A statistical problem

lies in t h e p a r t i a l o v e r l a p of t h e s e t s of c o m p o s i t i o n s s c o r e d b y t h e s e r a t e r p a i r s . So, it is n e c e s s a r y to a p p l y two d i f f e r e n t t e s t s : (1) t h e c o n v e n t i o n a l F - t e s t for c o m p a r i n g t h e s u b - s e t s w h i c h w e r e u n i q u e to e i t h e r o n e of t h e r a t e r s in t h e p a i r , (2) a t - t e s t for p a i r e d observations (Ferguson, 1966), a p p l y i n g o n l y to t h a t p o r t i o n w h i c h w a s s c o r e d c o n j o i n t l y ( b u t i n d e p e n d e n t l y l ) b y t h e two. T h e r e s u l t s from the

Ratings and Compositions

505

H a m b u r g d a t a clearly s u g g e s t to r e t a i n t h e h y p o t h e s i s of h o m o g e n e o u s v a r i a n c e s : while for five of t h e n i n e t a s k s , n o t even t h e l a r g e s t observed d if f er en ces w er e significant on t h e first criterion, n o n e of t he pairs i n v e s t i g a t e d s h o w e d s i gni f i cant di f f erences on t h e s e c o n d . Conversely, t h e t - t e s t i d e n t i f i e d o n l y f o u r o v e r l a p p i n g s e t s for w h i c h t h e r e w e r e s ig n if ican t d i f f e r e n c e s in a n y of t h e t e n possible c o m b i n a t i o n s of raters, b u t t h e s e findings coul d n o t be r e p r o d u c e d on t h e b a s i s of t he Fcriterion for i n d e p e n d e n t s a m p l e s . T h u s it a p p e a r e d r e a s o n a b l e to p r o c e e d to c h e c k for p o s s i b l e m e a n d i f f e r e n c e s b e t w e e n r a t e r s on t h e a s s u m p t i o n of h o m o g e n e o u s variances. T h e r a t i o n a l e guiding t hi s

investigation w as

basically identical to

t h a t u s e d in t h e p r e v i o u s tests: while s o m e of t h e selected e x t r e m e m e a n d i f f e r e n c e s w e r e s i g n i f i c a n t in t h e i n d e p e n d e n t s a m p l e p o r t i o n , n o n e of t h e s e findings c oul d be c o n f i r m e d on t h e b a s i s of t h e r e s p e c t i v e s u b - s e t with p air ed obs er va t i ons . In t e r m s of m e a s u r e m e n t theory, t hen, clo s e to t h e i deal of " e q u i v a l e n t forms", summed

or averaged with a

t he o b t a i n e d rat i ngs got very w h i c h c a n l e g i t i m a t e l y be

c o r r e s p o n d i n g i n c r e a s e in "true variance".

E s t i m a t e s of I n t e r - R a t e r Reliability Insofar as the independent ratings can be regarded as equivalent, it is justified to e m p l o y C r o n b a c h ' s Alpha as a n e s t i m a t e of the ach iev ed i n t e r - r a t e r reliability. For t he special case of two s u c h ratings, t h e well-known S p e a r m a n - B r o w n - f o r m u l a m a y be used: 2 r _

Cronbach's Alpha (K=2) S in ce

it

°° 1j

1 + r ij

c a n be s h o w n t h a t t hi s s t a t i s t i c

d e f i n i t i o n of reliability,

the

resulting

fits

t he

numerical values

above s t a t e d give

a direct

i n d i c a t i o n of t h e p r o p o r t i o n of t r u e v a r i a n c e in t h e o b s e r v e d average scores. T h e r e are different v a l u e s for e a c h pai r of r a t e r s , task, and rating d i m e n s i o n . In t he H a m b u r g data, t h e r e w ere no c o n s i s t e n t differences b e t w e e n p a i r s of r a t e r s or r a t i n g d i m e n s i o n s , b u t t h e r e were differences b e t w e e n tasks: generally, writing a c h i e v e m e n t w as m e a s u r e d less a c c u r a t e l y for th e f o u r p r a g m a t i c / f u n c t i o n a l t a s k s a n d t h e p e r s u a s i v e / a r g u m e n t a t i v e t a s k t h a n it was for t he r e m a i n i n g tasks. T h e b e s t v a l u e s for i n t e r - r a t e r a g r e e m e n t were o b t a i n e d for t he letter of advice to a y o u n g e r s t u d e n t .

506

R. H. L e h m a n n

Averaging

the

A l p h a ' s b e t w e e n t h e first a n d

second

r a t i n g o v e r all

t a s k s a n d r a t i n g d i m e n s i o n s , a m e a n A l p h a of 0 . 8 8 5 w a s o b t a i n e d . T h i s a m o u n t s to s a y i n g t h a t , o n t h e a v e r a g e , 11.5 p e r c e n t o f t h e v a r i a n c e in t h e o u t c o m e v a r i a b l e s ( a r i t h m e t i c m e a n s f r o m two i n d e p e n d e n t s c o r e s o n t h e s a m e e s s a y a n d r a t i n g d i m e n s i o n ) m u s t b e a t t r i b u t e d to e r r o r .

E s t i m a t e s of I n t r a - R a t e r Reliability The above estimates

of inter-rater

reliability do not contain

any

r e f e r e n c e to t h e f a c t t h a t t h e r e m a y also b e a c e r t a i n a m o u n t of i n s t a b i l i t y w i t h i n t h e r a t i n g s of o n e a n d t h e s a m e r a t e r over time. In o r d e r to a s s e s s t h i s s o u r c e of e r r o r , a c o r p u s of 138 c o m p o s i t i o n s f r o m all t a s k s w a s r a t e d twice b y all r a t e r s in t h e H a m b u r g j u r y . A s s u m i n g a g a i n t h a t t h e two r a t i n g s f r o m a given r a t e r w e r e e q u i v a l e n t in t h e s t a t i s t i c a l s e n s e of t h e w o r d , A l p h a e s t i m a t e s t h e p r o p o r t i o n s of t r u e a n d e r r o r v a r i a n c e in t h e o b t a i n e d a v e r a g e s o v e r time. T h e r e s u l t i n g a v e r a g e A l p h a w a s 0 . 9 3 9 ; so 6.1 p e r c e n t of t h e v a r i a n c e of within-rater a s s o c i a t e d w i t h i n t r a - r a t e r instability.

average scores can

be

When trying to separate inter-rater from intra-rater effects, a c o r r e l a t i o n - b a s e d a p p r o a c h is m o r e a p p r o p r i a t e . A s s u m i n g h y p o t h e t i c a l l y that perfect intra-rater agreement could be obtained, one could correct for a t t e n u a t i o n o n t h e b a s i s of t h e u s u a l f o r m u l a . rij true score correlation = rit Jt - 4 ~ i i rjj

Using

again

data aggregated

over

tasks,

dimensions,

from the Hamburg study, the corrected estimate a g r e e m e n t w o u l d b e r = 0 . 8 4 3 or A l p h a = 0 . 9 1 5 .

and raters

for inter-rater

T h i s m e a n s t h a t a n a v e r a g e of o n l y 8.5 p e r c e n t of t h e v a r i a n c e of o u t c o m e s c o r e s c a n b e a t t r i b u t e d to i n t e r - r a t e r d i f f e r e n c e s , w h e r e a s a n a d d i t i o n a l 3 . 0 p e r c e n t o u t of t h e t o t a l e r r o r c o m p o n e n t of 11.5 p e r c e n t is e s t i m a t e d to b e d u e to i n t r a - r a t e r instability.

T o w a r d s G e n e r u l i z a b i l i t y - T h e V a r i a n c e C o m p o n e n t s Model An o b v i o u s d r a w - b a c k of c o n s i d e r a t i o n s so f a r h a s b e e n t h a t t h e s e w e r e o n l y c o n c e m e d w i t h t h e m e a s u r e m e n t a c c u r a c y for s i n g l e t a s k s a n d

Ratings and Compositions

507

rating dimensions. No r e f e r e n c e w a s m a d e to e x i s t i n g r e l a t i o n s h i p s between t a s k s / w i t h i n s t u d e n t s . It m a y be reiterated t h a t in the H a m b u r g s t u d y , all s t u d e n t s were asked to complete four a s s i g n m e n t s (one more t h a n i n t e r n a t i o n a l l y obligatory). There are 1,073 s t u d e n t S for w h o m two valid i n d e p e n d e n t scores exist for all four a s s i g n m e n t s . W i t h o u t going into details here, it m a y be added t h a t in H a m b u r g the d a t a also allow to combine "overall impression m a r k s " a n d analytical scores (except m e c h a n i c s a n d handwriting) into a single general m e r i t score for each c o m p o s i t i o n / r a t e r . From now on, considerations will only refer to t h e s e general merit scores. In the conceptualization of the IEA Study, it was a t t e m p t e d to have a s a m p l e of t a s k s from the d o m a i n of school writing (Vfih&passi, 1982). Pragmatic c o n s t r a i n t s led to the rotation of 4 plus 3 of the 8 international t a s k s for Population B (modal grade before leaving c o m p u l s o r y school). In spite of the existence of acceptable m e a s u r e s for each c o n s t i t u e n t t a s k it is n e c e s s a r y to a s k w h e t h e r - a n d if so, to w h a t extent - the outcome variables m e a s u r e a s t a b l e individual trait w h i c h c a n t h e n be called "general writing ability". Statistically speaking, this q u e s t i o n is closely r e l a t e d to the i d e n t i f i c a t i o n of w i t h i n - s t u d e n t / a c r o s s - t a s k variation, correcting for possible mitigating influences of rater performance. An appropriate technique is given by t h e a n a l y s i s of v a r i a n c e c o m p o n e n t s (cf. Thorndike, 1982, pp. 156 ff). The s t r u c t u r e of the IEA rating d a t a m a k e s it difficult to c o n d u c t s u c h an analysis for all t a s k s simultaneously: the fact that, for instance, no student completed both the p e r s o n a l / n a r r a t i v e t a s k and the p e r s u a s i v e / a r g u m e n t a i v e t a s k h a s left "empty cells" in the overall design w h i c h s h o u l d not be filled with e s t i m a t e d values, as long as virtually n o t h i n g is k n o w n a b o u t empirical relationships between achievement in these tasks. Thus, t h e following i n c o m p l e t e m a t r i x of i n t e r - t a s k c o r r e l a t i o n s (based on averages from two i n d e p e n d e n t ratings) was obtained (Table 1). Therefore, it seems advisable at least at this stage to disregard possible differences b e t w e e n r o t a t i o n forms a n d i n c l u d e in the a n a l y s i s those s t u d e n t s who have completed exactly four assignments (i.e. any of the functional tasks, any of the narrative, argumentative and reflective tasks, the letter of advice and the rhetorical analysis). Also, i n s t e a d of looking at five individtlal raters, only the two i n d e p e n d e n t r a t i n g s (first vs. second) are distinguished.

508

R. H. L e h m a n n

T a b l e 1: C o r r e l a t i o n m a t r i x f o r t a s k - s p e c i fi c scales for n i n e tasks s t u d e n t s ; p a i r w i s e n u m b e r s of cases in p a r e n t h e s e s )

Task

Bicycle descr,

Self Formal descr, note

Job appli,

Narr.

Argu.

(total N = 1340

R e f l . Advice R h e t . anal.

Self descript.

* (0)

Formal note

* (0)

* (0)

Job appl.

* (0)

* (0)

* (0)

.12 (102)

.24 (96)

.32 (108)

.10 (110)

.19 (92)

.29 (103)

.40 (98)

•37 (98)

* (0)

Reflective

.43 (103)

.26 (95)

.22 (107)

• 39 (101)

* (0)

* (0)

Advice letter

.28 (313)

.34 (312)

.32 (327)

• 35 (308)

.25 (403)

.31 (376)

.37 (394)

Rhetorical .23 analysis (290)

.37 (299)

.35 (311)

.34 (304)

.27 (389)

.34 (379)

.46 .37 (379) (1168)

Total

(324)

(337)

(328)

(424)

(401)

(411) (1270)(1231)

Narrative Argument

(323)

first r a t i n g Task

1A, B , C , E

5,6,7

second rating 9

0

1A, B , C , E

5,6,7

9

0

Student

1073

F i g u r e 2:

4.5 2.0

2.2 2.8

3.0 1.5

3.5 3.8

4.8 3.0

3.0 3.2

3.0 1.0

3.8 3.5

5.0

3.5

2.2

4.2

5.0

3.0

2.0

4.0

A n a l y t i c d e s i g n a n d file s t r u c t u r e of r a t i n g s

Ratings and Compositions

G i v e n t h e h i g h i n t e r - r a t e r r e l i a b l i t i e s a c h i e v e d , little as

a

consequence

i m p l i c a t i o n s for

of

that simplification,

the subsequent

analysis.

appears

to

be

509

lost

although there are certain

With

these

modifications,

a

c o m p l e t e l y b a l a n c e d f a c t o r i a l d e s i g n - or d e s i g n w i t h t h r e e ' f a c e t s ' , to u s e the

appropriate term - emerges.

Figure 2 depicts the

resulting analytic

d e s i g n a n d file s t r u c t u r e . T h i s is v e r y m u c h like a c o n v e n t i o n a l t h r e e - w a y ANOVA a s i n g l e o b s e r v a t i o n p e r cell, s p r e a d o v e r e i g h t cells.

except

It c a n

that

design with

o b v i o u s l y e a c h c a s e ( s t u d e n t ) is

also be viewed

as

a MANOVA w i t h

a two-

factor within-subject design and student as the breakdown variable. Given the possible

to

three

"rating",

define a n d e v a l u a t e

interaction terms. and

facets

the

"task",

and

respective

is

confounded

with

similarly the rater-by-task student

term.

it is n o w

m a i n effects a s well a s t h e

It s h o u l d b e n o t e d a g a i n t h a t o n l y " r o u n d s of s c o r i n g "

not the individual raters are considered;

effect

"student",

So,

the

therefore,

rater-by-student

i n t e r a c t i o n effect

with

the

rater main

interaction term and the

rater-by-task-by-

t h e following v a r i a n c e c o m p o n e n t s a r e defined:

Table 2. Defined variance c o m p o n e n t with numerical results f r o m H a m b u r g data

Notation

Variance c o m p o n e n t

Results

(52

df

a2

df

Between students effect

~Ss2

ns-I

.180

1072

Between tasks effect

~t 2

nt-1

.015

3

-.005

1073

.347

3216

.065

3219

Between raters effect/ r a t e r - b y - s t u d e n t interaction Student-by-task interaction Rater-by-task interaction/ Rater-by-student-by-task interaction

(nr-1)+ ~r,rs 2 (nr-1) (ns-1) (Yst2

(ns-1) (nt-1) (nr-1) (nt-1)+

(Yrt,rst 2 (nr-1) (ns-1) (nt-1)

nr = n u m b e r of ratings, ns = n u m b e r of students, nt = n u m b e r of tasks

510

R.H. Lehmann

It c a n n o w b e s e e n t h a t t h e c o m b i n e d r a t e r - e f f e c t / r a t e r - b y - s t u d e n t i n t e r a c t i o n t e r m is v i r t u a l l y zero, a s w a s , i n d e e d , e x p e c t e d w h e n t h e s c o r i n g d e s i g n for t h e H a m b u r g S t u d y w a s p l a n n e d . S i n c e m o s t , if n o t all, r a t e r s w e r e likely to b e involved w i t h e a c h s t u d e n t in t h e s a m p l e , this t e r m w a s likely to d i s a p p e a r a s a c o n s e q u e n c e of t h e s c o r i n g s c h e m e . Similarly, this scheme would cancel out rater-by-task interaction effects e x c e p t for a p o s s i b l e t i m e - r e l a t e d factor. T h e r e f o r e t h e l a s t v a r i a n c e c o m p o n e n t is a l m o s t e x c l u s i v e l y r e l a t e d to w h a t w a s labelled "inter-rater disagreement" above. Fortunately, t h i s c o n t r i b u t i o n to overall v a r i a n c e is m i n o r . T h e f a c t t h a t t h e r e is n o s t r o n g b e t w e e n - t a s k s effect in t h e d a t a m a y b e u n d e s i r a b l e f r o m a t h e o r e t i c a l p o i n t of view, s i n c e it leaves little r o o m for e x p l a n a t i o n s r e f e r r i n g to differential achievement over t h e t a s k s a s s i g n e d . M e t h o d o l o g i c a l l y , it m a y b e a c o n s e q u e n c e of a t e n d e n c y a m o n g raters

to s c o r e to a n o r m a l c u r v e ,

b u t it m a y ,

of

c o u r s e , also r e f l e c t a

m o r e f u n d a m e n t a l difficulty, n a m e l y t h a t t h e c l a s s i c a l c o n c e p t of "item difficulty" is n o t easily a p p l i e d to t a s k s of s c h o o l writing. The remaining

two v a r i a n c e c o m p o n e n t s

are those which

are

of

p r i m a r y i n t e r e s t for l a t e r m u l t i v a r i a t e a n a l y s e s . Clearly, t h e r e l a t i v e l y s m a l l a m o u n t of b e t w e e n - s t u d e n t s v a r i a n c e (as c o m p a r e d with the student-by-task i n t e r a c t i o n , i.e. t h e " w i t h i n - s t u d e n t s " c o m p o n e n t ) will i m p o s e l i m i t a t i o n s o n t h e a t t e m p t to find a single overall e x p l a n a t i o n for d i f f e r e n c e s b e t w e e n s t u d e n t s in t e r m s of w r i t i n g a c h i e v e m e n t . W i t h i n students variation, o n t h e o t h e r h a n d , m a y b e r e l a t e d to m a n y f a c t o r s w h i c h w e r e o n l y p a r t i a l l y c o n t r o l l e d in t h i s s t u d y - e.g., f l u c t u a t i o n s in a c h i e v e m e n t o v e r time, v a r y i n g levels of m o t i v a t i o n , f a m i l i a r i t y w i t h t h e t a s k s , etc.. It r e m a i n s to b e s e e n

w h e t h e r s o m e of t h e b a c k g r o u n d d a t a of

t h e S t u d y will h e l p to explain this s o u r c e of v a r i a t i o n .

Conclusions:

G e n e r a l Writing A c h i e v e m e n t A c r o s s T a s k s

It is n o w p o s s i b l e to r e t u r n to t h e g u i d i n g q u e s t i o n of t h i s p a p e r : what c a n b e s a i d a b o u t t h e r e l i a b i l i t y of m e a s u r i n g g e n e r a l w r i t i n g achievement across the tasks used, or in o t h e r w o r d s , a b o u t t h e g e n e r a l i z a b i l i t y of c o m p o s i t i o n r a t i n g s in t h e S t u d y o f A c h i e v e m e n t in W r i t t e n C o m p o s i t i o n ? It will b e s e e n i m m e d i a t e l y t h a t t h e r e is n o single and simple assumptions

a n s w e r ; i n s t e a d , t h e s o l u t i o n d e p e n d s o n t h e k i n d of w i t h r e s p e c t to t h e t a s k s o n e is p r e p a r e d to m a k e .

Ratings and Compositions

S t a t i s t i c a l l y , t h e a n s w e r is a f u n c t i o n of w h e t h e r w i t h i n - s t u d e n t s is c o n s i d e r e d a s t r u e v a r i a n c e or error.

511

variation

Assuming that the pragmatic/functional tasks and the essay-type t a s k s a r e s t r i c t l y e q u i v a l e n t s t a t i s t i c a l l y a s well a s t h e o r e t i c a l l y a n d t h a t t h e s e t of f o u r a s s i g n m e n t s p e r s t u d e n t r e p r e s e n t s e x a c t l y t h e d o m a i n to w h i c h o n e w i s h e s to g e n e r a l i z e (fixed effects m o d e l w i t h r a n d o m l y c h o s e n r a t e r s ) , t h e following f o r m u l a can be applied to estimate the achieved generalizability: 2 2 ~st cr +. s nt

2 ~t 2 ~o Generalizability I =

2 (~

+

s

2 2 Cst ~t - - + - - + nt

2 ~r,rs

nt

O n t h e b a s i s of t h i s f o r m u l a ,

+

nr

2 ~rt,rst - nr nt

a g e n e r a l i z a b i l i t y c o e f f i c i e n t of 0 . 9 5 7

w o u l d b e o b t a i n e d for t h e H a m b u r g d a t a . This value appears quite a p p e a l i n g , b u t it is n o t a v e r y p l a u s i b l e one, given t h e d o u b t s a b o u t t h e validity of t h e s t r o n g u n d e r l y i n g a s s u m p t i o n s . In fact, t h e a l r e a d y q u o t e d specification of t a s k types within the domain of school writing (V~ih~ipassi, 1982) d o e s n o t t r e a t t h e two g r o u p s of t a s k s a s e q u i v a l e n t , and it w o u l d b e difficult to find a n e x p e r t / t e a c h e r in t h e City of H a m b u r g w h o w o u l d c o n s i d e r t h e t a s k s u s e d in t h e S t u d y a s r e p r e s e n t a t i v e for all l l t h - g r a d e s c h o o l w r i t i n g t h e r e . T h e s e o b j e c t i o n s a l o n e are r e a l l y s u f f i c i e n t to r e j e c t t h e m o d e l a s l e a d i n g to g r o s s l y i n f l a t e d e s t i m a t e s of t h e

achieved

b e c h a n g e d to a

generalizability.

So,

random-effect model,

the generalizability formula must deleting

the

within-student

2

component

~st/nt

f r o m the n u m e r a t o r :

2

2

ot

~s

2 ~o

2

2 ~st

~S+--

Generalizability II =

2 ~t

2 ~r,rs

- + nt + -nt nr

2 ~rt,rst + - -

nr nt

Here, it is o n l y a s s u m e d t h a t a n y f o u r t a s k s from t h e u n i v e r s e of s c h o o l w r i t i n g w e r e c o m p l e t e d a n d s c o r e d b y a n y two r a t e r s f r o m t h e j u r y

512

R.H. Lehmann

a s s u m e d to p r o d u c e valid ratings. On t h e b a s i s of this f o r m u l a , t he g e n e r a l i z a b i l i t y e s t i m a t e for t h e o b t a i n e d r a t i n g s on t h e H a m b u r g s a m p l e is r e d u c e d r a t h e r drastically to 0.646. This m e a n s that, in spite of all efforts u n d e r t a k e n by s t u d e n t s as well as raters, t he m e a s u r e m e n t of g e n e r a l w r i t i n g a c h i e v e m e n t w a s n o t v e r y good. B u t p e r h a p s it is comforting to see how m u c h m or e effort it would have t a k e n to achieve a s a t i s f a c t o r y generalizability of above 0.85: e v e r y t h i n g else b e i n g equal, m i n i m u m of 13 writing a s s i n g m e n t s would have b e e n requi red.

a

References F e r g u s o n , G.A. (1966). Statistical analysis in psychology and education. (2nd ed.). New York: McGraw-Hill. G o r m a n , T.P., Purves, A.C. and D e g e n h a r t , R.E. (1988). T he IEA s t u d y of w r i t t e n c o m p o s i t i o n I: T h e i n t e r n a t i o n a l w ri t i ng t a s k s a n d s c o r i n g scales. P e l g r u m , H. & W a r r i e s , E. (1986). IEA: Activities, Oxford: Per ga m on. T h o r n d i k e , R.L. (1982). Mifflin.

Applied psychometrics.

institutions, people. Boston:

Houghton &

Vfihtipassi, A. (1982). On the specification of the d o m a i n of school writing. In A.C. P u r v e s & S. T a k a l a (Eds.) An international perspective on the e v a l u a t i o n o f w r i t t e n composition, E v a l u a t i o n in Education: A n International R e v i e w Series 5, (3), pp. 265- 289. Wesdorp, H., Bauer, B.A., & P u r v e s , A.C. (1982): T o w a r d a c o n c e p t u a l i z a t i o n of t he s c o r i n g of w r i t t e n c o m p o s i t i o n . In A.C. P u r v e s & S. T a k a l a (Eds.) Evaluation in Education, 5, (3), pp. 299315. Note The author acknowledges gratefully that the present research has been funded by the German Research Association (DFG). T he A u t h o r RAINER H. LEHMANN t e a c h e s in the field of d a t a anal ysi s a n d r e s e a r c h m e t h o d o l o g y at t he University of H a m b u r g . His i n t e r e s t s i ncl ude large-scale a s s e s s m e n t , n a m e l y i n t e r n a t i o n a l l y c o m p a r a t i v e studies.

Reliability and generalizability of ratings of compositions

Reliability and generalizability of ratings of compositions

Recommend Documents