Studies in Educational Evaluation. Vol. 16, pp. 501-512, 1990 Printed in Great Britain. All rights reserved.
0191-491X/90 $0.00 + .50 © 1990 Pergamon Press plc
EVALUATION STUDIES
RELIABILITY AND GENERALIZABILITY OF RATINGS OF COMPOSITIONS Rainer H. Lehmann School of Education, University of Hamburg, F.R.G.
Introduction
in
One of t he aims of t he IEA I n t e r n a t i o n a l S t u d y of A c h i e v e m e n t W r i t t e n C o m p o s i t i o n h a s b e e n "to make a contribution t ow ard
solving p r o b l e m s r e l a t e d to t he a s s e s s m e n t of e s s a y - t y p e a n s w e r s " ( P elg r u m & W a r r i e s 1986, p. 18). T h e r e are, of c o u r s e , as a n y o n e s e r i o u s l y en g aged in t he field k n o w s - m a n y s u c h probl em s, only a few of w h i c h c a n be d e a l t with
in t hi s p a p e r .
Here,
onl y s u c h q u e s t i o n s are
t r e a t e d w h i c h f o c u s on t h e following f o u r s o u r c e s of v a r i a t i o n in t he a s s e s s m e n t of s t u d e n t writing:
.
Between-rater factors
2.
W i t h i n - r a t e r f act or s
3. 4.
B e t w e e n - a s s i g n m e n t f a c t or s W i t h i n - s t u d e n t factors
(cf. W e s d o r p et al. 1982, p. 2990. S i n c e in t h e IEA r e p e a t e d l y on t h e s a m e
s t u d y no s t u d e n t s w ere a s k e d to w o r k assignment, within-student and between
a s s i g n m e n t f a c t o r s are c o n f o u n d e d ,
so t h a t t h e y
c a n only be a n a l y z e d
co n jo in tly here. However, a s u g g e s t i o n will be e l a b o r a t e d w h i c h is b a s e d on t h e n o t i o n of v a r i a n c e c o m p o n e n t s a n d w h i c h allows for a s i m u l t a n e o u s
501
502
Ft. 14. Lehmann
ev alu atio n of t hos e effects. The
d a t a u s e d for e x e m p l i f i c a t i o n c o m e f r o m t h e W e s t
German
c o m p o n e n t of t h e IEA S t u d y , involving a t o t a l of 1487 l lth-grade s t u d e n t s from 71 c l a s s r o o m s in eight different t r a c k s of t h e school s y s t e m of t h e City of H a m b u r g . E a c h of t h e s e s t u d e n t s h a d b e e n a s k e d to complete f o u r writing a s s i g n m e n t s , i.e. one m or e t h a n i n t e r n a t i o n a l l y obligatory, in a p a r t l y r o t a t e d design: 1.
On e of t h e f o u r s h o r t a s s i g n m e n t s in t h e f o r m a t of a l e t t e r (the i n t e r n a t i o n a l p r a g m a t i c / f u n c t i o n a l t a s k s - a l e t t e r to a n u n c l e d e s c r i b i n g a bicycle; a self-description in a letter to a penpal; a formal n o t e to t h e h e a d of t he school; a n d a r e p l y to a n a d v e r t i s e m e n t for a s u m m e r job);
2.
One of t h r e e longer i n t e r n a t i o n a l a s s i g n m e n t s (narrative p e r s o n a l story; p e r s u a s i v e / a r g u m e n t a t i v e essay; reflective essay);
3.
The a s s i g n m e n t of a letter of advice to a y o u n g e r s t u d e n t ;
4.
Th e
assignment
of
paraphrasing,
a n a l y z i n g rhetorically,
and
evaluating a n e w s p a p e r c o m m e n t a r y . Two i n d e p e n d e n t
s e t s of s c or es on a five-point scale
w ere a w a r d e d
to e a c h e s s a y in a c c o r d a n c e w i t h the International Scoring Guides ( Go r man , P u r v e s & D e g e n h a r t , 1988). In e a c h case, t h e r a t e r s w ere two from a j u r y of five fully t r a i n e d a n d certified m o t h e r t o n g u e t e a c h e r s . T he e s s a y s wer e d i s t r i b u t e d a m o n g t h e r a t e r s in s u c h a w a y as to allocate a p p r o x i m a t e l y equal por t i ons to all possible c o m b i n a t i o n s of rat ers.
Methodological C o n s i d e r a t i o n s Th e s t u d y of b e t w e e n - r a t e r effects, w i t h i n - r a t e r effects, a n d b e t w e e n a s s i g n m e n t / w i t h i n - s t u d e n t effects entails, for a s t u d y of t hi s size and c o m p l e x design, a n impressive a r r a y of very specific statistical anal yses, far b e y o n d of w h a t can be r e p o r t e d here. To give j u s t a n i ndi cat i on of w h a t is involved, it s h o u l d be m e n t i o n e d t h a t t h e i n t e r n a t i o n a l r e p o r t i n g r e q u i r e m e n t s for s c o r i n g i n c l u d e d , for t he case of t h e H a m b u r g St udy, t h e c o m p l e t i o n of 144 f or m s h e e t s , c o n t a i n i n g a t ot al of several
Ratings and Compositions
503
t h o u s a n d statistics referring to the 9 different t a s k s , 5 different raters, a n d 10 possible c o m b i n a t i o n s of raters. W h e n t h e s e forms were conceptualized, little was k n o w n with respect to the a c t u a l r a t e r performance. T h u s it h a d s e e m e d appropriate n o t to rely on p a r a m e t r i c t e s t models alone, b u t also to include more "intuitive" statistics.
Rater
I st
round
package A B C D E
no.
1 3
2 4
5
6
7
8 I0
9
round p a c k a g e no. 2 nd
8 i0 2 4 6
9 1 3 5 7
Figure 1: Scoring scheme of the Hamburg Study ("star design")
Two
such
intuitive concepts
are percentage
of perfect a g r e e m e n t
b e t w e e n two i n d e p e n d e n t r a t i n g s a n d p e r c e n t a g e of loose a g r e e m e n t , defined as the p e r c e n t a g e of differences b e t w e e n two r a t i n g s not greater t h a n one scale point. While these concepts have the a d v a n t a g e of not b e i n g b a s e d on a s s u m p t i o n s as to the h o m o g e n e i t y of m e a n s a n d v a r i a n c e s , t h e i r m a j o r d r a w - b a c k is t h a t t h e y c a n n o t be converted into w e l l - d e f i n e d r e l i a b i l i t y coefficients. Moreover, p e r c e n t a g e of loose a g r e e m e n t does n o t d i s c r i m i n a t e well between quality levels, if the
504
R. H. Lehmann
agreement
between
two
independent
r a t i n g s is g e n e r a l l y h i g h . In t h e
H a m b u r g d a t a , t h e r e is m o r e t h a n 9 7 p e r c e n t l o o s e a g r e e m e n t b e t w e e n i n d e p e n d e n t r a t i n g s . P e r f e c t a g r e e m e n t w a s a c h i e v e d o n 7 3 . 2 p e r c e n t of all 5 3 6 2 r a t e d c o m p o s i t i o n s . If, h o w e v e r , t h e h y p o t h e s e s of h o m o g e n e o u s v a r i a n c e s (and m e a n s ) between independent ratings can be maintained, correlations between ratings and associated measures such as regression coefficients, c o r r e l a t i o n ratios, a n d v a r i a n c e c o m p o n e n t s c a n b e u s e d . T h e y are clearly s u p e r i o r in t h a t t h e y c a n b e r e l a t e d to t h e c l a s s i c a l r e l i a b i l i t y c o e f f i c i e n t w h i c h is d e f i n e d a s t h e ( e s t i m a t e d ) t r u e v a r i a n c e d i v i d e d b y t h e o b s e r v e d variance:
2
2
c~t Reliability 2 GO
with
c~t 2 s2 ~t + e
2 c~t = true variance 2 o
= observed variance
2 r~ = error variance e It is a s s u m e d reliability and
that an appropriate generalizability
of
treatment
(cf. Thorndike 1982). of p r o b l e m s involving
the
essay ratings should be based on this
concept. T e s t i n g t h e A s s u m p t i o n s - H o m o g e n e i t y of S c o r e V a r i a n c e s There
are
a n u m b e r of w a y s to a s c e r t a i n
that
scores
and Means from different
r a t e r s do, i n d e e d , d i s p l a y v a r i a n c e s s u f f i c i e n t l y s i m i l a r to b e c o m p a t i b l e with the hypothesis of homogeneity. One o f t h e s e is to l o o k a t t h e e x t r e m e s w i t h i n t a s k s , i.e. t h a t p a i r of r a t e r s for w h i c h in a given t a s k t h e o b s e r v e d d i f f e r e n c e in s t a n d a r d d e v i a t i o n s is largest.
A statistical problem
lies in t h e p a r t i a l o v e r l a p of t h e s e t s of c o m p o s i t i o n s s c o r e d b y t h e s e r a t e r p a i r s . So, it is n e c e s s a r y to a p p l y two d i f f e r e n t t e s t s : (1) t h e c o n v e n t i o n a l F - t e s t for c o m p a r i n g t h e s u b - s e t s w h i c h w e r e u n i q u e to e i t h e r o n e of t h e r a t e r s in t h e p a i r , (2) a t - t e s t for p a i r e d observations (Ferguson, 1966), a p p l y i n g o n l y to t h a t p o r t i o n w h i c h w a s s c o r e d c o n j o i n t l y ( b u t i n d e p e n d e n t l y l ) b y t h e two. T h e r e s u l t s from the
Ratings and Compositions
505
H a m b u r g d a t a clearly s u g g e s t to r e t a i n t h e h y p o t h e s i s of h o m o g e n e o u s v a r i a n c e s : while for five of t h e n i n e t a s k s , n o t even t h e l a r g e s t observed d if f er en ces w er e significant on t h e first criterion, n o n e of t he pairs i n v e s t i g a t e d s h o w e d s i gni f i cant di f f erences on t h e s e c o n d . Conversely, t h e t - t e s t i d e n t i f i e d o n l y f o u r o v e r l a p p i n g s e t s for w h i c h t h e r e w e r e s ig n if ican t d i f f e r e n c e s in a n y of t h e t e n possible c o m b i n a t i o n s of raters, b u t t h e s e findings coul d n o t be r e p r o d u c e d on t h e b a s i s of t he Fcriterion for i n d e p e n d e n t s a m p l e s . T h u s it a p p e a r e d r e a s o n a b l e to p r o c e e d to c h e c k for p o s s i b l e m e a n d i f f e r e n c e s b e t w e e n r a t e r s on t h e a s s u m p t i o n of h o m o g e n e o u s variances. T h e r a t i o n a l e guiding t hi s
investigation w as
basically identical to
t h a t u s e d in t h e p r e v i o u s tests: while s o m e of t h e selected e x t r e m e m e a n d i f f e r e n c e s w e r e s i g n i f i c a n t in t h e i n d e p e n d e n t s a m p l e p o r t i o n , n o n e of t h e s e findings c oul d be c o n f i r m e d on t h e b a s i s of t h e r e s p e c t i v e s u b - s e t with p air ed obs er va t i ons . In t e r m s of m e a s u r e m e n t theory, t hen, clo s e to t h e i deal of " e q u i v a l e n t forms", summed
or averaged with a
t he o b t a i n e d rat i ngs got very w h i c h c a n l e g i t i m a t e l y be
c o r r e s p o n d i n g i n c r e a s e in "true variance".
E s t i m a t e s of I n t e r - R a t e r Reliability Insofar as the independent ratings can be regarded as equivalent, it is justified to e m p l o y C r o n b a c h ' s Alpha as a n e s t i m a t e of the ach iev ed i n t e r - r a t e r reliability. For t he special case of two s u c h ratings, t h e well-known S p e a r m a n - B r o w n - f o r m u l a m a y be used: 2 r _
Cronbach's Alpha (K=2) S in ce
it
°° 1j
1 + r ij
c a n be s h o w n t h a t t hi s s t a t i s t i c
d e f i n i t i o n of reliability,
the
resulting
fits
t he
numerical values
above s t a t e d give
a direct
i n d i c a t i o n of t h e p r o p o r t i o n of t r u e v a r i a n c e in t h e o b s e r v e d average scores. T h e r e are different v a l u e s for e a c h pai r of r a t e r s , task, and rating d i m e n s i o n . In t he H a m b u r g data, t h e r e w ere no c o n s i s t e n t differences b e t w e e n p a i r s of r a t e r s or r a t i n g d i m e n s i o n s , b u t t h e r e were differences b e t w e e n tasks: generally, writing a c h i e v e m e n t w as m e a s u r e d less a c c u r a t e l y for th e f o u r p r a g m a t i c / f u n c t i o n a l t a s k s a n d t h e p e r s u a s i v e / a r g u m e n t a t i v e t a s k t h a n it was for t he r e m a i n i n g tasks. T h e b e s t v a l u e s for i n t e r - r a t e r a g r e e m e n t were o b t a i n e d for t he letter of advice to a y o u n g e r s t u d e n t .
506
R. H. L e h m a n n
Averaging
the
A l p h a ' s b e t w e e n t h e first a n d
second
r a t i n g o v e r all
t a s k s a n d r a t i n g d i m e n s i o n s , a m e a n A l p h a of 0 . 8 8 5 w a s o b t a i n e d . T h i s a m o u n t s to s a y i n g t h a t , o n t h e a v e r a g e , 11.5 p e r c e n t o f t h e v a r i a n c e in t h e o u t c o m e v a r i a b l e s ( a r i t h m e t i c m e a n s f r o m two i n d e p e n d e n t s c o r e s o n t h e s a m e e s s a y a n d r a t i n g d i m e n s i o n ) m u s t b e a t t r i b u t e d to e r r o r .
E s t i m a t e s of I n t r a - R a t e r Reliability The above estimates
of inter-rater
reliability do not contain
any
r e f e r e n c e to t h e f a c t t h a t t h e r e m a y also b e a c e r t a i n a m o u n t of i n s t a b i l i t y w i t h i n t h e r a t i n g s of o n e a n d t h e s a m e r a t e r over time. In o r d e r to a s s e s s t h i s s o u r c e of e r r o r , a c o r p u s of 138 c o m p o s i t i o n s f r o m all t a s k s w a s r a t e d twice b y all r a t e r s in t h e H a m b u r g j u r y . A s s u m i n g a g a i n t h a t t h e two r a t i n g s f r o m a given r a t e r w e r e e q u i v a l e n t in t h e s t a t i s t i c a l s e n s e of t h e w o r d , A l p h a e s t i m a t e s t h e p r o p o r t i o n s of t r u e a n d e r r o r v a r i a n c e in t h e o b t a i n e d a v e r a g e s o v e r time. T h e r e s u l t i n g a v e r a g e A l p h a w a s 0 . 9 3 9 ; so 6.1 p e r c e n t of t h e v a r i a n c e of within-rater a s s o c i a t e d w i t h i n t r a - r a t e r instability.
average scores can
be
When trying to separate inter-rater from intra-rater effects, a c o r r e l a t i o n - b a s e d a p p r o a c h is m o r e a p p r o p r i a t e . A s s u m i n g h y p o t h e t i c a l l y that perfect intra-rater agreement could be obtained, one could correct for a t t e n u a t i o n o n t h e b a s i s of t h e u s u a l f o r m u l a . rij true score correlation = rit Jt - 4 ~ i i rjj
Using
again
data aggregated
over
tasks,
dimensions,
from the Hamburg study, the corrected estimate a g r e e m e n t w o u l d b e r = 0 . 8 4 3 or A l p h a = 0 . 9 1 5 .
and raters
for inter-rater
T h i s m e a n s t h a t a n a v e r a g e of o n l y 8.5 p e r c e n t of t h e v a r i a n c e of o u t c o m e s c o r e s c a n b e a t t r i b u t e d to i n t e r - r a t e r d i f f e r e n c e s , w h e r e a s a n a d d i t i o n a l 3 . 0 p e r c e n t o u t of t h e t o t a l e r r o r c o m p o n e n t of 11.5 p e r c e n t is e s t i m a t e d to b e d u e to i n t r a - r a t e r instability.
T o w a r d s G e n e r u l i z a b i l i t y - T h e V a r i a n c e C o m p o n e n t s Model An o b v i o u s d r a w - b a c k of c o n s i d e r a t i o n s so f a r h a s b e e n t h a t t h e s e w e r e o n l y c o n c e m e d w i t h t h e m e a s u r e m e n t a c c u r a c y for s i n g l e t a s k s a n d
Ratings and Compositions
507
rating dimensions. No r e f e r e n c e w a s m a d e to e x i s t i n g r e l a t i o n s h i p s between t a s k s / w i t h i n s t u d e n t s . It m a y be reiterated t h a t in the H a m b u r g s t u d y , all s t u d e n t s were asked to complete four a s s i g n m e n t s (one more t h a n i n t e r n a t i o n a l l y obligatory). There are 1,073 s t u d e n t S for w h o m two valid i n d e p e n d e n t scores exist for all four a s s i g n m e n t s . W i t h o u t going into details here, it m a y be added t h a t in H a m b u r g the d a t a also allow to combine "overall impression m a r k s " a n d analytical scores (except m e c h a n i c s a n d handwriting) into a single general m e r i t score for each c o m p o s i t i o n / r a t e r . From now on, considerations will only refer to t h e s e general merit scores. In the conceptualization of the IEA Study, it was a t t e m p t e d to have a s a m p l e of t a s k s from the d o m a i n of school writing (Vfih&passi, 1982). Pragmatic c o n s t r a i n t s led to the rotation of 4 plus 3 of the 8 international t a s k s for Population B (modal grade before leaving c o m p u l s o r y school). In spite of the existence of acceptable m e a s u r e s for each c o n s t i t u e n t t a s k it is n e c e s s a r y to a s k w h e t h e r - a n d if so, to w h a t extent - the outcome variables m e a s u r e a s t a b l e individual trait w h i c h c a n t h e n be called "general writing ability". Statistically speaking, this q u e s t i o n is closely r e l a t e d to the i d e n t i f i c a t i o n of w i t h i n - s t u d e n t / a c r o s s - t a s k variation, correcting for possible mitigating influences of rater performance. An appropriate technique is given by t h e a n a l y s i s of v a r i a n c e c o m p o n e n t s (cf. Thorndike, 1982, pp. 156 ff). The s t r u c t u r e of the IEA rating d a t a m a k e s it difficult to c o n d u c t s u c h an analysis for all t a s k s simultaneously: the fact that, for instance, no student completed both the p e r s o n a l / n a r r a t i v e t a s k and the p e r s u a s i v e / a r g u m e n t a i v e t a s k h a s left "empty cells" in the overall design w h i c h s h o u l d not be filled with e s t i m a t e d values, as long as virtually n o t h i n g is k n o w n a b o u t empirical relationships between achievement in these tasks. Thus, t h e following i n c o m p l e t e m a t r i x of i n t e r - t a s k c o r r e l a t i o n s (based on averages from two i n d e p e n d e n t ratings) was obtained (Table 1). Therefore, it seems advisable at least at this stage to disregard possible differences b e t w e e n r o t a t i o n forms a n d i n c l u d e in the a n a l y s i s those s t u d e n t s who have completed exactly four assignments (i.e. any of the functional tasks, any of the narrative, argumentative and reflective tasks, the letter of advice and the rhetorical analysis). Also, i n s t e a d of looking at five individtlal raters, only the two i n d e p e n d e n t r a t i n g s (first vs. second) are distinguished.
508
R. H. L e h m a n n
T a b l e 1: C o r r e l a t i o n m a t r i x f o r t a s k - s p e c i fi c scales for n i n e tasks s t u d e n t s ; p a i r w i s e n u m b e r s of cases in p a r e n t h e s e s )
Task
Bicycle descr,
Self Formal descr, note
Job appli,
Narr.
Argu.
(total N = 1340
R e f l . Advice R h e t . anal.
Self descript.
* (0)
Formal note
* (0)
* (0)
Job appl.
* (0)
* (0)
* (0)
.12 (102)
.24 (96)
.32 (108)
.10 (110)
.19 (92)
.29 (103)
.40 (98)
•37 (98)
* (0)
Reflective
.43 (103)
.26 (95)
.22 (107)
• 39 (101)
* (0)
* (0)
Advice letter
.28 (313)
.34 (312)
.32 (327)
• 35 (308)
.25 (403)
.31 (376)
.37 (394)
Rhetorical .23 analysis (290)
.37 (299)
.35 (311)
.34 (304)
.27 (389)
.34 (379)
.46 .37 (379) (1168)
Total
(324)
(337)
(328)
(424)
(401)
(411) (1270)(1231)
Narrative Argument
(323)
first r a t i n g Task
1A, B , C , E
5,6,7
second rating 9
0
1A, B , C , E
5,6,7
9
0
Student
1073
F i g u r e 2:
4.5 2.0
2.2 2.8
3.0 1.5
3.5 3.8
4.8 3.0
3.0 3.2
3.0 1.0
3.8 3.5
5.0
3.5
2.2
4.2
5.0
3.0
2.0
4.0
A n a l y t i c d e s i g n a n d file s t r u c t u r e of r a t i n g s
Ratings and Compositions
G i v e n t h e h i g h i n t e r - r a t e r r e l i a b l i t i e s a c h i e v e d , little as
a
consequence
i m p l i c a t i o n s for
of
that simplification,
the subsequent
analysis.
appears
to
be
509
lost
although there are certain
With
these
modifications,
a
c o m p l e t e l y b a l a n c e d f a c t o r i a l d e s i g n - or d e s i g n w i t h t h r e e ' f a c e t s ' , to u s e the
appropriate term - emerges.
Figure 2 depicts the
resulting analytic
d e s i g n a n d file s t r u c t u r e . T h i s is v e r y m u c h like a c o n v e n t i o n a l t h r e e - w a y ANOVA a s i n g l e o b s e r v a t i o n p e r cell, s p r e a d o v e r e i g h t cells.
except
It c a n
that
design with
o b v i o u s l y e a c h c a s e ( s t u d e n t ) is
also be viewed
as
a MANOVA w i t h
a two-
factor within-subject design and student as the breakdown variable. Given the possible
to
three
"rating",
define a n d e v a l u a t e
interaction terms. and
facets
the
"task",
and
respective
is
confounded
with
similarly the rater-by-task student
term.
it is n o w
m a i n effects a s well a s t h e
It s h o u l d b e n o t e d a g a i n t h a t o n l y " r o u n d s of s c o r i n g "
not the individual raters are considered;
effect
"student",
So,
the
therefore,
rater-by-student
i n t e r a c t i o n effect
with
the
rater main
interaction term and the
rater-by-task-by-
t h e following v a r i a n c e c o m p o n e n t s a r e defined:
Table 2. Defined variance c o m p o n e n t with numerical results f r o m H a m b u r g data
Notation
Variance c o m p o n e n t
Results
(52
df
a2
df
Between students effect
~Ss2
ns-I
.180
1072
Between tasks effect
~t 2
nt-1
.015
3
-.005
1073
.347
3216
.065
3219
Between raters effect/ r a t e r - b y - s t u d e n t interaction Student-by-task interaction Rater-by-task interaction/ Rater-by-student-by-task interaction
(nr-1)+ ~r,rs 2 (nr-1) (ns-1) (Yst2
(ns-1) (nt-1) (nr-1) (nt-1)+
(Yrt,rst 2 (nr-1) (ns-1) (nt-1)
nr = n u m b e r of ratings, ns = n u m b e r of students, nt = n u m b e r of tasks
510
R.H. Lehmann
It c a n n o w b e s e e n t h a t t h e c o m b i n e d r a t e r - e f f e c t / r a t e r - b y - s t u d e n t i n t e r a c t i o n t e r m is v i r t u a l l y zero, a s w a s , i n d e e d , e x p e c t e d w h e n t h e s c o r i n g d e s i g n for t h e H a m b u r g S t u d y w a s p l a n n e d . S i n c e m o s t , if n o t all, r a t e r s w e r e likely to b e involved w i t h e a c h s t u d e n t in t h e s a m p l e , this t e r m w a s likely to d i s a p p e a r a s a c o n s e q u e n c e of t h e s c o r i n g s c h e m e . Similarly, this scheme would cancel out rater-by-task interaction effects e x c e p t for a p o s s i b l e t i m e - r e l a t e d factor. T h e r e f o r e t h e l a s t v a r i a n c e c o m p o n e n t is a l m o s t e x c l u s i v e l y r e l a t e d to w h a t w a s labelled "inter-rater disagreement" above. Fortunately, t h i s c o n t r i b u t i o n to overall v a r i a n c e is m i n o r . T h e f a c t t h a t t h e r e is n o s t r o n g b e t w e e n - t a s k s effect in t h e d a t a m a y b e u n d e s i r a b l e f r o m a t h e o r e t i c a l p o i n t of view, s i n c e it leaves little r o o m for e x p l a n a t i o n s r e f e r r i n g to differential achievement over t h e t a s k s a s s i g n e d . M e t h o d o l o g i c a l l y , it m a y b e a c o n s e q u e n c e of a t e n d e n c y a m o n g raters
to s c o r e to a n o r m a l c u r v e ,
b u t it m a y ,
of
c o u r s e , also r e f l e c t a
m o r e f u n d a m e n t a l difficulty, n a m e l y t h a t t h e c l a s s i c a l c o n c e p t of "item difficulty" is n o t easily a p p l i e d to t a s k s of s c h o o l writing. The remaining
two v a r i a n c e c o m p o n e n t s
are those which
are
of
p r i m a r y i n t e r e s t for l a t e r m u l t i v a r i a t e a n a l y s e s . Clearly, t h e r e l a t i v e l y s m a l l a m o u n t of b e t w e e n - s t u d e n t s v a r i a n c e (as c o m p a r e d with the student-by-task i n t e r a c t i o n , i.e. t h e " w i t h i n - s t u d e n t s " c o m p o n e n t ) will i m p o s e l i m i t a t i o n s o n t h e a t t e m p t to find a single overall e x p l a n a t i o n for d i f f e r e n c e s b e t w e e n s t u d e n t s in t e r m s of w r i t i n g a c h i e v e m e n t . W i t h i n students variation, o n t h e o t h e r h a n d , m a y b e r e l a t e d to m a n y f a c t o r s w h i c h w e r e o n l y p a r t i a l l y c o n t r o l l e d in t h i s s t u d y - e.g., f l u c t u a t i o n s in a c h i e v e m e n t o v e r time, v a r y i n g levels of m o t i v a t i o n , f a m i l i a r i t y w i t h t h e t a s k s , etc.. It r e m a i n s to b e s e e n
w h e t h e r s o m e of t h e b a c k g r o u n d d a t a of
t h e S t u d y will h e l p to explain this s o u r c e of v a r i a t i o n .
Conclusions:
G e n e r a l Writing A c h i e v e m e n t A c r o s s T a s k s
It is n o w p o s s i b l e to r e t u r n to t h e g u i d i n g q u e s t i o n of t h i s p a p e r : what c a n b e s a i d a b o u t t h e r e l i a b i l i t y of m e a s u r i n g g e n e r a l w r i t i n g achievement across the tasks used, or in o t h e r w o r d s , a b o u t t h e g e n e r a l i z a b i l i t y of c o m p o s i t i o n r a t i n g s in t h e S t u d y o f A c h i e v e m e n t in W r i t t e n C o m p o s i t i o n ? It will b e s e e n i m m e d i a t e l y t h a t t h e r e is n o single and simple assumptions
a n s w e r ; i n s t e a d , t h e s o l u t i o n d e p e n d s o n t h e k i n d of w i t h r e s p e c t to t h e t a s k s o n e is p r e p a r e d to m a k e .
Ratings and Compositions
S t a t i s t i c a l l y , t h e a n s w e r is a f u n c t i o n of w h e t h e r w i t h i n - s t u d e n t s is c o n s i d e r e d a s t r u e v a r i a n c e or error.
511
variation
Assuming that the pragmatic/functional tasks and the essay-type t a s k s a r e s t r i c t l y e q u i v a l e n t s t a t i s t i c a l l y a s well a s t h e o r e t i c a l l y a n d t h a t t h e s e t of f o u r a s s i g n m e n t s p e r s t u d e n t r e p r e s e n t s e x a c t l y t h e d o m a i n to w h i c h o n e w i s h e s to g e n e r a l i z e (fixed effects m o d e l w i t h r a n d o m l y c h o s e n r a t e r s ) , t h e following f o r m u l a can be applied to estimate the achieved generalizability: 2 2 ~st cr +. s nt
2 ~t 2 ~o Generalizability I =
2 (~
+
s
2 2 Cst ~t - - + - - + nt
2 ~r,rs
nt
O n t h e b a s i s of t h i s f o r m u l a ,
+
nr
2 ~rt,rst - nr nt
a g e n e r a l i z a b i l i t y c o e f f i c i e n t of 0 . 9 5 7
w o u l d b e o b t a i n e d for t h e H a m b u r g d a t a . This value appears quite a p p e a l i n g , b u t it is n o t a v e r y p l a u s i b l e one, given t h e d o u b t s a b o u t t h e validity of t h e s t r o n g u n d e r l y i n g a s s u m p t i o n s . In fact, t h e a l r e a d y q u o t e d specification of t a s k types within the domain of school writing (V~ih~ipassi, 1982) d o e s n o t t r e a t t h e two g r o u p s of t a s k s a s e q u i v a l e n t , and it w o u l d b e difficult to find a n e x p e r t / t e a c h e r in t h e City of H a m b u r g w h o w o u l d c o n s i d e r t h e t a s k s u s e d in t h e S t u d y a s r e p r e s e n t a t i v e for all l l t h - g r a d e s c h o o l w r i t i n g t h e r e . T h e s e o b j e c t i o n s a l o n e are r e a l l y s u f f i c i e n t to r e j e c t t h e m o d e l a s l e a d i n g to g r o s s l y i n f l a t e d e s t i m a t e s of t h e
achieved
b e c h a n g e d to a
generalizability.
So,
random-effect model,
the generalizability formula must deleting
the
within-student
2
component
~st/nt
f r o m the n u m e r a t o r :
2
2
ot
~s
2 ~o
2
2 ~st
~S+--
Generalizability II =
2 ~t
2 ~r,rs
- + nt + -nt nr
2 ~rt,rst + - -
nr nt
Here, it is o n l y a s s u m e d t h a t a n y f o u r t a s k s from t h e u n i v e r s e of s c h o o l w r i t i n g w e r e c o m p l e t e d a n d s c o r e d b y a n y two r a t e r s f r o m t h e j u r y
512
R.H. Lehmann
a s s u m e d to p r o d u c e valid ratings. On t h e b a s i s of this f o r m u l a , t he g e n e r a l i z a b i l i t y e s t i m a t e for t h e o b t a i n e d r a t i n g s on t h e H a m b u r g s a m p l e is r e d u c e d r a t h e r drastically to 0.646. This m e a n s that, in spite of all efforts u n d e r t a k e n by s t u d e n t s as well as raters, t he m e a s u r e m e n t of g e n e r a l w r i t i n g a c h i e v e m e n t w a s n o t v e r y good. B u t p e r h a p s it is comforting to see how m u c h m or e effort it would have t a k e n to achieve a s a t i s f a c t o r y generalizability of above 0.85: e v e r y t h i n g else b e i n g equal, m i n i m u m of 13 writing a s s i n g m e n t s would have b e e n requi red.
a
References F e r g u s o n , G.A. (1966). Statistical analysis in psychology and education. (2nd ed.). New York: McGraw-Hill. G o r m a n , T.P., Purves, A.C. and D e g e n h a r t , R.E. (1988). T he IEA s t u d y of w r i t t e n c o m p o s i t i o n I: T h e i n t e r n a t i o n a l w ri t i ng t a s k s a n d s c o r i n g scales. P e l g r u m , H. & W a r r i e s , E. (1986). IEA: Activities, Oxford: Per ga m on. T h o r n d i k e , R.L. (1982). Mifflin.
Applied psychometrics.
institutions, people. Boston:
Houghton &
Vfihtipassi, A. (1982). On the specification of the d o m a i n of school writing. In A.C. P u r v e s & S. T a k a l a (Eds.) An international perspective on the e v a l u a t i o n o f w r i t t e n composition, E v a l u a t i o n in Education: A n International R e v i e w Series 5, (3), pp. 265- 289. Wesdorp, H., Bauer, B.A., & P u r v e s , A.C. (1982): T o w a r d a c o n c e p t u a l i z a t i o n of t he s c o r i n g of w r i t t e n c o m p o s i t i o n . In A.C. P u r v e s & S. T a k a l a (Eds.) Evaluation in Education, 5, (3), pp. 299315. Note The author acknowledges gratefully that the present research has been funded by the German Research Association (DFG). T he A u t h o r RAINER H. LEHMANN t e a c h e s in the field of d a t a anal ysi s a n d r e s e a r c h m e t h o d o l o g y at t he University of H a m b u r g . His i n t e r e s t s i ncl ude large-scale a s s e s s m e n t , n a m e l y i n t e r n a t i o n a l l y c o m p a r a t i v e studies.