Studies m Educational Evaluation. Vol. 15, pp. 285-293, 1989 Printed in Great Britain. All rights reserved.
0191-491X/89 $O.O0 + .50 @ 1989 Pergamon Press plc
APPLICATION OF ITEM RESPONSE THEORY IN STANDARD SETTING Toh Poh Guan and Saminanthan Gopal Ministry of Education, Research Branch, Kay Siang Road, Singapore 1024
In t r a d i t i o n a l e x a m i n a t i o n s , grades.
E a c h g r a d e s p a n s a c e r t a i n r a n g e of m a r k s .
o b t a i n i n g b e t w e e n s e v e n t y five to 'A'.
Such grade boundaries
exercise.
a c h i e v e m e n t is r e p o r t e d in t h e f o r m of For example, examinees
o n e h u n d r e d m a r k s m a y be a w a r d e d g r a d e
a r e u s u a l l y d e t e r m i n e d in a s t a n d a r d
A n i m p o r t a n t i s s u e in s t a n d a r d
setting
s e t t i n g is w h e t h e r a g r a d e in a
p a r t i c u l a r y e a r ' s e x a m i n a t i o n is e q u i v a l e n t to t h e s a m e g r a d e in t h e p r e v i o u s year's examination.
T h e i s s u e of e q u i v a l e n t g r a d e s c a n be r e s o l v e d t h r o u g h
t e s t - e q u a t i n g b a s e d o n I t e m R e s p o n s e T h e o r y (IRT). the
application
of
an
IRT
method
of
equating
This paper describes two
such
traditional
examinations and the problems encountered.
Method Equating
two
parallel
tests
involves
a procedure
for d e t e r m i n i n g
e m p i r i c a l l y a t r a n s f o r m a t i o n w h i c h c a n b r i n g t h e two t e s t s o n t o a c o m m o n s c a l e so t h a t t h e y c a n be u s e d i n t e r c h a n g e a b l y . parallel,
they must
For t e s t s to be c o n s i d e r e d as
be d e v e l o p e d f r o m t h e s a m e s p e c i f i c a t i o n in t e r m s of
content and item characteristics.
In t h i s s t u d y , we p r o p o s e a t e s t e q u a t i n g
d e s i g n t h a t c a n be a p p l i e d to t h e t r a d i t i o n a l e x a m i n a t i o n s y s t e m . Simulation Study We c a r r i e d o u t a r e s e a r c h p r o j e c t over the last few y e a r s o n a s i m u l a t e d 285
286
T. P. Guan and S. Gopa/
t r a d i t i o n a l e x a m i n a t i o n s y s t e m . A large g r o u p of p r i m a r y s c h o o l s w h o s h a r e d a c o m m o n c u r r i c u l u m in E n g l i s h , M a t h e m a t i c s a n d S c i e n c e p a r t i c i p a t e d in the project.
The r e s e a r c h s t u d y w a s c o n d u c t e d at a n u p p e r p r i m a r y level.
For
e a c h of t h e t h r e e s u b j e c t s , a c o m m o n e x a m i n a t i o n w a s a d m i n i s t e r e d to the s e l e c t e d level at t h e e n d of t h e s c h o o l y e a r .
The p o p u l a t i o n s of the s e l e c t e d
level for a p a r t i c u l a r y e a r a n d t h e s u b s e q u e n t Population 1 and Population 2 respectively.
y e a r will be r e f e r r e d to as
Similarly, E x a m i n a t i o n
1 and
E x a m i n a t i o n 2 were t h e e x a m i n a t i o n s a d m i n i s t e r e d r e s p e c t i v e l y to P o p u l a t i o n 1 a n d P o p u l a t i o n 2 for e a c h subject. D e s i g n for D a t a Collection D a t a collection d e s i g n for e q u a t i n g E x a m i n a t i o n 1 a n d E x a m i n a t i o n 2 is d e p i c t e d in D i a g r a m 1.
Pupils of P o p u l a t i o n 1 t o o k E x a m i n a t i o n 1. A s a m p l e
of a b o u t 5 0 0 p u p i l s from P o p u l a t i o n 1 t o o k the A n c h o r T e s t 1 w i t h i n a w e e k after Examination Population 2
1.
Examination
2 was
administered
level as P o p u l a t i o n 1.
to
A s a m p l e of a b o u t 5 0 0 p u p i l s f r o m P o p u l a t i o n 2 t o o k
t h e A n c h o r T e s t 2 w i t h i n a w e e k after E x a m i n a t i o n 2. Examination
a year later
w h i c h c o m p r i s e d t h e s u b s e q u e n t c o h o r t of p u p i l s a t t h e s a m e Examination 2 and
1 m a y n o t be s t r i c t l y p a r a l l e l t e s t s as t h e y h a v e to reflect the
c h a n g e s in t h e c u r r i c u l u m over t h e y e a r s .
T h e p u r p o s e of u s i n g two A n c h o r
t e s t s w a s to m a k e a l l o w a n c e for t h e d i f f e r e n c e s b e t w e e n E x a m i n a t i o n 2 a n d Examination
1.
A n c h o r T e s t 1 a n d A n c h o r T e s t 2 w e r e p a r a l l e l f o r m s of
Examination
1 and E x a m i n a t i o n 2 respectively.
The two A n c h o r T e s t s h a d
a b o u t 8 0 % of c o m m o n items. Model U s e d in this S t u d y R e s e a r c h e r s h a v e p r o p o s e d m a n y m o d e l s in IRT.
The appropriateness
of a m o d e l d e p e n d s u s u a l l y on t h e p u r p o s e of t e s t i n g a n d t h e t y p e s of i t e m s i n c l u d e d in a test. The
one-parameter
(Birnbaum,
1968)
A m o n g the m o r e p o p u l a r m o d e s are t h e logistic m o d e l s . (Rasch,
1960),
logistic models
two-parameter
and
are frequently u s e d
m o d e l s for t e s t s w i t h d i c h o t o m o u s l y
scored items.
e x a m i n a t i o n s are m a d e u p of b o t h m u l t i p l e - c h o i c e a n d
three-parameter as
measurement
However, traditional o p e n - e n d e d items,
h o w e v e r o p e n - e n d e d i t e m s m a y n o t be s c o r e d d i c h o t o m o u s l y b e c a u s e a partial a n s w e r to a n item is a w a r d e d a partial credit a c c o r d i n g l y .
One model that can
be u s e d to s c o r e o p e n - e n d e d i t e m s is t h e P a r t i a l C r e d i t Model (Wright a n d Masters,
1981) w h i c h is a n e x t e n s i o n of t h e o n e - p a r a m e t e r logistic m o d e l .
T h e c o m p u t e r p r o g r a m C R E D I T (MESA, U n i v e r s i t y of Chicago) b a s e d on the
Item Response Theory Partial Credit
m o d e ] w a s u s e d to e s t i m a t e
item parameters
and examinee
abilities in t h i s s t u d y .
Population
I
/
//
i
Examination
I
Anch r Test i
I I
I '
Common
Item
Anchor Te~t 2 I I
II
287
Examination
Population 2
F i g u r e 1: S c h e m a t i c R e p r e s e n t a t i o n of D a t a Collection in E q u a t i n g Examination 2 and Examination 1
288
T. P. Guan and S. Gopal
Procedure The approach
used
in t h i s
s t u d y for e q u a t i n g E x a m i n a t i o n
E x a m i n a t i o n 1 is d e p i c t e d in D i a g r a m 2.
2 and
T h e d e t a i l e d e q u a t i n g p r o c e d u r e is
o u t l i n e d a s follows:
a)
S e p a r a t e l y c a l i b r a t e e a c h t e s t with the r e s p e c t i v e s a m p l e s .
T h a t is, r u n
the CREDIT program with the input data from Examination
1 with
S a m p l e 1 a n d t h e n follow b y a s e c o n d r u n o n t h e d a t a from A n c h o r Test 1 w i t h S a m p l e 1,
Repeat these
C R E D I T p r o g r a m r u n s on E x a m i n a t i o n
2 a n d A n c h o r Test 2 with S a m p l e 2.
b)
Link E x a m i n a t i o n 2 to A n c h o r Test 2 t h r o u g h c o m m o n p e r s o n e q u a t i n g , a s b o t h t e s t s h a v e b e e n t a k e n b y S a m p l e 2. The c a l i b r a t i o n of e a c h test w i t h the C R E D I T p r o g r a m gives two m e a n ability e s t i m a t e s of S a m p l e 2, n a m e l y ~1 a n d ~2 " J~l is t h e m e a n ability e s t i m a t e of S a m p l e 2 b y E x a m i n a t i o n 2 while ~ 2 is the e s t i m a t e b y A n c h o r T e s t 2.
A c c o r d i n g to
IRT, t h e difference in the two ability m e a n s , _61 -_62 , e s t i m a t e s the shift r e q u i r e d to b r i n g E x a m i n a t i o n 2 o n t o a c o m m o n scale w i t h A n c h o r Test 2.
c)
Link t h e two A n c h o r t e s t s t o g e t h e r t h r o u g h t h e c o m m o n items, as a b o u t 8 0 % of t h e i t e m s in t h e t e s t s are c o m m o n items.
C a l i b r a t i o n of t h e s e
A n c h o r t e s t s u s i n g t h e C R E D I T p r o g r a m p r o d u c e s a p a i r of i t e m difficulties dil a n d di2 a n d a pair of a s s o c i a t e d s t a n d a r d e r r o r s Sil a n d si2 for e a c h c o m m o n i t e m i.
T h e difficulty e s t i m a t e s in e a c h p a i r are
s t a t i s t i c a l l y e q u i v a l e n t a n d t h e d i f f e r e n c e for e a c h p a i r is a c o n s t a n t . However, estimates.
empirically,
n o t all c o m m o n
items
S o m e c r i t e r i a n e e d to be set for
have
such
equivalent
s e l e c t i n g g o o d link items.
O u r m e t h o d of s e l e c t i n g good link i t e m s w a s a c c o m p l i s h e d t h r o u g h two steps.
T h e first s t e p w a s to t a k e all the c o m m o n i t e m s a s link i t e m s
a n d u s e it to shift A n c h o r T e s t 2 o n t o t h e scale of A n c h o r T e s t 1.
The
s e c o n d s t e p w a s to select i t e m s t h a t satisfied the s t a t i s t i c a l c r i t e r i o n of I d il - d'i2 [ <~ 3.0 x m a x (Sil , si2), w h e r e d'i2 w a s t h e a d j u s t e d v a l u e of di2 w h e n all c o m m o n i t e m s were t r e a t e d a s link items. This is a n a r b i t r a r y c r i t e r i o n derived from s o m e e m p i r i c a l s t u d i e s .
The
c h o i c e of the f a c t o r of 3,0 in the e q u a t i o n w a s a c o m p r o m i s e b e t w e e n a more
stringent
criterion
(and h e n c e
less link i t e m s )
and
having a
Item Response Theory r e a s o n a b l e n u m b e r of link i t e m s .
289
If t h e r e are k s u c h g o o d link items,
t h e n t h e s h i f t n e e d e d to p u t i t e m s o n A n c h o r T e s t
2 o n t o t h e s c a l e as
t h o s e of A n c h o r t e s t 1 is k E (di2 - dil) / k = d2 - d l i=l
A n c h o r T e s t 2]
CP
-
COMMON PERSON
CI
-
COMMON
~I
-
MEAN ABILITY
~2
-
M E A N A B I L I T Y OF S A M P L E
"~3
-
MEAN ABILITY
~-4
-
M E A N A B I L I T Y OF SAMIb~IPLE I AS M E A S U R E D
"d2
-
MEAN DIFFICULTY
OF A N C H O R T E S T 2 L I N K I T E M S
"dl
-
MEAN DIFFICULTY
OF A N C H O R TEST I L I N K I T E M S
Figure 2: Test Equating Design
//
'
ITEM OF SAMPLE 2 AS M E A S U R E D
OF S A M P L E
BY E X A M I N A T I O N
2
2 AS M E A S U R E D
BY A N C H O R
TEST
2
I AS M E A S U R E D
BY A N C H O R
TEST
i
BY E X A M I N A T I O N
I
-
290
d)
T. P. Guan and S. Gopal
Link the A n c h o r Test
1 to E x a m i n a t i o n
1 through
Common
person
e q u a t i n g as in b) b u t with S a m p l e 1. If J~3 a n d )~4 were t h e m e a n ability e s t i m a t e s of S a m p l e 1 b y A n c h o r Test 1 a n d E x a m i n a t i o n 1 respectively, t h e n t h e shift r e q u i r e d to b r i n g A n c h o r T e s t 1 o n t o c o m m o n scale with E x a m i n a t i o n 1 is J~3 - J[~4.
e)
To p u t E x a m i n a t i o n 2 o n t o s a m e scale a s E x a m i n a t i o n 1, we s u m u p all t h e shifts, t h a t is
(~1 - F2 ) + (d-2 - d'l) + (133- 134) a n d s u b t r a c t it f r o m all item 3 difficulties in E x a m i n a t i o n 2.
This shift
is also a n i n d i c a t i o n of t h e relative difficulty of the two tests.
f)
Once Examination
1 and E x a m i n a t i o n 2 are calibrated a n d p u t on a
c o m m o n scale, it is p o s s i b l e to derive ability e s t i m a t e s .
However, g r a d e
b o u n d a r i e s a r e b a s e d o n t e s t s c o r e s r a t h e r t h a n o n ability e s t i m a t e s . H e n c e , we n e e d to m a p ability e s t i m a t e s to t e s t s c o r e s .
The method
u s e d in t h i s s t u d y for m a p p i n g ability e s t i m a t e s to t e s t s c o r e s will be r e f e r r e d to a s t h e " G u t t m a n P a t t e r n " m e t h o d .
This m e t h o d r e q u i r e s all
t h e i t e m s to be p l a c e d o n a vertical s c a l e f r o m t h e e a s i e s t to t h e m o s t difficult;
t h e c u m u l a t i v e t e s t s c o r e c o r r e s p o n d i n g to e a c h item defines
t h e m a p p i n g f r o m a b i l i t y e s t i m a t e to t e s t s c o r e .
While a regression
m e t h o d is also possible, larger s a m p l e s w o u l d be n e e d e d .
The "Guttman
P a t t e r n " m e t h o d c a n be u s e d w i t h s a m p l e s of a b o u t 500. T h i s is a n i m p o r t a n t p r a c t i c a l c o n s i d e r a t i o n f r o m t h e t e s t a d m i n i s t r a t i o n p o i n t of view. To i l l u s t r a t e t h e " G u t t m a n P a t t e r n " m e t h o d , let u s c o n s i d e r a f o u r - i t e m t e s t w i t h i t e m w e i g h t s of 1,2, 1,2 w h e n a r r a n g e d in a s c e n d i n g o r d e r of difficulty.
The cumulative sum
ascending
order
of 1 , 3 , 4 , 6
of difficulty) a r e
(from i t e m s
associated
with
arranged
IRT r a w
( n u m b e r of i t e m s a n s w e r e d correctly) of 1 , 2 , 3 , 4 r e s p e c t i v e l y .
in
scores On the
o t h e r h a n d , f r o m t h e C R E D I T a n a l y s i s , t h e IRT r a w s c o r e s of 1,2,3,4 are a s s o c i a t e d w i t h abilities of [31, 132, 133 a n d 134 r e s p e c t i v e l y . T h u s , the " G u t t m a n P a t t e r n " m e t h o d set u p t h e a s s o c i a t i o n of t h e c u m u l a t i v e s c o r e s of 1 , 3 , 4 , 6 to t h a t of 131, ~2, 133 a n d ~4 r e s p e c t i v e l y . The a s s o c i a t e d abilities w i t h t h e m i s s i n g s c o r e s of 2 a n d 5 will t h e n be obtained by interpolation.
Item Response Theory
291
Results Simulation Science.
studies
For purposes
setting were considered Percent" method.
were
conducted
of c o m p a r i s o n ,
for English,
two other
Mathematics
and
" m e t h o d s " for s t a n d a r d
- the "Same Boundaries"
method
and the "Same
In t h e " S a m e B o u n d a r i e s " m e t h o d , t h e g r a d e b o u n d a r i e s of
the previous year's examination was used. t e s t s w e r e of p r e c i s e l y e q u a l s t a n d a r d .
T h i s m e t h o d a s s u m e d t h a t t h e two In t h e " S a m e P e r c e n t " m e t h o d , t h e
g r a d e b o u n d a r i e s w e r e d e t e r m i n e d so t h a t t h e p r o p o r t i o n of p u p i l s o b t a i n i n g e a c h g r a d e r e m a i n e d t h e s a m e a s in t h e p r e v i o u s y e a r . assumed the same.
This latter method
t h a t t h e d i s t r i b u t i o n of p u p i l s b y ability in t h e two p o p u l a t i o n s w e r e T h e m e t h o d in w h i c h IRT t e s t e q u a t i n g w a s u s e d will b e d e n o t e d
b y "IRT" m e t h o d . In E n g l i s h a n d M a t h e m a t i c s , t h e g r a d e b o u n d a r i e s o b t a i n e d b y t h e t h r e e m e t h o d s w e r e r e a s o n a b l y c l o s e for all t h e g r a d e s .
H o w e v e r , for t h e b o u n d a r y
b e t w e e n g r a d e 'B' a n d 'C' in S c i e n c e , t h e r e s u l t o b t a i n e d b y t h e IRT m e t h o d differed quite considerably from the "Same Percent" method B o u n d a r i e s " m e t h o d . T a b l e 1 s h o w s t h e r e s u l t s for S c i e n c e .
and the "Same
W e a r e u n a b l e to offer a n y c o n c r e t e e x p l a n a t i o n for t h i s l a r g e d e v i a t i o n of IRT m e t h o d However,
we
from the other two methods are
uncomfortable
with
e s t i m a t e to t e s t s c o r e b y " G u t t m a n mapping can be improved.
our
at the grade B/C method
Pattern".
of m a p p i n g
We s u s p e c t
T a b l e 1: G r a d e B o u n d a r y a n d D i s t r i b u t i o n for S c i e n c e Test 2 Grade
Percent
Test 1
Same Boundary Method
Same Percent Method
IRT Method
36.3
37.1
37.1
41.1
75-100
75-100
75-100
A Score
Percent
73- I00
31.1
27.3
30.8
34.8
60- 74
60- 74
58- 74
53- 72
17.1
16.3
17.1
11.7
Score
50- 59
50- 59
47- 57
45- 52
Passed (A to C)
84.5%
80.7%
85.0%
87.6%
B Score
Percent C
Percent
15.5
19.3
Score
O- 49
O- 49
15.0
12.4
F O- 46
O- 44
boundary. of a b i l i t y
our method
of
292
T. P. Guan and S. Gopal
Discussion T h i s p a p e r d i s c u s s e s a m e t h o d of e q u a t i n g two t e s t s so t h a t d e c i s i o n m a k e r s in a s t a n d a r d s e t t i n g exercise c a n be p r o v i d e d w i t h b e t t e r i n f o r m a t i o n on the test outcomes.
The fact t h a t the grade b o u n d a r i e s
b a s e d o n IRT
m e t h o d do n o t d e v i a t e m a r k e d l y f r o m o t h e r m e t h o d s s u g g e s t s t h a t IRT c a n be a p p l i e d to t e s t s in w h i c h n o t all i t e m s are s c o r e d d i c h o t o m o u s l y . Our test
equating
design was
chosen
o u t of c o n s i d e r a t i o n
practical constraints within a traditional examination system. equating
was
curriculum
chosen
because
over time.
The
of its
Partial
IRT m e t h o d of
c a p a b i l i t y in h a n d l i n g
Credit
model was
for t h e
used
changes because
in the
e x a m i n a t i o n s a r e m a d e u p of b o t h m u l t i p l e - c h o i c e a n d o p e n - e n d e d i t e m s . A l t h o u g h e m p i r i c a l s t u d i e s s h o w e d t h a t t h i s m e t h o d of a w a r d i n g g r a d e s is a p p r o p r i a t e , t h e r e are still s o m e p r o b l e m s t h a t h a v e n o t b e e n r e s o l v e d to o u r satisfaction.
Two m a i n u n r e s o l v e d p r o b l e m s are:
a) C r i t e r i o n for c h o o s i n g g o o d link i t e m s The
criterion
stringent
adopted
criterion
was
(and
a
hence
r e a s o n a b l e n u m b e r of link items.
compromise less
link
between items)
imposing
and
having
a a
So far, t h e c r i t e r i o n u s e d is b a s e d
o n o u r e x p e r i e n c e derived f r o m e m p i r i c a l s t u d i e s .
b) M a p p i n g ability o n t o t e s t s c o r e s T h e r e is a l a c k of o n e - t o - o n e m a p p i n g b e t w e e n ability e s t i m a t e s a n d test scores.
This arises w h e n items are deleted b e c a u s e everyone
g e t s the i t e m c o r r e c t or n o n e gets it c o r r e c t . ended
items,
item
steps
may
be
collapsed
In t h e c a s e of o p e n in
the
process
of
e s t i m a t i n g t h e p a r a m e t e r s in the Partial C r e d i t model. T h u s w h i l s t t h e IRT c a n s h e d a d d i t i o n a l l i g h t in a s t a n d a r d
setting
exercise, we h a v e n o t r e a c h e d t h e s t a g e w h e r e it c a n be u s e d e x c l u s i v e l y in the d e c i s i o n m a k i n g p r o c e s s .
Item Response Theory
293
References B r e n n a n , R.L. & Kolen, M.J, (1986). Practical issues in Linear Equating Using the Common Item Nonequivalent Population Design. ACT T e c h n i c a l Bulletin No. 53. T he A m er i c a n College Testing Program. B i r n b a u m , A. (1968). Some Latent Trait Models and Their Use in Inferring an Examinee's Ability. In F.M. Lord & M.R. Novick; Statistical Theories of Mental T e s t Scores Reading, Mass: Addison-Wesley Publishing. H a m b l e t o n , R.K. (1983). Applications of Item R e s p o n s e Theory. R e s e a r c h Institute of British Columbia.
Educational
Holland, P.W. & Rubin, D.B. (1982). Test Equating, Academic Press Inc. Hulin, C.L., Drasgow, F. & Parsons, C.K. (1983). Item Response Theory Application to Psychological Measurement. The D o r s e y Professional Series, Dow J o n e s , Irwin. Rasch, G. (1960). Probablistic Models for Some Intelligence and Attainment Tests. D a n i s h Institute for E d u c a t i o n a l Research, Copenhagen. Weiss, D.J. (1983). N e w Horizons in Testing: Latent Trait Test Theory and Computerised Adaptive Testing. Academic Press. Wright, B.D. & Masters, G.N. (1982). Rating Scale Analysis - Rasch Measurement. MESA Press, Chicago. Wright, B.D. & Stone, M.H. (1979). B e s t Test Design: Rasch Measurement. MESA Press, Chicago.
The A u t h o r s
TOH POH GUAN is a R e s e a r c h Officer a n d SAMINANTHAN GOPAL is a n E v a l u a t i o n R e s e a r c h Officer with the R e s e a r c h a n d T est i ng Division, Ministry of E d u c a t i o n , Si ngapor e . T h e y have r e s e a r c h e d into t e s t e q u a t i n g m e t h o d s for p u b lic e x a m i n a t i o n s at grade six level for t he p a s t 6 y e a r s . T h e y have w r itten p a p e r s m a i n l y for t he i r e x a m i n a t i o n staff and award committee.