Application of item response theory in standard setting

Application of item response theory in standard setting

Studies m Educational Evaluation. Vol. 15, pp. 285-293, 1989 Printed in Great Britain. All rights reserved. 0191-491X/89 $O.O0 + .50 @ 1989 Pergamon ...

370KB Sizes 1 Downloads 56 Views

Studies m Educational Evaluation. Vol. 15, pp. 285-293, 1989 Printed in Great Britain. All rights reserved.

0191-491X/89 $O.O0 + .50 @ 1989 Pergamon Press plc

APPLICATION OF ITEM RESPONSE THEORY IN STANDARD SETTING Toh Poh Guan and Saminanthan Gopal Ministry of Education, Research Branch, Kay Siang Road, Singapore 1024

In t r a d i t i o n a l e x a m i n a t i o n s , grades.

E a c h g r a d e s p a n s a c e r t a i n r a n g e of m a r k s .

o b t a i n i n g b e t w e e n s e v e n t y five to 'A'.

Such grade boundaries

exercise.

a c h i e v e m e n t is r e p o r t e d in t h e f o r m of For example, examinees

o n e h u n d r e d m a r k s m a y be a w a r d e d g r a d e

a r e u s u a l l y d e t e r m i n e d in a s t a n d a r d

A n i m p o r t a n t i s s u e in s t a n d a r d

setting

s e t t i n g is w h e t h e r a g r a d e in a

p a r t i c u l a r y e a r ' s e x a m i n a t i o n is e q u i v a l e n t to t h e s a m e g r a d e in t h e p r e v i o u s year's examination.

T h e i s s u e of e q u i v a l e n t g r a d e s c a n be r e s o l v e d t h r o u g h

t e s t - e q u a t i n g b a s e d o n I t e m R e s p o n s e T h e o r y (IRT). the

application

of

an

IRT

method

of

equating

This paper describes two

such

traditional

examinations and the problems encountered.

Method Equating

two

parallel

tests

involves

a procedure

for d e t e r m i n i n g

e m p i r i c a l l y a t r a n s f o r m a t i o n w h i c h c a n b r i n g t h e two t e s t s o n t o a c o m m o n s c a l e so t h a t t h e y c a n be u s e d i n t e r c h a n g e a b l y . parallel,

they must

For t e s t s to be c o n s i d e r e d as

be d e v e l o p e d f r o m t h e s a m e s p e c i f i c a t i o n in t e r m s of

content and item characteristics.

In t h i s s t u d y , we p r o p o s e a t e s t e q u a t i n g

d e s i g n t h a t c a n be a p p l i e d to t h e t r a d i t i o n a l e x a m i n a t i o n s y s t e m . Simulation Study We c a r r i e d o u t a r e s e a r c h p r o j e c t over the last few y e a r s o n a s i m u l a t e d 285

286

T. P. Guan and S. Gopa/

t r a d i t i o n a l e x a m i n a t i o n s y s t e m . A large g r o u p of p r i m a r y s c h o o l s w h o s h a r e d a c o m m o n c u r r i c u l u m in E n g l i s h , M a t h e m a t i c s a n d S c i e n c e p a r t i c i p a t e d in the project.

The r e s e a r c h s t u d y w a s c o n d u c t e d at a n u p p e r p r i m a r y level.

For

e a c h of t h e t h r e e s u b j e c t s , a c o m m o n e x a m i n a t i o n w a s a d m i n i s t e r e d to the s e l e c t e d level at t h e e n d of t h e s c h o o l y e a r .

The p o p u l a t i o n s of the s e l e c t e d

level for a p a r t i c u l a r y e a r a n d t h e s u b s e q u e n t Population 1 and Population 2 respectively.

y e a r will be r e f e r r e d to as

Similarly, E x a m i n a t i o n

1 and

E x a m i n a t i o n 2 were t h e e x a m i n a t i o n s a d m i n i s t e r e d r e s p e c t i v e l y to P o p u l a t i o n 1 a n d P o p u l a t i o n 2 for e a c h subject. D e s i g n for D a t a Collection D a t a collection d e s i g n for e q u a t i n g E x a m i n a t i o n 1 a n d E x a m i n a t i o n 2 is d e p i c t e d in D i a g r a m 1.

Pupils of P o p u l a t i o n 1 t o o k E x a m i n a t i o n 1. A s a m p l e

of a b o u t 5 0 0 p u p i l s from P o p u l a t i o n 1 t o o k the A n c h o r T e s t 1 w i t h i n a w e e k after Examination Population 2

1.

Examination

2 was

administered

level as P o p u l a t i o n 1.

to

A s a m p l e of a b o u t 5 0 0 p u p i l s f r o m P o p u l a t i o n 2 t o o k

t h e A n c h o r T e s t 2 w i t h i n a w e e k after E x a m i n a t i o n 2. Examination

a year later

w h i c h c o m p r i s e d t h e s u b s e q u e n t c o h o r t of p u p i l s a t t h e s a m e Examination 2 and

1 m a y n o t be s t r i c t l y p a r a l l e l t e s t s as t h e y h a v e to reflect the

c h a n g e s in t h e c u r r i c u l u m over t h e y e a r s .

T h e p u r p o s e of u s i n g two A n c h o r

t e s t s w a s to m a k e a l l o w a n c e for t h e d i f f e r e n c e s b e t w e e n E x a m i n a t i o n 2 a n d Examination

1.

A n c h o r T e s t 1 a n d A n c h o r T e s t 2 w e r e p a r a l l e l f o r m s of

Examination

1 and E x a m i n a t i o n 2 respectively.

The two A n c h o r T e s t s h a d

a b o u t 8 0 % of c o m m o n items. Model U s e d in this S t u d y R e s e a r c h e r s h a v e p r o p o s e d m a n y m o d e l s in IRT.

The appropriateness

of a m o d e l d e p e n d s u s u a l l y on t h e p u r p o s e of t e s t i n g a n d t h e t y p e s of i t e m s i n c l u d e d in a test. The

one-parameter

(Birnbaum,

1968)

A m o n g the m o r e p o p u l a r m o d e s are t h e logistic m o d e l s . (Rasch,

1960),

logistic models

two-parameter

and

are frequently u s e d

m o d e l s for t e s t s w i t h d i c h o t o m o u s l y

scored items.

e x a m i n a t i o n s are m a d e u p of b o t h m u l t i p l e - c h o i c e a n d

three-parameter as

measurement

However, traditional o p e n - e n d e d items,

h o w e v e r o p e n - e n d e d i t e m s m a y n o t be s c o r e d d i c h o t o m o u s l y b e c a u s e a partial a n s w e r to a n item is a w a r d e d a partial credit a c c o r d i n g l y .

One model that can

be u s e d to s c o r e o p e n - e n d e d i t e m s is t h e P a r t i a l C r e d i t Model (Wright a n d Masters,

1981) w h i c h is a n e x t e n s i o n of t h e o n e - p a r a m e t e r logistic m o d e l .

T h e c o m p u t e r p r o g r a m C R E D I T (MESA, U n i v e r s i t y of Chicago) b a s e d on the

Item Response Theory Partial Credit

m o d e ] w a s u s e d to e s t i m a t e

item parameters

and examinee

abilities in t h i s s t u d y .

Population

I

/

//

i

Examination

I

Anch r Test i

I I

I '

Common

Item

Anchor Te~t 2 I I

II

287

Examination

Population 2

F i g u r e 1: S c h e m a t i c R e p r e s e n t a t i o n of D a t a Collection in E q u a t i n g Examination 2 and Examination 1

288

T. P. Guan and S. Gopal

Procedure The approach

used

in t h i s

s t u d y for e q u a t i n g E x a m i n a t i o n

E x a m i n a t i o n 1 is d e p i c t e d in D i a g r a m 2.

2 and

T h e d e t a i l e d e q u a t i n g p r o c e d u r e is

o u t l i n e d a s follows:

a)

S e p a r a t e l y c a l i b r a t e e a c h t e s t with the r e s p e c t i v e s a m p l e s .

T h a t is, r u n

the CREDIT program with the input data from Examination

1 with

S a m p l e 1 a n d t h e n follow b y a s e c o n d r u n o n t h e d a t a from A n c h o r Test 1 w i t h S a m p l e 1,

Repeat these

C R E D I T p r o g r a m r u n s on E x a m i n a t i o n

2 a n d A n c h o r Test 2 with S a m p l e 2.

b)

Link E x a m i n a t i o n 2 to A n c h o r Test 2 t h r o u g h c o m m o n p e r s o n e q u a t i n g , a s b o t h t e s t s h a v e b e e n t a k e n b y S a m p l e 2. The c a l i b r a t i o n of e a c h test w i t h the C R E D I T p r o g r a m gives two m e a n ability e s t i m a t e s of S a m p l e 2, n a m e l y ~1 a n d ~2 " J~l is t h e m e a n ability e s t i m a t e of S a m p l e 2 b y E x a m i n a t i o n 2 while ~ 2 is the e s t i m a t e b y A n c h o r T e s t 2.

A c c o r d i n g to

IRT, t h e difference in the two ability m e a n s , _61 -_62 , e s t i m a t e s the shift r e q u i r e d to b r i n g E x a m i n a t i o n 2 o n t o a c o m m o n scale w i t h A n c h o r Test 2.

c)

Link t h e two A n c h o r t e s t s t o g e t h e r t h r o u g h t h e c o m m o n items, as a b o u t 8 0 % of t h e i t e m s in t h e t e s t s are c o m m o n items.

C a l i b r a t i o n of t h e s e

A n c h o r t e s t s u s i n g t h e C R E D I T p r o g r a m p r o d u c e s a p a i r of i t e m difficulties dil a n d di2 a n d a pair of a s s o c i a t e d s t a n d a r d e r r o r s Sil a n d si2 for e a c h c o m m o n i t e m i.

T h e difficulty e s t i m a t e s in e a c h p a i r are

s t a t i s t i c a l l y e q u i v a l e n t a n d t h e d i f f e r e n c e for e a c h p a i r is a c o n s t a n t . However, estimates.

empirically,

n o t all c o m m o n

items

S o m e c r i t e r i a n e e d to be set for

have

such

equivalent

s e l e c t i n g g o o d link items.

O u r m e t h o d of s e l e c t i n g good link i t e m s w a s a c c o m p l i s h e d t h r o u g h two steps.

T h e first s t e p w a s to t a k e all the c o m m o n i t e m s a s link i t e m s

a n d u s e it to shift A n c h o r T e s t 2 o n t o t h e scale of A n c h o r T e s t 1.

The

s e c o n d s t e p w a s to select i t e m s t h a t satisfied the s t a t i s t i c a l c r i t e r i o n of I d il - d'i2 [ <~ 3.0 x m a x (Sil , si2), w h e r e d'i2 w a s t h e a d j u s t e d v a l u e of di2 w h e n all c o m m o n i t e m s were t r e a t e d a s link items. This is a n a r b i t r a r y c r i t e r i o n derived from s o m e e m p i r i c a l s t u d i e s .

The

c h o i c e of the f a c t o r of 3,0 in the e q u a t i o n w a s a c o m p r o m i s e b e t w e e n a more

stringent

criterion

(and h e n c e

less link i t e m s )

and

having a

Item Response Theory r e a s o n a b l e n u m b e r of link i t e m s .

289

If t h e r e are k s u c h g o o d link items,

t h e n t h e s h i f t n e e d e d to p u t i t e m s o n A n c h o r T e s t

2 o n t o t h e s c a l e as

t h o s e of A n c h o r t e s t 1 is k E (di2 - dil) / k = d2 - d l i=l

A n c h o r T e s t 2]

CP

-

COMMON PERSON

CI

-

COMMON

~I

-

MEAN ABILITY

~2

-

M E A N A B I L I T Y OF S A M P L E

"~3

-

MEAN ABILITY

~-4

-

M E A N A B I L I T Y OF SAMIb~IPLE I AS M E A S U R E D

"d2

-

MEAN DIFFICULTY

OF A N C H O R T E S T 2 L I N K I T E M S

"dl

-

MEAN DIFFICULTY

OF A N C H O R TEST I L I N K I T E M S

Figure 2: Test Equating Design

//

'

ITEM OF SAMPLE 2 AS M E A S U R E D

OF S A M P L E

BY E X A M I N A T I O N

2

2 AS M E A S U R E D

BY A N C H O R

TEST

2

I AS M E A S U R E D

BY A N C H O R

TEST

i

BY E X A M I N A T I O N

I

-

290

d)

T. P. Guan and S. Gopal

Link the A n c h o r Test

1 to E x a m i n a t i o n

1 through

Common

person

e q u a t i n g as in b) b u t with S a m p l e 1. If J~3 a n d )~4 were t h e m e a n ability e s t i m a t e s of S a m p l e 1 b y A n c h o r Test 1 a n d E x a m i n a t i o n 1 respectively, t h e n t h e shift r e q u i r e d to b r i n g A n c h o r T e s t 1 o n t o c o m m o n scale with E x a m i n a t i o n 1 is J~3 - J[~4.

e)

To p u t E x a m i n a t i o n 2 o n t o s a m e scale a s E x a m i n a t i o n 1, we s u m u p all t h e shifts, t h a t is

(~1 - F2 ) + (d-2 - d'l) + (133- 134) a n d s u b t r a c t it f r o m all item 3 difficulties in E x a m i n a t i o n 2.

This shift

is also a n i n d i c a t i o n of t h e relative difficulty of the two tests.

f)

Once Examination

1 and E x a m i n a t i o n 2 are calibrated a n d p u t on a

c o m m o n scale, it is p o s s i b l e to derive ability e s t i m a t e s .

However, g r a d e

b o u n d a r i e s a r e b a s e d o n t e s t s c o r e s r a t h e r t h a n o n ability e s t i m a t e s . H e n c e , we n e e d to m a p ability e s t i m a t e s to t e s t s c o r e s .

The method

u s e d in t h i s s t u d y for m a p p i n g ability e s t i m a t e s to t e s t s c o r e s will be r e f e r r e d to a s t h e " G u t t m a n P a t t e r n " m e t h o d .

This m e t h o d r e q u i r e s all

t h e i t e m s to be p l a c e d o n a vertical s c a l e f r o m t h e e a s i e s t to t h e m o s t difficult;

t h e c u m u l a t i v e t e s t s c o r e c o r r e s p o n d i n g to e a c h item defines

t h e m a p p i n g f r o m a b i l i t y e s t i m a t e to t e s t s c o r e .

While a regression

m e t h o d is also possible, larger s a m p l e s w o u l d be n e e d e d .

The "Guttman

P a t t e r n " m e t h o d c a n be u s e d w i t h s a m p l e s of a b o u t 500. T h i s is a n i m p o r t a n t p r a c t i c a l c o n s i d e r a t i o n f r o m t h e t e s t a d m i n i s t r a t i o n p o i n t of view. To i l l u s t r a t e t h e " G u t t m a n P a t t e r n " m e t h o d , let u s c o n s i d e r a f o u r - i t e m t e s t w i t h i t e m w e i g h t s of 1,2, 1,2 w h e n a r r a n g e d in a s c e n d i n g o r d e r of difficulty.

The cumulative sum

ascending

order

of 1 , 3 , 4 , 6

of difficulty) a r e

(from i t e m s

associated

with

arranged

IRT r a w

( n u m b e r of i t e m s a n s w e r e d correctly) of 1 , 2 , 3 , 4 r e s p e c t i v e l y .

in

scores On the

o t h e r h a n d , f r o m t h e C R E D I T a n a l y s i s , t h e IRT r a w s c o r e s of 1,2,3,4 are a s s o c i a t e d w i t h abilities of [31, 132, 133 a n d 134 r e s p e c t i v e l y . T h u s , the " G u t t m a n P a t t e r n " m e t h o d set u p t h e a s s o c i a t i o n of t h e c u m u l a t i v e s c o r e s of 1 , 3 , 4 , 6 to t h a t of 131, ~2, 133 a n d ~4 r e s p e c t i v e l y . The a s s o c i a t e d abilities w i t h t h e m i s s i n g s c o r e s of 2 a n d 5 will t h e n be obtained by interpolation.

Item Response Theory

291

Results Simulation Science.

studies

For purposes

setting were considered Percent" method.

were

conducted

of c o m p a r i s o n ,

for English,

two other

Mathematics

and

" m e t h o d s " for s t a n d a r d

- the "Same Boundaries"

method

and the "Same

In t h e " S a m e B o u n d a r i e s " m e t h o d , t h e g r a d e b o u n d a r i e s of

the previous year's examination was used. t e s t s w e r e of p r e c i s e l y e q u a l s t a n d a r d .

T h i s m e t h o d a s s u m e d t h a t t h e two In t h e " S a m e P e r c e n t " m e t h o d , t h e

g r a d e b o u n d a r i e s w e r e d e t e r m i n e d so t h a t t h e p r o p o r t i o n of p u p i l s o b t a i n i n g e a c h g r a d e r e m a i n e d t h e s a m e a s in t h e p r e v i o u s y e a r . assumed the same.

This latter method

t h a t t h e d i s t r i b u t i o n of p u p i l s b y ability in t h e two p o p u l a t i o n s w e r e T h e m e t h o d in w h i c h IRT t e s t e q u a t i n g w a s u s e d will b e d e n o t e d

b y "IRT" m e t h o d . In E n g l i s h a n d M a t h e m a t i c s , t h e g r a d e b o u n d a r i e s o b t a i n e d b y t h e t h r e e m e t h o d s w e r e r e a s o n a b l y c l o s e for all t h e g r a d e s .

H o w e v e r , for t h e b o u n d a r y

b e t w e e n g r a d e 'B' a n d 'C' in S c i e n c e , t h e r e s u l t o b t a i n e d b y t h e IRT m e t h o d differed quite considerably from the "Same Percent" method B o u n d a r i e s " m e t h o d . T a b l e 1 s h o w s t h e r e s u l t s for S c i e n c e .

and the "Same

W e a r e u n a b l e to offer a n y c o n c r e t e e x p l a n a t i o n for t h i s l a r g e d e v i a t i o n of IRT m e t h o d However,

we

from the other two methods are

uncomfortable

with

e s t i m a t e to t e s t s c o r e b y " G u t t m a n mapping can be improved.

our

at the grade B/C method

Pattern".

of m a p p i n g

We s u s p e c t

T a b l e 1: G r a d e B o u n d a r y a n d D i s t r i b u t i o n for S c i e n c e Test 2 Grade

Percent

Test 1

Same Boundary Method

Same Percent Method

IRT Method

36.3

37.1

37.1

41.1

75-100

75-100

75-100

A Score

Percent

73- I00

31.1

27.3

30.8

34.8

60- 74

60- 74

58- 74

53- 72

17.1

16.3

17.1

11.7

Score

50- 59

50- 59

47- 57

45- 52

Passed (A to C)

84.5%

80.7%

85.0%

87.6%

B Score

Percent C

Percent

15.5

19.3

Score

O- 49

O- 49

15.0

12.4

F O- 46

O- 44

boundary. of a b i l i t y

our method

of

292

T. P. Guan and S. Gopal

Discussion T h i s p a p e r d i s c u s s e s a m e t h o d of e q u a t i n g two t e s t s so t h a t d e c i s i o n m a k e r s in a s t a n d a r d s e t t i n g exercise c a n be p r o v i d e d w i t h b e t t e r i n f o r m a t i o n on the test outcomes.

The fact t h a t the grade b o u n d a r i e s

b a s e d o n IRT

m e t h o d do n o t d e v i a t e m a r k e d l y f r o m o t h e r m e t h o d s s u g g e s t s t h a t IRT c a n be a p p l i e d to t e s t s in w h i c h n o t all i t e m s are s c o r e d d i c h o t o m o u s l y . Our test

equating

design was

chosen

o u t of c o n s i d e r a t i o n

practical constraints within a traditional examination system. equating

was

curriculum

chosen

because

over time.

The

of its

Partial

IRT m e t h o d of

c a p a b i l i t y in h a n d l i n g

Credit

model was

for t h e

used

changes because

in the

e x a m i n a t i o n s a r e m a d e u p of b o t h m u l t i p l e - c h o i c e a n d o p e n - e n d e d i t e m s . A l t h o u g h e m p i r i c a l s t u d i e s s h o w e d t h a t t h i s m e t h o d of a w a r d i n g g r a d e s is a p p r o p r i a t e , t h e r e are still s o m e p r o b l e m s t h a t h a v e n o t b e e n r e s o l v e d to o u r satisfaction.

Two m a i n u n r e s o l v e d p r o b l e m s are:

a) C r i t e r i o n for c h o o s i n g g o o d link i t e m s The

criterion

stringent

adopted

criterion

was

(and

a

hence

r e a s o n a b l e n u m b e r of link items.

compromise less

link

between items)

imposing

and

having

a a

So far, t h e c r i t e r i o n u s e d is b a s e d

o n o u r e x p e r i e n c e derived f r o m e m p i r i c a l s t u d i e s .

b) M a p p i n g ability o n t o t e s t s c o r e s T h e r e is a l a c k of o n e - t o - o n e m a p p i n g b e t w e e n ability e s t i m a t e s a n d test scores.

This arises w h e n items are deleted b e c a u s e everyone

g e t s the i t e m c o r r e c t or n o n e gets it c o r r e c t . ended

items,

item

steps

may

be

collapsed

In t h e c a s e of o p e n in

the

process

of

e s t i m a t i n g t h e p a r a m e t e r s in the Partial C r e d i t model. T h u s w h i l s t t h e IRT c a n s h e d a d d i t i o n a l l i g h t in a s t a n d a r d

setting

exercise, we h a v e n o t r e a c h e d t h e s t a g e w h e r e it c a n be u s e d e x c l u s i v e l y in the d e c i s i o n m a k i n g p r o c e s s .

Item Response Theory

293

References B r e n n a n , R.L. & Kolen, M.J, (1986). Practical issues in Linear Equating Using the Common Item Nonequivalent Population Design. ACT T e c h n i c a l Bulletin No. 53. T he A m er i c a n College Testing Program. B i r n b a u m , A. (1968). Some Latent Trait Models and Their Use in Inferring an Examinee's Ability. In F.M. Lord & M.R. Novick; Statistical Theories of Mental T e s t Scores Reading, Mass: Addison-Wesley Publishing. H a m b l e t o n , R.K. (1983). Applications of Item R e s p o n s e Theory. R e s e a r c h Institute of British Columbia.

Educational

Holland, P.W. & Rubin, D.B. (1982). Test Equating, Academic Press Inc. Hulin, C.L., Drasgow, F. & Parsons, C.K. (1983). Item Response Theory Application to Psychological Measurement. The D o r s e y Professional Series, Dow J o n e s , Irwin. Rasch, G. (1960). Probablistic Models for Some Intelligence and Attainment Tests. D a n i s h Institute for E d u c a t i o n a l Research, Copenhagen. Weiss, D.J. (1983). N e w Horizons in Testing: Latent Trait Test Theory and Computerised Adaptive Testing. Academic Press. Wright, B.D. & Masters, G.N. (1982). Rating Scale Analysis - Rasch Measurement. MESA Press, Chicago. Wright, B.D. & Stone, M.H. (1979). B e s t Test Design: Rasch Measurement. MESA Press, Chicago.

The A u t h o r s

TOH POH GUAN is a R e s e a r c h Officer a n d SAMINANTHAN GOPAL is a n E v a l u a t i o n R e s e a r c h Officer with the R e s e a r c h a n d T est i ng Division, Ministry of E d u c a t i o n , Si ngapor e . T h e y have r e s e a r c h e d into t e s t e q u a t i n g m e t h o d s for p u b lic e x a m i n a t i o n s at grade six level for t he p a s t 6 y e a r s . T h e y have w r itten p a p e r s m a i n l y for t he i r e x a m i n a t i o n staff and award committee.