Selecting items for criterion-referenced tests

Selecting items for criterion-referenced tests

Evaluation in Education. 1982, Vol. 5, pp. 177-190 0191+765X/82/020177-1457.00/0 Copyright © 1982 Pergamon Press Ltd. Printed in Great Britain. All ...

649KB Sizes 0 Downloads 55 Views

Evaluation in Education. 1982, Vol. 5, pp. 177-190

0191+765X/82/020177-1457.00/0 Copyright © 1982 Pergamon Press Ltd.

Printed in Great Britain. All rights reserved

CHAPTER 6 SELECTING ITEMS FOR CRITERIONREFERENCED TESTS Gideon J. Mellenbergh and W i m J. van der Linden Vakgroep Methodenleer, Psycholog/sch Laboratorium, Universiteit van Amsterdam, Weesperplein8, 1018XA Amsterdam, The Netherlands ABSTRACT The concern of this paper is with the choice of optimal item selection methods f o r c r i t e r i o n - r e f e r e n c e d tests. Thr ee classes of methods are examined. It is f i r s t shown how methods based on classical parameters, such as item d i f f i c u l t y and item-test c o r r e l a t i o n , have been used f o r this purpose, and some criticism of these methods is formulated. Next, methods based on item characteristic c ur v e t h e o r y are elucidated. These methods have important advantages inasmuch as t h e y are w e l l - s u i t e d to analyze the local p r o p e r t i e s of items at the mastery score. Finally, attention is called to the fact t h a t selecting items f o r mastery decisions should also allow f o r the u t i l i t y of the decision outcomes and the d i s t r i b u t i o n of the t r u e scores. It is shown how this can be taken into consideration by approaching the item selection from a d e c i s i o n - t h e o r e t i c point of view. Educational tests usually consist of a series of separate items. It often occurs that o n l y a limited amount of time f o r administering the test is available b u t that a large collection of items measuring the intended achievements can be c o n t r i v e d or is already on hand. In such a case, the necessity of item selection arises, and we c a r e f u l l y select items from the collection available until the test is composed best f o r out" purposes. Recently, it has become customary to d i s t i n g u i s h between two major types of educational measurement. One is known as norm+referenced measurement, the o t h e r as c r i t e r i o n - r e f e r e n c e d measurement. The need f o r c r i t e r i o n - r e f e r e n c e d measurements has manifested itself chiefly in mastery learning or i n d i v i d u a l i z e d i n s tr u c ti o n programs. For the implementation of these programs, measurements with a behavioral i n t e r p r e t a t i o n are r e q u i r e d . C r i t e r i o n - r e f e r e n c e d te s ti n g procedures are developed to p r o v i d e these measurements since t h e y place students on a continuum i n d i c a t i n g the specific knowledge and skills he is able to perform (Glaser, 1963; Glaser & Klaus, 1962; Glaser & Nitko, 1971; Nitko, 1980). In n o r m - r e f e r e n c e d measurement the concern is not with a behavioral i n t e r p r e t a t i o n of test scores but with the r e l a t i v e standing of students on the continuum represented by the test scores. Test scores are n o r m - r e f e r e n c e d when t h e y indicate how much b e t t e r 177

178

Gideon J. Mellenbergh 6" Wire J. van der Linden

or worse t h e performances of students are students in the population or norm g r o u p .

compared

with

those

of

other

From this d i s t i n c t i o n it should not be i n f e r r e d t h a t n o r m - r e f e r e n c e d testing has n o t h i n g to do with the measurement of b e h a v i o r a l l y defined v a r i a b l e s . In i n t e l l i g e n c e t e s t i n g , f o r instance, which is a branch of testing t y p i c a l l y dominated by n o r m - r e f e r e n c e d p r o c e d u r e s , subjects are tested to measure t h e i r r e l a t i v e position on w e l l - d e f i n e d and v a l i d a t e d psychological c ons t r u c t s , and it would c e r t a i n l y be wrong to state t h a t these constructs do not have a behavioral i n t e r p r e t a t i o n . The p o i n t is, however, that in n o r m - r e f e r e n c e d measurement the continuum represented by the test scores allows o n l y a global behavioral interpretation: It is o n l y the "continuum as a whole" t h a t is b e h a v i o r a l l y defined. C r i t e r i o n - r e f e r e n c e d measurement, on the o t h e r hand (and in o u r opinion this is the c r i t i c a l d i f f e r e n c e ) , involves the p o s s i b i l i t y of local behavioral interpretations: For points along the continuum specific behaviors can be indicated so t h a t test scores c o r r e s p o n d i n g with these points can be c l e a r l y i n t e r p r e t e d . In most applications of c r i t e r i o n - r e f e r e n c e d t e s t i n g procedures we know test scores are ultimately used f o r making mastery decisions, that is, to determine w h e t h e r a s t u d e n t masters an i n s t r u c t i o n a l objective and may proceed to the next objective or needs e x t r a learning time and i n s t r u c t i o n (Hambleton, 1974; Glaser 8 Nitko, 1973). In all these instances, the i n t e r e s t is e x c l u s i v e l y at one point of the c r i t e r i o n - r e f e r e n c e d continuum d i v i d i n g this into a " m a s t e r y " and a " n o n m a s t e r y " region. It is customary to call this p o in t the mastery score. In view of the p o p u l a r i t y of mastery t e s t i n g , we will r e s t r i c t this paper to the issue of item selection f o r mastery tests. G e n e r a l l y , the necessity of item analysis arises from the fact t h a t items are not u n i v e r s a l l y good b u t measure some performance levels b e t t e r than others. I n t u i t i v e l y , an item yields the most accurate results when its d i f f i c u l t y matches the performance level of the s t u d e n t . It also seems t h a t some items cover a wide range of performance levels, whereas others are more " c r i t i c a l " and discriminate only w i t h i n a small range. In a d d i t i o n , some items allow more success f o r students guessing than o t h e r s , and, though the p o s s i b i l i t y of guessing may influence the accuracy of all performance levels measured, it seems obvious to assume this problem to be more serious f o r levels f o r which the item is d i f f i c u l t . These p r o p e r t i e s of " d i f f e r e n t i a l s e n s i t i v i t y " of items are impor t ant when selecting items f o r c r i t e r i o n - r e f e r e n c e d tests. In most instances of c r i t e r i o n referenced measurement, the i n t e r e s t is in certain regions of the continuum, n o t a b l y the region around the mastery score. It is the purpose of c r i t e r i o n referenced item analysis to select items t h a t are maximally sensitive to small d i f f e r e n c e s between test scores c o r r e s p o n d i n g with this region. In this paper it will be shown how psychometric methods can be used f o r this analysis. We shall, f i r s t , b r i e f l y review item selection methods using classical parameters as item d i f f i c u l t y and item-test c o r r e l a t i o n . Then, it will be pointed out how the use of item c h a r a c t e r i s t i c c u r v e t h e o r y can improve the selection of items f o r c r i t e r i o n - r e f e r e n c e d tests. F i n a l l y , it is indicated that the q u a l i t y of mastery decisions depends not o n l y on the p r o p e r t i e s of the items but also on

Criterion--Referenced Tests

179

the u t i l i t y of the decision outcomes and the d i s t r i b u t i o n of the t r u e scores, and t h a t these can be taken into consideration by approaching item selection from a d e c i s i o n - t h e o r e t i c point of view. Before dealing with these methods, we note t h a t according to some authors ( e . g . , Millman, 1974) item analysis should be omitted in c r i t e r i o n - r e f e r e n c e d measurement, especially when domain-referenced testing procedures ( e . g . , H i v e l y , 1974) are used. Item w r i t i n g often involves some s u b j e c t i v i t y , t h o u g h , and it is a common experience that this s u b j e c t i v i t y may lead to items having unwanted properties being undetected u n t i l item analysis indicates t h a t something is w r o n g . Furthermore, even the use of item generation rules occasionally leads to unexpected results needing revision. We, t h e r e f o r e , advocate item analysis using empirical data when screening items f o r the test. If domain-referenced t e s t i n g procedures are adopted, the p r o p e r moment f o r item analysis is when admitting items into the domain from which the test will be sampled.

METHODS BASED ON CLASSICAL TEST THEORY

In classical item analysis two item parameters are considered: item d i f f i c u l t y and item d i s c r i m i n a t i n g power. The former is defined as the expected p r o p o r t i o n of subjects answering the item c o r r e c t l y . The latter is commonly conceived as the biserial or p o i n t - b i s e r i a l correlation of the item scores with the total score or the rest score. For the case of no guessing being possible, a classical recommendation is to select items with d i f f i c u l t i e s close to .50 and as large a d i s c r i m i n a t i n g power as possible. The recommendation is based on the idea t h a t test score d i s t r i b u t i o n should have maximal variance. This can be shown by the following formula from classical test t h e o r y , w h i c h , for an n-item test, relates the test standard d e v i a t i o n , o x , the item d i f f i c u l t i e s , ]t i , and the item-total ( p o i n t - b i s e r a l ) correlations, oiX , to each other:

n

°X = i=~l PiX

~/~i ( i_iii ),

(Lord & Novick, 1968, p. 330). The T[i = .50 and PiX = 1.00, i = 1,2 . . . . . n.

test

score

(I) variance

is

maximal

for

Tests with uniform item d i f f i c u l t y are called peaked tests. The p r o p e r t i e s of peaked tests w i t h IIi = .,50 and high item-total correlations have been e x t e n s i v e l y examined, notably in the early l i t e r a t u r e on the attenuation paradox ( e . g . , Brogden, 1946; Cronbach & W a r r i n g t o n , 1952; Loevinger, 1954; T u c k e r , 1946).

180

Gideon J. Mellenbergh 6" Wim J. van der Linden

T h e classical item s e l e c t i n g p r o c e d u r e seems v a l i d o n l y f o r n o r m - r e f e r e n c e d measurement. As noted e a r l i e r , in t h i s t y p e of m e a s u r e m e n t t h e r e is no c o n c e r n w i t h p a r t i c u l a r p o i n t s of t h e c o n t i n u u m m e a s u r e d b y t h e t e s t , and if t h e t e s t v a r i a n c e is l a r g e i t is c e r t a i n t h a t a b r o a d r a n g e of t h e c o n t i n u u m is m e a s u r e d as a c c u r a t e l y as p o s s i b l e . In m a s t e r y t e s t i n g , h o w e v e r , i n t e r e s t is a t one p a r t i c u l a r p o i n t of t h e l a t e n t c o n t i n u u m , namely t h e p o i n t s e p a r a t i n g masters from nonmasters. I t is not c e r t a i n t h a t t h i s p o i n t is m e a s u r e d o p t i m a l l y w h e n t h e t e s t s scores show l a r g e v a r i a b i l i t y . For m a s t e r y t e s t i n g i t has been p r o p o s e d to base item s e l e c t i o n on a p r e t e s t posttest method. T h i s m e t h o d s t i l l r e m a i n s w i t h i n t h e f r a m e w o r k of classical t e s t t h e o r y b u t a p p r o a c h e s item s e l e c t i o n f r o m a d i f f e r e n t p o i n t of v i e w . Cox and V a r g a s (19GG) w e r e t h e f i r s t to i n t r o d u c e a p r e t e s t - p o s t t e s t c o e f f i c i e n t w h i c h is s i m p l y t h e d i f f e r e n c e b e t w e e n t h e p r o p o r t i o n s of s u b j e c t s e x p e c t e d to g i v e a c o r r e c t r e s p o n s e to item i b e f o r e and a f t e r i n s t r u c t i o n :

(z) ~i = ]lil - ]iiO' ~1i0 and ] ] i l denoting the proportions before and after instruction, respectively. T h e more s u b j e c t s p r o f i t f r o m t h e i n s t r u c t i o n , t h e l a r g e r t h e difference between the proportions of c o r r e c t r e s p o n s e s and t h e l a r g e r c o e f f i c i e n t Ai . S e v e r a l m o d i f i c a t i o n s of t h e c o e f f i c i e n t have been f o l l o w e d ; f o r a r e v i e w , see B e r k (1980). All t h e s e m o d i f i c a t i o n s a r e b a s e d on t h e same r a t i o n a l e as Ai , and i t w i l l , t h e r e f o r e , s u f f i c e to p r e s e n t t h i s c o e f f i c i e n t o n l y . P r e t e s t - p o s t t e s t c o e f f i c i e n t s and t h e i r r a t i o n a l e have been e x p o s e d to s e r i o u s criticism (van der Linden, 1981). For t h e p r e s e n t p a p e r one p o i n t of c r i t i c i s m is o f p a r t i c u l a r interest. T h e c o e f f i c i e n t s have been d e v e l o p e d s t a r t i n g f r o m t h e a s s u m p t i o n t h a t t h e d i f f e r e n c e b e t w e e n t h e p r o p o r t i o n s of c o r r e c t item r e s p o n s e s r e f l e c t s t h e s e n s i v i t y of t h e item to s u b j e c t s ' t r a n s i t i o n f r o m t h e n o n m a s t e r y to t h e m a s t e r y s t a t e . H o w e v e r , a h i g h v a l u e of Ai does n o t i n d i c a t e t h a t t h e item has h i g h d i s c r i m i n a t i n g p o w e r at a p a r t i c u l a r point of t h e c o n t i n u u m . C o e f f i c i e n t s l i k e Ai can t h u s n o t be i n d i c a t i v e of t h e d i s c r i m i n a t i n g p r o p e r t i e s of t h e item at t h e m a s t e r y s c o r e . T h i s p o i n t is i l l u s t r a t e d f u r t h e r c o n s i d e r i n g t h e u l t i m a t e c o n s e q u e n c e s of item s e l e c t i o n b a s e d on classical item p a r a m e t e r s and p r - e t e s t - p o s t t e s t c o e f f i c i e n t s . As n o t e d a b o v e , t h e t o t a l s c o r e v a r i a n c e is maximal when all items h a v e d i f f i c u l t i e s e q u a l t o .50 (T[i = .50, i = 1,2 . . . . . n) and i t e m - t o t a l c o r r e l a t i o n s e q u a l to 1.00 ( ¢ i X = 1.00, i = 1,2 . . . . . n). In t h i s e x t r e m e case, 50 p e r c e n t of all s u b j e c t s o b t a i n a t o t a l s c o r e of n and t h e r e m a i n i n g p e r c e n t a g e a t o t a l s c o r e of 0. For all m a s t e r y scores on t h e c o n t i n u u m and c u t - o f f scores on t h e t e s t , c ( 0 < c < n ) , t h i s implies 50 p e r c e n t of passes. Similarly, the h i g h e s t p o s s i b l e v a l u e of Ai is r e a c h e d w h e n ]1il = 1.00 and ~i0 = .00. In t h i s e x t r e m e case, all s u b j e c t s have a p r e t e s t s c o r e of 0 and a p o s t t e s t s c o r e of n. For all m a s t e r y scores and c u t - o f f scores on t h e t e s t (c < n) , t h e p e r c e n t a g e of passes is 100 p e r c e n t . In each of t h e t w o cases, a f i x e d p e r c e n t a g e of s u b j e c t s p a s s . These consequences contradict the mastery t e s t i n g idea t h a t t h e p r o p o r t i o n of passes s h o u l d n o t be p r e d e t e r m i n e d b u t d e p e n d on the proportion of s u b j e c t s above the mastery score after

Criterion-Referenced Tests

instruction. T h u s , both the classical and the p r e t e s t - p o s t t e s t procedure c o n t r a d i c t the basic p h i l o s o p h y of mastery t e s t i n g .

181

item selection

METHODS BASED ON L A T E N T T R A I T THEORY

A w e l l - k n o w n model in latent t r a i t t h e o r y is the t h r e e - p a r a m e t e r logistic model. According to this model the p r o b a b i l i t y of a subject with score 0 on the latent continuum g i v i n g a correct response to item i is w r i t t e n as:

P.(O)I = gi + ( l - g i )

[

1 +

xp-Dai(@-b i

I>]-'

,

(3)

where D denotes a constant, g1 the item guessing, b i the item d i f f i c u l t y , and a i the item d i s c r i m i n a t i n g power parameter ( B i r n b a u m , 1968, p. 405; Lord, 1980, p. 12). The special case of the two-parameter model is obtained setting gi = 0 , ( i = 1,2 . . . . . n ) , which means t h a t it is supposed to be impossible to obtain a c o r r e c t response by random guessing. The special case of the one-parameter logistic or Rasch model is obtained b y setting gi = 0 and a i = (i = 1,2 . . . . . n ) . The latter condition implies that all items are assumed to have the same d i s c r i m i n a t i n g power. In latent t r a i t t h e o r y , a subject's score on the continuum is estimated from the item responses. A usual measure of the precision of the estimate is the asymptotic variance of the estimate; the smaller the variance, the smaller the confidence i n t e r v a l f o r the subject's latent score. When maximum likelihood estimation is i n v o l v e d , the inverse of the asymptotic variance is called the information f u n c t i o n I ( 0 ) . The information f u n c t i o n is used in latent t r a i t t h e o r y as a measure of precision; the l a r g e r the information f u n c t i o n value, the smaller the asymptotic variance and the smaller the confidence i n t e r v a l f o r the subject's score on the latent continuum. For the model given in (3), the test information is equal to n I(@) = .~. l i ( @ ) , I=I

(4)

where

li(O) =

P~(@

IPi(O) {I - Pi(O)}

is the item information f u n c t i o n and P~(e) denotes the f i r s t P i ( 0 ) to 0 ( B i r n b a u m , 1968, p. 454; Lord, 1980, chap. 5).

(5)

d e r i v a t i v e of

182

Gideon J. Mellenbergh ~ Wim J. van der Linden

Formula 4 shows t h a t t h e t e s t i n f o r m a t i o n f u n c t i o n is d e f i n e d u s i n g t h e item information functions. Formula 5 shows t h a t each item i n f o r m a t i o n f u n c t i o n is a f u n c t i o n of t h e l a t e n t score G. T h e c o n s e q u e n c e of t h e s e p r o p e r t i e s is t h a t i n f o r m a t i o n f u n c t i o n s can be used f o r item s e l e c t i o n in m a s t e r y t e s t i n g ( L o r d , 1980, c h a p . 11). In m a s t e r y t e s t i n g , a cut-off score on t h e l a t e n t c o n t i n u u m , 0 c s a y , is d e f i n e d . S u b j e c t s w i t h l a t e n t scores a b o v e @c are considered masters, the others nonmasters. It is o b v i o u s t h a t a m a s t e r y t e s t can o n l y be of g o o d q u a l i t y w h e n it d i s c r i m i n a t e s o p t i m a l l y at t h e p o i n t @c. A n a t u r a l item s e l e c t i o n s t r a t e g y based on l a t e n t t r a i t t h e o r y is to s e l e c t items h a v i n g v a l u e s f o r I(@c) such t h a t t h e f i n a l t e s t shows an o p t i m a l v a l u e f o r I ( 0 c ) . An a p p l i c a t i o n of t h e p r o c e d u r e can be f o u n d in van d e r L i n d e n (1981). As o p p o s e d to m e t h o d s based on classical t e s t t h e o r y , latent trait theory allows an item s e l e c t i o n s t r a t e g y e x p l i c i t l y t a k i n g a c c o u n t of t h e m a s t e r y s c o r e '~c" T h e f i n a l t e s t has o p t i m a l d i s c r i m i n a t i n g power" at 9c , w h i c h i m p l i e s , f o r i n s t a n c e , t h a t t h e p r o p o r t i o n of passes is not p r e d e t e r m i n e d , as in t h e classical m e t h o d s , b u t an a c c u r a t e e s t i m a t e of t h e t r u e p r o p o r t i o n of m a s t e r s (@ -~ @c). I t s h o u l d be n o t e d t h a t at o t h e r p o i n t s of t h e c o n t i n u u m t h e t e s t may have p o o r p r o p e r t i e s . H o w e v e r , t h i s is in p e r f e c t h a r m o n y w i t h t h e idea of m a s t e r y t e s t i n g . A m a s t e r y t e s t m u s t have f a v o r a b l e p r o p e r t i e s at t h e m a s t e r y s c o r e , w h e r e a s t h e p r o p e r t i e s at o t h e r p o i n t s a r e of no i n t e r e s t . In n o r m - r e f e r e n c e d t e s t i n g , on t h e c o n t r a r y , t h e r e is no local i n t e r e s t and all p o i n t s of t h e c o n t i n u u m a r e e q u a l l y i m p o r t a n t .

METHODS BASED ON D E C I S I O N

THEORY

Recently, methods from statistical decision theory h a v e been a p p l i e d in a c h i e v e m e n t t e s t i n g , in p a r t i c u l a r to s o l v e m a s t e r y t e s t i n g p r o b l e m s . For a r e v i e w , see van d e r L i n d e n (1980). T h e basic ideas can be e x p l a i n e d u s i n g the dichotomous decision situation considered above. The latent continuum m e a s u r e d b y t h e t e s t is d i v i d e d b y a p o i n t , Gc , i n t o a m a s t e r y and a nonmastery region. A c u t - o f f s c o r e on t h e t e s t , c , is c h o s e n , and s t u d e n t s s c o r i n g b e l o w c fail t h e t e s t w h e r e a s t h e o t h e r s pass. The four possible o u t c o m e s of a t e s t a d m i n i s t r a t i o n a r e : A m a s t e r passes, a m a s t e r f a i l s , a n o n m a s t e r passes, and a n o n m a s t e r f a i l s . T h e p r o b l e m is to choose a v a l u e of c such t h a t t h e d e c i s i o n s a r e o p t i m a l in some sense. In t h e d e c i s i o n - t h e o r e t i c a p p r o a c h to t h e p r o b l e m , u t i l i t y f u n c t i o n s a r e n e e d e d r e p r e s e n t i n g t h e p a y - o f f s of t h e v a r i o u s d e c i s i o n o u t c o m e s . Both f o r t h e pass and t h e fail d e c i s i o n a s e p a r a t e f u n c t i o n is r e q u i r e d r e l a t i n g t h e u t i l i t y of ( f o r e x a m p l e ) t h e e d u c a t i o n a l , p s y c h o l o g i c a l , and economic c o n s e q u e n c e s t o the student's position on t h e l a t e n t continuum. In t h e psychometric l i t e r a t u r e , t h r e e classes of u t i l i t y f u n c t i o n s have d o m i n a t e d : The threshold u t i l i t y f u n c t i o n ( H a m b l e t o n E, N o v i c k , 1973), t h e l i n e a r u t i l i t y f u n c t i o n ( v a n d e r L i n d e n Z- M e l l e n b e r g h , ]977), and t h e n o r m a l - o g i v e u t i l i t y f u n c t i o n ( N o v i c k ~, L i n d l e y , 1978). In t h i s paper" o n l y t h e f i r s t t w o f u n c t i o n s a r e used. An e x a m p l e of t h e t h r e s h o l d f u n c t i o n is g i v e n in F i g u r e l a and an e x a m p l e of t h e l i n e a r f u n c t i o n in F i g u r e l b .

Criterion--Referenced Tests

183

Th e t h r e s h o l d u t i l i t y f u n c t i o n , f o r example, asserts t h a t f or passed students, u t i l i t y is a constant f o r nonmasters as well as masters but that this constant is lower f o r the former than the f o r the l a t t e r . The linear u t i l i t y function does not show a jump at Oc and relates l i n e a r l y to 0 f o r the two decisions.

X >c

X
v

!

0 c

i

G

0 c A

X>c

X
~D v

I

Fig. 6.1.

Examples of threshold and linear utility functions

0

184

Gideon J. Mellenbergh ~ Wire J. van der Linden

The expected u t i l i t y of a decision f o r a randomly selected student is equal to

E(U) =

[ [ u(Q)k(x,~)dG, ×=0 -~

(6)

where k ( x , e ) is the p r o b a b i l i t y function of the joint d i s t r i b u t i o n of the latent score, @, and the test score, X. In Bayesian decision t h e o r y , the expected u t i l i t y defined in (6) is the c r i t e r i o n in d e r i v i n g optimal decision rules. For example, using a t h r e s h o l d or a linear u t i l i t y function and taking certain conditions on k ( x , 0 ) f o r g r a n t e d , it is possible to determine optimal c u t - o f f scores y i e l d i n g the highest value of the expected u t i l i t y among all possible c u t - o f f scores ( H u y n h , 1976; van der Linden ~, Mellenbergh, 1977). The expected u t i l i t y has also been used f o r d e r i v i n g coefficients f o r tests. Suppose that a c u t - o f f score on the test, c, has been chosen ( p r e f e r a b l y optimally but this is not necessary f o r the f o l l o w i n g ) . The expected u t i l i t y can serve as an index of the q u a l i t y of the decision procedure: The h i g h e r the expected u t i l i t y , the better- the decision procedure. This index is h a r d l y i n t e r p r e t a b l e , t h o u g h , because it is not standardized ( f o r example, on the unit i n t e r v a l ) . Van d e r Linden and Mellenbergh (1978) standardized the expected u t i l i t y c o n s i d e r i n g two hypothetical situations: the test contains complete information about the latent continuum and the test contains no information about the continuum. The notion of complete information is formalized stating that the latent a b i l i t y is an increasing f u n c t i o n of the observed score, while "no i n f o r m a t i o n " refers to tile situation where the observed and latent scores are i n d e p e n d e n t l y d i s t r i b u t e d . For the chosen c u t - o f f score th r e e values of the expected u t i l i t y can be determined: the expected u t i l i t y f o r the actual decision p r o c edur e (U) and the expected u t i l i t i e s f o r the hypothetical cases of complete (Uc) and no information (Un) , r e s p e c t i v e l y . The standardized index of the q u a l i t y of the decision p r o c e d u r e is then : = (U - Un)/(U c - Un).

(7)

Man d e r Linden and Mellenbergh (1978) indicate that the coefficient is not necessarily in the i n t e r v a l from 0 to 1, but show that f o r some important special cases the coefficient does have this p r o p e r t y . See also Mellenbergh and van d e r Linden (]979). Van d e r Linden and Mellenbergh r e p o r t two important cases. For the threshold utility function, coefficient 6 equals Loevinger's coefficient H computed between the dichotomized latent continuum and test score v a r i a b l e . Because the latent continuum is u n o b s e r v e d , a psychometric model, such as the beta-binomial model ( L o r d & Novick, ]968, chap. 23) or Lord's Method 20 ( L o r d , ]969), is needed to estimate coefficient 6 in this case. For the linear u t i l i t y f u n c t i o n and a linear regression f u n c t i o n of the latent scores on the observed scores, coefficient 6 equals the r e l i a b i l i t y coefficient as defined in classical test t h e o r y . In these two cases, coefficient 5 is in the unit i n t e r v a l

Criterion--Referenced Tests

185

because Loevinger's H and the r e l i a b i l i t y coefficient are in this i n t e r v a l . Wilcox (1978) has proposed another s t a n d a r d i z a t i o n such t h a t the coefficient is always in the u n i t i n t e r v a l . An attempt is made to use the d e c i s i o n - t h e o r e t i c approach in item analysis. An obvious procedure is the f o l l o w i n g : F i r s t , the expected u t i l i t y defined in (6) is computed f o r the test. Next, item i is removed from the test and the expected u t i l i t y is computed f o r the remaining ( n - 1 ) - i t e m test. F i n a l l y , an index is computed which is a suitable f u n c t i o n of these two expected u t i l i t i e s . This i n d e x , t h e n , can be taken as a measure of the c o n t r i b u t i o n of item i to the expected u t i l i t y of the mastery t e s t i n g procedure. The l a r g e r this i n d e x , the l a r g e r the c o n t r i b u t i o n of item i and the more a p p r o p r i a t e f o r mastery t e s t i n g it is. Suppose the d i f f e r e n c e between the measure of the c o n t r i b u t i o n of item i.

f

two expected u t i l i t i e s is taken as a It can be shown that this is equal to:

P(X ) c l e , n) - Pi(X ~ c ' l e ,

n-i

g(e)u(e)de,

(8)

--oo

g(0) denoting the p r o b a b i l i t y f u n c t i o n of the latent score 0. In this e x p r e s s i o n , P(X >, cte, n) is the p r o b a b i l i t y of a mastery decision f o r an examinee with latent score 0 on the n-item test, whereas Pi(X>,cl0, n - l ) is the same p r o b a b i l i t y f o r the test w i t h o u t item i. Notice that in the latter p r o b a b i l i t y the c u t - o f f score on the ( n - 1 ) - i t e m test is represented by c' i n d i c a t i n g t h a t its value may be taken d i f f e r e n t l y from the one on the n-item test. At least three options are possible. F i r s t , c' can be set equal to c. Second, c' can be chosen one point lower than c. T h i r d , and this option seems the most preferable one, both c and c' can be chosen optimally by maximizing the expected u t i l i t i e s associated with the two test lengths. Instead of using the expected utility in the above procedure, its s t a n d a r d i z a t i o n , coefficient $, could be used as well. This choice has even some advantage since, as was indicated above, coefficient 6 has simple i n t e r p r e t a t i o n s f o r the cases of a t h r e s h o l d and a linear u t i l i t y f u n c t i o n . For these two u t i l i t y f u n c t i o n s we examine the item c o n t r i b u t i o n to the standardized expected u t i l i t y . Let 5 and 6 i denote the value of coefficient delta f o r the n-item test and f o r the ( n - 1 ) - i t e m test a f t e r item i has been removed, r e s p e c t i v e l y . Analogously to formula 8, we can now define _

6i

(9)

as the d i f f e r e n c e between the standardized expected u t i l i t i e s before and a f t e r the removal of item i. In the e v e n t of the t h r e s h o l d u t i l i t y f u n c t i o n , coefficient delta equals Loevinger's (1947) H computed o v e r the f o u r f o l d table with p r o p o r t i o n s of masters who pass, masters who fail, nonmasters who pass,

186

Gideon J. Mellenbergh ,9" Wire J. van der Linden

and n o n m a s t e r s w h o f a i l . Thus, for tile threshold utility function, formula 9 can be r e p l a c e d b y t h e d i f f e r e n c e H Hi (H i representing Loevinger's H a f t e r t h e removal of item i ) . It has a l r e a d y been n o t e d t h a t since t h e l a t e n t c o n t i n u u m is u n o b s e r v e d , a p s y c h o m e t r i c model is needed f o r e s t i m a t i n g t h e proportions from which Loevinger's c o e f f i c i e n t m u s t be c o m p u t e d . One p o s s i b i l i t y is to use L o r d ' s M e t h o d 20, w h i c h is based on a w e a k model a s s u m i n g o n l y t h a t t h e c o n d i t i o n a l d i s t r i b u t i o n of o b s e r v e d t e s t scores g i v e n t h e l a t e n t a b i l i t y is c o m p o u n d b i n o m i a l ( L o r d , 1969). T h e p r o c e d u r e is t h e n to e s t i m a t e t h e f o u r p r o p o r t i o n s of c l a s s i f i c a t i o n b o t h f o r t h e t e s t w i t h and w i t h o u t item i, and to e v a l u a t e H - Hi using these proportions. It is o b s e r v e d t h a t , a l t h o u g h M e t h o d 20 is a t t r a c t i v e f r o m a t h e o r e t i c a l v i e w p o i n t , i t uses a lot of c o m p u t e r t i m e . T h i s p r o h i b i t s its use f o r l o n g e r t e s t s . A second p o s s i b i l i t y is to assume t h e b e t a - b i n o m i a l model f o r t h e t e s t d a t a . T h i s a s s u m p t i o n is u s u a l l y made to e s t i m a t e c l a s s i f i c a t i o n p r o p o r t i o n s in m a s t e r y t e s t i n g ( M e l l e n b e r g h , K o p p e l a a r , r, van d e r L i n d e n , 1977; S u b k o v i a k [, W i l c o x , 1978); for this estimation a computer program is a v a i l a b l e ( K o p p e l a a r , van d e r L i n d e n , E, M e l l e n b e r g h , 1977). T h e b e t a - b i n o m i a l model assumes, h o w e v e r , t h a t t h e c o n d i t i o n a l d i s t r i b u t i o n of o b s e r v e d t e s t scores g i v e n t h e l a t e n t a b i l i t y is b i n o m i a l . T h i s is a s t r o n g e r a s s u m p t i o n t h a n t h e a s s u m p t i o n of a c o m p o u n d b i n o m i a l d i s t r i b u t i o n in Method 20; it implies t h a t the conditional distribution for X given 0 is i n d e p e n d e n t of w h i c h p a r t i c u l a r item is r e m o v e d . ( T h e second p r o b a b i l i t y in f o r m u l a 8 does not a n y l o n g e r d e p e n d on i . ) As a c o n s e q u e n c e , t h e i n d e x 6 - ~i is t h e same f o r each item. T h e c o n c l u s i o n is t h a t t h e a s s u m p t i o n of t h e b e t a - b i n o m i a l model c h a n g e s t h e item a n a l y s i s p r o b l e m . T h e p r o b l e m is no l o n g e r to s e l e c t items w i t h l a r g e c o n t r i b u t i o n s to t h e ( s t a n d a r d i z e d ) e x p e c t e d u t i l i t y of t h e d e c i s i o n p r o c e d u r e but rather how m a n y items must be s e l e c t e d , for example, until a p r e d e t e r m i n e d v a l u e of t h e e x p e c t e d u t i l i t y is r e a c h e d . It is p o s s i b l e , a l b e i t t e d i o u s , t o c o n s t r u c t t a b l e s s h o w i n g how, f o r a g i v e n mean o b s e r v e d s c o r e and s t a n d a r d d e v i a t i o n , t h e ( s t a n d a r d i z e d ) e x p e c t e d u t i l i t y d e c r e a s e s w i t h test length. For t h e l i n e a r u t i l i t y f u n c t i o n and a l i n e a r r e g r e s s i o n of l a t e n t s c o r e on o b s e r v e d s c o r e , c o e f f i c i e n t d e l t a e q u a l s t h e classical r e l i a b i l i t y c o e f f i c i e n t , c,. let oi d e n o t e t h e r e l i a b i l i t y of t h e ( n - ] ) - i t e m t e s t a f t e r r e m o v a l of item i. S u b s t i t u t i n g t h e t w o v a l u e s of t h e r e l i a b i l i t y c o e f f i c i e n t in f o r m u l a 9 y i e l d s t h e item i n d e x P - Pi. T w o comments on t h e i n d e x can be made. In t h e f i r s t p l a c e , i t is c o m p a r a t i v e l y easy to e s t i m a t e # - Pi" Unlike the estimation of t h e i n d e x H - Hi , no f u r t h e r p s y c h o m e t r i c model is needed to e s t i m a t e l a t e n t q u a n t i t i e s and t h e r e l i a b i l i t y c o e f f i c i e n t s in c' - oi can be e s t i m a t e d in t h e usual f a s h i o n . T h e c o m p u t a t i o n s a r e also s i m p l i f i e d b y t h e f a c t t h a t t h e r e l i a b i l i t y c o e f f i c i e n t is not a f u n c t i o n of t h e c u t - o f f s c o r e on t h e t e s t , c , so t h a t i t is n o t n e c e s s a r y to d e t e r m i n e t h i s again each t i m e an item is r e m o v e d from the test. T h e second c o m m e n t c o n c e r n s t h e e v e n t of p a r a l l e l t e s t items. In t h i s case, j u s t as in t h e p r e v i o u s case of t h e b e t a - b i n o m i a l model, t h e p r o b l e m of item s e l e c t i o n is r e p l a c e d by the p r o b l e m of test length d e t e r m i n a t i o n , and t h e i n d e x ~ " Pi r e m i n d s us of t h e w e l l - k n o w n S p e a r m a n B r o w n f o r m u l a f o r t e s t l e n g t h e n i n g f r o m classical t e s t t h e o r y ( L o r d E, N o v i c k , 1968, p. 9 0 ) . In f a c t , P - Pi is t h e n e q u a l to t h e i n c r e a s e of t h e v a l u e of t h i s f o r m u l a a s s o c i a t e d w i t h an an i n c r e a s e in t e s t l e n g t h f r o m n-1 to n.

Criterion--Referenced Tests

187

DISCUSSION

In the preceding sections, several methods of selecting items f o r c r i t e r i o n referenced tests have been examined. F i r s t , classical methods based o n parameters such as item d i f f i c u l t y and item-test correlation along w i t h a class of methods using p r e t e s t - p o s t t e s t coefficients were investigated in view of t h e i r s u i t a b i l i t y f o r c r i t e r i o n - r e f e r e n c e d t e s t i n g procedures. It appeared that these methods are less a p p r o p r i a t e , mainly because they do not show the local p r o p e r t i e s of the items at the mastery score and have implicit f i x e d percentages of passes c o n t r a d i c t i n g the basic p h i l o s o p h y of mastery t e s t i n g . Second, an item selection method derived from latent t r a i t t h e o r y was reviewed. This method uses the notions of item and test information and has an important advantage over classical methods inasmuch as it evaluates the p r o p e r t i e s of items and test at the mastery score. Finally, an attempt was made to approach item selection from a d e c i s i o n - t h e o r e t i c p o i n t of view. The idea of item c o n t r i b u t i o n to the ( s t a n d a r d i z e d ) expected u t i l i t y of the mastery t e s t i n g procedure was formulated, and it was shown how this led to item indices f o r the case of a t h r e s h o l d and a linear u t i l i t y f u n c t i o n . It is observed that the last section presented f i r s t results and that f u r t h e r attempts along this line should follow. For example, these attempts could be aimed at f i n d i n g reasonable approximations s i m p l i f y i n g the computations in the case of a t h r e s h o l d u t i l i t y f u n c t i o n . Doing so, it shall c e r t a i n l y be w o r t h considering other functions than the simple difference between the ( s t a n d a r d i z e d ) expected u t i l i t i e s . To i l l u s t r a t e the possibilities an index proposed by van Naerssen (1967) can be used:

f i = p/(1 - p) - p i / ( 1 - pi ).

(10)

Within the framework of classical test t h e o r y f i has a nice i n t e r p r e t a t i o n because p / ( 1 - p) equals the signal-noise ratio of t r u e score variance to e r r o r variance (Cronbach F, Gleser, 1964). Van Naerssen proposed f i as a measure of the c o n t r i b u t i o n of item i to the signal-noise ratio of the test. But formula lO has a d e c i s i o n - t h e o r e t i c i n t e r p r e t a t i o n as well. Assuming a l i n e a r u t i l i t y f u n c t i o n and a linear regression of latent score on observed score, f i is a f u n c t i o n of the standardized expected u t i l i t i e s f o r the test with and w i t h o u t item i. It, t h e r e f o r e , deserves new attention from those engaged in c r i t e r i o n - r e f e r e n c e d t e s t i n g .

REFERENCES

Berk,

R.A.

measurement:

Item

analysis.

In

R.A.

The stote of the oft.

U n i v e r s i t y Press, 1980.

Berk ( E d . ) , Baltimore, MD:

Criterion-referenced The Johns Hopkins

188

GideonJ. Mellenbergh ~ Wire J. van der Linden

Birnbaum, A. S o m e latent t r a i t models and t h e i r use in i n f e r r i n g an examinee's a b i l i t y . In F.M. Lord & M.R. Novick, Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley, 1968. Brogden, H.E. V a r i a t i o n in test v a l i d i t y with variation in the d i s t r i b u t i o n of item d i f f i c u l t i e s , number of items, and degree of t h e i r i n t e r c o r r e l a t i o n . Psychometrika, 11, 197-214, 1946. Cox, R.C. & Vargas, J.S. A comparison of item selection techniques f o r n o r m - r e f e r e n c e d and c r i t e r i o n - r e f e r e n c e d tests. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, I l l i n o i s , F e b r u a r y 1966. (EDRS No. ED 010 517) Cronbach, L.J. & Gleser, r e l i a b i l i t y coefficients. 467-480. 1964.

G.C. The s i g n a l / n o i s e ratio in the comparison of Educational and Psychological Measurement, 2#,

Cronbach, L.J. & W a r r i n g t o n , W.G. Efficiency of m u l t i p l e - c h o i c e tests as a f u n c t i o n of spread of item d i f f i c u l t i e s . Psychometrika, 17, 127-148, 1952. Glaser, R. outcomes:

I n s t r u c t i o n a l t e c h n o l o g y and tile measurement of learning Some questions. American Psychologist, 18, 519-521, 1963.

Glaser, R. & Klaus, D.H. Proficiency measurement: Assessing performance. In R. Gagne ( E d . ) , Psychological p r i n c i p l e s in development. New Y o r k : Holt, Rinehart, S Winston, 1962. Glaser, R. ~, Nitko, A . J . Measurement in learning and i n s t r u c t i o n . T h o r n d i k e ( E d . ) , Educational Measurement. Washington, D . C . : Council on Education, 1971.

human system

In R.L. American

Hambleton, R.K. T e s t i n g and decision-making procedures f o r selected i n d i v i d u a l i z e d i n s t r u c t i o n a l programs. Review of Educational Research, ztzl, 371-400, 1974. Hambleton, R.K. & Novick, M.R. Toward an i n t e g r a t i o n of t h e o r y and method for c r i t e r i o n - r e f e r e n c e d tests. Journal of Educational Measurement, 10, 159-170, Hively, W. Technology,

Introduction to 1#, 5-10, 1974.

domain-referenced

testing.

H u y n h , H. Statistical considerations of mastery scores. 65-79, 1976.

Educational

Psychometrika,

#1,

Koppelaar, H . , van der L i n d e n , W . J . , & M e l l e n b e r g h , G.J. A computer program f o r classification p r o p o r t i o n s in dichotomous decisions based on dichotomously scored items. T i j d s c h r i f t voor O n d e r w i j s r e s e a r c h , 2, 32-37, 1977.

Criterion--Referenced Tests

189

Loevinger, J. A systematic approach to the c o n s t r u c t i o n and evaluation of tests of a b i l i t y . Psychological Monographs, 61 (Whole No. 285), 1974. Loevinger,

J.

The

attenuation

paradox

in

test

Psychological

theory.

Bulletin, 51, 493-504, 1954. Lord, F.M. Estimating t r u e - s c o r e d i s t r i b u t i o n s in psychological t e s t i n g (an empirical Bayes estimation problem). Psychometrika, 3#, 259-299, 1969. Lord F.M. Applications of item response theory to practical testing problems. Hillsdale, New Jersey: Erlbaum, 1980. Lord, F.M. F, Novick, M.R. ' Statistical Reading, Mass.: Addison-Wesley, 1968.

theories

of

mental

test

scores.

Mellenbergh, G . J . , Koppelaar, H. ~, van der Linden, W.J. Dichotomous decisions based on dichotomously scored items: A case s t u d y . Statistica Neerlandica, 31, 161-169, 1977. Mellenbergh, G.J. E, van der Linden, optimality of decisions based on tests. 1979, 3, 257-273. Millman,

J.

W.J.

The

C r i t e r i o n - r e f e r e n c e d measurement. Berkeley, C a l i f o r n i a :

A.J.

and

In W.J. Popham McCutchan, 1974.

Evaluation in Education. Nitko,

internal

external

Applied Psychological Measurement,

(Ed.),

D i s t i n g u i s h i n g the many varieties of c r i t e r i o n - r e f e r e n c e d tests.

Review of Educational Research, 50, 461-485, 1980. Novick, M.R. 8 L i n d l e y , D . V . The use of more realistic u t i l i t y f u n c t i o n s in educational applications. Journal of Educational Measurement, 15, 181-191, 1978. Subkoviak,

M.J.

&

Wilcox,

R.

Estimating

the

probability

of

correct

classification in mastery testing.

Paper presented at the Annual Meeting of the American Educational Research Association, Toronto, March 1978.

Tucker,

L.R.

Maximum

validity

of

a

test

with

equivalent

items.

Psychometrika, 11, 1-13, 1946. van der tests.

Linden,

W.J.

Decision

models

for

use

with

criterion-referenced

Applied Psychological Measurement, /4, 469-492, 1980.

van der Linden, W.J. A latent t r a i t look at p r e t e s t - p o s t t e s t validation of c r i t e r i o n - r e f e r e n c e d test items. Review of Educational Research, 51, 1981, 379-402. van der Linden, W.J. linear loss f u n c t i o n .

E, Mellenbergh,

G.J.

Optimal

cutting

scores using a

Applied Psychological Measurement, 1, 593-599, 1977.

190

GideonJ. Mellenbergh 6" Wire J. van der Linden

van der Linden, W.J. & Mellenbergh, G.J. Coefficients f o r tests from a d e c i s i o n - t h e o r e t i c p o i n t of view. Applied Psychological Measurement, 2, 119-134, 1978. van Naerssen, R.F. Itemselectie bij studietoetsen, een nieuwe benaderin 9. Ned. T i j d s c h r i f t voor de Psychologie, 22, 345-359, 1967. Wilcox, R.R. A note on decision t h e o r e t i c - c o e f f i c i e n t s f o r Psychological Measurement, 2, 609-613, 1978.

tests.

Applied