0~9t~9~~89 $0.00 c 50 Q 7989 Pergamcn Press @C
A. standard Ialso aeon as a passing score, cut score, or pass-fall point) is a form of decision rule applied to test scores. People who score at or above the standard are classified in one ways and people who score below the standard are classified in a d~ffer~~~way. ~~o~~g dis~~b~~o~s on most tests are reasonably ~o~~~uo~s~ there are people at almost aU.Ieveb of the score d~str~b~t~on. The standard or p~ss~~~ score splits the continuous score distributions into two parts. Providing a justification for splitting the score distribution at some particular point rather than at a higher or lower one is a difficult task. If all the people who deserved to pass obtained perfect scores and if all of the people who deserved to fail obtained scores below the chance level, setting standards would be a trivial task. The fact that the scxx+esof peopfe who deserve to pass and the scores of people who deserve to fail often overlap makes standard setting a d~ff~c~lt~ f~~s~ra~~~gand ~~~rna~e~ysubjective task.
There is no rigorous, objective way to set standards. Every method of setting standards depends on subjective, human jud~~~ent at some stage of the process. There is no way to prove that any particular standard is correct because the “correctness” of a standard depends on one’s values. No matter where a standard has been set, some people will argue that it is absurdly lowS rnaki~~ a mockery of ~du~a~o~ and ~~da~g~r~~~ the public. Other will argue
336
M. Zieky
t h a t t h e s a m e s t a n d a r d is u n f a i r l y high, d e p r i v i n g s t u d e n t s a n d s o c i e t y of w h a t r i g h t f u l l y b e l o n g s to t h e m . B e c a u s e all s t a n d a r d s d e p e n d o n s u b j e c t i v e v a l u e j u d g m e n t s
and because
d i f f e r e n t p e o p l e h a v e d i f f e r e n t v a l u e s , it is i m p o s s i b l e to a v o i d c o n t r o v e r s y w h e n s e t t i n g s t a n d a r d s o n a n i m p o r t a n t test. g o o d m e t h o d will find.
Standards
T h e r e is n o "true" s t a n d a r d t h a t a
are constructed,
n o t d i s c o v e r e d , a n d the
r e s u l t i n g s t a n d a r d will c a u s e e r r o r s of classification. P o p u l a r M e t h o d s of S e t t i n g S t a n d a r d s Unfortunately, the most widely used methods not
the
most
standards because
appropriate
ones.
are those based
The
most
on tradition.
of s e t t i n g s t a n d a r d s
common
methods
are
of s e t t i n g
("The p a s s i n g s c o r e is 7 0 p e r c e n t
t h e p a s s i n g s c o r e h a s a l w a y s b e e n 70 p e r c e n t " ) or o n p o w e r ("The
p a s s i n g s c o r e is 8 0 p e r c e n t c o r r e c t b e c a u s e I s a y so"). M o r e r e c e n t l y , a m e t h o d of s e t t i n g s t a n d a r d s b a s e d o n n e g o t i a t i o n h a s b e c o m e p o p u l a r . A c o m m i t t e e of j u d g e s is a s k e d to s e t a s t a n d a r d a n d t h e r e s u l t is u s u a l l y a compromise
between
those who want
to s e t a h i g h s t a n d a r d
to m a i n t a i n
"quality" a n d t h o s e w h o a r e c o n c e r n e d t h a t s u c h s t a n d a r d s a r e " u n r e a l i s t i c a l l y " high. The problem with such popular methods
is t h a t t h e y fail to t a k e i n t o
a c c o u n t t h e difficulty of a p a r t i c u l a r s e t of t e s t q u e s t i o n s u s e d for a p a r t i c u l a r purpose. D e f e n s i b l e M e t h o d s of S e t t i n g S t a n d a r d s The
methods
that
follow a r e n o t
perfect.
s u b j e c t i v e j u d g m e n t s of fallible h u m a n b e i n g s .
They
are based
on the
However, they focus the judges'
a t t e n t i o n on t h e a p p r o p r i a t e f a c t o r s a n d pool t h e j u d g m e n t a c r o s s a n u m b e r of p e o p l e to o b t a i n m o r e reliable e s t i m a t e s of t h e r e s u l t i n g s t a n d a r d . J u d g m e n t s c a n b e m a d e a b o u t t h e q u e s t i o n s in t e s t s or a b o u t t h e p e o p l e w h o t a k e t h e tests. P o r t i o n s of t h e d e s c r i p t i o n s of s t a n d a r d s e t t i n g m e t h o d s h a v e b e e n a d a p t e d f r o m t h e m a n u a l P a s s i n g S c o r e s b y L i v i n g s t o n a n d Z i e k y (ETS, 1982). Methods Based on Judgments
of Q u e s t i o n s
T h e t h r e e m e t h o d s to b e d e s c r i b e d a r e k n o w n a s t h e Angoff, (1971), Ebel, (1972) a n d N e d e l s k y (1954) m e t h o d s a f t e r p e o p l e w h o f i r s t d e s c r i b e d
Setting Standards of Performance
them
in
print.
The
"borderline" test-taker.
three
methods
are
based
on
the
concept
337
of t h e
T h i s t e s t - t a k e r is t h e o n e w h o s e k n o w l e d g e a n d skills
a r e on t h e b o r d e r l i n e b e t w e e n t h e p e o p l e w h o d e s e r v e to p a s s a n d t h e p e o p l e w h o d e s e r v e to fail.
T h e s e m e t h o d s a r e b a s e d on t h e i d e a t h a t , s i n c e t h e t e s t -
t a k e r s w h o d e s e r v e to p a s s t e n d to e a r n h i g h e r s c o r e s t h a n t h o s e w h o d e s e r v e to fail, t h e s t a n d a r d s h o u l d b e t h e s c o r e t h a t w o u l d b e e x p e c t e d f r o m a p e r s o n w h o s e skills a r e o n t h e b o r d e r l i n e .
The judgments
these methods require are
m a d e in t e r m s of t h e specific q u e s t i o n s o n t h e test. The methods or a f t e r
the
judgments
test
are relatively convenient and can be applied either before is a d m i n i s t e r e d .
In a d d i t i o n ,
the
process
of m a k i n g
about test questions focuses the judges' attention closely on the
c o n t e n t of t h e test.
Most important, the necessary data--judgments about test
questions--can nearly always be obtained. E a c h of t h e s e m e t h o d s c o n s i s t of five b a s i c s t e p s : 1,
Select the judges;
2.
Define " b o r d e r l i n e " k n o w l e d g e a n d skills;
3. 4.
T r a i n t h e j u d g e s in t h e u s e of t h e m e t h o d ;
5.
C o m b i n e t h e j u d g m e n t s to c h o o s e a p a s s i n g score.
Collect j u d g m e n t s ;
T h e first two s t e p s a r e t h e s a m e for all m e t h o d s . differ. A d e t a i l e d d e s c r i p t i o n of t h e t h r e e p r e s e n t e d in t h e a r t i c l e s r e f e r r e d to a b o v e .
ways
The remaining steps
of s e t t i n g
methods
is
T h e s e c o n d s t e p is to h a v e t h e j u d g e s d i s c u s s a n d define w h a t is m e a n t by "borderline" performance.
It is c r u c i a l t h a t all t h e j u d g e s u n d e r s t a n d
the test measures
a n d h o w t h e t e s t s c o r e s a r e to b e u s e d .
should
to d e s c r i b e a p e r s o n
be asked
whose knowledge
what
Then the judges and
skills would
r e p r e s e n t t h e b o r d e r l i n e b e t w e e n a c c e p t a b l e a n d u n a c c e p t a b l e levels of t h e k n o w l e d g e a n d skills t h e t e s t m e a s u r e s . T h e j u d g e s m a y find it c o n v e n i e n t to d e s c r i b e t h e p e r f o r m a n c e of s p e c i f i c p e o p l e t h e y h a v e w o r k e d w i t h , w h o m t h e y w o u l d c l a s s i f y a s "borderline".
338
M. Zieky
References
Angoff, W.H. Scales, Norms, a n d Equivalent Scores. In: R.L. Thorndike (Ed.) (1971) Educational Measurement, Washington D.C., American Council on Education, pp. 514-515. (Source d o c u m e n t for AngofPs method). Ebel, R.L. (1972). Essentials o f Education Measurement. Englewood Cliffs, N.J., Prentice-Hall, pp. 492-494. (Source d o c u m e n t for Ebel's method). Livingstone, S.A. and Zieky, M.J. (1982). Passing scores, Princeton, N.J., E d u c a t i o n a l Testing Service. (Portions of this paper were a d a p t e d from the manual). Nedelsky, L. (1954). Absolute Grading S t a n d a r d s f o r Objective Tests, Vol. 14, No. 1, pp. 3-19. (Source d o c u m e n t for Nedelsky's method).
The Author
MICHAEL ZIEKY is a Principal M e a s u r e m e n t Specialist at Educational Testing Service in Princeton, New J e r s e y . His major responsibilities include test development, quality assurance, and education of professional staff members.