Measurement as an element of curricular evaluation

Measurement as an element of curricular evaluation

STUDIES IN EDUCATIONAL EVALUATION Volume 2, No. 2 Summer 1976 MEASUREMENT AS AN ELEMENT OF CURRICULAR EVALUATION ~) HANS SPAI)A In~titute Jbr ~cie...

1MB Sizes 2 Downloads 67 Views

STUDIES IN EDUCATIONAL EVALUATION

Volume 2, No. 2

Summer 1976

MEASUREMENT AS AN ELEMENT OF CURRICULAR EVALUATION ~) HANS SPAI)A

In~titute Jbr ~cience Education (I PN) at the University of Kiel 1. INTRODUCTION The use of' test models (the classical test theory, the b i n o m i n a l model, the RASCHmodel etc.) to measure curriculum effects is an i m p o r t a n t part of empirical evaluation. The point of departure is usually formed by hypotheses a b o u t the possible effects of the curriculum on the learner. On the basis of empirical d a t a which is collected by means of a systematic assessment of' behavior in specific characteristic situations (realized through test items, behavior observation situations etc.), it is a t t e m p t e d to estimate the q u a n t i t a t i v e e x t e n t of the effects. The psychometric basis of' this method is fbrmed each time by a test model. On the one hand, it guides the d e v e l o p m e n t of the means which are necessary for the eollection of d a t a (tests, questionnaires, behavior observation records etc.) and, on the other hand, permits inferences from the data as to the effects of' the curriculum. An example of a curricular evaluation, in which a corresponding m e a s u r e m e n t of' curriculum effects is a n essential element, is the empirical e x a m i n a t i o n of the question of whether the learning goals formulated in an instructional u n i t are achieved a n d what the e x t e n t of the additional undesired effects is. The most i m p o r t a n t task of a test model is the theoretical foundation of the measurem e n t method. I t is of decisive i m p o r t a n c e for the accomplishment of this task t h a t a (properly chosen) test model defir~es the kind of' dependence of the observable behavior on u n d e r l y i n g abilities of the persons tested in the fbrm of" a d a t a - a d e q u a t e theory. It is precisely this definition which permits the infbrence from the data to the degree of individual abilities and thus also as to the e x t e n t of curricular effeets. Although the most c o m m o n l y used model is still the classical test. theory, a series of test models are presently being used in curriculum evaluation. F o r the m a j o r i t y of them. introductions to their a s s u m p t i o n s and use are available in tlw fbrm of' textbooks. I will refer to this literature in the appropriate places. ]n this contribution, 1 try to point out the limits and potential of individual models under the aspect of empirical e v a l u a t i o n and compare the models critically. I shall emphasize those test models, that appear to me to be t r e n d setting even though they are not used in curricular e v a l u a t i o n t o d a y to the same e x t e n t as other models which are already partially o u t d a t e d due to scientific development. I hope t h a t even readers who, as yet, have seldom come into contact with questions of empirical evaluation are provided with a useful i n t r o d u c t i o n to the concepts and a r g u m e n t s used in connection with test models and curricular evaluation. 1 Revised and translated version of a contribution to the "Curriculum Handbuch" (Spada, H.: Quantifizierung als Element curricularer Evaluation. In: Frey, K. u. a. (Hrsg.): Curriculum Handbuch. Mfinchen: Piper, 1975).

86

SPADA 2. MEASUREMENT, TEST MODEL AND CURRICULAR EVALUATION

2.1. A n Example from Education Research for the Illustration of the Relevance of Measurement Let us take the following example as the p o i n t of" departure of the considerations: At the I n s t i t u t e tbr Science E d u c a t i o n (IPN), science educators in cooperation with psychologists a n d teachers are developing an instructional unit entitled "Nuclear power dream or n i g h t m a r e .~" ibr the 9th and 10th grades. Different instructional methods are tested (e.g. group i n s t r u c t i o n on the basis of sociometric assessments and based on theories of social learning the i n t r o d u c t i o n of a fictitious person behaving in a way which corresponds to the desired learning goals.) I t is to be evaluated if the learning goals, which were set up within the framework of' the instructional unit, arc a t t a i n a b l e a n d what c o m b i n a t i o n of instructional methods proves to be the best 2. Learning goals of" the i n s t r u c t i o n a l u n i t arc, for example: "The pupil should be able to compound contexts between technical/scientific a n d economic policy facts in the area of power supply a n d nuclear power p l a n t s " , and "The pupil should be interested in 'power supply and nuclear power p l a n t s ' a n d he should feel involved in problems in this area". W h a t problems does such an investigation raise ibr empirical evaluation in the stage of m e a s u r e m e n t of i n s t r u c t i o n a l effects ~ I t is primarily a question of what c o m b i n a t i o n s of i n s t r u c t i o n a l methods can best, achieve the learning goals and keep the undesired side effects to a m i n i m u m . From the viewpoint of the m e a s u r e m e n t of the effects, the answer to this question requires several things: a) Specification of the learning goals a n d of the hypotheses a b o u t undesired side effects b y means of systematic relation to observable behavior. b) The setting-up of learning goal-oriented tests and the development of tests tbr the control of the side effects (I t o r m u l a t e the concept " t e s t " very broadly and u n d e r s t a n d it, to include questionnaire procedures, standardized behavior observation protocols a n d other procedures developed with the aid of test models). c) The setting-up of rules which reflect the functional relation between observable behavior, test d a t a and u n d e r l y i n g abilities and allow conclusions from the data as to the a t t a i n m e n t of the learning goals. d) The use of a test model (or perhaps of several test models tbr different behaviorand learning goal areas) is the prerequisite for the mentioned activities.

2.2. The Measurement of an Ability with the Aid of a Test Model: An Outline Concepts such as "learning goal-oriented tests" a n d "abilities" come up in empirical e v a l u a t i o n again and again. They are to be explained in more detail in Diagram 1. This diagram sketches the basic concept of a formal satistying m e a s u r e m e n t with the aid of a test model. The s u b s e q u e n t c o m m e n t s on the individual test models will show t h a t only some of these models ensure m e a s u r e m e n t s on this level. One can also view the 2 A survey of' the planning, theoretical basis and some of the materials of this inw~stigation is provided by two work reports (Hoffmann el, al. 1973, 1975).

MEASUREMENT

Quantitative degree of the ability (e.g. person parameter)

Estimate tot the degree of the ability mentioned in 6 (e.g. person parameter estimate)

/

"independent" of the sample of test items numerical

!

! $

~

dependent on the sample of test items

Test score defined as frequency of certain reactions

$7

Ability

~3

latent, non-observable

! 8

! empirical

manifest, observable Behavior, which can be traced back to the ability, defined by means of a class of situations and a class of reactions That part of the behavior mentioned in 2, which is recorded in the test

• . . Control of the measurement procedure by means of the test model (the model can only be applied for the measurement if the model assumptions are adequate to the data; in other words, if the model is valid).

Diagram 1 : An Outline of the Basic Concept of the Measurement of an Ability with the Aid of a Suitable Test Model.

88

SPAI)A

graph as a flow-diagram which shows how - on the basis of an appropriate test model qualitative d a t a and ti'om these, in t u r n , q u a n t i t a t i v e conclusions can be derived from the variety and complexity of the occurrences which take place in the course of teachingand learning processes. Let us first consider the process, which is characterized by the succession of points 1 to 6, of the m e a s u r e m e n t of an ability on the basis of a test model. The assumption t h a t observable behavior' (point 2) can be traced back to an ability (point 1), is of great significance. F o r m u l a t e d in a more precise m a n n e r this assumption says t h a t the dift~rent degrees of an ability cause systematic difl)rences in observable behavior. How ever, one ability according to this concept does not, of course, steer the entire behavior' b u t only a small facet of behavior. Such a facet of observable behavior can be defined, for example, through the definition of the type of situations to which this behavior appeals and of the type of reactions aecording to which the behavior" is categorized (cf. Diagram 1 ). This is simpler t h a n it sounds in this formulation. A concrete example would be correct and ineorree,t answers (type of reaction) to addition problems with addends having several decimal places (type of situation). The ability which is definded by this type of reactions and situations might be called "the ability to solve correctly addition problems with addends h a v i n g several decimal places". One ability thus characterizes the t e n d e n c y of a person to behave in a certain m a n n e r in certain situations (el. Klauer 1972a; Fricke 1072, 1074: Niedderer 1.07:3; S p a d a / F r e y 1073). i t is generally not, possible to include the entire behavior, which can be traced back to a certain ability, in a test. F r o m the usually very large class of all possible situations only one situations sample can be presented in the form of'test items. Thus. for example, it is not the pupil reactions to all t h i n k a b l e addition problems with addends having several decimal places on which the m e a s u r e m e n t can be based b u t only the answer's in a sample of these problems. Data collection must theretore limit itself" to a part of the behavior of interest. The transition from point 2 to point 3 in Diagram 1 corresponds to this limitation. The restriction of the d a t a collection to only a part of the behavior which is d e p e n d e n t on the ability has as a result t h a t the q u a n t i t a t i v e value (point 6) which characterizes the m a g n i t u d e of the ability can not be d e t e r m i n e d exactly but can only be estiinated with a certain error d i s t r i b u t i o n (point 5). The next step in the measuremer:t procedure leads from point 3 to point 4, i.e. from the observation and classification of behavior in a test to the characterization of this behavior by means of numbers. A typical example for such a numerical characterization are test scores, defined as the frequency of correct answers, in cognitive achieve m e n t tests. This step from a classification of behavior to a counting of behavioral outcomes represents the t r a n s i t i o n from the empirical- to the numerical sphere. Let it be emphasized t h a t this t r a n s i t i o n presupposes the validity of a series of theoretical assumptions and c a n n o t only be pragmatically ,justified by the didactic use (fulness) of resulting numerical values. The use of test scores, defined as the frequency of corrert answers, thus requires, a m o n g other things, the following: The frequency of the correct, solutions m u s t e x h a u s t the entire information which the pupil's reactions contain a b o u t the ability. I f it were essential for the m e a s u r e m e n t of the ability. which problems are solved correctly or in what order correct and incorrect answers are given, then the n u m b e r of c(,rreet answers wouht not sufficiently represent the test behavior numerically (cf. Fischer 1974).

MEASUREMENT

89

I t is s o m e w h a t difficult to describe the n e x t p a r t of the m e a s u r e m e n t procedure. H o w e v e r , it is i m p o r t a n t for the u n d e r s t a n d i n g of a m e a s u r e m e n t and will be of use to us in the comparison of classical test t h e o r y and probabilistic test models. On the basis of the test score (point 4 in the diagram) an e s t i m a t i o n (point 5) tbr the degree of the ability is derived. A t first glance, such a d e r i v a t i o n m a y seem superfluous. W h y isn't the test score itself used as e s t i m a t e for the degree of the ability ? This can be explained as follows: The q u a n t i t a t i v e value (point 6) for an ability (of a pupil at a certain t i m e during instruction) should enable a description and prognosis of the entire b e h a v i o r (point 2) which can be traced back to the ability. I t should not only be valid ibr the behavior (point 3) which was evoked by the test. Consequently, the e s t i m a t e d value (point 5) for the degree of the ability should not characterize only the test behavior, because the test behavior is to a certain e x t e n t conditioned by the item sample, which itself is the result of a more or less a r b i t r a r y choice of items from the class of possible situations. In this sense~ the estimation should be " i n d e p e n d e n t ":~ of the sample of test items, i. e. w i t h o u t s y s t e m a t i c bias due to the selection of items. T h e test score itself does not represent an i n d e p e n d e n t estimation ibr the e x t e n t of an ability. Indeed, n o t only is it d e p e n d e n t on this ability but also on the eharacteristics of the items which were chosen from the class of the possible items. On the basis of the e x a m p l e of the test with addition problems one can easily imagine how the test score of a pupil changes when a sample of relatively simple problems and a sample of more difficult problems is a l t e r n a t e l y given even when both samples allow the m e a s u r e m e n t of the same ability. This dependence of the test score on the it em -s am p l e is the reason why this figm'e in some test models is only used tbr the d e r i v a t i o n of an e s t i m a t e for the e x t e n t of the ability. In the R A S C H - m o d e l (1966) for ex em p l e (ef. the c o m m e n t s in section 3.2.3. of this paper) it is possible to take the difficulty of the items of the presented test successfully int.<) account so t h a t the resulting (parameter-) e s ti m a t e s are i n d e p e n d e n t of the sample. In the binominal model (of. section 3.2.2.), a special ease of the R A S C H - m o d e l , the resulting estimates are also " s a m p l e - f r e e " . In the classical test t h e o r y (of. section 3.1.), on the other hand. the test scores themselves are used as estimates for the e x t e n t of an ability. This test t h e o r y thus offers no sample-free e s t i m a t i o n of the e x t e n t of an ability 4, Thus, in an outline of the basic concept of the classical test theory, points 5 and 4 as well as points 2 and 3 coincide. This has as a result t h a t the e x t e n t of the ability as well as its content and its didactic significance are only defined th r o u g h use of the behavior which is evoked by the test itself. L et us once again s u m m a r i z e the concept represented in D i ag r am l of the measur e m e n t of an ability. F u n d a m e n t a l is the notion t h a t observable b e h a v i o r can be

:~The word "independent" has been placed in quotation marks to point out that the accuracy of the estimation is. of course, dependent on the sample items. Thus, fbr example, a test. with many items generally offers a more precise estimation for' the extent of a pupil's ability than a test with a small sample of items. 4 This problem and the negative deductions which have been derived from it are somewhat less serious it' the sampling of items is carried out with the aid of a method such as, tbr example, item tbrm technology (see section 3.3.1.) since by means of it a certain representativeness of the selected items is ensured for the entire class of tasks. However, a principle change of the problems doe,~-not fbllow.

90

SPADA

e x p l a i n e d b y means of abilities a n d their q u a n t i t a t i v e degree. F o r the estimation of the e x t e n t of an a b i l i t y in i n d i v i d u a l pupils, b e h a v i o r which can be t r a c e d back to this a b i l i t y - is registered b y m e a n s of presenting p r o p e r l y chosen test items or by means of the o b s e r v a t i o n of pupils in certain situations. The b e h a v i o r is classified in such a way, t h a t it is assigned to reaction categories such as "correct solution" and "incorrect solution". The n e x t s t e p is the first numerical one. I t is counted how often answers have come up in certain reaction categories. The test scores which have been derived in this m a n n e r are d e p e n d e n t on the sample of items of the given test. Therefore. on the basis of these test scores, it is a t t e m p t e d to derive an e s t i m a t i o n of the e x t e n t of the a b i l i t y for the i n d i v i d u a l pupils which is i n d e p e n d e n t of the itemsample. I have p o i n t e d out several t i m e s t h a t the v a l i d i t y of different theoretical a s s u m p t i o n s is a prerequisite for the m e a s u r e m e n t of an ability. H o w can these a s s u m p t i o n s be tested with respect to a certain class of b e h a v i o r in order to ensure t h a t a m e a s u r e m e n t based on these a s s u m p t i o n s is a t all possible ~ A large p a r t of the a s s u m p t i o n s is incorp o r a t e d in the i o r m u l a t i o n of t e s t models. A n o t h e r p a r t comprises the framework conditions for the use of the models. An empirical e x a m i n a t i o n of the v a l i d i t y of a test model m a k e s the testing of those a s s u m p t i o n s which are c o m p o n e n t s of the model in question - possible. The specific tests which are to be carried out tbr the e x a m i n a t i o n of the v a l i d i t y of a model (and t h u s t h e v a l i d i t y of the a s s u m p t i o n s as well) differ slightly from test model to t e s t model a n d can be read a b o u t in detail in the literature on the different models. W h e n reviewing the different t e s t models I will r e t u r n briefly to this point. Since differences between i n d i v i d u a l test models are generally due to diverging a s s u m p t i o n s a b o u t t h e m e a s u r e m e n t procedure, the carrying out of model controls can provide one with i n i b r m a t i o n a b o u t w h a t form of m e a s u r e m e n t is best. suited for a certain class of behavior. On the basis of model controls it is also possible to analyze which items correspond to the a s s u m p t i o n s of the a p p l i e d model and can be included therefore in t h e sample of t e s t - i t e m s for the m e a s u r e m e n t of the a b i l i t y under s t u d y . A prerequisite tbr this is t h a t all items, or the behavior evoked by them, can be related to this a b i l i t y (item homogeneity). In s u m m i n g up, let it be emphasized t h a t only a positive result of model v a l i d i t y tests meets the r e q u i r e m e n t s tbr the use of the model for the m e a s u r e m e n t of an ability.

2.3. The Formulation of Learning Goals and the Development of Criterion-Referenced Tests A f t e r the theoretical discussion of the basic concept for the m e a s u r e m e n t of an ability, I would like to consider the question how the formulation of learning goals a n d the d e v e l o p m e n t of learning-goal (or criterion) referenced tests is possible on the basis of this concept. The fbllowing procedure has proved a d v a n t a g e o u s in the formulation of learning goals. A learning goal is defined as the degree of the a b i l i t y a i m e d at b y instruction or as the degree of change of' the a b i l i t y a i m e d at b y instruction. Such a t b r m u l a t i o n of a learning goal ensures two things: a) The specification of the learning goal through reference to a defined (by means of a class of situations a n d a class of reactions), observable behavior, and b) The possibility to e v a l u a t e i n s t r u c t i o n e m p i r i c a l l y with respect to the degree of reaching the learning goal.

MEASUREMENT

91

The prerequisite is, of course, t h a t an empirical analysis of the a b i l i t y aimed at by the learning goal as well as t h e d e v e l o p m e n t of a test for this a b i l i t y have been successfully carried out. A learning goal f o r m u l a t e d in this m a n n e r refers not only to the test b e h a v i o r but to the entire b e h a v i o r which can be t r a c e d back to the ability. I n this sense, the formulation of the learning goal as well as the definition of the a b i l i t y itself is i n d e p e n d e n t of the special sample of test p r o b l e m s used. The t e s t which was d e v e l o p e d for the m e a s u r e m e n t of the a b i l i t y can be used as criterion-referenced test as well. On the condition t h a t the concept of m e a s u r e m e n t as discussed b y me is realized with the aid of a suitable test model, a corresponding test p e r m i t s the decision if the degree of the a b i l i t y a i m e d at b y the learning goal could be reached through instruction as well as the m e a s u r e m e n t of the distance from the learning goal for each i n d i v i d u a l pupil. The concept of criterion-referenced tests which, within the f r a m e w o r k of this concept, is to be easily arranged, has - for more t h a n a decade led to controversial discussions a t m a n y r e l e v a n t conferences. Their main p o i n t of d e p a r t u r e is the question if a n d how criterion-referenced tests can be developed. I f one t a k e s the form - described b y m e - of specification of learning goals by means of m e a s u r a b l e abilities as a basis, then this question is to be answered simply in t h a t e v e r y suitable t e s t model for t h e m e a s u r e m e n t of an a b i l i t y also m a k e s the d e v e l o p m e n t of learning goal-referenced tests possible. The discussion of different test models in the sections to follow will show, which models in p a r t i c u l a r can be considered for a learning go~d-oriented test c<)nstru(,ti<)n. Does, however, my definition o f a criterion-retereneed t e s t - t e s t fox' the measurement of the difference between the e x t e n t of an a b i l i t y of a pupil a n d the degree of the a b i l i t y which is d e m a n d e d b y the learning goal - do justice to the s t a n d i n g of the discussion in t h e l i t e r a t u r e ? I t h i n k t h a t this question can be answered in a positive manner. I n the literature, the criterion-referenced tests are c o m p a r e d with the so-called norm-referenced tests (cf. Glaser 1963, 1971 ; P o p h a m & H u s e k 1971 ; K l a u e r 1972a). One characterizes those tests as norm-referenced whi<.h describe an individual pui)il only in comparison to others. The performance of a pupil is not measure(l per se but only in relation to the p e r f o r m a n c e of the pupils of the sample t a k e n or of the school class. On the o t h e r hand, one spealcs of criterion-referenced tests if s t a t e m e n t s of the following form which are i n d e p e n d e n t of the t e s t performance of other pupils can be made: A pupil has (has not) achieved the learning goal. His distance from the learning goal is . . . . The d e m a n d for a description of the individual pupil which is i n d e p e n d e n t of the test p e r f o r m a n c e of other pupils, corresponds entirely to the concept presented b y me for the m e a s u r e m e n t of an ability. The consideration of i n d i v i d u a l test models will show how closely related the concept " i n d e p e n d e n c e from the sample of p u p i l s " and the concept " i n d e p e n d e n c e from the sample of test i t e m s " i n t r o d u c e d earlier are. This relationship was first p o i n t e d out by R a s c h (1960) who developed test models which ensure this twolbld psychometric independence. Of historical interest is the fact t h a t in educational research the discussion a b o u t criterion-referenced tests (Glaser 1963) began a t a b o u t the same time t h a t g a s c h (1960) a n a l y z e d i m p o r t a n t p s y c h o m e t r i c preconditions for the d e v e l o p m e n t of tests with a p p r o p r i a t e characteristics b u t t h a t both research t r e n d s only began to t a k e note of each o t h e r at the end of the 60's. The e x a m p l e s which were p r e s e n t e d for the illustration of m e a s u r e m e n t as an e l e m e n t of curricular e v a l u a t i o n were m a i n l y t a k e n from the sphere of cognitive

92

SPAI)A

achievement tests. On the one hand, tests from this sphere predominate in didactics. On the other hand, however, an ability which can be measured by an achievement test can, in general, be described more easily and in less words t h a n for example, behavioral dispositions from the emotional sphere. Despite this, the conclusion t h a t the m e a s u r e m e n t of attitudes, of m o t i v a t i o n a n d of other non-cognitive variables is impossible, would be, of course, false. In fact, it is possible to realize measurements on the basis of the discussed concepts even in this sphere. Examples for corresponding test developments are a m o n g s t others: the preparation of' a test fbr the measurem e n t of s u b j e c t - m a t t e r directed m o t i v a t i o n ( K e m p f & Lehrke, 1975) and the con struetion of a situation test for the m e a s u r e m e n t of dispositions such as those mentioned in connection with the learning goals which were described at the beginning of this section to illustrate an i n s t r u c t i o n a l u n i t on nuclear plants (Npada 1973a; L u e h t & Spada, 1975). The d e v e l o p m e n t of tests for dispositions from the non-cognitive sphere doubtless presents a n u m b e r of' special problems. Amoug them is the problem of definition of t h a t facet of behavior which can be traced back to a certain disposition. Of course, it is easier with respect to a cognitive ability to define the reaction classes of correct and incorrect solutions and to set up hypotheses a b o u t the items which might belong to a certain item- or situation pool than, for example, to define reaction categories for behavior which is marked by an a t t i t u d e and to define the class of situations for which this a t t i t u d e might be relevant. Add to this t h a t the development of a test in the non-cognitive sphere often fbrees one to measure only a related disposi tion instead of the disposition which is really of interest. An example of that kind would be the m e a s u r e m e n t of an a t t i t u d e which describes the verbal behavior of pupils to certain daily situations presented to them in written form instead of the m e a s u r e m e n t of that, a t t i t u d e to which the "real-live" behavior of the pupils in the corresponding situations is traced back. I t is often also necessary in the non-cognitive sphere to allow more than two reaction eategories which necessitates more eomplieated test models (el. Fiseher & Spada, 1973, for the use of a test model for multi-eategorial classified reactions). I n the, following comparison of test models, I restrict myself to two-eat.egorial coded test answers (such as correct and incorrect solutions). I n ending this sk,ction, I sum up the considerations on the formulation of learning goals a n d on the construction of criterion-referenced tests in the form of a thesis (ef. also: Raseh 1966: Frieke 1972; K l a u e r 1972a; Niedderer 1973). On the basis of this thesis the usefulness of individual test models for educational m e a s u r e m e n t s will be tested. Thesis : A test model is suited for m e a s u r e m e n t s within the frame of empirical eurrieula~' e v a l u a t i o n if the tbllowing is true: A test which has been developed and evaluated with the aid of the model allows the estimation of the degree of an ability which a pupil has a t t a i n e d in the course of instruction. The ability can be described by means of a reaetion- and situation-class. I f this ability was aimed at in a learning goal, then it is possible to analyze whether the pupil has a t t a i n e d the desired degree of the ability. The resulting estimate of the degree of the ability is not only i n d e p e n d e n t of the test results of other pupils, but, also of the special item sample of the given test in other words, the estimate is "sample free". The assumptions u n d e r l y i n g the m e a s u r e m e n t procedure can be tested by means of model controls.

MEASUREMENT

93

3. A CRITICAL COMPARISON OF DIFFERENT TEST MODELS 3.1. The Classical Test Theory The m a j o r i t y of the tests used in educational m e a s u r e m e n t is based on classical test theory (cf. Lord & Novick, 1968; Lienert, 1969; Fischer, 1974). B u t exactly this test theory shows how the dependence of the test scores on the sample of the given test items leads to norm-referenced s t a t e m e n t s and thus contradicts the basic principle of criterion-referenced tests. The use of the classical test theory faces the following dilemma. The test scores are d e p e n d e n t on the sample of items. I n order to overcome this d e p e n d e n c y a n d thus to better suit the concept, of item sample fl'ee measurements test scores are often transfbrmed into normalized scores ~. This is done by means of the ascertation of the relative position of the test performance of each pupil in the overall d i s t r i b u t i o n of test scores of a normalization sample. The characterization of the test performance of a pupil, however, is then d e p e n d e n t on a normalization sample in other words, d e p e n d e n t on a sample of pupils. The relative independence of the normalized scores of the sample of test problems is paid for by the dependency on the test scores of a normalization sample. I t is still more unfavorable within the framework of classical test theory to judge the t r a n s f o r m a t i o n of test scores into normalized scores if, in addition, the scale of m e a s u r e m e n t is t a k e n into consideration. By means of the standardization, a linear t r a n s f o r m a t i o n of the test scores is performed whereby the zero point of the scale and its u n i t of m e a s u r e m e n t are fixed by the mean and the s t a n d a r d deviation of the test scores of the normalization sample. Only in the ease of an interval scale would this form of s t a n d a r d i z a t i o n be consistent, a n d sufficient for the fixation of the scale. An interval scale, however, is n o t ensured within the framework of classical test theory. "... w e shall treat a m e a s u r e m e n t as having interval scale properties although it is clear t h a t the m e a s u r e m e n t procedure and the theory u n d e r l y i n g it yield only a n o m i n a l scale, or at best, an ordinal scale . . . " (Lord & Novick, 1968. 22: cf. also Kristof 1968). I f the learning gain of pupils is to be estimated through repeated m e a s u r e m e n t s of the ability of interest at different times during the course of instruction, then the dependency of the results on the sample is especially problematic. None of the methods for the " m e a s u r e m e n t of change" suggested within the framework of classical test theory is really satisfying. A m o n g the most simple possibilities are test. score- and normalized score differences. Test score differences, just as the test scores themselves, are d e p e n d e n t on the sample of items. I f the same test is given at different times during the course of instruction, then undesired test repetition effects are unavoidable, too. A further problem is the distortion of score differences by ceiling effects (solving of no or all problems). Normalized score differences, on the other hand, permit, only the d e t e r m i n a t i o n of the change in relative position in a normalization sample.

5 Another reason tbr the introduction of normalized scores might be the desire to compare the relative position of the test performance of pupils. An example would be the comparison of the normalized score of a 12-year old pupil (derived by means of a normalization sample of 12-year old pupils) to that of a 16-year old schoolgirl (derived by means of a normalization sample of 16-year old schoolgirls).

interval scale (hoped for but not reached; see text)

ordinal scale

absolute scale (the results can be directly interpreted as item solution probabilities)

Depending on the model variant either ratio- or difference scale (cf. footnote 6)

GUTTMAN scalogram analysis

Binomial model

RASCH model

measu relnent

Level of

classical test theory

test model

probabilistic

probabilistic

"independent" of item- and pupil sample

"independent" of item- and pupil sample

given

favorable in the cognitive sphere ; only moderately favorable in the non-cognitive sphere (cf. Fischer. 1974)

only moderately favorable due to the demand for equal difficulty of all items

unfavorable: exact model validity is never given; frequent occurrence of many deviations

given

deterministic

"independent" of item- and pupil sample

given

therefore: difficult to evaluate

only partly given

the statements refer to test scores and not to individual pupil reactions to individual items

test scores dependent on item sample (normalized scores dependent on sample of pupils; see text)

chances tor model validity in different fields

possibility of testing assumptions underlying the model with the aid of model controls

type of statements about pupil reactions

Sample free or sample dependent measures

Table 1. Characteristics of s o m e test m o d e l s

1. complicated, with respect to the necessity to use computerprograms 2. simple

1. average ; (testing of all assumptions: complicated) 2. simple

1. simple 2. simple

1. average 2. simple

handling of test models with respect to: 1. test develop ment 2. test evaluation and interpretation

>

>

d~

MEASUREMENT

95

I f a learning goal is f o r m u l a t e d on the basis of an a b i l i t y which, due to the classical t e s t t h e o r y , is n o t defined in a m a n n e r which goes b e y o n d the item sample, there is the d a n g e r t h a t m a s t e r y of the test itself will be m a d e into a learning goal. F u r t h e r difficulties in the use of classical test t h e o r y arise from the d e p e n d e n c y of the test criteria (reliability, v a l i d i t y ) on the d i s t r i b u t i o n of the l a t e n t v a r i a b l e (cf. Fischer, 1974) as well as from the fact t h a t m a n y a s s u m p t i o n s a b o u t the m e a s u r e m e n t procedure c o n t a i n e d in this test t h e o r y can not be satisfactorily tested for v a l i d i t y . In the face of these d i s a d v a n t a g e s , the aspects which speak for the further use of classical test t h e o r y in curricular e v a l u a t i o n a p p e a r to me to be n o t i m p o r t a n t enough. Corresponding a d v a n t a g e s are, a m o n g o t h e r things, the u n c o m p l i c a t e d test d e v e l o p m e n t a n d test analysis, and t h e a v a i l a b i l i t y of a g r e a t n u m b e r of tests c o n s t r u c t e d with the aid of this theory, Table 1 outlines the characteristics of the classical test t h e o r y as well as of the test models discussed in the following. 3.2. Probabilistic Test Models In p r o b a b i l i s t i c t e s t models it is a t t e m p t e d to describe - in the form of a m a t h e m a tical e q u a t i o n the p r o b a b i l i t y t h a t a pupil will give a certain reaction to an item. The factors which are viewed as d e t e r m i n i n g ones for the reaction probabilities, and the special functional connection between these factors and the probabilities, differ" from model to model. Most f r e q u e n t is the a s s u m p t i o n t h a t the degree of the individual a b i l i t y u n d e r s t u d y as well as the difficulty of the items are r e l e v a n t for the probability of correct solutions. I t is generally n o t the corresponding reaction frequencies themselves which are used as e s t i m a t e s for the i n d i v i d u a l a b i l i t y and for the difficulty of the items. I n s t e a d , o t h e r e s t i m a t e s in the sense of the concept discussed are derived from t h e reaction frequencies. These are often referred to as person p a r a m e t e r s and item p a r a m e t e r s . Model tests for the control of the model a s s u m p t i o n s are possible in n e a r l y all cases. I n the following, three especially interesting models have been selected from the g r e a t n u m b e r of probabilistic t e s t models (cf., F i s c h e r 1974). 3.2.1. Guttman's Scalogram Analysis G u t t m a n ' s (1950) scalogram analysis is a forerunner as well as a special case of p r o b a b i l i s t i c test models. I n this model, it is distinguished only between reaction p r o b a b i l i t i e s equal to zero a n d equal to one, i.e. the prognosis of pupil reactions is deterministic. I t is p o s t u l a t e d t h a t an ordered series of the pupils according to the e x t e n t of t h e a n a l y z e d a b i l i t y can be d e t e r m i n e d on the basis of t e s t scores and that, similarly, the problem difficulties can be a r r a n g e d according to the frequency of correct answers. The basis for this m e a s u r e m e n t on an ordinal scale is the a s s u m p t i o n t h a t a pupil solves all the items correctly if their difficulty is less t h a n his " a b i l i t y " to do p r o b l e m s of this kind and t h a t he solves all the more difficult p r o b l e m s wrongly. An e x a m p l e from the non-cognitive sphere would be a social distance scala with questions which aim a t t h e readiness to certain forms of contact. W h e n the d e t e r m i n i s t i c model a s s u m p t i o n s a b o u t the reactions of pupils are valid, only certain vectors of correct and incorrect solutions occur. The incorrect response to a simpler p r o b l e m when the pupil has a l r e a d y solved more difficult ones correctly would, for instance, c o n t r a d i c t the t h e o r y and m a k e it impossible to e s t i m a t e the degree o f a b i l i t y of the pupil and to d e t e r m i n e an ordered series of the corresponding problems. D i v e r g e n t answer vectors

96

SPAI)A

can, from the viewpoint of the model, be dependent., a m o n g other things, on the fact t h a t individual items measure different, abilities. The checking of answer vectors tot deviations of the m e n t i o n e d n a t u r e eonstitutes a simple model control. I t is interesting that, upon model validity, the ordered series of pupils set, up according to the n u m b e r of correct answers is not d e p e n d e n t on the special selection of given items. The sample of items only has an influence on this ordered series inasmuch as the n u m b e r of' items of differing difficulty determines how m a n y different, points of the ordinal scale are at all able to be determined. The d e t e r m i n a t i o n of the place in the ordered series of a pupil by means of the n u m b e r of corectly answered problems ix i n d e p e n d e n t of the test performance of other pupils. Aside from the low level of measurement, the great d i s a d v a n t a g e of sealogram analysis is t h a t the model assumptions are nearly always falsified due to their deterministic character (el. Fischer 1974).

3.2.2. The Binomial Model The binomial model (of. Ferguson 1971; K l a u e r 1972b; Niedderer 1973) rests on the following f u n d a m e n t a l a s s u m p t i o n : the probability of a correct answer" fi'om a pupil is the same for all problems which measure the same ability. This probability only depends on the e x t e n t of the ability and thus differs from pupil to pupil• Using the binomial model, the m e a s u r e m e n t of a pupil's ability is done in the following m a n n e r : from the class of equally difficult items which measure the same ability, a sample of items is taken a n d given as a test. On validity of the model assumptions for a sequence of problems as well (e.g. stochastic independence of reaetions), the relative frequency of solved problems is an estimate for the solution probability and thus for the degree of the analyzed ability. The estimation process is so simple within the framework of the binomial model because no differing item difficulties are aceept,ed and thus differences in the reaction probabilities can be traced back exclusively to differing degrees of the ability. The resulting estimate for each tmpil is independent of the special sample of given items since all items of one class of items are seen from the viewpoint of the model completely equal. The sample of items influences the test result only inasmuch as the m e a s u r e m e n t gains in precision with increasing size of the sample of items. The estimation of the solution probability of a pupil ix also i n d e p e n d e n t of the sample of other tested pupils. In other words, it is not norm referenced. Consequently a test, developed with the aid of the binomial model can function as a learning goal-referenced test,. Of course, all these a d v a n t a g e o u s properties of the binomial model are only guaranted for use in those areas in which the model assumptions were empirically tested for validity with positive results. I n this connection, the a s s u m p t i o n of equal item diffieulty must be viewed as problematic. I t ix very restrictive and is not raised within the framework of other test models. F u r t h e r m o r e , it leads to an " a t o m i z a t i o n " ok abilities and thus of learning goals as well, since only equally difficult items are combined into a "homogeneous" class of items. As a special ease of the RASCH model to be described in the next section, the binomial model has m a n y similarities with this model. The binomial model can be derived from the RASCH-model with the restriction of equal problem difficulty. From this it follows t h a t the validity of the binomial model for one area. also implies the validity of the HASCH-model, b u t t h a t analogous conclusions in the opposite direction

MEASUREMENT

97

are n o t permissible. F r o m t h e f o r m a l s i m i l a r i t y of b o t h m o d e l s it also results t h a t m o d e l c o n t r o l s for t h e R A S C H - m o d e l can also be a p p l i e d to t h e b i n o m i a l m o d e l a f t e r t h e n e c e s s a r y m o d i f i c a t i o n s . T h i s is e s p e c i a l l y i n t e r e s t i n g for t h e t e s t i n g of t h e a s s u m p t i o n t h a t all p r o b l e m s m e a s u r e t h e s a m e a b i l i t y since t h e m e t h o d s p r e v i o u s l y s u g g e s t e d for this t e s t ( K l a u e r 1972b) are, for m a n y reasons, n o t a d e q u a t e (for a m o r e e x a c t discussion of this q u e s t i o n cf. S p a d a & F r e y , 1973). I f one, in using t h e b i n o m i a l model, dispenses w i t h t h e t e s t i n g of t h e v a l i d i t y o f all m o d e l a s s u m p t i o n s , t h e n t h e t e s t d e v e l o p m e n t , t h e t e s t e v a l u a t i o n , t h e f o r m u l a t i o n of l e a r n i n g goals a n d t h e t e s t i n g of t h e a t t a i n m e n t o f l e a r n i n g goals pose no g r e a t difficulties. I n m o r e e n c o r e : p a s s i n g analyses, t h e b i n o m i a l m o d e l s h o u l d o n l y be used if sufficient m o d e l c o n t r o l s h a v e b e e n carried o u t a n d it is e n s u r e d t h a t t h e m e a s u r e m e n t a n d d e f i n i t i o n of an a b i l i t y (only) is possible b y m e a n s o f a class of e q u a l l y difficult items. I f this is n o t t h e ease a n d t h e a b i l i t y o f i n t e r e s t can be defined, b y m e a n s of i t e m s of differing difficulty, t h e n t h e R A S C H - m o d e l s h o u l d be used. 3.2.3. The R A S C H - M o d e l I n t h e R A S C H - m o d e l ( R a s c h 1960, 1966 ; of. also F r i c k e 1972 a n d F i s c h e r 1974) t h e p r o b a b i l i t y of a c o r r e c t s o l u t i o n of a p u p i l f a c e d w i t h an i t e m is v i e w e d as a f u n c t i o n of t h e d e g r e e of his a b i l i t y to solve i t e m s of this t y p e a n d of t h e difficulty o f t h e p r o b l e m . For t h e m e a s u r e m e n t o f an ability~ a s a m p l e of i t e m s is t a k e n f r o m t h e class d e f n e d b y m e a n s of m o d e l c o n t r o l s a n d t h e o r e t i c a l a n a l y s e s of t h e items, w h i c h m e a s u r e this a b i l i t y a n d is g i v e n as a test, T a k i n g t h e t e s t a n d i t e m scores as a basis, e s t i m a t e s for t h e d e g r e e of t h e a b i l i t y o f t h e t e s t e d pupils are d e r i v e d w i t h t h e aid of' c o m p u t e r i z e d a l g o r i t h m s ( F i s c h e r 1974). T h e s e e s t i m a t e s are i n d e p e n d e n t of t h e i t e m - a n d pupil sample. I n t h e R A S C H - m o d e l , t h e difficulty o f t h e i t e m s i n c l u d e d in a t e s t can be t a k e n i n t o c o n s i d e r a t i o n in t h e p r o p e r m a n n e r in t h e d e t e r m i n a t i o n of t h e person p a r a m e t e r e s t i m a t e s . I f in t h e p h a s e of t e s t d e v e l o p m e n t , it was c o m p u t e d , w h a t p e r s o n p a r a m e t e r e s t i m a t e s c o r r e s p o n d to t h e i n d i v i d u a l t e s t scores, t h e n , on p r a c t i c a l use of t h e t e s t t h e p a r a m e t e r s can s i m p l y be l o o k e d up in a c o r r e s p o n d i n g table. Be cause of this, t h e use o f a R A S C H - s e a l e d t e s t is u n p r o b l e m a t i c . T h e f o r m u l a t i o n of l e a r n i n g goals a n d t h e c o n t r o l of t h e a t t a i n m e n t of a l e a r n i n g goal is possible in t h e m a n n e r which has b e e n d e s c r i b e d so often a l r e a d y . T h e m e a s u r e m e n t s w i t h R A S C H scaled t e s t s lie e i t h e r on a r a t i o - or a difference scale d e p e n d i n g on t h e m o d e l v a r i a n t ~. Since t h e l e a r n i n g goal i b r m u l a t i o n w i t h i n t h e f r a m e w o r k of t h e R A S C H - m o d e l does n o t refer to a c e r t a i n s a m p l e of i t e m s a n d since t h e person p a r a m e t e r e s t i m a t e s In ,statements concerning the degree of the ability of a pupil, which are based on the scale level of' the parameters of the RASCH=model (ratio- or difference scale), a certain amount of caution is necessary. Two kinds of over-interpretation of such statements are: Even though the statements refer to parameters and thus to a disposition, they are understood as statements on data level or as statements about reactions probabilities. Contrary to measurements of the classical natural sciences such as the (.entimeter . gram and seconds system, the latent traits (pupil ability and problem difficulty) fi~rmaHzed in the RASCH-model are not directly interpretable on the level of observation data since they are not experimentally independent of each other (of. Seheibleehner 1974: Spada 1976b). This is one of the reasons why difl'erent model variants exist with diffcrem scale properties (ratio- or difference scale) for the latent traits. Formally, these variants are completely equal, and, fi~r psychological-didactic reasons, a decision tbr one of them seems to be im possible. Thereibre. it is advisable to inehMe in statements on the degree of pul)i] ability on which model variant the statement is based as well as to I)e generally carefull about scale interpretations.

98

SPADA

are also i n d e p e n d e n t of this sample, different samples of items can be given to measure an a b i l i t y a t different times d u r i n g t h e course of instruction. B o t h e r s o m e repetition effects can t h u s be a v o i d e d a n d the difficulty of the items can be a d j u s t e d to the respective p e r f o r m a n c e level of t h e pupils. B y means of the p r e s e n t a t i o n of different samples of problems, ceiling effects can be avoided, too. I n order to e s t i m a t e the effect of instruction with respect to an a b i l i t y between two testings, the pupil p a r a meters are to be c o m p a r e d - n o t the test scores. I t should be k e p t in m i n d t h a t the model is still usable if all pupils a t t a i n t h e learning goal, even if the i m p r o b a b l e occurs and all pupils show the same degree of a b i l i t y a t the time of testing. F o r a precise d e t e r m i n a tion of the q u a n t i t a t i v e e x t e n t of an ability, it is of course necessary t h a t p r o b l e m s whose solution p r o b a b i l i t y (despite increasing pupil ability) is n o t too close to l, are available for e v e r y test. The testing of the model a s s u m p t i o n s is possible b y means of different model controls (Andersen 1973; Fischer 1974). An essential aspect of this model control is the fixation of the class of situations or items which measure the same ability. Prerequisite for such a s t a t i s t i c a l item analysis arc d e t a i l e d h y p o t h e s e s due to psychological and didactic studies of the items. W i t h o u t such studies, it is often difficult to i n t e r p r e t the results of model tests in such a m a n n e r t h a t one knows w h y some items were excluded a n d others not. A precise description of the a b i l i t y which is to be focused on b y the test is also impossible w i t h o u t d e t a i l e d considerations of the content. I n the cognitive sphere, s t r u c t u r a l models of learning a n d t h o u g h t often comprise a v a l u a b l e analysis of the item s t r u c t u r e a n d t h u s of the a b i l i t y under s t u d y . Several such models will be outlined briefly in the following.

3.3. Curricular Evaluation and the Analysis of the Item Structure with the Aid of Test Models E m p i r i c a l curricular e v a l u a t i o n received decisive impulse from the d e v e l o p m e n t of models which m a k e an analysis of t a s k s t r u c t u r e possible. This analysis provides access to the question why pupils r e a c t to items in a certain way, I n the area of achievem e n t tests, a t h e o r e t i c a l l y f o u n d e d t a s k s t r u c t u r e analysis can be u n d e r s t o o d as an e x a m i n a t i o n of the cognitive o p e r a t i o n s with rules which m a k e correct solutions possible for pupils of the age u n d e r s t u d y . I n a more c o n t e n t oriented w a y of' looking at things, the concepts "rules", " r e g u l a r i t y " a n d "sequence of rules" - among other's are used for the description of a t a s k structure. F r o m the viewpoint of psychology, one speaks more readily of cognitive operations with rules a n d characterizes a sequence of such operations as p r o b l e m - s o l v i n g process. The analysis of the s t r u c t u r e of items makes two things possible: a) A d e t a i l e d description and definition of the class of items b y means of the specification of the rules which m a k e the solving of the items possible. This allows at the same t i m e a b e t t e r e v a l u a t i o n of the psychological a n d didactic relevance of' the a b i l i t y (and t h u s p e r h a p s also of the learning goal f o r m u l a t e d on the basis of the ability) m e a s u r e d b y the items. b) I n d i c a t i o n s as to w h a t rules in w h a t c o m b i n a t i o n s a n d in w h a t sequence are to be i n s t r u c t e d on the basis of w h a t t a s k s in order to m a k e the solving of problems easier for the pupil.

MEASUREMENT

99

Test models which, besides the m e a s u r e m e n t of an ability, also m a k e an analysis of the t a s k s t r u c t u r e possible are thus interesting n o t only for the curricular e v a l u a t i o n a n d the f o r m u l a t i o n of e m p i r i c a l l y t e s t a b l e learning goals h u t also p r o v i d e v a l u a b l e i n f o r m a t i o n for instruction. F o r purposes of limitation, it is to be noted t h a t these a t t e m p t s have n o t y e t come p a s t the d e v e l o p m e n t stage. T h e y raise problems in practical a p p l i c a t i o n since t h e y d e m a n d a g r e a t deal of knowledge a b o u t each a n a l y z e d class of t a s k s from the psychological- as well as c o n t e n t p o i n t of view. I n the following, different a p p r o a c h e s will be outlined briefly 7. 3.3.1. "item forms" According to Hively, Patterson & Page (1968) The a t t e m p t to define i t e m forms (Hively, P a t t e r s o n & Page, 1968) was m o t i v a t e d m a i n l y b y the desire to objectify the choice of the items to be given as test for an ability. The p a r t i a l sets of items (of a certain class of items such as a d d i t i o n problems with several decimal places) which are characterized b y specific properties are called item forms. An a r b i t r a r i l y chosen e x a m p l e of an item form could be all those a d d i t i o n p r o b l e m s which consist of only two, m a x i m a l l y five, decimal places a n d which contain no column sums larger t h a n 9 (it is not necessary to c a r r y a n y a m o u n t over). The fixation of an item sample is done b y the d r a w i n g or b y the c o m p u t e r i z e d generation of one or possibly of several items of each item form. I f the item forms encompass all items of a certain class of items and the p a r t i a l sets of items defined b y t h e m do not overlap, the resulting item s a m p l e can, to a certain extent, be viewed as r e p r e s e n t a t i v e for the entire class of items. I t must, however, be k e p t in mind t h a t the m e t h o d of different a u t h o r s of defining item forms is only slightly t h e o r e t i c a l l y founded and orients itself on e x t e r n a l characteristics of items a n d not on the characteristics of the respective problem-solving processes. 3.3.2. Individualized Testing According to Ferguson (1971) Of course, item form t e c h n o l o g y does n o t replace a t e s t model. I t is r a t h e r an interesting a p p r o a c h to the e x t r a c t i o n of h y p o t h e s e s a b o u t homogenous classes of items whose p r o p e r analysis m u s t be done within the f r a m e w o r k of a test model. I n the m a j o r i t y of cases t h e item form technology was a p p l i e d within the f r a m e w o r k of v a r i a n t s of classical t e s t theory. Ferguson, (1971 ; cf. also B r e n n a n 1973) on the o t h e r hand, d e v e l o p e d a test concept which proceeds from hierarchically a r r a n g e d item forms a n d which can be t r a c e d back to the binomial model. The concept implies t h a t all p r o b l e m s of one item form are equally difficult a n d measure the same ability. Ferguson confronts the r e l a t i v e l y large n u m b e r of abilities (one a b i l i t y a n d one learning goal is assigned to each item form) with the a s s u m p t i o n t h a t item forms, abilities a n d learning goals can each be a r r a n g e d into congruent hierarchies. Those learning goals which in the sense of Gagne & P a r a d i s e (1961) presuppose o t h e r learning goals have a high position in the learning goal hierarchy. According to the F e r g u s o n (1971) concept, the testing is done i n d i v i d u a l i z e d a n d computerized. Beginning with an item form of i n t e r m e d i a t e difficulty, items are given to a pupil until the reactions p e r m i t a s t a t i s t i c a l l y ensured decision on w h e t h e r the pupil has reached the learning goal 7 I do not consider task structure analysis with the aid of factor- and cluster analyses in this context. These methods are hardly suited for the development of learning goal-oriented tests for' various reasons (for a critical discussion of the factor analytical approach cf. Kempf 1972).

100

SPADA

formulated for this item form. The generation of items is done by computer on the basis of properties of the individual item forms. It' a pupil has reached the learning goal for one item form, he is given problems of an item form which is higher up in the hierarchy. An analysis of the simpler item forms is dispensed with. In such an individualized procedure, the n u m b e r of items given pro item form as well as the number of given item forms themselves can be kept to a minimum. Upon validity of the assumptions underlying the concept (above all, the validity of the binomial model and of the learning goal hierarchy), it is possible to state which item forms are not yet sufficiently mastered by the pupil, An adequate testing of all assumptions was not done by Ferguson (1971) in his study about arithmetical problems. His test concept did very well, however, in comparison to a conventional procedure. 3.3.3. Models of Learning and Thought of the Rasch Theory Models of learning and t h o u g h t of the R A S C H theory represent a generalization of the RASCH-model discussed in section 3.2.3. and have - test theoretically with respect to the measurement of an ability analogous properties. They go back to works by Scheiblechner (1972) and Fischer (1973) (cf. also Spada, 1973b, 1976b; Kempf, 1974, Kempf, Niehusen & M a c h , 1976, and for a detailed presentation Spada, 1976a. The theory of thinking outlined in the models makes possible an analysis of the task structure in the sense t h a t hypotheses about the type, number and difficulty of cognitive operations, which constitute the problem-solving process of pupils in different tasks, can be tested. The results of such a task structure analysis can, in the sense of the remarks made at the beginning of section 3.3., be applied for the definition and description of the ability measured by the items as well as for the planning of teaching (cf. Spada 1976b). An arrangement of the items into a hierarchy according to the type and number' of necessary cognitive operations is possible. By means of these models different forms of practice effects which result fl'om the t r e a t m e n t of the test problems themselves and from teaching can be analyzed and measured (Spada, 1975). For a comparison to the test concept of Ferguson it must be kept in mind t h a t in use of the models of the Rasch theory, in general, only one ability for all items of one class is postulated even if the problems are structurally different (of., however Hilke, K e m p f & Scandura, 1976). This presupposes that the assumption of item homogeneity must not be rejected. 3.3.4. The Deterministic Approach of Scandura (1973) An approach to the testing of knowledge discussed by Durnin & Scandura (1973) and compared by these authors to the item form technology by Hively, Patterson & Page (1968) and to the method of Ferguson (1971) goes back to a simple deterministic model of t h o u g h t by Scandura (1973). This model postulates t h a t all items of a given type can be solved by means of a small number of rules (operation rules, branching rules etc.). Individual behavioral differences are assumed to result from the mastery or non-mastery of individual rules. The rules are assumed to be applied in a deter~ ministic fashion: a pupil who has mastered a certain rule will apply it correctly to all tasks which require this rule. On validity of the model assumptions (problematic due to its deterministic character) and suited hypotheses about the necessary rules for the solving of individual problems, the pupils' knowledge of rules can be clearly

MEASUREMENT

101

d e t e r m i n e d fi'om t h e r e a c t i o n s . T h e t e s t i n g o f t h e m o d e l a s s u m p t i o n s is p o s s i b l e b y m e a n s o f a n a n a l y s i s of' t h e a n s w e r v e c t o r s . W i t h t h i s s o m e w h a t a t y p i c a l a p p r o a c h , t h e s h o r t s u m m a r y of t e s t - s t a t i s t i c a l m e t h o d s o f t a s k s t r u c t u r e a n a l y s i s is closed. REFERENCES

ANDERSEN, E. B. ('onditional inJ'erer~ce a~d model.~ J'or measuring. Copenhagen: l)iss. Mentalhygiejnisk Forskningsinstitut, 1973. B R E N N A N , R. 1,. (~omputerunterstiitzte Leistungsprfifang im Unterricht. In B. Rollet, K. Weltner (Hrsg.) Fortschritte and Ergebnis,~'e der Bildungstechnologie 2. Miinchen : Ehrenwirth, 1972. I)URNIN, J.. SCANI)URA. J. M. An algorithmic approach to assessing behavior potential. Journal of Educational Psychology, 1973, 65, 262 272. FERGUSON, R. L. Computer a.~',si,stance for indiridualizing measurement. University of Pittsburgh, 1971. FISCHER, G. H. The linear logistic test model as an instrument in educational research. Acta P,s'yehologica, 1973, 37, 35~374. F I S C H E R , G. H. Einfiihrung in die Theorie p,~ychologischer Te~'ts. Bern: Huber, 1974. FISCHER, G. H., SPADA, H. Die psychometrischen Grundlagen des Rorschachtest8 und der Holtzman Inkblot Technique. Bern: Huber, 1973. FRICKE, R. Lehrzielorientierte Messung mit Hilf~ stochastischer Mei3modelle. In K. Klauer, J. R. Fricke, H. Rupprecht. F. Sehott (Hrsg.) Lehrzielorientierte Te.~ts. Beitr6ge zu~" Theorie, Kon,s'truktion u~ut A nwendung. Dtisseldorf: Sehwann. 1972. F R I C K E , R. Kriterium,~'orientierte Leistungsme,ssung. Stuttgart: Kohlhammer, 1974. GAGNE. R. M.. PARADISE, N. E. Abilities and learning sets in knowledge acquistion. P,s.ycholo gical M(mograph, 1961, 75, 14, No. 518. GLASER, R. Instructional technology and the measurement of learning outcomes Some quest,ions. In W. J. Popham (ed.) Criterion-reJ'erenced measurement - A n introduction. Englewood Cliffs N. J.: Educational Technology Publications, 1971. See also: Amer. psychol., 1963, 18, 519-521. GLASER, R. A criterion-rei~vrenced test. In W. J. Popham (ed.) Criterion referenced measurement A n introduction. Englewood Cliffs, N.J.: Educational Technology Publications, 1971. GUTTMAN, 1. The basis tor scalogram analysis. In S. A. Stouffer et al. (ed.) Studies in ,s'ocial psychoIoyy in World War 11. Princeton, N.J.: Princeton University Press, 1950. HILKE, R., KEMPF. W. F.. SCANDITRA, ,I. M. Deterministic and probabilistic theorizing in structural learning. In H. Spada. W. F. K e m p f (eds.) Structural models of thinking and learning. Proeeeding~ of the 1PN-Symposium 7, Kiel, 1975. Bern: Huber, 1976 (in print). HIVELY, W., PATTERSON, H. I., PAGE, S. A. "Universe defined" system of arithmetic achievement tests. Journal of Educational Measurement, 1968, 5, 275-290. HOFFMANN, L., KATTMANN, U., LUCHT. H., SPADA, H., STICH, S. Die Wirkung einstellungsver{ivuternder Maflnahmen im naturwissenschaftlichen Unterricht auf das Verhalten yon Schiilern im ProblemJ~ld Technik, Energie and Umweltschutz : Theoretische Grundlagen und Versuch,s'planunff. IPN-Arbeitsbericht 1. Kie], 1975. HOFFMANN, L., KATTMANN, U.. LUCHT, H., SPADA, H. Materialien zum Unterrichtsversuch : Kernkraftu~erke in der Einstellung t'on J u
102

SPA DA

K L A U E R , K. J. Einffihrung in die Theorie lehrzielorientierter Tests. In K. J. Klauer, R. Frickc. M. Herbig, H. Rupprecht, F. Schott (Hrsg.) Lehrzielorientierte Tests. Beitrdge zur Theorie, Konstruktion und Anwendung. Diisseldorf: Schwann, 1972a. KLAUER, K. J. Zur Theorie und Praxis des binomialen Modells lehrzielorientierter Tests. In K. J. Klauer, R. Fricke, M. Herbig, H. Rupprecht, F. Schott (Hrsg.) Lehrzielorientierte Tests. Beitriige zur Theorie, Kon,~truktion und Anwendung. Diisseldorf: Schwann, 1972b. KRISTOF, W. Einige Skalenfragen. In G. H. Fischer (Hrsg.) Psychologische Testtheorie. Bern: Huber, 1968. L I E N E R T , G. A. Testaufbau und Analyse. Weinheim: Beltz, 1969. LORD, F. M., NOVICK, M. R. (eds.) Statistical theories of mental test scores. Reading/Mass. : Addi son-Wesley, 1968. LUCHT, H., SPADA, H. Der Situationstest (ST) - ein Rasch-skaliertes Verfahren zur Einstellungsmessung. Unver6ff. Manuskript eines Referates an der Item-Tagung, Landau, 1975. N I E D D E R E R , H. Dispositionsziele und lernzielorientierte Tests auf der Basis probalistischer Modelle des Verhaltens. Die Schulwarte, 1973, 26, 1 21. POPHAM, W. J. (ed.) Criterion referenced measurement A n introduction. Englewood Cliffs, N.J.: Educational Technology Publications, 1971. POPHAM, W. J., HUSEK, R. R. Implications of criterion-referenced measurement. In W. J. Popham (ed.) Criterion referenced measurement A n introduction. Englewood Cliffs, N.J. : Educational Technology Publications, 1971. RASCH, H. G. Probabilistic models for some intelligence and attainment tests. Copenhagen: Nielsen/ Lydiche, 1960. RASCH, G. An individualistic approach to item analysis. In P. F. Lazarsfeld, N. W. Henry (eds.) Readings in mathematical social sciences. Chicago: Science Research Associates, 1966. SCANDURA. J. M. Structural learning : I. Theory and research. New York : Gordon & Breach. 1973. S C H E I B L E C H N E R , H. Das Lernen und LSsen komplexer Denkaufgaben. Zeitschrift ffir experimentelle und angewandte Psychologie, 1972, 19, 47(~506. S C H E I B L E C H N E R , H. Die Sozialstruktur groBer Gruppen. In W. F. K e m p f (Hrsg.) Probabi listi~'che Modelle in der Sozialpsychologie. Bern: Huber, 1974. SPADA, H. Die Messung yon Verhaltenstendenzen bei Schfilern im Problemfeld Technik, Energie und Umweltschutz zur Prfifung der Wirkung einstellungsver/indernder MaBnahmen im Unterricht. In L. Hoffmann, U. K a t t m a n n , H. Lucht, H. Spada, S. Stich (Hrsg.) Die Wirkung einstellungsver~indernder Maflnahmen im naturwissenschaftlichen Unterricht auf das Verhalten yon Schi~: lern im ProblemJ~ld Teehnik. Energie und U~tweltschutz, Arbeitsbericht 1. Kie]: | P N , 1973a. SPADA, H. Denk- und Lernmodelle der RASCH-Mel3theorie unter psychologischen, formalen und didaktischen Aspekten. In H. Spada, P. Hhussler, W. Heyner (Hrsg.) Denkoperationen und Lernprozesse als Grundlage J)ir lernerorientierten Unterricht : Ver,vuchsplanung und erste Ergebnisse. Arbeitsbericht 5. Kiel: IPN, 1973 b. SPADA, H. Didaktische Intormationen aus Testdaten mit Hilt~ der RASCH-MeBtheorie. In W. goyl (Hrsg.) Didalctisehe lnjbrmationen aus Schulversuchen. Berieht iiber die Tagung der Organisation I T E M , Kiel. 1973. Braunsehweig: Westermann, 1975. SPAI)A, H. Modelle des Den]cen,~ und Lernens. Bern: Huber, 1976a. SPADA, H. A. model of learning and thought, and its application in the planning and evaluation of science teaching. Instructional Science, 1976 b (in print). SPADA, H., F R E Y , K. Die Formulierung yon Lernzielen und die Konstruktion yon lernzielorientierten Tests mit Hilt~ tbrmaler Mode]le. In H. Spada, P. H~ussler, W. Heyner (Hrsg.) Denkoperationen und Lernprozesse als Grundlage ]~ir lernerorientierteu Unterricht : V ersuchsplanung und erste Ergebnisse. Arbeitsbericht 5. Kiel: IPN, 1973.