Criterion-referenced tests — Theory and application

Criterion-referenced tests — Theory and application

Educational Stud/es II) Evaluabon. Vol. 9. p Printed in Great Britain. All rights reserved. 195-207, 1983 CopyrIght CRITERION-REFERENCED TESTS APP...

915KB Sizes 0 Downloads 40 Views

Educational

Stud/es II) Evaluabon. Vol. 9. p Printed in Great Britain. All rights reserved.

195-207,

1983 CopyrIght

CRITERION-REFERENCED TESTS APPLICATION

-

THEORY

0191-491x/83 $0 00 + .50 1983 Pergamon Press Ltd

AND

Reiner Fricke University

of Hanover,

Reinhold University

Germany

Liihmann

of Aachen,

Germany

INTRODUCTION The concept of criterion-referenced measurement developed by Ebel (1962) and Glaser (1963) has received wide interest. It has been described and further developed in Germany in the publications of Ingenkamp and Marsolek (1968), Ingenkamp (1970), Klauer et al. (1972, 19771, Fricke (1974b), Herbig (1976), Pawlik (1976) and Klauer (1978a) as a meaningful new concept for psychological and pedagogical diagnosis. The majority of tests in the pedagogical field, up until the sixties, were characterized by an individual test score being related to the test scores of a (preferably) large group of other individuals as a sample for gaining normative data. This was in accordance with the classical test theory. This classical approach was adequate for institutional decisions, i.e. for In the field of pedagogics, however, it proved to be selecting applicants. highly unsatisfactory in many cases. For example, when rating a student's progress in a particular subject, it is irrelevant how other students scored on the same test. In addition the classical approach led to tests in which the test items were mainly of an average level of difficulty, with the result that students correctly solved 50% of the items, on the average, whereas pedagogues sought to achieve a higher level of ability. One solution was the concept of criterion-referenced measurement which was developed in the USA. Criterion-referenced measurement is characterized by the fact that the individual scores are related to a criterion which is set prior to testing. This method of test interpretation does not change a classical into a criterion-referenced test. The criterion-reference of a test is only stated when the test is referenced to the criterium in at least three ways: 1.

The scores obtained are related to a well-defined criterium, i.e. to an operationalized instructional objective. "Criterium" is, in this case, the set of items representing the instructional objective. It is therefore necessary that the content-validity of such tests be particularly high.

195

196

2.

3.

R. Frlcke Et R. Liihmann The test analysis is the equations stemming test criteria become tional objective, new The interpretation of a criterion-referenced and the criterium is percentage of correct be achieved.

carried out in a criterion-referenced manner. Thus from the classical test theory used to calculate the inapplicable; whenever students master the instrucequations for test analysis must be employed. the achieved scores, i.e. the evaluation, is made in way: the difference between a single student's score determined. "Criterium" in this case is a certain solutions, established prior to testing, which must

Considering these three characteristics, defined in the following manner:

a

criterion-referenced

test

may

be

"A criterion-referenced test is a scientific routine procedure which tests the question whether and possibly how well a certain instructional goal has been reached. The test items employed for this purpose are not identical to the instructional objective, but rather only represent it, serving to compare the individual ability level of a student with a desired level of ability. For this comparison the following prerequisites are substantial: 1. quantification of the instructional objective; 2. quantitative assessment; 3. a statistical test of significance, for the decision as to whether the instructional objective has been realized or A special not. criterion-referenced test analysis is necessary in order to compute the test criteria" (Fricke, 1972a, 1973). A test defined in this way will not necessarily be the optimal one in all pedagogical decision-making situations. Based on the ideas of Cronbach and Krapp (1978) developed 32 different types of decisions for Gleser (1965), pedagogical situations. It should be clear from this that the concept "criterion-referenced testing" cannot account for the diversity of application and interpretation possiA classification of test in the pedagogical field, which is still to bilities. must take into consideration both the construction characteristics be created, of interpreting and applying tests (cf. Fricke, as well as the possibilities 1981). as mentioned of the three characteristics of such tests, On the basis the contribution of German research to the development above, we shall discuss and application of criterion-referenced test in the following section.

CRITERION-REFERENCED

TEST

CONSTRUCTION

to a classical test, should conA criterion-referenced test, as compared tain an especially high degree of content-validity. This is the case when the a subject-matter is known, and the test item is universe of items, representing The main problem in creating a high conrandomly chosen from this universe. of the test universe. tent-validity lies in the construction resp. description the rules for test construction should be stated in According to Ebel (1962), such a way that test-writers working independently of each other would create very

not

similar

tests.

Klauer (1974) only the test,

and Schott, Neeb and Wieberg (1981, but also the corresponding curriculum

1983) pointed out that must be constructed in

Criterion-Referenced

Tests

197

For this claim, the latter group of authors has a criterion-referenced manner. created the term "parallel content-validity." In addition, Kiiffner (1982) proposed extending the criterion-referenced measurement to the error-oriented measurement: by analyzing typical errors it becomes possible to gain information as to the reasons for a certain ability level, thus enabling teachers to direct instruction more efficiently. In this paper we shall only consider procedures which ensure the contentvalidity of a criterion-referenced test. According to Wieberg (19831, the procedures may be divided into colloquially-oriented and formally-oriented ones: The first group starts with colloquially-formulated instructional objectives and attempts to make them more precise as well as to state an algorithm for creating sets of items. The second group of procedures is limited to the development of an algorithm, on the other hand. The first group includes the following methods: the method of the behavioristic operationalization of instructional objectives (Mager, 1961) the method of Tyler's table of specifications (Tyler, 1971) the method of taxonomics (Bloom, 1956) the method of "Amplified Objectives" (Popham, 1980) the method of "Test Specifications" (Popham, 1980) the PLANA method (Schott, 1975; Schott, Neeb and Wieberg, 1981, 1983) The formally-oriented approaches include the following methods: the the the the the the

method of item-forms (Ilively,et al., 1968) method of item-transformation (Bormuth, 1970) method of mapping-sentences (Berk, 1978) method of algorithmization (Scandura, 1973, 1977) method of concept analysis (Tiemann and Markle, 1978) integrative method (Klauer, 1978b)

COLLOQUIALLY-ORIENTED METHODS OF TEST CONSTRUCTION The following methods have been depicted in German publications as representatives of colloquially-oriented procedures: Mager's method (cf. Klauer, 1972a; Mb'ller, 19741, Tyler's (cf. Schott, 1972) and Bloom's (cf. Herbig, 1976). It was realized that all three procedures create a greater degree of precision than previous methods, but that the problem of generating content-valid sets of items still remained to be solved: the relation between universe and set of items is not explicity defined in Mager's method. With the methods of Tyler and Bloom, different item-writers are likely to produce very different item universes. Using Schott's (1975) rather formally-oriented basic study as their forerunner, Schott, Neeb and Wieberg (1981, 1983) developed the procedure entitled PLANA, which constitutes a method for analyzing subject-matter, and for constructing instructional objectives and test items. This method also entails decomposing subject-matter into a domain and a behavior specification. Each item can be represented by stating an initial state, a final state, and an operator which transforms the initial into the final state. The operator which represents the behavior aspect is defined by a precise statement of initial and

198

R. Frxke

Et R. Ltihmann

final states, i.e. by precisely defined domain specifications, because the authors believe, as opposed to Mager, that it is much easier to define the domain precisely rather than define the behavior directly. The domain specification itself is constructed by stating a formally represented basic subject matter which is more or less differentiated according to the requirement level. The training program PLANA, developed for teachers and students of education, was superior to 5 other procedures used both for mathematical as well as social studies curricula. Compared to other procedures, e.g. Mager's approach, PLANA trained subjects produce the most content-valid items related to the instructional objective (Schott, Wieberg, Neeb, 1981).

FORMALLY-ORIENTED

METHODS

OF TEST

CONSTRUCTION

The formally-oriented methods have likewise been demonstrated and discussed in German publications: Fricke (1974) the method of item-forms, Rupprecht (1972) the method of item transformation, NuBbaum (1982) the method of mapping sentences, and Schott and Kretschmer (1977) the method of algorithmization. Klauer's (1978b) approach was integrative: it made possible the use of item-forms, item transformations, algorithms and mapping sentences and provided for the simultaneous generation of content-valid test items and content-valid instructional tasks. This method did almost as well as the PLANA program in the comparative study carried out by Schott, Wieberg and Neeb (1981). In addition, Klauer and DZnecke (1981) were able to demonstrate that the parallel tests created with the help of this method had the same mean values and variances and correlated with one another with values between r = 0.67 - 0.76. However, it about the efficiency of the diverse is too soon to make any final statements as only the results from a single comparative methods of test construction, study are available. of effort required in generating conDue to the relatively large amount tent-valid items, it would be of great help if authors of curricula and texttests and thereby relieve books would simultaneously develop the corresponding teachers of this tremendous amount of work. In doing so, they should take not only special, but also general instructional objectives into account (cf. Neeb, 1983).

CRITERION-REFERENCED

TEST

ANALYSIS

Criterion-referenced tests should also be able to meet the test criteria by the American such as reliability, validity, etc. which have been established In certain cases it is not possible to Psychological Association (1966). found in the calculate the test statistics by means of the usual equations as most of them are defined in relation to the popuclassical test theory, lation. the relation of error variance to the variance between For example, individuals is crucial to the reliability. The reliability of a test changes, however, according to the variance of the test scores of the given population, and is no longer defined in the extreme case in which all individuals have reached the instructional goal and therefore have achieved the same score. As most of the equations of the classical test theory contain correlation coeffitest criteria objectivity, reliability, cients quantifying the most important it follows that these equations cannot and validity, Husek (1969) were the first to point out this fact.

be

applied.

Popham

and

Criterion-Referenced

Tests

199

An undesired dependence on the population has led many researchers to suggest equations which stand apart from the population. Carver (19701, Cronbach, et al. (19721, Livingston (1972) and Rasch (1960) use the absolute error of measurement and do not relate this to the variance of test scores. In Germany, Fricke (1972a, 197413) developed the agreement coefficient A which is independent of the population and thereby suitable for criterionreferenced test analysis. Due to the fact that A functions as a populationindependent equivalent to the correlation coefficient, it is possible to replace the correlation coefficient in numerous equations stemming from the classical test theory whenever the test variance is not available. The agreement coefficient is defined by: A = l- SS/max SS SS is the sum of squares within individuals which results whenever subjects rated by several judges or by means of several test items. Max SS is greatest possible sum of squares within subjects which would result, given minimal amount of agreement between judges respective of test items. In case of binary data 0 can also be expressed in the following manner:

.r”

A

=

X”

3' n: k:

dj

J=l

n.k

are the the the

2

dj =I xj - (k-xj)I

Number of ones in the j-column Number of objects to be rated (e.g. Persons) Number of raters (e.g. test items)

The coefficient can be applied both to binary data as well as to quantitative data in order to compute the objectivity, reliability and validity of criterion-referenced tests. The theoretical distribution specified by Fricke (1972a) for the statistical test against A = 0 was too conservative. Lindner (1980a) worked out the exact distribution of A. For large n the variable n.k2.A is an approximate normal distribution. For n = 2 the values of Lindner have been listed in a table. Additional methods of criterion-referenced test analysis, that is, the procedures for validating a subject-matter hierarchy, have been depicted in great detail by Klauer (1974) and Fricke (1974b).

CRITERION-REFERENCED TEST INTERPRETATION In the following section we shall describe some models of criterionreferenced test interpretations, giving special consideration to developments in the German-speaking countries. The interpretation of a test result, independent of a population, can be made in at least two different ways: either by stating the percentage of items which a person has correctly solved or by making the dichotomous judgment "mastery", or "non-mastery realized." The criterium-referenced test interpretation models can therefore be classified accordingly: 1.

Dichotomous Statements: The problem of assigning persons to two possible states can be transformed into a correspondingly probabilistic problem.

200 2.

R. Frxke & R. Ltihmann Quantitative Methods of test result

Statements: estimation are and theoretical

indicated parameter

which describe on a statistical

Polytomous methods, which classify subjects the dichotomous and quantitative procedures, assignment of grades.

the relation basis.

between

into several fall between groups, and are very important for the

DICHOTOMOUS MODELS OF TEST INTERPRETATION

The binomial one-error model The binomial model suggested by Klauer (1972b) and Millmann (1973) following the example of Lord and Novick (1968) can be applied whenever three preconditions are given: 1. 2. 3.

The test items must be assessed dichotomously (correct/false) The item solution probabilities of a person are constant. The item solutions for each individual person are stochastically dent.

indepen-

If these three pre-conditions are met, then the assumption can be made that each individual can be characterized by an ability parameter which states the probability with which this person solves a randomly chosen item of the test in question. In this case the probability distribution for the number of items in a test which is solved by a subject is binomial. If a certain solution probability is stated a priori (i.e., 0.9 or 0.95) as a criterium for reaching a goal, then it is possible to analyse by means of the distribution, whether test statistic "Number of items solved," with the binomial a given certain test score is sufficient. The formal simplicity and the relatively small number of prior assumptions make the binomial model a test procedure which is of practical value for everyIt can be applied in two frequently found situations: day use. 1. 2.

Individuals Individuals

shall shall

solve solve

items with a given a given percentage

degree of probability. of a certain set of items.

The assumption of homogeneous solution probabilities can be tested by means The verification of the assumpof a procedure suggested by Fricke (1974b, 86). tion of the stochastic independence of items is insolvable, thus, one must be or methods with a low degree of power. Fricke content with approximations (1974, 88) and Lind (1977) discuss such procedures. In a simulation study, Bib1 (1977) examined the errors made when the binomial model was applied although the homogenity of the solution probability was binomial model" should have been used instead (Lord not given and the "compound He found out that the simple binomial model is quite and Novick, 1968, 524 ff). In order to achieve congruence robust against violations of this assumption: between the two models in most cases one only has to reduce the computed critical value by one or two points, thus using the simple binomial model.

Criterion-Referenced

Tests

201

The binomial two-error model The binomial model, conceived of as a significance test, only controls the Type I error, i.e., the probability with which "masters" is falsely judged to be "nonmasters." In many cases in pedagogical research and application, however, the Type II error must also be controlled, namely the probability of falsely classifying a "nonmaster" as a "master." This is only possible, to be sure, when the distribution of the solution probabilities of all individuals is known. Lindner (1979) therefore suggests establishing a second limit, e.g. at p = 0.80, under the standard solution probability. By doing so, not only the probability for the Type I but also for the Type II error can be computed according to the binomial model. Lindner (1979) has drawn up tables for looking up the minimal test length and critical values for diverse standard solution probabilities, lower limits and Type I and II errors.

Emrick's cost-model An interesting further development of the binomial two-error model is the model suggested by Emrick (1971) in which the probability criterium is replaced by a cost-criterium. The model takes the following variables into account: 1. 2. 3. 4.

The The The off The

solution probability of the "masters". solution probability of the "nonmasters". resulting costs of erroneous classification by using a certain cutting point. probability with which a "master" can be found in the population.

According to Emrick (1971), the critical value K is determined in such a way that the sum of all resulting expected costs from possible misjudgments is a minimum. As Fricke (1974a) was able to demonstrate, however, the variables (3) and (4) only influence the value of K to a small extent, especially for increasing test length, for which reason it seems justified to pick out K from a table in which only the first two variables are taken into account (cf. Fricke, 1974b, 109). If they should be taken into consideration, however, then Fricke's (1974a) correction tables can be employed. Reulecke (1977) and Liihmann (1979) specify procedures with the help of which the a priori unknown parameters (2) and (4) can be estimated from the given test answers. However, these solution procedures are only partially satisfying. An extensive paper has been presented by Van der Linden (1981) which solves the problem in a practical manner.

Beta-Binomial-Models There have been attempts to improve the criterium-oriented tests by including information which is known a priori about the population of all individuals in the statistical decision. This is dangerous, however, as the disadvantage of the classical test theory, namely the dependence of the test evaluation on the scores of other individuals, is highlighted in this instance. The Beta-Binomial-Model is of great importance for pedagogical diagnosis. Characteristic of this model is the assumption that the prior distribution of the ablity parameters of all Ss of the population is a Beta-distribution. This assumption is less important, as all unimodal distributions above the interval between zero and one can be approximated by such a distribution (cf. Novick and

202

R. Fricke

& R. LGhmann

Jackson, 1974). prior-distribution the same type.

The reason for the assumption of a lies in the fact that the posterior

Beta-distribution of the distribution is also of

shown in the works of Novick and Jackson (1974); Hambleton, et al. Huynh (1976a, b); Mellenbergh, et al. (1977); Wilcox (1979a), the BetaBinomial-Model is very efficient, but requires a greater amount of computation. The advantages of this model have also been discussed in the German works of Fricke (1977a) and Klauer (1982b). As

(1978);

POLYTOMOlJS

MODELS

OF TEST

INTERPRETATION

Polytomous models of evaluation divide a group of persons into more than two ability groups and can therefore be used to evaluate school performance which is carried out in Germany by means of the grades "1" (very good) through "6" (unsatisfactory). Diverse models of grading have been proposed in order to eliminate the deficiencies of the previous grading system. Herbig (1974, 1976) assigned different solution probabilities (standards) model successively to the individual to the grades and applied the binomial standards. As the binomial model only controls the Type 1 error, an undefendby the multiple ably large number of false decisions occur, which are reinforced application. Lindner (1980b) has demonstrated that grades assigned using this In addition, Fricke (1974, 96-97) pointed out model always tend to be too high. performance on that students receive different grades for the same percentual tests which differ in their lengths. Lindner (1980) therefore developed a model which consists of a simultaneous The standard is represented by the solution application of the two-error-model. of the probability ascribed to the grade and the lower limit by the standard for selecting By doing so, Lindner developed a procedure next lower grade. norms so that it is possible to consider the resulting grades to be an interval scale. An additional model drawn up by Liihmann (1980) represents a simultaneous Klauer (1982a) developed a model which application of Emrick's model (1971). but rather solution probabilities transdoes not employ solution probabilities, This transformation has the effect of stabilizing the formed by Arcus-Sinus. variance and leads to several improvements.

QUANTITATIVE

MODELS

OF TEST

INTERPRETATION

The aim of the quantitative criterium tests is to rate the level of ability Rasch's (1960) model is appropriate for measuring these paraparameters. It classifies the diverse test items on a scale according to their meters. to individuals differing in their ability level of difficulty and according possible to infer the item solution probability from both level; it thus becomes The model displays several advantages in terms of its of the parameters. the model can be further differentiated according to measurement properties: the object sphere to be measured by decomposing both the ability as well as difficulty parameters into a series of individual components, thereby creating the possibility of representing complicated cognitive and learning processes If the empirically testable model is valid, then it is pos(cf. SPADA, 1976). sible to measure the parameters of the model at a difference scale level. to certain that a single dimension is sufficient Furthermore, one is then describe the ability parameter and that the number of the items solved allows a

Criterion-Referenced

Tests

203

sufficient statement about the ability level of a person (cf. Fricke, 1972~; Fischer, 1974). Tn contrast to the binomial model, the Rasch-model has the additional advantage of allowing the difficulty parameters of the items and therefore the item solution probabilities to differ from each other. Fricke (1972b) discussed the advantages of the Rasch-model in connection with criterion-referenced measurement, using the example of an examination.

MULTTOPERATTONAL MODELS OF TEST INTERPRETATION One disadvantage of all test interpretation models is that the required number of items, approximately n = 20, is still quite large. This is due to the fact that prevailing models of test interpretation are "blind" towards the complexity level of an item. They do not take into account that persons have to perform certain individual operations with different frequencies in order to solve items with diverse complexity levels. As such models regard the solution of each item as a single operation, they can, therefore, be considered unioperational. Schott (1977) and Fricke (1977a, b) have, therfore, proposed a multioperational procedure: by exactly analyzing the required individual operations, it is easier to separate those who are able to solve the item(s) from those who cannot do so. This is based on the simple assumption that a person has a certain probability rate for performing elementary arithmetical operations, but that the probability rate of "non-solvers" tends to be zero for complex items. Fricke (1977a, b) has demonstrated the economical advantages of multioperational testing with the example of WALD's Sequential-Test (1947) and a BAYES-model.

TBE APPLICATION OF CRITERION-REFERENCED TESTS Although the theoretical works on criterion-referenced test construction, test analysis and test interpretation are also quite numerous in Germany, their application in pedagogical and psychological fields has nevertheless been limited. There is a lack of criterion-referenced tests which have been created in accordance with the theory; at most, only individual aspects have been considered. For example, content-valid tests are constructed to accompany textbooks, which in turn, are then evaluated in a norm-oriented manner. This is due to the fact that the grades may be criterion-referenced, but they are still very vaguely defined, because the requirement-level has not been operationalized. Furthermore, teachers' knowledge of pedagogical diagnosis is still quite limited, although Haase (1978) claims there is a great amount of interest in using tests in school. Nauck (1981) states, in one of the first research works on the application of criterium-oriented tests, that teachers may take Klauer et al.'s (1972) recommendations into account in their test construction, but that they do not initiate any further steps after giving the test in terms of a test analysis, a procedure mostly unknown to them. University courses for teachers on pedagogical diagnosis are very scarce (cf. Kiiffner, 1980, p. 25), so it is not surprising that students of education rate the value of rules for criterium-oriented test construction, test analysis and test interpretation lower than teachers. Tn order to convince even the practitioners of the value of criteriumoriented tests, however, future research should consider aspects of daily instruction to a greater extent, as Schott (1983) claims. He has worked out a

204

R. Fricke & R. Liihmann

number of requirements which must be met in order to make tests more applicable in everyday school instruction. All together, progress indeed been considerable. call for further research.

in the Some of

criterion-referenced

field of criterium-oriented the problems have become

measurement more obvious

has and

REFERENCES AMERICAN PSYCHOLOGTCAL ASSOCTATION. Standards for educational and psychological tests and manuals. In: Jackson, D.M. and Messick, S. (Eds.), 1967. Problems in human assessment. New York: McGraw-Hill, 1966, pp. 169-189. BERK, R.A. The application of structural facet theory to achievement test construction. Educational Research Quarterly, 3, 1978, pp. 62-72. BIBL, W. Verallgemeinertes und einfaches Binomialmodell. Tn: Garten, H.-K. Diagnose von Lernprozessen. Braunschweig: Westermann, 1977, pp. (Ed.) 209-217. BLOOM, B.S. (Ed.) Taxonomy of educational objectives. Handbook 1: Cognitive Domain. New York: David McKay, 1956. BORMUTH, J.R. On the theory of achievement test items. Chicago: University of Chicago Press, 1970. CARVER, R.P. Special problems in measuring change with psychometric devices. Pittsburgh: American Institute Evaluative Research: Strategies and Methods. for Research, 1970, pp. 48-66. CRONBACH, L.J. and GLESER, G.C. Psychological tests and personnel decisions. Urbana: University of lllinois Press, 1965. CRONBACH, L.J.; GLESER, G.C.; NANDA, H. and RAJARATNAM, N. The dependability Theory of generalizability for scores and of behavioral measurements: New York: Wiley, 1972. profiles. Educational and Psychological Content standard test scores. EBEL, R.L. Measurement, 22, 1962, pp. 15-25. An evaluation model for mastery testing. EMRTCK, J.A. Measurement, 8, 1971, pp. 321-326. FISCHER, G. Einfiihrung in die Theorie psychologischer

Journal Tests.

of Educational Bern:

Huber,

1974. Zeitschrift fiir Testgiitekriterien bei lehrzielorientierten Tests. FRlCKE, R. Abdruck in: Forschung, 6, 1972a, pp. 150-175. erziehu ngswissenschaftliche Weinheim: (Ed.) Lernzielorientierte Leistungsmessung. Strittmatter, P. Belts, 1973, pp. 115-135. Lehrzielorientierte Messung mit Hilfe stochastischer MeBmodelle. FRTCKE, R. In Klauer, K.J., Fricke, R., Herbig, M., Rupprecht, H. and Schott, F. Lehrzielorientierte Tests, Diisseldorf: Schwann, 1972b, pp. 126-160. FRTcKE, R. bber Me8modelle in der Schulleistungsdiagnostik. Diisseldorf: Schwann, 1972~. Zur Theorie lehrzielorientierter FRTCKE, R. Unterricht, 2, 1973, pp. 18-28. gum Problem von Cut-off-Fonneln FRZCKE, R.

Tests. bei

Lernzielorientierter

lehrzielorientierten

Tests.

Unterrichtswissenschaft, 3, 1974a, pp. 43-56. Kohlhammer, Kriteriumsorientierte Leistungsmessung. Stuttgart: FRTCKE, R. 1974b. Versuche zur Reduzierung der Uni-versus multioperationale Testung. FRTCKE, R. IT: Messungen. Teil kriteriumsorientierten bei Aufgabenzahl (Ed.) Diagnose Tn Garten, H.-K. Kriteriumsorientierte Auswertungsmodelle. Braunschweig: Westernmann, 1977a, pp. 189-208. von Lernprozessen. In: Klauer, K.-J., Test8konomie durch multioperationale Testung. FRTCKE, R. Lehrzielorientierte Fricke, R., Herbig, M., Rupprecht, H. and Schott, F. Leistungsmessung. Di_isseldorf: Schwann, 1977b, pp. 101-110.

Criterion-Referenced

Tests

205

FRTCKE, R. Stichwort: Test. Tn: Schiefele, H. und Krapp, A. (Eds.), Handlexikon zur PBdagogischen Psychologie, 1981, pp. 367-372. GLASER, Tnstructional technology and the measurement of learning outcomes. American Psychologist, 18, 1963, pp. 519-521. HAASE, H. Tests im Bildungswesen. Urteile und Vorurteile. Giittingen: Hogrefe, 1978. HAMBLETON, R.K.; SWAMTNATHAN, H.; ALGTMA, J. und COULSON, D.B. Criterionreferenced testing and measurement: A review of technical issues and developments. Research of Educational Research, 48, 1978, l-47. HERBTG, M. Ein lehrzielorientiertes Zensierungsmodell. Zeitschrift fiir erziehungswissenschaftliche Forschung, 8, 1974, 129-142. HTVELY, W.; PATTERSON, H.L; und PAGE, S.H. A 'universe defined' system of arithmetic achievement tests. Journal of Educational Measurement, 5, 1968, pp. 275-290. HUYNH, H. On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13, 1976a, pp. 253-264. HUYNH, H. Statistical consideration of mastery scores. Psychometrika, 41, 1976b, pp. 65-78. TNGENKAMP, K. Nonnbezogene und kriterienorientierte Tests. Didacta Medica, 3, 1970, pp. 65-70. TNGENKAMP, K. Die Fragwiirdigkeit der Zensurengebung. Texte und Untersuchungsberichte, 1971. TNGENKAMP, K. und MARSOLEK, Th. (Eds.) MNglichkeiten und Grenzen der Testanwendung in der Schule. Weinheim: Beltz, 1968. KLAUER, K.J. Einfiihrung in die Theorie lehrzielorientierter Tests. Tn Klauer, K.J., Herbig, M., Rupprecht, H. und Schott, F. Fricke, R., Lehrzielorientierte Tests. Di&.seldorf: Schwann, 1972a. KLAUER, K.J. Zur Theorie und Praxis des binomialen Modells Klauer, K.J., Fricke, R., Herbig, M., lehrzielorientierter Tests. In Rupprecht, H. und Schott, F. Diisseldorf: Lehrzielorientierte Tests. Schwann, 1972b. KLAUER, K.J. Methodik der Lehrzieldefinition und Lehrstoffanalyse. Diisseldorf: Schwann, 1974. KLAUER, K.J. (Ed.) Handbuch der PEdagogischen Diagnostik. Diisseldorf: Schwann, 1978a. KLAUER, K.J. KontentvaliditPt. Tn Klauer, K.J. (Ed.) Handbuch der PIdagogischen Diagnostik, Band 1. Diisseldorf:Schwann, 1978b, pp. 225-255. KLAUER, K.J. Ein kriteriumsorientiertes Zensierungsmodell. Zeitschrift fiir Entwicklungspsychologie und PPdagogische Psychologie, 14, 1982a, pp. 65-79. KLAUER, K.J. Kriteriumsorientierte Tests. ln Bredenkamp, J. und Feger, H. (Ed.). EnzyklopXdie der Psychologie, Fornkungsmethoden der Psychologie, Band 3: Messen und Testen. Gijttingen: Verlag fiir Psychologie, 1983, pp. 693-726. KLAUER, K.J. und DXNECKE, K. Wie parallel sind lehrzielvalide Paralleltests? Zeitschrift fuer Entwicklungspsychologie und PBdagogische Psychologie, 13, 1981, pp. 181-189. KLAUER, K.J., FRTCKE, R., HERBTG, R., RUPPRECHT, H. und SCHOTT, F. Lehrzielorientierte Tests. Diisseldorf:Schwann, 1972. KLAUER, K.J., FRTCKE, R., HERBTG, M., RUPPRECHT, H. und SCHOTT, F. Lehrzielorientierte Leistungsmessung. Diisseldorf:Schwann, 1977. KRAPP, A. Zur Abhdngigkeit der pPdagogisch-psychologischen Diagnostik von Handlungs- und Entscheidungssituationen. Tn Mandl, H. und Krapp, A. (Eds.) Schuleingangsdiagnose. G'dttingen: Verlag fiirPsychologie, 1978, pp. 43-65. KUFFNER, H. Fehlerorientierte Tests: Konzept und Bewfhrungskontrolle. Weinheim: Beltz, 1980. LTND, D. Zur Anwendbarkeit des binomialen Testmodells bei Lernerfolgspriifungen. Lernzielorientierter Unterricht, 6, (l), 1977, pp. 25-33.

206

R. Fricke & R. Ltihmann

LTNDNER, K. Ein zweiseitig orientiertes binomiales Testmodell. Lernzielorientierter Unterricht, 8, (31, 1979, pp. 17-29. LTNDNER, K. Die Ueberpruefbarkeit des Kondordanzmasses IUE". Zeitschrift fuer empirische Paedagogik, 4, 1980a, pp. 45-58. LTNDNER, K. Parameterwahl bei kriteriumsorientierten Zensierungsmodellen. Lernzielorientierter Unterricht, 9, (2), 1980b, pp. 25-37. LTVTNGSTON, S.A. Criterion-referenced applications of classical test theory. Journal of Educational Measurement, 9, 1972, pp. 13-26. LORD, F.M. and NOVTK, M.R. Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley, 1968. LUHMANN, R. Zur praktischen Anwendung des Testmodells von EMRTCK. Lernzielorientierter Unterricht, 8, cl), 1979, pp. 21-30. LijHMANN, R. Ein lehrzielorientiertes Zensierungsmodell. Lernzielorientierter Unterricht, 9, (3), 1980, pp. 17-28. MAGER, R.F. Preparing objectives for programmed instruction. San-Francisco: Fearon Publishers, 1961. MELLENBERGH, G.J., KOPPELAAR, H. and VAN DER LTNDEN, W.J. Dichotomous decisions based on dichotomously scored items: A case study. Statistica Neederlandica, 31, 1977, pp. 161-169. MILLMANN, J. Passing scores and test lengths for domain-referenced measures. Review of Educational Research, 43, 1973, 205-216. M'dLLER, Ch. Technik der Lernplanung. Weinheim: Beltz, 1974 5. NAUCK, J. Erfolgskontrollen in der Orientierungsstufe. Anwendungsprobleme moderner Schulleistungsdiagnostik. Dissertation Hochschule Hildesheim, 1981. Probleme kriteriumsorientierter Leistungsmessung: UberprUfung NEEB, K.E. Allgemeiner Lehrziele. Tn JXger, R.S., Tngenkamp, K. und Horn, R. Tests und Trends 1983 Jahrbuch der PBdagogischen Diagnostik. Weinheim: Beltz, 1983. NOVAK, M.R. and JACKSON, P.H. Statistical methods for educational and New York: McGraw-Hill, 1974. psychological research. NUPBAUM, A. Kriteriumsorientierte Messung im Rahmen der Generalisierbarkeitstheorie. Zeitschrift fiir Empirische Pldagogik, 6, 1982, pp. 75-89. PAWLTK, K. (Ed.) Diagnose der Diagnostik. Stuttgart: Klett, 1976. Domain specification strategies. POPHAM, W.J. Tn Berk, R.A. (Ed.) CriterionThe state of the art. referenced measurement. Baltimore: Johns Hopkins University Press, 1980. pp. 15-31. Implications of criterion-referenced POPHAM, W.J. and HUSEK, T.R. Journal of Educational Measurement, 6, 1969, pp. 1-9. measurement. Probabilistic models for some intelligence and attainment tests. RASCH, G. Copenhagen: The Danish Institute for Educational Research, 1960. Ein Model1 fiir die kriteriumsorientierte Testauswertung. REULECKE, W. Zeitschrift fiir Empirische PPdagogik, 1, 1977, pp. 49-72. Konstruktion von Testaufgaben nach einem Verfahren von Bormuth. RUPPRECHT, H. In Klauer, K.J., Fricke, R., Herbig, R., Rupprecht, H. und Schott, F. D%seldorf: Schwann, 1972, pp. 101-115. Lehrzielorientierte Tests. SCANDURA, J.M. (Ed.) Structural learning 1: Theory and research. New York: Gordon and Breach, 1973. SCANDURA, J.M. (Ed.) Problem solving. New York: Academic Press, 1977. Zur PrPzisierung von Lehrzielen durch zweidimensionale AufgabenSCHOTT, F. Tn Klauer, K.J., Fricke, R., Herbig, R., Rupprecht, H. und Schott, klassen. Diisseldorf: Schwann, 1972. Lehrzielorientierte Tests. F. Ein Beschreibungssystem zur Analyse von Tnhalt Lehrstoffanalyse. SCHOTT, F. und Verhalten bei Lehrzielen. Diisseldorf: Schwann, 1975. Uni- versus multioperationale Testung. Versuche zur Reduzierung SCHOTT, F. kriteriumsorientierten bei Aufgabenzahl der multioperationale Ansatz. Tn: Garten, H.K. Braunschweig: Westermann, 1977, pp. 175-188.

Messungen. Diagnose von

Der Teil T: Lernprozessen.

Criterion-Referenced

Tests

SCHOTT, F . Probleme kriteriumsorientierter Leistungsmessung: Zum praktiscben Nutzen lehrzielorientierter Tests im Unterricht. In .J;irrer. R.S.. Lneenkamn. -. .I K. und Horn, R. Tests und Trends 1983. Jahrbuch der PPdagogischen Diagnostik. Weinheim: Beltz, 1983. SCHOTT, F. und KRETSCHMER. 1. Konstruktion lehrzielvalider Testaufeaben aufgrund einer normierten Lehrstoffanalyse. In Klauer, K.J., Fricke, R., Herbig, R., Rupprecht, H. und Schott, F. Lehrzielorientierte Leistungsmessung. Diisseldorf:Schwann, 1977, pp. 26-64. SCHOTT, F., NEEB, K.E. und WTEBERG, H.J.W. Lehrstoffanalyse und Unterrichtsplanung. Braunschweig: Westennann, 1981. SCHOTT, F., WLEBERG, H.J.W. und NEEB, K.E. Empirischer Vergleich praxisnaher zur verschiedener Handlungsanweisungen Erstellung lehrzielvalider Testaufgaben. Zeitschrift fiir Empirische PPdagogik, 5, 1981, pp. 137-147. SCHOTT, F., NEEB, K.E. and WTEBERG, H.J.W. A General Procedure for the Construction of Content-Valid Items for Goal-Oriented Teaching and Testing. To appear in: Studies in Educational Evaluation. Oxford/New York: Pergamon Press, 1983. SPADA, H. Modelle des Denkens und Lernens. Bern: Huber, 1976. TTEMANN, P.W. and MARKLE, S.W. Analyzing instructional content: A guide to instruction and evaluation. Champaign: Stipes Publications, 1978. TYLER, R.W. Basic Principles of curriculum and instruction. Chicago: The University of Chicago Press, 1971. VAN DER LTNDEN, W.J. Estimating the parameters of Emrick's mastery testing model. Applied Psychological Measurement, 5, 1981, pp. 517-530. WALD, A. Sequential Analysis. New York: Wiley, 1947. WIEBERG, H.J.W. Probleme kriteriumsorientierter Leistungsmessung: Sicherung der KontentvaliditPt. Tn J;dger, R.S., Tngenkamp, K. und Horn, R. Tests und Trends 1983 Jahrbuch der Pgdagogischen Diagnostik. Weinheim: Beltz, 1983. WILCOX, R.R. Comparing examinees to a control. 1979.

THE AUTHORS RETNER FRTCKE is Professor at the University of Hannover. Ph.D. from the University of Braunschweig.

He received his

RETNHOLD LijHMANNis a Research Assistant at the University of Aachen. received his Ph.D. from the University of Braunschweig.

He

207