CHAPTER
2
SOLVING CRITERION-REFERENCED MEASUREMENT PROBLEMS WITH ITEM RESPONSE MODELS* RONALD University
K. HAMBLETON
of Massachusetts,
Hills South,
and H. JANE Room
ROGERS
152, Amherst,
MA 01003, U.S.A.
Abstract Item response theory models appear to provide an excellent measurement framework for addressing technical matters associated with criterion-referenced testing. The primary purpose of this chapter is to describe our efforts to apply IRTmodels to the problems of determining test lengths, selecting items, and reporting scores. In addition, because of the importance of model fit to the viability of results, approaches to assessing the fit of IRT models to criterionreferenced test results are considered.
Introduction A major change in the direction of achievement testing in the last 20 years in the United States has been the movement toward the use of criterion-referenced tests in education, industry and the armed services (see, for example, Jaeger & Tittle, 1980). For many educational purposes, the determination of an examinee’s level of skills is substantially more important than the determination of the examinee’s standing in relation to a norm group, and so the long-used and well-known norm-referenced testing framework is less appropriate. With this change in focus, measurement specialists have faced new challenges in test design and construction. Topics such as determination of test length, assessment of validity and reliability, and standard setting for criterion-referenced tests have been extensively studied and the results have been widely reported in the measurement literature (see, for example, Berk, 1984; Hambleton, 1980; Jaeger, 1989). Substantial progress has been made in solving the technical problems, enabling the successful implementation of criterion-referenced testing programs at all levels of education (see, for example, Berk, 1984; Hambleton, 1980). In the middle 1970s many measurement specialists thought that criterion-referenced
*A presentation 1988.
at an invited symposium
at the annual
145
meetings
of AERA
and NCME,
New Orleans,
April,
146
RONALD
K. HAMBLETON
testing was just another ‘fad’ with a short life expectancy. One representative of a large test publishing company said his company “would keep an eye on criterion-referenced testing until it died”. Today, criterion-referenced tests are being used, in some form, in nearly every school district in the country, and some 48 of the 50 states require high school students to pass criterion-referenced reading and mathematics tests (sometimes called ‘basic skills’ or competency tests) to receive high school diplomas. Over 900 professional organizations, such as those in medicine, allied-health and teaching, use criterion-referenced tests ‘licensing’, or ‘certification’ exams) to credential their (sometimes called ‘credentialing’, and the armed services regularly administer criterion-referenced tests members, (sometimes called ‘skills qualification’ or ‘job proficiency’ tests) to determine competency levels of military personal and to identify persons in need of additional training. Clearly, criterion-referenced testing is not just a ‘fad’. Concurrent with the national interest in criterion-referenced testing has been the emergence of item response theory (IRT) (see, for example, Hambleton, 1989; Hambleton & Swaminathan, 1985; Lord, 1980). This theory postulates the existence of a set of unobservable traits, or abilities that influence item performance, and introduces monotonic expressions, called item characteristiccurves, that link item performance to the ability or abilities that the test measures. IRT overcomes a number of shortcomings with IRT is potentially more useful for providing a classical test theory and, therefore, measurement theory for criterion-referenced tests. Interestingly, applications of IRT models to test score equating, adaptive testing, and the study of item bias are well known. Far less seems to be known about applications of IRT to CRT measurement problems. Although classical true-score theory has provided an adequate framework for obtaining many important results for criterion-referenced testing (see, for example, Hambleton, 1980; van der Linden, 1982), it has a number of shortcomings. For example, classical item parameters, such as difficulty and discrimination, are dependent on the sample of examinees in which they are obtained. An item which appears to be easy when administered to a very able group of examinees will have a higher level of difficulty in a group of less able examinees. If the group is relatively homogeneous with respect to ability, item-total score correlations and hence item discrimination indices will be lower than those estimates obtained in a more heterogeneous group. Such problems arise regularly when large sets of criterion-referenced test items are being field-tested by large school districts, state departments of education, and test publishers. It is nearly impossible to insure that (near) equivalent samples of examinees are administered the various forms of the tests. Similarly, examinee scores depend on the sample of items which constitute the test. Given an easy test, an examinee will gain a higher score than if the same examinee is given a more difficult test. If the test is very easy or very difficult for an examinee, test scores are imprecise and biased estimates of examinee ability result; yet if the test is ‘tailored’ to the examinee’s ability level, the score obtained is not comparable to the scores of other examinees taking different items nor does the score provide an unbiased estimate of an examinee’s level of proficiency in the content domain of interest. In addition, the degree of precision in a given ability estimate cannot be calculated; errors of measurement can be estimated only for the group as a whole. Item response theory, an increasingly popular alternative to classical test theory, offers the potential to overcome the shortcomings of classical measurement procedures, if, of course, specific IRT models can be found to fit the
Applications of Item Response Theory
147
test data. IRT models are based upon strong assumptions and therefore one of the keys to successful applications of IRT models involves the extent of model-test data fit. The various features that make IRT models attractive for solving a number of measurement problems are also important with criterion-referenced tests: (1)
(2)
(3)
(4)
Invariant item statistics are especially useful in item banks which are common with criterion-referenced tests. The characteristics of the various samples of examinees can be taken into account in item parameter estimation. Invariant ability statistics permit, for example, examinees to be administered nonrepresentative sets of test items which optimally discriminate in the region of the cut-off score while, at the same time, using the ‘test characteristic curve’ obtained for the item banks (or any part of the bank which is of interest), criterionreferenced test scores can be obtained. Item information functions which provide data concerning the utility of items for assessing ability along the ability continuum can be produced from the item statistics. Item information functions provide a new tool for test developers. Estimates of measurement error for each examinee are available. In addition, predictions can be made of examinee performance on items, subtests, or tests that were not administered but which are a part of the item pool. This feature, for example, is used in ‘customized testing’.
Since about 1975, faculty members and graduate students in the Laboratory of Psychometric and Evaluative Research at the University of Massachusetts at Amherst have addressed a number of technical problems in the areas of criterion-referenced testing and item response theory separately, and in recent years, the use of IRT models to address criterion-referenced measurement problems. The purpose of the remainder of this chapter is to focus attention on some of our research at the University of Massachusetts to investigate the usefulness of IRT models for solving criterion-referenced measurement problems. Attention will be focused on four areas: goodness-of-fit, determination of test length, item selection, and score reporting.
Goodness-of-Fit
Studies
It is well known that the advantages derived from the applications of IRT models can only be obtained when there is a reasonable fit between the model and the test data. We have conducted numerous studies concerning the methodology of goodness-of-fit studies (see, for example, Hambleton, 1989; Hambleton & Murray, 1983), as well as goodness-offit studies with various criterion-referenced tests. Our view has been that no single piece of evidence is ever sufficient to justify the use of a particular model. Rather, multiple pieces of evidence are required for a comprehensive review of model suitability and then a judgment about the suitability of the chosen model for a particular application can be made. Fitting several IRT models to the same test data and then judging the practical significance of the difference in the results is especially helpful information when choosing a model. For example, Table 2.1 shows the standardized residuals (see Hambleton & Swaminathan, 1985) that were obtained from fitting the one-, two-, and three-parameter logistic test models to 400 9th grade reading items. In this particular example, the
RONALD
148
Distribution
Table 2.1 of Absolute-Valued Standardized
One-parameter Two-parameter Three-parameter
16.6% 23.3% 26.0%
68.4% 69.9% 68.6%
from fitting each model to 400 9th grade reading
Residuals”
Standardized 1to2
0to 1
Model
“Results
K. HAMBLETON
Residuals 2to3 8.W 6.3G 4.8%
over 3 7.0% 0.4% 0.7%
items.
differences in fit among the models were modest, but the three-parameter model did fit the data better than the other two. The determination of how well an IRT model fits a set of test data can be addressed by collecting three types of evidence: (a) (b) (c)
Determine if the test data satisfy the assumptions of the test model of interest. Determine if the expected advantages derived from the use of an item response model (for example, invariant item and ability estimates) are obtained. Determine the closeness of the fit between predictions of observable outcomes (for example, test score distributions) utilizing model parameter estimates and the test data.
Hambleton and Swaminathan (19X5) and Hambleton (1989) have organized the current goodness-of-fit literature to highlight methods that could be used to address each type of evidence. Table 2.2 provides an abridged version of their work. We have analyzed, to name only a few, criterion-referenced tests intended for classroom use such as the Individualized Criterion-Referenced Tests; year-end elementary and high school tests; state proficiency tests such as those used in New Mexico, New York, and Maryland; and certification and licensing exams in areas such as Family Practice, Nursing and Environmental Health. Some general findings can be noted: The one-parameter model has rarely provided a satisfactory fit to the test data; the three-parameter model nearly always has. The two-parameter model was appropriate when the test was on the easy side. In many of the studies, we have conducted an intensive study of model-data fit: Looking for violations of model assumptions, checking invariance of parameter estimates, and studying residuals and practical consequences of model misfit (see, for example. Hambleton, 1989). Figure 2.1 highlights one of the analyses we regularly carry out to assess the unidimensionality (see Drasgow & Lissak, 1983). One hundred 9th grade reading items were factor analyzed (tetrachoric correlations in the diagonal) and then a plot of the resulting eigenvalues is produced. To determine the number of significant factors needed to account for the intercorrelations, the plot is compared to a plot obtained from factor analyzing test data derived from some random process. Eigenvalues obtained with the real data which exceed those obtained from the random data are assumed to be associated with significant factors. The data in Figure 2.1 highlight the fact that there is one dominant factor and possibly a second factor affecting test performance.
Applications
Approaches
of Item Response Table 2.2 for Assessing
Theory
149
Model Fit”
Approach 1.
Determine if the model assumptions are met. a. Unidimensionality - eigenvalue plots with baseline plots obtained from random data (e.g., Drasgow - non-linear factor analysis (McDonald, 1981) - IRT factor analysis (Bock, Gibbons, & Muraki, 1988) - Bejar’s method (Bejar, 1980) b. Equal Discrimination - Compare item point-biseriai correlations c. ~~nirnal Guessing - consider item format and test difficuIty - investigate performance of low-ability candidates on the hardest questions
& Lissak,
1983)
2.
Determine if the expected advantages are obtained. a. Invariance of Abilities - compare performance of examinees on different sets of items from the pool of items over which the ability scale is defined (e.g., Wright, 1968) b. Invariance of Item Parameters - compare item statistics obtained in two or more groups (e.g., high and low performing groups). When racial, sex, or ethnic groups are used, studies are called item bias investigations (e.g., Shepard, Camilli, & Williams, 1984)
3.
Determine the closeness of the fit between the IRT model and the test data. - investigations of residuals and standardized residuals (Hambleton & Murray, 1983; Ludlow, 1986) - comparisons between actual and predicted test score distributions - investigations of threats to validity such as item placement (e.g., Kingston & Dorans, 1984). practice effects, speededness. recency of instruction (Cook, Eignor, & Taft, 1988). and cognitive processing variables (Tatsuoka, 1987) - applications of statistical tests {see Hambleton & Rogers, in press, for a review) - investigations of model robustness (e.g., Ansley & Forsyth, 1985)
“This Table highlights
only a sample of the approaches
that are in current
use.
Figure 2.2 provides an example of the type of item parameter invariance study we have looked at: Item parameter invariance over ability levels, in this case. The plots in Figure 2.2 are different, suggesting that item parameter invariance is not present. This result normally highlights a problem with the model, or with the model parameter estimates. In this instance, a closer look at our results revealed that the item parameters were poorly estimated in the two extreme groups. As Lord (1980) stressed, in theory the item and ability parameters are invariant over samples, however the estimates are not. Good parameter estimates can only be obtained with a heterogeneous sample of examinees.
Determination
of Test Length
The technical problem of determining the length of a criterion-referenced test or of determining the number of test items to measure each objective in the test, is important because the answer directly relates to the usefulness of the criterion-referenced test scores obtained from the test. Short tests often result in imprecise domain score estimates and lead to mastery classifications that are (1) inconsistent across parallel-form administration
RONALD
1.50
1s.
0
14.
0
K. HAMBLETON
h, ----
13.0 12. 0 1
Real Data Random Data
11. o-
I
10. o6.0-l 6.0-
\
7. o6.0-
\
SO-
\
4. o3.0-
I
2. o1. o-
’ I’ 1234567
.o
I
I
III
8
9
II
I
10
11
I
11
i2
i3
’
I
14 15 16
17
1
18
‘I
1’9 20
Eigenvalue Figure 2.1 Plot of eigenvalues
(0)
ija w $
(b)
1
a8
2 m ‘I5 2
s 6 >- ‘It7 5
e .o
s O 2
0 00 ooo o.86o80 0 0o. z% 0.e
3” if B
0
0 O
a
-I -‘lu > m
for real and random test data
3-40
B
-
m
-a-8
1 B
' -4
'
' 0
'
' 4
'
VALUES (RANDOM SAMPLE I)
Figure 2.2 Examples
of a study of item parameter
8
0 -B-8
'
' -4
0
0
: '
' 0
'
B VALUES (LOW ABILITY
invariance
over groups
of different
m 4
'
GROUP)
ability levels
(or retest administrations) and (2) not indicative of the true mastery states of examinees. In sum, mastery classifications based on scores obtained from short tests are often both unreliable and invalid. Therefore, criterion-referenced test scores obtained from short tests often have limited value. Classical methods suffer, for example, because they cannot provide a connection between test length and decision-consistency. In view of the shortcomings of several of the classical test length determination methods, Hambleton, Mills, and Simon (1983) offered an approach using computer simulation
Applications of Item Response Theory
151
methods, and concepts and models from the field of item response theory. There are three principal advantages of their approach besides the fact that there is substantial evidence to show that IRT models fit CRT data well: (1) a variety of item pools to reflect item pools obtained in practice can be described quickly and easily; (2) one of two item selection methods can be used to build tests; and (3) examinee performance consistent with the compound binomial test model can be simulated. Of course, the appropriateness of the test lengths identified by the Hambleton et al. method will depend upon the match between the situations simulated and the actual performance of the desired examinee population. One way that reality can be attended to is to use the actual item parameter estimates for the item pool being used in test development. Instead of simulating the parameter values of the items, the actual item parameter estimates, if they are known, can be used in the computer simulation program. The Hambleton method utilized five factors which impact on the reliability and validity of binary decisions (e.g., identifying examinees as masters and nonmasters) emanating from the use of criterion-referenced test scores: (1) (2) (3) (4) (5)
Test length. Statistical characteristics of the item pool. Method of item selection. Choice of cut-off score. Domain score distribution.
Their method is implemented with TESTLEN (Mills & Simon, 19Sl), a computer program prepared to simulate examinee criterion-referenced test scores via the use of item response theory. In this program, examinee item response data are simulated using the logistic test model selected by the user. TESTLEN provides the user with the option to manipulate the five factors mentioned above (and several others):
(1) (2)
(3)
Test length. Test lengths of any size can be requested. Statistical characteristics of the item pool. The user must describe the statistical characteristics of the items in the pool from which items will be drawn. The statistical information is often obtained (and updated) from the item statistics obtained in the field test administrations. Method of item selection. It is common to build parallel forms of a criterionreferenced test by drawing two random samples of test items from a pool. This ‘randomly parallel’ method is unlikely to work well when the test of interest is short and the item pool is heterogeneous. In a second method, the first form of a test is formed by drawing items randomly from the pool, and the second form is constructed by matching items statistically to the corresponding items in the first form. Both methods for selecting test items are available in the simulation program. Choice of cut-off score. Any cut-off score can be selected. Domain score distribution. Even short tests can be effective .when the distribution of examinee domain scores is not too close to the chosen cut-off score. The shape and characteristics of the specified domain score distribution in relation to the chosen cut-off score and advancement score have a substantial influence on the test lengths needed to achieve desirable levels for reliability and validity indices.
RONALD
152
The computer information: (1) (2) (3) (4) (5) (6) (7)
program
requires
the
user
to
provide
the
following
the test length of interest and the item pool size; the statistical characteristics of the item pool; the preferred method of item selection; the cut-off score of interest in relation to the item pool; an expected distribution of domain scores for the examinee population of interest in the content domain of interest; the number of examinees; and the number of replications of interest to determine the standard errors associated with the reliability estimates for the particular set of conditions.
Hambleton (1) (2)
simulation
K. HAMBLETON
ef al. (1983) addressed
these two questions
using TESTLEN:
What is the impact of item pool heterogeneity on the choice of desirable test lengths? What is the impact of the choice of item selection method on decision consistency for tests of different lengths drawn from item pools which different substantially in item heterogeneity‘?
Many other relevant questions can be addressed to aid criterion-referenced test developers select their test lengths. Figure 2.3 provides an example of the output from TESTLEN that can be obtained to guide test developers in the selection of test lengths. For the particular situation simulated, Figure 2.3 highlights the increase in reliability (as assessed by decision consistency) due to increasing test length from 5 to 30 test items. Two methods of constructing parallel-forms were compared: strictly-parallel and random. However, any of the relevant variables may be changed (e.g., ability score distributions, cut-off scores, nature of the item pool) and new results can be produced. Graphs like those in Figure 2.3 are invaluable to test developers. Certainly other psychometric models could be used to simulate the test lengths and produce results (see, for example, Wilcox, 1980). On the other hand, IRT provides a general framework for solving the test length problem, and there is substantial evidence that logistic models fit CRT data. Item Selection The common method for selecting test items for criterion-referenced tests is straightforward: First, a test length is determined and then a random (or stratified random) sample of test items is drawn from the pool of acceptable (valid) test items measuring the content domain of interest. Random (or stratified random) selection of test items is a satisfactory method when an item pool is statistically homogeneous because, for all practical purposes, the test items are interchangeable. When items are similar statistically, the particular choice of these items will have only a minimal impact on the decision validity of the test scores. But, when item pools are statistically heterogeneous (as they often are), a random selection of test items may be far from optimal for separating examinees into mastery states.
Applications
of Item Response Theory
153
TEST LENGTH
Figure 2.3 Decision consistency as a function of test length.
When constructing tests to separate examinees into two or more mastery states in relation to a content domain of interest, it is desirable to select items that are most discriminating within the region of the cut-off score (Hambleton & de Gruijter, 1983). Classical methods do not provide a solution; that is, there is no way classical item statistics can be used to optimally select test items. Item response models, in theory, provide a useful solution because these models can be used to place items, persons, and cut-off scores on the same measuring scale. Hambleton and de Gruijter (1983) and de Gruijter and Hambleton (1983) conducted investigations of the effects of optimal item selection on the decision-making accuracy of a test when the intended cut-off score for the test is known in advance of the test development. To provide a baseline for interpreting the results, tests were also constructed by selecting test items on a random basis. Figure 2.4 shows a comparison of decision accuracy with two (similar) optimally constructed criterion-referenced tests and a randomly constructed test for tests of Length 8 to 20 items. The results were obtained using computer simulation methods. Error rates (probabilities of misclassification) were nearly double with the randomly constructed test. In passing, it should be noted that the use of random item selection procedures to date has been common in criterion-referenced test construction. Another way to highlight the advantage of optimal item selection based upon item response model principles and methods is in terms of how long the test constructed using random item selection would need to be to produce similar results to those results obtained through optimal item selection. Using an acceptable error rate of (say) .20, the optimal item selection method produced this result with 8 items or so. The random item selection method needed somewhere between 15 and 20 items (or about twice as many). Figure 2.5 (from a paper by Hambleton, Arrasmith, & Smith, 1987) provides an example of test information functions corresponding to tests constructed using four
RONALD K. HAMBLETON
154
O1
I 6
IO
12
NUMBER
14
16
18
20
OF TEST ITEMS
Figure 2.4 Error probabilities as a function of test length item selection methods. Random and classical were based on standard procedures: Random involved the random selection of items from the bank, and classical involved the selection of items with moderate p values (.40 to .90) and as high r values as possible. Optimal and content-optimal methods involved the use of item information items were drawn to provide maximum functions. With optimal item selection, information at a particular cut-off score. Three cut-off scores on the test score scale were studied: 65%) 70% and 75%. Corresponding cut-off scores on the ability scale were set using the test characteristic curve defined over the total item bank. The content-optimal method involved optimal item selection with a restriction that the final test should match a set of content specifications. The results in Figure 2.5 highlight clearly the advantage of optimal and content-optimal item selection methods. These tests provided nearly twice as much information in the region of interest near the cut-off score. Table 2.3 provides a summary of the decision accuracy results for the total sample of candidates near the cut-off score. Results in the tables highlight the actual decision accuracy results for the various exams and cut-off scores. Though the gains in decision accuracy with optimal and content optimal item selection methods with the real data were modest in size over the other two methods (they ranged from 1 to 16%), nevertheless they are of practical significance. These improved results were obtained without any increase in exam length. Any increases, however slight, as long as they do not involve major new test development expenses or an excessive amount of time, would seem worthy of serious consideration by certification boards in view of the desirability of increasing the decision accuracy (i.e., validity) of their exams. Improved decision accuracy resulted with the nonmasters groups, in part because on the average these groups were closer to the cut-off scores. Second, rather sizeable increases in exam length with the random and classical methods would be required to obtain even 3% to 4% increases in decision accuracy. Using the real data and the random method, the levels of decision accuracy as a function of exam length for the three cut-off scores were calculated. Even a gain in decision accuracy of 4% would require an exam constructed with the random method which would be nearly double in length! Thus, small gains in decision accuracy through the use of optimally selected items based upon the use of IRT models corresponded to rather large changes in the effective exam length. different
Applications
of Item Response
Theory
/.-? \ Cut- off Score /:‘““;..:,,, 70 % /;” ‘?. .$:’ “.\, 4 fi .c ‘...
1::
I-
0
I
2>_\ I
I
I
I
I
0
I2
I
I
:
k ..:
_r /_**- -- ----------_ 1 I I I
-3-2-l
155
I
I
_
3
8
Key
: I-
Random.
2-Optimal,
Figure 2.5 Test information
Decision
Accuracy
Method
N
65%
Random Optimal Content-Optimal Classical
79 79 79 79
70%
Random Optimal Content-Optimal Classical
75%
Random Optimal Content-Optimal Classical
“Overall Accuracy 20-item exams.
is the percent
functions
4-Clossicol
for the 20-item tests (n=249).
Table 2.3 Results (Criterion Test = 249 items; Constrained
cut-off Score
3-Optimal-Content,
Nonmasters Fail
Pass
Overall Accuracy”
40.0% 35.8% 33.7% 34.2%
60.0% 64.2% 66.2% 65.8%
58.4% 65.6% 67.5% 63.8%
437 437 437 437
38.6% 30.2% 30.5% 31.9%
61.4% 69.8% 69.5% 68.1%
65.3% 69.0% 67.9% 66.7%
507 507 507 507
40.4% 35.7% 24.6% 36.3%
59.6% 64.3% 65.4% 63.7%
59.8% 68.2% 68.6% 65.2%
Pass
N
54.6% 68.5% 70.4% 59.3%
45.4% 31.5% 29.6% 40.7%
268 268 268 268
178 178 178 178
62.2% 69.3% 67.1% 61.6%
37.8% 30.7% 32.9% 38.4%
307 307 307 307
60.2% 73.7% 73.1% 67.0%
39.8% 26.3% 26.9% 33.0%
of masters
who pass and nonmasters
Masters Fail
Sample)
who fail in the constrained
samples
for the
156
RONALD
K. HAMBLETON
Score Reporting Improvements in the reporting of scores is another one of the important advantages of IRT models. The IRT framework provides a basis for estimating errors of measurement for each test score so that confidence bands used in reporting scores will be more accurate. Second, IRT produced scales and associated concepts, such as test characteristic curves, provide a basis for predicting performance on any set of items of interest from the bank that was used in producing the original scale. This feature is currently being used in ‘customized testing’. Here, both norm-referenced and criterion-referenced test items appear in the same item bank and are referenced to a common ability scale. School districts can build criterion-referenced tests to address their instructional needs as well as use the overall ability score obtained from the test to predict performance on those items in the norm-referenced test. Hambleton and Martois (1983) found that predictions of norm-referenced test score performances were quite accurate even when the criterion-referenced test was relatively harder or easier than the norm-referenced test. This study was carried out with reasonably large samples at two grade levels. On the other hand, there are some cautions about using customized tests (see, for example, Yen, Green, & Burket, 1987) that center on the viability of the invariance assumption over examinees from different instructional groups. A third advantage of IRT reporting systems is that meaningful anchors can be placed on the scale to help in the score interpretations. National Assessment of Educational Progress, California Assessment Program, and Degrees of Reading Power are three excellent examples of tests with meaningful scales aided by IRT. Figure 2.6 comes from some of our research where the test characteristic curves for sets of items measuring several objectives are reported. With an ability level for an examinee in hand, the approximate performance levels on the objectives are easily seen. Such information can be invaluable to teachers. If cut-off scores are set on the test score scale, and with knowledge of a student’s ability level and the characteristic curves for each objective, an estimate of the mastery state on each objective can be made for a student. With scores obtained using item response theory, a number of reporting options become available. First, the scores can be scaled to a metric which is easily interpreted. For instance, both NAEP and the California Assessment Program (CAP) report scores on a scale from 0 to 500, with the mean set at 250 and a standard deviation of 50 (Bock & Mislevy, 1988). Each school or district (or for any subgroups of students of interest) will have a scaled score on each curricular objective assessed. Along with these scores, an associated standard error can be computed, another advantage of scores derived using IRT. The standard error can be used to establish a confidence interval around the estimate of attainment or proficiency. The report for a given school (or subgroups of interest), then, might graphically display the school’s score and confidence band for each objective, along with comparison bands based on state or national scores (Bock & Mislevy, 1988). The relative strengths and weaknesses of the school could be determined by identifying those objectives for which the schools’ scores were significantly above and below its overall mean. This kind of reporting combines both norm-referenced and criterion-referenced information. Another method of reporting and interpreting scores is to reference them to test content. One of the most useful features of item response theory is that the ability or attainment estimates and the difficulty of test items are reported on the same scale. That
Applications
of Item Response Theory
157
OBJECTIVE -3 ._____, .--_ 0 ._-__ ,o .__.a
THETFl
Figure 2.6 Characteristic
curves for selected objectives.
is, estimates of attainment can be interpreted in terms of the level of difficulty of items that can be answered correctly with a specified probability. Thus, a report of the relative performance of schools within a district, for instance, could display a continuum on which each school’s scaled score was marked, along with the locations of various typical items (Bock & Mislevy, 1981). Bock and Mislevy (1981) suggest defining a ‘mastery threshold’ for items, representing the level of attainment required in order to have a specified probability of answering the item correctly. ‘Benchmark’ items would be those which have mastery thresholds at certain levels of attainment. For example, benchmarks such as ‘basic’, ‘intermediate’, and ‘advanced’ could be used as reference points (Bock & Mislevy, 1981). From this report it could be seen what level of attainment each school had reached in terms of specific content. These reports could then be used to guide future curriculum development. Figure 2.6 could show, for example, the distributions of abilities for the three groups of interest (students from three different schools, or, say, for high, middle, and low ability groups) and the location of the 25th, 5&h, and 75th percentiles‘in each distribution. Shown over the same ability scale could be the item characteristic curves for a set of exercises (or test items). These items may be chosen to serve as ‘benchmarks’. They are items with some special characteristics such as covering content that should be mastered by all examinees. It is then possible to determine the percent of examinees at each quartile point in each distribution who could answer the test items correctly. The same information (and other variations) could also be reported in easy-to-read tables along with standard errors, sample sizes, detailed content specifications for the test items, etc. Alternatively, points along the scale could be selected (e.g., 150,200,250,300,350) and given meaning through descriptions of the types of items which candidates can ‘handle’ say 80%). Candidates in various (i.e., answer correctly with a specified probability, subgroups performing above the point (i.e., candidates who have ‘mastered’ the content
158
RONALD
K. HAMBLETON
at that point) could easily be calculated and reported. Clearly the use of item response theory opens up numerous possibilities with regard to the reporting of scores. Schools or school districts can be described by comparison ,with themselves over time, with other schools and districts, and in terms of their level of mastery of objectives. In one of our current projects we are attempting to catalog the unique features of IRT score reporting for individuals and groups, and for items and test scores, as well as provide sample reports to guide the reporting process.
Conclusions Based upon our research to date and the research of others, we believe that IRT models have contributed to the development of sound measurement practices for criterionreferenced tests. Our concluding points are as follows: (1)
(2)
(3)
(4) (5)
There is substantial evidence to show that one-dimensional IRT models (with 2 or 3 parameters) do fit a wide range of CRTs. And there are methods to detect nonfit when such is the case. IRT models have provided a basis for addressing the CRT length problem, a problem that does not have an analytic solution because of the loss functions that are important. There are other non-IRT models that can be used, but IRT models have been tried and useful results have been obtained., IRT models provide an excellent basis for item selection, especially when cut-off scores are used, and known prior to test development. Classical methods simply cannot do the job without a great deal of difficulty. The IRT feature of placing items, persons, and cut-off scores on a common scale is invaluable. IRT models have contributed valuable new information to the reporting of test scores and related information. Other technical problems such as item bias and test equating can also be addressed nicely by IRT models though we did not address these areas in our chapter.
Having made some very positive comments about IRT models in our CRT research, we would be remiss if we did not conclude with some cautionary notes. Computer programs are still not readily available, user friendly, or always going to provide the desired output. Bugs in the computer programs are an on-going problem, and computer time can be excessive. Sample sizes are sometimes prohibitive, too. No one will give you a lower bound estimate for a sample-size to jointly estimate ability and three item parameters, but it must surely be several hundred or more, unless Bayesian procedures are used. And Bayesian procedures are not without their own problems. Our personal worry centers on the viability of the unidimensionality assumption for calibrating items at different points in instruction. This invariance property must not only hold for variables such as race and sex, but also different approaches to instruction and at different times in the instructional process. If the property does not hold to a reasonable degree, item banks and associated IRT item statistics will have limited usefulness in achievement test construction and test score reporting. All in all, we believe it has been demonstrated that IRT models can be helpful in building successful criterion-referenced testing programs. It now remains to see how
Applications
successful the applications are when environment and away from persons experience.
of Item Response
they with
Theory
159
are carried out away from a research a high level of IRT and measurement
References An&y, T. N., & Forsyth, R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 9,37-48. Bejar, I. I. (1980). A procedure for investigating the unidimensionality of achievement tests based on item parameter estimates. Journal of Educational Measurement, 17,283-296. Berk, R. A. (1984). A guide to criterion-referenced test construction. Baltimore, MD: The Johns Hopkins University Press. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm. Psychometrika, 46.443-459. Bock, R. D., Gibbons, R.-D., & Muraki, E. (1988). Full-information item factor analysis. Applied Psychological Measurement, 12,261-280. Bock, R. D., & Mislevy, R. J. (1981). An item response curve model for matrix-sampling data: The California grade three assessment. In D. Carlson (Ed.), New directions for testing and measurement: Testing in the States -beyond accountability (No. 10, pp. 65-90). San Francisco, CA: Jossey-Bass. Bock, R. D., & Mislevy, R. J. (1988). Comprehensive educational assessment for the States: The duplex design. Educational Evaluation and Policy Analysis, 10,X%105. Cook, L. L., Eignor, D. R., & Taft, H. L. (1988). A comparative study of the effects of recency of instruction on the stability of IRT and conventional item parameter estimates. Journal of Educational Measurement, 25, 3145. de Gruijter, D. N. M., & Hambleton, R. K. (1983). Using item response modelsin criterion-referenced test item selection. In R. K. Hambleton (Ed.), Applications of item response theory. Vancouver, BC: Educational Research Institute of British Columbia. Drasgow, F., & Lissak, R. I. (1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68.363-373. Hambleton. R. K. (Ed.) (1980). Contributions to criterion-referenced testing technology. Applied P.sychological Meusurement, 4.421-581. Hambleton, R. K. (1982). Advances in criterion-referenced testing technology. In C. Reynolds & T. Gutkin (Eds.), Handbook of schoolpsychology. New York: Wiley. Hambleton, R. K. (1984). Determining suitable test lengths. In R. Berk (Ed.), A guide to criterion-referenced test construction. Baltimore, MD: The Johns Hopkins University Press. Hambleton. R. K. (1989). Principles and selected applications of item response theory. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: Macmillan. Hambleton, R. K., &de Gruijter, D. N. M. (1983). Application of item response models to criterion-referenced test item selection. Journal of Educational Measurement, 20.355-367. Hambleton, R. K.. & Martois, J. (1983). Evaluation of a test score prediction system based upon item resnonse model principles and procedures. In R. K. Hambleton (Ed.), Apphcationsofitem response theory. Vancouver, , BC: Educational Research Institute of British Columbia. Hambleton, R. K., Mills, C. N., & Simon, R. (1983). Determining the lengths for criterion-referenced tests. Journal of Educational Measurement, 20,27-38. Hambleton, R. K., & Murray, L. (1983). Some goodness of fit investigations for item response models. In R. K. Hambleton (Ed.), Applications of item response theory. Vancouver, BC: Educational Institute of British Columbia. Hambleton, R. K., & Rogers, H. J. (in press). Promising directions for assessing item response model fit to test data. Applied Psychological Measurement. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer Academic Publishers. Jaeger, R. (1989). Certification of student competence. In R. L. Linn (Ed.), Educationalmeasurement (3rd ed.). New York: Macmillan. Jaeger, R., & Tittle, C. (Eds.) (1980). Minimum competency achievement testing: Motives, models, measures, and consequences. Berkeley, CA: McCutchan. Kingston, N. M., & Dorans, N. J. (1984). Item location effects and their implications for IRT equating and adaptive testing. Applied Psychological Measurement, 8, 147-154. Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). An investigation of item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159-173.
160
RONALD
K. HAMBLETON
Lord, F. M. (1980). Application of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Ludlow, L. H. (1986). On the graphical analysis of item response theory residuals. Applied Psychological Measurement, 10.217-220. McDonald, R. P. (1981). The dimensionality of tests and items. British Journal of Mathematical and Statistical Psychology, 34, lo(~ll7. Mills, C. N., & Simon, R. (lY81). A method for determining the length of criterion-referenced tests using reliability and validity indices (Laboratory of Psychometric and Evaluation Research Report No. 110). Amherst. MA: School of Education, University of Massachusetts. Shepard. L. A., Camilli, G., & Averill, M. (1981). Comparison of procedures for detecting test item bias with both internal and external ability criteria. Journal of Educational Statistics, 6, 317-37.5. Shepard, L. A., Camilli, G., & Williams, D. M. (1984). Accountingfor statistical artifacts in item bias research. Journal of Educational Statistics, 9, 83-138. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied P.sychological Measurement,
7, 201-210.
Swaminathan, H., & Gifford, .I. A. (1983). Estimation of parameters in the three-parameter In D. Weiss (Ed.), New horizons in testing. New York: Academic Press. Swaminathan, H., & Gifford. J. A. (IV%). Bayesian estimation in the two-parameter Psychometrika.
Swaminathan. Psychometrika,
Tatsuoka,
logistic
model.
logistic
model.
50,349-364.
H.,
& Gifford,
J. A. (lY86).
Bayesian
estimation
in the three-paramctcr
51,589601.
K. K. (1987).
Validation
of cognitive
Measurement, 25.3145. wn der Linden, W. J. (Ed.) (lYX2). Aspects International Review Series, 5, 95-203.
Wilcox,
latent trait model.
R. (19X0). Determining
sensitivity
for item response
of criterion-referenced
the length of a criterion-rcfcrenced
curves.
measurement. test. Applied
Journal
Evaluation
of Educational
in Education:
Psychological
An
Measurement,
4,425-446.
Wright,
B. D. (1968). Sample-free
test calibration and person measurement. Proceedings of the 1967fnvitational Princeton, NJ: Educational Testing Service. D. R., & Burket. G. R. (1987). Valid normative information from customized achicvcment
Conference on Testing Problems.
Yen, W. M., Green, tests. Educational
Measurement:
Issues and Practices, 6. 7-13.
Biographies Ronald K. Hambleton is Professor of Education and Psychology and Chairman of the Laboratory of Psychometric and Evaluative Research at the University of Massachusetts, Amherst. He received his Ph.D. in psychometric methods from the University of Toronto in 1969. His principal research interests are in the areas of criterion-referenced measurement and item response theory. His most recent publications are a book (coauthored with H. Swaminathan) entitled Item Response Theory: Principles and Applications and a book (co-edited with Jac Zaal) entitled New Developments in Testing: Theory and Applications. Currently, he serves as an Associate Editor for the Journal of Educational Statistics, is on the Editorial Board of Applied Psychological Meusurement and Evaluation and the Health Professions and serves as the Vice-President of the International Test Commission, and the National Council on Measurement in Education. H. Jane Rogers is a fourth year doctoral student at the University of Massachusetts, Amherst, where she is studying psychometric and statistical methods. She received a Master’s degree from the University of New England, Australia, in 1983. Her Master’s dissertation, a study of goodness-of-fit statistics for item response models, was recently published in Applied Psychological Measurement. She is currently studying applications of item response theory to criterion-referenced measurement problems and to the identification of biased test items. Other areas of research interest include Bayesian methods and multivariate statistics.