Classical Test Theory Wim J. van der Linden University of Twente, Enschede, The Netherlands
Glossary
Classical Model
classical test model Two-level model for observed scores on a test, one level for the score of a fixed person and another for a random person from a population, underlying all derivations in classical test theory. measurement error Difference between an observed score and a true score. observed score Random variable representing the test score of a person. reliability Extent to which the true score agrees with the observed score in a population of persons, measured by the squared linear correlation coefficient between the two. true score Expected value of an observed score over replications of the test for a fixed person. validity The degree of success in using the observed scores on a test to predict the score on an independent criterion, measured by the linear correlation coefficient between the two.
Charles Spearman’s 1904 model of an observed score as a sum of a true score and an error seems simple, but it initially generated much controversy and confusion. Spearman viewed observed scores as random, whereas others rejected this interpretation. But even among those who accepted the idea of randomness, there was much confusion between the notions of randomness of an observed score over replications and randomness due to sampling of persons from a population. Another debate was on the nature of the true score. According to some, it represented a property existing independently of the observed score (so-called definition of platonic true score), whereas others had difficulty with such obscure properties and wanted a more formal definition. In 1996, Melvin Novick published an axiomatization of classical test theory (CTT) based on a minimal set of assumptions, in which CTT was treated as a theory for measurement error in a hierarchical experiment with two different levels—one level at which a random person is sampled from a population and another at which his/her observed scored is obtained. Both levels involve separate model assumptions. Novick’s assumptions were used in collaboration with Frederic Lord in their 1968 treatment of CTT.
Classical test theory is a theory of measurement error. It is customary to consider the work of Charles Spearman, published in 1904, as the origin of this theory. In his paper, Spearman adopted the model of an observed score as a sum of a true score and an error, and showed how to correct the linear correlation between observed scores for their errors. This correction is now known as Spearman’s correction for attenuation. The name ‘‘classical test theory’’ is somewhat misleading in that it suggests a theory for test scores only. The theory applies to any type of measurement for which the notion of random variation across replications is meaningful. In fact, it shares its formal structure with most of the theory of error for physical measurements commonly used in the natural sciences, but misses its notion of systematic error as well as the platonic interpretation of the true score associated with it.
Model for a Fixed Person Let Xjt be the observed score of person j on test t. This score is viewed as a random variable with a distribution over replications of the test for j (sometimes referred to as propensity distribution). For most physical variables, measurements can be replicated, but for more psychological properties, replications are usually hypothetical. Humans are able to remember and learn between measurements, even though in practice it is seldom possible
Encyclopedia of Social Measurement, Volume 1 Ó2005, Elsevier Inc. All Rights Reserved.
301
302
Classical Test Theory
to maintain exactly the same conditions. For the majority of variables in the social sciences, the only realization of Xjt is the single observed score at hand. An observed score Xjt can be used to define the following two new quantities: tjt ¼ EðXjt Þ,
ð1Þ
Ejt ¼ Xjt tjt ,
ð2Þ
where E(Xjt) is the expected value of Xjt. The first definition is of the true score for examinee j on test t, the second is of the error. Observe that tjt is fixed but Ejt is random. The definitions of tjt and Ejt are based on a convention. Nothing in the distribution of Xjt forces us to have an interest in its mean instead of, for example, its median. The distribution of Ejt follows directly from the arbitrary choice of true score. The definitions in Eqs. (1) and (2) imply the following model for the score of a fixed examinee: Xjt ¼ tjt þ Ejt :
ð3Þ
This model is nothing else than a representation of the two definitions in Eqs. (1) and (2). No separate assumption of linearity is made; the linearity in Eq. (3) follows entirely from the definition of Ejt as a difference. Xjt is defined on the scores Ui, i ¼ 1, . . . , n, for n items in the test, often as a (weighted) sum of the scores. This definition does not need to be made for Eqs. (1) and (2). The definitions of a true score and error in CTT hold for any definition of observed score.
Summary Let FXjt ( ) be the distribution function of observed-score Xjt for a fixed person and FTJt ( ) be the distribution function of the true score for the population of persons. The two models in Eqs. (3) and (4) can be summarized as the following two-level model: Xjt FXjt ðx; tjt Þ,
ð5Þ
tJt FTjt ðtÞ:
ð6Þ
It is important to note that both distribution functions are unknown. Also, the presence of the true score tjt in the argument of the function for Xjt in Eq. (5) should not be taken to suggest that the observed-score distribution is a member of a parametric family indexed by it. It is necessary to know the distribution to identify this parameter, not the other way around. Additional assumptions that lead to known parametric families of distributions for Eqs. (5) and (6) are discussed in the context of stronger true-score theory (see later). The model in Eqs. (5) and (6) is equivalent to a one-way analysis-of-variance model with random person effects, albeit that the latter is usually accompanied by the extra assumption of a normal distribution for XJt. Observed scores are usually discrete and have a bounded range. The definition of the true score as an expected observed score implies the same range as for the observed score, but with real values. The error is also real valued, but its range runs between the maximum possible value of Xjt and its negative. Now that the score definitions in CTT have been made precise, for notational convenience, their indices, wherever possible, will be omitted.
Model for a Random Person At this level, the test is still considered as fixed, but the person is obtained by random sampling from a population. An extension of CTT to the case of random sampling of a test from a pool of items also exists (see below). Let J denote a random person sampled from the population and TJt be his/her true score. This true score is also random. The observed score and error of a random person, XJt and EJt, now are random due to two different sources: one source is the sampling of the person from the population, the other is the ‘‘sampling’’ of an observed score from his/her distribution. Due to sampling of persons, the model in Eq. (3) therefore becomes: XJt ¼ TJt þ EJt :
ð4Þ
Again, the only additional assumptions needed to extend the model in Eq. (3) to this model for a random person is on the random status of its variables.
Classical Theory In spite of its almost tautological nature, it is amazing how many results relevant to measurement in the social sciences can be derived from the classical model in Eqs. (5) and (6). The goal of these derivations is ‘‘to derive unobservables from unobservables,’’ that is, to express unobservable quantities defined in terms of true scores and measurement error as functions of observables. The following equations are examples of results derived directly from the model in Eqs. (5) and (6): EðEjt Þ ¼ 0, s2Ejt ¼ s2Xjt , CovðXjt , X0jt Þ
ð7Þ ð8Þ
¼ 0,
ð9Þ
CovðT, EÞ ¼ 0,
ð10Þ
s2X ¼ s2T þ s2E ,
ð11Þ
Classical Test Theory
and PrfX0 jt xo g PrfX0 jt xo g
for xo large:
ð12Þ
The first three results show that, for a fixed person j, the expected error is equal to zero, the variance of the error is equal to the variance of the observed score, and the covariance between the observed score and a replication X0jt is equal to zero. In all three equations, the left-hand side contains an unobservable quantity and the righthand side contains an observable quantity. The result in Eq. (10) shows that the covariance between true scores and observed scores in any population is always equal to zero. The derivation of this result is less obvious, but it follows directly from the fact that Eq. (7) implies a horizontal regression line of X on t. The zero covariance in Eq. (10) leads directly to the equality of the observed-score variance with the sum of the true-score and error variances in Eq. (11). The property in Eq. (12) shows that if there is a large test score xo and the test is replicated, it is more likely that a smaller (rather than larger) second score will be observed. (By a large test score is meant a score larger than the median; a comparable statement is true for a score below the median.) This simple property, which is a trivial consequence of the assumption of a random observed score, explains the often misunderstood phenomena of regression to the mean and capitalization on chance due to measurement error. More useful results are possible if the observed score, true score, or error is used to define new test and item parameters. Such parameters are often defined with a practical application in mind (for example, item and test analysis, choice of test length, or prediction of success in a validity study). Some results for these parameters can be derived with the model in Eqs. (5) and (6) as the only assumption; for others, an auxiliary assumption is needed. A commonly used auxiliary assumption is the one of parallel scores on two measurement instruments. Scores Xt on test t and Xr on a second test r are strictly parallel if tjt ¼ tjr ,
ð13Þ
s2Xjt ¼ s2Xjr ,
ð14Þ
for each person j in the population. This definition equates the first two moments of the score distributions of all persons. Weaker definitions exist but are not reviewed here.
Reliability A key parameter in CTT is the reliability coefficient of observed score XJt. This parameter is defined as the
303
squared (linear) correlation coefficient between the observed and true score on the test, r2TX :
ð15Þ
Though this parameter is often referred to as the reliability of a test, it actually represents a property of the observed score X defined on the test. If the definition of X is changed, the value of the coefficient changes. The use of the correlation coefficient between X and T in Eq. (15) can be motivated as follows: If rXT ¼ 0, it holds that X ¼ T for each person in the population and X does not contain any error. If rXT ¼ 1, it holds that X ¼ E for each person and X contains only error. A squared correlation coefficient is used because of the standard interpretation of this square as a proportion of explained variance. From Eq. (11) it follows that r2TX ¼
s2T , s2X
ð16Þ
which allows interpretation of the reliability coefficient as the proportion of observed-score variance explained by the differences between the true scores in the population. One of the major uses of the reliability coefficient is for calculating the standard error of measurement, which is the standard deviation of errors in the population, sE. From Eqs. (11) and (16), it follows that sE ¼ ð1 r2XT Þ1=2 sX :
ð17Þ
Internal Consistency The internal consistency of a test is the degree to which all of its item scores correlate. If the correlations are high, the test is taken to measure a common factor. Index i ¼ 1, . . . , n denotes the items in the test; a second index k is used to denote the same items. A parameter for the internal consistency of a test is coefficient a, which is defined as " Pn # n i6¼k sik a¼ : ð18Þ n1 s2X This parameter is thus equal to the sum of the item covariances, sik, as a proportion of the total observed score variance for the test (corrected by a factor slightly larger than 1 for technical reasons). As will become clear later, a convenient formulation of coefficient a is Pn s2 n ð19Þ 1 i¼12 i : a¼ n1 sX For the special case of dichotomous item scores, coefficient a is known as KuderRichardson formula 20 (KR20). If all items are equally difficult, this formula
304
Classical Test Theory
reduces to a version known as KuderRichardson formula 21 (KR21).
Validity If a test score is used to predict the score on another instrument, e.g., for the measurement of future success in a therapy or training program, it is important to have a parameter to represent its predictive power. Let Y denote this other score. We define the validity coefficient for the observed test scores, X, as the correlation coefficient. rXY :
ð20Þ
The reliability coefficient remains an important parameter in predictive validity studies, but the correlation of observed score X with Y, instead of with its true score T in Eq. (15), becomes the ultimate criterion of success for it in prediction studies.
Item Parameters Well-known item parameters in CTT are the itemdifficulty or item p value, the item-discrimination coefficient, and the item validity coefficient. Suppose that the items are scored dichotomously, where Ui ¼ 1 is the value for a correct response to item i and Ui ¼ 0 is for an incorrect response. The classical parameter for the difficulty of item i is defined as the expected value or mean of Ui in the population of examinees, pi ¼ E ðUi Þ:
ð21Þ
interpretation problem does not exist for the itemvalidity coefficient, riY. It is helpful to know that the following relation holds for the standard deviation of observed score X: sX ¼
ð22Þ
where siX, si, and sX are the covariance between Ui and X, and the standard deviation of Ui and X, respectively. Analogously, the correlation between its score and observed score Y defines the validity parameter for item i: riY ¼ CorðUi , YÞ ¼ siY =si sY :
ð23Þ
Obviously, a large value for pi implies an easier item. Likewise, a large value for riX or riY implies an item score that discriminates well between persons with a high and a low observed score X and criterion score Y, respectively. Because X is a function defined on the item-score vector (U1, . . . , Un), the scores on all other items than i have an impact on the correlation between Ui and X too. It is therefore misleading to view riX as an exclusive property of item i. This
si riX :
ð24Þ
i¼1
Replacing s2X in Eq. (16) by the square of this sum of products of item parameters leads to " # Pn 2 s n 1 Pn i¼1 i 2 : a¼ ð25Þ n1 si riX i¼1
Using comparable relations, the validity coefficient can be written as PI si riY : ð26Þ rXY ¼ PIi¼1 i¼1 si riX Except for the (known) test length n, these expressions for a and rXY are entirely based on the three item parameters si, riX, and riY. They allow evaluation of the effect of the removal or addition of an item to the test on the value of the internal consistency and validity coefficients. The application of CTT to test construction relies heavily on these two relations.
Key Theorems on Reliability and Validity As examples of more theoretical derivations possible from the classical model in Eqs. (5) and (6), the following three theorems are presented:
The discrimination parameter of item i is defined as the correlation between the item score and the observed test score, riX ¼ Corð Ui , XÞ ¼ siX =si sX :
n X
r2XT ¼ rXX0 ,
ð27Þ
r2XT a,
ð28Þ
rXT rXY :
ð29Þ
and
The first theorem shows that the reliability coefficient of a test is equal to the correlation between its observed score X and the observed score X0 on a replication of the test. This theorem is important in that it demonstrates again how CTT allows expression of an unobservable quantity, such as the squared relation between an observed and a true score, as a quantity that, in principle, can be observed. Though rXX0 is observable, for most psychological or social variables it is seldom possible to replicate an administration of the measurement instrument. For such cases, the second theorem is convenient. It shows that the reliability of an observed score can never be smaller than the internal consistency of the item scores on which it is calculated. Because coefficient a can be estimated from
Classical Test Theory
a single administration of a test, we often resort to this relation and estimate r2TX by an estimate of this lower bound. The bias in this estimate is only negligible if the test is unidimensional in the sense that it measures a single factor. The third theorem shows that the predictive validity coefficient of a test can never exceed the square root of the reliability coefficient. This inequality makes sense; it implies that an observed score always correlates at least as well with its own true score as with any other observed score. It also illustrates the previous claim that, if the test is used to predict another score, the reliability coefficient remains important; high reliability is a necessary condition for high validity.
Test Length If the length of a test is increased, its reliability is expected to increase too. A well-known result in CTT is the SpearmanBrown prophecy formula, which shows that this expectation is correct. Also, if the lengthening of the test is based on the addition of new parts with parallel scores, the formula allows calculation of its reliability in advance. Suppose the test is lengthened by a factor k. If the scores on the k 1 new parts are strictly parallel to the score on the original test according to the definition in Eqs. (13) and (14), the SpearmanBrown formula for the new reliability is r2ZTZ
kr2XTX ¼ , 1 þ ðk 1Þr2XTX
ð30Þ
where Z is the observed score on the new test and TZ its associated true score.
Attenuation Corrections As already discussed, attenuation corrections were the first results for CTT by Spearman in 1904. He showed that if we are interested in the correlation between the true scores TX and TY and want to calculate it from their unreliable observed scores X and Y, the following relation can be used: r TX T Y ¼
rXY , rXTX rYTY
rXY : rYTY
This correction makes sense in validity studies, where we want to predict the true criterion score TY but always have to predict from the unreliable observed score X.
Parameter Estimation The statistical treatment of CTT is not well developed. One of the reasons for this is the fact that its model is not based on the assumption of parametric families for the distributions of Xjt and TJt in Eqs. (5) and (6). Direct application of standard likelihood or Bayesian theory to the estimation of classical item and test parameters is therefore less straightforward. Fortunately, nearly all classical parameters are defined in terms of first-order and second-order (product) moments of score distributions. Such moments are well estimated by their sample equivalents (with the usual correction for the variance estimator if we are interested in unbiased estimation). CTT item and test parameters are therefore often estimated using ‘‘plug-in estimators,’’ that is, with sample moments substituted for population moments in the definition of the parameter. A famous plug-in estimator for the true score of a person is the one based on Kelley’s regression line. Kelley showed that, under the classical model, the least-squares regression line for the true score on the observed score is equal to EðT j X ¼ xÞ ¼ r2XT x þ 1 r2XT mX :
ð33Þ
An estimate of a true score is obtained if estimates of the reliability coefficient and the population mean are plugged into this expression. This estimator is interesting because it is based on a linear combination of the person’s observed score and the population mean with weights based on the reliability coefficient. If r2XT ¼ 1, the true-score estimate is equal to observed x; if r2XT ¼ 0, it is equal to the population mean, mX. Precision-weighted estimators of this type are typical of Bayesian statistics. For this reason, Kelley’s result has been hailed as the first Bayesian estimator known in the statistical literature.
ð31Þ
Strong True-Score Models
where rXTX and rYTY are the square roots of the reliability coefficient of X and Y (also known as their reliability indices). If we want to correct the validity coefficient in Eq. (19) only for the unreliability of one score, Y, say, the correction is given by rXTY ¼
305
ð32Þ
To allow stronger statistical inference, versions of the classical test model with the additional assumption of parametric families for Eqs. (5) and (6) can be used. These models remain entirely within the framework of CTT; their main parameter and the classical true score are one and the same quantity. For this reason, they are generally known as ‘‘strong true-score models.’’
306
Classical Test Theory
Binomial Model If the item scores are dichotomous, and observed score Xjt is defined as the number-correct score, the observed-score distribution can sometimes be approximated by the binomial with probability function n f ðxÞ ¼ px ð1 pjt Þnx , ð34Þ x jt where pjt is the binomial success parameter. For the binomial distribution it holds that pjt ¼ EðXjt Þ, which proves that this distribution remains within the classical model. The assumption of a binomial distribution is strong in that it only holds exactly for a fixed test if pjt is the common probability of success for j on all items. The assumption thus requires items of equal difficulty for a fixed test or items randomly sampled from a pool. In either case, pjt can be estimated in the usual way from the number of correct responses in the test. The assumption of a beta distribution for the binomial true score, which has density function f ðpÞ ¼
pa 1 ð1 pÞbn , Bða, n b 1Þ
ð35Þ
with B(a, n b 1) being the complete beta function with parameters a and b n þ 1, is natural. The beta distribution is both flexible and the conjugate of the binomial. In addition, convenient closed-form estimators for its parameters exist. Finally, the validity of the two distributional assumptions can be checked by fitting a predicted observed-score population distribution, which is known to be negative hypergeometric, to a sample estimate. Examples of stronger statistical inferences for the beta binomial model are with respect to the standard error of measurement for a person, which is the binomial standard deviation [pjt(1 pjt)]1/2, and true-score population parameters, such as the population median, variance, and interquartile range.
NormalNormal Model Another strong true-score model is based on the assumptions of normal distributions for the observed score of a fixed person and the true scores in the population: Xjt N½mjt , sjt , mjt NðmT , sT Þ:
ð36Þ ð37Þ
Parameter mjt in this model is also the classical true score. Further, the model allows for inference with respect to the same quantities as the beta binomial model. The normalnormal model has mainly been used to study sampling distributions of classical item and test parameters. The results from these studies provide only
a rough approximation of the empirical distributions. Observed scores are discrete and often have a bounded range; the assumption of a normal distribution cannot hold exactly for such scores.
Applications The major applications of CTT are item and test analyses, test assembly from larger sets of pretested items, and observed-score equating.
Item and Test Analyses Item and test analyses are based on a pretest of a set of items or a test. Such pretests yield empirical estimates of relevant item and test parameters, which can be inspected to detect possibly dysfunctional items or an undesirable feature of the test. Typical parameters estimated in item and test analyses are item difficulty and discrimination, test reliability or internal consistency, standard error of measurement, and validity coefficients.
Test Construction If a larger set of items exists and a test of length n has to be assembled from this set, the usual goal is maximization of the reliability or validity of the test. The optimization problem involved in this goal can be formulated as an instance of combinatorial programming. Instead of optimizing the reliability coefficient, it is more convenient to optimize coefficient a; because of Eq. (28), r2XT is also optimized. Using the facts that n is fixed and n X
sX ¼
si riX ,
ð38Þ
i¼1
maximizing Eq. (19) is equivalent to minimizing PI s2 xi : PI i¼1 i ð i¼1 si riX xi Þ2
ð39Þ
Because the denominator and numerator are based on expressions linear in the items, optimization is possible by solving a linear problem consisting of the maximization of I X
si riX xi ,
i¼1
subject to the constraint I X i¼1
s2i xi c,
ð40Þ
with c a well-chosen constant. A comparable approach is available for the problem of optimizing the validity coefficient of the test in Eq. (20).
Test Equating If a new version of an existing standardized test is constructed, the scales of the two versions need to be equated. The equating transformation, which maps the scores on the new version to equivalent scores on the old version, is estimated in an empirical study, often with a randomly-equivalent-groups design in which the two versions of the test are administered to separate random samples of persons from the same population. Let X be the observed score on the old version and Y be the observed score on the new version. In classical equipercentile equating, a study with a randomly-equivalent-groups design is used to estimate the following transformation: ð41Þ x ¼ j y ¼ FX 1 ½FY ðyÞ: This transformation gives the equated score X ¼ j(Y) in the population of persons the same distribution as the observed score on the old version of the test, FX(x). Estimation of this transformation, which actually is a compromise between the set of conditional transformations needed to give each person an identical observed score distribution on the two versions of the test, is sometimes more efficient under a strong truescore model, such as the beta-binomial model in Eqs. (34) and (35), because of the implicit smoothing in these models. Using CTT, linear approximations to the transformation in Eq. (41), estimated from equating studies with different types of sampling designs, have been proposed.
Current Developments The main theoretical results for the classical test model were already available when Lord and Novick published their standard text in 1968. In fact, one of the few problems for which newer results have been found is the one of finding an approximation to the conditional standard error of measurement sEjT¼t. This standard error is the one that should be used instead of the marginal error sE when reporting the accuracy of an observed score. Other developments in test theory have been predominantly in item-response theory (IRT). Results from IRT are not in disagreement with the classical test model, but should be viewed as applying at a deeper level of parameterization for the classical true-score in Eq. (1).
Classical Test Theory
307
For example, it holds that E ðUi Þ ¼ 1pi ðyÞ þ 0 1 pi ðyÞ ¼ pi ðyÞ,
ð42Þ
which shows that the IRT probability of success for a dichotomous model, pi(y), is the classical true score at item level. Likewise, its holds for the true score at test level that ! n n n X X X t¼E Ui ¼ ½EðUi Þ ¼ pi ðyÞ, ð43Þ i¼1
i¼1
i¼1
which is the test characteristic curve (TCC) in IRT. The IRT probability of success for a dichotomous model, pi(y), has separate parameters for the person and the item properties. Due to this feature, IRT is more effective in solving incomplete-data problems, that is, problems in which not all persons respond to all items. Such problems are common in test-item banking, computerized adaptive testing, and more complicated observed-score equating designs, compared to the randomly-equivalentgroups design.
See Also the Following Articles Item Response Theory Item and Test Bias Measurement Error, Issues and Solutions Reliability Assessment Validity Assessment
Further Reading Gulliksen, H. (1950). Theory of Mental Tests. Wiley, New York. Kolen, M. J., and Brennan, R. L. (1995). Test Equating: Methods and Practices. Springer-Verlag, New York. Novick, M. R. (1966). The axioms and principal results of classical test theory. J. Math. Psychol. 3, 118. Lord, F. M., and Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley, Reading, Massachusetts. Spearman, C. (1904). The proof and measurement of association between two things. Am. J. Psychol. 15, 72101. Traub, R. E. (1997). Classical test theory in historical perspective. Educat. Psychol. Measure. Issues Pract. 16(4), 814. van der Linden, W. J. (1986). The changing conception of testing in education and psychology. Appl. Psychol. Measure. 10, 325332. van der Linden, W. J. (2004). Linear Models for Optimal Test Design. Springer-Verlag, New York. van der Linden, W. J. (2004). Evaluating equating error in observed-score equating. Appl. Psychol. Measure. von Davier, A. A., Holland, P. W., and Thayer, D. T. (2004). The Kernel Method of Test Equating. Springer-Verlag, New York.