Model Selection

Model Selection

Model Selection S J Haberman, Educational Testing Service, Princeton, NJ, USA ã 2010 Elsevier Ltd. All rights reserved. Glossary Bayesian inference –...

126KB Sizes 0 Downloads 86 Views

Model Selection S J Haberman, Educational Testing Service, Princeton, NJ, USA ã 2010 Elsevier Ltd. All rights reserved.

Glossary Bayesian inference – A method of statistical inference in which conclusions about populations are made by combining sample data with prior knowledge concerning population measures. Continuous variable – A variable such as height or weight which can assume any value in an interval of positive length. Discrete variable – A variable which assumes a finite or countable number of values. Hypothesis test – A method of statistical inference in which a statistic based on observed data is used to assess consistency of observed data with a hypothesis concerning a population measure. Item-response model – A model employed to describe the probability that an examinee in an assessment has any specific pattern of responses. Likelihood-ratio test – A test of a hypothesis which compares the compatibility of data with model parameters consistent with a specified hypothesis to the compatibility of data with model parameters consistent with an alternative hypothesis. Linear model – A model which uses a linear function of one or more explanatory variables to predict a response variable. Multiple-choice test – An assessment in which examinees select an answer to each test item from a list of alternatives. Pearson chi-square test – A hypothesis test used with discrete data to assess the compatibility of a probability model with the data. Penalty function – A function used to measure prediction error.

increasingly important as sample sizes become large, for inconsistency of models with data becomes an increasingly serious issue in such cases (Gilula and Haberman, 2001). In this article, the sections on likelihood-ratio tests, Pearson chi-square tests, and Bayesian inference emphasize the approach based on hypothesis tests. The sections on penalty criteria and linear models emphasize prediction measures. To illustrate some common issues in model selection, consider selection of a model for the responses on a multiple-choice test. For illustrative purposes, data will be used from a writing test with q ¼ 45 items numbered from 1 to q. Data are available for n ¼ 8686 examinees i numbered from 1 to n (Haberman, 2005, 2007b). For each examinee i and item j, the observation Xij ¼ 1 if the response is correct and 0 if the response is not correct. In psychometrics, several common item-response models are often applied to data of this kind. In each itemresponse model, it is assumed that there are independently distributed unobserved random variables yi which represent the abilities of the examinees i, 1  i  n. As is typically the case in applications of these models, the unobserved variables are assumed to have a standard normal distribution. It is assumed that, given the yi, the Xij are conditionally independent and, for a particular examinee i and item j, the conditional probability that Xij ¼ 1 (the response is correct) only depends on the ability variable yi associated with the examinee. The conditional probability that Xij ¼ 1 given yi ¼ o is Pj(o) for some unknown function Pj defined for all real numbers. The models considered are models for Pj. In the oneparameter logistic (1PL) model, it is assumed that, for some unknown constants a and gj, Pj ð!Þ ¼

Model selection is a common activity in statistical analysis of data. Two somewhat different approaches can be found. In the first approach, a model is regarded as appropriate for the data if, according to hypothesis tests, the model is consistent with the data. Among models consistent with the data, comparisons may be made in terms of parsimony or in terms of relative consistency with the data. In the second approach, models are compared in terms of their ability to predict data. It is not regarded as likely that any model under study is consistent with data. Instead, prediction measures are estimated and compared for the models under consideration. The second approach becomes

expða!  j Þ 1 þ expða!  j Þ

In the special case of a ¼ 0, one obtains the independence model (Ind) that the Xij, 1  j  q, are mutually independent and that the probability is positive that Xij ¼ k for k equal 0 or 1. In the two-parameter logistic (2PL) model, it is assumed that, for some unknown slope parameter aj and intercept parameter gj, Pj ð!Þ ¼

expðaj !  j Þ 1 þ expðaj !  j Þ

In the three-parameter logistic (3PL) model, it is assumed that, for some unknown slope parameter aj , some

289

290

Statistics

unknown intercept parameter gj, and some unknown lower asymptote cj, 0  cj < 1, Pj ð!Þ ¼ cj þ ð1  cj Þ

expðaj !  j Þ 1 þ expðaj !  j Þ

The 1PL model is a special case of the 2PL model in which the slope parameters aj have a common value a, and the 2PL model is a special case of the 3PL model in which the lower asymptotes cj are all 0. The basic question to consider is how to decide which of these models is most appropriate for the data under study. One approach considered will be based on tests of hypotheses. The other approach will be based on the ability of the models to describe the joint distribution of the item responses Xij, 1  j  q, associated with an examinee i (Hambleton et al., 1991).

Likelihood-Ratio Tests The models under study may be compared by the use of likelihood-ratio tests (Wilks, 1938). For any vector x of possible responses xj equal 0 or 1, 1  j  q, let p(x) be the joint probability that, for some examinee i, the response Xij to item j is xj for 1  j  q. The distribution of the vector Xi of responses Xij, 1  j  q, is determined by the q-dimensional array p of probabilities p(x), 0  xj  1, 1  j  q. To each model corresponds a set S of possible arrays of probabilities, so that the model holds if, and only if, p is in S. For example, the independence model holds if, and only if, p is in the set S(0) of arrays such that pðxÞ ¼

q Y expðxj j Þ 1 þ expðj Þ j ¼1

for some real gj, 1  j  q. In like fashion, if y is a random variable with a standard normal distribution, then the 1PL model holds if, and only if, p is in the set S(1) of arrays such that p(x) is the expected value of q Y exp½xj ða  j Þ 1 þ expða  j Þ j ¼1

for some real a and real gj, 1  j  q. In like manner, a set S (2) may be defined to correspond to the 2PL model, and a set S(3) may be defined to correspond to the 3PL model. The saturated, or unrestricted, model holds for p in the set T of arrays with p(x) positive and the sum of all p(x) equal to 1. The log likelihood function ‘ðpÞ ¼

n X

log pðXi Þ

i¼1

is the logarithm of the probability that the observed vectors Xi are obtained for 1  i  n. For a model that p is in S, likelihood-ratio tests are based on the maximum

‘ðSÞ of the log likelihood for probability arrays in S. For a subset S 0 of S, the model that p is in S 0 may be compared to a model that p is in S by the likelihood-ratio chi-square statistic L2 ðS; S 0 Þ ¼ 2½‘ðSÞ  ‘ðS 0 Þ

In typical cases, if p is in S 0 , then L2(S, S 0 ) has an approximate chi-square distribution with k(S)  k(S 0 ) degrees of freedom, where k(S) is the dimension of the space S and k(S 0 ) is the dimension of S 0 . Thus, the approximate significance level is the probability P(S, S 0 ) that a chi-square variable with k(S)  k(S 0 ) degrees of freedom exceeds the observed value of L2(S, S 0 ). The approximation improves as the sample size n increases. If the model that p is in S 0 does not hold, then the typical result is that P(S, S 0 ) converges to 0 with probability 1 as the sample size increases. In the comparison of the 1PL to the 2PL model for the data under study, the set S(1) corresponding to the 1PL model has dimension q þ 1, for there are q unrestricted parameters gj and one unrestricted parameter a. The set S (2) corresponding to the 2PL model has dimension 2q, for there are q unrestricted parameters gj and q unrestricted parameters aj. The resulting chi-square approximation has 2q  (q þ 1) ¼ q  1 ¼ 44 degrees of freedom in the case under study. The statistic L2(S(2), S(1)) ¼ 3772, so that the approximate significance level is extremely close to 0. In other words, either the 1PL model does not hold, or a very rare event has taken place. In like fashion, one may readily compare the independence model to the 1PL model, for S(0) has q unrestricted parameters. Thus, k(S(1))  k(S(0)) is q þ 1q ¼ 1, and L2(S(1), S(0)) ¼ 22 109, so that the approximate significance level is extremely small. One might expect that the comparison of the 2PL model to the 3PL model would proceed just as the comparison of the 1PL model to the 2PL model and the comparison of the independence model to the 1PL model, but this expectation is not met. The likelihoodratio chi-square statistic L2(S(3), S(2)) ¼ 646 is readily found, but the chi-square approximation does not hold due to the constraints on the lower asymptote parameters cj. Under the 2PL model, cj is at the boundary value 0 of its range. This constraint typically results in a large-sample distribution that is less likely to exceed any given value x > 0 than is the chi-square distribution with k(S(3))  k(S(2)) ¼ q degrees of freedom. Because q is 45, it remains evident that the 2PL model does not hold. In principle, one could compare the model that p is in S(m), m equal 0, 1, 2, or 3 to the saturated model that p is in T, but the chi-square approximation again fails. The problem arises because T has dimension 2q  1 ¼ 3.51  1013, a very large number relative to the sample size n. For use of L2(T, S(m)), it is commonly recommended that the expected number np(x) of Xi equal x be at least 1 for all possible values x of Xi and that np(x) should be at least 5

Model Selection

for at least 80% of the possible x (Cochran, 1954). Use of T can be illustrated if one confines attention to the first four responses Xij, 1  j  4, for in this case, the conditions for the chi-square approximation hold. For the independence model, L2(T, S(0)) is 410.7, and there are 24  1  4 ¼ 11 degrees of freedom, so that P(T, S(0)) is very small. Similarly, L2(T, S(1)) ¼ 289.1, and there are 24  4 ¼ 12 degrees of freedom for the chi-square approximation, so that P(T, S(1)) is very close to 0. Thus the independence and 1PL model for the first four responses do not appear to be compatible with the data. In the case of the 2PL model, L2(T, S(2)) is 20.6, and 24  2(4)  1 ¼ 7 degrees of freedom are associated with the chi-square approximation, so that P(T, S(2)) is about 0.0 044. This probability is sufficiently low that evidence exists that the 2PL model does not hold for the first four responses; however, the evidence is much less substantial than in the case of the independence or 1PL model. A convenient feature of likelihood-ratio chi-square statistics is the existence of simple decompositions. For example, L2 ðS; S 0 Þ ¼ L2 ðT ; S 0 Þ  L2 ðT ; SÞ

For instance, in the example with the four initial responses, L2(S(2), S(1)), the comparison statistic for the 1PL model against the 2PL model, is L2(T, S(1))  L2(T, S(2)) ¼ 268.5, and the chi-square approximation has 12  7 ¼ 5 degrees of freedom. In this particular example, this comparison provides very strong evidence against the 1PL model. This result is consistent with the previous finding for the test statistic L2(T, S(1)) comparing the 1PL model to the saturated model. Likelihood-ratio tests may be employed for comparisons of more than two models. Let S be a family of subsets of T. For each S in this family, one may consider the model that p is in S. In the simplest case, consider a base model that p is in U, where U includes every set S in the family S but U itself is not a member of the family of models to be compared. One may then consider comparison of the model that p is in S to the model that p is in U by means of the likelihood-ratio chi-square L2(U, S) and the corresponding approximate significance level P(U, S). For instance, in the four-response case, let U be T, so that the base model is the saturated model, and let S contain S (m) for m from 0 to 2, so that the independence model, the 1PL model, and the 2PL model are compared to the saturated model. Preference is given to 2PL model that p is in S(2), for P(U, S(2)) is larger than P(U, S(0)) or P(U, S(1)). A related criterion based on ratios of likelihoodratio chi-square statistics to approximate expected values is the F statistic F(U, S) ¼ L2(U, S)/[k(U)  k(S)]. The smallest F(U, S) is sought for S in S (Goodman, 1971; Haberman, 1974). For the four-response example, the best result is again for the 2PL model, for F(T, S(2)) ¼ 2.9 is much smaller than F(T, S(0)) or F(T, S(1)).

291

Pearson Chi-Square Tests Although less convenient in the application under study, it should be noted that the Pearson chi-square statistic can be used instead of the likelihood-ratio chi-square (Pearson, 1900; Fisher, 1924). This option typically uses maximum-likelihood estimates to compare the model that p is in S 0 to the model that p is in S under the condition that S 0 is included in S. The array ^pS of probabilities ^pS (x) is the maximum-likelihood estimate of p under the model that p is in S if ^p is in S and ‘ð^pS Þ ¼ ‘ðSÞ. In the case of , the array of the unrestricted model that p is in T, ^pT = p fractions p (x) of examinees i with Xi ¼ x. One has X 2 ðS; S 0 Þ ¼ n

X ½^pS ðxÞ  ^pS 0 ðxÞ2 =^pS 0 ðxÞ x

(Haberman, 1977). The large-sample approximations for X2(S, S 0 ) and L2(S, S 0 ) are the same. The most common applications use S ¼ T. For instance, in the four-question case, to test the independence model that p is in S(0), one can use X2(T, S(0)) ¼ 409.0 in place of L2(T, S(0)) ¼ 410.7 without appreciable effect. Whether Pearson or likelihood-ratio chi-square statistics are employed, a typical problem in model selection is that no model of any interest is compatible with the data if the sample size n is large (Gilula and Haberman, 2001).

Continuous Data Although the working example involves discrete variables, it should be emphasized that likelihood-ratio chi-square statistics are also commonly applied to continuous data to examine compatibility of models with data. The basic difference is that in the log likelihood, probabilities are replaced by probability densities. The likelihood function continues to be maximized for each model compared, chisquare approximations remain, and computations of degrees of freedom are unchanged.

Penalty Criteria An alternative approach to model selection compares models in terms of predictive power rather than in terms of fitting data. This approach often employs information theory (Gilula and Haberman, 2001). One may illustrate arguments by using the results of the 45-item multiplechoice test previously employed to illustrate use of tests of significance. In the approach under study, probability prediction is employed. A probability array u is used to predict the observation Xi. The value of the array as a predictor is judged by use of the penalty  log u(Xi) (Savage, 1971). This penalty only depends on the probability assigned to the observed value of the responses Xi,

292

Statistics

and the penalty is non-negative. The penalty is 0 if, and only if, the observed responses Xi have assigned probability 1. The expected penalty is then

base the prediction on estimates computed from Xi, 1  i  n, then further correction is needed. One measure for prediction of Xn+1 is the Akaike estimate

GðuÞ ¼ Eðlog uðXi ÞÞ

^ Ga ðSÞ ¼ GðSÞ þ kðSÞ=n

The expected penalty is never less than the entropy H ¼ GðpÞ ¼ Eðlog pXi Þ

and G(u) ¼ H(p) if, and only if, the probability prediction u is the actual probability array p. The penalty criterion may be employed to assess the quality of a model that p is in S. If the probability array p is known, then model error may be evaluated by use of the minimum G(S) of G for probability arrays in S. The difference G(S)  H is a measure of model error. The measure is non-negative, and, if p is in S, then G(S) ¼ H and the error G(S)  H ¼ 0. To compare models that p is in S for S in a family S of subsets of T, the smallest G(S) is sought for S in S. In practice, p is not known, so that G(S) and H must be estimated. Estimation of G(S) is generally feasible given the log-likelihood maximum ‘ðSÞ, for G(S) may be esti^ mated by GðSÞ ¼ ‘ðSÞ=n. In principle, H has estimate ^ ¼ ‘ðT Þ=n; however, this estimate is not satisfactory H unless conditions hold for the chi-square approximation to the likelihood-ratio chi-square statistic L2(T, S). For S 0 ^ 0 Þ  GðSÞ ^ contained in S, GðS is L2(S, S 0 )/(2n), and 0 ^ ^ GðS Þ  GðSÞ can be employed to measure the improvement in the model that p is in S relative to the model that p is in S 0 . For example, for the case of the data with 45 ^ ^ ^ items, GðSð0ÞÞ is 28.110, GðSð1ÞÞ is 26.838, GðSð2ÞÞ is ^ 26.621, and GðSð3ÞÞ is 26.583. Despite the very strong evidence that the 1PL model does not hold, the measured ^ ^ difference GðSð1ÞÞ  GðSð2ÞÞ is relatively small, only ^ ^ 0.217. In contrast, GðSð0ÞÞ  GðSð1ÞÞ is 1.273. The relative reduction in estimated expected penalty ^ ^ GðSð1ÞÞ  GðSð2ÞÞ ¼ 0:0081 ^ GðSð1ÞÞ

for the 2PL model relative to the 1PL model is quite small compared to the relative reduction ^ ^ GðSð0ÞÞ  GðSð1ÞÞ ¼ 0:0453 ^ GðSð0ÞÞ

for the 1PL model relative to the independence model. The contrast between the 2PL and the 3PL model demon^ ^ strates even less difference, for GðSð2ÞÞ  GðSð3ÞÞ is only 0.037, and ^ ^ GðSð2ÞÞ  GðSð3ÞÞ ¼ 0:0014 ^ GðSð2ÞÞ

is very modest in size. ^ The estimate GðSÞ is biased. In addition, if one wishes to consider prediction of a new observation Xn+1 with the same distribution as the Xi, 1  i  n, and one wishes to

This correction is only satisfactory if the model that p is in S is valid. An alternative approach (Gilula and Haberman, 2001) uses a criterion Gg(S) similar to the Akaike criterion that does not assume that the model holds. It is also ^ possible to consider Gc ðSÞ ¼ GðSÞ þ bkðSÞ for a positive real constant b, so that a penalty for model complexity is imposed without regard to sample size. The Bayesian analysis in the next section leads to the Schwarz criterion ^ Gs ðSÞ ¼ GðSÞ þ ð2nÞ1 kðSÞ log n

^ In large samples, GðSÞ and the criteria of Akaike, Schwarz, and Gilula and Haberman are very similar. For instance, for the 2PL model for 45 items, Ga(S(2)) and Gg(S(2)) are 26.631 to five significant figures, while Gs(S(2)) is 26.668. The data suggest that the 3PL model is slightly preferable to the 2PL model unless Gc(S(2)) is compared to Gc(S(3)) with the coefficient b at least 0.00083. The advantage of the 2PL model relative to the 1PL model is larger under all criteria, and Gc(S(1)) is greater than Gc(S(2)) unless b is at least 0.00482. In the case of continuous data, similar arguments can often be used to those which have been employed for discrete data (Gilula and Haberman, 2000); however, as in the case of likelihood-ratio tests, arguments are based on density and functions rather than on probabilities.

Bayesian Inference Bayesian inference can be employed to select models (Kass and Raftery, 1995). For example, consider the data with 45 items and S equal to the set of S(m) for m from 0 to 3. Assume that the 3PL model does hold and that comparison is between the independence model, the 1PL model, the 2PL model, and the 3PL model. One may divide the set S(3) corresponding to the 3PL model into four disjoint sets U(0), U(1), U(2), and U(3). Here U(0) is S(0) and, for k equal 1, 2, or 3, U(k) consists of members of S(k) that are not in S(k  1). Thus, U(0) corresponds to the independence model, while U(1) corresponds to case in which the 1PL model holds but the independence model holds. Let the prior probability that p is in U(k) be pk > 0 for 0  k  3, and let Pk be the prior distribution of p given that p is in U(k). In typical cases, each Pk is associated with some positive probability densities on the set of model parameters which correspond to p in U(k). For p in U(0), a positive density is assigned to the q-dimensional vector with coordinates gj for 1  j  q. Similarly, for p in

Model Selection

U(1), a positive density is assigned to the (q þ 1)-dimensional vector with coordinates gj, 1  j  q, and a, a ¼ 6 0. Similar requirements are imposed for p in U(2) or for p in U (3). Given these prior distributions, Bayes’s theorem can be employed to compute the conditional probability pk that p is in U(k) given the observed Xi, 1  i  n. This conditional probability depends on the log-likelihood function l(u) for u in S(3). In this case, large values of pk suggest use of the model that p is in S(k). Two practical obstacles exist. The value pk depends on prior distributions that may be difficult to specify in a realistic manner. Computation of pk is very difficult, although numerical quadrature, simulation methods, and approaches based on Laplace transforms can be considered. A very crude approximation can often be based on the Schwarz criterion Gs(S(k)) for 0  k  3. Under relatively general conditions n½Gs ðSðgÞÞ  Gs ðSðkÞÞ  logðpk =pg Þ logðpk =pg Þ

converges in probability to 0 as the sample size n increases. This result may be used to show that pk converges in probability to 1 if p is in U(k). In the example under study, the result favors the 3PL model to the extent that Gs(S(k)) is smallest for k ¼ 3. The Bayesian approach described here does not have any obvious basis if p is not contained in S(3). This restriction is severe if one takes the position that nontrivial models are rarely valid.

293

The expected mean-square error is then the expected ~i 2 Þ. If the mean E(Y) of the Yi, the mean value Eð½Yi  Y E(X) of the Xi, the variance V(Y) of the Yi, the covariance matrix Cov(X), and the covariance vector Cov(X, Y) with elements Cov(Xij, Yi), 1  j  k, are all known, then the ~i 2 Þ is minimized if expected mean-squared error Eð½Yi  Y ~i ¼ EðY Þ þ 0 ½Xi  EðXÞ Y

where b, the vector of bj, 1  j  k, is [Cov(X]1 Cov(X, Y). The resulting mean square error, denoted by V(Y|X), is the residual variance of Yi given Xi. In principle, V(Y|X) can form the basis for evaluation of the value of Xi in prediction of Yi. In practice, means, variances, and covariances must be estimated. Let  ¼ n1 X

n X

Xi

i¼1

denote the sample mean of the Xi, and let  ¼ n1 Y

n X

Yi

i¼1

denote the sample mean of the Yi. The least-squares estimates a of a and b of b satisfy the equations ^i ¼ a þ Y

k X

bj Xij

j ¼1

   b0 X a¼Y

and n X ^i Þ Xi ¼ 0 ðYi  Y

Linear Models

i¼1

Model selection need not be based on complete probability models. A common case involves linear models. For example, consider a sample of n essays, for each of which both holistic scores generated by two human raters and computer-generated essay features are available (Haberman, 2007a). For essay i, let Yi be the average of the two holistic scores, and let Xij be the numerical value of essay feature j, 1  j  k. For example, Xi1 might be the square root of the number of characters in essay i. For each essay i, the computer-generated features Xij, 1  j  k, are to be employed to predict the average holistic score Yi. A simple approach considers a linear predictor ~i ¼  þ Y

k X

j Xij

j ¼1

for some real constants a and bj, 1  j  k. The prediction ~i . If Xi is the vector with coordinates Xij , error is Yi  Y 1  j  k, then one may consider the case of pairs (Yi, Xi) that are independently and identically distributed. Let Yi have finite and positive variance V(Y ), and let Xi have finite and positive-definite covariance matrix Cov(X ).

Here b has coordinates bj, 1  j  k, and 0 is the k-dimensional vector with all coordinates 0. As the sample size n becomes large, V^ ðY jXÞ ¼ n1

n X ^ i Þ2 ðYi  Y i¼1

approaches V(Y|X) with probability 1, so that V^ ðY jXÞ may be employed to evaluate the effectiveness of Xi in linear prediction of Yi. Thus, V^ ðY jXÞ can provide a basis for selection of different predictors X of Y. Corrections are commonly made to treat biases due to the finite sample size (Draper and Smith, 1998). In standard regression analysis, the assumption is added that, given Xi, ~i and the conditional varithe conditional mean of Yi is Y ance of Yi is V(Y|X). In this case, if X is continuous and n > k þ 1, then V(Y|X) has the unbiased estimate V ðY jXÞ ¼

n V^ ðY jXÞ nk1

These assumptions do not hold exactly in the essay example, for the Xij are discrete variables and the traditional

294

Statistics

assumptions of regression are not likely to hold exactly given that holistic scores have a limited range, which in the data to be examined is from 1 to 6. An alternative approach with fewer assumptions uses PRESS residuals (Draper and Smith, 1981; 325–327). The idea is to predict a new average holistic score Yn + 1 by using the observed (Yi, Xi), 1  i  n, and the new computer-generated features X(n+1)j, 1  j  n. Under traditional regression analysis, the mean squared error ^nþ1 ¼ a þ b0 Xi is from prediction of Yn+1 by Y V ðY jXÞ nþkþ1 n , and this mean square error has unbiased estimate V ðY jXÞ nþkþ1 n . Without traditional assumptions, one may employ PRESS residuals. For each essay i, leastsquares estimates ai for a and bji, 1  j  k, for bj, are ^mi Þ2 for obtained by minimizing the sum of the ðYm  Y 1  m  n, m 6¼ i, where ^mi ¼ ai þ Y

k X

bji Xmj

j ¼1

This approach estimates this mean squared error for prediction of Yn+1 by V^p ðY jXÞ ¼ n1

n X ^ii Þ2 ðYi  Y i¼1

where ^ii ¼ ai þ Y

k X

bji Xij

been on simple random sampling. Modifications for more complex sampling are readily accomplished. Although the emphasis here has been on independent observations, model selection is also an issue in the analysis of time series in which observations are dependent. Except for the discussion of linear models, ability of independent variables to predict dependent variables has not been explored. Use of models which involve such prediction arises for polytomous dependent variables as well as for linear models (Gilula and Haberman, 2001). The emphasis in this article has involved situations in which the sample sizes are relatively large and the number of models to consider is relatively small. When sample sizes are not very large and when the number of models is large, then issues of selection bias must be considered. In these cases, a model is much more likely to appear to be the most appropriate due to random variation rather than due to actual superiority of the model. This issue is frequently encountered in regression analysis (Draper and Smith, 1998). In general, model selection depends on the nature of the data, the sample size, and the intended application of the results. The approach must be adapted to the problem at hand. See also: Categorical Data Analysis; Educational Data Modeling; Goodness-of-Fit Testing; Univariate Linear Regression.

j ¼1

This approach might appear tedious; however,    ii Þ ^ii ¼ Yi  Y ^i =ð1  H Yi  Y

for " #1 n X 0 0 1     ^ ii ¼ n þ ðXi  XÞ H ðXm  XÞððX ðXi  XÞ m  XÞ m¼1

(Draper and Smith, 1998; 207–208) It is clearly desirable to have V^p ðY jXÞ smaller. For one case with 4895 essays and k = 6 predictors, one has V^p ðY jXÞ ¼ 0:238. An alternative predictor was considered with k ¼ 56 predictors. This case had V^p ðY jX Þ ¼ 0:235, so that a very modest gain was achieved with a much larger set of predictors. A further very modest gain was achieved with 206 predictors, for V^p ðY jXÞ went to 0.232. Whether the modest gain is really worth the cost in terms of predictors needed is a matter of judgment.

Concluding Remarks Model selection issues are much more varied than can be indicated in a brief survey. The emphasis in this article has

Bibliography Cochran, W. G. (1954). Some methods for strengthening the common w2 tests. Biometrics 10, 417–451. Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, 2nd edn. New York: Wiley. Draper, N. R. and Smith, H. (1998). Applied Regression Analysis, 3rd edn. New York: Wiley. Fisher, R. A. (1924). The conditions under which w2 measures the discrepency between observations and hypothesis. Journal of the Royal Statistical Society 87, 442–450. Gilula, Z. and Haberman, S. J. (2000). Density approximation by summary statistics: An information-theoretic approach. Scandinavian Journal of Statistics 27, 521–534. Gilula, Z. and Haberman, S. J. (2001). Analysis of categorical response profiles by informative summaries. Sociological Methodology 31, 129–187. Goodman, L. A. (1971). The analysis of multidimensional contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications. Technometrics 13, 33–61. Haberman, S. J. (1974). Log-linear models for frequency tables with ordered classifications. Biometrics 30, 589–600. Haberman, S. J. (1977). Log-linear models and frequency tables with small expected cell counts. Annals of Statistics 5, 1148–1169. Haberman, S. J. (2005). Latent-Class Item Response Models. (Research Report RR-05-28). Princeton, NJ: Educational Testing Service. Haberman, S. J. (2007a). Electronic essay grading. In Rao, C. R. and Sinharay, S. (eds.) Handbook of Statistics, vol. 26, pp 205–233. Amsterdam: North-Holland.

Model Selection Haberman, S. J. (2007b). The Information a Test Provides on an Ability Parameter. (Research Report RR-07-18). Princeton, NJ: Educational Testing Service. Hambleton, R. K., Swaminathan, H., and Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage. Kass, R. and Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association 90, 773–795. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is

295

such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine 50, 157–175. Savage, L. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association 66, 783–801. Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. Annals of Mathematical Statistics 9, 60–62.