Sequential estimation in variable length computerized adaptive testing

Sequential estimation in variable length computerized adaptive testing

Journal of Statistical Planning and Inference 121 (2004) 249 – 264 www.elsevier.com/locate/jspi Sequential estimation in variable length computerize...

265KB Sizes 0 Downloads 68 Views

Journal of Statistical Planning and Inference 121 (2004) 249 – 264

www.elsevier.com/locate/jspi

Sequential estimation in variable length computerized adaptive testing Yuan-chin Ivan Changa;∗;1 , Zhiliang Yingb;2 a Institute

of Statistical Science, Academia Sinica, 128, Sec. 2 Academia Road, Taipei 11529, Taiwan b Department of Statistics, Columbia University, New York, NY 10027, USA Received 28 July 2001; accepted 3 December 2002

Abstract With the advent of modern computer technology, there have been growing e3orts in recent years to computerize standardized tests, including the popular Graduate Record Examination (GRE), the Graduate Management Admission Test (GMAT) and the Test of English as a Foreign Language (TOEFL). Many of such computer-based tests are known as the computerized adaptive tests, a major feature of which is that, depending on their performance in the course of testing, di3erent examinees may be given with di3erent sets of items (questions). In doing so, items can be e>ciently utilized to yield maximum accuracy for estimation of examinees’ ability traits. We consider, in this article, one type of such tests where test lengths vary with examinees to yield approximately same predetermined accuracy for all ability traits. A comprehensive large sample theory is developed for the expected test length and the sequential point and interval estimates of the latent trait. Extensive simulations are conducted with results showing that the large sample approximations are adequate for realistic sample sizes. c 2003 Elsevier B.V. All rights reserved.  MSC: primary 62F12; secondary 62J05 Keywords: Computerized testing; Stopping rule; ConCdence interval; Rasch model; Logistic regression; Tailored test; Adaptive testing; Variable length test



Corresponding author. Tel.: +886-2-7835611; fax: +886-2-7831523. E-mail address: [email protected] (Yuan-chin Ivan Chang). 1 Supported in part by NSC87-2118-M-001-010. 2 Supported in part by NSF grant DMS 9971791.

c 2003 Elsevier B.V. All rights reserved. 0378-3758/$ - see front matter  doi:10.1016/S0378-3758(03)00119-8

250

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

1. Introduction Computerized adaptive testing (CAT) has become increasingly important to standardized testing, as evidenced by the transition of several large scale such tests, including graduate record examination (GRE), graduate management admission test (GMAT) and test of English as a foreign language (TOEFL), from the traditional paper-and-pencil (P& P) version to CAT. A crucial ingredient in CAT is that di3erent examinees could be assigned with di3erent sets of items (questions) from a large item bank. For each examinee, the assignment is done sequentially according to his/her performance on the preceding items. This is very di3erent from the traditional P& P test, where all examinees are given with the same set of items. The CAT was conceptualized initially by Lord (1970, 1971a,b), who was motivated by the development of sequential methodology in statistics, particularly the stochastic approximation procedure of Robbins and Monro (1951). He argued that “An examinee is measured e3ectively if the item is neither too di>cult nor too easy for him”. In terms of designing a test, this means that optimal item selection for a examinee should be based upon or adapted to his/her ability trait level. Thus the adaptive tests may have greater advantage over P&P tests. Obviously, implementation of such an idea requires sophisticated and powerful computing technology. A centerpiece of the CAT is the examinee-speciCc and response-dependent item selection to improve e>ciency. To this end, there have been extensive discussions in the psychometrics literature with various proposals; see, in addition to the work by Lord, Owen (1975), Weiss (1976, 1982), Sympson and Hetter (1985), Lewis and Sheehan (1990), van der Linden (1998) and Chang and Ying (1999) among many others. For a comprehension survey of the subject, including discussions of key issues, we refer to Wainer (2000). We consider, in this article, adaptive testing which achieves pre-speciCed accuracy for estimating examinee’s ability trait. As pointed out by Thissen (2000), if implementable, tests of such kind are highly desirable since they lead to uniformly high reliability across di3erent latent trait levels. Their statistical aspect is closely related to the classic sequential estimation problem with Cxed width, initiated by Stein’s (1945) two-stage procedure and subsequently developed by Anscombe (1952), Chow and Robbins (1965), Chang and Martinsek (1992) among many others. See also Siegmund (1985) and Woodroofe (1982). For more recent discussion about sequential estimation, we refer readers to Ghosh et al. (1997). We propose a stopping rule analogous to that of Chang and Martinsek (1992) for estimation of the examinee’s latent trait in logistic item response theory (IRT) model-based adaptive testing. Theoretic justiCcation is then provided by establishing some key large sample properties. The rest of this paper is organized as follows. In Section 2, a mathematical description of the logistic IRT model-based adaptive testing is presented and the corresponding variable length CAT is deCned. Some key asymptotic properties for the variable CAT are established in Section 3. Results from extensive simulation studies are summarized in Section 4. Section 5 contains some discussions. All the proofs are relegated to Appendix A.

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

251

2. Logistic IRT models and variable length CAT The logistic models are one the most popular statistical models used for the IRT based standardized tests. The item characteristic curves (ICC) of a three parameter logistic (3-PL) is deCned as P() ≡ P(Y = 1|; a; b; c) = c + (1 − c)

exp(a( − b)) : 1 + exp(a( − b))

(2.1)

Eq. (2.1) models the probability of a correct answer by an examinee with latent ability  to a given item with parameter a, b, and c. Here Y = 1 (0) denotes whether the item is answered correctly (incorrectly), and a, b, and c are the item parameters of discrimination, di>culty and guessing, respectively. The special case of c = 0 (no guessing) is called the two parameter logistic (2-PL) model. If, in addition to c = 0, the discrimination parameters for all items in the bank are the same, then it is called the Rasch model (see Rasch, 1960). Let the vector (a; b; c) denote an item selected from an item pool B with item parameters a, b and c. Suppose that we are in the n − 1th step of an adaptive test, then there are n − 1 items, (ai ; bi ; ci ), i = 1; : : : ; n − 1, been administered to an examinee. Let Y1 ; : : : ; Yn−1 , i = 1; : : : ; n − 1 be the corresponding responses of the examinee, and let Fn−1 denote the -Celd generated by (ai ; bi ; ci ) and Yi , for i = 1; : : : ; n − 1. Thus, the selection of the nth item of the adaptive test will be based on knowledge in Fn−1 . Suppose that 0 is the true (but unknown) value of the examinee’s latent trait level. Therefore, in the nth step of the test, the maximum likelihood estimate ˆn of 0 can be obtained based on the observation of (ai ; bi ; ci ) and Yi , i = 1; : : : ; n. Under suitable regularity conditions, ˆn can be shown to be asymptotically normal with mean 0 and variance In−1 (0 ) where In () is the Fisher information 2  n   @Pi () In () = Pi ()Qi (): (2.2) @ i=1

Here Pi is the ICC for the ith item as deCned in (2.1), and Qi = 1 − Pi . Thus, as n → ∞, we expect  In (ˆn )(ˆn − 0 ) →L N (0; 1): (2.3) This suggests the following stopping rules for constructing a conCdence interval of 0 with prescribed width 2d and coverage probability 1 − : Td = inf {n ¿ 1: In (ˆn ) ¿ Cd }

(2.4)

where Cd = (z=2 =d)2 and z=2 is the 1 − =2 quantile of the standard normal distribution. An asymptotically equivalent stopping rule that can reduce the occurrence of undesirable early stopping is the following modiCcation Td∗ = inf {n ¿ 1: In (ˆn ) ¿ Cd + kn };

(2.5)

252

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

where kn is a sequence of positive constants and kn → 0 as n → ∞. It is not di>cult to see that all asymptotic results to be developed in Section 3 for Td also hold for Td∗ . In the case of the Rasch model (i.e. ci ≡ 0 and ai ≡ a), the information function for an item with di>culty parameter bi is simply a2 ea(−bi ) =[1 + ea(−bi ) ]2 , which is bounded above by a2 =4. Thus, at  = 0 , In (0 ) =

n  i=1

a2

n ea(0 −bi ) 6 a2 : 2 a( −b ) 4 [1 + e 0 i ]

(2.6)

Clearly, the information is maximized when bi ≡ 0 , in which case the above inequality becomes an equality. This means that if the true 0 were known, then in order to have the most e>cient test (i.e. the one with shortest test length), one would choose items such that their di>culty parameters match the examinee’s latent trait level. In doing so, one achieves the upper bound of the Fisher information, na2 =4. Since 0 is unknown in practice, at the nth stage, an adaptive version of this optimal item selection is to choose an item with bn that most closely approximates the current estimate ˆn−1 , subject to availability and other constrains. The principle of matching di>culty b with ability 0 in order to maximize the Fisher information continues to hold under the 2-PL model. But consideration of ai in the item selection process may cause some di>culties; cf. Chang and Ying (1999). For the 3-PL model, a slight shift to the left of 0 is needed in choosing bi ; cf. Lord (1980, p. 152). In a CAT system, di3erent examinees respond to di3erent sets of items. When the maximum information rule as described above is used for item selection, each examinee gives correct answers to about half of the items administered to him/her. Therefore, the number-correct score is not a useful measurement to predict the latent trait level. Moreover, a Sat information function over the range of the latent trait levels of interest is usually more desirable as it will result in more uniformly accurate estimation. However, it is very unlikely for a Cxed-length CAT to achieve a Sat information function. In other words, there will be di3erent precisions for di3erent examinees; also see Wainer (2000). On the other hand, the variable-length CAT given by (2.4) can achieve the goal of uniformly accurate estimation, and this will be proved in the following section.

3. Asymptotic properties In this section, we establish asymptotic properties for stopping rule (2.4) and the associated inference procedures. To do so, we will Crst establish consistency and asymptotic normality for the maximum likelihood estimate ˆn under adaptive designs. All the items are assumed to follow the general 3-PL model, of which 2-PL and the Rasch models are special cases. In addition, the following regularity conditions are needed throughout this section. (C1) The number of items in the bank cannot be exhausted, i.e., the size of the item bank is inCnite.

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

253

(C2) The parameters of all selected items satisfy the following constrains: sup |bn | ¡ ∞; n

0 ¡ inf an ¡ sup an ¡ ∞ n

n

and

sup cn ¡ 1 n

a:s:

(C3) There exists a nonrandom sequence vn such that In (0 )=vn → 1 a.s. Remark 3.1. The above conditions are quite minimal. Condition (C1) is necessary as, without it, asymptotics become meaningless. Condition (C2) is to ensure that individual item information is bounded and the cumulative Fisher information goes to inCnity. It can be interpreted as the Feller condition in the Lindeberg–Feller central limit theorem. Condition (C3) is equivalent to that requiring the normalizing constant in the asymptotic normality to be nonrandom, a condition often found in the martingale central limit theorem, which is to be used in establishing the asymptotic normality for our estimate. The likelihood function after n items having been administered is n  L() = PiYi ()QiYi ();

(3.1)

i=1

recalling that Pi () = 1 − Qi () = ci + (1 − ci )G(ai ( − bi )), where G(t) = et =(1 + et ). Therefore the likelihood estimating function can be written as n  wi ()[Yi − ci − (1 − ci )G(ai ( − bi ))]; (3.2) i=1

where wi () = @ log[Pi ()=Qi ()]=@. The weight wi () may be replaced by any Fi−1 measurable weight and the resulting estimating function is still valid. This can be seen easily as E[Yi − Pi (0 )|Fi−1 ] = 0. DeCne ˆn as a maximizer of (3.1) or a root of (3.2). We shall assume that the maximization or root Cnding is over a compact interval containing true parameter 0 as an interior point. Then the following theorem shows that the maximum likelihood estimate ˆn is consistent and asymptotically normal. The proofs of this and subsequent theorems will be given in Appendix A. Theorem 3.1. Under conditions (C1) and (C2), the maximum likelihood estimate ˆn is strongly consistent: ˆn → 0

a:s: as n → ∞:

In addition, if condition (C3) is also satis9ed, then the asymptotic normality holds:  In (ˆn )(ˆn − 0 ) →L N (0; 1) as n → ∞:  ˆ It follows from above theorem, for large n, n ±z=2 = In (ˆn ) is a conCdence interval for 0 with asymptotic coverage probability 1 − . Hence, we can use the stopping rule Td deCned by (2.4) to construct a conCdence interval with a prescribed accuracy, say d (d ¿ 0). As d → 0, Td → ∞. So, if we can show that the asymptotic normality continues to hold when n is replaced by Td , then it leads to the following result.

254

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

Theorem 3.2. Suppose that conditions (C1), (C2) and (C3) are satis9ed. Let Td be the stopping time de9ned by (2.4). Then P(0 ∈ ˆTd ± z=2 IT−1=2 (ˆTd )) → 1 − ; d

as d → 0:

The Fisher information used in setting the stopping rule Td makes use of estimate ˆn . Of course, one would use the true parameter value 0 if it were available. In other words, instead of Td , ideally one would like to use the following stopping rule n0 = inf {n: n ¿ 1 and In (0 ) ¿ Cd }:

(3.3)

In the theorem below, by the asymptotic normality of ˆn , we can show that n0 and Td are asymptotically equivalent. Theorem 3.3. Suppose that conditions (C1), (C2) and (C3) are satis9ed. In addition, assume that there exist constants 0 ¡ ma ¡ Ma and mb ¡ Mb such that ma 6 ai 6 Ma and mb ¡ bi ¡ Mb for all i. Then, as d → 0, E(Td =n0 ) → 1. Moreover, for the Rasch model, Td − n0 = O(log(n)), as d → 0. The results of Theorems 3.2 and 3.3 are called asymptotic consistency and e>ciency, respectively, in the sense of Chow and Robbins (1965). They entail that the coverage frequency will converge to the required coverage probability and the ratio of the expected sample size to the “optimal” sample size will approach 1, as d → 0. √ The sharper result of Td − n0 = O(log(n)) in (3.3) as in comparison with the usual O( n) order is due to the adaptive design procedure of the CAT. 4. Simulations The simulation studies are conducted to assess performances of the proposed method under various scenarios. All the simulation results are based on the general 3-PL model deCned in (2.1). The nominal coverage probabilities in all the simulations reported here are set at the usual 95%. Six precision levels, ranging from 0.5 to 1.0, are used. There are two sets of simulation studies. In Study 1, the item parameters are generated by computer. In Study 2, the item parameter were taken from a set of National Assessment of Educational Progress (NAEP) sample, which was supplied by Professor Hua-Hua Chang. All simulations are performed on personal computers using the Digital Visual Fortran. 4.1. Study 1 In Study 1, with the exception of the initial items, the item selection rule of matching di>culty level b and latent trait  is used, and the discrimination parameter a and the guessing parameter c are chosen randomly from uniform distributions with ranges [0:5; 2:5] and [0; 0:2], respectively. Tables 1–5 contains summary results of the simulations, which correspond to Cve latent trait values: =0, ±1 and ±2, respectively.

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

255

Table 1 Variable-length CAT,  = 0 Precision (d)

Coverage freq. (%)

ˆ : Mean

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

94.6 95.2 96.2 96.8 97.9 97.5

0.008 0.009 −0.004 0.027 0.016 0.004

(0.067) (0.091) (0.116) (0.135) (0.140) (0.195)

119.0 89.5 72.4 58.7 52.9 42.9

(17.0) (12.7) (10.5) (8.91) (7.83) (6.03)

Table 2 Variable-length CAT,  = 1 Precision (d)

Coverage freq. (%)

ˆ Mean :

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

94.4 95.4 94.9 97.1 98.7 97.3

1.004 1.010 1.000 1.021 1.006 1.049

(0.072) (0.093) (0.118) (0.135) (0.129) (0.195)

118.5 89.6 72.6 58.8 52.8 43.0

(16.8) (12.3) (10.8) (8.66) (7.60) (6.28)

Table 3 Variable-length CAT,  = −1 Precision (d)

Coverage freq. (%)

ˆ Mean :

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

94.8 93.1 96.3 95.5 98.1 97.4

−0.986 −0.993 −1.000 −0.991 −0.986 −0.958

(0.067) (0.111) (0.113) (0.154) (0.133) (0.184)

119.0 89.8 72.6 58.8 53.0 42.9

(18.1) (13.8) (11.0) (8.87) (7.49) (5.78)

The Crst item b1 is chosen randomly from [ − 3:6; 3:6]. A summary description of the key steps in simulation is given below: Step 1 (Initialization): Let b1 be randomly chosen from [ − 3:6; 3:6]. Suppose that the Crst response is correct (Y1 = 1). Let k=smallest integer such that the Crst incorrect response occur; i.e. Y1 =· · ·=Yk−1 =1, and Yk =0. Then the item 2nd to item k −1th will be chosen with bi randomly from [bi−1 ; 3:6], for i = 2; : : : ; k (i.e. b1 6 b2 6 · · · 6 bk ). Similarly, if the Crst response is incorrect (Y1 =0), then let h= the smallest integer such that the Crst correct response occur (i.e. Y1 = · · · = Yh−1 = 0, and Yh = 1) and the item 2nd to (h−1)th are chosen with [−3:6; bi−1 ], for i=2; : : : ; h, (i.e. b1 ¿ b2 ¿ · · · ¿ bh ). ˆ of  Step 2 (Stopping rule): Once we have observed both 0 and 1 in Y , the MLE, , can be obtained and the stopping rule will be checked. If the stopping rule (2.4) is not

256

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

Table 4 Variable-length CAT,  = 2 Precision (d)

Coverage freq. (%)

ˆ Mean :

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

94.9 94.1 94.6 95.2 96.4 96.0

2.014 2.010 2.018 2.030 2.029 1.999

(0.072) (0.100) (0.128) (0.158) (0.154) (0.219)

119.2 90.1 72.9 59.1 53.5 43.3

(17.2) (12.8) (10.4) (9.38) (8.51) (7.66)

Table 5 Variable-length CAT,  = −2 Precision (d)

Coverage freq. (%)

ˆ Mean :

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

91.7 92.3 93.8 94.3 96.7 93.7

−1.994 −2.009 −1.982 −2.006 −2.011 −1.987

(0.081) (0.119) (0.143) (0.168) (0.164) (0.258)

119.8 90.1 73.3 59.7 53.8 43.6

(18.2) (15.1) (13.2) (10.3) (8.96) (8.01)

√ satisCed, then the next item is administered with b = ˆ − a−1 (log(2−1 (1 + 1 + 8c))); see Lord (1980) or Chang and Ying (1999). The Step 2 will be repeated until the stopping rule is satisCed. Step 3 (ConCdence interval): If the stopping rule is satisCed, then no more item will be assigned. The conCdence interval for  is constructed based on the MLE obtained at stopping, and its corresponding estimate of the Fisher information. As expected (except for  = −2), the empirical coverage probabilities are quite close to the target 95% conCdence level, especially when d is getting smaller. For the case  = −2, the coverage probability for d = 0:7 to 1.0 are still very close to the 95% target. The biases in all the cases are negligible, i.e., they are essentially unbiased. In fact, when comparing with the corresponding variances, it is clear that they bear no impact at all as they are of much smaller order. In addition, for the same precision level, the expected sample sizes are quite close across di3erent values of . 4.2. Study 2 In Study 2, the items selection rule are basically the same as those in Study 1, except that we are now using the items with parameters taken from a set of NAEP sample. There are 252 items in this item pool with parameters a ∈ [0:420; 2:502], b ∈ [−2:325; 3:061] and c ∈ [0; 0:425]. Among them, there are 124 items have parameter c = 0. Fig. 1 shows the distributions of all 3 parameters in the item pool.

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264 Parameter a

Parameter b

50

50

40

40

30

30

20

20

10

10

257

Parameter c 120 100 80 60

0

40 20 0

0 0.5

1.0

1.5

2.0

-2

2.5

NAEP Items

-1

0

1

2

0.0

3

NAEP Items

0.1

0.2

0.3

0.4

NAEP Items

Fig. 1. Distributions of parameters.

Table 6 NAEP items,  = 0 Precision (d)

Coverage freq. (%)

ˆ Mean :

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

94.9 96.3 94.7 93.7 95.2 95.3

−0.004 0.003 0.006 −0.001 −0.014 −0.028

(0.063) (0.085) (0.132) (0.170) (0.198) (0.243)

60.7 43.6 33.4 26.5 21.6 18.6

(16.16) (13.99) (12.53) (9.73) (8.47) (6.92)

In this study, the items are selected without replacement for each examinee; i.e. no item will be administered to the same examinee twice. Because NAEP items are limited, it is unlikely to select an items with exactly matched b at each step of an adaptive test. In this case, the item with the closest b parameter to the theoretical value will be chosen instead. Tables 6–10 summarize the simulation results of Study 2. To compare the results obtained here with those of Study 1, we found that the means and the variations of the estimates are similar between two studies, except for the case of  = −2. Again, with exception of  = −2, the coverage frequency for all other cases here are very close to the target 95%. In general, the average test lengths (stopping time) of Study 2 are shorter than those of Study 1, but with larger variations. From the distributions of parameters in Fig. 1, it is obvious that they are very di3erent from uniform distributions, which the item parameters in Study 1 were generated from. In addition to the parameter distributions, the parameter b in NAEP item pool has a shorter range (i.e. −2:325 to 3.061) than that in Study 1. We suspect that’s why the performance of  = −2 is not as good as we expected.

258

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

Table 7 NAEP items,  = 1 Precision (d)

Coverage freq. (%)

ˆ Mean :

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

96.3 96.4 96.4 96.5 94.9 97.0

1.000 0.979 0.967 0.971 0.951 0.962

(0.057) (0.082) (0.118) (0.148) (0.197) (0.210)

71.4 52.1 40.4 32.0 27.5 23.5

(59.13) (48.30) (37.58) (38.44) (33.29) (26.94)

Table 8 NAEP items,  = −1 Precision (d)

Coverage freq. (%)

ˆ Mean :

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

0.944 0.944 0.944 0.940 0.939 0.950

−0.979 −0.981 −0.974 −0.965 −0.965 −0.966

(0.072) (0.099) (0.137) (0.177) (0.233) (0.274)

64.7 46.2 35.0 28.1 23.2 19.6

(23.04) (17.72) (14.98) (14.52) (12.32) (10.11)

Table 9 NAEP items,  = 2 Precision (d)

Coverage freq. (%)

ˆ Mean :

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

0.955 0.969 0.968 0.972 0.956 0.958

1.981 1.959 1.940 1.937 1.879 1.881

(0.064) (0.077) (0.100) (0.128) (0.173) (0.208)

148.4 89.6 63.7 50.4 40.8 34.9

(37.20) (23.80) (14.20) (10.30) (8.15) (7.64)

Table 10 NAEP items,  = −2 Precision (d)

Coverage freq. (%)

ˆ mean :

(Var)

Stopping time: mean

(Var)

0.5 0.6 0.7 0.8 0.9 1.0

0.867 0.892 0.908 0.902 0.872 0.872

−1.988 −1.902 −1.918 −1.856 −1.814 −1.763

(0.114) (0.144) (0.164) (0.228) (0.326) (0.362)

79.5 55.8 42.2 33.8 27.5 23.2

(56.7) (45.0) (22.5) (20.5) (19.2) (15.9)

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

259

5. Conclusions We have developed basic large sample results for the Cxed precision/variable length computerized adaptive tests. They are in many ways similar to those for the Cxed width sequential estimation, especially in the context of logistic regression. Simulation results show that the performances of the large sample based procedure are satisfactory in terms of accuracy of coverage probability, bias and test length. All results here can be easily extended to the probit link, i.e., the case with G replaced by #, the standard normal distribution. We believe this investigation is only a beginning. There are many issues remain to be studied. For example, it will be of great interest to look at the situation of a Cnite item pool with suitable control of the item exposure rate, which is certainly more realistic. Another issue is the roles of discrimination parameter a and guessing c in item selection. We refer to Chang and Ying (1999) for related aspects. Acknowledgements We are grateful to an Editor Board Member and the referees whose suggestions and encouragements have improved the paper. We also like to thank Professor Hua-Hua Chang, University of Texas at Austin, for providing us the NAEP item pool. Appendix A Proof of Theorem 3.1. Under conditions (C1) and (C2), it is neasily seen that, with probability 1, the set of functions (with respect to ) {n−1 i=1 wi ()[Yi − Pi (0 )]: n ¿ 1} is equicontinuous. Furthermore, for any Cxed , {wi ()} are predictable and {Yi − Pi (0 )} are martingale di3erences with respect to {Fi }. We shall use $i = Yi − n −1 Pi (0 ). Applying Theorem 2 of Chow (1965), we get n i=1 wi ()$i → 0 a.s. as  −2 2 n → ∞, since n n wi ()Var($n |Fn ) ¡ ∞ a.s. The equicontinuity further entails   n    −1   (A.1) wi ()$i  → 0 a:s:; sup n  ∈%  i=1

where % is the compact interval upon which the likelihood function is maximized to obtain ˆn . For any 1 ¿ 0 , we can apply the mean-value theorem to get n  wi ()[Pi () − Pi (0 )] ¿ 0: (A.2) lim inf inf n−1 n→∞ ∈%;¿1

i=1

Likewise, for any 2 ¡ 0 , n  −1 wi ()[Pi () − Pi (0 )] ¡ 0: lim sup sup n n→∞ ∈%;62

(A.3)

i=1

Combining together (A.1), (A.2) and (A.3), we conclude ˆn → 0 a.s. as n → ∞.

260

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

By the deCnition of ˆn and the mean-value theorem, n 

wi (0 )[Yi − Pi (0 )] − In (n∗ )(ˆn − 0 ) = 0;

i=1

where n∗ lies between ˆn and 0 . We apply the martingale central limit theorem (Pollard, 1984) to get vn−1=2

n 

wi (0 )$i → N (0; 1):

i=1

Hence, In1=2 (ˆn )(ˆn − 0 ) =

n In1=2 (ˆn )vn1=2 −1=2  wi (0 )$i v n In (n∗ )

(A.4)

i=1

converges in distribution to N (0; 1). Proof of Theorem 3.2. A commonly used method to prove asymptotic consistency of a Cxed-width conCdence interval estimation procedure is via Anscombe’s Theorem (Anscombe, 1952; Woodroofe, 1982). From Theorem 3.1 and Lemma 1 of Chow and Robbins (1965), we know that Td =n0 → 1 a.s. as d → 0. Therefore, to verify the conditions in Anscombe’s Theorem, it su>ces to show that {In1=2 (ˆn )(ˆn −0 ); n ¿ 1} is uniformly continuous in probability (u.c.i.p.). A sequence of random variables {zn ; n ¿ 1} is u.c.i.p. if, for any & ¿ 0, there exits ' ¿ 0 such that

P max |zn+k − zn | ¿ ' 6 ': 06k6n&

We refer to Woodroofe (1982) for further discussions about u.c.i.p. Once we know that {In1=2 (ˆn )(ˆn − 0 ); n ¿ 1} is u.c.i.p., Theorem 3.2 can be easily proved by applying the same arguments as those in Woodroofe (1982). Thus it remains to show the u.c.i.p. However, from (A.4) and strong consistency of ˆn , n  1=2 ˆ −1=2 ˆ I (n )(n − ) = [1 + o(1)] v a:s: (A.5) wi (0 )$i n

n

i=1

Thus, it su>ces to show that the right-hand side of (A.5) is u.c.i.p. Since wi = wn (0 ) is Fi−1 -measurable and {ji } is a martingale n di3erence sequence, {wi ji } is again a martingale di3erence sequence. Let Sn = i=1 wi ji and n Sn∗ = vn−1=2 i=1 wi ji . We have

1=2  1 vn ∗ ∗ |Sn+k − Sn | 6 √ |Sn+k − Sn | + 1 − |Sn∗ |: vn vn+k

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

261

Let ' ¿ 0, & ¿ 0 and k 6 n&, and suppose, without loss of generality, vn is monotone increasing. Then 1−

1=2

vn

61 −

vn+k

1=2

vn

= C(&);

vn+[n&]

say:

By Theorem 3.1, Sn∗ →L N (0; 1). Therefore, as & → 0,

P



1−

vn

1=2

vn+k

|Sn∗ | ¿

$ 2



 6 P |Sn∗ | ¿

$ 2C(&)

 → 0:

Moreover, it follows from the deCnition of ji and wi that supi¿1 E[wi ji ]2 ¡ ∞. Therefore, the HUajek-RUenyi inequality (Chow and Teicher, 1988, p. 247) implies 

|Sn+k − Sn | ' ¿ P max √ 16k6n& vn 2



4 6 2 '

[n&] i=1

E[wi ji ]2 4 n& supi¿1 E[wi ji ]2 6 2 ; vn vn '

which clearly converges to 0 as & → 0. Thus {In1=2 (ˆn )(ˆn − ): n ¿ 1} is u.c.i.p. and the proof is complete. Proof of Theorem 3.3. As shown in the proof of Theorem 3.2, Td =n0 → 1 a.s. Thus, it follows from (2.4), (3.3) and deCnition of Cd that, to prove limd→0 E[Td =n0 ] = 1, it su>ces to prove that {d2 Td : d ∈ (0; 1)} is uniformly integrable. Let D = [1 ; 2 ], where 1 and 2 are chosen so that 0 and bi lie in the interior of D. Since ai ∈ [ma ; Ma ], condition (C2) implies that there exists a constant C+ ¿ 0 such that for all i, inf

∈D;bi ∈D

@Pi () @

2  Pi ()Qi () ¿ C+ :

(A.6)

By deCnition ofTd , Td 6 Cd =C+ provided ˆn and bn lie in D for all n. n Let ‘n () = i=1 wi ()[Yi − Pi ()]. DeCne last times Lj = sup{n ¿ 1: ‘n (j ) ¿ 0}, j = 1; 2. Clearly, ˆi ∈ D for all n=2 6 i 6 n, provided n=2 ¿ Lmax = max(L1 ; L2 ). Let I[A] be the indicator function for set A. For d ∈ (0; 1), it follows from previous arguments that d2 Td = d2 Td I[Td =2¿Lmax ] + d2 Td I[Td =26Lmax ] 6 d2 Td I[Td =2¿Lmax ] + 2Lmax : Note that ˆTd =2 ∈ D whenever Td =2 ¿ Lmax . Thus, d2 Td I[Td =2¿Lmax ] 6

2 z=2

C+

:

(A.7)

262

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

Therefore, by (A.6), to show {d2 Td : d ∈ (0; 1)} is uniformly integrable, it su>ces to show ELmax ¡ ∞. By the mean-value theorem, for j = 1; 2, and i ¿ 1, V i ( ∗ − bi )(j − 0 ); Pi (j ) − Pi (0 ) = (1 − ci )ai G(ai ( ∗ − bi )G(a where j∗ lies between j and 0 . By condition (C.2) and in view of ai ∈ [ma ; Ma ], V i (j∗ − bi )| is bounded away from 0 and ∞. Consequently, |(1 − ci )ai G(ai (j∗ − bi )G(a there exist two positive constants C1 and C2 such that ‘n (1 ) 6

n 

wi (1 )$i − nC1

i=1

and ‘n (2 ) 6

n 

wi (2 )$i + nC2 :

i=1

n n  Let L1 = sup{n ¿ 1: i=1 wi (1 )$i ¿ nC1 } and L2 = sup{n ¿ 1: i=1 wi (2 )$i ¡ − nC2 }. By Chang (1999), it can be shown that EL1 ¡ ∞ and EL2 ¡ ∞, which imply ELmax ¡ ∞. Thus limd→0 E[Td =n0 ] → 1. Now, let’s turn to the Rasch model, i.e. ai ≡ a and ci =0. Without loss of generality, we can assume ai ≡ 1. Let ˆk be the MLE of a sample size k. For Rasch models, it follows from Proof of Theorem 3.1, ˆk − 0 = Ik−1 (k∗ )

k 

$i ;

(A.8)

i=1

where k∗ is between 0 and ˆk . By the mean value theorem, and deCnition of MLE, 0=

n 

(Yi − G(ˆn − bi ))

i=1 n 

=

(Yi − G(ˆk − bi )) −

n 

∗ G  (nk − bi )(ˆn − k );

i=1

i=k+1

and n  i=k+1

(Yi − G(ˆk − bi )) =

n  i=k+1

$i −

n 

G  (k∗ − bk )(ˆk − 0 );

i=k+1

∗ is between ˆn and 0 . It implies that where nk  

n k n    −1 ∗ −1 ∗ ˆ ˆ n − k = O(1)In (nk ) Ik (k ) ; $i − Ik+1; n $i i=k+1

i=k+1

i=1

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

263

n ∗ ∗ where Ik+1; n = i=k+1 G  (k+1; n − bi ). When n is large enough, n will approach to ∗ 0 . Thus, we have In (n ) = O(n), Ik+1; n = O((n − k)), and n  (A.9) (ˆn − ˆk )2 = O(log(n)): k=1

By deCnition of stopping time, ITd (ˆTd ) ¿ Cd ¿ ITd −1 (ˆTd −1 ). In addition, by (3.3), n0 = [4Cd ] + 1 and n0 =4 ¿ Cd ¿ (n0 − 1)=4: Hence, by the Taylor expansion theorem, ITd (ˆTd ) =

Td 

G  (ˆTd )

i=1 T



d  Td (ˆTd − bi )2 − O(1) 4

i=1

¿ Cd ¿

n0 − 1 : 4

(A.10)

Similarly, ITd −1 (ˆTd −1 ) =

Td 

G  (ˆTd )

i=1 T −1



d  Td − 1 (ˆTd −1 − bi )2 − O(1) 4

i=1

n0 : 4 It follows from (A.9) and (A.10), ¡ Cd 6

Td − n0 = O(1)

Td 

(ˆTd − bi )2 :

(A.11)

(A.12)

i=1

Note that for the Rasch model, when the maximum information item selection scheme is used, bi = ˆi−1 . Hence, by (A.8) and (A.11), Td − n0 = O(log(n)). References Anscombe, F.J., 1952. Large sample theory of sequential estimation. Proc. Cambridge Philos. Soc. 48, 600–607. Chang, Y.-C.I., 1999. Strong consistency of maximum quasi-likelihood estimate of generalized linear models via a last time. Technical Report c-98-4, Institute of Statistical Science, Academia Sinica, Taipei, Taiwan. Chang, Y.-C.I., Martinsek, A.T., 1992. Fixed size conCdence regions for parameters of a logistic regression model. Ann. Statist. 20, 1953–1969. Chang, H.-H., Ying, Z., 1999. a-StratiCed multistage computerized adaptive testing. Appl. Psychol. Meas. 23, 263–278.

264

Yuan-chin Ivan Chang, Z. Ying / Journal of Statistical Planning and Inference 121 (2004) 249 – 264

Chow, Y.S., 1965. Local convergence of martingales and the law of large numbers. Ann. Math. Statist. 36, 552–558. Chow, Y.S., Robbins, H., 1965. On the asymptotic theory of Cxed width sequential conCdence intervals for the mean. Ann. Math. Statist. 36, 457–462. Chow, Y.S., Teicher, H., 1988. Probability Theory: Independence, Interchangeability Martingales. Springer, New York. Ghosh, M., Mukhopadhyay, N., Sen, P.K., 1997. Sequential Estimation. Wiley, New York. Lewis, C., Sheehan, K., 1990. Using Bayesian decision theory to design a computerized mastery test. Appl. Psychol. Meas. 14, 367–386. Lord, F., 1970. Some test theory for tailored testing. In: Holzman, W.H. (Ed.), Computer-Assisted Instruction, Testing, and Guidance. Harper and Row, New York, pp. 139–183. Lord, F., 1971a. Robbins-Monro procedures for tailored testing. Educ. Psychol. Meas. 31, 3–31. Lord, F., 1971b. A theoretical study of two-stage testing. Psychometrika 36, 227–242. Lord, F., 1980. Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum, Hillsdale, New Jersey. Owen, R.J., 1975. A bayesian sequential procedure for quantal response in the context of adaptive mental testing. J. Amer. Statist. Assoc. 70, 351–356. Rasch, G., 1960. Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: The Danish Institute of Educational Research. (Extended edition, 1980). University of Chicago Press, Chicago. Robbins, H., Monro, S., 1951. A stochastic approximation method. Ann. Math. Statist. 22, 400–407. Stein, C., 1945. A two-sample test for a linear hypothesis whose power is independent of the variance. Ann. Math. Statist. 16, 243–258. Sympson, J.B., Hetter, R.D., 1985. Controlling item exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association, Navy Personnel Research and Development Center, San Diego, CA. Thissen, D., 2000. Reliability and measurement precision. In: Wainer, H. (Ed.), Computerized Adaptive Testing: A Primer, 2nd Edition. Lawrence Erlbaum, Hillsdale, New Jersey, pp. 159–184. van der Linden, W.J., 1998. Optimal assembly of psychological and educational tests. Appl. Psychol. Meas. 22, 195–211. Wainer, H., 2000. Computerized Adaptive Testing: A Primer, 2nd Edition. Lawrence Erlbaum, Hillsdale, New Jersey. Weiss, D.J., 1976. Adaptive testing research in Minnesota: overview, recent results, and future directions. In: Clark, C.L. (Ed.), Proceedings of the First Conference on Computerized Adaptive Testing, United States Civil Service Commission, Washington, DC, pp. 24 –35. Weiss, D.J., 1982. Improving measurement equality and e>ciency with adaptive testing. Appl. Psychol. Meas. 6, 379–396.