Power analysis in randomized clinical trials based on item response theory

Power analysis in randomized clinical trials based on item response theory

Controlled Clinical Trials 24 (2003) 390–410 Power analysis in randomized clinical trials based on item response theory Rebecca Holman, M.Math.a,*, C...

184KB Sizes 16 Downloads 27 Views

Controlled Clinical Trials 24 (2003) 390–410

Power analysis in randomized clinical trials based on item response theory Rebecca Holman, M.Math.a,*, Cees A.W. Glas, Ph.D.b, Rob J. de Haan, Ph.D.a a

Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, Amsterdam, The Netherlands b Department of Research Methodology, Measurement and Data Analysis, University of Twente, Enschede, The Netherlands Manuscript received May 20, 2002; manuscript accepted March 17, 2003

Abstract Patient relevant outcomes, measured using questionnaires, are becoming increasingly popular endpoints in randomized clinical trials (RCTs). Recently, interest in the use of item response theory (IRT) to analyze the responses to such questionnaires has increased. In this paper, we used a simulation study to examine the small sample behavior of a test statistic designed to examine the difference in average latent trait level between two groups when the two-parameter logistic IRT model for binary data is used. The simulation study was extended to examine the relationship between the number of patients required in each arm of an RCT, the number of items used to assess them, and the power to detect minimal, moderate, and substantial treatment effects. The results show that the number of patients required in each arm of an RCT varies with the number of items used to assess the patients. However, as long as at least 20 items are used, the number of items barely affects the number of patients required in each arm of an RCT to detect effect sizes of 0.5 and 0.8 with a power of 80%. In addition, the number of items used has more effect on the number of patients required to detect an effect size of 0.2 with a power of 80%. For instance, if only five randomly selected items are used, it is necessary to include 950 patients in each arm, but if 50 items are used, only 450 are required in each arm. These results indicate that if an RCT is to be designed to detect small effects, it is inadvisable to use very short instruments analyzed using IRT. Finally, the SF-36, SF-12, and SF-8 instruments were considered in the same framework. Since these instruments consist of items scored in more than two categories, slightly different results were obtained. 쑕 2003 Elsevier Inc. All rights reserved. Keywords: Item response theory; Sample size; Power; Latent trait; IRT; Two-parameter model; SF-36; SF-12; SF-8

* Corresponding author: Rebecca Holman, M.Math, Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, PO Box 22700, 1100 DE Amsterdam, The Netherlands. Tel.: ⫹31-20-566-6947; fax: ⫹31-20-691-2683. E-mail address: [email protected] 0197-2456/03/$—see front matter 쑕 2003 Elsevier Inc. All rights reserved. doi:10.1016/S0197-2456(03)00061-8

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

391

Introduction In recent years, there has been an enormous increase in the use of patient relevant outcomes, such as functional status and quality of life, as endpoints in medical research, including controlled randomized clinical trials (RCTs). Many patient relevant outcomes are measured using questionnaires designed to quantify a theoretical construct, often modeled as a latent variable. When a questionnaire is administered to a patient, responses to individual items are recorded. Often the scores on each item are added together to obtain a single score for each patient. The reliability and validity of sum scores are usually examined in the framework of classical test theory. This framework is widely accepted and applied in many areas of medical assessment [1]. However, following dissatisfaction with these methods, interest in the use of an alternative paradigm, known as item response theory (IRT), has grown [2]. IRT was developed as an alternative to the use of classic test theory when analyzing data resulting from school examinations. An overview of IRT methods is given in this paper, while in-depth descriptions in general [3,4] and in medical applications [5] are given elsewhere. Advantages of using IRT to analyze an RCT include proper modeling of ceiling and floor effects, solutions to the problem of missing data, and straightforward ways of dealing with heteroscedacity between groups. However, the main advantage is that it is not essential to assess all patients with exactly the same items. For instance, if sufficient information on the measurement characteristics of the Medical Outcomes Study Short-Form Health Survey with 36 items (SF-36) [6] health survey in a particular patient population were available, it would be possible to obtain completely comparable estimates of health status using only the items most appropriate to each individual patient. An extension of this is computerized adaptive testing, in which each patient potentially receives a different computer-administered questionnaire in which the questions offered to each patient depend on the responses given to previous questions [7,8]. A prerequisite of this type of testing is access to a large item bank that has been calibrated using responses from comparable patients [9,10]. Ethical considerations require that as few patients as possible are exposed to the “risk” of a novel treatment during an RCT, but it is important to ensure that enough patients are included to have a reasonable power of detecting the effect of interest. For this reason, calculation of the minimal sample size required to demonstrate a clinically relevant effect has become integral to the RCT literature [11]. However, since IRT was developed as a tool for analyzing data resulting from examinations, most technical work has concentrated on the statistical challenges found in this field. For example, when assessing the effects of an educational intervention, in some ways similar to an RCT, thousands of pupils, even whole cohorts, are often included. This means that minimal sample size and power calculations in relation to questionnaires analyzed with IRT have received very little attention. In addition, IRT offers a framework in which the number of items used to assess patients can be easily varied. Thus, sample size calculations need to consider not only the number of patients, but also the number of items used. Some work has touched on these issues [12] or considered it as a sideline of another aspect [13], but no guidance on sample size calculations for RCTs in the context of IRT has been published. In this paper, the relationship between the number of patients in each treatment arm, the number of items used to assess the patients, and the power to detect given effect sizes will

392

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

be examined using a simulation study. The results will be used to develop guidelines for the number of patients to be used in each arm of an RCT when a questionnaire analyzed using IRT is used as the primary outcome. The methods are illustrated in a population of patients with end-stage renal disease 12 months after starting dialysis [14,15] using the SF36 [6], the SF-12 [16], and the SF-8 [17] health-status surveys. In addition, to provide a framework in which data from RCTs can be analyzed using IRT, the behavior of asymptotic methods developed to compare the mean level of the latent trait in two groups will be considered in the relatively small samples encountered in RCTs.

IRT in an RCT IRT is used to model the probability that a patient will respond to a number of items related to a latent trait in a certain way. The two-parameter logistic model [18] is for data resulting from items with two response categories, “0” and “1,” and is one of many IRT models developed [19]. In this model, the probability, pik(q), that patient k, with latent trait equal to qk, will respond to item i in category 1 is given by pik(q) ⫽

exp (ai(qk⫺bi)) 1⫹exp (ai(qk⫺bi))

(1)

where αi and bi are known as item parameters. The more widely known Rasch model [20] is similar to the two-parameter logistic model, but with the parameters αi assumed equal to 1. An important assumption of IRT models is that of local independence, meaning that the probability of a patient scoring 1 on a given item is independent of them scoring 1 on another item, given their value of q. This means that the correlations between items and over patients are fully explained by q. Models have also been developed for data resulting from items with more than two response categories [4,21]. The generalized partial credit model [22] is a fairly well-known example in which the probability pijk(q) that a patient with latent trait equal to qk will respond to an item i, with (Ji ⫹ 1) response categories, in category j, j ⬎ 0, is given by exp pijk(q) ⫽



1⫹

(兺

Ji

j⫽0

j u⫽0

exp

)

ai (qk⫺biu)

(兺

j u⫽0

)

(2)

ai(qk⫺biu)

where αi is the discrimination parameter of item i and biu indicates the point at which the probability of choosing category j or category j⫺1 is equal. In IRT, the item and patient parameters are usually estimated in a two-stage procedure. First, the item parameters are estimated, often by assuming that the qk follow a normal distribution and integrating them out of the likelihood. Second, maximum likelihood estimates of qk are obtained using the previously estimated item parameters. In this study, it will be

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

393

assumed that the items under consideration form part of a calibrated item bank [9], meaning that the item parameters have been previously estimated [23] from responses given by comparable patients to the items and are assumed known for all items. It is theoretically possible to estimate the item parameters from the responses given to the items by patients included in an RCT, but accurate estimates can only be obtained from large samples of patients, say over 500. Since this figure is rarely attained in RCTs, it will often not be practical to estimate the item parameters in an RCT. In a straightforward RCT, the patient sample is randomly divided into two groups, say A and B. Each group receives a different treatment regimen and primary and secondary outcomes are assessed once at the end of the study. The main interest is in the null hypothesis, H0, whether the mean level of the primary outcome, say q, is equal in both groups. This can be written as H0 : mA ⫺ mB ⫽ 0

(3)

where mA and mB denote the mean of the distribution of q in groups A and B, respectively. Clinically, the two groups are said to differ if the ratio of the difference mA ⫺ mB and the standard deviation of q is larger than a given effect size. A lot of work has been carried out examining the clinical relevance of particular effect sizes in given situations. However, in practice, interest is often in examining the arbitrarily defined minimal, moderate, and substantial effect sizes of 0.2, 0.5, and 0.8 on continuous variables [11]. The number of patients required to detect a given effect size with a particular power depends on the values of the effect size and the standard errors of mA and mB. Now consider an RCT in which the primary outcome is q, measured at the end of the study using a questionnaire with n items, each with two response categories, and analyzed using IRT. Let us assume that patients 1, 2,…, K are in group A and patients K ⫹ 1, K ⫹ 2,…, 2K are in group B, meaning that the total sample size is 2K. For group A, we can rewrite Eq. (1) as pik(q) ⫽

exp (ai(mA⫹ek⫺bi)) 1⫹exp (ai(mA⫹ek⫺bi))

(4)

where qk ⫽ mA ⫹ ek and mA is the mean of qk in group A. Hence, ek have mean 0 and standard deviation sA. For group B, Eq. (1) can be rewritten in a similar way, with qk ⫽ mB ⫹ ek and the standard deviation of ek equal to sB. For RCTs carried out using a questionnaire consisting of items with more than two response categories, Eq. (2) can be rewritten in a similar way. The main interest is in examining the null hypothesis in Eq. (3) by obtaining mˆ A and mˆ B and testing whether they are significantly different from each other. When using IRTbased techniques, it is inadvisable to estimate the values of qk for all patients and then perform standard analysis, such as t tests, on these estimates [13], since this ignores the measurement error inherent to qk. This means that using standard methods for calculating sample size may lead to inaccurate conclusions. The estimation equations for mA and mB are complex and the values of S.E. (mA) and S.E. (mB) depend not only on the number of patients in each arm of the RCT, but also on the number of items used to assess the patients and the relationship between ai and bi and the distribution of q. This means that it is not

394

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

possible to write down a straightforward equation for either mˆ A and mˆ B or the number of patients required to detect a given sample size with a particular power. Consistent estimates of mA and mB are obtained by maximizing a likelihood function that has been marginalized with respect to q [24,25]. These estimation methods are described in more detail in the appendix. The marginal maximum likelihood estimates of mA and mB can be combined with their standard errors to obtain test statistic ZL ⫽

mˆ A ⫺ mˆ B S.E.(mˆ A ⫺ mˆ B)

(5)

where mˆ A and mˆ B denote the estimates described above. It has been proven that under the hypothesis mA ⫺ mB ⫽ 0 and for large sample sizes, say 2K ⬎ 500, ZL follows an asymptotic standard Normal distribution [24]. However, since these methods were developed in the field of educational measurement where interventions are assessed using very large samples, the behavior of ZL for the smaller samples common in RCTs is as yet unknown.

A simulation study In this study, we simulated data from RCTs to examine the behavior of ZL and the power to detect a number of effect sizes with a given number of patients and items. We were particularly interested in RCTs where there were 30, 40, 50, 100, 200, 300, 500, or 1000, denoted K, patients in each arm and where these patients were assessed using 5, 10, 15, 20, 30, 50, 70, or 100, denoted n, items each with two response categories. These values were chosen to reflect the range of sample sizes often encountered in clinical research and the number of items with which it is acceptable to assess patients in a variety of situations. In total, there were 64 (⫽8 × 8) different combinations of sample size and number of items in the study. The study was carried out by simulating 1000 RCTs at each of the 64 different combinations. In each RCT, a group of K values of q were sampled from each of N(0,1) and N(mB,1) distributions to represent the latent trait levels of patients in groups A and B, respectively. Since the standard deviation of q is equal to one in both groups, the effect size in an RCT is equal to mA⫺mB. In addition, for each RCT n values of b and of a were generated from N(0,1) and log N(0.2,1) distributions, respectively, to represent the items. These values were chosen as they give a reasonably high level of statistical information on the values of q used in this study and thus represent a carefully chosen questionnaire in terms of item characteristics. Each RCT was “conducted” by calculating the probability, pik, that patient k would respond to item i in category 1, given their value of the latent trait, qk, and the item parameters ai and bi using the formula in Eq.(1). The response, xik, made by patient k on item i was obtained by taking an observation on a Bi(1,pik) distribution. This was repeated for all 2K patients in an RCT and resulted in a data matrix with 2K rows and n columns. The responses, xik, were used, together with the two-parameter logistic model, the “known” values of a and b, and marginal maximum likelihood estimation methods to obtain estimates of mA, mB, and the associated standard errors. These estimates were combined to obtain a value

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

395

of ZL for each RCT. The simulations were carried out in a program adapted by the authors from OPLM, a commercially available program for estimating parameters in IRT models [26]. The distribution of ZL under the null hypothesis In order to provide a framework for examining the small sample behavior of ZL under the null hypothesis H0 : mA⫺mB ⫽ 0, the values of mA and mB were set equal to 0 and 1000 RCTs carried out at each of the 64 combinations of K and n. The distribution of the values of ZL obtained for each combination of K and n was examined by calculating the mean, Z¯L, and standard deviation, S.D.(ZL), of the values of ZL obtained from the 1000 RCTs conducted with each combination of K and n. In addition, the values of ZL were tested to see whether there was evidence that they did not form a sample from a normal distribution, with mean Z¯L and standard deviation, S.D.(ZL), using the Kolmogornov-Smirnov statistic. Sample size, number of items, and power The main objective of the simulation study was to examine the power of an RCT using a questionnaire with n items as a primary endpoint and K patients in each arm of the RCT to detect minimal (0.2), moderate (0.5), or substantial (0.8) effect sizes. Hence, the whole simulation process with 64 combinations of k and n was repeated three times. The value of mA was set to 0 and sA ⫽ sB ⫽ 1. Hence, the values of mB were set at 0.2, 0.5, and 0.8 for the first, second, and third repetitions, respectively. The 1000 values of ZL obtained for each of the 192 combinations of sample size, number of items, and effect size were compared to the critical values of the appropriate normal distribution to determine how many of the test statistics were significant at the (two-sided) 90%, 95%, and 99% levels. For RCTs with more than 100 patients in each arm, a standard normal distribution was used, meaning that the critical values were ⫾1.64, ⫾1.96, and ⫾2.58 for the 90%, 95%, and 99% levels, respectively. The critical values for RCTs with up to 100 patients in each arm were obtained in the first part of this simulation study and are given in the following section.

Results of the simulation study The distribution of ZL under the null hypothesis The mean, Z¯L, S.D.(ZL), and the p-value of the Kolmogornov-Smirnov test for Normality, P(KS), of the 1000 values of ZL produced at each combination of n and K for mA ⫽ mB ⫽ 0 are given in Table 1. The mean value of ZL over all 64,000 replications is 0.0004 and only two of the 64 values of P(KS) are less than 0.10. This indicates that there is no reason to suspect that ZL does not attain its asymptotic normal distribution, with mean 0, for samples consisting of two groups, each of 30 to 1000 patients.

0.0963 ⫺0.0031 ⫺0.0058 0.1366 0.0764 0.0794 ⫺0.1169 0.0151

5 10 15 20 30 50 70 100

30

S.D.(ZL)

1.0406 1.0428 1.0167 0.9980 1.0263 1.0102 0.9972 0.9889

Z¯L

⫺0.0197 0.0342 ⫺0.0561 ⫺0.0051 0.0112 ⫺0.0114 0.0142 0.0126

5 10 15 20 30 50 70 100

200

1.0330 1.3092 1.2144 1.2138 1.1905 1.2374 1.2262 1.2146

S.D.(ZL)

Number of items (n)

Table 1. Continued

Z¯L

Number of items (n)

0.970 0.860 0.860 0.472 0.567 0.980 0.687 0.541

P(KS)

0.027 0.012 0.318 0.429 0.872 0.289 0.292 0.374

P(KS)

⫺0.0068 ⫺0.0634 ⫺0.0142 ⫺0.0319 0.0156 0.0144 ⫺0.0364 0.0382

Z¯L

40 1.0432 1.1260 1.1482 1.2059 1.0984 1.1648 1.1728 1.1425

S.D.(ZL) 0.465 0.230 0.327 0.182 0.296 0.339 0.294 0.879

P(KS) 0.0269 0.0152 ⫺0.0050 ⫺0.0230 ⫺0.0115 ⫺0.0222 ⫺0.0252 ⫺0.0070

Z¯L

50 1.0886 1.0405 1.1383 1.1729 1.1075 1.1356 1.0612 1.1250

S.D.(ZL)

0.9882 0.9778 1.0064 0.9914 1.0141 0.9661 0.9680 1.0377

S.D.(ZL)

300 0.322 0.610 0.652 0.515 0.966 0.471 0.289 0.561

P(KS) 0.0086 ⫺0.0287 ⫺0.0223 ⫺0.0146 0.0114 ⫺0.0201 ⫺0.0227 ⫺0.0114

Z¯L

1.0181 1.0516 0.9574 0.9898 0.9612 0.9817 0.9950 0.9898

S.D.(ZL)

500

Number of patients in each of group A and group B (K)

0.0006 0.0252 0.0202 0.0246 ⫺0.0063 0.0037 ⫺0.0576 ⫺0.0005

Z¯L

Number of patients in each of group A and group B (K)

Table 1. The mean, standard deviation, and Kolmogornov-Smirnov p-value, P(KS), for ZL when mA ⫽ µB ⫽ 0

0.793 0.994 0.782 0.446 0.782 0.362 0.153 0.697

P(KS)

0.570 0.639 0.807 0.407 0.320 0.571 0.194 0.728

P(KS)

⫺0.0420 ⫺0.0713 0.0069 0.0896 ⫺0.0235 ⫺0.0330 –0.0286 0.0094

Z¯L

0.0158 0.0024 ⫺0.0228 0.0557 ⫺0.0255 0.0336 0.0178 0.0097

Z¯L

100 0.263 0.826 0.119 0.826 0.853 0.336 0.776 0.615

P(KS)

1.0183 0.9775 1.0059 0.9679 0.9465 1.0285 1.0209 0.9488

S.D.(ZL)

1000 0.959 0.302 0.219 0.942 0.317 0.894 0.279 0.661

P(KS)

Table (continued )

1.0267 1.0015 1.0515 1.0325 1.0175 1.0484 0.9801 1.0267

S.D.(ZL)

396 R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

397

Fig. 1. Standard deviation of ZL against the number of patients in each arm of a randomized clinical trial.

The variation of the standard deviation of the 1000 values of ZL produced at each combination of conditions is illustrated with respect to K and n in Figs. 1 and 2, respectively. It does not appear that n has a substantial effect on the standard deviation of ZL. However, S.D.(ZL) increases as the sample size decreases. We modeled the relationship between S.D.(ZL) and K, under the null hypothesis H0 : mA ⫽ mB, using 1 S.D.(ZL) ⫽ 0.97⫹6.67 . K

(6)

Eq. (6) was obtained using regression analysis, since the variance of statistic is often related to the reciprocal of the number of patients used to calculate the statistic. The usual assumptions were tested and the data did not violate them. The asymptote was not forced down to 1, since Eq. (6) gave the best fit to data obtained from 30, 40, 50, and 100 patients, and primary interest was in studies with these numbers of patients, rather than obtaining an accurate representation of smaller deviations from a standard Normal distribution. The correlation between S.D.(ZL) and (1ⲐK) was 0.8843, meaning that R2 ⫽ 0.7820. The modeled distribution for ZL along with critical values and type I error rates based on the 8000 “trials” with an effect size of 0.0 carried out using the eight test lengths for the 90%, 95%, and 99% levels for RCTs using 30, 40, 50, and 100 patients per arm are given in Table 2.

398

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

Fig. 2. Standard deviation of ZL against the number of items used to assess the patients.

Sample size, number, of items, and power Table 3 contains the number of the 1000 RCTs carried out under each of the 192 combinations of K, n, and effect size, which resulted in a value of the ZL statistic beyond the appropriate critical values given in Table 2. In order to facilitate comparisons with the situation where q were known, similar to obtaining an estimate using an infinite number of items, the theoretical number of 1000 RCTs that would result in a value of the T statistic beyond the appropriate critical values have been calculated using standard procedures [27]. These are presented in rows labeled ∞. If the numbers in Table 3 are divided by 1000, then an Table 2. Modeled distribution for ZL and critical values for randomized clinical trials with small sample sizes Number of patients per arm (K) 30 40 50 100 ⭓200

Modeled distribution of ZL 2

N(0,1.19 ) N(0,1.142) N(0,1.102) N(0,1.042) N(0,1)

Critical values

Type I error rate

90%

95%

99%

90%

95%

99%

⫾1.96 ⫾1.88 ⫾1.81 ⫾1.71 ⫾1.64

⫾2.33 ⫾2.23 ⫾2.16 ⫾2.04 ⫾1.96

⫾3.07 ⫾2.94 ⫾2.83 ⫾2.68 ⫾2.58

0.1016 0.1005 0.1025 0.0933

0.0565 0.0545 0.0558 0.0470

0.0139 0.0139 0.0158 0.0086

126 145 141 155 158 170 157 139 190 174 267 280 303 294 291 305 317 600 389 497 489 475 538 573 535 571 920

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8

5 10 15 20 30 50 70 100 ∞ 5 10 15 20 30 50 70 100 ∞ 5 10 15 20 30 50 70 100 ∞

0.10

Number of items (n)

Significance level

Effect size

64 74 87 91 86 109 78 96 110 104 181 170 185 193 208 208 208 470 262 354 384 371 416 461 428 427 860

0.05

30

24 15 24 28 24 25 33 37 30 30 69 91 70 88 89 67 88 240 98 163 179 194 205 217 229 214 660

0.01 136 139 133 163 186 158 181 165 220 344 438 492 467 550 559 549 557 710 655 774 846 872 856 875 862 891 970

0.10 76 72 81 99 109 85 117 100 140 247 297 364 349 416 426 428 415 590 538 669 747 774 789 804 773 822 940

0.05

40

18 31 30 29 36 28 44 33 40 90 132 145 140 180 208 213 197 340 297 425 482 497 581 574 549 618 820

0.01 149 188 208 180 190 210 215 211 260 433 575 579 606 650 622 663 674 790 805 900 911 931 950 933 952 940 1000

0.10 77 108 117 126 130 126 120 138 160 314 445 463 488 510 492 536 550 690 689 852 817 892 897 903 919 899 970

0.05

50

26 28 27 37 33 41 36 48 50 126 208 225 267 283 274 287 296 450 422 639 649 690 750 771 799 775 910

0.01

Number of patients in each of group A and group B (K)

216 276 305 339 344 336 345 365 400 724 837 876 905 927 939 947 935 960 973 995 1000 997 999 1000 998 1000 1000

0.10

42 63 74 66 87 87 98 100 120 361 542 611 702 704 706 735 727 820 815 956 984 985 993 992 994 995 1000

0.01

Table (continued )

139 174 223 215 236 227 247 258 290 610 742 800 854 862 885 900 889 940 949 991 997 994 999 1000 998 1000 1000

0.05

100

Table 3. The number of 1000 randomized clinical trials in which ZL was greater than the appropriate critical value for the two-sided significance level given

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 399

372 492 536 539 561 598 585 612 630 960 994 998 995 997 995 1000 996 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8

5 10 15 20 30 50 70 100 ∞ 5 10 15 20 30 50 70 100 ∞ 5 10 15 20 30 50 70 100 ∞

0.10

Significance level

Effect size

Number of items (n)

Table 3. Continued

268 350 395 424 414 462 473 489 510 924 985 992 990 991 994 997 995 1000 999 1000 1000 1000 1000 1000 1000 1000 1000

0.05

200

112 170 189 209 211 267 246 262 270 802 923 957 965 967 982 991 986 1000 995 1000 1000 1000 1000 1000 1000 1000 1000

0.01 515 625 675 688 736 773 750 765 780 985 1000 1000 1000 1000 1000 1000 1000 1000 — — — — — — — — —

0.10 375 508 556 565 619 657 659 659 680 976 998 1000 1000 1000 1000 1000 1000 1000 — — — — — — — — —

0.05

300

164 266 317 334 394 409 401 414 440 931 986 997 999 1000 1000 1000 1000 1000 — — — — — — — — —

0.01 649 810 849 880 901 904 926 920 930 — — — — — — — — — — — — — — — — — —

0.10 523 712 763 802 827 853 871 879 880 — — — — — — — — — — — — — — — — — —

0.05

500

328 474 534 564 625 673 718 697 710 — — — — — — — — — — — — — — — — — —

0.01

Number of patients in each of group A and group B (K)

900 970 982 992 995 994 998 994 1000 — — — — — — — — — — — — — — — — — —

0.10 822 949 963 980 990 989 995 991 1000 — — — — — — — — — — — — — — — — — —

0.05

1000

610 825 896 923 929 943 962 966 970 — — — — — — — — — — — — — — — — — —

0.01

400 R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

401

estimate of the power under the given combination of factors is obtained. It can be seen that the power to detect the given effect size increases with the effect size and the number of items used to assess the patients. It is also apparent that while increasing the number of items used to assess the patients from five to ten and from ten to twenty results in substantial increases in the power, increasing the number of items beyond 30 increases this figure only minimally. In addition, the power obtained using 100 items does not even approach the theoretical maximum, using an infinite number of items, for K ⭐ 40, indicating that the correction introduced in Table 2 may not be sufficient for very small RCTs. Using the results in Table 3 and linear interpolation, the values of K required to detect the effect size at the two-sided 5% level and with a power of 80% using a given n was calculated and are displayed in Table 4. This significance level and power were chosen as these values are regularly used when analyzing RCTs. It can be seen that as long as at least 20 items are used, the number of items barely affects the number of patients required in an RCT to detect effect sizes of 0.5 and 0.8. However, the number of items used to assess patients has more effect on the number of patients required to detect an effect size of 0.2. For instance, if only five randomly selected items are used, it is necessary to include 950 patients in each arm, but if 50 items are used, only 450 are required in each arm.

An illustration using the short form instruments from the Medical Outcomes Study In order to illustrate the methods described in this paper, data were used from the second phase of the Netherlands Co-operative Study on Dialysis (NECOSAD). NECOSAD is a multicenter prospective cohort study that includes all new end-stage renal disease patients who started chronic hemodialysis or peritoneal dialysis. Treatment is not randomized but chosen after discussion with the patient and taking clinical, psychological, and social factors into account. The SF-36 was developed from the original Medical Outcomes Study survey [28] to measure health status in a variety of research situations [6]. The SF-36 is a reliable measure of health status in a wide range of patient groups [29] and has been used with classical [30] and IRT-based [31] psychometric models. In addition, the SF-12, with 12 items [16], and a preliminary version of the SF-8, with 9 items [17], have been developed. In this article, the Dutch language version of the SF-36 [32] was used. The SF-36 uses between two and six response categories per item and, on the majority of items, a higher

Table 4. Approximate number of patients required in each arm (K) of a randomized clinical trial to demonstrate a given effect size at the 5% level with 80% power Number of items used Effect size 0.2 0.5 0.8

5

10

20

50

100



950 160 70

680 125 50

500 95 45

450 90 40

440 87 39

394 64 26

402

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

score denotes a better health state. The items that are scored so that a higher score denotes a worse health state have been rescored so that a higher score denotes a better health state. A brief description of each item is given in Table 5, together with an indication of whether an item is used in the SF-12 or SF-8, while full details are available elsewhere [1]. The measurement properties of the SF-36 in a sample of patients with chronic kidney failure 12 months after the start of dialysis [14,15] have been examined using the generalized partial credit model described in Eq. (2) and are summarized in Table 5. Questionnaires were sent to 1046 patients, of whom 978 completed at least one question on the SF-36. Of the 978 patients, 583 were male and 395 were female, while 615 were on hemodialysis and 363 were on peritoneal dialysis. The patients were between 18 and 90 years of age, with a median of 61 years. When IRT techniques were used to examine the quality of life, the mean, m, was 0 and the standard deviation was s ⫽ 0.692. It should be emphasized that the model parameters given in this article are for illustration purposes only and should not be regarded as a definitive calibration of these items as the fit of the model has not been extensively tested, particularly with respect to differential item characteristics between subgroups of patients. In order to examine the power with which studies using the short form instruments could detect treatment effects in a population of patients with chronic kidney failure 12 months after starting dialysis, a simulation study was carried out. The simulations were carried out in a similar way to those described earlier in the section “A simulation study.” We were interested in detecting effect sizes of 0.2, 0.5, and 0.8, denoted x, in RCTs with 30, 40, 50, 100, 200, 300, 500, or 1000 patients in each arm (K) using the SF-36, the SF-12, or the SF8 as the primary endpoint, meaning that there were 24 different combinations of x and K. In each run, K values of q were sampled from each of N(m, s2) and N(m ⫹ xs, s2), where x represents the effect size expected in the study and s the standard deviation of quality of life, to denote the control and treatment groups, respectively, meaning that mA ⫽ 0 and mB ⫽ 0.692x. These values were combined with the item parameters in Table 5 and Eq. (2) to generate “responses” by “patients” to the SF-36. These data were used to obtain the statistic ZL using the items in the SF-36, in the SF-12, and in the SF-8. One thousand runs, or RCTs, were carried out at each of the 24 combinations of K and x. It was assumed that the item parameters were known and equal to those in Table 5. The number of runs at each combination of K, x, and short form instrument resulting in a value of ZL more extreme than the appropriate critical value, in Table 2, for known item parameters is recorded in Table 6. In general, the SF-36 detected a given treatment effect more often than the SF-12, which in turn detected a treatment effect more often than the SF-8. The number of patients needed in each arm of an RCT using the SF-36, SF-12, and SF-8 to detect standard treatment effects in the underlying latent trait with a power of 80% were obtained using linear interpolation and are given in Table 7. It appears that the number of patients required when using the SF-36 to detect an effect of 0.2 is less than if q were known, as would be the case if an infinite number of items were used. This is not true but is a result of the inherent error involved in a simulation study with only 1000 trials at each combination of K and x. It can also be seen that the differences between the numbers of patients required in each arm are less marked than in Table 4. This is because the items on the SF-36 have up to six response categories leading to a total of 149, 47, and 40 different

4c 4d

4b

4a

3h 3i 3j

3g

3f

3e

3d

3a 3b 3c

1 2

Item number

SF-12

SF-12

SF-12

SF-8

SF-8 SF-8

3 3 3

3

3

3

3

3 3 3

5 5

Problems with work as a result of physical health reduced amount of 2 time working accomplished less SF-12 2 than hoped limited in kind of work SF-12 2 difficulty in working SF-8 2

General health status Health compared with a year ago Does your health limit vigorous activities moderate activities lifting or carrying groceries climbing 2⫹ flights of stairs climbing one flight of stairs bending, kneeling, or stooping walking more than a mile walking several blocks walking one block bathing or dressing yourself

Item content

Number of response categories

Table 5. The characteristics of the items in the SF-36 questionnaire

no no

no

no

no no no

no

no

no

no

no no no

yes yes

Scoring reversed

3.495 2.998

3.094

2.770

2.245 2.080 2.412

2.155

1.924

2.276

2.353

1.729 2.767 2.247

2.641 0.624

α

0.612 ⫺2.132 ⫺0.384 0.809 ⫺1.302 ⫺2.524 ⫺5.497

⫺0.890 ⫺1.996 ⫺1.268 ⫺0.204 ⫺1.206 ⫺1.865 ⫺3.777

1.290 1.388

1.077

0.441

4.496 1.019 ⫺0.025

⫺3.810 ⫺2.857

⫺3.890 ⫺1.759

1.300 ⫺1.031 ⫺1.370

β2

β1 0.086 ⫺2.006

β3

Item parameters β5

Table (continued )

4.825 ⫺0.931

β4

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 403

Reduced social activities

10

11d

11c

11b

Rate relevance of statements sick easier than other people as happy as anybody I know I expect my health to get worse my health is excellent

Emotions felt full of pep very nervous down in the dumps calm and peaceful lots of energy downhearted and blue worn out a happy person tired

11a

Number of response categories

SF-12 SF-12 SF-12 SF-8 SF-8

5

5

5

5

5

6 6 6 6 6 6 6 6 6

Problems with work as a result of emotional health reduced amount of 2 time working accomplished less SF-12 SF-8 2 than hoped worked less carefully SF-12 2 Reduced social SF-12 SF-8 5 activities Amount of bodily pain SF-8 6 Pain interfere SF-12 5 with work

Item content

9a 9b 9c 9d 9e 9f 9g 9h 9i

7 8

5c 6

5b

5a

Item number

Table 5. Continued

yes

no

yes

no

no

yes no no yes yes no no yes no

yes yes

no yes

no

no

Scoring reversed

1.083

0.637

0.758

0.670

1.703

0.632 0.623 1.219 0.985 1.347 1.090 1.452 0.591 1.506

0.837 1.664

2.302 1.834

2.730

2.587

α

⫺1.279 ⫺0.012 ⫺1.887 0.696

⫺0.558 ⫺1.213 ⫺0.529

⫺3.955

⫺2.333 ⫺0.772

⫺2.638 ⫺2.257 ⫺4.755 ⫺3.267 ⫺2.167 ⫺3.583 ⫺3.834 ⫺2.823 ⫺2.107

⫺3.196 ⫺4.055

⫺1.318 ⫺2.286

⫺1.475 ⫺1.056 ⫺2.452 ⫺1.928 ⫺1.740 ⫺1.633 ⫺2.010 ⫺1.395 ⫺1.229

⫺4.434

β2

⫺0.681 ⫺2.616

0.046

⫺0.435

β1

0.002

⫺0.121

0.115

⫺0.663

⫺3.622

⫺1.702 ⫺3.755 ⫺6.607 ⫺2.922 ⫺1.103 ⫺5.401 ⫺4.637 ⫺2.472 ⫺2.238

⫺3.193 ⫺4.927

⫺5.872

β3

Item parameters

2.442

0.519

1.950

⫺0.345

⫺2.908

⫺2.598 ⫺3.780 ⫺7.134 ⫺3.583 ⫺0.581 ⫺5.359 ⫺3.523 ⫺3.435 0.016

⫺3.131 ⫺4.638

⫺5.587

β4

⫺1.242 ⫺3.530 ⫺7.291 ⫺2.028 1.884 ⫺4.948 ⫺1.925 ⫺2.612 2.458

⫺3.241

β5

404 R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

SF36 SF12 SF8 SF36 SF12 SF8 SF36 SF12 SF8

0.2 0.2 0.2 0.5 0.5 0.5 0.8 0.8 0.8

Significance level

SF36 SF12 SF8 SF36 SF12 SF8

Effect size

0.2 0.2 0.2 0.5 0.5 0.5

Table 6. Continued

Significance level

Effect size

666 636 637 1000 998 999

0.10

155 146 152 493 495 469 858 834 840

0.10

561 515 516 998 997 999

0.05

200

92 80 84 352 363 314 755 724 714

0.05

30

353 300 288 988 995 993

0.01

29 25 23 147 166 142 538 500 451

0.01

841 808 772 — — —

0.10

210 197 184 690 618 611 961 945 925

0.10 38 34 32 332 284 260 785 716 663

0.01 252 240 221 769 765 750 980 986 977

0.10 168 160 138 695 659 627 955 964 959

0.05

50

782 739 687 — — —

0.05

300

549 525 489 — — —

0.01

952 928 936 — — —

0.10

917 891 891 — — —

0.05

500

Number of patients per arm (K)

137 110 114 571 513 490 914 893 860

0.05

40

Number of patients per arm (K)

786 751 742 — — —

0.01

57 70 49 452 392 391 897 856 858

0.01

998 997 995 — — —

0.10

430 415 385 976 960 944 1000 999 999

0.10

156 141 116 838 838 726 997 997 989

0.01

995 997 990 — — —

0.05

1000

982 972 959 — — —

0.01

Table (continued )

312 295 265 949 933 895 999 999 998

0.05

100

Table 6. The number of 1000 randomized clinical trials using the SF-36, SF-12, and SF-8, in which ZL was greater than the appropriate critical value for a test at the two-sided significance level given

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410 405

406

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

Table 7. Approximate number of patients required in each arm (K) of a randomized clinical trial to demonstrate a given effect size at the 5% level with 80% power Instrument used Effect size

SF-8

SF-12

SF-36

0.2 0.5 0.8

410 82 36

380 75 35

325 71 33

∞ 394 64 26

possible sum scores on the SF-36, SF-12, and SF-8, respectively. The items used in the first simulation study described in this paper had only two response categories, meaning that 149, 47, and 40 items would have been necessary to obtain the same range of possible sum scores. Discussion This paper has described simulation-based studies into the use of IRT to analyze RCTs using a questionnaire consisting of items with two response categories and designed to quantify a theoretical variable as the primary outcome. The study into the small sample behavior of asymptotic results provides a framework in which data from RCTs can be analyzed using IRT. It had been proven that the ZL statistic closely approximated an N(0,1) distribution under the null hypothesis for studies in which each arm contained at least 500 patients [24]. The results in this paper show that ZL follows an N(0,s) distribution for small RCTs, but that s ≠ 1. The variance appears to be a function of the reciprocal of the number of patients in each arm of an RCT. This means that, as in many situations in which asymptotic results are used, the procedure for testing whether an effect is significant needs to be adjusted for the relatively small sample sizes often found in RCTs when IRT is used. The main simulation study examined the relationship between the number of patients in each arm of an RCT, the number of items used to assess the patients, and the power to detect given effect sizes when using IRT. As ever, the smaller the effect size, the larger the number of patients needed in each arm to detect the effect with a given power. In addition, increasing the number of items used to assess the patients can mean that given effects can be detected using fewer patients. The number of patients required to detect a minimal effect using five items is more than twice the number required when 50 items are used. The results suggest that reductions in the number of patients required are minimal for more than 20 carefully chosen items, indicating that a maximum of 20 items with good measurement qualities is sufficient to assess patients. However, if the items have poor measurement properties, such as very low values of ai or values of bi more extreme than mq ⫾ 2sq, indicating a lack of discrimination and floor or ceiling effects in a questionnaire, respectively, then many more items may be required. These results also indicate that if an RCT is to be designed to detect small effects, it is inadvisable to use very short instruments consisting of items with two response categories analyzed using IRT. Again, it should be noted that these results are based on the assumption that the items used do not have poor measurement properties, as discussed above.

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

407

It is common wisdom that to detect effect sizes of 0.2, 0.5, or 0.8 using a t test with a significance level of 0.05 and a power of 80%, it is necessary to include 394, 64, or 26 patients, respectively, in each arm of an RCT. The results presented in this paper show that 450, 90, and 40 patients are required in each arm to detect the same effects using 50 items and IRT. The main reason for the differences between these figures is that when a t test is used, it is assumed that the variable of interest is measured without error, whereas IRT takes into account that a latent variable cannot be measured without error. In addition, the sample sizes obtained using IRT are conservative as linear interpolation has been used between observations, whereas it is reasonable to assume that the true function would curve toward the top lefthand corner, indicating that the true numbers of patients required are marginally lower than these results suggest. It should also be emphasized that if researchers require very accurate estimates of the number of patients required in each arm of an RCT using a particular set of items, they should carry out their own simulation study. There are a number of advantages of the use of IRT in the analysis of RCTs. First, the ZL test described in this article can, in contrast to the t test, always be used in the same format, regardless of whether the variance of the latent trait is the same in both arms of the RCT or not. Second, it is not essential that all patients are assessed using exactly the same questionnaire. The questionnaire can be shortened for particular groups of patients, while the estimates of the latent trait remain comparable over all patients. The value of the standard errors of mA and mB depend on a complex relationship between the values of the latent trait of the patients included in an RCT and the parameters of the items used to assess them. This means that the results obtained in this study are, theoretically, local to the combination of patients and items used. However, we reselected all parameters for each of the 1000 individual “RCTs” carried out at each combination of effect size, number of patients per arm, and number of items in order to give a more general picture of the number of patients required. In addition, item parameters for the SF-36, SF-12, and SF-8 quality of life instruments were estimated from a dataset collected from patients undergoing dialysis for chronic kidney failure and used to illustrate the methods described. This article has examined the sample sizes required in an RCT when the primary outcome is a latent trait measured by a questionnaire analyzed with IRT. It has been proven that it is possible, following a minor transformation, to use the statistics developed for analyzing large-scale educational interventions in the much smaller sample sizes encountered in RCTs in clinical medicine. In addition, the relationship between the number of items used and the number of patients required to detect particular effects was examined. It is hoped that this article will contribute to the understanding of IRT, particularly in relation to RCTs.

Acknowledgments This research was partly supported by a grant from the Anton Meelmeijerfonds, a charity supporting innovative research in the Academic Medical Center, Amsterdam, The Netherlands. The authors would like to thank the researchers involved in The Netherlands Cooperative Study on the Adequacy of Dialysis (NECOSAD) for allowing their data to be used in this paper.

408

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

Appendix: Obtaining m ˆ B, and their standard errors ˆ A, m The values of mA and mB and their standard errors can be estimated using marginal maximum likelihood methods. The likelihood L can be written as kg

L⫽



兿 兿 兰 兿p

xijk ijk(q)

g(q|mg,sg)dq

(A.1)

g⫽A,B k=1 ⫺∞ i,j

where pijk is as defined in Eq. (1) and (2), g(q|mg,sg) is a Normal density function with mean and standard deviation equal to mg and s2g for g ⫽ A, B, respectively. Furthermore, xijk is an indicator variable taking the value 1 if patient k responds in category j of item i and the value 0 otherwise. In the study described in this paper, we have assumed that the item parameters ai and bi are known. In practice, this would mean that the items came from a calibrated item bank. It has been shown that mA can be estimated using K

mˆ A ⫽



1 E(qk|Xk) K k⫽1

(A.2)

where Xk is a vector with n elements containing the responses, xik, of patient k to the items, and E(qk|Xk) is the posterior expected value of qk for patient k given their pattern of responses, Xk [23–25]. It has also been shown that if the item parameters are considered known, as in this study, the asymptotic standard error S.E.(mˆ A⫺mˆ B) is equal to S.E.(mˆ A)⫹ S.E.(mˆ B), where S.E.(mˆ A) ⫽

1 4 sˆ ⫺ A



K

(A.3)

(E(qk|Xk)⫺mˆ A)2

k⫽1

and sˆ A is the estimated standard deviation of θ in group A. The definition of S.E.(mˆ B) is analogous to that of S.E.(mˆ A). The expectation E(qk|Xk) is as defined in Eq. (A.2) and given by ∞

E(qk|Xk) ⫽

兰 qf (qk|Xk)∂q

(A.4)

⫺∞

where the posterior distribution of q is f(qk|Xk) ⫽

P(Xk|qk)g(qk|mA,sA) ∞

兰⫺∞P(Xk|qk)g(qk|mA,sA)∂qk

(A.5)

where P(Xk|qk) is the probability of the response pattern made by patient k given qk.

References [1] McDowell I, Newall C. Measuring health: a guide to rating scales and questionnaires. Oxford: Oxford University Press, 1996.

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

409

[2] Cella D, Chang CH. A discussion of item response theory and its applications in health status assessment. Med Care 2000;38(Suppl):II66–II72. [3] Fischer GH, Molenaar IW, editors. Rasch models: foundations, recent developments and applications. New York: Springer-Verlag, 1995. [4] van der Linden W, Hambelton RK, editors. Handbook of modern item response theory. New York: Springer, 1997. [5] Teresi JA, Kleinman M, Ocepek-Welikson K. Modern psychometric methods for detection of differential item functioning: application to cognitive assessment measures. Stat Med 2000;19:1651–1683. [6] Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection. Med Care 1992;30:473–483. [7] Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21st century. Med Care 2000;38(Suppl):II28–II42. [8] van der Linden WJ, Glas CAW. Computerized adaptive testing. Theory and practice. Dordrecht, the Netherlands: Kluwer Academic Publishers, 2000. [9] Holman R, Lindeboom R, Vermeulen R, Glas CAW, de Haan RJ. The Amsterdam Linear Disability Score (ALDS) project. The calibration of an item bank to measure functional status using item response theory. Qual Life Newsletter 2001;27:4–5. [10] Lindeboom R, Vermeulen M, Holman R, de Haan RJ. Activities of daily living instruments in clinical neurology. Optimizing scales for neurologic assessments. Neurology 2003;60:738–742. [11] Cohen J. Statistical power analysis for the behavioral sciences. New Jersey: Hillsdale, Lawrence Erlbaum Associates, 1988. [12] Formann AK. Measuring change in latent subgroups using dichotomous data-unconditional, conditional and semiparametric maximum-likelihood-estimation. J Am Stat Assoc 1994;89:1027–1034. [13] May K, Nicewander WA. Measuring change conventionally and adaptively. Educ Psychol Meas 1998; 58:882–897. [14] Korevaar JC, Merkus MP, Jansen MA, et al. Validation of the KDQOL-SF: a dialysis-targeted health measure. Qual Life Res 2002;11:437–447. [15] van Manen JG, Korevaar JC, Dekker FW, et al. How to adjust for comorbidity in survival studies in ESRD patients: a comparison of different indices. Am J Kidney Dis 2002;40:82–89. [16] Ware J Jr, Kosinski M, Keller SD. A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. Med Care 1996;34:220–233. [17] QualityMetric Incoporated. The SF-8 Health survey ⬍⬍http://www.sf36.com/tools/sf8.shtml⬎⬎. Accessed January 2, 2003. [18] Birnbaum A. Some latent trait models and their use in inferring an examinee’s ability. In: Lord FM, Novivk MR, editors. Statistical theories of mental test scores. Reading Massachusetts: Addison-Wesley, 1968. [19] Thissen D, Steinberg L. A taxonomy of item response models. Psychometrika 1986;51:567–577. [20] Rasch G. Probabalistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research, 1960. [21] Holman R, Berger MPF. Optimal calibration designs for tests of polytomously scored items described by item response theory models. J Educ Behav Stat 2001;26:361–380. [22] Muraki E. A generalized partial credit model. In: van der Linden WJ, Hambleton RK, editors. Handbook of modern item response theory. New York: Springer, 1997. [23] Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: an application of an EM-algorithm. Psychometrika 1981;46:443–459. [24] Glas CAW, Verhelst ND. Extensions of the partial credit model. Psychometrika 1989;54:635–659. [25] Glas CAW. Modification indices for the 2-PL and the nominal response model. Psychometrika 1999; 64:273–294. [26] Verhelst ND, Glas CAW, Verstralen HHFM. OPLM computer program and manual. Arnhem, the Netherlands: Centraal Instituut voor Toets Ontwikkeling (CITO), 1994. [27] Cochran WG. Sampling techniques. New York: Wiley, 1977. [28] McHorney CA, Ware JE Jr, Rogers W, Raczek AE, Lu JF. The validity and relative precision of MOS short- and long-form health status scales and Dartmouth COOP charts. Results from the Medical Outcomes Study. Med Care 1992;30(Suppl):MS253–MS265.

410

R. Holman et al./Controlled Clinical Trials 24 (2003) 390–410

[29] McHorney CA, Ware JE Jr, Lu JF, Sherbourne CD. The MOS 36-item Short-Form Health Survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care 1994;32:40–66. [30] McHorney CA, Ware JE Jr, Raczek AE. The MOS 36-Item Short-Form Health Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care 1993;31:247–263. [31] Raczek AE, Ware JE, Bjorner JB, et al. Comparison of Rasch and summated rating scales constructed from SF-36 physical functioning items in seven countries: results from the IQOLA Project. International Quality of Life Assessment. J Clin Epidemiol 1998;51:1203–1214. [32] Aaronson NK, Muller M, Cohen PD, et al. Translation, validation, and norming of the Dutch language version of the SF-36 Health Survey in community and chronic disease populations. J Clin Epidemiol 1998;51:1055–1068.