Multiattribute health utility scoring for the computerized adaptive measure CAT-5D-QOL was developed and validated

Multiattribute health utility scoring for the computerized adaptive measure CAT-5D-QOL was developed and validated

Journal of Clinical Epidemiology 68 (2015) 1213e1220 Multiattribute health utility scoring for the computerized adaptive measure CAT-5D-QOL was devel...

2MB Sizes 0 Downloads 15 Views

Journal of Clinical Epidemiology 68 (2015) 1213e1220

Multiattribute health utility scoring for the computerized adaptive measure CAT-5D-QOL was developed and validated Jacek A. Kopeca,b,*, Eric C. Sayreb, Pamela Rogersb, Aileen M. Davisc,d,e, Elizabeth M. Badleyf, Aslam H. Anisa,b, Michal Abrahamowiczg, Lara Russellh, Md Mushfiqur Rahmanb, John M. Esdaileb,i a

School of Population and Public Health, University of British Columbia, 2206 East Mall, Vancouver, British Columbia, Canada V6T 1Z9 b Arthritis Research Canada, 5591 No. 3 Road, Richmond, British Columbia, Canada V6X 2C7 c Health Care and Outcomes Research, Toronto Western Research Institute, University Health Network, 399 Bathurst Street, Toronto, Ontario, Canada M5T 2S8 d Institute of Health Policy, Management and Evaluation, University of Toronto, 55 College Street, Toronto, Ontario, Canada M5T 3M7 e Departments of Physical Therapy and the Rehabilitation Science Institute, University of Toronto, 500 University Avenue, Toronto, Ontario Canada M5G 1V7 f Dalla Lana School of Public Health, University of Toronto, Health Sciences Building, 155 College St., Toronto, Ontario, Canada M5S 1A8 g Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Purvis Hall, 1020 Pine Ave. West, Montreal, Quebec, Canada H3A 1A2 h Centre for Health Evaluation and Outcome Sciences, University of British Columbia, 1081 Burrard Street, St. Paul’s Hospital, Vancouver, British Columbia, Canada V6Z 1Y6 i Department of Medicine, University of British Columbia, 1200 Burrard Street, Vancouver, British Columbia Canada V6Z 2C7 Accepted 14 March 2015; Published online 14 April 2015

Abstract Objectives: The CAT-5D-QOL is a previously reported item response theory (IRT)ebased computerized adaptive tool to measure five domains (attributes) of health-related quality of life. The objective of this study was to develop and validate a multiattribute health utility (MAHU) scoring method for this instrument. Study Design and Setting: The MAHU scoring system was developed in two stages. In phase I, we obtained standard gamble (SG) utilities for 75 hypothetical health states in which only one domain varied (15 states per domain). In phase II, we obtained SG utilities for 256 multiattribute states. We fit a multiplicative regression model to predict SG utilities from the five IRT domain scores. The prediction model was constrained using data from phase I. We validated MAHU scores by comparing them with the Health Utilities Index Mark 3 (HUI3) and directly measured utilities and by assessing between-group discrimination. Results: MAHU scores have a theoretical range from 0.842 to 1. In the validation study, the scores were, on average, higher than HUI3 utilities and lower than directly measured SG utilities. MAHU scores correlated strongly with the HUI3 (Spearman r 5 0.78) and discriminated well between groups expected to differ in health status. Conclusion: Results reported here provide initial evidence supporting the validity of the MAHU scoring system for the CAT-5DQOL.  2015 Elsevier Inc. All rights reserved. Keywords: Health-related quality of life; Item response theory; Computerized adaptive testing; Health utilities; Standard gamble; Validity; Discrimination

1. Introduction Measurement methods based on item response theory (IRT) and computerized adaptive testing (CAT) have been used for many years in educational and psychological testing, Funding: This research was supported by a grant from the Canadian Arthritis Network (Grant No. 01-MNO-08R). * Corresponding author. School of Population and Public Health, University of British Columbia, Milan Ilich Arthritis Research Centre, 5591 No. 3 Road, Richmond, British Columbia V6X 2C7. Tel.: 604-207-4047; fax: 604-207-4059. E-mail address: [email protected] (J.A. Kopec). http://dx.doi.org/10.1016/j.jclinepi.2015.03.020 0895-4356/ 2015 Elsevier Inc. All rights reserved.

and their advantages over conventional measurement methods are well established [1,2]. In measuring health and health-related quality of life (HRQOL), the last 2 decades have seen a rapid increase in the use of IRT [3e5]. IRTbased item banks and CAT systems for various aspects of emotional and physical function and symptoms have been developed. The most extensive CAT system is the PatientReported Outcomes Measurement Information System (PROMIS) [6]. It includes item banks for anger, anxiety, depression, fatigue, pain behavior, pain interference, physical function, sleep disturbance, social satisfaction, and other health domains [4,7]. Other examples of CAT measures are

1214

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

What is new?  This study provides a formula for converting item response theoryebased scores from a computerized adaptive health-related quality of life instrument (CAT-5D-QOL) into standard gamble (SG)ebased multiattribute health utilities (MAHUs).  MAHU scores were, on average, higher than Health Utilities Index Mark 3 (HUI3) scores and lower than directly measured SG utilities, correlated strongly with the HUI3, and discriminated well between groups expected to differ in health status.  The study provides preliminary evidence to support the validity of the MAHU scores and suggests that the CAT-5D-QOL may be useful for economic evaluation studies.

item banks for headache impact [8], osteoarthritis [9], back pain [10], or asthma [11]. We have developed a generic CAT system for measuring five domains of HRQOL, the CAT-5D-QOL [12]. The domains measured are walking (Walk), handling objects (Hand), daily activities (Daily), pain or discomfort (Pain), and feelings (Feel). We have also developed a semiadaptive paper version of this instrument (PAT-5D-QOL) [13]. Previous studies discussed details of the methodology [12], validation of the CAT-5D-QOL in back pain [14], and data from a simulation comparing the adaptive, semiadaptive, and fixed versions of the questionnaire for each domain [13]. For many applications of HRQOL instruments, such as burden of disease studies or cost-utility evaluations of health interventions, there is a need for measuring health preferences (utilities) in addition to psychometric scores [15]. Health utilities allow the calculation of qualityadjusted life years or health-adjusted life expectancy [16]. Direct methods of measuring utilities include standard gamble (SG), time trade-off, person trade-off, and willingness to pay [17]. Indirect health utility measures consist of a health questionnaire and an empirically derived formula to convert item responses to utilities. Widely used multiattribute indirect health utility measures are the EQ-5D [18], Health Utilities Index Mark 3 (HUI3) [19], 15D [20], and Assessment of Quality of Life (AQoL) [21]. Rather than using a separate instrument to measure health utilities, utility scores are often derived from generic psychometric instruments [22e26]. A model to derive EQ-5D scores from five PROMIS item banks has been developed [27]. Scores from condition-specific instruments have also been converted to health utilities [28e30].

The purpose of the present study was to derive a multiattribute health utility (MAHU) scoring formula for the CAT-5D-QOL and assess the measurement properties of the MAHU scores.

2. Methods 2.1. Phase I: utilities for single-deficit states A random sample of 151 individuals from Metro Vancouver, Canada, selected from the telephone directory, provided health utilities via SG for 15 hypothetical health states per domain (75 states total). Details of the measurement protocol are provided in Appendix A at www.jclinepi. com (available online). Each set of 15 states was constructed such that four of the five domains of the CAT5D-QOL were kept constant at ‘‘perfect health’’ and only one domain varied (single-deficit states). The level of health for the affected domain was described using response options to six items from that domain. The items and response options for each state were selected using IRT methods, such that the IRT scores associated with each resultant health state were approximately evenly distributed between the highest and lowest estimable IRT score for a given domain. IRT scores were obtained via the generalized partial credit model (GPCM) [31], using our adaptive algorithm [14]. SG utilities for all 75 states were obtained through a personal interview. Respondents were presented with state descriptions and asked to choose between a given state and a gamble with specified probabilities of death and perfect health. Two examples of state descriptions are provided in Appendix A at www.jclinepi.com (Fig. S1). Utilities were calculated as the probability of perfect health at the point of indifference. These data were used to estimate the relationships between IRT scores for each domain and SG utilities for health states in which only one domain varied (single-deficit states) and, subsequently, to constrain the utility function, as described under the MAHU model. 2.2. Phase II: utility function for multiattribute states In phase II, utilities were collected via SG on 256 multiattribute states using a different random population sample of 128 subjects from Metro Vancouver in a fractional factorial design in which each multiattribute state was rated by four subjects and each subject rated eight states (n 5 1,024 data records). The multiattribute (five attributes) health states were constructed using only one item from each domain. The items used to construct the states were the ‘‘routing’’ items. The routing items play an important role in the CAT algorithm because they are the first items to be administered. The wording of these items for each of the five domains is given in Appendix A at www. jclinepi.com (Fig. S2). Because each routing item has four

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

options, there were 4 5 1,024 possible multiattribute states in phase II. SG utilities for the multiattribute states were obtained through personal interviews, as described in phase I for single-deficit states. For states identified by the respondents as worse than being dead, negative utilities were obtained by choosing between death and a gamble with specified probabilities of perfect health and the state worse than dead. Utilities for such states were calculated as the negative probability of perfect health at the point of indifference (monotonic, nonlinear transformation of the original utility scale) [32]. Each multiattribute state was assigned five domainspecific IRT scores, one per domain, based on the level of response to the routing item. IRT scores for the two extreme responses were estimated using the mean IRT score from the original item calibration study among those who selected that response option [12]. IRT scores for the two intermediate options were estimated using the GPCM [31]. This allowed us to develop a model linking IRT domain scores with multiattribute health utilities. 5

2.3. MAHU model The final MAHU model was fit on 256 records, each containing five IRT scores and the mean MAHU score for a unique multiattribute health state. We transformed the mean MAHU score from a conceptual range of 1 to 1 to a 0 to 1 scale and took the log, then regressed it by least squares on the five domains’ IRT scores, with the following constraints. First, the model was constrained at one when all five domains were at their maximum estimable IRT scores. Second, when only one domain was less than perfect, the MAHU model prediction curve was forced through a point derived empirically from phase I data and corresponding to the lowest IRT score among the 15 states that were evaluated. To this end, a 0.5 window, first-order locally weighted scatterplot smooth was fit to the data with a tricube weighting function to estimate the mean single-deficit utility at the lowest IRT score among the 15 states per domain. These coordinates were used as lower constraints in the MAHU model. Because of these constraints, the MAHU scores have a theoretically appropriate range, with the maximum score of 1 for perfect health and a score of 0 for dead, and are highly consistent with empirical data for single-deficit states when only one domain is affected. At the same time, the contribution of each attribute to the MAHU score is determined by the multiattribute data from phase II. The MAHU model is multiplicative in the health utility scale, with five factors bounded from 0 to 1, each raised to a power that is a function of the MAHU score curvature as that domain varies (Equation 1 in Appendix B at www.jclinepi. com). Thus, the model is nonlinear and allows for empirically and theoretically appropriate multiplicative interactions between the attributes. For comparison, in Appendix

1215

C at www.jclinepi.com, we show three common theoretical forms of multiattribute utility function [33]. Model fit was assessed with residual diagnostics, including residual plots vs. the regressors and predicted MAHU, and a normal QeQ plot of the residuals. We also compared mean observed utility for each health state to predicted utility. 2.4. Empirical validation An online survey was carried out among members of the Canadian Association of Retired Persons. The questionnaire included CAT-5D-QOL (five items per domain were administered), HUI3, and SG assessment of one’s own health, plus questions about 12 self-reported chronic conditions and use of health services. Validity was assessed by comparing MAHU scores with HUI3 scores and SG for the whole sample and subgroups reporting various chronic conditions and by ‘‘known-group’’ discrimination. Known groups were defined according to the use of medication, hospitalization, and number of visits to a doctor in the past year. Discrimination for known groups was assessed by the chi-squared test from a multinomial logistic model, adjusted for age and sex. A relative discrimination (RD) index (range 0e1) for MAHU and HUI3 was calculated as the ratio of the chisquare value for a given measure to the chi-square value for the most discriminating measure [14]. The study was approved by the University of British Columbia Research Ethics Board, and all subjects provided informed consent to participate in the study.

3. Results Phase I (n 5 151) and II (n 5 128) sample characteristics are shown in Table 1. About 80% of the subjects in both

Table 1. Demographic characteristics of phase I (n 5 151) and phase II (n 5 128) samples Variable Age 18e24 25e44 45e64 65þ Gender Male Female Education High school or less Professional or trade school University (some or completed) Self-rated health Excellent or very good Good, fair, or poor

Phase I sample N (%) 8 79 47 17

(5.3) (52.3) (31.1) (11.3)

Phase II sample N (%) 4 53 49 22

(3.1) (41.4) (38.3) (17.2)

55 (36.4) 96 (63.6)

55 (43.0) 73 (57.0)

13 (8.7) 35 (23.2) 103 (68.2)

19 (14.8) 29 (22.7) 80 (62.5)

92 (60.9) 58 (39.4)

77 (60.2) 51 (39.8)

One subject in the phase I sample did not report self-rated health.

1216

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

Table 2. Regression coefficients and constraints for the multiattribute health utility model Domain Walk Hand Daily Pain Feel

bd (95% CI) 0.1962 0.4848 0.3814 0.9324 0.5422

(0.23, (0.64, (0.42, (1.07, (0.90,

MinEstqd

MaxEstqd

MinConstrqd

MinConstrUd

3.071 3.895 3.783 4.426 4.762

3.181 3.468 2.200 3.218 4.056

2.624 2.930 3.301 2.909 2.731

0.719 0.575 0.497 0.273 0.675

0.16) 0.33) 0.35) 0.79) 0.19)

Abbreviation: CI, confidence interval. Please see Appendix for the definitions of the constraints.

samples were between 25 and 64 years of age, 64% and 57%, respectively, were female, and 68% and 63%, respectively, had some college or university education. About 60% in both groups reported excellent or very good health. Model constraints and regression parameters with 95% confidence intervals are provided in Table 2. All regression parameters were significant at a 5 0.05. The beta coefficients were negative, with absolute magnitude corresponding to the curvature of the MAHU function. Pain was associated with the strongest effect on MAHU curvature, followed by Feel, Hand, Daily, and Walk. Fig. 1 shows model predictions when only a single domain is allowed to vary, holding the other four domains constant. The lower and upper constraints for each domain are also shown. Model fit was good, as indicated by log-scale residuals with no patterns and normal QeQ plot. Adjusted R2 for

comparing mean observed utility for each health state to predicted utility was 0.43. In Appendix D at www. jclinepi.com, we expand on Fig. 1 by showing MAHU scores when pairs of domains are allowed to vary. 3.1. Validation There were 551 individuals who completed the online questionnaire. The mean age was 66 years (29e87), 57% were female, and 56% were college or university graduates (Table 3). MAHU was skewed (as expected given population sampling) with a mean of 0.876 (median, 0.919; range, 0.277e0.999). The mean (median) was 0.734 (0.833) for the HUI3 and 0.920 (0.995) for SG. MAHU correlated 0.78 with the HUI3 and 0.32 with SG, whereas the correlation between the HUI3 and SG was 0.26 (Spearman rank correlations). The intraclass correlation coefficient between the three measures was 0.33. Women reported lower MAHU and HUI3 scores than men (Table 3). Neither MAHU nor HUI3 scores differed by age or education. MAHU, HUI3, and SG scores for subjects who self-reported each health condition are presented in Fig. 2. The highest MAHU was observed for cancer, hearing problems, eye problems, and diabetes and the lowest for mood and anxiety disorders. Ordering of conditions by utility score was similar for MAHU and the HUI3

Table 3. Sample characteristics and mean MAHU, HUI3, and SG utilities in the validation study among members of the Canadian Association of Retired Persons Variable

Fig. 1. MAHU functions varying each domain while holding other domains constant at perfect health. PredUtW, predicted MAHU scores varying Walk; PredUtH, predicted MAHU scores varying Hand; PredUtD, predicted MAHU scores varying Daily; PredUtP, predicted MAHU scores varying Pain; PredUtF, predicted MAHU scores varying Feel; MaxEstW, maximum estimable IRT score for Walk; MaxEstH, maximum estimable IRT score for Hand; MaxEstD, maximum estimable IRT score for Daily; MaxEstP, maximum estimable IRT score for Pain; MaxEstF, maximum estimable IRT score for Feel; MinConstrW, minimum constraint for Walk; MinConstrH, minimum constraint for Hand; MinConstrD, minimum constraint for Daily; MinConstrP, minimum constraint for Pain; MinConstrF, minimum constraint for Feel.

Age !60 60e64 65e69 70þ Gender Male Female Education (completed) Elementary High school Technical or trade school University Postgraduate study

N (%)

MAHU

HUI3

SG

(17.3) (26.7) (27.3) (28.8)

0.86 0.88 0.89 0.88

0.71 0.76 0.75 0.72

0.90 0.92 0.94 0.91

237 (43.3) 310 (56.7)

0.90 0.86

0.76 0.72

0.91 0.93

16 133 90 213 88

0.87 0.87 0.86 0.88 0.91

0.62 0.75 0.71 0.73 0.78

0.92 0.95 0.91 0.91 0.93

93 144 147 155

(3.0) (24.6) (16.7) (39.4) (16.3)

Abbreviations: MAHU, multiattribute health utility; HUI3, Health Utilities Index Mark 3; SG, standard gamble.

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

1217

Fig. 2. Mean utility scores (SG, MAHU, and HUI3) for persons reporting selected chronic conditions (ordered by MAHU). SG, standard gamble; MAHU, multiattribute health utility; HUI3, Health Utilities Index Mark 3.

(Table 4). The RD index for the HUI3 compared with MAHU was 0.81 for medication, 0.80 for visits to doctors, and 0.44 for hospitalization.

with a few exceptions. In particular, hearing disorders were ranked lower according to the HUI3, whereas knee osteoarthritis was ranked higher. In the age and sex-adjusted model, differences in MAHU scores between persons with and without 11 of the 12 conditions (except cancer) were significant (data not shown), with differences ranging from 0.16 for an anxiety condition to 0.03 for hearing problems (Fig. 2). With regard to known group discrimination, both MAHU and the HUI3 showed highly significant differences

4. Discussion In this article, we present the utility scoring function for a previously published IRT-based computerized adaptive test for five domains of HRQOL, the CAT-5D-QOL

Table 4. Discrimination according to health services use for MAHU and HUI3 Variable Number of medications None One Two Three or four Five or more Frequency of pain medication Never Once a week or less Two or three times a week Once a day More than once a day Visits to doctor in last year Never Once Twice 3e4 times 5e10 times O10 times Hospitalization in last year Yes No

N (%)

Mean MAHU

Mean HUI3

77 97 93 147 133

(14.1) (17.7) (17.0) (26.9) (24.3)

0.95 0.93 0.90 0.88 0.77

0.84 0.82 0.79 0.75 0.55

152 162 86 75 70

(27.9) (29.7) (15.8) (13.8) (12.8)

0.95 0.91 0.86 0.83 0.70

0.85 0.79 0.68 0.69 0.46

12 65 108 208 114 40

(2.2) (11.9) (19.7) (38.0) (20.8) (7.3)

0.91 0.94 0.93 0.89 0.80 0.74

0.76 0.84 0.82 0.76 0.63 0.49

59 (10.8) 488 (89.2)

0.81 0.88

0.64 0.75

RD MAHU

RD HUI3

1.00

0.81

1.00

0.80

1.00

0.80

1.00

0.44

Abbreviations: MAHU, multiattribute health utility; HUI3, Health Utilities Index Mark 3; RD, relative discrimination. Discrimination was assessed in a multinomial regression model adjusted for age and sex. RD index for measure X with respect to a categorical variable V is defined as a ratio of the chi-square value for measure X to the chi-square value for the measure with the highest chi-square for discriminating between the categories of variable V.

1218

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

[12,14]. This function also applies to the semiadaptive paper version of the instrument, PAT-5D-QOL [13]. Our scoring function uses a nonlinear regression model to convert IRT-based scores for five domains of HRQOL into an SG-based multiattribute utility score. The scoring function is constrained so that MAHU is 1 when all 5 domains are at their maximum estimable score. MAHU score of 0 corresponds to the state of being dead. MAHU has been additionally constrained using data on the relationship between SG utilities for single-deficit states and domainspecific IRT scores based on response profiles to multiple items. A unique aspect of our study is the application of CAT methodology to the construction of health states. This approach is justified by modern psychometric theory (IRT), whereby all items in an item bank measure the same attribute. The items for each state were carefully selected to provide maximum information at the targeted value of that attribute and to ensure logical consistency and likelihood of occurrence in real data. The model explained 43% of the variance in mean utilities across the multiattribute health states. Considering parsimony and model structure, as well as using constraints from data other than the multiattribute data on which the model was fit, this proportion of variance explained is acceptable [34]. The highest MAHU score below 1 is O0.999. The lowest estimable MAHU is 0.842. It corresponds to the state in which all domains are at the lowest estimable IRT score with all items being answered. The number of possible MAHU scores is practically infinite (about 10125) when the full item banks are used, although this is usually impractical. For an adaptive test using five items per domain, there are about 1017 possible MAHU scores. In addition to utility-based scoring that combines all domains, IRT scores for each domain can be obtained. These scores can be measured on (1) the original IRT scale (theoretically unbounded), (2) a norm-based scale (mean 5 50, SD 5 10), and (3) 0e100 bounded scale. All three scales can be useful, depending on the specific application of the instrument and requirements for score interpretation. The norm-based scale allows the user to interpret the scores as deviations (in standard deviation units) from the population average, whereas the 0e100 scale provides information on how far a given score is from the best and worst possible scores (Appendix E at www.jclinepi.com). Widely used health utility instruments, such as the HUI3 or EQ-5D, use one or a few items to classify individuals into several (usually three to six) levels of health for a given domain. The amount of detail, accuracy, discrimination, and sensitivity to change of a single item are generally much lower than those of a multi-item instrument [35]. Furthermore, the advantages of adaptive IRT-based instruments compared with traditional generic and diseasespecific HRQOL questionnaires are well established. CAT ensures that the items asked are relevant and appropriate for the respondent’s level of health. As a result, the

assessment is highly efficient, with the same level of precision achieved with fewer items and, consequently, lower respondent burden. In addition, precision of measurement for the extreme levels of health tends to be higher than in traditional instruments and ceiling and floor effects are less common, which helps avoid bias in assessing extreme levels of health [36,37]. The CAT-5D-QOL offers all the aforementioned psychometric advantages of adaptive testing. In contrast, ceiling effects are common for the HUI3 and other utility instruments, especially in younger or relatively healthy populations [38e40], and floor effects are common in the Short Form-6D (SF-6D) [40]. The types of domains covered by CAT-5D-QOL are similar to those of the EQ-5D and somewhat different from the HUI3. We have not included vision, hearing, speech, or cognitive function and included daily activities (Daily). The inclusion of the Daily domain is important because it captures the impact of health problems that are not explicitly mentioned, such as those due to cognitive or sensory impairments, fatigue, and other symptoms. MAHU scores are strongly correlated with HUI3 scores and, on average, higher than HUI3 utilities but lower than directly measured SG utilities for one’s own health. The observed difference between MAHU and SG utilities is expected because SG valuations of one’s own health are affected by adaptation to illness [41]. For each selfreported chronic condition except cancer, MAHU scores were significantly lower in persons reporting the condition than in the remainder of the sample. With a few exceptions, especially hearing problems and knee osteoarthritis, the ordering of conditions by utility was similar for MAHU and HUI3. Hearing is one of the domains in the HUI3 but not in the CAT-5D-QOL. The impact of osteoarthritis, on the other hand, is measured in greater detail by the CAT-5D-QOL, as this instrument considers the level of difficulty for a range of activities. Furthermore, MAHU discriminated better than the HUI3 with respect to different levels of health services use, such as use of medication, visits to doctors, or hospitalization. This suggests a high level of sensitivity to multiple health problems associated with common chronic conditions. A potential limitation of the instrument in some applications is that it requires a CAT system for questionnaire administration and scoring (available from the authors) and Internet access for self-administration. However, the questionnaire can also be administered by an interviewer, either in person or over the phone. A semiadaptive paperbased version of the questionnaire (PAT-5D-QOL) that does not require CAT scoring software is available [13]. Another limitation is that translations of the item banks to languages other than English are not available at this time. In conclusion, we have developed an SG-based multiattribute utility function for a set of IRT scores from a multidomain CAT instrument. The CAT-5D-QOL with MAHU

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

scoring can be used in research and clinical practice as a generic measure of HRQOL. In addition to generating efficient and precise IRT-based scores for five HRQOL domains, it now offers well-discriminating health utilities for disease burden studies and economic evaluations. Future applications of this instrument, especially its use in clinical trials and longitudinal studies, as well as head-to-head comparisons with other measures, will continue to provide useful information on the clinical meaning, sensitivity to change, and other properties of MAHU scores.

References [1] Lord FM. Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers; 1980. [2] Wainer H. Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers; 1990. [3] Revicki DA, Cella DF. Health status assessment for the twenty-first century: item response theory, item banking and computer adaptive testing. Qual Life Res 1997;6:595e600. [4] Carle AC, Cella D, Cai L, Choi SW, Crane PK, Curtis SM, et al. Advancing PROMIS’s methodology: results of the third PROMIS Psychometric Summit. Expert Rev Pharmacoecon Outcomes Res 2011;11:677e84. [5] Tulsky DS, Carlozzi NE, Cella D. Advances in outcomes measurement in rehabilitation medicine: current initiatives from the National Institutes of Health and the National Institute on Disability and Rehabilitation Research. Arch Phys Med Rehabil 2011;92(10 Suppl): S1e6. [6] Fries JF, Bruce B, Cella D. The promise of PROMIS: using item response theory to improve assessment of patient-reported outcomes. Clin Exp Rheumatol 2005;23:S53e7. [7] Patient-Reported Outcome Measurement and Information System (PROMIS) website. Available at: http://www.nihpromis.org/. Accessed on April 3, 2014. [8] Ware JE Jr, Kosinski M, Bjorner JB, Bayliss MS, Batenhorst A, Dahl€ of CG, et al. Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Qual Life Res 2003; 12:935e52. [9] Jette AM, McDonough CM, Ni P, Haley SM, Hambleton RK, Olarsch S, et al. A functional difficulty and functional pain instrument for hip and knee osteoarthritis. Arthritis Res Ther 2009;11: R107. [10] Hart DL, Mioduski JE, Mark W, Werneke MW, Stratford PW. Simulated computerized adaptive test for patients with lumbar spine impairments was efficient and produced valid measures of function. J Clin Epidemiol 2006;59:947e56. [11] Turner-Bowker DM, DeRosa MA, Saris-Baglama RN, Bjorner JB. Development of a computerized adaptive test to assess healthrelated quality of life in adults with asthma. J Asthma 2012;49: 190e200. [12] Kopec JA, Sayre EC, Davis AM, Badley EM, Abrahamowicz M, Sherlock L, et al. Assessment of health-related quality of life in arthritis: conceptualization and development of five item banks using item response theory. Health Qual Life Outcomes 2006; 4:33. [13] Kopec JA, Sayre EC, Davis AM, Badley EM, Abrahamowicz M, Pouchot J, et al. Development of a paper-and-pencil semi-adaptive questionnaire for 5 domains of health-related quality of life (PAT5D-QOL). Qual Life Res 2013;22:2829e42. [14] Kopec JA, Badii M, McKenna M, Lima VD, Sayre EC, Dvorak M. Computerized adaptive testing in back pain: validation of the CAT5D-QOL. Spine (Phila Pa 1976) 2008;33(12):1384e90.

1219

[15] Torrance GW. Measurement of health state utilities for economic appraisal: a review. J Health Econ 1986;5:1e30. [16] Manuel DG, Schultz SE, Kopec JA. Measuring the health burden of chronic disease and injury using health adjusted life expectancy and the Health Utilities Index. J Epidemiol Community Health 2002;56: 843e50. [17] Green C, Brazier J, Deverill M. Valuing health-related quality of life. A review of health state valuation techniques. Pharmacoeconomics 2000;17(2):151e65. [18] The EuroQol Group. EuroQolda new facility for the measurement of health-related quality of life. Health Policy 1990;16:199e208. [19] Feeny D, Furlong W, Torrance GW, Goldsmith CH, Zhu Z, DePauw S, et al. Multiattribute and single-attribute utility functions for the health utilities index mark 3 system. Med Care 2002;40(2): 113e28. [20] Sintonen H. The 15D instrument of health-related quality of life: properties and applications. Ann Med 2001;33:328e36. [21] Richardson J, Iezzi A, Khan MA, Maxwell A. Validity and reliability of the assessment of quality of life (AQoL)-8D multi-attribute utility instrument. Patient 2014;7(1):85e96. [22] Revicki DA. Relationship between health utility and psychometric health status measures. Med Care 1992;30:MS274e82. [23] Nichol MB, Sengupta N, Globe DR. Evaluating quality-adjusted life years: estimation of the health utility index (HUI2) from the SF-36. Med Decis Making 2001;21:105e12. [24] Brazier J, Roberts J, Deverill M. The estimation of a preferencebased measure of health from the SF-36. J Health Econ 2002;21: 271e92. [25] Sengupta N, Nichol MB, Wu J, Globe D. Mapping the SF-12 to the HUI3 and VAS in a managed care population. Med Care 2004;42: 927e37. [26] Sullivan PW, Ghushchyan V. Mapping the EQ-5D index from the SF12: US general population preferences in a nationally representative sample. Med Decis Making 2006;26:401e9. [27] Revicki DA, Kawata AK, Harnam N, Chen WH, Hays RD, Cella D. Predicting EuroQol (EQ-5D) scores from the patient-reported outcomes measurement information system (PROMIS) global items and domain item banks in a United States sample. Qual Life Res 2009;18:783e91. [28] Grootendorst P, Marshall D, Pericak D, Bellamy N, Feeny D, Torrance GW. A model to estimate health utilities index mark 3 utility scores from WOMAC index scores in patients with osteoarthritis of the knee. J Rheumatol 2007;34:534e42. [29] Stolk EA, Busschbach JJ. Validity and feasibility of the use of condition-specific outcome measures in economic evaluation. Qual Life Res 2003;12:363e71. [30] Bansback NJ, Brennan A, Ghatnekar O. Cost effectiveness of adalimumab in the treatment of patients with moderate to severe rheumatoid arthritis in Sweden. Ann Rheum Dis 2005;64:995e1002. [31] Muraki E. A generalized partial credit model. In: van der Linden WJ, Hambleton RK, editors. Handbook of modern item response theory. New York: Springer; 1996:153e64. [32] Patrick DL, Starks HE, Cain KC, Uhlmann RF, Pearlman RA. Measuring preferences for health states worse than death. Med Decis Making 1994;14:9e18. [33] Keeney Ralph L, Raiffa Howard. Decisions with multiple objectives: preferences and value tradeoffs. New York: Cambridge University Press; 1993:First Edition, John Wiley & Sons, 1976. [34] Dolan P. Modeling valuations for EuroQol health states. Med Care 1997;35:1095e108. [35] Bowling A. Just one question: If one question works, why ask several? J Epidemiol Community Health 2005;59:342e5. [36] Fries J, Rose M, Krishnan E. The PROMIS of better outcome assessment: responsiveness, floor and ceiling effects, and Internet administration. J Rheumatol 2011;38:1759e64. [37] Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21st century. Med Care 2000;38:II28e42.

1220

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

[38] Fryback DG, Dunham NC, Palta M, Hanmer J, Cherepanov D, Herrington SA, et al. US norms for six generic health-related quality-of-life indexes from the National Health Measurement study. Med Care 2007;45:1162e70. [39] Luo N, Chew LH, Fong KY, et al. A comparison of the EuroQol-5D and the Health Utilities Index mark 3 in patients with rheumatic disease. J Rheumatol 2003;30:2268e74.

[40] Brazier J, Roberts J, Tsuchiya A, Busschbach J. A comparison of the EQ-5D and SF-6D across seven patient groups. Health Econ 2004;13: 873e84. [41] Arnold D, Girling A, Stevens A, Litford R. Comparison of direct and indirect methods of estimating health state utilities for resource allocation: review and empirical analysis. BMJ 2009; 339:b2688.

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

Appendixes Appendix A Details of utility elicitation methodology Two professional interviewers with extensive training in preference measurement conducted the interviews. The bulk of the interviews were conducted at the Arthritis Research Centre in Vancouver, and a few were conducted at the participant’s home. A sample of interviews were tape recorded for quality control. A short questionnaire to obtain basic demographic and overall health status information was self-administered before the interview. The preference measurement task was preceded by a teaching phase. Health states for this phase were chosen to represent states similar, but not identical, to those used in the study. Cognitive interviewing methods were used during the teaching phase, including ‘‘thinking out loud’’ and a variety of probing questions. After the teaching phase, and again at the conclusion of the interview, the interviewer determined if the participant was able to listen to the instructions and to

1220.e1

understand the questions. If he/she was not, the interview was declared invalid. To measure preferences, we used two props, a Feeling Thermometer for measuring rating scale (RS) values and a Chance Board to measure standard gamble (SG) utilities. These props were obtained from the developers of the Health Utilities Index at McMaster University. RS preferences were measured first, followed by SG preferences. Because the Chance Board was limited to increments of 10%, we additionally used a set of cards displaying the probabilities with increments of 1% to increase precision of measurement. In phase I, we measured utilities for 15 single-deficit health states within each domain (75 states total). The order of presentation of the health states was randomized. The health states were printed on arrow-shaped cards with six statements each. In each state, only one domain was affected and all statements pertained to that domain. The health state descriptions were based on selected domain-specific items drawn from our item banks. The first statement was always derived from the routing item. Two examples of health state cards are presented in

Fig. S1. Two examples of cards with descriptions of health states for the affected domain.

1220.e2

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

Fig. S1. The levels of the unaffected domains were shown on a separate chart that remained visible to the respondent throughout the interview. For example, if the affected domain was Walk, the other domains were described as ‘‘You have no problems doing usual daily activities (e.g., work, school, housework, or leisure),’’ ‘‘You have no problems handling objects (e.g., carrying, grasping, moving things),’’ ‘‘You have no pain or discomfort,’’ and ‘‘You are very happy.’’ In phase II, we measured preferences for multiattribute states. Each subject provided RS values and SG utilities for a sample of eight states, in random order. The health state descriptions were printed on arrow-shaped cards. Each card had five domain headings and, for each domain, one of four possible levels of health (Fig. S2). A chart with the entire range of response possibilities

was introduced to the participants at the beginning of each interview and remained visible throughout the interview. We obtained SG utilities for 256 multideficit states, of 1,024 possible states, according to a fractional factorial design. The design had 8 states per block (subject) and 64 subjects in each of the two design replicates for a total of 128 subjects. This design allows a maximal model with all main effects and two-way interactions to be fit without confounding between any of the effects. In the online validation study, SG utilities for the respondent’s own health were elicited using a questionnaire implemented on the ARC online Research Survey System. The questions were preceded with the following preamble and example: The following questions are hypothetical questions, but we would like you to answer them as best you can. Please

Fig. S2. Routing items in each domain.

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

imagine there is a new treatmentdwe call it a magic pilld that can cure all your current health problems, both physical and psychological. In other words, if you take this pill, you will achieve perfect (full) health. However, some people who take this pill may die (quickly and without any pain), and there is no way to tell who will be cured and who will

1220.e3

Appendix B MAHU model Our nonlinear regression model to predict MAHU from the five domains’ IRT scores is of the following form:

0

1

Blnðlnð0:5þ0:5MinConstrU ÞÞlnðb ÞC 3 d B  dC @ A7 6   MaxEstqd MinConstrqd 7 6 ln X MaxEstqd MinEstqd MaxEstqd  qd 7 6 U 5 2 exp6 bd 7  1; 7 6 d˛fW;H;D;P;Fg MaxEstqd  MinEstqd 5 4 2

die. We would like to know how much risk of death you would take to cure all your current health problems. Example: In your current state of health, would you take the magic pill if the risk of death was 0% (no risk) and the chance of perfect health was 100% (certainty)?  Yes, I would take the pill.  No, I would not take the pill. Explanation: In this hypothetical example, the rational choice would be to take the pill because it guarantees perfect health with no risk. So, everyone should select ‘‘Yes.’’ In the following questions, the choice is not so obvious because the risk of death and the chance of perfect health change from question to question. Please remember that there is no correct answer, only what you think. Please try to think what you would likely do if this was a real-life decision. One question per screen was shown and the ‘‘pingpong’’ method was used to obtain SG utilities. For example, subjects were asked the following question: In your current state of health, would you take the magic pill if the risk of death was 10% and the chance of perfect health was 90%?  Yes, I would take the pill.  No, I would not take the pill. If the answer was No, this part of the questionnaire ended and the point of indifference was assumed to be the midpoint between 90% and the probability of perfect health in the previous question (95%), to which the respondent answered Yes. If the answer was Yes, the next question was: In your current state of health, would you take the magic pill if the risk of death was 90% and the chance of perfect health was 10%? The questions continued until the point of indifference was reached.

ð1Þ

where qd is the IRT score for domain d, bd is the regression coefficient for domain d, MinEstqd and MaxEstqd are the minimum and maximum estimable IRT scores from a complete response to the entire domain d, and MinConstr qd and MinConstrUd are the IRT score and single-deficit utility from the lower constraint for domain d.

Appendix C Forms of multiattribute utility functions In their classic text, ‘‘Decisions with Multiple Objectives’’ Keeney and Raiffa [33] discussed three types of multiattribute utility function: additive, multiplicative, and multilinear. The following notation is used: ui(xi) is the single-attribute utility function for attribute i, xi is the level of attribute i, and k’s are scaling constants. Then, the additive multiattribute utility function is defined as: X X uðxÞ 5 ki ui ðxi Þ; where ki 5 1: The multiplicative utility function is defined as: hY i ð1 þ kki ui ðxi ÞÞ  1 ; uðxÞ 5 ð1=kÞ where ð1 þ kÞ 5

Y

ð1 þ kki Þ:

Finally, the multilinear utility function is defined as: uðxÞ 5 k1 u1 ðx1 Þ þ k2 u2 ðx2 Þ þ . þ k12 u1 ðx1 Þu2 ðx2 Þ þ k13 u1 ðx1 Þu3 ðx3 Þ þ . þ k123 u1 ðx1 Þu2 ðx2 Þu3 ðx3 Þ X þ . þ .; where k0 s 5 1

1220.e4

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

Appendix D Three-dimensional representation of the relationships between IRT scores and MAHU scores

Fig. S3. (A) Predicted multiattribute heath utilities for varying IRT scores on walking (Walk) and handling objects (Hand) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.012. (B) Predicted multiattribute heath utilities for varying IRT scores on walking (Walk) and daily activities (Daily) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.122. (C) Predicted multiattribute heath utilities for varying IRT scores on walking (Walk) and pain or discomfort (Pain) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.353. (D) Predicted multiattribute heath utilities for varying IRT scores on walking (Walk) and feelings (Feel) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.044. (E) Predicted multiattribute heath utilities for varying IRT scores on handling objects (Hand) and daily activities (Daily) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.159. (F) Predicted multiattribute heath utilities for varying IRT scores on handling objects (Hand) and pain or discomfort (Pain) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.515. (G) Predicted multiattribute heath utilities for varying IRT scores on handling objects (Hand) and feelings (Feel) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.284. (H) Predicted multiattribute heath utilities for varying IRT scores on daily activities (Daily) and pain or discomfort (Pain) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.462. (I) Predicted multiattribute heath utilities for varying IRT scores on daily activities (Daily) and feelings (Feel) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.206. (J) Predicted multiattribute heath utilities for varying IRT scores on pain or discomfort (Pain) and feelings (Feel) domains while holding the other three domains constant at their highest estimable IRT. Lowest MAHU is 0.542.

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

1220.e5

Fig. S3. (continued).

Appendix E IRT score transformations The MAHU formula (Equation 1 in Appendix B at www.jclinepi.com) uses the standardized maximumlikelihood GPCM-based IRT score for each domain that has a mean of 0 and standard deviation (SD) of 1, with the maximum and minimum estimable IRT scores based on the full item banks. For scoring the domains, in most applications of the instrument, it is preferable to use scores transformed to a norm-based scale with a mean of 50 and SD of 10 (Equation 2 below) or a 0e100 scale where

0 and 100 correspond to the lowest and highest estimable scores (Equation 3 below). Normalizing 50/10 transformation formulas were developed based on IRT scores in a random population sample age 18þ from Metro Vancouver (n 5 651) who were administered all items in the CAT-5DQOL item banks (Kopec et al. [12]). IRT scores on a 50/10 scale in a standard population are obtained by applying the following formula:   10 0 qd 5 qd  qd  þ 50 sd

ð2Þ

where qd is the original-scale IRT score for domain d to be transformed, qd is the mean original-scale IRT score for

1220.e6

J.A. Kopec et al. / Journal of Clinical Epidemiology 68 (2015) 1213e1220

Table S1. Constants for the 50/10 transformation of original-scale IRT scores Domain Walking (Walk) Handling objects (Hand) Daily activities (Daily) Pain or discomfort (Pain) Feelings (Feel)

qd

sd

1.282 1.323 1.324 1.288 0.362

0.918 1.139 0.940 1.114 0.974

Abbreviation: IRT, item response theory.

domain d in the analysis data, sd is the standard deviation of the original-scale IRT score for domain d in the analysis

0

data, and qd is the resultant 50/10 IRT score, with mean 50 and standard deviation 10 in a standard population (Vancouver, BC in 2000). Table S1 provides the required constants for each domain. The 0e100 IRT scores can be obtained by: 0

qd 5 100 

qd  MinEstqd MaxEstqd  MinEstqd

ð3Þ

where MinEst qd and MaxEst qd are as described in Appendix B at www.jclinepi.com.