Journal of Clinical Epidemiology 65 (2012) 1117e1123
Rasch analysis of the PedsQL: an increased understanding of the properties of a rating scale Leila Amina,*, Peter Rosenbaumb, Ronald Barrc, Lillian Sungd, Robert J. Klaassene, David B. Dixf, Anne Klassenc a
Department of Psychology, The Hospital for Sick Children, 555 University Avenue, Toronto, Ontario M5G1X8, Canada Institute of Applied Health Sciences, McMaster University, 1400 Main Street West, Hamilton, Ontario L8S1C7, Canada c Department of Pediatrics, McMaster University, 3A 1280 Main Street West, Hamilton, Ontario L8S 4K1, Canada d Division of Haematology/Oncology, Department of Paediatrics, The Hospital for Sick Children, University of Toronto, 555 University Avenue, Toronto, Ontario M5G 1X8, Canada e Division of Haematology/Oncology, Children’s Hospital of Eastern Ontario, 401 Smyth Road, Ottawa, Ontario KIH 8L1, Canada f Department of Pediatrics, University of British Columbia, 4480 Oak Street, Vancouver, British Columbia V6H 3V4, Canada Accepted 21 April 2012 b
Abstract Objective: Traditionally, measures have been developed using classical test theory (CTT). This study examines the psychometrics of the PedsQL 4.0 Generic Core Scales (parent report) using a modern psychometric approach (i.e., Rasch analysis) to determine if further insight regarding the behavior of the measure can be gained using a modern approach. Study Design and Setting: Data from a Canadian study of 376 parents of children aged 2e17 years on active cancer treatment were analyzed using Rumm2030 software (RUMM Laboratory, Australia). Results: Overall data did not fit the Rasch model and indicated important limitations in the PedsQL including disordered item thresholds, local dependency, and poor targeting. The internal consistency was higher using CTT (Cronbach’s a 5 0.93) than Rasch (Person Separation Index 5 0.78). Rasch item curves showed that respondents did not discriminate between the five-point response categories and a three-point scale was easier to discriminate. Conclusions: The PedsQL is the most widely used generic measure of health-related quality of life in pediatrics. The use of Rasch allows detailed item-level evaluation and assists in identifying limitations of measures that could then be further explored. Studies are needed in other populations to better understand the behavior of the PedsQL from a modern psychometric perspective. Ó 2012 Elsevier Inc. All rights reserved. Keywords: Rasch analysis; Classical test theory; Psychometrics; Quality of life; Pediatrics; Outcome measurement
1. Introduction Rating scales used in clinical practice often lack validity, making scores difficult to interpret [1]. Nonvalid scores should not be used to evaluate outcomes or to make clinical decisions. Psychometrics is the study of methods to develop and evaluate rating scales [2]. Exploring psychometric theories that underlie the development of rating scales provides a means to investigate limitations that may exist in the use and interpretation of these measures. This study was funded by Canadian Cancer Society grant #15265 and by LA’s Master’s thesis internal funding from the School of Rehabilitation Sciences and Department of Pediatrics at McMaster University, Hamilton, Ontario, Canada. Anne Klassen and Lillian Sung are recipients of Canadian Institute of Health Research Salary awards. * Corresponding author. Tel.: þ289-937-0585; fax: þ416-813-8839. E-mail address:
[email protected] (L. Amin). 0895-4356/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2012.04.014
Classical test theory (CTT) has served as the main approach to the development and evaluation of rating scales [2]. CTT focuses predominantly on person-level statistics, such as means and standard deviations, and on test-level statistics such as reliability. Modern psychometric approaches, such as Rasch analysis, bring increased attention to item-level statistics and focus on exploring the relationship between item-level scores and clinical outcomes. The Rasch measurement model determines the relationship between a person’s performance on any item in a scale and the level of the trait that person possesses. The model assumes: 1) unidimensionality (refers to the idea that all items on a measure assess the same single construct of interest); and 2) local independence (refers to the idea that a person’s response to one item does not depend on their response to any other items) [3]. Ensuring data collected from a scale fit the Rasch measurement model is a method
1118
L. Amin et al. / Journal of Clinical Epidemiology 65 (2012) 1117e1123
What is new? 1. Rasch analysis allows clinicians and researchers to gain insight into the behavior of rating scales. Specifically, Rasch analysis of the PedsQL 4.0 shows that respondents cannot discriminate between a five-point scale, but can discriminate between a 3-point scale. 2. Rating scales used in clinical practice must be valid. Rasch analysis provides a means of improving the validity of these scales. 3. Future studies of the development of rating scales should include Rasch analysis to provide insight into the behavior of the rating scale from an item-level perspective.
for improving the validity of a rating scale [1]. Fit to the Rasch measurement model provides empirical evidence to support how well the items on a scale measure the construct of interest. Rasch analysis can be used to determine validity of rating scales that assess constructs that are difficult to measure, such as health-related quality of life (HRQOL). This approach provides evidence to support the inclusion or exclusion of items, improving our knowledge and understanding of the concepts that comprise HRQOL [4]. The PedsQL is currently the most widely used measure of child health and there are more psychometric data published for the PedsQL 4.0 Generic Core Scales than for any other pediatric HRQOL tool [5,6]. The PedsQL was developed using a CTT approach and item generation involved literature review, interviews with patients and families, and discussions with health care professionals [7]. The measure is composed of separate subscales to assess physical, emotional, social, and school health and the scores of these subscales can be summed to produce an overall score to represent HRQOL [7,8]. Two earlier studies examined psychometrics of the PedsQL using modern methods. The first assessed the Korean translation of the PedsQL 4.0 for cross-cultural validity in a sample of healthy school children and found the scale was generally valid; however, there were some limitations [9]. The ‘‘Almost Always’’ response option was problematic, person reliability was low and there was a ceiling effect [9]. In the second study, the validity of the PedsQL 4.0 was examined in a Singaporean sample of preschool children with refractive errors and findings indicated the scale was not valid [10]. Problems included disordered response categories, a ceiling effect, and multidimensionality within scales [10]. A third study used the PedsQL 4.0 to demonstrate practical issues that arise when applying modern psychometrics to a measure and they found items on the
scale to exhibit local dependence [11]. The purpose of our study was to examine the psychometric properties of the PedsQL 4.0 (parent report) in a sample of children with cancer using Rasch analysis.
2. Methods 2.1. Sample PedsQL 4.0 Generic Core Scales parent-report data were collected as a part of a multicenter Canadian study of parents of children on active cancer treatment between 2004 and 2007 [12e14]. Parents of children were included if the child was receiving treatment for any type of cancer, if they were initially diagnosed more than 2 months before enrollment on this study and if they were not considered palliative. Eligible parents had to be the primary caregiver (i.e., the person most responsible for the day-to-day care of the child) and able to read English. Demographics of eligible parents, their children, and the sites involved in the study are described elsewhere [12]. For the purpose of this psychometric evaluation, data provided by 376 parents of children aged 2 years or older were included (PedsQL parent-proxy report is for children aged 2e18 years). On the PedsQL, parents rank how much of a problem their child has had in four domains: physical function (PF), emotional function (EF), social function (SF), and school function (ScF) using a five-point scale: never (scored 0); almost never (1); sometimes (2); often (3); and almost always (4) [7]. 2.2. Analysis The PedsQL 4.0 data set was analyzed using Rumm2030 software (RUMM Laboratory, Australia) [15]. Rasch analysis was applied to test whether the response pattern observed in the data matches the expected theoretical pattern. The first step of the analysis was to decide which derivation of the Rasch model should be used to best improve fit; we used the partial credit model. Next, unidimensionality was examined using t-test procedures. Reasons for dimensionality were explored and corrections were applied to improve model fit. Specific statistical criteria were used to determine the extent to which model fit is improved and these are discussed elsewhere [16e20]. The importance of item-level robustness in a Rasch analysis is highlighted by the types of issues that are explored as potential sources of misfit. First, item thresholds were examined as a potential cause of misfit. A threshold is the point between two response categories in which the respondent is equally likely to endorse either response option (i.e., the probability of scoring a ‘‘0’’ or a ‘‘1’’ is 50%). When thresholds are disordered, respondents cannot differentiate between the various response options for that item. Secondly, individual items and persons were examined for misfit to the Rasch model (item and person-fit residuals are acceptable if they lie between 62.5 units). Residual refers
L. Amin et al. / Journal of Clinical Epidemiology 65 (2012) 1117e1123
to the portion of the item or person that remains when the portion that fits the Rasch model (the Rasch factor) is taken into consideration. Thirdly, items with residual inter-item correlations greater than 0.2 (known as locally dependent) were examined. Differential item functioning (DIF) occurs when items display bias for a particular subgroup of the sample and is calculated using analysis of variance (5% alpha). Two types of DIF can occur, uniform and nonuniform. Uniform DIF occurs when, for example, males consistently score higher than females on an item given the same level of HRQOL. Nonuniform DIF occurs when, for example, females score a higher response to items at lower levels of HRQOL, but score a lower response at higher levels of HRQOL. In this study, we examined DIF by gender and by age. Once the best possible fit to the Rasch model was achieved (representing good construct validity), reliability and targeting of the scale were assessed. Reliability was evaluated using the Person Separation Index (PSI), which indicates the degree to which persons can be differentiated into certain groups. PSI values range from 0 to 1 and a value of 0.8 is considered acceptable [19]. Targeting was evaluated by examining the extent of overlap between graphical displays of person ability and item difficulty. In this study, the set of analyses outlined above was carried out only on the total score (TS), which is the sum of all items, as this is the score most commonly used in clinical practice. 3. Results Three hundred and seventy-six records were analyzed with the partial credit model. The 23-item PedsQL 4.0 scale did not fit the model (c2 5 381.0, df 5 184, P ! 0.001). A Pvalue !0.05 indicates that there was a significant difference between observed data and expected data based on the Rasch model. The proportion of statistically significant t-tests (P ! 0.05) was 21.5% (19.3e23.7%), exceeding the critical value of 5%, which indicates a violation of local independence and hence the assumption of unidimensionality.
1119
Category frequencies for each item indicate that most parents responded using the ‘‘Never,’’ ‘‘Almost Never,’’ and ‘‘Sometimes’’ response categories. Fourteen items (61% of items) on the PedsQL had disordered thresholds: PF1e6, EF4, SF1, SF3e4, ScF1, and ScF3e5. Examination of individual item response curves for disordered items indicate that respondents had difficulty discriminating between the categories ‘‘Never’’ and ‘‘Almost Never’’ (scores of 0 and 1) and ‘‘Often’’ and ‘‘Almost Always’’ (scores of 3 and 4). The response categories ‘‘Never’’ and ‘‘Almost Never’’ were therefore collapsed into a single category and ‘‘Often’’ and ‘‘Almost Always’’ were also collapsed for each disordered item. Collapsing response categories in this fashion revised the original five-point scale into a three-point scale as displayed in Fig. 1. In Fig. 1 (right panel), each response option of the threepoint scale systematically has a point along the x-axis (ability continuum) where it is the most likely response. In Fig. 1 (left panel), the item has disordered thresholds and therefore the response categories do not have a point at which they are the most likely response option. These responses are inconsistent with what was predicted by the Rasch model. Item PF5 (Bathing) was the only item in the scale that remained disordered; therefore, a different correction was applied: scores of ‘‘Never’’ and ‘‘Almost Never’’ were collapsed and scores of ‘‘Sometimes’’ and ‘‘Often’’ were collapsed. Fit to the Rasch model was reevaluated after rescoring thresholds and model fit was not achieved. Items within each subscale of the PedsQL were subtested to account for residual inter-item correlations (i.e., local dependency) within each scale. In Rasch, subtesting items allows one to parcel out the portion of the internal consistency coefficient (ICC) that has been inflated because of local dependency and thus get a more accurate representation of the ICC. Subtested items behaved as new items with a maximum score equal to the sum of the maximum score of the individual items. Four subtests were created corresponding to the four subscales of the PedsQL. Item and overall fit to the Rasch model were significantly improved by creating subtests (c2 5 33.01, df 5 32, P 5 0.42). All
Fig. 1. Category probability curves showing disordered five-point response options and corrected three-point response options.
1120
L. Amin et al. / Journal of Clinical Epidemiology 65 (2012) 1117e1123
subtested items fit the Rasch model as indicated by nonsignificant P-values (O0.05). The residual mean value for person-fit was 0.35, indicating no serious misfit among respondents in the sample [15]. We then examined DIF (i.e., item bias) by gender and age for the four subtested items using a P-value of 0.05. Age was divided using the same categories used by developers of the PedsQL: 2e4, 5e7, 8e12, and 13e18 years and there were at least 100 cases in each grouping [7]. DIF was not observed for gender but marginal uniform DIF was apparent for age for subtested items in the PF subscale (Mean square [MS] 5 3.44; F 5 6.80; df 5 3; P 5 0.0002) as displayed in Fig. 2. DIF was not adjusted for because of its marginal nature. In Fig. 2, the item curves for the different age groups are very similar but do not coincide perfectly indicating some uniform DIF by age. When items do not display any DIF their curves coincide perfectly. Person-item threshold distribution maps are used to assess targeting of a scale in Rasch analysis. See Fig. 3 for personitem threshold map of the PedsQL TS score. The items available in the scale do not target some respondents, indicating that this sample has a slightly better HRQOL than the average HRQOL that can be assessed by the items on the scale. We also produced person-item threshold maps that separately display respondents belonging to various subgroups, such as age or gender, to examine if there is a particular gender or age group that is poorly targeted by the scale. In this study, the gender map (Fig. 4) indicates that more males than females lie above the ceiling of the scale (mean of male person ability is 0.31, which has a greater absolute value than female mean ability which is 0.21). Therefore, there may be a systematic difference in the way parents perceive and report HRQOL for their sons when compared with their daughters. The age map (Fig. 5) shows that mostly younger age groups (2e4 and 5e7 years) lie above the ceiling of the scale (mean person ability is 0.45 and 0.36, respectively) when compared with the older groups (8e12 and 13e18 years) with mean person ability of 0.14 and 0.07, respectively. In addition,
Fig. 2. Marginal differential item functioning based on age for subtested physical function items as indicated by lack of overlap between item curves.
because marginal uniform DIF was detected based on age, items may not function equivalently for the different age groups. The PSI of the TS score decreased from 0.92 to 0.78 after correcting for disordered thresholds and local dependency to improve fit to the Rasch model. Similarly, the ICC decreased from 0.93 to 0.80 after obtaining fit to the Rasch model.
4. Discussion A Rasch analysis of the PedsQL 4.0 Generic Core Scales (parent report) was conducted to determine how the TS score behaves from an item-level perspective. The PedsQL is already established in the literature as a valid and reliable tool from a traditional psychometric perspective and it is the most widely used generic HRQOL tool in pediatric oncology [5,21]. However, a modern psychometric approach uncovered a number of limitations in terms of disordered thresholds, local dependency, and targeting issues. Findings indicate that 14 (61%) of 23 items on the PedsQL have disordered response categories, which may be the result of too many response categories or unclear label options. These findings suggest that the score obtained from summing the score of each five-point item in the PedsQL is not appropriate as response options are not being used by respondents in the manner intended by the developers of the scale. Our findings regarding the disordered response options of the PedsQL are concordant with another report [10]. In terms of item fit, we identified high local dependency between item residuals on each subscale of the PedsQL. The corresponding items were subtested, once for each of the four subscales of the PedsQL 4.0, to determine how much the reliability coefficient was inflated because of local dependency between items. After achieving fit of data to the Rasch model by creating subtests, the ICC dropped below the level acceptable for individual-level analysis [22]. Before fitting data to the Rasch model, the ICC was appropriate for individual-level analysis. Also, the PSI dropped below the accepted level (set at 0.8) after fitting data to the Rasch model, suggesting that the TS score should be used only to distinguish between two groups [19]. Marginal DIF (i.e., item bias) was detected on the basis of age, for subtested PF items, indicating that these items may not function equivalently for the different age groups. Parents may be better at reporting PF for older children who can communicate with them and inform them of any health concerns. Therefore, the PF item bias may be that younger children, who might not express their health concerns to their parents, have parents who are not as easily able to act as a proxy respondent for their child. Alternatively, the item bias may be that parents are better at reporting PF for younger children with whom they may have more contact. Because the use of proxy respondents in pediatric research is common, future research could investigate this issue further.
L. Amin et al. / Journal of Clinical Epidemiology 65 (2012) 1117e1123
1121
Fig. 3. Person-item threshold distribution map of the PedsQL 4.0 (parent report).
Overall, the PedsQL did not demonstrate good targeting as suggested by the lack of overlap of person ability and item difficulty in the person-item threshold maps. Examination of threshold maps provides a visual means of determining how well the items spread over the range of the construct and provides an easy way to determine whether the addition of a particular item improves targeting of the scale. A strong ceiling effect is common in generic HRQOL instruments as these are constructed with the intention of being applicable to a wide range of populations including healthy people. Our sample was composed of children on active treatment for cancer, and one would expect that cancer patients would score lower on the PedsQL given that they are not well; however, even in this unhealthy sample, a ceiling effect was associated with the TS score. To
improve scale targeting, items could be added to the PedsQL that can discriminate at higher levels of HRQOL. Conducting qualitative interviews with children and parents, to better understand their perspectives on what kinds of items could capture the higher levels of HRQOL, may be a useful exercise to improve the targeting of HRQOL measures [4]. Findings from this study that suggest poor targeting are supported by other modern and traditional psychometric studies that have evaluated the properties of the PedsQL [7e10,23e26]. 4.1. Strengths and limitations The strengths of this study include the large sample size, the inclusion of parents of children treated on a variety of treatment protocols, and inclusion of parents of children
Fig. 4. Person-item threshold distribution map based on gender.
1122
L. Amin et al. / Journal of Clinical Epidemiology 65 (2012) 1117e1123
Fig. 5. Person-item threshold distribution map based on age.
from hospitals in different Canadian cities, which increase generalizability of findings. Limitations are that the study was conducted using English-speaking parents of cancer patients; therefore, findings may not apply to parents of children with other chronic conditions and those in other cultural groups. Furthermore, parent-reported HRQOL during active treatment may be impacted by high levels of parental stress during this period and this additional stress may have influenced findings [27]. Lastly, child self-report data was not collected as part of this study, limiting applicability to that group. 4.2. Summary Rasch analysis permits investigators a means of examining item-level statistics in a more detailed manner than CTT. Item curves display response thresholds for each item and permit a means of evaluating whether response categories are being used as they were intended. Person-item threshold maps display how well the items on a measure target the sample being assessed and can be used to identify if the addition of a new item improves targeting of the scale for a particular sample. Our analysis of the PedsQL 4.0 highlights some important shortcomings of the questionnaire as it currently stands. The five-point response option is problematic, subtested items from the PF scale display marginal bias for age, and the TS score is not reliable for individual-level decision making after data are fit to the Rasch model. Studies could be conducted in other samples to determine if similar limitations are identified with the PedsQL. In addition, because our study focused only on the parent perspective, future research could use Rasch analysis to examine self-report data.
References [1] Hobart J, Cano S, Zajicek J, Thompson A. Rating scales as outcome measures for clinical trials in neurology: problems, solutions, and recommendations. Lancet Neurol 2007;6:1094e105. [2] Hobart J, Cano S. Improving the evaluation of therapeutic interventions in multiple sclerosis: the role of new psychometric methods. Health Technol Assess 2009;13:1e214. [3] Reise S, Henson J. A discussion of modern versus traditional psychometrics as applied to personality assessment scales. J Pers Assess 2003;81:93e103. [4] Rajmil L, Herdman M, Fernandez de Sanmamed M, Detmar S, Bruil J, Ravens-Sieberer U, et al. Generic health-related quality of life instruments in children and adolescents: a qualitative analysis of content. J Adolesc Health 2004;34:37e45. [5] Eiser C, Morse R. Quality-of-life measures in chronic diseases of childhood. Health Technol Assess 2001;5:1e157. [6] Eiser C, Vance Y, Horne B, Glaser A, Galvin H. The value of the PedsQLTM in assessing quality of life in survivors of childhood cancer. Child Care Health Dev 2003;29:95e102. [7] Varni JW, Seid M, Rode C. The PedsQLTM: measurement model for the Pediatric Quality of Life Inventory. Med Care 1999;37:126e39. [8] Varni JW, Seid M, Kurtin P. PedsQLTM 4.0: reliability and validity of the Pediatric Quality of Life Inventory version 4.0 Generic Core Scales in healthy and patient populations. Med Care 2001;39:800e12. [9] Kook S, Varni JW. Validation of the Korean version of the Pediatric Quality of Life Inventory (PedsQLTM) 4.0 Generic Core Scales in school children and adolescents using the Rasch model. Health Qual Life Outcomes 2008;6:41. [10] Lamoureux E, Manjula M, Chang B, Dirani M, Kah-Guan A, Chia A, et al. Is the pediatric quality of life inventory valid for use in preschool children with refractive errors? Optom Vis Sci 2010;87: 813e22. [11] Hill CD, Edwards MC, Thissen D, Langer MM, Wirth RJ, Burwinkle TM, et al. Practical issues in the application of item response theory: a demonstration using items from the Pediatric Quality of Life Inventory (PedsQLTM) 4.0 Generic Core Scales. Med Care 2007;45:S39e47.
L. Amin et al. / Journal of Clinical Epidemiology 65 (2012) 1117e1123 [12] Sung L, Klaassen RJ, Dix D, Pritchard S, Yanofsky R, Dzolganovski B, et al. Identification of paediatric cancer patients with poor quality of life. Br J Cancer 2009;100:82e8. [13] Sung L, Klaassen R, Dix D, Pritchard S, Yanofsky R, Ethier M, et al. Parental optimism in poor prognosis pediatric cancers. Psychooncology 2009;18:783e8. [14] Sung L, Yanofsky R, Klaassen R, Dix D, Pritchard S, Winick N, et al. Quality of life during active treatment for pediatric acute lymphoblastic leukemia. Int J Cancer 2010;128:1213e20. [15] Andrich D, Lyne A, Sheridan B, Luo G. RUMM2030. Perth, Australia: RUMM Laboratory; 2010. [16] Tennant A, Conaghan P. The Rasch measurement model in rheumatology: what is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis Rheum 2007;57:1358e62. [17] Pallant J, Tennant A. An introduction to the Rasch measurement model: an example using the hospital anxiety and depression scale (HADS). Br J Clin Psychol 2007;46:1e18. [18] Tennant A, McKenna SP, Hagell P. Application of Rasch analysis in the development and application of quality of life instruments. Value Health 2004;7:S22e6. [19] Schumacker R, Smith E. A Rasch perspective. Educ Psychol Meas 2007;67:394e409. [20] Teresi JA. Different approaches to differential item functioning in health applications: advantages, disadvantages and some neglected topics. Med Care 2006;44:S152e70.
1123
[21] Klassen A, Khan A, Anthony S, Klaassen R, Sung L. Conceptualizing quality of life in children with cancer and childhood cancer survivors [Abstract]. International Society for Quality of Life Research Meeting Abstracts. Qual Life Res 2010;1653:76. [22] Nunnally JC, Bernstein IR. Psychometric theory. 3rd ed. New York, NY: McGraw-Hill; 1994. [23] Varni JW, Burwinkle TM, Katz E, Meeske K, Dickinson P. The PedsQLTM in pediatric cancer: reliability and validity of the Pediatric Quality of Life Inventory Generic Core Scales, Multidimensional Fatigue Scale, and Cancer Module. Cancer 2002;94: 2090e106. [24] Varni JW, Burwinkle TM, Seid M, Skarr D. The PedsQLTM as a pediatric population health measure: feasibility, reliability, and validity. Ambul Pediatr 2003;3:329e41. [25] Varni JW, Burwinkle TM, Rapoff M, Kamps J, Olson N. The PedsQLTM in pediatric asthma: reliability and validity of the pediatric quality of life inventory generic core scales and asthma module. J Behav Med 2004;27:297e318. [26] Varni JW, Seid M, Knight TS, Burwinkle TM, Brown J, Szer I. The PedsQLTM in pediatric rheumatology. Reliability, validity and responsiveness of the Pediatric Quality of Life Inventory Generic Core Scales and Rheumatology Module. Arthritis Rheum 2002;46: 714e23. [27] Johnston C, Steele R, Herrera E, Phipps S. Parent and child reporting of negative life events: discrepancy and agreement across pediatric samples. J Pediatr Psychol 2003;28:579e88.