The Spine Journal 9 (2009) 854–856
Commentary
Differences in reported psychometric properties of the Neck Disability Index: patient population or choice of methods? Sheilah Hogg-Johnson, PhDa,b,* a Institute for Work & Health, 481 University Avenue, Suite 800, Toronto, ON M5G 2E9, Canada Dalla Lana School of Public Health, University of Toronto, 155 College Street, Toronto, ON M5T 3M7, Canada
b
Received 2 July 2009; accepted 2 July 2009
COMMENTARY ON: Young BA, Walker MJ, Strunce JB, et al. Responsiveness of the Neck Disability Index in patients with mechanical neck disorders. Spine J 2009;9:802–808 (in this issue).
Introduction Patient-reported outcome measures of pain and disability are frequently used in both clinical and research settings. The Neck Disability Index (NDI) [1] is a commonly used patient self-report measure of neck pain symptoms and their impact on function and activities [2]. It was adapted from the Oswestry Low Back Pain Disability Questionnaire, a similarly popular instrument for low back pain disability. It is important to establish measurement properties of instruments within the population they are to be used in [3,4]. Young et al. [5] have assessed the psychometric properties of the NDI with a secondary analysis of data from a randomized clinical trial of patients seeking care for neck pain with and without concurrent upper extremity symptoms. They have added to a growing body of literature on the performance of the NDI [2]. They argued that most of the previously published work on the properties of the NDI has focused on patients with neck pain only. The NDI questionnaire includes ten items covering different aspects of symptoms and functioning: pain, personal care, lifting, work, headaches, concentration, sleeping, driving, reading, and recreation. Each item is rated from 0 to 5, and the item ratings are summed to give a total score ranging from 0 to 50. With their sample of 91 patients, Young et al. [5] examined testretest reliability, construct validity, and responsiveness to change of this disease-specific measure. Test-retest reliability was reported as Intraclass Correlation Coefficient 2,1 (ICC2,1)50.64 and described as moderate. Construct validity DOI of original article: 10.1016/j.spinee.2009.06.002 FDA device/drug status: not applicable.Author disclosures: none. * Corresponding author. Institute for Work & Health, 481 University Avenue, Suite 800, Toronto, ON M5G 2E9, Canada. Tel.: (416) 927-2027. E-mail address:
[email protected] (S. Hogg-Johnson) 1529-9430/09/$ – see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.spinee.2009.07.001
was demonstrated by showing that NDI change scores for stable patients were significantly different from those for improving patients. Responsiveness to change was assessed using two techniques. The first approach was based on an external criterion, the global rating of change (GRC), where the optimal NDI change score to distinguish between stable patients and improving patients was identified. From this method, the minimal clinically important difference (MCID) was estimated to be 7.5 points. The second approach was based on the standard error of measurement—which was used to derive the minimum detectable change (MDC)— estimated to be 10.2 points. Finally, the investigators showed that the NDI change scores were similarly responsive for patients presenting with neck pain only and for patients presenting with neck pain and unilateral upper extremity symptoms. Methods and findings Test-retest reliability Adequate reliability of measurement is a prerequisite for establishing validity and responsiveness of measurement. In this study, the estimated test-retest reliability ICC2,1 of 0.64 is low and not consistent with most other reports on the NDI [2]. Recommendations about instruments generally suggest that reliability coefficients should be at least 0.70 when the instrument is used to compare groups of patients and 0.90 to 0.95 when it is used for an individual subject or patient [4]. With two exceptions, other reports of test-retest reliability for the NDI have generally shown a high test-retest reliability (reliability coefficients kappa 0.90 [6], correlation coefficients ranging from 0.81 to 0.97 [1,7], and ICC2,1 ranging from 0.89 to 0.94 [8–10]). The two exceptions [11,12] used similar inclusion criteria (neck pain with referred arm pain)
S. Hogg-Johnson / The Spine Journal 9 (2009) 854–856
and similar methodology (GRC to establish stable subgroup) to assessing reliability as the current report. So we might question whether the estimates of test-retest reliability provided by Young et al. [5] and by Cleland et al. [11,12] are lower than those reported elsewhere because of the inclusion of patients with arm pain or could the methodology used to establish reliability play some role? Young et al. [5] appropriately estimated test-retest reliability using the ICC2,1 statistic [13], but there are two methodological issues worth considering. First, the precision of the estimate of test-retest reliability is poor (95% confidence interval, 0.19–0.84), based on NDI measures taken 3 weeks apart from a sample of 25 patients described as stable. Generally, samples of at least size 40 to 50 are recommended for estimation of test-retest reliability [14–16]. However, many of the other reports of test-retest reliability for NDI have also relied on small samples. Second, to establish test-retest reliability, it is important to identify a stable sample of people with no expected change between the two measurement time points. For instance, Jaeschke et al. [17] cited by the authors counted scores of 0 on the GRC as ‘‘stable’’ and scores in the 1 to 3 and 1 to 3 ranges as ‘‘small change.’’ They then compared the ‘‘stable’’ and ‘‘small change’’ groups to derive MCIDs. In the present study, what might be the impact of including GRC scores 2 (a little worse) and +2 (a little better) in the ‘‘stable’’ sample used to assess test-retest reliability? How much higher might the test-retest reliability estimate be if computation were limited to those with ‘‘no change’’? Note that the same GRC scale with scores from 3 to +3 was used to identify a ‘‘stable’’ group of patients in the two other articles reporting lower test-retest reliability [11,12]. If only scores of only 0 or even of only 1, 0, and 1 on the GRC had been used to identify the ‘‘stable’’ subgroup for test-retest reliability, then the sample size would have been even lower than the 25 that were used. Perhaps, there just was not adequate sample size with these data to identify a stable group of reasonable size to obtain a precise estimate of test-retest reliability. Responsiveness Adequate reliability of an instrument is a prerequisite for validity—so where does that leave us with the NDI in the population under study? The estimate of test-retest reliability here is ICC2,150.64, and some may argue insufficient to go on and consider validity and responsiveness. Nevertheless, 0.64 may be an underestimate as suggested above, and the authors have gone on to examine validity and responsiveness to change, the primary objective of their article. A variety of methods have been proposed to assess the responsiveness of health measurement scales in the literature [18], and even when narrowing in on the concept of an MCID, different approaches have been used [19]. The method used by Young et al. [5] to establish an MCID for the NDI is
855
classified by Wells et al. [19] as appropriate for understanding important change within a group of patients over time. The MCID is estimated at 7.5 points on the 0 to 50 scale. The second method used to examine responsiveness is based on the minimal detectable difference (MDC). The MDC is derived from the standard error of measurement, and this estimate is described by Wells et al. as appropriate for understanding important change within an individual patient over time. From this study, the MDC is estimated as 10.2 points. We interpret the value of MDC as saying that 90% of stable patients in the population of interest will have a retest score within 6MDC of their original score, so a change greater than MDC points for an individual patient is required to be confident that a ‘‘true’’ change has occurred. But note that the expression used to derive this MDC value pffiffiffi MDC5 2 za SEMtest-retest pffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 5 2 za s 1 ICC2;1 depends on the estimate of test-retest reliability ICC2,1. If ICC2,150.64 is an underestimate as was suggested earlier, then the estimate of MDC will be overestimated accordingly. In fact, if calculation of Young et al. [5] was repeated, but using test-retest reliability of 0.94 (as reported by Stratford et al. [9]), then the MDC estimate would be 4.2—similar to other MDC estimates reported in the literature [9,10]. The important link between test-retest reliability and MDC must be noted, and maybe, the estimation of MDC was not worth pursuing. Finally, one other puzzling aspect of the analysis arises in the comparison of responsiveness between patients with and without unilateral upper extremity symptoms. It is unclear how one looks at an interaction from a one-way analysis of variance (ANOVA). I suspect that a two-way ANOVAwas actually conducted, with one factor being presence or absence of unilateral upper extremity symptoms and the other factor being stable versus improving over time. Conclusions It seems that the study was based on a secondary analysis of an existing data set that was not sufficient to answer the research questions posed. There were insufficient stable patients to adequately estimate test-retest reliability, and without a precise and sufficiently high estimate of reliability, we could not know what to make of the rest. Are the results presented here inconsistent with other NDI findings because of the different population under study? Or, have methodological choices led to reliability and MDC estimates that are different from other literature? The latter seems likely. References [1] Vernon H, Mior S. The Neck Disability Index: a study of reliability and validity. J Manipulative Physiol Ther 1991;14:409–15.
856
S. Hogg-Johnson / The Spine Journal 9 (2009) 854–856
[2] Vernon H. The Neck Disability Index: state-of-the-art, 1991-2008. J Manipulative Physiol Ther 2008;31:491–502. [3] Greenhalgh J, Long AF, Brettle AJ, et al. Reviewing and selecting outcome measures for use in routine practice. J Eval Clin Pract 1998;4:339–50. [4] Scientific Advisory Committee of the Medical Outcomes Trust. Assessing health status and quality-of-life instruments: attributes and review criteria. Qual Life Res 2002;11:193–205. [5] Young BA, Walker MJ, Strunce JB, et al. Responsiveness of the Neck Disability Index in patients with mechanical neck disorders. Spine J 2009;9:802–8. [6] Chok B, Gomez E. The reliability and application of the Neck Disability Index in physiotherapy. Physiother Singapore 2000;3:16–9. [7] Ackelman BH, Lindgren U. Validity and reliability of a modified version of the Neck Disability Index. J Rehabil Med 2002;24:284–7. [8] Knapp SFC, Langworthy JM, Breen AC. The use of the Neck Disability Index in the evaluation of acute and chronic neck pain. Proceedings of the 12th International Conference on Spinal Manipulation 1994; Palm Springs, CA: 10. [9] Stratford PW, Riddle DL, Binkley JM, et al. Using the Neck Disability Index to make decisions concerning individual patients. Physiother Can 1999;51:107–19. [10] Westaway MD, Stratford PW, Binkley JM. The patient-specific functional scale: validation of its use in persons with neck dysfunction. J Orthop Sports Phys Ther 1998;27:331–8.
[11] Cleland JA, Fritz JM, Whitman JM, et al. The reliability and construct validity of the Neck Disability Index and patient specific functional scale in patients with cervical radiculopathy. Spine 2006;31: 598–602. [12] Cleland JA, Childs JD, Whitman JM. Psychometric properties of the Neck Disability Index and Numeric Pain Rating Scale in patients with mechanical neck pain. Arch Phys Med Rehabil 2008;89:69–74. [13] Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420–8. [14] Kraemer HC, Korner AF. Statistical alternatives in assessing reliability, consistency, and individual differences for quantitative measures: application to behavioral measures of neonates. Psychol Bull 1976;83:914–21. [15] Donner AP, Eliasziw M. Sample size requirements for reliability studies. Stat Med 1987;6:441–8. [16] Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use. 3rd ed. New York, NY: Oxford University Press, 2003. [17] Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimal clinically important difference. Control Clin Trials 1989;10:407–15. [18] Beaton DE, Bombardier C, Katz JN, et al. A taxonomy for responsiveness. J Clin Epidemiol 2001;54:1204–17. [19] Wells G, Beaton D, Shea B, et al. Minimal clinically important differences: review of methods. J Rheumatol 2001;28:406–12.