Reliable change and minimum important difference (MID) proportions facilitated group responsiveness comparisons using individual threshold criteria

Reliable change and minimum important difference (MID) proportions facilitated group responsiveness comparisons using individual threshold criteria

Journal of Clinical Epidemiology 57 (2004) 1008–1018 Reliable change and minimum important difference (MID) proportions facilitated group responsiven...

268KB Sizes 6 Downloads 15 Views

Journal of Clinical Epidemiology 57 (2004) 1008–1018

Reliable change and minimum important difference (MID) proportions facilitated group responsiveness comparisons using individual threshold criteria John S. Schmitt, Richard P. Di Fabio* Department of Physical Medicine & Rehabilitation, School of Medicine, University of Minnesota, Mayo Mail Code 388, Minneapolis, MN 55455, USA Accepted 10 February 2004

Abstract Objective: This study contrasted the use of responsiveness indices at the group level vs. individual patient level. Study Design and Setting: We followed a cohort of 211 patients (50% male; mean age 47.5 years; SD 14) with musculoskeletal upper extremity problems for a total of 3 months. Outcome measures included the Disabilities of the Arm, Shoulder, and Hand (DASH) questionnaire, Shoulder Pain and Disability Index (SPADI), Patient-Rated Wrist Evaluation (PRWE), and the Medical Outcomes Study 12-Item Short-Form Health Survey (SF-12). We calculated confidence intervals on various group-level responsiveness statistics based on effect size and correlation with global change. The proportion of patients exceeding the minimum detectable change (or reliable change proportion) and minimum important difference (MID proportion) were included as indices applicable to the individual patient. Results: For the DASH, effect size ranged from 1.06 to 1.67 for various patient subgroups, and the reliable change and MID proportions indicated that 50%–70% of individuals exhibited change based on individual change scores. Only the SRM and reliable change proportion indicated differences among the outcome measures used in this study. Conclusion: The reliable change and MID proportions have an intuitive interpretation and facilitate quantitative responsiveness comparisons among outcome measures based on individual patient criteria. 쑖 2004 Elsevier Inc. All rights reserved. Keywords: Responsiveness; Minimum detectable change; Minimal important difference; Reliable change proportion; Individual patient level; Outcome measures, generic and specific

1. Introduction The ability to assess longitudinal changes in health status is critical for outcome measures used in the study of treatment efficacy. This aspect of measurement is termed responsiveness, or sensitivity to change [1]. Responsiveness has been defined as the ability of an instrument to detect small but important changes in health status over time [2–4]. Responsiveness is essential for both group and individual measurement purposes. In group applications, the greater the responsiveness of an outcome measure, the fewer subjects required to detect a significant treatment effect [3,5]. This has obvious implications for power calculations, sample size, and cost of clinical research. For individual applications, a responsive outcome measure is more likely to show change for a particular patient in response to effective treatment. A

* Corresponding author. Tel.: 612-626-4973; fax: 651-653-0909. E-mail address: [email protected] (R.P. Di Fabio). 0895-4356/04/$ – see front matter 쑖 2004 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2004.02.007

clinician may make treatment decisions with greater confidence using scores from an instrument known to be sensitive to small but important changes over time. The identification and selection of the most responsive outcome measure is critical for both group research and clinical decision- making. There is no agreement in the literature on how to calculate and report responsiveness to clinical change. Various statistics have been used, and often a combination of statistics is reported within a single study. Group-level statistics such as effect size are commonly reported. Effect size is calculated as the ratio of the mean score change divided by the baseline score for a group of patients judged to have changed over time [6]. The standardized response mean (SRM) [5] and Guyatt’s responsiveness index [3] are variations on effect size and also qualify as group-level statistics. Statistically significant change at the group level may not be significant at the individual level. Average effects across a group may not be meaningful to the individual patient. Recent authors have emphasized the importance of responsiveness at the level of the individual patient [7–12].

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

Mean changes for a group may be the result of few individuals with relatively large changes, or numerous individuals with relatively small changes. Liang [13] suggests reserving the term responsiveness to denote important change (importance being represented by some individual change criterion), with the term sensitivity used for indices such as effect size, which reflect change of any level without consideration for the individual patient’s perspective. Proponents of evidence-based practice advocate an individual threshold approach in the concepts of control event rate, experimental event rates, and number needed to treat (NNT) [14–18]. This method requires the specification of each patient as a success or failure based on some individuallevel criterion. Because the definition of treatment success is based on individual criteria, it would be useful to develop similar approaches to comparing responsiveness of outcome measures intended for clinical decision-making. Two common threshold approaches to defining individual-level change have been based on either statistically reliable change or clinically important change. Statistically reliable change is calculated using the standard error of measurement (SEM), which reflects the amount of error associated with an individual subject assessment. The SEM is calculated as the square root of the mean square error term from an analysis of variance on test–retest reliability data [19]. Alternatively, it may be estimated by the formula SEM ⫽ SD[(1 ⫺ R)1/2], where SD is the baseline standard deviation and R is the test–retest reliability coefficient [20]. The minimum detectable change (MDC), also known as reliable change or smallest real difference, may then be calculated by multiplying the SEM by the z-score associated with the desired level of confidence and the square root of 2, reflecting the additional uncertainty introduced by using difference scores from measurements at two points in time [21]. The MDC represents the smallest change in score that likely reflects true change rather than measurement error alone [22]. Authors have previously noted that outcome measures used for individual patient applications require greater reliability than those used only for research purposes on the group level. Minimum reliability standards have been arbitrarily set at 0.90 or 0.95 [23,24]. An advantage of the MDC approach is that it incorporates reliability in calculations of responsiveness. The lower the reliability coefficient, the greater the SEM, and the higher the MDC will be. An outcome measure with a large MDC threshold will have fewer patients with change scores that exceed that threshold (yielding a lower event rate), and thus will appear less responsive to change than more reliable outcome measures. In contrast to arbitrary reliability coefficients, the MDC expresses the reliability threshold in the same units as the outcome measure at the specified confidence level. Another threshold approach to individual health assessment questionnaire scores is based on the concept of clinically important change [6,10,25,26]. Anchoring refers to a process for connecting score changes with some external

1009

criterion, often a global change measure [25,26]. The minimal important difference (MID; also known as minimal clinically important difference) is defined as the smallest difference in score which patients perceive as beneficial [25]. Some authors advocate using the MID because it allows the patient to determine the level of improvement deemed important and relevant [3,10,13,25–28]. Like the MDC, the MID also is expressed in the same units as the outcome measure. Although the MDC and MID are useful and important benchmarks in clinical applications, they have seldom been used to compare responsiveness across outcome measures. Davidson and Keating [29] have illustrated an approach using an individual patient criterion to compare responsiveness across outcome measures for a cohort of patients. These authors determined the proportion of patients in a sample that exceeded the minimum detectable change (MDC, or reliable change) on each outcome measure. This index of responsiveness represents the proportion of individual patients in a sample that exhibit true change, surpassing the degree of change that could be expected due to measurement error alone. We term this the reliable change proportion. The reliable change proportion is analogous to the event rate in evidence-based practice applications. This approach could also be used with the MID as an “important change” benchmark. The more responsive the outcome measure, the greater the proportion of patients who will exceed the minimum change criterion. Although these proportions or event rates are based on individual criterions, they may be used to summarize and compare results across different outcome measures for a group of patients. The purpose of this study was to contrast the use of group-level vs. individual-level responsiveness indices and to investigate the sensitivity of these methods to differences in responsiveness among outcome measures. 2. Methods 2.1. Subjects Patients were recruited at the initial visit to local physical and occupational (hand therapy) outpatient clinics in the Minneapolis area. All patients 18 years and older with a physician’s referral and diagnosis of a musculoskeletal upper extremity problem were eligible for the study. The ability to read and understand English was required for eligibility. Patients with primary or coexisting systemic conditions including etiology in the cervical spine, multiple sclerosis, cancer, rheumatoid arthritis, stroke, or mental or cognitive impairment were excluded. This study was approved by the Institutional Review Boards of the University of Minnesota and Park Nicollet Clinic in Minneapolis, Minnesota. 2.2. Study design A cohort of patients with upper extremity diagnoses was followed from the initial clinic visit for a total of 3 months.

1010

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

No effort was made to alter or modify the course of physical or occupational therapy. The interval between baseline and first clinic follow-up (typically 2 weeks or less) was used for analysis of test–retest reliability and the interval between baseline and 3-month follow-up was used to assess responsiveness to change. 2.3. Outcome measures 2.3.1. DASH The DASH (Disabilities of the Arm, Shoulder, and Hand) [30] has 30 items rated on a five-point Likert scale. Scores may range from 0 to 100, with 0 reflecting no disability. The DASH is region-specific and so allows comparisons across diagnoses of the upper extremity. The DASH is intended for discriminative and evaluative purposes [31]. Internal consistency [31] using Cronbach’s alpha and test–retest reliability [32–34] using intraclass correlation coefficients have been reported as 0.92 and above for the DASH. The construct validity of the DASH has been supported in a number of patient populations, including patients with musculoskeletal diagnoses of the shoulder [35,36] and elbow [33,34] and patients with various problems affecting the upper extremity [32]. The DASH has demonstrated responsiveness to change for patients with shoulder instability [36], distal radius fracture [37], and a range of shoulder or hand and wrist problems [32]. 2.3.2. SPADI The Shoulder Pain and Disability Index (SPADI) [38,39] was given to the subgroup of patients with diagnoses affecting the proximal upper extremity. The SPADI is a 13-item self-administered questionnaire relating to pain and functional status of the shoulder region. The SPADI includes a five-item pain scale and an eight-item disability scale. Each item is scored from 0 to 10, with total scores ranging from 0 to 100 for both the pain and disability sections. The total SPADI is calculated as the mean of the pain and disability scales. Higher scores indicate greater disability. The SPADI has demonstrated reliability [40] and validity [38,39,41] among patients with various shoulder diagnoses. The SPADI has shown responsiveness to change in a group of patients following shoulder surgery [40] and among outpatients with a variety of shoulder diagnoses [42]. 2.3.3. Patient-Rated Wrist Evaluation The Patient-Rated Wrist Evaluation (PRWE) [37,43], a joint-specific instrument, was given to patients with conditions affecting the elbow, forearm, and hand in addition to those with wrist problems, because of the overlapping function of these interrelated joints. It contains 5 items regarding pain and 10 items pertaining to function of the wrist and hand [37]. Each item is scored on a 0–10 scale. Item totals are summed and transformed to a 0–50 scale for both pain and function. The total score is the sum of the two subscales and may range from 0–100, with higher scores indicating

greater pain and disability. Reliability, validity, and responsiveness to change have been established in a group of patients recovering from wrist fractures [37,43]. 2.3.4. SF-12 The Medical Outcomes Study 12-Item Short-Form Health Survey (SF-12) is a generic health-related quality of life measure with items derived from the Short Form (SF-36) Health Survey [44,45]. It has been shown to reproduce the Physical and Mental Component Summary Scales of the SF36 with less respondent burden [45]. Validity testing showed high correlations with the Physical and Mental Component Summary scores of the SF-36 [44]. Two-week test–retest reliability coefficients for the Physical Component and Mental Component scales were 0.89 and 0.76, respectively [45]. Several authors have reported lower responsiveness for the SF-12 Physical Component Summary Scale than more specific self-assessment instruments in studies that used the measures concurrently [36,40,46–48]. 2.3.5. Global Disability Rating The group-level responsiveness indices used in the present study require an external criterion for change. Because the validity of retrospective global ratings of change has been challenged [49], a prospectively administered global rating scale was created for our study. It consists of a single question about the effect of injury on the subject’s daily function over the past week. This single global question was administered at each test occasion to reduce patient bias in the resultant change scores. It provided the patient’s current global functional assessment without requiring extensive recall. Patients were instructed to answer the question, “How much has this problem affected your normal daily activities using the arm(s) during the past week?” Items were scored from 1 (no disability) to 7 (maximum disability). Descriptors were used to define each level of disability for the patient. The examining therapist also rated the patient’s baseline disability using this scale. 2.4. Protocol Following informed consent, each patient completed a demographic summary form and the DASH, SF-12, and Global Disability Rating at the initial therapy visit. The examining therapist also completed a rating of disability at the initial clinic visit based on his or her evaluation of the patient’s injury. Therapists were instructed not to refer to the baseline ratings when making the follow-up rating. The therapist categorized the patient problem as proximal (primarily shoulder problems), distal (elbow, wrist, or hand), or both, according to the physician’s diagnosis and the location of the patient’s complaint. Patients with a proximal upper extremity problem also completed the SPADI and patients with distal upper extremity problems also completed the PRWE. To eliminate any order effects, the sequence of outcome measures was randomized for each packet of questionnaires.

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

At the second clinical visit, subjects again filled out each applicable outcome measure, along with a rating of change to determine whether any meaningful change had occurred in the interval between clinic visits [22]. These data were used for test–retest reliability purposes. Subjects were not allowed to refer to responses from the first set of questionnaires. Three months after the initial visit, the complete randomized set of outcome measures was mailed to each subject. Subjects were encouraged to fill out all questionnaires and return them promptly. If needed, follow-up postcards and a second copy of the questionnaires were mailed in subsequent weeks. 2.5. Analysis Statistical analyses were completed using NCSS 2001 (Number Cruncher Statistical Systems; Kaysville, UT, USA; http://www.ncss.com). Dependent variables were baseline, final follow-up, and change scores for the DASH, SPADI Total, PRWE Total, SF-12 Physical Component Summary Scale, and Global Disability Rating. All outcome measures were scored according to the authors’ instructions, including instructions for missing data. For the DASH, SPADI, and PRWE, this involved determining the percentage score based on the number of completed items. If more than 25% of the items for a given scale were missing, no scale score was calculated or entered for the subject. The authors of the SF-12 do not offer a scoring alternative when one or more items are omitted from this 12-item questionnaire. To minimize missing SF-12 data, a multiple linear regression formula was used to impute the Physical and Mental Component Summary Scale scores with completed items as predictors of the final scale scores. To gauge the effect of imputation on this group’s scores, Pearson product-moment correlation was used to compare imputed scores to scores resulting from substitution of sample median responses for the missing items. Descriptive statistics were used to summarize demographic information and baseline, 3-month follow-up, and change scores on each outcome measure. The interim period from initial evaluation could vary from person to person, depending on the timeliness of the completion and return of the 3-month follow-up packet of questionnaires. For this reason, correlations between the interim period (in days) and change scores on the DASH, SPADI, and PRWE were calculated using Pearson product-moment correlation coefficients. Test–retest reliability was determined using data from baseline and the first follow-up visit, excluding patients who indicated they had undergone meaningful change since the first questionnaire administration. Intraclass correlation coefficients (ICCs) and 95% confidence intervals were calculated based on a one-way analysis of variance (ANOVA) for the DASH, SPADI, PRWE, SF-12, and Global Disability Rating scores between the first and second clinic visit [19].

1011

ICC (1,1) calculations were based on a subset of stable patients as defined above. The SEM was determined by taking the square root of the mean square error term from the respective ANOVA report. The MDC was calculated for each instrument by multiplying the z-score corresponding to the level of significance, the square root of 2, and the SEM [21]. A z-score of 1.64 was chosen to reflect an acceptable 90% confidence level for clinical application to individual patients. To examine the validity of the Global Disability Rating used in the present study, we compared patient self-ratings to observer-based ratings provided by the therapists. This relationship between baseline therapist and patient ratings of upper extremity function was analyzed using the Spearman rank order correlation coefficient. Correlations were also determined for the baseline, 3-month follow-up, and change scores between the Global Disability Rating and the DASH, SPADI, and PRWE. Finally, a one-way ANOVA was used to test for significant differences among DASH change scores for patients who were worse, the same, or better according to change scores on the Global Disability Rating. Change scores of 1 or more on the Global Disability Rating were used to differentiate patients who had improved from those who stayed the same or became worse. Score changes of 1 or more are consistent with previous use of similar instruments [28,40]. It also exceeded the value of 1 SEM calculated from test–retest reliability data (described above). The MID for each outcome measure was determined using mean change scores for patients with small but meaningful global change [25]. The MID was defined for each functional outcome measure by anchoring change scores to descriptors on the Global Disability Rating. The MID corresponded to change scores of +1 over the 3-month study period. For example, ⫹1 represents the difference between the descriptors “a little bit” vs. “somewhat” or “somewhat” vs. “moderate” with respect to the effect of an injury on daily activities. Change over the 3-month interval was analyzed using the following array of responsiveness indices. At the individual patient level:

• Reliable change proportion [29]. The proportion of the sample with change scores exceeding the MDC was determined for each outcome measure. All subjects with complete follow-up data were used for this determination. • MID proportion. Analogous to the reliable change proportion, the proportion of the sample with change scores exceeding the MID was determined for each outcome measure. All subjects with complete followup data were used for this determination. At the group level:

• Effect size [6]. Mean change score of “improved” patients divided by the standard deviation of baseline scores.

1012

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

• Standardized response mean (SRM) [5]. Mean change score of “improved” patients divided by the standard deviation of change scores. • Guyatt’s responsiveness index [3]. Mean change score of “improved” patients divided by the standard deviation of change scores among stable patients over the 3-month study period. Stable patients were defined as those with no change on the Global Disability Rating. • Correlation with global change criterion [50]. Spearman rank order correlation coefficients were calculated between change scores on the Global Disability Rating and change scores on the DASH, SPADI, PRWE, and SF-12 Physical Component Summary Scale. A critical factor in comparing responsiveness of outcome measures is to base such comparisons on a common treatment effect. Obviously, the observed responsiveness will depend on the effectiveness of the treatment in addition to the relative responsiveness of the outcome measure. To ensure that each index of change was based on the same treatment effect, only patients with complete data required for a particular set of responsiveness calculations were included. For example, if effect size comparisons were made between the DASH and the SF-12, only subjects who had complete data for both the DASH and the SF-12 were included. Because the treatment effect based on a common sample is fixed, direct comparisons of responsiveness are valid. Frequently, determination of the most responsive outcome measure has been based on comparison of point estimates for responsiveness indices. This method fails to recognize the inherent uncertainty associated with point estimates. More recent approaches have used confidence intervals to facilitate comparisons among outcome measures. Confidence intervals were calculated for the standardized response mean (SRM) and effect size statistics based on the assumption that difference scores in the numerator are normally distributed [51]. Tuley et al. [52] used similar assumptions to calculate confidence intervals on Guyatt’s responsiveness index, in which the numerator reflects mean change in a “changed” group and the denominator reflects the variability in difference scores among stable subjects. As noted earlier, Davidson and Keating placed confidence intervals on the proportion of individual patients exceeding the MDC [29]. This calculation was also used to place a confidence interval on the MID Proportion. Finally, confidence intervals were determined for the correlation with change on the Global Disability Rating [53].

3. Results A total of 211 eligible subjects were enrolled and 155 (73.5%) returned the 3-month follow-up packet of questionnaires. The subset of patients at 3-month follow-up was

similar to the baseline sample in age, sex, location of symptoms, occupation, education, and race. A demographic summary is given in Table 1. The sample represented a mixture of diagnoses involving the upper extremity. The most common diagnoses included shoulder pain (n ⫽ 45 patients, 21% of the sample), shoulder tendonitis (n ⫽ 33; 16%), shoulder impingement (n ⫽ 16; 8%), and elbow tendonitis (n ⫽ 24; 11%). The most common distal diagnoses were wrist or hand fracture (n ⫽ 8; 4%) and carpal tunnel syndrome (n ⫽ 6; 3%). Patients declining participation in the study were similar in age, sex, and diagnosis to participating subjects. Over the three test occasions, 552 SF-12 questionnaires were returned. Of these, 35 (6.3%) had missing data. For the SF-12 Physical Component Summary Scale, the correlation of the imputed scores with scores using median item replacement was 0.98. By comparison, rate of missing responses on the baseline DASH questionnaire was 1.4%. The mean interim period between baseline and completion of the 3-month follow-up was 110 days (SD ⫽ 23, median ⫽ 102 days, range 67–204 days). There was a skew toward longer response times. There was no significant correlation between the interim time period vs. change scores

Table 1 Demographic summary

Age, years (SD) Age range, years Sex Male Female Involved arm Dominant Nondominant Bilateral Symptom location Proximal Distal Both Symptom duration, months Mean (SD) Median Work injury Post surgery Education, highest High school Some college College degree Postgraduate Missing Race White Asian American African American Native American Missing

Subjects enrolled at baseline (n ⫽ 211)

Subjects returning the 3-month follow-up questionnaire (n ⫽ 155)

47.5 (14.1) 18–88

49.6 (14.3) 18–88

105 (49.8%) 106 (50.2%)

70 (45.2%) 85 (54.8%)

121 (57.3%) 63 (29.9%) 27 (12.8%)

91 (58.7%) 43 (27.7%) 21 (13.6%)

133 (63.0%) 69 (32.7%) 9 (4.3%)

98 (63.2%) 52 (33.5%) 5 (3.2%)

11.1 (23.2) 4 38 (18%) 33 (16%) 39 79 49 41 3 191 4 8 4 4

(18.5%) (37.4%) (23.2%) (19.4%) (1.4%) (90%) (2%) (4%) (2%) (2%)

12.7 (26.0) 4 19 (12%) 24 (16%) 23 54 39 36 3

(14.8%) (34.8%) (25.2%) (23.2%) (2%)

141 4 4 3 3

(91%) (2.5%) (2.5%) (2%) (2%)

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

on the DASH, SF-12, SPADI, or PRWE. The highest correlation coefficient found was r ⫽ .18 (P ⫽ .24) between the interim period and change scores on the PRWE. This suggests that varying interim periods did not account for the variability in change scores on the primary outcome measures used in the present study. The intraclass correlation coefficient (1,1) and lower limit of the 95% confidence interval for each outcome measure are listed in Table 2. The reliability coefficients range from 0.75 for the SF-12 Component Summary Scale score to 0.91 for the PRWE, indicating good to excellent reliability. The error associated with an individual patient score on each outcome measure is indicated by the SEM and the 90% MDC, also listed in Table 2. Thirty-two physical and occupational therapists participated in subject recruitment and therapist ratings of disability. These therapists reported a mean of 9.9 years of experience in orthopedic practice (SD ⫽ 5.2 years, range 1–20). The intraclass correlation coefficient (1,1) for the Global Disability Ratings were 0.85 for therapists and 0.75 for patients. The SEM based on this data was 0.41 scale units for therapist ratings and 0.68 scale units for patient ratings. At baseline, the correlation between patient vs. therapist Global Disability Ratings was significant (P ⬍ .05), with r ⫽ .59 for the cohort. Table 3 shows significant correlations between the Global Disability Rating and the DASH, SPADI, and PRWE for baseline, 3-month, and change scores. Table 4 shows results of the one-way ANOVA on DASH, SPADI, and PRWE change scores over the 3-month study period. There were significant differences for all three outcome measures among patients defined as worse, the same, Table 2 Intraclass correlation coefficients, minimum detectable change, and minimal important difference for four outcome measures, by diagnosis Outcome measures All diagnoses (n ⫽ 78) DASH SF-12–PCS Proximal diagnoses (n ⫽ 53)b DASH SF-12–PCS SPADI Distal diagnoses (n ⫽ 20)b DASH SF-12–PCS PRWE

ICC (lower 95% CI)

SEM

MDC, 90%

MIDa

0.90 (0.84) 0.77 (.67)

5.35 4.22

12.5 9.85

12.6 6.8

0.91 (0.85) 0.75 (0.60) 0.86 (0.78)

5.22 4.47 7.75

12.2 10.4 18.1

10.2 6.5 13.2

0.81 (0.57) 0.86 (0.67) 0.91 (0.82)

5.86 3.53 5.22

13.7 8.2 12.2

17.1 7.3 24.0

Abbreviations: DASH, Disabilities of the Arm, Shoulder, and Hand; ICC, intraclass correlation coefficients; MDC, minimum detectable change; MID, minimal important difference; PRWE, Patient-Rated Wrist Evaluation; SEM, standard error of measurement; SF-12, Medical Outcomes Study 12-Item Short-Form Health Survey; SPADI, Shoulder Pain and Disability Index. a Defined by change scores of ⫹1 on the Global Disability Rating; n ⫽ 44 for all diagnoses, n ⫽ 29 for proximal diagnoses, n ⫽ 16 for distal diagnoses. b Subset does not include data on patients with both proximal and distal diagnoses.

1013

Table 3 Spearman’s correlations among DASH, SPADI, PRWE, and the Global Disability Rating Global Disability Ratinga

DASH SPADI PRWE

Baseline

3-Month

Change

0.71 (n ⫽ 206) 0.69 (n ⫽ 138) 0.56 (n ⫽ 63)

0.67 (n ⫽ 143) 0.64 (n ⫽ 95) 0.61 (n ⫽ 40)

0.67 (n ⫽ 143) 0.64 (n ⫽ 95) 0.61 (n ⫽ 40)

Abbreviations: DASH, Disabilities of the Arm, Shoulder, and Hand; PRWE, Patient-Rated Wrist Evaluation; SPADI, Shoulder Pain and Disability Index. a All values were significant at P ⬍ .0001.

or better as defined by change scores on the Global Disability Rating (Table 4). In Fig. 1, 3-month change scores of the DASH, SPADI, PRWE, and SF-12 Physical Component Summary Scale were stratified according to degrees of change on the Global Disability Rating. Change scores of ⫹1 on the Global Disability Rating defined the MID for each outcome measure. The MID scores are listed in Table 2. Summary statistics for baseline, 3-month follow-up, and change scores are listed in Table 5. Responsiveness statistics were calculated for the DASH, SF-12, SPADI, and PRWE based on the subset of “improved” patients as defined by change scores on the Global Disability Rating. Table 6 lists these statistics, along with the responsiveness rank for each outcome measure. Inspection of these point estimates shows that, with some exceptions, the DASH and the joint-specific SPADI and PRWE are generally more responsive than the generic SF-12 Physical Component Summary Scale. The highest responsiveness rank between the DASH and joint specific outcome measure appears to be dependent on the chosen responsiveness index. Confidence intervals for the various responsiveness indices are illustrated in Fig. 2. In all comparisons involving the effect size, Guyatt’s responsiveness index, correlation with global change, and MID Proportion, there is considerable overlap of confidence intervals between outcome measures. Only the reliable change proportion and the SRM

Table 4 One-way ANOVA results on change scores, with DASH, SPADI, and PRWE stratified according to improvement on the Global Disability Rating Change status per Global Disability Rating Outcome measure

Improved

No change

Worse

P-value from ANOVA

DASH change (n ⫽ 143) SPADI change (n ⫽ 95) PRWE change (n ⫽ 40)

⫹21.1 ⫹26.3 ⫹31.7

⫹1.9 ⫹4.0 ⫹10.8

⫺9.9 ⫺6.2 ⫺2.2

0.000001 0.000001 0.005

Abbreviations: ANOVA, analysis of variance; DASH, Disabilities of the Arm, Shoulder, and Hand; PRWE, Patient-Rated Wrist Evaluation; SPADI, Shoulder Pain and Disability Index.

1014

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

Fig. 1. Mean 3-month change scores on the Disabilities of the Arm, Shoulder, and Hand (DASH) questionnaire, SF-12 Physical Component Summary Scale, Shoulder Pain and Disability Index (SPADI), and Patient-Rated Wrist Evaluation (PRWE) stratified by 3-month global change scores on the patient’s Global Disability Rating. Global change scores of ⫹1 (arrow) define the minimal important difference (MID) for each outcome measure.

statistics show nonoverlapping confidence intervals, suggesting real differences between outcome measures. Examination of the SRM section of Fig. 2 suggests a higher confidence interval for the PRWE compared with the SF12 Physical Component Summary Scale in the subgroup of patients with distal upper extremity diagnoses, but no other differences between outcome measures. Inspection of the confidence intervals in the reliable change proportion section of Fig. 2 suggests that the DASH is more responsive than the SF-12 Physical Component Summary Scale for the entire

Table 5 Summary descriptive statistics for patients with baseline and 3-month follow-up data Outcome measure DASH (n ⫽ 154) SF-12 PCS (n ⫽ 154) SF-12 MCS (n ⫽ 154) Global Disability Rating (n ⫽ 143) SPADI total (n ⫽ 103) PRWE total (n ⫽ 44)

Baseline mean (SD)

3-Month mean (SD)

Change scores (SD)

32.1 40.3 53.2 3.6

18.2 46.7 53.4 2.4

13.9 6.4 0.16 1.2

(17.2) (7.8) (9.6) (1.4)

41.3 (21.1) 46.4 (17.1)

(16.4) (9.4) (9.2) (1.1)

24.5 (22.4) 21.0 (20.0)

(19.3) (8.9) (9.0) (1.6)

16.8 (25.2) 25.4 (20.9)

Abbreviations: DASH, Disabilities of the Arm, Shoulder, and Hand; MCS, Mental Component Summary scale; PCS, Physical Component Summary scale; PRWE, Patient-Rated Wrist Evaluation; SF-12, Medical Outcomes Study 12-Item Short-Form Health Survey; SPADI, Shoulder Pain and Disability Index.

cohort. Among patients with proximal diagnoses, the DASH and SPADI are more responsive than the SF-12 Physical but not different from each other. Among patients with proximal diagnoses, the PRWE is more responsive than the SF-12 Physical but not different than the DASH.

4. Discussion Our study is consistent with previous research showing that the highest responsiveness rank among outcome measures is dependent on the responsiveness index chosen [22,46,54,55]. The lack of agreement on a preferred index makes comparisons of responsiveness problematic. Beaton et al. [56] have suggested that the selected index of responsiveness may be decided by the context in which the outcome measure will be used. For comparing responsiveness among measures intended for clinical decision making, the reliable change and MID proportions have a number of advantages. They have an intuitive interpretation, the proportion of individual patients in the sample that has exhibited a “true” or meaningful change. Confidence intervals placed on a proportion demonstrate the degree of uncertainty underlying these point estimates and facilitate comparisons between outcome measures. Without placing undue emphasis on strict decision-making based on the magnitude of a P-value, the use of confidence intervals leads

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

1015

Table 6 Responsiveness statistics and rank for four outcome measures, by diagnosis Outcome measure All diagnoses DASH SF-12 PCS Proximal diagnoses DASH SF-12 PCS SPADI total Distal diagnoses DASH SF-12 PCS PRWE total

Effect size

SRM

Guyatt’s index

Spearman’s correlation with global change

Reliable change proportion

MID proportion

1.21 1.30a

1.26a 1.15

1.66a 1.37

0.67a 0.44

0.54a 0.31

0.54a 0.49

1.06 1.20 1.21a

1.08a 1.07 1.08a

1.78a 1.40 1.53

0.66a 0.55 0.64

0.50a 0.29 0.48

0.58a 0.48 0.56

1.67 1.51 1.87a

1.76 1.22 1.94a

1.16a 0.95 1.16a

0.63a 0.50 0.61

0.70 0.48 0.75a

0.50 0.55a 0.55a

Abbreviations: DASH, Disabilities of the Arm, Shoulder, and Hand; MID, minimal important difference; PCS, Physical Component Summary scale; PRWE, Patient-Rated Wrist Evaluation; SF-12, Medical Outcomes Study 12-Item Short-Form Health Survey; SPADI, Shoulder Pain and Disability Index; SRM, standardized response mean. a Highest responsiveness ranking within this classification.

to more valid conclusions regarding the relative responsiveness between outcome measures. Once established, using the MDC or MID as a threshold for change eliminates the need for a separate external change criterion. Establishing a valid change criterion has presented a formidable challenge for researchers studying responsiveness. A single-item global change or satisfaction measure is a less than satisfactory solution. In the present study, we used change scores on a single global function item applied serially. In each case, we were obligated to use single-item measures to judge true change in multi-item measures of much greater sophistication. The global instrument was used to determine change status for purposes of calculating the effect size, standardized response mean, and Guyatt’s responsiveness index, and for correlation with global change. It was also used to establish the MID; a procedure that presents further challenges. Some researchers have used retrospective ratings of change to define the minimal important difference [25], but we avoided this approach because of questions concerning the validity of retrospective ratings. An alternative would have been to select a multi-item instrument as the criterion reference for change, but this is also problematic. First, our purpose was to establish the relative responsiveness of selected outcome measures, so identifying one outcome as the criterion for change would subvert this purpose. Second, how much change on the criterion outcome measure would constitute “true” change? It seems preferable to avoid the application of global measures as criterion standards for meaningful clinical change whenever possible. The use of an MDC or MID threshold facilitates this approach. The MID and reliable change proportions offer more insightful clinical interpretations than group-level statistics. The following example from our study illustrates this point. The effect size and SRM were calculated on the “improved” subset of the overall sample on the basis of an external

change criterion. For the entire cohort, the effect size was 1.21 and the SRM 1.26, indicating a very large treatment effect based on generally accepted benchmarks [57]. The reliable change proportion used the entire sample and indicated that only 54% of individual patients had a true positive change in status, a degree of change that would exceed measurement error alone. Similarly, the MID Proportion indicated only 54% of the sample noted an important change in functional status. Even with large treatment effects, there is a sizeable subgroup of patients who do not improve. The reliable change and MID proportions quantify the magnitude of this subgroup in terms of individual patients. The use of proportions in the determination of responsiveness is analogous to the procedure used to calculate number needed to treat in evidence-based practice terminology. The number needed to treat indicates the number of patients that would need to receive an experimental intervention to result in one additional successful outcome compared with a control treatment. It is based on the control event rate (Pc) and the experimental event rate (Pe), where an event is a successful outcome. The NNT may be calculated as NNT ⫽ 1/(Pe ⫺ Pc). This suggests a corresponding approach for comparing responsiveness. If PDASH is the proportion of patients in a sample improved using the DASH outcome measure, and PSF-12 is the proportion improved per the SF-12 Physical Summary Scale score, then 1/(PDASH ⫺ PSF-12) would equal the number of subjects you would need to measure to show quantifiable improvement in one additional subject using the more responsive DASH. Here again, “quantifiable” improvement would depend on your criterion for improvement. Referring to Table 3, this would result in 1/(0.54 ⫺ 0.31) ⫽ 4.3 patients if using the MDC as the criterion and 1/(0.54 ⫺ 0.49) ⫽ 20 patients if using the MID Proportion. One obstacle in calculating NNT is the definition of success required to dichotomize outcomes. If outcome measures such as a disability questionnaire or visual-analog

1016

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

Effect Size

Standardized Response Mean

ALL

DASH SF12 Physical

DASH SF12 Physical

Proximal

DASH SF12 Physical SPADI

DASH SF12 Physical SPADI

Distal

DASH SF12 Physical PRWE

DASH SF12 Physical PRWE 0

0.5

1

1.5

2

2.5

0

ALL

DASH SF12 Physical

DASH SF12 Physical

Proximal

DASH SF12 Physical SPADI

DASH SF12 Physical SPADI

Distal

DASH SF12 Physical PRWE

DASH SF12 Physical PRWE 0.5

1

1.5

2

2.5

0

0.2

Reliable Change Proportion DASH SF12 Physical

DASH SF12 Physical

Proximal

DASH SF12 Physical SPADI

DASH SF12 Physical SPADI

Distal

DASH SF12 Physical PRWE

DASH SF12 Physical PRWE 0.2

0.4

0.6

0.8

1.5

2

2.5

0.4

0.6

0.8

1

MID Proportion

ALL

0

1

Correlation with Global Change

Guyatt's Responsiveness Index

0

0.5

1

0

0.2

0.4

0.6

0.8

1

Fig. 2. 95% Confidence intervals on six responsiveness indices. The Disabilities of the Arm, Shoulder, and Hand (DASH) questionnaire and SF-12 Physical Component Summary Scale are represented in all comparisons; the Shoulder Pain and Disability Index (SPADI) and Patient-Rated Wrist Evaluation (PRWE) are shown for the proximal and distal subgroups, respectively. Only comparisons using the standardized response mean and the reliable change proportion statistics suggest differences between outcome measures based on nonoverlapping confidence intervals. MID, minimal important difference.

pain scores are used, a somewhat arbitrary cutoff score might determine success vs. failure. The MID or MDC offer a more quantifiable threshold for defining measurable change. Recent authors have attempted to classify and evaluate the usefulness of the various indices of responsiveness. Liang suggests that statistically detectable change should be a minimum threshold for change, with minimal important change perhaps a more meaningful criterion [13]. Such distinctions may be academic. Wyrwich et al. [20,58] have presented empirical evidence that important change may be comparable to statistically detectable change as measured by a 1 SEM criterion.

5. Limitations There is no treatment of known efficacy available for a sample of patients with multiple diagnoses, so results of some calculations were dependent on external criterion measures of change to determine “true” change. The lack of an accepted gold standard for assessing change made it difficult to validate our external criterion. We were able to show a significant correlation of the Global Disability Rating criterion with therapist ratings and change scores on accepted functional outcome measures. Due to the observational design of the present study, the final health status of each subject included changes due to treatment, the passage of

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

time, and other factors specific to that subject. This study design did not allow differentiation among these factors. Previous authors have pointed out that both the MDC [8,59] and the MID [9,59] may vary, depending on the patient’s baseline level of disability. This analysis did not attempt to vary these criterion values according to baseline score. This study included a small sample of patients with elbow, wrist, and hand problems, which led to wider confidence intervals and smaller numbers for subgroup comparisons.

6. Conclusion Minimal change indices provide useful benchmarks for clinical decision making about the progress of individual patients. The reliable change and minimal important difference proportions characterize the ability of an outcome measure to detect small but important changes in patient status. We recommend using these indices to compare responsiveness among outcome measures used for clinical decision making.

Acknowledgments We would like to thank the therapists at Park Nicollet Medical Center and the Institute for Athletic Medicine in Minneapolis, MN, for their assistance in recruiting patients and gathering data. We also express our appreciation to the Orthopaedic Section of the American Physical Therapy Association for funding this project through the Clinical Research Grant Program.

References [1] Kirshner B, Guyatt G. A methodological framework for assessing health indices. J Chronic Dis 1985;38:27–36. [2] Deyo RA, Diehr P, Patrick DL. Reproducibility and responsiveness of health status measures: statistics and strategies for evaluation. Control Clin Trials 1991;12(4 Suppl):142S–158S. [3] Guyatt G, Walter S, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis 1987;40:171–8. [4] Deyo RA, Centor RM. Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J Chronic Dis 1986;39:897–906. [5] Liang MH, Fossel AH, Larson MG. Comparisons of five health status instruments for orthopedic evaluation. Med Care 1990;28:632–42. [6] Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care 1989;27(3 Suppl):S178–89. [7] Jacobson NS, Roberts LJ, Berns SB, McGlinchey JB. Methods for defining and determining the clinical significance of treatment effects: description, application, and alternatives. J Consult Clin Psychol 1999;67:300–7. [8] Stratford PW, Binkley JM. Applying the results of self-report measures to individual patients: an example using the Roland-Morris Questionnaire. J Orthop Sports Phys Ther 1999;29:232–9. [9] Testa MA. Interpretation of quality-of-life outcomes: issues that affect magnitude and meaning. Med Care 2000;38(9 Suppl):II166–74.

1017

[10] Wiebe S, Matijevic S, Eliasziw M, Derry PA. Clinically important change in quality of life in epilepsy. J Neurol Neurosurg Psychiatry 2002;73:116–20. [11] Simpson JM, Valentine J, Worsfold C. The Standardized Three-metre Walking Test for elderly people (WALK3m): repeatability and real change. Clin Rehabil 2002;16:843–50. [12] Ljungquist T, Nygren A, Jensen I, Harms-Ringdahl K. Physical performance tests for people with spinal pain: sensitivity to change. Disabil Rehabil 2003;25:856–66. [13] Liang MH. Longitudinal construct validity: establishment of clinical meaning in patient evaluative instruments. Med Care 2000;38(9 Suppl):II84–90. [14] Sackett DL, Straus SE, Richardson WS, Rosenberg W, Haynes RB. Evidence-based medicine: how to practice and teach EBM. 2nd ed. Edinburgh; New York: Churchill Livingstone; 2000. [15] Bender R. Calculating confidence intervals for the number needed to treat. Control Clin Trials 2001;22:102–10. [16] Furukawa TA, Guyatt GH, Griffith LE. Can we individualize the ‘number needed to treat’? An empirical study of summary effect measures in meta-analyses. Int J Epidemiol 2002;31:72–6. [17] Walter SD. Number needed to treat (NNT): estimation of a measure of clinical benefit. Stat Med 2001;20:3947–62. [18] Dalton GW, Keating JL. Number needed to treat: a statistic relevant for physical therapists. Phys Ther 2000;80:1214–9. [19] Fleiss JL. The design and analysis of clinical experiments. New York: Wiley; 1986. [20] Wyrwich KW, Nienaber NA, Tierney WM, Wolinsky FD. Linking clinical relevance and statistical significance in evaluating intra-individual changes in health-related quality of life. Med Care 1999;37: 469–78. [21] Ottenbacher KJ, Johnson MB, Hojem M. The significance of clinical change and clinical change of significance: issues and methods. Am J Occup Ther 1988;42:156–63. [22] Stratford PW, Binkley FM, Riddle DL. Health status measures: strategies and analytic methods for assessing change scores. Phys Ther 1996;76:1109–23. [23] Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use. 2nd ed. New York: Oxford University Press; 1995. [24] Nunnally JC, Bernstein IH. Psychometric theory. 3rd ed. New York: McGraw-Hill; 1994. [25] Jaeschke R, Singer J, Guyatt GH. Measurement of health status: ascertaining the minimal clinically important difference. Control Clin Trials 1989;10:407–15. [26] Deyo RA, Patrick DL. The significance of treatment effects: the clinical perspective. Med Care 1995;33(4 Suppl):AS286–91. [27] Redelmeier DA, Guyatt GH, Goldstein RS. Assessing the minimal important difference in symptoms: a comparison of two techniques. J Clin Epidemiol 1996;49:1215–9. [28] Santanello NC, Zhang J, Seidenberg B, Reiss TF, Barber BL. What are minimal important changes for asthma measures in a clinical trial? Eur Respir J 1999;14:23–7. [29] Davidson M, Keating JL. A comparison of five low back disability questionnaires: reliability and responsiveness. Phys Ther 2002;82: 8–24. [30] Hudak PL, Amadio PC, Bombardier C. The Upper Extremity Collaborative Group (UECG). Development of an upper extremity outcome measure: the DASH (disabilities of the arm, shoulder and hand) [corrected]. Am J Ind Med 1996;29:602–8. [31] Marx RG, Bombardier C, Hogg-Johnson S, Wright JG. Clinimetric and psychometric strategies for development of a health measurement scale. J Clin Epidemiol 1999;52:105–11. [32] Beaton DE, Katz JN, Fossel AH, Wright JG, Tarasuk V, Bombardier C. Measuring the whole or the parts? Validity, reliability, and responsiveness of the Disabilities of the Arm, Shoulder and Hand outcome measure in different regions of the upper extremity. J Hand Ther 2001;14:128–46.

1018

J.S. Schmitt, R.P. Di Fabio / Journal of Clinical Epidemiology 57 (2004) 1008–1018

[33] MacDermid JC. Outcome evaluation in patients with elbow pathology: issues in instrument development and evaluation. J Hand Ther 2001; 14:105–14. [34] Turchin DC, Beaton DE, Richards RR. Validity of observer-based aggregate scoring systems as descriptors of elbow pain, function, and disability. J Bone Joint Surg Am 1998;80:154–62. [35] Ring D, Perey BH, Jupiter JB. The functional outcome of operative treatment of ununited fractures of the humeral diaphysis in older patients. J Bone Joint Surg Am 1999;81:177–90. [36] Kirkley A, Griffin S, McLintock H, Ng L. The development and evaluation of a disease-specific quality of life measurement tool for shoulder instability: the Western Ontario Shoulder Instability Index (WOSI). Am J Sports Med 1998;26:764–72. [37] MacDermid JC, Richards RS, Donner A, Bellamy N, Roth JH. Responsiveness of the short form-36, disability of the arm, shoulder, and hand questionnaire, patient-rated wrist evaluation, and physical impairment measurements in evaluating recovery after a distal radius fracture. J Hand Surg [Am] 2000;25:330–40. [38] Roach KE, Budiman-Mak E, Songsiridej N, Lertratanakul Y. Development of a shoulder pain and disability index. Arthritis Care Res 1991; 4:143–9. [39] Williams JW Jr, Holleman DR Jr, Simel DL. Measuring shoulder function with the Shoulder Pain and Disability Index. J Rheumatol 1995;22:727–32. [40] Beaton D, Richards RR. Assessing the reliability and responsiveness of 5 shoulder questionnaires. J Shoulder Elbow Surg 1998;7:565–72. [41] Beaton DE, Richards RR. Measuring function of the shoulder: a cross-sectional comparison of five questionnaires. J Bone Joint Surg 1996;78:882–90. [42] Heald SL, Riddle DL, Lamb RL. The shoulder pain and disability index: the construct validity and responsiveness of a region-specific disability measure. Phys Ther 1997;77:1079–89. [43] MacDermid JC, Turgeon T, Richards RS, Beadle M, Roth JH. Patient rating of wrist pain and disability: a reliable and valid measurement tool. J Orthop Trauma 1998;12:577–86. [44] Jenkinson C, Layte R, Jenkinson D, Lawrence K, Petersen S, Paice C, Stradling J. A shorter form health survey: can the SF-12 replicate results from the SF-36 in longitudinal studies? J Public Health Med 1997;19:179–86. [45] Ware J Jr, Kosinski M, Keller SD. A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. Med Care 1996;34:220–33.

[46] Wright JG, Young NL. A comparison of different indices of responsiveness. J Clin Epidemiol 1997;50:239–46. [47] Binkley JM, Stratford PW, Lott SA, Riddle DL, North American Orthopaedic Rehabilitation Research Network . The Lower Extremity Functional Scale (LEFS): scale development, measurement properties, and clinical application. Phys Ther 1999;79:371–83. [48] Enloe LJ, Shields RK. Evaluation of health-related quality of life in individuals with vestibular disease using disease-specific and general outcome measures. Phys Ther 1997;77:890–903. [49] Norman GR, Stratford P, Regehr G. Methodological problems in the retrospective computation of responsiveness to change: the lesson of Cronbach. J Clin Epidemiol 1997;50:869–79. [50] Stratford PW, Binkley J, Solomon P, Gill C, Finch E. Assessing change over time in patients with low back pain. Phys Ther 1994;74:528–33. [51] Beaton DE, Hogg-Johnson S, Bombardier C. Evaluating changes in health status: reliability and responsiveness of five generic health status measures in workers with musculoskeletal disorders. J Clin Epidemiol 1997;50:79–93. [52] Tuley MR, Mulrow CD, McMahan CA. Estimating and testing an index of responsiveness and the relationship of the index to power. J Clin Epidemiol 1991;44:417–21. [53] Luus HG, Muller FO, Meyer BH. Statistical significance versus clinical relevance. Part III. Methods for calculating confidence intervals. S Afr Med J 1989;76:681–5. [54] Taylor SJ, Taylor AE, Foy MA, Fogg AJ. Responsiveness of common outcome measures for patients with low back pain. Spine 1999;24: 1805–12. [55] Walsh TL, Hanscom B, Lurie JD, Weinstein JN. Is a condition-specific instrument for patients with low back pain/leg symptoms really necessary? The responsiveness of the Oswestry Disability Index, MODEMS, and the SF-36. Spine 2003;28:607–15. [56] Beaton DE, Bombardier C, Katz JN, Wright JG. A taxonomy for responsiveness. J Clin Epidemiol 2001;54:1204–17. [57] Cohen J. Statistical power analysis for the behavioral sciences. Rev ed. New York: Academic Press; 1977. [58] Wyrwich KW, Tierney WM, Wolinsky FD. Further evidence supporting an SEM-based criterion for identifying meaningful intraindividual changes in health-related quality of life. J Clin Epidemiol 1999;52:861–73. [59] Stratford PW, Binkley J, Solomon P, Finch E, Gill C, Moreland J. Defining the minimum level of detectable change for the Roland-Morris questionnaire. Phys Ther 1996;76:359–65; discussion 366–8.