A minimal clinically important difference was derived for the Roland-Morris Disability Questionnaire for low back pain

A minimal clinically important difference was derived for the Roland-Morris Disability Questionnaire for low back pain

Journal of Clinical Epidemiology 59 (2006) 45–52 A minimal clinically important difference was derived for the Roland-Morris Disability Questionnaire...

140KB Sizes 7 Downloads 50 Views

Journal of Clinical Epidemiology 59 (2006) 45–52

A minimal clinically important difference was derived for the Roland-Morris Disability Questionnaire for low back pain Kelvin Jordan*, Kate M. Dunn, Martyn Lewis, Peter Croft Primary Care Sciences Research Centre, Keele University, Keele, Staffs ST5 5BG, United Kingdom Accepted 23 March 2005

Abstract Objective: To compare methods commonly used to derive minimal important differences and recommend a rule for defining patients as clinically improved on the low back pain-specific Roland-Morris Disability Questionnaire (RMDQ). Methods: 447 primary care low back pain consulters completed a questionnaire at consultation and 6 months. Patients were classified as having achieved an important change based on methods with the best theoretical qualities, that is, the standard error of measurement, reliability change index (RCI), and modified RCI (RCindiv), and using a 30% reduction in score from baseline. To assess clinical importance, improvements based on these methods were compared with improvements on other back pain-related measures. Results: The percentage of patients rated as improved ranged from 14 to 51% by method. Using a simple rule it was possible to identify patients who had clinically important improvement (36%), patients not improved (53%), and a group of possible improvers (11%). Clinical improvement is shown if RMDQ score is reduced by 30% from baseline and back pain is rated as better on a global rating scale. Conclusion: A minimal clinically important difference is derived that is clinically relevant, incorporates the measurement error of the RMDQ, and allows subjects with different grades of severity to improve. Ó 2006 Elsevier Inc. All rights reserved. Keywords: Minimal clinically important difference; Low back pain; Disability; Distribution-based methods

1. Introduction Self-reported generic and disease-specific health profile instruments are commonly used in clinical trials and in longitudinal observational studies. However, a statistically significant change in group scores on a health profile tool does not necessarily equate to a change in an individual’s score that either the patient or the clinician would identify as being an important change in the patient’s health. Methods for establishing the smallest meaningful or important change for the individual on health profile instruments have recently been reviewed [1]. This minimal important difference (MID) is derived based on one of two approaches. MIDs from anchor-based methods (MIDA) compare changes in scores on the instrument with an anchor. The anchor is often a single question on global health status changedwhether the patient believes they are better than at baseline. This anchor has been suggested to be the best for measuring change from the patient’s perspective but may be affected by recall bias [1]. The anchor-based approach has several other limitations [1,2]. Anchors may

* Corresponding author. Tel.: (144) 1782 583924; fax: (144) 1782 583911. E-mail address: [email protected] (K. Jordan). 0895-4356/06/$ – see front matter Ó 2006 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2005.03.018

come from different perspectives (patient, clinician, family) and may, therefore, lead to different MID-As. It relies on the validity and reliability of the anchor chosen. Finally, it does not take into account the variation within the sample nor the measurement precision of the instrument. This last limitation implies that the minimal difference obtained may not lie outside the range of the instrument’s random variation, that is, may not exceed the measurement error of the instrument. Due to the above limitations of anchor-based methods, distribution-based methods (MID-D) are seen as preferable. Distribution-based methods evaluate the minimal difference that is in excess of that expected by either the random variation of the sample or the measurement error of the instrument. Methods based on sample variation include the effect size [based on the baseline standard deviation (SD) of the sample], standardized response mean (SD of change), and the responsiveness statistic (SD of change of stable patients). The derived MID-D will vary across samples if the variability within each sample differs between samples [1,3]. Therefore, MID-Ds based on the measurement error of the instrument are recommended and these are described later. The terms MID and minimal clinically important difference (MCID) tend to be used interchangeably. However, it

46

K. Jordan et al. / Journal of Clinical Epidemiology 59 (2006) 45–52

may be difficult for a clinician to interpret what a change of the magnitude of the MID-D means in terms of clinical significance. To qualify as a MCID, the MID-D needs to be shown to relate to associated measures; for example, in back pain sufferers, these might include severity of pain or changes in healthcare use. This implies that both anchor-based and distribution-based methods are needed to establish a MCID for an instrument. The aim of this study was to derive a rule for determining a MCID for the low back pain-specific Roland-Morris Disability Questionnaire (RMDQ) through a series of related objectives. The objectives were: (1) to compare the MIDs for the RMDQ derived from a range of methods, (2) to establish how the MIDs obtained from the theoretically preferred distribution-based methods relate to other measures of back pain severity, (3) to use the results from objectives (1) and (2) to develop a MCID for the RMDQ by defining a rule for classifying individual patients as clinically improved or not. 2. Study population and questionnaires The objectives were addressed using a prospective cohort study of low back pain sufferers in primary care. The study population consisted of 447 consecutive patients (Table 1) consulting primary care for back pain at five general practices in North Staffordshire, and who completed a questionnaire at baseline and at 6 months. Two hundred sixty (58%) subjects were female, with a mean age of 46.8 (SD 8.12, range 30–59). The questionnaire included the RMDQ, a 24-item back pain-specific disability scale recommended for use in primary care and community studies [4]. The RMDQ has shown good validity and been recommended for use without further validation [5]. The RMDQ has also shown good test–retest reliability with reported intraclass correlation coefficients of 0.8 or more

[6–8]. The scale score ranges from 0 (no disability) to 24 (severe disability). No patient indicated limitation on all items of the RMDQ at either baseline or 6 months. At baseline the mean RMDQ score was 9.1 (SD 6.37, median 8, interquartile range 4, 14). Scores at 6 months were lower (less disability) than at baseline (mean difference 22.5; 95% CI 23.0, 22.0). Also included were questions on current working status (‘‘working,’’ ‘‘not working due to back pain,’’ ‘‘not working for other reasons’’) and bothersomeness of back pain (five categories ranging from ‘‘not at all’’ to ‘‘extremely’’) [9]. Responses on a 0–10 visual analog scale relating to current, least, and usual pain over the last 2 weeks were combined by calculating their mean to give a score for intensity of pain with 0 indicating no pain and 10 most intense pain. The generic health profile tool, the Short Form-36 version 2 (SF-36), was included in both questionnaires [10]. This has eight scales relating to different dimensions of health. For the analyses presented here the SF-36 scores were inverted so that 0 represented best health and 100 worst health so that the SF-36 scores were in the same direction as the RMDQ and pain intensity scale. The 6-month questionnaire included a global question on the patient’s perception of how their back pain had improved or deteriorated over the 6 months from baseline. The 6-month questionnaire also asked respondents whether they had seen their GP about their back pain in the previous four weeks. 3. Stage 1dComparison of methods to determine a MID for improvement Stage 1 addresses the first objective of a comparison of MIDs obtained from commonly used anchor-based and distribution-based methods. 3.1. Anchor-based methods (MID-A)

Table 1 Baseline and 6-month characteristics of sample n Maled n (%) Femaledn (%) Age at baseline Mean (SD) Range RMDQ at baseline Mean (SD) Range RMDQ at 6 months Mean (SD) Range Change in RMDQ Mean (SD) Range Global change at 6 months Betterdn (%) No changedn (%) Worsedn (%)

447 187 (42) 260 (58) 46.8 (8.12) 30 to 59 9.1 (6.37) 0 to 23 6.6 (6.56) 0 to 23 22.5 (5.26) 222 to 17 214 (48) 165 (37) 64 (14)

The global rating used as the anchor was the patient’s subjective measure of change in back pain health over 6 months. Patients were grouped into: better (completely recovered, much better, better), no change and worse (worse, much worse). Two hundred fourteen (48%) patients reported their back pain was better at 6 months. Only 64 (14%) rated it as worse. Patients reporting better health at 6 months had improved a mean of 5.1 (SD 4.88) points on the RMDQ while those reporting no change had improved a mean of 0.2 (SD 4.14) points. Two receiver operating characteristic (ROC) curves, which plot sensitivity against one minus the specificity, were constructed. The first ROC curve compared those rating their back pain as better at 6 months to those reporting no change by examining raw score change on the RMDQ from baseline to 6 months. The sensitivity was defined as the number of patients identified by a particular score change on the RMDQ as having improved and with a global

K. Jordan et al. / Journal of Clinical Epidemiology 59 (2006) 45–52

rating of better divided by all patients stating their back pain was better. The curves were developed using the sensitivities of different values of RMDQ change and their corresponding specificities. The point where sensitivity was closest to specificity was taken as the optimal raw score change on the RMDQ. The second ROC curve replaced raw score change with change on the RMDQ as a percentage of baseline score (percentage change score). 3.2. Distribution-based methods (MID-D) based on measurement error The standard error of measurement (SEM) method is based on the measurement error inherent in the instrument as well as the variation within the sample. The SEM incorporates a measure of the reliability of the instrument and is calculated as the SD at baseline multiplied by the square root of one minus the reliability of the instrument. The MID-D derived from the SEM should not be sample specific as the relationship between SD and reliability should be consistent across samples; increases in the standard deviation should be matched by increasing reliability [1,3]. However, the MID-D will depend upon the statistic used to summarize the reliability [e.g. the intraclass correlation coefficient (ICC) from test–retest studies, or Cronbach’s alpha]. Further, a low reliability can lead to large MID-Ds although health profile instruments should have high levels of reliability if they are to assess change. There is also no consensus on the cutoff of the SEM used to determine the MID-D. Values of 1 SEM and 1.96 SEM have been recommended for determining the MID-D [11]. It is assumed that in two-thirds of cases the true score on the RMDQ should be within 1 SEM of the observed score [11]. In our study, the MID-Ds based on 1 SEM and 1.96 SEM were calculated using a reliability value (ICC) of 0.83 obtained from a 2-week test–retest study of a similar population of back pain patients [8]. The other two methods considered were the reliability change index (RCI) and a modified version of the RCI, the RCindiv. The RCI [12,13] represents the difference in two scores on an individual that would be expected to be due to measurement error, and is a more conservative estimate of this difference than the SEM. It is equivalent to O(2SEM2). A cutoff value of 1.96 RCI is recommended as indicating that change is not just due to measurement error [12,13]. The RCindiv has been suggested to distinguish better between individual and group level analyses and to be less conservative than the RCI [14]. It is based on calculation of the reliability of baseline, 6 months, and change scores as well as the correlation between baseline and 6-month scores, with a recommended level of MID-D of 1.65RCindiv. An alternative method of estimating the true percentage of patients who have improved than that based on counting the individuals who have exceeded the RCindiv MID-D has been suggested to give a higher and more accurate figure [14].

47

3.3. Stage 1 results ROC curve analysis using the raw score change gave a MID-A of two points (sensitivity 73%, specificity 67%). ROC curve analysis using the percentage change score suggested that a reduction of 30% on baseline score was the optimal cutoff point (sensitivity 75%, specificity 76%). This is the equivalent of a change of three points based on a 30% reduction for a patient with a baseline score of 9 (i.e., close to the mean baseline score) on the RMDQ. The distribution-based methods using the recommended cutoffs gave values of: 1 SEM 2.62, 1.96 SEM 5.14, RCindiv 6.17, RCI 7.26. Table 2 shows the MIDs needed at each level of baseline RMDQ for improvement based on (1) a 30% change, (2) the RCI, (3) the RCindiv, (4) 1.96 SEM and (5) 1 SEM. The 30% reduction method is not a distributionbased method; it is not based on the sample variation or measurement error of the tool. However, Table 2 shows that for the majority of the RMDQ scale (from 7 onwards), it does match or exceed the MID-D determined by 1 SEM. These distribution-based measures assume that the MID-D is uniform across the whole scale. Therefore, subjects scoring less than 8 at baseline cannot register an improvement if the RCI is used to derive a MID-D. Similarly, patients cannot show an improvement if they scored, at baseline, less than 7 if the RCindiv method is used, 6 if 1.96 SEM is used, or 3 if 1 SEM is used. Only 36% of patients who had identified themselves as completely recovered would be rated as improved based on the MID-D from the RCI. This is partly due to the fact that the mean baseline score may be lower for our study population than in, for example, patients in secondary care and, hence, have less room for improvement. Proportional or baseline score-dependent MIDs may be needed [15,16] that would allow those at the lower (better health) end to register improvement. Table 2 Minimal important differences (MIDs) derived from different methods for different baseline RMDQ scores Baseline score MIDa MIDb MIDc MIDd MIDe MIDf MIDg MIDh MIDi 1–3 4–6 7–10 11–13 14–16 17–20 21–23 24 a b c d e f g h i

Based Based Based Based Based Based Based Based Based

1 2 3 4 5 6 7 8 on on on on on on on on on

8 8 8 8 8 8 8 8

7 7 7 7 7 7 7 7

6 6 6 6 6 6 6 6

30% reduction. RCI. RCindiv. 1.96 SEM. 1 SEM. lower of 30% reduction lower of 30% reduction lower of 30% reduction lower of 30% reduction

3 3 3 3 3 3 3 3

and and and and

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 7

RCI. RCindiv. 1.96 SEM. 1 SEM.

1 2 3 4 5 6 6 6

1 2 3 3 3 3 3 3

48

K. Jordan et al. / Journal of Clinical Epidemiology 59 (2006) 45–52

To assess this, the MID was further defined (Table 2, columns f to i) as the lower of a 30% reduction on RMDQ score and f) the RCI, g) the RCindiv, h) 1.96 SEM, and i) 1 SEM. Table 2 shows that, as no subject scored 24 at baseline, using the criterion of the lower of 30% reduction and either the RCI or RCindiv gives identical MIDs to those obtained from using 30% reduction on its own. Further, only one more patient would be deemed to have improved using the lower of 30% reduction and 1.96 SEM than using 30% reduction on its own. When the lower of 30% reduction and 1 SEM is taken, Table 2 shows that subjects scoring 7 or over at baseline on the RMDQ will need to reduce their score by 1 SEM (three points) while subjects scoring less than 7 need to obtain a 30% reduction. 4. Stage 2dComparison of derived MIDs with other back pain-related measures Stage 2 shows how the MIDs derived in stage 1 compare to changes in other back pain-related measures (objective 2). Specifically, the sensitivity and specificity of these methods compared to other back pain measures obtained from the study questionnaire were calculated. By comparing these anchors (or reference standards) with the derived MIDs, identification of the most clinically relevant MID obtained from these methods is possible to obtain the minimal clinically important difference (MCID). The sensitivity shows the percentage of patients improving on the reference standard who changed by at least the derived MID for that method on the RMDQ. One minus the specificity shows the percentage of patients not improved on the reference standard who changed by at least this MID. Both are important to show that the RMDQ has detected patients who have clinically changed, and identifies correctly those patients who have not. The reference standards were bothersomeness of pain, visit to a general practitioner (GP) for back pain in the previous four weeks at 6 months, return to work for those not working at baseline due to their back problem, intensity of pain, and the Physical Functioning and Bodily Pain scales of the SF-36. The GP visit and work status measures are ‘‘objective’’ measures of back pain severity while the others are subjective. The change in pain intensity and SF-36 scores (baseline minus 6 months) were dichotomized around a reduction on each of the scales of 30% or more from baseline. Thirty percent was chosen for consistency, as this was the percentage obtained from the ROC curve analysis of the RMDQ. It is recognized, however, that 30% may not relate to clinically important change on these instruments. Table 3 shows the sensitivity and one minus the specificity based on each method of determining the MID of the RMDQ for each of the different reference standards. The number of patients who could be regarded as improved ranged from 14% based on the RCI to 51% based on the lower of 30% reduction and 1 SEM. The alternative estimate of the true population percentage improving based

on the RCindiv increases the percentage improved from 18 to 27%. The highest sensitivity for all measures was obtained from the lower of 30% reduction and 1 SEM criterion. For example, 76% of those with less pain intensity at 6 months were classified as clinically improved on this criterion. However, the specificities of this criterion were the lowest (e.g., 74 compared to 97% for the RCI criterion on the pain intensity measure). A 30% reduction on its own and a fall in score of 1 SEM on its own were the next most sensitive methods, although the order of these two varied by measure. The RCI, RCindiv and 1.96 SEM criteria lacked sensitivity but were very specific. The higher MIDs ensure that those who have not improved on the reference standards are not classified as improved on the RMDQ but at the expense of misclassifying those who have improved on these standards. 5. Stage 3dIncorporation of an anchor Stage 3 extends stage 2 by assessing the effect of incorporating an anchor in the MID methods applied in stage 2. Several authors have recommended combining an anchor with distribution-based methods to obtain a MCID [1,3,17]. The methods used in stage 2 were reevaluated with the additional component that the patients had to have rated their back pain health as better at 6 months to show improvement. The inclusion of the anchor reduced the percentage improved to 13% for the RCI and to 37% for the lower of 30% reduction and 1 SEM criterion (Table 4). As would be expected with a more severe criterion to be classified as improved, all methods became less sensitive but more specific than without the anchor. For example, the combined 30%/1 SEM criterion plus anchor inclusion led to a reduced sensitivity from 78 to 72% for return to work but an increased specificity from 75 to 95%. For improvement of one category on the bothersomeness question, sensitivity fell from 75 to 62%, but specificity rose from 76 to 90%. 6. Stage 4dDerivation of a rule for determining clinical improvement for an individual on the RMDQ Based on stages 2 and 3, a rule can be generated that classifies individuals as definitely clinically improved, possibly improved, and not improved based on the RMDQ (objective 3). This rule takes into account the MID related to measurement error of the RMDQ, allows those starting with less severe disability to show improvement by incorporating an MID dependent on the baseline score of the RMDQ, and includes the patient’s own perception of change in their own health to give further evidence of clinical change related to back pain. Table 4 shows that the anchor plus 30% reduction criterion is only slightly less sensitive, and slightly more specific, than the anchor plus combined 30%/1 SEM criterion. Further, only five patients would lose their ‘‘improvement’’ status using the anchor plus 30% reduction criterion.

K. Jordan et al. / Journal of Clinical Epidemiology 59 (2006) 45–52

49

Table 3 Comparison to reference standards of methods to derive MID n Number of subjects Global rating better Not better Less bothereda Not less bothered Less botheredb Not less bothered No GP visit GP visitc Return to workd Not return to work Less pain intensitye Not less pain intensity Improved functionf Not improved function Less body paing Not less body pain

447 214 229 231 211 140 94 316 131 36 63 222 225 160 287 192 255

% improved sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%)

RCI

RCindiv

1.96 SEM

1 SEM

30% reduction

Lower of 30% and 1 SEM

14 27 3 26 1 37 2 17 9 47 6 26 3 31 5 30 3

18 33 5 32 2 44 3 22 11 53 8 32 4 38 8 37 5

24 42 7 41 5 52 9 28 15 58 13 41 7 47 11 45 8

43 64 24 65 18 76 21 49 27 75 25 64 22 71 28 67 25

47 75 20 69 20 72 15 57 21 72 13 74 20 80 28 76 24

51 77 27 75 24 79 22 59 31 78 25 76 26 81 34 79 30

a

Based on improvement of one category. Based on improvement from extremely or very much to not at all, slightly or moderately; analysis restricted to those rating extremely or very much at baseline. c Visited GP for back pain in last 4 weeks at outcome. d Analysis restricted to those not working at baseline due to low back pain. e Based on improvement of 30% from baseline on pain intensity measure. f Based on improvement of 30% from baseline on SF-36 physical functioning scale; scale inverted so 100 indicates worst health. g Based on improvement of 30% from baseline on SF-36 bodily pain scale; scale inverted so 100 indicates worst health. b

Therefore, as it is simpler to determine, 30% reduction forms the basis of the rule. The derived rule classifies patients as definitely improved, possibly improved, and not improved using the anchor of a rating of back pain of better at 6 months plus a 30% reduction on RMDQ score from baseline. The rule is shown in Box 1. Table 5 shows how these three groups are spread across the different measures: 36% had definitely improved and 11% had possibly improved. In clinical research, researchers wishing to obtain the highest sensitivity could combine the possible improvers with the definite improvers. This would give sensitivities of 69 to 80% on all the measures (except GP visit, 57%) with specificities at least 72%. However, possible improvers will need to be followed for a longer period of time to see which group they eventually fall into, and it may be preferable to recommend these be treated as not known in terms of clinical improvement. 7. Stage 5dApplication of the rule to a more severe population The rule derived in stage 4 and shown in Box 1 was applied to a more severe and acute back pain population (n 5 319, mean baseline RMDQ 13.5, SD 4.85). Participants in this study had consulted a GP at one of 28 general practices in North Staffordshire, UK for the first or second time in an episode of low back pain that was !12 weeks in duration. Further, they had not received treatment from any other

health care professional for this episode of back pain. This population had smaller variation in RMDQ scores at baseline but showed a bigger mean difference on the RMDQ over a shorter time period (3 months, mean difference 27.9; 95% CI 28.6, 27.3). In this population, 69% of patients were defined as definitely improved, 23% as not improved, and only 8% as possibly improved. However, the rule again led to high sensitivities compared to related measuresdat least 80% for those definitely improved with specificities of around 70 to 80%. 8. Discussion This study has shown that different methods can lead to different levels of MID. The type of change studied, that is, whether it relates to simply that above the measurement error of the tool or to change observed on another related measure will alter the important difference found [18,19]. The methods used here are often used to assess the responsiveness of health status instruments, but comparisons between instruments can be affected by the methods used, with different ordering of responsiveness possible depending upon the method [20]. The theoretic qualities of distribution-based methods need to be aligned with the relevance of MIDs to clinicians and to patients. We have united anchor and distributionbased approaches to arrive at a definition of MCID for the RMDQ. The rule could potentially be used in both longitudinal cohort studies and trials. In trials, treatments

50

K. Jordan et al. / Journal of Clinical Epidemiology 59 (2006) 45–52

Table 4 Comparison to reference standards of methods to derive MID with added criteria of global change rating of ‘‘better’’ n Number of subjects Less botheredb Not less bothered Less botheredc Not less bothered No GP visit GP visitd Return to worke Not return to work Less pain intensityf Not less pain intensity Improved functiong Not improved function Less body painh Not less body pain

443a 229 209 140 92 314 129 36 62 220 223 158 285 190 255

% improved sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%) sensitivity (%) 12specificity (%)

Better 1 RCI

Better 1 RCindiv

Better 1 1.96 SEM

Better 1 1 SEM

Better 1 30% reduction

Better 1 (lower of 30% and 1 SEM)

13 24 !1 34 1 16 6 47 3 24 2 30 4 28 2

16 29 1 39 1 20 7 53 3 30 2 35 5 34 2

20 36 2 47 1 25 9 58 3 37 3 42 8 43 3

30 53 5 64 2 39 12 69 5 55 7 61 14 62 8

36 60 10 63 2 47 10 69 3 65 7 68 18 71 10

37 62 10 66 2 47 12 72 5 66 9 69 20 72 11

a

Four patients gave no global rating at 6 months. Based on improvement of one category. c Based on improvement from extremely or very much to not at all, slightly, or moderately; analysis restricted to those rating extremely or very much at baseline. d Visited GP for back pain in last 4 weeks at outcome. e Analysis restricted to those not working at baseline due to low back pain. f Based on improvement of 30% from baseline on pain intensity scale. g Based on improvement of 30% from baseline on SF-36 physical functioning scale; scale inverted so 100 indicates worst health. h Based on improvement of 30% from baseline on SF-36 bodily pain scale; scale inverted so 100 indicates worst health. b

could be compared on the proportion of subjects who had improved by the MCID. Use of the rule as an outcome would allow a comparative effectiveness of treatments on their ability to produce a clinically meaningful change believed to be real by the patient. We have shown that, in our population of primary care attenders for back pain, distribution-based methods do not allow subjects with less severe disease to improve. By using a proportional criterion (30% reduction), the MID-D derived using the 1 SEM is exceeded for the majority of the scale, while those patients with less severe disease can also potentially be graded as improved. High specificity is gained by adding in an anchor of global change. This anchor confirms a change identified by the criterion is actually believed to be real by the patient. Another advantage of using a baseline score-dependent rule, whereby patients with more severe baseline disability have to show a greater improvement, is that the impact of regression to the mean is reduced [1]. Regression to the mean is the tendency for those with more extreme scores to regress to the mean at follow-up.

Our defined rule shown in Box 1 allows classification of subjects into definite improvement, nonimprovement, and a small group of subjects who cannot currently be classified. We have shown that the rule relates well to a number of different measures that reflect either the subjective opinion of the patient (e.g., bothersomeness, pain intensity) or more objective characteristics (work status, health care use). The comparison with a number of different anchors means that the problems associated with using just one anchor are overcome. A change score of 5 on the RMDQ has previously been recommended as the smallest change that is important to patients following ROC curve analysis for patients with back pain, after 3–6 weeks treatment, using global change [16] or achievement of treatment goals [15] as the reference standard. A change of 4–5 has also been estimated as the minimal important difference using SEM approaches [7,21]. However, these studies recommended further research into baseline score-specific MCIDs. Beurskens et al. [22] recommended a change of 2.5–5 points for signalling improvement after 5 weeks of treatment, again

Box 1. Recommended rule for defining patients as improved on RMDQ Definitely improved Patients rating their back pain as at least better at 6 months and with a reduction of at least 30% on their RMDQ score from baseline Possibly improved Patients who have a RMDQ score at least 30% reduced at 6 months but who have not rated their back pain as better (including those with missing information on the global health question) Not improved Patients with less than a 30% reduction in RMDQ score at 6 months

K. Jordan et al. / Journal of Clinical Epidemiology 59 (2006) 45–52 Table 5 Improvement at 6 months based on defined rule of global rating of better and 30% reduction on RMDQ

Number of subjects Less bothereda Not less bothered Less botheredb Not less bothered No GP visit GP visitc Return to workd Not return to work Less pain intensitye Not less pain intensity Improved functionf Not improved function Less body paing Not less body pain

n

Definite Possible Not improved (%) improved (%) improved (%)

447 231 211 140 94 316 131 36 63 222 225 160 287 192 255

36 59 9 63 2 47 10 69 3 65 7 68 18 70 10

11 10 11 9 13 11 11 3 10 9 12 13 10 6 15

53 31 80 28 85 43 79 28 87 26 80 20 72 24 76

a

Based on improvement of one category. Based on improvement from extremely or very much to not at all, slightly, or moderately; analysis restricted to those rating extremely or very much at baseline. c Visited GP for back pain in last 4 weeks at outcome. d Analysis restricted to those not working at baseline due to low back pain. e Based on improvement of 30% from baseline on pain intensity scale. f Based on improvement of 30% from baseline on SF-36 physical functioning scale; scale inverted so 100 indicates worst health. g Based on improvement of 30% from baseline on SF-36 bodily pain scale; scale inverted so 100 indicates worst health. b

based on a ROC curve using global change as the reference standard. However, it is not known how this estimate compares with other anchors or to MIDs derived from measurement error distribution-based approaches. In a study of sciatica patients, a minimal important difference of 2–3 points was suggested on a modified (0–23) RMDQ based on mean change in subjects rating their sciatica better [23]. By contrast, a study of patients with low back pain suggested that the MID-D for the RMDQ was around 8–9 points (using the RCI) after 6 weeks [24]. A similar RCIbased MID-D was found in our study. Our study has extended this previous work on the RMDQ by comparing different common approaches to deriving an MID, by recommending a rule for determining an MCID that is baseline score-specific, and that is related to a number of other back pain measures rather than a single measure. An MID of 5 that has been recommended in previous studies would seem to be less sensitive to real change than the rule from our study population. The RCI and RCindiv methods have been proposed as part of a two-stage process for identifying MCID, the second stage being how the outcome score compares to that of a ‘‘functional or normal population’’ [13,14]. However, this depends on how this range is defined, and would be difficult to apply to disease-specific instruments. While recommended for their statistical properties [12–14], these two methods have poor sensitivity and, hence, appear too

51

conservative for classifying patients as improved. The explanation for this is that they only signify improvement if the change is greater than a level that is very unlikely to be due to chance [25,26]. The SEM is also related to the variation of the instrument but is more sensitive. A value of 1 SEM has been shown, for example, to relate well to global ratings of change in chronic respiratory disease and congestive heart failure [26]. Distribution-based methods are often preferred to anchor-based methods as the latter have the limitation of not relating to measurement error. However, the more conservative distribution-based methods (1.96 SEM, RCI and RCindiv) have a major disadvantage in that they do not relate well to other back pain measures. These methods are very specific but have very poor sensitivity, and sensitivity may often be considered the more important attribute. Using these more conservative methods will lead to many patients with actual improvement being rated as not improved. Our rule has shown to have good sensitivity and specificity when compared to other back pain measures. Further, by adding in the ‘‘better’’ criterion then even if the change is within the measurement error of the RMDQ based on the more conservative methods, it is at least believed to be real by the patient and, hence, more likely to be real change. This study has concentrated on improvement rather than deterioration, as this is more likely to be the outcome of interest in clinical practice and research. Further, it is possible that a certain sized improvement has a different meaning than the same sized deterioration [25]. However, further research could assess whether the same MCID indicates deterioration as well as improvement. We have proposed classification into one of three groups: improved, not improved, and a possible improved group. The not improved group will include patients who have deteriorated, and in the future it may be possible to reclassify these as deteriorated, possibly deteriorated, and not changed. Arbitrary improvements of 30% on the continuous reference standard scales (SF-36, pain intensity) have been used. This level may not reflect clinically significant change on these scales. When agreed levels of MCID have been determined for these scales, then these can be used to help determine MCIDs for other scales. The rule could now usefully be further tested on shorter or longer outcome periods, and for different health status instruments in different populations. This would include assessment of whether the optimal reduction of 30% in questionnaire scores is consistent across samples, conditions, and instruments. Further, the generality of the rule on age and gender subgroups could be tested in bigger samples. 9. Conclusions An original approach, using anchor and distributionbased methods, has derived a rule for determining the MCID on the RMDQ. The derived MCID is clinically relevant, and allows subjects from mild to severe disease to

52

K. Jordan et al. / Journal of Clinical Epidemiology 59 (2006) 45–52

show improvement, while allowing for measurement error in the RMDQ. This rule could be used in clinical trials and longitudinal studies in low back pain as a means of detecting clinical changes in health. Further research is now needed to assess the generalizability of the rule to different low back pain populations, and assess the approach taken to determining the MCID on other measurement instruments in other populations. Acknowledgments This work is supported by a research grant from the Wellcome Trust. We are grateful also to the referees for their helpful comments. References [1] Crosby RD, Kolotkin RL, Williams GR. Defining clinically meaningful change in health-related quality of life. J Clin Epidemiol 2003;56: 395–407. [2] Norman GR, Stratford P, Regehr G. Methodological problems in the retrospective computation of responsiveness to change: the lesson of Cronbach. J Clin Epidemiol 1997;50:869–79. [3] Guyatt GH, Osoba D, Wu AW, Wyrwich KW, Norman GR. Methods to explain the clinical significance of health status measures. Mayo Clin Proc 2002;77:371–83. [4] Roland M, Morris R. A study of the natural history of back pain. Part I: development of a reliable and sensitive measure of disability in lowback pain. Spine 1983;8:141–4. [5] Grotle M, Brox JI, Vøllestad NK. Functional status and disability questionnaires: what do they assess? A systematic review of backspecific outcome questionnaires. Spine 2004;30:130–40. [6] Kopec JA, Esdaile JM, Abrahamowicz M, Abenhaim L, Wood-Dauphinee S, Lamping DL, et al. The Quebec Back Pain Disability Scale. Measurement properties. Spine 1995;20:341–52. [7] Stratford PW, Finch E, Solomon P, Binkley J, Gill C, Moreland J. Using the Roland-Morris Questionnaire to make decisions about individual patients. Physiother Can 1996;48:107–10. [8] Dunn KM, Jordan K, Croft PR. Does questionnaire structure influence response in postal surveys? J Clin Epidemiol 2003;56:10–6. [9] Dunn KM, Croft PR. Classification of low back pain in primary care: using ‘‘bothersomeness’’ to identify the most severe patients. Spine 2005;30:1887–92. [10] Ware JE, Kosinski M, Dewey JE. How to score version two of the SF-36 Health Survey. Lincoln, RI: Quality Metric Incorporated; 2000.

[11] Wolinsky FD, Wan GJ, Tierney WM. Changes in the SF-36 in 12 months in a clinical sample of disadvantaged older adults. Med Care 1998;36:1589–98. [12] Christensen L, Mendoza JL. A method of assessing change in a single subject: an alteration of the RC Index. Behav Ther 1986;17: 305–8. [13] Jacobson NS, Truax P. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J Consult Clin Psychol 1991;59:12–9. [14] Hageman WJJM, Arrindell WA. Establishing clinically significant change: increment of precision and the distinction between individual and group level of analysis. Behav Res Ther 1999;37:1169–93. [15] Riddle DL, Stratford PW, Binkley JM. Sensitivity to change of the Roland-Morris Back Pain Questionnaire: part 2. Phys Ther 1998; 78:1197–207. [16] Stratford PW, Binkley JM, Riddle DL, Guyatt GH. Sensitivity to change of the Roland-Morris Back Pain Questionnaire: part 1. Phys Ther 1998;78:1186–96. [17] Rowbotham MC. What is a ‘‘clinically meaningful’’ reduction in pain? Pain 2001;94:131–2. [18] Beaton DE, Bombardier C, Katz JN, Wright JG, Wells G, Boers M, et al. Looking for important change/differences in studies of responsiveness. OMERACT MCID Working Group. Outcome measures in rheumatology. Minimal clinically important difference. J Rheumatol 2001;28:400–5. [19] Bombardier C, Hayden J, Beaton DE. Minimal clinically important difference. Low back pain: outcome measures. J Rheumatol 2001; 28:431–8. [20] Wright JG, Young NL. A comparison of different indices of responsiveness. J Clin Epidemiol 1997;50:239–46. [21] Stratford PW, Binkley J, Solomon P, Finch E, Gill C, Moreland J. Defining the minimum level of detectable change for the Roland-Morris questionnaire. Phys Ther 1996;76:359–65. [22] Beurskens AJHM, de Vet HCW, Koke AJA. Responsiveness of functional status in low back pain: a comparison of different instruments. Pain 1996;65:71–6. [23] Patrick DL, Deyo RA, Atlas SJ, Singer DE, Chapin A, Keller RB. Assessing health-related quality of life in patients with sciatica. Spine 1995;20:1899–909. [24] Davidson M, Keating JL. A comparison of five low back disability questionnaires: reliability and responsiveness. Phys Ther 2002;82: 8–24. [25] Cella D, Bullinger M, Scott C, Barofsky I. Group vs individual approaches to understanding the clinical significance of differences or changes in quality of life. Mayo Clin Proc 2002;77:384–92. [26] Wyrwich KW. Minimal important difference thresholds and the standard error of measurement: is there a connection? J Biopharm Stat 2004;14:97–110.