Journal of Clinical Epidemiology 54 (2001) 1204–1217
A taxonomy for responsiveness D.E. Beatona,b,c,d,l,*, C. Bombardiera,c,d,e,m,n, J.N. Katzc,h,i,j, J.G. Wrightc,d,f,g,k a Institute for Work & Health, Toronto, Canada Department of Occupational Therapy, University of Toronto, Toronto, Canada c Institute of Medical Sciences, University of Toronto, Toronto, Canada d Clinical Epidemiology Health Services Research Program, University of Toronto, Toronto, Canada e Department of Medicine, University of Toronto, Toronto, Canada f Department of Surgery, University of Toronto, Toronto, Canada g Department of Public Health Sciences, University of Toronto, Toronto, Canada h Brigham and Women’s Hospital, Boston, MA, USA i Harvard Medical School, Boston, MA, USA j Robert Brigham Multipurpose Arthritis and Musculoskeletal Diseases Center, Boston, MA, USA k The Hospital for Sick Children, Toronto, Ontario, Canada l St. Michael’s Hospital, Toronto, Ontario, Canada m Mt Sinai Hospital, Toronto, Ontario, Canada n The University Health Network, Toronto General Hospital, Toronto, Ontario, Canada Received 29 November 1999; received in revised form 7 June 2001; accepted 7 June 2001 b
Abstract Responsiveness is quickly becoming a critical criterion for the selection of outcomes measures in studies of treatment effectiveness, economic appraisals, and other program evaluations. Statistical characteristics, specifically “large effect sizes,” are often felt to indicate the relative worth of one instrument over another. However, debates about their meaning led the present authors to propose a taxonomy for responsiveness based on the context of the study concerned. The three axes underlying the classification system relate to: who is this being analyzed for (individuals or groups); which scores are being contrasted (over time/at one point in time); and the type of change being quantified (for example, observed change or important change). It is concluded that responsiveness should be considered a highly contextualized attribute of an instrument, rather than a static property and should be described only in that way. A questionnaire could thus be described as being “responsive to” a given category in the new taxonomy. © 2001 Elsevier Science Inc. All rights reserved. Keywords: Responsiveness; Effect size statistic; Treatment effectiveness
1. Introduction Clinicians often use indexes, instruments or questionnaires to evaluate their patients over time, and they must select which one(s) to use. The most appropriate measure must be sensible, reliable, valid and (if used to evaluate change over time) it must also be “responsive” [1,2]. Responsiveness is defined as the ability of an instrument to detect accurately change when it has occurred [3,4] and is usually quantified by a statistical or numeric score, such as an effect size statistic [4–9] or a standardized response mean. The ques-
* Corresponding author. Dorcas Beaton, B.ScOT, PhD, Institute for Work & Health, 250 Bloor St East, Ste 702, Toronto, Ontario, M4W 1E6 Canada. Tel.: Phone: 416-927-2027; fax: 416-927-4167. E-mail address:
[email protected] (D. Beaton).
tion arises, however, of whether such scores provide clinicians with enough information about the usefulness of an instrument in its intended application. In many ways, the interpretation of statistics of responsiveness is analogous to the interpretation of P-values used in treatment trials, where too much emphasis is often placed on the magnitude of its numeric estimates, such as a P-value of 0.05 or less, with too little attention paid to the nature and meaning of the change being quantified. Which patients are being compared? How long was the follow-up? Which treatments are involved? The answers to these questions provide the context of the trial and are essential for interpreting the results (the P-value). Similarly, with responsiveness, the magnitude of the effect size statistic alone is unlikely to provide enough information to aid in its’ interpretation. The context of the study of responsiveness (i.e., the nature of the
0895-4356/01/$ – see front matter © 2001 Elsevier Science Inc. All rights reserved. PII: S0895-4356(01)00 4 0 7 - 3
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
change that the study is set up to measure) must be considered before interpreting the magnitude. What had the patients experienced to be considered “changed” in the study of responsiveness? What change scores are being quantified: the change in treatment over control or the change in the treatment group alone. Is the change quantified change in one patient or change in a group of patients? To date, attention has focused primarily on finding measures with the largest responsiveness statistics or determining if a measure can produce a “large” effect size statistic, greater than .80, according to Cohen [10], and therefore is “responsive.” But similar to the P-value these statistics on their own lack any context and therefore often lead to the assumption that responsiveness, once established is a static, context-free attribute of the questionnaire—an assumption we feel has fueled the numerous discussions about interpretation of these statistics. It is argued here that discussing the context of the measurement of responsiveness (i.e., the nature of the change that the study is set up to measure) rather than the magnitude of the statistic is more likely to advance the already protracted debate [9,11–17] over the interpretability [18] of responsiveness statistics. This approach looks deeper than the statistic’s numeric value to “what” is actually being quantified. As Michell suggests, “protracted controversy suggests that the disagreement lies much deeper than the arguments hitherto presented imply” [19, (p. 398)]. The debate over the interpretation of responsiveness statistics is not just of methodological or academic interest, but has direct implications for how we assess patients and how we decide if treatments have truly made them better. Without a clear understanding of responsiveness statistics, a meaningless change could be misconstrued as clinically significant when it is merely statistically significant [20,21]; alternatively, a small gain in mobility might be statistically insignificant [22,23] but be considered by the patient as being a very important improvement. As Jenkinson [24] suggests, “the results of health status measures could not simply be misleading, but actually harmful.” We suggest that there are three core topics in the debates over responsiveness ordered according to the frequency with which we found them in the literature: first, the interpretation of the statistic (i.e., is the change relevant or important?) [11,13,21,25–33]; second, methodological issues (i.e., how should studies of responsiveness be designed? [7,17,29,31,33] and how should the property be quantified and analyzed? [4,7,9,17,27,34–36]); and finally, the conceptual and definitial issues (i.e., how is responsiveness conceptualized? what is being quantified? [17,36,37]). Although the conceptual and definitional issues have received least attention, they may be the most fundamental factors and are probably critical to making sense of the data (interpretability). It is these issues that are the focus of this article. The literature contains many definitions of responsiveness (Table 1), and the differences between them are instructive. Most authors agree that responsiveness involves the ability of a measure to detect change but there are wide variations in opinion about nature of the change that is being de-
1205
tected. For example, in 1977 Guyatt et al. [38] defined responsiveness as “the ability to detect change, specifically, important change, in the way patients are feeling, even if those changes are small,” thus focusing on individual feelings and fine discrimination. In 1994, Testa et al. took a broader view, defining responsiveness as “the ability to detect meaningful treatment effects” [22], whereas Anderson and Chernoff in 1993, [39] said it was the “ability to detect important changes in disease activity over time” (emphasis added). Differences between the definitions are critical, as they each reflect distinct types of change being quantified in a given analysis of responsiveness, and thus different distinct types of categories of responsiveness. We suggest that many parts of the conceptual and definitional debate could be resolved by allowing these to stand as distinct types of responsiveness, each depending on the nature of the change described within the study. Earlier we used deBruin et al.’s [3] definition of responsiveness. This was chosen as our operational definition because it did not specify the nature of the change but did specify that there needed to be some determination that the change had occurred. This definition encompasses all the types of change specified in the others. The purpose of the present article is to propose a classification system that reflects the context of the measurement of responsiveness where the context is defined by specific attributes of the change designed into a study of responsiveness. This suggests that different categories or types of responsiveness can be defined in terms of the attributes of the particular change being quantified.
2. Methods 2.1. Literature review Articles discussing the theory of responsiveness were gathered. The initial source was based on the personal files of the authors. In addition, Murawski and Miederhoff [9] (who published a review of the literature on responsiveness up to 1994) shared their list of 324 references and search strategy. Their approach was to seek all articles using health status measures through an electronic search and through a hand-review of the more than 20,000 abstracts, to find those which dealt with longitudinal data collection. The result was the 324 articles they shared with us. Articles after 1994 were obtained by electronic searches of Medline and CINAHL. As shown in Fig. 1, two approaches were used. First 238 articles were identified that had text words of responsiveness and of health. Second 3520 articles were identified that addressed the measurement properties of questionnaires/health status indicators. 1024 of these also had text words of health and were in English. The 238 and 1024 articles were combined and together in Murawski’s review (n 324) provided a pool of 1534 potential references on responsiveness. This non-systematic review became our source for information on responsiveness. Additional articles were found from the reference lists, and from reviewing re-
1206
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
Table 1 Examples of the varying definitions for the concept of responsiveness as reported in different articles in the literature Author, year [reference]
Definition of responsiveness (direct quotes from articles)
DeBruin et al., 1997 [3] Guyatt et al., 1997 [1] Guyatt et al., 1997 [38] Katz et al., 1992 [8] Irvine et al., 1996 [120] Hays et al., 1993 [2] Wright and Young, 1998 [4] Liang, 1995 [28] Murawski and Miederhaff, 1998 [9] Pfennings et al., 1995 [107] Siu, et al., 1993 [121] Jenkinson, 1995 [24] Testa, 1996 [36] Testa and Simonson, 1994 [22] Anderson and Chernoff, 1993 [39] Jacobs et al., 1996 [70]
Accurate detection of change when it has occurred Ability to detect minimal clinically important differences Ability to detect change, specifically important changes in the way patients are feeling, even if those changes are small Sensitivity to clinical change Sensitivity to detect important changes in clinical status Ability of a measure to reflect underlying change Ability of an index to measure clinical change Ability of an instrument to measure a clinically meaningful or important change in a clinical state Ability to detect change over time A property of an instrument for detecting clinically important changes in the status of the patient Validity in longitudinal analysis focusing on within-person changes Ability to detect change Measure of the association between the change in the observed score and the true value construct Ability to detect meaningful treatment effects Ability to detect important clinical change in disease activity over time That an instrument can detect clinically relevant changes in health status over time or clinically relevant treatment effects
lated articles (on internet Grateful Med database) of the key references. 2.2 Building a taxonomy of responsiveness Using articles from the literature review, we identified key issues in the debate over the study of responsiveness or the interpretation of these studies. Examples would be the variance and dissent of Wright and Redelmeier [13,33,40] raising issues of cross-sectional versus longitudinal design and the perspective bearing on importance; or Lydick and epstein’s contrast of anchor-based versus distributionally based descriptions of responsiveness [41]. The issues were carefully considered and other articles addressing each were sought. We then considered the uniqueness of each issue,
Fig. 1. Diagram showing how articles were retrieved from Medline and CINAHL. This provided the basis for findings methodologically focused articles. Additional references were found in the reference lists of pertinent articles.
and tried to construct mutually exclusive categories for studies of responsiveness. The structure of the taxonomy of studies of responsiveness was based on the guidelines described by Buchbinder et al. [42] and/or Feinstein [43]. A taxonomy should: have mutually exclusive yet exhaustive categories, use multiple axes if necessary, be simple to apply and be easy to interpret. Ultimately, the goal was to have a system that would facilitate accurate communication [44] about the context of measuring responsiveness.
3. Results 3.1. Building a taxonomy of responsiveness Several articles concerning responsiveness have discussed different aspects of the nature of the change being studied and how they relate to interpretation of resultant statistics. Three groups of articles were identified: those that discussed individual-level versus group-level assessment of change [13,18,35,41,45–47]; those that considered the contrast of between-person difference versus within-person change [13,14,34,40,48–50]; and those that addressed different types of change (i.e., the importance of the change, and from whose perspective) [7,8,17,27,29,30,33,37,40,45,48,50– 55]. These groupings became the defining axes of the system. All possible subheadings (based on examples from the literature) within each of the three axes were then defined, to form a 2 3 5-cell matrix that formed 30 cells each representing a category of change. Although not every cell had a supporting example from the literature, it was usually possible to hypothesize a study that could be designed to measure responsiveness to that type of change. The three axes were labeled Who (Who are the results of the study presented for? An individual or a group?), Which (Which scores are being contrasted? i.e., Over time or at one point in time?) and What (What kind or amount of change is
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
being estimated?). There is no hierarchical relationship between the axes. 3.2. The Who axis The Who axis considers whether information from the study under consideration will be interpreted at an individual level (a change in the individual patient) or at a group level (the average amount of change for a group) (see Fig. 2). Most summaries of responsiveness use parametric (grouplevel) summaries based on the distribution of baseline and change scores (mean, standard deviations) [4,34,49,56]. However, as Williams and Naylor [57, (p. 1347)] point out “probability and chance are not easily related to applications on the individual level.” Therefore, change that has meaning for groups may not have meaning for individuals [58]. Interpreting individual-level versus group-level change requires the consideration of two issues. The first issue is the perceived meaning of change for individuals versus groups. Research has shown that when experts are shown profiles of change for individuals and for groups, they are willing to accept a much smaller amount of change as being “important” for group comparisons than for comparisons of individuals [58–60]. For example, as Deyo and Patrick [61] pointed out, a 5-mm change in diastolic blood pressure is likely to be considered insignificant at an individual level, but significant if it is the average change for a group of patients. Because measurement involves estimating an attribute within a certain margin of error, single, individual measures have a wide margin of error. This may partially explain why we have less confidence in interpreting the meaning of change scores for single patients [60]. In groups of patients, the increased sample size reduces the margin of error, and thus we can be more certain that the mean change is close to the true change for that group, and have more confidence in treating small changes as important in groups. The second, related, issue in interpreting individual and group levels is the degree to which findings can be transferred between the two and retain their meaning. Clinicians often are faced with reports of group-level results and need to apply this information to their individual patients. Using group-level estimates of change (such as average change or median change) as a guide to when an individual has improved is a challenge, not only because of the different standards that are at play, but also because any cut point will have some degree of misclassification error [35]. Mean change is an esti-
1207
mate of a central tendency in a whole distribution of change scores, where each data point represents a valid experience of change (even of important change) for an individual patient. Accepting the mean as representative of an indicator that an individual has improved would mean several people, who may have considered themselves improved, would be erroneously considered unchanged [24,62,63] just because their numeric change score was below the cut point (the average). Improved measurement techniques will lead to greater precision in differentiating measurement noise from change (signal) [46,57,64], but will not help with where on the distribution of change scores the cut-off should be made. Redelmeier and Lorig [65] suggest using the lower bound of the confidence limit of the change scores as a threshold for improvement in order to be sensitive to include 95% of the individual’s change scores. Goldsmith et al. [59] considered the 25th percentile in the distribution of change scores as the threshold for the same reason. Both point to the possible problems with using a mean change score as representative of the distribution of change scores. Similar dilemmas face the establishment of nutrient requirements for the general population. A cut-off at any given point on the distribution of individual requirements will misclassify some of the sample [66], and consequently set a requirement level that may result in a proportion of the population being either deficient or exposed to excessive amounts of a nutrient. The decision about where to set the recommended requirement level must be based on an estimation of the risks of under- or over-estimating the requirement. The aim is to set a level that has the greatest meaning (greatest likelihood of covering the nutrient requirements), while minimizing the risk of misclassification (over- or under-estimation). The same is true of attempts to determine a threshold for meaningful change from a distribution of change scores. Clinicians routinely faced with the challenge of applying the results of a clinical trial of program evaluation to an individual patient find relatively few guidelines in the literature [46] on how to do it [67]. Similarly, they may be faced with trying to relate the results of studies of responsiveness to whether their individual patient should be considered better. This axis emphasizes that the results of a study of responsiveness can be provided in a manner that makes them useful at an individual level (e.g., a threshold for important change) or in a manner useful for group-level interpretation (e.g., mean change score), and it holds the two as distinct. 3.3. The Which axis
Fig. 2. Axis 1: Individual versus group.
The Which axis refers to the timing of the data collection that is to be used in the study of responsiveness. Questionnaire scores can be compared using data gathered at one point in time but from different people (between-person differences) [40,45,65] (see Fig. 3). Alternatively, information can be gathered over time in order to assess how a patient changes over time (within-person change [22,36]). A combination of so-called between and within change is used in
1208
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
Fig. 3. Axis 2: Groups being compared.
randomized controlled trials where the between-person (group) differences of within-person change is being analyzed. Conceptually, within-person change and between-person differences have long been held to be very different, with only the former being considered relevant for discussions of responsiveness, and readers may be surprised to find discussions of between-person differences in a discussion of responsiveness. However, our approach was to sift through what was being described in the literature and indeed we found studies in the literature review that used between-person differences to give estimates for longitudinal change [65,68]. Burke and Nesselroade [69] describe change as “any variation in the quantity or quality of entity’s attributes.” From this perspective, it could be seen that both between- and within-person change could be used to estimate a measure’s sensitivity to change; however, this does not mean they are interchangeable. Both are included in our taxonomy for two reasons: first, because the same statistics can be applied to each, giving “effect sizes” for each and thus the difference could be overlooked if not highlighted. Second, advocates of this approach suggest that the results of within-person change and between-person differences yield similar results (having observed similar magnitudes of change using either approach) [13,40,65]. These investigators also suggest that estimates of effect size or minimally clinically important differences observed in the analysis of “between-person differences” could be used to guide the estimation of minimally clinically important differences in the “within-person change” situation [13,45,65]. We would suggest, however, that they are conceptually different types of change the magnitude of which could vary a great deal when comparing “between-differences” and “within-person change” [13,33,36,70]. Hemingway et al. [71] measured and compared both “between-person difference” scores and “within-person change” scores using a 1-year difference in age as a similar external marker for people in different health states (difference of 1 year between people, or a 1-year longitudinal follow-up for within-person change). They found that the estimates were not the same, and that between-person differences systematically under-estimated within-person change [71]. This supports the concept that within-person change does not necessarily equate to between-person differences; hence the two are held as different in our taxonomy. Readers
of studies of responsiveness should be concerned when between-person differences are mistakenly substituted as being meaningful surrogates of within-person change. Although arguments could be made for excluding between-person differences from the taxonomy entirely, the stratum for between-person difference is kept in the taxonomy in order to provide a place to locate work such as those described above that are described in the literature under the rubric of “responsiveness.” This allows the taxonomy to be comprehensive, but also holds these studies as different from others, such as within-person change. The final category in the second axis is a combination of the first two and is called “both” or “hybrid” category: betweengroup differences in within-person change. For example, in clinical trials [34,49] the focus is on between-person differences (treated groups versus control groups) of within-person change over time (measuring improvement over time). The change is no longer what we have labeled an observable change in a single-arm cohort sense, but rather a relative—the change in health in the treatment group relative to the control group. Thus, it is a combination of both within-person and between-person types of change. 3.4. The “What” axis The third axis distinguishes the different kinds of change that may be quantified in a study of responsiveness: for example, change observed before and after treatment and change that was deemed to be clinically relevant or important [4,21–23,25,26,32,45,61,72](see Fig. 4). There are five subheadings along this axis. 3.4.1. Minimum change potentially detectable by the instrument The minimum change is the smallest amount of change that the instrument is capable of measuring on an individual level and is dependent on the construction of the instrument (i.e., the number of items and the number of response categories). For example, with the DASH questionnaire [73–75], there are 30 items, with five response options (1–5) giving a possible score ranging from 30 to 150 (or a 120-point range). A change of one response level on just one item (1/120 or 0.83%) would be the minimal amount of change possible. In another scale, the Roland-Morris Disability Questionnaire [76] with 24 yes/no items the minimum change would be 1/24, or 4.2%. For a given individual, a change score less than this amount would be impossible. However, at a group level the average change could be lower than this minimum potentially detectable change. As McDowell and Jenkinson [21], p. 241 state, interpretation should be content-based. An average change score that is less than the minimum possible for an individual patient should be seriously scrutinized to determine if it can be interpreted in a group-level analysis. This is the only type of change that is stable in magnitude across the coordinates on the other two axes because it is an attribute of the questionnaire, using items and scaling, and not of the study design.
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
Fig. 4. Axis 3: Type of change being quantified in comparison.
3.4.2. Minimum change detectable given the measurement error of the instrument (“minimally detectable change”) The second type of change is defined by the error associated with the measurement. Measurement of change reflects true change plus error. When the error is great, wider confidence intervals apply to the observed change score; the “true” change could be anywhere within the given range. The upper limit of this confidence interval, when placed around a change score of zero (no change), helps to define another boundary of meaningful change. Theoretically, change scores greater than this upper limit would have less than a 5% chance of being change due to chance (error) alone [77] and could therefore be confidently considered as true improvement (or deterioration). Following this logic, the minimally detectable change (MDC) also constitutes the lower boundary of a potentially meaningful change. Similar boundaries for change have been described in the literature. Radiographic changes are often based on their magnitude compared to the smallest detectable change (SDC) calculated in the same fashion [78,79]. Bland and Altman describe the parallel concept though in samples that have no change in their limits of agreement which sets a 95% confidence interval around the mean change in a stable group [80]. The application to health status measures is usually found under the rubric of reliability change index (RCI) based on the work of Christensen et al. [77], Jacobson et al. [81,82], Ottenbacher [83,84] and Stratford et al. [85,86]. The RCI is defined as the change score divided by the standard error of measurement (SEM) of the change. If the SEM is not known,
1209
it can be estimated by using the information on reliability as: standard error of measurement baseline standard deviation square root of (2 (1 rxx)) where rxx test-retest reliability (although some use the internal consistency rather than test–retest [87]). There is some variability in the calculation of this statistic specifically over which reliability coefficient to put in the equation and whether the multiplier of two should be included. Christensen and Mendoza [77], McHorney and Tarlov [67] and Stratford et al. [88] clearly describe two distinct goals for this type of work: first to get an estimate around the error of a single observed score (in which case Cronbach’s alpha would be used), and second to get an estimate of the precision of change scores (in which case test–retest reliability coefficients would be used). Nunnally and Bernstein [64] suggest that in very large samples (n ~ 300) the two reliability coefficients would converge. However, often health studies have smaller samples, so the appropriate statistic should be chosen. It would seem most sensible to use the related reliability statistic [67] for the intended purpose. The second point was disagreement around the need to make adjustments to the calculation of the RCI when looking at data gathered from two samples (test and retest) versus one. It has been suggested that an adjustment should be made to account for the increased error calculating the coefficient from two data sets (test and retest) [67,77,88]. Indeed, earlier work by Jacobson had not described this multiplier [89], but more recent work has shown the correction [90]. Wyrwich work does not make this adjustment and therefore their estimates for the MDC will be lower by a factor of the 2 or 1.41 [87,91]. Given our interest is in the precision around change scores, the equation of choice is the one described above using the test–retest reliability coefficient and the multiplier to correct for the sampling error. The reliability change index would therefore be an expression of the change score in standard deviation units, much like a z score. Setting this equal to 1.96 (the value on a standard normal curve associated with a 95% confidence interval) and then solving for the change score in the numerator, would give the minimum change score considered significantly different to no change at all (at P 0.05) on that instrument in that application. Stratford et al. [85,86] used the term “minimally detectable change” MDC for this value. Accordingly, change above this level is confidently considered (at a 95% confidence level) greater than measurement error and therefore likely a true change. Although some describe this measure as sample independent [87], we felt that the minimally detectable change will vary with setting, populations, raters, etc., because reliability and variance can change in different samples. As a standard for a lower boundary on relevant change for an individual patient, it is usually quite large. Stratford et al. [86] reported a 90% minimally detectable change to be four (out of a possible maximum of 24) Roland-Morris points, or roughly 16%. Some commentators suggest that this may be an overestimate of the threshold for a meaningful change [3]. Others have
1210
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
suggested setting the RCI to one (which will mean the related change score will equal the SEM) produces a change score that could be considered the minimal threshold for meaningful change [87,91]. Despite the variability in both calculation and interpretation, we put forth the concept of minimally detectable change as a data-driven approach to aid in the interpretation of change scores [92], and acknowledge the various approaches being used (varying the level of confidence for instance). We recommend the use of a subscript to identify the level of confidence chosen: for example, MDC67 if the RCI is set to one as Wyrwich et al. do [87,91], or MDC90 for Stratford’s work at the 90% confidence level [85], or MDC95 if the confidence level is set to 95% [88]. 3.4.3. Change observed between measurements in a given population (“Observed Change”) Observed change is simply the change in health status that is measured by gathering data on two different occasions, such as before and after a treatment. Treatments of “known efficacy” are often used when testing a measure’s responsiveness because change is expected. For example, because most patients improve in function after a total joint arthroplasty [52,93] or after the onset of uncomplicated, acute, low-back pain [7], such patient populations have been used to test responsiveness of an outcome measure. Observed change can be described for the individual person or for the group. Perhaps its most common and familiar use is in randomized controlled trials where it is measured and compared between treatment and control groups. It should be noted that observed change includes individuals who do get better as well as those who do not. 3.4.4 Observed change measured by the instrument in a population deemed to have improved (“Estimated Change”) With “estimated change,” study subjects are stratified into those who improved and those who did not, according to some external standard. Responsiveness is then assessed (i.e., calculation of responsiveness statistics) using, for one example, only data from those deemed to have improved, according to this external standard. Other studies contrast change scores in those improved and those not improved [54,56,94]. Deyo and Inui [54] coined the term “clinically estimated change” to describe a study of responsiveness in which they used a combination of clinician and patient ratings of improvement (“yes, this patient is better”) to define the sample for the analysis. MacKenzie et al. [55] used a similar approach in studying the patients’ ratings of whether or not they had improved. The determination of whether a change has occurred can come from different perspectives, for example that of the clinician versus that of the patient. The perspectives taken are used here to define four subcategories of estimated change. Patients can estimate their own change (by answering “yes I am much better” when asked). Alternatively, clinicians can estimate change in their patients by asking themselves “Is this patient better?” perhaps using a variety of criteria (such as test
results, assessment findings, and general impression, etc.). Payers may also be able to estimate when a change has occurred. For instance, change has occurred when a person returns to work, or when a wage-replacement claim is closed. It is suggested here that society can also be used to estimate the presence or absence of change in a person’s health. In this context, by society we mean those aspects outside of the patient–health services interaction, including institutions, communities, culture, government policy, workplace organizations, etc. For example, at a societal level, patients can be considered to have improved from surgery when they have resumed their social or work roles [30]. Estimated change is therefore divided into four subcategories, one for each of the four different perspectives. 3.4.5. Observed change measured by an instrument in a population deemed to have had an important improvement or deterioration (“Important change”) Important change is estimated change that is seen to be valued or important and becomes a criterion for stratifying a sample prior to analysis. Subjects who undergo important change (usually separating improvement from deterioration) would be used in the analysis of this type of responsiveness. The distinction between estimated and important change is critical because many studies claim to be measuring the latter when, in fact, they are usually measuring estimated changes (as defined above). Furthermore, it is often assumed that the magnitude and importance of change are correlated, but they are not the same. For instance, some research suggests that the threshold for change to be described as “important” varies according to the severity of condition at the time of the baseline assessment (test-1) [45,95]. Patients can be asked to discern not only if they had experienced change, but also if that change was important to them in their lives [29]. Such highly individualized appraisal could lead to very different experiences of change being considered important. For instance, an indication that pain is reduced is estimated change; it might only become important change when the reduction is sufficient for the patient to forget about his or her arm problem or do what they want or need to do [96]. For someone else, reduction in pain could be experienced (estimated change), but nevertheless be entirely unimportant to an individual. Increased variance in change scores could be expected in this type of responsiveness, as the definition of “important change” is so highly individualized. As was the case with estimated change, the perspective also has a bearing on when change is considered important. Clinicians or researchers may determine whether change is important by relating change scores to some “criterion”: clinical findings such as lab work or observed findings, experience, or a consolidation of findings from past clinical trials [29,97]. Juniper et al. [48] and Jaeshcke et al. [50] had patients estimate the magnitude of change on a 15-point ordinal scale (7 to 7). The researchers then set the cut point of what was considered to be “minimally clinically important”
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
change (i.e., more than 1 in the case of Juniper et al. [48], and 3 for Jaeshcke et al.’s work [50]). Although this work is considered seminal in the determination of minimally clinically important differences, there are two concerns relevant to the taxonomy. First, it mixes perspectives in that the patient estimates the change, but the investigators decide on its importance. Second, the selection of a cut point on the external marker has a great deal of influence on the estimation of the responsiveness and various cut-points have been used for people using this methodology [37,48,50,98,99]. Different cut points will result in smaller or greater amounts of change being considered the minimally clinically important difference [33]. Redelmeier [45], used similar techniques for estimating important between-patient differences, and recognized the potential conceptual inconsistency in having researchers set the cut point for importance. For the taxonomy presented here, it is suggested that a consistent rater determine both the existence and importance of the change in order to have a single perspective reflected in the external standard, and that the rationale for the cut point for importance be clearly articulated. Naylor and Llewellyn-Thomas [29] support the role of broader, albeit differing, perspectives in determining the importance of change. Payers such as insurance companies and workers’ compensation boards and other facets of society (as defined above) can also determine the importance of change, often from a productivity or economic point of view. It would be anticipated that this would differ from an individual’s or even a clinician’s ratings of important change [71]. The external standard might take the form of either a cut point set at an amount of change that the person has to experience in order to achieve a certain level (population norms, a return to work, a perfect score) or a transitional stage (a certain magnitude of improvement, perhaps accounting for baseline scores). Thus, payers and society provide the final two perspectives on whether a given change would be deemed important. 3.5. Application of the classification system The proposed classification system describes many different categories of change that can be quantified in studies of responsiveness. Each category is defined by the place it occupies on a triaxial matrix (Fig. 5) in which the three axes define the Who, Which, and What of the change being quantified within a study of responsiveness. Some examples might help to illustrate the practical application of the system. Table 2 briefly describes six different studies and outlines how they would be categorized. As summarized in Table 2, Laupacis [53] followed a group of patients before and after total hip replacement and provided overall summaries of change in health for the whole group whether or not each individual had actually improved (it would be probable that some did not improve). This is an example of group-level, over-time, observed change. In contrast, Redelmeier and Lorig [65] had patients rate themselves as being better/worse/same compared with other patients in the
1211
Fig. 5. Depiction of the recommended taxonomy for responsiveness. Each cell in this three-dimensional structure (as defined by the axes of “which, who and what” of the change being quantified), represents a category or type of responsiveness. A given study usually is looking at one kind of change, and therefore one category of responsiveness. A given measure may have information in terms of its responsiveness to one category of change, but this may not mean it is responsive to another.
same room, on the same day. The mean difference in health status scores in two groups (those where one of the pair said they were somewhat better versus those where one of the pair said their health was the same) was defined as a minimally important difference. This provides a rare example of group-level, between-person, important-change (researcher’s perspective) type of responsiveness study. In previous work done by the authors [100], patients undergoing shoulder surgery (rotator cuff debridement or repair, or total shoulder arthroplasty) were measured before and 6 months after surgery. At the second testing, patients were asked if they were better on a 5-point scale. Those that responded “yes” (that they were somewhat or much better) were deemed to have improved, and their responses were analyzed for responsiveness (to estimated change) [100]. An excellent example of responsiveness in a “both” (between- and within-person difference/ change) group comparison is found in Buchbinder et al.’s [34] article on outcome measures in randomized controlled trials for rheumatoid arthritis. Buchbinder describes the adaptation of the effect size statistics for use in a “both” (betweenperson difference of within-person change) and ranks traditional outcomes by their responsiveness. Important change at an individual level is shown in the example of Stratford et al’s work [94]. Using an average of patient and clinicians ratings of improvement, and considering anything 5/7 on that scale to be important, this group uses receiver operator characteristic curves to find the most accurate change score for discriminating between those who had and did not have important change [94]. Riddle et al. [99] led another study using a similar technique, though using achievement of treatment goals as an indicator of important change. Wyrwich et al. [87] provides a rare example of analysis of re-
1212
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
Table 2 Application of the classification system using published studies of responsiveness from the literaturea Author
Brief description
Who?
Laupacis et al., [53] Cohort of patient pre and post total hip replacement Group Redelmeier and Lorig [65] Group of patients comparing selves to others, and rating Group self as somewhat better than other Beaton and Richards [100] Shoulder patients who said they were better after surgery Group
Which scores?
What?
Within-person Between-person (one point in time) Within-person
Observed change in population Observed change in those estimated to be different Observed change in those estimated to have improved Observed change in a population Important: average rating >5/7 on scale used to define responder/nonresponder groups. Most accurate change score to discriminate between groups important change to individual Important: observed change in those estimated to have important improvement (using one SEM as criterion and McMaster MCID)
Buchbinder et al. [34] Stratford et al. [94]
Randomized controlled trial in rheumatoid arthritis Patients with low-back pain <6 months duration, undergoing physiotherapy. Follow-up 3–6 weeks later. Average of patient and clinician rating of change used as external marker
Group Both/hybrid Individual Within-person
Wyrwich et al. [87]
Outpatients enrolled in trial of drug utilization in heart disease and rating improvement
Individual Within-person
a This table shows some examples of studies of responsiveness from the literature, and how they would be classified using the suggested classification system. Refer to text for description of each of the categories under the Who, Which and What columns.
sponsiveness at an individual patient level using both the minimally clinically important difference approach of McMaster [50] while proposing one standard error of measurement unit as reflecting an important difference [91]. Not all of the analysis is at the individual level; however, parts are such as the proportion of subjects exceeding one SEM criterion. Examples can also be found of studies that deliberately contrast different categories of responsiveness and emphasize the uniqueness of each approach. Hemingway et al. [71] examined the same category of change (group-level, withinperson observed change) as Laupacis, above but contrasts it with another category by changing the “Which” component in the second group (group-level, between-person, observed change). In a study of injured workers, some of the present authors demonstrated the difference between observed change and estimated change holding the other two axes constant (grouplevel, within-person change) [7]. The differences found in these two studies show how different categories of responsiveness can lead to dissimilar estimates of change and effect size (in terms of magnitude) even within the same study situation.
4. Discussion This article has revealed the need to consider responsiveness a context-specific attribute, where the nature of the change designed into the study partially defines that context. It has also created a taxonomy of responsiveness to define the nature of the change in the study. We suggest that this proposed taxonomy reconciles many of the debates and discussions in the literature [14,21,36,46,101] by locating the nature of change (category of change) within a matrix defined by three axes: Who is the focus of the study? When were the two measures that are being compared gathered? and What is the nature of the change being examined? These three axes define the essential attributes of all the different
types of change that have been or may be studied under the rubric of “responsiveness.” Because any measure may be tested for responsiveness to any of these categories of change, responsiveness is not a unified attribute “possessed” by an outcome measure, but, rather, a context-specific characteristic within a situation defined by the patient, the treatment under study and the category of change being assessed. We believe that responsiveness can only be attributed to that particular application (patient, treatment, category of responsiveness, outcome measure) and that a study of responsiveness therefore validates the application, not the instrument [36,64]. Given the contextual nature of responsiveness, we suggest the adoption of the definition of deBruin et al. [3] (the accurate detection of change when it has occurred). This is, in our view, probably the most appropriate because it balances the measurement of change (accurate detection) with the need to define how we know that change has occurred. This definition returns responsiveness to its etymological origins. The Canadian Oxford Dictionary states that the term is derived from the Latin respondere: to return to a pledge or promise [102], and thereby defining responsiveness as a reflexive term: a response to “x” where, in our situation, “x” is a category of change in a given setting. A limitation of many other definitions in the literature is that they include what we have called a category of change within their definitions of responsiveness (such as the ability to detect treatment effects, or important change even if it is small), yet do not recognize the lack of potential lack of generalizability to other types of change. A new framework or conceptual model in which several types of responsiveness can be held as meaningful is called for. Testa and Simonson [36] suggests that a study of responsiveness compares fluctuations in a measure’s score against shifts in some criterion (or proxy) for true change in the attribute (given a hypothesized causal link between the proxy and the true change [36]). This comes closest to this context-specific concept of responsiveness.
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
We have elaborated on the Testa framework by suggesting the different categories of change that could be considered proxies for true change in health status, each of which would offer a valid assessment of a specific type of responsiveness. Thus, the definition and the concept are complementary. A classification system should have the goal of “facilitating accurate communication” [44]. The question of whether a triaxal classification system does indeed facilitate the study of responsiveness remains open to debate. Nevertheless, it does define the type of change being quantified in a study and (because it emphasizes the importance of the context) can be expected to lead to less confusion between contrasts such as observed versus estimated or important change. Results will be easiest to generalize to the same context—the same category of change as well as the same patient group and treatment. Two other approaches, which could be classification systems, were found. First, the distinction made between anchor-based versus distribution-based studies of responsiveness made by Lydick and Epstein [41], second the studies described in Stratford et al.’s work [37]. Lydick and Epstein differentiate between what we have called observed versus estimated (or important) change on the What axis, and clearly their work fits along that axis [41]. The work of Liang differentiating between sensitivity (observed) and responsiveness (expected or important) is similar [28,103]. In the same manner Husted and colleagues [104] differentiate between internal and external responsiveness, the former being our “observed” change, and the latter being responsiveness in relation to an external marker of change either estimated or important in our taxonomy. Stratford et al. describe five common approaches to studies of responsiveness but they mix what is separated out as different axes on our taxonomy [37]. In Stratford’s work, observed change in a group of patients is study design 1b, observed change in a randomized controlled trial is 2a, the result is that they appear to be totally different according to the alphanumeric code assigned. Our classification would hold these two examples as the same on the What axis (observed change) and differing only the Who axis (within-person versus both/hybrid—the between-person differences of within-person change). Their system would not be comprehensive enough to include all the studies of responsiveness we have encountered in the literature, such as important change in a randomized controlled trial. Our work supports both the work of both these groups in defining distinct studies of responsiveness and asserting that they should be considered as different. Limitations to this research include the fact that it has not been thoroughly tested. We provided examples of studies and where they fit within the taxonomy, but these were only six of many hundreds of studies. The proposed classification system needs to be tested using practical examples of studies of responsiveness to evaluate its comprehensiveness and describe the nature of studies of responsiveness. Recent work of the OMERACT V meeting (May 2000) contributed to this effort by reviewing studies of responsiveness in four clinical areas and using the taxonomy to find those studies addressing
1213
important change [105,106]. Further work is needed to continue to find those studies addressing important change [105, 106]. Further work is needed to continue the application of the taxonomy to other bodies of responsiveness literature. A second limitation is that this work addresses only some of the challenges inherent in measuring responsiveness. There are also statistical and mathematical debates over how to quantify and model change [5,14,15,25,36,107], as well as concern about the possibility of violating assumptions in assuming an interval level of scaling in our measures [15,17,19,32,95,108–110]. Some authors have raised issues such as how well respondents can be expected to recall previous health states when asked a transitional question such as “how are you now compared to. . . .?” [111]. Still others suggest that the whole framework for how a patient would respond to a given question could shift over time [112,113]. Finally, our work did not address the other contextual issues that define and influence the interpretation of responsiveness statistics: the type of treatment experienced between measurement times (for example, joint arthroplasty will have a large and dramatic effect compared to some non-operative interventions for easing joint pain [8]), the patient mix, and the timing of the measurement [114]. These matters were beyond the scope of this article, which had the goal of classifying just one component of responsiveness: the construct of change being quantified. Our work reflects a more heuristic approach to responsiveness which has been called for by others [28,115–117]. The work presented here has two main implications: first, it will change the way we (clinicians and researchers) talk about responsiveness; and second, it will influence how we will choose instruments to use in our work. In using this taxonomy, clinicians and researchers will begin to talk about responsiveness in a contextualized way. Instruments will be considered responsive to a type of change (treatment effects or important change from a patient’s perspective). Conceivably, although outside the scope of this article to demonstrate, a given instrument may be responsive to one category of change, but not to another, suggesting a limit to the generalizability of the findings, even when using the same questionnaire. Certain disagreements in the field of responsiveness (exemplified by articles such as Norman et al. [17] and the debate between Redelmeier and Wright [13,33,40]) could be resolved with this new way of thinking. For instance, Norman et al. [17] raised the criticism that measures of what we have called “Who: group-level, Which scores: within-person over time, What: estimated change” are biased indicators of change for clinical trials. Although we would concur, we would say it is because it is the incorrect type of change being looked at for their target application (clinical trials). A statistical result from a study that has used an external marker to estimate the occurrence of change does not reflect the type of change being tested in a clinical trial (where you would need data that looked at the between-group difference of within-group change using observed change data at a group level: position group-level, both/hybrid, observed change). It is suggested here that what Norman et al. [17] argued
1214
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
from a mathematical point of view about the correlation between the external criterion and the change score, is an error of context: the wrong type of change. Our findings would also caution against many statements found in the concluding lines of responsiveness studies such as that the findings derived from observed change will aid in “choosing more sensitive outcome measures for a clinical trial” [39, p. 537]. This is to mix categories of change, and an instrument that works well in a particular application (in this case withinperson change) may or may not do so in a clinical trial (hybrid or “both” type of change). Likewise, we would be hesitant to generalize from Redelmeier et al.’s between-person estimates [45,68] to determining clinically important differences in a clinical trial (as the authors suggest). Again, this extrapolates the findings to another situation where the instrument may or may not perform in the same manner. Users of the taxonomy will demand more of definitions or statements such as “the ability to detect meaningful treatment effects.” They will want more clarification of exactly what was considered a “meaningful” versus a non-meaningful effect. Who’s perspective defined it as meaningful? Readers will become much more demanding and will be more hesitant to accept a blanket statement that an instrument is “responsive.” The second implication is that taxonomy described here will change the way we select instruments for either clinical applications or research. Those choosing an instrument based on the criterion of responsiveness will now look not only for evidence of “responsiveness” but for evidence of a particular category of responsiveness that matches how they intend to measure change when using this instrument. In 1985 and again in 1992, Kirshner and Guyatt and Guyatt et al. [101, 118] proposed a now well-accepted, though also openly challenged [57,115,119], taxonomy for the purposes of health status measures. The goal was to clarify and simplify the ways in which instruments are used [1,118,119]. As Guyatt et al. suggest, their work addressed only one area in health status measurement: the purpose of the instrument [119]. With a similar goal, we have developed another taxonomy, in this case, one for the responsiveness. Like Kirshner and Guyatt [101], we hope that it will help the user select the questionnaire most likely to accurately detect the type of change being quantified. In conclusion, the present article puts forward a taxonomy for the change being quantified in studies of responsiveness. By defining responsiveness as a context-specific attribute (with the context being the type of change the study is designed to measure) we are more likely to be able to interpret the meaning of change scores and determine the comparability of estimates of responsiveness across different studies and different instruments. The goal is, like Guyatt’s, to simplify “what appears to be chaos” [119, p. 1353] in the field of health measurement.
Acknowledgments The authors thank Dr. Matthew Murawski for sharing the results of his literature review on responsiveness up to 1994
and Dr. Valerie Tarasuk for her significant contribution to the writing and review of this manuscript. Dr. Beaton was supported by a PhD fellowship (health research) from the Medical Research Council of Canada and by the Institute for Work & Health while this research was done. Dr. Wright is the R.B. Salter Chair of Surgical Research and a Medical Research Council of Canada Scientist. Dr. Katz is supported through a grant from the NIH #AR36308 and the US National Arthritis Foundation.
References [1] Guyatt GH, Walter SD, Norman GR. Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis 1987;40(2):171–8. [2] Hays RD, Anderson R, Revicki D. Psychometric consideration in evaluating health-related quality of life measures. Qual Life Res 1993;2:441–9. [3] De Bruin AF, Diederiks JPM, De Witte LP, Stevens FCJ, Philipsen H. Assessing the responsiveness of a functional status measure: the Sickness Impact Profile versus the SIP68. J Clin Epidemiol 1997; 50(5):529–40. [4] Wright JG, Young NL. A comparison of different indices of responsiveness. J Clin Epidemiol 1998;50(3):239–46. [5] Cohen J. Things I have learned so far. Am Psychologist 1990;45 (12):1304–12. [6] Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care 1989;27(Suppl 3):S178–89. [7] Beaton DE, Hogg-Johnson S, Bombardier C. Evaluating changes in health status: reliability and responsiveness of five generic health status measures in workers with musculoskeletal disorders. J Clin Epidemiol 1997;50(1):79–93. [8] Katz JN, Larson MG, Phillips CB, Fossel AH, Liang MH. Comparative measurement sensitivity of short and longer health status instruments. Med Care 1992;30(10):917–25. [9] Murawski MM, Miederhoff PA. On the generalizability of statistical expressions of health related quality of life instrument responsiveness: a data synthesis. Qual Life Res 1998;7:11–22. [10] Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. NJ: Lawrence Erlbaum Associates, 1998. [11] Donovan JL, Frankel SJ, Eyles JD. Assessing the need for health status measures. J Epidemiol Comm Health 1993;47(2):158–62. [12] Deyo RA, Carter WB. Strategies for improving and expanding the application of health status measures in clinical settings: a researcherdeveloper viewpoint. Med Care 1992;30(Suppl 5):MS176–86. [13] Redelmeier DA, Guyatt GH, Goldstein RS. On the debate over methods for estimating the clinically important difference. J Clin Epidemiol 1996;49(11):1223–4. [14] Norman GR. Issues in the use of change scores in randomized trials. J Clin Epidemiol 1989;42(11):1097–105. [15] Nunnally J. The study of change in evaluation research: principles concerning measurement, experimental design and analysis. In: Streuning E, Guttentag M, editors. Handbook of evaluation research. Beverly Hills, CA: Sage Publications, 1975. p. 101–37. [16] Leplege A, Hunt S. The problem of quality of life in medicine. JAMA 1997;278(1):47–50. [17] Norman GR, Stratford P, Regehr G. Methodological problems in the retrospective computation of responsiveness to change: the lesson of Cronbach. J Clin Epidemiol 1997;50(8):869–79. [18] Guyatt GH, Cook D. Health status, quality of life, and the individual. JAMA 1994;272(8):630–1. [19] Michell J. Measurement scales and statistics: a clash of paradigms. Psychol Bull 1986;100(3):398–407. [20] Redelmeier DA, Goldstein R, Min ST, Hyland RH. Spirometry and
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217
[21] [22] [23] [24] [25] [26] [27]
[28] [29]
[30]
[31]
[32] [33] [34]
[35]
[36] [37]
[38]
[39] [40]
[41] [42]
[43] [44] [45]
[46]
dyspnea in patients with COPD: when small differences mean little. Chest 1996;109. McDowell I, Jenkinson C. Development standards for health measures. J Health Serv Res Policy 1996;1(4):238–46. Testa M, Nackley J. Methods for quality of life studies. Annu Rev Public Health 1994;15:535–59. Russell E. The ethics of attribution: the case of health care outcome indicators. Soc Sci Med 1998;47(9):1161–9. Jenkinson C. Evaluating the efficacy of medical treatment: possibilities and limitations. Soc Sci Med 1995;41(10):1395–401. Cronbach LJ, Furby L. How we should measure “change”—or should we? Psychol Bull 1970;74(1):68–80. Braitman LE. Confidence intervals assess both clinical significance and statistical significance. Ann Intern Med 1991;114(6):515–17. Fortin PR, Stucki G, Katz JN. Measuring relevant change: an emerging challenge in rheumatic clinical trials. Arthritis Rheum 1995;(August):1027–30. Liang MH. Evaluating measurement responsiveness. J Rheumatol 1995;22:1191–2. Naylor CD, Llewellyn-Thomas HA. Can there be a more patientcentred approach to determining clinically important effect sizes for randomized treatment trials? J Clin Epidemiol 1994;47(7):787–95. Drummond M, O’Brien B. Clinical importance, statistical significance and the assessment of economic and quality-of-life outcomes. Health Economics 1993;2(3):205–12. Deyo RA, Carter WB. Strategies for improving and expanding the application of health status measures in clinical settings: a researcherdeveloper viewpoint. Med Care 1992;30(Suppl 5):MS176–86. Feinstein AR. Zeta and delta: critical descriptive boundaries in statistical analysis. J Clin Epidemiol 1998;51(6):527–30. Wright JG. The minimal important difference: who’s to say what is important? J Clin Epidemiol 1996;49(11):1221–2. Buchbinder R, Bombardier C, Yeung M, Tugwell P. Which outcome measures should be used in rheumatoid arthritis clinical trials? Arthritis Rheum 1995;38(11):1568–80. Testa MA. Interpreting quality-of-life clinical trial data for use in the clinical practice of antihypertensive therapy. J Hyperten 1987; 5(Suppl 1):S9–13. Testa MA, Simonson DC. Assessment of quality-of-life outcomes. N Engl J Med 1996;334:335–40. Stratford PW, Binkley JM, Riddle DL. Health status measures: strategies and analytic methods for assessing change scores. Phys Ther 1996;76(10):1109–23. Guyatt GH, Naylor CD, Juniper E, Heyland DK, Jaeschke R, Cook DJ. Users’ guides to the medical literature: XII. How to use articles about health-related quality of life. JAMA 1997;277(15):1232–7. Anderson JJ, Chernoff MC. Sensitivity to change of rheumatoid arthritis clinical trial outcome measures. J Rheumatol 1993;20(3):535–7. Redelmeier DA, Guyatt GH, Goldstein RS. Assessing the minimal important difference in symptoms: a comparison of two techniques. J Clin Epidemiol 1996;49(11):1215–9. Lydick E, Epstein RS. Interpretation of quality of life changes. Qual Life Res 1993;2:221–6. Buchbinder R, Goel V, Bombardier C, Hogg-Johnson S. Classification systems of soft tissue disorders of the neck and upper limb: do they satisfy methodological guidelines? J Clin Epidemiol 1996; 49(2):141–9. Feinstein AR. Clinical epidemiology: part 2. the identification rates of disease. Ann Intern Med 1968;69(5):1037–61. Katz JN, Liang MH. Classification criteria revisited. Arthritis Rheum 1991;34(10):1228–30. Redelmeier DA, Goldstein R, Min ST, Hyland RH. Spirometry and dyspnea in patients with COPD: when small differences mean little. Chest 1996;109. McHorney CA. Generic health measurement: past accomplishments and a measurement paradigm for the 21st century. Ann Intern Med 1997;127:743–50.
1215
[47] Wood-Dauphinee S. Assessing quality of life in clinical research: from where have we come and where are we going? J Clin Epidemiol 1999;52(4):355–63. [48] Juniper EF, Guyatt GH, Willan A, Griffith L. Determining a minimal important change in a disease-specific quality of life questionnaire. J Clin Epidemiol 1994;47(1):81–7. [49] Bombardier C, Raboud J. A comparison of health-related qualityof-life measures for rheumatoid arthritis research. Control Clin Trial 1991;12:243S–56S. [50] Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimal clinically important difference. Control Clin Trial 1989;10:407–15. [51] Liang MH, Fossel AH, Larson MG. Comparisons of five health status instruments for orthopedic evaluation. Med Care 1990;28(7): 632–42. [52] Liang MH, Larson MG, Cullen KE, Schwartz JA. Comparative measurement efficiency and sensitivity of five health status instruments for arthritis research. Arthritis Rheum 1985;28(5):542–7. [53] Laupacis A, Bourne R, Rorabeck C, Feeny D, Wong C, Tugwell P, et al. The effect of elective total hip replacement on health-related quality of life. J Bone Joint Surg, Am 1993;75A(11):1619–26. [54] Deyo RA, Inui TS. Toward clinical applications of health status measures: sensitivity of scales to clinically important changes. HSR 1984;19(3):275–89. [55] MacKenzie CR, Charlson ME, DiGioia D, Kelly K. Can the sickness impact profile measure change? An example of scale assessment. J Chronic Dis 1986;39(6):429–38. [56] Deyo RA, Centor RM. Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J Chronic Dis 1986;39(11):897–906. [57] Williams JI, Naylor CD. How should health status measures be assessed? Cautionary notes on procrustean frameworks. J Clin Epidemiol 1992;45(12):1347–51. [58] Redelmeier DA, Tversky A. Discrepancy between medical decisions for individual patients and for groups. N Engl J Med 1990; 322(16):1162–4. [59] Goldsmith CH, Boers M, Bombardier C, Tugwell P. Criteria for clinically important changes in outcomes: development, scoring and evaluation of rheumatoid arthritis patient and trial profiles. J Rheumatol 1993;20(3):561–5. [60] Paulus H, Egger MJ, Ward JR, Williams HJ. Reply (letter to the editor). Arthritis Rheum 1991;33(4):502–4. [61] Deyo RA, Patrick DL. The significance of treatment effects: the clinical perspective. Med Care 1995;33(4):AS286–91. [62] Feinstein AR. Twentieth century paradigms that threaten both scientific and humane medicine in the twenty-first century. J Clin Epidemiol 1996;49(6):615–7. [63] Fletcher RH, Fletcher SW, Wagner EH. Clinical epidemiology: the essentials. 3rd ed. Baltimore: Williams & Wilkins, 1996. [64] Nunnally JC, Berstein IH. Psychometric theory. 3rd ed. New York: McGraw-Hill, 1994. [65] Redelmeier DA, Lorig K. Assessing the clinical importance of symptomatic improvements: an illustration in rheumatology. Arch Intern Med 1993;153(11):1337–42. [66] Beaton GH. Statistical approaches to establish mineral element recommendations. J Nutrition 1996;126:2320S–8. [67] McHorney CA, Tarlov AR. Individual patient monitoring in clinical practice: are available health status surveys adequate? Qual Life Res 1995;4:293. [68] Wells GA, Tugwell P, Kraag GR, Baker PRA, Groh J, Redelmeier DA. Minimum important difference between patients with rheumatoid arthritis: the patient’s perspective. J Rheumatol 1993;20(3): 557–60. [69] Burke A, Nesselroade J. Change measurement. In: von Eye A, editors. Statistical methods in longitudinal research, vol. 1. Principles and structuring change. Boston: Academic Press, 1990. p. 3–34. [70] Jacobs HM, Touw-Otten FWMM, de Melker RA. The evaluation of
1216
[71]
[72] [73]
[74]
[75]
[76]
[77] [78]
[79]
[80] [81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90] [91]
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217 changes in functional health status in patients with abdominal complaints. J Clin Epidemiol 1996;49(2):163–71. Hemingway H, Stafford M, Stansfield S, Marmot M. Is the SF-36 a valid measure of change in population health? Results for the Whitehall II study. Br Med J 1997;315:1273–9. Lachs MS. The more things change . . . J Clin Epidemiol 1993; 46(10):1091–2. Beaton DE, Katz JN, Fossel AH, Tarasuk V, Wright JG, Bombardier C. Measuring the whole or the parts? Validity, reliability and responsiveness of the DASH Outcome Measure in different regions of the upper extremity. J Hand Therapy. In press. McConnell S, Beaton DE, Bombardier C. The DASH Outcome Measure: A user’s manual. 1st ed. Toronto: Institute for Work & Health, 1999. Hudak PL, Amadio PC, Bombardier C. The Upper Extremity Collaborative Group (UECG). Development of an upper extremity outcome measure: the DASH (Disabilities of the Arm, Shoulder and Head). Am J Ind Med 1996;19(6):602–8. Roland M, Morris R. A study of the natural history of back pain— part I: development of a reliable and sensitive measure of disability in low-back pain. Spine 1983;8(2):141–4. Christensen L, Mendoza JL. A method of assessing change in a single subject: an alteration of the RC index. Behav Ther 1986;17:305–8. Sadler WA, Murray LM, Turner JG. Minimum distinguishable difference in concentration: a clinically oriented translation of assay precision summaries. Clin Chem 1992;38(9):1773–8. Ravaud P, Giraudeau B, Auleley GR, Edouard-Noel R, Dougados M, Chastang C. Assessing smallest detectable change over time in continuous structural outcome measures: application to radiological change in knee osteoarthritis. J Clin Epidemiol 1999; 52(12):1225–30. Altman D. Practical statistics for medical research. London: Chapman & Hall, 1991. Jacobson NS, Truax P. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J Consult Clin Psychol 1991;59(1):12–9. Jacobson NS, Roberts LJ, Berns SB, McGlinchey JB. Methods for defining and determining the clinical significance of treatment effects: description, application, and alternatives. J Consult Clin Psychol 1999; 67(3):300–7. Ottenbacher KJ, Johnson MB, Hojem M. The significance of clinical change and clinical change of significance: issues and methods. Am J Occup Ther 1988; 42(3):156–63. Stratford PW, Binkley JM. Measurement properties of the RM-18: a modified version of the Roland-Morris Disability Scale. Spine 1997; 22(20):2416–21. Stratford PW, Binkley J, Soloman P, Finch E, Gill C, Moreland J. Defining the minimum level of detectable change for the RolandMorris questionnaire. Phys Ther 1996;76(4):359–68. Stratford PW, Finch E, Solomon P, Binkley J, Gill C, Moreland J. Using the Roland-Morris questionnaire to make decisions about individual patients. Physiotherapy Can 1996;48(2):107–10. Wyrwich KW, Nienaber MA, Tierney WM, Wolinsky FD. Linking clinical relevance and statistical significance in evaluating intraindividual changes in health-related quality of life. Med Care 1999; 37(5):469–78. Stratford PW, Riddle DL, Binkley JM, Spadoni G, Westaway MD, Padfield B. Using the neck disability index to make decisions concerning individual patients. Physiotherapy Can 1999; Spring: 107–12. Jacobson NS, Follette WC, Revenstorf D. Psychotherapy outcome research: methods for reporting variability and evaluating clinical significance. Beh Ther 1984;15:336–52. Wells GA, Beaton DE. Methods for ascertaining MCID. J Rheumatol. In press. Wyrwich KW, Tierney WM, Wolinsky FD. Further evidence supporting standard error of measurement based criterion for identifying meaningful intra-individual change in health-related quality of life. J Clin Epidemiol 1999; 52:861–73.
[92] Liang MH, Larson MG, Cullen KE, Schwartz JA. Comparative measurement efficiency and sensitivity of five health status instruments for arthritis research. Arthritis Rheum 1985;28(5):542–7. [93] Laupacis A, Bourne R, Rorabeck C, Feeny D, Wong C, Tugwell P, et al. The effect of elective total hip replacement on health-related quality of life. J Bone Joint Surg Am 1993;75A(11):1619–26. [94] Stratford PW, Binkley JM, Riddle DL, Guyatt GH. Sensitivity to change of the Roland-Morris back pain questionnaire: Part 1. Phys Ther 1998;78(11):1186–96. [95] Stucki G, Daltroy L, Katz JN, Johannesson M, Liang MH. Interpretation of change scores in ordinal clinical scales and health status measures: the whole may not be equal to the sum of the parts. J Clin Epidemiol 1996;49(7):711–7. [96] Beaton DE, Tarasuk V, Katz JN, Wright JG, Bombardier C. Are you better? A qualitative study on the meaning of recovery. Arthritis Care and Research, in press. [97] van Walraven CV, Mahon JL, Moher D, Bohm C, Laupacis A. Surveying physicians to determine the minimal important difference: implications for sample-size calculation. J Clin Epidemiol 1999; 52(8):717–23. [98] Guyatt GH, Juniper EF, Walter SD, Griffith L, Goldstein RS. Interpreting treatment effects in randomized trials. Br Med J 1999;316: 690–2. [99] Riddle DL, Stratford PW, Binkley JM. Sensitivity to change of the Roland-Morris back pain questionnaire: Part 2. Phys Ther 1998; 78(1):1197–207. [100] Beaton DE, Richards RR. Assessing the reliability and responsiveness of five shoulder questionnaires. J Shoulder Elbow Surg 1998;7:565–72. [101] Kirshner B, Guyatt GH. A methodological framework for assessing health indices. J Chronic Dis 1985;38(1):27–36. [102] Barber K. The Canadian Oxford dictionary. Toronto: Oxford University Press, 1998. [103] Liang MH. Longitudinal construct validity: establishment of clinical meaning in patient evaluative instruments. Med Care 2000;38(Suppl 9):II84–90. [104] Husted JA, Cook RJ, Farewell VT, Gladman DD. Methods for assessing responsiveness: a critical review and recommendations. J Clin Epidemiol 2000;53:459–68. [105] Beaton DE, Bombardier C, Katz JN, Wright JG. Looking for important change in studies of responsiveness. J Rheumatol 2001;28:400–5. [106] Wells GA, Beaton DE, Welch V. Methods for ascertaining the minimally clinically important difference. J Rheumatol 2001;28:406–12. [107] Pfennings L, Cohen L, Van der Ploeg H. Preconditions for sensitivity in measuring change: visual analogue scales compared to rating scales in a Likert format. Psychol Reports 1995;77:475–80. [108] Nunnally J, Durham R. Validity, reliability, and special problems of measurement in evaluation research. In: Struening E, Guttentag M, editors. Handbook of evaluation research. Beverly Hills, CA: Sage Publications, 1975: p. 289–352. [109] Stevens SS. Problems and method of psychophysics. Psychol Bull 1958;55(4):177–96. [110] Chambers LW, Haight M, Norman GR, MacDonald L. Sensitivity to change and the effect of mode of administration on health status measurement. Med Care 1987;25(6):470–80. [111] Ross M. Relation of implicit theories to the construction of personal histories. Psychol Rev 1989;96(2):341–57. [112] Llewellyn-Thomas HA, Schwartz CE. Response shift effects on patients’ evaluation of health states: sources of artifact. 1999. [113] Schwartz CE. Measuring patient-centred outcome in neurologic disease: extending the Q-Twist method. Arch Neurol 1995;52:754–62. [114] Greenfield S, Nelson EC. Recent developments and future issues in the use of health status assessment measures in clinical settings. Med Care 1992;30(Suppl 5):MS23–41. [115] Hyland ME. The validity of health assessments: resolving some recent differences. J Clin Epidemiol 1993;46(9):1019–23. [116] Marquis P. Strategies for interpreting quality of life questionnaires. Quality of life newsletter, 3–4. 1998. Lyons, Frnce, Medcom. 1-3-0098.
D.E. Beaton et al. / Journal of Clinical Epidemiology 54 (2001) 1204–1217 [117] Cronbach LJ. Beyond the two disciplines of scientific psychology. Am Psychol 1975;116–27. [118] Guyatt GH, Kirshner B, Jaeschke R. Measuring health status: what are the necessary measurement properties? J Clin Epidemiol 1992; 45(12):1341–45. [119] Guyatt GH, Kirshner B, Jaeschke R. A methodologic framework for health status measures: clarity or oversimplification? J Clin Epidemiol 1992;45(12):1353–5.
1217
[120] Irvine EJ, Zhou Q, Thompson AK. The short inflammatory bowel disease questionnaire: a quality of life instrument for community physicians managing inflammatory bowel disease. Am J Gastroenterol 1996;91(8):1571–7. [121] Siu AL, Ouslander JG, Osterweil D, Reuben DB, Hays RD. Change in self-reported functioning in older person entering a residential care facility. J Clin Epidemiol 1993;46(10):1093–101.