Further Evidence Supporting an SEM-Based Criterion for Identifying Meaningful Intra-Individual Changes in Health-Related Quality of Life

Further Evidence Supporting an SEM-Based Criterion for Identifying Meaningful Intra-Individual Changes in Health-Related Quality of Life

J Clin Epidemiol Vol. 52, No. 9, pp. 861–873, 1999 Copyright © 1999 Elsevier Science Inc. All rights reserved. 0895-4356/99/$–see front matter PII S0...

210KB Sizes 0 Downloads 4 Views

J Clin Epidemiol Vol. 52, No. 9, pp. 861–873, 1999 Copyright © 1999 Elsevier Science Inc. All rights reserved.

0895-4356/99/$–see front matter PII S0895-4356(99)00071-2

Further Evidence Supporting an SEM-Based Criterion for Identifying Meaningful Intra-Individual Changes in Health-Related Quality of Life Kathleen W. Wyrwich,1,* William M. Tierney,2,3,4 and Fredric D. Wolinsky1,5 1Saint

Louis University School of Public Health, St. Louis, Missouri, USA; 2Regrenstrief Institute for Health Care, Indianapolis, Indiana, USA; 3Indiana University School of Medicine, Indianapolis, Indiana, USA; 4Roudebush Veterans Administration Medical Center, Indianapolis, Indiana, USA; and 5Saint Louis University School of Medicine, St. Louis, Missouri, USA ABSTRACT. This study used the standard error of measurement (SEM) to evaluate intra-individual change on both the Chronic Respiratory Disease Questionnaire (CRQ) and the SF-36. After analyzing the reliability and validity of both instruments at baseline among 471 COPD outpatients, the SEM was compared to established minimal clinically important difference (MCID) standards for three CRQ dimensions. A value of one SEM closely approximated the MCID standards for all CRQ dimensions. This SEM-based criterion was then validated by cross-classifying the change status (improved, stable, or declined) of 393 follow-up outpatients using the oneSEM criterion and the MCID standard. Excellent agreement was achieved for all three CRQ dimensions. Although MCID standards have not been established for the SF-36, the one-SEM criterion was explored in these change scores. Among SF-36 scales demonstrating acceptable reliability and reasonable variance, the percent of individuals within each change category was consistent with those seen in the CRQ dimensions. These results replicate previous findings where a value of one SEM also closely approximated MCIDs for all dimensions of the Chronic Heart Disease Questionnaire among cardiovascular outpatients. The one-SEM criterion should be explored in other health-related quality of life instruments with established MCIDs. J CLIN EPIDEMIOL 52;9:861–873, 1999. © 1999 Elsevier Science Inc. KEY WORDS. Quality of life, sensitivity, responsiveness, chronic obstructive pulmonary disease, clinical change, measurement

INTRODUCTION Health-related quality of life (HRQoL) instruments, both disease-specific and generic, must be reliable, valid, and sensitive to change [1]. Ideally, an HRQoL instrument also needs established standards for identifying clinically important change for each patient population in which it is used [2]. A minimal clinically important difference (MCID) is defined by Jaeschke et al. [3] as “the smallest difference in a score of a domain of interest that patients perceive to be beneficial and that would mandate, in the absence of troublesome side effects and excessive costs, a change in the patient’s management.” Such clinically important change, which recognizes the patient’s perceptions as well as the clinical aspects of case management and outcomes, can only be assessed by practitioners having extensive experience with both the patient population and the assessment *Address correspondence to: Kathleen W. Wyrwich, Ph.D., Saint Louis University School of Public Health, 3663 Lindell Boulevard, St. Louis, MO 63108-3342. Tel.: (314) 977-3273, Fax: (314) 977-8150, E-mail: ,[email protected].. Accepted for publication on 30 March 1999.

instrument [4], and individual patients. It is such clinically important change that reliable, valid, and sensitive health status measures must be able to detect [5,6]. Using a longitudinal study of 563 outpatients with a history of cardiac illness who were interviewed on both a disease-specific HRQoL measure, the Chronic Heart Disease Questionnaire (CHQ) [7], and the Medical Outcomes Study 36-Item Short Form Health Survey (SF-36) [8], we recently found that a change equivalent to one standard error of measurement (SEM) approximated the established MCID standards for each CHQ dimension [9]. Furthermore, when applied to the SF-36 change scores of these same outpatients, the one-SEM criterion showed promise in scales with acceptable reliability and variability by reflecting similar percentages of individuals who improved, remained stable, or declined as indicated by the CHQ dimensions. The SEM is the standard error in an observed score related to measuring with a particular test that obscures the true score. It is estimated by the standard deviation of the instrument multiplied by the square root of one minus its reliability coefficient [10], or

K. W. Wyrwich et al.

862

σ x 1 – r xx . Through distributional (sx) and reliability (rxx) components, the SEM takes into consideration the possibility that some of the observed change may be do to random measurement error [11]. Furthermore, according to classical test theory, the SEM possesses the unique attribute of being sample-independent. That is, the SEM is relatively constant across all but the extremes (top and bottom 10%) of a given population’s ability levels [12]. Ironically, this is due to the simultaneous incorporation of the two sample-dependent statistics, sx and rxx in the SEM’s estimation formula. In samples drawn from a given population, these components vary together around a fixed SEM as reflected in this rearrangement of the SEM formula: 2

2

r xx = 1 – ( SEM ⁄ σ x ). In addition to its population invariance, a second valuable quality of the SEM is that it is expressed in the original metric of a measure, providing ease in interpretation [10]. In the present study, we explored a SEM-based criterion for evaluating clinically important individual change on the Chronic Respiratory Disease Questionnaire (CRQ) [13] and the SF-36 among adult outpatients with a history of chronic obstructive pulmonary disease (COPD). The study population completed baseline and follow-up interviews on both instruments. Because we wanted to identify a statistical marker of clinically relevant change in reliable and valid dimensions, we appraised the psychometric properties of each of these instruments in our patient population at baseline. By using MCID standards that have been established for the CRQ [3], the number of SEMs that an individual’s score must change to be deemed clinically relevant could be determined. The SEM criterion was validated by cross-classifying trichotomous categorizations of the patient’s change scores (improved, stable, or declined) on the SEM criterion and the MCID standard for 393 outpatients with follow-up interviews. This SEM criterion was then applied to the eight scales of the SF-36 to identify the percentage of individual outpatients who improved, remained stable, or declined. These SF-36 change percentages were compared to those of the CRQ to explore the potential for a SEM-based change criterion for the SF-36, although MCID standards have not been established for those scales [14].

METHODS Sampling This study is a secondary data analysis of baseline and follow-up telephone interviews of outpatients who attended the general medicine clinics of a large academic medical center and participated in a randomized controlled trial (RCT) assessing the effects of computerized reminders to physicians and pharmacists regarding drug utilization re-

view (DUR). The Regenstrief Medical Record System (RMRS) [15,16], an electronic medical chart, identified outpatients as eligible for inclusion if they had a previous diagnosis of chronic obstructive respiratory disease, emphysema, or chronic bronchitis. In addition, outpatients who were at least 45 years old and had either a diagnosis of asthma or had been prescribed an oral or inhaled asthma medication in the past two years were also eligible for inclusion due to the chronic component of their disease and the standards of treatment then in effect at the study institution [17]. Eligibility also required that the outpatient was English-speaking, had access to a telephone, was not deaf, and did not live in a nursing home. Between August 1994 and April 1996, 579 eligible outpatients were consecutively enrolled after a brief face-toface screening interview during their scheduled clinic visits, which included informed consent to this IRB-approved study. Subsequently, they were contacted by telephone for their baseline interviews, and 487 enrolled outpatients (84%) completed this process. Of the 92 enrolled outpatients who did not complete baseline interviews, 43 refused to participate (13 for health reasons and 30 for other reasons), 9 had become ineligible since their enrollment, and 40 could not be contacted despite repeated attempts. Notebook computers, which used computer-assisted-interviewing (CAI) software that automatically maintained skip patterns, checked for missing data, signaled improper codes, and clocked administration times for each instrument were used to record all baseline and follow-up interviews. Follow-up telephone interviews were conducted between May 1995 and May 1997. Of the 487 outpatients interviewed at baseline, 417 completed re-interviews for a reinterview rate of 86%. Hence, 72% of the 579 outpatients originally enrolled in this study completed re-interviews. Seventy outpatients interviewed at baseline were not reinterviewed for the following reasons: 19 refused to be reinterviewed, 37 could not be reached within the study time frame, and 14 others had proxies who requested no further contact. These 70 outpatients differed at baseline from the 417 outpatients who were re-interviewed in that they had significantly (P , 0.05) lower scores on the fatigue and emotional function dimensions of the CRQ, and on the general health, vitality, social functioning, role-emotional, and mental health scales of the SF-36. No significant sociodemographic differences, however, were observed. The total number of interviews that each follow-up outpatient received was either two (n 5 95), three (n 5 310), or four (n 5 21) based upon enrollment dates and the study’s time frame. All re-interviews were also conducted by telephone, and they occurred within two weeks prior or six-weeks after each participant’s six-month baseline interview anniversary date. Consequently, the time between each outpatient’s baseline interview and last re-interview was roughly six months (n 5 58), twelve months (n 5 332), or eighteen months (n 5 27).

Further Evidence Supporting an SEM-Based Clinical Change Criterion

The CRQ The CRQ is the best known and most completely described disease-specific HRQoL measure for use among patients with COPD [14,18,19]. It was developed in 1987 at McMaster University using the impact method of item selection [20]. The 20 top scoring items were selected and grouped into the following dimensions based on importance and frequency: dyspnea, fatigue, emotional function, and mastery. The dyspnea dimension is patient-specific, and it requires patients to quantify how shortness of breath has affected five important daily activities. After selecting frequently performed activities from a standardized list, interviewers encouraged outpatients to volunteer any other pertinent activities. From those selected and volunteered activities, outpatients serially selected the five most important activities affected by shortness of breath. Finally, the impact of shortness of breath for each of these five items was rated from one (extremely short of breath) to seven (not at all short of breath). The other three CRQ dimensions, emotional function, fatigue and mastery, ask the same items of each outpatient. The four fatigue items deal with tiredness and energy level. The emotional function dimension includes seven questions on frustration, depression, anxiety, happiness, tenseness, relaxation, and embarrassment. The four mastery items deal with being panicked, scared, confident, and in control over the disease. All items are rated on a sevenpoint response scale, with one representing the worst and seven the best health response. Dimensions are scored by summing across equally weighted dimensional items.

863

20 cardiac patients, with some patients contributing multiple observations at different time intervals. Individual patient change on each dimension was compared with the patient’s self-reported global rating. These dimensional global change ratings were from a 15-point scale, ranging from 27 (a very great deal worse) to 0 (no change) to 7 (a very great deal better). Patients were classified as having a minimal clinically important change if their absolute global rating was between 1 and 3 (23 to 21 and 1 to 3), a moderate clinically important change if their absolute global rating was between 4 and 5, and a large clinically important change if their absolute global rating was between 6 and 7. Within each classification (minimal, moderate, and large), the absolute value of patients’ dimensional change scores (uT2-T1u) were averaged. Results for the minimal classification in these three dimensions corresponded well with the panel’s consensus and the 0.5 per item MCID hypothesis. A second study compared between-patient differences on the CRQ to seek the minimally important difference (MID), which is defined as “the smallest difference in measured health status that signifies an important rather than trivial difference in patient symptoms” [22]. The results varied among the four CRQ dimensions, however, the investigators concluded the MID was an average per item change of 0.5 within each dimension. Although this subsequent development of corroborating standards produced rather large confidence boundaries on the required per item dimensional change, these were reduced after aggregation to summary measures [23]. It is important to note, however, that the clinical aspect of patient change is noticeably absent from the definition of an MID.

MCIDs Standards for detecting patient changes that reflect MCIDs have been established for the CRQ by two methods: a consensus panel, and within-patient change assessments [3]. Because both the CRQ and the CHQ were developed at McMaster University by many of the same researchers, and because both instruments have three nearly identical dimensions (dyspnea, fatigue, and emotional function), the two patient populations (COPD and cardiac) were combined to generate MCIDs. A panel of individuals experienced with both instruments were surveyed about the score change in each dimension that identified improvements or declines. Using consensus methods [21], this panel defined minimal clinically important change as a shift of three points on the dyspnea, two points on the fatigue, or four points on the emotional function dimension summed scores. Based on the number of items in each of these dimensions (five, four, and seven, respectively), it was then hypothesized that an MCID could be defined as an average per item change of 0.5 in any dimension. Patient’s change scores on these instruments from three longitudinal respiratory and cardiac studies verified this hypothesis. The 113 change scores came from 54 COPD and

The SF-36 Developed by researchers in the Medical Outcomes Study and first published in 1992, the SF-36 is a widely-use generic HRQoL instrument [24]. The eight health concepts assessed by scales in the instrument are: physical functioning (10 items), role limitations due to physical functioning (4 items), bodily pain (2 items), general health perceptions (5 items), vitality (4 items), social functioning (2 items), role limitations due to emotional problems (3 items), and mental health (5 items). A 36th item not incorporated in any of the scales asks respondents about any health changes over the past year. The number of response choices for the SF-36 items varies considerably. There are two response choices for all of the role-physical and the role-emotional scale items, three response choices for all of the physical functioning scale items, five response choices for all of the social functioning and general health scale items as well as one item from the bodily pain scale and the 36th item, and six response choices for all of the vitality and mental health scale items and the second bodily pain scale item. Within scales, all items are equally weighted. After rescaling the five re-

K. W. Wyrwich et al.

864

sponse-set bodily pain item and recoding all reverseordered items, the item scores are summed within scales and transformed, ranging from 0 (worst) to 100 (best) health [25]. Although there are no established MCID standards for the SF-36 [14], recent investigations into the clinical interpretation of these scales using item response theory are promising with regard to the clinical interpretation of individual scores [26,27]. Disease Severity and Illness Burden Measures The eligibility criteria for this study resulted in the enrollment of a broad spectrum of COPD outpatients. Hence, each participant’s disease-specific severity and overall illness burden were assessed. To address the former, outpatients were asked if they had shortness of breath in the past four weeks. Those with shortness of breath were considered to have greater disease-specific severity. The extensive data stored in the RMRS were used to address overall illness burden. Each patient was categorized into one of the five Ambulatory Care Groups (ACG) [28,29]. This system summarized overall illness burden using gender, age, and ICD9CM codes for clinical diagnoses over the past year, and led to 51 classifications reflecting one or more commonly seen comorbid conditions called Ambulatory Diagnostic Groups (ADGs). The number of ADGs determined an outpatient’s final ACG categorization, which were collapsed into the recommended quintiles of increasing overall illness severity, with I 5 one ADG, II 5 two or three ADGs, III 5 four or five ADGs, IV 5 six to nine ADGs, and V 5 ten or more ADGs. Analytic Approach There were many phases to the analysis of this project. First, we examined the baseline descriptive and psychometric qualities of the four CRQ dimensions and the eight SF36 scales among those outpatients with no missing scale data after imputation. This included the ceiling and floor effects, means, standard deviations, internal consistencies, and SEMs. Next, we assessed the convergent, discriminant, and factorial validity of each instrument. To begin the evaluation of change, we calculated the outpatient change scores (baseline-to-last-re-interview) on all CRQ dimensions and SF-36 scales, and conducted an analysis of variance on these change scores to determine if the change scores of the six-month, twelve-month and eighteenmonth groups could be combined [3]. Given that there were no significant (P , 0.05) differences among the three time-to-last-re-interview groups, we calculated the same psychometric properties examined at baseline (ceiling and floor effects, means, standard deviations, internal consistencies, and SEMs) for the last re-interview among the follow-up outpatients. After identifying the necessary SEM criterion approximating the MCID, we validated that crite-

rion by constructing two sets of patient change outcomes (improved, stable, declined), one defined by the SEM criterion and the other defined by the MCID standards. These two outcomes were then cross-tabulated, and a weighted kappa statistic was calculated for each comparison. Next, we explored the use of the SEM criterion with the SF-36 by constructing similar patient change outcomes (i.e., improved, stable, or declined). The distribution of these SF-36 outcomes were then compared to those for the CRQ outcomes. Finally, we conducted a sensitivity analysis by repeating all of the previous phases of the study using only those outpatients meeting current clinical standards for the diagnosis of COPD [30]. All of our analyses were performed using the SPSS for Windows release 8.0.0 statistical software system [31]. RESULTS Baseline Descriptive Data As indicated, 487 COPD outpatients were enrolled at baseline. Although the interviews were considered complete because all items were asked of each outpatient, not every item was answered. There were 391 outpatients who had no missing items on the CRQ, 392 who had no missing items on the SF-36, and 316 (65%) who had completely answered all items on both instruments at baseline. The number of usable interviews was increased by applying to both instruments the standard prorated imputation methods recommended for the SF-36, where missing item scores are imputed from the individual’s dimensional or scale mean if at least half of the items on the dimension or scale have been answered [25]. After imputation, there were 480 usable baseline CRQ interviews and 477 usable baseline SF-36 interviews, with 471 outpatients (97%) having usable interviews consisting of both the CRQ and SF-36. We conducted all baseline analyses using the data from these 471 outpatients. Baseline Scale Analyses The average time to administer the telephone interviews was similar for each instrument: 9.3 minutes for the 20 item CRQ, and 11.0 minutes for the SF-36. However, these times do not reflect the elicitation process for the five CRQ patient-specific activities impacted by shortness of breath that took place during the face-to-face enrollment interviews. Although there are no recorded data on the duration of this process, study interviewers made post-hoc range estimates of 5 to 30 minutes, with a strong positive association between the length of time and the age of the outpatient. Two-thirds of the 471 outpatients were female, and twofifths were Black. The mean age was 58 years (range 5 29 to 90 years), 31% had no more than a grade school education, and 46% were widowed or divorced. Baseline ceiling and floor effects, means, standard devia-

Further Evidence Supporting an SEM-Based Clinical Change Criterion

tions, internal consistencies, and SEMs for all CRQ dimensions and SF-36 scales are shown in Table 1. Using 15% as the recommended critical value for the largest proportion of the study population that should score at the highest or lowest possible scale levels [32], there are no ceiling or floor effects for the CRQ. However, the role-physical and roleemotional scales on the SF-36 have both ceiling and floor effects. Ceiling effects are also seen in the social functioning scale. These proportions play an important role in the detection of change because an outpatient baseline score at the top (or bottom) of a scale cannot express improvement (or decline). The means, standard deviations, and SEMs for each CRQ dimension are presented for both transformed scores, ranging from 0 (worst health) to 100 (best health), and raw scores, ranging from 5 to 35 for dyspnea, 4 to 28 for fatigue, 7 to 49 for emotional function, and 4 to 28 for mastery. The transformed scores allowed for better domain comparisons both within the CRQ dimensions, and between the CRQ dimensions and the SF-36 scales. The raw CRQ scores are important because MCIDs are expressed relative to this metric. Only the transformed scores are presented for the SF-36 scales. Among the means, it is notable that the fatigue dimension was lower than the other CRQ dimensions, reflecting the constraints of COPD. The transformed standard deviations in both instruments were relatively consistent, with the exception of the SF-36’s role-physical and role-emotional scales. As a result, the SEMs for these scales are comparatively large. Cronbach’s alpha, the measure of reliability used in this study, varies substantially among the domains examined. The CRQ mastery dimension and the SF-36’s general health and social functioning scales fall short of the a . 0.80 minimum for basic research [12].

865

Convergent Validity To assess convergent validity, we correlated the CRQ dimensions and the SF-36 scales, as shown in Table 2. The fatigue dimension of the CRQ and the vitality scale of the SF-36, and the CRQ emotional function dimension and the SF-36 mental health scale had high baseline correlations (r 5 0.79 and 0.82, respectively). This was expected in as much as these dimension-scale pairs tap relatively equivalent constructs. Although the CRQ dyspnea dimension measures the physical impact of COPD on outpatients’ activities, it does so in a patient-specific manner that differs from the ten items all outpatients are asked on the SF-36 physical functioning scale. Consequently, the correlation between the dyspnea dimension and the physical functioning scale (r 5 0.60) was not of the same magnitude as the fatigue-vitality or emotional function-mental health pairings. Table 2 also contains important divergent validity evidence for the CRQ. Both the dyspnea and fatigue dimensions of the CRQ had their lowest correlations with the SF-36 role-emotional scale, whereas the CRQ’s emotional function dimension was least correlated with the SF36’s physical functioning and role-physical scales. In contrast, the mastery dimension of the CRQ does not achieve large correlations with any SF-36 scales. This can be interpreted as reflecting either a lack of convergent validity due to lower scale reliability, or the absence of a relatively equivalent SF-36 scale.

Discriminant Validity We examined the ability of the CRQ and SF-36 to reflect differences among known contrasting groups using the disease severity and illness burden measures. First, outpatients

TABLE 1. Baseline content, floor and ceiling effects, internal consistency, means, standard deviations and SEMs for the CRQ

and the SF-36 for 471 COPD outpatientsa

Scales

Number of items

% at Ceiling

% at Floor

Mean Scale Scorea

Standard deviationa

Cronbach’s alpha

SEMa

CRQ

Dyspnea Fatigue Emotional function Mastery

5 4 7 4

6.2 0.0 1.1 5.9

1.9 1.5 0.0 0.2

51.2 (20.4) 43.1 (14.3) 55.2 (30.2) 60.0 (18.4)

27.3 (8.2) 22.6 (5.4) 22.1 (9.3) 22.9 (5.5)

.91 .84 .88 .75

8.36 (2.51) 9.00 (2.16) 7.60 (3.19) 11.38 (2.73)

SF-36

Physical functioning Role-physical Bodily pain General health Vitality Social functioning Role-emotional Mental health

10 4 2 5 4 2 3 5

1.1 19.7 8.7 0.0 0.4 26.3 52.4 3.8

6.8 51.0 4.0 1.9 5.9 1.9 29.1 0.4

34.1 33.0 47.1 34.7 37.4 65.1 61.1 61.2

24.6 40.1 26.6 20.2 22.0 29.5 44.3 23.9

.90 .88 .85 .67 .82 .77 .90 .86

7.65 13.96 10.28 11.59 9.43 14.15 14.10 8.88

aNon-transformed

mean scores, standard deviations, and SEMs for the CRQ scales given in parentheses.

Instrument

K. W. Wyrwich et al.

866

TABLE 2. Correlations between the CRQ dimensions and the SF-36 scalesa

CRQ Dimensions SF-36 scales Physical functioning Role-physical Bodily pain General health Vitality Social functioning Role-emotional Mental health aAll

Dyspnea

Fatigue

Emotional function

Mastery

0.60 0.50 0.37 0.47 0.55 0.47 0.33 0.35

0.53 0.50 0.50 0.52 0.79 0.52 0.44 0.57

0.45 0.45 0.47 0.49 0.57 0.59 0.65 0.82

0.37 0.41 0.31 0.43 0.44 0.39 0.41 0.46

correlations significant at the P , 0.001 level.

were dichotomized by whether or not they had experienced shortness of breath symptoms in the past four weeks. The resulting means for these two groups, adjusted for age, gender and race, are shown in Table 3. Despite the unequal group sizes, the recent occurrence of this important COPD symptom allowed both the CRQ and the SF-36 to display significant group differences on all dimensions and scales. Second, we categorized outpatients by illness burden using the ACG quintile classifications. Table 4 shows these age, gender and race adjusted means and analysis of variance results. Although the ACG V quintile, which contains those outpatients with the greatest disease burden, had the lowest mean scale score in nearly all CRQ dimensions and SF-36 scales, the overall differences were often not statistically significant. Because four scales of the SF-36 (role-physical, bodily pain, vitality, and social functioning) were able to significantly (P , 0.05) differentiate among the ACG quintiles, but only one of the CRQ’s dimensions (fatigue) could do so, the SF-36 may appear to have superior discriminant validity. Nunnally and Bernstein warn, however, that

such evidence “is often impaired because the criteria is illdefined or a composite of multiple attributes [12].” Unlike the shortness of breath comparisons, the ACG quintiles reflect multiple attributes of all comorbid conditions rather than just the target condition (COPD). Therefore, it is not surprising that a generic measure would discriminate better with respect to an overall illness burden (as opposed to a disease-specific severity) measure. Factorial Validity Using exploratory factor analysis, we examined the factorial validity of the CRQ and the SF-36. Factor loadings for the CRQ items obtained using principal axis methods with oblique rotation are shown in Table 5. Both scree and eigenvalue criterion identified the four hypothesized factors (dimensions), which explained 67% of the variance in the 20 items. The “feel embarrassed by coughing” item from the emotional function dimension, and the “feel confident you could deal with your illness” item from the mastery dimen-

TABLE 3. Mean CRQ and SF-36 scores by breathing problems status, adjusted for age,

gender and race

Scale CRQ Dyspnea Fatigue Emotional function Mastery SF-36 Physical functioning Role-physical Bodily pain General health Vitality Social functioning Role-emotional Mental health

Breathing problems in the past 4 weeks (n 5 414)

No breathing problems in the past 4 weeks (n 5 57)

Significance level

46.3 40.6 53.0 57.5

87.1 61.2 71.1 78.2

P , 0.001 P , 0.001 P , 0.001 P , 0.001

31.1 28.7 46.0 32.9 35.7 63.3 58.1 59.7

56.1 64.7 55.3 48.0 50.1 78.5 83.3 71.9

P , 0.001 P , 0.001 P 5 0.013 P , 0.001 P , 0.001 P , 0.001 P , 0.001 P , 0.001

Further Evidence Supporting an SEM-Based Clinical Change Criterion

867

TABLE 4. Mean CRQ and SF-36 scale scores by ACG quintiles, adjusted for age, gender and race

Scale CRQ Dyspnea Fatigue Emotional function Mastery SF-36 Physical functioning Role-physical Bodily pain General health Vitality Social functioning Role-emotional Mental health

ACG I: 1 ADG (n 5 15)

ACG II: 2–3 ADGs (n 5 62)

ACG III: 4–5 ADGs (n 5 88)

ACG IV: 6–9 ADGs (n 5 181)

ACG V: 101 ADGs (n 5 125)

ANOVA significance level

62.3 42.7 61.1 69.3

49.7 46.1 56.2 60.8

56.0 48.7 59.4 62.1

50.4 42.6 54.0 57.8

48.4 38.4 52.7 60.0

P 5 0.120 P 5 0.012 P 5 0.116 P 5 0.271

49.6 60.2 68.0 45.0 38.7 87.0 80.8 69.5

35.0 35.9 52.9 35.4 40.7 66.8 64.1 64.2

34.4 38.4 50.9 35.8 42.4 70.7 65.1 64.7

34.2 32.8 47.1 34.6 37.1 63.4 58.2 60.1

31.3 24.9 39.2 32.7 32.8 60.2 58.9 57.6

P 5 0.098 P 5 0.006 P , 0.001 P 5 0.192 P 5 0.011 P 5 0.003 P 5 0.268 P 5 0.074

Significant (P , 0.01) pairwise comparisons

a

b b,c a,b,c,d a a,b,c

aSignificant

pairwise comparison between ACG III and V. pairwise comparison between ACG I and V. cSignificant pairwise comparison between ACG I and IV. dSignificant pairwise comparison between ACG II and V. bSignificant

sion, however, did not have a principal loading ($0.40) on any factor. Closer examination of the emotional function dimension indicates that the items with principal loadings all involve generic (as opposed to disease-specific) emotions. In contrast, the “coughing” item is disease-specific and is also confounded by physical symptomology. The “feel confident” item has similar problems. All of the other mastery items involve generic elements referring to breathing, but the “feel confident” item is specific to the target disease. These item-specific limitations notwithstanding, the exploratory factor analysis results of the items within each hypothesized CRQ dimension indicated unidimensionality. The principal axis factor loadings for the 35 SF-36 items obtained using oblique rotation are shown in Table 6. As has been frequently reported, scree and eigenvalue criteria did not identify all eight hypothesized factors [33]. Rather, six factors were found that explained 61% of the variance in these items. All ten of the physical functioning scale items loaded together, as did the role-physical and the bodily pain items. All three role-emotional items loaded onto the same factor as the five mental items. Three of the four vitality items loaded together, but the “feel full of pep” item had no principal loading. Similarly, three of the five general health items comprised the sixth identified factor, but two of this scale’s items, “get sick a little easier than other people” and “expect health to get worse” had no principal loadings. Finally, neither of the social functioning items had principal loadings. It should be noted, however, that when the exploratory factor analysis was performed within each hypothesized scale, unidimensionality and acceptable principal loadings were observed.

Follow-Up Descriptive Data Of the 417 outpatients re-interviewed, 398 answered every item on their last CRQ interview, 413 had no missing items on their last SF-36 interview, and 394 (94%) answered all items on both the CRQ and the SF-36. By applying prorated imputation methods, the number of usable last followup interviews was 410 for the CRQ, 408 for the SF-36, and 403 (97%) for both. There were 393 outpatients with complete or imputed scores on both instruments at both baseline and last follow-up interview, which was 81% of the baseline interviewees and 94% of those re-interviewed. We performed all follow-up and change score analyses using the data from these 393 outpatients.

Follow-Up Scale Analyses With only a slight increase in educational attainment, these 393 outpatients have nearly identical demographic attributes as the 471 analyzed at baseline (67% female, 58% Black, mean age 5 58 years, 29% with only a grade school education, and 46% widowed or divorced). Eighty-eight of the 393 outpatients received two interviews (baseline and one follow-up), 286 received three interviews, and 19 were interviewed four times. The length of time between baseline and last re-interview was roughly six months for 56 outpatients, twelve months for 312 outpatients, and eighteen months for 25 outpatients. A one-way analysis of variance among these three time groups for all CRQ dimension and SF-36 scale change scores found no significant (P , 0.05) differences. Consequently, we used the last re-interview scores for these 393 outpatients.

K. W. Wyrwich et al.

868

TABLE 5. Factor loadings obtained from the exploratory factor analysis of the 20 CRQ

items using principal axis methods with oblique rotationa Factors CRQ Items

I

Dyspnea items The first important activity impacted by shortness of breath The second important activity impacted by shortness of breath The third important activity impacted by shortness of breath The fourth important activity impacted by shortness of breath The fifth important activity impacted by shortness of breath Emotional function items Feel frustrated or impatient Feel upset, worried, or depressed Feel relaxed and free of tension Feel discouraged or down in the dumps Feel felt restless, tense, or uptight Feel happy with your personal life Feel embarrassed by coughingb Mastery Fear or panic when difficulty getting breath Feel you have control of your breathing problems Feel upset or scared when difficulty getting breath Feel confident you could deal with your illnessb Fatigue items Have much energy Feel low in energy Feel worn out or sluggish Feel tired aFactor bAll

II

III

IV

0.69 0.75 0.70 0.73 0.66 0.71 0.89 0.45 0.87 0.62 0.63 0.75 0.55 0.75 0.64 0.66 0.69 0.49

loadings smaller than 60.40 are omitted for clarity. factor loadings for these items are less than 0.40 in absolute value.

Ceiling and floor effects, along with means, standard deviations, internal consistencies and SEMs for all CRQ dimensions and SF-36 scales at the last follow-up interview are shown in Table 7. These aggregated parameters reflected very little change from those found at baseline. There were substantial ceiling and floor effects associated with the role-physical and role-emotional scales (along with high standard deviations). Ceiling effects remained evident in the social functioning scale, and low reliability persisted in the general health and social functioning scales. The reliability of the CRQ mastery dimension at follow-up was marginally improved over baseline. Despite the reduced number of follow-up outpatients (393 vs. 471 at baseline), all dimension and scale SEM values were essentially unchanged from baseline, which is consistent with this statistic’s theoretically-fixed quality. The SEM and MCID Linkage Previous research on the MCID in each CRQ dimension has concluded that an average per item change of 0.5 in each dimension is a sufficient and straightforward criterion for indicating clinically meaningful change [3]. In these data, the CRQ baseline dimensional raw score analog of a one-SEM change divided by the number of items in the target dimension was 0.50 for dyspnea, 0.54 for fatigue, and

0.46 for emotional function. These results are equivalent to those found in our previous study, supporting the one-SEM criterion as a validated method for defining individual change that is at least a minimal clinically important difference [9]. In order to validate this criterion in these data, we classified each follow-up outpatient’s change score in the three CRQ dimensions by the one-SEM criterion and the established MCID standards. For the one-SEM classification, outpatients were considered to be either improved, declined, or stable on each CRQ dimension if their score went up one or more SEMs, went down one or more SEMs, or remained within one SEM, respectively, between baseline and the last follow-up. Similarly, for the MCID classification, outpatients were considered to be either improved, declined, or stable based on whether their score went up one or more MCIDs, went down one or more MCIDs, or remained within one MCID, respectively, between baseline and the last follow-up. As shown in Tables 8–10, when these classifications were cross-tabulated within each of the three CRQ dimensions, the resulting weighted kappa statistics were .99, .85, and .99 for the dyspnea, fatigue, and emotional function dimensions, indicating excellent agreement. All 71 outpatients misclassified in the fatigue dimension had a change scores that barely missed the one-SEM criterion of 62.16 points, but met the MCID change standard of 62 points. Thus, the fatigue dimension’s relatively

Further Evidence Supporting an SEM-Based Clinical Change Criterion

869

TABLE 6. Factor loadings obtained from the exploratory factor analysis of the 35 SF-36

items using principal axis methods with oblique rotationa Factors SF-36 Items Physical functioning items Vigorous activities Moderate activities Lifting or carrying groceries Climbing several flights of stairs Climbing one flight of stairs Bending, kneeling, or stooping Walking more than a mile Walking several blocks Walking one block Bathing or dressing yourself Role-emotional items Cut down the amount of time on work and other activities due to emotional problems Accomplished less than you would like due to emotional problems Didn’t do work or other activities as carefully as usual Mental health items Very nervous Down in the dumps Calm and peaceful Downhearted and blue Happy Role-physical items Cut down the amount of time on work or other activities Accomplished less than you would like due to physical problems Limited in the kind of work or other activities due to physical problems Had difficulty performing the work or other activities due to physical problems Vitality items Have lots of energy Feel worn out Feel tired Feel full of pepb Bodily pain items Pain interfering with normal work Bodily pain General health items General health As healthy as anybody Excellent health Get sick a little easier than other peopleb Expect health to get worseb Social functioning items Extent that physical health or emotional problems interfered with normal social activitiesb Time that physical health or emotional problems interfered with social activitiesb aFactor bAll

I

II

III

IV

V

VI

0.50 0.62 0.60 0.63 0.74 0.53 0.71 0.81 0.83 0.53 20.84 20.83 20.76 20.56 20.73 20.41 20.67 20.46

loadings smaller than 60.40 are omitted for clarity. factor loadings for these items are less than 0.40 in absolute value.

20.57 20.66 20.79 20.70 0.55 0.56 0.64 20.86 20.84 0.52 0.68 0.71

K. W. Wyrwich et al.

870

TABLE 7. Floor and ceiling effects, means, standard deviations, internal consistency and SEMs for the CRQ and the SF-36 at

follow-up among 393 COPD outpatients at last re-interviewsa Instrument

Scales

% at Ceiling

% at Floor

Mean scale scorea

Standard deviationa

Cronbach’s alpha

SEMa

51.7 (20.5) 44.0 (14.6) 56.8 (30.9) 61.6 (18.8)

26.7 (8.0) 22.3 (5.4) 21.4 (9.0) 23.6 (5.7)

0.90 0.85 0.88 0.80

8.55 (2.56) 8.70 (2.09) 7.39 (3.10) 10.69 (2.56)

35.6 31.4 47.9 36.0 38.7 65.7 65.3 63.3

24.2 39.0 26.6 22.7 21.6 28.1 42.5 22.8

0.90 0.87 0.85 0.73 0.82 0.76 0.87 0.86

7.62 14.25 10.38 11.70 9.31 13.77 15.12 8.44

CRQ

Dyspnea Fatigue Emotional function Mastery

4.8 1.0 0.5 6.6

1.8 1.3 0.3 0.3

SF-36

Physical functioning Role-physical Bodily pain General health Vitality Social functioning Role-emotional Mental health

0.5 18.8 7.9 0.0 0.3 26.2 55.0 2.8

3.8 49.9 3.1 1.3 5.1 1.3 24.2 0.0

aNon-transformed

mean scores, standard deviations, and SEMs for the CRQ scales given in parentheses.

smaller weighted kappa (0.85) entirely reflects these minor gradation differences between the SEM and MCID gauges. One-SEM Classifications To demonstrate the amount of clinically relevant individual change occurring in these data, Table 11 contains the distribution of those who improved, declined, or remained stable on each CRQ dimension and SF-36 scale, with these change classifications being defined by the one-SEM criterion. As indicated, considerable change occurred, and this change occurred in similar percents for most CRQ dimensions and SF-36 scales. The exceptions were the general health, social functioning and role-emotional scales, for which the SEMs were quite large.

standard deviations, internal consistencies, SEMs, and convergent and discriminant validity for this subset of outpatients, no meaningful differences from Tables 1–5 were found. The baseline CRQ dimensional SEMs were 2.57, 2.15, and 3.20 for dyspnea, fatigue, and emotional function, respectively, and the resulting SEM per item ratios were 0.51, 0.54, and 0.46. Similarly, the results obtained from the subset at follow-up were remarkably similar to those shown in Tables 7–11. The only notable difference involved the factorial validity results for the SF-36 (Table 6). All four items from the vitality scale had principal factor loadings ($0.04) on the same exclusive factor.

DISCUSSION

In contrast to the eligibility criteria used in this study, current standards for the diagnosis of COPD requires patients of any age, with or without asthma, to also have a diagnosis of chronic bronchitis, emphysema, or COPD [30]. These criteria were met by 334 of the 471 outpatients at baseline, and by 278 of the 393 outpatients at follow-up. When we examined the baseline ceiling and floor effects, means,

This study adds to the evidence supporting the contention that a one-SEM criterion can be used to identify meaningful intra-individual change in HRQoL instruments. In all three CRQ dimensions with established MCID standards, one SEM identified clinically relevant change. Furthermore, when we applied the one-SEM criterion to the SF36, those scales with acceptable psychometric properties (minimal floor to ceiling effects, reasonable variance, and acceptable internal consistency) showed similar percent-

TABLE 8. Cross-classification of outpatient change scores in

TABLE 9. Cross-classification of outpatient change scores in

the CRQ dyspnea dimension by the one-SEM criterion and the MCID standard

the CRQ fatigue dimension by the one-SEM criterion and the MCID standard

Sensitivity Analysis

Classification by one-SEM Classification by MCID

Improved

Stable

109

4 166

Improved Stable Declined Weighted kappa 5 0.99; P , 0.001.

Declined

114

Classification by one-SEM Classification by MCID

Improved

Stable

Declined

95

41 120 30

107

Improved Stable Declined Weighted kappa 5 0.85; P , 0.001.

Further Evidence Supporting an SEM-Based Clinical Change Criterion

TABLE 10. Cross-classification of outpatient change scores

in the CRQ emotional functioning dimension by the oneSEM criterion and the MCID standards Classification by one-SEM Classification by MCID

Improved

Stable

103 1

183

Improved Stable Declined

Declined

106

Weighted kappa 5 0.99; P , 0.001.

ages of individuals classified as improved, stable or declined as the three CRQ dimensions with established MCIDs. Although clinical change standards have not been established for the SF-36 [14], these results show promise for the use of a SEM-based change criterion with some scales of this generic measure. Perhaps the most important property of the SEM as a marker for clinically relevant change is its invariance across all but the extremes of the population’s ability levels [12]. For HRQoL instruments that provide accurate and responsive measurements in longitudinal applications, the MCID should be invariant across a heterogeneous population of patients. Likewise, any statistical marker corresponding to the MCID should be effectively invariant. To demonstrate the SEM’s ability to meet this challenge, we calculated the SEM among only those outpatients in the lowest quartile on the dyspnea dimension at baseline. The untransformed dimensional scores of this subset ranged from 5 to 14. The CRQ baseline dimensional raw score analog of a one SEM change divided by the number of items for this lowest ability quartile was 0.49 for dyspnea, 0.50 for fatigue, and 0.49 for emotional function. Thus, even among outpatients with the most personally disturbing shortness of breath symptoms, the one-SEM criterion continued to identify the established MCID standards.

871

It is the SEM’s sample independence that produces the statistic’s consistent and relatively invariant quality within a population. This contrasts with the SEM’s component statistics, reliability and variance, which display considerable sample dependence. For example, within the subset of outpatients in the lowest quartile on the dyspnea dimension, the dimensional reliability estimates were 0.26, 0.74, and 0.84, respectively, for dyspnea, fatigue, and emotional function with standard deviations of 2.83, 3.90, and 8.40. These values vary markedly from those shown in Table 1 for all outpatients. Thus, any change criteria that rely on the standard deviation without incorporating the sample’s reliability, such as the family of popular effect size measures, are quite sample-specific, and have limited generalizability. In addition to its other features, the one-SEM criterion is consistent with Cohen’s generally accepted standards for change [34]. This occurs through a reduction in the magnitude of change that a highly reliable domain must demonstrate in order for that change to be considered clinically relevant. For example, in a domain with a reliability estimate of 0.92, a one-SEM change equates to an individual effect size of 0.28. This is considered a small (minimal) change by Cohen’s standards [34]. However, for a domain with a reliability estimate of 0.75, a one-SEM change equates to an individual effect size 0.50. This signifies a moderate shift [34], and reflects the larger individual change needed to detect an MCID in a less reliable measure. The extended investigation of the baseline reliability and validity of these instruments was an essential step to evaluating change for two reasons. First, our goal was to identify a statistical marker of clinically relevant change. No matter how much change occurs, if the latent construct being measured is not reliably defined, understood, and measured, the observed change is meaningless. Second, our SEM calculations used Cronbach’s alpha to estimate reliability, which requires unidimensionality within domains [12]. In addition, the presentation of these psychometric

TABLE 11. Baseline SEMs for transformed scores and percents improving, remaining stable or declining at least one SEM on

each CRQ dimension and SF-36 scale

Instrument CRQ

SF-36

Dimension or scale

Transformed value of One SEM

Percent with one SEM improvement

Percent stable with one SEM criterion

Percent with one SEM decline

Dyspnea Fatigue Emotional function Mastery Physical functioning Role-physical Bodily pain General health Vitality Social functioning Role-emotional Mental health

8.36 9.00 7.60 11.38 7.65 13.96 10.28 11.59 9.44 14.15 14.10 8.88

27.7 24.2 26.5 30.3 31.3 26.2 28.5 21.6 32.1 22.4 21.1 29.3

43.3 48.6 46.6 41.2 37.7 48.1 43.8 57.0 35.4 54.5 57.0 43.3

29.0 27.2 27.0 28.5 31.0 25.7 27.7 21.4 32.6 23.2 21.9 27.5

K. W. Wyrwich et al.

872

properties of the CRQ and the SF-36 should provide physicians and researchers working with a similar population of outpatients the opportunity to evaluate which instrument provides the most useful HRQoL data for their purposes, if time, financial, or other constraints permit using only one. As indicated above, we used Cronbach’s alpha as the reliability estimate in both the psychometric assessments and the SEM calculations. In proposing the SEM for evaluating individual change on HRQoL measures, however, McHorney and Tarlov advocated test-retest methods [32]. There are two reasons why that approach is unsuitable. The first involves situations where the target condition, like COPD, can result in highly fluid dimensional scores even over very short time-periods. For example, a 1994 assessment of the CRQ reported test-retest correlations of only 0.20 for the fatigue dimension when repeated nine days apart, even among patients whose COPD was judged to be well-managed [35]. Minimal changes in the activity levels of COPD patients can simply drain their energy and result in greater reports of tiredness. Under those circumstances, test-retest methods underestimate reliability. The second reason mitigating against the use of test-retest methods is that when using multiple-item measures, it is possible to achieve the same numeric summary scores at test and retest even though the individual item responses have changed. Under those circumstances, test-retest methods overestimate reliability [12]. Furthermore, only internal consistency methods account for heterogeneity and content sampling error if the items in a domain encompass differing levels of difficulty [10]. This study is not without limitations. The most important of these stems from the development and validation of the MCIDs. These were originally established by researchers at McMaster University by grouping data from COPD and congestive heart failure patients together, even though these patients had different diseases and responded to slightly different instruments. Moreover, they used a consensus panel that identified clinically meaningful change in the abstract rather than in reference to specific patients, and each patient’s global rating of change was never assessed by their individual physician. These limitations minimize the clinical nature of the MCID standards. In addition, the test-retest reliability of the single-item patient global change ratings was never assessed, the clinical change levels were arbitrarily defined, and it was assumed that change related to a minimal improvement was the same as that for a minimal decline [36]. Thus, the “gold standard” against which the one-SEM criterion is compared may be somewhat less than pure. Moreover, these MCID standards were established among patients enrolled in clinical trials at McMaster University, and may not apply to this study’s outpatient population. These limitations notwithstanding, this study provides additional evidence that the one-SEM criterion holds promise for identifying clinically meaningful intra-individ-

ual change in reliable and valid HRQoL instruments. Further research is needed, however, before these results can be generalized. Other HRQoL instruments with established MCID standards need to be tested in different patient populations to demonstrate the equivalence of the SEM and the MCID. Unfortunately, few such HRQoL measures exist. Therefore, either a validated statistic like the SEM needs to become a standard “rule of thumb” for identifying clinically-relevant individual HRQoL change, or the process of establishing MCID estimates needs to be systematized. Either or (preferably) both options must occur before HRQoL measures can realize their potential as practical and beneficial evaluation tools routinely used in clinical practice [37]. This research was supported by grants to Dr. Wolinsky from the National Institutes of Health (R37 AG-09692) and to Dr. Tierney from the Agency for Health Care Policy Research (R01 HS-07763). An earlier version of results in this paper were presented at the International Society for Quality of Life Research Annual Conference on November 17, 1998. The opinions expressed here are those of the authors and do not necessarily reflect those of the funding agencies, academic, research, or governmental institutions involved. The authors would like to thank Elena M. Andersen, Ph.D., James G. Gurney, Ph.D., and three reviewers for their helpful suggestions and assistance.

References 1. Guyatt G, Kirshner B, Jaeschke R. Measuring health status: What are the necessary measurement properties? J Clin Epidemiol 1992; 45: 1341–1345. 2. Juniper E, Guyatt G, Willan A, Griffith L. Determining a minimal important change in a disease-specific quality of life questionnaire. J Clin Epidemiol 1994; 47: 81–87. 3. Jaeschke R, Singer J, Guyatt G. Measurement of health status: Ascertaining the minimal clinically important difference. Con Clin Trial 1989; 10: 400–415. 4. Liang J. Evaluating measurement responsiveness. J Rheum 1995; 22(6): 1191–1192. 5. Feinstein A. Benefits and obstacles for development of health status assessment measures in clinical settings. Med Care 1992; 30: MS50–MS56. 6. Their S. Forces motivating the use of health status assessment measures in clinical settings and related clinical research. Med Care 1992; 30: MS15–MS22. 7. Guyatt G, Nogradi S, Halcrow S, Singer J, Sullivan M, Fallen E. Development and testing a new measure of health status for clinical trials in heart failure. J Gen Intern Med 1989; 4: 101– 107. 8. Ware J, Jr., Shelbourne C. The MOS 36-item short-form health survey (SF-36): I. Conceptual framework and item selection. Med Care 1992; 30: 473–483. 9. Wyrwich K, Nienaber N, Tierney W, Wolinsky F. Linking clinical relevance and statistical significance in evaluating intra-individual changes in health-related quality of life. Med Care 1999; 37: 469–478. 10. Anastasi A, Urbina S. Psychological Testing (7th edition). Upper Saddle River, NJ: Prentice-Hall; 1997. 11. Jacobson N, Follette W, Reventorf D. Psychotherapy outcome research: Methods for reporting variability and evaluating clinical significance. Behav Ther 1984; 15: 336–352. 12. Nunnally J, Bernstein I. Psychometric Theory. New York, NY, USA: McGraw Hill; 1994.

Further Evidence Supporting an SEM-Based Clinical Change Criterion

13. Guyatt G, Berman L, Townsend M, Pugsley S, Chambers L. A measure of quality of life for clinical trials in chronic lung disease. Thorax 1987; 42: 773–778. 14. Shin-Ping T, McDonnell M, Spertus J, Steele B, Fihn S. A new self-administered questionnaire to monitor health-related quality of life in patients with COPD. Chest 1997; 112: 614– 622. 15. McDonald C, Tierney W, Overhage J, Martin D, Wilson G. The Regenstrief Medical Record System: 20 years of experience in hospitals, clinics, and neighborhood health centers. MD Computing 1992; 9: 206–217. 16. Tierney W, Miller M, Hui S, McDonald C. Practice randomization and clinical research. The Indiana experience. Med Care 1991; 29: JS57–JS64. 17. American Thoratic Society. Standards for the diagnosis and care of patients with chronic obstructive pulmonary disease (COPD) and asthma. Am Rev Respir Dis 1987; 136: 225–244. 18. McSweeney A, Labuhn K. Quality of life in chronic obstructive pulmonary disease. In: Spilker B, Ed. Quality of Life and Pharmacoeconomics in Clinical Trials. Philadelphia, PA, USA: Lippincott-Raven; 1996: 961–976. 19. Kazis L, Skinner K, Miller D, Clark J, Lee A. Quality of life for the chronically ill elderly. In: Romeis J, Coe R, Morley J, eds. Applying Health Services Research to Long Term Care. New York, NY, USA: Springer; 1996: 43–75. 20. Guyatt G, Berman L, Townsend M, Pugsley S. Quality of life in patients with chronic airflow limitations. Br J Dis Chest 1987; 81: 45–54. 21. Fink A, Kosecoff J, Chassin M, Brook R. Consensus methods: Characteristics and guidelines for use. Am J Pub Health 1984; 74: 979–983. 22. Redelmeier D, Guyatt G, Goldstein R. Assessing the minimal important difference in symptoms: A comparison of two techniques. J Clin Epidemiol 1996; 49(11): 1215–1219. 23. Wright J. The minimal important difference: Who’s to say what is important? J Clin Epidemiol 1996; 49(11): 1221– 1222. 24. Ware J, Jr. The SF-36 Health Survey. In: Spilker B, Ed. Quality of Life and Pharmacoeconomics in Clinical Trials. 2nd ed. Philadelphia, PA, USA: Lippincott-Raven; 1996: 337–345.

873

25. Ware J, Jr., Snow K, Kosinski M, Gandek B. The SF-36 Health Survey Manual and Interpretation Guide. Boston, MA, USA: The Health Institute, New England Medical Center; 1993. 26. McHorney C, Haley S, Ware J, Jr. Evaluation of the MOS SF36 Physical Functioning Scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods. J Clin Epidemiol 1997; 50(4): 451–461. 27. Ware J, Jr., Kemp J, Buchner D, Singer A, Nolop K, Goss T. The responsiveness of disease-specific and generic health measures to changes in the severity of asthma among adults. Qual Life Res 1998; 7: 235–244. 28. Weiner J, Starfield B, Steinwachs D, Mumford L. Development and application of a population-oriented measure of ambulatory care case-mix. Med Care 1991; 29: 452–472. 29. Starfield G, Weinger J, Mumford L, Steinwachs D. Ambulatory care groups: A categorization of diagnoses for research and management. HSR 1991; 26: 53–74. 30. American Thoratic Society. Standards for the diagnosis and care of patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 1995; (152): S78–S116. 31. SPSS. SPSS for Windows, release 8.0.0. Chicago, IL, USA: SPSS, Inc.; 1998. 32. McHorney C, Tarlov A. Individual-patient monitoring in clinical practice: Are available health status surveys adequate? Qual Life Res 1995; 4: 293–307. 33. Wolinsky F, Wyrwich K, Nienaber N, Tierney W. Generic vs. disease-specific health status measures: An example using coronary artery disease and/or congestive heart failure patients. Eval & Health Prof 1998; 21(2): 216–243. 34. Cohen J. Statistical Power Analysis for the Behavioral Sciences. New York, NY, USA: Academic Press; 1977. 35. Martin L. Validity and reliability of a quality-of-life instrument. Clin Nurs Res 1994; 3(2): 146–156. 36. Norman G, Stratford P, Regehr G. Methodological problems in the retrospective computation of responsiveness to change: The lessons of Cronbach. J Clin Epidemiol 1997; 50(8): 869– 879. 37. Clancy C, Eisenberg J. Outcomes research: Measuring the end results of health care. Science 1998; 282: 245–246.