A comparison of health-related quality-of-life measures for rheumatoid arthritis research

A comparison of health-related quality-of-life measures for rheumatoid arthritis research

A Comparison of Health-Related Quality-of-Life Measures for Rheumatoid Arthritis Research Claire Bombardier, MD, Janet Raboud, MSc, and the Auranofin ...

737KB Sizes 0 Downloads 32 Views

A Comparison of Health-Related Quality-of-Life Measures for Rheumatoid Arthritis Research Claire Bombardier, MD, Janet Raboud, MSc, and the Auranofin Cooperating Group Rheumatic Disease Unit, Wellesley Hospital, and Division of Community Health, University of Toronto, Canada

ABSTRACT: Twenty-eight instruments measuring pain, clinical, functional, and global characteristics were administered to 303 patients in a six-month randomized clinical trial of auranofin and placebo in the treatment of patients with rheumatoid arthritis. The instruments were compared with respect to their responsiveness in detecting a treatment effect, the time involved in administering the instrument, the need for the presence of an interviewer, and ease of administration. The instruments' ability to detect a treatment effect was the deciding characteristic in the clinical, pain, and global categories in choosing the preferred instrument. The counts of tender and swollen joints were found to be the most responsive clinical measures, the 10-cm pain line was the most responsive and the fastest to administer of the pain instruments, and the categorical self-assessment of arthritis was the most responsive global measure. In the functional ability category, the Health Assessment Questionnaire (HAQ), the Keitel Functional Assessment, and the Quality of Well-Being (QWB) Questionnaire were equally responsive. The HAQ was the shortest and the only self-administered questionnaire. The QWB has had the most extensive validation work but was a complex instrument requiring intensive interviewer training. The Keitel was the most timeconsuming instrument, but had the advantage of high interobserver agreement. The design of future trials can be guided by the information obtained in this study on their relative efficiencies and ease of use. KEY WORDS: Auranofin in rheumatoid arthritis, randomized clinical trials, sensitivity to change,

health status, quality of life INTRODUCTION R h e u m a t o i d arthritis (RA) is a chronic i n f l a m m a t o r y disease t h a t m a y lead to significant disabilities. Interest in m e a s u r i n g RA p a t i e n t s ' function resulted in the creation in the 1940s of a few simple scales such as the A R A functional classification [1]. In recent years, a large n u m b e r of arthritis specific H R Q L scales h a v e b e e n p r o p o s e d [2].

Address reprint requests to: Dr. ClaireBombardier, WellesleyHospital, 160 WellesleySt. E., Toronto, Ontario M4Y 1J3, Canada. Received July 11, 1990; revisedMarch 25, 1991. A complete list of the participants in this study and their affiliations appears in the Appendix. ControUedClinicalTrials 12:243S-2S6S(1991) © C. Bombardier

243S

244S

¢. Bombardier et al. A clinical trial was designed to compare 28 arthritis measures and assess a new therapeutic agent, auranofin. Outcome measures were grouped into four types: clinical, functional, pain, and global measures and a composite score was calculated for each of the four groups. The primary hypotheses were that auranofin would reduce disease severity in each of the four dimensions. Results of the efficacy analysis based on the composite measures are reported in a previous paper [3]. Three of the four composite scores (clinical, functional and global) reached predetermined statistical significance. The magnitude of the change was comparable across the four groups. In this paper, various analytic approaches for comparing the performance of these twenty-eight outcome measures are discussed. Measurement characteristics to consider in evaluating the usefulness of quality of life instruments have been published [4]. Here, only two characteristics are discussed: feasibility or acceptability and the ability to detect a treatment effect.

PATIENTS A N D METHODS

Study Design Fourteen clinical centres in the United States and Canada participated in a six-month, double-masked, placebo-controlled study of auranofin as a treatment for patients with rheumatoid arthritis. Twenty-eight instruments were used to measure patients" pain, clinical, functional, and global response to treatment.

Patient Selection Patients w h o met the following entry criteria were eligible: 1. 2. 3. 4. 5.

Presence of adult-onset, classic, or definite rheumatoid arthritis. Unremitting disease for at least six months. Age between 18 and 65 years. Six or more tender joints. A pain assessment greater than 3 cm on a 10-cm visual analogue scale.

EVALUATION OF RESPONSE The measures used may be divided into clinical and health status measures. The measures and their frequency of administration are shown in Table 1.

Clinical Measures Traditional assessment of disease activity included: 1. 2. 3. 4. 5.

Number of joints tender on pressure and/or painful on motion. Number of swollen joints. Duration of morning stiffness. Time to walk 50 feet. Grip strength.

245S

Rheumatoid Arthritis Research Table 1

Measures of O u t c o m e a n d Their F r e q u e n c y of A s s e s s m e n t Baseline

Clinical Tender Joints Swollen Joints 50-foot walk time Duration of Morning Stiffness Grip Strength Function Health Assessment Questionnaire Keitel Assessment QWB Questionnaire Toronto ADL (change) Pain McGill Pain Questionnaire Pain Ladder Scale 10-cm Pain Line Global Impression Arthritis Categorical Scale Ladder Scale Overall Health Ladder (Current Ladder (6 day mean) Rand 10 cm (by patient) 10 cm (by physician)

Study Month

Two Weeks Before

Drug Start Day

One

Two

Three

Four

Five

Six

X X X X

X X X X

X X X X

X X X X

X X X X

X X X X

X X X X

X X X X

X

X

X

X

X

X

X

X

X X X

X X

X X

X

X

X

X

X

X

X X X

X

X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X

X X X X X

X X X X X

X X X X X

X X X X X

X X X X X

Utility Patient Utility Measurement Standard Gamble Questionnaire Willingness to pay Questionnaire Other NIMH Depression Quest. Rand General Health Perception ESR

X X X

X

X

X X X

X

X

X

X X

X X

246S

c. Bombardier et al.

Health Status Measures Most of these measures were in the form of structured questionnaires administered by trained interviewers. Prior to the study, the interviewers received a procedure manual and undertook 30 hours of home study and four days of centralized training. During the study, the coded questionnaires were audited weekly by central staff; all early interviews and a random sample of later ones were recorded on tapes audited centrally for standardized execution. Four categories of measurement may be distinguished.

Function

The Health Assessment Questionnaire (HAQ) [5] is a short self-administered instrument which asks questions about eight areas of daily life including physical activity, mobility, and role functions. The Keitel Assessment [6] involves a series of 23 range-of-motion tasks. The interviewer assigns a variable difficulty score for each exercise. The Bush Quality of Well-Being Questionnaire (QWB) [7-10] examines three areas---mobility, physical activity, and social activity, and 22 symptom-problem complexes. The importance given to each health state depends on the weight given to that health state on a zero to one scale (zero is dead and one is perfectly healthy). Weights are derived from interviews of rheumatoid arthritis patients [11].

Pain

The McGill Pain Questionnaire [12,13] lists 20 groups of words describing pain. From each group, a word may be selected by the patient to describe "present pain." Responses are valued from six to zero, giving a total score from 78 (worst) to zero. The Pain Ladder Scale, designed for this trial, represents ten equal degrees of pain from "least desirable--most severe imaginable pain" (zero, bottom) to "most desirable---no pain" (ten, top). The patient selects the degree of arthritis pain experienced on each of the past six days, and a sixday mean score from zero to ten is calculated. The lO-Centimetre Pain Line is the standard, visual analogue noncalibrated line [14,15] labeled in this trial from "excruciating" (ten, right end) to "none" (zero, left end).

Global Impression These measures involve various scales designed to quantify direct impressions of either the patient's arthritis or the patient's overall health. Unless referenced these measures were developed for the trial.

Arthritis

The Arthritis Categorical Scale cites five responses from "very poor" to "very good." The Arthritis Ladder Scale represents ten equal degrees of arthritis activity. The patient selects the degree of arthritis activity experienced on each of the past six days, and a 6-day mean score from zero to ten is calculated.

Rheumatoid Arthritis Research

247S

Overall Health

The Overall Health Ladder Scale, Current represents ten equal degrees of health from "least desirable" (zero, bottom) through "most desirable" (ten, top). The Overall Health Ladder Scale, Six-Day Mean, is similar, but the zero-to-ten score averages the responses for each of the past six days. The Rand Current Health Assessment [16] measure presents 19 statements about current health, which the patient classifies as from "definitely true" through "don't know" to "definitely false." Responses are combined to give a score from nine (worst) to 45. The Ten-Centimetre Overall Health Scale, by Patient is the standard visual analogue line labeled "poor" (zero, left end) to "perfect." The 10-Centimetre Overall Health Scale, by Physician, is identical to the above, except the patient's zero-to-ten health status is indicated by the physician. Utility Three questionnaires based on this type of measurement were used in the trial, all administered in the fifth month. The Patient Utility Measurement Set (PUMS) elicits the patient's perception of his or her current state of health relative to his or her recollected health state at the beginning of the trial and relative to a state of full health. The formulation and scoring of the types of choices are summarized in the Appendix of Ref. 3. For each patient, combining a relative change score calculated from a visual analogue scale, modified time tradeoff, and lottery with results of the standard time tradeoff gives an overall score representing the difference between the patient's pretrial and current health states. The Standard Gamble Questionnaire developed for this trial [17] asks the patient to choose between his or her current state and a hypothetical treatment with systematically varied chances of causing either complete cure or death. The Willingness to Pay Questionnaire, also developed for this trial [17], elicits the share of the patient's monthly household income he or she would pay for a hypothetical cure of arthritis.

Other

The National Institute of Mental Health (NIMH) Depression Questionnaire [18,19] asks the patient on how many of the last seven days he or she experienced 20 feelings or attitudes indicative of a depressed state. The Rand General Health Perceptions Questionnaire [16] includes 36 statements that may reflect the patient's feelings and attitudes toward his or her past and future health care and outlook. Composite variables were calculated for each category (clinical, functional, pain, global) by dividing each outcome variable by its standard deviation at baseline, changing the sign if necessary so that a larger score represents improvement, and taking the mean of the standardized variables. Instruments' responsiveness to treatment effect was measured using two concepts: standardized effect size and relative efficiency. The standardized effect size provides a direct comparison between competing measures. It is the ratio of the treatment effect (the difference between the mean change between six months and baseline in the auranofin [dt] and the control [de]

248S

c. Bombardier et al. group) and of the pooled standard deviation (SD) of these differences: (dr-de)~ SD. The relative effidency of each instrument was calculated with respect to tender joints, so that the relative efficiencies of the instruments could be compared among categories. Relative effidency can be calculated for any pair of instruments. If the relative efficiency of one instrument to another is greater (less) than one, that instrument is more (less) efficient at detecting change. To calculate the relative efficiencies to detect a difference among treatment groups, two methods were used. The first method uses the squared ratio of the standardized effect sizes. The second method uses analyses of covariance to find the treatment effect for each instrument while controlling for age, sex, clinic, and functional status at baseline. With this method, relative efficiency was computed by taking squared ratios of B/SE(B), where B is the coefficient of treatment in the analysis of covariance. This is similar to taking squared ratios of t statistics of the analysis of covariance [20].

RESULTS Baseline Characteristics

Of the 330 patients randomized into the study, 19 did not meet eligibility criteria and eight patients were lost to follow-up or withdrawn from the trial before the third month's assessment. The remaining 303 patients are included in this analysis. Health Status measures did not differ significantly between treatment groups at baseline when tested with two-sample t tests (Table 2). Measurement Characteristics

Key questions in selecting HRQL measures for use in a clinical trial include: the need for an interviewer and the amount of training required for this interviewer, the time needed to administer the questionnaire, the clarity, ease of administration and acceptability of the questionnaire. Monitoring of the interview process during the trial provided the information summarized in Table 3. All of the clinical measures were performed by a physician or a nurse coordinator while most of the pain, functional, global, utilities, and other measures were collected by the independent outcome assessor, a trained lay interviewer. The training required is most intensive for the joint counts, the QWB, the Keitel and all the utility measures. These same measures have the longest administration time. The experience zg~ined from the independent assessor's training sessions and from the tape audits of interviews indicates that the Quality of Well-Being questionnaire required the most probing and clarification of questions by the interviewers. Relative Efficiency to Detect a Treatment Effect

The results in Table 4 are based on the analysis of covariance presented in the main efficacy paper [3] where the effect of treatment on each response variable is assessed with the six month outcome as the dependent variable

249S

Rheumatoid Arthritis Research Table 2

Health Status Measures at Baseline by Treatment Group (n = 303) Possible Range (Worst to Best)

Function Health Assessment Quest. Keitel Quality of Well-Being Pain McGill Pain Questionnaire Pain Ladder Scale (6-day mean) 10-cm Pain Line

Placebo (n = 149)

Auranofin (n = 154) Mean

p Value

Mean

SEM

SEM

3-0 98--0 0-1

1.39 30.00 0.60

0.06 1.4 2.00 30.0 0.008 6 0 . 0

0.05 1.0 0.007

.88 .72 .88

78--0 0-10

23.0 5.5

1.0 0.1

22.0 5.5

1.0 0.1

.68 .79

10-0

5.3

0.1

5.3

1.0

.76

1-5 0-10

2.8 5.0

0.07 0.1

2.9 4.8

0.07 0.1

.92 .42

0-10 0-10 9-45

6.1 5.8 22.0

0.1 0.1 0.5

6.0 5.7 22.0

0.1 0.1 0.5

.90 .49 .73

0-10 0-10

5.6 6.2

0.2 0.1

5.5 6.2

0.2 0.1

.47 .83

60-0 0-110

15.0 66.0

1.0 1.0

15.0 67.0

1.0 1.0

.58 .52

Global Impression Arthritis

Categorical Scale Ladder Scale (6-day mean) Overall Health

Ladder Scale (Current) Ladder Scale (6-day mean) Rand Current Health Assessment 10-cm line, by patient 10-cm line, by physician Other NIMH Depression Quest. Rand General Health Quest.

and using as covariates the baseline value, clinic, age, sex and baseline functional status. Relative efficiencies are calculated with respect to tender joints. Among the clinical measures, the counts of tender and swollen joints were the most efficient. The three functional ability instruments were found to be equally responsive, while the 10-cm pain line and the self-assessment of arthritis were the most efficient measures in the pain and global measures categories respectively. Of the utility measures, the PUMS best detected a treatment effect. The functional ability composite was found to be the most responsive of the composite scores, the global and clinical composites the next most efficient, and the pain composite the least efficient. In the other measures, the Toronto ADL change score and the erythrocyte sedimentation rate were also responsive to treatment effects. For purposes of comparison we also analyzed these results using the methods proposed by Roberts at the conference (Table 5). Results on relative efficiendes are very similar when comparing those obtained using the analysis of covariance method in Table 4 and the ratio of effect sizes in Table 5. The most efficient measure remains the global assessment of arthritis using a categorical scale (RE = 2.03 and 2.20). It has a large effect size (0.36) with a good relative treatment effect (dr - dc/dc = 1.12) without much noise compared with other measures (CV = 3.11). The effect sizes of the following measures are significantly smaller than the effect size of this categorical scale (as measured by the z statistic): morning stiffness, pain ladder scale, overall

250S

c. Bombardier et al. Table 3

Characteristics of Health-Related Quality-of-Life Measures

Administrationa

Need Time for Training~ (min) c Probingd

Clinical Tender Joints Swollen Joings 50-foot walk time Duration of Morning Stiffness Grip Strength

MD/N MD/N MD/N MD/N MD/N

+ + + + + + + + +

20 20 2 2 4

NA NA NA NA NA

Function Health Assessment Questionnaire Keitel Assessment Quality of Well-Being Quest.

SELF IA IA

0 +++ +++

5 30 20

NA ++ +++

Pain McGill Pain Questionnaire Pain Ladder Scale 10 cm Pain Line

IA IA MD/N

++ + +

5 2 1

++ + NA

+ +

1 2

0 +

IA IA SELF

+ + + +

2 2 2

+ ++ NA

MD/N MD/N

+ +

1 1

NA NA

Global Impression Arthritis: Categorical Scale Ladder Scale Overall Health Ladder (current) Ladder (6-day mean) Rand Current Health Assessment 10 cm (by patient) 10 crn (by physician) Utility Patient Utility Measurement Set Standard Gamble Willingness to pay Other NIMH Depression Questionnaire Rand General Health Perceptions Questionnaire Erythrocyte Sedimentation Rate

IA IA

IA IA IA

+++ +++ +++

30 +++ 30--60 + + + 30--60 + + +

IA SELF

++ + +

5 5

+ NA

lvID/N

+ +

NA

NA

°Individual administering each measure: physician/nurse, MD/N;patient self-administered, SELF; independent outcome assessor, IA. bAmount of interviewer training t~luired: intensive, + + +; moderate, + +; some, +; none, 0. CAdministration time. dAmount of probing and clarification needed as judged by the independent outcome assessors training sessions and the tape audits of interviews: intensive, + + +; moderate, + +; some, +; none, 0. Applies only to interviewer administered measures. ~Not applicable.

health current ladder scale a n d 10-cm line b y the physician, N I M H d e p r e s s i o n scale, a n d the R a n d general health perception scale. A m o n g the clinical measures, 50-foot walk time, duration of m o r n i n g stiffness a n d grip s t r e n g t h s h o w an i m p o r t a n t relative treatment effect but they also s h o w a large a m o u n t of

251S

Rheumatoid Arthritis Research Table 4

Efficiency of M e a s u r e s to Detect a T r e a t m e n t Effect Controlling for Age, Sex, Clinic, a n d Baseline Functional Status Using Analysis of Covariance p Value

t Statistic

Relative Efficiency

Clinical Measures Tender joints Swollen joints 50 foot Walk time Duration of Morning Stiffness Grip Strength

.01 .01 .11 .10 .02

2.49 2.57 1.60 1.67 -2.32

1.00 1.07 0.41 0.45 0.87

Functional Measures Health Assessment Questionnaire Keitel Assessment Quality of Well-Being Questionnaire

.01 .003 .005

2.60 3.02 -2.82

1.09 1.47 1.28

Pain Measures McGill Pain Questionnaire Pain Ladder Scale 10 cm Pain Line

.02 .09 .01

2.30 - 1.69 2.61

0.85 0.46 1.10

Global Measures Arthritis Categorical Scale Ladder Scale

.001 .11

- 3.55 - 1.62

2.03 0.42

.19 .01 .01 .09 .20

- 1.31 - 2.70 -2.53 - 1.66 -1.30

0.28 1.18 1.03 0.44 0.27

Utility Measures Patient Utility Measurement Set Standard Gamble Willingness to Pay

.002 .05 .29

-3.22 1.98 1.31

1.67 0.63 0.28

Other Measures NIMH Depression Questionnaire Rand General Health Perception Questionnaire Toronto ADL Erythrocyte Sedimentation Rate

.54 .32 .02 .02

-0.62 -0.99 - 2.29 2.37

0.06 0.16 0.85 0.91

Composites Clinical Functional Pain Global

.003 .001 .021 .007

2.99 3.44 2.32 - 2.74

1.44 1.91 0.87 1.21

Overall Health

Ladder (current) Ladder (6-day mean) Rand Current Health Assessment 10 cm (by patient) 10 cm (by physician)

noise in this effect; t h u s their l o w e r effect size. A m o n g the functional m e a s u r e s effect sizes are c o m p a r a b l e . T h e p a i n m e a s u r e s h a d c o m p a r a b l e relative t r e a t m e n t effects, b u t the Ladd e r scale s h o w e d m o r e noise, m a k i n g it less sensitive. A m o n g the global m e a s u r e s , the arthritis categorical scale a n d the overall health six-day m e a n l a d d e r scale w e r e the m o s t efficient; the R a n d current health a s s e s s m e n t h a d the largest relative t r e a t m e n t effect b u t also a large coefficient of variation.

252S Table 5

C. Bombardier et al. S t a n d a r d i z e d Effect Size (SES) a n d Relative Efficiency (RE) of M e a s u r e s to D e t e c t a T r e a t m e n t Effect (dt-dc)/ dc

Pooled SD ~

CV b

2.81 1.86 1.35 16.00 0.16

0.63 0.52 - 2.21 1.30 -8.00

11.61 8.11 6.64 156.12 0.72

0.13 3.25 0.02

0.76 - 1.89 ./

2.44 0.35 0.53

SES¢

REa

Z

2.59 2.25 - 10.88 12.71 -36.25

0.24 0.23 0.20 0.10 0.22

1.00 0.89 0.70 0.18 0.83

1.03 1.10 1.06 1.86~ 1.04

0.51 11.09 0.09

&00 - 6.45 .f

0.25 0.29 0.23

1.11 1.47 0.94

0.93 0.49 1.01

0.44 0.57 0.58

12.51 1.92 2.17

2.23 3.14 2.38

0.19 0.18 0.24

0.65 0.57 1.02

1.42 1.72~ 1.14

0.34 0.39

1.12 0.65

0.95 1.98

3.11 3.33

0.36 0.20

2.20 0.65

-1.62

0.26 0.52 1.31

0.67 1.48 2.59

1.74 1.62 5.89

4.47 4.59 11.62

0.15 0.32 0.22

0.38 1.77 0.84

1.89~ 0.36 1.03

0.38 0.12

0.74 0.26

1.80 1.54

3.51 3.34

0.21 0.08

0.75 0.10

1.23 2.1¥

11.39

1.24

26.31

2.88

-0.43

3.16

0.62

-6.98 -3.10

-0.23 -0.12

25.95 20.44

0.87 0.80

-0.27 -0.15

1.24 0.38

0.63 1.53

0.73 0.60

-0.18 -8.05

11.82 9.37

-2.92 -126.16

0.06 +0.06

0.06 0.06

1.67e 2.11f

-4.88

-8.56

19.90

34.91

-0.25

1.07

0.60

0.02

377.89

0.08

1487.5

0.25

1.10

1.06

- 0.18 - 0.23 -0.26 0.23

1.13 4.60 0.54 0.85

0.62 0.62 0.97 0.74

- 3.87 - 12.44 -2.02 2.74

- 0.29 - 0.37 -0.27 + 0.31

1.44 2.33 1.24 1.64

- 0.64 0.06 0.42 0.62

dt-dc

Clinical Tender joints Swollen joints 50-walk time Duration of AM stiffness Grip strength Function Health assessment quest. Keitel assessment Quality of well-being questionnaire Pain McGill Pain questionnaire Pain Ladder Scale 10 cm pain line Global Impression Arthritis Categorical scale Ladder scale Overall Health Ladder (current) Ladder (6-day mean) Rand current health asses. 10 cm (by patient) 10 cm (by physician)

Utility Patient utility measurement set Standard gamble Willingness to pay Other NIMH depression quest. Rand g e n e r a l h e a l t h perception questionnaire Erythrocyte sediment. rate Toronto ADL Composites Clinical Functional Pain Global

aStandard deviation bCoefficient of variation cStandardized effect size dRelative efficiency eRelative effect sizes and relative efficiencies significantly different from the Global Measure of arthritis (categorical scale). fDt = 0

Rheumatoid Arthritis Research

253S

The utility measures had the largest relative treatment effects of all the 28 measures. Although their coefficients of variations were also large they nonetheless had large effect sizes. Of the composite measures, the functional measures had the largest coefficient of variation but also the largest relative treatment effect, making them the most efficient of the four composites. DISCUSSION In the past many investigators have shown a reluctance to use the HRQL measures because it was felt that the outcomes were difficult to quantify. A major consensus workshop of rheumatologists and methodologists emphasized the need for a broader perspective [22]; this was reiterated by investigators at a conference in Washington, DC sponsored by the Arthritis Foundation in December 1988. This study demonstrates that the HRQL measures often outperform traditional measures with respect to efficiency, and acceptability. In an attempt to minimize measurement error, standardized procedures, trained independent assessors and checks of reliability were used. It is possible that if the clinical measures had been subjected to the same standardization they may have outperformed the HRQL measures. Two approaches to measuring relative efficiencies are contrasted here, one based on analysis of covariance and one based on effect sizes. The results are similar. The ANCOVA approach may be better in situations where the baseline groups may not be comparable. The other approach is useful when groups are comparable at baseline because one can test if effect sizes are significantly different using z statistics. This analysis indicates that morning stiffness, pain ladder scale, overall health current ladder scale, overall health 10-cm line by the physician, NIMH depression scale, and the Rand general health perception scale are the least responsive of the 28 measures. The latter two measures have a less direct relationship to treatment and were expected to be less sensitive. The functional composite score was at least as effective as the clinical composite score. The self-assessment of arthritis was found to be the most efficient of all the measures. Although they seem to be "soft" measures, questionnaires eliciting patient opinions are quite responsive to changes in the severity of arthritis. A comparison of instruments within categories showed that the instruments" ability to detect a treatment effect was the deciding characteristic in the clinical, pain, and global categories in choosing the preferred instrument. All other characteristics of the instruments were comparable within these categories, with the exception of administration time in the pain category. The counts of tender and swollen joints were found to be the most responsive clinical measures, the 10-cm pain line was the most responsive and the fastest to administer of the pain instruments and the categorical self-assessment of arthritis was the most responsive global measure. In the functional ability category, the HAQ, the Keitel Functional Assessment, and the QWB Questionnaire were found to be comparable with respect to responsiveness. The H A Q was the shortest and the only self-administered questionnaire. The QWB has had the most extensive validation work but required intensive interviewer

254S

c . Bombardier et al.

training. The Keitel was the most time-consuming instrument, but had the advantage of high interobserver agreement. The Toronto ADL also showed good responsiveness. These results indicate that patients' improvement in the number of inflamed joints is accompanied by meaningful improvement across a range of outcomes relevant to the patients' quality of life. A remarkable finding in this study is that these improvements can be detected as efficiently as those in the clinical measures. Also striking was the result of a structured utility measure, the Patient Utility Measurement Set (PUMS). The PUMS measures the patients' perceived change between baseline and end of study. This feature may help explain w h y it performed better than the two other measures of Utility, the Standard Gamble and the Willingness to Pay where patients were only interviewed about their status at the end of the study. This first application of utility measures in the context of a clinical trial in arthritis provides encouraging evidence of their potential usefulness. Overall, this trial has shown that several HRQL measures have now proved sensitive to change in rheumatoid arthritis due to therapy. The design of future trials in this disease can be guided by the information on their relative efficiencies and ease of use.

REFERENCES

1. Steinbrocker 0, Traeger CH,Batterman RC: Therapeutic criteria in rheumatoid arthritis. JAMA 140:659--662, 1949 2. American College of Rheumatology: Dictionary of the Rheumatic Diseases. Volume III: Health Status Measurement. American College of Rheumatology, 1988 3. Bombardier C, Ware JE, Russell IJ, Larson MG, Chalmers A, Read JL: Auranofin therapy and quality of life in patients with rheumatoid arthritis. Results of a multicenter trial. Am J Med 81:565-578, 1986 4. Bombardier C, Tugwell PX: Measuring disability: Guidelines for rheumatology studies. J Rheumatol 10(suppl):68-73, 1983 5. Fries JF, Spitz PW, Young DY: The dimensions of health outcomes: The Health Assessment Questionnaire, Disability and Pain Scales. J Rheumatol 9:789-793, 1982 6. Eberl DR, Fasching V, Rahlfs V, Schleyer I, Wolf R: Repeatability and objectivity of variaus measurements in rheumatoid arthritis. A comparative study. Arthritis Rheum 19:1278-1286, 1976 7. Kaplan RM, Bush JW, Berry CC: Health status: Types of validity for an index of well-being. Health Serv Res 11:478-507, 1976 8. Kaplan RM, Bush JW, Berry CC: The reliability, stability and generalizability of a health status index. In: Proceedings of the American Statistical Association, Social Statistics Section. American Statistical Association 1978, pp 704-709 9. Bush JW, Chen M, Patrick DL: Cost-effectiveness analysis using a health status index: Analysis of the New York State PKU screening program. In: Health Status Indexes, Berg R, Ed. Chicago, Hospital Research and Educational Trust 1973, pp 172-208 10. Kaplan RM, Bush JW: Health-related quality of life measurement for evaluation and policy analysis. Health Psychology 1:61-80, 1982 11. Balaban DJ, Sagi PV, Goldfarb NI, Nettler S: Weights for scoring the quality of well-being instrument among rheumatoid arthritics. A comparison to general population weights. Med Care 24:973-980, 1986

Rheumatoid Arthritis Research

255S

12. Melzack R: The McGill Pain Questionnaire: Major properties and scoring methods. Pain 1:227-229, 1975 13. Melzack R, Torgenson WS: On the language of pain. Anaesthesiology 34:50-59, 1971 14. Scott J, Huskisson EC: Graphic representation of pain. Pain 2:175-184, 1976 15. Dickson JS, Bird HA: Reproducibility along a 10 cm vertical visual analogue scale. Ann Rheum Dis 40:87-89, 1981 16. Brook RH, Ware JE, Davies-Avery A: Overview of adult health status measures fielded in Rand's health insurance study: General health perception battery. Med Care 17 (suppl 7): 95-97, 1979 17. Thompson MS: Willingness to pay and accept risks to cure chronic disease. Am J Publ Health 76:392-396, 1986 18. Radloff LS: The CES-D Scale: A Self-Report Depression Scale for Research in the General Population. Appl Psychol Measure 1:385--401, 1977 19. Husaini BA, Neff JA, Harrington JB, Hughes MD, Stone RH: Depression in rural communities: Establishing CES-D cutting points. Mental Health project 1979. Final report, National Institute of Mental Health contract 278-77-0044 (DEB). 1979 20. Liang MH, Larson MG, Cullen KE, Schwartz JA: Comparative measurement efficiency and sensitivity of five health status instruments for arthritis research. Arthritis Rheum 542-547, 1985 21. Reilly MC, Gufler A, Claire M, Corwin J: Final report: Ridaura Quality of Life study. Prepared by Rhode Island Health Services Research, Inc. 1984 22. Bombardier C, Tugwell PX, Sinclair A, Dok C, Anderson G, Buchanan WW: Preferences for endpoint measures in clincial trials: Results of structured workshops. J Rheumatol 9:798--801, 1982 APPENDIX: PARTICIPATING INVESTIGATORS AND INSTITUTIONS

Principal Investigators: Claire Bombardier, MD, James Ware, PhD. Trial Design Committee: Claire Bombardier, MD, James Bush, MD, University of California, San Diego; J Leighton Read, MD, Institute for Health Research, Harvard School of Public Health; Mark Thompson, PhD, Institute for Health Research, Harvard School of Public Health; James Ware, PhD, Christina Hutchings, RN, MBA, Smith Kline & French Laboratories, Morton Paterson, Smith Kline & French Laboratories. Participants: Lutheran General Hospital (Park Ridge, Illinois): William Arnold, MD; University of Oregon (Portland, Oregon): Robert Bennett, MD; University of Toronto (Toronto, Canada): Claire Bombardier, MD; University of Florida (Gainesville, Florida): Jacques CaldweU, MD; University of British Columbia (Vancouver, Canada): Andrew Chalmers, MD; Scripps Clinic (La Jolla, California): P. Kahler Hench, MD; Santa Clara Valley Medical Center (San Jose, California): William Lages, MD; Brigham and Women's Hospital (Boston, Massachusetts): Matthew Liang, MD; Peter Schur, MD; St. Lukes Medical Center (Bethlehem, Pennsylvania): Charles Ludivico, MD; Dartmouth-Hitchcock Hospital (Hanover, New Hampshire): James Morgan, MD; Arthritis Associates of Nevada (Las Vegas, Nevada): Michael O'Hanlan, MD; University of Texas (San Antonio, Texas): I. Jon Russell, MD; Toledo Clinic

256S

c. Bombardier et al. (Toledo, Ohio): Robert Sheon, MD; Darthmouth-White River Junction Veterans Administration Medical Center (White River Junction, Vermont): Thomas Taylor, MD; Harvard School of Public Health (Boston, Massachusetts): Martin Larson, ScD, Gary Rosner, ScD, James Ware, PhD.

Interviewers" Training, Monitoring, and Data Management: Margaret C Reilly, MPH, Rhode Island Health Services Research, Providence, Rhode IslarLd; Christina Hutchings.