0895-4356/90 $3.00 + 0.00 Copyright Q 1990 Pergamon Pres pk
J’Clh E@danlol Vol. 43, No. 12, pp. 1313-1318, 1990 FVinted in Grcat Britain. AU rights reservcd
SENSITIVITY OF EFFECT VARIABLES IN RHEUMATOID ARTHRITIS: A META-ANALYSIS OF 130 PLACEBO CONTROLLED NSAID TRIALS PBm
c. G0TzscI-E
Medical Department A, Rigshospitalet, Blegdamsvej 9, DK-2100 Copenhagen 0, Denmark (Received in revisedfarm 7 June 1990)
Abstract-In a meta-analysis of placebo controlled NSAID trials, the sensitivity of the effect variables was calculated as the correlation coefficient and as the differente between drug and placebo, divided by the placebo group standard deviation. The patient’s global evaluation was the most sensitive variable overall. Pain was more sensitive than Ritchie’s index. Several variables may be omitted from clinical trials, especially if two active drugs are being compared. For example, the best maximum estimate for the differente in ESR between NSAIDs and placebo was l.Omm/hr (95% confìdence interval - 1.5 to 3.4 mm/hr), and for joint sixe 0.44% (- 1.O to 1.9%), corresponding to a quarter of a millimeter for each of the 10 joints usually measured. It is suggested to record only the patient’s global evaluation, pain, and morning stiffness. Methods Arthritis, rheumatoid steroidal, anti-inflammatory drugs
Meta-analysis
INTRODUCTION
Drug trials should be simple in order to save resources and to maintain investigator and patient compliance, thereby increasing the quality of the collected data. In a review of 196 double-blind trials that compared two or more non-steroidal antiinflammatory drugs (NSAIDs) in rheumatoid arthritis more than 70 different effect variables were identified [1], reflecting the uncertainty concerning the choice of the most meaningful ones [2-91. Multiple endpoints may confuse the interpretation of the treatment result [9] and may lead to bias through preferential reporting [ll. It seems therefore of value to choose a few variables. This choice has statistical and philosophical aspects. Statistically, a variable should be sensitive to change on drug treatment. Philosophically, the change should bt? of value to the patient.
Effect variables
Non-
The purpose of this study was to examine the sensitivity of the most common effect variables used in NSAID trials and relate the findings to the patients’ opinions of the value of the drugs. METHODS
A MEDLINE search, covering 1966 and onwards, was carried out in September 1988. The Medical Subject Headings (MeSH [ 101)arthritis, rheumatoid and anti-injlammatory agents, nonsteroidal were combined with each other and with placebo (truncated, searched either as a MeSH term or as a text word in the title or abstract). Apart from placebo, the “explode” command was used to access al1 subcategories assigned under a MeSH. The companies marketing .NSAIDs in Denmark were contacted, and the reference lists of the collected articles were scanned. Reports were collected on trials published in one of the west European languages that 1313
1314
PETER
c.
compared an NSAID with a double-blind placebo in rheumatoid arthritis. The drugs had to be taken orally in repeated doses, and the placebo had to be randomized on equal terms as the active drug; wash-out placebos therefore were not accepted. Trials of NSAID-steroid combination products and trials on the steroidsparing effect were excluded. When more than one NSAID was used in a trial, the newest was chosen, except in two cases in which the experimental drug was clearly without any symptomatic effect, when phenylbutazone was chosen instead. In trials of several doses of a drug, the highest was selected. Trials with a sequentia1 analysis design were excluded since the variable subjected to sequentia1 analysis might be correlated with other reported variables. The sensitivity of a variable in discriminating between drug and placebo was measured as the correlation coefficient, r, calculated as [l 11: t2 r = J- t’+df (see the Appendix for details)
using the value from the t-distribution with its appropriate degrees of freedom (dfl. When other tests had been used, two-sided p values were converted to corresponding t-values. The correlation coefficient may be regarded as a standardized effect size [ 111. Another standardized effect size, d/s, was calculated as the difference, d, between drug and placebo, divided by the placebo group standard deviation, s [12]. This measure also allows a comparison of the sensitivities of the variables (see the Appendix). The patient’s global evaluation of the effect, and, in crossover trials, also patient preferente, were compared with other variables. Finally, 1 studied whether NSAIDs have any effect on the erythrocyte sedimentation rate (ESR) and digital joint size. In this analysis, the reported differences between drug and placebo in crossover trials were weighted by the sample size upon which the results were based (number of recruited patients minus withdrawals, dropouts, exclusions, and missing data) [13]. Since joint size measurements were sometimes given in ring size units and sometimes in millimeters, these data were converted to the percentage change, i.e. the differente between drug and placebo, relative to the placebo value. Estimates are shown with 95% confidence intervals (Cr).
G~ZSCHE RESULTS
Exclusion of reports
Reports from 130 trials were collected. For the analyses of standardized effect measures, 24 trials were excluded due to lack of statistical data or appropriate analyses, or because the result was described for only one variable. Of the remaining trials, 68 were of a crossover design and 38 were group comparative trials (5891 patients in total). In al1 trials but four it was stated that the patients fulfilled the American Rheumatism Association’s criteria for classic or definite rheumatoid arthritis. Global evaluation crossover trials
The most sensitive variable was the patient’s global evaluation, followed by the physician’s global evaluation and pain (Table 1). In Table 1, al1 correlation coefficients that could be calculated from the reports have been averaged to increase the precision of the estimates. Within-trial comparisons showed closely similar results. In five trials, for example, in which r values for global effect and pain could be calculated, the average values were 0.65 VS 0.55; for global effect and Ritchie’s joint tenderness index [14] (n = 5) the values were 0.71 VS 0.56, and for global effect and duration of morning stiffness, 0.66 VS0.56 (n = 5). Overall joint pain was as sensitive as Ritchie’s index, 0.61 VS 0.56 (n = 8). Measured as the standardized differente, d/s, the patient’s global evaluation was far more Table 1. Average correlation coefficient, r, and standardized differente, d/s, for effect measures in placebo controlled crossover trials of NSAIDs* Effect variable Patient’s global effect Physician’s global effect Pain Number of tender joints Ritchie’s index Physician’s preferente Patient’s preferente Duration of morning stiffness Number of affected joints Rescue drug intake Number of swollen joints Grip strength Walking time Erythrocyte sedimentation rate Diaital ioint size
No. of trials 9
r
No. of trials
d/s
0.70 0.68 0.61 0.58 0.57 0.57 0.52
13 9 23
25
5 13 4
0.52 0.50 0.45 0.45 0.43 0.40
5 3 28 6
0.66 NA 0.36 0.33 0.31 0.38
6 6
0.23 0.03
11 15
0.13 0.08
9 4 12 9 20 15 4
26
1.33 1.02 0.84 NA 0.61 NA NA
*Results are shown for values based on at least 3 trials. NA-not available, d-difference between drug and placebo, s-placebo group standard deviation.
Variables
sensitive than other variables, both within trials and averaged across al1 trials (Table 1). Witbin trials, the values for global effect and pain were 1.28 VS0.96 (n = 12), and for global effect and Ritchie’s index 1.36 VS 0.55 (n = 12). Pain was significantly more sensitive than Ritchie’s index, 0.86 VS0.53 (p = 0.005, n = 21, t-test). Within-trial comparisons of the patient’s and the physician’s global evaluation showed r values of 0.70 VS0.68 (n = 5) and d/s values of 1.15 VS1.09 (n =7). Patient’s preferente might seem to be less sensitive than the global effect. However, r values for both could be calculated in only four trials, in which the average was 0.65 for both variables. The sensitivity of patient’s preferente was unchanged when multiple crossover trials comparing more than one NSAID with placebo were excluded (r = 0.51, n = 15). Within-trial comparisons of preferente and pain and of preferente and Ritchie’s index gave r values of 0.59 VS0.57 (n = 4), and 0.43 VS0.45 (n = 6).
1315
Pain: r = 0.40 VS 0.30 (n = 4), d/s = 0.96 VS 0.71 (n = 3); Ritchie’s index: r = 0.37 VS0.40 (n = 4), d/s = 0.84 VS0.87 (n = 4); duration of morning stitfness; r = 0.30 VS 0.27 (n = 9), d/s = 0.67 VS 0.59 (n = 9); grip strength: r = 0.29 VS 0.28 (n = 7), d/s = 0.78 VS 0.68 (n = 7). ESR and digitaljoint size
Sixty-six crossover trials were collected in which either the FSR or digital joint size, or both, were recorded. Actual values for the ESR on drug and placebo were shown in only 23 of 48 trials (48%) in which the ESR had been recorded. In al1 reported cases, the Westergren method had been used. For joint size, values were shown in 31 of 41 trials (76%). The differences between NSAIDs and placebo in ESR and joint size are shown in Fig. 1. There was a tendency towards large differences that favoured the NSAIDs in smal1 trials (positive skew). The results reported in the three most outlying studies (1-3 in the figure) were rather unlikely for various reasons, but their Global evaluation,group comparative trials in- or exclusion had little influence on the For group comparative trials, several vari- meta-analysis. ables were as similarly sensitive as the global The median ESR on placebo was 43.8 mm/hr. effect. The number of trials with analysable data The differente between NSAIDs and placebo was small, however, and the variation relatively weighted by sample size was 1.9 mm/hr (CI: large, partly because of much more variable - 1.5 to 5.3 mm/hr), and 1.0 mm/hr (CI: - 1.5 treatment times and drop-out rates than in to 3.4 mm/hr) when the three outlying studies crossover trials. The sensitivity of the global were ignored, evaluation across al1 trials was r = 0.30 and The median joint size on placebo in 23 trials, d/s = 0.66 (n = 12 in both cases). in which the sum of the circumference of 10 Within-trial comparisans of global evaluation joints (the proximal interphalangeal joints of VS other variables gave the following sets of the four fingers of each hand and the intervalues: phalangeal joint of each thumb) [15] had been
..
Samole aize 100 .
80
1
. .
60-
.
. 40-
.
.
?? . ??
20-
.
0-I
, -2s
.
:
?? I
0
?? E .: ?? g,
?? 3
%mw
*;’
“Ir.
B
A
+25 AESR,nuWh
.
??
,
-25
0
g
3e
??
.
??*
.
??
. 7
+ 2.5 Ajointaize, %
Fig. 1.Effect of NSAIDs on (A) erythrocyte sedimentation rate and (B) digital joint size in crossover trials. Positive values indicate a greater decline on drug than on placebo. For joint size the differente relative to the placebo value is shown. Numbers indicate outlying studies.
PETER C. G~ZSCHE
1316
measured, was 583.0 mm. The differente between NSAIDs and placebo was 0.55% (CI: -3.9 to 5.0%), and 0.44% (CI: - 1.0 to 1.9%) when the most outlying study was ignored, corresponding to a differente of 2.5 mm (CI: - 5.8 to 11.1 mm) for a typical rheumatic hand. DISCUSSION
The most sensitive effect measure in crossover trials was the patient’s global evaluation. In half of the trials it was recorded as clinical change on a ranking scale, usually with five classes, such as much better, better, no change, worse, much worse. In a quarter of the trials, the drug effect was noted, e.g. very good, good, etc.; in the remaining trials, the disease status was recorded similarly. When several measures were reported, this order of preferente was used and there was rarely any doubt about which one to choose for the analyses. At a second scanning of the reports 1 year after the first reading, the selected global evaluations were the same in al1 trials but one. With al1 of these measures, even the disease status, the patient probably assesses a change, at least indirectly, by comparison with baseline or a previous treatment period. Pain was the second most sensitive effect measure. It is remarkable that Ritchie’s index was less sensitive than a simple pain scale. This is probably caused by the large interobserver variation with the index [14, 161, in which the physician scores the joints &3 for tenderness on palpation or passive movement and adds the scores. Thus there is little justification for the widespread use of the index appearing as it did in almost half of 196 comparative trials [ll. It is rather like crossing a river to collect water to let the physician inflict artificial pain and note the patient’s reaction, instead of letting the patient record her pain. The method increases the sources of variation from 1 to 3: stimulus strength, response, and interpretation of the response. The measurement scale was not important for the results. For pain, for instance, r was 0.57 in four crossover trials in which the visual analogue scale [17] was used, and 0.64 in five trials with ranking scales in which the pain was evaluated as none, slight, moderate, severe, or very severe. Evaluated as d/s, the analogue scale was somewhat less sensitive than the ranking scale, 0.75 VS0.97 (p = 0.30, 10 and 11 crossover trials, respectively). For the duration of morning stiffness, r was 0.55 when the duration was
measured in minutes, and 0.49, when ARA decile classes [2] were used. An assumption in these calculations is that the differences between NSAIDs are negligible compared with the effect of NSAIDs over placebo. This assumption is likely to be true [13, 18, 191. In group comparative trials, there was little differente between several of the variables. Global evaluation is presumably more sensitive to change when a within-patient comparison is involved as in a crossover trial, whereas objective measures such as grip strength would be expected to give similar differences irrespective of the design [13]. NSAIDs seem to lack an effect on ESR and digital joint size. The estimated differences between drug and placebo are probably maximum ones, since many results were not shown, whereas promising results are likely to be reported [l]. Insufficient length of treatment could not explain the negative findings, since there was no correlation between effect and number of treatment days (r2 was 0.02 for ESR and 0.00 for joint size). Why is measurement of digital joint size so popular? The failure of this measure to detect differences between drug and placebo was noted as early as 1970 [20] and was substantiated in 1973 [6]. Nonetheless it was used in 50 of 196 trials comparing two or more NSAIDs [l], although only 14 of them had a randomized placebo group. Probably, the dramatic antiinflammatory effect of corticosteroids have led to a search of NSAIDs with similar properties. Further, the first blinded study to demonstrate an effect of an NSAID was carefully done by reputable rheumatologists and published in a highly-ranked journal [ 151.Such a starting point is an ideal setting for reproduction of spurious findings which may be caused by a subtle bias that was not recognized in the first trial [21]. Bias may be introduced by loss of blinding, e.g. caused by an effect on the arthritis pain or by the occurrence of side-effects. A slight increase in pressure when measuring the joint circumference could easily account for the small, nonsignificant differente of a quarter of a millimeter per joint found in the present study [15, 221. 1 used the sum of 10 joints, since it was the only reported value in most studies. However, even if joints with irreversible or no swelling had been excluded, the differente between drug and placebo would stil1 have been small. It was argued in several of the reviewed reports that the lack of an effect on joint size
Variables be caused by the fact that patients in NSAID trials are selected on account of pain, and not on account of reversible joint swelling. However, since pain is also the reason for NSAID treatment in practice, the argument is questionable. Measurement of unresponsive effect variables is a waste of resources and may lead to bias through preferential analysis and reporting. In comparative trials, for example, significant differences in ESR more often favour the new NSAID than the control NSAID (p = 0.06) [ll. Apart from the ESR, which is especially attractive, since an effect on this variable would associate the NSAID in question with secondline drugs [23] and corticosteroids, there is no reason to believe that bias would consistently favour some variables more than others, unless a laborious procedure was regularly recorded in fewer patients than other measures. This would tend to give too low t-values, and, accordingly, too low standardized effect sizes. However, since this reflects what happens in the real world, the effect sizes reported here would stil1 be valid for pragmatic clinical trial planning. The author’s results were therefore taken at face value. From the standardized effect sizes one may calculate the necessary sample sizes for planned trials in which one would like not to overlook the average differences between drug and placebo found.in this review (see the Appendix). Such calculations would most probably underestimate the required sample sizes, however. For instance, the sample size for a crossover trial of the ESR would be 190, and 2500 for a group comparative trial, rather than the expected infinity based on the weighted analysis in which no effect of NSAIDs was found. For digital joint size, the sample size estimates would exceed 5000 for both designs. These estimated sample sizes refer to control groups treated with placebo. For comparisons of two active drugs, sample sizes of hundreds or thousands would generally be needed. For example, in order not to overlook a differente of one-fifth of the differente between drug and placebo for the patient’s global effect, 25 times as many patients as in a placebo trial would be needed, i.e. at least 300 patients in a crossover trial and 2500 in a group comparative trial. Which variables are most relevant? In twothirds of 122 consecutive arthritis patients at a rheumatism clinic, pain was the most important symptom [24], and it correlates wel1 with the patient’s global evaluation [6,25-281. However,
might
1317
since the global evaluation comprises everything which, in addition to pain, may be important to the patient (e.g. stiffness, fatigue, functional impairment and side-effects of the drug), this measure seems to be most relevant. A pooled index of several variables [29] cannot replace the global evaluation. The weights applied to positive and negative aspects of drug action may vary from patient to patient, and it would be rather patemalistic to argue that an index construed by the physician should be a better judge than the patient herself in the evaluation of the overall benefit from a symptomatic therapy. Therefore, global effect alone would suffice in trials of NSAIDs, but to satisfy regulatory authorities [30] and to distinguish NSAIDs from other drugs, pain and moming stiffness could be measured as well. Number of tender joints may be added but would not be expected to give further information than pain alone. In crossover trials, patient preferente may als0 be considered to facilitate inclusion of the dropouts [18]. In long-term group comparative trials the time the patients continue with an NSAID may be a relevant and sensitive effect variable [3 11. These simple measures are sensitive, meaningful, ethical, quick to perform, cheap, applicable in all patients, and may be recorded by the patients. Hence, the interobserver variation, which for some variables may be larger than the variation between patients [16], is avoided. In trials of second-line drugs, other measures may be preferable. Joint tendemess count, ESR, grip strength, and the physician’s assessment of disease activity was recently suggested, based on a secondary analysis of three large trials of gold, penicillamine, and methotrexate [23].
REFERENCES 1.
2.
3.
4. 5.
Gotzsche PC. Methodology and overt and hidden bias in reports of 196 double blind trials of nonsteroidal antiinílammatory drugs in rheumatoid arthritis. Contwlled Clht Trials 1989; 10: 31-56. Cooperating Clinics Committee of the American Rheumatism Association. A sevenday variability study of 499 patients with peripheral rheumatoid arthritis. Arthr Rhaum 1965; 8: 302-334. McGuire RJ, Wright V. Statistical approach to indices of disease activity in rheumatoid arthritis. With reference to a trial of indomethacin. Arm Rheum Dis 1971; 30: 574-580. Hart FD, Huslcisson EC. Measurement in rheumatoid arthritis. Lancet 1972; i: 28-30. Lee P. Sturrock RD, Kennedy A, Dick WC. The evaluation of antirheumatic drugs. Cw Med Ree Opht 1973; 1: 427443.
1318 6.
7.
8.
9.
10.
ll. 12. 13. 14.
15.
16.
17. 18. 19. 20.
21. 22.
23.
24. 25.
26.
27.
PETERc. GBTZSCHE Deodhar SD, Dick WC, Hodgkinson R, Buchanan WW. Measurement of clinical response to antiinflammatory drug therapy in rheumatoid arthritis Q J Med 1973; 42: 387-401. Bombardier C, Tugwell P, Sinclair A, Dok C, Anderson Cl, Buchanan WW. Preferente for endpoint measures in clinical trials: results of structured workshops. J Rbeumatol 1982; 9: 798-801. Buchanan WW. Assessment of joint tenderness, grip strength, digital joint circumference and moming stiffness in rheumatoid arthritis. J Rheumatol 1982; 9: 763-766. Kirwan JR, Chaput de Saintonge DM, Joyce CRB, Currey LF. Clinical judgement analysis-practica1 application in rheumatoid arthritis. Br J Rheumatol 1983; 22 (suppl): 18-23. National Library of Medicine. Medicel Subjeet Headines-Annotated tiohabetic List. 1988. Bethesda. Md: National Library o’f Medicine; July 1987. Rosenthal R. Meta-analytic Procedures for Social Research. Beverly Hills: Sage; 1984. Glass GV, McGaw B, Smith ML. Meta-analysis in Social Research. Beverly Hills: Sage; 1981. Gotzsche PC. Meta-analysis of grip strength: most common, but superfluous variable in comparative NSAID trials. Dan Med Buil 1989; 36: 493-495. Ritchie DM, Boyle JA, McInnes JM et al. Clinical studies with an articular index for the assessment of joint tenderness in patients with rheumatoid arthritis. Q J Med 1968; 147: 393406. Boardmann PL, Hart FD. Clinical measurement of the anti-inflammatory effects of salicylates in rheumatoid arthritis. Br Med J 1967; 4: 264268. Hansen TM, Keiding S, Lauritzen SL, Manthorpe R, Sorensen SF, Wiik A. Clinical assessment of disease activity in rheumatoid arthritis. Seand J Rhemnatol 1979; 8: 101-105. Huskisson EC. Measurement of pain. Lancet 1974; ii: 1127-1131. Gotzsche PC. Patients’ preferente in indomethacin trials: an overview. Lancet 1989; i: 88-91. Gotzsche PC. Bias in double-blind trials. Thesis. Ksbenhavn: Lmgeforeningens Forlag; 1990. Dick WC, Grayson MF, Woodbum A, Nuki G, Buchanan WW. Indices of antiinflammatory activity. Relationship between isotope studies and clinical methods. AM Rheum Dis 1970; 29; 643-648. Kohn A. Fake Prophets. Fraud and Error in Science and Medlciw. Oxford: Blackwell; 1986. Webb J, Downie W, Dick WC, Lee P. Evaluation of digital joint circumferenoe measurements in rheumatoid arthritis. Stand J Rhemnatol 1973: 2: 127-131. Anderson JJ, Felson DT, Meenan RF, Williams HJ. Which traditional measures should be used in rheumatoid arthritis clinical trials? Arthr Rheum 1989; 32: 1093-1099. Wright V. Measurement of outcome in rheumatic diseases. J R Sec Med 1985; 78: 985-994. Umbenhauer ER. Is the ability to detect drug effect in rheumatic diseases increased by measuring more clinical parameters? In: Paulus HE et al., Eds. Contrnversies in the CUnical Evaluation of Analgesic-Anti-inflanunatory-Antirhemnatic Drugs. Stuttgart: Schathuer; 1981: 193-202. Callahan LF, Brooks RH, Summey JA, Pincus T. Quantitative pain assessment for routine care of rheumatoid arthritis patients, using a pain scale based on activities of daily living and a visual analogue pain scale. Artbr Rhemn 1987; 30: 630-636. Spiegel JS, Paulus HE, Ward NB, Spiegel TM, Leake B, Kane RL. What are we measuring? An examination of walk time and grip strength. J Rheumatol 1987; 14: 80-86.
28.
29.
30. 31.
32.
Pincus T, Callahan LF, Brooks RH, Fuchs HA, Olsen NJ, Kaya JJ. Self-report questionnaire scores in rheumatoid arthritis compared with traditional physical, radiographic, and laboratory measures. Ann Intern Med 1989; 110: 259-266. Smythe HA, Helewa A, Goldsmith CH. “Independent assessor” and “pooled index” as techniques for measuring treatment effects in rheumatoid arthritis. J Rheumatol 1977; 4: 144-152. WHO Regional Office for Europe. GuIdeBnes for the Clinical Inve8tigation of Drugs used in Rheatnatic Diseases. Copenhagen: WHO, 1985. Cape11 HA, Rennie JAN, Rooney PJ er al. Patient compliance: a novel method of testing non-steroidal antiinflammatory analgesics in rheumatoid arthritis. J Rheumatol 1979; 6: 584-593. Armitage P, Berry G. Statistical Methods io Medical Research. 2nd edn. Oxford: Blackwell: 1987.
APPENDIX Tests of statistical significante have two components: the effect size and the sample size. Any effect, however small, wil1 become significant, if the sample size is big enough. The formula for the paired r-test, for example, may be written as: d
‘=SE(I)=SD(d)
d
xJn,
r
where d = differente between drug and placebo and n = sample sim. Hence, d/SD(d) may be regarded as a standardized effect size which is independent of sample size in contrast to the p value. The familiar correlation coefficient, r, may also be regarded as a standardized effect size [ll]. The standardization makes the effect measure dimensionless and allows comparison of variables measured on different scales. The necessary sample size, n, for a planned crossover trial in which one would like not to overlook the average differente, d, between active drug and placebo found in this review with 90% certainty, may be estimated from the average correlation coefficient, r, from the crossover trials: From
n = b., + 22~) x SN4 )= 10.5 (g d (for type 1 and type 11 errors of 5% and lO%, respectively [321), and dx,,h f=SD(d)’
and
and df=n-1,
one gets n=10*5x
1-1 +1. ( r2 > The corresponding formula for the necessary sample size, 2n, for a group comparative trial, with r estimated from a sample of group comparative trials, is: 2n=10.5x
f-1 +2. ( ) If d/s is used as the standardized effect measure (s = placebo group standard deviation), the formula is: 2n =4 x 10.5 x % ‘[32]. 0