Contemporary Clinical Trials 28 (2007) 348 – 351 www.elsevier.com/locate/conclintrial
Short communication
Clinical trials and the response rate illusion Irving Kirsch a,⁎, Joanna Moncrieff b b
a Psychology, University of Hull, Hull HU6 7RX, United Kingdom Social and Community Psychiatry, Department of Mental Health Sciences, University College London, United Kingdom
Received 22 May 2006; accepted 22 October 2006
Abstract Clinical trial outcome data can be presented and analyzed as mean change scores or as response rates. These two methods of presenting the data can lead to divergent conclusions. This article explores the reasons for the apparently divergent outcomes produced by these methods and considers their implications for the analysis and reportage of clinical trial data. It is shown that relatively small differences in improvement scores can produce relatively large differences in expected response rates. This is because differences in response rates do not indicate differences in the number of people who have improved; they indicate differences in the number of people whose degree of improvement has pushed them over a specified criterion. Therefore, patients classified as non-responders may have shown substantial and clinically significant improvement, and these are the patients who are most likely to become responders when given medication. Response rates based on continuous data do not add information, and they can create an illusion of clinical effectiveness. © 2006 Elsevier Inc. All rights reserved. Keywords: Response rates; Clinical trials; Data analysis; Antidepressants
1. Introduction There are various ways of comparing the performance of drug and placebo in clinical trials and systematic reviews of those trials. Outcome data for conditions that are assessed on continuous scales are typically presented as mean change scores (or sometimes as post treatment scores adjusted for baseline scores). Categorical results can also be calculated from continuous data, in the form of response rates or improvement rates (i.e., the percent of patients deemed to have responded or improved) and statistics derived from them including odds ratios, relative risk ratios, and number needed to treat. These categorical outcomes consist of the proportion of people who meet a (usually) predefined level of improvement or fall below a predefined threshold score on a continuous measure. They do not reflect natural categories, but are simply a way of dividing up a continuous distribution. Ideally, these different methods of assessing outcome should produce similar conclusions, since they are derived from the same data. In practice, they can be divergent. This is particularly striking in reviews of antidepressant medication, where mean improvement scores indicate very small drug-placebo differences [1,2]. whereas response rate ⁎ Corresponding author. E-mail address:
[email protected] (I. Kirsch). 1551-7144/$ - see front matter © 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.cct.2006.10.012
I. Kirsch, J. Moncrieff / Contemporary Clinical Trials 28 (2007) 348–351
349
data suggest that these differences are more substantial [1,3]. The purpose of this article is to explore the reasons for the apparently divergent outcomes produced by comparisons of mean improvement and response rates and to consider their implications for the analysis and reportage of clinical trial data. Although our concern extends to all clinical trials in which data derived from continuous scales are reported as response rates, we illustrate the issue with data from the antidepressant literature. 2. Response rates and continuous distributions Response rates depend on the criterion used to define a response, as well as on the magnitude of drug and placebo effects. When a response rate of 50% has been found, as has been reported in the published antidepressant trial data [3], it means that the criterion that had been chosen for defining a response has coincidentally turned out to be the median of the distribution of improvement scores. This is true regardless of the shape of that distribution. In normal distributions, it is also the mean. In a normal distribution, if the chosen criterion turns out to be .5 standard deviations above the mean improvement score, the response rate will be 31%, and a criterion that falls at .5 standard deviations below the mean gives a response rate of 69%. Response rates corresponding to other criteria can be readily obtained from distribution tables. Differences in response rates can also be obtained from distribution tables. For normally distributed data, having obtained the response rate for one group, to obtain the response rate for the second, one need know only the standardized mean difference between the groups. The response rate for the second group is the proportion of the area under the normal curve that is above a z score equal to the difference between the effect size and the response rate for the first group. The relation between difference scores and response rates are illustrated in Table 1 and Fig. 1. This shows response rate data for four levels of drug-placebo differences in improvement on the Hamilton Rating Scale for Depression (HRSD) and two different response rates in the drug condition. All four drug-placebo difference scores are relatively modest and of questionable clinical significance. The lowest is above that found in a meta-analysis including unpublished clinical trial data [2]; the remaining three are above those found in meta-analyses of published trials only [1,4]. Two different drug response rates were included to illustrate how these data might be affect by differences in the criterion used to define a therapeutic response. Calculations were made using a standard deviation of 8, which was derived from two recent meta-analyses [2,4]. Despite the modest size of the drug-placebo differences, the response rate differences are impressive, ranging from 10–25%. Similarly, the odds ratios range from 1.50–2.79, and the number needed to treat ranges from 4–10. It is also noteworthy that these relations between difference scores and response rate differences are relatively independent of the response criterion that is adopted. In contrast, risk ratios (a.k.a. relative risk) are higher when a lower criterion is chosen. 3. How the response rate illusion is produced The response rate illusion is not due to response rate statistics, but rather to our interpretation of them. We think of a responder as a person who improves and a non-responder as a person who does not improve, but this is not necessarily Table 1 Response rate differences on the HRSD as a function of drug response rate and differences in improvement scores Response rate in drug group
HRSD difference
Response rate in placebo group (%)
Response rate difference (%)
Odds ratio
Relative risk
NNT
50%
2 3 4 5 2 3 4 5
40 35 31 27 55 50 45 40
10 15 19 23 10 15 20 25
1.50 1.86 2.23 2.70 1.52 1.86 2.27 2.79
0.83 0.77 0.72 0.68 0.78 0.70 0.64 0.58
10 7 5 4 10 7 5 4
65%
350
I. Kirsch, J. Moncrieff / Contemporary Clinical Trials 28 (2007) 348–351
Fig. 1. A 3-point drug-placebo difference on the Hamilton Rating Scale for Depression is relatively small in terms of clinical significance, but it corresponds to an impressive 15% difference in expected response rates.
accurate when response to treatment has been derived from continuous scores, rather than defined by natural categories (e.g., survival). A patient who is classified as a non-responder on the basis of a criterion of improvement on a continuous scale, may have shown substantial and clinically significant improvement. Consider, for example, the most commonly used definition of response in reviews of antidepressant efficacy, which is a 50% decrease in depressive symptoms [1,3]. The mean baseline HRSD score in clinical trials analyzed in two different meta-analyses was 24 [2,4]. So for a typical patient in these trials, the criterion for response would be 12. This means that patients showing an 11 point improvement in HRSD symptom scores would be classified as non-responders, despite the fact that their improvement is about three times the drug-placebo difference reported in the published literature [1]. Lack of response does not mean that the patient has not improved; it means that the improvement has been less than the criterion chosen for defining a therapeutic response. A one point difference in improvement can push a person from being a nonresponder to being a responder, regardless of the scale that is used, the mean difference in improvement scores, the standard deviation of those scores, or the shape of the distribution. With this in mind, the meaning of terms like response rate, odds ratio, and number needed to treat needs to be reconsidered. When response rates have been derived from changes in continuous measures, differences in those rates do not indicate differences in the number of people who have improved; they indicate differences in the number of people whose degree of improvement has pushed them over a specified criterion. Odds ratios do not indicate the relative likelihood of improvement; they indicate the relative likelihood of crossing the response criterion (e.g., by improving by 12 HRSD points instead of only 11). Number needed to treat is not the number of patients one must treat for one extra person to get better; instead, it is the number of patients one must treat to push one extra person over the response criterion. We have used a normal distribution to estimate actual correspondences between response rates and differences in change scores, but the response rate illusion does not depend on the normality of the distribution. Regardless of the shape of the distribution, those patients whose improvement without medication is closest to the response criterion are most likely to become responders when given medication. That is because even a small additional medication effect will push them over the criterion. 4. Conclusions Some outcome data (e.g., death and pregnancy) can only be expressed in terms of response rates. Other outcomes do not fall into natural categories and can be assessed meaningfully with continuous scores. Imposing categories on such data is hazardous. It creates the impression of discrete patterns of response where the data does not suggest any, it obscures the arbitrary nature of criteria used to form the categories, and as we have shown, it can spuriously inflate the differences between groups in clinical trials. Furthermore, these concerns are not limited to antidepressants. For example, a seminal trial of clozapine showed modest differences in rating scale scores, but a 26% difference in response
I. Kirsch, J. Moncrieff / Contemporary Clinical Trials 28 (2007) 348–351
351
rates [5]. Given these problems in the relation between continuous scores and response rate differences, clinical effectiveness should be gauged by mean differences in mean outcome scores, rather than by differences in response rates or statistics derived from them. Response rates derived from continuous measures do not add information, and they can create an unwarranted illusion of clinical effectiveness. References [1] NICE; National Institute for Clinical Excellence, 2004; Vol. 2005. [2] Kirsch I, Moore TJ, Scoboria A, Nicholls SS. Prev Treat 2002;5 [article 23, Available on the World Wide Web: http://www.journals.apa.org/ prevention/volume5/pre0050023a.html, posted July 15, 2002]. [3] Mulrow CD, Williams JWJ, Trivedi M, et al., Agency for Health Care Policy and Research: Rockville, MD, 1999. [4] Kirsch I, Sapirstein G. Prev Treat 1998;1 [Article 0002a, Available on the World Wide Web: http://www.journals.apa.org/prevention/volume1/ pre0010002a.html, posted June 26, 1998]. [5] Kane JM, Honigfeld G, Singer J, Meltzer H. Psychopharmacol Bull 1988;24:62–7.