ARTICLE IN PRESS
Controlled Clinical Trials 23 (2002) 703–707
Commentary
Evaluating surrogate endpoints Michael D. Hughes, Ph.D. Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts
The papers by Molenberghs et al. [1] and Taylor and Wang [2] provide nice perspectives on methods for evaluating surrogate endpoints. Based on these papers and others, I would like to draw three conclusions that I will then discuss in detail. First, the so-called “proportion of treatment effect explained” is not to be recommended for the evaluation of surrogate endpoints in clinical trials because of conceptual difficulties. Second, greater use should be made of joint models to understand further the extent to which the association between a potential surrogate endpoint and a patient-relevant clinical endpoint varies between treatments at the individual patient level. Third, the evaluation of surrogate endpoints should be undertaken in systematic overviews of relevant trials to provide a more complete understanding of previous experience concerning the concordance or otherwise of treatment effects on the surrogate and clinical endpoints. These understandings, together with careful consideration of knowledge about disease and treatment mechanisms, are necessary to comprehend more fully the potential risks and benefits of using surrogate endpoints. Both Molenberghs et al. and Taylor and Wang consider the “proportion of treatment effect [on the clinical endpoint] explained” (PTE) by a surrogate endpoint. At the simplest level, the PTE is defined as the ratio (-a)/ where and a measure the true differences between treatments in a randomized trial without and with adjustment, respectively, for the potential surrogate endpoint. As both groups of authors and others have noted, there are problems with this approach, though some, such as the problem of imprecision in the validation process, are likely to affect any approach that uses data from a single trial. The most fundamental conceptual problem concerns the fact that the PTE is not a proportion as it can truly take values outside of the range 0 to 1. This has also been found empirically [3,4]. Molenberghs et al. demonstrate this but there is a more tangible illustration of the problem. Consider to be the sum of a beneficial effect, 0, that entirely reflects the difference in effect of two randomized treatments on the marker and an adverse effect, 0, that reflects a mechanism of action not captured by the marker. Then adjustment for the marker will give a, from which it follows that the PTE equals /(+), which is greater than 1. Thus, in the worrisome sit* Corresponding author: Michael D. Hughes, Ph.D., Department of Biostatistics, Harvard School of Public Health, 655 Huntington Ave., Boston, MA 02115, USA. E-mail address:
[email protected] 0197-2456/02/$—see front matter © 2002 Elsevier Science Inc. All rights reserved. PII: S0 1 9 7 - 2 4 5 6 ( 0 2 ) 0 0 2 6 4 -7
ARTICLE IN PRESS 704
M.D. Hughes/Controlled Clinical Trials 23 (2002) 703–707
uation in which a therapy might have mechanisms of action that produce beneficial effects on the clinical endpoint as well as other actions that produce adverse effects, the PTE could inappropriately indicate a useful surrogate endpoint. Given such conceptual problems and the apparent lack of arguments in the literature in favor of the PTE, it seems reasonable to recommend against its use in practice as a method for validating a surrogate endpoint for use in clinical trials. Despite the fact that Taylor and Wang focused on using joint models for repeated measurements of a surrogate marker and time to a clinical event to evaluate the PTE, their modeling approach has other important applications. In a trial with a limited number of clinical events, the model could be used to augment the clinical event information with the surrogate marker information to obtain a more precise estimate of the difference between treatments in the effect on the clinical endpoint. In this context, the surrogate marker is often referred to as an auxiliary variable or auxiliary endpoint [5–7]. Note that use of such models does not answer the question about whether the marker is a good surrogate endpoint for future trials but does have the advantage of avoiding the extrapolation of information about surrogacy from one trial to another trial. Indeed, it is also not necessary to assume that the same association between the marker and the clinical event holds for all randomized treatments. A potential downside of this approach is that evaluation of the difference between treatments can be biased if the model is misspecified. In addition, the gains in precision may not be great: Two applications in the literature indicate reductions in the standard error in estimating the difference in effect of two treatments of about 10% or less [5,7]. This is not surprising as a small number of clinical events in a study is likely to be the main motivation for using this approach, but it is this small number that also limits the precision to estimate the association between the marker and the clinical event. A second important use of the modeling approach of Taylor and Wang is to evaluate the treatment and marker interaction, and hence whether there is evidence that the association between the surrogate marker and the clinical endpoint depends on treatment assignment. If the marker is to be useful as a surrogate endpoint, then the association should be (approximately) the same for all treatments. From a clinical perspective, this would mean that the risk of a clinical event for patients with a particular marker level could be interpreted in the same way independent of the treatment received (all other factors being equal). This might therefore provide useful information for individual patient management. The issue of the association between surrogate marker and clinical outcome also arises in the paper by Molenberghs et al. Their “adjusted association” is essentially the correlation of the marker and the clinical outcome (adjusted for any treatment effect). I prefer to think of it as one measure of the prognostic value of the marker for the clinical endpoint and, as such, it is more readily interpretable if quantified not as a correlation but as the effect on the clinical outcome (e.g., relative hazard) per unit change in the marker. As Molenberghs et al. and others note [8,9], such an association, particularly if it has a biological justification, is important but not sufficient for a good surrogate endpoint. However, going beyond the model of Molenberghs et al., it seems essential to evaluate systematically how much this association varies from treatment to treatment across studies in a particular therapeutic area. The implications of such heterogeneity will also be easier to interpret when the clinical outcome is modeled directly in terms of the marker rather than expressed as a correlation.
ARTICLE IN PRESS M.D. Hughes/Controlled Clinical Trials 23 (2002) 703–707
705
In considering the use of meta-analysis to evaluate surrogate endpoints, it is useful first to step back and note that the principle caveat of using surrogate endpoints in the phase III trial setting concerns the potential for an increased risk of recommending therapies for use in clinical practice that are either ineffective or, worse, have adverse effects with respect to outcomes that are relevant to patients. It is therefore important that the effect of a treatment on the clinical endpoint is predictable given the effect of a treatment on the surrogate endpoint. The aim of a systematic overview is to provide an evaluation of this across previous randomized trials of treatments and hence provide evidence about the risk of errors in decisions about treatments when these decisions are based on a surrogate endpoint. Of note, a systematic overview will include trials that show differences in a marker outcome but no difference in the clinical endpoint, even though they might be well-powered to evaluate the clinical effects. Methods for evaluating surrogacy in an individual trial cannot generally be used in these “counterexamples” to a marker being a good surrogate. Fleming and DeMets list many such studies where decisions based on surrogate endpoints would have led to conclusions about a therapy that are likely incorrect with respect to the clinical outcome [9]. A systematic overview would help to identify further how common such studies are in any particular therapeutic area. This would then contribute to a discussion for the broader patient population about the balance between inappropriate treatment(s) being recommended versus the ability to access therapies earlier when, respectively, incorrect and correct decisions are made based upon the use of surrogate endpoints. The major challenge in conducting such a systematic overview is likely to be the practical challenge of accessing data from all relevant studies. This is illustrated by experience in HIV (human immunodeficiency virus) research: Evaluation of the level of HIV RNA in a patient’s plasma as a surrogate endpoint for progression to AIDS (acquired immunodeficiency syndrome) or death was accomplished in a meta-analysis using individual patient data for one class of drugs [10], but a meta-analysis that included other classes of drugs [11] was limited to the use of summary statistics from publications, drug product labels, etc., because, in part, of data access issues. As well as the adjusted association, Molenberghs et al. introduce the “relative effect” (RE), which is the ratio of the difference in effects of two randomized treatments on the clinical and surrogate endpoints given by /, where measures the difference in effect on the surrogate. As they note, in the context of an individual trial, this ratio is difficult to interpret and so the more interesting issue concerns the variation in the RE across randomized trials. In essence, this involves modeling in terms of across randomized trials. Daniels and Hughes [12] presented a framework for such modeling using just the trial-level estimates. Both Molenberghs et al. and Gail et al. [13] present richer models that use individual patient data and incorporate parameters reflecting both the marker and clinical outcome levels for each treatment as well as the differences between treatments. This type of approach has been used in HIV research to evaluate measures of immune function (CD4 cell count) and viral load (HIV RNA level in plasma) as surrogate endpoints for progression to AIDS or death [4,10–12]. Simple descriptive results from these meta-analyses are important. For example, qualitatively, the proportion of trials in which the marker and clinical outcome treatment differences were discordant was very small. However, at the quantitative level, it is possible to identify pairs of trials for which, for example, the difference between randomized treatments in mean reduction in HIV RNA was clearly larger, but
ARTICLE IN PRESS 706
M.D. Hughes/Controlled Clinical Trials 23 (2002) 703–707
the difference in clinical outcome was smaller, in one randomized comparison versus the other. Thus, while incorrectly concluding that a treatment is clinically effective when decisions are based upon the surrogate markers was rare, the heterogeneity in the association of on means that the ranking of treatments based on marker effects likely differs from a ranking based on clinical effects. These models can also be extended to evaluate multiple markers. In the HIV setting, this gave the interesting result that after adjusting for the difference between treatments in effect on CD4 cell count, the difference in effect on HIV RNA level had essentially no value in predicting the difference in effect on progression to AIDS or death [10]. This result fits well with the biology of the disease whereby high viral load leads to poorer immune function and hence to progression to AIDS or death. Thus, the marker that was more proximal to the clinical outcome was the better surrogate endpoint. An important application of the modeling in these meta-analyses, as Molenberghs et al. and others have described [12,13], is the ability to predict what might be the difference in clinical outcome in terms of an observed difference in marker levels in some future trial. This was done in the HIV meta-analyses to identify what might be minimal marker differences that, based on past clinical trials, might indicate a difference in clinical outcome [4,10,12]. Gail et al. [13] make the important point, however, that approaches taken by Daniels and Hughes [12] and by Molenberghs et al. ignore the imprecision in estimating variance components in the model and hence that the precision of the predictions can be substantially overestimated, particularly in small meta-analyses. Gail et al. provide a bootstrap approach for resolving this problem. A consequence of this is that in the common setting in which there are few previous trials with both marker data and substantial clinical outcome data, the predictions may be so imprecise that direct evaluation of the effects of a new treatment on clinical outcome is necessary. Indeed, if the heterogeneity in the association between differences in clinical outcome and differences in marker outcome is not small, as in practice may be quite plausible, the imprecision in any prediction may always be larger than that for a direct estimate obtained from an appropriately powered clinical outcome trial [13]. Note that subdividing trial populations according to sites or groupings defined by covariate values, thus increasing the number of observational units in the meta-analysis, is not likely to resolve this problem because the totality of information is not increased. Obviously, the problems of imprecision will also be greatest for those diseases in which markers would be most valuable because of the difficulties of doing clinical outcome trials. There are other issues with the meta-analytic approach that are important [1,12,13]. Two are concerned with significant definitional problems. First, what defines the set of previous trials that should be included in the meta-analysis? For example, should it include all trials for a particular disease (in which case, how is the disease defined?), or should it be restricted to trials that evaluate a particular class of drugs? In this context, it is obviously critical to appreciate that any use of the results of the meta-analysis when interpreting marker results from a future trial involves an extrapolation from the set of previous trials. Second, the fitted model may not be invariant to whether the difference between treatments A and B in any individual trial is defined as A–B or B–A. Such issues serve to amplify the complexity of formal evaluation of surrogate endpoints in meta-analyses as for individual trials and the need for further development of both the conceptual framework and statistical methods.
ARTICLE IN PRESS M.D. Hughes/Controlled Clinical Trials 23 (2002) 703–707
707
References [1] Molenberghs G, Buyse M, Geys H, et al. Statistical challenges in the evaluation of surrogate endpoints in randomized trials. Control Clin Trials 2002;23:607–625. [2] Taylor JMG, Wang Y. Surrogate markers and joint models for longitudinal and survival data. Control Clin Trials 2002;23:626–634. [3] Flandre P, Saidi Y. Letters to the editor: estimating the proportion of treatment effect explained by a surrogate marker. Stat Med 1999;18:107–115. [4] Hughes MD, Daniels MJ, Fischl MA, Kim S, Schooley RT. CD4 cell counts as a surrogate endpoint in HIV clinical trials: a meta-analysis of studies of the AIDS Clinical Trials Group. AIDS 1998;12:1823–1832. [5] Fleming TR, Prentice RL, Pepe MS, Glidden D. Surrogate and auxiliary endpoints in clinical trials, with potential applications in cancer and AIDS research. Stat Med 1994;13:955–968. [6] Hogan JW, Laird NM. Increasing the efficiency from censored survival data by using random effects to model longitudinal covariates. Stat Methods Med Res 1998;7:28–48. [7] Faucett CL, Schenker N, Taylor JMG. Survival analysis using auxiliary variables via multiple imputation, with application to AIDS clinical trial data. Biometrics 2002;58:37–47. [8] Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med 1989;8:431–440. [9] Fleming TR, DeMets DL. Surrogate end points in clinical trials: are we being misled? Ann Intern Med 1996;125:605–613. [10] HIV Surrogate Marker Collaborative Group. Human immunodeficiency virus type 1 (HIV-1) RNA level and CD4 count as prognostic markers and surrogate endpoints: a meta-analysis. AIDS Res Hum Retroviruses 2000;16:1123–1133. [11] Hill AM, DeMasi R, Dawson D. Meta-analysis of antiretroviral effects on HIV-1 RNA, CD4 cell count and progression to AIDS or death. Antivir Ther 1998;3:139–145. [12] Daniels MJ, Hughes MD. Meta-analysis for the evaluation of potential surrogate markers. Stat Med 1997;16:1965–1982. [13] Gail MH, Pfeiffer R, van Houwelingen HC, Carroll RJ. On meta-analytic assessment of surrogate outcomes. Biostatistics 2000;1:231–246.