A first step to assess harm and benefit in clinical trials in one scale

A first step to assess harm and benefit in clinical trials in one scale

Journal of Clinical Epidemiology 63 (2010) 627e632 A first step to assess harm and benefit in clinical trials in one scale Maarten Boersa,*, Peter Br...

707KB Sizes 0 Downloads 33 Views

Journal of Clinical Epidemiology 63 (2010) 627e632

A first step to assess harm and benefit in clinical trials in one scale Maarten Boersa,*, Peter Brooksb, James F. Friesc, Lee S. Simond, Vibeke Strande, Peter Tugwellf b

a Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, the Netherlands Faculty of Health Sciences, University of Queensland, Royal Brisbane and Women’s Hospital, Herston, Brisbane, Queensland 4006, Australia c Division of Immunology/Rheumatology, Stanford University School of Medicine, Palo Alto, CA, USA d SDG LLC, Cambridge, MA, USA e Division of Immunology/Rheumatology, Stanford University School of Medicine, Portola Valley, CA, USA f Institute of Population Health, University of Ottawa, Ottawa, Ontario, Canada

Accepted 3 July 2009

Abstract Objective: To develop a simple system to assess benefit and harm of treatment on a single scale. Harm and benefit signals from trials need to be placed in the proper perspective to decide on the value of a treatment. Several systems have been developed for assessment, but few attempt to incorporate both benefit and risk in the same metric while retaining enough simplicity to aid patients and clinicians in their decision making. Study Design and Setting: We designed a very simple 3  3 table (Outcome Measures in Rheumatology [OMERACT] 3  3) that comprises three ranks for both beneficial and harm outcomes: for benefit, these are ‘‘none,’’ ‘‘substantial,’’ and ‘‘(near) remission’’; for harm, these are ‘‘none,’’ ‘‘severe,’’ and ‘‘(near) death.’’ Patients are ranked both for benefit and harm and subsequently counted in a 3  3 table. Results: The system was feasible when applied to one trial dataset (patient-level information) and a meta-analysis. To become applicable as a tool, several issues need to be resolved in further development, especially the definitions and cutoffs for the ranks. Conclusion: A simple 3  3 table to rank both benefit and harm outcomes is feasible. For rheumatology this will be further developed in the context of the OMERACT initiative. Ó 2010 Elsevier Inc. All rights reserved. Keywords: Benefit; Harm; Risk; Outcome assessment; Benefiterisk assessment; Drug treatment; Clinical trial; Clinical pharmacology; Clinical epidemiology

1. Introduction Data on harm (here used interchangeably with ‘‘risk’’) and benefit of a medical (often drug) treatment, whether originating from trials or observational studies need to be placed in the proper perspective to decide on the utility of a treatment. In many settings, standardized measures are available to assess benefit, but in comparison, the assessment of harm is still fairly primitive. Moreover, benefit and harm have not yet been usefully combined into one scale. In 1998, the Council for International Organizations of Medical Sciences (CIOMS) stated, ‘‘It is a frustrating aspect of benefit-risk evaluation that there is no defined and tested algorithm or summary metric that combines benefit and risk data and that might permit straightforward * Corresponding author. Department of Epidemiology and Biostatistics, VU University Medical Center, PK 6Z 185, PO Box 7057, 1007 MB Amsterdam, the Netherlands. Tel.: þ31-20-444-4474; fax: þ31-20444-4475. E-mail address: [email protected] (M. Boers). 0895-4356/$ e see front matter Ó 2010 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2009.07.002

quantitative comparisons of different treatment options, which in turn might aid in decision making’’ [1]. This statement was repeated by a committee of the Committee for Medicinal Products for Human Use of the European Medicines Agency in 2007 [2] and expanded by Mussen et al. [3]. One of the problems is that benefit and risk assessment is mostly driven by the need to make decisions, whereas most research is ‘‘truth driven.’’ More specifically, benefiterisk assessment involves placing value judgments on scientific facts. These values will vary depending on the perspective of the assessor. In addition, comparing benefit and risk resembles comparing apples and pears, which involves a trading off (and discounting) of long-term against short-term effects. Finally, in real life multiple benefits and risks need to be assessed simultaneously. A brief literature review of what is usually called ‘‘benefiterisk assessment’’ (Pubmed and Google scholar; terms: benefit, risk, harm, drug therapy, assessment, and review in various combinations) revealed a relatively sparse body of literature, some of it less accessible (book chapters, etc.).

628

M. Boers et al. / Journal of Clinical Epidemiology 63 (2010) 627e632

Broadly speaking, two types of methodology have been proposed: simple qualitative ranking systems and more complicated methods. Honig distinguished a third category, that of models for specific classes of drugs [4]. Within the context of a special meeting held at the ninth Outcome Measures in Rheumatology Conference (OMERACT, see http:// www.omeract.org) [5], we have identified the need for simple tools that incorporate both benefit and risk into one scale, to aid patients and clinicians in their decision making. The following were considered to be sufficiently simple that the patient can understand them and/or clinician can explain them; likewise, decision makers without technical expertise can understand them. 1. Number needed to treat vs. number needed to harm [6,7] 2. Unqualified success and unmitigated failure [8] 3. Principle of Three and Transparent Uniform Risk/ Benefit Overview (TURBO) [1] 4. Grading of Recommendations Assessment, Development and Evaluation (GRADE) [9] Within a single trial, the number needed to treat/number needed to harm metrics are frequently used and also applied in meta-analyses to give the best synthesized estimates [10]. In such reviews, separate numbers are shown for the main outcomes of benefit and harm. Mancini and Schulzer developed an extension to incorporate both numbers in one metric [8]: these authors introduced the concept of unqualified success (benefit success without any harmful event) and unmitigated failure (a harmful event without any benefit). Calculation of these entities is straightforward if the occurrence of benefit and harm is deemed independent and slightly more complicated in case of a known dependency. However, both the number needed to treat/harm and its extension are limited to one benefit and one risk (unless these are pooled), expressed as binary endpoints. The ‘‘Principle of TURBO’’ models were described in the 1998 CIOMS report [1]. These can be used to summarize all the available evidence (trials and observational studies) on a qualitative scale. Although the ‘‘Principle of Three’’ originally pertained to selecting the three most serious and the three most frequent side effects for analysis, it also applies to the method as a whole, comprising three separate 3  3 tables. The first describes the disease, the second the benefit(s), and the third the harm(s) of the treatment. In each, the dimensions ‘‘seriousness,’’ ‘‘duration,’’ and ‘‘incidence’’ are scored on a four-point scale: 0 5 absent or no effect; 1 5 low; 2 5 medium; and 3 5 high. Treatment ‘‘profiles’’ as expressed in the tables can now be compared. The ‘‘TURBO’’ model is an expansion of the Principles of Three. In the first step, the most important harms receive a semiquantitative score on a scale of 1 to 7 based on frequency and severity. Likewise, the most important benefits receive a score based on likelihood and degree of effect. Finally, both scores are plotted on a two-dimensional diagram with seven points on each axis.

The ‘‘Grading of Recommendations Assessment, Development and Evaluation’’ (GRADE) system takes a different approach by emphasizing quality of evidence [9]. Explicit judgments are requested for each important setting. Trade-offs between benefit and harm are categorized as: B

B

B

B

Net benefits: the intervention clearly does more good than harm. Trade-offs: there are important trade-offs between benefit and harm. Uncertain trade-offs: it is not clear whether the intervention does more good than harm. No net benefits: the intervention clearly does not do more good than harm.

GRADE thus provides nomenclature for a decision based on implicit or explicit weighting of the evidence for the trade-offs. The implementation of GRADE is still evolving. In our search, we also identified more complex methods, judged to be substantive but too complex for easy explanation to patients and nontechnical stakeholders. These included minimal clinical efficacy [11], various models based on quality of life measures, for example, Q-TWIST [12], multi-criteria decision analysis [13], and incremental riskebenefit ratio and riskebenefit acceptability curve (Bayesian probabilistic modeling) [14]. These methods fall outside the scope of this study. From this review, and inspired by the patient decision aids developed in Ottawa [15,16], we propose a very simple ranking system (OMERACT 3  3) that builds on the Principle of Three [1] to be piloted within the context of a trial.

2. OMERACT 3 3 3 We propose to initially devise a very simple system that can categorize the outcome of a patient according to three categories of benefit and three categories of harm, creating a 3  3 table (Table 1). The first two categories are ‘‘none/ minimal’’ and ‘‘major’’ for benefit and harm. The third category is ‘‘(near) remission’’ for benefit and ‘‘(near) death’’ for harm. If necessary, more detail can be assessed later or on a separate ‘‘page.’’ Note that we use the term ‘‘harm’’ because adverse events occur regardless of therapy and often cannot be attributed with great certainty to that treatment. Of course, using only three categories grossly simplifies matters and cannot replace more detailed analyses. However, it may avoid complex weighting processes because many events are placed into one category. For example, using only three categories for harm would allow categorization of myocardial infarctions of a certain severity into one group together with gastrointestinal (GI) bleeds of a certain severity and progression of lung carcinoma and with other severe adverse events occurring in other organ systems. Definitions of the categories will need to be operationalized to allow use in specific situations of potential harm and

M. Boers et al. / Journal of Clinical Epidemiology 63 (2010) 627e632 Table 1 A 3  3 table to classify benefit and harm in a trial

benefit that will require a separate exercise. For now, our aim is to define the potential states of benefit and harm that may persist for a clinically relevant period (to be defined) rather than focusing on changes. Thus, patients achieving a minimal disease activity state will (depending on the choice of category definitions) be placed into the rightmost column of benefit, and if they escape adverse events, in the upper right cell. In contrast, worsening of disease, a flare, or onset of a comorbidity will put them in the second row, and if no other (unexpected) benefit occurs, they will be categorized in the leftmost cell of that row. Another example: if patients with rheumatoid arthritis experience an ACR20 response but their disease remains moderately active, but pancytopenia requires them to switch from methotrexate to etanercept, they would be categorized in the middle cell of the table: substantial benefit with substantial harm. At the end of a trial, the table would be filled with counts of occurrences, from which probabilities could be estimated for the occurrence of each of the nine combinations in a population; subsequently, an overall potential benefit/ harm profile of occurrences can be compared between treatment groups. To demonstrate this in practice, we offer two examples: one is the COBRA (COmbinatietherapie Bij Reumatoide Artritis) trial [17], where individual patient data are available, and the other the VIGOR (Vioxx Gastrointestinal Outcomes Research) trial, with cardiovascular and GI event rates reported on the group level [18]. The COBRA trial compared sulfasalazine monotherapy (SSZ) with combination therapy that comprised high prednisolone rapidly tapered to low maintenance dose, methotrexate and sulfasalazine in 155 early patients with rheumatoid arthritis over 1 year, with outcome assessments every 3 months. For the current example, we defined major benefit as a disease activity score (DAS) of <3.7 on at least two occasions and (near) remission as a DAS <2.4 on at least two occasions, corresponding to the original EULAR (European League Against Rheumatism) improvement criteria [19]. Severe harm was defined as treatment discontinuation for severe adverse events, loss of efficacy, or both. The marginal totals in Table 2 show the data as reported in the original trial, expressed as percentage of patients in each group. As there were no deaths or patients with severe incapacitation, the lowest row of the table is empty. Patients in the combination therapy group reported more benefit and fewer adverse events, illustrated in Tables 2 and 3 (a 2  2 table). Although the occurrence of severe harm was relatively low, 5% and 4% of the SSZ group with serious harm also

629

Table 2 Results of the COBRA trial, expressed in a 3  3 table of benefit and harm

SSZ, sulfasalazine monotherapy; COBRA, combination therapy.

experienced substantial benefit or (near) remission, respectively. In the COBRA group, these percentages were 3 and 0. The VIGOR trial compared rofecoxib and naproxen for serious GI events over a median of 9-month exposure in 8,076 patients. Most controversy revolved around the high incidence of myocardial infarctions in the rofecoxib group. For GI events, we used the data from the primary report. For cardiovascular events, we used analyses of the rates of adjudicated events defined by the Antiplatelet Trialists’ Collaboration (APTC) endpoint including myocardial infarctions, cerebrovascular accidents, and cardiovascular deaths reported at Food and Drug Administration meetings and in subsequent reviews [20]. Without access to individual patient data, in order to fill the tables, we assumed occurrence of benefit and harm to be independent. We also assumed (antirheumatic) efficacy measured by improvements in Health Assessment Questionnaire (HAQ) to be equal between the two treatment groups. We defined major benefit as a decrease in the HAQ > 0.3 points and (near) remission as > 0.5. For major harm, we explored two definitions: (1) any dropout for an adverse event or loss of efficacy, occurrence of a predefined GI event (endpoint Table 3 Results of the COBRA trial, expressed in a 2  2 table of benefit and harm

630

M. Boers et al. / Journal of Clinical Epidemiology 63 (2010) 627e632

Table 4 Results of the VIGOR trial, outline of a 3  3 table of benefit and harm

Major harm is defined as dropout for adverse event or loss of efficacy; cardiovascular event or GI event.

of trial), or an APTC event and (2) occurrence of a GI or APTC event only. With the first definition of harm, the benefit/harm profiles of both treatments appear almost identical (Table 4). This should come as no surprise knowing that O16% of patients in both groups discontinued prematurely for an adverse event and O6% for loss of efficacy. The results with the second definition are more interesting in view of discussions regarding trade-offs between GI and cardiovascular harm (Table 5) [21]. In this model, major harm is less frequent with rofecoxib than naproxen but the GI advantage of the COX-2 is blunted by an excess of cardiovascular events. Both models indicate that one-third of patients with major harm also experienced major benefit or (near) remission.

3. Discussion Several issues need further discussion. 3.1. Definition of benefit and harm domains We suggest that benefit is any occurrence or change that results in a patient being in a better state than before treatment. Likewise, harm is any occurrence or change such that a patient Table 5 Results of the VIGOR trial, expressed in a 2  2 table of benefit and harm

Table 6 An alternative system of classification, based on the distinction between intended and unintended effects

is in a worse state than before. This system assumes that both benefit and harm can coexist in the same patient, simultaneously or sequentially during the course of treatment/observation (bottom right four categories of the 3  3 table). For example, a patient could initially go into remission and then die. Another possibility for definition of these domains would be to include one domain of specific (intended) drug effect (e.g., pain reduction for an analgesic) and another for all other ‘‘off target’’ effects (Table 6). This system may appear more logical in terms of drug development but comes at a cost of an extra category in each domain: a category of significant harm in the specific effects (e.g., more pain in an analgesic trial) and a category of unexpected significant benefit in the generic domain (e.g., minoxidil; hirsutism in bald men). Usually, the latter will be very rare. 3.2. Defining the cutoffs/weights The OMERACT 3  3, like other methods outlined in the introduction, requires weights to be attached to categories of benefit, harm, as well as to the severity of the disease being treated. In other words, operationalizing the categories (e.g., when does a minor harm become major?) requires judgment and consensus performed through an a priori predefined adjudication process. This is probably the key weak point that prevents wide application of previous methods, and all authors call for prospective study and validation that subsequently is not performed. In any agreed upon system, there will be difficult situations to resolve. We need to stress that any system that simplifies a very complex reality cannot supplant a more in-depth study of the full results. When applying the system and a specific set of definitions, it will be important to show the results to be robust and able to stand up to sensitivity analyses when the cutoff definitions are changed. Within rheumatology, perhaps the OMERACT safety group can lead a process to create weights via consultations and consensus with various stakeholders. 3.3. Time of observation

Major harm is defined as cardiovascular event, GI event, and death.

The simplest model is to select the trial observation period for both benefit and harm. More advanced models could select a longer period but would then need input from external sources to determine the length of observation, for example, the likelihood of future benefit and harm vs. incidence of disease-associated comorbidities over time.

M. Boers et al. / Journal of Clinical Epidemiology 63 (2010) 627e632

3.4. How to compare the profile of occurrences between trial groups The method of statistics best suited to this requires further discussion: a 3  3 chi-square test is probably too simple, as ranking is inherent in both horizontal and vertical directions. However, the ranking of each individual combination of benefit and harm may not be completely straightforward and likely will differ in various settings. For instance, in oncology, patients almost invariably choose a situation of (near) remission even at the price of severe harm (e.g., by extensive surgery, chemotherapy) over a situation of no or minimal benefit combined with no or minimal harm. 3.5. Interpreting results of trials where prevention of harm is the objective Prevention of harm is beneficial but, in this system, will be counted as a difference between treatment groups in the occurrence of harm. As shown in the VIGOR trial of rofecoxib vs. naproxen, most patients were included in the middle column for benefit, achieving substantial pain reduction. For harm, naproxen had more patients in the middle row than rofecoxib, indicating GI benefit of the latter, but this was partially offset by an increase in myocardial infarctions. Application of this principle is less intuitive in situations where a relentless progression of disease is expected (e.g., small cell lung carcinoma). In such a situation, ‘‘no deterioration’’ or ‘‘stabilization’’ would be considered a benefit but would be categorized into the leftmost column and the upper left cell if there were no toxicity or other unexpected benefits; yet this would be a much better state than most patients with this disease. However, this actually does not pose a real problem when comparing across treatment groups in a trial, where receiving the active agent results in many patients being categorized in this cell, compared with placebo, where most patients would be included in the middle or lowest cell in the leftmost column, for example, substantial harm or (near) death. 3.6. Summarizing evidence across trials or observational studies The problems inherent in systematic review are not solved by our system. Straightforward comparisons between treatment groups in a trial can be extended to a summary of multiple trials addressing the same question, but in most situations, multiple trials use different comparisons, for example, placebo- vs. active-controlled trials. A body of literature has grown around the idea of ‘‘indirect comparisons,’’ contrasting treatments that have not been directly compared in trials [22]. 3.7. Assessing ‘‘background’’ efficacy and toxicity Both in controlled trials and observational studies, we do not know what benefit and harm the studied patients would

631

have experienced without exposure to the treatment of interest. To properly place the risk/benefit profile of a treatment within the population of interest, it is necessary to determine the ‘‘background risk’’ in this population from outside sources. 3.8. Feasibility The system can be readily applied to patient aids, where 1,000 circles are drawn in nine colors to reflect the various combinations of benefit/harm, in the frequency with which they occur. This would likely require deleting organ-specific categories of potential harm. Obviously, the situation is simplified when the 2  2 classification is used. Once this system is initiated, other refinements could be instituted, such as rank ordering the nine cells in order of preference, creating an immensely useful tool: a final common one-dimensional scale of benefit and harm. Imagination could carry us further: on this 3  3 two-dimensional framework, a third dimension could be added, that of cost, thus facilitating comparison of the ‘‘value’’ of different treatments. For now, we propose testing the two-dimensional framework in several available trial datasets.

4. Conclusion A clear need exists for validated and applicable tools to assess benefit and risk simultaneously. Patients are increasingly demanding that clinicians provide them with the evidence for harms and benefits; one important avenue for this is to develop the methodology to assess benefit and risk on a single scale. More active dialogue and engagement are needed between the key methodologic disciplines (clinical epidemiologists, clinical pharmacologists, pharmacoepidemiologists, and psychometricians), the opinion leaders, and patient/consumer groups in each discipline plus of course the approval agencies, funders, industry, and relevant non-governmental organizations. The OMERACT meetings and working groups may be a suitable venue to assemble researchers interested in the topic in the area of rheumatology. References [1] CIOMS Working Group IV. Benefit-risk balance for marketed drugs: evaluating safety signals. Geneva: World Health Organization; 1998. [2] Committee for Medicinal Products for Human Use (CHMP). Report of the CHMP working group on benefit risk assessment models and methods. European Medicines Agency; 2007. www.emea.europa. eu/pdfs/human/brmethods/1540407en.pdf. Accessed 6-8-09. [3] Mussen F, Salek S, Walker S. A quantitative approach to benefit-risk assessment of medicinesdpart 1: the development of a new model using multi-criteria decision analysis. Pharmacoepidemiol Drug Saf 2007;16(Suppl 1):S2eS15. [4] Honig P. Benefit and risk assessment in drug development and utilization: a role for clinical pharmacology. Clin Pharmacol Ther 2007;82:109e12. [5] Simon L, Strand CV, Boers M, Brooks PM, Tugwell PS, Bombardier C, et al. How to ascertain drug safety in the context of benefit.

632

[6]

[7]

[8]

[9]

[10]

[11] [12]

[13]

[14]

M. Boers et al. / Journal of Clinical Epidemiology 63 (2010) 627e632 Controversies and concerns. In: Proceedings of the OMERACT 9 Drug Safety Summit. J Rheumatol 2009;36:2114-21. Laupacis A, Sackett DL, Roberts RS. An assessment of clinically useful measures of the consequences of treatment. N Engl J Med 1988;318:1728e33. Haynes RB, Sackett DL, Guyatt GH, Tugwell P. Clinical epidemiology: how to do clinical practice research. Philadelphia: Lippincott Williams & Wilkins; 2006. Mancini GB, Schulzer M. Reporting risks and benefits of therapy by use of the concepts of unqualified success and unmitigated failure: applications to highly cited trials in cardiovascular medicine. Circulation 1999;99:377e83. Atkins D, Best D, Briss PA, Eccles M, Falck-Ytter Y, Flottorp S, et al. Grading quality of evidence and strength of recommendations. BMJ 2004;328:1490. Maxwell L, Santesso N, Tugwell PS, Wells GA, Judd M, Buchbinder R. Method guidelines for Cochrane Musculoskeletal Group systematic reviews. J Rheumatol 2006;33:2304e11. Moye LA. Multiple analyses in clinical trials fundamentals for investigators. New York: Springer; 2003. Gelber RD, Goldhirsch A, Cole BF. Evaluation of effectiveness: Q-TWiST. The International Breast Cancer Study Group. Cancer Treat Rev 1993;19(Suppl A):73e84. Mann RD. Retracted: a quantitative approach to benefit-risk assessment of medicinesdpart 2; the practical application of a new model Filip Mussen, Sam Salek, Stuart Walker. Pharmacoepidemiol Drug Saf 2007;16:1180. Lynd LD, O’Brien BJ. Advances in risk-benefit evaluation using probabilistic simulation methods: an application to the prophylaxis of deep vein thrombosis. J Clin Epidemiol 2004;57:795e803.

[15] Patient decision aids. Accessed: 1-12-2007. Available at: http://www. ohri.ca/decisionaid [16] Santesso N, Maxwell L, Tugwell PS, Wells GA, O’Connor AM, Judd M, et al. Knowledge transfer to clinicians and consumers by the Cochrane Musculoskeletal Group. J Rheumatol 2006;33:2312e8. [17] Boers M, Verhoeven AC, Markusse HM, van de Laar MA, Westhovens R, van Denderen JC, et al. Randomised comparison of combined step-down prednisolone, methotrexate and sulphasalazine with sulphasalazine alone in early rheumatoid arthritis. [Erratum: Lancet 1998;351(9097):220]. Lancet 1997;350:309e18. [see comments]. [18] Bombardier C, Laine L, Reicin A, Shapiro D, Burgos-Vargas R, Davis B, et al. Comparison of upper gastrointestinal toxicity of Rofecoxib and Naproxen in patients with rheumatoid arthritis. N Engl J Med 2000;343:1520e8. [19] van Gestel AM, Prevoo ML, van’t Hof MA, van Rijswijk MH, van de Putte LB, van Riel PL. Development and validation of the European League Against Rheumatism response criteria for rheumatoid arthritis. Comparison with the preliminary American College of Rheumatology and the World Health Organization/International League Against Rheumatism Criteria. Arthritis Rheum 1996;39:34e40. [20] Strand V. Are COX-2 inhibitors preferable to non-selective non-steroidal anti-inflammatory drugs in patients with risk of cardiovascular events taking low-dose aspirin? Lancet 2007;370:2138e51. [21] Boers M. NSAIDs and selective COX-2 inhibitors: competition between gastroprotection and cardioprotection. Lancet 2001;357: 1222e3. [22] Caldwell DM, Ades AE, Higgins JP. Simultaneous comparison of multiple treatments: combining direct and indirect evidence. BMJ 2005;331:897e900.