Journal of Clinical Epidemiology 66 (2013) 124e131
COMMENTARIES
Introducing GRADE across the NICE clinical guideline program Judith Thorntona,*, Philip Aldersona, Toni Tana, Claire Turnerb, Sue Latchemb, Elizabeth Shawa, Francis Ruizb, Stefanie Rekenb, Moira A. Mugglestonec, Jennifer Hilld, Julie Neilsond, Maggie Westbyd, Karen Francise, Craig Whittingtonf, Faisal Siddiquia, Tarang Sharmaa, Victoria Kellya, Lynda Ayikua, Kathryn Chamberlaina a
Centre for Clinical Practice, National Institute for Health and Clinical Excellence, Level 1A, City Tower, Piccadilly Plaza, Manchester M1 4BD, United Kingdom b Centre for Clinical Practice, National Institute for Health and Clinical Excellence, MidCity Place, 71 High Holborn, London WC1V 6NA, United Kingdom c National Collaborating Centre for Women’s and Children’s Health, Kings Court, Fourth Floor, 2-16 Goodge Street, London W1T 2QA, United Kingdom d National Clinical Guideline Centre (NCGC), Royal College of Physicians, 11 St Andrews Place, Regents Park, London NW1 4LE, United Kingdom e National Collaborating Centre for Cancer, 2nd Floor Park House, Greyfriars Road, Cardiff, Wales CF10 3AF, United Kingdom f National Collaborating Centre for Mental Health, 4th Floor, 21 Mansell Street, London E1 8AA, United Kingdom Accepted 19 December 2011; Published online 8 March 2012
Abstract Objectives: Grading of Recommendations Assessment, Development and Evaluation (GRADE) is a system for rating the confidence in estimates of effect and grading guideline recommendations. It promotes evaluation of the quality of the evidence for each outcome and an assessment of balance between desirable and undesirable outcomes leading to a judgment about the strength of the recommendation. In 2007, the National Institute for Health and Clinical Excellence began introducing GRADE across its clinical guideline program to enable separation of judgments about the evidence quality from judgments about the strength of the recommendation. Study Design and Setting: We describe the process of implementing GRADE across guidelines. Results: Use of GRADE has been positively received by both technical staff and guideline development group members. Conclusion: A shift in thinking about confidence in the evidence was required leading to a more structured and transparent approach to decision making. Practical problems were also encountered; these have largely been resolved, but some areas require further work, including the application of imprecision and presenting results from analyses considering more than two alternative interventions. The use of GRADE for nonrandomized and diagnostic accuracy studies needs to be refined. Crown Copyright Ó 2013 Published by Elsevier Inc. All rights reserved. Keywords: Clinical guidelines; GRADE; Evidence synthesis
1. Introduction The National Institute for Health and Clinical Excellence (NICE) has published more than 120 clinical guidelines providing recommendations, based on clinical and cost-effectiveness, for health care professionals in England and Wales. NICE commissions four external national collaborating centers (NCCs) to develop guidelines and also has an in-house team that develops guidelines. In 2006, NICE decided to drop its grading system for recommendations, which assigned a grade (A, B, C, D, or * Corresponding author. Tel.: þ44 (0)161 870 3112; fax: þ44 (0)845 0037785. E-mail address:
[email protected] (J. Thornton).
good practice point) largely based on rating the quality of studies against a ‘‘hierarchy of evidence.’’ This was motivated by concerns about the sometimes inappropriate direct link between study design and recommendation strength (evidence from a randomized controlled trial [RCT] does not necessarily justify a strong recommendation) and anecdotal evidence that recommendations not based on evidence from trials were being ignored. Also, the World Health Organization evaluated the NICE clinical guidelines program and recommended that NICE should consider using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach for appraising evidence (http://www.gradeworkinggroup.org/). The GRADE system was described in a series of papers in 2008 [1] and has been further developed and described in greater detail [2e11].
0895-4356/$ - see front matter Crown Copyright Ó 2013 Published by Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jclinepi.2011.12.007
J. Thornton et al. / Journal of Clinical Epidemiology 66 (2013) 124e131
What is new? The National Institute for Health and Clinical Excellence has introduced the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system for appraising evidence across its clinical guideline program; this system promotes evaluation of the confidence in estimates of effect for each outcome and an assessment of the trade-off between desirable and undesirable outcomes and costs leading to a judgment about the strength of the recommendation. The aim of using GRADE was to remove the sometimes inappropriate direct correspondence between study design (within the hierarchy of evidence) and recommendation strength, thus allowing separation of judgments about the confidence in estimates of effect from judgments about the strength of the recommendation. Using the GRADE system involved a substantial shift in thinking from previous methods of evaluating a body of evidence. Other conceptual and practical problems were encountered, including defining outcomes, defining minimum important differences and imprecision, integrating health economic evidence, and evaluating evidence from multiple comparisons of interventions, such as network meta-analyses. Further work is required to refine the use of GRADE, including application to nonrandomized and diagnostic accuracy studies.
GRADE promotes evaluation of the confidence in estimates of effect for each outcome; for RCTs, this is based on five main quality domains with explicit consideration of risk of bias, directness of evidence, consistency of evidence, precision of the estimated effects (relative to decision making), and publication bias. For non-RCTs, the three additional quality domains are the presence of doseeresponse effect, magnitude of effect, and issues around confounding. GRADE also includes assessment of the trade-off between desirable and undesirable outcomes and cost-effectiveness leading to a judgment about the importance of the recommendation (as indicated by ‘‘weak’’ or ‘‘strong’’). The key differences from other methods are that GRADE evaluates the evidence across studies for confidence in effect for each outcome and separates out judgments about the quality of the evidence from judgments about the importance of the recommendation. Therefore, in 2007, NICE started a pilot study of the use of GRADE within the NICE clinical guidelines program. This article describes our experience, with Table 1
125
summarizing the challenges encountered and the solutions that we developed.
2. Linking evidence to recommendations project GRADE was introduced through the linking evidence to recommendations (LETR) project that aimed to increase transparency in the development of recommendations in NICE clinical guidelines. A pilot study of GRADE focused on intervention questions in three representative guidelines. A steering group was convened and recruited representatives from NICE and the NCCs. This steering group met on a regular basis to discuss problems and provided a link to the GRADE working group. Use of GRADE was then expanded across the entire guideline program. During the development of a guideline, the evidence is identified and appraised by the methodologists at the NCC, who then present summaries to the guideline development group (GDG) for discussion and development of recommendations. GDGs comprise experts in the topic area, including health care professionals from primary, secondary, and tertiary care, and patient/carer representatives. Having received anecdotal reports about using GRADE from GDG members, NICE decided to examine further their opinions and sent a questionnaire to 80 GDG members of seven short clinical guidelines asking questions about their experience of using GRADE [12].
3. Learning new methods The GRADE system involves a substantial shift in thinking from previous methods of evaluating evidence to a more structured and transparent approach to decision making. Apart from the conceptual changes that are required, learning a new system for making sometimes complex judgments is a challenge for those implementing GRADE, both to technical staff working on evidence reviews and members of GDGs. To facilitate introduction of the GRADE approach, the LETR project ran induction training on the use of GRADE for all technical staff working on guidelines and GDG members (Table 1). Comments from respondents in the GDG survey (Table 2) helped to further refine the approach.
4. Specifying outcomes GRADE has provided a welcome focus on the need to specify the important outcomes in guideline review questions searching for evidence. Many systematic reviews collect information on whatever outcomes are reported, raising the possibility that outcomes of less interest to patients are given higher priority simply because they have been reported more often. GRADE recommends limiting the number of outcomes to seven with a few critical
126
J. Thornton et al. / Journal of Clinical Epidemiology 66 (2013) 124e131
Table 1. Challenges and solutions with implementing GRADE within NICE clinical guidelines Challenges Conceptual changes required from previous methods of evaluating evidence to a more structured and transparent approach to decision making, and learning a new system for making sometimes complex judgments Limiting the number of outcomes to seven can be challenging for GDGs Technical staff and GDGs initially struggle with the concepts of imprecision Restriction on the number of outcomes might make it difficult or even impossible to collect all relevant resource implications Modeled estimates cannot be presented Cost-effectiveness analyses are increasingly being informed by the results of complex evidence synthesis methods, such as network meta-analysis, and such synthesized evidence cannot be easily incorporated at present GRADE suggests that recommendations are categorized as ‘‘strong’’ or ‘‘weak’’
Initial GRADEpro software was unreliable and lacked the capacity to evaluate observational studies
Solutions Established induction training in use of GRADE for all technical staff. Practical problems are discussed at ongoing support sessions, and there is a ‘‘frequently asked questions’’ resource All GDGs receive training in the principles and practice of GRADE at the start of guideline development We have not imposed a rigid limit but specified a set of outcomes in advance in the review protocols for guidelines NCCs and GDGs have used a range of approaches depending on the review question. Ongoing discussions within the LETR group to address this issue Designed GRADE-inspired economic evidence profiles identifying the need for further refinement Developed a reporting standard for network meta-analyses. Adaptation of existing profiles to fit multiple treatment comparisons is ongoing
Reflect the concept of strength in the wording of the recommendation preferring to use ‘‘offer’’ for recommendations where we are certain about the strength and quality of the evidence and there is little doubt, use ‘‘consider’’ where we are less sure, or write a research recommendation where we are very unsure Found other ways of presenting evidence profiles, such as using Word templates or Excel spreadsheets. The software is now more stable with greater functionality and a better help system
Abbreviations: GRADE, Grading of Recommendations Assessment, Development and Evaluation; NICE, National Institute for Health and Clinical Excellence; GDGs, guideline development groups; NCCs, national collaborating centers, LETR, linking evidence to recommendations.
outcomes; it can be challenging to persuade GDGs to limit themselves to this number particularly for multisystem and chronic diseases. NICE does not impose a rigid limit but recognizes the need to specify a set of outcomes in advance and now does this in the review protocols for guidelines. Most of the respondents in the GDG survey were comfortable with this approach (Table 2).
5. Imprecision Imprecision is an aspect of quality considered in GRADE; detailed guidance on imprecision is described in the recently published article from the GRADE working group [8]. For guideline developers, imprecision relates to the ability of the data to support a particular recommendation. This means consideration of how statistical imprecision relates to each clinically defined minimum important difference (MID) or threshold for a particular outcome. Judgments about important effects should be made before reviewing the evidence, but specifying important or meaningful differences in outcomes has been challenging for GDGs and technical staff. We found that GDGs used a range of approaches to imprecision. For example, one GDG used the default suggested by GRADE throughout guideline development, whereas a second GDG used MIDs for some outcomes, where values were available from the literature but otherwise used the default values. Considering explicitly the importance of any imprecision relative to meaningful effects has not been a feature
of past grading systems and certainly not as a criterion to downgrade the quality of evidence. Imprecision is often not addressed clearly in primary or secondary research [13]. Our experience suggests that technical staff and GDGs initially struggle with the concepts involved, and this can become a barrier to the use of GRADE, particularly where the health condition or outcomes of interest are rare, trials are small, and most effect estimates are imprecise. In these situations, most evidence may be rated down as imprecise and labeled as ‘‘moderate’’ or ‘‘low’’ quality. A newly formed GDG faced with condemning all the research evidence in its field as lower quality can feel impelled to reject the system (Table 2).
6. Integration with economic analysis Cost-effectiveness is an integral factor in decision making in NICE guidelines, and it is important to present economic evidence alongside the clinical data. Analytical techniques vary from formal de novo cost-utility analysis through to a less explicit consideration of costs and consequences, depending on a number of factors, including the availability of suitable data on which to undertake any modeling and the likely resource and/or health impact of a change in practice. The existing GRADE framework allows collection of resource use as outcomes in its evidence profiles, although the restriction on the number of outcomes suggested might make it difficult or even impossible to collect all relevant resource implications. Furthermore,
J. Thornton et al. / Journal of Clinical Epidemiology 66 (2013) 124e131
127
Table 2. Comments from health care professional members of guideline development groups in the short guideline program (26 replies) Question Do you think grading outcomes rather than studies is helpful in clinical guideline development?
Was limiting the number of outcomes as specified by GRADE methodology difficult?
Was presentation of evidence in GRADE tables helpful? If you have seen narrative summaries before, were tables more helpful?
Did GRADE help the flow of GDG discussion for developing recommendations?
Do you think the GRADE methodology increased transparency in developing recommendations in a guideline?
What were your overall thoughts (benefits and limitations) of GRADE as experienced within this guideline?
Replies 16 respondents found this approach to be helpful; e.g.: ‘‘Yes, it provides structure.’’ 4 further respondents also found the approach helpful but commented that they would also liked to have assessed the individual studies; e.g.: ‘‘Yes, it is a systematic method of managing a large amount of information. I found it quite difficult to interpret some of the findings without being able to read the whole article, but the grading method was useful.’’ 3 respondents did not find the approach helpful 18 respondents had no difficulty limiting the number of outcomes and felt that this approach helped them to focus on the question being asked, e.g.: ‘‘Helps to focus on the questions you are asking, which avoids ambiguity about the results you are comparing. It may make you realize that you have forgotten important outcomes, and then have to undertake further analysis to answer them.’’ ‘‘I originally wanted to include more outcomes but the group came to a joint agreement as to which ones were the critical and important outcomes so I felt my views were included. By refining our original questions we were able to ensure that only the most relevant outcomes were included.’’ 5 respondents found difficulties in limiting the number of outcomes; most of their concerns were around having the standard criteria for downgrading evidence; e.g.: ‘‘What I would consider reasonable evidence was often downgraded. This is my main criticism of GRADE e in that it seems like a one size fits all approach to assessing evidence. However it also could be argued that this is a good thing as evidence needs to be assessed using the same standards across the board!’’ 18 respondents liked the GRADE profile format for summarizing the evidence; e.g.: ‘‘Yes, I really liked the table format. Easy to follow, kept it concise’’ ‘‘Generally yes. However if multiple outcomes are evaluated (many of which are often similar/overlapping), it is a lot of data to digest’’ 4 respondents said they preferred the table to be accompanied by narrative summaries; e.g.: ‘‘I found it very difficult to accept the GRADE tables at face value without having critically appraised each manuscript myself! I think that the GRADE tables would be easier to interpret if some narrative was included too. For example, it would be useful to actually state the study aim and author’s conclusion each time.’’ ‘‘It would have been good to see the narrative tables first. This would have made the member think a bit about the implications of the study before reflecting on the GRADE. I found a tendency to not bother too much with a study with a poor GRADE which may at times be a mistake. I think the GRADE assessment at times needs to be challenged.’’ 2 respondents did not like tables and preferred narrative summaries only 22 respondents found that use of GRADE supported the discussions around development of recommendations, e.g.: ‘‘Yes, everyone was able to follow tables and where necessary issues/queries were raised and discussed.’’ ‘‘I felt it did, in this GDG highlighted the relatively poor evidence in most of the guideline.’’ 1 respondent replied that they were ‘‘not sure,’’ 1 that using GRADE was ‘‘no better than anything else’’ and 1 said ‘‘no’’ 21 respondents felt that use of GRADE increased transparency although a few had some additional comments, e.g.: ‘‘Would have done if more evidence but was really lacking in this field.’’ ‘‘I think it meant that we often changed our views from ones previously held.‘‘ 3 respondents replied ‘‘no’’ and one replied ‘‘neutral’’ Overall, comments were generally positive about the use of GRADE and reflected those already made for specific aspects of GRADE, e.g.: ‘‘Easy to understand once explained and easier to follow than text.’’ ‘‘A useful new tool for me - and helped the overall working group task.’’ ‘‘I thought it was helpful and summarized the studies nicely. However, without reading the entire studies in full, it may be difficult to say for sure. I would say that we have put a lot of trust into the information presented to us by the NICE team, although they do seem to have been meticulous in their assessment of the literature and findings.’’ ‘‘I think it is a very important part of the process. Once the concept is taken on board it makes analysis of a large number of papers relatively easy (at least for the members of the GDG). I do think at times, though it can lead to some nihilism about the evidence base for what is regularly done clinically. Many times the studies that need to be done to answer the questions being examined by the GDG have not been done.’’
Abbreviations: GRADE, Grading of Recommendations Assessment, Development and Evaluation; GDG, guideline development group; NICE, National Institute for Health and Clinical Excellence.
128
Table 3. NICE GRADE-like profile for economic evidence Quality assessment Study
Summary of findings Other comments
Potentially serious limitationsa
Limitations
Partially applicableb
Lamotte, 2006
Very serious limitationsd
Partially applicablee
NCC analysis
Minor limitationsg
Directly applicableh
Based only on measured resource use and survival in 3.5 years follow-up in GISSI-P Based on measured resource use and survival over 3.5 years in GISSI-P plus long-term survival benefits attributed to nonfatal events using Canadian database. Belgian results presented Based on morbidity and mortality estimated from Markov model using pooled effectiveness data from GISSI-P and DART. Results were sensitive to the size of treatment effects and over their assumed duration
Incremental effects
ICER
£871c
0.0332 LYs
£26,243 per LY gained
£16,769 to £56,025 per LY gained (best/ worst case)
£1,090f
0.282 LYs
£3,860 per LY gained
O98% probability ICER less than V20,000 per QALY gained
£1,073
0.09 QALYs
£12,480 per QALY gained
£3,912 to £130,705 per QALY gained (range in one-way sensitivity analyses)
Uncertainty
Abbreviations: ICER, incremental cost-effectiveness ratio; GISSI-P, Gruppo Italiano per lo Studio della Sopravvivenza nell’Infarto Microcardio-Prevenzione; LYs, life-years; QALY, qualityadjusted life-year; NCC, national collaborating center; DART, Diet and Reinfarction Trial. Review: Omega-3 acid ethyl ester supplements vs. control in people within 3 months of an acute myocardial infarction. a This study is relatively conservative as it does not impute any quality of life or long-term survival benefit to supplements. Conversely, it omits gastrointestinal side effects. b Some uncertainty over the applicability of Italian trial data to the United Kingdom. Perhaps differences in population risk and diet as well as health care use and unit costs. c Converted from 1,999 Italian euros using a PPP exchange rate of 0.797 (www.oecd.org/std/ppp), then uprated by inflation factor of 133.8% (www.pssru.ac.uk/pdf/uc/uc2006/uc2006.pdf). d Methods and data used to estimate life expectancy are questionable and were not subjected to sensitivity analysis. This is likely to have biased the results. e Some uncertainty over the applicability of Italian trial data to the United Kingdom. Perhaps differences in population risk and diet as well as health care use. Unit costs may also differ for the United Kingdom. f Converted from 2,004 Belgian Euros using a PPP conversion rate of 0.706 (www.oecd.org/std/ppp), then uprated by inflation factor of 107.3% (www.pssru.ac.uk/pdf/uc/uc2006/uc2006. pdf). g Some limitations in reporting (e.g., for inputs taken from National Institute for Health and Clinical Excellence [NICE] statins appraisal). However, analysis is based on best-available effectiveness estimates and follows NICE methodological guidance. The robustness of results is also well tested through sensitivity analysis and comparison with other study results. h Some uncertainty over applicability of trial data to the United Kingdom because of differences in population risk and diet. However, resource use and unit costs are UK-specific, and the perspective and discount rates follow the NICE reference case.
J. Thornton et al. / Journal of Clinical Epidemiology 66 (2013) 124e131
Applicability
Franzosi, 2001
Incremental cost (2,006£)
J. Thornton et al. / Journal of Clinical Epidemiology 66 (2013) 124e131
modeled estimates cannot currently be presented within the summary of findings tables and evidence profiles [14]. Finally, cost-effectiveness analyses are increasingly being informed by the results of complex evidence synthesis methods, such as network meta-analysis, and such synthesized evidence cannot be easily incorporated within a GRADE profile at present. Members from the LETR group designed GRADEinspired economic evidence profiles to capture and summarize such evidence (Table 3) [14]. These profiles differ from clinical GRADE profiles in terms of quality assessment, outcomes, and style. The quality assessment was simplified from five main domains into two main dimensions for economic evaluations: applicability to the decision context (part of the directness domain in GRADE) and the methodological quality of the evaluation performed (study limitations). Results from two or more cost-effectiveness models presented in the summary of findings in the economic profile can be directly compared and thus consistency assessed. We have identified the need for further refinement, especially in relation to presenting the results arising from a comparison of more than two alternatives [15].
129
Despite these challenges, guideline developers agree that the consistent approach to presenting relevant information using GRADE style profiles is beneficial when developing guidelines. 7. Comparing multiple interventions Clinical guidelines often identify situations where there are several possible interventions. Analysis in such situations is usually pairwise, resulting in an evidence profile for each pair-wise comparison, with no clear framework for synthesizing the overall interpretation of these profiles. Presenting results in this way can lead to lengthy tables and footnotes and misrepresent the synthesized results. As methodologies, such as network meta-analyses, are becoming more frequently reported in the literature and are undertaken for NICE guidance, the challenge to present data in an accessible way to GDGs remains. As well as ongoing work within the GRADE working group, NICE has been involved in developing a reporting standard for network meta-analyses [16,17], which is an early but essential step toward the assessment of quality of such studies. The
Table 4. ‘‘Evidence statements’’ and ‘‘From evidence to recommendations,’’ example from Glaucoma Clinical Guideline 85: PGAs vs. beta-blockers in COAG (see guideline for details of all comparisons studied) Evidence statements Clinical
There were no studies that reported visual field progression PGAs are more effective than beta-blockers in reducing IOP from baseline at 6 to 36 months follow-up, but the effect size is too small to be clinically effective (moderate quality). PGAs are more effective than beta-blockers in increasing the number of patients with an acceptable IOP at 6 to 12 months follow-up (moderate quality) Significantly more patients using beta-blockers than PGAs experienced a respiratory adverse event at 6-month follow-up (moderate quality) There was no statistically significant difference in patients experiencing cardiovascular adverse events or an allergic reaction at 6 to12 months follow-up (moderate quality) Significantly more patients using PGAs than beta-blockers experienced hyperemia at 6 to 12 months follow-up (high quality)
Economic
PGAs are more cost effective than beta-blockers for any stage of COAG. This evidence has minor limitations and direct applicability
Recommendation Offer people newly diagnosed with early or moderate COAG and at risk of significant visual loss in their lifetime, treatment with a PGA From evidence to recommendations Relative values of different outcomes Trade-off between clinical benefits and harms Economic considerations
Quality of evidence Other considerations
Prevention of blindness is the most important outcome. Cosmetic side effects of treatment with PGAs may be unacceptable to some patients who may prefer an alternative treatment PGAs are effective at lowering IOP. They may affect the pigmentation of the iris and periorbital skin and cause lash growth but rarely have systemic side effects The cost-effectiveness of trabeculectomy is dependent on a rapid progression in visual field loss. Therefore, in the absence of any evidence of progression, pharmacological treatment is cost effective Among the pharmacological treatments, PGAs are the most cost effective Clinical evidence was generally of low quality The economic evidence has minor limitations but direct applicability Patient preference (see ‘‘relative values of different outcomes’’ mentioned previously in the table)
Abbreviations: PGAs, prostaglandin analogs; IOP, intraocular pressure; COAG, chronic open angle glaucoma.
130
J. Thornton et al. / Journal of Clinical Epidemiology 66 (2013) 124e131
adaptation of existing profiles to fit multiple treatment comparisons is ongoing [10].
8. GDG discussion of evidence and development of recommendations The GRADE profiles, sometimes supplemented with forest plots and evidence statements (brief statements summarizing the key results from the evidence review outcome by outcome), are presented at GDG meetings as a basis for discussion of the evidence by the members. The results of the survey of GDG members have given us an insight into how GRADE contributes to this discussion (Table 2). Most respondents thought that the use of GRADE helped ensure a transparent, consistent, and systematic approach to evaluating evidence and that it clarified the weight of evidence available, highlighted low-quality evidence more easily, and allowed issues and queries to be raised and discussed. Some respondents commented that the GRADE profiles were easy to understand once explained and easier to follow than the text. However, a few respondents felt that the GRADE profiles presented as tables alone did not help them to understand the details of the evidence, and they would have preferred narrative summaries alongside the GRADE profiles or to have appraised the individual studies themselves. Four key factors determine the strength of a recommendation within GRADE: trade-off between desirable and undesirable effects, quality of evidence, values and preferences, and costs [18]. The NICE guidelines manual suggests that these points are covered in the discussions at the GDG meeting, thus clarifying the background to the recommendation and ensuring that all aspects are addressed. Table 4 displays an example of how this advice was incorporated into the guideline on diagnosis and management of glaucoma (http://guidance.nice.org.uk/CG85); this was one of the earliest guidelines to use GRADE, and subsequent experience of the LETR group is refining this approach. GRADE suggests that recommendations are categorized as ‘‘strong’’ (where guideline developers are confident that the desirable effects of adherence to a recommendation outweigh the undesirable effects) or ‘‘weak’’ (the desirable effects of adherence to a recommendation probably outweigh the undesirable effects, but the developers are less confident) [3]. Use of other explanatory wording or symbols have also been suggested [19,20]. NICE has chosen to reflect the concept of strength in the wording of the recommendation, preferring to use ‘‘offer’’ for recommendations where there is certainty about the strength and quality of the evidence, and there is little doubt that desirable effects outweigh the undesirable effects, but use ‘‘consider’’ where there is less certainty or write a research recommendation where there is high uncertainty (http:// www.nice.org.uk/guidelinesmanual). The GDG’s view of
the strength of a recommendation should be made clear from its discussions in the evidence to recommendations sections (see example provided in Box 1). 9. Practical problems The technical staff were concerned with a couple of technical issues. First, as with all new systems, an initial phase of learning was required when GRADE was introduced. Second, representing data as evidence profiles with quality assessment and summary of findings tables takes longer to prepare than writing study-based narrative summaries of the evidence. The GRADE system forces users to become engaged in making explicit judgments about the quality of the evidence, whereas previous systems allowed these judgments to be implicit or avoided altogether. The GRADE working group developed GRADEpro software to work alongside the system and produce evidence profiles. In the early stages of this project, the software was unreliable and lacked the capacity to evaluate observational studies forcing technical staff to find other ways of presenting evidence profiles, such as using Word templates or Excel spreadsheets. The software is now more stable with greater functionality and a better help system. A particularly useful feature is GRADEpro’s ability to interface with the Cochrane Collaboration’s software for preparing systematic reviews from Review Manager 5. Box 1 From the guideline on peritoneal dialysis (http://guidance.nice.org.uk/CG125/NICEGuidance/ pdf/English): Recommendations: Offer all people with stage 5 CKD a choice of peritoneal dialysis or hemodialysis, if appropriate, but consider peritoneal dialysis as the first choice of treatment modality for: children 2 years old or younger people with residual renal function adults without significant associated comorbidities. Evidence to recommendations: The GDG agreed that overall there was no evidence of significant differences between the modalities (peritoneal dialysis and hemodialysis) for the critical outcomes; therefore, recommendations were made allowing all patients the option of either peritoneal or hemodialysis. The GDG, therefore, made recommendations, based on the evidence and their clinical expertise, on those patient groups in which peritoneal dialysis was likely to be the preferred option.
J. Thornton et al. / Journal of Clinical Epidemiology 66 (2013) 124e131
10. Summary and conclusions Overall, introduction of GRADE has been positively received by both technical staff and GDG members and has helped to clarify the way in which NICE evaluates evidence and develop recommendations. The conceptual and practical problems encountered have largely been resolved (Table 1). However, some areas require further work including the practical application of imprecision within GDGs and presenting results from analyses with more than two alternative interventions. It is important to note that GRADE presents a ‘‘balance sheet’’ of outcomes, in a way more helpful to decision making than previous approaches to assessing evidence quality. Next, NICE is deciding how to update older guidelines that have not used GRADE. Also, our experiences so far have mostly been with using GRADE for randomized studies of interventions. Many NICE guidelines involve diagnosis so, alongside the GRADE working group, we need to decide how best to use GRADE in this situation and refining its use with evidence from nonrandomized studies. In addition, NICE is a partner in the DECIDE project, a 5-year collaborative project funded by the European Commission’s 7th Framework Programme and is designed to research and improve the way health care evidence and recommendations are presented in clinical guidelines (http://www.decide-collaboration.eu/). Many of the issues encountered by NICE in applying GRADE will be explored in this project. In addition, future clinical research could ideally be designed, conducted, and analyzed with GRADE in mind to maximize the use of research in decision making. Acknowledgments NICE thanks all the GDG members who participated in the survey about the experiences with using GRADE. References [1] Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, AlonsoCoello P, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008;336:924e6. [2] Guyatt GH, Oxman AD, Sch€unemann HJ, Tugwell P, Knotterus A. GRADE guidelines: a new series of articles in the Journal of Clinical Epidemiology. J Clin Epidemiol 2011;64:380e2. [3] Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, et al. GRADE guidelines: 1. IntroductiondGRADE evidence profiles and summary of findings tables. J Clin Epidemiol 2011;64:383e94.
131
[4] Guyatt GH, Oxman AD, Kunz R, Atkins D, Brozek J, Vist G, et al. GRADE guidelines: 2. Framing the question and deciding on important outcomes. J Clin Epidemiol 2011;64:395e400. [5] Balshem H, Helfand M, Sch€unemann HJ, Oxman AD, Kunz R, Brozek J, et al. GRADE guidelines: 3. Rating the quality of the evidence. J Clin Epidemiol 2011;64:401e6. [6] Guyatt GH, Oxman AD, Vist G, Kunz R, Brozek J, Alonso-Coello P, et al. GRADE guidelines: 4. Rating the quality of evidencedstudy limitations (risk of bias). J Clin Epidemiol 2011;64:407e15. [7] Guyatt GH, Oxman AD, Montori V, Vist G, Kunz R, Brozek J, et al. GRADE guidelines: 5. Rating the quality of evidencedpublication bias. J Clin Epidemiol 2011;64:1277e82. [8] Guyatt G, Oxman AD, Kunz R, Brozek J, Alonso-Coello P, Rind D, et al. GRADE guidelines 6. Rating the quality of evidenced imprecision. J Clin Epidemiol 2011;64:1283e93. [9] Guyatt GH, Oxmand AD, Kunz R, Woodcock J, Brozek J, Helfand M, et al. GRADE guidelines: 7. Rating the quality of evidencedinconsistency. J Clin Epidemiol 2011;64:1294e302. [10] Guyatt GH, Oxman AD, Kunz R, Woodcock J, Brozek J, Helfamd M, et al. GRADE guidelines: 8. Rating the quality of evidenced indirectness. J Clin Epidemiol 2011;64:1303e10. [11] Guyatt GH, Oxman AD, Sultan S, Glasziou P, Akl EA, AlonsoCoello P, et al. GRADE guidelines: 9. Rating up the quality of evidence. J Clin Epidemiol 2011;64:1311e6. [12] Thornton J, Siddiqui F, Nyong J, Kelly V, Tan TPY, Sharma T, et al. Attitudes of guideline development groups to the use of GRADE in evidence evaluation and development of recommendations. Guideline International Annual Conference; 2010; Chicago, IL. [13] Bland JM. The tyranny of power: is there a better way to calculate sample size? BMJ 2009;339:1133e5. [14] Brunetti M, Ruiz F, Lord J, Pregno S, Oxman AD. Chapter 10: Grading economic evidence. In: Shemilt I, Mugford M, Vale L, Marsh K, Donaldson C, editors. Evidence-based decisions and economics: health care, social welfare, education and criminal justice. Oxford, UK: Wiley-Blackwell; 2010:114e33. [15] Reken S, Ruiz F. Incorporating cost effectiveness into guidelines using GRADE-like evidence profiles. Otolaryngol Head Neck Surg 2010;143(1 suppl):44e5. [16] Reken S, Wonderling D, Alderson P. Presenting and reporting network meta-analyses in clinical guidelines. Guidelines International Annual Conference; 2009; Lisbon, Portugal. [17] Reken S, Wonderling D, Alderson P. Validation of a reporting guideline for mixed treatment comparisons. Otolaryngol Head Neck Surg 2010;143(1 suppl):16e7. [18] Guyatt GH, Oxman AD, Kunz R, Falck-Ytter Y, Vist GE, Liberati A, et al. GRADE: going from evidence to recommendations. BMJ 2008;336:1049e51. [19] Akl EA, Maroun N, Guyatt G, Oxman AD, Alonso-Coello P, Vist GE, et al. Symbols were superior to numbers for presenting strength of recommendations to health care consumers: a randomized trial. J Clin Epidemiol 2007;60:1298e305. [20] Schunemann HJ, Best D, Vist G, Oxman AD, GRADE Working Group. Letters, numbers, symbols and words: how to communicate grades of evidence and recommendations. CMAJ 2003;169:677e80.