Evidence-based Radiology: Step 3—Primary Literature Validity (Critical Appraisal)

Evidence-based Radiology: Step 3—Primary Literature Validity (Critical Appraisal)

Evidence-based Radiology: Step 3—Primary Literature Validity (Critical Appraisal) Paul Cronin, MD, MS, MRCPI, FRCR A fter going through the initial ...

476KB Sizes 0 Downloads 30 Views

Evidence-based Radiology: Step 3—Primary Literature Validity (Critical Appraisal) Paul Cronin, MD, MS, MRCPI, FRCR

A

fter going through the initial 2 steps of evidence-based radiology as outlined in the previous articles,1,2 the third step in the evidence-based radiology process is “Critical Appraisal.” With critical appraisal of primary literature, 2 sections of the manuscript are evaluated. First, the reader needs to determine the validity of the study methods, and second, to examine the statistical strength of the results. This involves assessing the Materials and Methods section of the article for validity or presence of bias, and evaluating the Results section for statistical strength of the study, and to see the performance of the study both at population and individual levels. There are 2 broad categories of research studies: that is, primary research (or first-hand studies), which includes experiments, clinical trials, and surveys; and secondary research, which draws conclusions from primary studies and includes overviews (nonsystematic reviews of primary studies, systematic reviews, and meta-analysis), guidelines, decision analyses, and economic analyses including cost-effectiveness analyses. Primary research is about measuring differences between groups of subjects who undergo different clinical interventions, or who have different characteristics. Just because a study is deemed the best type of study by the study hierarchy (described in an earlier article), one cannot necessarily trust the study’s conclusions. Studies can be performed well, poorly, or fall somewhere in between. Thus, the type of study (level of evidence) is only the first dimension of the evidence that affects the degree of confidence one can ultimately have in the authors’ conclusions. The other dimensions can be determined using critical appraisal. To start the appraisal process, start with the “Patient-Intervention-ComparatorOutcome” (PICO).

Division of Cardiothoracic Radiology, Department of Radiology, University of Michigan Medical Center, Ann Arbor, MI. Funded in part by GE-AUR Radiology Research Academic Fellowship. Address reprint requests to Paul Cronin, MD, MS, MRCPI, FRCR, Division of Cardiothoracic Radiology, Department of Radiology, University of Michigan Hospitals, B1 132G Taubman Center/5302, 1500 E Medical Center Drive, Ann Arbor, MI 48109-5302. E-mail: [email protected]

0037-198X/09/$-see front matter © 2009 Published by Elsevier Inc. doi:10.1053/j.ro.2009.03.002

PICO ● ● ● ●

P—population/problem I—intervention/index test C— control/reference test O— outcome

When a question is framed in the PICO format, the next steps in critically appraising the evidence are assessing its validity (closeness to the truth/lack of bias), impact (size of the effect), and applicability (usefulness) in our clinical practice. This can be done by answering the following questions: ● ● ●

Question 1: What is the PICO of the study, and is the PICO of the study close enough to your PICO? Question 2: What is the internal validity of the study, or how well was the study done? Question 3: What do the results mean, and could they have been due to chance?

Let us assess each of these questions in turn.

Question 1: What Is the PICO of the Study, and Is the PICO of the Study Close Enough to Your PICO? When you find a study that you think will help to answer your clinical question, the first thing to look at is whether the PICO of the study matches the PICO of your question. This allows you to decide whether it really provides useful information relevant to your PICO. The study will rarely exactly match your question, so you will need to judge whether it is a close enough match to assist with your clinical decision.

Question 2: What Is the Internal Validity of the Study, or How Well Was the Study Done? The quality of a study, which is also referred to as its “internal validity,” is based on how good the research methodology is. 153

P. Cronin

154 While appraising the internal validity of a study, several questions should be considered3: ●

● ● ●

Was the diagnostic test evaluated in an appropriate spectrum of patients (similar to those in whom it would be used in practice)? Was there an independent, blind comparison with a reference (gold) standard of diagnosis? Was the reference (gold) standard applied, regardless of the diagnostic test result? Was the test (or cluster of tests) validated in a second, independent group of patients?

Each time the answer to a question is “no,” a potential source of methodological bias is identified. The Centre for Evidence-Based Medicine has stated that the ideal study performed to evaluate a test should be an independent blinded comparison of an appropriate spectrum of consecutive patients, all of whom have undergone both the test and the reference standard.4 The quality of a study is also based on how well the research methods prevented the results from being affected by bias and confounding factors. Bias is the degree to which the result is skewed away from the truth. Bias is different from random error, or scatter, which occurs because of various system variables and is evenly distributed around the true mean. However, bias comes in many forms, including selection bias, allocation bias, treatment bias, and measurement bias, to name but a few. Confounding factors are patient characteristics and other possible causal factors, apart from the one that is being measured, that can affect the outcome of the study. To eliminate confounding factors, we need to ensure that the groups are as closely matched as possible at the start of the study and that management of the groups is the same in every respect apart from the treatment or exposure of interest. To find out how well bias and confounding factors were controlled in a study, check each stage of the study to see how well bias and confounding factors have been reduced or eliminated. To do this, ask these questions: ● ●



How fairly were the subjects recruited (“P”)? How fairly were the subjects allocated to groups, the study groups maintained through equal management and follow-up of subjects (“I” and “C”)? How fairly were the outcomes measured (“O”)?

If the study that you have identified has eliminated bias, then there is a good chance that the study’s results will be reliable. Another issue is test accuracy. The test is not useful if it is not accurate. To be accurate a test should be both (1) reproducible, that is, we obtain the same (possibly wrong) answer every time; and (2) valid, that is, we obtain the right answer (Fig. 1). How does a busy radiologist sift through large numbers of articles to find those that have the least amount of bias, and therefore provide results that are closest to the truth? What you want to know is, “Can I trust the data from the study?” The Centre for Evidence-Based Medicine in Oxford has de-

Figure 1 Diagram of the relationship of reproducibility and validity.

veloped a way to help you assess how likely the study methods have reduced or eliminated bias, using the acronym “RAMMbo” (Fig. 2).3 The steps in “RAMMbo” are as follows: ● ●





Recruitment: Were the subjects representative of the target population? Allocation or Adjustment: Was the treatment allocation concealed before randomization and were the groups comparable at the start of the trial? Maintenance: Was the comparable status of the study groups maintained through equal management and adequate follow-up? Measurement: Were the outcomes measured with blinded subjects and assessors, and/or objective outcomes?

Each “RAMMbo” element is discussed below.

Recruitment: Were the Subjects Representative of the Target Population? It is important that the subjects selected for a study appropriately represent the population of interest. If the subjects of the study are not representative, it may be difficult to know to which population the outcomes are applicable. The best way to ensure that the study groups are representative is to recruit potential subjects sequentially (consecutively) or use randomization. In general, larger studies are preferable to smaller studies, as small study groups provide an imprecise estimate of the effects. The study should clearly describe the severity, duration, and/or risk level of the patients recruited to ensure that the target population is adequately represented (this is the “P” or population/problem, of the study).

Evidence-based radiology: step 3

155

Figure 2 Assessment of “PICO” using “RAMMbo.”

Allocation: Were the Study Groups Comparable? It is vital that the groups are matched as closely as possible in every way except for the intervention being studied. If the groups are not comparable to begin with, then a difference in outcomes may be due to one of the nonmatched characteristics (or confounding factors) rather than to the intervention under consideration. Important factors in which groups are matched include age, sex, disease severity or stage, and other risk factors. For experimental studies, once subjects have consented to participate in the trial, they are randomly allocated to either the control or intervention group. For observational studies, such as cohort studies and case-control studies (which is a common study design in radiology studies), the subjects are not randomly allocated to groups. The ways in which study groups are formed and matched become the critical quality issues. Perfect matching is almost never possible in nonrandomized studies; therefore, statistical adjustments are needed to approximate comparability of the study groups.

Maintenance: Was the Comparable Status of the Study Groups Maintained Through Equal Management and Adequate Follow-up? When comparable groups have been set up, it is important that they remain that way! That is, the management and follow-up of the groups should maintain the comparable status of the groups.

Equal Management The study groups should be managed so that the only difference between the groups is the intervention. This means that

the control group should be treated exactly the same as the experimental group, except for the factor being tested.

Adequate Follow-up During a study, some subjects may drop out, change groups, or become lost to follow-up. This may be a serious problem if the remaining groups may no longer be comparable. Overall, most of the subjects at the start should equal the subjects at the end, and the subjects should be analyzed in the groups that they started out in (intention-to-treat analysis).

Measurement: Were the Outcomes Measured With Blinded Subjects and Assessors and/or Objective Measures? Although it is important to ensure that study groups have been randomly allocated or adjusted to ensure comparability at the start of the study, most subjects accounted for, and relevant outcomes obtained and analyzed in the starting groups, it is equally important to ensure that the outcomes are also measured fairly to eliminate measurement bias. Measurement bias can be overcome by using subjects and outcome assessors that are both blinded, that is, a double-blind trial. A trial in which either the subjects or the outcome assessors are blinded to the group allocation, but not both, is called a single-blind study and the results are less reliable than for a double-blind study because of the increased potential for bias. A study in which neither the subjects nor the outcome assessors are blinded is the least reliable type of study of all because of the high potential for bias.

P. Cronin

156 Table 1 Outcome Measures for Binary Outcomes Measure

Meaning

Relative risk (RR) ⴝ risk of outcome in the treatment group/risk of event in the control group.

RR tells us how many times more likely it is that an event will occur in the treatment group relative to the control group. RR ⴝ 1 means that there is no difference between the 2 groups. RR < 1 means that the treatment reduces the risk of the event. RR > 1 means that the treatment increases the risk of the event. ARR tells us the absolute difference in the rates of events between the 2 groups and gives an indication of the baseline risk and treatment effect. ARR ⴝ 0 means that there is no difference between the 2 groups (thus, the treatment had no effect). ARR positive means that the treatment is beneficial ARR negative means that the treatment is harmful. RRR tells us the reduction in rate of the event in the treatment group relative to the rate in the control group. RRR is probably the most commonly reported measure of treatment effects. NNT tells us the number of patients we need to treat to prevent 1 adverse event

Absolute risk reduction (ARR) ⴝ risk of event in the control group—risk of event in the treatment group (also known as the absolute risk difference).

Relative risk reduction (RRR) ⴝ ARR/risk of eve t in control group (or 1-RR).

Number needed to treat (NNT) ⴝ l/ARR.

Additional Points for Radiology Studies In addition to answering epidemiologic questions in the Materials and Methods section of radiology articles, this section should be further appraised from the (diagnostic) radiologist’s perspective by further asking these 5 questions.5 These questions are as follows: ● ●







Has the imaging method been described in sufficient detail for it to be reproduced in one’s own department? Has the imaging test been evaluated (index test) and the reference (gold standard) test been performed to the same standard of excellence? Have “generations” of technology development within the same modality (eg, conventional vs helical singledetector row vs multidetector row CT) been adequately considered in the study design and discussion? Has radiation exposure been considered? (The concept of justification and optimization has assumed prime importance in radiation protection to patients.) Were magnetic resonance and/or CT images reviewed on a monitor or as film (hard copy) images?

Question 3: What Do the Results Mean? When you have assessed the study methodology, the next stage is to turn to the Results section. Assessment of the strength of a study can be found in the Results section of an article, and one must ask, “What do the results show and could they have been due to chance?” It is important to determine what outcome measures were used and whether they are appropriate. Results can be presented either as binary outcomes (also called dichotomous outcomes), that is, “yes” or “no” out-

comes, or continuous outcomes, such as weight or height. For binary outcomes, the results can be expressed in many ways, as shown in Table 1. For continuous outcomes, the important measures are the group means. The difference between the means (average) of the treatment and control group tells us how large or small the effect is. Another important question is “Are the results real and relevant?” If the results of a study appear to show an effect, you will also need to determine whether this is a real effect or one that is due to chance through the use of statistical testing, specifically hypothesis testing and the P values, and also using confidence intervals (CIs). P values are a measure of the probability that a result is purely due to chance. Scientific research is about testing a “null hypothesis” (ie, a hypothesis that there will not be an effect).6 The P value tells us the probability that this effect could be due simply to chance. If the P value is low (usually ⬍ 0.05), it means that the probability that the result was due to chance is also low (⬍ 5%) that is, the result is probably real. An effect with a low P value (⬍ 0.05) is called a “statistically significant” result. Of course, this in itself does not imply that there is a clinically relevant result. Confidence intervals are generally more informative than P values. They are an estimate of the range of values that are likely to include the real value. Usually, CIs are quoted as 95% CIs, which means that the range of values has a 95% chance of including the real value. If the 95% CI for the difference between groups is small and does not overlap the “no effect” point, we can be more certain that the result is real (ie, provided P ⬍ 0.05). The greater the number of subjects in the study, the narrower the CIs are likely to be, and therefore larger studies give more reliable results than smaller studies.7-9 Other assessments of study results include sensitivity and specificity, positive and negative predictive values, positive and negative likelihood ratios, and pre- and post-test probability. These are discussed further in the next article.

Evidence-based radiology: step 3

157

Conclusion The third step in the process of evidence-based practice is Critical Appraisal. This involves appraising the study PICO, determining the validity of the study (closeness to the truth/ lack of bias) by assessing recruitment of subjects, allocation or adjustment, maintenance of subjects during the trial, outcomes measured using blinding and/or objective outcomes, and examining the statistical strength of the results using P values and CIs.

4.

References

8.

1. Kelly AM: Evidence-based radiology: Step 1—Ask. Semin Roentgenol 43:140-146, 2009 2. Kelly AM: Evidence-based radiology: Step 2—Searching the literature (search). Semin Roentgenol 43:147-152, 2009 3. Sackett DL, Straus SE, Richardson WS, et al: Evidence-based Medicine:

9.

5.

6. 7.

How to Practice and Teach EBM (ed. 2). Edinburgh, Scotland, Churchill Livingstone, 2000 Levels of evidence. Oxford Centre for Evidence-Based Medicine Web site. http://www.cebm.net/levels_of_evidence.asp Dodd JD, MacEneaney PM, Malone DE: Evidence-based radiology: how to quickly assess the validity and strength of publications in the diagnostic radiology literature. Eur Radiol 14:915-922, 2004 Guyatt G, Jaeschke R, Heddle N, et al: Basic statistics for clinicians: 1. Hypothesis testing. CMAJ 152:27-32, 1995 Gardner MJ, Altman DG: Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ (Clin Res Ed) 292:746750, 1986 Guyatt G, Jaeschke R, Heddle N, et al: Basic statistics for clinicians: 2. Interpreting study results: confidence intervals. CMAJ 152:169-173, 1995 Montori VM, Kleinbart J, Newman TB, et al: Tips for learners of evidence-based medicine: 2. Measures of precision (confidence intervals). CMAJ 171:611-615, 2004