Mine Is Bigger Than Yours

Mine Is Bigger Than Yours

CHEST Commentary Mine Is Bigger Than Yours Measures of Effect Size in Research David L. Streiner, PhD, CPsych; and Geoffrey R. Norman, PhD Results ...

490KB Sizes 2 Downloads 99 Views

CHEST

Commentary

Mine Is Bigger Than Yours Measures of Effect Size in Research David L. Streiner, PhD, CPsych; and Geoffrey R. Norman, PhD

Results of studies that use laboratory tests are often easy to interpret, because we are familiar with the units and how to interpret them. However, this is not the case when the results are presented as ORs, relative risks, correlation, or scores on an unfamiliar scale. This article explains these various indices of effect sizes—how they are calculated, what they mean, and how they are interpreted. CHEST 2012; 141(3):595–598 Abbreviations: AR 5 absolute risk; ARR 5 absolute risk reduction; CABG 5 coronary artery bypass graft; ES 5 effect size; NNT 5 number needed to treat; RR 5 relative risk

I

n a previous article about P levels and CIs,1 we mentioned in passing that the important information in any study is actually the effect size. However, we did not say what we meant by “effect size”—what it is, how it is measured, and how we interpret it. We omitted these key points deliberately for two reasons: first, so as not to detract from the main message of the previous article; and second, to give ourselves an excuse to write this one. But, before we start describing it, let us begin by giving some examples of why it is (sometimes) needed. If an article states that a certain intervention reduces the Paco2 level from 50 mm Hg to 40 mm Hg, no further information is needed for you to determine that the intervention was successful; Paco2 moved from a level indicating acidosis to being within the normal range. You know what these values mean and

Manuscript received September 28, 2011; revision accepted October 6, 2011. Affiliations: From the Department of Psychiatry and Behavioural Neorosciences (Dr Streiner) and Department of Clinical Epidemiology and Biostatistics (Dr Norman), McMaster University, Hamilton, ON; and Department of Psychiatry (Dr Streiner), University of Toronto, Toronto, ON, Canada. Financial/nonfinancial disclosures: The authors have reported to CHEST that no potential conflicts of interest exist with any companies/organizations whose products or services may be discussed in this article. Correspondence to: David L. Streiner, PhD, CPsych, St. Joseph’s Health Care, W5th Campus, 100 W 5th St, Box 585, Hamilton, ON, L8N 3K7, Canada; e-mail: [email protected] © 2012 American College of Chest Physicians. Reproduction of this article is prohibited without written permission from the American College of Chest Physicians (http://www.chestpubs.org/ site/misc/reprints.xhtml). DOI: 10.1378/chest.11-2473 www.chestpubs.org

how much of a change is clinically important. But let us consider some other results: 1. The correlation between salt intake and hypertension is 0.46.2 2. The OR for having a myocardial infarction was 0.48 for physicians who took aspirin compared with those who did not (sorry, it did not affect mortality).3 3. The mean Duke Activities Status Index was two points higher among patients treated in sites that have angiography compared with sites that do not.4 Does that correlation of 0.46 indicate a strong or a weak relationship between salt and hypertension? Does the OR of 0.48 mean we should all start swallowing aspirin immediately or not? Is that two-point difference a clinically meaningful one or do we need a magnifying glass to see it? The answers are not as obvious as with the Paco2 example, because we are not as familiar with these other units of measurement, be they correlation coefficients, ORs (or their cousin on their mother’s side, relative risks [RRs]), or especially differences between groups on scales we have not encountered before. So, let us take a guided tour of indices that measure how big an effect is. For obvious reasons, these indices together are called measures of effect size, commonly abbreviated as ES. Ask a statistician anything, and he or she will immediately draw one of two things—either a normal curve or a 2 3 2 table. In this case, we will do the latter. CHEST / 141 / 3 / MARCH, 2012

Downloaded from chestjournal.chestpubs.org by Kimberly Henricks on March 10, 2012 © 2012 American College of Chest Physicians

595

The reason is that we have two types of data and two types of measures. The data can be divided into (1) dichotomous variables, such as dead or alive, better or worse, and the like; and (2) continuous variables, such as various serum levels, scores on a scale, and so on. Similarly, we have two types of measures: (1) those that indicate how big the difference is and (2) those that tell us if the variables are related to one another. The resulting four types of ES are shown in Table 1. Let us go over them one at a time. When we have a dichotomous outcome and we want to know how big the difference is between groups, we have two alternatives: the RR and the OR, which is sometimes called the relative odds. Which one to use depends on the type of study we are dealing with; the RR is used with randomized controlled trials and cohort studies, and the OR is used with case-control studies. (If you need help regarding what these terms mean, or why we cannot use the RR with case-control studies, see Reference 5.) Both begin with a 2 3 2 table (yet again), as in Table 2, with the rows indicating group membership and the columns the outcome. In the treatment group, the absolute risk (AR) of death is the number who have died (20 in cell A) divided by the number of people in the treatment group (100; the total of cells A and B); or A/(A 1 B), or 0.20. Similarly, the AR of death in the control group is 40 (cell C) divided by the number of people in that group (100; cells C 1 D), which is 0.40. The RR is simply the ratio of these two risks, or:

RR

A

A  B

C

C  D

A C  D ARTreatment C A  B ARControl

0.20 0.50 0.40

This means that the risk of dying in the treatment group is half of that in the control group. When we are dealing with case-control studies or with the output from some statistical tests such as logistic regression, we get a related measure, called the OR. Here, we first calculate the odds of dying if a person is in one of the two groups. It is 20 (cell A) divided by 40 (cell C), or 0.5. Similarly, the odds of not dying in the two groups is 80 (cell B) divided 60 (cell D), or 1.33. So, the OR is the ratio of the two, or: Table 1—Four Types of Effect Sizes Type of Data Type of Measure Difference Relationship

Dichotomous

Continuous

OR Relative risk (RR) Phi (w)

Standardized mean difference (d) Correlation (r, R)

A OR

B

C D

AD BC

20 u 60  u 

0.375

(We calculated the OR based on data for which we

should have used the RR, but that is acceptable; we just cannot go the other way). Although the RR and OR are interpreted differently (again, see Reference 5 for a fuller explanation), their importance is assessed similarly. An OR or RR of 1.0 means that nothing is going on; there is no difference in outcome between the two groups. As a rough rule of thumb (and realize that it is only a rule of thumb), RRs and ORs , 0.50 or . 2.0 are seen as clinically important. Note that we are not speaking about statistical significance, but rather importance. For example, comparing coronary artery bypass graft (CABG) to stents in people with stable angina or acute coronary syndromes, the OR for restenosis at 6 months was 0.29, leading the authors of the metaanalysis to conclude that CABG is more effective in reducing the rates of major adverse cardiac events.6 As we said in the previous article,1 statistical significance is greatly affected by the sample size, so in very large studies or surveys, an RR of 1.02 or 0.98 can be statistically significant, but the results are likely clinically trivial. Also bear in mind that if a disease has a bad outcome and the intervention is relatively noninvasive or inexpensive, ORs or RRs closer to 1.0 can point to useful treatments; rules of thumb should never trump clinical judgment. Before leaving the RR and OR, we must give a word of warning. These ES indices are much beloved by purveyors of expensive pharmaceuticals, which in itself should serve as a red flag that they are like bikinis—what they reveal is interesting, but what they conceal is crucial. Let us go back to Table 1, where we found that the RR was 0.5 and the intervention looked like it might be worthwhile. Imagine we do exactly the same study with 10,000 patients in each group rather than 100, but the number who died was exactly the same: 20 in the treatment group and 40 in the control group. If we go through the calculations, we find that the AR in the treatment group is 0.002, and it is 0.004 in the control group. And what is the RR? It is 0.50, just what we found before. Now, though, the intervention looks much less promising—it saves only 20 more people out of 20,000, rather than 20 out of 200. Put very briefly, the RR and OR do not take into account the baseline risk of succumbing to the disease, and that is the crucial part that is concealed. Never evaluate an intervention simply on the basis of the OR or RR; you also need some other index that takes the baseline into account. One simple way to get a better idea of how much real benefit results from the treatment is simply to take the difference in ARs

596

Commentary

Downloaded from chestjournal.chestpubs.org by Kimberly Henricks on March 10, 2012 © 2012 American College of Chest Physicians

Table 2—Results of a Hypothetical Study Outcome Group Treatment Control Total

Dead

Alive

Total

20 (A) 40 (C) 60 (A 1 C)

80 (B) 60 (D) 140 (B 1 D)

100 (A 1 B) 100 (C 1 D) 200

rather than a ratio. This is called the absolute risk reduction (ARR), or the attributable risk. In the first example, the AR was 0.20 in the treatment group and 0.40 in the comparison group, so the ARR is the difference between them: 0.40 2 0.20 5 0.20. That means that 20% of patients avoided death. By contrast, in the second example, although the RR was the same (0.50), the ARR was way down at 0.2%: two patients in 1,000 avoided death. One other twist is to ask the question, “How many people do I have to treat to avoid one death?” This is called the number needed to treat, or NNT.7 We have already found in the first example that if we treat 100 people, we will avoid 20 deaths. If we treat 100/20 5 5 people we will avoid one death. So the NNT is simply the reciprocal of the ARR: NNT

1 ARR

1 0.20

5

If we go through the same calculations for the second example, we find that the ARR is 0.002, and the NNT is 500. Now the difference between the two studies becomes clear: despite identical RRs, they are clearly different in terms of effectiveness. (Why is the NNT not 1? For two reasons: first, not everyone who gets the treatment does well; more importantly, despite advertisements to the contrary, not everyone who does not get the treatment succumbs to the dread disease.) Now let us move to the next box, which is the ES for differences in continuous measures. The problem is that for measures with which we are unfamiliar, such as scales tapping pain, quality of life, activities of daily living, and so forth, we do not know how many points between groups, or within a group from preintervention to postintervention, is a big or a small difference. It could be two points for one scale and 20 points for a different scale. We get around this problem by “translating” them so they all use a common yardstick, called the standardized mean difference, or Cohen’s d, in honor of its developer. The formula for d is: d www.chestpubs.org

MT  MC SD

where MT is the mean of the treatment group, MC that of the control group, and SD the standard deviation. (We will leave aside the question of whether it is the SD of the control group or that of both groups combined, but most people use the latter.) The value of d is expressed in units of the SD of whatever scale was used in the study. So, if d is 0.5, that means that the groups differ by one-half an SD. That is simple enough, but is that good or bad? Cohen said that a value of d of 0.2 is small, 0.5 is moderate, and 0.8 and above is large (there is no upper limit). And why those values? Because Jake Cohen speaketh from on high, and we did tremble and listen. Actually, Cohen was not quite that dogmatic; he proposed these as guidelines only, which could change from one field to the next, or as we gained more knowledge. It is the rest of the world that is fairly dogmatic now and has engraved these numbers in stone. Jumping to cell D of Table 1, the simple correlation (r) and the multiple correlation (R) are used to tell us the strength of the relationship between two variables (r) or between one variable and two or more predictors (R). They can range between 21.0 and 11.0. Forgetting about the sign (which tells us only the direction of the relationship), the guidelines for interpreting r and R are: a value of 0.100 is small, 0.243 is moderate, and 0.371 and higher are large. These numbers are exactly equivalent to d values of 0.2, 0.5, and 0.8, respectively, and come from the formula that converts d to r. In our opinion, though, these suggested values are too small by about half. Our reasoning is that squaring r or R tells us the amount of variance in one variable explainable by the other variable(s). An r of 0.371 means that only 14% of the variance is explained; 86% thus remains unexplained, and this does not seem high to us at all. We would like at least 25% of one variable to be accounted for by another, which means a correlation of at least 0.50. But when we speak, nobody trembles and listens. Meyer et al8 calculated r-type ES for a number of interventions and other phenomena. They found that the effect of antihypertensives in reducing stroke was 0.03, and the effectiveness of CABG on stable heart disease and survival at 5 years was 0.08. These are not overly impressive, especially in comparison with the effects of sildenafil for improving sexual functioning (0.38) or even the ratings of prominent movie critics and box office success (0.17). The last ES, w, is used to reflect the magnitude of the relationship for dichotomous variables. It is based on the x2 test:

M

F2 N

CHEST / 141 / 3 / MARCH, 2012

Downloaded from chestjournal.chestpubs.org by Kimberly Henricks on March 10, 2012 © 2012 American College of Chest Physicians

597

It yields a number between 0 and 1, although one drawback is that its maximum value is 1 only if (A 1 B) 5 (C 1 D) and (A 1 C) 5 (B 1 D). The guidelines for interpreting w are the same as for r and R. We have rarely encountered this outside of the psychologic literature. So there it is (or rather, there they are). ES is a handy way of going beyond P levels (and even CIs) and answer the question, “So how big was the effect?” Some guidelines exist for saying whether the effect was small, moderate, or large, but these must be supplemented by your own clinical judgment and knowledge of the area. References 1. Norman GR, Streiner DL. Do confidence intervals give you confidence? Chest. 2012;141(1):17-19. 2. Berglund G. The role of salt in hypertension. Acta Med Scand Suppl. 1983;672(S672):117-120.

3. Steering Committee of the Physician’s Health Study Research Group. Final report on the aspirin component of the ongoing Physician’s Health Study. N Engl J Med. 1989;321(3): 129-135. 4. Pilote L, Lauzon C, Huynh T, et al. Quality of life after acute myocardial infarction among patients treated at sites with and without on-site availability of angiography. Arch Intern Med. 2002;162(5):553-559. 5. Streiner DL, Norman GR. PDQ Epidemiology. 3rd ed. Shelton, CT: PMPH USA; 2009. 6. Bakhai A, Hill RA, Dundar Y, Dickson RC, Walley T. Percutaneous transluminal coronary angioplasty with stents versus coronary artery bypass grafting for people with stable angina or acute coronary syndromes. Cochrane Database Syst Rev. 2005;1(1):CD004588. 7. Laupacis A, Sackett DL, Roberts RS. An assessment of clinically useful measures of the consequences of treatment. N Engl J Med. 1988;318(26):1728-1733. 8. Meyer GJ, Finn SE, Eyde LD, et al. Psychological testing and psychological assessment. A review of evidence and issues. Am Psychol. 2001;56(2):128-165.

598

Commentary

Downloaded from chestjournal.chestpubs.org by Kimberly Henricks on March 10, 2012 © 2012 American College of Chest Physicians