Making valid causal inferences from observational data

Making valid causal inferences from observational data

Preventive Veterinary Medicine 113 (2014) 281–297 Contents lists available at ScienceDirect Preventive Veterinary Medicine journal homepage: www.els...

1MB Sizes 0 Downloads 90 Views

Preventive Veterinary Medicine 113 (2014) 281–297

Contents lists available at ScienceDirect

Preventive Veterinary Medicine journal homepage: www.elsevier.com/locate/prevetmed

Making valid causal inferences from observational data Wayne Martin ∗ Professor Emeritus, University of Guelph, Guelph, Ontario, Canada N1G 2W1

a r t i c l e

i n f o

Article history: Received 19 January 2013 Received in revised form 29 August 2013 Accepted 13 September 2013 Keywords: Causal inference Cause Component cause Causal diagram Counterfactual Multivariable model Propensity score Instrument variable Boosted regression Marginal structural model Critical appraisal Forward projection Causal guidelines

a b s t r a c t The ability to make strong causal inferences, based on data derived from outside of the laboratory, is largely restricted to data arising from well-designed randomized control trials. Nonetheless, a number of methods have been developed to improve our ability to make valid causal inferences from data arising from observational studies. In this paper, I review concepts of causation as a background to counterfactual causal ideas; the latter ideas are central to much of current causal theory. Confounding greatly constrains causal inferences in all observational studies. Confounding is a biased measure of effect that results when one or more variables, that are both antecedent to the exposure and associated with the outcome, are differentially distributed between the exposed and non-exposed groups. Historically, the most common approach to control confounding has been multivariable modeling; however, the limitations of this approach are discussed. My suggestions for improving causal inferences include asking better questions (relates to counterfactual ideas and “thought” trials); improving study design through the use of forward projection; and using propensity scores to identify potential confounders and enhance exchangeability, prior to seeing the outcome data. If time-dependent confounders are present (as they are in many longitudinal studies), more-advanced methods such as marginal structural models need to be implemented. Tutorials and examples are cited where possible. © 2013 Elsevier B.V. All rights reserved.

1. Introduction Humans have made causal inferences based on their observations for thousands of years. As one example, over 2000 years ago, a philosopher, Lucretius Caras, showed incredible perception into how nature works and he recorded his insights in a poem entitled “On the Nature of Things”. Based on his observations, Caras thought that the basic building blocks of everything, living or dead, were eternal invisible particles that were infinite in number, but limited in size and shape. He posited that everything was formed of these “seeds” and on death or dissolution, everything returns to them (Greenblatt, 2011). As a second example, all epidemiologists know of the work of John

∗ Tel.: +1 519 824 9637. E-mail address: [email protected] 0167-5877/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.prevetmed.2013.09.006

Snow and colleagues in the 1800s (Bingham et al., 2004; Koch, 2008), and how with astute observations, and what today we might call a cohort study—with exposure based on level of salt in the home water supply—they concluded that cholera was caused by invisible (at the time) microorganisms that entered the water supply via human fecal material contamination. Today, “The goal of most, if not all, scientific investigation is to uncover causal relationships.” (Aiello and Larson, 2002). As De Vreese (2009) states “the goal of epidemiologic research is, ultimately, disease prevention” which requires the identification of causal factors. And, according to Constantine (2012) to help achieve this goal requires a “deep understanding of the research topic, respect for the assumptions and limitations of the analytical tools employed, and perhaps most importantly, a strong theoretical foundation.” Despite the good intentions of most epidemiologists, there has been concern about the large number of

282

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

apparently false rejections of the null hypothesis in observational studies (de Jonge et al., 2011). In the mid-1990s, Taubes (1995) opined that epidemiology had reached (or had passed) its limits as a science. And, in 2001, editors of the International Journal of Epidemiology wondered if it was time to “call it a day?” (Davey Smith and Ebrahim, 2001). Thus, in 2012, it is comforting to see that over the past decade epidemiology has flourished as a science. Furthermore, credible epidemiologists are holding to the view that observational studies, “warts and all”, are our best scientific approach for improving the health of humans (Hernan, 2011). Nonetheless, we must recognize that making valid causal inferences from observational data is a challenging, “risky” process (Hernan and Robins, 2006a). Indeed, Rothman and Greenland (2005) reminded us “. . . all of the fruits of scientific work, in epidemiology or other disciplines, are at best only tentative formulations of a description of nature . . . the tentativeness of our knowledge does not prevent practical applications, but it should keep us sceptical and critical.” Hence, because most epidemiologists will continue to rely on data from observational studies, to identify causal associations between exposures and outcomes, new approaches to support our making valid causal inferences are needed. Recently, philosophical discussions of causal inferences have included approaches to identifying and understanding causal factors in complex systems (Campaner, 2011; De Vreese, 2009; Rickles, 2009; Ward, 2009a). More complete reviews on the philosophy of causal inference are available also (Aiello and Larson, 2002; Weed, 2002; White, 2001; Robins, 2001). A number of important papers on the quantitative aspects of causal modeling are published in two special issues of the International Journal of Biostatistics (Moodie and Stephens, 2010a,b; Moodie et al., 2012). My objectives here are to review the literature on making causal inferences from non-experimental data and to make recommendations on how we might improve our ability to make valid causal inferences from observational study-derived data. 2. Defining a cause Rothman (1976) reviewed the concepts of “cause” from an epidemiological perspective. For practical purposes, like Susser (1991), I define a cause of disease as any factor that produces a change in the nature or frequency of the health outcome. Often, epidemiologists have separated biological causes (those operating within individuals) from population causes (those operating at or beyond the level of the individual). For example, infection with a specific microorganism often is viewed as a biological cause of disease within individuals. In contrast, lifestyle, nutrition, or other factors that act at the group level or beyond (e.g. weather) and affect whether or not individuals are exposed to the microorganism (or alternatively, affect the individual’s susceptibility to the effects of exposure), would be deemed population causes. Epidemiologists recognize that whereas disease occurs in individuals, “epidemiology deals with groups of individuals because the methods for determining causality require it” (De Vreese, 2009). Further, it is vital that we include social as well as biological factors

in our study of health and disease, especially in humans (Kaplan, 2004; Harper and Strumpf, 2012). Because most causes act in concert with other causes, we recognize that a single cause need not invariably produce the outcome, and a cause need not be directly causal of the outcome. Given this complexity, our challenge is how to develop a standardized approach to identify when an exposure should be deemed to be a cause of the effect (more on this later) and to estimate the magnitude of its’ effect. In searching for causes, although we stress a holistic approach to health and disease we cannot consider every potential causal factor in a single study. Rather, we need to place limits on the portion of the “real world” we study and, within this, we constrain the list of factors we identify for investigation. Being pragmatists, we seek to identify causal factors that we can manipulate to prevent disease, while recognizing that some non-manipulatable causal factors (e.g. age, sex, race) might be crucial to our understanding of disease patterns in populations. Usually, extant knowledge and current beliefs form the basis for selecting potential causal factors for study. Thus, I will begin my discussion with a brief review of some concepts of how causal factors might act, and interact, to alter the health status of individuals.

3. Conceptual mechanistic models of causation The biological details of causation often are unknown, and the statistical measures of association epidemiologists use reflect—but do not explain—the number of ways in which the exposure might cause disease (Hernan, 2004; Hernan and Robins, 2006a). Nevertheless, mechanistic models of causation have been helpful in guiding our research efforts. Because our inferences about causation typically are based on the observed differences in outcome frequency, or severity, between exposed and unexposed subjects (Campaner, 2011), we will examine the relationship between a postulated causal model and the resultant, observed, outcome frequencies. We begin with a description of a simple mechanistic model known as the component-cause model.

3.1. Component-cause model The component-cause model is based on the concepts of sufficient causes (Rothman, 1976). In this model, a sufficient cause always produces the disease (i.e. if the factor is present, the disease invariably follows). Both experience and formal research have indicated that very few exposures (potential causal factors) are sufficient in and of themselves; rather, different groupings of factors combine and become a sufficient cause. In this context, a component cause is one of a number of factors that, in combination, constitute a sufficient cause. Within each sufficient cause, the factors might be present concomitantly—or they might follow one another in a temporal chain of events (Rothman and Greenland, 2005). In Table 1, I portray some potential causal relationships between four risk factors (potential causes) and childhood

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

283

Table 1 Five hypothetical sufficient causes of childhood respiratory disease (CRD), each comprised of the two specified component causes. Sufficient cause

I II III IV V Prevalencea a

Component causes STREP

RSV

+ +

+ +

+ 0.1

0.25

Stressors + + + 0.35

MyP

+ + 0.15

Assumed prevalence for purposes of developing Table 2, and Table

A.1.

respiratory disease (CRD) (Chibuk et al., 2010; Korppi et al., 2003). These risk factors include: • a bacterium, Streptococcus pneumoniae (STREP) • a virus, the respiratory syncytial virus (RSV) • other bacteria such as Mycoplasma pneumoniae (MyP), and • damp cool/cold weather (called Stressors). In this deterministic portrayal of causes, I suggest that there are five sufficient causes, each one containing a minimum of two specific components. I further assume that the specified two-factor combinations each form a sufficient cause. Hence, whenever these combinations of exposures/factors occur in the same child, clinical pneumonia—here denoted as CRD—will occur. As mentioned, one can conceive that these factors might not need to be present concomitantly, they could be sequential exposures in a given child. Some children could have more than two causal factors (e.g. STREP, RSV and Stressors), but the first exposure to any two of these three factors would be sufficient to produce CRD. Note that, in this model, I have indicated that only some specific two-factor combinations (n = 5) act as sufficient causes. Overall, STREP and Stressors are each a component of three of the sufficient causes; RSV and MyP are present only in two sufficient causes. Because no factor is included in every sufficient causes, there is no necessary cause, in my model, of CRD. For my model to serve as the basis of observed disease frequencies, I assigned the following prevalences: 0.1, 0.25, 0.35 and 0.15 to STREP, RSV, Stressors, and MyP respectively, and assume that these factors are independently distributed in the population. Table 2 shows the expected CRD disease frequencies, given these assumptions. Now, against this backdrop of four causal factors, I will assume that we plan to measure only the STREP and RSV components (e.g. obtain nasal swabs for culture and/or blood samples for antibody titers) in our research. We might, or might not, be aware that the other components (Stressors and/or MyP) operate as components of one or more of the sufficient causes, and that some of these might contain, none, one, or both of STREP and RSV (see Table A.1). Deterministically, in a population of 10,000 children, we would observe that 250 children with CRD will have both STREP and RSV exposure, 336 will have only STREP, and 788 only the RSV component. Because of the specified causal effects and prevalence of the other unmeasured factors

Fig. 1. A causal model of CRD based on the component causes shown in Table 1.

(e.g. Stressors and MyP forming sufficient cause IV), many (N = 354) children with CRD will have neither of the two measured factors (see Tables 2 and A.1). The data in Table 2 indicate an overall (unadjusted) relative risk (RR) of 4.6 for STREP and an overall RR of 4.5 for RSV. However, in the multiplicative scale, there is clear interaction between these two factors: the risk of CRD is increased 2.9X by STREP when RSV is present and by 9X when RSV is absent. In the risk difference (RD) scale, the data are not additive either (see VanderWeele and Robins, 2007a for a full discussion of this topic). The RD produced by STREP alone is 0.4, and by RSV alone it is 0.3; when both are present the observed RD is 0.95. Under additivity, of effects, a combined risk of only 0.7 would be expected. In choosing the best scale for assessing biological interaction, Hofler (2005b), and others have asserted that the RD is most appropriate. The reader also should recall that although I have not stated any biological explanations as to how these agents might exert their causal effects, I developed the component model assuming that either exposure to both RSV and STREP had to occur—or exposure to the other component causes (shown in Table 1)—to form a sufficient cause (Vineis and Kriebel, 2006). The relative and additive strengths of their effects are a result of these causal specifications and the distribution of the components of the five sufficient causes, as well as the scale of measurement; they are not biological constants for each agent. 3.2. Causal-web model A second way of conceptualizing how multiple factors can combine to cause disease is through a causal web (Fig. 1) consisting of multiple indirect and direct causes (Krieger, 1994). This concept is based on a series of interconnected causal chains or web structures; in a sense, it takes the factors portrayed in the sufficient-cause approach and links them temporally and biologically into causal chains. This approach requires knowledge about the likely associations and causal structure over and above that needed for the component cause model. In the causal-web model, a direct cause has no known intervening variable between that factor and the disease. Diagrammatically, the exposure is adjacent to, and antecedent to, the outcome; e.g. MyP is a direct cause of CRD in Fig. 1. One possible web of causation of CRD, based on the factors shown in Table 1, might have the structure shown in Fig. 1.

284

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

Table 2 The distribution of CRD cases and the source populationa (N = 10,000), with respect to STREP and RSV, given the sufficient causes shown in Table 1. The stratum specific risks of CRD and the increased risk from STREP exposure in children with, and without, RSV exposure are shown. RSV

STREP

Population

Cases

Stratum CRD risk

+ +

+ −

250 2250

250 788

1.0 0.35

2.86

− −

+ −

750 6750

336 354

0.45 0.05

9.0

Increased CRD risk from STREP

a For purposes of this example, I assume all four component causes (Table 1) are operative in the source population, but only two of the four (RSV and STREP) are known (or of interest) to the researchers. See Table A.1 for details on all four component causes.

The causal-web model complements the componentcause model; however, there is no direct equivalence between them. Nonetheless, the formal causal-web diagrams are useful to guide our study design, analyses and interpretation of data (VanderWeele et al., 2008), and share a number of causal principles with the component-cause model (VanderWeele and Robins, 2007b). My example model, in Fig. 1, indicates that Stressors make the child more susceptible to infection with STREP, RSV, and MyP. RSV increases the susceptibility to STREP but not MyP; and, RSV can “cause” CRD directly (this might be known to be true, or it might reflect the lack of knowledge about the existence of an additional intervening factor which is missing from the causal model). The diagram also indicates that STREP is an indirect cause of CRD via MyP, as well as being a direct cause of CRD. If this causal model is true, it suggests that we could reduce CRD occurrence by removing an indirect cause such as weather stress, even though it has no direct effect on CRD. We could also prevent CRD by blocking the action of the direct causes STREP, RSV, and MyP (e.g. by vaccination). All things being equal, an RSV vaccine would appear to be particularly effective, because of its direct and indirect effects on CRD. As mentioned, this model claims that Stressors do not cause CRD without at least co-exposure to one of STREP, RSV, or MyP infection and thus suggests a number of two or three-factor groupings of component causes into sufficient causes. Based on the component cause model, the outcome frequencies in RSV-infected and non-infected children will depend on the distribution of the other component causes and whether or not RSV can appear to be a sufficient cause by itself (because the other components are not measured). Later, I will discuss the relationship of the causal structure to the design of our studies and as a guide to the correct approach in our analyses and interpretation of the study data. For now, I note that the heavier arrows linking RSV and STREP to CRD are the only paths of concern given that we measured only RSV and STREP as potential causes of CRD. I would point out that as Constantine (2012) notes, a specified model “does not ‘confirm’ causal relationships. Rather, it assumes causal links and then tests how strong they would be if the model were a correct representation of reality.” Direct causes often are the proximal causes emphasized in therapy, such as specific microorganisms or toxins. In contrast, an indirect effect is one in which the effects of the exposure on the outcome are mediated through one or more intervening variables. Many causal factors contribute both directly and indirectly (e.g. RSV and STREP), and it is important to recognize that, in terms of disease

control, direct causes might be no more valuable than indirect causes. In fact, many large-scale control efforts are based on manipulating indirect rather than direct causes. Well-known examples include the work of John Snow and his colleagues on cholera control through improved water supply (Bingham et al., 2004; Koch, 2008) and Goldberger’s suggestions to prevent pellagra by focusing on indirect socio-economic factors (Rajakumar, 2000). 3.3. Counterfactual concepts of causation for a single exposure So far, I have asserted causation without specifying “how” or “why”? Indeed, I have declared causation based on associations and must admit that such assertions often are erroneous. Thus, I now describe the most widely accepted conceptual basis for determining causation in epidemiology; it is called the counterfactual or potential-outcomes model (Greenland, 2005; Hofler, 2005b; Maldonado and Greenland, 2002). Flanders (2006) discusses the linkages between counterfactual and sufficient cause models. Mortimer et al. (2005) note that counterfactuals “represent all of the study subjects’ possible outcomes, including the observed outcome and those outcomes that would occur if, contrary to the fact, the subject were exposed to each possible exposure history.” In my forthcoming example, there are two exposure states (vaccinated or not) and two possible outcomes (CRD or not); hence there are four possible results per subject. Suppose we are interested in whether or not a vaccine would protect against a disease called “CRD”. In this scenario each individual could be vaccinated or not, and each could develop CRD or not. In the face of a CRD outbreak, if we observed a vaccinated subject who did not develop the disease within a year of vaccination, we might think that the exposure (vaccination) prevented the disease in that subject. In addition, if we imagined (or could observe) that same subject—in the same time-period except he/she was not vaccinated (this is the counterfactual state)—then, if we knew that the disease would have occurred in this individual we would surely conclude that the exposure (vaccination) had prevented the disease in the individual we observed. Conversely, if the disease did not occur in this non-vaccinated counterfactual state, we would conclude that vaccination was of no consequence to the health status of that subject (because the disease did not occur—regardless of exposure). In this sense, the counterfactual model reflects the information and the judgements many of us would use to make causal inferences. Let’s examine this in more detail.

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

285

Table 3 Observed and counterfactual results of an exposure (E) and disease (D).a Subject (i)

Covariateb (C)

Actual exposure (vaccination)

Actual outcome (CRD)

DE+

DE−

1 3 4 5 6 11 12 13 15 17 18 20

1 1 1 1 1 1 1 1 1 1 1 1

0 1 1 1 1 1 0 1 1 0 1 1

1 1 0 1 1 1 1 1 1 0 0 0

0 1 0 1 1 1 0 1 1 0 0 0

1 1 0 1 1 0 1 1 1 0 1 0

2 7 8 9 10 14 16 19

0 0 0 0 0 0 0 0

0 0 1 0 1 1 0 1

1 0 1 0 0 0 0 0

1 1 1 1 0 0 0 0

1 0 1 0 0 0 0 0

Totals

12

13

10

10

10

Counterfactual outcomes

a

In my example (see text), the exposure is vaccination and the disease is CRD. “Actual”, represents the exposure and outcome states observed by the researcher (1 = presence; 0 = absence). Counterfactual outcomes represent the outcome that would have been observed in each of the possible exposure states. One of these (e.g. no vaccination in subject 1) was actually observed and subject 1 developed CRD. The other exposure (vaccination) was not observed in subject 1, but had it been, subject 1 would NOT have developed CRD. b The data are sorted by the presence/absence of variable C, a variable that might have impacted on the assignment to vaccination in a RCT, or been a confounder in an observational study.

The following discussion, on causal inference—based on counterfactuals—closely follows that of Hernan (2004), supplemented with earlier work by Little and Rubin (2000). We can make the previous thought process, about causation, more formal by denoting the potential outcomes in exposed (i.e. vaccinated) subjects as DE+ ; DE+ is set to 1 if the disease would occur in a particular vaccinated subject and 0 otherwise as shown in Table 3. The potential outcomes in the same subject if unexposed (i.e. unvaccinated, or the counterfactual state) denoted DE− are also coded as 1 or 0 on this basis. Our thought process concludes that there is a causal effect in that subject if the outcome under DE+ differs from the outcome under DE− (i.e. if DE+ = / DE− ). For example, subject 1 in Table 3, was not vaccinated and developed CRD (his/her observed exposure state), and this subject would not have developed CRD if vaccinated, the counterfactual state. If we knew the outcomes under both the exposed and counterfactual states, given that these observations would be made under the exact same conditions—save for vaccination—we would conclude that the vaccine would have prevented the disease in subject 1. Note that exposure to the putative causal factor need not exert a causal (or preventive) effect in every individual, principally because the other factors needed to complete a sufficient cause might be absent in that individual. Despite its didactic merit, it is clear that, in reality, we cannot determine a causal effect at the individual level because only one of the two possible exposure levels is observed. The data concerning what might have happened at the other exposure level (i.e. the counterfactual state) are

missing. I created the counterfactual data shown in Table 3 for purposes of this example; however, in real-life these outcomes are unknown. In Table 3, I have summarized the observed exposure (vaccination status), the observed disease outcome, within a year of vaccination, and the counterfactual (potential) outcomes in a population of 20 subjects based on Hernan (2004). Note that the observed outcome for each subject is the same as the potential outcome for that exposure state. Thus, because subject 1 was not vaccinated, the observed outcome is the same as specified under DE− for subject 1. At this point we can overlook the scientific process (i.e. experiment or observation) that generated the observed data in Table 3. Subscript ‘i’ is the subject identifier; variable C is (at present) an unidentified extraneous variable; variable E the observed exposure (vaccinated or not); and variable D the observed outcome (CRD or not). A 1 indicates the presence and 0 the absence of the exposure factor or the outcome. In the two columns on the right side of Table 3 are the outcomes for the counterfactual exposed and unexposed populations. Note that in the counterfactual (or potential outcome) setting, our inference about cause is made by comparing the potential outcomes in the exact same subjects under separate (albeit counterfactual, or unrealized) scenarios that differ only by exposure. Because, in real life, there might be concerns about an attenuated vaccine causing the disease it is designed to prevent, I have set this example up such that the exposure (vaccination) causes (or would have caused) CRD in three individuals (subjects 7, 9, and 11) and prevents (or would have prevented) CRD in three others (subjects 1,

286

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

12, and 18). Vaccination had no effect in the remaining 14 individuals (their CRD outcome did not change under the counterfactual state). Recall from our earlier discussion of component causes that these counterfactual effects (if real) could be due to the presence or absence of other causal components as well as the effect of the exposure of interest. If we expand our horizon to the population level, we could compare the potential frequency of disease in the population if all of its 20 members were exposed, denoted p(DE+ ), to the potential frequency of disease in that same population if none of its members was exposed denoted as p(DE− ). At the population level, we can infer that there is a causal effect (i.e. a true treatment effect) in the population if there is a difference in counterfactual means {i.e. p(DE+ ) − p(DE− ) = / 0}. In our example, because both DE+ and DE− equal 10, there is no causal effect in the population. So how does this potential outcome model help us? Well, although the previous population measures are not directly observable, we can estimate them under specific conditions; namely, through the use of randomization in the perfect experiment. In essence this would create the setting where everything is exactly the same in the two exposure states, save the exposure. As shown in this example, De Vreese (2009) reminds us that interventions can have a large effect at the population level but offer little benefit to individuals, or they might have no effect at the population level but offer advantages to some individuals. In general, epidemiological findings relate to populations and should not be viewed, necessarily, as good predictors (prognostic causal factors) of outcomes in individuals. In addition, both Greenland (2005) and Hernan (2005) give examples (through hypothetical interventions) of the specificity required in these judgments (and research hypotheses) if we are to make progress in resolving complex health problems. See also Kramer et al. (2012). 4. Experimental approaches to causation The experimental design which would most nearly approximate the counterfactual is the cross-over design. In this design, subjects are randomly assigned either to receive the treatment of interest, or to serve as placebo controls, in the first period of the experiment. Then, after a suitable “wash-out period”, the subjects receive the other level of the treatment (i.e. if they received the treatment in the first period they would receive the placebo in the second and vice versa). This allows each subject to serve as its own control, as in the counterfactual setting. More frequently, the gold-standard approach to identifying causal factors is to perform a two-arm experiment (randomized controlled trial, RCT), in which we randomize some subjects to receive the “treatment” factor and some (often half the subjects) to receive a placebo. After a specified time period, we assess whether there are differences in the outcome between these two groups. This design builds on the fact that in the long-run (i.e. large samples), randomization will completely balance all the covariates in the two treatment (exposure) groups. In both of these experimental designs, the treatment/exposure (now denoted as X) explicitly precedes the outcome (denoted as Y) temporally,

and all other variables (known and unknown) that do not intervene between X (e.g. vaccination) and Y (e.g. CRD) are made independent of X through the process of randomization (this means that these variables do not confound or bias the results we attribute to the exposure X). Given the conditions of our perfect experiment, this independence of all factors from the treatment X produces exchangeability in the treatment groups; that is, the same outcome would be observed (except for sampling error) if the assignments of treatment—vaccination in our example—to study subjects had been reversed totally. The formal application of randomization provides the probabilistic basis for the validity of this assumption, and the expectation of “equality” increases with sample size. Factors that are positioned temporally or causally between X and Y need not be measured and are of no concern with respect to answering the causal objective of the trial. Thus, following the randomization of the 20 subjects in Table 3, to vaccination or placebo, we would observe the outcome shown under the appropriate counterfactual column. In these experimental contexts, exposure X would be considered a proven cause of outcome Y if it was judged that the value (or state) of Y differed (for reasons other than chance) between exposed and unexposed subjects following the manipulation of X. The measure of causation in this ideal experiment is called the causal-effect coefficient; except for sampling error, the difference in the outcome between the “treated” and “non-treated” groups (i.e. those subjects with different levels of factor X) indicates the average difference between the two counterfactual means. For example, if the risk of the outcome in the group receiving the treatment is denoted R1 = p(D+|E+) ≈ p(DE+ ) and the risk in the group not receiving the treatment is R0 = p(D+|E−) ≈ p(DE− ), then we can choose to measure the effect of treatment using either an absolute measure (e.g. RD) or a relative measure (e.g. RR). Indeed we could have a quantitative outcome where we would assess whether y¯ 1 − y¯ 0 = / 0. If the difference in y¯ 1 − y¯ 0 is greater than what could be attributed to chance, then we could say that we have proved that the factor is a cause of the outcome event. A key point is that all causaleffect statements are based on contrasts of outcomes in the different treatment groups; the outcome in the treated group cannot be interpreted without knowing the outcome in the untreated group. For the present, I will view the data in Table 3 as arising from a perfect experiment with the caveat that it might be imperfect because of interference—e.g. indirect effects of vaccination—so this might not be the best design for a vaccination trial but bear with me for this example. According to the counterfactual outcomes in Table 3, in our perfect trial, we should observe R1 = R0 and conclude that there is no causal effect of vaccination in the population. However, as shown in Table 4, this is not the case using the observed data (columns 3 and 4) from Table 3. Analysis of the observed data indicates that the risk of CRD is higher (RR = 1.26) in the vaccinated than in the non-vaccinated subjects. So, what accounts for the fact that the observed risks do not equal the counterfactual risks? The problem is that our comparison group (E−) is not a good counterfactual group, in that it differs systematically from the E+ group in

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297 Table 4 A summary of the risk of disease (CRD) in vaccinated (E+) and unvaccinated (E−) subjects: data from Table 3. CRD cases

Vaccinated

Non-vaccinated

Total

Risk ratio

Yes No

7 6

3 4

10 10

1.25

Total CRD risk

13 7/13 = 0.53

7 3/7 = .43

20

a manner that alters the risk of the outcome. For example, it appears that in the context of conducting this clinical trial 75% of the C+ individuals were randomly assigned to be vaccinated, whereas only 50% of the C− subjects were randomly assigned to be vaccinated. Consequently, the groups were not exchangeable (9/13 = 69% of the V+ group were C+, whereas only 3/7 = 43% of the V− group were C+). Based on the above discussion, the data in Tables 3 and 4 would be unlikely to arise in a trial with complete randomization because factor C (an indicator for risk of disease) would have been distributed equally in the vaccinated and unvaccinated groups, and hence would not bias the observed disease frequency. As Little and Rubin (2000) noted, we need to explain the exposure (vaccination) distribution, if we are to obtain the true causal effect. If randomization to vaccination had been conditional on factor C (e.g. representing risk of disease prior to vaccination assignment), we would have had to account for this feature when analyzing the data to obtain unbiased estimates of the causal effect and/or its variance. Thus, following Little and Rubin (2000), the presence of variable C could explain the biased distribution of vaccination. If the researchers, in this example, had decided, a priori, to vaccinate a higher proportion of the “at risk” subjects, they could have stratified or blocked on the initial disease risk and then randomly assigned vaccination to each subject with a higher vaccination level in the “high risk” group. To account correctly for the conditional randomization, in the experimental design, we must stratify on the level of C, in our analysis, to achieve conditional exchangeability (within each level of C) and obtain an unbiased estimate of the causal effect. When this is accomplished, the simple difference (or risk ratios) in outcome between exposure groups, within risk blocks, provides unbiased estimates of the causal effect, as shown in Table 5. Although the perfect trial mimics the counterfactual, given any lack of compliance, unequal follow-up, loss to follow-up, measurement/diagnostic errors, and other biases, our observed disease frequencies in a real trial will reflect ‘associations’ and not necessarily causal effects. Chen et al. (2011), Hernan (2012), and Little and Rubin (2000) discuss these and other issues including interference (indirect effects as in herd immunity, or indirect social effects), confounder selection, lack of positivity (no comparable exposed, or unexposed, subjects), and missing data that can bias the observed effects in a RCT. 5. Causal inference in observational studies In the absence of a perfect trial, our observational study approach is to estimate the difference in values of the

287

outcome “Y” between subjects that happen to have different values of the exposure “X”; we do not exert control over whether a subject is (or is not) exposed. In this setting, variables related to both X and Y (and which do not intervene between X and Y) must be controlled (or adjusted for) to prevent confounding bias and to support the estimation of causal effects. However, a major difficulty in observational studies is that the exposed and unexposed groups (even with control of the measured covariates) rarely are exchangeable. Traditionally, in an effort to approach exchangeability we try to identify and measure “all” the important potential confounders. Unfortunately, we have no way of knowing the degree of non-exchangeability in our data, and it is naive to assume that the overall treatment (exposure) assignment is ignorable, “after controlling for recorded confounders” (Little and Rubin, 2000). Thus, for valid causal inferences, a major goal for observational study data is to have the covariate distribution in the control subjects be as similar as possible to that of the treated (exposed) group, within blocks of data. However, because the equivalence of the exposed/unexposed groups is not fully testable, the association measure from our observational study might not equal the causal effect. And, in general, we should remind ourselves that these association measures might not reflect causation or serve as valid estimates of the true causal effect. 5.1. Statistical methods in support of valid causal inferences from observational data Little and Rubin (2000) stress that if we wish to make inferences only to the sample of individuals involved in a RCT (internal validity), estimating the causal effect must account for the mechanism of treatment assignment. However, if we wish to make inferences to the larger population from which the sample of randomized subjects was obtained (external validity), then we also must account for the selection mechanism for obtaining the study subjects (Grimes and Schulz, 2002). Hence, the ideal for randomized trial design is random selection of subjects and random allocation of treatment. Because random selection is rarely used, we address the potential of selection bias by comparing the estimated treatment effects in a number of RCTs (consistency of effects). Large heterogeneity of effects suggests that selection bias cannot be ignored. Little and Rubin (2000) discuss this in more detail and contrast Fisher’s, Neyman’s and Bayesian approaches to inference—the interested reader should consult their paper for details. In the case of observational studies, we need to be concerned about both selection and confounder bias; however, because Dr. Ian Dohoo is discussing selection and information bias in his paper, today, I will not include details on these topics here. The most likely reason for the exposed and unexposed groups of subjects being non-exchangeable in observational studies is the presence of factors that are related to the exposure and the disease. For example, the C+ and C− subjects, in Table 3, might have had different risks of CRD, prior to vaccine assignment. In an observational study setting, this could alter whether or not the subjects choose to receive vaccination and also could affect disease

288

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

Table 5 A summary of the vaccination history and disease (CRD cases) in a population of 20 study subjects. The data are stratified on the confounder covariate “C” from Table 3. Stratum 1: confounder C = 0 (subjects have a low a priori risk of infection; p(D+) = 2/8 = 0.25) Vaccinated

Not vaccinated

Total

Cases Non-cases

1 3

1 3

2 6

Total

4

4

8

Risk ratioa 1

Stratum 2: confounder C = 1 (subjects have a high a priori risk of infection; p(D+) = 8/12 = 0.67) Vaccinated

Not vaccinated

Total

Risk ratioa

Cases Non-cases

6 3

2 1

8 4

1

Total

9

3

12

a

The adjusted risk ratio would be 1 indicating no association between vaccination and CRD. If variable “C” was the only confounder, the risk ratio would equal the causal risk ratio. Thus, vaccination had no causal effect on CRD risk in the population of 20 subjects.

occurrence. Recall that in an RCT, if a factor (such as variable C) altered the randomization process, this factor—here variable C status— would need to be included in the analysis to estimate the true causal effect. Similarly, if we recognized and measured variable C, in an observational study, we would need to stratify on variable C, as part of our analysis of the data, to obtain a valid estimate of the causal effect (as shown in Table 5). The stratification produces conditional exchangeability within levels of the confounder(s), but only if we can also assume no residual confounding with level(s) of C and no unmeasured confounding. Unless we are certain there is no other unmeasured confounder, our measure of association might still be biased. This uncertainty remains the Achilles heel of all observational studies. Nonetheless, given concerted effort to measure all important confounders a number of methods offer advances in making valid causal inferences. 5.2. Approaches to control confounding and estimate causal effects There are a number of ways of trying to ensure that the observed risks would equal the counterfactual risks, including: stratification, matching, standardization, inverse proportional-to-treatment (exposure) weights (IPTW) to create marginal structural models (MSMs), propensity scores (PSs), and instrument variables. Historically, the most common approach has been through multivariable regression models which include treatment (exposure), confounders, and interactions (if important). 5.3. Regression adjustment models (aka multivariable modeling) The intent with this procedure is to “adjust” for the effect of confounders such that the adjusted measure of association is an unbiased estimate of the true causal effect. However, as Austin (2011a,b) points out, it is difficult to know whether the adjusted model is specified correctly. Despite its wide usage, a number of other authors also have pointed out the drawbacks to this approach (Greenland, 2007, 2008; Vansteelandt et al., 2012; Moodie

and Stephens, 2010a,b). Although choosing confounders based on the “change-in-estimate” approach when using this method is preferred to purely statistical significancedriven processes (Vansteelandt et al., 2012; Greenland, 2008), drawbacks to multivariable modeling include • Invoking the rule of thumb of 10 cases for every covariate can make the necessary sample sizes very large. • This approach can be unreliable if the distributions of covariates in exposed and unexposed subjects are “very” different, because linearity of association over larger differences is harder to ensure. And, related to this, it is difficult to assess the actual degree of overlap of the exposed and unexposed groups in multivariable models. • The multivariable models are conditional and reflect individual-level causal effects. The population effects can differ from these when the model is multiplicative, logistic or based on proportional hazards. • Perhaps the biggest drawback to this approach is that it allows us to avoid deciding on the exact composition of the groups that we wish to compare until we see the outcome data. Rubin (2007) stresses that this decision should be made before seeing the outcome data, as would be achieved in a randomized controlled trial. Often, we focus on the nature of relationships between predictors and the outcome from the early stages of investigation and expend much energy on “getting the association correct” (i.e. transformations, linearity, etc.) in addition to “getting the correct variables” in our model. However, methods for selecting the “correct” variables have their problems and biases, including being prejudiced by our own beliefs.

A logistic model of the CRD data, shown in Table 5, in which STREP is examined as a risk factor for CRD, while controlling for RSV, would be an example of multivariable modeling. The effect of STREP would be an estimate of its effect when comparing individuals of comparable RSV status (effectively controlling confounding from RSV). This approach leads to individual level causal effects and in this example we would have to retain two measures of STREP

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

effect because of the presence of interaction in the logistic scale: one when RSV is present and one when it is absent. 5.4. Using a causal diagram to support regression modeling Causal diagrams (also called “directed acyclic graphs”) are helpful for displaying relationships among a number of possible causal variables (see Fig. 1) that we wish to study, for deducing statistical associations that might arise from a given set of underlying causal relationships, and as an aid in selecting variables in need of “control” to prevent confounding. These advantages apply to almost all of the methods I discuss subsequently. The easiest way to construct the causal diagram is to begin at the left with exogenous variables that are predetermined by variables outside of our model; the variation of these variables (such as Stressors in Fig. 1) is considered to be due to factors outside of the model. The remaining endogenous variables are placed in the diagram in their presumed causal order; variables to the left could “cause” the state of variables to their right to change (so our diagram suggests that stressors could alter the risk of infection with various micro-organisms and hence the risk of CRD). The only causal models to be described here are called “recursive”; that is, there are no causal feedback loops. We will assume that our objective is to estimate the causal effect of STREP on CRD, but we are aware of, and have measured, the other variables shown in Fig. 1. The model indicates that STREP can cause changes in CRD directly and also by an indirect pathways through MyP. It also indicates that Stressors can be a direct cause of (susceptibility to) STREP. The rule for tracing the pathways is that you can start backwards from any variable but once you start forward on the arrows you cannot back up. Paths which start backwards from an exposure variable are spurious causal paths and reflect the impact of confounders. As we move forward from the exposure through other variables in the direction of the arrows, we trace out an indirect causal path and the variables we pass through are denoted as “intervening variables”. If no intervening variables are present on the path, we are tracing a direct causal effect. To estimate the causal effect, we must prevent any spurious (confounded) effects, so the variables preceding an exposure factor of interest (e.g. STREP) that have arrows pointing toward it (i.e. from RSV and Stressors) and through which CRD (the outcome) can be reached on a path must be controlled. The process also asserts that we do not control intervening variables, so MyP is not placed in the statistical model when estimating the causal effect of STREP. If we can assume that there are no (other) confounders missing from the model, our analyses will estimate the total causal effect of STREP on CRD. Greenland and Brumback (2002) and VanderWeele and Hernan (2006) discuss relations among causal diagrams, counterfactual models, and component-cause models. Howards et al. (2007) provide both a good discussion on the use of causal diagrams with linkages to appropriate regression models for estimating the associations and examples of causal diagrams based on potential causes of perinatal disease. For more-advanced reading, see VanderWeele

289

and Robins (2007b) and VanderWeele et al. (2008). Hernan and Cole (2009) discuss how to describe four types of measurement error—as well as confounding and selection bias—using causal diagrams. Understanding and estimating direct and indirect effects (also known as “mediation analyses”) has become a frequent topic of discussion (see VanderWeele and Vansteelandt, 2010; Vansteelandt, 2012). Pearl (2010) discusses causal inference in general, and then moves to a detailed discussion of structural equation models-these are widely used in econometrics but will not be discussed further here. 6. Propensity scores for controlling confounding In this approach we try to uncover and account for the mechanism behind the allocation of exposure (Rubin, 1991) or vaccination as in my counterfactual example shown in Table 3. For example, the observed data could have resulted from a controlled experiment where the researchers decided to vaccinate a higher percentage of high-risk than low-risk study subjects, or an observational study where those subjects possessing variable C had a different risk of exposure, disease, or both than those who did not possess it. We would need to account for these exposure-altering mechanisms to obtain unbiased causal effect measures. One method to achieve this is based on the use of a propensity score (PS) (Hernan and Robins, 2006a; Austin, 2011a,b). A PS is the conditional probability of being treated/exposed (i.e. the probability that an individual with certain characteristics will be treated/exposed) given the measured characteristics (covariates). We can denote this as p(E+|C), which is bounded by 0 and 1. Once computed, PSs can be used for: matching; as the basis for a stratified analysis; for a weighted analysis; or as a covariate in a modeling procedure. One approach might prove to be more precise or less biased than another depending on the context. Nichols (2008) discusses various re-weightings of the PS that can be implemented in Stata (Stata Statistical Software, College Station, TX). Initially, PSs largely were used in cohort studies to evaluate the effects of treatments (or other exposures) when evidence from a randomized controlled trial was unavailable (D’Agostino, 1998). However, PSs can be used in case–control studies with some limitations (see Mansson et al., 2007). 6.1. Computing propensity scores With only one or two categorical confounders, the PSs could be calculated manually, by using the observed distribution of exposure within levels of the confounder (here variable C). Based on our counterfactual data, the p(E|C+) = 0.75 and the p(E|C−) = 0.5 (see Table 6, right column). With more confounders and/or continuous confounders, PSs usually are derived from a logistic or probit model with observed exposure status (or treatment assignment) as the outcome (i.e. treatment assignment in a RCT; but exposure status in an observational study). In general, including potential confounders (i.e. non-intervening variables, known or suspected to be related to both exposure

290

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

Table 6 Conditional probability of exposure, p(E = e|C), inverse probability of total exposure weights (WT), pseudo-population (Ps-Pop) composition and propensity scores for data in Table 3. Variables

C

E

D

1 1 1 1 0 0 0 0

1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0

Obs. no. nj

pE = p(e|C)

WT = 1/pE

Ps-Pop = WT*nj

Propensity score

6 3 2 1 1 3 1 3

0.75 0.75 0.25 0.25 0.5 0.5 0.5 0.5

4/3 4/3 4 4 2 2 2 2

8 4 8 4 2 6 2 6

0.75 0.75 0.75 0.75 0.5 0.5 0.5 0.5

and the outcome) and their interactions, as necessary, is the most appropriate approach when estimating the PS. Moodie (2009) demonstrated that that it is better to include a non-confounding non-intervening predictor of the outcome than to omit it. Intervening variables should not be included in the response model. Becker and Ichino (2002) provide an example of estimating causal effects using PSs, in Stata. McCaffrey et al. (2004) describe a “boosted” regression approach to fitting a large number of covariates when deriving PSs. Schonlau (2005) describes “boosting” and has developed a plug-in program for Stata. 6.2. Balancing of exposure groups Propensity scores can be used to help ensure equivalence of confounder distribution by “balancing” the characteristics of the exposed and non-exposed individuals across all strata (also called “levels” or “blocks”) formed by grouping of the PSs. However, this is a large-sample property, so it can be combined with covariate adjustment using regression methods if needed (Austin, 2011a,b; Dohoo et al., 2012). A study is considered balanced if two conditions are met: First, the average value of the PS is the same in exposed and non-exposed individuals within each stratum of the PS. Second, the mean value of all covariates making up the PS should be equal in the exposed and non-exposed groups within each stratum (i.e. the prevalence of dichotomous covariates, or the mean of continuous covariates, should be equal). As part of the balancing process, examination of the distribution of each of the original confounders in the groups matched by PS score is recommended. Computation and evaluation of PSs often are limited to observations falling in the range of PSs that includes both exposed and non-exposed individuals (called the “region of common support”). Non-exposed individuals with PS values lower than the lowest value observed for an exposed individual are ignored, as are exposed individuals with higher PSs than any non-exposed subject (also referred to as lack of positivity). Not accounting for these might seem wasteful or potentially biased; however, in the context of trying to assess the causal impact of an exposure, these individuals are so different from the others in the study group that regardless of their exposure and disease experience, it virtually would be impossible to implicate exposure validly as a cause of the outcome. Nevertheless, it is

important to note the characteristics of these subjects, because they might provide a clue about potential causes that can be addressed in future studies. 6.3. Matching using propensity scores Matching begins with obtaining the individual PS on each of the potential study subjects. Then, we identify an exposed individual and note their PS. Next, one or more non-exposed individuals with a PS the same as, or close to, the PS of this exposed subject would be selected from the available potential study group. Selection of 1:1 or 1:n matches is usually done with replacement (so a nonexposed individual can serve as a matched control more than once). This process is continued until all exposed individuals have one or more matches. Radius matching (also called “caliper matching”) selects all non-exposed individuals that have a PS within a “specified” distance of the value of the exposed individual (e.g. ±0.05 PS units). Although there is no total agreement on the “specified” distance, it is often set to 20% of the pooled standard deviation of the logit of the PS (Austin, 2011b). The analysis of the matched data should take the matching into account, although the need to do this has been questioned (Stuart, 2008). With RCTs, the most common measure of effect computed is the average treatment effect (ATE). The ATE is the difference in the outcome measure between the vaccinated (exposed) and non-vaccinated (non-exposed) individuals (i.e. RD or difference in means). Another measure of effect is the average treatment effect in the treated (exposed) subjects (ATT). In RCTs, with full compliance, these measures do not differ (because of randomization); however in cohort and cross-sectional studies those who actually receive treatment (or exposure) may differ from those who don’t; hence, the measures may differ. Austin (2011a,b) discusses the similarities and differences in these measures of causal effect (see below for further discussion). 6.4. Stratification using propensity scores to obtain average treatment (exposure) effects Stratification involves dividing the observed data into strata (blocks) that are used to evaluate the “balancing properties” of the PS procedure (Hullsiek and Louis, 2002). Commonly, 5 blocks are used, based on PS, and

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

this should remove 90% of the bias of a continuous confounder. In general, stratum-specific estimates of effect are weighted by the proportion of subjects who lie within that stratum; these can be averaged to produce an overall difference in means or RD. When the sample is stratified into K equal-size strata, stratum-specific weights of 1/K commonly are used when pooling the stratum-specific treatment effects, allowing one to estimate the ATE. Using stratum-specific weights that are equal to that proportion of treated (exposed) subjects that lie within each stratum allows the estimation of the ATT (Austin, 2011b). Austin (2011a) provides a worked example. Because the PS might be inaccurate for subjects with a very low PS, the use of stabilizing rates is discussed in Austin (2011a,b). 6.5. Multivariable modeling using propensity scores Propensity scores can also be used as an alternative to including individual covariates to control confounding in a multivariable model. In this method, we regress the outcome (simultaneously) on the exposure status and the PS (as a covariate) for each study subject. At this point the question arises, “does using a PS in a multivariable model do a better job than controlling for confounding by including all of the potential confounders directly in the model?” A number of studies have focused on this question, and in general, if there are fewer than seven outcome events per covariate, controlling confounding using a PS (included in the model as a categorical variable based on quintiles of the PS) is preferred. If there were eight or more outcome events per covariate, a logistic model with the original confounders might be the technique of choice. In any event, the more factors that we try to “control”, the greater the value of using the PS approach. If interaction is present (i.e. the effect of treatment varies with level of PS), then the way in which the PS is used could have a large impact on the overall effect estimate (Kurth et al., 2006). Perhaps the biggest benefit of the PS approach is that it changes the strategy of analysis. With PSs, we place our emphasis on getting the groups ‘comparable’ so that our subsequent comparison of the outcome frequency in each group is valid. The focus on comparability is not biased (or should not be) by knowledge of predictor-outcome associations. Thus, issues of necessary transformations, linearity, etc. should be decided a priori (before seeing the outcome). If the subsequent analysis indicates that different analytical approaches should be used, the difference between the a priori and a posteriori methods (after seeing the outcome) should be noted and discussed. Stuart (2008) discusses practical recommendations about the use of PS. She stresses that checking for balance of covariates should include variance assessments as well as interactions, not just assessments of equality of means. Recently Ertefaie and Stephens (2010) and Moodie and Stephens (2012) have shown how to use generalized PSs (GPS) when the exposure is a continuous variable. They concluded that GPS produced estimates with smaller variance than MSMs. In their approach the exposure variable was deemed to be continuous (although a dichotomous exposure can be used) as for example if exposure-days was the variable of interest, or if the “exposure” was a measured

291

variable such as drug dose. The GPS models estimate the causal dose–response relationship. Bia and Mattei (2008) provide an example of estimating causal effects using the GPS, in Stata. Jiang and Foster (2012) provide an example of using GPS based on breastfeeding and childhood obesity. Because PS models are not immune to mis-specification, a technique known as doubly robust estimation has been developed. Funk et al. (2011) provide an example of implementing this approach and Waernbaum (2012) discusses this approach and compares matching to the use of PSs. In the doubly robust approach, essentially one develops two models: the exposure model (i.e. where the PS is estimated as explained above), and the outcome model. In the latter, we regress Y on the covariates Z (within exposure categories) to obtain estimates of Y in exposed and unexposed subjects (Yˆ 1 and Yˆ 0 ). Then, the actual outcome (Y1 and Y0 ) in subjects in each exposure category is compared to the predicted outcome (weighted by a function of the PS) to estimate the doubly robust estimator in exposed and unexposed subjects (DR1 and DR0 ). The difference between DR1 and DR0 is used as the estimated casual effect. The benefit of this approach is that if at least one of the models (exposure or outcome) is specified correctly the DR estimator will be unbiased (see Funk et al., 2011 for details). 6.6. Iterative proportional treatment (exposure) weights (IPTW) based on the propensity score We will discuss IPTW shortly, but here we use the PS for weights. We specify an indicator variable Zi denoting whether or not the ith subject was exposed (1 = yes; 0 = no). The IPTW weight (Wi ) is then Wi =

Zi 1 − Zi + PSi 1 − PSi

This was originally proposed as model-based direct standardization and gives the same weight as the IPTW shown subsequently (see Linden and Adams, 2010 for a worked example). 7. Other methods to control confounding The next two methods are related closely (Hernan and Robins, 2006a), although one method uses standardization to estimate the expected number of cases (or risk) and the other method, MSM uses “weights” to produce an un-confounded pseudo-population from which we can estimate the causal effect of interest using a simple (as in a 2 × 2 table) measure of association such as RR (we could also use RD or odds ratio as our effect measures). Both these approaches provide a valid summary of the effect of exposure in a specified population, whether or not interaction (at the individual level) is present; the stratum-specific measures of association do not need to be homogeneous. Brumback et al. (2010) explain how to assess effect modification (interaction), while Brumback et al. (2004) discuss sensitivity analysis for unmeasured confounding in the MSM context. These two features—that the population of interest is specified and that the summary measure is valid in the presence of interaction—are key elements for choosing this approach to estimating causal effects, although

292

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

this putative benefit has been debated (Shahar and Shahar, 2013). 7.1. Marginal structural models In the MSM (Moodie and Stephens, 2011; Robins et al., 2000; Suarez et al., 2008) we do not need to condition on potential confounders; their effects will be removed through the construction of a pseudo-population. The idea behind this approach is to describe and account for the distribution of ‘exposure’ (vaccination in our example Table 3). In my example, I used variable C to represent the prior risk of infection to account for the distribution of exposure. I then used the traditional multivariable regression approach to “control” the confounder (thereby explaining the exposure) and obtained the correct causal effect of vaccination using stratification. Another method of achieving this ‘control’ is to create a pseudo-population by “weighting” the exposure (vaccination) groups. The first component of the weight is the probability of receiving the vaccination (more generally the exposure) status (i.e. V+ = E+ or V− = E−) each subject actually received, conditional on the confounder information which is pE = p(E = e|Cj ) with the observed exposure “e” taking the values 1 or 0 depending on whether the subject was exposed (1) or not (0), and “j” representing the strata formed by different levels of the confounder (or combinations of the confounders). The weight (WT) we assign to each subject is equal to the inverse of this probability; i.e. WT = is 1/(pE). Note that there are two measures of WT: one for the exposed and one for the unexposed (see Table 6). The result is the IPTW estimator (Hogan and Lancaster, 2004; Hernan and Robins, 2006a; Cole and Hernan, 2008). VanderWeele and Vansteelandt (2011) discuss the use of this approach in case–control studies. As in usual sampling theory, the sampling weights describe the number of subjects (nj ) each study subject represented. To create the pseudo-population we multiply the IPTW × nj as shown in Table 6. Because the covariates in the pseudopopulation are independent of exposure, we can then form a single 2 × 2 table to assess the risk estimate. MSMs are “marginal” because they produce population average effects and “structural” because they describe causal, not associational, effects (Moodie and Stephens, 2011). Determining the weights with multiple covariates that potentially are associated with exposure (the so-called “treatment model”) is a crucial step, especially if exposure is a time-varying variable. When exposure (treatment) is dichotomous, either a logistic regression model or proportional hazards model can be used to estimate the conditional pEs and hence the weights. Mortimer et al. (2005) describe how they selected the predictors of exposure using 90% of their data and evaluated it on the remaining 10% (which they called the “test data”). First they used AIC to evaluate a series of treatment predictor models containing different covariates and select candidate models. Developing and evaluating a number of “good” treatment models is deemed to be a better approach than selecting just one candidate model. Then, because their disease outcome was measured as a continuous variable (i.e. asthma based on peak expiratory flow rate (PEFR)) they

used residual sum of squares (RSS) to choose the better treatment in the test model. In addition, because there were repeated measures of PEFR and other covariates over the study period, Mortimer et al. (2005) used a mixed model to account for the clustering (prior to determining the conditional pEs). Stabilized weights (marginal pE/conditional pE) were used to create the pseudo-population because they are more efficient than inverse conditional weights. Those authors’ final model (with the lowest RSS on the test data) contained a more parsimonious set of covariates than the treatment model with the lowest AIC. The IPTW estimate is a population level measure in that it contrasts the outcome frequency if everyone in the study group were exposed versus the outcome if no one were exposed (see Hernan and Robins, 2006b for a worked example)—similar to what I used in the counterfactual example from Table 3. Bodnar et al. (2004) provide a worked example based on perinatal disease and death. Xiao et al. (2010) describe how to fit a MSM directly using weighted proportional-hazard models. Lange et al. (2012) describe how to identify direct and indirect effects using MSMs. Lefebvre et al. (2008) discuss the impact of MSM mis-specification. Platt et al. (2013) describe an information criterion for assessing fit of MSMs; Pullenayegum et al. (2008) discuss how to use treatment-covariate associations in the re-weighted population to guide model fitting. In using these weights to create the pseudopopulations, we are assuming that there is no unmeasured confounding and no confounding within the levels of the measured confounders; this produces exchangeability and allows us to estimate the causal effects. However, we need to remind ourselves that this assumption is not verifiable from the available observational data—it must be defended on other substantive grounds. This also is a good time to remind ourselves about deciding on the exact composition of the groups that we wish to compare before seeing the outcome data (as suggested by Rubin, 2007). Given that we cannot verify the exchangeability, it is crucial that we at least have a consensus about what constitutes “comparable groups” before potentially being biased by seeing the outcome data. Similar to the process of using PSs, MSMs are consistent with this focus because most of the effort goes into getting the “treatment” model (and hence the weights) correct before assessing the impact of exposure on the outcome. If there is censoring of study subjects, a censoring model (akin to the treatment model) must also be constructed. The weights for the pseudo-population are then the product of the treatment model and censoring model weights (see Hernan et al., 2000). Linden and Adams (2010) provide an example of combining PSs and MSMs for analyzing longitudinal data. They also provide the Stata code for producing stabilized weights. Newman (2006) shows how to extend the MSMapproach to the analysis of case–control studies and demonstrates the relationship between the standardized odds ratio (where the stratum-specific odds ratio is weighted by b0j ) and the Mantel–Haenszel odds ratio (where the stratum-specific odds ratio is weighted by nj ) (Mantel and Haenszel, 1959). Kurth et al. (2006) compared the results of standardizing, using IPTW, and PSs in the

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

293

analysis of a large dataset; some of the measures differed greatly. Reasons for these discrepancies were investigated and recommendations about the choice of analysis given. MSMs are especially useful for longitudinal studies where time-dependent confounding is present; standard regression techniques cannot adjust (correctly) for timedependent confounding. A time-dependent confounder is a factor that confounds the subsequent treatment/exposure and outcome and is itself affected by prior treatments/exposure. These confounders arise frequently in longitudinal studies with measurements of exposure and outcome at several time points (Ten Have and Joffe, 2012). Time dependent confounders may also be intervening variables. The basic approach in analysis is to create new weights in each observation period using the approach discussed earlier (Hernan et al., 2000; Yang and Joffe, 2012). Fewell et al. (2004) provide a worked example (with Stata codes); their final model was assessed using a pooled logistic regression (which was comparable to a proportional-hazard approach). Godin et al. (2012) provide a worked example based on RRs instead of odds ratios because the former are collapsible. Moodie (2009) describes how to assess (using MSMs) the optimal treatment rule over a given time period (hence time varying confounders) when the exposure (i.e. treatment) is measured on a continuous scale. Correct specification of the treatment model is necessary to obtain true causal estimators (Yang and Joffe, 2012).

by the level of C with p1 = p(C = 0) = 0.4 and p2 = p(C = 1) = 0.6 based on the total population. The counterfactual risks for the exposed are obtained from the non-exposed group and are 1/4 = 0.25 in the C = 0 subjects and 2/3 = 0.67 for the C = 1 subjects. Hence, the standardized risk in the exposed SRE+ is 0.4*0.25 + 0.6*0.67. Dividing the observed risk by the standardized risk gives the proportionate increase in risk (i.e. the average causal effect) in the exposed (here the vaccinated) due to being exposed. Repeating this for the non-exposed and using the counterfactual risks obtained from the exposed group we obtain the standardized risk in the non-exposed SRE− as 0.4*0.25 + 0.6*0.67. (Recall that there was no association—within the strata—in my example, so the standardized risks turn out to be the same for both exposure groups.) Dividing the observed risk by the standardized risk gives the proportionate change in risk that would have occurred in the non-vaccinated group if they had been vaccinated. Then, dividing SRE+ by SRE− we obtain the causal risk ratio = 1. This contrasts the risks if everyone in the population were exposed to the risk if everyone were unexposed, and is the same measure as was obtained from the IPTW method shown earlier. The standardization process is described further by Hernan and Robins (2006a), Newman (2006), and Sato and Matsuyama (2003).

7.2. Using G-estimation to control time-dependent confounding

One of the reasons that a field experiment (randomized controlled trial) can be “imperfect” is the lack of compliance—that is, not all subjects randomized to treatment (Z+) complete the treatment and some of the subjects randomized to the placebo (Z−) group may actually undergo the treatment. Hence, the difference (or ratio) in outcome between the assigned treated and placebo groups does not estimate the true causal effect of the exposure; rather it estimates the likely effect of the exposure (the intention to treat analysis) if it were to be introduced to that population. If data on compliance were available, we could use them to estimate the true causal effect of treatment among subjects who actually complied; however, we would be concerned about the effect of confounding variables (C; measured and unmeasured) which might account for the failure to comply with the assigned treatment and also impact on the outcome risk—and hence, bias the measure of association. It turns out that we can estimate the true causal effect by using variable Z (the assigned ‘or intent-to-treat’ group) as an instrument variable (IV). A valid IV (Z) must meet three requirements:

Joffe (2012) and Joffe et al. (2012) describe G-estimation (the letter G was used to represent a hypothetical survival curve by Robins (1989)) which also was designed to account for time dependent confounding. G-estimation has been implemented in Stata (Sterne and Tilling, 2002) and its use also is described by Snowden et al. (2011). However implementing G-estimation is not straight forward and will not be described here. Vansteelandt and Keiding (2011) discuss the relationship between G-estimation, standardization and other techniques to control confounding. 7.3. Standardization and marginal structural models This approach uses indirect standardization methods to assess the association between exposure and disease. In a simple dataset standardization and IPTW will give the same results; however this is unlikely to be true with complex datasets (Hernan and Robins, 2006a). Recall that in a population, the overall outcome level (or frequency) is a weighted average of the stratum-specific risks (however  the strata are defined); i.e. R = pj rj where p is proportion of the population in each group level “j” (say age) and r is the risk of the outcome in that group. We can vary the selection of the grouping from the unexposed, to the exposed, to the total population as shown below. Usually, we would standardize the exposed population to ascertain whether that group experienced more or less disease than was predicted by the counterfactual risks from the unexposed group. In my simple counterfactual example, the grouping is defined

7.4. Instrument variables to control confounding

1. It has a direct causal effect on exposure (or actual treatment; E); 2. It is unrelated to the outcome (D) except through its association with the exposure; and, 3. It shares no common causes with the outcome. The approach bypasses the need to adjust for confounders by estimating the true causal effect (TCE) (shown

294

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

here on the difference scale) of the exposure based on the effects of the IV as shown below: TCE =

p(D + |Z = 1) − p(D + |Z = 0) p(E + |Z = 1) − p(E + |Z = 0)

D here is a dichotomous realization of Y. The numerator estimates the effect of exposure as randomized. The denominator reflects the association between the randomized exposure (Z) and the actual exposure (E). With perfect compliance, the denominator becomes 1 and the TCE of E on D becomes the same as the effect of Z on D (i.e. the ratio estimates the causal effect of the exposure among those that actually were exposed in comparison to those who were not exposed). As the non-compliance increases, the denominator becomes smaller and inflates the quotient so that the ratio consistently estimates the TCE. Most importantly, we do not have to correct for any potential confounders such as variable C. Given the concern over unmeasured confounders in observational studies, finding an IV would clearly be an advantage. However, finding a valid IV can be difficult. The situation is even more complex in that the direction of bias from the use of imperfect IVs is not intuitive (Hernan and Robins, 2006b). Also, Terza et al. (2008) caution researchers about using an IV that is adequate in a linear model, but then applying it in a non-linear model such as logistic regression. 8. Non-quantitative approaches for improving causal inferences So far, in this paper I have focused on technical aids to improving causal inferences and estimation of causal effects from observational data. I close out this paper with the admonition to all of us to ask better questions. Hernan (2005) described how our research questions/hypotheses for observational studies should be very specific, as if we were going to conduct a RCT. Recently, Hernan and Taubman (2008) re-iterate this principle using hypotheses about obesity and survival. Harper and Strumpf (2012) indicate that the causal questions we ask need to be tailored to give policy-makers the information they need to know. Once the question is clarified, when considering the details of the study design it is very helpful to incorporate forward projection as part of the assessment of the proposed study design. Forward projection is essentially the upfront application of a formal critical appraisal process to the study design as discussed by Elwood (2002). This process fits naturally with the use of PSs to balance the exposed and unexposed groups. 8.1. Causal criteria to aid causal inference Having asked well-defined questions (having welldefined objectives) and having used the best possible design and analysis of the data (see above) it will prove very helpful to assess the results of the study, formally, using a set of causal guidelines (these seek to bring uniformity to

decisions about causation (Evans, 1995; Susser, 1995)). A serious assessment of the question, the study design and the results should be a fore-runner of every causal inference. As Rothman et al. (2008) note such an assessment provides “a road map through complicated territory”. And, as Rothman and Greenland (2005) note, we should “avoid the temptation to use causal criteria simply to buttress pet theories at hand, and instead . . . focus on evaluating competing causal theories using crucial observations”. Doll (2002) describes the application of Sir Bradford Hill’s guidelines of causation for epidemiologists. Hofler (2005a) chooses to interpret Hill’s criteria in a counterfactual setting; a setting he had elaborated upon in an earlier paper (Hofler, 2005b). Lipton and Odegaard (2005) note the importance of clearly stating the methods used to arrive at the statistical association(s) between an exposure and disease. Lash (2007) accepts the utility of causal criteria—but warns that in many instances researchers underestimate the magnitude of systematic errors and uncertainties in their data and fail to fully recognize “countervailing external information”. Thus, I recommend the accompanying paper by Dohoo (2013) on this topic. In addition careful attention to the issues contained in STROBE is of great importance (von Elm et al., 2008). Shapiro (2008a,b,c) provides a recent summary of good study design and of the utility of guidelines for inferring causation. Ward (2009a,b) published an extensive review of the use of causal criteria. He claims that their application does not fully satisfy either deductive or inductive reasoning—but that their application does provide a consistent basis for arriving at the best explanation for the statistical association. Cresswell et al. (2012) describe how they created quantitative scores for each criterion in attempting to decide the validity of claims that pesticide exposure was responsible for reducing the honey bee population. Earlier, Swaen and van Amelsvoort (2009) developed a process for formalizing the application of causal criteria (for assessing the extent to which each criterion was true). The application of a formal weighting of the criteria used discriminant analysis to estimate the probability that the observed associations between an exposure and an outcome were causal. Weed (2000) discussed making causal inferences based on epidemiologic evidence. Recently, Constantine (2012) suggests using a Popperian approach to inferences; that is, researchers should test theory-based models and assess the extent to which alternative hypotheses could explain the data. At the outset, we must be clear about the context for inferring causation. As Rose (2001) stated, it is important to ask whether we are trying to identify causes of disease in individuals or causes of disease in populations. Indeed, with the expansion of molecular studies, the appropriate level at which to make causal inferences, and whether such inferences are valid across different levels of organization remains open to debate. However, clear decisions about the appropriate level to use (think back to the objectives when choosing this) will guide the study design as well as improve the validity of inferences about causation. Recall De Vreese’s (2009) assertion that epidemiological methods naturally lead to inferences at the population level.

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

295

Table A.1 The frequency of four-factor combinations in a population of 10,000 subjects. The sufficient causes of CRD and resultant CRD case numbers are shown. Factor combinationsa

Sufficient cause?

RSV

STREP

Stressor

MyP

1 1 1 1 1 1 1 1

1 1 1 1 0 0 0 0

1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0

1 1 1 1 1 1 0 0

Subtotals 0 0 0 0 0 0 0 0

1 1 1 1 0 0 0 0

1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0

1 1 1 0 1 0 0 0

Subtotals Totals

Number in population

Number of cases

13 74 24 138 118 669 219 1243

13 74 24 138 118 669 0 0

2500

1038

39 223 73 414 354 2008 658 3729

39 223 73 0 354 0 0 0

7500

690

10,000

1728

Note: If you collapse the table over the two (unobserved) factors (Stressors and MyP) the data shown in Table 2 (with decimal places rounded up) will be obtained. a The sufficient cause results match the data from Table 1.

8.2. Criteria for inferring causation

Appendix A.

The following set of criteria for causation (modified by this author) can be applied at any level of organization (see Dohoo et al., 2009).

References

• • • • • • •

Time sequence Strength of association Dose–response relationship Coherence or plausibility Consistency Study design and statistical issues Extent to which the results could be “caused” by factors other than the exposure of interest

9. Conclusions Our ability to make valid causal inferences from observational data will be enhanced by asking better counterfactual-based questions, improved study design, through the use of forward projection and attention to STROBE guidelines, the use of newer technical methods (such as PSs and MSMs), and the formal application of causal criteria to the study results. Conflict of interest The author declares that he has no financial or professional conflict of interest that would impact on the contents of this paper.

Aiello, A.E., Larson, E.L., 2002. Causal inference: the case of hygiene and health. Am. J. Infect. Control 30, 503–511. Austin, P.C., 2011a. A tutorial and case study in propensity score analysis: an application to estimating the effect of in-hospital smoking cessation counseling on mortality. Multivariate Behav. Res. 46, 119–151. Austin, P.C., 2011b. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav. Res. 46, 399–424. Becker, S.O., Ichino, A., 2002. Estimation of average treatment effects based on propensity scores. Stata J. 2, 358–377. Bia, M., Mattei, A., 2008. A Stata package for the estimation of the dose–response function through adjustment for the generalized propensity score. Stata J. 8, 354–373. Bingham, P., Verlander, N.Q., Cheal, M.J., 2004. John Snow, William Farr and the 1849 outbreak of cholera that affected London: a reworking of the data highlights the importance of the water supply. Public Health 118, 387–394. Bodnar, L.M., Davidian, M., Siega-Riz, A.M., Tsiatis, A.A., 2004. Marginal structural models for analyzing causal effects of time-dependent treatments: an application in perinatal epidemiology. Am. J. Epidemiol. 159, 926–934. Brumback, B.A., Bouldin, E.D., Zheng, H.W., Cannell, M.B., Andresen, E.M., 2010. Testing and estimating model-adjusted effect-measure modification using marginal structural models and complex survey data. Am. J. Epidemiol. 172, 1085–1091. Brumback, B.A., Hernan, M.A., Haneuse, S.J., Robins, J.M., 2004. Sensitivity analyses for unmeasured confounding assuming a marginal structural model for repeated measures. Stat. Med. 23, 749–767. Campaner, R., 2011. Mechanistic causality and counterfactualmanipulative causality: recent insights from philosophy of science. J. Epidemiol. Community Health 65, 1070–1074. Chen, P.L., Cole, S.R., Morrison, C.S., 2011. Hormonal contraception and HIV risk: evaluating marginal-structural-model assumptions. Epidemiology 22, 877–878. Chibuk, T.K., Robinson, J.L., Hartfield, D.S., 2010. Pediatric complicated pneumonia and pneumococcal serotype replacement: trends in

296

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297

hospitalized children pre and post introduction of routine vaccination with pneumococcal conjugate vaccine (PCV7). Eur. J. Pediatr. 169, 1123–1128. Cole, S.R., Hernan, M.A., 2008. Constructing inverse probability weights for marginal structural models. Am. J. Epidemiol. 168, 656–664. Constantine, N.A., 2012. Regression analysis and causal inference: cause for concern? Perspect. Sex. Reprod. Health 44, 134–137. Cresswell, J.E., Desneux, N., vanEngelsdorp, D., 2012. Dietary traces of neonicotinoid pesticides as a cause of population declines in honey bees: an evaluation by Hill’s epidemiological criteria. Pest Manage. Sci. 68, 819–827. D’Agostino Jr., R.B., 1998. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat. Med. 17, 2265–2281. Davey Smith, G., Ebrahim, S., 2001. Epidemiology—is it time to call it a day? Int. J. Epidemiol. 30, 1–11. de Jonge, P., Conradi, H.J., Thombs, B.D., Rosmalen, J.G., Burger, H., Ormel, J., 2011. Prevention of false positive findings in observational studies: registration will not work but replication might. J. Epidemiol. Community Health 65, 95–96. De Vreese, L., 2009. Epidemiology and causation. Med. Health Care Philos. 12, 345–353. Dohoo, I.R., 2013. Bias—is it a problem, and what should we do? Prev. Vet. Med.. Dohoo, I., Martin, W., Stryhn, H., 2009. Veterinary Epidemiologic Research. Friesens, Manitoba, Canada. Dohoo, I., Martin, W., Stryhn, H., 2012. Methods in Epidemiologic Research. VER Inc., Box 491, Charlottetown, PEI, Canada. Doll, R., 2002. Proof of causality: deduction from epidemiological observation. Perspect. Biol. Med. 45, 499–515. Elwood, M., 2002. Forward projection—using critical appraisal in the design of studies. Int. J. Epidemiol. 31, 1071–1073. Ertefaie, A., Stephens, D.A., 2010. Comparing approaches to causal inference for longitudinal data: inverse probability weighting versus propensity scores. Int. J. Biostat. 6 (Article 14). Evans, A.S., 1995. Causation and disease: a chronological journey. The Thomas Parran Lecture. 1978. Am. J. Epidemiol. 142, 1126–1135 (discussion 1125). Fewell, Z., Hernan, M.A., Wolfe, F., Tilling, H., Choi, J.A., Stern, C., 2004. Controlling for time-dependent confounding using marginal structural models. Stata J. 4, 402–420. Flanders, W.D., 2006. On the relationship of sufficient component cause models with potential outcome (counterfactual) models. Eur. J. Epidemiol. 21, 847–853. Funk, M.J., Westreich, D., Wiesen, C., Sturmer, T., Brookhart, M.A., Davidian, M., 2011. Doubly robust estimation of causal effects. Am. J. Epidemiol. 173, 761–767. Godin, O., Elbejjani, M., Kaufman, J.S., 2012. Body mass index, blood pressure, and risk of depression in the elderly: a marginal structural model. Am. J. Epidemiol. 176, 204–213. Greenblatt, S., 2011. The Swerve: How the World Became Modern. W.W. Norton and Co., New York. Greenland, S., 2005. Epidemiologic measures and policy formulation: lessons from potential outcomes. Emerg. Themes Epidemiol. 2, 5–12. Greenland, S., 2007. Bayesian perspectives for epidemiological research. II. Regression analysis. Int. J. Epidemiol. 36, 195–202. Greenland, S., 2008. Invited commentary: variable selection versus shrinkage in the control of multiple confounders. Am. J. Epidemiol. 167, 523–529 (discussion 530–531). Greenland, S., Brumback, B., 2002. An overview of relations among causal modelling methods. Int. J. Epidemiol. 31, 1030–1037. Grimes, D.A., Schulz, K.F., 2002. Bias and causal associations in observational research. Lancet 359, 248–252. Harper, S., Strumpf, E.C., 2012. Social epidemiology: questionable answers and answerable questions. Epidemiology 23, 795–798. Hernan, M.A., 2004. A definition of causal effect for epidemiological research. J. Epidemiol. Community Health 58, 265–271. Hernan, M.A., 2005. Invited commentary: hypothetical interventions to define causal effects—afterthought or prerequisite? Am. J. Epidemiol. 162, 618–620 (discussion 621–622). Hernan, M.A., 2011. Epidemiologic studies, warts and all, are our best chance. Epidemiology 22, 636–637. Hernan, M.A., 2012. Beyond exchangeability: the other conditions for causal inference in medical research. Stat. Methods Med. Res. 21, 3–5. Hernan, M.A., Brumback, B., Robins, J.M., 2000. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIVpositive men. Epidemiology 11, 561–570.

Hernan, M.A., Cole, S.R., 2009. Invited commentary: causal diagrams and measurement bias. Am. J. Epidemiol 170, 959–962 (discussion 963–964). Hernan, M.A., Robins, J.M., 2006a. Estimating causal effects from epidemiological data. J. Epidemiol. Community Health 60, 578–586. Hernan, M.A., Robins, J.M., 2006b. Instruments for causal inference: an epidemiologist’s dream? Epidemiology 17, 360–372. Hernan, M.A., Taubman, S.L., 2008. Does obesity shorten life? The importance of well-defined interventions to answer causal questions. Int. J. Obes. (Lond.) 32 (Suppl. 3), S8–S14. Hofler, M., 2005a. The Bradford Hill considerations on causality: a counterfactual perspective. Emerg. Themes Epidemiol. 2, 11–20. Hofler, M., 2005b. Causal inference based on counterfactuals. BMC Med. Res. Methodol. 5, 28–40. Hogan, J.W., Lancaster, T., 2004. Instrumental variables and inverse probability weighting for causal inference from longitudinal observational studies. Stat. Methods Med. Res. 13, 17–48. Howards, P.P., Schisterman, E.F., Heagerty, P.J., 2007. Potential confounding by exposure history and prior outcomes: an example from perinatal epidemiology. Epidemiology 18, 544–551. Hullsiek, K.H., Louis, T.A., 2002. Propensity score modeling strategies for the causal analysis of observational data. Biostatistics 3, 179–193. Jiang, M., Foster, E.M., 2012. Duration of breastfeeding and childhood obesity: a generalized propensity score approach. Health Serv. Res. 48, 628–651. Joffe, M.M., 2012. Structural nested models, g-estimation, and the healthy worker effect: the promise (mostly unrealized) and the pitfalls. Epidemiology 23, 220–222. Joffe, M.M., Yang, W.P., Feldman, H., 2012. G-Estimation and artificial censoring: problems, challenges, and applications. Biometrics 68, 275–286. Kaplan, G.A., 2004. What’s wrong with social epidemiology, and how can we make it better? Epidemiol. Rev. 26, 124–135. Koch, T., 2008. John snow, hero of cholera: RIP. Can. Med. Assoc. J. 178, 1736. Korppi, M., Heiskanen-Kosma, T., Kleemola, M., 2003. Mycoplasma pneumoniae causes over 50% of community-acquired pneumonia in school-aged children. Scand. J. Infect. Dis. 35, 294. Kramer, M.S., Moodie, E.E., Platt, R.W., 2012. Infant feeding and growth: can we answer the causal question? Epidemiology 23, 790–794. Krieger, N., 1994. Epidemiology and the web of causation: has anyone seen the spider? Soc. Sci. Med. 39, 887–903. Kurth, T., Walker, A.M., Glynn, R.J., Chan, K.A., Gaziano, J.M., Berger, K., Robins, J.M., 2006. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am. J. Epidemiol. 163, 262–270. Lange, T., Vansteelandt, S., Bekaert, M., 2012. A simple unified approach for estimating natural direct and indirect effects. Am. J. Epidemiol. 176, 190–195. Lash, T.L., 2007. Heuristic thinking and inference from observational epidemiology. Epidemiology 18, 67–72. Lefebvre, G., Delaney, J.A., Platt, R.W., 2008. Impact of mis-specification of the treatment model on estimates from a marginal structural model. Stat. Med. 27, 3629–3642. Linden, A., Adams, J.L., 2010. Evaluating health management programmes over time: application of propensity score-based weighting to longitudinal data. J. Eval. Clin. Pract. 16, 180–185. Lipton, R., Odegaard, T., 2005. Causal thinking and causal language in epidemiology: it’s in the details. Epidemiol. Perspect. Innov. 2, 8. Little, R.J., Rubin, D.B., 2000. Causal effects in clinical and epidemiological studies via potential outcomes: concepts and analytical approaches. Annu. Rev. Public Health 21, 121–145. Maldonado, G., Greenland, S., 2002. Estimating causal effects. Int. J. Epidemiol. 31, 422–429. Mansson, R., Joffe, M.M., Sun, W., Hennessy, S., 2007. On the estimation and use of propensity scores in case–control and case-cohort studies. Am. J. Epidemiol. 166, 332–339. Mantel, N., Haenszel, W., 1959. Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst. 22, 719–748. McCaffrey, D.F., Ridgeway, G., Morral, A.R., 2004. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol. Methods 9, 403–425. Moodie, E.E., 2009. Risk factor adjustment in marginal structural model estimation of optimal treatment regimes. Biom. J. 51, 774–788. Moodie, E.E., Kaufman, J.S., Platt, R.W., 2012. Special issue on causal inference in health research. Int. J. Biostat. 8 (2), 1–2.

W. Martin / Preventive Veterinary Medicine 113 (2014) 281–297 Moodie, E.E., Stephens, D.A., 2010a. Special issue on causal inference. Int. J. Biostat. 6 (Article 1). Moodie, E.E., Stephens, D.A., 2010b. Using directed acyclic graphs to detect limitations of traditional regression in longitudinal studies. Int. J. Public Health 55, 701–703. Moodie, E.E., Stephens, D.A., 2011. Marginal structural models: unbiased estimation for longitudinal studies. Int. J. Public Health 56, 117–119. Moodie, E.E., Stephens, D.A., 2012. Estimation of dose–response functions for longitudinal data using the generalised propensity score. Stat. Methods Med. Res. 21, 149–166. Mortimer, K.M., Neugebauer, R., van der Laan, M., Tager, I.B., 2005. An application of model-fitting procedures for marginal structural models. Am. J. Epidemiol. 162, 382–388. Newman, S.C., 2006. Causal analysis of case–control data. Epidemiol. Perspect. Innov. 3, 2–8. Nichols, A., 2008. Erratum and discussion of propensity-score reweighting. Stata J. 8, 532–539. Pearl, J., 2010. An introduction to causal inference. Int. J. Biostat. 6 (Article 7-4679.1203). Platt, R.W., Alan Brookhart, M., Cole, S.R., Westreich, D., Schisterman, E.F., 2013. An information criterion for marginal structural models. Stat. Med. 32, 1383–1393. Pullenayegum, E.M., Lam, C., Manlhiot, C., Feldman, B.M., 2008. Fitting marginal structural models: estimating covariate-treatment associations in the reweighted data set can guide model fitting. J. Clin. Epidemiol. 61, 875–881. Rajakumar, K., 2000. Pellagra in the United States: a historical perspective. South. Med. J. 93, 272–277. Rickles, D., 2009. Causality in complex interventions. Med. Health Care Philos. 12, 77–90. Robins, J., 1989. The control of confounding by intermediate variables. Stat. Med. 8, 679–701. Robins, J.M., 2001. Data, design, and background knowledge in etiologic inference. Epidemiology 12, 313–320. Robins, J.M., Hernan, M.A., Brumback, B., 2000. Marginal structural models and causal inference in epidemiology. Epidemiology 11, 550–560. Rose, G., 2001. Sick individuals and sick populations. Int. J. Epidemiol. 30, 427–432 (discussion 433–434). Rothman, K.J., 1976. Causes. Am. J. Epidemiol. 104, 587–592. Rothman, K.J., Greenland, S., 2005. Causation and causal inference in epidemiology. Am. J. Public Health 95 (Suppl. 1), S144–S150. Rothman, K.J., Greenland, S., Lash, T., 2008. Modern Epidemiology. Lippincott, Philadelphia. Rubin, D.B., 1991. Practical implications of modes of statistical inference for causal effects and the critical role of the assignment mechanism. Biometrics 47, 1213–1234. Rubin, D.B., 2007. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Stat. Med. 26, 20–36. Sato, T., Matsuyama, Y., 2003. Marginal structural models as a tool for standardization. Epidemiology 14, 680–686. Schonlau, M., 2005. Boosted regression (boosting): an introductory tutorial and a stata plugin. Stata J. 5, 330–354. Shahar, E., Shahar, D.J., 2013. Marginal structural models: much ado about (almost) nothing. J. Eval. Clin. Pract. 19, 214–222. Shapiro, S., 2008a. Causation, bias and confounding: a hitchhiker’s guide to the epidemiological galaxy. Part 2. Principles of causality in epidemiological research: confounding, effect modification and strength of association. J. Fam. Plann. Reprod. Health Care 34, 185–190. Shapiro, S., 2008b. Causation, bias and confounding: a hitchhiker’s guide to the epidemiological galaxy. Part 1. Principles of causality in epidemiological research: time order, specification of the study base and specificity. J. Fam. Plann. Reprod. Health Care 34, 83–87. Shapiro, S., 2008c. Causation, bias and confounding: a hitchhiker’s guide to the epidemiological galaxy. Part 3. Principles of causality in epidemiological research: statistical stability, dose- and duration-response effects, internal and external consistency, analogy and biological plausibility. J. Fam. Plann. Reprod. Health Care 34, 261–264. Snowden, J.M., Rose, S., Mortimer, K.M., 2011. Implementation of Gcomputation on a simulated data set: demonstration of a causal inference technique. Am. J. Epidemiol. 173, 731–738.

297

Sterne, J.A.C., Tilling, K., 2002. G-Estimation of causal effects, allowing for time-varying confounding. Stata J. 2, 164–182. Stuart, E.A., 2008. Developing practical recommendations for the use of propensity scores: discussion of ‘A critical appraisal of propensity score matching in the medical literature between 1996 and 2003’ by Peter Austin. Stat. Med. 27, 2062–2065 (discussion 2066–2069). Suarez, D., Haro, J.M., Novick, D., Ochoa, S., 2008. Marginal structural models might overcome confounding when analyzing multiple treatment effects in observational studies. J. Clin. Epidemiol. 61, 525–530. Susser, M., 1991. What is a cause and how do we know one? A grammar for pragmatic epidemiology. Am. J. Epidemiol. 133, 635–648. Susser, M., 1995. Judgment and causal inference: criteria in epidemiologic studies. 1977. Am. J. Epidemiol. 141, 701–715 (discussion 699–700). Swaen, G., van Amelsvoort, L., 2009. A weight of evidence approach to causal inference. J. Clin. Epidemiol. 62, 270–277. Taubes, G., 1995. Epidemiology faces its limits. Science 269, 164–169. Ten Have, T.R., Joffe, M.M., 2012. A review of causal estimation of effects in mediation analyses. Stat. Methods Med. Res. 21, 77–107. Terza, J.V., Bradford, W.D., Dismuke, C.E., 2008. The use of linear instrumental variables methods in health services research and health economics: a cautionary note. Health Serv. Res. 43, 1102–1120. VanderWeele, T.J., Hernan, M.A., 2006. From counterfactuals to sufficient component causes and vice versa. Eur. J. Epidemiol. 21, 855–858. VanderWeele, T.J., Hernan, M.A., Robins, J.M., 2008. Causal directed acyclic graphs and the direction of unmeasured confounding bias. Epidemiology 19, 720–728. VanderWeele, T.J., Robins, J.M., 2007a. The identification of synergism in the sufficient-component-cause framework. Epidemiology 18, 329–339. VanderWeele, T.J., Robins, J.M., 2007b. Directed acyclic graphs, sufficient causes, and the properties of conditioning on a common effect. Am. J. Epidemiol. 166, 1096–1104. VanderWeele, T.J., Vansteelandt, S., 2010. Odds ratios for mediation analysis for a dichotomous outcome. Am. J. Epidemiol. 172, 1339–1348. VanderWeele, T.J., Vansteelandt, S., 2011. A weighting approach to causal effects and additive interaction in case–control studies: marginal structural linear odds models. Am. J. Epidemiol. 174, 1197–1203. Vansteelandt, S., 2012. Understanding counterfactual-based mediation analysis approaches and their differences. Epidemiology 23, 889–891. Vansteelandt, S., Bekaert, M., Claeskens, G., 2012. On model selection and model misspecification in causal inference. Stat. Methods Med. Res. 21, 7–30. Vansteelandt, S., Keiding, N., 2011. Invited commentary: Gcomputation—lost in translation? Am. J. Epidemiol. 173, 739–742. Vineis, P., Kriebel, D., 2006. Causal models in epidemiology: past inheritance and genetic future. Environ. Health 5, 21–31. von Elm, E., Altman, D.G., Egger, M., Pocock, S.J., Gotzsche, P.C., Vandenbroucke, J.P., STROBE Initiative, 2008. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. J. Clin. Epidemiol. 61, 344–349. Waernbaum, I., 2012. Model misspecification and robustness in causal inference: comparing matching with doubly robust estimation. Stat. Med. 31, 1572–1581. Ward, A., 2009a. Causal criteria and the problem of complex causation. Med. Health Care Philos. 12, 333–343. Ward, A.C., 2009b. The role of causal criteria in causal inferences: Bradford Hill’s “aspects of association”. Epidemiol. Perspect. Innov. 6, 2–24. Weed, D.L., 2000. Epidemiologic evidence and causal inference. Hematol. Oncol. Clin. North Am. 14, 797–807. Weed, D.L., 2002. Environmental epidemiology: basics and proof of cause–effect. Toxicology 181/182, 399–403. White, P.A., 2001. Causal judgments about relations between multilevel variables. J. Exp. Psychol. Learn. Mem. Cogn. 27, 499–513. Xiao, Y., Abrahamowicz, M., Moodie, E.E., 2010. Accuracy of conventional and marginal structural Cox model estimators: a simulation study. Int. J. Biostat. 6 (Article 13). Yang, W., Joffe, M.M., 2012. Subtle issues in model specification and estimation of marginal structural models. Pharmacoepidemiol. Drug Saf. 21, 241–245.