Child welfare and the challenge of causal inference

Child welfare and the challenge of causal inference

Children and Youth Services Review 35 (2013) 1130–1142 Contents lists available at ScienceDirect Children and Youth Services Review j o u r n a l h ...

606KB Sizes 32 Downloads 89 Views

Children and Youth Services Review 35 (2013) 1130–1142

Contents lists available at ScienceDirect

Children and Youth Services Review j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c h i l d yo u t h

Child welfare and the challenge of causal inference E. Michael Foster a,⁎, Kimberly McCombs-Thornton b a b

Department of Health Care Organization and Policy, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, United States Department of Maternal and Child, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC, United States

a r t i c l e

i n f o

Article history: Received 11 March 2011 Accepted 17 March 2011 Available online 26 March 2011 Keywords: Causal inference Propensity scores Methodology

a b s t r a c t Causal inference refers to the assessment of cause and effect relationships in observational data—i.e., in situations where random assignment is impossible or impractical. Choices involving children in the child welfare system evoke core elements of causal inference—manipulation and the counterfactual. How would a child's circumstances differ if we changed her environment? This article begins with one mathematical approach to framing causal inference, the potential outcomes framework. This methodology is a cornerstone of newer approaches to causal inference. This framework makes clear the identification problem inherent in causal inference and highlights a key assumption often used to identify the model (ignorability or no unobserved confounding). The article then outlines the various approaches to causal inference and organizes them around whether they assume ignorability as well as other key features of each approach. The article then provides guidelines for producing good causal inference. These guidelines emerge from a review of methodological literature as broad ranging as epidemiology, statistics, economics, and policy analysis. These steps will be illustrated using an example from child welfare. The article concludes with suggestions for how the field could apply these newer methods. © 2011 Elsevier Ltd. All rights reserved.

1. Introduction Causal inference refers to the assessment of cause and effect relationships in observational data—i.e., in situations where random assignment is impossible or impractical. One might want to know, for example, the consequences of a risk factor, such as maternal substance use, on a child's development. Obviously one cannot randomly place children with mothers who are using alcohol or other drugs. Similar questions arise as to key choices by child welfare system staff such as whether to remove a child from her home. Presumably that choice is made based on the child's best interest in the eyes of the case worker and others; to simply assign a child to another situation would be unethical. As a result, observational data represent the main source of information that can inform child welfare policy, and for that reason, causal inference lies at the heart of research on children and youth services. Choices involving children in the child welfare system evoke core elements of causal inference—manipulation and the counterfactual (Pearl, 2000). How would a child's circumstances differ if we changed her environment? The fundamental challenge of causal inference is that at a point in time, we observe children only in one condition or circumstance.1 Causal inference is the question of “what if”—what if ⁎ Corresponding author. Tel.: + 1 919 259 6466. E-mail address: [email protected] (E.M. Foster). 1 Economists refer to this challenge as the “evaluation problem” (Heckman & Vytlacil, 2007a). As this reference also notes, not all agree that random assignment is the gold standard nor that a potential cause need be manipulable.

0190-7409/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.childyouth.2011.03.012

the child's circumstances were different? In many cases, the question is what if social policy changed those circumstances? More formally, the quantity we want to calculate is the difference between what we observe and the counterfactual we do not. We have two unknowns (what is and what might have been) and one known (what we observe); statisticians refer to such a situation as “underidentified”. How does one identify a model? Generally, identification requires either an assumption of some sort (e.g., the absence of confounding by an unobserved characteristic) and/or additional data (Heckman & Vytlacil, 2007b). The latter often involves a comparison group of some sort—the former specifies the conditions under which the experiences of that group represent the counterfactual we need. Generally, some combination of the two is required. For example, we might observe the treated entity prior to treatment, and that information is valuable. However, causal inference requires an assumption as well—some assumption about other forces acting on the entity over time. In the case of maternal substance abuse, a natural comparison group would involve the children of other women, and one easily could compute differences in outcomes between the two groups. A lay person would recognize that such a comparison is potentially misleading—children in the two groups differ in many ways, and addressing the needs of the substance-abusing mother would only partially eliminate those differences. The association is likely much larger than the true effect, and for that reason, every graduate student (hopefully) is taught that “association does not mean causation” (Pearl, 2009).

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

The number of poor examples of causal inference, however, indicates that this lesson is either not absorbed or not appreciated fully.2 Indeed, moving from association to causation is audacious: it is a venture into the unknown.3 Most statistical tools have mechanical properties that refine associations in one way or another. Regression, for example, is the orthogonal projection of an outcome onto the space spanned by the covariates. In non-technical terms, regression imposes the restriction that none of the covariates are correlated with the error term in the model whether that is actually true or not. The fundamental problem of causal inference dictates that for the resulting estimate to be causal, other assumptions need to be added—namely, that treatment is not simultaneously determined with the outcome nor is it correlated with unobserved determinants of the outcome.4 Many of the key assumptions are untestable. (A reasonable rule of thumb is that the more essential an assumption is to causal inference, the less likely it is that one can test it. Regression, for example, often assumes a linear relationship between the covariates and the outcome. This assumption can be tested but is certainly not required for causal inference.)5 Causal inference may be hard, but the good news is that methodologists are developing new tools. This development is refining—and narrowing—the assumptions under which one can move from association to causation. It is also clarifying which assumptions embedded in our analytical techniques are required for causal inference and which serve some other purposes. In many instances, the latter can be relaxed through better empirical practice. For example, outliers may damage causal inference, but many regression diagnostics are available to identify such problems. The bottom line, though, is that a key assumption (ignorability, described below) will remain even if the application of regression is improved. An informal discussion highlights key issues, but a full treatment requires a mathematical presentation. Only such a presentation can provide the specificity needed. For example, such specificity makes it clear that some causal questions require assumptions that others do not. For example, the assumption required to assess the effect of teenage childbearing on teen mothers differs from that required to assess the effect on other young women were they to make that choice. This article begins with one mathematical approach to frame causal inference, the potential outcomes framework (Holland, 1986; Rubin, 2005). This methodology is a cornerstone of newer approaches to causal inference. This framework makes clear the identification problem inherent in causal inference and highlights a key assumption often used to identify the model (ignorability or no unobserved confounding). The article then outlines the various approaches to causal inference and organizes them around whether they make this assumption as well as other key features of each approach. The article

2 The poor examples of causal inference are too many to count. For an example, compare Christakis, Zimmerman, DiGiuseppe, & McCarty (2004) with Foster & Watkins (2010) and see discussion in Foster (2010a). 3 We recognize the difficulty of this assessment in our lives—what if I had gone to graduate school in public health instead of economics and so on—and we recognize how difficult that question is to answer. Why this question would be easier to answer about anonymous people about whom we have less information is difficult to fathom. But in our personal lives we often recognize that the question is unanswerable: “Who knows?” That hesitation also should inform efforts to gage causal relationships in observational data. 4 Heckman and Vytlacil (2007b) make this distinction in a different way. Regression conditions on a variable mechanically; whether it “fixes” it in a causal way—i.e., reveals its influence if manipulated—requires additional assumptions. This distinction is also apparent in Pearl's well known treatment (Pearl, 2000). He represents the latter with the do() operator. This operator represents the effect of making the exposure determined by forces outside of the model. 5 This discussion oversimplifies matters somewhat. The importance of different assumptions also depends on the nature of the causal question. The literature in statistics focuses on the non-parametric estimation of the ATT, ATE and ATU, and for that purpose, linearity (or a specific functional form) is not required. As Heckman and Vytlacil (2007a) notes, this is only one of several causal questions related to policy issues. These questions include estimating the impact of a program involving combinations of treatments that have not yet been delivered.

1131

then provides guidelines for producing causal inference. These guidelines emerge from a review of methodological literature as broad ranging as epidemiology, statistics, economics, and policy analysis. These steps will be illustrated using an example from child welfare. It concludes with suggestions for how the field could apply these newer methods. A brief word about terminology is in order before proceeding. If not already apparent, the term “treatment” is used very generally in this literature. It could refer to any actual treatment or service but also to exposures or conditions or states. There may be very little about these conditions that are beneficial. 2. Potential outcomes framework 2.1. Basic mathematical language This basic mathematical language provides specificity about what we really want to know and what we have to assume to estimate that quantity. In the potential outcomes framework, one is interested in a treatment (or exposure or characteristic) (D) and some outcome (Y). D = 1 or 0 for the treated and untreated groups, respectively. For each individual, one can think of her as having two possible outcomes Y1 (outcome when treated, D = 1) and Y0 (outcome when not treated, D = 0). One can characterize each individual by three variables, (Y1, Y0, and D)—the outcome if treated, if not treated and treatment status. The fundamental problem is that we do not observe (the joint distribution of) all three random variables. Rather we observe D for each individual (was she treated or not) and either Y1 or Y0 for the treated (D = 1) and untreated (D = 0), respectively. One can write Yobs = (D * Y1) + (1 − D) * Y0, where Yobs is the observed outcome. Economists refer to this relationship as the “the observation rule”. We have omitted a person subscript to this point, but these outcomes differ across individuals, both because individuals differ in the treatment they receive and in other factors (differentiating even those who receive the same treatment). What we would like to know is the treatment effect—i.e., τ = Y1, i − Y0, i—for every individual i. Generally, we are unable to calculate this term for a single person. Can we get traction on this problem if we reduce our goal? What if we wanted to know only τ = E[Y1, i − Y0, i], the mean of this distribution of effects? The “E” identifies the expectations operator or the average. When one writes E[Y|X = x] or E[Y|x], this notation refers to the average value of Y for a specific value or range (x) of X. (Following convention, random variables are denoted with upper case variables and specific values of those variables, lower case.) Because we chose the mean and not some other characteristic of the distribution (such as the median), one can see that τ = E[Y1, i] − E[Y0, i].6 The problem is (still) that neither of the two terms is observed for the whole population. In particular, we observe not E[Y1, i] but E[Y1, i|Di = 1] and not E[Y0, i] but E[Y0, i|Di = 0] . One sees the outcome under treatment only for those treated and the converse for untreated individuals. These quantities are related to the terms of interest, the expectation of the outcome in the treated state for all persons in the sample: h i h i E Y1;i = pðDi = 1ÞE Y1;i jDi = 1 + pðDi = 0ÞE Y1;i jDi = 0

½



ð1Þ

The average outcome for all individuals is the weighted average of the treated outcome for the treated and untreated. The weights are the proportions of the sample that do and do not receive treatment. The emboldened term is unobserved—that expectation and the expression as a whole cannot be calculated without some additional assumption. Again, this is the fundamental problem of causal inference. 6 An added advantage of working with the expectation is that this estimand does not require knowledge of the correlation between Y0 and Y1. That correlation is required for calculation of other potential treatment effects of interest, such as the proportion of individuals who benefit from treatment.

1132

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

Similarly, h i h i E Y0;i = pðDi = 1ÞE Y0;i jDi = 1 + pðDi = 0ÞE Y0;i jDi = 0

½



ð2Þ

The emboldened term is unobserved—we do not know the average untreated outcome for the individuals who received treatment. Given these two terms, one can rewrite τ as  h i h i τ = pðDi = 1ÞE Y1;i jDi = 1 + pðDi = 0ÞE Y1;i jDi = 0  h i h i − pðDi = 1ÞE Y0;i j Di = 1 + pðDi = 0ÞE Y0;i jDi = 0

ð3Þ

One can rewrite this term as h h i h ii τ = pðDi = 1Þ E Y1;i jDi = 1 −E Y0;i jDi = 1 h h i h ii + pðDi = 0Þ E Y1;i jDi = 0 −E Y0;i jDi = 0

ð4Þ

The average treatment effect Eq. (4) is the weighted sum of the treatment effect for those treated and the untreated. We label the first as the “average effect of treatment for the treated” (ATT), and the second, “average effect of treatment on the untreated” (ATU). τ is labeled the “average effect of treatment” (ATE). This situation is the identification problem noted in the Introduction. In particular, the ATE is “not identified”. Eq. (4) represents 1 equation but includes two unknowns—it is not possible to generate a unique estimate of the treatment effect. This problem would exist no matter what the sample size: it is a problem of logic rather than an empirical problem per se. At this point, we have reached a dead-end unless we can add some additional data or assumption(s) of some sort—information that would allow us to identify the model. 2.2. Assumptions so far Before proceeding, we want to make clear three assumptions we have made implicitly so far. The first assumption is “stable unit treatment value assumption” (SUTVA) (Rubin, 1980). This assumption requires that one's counterfactual states (Yi,0 and Yi,1) do not depend on the treatment status of other individuals. One can note that in the math above, there is no “interference” among individuals: an individual's outcome Yi does not depend on the treatment received by person j. As discussed below, many problems in children's policy may not fit this assumption.7 The second assumption is “positivity”. This assumption requires that the probability that a given individual receives each level of treatment is positive for every combination of treatment and covariates. This assumption eliminates illogical possibilities, such as men developing uterine cancer. As discussed below, this assumption has an empirical counterpart.8 A third assumption is “consistency”(Cole & Frangakis, 2008). (Economists label this policy invariance (Heckman & Vytlacil, 2007a).) This assumption implies that the outcome of treatment does not depend on the assignment mechanism. In a clinical trial in medicine, this assumption means that whether an individual is 7 Relaxing this assumption is possible, but the difficulty is that the treatment for an individual is defined by his/her own treatment and that of relevant others who influence his/her outcome. Suppose each child is influenced by his or her own status and that of his or her 10 best friends. In that case, one would index the outcome with 10 sub-scripts, and one could define 210 + 1 = 2048 potential outcomes. Generally, researchers try to summarize the exposure of individuals other than the subject in some way. 8 For example, a sample of professors may not include women who are economists. Women, of course, can be capable economists, but we may have to deal with not having any in a given sample. The percentage of full professors in economics departments who are female has risen from 6.5% in 1994 to 10.7% in 2010 (Fraumeni, 2011).

randomized into taking a medication does not affect the benefits of that medication when real-world patients subsequently take the medication by choice. In child welfare, this issue is primarily a matter of whether the treatment has been sufficiently differentiated. For example, consistency requires that the returns to removing a child from her home do not depend on the reason for that removal. If this assumption does not hold one possible solution is that one define different variations of treatment, such as “removed for abuse” or “removed for neglect” and so on. As we discuss below, increased differentiation of the treatment has other implications for causal inference. In outlining the phases of causal inference, Heckman identifies three distinct tasks. (Heckman & Vytlacil, 2007a) To this point our discussion represents the first: “Defining a set of hypotheticals or counterfactuals”. The potential outcomes framework provides one way to specify these terms and frames the identification problem. Note that a key feature of the literature emerges at this point. In specifying the counterfactuals, economists will generally write a model linking the outcomes to other factors that potentially determine the outcome, such as demographic characteristics. In some models, the link between the covariates and the outcome may differ between the treated and untreated. This model also will often include a model determining treatment itself. A key focus in specifying these elements of the model is that they include an element of the nature and degree of interdependency between the determinants of outcomes and exposure. In contrast, statisticians and biostatisticians often present no formal model of how treatment is determined. The benefits of such an approach include flexibility. In a sense, this approach is more generalized because it is not embedded in a broader set of assumptions. Economists raise a variety of objections, such as the lack of an explicit model prevents the researcher from incorporating well established information about how the treatment is determined. Non-economists complain about this as well. Pearl (2000), for example, argues that the potential outcomes framework (as presented by statisticians) offers no guidance as to what covariates should be included in the empirical analysis. Pearl asserts that the directed acyclic graph (described below) fills a needed gap in the literature by helping the research to identify covariates that should (and should not) be included in the analysis. The next section addresses second phase in causal inference is identification, which reflects the fundamental problem of causal inference. 3. Phase II: identification Parameters can be identified in a variety of ways, such as setting other parameters equal to a constant (such as zero); distributional assumptions; functional form (like linearity) or other assumptions (such as ruling out some types of behavior). As discussed below, in causal inference these assumptions often involve the determinants of treatment and of the outcome and the relationship between the two. We now consider strategies for identification before turning to task 3, estimation. By far the most common method for identifying causal models involves a form of so-called ignorability. Ignorability refers to the assumption that counterfactuals Y0,i and Y1,i do not influence exposure. This assumption implies that the following: h i h i E Y1;i jDi = 1 = E Y1;i jDi = 0 aÞ h i E Y0;i jDi = 0 = E½Y0;i j Di = 1 bÞ

ð5Þ

The first term in the top equation is the outcome if a child is placed out of home for children who are actually placed out of home and so on. The key is that the right-hand-side of the equation is not observed; the left-hand-side can be calculated from the data. (a) is required to

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

calculate the ATU; and (b), ATT. Both are required to calculate the ATE. This equation illustrates the value of random assignment: it assures that this condition holds. Ignorability is a feature of the design rather than an assumption made by the researcher. In observational studies, ignorability is often demonstrably false. Treatment assignment depends on a range of variables that likely influence both Y0 and Y1. Ignorability, therefore, takes the follow form: h i h i E Y1;i jDi = 1; Xi = x = E Y1;i jDi = 0; Xi = x h i h i E Y0;i jDi = 0; Xi = x = E Y0;i jDi = 1; Xi = x

aÞ ð6Þ bÞ

Conditioning on the Xi refers to the calculation of the expectation for a subset of the data sharing the specific value of the covariates (x) (e.g., for African-American children only). This assumption means that treatment assignment is as if randomly assigned within strata defined by (specific values of) the covariates X. In other words, the counterfactual for the treated can be represented with the experience of untreated individuals who share the same profile of X (within the same strata). In essence, this assumption means that—within strata defined by X—treatment is as if randomly assigned. As discussed below, estimation of treatment effects proceeds by calculating a treatment effect for each stratum and then combining those estimates across strata. One might write this expectation as a linear function of the covariates and an error term, and in that case, the plausibility of ignorability involves the latter. One might refer to ignorability here as conditional ignorability; this framing shifts the focus to the unobserved determinants of the outcome. For that reason, ignorability is also labeled “no unobserved confounding” by economists. One advantage of Eq. (6) is that this problem precedes the way one writes the expectation. That choice is really a matter of estimation or task 3. Are there alternatives to ignorability? One strategy is to weaken the form of ignorability required. This alternative straddles the boundaries of identification and estimation (task 3 discussed below); it involves repeated observations on the unit of interest. For example, one might be interested in the effect of case workers' judgment on individuals' movement through the system. One potential source of confounding might be agency-level factors that are correlated with caseworker characteristics (due, for example, to hiring practices). By comparing children with different case workers within the same agencies, one would effectively control for agency-level confounding involving both measured and unmeasured confounding. (The stratum is the agency.) The form of ignorability now required is weaker; treatment only need be free of within-agency unobserved confounding. This strategy is conceptual but also depends on the data available. If one's data includes only one child per agency, obviously estimation is not possible. What are other, even more general alternatives to ignorability? One might feel, for example, that even within the strata (the family), the treatment is confounded by unobservables: the teen mother may have been less capable than her sister. All else may not be not equal. Perhaps best known among the alternatives are “selection” or Heckman models. These models recognize that treatment assignment may depend on unobserved factors that also influence the potential outcomes. The key to this approach is to add enough structure to the relationships involved to remove the impact of this unobserved confounding variable, U. In essence the relevant terms include those like E[Y1, i|Di = 1, Ui = u′, Xi = xi] and E[Y0, i|Di = 0, Ui = u′′, Xi = xi]. One cannot establish something like ignorability because these terms involve different values of the unobserved characteristic, which we obviously do not measure. If we use the latter as the counterfactual for a treated individual, we will confound the influence of the treatment and of U. This is unobserved confounding.

1133

The selection model essentially puts the two expectations on the same basis by adjusting each for the differences in U. This task is accomplished not by measuring the U by making an assumption about the distribution of U but and calculating expectations of expectations.9 h i h h ii E Y0;i jDi = 0; Xi = x = EUi E Y0;i jDi = 0; Xi = x; Ui = u

ð7Þ

In essence, this method averages out the potential values of the Ui. While we do not observe Ui, we can calculate conditional expectations (inside the right-hand side) and average across them (the outer expectation). This produces the unconditional expectation. We can get terms like this for different potential outcomes and levels of treatment. We then can substitute observed terms (the left hand side of Eq. (7)) for unobserved expectations (E[Y0, i|Di = 1, Xi = xi]) because they are exchangeable once U's influence is removed.10 The identification problem here involves the ability to distinguish the true effect of the treatment (D) from the confounding influence of U, a joint determinant of the treatment and the outcome. Identification depends on the presumed distribution of the Ui. Identification is improved in circumstances where we have an exclusion restriction or an instrumental variable—a variable that influences the outcome only by influencing the treatment—its effect on the outcome is totally mediated by the treatment.11 The instrumental variable helps with identification in that the outcomes we observe reflect two processes— that determining the outcome (based on treatment and the X variables) and that determining treatment status. By observing different values of the instrumental variable (at the same levels of the other covariates), one can better distinguish the two processes.12 Within those strata, different values of the instrument essentially manipulate treatment assignment without changing the expectation of the outcome directly. 4. Phase III: estimation Heckman's third phase of causal inference involves estimation. Estimation requires calculating expectations (or averaging across observations) of the outcome for different exposures (treated or not) and for different levels of the covariates. The key task is estimating a counterfactual outcome. How this is done depends on the assumptions used to identify the model and other methodological considerations (and convention). Fig. 1 offers a taxonomy of alternative estimation strategies given how the model is identified. The figure differentiates two key aspects 9 The reader likely has done this many times with observed variables. For example, if one has the mean height for boys and for girls (two conditional expectations), one can calculate the unconditional expectation of height by averaging the gender-specific averages. One could write this procedure as EG[E[Hi|Gi = g]] where H and G are height and gender, respectively. G might be a variable with many values, and the key feature for “averaging out” this variable would be distribution of G in the population and a link between the variable and the outcome. 10 The reader may be familiar with the Heckman two-step correction. In essence, this model calculates an additional term, called the inverse mills ratio such that E[Y0, i| Di = 0, Xi = x, Ui = u] = E[Y0, i|Di = 0, Xi = x] + additional term that is a function of u, D, X and perhaps an instrumental variable (as discussed below). The confounding influence of U is captured by the additional term—U is now held constant when we compare individuals who received different treatments. 11 Economists have identified more narrow assumptions for identifying these models— that is, one needs not assume a distribution for the unobservable, even in the absence of an instrument. These methods are semi-parametric and still depend on untestable assumptions, albeit weaker ones (Andrews & Schafgans, 1998; Heckman & Vytlacil, 2007b). 12 An old example in economics involves the price and quantity of a good. These two outcomes are determined simultaneously. The pattern of prices and quantity that we observe across units (e.g., cities) reflects supply and demand for the good. If one wanted to separate the two processes, one would need a determinant of each process that influenced only that process (weather influencing only supply; and family income influencing only demand). The problem is the same in causal inference—participation and the outcome are simultaneously determined, and we want to identify the effect of the former on the latter (allowing for the effect of the latter on the former).

1134

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

Fig. 1. Relationship Among Alternative Methods of Causal Inference.

of causal inference that can be used to organize alternative methods of causal inference. The first involves identification. As discussed to this point, a key strategy for identification is the assumption of ignorability or no unobserved confounding. One can see that regression, matching and propensity-score methods all make this assumption; in that sense, they are much more similar to each other than they are to alternatives, alternatives, like instrumental variables estimation. A key step in the proper use of ignorability-based methods involves the choice of covariates.13 Having chosen the covariates, the difference among the various methods emerges—key among these is whether to use all covariates or summarize them using the propensity score. The large, right branch of the figure groups methods that do not assume ignorability in its fullest form. These methods blend the task of identification and estimation. As discussed in more detail below, the effect identified is somewhat determined by the type of data available for estimation. Instrumental variables estimation is identified by an exclusion restriction, and that restriction also determines the parameter estimated. 4.1. Under ignorability Methods based on ignorability generally share two steps—estimating the difference between the exposed and their comparators within strata or matched sets of some sort; and combining those estimates across strata. Ordinary regression combines these steps seamlessly but is a form of matching (Morgan & Winship, 2007). Regression accomplishes these tasks by typically specifying key relationships as linear—between the covariates (and treatment) and the outcomes. Perhaps less commonly recognized is that regression specifies linear relationships among the covariates themselves. In this case, most salient is that it specifies a linear relationship between the covariates and the treatment. This structure improves efficiency (lowers standard errors) if correct but produces bias and other statistical problems if false. Linearity is not essential to causal inference (at least in the simple case considered here involving two groups, treated and a comparison group). Regression has other key features shaping its use in causal inference. In particular, the estimated effect of treatment is none of the three effects 13 These choices are also important for methods like instrumental variables, and arguably this element of the diagram should span both of the main branches.

identified above (ATT, ATU nor ATE). Rather, the estimated effect is calculated by combining across strata in a way that reflects both the distribution of observations across the strata as well as the distribution of treatment within strata. Strata where treatment is distributed more evenly are weighted more heavily. Doing so serves the purpose for which regression was designed—explaining the variation in the outcome. But this feature has some peculiar implications. For example, strata where the exposure is very common (or very rare) have relatively little impact on the overall estimate. For example, in an analysis of the effect of teenage childbearing, the effect on poor black women living in poor neighborhoods has relatively little impact on the overall effect estimated. While they may differ in implementation, matching estimators generally work in the same manner (Zhao, 2004). The principal drawback to matching occurs when there are many covariates. No pair of cases will match exactly on every covariate, and the analyst is left to define some type of distance function—some measure of how similar cases are—and how close is “close enough”. The propensity score (or the predicted probability of exposure) has become increasingly common in recent years, but in terms of identification and estimation, propensity-score methods are very closely related to ordinary regression (Dehejia & Wahba, 1998; Imbens, 1999; Rosenbaum & Rubin, 1983). The main advantage of the propensity score is that it conveniently summarizes the covariates—it represents a weighted sum of the covariates where the weighting reflects their potential to confound. The analyst then can work with one variable rather than many, which has advantages for some procedures. This reduction makes matching much more feasible and explains why matching estimators have increased in popularity (Zhao, 2004). The key property of the propensity score is that—in large samples—it “balances the covariates”: conditioning on the propensity score, any remaining variation in the covariates is unrelated to treatment.14 The covariates included in the propensity score model have lost their ability to confound treatment-comparison group differences. 14 As Zhao (2004) discusses, this seemingly reduces demands on the data; one can work with a single variable rather than K covariates. However, the propensity score also assumes that the relationship between the covariates and the outcome is characterized by a continuous function and that any relationship between the covariates and the outcomes (conditioning on the propensity score) balances as well. In that sense there is no free statistical lunch; one cannot simply move from K covariates to one covariate without making an additional assumption.

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

The bottom line, however, is that propensity-score methods and regression depend on the same critical assumption of ignorability—i.e., no unobserved confounding. Whether propensity-score methods adjust for observed covariates more effectively than regression depends on a variety of practical issues, such as model specification. In the end, a well specified regression model should produce results that are similar to those from a model based on propensity scores. Comparison of results between matching and regression estimators may illumine key issues of model specification. They provide, however, no information on the validity of the key ignorability assumption. The propensity score does have some value as a diagnostic; for example, one can examine the distribution of the propensity scores and identify propensity scores near 1 or 0. These observations are essentially unlike any in the other group, and one should think carefully how to handle them. Regression handles these observations behind the scenes, essentially extrapolating values of the outcome across ranges of the explanatory variables where all observations are either treated or not. (This is the so-called “support problem” (Heckman, Ichimura, & Todd, 1998).) As a result, the treatment effect may be rather sensitive to the model's functional form. In a more flexible specification (such as a saturated model), the effects of regression are different. In that case, the severely imbalanced strata would contribute virtually nothing to the overall estimate as discussed above.

4.2. Without ignorability As the figure makes apparent, relaxing the ignorability assumption takes the analyst down a rather different branch of methodology. Estimation without ignorability involves recognizing the potential for unobserved confounding and then putting enough structure on its form to “remove” it—i.e., to adjust comparisons of the treated and comparison cases for this confounding, leaving the difference in the adjusted means to represent the true effect of the exposure, net of confounding by both observed and unobserved variables. In practice, these strategies involve specifying a likelihood function for both treatment and the outcome; the shared components of these choices—an unobserved factor (or heterogeneity) influencing both—represent the shared confounding. One strategy is to remove the influence of the unobserved confounding by “conditioning” on it. The latter means one essentially discards the portion of the likelihood that includes this component. In the case of sister comparisons, this shared component influences only variation across families; the model avoids its confounding influence by calculating the effects of exposure only within families.15 Another strategy involves instrumental variables. There are at least two estimation strategies here. The first involves enhancing the selection model. The likelihood incorporates the instrumental variable as a determinant of treatment (only), adding leverage to the model's ability to distinguish selection into treatment from the consequences of treatment. Instrumental variables can be used outside of the selection model as well. In this case, the estimation proceeds essentially by replacing one troublesome comparison (that of the outcomes for the treated and untreated observations) with two others—that of levels of the outcome and of the levels of treatment across different values of the instrumental variable (Foster & McLanahan, 1996; Tan, 2006; Wald, 1940). A well known example in economics involves the draft lottery. Economists have long been interested in the effect of military service on the future earnings of those who served. Obviously, for these individuals, earnings had they not served is unknown. Furthermore, 15 Another method of “adjusting for” or “controlling for” the unobserved confounder is to “first-difference” the data—one regresses the difference in the outcome between sisters on their difference in teen childbearing status. This approach is intuitive and reveals the importance of families where the sisters differ in their teenage childbearing status. However, differencing does not generalize to analyses of categorical outcomes (Chamberlain, 1980).

1135

comparisons with similar individuals not serving are likely confounded by a range of factors unobserved. Draft lottery numbers, however, represent a potential instrumental variable—individuals with lower numbers were much more likely to serve yet the numbers themselves have no plausible effect on the outcome (Angrist, 1990; Angrist & Krueger, 2001). Simplifying matters, suppose draft lotteries were either high or low. In that case, estimation involves the comparison of earnings between individuals with high and low draft numbers adjusted for differences in military service between the two groups. At no point does the research directly compare earnings between those who do and do not serve. The principal limitation of this approach is that the estimated treatment effect is neither the ATE, ATT, nor ATU. This so-called “LATE” effect (local area treatment effect) represents the effect of the exposure for those individuals whose treatment status would change in response to a change in the value of the instrumental variable (Imbens & Angrist, 1994). One cannot identify these individuals specifically. They are those persons whose treatment status is itself a counterfactual determined by the level of the instrumental variable (Deaton, 2009). In the case of the draft lottery, the LATE is the effect of service for individuals not serving with a high number who would have served had they had a low number.16 Note that the regression discontinuity design has received a great deal of attention in research on social policy in recent years, especially in education research (Ludwig & Miller, 2007). It is closely related to instrumental variables estimation (Angrist & Krueger, 2001). The basic idea is that treatment status depends on some threshold value of a variable. In this case, the variable itself may influence the outcome; the instrumental variable is the discontinuity itself (Hahn, Todd, & Van Der Klaauw, 2001). Angrist and Lavy (1999) examine the relationship between class size and student performance. Of course, comparisons of students in different size classes likely confound the effect of a variety of factors, such as the affluence of the surrounding community. The authors use data from Israel where new classes are formed when class size exceeds 40 or a multiple of 40. This rule induces a discontinuity in class size presumably unrelated to student performance (directly). Consider two children each of which is a class with 40 students. These two children are otherwise similar, but a new student enters one classroom. In that case, that classroom is split in half, reducing class size for reasons that are essentially random (that is, not reflecting unmeasured factors that also influence the outcomes of interest). Comparison of the two children now reveals the effect of class size (Angrist & Lavy, 1999). 5. Guidelines for causal inference research Good causal inference involves thoughtful application of a series of steps. The process begins with a careful definition of the treatment or exposure. 5.1. Step 1a: carefully define the treatment At first glance, defining the treatment of interest appears straightforward. For instance, a person who participates in a program is defined as receiving the treatment and those not in the program do not. Upon further consideration, questions quickly surface such as, how much of a program does the person need to receive to be counted as receiving treatment? These questions involve the heterogeneity of the treatment and the applicability of the consistency assumption (Cole & Frangakis, 2008). Child welfare has its own examples. For instance, suppose we wanted to assess the effect of out-of-home placement on child behavior. First we 16 Estimation does depend on other assumptions, including so-called monotonicity. This assumption requires that moving from a high draft number to a low draft number not increase an individual's chance of serving.

1136

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

would need to define out-of-home placement. Will this include children only placed through the child welfare system or will it also include situations when a parent voluntarily places a child with a relative? If we include the latter, does the reason for placement matter in the definition— such as military deployment versus long-term residential substance abuse treatment? On the other hand, if the study is to focus solely on children placed through the child welfare system, will this include those placed in traditional foster care as well as those placed in kinship care? If kinship care is included, then how will the study handle cases in which the child is placed with a relative, but the birth parent remains in the same home with the relative and child? The consistency assumption requires that this type of heterogeneity either be reflected in the definition of exposure or that the description of the resulting estimates clearly state that the effect estimated is a mix of actual effects. Note that the implications of the definition of exposure ripple through the steps that follow. For example, better differentiating the varieties of treatment would offer the advantage of providing a treatment effect more applicable to a child in a given set of circumstances. The more we differentiate treatment, the more choices by relevant agents are involved. However, increased differentiation increases the number of mechanisms determining treatment. If the treatment is defined as “longterm” out-of-home placement, the exposure is driven by the mechanisms shaping both entry into and exit from such placements. Each of these mechanisms may represent opportunities where unobserved confounding may arise as individuals are differentiated by process involving both measured and unmeasured characteristics. Another key assumption of causal inference needs to be considered in defining the exposure, SUTVA. At first glance, a child's outcomes may not seem to depend on whether others are placed in out-of-home care. Further thought may reveal that this assumption may be violated among family members such as siblings or cousins. There also may be neighborhood processes that challenge SUTVA as well. 5.2. Step 1b Carefully define the outcome As with treatment, the study also must explicitly define the outcome. Researchers commonly define an outcome through a series of established measurement scales. Internal consistency and reliability measures are traditionally included to demonstrate the quality of the measure. Clearly this step is not unique to causal inference. However, if the outcome measure is poorly defined, then the application of sophisticated statistical methods will yield meaningless results. It is worth noting that an analysis is only as strong as its weakest component. Advanced causal inference built on a foundation of poor measurement or other methodological problems (such as the handling of missing data) is not very useful.

brings the child into care. If so, then the natural incentive may be to only take the most severe cases in order to keep their own caseload manageable. Or worse, perhaps the incentive is to take the less severe cases with whom it would be easier to work. This perverse incentive may be why many social service departments separate investigatory and case-monitoring activities. Similarly, it would be important to know if there are financial incentives from the department, state, or federal government that may affect the treatment decision. Are there budget pressures to keep the number in “treatment” low? Surely, much is to be considered in the treatment decision. It is uncertain whether or not the researcher could possibly identify and understand all the factors in the decision-making process. A key issue is whether data are even collected from all the persons shaping these choices. And if collected, do the data include key information? The relevant information clearly extends beyond simple demographics of case workers including data such as motives. These reflections of the nature of treatment have important implications for the choice of causal methods below. 5.4. Step 3 Define which treatment effect is of interest The potential outcomes framework highlights three possible treatment effects (ATT, ATE, and ATU). The researcher must determine which is most appropriate to answer the question of interest. Moreover, different agents may be interested in different questions. For the effect of out-of-home placement, the child welfare staff may want insight into the ATU: “Should I remove a child currently living in his home? What will be the effect on the child's behavior of treatment (out-of-home placement)?” Those not removed from the home may not experience the same behavior effect if they were instead placed in out-of-home care. The ATT would indicate behavior changes to expect if children who had not been removed instead did have an out-of-home placement. The ATE would provide an overall, average effect. School administrators, on the other hand, may be interested only in the effects for those children who were actually removed from the home. In their mind, this could be a red flag for potential behavior problems in the classroom and the need for on-site children's mental health services. The ATT would indicate behavior they should generally expect from children in out-of-home placement. It may reveal that anticipated fears of behavior issues overestimate the true effect, or vice versa. To some extent the definition of the treatment effect may be beyond the researcher's control. Positivity for instance, may mean that a study of children in out-of-home placement may not include any children from upper income families. As a result, the treatment effect on behavior will not reflect what would happen for upper income children. For the few children in this category, the effects of placement should be acknowledged as simply unknown.

5.3. Step 2 Carefully think about the mechanism determining the treatment 5.5. Step 4 Draw a directed acyclic graph Understanding selection into treatment is essential for identifying the effect of treatment. This step is critical even in instances where one does not write a formal model determining treatment. If treatment is out-of-home placement, one would next consider how placement is determined. A variety of child welfare staff may be involved in that decision. They will consider the state laws and assess whether the case meets the criteria laid out in the law. This interpretation may depend on caseworker characteristic S, such as experience. (The longer she or he has worked in the field, the broader the frame of reference. In other words, the more cases she has investigated or worked with, she may have a better sense of how the potential case compares to others.) More specifically, the researcher should consider incentives and motives for those influencing treatment. For instance, it may be important to know if the case workers who investigate the families are those who will work with the case if the child welfare agency

As the discussion above makes clear, ignorability marks a dividing line between alternative methodologies. The plausibility of ignorability depends on (among other things) the proper selection and availability of key variables. Of particular interest are potential confounders—variables that influence both treatment and the outcome. Of course, reflections on the determinants of treatment can inform this choice, but the task is more difficult than it might seem. Incorporating some covariates may induce unobserved confounding where none existed; key issues of sample selection may essentially act as “bad” covariates, causing further bias. To understand these issues, one can turn to another tool, the directed acyclic graph (DAG) (Pearl, 2009). The DAG illustrates the set of causal relationships among the treatment, outcome, and the other covariates. The DAG is “complete” when it captures all interdependencies in the data. Unlike a path diagram, the DAG should include unobserved common

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

causes of observed variables. Furthermore, the links between variables in a DAG are explicitly causal and not necessarily linear. The key contribution of the DAG is to identify variables that should and should not be included as covariates. When a variable is added as a covariate, the DAG is modified—the relationship between that variable and other variables in the model is now severed. When a variable is held constant (in a stratum, for example), it no longer covaries with other variables. If that variable is a common cause of two other variables, then those variables are no longer related to each other unless other common causes are at work or one causes the other. This process represents the motive for adding covariates to a model. After all common causes have been controlled and any unobserved confounding ruled out, any remaining association is causal in nature. As long as the outcome and the treatment of interest reflect shared common causes, they will be associated above and beyond any causal relationship. This spurious relationship appears in the DAG as a “back door path”, linking the two variables. One can eliminate (or “block”) this path by conditioning on at least one variable on this path. This process of examining three variables—that is the covariance between two variables holding a third constant is the essence of the DAG. Other combinations of three variables exist such that conditioning on one creates rather than removes a spurious association. These involve pairs of variables that share a common effect. Two variables that influence a third need not be associated. However, when one conditions on the third, a relationship between the two is established. This mechanism can create relationships where none existed, and the consequences are most pronounced when one of the variables is unobserved. Suppose a treatment (D) has an effect (E) that is also caused by an unobserved factor (U). The latter and the treatment may very well be unrelated. However, conditioning on the shared outcome (E) creates a spurious relationship between the two (D and U). If the unobservable also influences the outcome of interest (Y), this conditioning effectively creates an association between the treatment and the outcome (D and Y)—a relationship that one might mistakenly interpret as causal. This indirect association is a “backdoor path” (D– U–Y). The effect of such a path could be eliminated by not conditioning on the effect (E) in the initial triple of variables or by finding some other variable on the causal path from the unobservable (U) to the outcome (Y). The literature labels variables like E as “colliders”. Conditioning on a collider establishes a spurious correlation between unobserved confounders and the treatment: ignorability is violated even if that assumption held before the covariate was added (Greenland, Pearl, & Robins, 1999; Shrier & Platt, 2008). DAGs can help the researcher understand key choices about sampling as well. For example, sampling essentially includes sample criteria as variables in the model (Hernan, Hernandez-Diaz, & Robins, 2004). In effect, these criteria may serve as colliders. For example, one's data may be limited to those already in child welfare. In assessing the link between a risk factor and length of time in the system, the decision to include only those already in the system has enormous causal consequences. For example that choice may act as a collider itself, creating spurious associations that span the model. It seems clear that the system involvement may reflect a range of unobserved characteristics that are relevant to understanding outcomes. The key issue is whether the exposure or treatment of interest reflects processes that were at work before the child entered the child welfare system. Mediating variables also can act as colliders, and we discuss this possibility below. Many researchers would argue that sampling involves only problems of external validity. But it is important to understand the full consequences in situations where sampling acts as a collider. Issues of external validity often involve characteristics like gender—characteristics that are well defined, easily measured in a new sample, and exogenously determined. An analysis of boys may not generalize to girls, and in this case, the relevant issue is indeed external validity. But in situations where the sampling is inherently related to the process being studied, the issues are more complicated. In that case, the results based on the self-selected

1137

sample may generalize only to other situations where the criteria for entering the system depend on the same characteristics and those characteristics have the same value. In essence, sample selection conditions on an added variable—the probability of participation. If that variable is related to other processes—especially if it reflects them—then the relations in the data would generalize only to other situations where that variable took on the same value. Fig. 2 illustrates a DAG one might use in assessing the effect of outof-home placement on child behavior. The heavy line in the middle between these two boxes represents this relationship. A review of prior literature and theory would suggest a series of confounders, linked to both the treatment and the outcome. These are represented by the boxes labeled child, parent, and neighborhood characteristics as well as type, degree, and history of alleged maltreatment. Note that these boxes each have an arrow directed at out-of-home placement (the treatment) and at child behavior (the outcome). (Suppose we are also considering conditioning on characteristics of the child welfare staff. After a little thought, we decide that these may likely be related to selection into out of home care, but are unlikely to have a direct effect on the child behavior.) Fig. 3 illustrates the effect of conditioning on child, parent, neighborhood, and maltreatment characteristics alone; these covariates block the backdoor paths between out-of-home placement (X) and child behavior (Y). This is necessary to yield an unbiased estimate of the effect of X on Y. Since all paths are blocked, it is not necessary to condition on other covariates, such as child welfare worker characteristics. While doing so does not create a backdoor relationship in this example (e.g., child welfare characteristics is not linked to both out-of-home placement and child behavior once the key covariates are controlled for), it does require additional degrees of freedom. What does the DAG tell us about the effect of adding the number of placements as a mediator to the analysis? Many child welfare researchers would recognize that adding this covariate changes the meaning of the effect linking placement and the outcomes—that effect now represents the “direct” effect of the former. The effect on the mediator and of the mediator on the outcome represents the “indirect” effect. In many instances, researchers are interested in decomposing causal effects into these two components. As illustrated in Fig. 4, the mediator becomes a collider due to unobserved confounding involving the number of placements. A new spurious association is represented by the dotted line between out-ofhome placement and unobserved factors influencing the number of placements. For example, one contributor to placement instability is being placed in a home with relatives who may exhibit genetic mental health traits that affect their ability to parent or to provide a stable home, thus leading to turn over. These same genetic traits may be linked to the child's own behavior. Now of course, many relatives provide loving, stable homes for children in foster care so for many children, this unobserved confounding may not exist. But it does for some. We just do not know how many. Moreover, one could imagine a variety of other unobserved confounding that would elicit the same biasing effect. As a result, the DAG process reveals that testing for mediation will introduce bias into our understanding of the direct effect of out-of-home placement on child's behavior. As such, we generally should not control for items on the causal pathways, such as mediators.17

17 This fact may shock any reader familiar with Baron and Kenney's approach to this issue. Their approach, however, makes a huge assumption—that both the mediator and the treatment are free of unobserved confounding. This assumption would require careful assessment and justification, to say the least (Baron & Kenny, 1986). Other approaches for assessing mediation exist that do not make this assumption, such as principal stratification. It is important to note that the identification issues are more difficult when mediation is involved, and any form of estimation will involve fairly strong assumptions (Ten Have & Joffe, 2010).

1138

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

Unobserved: Family genetics

Number of out of home placements

Out of Home Placement

Child’s Behavior

Child characteristics Characteristics of child welfare staff -Who is involved in making the decision to remove the child -Number of years of service -Education level of

Parent characteristics Neighborhood characteristics

Type, degree, history of alleged maltreatment

investigating case worker

Unobserved: Child welfare staff interpretation of state law Fig. 2. Baseline DAG.

5.6. Step 5 Select a statistical method for estimating the treatment effect under ignorability As noted above, the choice of statistical method has important implications. The primary issue involves the plausibility of ignorability. Whether unobserved confounding exists depends on several data issues. Are the key covariates identified with the DAG included in the dataset? Was unobserved confounding created unintentionally by those who designed the study originally? If ignorability seems plausible, estimation can take one of several forms. Regression is the best known, but as noted, it makes assumptions that may be undesirable and/or unnecessary. Alternatives include matching and propensity-score methods. One motivation for using regression over other strategies, like matching, is that regression adds structure to the relationships of interest. Whether that structure is beneficial depends on whether it is correct, so one should conduct regression diagnostics to check key features of the model, such as the functional form and for the presence of outliers. Another option is to start the analysis with a rather flexible specification, such as a fully saturated model. In that instance, regression and matching are closely related. Of course, the more flexible the specification, the smaller are any advantages offer by functional form. Propensity score-based methods represent an alternative to regression, and as noted, one can use the former as a means to assess problems with the latter, such as the support problem. Rubin outlines three conditions that should be met in order for regression results to be considered “trustworthy”: 1) the differences in predicted probability of treatment (the propensity score) are small 2) the variance in these probabilities are nearly equal, and 3) the variances of the residuals of the covariates are also nearly equal (Rubin, 2001). Basically, these assumptions imply that the amount of confounding is relatively small, and so the amount of damage that might be done by misspecification is fairly small. As noted, one can limit the damage (and benefits) of regression by using a flexible specification.

In general, one would expect propensity score methods to produce results that are similar to those involving a flexibly specified and well conducted regression model. Remaining differences could reflect substantive differences, such as the difference between the ATE and the regression estimate. That difference could be explored in more detail by examining treatment effects for specific strata and variation in these effects across strata. (One can generate the ATE using weighted regression. For details, see Morgan and Winship (2007).) 5.7. Step 6 Consider methods for relaxing the ignorability assumption In some instances, the nature of the treatment, of the outcome of interest and of the available data would suggest one skip step 5 entirely. However, in any given substantive application, prior research likely relies on regression. These analyses would provide a frame of reference for the new analyses. For that reason, step 6 likely follows step 5, and the key issue is how to avoid ignorability. As noted, ignorability cannot be simply eliminated. Rather the model must be identified with an alternative assumption, such as the structure of a selection model. As noted, alternative methods of identification cannot be tested; models that do and do not assume ignorability can fit the data identically. One must choose among models based on theoretical understanding and institutional knowledge. This choice draws heavily on step 2—on the information and motives of key decision makers shaping the exposure and outcome. As discussed above, possible alternatives to ignorability include instrumental variables or selection models, both of which involve an exclusion restriction. What can one do if no instrumental variable is available? Sensitivity analyses represent one option. These methods posit a variable like U in Eq. (7)—one can consider alternative levels of association between U, the outcome and treatment. All of these models fit the data equally well but differ in the effect of treatment. One then can assess how strong these relationships involving U would have to be to alter the implications of the baseline findings (produced

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

1139

Unobserved: Family genetics

Number of out of home placements

Out of Home Placement

Child’s Behavior

Child characteristics Characteristics of child welfare staff -Who is involved in making the decision to remove the child -Number of years of service -Education level of

Parent characteristics Neighborhood characteristics

Type, degree, history of alleged maltreatment

investigating case worker

Unobserved: Child welfare staff interpretation of state law Fig. 3. Final DAG.

under ignorability) (Foster, 2011). In some circumstances, the confounding influence of U would have to be extremely large to explain a relationship that otherwise appears significant.

Foster (2011), for example, estimates the effect of military deployment on the behavioral health of female members of the National Guard and Reserve. Propensity-score analyses suggest a large

Unobserved: Family genetics

Number of out of home placements

Out of Home Placement

Child’s Behavior

Child characteristics Parent characteristics Neighborhood characteristics Type, degree, history of alleged maltreatment Fig. 4. Biasing effect of mediator.

1140

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

negative effect, but the available data provided relatively few covariates. (The available data did indicate that deployments were not randomly assigned.) Sensitivity analyses reveal that deployment would have to be strongly selective in terms of members' (unobserved) health to explain this relationship. Consider again the effect of out-of-home placement on child behavior. The DAG may suggest some possible instrumental variables— variables influencing out-of-home placement but not effecting child behavior outcome (directly). Child welfare agency or staff characteristics influence selection placement but not cause outcomes directly. Of course, these must be tested, and it is quite likely that an appropriate instrument is not available. Intuitively, selection into out-of-home placement is best explained by characteristics of the maltreatment and the parents, which in this example, are also related to the child outcome. On the other hand, longitudinal data might permit fixedeffects estimation to account for unobserved confounding at the child level. That confounding would involve time-invariant heterogeneity only; estimation essentially would involve pre-post comparisons of children placed out of home. The researcher would have to give careful consideration to the time-varying factors that potentially influence both placement and outcomes. 6. Conclusion If causal inference is that hard, the question is whether one should just avoid it? The answer in children and youth services is perhaps. In some instances, an association is indeed good enough. Nonetheless, the heart of child welfare research lies in determining the cause of a child's difficulties and addressing them through an intervention. If research is meant to inform policy, then it must directly address causal inference. Otherwise, the work implies causation to policy makers who might not understand statistical modeling. This is misleading, and if findings are biased to the point of masking program ineffectiveness, may ultimately harm the very children and families one seeks to help. The article presents six steps for causal inference and offers a guide for the field. Of course, this list is somewhat arbitrary, and some issues span all six steps. For example, one requirement for causal inference (recognized by both economists and statisticians) is positivity. That there may be some individuals in some strata who are unique to either the treated or the comparison group. The existence of this group shapes the definition of the exposure of interest and the choice of methods. Regarding the latter, regression would offer a means to extrapolate treatment effects for this group, but a flexible specification of that model would eliminate their contribution entirely. 6.1. Ignorability Ignorability receives a great deal of attention in our discussion, as it should. It is important for researchers in child welfare to recognize that ignorability is likely implausible in their research. As noted in the introduction, decisions in child welfare should be made based on the potential outcomes—in that case ignorability should not hold. It seems particularly implausible that one can estimate reasonable estimates for key causal effects with a limited set of covariates. Odd self-selected samples complicate matters still further. This recognition should shape research practice in several ways. First, at the very least, child welfare researchers should use very flexible model specifications with an eye toward making the most out of the covariates that are available. Maternal education likely explains relatively little of the difference between children living with women who do and do not abuse drugs, but one can be fairly certain that including education as one variable (years of schooling) does not maximize the capacity of that variable to “make all else equal”. At the very least, one might separate levels of education into separate

categories and add dummy variables to capture non-linearities in the relationship between education and the outcome (or exposure). Second, researchers should think very carefully about the use of methods that do not require ignorability. The good news is that child welfare reflects policies that vary a great deal across communities. This variability appears to represent an exciting opportunity to employ instrumental variables estimation. Third, researchers in child welfare may want to walk away from causal inference. Bad causal inference—what one might call “casual inference”—can do substantial harm. When one is analyzing realworld services, for example, it is easy to conduct analyses that show that more services do not improve outcomes (Foster, 2000, 2003).18 Such findings could easily reflect unobserved confounding and represent an invitation for policy makers to cut services, regardless of researchers' disclaimers. More generally complex regressions that include a long but incomplete list of covariates may represent the worst possible approach. The resulting estimates may not be plausibly causal, capturing only some sort of obscure association that defies description. The researchers may add some sort of disclaimer that these estimates are “only associations” but they often turn to a causal interpretation elsewhere in the same paper (Rutter, 2007, 2009). While the literature includes many such estimates, one could argue that they are at the bottom of the u-shaped curve of usefulness and complexity (Foster, 2010b). A regression coefficient in a model with many covariates is hard to describe even as an association. Even in a well specified, saturated model, the coefficient is difficult to interpret; as noted, regression combines the stratum-specific estimates in an obscure way. As the model specification becomes more complex, more cells comprise only treated or comparison cases. The resulting estimate becomes an estimate pertaining to less and less of the overall sample. Imposing linearity only makes interpretation even more difficult. More fundamentally, the value of associations generally lies in areas where simplicity is important. For example, suppose a policy maker wanted to identify children at risk of some long-term developmental problem. Research might show an association between some characteristic (e.g., attending day care for long periods of time at an early age) and the development of these problems. One might use attendance as a means of identifying at-risk children for an intervention. A key characteristic of such screening is often simplicity; more complex screening requires collecting more information from large numbers of children or parents. More generally, the strength of a characteristic as a screening measure likely does not depend on the degree of partial association between that variable and the outcome. Think of two variables that are closely correlated. When judged through the lens of causal inference, one variable is a putative cause and the other a potential confounder. For causal inference, distinguishing the differential effect of each is critical. When associations are the focus, the second variable is just redundant. One should pick one variable or the other depending on the strength of their association with the outcome. Other considerations, such as whether the variable is even manipulable, play no role. 6.2. Prior research in this journal Our review of recent applications of these methods in this journal, Children and Youth Services Review, finds that some researchers are beginning to apply these steps. All papers carefully defined the treatment and outcome (step 1) to some degree. Considering the causal mechanism (step 2), though, received very little attention in the articles. This process could be occurring, but omitted from the articles for the sake of brevity. Regardless, it is important for helping 18 Child welfare has no monopoly on casual inference. See, for example, research on the effect of television viewing (Christakis et al., 2004) or the effect of day care (Belsky et al., 2007). Both are particularly regrettable as they needlessly alarm parents.

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

to identify appropriate statistical techniques in later steps. Defining the treatment effect of interest (step 3) was not explicitly addressed in any of the articles. The average treatment effect is the standard approach, yet no one names it as such nor considers that their particular question may be better addressed with the average effect on the treated or the untreated. All articles define the covariates used (step 4), though each also could benefit from the use of a directed acyclic graph (DAG) to more systematically identify variables to use, and more importantly, to purposefully omit variables to avoid bias. Many studies chose a method to estimate the treatment effect assuming ignorability (step 5). Several studies used propensity score analysis (see Falconer et al. (2011) and Koh (2010)). Checking balance, a key analytic step to support the ignorability assumption, received little discussion. At times, though, there is confusion about the interpretation of the resulting analyses. Several papers exaggerate the benefits of propensity score methods versus regression. As discussed above, propensity score methods have some advantages over regression but rest on the same fundamental assumption of ignorability. Lastly, none of the articles estimated treatment effect without assuming ignorability (step 6) such as with an instrumental variable. Two other articles in this issue provide thorough examples for applying steps 5 and 6. Zhai, Waldfogel, and Brooks-Gunn use a propensity score analysis to estimate the effect of the Head Start program on preventing child maltreatment. Doyle provides an excellent explanation of instrumental variables and applies it to estimate the effect of foster care on child outcomes. They serve as models for future causal inference research in child welfare.

6.3. The role of basic math We end with a plea that researchers in this area not skim the math in this article. Our review of CYSR articles suggests that the presentation above is more technical than that typically used in the journal. We believe that technical detail is essential to causal inference for several reasons. First one should remember that our statistical methods have mathematical properties that have substantive content. One can describe regression as a method for “adjusting for” covariates, but one can see that the nature of that adjustment has important implications for understanding the resulting estimates. It is also not clear what “adjusting for” means in situations where exposure and outcomes are co-determined. Second, one can see that various assumptions have a technical content that is difficult to specify in words alone. For example, the specific form of ignorability required depends on the nature of the treatment effect of interest. Third, and relatedly, different literatures use different terms for largely the same concepts. For example, ignorability is (basically) the same concept as no unobserved confounding (NUC), conditional independence, or even exchangeability, and a mathematical representation of each term provides a way to link them. It also provides a way of knowing how they differ and when one is more appropriate than another. Finally, mathematical specification has other benefits. As one digs deeper (and more specifically) into relevant constructs, newer questions emerge. For example, many applied researchers will not have even considered whether they are estimating the ATE or the ATT or the ATU. Likewise, with direct and indirect effects, a mathematical specification raises new questions. For example, research in this area reveals multiple possible direct effects. One might consider the effect of an initial out-of-home placement on outcomes holding the overall number of placements constant at a range of alternative values—at the level at which they would have occurred had the first placement not been out of home or at a fixed number, such as zero. Clearly good substantive thinking can inform causal inference, but better causal inference can also stimulate better thinking.

1141

References Andrews, D. W. K., & Schafgans, M. M. A. (1998). Semiparametric estimation of the intercept of a sample selection model. Review of Economic Studies, 65(3), 497−517. Angrist, J. D. (1990). Lifetime earnings and the Vietnam era draft lottery: Evidence from social security administrative records. The American Economic Review, 313−336. Angrist, J. D., & Krueger, A. B. (2001). Instrumental variables and the search for identification: From supply and demand to natural experiments. The Journal of Economic Perspectives, 15(4), 69−85. Angrist, J. D., & Lavy, V. (1999). Using Maimonides' rule to estimate the effect of class size on scholastic achievement*. Quarterly Journal of Economics, 114(2), 533−575. Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173−1182. Belsky, J., Vandell, D. L., Burchinal, M., Clarke-Stewart, K. A., McCartney, K., & Owen, M. T. (2007). Are there long-term effects of early child care? Child Development, 78(2), 681−701. Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Economic Studies, 225−238. Christakis, D. A., Zimmerman, F. J., DiGiuseppe, D. L., & McCarty, C. A. (2004). Early television exposure and subsequent attentional problems in children. Pediatrics, 113(4), 708−713. Cole, S. R., & Frangakis, C. E. (2008). On the consistency statement in causal inference: A definition or an assumption. Unpublished manuscript. Deaton, A. (2009). Instruments of development: Randomization in the tropics, and the search for the elusive keys to economic development. NBER Working Paper. Dehejia, R. H., & Wahba, S. (1998). Propensity score matching methods for non-experimental causal studies (No. National Bureau of Economic Research Working Paper Series 6829). Cambridge, MA. Falconer, M. K., Clark, M. H., & Parris, D. (2011). Validity in an evaluation of Healthy Families Florida–A program to prevent child abuse and neglect. Children and Youth Services Review, 33(1), 66−77. Foster, E. M. (2000). Is more better than less? An analysis of children's mental health services. Health Services Research, Part II, 35(5), 1135−1158. Foster, E. M. (2003). Propensity score matching: An illustrative analysis of dose response. Medical Care, 41(10), 1183−1192. Foster, E. (2010a). Causal inference and developmental psychology. Developmental Psychology, 46(6), 1454−1480. Foster, E. M. (2010b). The U-shaped relationship between complexity and usefulness. Developmental Psychology, 46(6), 1760−1766. Foster, E. M. (2011). Deployment and the Citizen Soldier: Need and Resilience. Medical Care, 49, 301−312. Foster, E. M., & McLanahan, S. (1996). An illustration of the use of instrumental variables: Do neighborhood conditions affect a young person's chance of finishing high school? Psychological Methods, 3(1), 249−260. Foster, E. M., & Watkins, S. (2010). Television viewing and children's behavior: A reanalysis. Child Development, 81(1), 368−375. Fraumeni, B. M. (2011). Report of the Committee on the Status of Women in the Economics Profession, 2010. The American Economic Review. Greenland, S., Pearl, J., & Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 10(1), 37−48. Hahn, J., Todd, P., & Van Der Klaauw, W. (2001). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1), 201−209. Heckman, J. J., Ichimura, H., & Todd, P. E. (1998). Matching as an econometric evaluation estimator. Review of Economic Studies, 65, 261−294. Heckman, J. J., & Vytlacil, E. J. (2007a). Econometric evaluation of social programs, part I: Causal models, structural models and econometric policy evaluation. Handbook of Econometrics, 6, 4779−4874. Heckman, J. J., & Vytlacil, E. J. (2007b). Econometric evaluation of social programs, part II: Using the marginal treatment effect to organize alternative econometric estimators to evaluate social programs, and to forecast their effects in new environments. Handbook of Econometrics, 6, 4875−5143. Hernan, M. A., Hernandez-Diaz, S., & Robins, J. M. (2004). A structural approach to selection bias. Epidemiology, 15(5), 615−625. Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81(396), 945−960. Imbens, G. W. (1999). The role of propensity score in estimating dose–response functions. Cambridge, MA: National Bureau of Economic Research. Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2), 467−475. Koh, E. (2010). Permanency outcomes of children in kinship and non-kinship foster care: Testing the external validity of kinship effects. Children and Youth Services Review, 32(3), 389−398. doi:10.1016/j.childyouth.2009.10.010 Ludwig, J., & Miller, D. L. (2007). Does head start improve children's life chances? Evidence from a regression discontinuity design*. The Quarterly Journal of Economics, 122(1), 159−208. Morgan, S. L., & Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research. : Cambridge University Press. Pearl, J. (2000). Causality: Models, reasoning, and inference. Cambridge, U.K.; New York: Cambridge University Press. Pearl, J. (2009). Causality: Models, reasoning, and inference (Second ed). Cambridge, U.K.; New York: Cambridge University Press. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41−55. Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher randomization test: Comment. Journal of the American Statistical Association, 75(371), 591−593.

1142

E.M. Foster, K. McCombs-Thornton / Children and Youth Services Review 35 (2013) 1130–1142

Rubin, D. B. (2001). Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2(3), 169−188. Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469), 322−331. Rutter, M. (2007). Proceeding from observed correlation to causal inference: The use of natural experiments. Perspectives on Psychological Science, 2(4), 377−395. Rutter, M. (2009). Epidemiological methods to tackle causal questions. International Journal of Epidemiology, 38(1), 3−6. Shrier, I., & Platt, R. W. (2008). Reducing bias through directed acyclic graphs. BMC Medical Research Methodology, 8(1), 70.

Tan, Z. (2006). Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association, 101(476), 1607−1618. Ten Have, T., & Joffe, M. (2010). A review of causal estimation of effects in mediation analyses. Statistical Methods in Medical Research. Wald, A. (1940). The fitting of straight lines if both variables are subject to error. The Annals of Mathematical Statistics, 284−300. Zhao, Z. (2004). Using matching to estimate treatment effects: Data requirements, matching metrics, and Monte Carlo evidence. The Review of Economics and Statistics, 86(1), 91−107.