Econometric Approaches to Epidemiologic Data: Relating En&geneity and Unobserved Heterogeneity to Confounding NAMVAR
ZOHOORI,
MD, PHD, AND
DAVID
A. SAVITZ,
PHD
The concepts of endogeneity and unobserved heterogeneity are well-known among econometricians. However, these issues are rarely addressed in epidemiologic studies. This paper explores these two concepts, their relationship to each other, and the implications for analysis in epidemiologic studies. An endogenous variable is defined as a predictor variable which is partly determined by factors within the model itself, while unobserved heterogeneity is conceptualized as a vector of missing variables acting through the error term. Under certain assumptions, the simultaneous existence of an endogenous variably and unobserved heterogeneity is shown to act in a manner analogous to confounding. Specifically, this occurs due to an association between the error term in the equation and the endogenous predictor variable. The accepted econometric solution to this problem is to replace the endogenous varial&
with an ‘insrrumental variable’ which is not correlated with the error term and thus not susceptibleto confounding. The validity of these concepts and of the proposed solution are discussed. Ann Epidemiol /997;7:251-257.
0 1997 by Elsevier Science Inc.
Confounding Factors, Data Interpretation, Methods, Multivariate Analysis, Statistics.
KEY WORDS:
INTRODUCTION
Health outcomes are the result of interactions among a complex set of genetic, biological, behavioral, socioeconomic, and environmental factors. During the past decade there has been a shift in the way in which studies of mortality and morbidity conceptualize the pathways leading to these outcomes. Whereas in the past the determinants of health were primarily looked at in separate categories, such as biomedical or socioeconomic factors, several authors have laid the foundation for a framework in which proximate biological and beh avioral components of health are integrated along with their underlying socioeconomic and environmental factors into a single conceptual model (1, 2). Thus, evaluation of nutritional status as a dependent variable, for example, requires recognition that while nutritional status is affected by such behavioral factors as dietary intake and physical activity, the latter are themselves determined by a set of underlying factors such as food availability, income and education. Incorporating these relationships in statistical analysis, however, is not without its problems. As the model gets more complex, encompassing a greater number of equations, situations arise in which an explanatory variable in one
Econometric Models, Epidemiologic-
equation is itself a dependent variable m another, both equations being part of the same model. For example, while nutritional status is determined by dietary intake, the latter is a behavior which is modifiable by the individual in response to his or her nutritional or health status, and therefore, neither can be evaluated without considering the simultaneous effort of the other. In addition, health outcomes are influenced by attributes unique to individuals, such as genetic predispositions, not all of which are explicitly incorporated into any model. A number of recenr papers in the epidemiologic literature contend that ignoring these issues in the analysis of health outcomes leads to erroneous estimates of the impact
of individual
variables
(3-5).
These
which are considered to be of great importance in the econometric literature, are rarely addressedin epidemiologic issues,
studies (6-8).
The potential
usefulness of econometric
tech-
niques for dealing with these issuescan be better understood by reconciling the econometric concept:: b)f endogeneity and unobserved heterogeneity with the epidemiologic concept of confounding. In this paper we explore unobserved heterogeneity and endogeneity, and their relationship to each other, in an attempt to clarify their importnnce in epidemiologic research. -_” ___ THE
From rhc Carolma Population Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina. Address reprint requests to: Namvar Zohoori, Carolina Population Center, University c>fNorth Carolina, llnwersity Square East, CB 8120, Chapel Hill, NC 275 16. Reccwrd October1, 1995; accepred March 10, 1997.
APPROACH
Definitions: Unobserved
Endogeneity Heterogeneity
The dictionary “the condition
and
defines heterogeneity as “dissnnilarity” or or state of being heterogeneous.” This con-
252
Zohocm and Savit: ENKGENEITY AND
UNOBSERVED
AEP Vol. 7, No. 4 May 1997: 251-257
HETEROGENEITY
cept of dissimilarity or heterogeneity among subjects, or individual observations, is germane to the proper functioning of statistical models (for the sake of simplicity in this paper, we base our examples on the classical linear regression model, although the principles outlined are equally applicable to other, more complex models). In such models, the variation in a given characteristic (dependent variable) in a population or sample is “explained” or “accounted for” based on the dissimilarity of the observations with respect to a given set of explanatory (independent) variables. If models were perfect, that is if all possible variables could be perfectly measured and included in the model, one would be able to explain all of the variation in the dependent variable and obtain a correlation coefficient of 1. Thus, in the simplest case, we would have Y = ax
(Model 1)
where Y is the dependent variable, X is the explanatory variable, and cu is the regression coefficient associated with X. In such a case all observations would fall on the regression line. In practice, however, no model is perfect, nor is any model that simple. At best, purely random variations in the right-hand side variables would necessitate the inclusion in the model of another term, the error term, in order to capture that part of the variation in the dependent variable not explained by the heterogeneity of the explanatory variables. In addition, most models require more than one independent variable. The model is then represented as Y = alX + a2Z + E
(Model 2)
where Y is the dependent variable, X and Z are individual (or vectors of) explanatory variables, cyl and 0~~are individual (or vectors of) partial regression coefficients associated with X and 2, respectively, and E is the residual. Statistical theory assumes that E is independent of any of the explanatory variables in the model. Generally, if these conditions are met, then the model is an appropriate one for the data. However, many models will be more complicated than this. For example, consider the situation where X and Z are jointly determined by another variable, Q. The relationships among the variables can now be portrayed in Figure 1. Here, Y is determined by the combined effects of X and Z, which are themselves determined by Q. Note that Q is not an explicit part of the model, which can still be written as
In epidemiologic terms, Z is recognized as a confounder of the X-Y relationship, given that it is a risk factor for the dependent variable, Y, correlated with another independent variable, X, and not a part of the causal pathway between X and Y (9). In this sense, Z has to be controlled for in one way or another, and to ignore its effect, by leaving it out of the analysis, would produce a biased estimate of the effect of X on Y. For example, assume that Y is nutritional status, X is physical activity, Z is a measure of dietary intake, and Q is age. If age has an effect on diet aswell ason physical activity, then any study of the relationship of diet to nutritional status will be confounded by physical activity, which must be controlled. Ignoring Q, the determinant of X and Z, leads to a familiar situation of confounding. If we consider model 3 in its entirety, then two classesof variables emerge: Q is an exogenous variable-its determinants are not part of the model, whereas X and Z are determined, to some extent, by factors within the three-equation model. In econometric terms, this property defines X and Z as endogenous variables, that is determined in part by other variables in the model. Model 3 was developed to illustrate the familiar problem of confounding for our discussion. We now return to model 2 and develop that model along a different line in order to illustrate a similar situation that can arise in the presence of endogeneity and unobserved heterogeneity. In model 2, it is generally assumed that the error term, E, is purely random, that is, it is uncorrelated with X or 2. Suppose now that the dependent variable, Y, is affected not only by X, Z, and random variation, but also by certain nonrandom factors specific to the individual which are not measurable or observable, such as an inherent rate of metabolism or a genetic predisposition to obesity or to leanness. These are variables that may affect the nutritional status of an individual independent of the effects of diet or physical activity, and which would have been entered in the model if they were known or measurable. In this case the model would be
Y = a,X + a2Z + Ey. However, the implicit relationship can be portrayed by the following three equations Y = a,X + a2Z + EY
+
EY
X = qQ + Ex Z = a,Q + Ez
FIGURE
1. A graphic presentation of Model 3.
Y = a,X + azZ + kEU, + ey (Model 4) where EYis the truly random residual in the model, EUU is a vector of unobserved characteristics for each individual, b is an unknown parameter explaining the relationship of EUy to Y, and
Y = a,X + ci2Z + p,EUy -t- my
pd - .--! I
I.L.EU, + ~~ = Eu. In this model, EUI is defined as unobserved heterogeneity, that is inter/individual differences that are nonrandom but also not subject to direct measurement. The effect of not being able to explicitly include EUU in the model will, of course, depend on the assumptions that we make regarding the relationship between EUY and the other variables in the model. If we assume that COV (El&, X) = 0 and COV (EUy, Z) = 0, then leaving out the FE& term in the model will simply lead to an under-specified model with a larger residual (E, = PEU, + ey), but no effect on the parameter estimates ((wi and CL:),other than larger standard errors. This is a traditional, implicit assumption made about the myriad unmeasured influences on the dependent variable. However, the main premise of the discussion in this paper is that, under certain circumstances, unobserved heterogeneity will he correlated with the right-hand side variables, thus acting as a confounder. The proper evaluation of the validity of the assumption of no correlation between EU, and the other predictor variables will be greatly aided by an understanding of the concept of unobserved heterogeneity, its relationship to endogenous variables, and the consequences ,jf that relationship. The Problem:
Confounding
First, model 4 is extended a step further by specifying that the variable Z is itself detertnined by two other variables, V and W. As an extension of our example, V and W may include such factors as income, food availability, seasonal variations, cultural or religious beliefs, gender, education, etc. The new model can now be written as Y = ol,X + cy;Z + FE&
+ ey
Z = p,V + p:W + FE&+
P2
FIGURE 2. A graphic presentation ot the 6nal model of’s system
with endogeneiry and unobservedheterogcneirv, showcngall the relevant relationships among the variable;.
etc. Clearly though, for any given out:come in any given individual, not all of the elements of II will atfect that outcome. Therefore, for a given outcome in a given individual, EU will be reflective of a specific slfbset of H’. For example, in model 5, let us assume that for a given individy ual, the outcome Y is affected by X, Z, random error (+) and the unobserved characteristics C , (1 , C;, C+ Cj and Co, and that for that same individual the outcome Z is affected by V, W, random error (~2) and the unobserved characteristics C,, Ci, Cc,,C7, OS and C,. 111clther words ELJ, = CC,, C:.. C,, C,. c“,, (:,.I an d EUr = {C,, Cc. C,., C (.I., (:.,i
eZ (Model 5)
Note that Z is an endogenous variable and X, V, and W are exogenous. Also note that each of the equations describing Y and Z, has associated with it its own error term, E, and unobserved heterogeneity, EU. Now, assume t-hat, for any given individual, H is the set of the effects of all possible unobserved characteristics, that is H = (C,, Cl, . . , C,) where the C’s represent the effects of individual characteristics. One can think of H as a set of unobserved (and therefore otnitted) variables-such factors as genetic predispositions, inherent rates of metabolism, innate levels of healthiness,
(Model 5)
Z = p,V + p,W + pEUz + EZ
It follows that EUy f3 EUz -= (C.,, C;, C ,) that is, the intersection of EU~
_._ .-._...-- .---
_- -
254
Zohoori and Savitz ENDOGENEITY AND UNOBSERVED HETEROGENEITY
Even if the intersection of EUY and EUz is an empty set, the extent to which all or some of the members of the two sets are affected by, the same external factors determines the degree of correlation between EUY and EUz. Therefore, it can generally be assumed that for any given individual the unobserved heterogeneity terms are likely to be correlated across equations. From this discussion one can attribute a number of general properties to unobserved heterogeneity (the EU term): 1. It is individual-specific; 2. It is outcomespecific; 3. It can be thought of as a vector of missing variables; and 4. For any given individual, the EU terms are very likely to be (though not necessarily) correlated across equations, to a degree that will vary in different situations. Figure 2 shows all the relevant relationships among variables. Here 2, an endogenous variable, is in part a function of EUz, its vector of unobserved characteristics, and, based on the properties outlined above, EUz is correlated with E&. Therefore, it follows that Z is correlated with EUY. Here, then, we have a situation analogous to model 3, where a variable, EUy (albeit unobserved, and therefore acting through the error term EY = klEUy + ey), is a risk factor for the outcome, Y, while at the same time being correlated with another explanatory variable, 2. EUy, therefore, operates as a confounder of the Z-Y relationship.
Control for Confounding:
Instrumental
Variables
Unobserved heterogeneity is therefore a unique kind of confounder in that, by definition, it is not possible to measure it directly and thus to explicitly incorporate it into the model in the usual manner. Further, as stated before, the EU term represents a vector consisting of the effects of a number of variables, not all of which are necessarily confounders in a given model. However, in controlling for confounders, the general advice to epidemiologists is to “restrict attention to the control of only those (previously studied) extraneous variables that the investigator anticipates may account for the hypothesized relationship . . . being studied” ( 10). In the case of unobserved heterogeneity, however, some of the factors involved may not have been previously studied, and may in fact be totally unknown. Therefore, “controlling” for unobserved heterogeneity as a confounder may run counter to conventional epidemiologic view. From the epidemiologist’s point of view, these are valid concerns if one were trying to “control” for the confounding effect of unobserved heterogeneity by conventional methods. However, instead of controlling directly for unobserved heterogeneity (which we are not able to do), there is a way to prevent it from acting as a confounder. Recall from model 5 (and Figure 2) that EUY is a confounder of the Z-Y relationship because of its nonzero correlation with Z while at the same time being a risk factor for Y. Therefore, if the
AEP Vol. 7, No: 4 May 1997: 251-257
association of EUy with Z can be eliminated, then there will no longer be confounding. Econometricians have devised methods of purging such models of confounding variables. The solution lies in being able to find a variable, called an instrument (or instrumental variable), that is highly, though not completely, correlated with the endogenous variable, but is uncorrelated with the error term. This instrumental variable is then used as a surrogate in place of the original endogenous variable. Figure 3 depicts the role of an instrumental variable in the face of confounding (adapted from Kennedy) (8). For the equation Y = PZ + a, imagine that Y is represented by the circle Y, Z is represented by the circle Z, and the error term a is represented by area a (ignore, for now, circle Z’). If COV (Z,a) = 0, then the information in area b + c + d can be used to produce an unbiased estimate of the regression coefficient, p, when Y is regressed on Z. However, suppose that COV (Z,a) # 0. Graphically, this covariance can be represented by an overlap of area a with the Z circle. This overlap is represented by area b. That is, one is assuming that variation in Y in area b is due to the influence of both Z and the error term, a. The size of area b in this diagram is chosen arbitrarily. The actual size of area b will depend on the degree of correlation between Z and area a. When Y is regressed on Z in the presence of b, the regression coefficient estimated based on the information in area b + c + d is biased because the information in b is not solely due to the effect of Z on Y. Therefore, one needs to find a way to eliminate b. Let circle Z’ be an instrument for 2. Note that 2’ satisfies both conditions necessary for an instrumental variable, that is, it is uncorrelated with the error term (no overlap with area a + b), and it is highly (though not perfectly) correlated with the independent variable (large overlap with circle 2). Regressing Z unto 2’ produces the predicted value 2”. Now when Y is regressed on Z”, only the information in area d is used for parameter estimation, which will produce an unbiased estimate. Note however, that not all the information available (b + c + d) is used in estimating p, and therefore, although the bias associated with p is now removed, the variance of p is larger. In other words, in this exercise we are sacrificing some precision for a reduction in bias, consistent with the recommendation that “confounding should take precedence over precision” (10). In econometric terms, it is better to use an inefficient but consistent instrument (one that will produce a less precise but unbiased parameter estimate) than to use an inconsistent variable (one that will produce a biased parameter estimate). While in some studies authors may choose to live with some degree of bias over a loss of precision, this judgement is difficult to make without knowledge of the magnitude of bias that is present. However, as many econometricians recognize, the choice of an instrumental variable is rather arbitrary and sometimes
difficult. From model 5 it can be seen that both V and W are possible candidates for an instrument for Z. Another option, particularly when dealing with longitudinal data, is to use the lagged value of the endogenous variable as the instrument for that variable (for example one could use dietary intake in a previous time period as an instrument for dietary intake at the time of the study). However, in this case, one needs to be able to make the assumption that the values of the variable of interest are serially correlated over time, but that the measurement errors are not. The two examples above demonstrate a general case, namely, that any exogenous variable (or combination thereofl within the system, and a lagged value of the variable in question, are common candidates for instrumental variables. As the model gets more complex, however, lagged values of endogenous regressor variables themselves may become endogenous and thus correlated with error terms, leaving exogenous variables (which by the assumption of exogeneity are uncorrelated with the error term) to be considered the best candidates (8). The identification of which exogenous variables or combination thereof would make the best instrument cannot be determined directly. Among econometricians, one commonly accepted practice is to arrive at the ‘best’ instrument by regressing the endogenous variable in question on all the exogenous variables in the system, and calculating a predicted value for the endogenous variable. In econometric terms this IS known as estimating the reduced form. This estimate is then used as the instrument for the endogenous variable in the original equation, substituting the predicted value of Z (which is uncorrelated with EUy) for the measured value of Z (which is correlated with EL&). (This procedure is also known as “Two Stage Least Squares”). For example, this is rhe approach taken in the analysis by the Cebu Study Team (5) and by Zohoori ( 11), in which endogenous variables were replaced by their predicted values based on purely exogenous variables. However, in both of these studies, the ability to do this was due to the fact that the datasets used for the analyses contained measurements of a large enc>ugh number of exogenous variables. Related to this is the point that the suitability of an instrument depends on the degree of its correlation with the original endogenous variable. Where this correlation is high, the instrument will be a good proxy for the endogenous variable in question’. There are techniques for testing the endogeneity of independent variables, and thus the need and suitability of using instrumental variables, 3escribed in detail in econometrics text books (6-8). One such technique is Hausman’s specification error test (I 2). Consider the simple model y=px+r* ~--_-.---‘Please see reference I1 for a fuller discussmn of this and another related issue. the correlation between different instrumental variables within a swum $)f qusti
We wish to test the null and alternative hypotheses that H, : x and k are not correlated H, : x and k are correlated Let PI be the estimate of p when the original variable, x, is used in the equation, and pJ be the estimate of p when 8, the instrumental variable for x, is used instead of x. Then under the null hypothesis both 6, and p,! will be unbiased estimates of 6 (although pz will be less precise than p,). Under the alternative hypothesis, however, only PI will be unbiased. Thus, if we can prove that y = p, - 8, = Q then we can accept the null hypothesis. Hausman showed that the test statistic m = q- / var(q’) can be used as x’ with 1 degree of freedom t
DISCUSSION In the preceding sections we have presented the econometricians’ rationale for consideration of endogeneity and unobserved heterogeneity and reconciled that perspective with the epidemiologists’ concern with confounding. The principal advantage of the conceptual model and the resulting statistical approach presented here is that zonfounding due to endogeneity and unobserved heterogeneity may be eliminated by use of an “instrumental variable” to derive a more valid measure of the true etiologic relatic.)n between the exposure of interest and the health outcome. While a few recent papers in the health statistics and epidemiology literature have applied the instrumental variables approach to very specific situations (13-16), here we have carned this a step further by outlining the problem and the solution in general epidemiologic terms and within the general context of epidemiologic studies of health outcomes. It must be reemphasized that the procedures outlined here are for removing that confounding that is due to the presence of endogeneity and unobserved heterogeneity. These procedures are, therefore, to be considered in addition to the usual methods of dealing with other known confounding variables, such as stratification and adjustment &rough multivariate models. In practice, there are a number of considerations that challenge the potential gains in validity. First, an ;trray of assumptions is required for the confounding to operate as postulated and for the confounding to he removed as intended. One must first postulate a set of nonrandom, tmmeasurable (or unknown) individual attributes idenoted FEUD)
256
Zuhoori ad Savitz ENDOGENEITY AND
UNOBSERVED
AEI’ Vol. 7 No. 4 May 1997: 2;1-257
HETEROGENEITY
that influence the dependent variable. Obviously the most intuitive, desirable solution to such a situation is to find some means of measuring those attributes to better understand their relationship with the outcome and explicitly consider their potential confounding role. Next, the assumption needs to be made that there are an analogous set of unmeasurable influences on the independent variable of interest (FE&). Finally, there must be some correlation between the two sets of unmeasured influences. Although plausible scenarios can be postulated in which all these assumptions could hold, the inability to empirically examine either variable or the relationship between the two variables leaves the constructs with little empirical anchor. The only evidence that can be produced regarding whether the phenomenon is present is obtained when correction procedures are introduced, and the parameter estimates accounting for the phenomenon are compared to those that ignore it. The correction procedure itself is based on a number of assumptions as well. The goal is to create a variable that retains the desired components as a predictor of the outcome, yet is free of the correlated error components. However, without measurement of either part directly, the “correct” choice of an instrumental variable cannot be verified. In addition, there is the problem of identification-there is a need for adequate data from which to develop meaningful instrumental variables. In other words, the dataset needs to contain enough exogenous variables so that each instrumental variable can be uniquely identified. If this is not the case, then two major problems arise. First, the instrumental variables will not adequately represent the endogenous variables, and second, the instrumental variables, having been predicted from a limited and excessively overlapping set of exogenous regressors, will be correlated with each other to an undesirable extent. Related to this issue is the problem of judging how strongly the instrumental variable should be predicted by the exogenous variables. If the endogenous variable of interest is perfectly predicted, then the instrumental variable is the same as the measured variable and has the same undesirable attributes. Conversely, at the other extreme, if there is no predictive ability from the exogenous factors, the instrumental variable will have no variation whatsoever and be completely unrelated to the outcome of interest. When there is potential for endogeneity and unobserved heterogeneity to distort relations between variables, it could be argued that analyses should be conducted in parallel. If the analysis using the instrumental variable generates a different result than the measured variable, then the judgement as to which is better becomes rather subjective. Given the chain of assumptions required for the phenomenon to occur and be corrected as desired, there are other ways in which “correction” could yield a different answer that is not more valid than the uncorrected result. Of course, if the results are similar, then reassurance is provided that
there is less likely to be distortion resulting from the postulated form of confounding. The question may arise as to why epidemiologists have not been as concerned with these issues in the past. Like most disciplines that have a strong statistical component, epidemiology has developed close links with biostatistics, which shares a common perspective. As the discipline has evolved, largely in schools of public health and medicine, the linkages to statisticians who work in other departments such as statistics, economics, sociology, or psychology, have become increasingly remote. Nonetheless, there are many shared challenges across disciplines, particularly those that arise in observational rather than experimental fields such as epidemiology, economics, and sociology. The concerns raised by econometricians are clearly worthy of epidemiologists’ attention. Under specific assumptions, uncontrolled confounding will be present, and under further assumptions, such confounding can be controlled, if not measured. What is ultimately needed is for this rather speculative scenario to be penetrated with empirical observation. Examples of such empirical analyses in epidemiologic literature are very few ( 11, for example). This may be further enhanced through development of better statistical or measurement tools for evaluating the validity of the underlying assumptions. At present, the approach presented here should be considered at the very least as a potentially useful tool to assist in the interpretation of data, but not as an obligatory analytic strategy. Preparationof this paper was supported in part by a grant (POlHD2807601) from the National Institutes of Health. The authors wish to thank Dr. Jim VanDerslice and two anonymous reviewers for their very valuable comments and suggestions in shaping this paper for publication.
REFERENCES 1. Bongaarts 1. A framework for analyzing the proximate determinants of fertility. Popul Dev Rev. 1978;4:105-132. 2. Mosley WH, Chen LC. An analytic framework for the study of child survival in developing countries. Popul Dev Rev. 1984;10:2548. 3. Briscoe J, Akin JS, Guilkey DK. People are not passive acceptors of threats to health: endogeneity and its consequences. Int ] Epidemiol. 1990;19(1):147-153. 4. Popkin BM, Adair L, Akin JS, Black R, Briscoe J, Flieger W. Breastfeeding and diarrhea1 morbidity. Pediatrics. 1990;86(6):874-882. 5. Cebu Study Team. Underlying and proximate determinants of child health: The Cebu Longitudinal Health and Nutrition Study. Am J Epidemiol. 1991;133(2):185-201. 6. Wonnacott RJ, Wonnacott TH. Econometrics. New York: John Wiley and Sons; 1979. 7. Maddala GS. Introduction to Econometrics. New York: Macmillan Publishing Company; 1988. 8. Kennedy P. A Guide to Econometrics. Cambridge, MA: The MIT Press; 1992. 9. Rothman KJ. Modern Epidemiology. Boston/Toronto: and Company; 1986;92-94.
Little, Brown
10. Kleinhaum 1X;, Kupper LL, Muller KE. Applied regression analysis ,md other multwariate methods. Boston, MA: PWS-Kent Publishing C,rmp;my; 1988;171 -172.
14.
I I Lohow N t)~w endog!enelty mxter! A comparison ofempirical analy\(” wth ,md witlxn~t contr(>l for endogencq. Ann Epiderni<)l. 1997:7~‘5~-2~6.
15.
I.? H.mm,m IX. .Spec&atl,m 1079;4h:1151 -1171
16.
tests m econometrics.
Econometrica.