Journal
of Econometrics
39 (1988) 23-52.
North-Holland
ON THE INTERPRETATION John W. PRATT
AND OBSERVATION
and Robert
OF LAWS
SCHLAIFER
Harvard University, Boston, MA 02163, USA A law with factors x and concomitants z specifies a distribution given z of a potential value Y, that is defined for each x whether or not it is observed. An observed distribution of Y given x and I agrees with the law if and only if, given z, the observed x is independent of Y, or. equivalently, of the joint effect of UXof excluded variables on Y,. To establish such independence in non-experimental data requires exhaustive exploration of the effects of concomitants, causal and non-causal; R2 and F are irrelevant. We show how the model-free theory applies to linear models, time series, and simultaneous equations, and point out its Bayesian implications.
1. Introduction
1.1. Laws and the condition for their observability In this paper we define a class of relations that we shall call laws and discuss the conditions under which they can be observed in data and hence well estimated given enough data. Our laws include both parametric relations of the kind econometricians call ‘structural equations’ and wholly or partially ‘model-free’ relations of the kind discussed in the statistical literature on experimentation. They also include physical laws of the kind typified by Newton’s or the laws of electrodynamics (but not those that assert for example that energy is conserved or action is minimized); and we shall use physical laws as examples when so doing clarifies a definition or argument. After defining ‘law’ we shall point out that a law can be observed in data if and only if the process that generated the data satisfied a condition first stated precisely by Rubin (1978) but implicit in R.A. Fisher’s (1935) examples of proper analysis of experiments involving both ‘factors’ and ‘concomitant measurements’. Then going beyond Rubin we shall point out that this condition remains necessary and sufficient when a law cannot naturally be thought of as expressing the effect of a ‘treatment’ on an ‘experimental unit’ and also when a law applies to a time series rather than a cross-section. Finally, we shall show that it is violation of this same condition that leads to the need for special techniques when interdependent simultaneous laws are to be estimated from equilibrium data. 0304-4076/88/$3.5001988,
Elsevier Science Publishers
B.V. (North-Holland)
24
J. W. Pratt and R. Schlaijer, On the interpretation
and observation of laws
Something very much like the form of this condition that applies in the absence of concomitant measurements appears in many books on econometrics, but never quite correctly stated. Our exposition of the condition owes much to Rubin, but we shall refer very infrequently to Rubin and related authors because the language in which they generally interpret the condition makes it seem to apply to only a very few of the very many kinds of real-world problems to which it actually applies. We are in the position unusual for critics of insisting that Rubin’s contribution is much more important than he or his followers have even suggested that it is. Turning from theory to practice, we shall call attention to some implications of this condition that although obvious are often overlooked. In particular, it is only when data are obtained by randomized experimentation that a strong statistical association between y and a particular group of x’s can legitimately be taken to show that the value of y is in large part determined by these x ‘s. In data of the kind ordinarily available to econometricians, such an association shows only that the value of y is in large part determined by these x’s and/or other x’s with which these x’s are correlated. Economic theory may assert that it is in fact these x’s rather than others that largely determine the value of y, but if one wants to argue that the theory is supported by data, evidence of a kind quite different from a large value of R2 must be adduced.
1.2. The word ‘cause’ and its derivatives In discussing a particular law with a specific real-world interpretation we may use the word ‘cause’ or ‘because’ if it seems to us that natural English usage permits us to do so; but when we use these words we shall do as Feynman (1967, p. 53) does in discussing The Character of Physical Law and neither define them nor agree or disagree with anyone else’s definitions. Zellner (1984) provides a very useful survey of the econometric literature on causation. A simplified version of Rubin’s model is stated and explicated by Holland (1986). We decline to debate the ‘true’ meaning of the word ‘cause’ or any of its derivatives because what these words should convey about the relation of B to A over and above what is conveyed by the law or laws that relate B to A seems to us to be a question of purely metaphysical or lexicographical interest, irrelevant to either scientific understanding or practical control of the real world. Whether or not a cause must precede its effect, engineers who design machines that really work in the real world will continue to base their designs on a law which asserts that acceleration at time t is proportional to force at that same time t; and this will be true whether it is force that is controlled and acceleration that responds or vice versa. The meaning that merchants attach to what a law says about the change in demand that will or would accompany a
J. W. Pratt and R. Schlaifer,
On the interpretation
and observation
of laws
25
given change in price will be the same whether or not this law is embedded in a system of laws; and a fortiori it will be the same whether or not the change in price may be said to cause the change in demand if the laws are recursive but not if they are interdependent. The meaning of a law does not depend on how it is estimated. 1.3. Brief outline In section 2 we present the theory, first for laws without concomitants and then for laws with concomitants. In each case, we first state the essential propositions in a form that is ‘model-free’ in the sense that no assumptions are made about the functional form of a law, and we then interpret them in terms of linear models. We hope that when presented in this order rather than the reverse order employed in our 1984 paper the theory will seem as simple - not to say obvious - as it really is. In section 3, on implications for practice, we first argue that contributions to R2 and P-levels of the F-statistic are of no relevance whatever when one wishes to show that a statistical model expresses a law, and we go on to argue that a law should often contain exogenous variables that cannot possibly affect the endogenous variable on the LHS. We conclude with brief remarks on time series, interdependent laws, and Bayesian analysis. In section 4, which is really of the nature of an appendix, we first point out that, depending on its interpretation, a requirement that exogenous variables be either uncorrelated with or independent of ‘ the’ excluded variables is either meaningless or impossible to satisfy. We then proceed, first to show what requirements can and must be imposed on the relation between included and excluded variables, and then to give a correct definition of the ‘joint effect of excluded variables’. 2. Logical foundations 2. I. Prolegomena To accomplish our purposes, we shall need a framework and notation that are unusually explicit in certain respects. After explaining them in the present section, we shall put them to use in the next.
2. I. 1. The random variables X and Y, The first two members of our cast of characters hereafter i indexes observations): 1.
are the following
(here and
A sequence of random variables X, with the same set of possible values {x} for all i. Their joint distribution is arbitrary. For some or all i, the
26
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
distribution of Xi may be degenerate ity 1 to one particular x. 2.
in the sense that it assigns probabil-
A set of generic random variables Y,, one for each x, with Y,, independently identically distributed as Y,. Their distributions are at least almost always proper in applications of the kind discussed in statistics and econometrics but degenerate in most if not all of the laws of physics that were known before the present century.
The X’s and Y’s are observable, although on any one observation Y,, is observed for only one x. If x has a finite range, one may think of Y as a vector with elements Y, only one of which is observed. The values of the X’s and Y’s may be of any mathematical nature whatever: discrete or continuous, scalar, vector, matrix, or what have you. As one example of the possible interpretations of the X’s and Y’s, the values x of Xi and y*, of Yxi may represent, respectively, the force applied to the ith object and the acceleration that results. As another, they may identify or describe the agricultural treatment applied to the i th plot of land and the yield that results. As still another, yxi may represent the value in the ith year of the endogenous variable on the LHS of a time-series model, while x represents the values of the endogenous and predetermined variables on the RHS. The predetermined variables may include exogenous and endogenous variables lagged to any depth. Infinite lags present no special problem. It is because we are discussing laws rather than statistical associations that we need a set of random variables Y,,, one for each x and each i, and an unrestricted sequence of random variables X,, one for each i, rather than a single Y and X with a value for each i. The notation allows us to consider observed data in which the values of the successive X’s are determined in different ways and to ask what would have happened if on some observation X had had a value other than the one actually observed. 2.1.2.
The random variable U,
In the simplest laws that we shall discuss, those without concomitants, the third member of our cast will be a set of generic random variables U,, one for each x, with U,, independently identically distributed as U,. (When the i th value of U pertains to the ith individual drawn from a population, we shall tacitly assume that the population is infinite because the modifications needed for sampling without replacement from a finite population are distracting but neither difficult nor relevant to anything we shall say.) The U’s may be interpreted in the real world in either of two quite different ways. Some contexts permit either one. 1.
In the first interpretation of U,, the subscript x is U as an identi$er. superfluous. The value of Uxi is a label which is always observed and the
J. W. Pratt and R. Schlaifer, On the interpretation
and observation of laws
21
distribution of UXi is never degenerate. Thus U,, may identify a particular ‘experimental unit’, such as a plot of land in an agricultural experiment, or what is sometimes called a ‘case’, such as a firm in an econometric context. This interpretation is less natural, if possible at all, when observations have a meaningful order. 2.
U as the eflect of excluded variables. In its other interpretation, U, represents what econometricians call the ‘joint effect of excluded variables’ and the subscript x indicates that its distribution may depend on x. When so interpreted, U, ordinarily is unobservable and has a non-degenerate distribution, although we shall see that in special cases it is observable and may have a constant value.
In due course we shall discuss this interpretation at some length. For the moment, we shall merely suggest it by saying that (1) in an agricultural experiment, U, may be thought of as a summary variable representing evev variable that could help to explain or merely predict the plot-to-plot variation in the yields obtained when the same treatment x is applied to several plots of land, and (2) ‘pure chance’ may be among the explainers or predictors as when a photon may or may not be reflected by a body of water or a sheet of glass and Feynmann (1985) says no one can tell us why. It may seem that the requirement that the UXj be independently identically distributed excludes most time series from our purview, but it really excludes them only from the category of ‘simplest’ laws. When we proceed to the general case, the reader will see that what we say applies just as naturally to time series as to cross-sections. Who thinks about excluded variables? In what follows, we shall more than once compare what one can say about particular examples in terms of a U that represents excluded variables with what one can say in terms of a U that merely identifies a ‘case’ or ‘experimental unit’; and the comparisons will always come out in favor of the former. To prevent misunderstanding of what we mean by these comparisons, we now say as emphatically as we can that we do not mean to suggest that writers who do not explicitly represent excluded variables in their formal propositions about what we call laws think less clearly about excluded variables than those who do. We mean simply: The possible real-world interpretations of formal propositions about laws become , very much clearer when excluded variables are explicitly represented in the propositions.
2.1.3.
The random
variable Z
After we have explained as much as we can in terms of the simplest possible laws, we shall proceed to the general case by introducing one more random
28
J. W. Pratt and R. Schlaifer, On the interpretation
and observation of laws
variable, to be denoted by Z and called ‘the concomitants’; and we shall modify the definition of U accordingly. For the moment, all that we shall say about these changes is that they will greatly relieve the restrictiveness of the conditions under which laws without concomitants exist and can be observed in data. 2.2. Laws without concomitants 2.2.1.
Definition
Consider a sequence of observations on the X’s and Y’s. We shall say that Y is related to X by a law without concomitants if (1) for every value x of X, there exists an iid sequence {U,,}, and (2) there exists a function f which on the ith observation associates to every possible value x of X, a random variable
The values
uxi of the U,.; and the corresponding
Yx; =f
(XT4
values
(lb)
of the Yxj that are defined for every x on every observation i will be called ‘potential’ values because the only ones that will be realized on any one observation i are the pair corresponding to the one realized value x of X,. It is the existence of these potential values that distinguishes a law from a statistical association; and we shall see in due course that it is only in terms of these potential values that it is possible to state the condition under which a law can be observed in data. Although the observable y,.,‘s are unique, the unobservable uxi’s and therefore the function f are not; but at this point uniqueness of the yx, is all we require. When we come to discuss linear models of laws in section 2.3.5 we shall make f and the u,, unique in essentially the same way in which the residual and constant term of an ordinary regression are made unique. Now for any one x let D(Y,) be the distribution of Y,; it is the same on every observation and is also the long-run empirical distribution of Y,. D( Y,) may properly be said to describe (or to be) a law rather than a statistical association because the value of Y, is defined on every observation whether or not it is observed. One can also say: D(Y,) may properly be said to be or to describe a law because it is the distribution that would govern every observed value of Y if on every observation X, had the same value x.
J. W. Pratt and R. Schlaijer, On the interpretation and observation of laws
29
Because the distribution of U, is an essential part of a law, a law is guaranteed to hold only when the U’s are generated by one particular process or drawn from one particular population. We defer to section 2.5 the considerations that must be kept in mind while deciding whether or not a law that holds when the U’s are generated by one particular process is likely to be a good approximation to the law that holds when the U’s are generated by another particular process. On the other hand, observe that because we have as yet introduced no model of the relation of Y to X, we do not have to consider the possibility that our laws may hold for one value x of X but not for another. They are ‘true’ laws each of which is assumed to hold exactly for exactly one x.
Examples
2.2.2.
Example 1.
Let Y,, represent the acceleration that results when force x is applied to a particular object, and consider a sequence of observations on which the same object is subjected to a force which varies from one observation to the next. On each observation, there is a well-defined value y,, of Yxi corresponding to each of the infinitely many possible forces x, even though only one of these accelerations can be observed; and therefore there exists for every x a law D( Y,), even though for most x not even one value of Y, will be observed. The u,, which appears in (lb) is a constant which may be interpreted as either the observed identity or the observed or unobserved mass of the object.
Example 2.
Let the value x of X, represent the color (frequency) of a beam of monochromatic light in whose path a sheet of glass is placed, and let y,, = 1 [0] if the ith photon in the beam is [not] reflected. On each observation, x and uxi, which is here best interpreted as ‘pure chance’, assign to Y,, a value y,, that is equal to either 0 or 1 and is always observed because x is constant. The subscript x on uxi exhibits the fact that the probability that the photon will be reflected depends on the color x of the light.
Example 3. Let x be a vector representing direct-labor hours and direct-labor dollars and let Yxi be overhead dollars. In each period of time, x and ux,, which is here best interpreted as the joint effect of excluded variables, assign to Yxi a value y,, which is observed only when X, = x but could vary from period to period even if x remained constant. The subscript x on u,, allows the distribution of the joint effect U, of excluded variables, in particular its variance, to depend on the value of x.
JEcon
B
30
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
2.3. Obseruability of a law without concomitants Having discussed the definition and interpretation of a law without concomitants, we next ask when it will be possible to estimate such a law from data; but because most of the considerations that must be kept in mind when choosing an estimator are no different for laws than for statistical associations - no different, for example, for single-equation linear structural models than for linear regression models - we shall address only the one new problem that must be faced when the estimatand is not an association but a law; namely, the problem created by the fact that the yxi corresponding to any particular x and i is observed only if X, = x. Concerning this problem, we shall first state a sufficient and ‘almost’ necessary condition under which the distribution D(Y,,\X, = x) that is observed when X, = x coincides with the law D( Y,); and for brevity, we shall say that when this is true, the law D(Y,) is itself ‘observable’. We shall then add some other sufficient conditions which may make life easier for someone who wants to decide whether the condition is in fact satisfied in data in which the x’s are not randomized. They are technically a little stronger than our first condition but they are practically equivalent in the sense that although it is easy to invent mathematical examples, we have yet to think of a practical application where one might believe that although a weaker condition is satisfied, a stronger is not. Ordinarily, no estimator of a law that is not observable will be even consistent, although we shall see in section 3.4 that a consistent estimator that makes use of auxiliary information can be sometimes be found when the reason why the law is unobservable is well understood. For completeness, we point out that the condition for observability, which applies to each observation separately, does not preclude such tricks as optional sampling and optional stopping that are also possible when X, = x on every observation and hence D( Y,) is fully observed.
2.3.1.
Rubin’s and Dawid’s conditions
Our first condition for observability of the law D(Y,) for any one x is derived from a stronger condition first stated by Rubin for observability of the laws for all x.l We state it for discrete X, but it applies to arbitrary X if conditional probabilities are replaced by conditional densities. The observed D( Y,, 1X, = x) will coincide on every observation with the law D(Y,) iff on each observation the probability that Xi = x does not depend on ‘Rubin’s condition is stated in great generality by Rubin (1978, p. 42, def. 1): in a more elementaxv form bv Rosenbaum and Rubin (1984, p. 27. eq. 5). Our condition is stated here in a simplified form applicable to laws without concorkants. It will appear in its complete form in section 2.4.2.
31
J. W. Pralt and R. Schlaifer, On the interpretation and observation of laws
the associated
value yx, of Yxi, i.e., iff
Pr{X,=xlY,,} Our next condition, slightly stronger:
=Pr{X,=x}, which
Vi.
we state
X, and Y,, are statistically
(2a)
in a form akin
independent.
to Dawid
(1979)
Qi.
is
(2b)
The condition (2a) is essentially the same as the condition required for observation of an ordinary regression in a subset of the data in which the regression is defined, namely: the probability that an observation on which Xi = x will be included in the subset does not depend on the value y, of Y on that observation. The only difference is that whereas the y, of a regression is observed for every i, the y,, of a law is a potential value observed only when x, = x.
2.3.2,
Conditions
Because
with U in place of Y
by (1) Y,, is determined
Pr{X,=xlU,,} and therefore
=Pr{X,=x},
by x and U’,, (2a) will be satisfied if Qi,
if
X, and UXi are statistically
independent,
Qi.
(24
If U is interpreted as a label identifying an experimental unit, these two conditions say that the probability of applying a given treatment to a given plot of land must be the same for all plots. If U is interpreted as the joint effect of excluded variables, they say that the probability of applying a given treatment to a given plot must not depend on any excluded variable that can help to predict yield given that treatment. The relation of (2d) to the requirement found in some books on econometrics that the exogenous variables must be statistically independent of the joint effect of excluded variables will be discussed in section 2.3.5.
2.3.3.
Examples
Example I. Let Pi playing the role of X, represent the price in year i of an agricultural commodity produced in year i - 1 and sold in year i; let Q,, playing the role of Y,; represent the quantity that will be demanded in year i if Pi =p; and suppose that the only data recorded in each year are the p and qp, that clear the market. In these data, Pr{ P, = p (Qpi} cannot be independent
32
J. W. Pratt and R. Schlaifer, On the interpretation
and observation
of laws
of Qpi because p must be such that demand qpi is equal to the available supply; and therefore the law of demand o(Q,) is not observable. Example 2. Reverse the roles of P and Q, so that Q, plays the role of X, and Pqi the role of Y,,. As was first pointed out by Fox [Malinvaud (1966, p. 509,
sec. 16.3)], it may be that because supply and therefore demand Q, were assigned a value q before it was possible to predict [at all well] what pqr would result, Pr{ Q, = q( P,,} did not [much] depend on P4;; and if so, then the law D( P,) may be [nearly] observable, and from it the law o(Q,) could be computed. Notice that on the one hand the law in this example is [nearly] observable only if in fact P4, cannot be predicted [at all well] at the time q is determined, while on the other it is very often true in the real world that coming events cast their shadow before. In other words, it may sometimes be easier to infer a law from time-series than from cross-sectional data, but the condition that must be satisfied is the same in both cases and cannot be taken for granted in either. Example
3. In Example 1, it was possible to prove that the analog of Pr( Xi = x ] Y,;) depended on Y,; and therefore D( Y,) was not observable. In our next example, it is a priori reasoning or intuition about excluded variables that leads to a similar conclusion. Doctors give treatment x to some but not all of the patients who come to a certain hospital for relief from a certain symptom. D(Y,) will be observable if doctors give this treatment to patients chosen at random or choose at random the treatment to be given to each successive patient; but there is no guarantee that it will be observable if for example the probability of being given the treatment varies with the sex or age of the patient. If one thinks of the random variable U as merely identifying the patients, no more can be said. When the assignment of treatments to patients is not randomized, Pr{ Xi = x ] Y,,} may depend on Y,,, or equivalently, Xi may not be statistically independent of Uxi. If, however, one thinks of U as the joint effect of excluded variables, much more can be said. Zf (I) Xi is not independent of age and sex, and (2) age and sex help to predict the variation in the value of Yxi from one patient to another (and are therefore among the variables represented by Uxi), then X, virtually cannot be independent of Uxi and therefore the law D(Y,) almost certainly will not be observable. 2.3.4.
Manipulation us selection of the value of X
The fact that a law is defined without reference to any particular distribution of X implies that it holds when the value of X is ‘held’ constant or ‘arbitrarily set’ to a chosen value; and generalizing from the fact that in
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
33
certain contexts the variables X whose effects are of interest are in fact ‘manipulated’ in this way, Rubin (1986, p. 962b) -see also Holland (1986, p. 959a) - asserts: ‘ . . . the motto ‘no causation without manipulation’ is a critical guideline for clear thinking in empirical studies for causal effects.. . forc[ing] an initial definition of units and treatments.. . ‘. Taken as part of a definition of ‘causation’ we shall not debate this assertion any more than any other, but we do emphasize that it would be false to assert that Y cannot be related to X by a law unless X can be literally manipulated as it is in contexts of the kind Rubin and Holland take as examples. For example, an admissions officer cannot ‘manipulate’ the SAT score of an applicant for admission to college, but the officer may well take SAT score into account in selecting the applicants to be admitted because he is aware of a law which states that expected academic performance increases with SAT score. The law and its implications are the same whether or not high SAT scores may properly be said to ‘cause’ high academic performance; and the same thing is true of a law which asserts (for example) that increasing age is accompanied by failing eyesight. As regards the observability of a law, there is no difference between the case where on the i th observation one sets X, = x and the case where on the i th observation one selects an entity that already has the value x. The law that relates Y to X will be observable in either case only if the value x is chosen independently of the y, that will result (as when one spends x on advertising or selects a student with SAT score x) or has already resulted (as when one selects a person of age x and measures his visual acuity). 2.3.5. Linear models As we have already said, the random variables Uxi and the function f that appear in the definition (la) of a law are far from being uniquely identified, but nothing we have said so far about laws depends on identification. If the law D(Y,) for a given x is observable and has been observed often enough, one can estimate the law by simply tabulating the observed distribution of Y,. If on the contrary the laws for all x are to be represented by a parametric model, identification is convenient and presents no problem. Thus in a linear additive model y, = (I + bXi + U,, where y; and Vi stand for the one Y,, and U,i with x = Xi whose values are actually realized on the i th observation, U,, can be identified by defining C&i= Y,, - E( Y,), or equivalently E(U,) = 0, for all x. The distribution of U,, in particular its variance, may still depend on x and may also be modelled. For any one x, the observable expectation E( V,1X, = x) = E( Uxj ]X, = x) will not in general be equal to the E(U,) which is 0 by definition unless condition (2d) for observability of the law for that x is satisfied. If it is satisfied for all x, then E( U,]X,) will be 0, implying that the regression of Y on X coincides with
34
J. W. Pratt and R. Schlaifer, On the interpretation and observation
of laws
the law. Unbiased or consistent estimates of the regression intercept and slopes will then be unbiased or consistent estimates of the law’s intercept and structural effects b, called ‘derivatives’ by some econometricians, with corresponding consequences for use of regression residuals in estimating the distributions of U,. The condition for consistency of OLS estimates of b that is commonly stated in books on econometrics, that X, be uncorrelated with ujii, is of course weaker. The stronger condition sometimes imposed, that X, be independent of U,, does not follow from observability alone but is equivalent to the condition that D( U, )Xi = x) be the same for all x and hence under observability to the condition that D(U,) be the same for all x. These distinctions among conditions make little practical difference when the analysis incorporates a model of the dependence of D(U,) on x, as it must at least implicitly. But when deciding whether a condition is satisfied, we find it clearer and more natural to think directly about observability than about independence or absence of correlation between X, and U,. Instead of requiring that X, be independent of the joint effect U,, of excluded variables, some econometricians require that X, be independent of ‘the’ excluded variables themselves. We shall show in section 4 that this condition is meaningless unless the definite article is deleted and can then be satisfied only for certain ‘sufficient sets’ of excluded variables some if not all of which must be defined in a way that makes them unobservable as well as unobserved.
2.4. Laws with concomitants Having said what there is to say about the conditions under which laws without concomitants exist and are observable, we now proceed to the general case and introduce new random variables Z. R.A. Fisher’s name for Z was ‘concomitant measurements’, which we shall abbreviate to ‘concomitants’. Another name often found in the literature on experimentation is ‘covariates’. The variables we have denoted by X are commonly called ‘factors’ in that literature and will be so called by us henceforth. Concomitants formalize the notion of ‘controlling for’ or ‘adjusting for’. In randomized experimentation, they may merely reduce ‘random error’, or they may ‘refine’ a law by splitting it into two or more laws each of which applies only when Z has a particular value and which can therefore be more precise than a single law that applies on average across all values of Z. In observational data, there are two more reasons why concomitants may be needed. First, a relation of Y to X may be too unsystematic to be formulated as a law unless concomitants are included. Second, a well formulated law may not be observable unless concomitants are included.
J. W. Pratt and R. Schlaifer,
2.4.1.
Dejinition
On the interpretation
and observation of laws
35
of a law with concomitants
Formally, we introduce random variables Z,.,, which for convenience we shall treat as having the same set of possible values {z} for all x and i; and we replace the random variables U,; by random variables U,,,. The Z’s and U’s are generated by a single random process characterized by a joint distribution in which: For each x, the marginal distribution of the Z’s is arbitrary and may be degenerate as regards some or all of their components; but given any one pair of values x of X, and z of Zxi, the U,,, are iid. As the notation Z,, suggests, we follow Fisher (1935; 1953, p. 163) in allowing the Z’s to be affected by the X’s, despite the fact that most writers on experimentation say that concomitants affected by factors are inadmissible. We shall explain in section 2.4.3 why we admit them. We can now say that Y is related to X and Z by a law with concomitants if (1) Z and U have a joint distribution with the properties stated above, and (2) there exists a function g which on the ith observation associates to every possible value x of X and z of Z a random variable,
and a potential
value,
(3b) Instead of a law D(Y,) corresponding to each possible value of X, we now have a law D(Y,,) corresponding to the observed z and each possible value of x. Although x and z enter (3a) and (3b) symmetrically, the fact that the distribution of U,, is conditional on the observed value z of Z, but is defined for each possible value x of X whether or not that value is observed means that X and Z play completely different roles in the interpretation of a law. A law specifies a unique distribution D(Y,,,) f or each x and z (I ) regardless of the distribution D( Xi) but (2) given the conditional distribution D( qx!,,,1Z,, = z) in the population or process to which the law applies. This implies
that
One may choose the value of Xi arbitrarily, but one must allow Zxi to have whatever value it naturally has because choosing some other value would almost certainly alter the distribution D(U,., 1Zxi = z) that is an essential part of the law.
36
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
Example 1. Consider a law without concomitants that relates Y to factors X, and suppose that the excluded variable temperature T has an effect on Y that is part of the joint effect U, of excluded variables. If thermometer reading is a function of temperature and is added to the law as a concomitant Z, U,, has a distribution which differs from that of U, because given any value z of Z, T is constant. But if one tampers with the thermometer, T will no longer be constant given z, and therefore the distribution of U,, will no longer be what it was. Example 2. Consider a time-series model of retail sales Y that includes disposable income, a trend, and a monthly seasonal index. One can think of disposable income as a factor in X because it makes sense to ask what would have happened to Y if, because of (say) some previous change in government economic policy, disposable income in a given month of a given year had had a value other than the one it actually had. One must think of trend and seasonal as concomitants in Z because although they help to predict Y, they do so only because they proxy for unidentified excluded variables. Trend is included in the law because without it there would be no law at all: sales would drift up or down as one or more excluded variables drift up or down. Seasonal is included to ‘refine’ the law. Without it, one would have a single law applying very imprecisely to all months. With seasonal included one has twelve different laws, one for each month; and while these laws may not all be more precise than the law without seasonal, they must be more precise ‘on average’ in the sense that, for example, the mean of their variances must be less than the variance of the single law by an amount equal to the variance of their means. Example 3.
The simplest
r, = bx,
+ E,,
linear law with ‘autocorrelated Et= r-E_, + U,,
errors’:
U, iid,
is not in the form required by our definition of a law without concomitants because the sequence {E,} is not iid, but it conforms with our definition of a law with concomitants when it is rewritten as a non-linear law, Y,=bx,+r[Y,_,-bx,_,]+
U,,
U, iid.
Here Y, plays the role of Yxzi in (3a), x, plays the role of the factors x, and Y,_i and x,-i play the role of concomitants z. Y,_ 1 and x,-i cannot be considered to be factors because they have no effect of their own on Y,. What they jointly represent is the effect on Y of the excluded variable E,_1, which in turn merely represents the effects of the unidentified excluded variables represented by U,_ 1, U,_ 2, . . . .
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
2.4.2.
Conditions
37
under which a law is observable
The conditions under which a law with concomitants D(Y,,) is observable for any one pair x, z differ from the conditions (2) for a law without concomitants only in that ‘given z ’ is inserted wherever it makes sense, Pr{Xi=xlZ,i=z,Y,,,}=Pr{Xi=xlZ,,=z}, Xi and Y,,, are statistically Pr{X,=xlZ,,=z,U,,i}
independent
given Z,, = z,
=Pr{X,=xJZ,,=z},
X, and U,,, are statistically As regards the observability X and concomitants Z is:
Vi,
independent
independent
Vi,
Vi, given Z,, = z,
of a law, the essential
Whereas X, must be statistically requirement is imposed on Zxi.
@a)
difference
(4b) (4c)
Vi. between
(4d) factors
of U,,, given Z,, = z, no similar
This difference between X and Z in the condition under which a law is observable is the counterpart of the difference between X and Z in the interpretation of a law. As was pointed out in the previous section: Whereas a law holds regardless of the distribution of X, it holds only when the value of Zxi is generated by the process to which the law applies. Historically, the difference in the condition for observability precedes the difference in interpretation. The concept of a concomitant whose causal effects cannot be estimated was introduced by experimentalists who wanted to bring variables that could not be randomized into the analysis of experiments. Example. Return to the third example of section 2.3.3, where doctors’ choices of the treatment to be administered to each patient were influenced by the sex and age of the patient. If these characteristics of each patient were recorded as concomitants z, then D(Y,,jlZ,j = z, X, = x) will coincide with D(Yr,) and the law will be observable if any one of the conditions (4) is satisfied - e.g., if Pr{ Xi = xlZx, = I, Uxzi} did not depend on any excluded variable that could help to predict the variation of Y,, among patients all characterized by the same value of z.
2.4.3.
Concomitants
aflected by factors
As we have already said, we follow R.A. Fisher in allowing concomitants to be affected by factors, and to show why we do so, we quote Fisher (1953, pp.
J. W. Prait and R. Schlaifer, On the interpretation and observation of laws
38 162-163):
‘In agricultural experiments involving the yield following different kinds of treatments, it may be apparent that the yields of the different plots have been much disturbed by variation in the number of plants which have established themselves. If we [believe that plant number is affected by the treatments but] are willing to confine our investigation to the effects on yield, excluding such as flow directly or indirectly from effects brought about by variation in plant number, then it will appear desirable to introduce into our comparisons a correction which makes allowance. . . for the variations in yield due.. . to variation in plant number itself.’ In other words, the ‘total’ effect of Definitional us inadmissible concomitants. a treatment on yield can be thought of as the sum of two parts, a ‘direct’ effect given plant number and an ‘indirect’ effect due to the fact that treatment affects plant number and plant number affects yield. Whether or not plant number should be included as a concomitant depends entirely on whether it is the direct or the total effect that the analyst wants to learn; and it is clear that in general: A proposed concomitant Z that is affected by a ‘dejinitional’, if what one wants to learn is the ‘inadmissible ‘, if what one wants to learn is the ‘optional’ in the sense that one can learn what one it is included.
factor X may be either (1) e#ect of X given Z, or (2) total efSect of X. It is never wants to learn whether or not
Because concomitants affected by factors are unconditionally ruled out by most writers on experimentation, we give two more examples of contexts in which definitional concomitants are required, the first of them being a randomized experiment. Example I. Consider a medical experiment conducted in order to evaluate various drugs all intended to improve health by reducing blood pressure and any one of which can be administered in a dose sufficient to produce any desired reduction in blood pressure. The effects of the drugs that are of interest are therefore not their total effects on health but their (possibly harmful) effects on health given post-treatment blood pressure, which is therefore a required definitional concomitant. Example 2. Suppose that an employer is accused of paying men more than he pays equally qualified women, and suppose, unlikely as it may be, that the judge in the case lays down as a matter of law a list of the variables that are to
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
39
be used in measuring qualification. Then clearly what must be estimated is the effect on pay of the factor sex given the definitional concomitants listed by the judge; and this is true even if sex affected these concomitants in the sense that the employer hired women less qualified than men, not because he chose to do so, but because the women who applied for a job were less qualified than the men.
2.4.4.
Marginal and average laws
Marginal laws. If for any x, D( Y,,) is a law with concomitants 2 and the Z,, are iid, the expectation of D(Y,,) wrt 2, is a ‘marginal’ law D(Y,) that does not depend on the value of Z,. Similarly, if Z, is a vector (Z;, Z;‘) and the Z;; are iid given Z;, the expectation of D( Y,,) wrt Z;’ given Z; is a marginal law D( Y,,,) with concomitants Z;. Notice however that marginalization wrt a concomitant that is ‘definitional’ in the sense of section 2.4.3 radically changes the meanings of the effects of the factors. For example, suppose again that (1) laws D(Y,) without concomitants relating health to various treatments x are not observable because doctor’s choices of treatment were influenced by the sex and age of the patients who happened to come in for treatment, but (2) such laws are wanted because just one of the treatments is to be selected and administered to all remaining members of the population from which the treated patients came. Provided that (1) the laws D(Y,,) relating health to treatment given sex and age are observable, and (2) the joint distribution of age and sex in the remainder of the population is known, the desired laws D(Y,) can be estimated by computing the expectations of the D( Y,,) wrt this joint distribution. Laws holding ‘on average’. If the Z,; or Z;: are not iid but cyclical, like a seasonal index, averaging the laws wrt the distributions of Z,, or ZJ yields a law that applies ‘on average’ or to a randomly chosen phase of the cycle. And similarly when Z,, or Z;; has both cyclical and iid components. Thus in a law relating sales Y to factors X with concomitants trend and monthly seasonal index, averaging out the seasonal index yields a law for each year that applies to the ‘average month’ in that year and serves much the same purpose as a law that relates annual sales to the factors X.
2.4.5.
Linear models
When
a law with concomitants Y.=a+bX,+cZ,+
where
is represented
by a linear model
U,,
Y, Z,, and U, stand for the one Yxzi, Z,,, and U,,, with x = X, whose
40
J. W. Pratt and R. Schlaijer, On the interpretation and observation
of laws
values are actually realized on the i th observation, the U,, can be identified by defining E( U,, (Z, = z) = 0, Vx, z. The observable expectation E( ql X, = x, Z, = z) = E( U,,, (Xi = x, Zxi = z) will not in general be equal to the E(U,,J Z, = z) which is 0 by definition unless the condition for observability of the law is satisfied. If it is satisfied for all x and z, then the definition that identifies the U,, implies that E(U,(X, = x, Z, = z) = 0 for all x and z and hence: If and almost only if the condition for observability is satisjied, Xi and Z, will be uncorrelated with U,, implying that the law will coincide with the regression and an unbiased or consistent estimate of the coefficient of X in the regression will be an unbiased or consistent estimate of the structural coeficient or ‘derivative’ b that expresses the effect of X on Y.
For example: if both Y and X are affected by the excluded variable temperature, then Xl would be correlated with the V, of a law relating Y to X without concomitants. If, however, the values of X satisfy observability when temperature is included as a concomitant Z, then (1) the regression of Y on X and Z will coincide with this new law because Xi and Zj will be uncorrelated with its Ui, and (2) therefore the regression coefficient of X can be interpreted as the effect of X on Y, even though (3) the values of Z may not satisfy observability and therefore the coefficient of Z may include not only the effect of Z but also the effects of excluded variables with which Z is correlated. Interactions. Although variables representing interactions are not needed in model-free laws, they are frequently needed in linear laws but give rise to no new problem. A variable expressing interaction between factors or between a factor and a concomitant is a factor and satisfies the condition for observability if the interacting factor or factors do. A variable representing interaction between concomitants is a concomitant and will have the value it naturally has if the interacting concomitants have their natural values. Marginal and average laws.
If some or all of the factors neither affect nor interact with the concomitants or a subset thereof, the coefficients of these factors remain unchanged when the concomitants or subset thereof are ‘expected out’ to obtain a marginal law. If interaction between a factor X’ and a concomitant Z’ not affected by the factor is represented by a term of the form b’X’Z’, expecting Z’ out merely adds b’E(Z’) to the coefficient of X’. Expecting out a concomitant that is affected by a factor would radically alter the definition of the effect of the factor, as we saw in section 2.4.4. In any case, U and its distribution are altered, but the new distribution can easily be estimated by computing the residuals of the marginal law.
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
41
We emphasize the ease of computation because we believe that although one may well prefer a more comprehensible ‘parsimonious’ law to a law with many concomitants that tells more than one wants to know about the relation of Y to X, the parsimonious law will in very many cases not be ‘directly’ observable and hence will not coincide with the corresponding parsimonious regression. To estimate it, one will have first to add a substantial number of concomitants, estimate the resulting law, and then expect out the unwanted concomitants.
2.4.6. Concomitants
that could be factors
As we have already said, a variable like trend or seasonal that proxies for unidentified excluded variables must necessarily be classified as a concomitant, but one is also free to classify as a concomitant any variable whose values can (at least conceptually) be arbitrarily chosen and which therefore might be meaningfully classified as a factor. How such a variable should be classified involves a tradeoff. 1.
If a variable W is included in a law as a factor in X, the law will not be observable unless X with W included satisfies the condition for observability; but the law will hold no matter how W comes to have the value it has, so that if the law is linear, the coefficient of W can be interpreted as the effect of Won Y.
2.
If on the contrary W is included as a concomitant in Z, the law will not hold unless W is allowed to have whatever value it naturally has, so that if the law is linear, the coefficient of W can not be interpreted as the effect of Won Y, but W does not have to satisfy any other condition for the law to be observable.
Since an assumption that a non-randomized variable satisfies the condition for observability is a very strong assumption and strong assumptions are to be avoided whenever possible, the second choice is the better one in a great many cases. For example, take estimation of price elasticity of demand when demand is related to price by a linear law, and suppose that price is exogenous - the data are not limited to market-clearing values. If the person who wants to estimate the elasticity does not want to learn or say anything about the efict on demand of the exogenous variables other than price, there is no need to assume that these variables are uncorrelated with U. They can be thought of as concomitants which as we saw in section 2.4.5 are uncorrelated with U by dejkition if the one factor price satisfies the condition for observability and is therefore uncorrelated with U.
42
J. W. Prait and R. Schlai’er, On the interpretafion and obseruution of hws
2.5. Circumstances
in which a law applies
Up to now, all that we have said about the conditions under which a law holds is that it holds as long as the process generating the distribution of U given remains unchanged; and in some cases that may be all that one wants to know. In other cases, however, one may want to know whether a law that holds for one population or process may be reasonably thought to hold at least approximately for another; and therefore we shall now say what little one can on this subject. If U is interpreted as merely an identifier of a case or experimental unit, then all one can say is that there is no guarantee that the law applies to the new population or process. If, however, U is interpreted as the joint effect of excluded variables, one can say that the law for any one x and z applies if and only if U,, has the same distribution in the new population or process that it has in the population or process to which the law is known to apply; and this focussing of attention on excluded variables usually provides a useful handle on the problem. If for example one wants to decide whether an agricultural law estimated in North Dakota can reasonably be assumed to be a fair approximation to the law that applies in South Dakota, one can bring to bear whatever knowledge there is about similarity or dissimilarity between the two states as regards soil fertility, rainfall, and so forth. The role of concomitants. When a law applying to one population or process is taken as an approximation to the law that applies to another, the approximation improves as concomitants are added to the law because the concomitants reduce the extent to which Y depends on U and thus reduce the difference between the two exact laws. If the Dakotas differ in rainfall, a law estimated in North Dakota that takes account of rainfall will be a better approximation to the corresponding law in South Dakota than a law that does not take rainfall into account. Notice that this is likely to be true not only when what one wants is a law with concomitants but also when one wants a single law that applies on average across the values of the concomitants. Unlike differences between the distributions of U,, in the two populations, differences between the marginal distributions of Z cause no error of approximation when the distribution of Z in the new population is known, because the law one wants can then be computed by simply expecting Z out of the law with concomitants. Constant concomitants. Having shown how introduction of concomitants can help when one wishes to extrapolate a law to a new population (or process), we remind the reader that the extrapolated law can nevertheless be an extremely poor approximation to the true law for a reason emphasized by
J. W. Pruti and R. Schlaifer, On the inierpretatlon and observation of laws
Marschak (1953, sec. 3) long ago. If in the language factors interact strongly with an excluded variable that the population to which the law is known to apply but distribution in the new population, there is no way tolerably good approximation to the law that holds in
43
of linear models the has a constant value in has a different value or at all to find an even the new population.
2.6. Summary Now as Ethel Barrymore said after her last curtain: ‘That’s all there is, folks. There isn’t any more.’ Although they have been surrounded by not a few pages of explication, the brief definition of a law in section 2.4.1 and any one of the four one-line conditions for observability in section 2.4.2 say everything that has yet been logically shown to be required for a ‘law’ or ‘structural relation’ to exist and to be observable in data. All the rest is either mere working out of the details for special cases or else interpretation which depends for its usefulness, not on formal logic, but on common sense and familiarity with the application at hand.
3. Implications
for practice
We now turn to the implications for econometric practice of the condition for observability of a law that we stated in four practically equivalent forms in section 2.4.2. We shall state these implications in terms of laws represented by linear models, partly because such laws are the most familiar, partly because so doing will allow us to say very briefly what would otherwise require complicated phraseology. In particular, we shall be able to speak as we did in section 2.4.5 of the coefficients of factors X as ‘effects’, in contrast to the coefficients of concomitants Z which are ‘ordinary regression coefficients’. What we shall argue is that although econometricians pay great attention to the condition for observability in situations where it can be demonstrated a priori that the condition is almost certain to fail, as when interdependent simultaneous equations are to be estimated from equilibrium data, they often seem to pay little if any attention to the fact that the condition may very well fail in many other situations, or to the fact that when it does fail, it is not always through use of instrumental variables that a remedy is to be sought. To support this assertion we shall in this section suggest what must be done to convince sceptics that a model fitted to data actually satisfies the condition for observability and therefore represents a law rather than a mere regression; and because conviction is necessarily personal rather than objective, we shall assume as we must that we ourselves are the sceptics to be convinced. To simplify the presentation, we shall consider only single-equation, crosssectional models in sections 3.1 and 3.2; but then in section 3.3 we shall argue that what we have said applies unchanged to time-series models, and in section
44
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
3.4 we shall show that it applies again unchanged to simultaneous equations. Finally, we shall add a brief remark in section 3.5 on the proper application of Bayesian analysis to laws that are not or may not be observable.
3.1. Parsimonious laws us niggardly tests We are second to none in our admiration of parsimonious laws, but we are not to be convinced that a parsimonious linear model which purports to be a law is in fact a law when the only evidence adduced in support of the assertion is a niggardly test which shows that the coefficients of the ordinary regression of Y on the X of the proposed law are statistically significant (and perhaps also have the ‘right’ signs). To establish a law, one must proceed in two stages: 1.
Perform all the tests required to make sure that one has correctly modelled and estimated the regression function E( YJX).
2.
Try to convince others that this function expresses a law by doing everything possible to show that it does not express a law and failing in the attempt. In other words, it is not enough to show that the variation of Y in the data
might have been caused by the variables included in X. One must give strong reason to believe that the variation of Y was not caused by variables not
included in X; and to do this: One must include in the equation fitted to data every ‘optional ’ concomitant (cf. section 2.4.3) that might reasonably be suspected of either aflecting or merely predicting Y given X - or if the available degrees of freedom do not permit this, then in at least one of several equations fitted to data.
What is more, we will not be convinced that what purports to be a law is in fact a law if its proponent merely reports that other variables were included in the equation fitted to data but were then rejected because they contributed little to R2; or what amounts to much the same thing, because the F for their effects was not statistically significant. The relevant ‘test statistic’ for a law as opposed to a regression is not R2 or F, but the vector of changes in the estimated eflects of X on Y that result when ‘test concomitants ’ are included in the relation.
If any of these changes are material, the concomitants whose introduction produced them cannot be simply dropped and forgotten unless there is very good reason to think that the changes are due purely to sampling error; and
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
45
the burden of proof that sampling error may reasonably be believed to be responsible for the changes is on the proponent of the law, not on a reader who is willing to be but has not yet been convinced that the proposed law is in fact a law. To convince us, what the proponent of the original law without concomitants must show first of all is that sampling error could with probability much closer to 0.5 than 0.05 have produced changes as large as those actually produced by the concomitants. But if he does show in this way that the data do not falsify the law without concomitants, he will also have shown that they do not suffice to falsify the law which includes the concomitants. Choice between the two laws or mixtures of the two will have to depend entirely on a priori beliefs. If on the contrary the changes in the coefficients of the factors that result from inclusion of test concomitants in a proposed law cannot be legitimately attributed to sampling error, there are two possibilities: If the law that includes the concomitants seems to make sense, the concomitants must be recognized as essential to estimation of the law, although under the conditions stated in section 2.4.5 some or all of them may then be expected out to obtain a more parsimonious law that applies on average. If the law that includes the concomitants does not seem to make sense, the searcher for a law must start over from scratch. An example of a well-known law for which this seems to be the only remedy will be found at the end of the next section. To sum up: the only kind of evidence that will convince law is in fact a law is the kind that played an important convincing most doubters of the ill effects of smoking.
us that a proposed role in ultimately
The estimates of the efsects of smoking were shown to remain almost unchanged when one concomitant after another was introduced into the analysis, until$nally it became much easier to believe that smoking in fact had the efsects it seemed to have than to believe that it was merely proxying for some other, as yet undiscovered variable or variables.
3.2. Non-causal
concomitants
One fact about concomitants that has already been stated may be worth more emphasis because it seems often to be overlooked. Both because it is rarely possible even to name every variable that may conceivably affect both Y and X, and because one can rarely if ever obtain data on all the variables that
46
J. W. Pratt and R. Schlaifer, On the interprelation and ohseruaGon of laws
one can name, variables that cannot possibly affect both Y and X but can proxy for variables that do are often very useful concomitants. For example: In 1960 data from 48 states on the effect of compulsory motor-vehicle inspection on mortality due to motor-vehicle accidents, inclusion of alcohol consumption leaves the estimated effect of inspection almost unchanged, but inclusion of mortality due to all other accidents cuts it nearly in half, despite the fact that no one dies twice. In time-series data, it is especially easy to find a non-causal concomitant whose inclusion may show that the effects of the other variables in the model are not what they seem to be before the concomitant is included. We are by no means the first to suggest that a time-series model is never to be taken seriously as representing ‘structure’ unless either the model actually includes as a concomitant a smooth function of time chosen to add as much as possible to R2, or else the proponent of the model defines such a function and reports the negligible changes in the estimates of the effects of X that resulted from its inclusion in the model. For example: When time is added to the consumption equation in Klein’s three-equation Model I [Theil (1971, pp. 432-433)] and the equation is then estimated by two-stage least squares, R2 increases only from 0.977 to 0.984 and adjusted R2 increases only from 0.974 to 0.981, but the coefficients (and standard errors) of wages W + W’, profits P, and lagged profits P_, are affected as follows:
w+ Before: After:
W’
0.810 (0.045) 0.351 (0.184)
P
P -1
Time
0.017 (0.131) 0.302 (0.157)
0.216 (0.119) 0.501 (0.150)
0.432 (0.170)
To us (and we suppose to everyone else), the revised coefficients make no sense at all given what we think we know about spending habits in the US in the twenties and thirties; and therefore as good Bayesians we certainly do not believe that the model with time added expresses even a rough approximation to a law. But neither do we accept the original model as a law, because the standard errors of the revised coefficients show that we cannot dispose of the differences between them and the original coefficients as mere sampling error. We conclude that even though the original coefficients agree well enough with our very imprecise prior beliefs, they are unsupported by the data; and therefore we are left with our prior beliefs unaltered by this model and these data. 3.3. Laws governing
time series
Granger (1986, pp. 967-968a) says that time-series causation differs from cross-sectional causation because human behavior may be affected by memory.
J. W. Pratt and R. Schlaifer, On the interpretation
and observation of laws
41
With this as with any other statement about the meaning of the word ‘cause’ or ‘causation’ we shall not quarrel for the reason we gave in our Introduction. We do however want to point out that (1) everything we have said about laws quite obviously applies unchanged to laws that include any number of endogenous and exogenous variables lagged as deeply as the modeller likes, and (2) we do not see (and so far as we know, no one has suggested) any other way in which memory can be represented in a law. Here as elsewhere, differing views about the nature or definition of causation seem irrelevant to either understanding or control of the real world. There is, however, one fact about laws involving lagged variables that we have already mentioned but may be worth emphasis. As we pointed out in Example 2 in section 2.3.3, it must not be assumed that because the value of a lagged included variable x,-r was determined before the value of the current joint effect U, of excluded variables, xtP 1 necessarily satisfies the condition for observability - i.e., was independent of U,. It may well have been influenced by a forecast of an excluded variable represented in U,, or both x,-r and U, may have been affected by some third variable - in common parlance, a ‘common cause’.
3.4. Interdependent
laws and equilibrium
data
To show by a simple example how what we have said about single-equation models applies to simultaneous equations, we shall suppose that market-clearing prices and quantities are generated by laws of supply and demand that are interdependent because both variables appear in both laws; and using these data, we wish to estimate the law of demand Q,=a+bP+cX+dZ+
U,
where Qd and the factor P denote demand and price, X and denote other factors that affect and concomitants that predict is the joint effect of excluded variables. (Although demand supply Q, on every observation in the available data, the quantity Qd customers would be ready to buy at any price P, that quantity is actually available.)
(5) Z respectively demand, and U Qd is equal to law tells what whether or not
1.
The law of demand is not observable because in equilibrium data the factor P is necessarily affected by and therefore correlated with U. Ceteris paribus, P must increase as U increases since otherwise demand would exceed supply.
2.
In discussing how the law may nevertheless be estimated, we make the standard econometric assumption that the variables X, Z, and U, and some other variables W that we shall define in a moment, are ‘prede-
48
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
termined’ in the non-standard sense that they are not affected by Qd and P. As we emphasized in discussing Example 2 in section 2.3.3, ‘determined before’ is not sufficient for ‘not affected by’ when forecasting is possible. 3.
Given this assumption, the law can be estimated consistently if U is not only (1) uncorrelated with X and Z as is required in ordinary data (cf. section 2.4.5), but also (2) uncorrelated given X and Z with one or more predetermined variables W, called ‘instruments’ for P, that are correlated with P given X and Z.
These assumptions about absence of correlation with U in combination with the assumption that P does not affect W, X, or Z imply that W would not predict Qd given P, X, and Z in data in which Qd is unconstrained if P as well as X satisfies the condition for observability. In other words, W could not play the role of a factor or concomitant in the law of demand. Let P’ be the (true) regression of P on W, X, and Z, and observe that (1) U is uncorrelated with P’ because it is uncorrelated with W, X, and Z, and (2) the regression residual P - P’ is uncorrelated with P’, X, and Z. Then rewrite (5) as Q,=a+bP’+cX+dZ+
U’
where
u’=
U+b(P-P’),
(6)
and observe that from what has just been said about correlations with U and P - P’, it follows that U’ is uncorrelated with P’, X, and Z. The coefficients in (6) can be estimated in various ways, the most commonly used being two-stage least squares. In the first stage, an estimate P” of P’ is obtained by OLS regression of P on W, X, and Z. In the second stage, estimates of the coefficients in (6) are obtained by OLS regression of Qd on X, Z, and P” in place of P’. The estimates of the coefficients are biased because of errors in the estimates P” that result from sampling errors in the first-stage regression coefficients, but the errors vanish in the limit and therefore the estimates are consistent. From what was said about observability of linear laws in section 2.4.5, it follows that the condition u’ uncorrelated with P’, X, and Z which is necessary and sufficient for the second-stage estimates to be consistent will be satisfied if and almost only if the factors X and the instruments W that help to define the ‘purified’ factor P’ are independent of U given the concomitants Z. If X and W satisfy this condition, Z is uncorrelated with U by definition. If P is the only factor whose effect on Qd is of interest, only the variables W that serve as instruments for P need be independent of U given Z. All RHS variables other than P should be thought of as concomitants included in Z.
J. W. Pratt and R. Schlaifer, On the interpretation
3.5. Bayesian
and observation of laws
49
statistics
To see how Bayesian analysis can be brought to bear on estimation of a law y = bx + U when one is not sure that the condition for observability is satisfied, one must remember that a proper prior distribution of b is not enough and in fact is not even essential. If the law is not observable and therefore does not coincide with the regression y = cx + R of y on x, what any estimator will estimate is not the b of the law but the c of the regression; and therefore the proper prior distribution that is essential is the conditional distribution of b given c that will not be altered by even infinite data. This distribution is obviously very much easier to require than to supply, but: If a statistician with a Bayesian computer program treats the likelihood function of c as if it were the likelihood function of 6, as he must if he supplies no proper prior distribution of b given c, then what his printout will contain will be neither beans nor corn but succotash. In saying this, we do not mean to suggest that a Bayesian should as standard practice assess his distribution of effects b given regression coefficients c. To do so would amount to assessing a distribution of the part c - b of c that consists of ‘proxy effects’ for excluded variables; and rather than assess a judgmental distribution of these effects based on mere guesses at their signs and magnitudes, a Bayesian will do much better to search like a non-Bayesian for concomitants that absorb them. After everything that can be done in this way has been done, uncertainty about remaining proxy effects can probably best be taken account of by simply recognizing informally that the standard error of each regression coefficient seriously understates the real uncertainty about the effect of the corresponding factor. The exception if there is one would be the case in which the Bayesian feels sure that a particular factor must be proxying for a particular excluded variable on which no data are available, but even in this case a distribution of c - b is going to do more harm than good unless our Bayesian makes a very good guess at the magnitude and sign of both the effect of the excluded variable and the partial correlation between that variable and the factor in whose effect he is interested.
4. Excluded variables 4.1. ‘The’ excluded variables Suppose that in Utopia stocks are traded on only one day of each year and that the price of each firm’s stock is wholly determined by the firm’s known earnings (per share) and dividends (per share). Then price is also wholly
50
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
determined by earnings and retained earnings, and if one wants to estimate a law relating price to earnings only, one cannot talk meaningfully about ‘the’ excluded variable (singular) because it could equally well be either dividends or retained earnings or some other function of earnings and dividends. And earnings cannot be independent of ‘the’ excluded variables (plural) as some econometricians require because if it is independent of dividends it is dependent on retained earnings and vice versa. It will be relevant to what follows to observe that excluded variables that may be thought of as ‘affecting’ price are not the only excluded variables that may be thought of as helping to ‘determine’ price. If it is possible to make a forecast of dividends which has non-zero predictive power, then price is functionally determined by earnings, the forecast, and the residual of the regression of dividends on the forecast; and when one asks whether the law relating price to earnings is observable, this way of looking at the problem may be more relevant than the ‘natural’ way. 4.2. The joint effect of excluded variables If one is free to choose either dividends or retained earnings as ‘the’ excluded variable, then since earnings X cannot be independent of both dividends and retained earnings, how can X be independent of U? And if a law is not observable unless X is independent of U, how can it ever be observable? To prepare the answer to these questions, we first remark that if a person who wants to learn the effect of earnings on stock price either voluntarily or involuntarily excludes from the analysis not only dividends but all functions of earnings and dividends, then all that that person can hope to learn is the ‘total’ effect of earnings, consisting in part of the ‘direct’ effect of earnings given (say) dividends or retained earnings, in part of the ‘indirect’ effect that may result from the fact that earnings may affect dividends or retained earnings and dividends or retained earnings may in turn affect price. Notice that (1) although the direct and indirect effects of earnings depend on the excluded variable chosen to define them, the total effect does not, and (2) the direct and indirect effects corresponding to any particular excluded variable can be learned only by converting that variable into an included variable. Now to answer the questions, let IVxi be a set of natural excluded variables (possibly including ‘pure chance’) that in conjunction with the sets represented by X, and Zxi are at least sufficient to determine the value of Y,,,; it does not matter if W is redundant in the sense that it would still be sufficient if one or more of its members were eliminated. More specifically, let h be a function such that given Zxi = z,
Yxri=h(X, Z, wx,), v’x.
J. W. Pratt and R. Schlaijer, On the interpretation
and observation of laws
51
Further, let Zi* and W,* be sets of random variables unaffected by x that together with x suffice to determine Z,, and Wxi, respectively. Z,* and I#$* are what Nature does regardless of the value of X,, while Z,, and W,, are the parts of Zi* and w.* that are relevant if X, = x. (One could define Z,* as the set of all Zxi for all possible x, and define y* similarly, but smaller sufficient sets would ordinarily be available and more natural.) Now assume that the Z,* and I!$* are generated by a random process in which (1) the joint distribution of the Z * ‘s is arbitrary, but (2) given any one value z*, the q* are iid on the observations on which Zj* = z*; and let V, be y*. V, can be independent of Z,* and sufficient with Z,* to determine interpreted as the part of q.* that is unpredictable by Z,*. [The assumption that the W, * are iid given any one z * expresses the notion inherent in a law that Nature behaves the same whenever the same situation occurs. The V, can often be defined in a very natural way when all relations among variables are linear, but when no natural definition is available, they can be defined as the random numbers that would correspond to w* in a simulation of the distribution of y* given Z,* - for instance, as a random vector whose jth component is the cdf of the jth component of U: * given the the probability-integral first i - 1 components and Z,*. This generalizes transformation.] If now we express Y as a function of x, Z*, and V and interpret U as the joint effect on Y of V rather than W, then because V and therefore U are not affected by X, X may be independent of U given z* and hence given z and therefore the law relating Y to X may be observable. The law will be observable if X, is independent of V, given z, i.e., if Pr{ X, = xIZ\, = z, V, = U) does not depend on u -just as we previously said the law will be observable if X, is independent of the joint effect Uxz,. Three brief remarks may help a reader who wants to relate what we have just said about excluded variables to what we said in our earlier paper about excluded variables in laws without concomitants. (What we said there about laws with concomitants is incomplete because we considered only concomitants not affected by the factors.) 1.
The variables that we there denoted by E and called the ‘remainder’ excluded variables W are those that we have here called V.
of the
2.
After requiring that X be uncorrelated with E (uncorrelated being enough because we were dealing with linear models and were interested only in the coefficient of X), we went on to say that this meant that X must not be affected by W.
3.
To have an easy way of referring to this condition, we said that if it was satisfied, then X was by definition ‘genetically’ (although not statistically) independent of W.
52
J. W. Pratt and R. Schlaifer, On the interpretation and observation of laws
References Dawid, A.P., 1979, Conditional independence in statistical theory, Journal of the Royal Statistical Society B 41, 1-31. Feynman, Richard P., 1967, The character of physical law (MIT Press, Cambridge, MA). Feynman, Richard P., 1985, QED (Princeton University Press, Princeton, NJ). Fisher. R.A.. 1935/1953. The desien of exneriments (Oliver and Bovd. Edinburah). Granger, Clive, 1986, Comment, Journal 01 the American Statistici Associatiog 81, 967-968. Holland, Paul W., 1986, Statistics and causal inference, Journal of the American Statistical Association 81, 945-960, 968-970. Malinvaud, E., 1966, Statistical methods of econometrics (Rand-McNally, Chicago, IL). Marschak, J., 1953, Economic measurements for policy and prediction, in: W.C. Hood and T.C. Koopmans, eds., Studies in econometric method (Wiley, New York) l-26. Pratt, John W. and Robert Schlaifer, 1984, On the nature and discovery of structure, Journal of the American Statistical Association 79, 9-21, 29-33. Rosenbaum, Paul R. and Donald B. Rubin, 1984, Comment, Journal of the American Statistical Association 79, 26-28. Rubin, Donald B., 1978, Bayesian inference for causal effects, Annals of Statistics 6, 34-58. Rubin, Donald B., 1986, Comment, Journal of the American Statistical Association 81, 961-962. Theil, Hemi, 1971, Principles of econometrics (Wiley, New York). Zellner, Arnold, 1984, Basic issues in econometrics (University of Chicago Press, Chicago, IL) 35-74.