Journal
of Econometrics
15 (1981) 357-366.
LINEAR
North-Holland
REGRESSION Arthur
Received January
AFTER
Company
SELECTION*
S. GOLDBERGER
ofWisconsin,
University
Publishing
Madison,
1980, final version
WI 53706, USA received
October
1980
We study the effect, upon linear regression, of explicit selection on the dependent variable. If the explanatory variables are multinormally distributed along with the dependent variable, then the regression coefficient vector in the selected population is a scalar multiple of that in the original population.
1. Introduction How does selection on the dependent variable affect the results of linear regression? This question has arisen in recent empirical analyses of labor supply and earnings [e.g., Cain and Watts (1973), Hausman and Wise (1977), Crawford (1979)], but has a longer history. Lush and Shrode (1950, p. 338) wrote: ‘[I]f at each age some cows with low records are culled, then the older cows will include a larger fraction of those with inherently high production and a smaller fraction of inherently low producers than are among the cows which make records at the younger ages. If the regression of production on age is computed.. . that curve will not show the effects of age alone but will show those effects combined with whatever effects such culling actually had.’ Earlier
still, Karl Pearson
(1903, pp. 19-20) had written:
‘If it be advantageous for a species to have a certain group of its organs of a definite size, falling within a definite range, and related to each other in a definite manner, then these changes cannot take place without modifying not only the size, but the variability and correlation of all the other organs correlated with these, although these organs themselves be not directly selected.. . The selection of any complex of characters or organs in an organism changes all the *This article is based upon work supported by the National Science Foundation and by The Graduate School of the University of Wisconsin. An earlier version was circulated in 1975 under the title, ‘Linear regression in truncated samples.’ The author thanks David Crawford, Randall J. Olsen, and the referees for instructive comments.
358
AS. Goldberger, Linear regression after selection
other characters and organs not directly selected.. . . [T]he regression coefficient of the non-selected organ on the selected remains unchanged, while that of the selected organ on the non-selected will, as a rule, be widely modified.’ We obtain a precise answer to the question, under the assumption that the explanatory variables are multinormally along with the dependent variable.
very special distributed
2. Explicit selection We start with model applies,
a population
y=x’j?+u,
to which
u - JV (0, o’),
We select a subpopulation by subset of the real line. Thus variable; see Lord and Novick Our concern in this paper function (LRF), for y given coincides with the conditional x’p. In the selected population,
a conventional
u independent
of x.
are determined
regression
(1)
discarding all points where y falls outside r; a we have explicit selection on the dependent (1968, pp. 140-142). is with the ‘projection’, or linear regression x. In the full population, the LRF L(y lx) expectation function (CEF) E(y 1x), namely the LRF is the line
L* (y 1x) = ct*+ XI/?*, whose coefficients
linear
(2) by the normal
equations
where pL,*is the mean of y, $J is the mean vector of x, Cz is the variance matrix of x, and g$ is the covariance vector of x with y; cf. Cramer (1951, pp. 272-275, 302-304). Here and throughout we use the asterisk (*) to distinguish operations, moments, and expectations in the selected population. The LRF L* (y 1x) provides the best linear approximation to E*(y 1x), the CEF in the selected population, which in general will be nonlinear. It is the coefficients of the LRF in (2) that (under very general conditions) are consistently estimated when the linear regression of y on x (and a constant) is run on a sample drawn from the selected population. While a CEF is determined by the conditional distributions for y given x, an LRF depends also on the marginal distribution of x. To complete our
AS. Goldberger,
specification, we make population the regressor
Linear regression after selection
the very special assumption that in the vector was multinormally distributed,
x-J(O,C,,).
359
original
(4)
Without loss of generality we have taken also set the intercept in (1) at zero.
the mean
of x to be 0 and have
Under this specification the answer to the question which began the paper is very simple: j?* is a scalar multiple of j?. The calculation runs as follows. From (1) and (4), in the original population the random vector (x’,y)’ was multinormally distributed with means zero, and with variances and covariances
where cT2=72+co2,
Consequently
the marginal
z2 = pc,,~.
distribution
(6)
of y was normal,
Y-~(o,a’);
the conditional
distributions
(7) of x given y were normal
with
W)Y)=YY,
@a)
w/Y)=@,
(8b)
@=Z,,-a’yy’;
Pb)
where
and the multiple p2 =
coefficient /3’C,,fl/a’
Selection changes the variable y. Let E* (y)=pz
of determination = r2/(r2 + 0’).
of y on x was (10)
marginal distribution of the explicitly selected and V*(y) denote the mean and variance of y in
360
A.S. Goldberger, Linear regression after selection
the selected population, I9=
and define
v* (y)/a2.
(11)
But the selection does not change the conditional distributions of x given y (for ye Y); cf. Pearson (1903), Aitken (1934), Lawley (1943). Thus for ye Y the conditional moments in (8) hold also after selection,
Wa) v*(x)y)=@. From
(12b)
(9), (1 l), and (12), we get the variance
of conditional
means
as
V* (E* (x 1y)) = yV* (y)y’ = Bo2yy’, and the mean of conditional
variances
(13)
as
E*(V*(x(y))=@=C,,-a2yy’. Summing as
these two gives the variance
c:;=
v*(x)=c,,-
(14) matrix
of x in the selected population
(l-@a$‘.
(15)
Covarying the conditional mean in (12a) with y, we get the covariance of x with y in the selected population as n& = C* (x, y) = C* (E* (x / y), y) = @y. [Eqs. (15)-(16) are special cases of Johnson and (68.2)).] Thus the normal eq. (3a) which determines LRF in the selected population is (c,, - (I-
e)dyy’)j*
(16) Kotz (1972, p. 70, eq. the slope vector of the
= e&
Recalling the definitions of y in (9a) and of p2 in (lo), it is readily that the solution to (17) is
B* = v,
vector
(17) confirmed
(18)
where
n=e/(i
-$(I
-e)).
(19)
A.S. Goldberger,
Linear
regression
after selection
Thus the LRF slope vector /?* is a scalar multiple to verify that the LRF intercept defined in (3a) is
361
of /I. Further,
cI* = (1 -&IQ;,
(20)
and that the coefficient of determination as p”, =p*‘CQI*/V* (y), is
in the selected
pi = ip2. Evidently
it is easy
population,
defined
(21)
the scalar i is positive.
Under multinormality, a proportional change determination.
explicit selection on the dependent variable leads to in all the LRF slopes and in the coefficient of
3. Truncation A particular type of explicit selection is truncation on the dependent variable. Suppose that the selection rule is ys c, where c is the fixed truncation point. From (1) a familiar calculation gives the CEF of y on x in the selected population, E*(y)x)=xy-or(z),
(22)
where z= (c-x’p)/w, Y( .)=f( .)/F( .), while f( . ) and F (. ) denote the standard normal pdf and cdf; see Tobin (1958), Amemiya (1973). [Since the points with y>c are discarded rather than being reported at the limit value, we have the ‘truncated’, as distinguished from the ‘censored’, variant of Tobin’s limited dependent variable model; see Heckman (1976, p. 478) Olsen (1980).] The slopes of this nonlinear CEF are dE* (y (x)/Gxj= where xj and pj are typical r'(z)=dr(z)/dz=
(1 + r’(z))/Ij, elements
(23 I
of x and p, while
- (z+r(z))r(z)
(24)
is guaranteed to lie between 0 and - 1; see Heckman (1976), Goldberger (1980). Thus in (23) the scale factor 1 +r’(z) lies in the unit interval: As noted by Poirer and Melino (1978), when evaluated at any x, the slopes of the nonlinear CEF for the selected population are proportionally attenuated with respect to those in the CEF for the original population.
362
A.S. Goldberger, Linear regression after selection
It might be conjectured that under truncation, LRF slopes will also be proportionally attenuated; see Cain and Watts (1973, pp. 342-343), Hausman and Wise (1977, p. 935). But the following example shows that in the absence of the multinormality specification (4), neither proportionality nor attenuation is guaranteed for the LRF slopes. Suppose that y=x, +x, + u where u N JV (0,l) independent of x1 and x2, while the random vector x=(x,, x,)’ takes on only the four distinct values given in column (1) of the table, with probabilities given by the p(x) in column (2). Then y(x- X(x,+x,, 1). The E(ylx)=x, +x, is displayed in column (3). From this information we can calculate
Table 1 Calculation
of effects of selection.
; (1)
(2)
(3)
(4)
(5)
(6)
(7)
x’
P(X)
E(YlX)
z
F(z)
P*(x)
E*(Y(~)
2J6 116 l/6 216
5 2
0 3 7 10
0.50000 0.99865 l.OOOQO 1.00000
0.20005 0.19978 0.20005 0.40011
4.20212 1.99556 - 2.00000 - 5.00000
I,4
1, 1 -1,-l -1, -4
-2 -5
Our selected population is obtained by retaining only points for which y 5 5. as shown in column (4). Column (5) gives the Thus z=5-x1-x2, corresponding F(z). The probabilities assigned to x in the selected population, obtained as p*(x) = p(x)F (z)/c p(x)F (z), are shown in column (6). Finally, the CEF in the selected population, E* (y 1x)=x1 +x2 - r(z), is tabulated in column (7). Since C* (x, y) = C* (x, E* (y 1x)), the information suffices to calculate r~& as well as .Zz_
Then /I* = (1.111, 0.887)‘. Comparing this with p= (1, l)‘, we see that selection has increased one of the slopes while decreasing the other. On the other hand, when the multinormality assumption (4) is imposed, truncation on the dependent variable does imply proportional attenuation of the LRF slopes. For the analysis of section 2 applies with the following additional conclusions: In the selected population the marginal distribution
A.S. Goldberger,
of y will be truncated
normal
Linear
regression
after selection
363
with
PLY* = - ar(c/cT),
Pa)
e= 1 -((c/~)+I(c/(T)).r(c/(T);
(25b)
see Johnson and Kotz (1970, pp. 81-83). Observe that here t3= 1 + r’(c/~), which is guaranteed to lie in the unit interval. Since p2 also lies in the unit interval, we see from (19) that now 0
4. Incidental selection To what extent do our conclusions hinge on the assumption that selection is explicit on the dependent variable? It suffices to examine a multinormal version of the now-popular selectivity-bias model [Heckman (1976)]. Suppose that in the full population, Yl=~'Pl+~l,
Y?. = x’P* + %,
x-J”(O,&,),
II =
where u is independent
(u,, u,)‘-
(26)
fl(O, Q),
(27)
of x, and
(28) Let y= (yl,y2)’ and B= (pi,&). Then in the full population the random vector (x’, y’)’ is multinormally distributed with mean vector 0 and variance matrix
(29) with
z,,=
(
2
g12
Further
suppose
012
g1
that
>
(3Ob)
=B'C,,B+Q.
0:
the
selection
rule
is y,
with
c2 being
a fixed
AS.
364
Goldberger,
Linear
regression
after selection
truncation point. Thus y, is the explicit selection variable, while yi is, along with x, incidentally selected; see Lord and Novick (1968, pp. 140-142). We are interested in the relationship between y, and x. In the full population, the CEF is E (yl / x) = x’jll ; in the selected population it is
where z2 = (cl - x’/& )/Q ; see Heckman (1976, p. 479). The slopes nonlinear function, evaluated at a particular x are given by dE*(Yl
Ix)/~xj=B~j+(W,2/oZ)f(Z2)B2j.
Since these are mixtures of corresponding elements of jI1 and presumption of proportionality arises. We proceed to the LRF in the selected population, namely L*(yl
of this
(32) /Iz, no
Ix)=a;+x’p;,
where /IT is the solution
(33)
to the normal
equation
c;xpT = &. For the incidentally selected approach shows that
(34) variables
yi and x, the Pearson-Aitken-Lawley
where e2 = From
v* (y,)/a2, = 1 +
J”‘(CJC2).
(36)
(33))(34) it is easy to verify that
PT=P1- 4b,,lo:)P*,
(37)
where ~=(l-O,)/(l-P2(1-8,)),
P: =P;‘M-VJ~.
Evidently, even under multinormality, attenuation of the LRF slopes is guaranteed only incidentally selected.
(38)
neither proportionality nor when the dependent variable is
A.S. Goldberger, Linear regression after selection
365
5. Remarks Our analysis has been in terms of (full and selected) population parameters rather than sample statistics. But it does suggest how under multinormality. consistent estimates of the full-population parameters may be obtained with samples drawn from the selected population. For example, Greene (1980), adapting Pearson and Lee (1908) and Olsen (1980), spells out a procedure for consistent estimation of b when the specification consists of (l), (4), and truncation on y. Incidentally, for the censored variant of that model, Greene also shows that the LRF slope vector fl* is proportional to p, as in our (18), except that 2 = F (c/a). We must emphasize that the multinormal specification is rarely appropriate in socioeconomic applications. In the absence of normality of x, procedures for consistently estimating jI in (1) - or /?I in (26) - with a sample drawn from the selected population can be found in the recent econometric selectivity-bias literature; e.g. Amemiya (1973), Hausman and Wise (1977), Heckman (1976, 1979).
References Aitken, A.C., 1934, Note on selection from a multivariate normal population, Proceedings of the Edinburgh Mathematical Society, Series 2, 4, 106-110. Amemiya, Takeshi, 1973, Regression analysis when the dependent variable is truncated normal, Econometrica 41, Nov., 997-1016. Cain, Glen G. and Harold W. Watts, 1973, Toward a summary and synthesis of the evidence, Ch. 9 in: Cain and Watts, eds., Income maintenance and labor supply: Econometric studies (Markham, Chicago, IL) 328-367. Cram&r, Harold, 1951, Mathematical methods of statistics (Princeton University Press, Princeton, NJ). Crawford, David L., 1979, Estimating models of earnings from truncated samples, Department of Economics doctoral dissertation (University of Wisconsin, Madison, WI). Goldberger, Arthur S., 1980, Abnormal selection bias, Social Systems Research Institute workshop paper 8006, May (University of Wisconsin, Madison, WI). Greene, William H., 1980, On the asymptotic bias of the ordinary least squares estimator of the Tobit model, Department of Economics manuscript (Cornell University, Ithaca, NY). Hausman, Jerry A. and David A. Wise, 1977, Social experimentation, truncated distributions, and efficient estimation, Econometrica 45, May, 919-928. Heckman, James J., 1976, The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models, Annals of Economic and Social Measurement 5, Fall, 475492. Heckman, James J., 1979, Sample bias as a specification error, Econometrica 47, Jan., 153-161. Johnson, Norman L. and Samuel Katz, 1970, Distributions in statistics: Continuous univariate distributions - 1 (Houghton Milllin, Boston, MA). Johnson, Norman L. and Samuel Katz, 1972, Distributions in statistics: Continuous multivariate distributions (Wiley, New York). Lawley, D., 1943, A note on Karl Pearson’s selection formulae, Proceedings of the Royal Society of Edinburgh, Section A, 62, 28-30. Lord, Frederic M. and Melvin R. Novick, 1968, Statistical theories of mental test scores (Addison-Wesley, Reading, PA). Lush, Jay L. and Robert R. Shrode, 1950, Changes in milk production with age and milking frequency, Journal of Dairy Science 33, 338-357.
366
AS. Goldberger, Linear regression after selection
Olsen, Randall J., 1980, Approximating a truncated normal regression with the method of moments, Econometrica 48, July, 1099-1105. Pearson, Karl, 1903, Mathematical contributions to the theory of evolution - IX: On the influence of natural selection on the variability and correlation of organs, Philosophical Transactions of the Royal Society of London A-200, l-66. Pearson, Karl and Alice Lee, 1908, On the generalised probable error in multiple normal correlation, Biometrika 6, 59968. Poirier, Dale J. and Angelo Melino, 1978, A note on the interpretation of regression coefficients within a class of truncated distributions, Econometrica 46, Sept., 1207-1209. Tobin, James, 1958, Estimation of relationships for limited dependent variables, Econometrica 26, Jan., 2436.