SOCIAL
SClENCE
RESEARCH
12, 109-130
(1983)
Recursive Models for Categorical Data THOMAS P. WILSON AND WILLIAM University
of California,
Santa
T. BIELBY Barbara
Standard methods for recursive models with continuous endogenous variables are extended to models with categorical endogenous variables. The concept of a reduced-form equation is generalized in a natural way to cover nonlinear regression functions and, in particular, models with categorical endogenous variables. Maximum-likelihood estimation and asymptotic cm-square tests are described. Two numerical examples are presented: a linear recursive two-equation model for all-categorical data, and a combined linear and logit three-equation recursive model with both categorical and continuous endogenous variables. Limitations of the present work and directions for further extension are noted.
In the past several decades, work on methods for analyzing categorical data has been devoted mainly to log-linear and latent-variable models. Log-linear models are specifically designed for situations in which all the variables are categorical and seek to exploit what are taken to be the special features of such data (e.g., Fienberg, 1980; Duncan, 1982). Latent-variable models, in contrast, attempt to deal with categorical data by assuming that categorical endogenous variables reflect unobserved underlying continuous variables which carry the fundamental structure; this approach, then, uses observed categorical data to infer the properties of that assumed underlying structure (e.g., Amemiya, 1981; Manski, 1981; Muthen, 1981; Winship and Mare, 1983). In this paper we develop a method for analyzing categorical data that differs from both the log-linear and the latent-variable approaches. The method presented here is a straightforward extension ot the basic logic of structural-equation models for directly measured continuous variables. Thus, in common with the log-linear approach and in contrast with latent-variable models, we assume that all the variables of importance are explicitly measured. But in contrast with log-linear models, we base our approach on a direct generalization of conventional regression concepts. * This research was supported in part by a grant from the National Science Foundation (SES 79-24905). We are indebted to J. S. Rao and R. A. Berk for comments on early drafts of this paper. Send requests for reprints to Dr. Thomas P. Wilson, Department of Sociology, University of California, Santa Barbara, CA 93106. 109 0049-089X/83
$3 .OO
Copyright 0 1983 by Academic Press. inc. All rights of reproduction in any form reserved.
110
WILSON
AND
BIELBY
The result is a unified method for the analysis of recursive systems with categorical as well as continuous variables as either exogenous or endogenous factors. In this paper, then, we confine attention to models in which all variables are observable, and thus we are not concerned with situations in which measured variables are interpreted as proxies for or effects of underlying factors presumed to carry the real structure. Moreover, we limit the discussion to quasi-linear models in which the regression function depends only on a linear combination of the predetermined variables and the parameters of the model. Further, we discuss only recursive models, neglecting situations in which the recursive assumptions are not met.’ Finally, we are concerned only with models for cross-sectional data and do not consider time series. For all its limitations, the class of models we do consider is nevertheless extremely important for applications in social research. Moreover, there are issues concerning even this simplest case that have not been adequately understood, and it is our view that thorough understanding of the elementary situation will help clarify more complex ones. We begin by considering one special case that is important in its own right, namely, that in which all the variables are categorical. The problem of analyzing purely categorical data arises frequently in substantive research and has generated an impressive literature, especially in the log-linear tradition. The approach taken here differs in that it emphasizes the logic of recursive structural models. We then turn to a more general case and develop the structural and reduced-form equations for models with both categorical and continuous endogenous variables. MODELS
FOR PURELY
CATEGORICAL
DATA
In this section we consider purely categorical data, that is, situations in which all variables are polytomous. Two general classes of models have been developed for the analysis of purely categorical data. In what has come to be called “contingency table analysis,” all variables are treated on an essentially equal footing with none singled out as endogenous or exogenous and no causal structure assumed. We shall not be concerned ’ A model can fail to be recursive in two ways. First, one may postulate causal paths that form a closed loop. in which case the system of equations does not have a hierarchical structure. Second, the stochastic components may not be independent across equations of the model. The first type of nonrecursiveness is beyond the scope of this paper. The second type arises routinely because of the omission of structurally important variables from the model. Moreover, even when all important variables are included. the stochastic components still may not be independent. Such problems are probably inevitable in social research, and while ways of dealing with them are part of the art of structural-equation modeling with continuous variables, little in the way of practical lore has developed for categorical data. Consequently, in this paper we confine attention to the idealized recursive case.
RECURSIVE
MODELS
FOR
CATEGORICAL
DATA
111
with this situation, which has received extensive attention in the literature, but rather with structural models in which an explicit distinction is made between exogenous and endogenous variables, specific causal relations are assumed, and all variables of substantive interest are observable. There has been some controversy concerning the proper form for structural models for the analysis of purely categorical data. In particular, proponents of linear and log-linear or logit models have argued at some length for their favored approaches, often appealing to supposed mathematical or statistical advantages as grounds for their preferences (e.g., Goodman, 1975; Gillespie, 1977; Brier, 1978; Kritzer, 1979; Swafford, 1980). We shall dispose of this issue by showing that for purely categorical data linear and logit models are mathematically equivalent, and indeed that all quasi-linear models are formally equivalent. Consequently, logit and linear models in particular do not differ in the information they represent about the data but rather in the way information is organized and presented, and hence choice of a particular form for a model must be based on substantive rather than abstract or a priori considerations. We begin by sketching the setup for purely categorical data and showing the mutual equivalence of quasi-linear models in this case. We then discuss estimation, inference, and recursive models, and conclude with a numerical example of a recursive linear two-equation model. The Purely Cntegorical
Setup
Consider a polytomy y regarded as dependent on one or more polytomous predetermined variables. Suppose that y has R + 1 mutually exclusive and exhaustive categories, one of which we choose as a reference or origin and label with index 0, and let us label the remaining categories with 1, . . . . R. We code the categories of y with binary variables Y(O) y(1) . . ..3 Y’%uch that Y(” = 1 if an observation is in the rth category of y and 0 otherwise, and we let #) be the probability that Y”’ = 1 conditional on fixed values of the predetermined variables, so that 7~‘~) is a function of the predetermined variables. Since the categories of y are mutually exclusive and exhaustive, the Y(” and &) sum to 1; consequently we can leave the zeroth category implicit, with Yto’ = 1 Zf=, Y(‘) and r(O) - 1 - c;=, 54”. Any particular predetermined variable is a polytomy consisting of, say, T + 1 categories; we choose one of these as a reference category and leave it implicit, so that this predetermined variable can be represented by exactly T binary variables, just as y can be represented by Y”‘, . . . . YcR). We have, then, that the entire set of predetermined variables can be represented by, say, K binary variables X,, . . . . Xk, in such a fashion that the reference category for each predetermined polytomy is omitted, so that there are no linear relations among the XL. Moreover, some of the X, may be products of other binary variables to represent interaction
112
WILSON AND BIELBY
terms, but we do not require that when an interaction term appears in a model all of its factors must also appear. We shall array the values of the predetermined variables for a given observation as a 1 x (Ik; + 1) matrix X = [1X, *.. Xk], where we have suppressed subscripts referring to the particular observation, and the initial component 1 provides for a constant term in the model. The leading idea is that the conditional probability of an observation falling in a particular category of y given the value of X is some specified function of linear combinations of the predetermined variables. Thus, letting p”’ = [fi~‘@~’ . . . @l;‘] be a column vector of coefficients, we assume 771” = p( y”) = ] 1X) = l$“(xp’O’,
. . . . Xp’R’),
(1)
R
r=I
where the $(‘) form an invertible system of functions mapping the quantities Xfi”’ onto the quantities 7~~”in a one-to-one fashion. Because the system Ill fR) is invertible we can solve (1) to obtain the relations ~$+b;;$, . ..) 7~~~‘) = Xp”‘, and so the quasi-linear models are precisely those considered by Grizzle, Starmer, and Koch (1969) that have an invertible response function. Moreover, r”) in addition to being the conditional probability that Y”’ = 1 is also the conditional expected value of Y(‘), so that the I@‘) are regression functions: E( Y(” ) X) = l@“(xp”‘,
. . . . Xp’R’),
(2)
This fact will be important later when we consider recursive models and, in particular, reduced forms. The two most common examples of quasi-linear models for purely categorical data are of course the logit and linear models. For the logit model, we take I$” as the logistic function #I
--
exp(X@“) 1 + Et=, exp(XjP’)’
(3)
where exp is the exponential function. An equivalent form is the logodds version which emphasizes the log-linear character of the model, log(77”‘/P)
= XP”‘,
(4)
where log is the natural logarithm. In this form it is easy to see that, because the X, are binary variables, the pp’ can be interpreted directly as logarithms of odds ratios and, in the case of interaction terms, logarithms of ratios of such ratios. For the linear model we take $J”’ to be lr I’) = xa”’ (5)
RECURSIVE
MODELS
FOR CATEGORICAL
DATA
113
where we have used “a” here instead of “p” to emphasize the fact that while the predetermined variables in (3) and (5) are the same, the coefficients are different. In a linear model, the o$’ can be interpreted directly as percentage differences and differences of such differences.* The Equivalence of Quasi-linear Models
We now show that any two quasi-linear models for purely categorical data are equivalent in the sense that, given the coefjkients for a model of one functional form, the coeficients for a model of any other functional form are uniquely determined.3 It will be sufficient to prove that any
quasi-linear model is equivalent to a linear model. We proceed by showing how the equivalence can be established in a particular case, generalization of the procedure then being obvious. Suppose K = 2, and let I$‘), . . ., @“’ be any functions such that dr) = I+!P(p~“’ + p\o’x, + p$“‘x, + pg’x,x*,
. . .)
pbR’ + piR’x* + p;R’x* + p;;‘x,x,)
(6)
is a quasi-linear model. Since X, and X2 assume only the values 0 and 1, it is convenient to let $b$ = lp(pp,
,. .( @bR’)
t+by;= lp
+ py, . ..) pf’ + p’p’,
$b;) = lp”(p~’
+ p:“, . . .) pbR’ + /I;“‘,
$I;’ = $!Jypp
+ p:” + p :” + #cl:;‘, ..,, pb”’ + pi”’ + pkR’ + p:;‘).
Then ?r(r’ _- $l;b
x, = 0, x, = 0
= + 1;1
x, = 1,x2 = 0
= + b;’
x, = 0, x, = 1
=
(r’ $11
x,
=
1,x*
=
(7)
1.
Hence we can write Tr”’ = &
+ al”x,
+ &‘x*
+ &ix,
x2 )
(8)
’ In the linear case the model can be rewritten usefully in the form Y”’ = Xp”’ + E(“, where E(e”’ 1 X) = 0. ’ T. P. Wilson, “Exponential-Family Regression Models,” unpublished manuscript, Department of Sociology, University of California, Santa Barbara.
114
WILSON AND BIELBY
where
Thus, given any quasi-linear model (6), we can obtain a unique linear model (8). But, since $‘I’, . . . . I$~’ define a one-to-one system of functions, Eqs. (9) can be solved for the p’s. Hence, given a linear model (8), we can calculate the coefficients for any other form of quasi-linear model (6). In short, for purely categorical data, the coefficients of any quasilinear model are one-to-one functions of the coefficients of a uniquely corresponding linear model. Note, however, that the transformations in (7) and (9) are not simple ones, and in general, different functional forms organize the information in quite different ways. In particular, the concept of interaction is relative to the form of model one employs, for presence or absence of interaction in, say, a logit model implies nothing about the presence or absence of interaction in the corresponding linear model. Estimation The sample proportions of observations falling into the various categories of the dependent variable have, for fixed X, a multinomial distribution with probabilities 7~(“, 7~“‘. . . . 7~~~).Consequently, maximum-likelihood estimates for the coefficients and the variance-covariance matrix can be obtained using the Fisher scoring alogrithm.4 In the case of a linear model, this reduces to simple iteration of the procedure described by Grizzle et al. (1969). but for other functional forms the algorithm is more complicated. The availability of maximum-likelihood estimators and the equivalence of quasi-linear models dispose of one objection raised against the linear model for purely categorical data, namely, that one might obtain estimated coefficients that lead to estimated probabilities greater than one or less than zero. For, the maximum-likelihood estimate of the transform of a quantity is that same transform of the maximum-likelihood estimate of the quantity itself (Mood, Graybill, and Boes, 1974, pp. 284-286). Hence, estimated coefficients of a linear model leading to probabilities outside 4 The dichotomous case, R = 1. is discussed by Nelder and Wedderbum (1972; see also Jennrich and Moore. 1975; Charnes, Frome, and Yu. 1976). A more general result covering the polytomous case is mentioned by Barndorff-Nielsen (1978. p. 185) and proved under still broader conditions in Wilson (unpublished).
RECURSIVE
MODELS
FOR
CATEGORICAL
DATA
115
the unit interval would transform into coefficients for a logit model generating log odds outside the interval from minus to plus infinity, which is impossible. It should be noted, however, that although maximum-likelihood estimates can be transformed from one functional form of quasi-linear model to another, the actual computation of the estimates may be very much simpler in one form than in another. For example, the constraint expressed as cyll = 0 in the linear model (8) is very simple, and estimation taking it into account is straightforward: but this constraint transforms into a complicated nonlinear expression in, say, a logit model and estimation subject to it would be difficult and expensive. And the converse clearly is true for the constraint /3,: = 0 in the logit model. Statisticul Inference A hypothesis concerning the coefficients can be tested using either likelihood-ratio methods or the computationally simpler Wald statistic, both of which are asymptotically distributed as chi-square with degrees of freedom equal to the number of independent constraints in the hypothesis (Rao, 1973, pp. 418-420). For the case of linear hypotheses, the Waldstatistic approach reduces to that described by Grizzle et al. (1969) modified by using maximum-likelihood estimates of the coefficients and variancecovariance matrix. In the special case of a test for whether a particular coefficient is zero, the likelihood ratio and Wald chi-squares have one degree of freedom, and so in large samples their square roots can be treated as standard normal test statistics. Finally, it should be noted that hypotheses may be very simple to formulate in one form of a quasi-linear model and quite complicated in another. Thus, while different models are formally equivalent, they may differ importantly in how well they represent the information important for a particular application. Recursi\ge Models for Purely Categorical
Data
We develop the main ideas for recursive models using a two-equation linear model. We consider two endogenous polytomies, y and z, where y depends on the exogenous variables X, as in Eq. (5), and z depends on both X and y. To formalize the latter part of the model, assume that z has S + 1 categories represented by binary variables 2”‘. . . . . Z’S’, with Z”’ = 1 - Cs=, Z’“‘, and assume 7 (A) = P(Z’“’ = x/p
= 1 1 x, y) + yp,
(10)
where Y = [Y”’ **a Y’R’] and y’“’ = [yCd” ... y’“R’]i. Taken together, Eqs. (5) and (IO) specify a two-equation model for y and z that is recursive,
116
WILSON
AND
BIELBY
since there are no constraints on the coefficients across (5) and (lo).’ Consequently, each equation can be estimated independently using the maximum-likelihood methods described above. Reduced-Form
Equations
Recalling Eq. (2), we see that (5) and (IO) specify the conditional expected values of Y(” and Z’“’ as functions of X and of X and Y, respectively. In the latter case, Y also depends on X, and so we may ask how the conditional expected value of Z’.” depends only on X. both directly and indirectly through Y. From elementary probability theory we have E(Z’“’
1X, Y”’ = l)P(Y”’
1X) = 2 E(Z’“’
= 1Iw
r=O = E(Z’“’ 1x, y’0’ = I)$“’
+ i: E(Z’“’ 1x, Ycr’ = 1) 7r”‘. I= I Recalling that n”’ rearrange to get
= 1 -
E(Z’“’
Cf=, #‘,
1 X) = E(Z’“’
(11)
we can substitute
into (11) and
) X, Y”’ = I)
R
1X. Y”’ = 1)
+ c {(E(Z’“’ ,-= I
- E(Z’“’ I x, Y’O’ = I)}#‘.
(12)
Equation (12) states that the expected value of Z”‘, which is also the probability of being in the sth category of z, conditional only on the exogenous variables, is equal to the probability that ZcS’ = 1 given that the observation is already in the zeroth category of y, plus the net difference made by being in the rth category of y weighted by the probability of an observation being in that category. The reduced-form equations for the model, (5) and (IO), then are just (5) and (12). Next, let us substitute (5) and (10) into (12). Because in this case the model is linear, we get R E(Z’“’
1 X)
=
X/j’“’
+
~(Xp’“’
+
$“”
-
Xp’“‘)Xcy”’
r=l
=
X(/p’
+
i
y’.v’(y”)).
(13)
r=l ’ Equation (10) can also be written 0. See footnote 2.
as Z“’
= X/?“’
+ Yy’“’
+ q’l’,
where
E(7)“’
/ X, Y) =
RECURSIVE
MODELS
FOR CATEGORICAL
DATA
117
Thus, the reduced-form coefficients in (13) are simply the direct effects, $,); plus the sum of the indirect effects, -Y(~~)@, mediated through 7 **-, YcR). This result is exactly the same as that obtained using the classical path-tracing algorithm (Duncan, 1975).‘j In the linear case, then, the reduced-form equations have the same form and interpretation as for recursive models for continuous data.’ Example
Consider the data in Table 1, compiled from surveys conducted by the National Opinion Research Center in 1946 and 1963, which have been used several times in expositions of log-linear methods (e.g., Swafford, 1980). One question of interest is whether education mediates the effects of region and time on the proportion of persons favoring equal opportunity for blacks. For the purpose of this illustration we follow previous workers in making the assumption, certainly questionable, that selective migration during the period has no important effects. The question here lends itself naturally to a two-equation model, one for education as a function of region and time, and a second for attitude as a function of region, time, and education. We adopt a linear model with binary coding for predetermined variables, since this allows the coefficients to be interpreted directly as differences, and the question itself is formulated in these terms. Note that since all the predictors are categorical, no problem arises with a linear specification forcing predicted conditional probabilities outside the open unit interval. Table 2 presents the maximum-likelihood estimates of the coefficients for three models: the saturated two-equation model containing all interaction terms; the final model, obtained by deleting terms that are both numerically small and statistically insignificant at the .05 level and then reestimating; and the reduced-form equation for attitude calculated from (13), which in this case amounts to the usual path-tracing algorithm for linear models. The notation is as follows, where all variables are binary: NONS = 1 for non-Southerners, RCNT = 1 for 1963, COLL = 1 for college, HSCH = 1 for high school, GSCH = 1 - HSCH - COLL = 1 for grade school, OPPB = 1 for approval of equal opportunity for blacks. Note that in Table 2 the equations for education in the saturated and ’ Equation (13) can also be obtained by substituting the expression given in footnote 2 into the expression given in footnote 4 and then taking expected values. ’ The partitioning of the general reduced-form coefficients into a sum of direct and indirect effects depends heavily on the linearity of the regression function. Consequently, the reduced form for a logit model, for example, does not have this simple form and apparently (12) is the most perspicuous general form in this case.
118
WILSON
Proportion
AND BIELBY
TABLE I Agreeing Blacks Should Have Equal Opportunity, and Year” South
Year
Grade
High
1946
,202 (258) ,617 (115)
.282 (227) ,803 (157)
1963
by Region, Education, Non-South
College _ -___-~ ,452 (93) ,797 (69)
Grade ..~ ,460 (676) ,708
U-09)
High
College
,540 (813) ,880 (466)
.7ll (287) ,962 (209)
* NORC data reanalyzed by Swafford (1980).
final models have three components, for college, high school, and grade school, respectively, though the last is redundant since it is determined by the first two. Beneath each coefficient is the asymptotically normally distributed test statistic for the hypothesis that the coefficient is zero. It is apparent from the equation for attitude that though the difference between regions decreased substantially over time ( - .t522 in the final model), it remained sizable (.2588), even with education controlled. Thus. while education has an effect on attitude, it does not mediate the relationship among region, time, and attitude, a conclusion in accord with previous analyses of these data. However, the addition of the structural hypothesis embodied in the equation for education sheds some light on what is going on: though there is a general increase in the level of education over the 17-year period, the differences between regions remained virtually constant (note the negligible coefficients for RCNT x NONS in the saturated model). Thus, the general change in education cannot affect differences between regions in attitude over time. This conclusion can be seen in another way by considering the reduced-form equation. In particular, the coefficient for region decomposes into the sum of the direct effect of being a non-southerner and the indirect effects via education: Total
= direct + (indirect via HSCH
+ indirect via COLL),
or .2671 = .2588 + [(.0682) (.0789) + (.0112) (.2468)] = .2588 + .0083. The indirect effect of region mediated by education is obviously negligible. MODELS FOR CATEGORICAL
AND CONTINUOUS VARIABLES
Thus far we have considered only purely categorical data, in which all the variables in question are categorical. While this situation arises fairly often in applications, there are also many instances in which one
Constant
normal
equation .2710
.I530 (11.61) .3902 (22.52) .4568 (26.45) .2025 (10.16)
model .I609 (10.53) .3927 (19.33) .4464 (21.59) .2016 (8.07)
” Asymptotically
Reduced-form OPPB
OPPB
GSCH
HSCH
Final model COLL
OPPB
GSCH
HSCH
Saturated COLL
Dependent Variable
attitude .4229
(square
for
.0659 (4.66) .0699 (13.99) - 1.358 (8.46) .4011 (12.44)
.0414 (1.56) .06?7 (2.00) -.I091 (-3.32) .41.58 (8.04)
RCNT
root
of Wald
.2671
.0112 (0.77) .0682 (3.63) -0.793 (4.32) .2588 (12.01)
.0007 (0.W .0650 (2.77) - .0657 ( - 2.78) .2585 (8.21)
NONS
chi-square
.0789 (3.80)
.0804 (2.06)
HSCH
Saturated,
statistic
.2468 (12.02)
.2501 (4.36)
COLL
Final,
with
1 dfi.
- .I457
-.I522 (-4.64)
.0334 (1.06) .0017 (0.43) -0.351 ( - 0.93) - .1678 (- 2.64)
NONS
X
RCNT
.0957 (3.34)
.1048 (1.55)
HSCH
X
RCNT
TABLE 2 and Reduced-Form
- .0704 ( - 0.80)
COLL
X
RCNT
Models
.0005 (-0.01)
HSCH
X
NONS
.0007 (0.01)
COLL
X
NONS
X
X
- .0130 (-0.16)
HSCH
.0732 (0.73)
COLL
X
RCNT
X
RCNT
NONS
NONS
120
WILSON
AND
BIELBY
has a mix of categorical and continuous variables. When all the endogenous variables are continuous, there is of course no problem, since then we have simply a case of regression on dummy variables. Complications arise, however, when one or more of the endogenous variables is categorical and depends on a continuous predetermined variable. The central problem with categorical dependent and continuous predictor variables is that when a continuous predictor is unbounded, a linear model for a categorical endogenous variable will yield estimated probabilities outside the interval from zero to one. We have seen that this cannot happen in the purely categorical situation, but it is a major difficulty with continuous predictors. For such situations, then, a linear specification is inappropriate. In the case of a single-equation model, the problem is fairly well understood, and the usual approach is to specify some nonlinear form for the structural conditional expectation equations, such as the logistic or cumulative normal function. Moreover, this approach generalizes to recursive multiequation models, which can be analyzed equation by equation. For any reasonable choice of regression functions, maximumlikelihood estimators are available at least in principle, and, as in the purely categorical case, either likelihood-ratio or Wald chi-square statistics can be used to test hypotheses. To make this more concrete, consider a simple two-equation model in which a dichotomy Y depends on X, which includes some continuous predictors, and a continuous variable 2 depends on both X and Y:
E(Y(X) =
expW4 1 + exp(Xo) ’
E(Z 1x, r) = xg + y Y.
(14) (15)
Equation (14) can be estimated by an iterative maximum-likelihood procedure and, assuming that t: = Z - E(Z 1 X, Y) is distributed normally with mean 0 and variance u2, ordinary least squares gives maximumlikelihood estimates for Eq. (IS). An interesting feature of (14) arises when some of the X, are binary variables. For example, suppose Xi and X, are binary variables, and X3 = Xi X, , while X, , . . . , X, are continuous. Letting U = [IX, ‘1. X,], we can rewrite (14) as
E(YIX)
=
expNJ(vo + vJI 1 + exp[U(v,
+ vzXz + vd1Xd1
+ v,X, + vzX, + v,~X,X~)I’
In this version, the coefficient of the continuous as shifting from one combination of values However, letting PO = Uv,, .,., plz = Uvlz, (6)-(9) hold, and so we can reformulate Eq.
(16)
variables U is represented of X, and X, to another. we see that the results in (14) as
E(YI X) = a0 + a,X, + azXZ + (-w12X,X2,
(17)
RECURSIVE
MODELS
FOR
CATEGORICAL
DATA
121
where the ok, defined as in (9), are now functions of the continuous variables U. Thus, in this version, the coefficients of the categorical predictors are represented as sliding as the continuous variables U vary over their ranges. To get explicit expressions for the all, let the function A be defined as the binomial logistic h(t) = exp(t)/[l + exp(t)]. Then it is straightforward to verify that a0 = AKJvol, aI
= AKJ(Vo+ VII - AWvol. (18)
az = AKJ(vo + vdl - AWvoI, a 12 = ADJ(Ao f A, + A,,)] - AW(vo + 41 - A[U(vo + vz)l + A[Uv,l.
One question that arises in connection with models of this kind stems from the fact that Eq. (14) is for the expected value of Y, while (15) has Y itself on the right-hand side. Thus, one may ask if it is possible, at least in principle, to actually trace the effects of X on Z through Y. In short, do we in fact have a model of a coherent causal structure that can be given some interpretation? This question is essentially whether or not the notion of a reduced form is intelligible for such model. In this case, we can answer by observing that Eqs. (11) and (12) apply with R = 1, and so, since P( Y = 1 1 X) = E( Y ) X), we have E(ZIX)=X5+(X5+y-XS)P(Y= =x.$+y
expW.4 1 + exp(Xw)’
l/X) (19)
There is no way to factor X out of the second term in (19), and so we cannot represent the reduced form as X times a coefficient that in turn is the sum of direct and indirect effects. Nevertheless, we see that the expected value of Z conditional on X alone can be partitioned into the sum of the expected value of Z when Y = 0 and y times the probability that Y = 1. In the foregoing example, we have assumed that the expected value of Y depends on X via the logit function. However, essentially the same development can be carried out using any reasonable regression function $. In the general case, then, we would replace (14) by E(Y) X) = JI(Xf.0). The reduced-form
equation
for Z then is
E(Z ( X) = xg + y$(Xo). However,
the model
cw
(21)
specified by (21) and (IS) is not the only two-
122
WILSON
AND
BIELBY
equation model mixing categorical and continuous variables. Let us call this model Type I and consider two others. In models of Type II, both endogenous variables are categorical, and so in the dichotomous case, we have Eqs. (21) and E(Z j x, Y) = $Jcxg + yY),
G.9
where now Z is assumed to be a binary variable, and $ is a regression function such as the logit. We can again apply Eqs. (11) and (12) to obtain the reduced-form equation for Z:
HZ I Xl = wt7
+ [~cv + y) - $cq,1qJaw,.
(23)
Once again, the reduced-form equation can be written explicitly. In contrast, the remaining major variety of two-equation model is not as easy to analyze. In models of Type III, Y is continuous and Z is categorical. Again assuming Z to be a dichotomy, and the regression of Y on X to be linear, we have structural Eqs. (22) and E(YI X) = xw,
(24)
where we assume that Y has some known conditional distribution function normal. To obtain the reduced form for Z, we must now resort to the continuous version of (11):
F, such as the cumulative
E(Z 1X) = 1 E(Z / X. Y) dF(Y 1X) =
$(X[
+ yY) dF(Y 1X).
(25)
In general, the integral in (25) cannot be evaluated explicitly, and so values must be obtained by numerical methods. Nevertheless, the result in (25) is important, for it shows that a reduced-form function does exist, even though it may be inconvenient to compute, and consequently the models of Type III are coherent. Table 3 summarizes the basic facts concerning the three types of model. It will be noted that we have assumed here that categorical endogenous variables are dichotomous in order to simplify the notation. The polytomous versions are straightforward generalizations of those discussed here, but are notiationally more complex. For all of these models, maximum-likelihood estimation and statistical inference are possible in principle, assuming reasonable regression functions and error structures. However, in some cases the computational burden may be excessive for these methods, and more practical estimators may need to be developed. Example
Table 4 presents maximum-likelihood for a simple model of socioeconomic
estimates of structural coefficients attainment incorporating exogenous
RECURSIVE
MODELS
FOR
CATEGORICAL
DATA
123
(4.4) (0.2) (- 1.7) (1.2)
- .54 (- 1.5) x; = 76.3*** -
.12*** .0067 .0012* .33 -
7.3s (2.4) F(6.600) = 3134*** .49 17.7
4.9*** (22. I ) .78** (3.1) ~ .009* ( - 1.7) 2.7 (1.1) 5.0** (3.2)
Males tN = 606) ~. SE1 (linear) .047*** (5.7) .054*** (7.6) - .0010*** (-6.8) .33*** (5.4) .13** & (2.8) .0063*** (5.6) 7.51*** (87.6) R7.599) = 65.2*** .39 .49
Log income (linear)
-
- 1.20* (-2.1) xa = 9.0* -
-
.13** (2.1) .0081 (0.2) .0007 f-0.7) .04 (0.1)
Promotion (logit) 5.2*** (13.2) .I9 (0.6) - ,002 (-0.4) I.1 (0.5) 6.0** (2.6) 16.1*** (4.0) F(6,232) = 39.2*** .46 15.6
SE1 (linear)
Females tN = 238)
” Asymptotically normal test statistics appear in parentheses (square root of Wald chi-square statistic with 1 @). ’ Schooling measured as years completed minus 8. Age measured as years since 18th birthday. * p < .lO. t p < .I0 for sex difference. ** p 4 .05. tt p < .05 for sex difference. *** p < ,001. ttt p
Schooling* Age‘ Age’ Marital status Promotion SE1 Intercept Overall test R* SEE
.-.--__.~ Promotion (logit)
from 1963 to
.046*** .050*** - .0009*** ~ .15** .17** .0106*** 7.17*** F(9.231) = .34 .52
(2.7) (4.8) (-4.8) ( - 2. I)Wi (2.2) (4.8) (50.7)tt 20.1***
Log income (linear)
TABLE 4 Estimates of Structural Coefficients by Sex: Maximum-Likelihood Estimates for White Respondents in Labor Force Continuously 1967 and Not Self-Employed”
i;j
W
c? u
$
* F
RECURSIVE
MODELS
FOR CATEGORICAL
DATA
125
and endogenous variables which are both categorical and continuous. The exogenous variables are X,, years of schooling; X, and X, = (X, - 18)*, linear and quadratic age terms; and X,, a binary variable indicating married ( = 1) versus not married ( = 0) at the time of the survey. The three recursively ordered endogenous variables are Y1, a binary variable equal to 1 if the respondent received a promotion during the 5 years preceding the survey; Y2, occupational status at the time of the survey in the Duncan SE1 metric; and Yx, the natural logarithm of annual earnings.’ A logit specification is used for the promotion equation, while promotion is represented by a dummy variable in the other two equations. Thus, the structural model is represented by three equations, P( Y, = 1 1X) = $(Xcu), E(Y2
I x, Y1) = X6 + y*,Y,,
QY,Ix, Y,.Yd
(26) (27)
= XA + &,Y, + &*Y*,
where X = [l X, X2 X, X,]; (Y, 6, and A are corresponding 1 x 5 coefficient vectors; and J, is the logit function. The data are from a 1967 survey by the Institute of Social Research entitled “Technological Advance in an Expanding Economy” (Mueller et al., 1969). Analyzed here is information reported by 606 males and 238 females who participated in the labor force continuously during the 5-year period preceding the survey. (See Bielby and Baron (1982) for further details.) Estimates in Table 4 were computed separately for men and women, but only parameters of the income equation differed significantly by sex. As expected, schooling has a modest impact on the probability of promotion, and among men there is evidence that promotion opportunities are greatest for those in their midtwenties (the age profile peaks roughly 54 years after the 18th birthday). Each additional year of schooling increases status by about 5 points, and among males status increases with age but at a slightly decreasing rate. Having received a promotion is worth 5 to 6 SE1 points. The marginal return to a year of schooling is about 5%. Net of that, income increases about 5% per year for those around 18 years old, but the return declines by about 1% over each IO-year period. Being married “pays off” substantially for men (a 39% advantage at the grand mean), but is an earnings disadvantage for women (a 14% decrease evaluated at the grand mean). Net of other factors in the model, a promotion a As in nearly all attainment models estimated from cross-sectional survey data, there is a potential source of simultaneity bias here. Earnings are reported for the 12-month period preceding the survey and thus may be predetermined with respect to both occupational status and promotion for those who had recently changed jobs.
126
WILSON AND BIELBY
increases income from 13 to 19% at the mean. Women receive a 1% net income return to each status point, while the return for men is just over half that. However, the intercept is considerably lower for women (antilogs of intercepts are $1826 and $1300, respectively, for men and women). The reduced-form equations are P( Y, I Xl = $(X4.
(29)
E(Y, ( X) = X6 + yz&(Xa),
(30)
Jw3 I X) = WA + &z@ + (53, + ~3?Y?,ww,
(31)
where the latter two equations follow from (22). Reduced-form effects are obtained by differentiating each equation with respect to X, through X,. If in is the probability of a promotion where 7~ = $(Xa), then it can be shown that &r/ax, = a,~(1 - 7~) (Hanushek and Jackson, 1977, pp. 188-189). Accordingly, the total effect of Xk on Y, is crKTr(l - Tr). The total effect of X, on Yz is iSh + YLyK?r(l - ?T). And the total effect of X, on Y, is A, + S&f
+ (51, + Y5,2)%~(1
- n).
Except for the factor ~(1 - r), these expressions correspond to the decomposition of total effects into direct and indirect effects for conventional recursive models with continuous endogenous variables (Duncan, 1975). Since rr is a function of X, the direct effect of X, on the probability of a promotion and the indirect effects of X, mediated by promotion vary as a function of X as well. These effects are greatest when rr = SO, since the term n(1 - 7r) cannot exceed .25. Consequently, Table 4 reports the maximum effects-valuated at n = SO-and contrasts them to effects evaluated at 7~ = .25.9 Examining these effects, we find, for example, that an additional year of schooling increases the probability of promotion by .03 at 7~ = .50 and by .02 at 7~ = .25 (or .75). Since the total effects on promotion are small, indirect effects on status and income through promotion are minuscule and contribute little to total effects. ’ Effects evaluated at v = SO are a third large than those evaluated at rr = .25, where ~(1 - 7~) = .1875. The probability of a promotion is approximately .50 for a single. 18year-old male with 12 years of schooling or for a single, 24-year-old female with an advanced degree. The probability is about .25 for a married. 52-year-old male with 10 years of schooling or for a single, 18-year-old female with 9 years of schooling.
3. Log income
2. SE1
Females (N = 238) 1. Promotion
3. Log income
2. SE1
Males (N = 606) 1. Promotion
Dependent variable
Decomposition
Schooling Age Age* Marital status Schooling Age Age’ Marital status Schooling Age Age’ Marital status
Schooling 4% Age’ Marital status Schooling Age Age’ Marital status Schooling fw Age2 Marital status
Exogenous variable
=
.50
.032 .0020 - .0002 .OlO 5.4 .20 - ,003 1.2 .107 ,052 -.ooo9 -.14
,030 .0017 - .0003 .082 5.1 .79 -.Oll 3.1 ,082 .059 -.OOll .36
77
.024 .OOlS - .OOOl .008 5.4 .20 - .003 1.2 .105 ,052 -.ooo9 -.14
.02 .0013 - .0002 .062 5.0 .79 - .OlO 3.1 .OSl .0.59 -.OOll .36
?I = .25
Total effect
.032 .0020 .0002 ,010
,030 .0017 - .0003 .082
iT = .50
5.2 .19 - .002 I.1 ,046 ,050 - .0009 - .15
4.9 .78 - ,009 2.7 ,047 ,054 - .OOlO .33 .024 ,001s .OOOl .008
,022 .0013 -.0002 ,062
.20 ,012 - .OOlO .06 ,006 .ooo -.oooo
.15 .008 - .0015 .41 .004 .ooo - .oooo ,010
7r = .50
-
-
.I5 .009 - .ooo8 .05 ,004 ,000 -.oooo ,001
.I1 ,006 -.OOll .33 ,003 .ooo -.OOOO ,008
a = .25
Indirect effect via promotion
from 1963 to 1967 and Not Self-Employed
7r = .25
Direct effect
TABLE 5 of Effects by Sex: White Respondents in Labor Force Continuously
-
,031 .005 .OOOl .02
-
Indirect effect via SE1
128
WILSON
AND
BIELBY
In contrast, the modest effect of schooling on status translates into a substantial component of the effect of schooling on income. For men, the income return to a year of schooling through status is three percent. Among women the return is over 5% and accounts for more than half of the total income return to schooling. However, with the exception of schooling, nearly all total effects of exogenous variables are transmitted directly, unmediated by intervening endogenous variables. While the specific pattern of results is not particularly interesting, this example does illustrate that methods of path decomposition routinely applied to recursive models with continuous variables (Alwin and Hauser, 1975; Duncan, 1975) translate directly for models with both continuous and categorical variables. CONCLUSION In this paper we have considered the simplest situation: recursive models for cross-sectional data in which all variables are observable. We have shown that the basic logic of such models when all endogenous variables are continuous carries over to situations in which some or all of the endogenous variables are categorical. Thus, there is no need for a fundamentally different conceptual approach to the analysis of categorical data. There are, of course, differences in detail, and in general models for categorical data are more complicated, largely owing to the fact that nonlinear regression functions may be required. However, these complexities do not force a different point of view for categorical data. The extension of results for continuous data to categorical data depends fundamentally on two points. First, models for continuous data are regression models; that is. they represent the expected value of the dependent variable as depending in some definite way on the predetermined variables. Second, for a binary variable, the probability of equaling one is the same as the expected value of the binary varible; consequently equations for probabilities are equations for expected values and hence are also regression equations. Exploitation of this fact is what yields a coherent notion of reduced-form equations for categorical data, even when these equations are nonlinear or difficult to evaluate explicitly. For purely categorical data, the situation is especially simple. In this case, linear regression equations are appropriate, so that the reducedform equations are also linear, and consequently these models parallel the continuous case exactly. Moreover, for purely categorical data, quasilinear models with different forms for the regression functions are mathematically equivalent to one another, so that choice between, say, a logit or a linear model can be made solely on the basis of which form best expresses the relationships of substantive interest in a particular application. The statistical theory of these models is quite straightforward. Coefficients can be estimated by maximum-likelihood methods, and hypotheses can
RECURSIVE
MODELS
FOR CATEGORICAL
DATA
129
be tested using asymptotic chi-square statistics based either on the likelihood-ratio approach or on Wald’s method. However, computation is more complex and expensive than for linear models with continuous data, and software for reduced-form computations remains to be developed. We have stressed several times that the models considered here are limited in applicability to recursive structures involving observable crosssectional data. Consequently, a major task is to extend models for categorical data to situations in which these restrictions do not hold. For continuous data, a substantial amount of work has been done on these problems, resulting in highly sophisticated methods for dealing with nonrecursive systems, unmeasured variables, and time-series data. There are, moreover, seemingly esoteric problems, such as censored and truncated data, that are increasingly being recognized as rather more commonplace than might appear at first (e.g., Berk and Ray, 1982) and for which some progress toward solutions has been made when all the endogenous variables are continuous. Clearly, these define important directions for extension of methods for analyzing categorical data. In view of the fact that recursive models for categorical data involve the same underlying logic as those for continuous data, a productive approach might be to investigate whether solutions appropriate to these more complex problems for continuous data might also generalize in a useful fashion to categorical data. REFERENCES Alwin, D. R., and Hauser, R. M. (1975). “The Decomposition of Effects in Path Analysis,” American Sociological Review 40, 37-47. Amemiya, T. (1981), “Qualitative Response Models: A Survey.” Journal of Economic Literature 19, (Dec.), 1483-1536. Barndo&Nielsen, 0. (1978) Information and Exponential Families in Statistical Theory, Wiley, New York. Berk, R. A., and Ray, S. C. (1982), “Sample Selection Bias in Sociological Data.” Social Science Research 11, 352-398. Bielby, W. T., and Baron, J. N. (1982), “Organization, Technology, and Worker Attachment to the Firm,” in Research in Social Stratification and Mobility (D. J. Treiman and R. V. Robinson, Eds.), Vol. II. Jai Press, Greenwich, Conn. Brier, S. S. (1978), “The Utility of Systems of Simultaneous Logistic Equations,” in (ed.), Sociological Methodology 1979 (K. F. Schuessler, ed.), pp. 119-129, Jossey-Bass, San Francisco. Chames, A., Frome, E. L., and Yu, P. L. (1976). “The Equivalence of Generalized Least Squares and Maximum Likelihood Estimates in the Exponential Family,” Journal of the American Statisticial Association 71, 169-171. Duncan, 0. D. (1975). Introduction to Structural-Equation Models, Academic Press, New York. Duncan, 0. D. (1982), “Review Essay: Statistical Methods for Categorical Data,” American Journal of Sociology 87 (Jan.), 957-964. Fienberg, S. E. (1980) The Analysis of Cross-Classified Categorical Data. 2nd ed., MIT, Cambridge, Mass. Gillespie, M. W. (1977), “Log-Linear Techniques and the Regression Analysis of Dummy Dependent Variables,” Sociological Methodology and Research 6, 103-122.
130
WILSON
AND BIELBY
Goodman, L. A. (1975), “The Relationship between Modified and Usual Multiple-Regression Approaches to the Analysis of Dichotomous Data, ” in Sociological Methodology 1976 (D. R. Heise, ed.), pp. 83-110, Jossey-Bass, San Francisco. Grizzle, J. E., Starmer, C. F., and Koch, G. G. (1969). “Analysis of Categorical Data by Linear Models,” Biometrics 25, 489-504. Hanushek, E. A., and Jackson, J. E. (1977), Statistical Methods for Social Scientists, Academic Press. New York. Heckman, J. J. (1978), “Dummy Endogenous Variables in a Simultaneous Equation System.” Econometrica 46, 931-959. Jennrich, R. I., and Moore, R. H. (1975), “Maximum Likelihood Estimation by Means of Nonlinear Least Squares,” in Proceedings, Statistical Computing Section, Amer. Statist. Assoc. Kritzer, H. M. (1979). “Approaches to the Analysis of Complex Contingency Tables.” Sociological Methodology and Research 75, 305-329. Manski, C. F. (1981). “Structural Models for Discrete Data: Analsis of Discrete Choice.” in Sociological Methodology 1981 (S. Leinhardt, ed.). pp. 58-109. Jossey-Bass. San Francisco. Mood, A. H., Graybill. F. A., and Boes. D. C. (1974). Introduction to the Theory of Statistics, 3rd ed., McGraw-Hill, New York. Mueller, E., Hybels, J., Schmiedeskamp. Sonquist, J., and Staelin, C. (1969). Technological Advance in an Expanding Economy, Institute for Social Research, University of Michigan, Ann Arbor. Muthen, B. (1981), “A General Structural Equation Model with Ordered Categorical and Continuous Latent Variable Indicators,” unpublished manuscript, Department of Psychology, University of California, Los Angeles. Nelder, J. A., and Wedderburn, R. W. M. (1972), “Generalized Linear Models,” Journal of the Royal Statistical Society, Series A 135, Pt. 3, 370-384. Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2nd ed., Wiley, New York. Swafford, M. (1980), “Parametric Techniques for Contingency Table Analysis.” American Sociological Review 45, 664-690. Winship. C., and Mare, R. D. (1983), “Structural Equation Models for Discrete Data,” American Journal of Sociology 88.