STATISTICS& PROBABILITY LETTERS ELSEVIER
Statistics & Probability Letters 31 (1996) 113-120
A model for a multivariate binary response with covariates based on compatible conditionally specified logistic regressions H a r r y J o e *, Y i n g L i u Department of Statistics, University of British Columbia, 6356 Agricultural Road, Vancouver, B.C., Canada V6T 1Z2 Received September 1995; revised December 1995
Abstract
A model for a multivariate binary response vector with covariates is obtained. The conditional distribution of each binary response given the other binary responses and the covariates is equivalent to a logistic regression. A simple condition on the regression parameters is necessary and sufficient for the conditional distributions to be compatible, that is, yield a multivariate distribution for the binary response vector. The multivariate model has a wide range of dependence structure for the binary response variables, so it is more generally applicable compared with previous conditional models. The model is applied to a data set from cardiac surgery.
Keywords: Logistic
regression; Compatibility conditions for conditional distributions; Exponential family
I. Introduction
Recently there has been research on models based on compatible conditional distributions; see, for example, Arnold et al. (1992, 1995), Arnold and Press (1989) and references therein. In this article, we apply these ideas to obtain a model for a multivariate binary response (Y1 .... , Ym) with covariate vector x. For the binary response Yj, the conditional distribution of Yj given the other binary responses Y/ (i # j ) is equivalent to a logistic regression with the Y/ (i # j ) and x as the covariates. There are some conditions on the regression parameters in order that the m conditional distributions are compatible. In the case of compatibility, the joint multivariate distribution of (Y1 . . . . . Ym) has a fairly simple form in the exponential family. The statement of the multivariate model is given in Section 2, with the derivation of the compatibility conditions given in the Appendix. This model is more generally applicable than other multivariate models that are based on some logistic regressions because it covers a wide range of dependence, and not just the exchangeable type of dependence or serial dependence. Conditional approaches have been proposed partly because of the difficulty of constructing a multivariate model with given univariate margins. For multivariate binary data with covariates, there is a multivariate probit model that is computationally harder to work with. A multivariate logit model requires a flexible family of * Corresponding author. 0167-7152/96/$12.00 © 1996 Elsevier Science B.V. All rights reserved PII S01 67-71 5 2 ( 9 6 ) 0 0 0 2 1-1
114
H. Joe, Y. Liu / Statistics & Probability Letters 31 (1996) 113-120
multivariate logistic distributions. We now have such families but not when we started on the work in the present research. For multivariate probit and logit models, there is still the problem of deciding whether and how dependence parameters should depend the covariates. This is not a problem with the model in this article, but its main disadvantage is that it is not closed under taking of margins. In Section 3, the model is applied to a data set from cardiac surgery to illustrate the type of situation where it might be used. Some binary response variables measured immediately after surgery were indicators of renal complication, neurological complication, pulmonary complication, and low-out syndrome complication (these are indirect measures of the quality of life after surgery). Covariates include age, sex, indicator of chronic obstructive pulmonary disease, indicator of prior myocardial infarction, indicator of renal disease, indicator of diabetes.
2. A model for multivariate binary data with covariates In this section, we state the multivariate model that arises from conditionally specified logistic regressions, and obtain some of its properties. For notation, we suppose that there are m binary response variables, Yj, j = 1. . . . . m, each taking values 0 or 1, and a covariate vector x (say of dimension r). These are measured for each subject. Consider first the case of two binary response variables Yl, Y2 and a covariate vector x. Suppose that Y1 conditional on x and I12 = y2 is logit and that I12 conditional on x and Y1 = Yl is logit, that is, [Pr(Y1 = l[Y2 = y2, x ) ] logit [Pr(Y1 = l lY2 = y2,x)] = log [Pr(Y1 = 0IY2 --- y2,x)] = Ctl + fllX + 712Y2,
logit [Pr(Y2 = I[Y1 = y l , x ) ]
[Pr(Y2 =
llY1 =
(2.1)
yl,x)]
= log [Pr--~2 = 0]YI = y l , x ) J --- ~2 nt- f12X q-721Yl,
(2.2)
where x is a column vector of length r and the fl's are row vectors of length r. What are necessary and sufficient conditions for (2.1) and (2.2) to be compatible conditional distributions? The answer is that the conditional distributions are compatible if and only if 712 =721, in which case, the joint distribution is pl2(yl,y21x) = [c(x)] -1 exp[(cq + fllx)yt + (ct2 + fl2x)y2 + 712yly2],
(2.3)
where c(x) = 1 + exp(el + fllX) + exp(ct2 + flzX) + exp[(cq + e2) + (ill + fl2)x + 712]. The derivation of this result can be obtained from the condition of Arnold et al. (1992) or that of Gelman and Speed (1993). The model as given in (2.1)-(2.3) generalizes for the dimension m. For the general multivariate case with binary response variables Y1. . . . . Ym, suppose that for j = 1,...,m, Yj conditional on x and Yi = Yi, i # j, is 1ogit with parameters o~j,flj, Yji, i # j. That is, Iogit [Pr(Yj ---- I[Y/= Yi, i # j, x)] = otj + fljx + Z ~ j i Y i ,
j = 1. . . . . m.
(2.4)
i#j
The necessary and sufficient conditions for compatibility of the conditional distributions are 7ij = Yji, i ~ j. The proof of sufficiency is straightforward and is given below; the proof of necessity is given in the Appendix. The resulting joint distribution is
(Oti+flix)yi+
Pt...m(Yl . . . . . y m l x ) = [ c ( x ) ] - l e x p ki=l
with normalizing constant c(x) = E y l1= O ' ' ' E y m =1O
7ijYiYj , 1 <~i
yj = 0 , 1 , j =
..I
m exp{~i=l(~i + ~ixi)Yi + ~i
1..... m
(2.5)
H. Joe, Y. Liu I Statistics & Probability Letters 31 (1996) 113-120
115
Proof of sufficiency. For k = 1. . . . . m, let p-k('lx) be the marginal distribution of (?'1..... Yk-1, Given the model (2.5) with ?ij = ?ji, i ~ j, we have
Yk+l . . . . , Era).
p-k(yl ..... yk-1, Yk+l ..... Ym Ix) =[C(X)]-I(exp[(cXkWflkX)+Z(o~iWflix)yi+Z i~k
ykjyj+ js~k
Z ]:ijYiYj 1<~i
+exP[i~k(O~i+flix)yi+l<~i
=[c(x)]-lexp(Z(°ti+flix)yi+ [. i~k
Z YiJYiYJ]'{1Wexp[otk+flkX+ZTkJYJ i
}"
Hence, we obtain Pr(Yk : yklYj = yj, j ¢ k, x) =
Pl...re(Y1, Y2 ..... Ym Ix ) P-k(Yl ..... Yk-1, Yk+l ..... Ym Ix)
exp{~im=l [(~i + flix)yi] + ~-~1<~i
1
exp[E,~,k(~, +/~,x)y, + El.<,~3, let nij(yi, yjl x) be the (i,j) bivariate marginal distribution of (2.5). The unconditional log odds ratio of rcq will depend on x. For notational simplicity and without loss of generality, we write the log odds ratio only for the case of (i,j) = (1,2). Let
qm(Yl,y2,x) = Z "'" Z exp y3=O y~,=O
[(O~k+flkX-t-YlkYl +72kY2]Yk q- Z
Ykk'YkYk'
•
3~
Then ~Zlz(yl, Y2 ]x) -- [c(x)]-lqm(yl, Y2, x) exp{(~l + fllx)Yl + (Otz + fl2x)y2 + ?lZYi Y2 } and the log odds ratio is ~12, log{ [qm(1, 1, X)qm(O, O, X)]/[qm( 1, O, X)qm(O, 1, x)] }. If x has dimension r, that is, x = (xl . . . . . xr) T for a subject, then model (2.5) is an exponential family model c -1 exp{2Ts} with "sufficient" statistic vector ST = (Sl . . . . . SM) = (Yl . . . . . Ym, YlY2, YlY3 . . . . . Ym_lYm,XkYj, k = 1..... r, j = 1..... m).
116
H. Joe, Y. Liu I Statistics & Probability Letters 31 (1996) 113-120
Given data of the form (yil .... ,Yim, Xil Xir), i = 1..... n, and corresponding vectors si, the sufficient n statistic vector is )--]~i=1si. The estimate of the parameter vector 2 can be obtained using the Newton-Raphson method for an exponential family log-likelihood. Starting guesses for the Newton-Raphson method can come from the separate logistic regressions. Also these separate regressions give an indication of the adequacy of fit of the model (in particular, a check of the requirement 7ij : ?ji can be made). Note that the exponential family model (2.5) is not closed under margins. This is typical of multivariate exponential family models that are not multivariate normal. The model (2.5) can be extended if interaction terms are needed. If, in (2.4), interactions among the response variables and interactions among the covariates with the response variables are included in the logistic regressions (that is, for the response Yj, terms in the regression include ]-Ii~s Yi and x 1-Ills yi, where S is a non-empty subset of {1 ..... m}\{j}), then the generalization of (2.5) has the form: .....
pl...m(Yl . . . . . Ym Ix) = [c(x)] -1 exp
+
~
(~i Jr- flix)yi q-
flsxHyk+ kGS
sc~e,ISI/>2
~ 7ijYiYj 1<~i
Z ?sHYk]' 5~.~,151>13 k~S J
y j = O , 1, j = l
..... m,
where 5a is the class of non-empty subsets of { 1..... m}. The proof is similar to that in the Appendix. We conclude this section with some comparisons. The model (2.5) allows for a general dependence structure for the Yj's, with both positive and negative dependence. A model in Connolly and Liang (1988) results by taking 7ij - 7 for all i ~ j. This simpler model might only be reasonable in a familial data situation, which is the one in the example of this cited article. The assumption in Connolly and Liang (1988) is that Yj conditional on ~ i ~ j Yi and x has a logit probability or a probability in another family. Model (2.5) should not be used if the binary variables are observed sequentially in time; in the case, one could use a model consisting of a product of a sequence of logit models [logit of Yj given Y1 ..... Yj-l,x] and this is proposed in Bonney (1987).
3. Example with a data set
In this section, the model (2.5) is applied to a data set from cardiac surgery. The final choice of covariates to use was based on Akaike information criterion and the range of univariate prediction probabilities. These criteria are really not new, so we omit the details. MCR (Merged, Multi-Center, Multi-Specialty Clinical Registries) is an international data-base developed by Health Data Research Institute (Portland, Oregon) in which information of patients who had heart-related surgery was recorded. The data set is large and not public so we use a random subset of the data of 5000 subjects. The main response variable is indicator of 30 day survival after surgery, but there are also other binary response variables that are known immediately after the surgery. These dependent variables include REC, PUC, NEC, LOC, for indicator of renal complication (mild or severe), pulmonary complication, neurological complication, low-out syndrome complication respectively. The complication variables are related to the quality of life after surgery. More documentation of the data set and some initial data analysis are given in Zhang (1993). The analyses in Zhang (1993) suggest that, among the many possible predictor variables, the more important predictors or risk factors are: AGE (in years), SEX (0 for male, 1 for female), PMI (indicator of prior myocardial infarction), CNS (indicator of central nervous system disease), DIA (indicator of diabetes), REN (indicator of renal disease), COP (indicator of chronic obstructive pulmonary disease). The variable CNS was eliminated after comparing a sequence of models based on the criteria mentioned above.
H. Joe, Y. Liu / Statistics & Probability Letters 31 (1996) 113-120
117
Table 1 Percentages Variable
# l's
Percentage
REC PUC NEC LOC
224 1394 350 472
4.5% 27.9% 7.0% 9.4%
SEX PMI CNS DIA REN COP
1403 2086 115 637 168 364
28.1% 41.7% 2.3% 12.7% 3.4% 7.3%
Table 2(a) Regression parameter estimates and standard errors (6 covariates) Covariate\Response CONSTANT AGE SEX PMI DIA REN COP
REC -6.01 (0.52) 0.031 (0.008) 0.18 (0.16) 0.07 (0.15) 0.15 (0.20) 2.05 (0.21) 0.23 (0.23)
PUC -1.86 (0.16) 0.010 (0.003) 0.03 (0.07) -0.02 (0.07) 0.48 (0.09) -0.35 (0.19) 0.28 (0.12)
NEC -6.70 (0.44) 0.056 (0.007) -0.21 (0.13) 0.07 (0.12) 0.14 (0.16) 0.16 (0.26) 0.11 (0.20)
LOC -5.01 (0.36) 0.027 (0.005) 0.48 (0.11) 0.33 (0.10) -0.01 (0.14) 0.28 (0.23) 0.43 (0.16)
Table 2(b) Dependence parameter estimates and standard errors (4 response variables) Pair
y (SE)
REC, PUC REC, NEC REC, LOC PUC, NEC PUC, LOC NEC, LOC
0.56 (0.15) 1.03 (0.19) 1.48 (0.17) 0.67 (0.12) 1.14 (0.10) 0.68 (0.15)
To provide a univariate summary of the data, we list in Table 1 the percentages of l ' s for the binary response and predictor variables. The age variable ranges from below 20 to 90 with a mean of 63 and a SD of 11. The model fitting with a given set of covariates is straightforward. Models with 3 to 7 of the above covariates were tried. The initial parameter estimates from separate logistic regressions and the estimates from the model (2.5) are fairly similar, in particular the parameters Yij, yji in the separate logistic regressions are about the same for each pair ( i , j ) with i < j . The criterion of range o f prediction probabilities and Akaike information criterion led to the model with 6 covariates. (The Bayesian information criterion with a larger penalty term for the number of parameters
H. Joe, Y. Liu I Statistics & Probability Letters 31 (1996) 113-120
118
led to a model with fewer covariates.) For the model with 6 covariates, we provide the table of regression coefficients, parameter estimates and corresponding standard errors in Table 2. The positive signs of the Yij's indicate that the binary response variables are moderately positively associated. From Table 2, we could as another step go to a simpler model where some of the regression coefficients are zero. That is, each covariate is significant for some response variable but not necessarily every response variable. AGE is significant for all four response variables, but otherwise SEX and PMI are significant for LOC, DIA is significant for PUC, REN is significant for REC, and COP is significant for PUC and LOC.
4. Discussion We have proposed a model for a multivariate binary response vector with covariates. The model covers a wide range of dependence structure and hence is more flexible than some previously defined conditional models. Conditional distributions of one binary response variable given the other responses and the covariates are all logistic regressions, so known diagnostic methods for logistic regression can be used in the model assessment of the new multivariate model. Because the model is in the exponential family, estimation is computationally not difficult. For discrete covariates with few categories, the multivariate binary vector response can be considered as a multinomial response with 2m categories and with the probability parameters depending loglinearly on the binary responses and the covariates. This view of the model is indirect and more complicated, but provides a relation with some known models (for example, in McCullagh and Nelder, 1989). The model has been applied to a data set with multivariate binary response. Comparisons with other models for multivariate binary response will be done later.
Appendix. Conditional specifications and compatibility conditions, and proof of necessity for model (2.5) The following result is from Gelman and Speed (1993). Suppose the conditional densities fll2('lY) are given for all y and f211(.lxo) is given for a particular value x0. Assume that fll2(xoly) > 0 for all y. Then for the joint distribution f12,
f l2(x, y) oc f ll2(xly)f 211(ylx°) f ll2(xoly) '
(A.1)
with the proportionality constant equal to fl(xo) = { f f[f112(xly)f211(ylxo)/f~ 12(xoly)] dx d y } - l . That is, for two random variables, the set of conditional densities given one variable plus one conditional density given the other variable determine the joint bivariate distribution. Now if the conditional densities fll2('[Y) are given for all y and f21t(.lx) are given for all x, then a condition for compatibility is that f112(x lY)f 20(ylxl ) / f ll2(XlY ) f 2fl(ylx2 )
f ll2(xlly)
//
f ll2(x21y)
fl12(x2 ]Y)f 211(Y]Xl)
fll2(Xl lY)f 211(ylx2)
(A.2)
does not depend on x, y for all choices of xl ¢ x:. The multivariate extension is as follows. Let fl...m be the joint density of random variables Y1..... I'm and let fibr be the conditional density of Y/ given the remaining variables Yj, j ¢ i. Then m
fv..m(Yl
. . . . . Y m ) 0¢
0
I-Ii=lf ilr(yilyl,. • ", Yi-1, 0 Yi+l .... , Ym) m 0 0 0 IIi=lf il~(yi lyl ..... yi-l, Yi+l . . . . . Y m )
(A.3)
H. Joe, Y. Liu I Statistics & Probability Letters 31 (1996) 113-120
119
for a given (yl° . . . . . yO) for which all of the conditional densities in the above expression are positive. The compatability condition for given sets of conditional densities fil~ is that
Y I mi = l f ilr(yilyO,, m
0
, Y io- l , Y.i + l . . . . .. Y m )
.
0
x
0
1-[i=lfilr(Yi lYl . . . . . Yi-l,Yi+l, • • •, era)
.
Y I im= l f i l r ( Z i0l Z 01,.
, Z io_ 1, Y i + I , • • • ,
Ym)
(A.4)
I-Im=lfilr(yi]z°, .. . ,Zi_l,O el+l,. • •, Ym)
does not depend on (Yl,.. • , Ym) for (Ylo . . . . . Ym) o • (Z1o . . . . . Zm)" o NOW we go to the proof of the necessity of condition for compatibility of conditional logistic regressions. The notation fJl~ will be used for the probability mass function of Yj given x and the rest of the Y's. From (A.3), Pr(Y1 = Yl . . . . . Ym = y m l X ) OC f llr(YllY2," • ., Ym, x ) f 2 [ r ( Y 2 lY°, Y3 . . . . . Ym, X ) f llr(yOly2,. • ., Ym, x) f21r(yOly O, Y3 . . . . . Ym, X) X
f 31r(Y3 lY °, y02, )'4 . . . . . . Ym, X) 0
0
0
f31r(Y3 lYl, Y2' Y4. . . . . Ym, X)
'''
f ,nlr(YmlyO,. . . , Y m0_ I , X )
(A.5)
fmlr(Y~lY~,'" • , Y mo _ l , x )
for a fixed and arbitrary (yO, yO . . . . . yO). (A.3) simplifies to I ' I i % 1 exp{(~i + flix + E] i Y q Y j ) Y i } Pr(Yl = Yl . . . . . Ym = Ym] X ) O( Ui%l exp{(e, •
..
0
+ fl, X "-k Y~.j i Y i J Y j ) Y O}
Yi+~--~(E))ijY°)Yi--~--~(ZTijYjlY°) • i=1
Take y0 . . . . .
\j
/
i=1
\j>i
(A.6)
/
yO = 0 to obtain
Pr(Y1 -- Y b . . . , Ym = y,nlX) CX exp
~i + fliX +
7ijYj)Yi j>i
, /
the form of the joint probability distribution. For the condition for compatibility, take the ratio of (A6) with y0 and y*; the ratio should not depend on Yl . . . . . Ym. It is
R(yl,
., Ym; yO,.
. . . . .
o
Ym,
m
0
m
0
, exp{Y'~4=I(~j f f i j Y j ) Y i } Y~ . . . . . Ym ) = m * m , exp{Y~i=l(~jiYijYj)Yi } m 0 m 0 e x p { ~ i = l ( ~ ~ j > i ~ ) j i Y j ) y i -- ~"~i=l(~"~q>iYijyj)y i } m , m , exp{~i=l (Y]o> iYjiYj)Yi - E i = I ( E j > iTijYj)Yi }
i=1
j
"
In (A.7), one can take appropriate values for Yi,0 Y i•, for example, y , = (0 ..... 0) and yO = (0 ..... 0, 1, 1), (0 . . . . . 0, 1, 1, 1) . . . . . (1 . . . . . 1). The first choice yields exp{(Tm_l,m - Ym,m-1)Ym} and this is independent of y l , . . . , Ym only if 7m--l,m = 7m,m-l. The other choices yield the other inequalities. Alternatively, the symmetry in the variables and parameters then imply that 7ij = 7ji for all i ¢ j. Finally, note that the conditions in Arnold et al. (1992) would be much harder to apply in this conditionally specified model for the general multivariate situation.
120
H. Joe, Y. Liu / Statistics & Probability Letters 31 (1996) 113-120
Acknowledgements This research has been supported by a NSERC Canada gram. We are grateful to Hongbin Zhang for his assistance with the data set, to Dr. Scott Page for the availability of the data set, and to Jos6 Mendoza for his contribution to the proof in the Appendix. Thanks to the referee for suggesting the extension near the end of Section 2.
References Arnold, B.C., E. Castillo and J.M. Sarabia (1992), Conditionally Specified Distributions, Lecture Notes in Statistics (Springer, New York). Arnold, B.C., E. Castillo and J.M. Sarabia (1995), General conditional specification models, Commun. Statist. A 24, 1-11. Arnold, B.C. and S.J. Press (1989), Compatible conditional distributions, J. Amer. Statist. Assoc. 84, 152-156. Bonney, G.E. (1987), Logistic regression for dependent binary observations, Biometrics 43, 951-973. Connolly, M.A. and K.-Y. Liang (1988), Conditional logistic regression models for correlated binary data, Biometrika 75, 501-506. Gelman, A. and T.P. Speed (1993), Characterizing a joint probability distribution by conditionals, J. Roy. Statist. Soc. B 55, 185-188. McCullagh, P. and J.A. Nelder (1989), Generalized Linear Models (Chapman & Hall, London, 2rid ed.). Zhang, H. (1993), Risk prediction models for binary response variable for the coronary bypass operation, M.Sc. Thesis, Department of Statistics, University of British Columbia.