c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 8 9 ( 2 0 0 8 ) 72–75
journal homepage: www.intl.elsevierhealth.com/journals/cmpb
Computation of the quasi-independence model for the analysis of triangular contingency tables Serpil Aktas¸ ∗ Department of Statistics, Faculty of Science, Hacettepe University, Beytepe, Ankara 06800, Turkey
a r t i c l e
i n f o
a b s t r a c t
Article history:
Triangular contingency tables are a special class of incomplete contingency tables. Asso-
Received 27 February 2007
ciation and independence models are used to analyze such tables. This paper presents
Received in revised form
and compares some methods including the uniform association model and the quasi-
8 October 2007
independence model. These models can be described in terms of the association parameters
Accepted 9 October 2007
for the analysis of triangular contingency tables having ordered categories.
Keywords:
itive (negative) likelihood dependence. The sign test, which is a nonparametric test of the
Contingency tables
independence against the likelihood ratio dependence, is also examined. These methods
Stroke patients
are applied to the disability ratings of stroke patients data. Effects of the structural zeros on
Ordinal association
the results are also discussed.
A computer program is developed for the analysis of quasi-independence model for pos-
© 2007 Elsevier Ireland Ltd. All rights reserved.
1.
Introduction
A special class of incomplete contingency tables is the triangular contingency tables, which contain structural zeros in one or more cells above or below their main diagonals. The triangular contingency table was first analyzed in [1] by partitioning the table into a set of rectangular subtables, each of which can be analyzed in an elementary way. Ref. [2] discussed this kind of tables with the classical example of disability of stroke patients. Refs. [3–5] also discussed the quasi-independence model. Ref. [6] introduced various tests of the quasi-independence model against alternative hypothesis of positive or negative quasi-dependence.
∗
We consider the rxr table where the row and the column categories are ordinal, numbered from 1 to r, and ij denoting the probability that an observation falls in the ith row and jth column of the table and mij denoting the expected frequencies under some model. Ref. [7] defined four types of triangular contingency tables under the following conditions. An upper-right (left) triangular (URT(ULT)) table is described as, ij = 0 for i > j (i + j > r + 1), and a lower-left (right) triangular (LLT(LRT)) table is defined as, ij = 0 for i < j (i + j < r + 1). These tables (i.e., URT, ULT, LLT and LRT) can be transformed to each other by interchanging the row and column variables and/or reversing the category ordering.
Tel.: +90 312 2977930. E-mail address:
[email protected]. 0169-2607/$ – see front matter © 2007 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.cmpb.2007.10.005
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 8 9 ( 2 0 0 8 ) 72–75
2.
A simple log-linear model that uses the ordinal information but that has only one more parameter than the usual independence model is given by
Computational methods and theory
In this section the quasi-independence test, association model and the sign test for analyzing the triangular contingency tables will be given.
2.1.
Quasi-independence
Due to the incompleteness of the table, independence between the row and the column variables are tested by the quasi-independence (QI) model first proposed by [1]. The QI model in a URT table is given as(1)log mij = + X + i Y j for i ≤ j, i = 1, . . ., r, and j = 1, . . ., r. For testing the QI model, Refs. [5–9] used usual chi-squared tests based on the Pearson and the log-likelihood ratio statistics. Unlike these approaches, Ref. [7] considered testing the QI model in terms of the ordinal association based on concordant and discordant pair of observations. Ref. [7] also showed that quasi independence minimizes (maximizes) the ordinal association in triangular contingency tables in the class of positive (negative) likelihood ratio dependence models with a fixed set of marginals. The presence of the QI model can be characterized by = 0, where is the sum of the differences between the products of subscripted s = 2
Y XY ¯ j − v) ¯ (ui − u)(v log mij = + X i + j + ˇ
(3)
We will refer to this model as the linear-by-linear association model [10]. The ˇ parameter in model (3) that describes the association between X and Y can be interpreted as the common value of the local log odds ratio. The independence model is the special case of ˇXY = 0. Ref. [6] suggested that the model for the special case {ui = i}, {vj = j} in which the local odds ratio ij for the adjacent rows i and i + 1 and the adjacent columns j and j + 1 is uniformly exp(ˇ). He referred to this special case as the uniform association (UA) model.
2.3.
Sign test
Ref. [11] investigated a nonparametric test of independence against the likelihood ratio dependence based on unbiased estimates of the cross-product differences. Consider the cross-product differences as below w(i,j) = ni,j ni+1,j+1 − ni+1,j ni,j+1 ,
i, j = 1, . . . , r − 1
(4)
ˆ (i,j) as Unbiased estimator of w(i,j) defined by the w (ij kl − il kj )
(2)
i
Hence the QI model can be tested by the null hypothesis H0 : = 0 against the alternative HA : > 0 for the positive likelihood ratio dependence. An appropriate asymptotic test for testing H0 against HA is provided by a right-tail test based on √ z = ( nˆ /ˆ ) whose asymptotic null distribution is N(0, 1). For the negative likelihood ratio dependence condition, a reasonable alternative hypothesis is HA : < 0. The departure from the QI model can be attributed to the high value of .
ˆ (i,j) = w
1 (ni,j ni+1,j+1 − ni+1,j ni,j+1 ) n(n − 1)
Theorem. Under the multinomial sampling and using the delta √ method, n(ˆ − ) is asymptotically distributed as N(0, 2 ), 2 − 42 and ,ij = 2{P(X < i, Y < j) + P(X > i, where, 2 = i≤j ,ij ij Y > j) − P(X < i, Y > j) − P(X > i, Y < j) − P(X > j) − P(Y < i)} for i ≤ j. The maximum likelihood estimate of 2 is given by 2ˆ . Ref. [7] noted that his proposed test is likely to perform better than the usual chisquared tests in situations where the underlying models are expected to have the likelihood ratio dependence condition.
1 + ni+1,j+1 n2i,j + ni,j+1 n2i+1,j + ni+1,j n2i,j+1 ) (n n2 4 ij i+1,j+1 2
−(w(i,j) ) , cov(g(i,j) , g(i−1,j−1) ) =
1 (n n n ) − w(i,j) w(i−1,j−1) , 4 i,j i−1,j−1 i+1,j+1
cov(g(i,j) , g(i−1,j+1) ) =
1 (n n n ) − w(i,j) w(i−1,j+1) , 4 i,j+1 i+1,j i−1,j+2
1 cov(g(i,j) , g(i−1,j) ) = − (ni,j+1 ni−1,j ni+1,j + ni,j ni+1,j+1 nj−1,j+1 ) 4 −w(i,j) w(i−1,j) ,
Association model
For the analysis of two-way cross-classification triangular tables having ordered categories, Ref. [5] considered various kinds of association models. Suppose that both the row and column variables of a two-dimensional table are ordinal with the row variable denoted by X and the column variable denoted by Y. We assume that scores {ui } and {vj } are assigned to the rows and the columns, where u1 < u2 < . . . < ur , v1 < v2 < . . . < vr . u¯ and v¯ are the arithmetic means, respectively.
(5)
Using the delta method, it can be shown that the joint dis√ (i,j) ˆ n − w(i,j) ) for i = 1, . . ., r − 1, j = 1, . . ., c − 1 is tribution of n(w asymptotically normal with zero mean vector and covariance matrix N(0, 4˙) where ˙ = ( (i,j)(i ,j ) ), (i,j)(i ,j ) = cov(g(i,j) (X1 , Y1 ), g(i ,j ) (X1 ,Y1 )). Where n is the total number of observations and the elements of covariance matrix are given by var(g(i,j) ) =
2.2.
73
1 cov(g(i,j) , g(i,j−1) ) = − (ni,j ni+1,j+1 ni+1,j+1 + ni+1,j ni,j−1 ni,j+1 ) 4 −w(i,j) w(i,j−1) , cov(g(i,j) , g(s,t) ) = 0
Consider the null hypothesis of H0 : w(i, j) = 0, against HA : ≥ 0 for all i and j. The null hypothesis is equivalent to the independence between the row and the column categories. w(i, j)
74
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 8 9 ( 2 0 0 8 ) 72–75
Fig. 1 – Illustration of the program calculating the Sarkar’s method.
Ref. [11] introduced the test statistic to test the null hypothesis
S=
(i,j)
ˆn , sgn w
sgn x = {x > 0}
(6)
The mass attributed to the positive quadrant by a standard bivariate normal random vector with correlation coefficient is given by (1/4) + (1/2) arcsin . This probability is invariant under the scale transformation [12].
1≤i≤R−1 1≤j≤C−1
3.
Program description
(i,j)
ˆ n > 0. where S is the number of 2 × 2 subtables such that w We reject H0 , if S is sufficiently large. Ref. [11] derived the asymptotic null distribution of the test statistic S, under H0 computing the probability P(S = m) for m = 1, 2, . . ., (r − 1)(c − 1). This probability is given by P(S = m) =
ˆ (in 1 ,j1 ) > 0, . . . , w ˆ (in m ,jm ) > P(w
Cm
A computer program is written to implement Sarkar’s method using Borland Delphi 5.1 language. A copy of the program is available from the author via e-mail (e-mail:
[email protected]). The illustration of the program is given in Fig. 1. In the program, after we enter the cell frequencies into the above left matrix and click on the “calculate” button, the ˆ , the variance of ˆ and z value will be calculated. The below left matrix also gives the ϕ,ij values.
ˆ n(im+1 ,jm+1 ) ≤ 0, . . . , w ˆ (in (r−1)(c−1) ,j(r−1)(c−1) ) ≤ 0) 0, w
Cm denotes summation over all possible combinawhere tions of m distinct elements {(i1 , j1 ), . . ., (im , jm )} from {(1,1), . . ., (R − 1, C − 1)} and where {(im+1 , jm+1 ), . . ., (i(r−1)(c−1) , j(r−1)(c−1) )} is its complement set. We fail to reject the null hypothesis, if P(S = m) is greater than the significance level ˛.
Table 1 – Initial and final ratings on disability of 121 stroke patients Initial
E
D
Final C
B
A
E D C B A
8 – – – –
15 1 – – –
12 4 4 – –
23 10 4 5 –
11 9 6 4 5
4.
Sample program runs
In this section, the methods described in Section 2 are applied to a 5 × 5 triangular table and the sign test adapted to the triangular table. A classic triangular table discussed by [2] is given in Table 1, which has been analyzed by several authors. In this table, 121 stroke patients’ initial and final disability states are graded on a five-point scale A–E of increasing severity according to their physical disability following a stroke. Since no patient was discharged if he had become worse, a patient’s score on the final examination could only be the same or better than the one on the initial examination. If a hospital never discharges a patient whose condition is worse than at admission, then no discharged patient’s final state can be worse than his initial state and so there are 10 structural zeros.Sample proportions of the cells are illustrated in Table 2.
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 8 9 ( 2 0 0 8 ) 72–75
Table 2 – Sample proportions of the cells E
D
C
B
A
Total
E D C B A
0.066 – – – –
0.124 0.008 – – –
0.099 0.033 0.034 – –
0.190 0.083 0.033 0.041 –
0.091 0.074 0.050 0.033 0.041
0.570 0.198 0.117 0.074 0.041
Total
0.066
0.132
0.166
0.347
0.289
1
For the stroke patients data, the LR statistic and Pearson goodness-of-fit statistics are 9.60 and 8.369, respectively, each with 6 d.f. for the QI model. These values correspond to asymptotic p-values of 0.143 and 0.212, respectively. There is not strong evidence that the QI model does not fit the data. The form of the data implies that, given a patient’s initial state and the fact that he cannot get worse, the relative probability of a patient being in an achievable final state is independent of his initial state. For the uniform association model with integer scores, Pearson 2 is 5.7438 based on 5 d.f. with corresponding p-value of 0.3319. There is not strong evidence that the QI model can be rejected in favor of the UA model. The maximum likelihood estimate of ˇ is obtained as 0.2849 with estimated ˆ = 1.329 can be interpreted as standard error of 0.1528. Exp(ˇ) the local odds ratio defined for the adjacent rows and adjacent columns. Unlike the QI model, a value greater than one means that a patient with a poor initial state tends to have a poor final state. An empty cell in which observations are impossible is called a structural zero. When the data contain structural zeros in contingency tables, the primary problem is the accuracy of the number of the degrees of freedom, parameter estimates and the deviances. For example, when we use the usual uniform association model regardless of the structural zeros, we get the Pearson 2 14.2995 with 14 d.f. for the stroke patients data. When we applied the method proposed by [7], ˆ is found as 0.0781 with estimated variance 0.1552, and the z statistic is obtained as 2.1813. This value rejects H0 : = 0 against HA : > 0 at the level of 5%. This means that the positive dependence occurs for the data. In other words a patient with a high initial rating tends to have a high final rating. The patients’ final states when discharged are better than their initial states at admission. For the sign test, we found the probability as P(S = 4) = (1/4) + (1/2) = 0.2499822 and testing the hypothesis H0 : w(i,j) = 0 against HA : w(i,j) ≥ 0, we fail to reject the null hypothesis for P(S = 4) > ˛ = 0.05. Therefore, the data has not significant positive dependence along with this test.
5.
Conclusions
The QI model omits the ordinal nature of the row and the column variables. Testing the QI model without tak-
75
ing into account the ordinal structure of the variables causes inaccurate results. In order to avoid this problem, we applied the UA model to the data by taking integer scores and interpreted the QI model in terms of the ordinal association. The linear-by-linear association model can be considered as an alternative to the quasi-independence model. If one wants to use the ordinal nature of the variables, a right-tailed based on z also provides a reasonable test for testing whether or not the true model is the QI model. Additionally, the structural zeros affect the degrees of freedom, deviance statistics and the parameter estimates. For example, for the data in Table 1, if we ignored the structural zeros, the degrees of freedom under model (3) would be d.f. = 15. The correct degrees of freedom should be the degrees of freedom minus the number of structural zeros. Therefore, it would be d.f. = 15–10 = 5. The hypothesis is tested for the d.f. = 5 not for d.f. = 15. It is important to consider the structural zeros when analyzing data.
references
[1] L.A. Goodman, On quasi-independence in triangular contingency tables, Biometrics 35 (1979) 651– 655. [2] Y.M.M. Bishop, S.E. Fienberg, P.W. Holland, Discrete Multivariate Analysis, MIT Press, Cambridge, MA, 1975. [3] P.M. Altham, Quasi-independent triangular contingency tables, Biometrics 31 (1) (1975) 233–238. [4] N. Mantel, Incomplete contingency tables, Biometrics 26 (1970) 291–304. [5] Y.M.M. Bishop, S.E. Fienberg, Incomplete two-dimensional contingency tables, Biometrics 25 (1969) 119–128. [6] L.A. Goodman, On quasi-independence and quasi-dependence in contingency tables, with special reference to ordinal triangular contingency tables, JASA 89 (1994) 1059–1063. [7] S.K. Sarkar, Quasi-independence in ordinal triangular contingency tables, JASA 84 (1989) 592– 597. [8] L.A. Goodman, The analysis of cross-classified data: Independence, quasi-independence and interactions in contingency tables with or without missing entries, JASA 63 (1968) 1091–1131. [9] L.A. Goodman, The analysis of cross-classified data having ordered and/or unordered categories: association models, correlation models, and asymmetry models for contingency tables with or without missing entries., Ann. Stat. vol. 13 (1) (1985) 10–69. [10] A. Agresti, Categorical Data Analysis, Wiley, New York, 2002. [11] M.T. Lee, Some cross-product difference statistics and a test for trends in ordered contingency tables., Stat. Prob. Lett. 7 (No. 1) (1988) 41–46. [12] N.L. Johnson, S. Koltz, Distributions in Statistics, Continuous Distributions, John Wiley, New York, 1976.