Computer Methods and Programs in Biomedicine, 27 (1988) 111-119 Elsevier
111
CPB 00920 Section I. Methodology
Correspondence analysis of contingency tables M . A . A . M o u s s a a n d B.A. O u d a Faculty of Medicine, Kuwait University, Kuwait
This paper deals with the multidimensional representation of the dependence between row and column variables of contingency tables using the correspondence analysis. It includes: (1) estimation of the optimal weights that maximize the canonical correlation between two categorical variables by an optimization iterative method, (2) testing the discriminability of the estimated scoring scheme, (3) evaluating the relative contribution of categories (rows or columns) for each dimension, (4) simultaneous symmetric graphical representation of row and column points. The method is applicable to two-way contingency tables or multi-way tables concatenated to two-way tables through merging variables to form interactive ones. Contingency tables; Multi-dimensional scaling; Eigenvalues; Dual scaling
1. Introduction
(2) Testing the discriminability of the estimated weighting scheme. (3) Identification of the dominant categories (rows and columns) by evaluating the proportion of explained variance in each dimension accounted for by each category. (4) Symmetric graphical representation of row and column points.
Contingency tables are frequently used in medicine to summarize patients' responses to a set of categorical variables. In the last few years there has been growing interest in correspondence analysis [1-4]. Nishisato [1] p. 11 gives a full survey of other names and references which have appeared in the history of correspondence analysis like dual scaling, reciprocal averaging, canonical correlation and simultaneous linear regression. The recent flourishing of correspondence analysis as a dataanalytic technique is probably due to the heavy emphasis on the geometrical aspects of the method. Correspondence analysis is an especially efficient method to gain insight into the relation between the variables of a contingency table when the number of categories is large. In this paper, contingency tables are subjected to correspondence analysis. This comprises: (1) Estimation of the optimal weights for rows' and columns' categories that maximize the correlation between the two categorical variables.
The concept and theoretical implications of correspondence analysis are discussed by Nishisato [1]. It underlies attaching weights to columns so as to discriminate optimally among rows. The strength of association between the two categorical variables is expressed in this analysis by ,/2, the squared correlation ratio, which m a y be viewed as a qualitative analogue to the coefficient of determination for continuous data. In matrix notation
Correspondence." M.A.A. Moussa, Faculty of Medicine, Kuwait University, P.O. Box 24923, 13110 Safat, Kuwait.
i,/2 = X' [ F ' D ~ - I F - f ' f / f t ] x x ' [D - f 'f//ft ]x
2. Methodology
0169-2607/88/$03.50 © 1988 Elsevier Science Publishers B.V. (Biomedical Division)
(1)
112
where x (m × 1) = vector of unknown weights for column categories, F = t h e frequency matrix (f,v), i = 1,2 . . . . . n; j = 1 , 2 . . . . . m. D n (n × n) =diagonal matrix of the row totals of
1. Multiply Ca by an arbitrary vector b 0 and indicate the product by b 1 bl=Clb0 where C a is the ( m x m) residual matrix:
F, f ( m x 1) =vector of the column totals of F, ft = total number of frequencies, D ( m x m ) = d i a g o n a l matrix of column totals of F. n,m = t h e number of rows and columns respectively.
2.1. Optimization 72 is a ratio of two quadratic forms. In order to determine the optimum vector x, the Lagrange method of multipliers is used and the problem is
C 1 = D - 1 / 2 F ' D n 1 F D - 1 / 2 - 4D1/211'D]/2 Jt I is the vector of l ' s b 0 ( m x 1) is the initial vector. 2. Divide the elements of b I by the greatest absolute value of the elements in b~, say ]Kd: b f = bl/lKl[. 3. Let b 2 = Clh ~ and standardize b 2 as in step 2 (b~ = b2/IK21, where IK21 is the absolute value of the h 2 elements).
greatest
4. Iterate the process
reduced to that of solving a set of homogeneous equations of the first partial derivatives. The equa-
C l b ; = bj+l, bj+a/lK j + 11 = b~+ 1
tions to be solved are
until h7 becomes identical within a tolerance level to hj*+v This provides the largest eigenvalue [Kj+I] corresponding to the m a x i m u m 7 2.
( D - 1 / 2 F ' D n - I F D - 1 / 2 - ~ 2 I ) w = 0,
(2)
W"W = f t
Then
where
w=
W = D1/2x
0 is the null vector. The weight vector x can be obtained from w W = D-1/2w
bj*~
(4)
+1
(3)
We started by maximizing 7/2 with respect to x. Of the m possible values of 7 2, we are interested only in the largest value. Once this is identified, the weight vector x can be calculated from w.
2.2. Iterative procedure There are m a n y numerical methods for extracting only the largest eigenvalue and the corresponding eigenvector from an equation [5,6]. These methods employ essentially the same technique of generating a convergent sequence of vectors and computing the largest eigenvalue by an iterative process. The iterative method used [1] is:
The elements of x (Eq. 3) are optimal weights for column categories; these are the values that maximize 7 2. Then the optimal row vector y is obtained from x: y = D,-1Fx/~
(5)
The elements of y are the most discriminative scores for rows' categories.
2.3. Significance test Once the o p t i m u m solution (7/2, x, y} is obtained, it is necessary to test that the scores y can discriminate significantly among the row's categories, and also to know whether the solution recovers the original matrix F in an exhaustive manner.
113 The test statistic used is the Bartlett's [7] chi-square approximation adapted by Bock [8]: X2=-(ft-l-½(n+m-1))loge(1-,/2)
K can be evaluated by the ratio of the inertia of dimension K to the total inertia. Similarly, the percentage of the variance accounted for by the first q solutions can be calculated by (~2+*/~
+ ... +*/2q)× lO0/tr (C1). (6)
2.4. Approximation of the input data matrix with degrees of freedom (df) = n + m - 1 - 2j, K = 1, 2 . . . . . S (S = min(n - 1, m - 1). The X2 test of */2 obtained from the contingency table F may be regarded as a test for between-row and between-column discrimination, but more appropriately it is a test for association between the two scaled variates. When the optimum solution {*/2, x, y} is obtained, the significance of discriminability of the scoring scheme (x, y) can be tested using Eq. 6 with K = 1. If X2 is significant, this indicates not only that the scoring scheme is useful in discriminating among rows, but also that there may exist a second scoring scheme associated with ~ which is significantly discriminative. If, however, */2 is not significant, the corresponding scoring scheme (x, y) is still the best one can obtain, and perhaps it is not worth looking at the second scoring scheme associated with ,/2. The dimensionality of the solution (maximum number of scoring schemes) is equal to min(n - 1, m - 1). Regardless of the significance of the X2, one will be interested in knowing what percentage of the total variance of the data matrix is accounted for by the optimum scoring scheme {x, y} and denoted by 81: */2
81 -tr(C1----~
× 100
(7)
where tr(C1) is the total 'inertia' or the 'trace' of C 1 which is equal to the sum of the eigenvalues of C1: s tr(Cl) = y" */2 K=I 61 tells us how much information contained in the data matrix can be explained by the optimum solution {x, y). Thus the importance of solution
The first-order approximation of the input data matrix may be produced as:
eij=(fi+f+j)/ft(1.O+*/yixj)
(8)
For higher-order approximation: e o = ~ ~ (~+ i j
+f+j)/ft(*/yixj)
The order S approximation recovers the matrix F completely.
2.5. Relative contribution of categories An important aid in the interpretation of a dimension (solution) is the property that the sum of the weighted squared distances of the row points (or column points) to the origin, is equal to */2 for dimension K [3]: */~= ~ ( f / +
T
)
(f+J)~2 ~i2 = y'" J t ft
xjK
(9) •
where yi/~ is the ith row of the weighted matrix y in dimension K = y , / = D,-1Fx ~jK is the j t h column of the weighted matrix = x*/= D - 1 F ' y From Eq. 9 one can evaluate the relative contribution of row i to dimension K with the ratio ((fi+/ft)Yi%)/*/2, which can also be interpreted as the proportion of variance in dimension K accounted for by row i. The same holds for column point j. Using the last term of Eq. 9, the relative contribution of column j to dimension K is the ratio ((f+Jft)Xjk)~ 2 2/,0K.2 It is also possible to compute for row i on dimension K the ratio of the squared projected
114 distance and the squared total distance to the origin. With this ratio it is possible to evaluate how good the total chi-squared distance of row i to the origin is represented on dimension K.
analysed by taking separate tables for each category of other variables [3]. For example when the three-way table F is of order n x m x q, one can analyze two-way tables of order n x m for every category of the third variable. It is also possible to concatenate these tables, thus creating 'interactive variables' [9] by merging two or more original variables. For a three-way table F one can find three two-way matrices having order n x (m x q), m x (n x q) and q x (n x m) respectively. In the French literature this is the method used most
2.6. Graphical representation of row and column
points Simultaneous representation is made using the weighted (by ~) row and column optimal scores so that a symmetric representation of row and col-
umn points is displayed {9, 3}.
often to study interaction in higher-way tables. The two-way tables that are formed by concatenation are called 'tableaux multiples' [10]. Multiple tables are usually notated by placing the variables, which constitute the interactive variables, between brackets, for example in a three-way table the three possible multiple tables are F 1x(2x3), F2XdX3) and F 3x(lx2)
2. 7. Correspondence analysis of multi-way con-
tingency tables Correspondence analysis as explained so far is a method for the analysis of two-way contingency tables. A higher-way contingency table can be
D]NENSION 2 MALE
FEMALE o
o
0 ~SO
~' 3o
25-
~;ASH .~,
OGUNS o
15-
-40
o-,
DINENSION 1
-,45
-0.65
-0.52
-0.39
~
445 JUMP
0-0.13
i ~s
+o65
0:13
STAB ~
~,
~
0.26 50~.39 60
0.52
O".6
-80
+~_~ ~ '-~
0 DROW
eows (D 10-
c:, co
Fig. 1. Analysis of FMX(SXA);first two dimensions.
COLUMNS
115
3. Program description
3.3. Subroutines
The program C O R R E S P O N D is written in single-precision F O R T R A N IV and can be conveniently run on a broad range of computers,
Routine F R Q N C Y calculates two diagonal matrices DN, D v and the total number of frequencies, FT. Routine G R A P H I C S calls 11 sub-
3.1. Input parameters
r o u t i n e s = P L S T R T , SETCLP, O R I G I N , USSCAL, XAXIS, YAXIS, N E W P N , F I G , M A R K , W R I T P and PLSTOP. These subroutines are for manipulating symmetric graphing and described elsewhere [11].
NR, NC: the number of rows and columns of the data matrix, F(I, J): the frequency matrix, FIj , I = 1, NR; J = 1, NC.
3.2. Output The printed output lists the following: 1 Input data: F(I, J), I = 1, NR, J = 1, NC.
2 3 4 5 6
Row totals of F: DN(I), I = 1, NR. Column totals of F: D(J), J = 1, NC. Matrix C for eigen: C(J, J), J = 1, NC. Trace of C: TRACE. For each solution (dimension): N R O O T = 1, MIN(NR-1, NC-1). Solution number: N R O O T . Squared correlation ratio ~/2: S. Maximum correlation, +/: SROOT. Total variance accounted for, 8: SSP. Optimal scores for rows (unweighted): Y(I), I = 1, NR. Optimal scores for rows (weighted by ~ / ) : WT(I,J), I = 1, NR. Optimal scores for columns (unweighted): B(J), J = 1, NC. Optimal scores for columns (weighted by 7): WT(2, J), J = 1, NC. Chi-square and degrees of freedom: CHI2, NDF. Approximation of the input data matrix: E(I, J), I = 1, NR; J = 1, NC. Relative contribution of rows: RT(I), I -- 1, NR. Relative contribution of columns: CT(J), J = 1, NC. Proportion squared projected distances to the
origin for rows: RL(I), I = 1, NR. Proportion squared projected distances to the origin for columns: CL(J), J = 1, NC. 7 Graph: The symmetric representation of weighted row and column points for the first two dimensions is shown in Fig. 1. A sample output is contained in the Appendix.
4. Restrictions The program can handle a contingency table of a m a x i m u m of 35 rows and 35 columns. However, this can be changed through altering dimension statements. 5. Specifications The program runs on the VAX-11/780 system with 2 Mbytes core memory. The C P U time taken for the example in the Appendix is 2.1 s. Time requirement depends essentially on NR, N C and the number of iterations to extract the largest eigenvalue and eigenvectors (Section 2.2). The correspondence analysis is not available in any of the known software statistical packages, like SPSS or BMDP.
6. Mode of availability The source listing of the program is available from the authors on request.
References [11 S. Nishisato, Analysis of Categorical Data: Dual Scaling
and its Applications, pp. 21-45, 74-81 (University of Toronto Press, Toronto, 1980). [21 M.J. Greenacre, Theory and Applications of Correspondence Analysis (Academic Press, London, 1984). [3] P.G.M. Van der Heijden and J. de Leeuw, Correspondence analysis used complementary to loglinear analysis, Psychometrika 50 (1985) 429-447. [41 P.G.M. Van der Heijden, Transition matrices, model fitting and correspondence analysis, in: Data Analysis and
116
[5] [6] [7] [8]
Informatics IV, ed. E. Diday (North Holland, Amsterdam, 1986). H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936) 321-377. L. Fox, An Introduction to Numerical Linear Algebra (Oxford University Press, Oxford, 1964). M.S. Bartlett, Multivariate analysis, J. R. Stat. Soc. Suppl. 9 (1974) 176-190. R.D. Bock, Methods and Applications of Optimal Scaling, The University of North Carolina Psychometric Laboratory Research M e m o r a n d u m 25 (1960).
[9] A. Girl, Non-linear Multivariate Analysis (DSWO Press, Leiden, 1981). [10] J.P. Benzgcri, Pratique de l'analyse des donnres (Dunod, Paris, 1980). [11] Digital Equipment Corporation, Engineering Graphics Utilities for Computer Aided Design User's Manual-Order Number: A A - H 1 9 2 A - T K (DEC, Marlboro, 1979).
Appendix. Sample input-output [3] Suicide behavior: age by sex by cause of death MATT
GASH
GASO
HANG
DROW
GUNS
STAB
JUMP
OTHE
TOTAL
Men 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 90+ Total
4 348 808 789 916 1118 926 855 684 502 516 513 425 266 159 70 18 8917
0 7 32 26 17 27 13 9 14 6 5 8 5 4 2 1 0 176
0 67 229 243 257 313 250 203 136 77 74 31 21 9 2 0 1 1913
247 578 699 648 825 1278 1273 1381 1282 972 1249 1360 1268 866 479 259 76 14740
1 22 44 52 74 87 89 71 87 49 83 75 90 63 39 16 4 946
17 179 316 268 291 293 299 347 229 151 162 164 121 78 18 10 2 2945
1 11 35 38 52 49 53 68 62 46 52 56 44 30 18 9 4 628
6 74 109 109 123 134 78 103 63 66 92 115 119 79 46 18 6 1340
9 175 289 226 281 268 198 190 146 77 122 95 82 34 19 10 2 2223
285 1461 2561 2399 2836 3567 3179 3227 2703 1946 2355 2417 2175 1429 782 393 113 33828
Women 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 90+ Total
28 353 540 454 530 688 566 716 942 723 820 740 624 495 292 113 24 8648
0 2 4 6 2 5 4 6 7 3 8 8 6 8 3 4 1 77
3 11 20 27 29 44 24 24 26 14 8 4 4 1 2 0 0 241
20 81 111 125 178 272 343 447 691 527 702 785 610 420 223 83 19 5637
0 6 24 33 42 64 76 94 184 163 245 271 244 161 78 14 4 1703
1 15 9 26 14 24 18 13 21 14 11 4 1 2 0 0 0 172
0 2 9 7 20 14 22 21 37 30 35 38 27 29 10 6 2 309
10 43 78 86 92 98 103 95 129 92 140 156 129 129 84 34 7 1505
6 47 67 75 78 110 86 88 131 92 114 90 46 35 23 2 0 1090
68 560 862 839 985 1319 1242 1504 2168 1658 2083 2096 1691 1279 715 256 57 19382
117 Solution 1: Squared correlation ratio = 0.0976, m a x i m u m correlation = 0.3124, Delta (total variance accounted for by the first solution) = 51.9%. Optimal weight vectors Rows Unweighted
Contribution of rows Columns
Weighted
Unweighted
Weighted
1.9018 - 1.0006 -0.8295 -0.7186 -0.6349
-0.5941 -0.3126 -0.2591 -0.2245 -0.1983
0.8540 -0.3647 - 1.7072 -0.5787 - 1.8787
0.2668 -0.1139 -0.5333 -0.1808 0.5869
-0.6410 -0.8226 -0.9573 - 0. 8231 -0.7374 -0.7265 -0.6818 -0.5727 0.5852 - 0.2706 -0.5321 -0.6075 0.7024 1.4320 1.6952 1.3380 1.3932 1.2177 1.1269 1.1476 1.1270 1.2322 1.2857 1.2453 1.4339 1.5164 1.5698 1.3849 1.3663
-0.2002 -0.2570 -0 .2 991 -0 .2 571 -0.2303 - 0.2269 -0.2130 - 0.1789 - 0.1828 -0.0845 -0.1662 - 0.1898 0.2194 0.4474 0.5296 0.4180 0.4352 0.3804 0.3520 0.3585 0.3520 0.3849 0.4016 0.3890 0.4479 0.4737 0.4904 0.4326 0.4268
- 2.1066 - 0.1927 1.1521 -0.2847
-0.6581 - 0.0602 0.3599 -0.0889
-
-
Chi-square = 5 4 6 1 . 4 7 w i t h 4 0 d e g r e e s of freedom.
Contribution of columns
to dimension 1
to dimension 1
Row
Contribution
Column
Contribution
1 2 3 4
0.01936 0.02749 0.03311 0.02327
1 2 3 4
0.24073 0.00063 0.11797 0.12826
5 6 7 8 9
0.02148 0.02753 0.04042 0.05557 0.03441
5 6 7 8 9
0.17570 0.26002 0.00065 0.07096 0.00504
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0.01988 0.02355 0.02111 0.01340 0.00919 0.00107 0.00209 0.00078 0.00063 0.02158 0.04655 0.02822 0.03593 0.03675 0.02964 0.03722 0.05174 0.04730 0.06470 0.06108 0.06533 0.05531 0.03311 0.00922 0.00199
118 S o l u t i o n 2: S q u a r e d c o r r e l a t i o n r a t i o = 0.0717, m a x i m u m c o r r e l a t i o n = 0.2677, D e l t a (total v a r i a n c e a c c o u n t e d f o r b y 2 solutions) = 90% Optimal weight vectors Rows
Columns
Contribution of rows
Contribution of columns
to d i m e n s i o n 2
to d i m e n s i o n 2
Unweighted
Weighted
Unweighted
Weighted
Row
Contribution
Column
Contribution
- 2.910 0.4694 1.4854 1.4594 1.2164 0.7372 0.3878 0.1269 -0.3301 -0.5905 - 0.8880 - 1.2325 1.4750 1.6983 1.8606 - 2.0681 - 2.1775 0.7382 1.6350 1.5430 1.4405 1.1088 1.0477 0.3447 0.2172 -0.0877 0.2041 - 0.5023 -0.8847 -0.9314 - 0.6991 -0.4752 -0.3280 -0.5280
- 0.7819 0.1257 0.3976 0.3907 0.3256 0.1973 0.1038 0.0340 -0.0884 - 0.1581 -0.2377 -0.3299 -0.3949 -0.4546 -0.4981 -0.5536 -0.5829 0.1976 0.4377 0.4131 0.3856 0.2968 0.2805 0.0923 0.0581 -0.0235 -0.0546 -0.1345 - 0.2368 -0.2493 -0.1872 -0.1272 -0.0878 -0.1414
0.6529 1.3962 2.2709 - 1.0251 - 1.0897 1.0241 -0.5533 -0.0258 1.3465
0.1748 0.3738 0.6079 - 0.2744 -0.2917 0.2741 -0.1481 -0.0069 0.3605
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0.04569 0.00605 0.10618 0.09602 0.07886 0.03642 0.00898 0.00097 0.00553 0.01275 0.03490 0.06899 0.08893 0.07745 0.05087 0.03158 0.01006 0.00069 0.02813 0.03856 0.03271 0.02275 0.02720 0.00277 0.00133 0.00031 0.00129 0.00987 0.03082 0.02757 0.01175 0.00303 0.00051 0.00029
1 2 3 4 5 6 7 8 9
0.14070 0.00926 0.20875 0.40239 0.05911 0.06145 0.00539 0.00003 0.11288
-
-
-
-
C h i - s q u a r e = 3955.11 w i t h 38 d e g r e e s o f ~ e e d o m .
119 Proportion squared pr~ected distances to the origin Rows
Columns
Solution 1
Solution 2
Solution 1
Solution 2
0.33676 0.54363 0.27833 0.23524 0.25749 0.44570 0.74921 0.91914 0.82862 0.60544 0.46674 0.27462 0.16339 0.13488 0.02648 0.07775 0.08586 0.15008 0.39298 0.53544 0.48855 0.60736 0.60278 0.87294 0.87247 0.94869 0.90716 0.82084 0.64452 0.65026 0.77006 0.83475 0.55332 0.53131
0.58345 0.08787 0.65546 0.71262 0.69415 0.43293 0.12227 0.01187 0.09785 0.28515 0.51216 0.65907 0.79599 0.83410 0.91943 0.86239 0.81004 0.12170 0.37616 0.32576 0.41580 0.28248 0.32766 0.05997 0.02294 0.00422 0.01827 0.09201 0.23887 0.20149 0.12020 0.05616 0.02279 0.05827
0.66895 0.03573 0.39757 0.30016 0.60608 0.82130 0.05272 0.71973 0.04668
0.28712 0.38461 0.51661 0.69155 0.14975 0.14253 0.31912 0.00027 0.76701
Similar results appear in the sample output until solution 8 (the m a x i m u m n u m b e r of scoring schemes).