Correspondence Analysis Michael Greenacre, Universitat Pompeu Fabra, Barcelona, Spain and Barcelona Graduate School of Economics, Barcelona, Spain Ó 2015 Elsevier Ltd. All rights reserved.
Abstract Correspondence analysis (CA) is applicable to data in the form of rectangular tables, where the entries are nonnegative measures of association between the row and column entities. The primary example of a table suitable for CA is a crosstabulation, or contingency table, although the method extends smoothly to the analysis of almost any table of nonnegative numbers measured on the same scale, where the relative values in each row and in each column are of interest. The results of CA are one or more sets of scale values for the rows and columns, values that have a geometric interpretation leading to visualizations of the similarities between rows and between columns, as well as the row– column associations. Important variants of CA are multiple correspondence analysis, applicable to multivariate respondent-level categorical data, and canonical correspondence analysis, where the CA solution is linearly constrained by external explanatory variables.
Historical Background
Correspondence Analysis
In the early 1960s, a dedicated group of French researchers, led by the extraordinary scientist and philosopher Jean-Paul Benzécri, developed methods for structuring and interpreting large sets of complex data. This group’s method of choice was Correspondence analysis (CA), a method for transforming a rectangular table of data, usually frequency counts, into a visual map that displays rows and columns of the table with respect to continuous underlying dimensions. Benzécri’s contribution to data analysis in general and to CA in particular was not so much in the mathematical theory underlying the methodology, known for some time already, as in the strong attention paid to the graphical interpretation of the results and in the broad applicability of the methods to data in many contexts. His initial interest was in analyzing large sparse matrices of word counts in linguistics, but he soon realized the power of the method in fields as diverse as biology, archaeology, physics, and music. The fact that his approach paid so much attention to the visualization of data, to be interpreted and verbalized with ingenuity and insight into the substantive problem, fitted perfectly the esprit géometrique of the French and their tradition of visual abstraction and creativity. Originally working in Rennes in Western France, this group consolidated in Paris in the 1970s to become an influential and controversial movement in post1968 France. In 1973, they published the two fundamental volumes of L’Analyse des Données (Data Analysis), the first on La Classification, that is, unsupervised classification or cluster analysis, and the second on L’Analyse des Correspondances, or CA (Benzécri, 1973). From 1976 until 1997 they published the journal Les Cahiers de l’Analyse des Données (available at www.numdam.org), which further reflected the depth and diversity of Benzécri’s work. The first books in English explaining this approach to multivariate data analysis were by Lebart et al. (1984) and Greenacre (1984). For a more complete historical account of the origins of CA, see Greenacre (1984), Gifi (1990), and the first part of the book edited by Blasius and Greenacre (2014).
CA is a variant of principal component analysis (PCA) applicable to categorical data rather than interval-level measurement data (see Factor Analysis and Latent Variable Models in Personality Psychology). For example, Table 1 is a contingency table obtained from the 2009 International Social Survey Program (ISSP) survey on social inequality (ISSP Research Group, 2012), tabulating responses from 41 countries on the question: ‘What description describes the society in your country best?’ (1) A small elite at the top, very few people in the middle, and the great mass of people at the bottom; (2) A society like a pyramid with a small elite at the top, more people in the middle, and the most at the bottom; (3) A pyramid except that just a few people are at the bottom; (4) A society with most people in the middle; (5) Many people near the top, and only a few near the bottom. Of interest in CA are the profiles of the rows and columns of the table, defined as the relative frequencies in the rows (i.e., row entries divided by their respective row sums) and the relative frequencies in the columns (i.e., column entries divided by their respective row sums). For example, the profile of Argentina, which has row total (i.e., sample size) 1133, is 506/ 1133 ¼ 0.447, 400/1133 ¼ 0.353, etc., while that of Austria, with row total 1019, is 160/1019 ¼ 0.157, 243/1019 ¼ 0.238, etc. Similarly, the column profile of Type A, with column total 14 151, is 506/14 151 ¼ 0.0358, 160/14 151 ¼ 0.0113, etc. The theory and computing algorithm of CA can be summarized by the following steps:
International Encyclopedia of the Social & Behavioral Sciences, 2nd edition, Volume 5
1. Let N be the I J table with grand total n and let P ¼ (1/n)N be the correspondence matrix, with grand total equal to 1. If N is a contingency table, then P is an observed discrete bivariate distribution. 2. Let r and c be the vectors of row and column sums of P, respectively, and Dr and Dc diagonal matrices with r and c on the diagonal. The elements of r and c are called row and column masses, and are also the average, or expected, column and row profiles, respectively. For example, r can be equivalently defined as the column sums divided by the
http://dx.doi.org/10.1016/B978-0-08-097086-8.42005-2
1
2
Correspondence Analysis
Table 1 question
Type A AR AT AU BE BG CH CL CN CY CZ DE-E DE-W DK EE ES FI FR GB HR HU IL IS IT JP KR LV NO NZ PH PL PT RU SE SI SK TR TW UA US VE ZA
4. Compute the standard coordinates X and Y of the rows and columns:
Response frequencies from 41 countries to the same Type B
Type C
Type D
Type Cannot E choose
Argentina 506 400 Austria 160 243 Australia 89 431 Belgium 71 360 Bulgaria 598 256 Switzerland 78 290 Chile 354 698 China 644 1509 Cyprus 43 227 Czech Republic 360 408 Former E. Germany 92 158 Former W. Germany 144 286 Denmark 23 154 Estonia 313 447 Spain 185 454 Finland 58 198 France 444 1454 Great Britain 137 385 Croatia 617 310 Hungary 537 307 Israel 204 613 Iceland 90 173 Italy 338 428 Japan 130 452 South Korea 303 556 Latvia 723 214 Norway 29 152 New Zealand 59 300 The Philippines 371 476 Poland 437 389 Portugal 339 301 Russia 565 486 Sweden 77 251 Slovenia 245 292 Slovakia 468 420 Turkey 565 507 Taiwan 364 733 Ukraine 1264 404 The United States 200 454 Venezuela 301 346 South Africa 1626 1017
104 284 318 245 52 293 191 359 544 215 78 211 367 92 237 273 443 173 69 57 168 181 133 310 409 57 333 232 132 160 104 174 321 252 91 169 522 88 175 161 278
78 207 595 335 31 466 170 357 123 157 63 170 847 94 188 298 329 192 59 35 102 442 122 235 240 25 795 303 118 149 54 116 409 114 79 126 318 43 304 101 212
21 23 30 31 5 43 42 66 10 24 18 35 51 14 39 10 44 32 20 13 12 42 22 47 79 39 100 15 80 44 33 47 20 25 15 45 47 30 34 49 69
24 102 15 49 55 59 45 75 53 41 30 110 48 45 57 23 36 27 119 42 74 13 29 115 12 11 28 13 23 83 160 215 36 101 72 45 42 178 372 26 83
grand total: in this example, the mass of the first column (Type A) is c1 ¼ 14 151/54 597 ¼ 0.259; hence Argentina has higher than average frequency of Type A, whereas Austria has lower than average, and so on for the other columns. 3. Compute the singular value decomposition (SVD) of the centered and normalized matrix, with general element pffiffiffiffiffiffi sij ¼ (pij ricj)/ ri cj : 1=2
S ¼ Dr
1=2 P rcT Dc ¼ UDs V T
[1]
where the left and right singular vectors are orthonormal: UTU ¼ VTV ¼ I, and the singular values in the diagonal matrix Ds are in descending order.
1=2
X ¼ Dr
U
1=2
Y ¼ Dc
V
[2]
and respective principal coordinates F and G: 1=2
F ¼ Dr
UDs ¼ XDs
1=2
G ¼ Dc
VDs ¼ YDs
[3]
The respective row and column contribution coordinates are simply the matrices of left and right singular vectors U and V, i.e., in terms of the standard coordinates: 1=2
U ¼ Dr
X
1=2
V ¼ Dc Y
[4]
The total variance, called inertia, is equal to the sum of squares of the matrix decomposed in [1]: 2 2 I X J I X J pij ri cj X X pij ¼ ri cj 1 [5] inertia ¼ ri cj ri cj i¼1 j¼1 i¼1 j¼1 which is equal to 0.370 in this example. The inertia is identical to the Pearson chi-squar statistic calculated on the original table divided by n (see Multivariate Analysis: Discrete Variables (Overview)). The squared singular values s21 ; s22 ; /, called the principal inertias, decompose the inertia into parts attributable to the respective principal axes, in descending order of importance, just as in PCA the total variance is decomposed along principal axes. In this example, the principal inertias have values (and percentages of the total) equal to 0.238 (64.4%), 0.0557 (15.0%), 0.0461 (12.5%), 0.0241 (6.5%), and 0.0060 (1.6%) – the number of principal inertias is one less than the number of rows or columns, whichever is smaller. Apart from their several algebraic properties, the coordinates are used to represent the rows and columns in joint plots, where the interpretation differs depending on the choice of coordinates. The principal coordinates are the projections of the profiles onto principal axes that explain maximum inertia. These projected profile points are equivalently interpreted as approximating chi-square distances between the profiles, for example, between the profiles of rows i and i0 : vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ffi u u pi0 j pij u J uX ri ri0 dði; i0 Þ ¼ t cj j¼1
[6]
The total inertia of the table can then be written as the weighted average of the squared chi-square distances of the profiles to their average, or centroid, for example, for the row profiles: 2 pij c j I I J X X X ri inertia ¼ ri ½dði; cÞ2 ¼ ri cj i¼1 i¼1 j¼1 The standard coordinates are convenient reference points for the interpretation, being the projections onto the principal axes of the unit profiles ½ 1 0 0 / 0 , ½ 0 1 0 / 0 , etc. The contribution coordinates, which are a rescaled version of the standard coordinates, are related to the contributions of each row or column to the principal axes (see below). When the substantive variable, in this case the column classification,
Correspondence Analysis
when there are many column categories, as is the case in community ecological applications, where samples (rows) are described by very many species (columns). The resultant contribution biplot shows the researcher exactly which columns are determinant for interpreting the differences between samples. Note that in CA the row and column profiles are weighted by their corresponding masses, which affect the solution as well as the interpretation – thus the low-frequency Type E category receives low weight in determining the optimal solution. In this example, the row (country) masses are proportional to the sample sizes, so that South Africa, for example, with a sample size of 3285, obtains three times the weight as Argentina. If it is preferred to give each country the same weight, this is achieved simply by transforming the table to row proportions, or row percentages, before applying the CA algorithm, to force all row totals to be the same while not affecting the row profiles. CA can be equivalently considered a bilinear model for the observed frequencies, since the SVD in [1] may be written in terms of the standard coordinates in the following equivalent form, for the (i, j)-th element pij ¼ nij/n of the correspondence matrix: pij ¼ ri cj 1 þ s1 xi1 yj1 þ s2 xi2 yj2 þ / [7] (see Multivariate Analysis: Discrete Variables (Correspondence Models)). There are as many terms skxikyjk in [7] as there are dimensions, the maximum number of dimensions. For a model with two dimensions, for example, retaining k ¼ 1 and 2, the residual elements have been minimized by weighted least-squares.
Contributions to Inertia The overall quality of the CA solution is judged by the percentages of inertia, similar to a regression analysis: for example, in Figure 1, 79.4% of the inertia (variance) is explained by the two-dimensional solution, and in the three-dimensional solution in the supplementary material an additional 12.5% is
0.2
BG HR TypeA HU ZA
IS TypeD Cannot choose
PTRU PL AR SK TR
CH SE AU FI
US TypeE AT DE−W
CZ SI BENZ ITPHVE DE−E TypeC KR JP EE ESGB TW CL TypeB CY CN IL FR
−0.2
0.0
NODK UA LV
−0.6
−0.4
CA dimension 2
0.4
0.6
has few categories, one of the most popular displays is the symmetric map, where both rows and columns are displayed in principal coordinates, i.e., using the first two columns of F and G for the respective row and column coordinates in a two-dimensional display, as shown in Figure 1. Types A to D form a curved pattern from left to right, with Latvia and Ukraine the most extreme on the left, i.e., the highest proportions of respondents saying their societies contain a small elite, few in the middle and most at the bottom. Denmark and Norway are at extreme right, i.e., the highest proportions saying their societies have most people in the middle and less at the extremes. Thus the horizontal position of the countries, accounting for almost two-thirds (64.4%) of the inertia, scales them in terms of how egalitarian the respondents view their societies, while the vertical position, accounting for 15%, contrasts Types B and C against the others. For example, France and the United States are near average (the center) horizontally, but for different reasons: 52.9% of French respondents opt for Type B (pyramid structure of society), while only 29.5% of US respondents do, the average across all countries being 32.9%. Not obvious from Figure 1 is that 24.2% of US respondents ‘cannot choose,’ whereas in France it is only 1.3% – the ‘cannot choose’ category dominates the third dimension of the CA, explaining an additional 12.5% of the inertia, which can be seen in the three-dimensional solution shown as a video in the online supplementary material. Alternative choices of coordinates give slightly different interpretations of the results. In the asymmetric map of the rows, for example, also called the row-principal biplot, rows are shown in principal coordinates and columns in standard coordinates (vice versa for the column-principal biplot). The choice between a row-principal or column-principal biplot is governed by whether the original table is considered as a set of rows or a set of columns, when expressed in percentage form – usually, as in this example, the substantive categories are the columns, and the row-principal option would be the natural choice. Choosing contribution coordinates instead of standard coordinates in a row-principal biplot is a convenient alternative
−0.5
3
0.0
0.5
1.0
CA dimension 1
Figure 1 Correspondence analysis of Table 1: two-dimensional symmetric map showing rows and columns in principal coordinates; 79.4% of the inertia explained.
4
Correspondence Analysis
explained, bringing the overall percentage to 91.9%, with only 8.1% unexplained inertia. More detailed diagnostics in CA can be obtained in the form of the so-called contributions to inertia, based on the decompositions of the total inertia in [5] according to individual rows or individual columns, dimension by dimension: 2 I X J K I X K J X K pij ri cj X X X X ¼ s2k ¼ ri fik2 ¼ cj gjk2 ri cj i¼1 j¼1 i¼1 j¼1 k¼1 k¼1 k¼1
demographic variables such as age, education, and marital status, and another set of substantive categorical variables, all cross-tabulations of variable pairs composed of a demographic and a substantive variable are stacked row- and column-wise, and the whole matrix of tables subjected to a CA. It should be remembered that only pairwise relationships are displayed in the eventual visualizations.
[8]
Another approach to multiway data, called multiple correspondence analysis – also called homogeneity analysis (Gifi, 1990; Greenacre, 2007: Chapter 18), applies when there are several categorical variables skirting the same issue, often called ‘items.’ In the same ISSP survey, for example, there are 11 questions concerning the issue of ‘getting ahead,’ from “How important is coming from a wealthy family” to “How important is being born a man or a woman,” to which respondents have to respond on a five-point scale from ‘Essential’ to ‘Not important at all.’ MCA is usually defined as the CA algorithm applied to an indicator matrix Z, with the rows being the respondents and the columns being dummy variables for each of the categories of all the variables. The data are zeros and ones, with the ones indicating the chosen categories for each respondent. The resultant map shows each category as a point and, in principle, the position of each respondent as well. Alternatively, one can set up what is called the Burt matrix, B ¼ ZTZ, the square symmetric table of all two-way cross-tabulations of the variables, including the cross-tabulations of each variable with itself. This matrix, named after the psychologist Sir Cyril Burt, is reminiscent of a covariance matrix and the CA of the Burt matrix can be likened to the PCA of a covariance matrix. The analysis of the indicator matrix Z and the Burt matrix B give identical standard coordinates of the category points, but slightly different scalings in the principal coordinates because the principal inertias of B are the squares of those of Z. A variant of MCA called joint correspondence analysis (JCA) avoids the fitting of the tables on the diagonal of the Burt matrix, an approach that is analogous to least-squares factor analysis – see Greenacre (2007: Chapter 19). Notice that the Burt matrix is a concatenated matrix of pairwise relationships within a set of variables, whereas the concatenated matrix described at the end of Section Concatenated Tables contains pairwise relationships between two sets of variables.
c g 2 can Pj jk 2 j cj gjk of
be For example, every column component the expressed relative to the principal inertia s2k ¼ corresponding dimension k. These relative values provide a diagnostic for deciding which columns (or in a similar way, which rows) are important in the determination of the kth principal axis. In fact, these relative values, summing to 1, are exactly the squared contribution coordinates defined previously. An alternative diagnostic is to express these elemental components cj gjk2 relative to their sum over the dimensions, P 2 P 2 2 k cj gjk which is just gjk = k gjk – these values are analogous to the squared factor loadings in factor analysis, that is, squared correlations between the column category and the corresponding principal axis or factor (see Factor Analysis and Latent Variable Models in Personality Psychology). For the above application, these values for Type E are 0.108, 0.043, and 0.016 on dimensions 1–3, which means that only 0.166 (16.6%) of the variance of this column is explained by the first three dimensions, considered as ‘predictors’ of Type E. For the category ‘cannot choose,’ the respective values are 0.026, 0.046, and 0.918, showing that more than 90% of its variance is explained by dimension 3.
Extensions CA is regularly applied to analyze multiway tables, tables of preferences, ratings, as well as measurement data on ratio- or interval-level scales. Such extensions of the method conform to Benzécri’s conception of CA as a universal technique for exploring many different types of data through judicious transformations or coding.
Concatenated Tables For multiway tables, there are two approaches. The first approach is to convert the table to a flat two-way table, which is appropriate to the problem at hand. Thus, if a third variable is introduced into the example above, say ‘gender of respondent,’ then an appropriate way to flatten the three-way table would be to interactively code ‘country’ and ‘gender’ as a new row variable, with 41 2 ¼ 82 categories. The table for males is thus contented beneath the table for females. For each country there would now be a female and a male point and one could compare genders and countries in this richer map. This process of interactive coding of the variables can continue as long as the data do not become too fragmented into category combinations with very few counts. A less informative way of concatenating tables is to juxtapose several cross-tabulations of a set of row variables and a set of column variables into a supermatrix of tables. For example, given a set of
Multiple Correspondence Analysis
Ratings, Rankings, and Paired Comparisons As far as other types of data are concerned, namely rankings, ratings, paired comparisons, ratio-scale, and interval-scale measurements, the key idea is to recode the data in a form, which justifies the basic constructs of CA, namely profile, mass, and chi-squared distance. For example, in the analysis of rankings, or preferences, applying the CA algorithm to the original rankings of a set of objects by a sample of subjects is difficult to justify, because there is no reason why weight should be accorded to an object in proportion to its average ranking. A practice called doubling resolves the issue by adding either an ‘anti-object’ for each ranked object or an ‘anti-subject’ for each responding subject, in both cases with rankings in the reverse order – see Greenacre (2007: Chapter 23). This addition
Correspondence Analysis
of apparently redundant data leads to CA effectively performing different variants of PCA on the original rankings. The same idea can be applied to ratings data, for example fivepoint ordinal scales, usually coded 1–5. A rating of 2, for example, would be recoded as a pair of ratings with values 1 and 3, since there is one scale point on the scale to the left of the rating 2, and three scale points to the right of it. This doubled pair of variables would code opposite poles of the original scale, and the CA of the doubled table based on several rating scales is again quite similar to the PCA of the original undoubled table, the only difference being the normalization inherent in the chi-square distance.
Fuzzy Coding There are several ways for CA to handle continuous data – one of the most versatile is to recode each continuous variable into a set of fuzzy categories. This is a generalization of the dummy variable coding of MCA, which has only zeros or ones, also called ‘crisp coding.’ Coding a variable into three categories such as ‘low,’ ‘medium,’ and ‘high’ leads to fuzzy-coded values such as ½ 0 0:23 0:77 , where 0.77 indicates that the datum in question is ‘mostly’ in the high category but ‘somewhat’ in the middle category. This is a more accurate coding than crisply coding the datum as ½ 0 0 1 . Fuzzy coding is achieved through the use of membership functions, which transform the original variable into a set of fuzzy categories, with the fuzzy values being nonnegative and summing to 1. Once the data have been recoded, CA is applied in the usual way – see As¸an and Greenacre (2011).
CA Can Perform MDS on a Distance Matrix CA can also be shown to be a limiting case of multidimensional scaling (MDS), by applying it to an appropriately transformed square symmetric matrix of distances. The distances are squared and subtracted from a constant C, which is at least as large as the largest squared distance in the table (so the diagonal of the transformed matrix would consist of C’s). The CA of this transformed matrix yields a solution, which, after an appropriate rescaling that depends on C, approximates the classical scaling solution of the distance matrix and tends to the exact solution as C is increased (Carroll et al., 1997).
Subset Correspondence Analysis In some cases we would like to analyze certain subsets of rows or columns, for example, a specific set of response categories in survey research, that is, all the response categories excluding the ‘missing’ ones (refused to answer/not applicable/etc.). Because CA uses the table margins as weights, the simple use of a table with such categories simply omitted leads to problematic situations, which can be easily solved by maintaining the original margins of the full table whenever a subtable is analyzed – see Greenacre and Pardo (2006) and Greenacre (2007: Chapter 21).
Canonical Correspondence Analysis CA finds the optimal dimensions of a data table, but in certain situations additional variables serving as explanatory variables
5
are available. For example, in a social survey like the one described in Section Multiple Correspondence Analysis, the age, number of years of education, and marital status might be available for each respondent. The optimal CA dimensions can be regressed on the variables, to assess whether and how much the dimensions can be explained by these variables. A more specific analysis can be performed when the dimensions are forced to be related, usually linearly, to the explanatory variables. In other words, the researcher is only interested in the variance that can be explained by age, education, and marital status. This is achieved by first projecting the primary response data onto the space defined by the explanatory variables, and then performing the dimension reduction in this reduced space. In the CA context this is called canonical correspondence analysis, and is used extensively in ecology when biological community data, usually abundance counts of many species, are related to several environmental variables (ter Braak, 1986; Greenacre, 2007: Chapter 24). The latest developments on the subject, including discussions of sampling properties of CA solutions and comprehensive reference lists, may be found in the volumes edited by Greenacre and Blasius (1994, 2008) and Blasius and Greenacre (1998, 2014).
See also: Factor Analysis and Latent Structure Analysis: Overview; Multidimensional Scaling I; Multidimensional Scaling II; Multivariate Analysis: Discrete Variables (Correspondence Models); Multivariate Analysis: Discrete Variables (Overview).
Bibliography As¸an, Z., Greenacre, M.J., 2011. Biplots of fuzzy coded data. Fuzzy Sets and Systems 183, 57–71. Benzécri, J-P., 1973. L’Analyse des Données. La Classification, vol. I. L’Analyse des Correspondances, vol. II. Dunod, Paris. ter Braak, C.J.F., 1986. Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis. Ecology 67, 1167–1179. Blasius, J., Greenacre, M.J., 1998. Visualization of Categorical Data. Academic Press, San Diego, CA. Blasius, J., Greenacre, M.J., 2014. Visualization and Verbalization of Data. Chapman & Hall/CRC Press, Boca Raton, Florida. Carroll, J.D., Kumbasar, E., Romney, A.K., 1997. An equivalence relation between correspondence analysis and classical metric multidimensional scaling for the recovery of Euclidean distances. British Journal of Mathematical and Statistical Psychology 50, 81–92. Gifi, A., 1990. Nonlinear Multivariate Analysis. Wiley, Chichester, UK. Greenacre, M.J., 1984. Theory and Applications of Correspondence Analysis. Academic Press, London. Freely downloadable from: www.carme-n.org. Greenacre, M.J., 2007. Correspondence Analysis in Practice, second ed. Chapman & Hall/CRC Press, Boca Raton, Florida. Published in Spanish as La Práctica del Análisis de Correspondencias by the BBVA Foundation, Madrid, and freely downloadable from: www.multivariatestatistics.org. Greenacre, M.J., Blasius, J., 1994. Correspondence Analysis in the Social Sciences. Academic Press, London. Greenacre, M.J., Blasius, J., 2008. Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC Press, Boca Raton, Florida. Greenacre, M.J., Pardo, R., 2006. Subset correspondence analysis: visualization of selected response categories in a questionnaire survey. Sociological Methods and Research 35, 193–218. ISSP Research Group, 2012. International Social Survey Programme: Social Inequality IV – ISSP 2009. GESIS Data Archive, Cologne. ZA5400 Data file Version 3.0.0. http://dx.doi.org/10.4232/1.11506. Lebart, L., Morineau, A., Warwick, K., 1984. Multivariate Descriptive Statistical Analysis. Wiley, Chichester, UK.