Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952 www.elsevier.com/locate/jspi
Correspondence analysis of aggregate data: The 2×2 table Eric J. Beh School of Computing and Mathematics, University of Western Sydney, Locked Bag 1797, Penrith South DC, 1797 NSW, Australia Received 6 February 2007; received in revised form 6 September 2007; accepted 8 November 2007 Available online 28 November 2007
Abstract The issue of how much information the marginal frequencies of a single 2×2 table provide for the estimation of the cell frequencies is a topic that has received much attention since it was first made popular in the statistical literature by Sir Fisher [1935. The logic of inductive inference (with discussions). J. Roy. Statist. Assoc. Ser. A 98, 39–82]. The general consensus is this marginal information provides no useful information for making individual (cell) inferences. This paper will provide a graphical investigation of the problem. In particular we will demonstrate the applicability of correspondence analysis on a single 2 × 2 table where the joint cell frequencies are not available, but where only the marginal information is known. To complement this approach an aggregate association index is proposed to determine, based only on the availability of the aggregate data, whether there is any possibility of there existing a significant association between the two dichotomous variables arising from the table. © 2007 Elsevier B.V. All rights reserved. Keywords: Aggregate association index; Association; Contingency table; Ecological inference; Profile coordinates; Transition formulae
1. Introduction The 2 × 2 contingency table is the most fundamental of data structures when cross-classifying categorical variables. It is therefore not surprising that the analysis of this type of data has received a huge amount of attention in the statistical, and related, literature. In particular, inferences concerning the odds ratio of the table and differences in proportions have been the focus of much of this attention. However, the issue of what information the marginal frequencies provide for the inference of association between the variables is also a long standing problem. Fisher (1935, p. 48) broached the subject by saying Let us blot out the contents of the table, leaving only the marginal frequencies. If it be admitted that these marginal frequencies by themselves supply no information on the point at issue, namely, as to the proportionality of the frequencies in the body of the table we may recognize that we are concerned only with the relative probabilities of occurrence of the different ways in which the table can be filled in, subject to these marginal frequencies. There have been many discussions on this very issue. Yates (1984, p. 447) agreed with Fisher’s comments, but clarified that they apply “except in extreme cases and in repeated sampling”. Others to have discussed this issue include Plackett (1977), Aitkin and Hinde (1984), Barnard (1984) and Beh et al. (2002). E-mail address:
[email protected]. 0378-3758/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2007.11.004
2942
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
While the statistical investigation of association between dichotomous variables given only the availability of marginal information has largely involved likelihood theory, very little on the matter has been made from a graphical perspective. One methodology that can be employed to investigate the issue of how informative marginal frequencies are is correspondence analysis. Correspondence is a very popular method of graphically identifying the association between two or more categorical variables. Its theoretical development and application in a wide variety of disciplines highlights the importance of the analysis as a tool for all types of researchers. Typically, correspondence analysis is performed on a contingency table where the marginal, or aggregate, and joint frequencies are known. For a description of the theoretical aspects of the analysis refer to, for example, Greenacre (1984) and Beh (2004). The purpose of this paper is to explore the application of correspondence analysis to a single 2 × 2 contingency table where only the aggregate information is known. To do so, this paper is divided into five further sections. Section 2 defines the notation used throughout the paper, and describes the bounds associated with the cells of the contingency table as well as the Pearson product moment correlation. Section 3 describes the correspondence analysis of a single 2 × 2 contingency and examines its various features when only the aggregate data are known. In particular, attention is paid to the profile coordinates used for the graphical identification of association, and their relationship within and between the variables. Section 4 proposes a new index that measures the strength of the association between the two variables of a 2 × 2 contingency table when only the marginal frequencies are known, and assumed fixed. In Section 5 the “criminal twin” contingency table considered by Fisher (1935) is reexamined and it is shown that there is indeed a great deal of information available in the marginal information. We make some concluding remarks in Section 6. 2. The 2 × 2 contingency table Consider a 2 × 2 contingency table, N, that classifies n units/individuals. Denote the number of classifications in the (1, 1)th cell by n11 , and let the ith row marginal frequency be ni• > 0, for i = 1, 2. Similarly, let the jth column marginal frequency be denoted by n•j > 0, for j = 1, 2. Table 1 provides a description of a 2 × 2 table with this notation. Formal procedures for testing whether there is an (statistical) association between the variables of a contingency tables involve calculating the Pearson chi-squared statistic. For a 2 × 2 table of the form given in Table 1, the statistic is expressed as X2 = n
(n11 n22 − n12 n21 )2 . n1• n2• n•1 n•2
Incorporating Yates’ continuity correction is often advised when performing such a test since it provides a better approximation to the asymptotic chi-squared distribution. However, Conover (1974) showed that it is often unnecessary to consider such a correction. When discussing the role of the chi-squared statistic, we will therefore be restricting ourselves to the case where the correction is not adopted. Suppose for now the individual level data (joint frequencies/proportions) of the 2 × 2 table are known. Define the proportions P1 =
n11 , n1•
P2 =
n21 n2•
so that P1 is the proportion of those classified in “Row 1” that exhibit the characteristic of “Column 1”. It is therefore an estimator (in the case when the joint frequencies are unknown) of the conditional probability an individual being classified into “Column 1” given that they are classified into “Row 1”. Similarly, P2 is the proportion of those classified in “Row 2” that exhibit the characteristic in “Column 1”. That is, when the joint frequencies are unknown, it is an Table 1 Notation for cell values of a 2 × 2 table Column 1
Column 2
Total
Row 1 Row 2
n11 n21
n12 n22
n1• n2•
Total
n•1
n•2
n
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
2943
estimator of the conditional probability of an individual being classified into “Column 1” given that they are classified into “Row 2”. It can be verified algebraically that P1 and P2 satisfy the relationship n•1 = P1 n1• + P2 n2•
(1)
which is referred to as the accounting identity (King, 1997, p. 38) of N. Of course, P1 and P2 are proportions and so must lie within the interval [0, 1]. However, when deriving bounds for the Pearson product moment correlation of a 2 × 2 table, Duncan and Davis (1953) showed that these bounds can be narrowed and involve the marginal frequencies such that n•1 n•1 − n2• L1 = max 0, P1 min , 1 = U1 , (2) n1• n1• n•1 n•1 − n1• L2 = max 0, P2 min , 1 = U2 . (3) n2• n2• Much of the discussions made throughout this paper focus on defining the various characteristics of the 2 × 2 table and correspondence analysis as a function of P1 or P2 . To begin our investigation of these characteristics suppose that the proportion of individuals/units classified into the (i, j )th cell, for i, j = 1, 2 is denoted by pij = nij /n. Denote the proportion of these individuals/units classified in row i and column j as pi• and p•j , respectively. Independence between the two dichotomous variables will exist when pij = pi• p•j for each i and j. Of course, complete independence will not always be satisfied, and so a multiplicative measure of the departure from the model of complete independence can be considered such that pij = ij pi• p•j .
(4)
Note that independence occurs when ij = 1 for all i = 1, 2 and j = 1, 2. As complete independence is not always going to occur, one can determine when ij = 1 by evaluating ij =
pij . pi• p•j
(5)
This value is referred to as the Pearson ratio (Goodman, 1996). To provide a graphical depiction of the association between the row and column variables, correspondence analysis decomposes the Pearson ratio using singular value decomposition (SVD). Greenacre (1984) and Lebart et al. (1984) provide a theoretical description of this approach to correspondence analysis. Alternatively, a bivariate moment decomposition (BMD), involving orthogonal polynomials, may also be considered. See, for example Beh (1997, 2004). When applying these procedures to a single 2 × 2 table the output will be equivalent for both approaches. Both these methods of decomposing the Pearson ratio (5) can be implemented so that ij = 1 + xi yj .
(6)
Here xi and yj , for i, j = 1, 2, is the ith row and jth column orthogonal polynomial, respectively (when BMD is applied) or the singular vectors (using SVD) and are subject to the constraint 1, = 2, 1, = 2, p1• x1 + p2• x2 = p•1 y1 + p•2 y2 = 0, = 1, 0, = 1. In order to ensure that these properties are observed, it can be shown that p2• p•2 p1• p•1 x1 = , y1 = , x2 = − and y2 = − p1• p•1 p2• p•2
2944
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
and can be more generally expressed as p1• 2i−3 p•1 2j −3 i−1 j −1 xi = (−1) , yj = (−1) p2• p•2 for i, j = 1, 2. The values of xi and yj are also noted in Lancaster (1969, p. 218) and Goodman (1996, p. 413). In fact, Goodman (1996, p. 411) points out that for model (6) is the Pearson product moment correlation. By considering (4) and (6) the (i, j )th cell may be reconstituted so that pij = pi• p•j (1 + xi yj ), where =
2 2
xi yj pij .
(7)
i=1 j =1
For the case where BMD is used to decompose the Pearson ratio this model is equivalent to the rank-2 version of the model described in Rayner and Best (1996) √and Beh (1997). Under the hypothesis of independence, n is an asymptotic standard normal random variable—see, for example, Rayner and Best (1996, 2001, p. 141). Here, is equivalent to the (individual) Pearson product moment correlation of the 2 × 2 table. Duncan and Davis (1953) showed that the correlation is bounded by p2• p•1 p1• p•2 p2• p•2 p1• p•1 (8) , min , − min p1• p•2 p2• p•1 p2• p•2 p1• p•1 which only requires the marginal information of the 2 × 2 table. In practice the conditional proportions P1 and P2 are unknown, since n11 is unknown. For any estimate of P1 and P2 to be valid they must satisfy the identity p•1 = P1 p1• + P2 p2• = P2 + p1• (P1 − P2 )
(9)
which Goodman (1959) considered for the “ecological regression” of the aggregate information of a set of G (> 1) 2 × 2 tables. This is easily obtained by dividing the accounting identity (1) by the sample size n. The correlation coefficient (7) can be expressed as a function of P1 such that p1• p2• P1 − p•1 (P1 |p1• , p•1 ) = . (10) p2• p•1 p•2 This expression can be derived by substituting the definition of xi and yj (for i, j = 1, 2) into (7) and making the subject using (9). Therefore, any variation in P1 will impact upon the correlation of the contingency table in a manner described by (10). Similarly, from (9), P2 = (p•1 − p11 )/p2• . Therefore can be expressed as a function of P2 and the aggregate data by p1• p2• p•1 − P2 . (P2 |p1• , p•1 ) = p1• p•1 p•2 If we express this correlation coefficient in terms of the conditional proportions P1 and P2 we obtain p1• p2• (P1 , P2 |p1• , p•1 ) = (P1 − P2 ) p•1 p•2 which was also considered by Goodman (1959, p. 613). These expressions show that there will be a positive association between the two dichotomous variables when P1 > p•1 , P2 < p•1 or P1 > P2 . They also show that the two dichotomous variables are independent when P1 = P2 =p•1 .
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
2945
3. Correspondence analysis 3.1. Profile coordinates Since it is assumed here that the individual level (cell) proportions, and hence P1 and P2 are unknown, the results above provide no real insight into how one may detect whether there is any association between the two variables. In this section, we suggest that some insight may be gained by considering a more graphical approach to the problem. We will be considering the implementation of correspondence analysis and examine the behaviour of some of its characteristics when only aggregate data are available. Correspondence analysis can be used to graphically depict the association between categorical variables that, when cross-classified, form a contingency table. We can implement the above procedure and apply it to the classical correspondence analysis approach (Greenacre, 1984) or the ordered procedure discussed in Beh (1997), to observe how the position of the row profiles, pij /pi• , and column profiles, pij /p•j , are affected when only the aggregate information is available. Doing so, the coordinate along the first, and only, principal axis of the correspondence plot, for the ith row profile is p1• 2i−3 (11) fi = xi = (−1)i−1 p2• for i = 1, 2. Similarly, the coordinate of the jth column profile can be shown to be p•1 2j −3 gj = yj = (−1)j −1 p•2 for j = 1, 2. These results indicate that profile coordinates are the furthest away from the origin when , or alternatively P1 and P2 , lie on the extremes of their bounds, (2) and (3), respectively. That is, the presence of a large association between the two dichotomous variables will result in the configuration of points in the correspondence plot to be far from the origin. Similarly, when = 0 the profile coordinates will lie at the origin. In fact, using the row profile coordinates (11), and the bounds (8), it can be easily shown that f1 and f2 are bounded so that p•2 p2• p•1 p•2 p•1 p2• f1 , , min 1, − min p•1 p1• p•2 p•1 p•2 p1• p•2 p•2 p•1 p1• p1• p•1 − f2 . min , min 1, p•1 p•2 p2• p•1 p2• p•2 Similarly, the bounds for g1 and g2 are p2• p2• p1• p•2 p1• p•2 − g1 , min , min 1, p1• p2• p•1 p1• p2• p•1 p2• p2• p1• p•1 p1• p•1 − g2 , min , min 1, p1• p2• p•2 p1• p2• p•2 respectively. From a practical point of view the use of these intervals is not very helpful since they suggest that the lower bound of each coordinate will be negative while the upper profile will be positive. However, they do show that there are distinct bounds for the profile coordinates when analysing a single 2 × 2 contingency table and that these bounds are dependent only on the marginal information of the table. Recall that any variation in P1 will impact upon the value of the Pearson product moment correlation, . Therefore such variation in P1 will also influence the position of the row and column profile coordinates since − p P p1• 2i−3 p1• p2• 1 •1 , (12) fi (P1 |p1• , p•1 ) = (−1)i−1 p2• p2• p•1 p•2
2946
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
gj (P1 |p1• , p•1 ) = (−1)
j −1
P1 − p•1 p2•
p•1 p•2
2j −3
p1• p2• . p•1 p•2
(13)
For example, when i = 1 the coordinate for the first row response simplifies to P1 − p•1 f1 (P1 |p1• , p•1 ) = √ p•1 p•2
(14)
while, for i = 2 the profile coordinate is p1• P1 − p•1 f2 (P1 |p1• , p•1 ) = − . √ p2• p•1 p•2
(15)
Therefore f1 and f2 are related by p1• · f1 (P1 |p1• , p•1 ) f2 (P1 |p1• , p•1 ) = − p2• and is an intra-variable transition formula since one row profile coordinate can be easily obtained from the other. It shows that when the two dichotomous variables are statistically independent (note that this occurs when P1 = p•1 ) the row profile coordinates f1 and f2 , are equal to zero. This is an important feature of the profile coordinates obtained from the correspondence analysis of any sized contingency table. This relationship also indicates that f2 /f1 = −p1• /p2• ; that is, the ratio of the two row profile coordinates is dependent only on the aggregate information contained in the contingency table. Therefore any change in the position of one coordinate will affect the position of the second coordinate, and this will occur irrespective of the cell values in the table. This is an important property and one that is not necessarily true in classical correspondence analysis. Similar expressions and results can be obtained for the column profile coordinates g1 and g2 . 3.2. Transition formulae When the profile coordinates associated with one variable are known, the profile coordinates associated with the second variable can be calculated using the transition formulae. This can be done by comparing fi (P1 |p1• , p•1 ) and gj (P1 |p1• , p•1 ), defined by (12) and (13), respectively. Doing so obtains the transition formula i pj p p2• p•1 3/2 i−j 1• •2 fi (P1 |p1• , p•1 ) = (−1) gj (P1 |p1• , p•1 ) . j p1• p•2 pi p 2• •1
For example, when i = j = 1
f1 (P1 |p1• , p•1 ) = g1 (P1 |p1• , p•1 )
p•1 p2• . p•2 p1•
When i = j = 2 we also obtain the property f1 (P1 |p1• , p•1 ) g2 (P1 |p1• , p•1 ) = = g1 (P1 |p1• , p•1 ) f2 (P1 |p1• , p•1 )
p•1 p2• . p•2 p1•
This result shows that if fi is a positive coordinate then so too is gj , for i = j . It also shows that the relationship between the row and column profile coordinates is independent of P1 and P2 , and is thus independent of the cells of the contingency table. That is, the strength of the relationship between the row and column coordinates can be determined solely on the information given in the marginal frequencies of the table. By considering these results the bounds of given by (8) and the above transition formulae results, an alternative set of bounds for can be obtained using only the row and column profile coordinates of a 2 × 2 table. It can be shown that this bound is f2 f1 f1 f2 max min , , g1 g2 g 1 g2
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
2947
and indicates that, from the perspective of a correspondence analysis of a 2 × 2 table, the bounds of the correlation between the variables can be expressed in terms of the coordinates obtained from the correspondence plot. 3.3. The total inertia and association A feature of using correspondence analysis to determine the level of association present in a contingency table is that the Pearson chi-squared statistic, X2 , can be partitioned as a weighted sum of squares of the row or column profile coordinates. For the analysis of two dichotomous variables, the statistic can be defined such that X2 (P1 |p1• , p•1 ) = n[p1• f12 (P1 |p1• , p•1 ) + p2• f22 (P1 |p1• , p•1 )] = n[p•1 g12 (P1 |p1• , p•1 ) + p•2 g22 (P1 |p1• , p•1 )]. However, since a large sample size can inflate the value of the chi-squared statistic, correspondence analysis focuses on measuring the strength of the association by 2 = X2 /n referred to as the total inertia. Therefore when only the marginal information is available, and by considering (14) and (15), the total inertia can be expressed as a function of P1 by P1 − p•1 2 p1• p2• 2 . (P1 |p1• , p•1 ) = p2• p•1 p•2 However, by considering the correlation coefficient (10), the total inertia can be expressed as the square of the product moment correlation such that 2 (P1 |p1• , p•1 ) = 2 (P1 |p1• , p•1 ). If we consider the definition of the row profile coordinates given by (12), then the estimated total inertia can be expressed by p1• 2 (P1 |p1• , p•1 ) = f12 (P1 |p1• , p•1 ). p2• Therefore
f1 (P1 |p1• , p•1 ) = ± Similarly, f2 (P1 |p1• , p•1 ) = ∓
p2• p1•
p1• p2•
2 (P1 |p1• , p•1 ).
2 (P1 |p1• , p•1 ).
Thus, profile coordinates close to the origin indicate a very small total inertia and very small contribution to the association in the table. The converse also holds true, where a small total inertia will lead to profile coordinates close to zero. If we are again interested in the behaviour of the association in terms of P1 , we can conclude that there is no significant statistical association between the two categorical variables if P1 − p•1 2 p1• p2• < 2 , (16) X2 (P1 |p1• , p•1 ) = n2 (P1 |p1• , p•1 ) = n p2• p•1 p•2 where 2 is the 1 − percentile of a chi-squared distribution with 1 degree of freedom at the level of significance. Therefore the asymptotic 100(1 − )% expected interval of P1 under the null hypothesis of independence between the two dichotomous variables is 2 p•1 p•2 2 p•1 p•2 < P1 < p•1 + p2• = U . (17) L = p•1 − p2• n p1• p2• n p1• p2•
2948
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
Therefore one may conclude, given a level of significance and the aggregate data, that there is a significant association between the two dichotomous variables if L1 P1 La or Ua P1 U1 . Note also that, for 1 degree of freedom 2 =Z2/2 where Z is a standard normally distributed random variable. Therefore the asymptotic 100 (1 − ) % expected interval for P1 may be alternatively expressed as 1 p•1 p•2 1 p•1 p•2 < P1 < p•1 + p2• Z/2 = U . L = p•1 − p2• Z/2 n p1• p2• n p1• p2• 4. Aggregate association index Consider a plot of the Pearson chi-squared statistic X2 (P1 |p1• , p•1 ), defined by (16), versus P1 . If the area under 2 1 ) but above is large than there may be evidence to suggest that there is a significant association (at the level of significance) between the two dichotomous variables. A relatively small area between the curve and 2 will indicate that it is unlikely that such an association exists. This area may be found by calculating
U 2 [(L − L1 ) + (U1 − U )] + L X 2 (P1 ) dP1 A = 100 × 1 − ,
U1 2 L1 X (P1 ) dP1 X2 (P
where 2 is the 1 − percentile of a chi-squared distribution with 1 degree of freedom. The value A is referred to as the aggregate association index and is used as a measure of the strength of the association between the two variables based only on the marginal information of a single 2 × 2 table at some value of . It is bounded by [0, 100] where a value of zero indicates that, at the level of significance, there is no evidence of an association between the two categorical variables. A value of A ≈ 100 indicates that, at the level of significance, there is strong evidence to conclude (based only on the available aggregate data) that there is an association between the variables. More attention will be given to this index in the following example. 5. Example Consider the following 2 × 2 contingency table that Fisher (1935) used to explore what information the marginal frequencies offered for the inference of the cell values. Plackett (1977) also considered the analysis of Fisher’s data but focused on the odds ratio of the table. These data, summarised in Table 2, consider the results of a study carried out on criminal twins. The 30 individuals classified in Table 2 each have a criminal twin of the same sex. The table classifies whether the twin is a monozygotic or dizygotic twin and whether they are convicted of a criminal offence. While there is no mention of the source of this data in Fisher (1935) his reference to the data as “a classification as Lange supplies in his study on criminal twins” appears to have originally appeared in Lange (1930, p. 211). 2 When the individual level frequencies of Table 2 are assumed to be known, P1 = 10 13 = 0.7692, P2 = 17 = 0.1176 and the Pearson chi-squared statistic is 13.032. With a p-value of 0.0003 there is evidence of a strong association between the two variables. The product moment correlation of = 0.6591 indicates that this association is positive. Therefore a monozygotic twin of a convicted criminal is associated with being convicted of a crime, while a dizygotic twin of a convicted criminal tends not to be a convicted criminal. When only the aggregate data are assumed to be known, n11 is bounded by [0, 12] with P1 and P2 bounded by [0, 0.9231] and [0, 0.7059], respectively. Since the total inertia, and the Pearson chi-squared statistic, is maximised at the bounds of P1 and P2 , the local maximum chi-squared values are 15.2941 and 26.1538. Therefore, based only on the aggregate data, the maximum value that the Pearson chi-squared statistic can take is 26.1538 and occurs when Table 2 Conviction of same-sex twins of criminals Convicted
Not convicted
Total
Monozygotic Dizygotic
10 2
3 15
13 17
Total
12
18
30
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
2949
Table 3 Characteristics of the correspondence analysis of Table 2 assuming fixed marginal frequencies n11
P1
f1
f2
g1
g2
X2
p-Value
12 11 10 9 8 7 6 5 4 3 2 1 0
0.9231 0.8462 0.7692 0.6923 0.6154 0.5385 0.4615 0.3846 0.3077 0.2308 0.1538 0.0769 0.0000
0.9337 0.7964 0.6591 0.5218 0.3845 0.2472 0.1098 −0.0275 −0.1648 −0.3021 −0.4394 −0.5767 −0.7140
1.0677 0.9107 0.7537 0.5967 0.4397 0.2826 0.1256 −0.0314 −0.1884 −0.3454 −0.5025 −0.6595 −0.8165
−0.8165 −0.6964 −0.5764 −0.4563 −0.3362 −0.2161 −0.0961 0.0240 0.1441 0.2642 0.3842 0.5043 0.6244
1.1435 0.9754 0.8072 0.6390 0.4709 0.3027 0.1345 −0.0336 −0.2018 −0.3700 −0.5381 −0.7063 −0.8745
−0.7624 −0.6503 −0.5381 −0.4260 −0.3139 −0.2018 −0.0897 0.0224 0.1345 0.2466 0.3588 0.4709 0.5830
26.1538 19.0271 13.0317 8.1674 4.4344 1.8326 0.3620 0.0026 0.8145 2.7376 5.7919 9.9774 15.2941
0.0000 0.0000 0.0003 0.0043 0.0352 0.1758 0.5474 0.8804 0.3668 0.0980 0.0161 0.0016 0.0001
[P1 , P2 ] = [0.9231, 0]. Similarly the correlation coefficient of the table is bounded by (−0.7140, 0.9337). Based on this interval it is not possible to determine whether there is a significant association between the two variables nor how each of the categories are related. Although a full summary of the values associated with P1 , , fi , gj and X 2 for each possible value of n11 is given by Table 3. The p-value associated with X 2 is also given in this table. The bounds of P1 , and X2 stated earlier in this example can be verified by observing the second, third and eighth columns of Table 3, respectively. While the aggregate data by themselves do not allow for a clear picture of the nature of the association between the two variables, something may be obtained by considering a graphical summary through performing a correspondence analysis of Table 2. Assuming that the marginal frequencies of this table are known, the row profile coordinates can be expressed as a function of P1 and P2 by 5P1 − 2 13 12 f1 P1 |p1• = , p•1 = = √ = 2.0412P1 − 0.8165 30 30 6 and
f2
13 12 P1 |p1• = , p•1 = 30 30
13 =− 17
5P1 − 2 √ 6
= −1.5609P1 + 0.6244,
respectively. Therefore, for every increase of 0.1 in P1 the coordinate for the first row category, f1 , will increase by 0.204 while for the second category f2 reduces by 0.156. These equations, together with the equations of the two column profile coordinates, are simultaneously represented on the correspondence plot of Fig. 1. Note that the bounds stated above for P1 , are reflected in this plot and that the true position of the profile coordinates is represented by •. Fig. 1 shows that when only the aggregate data are available there appears to be a strong positive association between the two variables. That is, Monozygotic twins are associated with convicted criminals while Dizygotic twins are associated with a non-conviction. This is exactly the conclusion obtained from assuming that the individual level data were known. The proximity of the points is such that f1 will always be positioned closely to g1 irrespective of the value of P1 . Similarly f2 will be positioned close to g2 . The strength of this relationship is not surprising since f1 (P1 ) g2 (P1 ) = = 0.9337. g1 (P1 ) f2 (P1 ) Fig. 1 also shows that the origin coincides with the hypothesis of independence between the type of twin and their felony record. That is, the origin of the correspondence plot occurs where P1 = P2 = p•1 = 0.4. In fact, by considering the interval [L , U ], for a level of significance of = 0.05, we can conclude that there is not a significant association between the two variables if 0.2 < P1 < 0.6. Therefore, one may conclude that there is an association if P1 falls within the interval [0, 0.2] or [0.6, 0.9231]. This interval is reflected in Fig. 1 by the horizontal lines at P1 = 0.2 and 0.6. Since
2950
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952 f2 g2
f1 g1
0.8
P_1
0.6
0.4
0.2
0.0 −0.5
0.0 0.5 CA Coordinate
1.0
25
25
20
20
15
15
X2 (P_2)
X2 (P_1)
Fig. 1. Correspondence plot of Table 2 for changes in P1 . Closed line (-) is associated with the row profile coordinates, dashed line (- -) is associated with the column profile coordinates. The position of the profile coordinates (when the joint frequencies are known) is represented by •.
10
10
5
5
0
0 0.0
0.4 P_1
0.8
0.0
0.4 P_2
0.8
Fig. 2. Plot of X 2 versus P1 and P2 for Table 2.
the interval [L0.05 , U0.05 ] represents only 43% of the total range of possible values that P1 can attain indicates there are signs to suggest that this positive association is statistically significant. Based only on the availability of the aggregate data, can be expressed as a function of P1 by
13 12 P1 |p1• = , p•1 = 30 30
=
221 216
30P1 − 12 17
= 1.7850P1 − 0.7140.
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
2951
Therefore the correlation will increase by 0.1785 for every 0.1 increase in P1 . Alternatively, one may consider the total inertia 2 which can be expressed in terms of P1 and P2 by 221 30P1 − 12 2 13 12 2 = P1 |p1• = , p•1 = = 3.1863P12 − 2.5490P1 + 0.5098, 30 30 216 17 13 12 221 12 − 30P2 2 = 5.4487P22 − 4.3590P1 + 0.8718, = 2 P2 |p1• = , p•1 = 30 30 216 13 respectively. Therefore the chi-squared statistic can be obtained by multiplying these functions by n = 30. The resulting quadratic functions are plotted in Fig. 2 and shows that the Pearson chi-squared statistic is maximised at 26.1538 when [P1 , P2 ]=[0.9231, 0], verifying the result obtained earlier in the example. Note also that Fig. 2 confirms that the Pearson chi-squared statistic has a minimum value of zero when the two variables are independent (P1 = P2 = p•1 = 0.4). At the 0.05 level of significance, any 2×2 contingency table with a Pearson chi-squared value exceeding 3.84 indicates that there exists an association between the dichotomous variables. Therefore, if we consider that part of the area under the curve indicated by the chi-squared statistic in Fig. 2 that exceeds 3.84, then the association index is A0.05 = 61.83. That is, it is expected that 61.83% of contingency tables randomly generated with the marginal frequencies n1• = 13, n2• = 17, n•1 = 12 and n•2 = 18 will exhibit a significant association between the two dichotomous categorical variables. In fact, based on the p-values summarised in Table 3, of the 13 possible contingency tables with these marginal frequencies, the observed number of tables where there is a significant association (at the 0.05 level) between the rows and columns is eight, or 61.54%. The association index when is 0.001, 0.005, 0.01 and 0.1 is A0.001 = 22.18, A0.005 = 35.49, A0.01 = 42.51, A0.1 = 71.35, respectively. As expected, if is increased (so that confidence in the test of association is reduced) A increases. Similarly, reducing the level of significance will lead to a reduction in A to levels where it would be unwise to conclude that there is a significant association at the individual level. 6. Discussion This paper investigates the properties of applying correspondence analysis to a single 2 × 2 contingency table where the joint frequencies, or individual level data, are not known. The study of aggregate data, and the information it provides for inferring individual level data, has been a topic of much discussion since Fisher’s exposition of the problem. However, until now, the correspondence analysis of this type data has not been investigated. This is despite the continuing growth in theory and application of both correspondence analysis and techniques for dealing with aggregate data. The bound of proposed by Duncan and Davis (1953), while being more useful than knowing that −1 1, is often not very informative for determining the strength or direction of the association between the two dichotomous variables. However, by using correspondence analysis, knowledge of the direction of this relationship can be determined by quantifying the ratio f1 /g1 . Similarly, the direction of the association can be seen through the correspondence plot. For example, while the Pearson product moment correlation of the twin data of Table 2 falls somewhere within the interval [−0.7140, 0.9337] the correspondence plot of Fig. 1 suggests that the association between the variables is positive. What is also of interest from the plot is that, based only on the marginal information in the contingency table, the general configuration of points in the plot appears to be only dependent only on this information. However, knowledge of the cell values does determine the strength of the association, and the distance of the configuration of points from the origin. While it is these, as well as the other, features of correspondence analysis discussed in this paper that provide insight into how the two dichotomous variables are associated with one another, it is the aggregate association index (A ) that can provide a clear indication of the relative strength of the association at the individual level. One must be clear on the interpretation of A . It does not say that there will be (or would not be) a significant association between the two dichotomous variables—this decision is only able to be made based on the individual level data and the chosen value of . Instead A should be interpreted as, given only the availability of aggregate data, the possibility that there will exist a significant association between the two variables at the level of significance. The work described in this paper leaves open the possibility of exploring further the application of correspondence analysis of aggregate data. Certainly investigating the application of correspondence analysis for G (> 1) 2 × 2 tables may help to provide further insight into helping solve the ecological problem—refer to King (1997) and King et al.
2952
E.J. Beh / Journal of Statistical Planning and Inference 138 (2008) 2941 – 2952
(2004) for information on this problem. Future research on the characteristics of the aggregate association index, including a simulation study on its behaviour, should also be carried out to identify situations where it will work and where it will be less reliable. References Aitkin, M., Hinde, J.P., 1984. Comments to “Tests of significance for 2 × 2 contingency tables”. J. Roy. Statist. Soc. Ser. A 47, 453–454. Barnard, G.A., 1984. Comments to “Tests of significance for 2 × 2 contingency tables”. J. Roy. Statist. Soc. Ser A 47, 449–450. Beh, E.J., 1997. Simple correspondence analysis of ordinal cross-classifications using orthogonal polynomials. Biometrical J. 39, 589–613. Beh, E.J., 2004. Simple correspondence analysis: a bibliographic review. Internat. Statist. Rev. 72, 257–284. Beh, E.J., Steel, D.G., Booth, J.G., 2002. What useful information is in the marginal frequencies of a 2 × 2 table? Preprint 4/2002, School of Mathematics and Applied Statistics, University of Wollongong, Australia. Conover, W.J., 1974. Some reasons for not using the Yate’s continuity correction on 2 × 2 contingency tables (with discussions). J. Amer. Statist. Assoc. 69, 374–382. Duncan, O.D., Davis, B., 1953. An alternative to ecological correlation. Amer. Sociol. Rev. 18, 665–666. Fisher, R.A., 1935. The logic of inductive inference (with discussions). J. Roy. Statist. Assoc. Ser. A 98, 39–82. Goodman, L.A., 1959. Some alternatives to ecological correlation. Amer. J. Sociol. 64, 610–625. Goodman, L.A., 1996. A single general method for the analysis of cross-classified data: reconsiliation and synthesis of some methods of Pearson, Yule and Fisher, and also some methods of correspondence analysis and association analysis. J. Amer. Statist. Assoc. 91, 408–428. Greenacre, M.J., 1984. The Theory and Application of Correspondence Analysis. Academic Press, London. King, G., 1997. A Solution to the Ecological Inference Problem. Princeton University Press, Princeton, USA. King, G., Rosen, O., Tanner, M.A., 2004. Ecological Inference: New Methodological Strategies. Princeton University Press, Princeton, USA. Lancaster, H.O., 1969. The Chi-Squared Distribution. Wiley. New York, USA. Lange, J., 1930. Crime and Destiny. Allen and Unwin, London. Lebart, L., Morineau, A., Warwick, K.M., 1984. Multivariate Descriptive Analysis. Wiley, New York, USA. Plackett, R.L., 1977. The marginal totals of a 2 × 2 table. Biometrika 64, 37–42. Rayner, J.C.W., Best, D.J., 1996. Smooth extensions of Pearson’s product moment correlation and Spearman’s Rho. Statist. Probab. Lett. 30, 171–177. Rayner, J.C.W., Best, D.J., 2001. A Contingency Table Approach to Nonparametric Testing. Chapman & Hall, London. Yates, F., 1984. Tests of significance for 2 × 2 contingency tables (with discussions). J. Roy. Statist. Soc. Ser. A 147, 426–463.