Accepted Manuscript Correspondence analysis and the Freeman-Tukey statistic: A study of archaeological data Eric J. Beh, Rosaria Lombardo, Gianmarco Alberti
PII: DOI: Reference:
S0167-9473(18)30158-0 https://doi.org/10.1016/j.csda.2018.06.012 COMSTA 6636
To appear in:
Computational Statistics and Data Analysis
Received date : 1 May 2017 Revised date : 14 April 2018 Accepted date : 25 June 2018 Please cite this article as: Beh E.J., Lombardo R., Alberti G., Correspondence analysis and the Freeman-Tukey statistic: A study of archaeological data. Computational Statistics and Data Analysis (2018), https://doi.org/10.1016/j.csda.2018.06.012 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
*Manuscript Click here to view linked References
Correspondence Analysis and the Freeman-Tukey Statistic: A Study of Archaeological Data Eric J. Beh School of Mathematical & Physical Sciences, University of Newcastle, Australia
Rosaria Lombardo Department of Economics, University of Campania, Italy
Gianmarco Alberti Department of Classics and Archaeology, University of Malta, Malta
Abstract Traditionally, simple correspondence analysis is performed by decomposing a matrix of standardised residuals using singular value decomposition where the sum-of-squares of these residuals gives Pearson’s chi-squared statistic. Such residuals, which are treated as being asymptotically normally distributed, arise by assuming that the cell frequencies are Poisson random variables so that their mean and variance are the same. However, studies in the past reveal that this is not the case and that the cell frequencies are prone to overdispersion. There are a growing number of remedies that have been proposed in the statistics, and allied, literature. One such remedy, and the focus of this paper, is to stabilise the variance using the Freeman-Tukey transformation. Therefore, the properties that stem from performing correspondence analysis will be examined by decomposing the Freeman-Tukey residuals of a two-way contingency table. The application of this strategy shall be made by studing one large, and sparse, set of archaeological data. Keywords: Correspondence Analysis, Pearson’s Residuals, Freeman-Tukey Statistic, Singular Value Decomposition, Hellinger Distance, Archaeological Data.
Email address: e.mail:
[email protected] (Eric J. Beh)
Preprint submitted to Computational Statistics and Data Analysis
April 12, 2018
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
1. Introduction Correspondence analysis has long been used as a tool to visualise the association structure between the categorical variables of a contingency table. As such, it has gained popularity in many areas of research, especially in those disciplines where data take on the form of contingency tables as a means of summarising the information that has been collected. For example, while adopting other dimension-reduction techniques like multidimensional scaling (Torgerson, 1958; Borg and Groenen, 2005), archaeologists have increasingly used correspondence analysis since the seminal study of B´olviken and colleagues in early 1980’s (B´olviken et al., 1982). One may also consider Kendall (1971) for an earlier discussion of the role of correspondence analysis issues in archaeology. The classical approach to performing this analysis has rested on the partition of Pearson’s chi-squared statistic into the sum-of-squares of the singular values of Pearson’s residuals. Pearson’s statistic is a special case of the family of statistics derived from the series of power-divergence measures introduced by Cressie and Read (1984). Other members of this family include the log-likelihood ratio statistic (Wilks, 1938), the Freeman-Tukey statistic (Freeman and Tukey, 1950), the modified chi-squared statistic (Neyman, 1949) and the modified log-likelihood ratio statistic (Kullback, 1959). Like Pearson’s statistic, all the members of the Cressie-Read family are asymptotically chi-squared random variables. Therefore, one may also consider other members of this family for the purposes of conducting a correspondence analysis. Of those members just described, the one that lends itself nicely to performing correspondence analysis is the Freeman-Tukey statistic. This is because of the added complexities associated with the logarithm function which we shall not consider here and the utility, and widespread use of the Freeman-Tukey statistic. Thus, it is this statistic that we shall focus our attention on in this paper. To discuss the role of the Freeman-Tukey statistic in correspondence analysis, this paper is divided into six further sections. Section 2 provides a brief overview of the classical approach to correspondence analysis. Such an approach rests on the singular value decomposition of a cell frequencies Pearson’s residual. These residuals are assumed to be random variables from a Poisson distribution while their sum-of-squares yields Pearson’s chi-squared statistic. However, using Pearson’s residuals implies that the mean and variance of the cell frequencies are equivalent. Generally, this may not be the case and there is evidence in the statistical literature that shows that there can exist 2
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
overdispersion in a contingency table; a brief overview of this issue is made in Section 3. To stablise the variance, one approach is to consider the FreemanTukey transformation which leads to the Freeman-Tukey statistic; see Section 4. Section 5 provides a description of the key features of simple correspondence analysis by decomposing the Freeman-Tukey residuals of a two-way contingency table using singular value decomposition. Such features include the derivation of principal coordinates and the distance of these coordinates from the origin and from intra-variable points. We show that these distances are not chi-squared distances as is the case for classical correspondence analysis, but are Hellinger distances. Such distances have been explored before in the context of correspondence analysis; see, for example, Domenges and Volle (1979), Rao (1995) and Cuadras and Cuadras (2006). However their link with the Freeman-Tukey statistic is less well known. Modelling categorical data by considering the association model and reconstitution formulae derived from this analysis is also discussed in Section 5. Section 6 provides a practical demonstration of the correspondence analysis procedure using the Freeman-Tukey statistic as an alternative to Pearson’s statistic. This approach is illustrated by means of an archaeological example. While sparse data may prove more ubiquitous in archaeology than one could expect at first (and is an issue that deserves a specific study), the example takes into account one such dataset which summarises the distribution of 488 artifacts across 20 prehistoric huts unearthed at Capo Milazzese (Sicily). The data have been compiled and discussed at length by Alberti (2017). Some concluding remarks are made in Section 7. We must point out that Beh and Lombardo (2014, Section 9.3) provides a very brief, skeletal, outline of the correspondence approach described in this paper. Such an outline is confined only to the definition of the FreemanTukey statistic, the definition of principal coordinates and their link to Hellinger distances. It gave no description of the role that the Freeman-Tukey statistic plays in defining association, or correspondence, models. Nor did Beh and Lombardo (2014) provide an extensive description of the application of the technique to large, and sparse, archaeological data. Instead, the applicability of the approach was confined to a small data set that has been commonly used in the correspondence analysis literature. Beh and Lombardo (2014, Section 9.5.2) does provide R code to perform the analysis, and all computations and plots are obtained using adaptations of this code.
3
74
75 76 77 78 79 80 81 82 83 84
2. The Classical Approach to Correspondence Analysis 2.1. Notation Consider an I × J two-way contingency table, N, where the (i, j)th cell entry is denoted by nij for i = 1, 2, . . . , I and j = 1, 2, . . . , J. Let the grand total of N be n and let the matrix of relative frequencies be P so that P P the (i, j)th cell entry is pij = nij /n where Ii=1 Jj=1 pij = 1. Define the ith PJ row marginal proportion by pi• = j=1 pij . Similarly, define the jth column P marginal proportion as p•j = Ii=1 pij . To determine whether there exists a statistically significant association between the row and column variables, one may calculate Pearson’s chisquared statistic J I X X (nij − ni• n•j /n)2 (pij − pi• p•j )2 X = =n . ni• n•j /n pi• p•j i=1 j=1 i=1 j=1 2
85 86 87
88 89 90 91 92 93 94 95 96 97 98 99
J I X X
Given a level of significance, α, the statistical significance of the association may be tested by comparing X 2 with the 1 − α percentile of the chi-squared distribution with (I − 1) (J − 1) degrees of freedom. 2.2. Decomposing Pearson’s Residual When it is found that a statistically significant association exists between two categorical variables, a visual inspection of this association can be made using correspondence analysis. This technique has a long and interesting history, especially in Europe - see, for example, Van Meter et al. (1994), Beh (2004), Holmes (2008), Armatte (2008), De Falguerolles (2008), Lebart (2008) and Beh and Lombardo (2012, 2014) for some historical overviews. Two of the earliest, and pioneering contributions (especially for English speaking researchers), are the books of Greenacre (1984) and Lebart et al. (1984). Central to correspondence analysis is the partition of Pearson’s chi-squared statistic. This is achieved by applying the singular value decomposition (SVD) to the Pearson’s residuals, rij , such that rij =
100 101
M X pij − pi• p•j = aim λm bjm , √ pi• p•j m=1
where M = min (I, J)−1. Here aim is the mth element of the singular vector associated with the ith row category. Similarly, bjm is the mth element of 4
102 103
the singular vector associate with the jth column category. These elements have the property I X i=1
104 105
aim aim0 =
(
1 m = m0 0 m= 6 m0
)
J X
.
bjm bjm0 =
j=1
(
1 m = m0 0 m= 6 m0
)
.
(1)
The value λm is the mth singular value of the Pearson residuals and are typically arranged in descending order where λm =
I X J X
pij aim bjm .
(2)
i=1 j=1
106 107 108 109 110 111 112 113 114 115
116 117 118
2.3. Principal Coordinates The key feature from the correspondence analysis of a contingency table is the low (< M ) dimensional subspace that is obtained. Such a subspace, commonly referred to as a correspondence plot, often consists of two (or sometimes three) dimensions for ease of visualisation and provides a joint display of the association between the variables. When the plot consists of all M dimensions, it is referred to as the optimal correspondence plot. Such a plot also provides a visual inspection to help discriminate among the categories. To construct such a plot, the ith row and jth column categories can be simultaneously depicted along the mth axis using the principal coordinates bjm aim and gjm = √ λm , (3) fim = √ λm pi• p•j respectively. The Pearson chi-squared statistic of N can be expressed in terms of the weighted sum-of-squares of the coordinates and the sum of squares of the singular values such that X2 = n
M X
m=1 119 120 121 122 123 124 125 126
λ2m = n
M X I X
m=1 i=1
2 pi• fim =n
M X J X
m=1 j=1
2 p•j gjm .
In the correspondence analysis literature, X 2 /n is referred to as the total inertia and these results show that the total association contained in the contingency table can be partitioned so that each axis of the correspondence plot provides a certain percentage of the association. There are a number of coordinates systems that may be considered which construct a variety of different types of plots. For example, one may consider the biplot and its numerous varieties; see, for example, Greenacre (2010), Gower et al. (2011) and Beh and Lombardo (2014). 5
127 128 129
2.4. Correspondence Models From this classical approach, the (i, j)th cell proportion can be exactly reconstituted from the association model √
pij(M ) = pi• p•j + 130 131 132 133
135 136 137 138 139 140
M X
aim λm bjm ,
m=1
where the inclusion of the subscript (M ) indicates that the model accounts for the association reflected in an M-dimensional correspondence plot. Alternatively, this model, when expressed in terms of the row and column principal coordinates, leads to the reconstitution formula pij(M ) = pi• p•j
134
pi• p•j
M X
!
fim gjm . 1+ λm m=1
(4)
Therefore, the origin (when all the principal coordinates are zero) coincides with complete independence between the variables. The advantage of considering these models is that they may be used as a means of determining the goodness-of-fit of a correspondence plot in reconstituting the cell frequencies. For example, since a two-dimensional plot describing the association between the variables is commonly constructed, the quality of the plot may be assessed by considering any deviation of pij from pij(2) = pi• p•j
!
fi1 gj1 fi2 gj2 + . 1+ λ1 λ2
(5)
143
As we shall demonstrate in the example of Section 6, it is possible that such models will lead to expected, or reconstituted, cell frequencies, npij(2) , that are negative.
144
3. The Problem of Overdispersion
141 142
145 146 147 148 149 150 151 152
Despite the popularity of Pearson’s chi-squared statistic, there is a potentially troublesome issue that confronts the correspondence analyst. The structure of the Pearson residuals, rij , is based on the assumption that the cell frequencies, nij , are from a Poisson distribution with expectation npi• p•j . Therefore the mean and variance of nij are treated as being identical. However, in some circumstances this equality will not hold. In√fact, Agresti (2002, p. 81) points out that the asymptotic variance of nrij “is less than 1.0, averaging [(I − 1) (J − 1)]/(number of cells)”. Haberman (1973) 6
153 154 155 156 157 158 159 160 161 162 163 164 165 166
also points out that, under independence, the maximum likelihood √ estimate (using the Poisson distribution, for example) of the variance of nrij is (1 − pi• ) (1 − p•j ) < 1.0. Both results imply that the variance of the residuals exceeds its expectation and so the residuals, and hence nij , exhibit overdispersion when it is treated as a Poisson random variable. Therefore, performing correspondence analysis using rij may, in some cases, not be the most appropriate means of studying the association structure between the variables. By adopting the strategy used by Efron (1992), there is an easy way to visually determine whether there is overdispersion in a contingency table; this strategy will be demonstrated in Section 6. Efron (1992) focused on overdispersion for regression problems although the strategy is also applicable for studying the association between categorical variables. One strategy that may be considered to overcome this problem is to consider instead the adjusted Pearson residuals proposed by Haberman (1973) ∗ rij =q
pij − pi• p•j
pi• p•j (1 − pi• ) (1 − p•j )
.
177
∗ One advantage of considering the analysis of rij is that, unlike an analysis √ ∗ of rij , nrij is standard normally distributed. Beh (2012) showed how correspondence analysis can be performed through the decomposition of these PI PJ ∗ 2 adjusted residuals. However, the quantity n i=1 j=1 rij is not a chisquared random variable, although Golden (2000, p. 38) presents a means of calculating the p-values using the computational procedure outlined in Sheil and O’Muircheartaigh (1977). Another strategy, and one that we shall outline below, is to consider the Freeman-Tukey statistic as an alternative to Pearson’s statistic. The following sections describe the correspondence analysis of N using the FreemanTukey statistic.
178
4. The Freeman-Tukey Statistic
167 168 169 170 171 172 173 174 175 176
179 180 181 182
Since overdispersion arises because the variance of nij is larger than its expectation, a popular method used to stablise the variance of nij is to consider the transformation proposed by Freeman and Tukey (1950) who, adapting their notation for a two-way contingency table, showed that q √ nij + nij + 1
7
(6)
183 184
√ has a mean and variance of 4npi• p•j + 1 and 1 respectively. Thus, Freeman and Tukey (1950) proposed T˜2 =
I X J X √
nij +
i=1 j=1
185 186
I X J X
i=1 j=1
188
nij + 1 −
√
pij +
s
1 pij + − n
I X J X √ i=1 j=1
190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
4npi• p•j + 1
2
2
s
1 4pi• p•j + . n
For large sample sizes, T˜2 may be approximated by eliminating 1/n. Doing so gives the statistic T˘2 = 4n
189
q
as an alternative to Pearson’s statistic. Note that T˜2 may be alternatively, but equivalently, expressed as T˜2 = n
187
q
pij −
√
pi• p•j
2
,
and is referred to as the Freeman-Tukey chi-squared statistic. See, for example, Bishop, Fienberg & Holland (1975, p. 508). There has been a great deal of discussion given to comparing the behaviour of the Freeman-Tukey statistic with that of Pearson’s statistic (and other commonly used members of the Cressie-Read family). For small samples, Larntz (1978) and Lawal (1984) discussed that Pearson’s statistic is better at capturing the magnitude of a nominal chi-squared value than T˘2 . Bishop, Fienberg & Holland (1975, p. 513) note that, “in passing”, X 2 , T˘2 and the likelihood ratio statistic may be very different from each other in goodness-of-fit tests”. While the Freeman-Tukey statistic has rarely been considered for the purposes of visually exploring the association between categorical variables, it has been most commonly used in the context of goodness-of-fit testing. One may consider, for example the studies of Brooks et al. (2000) and Mazzetta, Brooks & Freeman (2007, p. 1012) who considered the statistic for its variance stabilising properties. Read (1993) provides an investigation of the goodness-of-fit properties of the Freeman-Tukey statistic and considers further variations of the statistic. When viewing correspondence analysis primarily as a means of visually summarising the association between categorical variables, the additional features of the Freeman-Tukey statistic outweighs the need to consider its 8
209 210 211 212 213 214
deficiencies from an inferential perspective. So, since virtually all applications of correspondence analysis consider the case where there exists a statistically significant association between the variables, the Freeman-Tukey statistic provides an ideal alternative to Pearson’s statistic. From the perspective of the correspondence analysis of a two-way contingency table, one may perform the analysis using the Freeman-Tukey residual r˜ij =
215
√
pij +
s
1 pij + − n
s
4pi• p•j +
1 , n
such that the total inertia of N can be expressed as I X J T˜2 X 2 = r˜ij . n i=1 j=1
216 217
√ Here n˜ rij is asymptotically standard normally distributed. Note that, for large n, r˜ij is approximately r˘ij = 2
218
so that
√
pij −
√
pi• p•j
J I X T˘2 X 2 r˘ij = n i=1 j=1 219 220 221 222 223
224
225 226 227 228
approximates T˜2 /n. Like X 2 , T˜2 and T˘2 are chi-squared distributed with (I − 1) (J − 1) degrees of freedom. Also, like X 2 , their magnitude is influenced by the sample size n and any variation in the cells contingency; note that for X 2 , Pearson’s classic measure of contingency is pij − pi• p•j , while for √ √ T˘2 considers the amended measure of contingency pij − pi• p•j . 5. Correspondence Analysis using the Freeman-Tukey Residual 5.1. The SVD of the Freeman-Tukey residuals Consider the statistic, T˜2 . A correspondence analysis of a contingency table can be undertaken by considering the SVD of the Freeman-Tukey residual r˜ij such that √ pij +
s
1 pij + − n
s
4pi• p•j + 9
M X 1 ˜ m˜bjm , = a ˜im λ n m=1
(7)
229
where ˜m = λ
I X J X
pij a ˜im˜bjm .
i=1 j=1
230 231 232 233
˜ m is the mth singular value of the I × J matrix of Freeman-Tukey Here λ residuals whose (i, j)th element is r˜ij and where these singular values are arranged in descending order. The properties of a ˜im and ˜bjm are analogous to aim and bjm for classical correspondence analysis and are subject to I X i=1
234
a ˜im a ˜im0 =
(
1 m = m0 0 m= 6 m0
236 237 238 239 240 241 242 243
244 245 246 247 248
√
pij −
√
250
.
˜bjm˜bjm0 =
j=1
(
1 m = m0 0 m= 6 m0
)
.
(8)
pi• p•j =
M X
˘ m˘bjm . a ˘im λ
(9)
m=1
For many practical situations where n is large, choosing between (7) or (9) will have a negligible impact on the numerical summaries of correspondence analysis. Nor will there be any considerable differences in the configuration of points when comparing their correspondence plots. For relatively, small samples sizes, or where sparse data are analysed (such as the data we shall be considering in Section 6), the configuration of points obtained from (9) can be very different to the configuration using (7). As we shall see in Section 5.3, considering (9) does have a big impact on the interpretation of the distance between the points. 5.2. Principal Coordinates A graphical interpretation of the association between the categories of the row and column variables may be made when considering the SVD of the Freeman-Tukey residual r˜ij ; see (7). Such a plot may be obtained using the following row and column principal coordinates a ˜im ˜ f˜im = √ λ m pi•
249
J X
For cases where n is large, one may consider the SVD of r˘ij such that 2
235
)
and
˜bjm ˜m, g˜jm = √ λ p•j
(10)
which graphically summarises the ith row and jth column category, respectively, along the mth dimension of the plot. Therefore, by considering (8), the 10
251 252
variation between the row coordinates (and column coordinates) reflects the variation in the contingency table such that I X i=1
253 254
pi• f˜im f˜im0 =
(
˜ 2 m = m0 λ m 0 m 6= m0
)
J X
j=1
p•j g˜jm g˜jm0 =
(
˜ 2 m = m0 λ m 0 m 6= m0
256 257 258 259 260
.
(11) When considering r˜ij , the total inertia can be expressed as the sum-ofsquares of the singular values such that M X T˜2 ˜2 . = λ m n m=1
255
)
(12)
Therefore, just like for classical correspondence analysis, the total variation in the contingency table (measured using the total inertia quantity) can be partitioned into M components, or principal inertias where the mth principal ˜ 2 . Hence the percentage of the total variation that the mth axis of inertia is λ m ˜ 2 /T˜2 . the correspondence plot summarises can be calculated to be 100 × nλ m Note that, from (11) and (12), I J M X M X X X T˜2 2 2 pi• f˜im p•j g˜jm = = . n m=1 i=1 m=1 j=1
261 262 263 264 265 266 267 268
Therefore, when considering the Freeman-Tukey residuals, r˜ij , principal coordinates that lie close to the origin indicate that the points category does not contribute as much to the association structure between the variables as other points in the configuration. Similarly, points lying at a distance from the origin make are stronger contribution to the association than other categories in the table. We may also consider the SVD of the residual r˘ij which yields approximations to f˜im and g˜jm a ˘im ˘ f˘im = √ λ m pi•
269 270 271 272
and
˘bjm ˘m, g˘jm = √ λ p•j
(13)
respectively. For large sample sizes, the configuration of points obtained using (10) and (13) will display negligible differences and so either correspondence plot may be considered. However, (13) provides more easily derived results for modelling the association between the variables and measuring, 11
273 274 275 276
277 278 279 280 281 282 283
and interpreting, the distance between two points in a correspondence plot. More will be said on the derivation and interpretation of these distances in Section 5.3 while a discussion on modelling the association will be made in Section 5.4. 5.3. Profile Distances Consider for the moment the squared Euclidean distance between two column coordinates in a correspondence plot when performing correspondence analysis using Pearson’s statistic. The jth column profile is defined by {p1j /p•j , . . . , pIj /p•j } and the j 0 th column profile is {p1j 0 /p•j 0 , . . . , pIj 0 /p•j 0 } and they can be graphically depicted in a RI space. An analogous expression of Pearson’s statistic can be obtained using the centred column profiles: 2
X =n
I X J X p•j i=1 j=1
284 285 286 287 288 289
291 292
I X
1 (j, j) = i=1 pi•
pij pij 0 − p•j p•j 0
I X J X
i=1 j=1
294 295 296 297 298
.
!2
=
M X
m=1
(gjm − gj 0 m )2 .
In the case of the Freeman-Tukey residual, we shall confine our attention to the classic Freeman-Tukey statistic T˘2 . It may be alternatively, and equivalently, expressed as T˘ = 4n
293
pi•
!2
In this situation the weight system for the row categories that exist in the J RI space is the matrix D−1 I , while the weight system in R for the column categories is given by the matrix DJ (Lombardo et al., 2016). In an optimal correspondence plot, the squared Euclidean distance between two column principal coordinates is equivalent to the chi-squared distance between their profiles. That is d2J
290
pij − pi• p•j
p•j
s
pij √ − pi• p•j
!2
,
and is the weighted sum-of-squares of the centred square-root of the column profiles. One may note that the summation is akin to the numerator of the Goodmn-Kruskal tau index (Goodman and Kruskal, 1954) used for the nonsymmetrical correspondence analysis of a contingency table; see D’Ambra and Lauro (1989), Kroonenberg and Lombardo (1999) and Beh and Lombardo (2014, Chapter 5). Their difference lies in the presence of the square 12
299 300 301 302 303 304 305
root transformation and of the constant term of 4n. When considering the re-expression of T˘2 in terms of the centred square-root of the column profiles, the weight system for the row categories that exist in the RI space is the identity matrix, while the weight system in RJ for the column categories is given by the matrix DJ . Under such a framework, one may consider the squared Euclidean distance between the square-root of the jth and j 0 th column categories, d˘2J (j, j 0 ) = φ
s
I X i=1
306 307 308 309 310 311 312 313 314 315 316
0
(j, j ) = 4
I X i=1
= 4
I X i=1
=
" s "
pij 0 p•j 0
!2
,
(14)
!
pij √ − pi• − p•j
s
pij 0 √ − pi• p•j 0
!#2
M M ˘ m˘bjm 1 X ˘ m˘bj 0 m 1 X a ˘im λ a ˘im λ − √ √ 2 m=1 p•j 2 m=1 p•j 0
" M I X X i=1
318
s
and is the squared Hellinger distance between the two column profiles. In this case, φ is called the dominating measure and for the purposes of our discussion, an appropriate choice is φ = 4. An overview of the Hellinger distance was given by Pollard (2002, p. 61). Gibbs and Su (2002) provide a comprehensive review of various distance measures and their application in statistics, including the Hellinger distance. Discussions of the properties concerned with the minimum Hellinger distance in the context of a variety of statistical situations have been made by, for example, Beran (1977), Simpson (1987) and Lindsay (1994). By considering the squared Hellinger distance, (14), and the SVD of r˘ij , 2 ˘ dJ (j, j 0 ) can be equivalently expressed as
d˘2J
317
pij − p•j
˘m a ˘im λ
m=1
˘bj 0 m ˘bjm −√ √ p•j p•j 0
!#2
#2
.
By expanding this last line and using the orthogonality property of a ˘im , the squared Hellinger distance can be simplified to d˘2J (j, j 0 ) =
M X
m=1
˘ m˘bj 0 m ˘ m˘bjm λ λ − √ √ p•j p•j 0
13
!2
=
M X
m=1
(˘ gjm − g˘j 0 m )2 .
319 320 321 322 323
Note that this is just the squared Euclidean distance between the principal coordinates of the jth and j 0 th column coordinates in the optimal correspondence plot when partitioning the Freeman-Tukey statistic, T˘2 . Similarly, the distance between the ith and i0 th row points in this optimal correspondence plot is d˘2I
0
(i, i ) = 4
J X
j=1 324 325 326 327 328 329 330 331 332 333 334
335 336 337 338 339 340 341 342 343 344
s
pi0 j pi0 •
!2
=
M X
m=1
f˘im − f˘i0 m
2
.
5.4. Modelling Association Using Freeman-Tukey Residuals For each cell of a two-way contingency table, the departure from independence between the categorical variables can be made by determining how different the Freeman-Tukey residual is from zero. Alternatively, this departure can be determined by modelling the association. This can be done in a variety of ways; see, for example, Agresti (2002) and Wong (2010). Here we shall consider the reconstitution of the (i, j)th cell proportion using the Freeman-Tukey residuals described above. Suppose we consider the case where n is considered large, so that r˘ij is considered. In this case, the (i, j)th cell proportion may be reconstituted using the saturated model M 1 X √ ˘ m˘bjm pi• p•j + a ˘im λ 2 m=1
= pi• p•j
346
s
The advantages of considering the Hellinger distance are the same as those when considering the distance between profiles using Pearson’s statistic. For example, using the Hellinger distance, d˘2J (j, j 0 ) satisfies the property of distribution equivalence; see, for example, Lebart, Morineau and Warwick (1984, p. 35) for a definition of this property. Also, the Hellinger distance is only dependent on the (square-root) of the two profiles and is independent of the sample size n. In his discussion of issues concerning correspondence analysis, Rao (1995) advocated the use of Hellinger distances while Cuadras and Cuadras (2006) explored further their role in the visualisation the association between categorical variables. We shall therefore refer the reader to these contributions for more details.
p˘ij(M ) =
345
pij − pi•
!2
M X 1 ˘ m˘bjm 1+ √ a ˘im λ 2 pi• p•j m=1
!2
.
(15)
˘m = We can immediately see here that under complete independence (when λ 0 for m = 1, 2, . . . , M ) pij = pi• p•j which is to be expected. What is also 14
347 348 349 350 351 352 353 354
clear by considering (15) is that, any deviation from complete independence will always yield a positive (i, j)th cell proportion. This property is not guaranteed when modelling, or performing a correspondence analysis, using Pearson’s residuals. By considering coordinates, defined by (13), √ √ the row and column principal ˘ ˘ ˘ ˘ pi• fim /λm and bjm = p•j g˘jm /λm . Therefore, substithen a ˘im = tuting these into (15) and simplifying, yields the saturated reconstitution formula p˘ij(M ) = pi• p•j
355 356 357
M ˘ 1 X fim g˘jm 1+ ˘m 2 m=1 λ
!2
.
(16)
From these models, the cell frequencies npij may be reconstituted using only the information contained in the first two, say, dimensions of the correspondence plot. In this case, the reconstituted cell values may be calculated by p˘ij(2) = pi• p•j
f˘i1 g˘j1 f˘i2 g˘j2 + 1+ ˘1 ˘2 2λ 2λ
!2
.
(17)
360
Like (15), using (16) or (17) guarantees that any reconstituted cell proportion will be positive. We shall compare and contrast the reconstitution of the cell frequencies by comparing npij(2) with n˘ pij(2) in Section 6.
361
6. Analysis of Sicily Data, Italy
358 359
362 363 364 365 366 367 368 369 370 371 372 373 374
6.1. Brief Overview of the Data Our study in this section is based on data, summarised in Table 1, that were compiled by Bernad`o Brea and Cavalier (1968), and analysed extensively by Alberti (2017), and come from an archaeological site at Capo Milazzese, Sicily. Table 1, which is of size 20 × 13, cross-classifies 374 artifacts found in 13 huts at the site. This table provides a description of the type of object found (where its row number is preceeded by “R”) and each of the sites/huts (whose number is preceeded with a “C”) where the object was found. Alberti (2017) gave a much larger table of size 31×19 (see his Table 2) which contains row and column categories that were treated as supplementary categories on the grounds of archaeological considerations. Also, the analysis of his table was undertaken by first aggregating a carefully selected set of row categories. The labels in Table 1 were selected to match the labels 15
16
Object C01 A-dinner (drink./eat.) 0 A-dinner (pour.) 1 Ae-cook 0 Ae-dinner (drink./eat.) 0 Ae-storage 0 A-storage 0 Cooking 2 Dinner (drink./eat.-c.ware) 1 Dinner (eat-c.ware) 0 Dinner (drink./eat.-f.ware) 1 Dinner (pour.-c.ware) 0 Dinner (pour.-f.ware) 1 Dinner (supporting) 0 Dinner (processing) 1 Processing 0 Spinning 0 Storage (long-term) 0 Storage (short-term) 2 Storage (other) 5 Working 0
C02 1 0 0 0 1 1 6 3 0 3 0 3 1 0 6 0 2 2 5 6
C03 2 0 0 0 1 2 3 0 0 1 0 4 4 0 3 0 2 1 2 8
C04 1 0 0 0 0 0 0 0 0 1 0 3 2 2 2 0 1 2 2 0
C05 1 0 0 0 0 1 1 0 0 1 0 2 2 1 0 0 0 1 1 2
C06 0 0 0 0 0 0 0 0 0 2 0 2 1 0 0 0 1 1 1 6
Huts C08 C09 1 1 0 1 0 0 0 0 0 1 0 0 3 5 0 0 0 2 2 5 1 0 2 4 1 0 0 2 2 15 0 2 2 2 3 4 2 3 3 10 C10 4 1 1 1 1 1 2 0 1 4 1 4 0 0 4 5 2 2 4 12
C11 C16 6 3 0 0 0 0 2 1 1 1 0 0 3 2 0 0 0 0 2 2 0 0 7 2 2 3 0 0 2 1 0 0 1 4 1 1 0 1 0 5
Table 1: Cross-classification 374 artifacts found in Capo Milazzese by type hut in which it was found (f.ware = fineware, c.ware = courseware, pour. =pouring, drink. = drinking, eat. = eating)
Code R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20
C18 C20 2 2 0 1 0 0 1 0 1 0 0 0 1 5 0 1 1 2 1 3 0 1 5 5 2 1 0 0 0 8 0 0 0 1 2 3 1 1 5 6
375 376
0.10 0.00 −0.10
Pearson Residual
0.20
377
Alberti (2017) used in his Table 4. The largeness of the contingency table and the relative sparsity of many objects in the sites makes it a particularly interesting data set to analyse.
0.5
1.0
1.5
2.0
2.5
3.0
Std Dev
Figure 1: Residual plot of Table 1
378 379 380 381 382 383 384
6.2. Preliminary Analysis Just as Alberti (2017) found, the Pearson chi-squared statistic for Table 1 is 265.32 and, with (20 − 1) (13 − 1) = 228 degrees of freedom and a pvalue of 0.0454, there is enough evidence to conclude a statistically significant association exists between the hut and type of pottery found there. Although, caution must be observed if we were to be convinced of such a conclusion at the 0.05 level of significance. It should be noted that Alberti (2017) reported 17
385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422
a p-value of 0.038 that was based on a permutation test and not on the theoretical chi-squared distribution. However, if the data are overdispersed, it is prudent to cross-check this finding with a method that accomodates for any overdispersion in the data. The presence of overdispersion can be easily identified by adopting the strategy outlined in Efron (1992) who suggested constructing a residual plot of the contingency table. If the residual plot consists of a configuration of points that is not randomly spread out within the plot (so that the points lie more on the left, or the right, side of the plot) then their is overdispersion in the data. In the context of analysing the association between the variables of a contingency table, the residual plot is constructed by plotting rij versus √ npi• p•j ; Figure 1 gives such a plot where the label along the horizontal √ axis, “Std Dev”, are the npi• p•j values and the label along the vertical axis, “Pearson’s Residual”, are the rij values. Figure 1 shows that most of the residuals, rij lie on the left side of the plot and so suggest the presence of overdispersion. Thus, using Pearson’s chi-squared statistic to assess the structure of the association between the huts and the objects found in these huts (Table 1) is not recommended. To overcome the overdispersion problem in the data, the variance will be stablised using the Freeman-Tukey transformation. Therefore, the Freeman-Tukey statistic, T˘2 , will be used as the measure of association, and to visualise the nature of this association, between the categories of Table 1. In doing so, we shall also compare, and contrast, the correspondence analysis of Table 1 using the classical approach and the Freeman-Tukey based approach outlined in the previous sections. Calculating the Freeman-Tukey statistic of the data gives a quite different measure of association; T˘2 = 363.043 with (20 − 1) (13 − 1) = 228 degrees of freedom, and a p-value of < 0.0001. There is a strong evidence to conclude a statistically significant association exists between the hut and type of pottery. Such a definitive conclusion was not observed using Pearson’s statistic. The difference between the Freeman-Tukey statistic, T˘2 , and the Pearson chi-squared statistic is due to the many very small cell frequencies which contributes to the overdispersion in Table 1. An inspection of the √ n˘ rij values, which are summarised in Table 2, reveals that there are some cells that exhibit large positive, or negative, deviations from what is expected under independence between the variables. Those values that appear in bold are those that are statistically significant at the 5% level of significance. By looking at these residuals, it clearly shows that the objects labelled R20 (Working), R15 (Processing) and R7 (Cooking) have larger residuals than 18
19
Code R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20
Object A-dinner (drink./eat.) A-dinner (pour.) Ae-cook Ae-dinner (drink./eat.) Ae-storage A-storage Cooking Dinner (drink./eat.-c.ware) Dinner (eat-c.ware) Dinner (drink./eat.-f.ware) Dinner (pour.-c.ware) Dinner (pour.-f.ware) Dinner (supporting) Dinner (processing) Processing Spinning Storage (long-term) Storage (short-term) Storage (other) Working % of T˘ 2
C01 -1.90 1.23 -0.39 -0.87 -1.02 -0.87 0.61 1.13 -0.95 -0.05 -0.67 -0.57 -1.69 1.05 -2.54 -1.02 -1.64 0.89 2.42 -3.07 11.35%
C03 -0.08 -1.19 -0.59 -1.33 0.43 1.50 0.05 -1.33 -1.46 -1.14 -1.03 0.06 1.41 -1.46 -0.43 -1.57 0.31 -0.97 -0.32 0.94 5.77%
C04 -0.03 -0.83 -0.41 -0.92 -1.09 -0.92 -2.38 -0.92 -1.01 -0.19 -0.72 0.72 1.03 1.82 0.12 -1.09 0.24 0.76 0.64 -3.28 8.16%
C05 0.17 -0.75 -0.37 -0.83 -0.99 1.17 -0.14 -0.83 -0.91 0.03 -0.65 0.36 1.20 1.09 -2.45 -0.99 -1.58 0.14 0.03 -0.13 4.94%
C06 -1.90 -0.77 -0.39 -0.87 -1.02 -0.87 -2.22 -0.87 -0.95 0.78 -0.67 0.26 0.31 -0.95 -2.54 -1.02 0.36 0.07 -0.05 1.83 7.30%
Huts C08 -0.38 -0.97 -0.49 -1.08 -1.28 -1.08 0.68 -1.08 -1.19 0.26 1.16 -0.39 -0.11 -1.19 -0.35 -1.28 0.77 1.04 0.26 -0.39 4.11% C09 -1.83 0.44 -0.78 -1.75 -0.07 -1.75 -0.01 -1.75 0.92 0.34 -1.35 -1.18 -3.40 0.92 2.63 0.76 -0.48 0.10 -0.67 0.13 10.46%
Table 2: Freeman-Tukey residuals for Table 1
C02 -1.20 -1.31 -0.65 -1.46 0.27 0.54 1.14 2.00 -1.60 0.00 -1.13 -0.87 -0.85 -1.60 0.61 -1.73 0.05 -0.44 1.01 -0.29 6.56%
C10 0.42 0.54 1.27 0.36 0.07 0.36 -1.37 -1.64 0.21 0.13 0.73 -0.85 -3.19 -1.79 -0.80 2.54 -0.27 -0.83 0.13 1.12 8.43%
C11 2.27 -1.07 -0.54 1.63 0.58 -1.20 0.38 -1.20 -1.32 -0.02 -0.93 1.73 0.49 -1.32 -0.70 -1.42 -0.28 -0.69 -2.84 -4.27 13.60%
C16 0.88 -1.05 -0.53 0.82 0.60 -1.18 -0.20 -1.18 -1.29 0.04 -0.91 -0.67 1.17 -1.29 -1.46 -1.40 1.76 -0.64 -0.79 0.29 5.57%
C18 0.45 -0.97 -0.49 0.92 0.72 -1.08 -0.79 -1.08 0.81 -0.57 -0.84 1.25 0.71 -1.19 -3.18 -1.28 -2.06 0.40 -0.57 0.62 7.61%
C20 -0.38 0.69 -0.65 -1.46 -1.73 -1.46 0.71 0.54 1.23 0.00 0.87 0.13 -0.85 -1.60 1.37 -1.73 -0.77 0.19 -1.46 -0.29 6.14%
% of T˘ 2 5.11% 3.18% 1.39% 4.83% 2.84% 4.60% 4.37% 5.64% 4.45% 0.68% 3.03% 2.49% 8.99% 6.60% 11.46% 7.40% 3.95% 1.48% 5.23% 12.27% 100.00%
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444
445 446 447 448 449 450 451 452 453 454 455 456 457 458 459
any of the other objects found at the site, with each contributing to 12.27%, 11.46% and 4.37% of the magnitude of T˘2 , respectively. This suggests that there exists a stronger association than expected (if independence was observed between the rows and columns) between these objects and the huts in which they were found. However, their cell frequencies are much higher than other objects classified in Table 1 and so they are not considered extreme cases from a classical correspondence analysis perspective. Although, there are objects that are, overall, also dominant in their contribution to the TukeyFreeman statistic. For example, R13 (Dinner (supporting)), R16 (Spinning) and R14 (Dinner (processing)) contribute to 8.9%, 7.4% and 6.60% of T˘2 , respectively, and have relatively small cell frequencies when compared with those just mentioned. For these objects, it is not the magnitude of the largest residuals that impacts upon their contribution to the magnitude of the Freeman-Tukey statistic, but the consistently large (although not extreme) residuals summarised in Table 2. Similarly, the huts labelled C11 and C01 have the most extreme residuals of the huts exacavated on the site and are the most dominant of the columns that contribute to the Freeman-Tukey statistic accounting for 13.60% and 11.35%, respectively. Table 2 also shows that the hut labelled C09 contributes to 10.46% of T˘2 and yet only has two statistically significant Freeman-Tukey residuals. Therefore, these objects and huts do not behave in a way that is consistent with the typical object or hut on the archaeological site. 6.3. A Visual Inspection of the Association A visual inspection of the association between the huts and the objects found within them can be made by performing correspondence analysis on Table 1. Alberti (2017) performed such an analysis in his detailed description of the data, although he used the classical approach which is based on the Pearson chi-squared statistic, X 2 ; see, for example, Greenacre (1984) and Beh and Lombardo (2014) for a comprehensive discussion of the various practical and technical issues of correspondence analysis. The two-dimensional correspondence plot of the contingency table, using the classical approach, is given on the left of Figure 2; note that the configuration of these points is identical to the plots given by Alberti (2017b, Figure 6) apart from the reflection about both axes. Similarly, the correspondence plot obtained using the Freeman-Tukey statistic T˘ is on the right of Figure 2 and is constructed using the principal coordinates defined by (13). Here we shall refer to the plot on the left of Figure 2 as the classic plot while the plot on 20
Freeman−Tukey Statistic
Pearson’s Statistic R3 R16
R3
R4 1 C10
R11 R20
R5 C06 0 R4
R9
C20 C09
C18 R17 R10 R1 R15 C03 C08 C16 C02 R12 R6 R7 C11 R13 R18 R19 C05
R2
R8
Principal Axis 2 (17.02%)
Principal Axis 2 (19.95%)
1
R5 R1
R13 C16 C03
C18 0
C06
C11 C05
R6
R16
R11
R17 R20 R9
C10
R12 R10 R7
C08 R18 C20 C02
R15 C09
R2
R19 R8
C04
C04
R14
−1
C01
−1
R14 C01 −1
0
1
−1
Principal Axis 1 (25.46%)
0
1
Principal Axis 1 (24.6%)
Figure 2: Comparison of correspondence plots using Pearson’s chi-squared statistic (left) and Freeman-Tukey statistic (right) for Table 1
460 461 462 463 464 465 466 467 468 469
the right shall be called the FT-plot. These two plots have been constructed so that the scale of the first and second principal axes in both plots are identical. By comparing these two configurations of points, it is clear that, aside from a few similarities, the two displays are different. This is because different measures of association have been used to quantify that association and the way in which the sparsity of the cell frequencies has been dealt with. Suppose we consider the classic plot of Figure 2. The close proximity of R15 (Processing) to the hut labelled C09 reflects the strong association between them, however their position is strongly influenced by the very large cell frequency that they share. Table 1 shows that the cell frequency intersecting 21
470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507
these categories is 15. Their position in the FT plot, using the partition of T˘2 , results in similarly (but NOT identically) positioned points. This feature is reflective of the statistically significant Freeman-Tukey residual of 2.63 for this pair of categories which is not the largest of the residuals. The cell frequency of 12 for the row/column pair of R20/C10 means that C10 and R20 (Working) share a very similar position in the classic and FT plot of Figure 2. Perhaps the most obvious difference between the two plots is that the row points and column points in the classic plot of Figure 2 are evenly scattered about the region. However, when the FT plot - obtained by considering the Freeman-Tukey statistic - gives what appears to be two rather distinct clusters (defined by the row points and the column points) with only a small overlap between them. This suggests that there is no obvious strong rowcolumn associations except for where the two clusters intersect. This may be because of the sparsity of the cell frequencies in Table 1, although (as we shall discuss in Section 6.4) it may be because more information about the nature of the association can be assessed by considering more than twodimensions. In the FT plot, we can see that the huts labeled C03, C11 and C18 (possible habitations) have a distribution of objects characterised by the relatively large cell frequency with R12 (Dinner (pour.-f.ware)), while the huts C02, C20 (possible utilitarian huts) and C09 have a distribution of objects dominated by R7 (cooking) and R15 (processing). However, it is the relative distribution of the objects, not the dominance (or lack thereof) of a cell frequency, which defines the association depicted in the correspondence plot using the Freeman-Tukey statistic. We can see that the points for some of these categories are in close proximity to the origin and so the strength of this association may actually be rather weak. From an archaeological point-of-view, the similarities shared by the plots in Figure 2 (as to the absolute and relative position of some points) are equally informative. This holds true for the four huts – C03, C16, C18 and C11 – classified as habitations by Alberti on analytical and anthropological grounds and whose relative position seems rather unchanged. Interestingly, the correspondence analysis based using the Freeman-Tukey statistic also shows that the position of hut C06 (possibly another habitation) is rather close to the small cluster of four huts just described. All-in-all, the similarities and the differences brought to the fore by the new approach do underline the importance of comparing traditional and new techniques so that one may better understand the varying nuances of the data being analysed. 22
508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539
540 541 542 543 544
6.4. Quality of the Correspondence Plot Another important issue is that the quality of both plots in Figure 2 is relatively poor; this led Alberti (2017b, Tables 4 and 5) to consider the first 5 dimensions in his interpretation of the data. The classical correspondence plot on the left of Figure 2 graphically depicts 45.41% of the association between the type of object and the hut in which it was found. Similarly, using the Freeman-Tukey statistic, the plot on the right of Figure 2 depicts a lower proportion of the association (41.62%). However, even though the two-dimensional plot obtained using the Freeman-Tukey statistic is of lower quality when compared with that obtained using the classic approach, its description of the association between the variables is more statistically meaningful. Indeed, the Freeman-Tukey statistic is 363.01 giving a total inertia of 0.971 with a p-value < 0.0001, while Pearson’s statistic is 265.3 (and a p-value of 0.045) indicating that the statistically significant association between the two variables is not as strong. This is due, in part, to the variance issues associated with Pearson’s statistic and to the sparsity of the data. Table 3 summarises the weight along all of the axes of the correspondence plot (P.Inertia, also commonly referred to as the axis’ principal inertia) obtained using the classic approach and through the Freeman-Tukey statistic. It also summarises the quality of the percentage contribution (%Inertia) that each axis makes to the total inertia (thereby reflecting the quality of presentation of each axis) and the cumulative percentage (C.%Inertia) of all of the axes (thereby reflecting the quality of a correspondence plot of any dimension). For both approaches, the quality of the correspondence plot remains consistent regardless of the number of dimensions used to construct the correspondence plots. For example, the correspondence plots in Figure 2 can be improved by adding a third dimension. Doing so, the third dimension of a classic correspondence plot accounts for a further 12.82% of the total inertia so that a three-dimensional plot visually describes 58.23% of the association between the huts and artifacts found at the site. When performing correspondence analysis using the Freeman-Tukey statistic the inclusion of a third dimension visualises a further 13.59% of association. 6.5. Modelling the Association We now turn our attention to the quality of the association models obtained by reconstituting the cell proportions of Table 1 using the information in the two-dimensional correspondence plots of Figure 2. Table 4 summarises the expected cell frequencies of Table 1 using the information contained in 23
Axis P. Inertia 1 0.181 2 0.142 3 0.091 4 0.083 5 0.075 6 0.043 7 0.037 8 0.028 9 0.012 10 0.007 11 0.007 12 0.003 Total 0.709
Pearson CA Freeman-Tukey CA %Inertia C.%Inertia P.Inertia %Inertia C.%Inertia 25.46 25.46 0.239 24.60 24.60 19.95 45.41 0.165 17.02 41.62 12.82 58.23 0.132 13.59 55.21 11.74 69.97 0.110 11.36 66.57 10.58 80.55 0.101 10.45 77.02 6.07 86.62 0.067 6.87 83.89 5.23 91.85 0.055 5.62 89.51 3.96 95.81 0.040 4.13 93.64 1.74 97.55 0.028 2.89 96.53 1.04 98.59 0.017 1.77 98.30 0.96 99.55 0.010 1.00 99.30 0.45 100.00 0.007 0.70 100.00 — — 0.971 — —
Table 3: Inertia values from a classical and Freeman-Tukey correspondence analysis of Table 1 545 546 547 548 549 550 551 552 553 554 555 556 557 558
the two-dimensional correspondence plots of Figure 2. The values not in parentheses are those calculated using n times the proportions estimated from the classic two-dimensional model of (5) while those values in parentheses are obtained in a similar manner from the model reflecting the Freeman-Tukey residuals; see (17). It is immediately clear by inspecting Table 4 that (5) gives 20, of the 20 × 12 = 260, negative expected cells frequencies, while all of the expected cell frequencies using (17) are non-negative. Since all cell frequencies should be non-negative, this result suggests that for the sparse data in Table 1, the Pearson based model of (5) is a poor choice of model while the model derived from the Freeman-Tukey residuals, (17), gives valid, non-negative, estimates for all of the cells. Table 4 also shows that, where both sets of expected cell frequencies are non-negative, both models provide similar estimates. An overall comparison of (5) and (17) can be made by calculating the following goodness-of-fit (GoF) statistics GoF =
I X J X
i=1 j=1
559 560
nij − npij(2) npij(2)
2
˘ = GoF
I X J X
i=1 j=1
nij − n˘ pij(2) n˘ pij(2)
2
.
Both models are assessed using the same Pearson chi-squared statistic, rather than the Freeman-Tukey statistic, so that an easy comparison of the 24
25
A-dinner (pour.)
Ae-cook
Ae-dinner (drink./eat.)
Ae-storage
A-storage
Cooking
Dinner (drink./eat.-c.ware)
Dinner (eat-c.ware)
Dinner (drink./eat.-f.ware)
Dinner (pour.-c.ware)
Dinner (pour.-f.ware)
Dinner (supporting)
Dinner (processing)
Processing
Spinning
Storage (long-term)
Storage (short-term)
Storage (other)
Working
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
R16
R17
R18
R19
R20
C01 -0.5 (0.0) 0.5 (0.2) -0.1 (0.1) -0.4 (0.2) -0.1 (0.0) 0.2 (0.0) 2.1 (1.5) 1.0 (0.6) 0.1 (0.0) 1.0 (1.2) 0.0 (0.0) 1.0 (1.3) 0.5 (0.1) 1.2 (1.1) 2.2 (1.3) -0.4 (0.1) 0.3 (0.1) 1.9 (2.2) 3.5 (2.8) 0.2 (0.0)
C02 1.1 (1.6) 0.8 (0.5) 0.0 (0.0) -0.1 (0.2) 0.4 (0.4) 0.5 (0.2) 4.3 (3.7) 1.3 (0.7) 0.6 (0.4) 3.1 (3.1) 0.3 (0.2) 3.8 (4.5) 1.5 (1.5) 1.4 (1.0) 5.5 (4.3) 0.3 (0.3) 1.5 (1.4) 3.5 (3.3) 5.1 (3.9) 5.2 (4.1)
C03 3.2 (3.4) 0.1 (0.1) 0.0 (0.0) 0.8 (0.9) 0.8 (0.8) 0.5 (0.5) 2.7 (2.5) 0.2 (0.2) 0.3 (0.2) 2.3 (2.2) 0.2 (0.1) 5.0 (5.1) 2.7 (3.6) 0.4 (0.2) 2.5 (1.9) 0.3 (0.2) 1.8 (1.7) 2.0 (1.9) 1.8 (1.6) 5.3 (5.0)
C04 1.0 (0.5) 0.2 (0.0) -0.1 (0.0) 0.2 (0.0) 0.2 (0.0) 0.3 (0.0) 1.8 (1.3) 0.6 (0.2) 0.0 (0.0) 1.0 (1.2) 0.0 (0.0) 2.4 (2.4) 1.6 (1.1) 0.8 (0.5) 1.1 (0.6) -0.5 (0.1) 0.7 (0.3) 1.6 (1.7) 2.3 (1.7) 0.9 (0.3)
C05 1.2 (1.0) 0.0 (0.0) -0.1 (0.0) 0.3 (0.2) 0.3 (0.1) 0.2 (0.1) 1.3 (1.0) 0.3 (0.1) 0.0 (0.0) 0.8 (0.9) 0.0 (0.0) 2.2 (2.2) 1.4 (1.6) 0.5 (0.1) 0.6 (0.4) -0.3 (0.0) 0.7 (0.5) 1.1 (1.0) 1.3 (0.8) 1.2 (0.8)
C06 1.1 (1.1) 0.1 (0.0) 0.1 (0.0) 0.3 (0.2) 0.3 (0.2) 0.2 (0.1) 1.1 (1.1) 0.1 (0.1) 0.2 (0.0) 1.0 (1.0) 0.1 (0.0) 1.8 (2.2) 0.8 (1.4) 0.1 (0.1) 1.5 (0.7) 0.3 (0.0) 0.7 (0.6) 0.8 (1.0) 0.7 (0.9) 2.6 (1.4)
Huts C08 1.2 (1.1) 0.3 (0.1) 0.0 (0.0) 0.2 (0.1) 0.4 (0.2) 0.3 (0.1) 2.1 (1.9) 0.4 (0.3) 0.3 (0.1) 1.6 (1.6) 0.2 (0.1) 2.5 (2.8) 1.1 (1.3) 0.5 (0.4) 2.6 (1.7) 0.3 (0.0) 1.0 (0.8) 1.6 (1.8) 2.0 (1.9) 3.4 (1.9) C09 1.2 (1.4) 1.2 (1.8) 0.3 (0.2) -0.1 (0.1) 0.7 (0.5) 0.5 (0.5) 5.3 (6.0) 1.1 (1.7) 1.5 (1.7) 4.8 (4.8) 0.8 (0.7) 3.8 (4.5) 0.1 (0.5) 0.8 (2.2) 10.0 (11.2) 2.4 (2.1) 2.2 (2.2) 3. 9 (4.8) 5.1 (6.9) 11.4 (8.9)
C10 2.9 (3.6) 0.6 (1.0) 0.5 (0.4) 0.7 (1.0) 1.1 (1.4) 0.5 (1.2) 3.4 (4.6) -0.1 (0.8) 1.6 (1.7) 4.2 (3.9) 0.8 (0.8) 4.1 (5.0) 0.2 (1.8) -0.6 (0.7) 8.2 (8.5) 3.2 (2.8) 2.4 (3.0) 2.1 (2.8) 1.2 (3.6) 13.1 (13.5)
C11 4.1 (3.2) -0.3 (0.0) 0.0 (0.0) 1.3 (0.7) 1.0 (0.5) 0.5 (0.2) 1.8 (1.8) -0.3 (0.0) 0.0 (0.0) 1.6 (1.7) 0.0 (0.0) 5.6 (5.4) 3.6 (4.9) 0.1 (0.1) 0.3 (0.4) -0.2 (0.1) 1.9 (1.2) 1.3 (1.7) 0.3 (1.1) 4.1 (1.9)
C16 3.2 (3.3) -0.1 (0.0) 0.0 (0.0) 0.9 (1.0) 0.8 (0.8) 0.5 (0.4) 1.9 (1.8) -0.1 (0.1) 0.2 (0.1) 1.7 (1.7) 0.1 (0.1) 4.5 (4.6) 2.6 (3.9) 0.1 (0.1) 1.3 (1.0) 0.2 (0.0) 1.6 (1.5) 1.3 (1.4) 0.7 (1.0) 4.4 (3.9)
Table 4: Expected cell frequencies of Table 1 using the chi-squared statistic and, in parentheses, the Freeman-Tukey statistic.
Object A-dinner (drink./eat.)
Code R1
C18 2.6 (2.7) -0.1 (0.0) 0.0 (0.0) 0.8 (0.7) 0.7 (0.6) 0.4 (0.3) 1.6 (1.5) -0.1 (0.0) 0.2 (0.0) 1.5 (1.4) 0.1 (0.0) 3.8 (4.0) 2.2 (3.5) 0.1 (0.1) 1.2 (0.7) 0.2 (0.0) 1.4 (1.1) 1.1 (1.2) 0.6 (0.8) 3.7 (2.7)
C20 1.6 (1.6) 0.7 (0.6) 0.2 (0.1) 0.2 (0.2) 0.6 (0.4) 0.4 (0.3) 3.7 (3.8) 0.7 (0.8) 0.9 (0.6) 3.2 (3.1) 0.4 (0.3) 3.6 (4.2) 0.9 (1.2) 0.6 (1.1) 5.9 (5.2) 1.2 (0.6) 1.7 (1.6) 2.8 (3.2) 3.4 (3.9) 7.3 (5.1)
568
estimated cell frequencies can be made. For Table 1, GoF = 244.13 while ˘ = 263.86, with df (20 − 3) (13 − 3) = 170. Therefore, while both give GoF a poor two-dimensional reconstitution of the sparse data in Table 1 the Freeman-Tukey model (5) provides slightly better estimates of the cell frequencies than the classic Pearson based model (17). However, what is important to underline is that, unlike the Pearson model of (5), the Freeman-Tukey model of (17) is guaranteed to produce estimates of the cell frequencies that are not negative!
569
7. Discussion
561 562 563 564 565 566 567
570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596
While correspondence analysis is most commonly performed by applying SVD to the standardised (Pearson) residuals, one must be cautious about the presence of overdispersion in a contingency table. Such a feature can be found in large, sparse data of the type described in Section’s 6. To stabilise the variance, we have shown here how correspondence analysis may be performed by applying SVD to the Freeman-Tukey residuals, r˘ij . In many of the practical and simulated applications we have undertaken on this approach, the Freeman-Tukey based correspondence plot gives a configuration of points that can be (but not always) very similar to a configuration obtained using the classical approach for very large sample sizes and tables that are small to moderate in size. However, as we have shown in our two examples, for tables that are large, and sparse, the classical approach and the FreemanTukey approaches can give very different configurations. Such differences still reflect the underlying association structure between the variables. However, due to the sparsity of the cell frequencies, the plot obtained from the Freeman-Tukey residual is less likely to yield row and column points that are in close proximity to one just because the cell frequency they share is large. Rather, since the Freeman-Tukey approach involves the square-root of the row and column profiles (instead of their observed profiles), large cell frequencies appear to play less of a role in determining the configuration of points. Also, since the underlying premise of the Freeman-Tukey residual involves stabilising the variability in the contingency table, such large cell frequencies are less likely to dominate such a configuration than if the classic Pearson residual were considered for such an analysis. Using the Freeman-Tukey statistic as a basis for performing correspondence analysis preserves many of the features of the analysis. The principal coordinates from the SVD of r˘ij are centred around zero and have a variance 26
617
equivalent to T˘2 /n. The squared Euclidean distance between two row (or column) points in a correspondence plot may no longer reflect the chi-squared distance of their profiles but they do reflect their Hellinger distance. Given the arguments made by Cuadras and Cuadras (2006) who supported the use of the Hellinger distances between points, using the Freeman-Tukey statistic as a basis for performing correspondence analysis is appealing. From a modelling perspective, given a low dimensional correspondence plot using the Freeman-Tukey statistic guarantees that negative expected cell frequencies will not be reconstituted; such a feature is not guaranteed from the reconstitution model derived from the classical correspondence analysis. Given that many areas of research yield large, and sparse, cross-classified data, the strategy adopted in this paper proves promising. Other strategies can be considered for performing correspondence analysis where the overdispersion exists in a contingency table; this includes, but is not limited to, the approached described by Beh (2012). The approach outlined in this paper also suggests that it is possible to explore the correspondence analysis of a contingency table using other members of the family of Cressie-Read power divergence statistics (Cressie and Read, 1984). Investigating how correspondence analysis can be performed, and generalised for members of this family (beyond Pearson’s statisic and the Freeman-Tukey statistic) will be left for future consideration.
618
References
619
Agresti, A., 2002. Categorical Data Analysis (2nd ed). Wiley, New York.
597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616
620 621 622 623
624 625 626
627 628
629 630
Alberti, G., 2017. New light on old data: Toward understanding settlement and social organization in Middle Bronze Age Aeolian Islands (Sicily) through quantitative and multivariate analysis. Journal of Archaeological Science: Reports 11, 310 – 329. Armatte, M., 2008. Histoire et pr´ehistoire de l’analyse des donn´ees par J. P. Benz´ecri: un cas de g´en´ealogie r´etrospective. Electronic Journal for History of Probability and Statistics 4(2), 24 pages. Beh, E. J., 2004. Simple correspondence analysis: A bibliographic review. International Statistical Review 72, 257 – 284. Beh, E. J., 2012. Simple correspondence analysis using adjusted residuals. Journal of Statistical Planning and Inference 142, 965 – 973. 27
631 632
633 634
635 636
637 638
639 640
641 642 643
644 645
646 647
648 649
650 651
652 653 654
655 656
657 658
Beh, E. J., Lombardo, R., 2012. A genealogy of correspondence analysis. Australian and New Zealand Journal of Statistics 54, 137 – 168. Beh, E. J., Lombardo, R., 2014. Correspondence Analysis: Theory, Practice and New Strategies. Wiley, Chichester. Beran, R., 1977. Minimum Hellinger distance estimates for parametric models. The Annals of Statistics 5, 445 – 463. Bernad`o Brea, L., Cavalier, M., 1968. Meligun`ıs Lip`ara IV. Stazioni preistoriche delle isole Panarea, Salina e Stromboli. Flaccovio, Palermo. Bishop, Y. M., Fienberg, S. E., Holland, P. W., 1975. Discrete Multivariate Analysis: Theory and Practice. MIT Press. B´olviken, E., Helskog, E., Helskog, K., Holm-Olsen, I. M., R., L. S., Bertelsen, 1982. Correspondence analysis: an alternative to principal components. World Archaeology 14, 41 – 60. Borg, I., Groenen, P. J. F., 2005. Modern Multidimensional Scaling: Theory and Applications. Springer, New York. Brooks, S. P., Catchpole, E. A., Morgan, B. J. T., 2000. Bayesian animal survival estimation. Statistical Science 15, 357 – 376. Cressie, N. A. C., Read, T. R. C., 1984. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B 46, 440 – 464. Cuadras, C. M., Cuadras, D., 2006. A parametric approach to correspondence analysis. Linear Algebra and its Applications 417, 64 – 74. D’Ambra, L., Lauro, N. C., 1989. Non-symmetrical correspondence analysis for three-way contingency table. In: Coppi, R., Bolasco, S. (Eds.), Multiway Data Analysis. Elsevier, Amsterdam, pp. 301–315. De Falguerolles, A., 2008. L’analyse des donn´ees; before and around. Electronic Journal for History of Probability and Statistics 4(2), 32 pages. Domenges, D., Volle, M., 1979. Analyse factorielle sph´erique: une exploration. Annales de l’ins´e´e 35, 3 – 84.
28
659 660 661
662 663
664 665
666 667 668
Efron, B., 1992. Overdispersion estimates based on the method of asymmetric maximum likelihood. Journal of the American Statistical Association 87, 98 – 107. Freeman, M. F., Tukey, J. W., 1950. Transformations related to the angular and square root. The Annals of Mathematical Statistics 21, 607 – 611. Gibbs, A. L., Su, E., 2002. On choosing and bounding probability metrics. International Statistical Review 70, 419 – 435. Golden, R. M., 2000. Examples of complex probability models in the social and behavioural sciences (presentation). Last Retrieved: April 24, 2017. URL www.utdallas.edu/~golden/ANNTALKS/Feb21talk/
671
Goodman, L. A., Kruskal, W., 1954. Measures of association for crossclassifications. Journal of the American Statistical Association 49, 732 – 764.
672
Gower, J. C., Lubbe, S., le Roux, N., 2011. Understanding Biplots. Wiley.
673
Greenacre, M., 2010. Biplots in Practice. Fundacion BBVA, Barcelona.
669 670
674 675
676 677
678 679 680
681 682 683
Greenacre, M. J., 1984. Theory and Application of Correspondence Analysis. Academic Press, London. Haberman, S., 1973. The analysis of residuals in cross-classified tables. Biometrics 75, 457 – 467. Holmes, S., 2008. Multivariate data analysis: The French way. In: Nolan, D., Speed, T. (Eds.), Probability and Statistics: Essays in Honor of David A. Freedman. Institute of Mathematical Statistics, Ohio, pp. 219 – 233. Kendall, D. G., 1971. Abundance matrices and seriation in archaeology. Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und Verwandte Gebiete 17, 104 – 112.
686
Kroonenberg, P., Lombardo, R., 1999. Nonsymmetric correspondence analysis: A tool for analysing contingency tables with a dependence structure. Multivariate Behavioral Research 34, 367 – 396.
687
Kullback, S., 1959. Information Theory and Statistics. Wiley.
684 685
29
688 689 690
691 692 693
694 695 696
697 698
699 700 701
702 703 704
705 706
707 708 709
710 711
712 713
714 715
716 717 718
Larntz, K., 1978. Small-sample comparisons of exact levels for chi-squared goodness-of-fit statistics. Journal of the American Statistical Association 73, 253 – 263. Lawal, H. B., 1984. Comparisons of the X 2 , Y 2 , Freeman-Tukey and Williams’s improved G2 test statistics in small samples of one-way multinomials. Biometrika 71, 415 – 418. Lebart, L., 2008. Exploratory multivariate data analysis from its origins to 1980: Nine contributions. Electronic Journal for History of Probability and Statistics 4(2), 17 pages. Lebart, L., Morineau, A., Warwick, K. M., 1984. Multivariate Descriptive Statistical Analysis. Wiley. Lindsay, B. G., 1994. Efficiency versus robustness: The case for minimum Hellinger distance and related methods. The Annals of Statistics 22, 1081 – 1114. Lombardo, R., Beh, E. J., Kroonenberg, P. M., 2016. Modelling trends in ordered correspondence analysis using orthogonal polynomials. Psychometrika 81, 325 – 349. Mazzetta, C., Brooks, S., Freeman, S. N., 2007. On smoothing trends in population index modeling. Biometrics 63, 1007 – 1014. Neyman, J., 1949. Contributions to the theory of the χ2 test. Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, 239 – 273. Pollard, D., 2002. A User’s Guide to Measure Theoretic Probability. Cambridge University Press. Rao, C. R., 1995. A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Q¨ uestii´o 19, 23 – 63. Read, C. B., 1993. Freeman-tukey chi-squared goodness-of-fit statistics. Statistics & Probability Letters 18, 271 – 278. Sheil, J., O’Muircheartaigh, I., 1977. Algorithm as 106: the distribution of non-negative quadratic forms in normal variables. Applied Statistics 26, 92 – 98. 30
721
Simpson, D. G., 1987. Minimum Hellinger distance estimation for the analysis of count data. Journal of the American Statistical Association 82, 802 – 807.
722
Torgerson, W. S., 1958. Theory and Methods of Scaling. Wiley, New York.
719 720
726
Van Meter, K., Schiltz, M. A., Cibois, P., Mounier, L., 1994. Correspondence analysis: A history and french sociological perspective. In: Greenacre, M., Blasius, J. (Eds.), Correspondence Analysis in the Social Sciences. Academic Press, London, pp. 128 – 137.
727
Wong, R. S. K., 2010. Association Models. Sage Publications Inc.
723 724 725
31