J. theor. Biol. (1972) 34, 337-352
2k Contingency Tables in Ecology E.
c.
PIELOUT
Biology Department, Queen’s University. Kingston, Ontario, Canada WITH
AN
APPENDIX
BY
D. S.
ROBSON
Biometrics Unit, Cornell University, Ithaca, New York, U.S.A. (Received 22 March
1971)
Suppose observations are made on the presence or absence of k different species in N sampling units. Denote by s the number of species per sampling unit and let mz and Var(s) respectively be the observed variance of s and its expectation under the null hypothesis that all k species are independent. It is shown that the difference ma-Var(s) is directly interpretable as a measure of the overall association among the species. Examples using field data are given and it is shown how the proposed measure of association may be used to judge: (i) whether some chosen group of species contributes markedly to the total amount of interdependence within a community; and (ii) whether some chosen group of species may be disregarded without affecting the result when ecological data are to be classified or ordinated.
1. Introduction It often happens that ecological field sampling consists in listing the species in each of a number, say N, of sampling units. Results obtained in this way can be displayed in a matrix all of whose elements are 0 or 1 (see Table 1). Thus, let k be the number of species encountered in all the sampling units (hereafter called “quadrats” for simplicity since the word “unit” is needed in another context). Then the data matrix is of order N x k and its (i, j)th element is given the value 1 or 0 according as the ith quadrat does or does not contain the jth species. Numerous examples could be given of this type of sampling. For instance, the organisms whose species are recorded and the “quadrats” containing them might be: microarthropods in soil samples; t Present address: Biology Department, Dalhousie University, Halifax, Nova Scotia, Canada. 337
338
E.
C.
PIELOU
warblers’ nests in forest stands; ectoparasites on rabbits; immature aquatic insects caught in drift nets in a stream; or pollen-eating insects found in flowers. Now suppose that we wish to use these data to measure the extent to which the several species are associated, on the assumption that the quadrats are independent of one another. The customary procedure is to condense the Nx k data matrix into a 2k contingency table. Since there are k species altogether, 2k different species combinations are possible (we include absence of all species as a possible combination) and the frequencies in the 2k cells of the table are the frequencies of occurrence of the corresponding combinations. Each species combination can be represented by a k-element row vector all of whose elements are either 1 or 0, and the N rows of the data matrix are, of course, the vectors representing the species combinations that actually occurred in the sample. The cell frequencies in the contingency table thus show the number of times each distinct vector occurs as a row of the matrix. Provided k is not too large, the 2k contingency table can be analysed as it stands. There has been much recent work on this topic; accounts of the current state of the subject, and abundant references, will be found in Fienberg (1970), Goodman (1970) and Bishop (1969). 1
TABLE
The generalform of data matrix whenpresence-absencerecords are tabulated for k speciesin N quadrats Quadrat number 1 2
0
0
1
0
3
1
1
’
1
s2 Numbersof s3 species per quadrat
-N L n1
I
n2 n3 nl, I I Numbersof occurrences per species
A
A isthe total numberof occurrences, i.e. $ s1= 2 n, = A t-1 ,=I
2k
CONTINGENCY
TABLES
IN
ECOLOGY
339
As k increases, however, the number of cells in the contingency table rapidly becomes enormous; for instance, when k = 20, 2k > 1,000,000. Except when k is small, therefore, the great majority of the observed frequencies in the cells of the table are zero, and their expected values are very small indeed. Consequently, before observed and expected frequencies can be compared it is necessary to group the cells, and the most natural grouping is that which yields, as the resultant pooled frequencies, the frequency distribution of the number of species per quadrat. In what follows we shall call this variate s and shall denote by f, the frequencies of the observed values of s for s = 0, 1, . . ., k. Thus f0 is the frequency in the single cell representing absence of all species; fi is the sum of the frequencies in the k cells representing presence of one species; . . . ; fi is the sum of the frek quencies in the i cells representing presence of exactly i species; . . . ; 0 and fk is the frequency in the single cell representing presence of all k species. The purpose of this paper is to show how associations among the species affect the distribution of s: in particular, how the net amount of interspecific association can be expressed in terms of the variance of S. One could, of course, discuss the distribution of s without reference to the 2k table. The various observed values of S, and their frequencies, are directly obtainable merely by counting the species in each quadrat and it may seem needlessly indirect to regard the distribution as the outcome of pooling the cells of a contingency table in accordance with certain rules. The reason for treating the distribution in this way, that is as being derived from the contingency table, will become clear below. 2. Relation Between the Observed and Expected Variance of s Suppose nj of the N quadrats examined contain the jth species; that is nj is the number of “occurrences” of this species in the sample. Also, put pj = nj/N, so that pi is the observed proportion of quadrats containing the jth species. Taking as null hypothesis that all the species are independent of one another Var(s), the expected variance of s, can be obtained as shown by Barton & David (1959). Thus if the species are independent, the value of s for any one quadrat is the number of “successes” in k simultaneous Bernoulli trials, one for each species, whose probabilities of success are Pl, Pz, *. ., pk. It follows that Var(4 = C Pj(l-Pj) =d
(Pje2PjP+F2)
= i{p(l -p)-var(P)).
- C ’
(Pj-15)'
(1)
340
E.
C.
PIELOU
Now put C nj = A ; that is, A is the total number of occurrences in the whole sample. Then
and var (p) = &
($)‘-
(fg2
Hence Var(s) = $ - + C $. 3 Referring back to Table 1 it is seen that, when the species are independent, the variance of the values of s (which are the row sums of the data matrix) is expressible in terms of values of n (which are the column sums of the data matrix). Now denote by m, the variance of an observed distribution of s. As before Var(s), as given by (2), denotes the expected variance under the null hypothesis of independence of the species. We shall show how the deviation of m, from Var(s) may be interpreted in terms of interspecific associations. Consider the data obtained by sampling an actual population in which the species are not independent. Visualize the data displayed in a 2k contingency table. The table is to be thought of as a k-dimensional hypercube having a “framework”
of
k
2 2k-2 plane (i.e. two-dimensional) faces, each 0 of which is itself a 2 x 2 table showing the association between a single pair of species for a given combination of all the remaining k-2 species. For example, one of the faces relates to species X and Y, for some specified combination of the other species, and would appear thus: Species Y
‘pecies x
Present Absent
Present
Absent
a c
b d
a+c
b+d
a+b c+d
2k CONTINGENCY
TABLES
IN ECOLOGY
341
Now let us make what may be called a “unit increase” in association in this particular 2 x 2 table by changing its cell frequencies -,-
b-l
a+1
a / b from
to
C-l d+l d The marginal frequencies are, of course, unaltered. We now enquire into the effect of this unit increase on the variance of s. The variance before the change will be denoted by var,(s), and after the change by var,(s). It is further assumed that the given combination of the k- 2 other species consists of r presences and k-r - 2 absences. Thus the effect of the unit increase in association is to alter the frequency distribution of s in the way shown in Table 2. C
TABLE 2
The distribution of the number of speciesper quadrat before and after a “unit increase” in association No. of species per quadrat 0 I
r rfl r+2 k
Original frequency h h . . h f ,+1
Altered frequency Al .
f 1+2 . . . x
.
fl
.
.
5+1
Ii+,--2 .
fr+2+1 . . At
N
N
There is no change in S, the mean number of species per quadrat. But the variance of s, from being var,(s) = k i szfS-S2 s 1 becomes r$1s%+r2(L+l)+(r+1)2(f,+1-2)+ s 1 +(r+2)2Ui+2+
1) +,=$+,s%
- 5’
342
E.
C.
PIELOU
and therefore var,(s)-vat-,
(s) = 2/N.
It thus appears that a unit increase in association applied to arty of the component 2 x 2 tables in the 2k table will cause the variance of s to increase by the amount 2/N. The magnitude of the increase is unaffected by the magnitude of the cell frequencies anywhere in the 2k table. Moreover, it is immaterial that changing the cell frequencies in one of the component 2 x2 tables of the k-dimensional table causes concomitant changes in all other 2 x 2 tables that have a pair of cells in common with it. In one half of these simultaneously altered tables the change will be in the direction of increased positive (or decreased negative) association and in the other half the change will be in the opposite direction; the two effects exactly cancel each other. In the same way, if we make a “unit decrease” in the association shown in one of the component 2 x 2 tables by changing it a
from
b
-c
d
b+l
a-l
to
/ c+l
/ d-l
the result is a decrease of magnitude 2/N in the variance of s; that is, var,(s) - varO (s) = -2/N. Now it is clear that if the marginal totals of a 2k table are given, many different sets of cell frequencies are possible. Further, any such set can be converted into any other by a sequence of unit changes (unit increases and unit decreases) while at the same time the marginal totals are held constant; the marginal totals (i.e. nj withj = 1,2, . . . , k) are the numbers of occurrences of the different species. Thus it is seen that the 2k table corresponding to complete independence among the species can, by a succession of unit changes, be converted to any observed table. Let us denote by vP and v,, respectively, the number of unit increases and unit decreases required to bring about the conversion from the theoretical table to the actual one. The conversion also causes the variance of s to change from its theoretical value, Var(s), to its observed value, m2. Therefore mz - Var (s) = (2/N)v, - (2/N)v, = 2v/n, say,
where v is the net number of unit increases. Clearly v, which may be positive or negative, measures directly the net amount of positive association in a 2k table. It depends on N, the number of
2k
CONTINGENCY
TABLES
IN
343
ECOLOGY
quadrats in the sample, and therefore, as a measure of interspecific association that is unaffected by sample size, it is more useful to take v/N = $(m, -Var (s)}. (3) It is true that in a community in which some pairs of species are associated positively and some negatively, v/N would greatly underestimate the total amount of interdependence among the species. However, it should prove a useful measure of the overall interdependence among the species in communities in which positive associations greatly exceed negative associations in number and intensity. This is probably true, for example, of animal communities in which several trophic levels are present (Pielou, 1971). There are two other contexts in which v/N should be useful and these are considered below in connection with actual examples. But first it is necessary to prove that if one excludes from consideration any species that is wholly independent of all the others, the magnitude of v/N is unchanged. The manipulations are simpler if we prove, instead, the following lemma which is obviously equivalent. Lemma Suppose k species are distributed among N quadrats so that the proportion of quadrats containing the jth species is Pj (j = 1, . . . , k). The species are roof independent. Now imagine that a (k+ 1)th “species” is added to a proportion pk+ r of the quadrats; the quadrats to which this new “species” are assigned are chosen at random, i.e. without regard to the presence or absence of any of the original k species. Then the statistic v/N is unchanged. Proof The subscripts 0 and 1, respectively, denote values before and after addition of the (k+ 1)th “species”. From (2) Vards) =j$l~j-
(4)
j$:
and k+l Vab
(S)
k+l
= jglPj
-
jzlPS.
(5)
We require to show that W.
-Vah
6) = (mJl -Varl
(4
or, equivalently, (mA - 040 = Varl (4 - Varo W. From (4) and (5) it is seen that the right-hand side of (6) is Varl
T.B.
(S)-var&)
=
(6)
(7)
Pk+dl-pk+lh 23
344
E.
C.
PIELOU
It remains to evaluate the left-hand side of (6). Written in full the frequency distribution of s before and after addition of the (k+ I)th “species” are as shown in Table 3. TABLE
Distribution
of s before and after the random addition of a new “species” to the quadrats
s
.
3
Frequency
Initial
Final
0
h
1
h
Pk+lfo+(l
2
fi
Pk + Ifi + (1 -Pli + llh . . . . .
.
.
k kS1
.
(l-Pr+lvo
.
.
fk 0
-PkflM
pk+Ih-l+(l---pk+l)h
1x
pk+
N
N
It is seen from the table that
and
Therefore 1 S1 -So
=&Pk+l
C.fj=Pk+l
and also s, -t-&l = pk+l+2so.
Now
and
.
2k
CONTINGENCY
TABLES
IN
ECOLOGY
345
Therefore
=‘y
{Cfj+2Cjfj)
-(Sl-So)(Sl+So)
= Pe$ {N+2NSo}-p,+,(p,+,+2S,) = PkflU--Pk+l)* Comparing
this with (7) shows that the proof is complete.
3. Ecological Uses of v/N For ease of computation it is convenient to express the statistic v/N entirely in terms of the marginal totals of the data matrix (see Table 1). Equation (2), namely
‘where A = c nj , gives Var(s) in terms of the column totals, nj, of the data j ) matrix; these totals are the numbers of occurrences of the k species. To express m, in terms of the row totals (which are the numbers of species in each quadrat), observe that t
2
$Si
M2=$,&$ I
1
( i=l
1
=,-&$-$.
>
Therefore $ = +{mz-Var(s))
I have now given a definition, in (3), and a computational formula, in (LX), for the statistic v/N. It will now be shown, by means of two examples, how the statistic can be used in ecological studies. Both uses hinge on the property proved above, namely that the value of v/N depends only on the degree of interdependence among associated species. Consequently if, and only if, some of the species in a community are independent of all the others, they may be disregarded without affecting the value of v/N.
346
E.
C.
PIELOU
Example I
Plant ecologists often use degrees of association among plant species as criteria for classifying vegetation. (I am concerned here only with classification methods based on presence-and-absence data.) It is often found that some species of plant, even common ones, contribute nothing to a classification. They can be regarded as “background” or “irrelevant” species; they are devoid of diagnostic value and add only uninformative “noise” to the data on which a classification is to be based. It would obviously simplify the classifying process if species in this category were dropped from consideration but one usually hesitates to dismiss a species, especially a common one, as irrelevant without a test. (The only species that can be disregarded without qualms are those present in all the quadrats; they are truly irrelevant to a classification.) Calculating v/N from the data, first with and then without the species believed to be irrelevant, constitutes such a test. Two points should be borne in mind when applying the test. (i) One must resist the temptation to recognize irrelevant species after analysing the observations. Recognition of species worth testing for irrelevance must be done in the field in the light of botanical knowledge, not as the result of “data dredging” by means of hunting and snooping (Selvin & Stuart, 1966). (ii) The sampling distribution of v/N is not derivable (even in principle) without an underlying hypothesis intended to describe the interrelations among all a community’s species. Therefore, to judge the significance of a change in v/N brought about by excluding a group of species from consideration, it is simplest to plan the analysis in a way that allows a distribution-free test to be performed. The best way to do this is to divide the quadrats into a number of subsamples (either naturally or arbitrarily) and analyse each separately. One can then judge the significance of the difference between the “before” and “after” values of v/N by a sign test or a matched pairs test. When the number of species or of occurrences excluded is large, and if there is reason to believe the excluded species are not independent among themselves, more exact and more troublesome tests are possible as described in the Appendix by Dr D. S. Robson. To demonstrate the use of v/N we consider data obtained by sampling the ground vegetation in deciduous woodland. The community has been described, and its species listed, elsewhere (Pielou, 1966). Here it suffices to say that the plant species were listed in each of 100 m2 quadrats placed at random in a 3000 m2 tract of woodland. A total of k = 62 species was
2k
CONTINGENCY
TABLES
IN
347
ECOLOGY
found in the whole sample. While the observations were being made it became obvious that the species did not occur independently of one another; certain species groups tended to recur fairly frequently. At the same time, the two commonest species seemed, so far as could be judged subjectively, to be independent of all others. The two plants were seedlings of Acer saccharum (sugar maple) and the herb Erythronium americanum which were present in 92 and 64 quadrats respectively. Although the sample is too small for a classification to be worthwhile, it is of interest to test whether these two plants would be irrelevant if classification were being attempted. The test can be done by calculating v/N from (i) the complete data matrix, and (ii) the data matrix from which the two columns relating to the test species have been deleted. Since the sampling distribution of the difference between the two values of v/N is unknown, the sample of 100 quadrats was divided at random into five subsamples each of 20 quadrats. The results are shown in Table 4. Values of v/N were calculated using (8) and the table 4
TABLE
To test whether some of the species in a plant community are independent of ail the others A
k
Subsample 1, N= 20 Subsample 2, N = 20 Subsample 3, N= 20 Subsample 4, N= 20 Subsample 5, N= 20
(Ot
I$ 1 (ii) (9 (ii)
149 118 144 113 157 125 143 115 135 101
Whole sample, N=lOO
6) (ii)
128 572
t (i) with (ii) without
(ii) g)
Difference (i)-(ii)
3”
2s”
vlN
40 38 44 42 42 40 40 38 32 30
1147 654 1018 533 1119 589 1073 649 1155 559
1327 924 1250 809 1415 957 1269 863 1063 661
3.13 3.56 3.00 2.11 2.04 2.01 3.93 2.98 1.86 1.95
-0.43
62 60
25914 13414
6324 4214
2.18 2.52
$0.26
+0*89 $0.03 $0.95 -0.09
the two test species. A =&x&;
1
I
v=’ N
2N2
N&;-A=-AN+~2 i
shows the required sum and sums of squares of data matrix. It is evident that the statistic v/N change when the two test species are ignored and tions on them could be neglected if the quadrats
. I
>
the marginal totals of the undergoes no significant we conclude that observawere to be classified.
348
E.
C.
PIELOU
Example 2
In this example we wish to test the hypothesis that a predesignated group of species contributes significantly to the “structure” of an ecological community, using the word structure to connote the total amount of mutual dependence among all the species. The data relate to the arthropod fauna (insects, mites and spiders) inhabiting fruiting bodies (“brackets”) of the birch bracket fungus PoZypotws betdimu (Bull.) Fries. Nine collections of fungus brackets, obtained in the province of New Brunswick, have so far been examined and details may be found in Pielou & Verma (1968). A total of 345 arthropod species was found in the brackets and 41 of them were species of parasitic wasps (Hymenoptera). These wasps lay their eggs in the eggs, larvae or pupae of their appropriate host species and each immature parasite develops within, and ultimately kills, its host. The extent to which a particular species of parasite always attacks the same species of host is not known. At one time there was thought to be an exact one-to-one relationship between them but this is now known to be untrue and evidence accumulates of parasite species attacking more than one host, and of host species being the victims of more than one parasite. In any case, the parasitic hymenoptera as a group are obviously important elements in the structure (as defined above) of any arthropod community. Values of v/N for the nine fungus bracket communities were calculated taking into account (i) all arthropod species, and (ii) all species except the parasitic hymenoptera. The results are given in Table 5 and it is clear that exclusion of the parasites reduces 11/Nin every case. This is not surprising in light of the biological characteristics of the excluded species but the magnitude of the effect is worth recording. The result also demonstrates the potential usefulness of this procedure when one wishes to test a more speculative hypothesis. It will also be noticed, as might have been foreseen, that v/N depends (roughly) on k, the number of species. Obviously a greater degree of mutual dependence is possible among many species than among few. What is more surprising, perhaps, is the lack of close relationship between the number of species of parasitic hymenoptera in each community (given by the difference between the two tabulated values of k) and the decrease in v/N that results from their exclusion from the analysis. 4. Remarks
on an Inverse Analysis
It is apparent from (8) that the formula for v/N is not symmetrical in the n’s and the s’s (i.e. in the column and row totals respectively of the data matrix). One is therefore prompted (almost automatically) to try an inverse analysis. Hitherto we have regarded the presence of a species as an attribute
2"
CONTINGENCY
TABLES TABLE
IN
349
ECOLOGY
5
v/N for arthropod communities when species of parasitic hymenoptera are (i) included, and (ii) excluded
Collection 1, N= 83 Collection 2, N= 56 Collection 3, N= 98 Collection 4, N= 130 Collection 5, N= 127 Collection 6, N = 207 Collection 7, N = 177 Collection 8, N = 214 Collection 9, N = 138 t (i) with (ii) without
(i)i (ii) ii?) (i) (ii) 6) (ii) (9 (ii)
iij
(ii) (9 (ii) (i) (ii) (9 (ii)
A
k
ZIZQ
zs2
V/N
169 164 53 46 205 193 380 329 518 411 874 845 981 800 891 867 486 448
43 38 18 13 35 28 56 44 55 42 99 88 89 76 89 82 59 49
3585 3580 377 366 5969 5939 13594 13247 22834 19183 54232 54127 55919 43026 50205 50093 19430 19034
609 576 115 90 799 677 1742 1363 3100 2097 4992 4627 6721 4750 5471 5171 2460 2136
0.838 0.790 0.166 0.114 1.154 0.839 1.369 1.166 2,555 1.996 1.666 1.435 1.765 1.631 2,582 2.396 1.461 1,346
Difference
(i)-(ii) -1-0.05 1-0.05 to.31 -4-0.20 +0,56 $0.23 $0.13 +0.19 i-0.12
the parasitic Hymenoptera.
of a quadrat. Now let us regard occurrence in a quadrat as an attribute of a species. Compilation of the data matrix thus entails listing, for each species, the quadrat it was found in. The intrinsic asymmetry of the two approaches is immediately apparent. Whereas some of the quadrats were empty (i.e. they contained 0 species), none of the species lacks a quadrat since the k tabulated species all occurred in at least one quadrat. The parent community from which the sample came presumably contains other species, which chanced not to be found in the sample, and usually the number of these undiscovered species is unknown. Suppose the number was M-k, so that the community contains a total of M species including the undiscovered ones. Then M-k new columns, all of whose elements are zero, must be added to the data matrix. If we now write ml(s) and m,(n) for the observed variances of s (the number of species per quadrat) and n (the number of occurrences per species) respectively, and Var(s) and Var(n) for their expected values, two sample statistics are definable : +{m?(s) - Var (s)} = v/N as already given in (3); and ${m2(n)-Var(n)) = d/M, say.
350
E.
C.
PIELOU
By analogy with (8), it is seen that G = 2h
!
M
ci
nj2 - A* - AM + 4 $1,.
(9)
If the enlarged data matrix, which is of order Nx M, were condensed into a 2N contingency table, v’ would be the net number of “unit increases” that would have to be made to its component 2 x 2 tables to convert the expected 2N table into the observed one. (By an “expected 2N table” we mean one in which each cell frequency is proportional to the product of the appropriate marginals.) Thus if A4 were known, v’/M would serve to measure (inversely) a community’s species diversity. However, M is usually not known; also, ecologists have a superabundance of diversity indices already. Thus v’/M seems unlikely to be a useful statistic. This work was supported by a grant from the National Research Council of Canada. REFERENCES BARTON, D. E. & DAVID, F. N. (1959). Jl R. statist. Sot. B 21, 190. BISHOP, Y. M. M. (1969).Biometrics 25, 383. FIENBERG, S. E. (1970).EcoZogy 51, 419. GOODMAN, L. A. (1970).J. Am. statist. Ass. 65, 226. PIELOU, D. P. & VERMA, A. N. (1968).Can. Entomol. 100, 1179. PIELOU, E. C. (1966). J. theor. Biol. 13, 131. PIELOU, E. C. (1971).In Structure and Function of Ecosystems (J. A. Wiens,ed.). Portland: OregonStateUniversityPress. SELVIN, H. C. & STUART, A. (1966).Am. Statist. 20(3), 20.
Appendix: Statistical Tests of Significance
D. S.
ROBSON
If the collection of k species is partitioned into two classes, say the X-class and the Y-class, then the total number of species Si appearing in the ith quadrat is expressible as the sum si = Xi+yi of the numbers in the two classes.The statistic v/N calculated from the complete data matrix, say v(x+y)/N, is then related to the corresponding statistics of the X- and Y-data matrices by the equation v(x+y)lN = W/N + v(y)lN + ml dx, Y), where the product moment is defined by
m,,(x,y)=~ [CXY-
&%CY)].
APPENDIX
: STATISTICAL
TESTS
OF
351
SIGNIFICANCE
The reduction resulting from deletion of one class of species, say the Y-class,
v(x+y)lN-
v(x)lN = v(y)lN+ %1(X, Y>, will thus have expectation zero if each species in the Y-class is not only independent of all species in the X-class, implying m,,(x, y) has mean zero, but also independent of all other species in the Y-class, implying v(y)/&’ has mean zero. A hypothesis of no expected reduction is thus a composite of two null hypotheses, each of which is separately and independently testable. If the Y-class is independent of the X-class then m, i has mean zero even when the expectation is conditioned on one of the order statistics, say {y> = {yili = 1, 2, . . .) N} = unordered collection of Ny’s. Thus,
J%f,L-mllb, Y>lWTiY)l = 0, where the average is taken over the N! equally likely permutations of (y). For example, if the Y-class contains a single species then yi = 0 or 1 and
where X, is the mean value of x in those n quadrats containing the Y-species. The permutation distribution of m, i in this case is, in effect, the distribution of a mean of a sample of size n drawn randomly and without replacement from a finite population of size N. The normal approximation to this distribution gives
In the more general case we have
Var Cm1dx7 Y>~(x),(~11 = m2WdyW-
1)
and
U%l(x, y)12/m2(x)m2(y))(N-1) = r&W- 1) N XL.. If the X- and Y-classes both contain a large number of species then the distribution of r& derived from bivariate (independent) normal theory might provide a better approximation than this chi-square distribution on one degree of freedom, but when one class is small the above approximation is the better of the two. The statistical significance of v(y)/N for the k, species in the Y-class can be tested in a 2ky contingency table when k, is small, and can be tested by
352
D.
S.
ROBSON
appealing to the central limit theorem when k, is large. A row total yi (see Table 1) is the sum of k, indicator variables, say k” j=l
which are independent under the null hypothesis, and hence as k, gets large the probability distribution of yi approaches a normal distribution. The same may be said of the column totals nj = f
6ij
i=l
as N gets large, and hence the conditional distribution of the row totals, given the column totals, must approach multivariate normality. Since the conditional moments are
Vaho(Yilnl,
covHo(Yi3
*.
Yi’ln19
.y
Qr)
=j$lVarzf,(~ij/nj)
. . .9
4,)
=j~lcov(djj3
=j$lPj(l-Pj),
di,jlnj>
=&
j$lPj(lmPj)
then the quadratic form of any N- 1 y’s, say y = (yr , . . . , y,- 1), reduces to j$lPj(l-Pj)~ I covariance matrix of y. For large k, this test
(y-E(yln))V-‘(y-E(yln))’ where V is the conditional statistic, S(N-l)=
= R(Y>(N-l)
[l +v&v$]
(N-1)
N xi-1
is therefore approximately chi-square distributed on N- 1 degrees of freedom, and is asymptotically independent of xt N r<,(N- 1) since the distribution of the latter is conditional on {y} and hence on mz(y). These two chi-squares could therefore be added to provide a test of the composite null hypothesis, but the separate tests of the two facets of this null hypothesis are more informative.