J. theor. Biol. (1982) 96,129-142
Towards a Basis for Classification: the Incompleteness of Distance Measures, Incompatibility Analysis and Pbenetic Classification DAVID
PENNY
Department of Botany and Zoology, Massey University, Palmerston North, New Zealand (Received 25 March 1981) There hasbeen a long term controversy in biology whether relationships between organismsshould be expressedin evolutionary terms, or on clustersbasedon overall similarity. Phenetic methodsare thosethat use a measureof overall similarity (dissimilarity, or distance)between pairs of taxa, rather than the original data. It is shownthat information is lost in converting the original data to distancesin that it is in general not possibleto recover the original data from a distancematrix. In particular it is shownthat: (a) different data setsmay give the samedissimilaritymatrix, (b) even data sets that give different minimal trees can give the same dissimilaritymatrix, (c) asthe number of taxa and/or charactersincreasethere may be many more data setsthan similarity matrices, (d) even when there are fewer data sets than similarity matrices it is possibleto find caseswhere the samesimilarity matrix canbe obtained from different data sets. It is concluded that becauseof this lossof information the methodsof phenetic classificationare inherently weaker than methodsthat retain the original data. An indication isgiven of how information islostin transforming to distances.Incompatibility matricesare shownnot to contain all the original information but thesemethodsusuallyretain the original data for tree building. 1. Introduction and Terminology
One of the first conclusions about evolution that Charles Darwin reached was the existing species had been linked in the past by an “evolutionary tree”. This is recorded in the first of his notebooks on the “species problem”
that he started in 1837 (Darwin, 1960). For over a hundred years this was the rationale for determining relationships between organisms and for classifying them (see Simpson, 1975). During the last 20 years there has been a major development of an alternative statistical approach where methods have been developed for 129
0022-5193/82/100129+14$03.00/0
@ 1982AcademicPressInc. (London)Ltd
130
D.
PENNL
finding clusters of taxa. These methods will be called “phenetic” where they use only pairwise comparisons to obtain a measure of “distance” between pairs of taxa (see for examples Estabrook, 1975; Jardine & Sibson, 1971; Sibson, 1971; Sneath & Sokal, 1973; Sokal & Sneath, 1963). These distances are stored in a symmetric similarity, dissimilarity, or distance matrix and these three terms are equivalent in the present paper. Initially the term numerical taxonomy was applied only to phenetic methods tSoka1 & Sneath, 1963) but more recently it has been applied to all objective methods, either phenetic, or the newer phylogenetic methods (Sneath & Sokal, 1973). Arguments for the phenetic and traditional evolutionary approaches can be summarized briefly as follows. The original evolutionary methods could not be shown to be objective but were based on a sound biological foundation. The phenetic methods were objective although the lack of a biological framework made it impossible to decide between results from the different phenetic methods (Penny, 1976). Recent discussions have appeared on a comparison of objective phylogenetic (evolutionary) techniques with phenetic methods (Farris, 1979; Janowitz, 1979; Mickevich, 1978). The present paper introduces a fundamental objection to phenetic methods and that is that the methods do not use all the information in the data. It is concluded that the transformation from the original data to distances loses sufficient information to make phenetic methods inherently weaker than methods that retain all the information. In at least some cases the loss of information with phenetic methods is sufficient to prevent any conclusion about relationships being derived from the distance matrix. The following example is introduced to explain the terminology used. There are four taxa (rows) WI-Z 1, each with 12 characters (columns) and any entry in a column could take one of four discrete states. Each character state is equally different from the others, that is, they are qualitative multistate characters (Sneath & Sokal, 1973). An example would be nucleotides in a nucleic acid sequence where for example A (adenine) is considered equally different from C, G and U(T). The data, and the dissimilarity matrix formed by comparing pairs of taxa is as follows:
Taxa Wl
Characters AAAAAAAAAAAA
Taxa Wl
Xl Yl Zl
AABAABBBBBBB BBABBAAABBBB AAABBBCCCCCC
Xl Yl 21
Dissimilarity WI Xl 0 8 8 X 9
0 8 9
matrix Yl 8 8 0 9
Zl 9 9 9 0
(11
INCOMPLETENESS
OF
PHENETIC
131
CLASSIFICATION
Two restrictions are required. Any rearrangement of the columns is not considered to alter the data, and any consistent relabelling of a column is permitted. That is a column BBCA, is equivalent to CCAB, UUGC, etc. What is required for each character (column) is a “taxon comparison matrix” which for AABC is: (A”,
(A”,
W(A)
0
0
X(A)
0
J’(B)
1
Z(C)
1
0 1 1
& 1 1 0 1
6) 1 1 1 0
(2)
An entry “1” indicates that the taxa are distinct for that character, a “0” that they are identical. The taxon comparison matrix is the same for any consistent relabelling of the character states. 2. Is There a Unique Similarity
Matrix for Each Data Set?
An important question that should be asked is “Is there a unique dissimilarity matrix for each data set”, or alternatively, “Is it possible to recover the original data from a similarity matrix?” This is equivalent to the question, “Does the similarity matrix retain all the information in the original data?” Gower (1975) has made the observation that methods are not available to generate the original data from distance matrices but this could be either, because an algorithm is possible but has not been found, or, because no general algorithm can exist. The question could be answered by finding different data sets that give the same distance matrix. If two or more different data sets give the same similarity matrix then it would not be possible in principle to get back the original data. By using appropriate search and construction methods (see later) the following six examples are found: Taxa Wl Xl
Characters AAAAAAAAAAAA AABAABBBBBBB
Taxa
Characters
W2 X2
AAAAAAAAAAAA AABAABBBBBBB
Taxa
Characters
W3 X3
AAAAAAAAAAAA ABBABBAABBBB
Yl
BBABBAAABBBB
Y2
BBBBBBAAAABB
Y3
BAABAABBBBBB
Zl
AAABBBCCCCCC
22
AABBBACCCCCC
23
AAABBBCCCCCC
W4
AAAAAAAAAAAA
W5
AAAAAAAAAAAA
W6
AAAAAAAAAAAA
X4 Y4
BBBBBBAAAABB AABAABBBBBBB
X5 Y5
ABBABBAABBBB BBBBBBBBAAAA
X6 Y6
BBBBBBAAAABB ABBABBBBBBAA
24
AABBBACCCCCC
25
ABBBAACCCCCC
26
ABBBAACCCCCC
(3)
132
D.
The distances identical to that W3-X3, etc. It similiarity matrix
PENN\
between pairs of taxa for each of these data sets are of (1): that is, the distance WI-Xl equals W2-X2, is therefore proven that in general there is not a unique for each data set. 3. Is the Loss of Information
Important?
It still needs to be determined whether this loss of information is sufficient to be significant in a given study. This question is investigated by examining the three unrooted or undirected trees (I-111) that can be formed from four taxa. They are Tree II Tree 111 Tree I J W /Y W > (0 j&---y 14, ‘_ w* I-X Z Y Z i The three cases have their first link from W to X, Y and Z respectively. The minimal length of each of the 18 trees (six data sets and three trees each) is determined by the algorithm of Fitch (1971). Each data set has 30 different character states and with 12 characters there must therefore be at least 18 (30-12) character state changes on any tree (Fitch, 1977; Foulds, Hendy & Penny, 1979). Only changes additional to these 18 will be counted. An example with the first data set is shown: Wl
AAAAAAAAAAAA AAAAAAAABBBB
BBABBAAABBBB ---AAABBAAABBBB
Y1 15)
AAABBBCCCCCC Z 1 Xl AABAABBBBBBB This tree with 19 changes has one additional change or “duplication” (Penny, 1976) which occurs because the character state “B” for character six is derived twice. A summary of the number of duplications for all 18 trees follows: Number of duplications (Trees from (4)) Ranking of Data set I II III Trees I-111 1 1 2 3 I < zz < III 2 1 3 2 I < III < II 3 2 1 3 II
INCOMPLETENESS
OF
PHENETIC
CLASSIFICATION
133
Thus all possibilities (counting only strict inequalities) are found from data sets consistent with the same dissimilarity matrix. It can be concluded that in this case insufficient information is retained by the dissimilarity matrix to allow any conclusion to be drawn about the relationship of the taxa. Note that this has arisen when IV, X and Y are more similar to each other than any of them is to 2 and phenetic methods would have shown that IV, X and Y formed a good cluster.
4. Frequency
of Such Cases
The next question is to decide how frequently such cases may arise. If the problem outlined above is unique, then it would not be a serious problem in practice. However the following analysis shows that the problem increases as the number of taxa and/or the number of characters increases. This can be demonstrated by comparing “D” the number of distinct data sets with “S” the number of possible similarity matrices. The number of data sets is calculated in two parts, Dl, the number of distinct columns or taxa comparison matrices (as defined in (2)) and 02, the number of combinations of them. A method to calculate the number of distinct columns for a given number of taxa is as follows: Taxa in a column 1 2 3 4
5 etc.
Possible character states A A A A A
or B or B, C if B included or B, C if B included or B, C if B included E if D included
earlier earlier, D if C included earlier earlier, D if C included earlier earlier.
A simple inductive method is used to count the number of different columns for a given number of taxa. With (n) taxa, columns may have one state, or two states up to n states. The number of columns for any number of states can be found from the table of “Stirling numbers of the second kind” which are given below. These numbers are derived in a similar manner to the numbers in Pascal’s triangle except that the generating function is more complex (see Cohen, 1978). The entry Xij
=j*Xi-lFj+Xi-l,j-]v
134
D.
The numbers Number taxa 1 2 3 4 5 6 7
PENNY
are as follows:
of 1 1 1 1 1 1 1 1
Number 2 1 3 7 15 31 63
of different 3
1 6 25 90 301
states in a column ( j) Total 4 5 6 7 columns 1 2 5 1 15 10 1 25 65 IS I 203 350 140 21 1 877 4140 21147
(7)
The character state for the first taxon is always coded A and that there is only one possibility is shown by the entry “I” in the first row. Either an A or a B state can be added for the second taxon to give the two columns [AAIT and [AB]? These are shown in the entries in the second row. The simple inductive method used develops the remaining entries. With four taxa the 15 possible columns are: 123456789012345
AAAAAAAAAAAAAAA AAABBABBABBBBBB AABABBABBACBCCC ABAABBBACCACBCD
(8)
As is shown in the table of Stirling numbers there is one column with only one code, seven columns with two codes (A & B), six columns with three codes (A, B & C) and one column with the four codes A, B, C & D. The number of data sets that can be selected with (c) characters (columns) is the number of combinations with replacement (since the order of the columns is not used, but the same column can be used repeatedly). The number of data sets D is thus (9) where (m) is the total number of columns in (7) for the number of taxa. The number of distinct similarity matrices S is simply: S=(C+l)llllrc -1li21, (10)
INCOMPLETENESS
OF
PHENETIC
135
CLASSIFICATION
This is the formula for permutations with repetition when selecting (n(n 1)/2) elements from (c + 1) (Cohen, 1978). For example with the data in (1) there are 12 characters (c = 12) and therefore a particular entry of the similarity matrix of (1) could have a value of 0, 1,2, . . . , 12, that is, any of c + 1 values. There are n(n - 1)/2 different entries in the symmetric similarity matrix ((4 x 3)/2 in (l)), each of which can have any of the c + 1 values. Hence the formula (10). The ratio D/S of the number of distinct data sets (D) to the number of similarity matrices (S) is plotted in Fig. 1. The ratio is plotted against the number characters for 3-6 species. When the ratio exceeds one there are more possible data sets than distinct similarity matrices. Then by the
25-
---MA 00
/
,-‘
I IO
5 FIG.
1. Ratio=
Number number
I 15
of distinct of dissimilarity
20
data sets matrices
This ratio is plotted against the number of characters in the data set for three, four, five and six taxa. The dotted line is for six taxa but with the restriction that no more than four character states can occur in any character. When the ratio exceeds one there are more data sets than dissimilarity matrices. In these cases some dissimilarity matrices must be derived from more than one data set.
136
D.
PENN>
pigeonhole principle (Cohen, 1978) there must be at least one similarity matrix that is obtained from more than one data set. It is shown in Fig. 1 that the ratio exceeds 1 as the number of characters increases, and that the increase is more rapid as the number of taxa increases. The ratio soon reaches astronomical size and with six taxa and 21 characters the ratio exceeds lOi’. That is, there are more than 10’” data sets per similarity matrix. (It should be noted that the numerator reflects a combinational process, and the denominator is a polynomial). It is concluded therefore that there is a general problem of the loss of information in transforming the original data to distances. 5. Redundancy
When DIS Less than 1
One additional point needs to be made before discussing these results. If the ratio in Fig. 1 (D/S) is less than 1, then it is logically possible for each similarity matrix to be derived from an unique data set. If additional restrictions are placed on the character states it is possible to reduce the rate of increase of the ratio shown in Fig. 1. For example the dotted line in Fig. I shows the increase for six taxa if only four character states (instead of six) are permitted. It may be possible to place sufficient restrictions (for example allowing only binary characters) so that the ratio would not exceed 1 with the number of taxa and columns used in a particular study. However it can readily be shown, that even when the ratio D/S is less than 1, redundancy may occur in that different data sets still give the same dissimilarity matrix. For example with four taxa and only two characters there are three pairs of characters (data sets) that give the same similarity matrix. They are “6”+“15” = ..‘)“+..14”
where the numbering indicates a column from (7). This can readily be checked by forming the “taxon comparison matrix” (2) for each column and adding the entries together for the pairs of characters. An example is as follows: “6”
INCOMPLETENESS
OF
‘,9”
;s;
PHENETIC
CLASSIFICATION
137
“14”
;[E;]
=
[E]
With four taxa and three characters there are 70 equivalent cases which can be divided into three groups; (a) degenerate cases based on equivalent pairs if “6”+“15” = “99, + “14”, then “1” + “6” + “l‘j” = “1” + “99, + “1477, “2” + “6” + “15” = “2” + “9” + “14” etc., “6” + “15” = “99, + “14” (b) recombinations such as and “lo”+ “13” = “7”+ “15”, “6”+“10”+“13”+“I5” = therefore “7”+“9”+“14”+“15” (and this is a degenerate case of “5”+“10”+ “13” = “7” + “9V + “14”.); (c) new cases that cannot be derived from equal pairs, for “l”+ “9,.+ “12” = “5” + “11” + “14”; “2” + “10” + “15” = example “7” + “12” + “14”; ~~2~‘+~~1~“+~~15” = “12”+“13”+“14”. Note that the proportion of equivalent cases (a + b + c) increases when going from two to three characters even though the ratio D/S shown in Fig. 1 decreases. With four taxa and four characters 630 equivalent cases have been found and they include the following: “3” + “7” + “93, + “97, = “4” + “6” + “lo” + “10” “2” + “83, + “97, + L&9,,= “4” + “6” + “12” + “12”
(12)
“2” + “8” + “10” + “10” = “3” + “7” + “12” + “12” These quadruples were used to construct the data sets in (1) and (3). With five characters 3389 equivalent quintuples have been found but more could exist because there was insufficient storage on the small microcomputer used to search for equivalent cases. The algorithm used for the search was of order n (Aho, Hopcroft & Ullman, 1974) but did require considerable storage. The conclusion is that even when there are more similarity matrices than data sets there still can be redundancy. It is not sufficient to show that the ratio in Fig. 1 is less than unity, and then hope for uniqueness. 6. Incompatibility
Analysis
An incompatibility matrix can also be derived from the original data and the same question can be asked about whether it retains all the information.
138
D.
PENN\
The concepts will be explained briefly but can be found in more detail in Le Quesne (1969), Estabrook, Strauch & Fiala ( 1977), Fitch ( 1977) and Foulds, Hendy & Penny (1979). Columns 5 and 6 of the data in (1) will be considered in more detail. Both columns have two character states A and B and for each column there must be at least one A-B change on any tree linking the taxa Wl -Z 1. When columns 5 and 6 are examined simultaneously it is found that taxon Wl has AA for the two columns, Xl has AB, Yl has BA and Zl has BB. Thus there are four states (AA, AB, BA, BB) and any tree to link these four states will require at least three changes which is one more than the sum of the two columns considered independently. The incompatibility matrix would have an entry “1” for columns 5 & 6 but a “0” for example with columns 1 and 2 because no additional changes are required when these two columns are considered together. All pairs of characters are compared and the result is a symmetric (c x c ) matrix where c is the number of characters in the data set. Again it can be shown that it is not possible in general to get back to the original data. Consider an incompatibility matrix 0 0 r 1
0 0 1
1 1 01
(13)
which is known to be derived from a data set with four taxa. The following six data sets all give the above incompatibility matrix (13).
AAA AAA AAA AAA AAA AAA BBA BBB BBB AAB AAB BBA (a) BBA (b) BBS (c) AAB (d)AAB (e) BBB (f, BBA BBB BBA BBB BBA AAB AAB I II III I II III
(14,
The numerals I-111 underneath the data set indicates which of the three possible trees (4) for four taxa that is minimal for the particular data set a-f The conclusion is that even data sets that determine different minimal trees can give the same incompatibility matrix. It is not possible to get back to the original data, and therefore the incompatibility matrix does not contain all the information in the original data. It has been demonstrated in earlier sections that the number of data sets (D) often exceeded the number of similarity matrices (S). A similar analysis has been made for incompatibility matrices but only a brief account will be given because the conclusion is again that there is a common problem
INCOMPLETENESS
OF
PHENETIC
139
CLASSIFICATION
of the possible data sets exceeding the number of incompatibility matrices. The number of distinct columns in the present case is less than with similarity matrices because incompatibility analysis only uses characters where at least two character states in a column occur more than once (Estabrook et al., 1977; Fitch, 1977). It is thus not possible to use the “Stirling numbers of the second kind” (7) without subtracting those columns where only one or no character state occurs twice or more. The new table to replace (7) is Number Number of taxa 1 2 3 4 5 6 7
1 2 0 0 0 00 0 3 0 10 0 25 0 56 * .
of different states in in a column 3 4 5 6
0 0 15 75 280
0 0 45 315
0 0 105 .
0 0 .
7
0 .
Total columns (4 0 0 0 3 25 145 756 3892 etc.
(15)
This table can be derived several ways but perhaps the simplest is to use the recurrence relation (for
Yi-*,j)-(Xii-*,j-*-
Yi-l,j-l)
where Xij etc. are entries from the table of “Stirling numbers of the second kind” (7), and Yij etc. are entries in (15). Again a simple inductive proof is used. The number of data sets (D’) is calculated as for (9) but using the values of m’ from (15). The number of incompatibility matrices (S’) equals
(in the case considered earlier with four taxa the entries in the incompatibility matrix were “0” or “1” but with six or more taxa, higher values are possible). As with similarity matrices it is found that the ratio D’/S’ is greater than 1 for many cases, and in these cases there cannot be a unique transformation from an incompatibility matrix to the original data. One partial qualification needs to be made and that the original data is usually retained in studies where incompatibility analysis is used and to
140
D.
PENNY
this extent the information is not “lost”. In these casesthe incompatibility matrix is used for example as an aid to building evolutionary trees. But if maximal cliques are derived from an incompatibility matrix, the objection would still remain that not all the original information had been used when discarding some characters. 7. Discussion The present conclusions are derived only for qualitative multistate characters. These were chosen from analogy with linear programming in that the problem may be more difficult when only integer values are allowed. It is thus to be expected that the problem discussed in this paper would be even more marked if real numbers were permitted. In general. it cannot be assumedthat conversion to distances will be unique until proof is given. Although it has been demonstrated that information is lost in using distances it will probably be helpful to give some explanation for this. The following example is for six taxa and when all the information is considered the optimal relationships may be:
For one particular character the following character states may occur: “> B
i’
4
,”
113)
A
When taxa two and five are compared for this character there will be no difference (both have B). But when the entire data set is considered it is found (but certainly not proven) that the distance between taxa two and five should probably be 2 for this character. It is when pairs of character states are considered that information lossmay occur and phenetic methods (by definition) use only information from comparisons of pairs of taxa. Although the main conclusion of this work is that distance measures should not be the only data used for determining reIationships, there are still occasions in which they may be used. The first case is simply as an aid to other methods such as tree-building where it may be important to know “the next most similar taxon”. In other studies the information available may only be a similarity matrix such as for DNA hybridisation studies or with immunological methods (see Wilson, Carlson & White, 1977). There
INCOMPLETENESS
OF PHENETIC
141
CLASSIFICATION
may be problems with morphological data where the measurements are not independent genetic characters and a transformation is desired. And then there will be cases when there is simply no better biological model available. There are several methods which make fuller use of the data. There are tree-building methods (Eck & Dayhoff, 1966; Fitch, 1977; Foulds et al., 1979), cladistic analysis (Henning 1966), methods based on maximum likelihood (Felsenstein, 1973; Fitch & Langley, 1976), and predictive classification (Gower, 1974). Of these cladistic methods at least can be accused of not using all the information. A recently described tree-building method (Penny, 1981) largely removes the problem for incompatibility analysis because maximal cliques are used as a starting point for adding more characters with additional information. There has been a tendency over recent years towards the use of a more phylogenetic approach and these methods retain all the information. However it has not been shown that existing methods use all the possible information in the data and a subsequent publication will describe methods to make fuller use of the data. This work wasundertakenwhile on sabbaticalleave with Prof. J. Maynard Smith and the Population Biology Group at the University of Sussex,to whom I am grateful for much hospitality. I am also grateful to Drs J. Haigh, J. C. Gower and Prof. R. Sibsonfor searchingfor previous work on this topic. REFERENCES AHO, A. V., HOPCROFT, J. E. & ULLMAN, Computer
Algorithms.
J. D. (1974). The Design
and Analysis
of
Reading,Massachusetts: Addison-Wesley.
COHEN, D. I. A. (1978). Basic Techniques of Combinatorial Theory. New York: Wiley. DARWIN, C. (1960). In: Darwins Notebooks on Transmutation of Species Part I First Notebook (July 1837-Feb 1838). (G. de Beer, Ed.), Bull. Brit. Museum (Natural History), Historical Series Vol. 2. No. 2. ECK, R. V. & DAYHOFF, M. 0. (1966). Atlas of Protein Sequence and Structure 1966. Silver Spring, Maryland: National Biomedical Res. Foundation. ESTABROOK, G. (ed.) (1975). Proc. 8th Intern. Conf. on Numerical Taxonomy. San Fransisco: Freeman. ESTABROOK,G.,STRAUCH,J.G.& FIALA,L.K. (1977). Syst. Zool.26,269. FARRIS, J. S. (1979). Syst. 2001. 28, 200. FELSENSTEIN,J. (1973). Syst. Zool. 22,240. FITCH, W.M. (1971). Syst. Zool. 20,406. FITCH,W.M.(~~~~).A~. Nat. 111,223. FITCH.~. M. & LANGLEY.~. H. (1976). Fed. Proc. 35.2092. FOUL&L. R.,HENDY,M: D. & i~~~Y,D.(1979). J.‘mol. Euol. 13,127. GOWER,J. C.(1974). Biomerrics 30,643. GOWER, J. C. (1975). In: Proc. 8th Intern. Conf. on Numerical Taxonomy (G. Estabrook, ed.), p. 71. San Fransisco: Freeman. HENNIG, W. (1966). Phylogenetic Systematics. Urbana, Illinois: University of Illinois Press.
142
D. PENNY
JANOWITIZ, M. F. (1979). Syst. Zool. 28, 197. JARDINE, N. & SIBSON, R. (1971). Mathematical LE QUESNE, W. J. (1969). Sysf. Zool. 18,201. MICKEVICH, M. F. (1978). Syst. Zool. 27, 143. PENNY, D. (1976). J. mol. Evol. 8, 95. PENNY, SIBSON,
D. (1982). Zool. J. Linn. Sot. 74, 277. R. (1971). In: Mathematics in fhe
Taxonomy. London: Wiley.
Archeological and HistoricaL Sciences (F. R. Hodson, D. G. Kendall, & P. Tautu, eds), Edinburgh: University Press. SIMPSON, G. G. (1975). In: Phylogeny of the Primates: A Multidisciplinary Approach (W. P. Luckett & F. S. Szalay eds), New York: Plenum Press. SNEATH, P. H. A. & SOKAL, R. R. (1973). Numerical taxonomy. San Fransisco: Freeman. SOKAL, R. R. & SNEATH, P. H. A. (1963). Principles of Numerical Taxonomy. San Fransisco: Freeman. WILSON, A. C., CARLSON, S. S. & WHITE, J. T. (1977). Ann. Rec. Biochem. 46, 573.