Distance matrix comparison and tree construction

Distance matrix comparison and tree construction

Pattern Recognition Letters 7 (1988) 207-213 North-Holland April 198~ Distance matrix comparison and tree construction Alain HI~NAUT and Marie-Odile...

496KB Sizes 0 Downloads 72 Views

Pattern Recognition Letters 7 (1988) 207-213 North-Holland

April 198~

Distance matrix comparison and tree construction Alain HI~NAUT and Marie-Odile D E L O R M E ('entre de G&u;tique Mol&'ulaire, Laboratoire propre &~ C N R S associg, ~ I'Universit~; Pierre el Marie Curie, Pari,~ I I 91190 G(/ sur Yvelte, France

Received 10 October 1987 Rewsed I January 1988

Abstract: Graphical representation of distance matrices in a Euclidean space allows the merging of two distancc matrices since they share elements. Graphical representation of the merging of the two distance matrices is associated with a robust method ol classilication that allows to distinguish species whose membership to a cluster cannot be established with certainty. These possibilities are exploited to test the consistency of phylogenetic trees, and to establish exact relations between species for which one possesses different independent distance measurements.

Key n'ord~: Principal coordinate analysis, dynamic clustering, multiwtriate statistics, ew)lution, phylogeny, rRNA.

Introduction The need to classify living beings and to search for phylogenetic relationship is as old as biology itself. For a long time, this problem has been resolved 'by hand' each systematist using his own heuristics. The results obtained were quite remarkable. Molecular biologists, who rely on genes or proteins primary structures comparisons, relaunched the study of taxonomy in cases where traditional methods were inadequate (prokaryotes, protists, etc). However, these developments raised new problems: * Molecular biologists are looking for automatic procedures in order to build 'objective' classifications. , Direct comparisons can be done only between homologous sequences, whereas one must take into account information coming from different molecular families in order to avoid likeness cases created by some convergence. , The data concerning a cluster are scattered in the literature. Most often they occur as several distance matrices only partially overlapping.

Before we approach the computerized solutions we propose to solve these problems, let us review briefly the currently used techniques, since exhaustive presentations are available in the literature (Sneath and Sokal, 1973: Felsenstein, 1~)82). When one classifies "by hand', the usual method is to proceed step by step, joining the two nearest objects at each step. At the end of each step, the distances between each new group and the remaining objects are recalculated. This process is repeated until all the objects are grouped into one simple cluster. Then, a tree is drawn whose nodes represent the successive mergings, the length of the branches being equal to the distance between two merged objects or groups of objects. Thus an agglomerative hierarchical clustering is obtained. There are different ways to recalculate the "distance' between an object and a cluster (single-linkage tree, k means algorithm, etc.) from which arc issued the different variants of hierarchical clustering. Sneath and Sokal (1973) submit several variants in detail. All these processes can be carried out by hand when the number of objects is small. But they can

01¢~7 ,";6>5 88~$3.50 @ 1988. Elsevier Science Publishers B.V. (North-Holland)

21~7

Volume 7, Number 4

PATTERN RECOGNITION LETTERS

also be automated, and the oldest programs, and the most widely used ('Single-linkage tree', Sneath, 1957; ' W P G M A ' and ' U P G M A ' Sokal and Michener, 1958), correspond to such hierarchical clustering programs. It is important to note that the initial distance matrix is the only one that comprises the whole information and that the different algorithms of classification only serve to give a representation perceptible at one glance. Unfortunately, while this representation is objective (the same algorithm giving the same result on a given table), it is not unique (different algorithms producing different trees). Each tree is only a particular representation of the distance matrix. The most correct way of representing distances is to provide the coordinates of the objects in space. The coordinate matrix gives exactly the same information as the distance matrix, but is easier to draw on a graph. The calculation of the coordinates from the distance matrix has been solved ('Principal Coordinate Analysis', Gower, 1966). Therefore, the problem of relations between species can be considered as a geometrical problem. This simpliflies things, and as a consequence we are able to build up different methods for: • drawing a phylogenetic tree in which the species whose position in the tree is definite can be distinguished from those whose phylogenetic relations are uncertain; • comparing phylogenetic trees mapped from different molecular families; • grouping the data drawn from several tables of distances into a single tree. The method was used to solve a real problem, the construction of phylogenies based on the comparison of the ribosomal RNAs.

Euclidean representation of a distance matrix

In a distance matrix, objects are marked out between each other by a distance index. When the objects are sequences, the distance index is a more or less direct measurement of the substitutions rate. Whatever the individuals i and j, a distance index d(i,j) verifies the following properties:

208

April 1988

, d(i,j) ~_ O, • d(i,j) = dO, i), • d(i,i)=O.

Representing objects in a system of axes always means to find a Euclidean space E, an orthonormal basis of this space and vectors X(i) such as, whatever the objects i and j: d(i,j) 2 = ~ ( X p ( i ) - Xp(j)) 2,

where Xp(i) is the coordinate of the vector X(i) on the p-th vector of the basis. One can prove (Gower, 1966) that if a Euclidean representation of individuals and a distance matrix exist, than the scalar product between the object i a n d j is necessarily as follows: X ( i ) . X(j) = ½[d(i,. )2 + d(.

,j)2

_

d(i,j)2 _ d ( . , . )2].

d ( i , . ) 2 and d(-,j)2 correspond to the mean square of all distances for the element ij between its row and its column, and d(., .)2 corresponds to the mean square of all the elements of the distance matrix. One can also show that principal components calculated by transforming scalar products matrix into a diagonal form are a Euclidean representation of the distance matrix. The k first components allow to retain at best, by orthogonal representation in a k-dimensional space, distances between objects. Practically, calculating principal components is equivalent to finding the eigenvalues and the eigenvectors of the scalar product matrix using Householder's algorithm (Foucart, 1984). Then eigenvectors are normalized so that the sum of squares of the i-th eigenvector is equal to the i-th eigenvalue. The normalized eigenvector matrix gives the objects coordinates on the axes. In the Euclidean representation, the centroid of all the objects takes place at the axes intersections. Principal components calculation sees to it that object dispersion around the centroid decreases from axis to axis and that dispersion is as large as possible along each axis. The transformation into a diagonal form of a distance matrix between 50 objects takes less than 5 minutes with a compiled basic program.

Volume 7, N u m b e r 4

PATTERN RECOGNITION LETTERS

Merging two distance matrices Merging two distance matrices sharing several points is very easy to manage since it is a matter of superposing two images partially overlapping. Practically, since the space is a more than two-dimensional one, calculation of rotations and translations required to superpose the two objects sets is not that simple. The first step consists in scaling the two distance matrices. After that, a Euclidean representation is calculated for each matrix. Thus, we have two sets of coordinates x, and y, (n = 1,2 ..... N) for the N shared objects which give their positions in the two Euclidean representations. The problem is to find the orthogonal rotation matrix U which minimizes the function E : (I/2N)~'(Vx.

-

.v,) 2

where x~ and y, then contain the centred coordinates of the shared objects in order to remove any translation from the problem. K absch (1976, 1978) demonstrated that it is easier to determine the rotation matrix starting with a matrix R constructed with the components of vectors x, and y,,

April 1988

not enough. Moreover, the number of shared objects used as rooted points has to be greater than the number of dimensions of the superposition space (three points are required to define a plane). And also, the rooted points must not occupy a space with dimension less than the one of the superposition space (to define a plane, the three required points must not be colinear). The program calculates the rotation matrix U for the largest sub-space compatible with Lhe conditions we have just listed. The translation matrix is calculated by making the centroid of the N objects in the first representation coincide with their centroid in the second one. Provided this way with rotation and translation matrices, the last thing to do is to use them to calculate, in the second axes system, the coordinates of the objects of the first Euclidean representation. Then the projection of the objects can be drawn on different orthogonal planes. The quality of the superposition is measured by the distance between the rooted points. The mean error by information loss due to the decreasing of the dimensions of the space is measured by the difference between the distance between objects in the original space and their distance in the sub-space of superposition.

R = (rij) = ( . ~ y , i x , j )

and from a matrix/~R, obtained by multiplying the transposed matrix/~ by R. Unfortunately, the algorithm published by Kabsch (1976, 1978) is only valid in the particular case of a three-dimensional space. Generalisation to a larger space raises problems we solved as follows: Starting from Kabsch matrix /~R, eigenvalues #l, ~LL2 ]Ak (k = space dimension) and eigenvectors a l , a2 a k are calculated, p) and a~ are used to calculate the vectors . . . . .

. . . . .

bj = ( 1/'x/#~" R a j

required to deduce the rotation matrix U -- (Mij) ~-(2bkiakj). The calculation of the rotation matrix U is only possiblc for a sub-space in which the eigenvalues/~j are not equal to zero. That the dimension of the space of superposition equals at m a x i m u m the one of the smallest of the two spaces which are to be superposed is a necessary condition. This condition is

Strong clusters determination Clustering automatically objects next to each other in space makes the graphical representations analysis easier. Grouping together near objects proceeds by two successive steps (Diday, 1971): • one tries to determine different ways to part the objects into k groups by dynamic clustering method; • then, objects which are always cl:astered together during the different partitions previously obtained are selected. To build a partition of the whole object set, one begins by setting a number k of groups and divide at random the objects into these groups. The following operations are then carried out: • the centroid of each cluster is determined; • each object is reallocated to the cluster whose centroid is the nearest: 209

Volume 7, Number 4

PATTERN RECOGNITION LETTERS

• these two operations are done over and over again as long as modifications in cluster composition occur. This algorithm has the advantage to optimize a simple criterion of dispersion: the partition variance, and to always converge whatever the case (Diday, 1971). Practical experience proves that this convergence is very quick (average time is 2 seconds for 50 objects). The speediness of these calculations allows comparison of the different ways to allocate objects into k clusters. One notices that most often there are many possible partitions, of very close quality. This information on the plurality of satisfactory solutions is one of the best advantages of this method. The best way to turn this information to account is to examine the strong clusters. These consist of subsets of objects always gathered in the same final cluster during the different attempts of initial partitions. A strong clusters study eliminates the problem of the a priori choice of the number of groups because the number of strong clusters does not directly depend on the required number of groups (Diday, 1972). A detailed study of the evolution of strong clusters during the successive partitions permits to determine the way the groups overlap each other. Thus one can distinguish objects which cluster hierarchically, giving larger and larger groups, from those whose belonging to a group is quite uncertain. Now that we have a reliable hierarchy based on strong clusters, a tree can be constructed. To do this, one must define a distance between the groups. Practically, a matrix is built in which 6 is the number of times two objets have not been clustered in the same group during all the successive partitions (if X and X' are always in the same group then 6 ( X , X ' ) = 0). 6 is an ultrametric distance (Diday, 1972). The tree constructed this way has qualities that trees obtained with classic clustering methods do not share: • It is reliable, in that it is much less sensitive than the other ones to small variations in the initial distance matrix. • It is free from the 'linkage effect', which makes it roughly insensitive to an object removal. • Objects which affinities are uncertain (midway 210

April 1988

between two clusters for example) are brought out. Thus the user knows upon which objects he must concentrate if he wants to precise his classification.

Biological application Let us take an example among vertebrates so that the interpretation of results is not complicated by biological problems. Qu and Lempereur both tackled classification problems by means of comparison of conserved parts of ribosomal RNA sequences. Qu worked on 28S rRNA and Lempereur on 18S rRNA (Qu, 1986; Lempereur, 1986). We extracted from their theses data concerning verterbrates. 18 species have been sequenced and for 12 of them one knows both rRNA sequences: • Species in common: mouse (Mus musculus), chinese hamster (Cricetulus longicaudatus), man (Homo sapiens), chimpanzee (Pan troglodytes), horse (Equus caballus), chicken (Gallus gallus), pigeon (Columba palombus), lizard (Lacerta lepida), viperine snake (Natrix viperinus), xenopus (Xenopus laevis), trout (Salmo gairdneri), carp (Cyprinus carpio). • Species for which one just knows the 18S sequence: mk2 (Homo sapiens), rat (Rattus norvegicus), sheep (Ovis aries), pleurodeles (Pleurodeles sp.). • Species for which one just knows the 28 S sequence: macaque (Macaca mulata), pig (Sus scrofa), axolotl (Ambystoma tigrinum). We began by superposing Euclidean representations of the two distance matrices taking the 12 species in common as rooted points. In the common Euclidean representation, the mean distance between 18S and 28S for each of these 12 species is about 6. Two species (the viperine snake and the lizard) are responsible of the deviation high value. If superposition is recalculated minimizing deviation between 18S position and 28S position for the remaining 10 common species, the mean distance drops to 3.6. This mean deviation decreases no more while reducing again the number of rooted points. This is a first indication of conflict between the two distance matrices, but a further analysis is required to understand why reptiles position raises a problem.

Volume 7. Number 4

PATTERN R E C O G N I T I O N LETTERS

Graphical representation of the two superposed matrices (Figure 1) reveals that the 18S of both lizard and viperine snake are close to birds while their 28S are near mammals. Projection onto plane 1 2 does not allow to see relations between amphibians. This cluster breaks up onto the third axis where xenopus and peurodeles 18S and 28S are on one side, and axolotl and birds on the other side. The problem is clearly coming from 28S sequences of reptiles and axolotl! Strong clusters determination allows further refinements. Figure 2 gives the number of times when

April 1988

two objects have been clustered together during 20 partitions in 4 groups. There are strong clusters, in other words there are objects clustered together at least 19 times out of 20. Most often, composition of these strong clusters is in the nature of things: we found either species for which 18S and 28S are very close (mammals, fishes, xenopus and chicken): either consisting sets based on 18S (bird:~ + reptiles cluster, xenopus + pleurodeles cluster). There are also amazing clusters such as association of viperine snake 28S and lizard 28S with mammals, and axolotl with both birds and reptiles 18S one. This result

pige liza chic

CARP

AXOG

pleu

trou

snak

carp

TROU

XENO

CHIC

PIGB

xeno

HORS ~AN



shee



SJA

ciL Iors rat

LIZ Figure 1. Representation in a single space of two distance matrices between vertebrates, one based on comparison between sequences of the 18S rRNA (lower-case letters names) and the other one based on the 28S rRNA (upper-case letters names). Both lizald and viperine snake 18S are near from birds whereas their 28S are close to mammals. Projection onto plane I 2 does not show rehLtions between amphibians. This cluster breaks up onto the third axis where xenopus and pleurodetes 18S and 28S are on one side, and axolotl and birds on the other side. Species used as rooted points for the superposition: carp Cyprinus carpio, chic Gallus gallus, ~'him Pan troglodytes, h,lms ('ricetulus longicaudams, hors Equus caballus, m a n H o m o sapiens, mous Mus musculus, pige Columba palombus, trol¢ Salmo gairdn~ri, \~'~7~Xeno pus laevis. Other species: axol Ambystoma tigrinum, liz and liza Lacerta lepida (28S and t8S), m a c a Macaca mulata, ink2 Homo sapiens, pi~, Sus scrofa, pleu Pleurodeles sp., rat Rattus norvegicus, shee Ovis aries, sna and snak Natrix viperinus (28S and 18S). 211

Volume 7, Number 4

PATTERN RECOGNITION LETTERS

has been observed before directly from graphical representation. The pigeon is unclassifiable, it can be gathered with m a m m a l s as well as with the birds + reptiles cluster. Figure 3 shows relations between objects as we can deduce them from the overlappping of strong clusters. Of course, broad outline of vertebrates phylogeny is present, and besides it gives information on validity of distance matrices. We saw that pigeon 28S was unclassifiable, here it is shown by its oscillations from one cluster to another. Inconsistency in results relating to viperine snake and lizard are found again. This inconsistency is enough to prove that data we possess are unusable for these two species. However, one cannot found whether the axolotl position is false because 28S is not available. As shown here, the possibility of comparing distance matrices coming out from independent data has a great interest.

April 1988

structed from sequences of a single gene taken from different organisms (Felsenstein, 1982). The software we have developed allows easy combination of information coming out from several distance matrices. This software turns out to be very useful to: * Take the most one can from scattered informations in the literature, where distance matrices based on comparison of various sequences and even, but less often, based on physicochemical criteria (immunological reactions, nucleic acids hybridisation, etc.) are available. Our program allows the merging of matrices issued from various origins provided that they have c o m m o n species. This permits the measurement of distance between species between which direct comparison does not exist. , Build up more reliable phylogenies. We consider that a tree is firmly established when it is based on independent distance matrices. It is a simple and reliable way to turn out the problem of 'molecular evolutionary clock' irregularities and some convergence cases. Our software is written in G W - B A S I C and runs on I B M - P C compatible micro-computers. The programs are compiled to improve C P U time (QuickBASIC 1.02).

Conclusion Until now, considerable effort has been invested to take the most one can of a distance matrix con-

SXIa r 0 A A " A I 0 N I H I X E R A a ~ ~ a U M N I C G R A Z I G 0 N 0 R n i 2 t u m r e i g z a n e r o S S M A S C E L 0 U P m s s s e c e a k o u p u -20 20 20 20 20 20 19 20 i 12 0 0 0 0 19 19 19 19 19 19 19 19 0 0 0 I 0 0 0 0

20 -20 20 20 10 20 19 20 I 11 0 0 0 0 19 19 19 19 ig 19 19 19 0 0 0 I 0 0 0 0

20 20 -20 20 20 20 19 20 I 12 0 0 0 0 19 19 19 19 19 19 19 19 0 0 0 I 0 0 0 0

20 20 20 -20 20 20 19 20 I 12 0 0 0 0 19 19 19 19 19 19 19 19 0 0 0 I 0 0 0 0

20 20 20 20 20 20 20 20 -- 20 20-20 20 19 19 20 20 I I 11 12 0 0 0 0 0 0 0 0 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 0 0 0 0 0 0 I I 0 0 0 0 0 0 0 0

20 20 20 20 20 20 -19 20 I 12 0 0 0 0 19 19 19 Ig 19 19 19 19 0 0 0 I 0 0 0 0

19 19 19 19 19 19 19 -19 0 II 0 0 0 0 20 20 10 20 20 20 20 20 0 0 0 0 0 0 0 0

20 i 12 0 0 0 0 20 1 12 0 0 0 0 20 i 12 0 0 0 0 20 1 12 0 0 0 0 20 1 12 0 0 0 0 IO I 12 0 0 0 0 20 I 12 0 0 0 0 19 0 ii 0 0 0 0 -- 1 12 0 0 0 0 I-- 919 2 0 0 12 9-- 8 0 0 0 019 8 - - 3 0 0 0 2 0 3-- 0 0 0 0 0 0 0--19 0 0 0 0 0 19-ig 0 ii 0 0 0 0 19 0 II 0 0 0 0 19 0 II 0 0 0 0 19 0 II 0 0 0 0 19 0 Ii 0 0 0 0 19 0 Ii 0 0 0 0 19 0 Ii 0 0 0 0 19 0 11 0 0 0 0 0 19 8 20 3 0 0 0 19 8 20 3 0 0 0 19 8 20 3 0 0 1 20 9 19 2 0 0 0 2 0 320 0 0 0 2 0 3 20 0 0 0 0 0 0 0 19 10 0 0 0 0 02019

19 19 19 19 19 19 19 20 19 0 ii 0 0 0 0 -20 10 20 20 20 20 20 0 0 0 0 0 0 0 0

19 19 19 19 19 19 19 20 19 0 II 0 0 0 0 20 -20 20 20 20 20 20 0 0 0 0 0 0 0 0

19 19 19 19 19 19 19 20 19 0 Ii 0 0 0 0 20 20 -20 20 20 20 20 0 0 0 0 0 0 0 0

19 19 19 19 19 19 19 Ig 19 19 19 19 19 19 20 20 19 19 0 0 Ii Ii 0 0 0 0 0 0 0 0 20 20 20 20 20 20 -- 20 20-20 20 20 20 20 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

19 19 19 ig 19 19 19 20 19 0 Ii 0 0 0 0 20 20 20 20 20 -20 20 0 0 0 0 0 0 0 0

19 19 19 19 19 19 19 20 19 0 Ii 0 0 0 0 20 20 20 20 20 20 -20 0 0 0 0 0 0 0 0

19 0 0 0 1 0 0 0 0 NOUS 19 0 0 0 1 0 0 0 0 HAMS 19 0 0 0 1 0 0 0 0 MAN 19 0 0 0 i 0 0 0 0 C~IM 19 0 0 0 1 0 0 0 0 MACA 19 0 0 0 1 0 0 0 0 )IG ig 0 0 0 1 0 0 0 0 BORS 20 0 0 0 0 0 0 0 0 SN~ 19 0 0 0 1 0 0 0 0 LIZ 019191920 2 2 00CBIC Ii 8 8 8 9 0 0 0 0 PIG~ 020202019 3 3 00AXOL 0 3 3 3 2 20 20 00XZNO 0 0 0 0 0 00I920TROU 0 0 0 0 0 0 0 20 19CAR) 20 0 0 0 0 0 0 0 0 man 20 0 0 0 0 0 0 0 0 chim 20 0 0 0 0 0 0 0 0 mk] 20 0 0 0 0 0 0 0 0 rat 20 0 0 0 0 0 0 0 0 mous 20 0 0 0 0 0 0 0 0 hams 20 0 0 0 0 0 0 0 0 bors -- 0 0 0 0 0 0 0 0 sbee 0--20 20 19 3 3 0 0 chic 0 20 TT 20 19 3 3 0 0 ~!ge 0 20 20--19 3 3 0 0 nza 0 19 19 19-- 2 2 0 0 snak 0 3 3 3 2--20 00xeno 0 3 3 3 2 20-- 00pleu 0 0 0 0 0 0 0 igcarp 0 0 0 0 0 00[i--trou

Figure 2. Strong clusters determination starting from an initial partition in 4 groups. The array indicates the number of times when two species are clustered together during 20 successive attempts. 212

Volume 7. Number 4

PATTERN RECOGNITION LETTERS MAN LIZ

MAN LIZ MOUS

MOUS HAMS CHIM

MAN LIZ MOUS HAMS CHIN MACA PIG HORS man mk2 mous hams hors chim shee SNA rat PIGE CHIC chic pige snak liza AXOL

HAMS

CHIN MACA PIG

MACA PIG HORS

] - -

~

HORS

PIGE man mk2 mous hams hors chim shee SNA rat

man mk2 mous hams hors chim shee SNA rat

PIGE CHIC chic pige snak liza AXOL

CHIC chic pige snak liza AXOL

]

MAN LIZ

April 1988

] - -

MOUS

MAN LIZ

HAMS CHIN MACA PIG HORS

MOUS HAMS CHIN MACA PIG HORS

man mk2 mous hams hors chim

man mk2 mous hams hors chim

shee

shee

SNA rat

] - -

PIGE

~

CHIC chic pige

] - -

chic pige

] - -

snak liza

snak liza AXOL

SNA rat

PIGE CHIC

AXOL

pleu

]

XENO xeno

TROU trou ] CARP J carp

pleu ~

]

pleu

pleu

pleu

~ ' ~

XENO xeno ~ - -

XENO xeno

TROU trou ] - -

TROU trou 2 - -

TROU trou

CARP carp

CARP carp 2 - -

CARP carp

XENO xeno

XENO xeno

TROU trou ] - CARP carp J ~

] - -

Figure 3. Trec construction from strong clusters overlapping. The figure shows from left to right strong clusters dctcrminated from an increasing number of initial partitions.

References Diday, E, (1971). Une nouvelle m&hode en classitication automatique et reconnaissance des formes: la m&hode des nudes dynamiques. Revue de Statistique Appliqu& 19 (2), I9 33. Diday, E. (1972). Thesis. University Pierre et Marie Curie, Paris. Felsenstein, J. (1982). Numerical methods for inferring evolutionary trees. Quarterly Rev. Biology 57, 379~404. Foucart, T (1984). Analyse /actorielle de tableaux multiples. Masson, Paris, Gower, J. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325 3~8.

Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. Actu Cryst. A 32, 922 923. Kabsch, W. (1978). A discussion of the solution for the best rota tion to relate two sets of vector. Acta Crvst. A 34, 827 828. Lempereur, L. (1986). Thesis. University F'aul Sabatier, Toulouse. Qu, L. (1986). Thesis. University Paul Sabatier, Toulouse. Sneath, P. (1957). The application of computer to taxonomy. J Gen. Microbiol. I7, 201 226. Sneath, P. and R. Sokal (1973). Numerical Taxonomy. Freeman. San Francisco. Sokal, R and C. Michener (1958). A statistical method for evaluating systematic relationship. Univ. Kansas &'i. Bull. 38. 1409 1438. 213