Analysis and Interpretation of Sequence Data for Bacterial Systematics: The View of a Numerical Taxonomist

Analysis and Interpretation of Sequence Data for Bacterial Systematics: The View of a Numerical Taxonomist

System. AppI. Microbiol. 12, 15-31 (1989) Analysis and Interpretation of Sequence Data for Bacterial Systematics: The View of a Numerical Taxonomist ...

15MB Sizes 1 Downloads 63 Views

System. AppI. Microbiol. 12, 15-31 (1989)

Analysis and Interpretation of Sequence Data for Bacterial Systematics: The View of a Numerical Taxonomist P. H. A. SNEATH Department of Microbiology, Leicester University, Leicester LEl 7RH, England Received January 8, 1988. Revised September 23, 1988

Summary The nature of sequence data is considered with particular attention to phenetic and cladistic relationships. The correct use of the terms phylogenetic, phenotypic and genomic is emphasized. Sequence data are genomic, and this is included within the concept of phenetic relationships. The primary aim of taxonomy is to construct phenetic groups, because of their high content of information. Methods of analysis of sequence data are briefly considered. Sequence relationships are subject to statistical sampling error, which is greater for smaller sequences. It is a substantial cause of uncertainty in both phenograms and cladograms. This is illustrated by examples from ribosomal RNA sequences, using a convenient graphic representation of uncertainty. Other sources of uncertainty include similarities between sequences which are not orthologous, i. e. not strictly comparable (such as paralogous similarities between genes that have undergone gene duplication and independent evolution of the duplicates), and imperfect correlation between relationships from oligonucleotide catalogues and the full sequences. Phylogenetic trees of prokaryotes rest heavily on the assumptions that rRNA sequences from different organisms are orthologous and that their evolution has been constant and divergent. There are some instances where 16S and 5S rRNA sequences give discrepancies larger than those expected from sampling error, The explanations are briefly considered. DNA-DNA pairing may be held to reflect the whole genome, and has, in theory, very small sampling error, but its value is limited by the experimental accuracy of the method. If accuracy can be improved the technique may have great potential at all levels of taxonomic rank.

Key words: Bacterial systematics - Taxonomy - Nucleic acid hybridization - Sequence analysis

Introduction The profound changes in bacterial taxonomy that are implied by molecular data (initially physicochemical data, later sequences; Doty et aI., 1960; McCarthy and Bolton, 1963; De Ley et aI., 1966; Reich et aI., 1966; Sankoff et aI., 1973; Hori, 1975; Doolittle et aI., 1975; Woese et aI., 1975; Woese and Fox, 1977; De Smedt and De Ley, 1977; Stackenbrandt and Woese, 1981; Schleifer and Stackebrandt, 1985; Woese, 1987) make it timely to look critically at the interpretation of sequences. Rather than attempt a complete review, I shall consider a set of questions which are tied together by a common theme, - the reliability (and conversely, the uncertainty) of conclusions that can be drawn from sequence data. The most serious uncertainty derives from making inappropriate comparisons between sequences where the homology (in the broad sense) is doubtful or incorrect.

This is most clearly seen in protein sequences. It is evident from the history of protein sequences that these problems are seldom obvious until considerable data have accumulated; they then present as anomalies. For this reason statistical and experimental error will first be considered, and other problems will be taken up under Incongruences between Sequences and in the Discussion. Uncertainty over the optimal numerical methods also detracts from reliability (Tateno et a!., 1982; Rohlf, 1984); this is now an active area, but it is not yet easy to assess its importance . . It must be noted that even long sequences are only samples from the genome, so that conclusions are subject to uncertainty due to sampling error. In the special case of the whole genome (which will be considered in the context of DNA-DNA pairing) it will be shown that for present-

16

P.H.A.Sneath

day techniques the role played by sampling error is replaced by experimental error. Whether the sequence of the entire genome (if we were able to obtain it) should be considered to be a sample is a question that may be left to the delectation of philosophers.

Some Definitions It is first essential to emphasize some definitions. I use the terms systematics, classification and taxonomy in the traditional senses of Simpson (1961) Davis and Heywood (1963) and Mayr (1969) as amplified by Sneath and Sokal (1973). Systematics is the scientific study of the kinds and diversity of organisms and of any and all relationships among them. Classification is the ordering of organisms into groups on the bases of their relationships (not confined to relationship by ancestry). Taxonomy is the theoretical study of classification, including its bases, and identification. Classification and taxonomy are often not distinguished where the context is clear. The process of identification does not assume a particular form of relationship; it is sufficient that it yields correct identity of an unknown specimen with a defined taxon. Taxonomic relationships may be phenetic or cladistic. Phenetic relationship (Cain and Harrison, 1960) is based on numerous observed properties of organisms, more or less equally weighted: it is equivalent to overall resemblance and is measured in units of similarity or dissimilarity (often treated as distances). Cladistic relationship (Cain and Harrison, 1960) is the inferred relationship by ancestry, shown by phylogenetic branching patterns or cladograms, in which the groups are clades. It is measured by time, or in default of time by topological units (numbers of nodes or internodes on a cladogram) or numbers of inferred evolutionary changes. Such units are not phenetic, bacause they include parallelisms and ' back mutations. Cladistic relationships are inferred from various types of phenetic relationship with the aid of additional assumptions on how evolution occurs. The majority of molecular analyses on bacteria have assumed that phylogeny closely mirrors phenetic relations, and have been based on classic phenetic analyses (such as WPGMA and UPGMA cluster analysis, e.g. De Wachter et aI., 1985; Hori et aI., 1985; Hori and Osawa, 1986; or undefined methods that seem similar, e.g. Stackebrandt and Woese, 1981; Stackebrandt, 1985). More recently explicit cladistic methods have been used in bacteriology (e. g. trees that minimize the amount of evolutionary change, Van Valen, 1982; McDonell et aI., 1986; Woese, 1987). Sequence data are often referred to as phylogenetic rather than phenetic, but this is not so. It is doubtful whether data can be called phylogenetic, because the term properly applies to relationships, and hence depends on methods of analysis. Methods of analysis may sometimes be phylogenetic (though, as just mentioned, such methods have as yet seldom been applied in bacteriology). The matter may be put in perspective with an example that is a celebrated triumph of sequence data, - the firm placement in the prokaryotes of the blue-green algae ,(cyanobacteria).

There never was a phenetic analy'sis (i. e. with numerous unweighted characters) before the genomic sequence studies, because relevant characters were earlier too few. The force of the argument from sequences (Doolittle et aI., 1975; Schwartz and Dayhoff, 1978) was little affected by whether the analyses were phenetic or cladistic. There was a phylogenetic analysis, by Humphries and Richardson (1980), which grouped blue-green algae in the traditional way with eukaryotes; however, this used an inappropriate cladistic method (Sneath, 1982) employing a few properties with exaggerated and a priori weights, like the previous botanical classifications. Some earlier proposals did place the blue-green algae with prokaryotes, but again on the basis of exaggerated a priori weights with a few characters, - principally on the ground that blue-green algae lacked an organized nucleus. Confusion often arises between the terms phenetic and phenotypic. Phenotype refers to the way in which genetic information (the genotype) is expressed. The nucleotide sequence coding of an enzyme is genotypic, the amino acid sequence of the enzyme protein may perhaps be regarded as phenotypic, whereas enzymic activities are obviously phenotypic. Messenger RNA and ribosomal RNA can be regarded as formally intermediate between genotype and phenotype (mutations are not inherited, yet must affect protein synthesis), but they reflect the genome so faithfully that they can in practice be treated as genotypic. As one moves from genotype to phenotype the genomic information is less and less directly expressed. The term genomic is convenient to refer to large amounts of information in the genome, or derived from it very directly with few extraneous influences (which would permit inclusion of protein sequences as genomic). One may thus consider most forms of sequence data as genomic. The contrast between phylogenetic and phenetic relations is false: the contrast is between genomic and phenotypic relations. Genomic relationships are included within phenetic relationships. This is because they express overall similarity of observed (i. e. genomic) properties, not units of time as in cladistic relationships. The latter must be inferred by using additional assumptions. Taxonomy has a number of aims, not simply to construct a phylogenetic system. This would present many difficulties and inconsistences if rigorously applied, caused by hybridization, multiplicity of groupings, alternative definitions of monophyletic groups and the like. A variety of less restrictive concepts are therefore permitted (Simpson, 1961; p.142; Davis and Heywood, 1963; Mayr, 1969, p.70; Crowson, 1970, pp.247-263; Jones and Luchsinger, 1979; Gould, 1980; Mayr, 1981). Extreme solutions would, for example, place birds and crocodiles in one group, snakes and lizards in another, mammals and turtles in a third (Mayr, 1969, p.71). Or, depending on different opinions about vertebrate fossils, the mammals, sI,lakes and lizareds might be a group, and the turtles another. It is difficult to see what practical advantages to scientists (other than palaeontologists) would accrue from some of these arrangements, - even if we were confident of them. The situation is worse if one considers the fishes: the typical, actinopterygian, fishes are one clade, but another

Interpretation of Sequence Data

clade would include crossopterygian fishes, lungfishes, perhaps coelacanth fishes, amphibia, reptiles, birds and mammals (Crowson, 1970, pp.96, 254). Again, it is difficult to see how scientists could abandon the traditional four groups of protozoa, - flagellates, rhizopoda, ciliates and sporozoa, - even if imperfect, for the 45 protozoal phyla of Corliss (1984), all of the same rank. In my view the primary aim of taxonomy is more prosaic, to give a classification that is useful for varied scientific purposes, including identification, and also (a point that is often overlooked) to produce the data-bases that summarize as much relevant information about organisms as possible. Scientific classifications are arrangements which aim to produce groups whose members share the maximum number of common properties, and about which ope may make the greatest number of predictive generalizations. These are goups with high information content, natural groups in the sense of Gilmour (1937; 1940). Their relation to phenetics and predictivity is discussed by Gower (1974) and Sneath and Hansell (1985). The distinction from phylogeny is well illustrated by the Periodic System of chemical elements, - a classic phenetic taxonomy. The phylogeny of atoms, though of great interest to nuclear physicists, is of little utility to chemists. Phenetic groups, such as the halogens, are useful to them (Sneath, 1988b). This example shows that there is no necessary connection between information contep.t and historical origin; this is also obvious for classifications without relevant historical dimensions (e. g. solvents, Wold, 1975; diseases, Loiseau et al., 1974). This is no mere pedantic insistance on terminology. If a phylogenetic grouping of bacteria (based on unassailable evidence) was indeed found to exhibit no substantial phenetic coherence it would be of little practical value. Conversely, if extreme convergence due to strong selection pressures led to two lineages becoming almost identical, there would be little practical value in separating them. Belief in the taxonomic importance of phylogenetic groupings rests on the implicit assumption that there are numerous common properties of such groups to be found on deeper investigation. That is, it is believed that such groups will also prove to be reasonably good phenetic groups. If this assumption is false (as in the instance of the origin of chemical elements, where the phylogenetic groupings have no phenetic coherence) the phylogenetic groupings are of little value outside the restricted field of geneaology. Nevertheless, the genomic groupings, whether phenetic or phylogenetic, pose a challenge to microbiologists to find new properties that characterize such groups, and thus expand the data-base for the organisms. There is a tendency in molecular studies to try to avoid dependence on phenetics by arguing that certain classes of data are highly conserved. But this term is exceedingly ambiguous, and can easily lead to circular argument. If we use it to mean "accurately reflecting phylogeny" this presupposes that we already know the phylogeny (and also that the data show sufficient but not too much variation). We cannot then use it to support the phylogeny from which it was inferred. If we use it in the sense of "showing 2 System. Appl. Microbial. Vol. 12/1

17

little variation within phylogenetic groups" this again presupposes that we know these groups, for otherwise we are prejudging what the groups are. It can also refer to features that on general biological grounds are thought unlikely to change much during evolution, or unlikely to respond to certain (usually unspecified) selection pressures. It can mean "showing little variation within major phenetic groups", but if we then call into question those phenetic groups the term loses most of its meaning. It is often used to mean "invariant". These different meanings of "conserved" are seldom distinguished. Its loose usage is therefore to be deprecated. Similar problems surround the concept of "primitive organisms". Present-day creatures are all modern and "up-to-date": evidence that some of them are highly similar to their remote ancestors (and have thus experienced prolonged evolutionary stasis) is usually very tenuous in the absence of fossil data (see Discussion). Analysis of Sequence Data Sequence relationships are produced by numerical taxonomic analyses (numerical taxonomy: "the numerical evaluation of the affinity or similarity between taxonomic units and the ordering of these units into taxa on the basis of their affinities", Sokal and Sneath, 1963, p. 48). A brief resume of the steps is as follows. First, a pair of sequences must be recognized in some sense as "the same", and this requires alignment of the sequences to reveal sites or regions that are homologous (or more correctly isologous, because homology due to ancestry is not known at this stage, - e. g. two methionine residues are isologous but not necessarily homologous). The significance of this, and associated terms such as orthology and paralogy, will be clearer in the section below on protein sequences. It is sufficient here to point out the logical distinction between observed resemblance and inferred history. Provision must be made for the minimum of insertions and deletions. The sequences are (in effect) slid past each other and positions of best matching are recorded. Second, the proportion of matches or mismatches between the aligned sequences is calculated. When repeated with numerous sequences this given a matrix of similarities or dissimilarities between organisms (often expressed as taxonomic distances): these relationships are phenetic, though they may be from genomic data. Third, a number of techniques are used to summarize the taxonomic structure and its implications. The three commonest techniques are phenetic: (a) cluster analysis, which yields a tree (phenogram) whose levels define phenetic groups of various ranks. Such groups are natural groups in the sense of Gilmour and the philosophers of science (Sneath, 1988a); (b) shaded similarity matrices, after re-arranging the rows and columns into an order defined by clustering so as to bring like next to like (these give visual representations of phenons as dark triangles, e. g. Hori and Osawa, 1986, Fig. 2);

18

P. H. A. Sneath

(c) ordination, in which the positions of organisms are represented in a taxonomic space of two or three dimensions (clusters and their relationships are evident by eye). The remaining method is: (d) a cladistic tree whose topology indicates the inferred evolutionary branching. To this may be added inferred time, ancestral properties and amount of evolutional change. The result is then often described as a phylogeny. Uncertainty can arise at any of these stages of analysis, in the sense that slightly different techniques (all of which may appear to be plausible models) may yield slightly different results. There may be alternative ways of alignment, different similarity coefficients, and various methods to optimize structure. As a rule (though thorough study has not yet been made) these difkrences seem minor, and less than effects of stratistical sampling error, so that only brief mention will be made of them below. Protein Sequences The first molecular sequences to be applied to taxonomy were those of proteins (Doolittle and Blombach, 1964; Margoliash and Smith, 1965; Dayhoff and Eck, 1968). They had an important role in showing that evolution as evidenced by the fossil record was closely mirrored by sequence similarities between proteins belonging to the same protein family. Cytochrome c and globins are the best known exampleS": They also showed the congruence of sequences with other phenetic data, and gave confidence that sound molecular phylogenetic reconstructions could be made for groups where suitable fossil evidence is not readily available (as in most plants and insects, Fitch and Margoliash, 1967; Boulter et al., 1972; Dayhoff, 1972; Goodman, 1976). Protein sequences have also contributed to taxonomic problems such as relationships between man and the great apes, and those within orders of mammals and within higher plants (Goodman, 1976; Romero-Herrera et al., 1976; Martin et al., 1985), Protein sequences have not made a large contribution to bacterial systematics, partly because of the small number of available sequences of anyone protein family, partly because of doubt of the equivalence of different types of molecule (e. g. cytochromes) or because they appear to have evolved too fast to retain much relevant information (Ambler, 1985). The sequencing of proteins also led to the realization that the genome is very complex, at least in eukaryotes. Within a broad family of proteins (e. g. globins) there may be subfamilies (e. g. haemoglobins and myoglobins), and the subfamilies may themselves contain further variation (e. g. alpha, beta, gamma, delta haemoglobins). It was such phenomena that forced attention onto different concepts of homology, and led Fitch and Margoliash (1968, 1970) to point out that additional, evolutionary, assumptions are needed if the basically phenetic information (i. e. sequence similarity) is to be used for phylogenetic inferences. The evidence for gene duplication early in a phylogenetic group, with independent evolution thereafter, made it clear that it was critically important to compare the

appropriate sequences. Thus alpha haemoglobin in one organism must be compared with alpha haemoglobin in another and not with beta haemoglobin. This led Fitch (1970) to distinguish orthologous proteins (e. g. two alpha haemoglobins, or two beta haemoglobins) from paralogous proteins (e. g. an alpha and a beta haemoglobin). The globin family is now known to be very complex. Recognition of orthology must often be made, not on external criteria, but from the sequences themselves; thus, a delta haemoglobin is recognized as delta by high similarity to another delta sequence. One mistake may thus affect the identity of many other sequences. Goodman (1976) discusses the difficulty of distinguishing orthology from paralogy as illustrated by different "delta" haemoglobins that appear to have arisen independently in different lineage in primates. More recently the discovery of "silent" genes and noncoding regions of DNA have shown how difficult it is to reach unambiguous phylogenetic interpretations from protein sequences. Harris et al. (1984) give an excellent discussion of the beta haemoglobin gene cluster of primates (other orders of mammals have different beta haemoglobin clusters). The primate cluster contains five functional genes, but also a pseudogene which presumably arose through silencing of a pre-existing functional gene. Such genes are thought to be commonly "turned on" or "turned off" during evolution (so that it may be difficult to know whether functional genes are orthologous). An illuminating example of paralogy is cited by Fitch and Yasunobu (1975). Duck lysozyme was found to be unexpectedly different from goose lysozyme. The explanation was that they are paralogous, because it was later found that the black swan produces two lysozymes, one very similar to the orthologous lysozyme found in the duck, and the other very similar to the orthologous lysozyme found in the goose. This illustrates the difference between isology and homology: what was originally observed was the degree of phenetic, isologous, similarity between duck and goose functionallysozymes; to interpret this in terms of cladistic relationship between duck and goose requires additional, and critical, assumptions. Ambler (1985) points out that unexpectedly large differences between sequences of organisms that are thought to be closely related can be explained away as being paralogous gene comparison. Conversely, unexpectedly high sequence similarity implies orthologous comparison. Indeed one might say that orthology is defined in practice as the highest similarity recognized to date. The high similarity may, however, have a variety of evolutionary explanations: - the two organisms in question may share a very recent common ancestor, the rate of evolution of the gene in both lineages may have slowed, there may have been lateral gene transfer from one lineage to the other, or the resemblance may be due to convergence or parallelism . . Phylogenetic reconstructions are thus vulnerable to the later discovery of unexpectedly high sequence similarities, and the explanations may prove contentious. Phenetic groupings too, are subject to this problem because operational phenetic homology also relies on the highest one-toone correspondence in suites of characters (Woodger,

Interpretation of Sequence Data

1945; Jardine, 1967; Sneath and Sokal, 1973), but they are less vulnerable to anomalies in single genes because of their avowed aim of subsuming all relevant information. It can be seen how unreliable the phylogenetic inferences from a single functional gene can be. Use of a wide range of proteins, however, can provide the additional information needed to check such inferences, and gene-sequencing techniques will make such work easier. There is the advantage that it is usually possible to find protein families whose evolutionary rates make them suitable for studying any particular systematic problem. It is also likely that secondary and tertiary structure can make useful contributions. For these reasons renewed study of bacterial protein sequences will be important.

Sampling Error All samples are subject to sampling error, Suppose, for a given pair of organisms, one observes 6 matches between two sequences, each of length 10, from a particular semantide, A. One will not necessarily observe exactly 6 matches in two sequences of length 10 from another semantide, B, of those organisms. One might observe 4 or 7 matches. In general one will obtain a mean and standard deviation from a series of semantides, A, B, C, D, e. g. 6.2 ± 3.4. The relationship between the organisms here averages about 6 matches out of 10 but this value is very uncertain. The longer the pooled length of the semantides, the smaller the standard deviation will become. Hence the less uncertain will be the estimate of the relationship that one would get from a very large sample of the genome (if one could obtain it). It should be emphasized that sampling error affects both phenetic and cladistic analyses. A tree from one semantide will differ from a tree from another semantide. In trees, whether phenograms or cladograms, it produces uncertainty of the position of branching points (nodes) and of length of stems (internodes). In shaded similarity matrices it produces blurring of the sharpness of the dark triangles representing taxa. In ordinations it produces uncertainty of the position of the points that represent organisms (so that the ordination behaves likes a photograph that is out of focus). Statistical sampling theory shows that the standard deviation of the numbers of changes on the internode in a cladogram is close to the square root of the number of changes. This indicates approximately the standard deviation of the internode length, so that, for example, an internode for which 25 differences are inferred has a standard deviation of about 5.0, and this is sometimes shown by labelling the internode as 25 ±5.0. This indicates that though the estimate of the internode length is 25, it may lie within about two standard deviations above and below, i. e. anything from about 15 to 35. This behaviour follows from the theoretical consideration that if evolution is constant and divergent the numbers of changes in unit time will follow a Poisson distribution. Actual data depart somewhat from this behaviour, but departure cannot be gross (because this would imply

19

evolutionary models that would prevent plausible phylogenetic reconstruction). A somewhat more stringent assumption is that the numbers follow the binomial distribution, but in practice this only increases the uncertainty a small amount: these questions have been discussed at length by Fitch and Langley (1976), Astolfi et al. (1981), Kimura (1983) and Sneath (1986). Standard deviation values on trees, or standard error bars (e. g. Hori and Osawa, 1986), Fig.3), do not give a good visual indication of uncertainty in trees, so I have developed a graphic method for this that is easier to appreciate (Sneath, 1986). The uncertainty in position of a furcation is shown by dotted or dashed lines. There is a probability of P = 50% that the "true" position (from an infinitely large sample) may be as displaced as much as is shown by the dotted lines. Dashes lines indicate that there is only P = 10% chance that the position will lie outside these limits. Thin lines indicate the range P= 5% to 10%, and thick lines indicate that there is only a chance of P = 5% that the node will reach the thick lines. Conversely, the thick lines, for example, represent the parts of the tree that have 95% certainty. The figures presented here were drawn using Method 2 in Sneath (1986) which is a compromise between the Poisson and binomial models. It has a minor disadvantage in that it tends to underestimate uncertainty for nodes that subtend internodes of very unequal length; this method therefore gives slightly optimistic representations of trees. Examples of Sampling Error Using this convention Figs. 1-3 show the uncertainty in some cladograms from the recent review of Woese (1987) based on 16S rRNA sequences. They are chosen to illustrate different levels of taxonomic rank. The numbers of differences on internodes of the· original figures were obtained by measuring the lengths from the published diagrams and relating them to selected percent sequence similarities in Woese's Tables 17 and 18. It was assumed that there were 1600 positions on the molecule. There are actually rather fewer, and functional restrictions may make this figure unrealistically high. However, reduction below 1600 would make the uncertainty more marked, so the figures shown tend to be optimistic. Fig. 1 shows the uncertainty of the cladogram of the major groups of living organism from Fig.4 in Woese (1987). The three major clades of archaebacteria, eubacteria and eukaryotes are very firmly indicated. The most uncertain regions involve methanogens, the animal-ciliatefungal-plant region, and the region of "purple bacteria", cyanobacteria and Gram-positive bacteria (see also Fig. 2). Woese (1987) states that in his opinion the branching order within each Kingdom is correct to a first approximation only; Fig. 1 shows an approximate magnitude of this uncertainty, and provides an explanation (i. e. statistical sampling error). It may be noted that other semantides have failed clearly to resolve the relations between animals, fungi and plants (e. g. Figs. 4 and 6 in Sneath, 1986, on cytochromes, though this of course has only about 100

20

P. H. A. Sneath

' Purple bocler.,o.

Gram POliti •• Green nan · ~acz,eroa .ulphur bacl.roa

Ammols

Cyonaboct

~

.'Ia~

_____ F

Eubocteria

la,,()bo"" ~ Thermologo

Exlreme halophil.

EVkaryo t es

./'

'I'OII'

I

Fvng'

,,/ ( yo.o u., "

"H'gher

./'

plan!>

/--.,/ . ~

/f~

Fla~lIal.'

P os

•••• o·s po·, _ 0·1 P 005 pe OOS

M.thonogenJt~:::::;hiIJ

Archaebacteria

Mocr<»po< id,a

Fig. 1. Cladogram of major groups of living organisms based 16S rRNA sequences (Wo ese, 1987), and drawn so as to representthe uncertainty due to sampling error. The cladogram is redrawn from Woese, Fig. 4 (using method 2 of Sneath, 1986, see text). The scale shows the approximate number of base pair differences corresponding to a given length on the tree. The species used to construct the tree are listed in the original figure.

aminoacids, and Fig. 3 in Hori and Osawa, 1986, on 55 rRNA of about 150 residues). Part of the problem may lie in the complex relations of organisms traditionally regarded as fungi, as shown in Hori and Osawa's study. Fig. 2 shows the uncertainty in the cladogram of major groups of eubacteria t rom Fig. 11 of Woese (1987). The figure of Woese displays some variation within certain of the groups, and this has been omitted in Fig. 2, though without greatly altering the appearance. It is seen that most of the lineages fan out from almost the same point, where the relative order of branching is quite uncertain. Even the deinococci might plausibly belong to this fan of Green Bocte/oid.s· , sulphur floyobocterio Splrochaetes bocterio

uncertain topology, although it is probable that they branched off first. Only the green non-sulfur bacteria and Thermotoga are convincingly separated from the fan, as noted by Woese (1987). The same picture is evident from examining individual 165 rRNA sequence similarities, e. g. Table 18 in Woese (1987), where one frequently sees individual similarities that are incongruent with Fig. 2. Thus the cyanobacterium Anacystis nidulans is more similar to the "purple bacterium" Desulfovibrio desulfuricans than to the Gram-positive Clostridium inocuum in Table 18. The relationships between the major groups are thus obtained by some sort of "averaging" of the individual inter-taxon similarities. This may give a more reliable indication of cladogeny than individual values; however, if one averages a number of uncertain values the extent of uncertainty that pertains to the resulting "average" raises difficult statistical problems. One cannot just assume that the uncertainties cancel out in a straightforward way. This is a matter that deserves more study. The uncertainty in order of branching at the level of traditional families and orders within the best-known Gram-negative bacteria ("purple bacteria" of Woese) is shown in Fig. 3. Because of the dense branching of Fig. 8 of Woese (1987) Fig. 3 has been drawn in a different convention (e. g. Sneath, 1986, Fig. 12). The alpha and delta subdivisions of Woese are reasonably defined. The distinction between the beta and gamma subdivisions is less reliable, as was noted by Woese. Figure 3 shows a number of traditional groupings, but generic groups are not clear; the type species are sometimes not included. Nor is it evident what picture would emerge if many more species were added. The region between beta and gamma subdivisions might become filled in. It is less clear to what extent the topological relations between the previous organisms would change as new ones were added. It is known that such effects can occur (Wilkinson, 1974; Fortey and Jefferies, 1982; Lanyon,

Alpha ,"-

Beta

Gram - pos it ive

'lj/'-/bocterio

Thermologo

~

RhodgJp i,;/Ium rubrum

.._ . .i ·- - -

Green non· Julphur bacter ia

\

,/ ~ -~~~t:~io

~~'/Y~' P"'" P >

0 .5

Gamma

Fig. 2. Cladogram of major groups of eubacteria from 16S rRNA sequences, showing uncertainty due to sampling error, redrawn from Woes e (1987, Fig. 11). Conventions as in Fig. 1. The scale shows the approximate number of base pair differences.

Rhodop.ludomano. poluslri. Rhod0p.ludomono. acidophilo - - - - - - RhodabocI.r cop.ulo'u. Neisseria gonorrhoeoe Spirillum volutonl Nilro.olobu. multiformi. ----Rhodocyclu. gllatinosu. Rhodocyclu. purpur.u. Chroma,ium yino.um llgion.llo pn.umophilo Pseudomonos o.rug;noJO .Acinefoboct., calcooc.t;CUJ

Eschlrochio col,

M yXOCO«". xon,hu.

Oe.ulfoy,b, io d •• ulfuricon. Ild.lloyibr io .Iolp'; O•• ullobocllr paJ'gol.i

Delt~

bocter io"

0·5 > P> D·' 0,1> P > 0 ·05 P < 0 ·05

A9ro octer;um fum.fociens

Oesul'uromonos aceto.idan. P' 03

100

_

_

OS-P - OI OI · ' · OOS

' · 005

Fig. 3. Cladogram of "purple bacteria" from 16S rRNA sequences (Woese, 1987, Fig. 8), redrawn to show uncertainty due to sampling error, Conventions as in Fig. 1, with scale showing approximate numbers of base pair differences_

Interpretation of Sequence Data 1985) but it is uncertain whether this is particularly troublesome for cladistic algorithms (Holmquist, 1979; Tateno et aI., 1982; Rohlf, 1984). Figs. 1-3 illustrate the sampling error from about 1600 nucleotide,s (165 rRNA). When trees are from smaller sequences uncertainty is greater, though this may be counterbalanced by greater rates of evolution. Fig. 4 is derived from the tree of De Wachter et al. (1985, Fig. 6) from 55 rRNA sequence of a number of Gram-negative bacteria, assuming a mean of 120 nucleotides. The relative extent of dotted and dashed lines due to sequence sizes would be expected to be about V (16001120) = 3.7 times greater in the tree from 55 sequences. However, the rate of percentage evolutionary change in 55 rRNA appears to be two or three times faster than in 165 rRNA, thus giving more changes for any given taxonomic pair. This would decrease the dotted and dashed regions by a factor corresponding to the square root of the rate difference, say V 3 = 1.7. The combined effect would imply that the dotted and dashed regions in Fig. 4 would have an extent of about 3.7/1.7 = 2.2 times that in Fig.3 for comparable taxonomic relationships. Although only Pseudomonas aeruginosa and Escherichia coli are common to both studies, it can be seen that the extent of dotted and dashed lines at the node subtending these bacteria is indeed about twice as great in Fig. 4 as in Fig. 3. This is also true for other relationships of about the same magnitude. This analysis also shows that a long sequence does not necessarily imply less uncertainty or greater reliability than a shorter one; it must also exhibit a suitable number of differences between organisms. A tree with relationships of similar size to those in Fig. 4, but from cytochrome c sequences of Ambler and his colleagues, was given by Sneath (1974, Fig. 6). The uncerAqu'''p."lIu1ft

'.'pen,

, .. ~, '.poc,o A 01,.,. • (0«01 ..

1_

ho, ••" "'-a,oboe,.,,,,,,, pfto,phor.u ItO

• Alolobod , .,,. Io..d" •

".~, CM'OIg.1'01O

',eudo.woonos /I11Ot."e",

' -0 ~ _

05 - P . o-l 01 . ' . 005

_

' · 005

•••• - Proleu, yulgar'

10

Fig. 4. Tree of Gram-negative bacteria from 5S rRNA sequences (D e Wachter et aI., 1985, Fig. 6), redrawn to show uncertainty due to sampling error. The original D units are almost proportional to numbers of base pair differences (shown in the scale) for small values of D. The differences were estimated assuming 120 base pairs, and then allocated equally to corresponding stems. Thus, Beneckea harveyi and Photobacterium phosphoreum are related at about O.12D or 14 differences; the stems sub tending them are drawn as representin"g 7 base pair differences each. The Escherichia coli cluster represents a number of different strains. Other conventions as in Fig. 1.

21

tainty (shown by standard deviations of branch lengths) is rather less than that for 55 rRNA sequences, because although the sequence lengths are about the same there are more changes in the cytochrome. The difference between a minimum length tree and UPGMA phenogram is small, and marginally significant. Fig. 4 illustrates another point: the graphic method gives a lacy appearance where many stems join in a small region. This implies that the detail within this region is uncertain, and the lacy appearance is well marked for the strains of Escherichia coli. If only one strain of E. coli had been included it would be subtended on a short solid line, just like the strain of Proteus vulgaris. Although the lacy appearance of the E. coli strains is quite appropriate (because the fine detail of their branching must indeed be uncertain) this effect may sometimes be misleading if the reason is not kept in mind. Oligonucleotide Catalogues Oligonucleotide catalogues provided most of the early information on rRNA, though now largely superceded by sequences. The catalogues may be used to illustrate two additional concepts, (1) the amount of uncertainty involved in two different ways of studying the same semantide, and (2) the way in which the uncertainties from different sources (experimental and sampling) add together. Oligonucleotide catalogues have been used to derive taxonomic relationships, expressed as SAB values. These values bear a curvilinear relationship to 165 rRNA sequence similarity (Woese, 1987, Fig. 2). More important than the shape of the curve is the extent to which one can be predicted from the other, i. e. how close the points lie to the fitted line. Agreement from this point of view is only moderately good. Over much of the useful range the standard deviation of sequence similarities for a given SAB value is almost 2%, so that the sequence similarity predicted from it may lie over a range of about 7% (at ± 2 s.d.). This may introduce considerable doubt in deciding taxonomic relations at the generic, familial and ordinal levels. Fig.5 shows cluster analyses of organisms from Table 18 of Woese (1987) on percent 165 rRNA sequence similarity (Fig. Sa) and percent SAB (Fig.5b). There are some differences, involving mainly the cyanobacterium Anacystis nidulans. Also, the relations between Agrobacterium, Desulfovibrio and Escherichia differ in Sa and 5b, (and neither agree with the relations in Fig. 8 of Woese, 1987, where Desulfovibrio is represented as the lowest branch of the three). Uncertainty can be shown if the data are converted to numbers of base pair differences (Figs. 5c, 5d). We lack good statistical models to convert SAB values to equivalent sequeoce differences if error is to be taken into account, so that Fig.5d is only a rough indication. It does, however, illustrate an important point, - that uncertainty may need to be increased to take account of additional types of error. In this instance the standard deviation of a node, typically 5-6 base pair differences, is increased to almost

22

P. H. A. Sneath

Mycoplasma gallllepl,cum Mycoplasma capnca/urn

r------r-------

L-_ _ _ _

Sprraplosn!o crl" Acho/~p/o.mo lo,dlow" (Iol/"d,un! ,nnocuum 80erl/u •• ubl"" H~I,oboclerrum chlorum Anoeys"s ",du/ons Arlhrobocle' globrlarm,s

_----

... .

a

Eschench,a col,

c

100

Mycaplalma gall,,~plrcum Mycop/o.ma copnco/um

Mycop/a.mo gal/"eplrcum Mycoplasma copncolum L-_ _ _ _ _ _ _ _ Sprrop/asma crl"

r---------L-_ _ _ _ _ _ _ _ _

80"IIu • • ubl,I" He/,obocleroum ch/o,um Arlnrobocler glabrfarmll AnacystlS n,dularu Agraboele"um lumefocrenl De.ulla.,broo d~sulfurrcon. ElChe"ch,o col,

b

Splfoplosmo (,t"

Achol~plosmo lo,dlaw"

C/o.,,,d,um ,nnocuum

Anac ys tIS mduJons

Arlhrobac/e' g/obdo,mll Agfobocter;um tumefoclens Oesu/'ovlbr,o desu/furleons

Agrobocte"um tumefaclens

Delulfa.,brra delulfurrconl f sche"dllo col,

Mycaplalmo goll"epllCum Mycaplosmo coprrco/um Sprraplolmo e,I" Acha/ep/o.mo lo,dlaw" (Iall"d,um ,nnocuum 80erl/ul .ubl,/" Hel,oboclerrum cn/arum

I.

- - - - - - Acholeplo.mo 'oidlow" - - - - - - C/osl"d,um ,nnocuum - - - - - - 80c"'us lubl",. ..... _... j :•. .. -He/'abacleroum ch/orum Arlhrabocler globoformll - - - - - - - - - Anacys"s n,du/ons Agraboclerrum luml!foc,enl Delullo.,brro desulfurrcans Escherrch,o col, 100

.........

d

_-----

P> 0·5 0·5 > P > 0·1 0·1> P > 0 ·05 P <0 ·05

Fig. 5. UPGMA phenograms of eubacteria from 16S rRNA (Table 18 of Woese 1987). (a) Percentage sequence similarities. (b) Percentage SAB values. (c) Percentage sequence similarities, (a), redrawn to show uncertainty due to sampling error after converting percentage similarity to numbers of base pair differences (as shown on scale). (d) Percentage SAB values (b) redrawn to show uncertainty due both to sampling error and to the lack of correlation between SAB and sequence similarity (see text) after transforming to sequence similarities (see below). Other conventions as in Fig. 1. SAB values were converted to approximate sequence similarities from the equation in Fig. 2 of Woese (1987) using an exponent of 1/5.5. The uncehainty due to lack of correlation between SAB and sequence similarity was estimated at about 1.7% from Fig. 2 of Woese, and the total uncertainty was obtained as the square root of the sum of the two variances.

30 differences because of the extra component due to imperfect correlation between sequence similarity and SAB' This explains the greater uncertainty in Fig,5d compared to 5c. DNA-rRNA Relationships The physicochemical estimation of DNA-rRNA simi· larity is a valuable technique, pioneered by De Ley and his colleagues (De Smedt and De Ley, 1977; De Ley et aI., 1978, and foreshadowed by the ribosome serology of Barbu et aI., 1961). It is particularly useful at intermediate taxonomic levels in bacteria. The results are usually presented in two forms (a) as two-dimensional plots of median denaturation temperature Tm against percent rRNA binding for DNA of numerous s.trains against a single reference rRNA, or (b) as dendrograms based 'on Tm with an average linkage method using results from a

number of reference rRNAs. 5uch studies have shown many of the groupings at familial and ordinal levels that were later confirmed by 55 and 165 rRNA sequences, and the concordance between the two methods is impressive. The reliability of the DNA-rRNA data has not yet received critical study with a sound theoretical basis. The following tentative evaluation for the family level is offered. The number of nucleotides involved from 55, 165 and 235 rRNA is almost five thousand in all, and though the work may be based on separated 23S fractions (e. g. De Ley et aI., 1986) this will only reduce the number to about three thousand. This would imply that the sampling standard error for several hundred differences (which would be expected between members of different families) is 'about 20 nucleotides, equivalent to about 5% accuracy. The experimental error of physicochemical estimation is probably also about 5%, if its accuracy is similar to DNADNA pairing (see later). Although we do not know how these errors combine, or what other factors affect the sys-

Interpretation of Sequence Data

tern, it is likely that the overall accuracy will be close to the square root of the sum of the variances, i. e. about 7%. Therefore one would expect that interfamily distances will be accurate to about 7% in both the two-dimensional plots and the trees. Because experimental accuracy probably does not vary much with taxonomic level, one would surmize that accuracy at the genus and species level would not be much lower, but would remain at a similar value.

EUIACTU'A

a EU "CTEltIA

Incongruences Between Sequences There is usually excellent concordance between different sets of sequence data. The question arises whether there are discordances in relationships, based on different sequences, that are too large to be plausibly explained by sampling error. 5uch incongruences are doubtless of interest to thoughtful readers. They are also important for understanding how evoution has occurred, and for determining orthologous and paralogous semantides. One such discordance, at the highest taxonomic level, is the position of the archaebacterial groups in relation to eubacteria and eukaryotes. This is partly related to the difficulty in ascribing the ancestral root, which Woese (1987) discusses. A problem remains nevertheless: the relationships of these groups from 165 and 55 rRNA sequences differ more than statistical considerations suggest. Fig. 6 shows relevant parts of the trees for 165 and 55 data (Woese, 1987, Fig. 4; Hori and Osawa, 1986, Fig. 3; De Wachter et al., 1985, Fig.4; employing one organism from each of the three groups). Figs. 6b and 6c differ principally because of different formulae to correct for double mutations. It is difficult to believe that the discrepancy between 6a and the other two could be due to chance alone (even if the omission of side branches has somewhat reduced the error around the central node). Another explanation is therefore required. There are other examples that are suggestive though not conclusive. The 165 rRNA sequences (Woese, 1987) indicate that blue-green algae (cyanobacteria) are closer to the Gram-positive bacteria than to the classical Gram-negative bacteria, whereas 55 rRNA studies (Hori and Osawa, 1986, Fig. 3; De Wachter et al., 1985, Fig.4) imply the reverse. The discrepancy may be accounted for by sampling error; the dotted region in Fig. 2, and the standard error bars of Hori and Osawa (which agree well), are large enough to make this likely, and supports the view of Woese that the available data cannot yet give a definative answer. More convincing (Fig. 7) is the considerable discrepancy in relative length of the branch leading to Sulfolobus in comparison to those to Halococcus, Methanospirillum and Thermoplasma in the 165 figure of Woese (1987, Fig. 13) and the 55 figure of De Wachter et al. (1985), Fig. 8). It is relatively almost twice as long in the 165 tree. On the basis of 165 rRNA one would expect the number of changes on the 55 tree to be about 60 ± 7.7 rather than the observed 35 ± 5.9. Unfortunately different species of Sulfolobus were studied, and it is not certain that the two species are closely related.

/I

EU AUOTES

AItCHAUAcnlt'A

/

b

23

EU "nOTES

EU "CTEltI" A .. CHAEa"CTUIA EU AltVOTES

c p - 05 0·5 - P - 0 ·1 0 ·1 . p . 005 P· 005

/

AItCHAEIACTEIt'"

Fig. 6. Relationships between euckaryotes, eubacteria and archaebacteria from 16S and 5S rRNA sequences, and drawn to show uncertainty due to sampling error (conventions as in Fig. 1, with scales showing numbers of base pair differences). (a) Relations based on 16S rRNA from Woese (1987, Fig. 4 using Gram-positive bacteria, Methanococcus, Zea mays). (b) Relations based on 5S rRNA from Hori and Osawa (1986, Fig. 3 using Gram-positive bacterial group A, "metabacteria", i. e. archaebacteria, green plants). (c) Relations based on 5S rRNA from De Wachter et al. (1985, Fig. 4 using Gram-positive bacteria, archaebacteria, embryophytes). The triads in (b) and (c) are similar apart from scaling factors due to somewhat different similarity coefficients employed. The uncertainty in (b) is close to that implied by the standard error bar in HOTi and Osawa (1986).

There are other examples of 165 and 55 discrepancies, though they are of less interest in bacteriology. In Fig. 8 (the trees are unrooted; the lower stems only indicate they lead to the prokaryotes) the position of flowering plants and yeasts differs (Woese, 1987, Fig. 4; Hori and Osawa, 1986, Fig. 3). The 5S data implies both these groups branched off before the protozoa-metazoa divergences, the 16S implies that both branched off after the protozoan lineages. Again the discrepancy seems too large for sampling error. Another instance involves the cellular slime moulds. What are the likely explanations of discrepancies that cannot be attributed to sampling error? It has been noted

24

/'"',.,.,,,.,

P. H. A. Sneath

earlier that assertions that certain sequences are conserved rest on shaky logic. One may believe that relations inferred from the longer sequences are, on statistical grounds, the correct ones (and in a broad sense they are more likely to be correct than relations from short sequences). Yet this does 'not itself explain the discrepancy, because statistics has already been taken into account. The commonest explanation is that discrepancies are due to different rates of evolution in different seman tides and lineages, and this may seem acceptable when one tree can be converted into another by a monotonic transformation. The positions of the roots are irrelevant because the trees in Figs.6-8 are unrooted. The unexpectedly long branch to Sulfolobus in the 165 tree in Fig. 7 might well be explained by mosaic evolution (i. e. differential rates of evolution in different genes), in this instance by a burst of rapid 165 rRNA evolution in the lineage leading to Sulfolobus. However, there are other explanations. The incongruences may be due to lateral gene transfer, mistaken homology or to evolutionary convergence. Methods for distinguishing such explanations (Sneath, 1974; Sneath et aI., 1975) remain to be applied to bacterial data. The obvious explanation may sometimes be wrong. Ambler (1985) notes that evidence for alternative explanations is apt to be dismissed or ignored. It should also be remembered that explanations depending on numerous ad hoc assumptions cease to carry conviction; - one can explain anything by making enough special assumptions . .. Further, differential rates cannot readily account for major topological differences, such as the position of flowering plants and yeasts referred to above. Even the advent of much larger sequences (e. g. from 235 rRNA) will not solve many of these problems. DNA-DNA Relationships We now turn to the prospects of obtaining accurate relationships from the entire genome. The concept goes back at least to Ehrlich (1964), who propounded it with admirable clarity, and some believe that complete genome sequences will soon be available. DNA-DNA pairing is in principle a method of achieving this. Though multiplycopy DNA interferes in higher organisms (so that techniques to remove it must be employed) it plays a minor part in prokaryotes. The greatest difficulties at present are to obtain useful values below about 60% pairing (though opinions vary widely on this) and to achieve very high experimental accuracy. It is also true that we do not yet know how gene duplications, deletions, inversions and "nonsense DNA" affect the degree of DNA-DNA pairing between organisms of the same phylogenetic relationship (e. g. three species which share almost the same most recent common ancestor). The ideal is not in sight. Nevertheless let us speculate on what may be possible if these problems can be overcome. Extensive work is now being done on higher organisms, as noted in the Discussion. Theoretical consider'ations lead to an important conclusion, - that sampling error is very small. In principle,

Su/f%bus

Ho/ococcu5 morrhvo

165 ./ Methono sp'rlllum , / hungote. 2

a

Sulfo/obvs /OCidocokJorlus

Holoc;occus morrhuoe

Me'nono spirillum hvngate ,

b

Th rmQPlosmo ocidophi/um

./

55 Thermo,p./asmo oCldopni/um p ~ o ·s

O'S> P 0·1 0 ,1> P 0 ·05

P 005 Fig. 7. Relations between selected archaebacteria as shown by 16S and 5S rRNA sequences, drawn to show uncertainty due to sampling error. Scales show numbers of base pair differences. Conventions as in Fig. l. (a) Relations from 16S rRNA (Woese, 1987, Fig. 13). (b) Relations from 5S rRNA (De Wachter et aI., 1985, Fig. 8).

therefore the method has great power to discriminate fine detail in trees. This has been illustrated (Sneath, 1986; 1988b) with DNA-DNA pairing data from Sibley and Ahlquist (1981, 1983) on birds. The genome of birds contains about 2 x 109 single-copy nucleotide pairs, so that a mispairing of 1 % (i. e. 99% pairing) corresponds to about 2 X 107 nucleotide differences, whose sampling error is about the square root, viz. 4.5 x 103• Relative to internode lengths this is so small that in drawing trees with dots and dashes one cannot show the sampling error graphically. Bacterial genomes are smaller (e. g. 4 X 106 nucleotide pairs), but the sampling error is still very small. In DNA-DNA pairing it is experimental error that dominates the situation. I am grateful to my graduate student Trudy Hartford for some findings on this. Cited standard deviations (e. g. Potts and Berry, 1983) and internal evidence in publications suggests that the standard deviation of a typical DNA-DNA pairing value is 4-8%. Its effect can also be represented in trees (Sneath, 1986, Fig. 11; Sneath, 1988b, Fig. 7). To illustrate this an example has been taken from the DNA-DNA pairing of strains of Bacillus circulans (Nakamura and Swezey, 1983). DNA-DNA pairing was estimated to have s. d. of 3.5%. The genome of species of Bacillus contains about 4 x 106 nucleotide pairs (Gillis et aI., 1970). Fig.9a is a UPGMA phenogram of relations between 17 strains of B. circulans, displayed as an unrooted tree. There is a tight cluster of eight strains (1, 2, 6, 7, 8, 9, 10, 11) whose members differ from one another very little: nevertheless they could differ from one another

Interpretation of Sequence Data CELLULAR SLIME MOULDS 011'10 'LlATES flAGELLATES METAZOA IkRING " - - PLANTS

' \

'.~

~YEASTS

16~

dered very reliable. It is likely that strains 3 and 4 each represent a tight cluster, but in view of the fact that the model used for dE units may well underestimate the experimental error of DNA work (for reasons under study in this laboratory) it is conceivable that strains 3 and 4 may belong to one tight cluster similar to that indicated on the left of Figs. 9a and 9b. It is, however, very implausible that the isolated strains are closely related to any others.

100

12

16

CELLULAR CILIATES 011'10 SLIME MOULDS flAGEllA TES T

1/

flJfll.no \

ry~somo

~\ :

25

METAZOA

/

14

__~~__-------17

FLOWERING - - - - -....".PLANTS

5S

YEASTS

P - 05 O·S - p - O·!

3

20

01 - P 0 ·05 p 005 Fig. 8. Relations between eukaryotes as shown by 16S and 5S rRNA, drawn to show uncertainty due to sampling error. Scales show numbers of base pair differences. 16S rRNA relations from Woese (1987, Fig. 4). 5S rRNA relations from Hori and Osawa (1986, Fig. 3). Conventions as in Fig. 1. Organisms that represent groups are listed in original figures. The trees are unrooted; the lower stems lead to the prokaryotes.

a 12

14~ ... ~ .' ......,.:-.......:::

=

13--./~-~·

by as much as 40,000 base pairs (corresponding to 99% pairing). Strain 13 is close to this cluster and the remaining strains are widely different from all others. The numbers of base pair differences are so large that it is not possible at the scale of Fig.9a to represent any dotted furcations. Even the largest dotted and dashed region would occupy less than 0.1 mm on the figure. Such data would in theory, therefore, be able to distinguish very fine detail with great reliability (though some transformation of the metric is needed, see Discussion). In reality, however, the power of the method is limited by experimental error, and when the distances of the internodes have been represented as dE units (which are the integral differences that would correspond to the mean standard deviation of DNA-DNA pairing values, Sneath, 1986) the resulting tree is very different (Fig. 9b). It is seen that none of the detail within the tight cluster can be regarded as significant. Very probably strain 13 should be part of the same cluster (see Discussion). The order of branching near the centre of the diagram cannot be consi-

6.7.9,IO11~L I .2. 8

b

17

P >O'5

P
100

'------5

Fig. 9. Relationships of strains named Bacillus circulans based on DNA-DNA pairing (Nakamura and Swezey, 1983, Table 3 with strains numbered 1 to 17 in the same order as in their Table). The percentages were converted to percent dissimilarities, d, clustered by the UPGMA method and displayed as unrooted trees showing uncertainty. (a) Relationships expressed as numbers of base pair differences (as shown by scale). Sampling error is too small to indicate graphically. (b) Relationships expressed as error units of Sneath (1986) as shown by scale. The mean error of a DNA pairing value was estimated as about 3.5% from the differences between pairs of values that ought to be equal (because two strains involved showed 100% DNA-DNA similarity, Hartford and Sneath, in preparation). The percent dissimilarities d were converted to error units, dE by multiplying by 'i.dl'i.?- where 'i.d is the sum of the 136 dissimilarities in the table and 'i.?- is here 136 X 3.5 2 • The dE units correspond to integers whose sampling error behaves in a similar fashion to numbers of differences.

26

P. H. A. Sneath

Discussion The distinction between phenetic and phylogenetic groups may seem to be esoteric. The biggest difference in practice is between (a) systems built on a few characters given' arbitrary or preconceived importance and (b) those that employ numerous characters without large, or deliberately a priori, weighting. Biology abounds in unfortunate results from the first methodology (Davis and Heywood, 1963, pp. 12-30; Mayr, 1969, p.205). It particularly plagued early attempts at bacterial classification (Bulloch, 1938; van Niel, 1946). The long insistance that cyanobacteria should be retained among eukaryotes as the bluegreen algae was based on just such methodology, and it was again illustrated by Humphries and Richardson (1980) by their choice of only two characters, chlorophyll A and an organized nucleus, for their analysis. Although it should be noted that there were often very few characters available, this methodology did not encourage a search for numerous new ones. The biggest change came with the avowed aim of phenetics to sample very large numbers of characters in order to estimate the relationships between organisms. Once given large samples of characters, the differences in results from phenetic or cladistic analyses seem generally rather small, though there are exceptions. The question of whether there are major incongruences between relationships from genomic data and large suites of phenotypic characters is important for our understanding of evolution and ·for practical reasons. This is best illustrated by referring to the cladogeny of fishes discussed earlier, which relies on fossil evidence. The morphological and behavioural characters of the lungfishes, coelacanths and crossopterygian fishes are very similar to those of the typical actinopterygian fishes (Crowson, 1970, pp.96, 254). Yet it may well turnout that the molecular pattern of the first three will be in accordance with phylogeny, i. e. more similar to the pattern of the amphibia, reptiles, birds and mammals (which share the same clade) than to the pattern of the actinopterygian clade. This would show as major incongruence between two large suites of characters. It could be explained by fossil evidence. But in bacteria, without fossils, it would be an unexplained anomaly. Phylogenetic reconstructions rely heavily on the assumption that evolution in orthologous characters is broadly divergent and reasonably constant (Felsenstein, 1978), the "molecular clock hypothesis". It is clear that the clock is not very accurate (Fitch and Langley, 1976; Kimura, 1983; Dover, 1987), but when stochastic factors and averaging effects are considered it is not too bad for proteins. It is still not clear how steady is the change in DNA. Dover (1987) points out that the apparent constancy may be due to averaging of bursts of rapid change in different parts of the genome. Kimura (1983) considered that available studies suggest reasonable constancy in the noncoding regions of DNA, and studies such as that of Harris et al. (1984) imply that pseudogenes commonly evolve at a rather steady rate. This raises a questioJl of some practical significance: is phylogeny better reflected by genes subject to strong selection, or by pseudogenes? The latter are presumed to be

under little selection because they do not lead to functional gene products. One could argue, for example, that ribosome sequences are very reliable indicators of phylogeny because they are under strong selection, or that they are unreliable for the same reason. They are, of course, under strong constraints against entirely haphazard changes, because these would impair their function, but the same can be said of any functional genes. On the other hand, pseudogenes, being uninfluenced by such constraints and by the environment, would be expected to accumulate changes at a fairly steady rate (though as noted by Harris et al., 1984, they evolve faster than related functional genes, presumably because of the lack of constraints). The logic of such arguments is by no means clear. Although pseudogenes may be absent or rare in prokaryotes, there are numerous genes whose selective values are probably very low. So the dilemma, (and therefore the question of whether one needs to know which genes are subject to most or least selection) still remains in another guise. Despite the lack of fossil data in bacteria, it is important to realize that there are methods of analysis that can explain some phylogenetic anomalies. This is best illustrated with a case where fossil evidence exists. The living coelacanth fish Latimeria appears to have evolved in its morphological features very little since those ancestors that were also the ancestors of the higher vertebrates; it exhibits morphological stasis. There are sequences cited by Dayhoff (1978) for triosephosphate isomerase for Latimeria and for rabbit and chicken (also for parvalbumin, but this shows probable paralogy, so it will not be considered). Can one say from the sequences of living forms whether there has also been stasis in the molecular detail of the genome in the Latimeria lineage? At first sight one might believe there was no answer. The number of differences between the sequences of Latimeria-rabbit, Latimeria-chicken and rabbit-chicken are, respectively, 42, 55 and 35. The usual methods of tree construction (e. g. Farris, 1970, 1972; Moore et al. 1973) would root it in a common ancestral fish below Latimeria and allocate about 30 changes to the path from Latimeria to the rabbit-chicken furcation. But one could not distinguish the following alternatives: (1) no changes between Latimeria and the root, with molecular stasis in this lineage (and consequently all 30 changes concentrated between the rabbit-chicken furcation and the root); and (2) changes evenly distributed, in which case about 18 would be below Latimeria and there would be no molecular stasis. In this instance the second alternative, but not the first, fits the geological dates of branching, so one infers Latimeria does not show stasis in molecular evolution. Fortunately, stasis can be detected without fossils if there are numerous and sufficiently long sequences. This is because the properties of trees are such that stasis in a segment will show up as discrepancies in the matrix of se'quence differences, and perhaps in discrepancies between matrices from different semantides (Sneath, 1974; Sneath et al., 1975). Differences in rates in bacterial rRNA have recently been discussed by MacDonell et al. (1986).

Interpretation of Sequence Data

Before considering the two major problems of sampling error and orthology it may be noted that chemotaxonomy has frequently confirmed molecular sequence data. Many good examples are listed by Schleifer and Stackebrandt (1983). However, not all chemotaxonomic findings are concordant; thus quinones are useful in many groups but in the Haemophilus - Pasteurella - Actinobacillus group are poorly correlated with DNA-DNA pairing (analysis of data of Mannheim, 1981, Fig. 3, shows correlation of only about 0.15). There is a temptation to notice findings that agree and overlook those that do not. Numerous chemical properties would, of course, imply a phenetic analysis, giving phenetic relationships (reviewed by O'Donnell, 1985). A major difficulty with currently available molecular sequences is the fact that they are still only samples of relatively small size. Sampling error will therefore continue to dominate the picture. Work is also needed on the stability of cladistic methods. There is some indication that cladistic methods may lead to more instability of topology than cluster methods when new organisms are added (Fortey and Jefferies, 1982; Jones and Young, 1983; Rohlf, 1984). Such behaviour would be damaging to taxonomy through introducing repeated changes in the groupings. Considerations of reliability and statistical significance have been affirmed from the earliest attempts at numerical phenetics (Michener and Sakal, 1957; Sneath, 1957; Sokal and Sneath, 1963). A main conclusion from the present survey is that such considerations are also urgently needed in cladistic analyses of molecular sequences. Recognition of orthology is a persistant problem. Ambler (1985) notes that in all single-gene protein phylogenies there are some organisms for which anomalous evolution seems to have occurred. Gene duplication and paralogy account for many instances; Ambler cites a number in bacteria e. g. cytochromes. Convergence and parallelism also have effects. A high proportion, often 50%, of sites in most protein families show parallelism (Guise et a!., 1982). Baba et a!. (1981) found anomalous close relationships between crustacean and horse cytochrome c, which led to the prawn lineage becoming joined in one analysis to the horse and donkey: this was interpreted not as a problem in orthology and paralogy, but as due to a few fortuitous convergent amino acid substitutions. Back-mutations or parallelisms undoubtedly occur in rRNA, as shown by incompatibilities by the criteria of Guise et a!. between angiosperm 5S sites 117 to 119 in Hori et a!. (1985). Ribosomal RNA is therefore not immune to such perturbations, and Ambler (1985) expects that chimeras will be discovered where one part of a sequence is derived from one bacterium, and the other part from another bacterium. Dover (1987), in reviewing the complexity of eukaryote ribosomal genes, points out that in addition to crossing-over events within spacer regions, there may be interactions with rRNA that are as yet poorly understood. Much of bacterial relationships at higher levels depends on rRNA, so it is worth examining this system with some care. Ribosomal groupings do not necessarily have to reflect phylogeny: convergence and parallelism alone will

27

lead to discrepancies. Already there are indications to this effect. The 5S rRNA of the monocotyledons wheat and rye is more similar to that of the dicotyledon tomato than 5S rRNA of tomato is to that of other dicotyledons such as sunflower, spinach and bean (Hori et a!., 1985). To insist that the tomato belonged to the monocotyledonous clade would strain credulity; it is more easily explained by fortuitous parallelisms. Dr. R. G. E. Murray (personal communication) points out that our knowledge of bacterial relationships at high levels is based almost entirely on the protein synthesizing apparatus of bacteria; it is tacitly assumed that this apparatus does reflect the rest of the genome over a wide range of organisms and epochs. Comparison with other semantides is therefore much needed. One area that is now being explored is the use of molecular "signatures", - small subsequences that are characteristic of various bacterial groupings (e. g. Table 3 in Woese, 1987). Their value will depend considerably on how constant these signatures are within the newlydefined bacterial taxa. At present there has been no detailed analysis of this, though some within-taxon variation is evident. DNA-DNA pairing has profound implications if it can be determined with highest accuracy. Similar arguments to those given here have been developed to support DNADNA results in higher organisms. Thus, in order to evaluate the reliability of phylogeny of man and the apes from data of Sibley and Ahlquist (1987), Felsenstein (1987) calculates the length of sequence required to give the same statistical reliability as the DNA data. He shows that the present experimental accuracy of DNA pairing is equivalent to several thousand nucleotides, and that this would increase with replication or better techniques. The study of O'Brien et a!. (1985) shows very strong DNA evidence that the giant panda is much more closely related to bears than to raccoons, and this is supported by serology, isoenzymes and karyology. There evidently exist in nature numerous varied types of organisms within traditional bacterial species, as shown by Fig. 9. Whether the isolated strains in such studies represent numerous tight clusters, or whether loose DNA clusters are the norm in bacteria, is not yet clear. The occurence of many almost indistinguishable strains could be due to accidents of sampling (repeated reisolation from a narrow niche) but it is also of great ecological and evolutionary interest. Do such strains always come from a narrow ecological niche? Can we observe evolutionary change over short periods of time? It may be noted that the satellite strain 13 in Fig. 9 is believed to derive from the same source, Ford strain 26, as strain 6 (strains 39 and 47 respectively in Nakamura and Swezey, 1983). The position of strain 13 may perhaps reflect some genetic change as well as experimental error, and the statistical evidence would not entirely rule this out. Accurate DNA data might behave as an ultrametric, which would make tree construction easy because cladograms would differ little from phenograms. This is now attracting attention (Lausen and Degens, 1988; Bandelt and von Haessler, in press). The restricted choice of reference strains in DNA studies can lead to difficulty in recov-

28

P. H. A. Sneath

ering taxonomic structure because different reference strains can greatly affect the inferred relationships (Hartford and Sneath, 1988). It will be necessary to introduce suitable transformations, however, because DNA-DNA pairing of 0% is less than the chance expectation of matches for four nucleotides, about 25%. Such formulae as that of De Wachter et al. (1985) would be of interest here. We are still some way from satisfactory criteria for taxonomic rank. Bacteria show much more molecular diversity than higher organisms of similar rank. Cytochrome c differences within a bacterial species are as great as that between orders of mammals (Ambler, 1985). 5S rRNA variation in Escherichia coli is as much as that between families of £lavering plants (compare De Wachter et al., 1985 and Hori et al., 1985). Percent DNA-DNA pairing between strains of a bacterial species, about 80%, is similar to that betweeen orders of birds (Sibley and Ahlquist, 1983), though the position is different for absolute numbers of nucleotide differences (about 2 X 106 for bacterial species and about 2 X 10 8 for orders of birds, or 2 X 107 for species of birds). If rates of evolution have been similar (in percentage terms) in prokaryotes and eukaryotes, the geological dating for vertebrates would imply that different strains of a single bacterial species had diverged perhaps 100 million years ago. Some would find this implausible. Alternatively, molecular evolution has been faster in prokaryotes. It would then become difficult to know where to place the root on a tree of all organisms, such as Fig. 1, because of doubt about which segments evolved at what rates. The validity of "molecular clocks" across all organisms, and studies such as that of Ochman and Wilson (1987), are of considerable consequence. Another possibility is that strains of a species of bacteria exchange genes frequently (relative to geological time) but maintain very high genomic diversity, perhaps to allow them better to exploit swift environmental changes. Another more pragmatic question is how far we can safely revise bacterial classifications on present sequence data. In many of the earlier papers on rRNA strain numbers were seldom given, so we do not know which were type strains. Most workers examined only a single strain of a taxon such as a genus, yet the analyses were displayed as representative of the entire taxon, - sometimes when the validity of adjacent taxa was being questioned. Many genera are heterogenous from the molecular viewpoint (Fig. 3,4) and this would imply that all their species must be studied before deciding on the new disposition of the genus. This still leaves unanswered the question of how homogenous the species are: contrast the homogeneity of Escherichia coli (Fig. 4) with the heterogeneity of Bacillus circulans (Fig. 9). A recent report (Wayne et al., 1987) has raised a number of questions discussed here. The report illustrates certain difficulties of definition. It is not always clear when the term "phylogenetic" means "cladistic" and when it means "genomic", or whether it simply refers to relationships based on numerous properties, as opposed to a few with arbitrary weighting as in the early days of bacteriology. The term "genospecies" is used to mean a phylogene-

tic group based on a specified percentage of DNA-DNA pairing (which, incidentaly, obviously implies a DNA molecular clock), and not in the more usual sense of Ravin (1963) to mean a group that can exchange genes. Further, most usage of "species" in biology refers to a group of organisms that is isolated (genetically, phenotypically, or in some other way) from nearby groups. But to define genospecies on a quantitative pairing value is not consistent with this usage, and makes no provision for isolated groups related at different DNA-DNA pairing values. Thus, it would not accommodate very closely related, yet nevertheless distinct and isolated groups, that are referred to in botany as microspecies and in zoology as cryptic or sibling species (Davies and Heywood, 1963; Mayr, 1969). We know little about such micro species among bacteria, because although there are pairs of named species that show high DNA-DNA pairing (e. g. Echerichia coli and Shigella flexneri, Brenner et al., 1969) it is not clear to what extent such pairs are distinct phenotypically or genomically, or whether they merge into one another. We therefore need time to consider the implications of this report, and to what extent criteria drawn from a few well-studied areas of bacteria can be applied across the whole field. The report very wisely points to the need for study of a range of semantides, and to extension to type strains of the known species of bacteria. Acknowledgements. I am greatly indebted to Dr Dorothy jones, Dr R. E. G. Murray, and a number of reviewers for many stimulating comments and helpful suggestions.

References Ambler, R. P.: Protein sequencing and taxonomy, pp. 307-335. In: M. Goodfellow, D. jones and F. G. Priest (eds.), Computer-assisted Bacterial Systematics, London, Academic Press 1985. Astolfi, P., Kidd, K. K., Cavalli-Sforza, L. L.: A comparison of methods for reconstructing evolutionary trees. Syst. Zoo!. 30, 150-169 (1981) Baba, M. L., Darga, L. L., Goodman, M., Czelusniak, j.: Evolution of cytochrome c investigated by the maximum parsimony method.]. Molec. Evo!. 17, 197-213 (1981) Bandelt, H.-j., von Haeseler, A.: The phylogeny of Prochloron: is there numerical evidence from SAB values? A response to Van Valen. In: Trees and Hierarchical Structures (in press). Barbu, E., Panijel, j., Quash, G.: Caracterisation immunochimique des ribosomes. Ann. Inst. Passteur Paris 100, 725-746 (1961) Boulter, D., Ramshaw, j. A. M., Thompson, E. W., Richardson, M., Brown, R. H.: A phylogeny of higher plants based on the amino acid sequences of cytochrome c and its biological implications. Proc. Roy. Soc. Lond. B 181, 441-455 (1972) Brenner, D. j., Fanning, G. R., johnson, K. E., Citarella, R. V., Falkow, S.: Polynucleotide sequence relationships among members of Enterobacteriaceae. ]. Bact. 98, 637-650 (1969) Bulloch, W.: The History of Bacteriology. London, Oxford University Press 1938 Cain, A. j., Harrison, G. A.: Phyletic weighting. Proc. Zoo!. Soc. Lond. 135, 1-31 (1960) Corliss, j. 0.: The Kingdom Protista and its 45 phyla. BioSystems 17,87-126 (1984)

Interpretation of Sequence Data Crowson, R. A.: Classification and Biology. London, Heinemann 1970 Davis, P. H., Heywood, V. H.: Principles of Angiosperm Taxonomy, Edinburgh, Oliver & Boyd 1963 Dayhoff, M. 0.: Atlas of Protein Sequence and Structure 1972, Silver Spring/MD, National Biomedical Research Foundation 1972 Dayhoff, M. O . (ed.): Atlas of Protein Sequence and Structure, Vo!' 5, supplement 3, WashingtonJDC, National Biomedical Research Foundation 1978 Dayhoff, M. 0., Eck, R. V.: Atlas of Protein Sequence and Structure, 1967-68, Silver Spring/MD, National Biomedical Research Foundation 1968 De Ley, ]., Park, 1. W., Tijtgat, R., van Ermengem, ].: DNA homology and taxonomy of Pseudomonas and Xanthomonas. J. Gen. Microbio!. 42, 43-56 (1966) De Ley, ]., Segers, M., Gillis, M.: Intra- and intergeneric similarities of Chromobacterium and janthinobacterium rRNA cistrons. Int. J. Syst. Bact. 28, 154-168 (1978) De Ley, ]., Segers, P., Kersters, K., Mannheim, W., Lievens, A.: Intra- and intergeneric similarities of the Bordetella ribosomal ribonucleic acid cistrons: proposal for a new family, Alcaligenaceae. Int. ]. Syst. Bact. 36, 405-414 (1986) De Smedt, ]., De Ley, ].: Intra- and intergeneric similarities of Agrobacterium rRNA cistrons. Int. ]. Syst. Bact. 27, 222-240 (1977) De Wachter, R., Huysmans, E., Vandenberghe, A.: 5S ribosomal RNA as a tool for studying evolution, pp. 115-141. In: K. H. Schleifer, E. Stackebrandt, (eds.) Evolution of Prokaryotes. London, Academic Press, 1985 Doolittle, W. F., Woese, C. R. ~ Sagin, M. L., Bonen, L., Stahl, D. A.: Sequence studies on 16S ribosomal RNA from a blue-green alga.]. Molec. Evo!. 4, 307-315 (1975) Doolittle, R. F., Blombach, B.: Amino-acid sequence investigations of fibrinopeptides from various mammals: evolutionary implications. Nature (Lond.) 202, 147-152 (1964) Doty, P., Marmur, J., Eigner, ]., Schildkraut, c.: Strand separation and specific recombination in deoxyribonucleic acids: physical chemical studies. Proc. Nat. Acad. Sci. USA, 46, 461-476 (1960) Dover, G. A.: DNA turnover and the molecular c1ock.]. Molec. Evo!. 26, 47-58 (1987) Ehrlich, P. H.: Some axioms of taxonomy. Syst. Zoo!. 13, 109-123 (1964) Farris, ]. S.: Methods for computing Wagner trees. Syst. Zoo!. 19, 83-90 (1970) Farris,]. S.: Estimating phylogenetic trees from distance matrices. Arner. Nat. 106, 645-668 (1972) Felsenstein, ].: Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zoo!. 27, 401-410 (1978) Felsenstein, ].: Estimation of hominoid phylogeny from a DNA hybridization data set. ]. Molec. Evo!. 26, 123-131 (1987) Fitch, W. M.: Distinguishing homologous from analogous proteins. Syst. Zoo!. 19, 99-115 (1970) Fitch, W. M., Langley, C. H.: Evolutionary rates in proteins: neutral mutations and the molecular clock, pp. 197-219. In: M. Goodman, R. E., Tashian and]. H. Tashian (eds.), Molecular Anthropology: Genes and Proteins in the Evolutionary Ascent of the Primates. New York, Plenum Press 1976 Fitch, W. M., Margoliash, E.: Construction of phylogenetic trees. Science 155, 279-284 (1967) Fitch, W. M., Margoliash, E.: The construction of phylogenetic trees. II. How well do they- reflect past history? Brookhaven Symp. Bio!. 21, 217-241 (1968)

29

Fitch, W. M., Margoliash, E.: The usefulness of amino acid and nucleotide sequences in evolutionary studies. Evol. Bio!. 4, 67-109 (1970) Fitch, W. M., Yasunobu, K. T.: Phylogenies from amino acid sequences aligned with gaps: the problem of gap weighting. ]. Molec. Evol. 5, 1-24 (1975 ) Fortey, R. A., jefferies, R. P. S.: Fossils and phylogeny - a compromise approach, pp. 197-234. In: K. A. joysey and A. E. Friday (eds.), Problems of Phylogenetic Reconstruction, London, Academic Press 1982 Gillis, M., De Ley, ]., De Cleene, M.: The determination of molecular weight of bacterial genome DNA from renaturation rates. Europ. ]. Biochem. 12, 143-153 (1970) Gilmour, ]. S. L.: A taxonomic problem. Nature (Lond.) 139, 1040-1042 (1937) Gilmour,]. S. L.: Taxonomy and philosophy, pp. 461-474. In:]. Huxley (ed.), The New Systematics, Oxford, Oxford University Press 1940 Goodman, M.: Toward a geneaological description of the primates, pp. 321-353. In: M. Goodman, R. E. Tashian and]. 'H. Tashian (eds.), Molecular Anthropology: Genes and Proteins in the Evolutionary Ascent of the Primates. New York, Plenum Press 1976 Gould, S. ].: The promise of paleobiology as a nomothetic, evolutionary discipline. Paleobio!. 6, 96-118 (1980) Gower, j. c.: Maximal predictive classification. Biometrics 30, 643-654 (1974) Guise, A., Peacock, D., Gleaves, T.: A method for identification of parallelism in discrete character sets. Zoo I. ]. Linn. Soc. 74, 293-303 (1982) Harris, S., Barrie, P. A., Weiss, M. L., jeffreys, A. j.: The primate 'IjJ~ 1 gene: an ancient ~-globin pseudogene. J. Molec. Bio!. 180, 785-801 (1984) Hartford, T., Sneath, P. H. A.: Distortion of taxonomic structure from DNA relationships due to different choice of reference strains. System. App!. Microbio!. 10,241-250 (1988) Holmquist, R.: The method of parsimony: an experimental test and theoretical analysis of the adequacy of molecular restoration studies. J. Molec. BioI. 135, 939-958 (1979) Hori, H.: Evolution of 5S rRNA. J. Molec. Evo!. 7, 75-86 (1975) Hori, H., Lim, B.-L., Osawa, S.: Evolution of green plants as deduced from 5S rRNA sequences. Proc. Nat. Acad. Sci. USA 82, 820-823 (1985) Hori, H., Osawa, S.: Evolutionary change in 5S rRNA secondary structure and a phylogenetic tree of 352 5S rRNA speci¢s. BioSystems 19,163-172 (1986) Humphries, C. ]., Richardson, P. M.: Henning's method and phytochemistry, pp. 353-378. In: F. A . Bisby,]. G. Vaughan and C. A. Wright (eds.), Chemosystematics, Principles and Practice. London, Academic Press 1980 Kimura, M.: The neutral theory of molecular evolution. Cambridge, University Press 1983 jones, A. G., Young, D. A .: Generic concepts of Aster (Asteraceae): a comparison of cladistic, phenetic and cytological approaches. Syst. Bot. 8, 71-84 (1983) jones jr., S. B., Luchsinger, A. E.: Plant Systematics. New York, McGraw-Hill 1979 jardine, N.: The concept of homology in biology. Brit. J. Phil. Sci. 18, 125-139 (1967) Lanyon, S. M .: Detecting internal inconsistencies in distance data, Syst. Zoo I. 34, 397-403 (1985 ) Lausen, B., Degens, P.O.: Evaluation of the reconstruction of phylogenies with DNA-DNA hybridization data, pp. 367-374. In: H. H. Bock (ed.), Classification and Related Methods of Data Analysis. Amsterdam, North Holland 1988

30

P. H. A. Sneath

Loiseau, P., Legroux, M., Grimont, P., du Pasquier, P., Henry, P.: Taxometric classification of myoclonic epilepsies. Epilepsia 15, 1-11 (1974) MacDonell, M. T., Swartz, D. G., Ortiz-Conde, B. A., Last, G. A., Colwell, R. R.: Ribosomal RNA phylogenies for the vibrioenteric group of eubacteria. Microbial. Sci. 3, 172-178 (1986) Mannheim, W.: Taxonomic implications of DNA relatedness and quinone patterns in Actinobacillus, Haemophilus and Pasteurella, pp. 265-280. In: M. Kilian, W. Frederiksen and E. L. Biberstein (eds.), Haemophilus, Pasteurella and Actinobacillus. London, Academic Press 1981 Margoliash, E., Smith, E; L.: Structural and functional aspects of cytochrome c in relation to evolution, pp. 221-242. In: V. Bryson and H.]. Vogel (eds.), Evolving Genes and Proteins. New York, Academic Press 1965 Martin, P. G., Boulter, D., Penny, D.: Angiosperm phylogeny studied using sequences of five macromolecules. Taxon 34, 393-400 (1985) Mayr, E.: Principles of Systematic Zoology. New York, McGraw-Hill 1969 Mayr, E.: Biological classification: toward a synthesis of opposing methodologies. Science 214, 510-516 (1981) McCarthy, B.]., Bolton, E. T.: An approach to the measurement of genetic relatedness among organisms. Proc. Nat. Acad. Sci. USA 50, 156-164 (1963) Michener, C. D., Sakal, R. R.: A quantitative approach to a problem in classification. Evolution 11, 130-162 (1957) Moore, G. W., Goodman, M., Barnabas, ].: An iterative approach from the standpoint of the additive hypothesis to the dendrogram problem posed by molecular data sets. ]. Thear. BioI. 38, 423-457 (1973) Nakamura, L. K., Swezey, ].: Taxonomy of Bacillus circulans Jordan 1890: base composition and reassociation of deoxyribonucleic acid. Int. J. Syst. Bact. 33, 46-52 (1983) O'Brien, S., Nash, W. G., Wildt, D. E., Bush, M. E., Beneviste, R. E.: A molecular solution to the riddle of the giant panda's phylogeny. Nature (Lond.) 317, 140-144 (1985) Ochman, H., Wilson, A. c.: Evolution in bacteria: evidence for a universal substitution rate in cellular genomes. J. Molec. Evol. 26, 74-86 (1987) O'Donnell, A. G.: Numerical analysis of chemotaxonomic data, pp. 403-414. In: M. Goodfellow, D. Jones and F. G. Priest (eds.), Computer-assisted Bacterial Systematics. London, Academic Press 1985 Potts, T. V., Berry, E. M.: Deoxyribonucleic acid-deoxyribonucleic acid hybridization analysis of Actinobacillus actinomycetemcomitans and Haemophilus aphrophilus. Int. J. Syst. Bact. 33, 765-771 (1983) Ravin, A. W.: Experimental approaches to the study of bacterial phylogeny. Amer. Nat. 97, 307-318 (1963) Reich, P. R., Somerson, N. L., Hybner, C. ]., Chanock, R. M., Weissman, S. M.: Genetic differentiation by nucleic acid homology. 1. Relationships among Mycoplasma species of man.]. Bact. 92, 302-310 (1966) Rohlf, F. ].: A note on minimum length trees. Syst. Zoo I. 33, 341-343 (1984) Romero-Herrera, A. E., Lehmann, H., Joysey, K. A., Friday, A. E.: Evolution of myoglobin amino acid sequences in primates and other vertebrates, pp. 289-300. In: M. Goodman, R. E; Tashain and]. H. Tashian (eds.), Molecular Anthropology: Genes and Proteins in the Evolutionary Ascent of the Primates. New York, Plenum Press 1976 Sankoff, D., Morel, c., Cedergren, R. ].: Evolution of 5S rRNA and the non-randomn-ess of base replacement. Nature (Lond.) 245,232-234 (1973)

Schleifer, K. H., Stackebrandt, E.: Molecular systematics of prokaryotes. Ann. Rev. Microbiol. 37, 143-187 (1983) Schleifer, K. H., Stackebrandt, E. (eds.): Evolution of Prokaryotes. London, Academic Press 1985 Schwartz, R. M., Dayhoff, M. 0.: Origins of prokaryotes, eukaryotes, mitochondria and chloroplasts. Science 199, 395-403 (1978) Sibley, C. G., Ahlquist, ]. E.: The phylogeny and relationships of the ratite birds as indicated by DNA-DNA relationships, pp. 301-335. In: G. G. E. Scudder and]. L. Reveal (eds.), Evolution Today: Proceedings of the Second International Congress of Systematic and Evolutionary Biology. PittsburghJPA, Hunt Institute of Botanical Documentation, Carnegie-Mallon University 1981 Sibley, C. G., Ahlquist, ]. E.: Phylogeny and classification of birds based on the data of DNA-DNA hybridization. Curr. Ornithol. 1, 245-292 (1983) Sibley, C. G., Ahlquist, ]. E.: DNA hybridization evidence of hominoid phylogeny: results from an expanded data set. J. Molec. Evol. 26, 99-122 (1987) Simpson, G. G.: Principles of Animal Taxonomy. New York, Columbia University Press 1961 Sneath, P. H. A.: The application of computers to taxonomy. ]. Gen. Microbiol. 17, 201-226 (1957) Sneath, P. H. A.: Phylogeny of micro-organisms. Symp. Soc. Gen. Microbiol. 24, 1-39 (1974) Sneath, P. H. A.: Systematics and biogeography: cladistics and vicariance. Syst. Zool. 31, 208-217 (1982) Sneath, P. H. A.: Estimating uncertainty in evolutionary trees from Manhattan-distance triads. Syst. Zool. 35, 470-488 (1986) Sneath, P. H. A.: Predictivity in taxonomy and the probability of a tree. Plant Systematics and Evolution, in press (1988a) Sneath, P. H. A.: The phenetic and cladistic approaches, pp. 252-273. In: D. L. Hawksworth (ed.) Prospects in Systematics. Oxford, Oxford University Press 1988b Sneath, P. H. A., Hansell, R. I. c.: Naturalness and predictivity of classifications. BioI.]. Linn. Soc. 24, 217-231 (1985) Sneath, P. H. A., Sackin, M.]., Ambler, R. P.: Detecting evolutionary incompatibilities from protein sequences. Syst. Zool. 24, 311-332 (1975) Sneath, P. H. A., Sakal. R. R.: Numerical taxonomy. San Francisco, W. H. Freeman 1973 Sakal, R. R., Sneath, P. H. A.: Principles of Numerical Taxonomy, San Francisco, W. H. Freeman 1963 Stackebrandt, E.: Phylogeny and phylogenetic classifications of prokaryotes, pp. 309-334. In: K. H. Schleifer and E. Stackebrandt (eds.), Evolution of Prokaryotes. London, Academic Press 1985 Stackebrandt, E., Woese, C. R.: The evolution of prokaryotes. Symp. Soc. Gen. Microbiol. 32, 1-31 (1981) Tateno, Y., Nei, M., Tajima, F.: Accuracy of estimated phylogenetic trees from molecular data. 1. Distantly related species. ]. Molec. Evol. 18, 387-404 (1982) Van Neil, C. B.: The classification and natural relationships of bacteria. Cold Spring Harbor Symp. Quant. BioI. 11,285-301 (1946) Van Valen, L. M.: Phylogenies in molecular evolution: Prochloron. Nature (Lond.) 298, 493-494 (1982) Wayne, L. G., Brenner, D.]., Colwell, R. R., Grimont, P. A. D., Kandler, 0., Krichevsky, M. I., Moore, L. H., Moore, W. E. c., Murray, R. G. E., Stackebrandt, E., Starr, M. P., Truper, H. G.: Report of the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics. Int. ]. Syst. Bact. 37, 463-464 (1987)

Interpretation of Sequence Data

Wilkinson, c.: Numerical classification: some questions answered. Can. Entomol. 106,449--464 (1974) Woese, C. R.: Bacterial evolution. Microbiol. Rev. 51, 221-271 (1987) Woese, C. R., Fox, G. E.: Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Nat. Acad. Sci. USA 74, 5088-5090 (1977) . Woese, C. R., Fox, G. E., Zablen, L., Uchida, T., Bonen, L., Pechman, K., Lewis, B. j., Woese, C. R.: Conservation of primary structure in 16S ribosomal RNA. Nature (Lond.) 254, 83-86 (1975)

31

Wold, S.: Analysis of similarities and dissimilarities between chromatographic liquid phases by means of pattern recognition. ]. Chromatogr. Sci. 13, 525-532 (1975) Woodger, j. H.: On biological transformations, pp. 94-120. In: W. E. Le Gras Clark and P. B. Medawar (eds.), Essays on Growth and Form Presented to D'Arcy Wentworth Thompson. Oxford, Clarendon Press 1945 The substance of an address at the joint meeting of the Bergey Manual Trust and the Microbial Systematics Group of the Society for General Microbiology, September 1987.

Dr. P. A. H. Sneath, Dept. of Microbiology, Leicester University, Leicester LEI 7RH, England