A taxonomic distance applicable to paleontology

A taxonomic distance applicable to paleontology

A Taxonomic Distance Applicable A. R. BEDNAREK’ Department of Mathematics, to Paleontology Unioersily of Florida, Gainesville, Florida 3261 I AN...

593KB Sizes 16 Downloads 96 Views

A Taxonomic

Distance Applicable

A. R. BEDNAREK’ Department of Mathematics,

to Paleontology

Unioersily of Florida, Gainesville,

Florida 3261 I

AND

TEMPLE F. SMITH Department of Physics, Northern Michigan Uniuersiv,

Marquette,

Michigan 49855

Received 14 July 1979; revised 19 Februaty I980

ABSTRACT A general

Hausdorff-like

metric on sets is presented

cal “dissimilarity measures.” The distances between and/or fosail material are obtainable simply and characteristics the measured

used to define the taxonomic units. anatomical characters is not required.

with applications

to paleontologi-

taxa represented by living organisms independently of the nature of the In particular,

linear

independence

of

INTRODUCTION One of the major problems in paleontology is to find a mathematically proper as well as meaningful taxonomic dissimilarity measure for classification. As pointed out by Simons [15] and Howell and Isaac [9], it is essential to classify fossil material in so far as possible within the taxonomic structure used for living organisms. This is normally quite difficult, as fossil material rarely allows full determination of the many characteristics used to classify modern organisms. The problem is further compounded by the natural intrataxon variations. Thus given a collection of fossil material, one must determine if the specimens are distinct enough to justify classification into different taxa; if so, how different; and finally, how these taxa are related to the living taxa [4]. The major analytical tools used in such studies come under the heading of cluster analysis. A comprehensive review of its application to taxonomy is found in [lo] and [17]. A variety of taxonomic distances or dissimilarity measures have been proposed [ 13, 1, 7, 6, 171. Many are complex and often not true metrics [lo]. Those that are metrics [5, 1 l] often require character *Research

supported

MATHEMATICAL

by NSF Grant

BIOSCIENCES

No. MSC 75-21130.

50:285-295

OElsevier North Holland, Inc., 1980 52 Vanderbilt Ave., New York, NY 10017

285

(1980)

0025-5564/80/060285

+ I l.SO2.25

286

A. E. BEDNAREK

AND TEMPLE

F. SMITH

independence or other assumptions. The advantage of a metric dissimilarity measure is that its mathematical properties and implications are known. In this study a simple intercluster (proper) metric is defined. This is done in such a way that for any arbitrary set of biological attributes or characters of a continuous or discrete nature, the distance between clusters can be uniquely defined. It is irrespective or whether the clusters are composed of many representatives or a single representative.’ In addition, no heuristic algorithm is required to compensate for a missing character state measurement. This allows the distances amongst a fossil collection having a common set of measurable characteristics to be compared with the distances amongst the relevant modern taxa. THE METRIC The proposed metric is analogous to that introduced by Hausdorff (see, e.g., [12]) for nonvoid closed bounded subsets of a metric space. Its major advantage is its conceptual simplicity. The metric is described heuristically as follows: Given any two sets A and B of points (each a representative set of the members of a given taxon, for example) such that the neighborhood about any point is defined by a single “step” operation; then the distance between sets A and B, p(A, B), is the maximum of: (1) the minimum number of “steps” required for a set A to expand’ and include all of B, or (2) the minimum number of steps for set B to include all of A. This metric has some conceptual similarity to the older complete linkage or farthest neighbor criterion [3]. This older method of constructing heirarchical classifications joined taxa to form the next higher taxon whenever all members of both taxa were within a minimum “distance” of each other. This minimum “distance” was a dissimilarity measure calculated from the entire set of characteristics. In the proposed scheme the clustering or expanded inclusion of two taxa generates the distance measure rather than the other way around. In the accompanying appendix we have outlined in some detail the mathematical properties of this Hausdorff-like metric on sets. The three properties for a metric are recalled to be: I. p(A) B) is always zero or positive, and zero if and only if A and B are identical. ‘Such representative sets are normally referred to in the systematic literature aa OTU (operational taxonomic units). ?he meanings of “step” and “expand” are made precise in the appendix, where they correspond to iterations of a monotone set operator.

A TAXONOMIC

287

DISTANCE

II. p(A,B) must equal p(B,A). III. The triangle inequality must be satisfied:

P(AB)
for all A, B, C.

(1)

While these properties are required for a proper metric, they do not impose Euclidean geometry on the space of sets-a point stressed by Williams and Dale [ 191 as important to taxonomic problems. It may be useful to recall from molecular taxonomy the concept of minimum mutational distance between protein sequences which was originally proposed by Fitch and Margoliash [8] and later put on a rigorous mathematical base by Sellers [14]. This molecular metric, the distance between molecular sequences, is just the minimum number of “steps” or mutations required to transform either sequence into the other. For the more general taxonomic problems the definition of a single step neighborhood is not as simple as the mutation “steps” used in molecular taxonomy. This is due in part to the fact that the sets of “points” composing any taxon have no a priori ordinal relationship to those of other taxonomic units. In order to apply this metric to taxonomy a few definitions are needed. First, the cluster sets or taxa are assumed to be sets of points in a multidimensional character space representing a set of organisms. Each taxon is, therefore, defined operationally by the organisms assigned to it. Any taxon A is operationally defined by the set {a} of representative organisms or taxonomic units. Each is unique at any given level in the hierarchic classification. These representatives are defined by a set of m characters, such that associated with each character a is a nearest neighbor “step” S. Thus for each taxon representative LX,there are two associated vectors: the character-state vector C, = (~:,a$. . . ,a,“), and the neighborhood vector A, = (a,‘, S,‘, . . . , 8:). The assignment of a proper step 6: for each character coordinate I represents the major problem in the practical application of the metric. For integer valued characters such as the number of primate premolars, 6, would normally be set equal to unity, while for the continuous characters such as skeletal lengths or ratios, 6 might be set at one standard deviation of the mean value for the total set of accepted taxon representatives. It is often useful to define the size or radius of a cluster or taxon. The radius R of a taxon can be defined as the minimum distance pm(a,a,) such that all representative members of a given taxon are within R of some standard taxon member a,-,. Here m denotes the dimensionality of the character space used. Note that in some sense the member aa defines the center of the taxonomic cluster or perhaps the norm representative member of the taxon. Operationally such a center is just the a giving the minimum

R.

288

A. E. BEDNAREK

AND

TEMPLE

F. SMITH

The distance between any two taxa A and B within an M dimensional character space and that measured within any subspace have the following property: < p”(A>B)

p”(A,B) for all n Qm. The interlevel taxonomic

taxon

distances

have

additional

(2) properties:

given

the

unit formed by the union, A I_,B, of taxa A and B representing

the next higher taxon level, then p”(A u B,B)

Q p”(A,B),

(3)

and more generally, p”(A

U

B,C U D)
for i=A,B;

APPLICATIONS

j=C,D.

TO PALEONTOLOGY

As an illustrative example the data of Wood [20] on the Homo femur fossils from the Lake Rudolf Basin have been analyzed using the proposed set metric. Three continuous characters were defined and measured by Wood [20]. These and the equivalent mean values for modern man are given in Table 1. The neighborhood step sizes were chosen for all representatives to be one (modern man) standard deviation. The set metric distances

obtained

are given in Table 2. In addition,

the data are presented

graphically in Fig. 1. In general it is assumed that the assignment of modern organisms to their respective taxa can be done either by prior consensus or by one of the standard hierarchical cluster methods using the maximum available characteristic space [ 10, 171, while any given collection of fossils has a limited set of measurable characteristics in common. Thus, in order to make both inter- and intrataxon direct level comparisons between fossil data and a relevant set of modern taxa, the modern intertaxon distances must be obtained in the limited fossil subspace. Various statistical techniques have been proposed for the analysis of cases containing missing data, but direct metric comparisons require equivalence of the metric spaces themselves. The relevant constraint is given by Eq. (2). This means new fossil evidence which increases the dimensionality of the character space to be used cannot support closer relationships than those supported by the original fossil data.

A TAXONOMIC

289

DISTANCE TABLE

I

The Data of B. Wood on the Homo Femur Proposed

Fossil

Neck shape

Neck length

Head size

setsa

identification

ratio

ratio

ratio

A B B’ C c B B’ B’ B B D D D D D

Modem man ER738 ER1463 ER1472 RI481 ER1503 ER1505 H-20 SK82 SK97 KRl KR2 Trinl I Rhodesian skhu1 IV

72+9 57 55 72 77 57 42 57 60 54 68 73 65 75 77

107&9 142

140*9 118 -

119 116 135

129 134 113 -

134 130 132 108 122 123 122 123

106 109 147 145 123 152 160

“The proposed groupings into sets follow those of Wood [20]. Sets B and C are identified with Australopithecus robwtur, and set D with Homo erecftu. Set A is defined as all points within one standard deviation of the modem Homo sapiem mean values. The identification of individual fossils also follows [20]. As with most fossils, the data are incomplete-see set B’.

Thus with minimal data (few measurable characters) one should assume the closest taxonomic relation compatible with p. This is a very important point, and as Simons [15] has pointed out, it has not been generally observed: often minimal data have been used to support the conjecture of maximum taxonomic separation. For example, if the distances amongst the fossils are all less than or on the order of the radii of modern species, there is no reason not to assign all the fossils to a single species. If in the same case the fossils are as distant from the modern species as the various modem species are from each other, then in an evolutionary sense the fossil species can be no closer than the next taxonic level. The above conclusions rest to some extent on the assumption that the same 6’s can be assigned to both modern and fossil characteristics. This may be valid only if the S’s are set to large enough values for the fossil data. A conservation policy would be never to choose fossil 6’s less than the observed variations among modern organisms. The construction of taxonomic trees or hierarchical relationships from dissimilarity measures requires a proper metric for any straightforward interpretation [2]. In addition the property given in Eq. (4) for the proposed metric automatically insures that the intrataxon level distances decrease as

290

A. E. BEDNAREK AND TEMPLE F. SMITH TABLE 2 The Set Metric Distances Obtained for the Homo Femur Data in Table with an Associated Dendrogram’ B

A

B’

c

1

D

‘The step sizes 6 used throughout were those for modem Homo sapiens [20] in the first row of Table 1. These metrics among A, B, C, and D fortuitously satisfy the four point condition [ 18, p. 2011 for additive data and thus allow the construction of the unique phylogenetic dendrogram shown. the taxonomic the relationship lineage method

level increases (becomes more inclusive). As noted earlier, between the proposed dissimilarity metric and the complete (see [lo, pp. 53-561) suggests a natural clustering using this

metric. Finally, it should be pointed out that the fact that measured anatomical character states for any organism are normally interdependent has no negating effect upon the interpretation of this metric. In particular, if two characters are simply linearly related (proportional to each other), the addition or removal of one or the other has no effect on the value of the metric, provided their 6’s are similarly related. Given the generality of the metric of applications. It is expected that a particularly in the case of molecular turn out to be particular instances of

MATHEMATICAL

introduced, we expect a broad range number of metrics introduced earlier, sequences and evolutionary trees, will this metric.

APPENDIX

Our purpose here is to describe in detail the metric and its properties, and to provide several examples, including that important particularization in which the taxa are sets of points in a multidimensional characteristic space. Some familiarity with the elementary concepts of set theory is

A TAXONOMIC

291

DISTANCE

70

/

/

/

/’

c------

/ /_-__--,’

-

/

/

/

,‘I

I I I I

NECK LENGTH = Ah’

100

110

120

130

140

HEAD

SIZE

=C/B’lOO

150

100

160

FIG. 1. A graphic representation of three of the sets listed in Table 1. The shaded cube represents all points within one standard deviation of the modem Homo sapiens mean values [20]. The solid circles are set B, and the open circles set D. Note that the point Trinl I, while not affecting the set metric distance from modem man, appears to be an intermediate between B and the rest of D. The three femoral indices are all calculated as ratios of two linear measurements at the head end of the femur.

presupposed. Our concern throughout will be with a finite’ set X and its subsets. We assume that the cardinality 1x1 of the set X is n. Moreover, we assume that with each x in X there is associated a unique nonempty subset N(x) of X called the neighborhood of x. We require that x EN(X) (x is contained in its neighborhood). If A is a subset of X we define E(A) to be the set E(A)= U xEAN(x); that is, E(A) is the union of all of the neighborhoods of points in A. Note that A c E(A). We iterate this process by letting E’(A)= E(E(A)), and in

%hile much of what follows may be extended to infinite sets, or infinite neighborhoods, these extensions are unnecessary for our purposes.

A. E. BEDNAREK

292 general Ek+‘(A)=

AND TEMPLE

for any positive integer k. Define

E(Ek(A))

F. SMITH

E’(A)=

A,

and observe that A=E”(A)cE’(A)cE2(A)c... The operator

E is monotone;

cE’(A)cE’+‘(A)c.... that is, if A c B, then E(A) c E(B).

additive, namely, E(A u B) = E(A) u E(B). We are now in a position to define an integral

valued metric

nonvoid subsets of X. Letting A, B c X with A, B#0, min{ kJA c Ek(B)

p(A,B)=

and B c Ek(A)}

if possible otherwise

n if and only if there is not positive integer k for which A

and B c Ek(A). Furthermore, analgous

on the

we define p(A, B) by

n p(A, B)=

It is also

we observe

to an equivalent

as noted

interpretation

earlier

that our definition

of the Hausdorff

c

Ek(B) of p is

metric for non-

void closed bounded subsets of a metric space [12]. PROPOSITION

1

Zf X is finite,

then p as described

above is a metric

on the nonuoid subsets

of x. Proof.

By definition

A c B = E’(B)

and B

c

p(A, B) is always nonnegative. Thus p(A, B)=O.

A = E’(A).

Now if A = B, then

Conversely,

if p(A, B)

= 0, then A c E’(B) = B and B c E’(A) = A, so that A = B. Consequently p(A,B)=OifandonlyifA=B. That p is symmetric [that is, p(A, B) =p(B,A)] is immediate from the definition. To establish the triangle inequality,

we consider any nonvoid subsets A,

B, and C of X. If either p(A, C) or p( C, B) is equal to n, then p(A, B) < p(A,C)+p(C,B), since p(A,B)
c Ek(B)

p(C>B). This completes

and

B c Ek(A))

the argument


so

that

p(A, B) < p(A, C)+

that p is a proper metric on the nonvoid

subsets of X. PROPOSITION

CC Eh(A), E”(Ej(B)) p(A, B) =

2

For any nonvoid subsets A, B, C, and D of the set X we have

A TAXONOMIC

293

DISTANCE

The inequality is obviously hand side have value n.

true if any of the distances on the On the other hand, if k = with k
right

Suppose we have two neighborhoods N(x) and M((x) for each xEX. Denote the respective metrics by p and j5. The following proposition is obtained: PROPOSITION If N(x)cN(x)

A,B

3

for

each x E X,

then p(A, B) < p(A, B) for

any nonvoid

cX.

-Let E(A) = LJ li,-fl(x) and in general ,!?kk’(A)=E(Ek(A)). If we suppose that p(A) B) =j, then A c E’(B) and B c E’(A). But E(A) c &A), so that A c E’(B) c @j-(B) and B c @(A) c p(A); therefore, p(A, B) Proof:

>j = p(A, B).

We collect below several particular above.

instances

of the metric

described

Example A. We suppose that X is the set consisting of the cells of the N by N grid; that is, X= {(i,j)j)ll < ij
consisting of the cell x and those cells to the right, to the left, above, and below that fall in the grid. Figure 2 provides an illustration of the “growth” of a pattern A under three iterations of the operator E, with the original cells added in successive cells of A denoted by * and the additional applications of E by 1, 2, and 3. One can of course modify the choice of neighborhoods as well as envision the extension to higher dimensions-particularly to dimension three. This illustration is suggestive of possible applications to measuring distance between patterns for recognition purposes.

A. E. BEDNAREK

294

AND

FIG. 2. An illustration of the “growth” of a two dimensional iterations of the operator E using a cruciform neighborhood.

TEMPLE

pattern

F. SMITH

under

several

Example B. In this illustration we consider the case of centeral interest in the main body of this paper. We assume that we have m nonvoid sets of numbers A i, A 2,. . . , A,,, each of which is a finite set of attributes. Our space X is the Cartesian product A, x A, X . . . X A,,,; that is, x EX implies that x=(x*,x2,..., x,,,), where xiEAi for i=1,2,...,m. For each integer i= 1,2,..., m we suppose there is given a 8, >0 (which, as mentioned earlier, might represent one standard deviation for the ith attribute). For x= (x,,x*,..., x,) we let N(x)={yEA,xA,X.

.. ~A,~x,-6~
,..., m}.

We note here that if we were to focus on only n of the attributes, n
,..., n}

where

for every x.

We are then in the situation described in Proposition 3, SO that ,G(A, B) < p(A, B) or, as indicated earlier in this paper, p”(A, B) < p”(A, B). It should also be noted that this metric is in no way dependent on what statistical parameters are involved in the determination of the neighborhood of a given element. Indeed, the flexibility in defining the neighborhood allows for a variable weighting of the characters of interest. This is true to the extent that the 6’s may be taken to be dependent on the choice of the element itself. REFERENCES 1 T. Bielecki, Some possibilities 2

for estimating inter-population relationships on the bases of continuous traits, Current Anfhrop. 3:3-8, 20-46 (1962). W. A. Beyer, M. L. Stein, T. F. Smith, and S. M. Ulam, A molecular sequence metric and evolutionary trees, Math. Biosci. 19:9-25 (1974).

A TAXONOMIC

295

DISTANCE

3

H. T. Clifford

4

Academic, New York, 1975, Chapter 8. M. H. Day and B. A. Wood, Functional Man 3:44Q-455

5 6 7 8 9

10 11 12 13 14 15 16 17

and

W. Stephenson,

An

Introduction affinities

to Numerical

of the Olduvai

Classification,

hominid

8 talus,

(1968).

J. S. Fan-is, The meaning of relationship and taxonomic procedure, Sysr. Zoo/. 16:4+51. (1967). J. S. Farris, Estimating phylogenetic trees from distance matrices, Amer. Naf. 106:645-667 (1972). J. S. Farris, A. G. Kluge, and M. J. Eckardt, A numerical approach to phylogenetic systematics, System. Zoo/. 19: 172- 189 (1970). W. M. Fitch and E. Margoliash, Science 155:279-283 (1967). F. C. Howell and G. L. Isaac, in Earliest Man and Environments in the Lake Rudou Basin, (Y. Coppens, F. C. Howell, G. L. Isaac, and R. E. F. Leakey, Eds.), Univ. of Chicago Press, Chicago, 1976. N. Jardine and R. Sibson, Mathemaricaf Taxonomy, Wiley, New York, 1971. S. C. Johnson, Hierarchical clustering schemes, Psychomerrika 32:241-255 (1967). J. L. Kelley, Genera/ Topology, Van Nostrand, Princeton, 1955. L. S. Penrose, Distance, size and shape, Ann. Etrgen. 18:337-343 (1954). P. H. Sellers, J. Combinatorial Theory Ser. A 16:253 (1974). E. L. Simons, Primare Ewluation: An Iniroduction to Man’s Place Macmillan, New York 1972.

in Nature,

G. Simpson, A. R. Lewontin, and R. Lewontin, Quantifatioe Zoology, Harcourt Brace, New York, 1960. R. R. Sokal and P. H. Sneath, Principles of Numerical Taxonomy, Freeman, San Francisco,

1977.

18

M. S. Waterman, 64: 199-213 (1977).

19

W. T. Williams and M. B. Dale, Fundamental Bot. Res. 2:35-68 (1965).

20

B. A. Wood, Remains attributable to Homo in East Rudolf succession, in EarIiest Man and Environments in the Lake Rudorf Basin (Y. Coppens, F. C. Howell, G. L. Issac and R. E. F. Leakey, Eds.), Univ. of Chicago Press, Chicago, 1976.

T. F. Smith,

M. Singh

and

W. A.

problems

Beyer,

in numerical

J.

Theoret.

Biol.

taxonomy.

A&.