Vector dissimilarity and clustering

Vector dissimilarity and clustering

Vector Dissimilarity and Clustering L. P. LEFKOVITCH Research Branch, Agriculture Ottawa, Canada, Research Program Service, Ontario, KlA OC6, Canad...

568KB Sizes 0 Downloads 65 Views

Vector Dissimilarity and Clustering

L. P. LEFKOVITCH Research Branch, Agriculture Ottawa,

Canada, Research Program Service,

Ontario, KlA OC6, Canada

Received 9 September

1988; revised 27 August 1990

ABSTRACT Based on the description of objects by m attributes, an m-element vector dissimilarity function is defined that, unlike scalar functions, retains the distinction among attributes. This function, which satisfies the conditions for a metric, allows the definition of betweenness, which can then be used for clustering. Applications to the subset-generation phase of conditional clustering and to nearest-neighbor-type algorithms are described.

INTRODUCTION The multiplicity of similarity coefficients and their transformations to distances (see Gower and Legendre [5]), coupled with the even larger number of clustering methods [12], offer a sometimes bewildering array of choices to the taxonomist wishing to use numerical procedures, and lead to the eclecticism characterizing publications in this field. Because scalar (dislsimilarity coefficients give no information about how each attribute differs among the objects, Lefkovitch [lo] proposed a clustering procedure for recognizing species’ associations based on the logical relationships among their attributes, applied it to numerical ecology, and pointed out that it is also appropriate for numerical taxonomy. That study therefore demonstrated that measures of pairwise relationships among objects (as the units to be grouped will be called) need not be used for clustering. Because (dis)similarity coefficients are intuitively appealing, the objective of this paper is to define a measure of pairwise relationship that contrasts with scalar coefficients by retaining the identity of the attributes, and to show how it can be used in two distinct clustering procedures. The mathematics used consists largely of the application to clustering of the concept of betweenness in Boolean vector algebra. UATHEUATICAL

BIOSCIENCES

OElsevier Science Publishing 655 Avenue of the Americas,

39

104:39-48 (1991)

Co., Inc., 1991 New York, NY 10010

00255564/91/$03.50

L. P. LEFKOVITCH

40

VECTOR

DISSIMILARITY

For simplicity, consider one-state attributes 171, also known as dichotomies [4], that are attributes in which the state is either “present” or on “absent,” such that joint absence in two objects gives no information their resemblance. (Other types of attributes will be considered below.) For the kth object, let b, be the m-element (Boolean) vector in which the pth element is unity if object k shows the state for attribute p and is zero otherwise. Denote by B the m X II array whose columns are formed by the to one of a set N of n > 0 n vectors b,, k = 1,. . ., n, each corresponding objects. If i and j are any two such vectors corresponding to objects i and j, then DEFINITION I

The vector dissimilarity between objects i and j is defined by the vector

g(i,j), with elements (in the same sequence as those in i and j) 1 g,(C)

=

o

i

if i, + jP, ifi,=

jP,

where i, denotes the state of the pth attribute shown by object i.

The object/attribute and vector dissimilarity spaces are the same, namely {0, l}“, and the function g(.,.> maps the object attributes onto the vector dissimilarities. By inspecting the elements of g(i,j), the attributes in which objects i and j differ can be identified immediately; that is, p(.,.) retains the identity of the attributes. Ellis [2] used a norm of g&j) as a measure of the distance between i and j; it is not difficult to show that most commonly used scalar similarity and dissimilarity coefficients are functions of various norms of this vector. The objective of this paper is to avoid the use of norms. The following notation will be used: 0 denotes a vector of zeros, 1 a vector of unities, i the complement of i, and N \S those members of N not in S. The following will apply unless otherwise indicated. (1) The Boolean

sum

of vectors is given by the component

sums

1+1=1+0=0+1=1, o+o=o. (2) The Boolean product of vectors is formed from the Boolean product of corresponding elements, 0x0=0x1=1x0=0, 1x1=1.

VECTOR

DISSIMILARITY

(3) The inequality ordering,

41

AND CLUSTERING

between

Boolean vectors is defined by the component O
It is easy to verify that the function g(.,.> defines an abelian group in (0, 11". Three consequences of the definition of vector dissimilarity are (1) g(i, j) = 0 implies and is implied by i = j, (2) di,j> = g(j,i>; and

(3) g(i,,i) < g(i, k) + g&,9, which are proved by Blumenthal [l]. Other than the fact that g(.,.) is a vector and not a scalar, these three conditions are identical with those of a metric, and so g(.,.> can be regarded as such. Following Blumenthal [l], the vectors on a straight line defined by i and j are any (and all) pairs of vectors u and t such that u = it+ji, where Z is the complement oft. If there are lines defined by i and j, there is also a concept of betweenness, which is given by the following definition. DEFINITION

2 (Blumenthal,

[l])

Zf g(i,j) = gfi, k)+g(&j),

then k is between

i and j.

This has a natural relationship with that used for the real line. An equivalent definition is that k is between i and j if it is a vertex of the sublattice whose universal bounds are i+j and ij, while the practical determination takes advantage of yet another equivalence, namely, that k is between i and j if k = k + ij. This last is easy to verify by considering all eight possible cases. If a row of B consists either entirely of unities or entirely of zeros or is identical with another row, the betweenness relationships are unchanged by the deletion of the corresponding attributes. A detailed study of Boolean vector geometry is given by Menger and Blumenthal [ll]. The remaining task here is to show how Boolean betweenness can be used for clustering. CONDITIONAL

CLUSTERING

AND VECTOR

DISSIMILARITY

The first phase of conditional clustering [8] consists of generating subsets of the objects from the empirically observed data. If a subset S of the n objects is represented by an n-element Boolean vector, a E 10, l)“, with elements ai =

1 0

if object i E S otherwise,

42

L. P. LEFKOVITCH

the first phase can be regarded as a function that maps vectors belonging IO, l]“’ onto vectors belonging to (0,1)“. The motivation for the subset-generating phase can be formulated being an appropriate answer to the question:

to as

If an arbitrary subset S of the N objects is formed, which others of N \S should also be included? An answer in the case of the real line is informative. Z(1) < Z(2) G . . .

Let

G Z(n)

be the order statistics corresponding to a measurement of some continuous variable for the n objects. Fisher [3] proposed that if the objects correto sponding to zCi) and z(~+~) are included in S, then those corresponding any points between them, that is, to zCi+,), should also be included. It follows that if the objects corresponding to z(i) and z(,) are included in S, then so should all of N. The answer to the question for the real line, therefore, is Include in S those members members of S.

of N \ S that are between

any (pair of)

It is now proposed that betweenness for vector dissimilarity be used as a subset-generating principle in the same way that the order statistics were utilized heuristically by Fisher [3]. This can be stated as Include

in S all k E N/S

that satisfy

s(i,j> = d&k) +dk,j), for all distinct

i,j E S, that is, if k is between

i and j.

It remains necessary to identify the initial members of S. Although this may seem to consist of each of the nonempty subsets of N, which is not wrong, it is unnecessary for conditional clustering. If S is initiated by any single object i, then k will be included in S iff k = i; but objects with identical attribute vectors can be represented by just one of them. Suppose an S is initiated by a distinct pair of objects, i and j. If g(i, j) = i +j (i.e., ij = O), objects i and j have no attribute exhibiting the identical state (excluding “absence” and attributes not included in the study).’ A k can be ‘It is interesting to note that the numerator of the corresponding (scalar) Jaccard similarity coefficient for i and j (see Sneath and Sokal [12]), which is the cardinality of ij, is zero in these circumstances.

VECTOR DISSIMILARITY AND CLUSTERING

43

between them iff k = 0, but since such a k is between every pair of objects, there is no interest in the subset of objects formed by such pairs. If a subset is initialized by three or more objects, whether or not they contain such pairs, it is not difficult to see that the subset generated by them will be represented by the union of subsets initiated by pairs of its members. This implies that only pairs of objects differing in at least one attribute need be used. Confining attention to these pairs may occasionally result in some objects not being included in any multiple-object subset; but such objects are isolated with respect to the others, so that each forms a single-object subset. Thus only pairs of objects, excluding those for which either i+j = 1 or ij = 0, need be considered to initiate the family of subsets for subsequent study. The upper limit of ‘; on the number of subsets is rarely reached, 0 not only because of the disqualifications described above, but also because different initial pairs may generate the same final subset. The restriction to pairs, however, requires that the third phase (see below) of conditional clustering must not be constrained to be a partition but allowed to be a covering. To complete phase 1 of conditional clustering, let the H distinct subsets generated by betweenness by assembled into an 12X H matrix A. If the logical reductions of A [8] do not give a unique covering, phase 2 obtains a maximum entropy-based “cost,” c,,, h = 1,. . . , H, for each subset from the Perron-Frobenius column eigenvector, v of AT(J -A), where J is an n x H matrix of unities. Phase 3 then obtains the groupings as the subsets defined by the columns of A indicated by the unities in the binary vector x min{cTxlAx21,~h~{0,1),h=1

,..., H},

here using ordinary matrix multiplication. If c is defined as ch = -log uh, the optimal solution will maximize the joint probability of the chosen subsets, or if defined as ch = - v,, log c,,, it will maximize the information in the choice [9]. The constraint Ax > 1 ensures that the optimal solution is an irredundant covering of the objects. If the solution is not a partition, but on.e is required, it can be formed as the union of nondisjoint subsets in the covering. Further details of these procedures are given in [8] and [lo]. Extension of vector dissimilarity to s-state unordered attributes is straightforward, using standard procedures for converting these to s onestate attributes. For an ordered attribute, the inclusion of k in the subset defined by i and j depends on whether the measurement for each object k is not outside the range determined by i and j. However, if the attribute is a random variable, there is a case for extending the range to include k in the subset if the average distance between it and the current members of S

44

L. P. LEFKOVITCH

does not exceed the maximum distance among them. This (recursive) extension of the range [8], which is based on extreme value theory, assumes that a unit difference measures the same degree of dissimilarity throughout the whole range. SINGLE

LINKAGE

CLUSTERING

USING VECTOR

DISSIMILARITY

Vector dissimilarity can also be used for dendrogram formation. After eliminating one of every pair of objects for which g(i,j) = 0, since objects i and j are identical with respect to their attributes, a greedy-type grouping procedure based on betweenness is Let T be initialized by any object; for j, j’, j” E T, include j in T if 3i E T such that g&j> < g(i,j’> and sj” such that g&j”) < g(i,j>,

that is, include j in T if it is between at least one object i belonging to T and j’ not belonging to T, and there is no other object j” between j and i. By joining j to all members of T for which this is true, a relative neighborhood graph is generated. If this graph has no cycles (it will be connected), it will be a minimum spanning tree with respect to vector dissimilarity. If a j is joined to just one rather than all i that bring about its inclusion, then one of the several possible minimum spanning trees will be produced. The relationship between minimum spanning trees and single linkage dendrograms is discussed by Gower and Ross [6]. NUMERICAL

EXAMPLE

Table la gives the data from Table 1 of [lo] for 10 species after removing duplicate sites. Table lb gives the 21 distinct subsets generated using behveenness confined to those initial pairs for which g(i,j) # i + j. Table lc gives the subset probabilities computed as described in [9]. Table Id gives the row reductions and lists the remaining subsets for determining an optimal covering. The optimal coverings, which were identical for joint probability and information as objective functions, consisted of two subsets (a, b, c,d,e,f, j} and (f, g,h, i, j} having two objects in common. No partition of the objects could be found in these 21 subsets. The covering solution can be compared with the two given in [lo], the first of which, obtained directly from the complete incidence matrix, consisted of the two subsets (a, b,c, d, e, j} and {b, c, d, e, f,g,h,i}, whereas the second, based upon a scalar dissimilarity coefficient, consisted of three subsets {a, b, c, d, e), (e, f,g,h, i}, and {j}. Clearly, there are resemblances and differences; since the role of clustering is to generate hypotheses, the differences are perhaps more informative than the resemblances.

VECTOR

DISSIMILARITY

AND

TABLE Numerical

45

CLUSTERING

Example:

1

10 Species,

12 Attributes

(a) Initial incidence matrix transposed: B7 Species

Attributes 111110 0 111111 0 0 11111 0 0 111111 0 0 111111110 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10

A B C D E F G H I J

0

0 0 0

0 0 11110 0 1111110 11111111 0 0 11 0 0 11

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0 0 0

0

0

0

1111

(b) 21 Subsets generated; the matrix A Species A B c D E F G H I J

Subsets 101001000000001000000 11 1101101000000100000 111111 111100000110000 00111111 11000001 000001111 1000.00000000 000001 1 1 1 1 1 1110001 000000001 111010000110 000000000001000000010 0000000000011 10000011 0000011 111000011 (cl Subset probabilities

1

1

0

1

and information

Subset

Probability

Information

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

0.036346 0.023830 0.048862 0.036346 0.026852 0.086229 0.073713 0.064219 0.087676 0.078182 0.026239 0.053679 0.024633 0.038596 0.027096 0.050926 0.041432 0.026856 0.040818 0.068258 0.039213

0.120474 0.089047 0.147502 0.120474 0.097136 0.211326 0.192212 0.176311 0.213412 0.199264 0.095522 0.156996 0.091233 0.125614 0.097772 0.151625 0.131907 0.097145 0.130563 0.183237 0.127001

0

0

0

1

1

1

1

1

1

L. P. LEFKOVITCH TABLE

1 (Confinued)

(d) Covering

solution

procedures

Row reductions H H E B E E Remaining A B E I

1 1 0 0 Original

subset

Optimal (i) (ii)

subsets 0 1 0 0

3 1

Maximum Maximum

covered covered covered covered covered covered

by by by by by by

I G F C D J

and objects

(resequenced)

0 1 1 0

1 1 1 0

0 0 0 1

0 0 1 0

numbers (first has maximum 9 10 16 6 7 8 2 4

covering: joint

original

probability:

information:

subsets objective

objective

1 1 0 0

probability) 20 12 13 14 21

15

6 and 20 function

function

= 0.005885. = 0.394563.

DISCUSSION Although the origin of the description vectors is 0, interpretable as a vector describing “absence,” other origins can be used to represent pairwise relationships. If the origin is 1 (i.e., “presence”), i as originally defined is replaced by i, but there is no effect on g(.,.>, as shown by the fact that

g(G) = g(Li); that is, using 1 as the origin is a mapping that preserves distance and is therefore a motion [2]. With 1 as the origin, however, the concept of similarity may be more natural, and although vector similarity can be represented by the complement of 8(.,.), betweenness and clustering rules may be more awkward to define. Of a number of further developments of vector dissimilarity that may be of value for future studies, one will now be briefly discussed. Scalar dissimilarities map the relationship between i and j onto a point of a space usually considered to be continuous (often treated as isomorphic

VECTOR

DISSIMILARITY

47

AND CLUSTERING

to a Euclidean space), and so allows the definition of derivatives. It is useful to develop the idea of a discrete derivative. Let i be any vector belonging to {O,l}“, and i(p) the same vector with the pth element replaced by its complement. DEFINITION

3

The discrete deriuatiue of g(i,j> with respect to j, denoted by g’(i,j>, is an m x m matrix with elements ifg(i,j)Zg(i,j(p)),p=l,...,m otherwise. The discrete derivative

has immediate

application

in two areas.

(1) If the pth value for object i is missing, g’(i,j) = g’(j,i) = 0; the more zeros there are in g’(i,j>, the less well-founded is the measure of dissimilarity and hence any grouping based on the data. (2) Suppose compound objects are constructed, that is, a group of objects not necessarily identical in the states of each attribute but considered to be a single entity, for example, a species. If for two compound objects (including single objects), g’(i,j) is a zero matrix, there are good grounds for replacing i and j by a compound formed from them; that is, g’(i,j> has a potential as a clustering criterion. In addition to the possible role of g’(i,j), the fact that effective groupforming and dendrogram-forming clustering procedures are based directly on vector dissimilarity illustrates the potential of this new measure of pairwise relationships, especially as the clustering principle of betweenness has a very natural interpretation. This potentiality is supported by the numerical example and its comparison with two other procedures differing only in the dissimilarities (one without any at all, the other with the scalar Jaccard similarity), because all three suggest essentially the same groups although with some minor differences. Although it is not known which grouping is correct, the results are suggestive about those objects that appear to belong together without doubt and also about those whose positions are perhaps uncertain. Since the objective of clustering is to generate hypotheses about group existence and membership, differences such as these are more likely to be helpful in coming to a decision about relationships than is the adoption of a single method asserted to be the best. The clustering procedures based on vector dissimilarity described above are appropriate in any context where a classification based on a set of attributes is wanted. Vector dissimilarity is immediately applicable to phenetics in biology, especially if the objective is to assemble individuals

48

L. P. LEFKOVITCH

into groups. One of the criticisms of phenetics made by advocates of numerical cladistics, where the objective is to model the phylogeny of such groups, is that the use of overall measures of (dis)similarity is not helpful for phylogenetic reconstruction, and that each attribute should be considered separately. Vector dissimilarity does not include the sense of direction in each attribute required by cladistics, but if the zero state is defined as being ancestral to that represented by unity, it becomes possible to interpret the vectors ij and i +j phylogenetically. For example, it is almost (but not absolutely) a truism that the states shown by an ancestral form include those that are uniform in the taxa supposedly descended from it; thus for objects i and j, since ij is between i and j, a hypothesis for the states of their ancestor is ij, while the diversity of attributes in the combined taxon is i +j. The attributes that have changed in i correspond to the unities in g(i, ij), while the unities in g(i+j, ij) indicate those attributes that have changed in both. If ij = 0, it must be concluded that the description of the objects is insufficient to determine the states shown by a common ancestor. Because SC.,.) retains the separateness of the attributes required by numerical cladistics without the biologically false assumption of independence implicit in much cladistic practice, there is a role for vector dissimilarity in cladistics as well as in phenetics. REFERENCES 1 2 3 4 5 6

L. P. Blumenthal, Boolean geometry I, Rend. Circ. Mar. Palermo Ser. II 1:343-360 (1952). D. Ellis, Autometrized Boolean algebras, Can. J. Math. 3:145-147 (1951). W. D. Fisher, On grouping for maximum homogeneity, J. Am. Sfat. Assoc. 53:789-798 (1958). J. C. Gower, A general coefficient of similarity and some of its properties, Biometrics 27:857-871 (1971). J. C. Gower and P. Legendre, Metric and Euclidean properties of dissimilarity coefficients, J. Classification 3:5-48 (1986). J. C. Gower and G. J. S. Ross, Minimum spanning trees and single-linkage cluster analysis, Appl. Stat. 18:54-64 (1969). L. P. Lefkovitch, Hierarchical clustering from principal coordinates: an efficient method for small to very large numbers of objects, Math. Biosci. 31:157-174 (1976). L. P. Lefkovitch, Conditional clusters, musters and probability, Math. Biosci. 60:207-234 (1982). L. P. Lefkovitch, Entropy and set covering, If. Sci. 36:283-294 (1985). L. P. Lefkovitch, Species associations and conditional clustering: clustering with or without pairwise resemblances, in Developments in Numerical Ecology, P. Legendre and L. Legendre, Eds., Springer, Berlin, 1987, pp. 309-331. K. Menger and L. M. Blumenthal, Studies in Geometry, Freeman, San Francisco, 1970. P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy, Freeman, San Francisco, 1973.