Conditional clusters, musters, and probability

Conditional clusters, musters, and probability

Conditional Clusters, Musters, and Probability L. P. LEFKOVITCH Engineering und Stutistical Research Institute, Agriculture Cc;nada, Uttawu KIA OCS...

3MB Sizes 0 Downloads 111 Views

Conditional

Clusters, Musters,

and Probability

L. P. LEFKOVITCH Engineering und Stutistical Research Institute, Agriculture Cc;nada, Uttawu KIA OCS, Cunudu Received 12 November 1981; revised 2 Fehruay

1982

ABSTRACT Some consequences, modifications, and extensions of the conditional clustering procedure are described. These include (1) ensuring that the neighborhoods are connected; (2) the derivation of the subset generating phase from extreme-value theory; (3) relaxing the convexity constraint on the neighborhoods, and so defining entities called musters: (4) defining the concept of probability of membership in an optimal covering, and using the maximum-entropy principle to estimate these; (5) defir,ng a maximum-joint-probability solution to the clustering problem; (6) interpreting the recognition of clusters 3% information gain with respect to the attribute states of objects, given a probabili& Inte-pretation of pairwise similarity. Numerical examplesare given.

I.

INTRODUCTION

The rules of biological nomenclature make it mandatory that each t;zxon at a given level of taxonomic organization (e.g. a species) pa.rticipatr?s in precisely one taxon (e.g. a genus) at the next levell. What determines a level of organization, and hence a taxon, is operationally well understood by taxonomists, although often undefinable. Given hitherto unstructured and unfamiliar taxa, taxonomists attempt to recognize (or create) organization within them with purposes which, in addition to the legal impositions of nomenclature, range from the convenience of memory to the modeling of evolutionary pathways. Grouping taxa at one level is isomorphic with set partitioning, is called clustering if numerical procedures are used, and implies small values for an appropriately defined entropy. The two main kinds of numerical methods used in clustering can be called “brute force” and “intelligent,” whose distinction, even if both produce the same results, is in the amount of work that they do. “Brute force” procedures examine many possible arrangements and explicitly prune the unlikely or inadmissable, which “intelligent” procedures do not even consider. Minimizing the amount of work or confining attention to a small number of possibilities alone does not guarantee an intelligent procedure. Thus many MA THEMATICA

L BIOSCIENCES

60:207-234

QElsevier Science Publishing Co., Inc., 1982 52 Vanderbilt Ave., New York, NY 10017

207

(1982)

0025-5564/82/06207

+ 28$02.75

208

L. P. LEFKOVITCH

sequential methods pretend to no more than computational optimality and may not achieve the principal objective, which is to produce groups of objects appropriate for a classification, since they usually form dendrograms more appropriate for the modeling of a phylogeny; while many nonhierarchical procedures make strong assumptions restricting the subsets unrealistically (e.g. to being contained in regions h(aving the same shape, orientation, and size, differing only in position)., so that the properties of what has been rejected are not known. The method considered in this paper places much weaker constraints on the containing regions, and confines attention to arrangements which are explicitly defined as being reasonable candidate groups appropriate for a classification in terms of the primary rather than the computational objective. The procedures described in this paper use the concepts of set covering (i.e.a family of subsets of the objects-a hypergraph) -constrained so that each object is included at least once and set partitioning (i.e. a &oice constrained so that e&l object is included precisely once); note that a partition is also a covering, but that coverings need not be partitions. The proposals in Section II extend those of [ 161in two ways, and fall into the class described by Padberg [17] for the solution of set covering and partitioning problems, which consists of two stages: Stage 1: Using the set of rules defining “acceptable” subsets of [the objects], generate explicitly [an ensemble of them belonging the power set of the objects] such that the probability of an optimal solution being contained in [this ensemble] is sufficiently high. Stage 2: Replace [the power set] in the problem definition.. . by a list of the members of [the generated subsets], and solve the associated.. . set-covering problem.

The new extensions are both associated with the first of Padberg’s stages: First, by weakening one assumption, subsets consisting of one or more isolated objects together with those contained within a defined region of the dissimilarity space will no longer be found. As a consequence the subsets can be characterized in a simple manner, and may be generalized so that the regions containing them need not be convex. Second, each subset is considered to have a probability of participating in an optimal solution, which are then estimated by the maximum-entropy principle. In trying to determine how to choose the optimal clustering from. the subsets which are generated in stage 1, several things have been noted elsewhere [ i 63: ( 1) insisting on a partition rather than a covering is likely to mislead, since it will not be clear whether the partition corresponds with “reality” or is merely an artifact of the computational procedures; by contrast, if a covering is also a partition, it will be more convincing;

CONDITIONAL

CLUSTERS, MUSTERS, AND PROBABILITY

209

(2) using the original data not only to generate the subsets but also to make the choice seem to be self-defining and circular in reasoning; (3) the “ usual” measures of homogeneity do not work; (4) external criteria, e.g. additional data, are rarely available. It is concluded that since the rules of stage 1 supposedly find “good” subsets, the solution must lie somewhere in the patterns shown by the generated subsets. Two main issues emerge: how to determine an optimal covering (this is considered in Section V), and how to define clusters from the subsets, since there is no reason to expect. that the family of containing regions determined by the stage 1 rules is such that a homogeneous (unknown) population is contained within just one. Suppose clusters should be externally isulattid and internal& homogeneous [5]; the external isolation of a group is often easily recognized no matter whch definition of the minimal hull for the containment of the objects is used, but high internal homogeneity usually implies that the objects are very much alike, and hence the hull tends to be hyperspherical. Because this interpretation may be challenged, and also because of the semantic overtones acquired by the term “cluster” in the context of numerical taxonomy, a more general term, namely a muster, is used to avoid confusion. A muster will be given a precise definition, and a particular kind of it will coincide with Cormack type clusters; these concepts are considered in Sections III and IV. In brief, Section IT1considers internal homogeneity in terms of restricted subsets, each of which is gene; ated by every pair of its members, and also strongly restricted subsets, which xe restricted subsets disjoint from all other subsets, while Section IV considers external isolation in terms of musters, each of which is the union of all subsets having members in common, and also weak mjaters, each of which is the union of all subsets whose containing regions intersect. Thus musters are subsets whose containing regions belong to a family larger than that of stage 1. The contrast between restricted subsets and musters is especialllq revealing, since a strongly restricted subset which is simultaneously a weak muster has high internal homogeneity and external isolation, and is therefore a cluster sensu Cormack. According to Lefkovitch [16], subsets likely to be part of &anoptimal covering or partition of a set, N, of n objects can be obtained as follows. If i,j ,... denote single objects, 11 the set of pairwise dissimilarities (see Allpendix A; here they will be regarded as Euclidean distances, so that only one of i or j will be retained if dij ==0) among the members of !V,and V(S,) the neighborhood of a subset St of the objects in the dissimilarity space formed at stage t (t = 1,2,...), then St+ 1 =S&J

(i : iE V(SJ}

0 .1)

L. P. LEFKOVITCH

210

and the process terminates when S,+ l = St. Thus S, , whose choice is considered below, generates a subset, G(S), of the objects, so that (1.1) can be considered as a function FIEF-+ 2N,

(I .2)

where 2 N denotes the power set of N. If it were necessary to consider each member of the power set for S1, then (1.1) would not be attractive computationally , but if V is chosen so that

n,

limlSJ =

1 9

domain of becomes

can be

madij(i,jES,)*m, madij(i,

to each

Fl

I

St) +

the &

S,E ,,C2 2N.

I

.

pairs and

(I .

Because

G generated one pair not necessarily from that by another, number of G may considerably less the maximum n( n - 1)/2 implied by (1.4), and so the range of the restricted F is a small subset of 2N. In practical implementation, (I. 1) guided by (1.3) led to formulating a rule which was include object i in St+ l with those of S, if wi < y ( es, - 1) h

(19*

where gt = 6( S,) is the average dissimilarity among the members of S[, wi the average dissimilarity between i and the members of St (see Appendix A ,, and y 2 f is a parameter; thus (1.5) implies that V(S,) is a convex ?c ‘T. with boundary given by y(e*r-l)=l,

(I .6)

which is used as the indicatrix (unit hypersphere) of a Minkowski distance function. In summary, the subsets generated can be regarded as a function of two arguments and some parameters and written as Gk(i, j) = F(i, j: V,D)

. (17)

with G, (k=l,...,m G n( n + 1)/2) denoting the k& distinct subset; each will be called an F-subset.

CONDITIONAL

CLUSTERS, MUSTERS, Ah’D PR,OBABILITY

211

I-Iowever, further study of (I.1)shows that it is unduly restrictive in a way to be discussed; modifying it introduces a “learning” component at the expense of only trivial amounts of additional computation, and permits the generated subsets to be classified with respect to their participation in an optimal covering. Furthermore, the indicatrix of (X.5)-(1.6), in addition to being ad htic, is also scaIe dependent; to be preferred is one which is scale free and which arises from some established body of theory. II.

TME MODIFIED GENERATING PROCEDURE

An implicit assumption in (I. 1) is that all objects belonging to a subset are also in its neighborhood, i.e., iE St

-

iE V(S,).

(11.1)

Example 1. Consider an initiating pair of objects, i and j, separated by a large distance, and many others located close to j; all other objects are remote from these. The operation of (1.5) results in the average distance among the members of St falling as more c ,je;ts are included, so that eventually wi may exceed y( e’f - 1). Thus object i, a member of S(, is not included in V(S). It follows that an F-subset may consist okpd cdjnnected region and some isolated points in the dissimilarity space. While a first reaction to Example 1 is that it is pathological, it is apparent that with respect to St, i is an outlier:: its removal not only leaves a more homogeneous subset, but also causes al+ l to fall even further, possibly resulting in the removal of further objects, ensuring that object i cannot be reincluded. The assumption in (II. 1) is therefore replaced by iE Sr -

iEV(S,)

(11.2)

iE. V($)},

(11.3)

so that (I. 1) becomes Sr+l

=

{i:

while (1.5) is reworded as include in S+ 1 only those i for which w, Q y ( es1- 1).

(11.4)

The functien replacing (1.7) is

Tk(i,j)=+(i,j;KD). Each Tk generated by (11.5) will be called a +-subset.

(11.5)

L. P. LEFKOVITCH

212 Example 2.

If Example 1 is changed so that in addition there are many objects situated close to i, then the effect of (11.3) with & = (i, j} may have the result that S, =0 if 6, is sufficiently small. This example can be taken as evidence in support of the claim that for fixed V, there tend to be fewer +-subsets than F-subsets. Both examples implqf that in $-subset generation it is possible for one or both of tline initiating objects to be excluded, and so their special status is reduced. The arguments so far presented have been leading towards the notion that making a decision about the membership of an object in a subset is analogous to its consideration as a possible outlier, and if the probability of its being an outlier is sufficiently low, one should include the object in the subset. To consider this as a possible model for subset generation, a number of assumptions have to be made. These assumptions, which are virtually the same as those of Scott [ 191,are that a subset is a candidate for inclusion in an optimal covering if (1) for any subset of objects, there exists a probability density for an arbitrary point in the dissimilarity space which depends only upon the distance between this point and the subset; (2) given a subset of objects, the remaining objects are mutually independent, at least locally. These remarks and assumptions lead naturally to considering the probability of membership as being one of extreme values. Using the exponential extreme-value distribution (dissimilarities with a theoretical upper bound of ~1require prior transformation to- ln( $-- di i); see

(11.6) 5 0 0 parameters which probably standard which distributions has expected

be for be of

Their problem, from and

on (for

estimathe

CONDITIONAL

CLUSTERS, MUSTERS, AND PROBABILITY

213

where ct is the centroid of the population, which is here assumed to be zero, and g is the Euler number 0.57722.. . . To a reasonable degree of approximation, if Pr(iES)=exp

{ -exp[ - wi~~~‘f]},

(11.7)

then (II.4) can be replaced by S,+,=(i:Pr(iE&)%~),

(11.8)

where QLis either specified in advance or perhaps is a function of S, Suppose (x,is some function of St; (1.3) can be written as I.

(1.3’) To this, if a further requirement is added-namely, that the cardinality should also tend to increase with the variability of the subset- - 2 defnition of LY,may be obtained which is data dependent. Suppose A, is the characteristic largest value and defined here as max( dij : i, Jo St); if o, = At - St, then this additional requirement can be expressed as (1.3aj Using the probability corresponding with AI to defin e %+I =exp

A, +0.46, 0.76,

which, for practical application, decides on membership by comparing Wi with Ar (i.e., include object i in the subset if its average dissimilarity to the members does not exceed the maximum dissimilarity among them). Although (11.7)-(11.9) assume extreme-value probability theory and the geometry is somewhat more difficult to visualize than that of (II.4), it is scale free, and the decision level is data dependent and does not involve an arbitrary parameter. By definition, each #-subset is a collection of objects contained within one member of a family of convex neighborhoods whose shape and size are determined (in part) by the objects. Convexity, which is a fairly weak

L. P. LEFKOVITCH

214

restriction, twill bie relaxed in a later section, and subsets, called +-lmusters, contArred in neigirborhoods which not only need not be convex but may not even be it star body, will be defined. To provide a foundation for this relaxation, It necessary to +-subsets. III.

SUBSETS

Some have been as being this attribute now be in terms concisely, in of their images. Consider Bk=Tk\(i:(3j)(i,_iETk,iZ

homogeneous than +-subsets, or, subset

j,+(kj;V,D)+Tk)}

(III. 1)

defined by the point-to-set function which omits those members of Tk which do not generate it. Then DEFINITIZN

I

A subset Tbzwill be called +-restricted if B, = Tk, and +-diffuse if B,

C

Tk.

There are many differences between these two kinds of subsets, of which one will be described in terms of the dissimilarities among their members. Let r( Tk) be the ratio of the smallest to the largest dissimilarity between distinct pairs (i.e. excluding i = j); this order statistic tends to unity in the most homogeneous subsets, ,and to zero in the most heterogeneous. THEOREM

1

If L and M are respectively +-restricted and +-diffuse subsets, 1L 1 = 1M 1, and a(L) = S(M), then r(L) > r(M).

Since M is diffuse and S(L) = 6(M), then because tht:re is at least one pair of members of IM which do not generate the su Jset, min d,,( i, jE M) < min dii( i, jE L); the assumption of equal S completes the proof. n Proof.

COROLLARY

max dij( i, jE M) 2 max dij(i, je L). Proof. This follows at once from equality of mean dissimilarities and the inequality of Theorem 1.

Thus with respect to r(Tk) and related measures, +-restricted subsets can be said to be more homogeneous than +-diffuse. +-restricted subsets are reminiscent of maximal cliques, as the following easily proved consequences of (III. 1) illustrate.

CONDITIONAL THEOREM

CLUSTERS, MUSTER:;, AND PROBABILITY

215

2

(a) If the union of two or more +-restricted subsets is u +-subset, it is +-diffuse.

(b)

the union of two or more @-diffuse subsets is a $-subset, it may be or +restricted. a proper subset of a +-restricted subset contains more than one object, be a $-subset. +-subset which is a subset of a +-diffuse subset may be $-restricted or

,If

+-diffuse (c) If it cannot (d) A e-diffuse.

The following lemmas are useful with respect to optimal coverings, and prepare the ground for the definition of isolated $-restricted subsets. LEMMA

1

The intersection

of two+-restricted subsets contains no more than one object.

Suppose L, and L, are both +-restricted, k 1# L,, 1L In L, 12 2; choose S, = (i, jE L,nL2, i # j}. By definition, both L, and L, are obtained from S,, but this implies L, = L,, which contraccts the supposition that they are distinct. Proof.

LEMMA

2

+-subsets which are proper subsets of e-restricted subsets wili not be generated by (11.5). Proof

This is a simple consequence of (11.5) and (III. 1).

DEFINITION

r’l

2

A +-restricted subset disjoint from any other +-subset will be called s tron& +res tricted. THEOREM

3

If each of the final ensemble of subsets is strongly +-restricted, together thq form an optimal partition with respect to +. Proof. Since they are disjoint, they form a partition, and since they are +-restricted, by Theorem 1 they are more homogeneous than if they were to have been +-diffuse. Lastly, no other subsets can be formed by the specified +, THEOREM

4

If the final ensemble includes +-restricted subsets whose union forms a covering, then only these need to be considered for an optimal covering, and the #-diffuse subsets may be deleted.

216

L. P. LEFKOVITCH

Pru~j Deleting the +-diffuse subsets, which by Theorem 1 are more heterogeneous than the $-restricted, does not create infeasibility; combining this with Lemma 2 completes the proof. The main consequences of these theorems are of practical interest and can be expressed as COROLLARY1

If the final ensemble includes strongly +-restricted subsets, they participate in the optimal covering. Determining if a subset is +-restricted is best done after completing any logical reductions [ 161 on the ensemble; Lemma 1 shows that only those subsets whose intersection -with any other is no greater than unity need be investigated. The determination may be made either by repeating some computation, or by keeping a record of the pairs of objects generating each subset. Although r( Tk) is a useful description of the homogeneity of a subset, it rarely takes a value of unity even for +-restricted subsets. There may be advantage, on occasions, in describing heterogeneity by the ratio 1T,+B, I/ 1Tk I, which will always be zero for +-restricted subsets. IV. $-MUSTERS Elsewhere [ 161 it has been emphasized that each subset in an optimal solution need not bear a one-to-one relationship with an unknown true population, since the “shapes” that the latter have in the dissimilarity space (are not necessarily part of the V-family, and may not even be convex. In consequence, it is necessary to weaken the convexity assumption and enlarge the V-family, thereby defining a form of subset homogeneity exhibiting continuity rather than the compactness of $-restricted subsets. DEFINPTION3

A subset R satisfying the three conditions (a) R #0 [Le., R is not empty], (b) (VT,J( Tk c R v Tk c N\ R) [i.e., a +-subset belongs either to R or to the complement of R], (c) W C R = (3T,)((Ti $L’W) A Tk $ZNW’)) [i.e., R has no proper subset which satisfies (a) and (b)], is called a +mu.ster. This definition is essentially identical with one proposed by Tutte [22].

CONDITIONAL TffEOREM

CLUSTERS, MUSTERS, AND PROBABILITY

217

5

the family of +-musters forms a partition. Proof.

Definition 3 implies that +-musters are pairwise disjoint.

COROLLARY

if tin 1R 1 = n, the only set of e-musters covering N consists oj’the partition formed by the improper subset. Proof.

the assumption implies that there is only one +-muster.

m

To avoid contradiction, the neighborhood of a +-muster formed by at least two +-subsets is defined differently from (1.6), and is DEFfNfTION

4

the neighborhood of a +-muster is the union of the neighborhoods of its component +-subsets. COROLLA R Y

The neighborhood of a +-muster need not be convex. In jis!, the-•e is no requirement that the boundav be smooth, or thut the neighborhood be w.?hout holes. The weakest form of homogeneity can now be defined: if R,, is the u th +-muster, and ZU= Z( R .) is its neighborhood, then DEFINfTION

5

The subset H=

u z,nz,

RJJR, +0

is called a weak #-muster. In words, if the neighborhoods of two musters intersect, form the subset which is the union of the objects they contain. Linking with Theorem 3. the following definitions characterize special clusters and partitions: DEFfNITI0.V

h

a +-muster which is also strongly +-restricted is called a +-restrz’ctedcluster. DEFINITION

7

if all +-musters are $-restricted clusters, the partition is called +-regular.

228

L. P. LEFKOVITCH

if the sub:;ets forming a +-regular partition are also weak +-muster?;, the partition is cJed +isoiuted. If a +-isolated partition is found, this implies that large changes is the definition of V would be needed in order to achieve a different partition, and also implies the mutual isolation important in the context of numerical taxonomy [5]. In consequence, there is a high probability that the partition is optimal. If there is only one muster for a given set of objects, there may be some inadequacy in the definition of I/ or D; equally, the objects may truly belong to one group, suggesting that an ordination may be preferable to a clustering. V. OPTIMAL C0VERINGS If the majority of generated subsets are diffuse, or there is only one muster, a further step is needed. By implication, +-subsets are more homogeneous than those not formed, and also are isolated from other $-subsets, so that a solution chosen from just them has a high probability of being globally optimal. It is the main purpose of this section to define and obtain this probability, which is the joint probability corresponding to the optimal solution in (V.9) below; note that the particular values obtained will be assumed to be conditional on the choice of +. If A represents an n X m matrix with ‘Ilik= 1 if object i is a member of subset k, anadujk = 0 otherwise, then a covering is indicated by any binary vector x satisf:ying Ax 2 1, while those x for which Ax = 1 indicate partitions. If a particular object belongs to precisely one subset, this subset must be part of every covering, and the corresponding element in x can be set equal to unity. If object i belongs only to the same subsets as objectj, then object i subsumes objectj, and the latter need not be considered explicitly; if j belongs also to subset k which then becomes empty, then x,. = 0. Considering all member!, of 2N, including those not generated by (II.3), these remarks can be expressed as the following probability statement: 0 pxx=Pr(xA=l)= 1

if subset k has not been formed by (9 or is empty by subsumation, (V.1) if subset k is part of every covering.

While just those +-subsets for which pi: = 1 may form a covering, this is unlikely. The notion that is now proposed is that each remaining subset has a probability pl, of participating in a covering, namely, o=+,
(v.2)

CONDITIONAL

CLUSTERS, MUSTERS, AND PROBABILITY

219

where these probabilities are to be considered more in the sense of “degrees of belief” than of frequencies; these probabilities can be normalized to standardize the total belief to unity. An optimal covering can now be regarded as the conjunction of individual hypotheses that a subset participates in an optimal solution, each of which depends on the contribution the subsets makes, and so the desired solution is one for which the joint probability is a maximum. As already noted, a subset is absolutely essential if a particular object belongs just to it, and is less so the more other subsets there are to which its members belong. Thus the information given by an object about am optimal covering is smaller the more widespread its membership, and dually, the probability that a particular subset is part of this covering depends on Ore pooled information given by its members. This implies that a subset’s probability should be based on the greatest number of possibilities for its contribution to a covering, which in turn leads to the maximum-entropy principle to obtain consistent estimates of them [2 I]. The importance of an object is inversely proportional to the number of subsets to which it belongs, or, equivalently, directly proportional to t!~ nurrmber to which it belongs in the complementary problem, defined by A* = (a$ = ( 1 - aij). This is equivalent to defining the relative importance of the objects as the probability of their participation in a :set representation (an optimal ensemble of objects such that each subset is represented). Those objects eliminated by the reductions have prior values of zero for these probabilities, while those belonging just to one subset have prior values of unity. Let 7Ti, 212,= I, denote the set representation probability of the i th remaining object; then the previous arguments lead to asserting that p a Ah, and also that (rra A*p. Suppose P* = A’lrr/SA’(rr, Q” = A*p/l’A*p

(V.3)

are prior estimates of p and (n re(spectively. Then in the absence of other constraints, the minimum-cross-entropy (maximum-conditional-entropy) estimate of P is @=={p:min[ =

;pJOg(

$)I:

2PF1}

ow

!P*

and, similarly, the :jet representation probabilities are estimated by

+=a*.

w.9

L. I’. LEFKOVITCH

220 Substituting

i for g, b for p in (V.3) and rearranging (A’A* - A r)a = (A*A’-

gives

0,

XI)& = 0

(V.6)

where h is a scalar, so that it is apparent that fi is an eigenvector of A’A*, and ?i is an ei&envecior of A*A’. Since A’A* is irreducible (if A is) and nonnegative, the eigenvalue of largest absolute value is positive, with the corresponding eigerr/ector nonnegative, there being no other positive eigenvalue with correspondling nfonnegative eigenvector. Thus the eigeavector associated with the large,c,t eigenvalue is the desired solution. A two-step iterative scheme based on l(V.3) converges quickly, and estimates both fi and li. Althou,gh this minimum-cross-entropy solution is given by akmost any reasonable loss function, if other constraints exist the solutions obtained are not necessarily consistent unless this formalism is adopted [21]. For example, if a partar.ion is the objective, the requirement that the intersections of the constitua;t subsets in the partition must all be null can be formed into a set of constrajnts on the probabilities. If A is considered as the incidence matrix 01 a hypergraph, and B the corresponding adjacency matrix of its representati\re grap’n -i.e.,

bAl= then appending

(1

T,nT,+0,

i

otherwise,

0

k#I,

(V.7)

the restriction By;,x 0

(V.8)

to (V.3) &ill confine attention to solutions which will indicate partitions. Without ‘[V.8), some of the 11~ may be zero, but to satisfy the additional r~~~~tstraitl~~ sinmc”must be zero unless RIis completely null. Since B may be of %,11JCl II 2 2, the additi’onal computational requirements may be considerable; I: oh herl~~r. perhaps, to adjust V so as to ensure that subset isolation is a rna~r ccrrnpn;nent of the gene!rating phase. Johnson [ 1I] gives APL code for thz ~Iur ion of constrained minimum-cross-entropy problems. and Erlander ;@l $ILW the problem in a more general setting. Abadie [ 11 discusses ,hw~~~ I‘ nwtlucd~ for the solution of this class of optimiza.tion problems.

lll!muliLr

--

s.xilogp,

1 Ax&II,

s,E

{O,l)

fv.9)

CONDITI0Np.L

( LUSTERS, MUSTERS, AND PROBABILITY

221

which is a linear least-cost set-covering problem whose the negative of the log likelihood of the postulated x; (V.9) constrains x to indicate a partition. The optimal maximum-likelihood solution conditional on $I. To solve to eliminate unneeded subsets.

objective function is requiring Ax = 1 in vector is therefore a (V.9), it is beneficial

If W is the union of two or more +suhse:s, Pr! W) > p,, , then x,, = 0 in ull optimul coLerings.

und if Tk @ W C W, und

Proof. Assume that at least one object in T, is required to complete the covering. It is obvious that the joint probability will be higher if the subsets m forming W are used rather than Tk. If a partition is being sought, the condition is that TA = W rather than being just a subset. Since large pk implies X~ = 1, and small pk implies X~ = 0, (V.9) may be replaced by min( - Cpklogph) and solved approximately by replacing small probabilities by zero until ApPO for at least one element, Even if this is not the optimal solution, it provides an irredundan! cover and a bound on the solution. A heuristic method for obtaining a near-optimal covering, proposed by Chvatal[4], is very much faster than the methods ensuring a solution, and the degree of approximation probably no worse than that in obtaining the subsets. The method may also be used to obtain a partition, if one exists, by with _y, = (T,I[-logp,, where t> -Xlogp, [8]; it is replacing -logp, necessary to verify that the solution obtained is a partition (see also [24]). In contexts other than that of numerical taxonomy, the usefulness of a subset need not be proportional to its probability, and weights may be used in addition. Let cI, denote the “cost” of the k th subset; then minimizing - Cx,c, log plc is a possible objective function (weighted log likelihood). If the probabilities are not considered relevant, then c’x is the objective function (this was used by Lefkovitch [lb]). All of these linear set-covering problems may be solved by methods given by Garfinkel and Nemhauser [d] or by Balas and Ho [3]. Minimizing the average cost, c’x/x’x, is a fractional problem which may be solved by methods described by Granot and Granot 191. No matter how the optima1 covering is obtained, it does not necessarily follow that each of the selected subsets corresponds to a distinct unknown true population. Musters, possibly even weak musters, formed from the subsets in the optimal covering are probably to be considered as distinct populations. Should there be only one muster, this suggests that either the objects really form one group, and that distinct populations are not repre-

L. P. LEFKOVITCH

222

sented in the data, or, more interestingly, that the data chosen to obtain the dissimilarities do not reflect the population differences which exist. NUMERICAL

EXAMPLES

The first example, from psychometry, is given in some detail, since the results have an intuitive interpretation; the second reveals some unsuspected but unsurprising groupings of 55 species of medicks, while the third is included to indicate that relatively large problems are amenable to these methods. Example 1. Published data on the similarities of lowercase letters as perceived in Sweden (Kuennepas and Janson, 1969) were used; since w is not a usual p;r;‘t of the Swedish alphabet, and since the three vowels represented by letters with diacritical marks were also excluded, only 25 objects were studied. Table 1 gives the generated distinct subsets, the reduction patterns (e.g. “d covered by b”), and the estimated subset-covering and object-representation probabilities. Table 2 gives the best covering and the best approximation to a partition that were found, together with the values of the joint probabilities and entropy. Comments on the six musters formed from the optimal covering are given in Table 3; similar comments may be made for the near partition. Example 2. The data used in this example consist of 100 attributes assessed on each of 55 species of medicks (Medicago, including alfalfa); similarities were based on the Gower measure as modifiecl by Lefkovitch [ 141 for each of the main sets of attributes (fruit, flower, vegetative parts), and the consensus among them [ 151 (thought to represent the relationships among the objects more reliably than that obtained by pooling the data) used to obtain the pairwise dissimilarities. Altogether 86 subsets were generated; these fell into 4 musters containing more than one species, and a further 9 each containing 1 (species incertae sedis). The same musters were obtained after reductions, which left 35 subsets, and an optimum covering obtained included just 15. Only 6 of these 15 contained more than one object, and these 6 fell into 4 musters (Table 4). The table als indicates that the musters in the optimal covering consist of groups for which there are good biological interpretations.

Exam,Dle 3. The data used here consist of 200 attributes assessed on about 20100species belonging to 152 defined genera or species groups within each genus, of beetles belonging to the North American Scarabaeidae. These attributes., which if present were all ‘L-statein the terminology of Lefkovitch [ 141were used to compute dissimilarities. 197 distinct subsets were obtained, of which 33 had prior probabilities of unity, including 20 consisting of single

CONDI’l‘IONAL CLUSTERS, MUSTERS, AND PROBABILITY

223

TABLE I Generated Subsets and Representation and Covering Probabilities of Letter-Similarity Data Letter

Subsets

Representation probability

a

18 138 267 138 27 16 39 411 17 16 16 17 16 45 11 45 11 12 367 38 10 38910 16 15 16 11 12 13 13 14 14 14 15

1.0 0.0 0.0 0.129 0.163 0.0 0.149 0.0 0.0 0.0 1.0 0.0 0.152 0.0 0.127 0.129 0.0 1.0 1.0 0.0 0.152 0.0 1.0 0.0 0.0

b : e f g h i j k I

m n 0

P q r s t U V X

Y Z

Subset

Covering probability

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0.048 0.06 I 0.198 3.057 0.057 0.047 0.108 0.096 0.055 0.048 0.113 0.057 0.057 1.0 1.0 1.0 1.0 1.0

Reduction d e f f

Sequence

covered by

g i k m r r

b L‘ E

t q j t n f t

S

Z

x

V

X

w

TABLE 2 __~~

~

Near-Optimal Coverings for Letter-Similarity Datrc _~

_

~

Solution and content

Subsets

Musters

- 1nCjoint probability)

Entropy

Generated subsets (Table 1) Covering (3 7 11 14- 18) Near-partition(2 3 5 12 14-18)

18 8 9

5 6 8

34.69 1 6.025 10.167

3.530 0.807 0.816

L. P. LEFKOVITCH

224 TABLE 3

Subsets. Musters ad Comments on Optimal Covering of Swedish Letter Data’ Mustersb bdgpqs

cc

_-

Comments

Circular letters, with 6t without a vertical strckc

unmh k e

Pwdlel vertical linearity. with or uithout a vertical stroke

VXV A

Angled letters. open above

-SZ fijlrt

Zigzag

a

Roundness with a hook

letters

Vertical linearity

“SW [ 131for the letter s apes actually used. “Subsets underlined.

genera: after logical reductions, 121 remained. The optimal covering contained just 13 of the multiple-object subsets. The resuits are summarized in Table 5 and reveal that the traditional groupings have been recovered: however, the study is incomplete, and so the detailed arrangements Slave not been given, although it is important to note that a muster consistin Dynastinae and Rutctinae, because of t dubious relationships of Just one the subfamilies are extracted from genus, is provocative. The lines connccti the minimum spanni the 152. Iiowe der, the single linkage clusters correspondin rs cannot oc obtained from the correjust one phcnon tine, but rcquirc~ several. which of the traditional classification. By The difference is easily contrast, the musters did no explained if it is remembered that a aups so separated can be t dissimil~~ty space, and ,so will fail to perform very unequal in the space needed to contain them, VI.

DISCUSSION

logic: if G( Iv) is a rithms MC:based on the fo Many cluster-in global measure of some property of interest about a clustering, (e.g. the determinantal ratio of discriminant analysis, or some norm which measures deplsturcs of a reconstructed simkrity matrikc from that empirically obtained), and some values of G( N) are “better” than others, then construct an algorithm to optimize G(N). There is nothing wrong with this reversal of

CONDITIONAL

CLUSTERS. MUSTERS, AND PROBABILITY

225

TI-\BLE 4 Subsets of Medicagc Species Generated

by Conditional

Clustering

Contained

Subset

Probabili tv plicata, radiata luptdirta, securtd$lora hkmchearta, rotata rotata, arabica, cottstricta, corortata. disciformis, doliata. gramadettsis, itttertexta, lartigera, littoralis, mirtima. murex. muricoleptis, t~oeu~iu, praecox, rigidula, sauvagei. shepardii. tenoreuna.

;lu

il.c

4h”

in the Optimal Covering

torrtata, trurtcatulu, turbinuta archiducis-rticolai, cancellata, carstiewis. cretacea, daghestartica, tyhrida, papillo.su, pirottae, playcarpa, popocii, pro.~trata, prrbescens, rttodopaea, rupestris, rutherlica. sutica, suffruticosa arcttidtrcis-nicolai, cancelluta, cretweu, dagttestartica, h_vbrida, marina. pupillosu. pirowe, prostrata, rhodopaea, rupestris, sativa. sasurilis suffruticosa Species irtcertae sedis arhorea Ix’ ovalis ‘* rugosa b

hqwiunu b lachiutu ’ scutellutuh

1.0 1.0 1.0 0.1256

0.1861

1.0

es:h 1.0

ilPerennial species. ‘Annual species. ‘Transitional to Trigonellu. dThe only shrub.

logic if the objective is to form clusters, since if a feasible algorithm exists, the clusters will satisfy this property. But if the purpose is not merely to form clusters, but rather to reveal the existence of unknown true populations in N, then global measures are suspect: why should the properties of one true population (e.g. the pattern of covariation among the attributes of its members) have any relationship with those of another? (Assuming that it dots is implicit in the use of “sister groups” by the Hennigian school of cladistic methodology). Thus in forming the subsets which are candidates for being one of the true populations. the properties of other candidate subsets are irrekuant; as it were, each of the objects under study must be seen through

the eyes of the candidate subsets. This is the essence of stage 1 of conditional clustering. Stage 1 depends on an empirically chosen set of rules operating on empirically observed or inferred measures of relationship among the objects.

L. P. LEFKOVITCH

226 TABLE 5 Simplified Mustering

Scarab Generaa

Trichiinae 2( 1) subsets 6 genera 24(28,33)

Ceratocanthinae

Troginae 2( 1) subsets’ 3 genera 40( 54,305)’ I Aphodiinae 5( 3) subsets 17 genera 220(557,1612)

of North Americana

I subset 2 genera lS(15.85)

-

Valginae I subset 1 genus (3(5,20)

I -

I Scarabaeinaed 3(2) subsets 5 genera 90( 9 111700)

Cetoniinae

Hybosorinae 1 subset 2 genera X2.13)

I subset 15 genera 71(86,126)

I Ochodaeinae 1 subset 2 genera 21(31,61)

Dynastinace 1 subset 21 genera 154(164,494)

I

I Glaphyrinae 1 subset 1 genus 5(8.8)

-

Melolonthinae 12(8) subsets 24 genera ~418(960,1600)

-

Rutelinae’ 1 subset 10 genera 71(83,367)

I Chasmatopterinac 1 subset 3 genera 5(5.5)

Pleocominae 1 subset I genus 17(27,27)

Anomalinae 1 subset 5 genera 81( 117,267)

“Grouped into subfamilies, and r;uperimposed on the minimum spanning tree. “Indicates 1 of the 2 subsets contains just I genus. ‘Indicates 40 species examined: * 54 in North America and Mexico, and = 305 in the world. dFurther genera of Scarabaeinae and all Geotrupinae (about 20 genera in all) remain to be studied. ‘The genus Purustusia I( 1,74) placed in both.

If stage 1 and its “clean up” phases (reductions, musters) are insufficient to give a classification of the objects, any further processing must depend on the further consequences of stage 1; these are here expressed in terms of a probability, which is interpreted as the degree of belief about the truth of certain propositions (namely, membership of the optimal covering) given the truth of others (namely, the pattern of object subset membership) Furthermore,, these probabilities (not to be confused with the probability of attribute-state identity; see Appendix A), to be consistent, must s’atisfy the

CONDITIONAL

CLUSTERS, MUSTERS, AND PROBABILITY

227

maximum-entropy principle [21], especially if it is argued that clustering is entropy reducing [23]. Indeed, it may be argued that, without independent data, there is no other appropriate characterization for each subset, since rules defining the ‘neighborhood function almost certainly reflect the anticipated properties of an optimal covering, making a second extraction of information from the same primary source somewhat questionable. In summary, the whole process can be described as a multiple-level nonlinear programming problem which finds an x satisfying min

- zx,logp,

subject to

I: pk log( pk /pz) is a minimum, z ?r,lOg(lzi/~~) is a m_nimum, (‘3 and P* =

A'(n/l'A'a

,

v * = AH’y/l’A”p,

together with

and

Suggestions for a computer program to generate A (the +-subsets) and the sequence for the remaining stages are given in Appendix B. Zoologists, in describing new species, are required to compare them with the supposed nearest relatives. A near relative of conditional clustering is “Adclus” [20]. 1 the latter, R is a n X II similarity matrix among objects: A, as in conditikuu clustering, is a n X m incidence matrix; and Y a diagonal m X m nonnegative matrix of weights. It is postulated that R = AYA’

and that A and Y are chosen by minimizing llR - A.Ykll. In thft original form of Adclus, A is initialized as the distinct maximal complete subsets for each distinct rij used as a threshold (i.e., each subset is a hypersphere of radius

228

L. P. LF i k. r)VITCH

d,,); as noted by Arabie and Carroll [2], this can become computationally unmanageable, and among other modifications, they propose specifying the number of subsets. It is apparent that A can also be obtained as in stage 1 of this paper, and the subsets need no longer be hypersheres Having obtained A, determining the Y which minimizes the loss function is a familiar optimization problem; furthermore, since small yii imply that the corresponding columns of a are of little importance, they can be deleted, so admitting a subsidiary criterion, namely, to minimize m. However, the second phase of Adclus uses the s;dme data to determine the Y (i.e. to choose the subsets) as were used in generating them; as already noted, there are pitfalls in this. Most clustering methods are justified either by their plausibility or by the fact that they have worked. Perhaps no more is required for what essentially are empiric& processes, but among competing procedures, the advantages of the present one are perhaps not apparent. One requirement rarely considered is that of corzsistenq -i.e., as the number of objects increases, the solution obtained converges (or converges in probability) to the true solution. Because

the subset-generating phase can be considered as a particular kind of k-means procedure (each initial pair defines a cluster center which is updated as its composition changes), consistency follows from a theorem of Pollard [ 181. The

main advantages over the usual k-means methods are (1) the initial cluster centers are chosen systematically to span the whole space and are not confined to a small number; (2) the properties of a subset do not depend on those of any other; (3) the subsets need not be convex. Other advantages can be readily found. To my knowledge, the consistency of any of the hierarchical methods has never been established, and it seems to be a difficult problem. A weaker requirement is that of sensitiuity; a clustering method is suspect if changes in the input data result in disproportionate (large or small) changes in the output. There seems to be good reason to expect conditional clustering to be satisfactory in this respect, although this question has been examined only in terms ocf the neighborhood function [ 161. Some widely used hierarchi-

cal procedures are deficient even here [7]. Even If one has satisfied requirements such as consistency and proportionate sensitivity, it is necessary to determine how much confidence should be placed in any model. One method, to compare its results with thos&: of other models (the principle of respectable association, which is the basis for much testing of clustering methods) is inadequate, since the right answers can be obtained using faulty or even inappropriate reasoning. The point is that modeling is not a substitute for thought, but its extension in the form of a disciplined, ordered framework for analysing data. Thus, to ask if a model is correct or valid is an empty question, since this is equivalent to asking if knowledge and thought are correct or valid; these may be inaccurate,

CONDITIONAL

CLUSTEW,

MUSTERS, AND PROBABILITY

229

illogical, inappropriate, etc., but never invalid. Hf the data are correctly translated, the model cannot do other than trace out the consequences correctly. These remarks emphasize the base on which all clustering methods are founded and on which they founder- namely, the data cho:jen to describe the objects, and also how they are used to obtain a measure of dissimilarity. It must be hoped that the choice and computation are guided by informed intuition and not solely, or even largely, by convenience. Conditional clustering, as described in this paper, is consistent and constructed on clearly stated simple principles; unlike dendrogram methods, it provides a solution to the clustering problem, and unlike many other nonhierarchical procedures, with only minimal artificial constraints. It has three of the main components of artificial intelligence [ 121: the*first is implicit pruning (+-subsets), the second is chunking (musters, and the material leading to Theorem 6), and the third is analogy (objective functions as decision criteria). Further development of each alone .nay yield better procedures. APPENDIX A This paper is founded on the notion of the dissimilarity between pairs of objects. Although dissimilarity is a reasonably well-understood concept which sometimes can be estimated directly, it is usually based on more primitive concepts from which it tends to be defined operationally. With the express purpose of illuminating this paper, a definition will be attempted based on these primitive concepts, now described. An attribute is a subdivision of ,in object which exhibits one of a number of states; sometimes the number of states is finite, and sometimes the possibilities map to some continuous subset of the real line. In any particular context, the definition of an attribute and the delimitation of its states is an empirical problem. The set of attributes will be denoted by 2, and Zik is the state of the k th attribute of object i; 121 may be finite for practical purposes, but in fact is infinite. Suppose each member of N is described by the states shown by each of 1Z! 2 1 attributes. Then DEFINITION 7

the similarity between objects i and j, denoted by s,~, is “ij

=Pr(Zik=Zjk.kEZ),

where k is chosen at random from Z. Remark. This definition may be clarified by an example. Suppose each attribute may show only one of two states; let a characteristic function be

L. P. LEFK(S1VI’I’CH

230 defined as

‘ijk

={

;,:

k=l

:;;;;},

T..‘,

IZI.

Then Sil=

E(S,jk)s k

The whole class of similarity coefficients on different characteristic functions.

can be r :garded as being based

Remark Because the empirical nature of attributes does not ensure their mutual independence, estimates of similarity ten? to be too high; prior probabilities (weights) on the attributes are sometimes used to reduce this. 1

If similarity is a probability,

then so is dissimilarity

in this framework, i.e.

d,j = 1 - SiJ 9 which is the probability that the objects do not show the same state for an attribute chosen at random. The mean dissimilarity between an object and the members of a set now satisfies the relationships Wj =

1 - IS,l-’

=]-

Z si, .iES,

E tsi,)

k E S, =

1-s,*,,

where sj, t is the similarity of object i to SI, defined as ‘i 1t-- Pr( zik = +(kEZ;jES,). The mean dissimilarity

among members of a set becomes

where St =

=

i.jEE Sr h,) Pr(

Zjx

=z,,lkEZ;

i, jES,).

CONDITIONAL CLUSTERS,MUSTERS,AND

In the context of (II@-(II.9), dissimilarities gives

PROBABILITY

231

the prior transformation required for bounded

-ln($-dii)=

-ln(l-dij)

i.e. the IIartley information relevant to the states of their attributes, so that the values corresponding to Wiand &*are the expectations of this information, namely the entropy. Thus the subset-generating phase can be considered as the inclusion of objects in a subset until too much chaos is created, i.e. until the probability of identity of attribute states in two objects chosen at random becomes too small in relation to some initial condition. APPENDIX I3 Since the arithmetic for generating +-subsers is of order n4, care must be exercised in its computer implementation. In the accompanying pseudoALGOL (Table 6), some additional features are included, hithero unmentioned, which depend on the fact that if two objects are sufficiently distant, the $-subset obtained is M, and so operating on this pair represents wasted effort. Suppose for any object the distances to all remaining objects not yet used with it as initial pairs are sorted in ascending order, and subsets initiated in that order. If the e-subset corresponding to any such pair is N, then it wiil also be N for any subsequent pair involving the object in question, and also for any other pair separated by at least the same value, and so these pair; need not be considered. With an efficient sorting procedure, the total amouut of work can be reduced without any change in the subsets obtained. Two heuristics are also incorporated. The first makes possible the processing of very large numbers of objects with only moldest storage and run times. It is motivated by the following reasoning. Suppose the smallest distance between pairs of objects generating N is called cutoff. The effect of a prior reduction of its value is to prevent the formation of subsets of cardinality approaching n occupying a large proportion of the space, which are those having a high probability of containing all of more than one true population, i.e. to eliminate subsets of no great interest. In contrast, the use of too low an initial cutoff may cause a single true population to be represented by more than one +-subset, and. thus produce an inferior solution.. Any deterioration may be partly or completely remedied by making further passes t.hrough the data with a higher cutoff and using a specially selected subset of the objects. If this subset is sufficiently small, cutoff need not come into play. This subset is obtained as follows: at the beginning of each pass, suppose cutoff = f( I/ ns),

L. P. LEFKOVITCH

232

TABLE 6 procedure gensubset (d, n, A); real d(Z...n, I... II),dd(1. . . n), c, cutoffl. integer ind(l.. . n), uow(1.. . n), n, ms, d, i, il, ii, j, jj, m, mm, 1, nu, k, ns; I)/?), Sl(l.. . n), S?(l.. . n); set A(I...n, l...n(ncomment It is assumed that pairwise distances ure availuble in d Ly means of some prior prccess, und that objects identical to others have been removed. The declaration set is easily implemented b_vthe usual boolean , but u more efficient code can result if sets we stored bv bits. Output is assumed to be incorporuted in the procedures reduce, muster, probs, and cover. These, and other external references, perform {hahe following functions: W finds the value of cutoff cut sorts un array in uscending order, curqing u set of c?ointerswith it sort masd finds the ntuxinum distunce mtong objects belonging to a subset finds the averuge distance between an object and the nten?bers of u subset avd store stores u subset reduce perform the row reductions on the subsets uguin decides if the process should be repeated withfewer objects, and selects them muster forms the musters from the subsets probs estimates the probabilities of the subsets and the objects cover finds an optimul covering end comment; if ( !I < 3) go to end gensubset; for i : = I...n, now(i) : = i; end i; ns : = n; ms : = n/2: cyrle: cutoff : = cut(m); nl : = ns - 1; A : = 0 ; for ii : = :. . . nl, i : = now(ii); m : = 0; il : = ii t I; for jj : = il.. . its, j : = nowCjj);m : = nt -I-1; ind(nt) : = j; dd(nl) : == d(i, j); end jj; sort(dd, ind. rn); for ii : = I.. . nt, if (dd(il) > cutoff) go to end ii elsej := ind(il); SI : = (i, j); mm : = 2; for 1 : = l...n, nu :== 0; SZ : = 0; c : = muxd(SI); for k : = I. . . n, if (avd(S2, k) > c) go to end k elSenu:= nu+Z*SZ:= (S2 I_! k}; end k; if (nu = n) cutoff : = min(dd(il), cutoff); if (((nu > wn) and (nm > ms))or (nu = n)) go to end ii else if ((nu < mm) OP(nu < 2)) go to end il else if (22 = Sl) go to keep else SZ : = SZ; mm : = nu; end 1; keep: if(SI @ A) store(S1, A); end il; end ii; reduce&$ now);if(uguin(now)) go to cycle else muster(A); probs(/4); A : = cover(A); muster(A); end gensubset: -.

CONDITIONAL

CLUSTERS, MUSTERS, AND PROBABILITY

233

where ns is the number of objects in the current pass (initially ns = n); at the end of a pass, the reduction process, in addition to its usual role of eliminating unneeded +-subsets, records the identity of the dominated objects and also those belonging just to one subset, in an array called now; this array will contain the ns objects to be used in a subsequent pass, if any. Further passes can then be made until ns no longer decreases. The justification for this restricted selection is as follows: The inclusion of objects belonging just to one subset is clearly necessary, since their membership pattern may be due to their having been prevented from belonging to others by the (low) value of cutoff rather than by being truly isolated; a higher value may allow them to be associated with other objects. The use of the dominated objects rather than the dominators is explained by the fact that since the dominated belong to at least as many subsets as those that dominate them, at a higher cutoff they will still be dominated by them; for the same reason, the use of the dominators may result in more single-object subsets for higher cutoff values, which is a contradiction. The formation of musters, even weak musters, is clearly desirable to complete the process. SOI ,e experimentation with a few data sets suggests cutoff = 1/‘ln( ns)

is a “good” value as a proportion of the range of the dissimilarities. The second heuristic is to set an upper limit, ULS,on the cardinality of the “acceptable” +-subsets. A natural upper limit is n - 1, but it may also he useful to consider a smaller value. The mustering process, especially weak musters, will tend to minimize the effects of this, as will making the upper limit larger for smaller values of YIS.In fact, it can be seen that cutoff ant/lm:: are often two aspects of the same phenomenon, since a larger cardinality usually implies a proportionately large region of the dissimilarity space. I am gratefuf to #r. E. Small, BiosystematicsResearch Institute, Agriculture Canada, and Mr. J. Cooper, Biologv Department, Carleton University, Ottawa, for allowing me to use their extensive data relating to medicks and scarabs respectively. I am also indebted to Drs. J. E. Shore and R. W. Johnson, Information Processing Systems Branch, Naval Research Laboratory, Washington, I). C., for twodays of intensive and extensive discussionson their axiomatic formulation of the maximum-entropy principle. The reader wishing to construct a computer program for this method should be grateful to a referee, whose suggestion led to the inclusion of Appendix B. Part of this work was carried out while I was a visiting scientist at the Department of Statistics, Rothamsted Experimental Station, Harpenden, U.K. Contribution number 1200, E.S. RX..

L. P. LEFKOVITCH

234 REFERENCES J. Abadie, Advances 2 3 4

5 6

8 9 IO

II 12 13 14

I5 i6

I7

in nonlinear

programming,

in OR’78 (K. B. Haley, Ed.), North-

Holland. Amsterdam, 1978, pp. 900-930. P. Arabie and J. D. Carroll, Mapclus: a mathematical

programming

approach

to fitting

the Adclus model, Psychometrika 45:2 1 l-235 ( 1980). E. Balas and A. Ho, Set covering algorithms using cutting planes, heuristics, and subg radient optimization: a computational study, Math. Progrumming 12:37-60 ( 1980). V. Clvatal, A greedy heuristic for the set-covering problem, Muth. Oper. Res. 4:233-235 (197)). R. ivl. Cormack (1971) A review of classification, J. Rqy. Statist. Sot. Ser. A 134: 321-367 (1971). S. Lrlander, Entropy in linear programs, Muthematicul Progrumming 2 1: 137- 151 (1981). L. Fisher and J. W. Van Ness, Admissible clustering procedures, Biometrika 58:91- 106 (IVI). R. Garfinkel and G. L. Nemhauser, Integer Progrumming, Wiley, New York, 1972. D. Granot and F. Granot, On integer and mixed integer fractional programming prob!ems, Ann. Discrete Muth. 1:22 l-23 1 ( 1977). N. L. Johnson and S. Kotz, Continuous Univariute Distributions I, Houghton Mifflin, Boston 1970. K. W. Johnson, Determining probability distributions by maximum entropy and minimum cross-entropy, in APL 79 Conference Proceedings, 1979, pp. 24-29. F. Mix, On interrelationships between natural and artificial intelligence research, in Fundumentctls of Computer Science, Vol. 8, North-Holland, Amsterdam, 1979, pp. l-9. T. Kuennapas and A.-J. Janso;l, Multidimensional similarity of letters, Perception and Motor Skills 28:3- 12 ( 1969). i. P. Lefkovitch, (1976) Hierarchical clustering from principal coordinates-an efficient method for small to very large numbers of objects, Muth. Biosci. 3 1: 157- 174 ( 1976). L. P. Lefkovitch, Consensus coordinates from qualitative and quantitative attributes, Biometi :ul J. 20:679-691 ( 1978). L. P. Lefkovitch, Conditional clustering, Biometrics 36:43-58 (1980). M. W. Padberg, Covering, packing and knapsack problems, Ann. Discrete Muth.

4:265-287 ( 1979). D. Pallard, Strong consistency of k-means clustering, Ann. Statist. 9: 135- 140 (1981). E. L. Scott, Subclustering, in Clussicul and Contagious Discrete Distributions, (<;. P. Patil. Ed.), Statistical Publishing Society, Bombay, 1965, pp. 33-44. 20 R. N. Gepard and P. Arabie, Additive clustering: representation of similarities as combinations of discrete overlapping properties, Psychologicul Rev. 86:87- 123 (1979). 21 J. E. Shore and R. W. Johnson, Axiomatic derivation of the principle of max!mum entropy and the principle of minimum cross entropy, IEEE Truns. Inform. Theory IT-26:26-37 (1980). 22 W. T. Tutte, All the king’s horses (a guide to reconstruction), in Gruph Theory und Reluted Tk,pics(J. 4. Bondy and U. S. R. Murty, Eds.), Academic, New York, 1979, pp. 15-33. 3-3 S. Watanabe, Pattern recognition as a quest for minimum entropy, Puttern Recognition 13:381-387 (1981). 24 E. Zemel. Measuring the quality of approximate solutions to zero-one programming problem<. Mum. Oper. Res. 6:3 19-339 ( 198 1). 18 19