Entropy and set covering

Entropy and set covering

INFORMATION SCIENCES 36,283-294 (1985) 283 Entropy and Set Covering L. P. LEFKOVITCH Engineering and Statistical Research Institute, Agriculture ...

755KB Sizes 6 Downloads 417 Views






Entropy and Set Covering L. P. LEFKOVITCH Engineering and Statistical Research Institute, Agriculture Canada, Central Experimental Farm, Ottawa, Ontario KlA 0C6, Canadrr

ABSTRACT If the set covering constraints are Ax > 1, and x, E {0, l), the prior probability subset participates in an optimal covering (independently of subset costs) is shown by the principal row eigenvector of A*A, where n,,* = 1 - crj. These probabilities and interesting objective functions, which are shown to be equivalent to cross weighted cross-entropy. The probabilities can also be used to obtain better bounds solutions to optimal covering and set representation problems.

that the j th to be given lead to new entropy or for heuristic

INTRODUCTION A set N of n > 2 distinct objects, for which s distinct subsets have been obtained in some way, can be represented as an n X s incidence matrix, A, whose elements are defined as a,, = 1 if object i belongs to subset j, and is zero s). It is assumed that the union of these subsets otherwise(i=l,..., n; j=l,..., is N. In some circumstances, it is of interest to make a parsimonious selection from this family of subsets, subject to the constraint that the union of the selected subsets remains N. One commonly used possibility for this is to consider that the selection of a subset involves a known nonnegative cost, c, s)-e.g. a penalty for making an incorrect choice-and that the costs (j=l,..., are additive. In these circumstances, if x is an s-element binary vector to be determined, the least-cost set-covering problem [5]


defined; uncommon



2 1, x, E {O,l})

usually there is a unique solution for x, but multiple solutions are not [5]. If the costs for the subsets are equal, the solution is called a

OElsevier Science Publishing Co., Inc. 1985 52 Vanderbilt Ave., New York, NY 10017


284 minimum

L. P. LEFKOVITCH covering,

and LC becomes


XJ E {OJ});

there tend to be many solutions for this x [S]. It is apparent that there are two components in the solution of these problems, namely, feasibility for x, and cost minimization. This paper is concerned first with the feasibility component, arguing that each subset has a prior probability, pi, of p~~pa~g in an optimal solution, and second, to combine these probabilities with the costs in a munber of ways. Before describing how to obtain these probabilities, it is necessary to discuss their interpretation, which here is neither empirical (i.e. based on experiment), nor subjective (involving the opinions of individuals), nor objective (independent of thought), nor pragmatic (practical rather than true or correct), but is logical (based on natural reasoning). This can be clarified as follows. The array A represents a set of predicates, first, about each object (e.g. object i belongs to subset j) which are either true or false; second, about each subset (e.g. subset j contains object i); and third, compound predicates about the objects, the subsets, and the object-subset combinations. If these predidtes constitute the evidence, and hypotheses of the form “subset j is a member of the *optimal covering” are being considered, a de~ee-of-~nfi~ation function is sought which measures the extent by which the evidence supports the hypotheses. Before any evaluation of these propositions, if the sole evidence is that there are s subsets, the principle of indifference leads to a statement that the evidence in favor of the participation of the jth subset in the optimal MC is equal to that of any other. After evaluation of the predicates in A, the evidence may suggest otherwise; for example, if column j’ consists entirely of unities, then for an optimal MC, the evidence is overwhelming that it will participate in the solution, while the evidence about the participation of all others indicate that they do not. Thus the evaluation of the evidence in A may lead to unequal degrees of confirmation for the subsets as potential members of an optimal MC. If the degree of conflation is assigned a no~egative numerical value, for which zero indicates certainty that the subset does rr~t participate in an optimal solution, and if complete confirmation is assigned a value of unity, then these (posterior) degrees of confirmation have the basic formal properties of a finitely additive probability, and are logical probabilities in the Carnap sense [4, Chapter VII]. Informally, therefore, such a probability represents the degree of confirmation of a hypothesis (e.g. subset j participates in the optimal covering), given the evidence in the set of predicates (e.g. those implicit in A). In this paper, probability is given the interpretation described in this paragraph.







Before proposing a procedure to estimate the probabilities, it is convenient to describe how they can be used. If a procedure exists to obtain the probabilities from A, an interesting choice of subsets consists of those which maximize their joint probability, which can be written as MP


X/E {O,l}).

The objective function in MP can be shown to be equivalent two lemmas are required.

to cross-entropy;

LEMMA 1. With the usual conventions that log0 = - cc, log( y/O) = cc, and


Proof. There are three steps in this proof. (1) Since pi = 0 implies xj = 0, then - xj logp, = xj log(l/pj). (2) If xi = 0, then xj log(xj/pj) = xi log(l/pj). (3) If xj = 1, then xi log(xj/pj) = xj log(l/p,). Assembling


(l), (2), and (3) completes the proof.

It follows that MP can be rewritten


MP’ in which the objective function can be recognised as being a cross-entropy. Its relationship with a cross-entropy for which the xj are normalized suggests an alternative objective function for MP’, which is

where z = Xxi. LEMMA2. The vector x which minimizes f (x; p) is the same as that minimizing MP’. Proof. Since f(x;p) = z-‘Cxj(log(xj/pj)-logz), be ignored in the search for the optimal x. Dropping be rewritten as f =~xjlog(xj/pj)-zlogz

it is apparent that z-l this term, the function

can can



in which the first term on the r.h.s. can be recognised as the objective of MP’. Since x, /pj 2 1 for xj = 1, z log z will always be smaller than term unless (1) Cp, = l( xj = 1, and (2) the pj corresponding to x, = equal, when strict equality will be true. It follows that the second term role in the search for the optimal x, and so can be omitted. Clearly, therefore, cross-entropy. In a generalization cost , namely,

the solution

function the first 1 are all plays no n

to MP and MP’ is also that of minimum

of LC, a natural

objective function

is the least expected

LEC Since (1- pi) = -log p, for large pj (i.e. corresponding to those subsets whose prior probabilities are large), if the pj are independent of the c, , the solution for x in LEC also minimizes -Cc, xi log pj = Lzjxj log( x,-/p,), and so also minimizes the weighted cross-entropy [6]. Obviously, the optimal solution for x in LEC will be identical with that of MP for uniform cj. For simplicity in what follows, two other related programming problems are described. The set representation problem, SR, attempts to determine a minimum number of objects such that each subset is covered, and is therefore equivalent to a MC with the roles of objects and subsets interchanged. Of more interest is the complementary set representation problem, which is a SR in which the zero elements of AT are replaced by unities, and the unit elements by zero, i.e. CSR


where A* = {a$} = {1-aij}, and y is to be determined. If each object has a probability of participation qi in a set representation, then it is of interest to choose the objects to maximize the joint probability of those chosen, which is equivalent to MPR

mi n(-CY,logqilA*Y

21, _YiE {OJ}).

If the objects have costs which are unequal (suppose some objects are poorly determined, or are rare, etc.), an objective function can be defined which is the least expected cost of the representation, namely, LECR where the ui are related to 1 - qi, and ri is the cost of including

object i.




The objective of this paper is to show that p and q are unique, and are easily obtained from A. DETERMINING


It will be assumed that A cannot be reduced further using those reductions not involving costs [5], and that n and s are the numbers of remaining objects and subsets respectively. This section is focused on the following theorem and its proof. THEOREM. The s-element vector p of probabilities of participation in an optimal covering is given by the principal row eigenvector of A*A.



X = {x : Ax > 1, x, E {O,l}} as the set of all coverings of the

n objects permitted by A. The jth column of A is indicated m, = C, E xx, times in X, let M = E;_lmj. The number of ways a given set of m, values can be

realized is the multinomial

coefficient w(m) =

Using the Stirling M > (Xl = 0(2”)],




to the factorials

[the m, tend to be large,

log w = - MC u, log u, , where uj = mj/M. The more frequently the jth column of A is indicated in X, the larger are mj and uj. An mj will be large iff the objects in the jth subset belong to few others, and so for such objects vi = ET_laij~J, the sum of the u, to which object i belongs, will also be large. The problem, therefore, is to determine the uj and vi in such a way that Au=v,


where vi is large if the number of subsets to which object i belongs is small, and is small if this number is large, without forming X. If Y = (y : A* y 2 1, y, E {O,l}} is the set of all representations of the s subsets in the complementary problem, and Oi the relative frequency of the number of times object i occurs in Y, then the logarithm of the corresponding multinomial coefficient is proportional to --CO, log Oi, and (omitting a few steps) ir needs to satisfy A*ir=ii.




Note that equivalent equivalent u and v in

fii will be large if the ith object occurs in many subsets of A*, to few subsets of A. Since A and A* are complementary, representing sets of propositions, it follows that the ii and ii in (2) are the same as (1). Combining (1) and (2) leads to

and AA*v =


which are eigenvalue problems. Because the derivation for u and v assumes the principle of indifference by utilizing the multinomial coefficient, and hence aims at a maximum opportunity for each subset to contribute to a covering, while the objective functions in the various progr amming problems seek to minimize the number of subsets, the pj are not simply 1 - ui after standardization. The two principles can be reconciled by reversing the roles of A and A* in the previous arguments, so that p is a row eigenvector of A*A, and q a row eigenvector of AA*. Since there can be more than one eigenpair in A*A and AA*, the last point to be established is the uniqueness of the solutions. Since A and A* are nonnegative, so are their products, and if A is irreducible, A*A will also be (if A is reducible, the covering problem can be decomposed into as many independent subproblems as there are irreducible submatrices, and so this assumption does not sacrifice generality). It follows from the Perron-Frobenius theorem that there is only one positive eigenvalue in these circumstances, and the elements of the corresponding row and column eigenvectors are nonnegative; all other eigenvalues are either negative, complex, or zero. Thus there is only one R admissible solution. REMARK1. Although 1 -u standardized to unity is approximately equal to p, it will not be identical unless A is a generalized inverse of A*. REMARK2. The use of an eigenvalue d~om~sition to estimate p is not indicated, since only one eigenvector is required, and an iterative procedure based solely on A has proved to converge quite rapidly even for fairly dense problems, and especially for those which are sparse. The initial vector in the iterative procedure should be assigned values of s-l, since convergence will occnr at once for set-covering constraints which are balanced. REMARK3. It is easy to recognize that the logarithm of the multinomial coefficient defines the entropy of the assigmnent of values to p, and since the admissible solution is unique, it has maximum entropy, and is therefore con-




sistent with the requirements of information operators [13]. However, the numerical estimates of the.probabilities of some of the subsets may differ by an amount sufficiently small that it is reasonable to believe that in exact arithmetic they would be identical; this implies that for any tolerance which can be assigned to p (and hence to any supposedly optimal solution to LC or LEC), there is a class of equivalent solutions. This class may be determined as follows. If b is a set of hypothetical probabilities for p (e.g. b, = s-l), then the minimum discrimination information statistic [ll]

where k is the number of unities in A, can be regarded as a x2 with s - n - 1 degrees of freedom (N.B. if n > s, then replace the test by that for the corresponding hypothetical representation) for examining the consistency of p with b. The set of assignments to b such that P(G’) > a define an a-tolerance class. From the concentration theorem [9], the range of values for the entropy of the members of this class is given by



XLnl.. <

H(b) G H(P) 3

where H( .) denotes the entropy. This inequality is valid asymptotically for any random experiment with IZ- s - 1 degrees of freedom, even though the value of H(p) may change.







The key differences between the Chvatal heuristic procedure [1,2,8] and the proposed modification arise from the facts that neither the subsets nor the constraints are of equal weight, and the subtractions and additions of ChvaWs original method are adjusted by weighting the terms according to the values of p, and q,; it will also be apparent that Chvatal’s method is reproduced should the subsets and ‘objects exhibit uniform probabilities. Numerical experience with the examples considered by Hey [8] and with several empirical sets of data suggest that the modified heuristic tends to find the optimal solution more often than the unmodified version, and to do so in fewer steps when both achieve it.

290 2.



Lefkovitch uses the MP solution to determine the optimal grouping of objects into not necessarily disjoint subsets in the absence of any other reasonable cost for the subsets [12]. 3.



There is a class of stochastic set covering problems which differ from MP and LEC in that the probabilities are supposed to be known a priori. In one example of this class, a decision to include subset i has a known positive probability of not being followed, and so is essentially a two-stage procedure: first to choose the xj, paying a cost for each subset in the solution, and then paying a penalty for each unsatisfied constraint. This leads to a nonlinear objective function which is the sum of the subset inclusion costs and the unsatisfied constraint penalties. In another example, one specifies a priori upper bounds on the probabilities that the constraints will not be satisfied; this can be formulated as an integer program. The probabilities obtained in this paper can be used for these problems, although typically they are considered to be externally available ]3,61. 4. INDUCTION

The reduction principles used to eliminate those compound predicates implied by others sometimes yields a set in which each predicate remaining is essential [5]; but more commonly, from those remaining, there are several minimal irredundant sets. In the search for the best among these, those having the highest joint probability can be considered as representing the hypotheses having the highest degree of confirmation. If there is just one possible choice, this can be taken as being the “essential” subset of predicates; if there is more than one, then each is equally valid, and the choice is arbitrary. In these circumstances, if A has been obtained from empirical data, recourse to their source may sometimes result in a further set of relationships which resolves the dilemma; if A is obtained by purely logical considerations, then an augmented set of predicates having some degree of independence from those previously considered may be available to aid in the choice. 5.



Since any integer program can be transformed into a O-l integer program [15], it is therefore equivalent to a set-covering problem, and so the notion of




probability of participation in an optimal solution applies to general integer programs. The principle described above for determining the probabilities for set covering are therefore of wider application. 6.



The A matrix given in Table 1 was obtained from the end product of the subset-generation phase of conditional clustering [12], using a dissimilarity coefficient for two-state attributes calculated from the raw data given in [14]. Ten subsets were generated for 14 species, which after reductions, collapsed to 7 species. Table 1 gives the complete 14X10 incidence matrix, indicating the species which remain, the prior probabilities of the subsets and objects, and the optimal covering. The optimal covering grouped the species into four subsets, two of which intersected, so that there is evidence for the existence of three species groups for these data; these coincide with the conjectured grouping of Figure 1 of [14], with the following differences: culluta and gayana are associated with the ciliata-canterai-dandyana group, with which inflata bears some resemblance, as well as the last’s affinity with uirguta. Otherwise, the groupings coincide. However, the groupings obtained by the authors of [14] are their subjective opinions, while those here are objective, conditional on the dissimilarity coefficient. DISCUSSION If the only information about the subsets is given by A, and the objective is to obtain a parsimonious covering of the objects, why should one nonredundant covering be preferred to another? A simple analogy may make this question clearer. Suppose a die has been thrown a number of times, and we are told that the mean score was 3.5; we wish to choose one face on which to place a bet. What assignment of probabilities should be made to the faces? This can be regarded as a set-covering problem for which only one element of x is to be unity, all others zero. The numerical values assigned to the faces are assumed to be no more than labels, so that any permutation of these labels, making appropriate changes to the numerical value of the mean, will leave the choice of face unchanged. It is easy to find solutions from the very many assignments which are possible which satisfy the constraint that the expected value is 3.5 [for example, Pr( i = 1,2,5,6) = 0, Pr( i = 3,4) = 0.5 is one such], but there is nothing in the information available which suggests that the probability of any face should be zero or any other value. If, in addition, we are told that the mean score remains at 3.5 for all permutations of the labels, this represents a considerable gain in information, and the commonly accepted assignment of i to each face is the only solution. However, without this additional information,


L. P. LEFKOVITCH TABLE 1 Subsets and Prior Probabilities for Analysis of Chloris Dataa


Representation probabilities

Incidence matrix

ciliata b










crinita gayana



. . .

virgata b verticillata

. .



. . . .


1 ,



. 1

0.0 11..


11.. . . 1 . 1








Covering probabilities

1 2 3 4 5 6 7 8 9 10

0.072948 0.145897 0.236066’ 0.072948 0.095493 0.090170 0.190985c 0.095493 l.OC l.OC

aVaradarajan and Gilmartin, 1983. bAfter reductions. ‘In optimal covering.




chloridea divaricarab

1.0 0.0 0.0





. . . . 1 . . 1




293 TABLE 2

Die Maximum-Entropy Probability Assignments for Various Sets of Constraints Constraints: set:













a a



a a













































zipi Ei2p,




Entropy (nats): Constraints:



















31.215” 178.19a

22.5” 119.34=

91a 73P5a



aUnspecified a priori.

it is not difficult to reason that the class of acceptable solutions should be confined to those assigmnents which do not favor one face over another unless there is some evidence in support. This class is defined by those assignments which maximiz e the entropy [9], which, in the absence of any degeneracies, can be shown to have at most one member which satisfies the constraints. Table 2 gives the maximum entropy assignments for a number of dice subject to various single and multiple constraints; the faces on which to bet are clear. Table 2 also shows that there may be no solution which satisfies the constraints precisely; however, the solution for which the constraints are satisfied as nearly as possible is unique, and is called the center of attraction [lo] for the probability density, which coincides with the maximum entropy density when the constraints are satisfied precisely. Examples of this class are the last two dice in Table 2. The numerical procedures used to m aximixe the entropy, which also obtain the center of attraction if a maximum-entropy solution does not exist, are described in [13]; in the implementation used for the examples, the constraints were first standardized to unity.



The justification of the maximum-entropy principle, therefore, is that it obtains the least prejudiced assignment of probabilities to events, subject to constraints given by the available data [13]. This principle is also supported, both theoretically and experimentally, from the world of physics [9]. In the set-covering context, the assignment of probabilities departs from uniformity according to the relationships given by the constraint matrix A, and it can be seen that there are as many constraints as there are rows and columns in (the fully reduced) A. These constraints can be considered as a set of sufficient statistics, together equivalent to postulating a particular distribution for the probability density of p (and q) and a (small) number of sufficient statistics to replace the data. It can be seen, therefore, that there may exist a family of probability density distributions equivalent to the empirical sets of constraints, which, if found, may rest& in a computational advantage over current practice.

REFERENCES 1. E. Balas and A. Ho, Set covering algorithms using cutting planes, heuristics, and subgradient optimization: A computational study, Math. Programming Stud. 12:37-60 (1980). 2. V. Chvatkl, A greedy heuristic for the set covering problem, MC&. Oper. Res. 4:233-235 3. 4. 5. 6. 7.

(1979). S. Erlauder, Entropy in linear programs, Math. Programming 21:137-151 (1981). T. L. Fine, Theories of Probability, Academic, 1973. R. Garfinkel and G. L. Nemhauser, Infeger Programming, Wiley, 1972. S. Guiasu, rnfor~tio~ Theory with AppIicatio~, McGraw-Hill, 1977. J. Halpem, The sequential coveting problem under uncertainty, INFOR - Canad. J. Uper.

Res. Inform. Process. 15:76-93 (1977). 8. A. M. Hey, Algorithms for the set covering problem, Ph.D. thesis, Dept. of Management

Studies, Imperial College, London, 1980. 9. E. T. Jaynes, Papers on probability, Statistics and Statistical Physics (R. D. Rosenkranta, Ed.), Reidel, 1983. 10. P. E. Jupp and K. V. Ma&a, A note on the rn~rn~-en~opy principle, Stand. .J. Statist. 10:45-47


11. S. Kullback, Information Theory and Statistics, Wiley, 1959. 12. L. P. Lefkovitch, Conditional clusters, musters and probability, Math. Biarci. 60:207-234 (1982). 13. J. E. Shore and R. W. Johnson, Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross entropy, IEEE Trans. Inform. Theory IT-26:26-37

(1980). 14. G. S. Varadarajan and A. 3. Gilmartin, Phenetic and cladistic analysis of North American Chloris (Poaceae), Taxon 32:380-386 (1983). 15. K. Zorychta, On converting the O-l linear progr amming problem to a set-covering problem, Bull Acad.

Polon. Sci. Dr.

Sci. Math.


Received 22 October 1984; revised 19 March 1985

Phys. 25~919-923