Generalized subsumption and its applications to induction and redundancy

Generalized subsumption and its applications to induction and redundancy

ARTIFICIAL INTELLIGENCE 149 Generalized Subsumption and Its Applications to Induction and Redundancy* Wray Buntine School of Computing Sciences, Uni...

1MB Sizes 0 Downloads 34 Views

ARTIFICIAL INTELLIGENCE

149

Generalized Subsumption and Its Applications to Induction and Redundancy* Wray Buntine School of Computing Sciences, University of Technology, Sydney, Broadway, N.S.W. 2007; and Macquarie University, North Ryde, N.S.W. 2113, Australia Recommended by Alain Colmerauer ABSTRACT

A theoretical framework and algorithms are presented that provide a basis for the study o f induction of definite (Horn) elauses. These hinge on a natural extension o f O-subsumption that forms a strong model of generalization. The model allows properties" of inductive search spaces to be considered in detail. A useful by-product o f the model is a simple but powerful model o f redundancy. Both induction and redundancy control are central tasks in a learning system, and, more broadly, in a knowledge acquisition system. The results also demonstrate interaction between induction, redundancy, and change in a system's current knowledge-with subsumption playing a key role.

1. Introduction

Inductive inference has long been a research topic of artificial intelligence (AI) [1] and computer science [2]. A common application cited is the (semi-) automated acquisition of knowledge for expert systems [3, 4] and knowledge bases [5]. Significant successes have been reported with commercial products of AI induction research [6], often based on Quinlan's ID3 algorithm to induce decision trees [7]. Similar practical achievements, however, have yet to be demonstrated of definite clause induction systems, for example, as would be used to induce PROLOG rules. A theoretical foundation for induction of definite clauses has been developed [8] and some progress has been made towards achieving practicality. By representing an inductive hypothesis as a logic program, that is, as a set of rules, Shapiro's Model Inference System [8, 9] demonstrates that the process of *This is a revised version of the paper that won the Artificial Intelligence Best Paper Award at E C A 1-86.

Artificial Intelligence 36 (1988) 149-176 0004-3702/88/$3.50 © 1988, Elsevier Science Publishers B.V. (North-Holland)

150

W. BIINTINE

induction can be split neatly into two tasks: searching for suitable rules to include and debugging demonstrably false applications of the rules to pinpoint which rule(s) to subsequently exclude. Further refinements of definite clause induction methods, specifically, improvements in the search for rules, could be made possible by adapting and unifying known practical methods in use in the field of machine learning ]1, 10]. The theory of definite clauses, however, currently lacks one important component: a suitable model of generalization. This is critical to many of these induction methods. In Section 2.2, 0-subsumption a weak model of generalization and its associated set of induction tools developed by Plotkin ]11, 121 and Shapiro [8] arc shown to be sometimes deficient for this purpose. To fill the gap, the main contribution of this paper is the introduction and development of a stronger model of generalization of definite clauses called generalized subsumption, and key results and algorithms concerning its application to induction. The model provides a new foundation for understanding definite clause induction systems. A useful by-product is a model of redundancy. Redundancy control is another central task for a knowledge acquisition system [5], and, as we shall see, actually plays a key role in some induction tasks. 0-subsumption is shown in Section 4 to be a special case of generalized subsurnption. Consequently, algorithms presented here simplify to the corresponding 0-subsumption algorithms. As onc would expect the advantages of using generalized subsumption instead of 0-subsumption come with a computational price, for example, in some settings termination may not be guaranteed. 1.1. Outline

The paper is organized as follows. Section 2 discusses the uses of a generalization model and highlights some of the properties a generalization model should have. Section 3 provides necessary concepts and notation for the theory to follow. Starting from the definition of generalized subsurnption, a basic theory of generalization is developed in Section 4. Applications of the theory and broader implications to AI are discussed in the remaining sections. Basic properties of inductive search spaces are discussed in Section 5, a model of redundancy in Section 6, and the finding of a most specific generalization in Section 7. 2. Generalization Models

A variety of generalization models are in use in machine learning and other AI fields. Many of these are defined operationally. For instance, some are coded, as in the LEX system [13], or defined in terms of suitable primitive operations, as in 0-subsumption (the definition is given in the start of Section 2.2). Michalski [14] lists operators for generalizing and specializing rules during

GENERALIZED SUBSUMPTION

151

induction. Forms of subsumption similar in spirit to 0-subsumption are used in knowledge representation (for example, [15]) for the purpose of organizing concepts. Generality of substitutions [16, 17] is used in logic programming to build concise representations of the solutions to sets of equations. This section, however, focuses on the applications of a model of generalization to induction and on the kinds of properties a model should have.

2.1. Using generalization models Regardless of the representation language used, a key part of the induction process, as it is envisaged by many researchers in the AI discipline, is a search through a space of rules [1]. A model of generalization provides a basis for organizing this search space [10]. Briefly, rule R l is more general than rule R~, or R 2 is more specific than Rt, if in any world R~ can be used to show at least the same results as R_~ (adapted from Mitchell [10]). Rules can be organized into a structure called a hierarchy. All clauses below a node in a specialization (generalization) hierarchy are specializations (generalizations) of the node, and, furthermore, the hierarchy contains all rules more specific (general) than the root. The use of a specialization or generalization hierarchy as a search space has several advantages. Complete branches can be pruned during search in the knowledge that all specializations of a rule, or generalizations, are guaranteed to inherit some prohibited property. There are several such properties: failure to comply with a constraint, failure to prove a fact known to be true [8], and proving a fact known to be false [18]. Furthermore, knowledge of specializations and generalizations of the desired rule can also be used to delimit the space of potential rules, and, as a by-product, simplify the process of testing whether a potential rule is consistent with the data currently available [10]. Finally, suppose R~ and R 2 are known specializations of some unknown rule. A most specific generalization of R 1 and R~ (that is, a generalization of both R l and R2 more specific than any other generalization) is a more general rule that may also be a specialization of the unknown rule. Applications of an algorithm to compute a most specific generalization, based on 0-subsumption as a model of generalization, have been discussed by Plotkin [11, 12] and Vere [19]. Mitchell et al.'s LEX system [13] and the learning system of Fu and Buchanan [20] employ a similar style of process.

2.2. Properties of generalization models An existing model of generalization for clauses, 0-subsumption, performs poorly at several of the above mentioned tasks. Two simple examples demonstrate this and give some insight into the sorts of properties a generalization

152

w. BUNTINE

model should have. Recall that clause C 0-subsumes clause D if there exists a substitution 0 such that D D_CO [11]. In informal terms, D can be converted to C by dropping conditions and turning constants to variables. First, suppose we have been advised that any small fluffy dog and any fluffy cat are cuddly pets. In definite clause form (introduced in m o r e detail in 1 Section 3.1) this can be expressed as

cuddly-pet(X) ~- small(X), fluffy(X), dog(X),

(1)

cuddly-pet(X) ~- fluffy(X), cat(X).

(2)

Suppose we also know the following clauses hold:

pet(X) ~- cat(X), pet(X) <-- dog(X), small(X) ~- cat(X) tame(X) +- pet(X) The most specific generalization under 0-subsumption of clauses (1) and (2), a possible clause for determining cuddly pets, is

cuddly-pet(X) +- fluffy(X). Given our current knowledge, a more likely clause is

cuddly-pet(X) ~-- small(X), fluffy(X), pet(X).

(37

This should be considered a generalization of clauses (1) and (2) because it must succeed for any value of X for which either of clauses (1) or (2) does. Clause (3), however, is not m o r e general under 0-subsumption than either of clauses (1) or (2), demonstrating a first inadequacy with this model of generalization. In similar circumstances, suppose we are trying to find a clause to determine examples of the concept pig. We might consider the clause

pig(X) ~-- pet(X).

(4)

If we know of no one with a pet pig, this clause can be rejected on the grounds that it does not help explain any of the available data. From our current knowledge the following three clauses can also be rejected because clause (5) is essentially the same as clause (4) (tame(X) is a redundant I Variables are in upper case, predicate, function, and constant symbols in lower case.

GENERALIZED SUBSUMPTION

153

atom since all pets are tame) and clauses (6) and (7) are special cases of clause (4). pig(X) *- pet(X), tame(X), (5)

pig(X) *- dog(X),

(6)

pig(X)

(7)

cat(X).

0-subsumption, however, fails to justify that clause (5) is as general as clause (4) and that clauses (6) and (7) are more specific, demonstrating further inadequacies with 0-subsumption. This can result, for instance, in the Model Inference System [8] doing much unnecessary work during search of a specialization hierarchy. The cause of these kinds of inadequacies is as follows. When forming a generalization or specialization of a clause, or finding the most specific generalization of two clauses, 0-subsumption does not allow current knowledge to be utilized. This can be partly alleviated by implementing subsumption relative to some set of facts [12, 19]. In the above example, however, the problem still remains because relevant facts are not immediately available; they need to be inferred from the current knowledge. This raises the question of how can current knowledge be utilized? Because it is possible to handcraft generalization algorithms for some given knowledge, a more important question is how can arbitrary current knowledge be utilized? Plotkin has considered the question of using current knowledge in his thesis [21]. He introduces the following notion: C generalizes D relative to P if there exists a substitution 0 such that P ~ V(CO-~ D). But this model of generalization then makes the clause

small(X) *- cat(X)

(8)

more general than the clause

cuddly-pet(X)

fluffy(X), cat(X)

(9)

relative to

pet(X) ~- cat(X), cuddly-pet(X) *-- small(X), fluffy(X), pet(X). Giving the " * - " an imperative reading, clause (8) is primarily about the concept small because it is used to provide results on which objects are small. Likewise, clause (9) is about the concept cuddly-pet. The two clauses are certainly not connected in the sense of generality introduced in Section 2.1. Plotkin's notion of generality is discussed further in Section 4.

[54

w. BUNTIN[2

Another method of generalizing a clause is used in Sammut's MARVIN system

[18, 221. MARVIN induces definite clauses by working interactivelv with a knowledgeable trainer. Sammut's technique would, for instance~ generalize either of clauses (1) or (2) to clause (3) and suggests an improved model of generalization because it has the ability to incorporate current knowledge int~ its generalization process.

3. Preliminaries Before introducing this model of generalization, I briefly review necessary concepts and illustrate the conventions on notation to follow. Some ~f the notation is only used in proofs and the more technical theorems: the casual reader need only gloss over Section 3.1. Appropriate background for logic programming is set out by Lloyd [23], for 0-subsumption in the context of induction is given by Plotkin [11], and for a theory of induction in definite clause logic is provided by Shapiro [8]. The model-theoretic definition of an interpretation and a Herbrand interpretation from first-order predicate logic [23] are assumed here. A Herbrand interpretation is an interpretation where function and constant symbols are assigned to themselves. A particular Herbrand interpretation over a language L can be concisely represented by the set of ground atoms in the language L that are true in the interpretation. The usual definitions for a formula to be true in an interpretation I, for a formula to be valid (that is, true in any interpretation) and for a formula to be satisfiable (that is, true in an interpretation) apply. P ~ F denotes that the formula P--~ F is valid.

3.1. Logic programming concepts Some basic concepts of logic programming follow. The notation is close to that of Lloyd [23]. A term is a constant, variable, or the application of a function symbol to the appropriate number of terms. An atom is the application of a predicate symbol to the appropriate number of terms. Atoms are usually represented by the letters A or B, sometimes with primes or subscripts or both. A literal is an atom or the negation of an atom. A goal has the form ~-B~ . . . . . B,, where n~>0 and each B i is an atom, Goals are usually represented by the letters G or H. If n = 0 the goal is the empty goal. A definite clause, abbreviated to clause, has the form A ~-B~ . . . . . B,, where n >~ 0 and A and each Bi are atoms. A clause is implicitly universally quantified. For n : 0, the clause is often called a fact. If the above clause is represented by the letter C, then Chead, the head of the clause, is the atom A and Cbody , the bod~v is the conjunction B I A .-. /x B~. The goal part of the clause C is the goal e - BI . . . . . B,,. It is logically equivalent to the negation of

GENERALIZED SUBSUMPTION

155

C[~o,Jy. Clauses are usually represented by letters such as C or D. Occasionally, goals or definite clauses may be represented as sets of literals, as in the definition of 0-subsumption (Section 2.2). The goal or clause actually corresponds to the disjunction of the set. For instance, the clause above can be represented as {A, ~B~ . . . . . -~B,,}. A Horn clause is a definite clause or a goal clause. A unit clause consists of a single literal. A logic program, for example P or Q, is a finite set of clauses representing their conjunction. If the set is empty, the program is the null program, represented as ~. A substitution, usually represented by a lower case Greek letter such as 0, o-, or ~-, consists of a finite number of distinct variables paired with terms. The instance of a finite string of symbols F by substitution 0, represented by FO, is obtained by simultaneously replacing each occurrence of a component variable of 0 in F by its corresponding term. Any clause or literal is ground if it contains no variables. The existential quantifier (3) and the universal quantifier (~') are sometimes used to close a formula. That is, ~'F represents the closed formula obtained by universally quantifying every unbound variable in F. So ~' 3 X f(x, y) actually represents V,. 3.~ f(x, y). Given a logic program P and a goal G = * - A t . . . . . A,,, the goal succeeds on P if and only if P ~ 3 ( A ~ /x,../x A,,). A sound and complete logic programming system can demonstrate this by constructing a refutation for G, that is, a linear sequence of goals with initial goal G and final goal the empty goal where each is the resolvent [23, 24] of the previous goal and a clause from P. 3.2. A context for induction

Next, a theoretical context for induction is considered. In an induction problem, the intended interpretation is that ideal but unknown interpretation giving truth values for formulae consistent with the problem specification and any future observations that could possibly be made in the domain being considered. An induction system's current knowledge consists of data, literals, plus background knowledge expressed as a logic program. These are assumed true in the intended interpretation. A formula is known true if it is a logical consequence of the current knowledge. An induction system is to find an hypothesis, expressed as a set of clauses, consistent with the current knowledge and capable of explaining the data in the sense that all the positive literals but no negative literals in the data are deducible from the conjunction of the hypothesis and the background knowledge. The hypothesis is not considered part of the current knowledge, although at some point a clause may become plausible enough to be included.

156

w. BUNTINE

(This formulation ignores the important problem of truth maintenance [25] to be performed when the truth of part of the system's current knowledge in the intended interpretation becomes falsifiable and so recent induction steps unjustified.) The intended hypothesis is that ideal but unknown hypothesis corresponding to the intended interpretation. The following key definition formalizes the notion of a clause being directly responsible for proving an atom. It is Shapiro's notion of "covers" [9], rephrased slightly. The notion is an essential building block of the theory to follow. A clause C covers" ground atom A in interpretation 1 if there is a substitution 0 such that Ch~,~O is identical to A and q(C~ooy0 ) is true in interpretation I. Such a clause is called a covering clause. A set of clauses covers atom A in interpretation 1 if some clause in the set covers atom A in interpretation 1. The search of the space of rules mentioned in Section 2 can be formulated as a search for covering clauses [9]; clauses are to be found that cover particular known facts in the intended interpretation.

4. Generalized Subsumption In the remainder of this paper 1 shall refer to generalized subsumption as, more briefly, subsumption, where no confusion can arise. Informally, subsumption corresponds to the more general relation for rules considered in Section 2. Subsumption is defined formally below in a model-theoretic manner. An operational view of subsumption follows. The logic program P used in the definition is intended to represent a system's current knowledge. When this is the case, knowing C subsumes D is an assurance that, in the intended interpretation, C will apply in any situation that D does; generalized subsumption is a safe model of generalization.

Definition 4.1. Clause C subsumes (or is more general than) clause D with respect to (w.r.t.) logic program P if for any Herbrand interpretation I (for the language of at least 2 P, C, D ) such that P is true in I, and for any atom A, C covers A in I whenever D covers A. This is denoted C >~e D. C is referred to as a generalization of D, and D as a specialization of C. The definition extends naturally to sets of clauses (that is. using "the set of clauses { C~ . . . . . C,, } covers A in I whenever the set of clauses {D 1. . . . , D., } does") and is written {C~ . . . . . C,,} >;, {D~ . . . . . D,n}. 3 These sets may repre~That is, 1 is a Herbrand interpretation constructed from symbols in P, C, D, with possibly some other symbols. 3 In the notation of Lloyd [22] this then becomes equivalent to: for any Herbrand interpretation l of at least P, C~ . . . . . C,, and [)~ ..... I),,,. such thai P is true in I. 7 ' , , ,, (I)_D_ T~,, . . . . . ;~,,,~(I).

GENERALIZED SUBSUMPTION

157

sent (1) logic programs or (2) clausal specifications for a particular predicate. So, depending on the usage, this can provide definitions for ( l ) a logic program subsuming another, or (2) a clausal specification for a particular predicate subsuming another. To assist in the formulation of clause hierarchies and redundancy, further definitions are required. Clause C is equivalent to clause D w.r.t, logic program P (represented as C = p D ) if C>~t,D and D>-pC. " = p " is an equivalence relation since " >~ p " is transitive and reflexive. The equivalence class under "=~," of C (denoted [C]~,) is the set of all clauses D such that C =I,D. Naturally, these concepts can also be extended to sets of clauses. Theorem 4.2 below lays the foundations for an algorithm to test for subsumption. This gives an operational view of subsumption.

Theorem 4.2. (Testing for ">e"). (1) Let C~ . . . . , C,, and D~ . . . . . D,, be clauses and P be any logic program. { C~ . . . . , C,,} >~~ { D~ . . . . . D,,} if and only if for all j such that 1 <~j <~m there exists an i such that 1 <~i <~n and C~ >~e D~. (2) Let C and D be two clauses containing disjoint variables and P be any logic program. Let 0 be a substitution grounding the variables in D using new constants not occurring in C, D, or P. C>~pD if and only if, for some substitution or, C~,, d o" is identical to D ~ d and P

A

ObodyO

p

3(CbodyO'O

) ,

(10)

Proof. (1) The "if" part follows directly from the definitions of ">~ p " and "covers". The following argument proves the "only-if" part. Suppose {C~ ~ o . . ~ C,,}~1,{D l ~ • • • , Din}. Then, from the definition of " ~~ / , " , {C~ . . . . . C,,} ~>e {D~} for every j such that 1 ~
P A (D0.ody0

p

since the right-hand side is satisfied by a least Herbrand interpretation of the left-hand side (this follows from a simple model-theoretic argument, as in [23, Theorem 6.6]). C~ ~>e Dj follows by (2) of the theorem.

158

w. BUNTINE

(2) First, consider the "if" part. Assume the substitution ~r exists and the formula is valid. Pick any Herbrand interpretation I of at least P. C, and D. such that P is true in I. Pick any atom A in the corresponding Herbrand base such that D covers A in I. Now, by the definition of covers, there exists a substitution ¢b such that A is identical to D~,,~,,~chand ~(D~o~,~b) is true in I. So there must exist another substitution, ¢b', such that Dch' is ground, A is identical to D~,dCh' and Dboay~' is true in 1. Because formula (10) is valid, by the uniform replacement of constants P A Db,,dy~b' ~ ::](Cb,,dv(r05') . So 3(Cb,,<,.cr&' ) is also true in I. Because (~head O" is identical to Dh~,d, this implies C covers A in I. This argument follows for any I satisfying the initial constraints, so by definition, C >~, D. The "only-if" part remains to be proven. Assume C ~>f, D. The following argument shows a substitution cr exists. Let a substitution 0 be given as in the theorem. Let I be a Herbrand interpretation of P, C, D, and DO that assigns true to every atom in the corresponding Herbrand base. Now, D covers Du,.~,dO in l, so C must also. The existence of the substitution ~r follows by the construction of 0. Finally, let 1 be a least Herbrand interpretation of P/~ DbodyO over the language of P, C, D, and DO. Then D covers D~,~,aO in 1, so C must also. Therefore. C covers C~,,~o-O in I. By the definition of covers, 3(('~,,,dy(r0) is satisfied by 1. Because I is a least interpretation, by standard properties of Horn clauses (for instance, [23, T h e o r e m 6.6]), formula (10) must be valid. ~ According to T h e o r e m 4.2(1), to compare two sets of clauses it is sufficient to compare individual rules in the sets. Every clause in a more specific set must be subsumed by some clause in the more general set. When comparing two clauses, the logical formula (10) can be tested by a logic programming system. This test can also be reinterpreted as follows: it is necessary to show that the more general clause can be converted to the other by repeatedly (1) turning variables to constants or other terms. (2) adding atoms to the body, or (3) partially evaluating the body by resolving some clause in t" with an atom in the body. At this point, readers may wish to convince themselves that clause (3) given in Section 2.2 indeed subsumes both clauses (1) and (2) w.r.t, the logic program given. Any generalized subsumption test is only semi-decidable. That is, termination of the test is only guaranteed when in fact C~>pD. Although, if the program P contains no recursion, termination is guaranteed. Similarly, sub-

GENERALIZED SUBSUMPTION

159

sumption w.r.t, a program containing no functions symbols, a DATALOG program [26], would be decidable, The theorem below outlines the specific connection between generalized subsumption and logical implication.

Theorem 4.3. Let D be any clause that is not tautologically true and P be a logic program. P ~ V D if and only if C >~ D for some clause C occurring in P, that is, P>~t,D. Proof. Let 0 be a substitution grounding the variables in D using distinct constants not occurring in D or P. Suppose P ~ VD. then P/x Db,,d~0 ~D~I~O. By the completeness of resolution for definite clauses, there exists a refutation with initial goal ~-Dhe,dO using clauses from the logic program P/x D~,,,dyO. Because D is not tautologically true, the first clause resolving with Dhe~,,~must be a member of P. Clearly, this clause satisfies the conditions given in Theorem 4.2(2) and so is ~>p D. The converse follows similarly. [] Generality is often considered to be closely related to implication; the theorem and its proof provide a comparison between generalized subsumption and implication. A distinction between C>~1~D and P ~ (C--~ D) arises, for instance, when C is recursive or C and D have different predicates at the head, as in the last example in Section 2.2. In approximate terms, C > p D when the body of D implies the body of C in the context of P. That is, C > t, P when it is possible to explain why C should apply in any situation that D does using only the knowledge contained in P. Although, using the theorem, it can be shown that if C>~,D then P ~ ( C - - ~ D ) .

Corollary 4.4. If C > p D , then (P /x VC) ~ (P /x VD). As a consequence of the corollary, when a clause in a logic program is replaced by a more general clause, at least the same goals will succeed on the newly constructed program. Furthermore, if Q ~ P and C -- p D, then ~ (Q/~ VC)<-~(Q/~ VD). So, equivalent clauses can be used interchangeably in the right context. This raises the question, addressed in Section 6, of all clauses in the same equivalence class, which is the best? Finally, I relate the model to various other models of generalization. 0-subsumption is a special case of generalized subsumption. For any two clauses C and D, C ~ 0 D if and only if C 0-subsumes D. This follows from Theorem 4.2(2) when the logic program P is the null program, keeping in mind that the clauses being compared are definite. Compare the resultant algorithm with Loveland's Theorem 4.1.1 [24]. In effect, generalized subsumption simplifies to 0-subsumption when no current knowledge is incorporated in the model of generalization.

160

W. BUNTINE

The version of subsumption usually used in automated theorem proving is (' subsumes D if ~ VC--~ VD. The difference between this and 0-subsumption is a special case of the difference between the conditions P ~ (VC--~VD) and C~rD. Consider Plotkin's notion of generality [21, p. 49]: C generalizes D relative to P if there exists a substitution 0 such that P ~ V ( C O - o D ) . The following theorem, from Plotkin [21, p. 52, Theorem lt, highlights the difference between this and generalized subsumption. Theorem 4.5. Let C and D be clauses and P a logic program. C generalizes D relative to P if and only if C occurs at most once in some refutation demonstrating P b (VC--~VD).

In view of Theorem 4.3, C ~>~,D is the special case of C generalizing D relative to P where C is at the root of the refutation demonstrating The version of subsumption typically used in knowledge representation is similar to 0-subsumption. Compare the definition given at the beginning of this section with Brachman and Levesque's extensional semantics of subsumption [15]. The organization of a knowledge base imposed by subsumption could be improved if the version of subsumption adopted incorporated some use of current knowledge, as suggested here. By careful selection of the kind of knowledge to be used, this improvement could be achieved with only a small loss in efficiency. A notion closely related to generalized subsumption is uniform containment [27], used in deductive database theory. From Theorem 4.3 and Sagiv's Proposition 2 [27], P uniformly contains Q if and only if P subsumes () w.r.t. P. The added flexibility of the setting for generalized subsumption allows a cleaner analysis of redundancy, for instance. Michalski's [14] usage of "more general" is in fact identical to that considered here. He is using a more extensive language than definite clauses so lists other generalization and specialization steps. His definition [14, p. 117], however, does not acknowledge the role that current knowledge can play in generalization. Now that the basic features of generalized subsumption have been covered, the remainder of the paper outlines some applications of the model. 5. Subsumption Hierarchies: The Search Space for Induction

Given positive examples E (a set of positive literals), negative examples N (a set of negative literals), and background knowledge, a logic program P, a potential induction hypothesis H explaining the examples must be such that (P/~ H) ~ E and P ~, H/x N is satisfiable. These conditions define a search

GENERALIZED

161

SUBSUMPTION

space on hypotheses. In the following analysis, another condition has been assumed: the supplied language is complete, that is, no new predicate or function symbols other than those already in E, N, and P need to be introduced into H. In other words, I am ignoring the problem of new terms, also referred to as constructive induction [14]. This language completeness assumption is not always applicable. For example, given positive and negative examples of the reverse predicate (reverse (X, Y) is true if list Y is the reverse of the list X), and background knowledge P = {}, no potential hypothesis H consisting of Horn clauses constructed from only the reverse predicate can exist. Whereas, given P equal to 4

append([], X, X ) , append(A.X, Y, A.Z)~--append(X, Y, Z) , one possible hypothesis is

the

O(n 2) reverse predicate

reverse([ ], [ ] ) , reverse( A . X, Y)~-reverse(X, Xrev ) , append( Xrev , [A[, Y). To capture the O(n) reverse predicate requires introducing the rather meaningless predicate reverse1. A correct hypothesis together with this introduced predicate is then

reverse(X, Y) ~-- reversel (X, [ ], Y) , reversel([ [, X, X ) , reversel(A.X, Y, Z ) ~ - r e v e r s e l ( X , A. Y, Z) . But the language completeness assumption simplifies generation of hypotheses: only hypotheses constructed from known symbols need be considered. Induction can then be achieved, as described in Section 2.1, by searching through those clauses more general than a known specialization of a clause, for instance, a ground fact, as done by M A R V I N [18]. Alternatively, a searcfi can be made of those clauses more specific than a known generalization. The Model Inference System [8] would begin, for instance, with the fact sort(X, Y). This fact is more general than any clause with the two-place sort predicate at its head. These search spaces are usually organized into hierarchies. Loosely, a specialization (generalization) hierarchy w.r.t. P rooted at a given clause C is some organization of clauses ~

e C) such that at least one member from each equivalence class whose members are <~eC (>~eC) is represented. Clauses may be organized into a tangled hierarchy where all clauses below a given clause in the hierarchy are ~e) that clause. M A R V I N searches a ~The notation for lists is

Head. Tail,

equivalent to

[Head ITail ] or cons(Head, Tail).

162

W. I3UNTINE

generalization hierarchy, the Model Inference System a specialization hierarchy, and the Version Space method [10] the intersection of the two hierarchies. Surprisingly, specialization and generalization hierarchies possess very different characteristics. To investigate these, I consider the advantages gained during search of the hierarchies when using a stronger model of generalization, as achieved by utilizing more of an induction system's current knowledge. The use of a stronger model of generalization does not in fact increase the scope of the search space defined by a specialization hierarchy. A hierarchy so constructed will still span an equivalent set of clauses. More specilically, let Q ~ P be two logic programs, let S t, denote the set of clauses represented in some specialization hierarchy w.r.t. P rooted at the clause (', and let 5,'~, denote the same for Q. Then every clause in S~, has a member of its equivalence class w.r.t. Q in S O (as a specialization w.r.t. P is also a specialization w.r.t. Q), ~md every clause in S o has a member of its equivalence class w.r.t Q m ~,'~, (let D ~ S~_~ and C~,,~O = D~,,~, then (D U ('0) =c) D and -<-~(7, hencc ~:~, ('). Using a similar argument, this property can be shown to occur in any rule language where the bodies of the rules are closed under conjunction. And, any specialization hierarchy rooted at the most general clause (for the sort predicate, this was sort(X, Y)) will span all possible clauses, it forms a complete search space. A stronger model of generalization, however, reduces the size but not the scope of the search space because several equivalence classes may collapse into the one. This was demonstrated by the second example in Section 2.2 where clause (5) becomes equivalent to clause (4). The stronger model also allows more pruning to occur during search, as demonstrated by clauses (6) and (7) in the same example. To illustrate these ideas with a more extensive example, consider the logic program below called Philosophers: in tellectual( socrates ) . man(socrates) , mortal(X) e- man(X) , greek(socrates) , man(descartes) .

Part of a specialization hierarchy constructed with the same language and organized using 0-subsumption is shown in Fig. l. Compare this with Fig. 2 showing part of the same hierarchy organized using subsumption w.r.t. Philosophers. Notice that not only have groups of nonequivalent clauses collapsed into single equivalence classes, as highlighted by the shading, but also the hierarchical structure of the space has been rearranged. For instance, compare the positions of intellectual(socrates), the clause shaded the darkest. Most of the significant differences between the hierarchies stem from the clause m o r t a l ( X ) +- m a n ( X ) .

163

G E N E R A L I Z E D SUBSUMPTION

intell(X)eman(Y) •

ntell(X)~mortal ( Y)-

intell(X){ntell(descartes) intell(s ocrates)!iiiili "~/X ~........~......................................................................... . . . . . . . . . . ~!i!! g reek ( Y) •

~'~~in~a~(~ ) ~~(~)!"~(ne~k:~

),

", '~

" ~ in~l(~)v~man(X), \ man(Y), \ mortal(Y). ,'~ mortal(Y). .~,.'~.,.-.~\\~,~"~ ~

T

~" intell(X) ~ x 5 ~ man(X), Q . mo~al(X). ~ ~

~ ,ntell(X)~-~ ~ man(Y), ~ _ \mortal(X).~ ~--~. ~

intell(X)~ man(Y), mo~al(Y), greek(Z).

,ntell(X)eman(Y), mortal(Z), greek(W).

~intell(X)~ ~ man(Y), E 2 mo~al(X),~ '- greek(Z)~

Fig. 1. Part of a specialization hierarchy w.r.t. ¢), that is, under 0-subsumption, rooted at intellectual(X), Arrows indicate "subsumes w.r.t. 0", that is, "0-subsumes".

In contrast, for generalization hierarchies w.r.t, different knowledge, the same equivalence of scope does not hold. The more knowledge incorporated into the model of generalization, the greater the number of clauses spanned. In particular, a generalization hierarchy w.r.t, the current knowledge rooted at a given fact contains representatives of exactly those equivalence classes whose clauses are currently known to cover that fact. This stands out dramatically if we compare the clauses found subsuming (above) intellectual(socrates) in Fig. 1 with those in Fig. 2. Thus, when using a generalization hierarchy w.r.t, current knowledge to achieve induction, an assumption is being made about the current knowledge. It must be sufficient to justify some clause, true in the intended interpretation, subsumes the root clause. I call this the justifiability assumption. The justifiability assumption dictates that knowledge should be acquired incrementally, with new concepts being built on existing knowledge. When the assumption is used, the scope of the resultant search space is greatly restricted in comparison with a full specialization hierarchy. There is one situation where the justifiability assumption is satisfied by default. This occurs when an unknown concept does not need a recursive

164

w. BUNTINE

intell(X) ~ ~ ~ n--- "'"""'" t ~ - mor t a l ( ~ / ~ / ~ . / / / ~ ~

n ~

t

e

~

l ~ ~, ~ , ( X ) ~

~

gre.~e scartes,. ~ ~}'22~X), ~esca~es)

intell(X)~ man(X), greek(X).

~:~:;~ (socrates}: Fig. 2. Part of a specialization hierarchy w.r.t. P~ilosophers roo~ed at b~ellectua~(XL Arrows indicate "subsumes w.r.t. Philosopher.~ "~. Clauses equivalent w.r.t. Philosophers arc shaded identically in Fig. 1 and Fig. 2.

definition and all predicates that could be used in the definition are such that the truth of their instances are known according to current knowledge. The Version Space method is often explained in this context. 1 envisage two other situations where the justifiability assumption is applicable. Induction may be being controlled by a knowledgeable trainer, as is expected with M A R V I N [18]. Before beginning induction, it is not unreasonable to expect the trainer to supply the system with relevant knowledge and pertinent examples sufficient to justify that a clause is indeed a cover. Although Banerji [28] comments, in the context of the induction of recursive rules, such a system "depends heavily on the facts being presented in a proper sequence", Second, a system may be supplied with all currently available knowledge, that is, observations and a suitably rich description language, and be required to perform induction solely with that knowledge. Any inductive hypothesis can only be accepted if it is currently plausible. I claim that to a first approximation, plausibility of a clause is equivalent to being able to justify that the clause covers at least some facts known true, but none known false. Most specific generalizations, constructed by some induction techniques to search a generalization hierarchy, are to be considered separately in Section 7.

GENERALIZED SUBSUMPTION

165

6. A Model of Redundancy In this section I consider a restricted form of redundancy called logical redundancy. With this form, redundancy is determined according to the ideal rules of logic, independently of the particular implementation of the knowledge-based or logic programming system being reasoned about. Before considering logical redundancy in more detail, it is best to consider the issues that it ignores. In the final analysis, a rule or some condition in a rule is redundant for a particular knowledge-based system if, on its removal, the system would continue to perform as required and would still do so after any feasible extensions to the system's knowledge have been made. Apart from just the form of knowledge available, several aspects need to be considered when determining this: (1) the kind of questions that will be asked of the system by the user; (2) the completeness of the reasoning component of the system (that is, given enough resources, is it able to draw all possible conclusions); (3) the resource and timing constraints the system operates under; and (4) the subset of the system's knowledge base that is correct or complete. (For example, if a set of Horn clauses is complete, that is, does not need to be extended in order to make conclusions about facts, we are able to use the closed world assumption to make conclusions about their negation [23, Chapter 3].) With logical redundancy, the following assumptions are being made: (]) any possible question can be asked of the system; (2) the system's reasoning component is complete; (3) the system has unbounded resources~ and, (4) no part of the knowledge base is complete, although a certain subset (usually represented below by P) is correct. It is fairly safe to make assumptions (]) and (4) because these only weaken the actual situation. ! consider some implications of relaxing assumption (3) in Section 6.2.

6.1. Logical redundancy At least two kinds of this style of redundancy exist in a logic program. A clause can be redundant and so can an atom within a clause. A clause D is redundant in a logic program P if P - { D } ~ VD. Such a clause can be removed from the logic program and, by assumptions (2) and (3) above, a particular implementation will still have exactly the same goals succeed. For example, the third and fourth clauses in the program

member(X, X. Y) , member(X, Z. Y) ~- member(X, Y) , member(X, Z.X.[ ]) , member(l, 3.2.1.[ ]),

166

W. BUNTINE

arc redundant because they are both logical consequences of the first two. From T h e o r e m 4.3, if a clause is redundant, there always exists another clause in the logic program that can be considered primarily responsible for rendering it redundant. Several such clauses may exist. For example, from T h e o r e m 4.2 the third and fourth clauses in the above program are both more redundant than the second clause and the first cannot be so related to any of the others in the program. In view of this, the subsumption relation can be considered to order clauses in terms of their relative redundancy. In addition, the subsumption test gives a semi-decidable algorithm to detect logical redundancy, ideally suited to computation by a logic programming system. This allows~ for instance, the redundancy of clauses and not just facts to be detected (compare with Bowen and Kowalski [29]). Within clauses themselves, a second type of redundancy is possible. If clauses are being constructed by an algorithm with no proper regard for lt~e underlying semantics, for example, if clauses are being enumerated during induction, they may contain atoms making no effective contribution lo the successful working of the clause. The clause

cuddly-cat(X) ~- fluC]}'( X ), cat(X), animal(X) is such a case if it is known that a cat is always an animal. This is because thc atom animal(X) will be proven true whenever cat(X) is. The last atom can be said to be redundant because its only effect is to cause additional but unnecessary computation. Formally, this occurs because if the logic program P contains the clause animal(X) ~- cat(X), then

cuddly-cat(X) ~-- fluffy( X ) , cat(X), animal(X) - t , cuddly-cat(X) *-- fluf[y( X ), cat(X) and, according to Corollary 4.4, the shorter clause can replace the longer in any such logic program and the same goals will still succeed. Plotkin [11] denotes the process of ensuring that a clause contains no further redundant atoms, in the context of 0-subsumption, reducing a clause. A corresponding concept is appropriate for generalized subsumption. Definition 6.1. Clause A 0 *- A ~. . . . . A,, is reduced w.r.t logic program P if for all i such that l ~ < i ~ < n , A ~ - A ~ . . . . . A~ ~ , A i ~ . . . . . A,,is not equivalent, w.r.t. P to A~ ~- A ~. . . . . A,,. A clause D within a logic program P can be replaced by an equivalent but reduced clause w.r.l. P and the new logic program will have exactly the same goals succeed. More precisely

GENERALIZED SUBSUMPTION

167

Lemma 6.2. Let D ~ P. l f D' = e D and D' has been obtained by deleting atoms

from the body of D, then b P < - ~ ( P - {D} tO { D ' } ) . Proof. D ~>1,D', so by Corollary 4.4, ~ P - ~ ( P - {D} tO {D'}). D' ~>0 D (as D' has been obtained by deleting atoms from D) so the reverse direction holds as well. [] The equivalent but reduced version of a clause will always remain equivalent to the original if the logic program is subsequently expanded. In additiom a reduced form obtained by deleting atoms is guaranteed to cause smaller proofs to be built--and would usually cause less computation--than the original clause, because any proof using the original will always contain a proof using this reduced form. When the reduced form is unique, it is guaranteed to cause smaller proofs to be built than any other equivalent clause. Unfortunately, the uniqueness condition does not always hold. Although it is a simple matter to show that the reduced form w.r.t, a nonrecursive program will always be unique. An algorithm for reducing a clause, given below, is the same as Plotkin's [11, Theorem 2] but is based on the subsumption test given in Theorem 4.2 rather than 0-subsumption, and, as also suggested by Sagiv [27, Section VIII, each atom need only be considered for reduction, in Step 2, once in the entire course of the algorithm. It inherits termination problems from the subsumption tests performed in Step 2. Because the input clause is finite in length, the algorithm is guaranteed to terminate with a clause D set to a reduced form of C if all subsumption tests terminate. Theorem 6.3 (Reduction algorithm). The reduction algorithm below accepts as input a logic program P and a clause C. Assuming all subsumption tests terminate, the clause D when output will be in reduced form and be equivalent to C w.r.t.P.

Step 1. Set D to C. Step 2. For each atom A in the body of C, if D - { ~ A } < ~ I , D D - {-~A}.

set D to

Proof. Termination has already been discussed. Clearly, D ~~,C on output as D has been constructed by removing atoms from the tail of C. It remains to show that D is reduced w.r.t. P on output. The following shows this by arguing that if an atom can be reduced from the output version of D, then it would have already been reduced during Step 2.

168

W. BUNTINE

Represent D by A ~ ~ A ~. . . . , A,,. Assume A,, can be reduced from D, that is Ao~-AI,...,A,,

~=eA~I~--AI ....

,A,,,

and An+ ~ was the last atom reduced to obtain D, that is A o ~--A ~. . . .

, A,, =t, A o ~-- A ~. . . . .

A,,~I .

This implies Ao <--A ~. . . . .

A,,

~~ e A ~

~-A~ ....

Ao~-A

A,,

t,A,,+~<~oAoe-A~

, A,,+.~

and so ~. . . . .

.....

A,~ t

<~t, A o ~-- A ~ , . . . , A,,.~ 1 .

Consequently, A,, could have been reduced before A,,+~, that is A~I~--AI,...,A,,

I,A,,+~=t, Ao~-AI

.....

A,,~I.

So, if an atom in the output version of D can be reduced, it would also have been able to be reduced during Step 2 (as earlier versions of D were equivalent but had extra atoms), so should not exist in the output version. Because this is a contradiction, no atom in the output version of D can be reduced from D. [] The complexity of this algorithm depends on the complexity of subsumption. Unfortunately, 0-subsumption is known to be NP-complete [30, p. 264]. The algorithm can be speeded in some instances by attempting to reduce a small group of atoms before employing the full subsumption test in Step 2. For example, consider the clause a(X) ~-b(X,

Y), c(X, Z), d(Z, W).

Using T h e o r e m 4.2(2), it can be shown that b ( X , Y ) can be reduced from this clause if a(X)~--c(X, Z), d(Z, W) <~pa(X)~--b(X, Y) .

If successful, this computation is far shorter than the full subsumption test of

a(x)~-c(X, z), ct(Z, w)~pa(X)~-b(x, v), c(X, z), a(z, w).

GENERALIZED SUBSUMPTION

169

Similarly, d(Z, W) can be reduced if

b(X, Y),

z)

W),

where z is a unique constant symbol. An application of the reduction algorithm is demonstrated in Section 7. Maher [31] suggests that a logic program can be reduced under 0-subsumption to a "canonical form" by removing all subsumed clauses and then replacing all remaining clauses by their unique reduced form under 0-subsumption. Likewise, it is possible to remove both kinds of logical redundancy from a program by removing all redundant clauses and then replacing each clause in the program by its reduced form w.r.t, the program, as in Lemma 6.2. Sagiv [27, Fig. 2], proposes a variant of this technique for "minimizing" a D A T A L O G program (although, in his algorithm, the strict order in which redundancies can be removed is unnecessary); as mentioned in Section 4, subsumption w.r.t. DATALOG programs is decidable. Similar techniques to these need to be developed for the more extensive knowledge representations usually found in knowledge-based systems, to assist in the task of knowledge base maintenance. 6.2. Redundancy when resources are bounded

Consider what happens when assumption (3) for logical redundancy is dropped; resources are always limited in practice. In this situation, rules that are logically redundant may become necessary for the system because the processing resources may not be available to deduce them in the course of answering other questions. It is important then to introduce controlled logical redundancy. There are two main contexts in which this can be done. Logically redundant rules may need to be generated by the system before it could even begin, within its resource constraints, to answer a particular question. This, of course, leads to the problem of lemma conjecturing [32], so important in proof discovery. Alternatively, the system may be currently answering a particular style of question and is required to improve its performance on these. A partial solution to this problem has already been proposed in another setting: explanation-based learning in the context of a strong, logical, domain theory [33]. Methods currently used [33, 34] effectively take the results of a deduction, a proof tree, remove problem specific components from the tree (the interesting question is, which ones?) and then collapse the remainder of the tree to a single rule using a technique such as partial evaluation. This is a simple knowledge compilation method that produces logically redundant rules to allow similar problems to be solved more efficiently. This pre-empts the need for methods such as analogy to be used when solving these similar problems.

170

,a++ BUNTINIai

The kind of "generalization step" used may not always be directly applicable in more extensive contexts. For instance, with a logic incorporating a time or change component, such as a planning logic, goal regression ma> also bu required to generalize an " e x p l a n a t i o n " / p r o o f .

7. Most Specific Generalizations A clause C is a most specific generalization w.r.t, logic program P of clauses D~ and D~ if C is constructed from (predicate, constant, and function) symbols occurring in P, DI and D 2, C is a common generalization of D~ and D,, and for any other common generalization, C', C'->-/, C. Although the clause (~ is not unique, its equivalence class under =/, is. Now, it is a simple matter to devise a logical formula that is a most specific common generalization: take the disjunction of the two clauses D~ and D,. But finding a clause is a more difficult problem. The concept is important for induction because, assuming that known clauses D~ and D 2 are specializations w.r.t. P of some unknown clause true in the intended interpretation, the most specific generalization w.r.t. P of D~ and D e must also be true. The assumption is similar to the justifiability assumption. Vere [19] gives illustrative examples where the current knowledge is represented by a set of facts. The following theorem, in conjunction with Plotkin's least generalization algorithm [11, T h e o r e m 3] suggests a method to find a most specific generalization if one exists. Plotkin's algorithm is used to find the least generalization (most specific generalization w.r.t. ~, under 0-subsumption) of two clauses. Theorem 7.1 (Finding a most specific generalization of two clauses). Let C and D be two clauses containing disjoint variables and P be a logic program. Let O~ be a substitution grounding the variables occurring in Che,,~ to new constants, O~ be a substitution likewise grounding the remaining variables occurring in C, and d)~ and d)2 be likewise for D. If a most specific generalization of C and D w. r.t. P exists then it is equivalent w.r.t. P to a least generalization of CO~ t3 { ~ A ~ . . . . . ~ A , , } and Ddo~Lt(-~B l . . . . . ~B,~,} where for l~i-~-.n, P:\ C~,odyO~~ ~ A~, and A~ is a ground atom constructed from symbols" occurring in P, C, 0~, 02, and D. Likewise .[or each B:. Proof. Let msg(C', D') denote a representative most specific generalization of clauses C' and D ' w.r.t, program P. Likewise, let lg(C', D') denote a representative least generalization. Let msg be any most specific generalization of C and D w.r.t. P containing no variables in common with C, D, or P. First, goals G and H shall be constructed such that msg = t, lg(C01 tO G, Dgal tO H) . Let r be the unique substitution affecting only variables in (msg)h~a u such that (msg)hc~d r = Chela. This exists because msg is a generalization of C. Let

GENERALIZED SUBSUMPTION

171

C' = C ~J (msg)r~7 for any substitution cr such that (1) ~ affects variables in the body of (msg)'r only, (2) (rnsg)~o- >~p C, and (3) variables common to C and (msg)'ccr must occur in the head of C. C' <~ rnsg by the construction of C', C' ~

j, C by conditions (2) and (3). Construct D' correspondingly. Clearly, lg((Y, D') <~ rnsg by definition of lg because both C' and D' are ~<~ msg, so

lg(C', D') <~pmsg. In addition,

lg( C', D ' ) >~,,msg( C', D ' ) because any generalization w.r.t. ~ is also a generalization w . r . t . P . But, C' = p C and D' = e D , so

lg(C', D') >~~ rnsg . Consequently,

rnsg =elg(C', D') =plg(C'O~, D'd~) =elg(CO~ ~ G, Dd)~ U H) , where G represents the goal part of (msg)r~rO~ and likewise for H. If a particular msg and substitution o- can be chosen such that the atoms in G satisfy the conditions on the A~ given in the theorem, and likewise for the corresponding substitution for H, we are done. The argument that the conditions on H are satisfied is similar to the corresponding argument for G so is not given below. Now (rnsg)-c >~ C and the heads of these two clauses are identical so by Theorem 4.2(2),

P/x C~,o~yO~O2 ~ ~v(mSg)bo~y~O~ , where V is the set of variables occurring only in the body of msg. Using the usual refutation process, a substitution o" can be constructed such that cr assigns variables in V to ground terms composed of the symbols in P/x CbodyOlO2 or (msg)b,,d~ rO~ and P/x Cbody0~02 ~ (msg)bodyrO , or.

172

w. BI !NT1NE

Function and constant symbols in msg all occur in P, C, and D by definition. So the corresponding ~- must also have function and constant symbols occurring only in P, C, and D. Consequently, (msg)~,od:.rO~~ is composed of function and constant symbols occurring only in P, C, 0~, &_, and D. Finally, note that the atoms in G are exactly the atoms in (msg)body~'O1~r and so satisfy the necessary conditions. ~] Two corollaries indicate special cases when most specific generalizations arc guaranteed to exist. The corollaries treat situations such as the logic program P being either a set of facts or a DATALOG program. Corollary 7.2. If both clauses C and D are unit clauses and only a finite number of ground atoms (constructed from symbols" in P, C and D) are a logical consequence of P, then a most specific generalization qf C and D w. r.t. P exists. Proof. In the construction in T h e o r e m 7.1, note that for the situation given in the corollary both C~,ody and Dbm~y are empty, so the sets { A~ . . . . . A,,} and { B ~ , . . . , B,,,) must contain ground atomic consequences of P. Consider the case where both these sets equal the entire set of ground atomic consequences of P. This set is finite so a least generalization, lg, can be constructed as given in the theorem. Now, for any other common generalization, cg, of C and D w.r.t. P, there exists another common generalization, cg', such that cg' ~ , c g and cg' can be constructed as outlined in T h e o r e m 7.1 but using other sets of atoms {A'~. . . . . A~,} and {B'~. . . . . B,~). A proof of this last statement is similar to the proof of T h e o r e m 7.1. Because the sets {A I . . . . . A~',} and {B'~ . . . . . B~} can contain nothing more than ground atomic consequences of P, lg <~ cg'. Because cg was picked arbitrarily, lg is most specific. ~q ¢

t

Corollary 7.3. If P, C, and D contain no function symbols, then a most specific generalization of C and D w.r.t. P exists. A direct implementation of T h e o r e m 7.1 as it stands is impractical for all but the simplest cases because it essentially involves the deduction of all ground facts logically implied by the logic program P. Furthermore, the resultant most specific generalization is to be reduced w.r.t, the logic program and this operation is NP-complete for even the simplest case when P = { }. Plotkin, in his thesis, suggests that the processes of constructing a least generalization and of performing its reduction should be interwoven. Buntine [36] discusses this in more detail. T h e o r e m 7.1 is, however, of theoretical significance because it provides a precise characterization of the technique suggested by Plotkin for using his least generalization algorithm, extended to allow the generalization of pairs of clauses, not just facts, and to allow the use of current knowledge expressed as definite clauses.

GENERALIZED SUBSUMPTION

173

Two simple examples below illustrate some problems that may be encountered in practice. In the first example, suggested by Tim Niblett, an infinite number of facts are able to be generated that contribute to an "infinite" most specific generalization. The simplicity of this example suggests that most specific generalizations will commonly not exist. A most specific generalization of h(a) and h(b) w.r.t. P, given below, does not exist. g(f(a)) ,

g(o, X),

g(b, X) .

A proof is by contradiction. Suppose a most specific generalization does exist. Consider applying the construction given in Theorem 7.1 with C=-h(a), D =- h(b), the grounding substitutions 0I, 02, ~b~ and (~2 empty, and the two sets of ground atoms, {A 1. . . . Ap) and {B~ . . . . . Bin} both equal to H,,, where H,, is the set of all ground atomic consequences of P with term depth ~ n . H,, is given by ,

{g(f(a)))

U0

G,,

i-O

where

G,, =- {g(a, fr'(a)), g(b, f"(a)), g(b, f"(b)), g(a, f " ( b ) ) } . The theorem says to construct the least generalization of

h(a) ~- g(f(a)), g(a, a), g(a, f(a)) . . . . . g(a, f"(a)), g(b, a), g(b, f(a)) . . . . . g(b, fn(a)) , g(b, b), g(b, f(b)) . . . . , g(b, f"(b)), g(a, b), g(a, f(b)) . . . . . g(a, f"(b)). h(b) ~- g(f(a)), g(a, a), g(a, f(a)) . . . . . g(a, f"(a)), g(b, a), g(b, f(a)) . . . . . g(b, F(a)), g(b, b), g(b, f(b)) . . . . . g(b, f"(b)), g(a, b), g(a, f(b)) . . . . . g(a, f"(b)). This eventually reduces w.r.t. P to cg,,, given by

h(X) ~-- g(X, X), g(X, f(X)) . . . . , g(X, f"(X)), g(X, a), g(X, f(a)) . . . . . g(X, fn(,)), g(X, b), g(X, f(b)) . . . . . g(X, f"(b)). By Theorem 7.1, the most specific generalization must be equal to cg n for some n. But then a contradiction exists as cg,, ~ is also a common generalization, but it is more specific than cg n (because both cg,, and cg,+~ are known to be reduced).

174

w. BUNTINE

In the second example, a most specific generalization does exist but is impractical as an immediate induction hypothesis. Consider a most specific generalization of C ~ member(4, [3, 4]) and D =- member(2, [5, 1, 2]) w.r.t. P, given below.

member(X, X. Y) , member(2, 1.2.[ ]) . Using a theorem and algorithm developed by Buntine [36], a most specific generalization exists. One has been computed to be 43 atoms in length, but it is unknown at present whether this clause can be reduced further. Had a most specific generalization been developed using a naive application of T h e o r e m 7.1, the intermediate clause constructed by the least generalization algorithm would have been around 2000 atoms long before application of the reduction algorithm. One method of overcoming the problems of infinite or lengthy most specific generalizations is to incorporate relevance notions [35, 36] as follows. In an induction problem, a ground atom B is irrelevant to a ground atom A if for the intended hypothesis H, B does not occur in any proof of A in H. The definition of a most specific generalization can then be extended to incorporate relevance. A maximally specific plausible generalization of two clauses is a common generalization of the two clauses satisfying available information on relevance that is not more general than any other common generalization satisfying available information on relevance. For instance, in the second example above, we could say that for N an integer and L a list of integers, member(U, V) is irrelevant to member(N, L) if one of the following holds: U is not an integer, V is not a list of integers, or if the length of V is greater than or equal to the length of L. This information about relevance, when incorporated into the example, yields a maximally specific plausible generalization of

member(X, U. V. W ) ~-- member(X, V. W ) . 8. Conclusions An integral part of the induction process is search through a space of rules. This paper introduces a model of generalization, incorporating theory and algorithms, on which suitable search spaces can be built. This supersedes 0-subsumption as such a model of generalization for definite clauses. Several results have been presented: a characterization of the potential effects of a system's current knowledge on inductive search spaces, methods to detect redundant clauses in logic programs and redundant atoms within clauses, a theoretical characterization of Plotkin's induction technique of finding a most specific generalization, and a significant improvement over an existing tool, 0-subsumption. For the reasons outlined in Section 5, the model also suggests an improve-

GENERALIZED SUBSUMPTION

175

ment over Shapiro's most general refinement operator [8] for enumerating a specialization hierarchy. Of course, to achieve induction in practice, a means of heuristically searching hierarchies needs to be developed. This is currently being investigated. More broadly, however, this work provides a case study of the interaction between different facets of knowledge: gen6ralization, induction, redundancy, relevance, and structure--key factors in the maintenance of knowledge bases. Finally, how can the model of generalization presented be strengthened without losing the computational properties demonstrated here? Maher [31] and Sagiv [27] both contribute in this direction. The kind of current knowledge utilized, presently definite clauses, could be extended, or generalization could be taken relative to some particular domain, that is, class of interpretations. ACKNOWLEDGMENT 1 would like to thank Ross Quinlan, Jenny Edwards, and Graham Wrightson for their encouragement and support; Donald Michie, John Potter, Michael Maher, the journal referees, and especially Tim Niblett for the suggestions tl~ey made concerning the current extension: Paul O'Rorke for making me aware of the many versions of subsumption; and Tim Niblett for introducing me to Plotkin's thesis. This researcl~ has been supported by a Commonwealth Postgraduate Research Award from the Australian Government.

REFERENCES 1. Dietterich, T.G., London, B., Clarkson, K. and Dromey, G., Learning and inductive inference, in: P.R. Cohen and E.A. Feigenbaum (Eds.), The Handbook of Artificial Intelligence lII (Kaufmann, Los Altos, CA, 1982) 323-512. 2. Angluin, D. and Smith, C.H., Inductive inference: Theories and methods, Comput. Surveys 15 (3) (19831 237-269. 3. Hart, A., The role of induction in knowledge elicitation, Expert Syst. 2 (1985) 24-28. 4. Quinlan. J.R., Compton, P.J., Horn, K.A. and Lazarus, L., Inductive knowledge acquisition: A case study, in: J.R. Quinlan (Ed.), Applications of Expert Systems (Addison-Wesley, London, 19871. 5. Kitakami, H., Kunifuji, S., Miyachi, T. and Furukawa, K., A methodology for implementation ot a knowledge acquisition system, in: Proceedings 1EEL International L~vmposium on Logic Programming, Atlanta City, NJ (1984) 131-142. 6. Michie, D., Current developments in exper~ systems, in: J.R. Quinlan (Ed.), Applications of Expert Systems (Addison-Wesley, London, 19871. 7. Quinlan, J.R., Induction of decision trees, Mach. Learning 1 (19861 81 106. 8. Shapiro, E.Y., Inductive inference of theories from facts, Tech. Rept. 192, Department of Computer Science, Yale University, New Haven, CT (1981). 9. Shapiro, E.Y., Algorithmic Program Debugging (MIT Press, Cambridge, MA, 1983). 10. Mitchell, T.M., Generalization as search, Artificial Intelligence 18 (1982) 203-226. 11. Plotkin, G.D., A note on inductive generalisation, in: B. Meltzer, and D. Michie (Eds.), Machine Intelligence 5 (Elsevier North-Holland, New York, 19701 153-163. 12. PIotkin, G.D., A further note on inductive generalisation, in: B. Meltzer and D. Michie (Eds.), Machine Intelligence 6 (Elsevier North-Holland, New York, 1971) 101-124. 13. Mitchell, T.M., Utgoff, P.E., Nudel, B. and Banerji, R., Learning problem solving heuristics through practice, in: Proceedings 1JCAI-8I, Vancouver, BC (1981) 127-134. 14. Michalski, R., A theory and methodology of inductive learning, Artificial Intelligence 20 (1983) 111-161.

176

W. BUNTINE

15. Brachman, R.J. and Levesque, H.J., The tractability of subsumption in frame-based description languages, in: Proceedings AAAI-84, Austin, TX (1984) 34-37. 16. Robinson, J.A., A machine-oriented logic based on the resolution principle. J. A ( M 12 (1) (1965) 23-41. 17. Seikmann, J. and Szabo, P., Universal unification and a classification of equational theories, in: Proceedings Conference on Automated Deduction, Lecture Notes in Computer Science 87 (Springer, New York, 1982) 369-389. 18. Sammut, C.A. and Banerji, R.B., Hierarchical memories: An aid to concept learning, in: R.S. Michalski, J. Carbonell and T.M. Mitchell (Eds.), Machine Learning: An Artificial Intelligence Approach II (Morgan Kaufmann, I_~os Altos, CA, 1986)~ 19. Vere, S.A., Induction of relational productions in the presence of background information, in: Proceedings IJCAI-77, Cambridge, MA (1977) 349-355. 20. Fu, I~.-M. and Buchanan, B.G., Learning intermediate concepts in constructing a hierarchical knowledge base, in: Proceedings IJCAI-85, Los Angeles, CA (1985) 659-666. 21. Plotkin, G.D., Automatic methods of inductive inference, Ph.D. Thesis, I;niversitv of Edinburgh, Scotland (1971). 22. Sammut, C.A,, Concept development for expert system knowledge bases, Aust. ('omput. J. 17 (1) (1985) 4%-55. 23. Lloyd, J.W., Foundations of Logic Programming (Springer, New York, t984), 24. Loveland, D.W.. Autornated Theorem Proving: A Logical Basis (North-Holland, Amsterdam, 1978). 25. Doyle, J., A truth maintenance system, Artificial Intelligence 12 (1979) 231-272 26. Gallaire, H. and Minker, J., Logic and Databases (Plenum, New York, 1978). 27. Sagiv, Y., Optimizing datalog programs, in: Proceedings Foundations of Deductive Databases and Logic Programming Workshop, Washington, DC (1986) 136-162; also Rept. STAN-CS86-1132, Department of Computer Science, Stanford University, CA (1986). 28. Banerji, R.B., Changing language while learning recursive descriptions from examples, in: T.M. Mitchell, J.G. Carbonell and R.S. Michalski (Eds.), Machine Learning: A Guide to Current Research (Kluwer Academic, Boston, MA, 1986) 5-9. 29. Bowen, K.A. and Kowalski, R.A., Amalgamating language and recta-language in logic program~ning, in: K.L. Clark and S.-A. Tarnlund (Eds.), Logic Programming (Academic Press, New York, 1982) 153-172. 3~). Garey, M.R. and Johnson, D.S.. Computers and Intractability (Freeman, San Francisco, CA, 1979). 31. Maher, M.J., Equivalences of logic programs, in: Proceedings Third International Con.f~'rence on Logic Programming, Lecture Notes in Computer Science 225 (Springer, New York. 1986). 410-424. 32. Bledsoc, W.W., Some thoughts on proof discovery, in: Proceedings IEEE Symposittm on Logic Programming, Salt Lake City, UT (1986) 2-10. 33. Mitchell, T.M., Keller, R.M. and Kedar-Cabelli, S.T., Explanation-based generalization: A unifying view, Mach. Learning I (1) (1986). 34. DeJong. G. and Mooney, R., Explanation-based learning: An alternative xiew, Mach. Learning 1 (2) (1986) 145-176. 35. Buntine, W.L., Induction of Horn clauses: Methods and the plausible generalization algorithm, Int. J. Man-Mach. Stud. 26 (1987) 499-519; Revised version of a paper presented at Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, Alia. (1986). 36. Buntine, W.L., A most specific generalisation algorithm for Horn clauses, Unpublished manuscript (1988).

Received N o v e m b e r 1986; revised version received January 1988