BIOCHIMIE, 1985, 67, 493-498
Statistico-syntactic learning techniques. Henry SOLDANO and Jean-Louis M O I S Y .
Atelier de Bio-lnformatique, Institut Curie, 11, rue P. et M. Curie, 75231 Paris Cedex 05. (Re~u le 30-1-1985, accept~ apr~s r~vision te 19-4-1985).
R6sum6 - - Les m~thodes ~ d'apprentissage ~ partir d'exemples ~ permettent de r~soudre des probl~mes de classification : discrimination entre deux classes d'objets, assimilation d'un objet h une classe d'objets reprdsentant une propridtd. Elles sont utilisdes dans la situation oh l'on ne connaft pas, a priori, de proc~dure de ddcision mais oft on dispose d'exemples en hombre suffisant. Aprds une dtape d'apprentissage sur les exemples, on construit une procddure pour r~soudre le probldme. Dans la m~thodologie exposde ici la description d'un objet prend la forme d'une liste d'attributs. La connaissance acquise par apprentissage prend la forme de paquets de r~gles, consid(r(es comme des arguments en faveur d'une certaine ddcision. Mots-cl~s : apprentissage / intelligence artificielle.
S u m m a r y - - The methods of "learning from examples" enable the solving of problems of classification : discrimination between two classes of objects, assimilation of an object to a class of objects representing a property. They are used in a situation where we don't know a priori a procedure in order to decide, but we have examples (in sufficient amounO. After a learning stage with the examples, a procedure to solve the problem is built. In the exposed methodology the description of an object is a list of attributes, the acquired knowledge is sets of "rules" considered as arguments in favour of a particular decision. Key-words : machine learning / artificial intelligence.
I. I n t r o d u c t i o n
The techniques reported in this paper and utilized for several applications included in this
issue are referable to Pattern Recognition. In this field, the goal is to establish bonds between a set o f objects and a set o f concepts or properties that these objects may have. More specifically to evaluate whether a certain object verifies a certain property. When a procedure enabling to answer this question is not found to exist in the beginning, it is possible to build one from a set of objects verifying this property (Examples) or not veri-
fying it (Counter-Examples). The techniques used are then called techniques o f "learning from examples". We may divide a learning process into two steps : - - A learning step, during which the acquisition o f knowledge leads to building the procedure. - - A generalization step, during which the use of the procedure allows the generalization of the property to a set bigger than the set of examples. They take various patterns, from purely statistical methods [3] to methods o f logical induction [61.
494
H. Soldano and J.-L. Moisy
The statistico-syntactic learning techniques are related to Artificial Intelligence : -- The description of objects is a sequence of binary attributes. From the attributes, short logical formulae are built, selected by their relevance to the property which is learnt. These formulae lead to "rules", which represent the acquired knowledge. The investigated procedure is then deduced from the rules using a simple algorithm. -- When the procedure is established it works in an automatic way. For each description of a new object, it gives a decision about the verification of the learnt property. Furthermore these techniques provide a picture of the acquired knowledge which enables us to explain a decision, to "argue". An enriching dialogue is then possible with the user (specialist in the field).
II. Problematics Let ~ be the set of objects and P the set of studied properties which the objects may have. To each object x corresponds a description d(x) of this object containing the accessible information about it. A description space D, which is then associated to the set of objects, is chosen depending on the studied property. Let us express in question form the problem we will try to solve : 1) QUESTIONS RELATED TO THE NOTION OF ASSIMILATION
a) Does an object x, known through its description d(x), verify a property A ? In order to answer this question, a new question must be formulated: does x look like the examples of A (i.e. objects verifying A ) ? The notion of similarity of x to the examples of A is here acquired by "learning from these examples". This learning consists of finding what the examples of A have in common in their description. In other words, finding a coherency in the description of the objects which verify A. When x acknowledges this coherency, x is said to be "assimilable to A". The research of this coherency is accomplished by the "algorithm of generalization by points of view" (cf. IV). Some other questions are related to the notioh of assimilation : b) How to complete an uncompleted descrip-
tion so that the corresponding object is assimilable to A ? (if possible). c) How to modify a description so that the corresponding object is assimilable to A ? The same methodology applies to answer these questions.
2) QUESTION RELATED TO THE NOTION OF DISCRIMINATION
Does an object x verify a property A rather than a property B ? When all the objects which do not verify A verify B, the question is equivalent to question a). However we are interested here in another aspect of this question : how does x verify what distinguishes A from B ? That is, what distinguishes the examples of A from the examples of B (if B is the negation of A these are called counter-examples of A) ? This distinction is acquired by learning from the examples of A and B. When x verifies what distinguishes A from B, and does not verify what distinguishes B from A the result of discrimination i s : "x verifies A rather than B". The learning algorithm used is called "discrimination algorithm".
Description of objects In the methods we will discuss, the following pattern of description is used : d(x) = dl (~) d 2 ( x ) . . , dp(x) belonging to {0,1 }P where p is a constant, di is a binary variable. The interesting case in which d~ may take more values in order to represent ignorance or uncertainty about the value of di will, however not be studied here.
General principles o f statistico-syntactic methods A list of examples (and counter-examples) represents a so-called "expression in extension" of a property (or of what distinguishes two properties). Any object which does not belong to this list will not be considered. The principle of statistico-syntactic learning leads from this "expression in extension" to a n "expression in comprehension" of the property. This expression consists of a set of logical formu-
Statistico-syntactic learning techniques lae built on the description, and a principle of management o f these formulae. The main characteristics of the statistico-syntactic techniques are : a) a generalization property : an object which is neither an example nor a counter-example may be considered; b) the results do not depend on the order o f the examples; c) the results depend on the repetition of the examples; d) they admit contradictions in input : a same description may be found in the examples and in the counter-examples of a property; e) they admit contradictions in o u t p u t : an object may be assimilable to a property as well as to its negation.
III.
Discrimination
The learning phase of the process of discrimination between two properties A and B consists in building two lists of rules • a list o f arguments in favour of A and a list o f arguments in favour of B. An argument applied to an object x will be : ( P x ) = - A or P ( x ) = - B where : P • is a logical formula built with the attributes, true or false for x (the premise o f the argument) :~ : A signifies "to be in favour o f A" which actually corresponds to " " P is often true for the examples o f A and rarely for the examples o f B". The principle of management of the arguments is reduced to a mere count of arguments in favour o f A and B. More precisely, it rests upon the following idea : an argument (in favour of A, for instance) is insufficient to infer A for two reasons : 1) non-overlapping: some examples o f A do not verify the premise of the argument, which then do not apply to them; 2) error risk : some examples of B verify the premise of an argument in favour of A, which, nevertheless, apply wrongly. However the accumulation of arguments tends to cancel these two causes o f error. The construction and selection of arguments is oriented in
495
order to enhance the overlapping and to limit the error risk. The checking and evaluation o f the quality of the acquired knowledge (arguments and management of arguments) will be made using the examples of A and B and using other objects, verifying A or B, and not belonging to the lists of examples.
Description of the Discrimination algorithm The definition of the rules acquired by the discrimination algorithm depends upon three notions : - - the kind of considered logical formulae, - - the selection criteria of these formulae, -- how the space of the considered formulae is explored. The considered formulae are those built with the attributes (logical variables) and the connectors - (no) and A (and). We progressively consider formulae of increasing length.
Formulae of length 1 : singlets Let di be an attribute, two possible singlets correspond to it : di and ~ di Therefore, there are 2p possible singlets. A rule in favour of A will be, for instance : (N di) =- A.
Formulae of length 2 : doublets Led di and dj be two attributes, we consider the four doublets :
diAdj di ^ - d j d i ^ dj -di^ -dj There are p ( p - 1)/2 couples o f attributes, so we consider 4p(p-- 1)/2 doublets. Before considering the case of longer formulae (triplet, quadruplets...), let us see what are the criteria used to select them. To build a rule in favour of A (resp. B) we search for formulae often true for the examples of A (resp. B) and seldom for the examples of B (resp. A). Therefore we count nA (resp. nB) the number of examples of A (resp. B) for which the formula is true. We shall retain in order to build a rule in favour of A a formula such as : nA ~ GAmin X NA and nB • SBmax X N8
H. Sold^no and J.-L. Moisy
496
where : NA (resp. Na) is the total number o f exampies of A (resp. B), SAmi, et SBm~x are arbitrary thresholds. In the same way, to build rules in favour of B we shall retain only formulae such as : nB >/ SBmin X NB and nA ~ Sgmax X NA where : SBm~.et Sgmaxare also arbitrary thresholds. The minimal thresholds, SAmi, and SBm~n, are typically chosen in the range of 0.3, 0.6; the maximal thresholds, SAmaxand SB. . . . being in the range o f 0.1, 0.3. For the rules in favour of A, a value of SAm~, close tO 0.6 will enhance the overlapping o f examples of A, while a value of Sama~ close to 0.1 will reduce the error risk on the examples o f B. In the case of a rule equivalent to an implication (i.e. Sa~a~=0) no error would occur for the examples of B. However, the choice of such thresho!ds is not realistic. The overlapping of the examples of A are then generally impossible. Furthermore this choice is contrary to the foundations of the method. These thresholds being defined a priori, the rules are arranged in an order following a measure of informational quality. If we consider the objects as individuals on which we observe two characters: the value of the formula and the property that they posses, each of them having two values as shown in Table 1, it is then normal to consider the Khi-two as an information measure : N ( n g N a - - naNA) 2 NANB ( n g + na) (N--nA -- ha) In addition, we may measure the divergence between the two distributions of the value of the
TABLE I
Property
I
Value of the formula 0
A
nA
NA
B
nB
NB --
nA+nB
-
-
nA
NA
n8
NB
N--nA--n~
formula on A and B by the Kullback's divergence which is : (~
~ A ) L°g [ n B ( N A - n A ) ] 1. nA(Na ns) or even use thc Mahalanobis' distance : InA -- n8 I
1/ nA + na
Formulae o f length 3 : triplets Let di, dj and dk be three attributes, and let A be a variable connector (as _ ) representing the negation or its absence. We consider the formulae of the following type : Adi ^ A(Adj ^ Adk). For our purposes we shall consider only the following formulae : Adi ^ Adj ^ Adk. There are p ( p - 1 ) ( p - 2 ) / 6 way to choose three attributes, so there are 23×p(p - 1) ( p - 2 ) / 6 triplets. The complexity o f t h e search is then O(p3). This implies the necessity to have a formulae search strategy based on successive restrictions of the search space. So we shall build the n-uplets combining the singlets only with ( n - l ) - u p l e t s responding to certain conditions. For instance, in order to build doublets in favour of A, we shall use only the singlets responding to the following constraint : nA /> SAmin X NA. If a singlet does not verify this constraint, its conjunction.with another singlet will not either. The selection by verification of constraints (thresholds) and by evaluation of the informational quality of rules gives a list of admissible rules from which we may, for instance, keep the k-best (in the sense of informational quality) for A and for B. However, since the management of these rules during the generalization step is reduced to a mere count of arguments for A and B, verified by an object, it is useful to check its homogeneity and overlapping. For this purpose we use a compression method • The synonymous or almost synonymous rules (i.e. whose premises take the same value on the examples) are clustered in classes by means of a classical clustering algorithm. In order to obtain a list of uncorrelated rules which enhances the overlapping and reduces the
Statistico-syntactic learning techniques error risk, we will keep one (or a few) members of each class. Generally, the number of rules that we finally keep for each property will not exceed a few dozen. During the generalizat~,on step, for each new object, we count the arguments in favour of A and B to make a decision which takes its value in {A, B,S}, where S (for Silence) represents an uncertaintly about the object which occurs because we find either too few arguments or more or less the same number of arguments in favour of each class. The foundations of this method come from the work of Bongard (1970) [I]. It has been developed more recently by "Quinqueton and Sallantin, inserted in an actual methodology of learning [7, 8] and applied successfully in various fields, including Biochemistry [4].
IV. Assimilation The notion of Assimilation is different from that of Discrimination studied above. As a matter of fact, the problem is no longer to acquire a knowledge of what distinguishes two lists o f examples, but rather to define a notion of similarity or neighbouring to one list of examples. Furthermore, in some problems it is not possible to give counter-examples. For instance, when predicting earthquake prone areas, the dangerous areas (examples) are those which contained an epicentre of a strong earthquake, making the counter-examples (areas in which no earthquake will occur) unknown from the start I2, 5].
Principle In order to study the assimilation of objects to a list of examples we shall define a notion of internal coherency of the description of the objects o f that list. The similarity of an object to the list of examples will be the verification o f a certain amount of constraints, expressing this internal coherency, by this object. The constraints used here are related to the notion o f "'point of view" : Let A be a property, then the point of view related to the attribute dl is defined as the prediction of the value of di, made by using the values of the other attributes. A correct prediction of d~ will be taken as an argument in favour of the hypothesis "'the object is assimilable to A".
497
In order to achieve this prediction, we shall use the discrimination algorithm. For that purpose, the examples of A are divided into two lists: those for which the attribute di takes the value 1 and those for which it takes the value 0. Obviously the objects are then described by the other p - 1 attributes. After the learning step, the result of the generalization on an object is a prediction of di. A point of view is then related to two lists of rules and to a principle o f management of these rules. Let us note that during the learning step a point of view may be declared as irrelevant, when the learning of discrimination has failed (e.g. lack of discrimination rules).
hTternal coherency expressed through a set of points of view In order to study the internal coherency of the description of an object, with respect to the hypothesis "the object is assimilable to A", we shall use a set of points of view (which generally corresponds to the set o f attributes). The answer o f each point o f view will be : J : Correct prediction (Justification) C : Incorrect prediction (Contestation) S : Impossible prediction (Silence). An object x will be represented by a sequence of points of view, whose value belongs to {J, C, S}: its image by the algorithm. An object ideally assimilable to A would have an image such as : PV(x) = " J J J ... J with d(x) = dr(x) d2(x) d3(x) . . . dp(x) However, the objects rarely contested and rarely silent will be seen as assimilable to A. So the object y such as : PV(y) = J J J C J C J J J S S J could be considered as assimilable to A. This decision depends on tolerance thresholds of contestation and silence.
Evaluation of the learning task L e t LA be the set of examples of A, we shall call Gen(Lg) the set of objects assimilable to A when the learning task has been made o n L g . We shall consider the learning as allowable when Gen(Lg) ~ f~ and Gen(LA) include LA. These are extreme conditions. In practice, it may occur that some objects of LA are not
498
H. Soldano and J.-L. Moisy
assimilable to A but we do not consider that the learning failed. Moreover, to obtain a large proportion of objects assimilable to A, among all the considered objects, is welcome. In case it is known that objects (not belonging to the learning set) do or do not satisfy the property, the evaluation is obvious. In particular, the decisions about counter-examples, if there are some, constitute a conclusive test. When counter-examples are unknown, one can estimate the quality of acquired knowledge on A with a set of logical tests : a) Logical consistency of A : Gen(Gen(LA)) has to be similar to Gen(LA). b) Logical stability using contradiction Given - G e n ( L A ) the set of non-assimilable objects. Normally : ~Gen(LA) rl LA = 0 . The test consists of comparing the two sets • Gen(LA) and G e n ( - G e n ( L A ) ) or Gen(Gen(LA)) and G e n ( - G e n ( L A ) ) . If Gen(LA) t3 G e n ( - G e n ( L A ) ) :~ O then there is a contradiction from a logical point of view. If Gen(LA) U G e n ( - G e n ( L A ) ) ~ fl then the excluded middle axiom is not satisfied. c) Logical coherence : Let X = Gen(LA) - LA Gen(X) N LA makes possible the evaluation o f the coherence of assimilation to A. The ideal situation would be Gen(X) including LA.
Let x be an object and PV(x) its image by the algorithm, we shall say that x' is a neighbour of x if its description corresponds to the modification o f only one attribute of the description o f x. Among all the neighbours we shall consider only those for which the modified attribute corresponds to a contestation. The modification of this attribute modifies the object probably in the direction of the property A. The result PV(x') is then checked. This leads to a Piloting Principle : how to lead an object towards A or how to lead it away from A. If x is a biological sequence and A a biological signal or a kind o f activity, a neighbour corresponds to a mutant. Then, if x verifies A, we shall search for the mutants which preserve or strengthen the property. A selected mutant is then a possible sequence which may be checked during a biological experiment. This methodology thus enables an approach of the general problem of Plan Generation and particularly the Drug Design problem in the field of Molecular Biology.
REFERENCES I. Bongard, M. (1970) in - Pattern Recognition. Spartan Books. 2. Cisternas, A., Godefroy, P., Gvishiani, A., Gorshkov, A.I., Kosobokov, V., Lambert, J., Rantzman, E., Sallantin, J., Soldano, H., Soloviev, A. & Weber, C. (1985) "A dual approach to recognition of earthquakes prone areas in Western Alps.", Ann. Geophysicae, 3,
V. Prospects Using the methodology presented one can deal with different problems in various fields. These techniques lead to determine, o f course, efficient procedures of classification. Besides, their goal is directed towards information exchanges with specialists and acquisition of explicit knowledge. The way the learning method runs makes possible the dialogue, imagining several modes of interpretation o f rules, for instance grouping synonyms. Looking ahead, this dialogue is set up by examining the space of considered objects and the study of the properties o f these objects. Let us consider, for instance, the question c) concerning the assimilation p r o b l e m : How to modify the description of an object in such a way that the new description corresponds to an object assimilable to A ?
3. Devijver,.P.A. & Kittler, J. (1982) in : Pattern Recognition, a statistical approach. Prentice Hall International. 4. Duquesne, M. & Sallantin, J. (1983) "La reconnaissance des formes, un outil pour ~tudier la corrrlation entre le pouvoir cancrrog~ne d'un hydrocarbure polybenz~nique et la forme de cette molrcule". Proceedings of 191th Rencontres lnternationales de Chimie ThHapeutique.
5. Gvishiani, A.D., Sallantin, J., Soldano, H., Cisternas, A. & Soloviev, A. (1984) "Results of FrenchSoviet research on recognition of high-risk earthquake areas in the Alps". Report to the Science Academy of the U.S.S.R., 275, 1353. 5. Michalski, R.S., Carbonnel, J. & Mitchell, T. (1983) Machine Learning, Tioga Publishing Company Palo Alto California. 7. Quinqueton, J. & Sallantin, J. (1983) Algorithms for learning logical formulas 8th IJCAI. 8. Quinqueton, J. & Sallantin, J. (1984) G~n~ralisation par points de rue et apprentissage de concepts. Rapport de recherche INRIA, 1265,