P. R. Krishnalah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2 ©North-Holland Publishing Company (1982) 451-477
")1"~
z~K./
A Unifying Viewpoint on Pattern Recognition*
J. C. Simon, E. Backer and J. Sallentin
O. Introduction Pattern Recognition is a burgeoning field of a rather unwieldy nature. Algorithms and 'ad hoc' techniques are proposed (or even independently discovered) in many applied fields and by scientists of different cultures: physicists, naturalists, doctors, engineers, and also mathematicians, statisticians or computer scientists. The same algorithms are often proposed (or 'discovered') under different names. They also are presented with different explanations or justifications. This paper is an effort to present under the same point of view a large class of PR algorithms, through an approach similar to the structural approach, which was successful in mathematics. Of course the Theory of Groups did not help anyone to perform additions or multiplications, but it helped to unify a very large class of operations, which were considered as quite different. The presentation is now simpler and even new powerful properties have been shown, at least in mathematics. Without antagonizing anyone, we hope to show what is common to approaches to PR considered now as quite different, such as the statistical approach, the fuzzy set approach, the clustering approach and some others.
1.
Representations and interpretations
1.1. Notations and basic definitions We would like to recall some conventions, as we have used them already [1].
Information Information is a loose concept, which is made of two parts: a representation and one or more interpretations. Later we speak of couples or pairs of 'representation-interpretation'. *An earlier version of this article has been published in the journal Signal Processing, Volume 2 (1980) pp. 5-22 under the title: "A Structural Approach of Pattern Recognition". 451
J. C. Simon, E. Backer, and J. Sallentin
452
Representation A representation is the material support of information. For example a string of letters or bits; the result of a measurement, such as an image. Let such a value be represented by a string of italic letters. X=(Xl,X2,...,Xn).
We represent a variable measurement by a string of roman letters. Such as x = (x,,x2,...,x°).
The set of all defined X is called the Representation Space X.
Interpretation A representation may have many interpretations: trivial." the nature of the element of rank j in the representation string. - a n identification: the 'name' of the object represented by X. Such an interpretation is the most frequent in Pattern Recognition (PR) for a representation of an object, measured by some physical sensors. -a property, such as a truth value or an assertion, a term, an expression -an action: A program is represented by a string of 'instructions', themselves translated into a string of bits, interpreted by a computer as actions. -In practice, many others as we witness by man in every days fife... Of course one may ask what we choose to call interpretation being the result of a process on a representation? We believe that in our field of PR, related to understanding and linguistics, it is more appropriate as this frame of concepts has been advocated quite early in linguistics [2]. We call semantic of a representation the set of interpretations which may be found from this representation. Again we underline that this set may be infinite a n d / o r ill defined. But after all the set of properties of a mathematical object such as a number may be also infinite. To demonstrate a theorem we only use a finite set of properties. Thus 'information', which is the couple of a representation and of one (or more) interpretation, may be quite ill defined. This we see in every day life: a representation is understood quite differently by different people, specially if they belong to different cultures. -
Identification Let I2 be the set of names, I2= {c01,cO2," "',cOp)" An identification is a mapping E from the representation space into the set of names. E:X ~/2, written also as
E : ( x t,
X 2 . . . . Xn) ----)0.
An identification is the most simple interpretation of a representation. But many other interpretations may be considered as we will see later on.
A unifying viewpoint on pattern recognition
453
PR operators or programs Of course such a mapping is only a mathematical description. It has to be implemented in a constructive way. A PR operator or algorithm does effectively the task of giving a name if the input data is a representation. The PR specialists are looking for such algorithms; they implement them on computer systems. Finding such efficient PR algorithms and programs is the main goal of the Pattern Recognition field.
Interpretations of an object An identification is not the only interpretation looked for the representation of an object: (i) A feature is the result of a partial identification. For example, a phoneme in a spoken word, a segment in a letter, a contour or a texture in an image. Sometimes the term initial level feature is used instead of representation. It points out the fact that such representations are obtained through physical sensors from the outside universe. Thus a representation is already an interpretation of the outside world. (ii) A fuzzy identification is sometimes preferred to the identification by yes or no. It may be defined as a ' multimapping' of X in/a × f, Ef:X+~×f
or
E f : ( x , ..... x.)--+(oai,fiIVi}.
fi is a membership function, of real value in the interval [0, 1]. (iii) A class or cluster is a name given to a set of elements. (iv) More generally the symbolic description of a class may be: -The most representative element, such as the center of gravity or the element closest to the center of gravity. -The nj most representative elements, the skeleton for instance. - A representation of the class, such as a linear manifold, a geometrical representation. - A concept: " a mental representation of something perceived through the senses" (Britannica World Language Dictionary). It is represented usually by a sentence in a language. - A statement, a logical proposition, an expression. These string of symbols are generated by a syntax; they may themselves be interpreted as 'true' or 'false'. The only way we use symbolic descriptions is through a similarity measure, between the representation of the object and its interpretation (the symbolic description). As we see later these similarity measures may be called by very different names. But in fact they play the same part in the determination of an identification. The aim of this paper is to study and to show their common properties.
Similarity, distance Let X,Y, Z be entities that we wish to compare. Note that they are not always of the same nature. Later on they may be taken as objects, classes or what we
J. C. Simon, E. Backer, and J. Sallentin
454
called symbolic descriptions, expressions, operators, etc... Let >> be an order operation on the set of couples, with the following interpretation: (X,Y) ~ (X,Z) means that X is more 'similar' to Y that to Z, or that the 'resemblance' is greater. More generally this operation may be interpreted as a ' natural association'. A constructive procedure to build this order is to implement a similarity (or dissimilarity) measure, i.e. a function of real values, the domain of which is the set of couples. Similarity (resemblance)
r,(x,x) = sups,
(1.1)
/x(X,V) =/~(Y, X),
(1.2)
/~(X,Y) >~ #(X,Z)
is equivalent to
(X,Y) ~- (X,Z).
(1.3)
(x, Y)
(1.4) (1.5) (1.6)
Dissimilarity (dissemblance)
h(X, X) = inf X, 7~(X,Y) = h(Y,X), X(X,Y) ~
is equivalent to
(x, z).
Distance. A distance is a dissimilarity measure, which satisfies also the ' triangle inequality': d(X, Y) ~
1.2.
is equivalent to
(1.7) X--Y.
(1.8)
Remarks
(I) Usually an identification may be described as a multilevel process. Let us take the example of written words identification. From the initial representation level, for instance the pixel level of an image, a first group of interpretations is obtained. They result in the identification of a certain number of 'features', such as segments, curves, crossings, extremities, etc. From this level the letters are found. Then in another level the word is identified from the letters. Thus starting from a representation level, an identification process allows access to an interpretation level, which then becomes the new representation level. Such a scheme is more general than the 'historical' PR scheme: feature identification followed by a 'classifier'. It is now currently said that image and speech recognition are such multilevel processes and may be described as an interactive, competitive system of procedures, either inductive (data driven) or deductive (concept driven) [3].
A unifying viewpoint on pattern recognition
455
(II) Any partial or intermediate identification process has to be implemented by a program, i.e. a combination of primitive operators. The problem which a PR specialist faces is to choose the appropriate operators and to combine them. Some are chosen for their simplicity, such as the linear operator, or because they pertain to some physical property of the problem, for instance the contour filters or the texture detectors of digital images, the segmentation operators of speech. Usually the first level uses arithmetic operators such as the filters in signal processing. The upper levels rely on syntactic or linguistic operators. In speech recognition, these techniques are now used even at the lower levels [4]. However, at the lowest level a large class of operators is directly inspired by the properties of the representation space. They may be designed as the characteristic function processes [5]. In fact these functions are similarity measures as we will see later. (III) Many PR specialists like to oppose the statistical approach to the syntactical approach. In fact this distinction, which has a historical interest, does not seem justified now. Should not we say now that the syntactical approach relies on the properties of a set of operators (which may be syntactic) and that the statistical approach relies on the properties of the representation space? In fact in the so-called statistical approach it is customary that the statistical assumptions are not justified, even sometimes ignored. If the setting up of a 'probability density' should be justified by the statistical properties of the PR problem, it is always used as a similarity function. As it has been underlined by Diday and Simon [6], clustering, which is typically a lower level process, is determined only by the data of similarity functions. Of course the statistical properties and techniques may be very interesting to establish and justify these similarity measures. But as soon as these similarity measures are determined, the identification is also determined.
1.3.
Properties of a representation space
The representation X of an object has been defined as a list of measurements. Let E be a finite set of m representations X. Most often it is implied that E is a sample of a more general infinite set X into which E is embedded. Intuitively it seems natural that any variable point X of X may be obtained by a measurement; thus this idea of an infinite representation space. But in certain instances it is not clear at all that such an infinite set is defined (attainable) everywhere or even may exist. Most of the efforts of the statistical approach of PR are oriented towards a restitution of such a space X. In the first place, let us consider what we may assert on a finite set of m representations X j (1 ~
456
J. C. Simon, E. Backer, and J. Sallentin
assumed that such an order is a basic property of the data. Let ~(j, h) be such a dissimilarity measure. It is clear that it does not always satisfy the triangular inequality (1.7). PROPERTY. A dissimilarity relation being given on a finite set E, it is always possible to find a homeomorphic mapping of g~ in R such that the resulting dissimilarity measure is a distance and induces the same order on the set of pairs
of E. Such a homeomorphism should map inf ~ on 0 and should add a constant to all the values of ~(j,h), such that (1.6) is verified for all the table. Similar properties may be found for a dissimilarity measure. We deal now with the properties of an infinite representation space X , into which E is embedded. Two basic properties have to be examined for such a space: (a) is it a metric space? (b) has it a density measure? Topology and metric spaces By referring to the work and language of mathematical topology we have a clear way to state the properties relevant to our problem. For a reference textbook see [7]. Let us recall some definitions. Topology T on a set X. It is a non empty collection of subsets of X, called open sets and satisfying four axioms (using the union, intersection and complementation operations). Basis. It is a subcollection fl of open sets, such that any open set is a union of some open sets of ft. Countable basis. A basis fl is countable if the number of open sets of 13 is countable. Neighbourhood. The neighbourhood of a point p is a set N containing p and also some open set of X which contains p. Hausdorff spaces. Every pair of distinct points has a disjoint neighbourhood. Compact spaces. Every open cover has a finite subcover. Many other concepts are defined and studied: basis of a topology, closed sets, interior, exterior, frontier, limit point, continuous or homeomorphic maps, etc. Note that no interesting property is obtained from a finite set X. However, to define constructively a collection of open sets is not an easy job. Usually it is done through the use of distances: Metric spaces. Assuming that there exists a distance d(p,x) between any two points p and x of X, an r-neighbourhood is Nf = {x E X Id(x, p) < r}. They allow to form a collection of open sets. R, R ", Hilbert space are metric spaces. Every metric space is a Hausdorff space. Metrizable spaces. A topological space is metrizable if there exists an injective mapping of the space into a metric space. Important theorems allow to bridge topological spaces in general and metric spaces: A compact Hausdorff space, having a countable basis is metrizable.
A unifying viewpoint on pattern recognition
457
One may challenge that the Hausdorff condition is verified everywhere, specially with the finite precision of the measurements and of the computations. However it seems quite reasonable to give the Hausdorff quality to an infinite representation space. In other words to assume that two distinct points should have separate neighbourhoods. Thus from now on we assume that a representation space X is a metric space. The problem is of course to find the metric. We have seen how an experimental table of dissimilarity of E may be transformed in a table of distances, without changing the order on the couples of points. This table of distance provides a sampling of the general distance measure on X. The problem is to find an algorithm of distance which verifies the measured distance table. It is a generalization problem as there exists so many in PR.
Density measure Let us call (intentionally)/~(X) a density measure at X ~ X, a function of X taking its values in • +. Many efforts are made in 'statistical' PR to build up such a density from an experimental distribution of representations X of objects. This density is used as a similarity measure between an object X and a class. These efforts are along two lines: either the objects are labeled or not labeled. Labeled. (1) Probability densities are obtained through various statistical techniques (parametric) and also by interpolating techniques (non parametric) [8]. (2) k-Nearest Neighbours, (k-NN) [9], [10], [11]. Note also the Shared k-NN approach of Jarvis [12] and the Mutual N N of Gowda [13]. (3) Potential or Inertia functions, [1]. (4) Fuzzy belonging [14]. All of these measures are interpreted and used as a similarity between an object and a concept, such as a class (also called aggregate, taxon, OTU (operational taxonomic unit), fuzzy set) [1]. Unlabeled. In clustering techniques the knowledge on the problem is usually given as some unlabeled density function, obtained by the potential or inertia function from the existing samples and the distance between points. It is assumed that this density is the sum of the density pertaining to each class;/~(X) -- Y/~i(X) [1]. Some practical ways to obtain this density is to build up the minimum spanning tree (MST) or the k-NN graph. The clustering techniques are using this knowledge to build up the clusters from the regions of high density. It is important to note that the algorithms to obtain these densities use the distance between points, i.e. the property that the representation space is a metric space. Remark It should be clear that the hypothesis that a representation space is a metric space is a 'strong' one, even if it is the most frequent. Some finite data cannot be
458
J. C. Simon, E. Backer, and J. Sallentin
ULL' L@L' ALL' Fig. 1
embedded in such an infinite space. We should be careful about the experimental validity of our data. For instance if the x i are only one among a few (binary for example), it is not legitimate to extend them in R. They are 'qualitative' values. On the other hand, though measured ampfitude such as grey level in an image, are always finite in number, according to the precision of the sensors, it is legitimate to extend them in R; thus considering them as 'quantitative' values.
1.4.
Interpretation spaces and their internal laws
We first examine the interpretation spaces directly deduced from the representation space.
The representation space is finite Let us assume that the representation space is the finite set E. The first example of an interpretation space is the set P = 2 E of the subsets of E. To 'classify' is to separate E in k disjoint subsets Ci, with E =- U i c i. The basic operations on such a finite (but exponential) set P are the union U, the intersection N, the complementation c. Under these laws, the elements of P form a distributive lattice or algebra. Let L, L' E P. The lattice relations may be represented by Fig. 1. A class such as C i may be obtained from the elements X i of E by a succession of union operations. Hierarchies. Hierarchies are a special set H of subsets of E, such that if L, L' E H C P, then either N LL' -- ~, or n L L ' :~ ~ and L ~ L' or L' E L. Hierarchies may be represented by trees. We later outline how hierarchies are obtained from E and a distance on the elements of E. The representation space is a metric space Again the interpretation space P is the set of subsets of X. The basic operations are U, N, e, as for finite sets, but now an infinite number of operations may be considered. The symmetric difference A is also utilized: A ~ B -- (A N eB) U (B N eA).
A unifying viewpoint onpattern recognition
459
These laws on P make a distributive lattice of this set, sometimes also called semi-ring or o-algebra. The distributivity means that A N ( B U C ) = (AN B) U (AN C).
(1.9)
A similar relation would we obtained, replacing U by A. Hierarchies are special subsets of P, with the same properties as for the finite space P. Fuzzy sets are b u i l t u p on P. Apart the questions of language and terminology, the algorithms that they suggest do not seem different from the usual ones. We will come back on this question in the following paragraph.
Languages as interpretation spaces We have looked up the interpretation spaces deduced directly from the representation space, i.e. the set of subsets which, equipped with operations, becomes a structure (the representation space is E or X). But such interpretations are not the only possible ones. The interpretation of an object may also be a sentence in a language. Let us outline different occurrences of such interpretations: Terms. We will call term a sentence naming an object. The set of terms is a generalization of the set ~ of names. The languages of terms are regular, i.e. may be recognized by a finite automata [15]. Logical expressions. The languages of different logics have been well defined, for instance cf. [15]. A sentence of such a language is called an expression. It is formed with terms and formulas. The formulas are built up with logical connectives, according to a syntax and a number of axioms. An essential point of these logical languages is that an expression may be interpreted as 'true' or 'false'. In the classical sentential logic the basic connectives are and A, or V. But many others have been proposed, with different syntax and semantic. They have allowed to propose other logics, such as the predicate (extensions of the classical sentential logic), the modal, the intuitionistic, the fuzzy, the quantum logics. Let us simply point out that the expressions of the sentential logic form an algebra, called a boolean algebra, isomorphic to an algebra of sets; thus we may obtain the same structure as that of a distributive lattice, where the connectives 'and' and 'or' play respectively the same role as the operations 'intersection' and 'union' for sets [15]. By suppressing some axioms, introducing new connectives, other logics are formed in which the distributivity of the lattice structure is not any more certain. Natural languages. A sentence in natural language may be considered as a proposition or predicate on an object (of course not always...) [16]. Human beings use the sentences of natural language as the interpretation space of their percep-
460
J. C. Simon, E. Backer, and J. Sallentin
tions of the world of objects. Of course the formalization is a lot more difficult. The languages of different logics have in fact been proposed to model the natural language. Operators, programs, algorithms. A programming language is a set of instructions with syntactic rules to form the proper sentences of the language. Such a sentence, i.e. a program, is interpreted by a computer as actions. Usually the domain is formalized as 'recursive functions'. In PR the domain of primitive recursive functions is of main interest [17]. The basic operations are the concatenation, the composition and recursivity. The term operator denotes that these algorithmic functions may be implemented by machines. Instead of speaking of the interpretation of an object it is preferable in this context to speak rather of relevance, interest for a recognition. An application of these ideas will be examined in connection with Information Measures. 2.
Laws and uses of similarity
Similarity, dissimilarity measures, distances have been defined. Some examples have been given. It was advanced that these measures form the basis for an interpretation at the first levels of an identification. We will show now how the structure of the interpretation space induces laws on the similarity or dissimilarity measures. 2.1.
Laws and structure of a similarity
The Subsections 1.3 and 1.4 gave the properties of representation spaces and of interpretation spaces. The first group of properties may be called 'data driven'; they come from the structure of the representation space and from the knowledge of a similarity or dissimilarity measure (later on we call it simply a measure). The second group of properties may be called 'concept driven'; they pertain to the interpretation space. The central issue of PR is how a problem can be translated into concepts, in other words to find an appropriate interpretation space related to the problem and the set of operators which allow to pass from the representation to the interpretation. However, in this paper we are concerned with another fundamental issue: how should the data driven representation structure be related in general with the concept driven interpretation structure? We will show that most of the time, an homomorphism exists between the measure and the interpretation space. Let us designate by f(X;A) a similarity or a dissimilarity. X is a variable element (a point) of the representation space. A is an element of the structure of the interpretation space. We now come to the main object of this paper, answer the following question: Knowing f(X; A) and f(X; B), what are the values of f(X; A U B) and f(X; A n B)?
A unifying viewpoint onpattern recognition
461
PR problem
data modelling
concept formalization
representation space
interpretation space
data driven structure
concept driven structure
1
, homomorphism ? ~
I
Fig. 2
The answer generally found by the users is to use on f two homomorphisms, i.e. to find two laws on f such that @ is an additive law, homomorphic to U : f(X; A U B) = fCx; A) @ f(X; B).
(2.1)
* is a multiplicative law, homomorphic to n : fCx; A n B) = fCx; A ) , fCX; B).
(2.2)
If necessary the multiplicative law is distributive with respect to the additive law. It will be true if the structure of the interpretation is itself distributive. Fig. 2 illustrates the above viewpoints. We will see that for most PR problems, there exists an homomorphic correspondance between the concept driven structure and the data driven structure, in other words between the interpretation and the knowledge given by the similarity or dissimilarity measure.
The range of f f takes its values in the real domain R. But its interval of variation or range R is usually a part of R. For example: {0, 1} two values only, example of the range of the characteristic function of a set; [0, 1], range of probability and of fuzzy belonging; R +, range of distances, etc. Semi ring Let us recall now the definition of a semi ring. It is a structure Z on a set (range) R, Z= (R,@,.,0,1>.
462
J. C. Simon, E. Backer, and J. Sallentin
@ is an associative law, called addition. 0 is the element identity of this law; a@0 = a. * is an associative law, called multiplication. 1 is the identity of this law; a* 1 = a. 0 is an 'absorbing' element; a * 0 = 0 * a = 0. * is distributive in respect of @ ; a * (b @ c) = (a * b) @ (a* c). Most of the time each law is commutative, but it is not always necessary. Thus with the two laws induced by the homomorphisms of the interpretation space, most of the time f has the structure of a topologic semi ring. 'Topologic' because of the topology of R. Of course, if the structure of the interpretation space is such that only one law is considered, the structure on f has only one law. It is a topologic semi group. For example Hierarchies consider only union, the corresponding law is unique (but m a y be anything between I N F and SUP, as we see later).
Examples of semi rings Let us give some examples of semi rings. Z~= ({0,1),@, x,0,1), Z 2 = ({0, 1 ) , S U P , I N F , 0 , 1), Z 3 = ([0, 1],SUP, I N F , 0 , 1). Z4 = ([0, 1],SUP, X , 0 , 1), Z s = ([0, 1 ] , x + y - x y ,
X ,0, 1),
Z6 = ( a ~ , I N F , + , ~ , 1),
with R ~ = U + U { + ~ } ,
Z 7 = ( R ~ , SUP, I N F , 0 , ~ ) ,
z~= (n~,+, x,o,1), Z 9 ~-
We will see their use on similarity measures. Any of the first or second law m a y generate a semi group.
Remark The difference between a group and a semi group comes from the fact that no
inverse is defined for an element x. In other words no - x for addition, no x 1 for multiplication, as it is always defined for a group. We will show that it has some important consequences.
Idempotent operations An operation is called idempotent if applied to the same element, it gives as a result the element itself. For instance if U and f3 are respectively the set
A. unifying viewpoint on pattern recognition
463
operations union and intersection, AUA=A
and
ANA=A;
the two operations are idempotent. If A is the symmetric difference, AAA=~; A is not idempotent. Let us suppose that (2.1) a n d / o r (2.2) are satisfied, if O is the operation on f,
f(X; A) Of(X; A) : f(X; a ) ;
(2.3)
the homomorphism implies that the operation on f is also idempotent. Then we may consider two possibilities: (a) The law on f has an inverse; it is a group. f-f=O
or
f×f-~=l.
Then f - f -- f or f × f 1 f is realized only by the identity dements 0 or 1. The only possible semi ring is Z 1 or Z 2. (b) The law on f has no inverse; it is a semi group. An example of this is INF or SUP, which are idempotent for any element of R. INF(A; A) = A and SUP(A; A) = A. As we see later this explains the interest of these operations introduced in the fuzzy set concept. Some other semi group laws may also be considered of course. We may characterize them by their idempotent properties. An alternative to (a) is to use other semi tings but to forbid the use of idempotence. Let us take the example of probability density. Then Z 5 is utilised but with the hypothesis of independency between elements, which excludes an idempotent operation. =
p(X; AN B) ----p(X; A) Xp(X; a),
(2.4)
p ( X ; A U B) = p ( X ; A ) + p ( X ; B ) - - p ( X ; A N B ) .
(2.5)
Similar operations may be performed on events are independent in probability,
probability P or information J. If two (2.6)
P(ANB) = P(A)XP(B).
The information given by their simultaneous realisation is usually assumed as J(A N B) = J(A) +J(B) The range of J is R +. The semi ring may be
(2.7) Z6
[18].
464
J. C. Simon, E. Backer, andJ. Sallentin
Let us come back to the range [0, 1]; other operations may be taken. Let % be a law such that %(A; B) ~
(2.8)
% is a contracting law and formulas similar to (2.4) and (2.5) may be used for intersection for union
9(; (A; B) ; 1 -- %(1 - A ; I - B ) .
(2.9)
EXAMPLE. % is INF. Then 1 - % ( 1 - A ; 1 - B ) is nothing but SUP. 2.2.
Application to clustering and first level recognition
As it has been underlined, the first level recognition (between the representation space or initial level after the sensors and the interpretation space of the first level 'features') uses similarity or distance measures, sometimes called 'characteristic functions', with the varieties of probability and fuzzy belonging. The above frame allows to unify the different techniques under a common structural point of view. 2.2.1. Hierarchies The representation space is either a finite set E or a metric space X. A hierarchy is a finite set H belonging to the set of parts of E or X, with the following conditions: (X}~HCP(orP), E (or X ) E H , if h i , h j E H , then either h i A h ) = O or h i A h j v a ~
(2.10)
and either h i D hj or hi C_hj.
A hierarchy is a semi lattice, even more a tree, in which the elements are obtained by the operation U. Two elements of H being given hi,hi, there exists always an element h called the least upper bound (1.u.b.), such that it is the smallest set with the property h i C h and hj C h. If hif-lh j =/a~, it is clear that h i = h or hj = h. Ultrametric distances. Let us define measures compatible with the above structure. hirqh j ¢ ¢. h i and hj are elements in a chain ordered by C. XChl.-.
c h i. .. C h j .
(2.11)
A unifying viewpoint on pattern recognition
465
h"
X1
X2
X3
Fig. 3
Let X be a measure on such chains: X(X) = O < h ( h , ) - - -
(2.12)
hif-)hj : 0. Let h be the l.u.b, of h i and of hj. Then Mhi), )~(hj) < X(h). Let X and X' be leaves of the hierarchy (tree). The distance between X and X' is 8(X,X') -- X(h), where h is the 1.u.b. of X and X'. Let X1,X2,X 3 be three leaves of the hierarchy, h be the 1.u.b. of Xl, X 2, and h' be the l.u.b, of X 1, X 3 and of X 2, X 3, then •(X 1, X 2 ) < a ( X 2 , X 3 ) = a ( X I , X 3 ) .
(2.13)
From (2.13) every triangle X l , X 2 , X 3 is isosceles with a base smaller than the two equal sides. (See Fig. 3.) It is shown that such a proposition is equivalent to the following relation, proper to ultrametric distances, 8(i,j) < SUP[8(i,k), 8(j,k)].
(2.!4)
(8(i,j)'is an abbreviation of 8(X i, X j).) The problem faced by the builder of a hierarchy is precisely to find 8, from the data of an ordinary distance d(X,X'). Only such an ultrametric measure will satisfy the structure of the interpreting hierarchy. Indexed hierarchy. Any h of H is an equivalence class on E or X such that 8(X,X') = h(h) for all X,X' belonging to h. Starting from elements X of E or from disjoint classes C i of X, the operation union (or symmetric difference) allows to build up a tree on which may exist a measure X, such that, if h = hiUhj = h i Ahj = 1.u.b.(hi,hj) ,
X(h) > SUP[)~(hi) , )~(hj)].
(2.15)
This relation applied to three elements X or classes C i, yields (2.14). Such a structure is called an indexed hierarchy. (See Fig. 4.) The Lance and Williams algorithm is a technique to build simultaneously H and h(h).
466
J. C. Simon, E. Backer, and J. Sallentin
X(h)
h
X Fig. 4
Two elements h i and hj are united in h, according to a criterium. A generalised distance D0,j) is computed, when in the course of the process, h i and hj become the leaves of the tree. h i and hj are united if D(i,j) is minimum and X(h) = D(i,j). Then (2.15) is obeyed. To compute the current D(i,j), a new h being formed, D(h,h') has to be computed for every h' leave of the tree. Various operations may be utilised for the computation of D0,j). - I N F corresponds to single linkage, -MEAN corresponds to average linkage, -SUP corresponds to complete linkage. In fact any operation ~ which gives a result superior or equal to INF may be utilised; otherwise the relation (2.15) would not hold, as it is easy to verify.
Remark. It appears clearly now that the necessity of an ultrametric distance comes from the interpretation structure; an ordinary distance would not respect the homomorphism. 2.2.2. Adaptive partitions Hierarchy formation may be seen as an inductive process (data driven). The knowledge on the problem is essentially given by the distances d(X,X'); the results depend on the technique to compute an ultrametric distance. Adaptive partition techniques, such as the "Dynamic Cluster Algorithm" of Diday [19], work in a different way. There the number of disjoint classes is chosen; a criterion is minimized. Though the classes are not known a priori, this technique may be considered as more 'concept driven' [1]. Let C --~ ( C l . . . . C i , . . . C k )
(2.16)
be the k disjoint classes, A = (A 1.... A i , . . . A k ) be the k corresponding 'kernels' or symbolic descriptions. (I) The representation set is the finite set E.
(2.17)
A unifying viewpoint on pattern recognition
467
From the distance d(X,X') are usually deduced -a distance between X and Ai, D(X,&)=
Y~ d(X,X')~(X');
(2.18)
X'EA i
-a distance between A i and Ci, R(Ai,Ci) =
~ Xc
D(X,Ai)/x(X ).
(2.19)
Ci
D and R may be considered as inertia measures. (II) The representation space is a metric space X. (2.18) and (2.19) may be extended to the continuous problem [20]. They become the usual inertia formulas. The basic operations performed on the interpretation spaces (C or A are respectively elements of these spaces) are union and difference. The usual ring on the measure is Z 8. But other semi ring laws may be considered, such as SUP (MAX) or INF (MIN). As an example see the works of Hansen and Delattre [21]. 2.2.3. Strong and weak patterns Usually in 'classification', the classes are defined first, thus the interpretation structure. Then we look for a data-driven structure homomorphic to the conceptdriven structure. On the contrary cluster analysis goes the other way round: the interpretation of the problem is inferred automatically from a data driven structure. The laws in the representation domain infer the laws of the interpretation domain. Irrespective of the procedures, almost all the clustering algorithms provide some partitioning of the data on the basis of some (dis)similarity measure and a criterion, which has to be optimized. In any case the final data-driven structure is such that the intersections of the final clusters is empty. The concept-driven structure of the interpretations is one of disjunct subsets. The idea of 'strong and weak pattern', like the fuzzy idea, is a technique to obtain a data-driven structure in which the intersections of the interpretation sets are not empty. Let us recall this idea [6, 22]. Let us suppose that by any clustering method, different stable optimized partitions may be obtained, either by changing the thresholds or the initial conditions. These different partitions C(1),...,C (q) not only allow how to learn new facts about the problem but also how to define another interpretation structure. Let
H = C(~)A -. • (~C (q)
(2.20)
be the 'cross partition'. If X and X' are classified together in all the C (i) (1 ~< i ~
J. C. Simon, E. Backer, and J. Sallentin
468
Of course p = q for a class ~r. If the classes of C (i) are in number k, the number of classes o f / / m a y be much larger, thus detailing or 'refining' the interpretation structure. As in fuzzy belonging, the relation of an element X to a concept (here a class) may be estimated as a number between 0 and 1. On the other hand, a natural way to infer concepts when the intersections of subsets are not empty is to use laws of similarity, which are homomorphic to a distributive lattice on the set of subsets L i. These approaches [14, 23, 24] have in common a multimapping E:X--* Li Xf,
(2.21)
where the range of f is [0, 1] and the interpretations form a distributive lattice.
2.3.
Probability and fuzzy sets
As we have repeatedly assumed, we wish to show that from the algorithmic point of view there is no deep difference between the fuzzy set approach and the usual probability density approach; the axiomatic difference being on the laws of the similarity measure. We will examine in parallel the homomorphic correspondances usually found, either in the probability or fuzzy points of view.
The use of INF, SUP, X Let D_be a set of conceptual entities L and a distributive lattice determined on l_ by union U (or symmetric difference A) and intersection f3. Let f be a measure on the range [0, 1] and ~ be one of the laws on f: INF, SUP, X. Let a, b, c, be X or L. The following formulas satisfy the homomorphism. f(a; b A c) = %If(a; b), f(a; c)],
(2.22)
f(a; b U c) = 1 - %[(1 - f ( a ; b)), (1 - f ( a ; c))].
(2.23)
Such formulas are currently used in fuzzy set formulations for object-concept similarity, for concept-concept similarity and object-object similarity [30, 31, 32].
The use of the ring Z 5 As we have seen this ring, which uses + and × , allows to remain in the range [0, 1]. It is used for probability with the hypothesis of independance, made necessary by the problems of idempotence (see Subsection 2.1). It also has been used sometimes for fuzzy sets. Let us give examples of both domains.
A unifying viewpoint onpattern recognition
469
In probability
Object-concept, p(X; L n L') = p(X; L) ×p(X; L'), p(X; LU L') = p(X; L) +p(X; L')--p(X; L) ×p(X; L').
(2.24) (2.25)
Concept-concept, P(L, L') = 1 ~ p(X; L) X p(X; L').
(2.26)
x
Object-object, n
p(X,X') = 1 -- ~ p(X; Li) Xp(X; Li).
(2.27)
1
Bayes probability of error for two classes, r(L, L') =
fxP(X;L n L')p(x) dx.
(2.28)
Bhattacharyya coefficient for two classes, b(L, L') = fxP(X; L)p(x; L')P(x) dx.
(2.29)
Similar formulas are used in fuzzy set formulations; for examples see [14, Chapter 2]. The use of the semi rings with idempotent operations
The interest of semi rings such as Z 3, Z4, Z6, Z 7 and Z 9 is that idempotent operations are possible for any f(X). It seems to be the main interest of the fuzzy set measures of similarity which have extensively used the semi rings Z3, Z 4. Let us give some examples. For fuzzy sets
Object-concept, f(X; L n L') = IN F [f(X; L), f(X; L')],
(2.30)
f(X; L U L') = SUP[f (X; L), f(X; L')].
(2.31)
J . C . Simon, E. Backer, and J. Sallentin
470
Concept-concept, 1
F(L, L') = ~ ~ INF [f(X; L), f(X; L')].
(2.321
x
Object-object, Ill
# (X, X') = ½2 (SUP[f(X; Li), f(X'; Li) ] - INF[f(X; L i), f(X'; Li)]). 1
(2.33t Remarks (1) The operation SUP is often interchanged with 'average' (1/N)Y,. However the average is not an associative operation and some care should be taken to preserve the homomorphic properties. (2) Note that in probability the conditional Bayes error for two classes is written as f(X; L A L') = INF[p(X; L),p(X; L')].
(2.34)
2.4. Information and feature evaluation The measures of information An important work has been done to estimate 'the information of an event'. Sometimes the information is considered as directly deduced from the probability measure through the formula J(A) = -logP(A).
(2.35)
Then of course (2.7) is obtained. But efforts have been made to define directly an information measure [25, 26]. Let p and q be logical statements. If J(p) and J(q) are known, what are J(p V q) and J(p A q)? In general one may assume that J(pVq) = F[J(p),J(q)],
(2.36)
J(p A q) = H [J(p), J(q)].
(2.37)
Kampe de Feriet [18] studies F. For instance he shows that if p - q, then J(p)/> J(q) and J(pVq) ~
(2.38)
A unifying viewpoint on pattern recognition
471
Van der Pyl [27] studies H and proposes to use, instead of (2.7), J(p A q) = J(p) + J(q) + kJ(p)J(q).
(2.39)
(2.35) implies (2.7) and thus k = 0.
Feature evaluations Many measures have been proposed for the evaluation of features; we would rather refer to [17]: evaluation of the PR operators which detect the existence of the features. Mutual Information, Quadratic Entropy, Bhattacharyya, Bayesian distances, Patrick-Fisher, Bayes error, Kolmogorov's, etc .... For example see [281. Among these Mutual Information has some interesting properties. Let A, B be two operators to be evaluated; let 12 be the 'ideal operator' given by the training set. Knowing the I(A; 12) and I(B; 12), what can be said if A and B are in serie or in parallel? Serie I(ANB; 12)~< INF[I(A; 12),I(B; 12)].
(2.40)
I(A U B; ~)/> I(A; ~) + I ( B ; ~ ) - - I ( A ; B).
(2.41)
Parallel
These relations should be compared to (2.4) and (2.5) concerning probability. They show clearly that it is not possible to find an homomorphism between the laws on I and the series or parallel operations on the operators.
2.5. Declarative statements A concept is usually represented by a sentence in a natural language under the form of a declarative statement. A simplified model of such a statement is a statement in a logical language. A logical language is generated by a set of syntactic rules. It is made up of: (i) a language of terms, recursively defined f r o m constants, variables with functions. It is a regular language [15]; (ii) a number of logical connectives, which obey a set of rules called axioms. With the terms and the connectives are built the well formed formulas (wff), respecting the axioms. They are usually called logicalpropositions or expressions or statements. The most frequent connectives are A, V, (iii) An interpretation 'true' or 'false' is assigned to any logical proposition p or q. Two connective I and 0 are always interpreted true and false.
J. C. Simon, E. Backer, andJ. Sallentin
472
Let us consider a similarity measure f(X, p) between an object X and a proposition p. Many semantic interpretations may be given to f such as a natural association, an interest, a generalized belonging, a verification index, a propensity, or simply a similarity.
Remarks We consider logics different from the 'classical' (boolean) logic. But in all of these logics no quantifiers are used such as 'there exist' or 'for all'. Thus Modal logic is not envisioned here. All these logics are sentential logics. Their propositions form a lattice under the operations V, A. But this lattice is not always distributive. Let us recall for the ease of reading some useful definitions of sentential logics.
Properties of the logics (1) Contraposition I f p A q = p, then/~ A q = q.
(2.42)
(2) Morgans rules pVq=fiA~
and
pAq=fiV~.
(2.43)
(3) Double negation p---p.
(2.44)
(4) Excluded middle p V p = I.
(2.45)
(5) No contradiction p A/3---- 0.
(2.46)
(6) Distributivity pA(qVr)---- (p Aq)V(p Ar).
(2.47)
(7) Pseudomodularity
If p Aq---- p, then q A ( p V ~)---- p V( q A ~ ) .
(2.48)
The different sentential logics are distinguished one from the others according to the above properties verified or not.
A unifying viewpoint on pattern recognition
,173
Table 1 Properties of sentential logics Logics
Properties
Classical Quantum Fuzzy N o n distributive fuzzy Intuitionist N o n distributive intuitionist
1
2
3
4
5
6
7
X X
X X
X X
X X
X X
X
X X
X
X
X
X
X
X
X
X
X X
X X
X
X X
X
X X
An × in Table 1 signifies that the corresponding logic verifies the property. For instance the classical logic verifies all of the seven above properties. Let us discuss the different logics according to their properties. Distributivity THEOREM OF STONE. All the distributive logics (property (6)) are homomorphic to a distributive lattice of subsets of a set. Thus for the distributive logics, we are again in the situation of Subsection 2.1. Let us consider a semi-ring Z having an additive law _1_ and a multiplicative law • , then f(X; p V q) = f(X; p ) J- f(X; q),
(2.49)
f(X; p A q ) = f(X; p ) , f(X; q).
(2.50)
But then, as before, we have to consider the idempotence question. Asp V p = p a n d p A p = p. If the usual + and × are taken as addition and multiplication, and if (2.49) and (2.50) are true, then the only possible ring is Z 1. The similarity reduces to a binary decision ' true' or 'false'. But if we take INF and SUP as laws on f, then any f(X; p) is an idempotent. The semi-rings Z 3 or Z 7 may be considered. Negation (1) If (2.44) is verified, a corresponding law on f has to be found. For instance if the range is [0, 1],
f(x; p) = 1 - f ( x ; p).
(2.51)
Such a relation on f relating to negation has been chosen by Watanabe and in fuzzy logic by Zadeh.
J. C. Simon, E. Backer, and J. Sallentin
474
(2) If (2.46) is verified, and if the range is [0, 1], then we should have f ( X ; f i ) = { O 1 iff(X;p)>O,otherwise.
(2.52)
Fuzzy logic does not verify this last relation but it is an essential property of the intuitionistic logic. On the other hand it is clear that the intuitionistic logic does not verify (2.44), the double negation property. The only logic which does verify both the double negation (2.44) and the no contradiction (2.46), is the classical logic. But then only admissible rings on f are the first two, leading to only two values for f. The interest of logics other than classical appears now clearly. Classical logic gave birth to Quantum logic; Fuzzy and Intuitionist to non distributive logics. Examples may be found in the 'real Universe', where distributivity is not verified.
Idempotence An essential property of the basic connectives V and A is idempotence. But modeling natural language, which describes the real world, we find that idempotence should not always be verified by such connectives. For example the repetition of a proposition is not always equivalent to the proposition itself: "This is a man" and "This is a man, who is a man" are not equivalent propositions. Let us introduce another connective [] such that p [] p = p is not always true for all p. We wish of course to give to [] the same properties as A, except maybe idempotence. For this, let us consider a 'projective operator' on X, Cpp(X). This operator takes into account the first proposition p, which has modified our knowledge on X. Then (2.50) may be written as f(X; p [] q) = f(X; p)~f(q0p(X); q).
(2.53)
Then f(X;
p[]p) =
f(X; p)~f(epp(X); p ) .
(2.54)
But only if p [] p = p we have f(X;
pVlp)= f(X;
p).
(2.55)
A unifying viewpoint on pattern recognition
475
It means that f(%(X); p) is now the idempotent. "Repeating once more will not change our knowledge on X" [29]. The propositions p verifying (2.55) are special in the language; sometimes they are called 'observable'. They form a lattice, which is not always distributive. On the use of projective operators Projective operators are introduced in many other instances: Fourier, Hadamard, K . L . expansions, filtering and others such as those pointed by Watanabe, Bongard and Ullman. The main question is: knowing f(X; L), what is f(~0L(X); L')? The usual answer is to assume that the representation space X is an Euclidian space and that the 'concepts' L, L' are built up as subspaces of this space. Thus the situation is similar to the one of Subsection 1.4: the interpretation space is a metric space. If ~. ) is the scalar product,
f(X; L) - iX" ePL(X))
(2.56)
f(X; LN L'):
(2.57)
and
(X'X')
A more general answer would be to assume on X a structure built with a semi-ring operation and to obtain in a similar manner f(X; L) and f(X; L n L').
3.
Conclusion
t~y establishing an homomorphism between the representation space and an interpretation space we have shown that a unifying point of view may be obtained for many apparently different PR problems. The operational laws on the (dis)similarity measures between representations of objects and concepts appear to be the key factors in constructing this homomorphism with the structure of the concepts (the interpretation space). In other words a link has to be made between data driven structures and concept driven structure through these operational laws. It is the belief of the authors that a unifying point of view may lead to the following issues. Many Pattern Recognition problems are stated and treated as if they were all quite different. However, a closer look to the fundamental problem may learn that these PR problems may only differ in discourse and terminology or may differ in the context in which they appear. We have emphasized the fact that the underlying problem lies in how to link its representation space to its interpretation space and that the construction of an homomorphism between them demands for a precise formulation of the linking key factor: a (dis)similarity
476
J. (7. Simon, E. Backer, and,[. Sallentin
measure. Then, the number of admissible operators to construct similarity based homomorphisms between representation space and interpretation space appears to be rather limited. In that sense for example, fuzzy set theory may enlarge the interpretation space but does not enlarge the operational formulation of the homomorphisms between the two spaces involved. In other words, the laws of fuzzy operation do not differ basically from the probabilistic ones. So, the diversity of problems is in some sense misleading. The same can be said for the huge amount of existing mapping algorithms between the representation space and the interpretation space. Once the set of admissible operational laws has been defined, many apparently different algorithms appear to be intrinsically the same. They just construct the homomorphic mapping in a different fashion, still based on the same laws of operation. Hence, they differ not fundamentally, but merely they differ in details of computation and in the discourse in which they are phrased. This fact should be stated as such and should not be brought up as something special. A unifying view should keep future researchers from the 're-invention of the wheel'. As soon as it becomes clear that constructing a homomorphism between representation space and interpretation space in PR is the basic issue, the solution space of 'true' homomorphisms is a very restricted one. Future designers of PR algorithms should identify both their problems and their approaches within the framework of the admissible operational laws on (dis)similarity. In conclusion, an important issue of this paper would be that the many identification algorithms may be thought or explained as special cases of a much more general model, so that we escape from the critic that Pattern Recognition is just 'bag of tricks'.
References [1] Simon, J. C. (1978). Some current topics in clustering in relation with pattern recognition. Proc. Third Internat. Conf. on Pattern Recognition, Coronado, pp. 19-29. [2] Sanssure, F. de (1972). Cours de Linguistique Gkn&ale. Payot, Paris. [3] Haralick, R. M. (1978). Scene matching problems. Proc. 1978 NATO A S I on Image Processing, Bonas. [4] De Mori, R. (1978). Recent advances in automatic speech recognition. Proc. Fourth Internat. Conf. on Pattern Recognition, Kyoto, pp. 106-124. [5] Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley, New York. [6] Diday, E. and Simon, J. C. (1976). Cluster Analysis. In: Fu, K. S., ed., Digital Pattern Recognition. Springer, Berlin. [7] Hu, S. T. (1966). Introduction to General Topology. Holden Day, San Francisco. [8] Kanal, L. (1974). Patterns in pattern recognition, 1968-1974. IEEE Trans. Inform. Theory 20 (6) 697-722. [9] Cover, T. M. and Wagner, T. J. (1976). Topics in statistical recognition. In: Fu, K. S., ed., Digital Pattern Recognition. Springer, Berlin. [10] Devljver, P. A. (1977). Reconnaissance des formes par la m6thode des plus proches voisins. Th6se de Doct., Univ. Paris VI, Paris. [ll] Cover, T. M. and Hart, P. E. (1967). Nearest neighbour pattern classification. IEEE Trans. Inform. Theory 13, 21-26.
A unifying viewpoint onpattern recognition
477
[12] Jarvis, R. A. (1978). Shared near neighbour maximal spanning tree for cluster analysis. Proc. Third Internat. Conf. on Pattern Recognition, Coronado, pp. 308-313. [13] Gowda, K. C. and Krishna, G. (1978). Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognition 10, 105-112. [14] Backer, E. (1978). Cluster Analysis by Optimal Decomposition of Induced Fuzzy Sets. Delft University Press, Delft. [15] Lyndon, R. C. (1964). Notes on Logic. Van Nostrand Mathematical Studies 6. Van Nostrand, New York. [16] Sabah, G. (1977). Sur la eompr6hension d'histoire en langage naturel. Th6se de Doct., Univ. Paris VI, Paris. [17] Simon, J. C. (1975). Recent progress to a formal approach of pattern recognition and scene analysis. Pattern Recognition 7, 117-124. [18] Kampe de Feriet, J. (1977). Les deux points de vue de l'information: information ~t priori, information ~t posteriori. Colloques du CNRS 276, Paris. [19] Diday, E. (1973). The dynamic cluster algorithm and optimisation in non-hierarchlcal clustering. Proc. Fifth IFIP Conf., Rome. [20] Miranker, W. L. and Simon, J. C. (1975). Un mod6le continu de l'algorithme des nu6es dynamiques. C.R. Acad. Sci. Paris SOr. A 281, 585-588. [21] Hansen, P. and Delattre, M. (1978). Complete link cluster analysis by graph coloring. J. Appl. Statist. Assoc. 73, 397-403. [22] Simon, J. C. and Diday, E. (1972). Classification automatique. C.R. Acad. Sci. Paris Sbr. A. 275, 1003. [23] Ruspini, E. (1969). A new approach to clustering. Inform. Control 15, 22-32. [24] Bezdek, J. C. (1973). Fuzzy mathematics in pattern classification. PhD Thesis, Cornell University, Ithaca. [25] Carnap, R. and Bar Hillel, Y. (1953). Semantic information. British J. Phil. Sci. 4, 147-157. [26] Kampe de Feriet, J. (1973). La Thborie Gbnbralisbe de l'Information et de la Mesure Subjective de l'Information, Lecture Notes in Mathematics 398. Springer, Berlin. [27] Van der Pyl, T. (1976). Axiomatique de l'information. C.R. Aead. Sci. Paris Sbr. A 282. [28] Backer, E. and Jaln, A. K. (1976). On feature ordering in practice and some finite sample effects. Proc. Third Internat. Conf. on Pattern Recognition, Coronado, pp. 45-49. [29] Sallentin, J. (1979). Repr6sentation d'observation dans le contexte de la th6orie de l'information. Thbse de Doct., Univ. Paris VI, Paris. [30] Zadeh, L. A. (1971). Similarity relations and fuzzy orderings. Inform. Sci. 3, 177-200. [31] Zadeh, L. A. (1977). Fuzzy sets and their application to pattern classification and clustering analysis. In: van Ryzin, J., ed., Classification and Clustering 251-299. Acad. Press, New York. [32] Zadeh, L. A. (1978). PRUF, a meaning representation language for natural languages. Internat. J. Man-Mach. Stud. 10, 395-460.