Self-organizing map for clustering in the graph domain

Self-organizing map for clustering in the graph domain

Pattern Recognition Letters 23 (2002) 405–417 www.elsevier.com/locate/patrec Self-organizing map for clustering in the graph domain Simon G€ unter *,...

187KB Sizes 3 Downloads 85 Views

Pattern Recognition Letters 23 (2002) 405–417 www.elsevier.com/locate/patrec

Self-organizing map for clustering in the graph domain Simon G€ unter *, Horst Bunke Department of Computer Science, University of Bern, Neubr€uckstrasse 10, CH-3012 Bern, Switzerland

Abstract Self-organizing map (som) is a flexible method that can be applied to various tasks in pattern recognition. However it is limited in the sense that it uses only pattern representations in terms of feature vectors. It was only recently that an extension to strings was proposed. In the present paper we go a step further and present a version of som that works in the domain of graphs. Graphs are a powerful data structure that include pattern representations based on strings and feature vectors as special cases. After introducing the new method a number of experiments will be described demonstrating its feasibility in the context of a graph clustering task.  2002 Elsevier Science B.V. All rights reserved. Keywords: Self-organizing map; Structural pattern recognition; Graph matching; Graph edit distance; Graph clustering; Neuron utility

1. Introduction Self-organizing map (som) as proposed in (Kohonen, 1997) is a general unsupervised method that can be used for various purposes in pattern recognition and information processing. In the present paper, the focus will be on clustering. Som consists of a layer of units, often called neurons, that adapt themselves to a population of input patterns. The adaption is achieved through an iterative procedure which processes the input patterns sequentially. Upon presentation of an input pattern x, the neuron y that is closest to x is determined. Then y and some of the neurons in its neighborhood are updated, i.e., they are moved

*

Corresponding author. E-mail addresses: [email protected] (S. G€ unter), [email protected] (H. Bunke).

closer to x. It can be expected that after a sufficient number of iterations all neurons have migrated into areas of the feature space where there is a high concentration of input patterns. A limitation of som is the representation of patterns in terms of feature vectors only. In the area of structural pattern recognition more powerful data structures for pattern recognition have been proposed, for example, strings, trees and graphs. A version of som for strings has been proposed recently (Kohonen and Somervuo, 1998). In the present paper we go a step further and consider graphs, which include string and tree representations as special cases. If graphs are used for pattern representation, the objects from the underlying problem domain are represented usually by the nodes of the graphs, while relations between those objects are modeled through edges. Labels or attribute vectors are often attached to the nodes and edges of a graph. Hence it is possible

0167-8655/02/$ - see front matter  2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 0 1 ) 0 0 1 7 3 - 8

406

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

to capture both unary properties of individual objects and contextual relationships between different objects in a graph representation. The use of graph representations in pattern recognition has a long history. Application examples include recognition of graphical symbols (Lee et al., 1990), character recognition (Lu et al., 1991; Rocha and Pavlidis, 1994), shape analysis (Cantoni et al., 1994; Pelillo et al., 1999), threedimensional object recognition (Wong, 1994) and video and image database indexing (Baxter and Glasgow, 2000; Shearer et al., 2001). Graph representations in pattern recognition are typically used in the context of nearest-neighbor classification. That is, an unknown input pattern is compared to a number of prototypes stored in a database. The unknown input is then assigned to the same class as the most similar prototype. A number of similarity measures on graphs and related computational procedures have been proposed in this context (Bunke and Allermann, 1983; Sanfeliu and Fu, 1983; Wang et al., 1994; Christmas et al., 1995; Cross et al., 1996; Messmer and Bunke, 1998; Myers et al., 2000). Of particular interest in the context of the present paper is graph similarity based on graph edit distance (Bunke and Allermann, 1983; Sanfeliu and Fu, 1983; Wang et al., 1994; Messmer and Bunke, 1998). It is an extension of the well-known concept of string edit distance (Wagner and Fischer, 1974). In this paper, we present an extension of som to the domain of graphs. To make this extension possible, two problems must be addressed. First, a

concept equivalent to the distance of two feature vectors in the n-dimensional feature space must be provided for the domain of graphs. This problem is solved by means of graph edit distance. The second problem is to provide an operation that decreases the distance of two given graphs. This operation corresponds to the update rule used in som, which moves a neuron closer to a given input pattern. As it will be shown in Section 3, such an operation can be derived from graph edit distance computation. In Section 2 we first review the fundamental properties of som as far as they are relevant for the present paper. Then a number of basic concepts from the field of graph matching will be introduced. Section 3 describes the proposed extension of som from feature vector representations to the domain of graphs. Next, experimental results are presented in Section 4. Finally, conclusions will be provided in Section 5.

2. Preliminaries In this section, all background material necessary for the remainder of the paper will be presented. 2.1. Self-organizing map A pseudo code description of the classical somalgorithm is given in Fig. 1. The algorithm can serve two purposes, either clustering, or mapping a

Fig. 1. The som-algorithm.

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

high-dimensional pattern space to a lower-dimensional one. In the present paper we focus on its application to clustering. Given a set of patterns, X, the algorithm returns a prototype yi for each cluster i. The prototypes are sometimes called neurons. The number of clusters, M, is a parameter that must be provided a priori. In the algorithm, first each prototype yi is randomly initialized (line 4). In the main loop (lines 5–11) one randomly selects an element x 2 X and determines the neuron y  that is nearest to x. In the inner loop (lines 8, 9) one considers all neurons y that are within a neighborhood N ðy  Þ of y  , including y  , and updates them according to the formula in line 9. The effect of neuron updating is to move neuron y closer to pattern x. The degree by which y is moved towards x is controlled by the parameter c, which is called the learning rate. It has to be noted that c is dependent on the distance between y and y  , i.e. if neuron y 2 N ðy  Þ has a smaller distance to y  than neuron y 0 2 N ðy  Þ then y is moved towards x by a larger degree than neuron y 0 . After each iteration through the repeat-loop, the learning rate c is reduced by a small amount, thus facilitating convergence of the algorithm. It can be expected that after a sufficient number of iterations the yi ’s have moved into areas where many xj ’s are concentrated. Hence each yi can be regarded as a cluster center. The cluster around center yi consists of exactly those patterns that have yi as closest neuron. More detail about this algorithm can be found in (Kohonen, 1997). The proposed extension of the som-algorithm to the domain of graphs, including the architecture of the som, will be discussed in Section 3. 2.2. Graph matching and similarity In this paper we consider graphs with labeled nodes and edges. Let LV and LE denote sets of node and edge labels, respectively. Formally, a graph is a 4-tuple, g ¼ ðV ; E; l; mÞ, where V is the set of nodes, E  V  V is the set of edges, l : V ! LV is a function assigning labels to the nodes, and m : E ! LE is a function assigning labels to the edges. A graph isomorphism from a graph g to a graph g0 is a bijective mapping from the nodes of g to the nodes of g0 that preserves all labels and the

407

structure of the edges. Graph isomorphism is a useful concept to find out if two patterns are the same, up to invariance properties inherent to the underlying graph representation. Real world objects are usually affected by noise such that the graph representation of identical objects may not exactly match. Therefore it is necessary to integrate some degree of error tolerance into the graph matching process. A powerful concept to deal with noise and distorted graphs is error-correcting graph matching using graph edit distance. In its most general form, a graph edit operation is either a deletion, insertion, or substitution (i.e. label change). Edit operations can be applied to nodes as well as to edges. Their purpose is to correct errors and distortions in graphs. Formally, let g1 ¼ ðV1 ; E1 ; l1 ; m1 Þ and g2 ¼ ðV2 ; E2 ; l2 ; m2 Þ be two graphs. An error-correcting graph matching (ecgm) from g1 to g2 is a bijective function f : Vb1 ! Vb2 , where Vb1  V1 and Vb2  V2 . We say that node x 2 Vb1 is substituted by node y 2 Vb2 if f ðxÞ ¼ y. If l1 ðxÞ ¼ l2 ðf ðxÞÞ then the substitution is called an identical substitution. Otherwise it is termed a non-identical substitution. Any node from V1 – Vb1 is deleted from g1 , and any node from V2 – Vb2 is inserted in g2 under f. The mapping f directly implies an edit operation on each node in g1 and g2 , i.e., nodes are substituted, deleted, or inserted, as described above. Additionally, the mapping f indirectly implies substitutions, deletions, and insertions on the edges of g1 and g2 . By means of the edit operations implied by an ecgm differences between two graphs that are due to noise and distortions are modeled. In order to enhance the noise modeling capabilities, often a cost is assigned to each edit operation. The costs are non-negative real numbers. They are application dependent. Typically, the more likely a certain distortion is to occur the lower is its costs. The cost cðf Þ of an ecgm f from a graph g1 to graph g2 is the sum of the costs of the individual edit operations implied by f. An ecgm f from graph g1 to a graph g2 is optimal if there is no other ecgm from g1 to g2 with a lower cost. The edit distance, dðg1 ; g2 Þ, of two graphs is equal to the cost of an optimal ecgm from g1 to g2 , i.e.

408

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

dðg1 ; g2 Þ ¼ minfcðf Þjf : Vb1 ! Vb2 is an ecgmg: ð1Þ The edit operations implied by an ecgm f : Vb1 ! Vb2 can be interpreted as a sequence of edit operations (in any order) that transform g1 into g2 . Hence string edit distance as proposed in (Wagner and Fischer, 1974) is a special case of graph edit distance. For more details on error-correcting graph matching and graph edit distance, including computational procedures, see (Messmer and Bunke, 1998), for example.

3. Som for the domain of graphs As it was pointed out in Section 1, two problems need to be solved when transferring som from vectorial pattern representations to the domain of graphs. First a suitable measure of similarity, or distance, between graphs needs to be provided (see line (7) in Fig. 1). Graph edit distance as introduced in Section 2.2 will be used for this purpose in the present paper. Secondly, a procedure in the graph domain that is equivalent to the neuron updating rule defined in line (9) in Fig. 1 is needed. This procedure will be described in Section 3.1. Then in Sections 3.2–3.5 further details of the som-algorithm in the graph domain will be discussed. 3.1. Neuron updating in the graph domain

inequality, i.e., if e1 , e2 and e3 are edit operations such that the application of e2 followed by e3 , or the application of e3 followed by e2 , has the same result as the application of e1 , then always cðe1 Þ 6 cðe2 Þ þ cðe3 Þ:

ð4Þ

This assumption is not a real restriction because, for any pair of graphs, g1 and g2 , in the computation of dðg1 ; g2 Þ, we are searching for a sequence of edit operations that transform g1 into g2 with minimum cost. Hence, if Eq. (4) were not satisfied then the algorithm would always choose e2 and e3 instead of e1 . In this case we could simply change the cost of e1 into cðe2 Þ þ cðe3 Þ, which would satisfy Eq. (4) and yield the same value of dðg1 ; g2 Þ. In Fig. 2 procedure for the computation of a weighted mean, g, of a pair of graphs, g1 and g2 , is given. The basic idea of this procedure is to take a subset, E0 , of the edit operations implied by an optimal ecgm from g1 to g2 and apply them to graph g1 . This results in a graph g, which is a weighted mean of g1 and g2 . Moreover, dðg1 ; gÞ ¼ a0 and dðg1 ; g2 Þ ¼ a0 þ dðg; g2 Þ, where a0 is equal to the sum of the costs of the edit operations in subset E0 . It can be shown that the procedure given in Fig. 2 is correct and complete. That is, for any two input graphs, g1 and g2 , and their corresponding output ðg; a0 Þ Eqs. (2) and (3) hold. Furthermore, there is no weighted mean g of a pair of graphs, g1 and g2 , that cannot be generated by this procedure. The link between the neuron updating rule of line (9) in Fig. 1 and Eqs. (2) and (3) is as follows. First, we rewrite this rule as

The procedure for neuron updating in the graph domain is based on the concept of the weighted mean of a pair of graphs. Let g1 and g2 be graphs. We call graph g a weighted mean of g1 and g2 if for some a with 0 6 a 6 dðg1 ; g2 Þ the following equations hold:

ynew  yold ¼ cðx  yold Þ:

dðg1 ; gÞ ¼ a;

ð2Þ

x  ynew ¼ ð1  cÞðx  yold Þ:

dðg1 ; g2 Þ ¼ a þ dðg; g2 Þ:

ð3Þ

If we substitute g1 ¼ x, g ¼ ynew , g2 ¼ yold and replace the subtraction operator by dð; Þ then Eqs. (5) and (6) turn into

From Eqs. (2) and (3) it immediately follows that dðg1 ; g2 Þ ¼ dðg1 ; gÞ þ dðg; g2 Þ. For the actual computation of the weighted mean, g, of a pair of graphs, g1 and g2 , we assume that the costs associated with the edit operations fulfill the triangular

ð5Þ

Furthermore we observe that ynew is located on a straight line segment that connects x and yold , which means that in addition to Eq. (5) the following equation holds: ð6Þ

dðg; g2 Þ ¼ cdðg1 ; g2 Þ;

ð7Þ

dðg1 ; gÞ ¼ ð1  cÞdðg1 ; g2 Þ:

ð8Þ

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

409

Fig. 2. Procedure for weighted mean graph computation.

From these equations we get Eqs. (2) and (3) if we let a ¼ ð1  cÞdðg1 ; g2 Þ. Hence the procedure given in Fig. 2 can be regarded as a meaningful implementation of the som neuron updating rule in the graph domain. 3.2. Neuron initialization In this section we will provide details of the neuron initialization procedure; see line (4) in Fig. 1. A number of procedures for the initialization of neurons have been proposed for the classical version of som. The main concern in neuron initialization is to provide some reasonable first approximation to the input pattern distribution. Clearly, the closer the initial neuron distribution to the real distribution of the input patterns is, the better is the expected convergence behavior and the final result of the som algorithm. For the domain of graphs, the following initialization procedures seem feasible: • Random initialization: Here all neurons are randomly generated. This method is problematic as there exists no straightforward way to guarantee that the randomly generated neurons occupy the same part of the pattern space as the input patterns. (This problem is much alleviated in the classical som, where the bounding hyperbox of all inputs can be easily determined.)

• Selection of a subset of the input patterns: This possibility is extremely simple to implement, but it does not reflect one of the key ideas of som to randomly initialize the neurons. Moreover it may lead to a bias of the result towards the initially selected graphs. • Selection of a subset of the input patterns, followed by a random perturbation: Under this procedure each of the selected input graphs is perturbed by means of randomly chosen edit operations. This method seems closer to the original idea of som to randomly generate the initial neuron population. However, the problem is a suitable choice of the perturbation operators. It can be argued that any appropriate operator must be eventually problem dependent. Hence, the design of a general initialization method seems rather difficult. • Random selection of pairs of input patterns and generation of a weighted mean for each pair: Under this procedure, a pair of input graphs, g1 and g2 , and a value a 2 ½0; 1 are randomly chosen. Then we let g ¼ weighted mean ðg1 ; g2 ; aÞ. It can be expected that each neuron g that is created by this procedure will be in a location of the pattern space that is occupied by input patterns. Hence the problem of generating isolated neurons (i.e. neurons residing in a part of the input space where no other neurons or patterns are), that may arise from a random initialization, is

410

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

avoided. Moreover this procedure has a strong random component and is application independent. For these reasons this method has been chosen for neuron initialization. For the random selection of pairs of input graphs and the values a 2 ½0; 1 equal distributions are used. 3.3. Topology of the som In the classical version of som a topology, i.e. a neighborhood relation, is defined on the som. Typically the neurons in the som are linearly ordered, or arranged in a two-dimensional array. Such a topology is important when a highdimensional input space is to be mapped to a low-dimensional som. In the domain of graphs, however, there is no concept equivalent to dimensionality. The neurons, i.e. graphs, in the som should provide an approximation to the original input pattern distribution, but input and output space are dimensionless. This is one of the reasons that no topological neighborhood relation among the neurons in the som is defined in this work. Another reason is that we are mainly concerned with applying som for the purpose of clustering, rather than providing an approximation of the input space distribution by means of som. Consequently, the som proposed in this paper for the clustering of graphs consists of a number of neurons that have no topological connections among each other. Using no topological relations between the neurons in the som means that proximity among neurons and input patterns is exclusively defined through edit distance. In order to simplify and speed up the som-algorithm, only the winner neuron, i.e. the neuron closest to the actual input, is updated. This means that N ðy  Þ ¼ fy  g in line (8) in Fig. 1. 3.4. Utility for outlier elimination A drawback arising from not considering any topological neighborhood relations among neurons is the potential existence of outliers. An outlier is an isolated neuron that is never a winner regardless of which input pattern is considered. In other words, an outlier is a neuron y such that there is no any input x to which

y is the nearest neighbor. Consequently, y will never be changed during the execution of the algorithm. Isolated neurons are artifacts that do not contribute to the final result. Hence some procedure should be provided in order to detect and remove them. A method for outlier detection and elimination for the classical version of som was proposed in (Fritzke, 1999). It is based on the concept of utility. The utility of a neuron y is an indicator reflecting how much y contributes to the approximation of the input data, i.e., it shows the contribution of y to minimizing the sum S of squared distances between all inputs and their nearest neuron. Neurons with a high utility are important and should be kept, while a neuron with a low utility can be removed without significantly changing S. Let y1 and y be neurons and x an input pattern. Then the utility of y1 is defined as follows: X U ðy1 Þ ¼ ½minfdðy; xÞ2 jy 6¼ y1 g  dðy1 ; xÞ2 ; x2nearðy1 Þ

ð9Þ where nearðy1 Þ ¼ fxjð8yÞðy 6¼ y1 Þ ) dðy; xÞ P dðy1 ; xÞg: ð10Þ In these equations, dð; Þ denotes the graph edit distance. The utility U ðy1 Þ of neuron y1 is obtained by computing for all inputs x 2 nearðy1 Þ the difference between the squared distance of the second nearest neuron to x, and the squared edit distance of y1 to x. Set nearðy1 Þ consists of all inputs that have y1 as closest neuron. Clearly, if y1 has a small utility then either for each input pattern in its immediate neighborhood there is another neuron close by, or there are no input patterns in its immediate neighborhood at all. In the second case, nearðy1 Þ ¼ ; and U ðy1 Þ ¼ 0. On the other hand, if U ðy1 Þ is large then the removal of y1 would leave all inputs in set nearðy1 Þ without a neuron close by, and the sum of squared distances of all inputs to their nearest neuron would significantly increase. In order to eliminate neurons with a small utility, the neuron y with minimum utility Umin ¼ U ðyÞ among all neurons is determined. Let

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

the average utility of all neurons be Uav . Then y is eliminated if Umin < bUav ;

where 0 < b < 1:

ð11Þ

After a neuron has been eliminated, a new neuron is inserted. Here the question of neuron initialization arises again. In order to initialize a new neuron, one could surely use the same procedure as in step 4 of the som algorithm (see Fig. 1 and Section 3.2). However, one can expect that the utility of a new neuron will be higher if it is inserted near an existing neuron with a high utility. Therefore, the procedure that is adopted in our algorithm takes the neuron y with the highest utility and randomly selects an input pattern x. Then weighted_mean(x; y; a) is called with a small value of a (in the experiments reported in Section 5, a ¼ 0:1). Clearly, this strategy of inserting a new neuron has some inherent randomness, but ensures at the same time that the new neuron will not be isolated from the input patterns, i.e. its utility can be expected high. In the algorithm in Fig. 1 the utility check and the potential replacement of the neuron with the lowest utility is to be done after step (10). However, as the determination of the utility is computationally expensive, it is actually performed only after every kth iteration through the repeat loop. In the experiments described in Section 4, k was set equal to 200.

411

4. Experimental results The extension of som to the graph domain as described in Section 3 was experimentally evaluated. In the experiments, graph representations of capital characters were used. This domain was chosen because it allows a straightforward visual interpretation of the graphs and the clustering results. For the purpose of simplification only those characters from the alphabet were considered that consist of straight lines only. In Fig. 3, 15 characters are shown, each representing a different class. In the corresponding graphs, each straight line segment is represented by a node with the coordinates of the endpoints in the image plane as attributes. No edges are included in this kind of graph representation. The edit costs are defined as follows. The cost of deleting or inserting a line segment is proportional to its length, while substitution costs correspond to the difference in length of the two considered line segments. This kind of representation will be called R1 in the following. In addition to representation R1, a second representation, called R2 below, was considered. Under R2, the nodes represent locations where either a line segment ends, or where the end points of two different line segments coincide with each other. The attributes of a node represent its location in the image. There is an edge between two nodes if the corresponding locations are connected

3.5. Further remarks As termination criterion in line (11) in Fig. 1 a fixed number of iterations through the repeat loop was chosen. This number was set equal to 5000 in the experiments described in Section 4. The learning rate c was defined equal to 0.9. Reduction of the learning rate was accomplished by multiplying c with the constant 0.999 in each iteration through the repeat loop. For the utility check in Eq. (11) b ¼ 0:05 was used. The algorithm proposed in this paper requires the correct number of clusters being given as a parameter. Often this number is not known. The problem of automatically finding the correct number of clusters has been addressed in (G€ unter and Bunke, 2001).

Fig. 3. 15 characters each representing a different class.

412

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

by a line in the image. No edge attributes are used in this representation. The deletion and insertion cost of a node is a constant, while the cost of a node substitution is proportional to the distance of the corresponding points in the image plane. The deletion and insertion of an edge also have constant cost. As there are no edge labels, edge substitutions will never be needed under representation R2. Examples of graph representations R1 and R2 are shown in Fig. 4. To the knowledge of the authors, there exist no automatic procedures to infer the costs of graph edit operations from a sample set. Therefore, for each application involving graph edit distance computation, the edit costs have to be experimentally determined. For graph representation R1 there is only one parameter involved. It is a constant that defines the weight of a node insertion or deletion relative to a substitution. For graph representation R2 two parameters are required, one defining the cost of node insertions and deletions, and the other defining the cost of edge insertions

(a)

and deletions. It turned out that the choice of these parameters is not particularly critical. Although any slight change of the parameter values leads to a change in the graph edit distance values, the partition of the set of input graphs into different clusters, and the cluster representatives, remain stable over a wide range of parameter values. For each of the 15 prototypical characters shown in Fig. 3, ten distorted versions were generated. Examples of distorted A’s and E’s are shown in Figs. 5 and 6, respectively. The degree of distortion of the other characters is similar to Figs. 5 and 6. Under graph representation R1 the graph of a distorted character may have a different number of nodes and different attribute values. Under R2, additionally the structure of the edges may change (see also Fig. 4). As a result of the distortion procedure, a sample set of 150 characters were obtained. Although the identity of each sample was known, this information was not used in the experiments described below, i.e., only unlabeled samples were used in the experiments.

(b)

(c)

Fig. 4. An example of graph representations R1 and R2: (a) original figure; (b) representation R1; (c) representation R2.

Fig. 5. Ten distorted versions of character A.

Fig. 6. Ten distorted versions of character E.

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

Fig. 7. Cluster centers obtained in one of the experimental runs.

The graph clustering algorithm described in Section 3 was run on a set of 150 graphs representing the (unlabeled) sample set of characters,

413

with the number of clusters set to 15. As the algorithm is non-deterministic, a total of 10 runs were executed. The cluster centers obtained in one of these runs are shown in Fig. 7. Obviously, all cluster centers are correct in the sense that they represent meaningful prototypes of the different character classes. In all other runs similar results were obtained for both representation R1 and R2, i.e., in none of the runs an incorrect prototype was generated. Also all of the 150 given input patterns were assigned to their correct cluster center. The distortions of the character prototypes in the experiment described above were created manually. In order to evaluate the influence of distortions in a more systematic manner, additional experiments were conducted, where the given character prototypes were distorted by means of an automatic procedure. For representation R1 the automatic distortion procedure

(a) (b)

(c)

(d)

(e)

(f)

(g)

(h) Fig. 8. Distorted versions of the 15 prototypes for various values of f. For R1: (a) f ¼ 0:05, (b) f ¼ 0:1, (c) f ¼ 0:15, (d) f ¼ 0:2. For R2: (e) f ¼ 0:05, (f) f ¼ 0:1, (g) f ¼ 0:15, (h) f ¼ 0:2.

414

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

takes each line segment of a character and applies a translation, a rotation, and a scaling operation on it. The degree of each distortion is a random variable with zero mean and a standard deviation that is controlled by a parameter f. There is also a certain probability, dependent on f, that a line segment is deleted. The distortion procedure for R2 is similar. Examples of distorted characters for both representations R1 and R2, and various values of f are shown in Fig. 8. To test the influence of noise on the graph clustering procedure, the experiment described in the beginning of this section (see Figs. 3–7) was repeated, but the distorted versions of the char-

acters were automatically generated this time. For both graph representations R1 and R2 the parameter f was increased in steps of 0.01 over the interval ½0:01; 0:20. To take the random nature of the som algorithm into account, each experimental run was repeated five times. Hence, a total of 200 runs were executed. Typical examples of cluster centers obtained in these runs are shown in Fig. 9. Note that the order of the cluster centers displayed in Fig. 9 does not correspond to Fig. 3 any longer, because all graphs are unlabeled. The cluster centers for f ¼ 0:05, f ¼ 0:1 and part of those for f ¼ 0:15 are quite similar in shape to the prototypes. But for higher noise levels the clusters contain more and more elements that were generated

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h) Fig. 9. Cluster centers obtained for various values of f. For R1: (a) f ¼ 0:05, (b) f ¼ 0:1, (c) f ¼ 0:15, (d) f ¼ 0:2. For R2: (e) f ¼ 0:05, (f) f ¼ 0:1, (g) f ¼ 0:15, (h) f ¼ 0:2.

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

415

Fig. 10. Correct cluster assignment rate depending on degree of noise.

from different prototypes. Consequently the cluster centers become more and more dissimilar to the prototypes. It should be noted, however, that the fact that clusters contain elements from different prototypes does not mean that the clustering algorithm failed. With increasing noise it just becomes more likely that an element is more similar to another prototype than the one it was created from. In order to provide a quantitative assessment, in addition to the qualitative judgment based on Figs. 8 and 9, an index C is introduced that measures the correctness, or purity, of a clustering generated by the som algorithm. For each input pattern x, let CðxÞ be the percentage of other input patterns in the same cluster that were created out of the same prototype as x. Then C is defined as the sum of all CðxÞ’s, divided by the number of input patterns, x. Clearly, 0 6 C 6 1, with the value C ¼ 1 indicating the most favorable case, where each cluster is totally pure (i.e. it contains only elements from the same prototype), while for C ¼ 0 none of the cluster contains two elements generated out of the same prototype. The values of C depending on f for each of the 200 experimental runs described above are plotted in Fig. 10. For each value f ¼ 0:01; 0:02; . . . ; 0:2 the average of the corresponding five runs for both R1 and R2 is

given. The graph in Fig. 10 corresponds well with our intuitive assessment based on Fig. 9. Up to a noise level of about f ¼ 0:1 clusters of high purity are obtained. Purity deteriorates as f increases to f ¼ 0:15, and clusters become very heterogeneous for f ¼ 0:2. Moreover the prototypes based on R2 appear more distorted than those based on R1 for f ¼ 0:2.

5. Conclusions Som has become an established tool in pattern recognition and related areas. In the present paper an extension of som to the domain of graphs has been proposed. Two problems must be solved to make this extension possible. First, a distance, or similarity, measure for graphs is needed. This problem is solved by means of graph edit distance. Secondly, the neuron updating rule of som must be suitably generalized from vectorial pattern representation to graphs. Such a procedure has been derived from graph edit distance computation. In the present paper the application of som to the clustering of graphs has been addressed. Graph representations of line drawings were used to show the feasibility of the proposed method. It was

416

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417

demonstrated that the som-based clustering procedure yields reasonable results even if the underlying graphs are significantly distorted. Graphs are a flexible and powerful representation formalism for various pattern recognition tasks. They include both feature vector and string representation as a special case. Hence the procedure proposed in this paper provides an extension of som to areas where the description of objects is no longer restricted to unary properties, but may include relations between various parts of an object. Only line drawings are considered in this paper. But it is straightforward to apply the approach to graph representations of other types of patterns. An example is graphs derived form segmented color images (De Mauro et al., 2001; Kittler and Ahmadyfard, 2001). From the general point of view, the procedure proposed in this paper can be regarded as a first step towards extending existing tools from the n-dimensional feature space into the symbolic domain. Many other clustering algorithms are known from the literature (Jain et al., 1999). With procedures for graph edit distance and weighted mean graph computation at our disposal, it can be expected that some of these algorithms can be extended from feature vector representations to graphs as well. The well-known k-means clustering algorithm (Jain et al., 1999) seems to be a particularly promising candidate because it is very similar to som. In fact, k-means can be regarded as a ‘‘batch version’’ of som, where cluster centers are not immediately updated after presentation of each input pattern, but only after a complete run through the whole set of inputs. An application of clustering to a problem in online character recognition, where the patterns are represented by strings, has been described in (Connell and Jain, 2001). References Baxter, K., Glasgow, J., 2000. Protein structure determination: combining inexact graph matching and deformable templates. In: Proc. Vision Interface, Montreal, pp. 179–186. Bunke, H., Allermann, G., 1983. Inexact graph matching for structural pattern recognition. Pattern Recognition Lett. 1 (4), 245–253.

Cantoni, V., Cinque, L., Guerra, C., Levialdi, S., Lombardi, L., 1994. 2-D object recognition by multiscale tree matching. Pattern Recognition 31, 1443–1455. Christmas, W.J., Kittler, J., Petrou, M., 1995. Structural matching in computer vision using probabilistic relaxation. IEEE Trans. Pattern Anal. Machine Intell. 8, 749–764. Connell, S., Jain, A., 2001. Template-based online character recognition. Pattern Recognition 34 (1), 1–14,. Cross, A., Wilson, R., Hancock, E., 1996. Genetic search for structural matching. In: Buxton, B., Cipolla, R. (Eds.), Computer Vision – ECCV’96, Lecture Notes in Computer Science 1064. Springer, Berlin, pp. 514–525. DeMauro, C. et al., 2001. Similarity learning for graph-based image representations. In: Jolion, J.-M., Kropatsch, W., Vento, M. (Eds.), Proc. 3rd IAPR-TC15 Workshop on Graph-Based Representations in Pattern Recognition, pp. 250–259. Fritzke, B., 1999. Growing self-organizing networks – history status quo and perspectives. In: Oja, E., Kaski, S. (Eds.), Kohonen Maps. Elsevier, Amsterdem, pp. 131–144. G€ unter, S., Bunke, H., 2001. Validation indices for graph clustering. In: Jolion, J.-M., Kropatsch, W., Vento, M. (Eds.), Proc. 3rd IAPR-TC15 Workshop on Graph-Based Representations in Pattern Recognition, pp. 229–238. Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data Clustering: A Review. ACM Comput. Surv. 31, 264–323. Kittler, J., Ahmadyfard, A., 2001. On matching algorithms for the recognition of objects in cluttered background. In: Arcelli, C., et al. (Eds.), Proc. 4th Internat. Workshop on Visual Form, Lecture Notes in Computer Science 2059. Springer, Berlin, pp. 51–66. Kohonen, T., 1997. Self-Organizing Map. Springer, Berlin. Kohonen, T., Somervuo, P., 1998. Self-organizing maps on symbol strings. Neurocomputing 21, 19–30. Lee, S.W., Kim, J.H., Groen, F.C.A., 1990. Translationrotation- and scale invariant recognition of hand-drawn symbols in schematic diagrams. Internat. J. Pattern Recognition Artif. Intell. 4, 1–15. Lu, S.W., Ren, Y., Suen, C.Y., 1991. Hierarchical attributed graph representation and recognition of handwritten Chinese characters. Pattern Recognition 24, 617–632. Messmer, B., Bunke, H., 1998. A new algorithm for errortolerant subgraph isomorphism detection. IEEE Trans. Pattern Anal. Machine Intell. 20, 493–504. Myers, R., Wilson, R., Hancock, E., 2000. Bayesian graph edit distance. IEEE Trans. Pattern Anal. Machine Intell. 22, 628–635. Pelillo, M., Siddiqi, K., Zucker, S., 1999. Matching hierarchical structures using associated graphs. IEEE Trans. Pattern Anal. Machine Intell. 21, 1105–1120. Rocha, J., Pavlidis, T., 1994. A shape analysis model with applications to a character recognition system. IEEE Trans. Pattern Anal. Machine Intell., 393–404. Sanfeliu, A., Fu, K.S., 1983. A distance measure between attributed relational graphs for pattern recognition. IEEE Trans. Systems Man Cybernet. 13, 353–362.

S. G€unter, H. Bunke / Pattern Recognition Letters 23 (2002) 405–417 Shearer, K., Bunke, H., Venkatesh, S., 2001. Video indexing and similarity retrieval by largest common subgraph detection using decision trees. Pattern Recognition 34 (5), 1075–1091. Wagner, R.A., Fischer, M.J., 1974. The string-to-string correction problem. J. Assoc. Comput. Mach. 21 (1), 168–173.

417

Wang, I., Zhang, K., Chirn, G., 1994. The approximate graph matching problem. In: Proc. 12th ICPR, Jerusalem, pp. 284–288. Wong, E.K., 1994. Model matching in robot vision by subgraph isomorphism. Pattern Recognition 25, 287–304.