GRAPHICAL MODELS AND IMAGE PROCESSING
Vol. 58, No. 3, May, pp. 187–197, 1996 ARTICLE NO. 0016
Multidimensional Co-occurrence Matrices for Object Recognition and Matching VASSILI KOVALEV Computer Center, Belarus Academy of Sciences, Kirova Street, 32-A, 246652 Gomel, Belarus AND
MARIA PETROU Department of Electronic and Electrical Engineering, University of Surrey, Guildford GU2 5XH, United Kingdom Received April 26, 1993; revised July 21, 1995; accepted December 6, 1995
occurrence matrices and gray-level run length matrices, suggested by Haralick et al. [10] and Galloway [11], have been widely used for texture description, texture classification, and segmentation [12–14]. Haddon and Boyce [15– 17] have renewed the interest in co-occurrence matrices by proposing the use of special co-occurrence matrices for edge detection and estimation of the optic flow field, and by using co-occurrence matrices as look-up tables to emphasize differences between the salient (atypical) and the background (typical) image features. We argue that the attributes used to represent an object are problem dependent and that often what characterizes an object is the relative abundance of these attributes as well as their relative position. Thus, in this paper we present a novel method for image recognition and matching based on quantitative estimation of relations between some ‘‘elementary’’ image structures which are represented by elements of special multidimensional co-occurrence matrices (MDCMs). We call this method ‘‘elementary structure balance method’’ as it exploits the idea that what makes an object what it is, is the balanced presence of certain characteristics in it. The relative abundance with which certain characteristics are present in an object are learned during a training phase which consists of both positive and negative examples. We argue that by far the most desirable line of approach is to pay a lot of attention in choosing these characteristics so that the classes of objects are linearly separable. In other words, we argue that careful feature selection followed by a simple classifier is a much more preferable approach than a quick feature selection stage followed by a carefully designed classifier. As is well known, even the most sophisticated classifiers minimize the error of classification rather than avoid it when they are dealing with overlapping classes. In the application
A novel method is proposed for object recognition and matching. It is based on the automatic search of features that characterize a certain object class using a training set consisting of both positive and negative examples. Special multidimensional co-occurrence matrices are used for the description and representation of some basic image structures. The features are extracted from the elements of this matrix and express quantitatively the relative abundance of some elementary structures, i.e., they are quotients of certain elements of the matrix. Only features which discriminate the classes clearly are used. The method is demonstrated in numerous applications, falling under the general problems of texture recognition, texture defect detection, and shape recognition. 1996 Academic Press, Inc.
1. INTRODUCTION
The problem of object recognition and matching is one of the basic objectives of computer vision. The crux of the problem lies in the choice of the appropriate descriptors of the objects to be recognized and the appropriate representation of these descriptors for the process of matching and recognition to take place. Various types of descriptors have been proposed in the literature, ranging from shape descriptors (e.g., [1, 2]), to region descriptors that try to encapsulate structural (e.g., texture) and nonstructural properties (e.g., color) of the object (e.g., [3, 4]). Among the different representations proposed in the literature, some of the most popular ones are attribute relational databases (e.g., [3]) and graphs (e.g., [5]), semantic nets (e.g., [6]), schemata (e.g., [7]), frames (e.g., [8]), Freeman chain codes ([9]), and co-occurrence matrices. Which of the above descriptors and representations are most appropriate for each case depends on the level of the vision task performed. For example, gray-level co187
1077-3169/96 $18.00 Copyright 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
188
KOVALEV AND PETROU
problems we consider here to demonstrate our method, like in medical diagnosis or signature verification, there is really no room for even small errors. Ideally, one should have zero error, or at most error equivalent to the error of the human expert. Thus, we choose our features on the basis of how well they discriminate between the positive and the negative examples and set hard thresholds for them. The process of object recognition then is that of hard evidence gathering where votes are counted from features with values that fall within the acceptable range. In this way, our classifier could be considered as an Ntuples neural network classifier operating on linearly separable classes. In Section 2, we review the definition of the generalized co-occurrence matrices and describe our methodology for feature extraction and matching. In Section 3 we discuss various specific applications of our approach and present extensive experiments in a diversity of fields where the strengths and weaknesses of our approach are exemplified. Finally, we present our conclusions in Section 4. 2. METHODOLOGY
We start by assuming that the image of any object can be considered a composition of elementary structures (Ntuples) the elements of which (the pixels that constitute them) carry some attributes and have some relations. For example, as basic spatial (geometric) pixel configurations, we may consider pixel pairs, pixel triplets, and quadruples. Examples of the attributes of each pixel used are its graylevel value and its gradient magnitude and/or orientation. Examples of the relations between the constituent pixels of the N-tuples considered are relative gradient orientation between the pairs of pixels and gray-level difference. To present these parameters we use an M-dimensional cooccurrence matrix with each of the attributes and relations corresponding to a different axis of the matrix. In general, an M-dimensional co-occurrence matrix is an M-dimensional array W, the elements of which have the general form w(a1 , a2 , . . . , aM1 ; b1 , b2 , . . . , bM2 ), where M1 1 M2 5 M, and ai takes all possible values a certain attribute could take for an elementary structure, while bj takes all possible values a certain relation could take, defined on the pixels that make up the elementary structures we consider (including their relative position). The value of each element of the array is the number of elementary structures of each type (their frequency of occurrence) in the image. Note that it is possible not to consider any attributes (i.e., M1 5 0) but we must always have M2 ? 0. Such a case, for example, arises when we consider binary images in which shapes are the objects to be recognized. We shall see such an example in the applications section when we consider the problem of signature verification. The selection of the appropriate matrix type, i.e., type
of the elementary geometric pixel configuration, and the attributes and relations used is a mathematically intractable problem. It depends on the following factors: • the image capturing device and more generally the image modality; • the image type (e.g., gray scale, binary, or color image); • the required level for detail in the image description and other special requirements (e.g., translation and rotation independence recognition). We do not have to offer any magic recipe for the choice of the matrix. This indeed is a very fundamental problem in computer vision. In all the examples we shall present in the next section the type of matrix constructed was chosen using common sense and after significant experimentation during the training phase of the algorithm. The other major factor which should affect the construction of the matrix chosen is the memory required to store such a matrix and the ability of the computer used to handle it. This problem, however, is a technology problem and it can easily be handled. Perhaps in the future the hardware development will be successful enough to allow the employment of huge matrices with billions of elements if necessary. However, the use of such matrices is not considered either necessary or even desirable in most applications. This is because images contain a lot of noise which not only manifests itself in the gray values of the pixels but indirectly affects all measurements performed on them. Thus, the possible values the various attributes and relations can take are quantized to a small number of bins, the width of which reflects the uncertainty in the calculation of these values. In the applications we shall present in the next section, the typical size of the matrix used was of the order 10 in each axis. Thus the matrices used were not excessively large. Further computational efficiency may be achieved for the case when the co-occurrence frequency refers to the co-occurrence of the same type of attribute (e.g., pair of gray-level values). We then define the co-occurrence matrix as expressing the frequency of a particular set of attribute values occurring at a certain distance from each other as opposed to a certain relative position. This is because this way we obtain features which are rotation and reflection invariant. That is, attribute values that occur at positions (i1 , j1 ) and (i1 2 Di, j1 2 D j ), for example, will be considered in the same bin as attribute values that occur at positions (i2 , j2 ) and (i2 1 Di, j2 1 D j ). In such a case a large number of the bins of the co-occurrence matrix will remain empty. The efficiency in the representation of the matrix because of this is particularly high in cases where triplets of attributes are considered. After we have decided upon the appropriate type of matrix for the problem we want to solve, we have to decide upon the process of recognition and in particular upon the
OBJECT RECOGNITION AND MATCHING
way the information encapsulated by the co-occurrence matrix can be used to identify an object. We argue that what makes an object recognizable for what it is, is the balanced presence in it of some specific elementary structures. Thus, for the process of object recognition, we do not try to compare the elements of the co-occurrence matrix directly. Instead, we construct some features out of them by considering all possible pairs of non-zero matrix elements and taking their ratio which corresponds to the relative abundance of the corresponding elementary structures. To facilitate the search of matrix element pairs, the matrix elements are stored as vectors, say v(m), in a raster scan format, each having a single index. Thus, we denote a feature by r(m, n)(;v(m)/v(n)), where its two indices denote the elements of the matrix, the ratio of which defines this particular feature. Clearly, the number of possible pairs one can consider is very large but not prohibitively large. During the training phase of the algorithm we exhaustively choose pairs of elements and test them for their ability to discriminate between the positive and the negative examples shown. If the range of values a certain pair of elements takes for the positive examples overlaps with the range of values it takes for the negative examples, this pair is quickly discarded. During the process of feature extraction, we are not only looking for features for which the ranges of possible values are not overlapping between the positive and negative examples, we are also looking for features according to how well they separate the two classes. Thus, only features that separate the positive and the negative examples into two linearly separable classes are considered [18]. Further, we are not considering for feature calculation any attribute or relation with frequency less than a certain threshold, TRS. Thus, matrix elements smaller than this threshold are ignored as representing characteristics that are not prominent enough. Matrix elements which are above this threshold are used to construct the features in the following way: Let us say that a particular feature r(m, n) takes values in the range [rp;min(m, n), rp;max(m, n)] for positive examples. We define an interval of reliability [a(m, n), b(m, n)] for this feature, a(m, n) 5 rp;min(m, n) 2 D(m, n)
KR 2 1 2
b(m, n) 5 rp;max(m, n) 1 D(m, n)
KR 2 1 , 2
where D(m, n) ; rp;max(m, n) 2 rp;min(m, n). Parameter KR is a control parameter that depends on how well the particular feature distinguishes the two classes of positive and negative examples, and thus we call it the
189
‘‘reliability coefficient.’’ For the special case that rp;max(m, n) 5 rp;min(m, n) 5 r(m, n), we define D(m, n) ; r(m, n). All reliable features according to the above definition are going to be used for the classification stage. Typically, the number of features used are on the order of a few, perhaps about 10 or so, as will be shown in the next section. The number of images used for training is not very large either, as too many images tend to introduce unnecessary detail. Experimentation showed that a few representative images in the range of 5–15 was enough. After the feature selection during the training phase and the definition of the range of their possible values, the unknown image of an object can easily be classified as follows: The co-occurrence matrix of the unknown image is constructed and the corresponding features computed as was done during the training phase. A positive vote is counted every time a feature of the test image has a value within the acceptable range and a negative vote is counted for every feature that takes a value outside the acceptable range. The total number of votes indicates the confidence with which the unknown object was classified in (or rejected from) the class of objects the system was trained to recognize. More precisely, during the testing process of a certain object I, any feature which falls within the range of acceptable values [rp;min(m, n), rp;max(m, n)] for the particular class of objects contributes positively to the vote counter TM(I ). Any feature the value of which is outside the acceptable range [rp;min(m, n), rp;max(m, n)] but inside the range [a(m, n), b(m, n)] is ignored as conveying unreliable information, and any feature which is outside the range [a(m, n), b(m, n)] contributes negatively to the accumulated recognition vote as it conveys reliable counterevidence. At the end, if TM(I ) is positive, the object I is recognized as belonging to the class it was tested for, and if TM(I ) is negative, it does not belong to this class. The absolute value of TM(I ) divided by the total number of features examined is the degree of confidence with which this object is recognized. For many applications, it is interesting to be able to identify the elementary structures in an object that make it distinct from other objects. This is the task of visualization of the characteristic elementary structures. We adopt the following process for this purpose: • During the recognition phase, we identify the matrix elements which contribute to the calculation of the reliable features. From these matrix elements we keep those that seem to contribute to the reliable features more often. Experimentation has shown that elements with minimum frequency value 0.3–0.4 are appropriate irrespective of the type of image or the values of the control parameters chosen. • For each pixel in the test image, we calculate the frequency by which it contributes to the calculation of these
190
KOVALEV AND PETROU
key features. We use this frequency as a probability of the particular pixel belonging to the object we wish to identify. Thus we create a gray-scale image, the ‘‘auxiliary gray image,’’ where each pixel is labeled with some probability of belonging to the class we have just identified. • We threshold the auxiliary gray image to create a segmented image. The segmented image is considered as a binary image. • We perform binary morphology to the threshold image in order to obtain smoothly shaped regions. 3. APPLICATIONS
The methodology described above was exhaustively tested on a variety of problems and on hundreds of images. We shall discuss here the peculiarities of each problem we tackled in detail. For the construction of the co-occurrence matrices, the following attributes were considered: • The gray-level value s(i) of a pixel i (single pixel indexing is used for convenience). • The magnitude of the local gradient q(i) computed using the Sobel masks. • The surface slope f (i) (i.e., the angle between the reference plane and the local tangent to the surface plane) when we deal with range data. Further, the following relations were considered: • The Euclidean distance d(i, j ) between two pixels i and j. • The angle a(i, j ) between the local gradient directions at pixel positions i and j. • The length u(i, j ) of a contour that connects pixels i and j. • The angle a(i, j ) between the tangents of a contour at points i and j connected by the contour. • The ratio t(i, j )(; u(i, j )/d(i, j )), of the length of a contour and the length of the chord between two pixels i and j that are connected by a contour. • The normalized length c(i, j ) of a contour segment that connects pixels i and j, i.e., the length u(i, j ) of a contour that connects pixel i and j divided by the length of a closed contour to which this segment belongs. • The distance dY(i, j ) between two pixels i and j along the Y axis only. This attribute is only used for the special case when we deal with ultrasound M-regime images where the distance along the Y axis represents distance inside the patient’s body and time is measured along the other axis. Thus, the following co-occurrence matrices were defined and their applicability to the various problems was examined:
TABLE 1 Brain Pathology Features W1 matrices (v(m)/v(n) 5 X/Y ) X 1 2 3 4 5 6 7 8 9 10 11 12 13
w(6, w(6, w(6, w(6, w(6, w(6, w(6, w(6, w(6, w(6, w(6, w(6, w(6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
W5 matrices (v(m)/v(n) 5 X/Y )
Y 2) 2) 3) 3) 4) 4) 5) 5) 6) 6) 7) 8) 9)
w(7, w(7, w(7, w(7, w(7, w(7, w(7, w(7, w(7, w(7, w(7, w(7, w(7,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
X 9) 10) 9) 10) 9) 10) 9) 10) 9) 10) 10) 10) 10)
1 2 3 4 5 6 7 8 9 10
w(6, w(6, w(6, w(6, w(6, w(6, w(6, w(6, w(6, w(6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
Y 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
2) 3) 4) 4) 5) 5) 6) 6) 7) 7)
w(7, w(7, w(7, w(7, w(7, w(7, w(7, w(7, w(7, w(7,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
9) 10) 9) 10) 9) 10) 9) 10) 9) 10)
W 1: w(s(i), s( j ), d(i, j )) W 2: w(s(i), s( j ), a(i, j ), d(i, j )) W 3: w(q(i), q( j ), a(i, j ), d(i, j )) W 4: w(f (i), f ( j ), d(i, j )) W 5: w(s(i), s( j ), s(k), d ) W 6: w(c(i, j ), t(i, j ), a(i, j )) W 7: w(u(i, j ), d(i, j ), a(i, j )) W 8: w(d(i, j ), d(i, k), d( j, k)) W 9: w(s(i), s( j ), dY(i, j )). In the above definitions d represents the side length of an equilateral triangle formed by the triplet of pixels considered. We shall discuss now some particular applications of our method which make use of the above defined matrices. 3.1. Texture Classification As a first example of an application of our method, we shall consider the case of CT brain scans which had to be classified either as normal or as pathological. We had 18 such images, 8 of which were normal and were used as negative examples, and 10 of which were pathological and were used as positive examples. In all of them the brain was isolated by some preprocessing method and the task was to classify it to the appropriate class. The matrices used were W 1 and W 5. During the training phase, the reliability coefficient KR was set equal 1.5 for W 1 matrices and 2.1 for the W 5 matrices. Threshold TRS was equal to 300 for both cases. The features obtained are shown in Table 1. Note that intensity gray-level numbers and dis-
191
OBJECT RECOGNITION AND MATCHING
tances are used to identify the matrix elements. For example, for the W 1 matrix, w(7, 4, 9), is the number of image pixel pairs with intensity levels of 7 and 4 which are at distance of 9 pixels from each other. It is worth noticing that very few matrix elements contribute to the features chosen. This is a very general statement, observed to be true in many applications and, in particular medical ones, independent of the image type (modality), the matrix type, and control parameter values. To facilitate the discussion, in what follows, matrix elements v(m) will be referred to as X-structures, while matrix elements v(n) will be referred to as Y-structures in relation to the ratio v(m)/v(n). We notice from Table 1 that differences between the norm and pathology are defined in images by the structures formed by pixels with intensity levels of 6, 7, and 4. The main features distinguishing the pathology from the norm are certain quantitative relations (balance) between the X- and Y-structures for both matrix types. It was established during the training phase that the matrix elements take up values in overlapping ranges, so they cannot be used as discriminating features between the different classes. This observation is valid even after various ways of normalization were tried [19]. W 1 features are less stable than W 5 features. In the first case no features can be found for KR . 1.5, while in the second case no features can be found for KR . 2.3. In this application and in all other applications described in this section, the validation of the quality of the result is based on the ‘‘leave-one-out’’ scheme [20]. In this scheme, all the images except one are used as a training set. (It is assumed that the classes of all images are known.) The image left out is then classified. Next, the image left out is reinstated into the training set and another image is selected for recognition. The process is repeated until all the images have been classified. The classification results for each CT brain image with the W 5 matrix are shown in Table 2. All eight normal images have been correctly recognized as seen from Table 2. From the 10 pathology images, only one image (number 16) has been wrongly classified as normal, whereas another (number 9) has not been classified at all because its ratios r(m, n) in all features happened to be in the ‘‘neutral zone’’ between the norm and pathology ranges, [a(m, n), rp;min(m, n)]. It should be noted that the features set in 12 out of the 18 cases agrees completely with the features set obtained when all 18 images were used for training (Table 1). Having recognized the pathological brains, the next step is to identify the regions with the pathological lesions, i.e., the pathological features. We notice from Table 1 that matrix element w(7, 4, 10) appears most frequently (8 times) in the calculation of pathological features with the help of matrix W 1, while the next most frequent element is w(7, 4, 9) (with frequency 5). When matrix W 5 is used,
TABLE 2 Result of Brain Pathology Recognition Image
Number of features
1 2 3 4 5 6 7 8
13 10 10 10 10 10 16 10
9 10 11 12 13 14 15 16 17 18
11 10 10 10 12 10 10 36 17 10
TM(I ) sum
Measure (%)
210 210 210 210 210 210 210 210
276.9 2100.0 2100.0 2100.0 2100.0 2100.0 262.5 2100.0
0 110 110 110 110 110 110 222 110 110
0.0 1100.0 1100.0 1100.0 183.3 1100.0 1100.0 261.1 158.8 1100.0
Norm
Pathology
the two most frequent elements are w(7, 6, 4, 9) and w(7, 6, 4, 10). During the testing phase we compute the number of times a pixel contributes to these pathological matrix elements and at the end we threshold the accumulator array thus created, to identify the pathological pixels. Some typical results of this process are shown in Fig. 1, where not only the two most frequent pathological matrix elements were used, but also all of those appearing in Table 1 under matrix W 5 (i.e., all pixels that belonged to triplets that contributed to at least one of the pathological features were counted when calculating the output accumulator array). No other postprocessing, apart from thresholding, has been used in these images. Outputs like these could be very useful for the clinician. Note how prominent some circular structures with centers are in these thresholded outputs. This happens because the central pixel is the pathological one and it participates in several triplets formed with possibly ‘‘healthy’’ pixels at a specific distance from it. If necessary, a postprocessing can be applied where these central pathological pixels are fully isolated. We shall see in the next section that this method has been successfully used in the identification of texture defects as well. For the task of texture classification some other applications were also considered, the classification of material surfaces seen under various types of microscopes and the identification of the forest type by analyzing the canopy texture as seen in aerial photographs. The former application ultimately relates to problems of identifying material wear, while the latter relates to problems of monitoring the environment with remote sensing.
192
KOVALEV AND PETROU
FIG. 1. On the left, two brains with pathology and, on the right, the pathological pixels enhanced. Notice the circular structures which identify the pathological elements in their centers.
Each available set of images was divided in training images and testing images. Every time, all such possible divisions of the image set were considered, thus a very large number of experiments was performed. The microscope images used were 256 3 256 and they were of three different modalities: 42 images from a scanning electron microscope (SEM), 12 images from an optical microscope (OM), and 15 images from a probe microscope (SPM). The scanning electron microscope was from the JEOL series. These microscopes create images where the gray-level value corresponds to the surface slope angle (i.e., ‘‘reflected electron beam regime’’). All were images of various materials and the task was their automatic identification. For the SEM images, there were seven classes to be identified, namely steel, aluminum, copper-brass, a type of plastic, and two types of ceramic. The classes to be identified from the OM images were three, namely steel, copper-brass, and plastic, and the classes to be discriminated from the probe microscope images were gold foil and two types of aluminum hard disk base after special high-quality processing by two different technologies [21]. The magnification of the OM and SEM microscopes was 3300, which results in textured images. The SPM images are actually range data. Each image was divided into four 128 3 128 images which were used as independent images to enhance the size of the dataset. For the training stage 3–5 positive and 5–15 negative example images were used
and the remaining images were used for testing. All possible combinations of training and testing images were used. Matrices W 1, W 2, W 3 were used for the SEM and OM images, while matrix W 4 was used for the SPM images. The best results were produced by matrix W 2 for OM images and by matrix W 3 for SEM images. These matrices in all tests runs classified all test images correctly, with confidence assigned to the correct class higher than 75%. Typical bin widths for which good results were obtained are Ds 5 13, 26, 50, Dq 5 40, 80, Da 5 10, 15, 30, D f 5 0.5, 3, and d 5 1, . . . , 10. A typical experiment with some SEM images is shown in Fig. 2. The three positive examples (PE) and the first of the test images were all of the same material. Our system was trained to recognize this material and the chart with the results presented shows how confidently the images of other materials were rejected. For the forest texture classification 30 images were available of various sizes (100 3 100, or 40 3 320). The different forest areas were already segmented by hand and the task was to have these image patches classified in one of four possible classes, pine, fir, leaf-tree, and bush. All the images were obtained with strictly the same conditions, thus both matrices, W 1 and W 3, that were tried produced equally
FIG. 2. (a) SEM images of some materials: PE, positive examples used for training; NE, negative examples used for training. (b) Test images, only the first one is of the same material as the positive examples. (c) Confidence with which each of the test images was recognized.
OBJECT RECOGNITION AND MATCHING
193
spite of the fact that the human eye can hardly discriminate between the two texture types. The matrices used for these experiments were W 2 and W 5 with Ds 5 3, Dq 5 10, and Da 5 15. W 5 matrices in general seemed to perform better. The same method was also tested for the identification of a liver tumor using ultrasound images. In all 56 cases images were available and the task was to classify them as normal and pathological. The most promising results were obtained with the W 3 matrices with Dq 5 40, Da 5 10, 30, and d 5 1, . . . , 10. However, even these results were not at all satisfactory. Similar experiments with eight images of the heart used for the recognition of heart disease (aorta stenosis) were disappointing. In this case matrix W 9 was used with Ds 5 13, 26 and dY 5 1, . . . , 15 and bin width for dY DdY 5 3. We concluded that our method in its present form is not appropriate for the analysis of ultrasonic images, perhaps due to the very high levels of noise these images contain and the high expertise required to interpret them. We speculate that such images could possibly be analyzed with our method if much higher order Ntuples were used to take into consideration more ‘‘global’’ information contained in them. 3.2. Texture Defect Detection
FIG. 3. (a) A CT cross section of the liver of a healthy patient from the region near the ‘‘liver port’’. (b) Four subimages of the previous image from the liver region. (c) A cross section of the liver of the same patient from the region near the gallbladder. (d) Four subimages of the previous image from the liver region.
good results with typical bin size values, Ds 5 26, Dq 5 70, and d 5 1, . . . , 20. In all cases the correct class was picked, but sometimes the confidence in it was not very high, perhaps around 50%. As the images used were obtained under the same lighting conditions, matrix W 1 could be used. However, for images for which the illumination conditions vary, it is envisaged that only matrix W 3 would produce good results. Finally, to demonstrate the power and sensitivity of the method, we set up the following problem: We used two CT images of the liver of a healthy patient, one taken from the liver port region and one near the gall bladder region. These two images are shown in Figs. 3a and 3c. Four subimages were extracted from each one of them from the liver region and these are shown in Figs. 3b and 3d, respectively. The problem was to use three subimages from each category to train the system and to try to classify the remaining two in the corresponding class. All possible combinations of training and testing images were used and in all cases our method successfully classified the test images in the appropriate class with more than 75% confidence, in
In this task, we are trying to identify parts in a textured image that could be defects. The basic idea is that if for a particular test image some of the chosen features keep taking irregular values, we identify the pixels that contrib-
FIG. 4. (a) The textured image of a hard disk base with no defects. (b) Defective image of a hard disk base. (c) Defective regions identified as a binary mask. (d) The defective regions identified on the original image.
194
KOVALEV AND PETROU
Da 5 15, and d 5 1, . . . , 20. The use of this matrix could correctly identify the faulty regions with confidence better than 50% and often better than 90% for the first two types of faults. The fault related to the wrong distance between two connections could not be identified with our method. However, our method was very robust to image rotation (by whatever angle) and to noise and aliasing effects. Figure 5 shows some typical result. Finally, we set up the following experiment: In the image of a certain type of forest, we embedded a small region from a different forest type. Our method could identify the atypical patch of the forest we had put in. The results of this experiment are shown in Fig. 6. 3.3. 2D Shape Recognition
FIG. 5. (a) The image of an integrated circuit with no defects. (b) The same image as in (a) rotated by an arbitrary angle with a fault added at the highlighted connection. (c) The faulty connection identified (accumulator array). (d) The faulty connection identified on the original image.
ute to the calculation of these features, and in an output image, where initially all pixels have zero value, we increment their values. Abnormal pixels will pair with many normal pixels to give rise to abnormal values of various features, and they will have their values persistently incremented so that a simple thresholding of the output image at the end will identify them. This technique was used for the inspection of two types of hard disk base, produced by two different surface technologies. There were six images available, taken with a SEM at 31000 and 33000 magnifications for the two types of hard disk. Matrices W 2 and W 3 were tried, with Ds 5 26, Dq 5 70, Da 5 15, and d 5 1, . . . , 20. Both gave very good results, with all the defective regions correctly identified with confidence higher than 85%. Some typical results of this series of experiments are shown in Fig. 4. The same method was used for the detection of faults in SEM images of integrated circuits. There were eight images available for this task, none of which had any defects. There are three basic types of defects that could be present in integrated circuits, connection line breaks, unwanted connections (bridges) between connection lines, and wrong distance between two parallel connections. Such effects were simulated on seven out of the eight images while the eighth image was kept as the reference acceptable image. The most successful matrix was W 3 with Dq 5 70,
In this application, we assume that the image has been segmented and that connected contours have been created, and the issue is to recognize shapes of contours irrespective of rotation, translation, and scaling. Various applications were considered, the classification of luminophore grains into good and defective ones from their SEM images (23 images were available for this task), the classification of guns from the shape of their semiburnt gunpowder grains
FIG. 6. (a) An image of a patch of forest. (b) The same image as in (a) with a different patch of forest grafted in the middle. (c) The accumulator array where the defective features are enhanced. (d) The suspected regions of another texture identified on image (b) by thresholded accumulator array.
OBJECT RECOGNITION AND MATCHING
195
FIG. 7. (a) The contour of a perfect grain. (b) The contour of a faulty grain of the same type seen with left-to-right reflection and at a different scale. (c) The faulty part of the contour (b) identified. Underneath the two contours the corresponding orientation histograms are shown. They are used for the construction of the co-occurrence matrices we use, two corresponding cross sections of which are shown below the histograms. The arrow in the second cross section indicates the place where the two cross sections differ.
from OM images scanned from photos in a crime-related book (21 images from three different types of guns, 7 for each type of gun, were available for this task), the classification of plums into three classes from camera images (15 images of three different varieties of plums, 5 images per variety, were available for this task), and finally the verification of signatures from computer scans (52 images were available, consisting of samples of genuine signatures from two different people and samples of corresponding forged ones). For the visual inspection of luminophore grains the ‘‘good’’ grains are characterized by a smooth shape, while the ‘‘bad’’ grains contain high curvature points. At this stage, occluded grains were ignored and only 23 complete contours obtained from the gray images with the help of standard tracing and contouring methods were used. An example of a ‘‘good’’ grain is shown in Fig. 7a. A ‘‘bad’’ grain, shown in Fig. 7b, is the scaled-down mirror image of the ‘‘good’’ one. The corresponding orientation histograms of the two contours are also shown in Fig. 7 and two corresponding cross sections of the co-occurrence matrices considered. Figure 7c shows the part of the contour which was highlighted as being the defective part, according to
the method we described in the texture defect detection section. For this application and the one concerning gun powder grains matrix W 6 was used. In all cases the contour was classified in the correct class with a confidence of more than 75% with parameter values Dc 5 0.05, 0.1, Dt 5 0.1, 0.32, 0.84, and Da 5 10, 12, 15. The only exceptions were some gunpowder grains which, though classified in the correct class, were classified with a confidence of just over 50%. For the classification of plums matrix W 7 was used with bin sizes Du 5 2, 4 mm, Dd 5 1, 2 mm, and Da 5 10, 15 and correct classification was obtained with a confidence of more than 80%. The problem of signature verification is one of the most challenging ones. Figure 8 shows some examples of genuine signatures and good forgeries. These examples correspond to a genuine case of fraud investigated by the police. The matrix used for this problem was W 8 with d 5 1, 2, . . . , 26 and bin size Dd 5 4, 8, 10. These values correspond to a matrix of size 26 3 26 3 26, which results in 17576 elements. However, due to the redundancy we discussed earlier, because of the same nature of the quantities measured along each axis, only 3272 elements of this matrix are distinct and have to be computed. Of the genuine
196
KOVALEV AND PETROU
FIG. 8. (a) Signatures of a person. (b) Forged versions of the above signatures. (c) Confidences with which the genuine signatures that were not used for training and the forged signatures were recognized.
signatures given in Fig. 8, all were used for training except L7, L11, L19, L23, and L29. The testing of the system was done on the ‘‘leave one out’’ basis, i.e., every time all forged signatures were used for training except one, which was classified afterward. The confidence of the classification of the genuine signatures which were not used for training and the classification of the forged signatures are presented in Fig. 8c. It can be seen that the system recognized confidently all genuine signatures and some forged ones as such. A couple of forged signatures were classified as genuine ones but with very low confidence level. From this example and the second set of signatures we used it became obvious that our method is very good for genuine/ forgery signature verification when the forged signature differs from the genuine one in the density of the writing. When the difference is only in some details of the signature, our method fails. However, usually, what distinguishes the various handwritings is exactly the frequency of the vertical movement of the pen. 4. CONCLUSIONS
We have argued here that what specifies the identity of an object is the relative abundance of some basic elementary geometric structures. Not all basic structures of a certain type are equally important and we also argued for using only those which distinguish the object from anything else. Such features, when identified, lead to linearly separa-
ble classes and then a simple classifier suffices for the confident identification of a certain object. We proposed, as a powerful way of representing the properties of the elementary geometric structures, the use of co-occurrence matrices which are like highly dimensional histograms in a feature space where we measure certain attributes of individual pixels or relations between N-tuples of pixels along the different axes. Other approaches to classification, where clustering is performed in this multidimensional space, inevitably have to use distances and the appropriate metric has to be chosen that takes into consideration the different dimensions of the quantities measured along the different axes. On the contrary, our approach makes use of the ratio of matrix/histogram elements, i.e., the relative abundances, thus eliminating the need to specify a metric and the measuring units. The use of multidimensional histograms and in particular the exhaustive search we make for the identification of discriminating N-tuples may sound excessive in terms of computer resources. We argue, however, that because of noise and intraclass variability, not all such elements should be retained and actually the various quantities measured should be quantized accordingly using a relatively small number of a bins, let us say of the order of 10. We have demonstrated our method in a variety of applications, including texture analysis and shape recognition. In all cases very good results were obtained using a small number of features (i.e., relative abundances of elementary struc-
OBJECT RECOGNITION AND MATCHING
tures) usually on the order of 10. We also proposed a method for identifying the elementary structures which contribute to the rejection of an object from belonging to a certain class. This method is based on identifying the pixels/elements which contribute to abnormal values of features and enhancing them in a properly constructed accumulator array. We showed that this approach can be used for the identification of pathological cells in some medical applications and also in the identification of faults in various textured images. In some applications, for example, the case of recognizing the location of a liver crosssection image using texture features, the task was performed using features which the human vision system obviously does not use, since to the naked eye the discriminated textures looked identical. Obviously, the success of the method relies on the correct choice of attributes and relations and the availability of a sufficient number of examples and counterexamples for each class. The correct choice of attributes and relations is clearly a matter of domaindependent experience and knowledge of the problem in hand.
5. 6. 7. 8.
9. 10.
11. 12.
13.
14.
ACKNOWLEDGMENTS The authors wish to thank Prof. A. G. Mrochek and Drs. S. N. Nefedov, S. I. Pimanov, A. A. Kovalev, A. Yu. Sasov, S. A. Chizhik, H. S. Ahn, and E. E. Gutman for providing the initial image data. Special thanks to Drs. N. V. Mytzik and A. Ya. Grigoriev, for contour preparation and discussion of results.
15.
16.
REFERENCES 17. 1. T. F. Cootes, A. Hill, C. J. Taylor, and J. Haslam, The use of active shape models for locating structures in medical images, Image Vision Comput. 12(6), 1994, 355–366.
18.
2. J. Kittler, W. J. Christmas, and M. Petrou, Probabilistic relaxation for matching problems in computer vision, in Proceedings, 4th International Conference on Computer Vision, 1993, pp. 666–673.
19.
3. T. L. Kunii, S. Weyl, and I. M. Tenenbaum, A relation database schema for describing complex scenes with colour and texture, in Proceedings, 2nd International Joint Conference on Pattern Recognition, Copenhagen, 1974, pp. 310–316.
20.
4. K. Y. Song, J. Kittler, M. Petrou, and I. Ng, Chromato-structural approach towards surface defect detection in random textured im-
21.
197
ages, in Proceedings, Machine Vision Applications in Industrial Inspection II, SPIE 2183, 1994, pp. 193–204. S. Z. Li, J. Kittler, and M. Petrou, Automatic registration of aerial photographs and digitized maps, Optical Eng. 32, 1993, 1213–1221. H. Niemann, Pattern Analysis and Understanding, Springer-Verlag, Berlin/New York, 1990. A. C. Shaw, A formal picture description schema as a basis for picture processing system, Inform. Control 14, 1969, 9–52. M. Sharples, D. Hogg, C. Hutchinson, S. Torrance, and D. Young, Computers and thought, a practical introduction to artificial intelligence, MIT Press, Cambridge, MA, 1989. H. Freeman, On the encoding of arbitrary geometric configuration, IRE Trans. Electronic Comput. EC-10(2), 1961, 260–268. R. M. Haralick, K. Shanmugam, and I. Dinstein, Textural features for image classification, IEEE Trans. Systems Manag. Cybernetics SMC-3(6), 1973, 610–621. M. M. Galloway, Texture analysis using gray level run lengths, Comput. Graphics Image Process. 4, 1975, 172–179. L. Van Gool, P. Dewaele, and A. Oosterlinck, Survey: Texture analysis anno 1983, Comput. Vision Graphics Image Process. 29, 1985, 336–357. J. S. DaPonte and P. Sherman, Classification of ultrasonic image texture by statistical discriminant analysis and neural networks, Computerized Med. Imaging Graphics 15(1), 1991, 3–9. A. Visa, Unsupervised image segmentation based on a self-organizing feature map and a texture measure, in Proceedings, 11th International Conference on Pattern Recognition, The Hague, The Netherlands, 1992, Vol. 3, pp. 101–104. J. F. Haddon and J. F. Boyce, Image segmentation by unifying region and boundary information, IEEE Trans. PAMI-12(10), 1990, 929–948. J. F. Haddon and J. F. Boyce, A relaxation computation of optic flow from spatial and temporal co-occurrence matrices, in Proceedings, 11th International Conference on Pattern Recognition, The Hague, The Netherlands, 1992, Vol. 3, pp. 594–597. J. F. Haddon and J. F. Boyce, Co-occurrence matrices for image analysis, Electronics Commun. Eng. J. 4, 1993, 71–83. R. M. Haralick and L. G. Shapiro, Glossary of computer vision terms, Pattern Recognit. 24(1), 1991, 69–93. V. A. Kovalev, Feature extraction and visualization methods based on image class comparison, in Proceedings, Medical Imaging 1994: Image Processing, SPIE 2167, 1994, pp. 691–701. D. Patel and T. J. Stonham, Texture image classification and segmentation using RANK-order clustering, in Proceedings, 11th International Conference on Pattern Recognition, The Hague, The Netherlands, 1992, Vol. 3, pp. 92–95. V. A. Kovalev and S. A. Chizhik, On the oriental structure of solid surfaces, J. Friction Wear 14(2), 1993, 45–54.