Computer Vision and Image Understanding 113 (2009) 1017–1025
Contents lists available at ScienceDirect
Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu
Classification of silhouettes using contour fragments Mohammad Reza Daliri a,b,*, Vincent Torre a a b
SISSA/ISAS, Via Beirut 2-4, 34014 Trieste, Italy Cognitive Neuroscience Lab., German Primate Center, Kellnerweg 4, 37077 Göttingen, Germany
a r t i c l e
i n f o
Article history: Received 9 November 2008 Accepted 7 May 2009 Available online 18 May 2009 Keywords: Shape recognition Fragment-based approach Symbolic representation PCA SVM Kernel method
a b s t r a c t In this paper, we propose a fragment-based approach for classification and recognition of shape contours. According to this method, first the perceptual landmarks along the contours are localized in a scale invariant manner, which makes it possible to extracts the contour fragments. Using a predefined dictionary for the fragments, these landmarks and the parts between them are transformed into a symbolic representation that is a compact representation. Using a string kernel-like approach, an invariant highdimensional feature space is created from the symbolic representation and later the most relevant lower dimensions are extracted by principal component analysis. Finally, support vector machine is used for classification of the feature space. The experimental results show that the proposed method has similar performance to the best approaches for shape recognitions while it has lower complexity. Ó 2009 Elsevier Inc. All rights reserved.
1. Introduction The ultimate aim of computer vision is to make an artificial system with the capability of human visual system. To this aim, object recognition is one of the fundamental problems in vision to be solved. Human can recognize an object in many different conditions like different positions in the scene, orientations, poses, sizes and illumination conditions, etc. while it would be a difficult task to deal with all these problems in an artificial system. Human can also easily determine the category of an object. This is a more difficult task than recognition for an artificial system in terms of visual perception. The most important reason that makes this task difficult is the natural variability of a category and also different levels that categorization can be performed [1]. The most straightforward way to deal with the problem of categorization is using fragment-based approaches [2]. An object has many different features like color, texture, shape etc. that can be used for recognition. Shape is probably the most important feature of an object, which can be perceived. Based on psychological studies [3], surface characteristics of an object play a secondary role in object recognition and real-time object recognition is mediated by the edge-based information. In this paper, a new representation for shape-based recognition based on the extraction of the perceptually relevant fragments is proposed. According to this approach, each shape is transformed into a symbolic representation, using a predefined dictionary for
* Corresponding author. Address: Cognitive Neuroscience Lab., German Primate Center, Kellnerweg 4, 37077 Göttingen, Germany. Fax: +49 551 3851452. E-mail addresses:
[email protected] (M.R. Daliri),
[email protected] (V. Torre). 1077-3142/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2009.05.001
the contour fragments, which later is mapped to an invariant high-dimensional space that is used for recognition. Fig. 1 shows the flowchart of different steps for the proposed algorithm. This manuscript has organized in nine sections. In the next section we review the shape recognition algorithms and fragmentbased approaches to the problem of shape recognition. We present our algorithm for extracting the contour fragments and transforming shapes into strings of symbols in Section 3. In Section 4, creating the high-dimensional invariant feature space is given. Section 5 shows how to reduce the dimensionality of the feature space. We describe the classification method in Section 6. Section 7 presents the complexity analysis of the algorithm and the experimental results are given in Section 8. We conclude the paper in Section 9. 2. Previous approaches Shape recognition has been studied extensively in the past. We will review some of those approaches in this section. The visual system can use local, informative image fragments of a given object, rather than the whole object, to classify it into a familiar category [2]. The fragment-based approaches use object parts which capture the commonly occurring object features, and at the same time accounts for intra-class variations. In [4], informative patches in the images are derived from the training examples and are used as fragments. Bouchard and Triggs [5], used a generative model that codes the geometry and appearance of generic object categories as a loose hierarchy of parts. They used scale invariant keypoint based local features at the lowest level of the hierarchy. A probabilistic spatial relations was used to link parts to subparts and soft assignment of subparts to parts. Ullman
1018
M.R. Daliri, V. Torre / Computer Vision and Image Understanding 113 (2009) 1017–1025
Fig. 1. Flowchart of the algorithm showing different steps of the proposed method.
et al. [6] extracted a set of features with intermediate complexity (they called them fragments) from a set of training images that were optimal building blocks for encoding those images. The fragments were extracted on the basis of maximizing the information delivered about a set of objects using a search procedure. A number of recent approaches to object recognition rely on the edge content of categories. There are two main approaches for contour-base recognition: local or fragment-based approaches and holistic or global-based methods. In [7] a hierarchical chamfer algorithm was used for matching edges by minimizing a generalized distance between them. The matching was performed in a series of images of the same scene at different resolutions. Belongie et al. introduced the shape context [8], which is a descriptor, developed for finding correspondences between pointsets. Given a set of points from an image (e.g. extracted from a set of detected edge elements), the shape context captures the relative distribution of points in the plane relative to each point on the shape. A shape is represented by a discrete set of points sampled from the internal or external contour on the object. The shape contexts have been used as attributes for a weighted bipartite matching problem. In [9] an approach for partitioning a natural image into regions based on the shock graph of its contour fragment was proposed. Shock graph [10] is a variant of the medial axis of the contour map of an image and it is obtained by viewing the medial axis as the locus of singularities formed in the course of wave propagation from boundaries. Shotton et al. [11] presented a visual recognition system based on local contour features. This system builds a class-
specific codebook of local fragments of contour using a chamfermatching algorithm. Boosting combines these fragments into a cascaded sliding-window classifier, and mean shift is used to select strong responses as a final set of detections. In [12] a part-based approach for the classification of contour shapes was proposed. In this method, Bayesian inference is used to perform classification within a three-level statistical framework including models for contour segments, for classes and for the entire database. Another probabilistic approach was presented in [13] that in this method, unlabeled point sets was used for shape representation and a background model was used to handle occlusion, significant dissimilarities between shapes and image clutter. The model could learn part decompositions and soft correspondences simultaneously. The algorithm is an extension of Procrustes matching to unlabeled point sets. Daliri and Torre [14] also used Procrustes analysis for shape matching but they used a different approach to deal with occlusion by preserving the order of the contour points. The shape context was used for the similarity between the point sets over the contours and the dynamic programming was applied to find the best correspondence between the point sets and remove the outliers and then Procrustes matching was applied for the corresponding points. Later the edit distance was used (on the symbolic forms of matched points) for measuring the similarity between the shapes. Latecki and Lakamper [15] used the correspondence of visual parts (for shape contours) as a cognitively motivated shape similarity measure. In their approach, a discrete curve evolution method was used as a filter for shape comparison that was a base for shape decomposition into visual parts. A polygonal approximation of shape contours was considered for shape representation in [16]. In this approach, the contours were divided into equal segments and all segments were served as local features for shape matching. For the similarity measure between shapes an elastic comparison of shape segments was used. There is a tradeoff between the accuracy and the complexity of an algorithm for recognition. The most of previous approaches discussed above focus on the recognition performance of their method while the aim of this manuscript is to propose a compact representation for shape recognition in which it has lower complexity with respect to the previous methods and at the same time its recognition performance is at the same level of the best approaches proposed in the past. Because the proposed approach uses boundary fragments for contour representation, it is useful for categorization task too. The algorithm uses a classification approach, like support vector machines, so unlike some of the methods, e.g. [14,16] that they rely on pair-wise similarity measure, cannot be applied for retrieval. 3. Extracting the local boundary features (fragments) In this section we describe how to extract the contour boundary fragments which are invariant with respect to the scale of the shapes. The proposed approach uses the boundary curvature, which is computed in a robust manner. The boundary landmarks, which are the bases for the fragments, can be extracted from the curvature profile and later using to a predefined dictionary for the fragments, the boundary can turn into a symbolic representation. In the following subsections, the transformation of the contours into symbolic representation through computation of curvature is explained. 3.1. Scale invariant curvature computation Curvature is one of the most important features for describing a shape by its contour. Recent finding in neurophysiology of shape representation in area V4 (V4 is part of ventral pathway in primate visual cortex) of the brain shows that neurons in this area are
1019
M.R. Daliri, V. Torre / Computer Vision and Image Understanding 113 (2009) 1017–1025
tuned for boundary curvature and relative position of shape components [17]. Following this finding, our aim is to present the possibility of shape recognition using contour fragments extracted from the curvature profile. The proposed approach lies among the contour-based methods for shape recognition, so we apply our method on databases of silhouettes in which each image contains just one shape that is completely separated from the background. Contour extraction is very simple in this condition (Fig. 2A). Following the contour extraction, we have a sequence of points around the contour. We propose using the tangent vectors along the contours to compute the curvature. Because of the property of orthogonality of the gradient vectors to the tangent vectors, we start to compute the gradient at the best scale. This is a manner how to make the curvature computation invariant with respect to the scale. Considering a Gaussian filter of exp ((x2 + y2)/2t) with t = r2 and Lx and Ly as the derivatives of the original image (along the x and y directions) convolved with the Gaussian filter, the local scale can be determined by computing the normalized derivatives of the original image as following [18]: k
Gk ¼ t 2
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi L2x þ L2y
ply drive it from the gradient vector. In this way, we obtain a good representation for the contour shapes because it has considered the best scale:
~ T ¼ ðT x ; T y Þ ¼ ðGy ; Gx Þ
According to the definition of the curvature C of a planar curve which is the rate of change of the tangent vector with respect to the arc length s, we need to compute the derivatives of the tangent vector for x and y directions. This can simply be done using convolution of each component of the tangent vector by the first derivas2
1 e2r2 : tive of one-dimensional Gaussian function of gðs; rÞ ¼ pffiffiffiffi 2pr
@T x @½T x gðs; rÞ @gðs; rÞ ¼ Tx ¼ @s @s @s
ð3Þ
The same procedure should be done for the y component of the tangent vector. Consider that different shapes have different contour lengths. To include this point in our computation, we use an adaptive sigma for the Gaussian function that is related to the length of a shape contour:
ð1Þ
According to the discussion in [18,19], the most convenient choice for the Gaussian step edges is k ¼ 12. A local maximum of G1 ðtÞ for 2 pffiffiffiffiffiffiffiffi ffi t = tmax the presence of a feature having the scale ¼ 2p t max , provided that G1 ðt max Þ is larger than 1.85 and larger than 1% of the max2 imum of the other values of G1 ðtÞ. These quantities were estimated 2 by assuming that the noise in images has normal distribution with a standard deviation rNoise of approximately 1.5 gray levels and are consistent with the threshold of four gray levels to find step edges (see [20]). In addition, numerical simulations show that by using these parameter values, no false edges are detected in the presence of a constant intensity profile super-imposed onto a Gaussian noise with a standard deviation of 1.5 gray levels. When Gk (t) increases in a monotone way, the scale is taken as the largest among the range of values considered. This quantity can be reliably computed for every point of the image and, therefore, it is possible to have a map of scale. The set of scales is composed by nine scales, in particular, {3, 5, 7, 9, 11, 13, 15, 19}/256. The best scale for the contour points is extracted using the Lindeberg approach [18], as discussed above. The Gradient ~ G ¼ ðGx ; Gy Þ is then computed at this scale using a two-dimensional Gaussian filter in x and y directions. Mathematically speaking the tangent vector is orthogonal to the gradient vector, so we can sim-
ð2Þ
r1 ¼ r0
l l0
ð4Þ
where l is the length of contour shape and, based on our experiment, we select l0 = 200 and r0 = 3. Having the two derivative components of the tangent vector, the value of the curvature can be computed as follows:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 @T x @T y kCk ¼ þ @s @s
ð5Þ
The complete representation of the curvature needs an attribute of a sign. The sign can be extracted from the direction of the tangent vector. To remove the noise, we first apply a Gaussian filter with a small sigma (r = 3) to each component of the tangent vector
!
to obtain a smoother version of the tangent vector ðT smooth Þ. The sign of the curvature at each point can be calculated using the external product of the tangent vector at this point and the previous point in the sequence of the contour as following:
SignðCÞ ¼ sign½ðT x;smooth ðsÞ; T y;smooth ðsÞ; 0Þ ðT x;smooth ðs 1Þ; T y;smooth ðs 1Þ; 0Þ
ð6Þ
The complete definition of our curvature will be obtained by multiplying the value of curvature with its sign:
Fig. 2. (A) A contour of a bird shape. The starting point for the computation of the curvature is the head of the bird and moving counter clockwise along the contour. (B) Raw and filtered curvature profile for the given contour. (C) Symbolic representation for the bird shape given the dictionary defined in the text.
1020
M.R. Daliri, V. Torre / Computer Vision and Image Understanding 113 (2009) 1017–1025
C ¼ kCk SignðCÞ
ð7Þ
The curvature obtained at this step is still noisy and due to noise there are some undesired peaks in the curvature profile. In order to reduce the noise, we used a non-linear filtering approach. It is more important to smooth regions of low curvature and leave almost unaltered regions of high curvature. We first compute the local square curvature related as
C 2 ðnÞ ¼
r1 X 1 C 2 ðn þ iÞ 2r1 þ 1 i¼r
ð8Þ
1
Non-linear filtering is performed again by convolution of the curvature with a one-dimensional Gaussian function with an adaptive scale, where the scale of the filtering is defined as follows:
r2 ðnÞ ¼ rmin þ
^ C C 2 ðnÞ
ð9Þ
^ ¼ 0:02 based on In our experiment we selected rmin = 0 and C the analysis of different shapes (See Fig. 2B for an example). In this way we create a robust and perceptually relevant representation for the curvature of the shapes, meaning that the curvature reflexes the sequence of properties of the shape on the original contour, like convexity, concavity, sharp angles, straight lines etc. Now, we can find the local maxima (negative and positive peaks) of the curvature and localize them as landmarks in the original 2D contours. We actually extract more information from our robust curvature representation, and it will be used for the symbolic representation in the next step. Beside the maxima, we localize the starting-point and the ending-point of each peak as well. Each landmark in the original 2D space will be represented using two vectors starting from the maxima towards the starting-point and ending-point: this means that each landmark will create an angle. In the following we will describe the mapping of the salient features of the curvature into a symbolic representation. 3.2. Symbolic representation using contour fragments In previous section we described how to compute the onedimensional curvature profile where its local extrema are perceptually relevant to the changes on the original contour. Our curvature representation has even more information: we can distinguish the curve parts, straight lines, sharp angles, concave and convex parts from this presentation. We first explain how to extract this information and then using this information, we describe how to transform each shape into a symbolic representation that is useful for our recognition task. First of all, in a pre-processing step, we remove the angles that are close to 180°. In the curvature profile, straight-lines have values close to zero, the sharp angles (corners) have high values and the curve parts have intermediate values between these two states. We have set some thresholds to find different parts based on extensive experiments on a database of shapes. We apply the following dictionary, which transforms the curvature representation into a sequence of symbols (See Fig. 2C for an example): 3.2.1. Symbols for angles As we have mentioned before, we localize the local peaks of the curvature that are related to the corners of the contours. Each peak is represented by three points of peak location, the location for the starting-point of the peak and the location for the ending-point of it. These three points create an angle in the original 2D space. We quantize the angles so that they will have either 45° or 90° or 135°. These angles will have positive or negative value of the curvature. Therefore, we obtain a total of six different corner types, which they can be labeled as A1, A2, . . . up to A6.
3.2.2. Symbols for curves We also detect and label curves either as concave or convex ones according to the sign of the curvature. Therefore, we simply, start with two different curves, labeled as C1 and C2. 3.2.3. Symbols for links between angles (and curves) We apply three symbols for the parts of contours that join the two corners (or two curves): L1 is used if the part is a straight line, L2 is applied if it is not a straight line (and it is not a curve) but has an average positive curvature and L3 is assigned if, on average, it has a negative curvature. 4. High-dimensional feature space According to the dictionary of fragments, which was defined in previous section, starting from an arbitrary point, we move along the contour and transform the shape contours into a sequence of symbols. It is obvious that more complex shapes have more landmarks, and so they will be represented with a larger sequence of symbols, and the length of symbolic sequence for each shape will be related to its complexity and the quantity of its landmarks. This means that the dimension of symbolic representation is different for different shapes. It has to be considered that our representation still is sensitive to the starting-point in the contour and it is not rotational invariant, but, as we will see in the final step, our feature space will be completely invariant to similarity transforms. Now, we can consider our problem similar to a text categorization problem, where each sequence of symbols can be considered as a sentence or a separate document. There are several successful works related to text categorization. A standard approach [21] to text categorization uses the classical text representation technique [22], which maps a document into a high-dimensional feature vector, where each entry of the vector represents the presence or the absence of a feature. Such sparse vectors can then be used in conjunction with many learning algorithms. This simple technique has recently been used very successfully in supervised learning tasks with Support Vector Machines [23]. Our approach is based on the newly proposed method that considers documents simply as symbol sequences, and makes use of specific kernels. This specific kernel named string kernel [24]. We do not use this method directly as a kernel, but instead we apply this idea to create our high-dimensional invariant feature space. Using this approach, we transform our symbolic representation into a high-dimensional feature space with the same size for all shapes. The size of this feature space is related to the length of our dictionary that is fixed for all the shapes. This transformation makes our system invariant related to rotation and starting-point in the contour as well. The feature space in this case is generated by the set of all the substrings of k-symbols, in which k can vary between 1 to the length of the dictionary which is 11 here. The most common substrings that two shapes have, make their inner product higher and therefore, they are more similar. It is important to consider that substrings do not need to be contiguous, and the degree of contiguity of one substring in a string representation of a shape determines how much weight it will have in comparison. For each substring there is a dimension of the feature space, and the value of such dimension depends on the frequency of appearance and on the degree of compactness of that specific substring in the text. To consider non-contiguous substrings, it is necessary to insert a decay factor between (0 and 1) in our formulation that is used for weightening of contiguity. The only difference between our work and string kernel for text classification is that for our shape recognition we need to have invariancy related to rotation and starting-point that for the text doesn’t need. We have a finite alphabet based on the definition in Section 3 of our dictionary that are 11 symbols. To create the feature space we need to search all possible substrings starting
M.R. Daliri, V. Torre / Computer Vision and Image Understanding 113 (2009) 1017–1025
from each single-symbol to k-symbols. Theoretically, based on our dictionary, k can be 11 but when the length of substring grows, the dimension of feature space grows exponentially and managing that would be impossible, so we have to cut the searching process until an acceptable number of sequence in the substring. For each possible substring there is a dimension in the feature space and the value of that dimension will be the sum of all occurrences of that substring in the sequence considering decay factor for non-contiguity. Each dimension of the feature space is related to feature mapping of a special substring in the original sequence of symbols. Suppose that the dimension is for the substring s1 and the original symbol sequence is S, the feature mapping /s1 can be written as following:
/s1 ðSÞ ¼
X
klðiÞ ;
ð10Þ
1021
space. Using a training set of shapes, we need to apply a classifier to find the best hyperplanes between the different shape classes. One of the successful classifiers in classifying data in high dimensional pace are Support Vector Machines (SVM) [23]. They are a method for creating functions from a set of labeled training data. The function can be a classification function or the function can be a general regression function. For classification, SVMs try to find the best hypersurfaces in the training input space, which maximizes the margin – distance between plane and closest point – between different classes, so they have a very good generalization property. After they learned the hyperplanes from the training data, they can classify the testing data according to their position with respect to these hyperplanes. More information about SVM can be found anywhere such as [27,28].
i:s1
that l(i) is the distance between each two character or symbol of s1 found in the original sequence of S, and the k is the decay factor. As we know that each contour is a closed-curve, to solve the problem related to starting-point and invariancy for rotation, we can consider each sequence to be a closed-loop as well, which means that the last symbol in each sequence is contiguous to the first symbol. With this definition, we can solve the problem of rotation and starting-point when we are searching for substrings and inserting the decay factor based on that. To calculate the distance between symbols in the sequence we used the original distance in 2D plane normalized to the arc length of each contour. In this way, we incorporate the spatial information to our feature space representation and make the representation more precise. An important issue while creating the feature space is that because we have quantized angles, there is an overlap between some of the dimensions of our feature space. For example, if we have a feature of A2–L1–A5 with a weight of w in our symbolic representation, then we give a smaller weight (here w/3) to the neighboring dimensions of A1–L1–A5, A3–L1–A5, A2–L1–A4, A2–L1–A6, A1–L1–A4, A1–L1–A6 and so on. After creating the feature space, one basic preprocessing is the normalization of feature vectors [25]. If we consider that x 2 RN is an input vector, the normalized vector ~ x will be given by
~x ¼
x 2 RN ; kxk 2
where kxk ¼
ð11Þ PN
2 i¼1 xi
and N is the dimension of the feature space.
5. Dimensionality reduction As we have mentioned in previous section, the feature space we have created for our shape representation has a very high dimension, even if we consider until limited length of substrings. So, would be easier to manage the feature vectors if we reduce the dimension using some standard methods. Principal component analysis (PCA) [26] is one approach to this aim. PCA can be used for dimensionality reduction in a data set by retaining those characteristics of the data set that contribute most to its variance, by keeping lower-order principal components and ignoring higher-order ones. PCA involves the calculation of the eigenvalue decomposition of a data covariance matrix after mean centering the data for each attribute or dimension. It transforms the data into a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (or first principal component), the second greatest variance on the second coordinate, and so on. 6. Classification After following all the steps till now, each shape is represented by a feature vector and a database of shapes create our feature
7. The complexity of the algorithm The Computational complexity of classification and recognition for an algorithm depends on the number of shapes to recognize and the number of classes to identify. Normally in the literature, the complexity of the representation part is reported. Here to be able to compare the complexity of our algorithm with those reported in the literature, we also give the complexity of our algorithm for creating the feature space. The complexity of an algorithm is equal to the complexity of the most expensive step. Lets suppose that a shape has N number of points around its contour. The complexity of computing the curvature is of the order of O(N log N) which is related to the computation of convolution. The symbolic representation has a linear complexity with N. Creating the feature space is of the order of O(N) considering that we cut the space until a reasonable substrings (3 and 4 here). As we mentioned, we do not consider the complexity of the classification part, but as we use linear SVM, the complexity is linear in the number of training example. Therefore the total complexity of the algorithm is of the order of O(N log N). Comparing to other approaches like Shape Context [8] with the complexity of O(N3), Inner-Distance [29] and Robust Symbolic Representation [14] with the complexity of O(N2) it has lower complexity. 8. Experimental results In this section, some experiments aiming at evaluating and comparing the proposed algorithm for shape recognition will be presented. First of all we show how to set the parameters of the algorithm namely decay factor and the maximum length of substrings for creating the high-dimensional feature space and the number of principal components for reduction the dimensionality and show how these parameters change the performance of the method. After finding the optimal parameters on a small subset of shapes, we fixed them for the main experimental parts. We used LIBSVM [30] tools that support multi-class classification. Different kernel functions with different parameters have been tested to reach the best result, but a simple linear kernel was the best. The parameters for the classifier were set using the cross-validation strategy. 8.1. The effect of decay factor (k) To show what would be the best parameter for the decay factor for weightening of contiguity, we selected a subset of MPEG-7 shape database [31] and having fixed the other parameters, studied how the performance will change while changing this parameter. The database was consisting of 20 classes and each class was containing 20 shapes. The performance was evaluated using cross validation Leave-one-Out strategy, each time, training the algorithm
1022
M.R. Daliri, V. Torre / Computer Vision and Image Understanding 113 (2009) 1017–1025
with all the shapes in the database leaving one out for the test, until all the shapes were considered for the test. The performances versus different values for the decay factor were plotted in Fig. 3. According to this plot, the best value for k was set to be 0.3 and we fixed this value for further experiments. The plot shows that this is an important parameter affecting the accuracy of algorithm. As we have shown here, for a given dataset, it can be set by cross validation procedure. 8.2. The effect of the length of substrings for creating the highdimensional feature space In this subsection the same database from the previous part was considered to show the effect of the length of substrings for creating the high-dimensional feature space on the performance of the algorithm. As discussed in Section 4 it is possible to consider feature vectors with different maximum length of symbols and therefore we compared the results obtained based on substrings with a maximum length of 2, 3 and 4 symbols. Consider that the maximum length of substrings that can be used for creating the feature space depends on the length of dictionary, which is 11 here. As shown in Table 1 successful recognition rate increases with longer substring, but not so much to justify a significantly heavier computational load, so for further experiments we fixed the maximum length of substrings to be 3. There is a trade of between the recognition accuracy and the length of feature vectors but at some extent increasing the dimension will not add more information to the classifier but increase the computational load for further processing. 8.3. The effect of the number of principal components for the dimensionality reduction The last parameter we consider here is the number of principal components for reducing the dimensionality of our feature space. The same subset of MPEG-7 database and the same evaluation strategy of Leave-one-Out were used here too. As shown in Fig. 4 the best performance was obtained when the number of principal component was 200, so we selected this value for the main experiments. The recognition accuracy for the raw feature vectors was slightly higher (98%) than the reduced feature vectors by PCA. It is worth to mention that the decision of how many dimensions
Fig. 3. Recognition rate for 20 classes from MPEG-7 database used to set the value of the parameter k. Best recognition rate was obtained with k equal to 0.3 and we fixed this value for further experiments.
Table 1 Comparison of the effect of different maximum lengths for the substrings for creating the high-dimensional feature space. The maximum length of substrings was set to 3 for the main experiments. The maximum length of substrings
Recognition rate (%)
2 3 4
86 97.75 98
to retain basically depends on when there is only very little random variability left. The nature of this decision is arbitrary however there are several approaches like the Kaiser criterion [32] and the Cattell’s scree test [33] for this aim. Beside the approach we proposed above, we used the Cattell’s scree test and according to this test the optimal number of dimensions to retain was 173, which yields the recognition rate of 95%. 8.4. Comparing LDA and PCA for dimensionality reduction As we have mentioned in the text there several ways to reduce the dimension of feature space. LDA and PCA are among the most popular approaches. In this section we have compared the effect of dimensionality reduction method on the classification accuracy of our algorithm. The same subset of MPEG-7 shape dataset was selected for the experiment. Using the cross-validation Leave-oneOut strategy we obtained 97.75% and 97% recognition rate for PCA and LDA respectively. PCA works slightly better than LDA in this case but the overall results are not so different. This is compatible with the results shown in [34] that demonstrates that PCA outperforms LDA when the number of samples per class is small while in the case of training with a large number of samples LDA outperforms PCA. 8.5. Comparison of different approaches in curvature computation To show the ability of our approach for computing the curvature, we compared the results of classification accuracy on a subset of MPEG-7 shape database using three different methods: The one explained here in Section 3 and the two other methods described in [35,36]. The way of creating the symbolic representation, feature space, dimensionality reduction and the classification remained
Fig. 4. Recognition rate for a subset of shapes from MPEG-7 database used to select the number of principal components for the dimensionality reduction of the feature space. Best recognition rate was obtained with 200 components.
1023
M.R. Daliri, V. Torre / Computer Vision and Image Understanding 113 (2009) 1017–1025
Fig. 5. Sample of shapes from Chicken Piece Dataset. Each row presents a class of this dataset.
the same in our comparison but the method for computing the curvature was different. Our approach and the one explained in [35] both performed well (our approach showed a little higher accuracy: 97.75% versus 97.25%) and both outperformed the other approach [36] with recognition rate of 95.75%.
Fig. 6. A set of examples from MPEG-7 shape database. One shape from each category is shown.
8.6. Chicken Piece Dataset Neuhaus and Bunke [37] have proposed a general edit distancebased kernel function for pattern classification. They have tested their approach on different datasets of strings and graphs including some shape datasets. To compare the algorithm proposed here with their approach, we selected one of their shape datasets, Chicken Piece Dataset (Fig. 5), and we used the same classification approach for comparison. This dataset consists of 446 silhouettes from five different classes of chicken pieces consisting of wing, leg, thigh, quarter and breast. The dataset was randomly divided in three subsets of 149 shapes for training, 149 shape for validation and 148 pieces for testing. The validation set was used for setting the parameters. The results were compared with k-nearest-neighbor classifier. We used the same procedure and compared the classification accuracy of our method with those proposed in [37]. Table 2 reproduces the classification rates of different approaches. Our fragment-based approach outperforms both k-nearest-neighbor (kNN) classifier and the edit distance-based kernel approach, with classification rate of 84.5%. 8.7. MPEG-7 shape database The proposed recognition method was tested on a large database of MPEG-7 CE-Shape-1 [31]. The MPEG-7 shape database consists of 70 classes each having 20 different shapes, for a total of 1400 shapes (Fig. 6). Shapes in the database are from a variety of both natural and artificial objects. The database is challenging due to the presence of examples that are visually dissimilar from other members of their class, and examples that are highly similar to members of other classes [31]. The standard test for the recog-
Table 2 Classification rate for chicken pieces dataset for different approaches. Our approach shows a better classification accuracy. Method
kNN [37]
Edit-dist kernel [37]
Our approach
Classification rate
74.3%
81.1%
84.5%
nition reported in the literature is the cross validation Leave-oneOut strategy for this database. In Table 3 we have summarized the results of this test for different algorithms available in the literatures. The proposed method has a comparable performance to the best methods with a lower complexity. 8.8. ETH-80 3D object database We have tested our algorithm on another challenging database of ETH-80 3D object database [42]. It contains 80 high-resolution color images of 3D objects from eight different categories (Fig. 7). For each object, there are 41 images taken from different viewpoints spaced equally over the upper viewing hemisphere (at distances of 22.5–26°), creating a total of 3280 images. For every image, a high-quality segmentation mask is provided, so that shape and contour-based methods can be easily applied. The standard test is leave-one-object-out cross-validation [42]. Recognition is considered successful if the correct category is assigned. The results are averaged over all 80 possible test objects. The database is used for a best-case analysis that is categorization of unknown objects under the same viewing conditions, with a near-perfect figure-ground segmentation, and known scale [42]. Table 4 lists the
Table 3 Comparison of recognition rates for different algorithms tested on the MPEG-7 database. Our method outperforms all previous algorithms, with the exception of the Robust Symbolic Representation [14] but its complexity is lower than it. Method
Recognition accuracy (%)
Normalized squared distance [38] RACER [39] Chance probabilities [40] String of symbols [41] Polygonal multi-resolution [16] Class segment sets [12] Probabilistic approach [13] Robust Symbolic Representation [14] Our method
96.9 96.8 97.4 97.36 97.57 97.93 97.5 98.57 98.21
1024
M.R. Daliri, V. Torre / Computer Vision and Image Understanding 113 (2009) 1017–1025
have considered here. Also this approach can overcome the partial occlusion, because of the measuring of the similarity of substrings in the sequence of each shape is related to local property of shapes. Acknowledgments We would like to thank Prof. Alessandro Verri for his preliminary idea of this work, Drs. Latecki and HaiRong Liu for providing their code for comparison, Dr. Enrique Vidal for providing us the Chicken Pieces Dataset and N. Ashrafi for helping to prepare the figures. We wish also to thank the reviewers for their helpful and constructive comments. References
Fig. 7. A set of images from the ETH-80 object database in which each row shows one category of objects.
Table 4 Recognition rates for different methods tested on the ETH-80 database. Our method is among the best-proposed methods for this database with a very compact representation and much less complexity. (Consider that Decision Tree approach uses several cues.) Algorithm
Recognition rate (%)
Algorithm
Recognition rate (%)
Color hist. [42] DxDy [42] Mag–Lap [42] PCA masks [42] PCA gray [42]
64.85 79.79 82.23 83.41 82.99
86.40 93.02 86.80 88.11 89.03
SC greedy [42]
86.40
SC + DP [42] Decision Tree [42] MDS + SC + DP [29] IDSC + DP [29] Robust Symbolic Representation [14] Ours
86.40
recognition rate for different algorithms. Our method performs very similar to the best-proposed approaches for this database, which they are more complex than our algorithm that has a very compact representation. 9. Conclusions In this paper we have presented an algorithm for shape-based object recognition. In this approach, first we extracted the fragments from the contour of shapes in such a way that they would be perceptually relevant, and then changed our information into a sequence of symbols using a predefined dictionary for the fragments. Finally we transformed the representation into completely invariant high-dimensional feature-space using string kernel-like approach from text categorization community. We reduced the dimension of our feature space using PCA and classified them with support vector machines. From the nature of our method it is translation invariant as it works with the contour of shapes. It is in some extent scale invariant as well, because we use automatic scale selection using the Lindeberg formula and extracting perceptually relevant landmarks with the method that was described in Section 3. Our feature space representation makes our system completely invariant, related to similarity transforms. The quantization step for the angles in the symbolic representation enables the algorithm to deal with some deformations and makes it very useful for the categorization task. Furthermore the symbolic representation is very compact and the performance of the proposed approach is comparable to the best methods for shape recognition we
[1] S. Edelman, Representation and Recognition in Vision, MIT Press, Cambridge, USA, 1999. [2] J. Hegde, E. Bart, D. Kersten, Fragment-based learning of visual object categories, Current Biology 18 (2008) 597–601. [3] I. Biederman, G. Ju, Surface versus edge-based determinants of visual recognitions, Cognitive Psychology 20 (1988) 38–64. [4] S. Agarwal, A. Awan, D. Roth, Learning to detect objects in images via a sparse part-based representation, IEEE TPAMI 26 (11) (2004) 1475–1490. [5] G. Bouchard, B. Triggs, Hierarchical part-based visual object categorization, Proceedings of CVPR 1 (2005) 710–715. [6] S. Ullman, M. Vidal-Naquet, E. Sali, Visual features of intermediate complexity and their use in classification, Nature Neuroscience 5 (7) (2002) 682–687. [7] G. Borqefors, Hierarchical chamfer matching: a parametric edge matching algorithm, IEEE TPAMI 10 (6) (1988) 849–865. [8] S. Belongie, J. Malik, J. Puzicha, Shape matching and object recognition using shape contexts, IEEE TPAMI 24 (24) (2002) 509–521. [9] O.C. Ozcanli, B.B. Kimia, Generic object recognition via shock patch fragments, in: BMVC’07, Warwick Print, 2007, pp. 1030–1039. [10] T. Sebastian, P. Klein, B. Kimia, Recognition of shapes by editing their shock graphs, IEEE TPAMI 26 (2004) 551–571. [11] J. Shotton, A. Blake, R. Cipolla, Multiscale categorical object recognition using contour fragments, IEEE TPAMI 30 (7) (2008) 1270–1281. [12] K.B. Sun, B.J. Super, Classification of contour shapes using class segment sets, Proceedings of CVPR 2 (2005) 727–733. [13] G. McNeill, S. Vijayakumar. A probabilistic approach to robust shape matching, in: Proc. of Int. Conf. on Image Processing (ICIP), Atlanta, 2006. [14] M.R. Daliri, V. Torre, Robust symbolic representation for shape recognition and retrieval, Pattern Recognition 41 (5) (2008) 1799–1815. [15] L.J. Latecki, R. Lakamper, Shape similarity measure based on correspondence of visual parts, IEEE TPAMI 22 (10) (2000) 1185–1190. [16] E. Attalla, P. Siy, Robust shape similarity retrieval based on contour segmentation polygonal multiresolution and elastic matching, Pattern Recognition 38 (12) (2005) 2229–2241. [17] A. Pasupathy, C.E. Connor, Population coding of shape in area V4, Nature Neuroscience 5 (2002) 1332–1338. [18] T. Lindeberg, Edge detection and ridge detection with automatic scale selection, International Journal of Computer Vision 30 (2) (1998) 117–154. [19] P. Majer, The influence of the gamma-parameter on feature detection with automatic scale selection, in: Proc. 3rd Int. Conf. Scale-Space and Morph. Computer Vision, 2001, pp. 245–254. [20] W. Vanzella, F.A. Pellegrino, V. Torre, Self-adaptive regularization, IEEE TPAMI 26 (6) (2004) 804–809. [21] T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: Proc. Euro. Conf. Machine Learning, Springer, Berlin, 1998, pp. 137–142. [22] A.W. Salton, C. Yang, A vector space model for automatic indexing, Communications of the ACM 18 (11) (1975) 613–620. [23] V. Vapnik, Statistical Learning Theory, Wiley Interscience, New York, 1998. [24] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C. Watkins, Text classification using string kernels, Journal of Machine Learning Research 2 (2002) 419–444. [25] A.B.A. Graf, S. Borer, Normalization in support vector machines, in: Proc. of the 23rd DAGM-Symposium on Pattern Recognition, LNCS 2191, 2001, pp. 277–282. [26] F. Keinosuke, Introduction to Statistical Pattern Recognition, Elsevier, 1990. [27] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000. [28] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (2) (1998) 121–167. [29] H. Ling, D.W. Jacobs, Shape classification using the inner-distance, IEEE TPAMI 29 (2) (2007) 286–299. [30] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, Software available at
, 2001. [31] L. Latecki, R. Lakamper, U. Eckhardt, Shape descriptors for non-rigid shapes with a single closed contour, in: Proc. of the IEEE Conf. on CVPR, 2000, pp. 424– 429. [32] H.F. Kaiser, The application of electronic computers to factor analysis, Educational and Physiological Measurement 20 (1960) 141–151.
M.R. Daliri, V. Torre / Computer Vision and Image Understanding 113 (2009) 1017–1025 [33] R.B. Cattell, The Scree test for the number of factors, Multivariate Behavioral Research 1 (2) (1966) 245–276. [34] A.M. Martinez, A.C. Kak, PCA versus LDA, IEEE TPAMI 23 (2) (2004) 228–233. [35] H. Liu, L.J. Latecki, W. Liu, A unified curvature definition for regular, polygonal, and digital planar curves, International Journal of Computer Vision 80 (1) (2008) 104–124. [36] M. Bicego, V. Murino, Investigating hidden Markov models’ capabilities in 2D shape classification, IEEE TPAMI 26 (2) (2004) 281–286. [37] M. Neuhaus, H. Bunke, Edit distance-based kernel functions for structural pattern classification, Pattern Recognition 39 (2006) 1852–1863. [38] B.J. Super, Learning chance probability functions for shape retrieval or classification, in: Proc. of the IEEE Workshop on Learning in Computer Vision and Pattern Recognition, June 2004.
1025
[39] B.J. Super, Improving object recognition accuracy and speed through nonuniform sampling, in: Proc. SPIE Conference Intelligent Robots and Computer Vision XXI: Algorithms, Techniques, and Active Vision, Providence, RI, SPIE 5267, 2003, pp. 228–239. [40] B.J. Super, Retrieval from shape databases using chance probability functions and fixed correspondence, International Journal of Pattern Recognition and Artificial Intelligence 20 (8) (2006) 1117–1138. [41] M.R. Daliri, V. Torre, Shape recognition and retrieval using string of symbols, in: Proceedings of the Fifth International Conference on Machine Learning and Application (ICMLA06), Orlando, Florida, USA, 2006. [42] B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 409–415, 2003.