Author’s Accepted Manuscript Metric Learning Based Object Recognition and Retrieval Jianyu Yang, Haoran Xu
www.elsevier.com/locate/neucom
PII: DOI: Reference:
S0925-2312(16)00079-5 http://dx.doi.org/10.1016/j.neucom.2016.01.032 NEUCOM16655
To appear in: Neurocomputing Received date: 5 November 2015 Revised date: 22 December 2015 Accepted date: 5 January 2016 Cite this article as: Jianyu Yang and Haoran Xu, Metric Learning Based Object Recognition and Retrieval, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.01.032 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Metric Learning Based Object Recognition and Retrieval Jianyu Yanga,∗, Haoran Xua a School
of Urban Rail Transportation, Soochow University, Suzhou, China, 8 Jixue Road, Xiangcheng District, Suzhou, Jiangsu, China
Abstract Object recognition and retrieval is an important topic in intelligent robotics and pattern recognition, where an effective recognition engine plays an important role. To achieve a good performance, we proposed a metric learning based recognition algorithm. To represent the invariant object features, including local shape details and global body parts, a novel multi-scale invariant descriptor is proposed. Different types of invariant features are represented in multiple scales, which makes the following metric learning algorithm effective. To reduce the effect of noise and improve the computing efficiency, an adaptive discrete contour evolution method is also proposed to extract the salient sample points of object. The recognition algorithm is explored based on metric learning method and the object feature are summarized as histograms inspired from the Bag of Words (BoW). The metric learning methods are employed to learn object features according to their scales, which reduces the intra-class distance while maximizes the inter-class distance. The proposed method is invariant for rotation, scale variation, intraclass variation, articulated deformation and partial occlusion. The recognition process is fast and robust for noise. This method is evaluated on multiple benchmark datasets and the comparable experimental results indicate the invariance and robustness of our method. Keywords: metric learning; object recognition; object retrieval; robot learning; intelligent analysis.
∗ Corresponding
author Email addresses:
[email protected] (Jianyu Yang),
[email protected] (Haoran Xu)
Preprint submitted to Neurocomputing
February 16, 2016
1. Introduction Object recognition is a fundamental and challenging problem in robot learning and pattern recognition with applications such as object retrieval [1][2], robot navigation [3] and multiple camera object recognition based on network control technology [4][5]. The visual feature of object is promising to contain the object information and an intuitive idea is to measure the similarity of two objects by comparing their features. In recent years, there is a fruitful literature in object recognition and retrieval algorithms that reports promising results [6][7][8][9]. Most of the researchers in the current object feature analysis community make effort to design a robust and discriminative descriptor which plays an important role in the traditional pairwise matching based methods [2][10][11][12]. However, it is still an open problem to recognize objects effectively in large scale datasets, due to the unavoidable perturbations, e.g., intra-class variation, articulated deformation, occlusion, noise and the combination of them. To represent shape features, most of the existing descriptors are proposed to improve the performance of shape matching in two aspects. On one hand, the descriptors are expected to capture the geometric invariance of salient shape features against various variations, e.g, rotation, articulated deformation, etc. The Integral Invariant (II) [13] is a typical method in this aspect. On the other hand, the shape structure information is preferred to represent the shape, which is usually represented by the spatial relations between each pair of the contour points, e.g., the shape context methods [2][10]. However, both of the two types of methods have limits. The learning algorithms were widely used in recent works to improve this problem, but most of the learning algorithms are directly used without consider the property of descriptors, which limits the recognition results. As the object descriptors represent various aspects of salient features, the matching between learning method and feature description is very important for the performance of object recognition. Thus the learning method should be well selected and designed. In this paper, we propose an object recognition method based on metric learning algorithm. For feature description, a novel multi-scale invariant descriptor (MID) is proposed with three types of invariants in different scales. The multiple scales of feature are well matched with the metric learning method. Besides, we adopt the bag of words (BoW) 2
Figure 1: The pipeline of our method.
method to summarize the feature vectors in different scales into histograms which makes the features much suit for the metric learning method. This increases the discrimination of our method, which increases the inter-class distance while reduces the intra-class distance. To reduce the intra-class distance, the core idea is using the simply connected region to calculate the MID instead of the II for articulated invariance. To increase the representative of feature and decrease the effect of noise, an evolution method is proposed to find the salient feature points of object. The pipeline of our method is shown in Fig. 1. In our framework, the raw data is preprocessed by a new contour evolution method first. In general shape data, there are many redundant contour points which have no contribution to the object representation but introduce extra perturbation to object matching. To extract the salient feature points, we propose an adaptive discrete contour evolution (ADCE) algorithm to process the closed shape contour. After that, the invariants of different types and scales are calculated to represent the shape features. To make the feature appropriate for metric learning, we use a bag of salient points (BoP) to represent the shape feature points statistically as histograms. At last, the metric learning algorithm is used according to relevant component analysis. The proposed method has the following advantages: • Our method is invariant to salient intra-class variation, partial occlusion and articulated deformation.
3
• The metric learning based method enlarged the inter-class distance and reduce the intra-class distance. • Object recognition by computing the metric learned BoP histograms is very fast. The traditional matching algorithms, e.g., DTW, DP, TPS, are not used in our method to find the best alignment. • Our method is robust to noise. The proposed method is evaluated on four benchmark shape datasets, including MPEG-7 dataset [14], Kimia’s 99 dataset [15], Kimia’s 216 dataset[15] and Articulated dataset [10]. The experiments are carried out on these datasets to evaluate the invariant properties of our method, including: rotation, scale variation, intra-class variation, articulated deformation and occlusion. We also test the capability of our method for shape retrieval comparing with other methods. The robustness of our method for noise is also tested. The organization of the remainder of this paper is as follows. The next section summarizes the relevant work on object recognition. In Section 3, the multi-scale invariants are presented. The adaptive discrete contour evolution method is presented in Section 4. Section 5 introduces the BoP method and the RCA metric learning for object recognition, and the experiments follow in Section 6. Finally, this work is concluded in Section 7.
2. Related Work In the literature, there are various methods for object recognition. In this section, we mainly discuss two sorts of them: the learning based methods and the matching based methods, which are the most related to the proposed method in this paper. The invariant descriptors [13][16][17][18] are predicted to be invariant for spatial variations and various deformations. The integral invariants [13][19] use the distance integral invariant and area integral invariant to represent each point of the shape contour. The distance integral invariant is not discriminative for shape matching, while the area integral invariant is a local descriptor (approximation of curvature) which cannot represent the spatial structure and context. The curvature scale space (CSS) [16] method use the locations of the 4
curvature zero crossing points on the shape contour to represent shapes. But the scale is hard to determine and the convex shapes have no curvature zero crossing points. The Biswas method [17] used a variety of simple invariants to represent shape features for shape indexing and retrieval. The structure learning methods make use of the spatial relation among contour points [2][9][11][20][21]. The most referred work is the Shape Context (SC) [2] which represent each contour point by the spatial relations with all other points as a histogram. This method can well characterize the spatial distribution of contour points and performs well on rigid objects, whereas it is sensitive to articulated deformation. The triangle area based methods [11][20] use the spatial relationship of three contour points to represent the shape feature. Alajlan et al. [11] proposed the triangle-area representation (TAR) method for shape description where the length of triangle base is changing. But the feature extraction is complicated and sensitive to noise. El et al. [20] proposed the multiscale triangle-area representation (MTAR) method to make use of the zero crossings of the triangle-area at different wavelet scales for matching, which is more robust for noise. Wang et al. [9] use the height function to represent the spatial relationship among contour points. Each contour point is represented by the distances from all other points to the tangent line of it. The fuzzy based [22][23] modeling and learning methods have been widely applied to computer vision system which can also be used in this application. In most of the recent works, it is hard to distinguish them to a sort because both of the targets are concerned together: represent the shape structure by the spatial relationship of contour points while the invariance is preserved. For instance, Ling et al. [10] proposed the inner distance shape context (IDSC) method, which use the inner distance instead of Euclidean distance to overcome the problem that the SC is sensitive to articulated deformation. Grigorescu et al. [24] proposed a distance set method to represent the contour point by the spatial arrangement of the image features around it. Xie et al. [25] used the skeleton context to represent the spatial relationships of shape structures. Attalla et al. [26] proposed a multi-resolution polygonal shape descriptor by segmenting shape contour into equal segments and capturing the shape features around its center. Adamek et al. [27] represented the contour convexities and concavities at different scales
5
Figure 2: Complex shapes.
as shape features for matching. Schmidt et al. [28] proposed a graph cut method to represent planar graphs for shape matching image segmentation. The previous methods are also combined to represent shapes for better performance [29][30]. The shape tree [31] method represents objects hierarchically in different scales of details. Besides shape representation, the shape contour processing is also important for extracting shape features. Latecki et al. [32] proposed a method to measure shape similarity based on correspondence of visual parts, called discrete contour evolution (DCE). This method is widely used to extract the salient feature points of shape contour, e.g., in the shape vocabulary method. In this scene, this method is more useful as a preprocessing method of object representation and matching, rather than an object matching method. The metric learning methods have been widely used in pattern recognition and reported promising performances [33][34][35], and we adopt this learning method in our method due to the property of the feature invariants. In this work, the proposed metric learning based method make use of multi-scales of object features for object recognition and retrieval. Different from the context based methods, our method does not make use of the spatial relationship among trajectory points directly. The proposed ADCE method is adaptive for the convergence condition of evolution contrast to the DCE method. The feature learning based on the BoW histograms is much faster than the traditional methods.
3. Multi-Scale Shape Descriptor 3.1. Definition of The Multi-scale Invariant Descriptor (MID) Since the raw data of a shape is the closed contour which is composed of a series of sample points, the descriptor is used to represent the shape features of each sample point on the contour. Denote S = {p(i)|i ∈ [1, n]} the closed planar shape contour with a sequence of sample points p(i), where n is the length of the contour. The sample point 6
Figure 3: Invariants in the articulated parts. The grey zones A and B on the rabbit ears in (b) are not simply connected zones within the circle, while the grey zone A in (c) is a simply connected zone. However, the three segments in (c) are still not simply connected segments that only the blue segment A crossing the circle center is used to calculate the arc length invariant.
p(i) is parameterized as p(i) = {u(i), v(i)}, where u(i) and v(i) are coordinates in the image. The MID descriptor M is defined as follows: M = {sk (i), lk (i), ck (i)|k ∈ [1, m], i ∈ [1, n]},
(1)
where sk , lk and ck are three invariants: normalized area s, arc length l and central distance c in the scale k. k is the scale label, and m is the total scale number. These invariants are defined separately in the following. Definition 1. Consider a circle Ck (i) with radius rk centered at p(i), one or more zones may be occupied by the shape (see Fig.3, zone A and zone B in (a)). The zone A is so called simply connected to p(i) because p(i) is on the edge of this zone (direct connection), while other zones have no connection to p(i) via the shape within the circle. Denote the area of the zone simply connected to p(i) (Zone A in Fig.3, (b)) by s∗k (i), the normalized area sk (i) is defined as the ratio of s∗k (i) to the area of Ck (i), as follows: sk (i) = s∗k (i)/(π ∗ rk2 ).
(2)
Since s∗k (i) is less than the area of Ck (i), the ratio sk (i) of them ranges from 0 to 1. In our method, only the zone simply connected to the center p(i) is considered, while other zones without circle-inside connection to p(i) are not calculated, e.g., Zone B. In the following definitions of lk and ck , only the simply connected zone and contour segment are considered as well. The reason and benefit of this selection is explained in Section 7
3.4. Definition 2. In the same circle Ck (i), there is one or more contour segments of the shape contour that are segmented by Ck (i) (see Fig.3, segment A, B and C in (c)). Similar as the simply connected zone in Definition 1, we only consider the simply connected segment A. Denote the arc length of the simply connected segment (segment A in this case) by lk∗ (i), the normalized arc length lk (i) is defined as the ratio of lk∗ (i) to the circumference of Ck (i), as follows: lk (i) = lk∗ (i)/(2π ∗ rk ).
(3)
The length lk∗ (i) is usually shorter than the circumference of Ck (i) in most of the benchmark datasets, e.g., the MPEG-7[14] dataset used in our experiments. Therefore, the normalized lk (i) ranges from 0 to 1. If a bigger lk∗ (i) than the circumference of Ck (i) exists in a special dataset, the range of lk (i) can be restricted to [0, 1] by dividing a bigger normalization parameter. The parameter should be integer times of the circumference, which is just bigger than lk∗ (i). Definition 3. The central distance c∗k (i) is defined as the distance between p(i) and w(i): c∗k (i) = ||p(i) − w(i)||,
(4)
where w(i) is the weight-center of the simply connected zone in Ck (i), which is calculated by averaging the coordinates of all the pixels in the zone. Then, ck (i) is calculated by normalization with rk : ck (i) = c∗k (i)/rk = ||p(i) − w(i)||/rk .
(5)
As the weight-center must be inside the simply connected zone, c∗k (i) is less than rk , and ck (i) ranges from 0 to 1. For convenience, we indicate with Z the simply connected zone in the following of this paper. As the methods of calculating area s∗k (i), arc length lk∗ (i) and weight-center w(i) are straightforward, and there are efficient implementations in the tool box of Matlab and OpenCV, we do not specify the formula here for brevity.
8
For each point p(i), there are 3 ∗ m invariants in the descriptor to represent the shape feature in different aspects (area, arc length and central distance) and different scales (semi-globally and locally). The combination of these invariants include full shape information at p(i). The area can only represent the concavity and convexity of a simple shape, while the arc length can only represent the complexity of the local shape that a long arc length indicates a complex shape. However, if these two invariants are combined, the discrimination of representation is improved significantly. Similarly, merging the invariants in different scales will increase the capability of the descriptor as well. 3.2. The Setting of r and k In representing a shape with the MID, the radius rk of the circle Ck needs to be set first. In our method, the radius rk in different scales is set with respect to an initial radius R: rk = R/2k ,
(6)
where rk is half of rk−1 in the prior scale. That is, r1 = R/2 is the radius of C1 in the highest scale (k = 1), and the circles in the following scales (k = 2, 3, ..., m) have their radii decreased half by half. The setting of initial R is important for representing both the local and semi-global information. We use the invariants in the first scale, i.e., s1 , l1 and c1 , to represent the semi-global information, while the invariants in the last scale, i.e., sm , lm and cm , to represent the local information. In our method, the semi-global information is the invariant salient feature of the body part of a shape, and the initial R is set as follows: R=
√ areaS ,
(7)
where areaS is the area of the whole shape in the image. The circle C1 (i) in the first scale covers the meaningful body part near p(i), and the invariants in this scale capture the semi-global invariant shape features in this part. Oppositely, the circle in the last scale is very small, which covers so small an area that the invariants can only represent
9
the detailed but simple shape features, e.g., the curvature of the local shape. Therefore, the invariants in different scales represent full information of the shape. The scale number m has an effect on the capability of the descriptor as well, which need to be carefully selected. A big scale number is not always better than a small one. For simple shapes, e.g., a regular square, the normalized area at the corner points are always 0.25 in different scales. Thus a bigger scale number does not increase the discriminative of descriptor, but introduces extra noise which enlarges the intra-class distance and lowers the accuracy of retrieval. Moreover, the computational cost is also increased by a big scale number. In this way, the scale number m should be set according to the complexity of shapes. In the proposed descriptor, the bigger scales are always used regardless of the complexity, and the scale number m determines only the smallest scale. Complex shapes include more salient local features which need smaller scales to represent. Simple shapes with few local features need only the bigger scales to represent the shape. Then we set the scale number m according to a so called convergence condition: if the average difference of the invariants between two neighbor scales m and m + 1 is less than a threshold, e.g., 1e-2, the invariants in scale m + 1 is unnecessary. 4. Contour Evolution for Salient Feature Points Most of the previous methods use all the sample points on the shape contour, which suffers from three problems: 1) there are many redundant points without salient features, but they have the same weight as the salient feature points, which makes the descriptor not representative for the shape features; 2) the redundant points are sensitive to noise that the variations of them increase the alignment error and intra-class distance; 3) the redundant points increase the computational cost in both representation and matching. To extract the shape feature and weed out the redundant points, Latecki et al. [32] proposed the discrete contour evolution (DCE) method. The DCE method deletes the redundant points and preserve the salient feature points. The main idea of DCE is that, in each time evolution, the point with minimum contribution to target identification is deleted. The contribution of a point p(i) is defined by a relevant measure function K: K(i) = B(i) ∗ b(i, i − 1) ∗ b(i, i + 1)/(b(i, i − 1) + b(i, i + 1)), 10
(8)
Figure 4: The DCE revolution results under different evolution degrees. The evolution degree gets deeper from left to right.
where b(i, i − 1) is the segment length between p(i) and p(i − 1), b(i, i + 1) is the segment length between p(i) and p(i + 1), and B(i) is the angle of these two segments. The length b is normalized with respect to the contour perimeter. A higher K indicates a bigger contribution to the shape. However, the deficiency of this method is that it cannot evolved adaptively due to the locally calculation of K, which conducts insufficient evolution with border noises (Fig.4, (b)), or over evolution into a convex polygon (Fig.4, (f)). In this work, we propose an adaptive discrete contour evolution (ADCE) method to overcome the above problem by an adaptive ending of evolution. The ideal situation of ending is that all the visual parts correspond exactly to all significant visual parts of the original shape contour after evolution. Therefore, the evolved contour should maintain a certain similarity with the original contour. We define an area-based adaptive ending function F (t), called evolution area difference function (EADF), as follows: F (t) = Σti=1 |Si − Si−1 | ∗ n0 /S0 ,
(9)
where S0 is the area of the original contour, St is the area after t times of evolution and n0 is the point number of the original contour. Note that we use the absolute difference between St and St−1 considering that both the convex and concave evolution contribute to the total area difference. The contour evolution ends when F (t) is bigger than a given threshold. The evolution results of two sample shapes are shown in Fig.5 with the point numbers of both the original and evolved contours. The threshold is set to a relatively big number 0.5, and we can see that the evolved shapes preserve the salient shape features of the original shapes without the redundant points. We should note that, this ADCE step is only used to find the representative feature
11
Figure 5: (b) and (d) are the ADCE revolution results of the original shapes (a) and (c) under the threshold F=0.5. The numbers aside the shapes are the point numbers of respective contours.
points, while the MID of the evolved contour is still calculated from the original image at these feature points. This is to preserve the original shape features of the salient feature points.
5. Metric Learning Based Object Recognition Most of the previous shape matching methods use the typical matching algorithms, e.g., thin plate spline (TPS), dynamic programming (DP) and dynamic time warping (DTW), to calculate the pariwise similarity of shape description. In these methods, the pre-knowledge of dataset is ignored and the large intra-class variation lower the accuracy of matching and retrieval. Furthermore, finding the best alignment between shapes in the matching algorithms is time consuming. In this work, we handle the problems by two steps of leaning: 1) the unsupervised dictionary learning for a new compact representation of shapes based on the multi-scale invariant descriptor; 2) the supervised metric learning for weights of dimensions of the new representation. 5.1. Bag of Salient Feature Points (BoP) Inspired by the popular image representation with Bag-of-Words (BoW) [36][37], we learn a dictionary of feature points of shapes in a dataset and quantize the shapes into shape codes. We name our method Bag of Salient Feature Points (BoP). Similar to the feature points extracted by the SIFT [38] in image representation, the feature points extracted by the ADCE and represented by the MID are used to learn a dictionary by an unsupervised method. In this paper, we use the k-means clustering method to learn the dictionary. A set of training feature points are extracted from the randomly selected training shape samples, and then the k-means algorithm is run on the set of training 12
points for clustering. The clustered centers are used as the words of the feature point dictionary G=[g1 , g2 , g3 , ..., gN ], where each column represents a clustering center of the feature points. Then, the N centers are used to represent the whole shape space. For each shape, the feature points are assigned to the clustering centers by the k nearest neighbors (KNN) method. The assignment is summarized as a histogram h = (h1 , h2 , ..., hN ) to representation the shape, where each dimension hi represents the i − th column of the histogram. Finally, each shape is represented by a statistical histogram of the words in the dictionary, and the similarity between shapes can be calculated from these histograms efficiently without pairwise alignment. The histogram representation often faces to the issue that the spatial information of feature points in image is lost. Similarly, the order of the feature points in the shape contour is lost in the BoP method as the spatial information. However, the MID includes the context information and the semi-global structure of the shape, which alleviate this problem as far as possible. Thus the using of BoP with MID is an appropriate combination in our method. 5.2. Metric Learning for Object Recognition In the similarity measurement between objects, the Bhattacharyya distance is employed to calculate the distance of BoP histogram vectors h due to its superior performance on the Bulls-Eye retrieval rate in our experiments comparing with other distances, e.g., Euclidean distance. However, the intra-class variation increases the shape distance of the same class, which will lower the retrieval rate if the intra-class distance is bigger than the inter-class distance. This is because the value of the ”irrelevant” dimensions of h have a big variance. For instance, one type of feature points (in one bag) have no discrimination, but the shapes of a given class have different numbers of this type of feature points, e.g., from 1 to 100. Thus the difference of this number increases the total intra-class distance when calculating their similarity. To solve this problem and make use of the pre-knowledge of the dataset, we need to learn a weight w to assign large weights to the relevant dimensions and small weights to the irrelevant dimensions. In our method, the Relevant Component Analysis (RCA)
13
algorithm [39][40], which is the preferable one of the popular distance metric learning methods, is employed to learn the weight. The RCA is more appropriate than other distance metric learning methods because it is straight forward to this problem concerning the relevant/irrelevant dimensions and can be applied to the original feature vectors directly. In the RCA, all the histogram vectors are transformed by a weighted transformation matrix W to a new space. Assume a total of p shapes in k shape classes, where each n
j and its mean is class j consists of nj shapes represented by histogram vectors {hji }i=1
mj . The matrix W is computed as follows: 1
W = C ∗− 2 ,
(10)
where C ∗ is the covariance matrix of all the points: k
nj
1 C = (hji − mj )(hji − mj )t . p j=1 i=1 ∗
(11)
And then this matrix is applied to the original data points: xnew = W x. The matching distance d between two shapes p and q is calculated by the Euclidean distance of the transformed histogram vectors: d = ||xpnew − xqnew ||.
(12)
6. Experimental Results In this section, we evaluate the capability of our method in three aspects: 1) demonstrate the invariant properties of the proposed MID descriptor for articulated deformation, partial occlusion and intra-class variation including rotation and scale variation; 2) evaluate the representative and discriminative power by shape matching and retrieval experiments on benchmark datasets, including the MPEG-7 [14] dataset, Articulated [10] dataset, Kimia’s 99 and Kimia’s 216 [15] datasets comparing with other methods; 3)test the robustness for noise in shape description and retrieval.
14
6.1. Invariant Properties of MID In this experiment we evaluate the invariant properties of MID for rotation, scale variation, articulated deformation, intra-class variation, and partial occlusion. Since the rotation and scale variation often happens in most of object detection and recognition applications, we combine these two variations with the intra-class variation together to test our method first. In the test, we select two images with salient intra-class variations from the same class in the MPEG-7 dataset, as shown in Fig.6. In the first column of Fig.6, row1 and row2 are the original images with salient variations, row3 and row4 are the corresponding rotated images, row5 and row6 are scaled images, row7 and row8 are rotated and scaled images. Column 2,3,4 are the functions of MID invariants corresponding to the shapes in scale 1-3. From the figures we can find significant correspondences and similarities among the functions of different shapes. In the higher scale (e.g., Scale 3 in the figures), more detailed information are included as the local details in the functions. We should note that, the plots in the figures are the functions of the invariants of the original shapes with many redundant points. After the revolution by ADCE, the invariants may extract the shape information of the salient feature points more clearly, e.g., the peak values in the functions. The strong correspondence of the function peaks among shapes in the figures indicates the invariance of our method for intra-class variation. The articulated deformation is another challenging variation in shape description. We use shape samples of two shape classes, rabbit and horses, in the MPEG-7 dataset to demonstrate the invariance of our method. As mentioned in Section 3, the articulated deformation invariance of the rabbit ears (in Fig.3) is indicated in the definition of the invariants by the simply connected zones and segments. In this experiment, we use another example, the horse legs, to evaluate the invariance as shown in Fig.7. From the figure we can see that the standing horse (the top image of the first column in Fig.7) have two front legs parallel together, and the circle centered on the left leg covers both the two front legs. However, the circle centered at the corresponding point on the running horse (the bottom image of the first column in Fig.7) covers only one leg. The functions of their invariants in the first scale are shown in Column 2, 3, and 4 with the labels (black circles on the functions) corresponding to the circle centers on the horse legs. We can find the 15
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
500
1000
0
1
0.5
0
500
1000
0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
500
1000
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
500
1000
0
0
500
1000
0
0
500
1000
0
0.8
0.8
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
1000
0.8
0.8 0.6
0.4
0.4
0.2
0.2 0
500
1000
0.8
500
0
0.6
0
1000
0
500
1000
0
500
1000
0
500
1000
0.5
0.6
500
500
1
0.8
0
0
0
1000
0
1
0.5
0
500
1000
0
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
500
1000
0
0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
500
1000
0
500
1000
0
0
500
1000
0
500
1000
1
0.5
0
500
1000
0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
500
1000
0
0
500
1000
0
0
500
1000
0
500
1000
16 Figure 6: In the first column, row1 and row2 are the original images with salient variations, row3 and row4 are the corresponding rotated images, row5 and row6 are scaled images, row7 and row8 are rotated and scaled images. Column 2-4 are the invariant functions in different scales corresponding to the images in the left.
Figure 7: The invariant functions and histograms for articulated shapes. The first column is two shapes with articulated deformation at the horse legs. The 2-4 columns are functions of the invariants s, l and c in the first scale. The small circles in the invariant functions correspond to the circle center positions on the original shape contours. The last column are the BoP histograms of the shapes, respectively.
strong similarities between the corresponding functions at the labels which indicate the articulated deformation invariance of our method. The right front leg of the standing horse does not affect the calculation of the invariants of the left front leg, although it is inside the circle. The articulated [10] dataset including shapes with various articulated deformations, which is used to test the robustness of our method in shape matching and retrieval in Section 6.2. We only show the functions in the first scale, because the first scale is the most affected by the articulated deformation. The higher scales have smaller circles which cover a more local range that cannot be affected by the articulated parts seriously. Thus, if the method is invariant of articulated deformation in the first scale, it is also invariant in the higher scales. From the BoP histograms (Fig.7, Column 5) of the two shapes we can also see the significant correspondence between them especially the ”high columns”, which supports our method as well. Occlusion is unavoidable in the real applications of object recognition by shape matching due to the obstacles in tracking. This experiment is to demonstrate the invariant property of the MID descriptor on the partially occluded shapes. Sample shape of the camel class from the MPEG-7 dataset is partially occluded in our test, where the MID invariants of both the occluded and original shapes are shown in Fig.8. We can see the camel head is occluded, and the squires in the function plots (Column 2-4) corresponds to the parts blocked by squires in the original and occluded images (Column 1). The blocked part of the occluded camel is a segment of line, and the functions inside the
17
Figure 8: The invariant functions and histograms for occluded shape compared with original shape. The first column are the original shape and an occluded shape. The occluded part is blocked by squires in both images. The 2-4 columns are functions of the invariants s, l and c in the first scale. The squires in the functions correspond to the occluded part of the image. The last column are the BoP histograms of the shapes, respectively.
corresponding squires are much simpler than the that of the original image. Besides the squires, the rest of the corresponding functions are similar, which indicates that our method is invariant for occlusion. The reason for using only the first scale of the functions in Fig.8 is the same as that in the experiment of articulated deformation invariance. The histograms are shown in the last column of Fig.8, where most of the corresponding columns are similar. The values of the 89-th column have a big difference because this class of points are mostly distributed in the occluded part, i.e., the head of the camel shape. Note that, a too heavy occlusion, e.g., the main part of shape is occluded, may result in ambiguities in shape matching and retrieval due to losing too much shape information. 6.2. Object Recognition 6.2.1. Implementation Setting For shape representation, the scale number of the MID descriptor is set to 4 for MPEG-7 dataset, while 3 for the Articulated dataset and Kimia’s 99 dataset, according to the convergence condition in our method. We use the proposed ADCE method to extract the salient feature points, where the threshold F is 0.3 and the number of the extracted points ranges from 40 to 400. In the BoP, the standard k-means clustering is employed for training the dictionary where the cluster number is set to 300 if it is not specified and the iteration number is 10. We also test our method with different number of clustering centers. In the following we give the retrieval results and analysis on the
18
Figure 9: typical shapes of MPEG-7 with two shapes from each class.
benchmark datasets. 6.2.2. MPEG-7 Dataset The MPEG-7[14] is a standard dataset which is widely used to test the capability of shape matching and retrieval methods. It consists of 1400 binary images divided into 70 shape classes that each class contains 20 shapes. Fig.9 shows some typical shapes of MPEG-7 with two shapes from each class. From the figure we can see that the shapes have big intra-class variations. Generally, the retrieval rate of a method is measured by the so-called bull’s eye score. Every shape of the dataset is used as a query to compare with all other shapes by shape matching. Among the 40 most similar shapes, the number of shapes from the same class of the query shape is reported. The bull’s eye retrieval rate is the ratio of the total number of shapes from the same class to the highest possible number and the best possible rate is 100%. We calculate the retrieval rate of all the shapes in this dataset and divide the sum by 28000 (1400*20). The bull’s eye scores of our method comparing with other methods are listed in Table 1, where our method has the best score than other methods in this table. Our method is significantly higher than the IDSC [10] method which utilizes the shape context and is robust to articulated deformation. Furthermore, our method outperforms the improved IDSC based on a learning method LP. The results validates that our method is representative and discriminative for shape retrieval. The retrieval rates of the 70 shape classes by our method are displayed by a histogram in Fig.10. From the histogram we can find that the rates of two classes, Class 32 and Class 63, are much lower than that of other classes. The shapes of these two classes are
19
Table 1: Retrieval Rates (Bulls-Eye) of Different Methods on the MPEG-7 Dataset
Method
Retrieval rate (%)
Visual Part [32] SC [2] Distance Sets [24] Gen. Model [41] SSC [42] AISD [43] IDSC [10] HPM [44] CDTW+SC [12] TAR [11] Shape Tree [31] ASC [45] Contour Flexibility [46] Locally Affine Invariant [6] Height Functions [9] Shape Vocabulary [8] Our Method
76.45 76.51 78.38 80.03 82.39 84.26 85.40 86.35 86.73 87.23 87.70 88.30 89.31 89.62 89.66 90.41 92.57
1
Retrieval Rate
0.8
0.6
0.4
0.2
0
0
10
20
30
40 Class Number
50
60
70
Figure 10: The retrieval rate of the 70 classes on the MPEG-7 dataset.
Figure 11: Shapes of the two classes which have the lowest retrieval rate of the 70 classes on the MPEG-7 dataset.
20
shown in Fig.11, where the Class 32 are circles and Class 63 are spoons. The circles of Class 32 are hard to retrieve for two reasons. One is that there is only one sort of salient feature points to represent the circle, i.e., the points on the circle which have the same value of invariants that belong to the same cluster in the BoP. The other is that there are too many interference points, e.g., the points of the chaotic curve inside the circle. These inference points have too much shape features that cannot be deleted by ADCE, therefore, cause big values on the irrelevant dimensions of the BoP histogram, and increase the intra-class distances. For the spoon class, there are too many intra-class variations while too few salient feature points for discrimination. 6.2.3. Kimia’s dataset The Kimia’s dataset [15] is another widely used benchmark dataset for shape matching and retrieval methods, including Kimia’s 25, Kimia’s 99 and Kimia’s 216. As the Kimia’s 25 dataset is too small (contains only 25 shape samples in total), we choose the Kimia’s 99 and Kimia’s 216 datasets to test our method for shape retrieval. All the shape samples of these two datasets are shown in Figs. 12 and Figs. 12. The Kimia’s 99 dataset contains 99 shapes grouped into 9 classes. In our test, each of the shapes are used as a query to match all other shapes and their similarities are computed. The retrieval rates are summarized as the number of shapes from the same class among the top1 to 10 most similar shapes, and the best possible result for each of them is 99. The Kimia’s 216 dataset consists of 18 shape classes with 12 shapes in each class. We test our method on this dataset in the same manner as the test on the Kimia’s 99 dataset and the best score of each result is 216. The results on these two datasets comparing with other methods are shown in Table 2 and Table 3. From the tables we can find that our method has comparable results. 6.2.4. Articulated Dataset In this experiment, we implement our method on the articulated dataset [10] to evaluate the robustness of our method for articulated deformation. The articulated dataset contains 40 shapes from 8 classes of objects (six classes of different scissors and two classes of staplers) as shown in Fig.14. In each class, the five shapes have serious articulated 21
Figure 12: All the shapes of the Kimia’s 99 dataset.
Figure 13: All the shapes of the Kimia’s 216 dataset.
Table 2: Retrieval Results on Kimia’s 99 Dataset
Method
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
SC [2] CDPH+EMD [47] Gen.Model [41] Efficient Indexing [17] Path Similarity [48] Our Method
97 96 99 99 99 99
91 94 97 97 99 99
88 94 99 98 99 99
85 87 98 96 99 99
84 88 96 97 96 98
77 82 96 97 97 98
75 80 94 96 95 96
66 70 83 91 93 95
56 62 75 83 89 92
37 55 48 75 73 82
22
Table 3: Retrieval Results on Kimia’s 216 Dataset
Method
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
11th
SC [2] CDPH+EMD [47] Our Method
214 215 216
209 215 216
205 213 215
197 205 212
191 203 210
178 204 207
161 190 205
144 180 201
131 168 192
101 154 185
78 123 167
Figure 14: All the shapes of the articulated shape dataset.
deformation contrast to each other. Beside the articulated deformation, another challenge to implement shape retrieval on this dataset is the significant inter-class similarity. The test is carried out in the same manner as the experiments on the Kimia’s datasets, and the results are shown in Table 4. Our method has the highest scores in this table, especially outperforms the IDSC [10] method which is specially designed for articulated shapes. Although the curve of the holes inside the scissors are not used as contours in our method, the holes have effects on the values of the MID invariants, which increase the discrimination of our method on this dataset.
Table 4: Retrieval Results on the Articulated Shape Dataset
Method
Top 1
Top 2
Top 3
Top 4
L2 (base line) [10] SC [10] MDS+SC [10] IDSC [10] Our Method
25 20 36 40 40
15 10 26 34 36
12 23 17 35 36
10 5 15 27 31
23
Figure 15: The same shape with Gaussian noises in different degrees.
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
500
1000
0
0
500
1000
0
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
500
1000
0
500
1000
0
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
500
1000
0
0
500
1000
0
0
500
1000
0
500
1000
0
500
1000
Figure 16: The invariant functions with Gaussian noises, where σ=0, 0.4 and 0.8 in the 1-3 rows respectively. The columns are different types of invariants in the first scale.
6.3. Robustness for noise This experiment is carried out to evaluate the robustness of our method against noise. Gaussian noise are added to the original shape contours in the Kimia’s 99 dataset. The shape contours are perturbed by a Gaussian random function with zero mean and deviation σ in both x and y directions. The noisy shapes with different deviations σ are shown in Fig.15. The MID functions of the original shape (Fig.15, (a)) and the noisy shapes with σ = 0.4 and σ = 0.8 (Fig.15, (c), (e)) are shown in Fig.16. From the figure we can find that the MID invariants preserves invariance under noise, and the increasing of σ has very little effect on our method. The invariants s and c with different σ are almost the same. The absolute values of l increases when σ gets bigger, because the noise increase the total arc length of the contour as well as the normalized l.
24
Table 5: Retrieval Results on Kimia’s 99 Dataset with Noise
σ
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
0 0.2 0.4 0.6 0.8
99 99 99 99 99
99 99 99 99 99
99 99 98 98 98
99 98 98 97 97
98 98 97 97 96
98 97 97 96 95
96 96 95 94 92
95 93 92 92 89
92 91 90 88 84
82 80 79 77 73
We also test our method on the perturbed Kimia’s 99 dataset with different σ. This test is implemented in the same manner as before to calculate the retrieval rate. The result is shown in Table 5. From the result we can see that the retrieval rates of our method is rarely affected under noise level from σ = 0.2 to σ = 0.6, while the rates of σ = 0.8 still maintain a relatively high score. This result verifies the robustness of our method against noise in shape retrieval.
7. Conclusion In this paper, we proposed an object recognition method based on metric learning. Not only different types of invariants are used but also different scales of object features are calculated. The experimental results verify the effectiveness of the proposed descriptor. The robustness for noise is also proved. The metric learning method performs well with the descriptor which makes the recognition process efficient and accurate. From the invariants functions, we can see that the features in different scales are all captured by the invariants in different scales. The proposed ADCE algorithm is capable of extract the salient feature points, which is a necessary preprocessing for the BoP clustering and histogram representation. The experiments on the benchmark datasets have verified that the proposed method has an excellent advantage on recognition accuracy. This method is applicable for general object recognition based on contour, e.g., the contour based pedestrian and vehicle recognition via network controlling technology [49][50] in intelligent transportation. Nowadays the using of robot is a popular topic and the robot can also recognize objects by this method, and we will explore these applications in the
25
future work.
References [1] You, X., Tang, Y.Y., Wawelet-based approach to character skeleton, in: IEEE Trans. on Image Processing, Vol. 16(5), pp. 1220-1231 (2007). [2] Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts, IEEE Trans. Pattern Analysis and Machine Intelligence., 24, (4), pp. 509-522 (2002). [3] Wolter, D., Latecki, L.J., Shape matching for robot mapping, in: Pacific Rim International Conference on Artificial Intelligence (2004). [4] Qiu, J., Wei, Y. and Karimi, H.R., New approach to delay-dependent H-infinity control for continuous-time Markovian jump systems with time-varying delay and deficient transition descriptions, Journal of The Franklin Institute, vol. 352, no. 1, pp. 189-215 (2015). [5] Zhang, C., Feng, G., Gao, H. and Qiu, J., H-infinity filtering for nonlinear discretetime systems subject to quantization and packet dropouts, IEEE Transactions on Fuzzy Systems, vol. 19, no. 2, pp. 353-365 (2011). [6] Wang, Z., Liang, M., Locally affine invariant descriptors for shape matching and retrieval, IEEE Signal Process. Lett, 17 (9), pp. 803C806 (2010). [7] Lu, H., Feng, X., Li, X. and Zhang, L., Superpixel level object recognition under local learning framework, Neurocomputing, Vol. 120, pp. 203-213 (2013). [8] Bai, X., Rao, C., Wang, X.: Shape vocabulary: A robust and efficient shape representation for shape matching, IEEE Trans. Image Processing, 23, pp. 3935-3949 (2014). [9] Wang, J., Bai, X., You, X., et al.: Shape matching and classification using height functions, Pattern Recognition Letters, 33, (2), pp. 134-143 (2012). [10] Ling, H., Jacobs, D.: Shape classification using the inner-distance, IEEE Trans. Pattern Analysis and Machine Intelligence., 29, (2), pp. 286-299 (2007). [11] Alajlan, N., El, Rube., Kamel, M., et al.: Shape retrieval using triangle-area repre-
26
sentation and dynamic space warping, Pattern Recognition, 40, (7), pp. 1911-1920 (2007). [12] Palazn-Gonzlez, V., Marzal, A.: On the dynamic time warping of cyclic sequences for shape retrieval, Image and Vision Computing, 30, (12), pp. 978-990 (2012). [13] Siddharth, M., Daniel, C., Byung-Woo, H., et al., Integral invariants for shape matching, in: IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 28, No. 10, pp. 1602-1618 (2006). [14] Latecki, L., Lakamper, R., Eckhardt, T., Shape descriptors for non-rigid shapes with a single closed contour, in: Proc. IEEE Conf. Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA, Vol.1, pp. 424-429 (2000). [15] Sebastian, T., Klein, P., Kimia, B.B., Recognition of shape by editing their shock graphs, in: IEEE Trans. Pattern Anal. Mach. Intell., Vol. 26, (5), pp. 550-571 (2004). [16] Mokhtarian, F., Abbasi, S., Kittler, J., Efficient and robust retrieval by shape content through curvature scale space, in: Smeulders, A.W.M., Jain, R. (Eds.) Image Databases and Multi-Media Search, pp. 51-58 (1997). [17] Biswas, S., Aggarwal, G., Chellappa, R.: Efficient indexing for articulation invariant shape matching and retrieval, Proc. IEEE. Conf. Computer Vision and Pattern Recognition, Minneapolis, MN, USA, pp. 1-8 (2007). [18] Soysal, M., Alatan, A., Joint utilization of local appearance and geometric invariants for 3D object recognition, in: Multimed Tools Appl, Vol.74, pp. 2611-2637 (2015). [19] Byung-Woo, H. and Stefano, S., Shape matching using Multiscale Integral Invariants, in: IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 37, No. 1, pp. 151-160 (2015). [20] El, Rube., Alajlan N., Kamel M., et al.: Robust multiscale triangle-area representation for 2D shapes. Proc. IEEE Conf. IEEE International Conference on Image Processing, Genoa, Italy, September 2005, pp. 545-548. [21] Prashant, A., Frederic, F., Point-based medialness for 2D shape description and identification, in: Multimed Tools Appl, in press. [22] Qiu, J., Tian, H., Lu, Q. and Gao, H., Non-synchronized robust filtering design for continuous-time T-S fuzzy affine dynamic systems based on piecewise Lyapunov
27
functions, IEEE Transactions on Cybernetics, vol. 43, no. 6, pp. 1755-1766 (2013). [23] Qiu, J., Ding, S.X., Gao, H. and Yin, S., Fuzzy-model-based reliable static output feedback H-infinity control of nonlinear hyperbolic PDE systems, IEEE Transactions on Fuzzy Systems, doi: 10.1109/TFUZZ.2015.2457934. [24] Grigorescu, C., Petkov, N.: Distance sets for shape filters and shape recognition, IEEE Trans. Image Processing., 12, (10), pp. 1274-1286 (2003). [25] Xie, J., Heng, P., Shah, M., Shape matching and modeling using skeletal context, Pattern Recognition 41 (5), pp. 1756C1767 (2008). [26] Attalla, E., Siy, P., Robust shape similarity retrieval based on contour segmentation polygonal multiresolution and elastic matching, Pattern Recognition, 38 (12), pp. 2229C2241 (2005). [27] Adamek, T., OC onnor, N.E., A multiscale representation method for nonrigid shapes with a single closed contour, IEEE Trans. Circuits Systems Video Technol, 14 (5), pp. 742C753 (2004). [28] Schmidt, F.R., Toeppe, E., Cremers, D., Efficient planar graph cuts with applications in computer vision In: CVPR: IEEE Internat. Conf. on Computer Vision and Pattern Recognition, pp. 351C356 (2009). [29] Alajlan, N., Rube, I.E., Kamel, M.S., Freeman, G., Shape retrieval using trianglearea representation and dynamic space warping, Pattern Recognition, 40, pp.1911C1920 (2007). [30] Alajlan, N., Kamel, M., Freeman, G., Geometry-based image retrieval in binary image databases, IEEE Trans. Pattern Anal. Machine Intell, 30 (6), pp. 1003C1013 (2008). [31] Felzenszwalb, P.F., Schwartz, J., Hierarchical matching of deformable shapes, In: CVPR: IEEE Internat. Conf. on Computer Vision and Pattern Recognition, pp. 1C8 (2007). [32] Latecki, L., Lakamper, R.: Shape similarity measure based on correspondence of visual parts, IEEE Trans. Pattern Analysis and Machine Intelligence., 22, (10), pp. 1185-1190 (2000). [33] Miao, Y., Tao, X., Sun, Y., et al., Risk-based adaptive metric learning for nearest
28
neighbour classification, Neurocomputing, Vol. 156, pp. 33-41 (2015). [34] Wang, J., Wang, H. and Yan, Y., Robust visual tracking by metric learning with weighted histogram representations, Neurocomputing, Vol. 153, pp. 77-88 (2015). [35] Liu, J., Sun, Z. and Tan, T., Distance metric learning for recognizing low-resolution iris images, Neurocomputing, Vol. 144, pp. 484-492 (2014). [36] Sivic, J., Zisserman, A., Video google: a text retrieval approach to object matching in videos, in: International Conference on Computer Vision, pp. 1470-1477 (2003). [37] Csurka, G., Dance, C., Fan, L., et al., Visual categorization with bags of keypoints, in: Workshop on Statistical Learning in Computer Vision, ECCV (2004). [38] Lowe, D., Distinctive image features from scale-invariant keypoints, in: Int. J. Comput. Vis., Vol.60, pp.91-110 (2004). [39] Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D., Learning distance functions using equivalence relations, In Proceedings of the International Conference on Machine Learning, pp. 11C18, Washington, DC (2003). [40] Shental, N., Hertz, T., Weinshall, D., Pavel, M., Adjustment learning and relevant compo-nent analysis, ECCV, pp. 776-790 (2002). [41] Tu, Z., Yuille, A.: Shape matching and recognitionCusing generative models and informative features, Proc. European Conf. Computer Vision (ECCV), Prague, Czech Republic, pp. 195-209 (2004). [42] Premachandran, V., Kakarala, R.: Perceptually motivated shape context which uses shape interiors, Pattern Recognition, 46, (8), pp. 2092-2102 (2013). [43] Fu, H., Tian, Z., Ran, M., et al.: Novel affine-invariant curve descriptor for curve matching and occluded object recognition, IET Computer Vision, 7, (4), pp. 279-292 (2013). [44] McNeill, G., Vijayakumar, S.: Hierarchical procrustes matching for shape retrieval, Proc. IEEE. Conf. Computer Vision and Pattern Recognition, June 17, vol. 1, pp. 885-894 (2006). [45] Ling, H., Yang, X., Latecki, L.: Balancing deformability and discriminability for shape matching, Proc. European Conf. Computer Vision (ECCV), Heraklion, Crete, Greece, pp. 411-424 (2010).
29
[46] Xu, C., Liu, J., Tang, X.: 2D shape matching by contour flexibility, IEEE Trans. Pattern Analysis and Machine Intelligence, 31, (1), pp. 180-186 (2009). [47] Shu, X., Wu, X.: A novel contour descriptor for 2D shape matching and its application to image retrieval, Image and vision Computing, 29, (4), pp. 286-294 (2011). [48] Bai, X., Latecki, L., Path similarity skeleton graph matching, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 30, (7), pp. 1282-1292 (2008). [49] Zhang, C., Feng, G., Qiu, J. and Shen, Y., Control synthesis for a class of linear network-based systems with communication constraints, IEEE Transactions on Industrial Electronics, vol. 60, no. 8, pp. 3339-3348 (2013). [50] Qiu, J., Gao, H. and Ding, S.X., Recent advances on fuzzy-model-based nonlinear networked control systems: A survey, IEEE Transactions on Industrial Electronics, doi: 10.1109/TIE.2015.2504351.
30