Contour segment grouping for object detection

Contour segment grouping for object detection

Accepted Manuscript Contour Segment Grouping for Object Detection Hui Wei, Chengzhuan Yang, Qian Yu PII: DOI: Reference: S1047-3203(17)30148-7 http:/...

7MB Sizes 0 Downloads 103 Views

Accepted Manuscript Contour Segment Grouping for Object Detection Hui Wei, Chengzhuan Yang, Qian Yu PII: DOI: Reference:

S1047-3203(17)30148-7 http://dx.doi.org/10.1016/j.jvcir.2017.07.003 YJVCI 2034

To appear in:

J. Vis. Commun. Image R.

Received Date: Accepted Date:

3 January 2017 12 July 2017

Please cite this article as: H. Wei, C. Yang, Q. Yu, Contour Segment Grouping for Object Detection, J. Vis. Commun. Image R. (2017), doi: http://dx.doi.org/10.1016/j.jvcir.2017.07.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contour Segment Grouping for Object Detection Hui Weia,b,, Chengzhuan Yanga,b , Qian Yua,b a

Laboratory of Cognitive Model and Algorithm, School of Computer Science, Fudan University, No.825 Zhangheng Road, Shanghai and 201203,China b Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, No.825 Zhangheng Road, Shanghai and 201203,China

Abstract In this paper, we propose a novel framework for object detection and recognition in cluttered images, given a single hand-drawn example as model. Compared with previous work, our contribution is three-fold. 1) Three preprocessing procedures are proposed to reduce the number of irrelevant edge fragments that are often generated during edge detection in cluttered real images. 2) A novel shape descriptor is introduced for conducting partial matching between edge fragments and model contours. 3) An efficient search strategy is adopted to identify the location of target object hypotheses. In the hypotheses verification stage, an appearance-based (support vector machine on pyramid histogram of oriented gradients feature) method is adopted to verify the hypothesis, identify the object, and refine its location. We do extensive experiments on several benchmark datasets including ETHZ shape classes, INRIA horses, Weizmann horses, and the two classes (anchors and cups) from Caltech 101. Experimental results show that the proposed method can significantly improve the accuracy of object detection. Comparisons with other recent shape-based methods further demonstrate the effectiveness and robustness of the proposed method. Keywords: Shape-based object detection; Contour grouping; Depth-first search

1. Introduction Object detection is a very important and difficult task in computer vision and image analysis, which seeks to locate and identify specific classes of objects within images. This Email addresses: [email protected] (Hui Wei), [email protected] (Chengzhuan Yang), [email protected] (Qian Yu) Preprint submitted to Journal of Visual Communication and Image Representation

July 22, 2017

task becomes particularly difficult in images with strongly cluttered background, and with objects subjected to scale changes, rotation changes, and substantial intra-class variations. Generally speaking, two main paradigms can be distinguished: appearance-based and shapebased detection. In recent years, some appearance-based methods for object detection have achieved remarkable success [30, 50, 20, 46, 12]. However, in many cases, the appearances between intra-class objects vary considerably [43], which makes appearance features unreliable. Therefore, recently we have observed a significant increase in methods that utilize shape information [53, 57, 62]. Shape information plays an important role in object detection. This is because shape can remain stable in various transformations and blurs. Compared with other image features, shape information tends to be a very stable feature that is invariant to lighting conditions and variations in object color and texture. Cognitive science has demonstrated that shape plays a very important role in object recognition in human beings [2]. People can easily distinguish between the shapes of two objects without using any extra information. More importantly, shape can efficiently represent image structure with large spatial extents [49]. Because of these advantages, shape-based approaches for object detection have drawn considerable attention from many researchers [18, 47, 17, 38, 42, 22, 56]. In general, most shape-based approaches primarily employ the contours of objects for object detection. We first obtain the edge pixels using an edge detector such as those discussed by [10, 40] for gray-level images. Edge fragments are then obtained using an edgelinking algorithm such as that described by [28]. The goal is then to select a small subset of edge fragments that match the model contour well. This task, however, presents several challenges. First, because a perfect edge detector algorithm does not exist, the contour of a target object is typically broken into several pieces, and sometimes some edge fragments are missing owing to the influence of noise and occlusion. Second, the object often appears in cluttered image which is the principal source of difficulty for the detection task. In addition, the contours of target objects typically comprise only a small proportion of the available edge fragments, which is often less than two or three percent. Finally, the shapes of target objects in images vary significantly because of scale changes, rotation changes, non-rigid 2

Figure 1: Examples of objects with missing parts owing to missing edges and/or broken edge links.

deformation, and intra-class variations. Figure 1 provides examples of edge images that illustrate the challenges involved in shape-based object detection. However, these problems for shape-based object detection are unavoidable in cluttered images, and must be addressed by shape-based object detection approaches. Therefore, a novel method to solve these issues is proposed in this paper. Figure 2 presents an example of our proposed shape-based object detection approach. We first obtain edge fragments using an edge detector algorithm, as illustrated in Figure 2(b), where each edge fragment is indicated by a different color. These edge fragments typically form the input of a shapebased object detection approach. The edge detector algorithm can produce many edge fragments, but most of them are from background or irrelevant texture, and only a small subset of the edge fragments belongs to the target object. An exhaustive brute force search of all possible global configurations of edge fragments is prohibitively complex, and at the same time unnecessary. We next use three preprocessing procedures to reduce the large number of irrelevant edge fragments typically generated in cluttered real images, as shown in Figure 2(c). Through preprocessing, we can determine the salient edge fragments and simultaneously reduce the number of irrelevant background fragments. Third, a novel shape descriptor is proposed to represent both edge fragments and the model contours, and a partial shape matching method is then conducted to obtain the three best matches between the model contours and edge fragments. The spatial relationship between the edge fragments and corresponding model contours are then built using the numerous matches between the model contours and edge fragments obtained. Subsequently, a depth-first search and combine 3

M6H+JMK*KZKIZUX

G

6XUVUYKJ6XKVXUIKYYOTM

I

H 9KGXINGTJ)USHOTK9ZXGZKM_

._VUZNKYOY
K

J

Figure 2: Proposed approach: (a) original image; (b) edge fragments obtained from (a), which typically serve as the input of a shape-based object detection method; (c) preprocessed edge fragments; (d) searched candidate hypothesis results; and (e) final detection result.

strategy is adopted to identify the candidate locations of a target object and obtain the candidate hypothesis results, as illustrated in Figure 2(d). Finally, an evaluation method is adopted to verify the candidate hypotheses and determine the final detection result, as shown in Figure 2(e). In the hypotheses verification stage, an appearance-based method employing a support vector machine application to pyramid histograms of oriented gradient descriptors is adopted to verify the hypothesis, identify the object, and refine its location. The remainder of this paper is organized as follows. Related work is reviewed in Section 2. In Section 3, a novel shape descriptor is introduced. The early preprocessing and partial shape matching methods are described in Section 4. Section 5 presents our proposed efficient search strategy to identify the candidate hypotheses, and simultaneously verify them. Experimental results are provided in Section 6. The final section presents our conclusion and future work.

4

2. Related Work A number of applications of the shape-based paradigm have been developed that achieve state-of-the-art performance for several object categories by using only shape information. In the following, we briefly review some important studies that are related to the subject of this paper. Some methods learn the codebooks of local image features for object detection. For example, Leibe et al. [31] employed a codebook representative of local appearance, containing information regarding local structures that may appear on objects of the target category. An implicit shape model that specifies where the codebook entries may occur on the object was then learned. Finally, they integrate the recognition and segmentation problem into a common probabilistic framework with the implicit shape model. However, an approach that depends on the consistency of these appearances may fail for certain object categories (e.g., bottles and cups) where the surface markings are particularly variable, whereas contour fragments are highly suitable for such objects. Subsequently, Opelt et al. [43] and Shotton et al. [49] considered the codebook learning of contour fragments for object detection. These two approaches first learn the distinct contours of weak classifiers, and then employ the boosting method to build a strong classifier model. However, these methods depended on the chamfer distance to match edge images with learned contour fragments, and chamfer matching is sensitive to background clutter and object rotation. Some approaches consider regions rather than points of local interest (appearance) or contours to better estimate the location and scale of a target object. For example, Gu et al. [21] employed the region proposed by [3] as the image feature for object detection. In their method, each image is first represented by a bag of regions derived from a region tree. Then, they adopt a discriminative max-margin framework to learn the region weight. Finally, a generalized Hough voting scheme is applied to cast hypotheses of object locations. The advantage of region feature is less affected by background clutter. However, the main problem of region feature is very sensitive to segmentation errors. Thus, the accuracy of object detection is not high. Similarly, Toshev et al. [54] employed the grouping of 5

superpixel regions for object detection. The superpixel regions were first obtained via an initial over-segmentation. The shapes of these superpixel regions were then extracted and described using a shape descriptor. Finally, the object detection and segmentation problem was formulated as an integer quadratic program, and a semidefinite programming relaxation was employed to solve it. However, it is very difficult to obtain a contiguous object region by image segmentation due to segmentation errors. As such, some recent studies have proposed approaches employing category-independent localization. For example, Alexe et al. [1] considered an objectness measure employed over bounding boxes to bias a sampling procedure for identifying potential object bounding boxes. Carreira et al. [11] presented a novel framework for generating object regions in an image using bottom-up processes, and subsequently ranked these plausible object regions using the mid-level cues. Similarly, Endres et al. [14] proposed a category-independent method to produce a bag of regions model, and then employed a structured learning method to rank these regions, such that the top-ranked regions were more likely to be good segmentations of various objects. However, the above region proposal methods did not provide the true location of a target object. Thus, it can be taken as a low-level processing of image analysis. At the same time, they can produce a large number of candidate plausible object regions, and their accuracy is not high. Most shape-based object detection approaches primarily use the contours of objects. The proposed method presented here also falls into this category. Therefore, in the following, we review some methods employing contour fragments for object detection. For example, Ferrari et al. [17, 19] built a network of nearly straight adjacent segments (kAS) that work together in a group to match the model parts. In their method, they first build a contour segment network (CSN) of an image, and the kAS are taken from the CSN. Then, they present a scale-invariant local shape features for the kAS. Finally, they propose a sliding window mechanism for detecting objects based on kAS features. The drawback of this method is that the kAS features are not very accurate due to the influence of inconsistent fragment extraction. Thus, it influences the performance of object detection of this method. In addition, it’s very slow to use the sliding window mechanism in detecting the objects. 6

Similarly, Ravishankar et al. [45] approximated the contours of objects by means of short slightly curved segments, which were found to have better discriminative power than straight segments. This method can handle the local deformation well. However, the disadvantage of this method is that it requires selecting the local contour segments manually. Zhu et al. [67] formulated object detection in cluttered images as a set-to-set matching problem, and presented an approximate solution to the hard combination problem. Since it only provides an approximate solution of this problem, thus, the accuracy of object detection is relatively low. Lu et al. [36] first decomposed a model contour into several part bundles. A particle filtering inference method was then employed to simultaneously select and group relevant contour fragments in edge images, and the fragment groups were subsequently matched to the model contours. This method can handle large variations of object contours including the local deformation and missing parts in cluttered images. Due to the limitation of their inference method, they cannot detect multiple objects. To address the non-rigid deformation of an object’s shape, Bai et al. [4] employed the skeleton information to capture the main structure of an object, and then adopted the oriented chamfer matching method [49] to match the model segments to images. We know that the chamfer matching method is very sensitive to cluttered background. Thus, it affects the performance of object detection. Riemenschneider et al. [47] adopted a partial matching method to match edge fragments to a single model contour of a category for object detection. This method can deal with the problems of cluttered background and occlusion very well. The limitation of this method is the weak discriminative power of their shape descriptor, which is only angle based. Srinivasan et al. [52] adopted a many-to-one matching method for object detection, and, simultaneously, a latent support vector machine was trained to ensure that the many-to-one score was tuned very discriminatively to improve detection accuracy. This method requires a lot of training data to learn the shape model, and the time complexity of this approach is very high. Ma and Latecki [38] presented a shape-based object detection approach employing partial shape matching that considered the object detection problem as a maximum clique inference on a weighted graph. This method can handle the problems of occlusion and cluttered background very well, but it is very inefficient to detect the objects. In [61], they applied the 7

learning of meaningful contour segments to object detection. They first obtained meaningful contours from a collection of training images, then learned the co-occurrences of discriminative contours, and finally employed these contour co-occurrences to detect the objects. This method also requires a lot of training data to learn the meaningful contours of the object, and at the same time they utilize the maximum margin multiple instance learning (MIL) method to learn the co-occurrences of discriminative contours, while the MIL method is unstable in the learning process. Thus, it can influence the stability of the performance of object detection. Wang et al. [55] presented a novel shape model, denoted as the fan shape model (FSM), for object detection. They modeled the contour sample points as rays of final length emanating from a reference point. This model calculates shape scale by maximizing an energy function in scale voting space, but its accuracy is reduced if part of the detected object is corrupted by wrong edge pixels. Lin et al. [33] proposed an effective method for learning a hierarchical object shape model, representing a category of shapes in the form of an And-Or tree model, which was then employed for object detection. This method is able to well capture the intra-class variance. However, this method is very difficult to obtain the optimal solution of the And-Or tree model in the learning process. Nguyen [42] presented a mean field-based chamfer template matching method. In this method, each template is represented as a field model and matching a template with an input image is formulated as an estimation of a maximum of posteriori in the field model. They adopted a variation approach to approximate this estimation, and this method can be applied to two versions of chamfer template matching [49, 37] to detect objects in cluttered real images. However, this method still cannot deal with the cluttered background very well, and the performance of object detection is relatively low. Zhou et al. [65] proposed a framework for crowd modeling using a combination of multiple kernel learning (MKL)-based fast head detection and shape-aware matching. First, they trained a classifier for head detection using MKL technique. Then, a fast head detection process was presented by implementing the classification procedure only at those spatial locations in the image. Finally, they modeled the crowd as a deformable shape through connected boundary points (head detection) and matched with the subsequent detection from the next frame in a shape-aware manner. Hong et al. [24] presented 8

a 3D object recognizing method based on multi-view data fusion, named multi-view ensemble manifold regularization (MEMR). In their method, they added a regularization term for support vector machine (SVM), and then multi-view learning method was proposed to train the modified SVM. Finally, hypergraph construction was applied to better capture the connectivity among views. Teo et al. [53] described a Gestaltist approach for contour-based object detection that combined the bottom-up and top-down contour information to detect the objects in cluttered environments. This method is able to handle the cluttered background, occlusion, and viewpoint changes very well, but the time complexity of the method is very high. In [25], a compact dominant orientation templates (DOT) representation was proposed, which is used to efficiently tackle the partial occlusions problem in object detection. The main difference between this method and our method is that this method adopts the gradient feature for object matching, while our method uses the shape feature for object detection. Most recently, Huang et al. [26] proposed a constellational contour parsing framework for contour-based object detection, which found the optimal configuration of contour segments in the clutter image to match the model contour. This method is able to detect the objects with different scales, rotations, perspective changes, and deformations. However, the computational complexity of the method is relatively high. 3. Shape Descriptor In this section, we present the details of our proposed shape descriptor. Let pi = (xi , yi ), (i = 1, 2, ..., N ) denote the sequence of equidistant sample points on the contour of a given shape S, which is generated by counter-clockwise direction tracing of the boundary at a constant speed, and where xi and yi are the coordinates of the point pi , p1 is the starting point, and N is the number of boundary points. First, we compute the coordinates of the centroid point g = (xg , yg ) of the shape S, which is given by the following expression:   xg =  yg =

1 N

PN

xi

1 N

PN

yi

i=1

i=1

9

(1)

Figure 3: Shape descriptor of point pi , comprised of three components (dSi , αiS , and cRiS =

chordLeni radLeni ).

For each point pi = (xi , yi ) of the shape S, we can find the farthest point f pi = (xf pi , yf pi ) of this point. By computing the distance from this point to the other points of the shape S, and then taking the maximum distance as the farthest point f pi of the point pi . The function DS (pi ) is used to represent the shape descriptor of the point pi of the shape S, as expressed by the following equation:

DS (pi ) = (dSi , αiS , cRiS )T .

(2)

Where di is the distance between the point pi and the centroid point g of the shape S, and αi represents the angle spanned by rays formed from pi and the centroid point g and pi and the farthest point f pi . Let cRi =

chordLeni , radLeni

and the cRi represents the ratio of

the chord length chordLeni to the arc length radLeni . The chordLeni is defined as the distance between the point pi and the farthest point f pi , and the radLeni is defined as the [ summation of distance between each two adjacent points of contour p[ i f pi . The pi f pi is the contour from the point pi to the farthest point f pi . To be more specific, with reference to Figure 3. The angle αi is normalized by π, and the last component is also scale invariant. To ensure the first component scale invariance, the distance di is normalized by the maximum distance of the points in shape S to the centroid point. The above three components after normalization is in the interval [0, 1]. Therefore, our proposed descriptor is scale invariant. We can obtain the final shape descriptor F D(S) of the shape S, which is defined as follows::

10

F D(S) = (DS (p1 ), ..., DS (pN ))   dS1 . . . dSN    S  =  α1S . . . αN .   S cR1S . . . cRN

(3)

We observe that F D(S) is an 3 × N matrix with column i being the shape descriptor DS (pi ) of the sample point pi of shape S. In this paper, we use the proposed shape descriptor to compare edge fragments with the model contours. Let F DA = (DA (p1 ), ..., DA (pN )) and F DB = (DB (q1 ), ..., DB (qN )) denote the shape descriptors that are extracted from shapes A and B, respectively. The shape dissimilarity between them is calculated using the following equation (here, t is the offset, which represents the starting point of matching procedure): 1 dist(A, B) = mint∈[1,...,N ] 3×N v u N uX B 2 A B 2 A B 2 × t ((dA i − di+t ) + (αi − αi+t ) + (cRi − cRi+t ) ).

(4)

i=1

By employing the angle information, distance information, and convexity/concavity properties, our proposed shape descriptor provides a fully quantitative description of an object’s shape, which significantly increases its descriptive capability. We later compare the performance of our shape descriptor with the performances of the shape context descriptor proposed by [6], the methods proposed by [38] and [47], and the other popular shape description methods [5, 34, 51, 44, 58]. Shape context is one of the most commonly employed shape descriptors because of its simplicity and good capability for describing shape. Shape context considers the distribution of an object’s contour point, and thus offers a global discriminative ability. The methods proposed by [38] and [47] are recently proposed descriptors that are used to compute the similarity of edge fragments with model contours. The method proposed by [38] is similar to shape context, but it does not build histograms representing lengths and direction. In contrast, the method proposed by [47] only considers angle information, and their shape descriptor achieves only weak discriminative power. The 11

Figure 4: An example of linking adjacent edge fragments.

other shape descriptors [5, 34, 51, 44, 58] are recently proposed popular shape representation methods. We will demonstrate in Section 6.5 that the proposed method is more flexible than shape context, the methods of [38] and [47] and the other shape descriptors [5, 34, 51, 44, 58]. 4. Early Preprocessing and Partial Matching Background clutter is one of the principal difficulties in object detection. Given an image I, we first obtain the edge pixels using an edge detector algorithm. We then obtain numerous edge fragments using the edge-linking software developed by [28]. However, because only a small portion of the edge fragments obtained is representative of the target object, we first remove some of the irrelevant edge fragments. The three preprocessing procedures proposed to address the issue of background clutter are given as follows. Owing to the limitation of the edge-linking algorithm, it can produce a large number of very short edge fragments, but these edge fragments lack the capability of object discrimination, and are also unnecessary. We therefore first remove these very short edge fragments. Meanwhile, the salient contour of a target object is often broken into several segments due to the effects of occlusion and noise. Hence, we also seek to link very adjacent contour fragments to create more discriminative long contour fragments. We calculate the distance between the endpoints of each edge fragment to the endpoints of all others edge fragments, and then link adjacent edge fragments when the distance is less than a given threshold value. An example of this procedure is shown in Figure 4. In this manner, we obtain edge images with fewer edge fragments.

12

Figure 5: Early preprocessing results for the edge images given in Figure 1.

Secondly, we employ a single model contour that can be hand-drawn or extracted from an example image. The single model contour is first decomposed into possibly overlapping model contour segments at high curvature points. In addition to selecting longer contour segments, shorter segments must also be selected because contour segments may be missing in edge images. The selection of segments and their grouping into long meaningful contour segments was designed manually. For a model contour segment, we compute the ratio of the chord length to arc length. In addition, we also compute the same ratios in edge images. Hence, we compare the similarity between these ratios for the model contour and edge fragment, and eliminate the obviously dissimilar edge fragments in the edge image. This method further provides edge images with fewer edge fragments. Finally, we consider that the contours of objects are connected with each other in edge images, if an edge fragment is not connected with any of the other edge fragments in edge images, and we think this edge fragment is isolated and at the same time unnecessary. Thus, we remove these isolated edge fragments in this procedure. We compute the distances between every edge fragment to the other edge fragments in edge images. We delete those that have a distance greater than a pre-set threshold. In this way, we obtain better edge images with fewer edge fragments. Figure 5 shows some edge image examples after preprocessing the edge images given in Figure 1. Compared with Figure 1, the proposed early processing method clearly reduces the number of irrelevant and/or background edge fragments, and can therefore improve the efficiency of the subsequent partial matching and depth-first search methods. 13



Figure 6: An example of partial shape matching between edge fragments and model contours.

We next introduce partial shape matching between model contours and edge fragments. Given a set of edge fragments E = {e1 , ..., ek }, where each edge fragment ei (i = 1, ..., k) is a list of Ni points, the geometry of an edge fragment is encoded in an Ni × 3 matrix according to the proposed shape descriptor. Similarly, an M × 3 matrix is used to fully represent the contour of model S, which is comprised of M points. The goal is to locate the best match between an edge fragment and a segment of model contour. Thus, we seek a segment S(i, l) = {pi , ..., pi+l−1 } ⊆ S where i is the starting point of the segment and l is its length. Here, if the model contour is a closed curve, the indices are taken as modulo M . The entire edge fragment is considered for the match. Hence, only the best match between a segment of a model contour and the edge fragment is obtained. We first sampled a fixed number M of points from the model contour that can be ordered as S = {p1 , p2 , ..., pM }. For the obtained edge fragments, points are sampled at equal distance, resulting in a sequence of points ei = {q1 , q2 , ..., qNi }. The sampling distance d between the points allows handling different scales. We set d = 5 in our experiment. In partial shape matching, if the number of points of an edge fragment is larger than the number of points of the model contour, we don’t need to compare. Because an edge fragment in edge image is only a part of the model contour. Otherwise, we directly compare the edge fragment with the model contour. First, we put the edge fragment slide on the model contour, and at each location, we compute the proposed shape descriptor. Then, we adopt the Eq. (4) to calculate the distance between them. Finally, we take the minimum distance as the best 14

match between the edge fragment and the model contour. An example of the best partial match between the edge fragment and the model contour is shown in Figure 6. To increase the robustness of the partial matching process and reduce the occurrence of mismatches between edge fragments and model contours, the three segments of a model contour with the smallest distances obtained by partial matching between edge fragments and model contours are employed in subsequent processing. 5. Identifying and Verifying the Candidate Hypotheses This section mainly introduces the search strategy proposed to obtain candidate hypothesis results efficiently. After obtaining numerous matches between edge fragments and the model contours, the spatial relationships between the edge fragments in the edge image and the corresponding model contour segments must be established. Subsequently, a depth-first search strategy is adopted to determine the candidate hypotheses. Finally, an evaluation method is employed to verify the hypotheses and determine the final detection result. 5.1. Establishing the Spatial Relationships Between Edge Fragments and Model Contours The shape descriptor proposed by [38] was developed to obtain the best matching between an edge fragment and model contour. However, their proposed shape descriptor has weak discriminative power, and, more importantly, is very slow at partial shape matching, which will be demonstrated in Section 6. However, because the spatial relationships between two edge fragments and two corresponding model contours are incorporated into the method, their proposed descriptor can be used to establish the spatial relationships between edge fragments and model contours. Therefore, this spatial relationship is employed as heuristic information for our search strategy, which not only improves the accuracy of object detection, but also increases its time efficiency, as is later verified through extensive experimental comparison. 5.2. Depth-first Search After establishing the spatial relationships between edge fragments and the model contours, the depth-first search procedure is conducted to obtain the candidate hypotheses. 15

In the search procedure, for the model contour, the best matches between edge fragments and model contours are obtained using the five edge fragments scales. The detailed search process is as follows. Step 1: for a given fixed sample points of the model contour, partial shape matching is employed to obtain the best three matches between an edge fragment and a segment of the model contour. In partial shape matching, we use five edge fragments scales to obtain all matches between edge fragments and the model contour. Step 2: after obtaining numerous matches between edge fragments and the model contour at all scales, we calculate the orientation and distance of the matrices between two edge fragments. We simultaneously compute the corresponding two model contours of both matrices. As a result, we obtain a measure of deviation of the spatial relationships between an edge fragment configuration and its corresponding model contour configuration. Step 3: the search process is conducted only for an adjacent edge fragment, and it is unnecessary to search for edge fragments located at distances beyond a given threshold. Different threshold values are set at different scales to distinguish adjacent edge fragments. Step 4: three to four edge fragments are sufficient to detect an object. Therefore, the search is conducted until a maximum of four edge fragments are obtained. Step 5: a threshold is established regarding maximum allowable degree of difference between an edge fragment configuration and its corresponding model contour configuration. The deviation between an edge fragment configuration and its corresponding model contour configuration are obtained in step two. If this deviation is less than the established threshold, the search is continued to obtain the next edge fragment of the neighboring curve. However, if all adjacent curve segments are greater than the threshold, the curve segment is deleted. If three or more edge fragments (not exceeding the limit of four set in step four) that satisfy the relationship between an edge fragment and model contour, these edge fragments are considered as a candidate result. Figure 7 presents an example of some candidate results.

16

(a) original image

(b) preprocessing results

(c) candidate results

Figure 7: An example of the search strategy employed to identify candidate hypotheses. (Here, we show only some of the candidate results.)

5.3. Evaluation of Detection Hypotheses After all candidate detection results have been obtained, the next step is to verify the candidate hypotheses to obtain the final object detection result. A candidate result, which we are here denoting as a hypothesis, is also known as an initial detection. These initial detections include many false positives and false negatives, and the locations of target objects are not sufficiently precise. Here, we train a support vector machine classifier model using pyramid histogram of oriented gradients features described in [9]. The classifier score can be obtained for any hypothesis via the support vector machine classifier, where the classifier score is high if the candidate is a true detection. If the classifier score is greater than a predetermined threshold, then the candidate hypothesis result is regarded as a final detection, as shown in Figure 2(e). In this process, we use a non-maximum suppression method to remove duplicate hypotheses. Through this step, we are able to significantly improve the performance of our detection results. 17

6. Experimental Results In this section, we present an extensive experimental evaluation of our method using seven diverse object classes from four challenging datasets: ETHZ shape classes [18], INRIA horses [19], Weizmann horses [8], and Caltech 101 [32]. The complete algorithm was implemented by MATLAB. The experimental platform employed a PC with a Quad-Core 3.1 GHZ CPU and 4GB memory. 6.1. ETHZ Shape Classes Dataset This dataset has five diverse classes (bottles, swans, mugs, giraffes, and apple-logos), which contains a total of 255 images. Every category contains 32-87 images. All categories incorporate significant scale changes, illumination changes, and intra-class variations. Moreover, many objects in this dataset are surrounded by unrelated background clutter and contain interior contours. Therefore, it is very suitable for comparison of shape-based object detection methods. In the experiment, we used hand-drawn image as our models. In addition, models (one per class) are also provided in this dataset. All 255 images served as test images. First, the early preprocessing was employed to obtain fewer edge fragments in the edge image, based on the following threshold settings. The edge fragment length threshold Len was set to 10. The edge fragment linking threshold lt was set to two for all categories. The model contour decomposition threshold mc was set to 12 salient segments for apple logos, 15 salient segments for bottles, 13 salient segments for giraffes, 15 salient segments for mugs, and 19 salient segments for swans respectively. The distance threshold dt was set to 25 for all categories. Second, the partial shape matching method was applied to obtain the top three matches between edge fragments and model contours. Third, the depth-first search strategy was employed to obtain the candidate hypotheses according to the following threshold specifications. The adjacent relation between edge fragments lr was set to a threshold value of 40, and the configuration deviation of edge fragments and corresponding model contours condev was set to a threshold of 1.2. Finally, the obtained candidate hypotheses were evaluated using the procedure outlined in Subsection 5.3. In the training process, we 18

used half of the images in each class as positive examples. For negative images, we first extract a background box from every positive example, so we can obtain an equal number of background boxes. These background boxes and random select a half of the other images as the final negative examples. Figure 8 presents some examples of the detection results for this dataset, where the hand-drawn images employed as models in the experiment are given in the final column. Two main measures have been developed to evaluate the detection performance, namely, the 20%-intersection-over-union ratio (IoU) criterion and the more stricter PASCAL criterion. The PASCAL criterion requires an IoU with a ground-truth bounding-box that is greater than 50%, while the 20%-IoU criterion requires an IoU with a ground-truth boundingbox that is greater than 20%. In this dataset, we employed the PASCAL criterion and 20%-IoU criterion to evaluate the detection performance. We evaluated the object detection performance of our method on this dataset by comparing our results with existing shape-based methods including the Ferrari’s method in [17, 18], the recent chamfer matching method proposed by Nguyen [42], and the Halawani’s method in [22]. We present the detection rate (DR) versus the false positives per image (FPPI) in Figure 9. FPPI/DR curve is a standard that measures the detection performance. The detection rate is the number of correct detections divided by the total number of objects. FPPI represents the average number of false positives per image. We can see that the performance of our approach is better than Ferrari’s method in all classes. And at the same time, our method gives better DR/FPPI rates in all categories except the apple logo class and the bottle class in comparison with Nguyen’s method. For apple logo and bottle classes, our method is slightly worse than the Nguyen’s method at low FPPIs. Compared with the Halawani’s method, our approach gives comparative or better DR/FPPI rates in all categories. Therefore, our approach achieves a better result among these shape-based object detection methods. Table 1 compares the detection rates obtained at 0.3/0.4 FPPI for numerous methods. The table indicates that the proposed method demonstrated the best performance for all the classes considered, except apple logos (i.e., bottles, giraffes, mugs, and swans). For apple logos, our detection result was worse than the best result at 19

Figure 8: Some examples of detection results for the apple logo (first row), bottle (second row), giraffe (third row), mug (fourth row), and swan (fifth row) classes of the ETHZ shape classes dataset. The templates employed are the right-most images.

20

Bottles

0.9

0.9

0.9

0.8

0.8

0.7

0.7

0.8 0.7

0.6 0.5 0.4 0.3

0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1 0.2

0.4 0.6 0.8 1 1.2 False−positives per image

0 0

1.4

Detection rate

1

0 0

0.4 0.3 0.1

0.2

0.4 0.6 0.8 1 1.2 False−positives per image Swans

1.4

0 0

0.2

0.4 0.6 0.8 1 1.2 False−positives per image

1.4

1 0.9

0.8

0.8

0.7

0.7

Detection rate

1 0.9

0.6 0.5 0.4 0.3

0.6 0.5 0.4 0.3 0.2

0.2 0.1 0 0

0.6 0.5

0.2

Mugs

Detection rate

Giraffes

1

Detection rate

Detection rate

Apple logos 1

0.1

0.2

0.4 0.6 0.8 1 1.2 False−positives per image

1.4

0 0

0.2

0.4 0.6 0.8 1 1.2 False−positives per image

1.4

Ferrari et al.IJCV2010 Hough only(Pascal) Ferrari et al.IJCV2010 Full System(Pascal) Ferrari et al.IJCV2010 Full System(20% IoU) Ferrari et al.PAMI 08(Pascal) Ferrari et al.PAMI 08(20% IoU) Nguyen CVPR2014 Mean Field DCM(20% IoU) Halawani et al. PR2016(20% IoU) Our Method(Pascal) Our Method(20% IoU)

Figure 9: Comparison of DR/FPPI curves on ETHZ shape classes (the curve for our method under PASCAL in the bottle plot is identical to the curve for 20%-IoU).

0.3 FPPI, but achieves the same DR value at 0.4 FPPI. The possible reason is that the leaf of apple logo class influences the spatial relationship between edge fragments and the model contour. Thus, it can also influence the performance of apple logo class. Furthermore, the mean value of our method is better than all other methods considered for this measure, as shown in Table 1. In summary, our method can obtain good object detection results. This is because our method is very effective to combine the benefits of the shape and appearance features for object detection. We first adopt the depth-first search to obtain the entire possible candidate hypothesis results by using the shape information, then an appearancebased (support vector machine on pyramid histogram of oriented gradients feature) method is adopted to verify the candidate hypothesis results. Therefore, our method can achieve high object detection performance, and at the same time our method is also very efficient in detecting objects from cluttered backgrounds. To further demonstrate the efficiency of our method, we compared the elapsed time of our method with Ma et al. [38] for every category in the ETHZ shape classes dataset under 21

Table 1: Comparison of detection rates for 0.3/0.4 FPPI on ETHZ shape Classes. Apple logos

Bottles

Giraffes

Mugs

Swans

Mean

Our Method

0.89/0.95

1/1

0.92/0.96

0.966/1

1/1

0.955/0.982

Halawani et al. [22]

0.905/0.905

0.965/0.965

0.87/0.90

0.79/0.816

0.82/0.94

0.87/0.9052

Nguyen [42]

0.856/0.856

1/1

0.76/0.826

0.52/0.55

0.75/0.80

0.777/0.806

Yang et al. [60]

0.80/0.80

0.929/0.959

0.7692/0.7921

0.833/0.8485

0.909/0.941

0.848/0.8679

Wang et al. [55]

0.90/0.90

1/1

0.92/0.92

0.94/0.94

0.94/0.94

0.940/0.940

Ma et al. [38]

0.92/0.92

0.979/0.979

0.854/0.854

0.875/0.875

1/1

0.926/0.926

Srinivasan et al. [52]

0.95/0.95

1/1

0.872/0.896

0.936/0.936

1/1

0.952/0.956

Maji et al. [39]

0.95/0.95

0.929/0.964

0.896/0.896

0.936/0.967

0.882/0.882

0.919/0.932

Felz et al. code [15]

0.95/0.95

1/1

0.729/0.729

0.839/0.839

0.588/0.647

0.821/0.833

Lu et al. [36]

0.9/0.9

0.792/0.792

0.734/0.77

0.813/0.833

0.938/0.938

0.836/0.851

Riemenschneider et al. [47]

0.933/0.933

0.970/0.970

0.792/0.819

0.846/0.863

0.926/0.926

0.893/0.905

Ferrari et al. [18]

0.777/0.832

0.798/0.816

0.399/0.445

0.751/0.8

0.632/0.705

0.671/0.72

similar condition. We show the comparison results in Figure 10. As we can see, our method significantly reduces the time required, and at the same time improves the efficiency of the algorithm. 6.2. INRIA Horses Dataset The INRIA Horses dataset is composed of 170 images containing instances of horses and 170 other background images, for a total of 340 images. This dataset does not contain any templates, thus we manually created a template for the experiments. This dataset contains different colors and textures as well as various articulated movements in cluttered images. Some different parameters settings were employed for the INRIA horses dataset than those employed for the ETHZ shape classes dataset. We set the edge fragment linking threshold lt equal to one, and the model contour decomposition threshold mc was set to 20 salient segments. All other parameters are the same as for the ETHZ shape classes dataset. In the training process, we extract half of images contain horse as our positive examples, and an equal number of background images as our negative images. The test images contain a total of 340 images.

22

Time Consume Compare in Applelogos of ETHZ

Time Consume Compares in Giraffes of ETHZ

Time Consume Compare in Bottles of ETHZ

160

180

140

160

300

250 140

120

200

120

80

time/min

time/min

time/min

100 100 80

150

60 60

40

100

40

50 20

20

0

0

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

45

image number

image number

Time Consume Compare in Mugs of ETHZ

Time Consume Compare in Swans of ETHZ

50

0

0

10

20

30

40

50

60

70

80

90

image number

70

45 40

60

35

50

time/min

time/min

30 25 20

40

30

15

20 10

0

Ma,Latecki in CVPR2011

10

5

0

5

10

15

20

25

30

35

40

45

50

0

0

5

image number

10

15

20

image number

25

30

35

Our Method

Figure 10: Comparison of the time consumption on ETHZ shape classes.

Figure 11: Examples of objects detected by our proposed method for the INRIA horses dataset. The last image is the hand-drawn template employed.

23

Detection rate

INRIA Horses 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4 0.6 0.8 1 1.2 False−positives per image

Ferrari et al.IJCV2010 Hough only(Pascal) Ferrari et al.IJCV2010 Full System(Pascal) Ferrari et al.IJCV2010 Full System(20% IoU) Ferrari et al.PAMI 08(Pascal) Ferrari et al.PAMI 08(20% IoU) Nguyen CVPR2014 Mean Field DCM(20% IoU) Halawani et al. PR2016(20% IoU) Our Method(Pascal) Our Method(20% IoU)

1.4

Figure 12: Comparison of DR/FPPI curves for the INRIA horses dataset. Table 2: Comparison of detection rates with 0.3/0.4 FPPI for the INRIA horses dataset. Horses Our Method

1 / 1

Ferrari et al. (2008) [17]

0.84/0.85

Ferrari et al. (2010) [18]

0.78/0.80

Halawani et al. (2016) [22]

0.815/0.83

To evaluate the detection accuracy of our method, we followed the PASCAL criterion and 20%-IoU criterion. Some detection examples from this dataset are illustrated in Figure 11, which also includes the hand-drawn template employed. We compared the performance of our approach with the Ferrari’s method in [17, 18], the Nguyen’s method in [42], and the Halawani’s method in [22]. The FPPI versus DR curves are presented in Figure 12. We can see that the performance of our approach is better than the Ferrari’s method, the Nguyen’s method, and the Halawani’s method. Table 2 lists the DR values at 0.3/0.4 FPPI. As the table shows, the proposed method demonstrates the best performance among the methods considered. 6.3. Weizmann Horses Dataset The Weizmann horses dataset contains 228 positive images and 228 background images, for a total of 456 images. The Weizmann horse is a very challenging multi-scale dataset. In the experiment, all parameters are the same as for the INRIA horses dataset. In the training process, we use a half of images contain horses as positive examples and an equal number of 24

Figure 13: Examples of objects detected by our proposed method for the Weizmann horses dataset. The last image is the hand-drawn template employed.

background images as negative examples. All 456 images in the dataset were employed as test images. This dataset also includes no templates, and a template was manually created for the experiments. Some detected examples from this dataset are illustrated in Figure 13, which also includes the hand-drawn template employed. In this dataset, we employed the PASCAL criterion to evaluate the detection performance. We compare the performance of our method with other shape-based approaches, namely, contour fragment [49], strip feature [63], and active contour fragment [64]. The comparison results are shown in Figure 14. As indicated in the figure, the performance of our approach is slightly worse than the Zheng’s method in [64] at low FPPIs. But our approach achieves better performance for all positive images compared with the performances of the methods proposed by [49]. Compared with strip feature proposed by [63], our method gives compar25

Weizmann Horse 1 0.9 0.8 0.7

Recall

0.6 Shotton et al. PAMI 08 Retrained test Shotton et al. PAMI 08 Initial detector Shotton et al. PAMI 08 Retrained training Zheng et al. CVPR 09 (strip) Zheng et al. ACCV 2012 (ACF) Our Method

0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

False Positive per Image

Figure 14: Comparison of DR/FPPI curves on the Weizmann horses dataset. Table 3: Comparison of detection rates with 0.4 FPPI for the Weizmann horses dataset. Horses Our Method

1

Shotton et al. (2008) [49]

0.9520

Zhu et al. (2010) [66]

0.8600

Yang et al. (2010) [59]

0.9397

Bhattacharjee et al. (2015) [7]

0.9520

ative DR/FPPI rates. In particular, our approach tends to give higher detection rate at low FPPIs. Table 3 compares the detection rate values at 0.4 FPPI for numerous methods. The results of other methods are directly drawn from the published results. As can be observed, our method achieved the best detection result among the five methods. 6.4. Caltech 101 Dataset Finally, we consider the well-known Caltech-101 dataset [32]. In this case, two shapebased classes were selected from this dataset: anchors and cups, with 42 and 57 positive images, respectively. Although most of the images provide only limited background clutter, the dataset incorporates significant intra-class variations. This dataset also includes no templates, and both anchor and cup templates were manually created for the experiments. In the training process, the half of the images containing a detection object were employed as positive examples, and an equal number of background images obtained from the Caltech 101 26

Figure 15: Examples of detection results for the anchor (first row) and cup (second row) classes for the Caltech 101 dataset. The right-most images are the hand-drawn templates employed.

background set were employed as negative examples. The testing set included the remaining positive images and an equal number of negative images from the background set. In the experiment, the model contour decomposition threshold mc was set to 10 salient segments for both anchors and cups. The other parameters were equivalent to those employed for the ETHZ shape classes. The results of our proposed method were compared with the kAS method [17] and Interest Point (IP) detectors including the Harris-Laplace [41], LoG [35], and DoG [35] under similar conditions. In the kAS method, 2AS, which is considered to be pairs of adjacent segments (PAS), produced the best results on all datasets, and was therefore employed in the experiment. The IP represents particular properties in an image, and produces local features widely used for object detection [16]. IP descriptors capture the appearance characteristics of local image patches. To directly compare the results of our approach with the results of PAS, the HarrisLaplace, LoG, and DoG, the 20%-IoU criterion was employed to evaluate the detection accuracy. Some detected examples from this dataset are illustrated in Figure 15, which also shows the hand-drawn templates employed. We show our quantitative results in Figure 16. As Figure 16 shows, our proposed method achieves better performance for all positive images compared to all other methods considered. 27

Cups 1 0.9

0.8

0.8

0.7

0.7

Detection Rate

Detection Rate

Anchors 1 0.9

0.6 0.5 0.4

Harris−Laplace LoG DoG PAS Our Method

0.3 0.2 0.1 0

0

0.5

1

0.6 0.5

Harris−Laplace LoG DoG PAS Our Method

0.4 0.3 0.2 0.1 0

1.5

False Positive per image

0

0.5

1

1.5

False Positive per image

Figure 16: Comparison of DR/FPPI curves for the Caltech101 dataset.

6.5. Performance Analysis of the Proposed Shape Descriptor In this subsection, we will analyze the performance of the proposed shape descriptor. We tested our shape descriptor on two popular benchmarking datasets: MPEG-7 dataset [29] and Kimia’s dataset [48]. In our implementation, we set the number of sample points from a shape contour as N = 100. 6.5.1. MPEG-7 dataset The MPEG-7 dataset [29] is widely used for testing the performances of shape descriptors in the last decade. This dataset contains a total of 1400 shape images belonging to 70 different classes with 20 images in each class. Some examples from this dataset are shown in Figure 17. This dataset is very challenging for shape-based image retrieval. This is because some shapes from different classes are very similar, and at the same time there are always some complex deformations for the shapes within the same class. The retrieval accuracy on MPEG-7 shape dataset is measured using the well-known bull’s eyes score. In this measurement, each shape of the dataset is used as a query compare with all the shapes in the dataset. The number of correct matches (that is the retrieved shape and the query shape belong to the same class) in the top 20×2 = 40 retrieved shapes that have the smallest dissimilarity values are counted. Since the maximum number of correct matches for a single shape is 20, the total number of correct matches is 1400 × 20 = 28000. The percentage of matched shapes out of 28000 is the retrieval rate of the bull’s eye score.

28

Figure 17: Some example shapes from the MPEG-7 dataset. One shape from each of the 70 classes is presented. Table 4: Bull’s eyes scores on MPEG-7 shape dataset. Method

Score

SC [6]

64.59%

WARP [5]

58.50%

IDSC [34]

68.83%

CPDH [51]

71.89%

ECCobj2D [27]

54.56%

SSD [44]

61.00%

Our Method

72.38%

We compared the performance of our propose shape descriptor with shape context (SC) [6], WARP [5], inner-distance shape context (IDSC) [34], contour points distribution histogram (CPDH) [51], eccentricity transform (ECCobj2D) [27], and shape salience descriptor (SSD) [44]. The comparison result is presented in Table 4. As can be observed, our shape descriptor achieved the highest retrieval rate among all of the compared methods. The reason is that our shape descriptor combines the benefits of the boundary and region of an object’s shape. The distance between the sample point and centroid point represents the region information of an object’s shape, while the angle and the ratio of chord length to arc length represent the boundary information of an object’s shape. Therefore, our shape descriptor has a strong ability to characterize the shape.

29

Figure 18: The complete Kimia’s 99 dataset.

6.5.2. Kimia’s dataset The Kimia’s dataset [48] is another widely used benchmark dataset for evaluating the performance of shape description methods, including Kimia’s 25, Kimia’s 99 and Kimia’s 216. Since the Kimia’s 25 dataset is too small (contains only 25 shape samples in total), we choose the Kimia’s 99 dataset and Kimia’s 216 dataset to test the performance of our method for shape retrieval. These two datasets contain a significant level noise and intraclass variations, making them very suitable for comparison of shape descriptor performances. The complete Kimia’s 99 dataset is shown in Figure 18. The Kimia’s 99 dataset contains a total of 99 shapes grouped into nine categories. There are some occlusions, missing parts, and articulations in this dataset. In the experiment, each shape in the dataset is considered to be a query, and the same class of retrieval result can be sorted as the number of top 1 to top 10 closest matches (excluding the query itself). The best result for each class is 99. We compared the performance of our proposed shape descriptor with shape context [6], and the methods proposed by [38] and [47]. The comparison results are listed in Table 5. As can be observed, our shape descriptor nearly always outperforms the other descriptors considered. A portion of the Kimia’s 216 dataset is shown in Figure 19. This dataset consists of 18 shape classes, with 12 shapes per class for a total of 216 shapes. In the experiment, each shape in the dataset is considered to be a query, and the same class of retrieval result can be sorted as the number of top 1 to top 11 closest matches (excluding the query itself). The best result is 216 for each class. The 11 closest matches obtained for each shape were evaluated 30

Table 5: Retrieval results for the Kimia’s 99 dataset. 1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

Total

Shape Context [6]

97

91

88

85

84

77

75

66

56

37

756

Ma and Latecki [38]

94

86

81

79

72

63

62

58

44

36

675

Riemenschneider et al. [47]

91

85

79

77

69

61

64

61

51

36

672

Our Shape Descriptor

97

95

93

92

86

84

79

75

67

47

815

Figure 19: Sample shapes from the Kimia’s 216 dataset. Two shapes for each of the 18 classes are presented.

as to whether or not they were in the same class as the query. We compared the performance of our shape descriptor with shape context [6], the methods proposed by [38] and [47], the path similarity skeleton graph matching (PSSGM) [23] and the contour segment matching (CSM) method [58]. The comparison results are presented in Table 6. The best retrieval result was 216 for each class, and we can see that the performance of the proposed shape descriptor is nearly always better than those of the other methods considered. Table 6: Retrieval results for the Kimia’s 216 dataset. 1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

Total

Shape Context [6]

214

209

205

197

191

178

161

144

131

101

78

1809

Ma and Latecki [38]

207

197

185

179

177

168

160

144

147

122

104

1790

Riemenschneider et al. [47]

205

191

180

168

155

159

140

141

128

108

85

1660

PSSGM [23]

210

208

203

202

200

192

186

167

161

130

96

1955

CSM [58]

216

215

206

204

200

186

172

163

130

124

107

1923

Our Shape Descriptor

216

211

208

197

197

192

184

178

167

164

142

2056

31

Figure 20: Examples of rotation images: (a) original image; (b) image rotated by five degrees; (c) image rotated by 20 degrees.

6.6. Influence of Rotation Transformation To further validate the effectiveness and robustness of the proposed method, we increased the comparison experiment when the objects have rotated a certain angle. The results of our method were then compared with FSM [55] and the Discriminative Combinations of Line Segments, Ellipses (DCLE) model [13]. The FSM and DCLE model represent the latest progress in shape-based object detection methods, currently with the best performances in shape-based object detection. The ETHZ shape classes dataset was again employed, and the DR versus FPPI results for each method were evaluated under similar conditions after the images were rotated by a given angle to compare the recognition rates of the three algorithms considered. Because the giraffe class exhibits the largest deformation of all ETHZ shape classes, and at the same time it is the most difficult class to recognize. Therefore, it is very suitable for the comparison experiment. An example of a rotated giraffe image is shown in Figure 20. All parameters were equivalent to those previously employed for the ETHZ shape classes dataset. Figure 21 presents the comparison results of the three algorithms under conditions of no rotation, a rotation of five degrees, and a rotation of 20 degrees. The figure shows that the proposed method demonstrates the highest recognition rate after image rotation, indicating a better tolerance for rotation transformation. 6.7. Influence of Early Preprocessing In this subsection, we will analyze the influence of early preprocessing on the performance of our proposed object detection method. At the same time, we also analyze the elapsed 32

FPPI/DR: Giraffes

FPPI/DR: Giraffes 1

0.9

0.9

0.8

0.8

0.8

0.7 0.6 0.5 0.4 0.3 Fan Shape Model(No−Rotation 20%IoU) DCLE(No−Rotation 20%IoU) Our Method(No−Rotation 20%IoU)

0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

False Positive per image

0.9

Detection Rate

1

0.9

Detection Rate

Detection Rate

FPPI/DR: Giraffes 1

0.7 0.6 0.5 0.4 0.3 0.2

1

0

0.6 0.5 0.4 0.3 0.2

Fan Shape Model(5−degree Rotation 20%IoU) DCLE(5−degree Rotation 20%IoU) Our Method(5−degree Rotation 20%IoU)

0.1

0.7

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

False Positive per image

0.9

Fan Shape Model(Rotate 20 Angle 20%IoU) DCLE(Rotate 20 Angle 20%IoU) Our Method(Rotate 20 Angle 20%IoU)

0.1

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive per image

Figure 21: Comparison results of the three algorithms under conditions of rotation transformation.

time of the proposed method in two different situations: using the early preprocessing phase and not using the early preprocessing phase. We employed the ETHZ shape classes dataset as the subject of analysis. Figure 22 presents comparison results of our approach in these two different situations. The red curves represent our approach using early preprocessing phase, while the green curves represent our approach without the early preprocessing phase. It can be seen from the Figure 22 that the early preprocess phase influences the performance of our proposed object detection method. However, even without using early preprocessing phase, our approach still achieved high object detection performance in all classes. The early preprocessing phase only decreased the detection rate a little in all classes except the mug class at lower FPPI values. Therefore, the overall performance of our approach was very stable, and our method was still very effective when detecting objects in cluttered backgrounds. Figure 23 presents the elapsed time of our proposed method in the two situations. The red curves represent the elapsed time when using the early preprocessing phase, while the green curves represent the elapsed time without the early preprocessing phase. It can be seen that the early preprocessing stage significantly reduced the time required and improved object detection efficiency. This is an important reason for using the early preprocessing phase in our proposed object detection method.

33

0.8

0.8

0.8

0.7 0.6 0.5 0.4 0.3

0.7 0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

0

0.1

False Positive per image Swans

False Positive per image Mugs

1 0.9

0.8

0.8

Detection Rate

1 0.9

0.7 0.6 0.5 0.4 0.3

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive per image

0.7 0.6 0.5 0.4 0.3 0.2

0.2

Our Method Using Early Preprocessing(20%IoU)

0.1

0.1 0

Detection Rate

1 0.9

0

Detection Rate

Giraffes

Bottles 1 0.9

Detection Rate

Detection Rate

Apple logos 1 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Our Method Without Early Preprocessing (20%IoU)

False Positive per image

False Positive per image

Figure 22: Comparison results of our proposed method in the ETHZ shape classes dataset for two different situations.

Bottles

50

50

50

40

40

40

30

time/min

60

30

30

20

20

20

10

10

10

0

0

5

10

15

20

25

30

35

0

40

0

5

10

15

20

25

30

35

40

45

50

image number Swans

image number Mugs 60

60

50

50

40

40

time/min

time/min

Giraffes

60

time/min

time/min

Apple logos 60

30

0

0

10

20

30

40

50

60

70

80

90

image number

30

20

20

10

10

Elapsed Time Using Early Preprocessing 0

0

0

5

10

15

20

25

30

image number

35

40

45

50

0

5

10

15

20

image number

25

30

35

Elapsed Time Without Early Preprocessing

Figure 23: Comparison of elapsed time of our proposed method in the ETHZ shape classes dataset for two different situations.

34

6.8. Parameter Study In this subsection, the influence of selected parameters on the experimental results obtained is examined, and the stability of our method with respect to changes in the main parameters is tested. We again employ a subset of the giraffe class in ETHZ shape classes as the subject of analysis. 6.8.1. Analysis of the Early Preprocessing Parameters Four parameters must be specified for the early preprocessing. The first parameter is the edge fragment length threshold Len, which is set to 10 for the ETHZ shape classes. We can obtain this value from test images based upon observations of the discriminative capabilities of short contour fragments. The second parameter is the linking threshold lt, which is set to two for all categories in the EHTZ shape classes dataset. This value is obtained from test images, and cannot be too large; otherwise, foreground contours may be incorrectly linked with background contours. Thus, the setting of this parameter can produce substantial deviation in the partial matching process and when identifying the location of a target object. The third parameter is the model contour decomposition threshold mc, it represents meaningful contour segments which were obtained manually. Thus, this value is not the same as for the different classes in the ETHZ shape classes dataset. The fourth parameter is the distance threshold dt, which is set to 25 for the ETHZ shape classes. This value is obtained from test images. If we set this value is too large, the isolated edge fragments are very less. If we set this value is too small, so we may remove some edge fragments which belong to the actual contours of an object. Experiments were conducted to analyze the distance threshold dt and linking threshold lt values on the detection result with all other parameters held constant. The DR values for a subset of the giraffe class images in the ETHZ shape classes dataset with 0.3 FPPI are listed in Table 7 for lt values of 1, 2, 3, and 4, and dt values of 20, 25, 30, 35, and 40. We can see that the DR increases and stabilizes with increasing dt and that the best DR is obtained when lt = 2 and dt = 35. It can be observed that lt has a relatively small influence on the detection result. For the linking threshold lt, it has a great influence on the detection 35

Table 7: The detection rate of the giraffe class in the ETHZ shape classes with 0.3 FPPI for different values of dt and lt employed in early preprocessing. lt = 1

lt = 2

lt = 3

lt = 4

dt = 20

0.8795

0.9186

0.8837

0.8864

dt = 25

0.8878

0.9200

0.9047

0.9070

dt = 30

0.8936

0.9285

0.9175

0.9078

dt = 35

0.8948

0.9286

0.9175

0.9078

dt = 40

0.8948

0.9286

0.9175

0.9078

results. Therefore, our experiment results can be influenced by the parameters of linking threshold lt and distance threshold dt. 6.8.2. Analysis of the Depth-first Search Parameters Two parameters must be specified in the depth-first search process. The first parameter is the adjacent relation between edge fragments lr. For the obtained edge fragments, the points are sampled at an equal distance d to handle different scales, and five edge fragment scales are considered. Thus, the value of d ranges from one to five. We set the adjacent relation threshold equal to 40/d for ETHZ shape classes because the object segments are connected with each other, and objects can be detected only through searching adjacent edge fragments. The second parameter is the allowable configuration deviation between edge fragments and corresponding model contours condev, which is set to 1.2 for the ETHZ shape classes dataset. If this value is overly large, the number of candidate results will be excessive. If this value is overly small, true detection results may be removed. The influences of different lr and condev values on the detection result were analyzed. The DR values of the giraffe class for the ETHZ shape classes dataset with 0.3 FPPI were evaluated for lr values of 30, 40, and 50, and condev values of 1, 1.1, 1.2, and 1.3. The results are listed in Table 8. As we can see from the table, the DR is best when lr = 1 and condev = 30. Moreover, the DR decreases with increasing lr because this can increase the number of false positives. However, condev exhibits a minor influence on the DR. While increasing the value of condev can obtain a greater number of detection results, also 36

Table 8: The detection rate of the giraffe class in the ETHZ shape classes with 0.3 FPPI for different values of lr and condev employed in the depth-first search. condev = 1

condev = 1.1

condev = 1.2

condev = 1.3

lr = 30

0.9355

0.9167

0.9250

0.9178

lr = 40

0.9118

0.8947

0.9200

0.9070

lr = 50

0.8889

0.8636

0.8750

0.8696

increases the number of false detection results. However, reducing this value can result in fewer detection results, so that true object detection results may be eliminated. Therefore, the detection result can be influenced by these two parameters. 7. Conclusion We presented a novel framework for contour-based object detection in cluttered images. Compared with previous work, this paper provides three main contributions. First, three preprocessing procedures were proposed to reduce the number of irrelevant edge fragments in cluttered real images. Second, a novel shape descriptor was introduced for partial shape matching that is used to obtain the best matches between edge fragments and model contours. Third, a depth-first search strategy was adopted to identify the locations of candidate hypothesis results. Experimental results on four challenging datasets demonstrated that the proposed method significantly improves detection accuracy in comparison with state-of-theart shape-based object detection methods. The template model, however, was provided manually. Hence, in future work, we will directly learning shape models from cluttered natural images. Acknowledgments This work was supported by the NSFC Project(Project No.61375122), (in part) by Shanghai Science and Technology Development Funds (Project No.13dz2260200,13511504300).

37

Reference [1] B. Alexe, T. Deselaers, V. Ferrari, Measuring the objectness of image windows, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (11) (2012) 2189–2202. [2] J. R. Anderson, Cognitive psychology and its implications, Macmillan, 2005. [3] P. Arbelaez, M. Maire, C. Fowlkes, J. Malik, From contours to regions: An empirical evaluation, in: Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, IEEE, 2009, pp. 2294–2301. [4] X. Bai, X. Wang, L. J. Latecki, W. Liu, Z. Tu, Active skeleton for non-rigid object detection, in: ICCV, 2009, pp. 575–582. [5] I. Bartolini, P. Ciaccia, M. Patella, Warp: Accurate retrieval of shapes using phase of fourier descriptors and time warping distance, IEEE transactions on pattern analysis and machine intelligence 27 (1) (2005) 142–147. [6] S. Belongie, J. Malik, J. Puzicha, Shape matching and object recognition using shape contexts, IEEE Trans.PAMI 24 (4) (2002) 509–522. [7] S. D. Bhattacharjee, A. Mittal, Part-based deformable object detection with a single sketch, Computer Vision and Image Understanding 139 (2015) 73–87. [8] E. Borenstein, E. Sharon, S. Ullman, Combining top-down and bottom-up segmentation, in: Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04. Conference on, IEEE, 2004, pp. 46–46. [9] A. Bosch, A. Zisserman, X. Munoz, Image classification using random forests and ferns, in: IEEE International Conference on Computer Vision, 2007, pp. 1–8. [10] J. Canny, A computational approach to edge detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (1986) 679–698. [11] J. Carreira, C. Sminchisescu, Cpmc: Automatic object segmentation using constrained parametric min-cuts, IEEE Trans.PAMI 34 (7) (2012) 1312–1328. [12] G. Cheng, J. Han, L. Guo, T. Liu, Learning coarse-to-fine sparselets for efficient object detection and scene classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1173–1181. [13] A. Y. Chia, M. K. Leung, S. Rahardja, Object recognition by discriminative combinations of line segments, ellipses, and appearance features, IEEE Trans.PAMI 34 (9) (2012) 1758–1772. [14] I. Endres, D. Hoiem, Category-independent object proposals with diverse ranking, IEEE Trans.PAMI 36 (2) (2014) 222–234. [15] P. Felzenszwalb, D. Mcallester, D. Ramanan, A discriminatively trained, multiscale, deformable part model, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.

38

[16] R. Fergus, P. Perona, A. Zisserman, Object class tecognition by unsupervised scale-invariant learning, in: Computer Vision and Pattern Recognition (CVPR), 2003 IEEE Conference on, IEEE, 2003, pp. 3264–271. [17] V. Ferrari, L. Fevrier, F. Jurie, C. Schmid, Groups of adjacent contour segments for object detection., IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (1) (2008) 36–51. [18] V. Ferrari, F. Jurie, C. Schmid, From images to shape models for object detection, International Journal of Computer Vision 87 (3) (2010) 284–303. [19] V. Ferrari, T. Tuytelaars, L. V. Gool, Object detection by contour segment networks, in: Computer Vision–ECCV 2006, Springer, 2006. [20] J. Gall, A. Yao, N. Razavi, G. L. Van, V. Lempitsky, Hough forests for object detection,tracking, and action recognition, IEEE Trans.PAMI 33 (11) (2011) 2188–2201. [21] C. Gu, J. J. Lim, P. Arbelaez, J. Malik, Recognition using regions, in: Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, IEEE, 2009, pp. 1030–1037. [22] A. Halawani, H. Li, 100 lines of code for shape-based object localization, Pattern Recognition 60 (2016) 458–472. [23] J. Hedrich, C. Yang, C. Feinen, S. Sch¨ afer, D. Paulus, M. Grzegorzek, Extended investigations on skeleton graph matching for object recognition, in: Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013, Springer, 2013, pp. 371–381. [24] C. Hong, J. Yu, J. You, X. Chen, D. Tao, Multi-view ensemble manifold regularization for 3d object recognition, Information Sciences 320 (2015) 395–405. [25] C. Hong, J. Zhu, J. Yu, J. Cheng, X. Chen, Realtime and robust object matching with a large number of templates, Multimedia Tools and Applications 75 (3) (2016) 1459–1480. [26] C. Huang, T. X. Han, Z. He, W. Cao, Constellational contour parsing for deformable object detection, Journal of Visual Communication and Image Representation 38 (2016) 540–549. [27] A. Ion, N. M. Artner, G. Peyr´e, W. G. Kropatsch, L. D. Cohen, Matching 2d and 3d articulated shapes using the eccentricity transform, computer vision and image understanding 115 (6) (2011) 817–834. [28] P. D. Kovesi, MATLAB and Octave functions for computer vision and image processing, available from: (2008). [29] L. J. Latecki, R. Lakamper, U. Eckhardt, Shape descriptors for non-rigid shapes with a single closed contour, in: Computer Vision and Pattern Recognition (CVPR), 2000 IEEE Conference on, IEEE, 2000, pp. 424–429. [30] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in: Computer Vision and Pattern Recognition (CVPR), 2006 IEEE Conference on, IEEE, 2006, pp. 2169–2178.

39

[31] B. Leibe, A. Leonardis, B. Schiele, Combined object categorization and segmentation with an implicit shape model, in: In ECCV workshop on statistical learning in computer vision, 2004, pp. 17–32. [32] F. F. Li, R. Fergus, P. Perona, Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories, in: Computer Vision and Pattern Recognition Workshop, 2004. CVPRW ’04. Conference on, 2004, p. 59?70. [33] L. Lin, X. Wang, W. Yang, J. Lai, Learning contour-fragment-based shape model with and-or tree representation, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 135–142. [34] H. Ling, D. W. Jacobs, Shape classification using the inner-distance, IEEE transactions on pattern analysis and machine intelligence 29 (2). [35] D. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110. [36] C. Lu, L. J. Latecki, N. Adluru, X. Yang, H. Ling, Shape guided contour grouping with particle filters, in: ICCV, 2009, pp. 2288–2295. [37] A. V. M. Y. Liu, O. Tuzel, R. Chellappa, Fast directional chamfer matching, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 1696–1703. [38] T. Ma, L. J. Latecki, From partial shape matching through local deformation to robust global shape similarity for object detection, in: CVPR, 2011, pp. 1441–1448. [39] S. Maji, J. Malik, Object detection using a max-margin hough transform, in: Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, 2009. [40] D. R. Martin, C. C. Fowlkes, J. Malik, Learning to detect natural image boundaries using local brightness, color, and texture cues., IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (5) (2004) 530–549. [41] K. Mikolajczyk, C. Schmid, Indexing based on scale invariant interest points, in: ICCV, 2001, pp. 525–531. [42] D. T. Nguyen, A novel chamfer template matching method using variational mean field, in: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE, 2014. [43] A. Opelt, A. Pinz, A. Zisserman, A boundary-fragment-model for object detection, in: Computer Vision–ECCV 2006, Springer, 2006, pp. 575–588. [44] G. V. Pedrosa, M. A. Batista, C. A. Barcelos, Image feature descriptor based on shape salience points, Neurocomputing 120 (2013) 156–163. [45] S. Ravishankar, A. Jain, A. Mittal, Multi-stage contour based detection of deformable objects, in: Computer Vision–ECCV 2008, Springer, 2008, pp. 483–496.

40

[46] X. Ren, D. Ramanan, Histograms of sparse codes for object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3246–3253. [47] H. Riemenschneider, M. Donoser, H. Bischof, Using partial edge contour matches for efficient object category localization, in: Computer Vision–ECCV 2010, Springer, 2010, pp. 29–42. [48] T. B. Sebastian, P. N. Klein, B. B. Kimia, Recognition of shapes by editing shock graphs, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (5) (2004) 550–571. [49] J. Shotton, A. Blake, R. Cipolla, Multiscale categorical object recognition using contour fragments, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (7) (2008) 1270–1281. [50] J. Shotton, J. Winn, C. Rother, A. Criminisi, Textonboost:joint appearance,shape and context modeling for multi-class object recognition and segmentation, in: Computer Vision–ECCV 2006, Springer, 2006, pp. 1–15. [51] X. Shu, X.-J. Wu, A novel contour descriptor for 2d shape matching and its application to image retrieval, Image and vision Computing 29 (4) (2011) 286–294. [52] P. Srinivasan, Q. Zhu, J. Shi, Many-to-one contour matching for describing and discriminating object shape, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 1673–1680. [53] C. L. Teo, C. Fermuller, Y. Aloimonos, A gestaltist approach to contour-based object recognition: Combining bottom-up and top-down cues, International Journal of Robotics Research 34 (4-5) (2015) 627–652. [54] A. Toshev, B. Taskar, K. Daniilidis, Object detection via boundary structure segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 950–957. [55] X. Wang, X. Bai, T. Ma, W. Liu, L. J. Latecki, Fan shape model for object detection, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 151–158. [56] H. Wei, C. Yang, Q. Yu, Efficient graph-based search for object detection, Information Sciences 385 (2017) 395–414. [57] H. Wei, Q. Yu, C. Yang, Shape-based object recognition via evidence accumulation inference, Pattern Recognition Letters 77 (2016) 42–49. [58] C. Yang, O. Tiebe, P. Pietsch, C. Feinen, U. Kelter, M. Grzegorzek, Shape-based object retrieval by contour segment matching, in: Image Processing (ICIP), 2014 IEEE International Conference on, IEEE, 2014, pp. 2202–2206. [59] X. Yang, L. J. Latecki, Weakly supervised shape based object detection with particle filter, in: European conference on computer vision, Springer, 2010, pp. 757–770. [60] X. Yang, H. Liu, L. J. Latecki, Contour-based object detection as dominant set computation, Pattern Recognition 45 (5) (2012) 1927–1936.

41

[61] P. Yarlagadda, B. Ommer, From meaningful contours to discriminative object shape, in: European Conference on Computer Vision, 2012, pp. 766–779. [62] Q. Yu, H. Wei, C. Yang, Local part chamfer matching for shape-based object detection, Pattern Recognition 65 (2017) 82–96. [63] W. Zheng, L. Liang, Fast car detection using image strip features, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 2703–2710. [64] W. Zheng, S. Song, H. Chang, X. Chen, Grouping active contour fragments for object recognition, in: Asian Conference on Computer Vision, 2012, pp. 289–301. [65] T. Zhou, J. Yang, A. Loza, H. Bhaskar, M. Al-Mualla, Crowd modeling framework using fast head detection and shape-aware matching, Journal of Electronic Imaging 24 (2) (2015) 023019–023019. [66] L. Zhu, Y. Chen, A. Yuille, Learning a hierarchical deformable template for rapid deformable object parsing, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (6) (2010) 1029–1043. [67] Q. Zhu, L. Wang, Y. Wu, J. Shi, Contour context selection for object detection: A set-to-set contour matching approach, in: Computer Vision–ECCV 2008, Springer, 2008, pp. 774–787.

42