Accurate object detection with a discriminative shape model

Accurate object detection with a discriminative shape model

Optik xxx (2014) xxx–xxx Contents lists available at ScienceDirect Optik journal homepage: www.elsevier.de/ijleo Accurate object detection with a d...

1MB Sizes 2 Downloads 66 Views

Optik xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Optik journal homepage: www.elsevier.de/ijleo

Accurate object detection with a discriminative shape model Huapeng Yu a,b,c,∗ , Yongxin Chang a,b,c , Pei Lu a,b,c , Zhiyong Xu a , Chengyu Fu a , Yafei Wang b a b c

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China School of Optoelectronic Information, University of Electronic Science and Technology of China, Chengdu 610054, China Graduate University of Chinese Academy of Sciences, Beijing 100039, China

a r t i c l e

i n f o

Article history: Received 24 July 2013 Accepted 18 January 2014

a b s t r a c t Discriminative model over bag-of-visual-words representation significantly improves the accuracy of object detection under clutter. However, it encounters bottleneck because of completely ignoring geometric constraint between features. On the contrary, to detect object accurately explicit shape model heavily relies on geometric information of the object, which as a result lacks of discriminative power. In this paper, we present a discriminative shape model to make use of the advantages of the two models based on the insight that the two models are essentially complementary. Discriminative model provides discriminative power, while shape model encodes geometry. The cost function that we used to distinguish objects considers both the detection maps of the discriminative model and the result of shape matching. In this cost function, we adopt a novel way to deal with multi-scale detection maps. We show that this cost function has very strong discriminative power, which makes learning a discriminative threshold for full object detection possible. For shape model, we also present a scheme for learning a good shape model from noisy images. Experiments on UIUC Car and Weizmann–Shotton horses show state-of-the-art performance of our model. © 2014 Elsevier GmbH. All rights reserved.

1. Introduction Object detection under clutter is one of the most challenging tasks of computer vision. At first, we must determine whether there is any object in the image or not. Secondly, we must localize all the instances of the object in the image. We call the former determination or verification [5], the latter localization [1–4,11,21]. In this paper, we deal with both. The difficulties of object detection under clutter mainly result from intra-class variations caused by illumination, pose, occlusion, viewpoint, noise, etc. Discriminative model over bag-of-visualwords (BOVW) representation [1,2,5,6,8,11,21] is a state-of-the-art approach to deal with this. Through discarding geometric information of the object and building on solid kernel based statistical learning theory [19,20], discriminative model over BOVW obtains state-of-the-art accuracy and robustness. However, it encounters performance bottleneck because of completely ignoring the geometric constraint between features. A typical example is the part redetection problem (see Fig. 2), i.e. part of an object (e.g. a head or tail of a car) is detected again.

∗ Corresponding author at: Corresponding author. E-mail address: [email protected] (H. Yu).

Explicit shape model [13] on the contrary heavily relies on the geometric information of the object. It can learn a shape model from real images or hand-drawings. Shape matching the learned shape model to test images obtains the accurate object detection results. Relative to the traditional approaches providing only rectangular bounding boxes [1–6], shape model can output the closed contour of the object. Relative to the recent approaches providing closed contours [11,21], shape model can output more abundant shape matching information which is point-wise. Although very charming, heavily relying on geometric information of the object and only learning model from positive samples result in poor discriminative power of the shape model. How to improve the discriminative power becomes the main obstacle for applying shape model to object detection under clutter. In this paper, we dedicate to propose a discriminative shape model to make use of the advantages of both discriminative model and shape model. This is based on the insight that the two models are essentially complementary: discriminative model making no use of geometry while shape model lacking of discriminative power. Our model bridges the gap between them. Discriminative model provides initial bounding boxes and corresponding multi-scale detection maps. Shape model then makes use of these information to distinguish objects and output the accurate object detection results.

http://dx.doi.org/10.1016/j.ijleo.2014.01.105 0030-4026/© 2014 Elsevier GmbH. All rights reserved.

Please cite this article in press as: H. Yu, et al., Accurate object detection with a discriminative shape model, Optik - Int. J. Light Electron Opt. (2014), http://dx.doi.org/10.1016/j.ijleo.2014.01.105

2

H. Yu et al. / Optik xxx (2014) xxx–xxx

We do verification with a cost function which considers both the detection maps of the discriminative model and the result of shape matching. Relative to [11,21], our cost function only accumulates over positive score points, which are sparse and have better discriminative power. Moreover, our cost function makes use of both the closed contour and the point-wise shape matching result. We show that this cost function has very strong discriminative power, which makes learning a discriminative threshold for full object detection possible. In [11,21], the solution for the optimization problem of cost function relies on ratio contour (RRC) algorithm [15]. In our model, thanks to the discriminative model providing initial bounding boxes and the shape model providing initial positions and scales within the bounding boxes, the solution is trivial. Once verified there is an object in the bounding box, localization is trivial. We just select the best match from the matches with minimal costs. In [13], verification does not exist, and localization is just the best match, which apparently lacks of discriminative power and results in high FP (false positive) rate. Our model can effectively reduce the FP rate and improve the quality of shape matching through selecting a better match. We evaluate our model on two public available data sets. One is UIUC Car, which is a dataset with rigid object images (cars) from a single viewpoint. The other is Weizmann–Shotton horses, which contains non-rigid objects. Experiments show state-of-the-art performance of our model. 2. Discriminative model over bag-of-visual-words representation Given an image traditional object detection outputs the bounding boxes of all objects in the image without assuming how many objects exist in it. Formally [1], let X denote the space of all images, Y the space of all rectangular bounding boxes, we can then define a quality function f(x,y) depicted by (1). f :X ×Y →R

(1)

The quality function f(x,y) predicts the quality of an object locating at bounding box y in x. For a single fixed image x, we use the notation f(y) for f(x,y). To predict the best location of the object, we need to solve yopt = arg max f (y)

(2)

Fig. 1. An example of multi-scale detection maps: part (a) is the original image, parts (b) to (f) are the corresponding detection maps of several different scales (from small to large). We can see that both object and background are multiple scaled. Accumulating over scales will lose discriminative power for object and background.

f (y) = ˇ +

L n    m=1 l=1

l,(i,j)

wcm

(4)

i = 1...l j = 1...l

y

where hl,(i,j) are the histograms of all features of the image x that fall into the spatial grid cell with index (i, j) of an l × l spatial pyramid in the bounding box y. Eq. (4) expresses (3) in a sum of per-point contributions using the linearity of the scalar products. More details about Eqs. (3) and (4) can be found in [1]. Note that in Eq. (4) accumulates over all points and all scales of the bounding box y. We differ from [1,11] in dealing with points and scales. At first, we only accumulate object points, i.e. the points with positive scores. This is based on the insight that relative to all the points in the bounding box y object points are sparse. Because of the sparsity, only accumulating object points can improve the discriminative power of decision function f. Secondly, we observe that accumulating over all scales of the bounding box y reduces the discriminative power of decision function f. In fact, although object is multiple scaled, background is also the same, so accumulating over scales loses discriminative power for object and background. Fig. 1 gives an example. Experiments in Section 6.1 will also show this. So the modified decision function can be written as

f (y) = ˇ +

n 

,(i,j) (plcmax ) m

(5)

m=1

y∈Y

Traditional sliding window approach [2–4] approximate the solution to Eq. (2) by searching only over a small subset of Y, while ESS (efficient subwindow search) algorithms [1] make use of branch and bound strategy to obtain a global optimal solution in a very computationally efficient way. We focus on the situation when the quality function f is the decision function of a support vector machine (SVM) with a linear kernel. The bag-of-visual-words representation we adopted is HMAX model [2,6,7]. It is similar to a hierarchical spatial pyramid histogram representation [1,5,8]. One key difference lies in the lowlevel features. HMAX model makes use of Gabor wavelets [9,10], while spatial pyramid histogram representation typically makes use of interesting point descriptors like SURF [1]. No matter what low-level features used, formally [1] we can write the decision function of a SVM with a linear kernel over a bag- of-visual-words representation as f (y) = ˇ +

L N    l=1

i = 1...l j = 1...l

k=1

l,(i,j)

˛k

y

hl,(i,j) , hkl,(i,j) 

(3)

,(i,j) are all the positive scores of a certain scale l where plcmax max . This m certain scale has the peak score over all points and scales. Eq. (5) means we accumulate the positive scores of a certain scale lmax , which expresses the total energy of all object points in bounding box y. Here, we replace the two inner linear summation (‘SUM’) operations in Eq. (4) with two nonlinear maximum operations (‘MAX’). We believe that this is consistent with the HMAX model [2,6,7] and makes the decision function more robust to clutter [7].

3. Discriminative shape model In Eq. (5), we accumulate over all feature points in the bounding box y of a certain scale lmax . Apparently, this coarse accumulation completely ignores the geometric information of the object. Fig. 2 is a typical example of false positive (FP) resulted from this coarse accumulation. We make use of explicit shape model [12,13] to complement the discriminative model over bag-of-visual-words representation and call it discriminative shape model. In this section, at first we will briefly introduce the explicit shape model, then we will present the cost function of our discriminative shape model.

Please cite this article in press as: H. Yu, et al., Accurate object detection with a discriminative shape model, Optik - Int. J. Light Electron Opt. (2014), http://dx.doi.org/10.1016/j.ijleo.2014.01.105

H. Yu et al. / Optik xxx (2014) xxx–xxx

3

Fig. 3. The relations of bounding box and shape matching result. The dashed lines are the matching points vkcond . The ellipse is the maximal closed contour C of the shape matching result. The region enclosed by C is R(C). For matching points in R(C) we accumulate with different weights. Fig. 2. An example of FP resulted from ignoring geometry.

3.1. Overview of explicit shape model Explicit shape model [13] learns a shape model from real images or hand-drawings, then this shape model is matched to the test images to detect the objects and at the same time output the accurate shapes of the objects. There are two most charming properties of this model. One is that it can just learn a shape model from one hand-drawing. The other is that it can output the accurate shape of the object, which is critical for in-depth semantic analysis of the object and the whole image. The local contour features used by shape model are the scaleinvariant pair of adjacent segments (PAS) features [12]. Learning a shape model is composed of four stages: determine model parts; assemble an initial shape; refine; learn intra-class deformations. To do object detection, we match the learned shape model to the test image edges in two stages. At first we obtain rough estimates for location (x, y) and scale s of the object based on a Hough-style voting scheme. Then the estimates are used to initialize the nonrigid shape matcher [14]. The output of the shape matcher is then scored and ready for the final result. More details about shape learning and matching can be found in [13]. Although very charming, shape model heavily relies on geometric information of the object, which as a result lacks of discriminative power. To deal with this problem we present the discriminative shape model. Moreover, our discriminative shape model can also improve the quality of shape matching. 3.2. Cost function with shape matching information The cost function of our discriminative shape model considers both the discriminative power of the discriminative model and the shape matching information provided by shape model. Based on the discussion of Section 2, we will formally present the cost function. ,(i,j) are the positive scores of a certain scale l In Eq. (5), plcmax max of m bounding box y. Here we use plymax ,(imax ,jmax ) to denote the maximum positive score over all scales and cells of bounding box y. Then we get the best candidate bounding box with Eq. (6), which search over all the bounding boxes. Eq. (6) could be solved by ESS or simply by traditional sliding window. ycand = arg maxplymax ,(imax ,jmax )

(6)

y∈Y,lmax

Applying shape matching to ycand , we obtain the candidate matches vkcond (k = 1. . .n). For each vkcond , we obtain the maximal closed contour C and corresponding region R(C) of the object. Then we can define our cost function as (C) =



 l

,(i,j)

w(i, j) pymax (i,j)∈R(C) cand

(7)

,(i,j) are the positive scores of scale l where plymax max of bounding box cand y, w(i, j) are the weights for positive score cell (i, j) which is within the region R(C),  is an adjustable penalty term. With the candidate match vkcond we can define w(i, j) as



w(i, j) =

w1, vkcond (i, j) = 1 w2, vkcond (i, j) = 0

(8)

where vkcond (i, j) = 1 means (i, j) is the matched point. In the experiments of Section 6 we adopt fixed values for w1 and w2, i.e. 2 for w1 1 for w2. Fig. 3 clearly shows the relations of ycand , C, R(C), vkcond . Different from Eq. (5), here we only accumulate the object points within region R(C), which are more probably the real object points. Thus, we further improve the discriminative power of the cost function with shape matching information. Also, we should note that in ,(i,j) (i, j) means grid cell, while in vk pylmax (i, j) (i, j) means image cond cand points. So in practice a resize operation is mandatory. ,(i,j) and shape Eq. (7) integrates discriminative score plymax cand matching information w(i, j). Optimizing Eq. (7) over candidate matches vkcond (k = 1,. . .,n), we obtain the final optimal match vopt , as described by Eq. (9).

vopt = arg min(C)

(9)

vkcond

Although mainly inspired by [11], we differ from it in several key aspects. At first, the numerator of Eq. (7) is an adjustable penalty term not the term for continuity and proximity. We have no need to consider the continuity and proximity of edgels as we can obtain the maximal closed contour C of the object through shape matching. Secondly, the denominator of Eq. (7) makes use of more abundant shape matching information vkcond which is not available by closed contour C, i.e. for matched point within the region R(C) we accumulate the score with different weight. Thirdly, as also mentioned in Section 2, Eq. (7) only accumulates object points. We believe that doing like this is helpful to improve the discriminative power of the cost function. Moreover, without accumulating background points (negative score points), we naturally meet the constraint that the denominator of Eq. (7) are positive. This means that we can always guarantee to obtain the global optimal solution for Eq. (9). For convenience, we can further transform the cost function with a Gaussian, which can be regarded as normalizing the cost function to [0 1]. 4. Proposed algorithms Assuming we have learned a shape model from images or hand-drawings, object detection with discriminative shape model (presented in Section 3) is composed of three key stages. At first, we need to obtain the candidate bounding box ycand and ,(i,j) . This could be done by ESS corresponding detection map plcmax m

Please cite this article in press as: H. Yu, et al., Accurate object detection with a discriminative shape model, Optik - Int. J. Light Electron Opt. (2014), http://dx.doi.org/10.1016/j.ijleo.2014.01.105

4

H. Yu et al. / Optik xxx (2014) xxx–xxx

Fig. 4. An example of obtaining the outmost closed contour of the shape matching result: part (a) is the original matching result, part (b) is the binary one, part (c) is the set of salient closed contours obtained with RRC, part (d) is the final outmost closed contour.

or simply by traditional sliding window, as mentioned in Section 3.2. Secondly, we need to apply shape matching to ycand to obtain the candidate matches vkcond (k = 1,. . .,n). To apply shape matching to ycand not to the whole image is an important extension to [13], which effectively reduces the computational complexity of shape matching and also makes best use of the discriminative power. Thirdly, for each vkcond we need to obtain the maximal closed contour C and corresponding region R(C) of the object. We make use of Ratio Contour (RRC) algorithm [15] to achieve this goal. With RRC, we can get all the salient closed contours in vkcond . The outmost closed contour of these contours together is an acceptable approximation for the maximal closed contour C of vkcond . Simple filling and subtraction operations with these contours can obtain the final result R(C). Fig. 4 shows an example for this. The complete algorithm for object detection with discriminative shape model is presented in Algorithm 1. Algorithm 1 vopt = SingleObjecDetection(x,M) Input: x—image M—learned shape model Output: vopt —optimal match 1: Obtain the candidate bounding box ycand and corresponding detection ,(i,j) map pclmax of image x. m 2: Obtain the PAS features P of image in bounding box ycand . 3: Match M to P to obtain the candidate matches vkcond (k = 1,. . .,n). 4: For k = 1 to n do 5: Call RRC to obtain the salient closed contour set S of vkcond . 6: Construct a binary contour map x1 (1 for contour point) with S. 7: Fill x1 from point (1, 1) to obtain the map x2. 8: x3 = x2 − x1. 9: R(C) is the set of zero points in x3. ,(i,j) . 10: Compute k (C) with R(C) and plcmax m 11: End For 12: Find the minimal k (C). 13: Return vkcond .

5. Multiple object detection As usual [1–4,11], we can apply Algorithm 1 repeatedly to detect multiple objects in the image. Each time, we just remove off the last detected one from the detection maps. Non-maximum Suppression (or Neighborhood Suppression) can further consider the object size, i.e. redetection within the region of detected ones is discarded. In practice, to simplify the implementations, the region of detected ones equals the size of the bounding box, which assures only detecting one object for each candidate bounding box. As mentioned at the very beginning, object detection is composed of both verification and localization. Typically, for verification, we need a binary classifier to decide whether the current detection is an object or not. An empirical threshold applying to the detection result is a simple solution and adopted by most of the related work [1–4,11,21]. In this paper, we regard (C) as the last verification stage relative the stage to obtain ycand . We hope to learn a discriminative

Fig. 5. Learn a discriminative threshold for (C). Note that (C)is normalized to [0 1] with a Gaussian. The area with low (C) is object region, while high (C) background region. If the problem is linear dividable, there is a gap between object and background region. The dashed line represents the learned discriminative threshold.

threshold for (C) from training samples. Fig. 5 shows the idea. Note that in Fig. 5 we normalize (C) to [0 1] with a Gaussian as mentioned in Section 3.2. Experiments show that our cost function (C) has strong discriminative power and learning a discriminative threshold for (C) from training samples is possible. 6. Experiments We evaluate our work on two public available datasets, i.e. UIUC Car and Weizmann–Shotton horses. For each one, we present the implementation details and corresponding results. We compare our results with several state-of-the-art ones for traditional bounding box detection. Just as expected, we obtain state-of-theart performance for both data sets. For accurate object shape detection, we can only give out rough statistical results because of lack of shape ground truth for the two data sets. We have not found a more appropriate data set, which has not only enough training samples but also shape ground truth. Maybe we should set up one next. 6.1. UIUC Car UIUC Car [3] is a classical example of a dataset with rigid object images (cars) from a single viewpoint (side-view, head left or right). It contains 1050 training images (550 positive and 500 negative) of fixed size 100 × 40 pixels. Two test sets with varying resolution are available. One is single scale set, which contains cars of roughly the same size as the training images. The other is multi-scale set, within which cars have sizes ranging from roughly 0.8 to 2 times the size of cars in the training images. The difficulties of this dataset include clutter, partial occlusion, noise, varying size, etc. We use the default setup and scoring program of this dataset. At first, we need to learn a shape model from the positive training images. Because of the strong noise, we find the learned model is unusable when learning with the default settings of [13]. We replace Berkeley edge detector [16] with the one based on nonclassical receptive field (NCRF) [17]. The settings we used for NCRF include anisotropic inhibition, thinning, no hysteresis thresholding. Also we make use of RRC to obtain the most salient closed contour, which helps filter the noisy edgels outside. Fig. 6 shows the models we learned from 7 randomly selected positive training images. We can see that with NCRF and filtering noise a much better shape model for the car is learned. We use HMAX model [2] to obtain the candidate bounding box and corresponding detection map of a certain scale which has maximum score. The settings are the same as [2]. Then we are ready for learning a discriminative threshold for the cost function. For single scale case, we can just learn the threshold

Please cite this article in press as: H. Yu, et al., Accurate object detection with a discriminative shape model, Optik - Int. J. Light Electron Opt. (2014), http://dx.doi.org/10.1016/j.ijleo.2014.01.105

H. Yu et al. / Optik xxx (2014) xxx–xxx

Fig. 6. The shape model for UIUC Car learned from 7 randomly selected training images. The left of the first row is the original initial model which is noisy. The right of the first row shows the main closed contour (in red color) obtained with RRC. The left of the second row is the one filtered with main closed contour. The right of the second row is the left one added the main closed contour. The third row is the final model, within which blue lines depict final initial model and red points depict final sampled model. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Table 1 Learned discriminative threshold for the cost function. For multi-scale case, the training images are randomly selected from multi-scale test images.

Single scale Multi-scale

Total training images

Normalized (C) Positive

Negative

1050 20

[1.0 1.0] [0.96 1.0]

[0 0.803] [0 0.988]

Learned threshold 0.9 0.99

Table 2 Detection rate on UIUC Car. References [2] and [1] are the results at the point of equal precision and recall. Reference [13] is the result of default threshold 0.15. Ours is based on the learned discriminative thresholds in Table 1.

Mutch [2] Lampert [1] Ferrari [13] Ours

Single scale

Multi-scale

99.94 98.5

90.6 98.6 55.43 94.85

99.75

from training images. For multi-scale case we can learn the threshold from part of the multi-scale test images or randomly resized training images. Table 1 is the learning result. Note that here we just set the numerator of Eq. (7) to 1, i.e. no extra penalty. We can see that single scale case is linear dividable [18] while multi-scale case not. Final object detection result on test sets is shown in Table 2. For our result on single scale the only TN (true negative) is caused by the failure of shape matching, which means single scale case is linear dividable. For multi-scale test set we can see it is linear undividable. Table 2 also compares our result with several other state-ofthe-art results. In this table, [2] and [1] are the results of purely discriminative model, while [13] is the result of shape model. We can see that our model largely improves the performance of shape model [13]. As mentioned before, the main cause lies in the strong discriminative power of our model. Comparing to the results of purely discriminative model, for multi-scale case our result is much better than [2], while much worse than [1]. For single scale case, our result is better than [1], while a bit worse than [2]. We have done another interesting experiment: replace the lowlevel features of Lampert’s experiment on UIUC Car with HMAX

5

Fig. 7. The shape model for Shotton horses learned from 12 randomly selected training images. The left of the first row is the original initial model which is noisy. The right of the first row shows the main closed contour (in red color) obtained with RRC. The left of the second row is the one filtered with main closed contour. The right of the second row is the left one added the main closed contour. The third row is the final initial model. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

features. Then the detection rate of Lampert’s on multi-scale drops to 93.53. At first, this shows that with the same low-level features our model outperforms Lampert’s just as analyzed in Section 2. Secondly, this infers that for multi-scale case, bottleneck of our method lies in low-level features. Note that in Table 2 only ours has a verification stage based on learned discriminative thresholds, which means a stable high performance. The most interesting property of explicit shape based model is the ability to obtain shape matching information, which is a much more accurate object detection than traditional bounding boxes [1–4] or closed contours [11]. Note that to obtain better shape matching result, we also take the shape matching score [13] into account. Due to lack of shape ground truth for UIUC Car, here we can only give out a rough statistical result for shape matching. We adopt a simple 50% coverage rule similar to the one for bounding box detection [3,4,12,13]. Under this rule we obtain a rough statistical result: 99.5 for single scale, 92.22 for multi-scale. 6.2. Weizmann–Shotton horses To verify the generalization of our model, we select Weizmann–Shotton horses [22] as our second data set, which contains non-rigid objects–horses. This data set consists of a single scale set and a multi-scale one. All the positive samples contain exactly one horse each, which is side-view and head left. For training discriminative model, we use the single scale set, which have total 328 positive images and total 900 negative images. For balance between samples, we use total 328 positive images and only the first 301 negative images. For test, we use the 228 positive and 228 negative test images of the multi-scale set. We adopt the same settings for the model as Section 6.1 except the penalty term of Eq. (7). Fig. 7 shows the learned shape model.

Please cite this article in press as: H. Yu, et al., Accurate object detection with a discriminative shape model, Optik - Int. J. Light Electron Opt. (2014), http://dx.doi.org/10.1016/j.ijleo.2014.01.105

6

H. Yu et al. / Optik xxx (2014) xxx–xxx

Table 3 Learned discriminative threshold for the cost function. Total training images Single scale

629

Normalized (C) Positive

Negative

[0.83 0.99]

[0 0.77]

Learned threshold 0.8

Table 4 Detection rate on Shotton horses. [12] is the result at the point of equal precision and recall. [13] is the result of default threshold 0.15. Ours is based on the learned discriminative threshold in Table 3. Multi-scale Ferrari [12] Ferrari [13] Ours

95.7 93.69 99.25

Table 3 is the result for learning discriminative threshold from 629 images (328 positive and 301 negative) of the single scales set. We can see that the single scale set is linear dividable. Note that here we set the penalty term of Eq. (7) to the total number of matched points. In fact, with the same penalty term as Section 6.1, i.e. no extra penalty, the single scale set will be linear undividable because of upper bound of negative saturating to 1.0. Table 4 is the detection result for the multi-scale set with the learned discriminative threshold 0.8. The same as [12], we use the 228 positive and 228 negative test images of the multi-scale set. Note that ours has a verification stage based on learned discriminative threshold, which means a stable high performance. In Table 4, [12] is the result for purely discriminative model, while [13] is the result of shape model. We can see that our result outperforms both [12] and [13], which again verifies state-of-theart performance of our model. Note that different from Section 6.1 here we deal with typical non-rigid objects–horses. The result shows that our model can also apply to non-rigid objects equally well. As to shape matching, the same as Section 6.1, we give out a rough statistical result for 228 positive test images: 87.72. We can see that there is still room for improving shape matching process. 7. Conclusions In this paper, we present a discriminative shape model for object detection under clutter. As discriminative model discards geometry information while explicit shape model lacks of discriminative power, we bridge the gap between the two models through a cost function, which integrates both shape matching information and discriminative power. We argue that full object detection is composed of verification and localization. We do verification with the cost function and the learned discriminative thresholds. For localization, we provide not only bounding boxes but also pixel-wise shape matching results. For explicit shape model, we also present a scheme for learning a good shape model from noisy images.

Experiments on two public available data sets (one rigid one non-rigid) show state-of-the-art performance of our model. Now, we only deal with single view. How to efficiently deal with multiple view is an interesting and challenging research topic. Possible other future work include: further improve the shape matching process; try better low-level features; set up a better data set which not only contains enough training samples but also shape ground truth. References [1] C.H. Lampert, M.B. Blaschko, T. Hofmann, Beyond sliding windows: object localization by efficient subwindow search, in: CVPR, 2008. [2] Jim Mutch, David G. Lowe, Object class recognition and localization using sparse features with limited receptive fields, International Journal of Computer Vision (IJCV) 80 (1) (2008) 45–57. [3] S. Agarwal, A. Awan, D. Roth, Learning to detect objects in images via a sparse, part-based representation, IEEE Trans. Pattern Anal. Mach. Intell. 26 (11) (2004) 1475–1490. [4] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part based models, IEEE Trans. Pattern Anal. Mach. Intell. 32 (9) (2010) 1627–1645. [5] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: CVPR, 2006, pp. 2169–2178. [6] Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, Tomaso Poggio, Robust object recognition with cortex-like mechanisms, IEEE Trans. Pattern Anal. Mach. Intell. 29 (3) (2007) 411–426. [7] Maximilian Riesenhuber, Tomaso Poggio, Hierarchical models of object recognition in cortex, Nat. Neurosci. 2 (1999) 1019–1025. [8] Tian. Yuandong, Relevant algorithms for general object recognition based on feature combination, in: Master Thesis, Shanghai Jiao Tong University, Shanghai, 2007. [9] D. Gabor, Theory of communication, J. IEE 93 (1946) 429–459. [10] J.P. Jones, L.A. Palmer, An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex, J. Neurophysiol. 58 (1987) 1233–1258. [11] Z. Zhang, Y. Cao, D. Salvi, K. Oliver, J. Waggoner, S. Wang, Free-shape subwindow search for object localization, in: CVPR, 2010. [12] V. Ferrari, L. Fevrier, F. Jurie, C. Schmid, Groups of adjacent contour segments for object detection, IEEE Trans. Pattern Anal. Mach. Intell. 30 (1) (2008) 36–51. [13] Vittorio Ferrari, Frederic Jurie, Cordelia Schmid. From images to shape models for object detection. International Journal of Computer Vision. (IJCV) 87 (3) (2010) 284–303. [14] H. Chui, A. Rangarajan, A new point matching algorithm for non-rigid registration, Comput. Vis. Image Understand. 89 (2–3) (2003) 114–141. [15] S. Wang, T. Kubota, J. Siskind, J. Wang, Salient closed boundary extraction with ratio contour, IEEE Trans. Pattern Anal. Mach. Intell. 27 (4) (2005) 546–561. [16] D. Martin, C. Fowlkes, J. Malik, Learning to detect natural image boundaries using local brightness, color, and texture cues, IEEE Trans. Pattern Anal. Mach. Intell. 26 (5) (2004) 530–549. [17] C. Grigorescu, N. Petkov, M.A. Westenberg, Contour detection based on nonclassical receptive field inhibition, IEEE Trans. Image Process. 12 (7) (2003) 729–739. [18] R.O. Duda, P.E. Hart, D.G. Stork, Patter Classification, John Wiley and Sons, New York, 2001, pp. 195–196. [19] Scholkopf A.J. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002. [20] Gokhan Bakir, Thomas Hofmann, Bernhard Scholkopf, Alexander J. Smola, Ben Taskar, S.V.N. Vishwanathan, Predicting Structured Data, MIT Press, Cambridge, MA, 2007. [21] Zhiqi Zhang, Sanja Fidler, Jarrell Waggoner, Yu Cao, Sven Dickinson, Jeffrey Mark Siskind, Song Wang, Superedge grouping for object localization by combining appearance and shape information, in: CVPR, 2012. [22] J. Shotton, A. Blake, R. Cipolla, Contour-based learning for object detection, ICCV (2005).

Please cite this article in press as: H. Yu, et al., Accurate object detection with a discriminative shape model, Optik - Int. J. Light Electron Opt. (2014), http://dx.doi.org/10.1016/j.ijleo.2014.01.105