Fast object detection based on selective visual attention

Fast object detection based on selective visual attention

Author's Accepted Manuscript Fast Object Detection Based on Selective Visual Attention Mingwei Guo, Yuzhou Zhao, Chenbin Zhang, Zonghai Chen www.els...

3MB Sizes 0 Downloads 65 Views

Author's Accepted Manuscript

Fast Object Detection Based on Selective Visual Attention Mingwei Guo, Yuzhou Zhao, Chenbin Zhang, Zonghai Chen

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(14)00685-7 http://dx.doi.org/10.1016/j.neucom.2014.04.054 NEUCOM14242

To appear in:

Neurocomputing

Received date: 26 October 2013 Revised date: 9 February 2014 Accepted date: 24 April 2014 Cite this article as: Mingwei Guo, Yuzhou Zhao, Chenbin Zhang, Zonghai Chen, Fast Object Detection Based on Selective Visual Attention, Neurocomputing, http: //dx.doi.org/10.1016/j.neucom.2014.04.054 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Fast Object Detection Based on Selective Visual Attention Mingwei Guo, Yuzhou Zhao, Chenbin Zhang, Zonghai Chen* Department of Automation, University of Science and Technology of China, Hefei 230027, PR China Abstract: Selective visual attention plays an important role in human visual system. In real life, human visual system can't handle all of the visual information captured by eyes on time. Selective visual attention filters the visual information and selects interesting one for further processing such as object detection. Inspired by this mechanism, we construct an object detection method which can speed up the object detection relative to the methods that search objects by using sliding window. This method firstly extracts saliency map from the origin image, and then gets the candidate detection area from the saliency map by adaptive thresholds. To detect object, we only need to search the candidate detection area with the deformable part model. Since the candidate detection area is much smaller than the whole image, we can speed up the object detection. We evaluate the detection performance of our approach on PASCAL 2008 dataset, INRIA person dataset and Caltech 101 dataset, and the results indicate that our method can speed up the detection without decline in detection accuracy. Key words: object detection; selective visual attention; deformable part models; sliding window 1 Introduction Computer vision is a science that tries to make the machine be able to 'see' the world as human beings. Researchers study the mechanisms of human visual system to improve computer vision. One of the mechanisms that make human visual system so effective in acting is the ability to extract the

*

Corresponding author. Tel.: +86 055163606104. E-mail address: [email protected] (Z. Chen).

1

relevant information at an early processing stage, a mechanism called selective visual attention. The information captured by human eyes is too numerous to process on time for human visual system. However, the visual information is not of the same importance. Human brain filters the information and selects the interesting one for further processing by selective visual attention. Object detection is one of the fundamental problems in computer vision and it focuses on both detecting object from video and static images. This paper mainly discusses object detection based on static image, which is to detect and locate the set class of object from a static image. Today, most object detection methods prefer to simplify object detection into a binary classification problem: determine whether there is the set type of object in the sliding window or not. As a result of sliding window, the speed of object detection is relatively slow which result that these object detection methods have limitation in many applications. The shortage of this kind of object detection methods is derived from the numerous visual information. Computers can not process all of the visual information in an image on time. To solve this problem, we naturally think of the selective visual attention. In figure 1, column a is the origin image, column b is the detection image and column c is the saliency detection image (the gray scale of the pixel in the saliency map shows the pixel's saliency degree). The origin images come from PASCAL VOC 2008 dataset[8-11]. From the first row of images, we can see that the pixels contained in the bounding box of detection image are of significant saliency. However, with the purpose of getting the bounding box, we have to search all the positions and scales in the image pyramid. It's taken for granted that reducing searching area is a way to speed up the detection. Thus, we consider to get candidate detection area which contains pixels of significant degree saliency as

2

searching area. In row 2, the detection image contains a bounding box without a target object, in which the pixels are of little saliency. So we can eliminate the error detection by get candidate detection area from the saliency map in order to maintain the detection accuracy. ********************************************************************************************

Fig. 1 is placed here. ********************************************************************************************

We simulate the selective visual attention in order to construct an object detection method. From the research of visual psychology, we can summarize visual selective attention mechanism in two aspects[1]: 1)Bottom-up selective visual attention, which is driven by external stimulus, such as strong contrast. This kind of visual attention is a low-level cognitive process; 2) Top-down visual attention, which is controlled by high-level brain information, such as knowledge, expectation, goal, and so on. Itti et al.[2-4] proposed selective attention model that is a bottom-up, irrelevant to the task, saliency based visual selective attention model, and it can simulate the visual selective attention mechanism of human visual system. Serre and Pogggio et al.[5] simulate the computations performed by the feedforward path of the ventral stream in visual cortex and the local circuits implementing them to build a model which can extract saliency from images. More over, plenty of saliency detection methods are proposed, for example, Zhai and Shah et al.[6] extract

color

saliency by calculating histogram, Qiong Yan et al.[7] construct a hierarchical model to detect saliency area. There are lots of researches that simulate the visual selective attention of human visual system to extract saliency area which can fulfil real-time computations[30-34]. To solve the problem of slow detection caused by sliding window, we consider to decrease searching area by extracting candidate

3

detection area from saliency detection, which can speed up object detection. Though there are some researches on speeding up object detection, for example, P.Felzenszwalb’s cascade object detection method [14], B. Alexe’s objectness measurement [26] and so on [27-29]. Our method is a novel way to speed up object detection by extracting salient candidate detection area for the reason that we consider to reduce the time of detection not only just reduce the number of searching window. When getting the candidate detection area, we use deformable part model as the detector to determine whether there is target object in the area. Deformable part model (DPM) is proposed by Felzenszwalb et al.[12-17] and performs excellently in PASCAL VOC Challenge. Because of using mixtures of multi-scale deformable part models, DPM is able to capture significant variations in appearance to represent a rich object category. This paper is organized as: the 2nd section introduces deformable part model; the method to extract candidate detection area is shown in the 3rd section; the 4th section shows how the model is trained; the 5th section discusses the results of our experiment; in the last section we summarize the whole paper and discuss the future work. 2

Deformable Part Model Because of the excellent performance in PASCAL VOC Challenge, we use deformable part model

as the detector to ensure the detection accuracy. DPM is defined by two kinds of filters, a coarse root filter that approximately covers an entire object and higher resolution part filters that cover smaller parts of the object. In the process of detection, the root filter location defines a detection window and the part filters are placed O levels down in the pyramid in order to compute the features of that level at twice the resolution of the feature in the root filter level.

4

As shown in figure 2, DPM firstly computes HOG[18] feature pyramids by computing a standard image pyramid via repeated smoothing and subsampling, and then computing a feature map from each level of the image pyramid. We define the root filter F0 and a set of part models ( P1 , P2 ,˜ ˜ ˜, Pn ) , where Pi

v ( Fi , vi , d i ) . Fi is the filter for the i -th part, i is a two-dimensional

vectors specifying the center of a box of possible positions for part i relative to the root position, and

di

is a four-dimensional vector specifying coefficients of a quadratic function defining a

deformation cost. The score S of a filter F on detection window is calculated by Eq. (1): S

F ˜ I ( H , p , w, h )

(1)

Where p is a cell in the image pyramid, w is the width of detection window and h is the height. ********************************************************************************************

Fig. 2 is placed here. ********************************************************************************************

3

Extraction of Salient Candidate Detection Area The purpose of saliency detection is to get saliency area of an image by simulating the visual

selective visual attention of human visual system[24]. We extract the salient area by using hierarchical saliency detection method and then handle it to get candidate detection area. ********************************************************************************************

Fig. 3 is placed here. ********************************************************************************************

3.1 Saliency detection The small-scale salient patterns in images can disturb saliency detection, and this may cause that

5

the candidate detection area we get may not contain the target object. To avoid this problem, we detect the salient area by using a hierarchical model which can process images at multi-scale. As shown in figure 3, the process contains three steps: a) Image layer extraction ********************************************************************************************

Fig. 4 is placed here. ********************************************************************************************

We first generate an initial over-segmentation by the watershed-like method and then for each segmented region we compute a scale value. The region's scale value enables us to sort them in an ascending order. The scale of a region is defined as: if a region can contain a n u n cell of pixels, then the scale of this region is larger than n . In figure 4, the scale of red region a and blue region b is smaller than 5, and the gray region c's scale is larger than 5. In practice, we take 3, 13 and 33 as the scale of the three layers of image. When the scale of a region in some layer is smaller than the scale of this layer, we merge this region into its nearest region, at the same time we update its color with their average. The image layer is shown in figure 3(c). b) Single-layer saliency cues After extracting three layers of image, we get saliency cue map per layer by two kinds of salient cues: local contrast and location heuristic. Local contrast means image regions contrasting with their surroundings are silent. For a region

Ri

᧤ i 1,2,˜ ˜ ˜n ᧥, we define it local contrast saliency cue as: n

Ci

¦ w( R )I (i, j ) || c  c j

i

j 1

6

j

||2



(2)

I (i, j ) exp{ D ( Ri , R j ) / V 2 } where᧨



(3)

ci and c j are colors of region Ri and R j respectively. w( R j ) counts the number of

pixels in

Rj

.

D( Ri , R j )

calculates the Euclidean distance of region

Ri and R j . V is the

parameter that controls the smooth of the Gaussian kernel. Location heuristic saliency cue shows that human visual system favors central regions. So the closer a pixel is to the center of an image, the higher saliency it get. The location heuristic saliency cue is defined as: 1 ¦ exp{O || xi  xc ||2 } w( Ri ) xi Ri

Hi



(4)

R x Where {x0 , x1 ,˜ ˜ ˜} are the pixels in region i . c is the center of the image. After we get local contrast saliency cue

Ci

and location heuristic saliency cue

Hi

from Eq. (2)

_

and (3), the region's saliency cue si is calculated as(figure 3(d) shows an example of the saliency cue): _

si

Ci ˜ H i ᧨

(5)

c) Saliency map Lastly, we apply a weighted average to the three calculated layer saliency cue maps to get the final saliency map. We define

pijl

as the gray scale of the pixel at row i and column j in the lth

p ( l 1,2,3 )layer saliency cue. Then we get the gray scale of pixel ij in the final saliency map by Eq. (6): 3

pij

¦A p l

l 1

Al

l ij

/3 (6)

is the weight value of each layer saliency cue map. For the purpose of eliminating the

7

interference of small-scale salient patterns, we assign small scale layer saliency map with small weight value. Because of dealing images at three scales, we can eliminate the interference of small-scale salient patterns. What's more, we extract salient area by using both local contrast saliency cue and location heuristic saliency cue, in which way the saliency map represents the real saliency of an image. 3.2 Candidate detection area

The saliency map obtained by saliency detection is a grayscale image. We need to get candidate detection area from it. As shown in figure 5, the intensity of pixel in saliency map is between 0 and 255. the higher intensity of a pixel is, the salienter the pixel is, such as the white area in fig. 5 (b). ********************************************************************************************

Fig. 5 is placed here. ********************************************************************************************

To get the candidate detection area which contains the target object, we first convert the saliency map to binary image. In this step, we need to make sure that the binary image does not miss any pixels belong to the object. In different images, the gray-scales of a salient pixel differs a lot from each other. For this reason, we can't just set a fixed threshold to obtain the binary image. After converting the saliency map to binary image, we get the candidate detection area by draw a bounding box on the binary image which contains all of white pixels. ********************************************************************************************

Fig. 6 is placed here. ********************************************************************************************

A proper threshold should not only make the candidate detection area contain the ground truth

8

area of object, but also get candidate detection area as small as possible. If we use a fixed threshold to obtain the binary image, such as median value of the saliency map, we cannot get the proper candidate detection area. In Fig. 6, (c) is the binary image got by using median value of the saliency map as the threshold. We can see that the binary image either can not contain all the object or contains too larger area. This may end in error detection or slow detection. To avoid this problem, we use an adaptive threshold to obtain the binary image. The threshold T S of the saliency map S is defined as:

TS

2 W 1 H 1 ¦¦ s( x, y) W uH x 0 y 0

(7)

where W and H are the width and height of the saliency map in pixels and s( x, y ) is the saliency value of pixel at position ( x, y) . A few results of the binary image obtained by the adaptive threshold is shown in (d) of Fig. 6. 3.3 Algorithm of extraction of salient candidate area

The general scheme for using our saliency detection method to extract salient candidate area is shown in Table 1. The algorithm inputs the origin image that we need to detect and special class of training image set. Firstly, we initialize I by over segmenting it, and get three scales image layers ( Ls

{w1 , w2 , w3} ) from it. Then we get single-layer saliency cues from each scale image layers and

integrate them into a saliency map. To get a proper threshold, we use Eq.(7) to get threshold T S . In the end, we get the binary image from saliency map by the threshold T S , and the salient candidate area is a window that contains all of the positions of pixels whose value of bij is 1. ********************************************************************************************

Table 1 is placed here.

9

********************************************************************************************

4 Training Models

In [12,13], Felzenszwalb's DPM uses a latent support vector machine(LSVM) to train its model. It takes position information as a latent value. The classifier is defined as: f E ( x)

max E ˜ )( x, z ) zZ ( x )

(8)

where:

E

( F0 ,˜ ˜ ˜, Fn , a1 , b1 ,˜ ˜ ˜, an , bn )

)( H , z ) (I ( H , p0 ),I ( H , p1 ),˜ ˜ ˜, I ( H , pn ))

F0 is the root filter F0 and ( P1 , P2 ,˜ ˜ ˜, Pn ) is a set of part models , where Pi

(9) (10) ( Fi , vi , d i ) . Fi is

v the filter for the i -th part, i is a two-dimensional vectors specifying the center for a box of d possible positions for part i relative to the root position, and i is a four-dimensional vector specifying coefficients of a quadratic function defining a deformation cost. The objective function is: LD ( E )

n 1 || E ||2 C ¦ max(0,1  yi f E ( xi )) 2 i 1

(11)

z in Eq. (8) is the latent variable. To optimize Eq. (10), initialize E and iterate: pick best z for each positive example and optimize E via gradient descent with data-mining. The training in [12,13] contains four steps: initialize root filter, update root filter, initialize part filter and update model. Fig.7 and Fig.8 are the model trained by LSVM from the PASCAL VOC 2008 dataset. ********************************************************************************************

Fig. 7-8 are placed here.

10

********************************************************************************************

5 Experiment

We evaluated our method using the PASCAL VOC 2008 dataset firstly. There are 10057 images containing 20 category classes in the dataset. The data has been split into 50% for training and 50% for testing. The distributions of images and objects by class are approximately equal across the training and test set. Our experiment environment is the Felzenszwalb's open system VOC-Release 4.0, and the parameters of feature pyramid, number of filters and so on are default values. Firstly we compare our saliency detection method with some state-of-art methods, Fig.9 shows the results of contrast. In Fig.9, Origin is the origin image, SR is the saliency detection result of Hou's method [30], ST is the saliency detection result of [31], IM is the saliency detection result of [32], RC is the saliency map extracted by [33] , RARE is the result of [34], and SAL is the result of our proposed method. We can see that, the salient pixels of saliency map in SR and ST don't represent the position of object and we can not get the proper candidate detection area from these saliency maps. ********************************************************************************************

Fig. 9 is placed here. ********************************************************************************************

A candidate detection area is proper when it not only contains a ground truth of the object but also is as small as possible. In order to get the proper candidate area for most images in PASCAL VOC 2008 dataset, we get thresholds for saliency maps by using an adaptive threshold. Using this approach, we obtain binary images from saliency maps. Average values of precision, recall and F-Measure(Eq. 12) are obtained over the MSRA 1000 dataset.

11

FE

we use E 2

(1  E 2 ) Precision u Recall E 2 u Precision  Recall

(12)

0.3 in our work to weigh precision more than recall. The comparison is show in fig. 9.

We can see that our method(SAL) shows the highest precision, recall and FE values. ********************************************************************************************

Fig. 10 is placed here. ********************************************************************************************

In Fig.11, we show the person detection process and Precision/Recall curve of our method. Column a is the origin image, column b is the saliency map, column c is the binary image, column c shows the candidate detection area and column e shows the final detection result. Below the two rows of images is the Precision/Recall curve. Fig.12 and Fig.13 respectively show the car and horse's detection process and Precision/Recall curve of our method, the structure of these two figures is the same as Fig.11. ********************************************************************************************

Fig. 11-13 are placed here. ********************************************************************************************

Our experiment environment is: Microsoft Windows XP᧨3.00GHz Intel Core i5-2320 CPU᧨ 2.91GB 2.99GHz DDR3. Table 2 shows the processing time and average searching window per image of our method and DPM. ********************************************************************************************

Table 2 is placed here. ********************************************************************************************

From table 2, we can see that the processing time of our proposed method contains 3 steps:

12

saliency detection, candidate area extraction and object detection. The whole time of three step is 2.83s on average, and faster than the average processing time 3.45s of DPM in [12], which shows that our method can speed up object detection. Average searching windows per image of our object detection method is lower than DPM, which indicates our method can decrease searching area for object detection. An important criterion evaluating the object detection in PASCAL VOC Challenge is Average Precision (AP). Firstly, we need to judge when a detection is correct. Detections are considered true or false positive based on the area of overlap with ground truth bounding boxes. To be considered as correct detection, the area of overlap bounding box

Bgt

a0 between the detection bounding box B p and ground truth

must exceed 50% by the Eq. 13. And then, for a category of object, n is the

whole number of objects in test set, True Positive (tp) is the correct detection number, and False Positive (fp) is the false detection number. We can get Recall by tp/n᧨and Precision by tp/(fp+tp). Lastly we draw a precision/ recall curve with precision monotonically decreasing and compute AP as the area under this curve by numerical integration. a0

area ( B p ˆ Bgt ) area ( B p ‰ Bgt )

᧤13᧥

In table 3, we list the average precision of our method, DPM and the some methods' results of the PASCAL VOC Challenge 2008 respectively. The contrast of PR curves of our methods and some other mehods is shown in Fig. 14. From the results, we can see obviously that the object detection method based on selective visual attention has achieved state-of-the-art results in object detection. Fig.14 shows some detection results of our method. At the same time we speed up object detection. The main reasons are:

13

1) In [12], the object detection uses sliding window to search all the position and scale in image pyramid to get the target object. In fact, the area of object is smaller than the whole image, which can be seen from Fig.10-12 easily. The candidate detection area got by saliency detection based on selective visual attention is smaller than 60% of the image on average. We just need search in candidate detection area to detect the target object. In this way, we speed up object detection. 2) We can not only speed up object detection, but also keep the accuracy of detection. As shown in the second row of Fig. 1, we can exclude the error detection bounding box which contains no salient pixel. ********************************************************************************************

Table 3 and Fig. 14-15 are placed here. ********************************************************************************************

Besides PASCAL VOC 2008 dataset, we also test our method on INRIA person dataset and Caltech 101 dataset. Some detection results are shown in Fig.16 and Fig. 17. In Fig 16 and Fig. 17, the first column is the origin image, the second column is the saliency map extracted by our method, the third column is the binary image, the forth column shows the candidate detection area and the last column show the final detection results. From these results, we can see that our method can decrease searching area for object detection even when the background of object is complex while it can detect objects correctly. In table 4, we list mAP, aSW(average searching windows) and aPT(average processing time) of our method on different datasets. The bolded values are the results of our method while the other values are the results of DPM. INRIA person datasets only contains person target, so the mAP of this dataset is higher than others, but it is about the same with the AP value of class person of PASCAL VOC dataset. We can see that our method can do well on both

14

INRIA person dataset and Caltech 101 dataset. ********************************************************************************************

Fig. 16-17 and Table 4 are placed here. ********************************************************************************************

However, there are still some problems with our method. In Fig. 18, DPM can detect human correctly while our method detect no target object. This is because that the pixels of a large scale object in an image (the woman in Fig. 18) are not all salient, which makes the candidate detection area got by our method can not contain the whole object. In the fifth row of Fig. 16, our method can not detect the little girl in the middle of the image and the saliency of the girl is low. This is because the little girl in the image is similar to the back ground. However the problem above can be solved easily by human visual system. Our future work is to learn more from human visual system to improve object detection method in order to solve these problems. ********************************************************************************************

Fig. 18 is placed here. ********************************************************************************************

6 Conclusion

We proposed a fast object detection method by simulating the selective visual attention of human visual system. The method detects saliency area at multi-scale from images, and then gets candidate detection area in which we detect object by using deformable part model. Tested on PASCAL VOC 2008 dataset, INRIA person dateset and Caltech 101 dataset, we found that our method can reduce searching area dramatically, which speeds up object detection. Meanwhile, the accuracy of our method has achieved state-of-art results. However, when facing occlusion or large scale object, our

15

method may fall down. Building an intelligent model based on feature extraction and feature imagination mechanism of human visual system is an important way to improve our method, and is also our long-term goal.

Acknowledgment

This work was supported by the National Natural Science Fund of China (Grant No. 61375079, 61005091)

reference

[1] Treisman AM, Gelade G. A feature-integration theory of attention[J], Cognitive Psychology, 1980, 12(1): 97-136. [2] L.Itti, C.Koch. Computational Modelling of Visual Attention[J]. Nature Reviews Neuroscience, 2001, 2(3), pp.194-203. [3] V. Navalpakkam, L. Itti. An Integrated Model of Top-down and Bottom-up Attention for Optimal Object Detection[C]. Proc. IEEE Conference on Computer Vision and Pattern Recognition, New York: IEEE Press, 2006, 2: 2049-2056. [4] A. Borji, D. N. Sihite, L. Itti. Salient Object Detection: A Benchmark[C]. Proc. European Conference on Computer Vision (ECCV), Florence: Springer Berlin Heidelberg, 2012: 414-429. [5] Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and Tomaso Poggio, Member, IEEE,Robust Object

Recognition

with Cortex-Like

Mechanisms[J]. IEEE

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29,

16

NO. 3, MARCH 2007: 411-426. [6] Y. Zhai and M. Shah, Visual attention detection in video sequences using spatiotemporal cues[J]. In ACM Multimedia, 2006: 815ದ824. [7] Qiong Yan, Li Xu, Jianping Shi, Jiaya Jia, Hierarchical Saliency Detection[C]. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Oregen: IEEE Press, 2013. [8] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A.Zisserman, The PASCAL Visual Object

Classes

Challenge

2009

(VOC

2009)

Results,

http://www.pascal-network.org/challenges/VOC/voc2009/, 2009. [9] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A.Zisserman, The PASCAL Visual Object

Classes

Challenge

2010

(VOC

2010)

Results,

http://www.pascal-network.org/challenges/VOC/voc2010/, 2010. [10] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A.Zisserman, The PASCAL Visual Object

Classes

Challenge

2011

(VOC

2011)

Results,

http://www.pascal-network.org/challenges/VOC/voc2011/, 2011. [11] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A.Zisserman, The PASCAL Visual Object

Classes

Challenge

2008

(VOC

2008)

Results,

http://www.pascal-network.org/challenges/VOC/voc2008/, 2008. [12] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models[J]. Pattern Analysis and Machine Intelligence, 2009, 32(9): 1627-1645. [13] P. Felzenszwalb, D. McAllester, and D. Ramanan. A Discriminatively Trained, Multiscale,

17

Deformable Part Model[C]. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Anchorage: IEEE Press, 2008: 1-8. [14] P.Felzenszwalb, R.Girshick, D.McAllester. Cascade Object Detection with Deformable Part Models[C]. Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Francisco: IEEE Press, 2010: 2241 - 2248. [15] P. Felzenszwalb, D. Huttenlocher. Pictorial Structures for Object Recognition[J]. Int ಬ l J. Computer Vision, 2005, 61(1): 55-79. [16] P. Felzenszwalb, D. McAllester. The Generalized A* Architecture[J]. J. Artificial Intelligence Research, 2007, 29: 153-190. [17] Ross Girshick, Pedro Felzenszwalb, and David McAllester. Object detection with grammar models[C]. 25th Annual Conference on Neural Information Processing Systems, Granada: Curran Associate Inc., 2011: 145-154. [18] N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection[C]. Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Diego: IEEE Press, 2005, 1: 886-893. [19] A. Yuille, P. Hallinan, and D. Cohen. Feature Extraction from Faces Using Deformable Templates[J]. Intಬl J. Computer Vision, 1992, 8(2): 99-111. [20] D. Ramanan and C. Sminchisescu. Training Deformable Models for Localization[C]. Proc. IEEE Conf. Computer Vision and Pattern Recognition, New York: IEEE Press, 2006, 1: 206-213. [21] P. Ott, M. Everingham. Shared Parts for Deformable Part-Based Models[C]. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence: IEEE Press, 2011: 1513-1520.

18

[22] M. Pedersoli, A. Vedaldi, J. Gonz¢lez. A Coarse-to-fine approach for fast deformable object detection[C]. Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence: IEEE Press, 2011: 1353-1360. [23] Hyunggi Cho, Paul E.Rybski, Wende Zhang. Vision-Based 3D Bicycle Tracking using Deformable Part Model and Interacting Multiple Model Filter[C]. in proceeding of IEEE International Conference on Robotics and Automation, Shanghai: IEEE Press, 2011: 4391-4398. [24] ZHU Mingqing, WANG Zhiling, CHEN Zonghai. Human Visual Intelligence and Particle Filter based Robust Object Tracking Algorithm[J]. Control and Decision, 2012, 27(11): 1720-1724. [25] X. Xu, Z. Wang, Z. Chen. Visual tracking model based on feature-imagination and its application[C]. Proceedings - 2010 2nd International Conference on Multimedia Information Networking and Security, Nanjing: IEEE Press, 2010, 370-374. [26] B. Alexe, T Deselaers, V Ferrari. What is an object[C], Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Francisco, CA: IEEE Press, 2010: 73-80. [27] Z. Pei, Y. Zhang, T.Yang, X. Zhang, and Y.H. Yang. A Novel Multi-Object Detection Method in Complex Scene Using Synthetic Aperture Imaging[J], Pattern Recognition, 2012, 45(4), pp. 1637-1658. [28] Malisiewicz, T. ,Gupta, A. , Efros, A.A. Ensemble of exemplar-SVMs for object detection and beyond. Proc. IEEE Conference on ICCV, Barcelona: IEEE Press, 2011 : 89-96. [29] Chia-Feng Juang, Wen-Kai Sun, Guo-Cyuan Chen. Object detection by color histogram-based fuzzy classifier with support vector learning[J]. Neurocomputing, 2009, 72(10-12): 2464-2467. [30] X. Hou and L. Zhang. Saliency detection: A spectral residual approach[C]. Proc. IEEE Conf.

19

Computer Vision and Pattern Recognition, Minneapolis: IEEE Press, 2007: 1-8. [31] Dirk Walther and Christof Koch. Modeling attention to salient proto-objects[J]. Neural Networks,2006, 19(9): 1395-1407. [32] R. Margolin, L. Z. Manor. A. Tal, Saliency for image manipulation[J]. The Visual Computer, 2013, 29(5): 381-392. [33] T. N. Vikram, M. Tscherepanow, B. Wrede. A saliency map based on sampling an image into random rectangular regions of interes[J]. Pattern Recognition, 2012, 45(9): 3114-3124. [34] N. Riche, M. Mancas, M. Duvinage, M. Mibulumukini, B. Gosselin, T. Dutoit. RARE2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis[J]. Signal Processing, 2013, 28(6): 642-658. [35] R. Erik, D. Joachim. One-shot learning of object categories using dependent gaussian processes[C]. Proceedings of the 32nd Annual Symposium of the German Association for Pattern Recognition, Darmstadt: Spring, 2010, vol. 6376: 232-241. [36] H. Harzallah, F. Jurie, C. Schmid. Combining efficient object location and image classification[C]. International Conference on Computer Vision, Kyoto: IEEE Press, 2009, p. 237-244. [37] C. H. Lampert, Christoph H, Blaschko M, Hofmann T. Beyond sliding windows: Object localization by efficient subwindow search. In: Computer Vision and Pattern Recognition. Anchorage: IEEE Press; 2008, p. 232-241.

20

1.Simulate selective visual attention to propose an object detection method. 2. Analyze the use of saliency map in the object detection. 3. Extract salient candidate area to reduce the searching area of object detection. Fast Object Detection Based on Selective Visual Attention Fig. 1. The contrast of object detection and saliency detection. Fig. 2. HOG feature pyramid. Fig. 3. The process of saliency detection. Fig. 4. Definition of region scale. Fig. 5. Process of getting candidate detection: (a) is the original image, (b) is the saliency map, (c) is the binary image and the yellow bounding box in (d) is the candidate detection area. Fig. 6. Binary image of the saliency map with different kinds of threshold: (a) is the original image, (b) is the saliency map, (c) is the binary image got by fixed threshold and (d) is the binary image got by adaptive threshold. Fig. 7. Human model. Fig. 8. Car model. Fig. 9. The contrast of saliency detection: Origin is the origin image, SR is the saliency detection result of Hou's method [30], ST is the saliency detection result of [31], IM is the saliency detection result of [32], RC is the saliency map extracted by [33] , RARE is the result of [34], and SAL is the result of our proposed method. Fig. 10. Precision-Recall bars for binarization of saliency maps extracted by different methods.

21

Fig. 11. Person detection process and Precision/Recall curve. Fig. 12. Car detection process and Precision/Recall curve. Fig. 13. Horse detection process and Precision/Recall curve. Fig. 14. PR Curves of different methods. Fig. 15. Some detection results of our method. Fig. 16. Some detection results of our method on INRIA person dataset. Fig. 17. Some detection results of our method on Caltech 101 dataset. Fig. 18. The existing problem: (a) is the origin image, (b) is the detection result of DPM and (c) is the saliency map of (a).

Zonghai Chen received his BS and MS degrees from University of Science and Technology of

China ᧤USTC᧥ in 1988 and 1991 respectively. He has served on the faculty of USTC from 1991. Since 1998, he has been a professor of USTC. He was assigned assistant to the president of USTC from 2000 to 2003, in charge of the technology industry. He is an expert that enjoys the special government allowances of the State Council of People’s Republic of China. He has more than 300 refereed publications. His research is focused on the modeling, simulation and control of complex systems, information acquisition and control, robotics and intelligent systems, and quantum system control and quantum state manipulation. He has 14 provincial and ministerial progress prizes in scientific and collective technology. Mingwei Guo received his BS degree from University of Science and Technology of China ᧤USTC᧥ in 2009. He has been a Ph.D student of Department of Automation in USTC from 2009.

His current research is focused on detection and localization of rich object category in natural images based on the mechanism of feature extraction and feature imagination of human vision system.

22

YuZhou Zhao received the BS degree in automation from the University of Science and Technology of China (USTC) in 2008. He is currently a Ph.D student in Laboratory of Simulation and Intelligent Control of USTC. His research interests include computer vision and intelligent system. Chenbin Zhang received the M.S. and Ph.D degrees in Pattern Recognition and Intelligent Systems from University of Science and Technology of China. Currently, he is an Associate Professor at the Department of Automation, University of Science and Technology of China. His current research interests include modeling, analysis and control of complex systems, such as mobile robots, quantum control system, energy management system of electrical vehicle, and Cyber-Physical systems.

[38]

23

Fig1. The contrast of object detection and saliency detection

Fig2. HOG feature pyramid

Fig3.The process of saliency detection.

Fig4. Definition of region scale.

Fig5.Process of getting candidate detection: (a) is the original

Fig6. Binary image of the saliency map with different kinds of t

Fig7. Human model

Fig8. Car model.

Fig9. The contrast of saliency detection: Origin is the origin i

Fig10. Precision-Recall bars for binarization of saliency maps e

1 0.9

Precision Recall F−beta

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

SR

ST

IM

RC

RARE

SAL

Fig11. Person detection process and Precision/Recall curve.

Fig12. Car detection process and Precision/Recall curve.

Fig13. Horse detection process and Precision/Recall curve.

Fig14. PR Curves of different methods.

1

SAL Jena LPC UCT XRCE MPI CD

precision

0.8 0.6 0.4 0.2 0 0

0.2

0.4

recall

0.6

0.8

1

Fig15. Some detection results of our method.

Fig16. Some detection results of our method on INRIA person data

Fig17. Some detection results of our method on Caltech 101 datas

Fig18. The existing problem: (a) is the origin image, (b) is the

Figure Caption

Fast Object Detection Based on Selective Visual Attention Fig. 1. The contrast of object detection and saliency detection. Fig. 2. HOG feature pyramid. Fig. 3. The process of saliency detection. Fig. 4. Definition of region scale. Fig. 5. Process of getting candidate detection: (a) is the original image, (b) is the saliency map, (c) is the binary image and the yellow bounding box in (d) is the candidate detection area. Fig. 6. Binary image of the saliency map with different kinds of threshold: (a) is the original image, (b) is the saliency map, (c) is the binary image got by fixed threshold and (d) is the binary image got by adaptive threshold. Fig. 7. Human model. Fig. 8. Car model. Fig. 9. The contrast of saliency detection: Origin is the origin image, SR is the saliency detection result of Hou's method [30], ST is the saliency detection result of [31], IM is the saliency detection result of [32], RC is the saliency map extracted by [33] , RARE is the result of [34], and SAL is the result of our proposed method. Fig. 10. Precision-Recall bars for binarization of saliency maps extracted by different methods. Fig. 11. Person detection process and Precision/Recall curve. Fig. 12. Car detection process and Precision/Recall curve.

Fig. 13. Horse detection process and Precision/Recall curve. Fig. 14. PR Curves of different methods. Fig. 15. Some detection results of our method. Fig. 16. Some detection results of our method on INRIA person dataset. Fig. 17. Some detection results of our method on Caltech 101 dataset. Fig. 18. The existing problem: (a) is the origin image, (b) is the detection result of DPM and (c) is the saliency map of (a).

Table

Table 1 Algorithm of extraction of salient candidate area Input: I , Di Output: CDA 1: Initialize I by over segmenting it.

s  {w1 , w2 , w3} 2: For i=1:3 _

si  C i  H i Where: n

Ci   w( R j ) (i, j ) || ci  c j || 2 j 1

Hi 

,

1  exp{ || xi  xc ||2} w( Ri ) xi Ri

 {S1 , S2 , S3} 3: S  { pij } , i  0,  , width  1 ; j  0,  , height  1 3

pij   Al pijl / 3 l 1

4: For each saliency map S

S 

2 W 1 H 1  s( x, y) W H x 0 y 0

5: B  {bij }

0, bij  

1,

if pij  S if pij  S

6: CDA is a window that contains all of the position of pixels whose value of bij is 1.

Table 2 The processing time of each step

Steps

Average processing time(s)

Saliency detection Candidate area extraction Object detection DPM[12]

Average searching windows

0.28 0.11 2.44 3.45

56541 79945

Table 3 Detection results on PASCAL VOC 2008

Class

UCT[13]

CD

Jena[35]

LPC[36]

MPI[37]

XRCE

DPM

SAL

Aero Bike Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse Mbik Pers Plant Sheep Sofa Train Tv mAP

0.326 0.420 0.113 0.110 0.282 0.232 0.320 0.179 0.146 0.111 0.066 0.102 0.327 0.386 0.420 0.126 0.161 0.136 0.244 0.371 0.229

0.252 0.146 0.098 0.105 0.063 0.232 0.176 0.090 0.096 0.100 0.130 0.055 0.140 0.241 0.112 0.030 0.028 0.030 0.282 0.146 0.128

0.048 0.014 0.003 0.002 0.001 0.010 0.013 0.000 0.001 0.047 0.004 0.019 0.003 0.031 0.020 0.003 0.004 0.022 0.064 0.137 0.022

0.365 0.343 0.107 0.114 0.221 0.238 0.366 0.166 0.111 0.177 0.151 0.090 0.361 0.403 0.197 0.115 0.194 0.173 0.296 0.340 0.226

0.259 0.080 0.101 0.056 0.001 0.113 0.106 0.213 0.003 0.045 0.101 0.149 0.166 0.200 0.025 0.002 0.093 0.123 0.236 0.015 0.104

0.264 0.105 0.014 0.045 0.000 0.108 0.040 0.076 0.020 0.018 0.045 0.105 0.118 0.136 0.090 0.015 0.061 0.018 0.073 0.068 0.071

0.336 0.371 0.066 0.099 0.267 0.229 0.319 0.143 0.149 0.124 0.119 0.064 0.321 0.353 0.407 0.107 0.157 0.136 0.244 0.371 0.219

0.377 0.407 0.099 0.073 0.288 0.293 0.416 0.244 0.147 0.151 0.092 0.097 0.450 0.367 0.420 0.098 0.166 0.155 0.240 0.410 0.249

Table 4 Detection results of our method on different datasets

mAP aSW aPT(s)

VOC

INRIA

Caltech 101

0.249 0.219 56541 79945 2.83 3.45

0.413 0.409 71394 87654 3.41 3.94

0.215 0.212 43763 53259 2.37 2.93

Chen Zonghai

Guo Mingwei

Zhao Yuzhou

Zhang Chenbin

1.Simulate selective visual attention to propose an object detection method. 2. Analyze the use of saliency map in the object detection. 3. Extract salient candidate area to reduce the searching area of object detection.