Third International Conference on Advances in Control and Optimization of Dynamical Systems March 13-15, 2014. Kanpur, India
SURF-based Human Tracking Algorithm with On-line Update of Object Model Meenakshi Gupta ∗ Nishant Kejriwal ∗∗ Laxmidhar Behera ∗ K.S. Venkatesh ∗ ∗
Department of Electrical Engineering, Indian Institute of Technology Kanpur, Uttar Pradesh, India - 208016. e-mail:
[email protected],
[email protected],
[email protected] ∗∗ Innovation Lab, Tata Consultancy Services (TCS), Noida. e-mail:
[email protected]
Abstract: The ability to robustly track a human is an essential prerequisite to an increasing number of applications that needs to interact with a human user. This paper presents a robust vision based algorithm to track a human in a dynamic environment using interest point-based method. The tracking algorithm is expected to cope with changes in pose, scale, illumination as well as camera motion. The interest point based (e.g. SURF) tracking methods suffer from the limitation of unavailability of sufficient number of matching key points for the target in all frames of a running video. One solution to this problem is to have an object model which contains SURF features for all possible poses and scaling factors. So an object model with all possible descriptors could be created off-line and could be used for detecting the target in each and every frame. However, such a scheme can not be used for tracking an object online. In order to overcome this problem, we propose a new approach which update the object model online and have sufficient matching key points for the target in case of change in the pose as well as scaling. Experimental results are provided to show the efficacy of the algorithm. 1. INTRODUCTION Visual tracking of objects is one of the several capabilities that human beings have. At the present time, introducing these capabilities in the artificial visual systems is one of the most active research challenges in computer vision and mobile robotics. The field has witnessed an unprecedented advancement owing to the availability of high quality cameras and inexpensive computing power, commensurate with the development of ingenuous techniques for image and video processing. In spite of the advancement made in this field, the visual tracking is still fraught with difficulties arising due to abrupt object motion, appearance pattern change including pose, non-rigid object structure, occlusion and camera motion [A. Yilmaz and Shah, 2006] [Yang et al., 2011]. In this paper, we focus on interestpoint based methods [Kloihofer and Kampel, 2010][Ta et al., 2009][He et al., 2009] which use local features such as SIFT [Lowe, 2004] or SURF [Bay et al., 2008] as the visual feature for object tracking due to their robustness to photometric and geometric distortions. We specifically look into the problem of tracking a non-rigid object (human) from a camera placed on the mobile platform [Motai et al., 2012] [Gupta et al., 2011]. Most of the humanfollowing robots make use of multiple sensors in order to track and follow a human as in [Hu et al., 2013] [Bellotto and Hu, 2009] [Vadakkepat et al., 2008]. Vision-based human detection and tracking is one of the most important module for human-following robots as one can see in [Nagumo and Ohya, 2001] [Yoshimi et al., 2006] [Hirai and Mizoguchi, 2003]. The most popular vision based tracking algorithm is Meanshift. Its a local search algorithm based on colour histogram matching [Comaniciu et al., 2000] and easy to implement. However, the colour based tracking methods [Zhang et al., 978-3-902823-60-1 © 2014 IFAC
321
2011] are sensitive to variation in illumination condition and necessitate having non-matching backgrounds [Gupta et al., 2011]. This has prompted researchers to use histogram of some other distinctive feature (such as SIFT, SURF) for Mean-shift tracking [Ahmadi et al., 2012]. In [Garg and Kumar, 2013] Sourav et al. proposed a object tracking algorithm that apply Mean-shift directly on SURF features. They proposed a method called re-projection to overcome the limitation of unavailability of sufficient number of key points for a given object. However, such an algorithm can not be used to track the non-rigid object as it does not account for changes in pose of object due to non-rigid motion or out-of-plane rotations. Meenakshi et al. [Gupta et al., 2013] proposed a tracking algorithm that uses a dynamic object model description to detect the target. This dynamic object model derives its point from a template pool which helps in reinforcing the features which occur more frequently compared to others. In this way, they resolve the stability-plasticity dilemma in object tracking [Gu et al., 2010] without having to learn the actual motion model of the object [Ta et al., 2009] [He et al., 2009] or creating bag-of-words through clustering [Bing et al., 2010]. The dynamic object model description proposed by them able to track the nonrigid object in case of out-of-plane rotations but increases the overall computational cost of the algorithm due to frame-toframe matching. In this work, we have combined the SURF-based Mean-shift algorithm and the dynamic object model description in such a way that the algorithm can track a non-rigid object with real time computational power. The human to be tracked is selected in the first frame by manually drawing a polygon on the boundary of the human silhouette. The bounding rectangle of the polygon is used as the initial window for the mean-shift tracker. The traget is located in the next frame by mean-shift 10.3182/20140313-3-IN-3024.00247
2014 ACODS March 13-15, 2014. Kanpur, India
tracker. The polygon containing the key points obtained through SURF correspondences defines the target region in the new frame. The limitation of such an approach is that the number of matching points obtained diminishes over time. To get a healthy number of matching points for defining the target polygon, two sets of key points are added to redefine the polygon region. First set of keypoints are added by using nearest neighbour approach and second set of keypoints are obtained by matching with the object model. The object model is continuously updated by the SURF descriptors of the points added through nearest neighbour approach. In this way, the object model accounts for the changes in pose. The proposed approach has several advantages. It uses the concept of dynamic object model description [Gupta et al., 2013], but does not require frame-to-frame matching, therefore, can be implemented on mobile robotic platform. Unlike approaches in [Gu et al., 2010][Haner and Gu, 2010][Kloihofer and Kampel, 2010] which use models for both foreground and background, we only maintain a model for the foreground which contains the object to be tracked. This obviates the need for learning a classifier separately, thereby reducing the computational requirement for the tracking algorithm. We use SURF as the only image feature for representing and tracking the object. This is in contrast with most of the work which use SURF along with other features like colour or blob etc. to increase the robustness of the tracking algorithm [Haner and Gu, 2010] [Zhang et al., 2011] [Lien et al., 2013]. The rest of this paper is organized as follows. The next section defines the problem through several notations and symbols. Section 3 provides the preliminary information about the SURF based mean-shift tracker and the object description used for defining the polygon and the object model. Tracking algorithm is discussed in section 4. The experimental results are provided in Section 5 followed by conclusion in section 6. 2. PROBLEM DEFINITION Consider a set of frames Ii , i = 0, 1, 2, · · · N of a video sequence where the human identified by the user in the first frame is to be tracked over all the frames. The human is identified by a polygon P0 drawn by selecting the points on the boundary of the human silhouette. The bounding rectangle of this polygon is taken as the mean-shift window W0 . Let V (I) = {(x1 , v1 ), (x2 , v2 ), · · · , (xn , vn )} be the set of SURF key points of an image I, where xi is the 2-dimensional key point location of the 64-dimensional SURF descriptor vi . The SURF key points lying within a polygon Pt is represented by the symbol VP (It ) for any frame It . The initial object description Od (0) is taken as VP (I0 ) and initial object model Om (0) contains the SURF descriptors of the VP (I0 ) i.e. ψ(vp (I0 )). The SURF correspondence found by mean-shift tracker between the object description Od (t − 1) and the window Wt of image It , is the set of best matching key points which is defined as O (t−1)
∼ ItWt ) , {(x1 , v1 )t−1 , · · · · · · , (xm , vm )t−1 , (x1 , v1 )t , · · · · · · , (xm , vm )t } (1) where It is the current image, m is the number of matching points. The SURF correspondence between the resized tracker window of two consecutive images (It−1 , It ) is denoted as W t−1)r Wr ∼ It t ). ζ(It−1( d ζ(It−1
322
The tracking polygon Pt on an image frame It is represented by Pt = (Bt , ct ) where ct = (cx , cy ) is the centre of the polygon and Bt is the set of key point locations enclosed by the polygon defined as B0 = {xi |(xi , vi ) ∈ V0 ∧ xi ∈ P0 } (2) Given I0 , P0 (B0 , c0 ) and V0 , the task is to compute Pt (Bt , ct ) for all frames t = 1, 2, . . . N . 3. PRELIMINARY INFORMATION In this section, we provide preliminary information about the SURF-based mean-shift tracker and the object description, object model which are used in our proposed method. 3.1 SURF based mean-shift tracker Obtaining SURF correspondences between complete images (frame-to-frame matching) to localize the target is computationally expensive. This necessitate to carry out SURF matching only in a local region around the current location. The best local search algorithm is mean-shift tracking which is based on color histogram matching [Comaniciu et al., 2003]. Meanshift tracking requires a histogram of the object template. In this work, it is formed by creating a fixed number of clusters of the SURF descriptors of the object model. The centre of the new target window computed by the mean-shift algorithm is given by:
! n X
x − x i 2
wi g
h xi i=1 z= n (3)
!
X
x − x i 2
wi g h i=1
where g(x) = −k ′ (x) is the derivative of the kernel profile and wi is the weight associated with each key point location xi of the source window which has a correspondence in the target window. The new centre location depends on the number of correspondences n between the source and the target window. The SURF correspondences between the windows are computed using minimum distance criterion and RANSAC [Bing et al., 2010] for removing outliers. SURF histograms have been used for object recognition Ahmadi et al. [2012] and place recognition Nicosevici and Garcia [2012] using Bag-of-words approach. Histogram creation requires availability of sufficient number of SURF key points for the object template which may not be available if we use only the SURF descriptors of the selected object. Therefore, in this work, an object model is used to create histogram. This object model is updated in each iteration to accommodate for the change in poses. 3.2 Object description and Object model The object description Od consists of three different sets of SURF keypoints and descriptors. The first set is obtained by the mean-shift tracker and is represented as M1 (t). The mean-shift tracker gives the SURF correspondences between the object description Od (t − 1) and the descriptors of the window Wt Od (t−1) ∼ ItWt ). The matched keypoints and of image It as ζ(It−1 Wt descriptors of the It (i.e. V (ItWt )) are directly added to the M1 (t). Mathematically, we can write as O (t−1)
d ∼ ItWt ), V (ItWt ) = {(xi , vi )t |(xi , vi )t ∈ ζ(It−1 i = 1, 2, · · · , m}.
(4)
2014 ACODS March 13-15, 2014. Kanpur, India
The matched keypoints of Od (t − 1) are first projected in the image It by replacing the xt−1 by xti and then keypoints and i Od (t−1) Od (t−1) Od (t−1) ∼ ) ∈ ζ(It−1 (i.e.V (It−1 descriptors of the It−1 Wt It )) are added to the M1 (t). The second set is obtained by applying Nearest neighbour approach on the set M1 (t) and is represented as M2 (t). To get the region where Nearest neighbour approach has to be applied, the mean-shift tracker window of images It−1 and It are resized (increasing the window size to 1.20% of its original size) and the SURF correspondences between the two resized WR
WR
window is obtained as ζ(It−1t−1 , It t ). The keypoints lying in R the region Wt−1 − Wt−1 are obtained and a polygon PR is formed by using their corresponding keypoints on image It . WR
WR
All the keypoints and descriptors of It t (i.e. V (It t )) which lies inside the polygon PR and have a distance less than a user specified threshold with any point of the set M1 (t) are added to set M2 (t). The third set is obtained by finding the correspondence between the SURF descriptors of the Object model with those of the V (ItWt ) i.e. ζ(Om (t − 1) ∼ ItWt ). Only those keypoints and descriptors of the ζ(Om (t − 1) ∼ ItWt ) are retained in this set which have a distance less than a user specified threshold with any point of the set (M1 (t) ∪ M2 (t)) and is represented by M3 (t). Hence, we describe the object description as Od (t) = {M1 (t) ∪ M2 (t) ∪ M3 (t)}
(5)
The polygon made by enclosing the points of Od (t) is the desired polygon and is denoted as Pt . To obtain a reliable object description, we use RANSAC based on homography to avoid outliers in set M1 (t) and M2 (t). For the set M3 (t), we avoid outliers by putting a threshold on the similarity between the descriptors of the object model and the matching points obtained on the current Image It within the window Wt . It should be noted that the object description Od (t) is able to accommodates the variations that may arise due to change in poses or in motion through M2 (t). The set M1 (t) provides high frequency temporal information of the object. On the other hand the set M3 (t) obtained from the Object model of descriptors provides a low frequency temporal information about the model variation over all previous frames It−1 to I0 . Updating the object model The object model consists of a set of SURF descriptors which is updated in each iteration using the set M2 (t). Mathematically, this set can be written as Om (t) = {Om (t − 1) ∪ ψ(M2 (t)}
(6)
The size of object model plays an important role as there is a trade-off between the accuracy and the computational cost. Increasing the size of object model increases the accuracy and also increases the computational cost. In this work, we have taken the size of the object model as 500 descriptors which defines the object substantially. Each descriptor is assigned an initial weight w0 whenever it is added to the object model. The initial weights are selected so as to ensure that the descriptors survive at least for few frames. The weight is set to its initial weight whenever a match is found with this descriptor otherwise it is decremented by 1. The descriptors with values less than one are discarded whenever the size of the object model 323
Select the human in 1st frame P0 Initialize Om ,Od
Make histogram from Om Initialize mean-shift tracker
Capture New frame
Mean-shift Tracker
Matching with Om M1 (t)
Update Om & histogram
Make polygon Pt from Od
Nearest neighbour approach around M1 (t)
M3 (t)
M2 (t)
Od =M1 (t)+M2 (t)+M3 (t)
Fig. 1. Block diagram of the proposed tracking algorithm reaches to its maximum value. Mathematically, we can write as follows: w0 if match is found wi (t + 1) = (7) wi (t) − 1 else
A part of this object model is included to the object description as the set M3 (t) as described above. This set, in turn, contributes towards defining the polygon Pt of the target in the current image It . 4. PROPOSED TRACKING ALGORITHM
The block diagram of the proposed algorithm is shown in Fig. 1. In the proposed tracking algorithm, the human to be tracked is selected in the first frame by manually drawing a polygon P0 on the boundary of the human silhouette. The bounding rectangle of the polygon is used as the initial window W0 for the meanshift tracker. The object description is initialized as Vp (I0 ) and object model is initialized with ψ(Vp (I0 )). The histogram for mean-shift tracker is made using the SURF descriptors of the object model Om . This histogram gets updated in each iteration as the object model is updated. The mean-shift tracker gives the best matched SURF correspondences between the object description (Od ) and the SURF descriptors of the window Wt of image It and constitute the set M1 (t). The object description is populated with set M2 (t), which is made by using nearest neighbour approach. The SURF matching of window Wt with object model results the set M3 (t). The new object description is calculated as per equation (5). The object model is updated using equation (6). The convex hull of the points of the object description gives the desired polygon Pt . 5. EXPERIMENTAL RESULT In order to test our tracking method, a video is recorded from a camera which is mounted on a mobile robot platform following the human. The video with our tracker can be seen online [Gupta, 2013]. Snapshots of some frame of the video are shown in Figure 2. Images contains three set of points which constitute the object description. The blue points are obtained by meanshift tracker and correspond to set M1 (t). The polygon made by set M1 (t) is also shown in blue color. The white color polygon denotes the region for applying nearest neighbour approach to find the set M2 (t). The set M2 (t) is shown in green color. The set of points obtained by object model are shown in red color and correspond to set M3 (t). The desired polygon Pt of the algorithm is shown in red color. The mean-shift tracker window is drawn in pink color. From the Figure 2, it is clear that when the human changes the pose, the polygon shift away from human body but the object model bring it back on the human body. It should be noted that
2014 ACODS March 13-15, 2014. Kanpur, India
Fig. 2.
Working of the Human Tracking Algorithm with a Dynamic Object Model (frame number-1,25,50,78,106,142,152,196,201,208,238,254,260,292,311, 336,373,408,441,461,500,531,570,661,707,708,709,710,766,803,836,875,917,967,987)
the polygon Pt contains a major portion of the human body. So majority of the points inside the polygon, which belong to human body will be at almost same depth from the robot. So the motion command to robot can be given on the basis of the average depth of these majority pixels. The Fig. 3 shows the number of matching points obtained from the object description Od (t) on a test run. It also shows the matching point contributions obtained from three sources mentioned in Section 3.2.
324
6. CONCLUSION This paper proposes an interest point based tracking algorithm for non-rigid objects like human in a non-stationary video. The tracking algorithm makes use of a dynamic object model which evolves over time to accommodate the changes that might arise due change in poses over subsequent frames. This dynamic object model aims to preserve a good set of key points which is essential for tracking the target. Thus, the proposed approach aims to overcome the limitation of interestpoint based tracking
2014 ACODS March 13-15, 2014. Kanpur, India
175 M1 (t) M2(t) M3(t) Od(t)
Number of matching points
150
125
100
75
50
25
0
0
100
200
300
400
500
600
700
800
900
1000
Frame number
Fig. 3. Object description Od =M1 (t)+M2 (t)+M3 (t) algorithms arising out of the depletion of matching key points over subsequent frames. REFERENCES O. Javed A. Yilmaz and M. Shah. Object tracking: A survey. ACM Computing Surveys (CSUR), 38(4), December 2006. ISSN 0360-0300. A. Ahmadi, M. R. Daliri, A Nodehi, and A Qorbani. Objects recognition using the histogram based on descriptors of SIFT and SURF. Journal of Basic and Applied Scientific Research, 2(9):8612–8616, 2012. H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Speededup robust features (SURF). Computer Vision and Image Understanding, Elsevier, 110:346–359, December 2008. N. Bellotto and H. Hu. Multisensor-based human detection and tracking for mobile service robots. IEEE Transactions on System, Man and Cybernetics, Part-B, 39(1):167–181, February 2009. Zhigang Bing, Yongxia Wang, Jinsheng Hou, Hailong Lu, and Hongda Chen. Research of tracking robot based on SURF features. In International Conference on Natural Computation (ICNC), pages 3523–3527, Yantai, Shandong, 2010. IEEE. D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects using mean-shift. In Proc. of Int. Conf. on Computer Vision and Pattern Recognition, volume 2, pages 142–149, Hilton Head Island, CS, 2000. IEEE. Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4):1–14, April 2003. Sourav Garg and Swagat Kumar. Mean-shift based object tracking algorithm using surf features. In Recent Advances in Circuits, Communications and Signal Processing, pages 187–194. WSEAS, 2013. Steve Gu, Ying Zheng, and Carlo Tomasi. Efficient visual object tracking with online nearest neighbor classifier. In 10th Asian Conference on Computer Vision (ACCV), pages 271–282, Queenstown, New Zealand, 2010. Springer Berlin Heidelberg. Meenakshi Gupta. Experimental video on human tracking. http://http://www.youtube.com/watch? v=CW27gJvt4rw&feature=youtu.be, 2013. Meenakshi Gupta, Laxmidhar Behera, and V. K. Subramanian. A novel approach of human motion tracking with mobile robotic platform. In Int. Conf. on Computer Modeling and Simulation, pages 218–223, Cambridge, UK, 2011. IEEE. 325
Meenakshi Gupta, Sourav Garg, Swagat Kumar, and Laxmidhar Behera. An online visual human tracking algorithm using surf-based dynamic object model. In International Conference on Image Processing (ICIP), Melbourne, Australia, 2013. IEEE. S. Haner and I. Y. Gu. Combining foreground / background feature points and anisotropic mean shift for enhanced visual object tracking. In International Conference on Pattern Recognition (ICPR), pages 3488–3491, Istanbul, 2010. IEEE. Wei He, T. Yamashita, Lu Hongtao, and Shihong Lao. SURF tracking. In International Conference on Computer Vision, pages 1586–1592, Kyoto, 2009. IEEE. N. Hirai and H. Mizoguchi. Visual tracking of human back and should for person following robot. In Proc. IEEE/ASME Intl. Conf. on Advanced Intelligent Mechatronics (AIM), volume 1, pages 527–532, July 2003. J. Hu, J. Wang, and M. Ho. Design of sensing system and anticipative behavior for human following of mobile robots. IEEE Transactions on Industrial Electronics, 2013. Werner Kloihofer and Martin Kampel. Interest point based tracking. In International Conference on Pattern Recognition, pages 3549–3552. ACM, 2010. Cheng-Chang Lien, Shin-Ji Lin, Cheng-Yang Ma, and Yu-Wei Lin. SURF-badge-based target tracking. In World Academy of Science, Engineering and Technology, volume 77, pages 877–883, 2013. D. G. Lowe. Distinctive image features from scale-invariant keypoints. Internation Journal of Computer Vision, 60(2): 91–110, January 2004. Yuichi Motai, Sumit Kumar Jha, and Daniel Kruse. Human tracking from a mobile agent: Optical flow and kalman filter arbitration. Signal Processing: Image Communication, 27 (1):83–95, January 2012. Y. Nagumo and A. Ohya. Human following behaviour of an autonomous mobile robot using light-emitting device. In Proc. 10th IEEE Intl. workshop on Robot and Human Interactive Communication, pages 225–230, Bordeaux, Paris, 2001. Tudor Nicosevici and Rafael Garcia. Automatic visual bagof-words for online robot navigation and mapping. IEEE Transactions on Robotics, 28(4):886–898, August 2012. Duy-Nguyen Ta, Wei-Chao Chen, Natasha Gelfand, and Kari Pulli. SURFTrac: Efficient tracking and continuous object recognition using local feature descriptors. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2937–2944, Miami FL, 2009. IEEE. P. Vadakkepat, P. Lim, L.C. De Silva, Liu Jing, and Li Li Ling. Multimodal approach to human-face detection and tracking. IEEE Transactions on Industrial Electronics, 55(3):1385– 1393, March 2008. H. Yang, L. Shao, F. Zheng, L. Wang, and Z. Song. Recent advances and trends in visual tracking: A review. Neurocomputing, 74:3823–3831, 2011. T. Yoshimi, M. Nishiyama, T. Sonoura, H. Nakamoto, S. Tokura, H. Sato, F. Ozaki, N. Matsuhira, and H. Mizoguchi. Development of a person following robot using vision based target detection. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pages 5286–5291, Beijing, 2006. Jian Zhang, Jun Fang, and Jin Lu. Mean-shift algorithm integrating with SURF for tracking. In Natural Computation (ICNC), pages 960–963, Shanghai, 2011. IEEE.