Expert Systems with Applications 38 (2011) 13682–13687
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
A hybrid tracking method for scaled and oriented objects in crowded scenes _ M. Fatih Talu a,⇑, Ibrahim Türkog˘lu b, Mehmet Cebeci c a
Inonu University, Computer Engineering Department, 44280 Malatya, Turkey Firat University, Department of Electronics and Computer Science, Elazig, Turkey c Firat University, Department of Electrical–Electronic Engineering, Elazig, Turkey b
a r t i c l e
i n f o
Keywords: Multi object tracking Background subtraction Mean shift Level set methods Occlusion handling
a b s t r a c t Traditional kernel based means shift assumes constancy of the object scale and orientation during the course of tracking and uses a symmetric/asymmetric kernel, such as a circle or an ellipse for target representation. In a tracking scenario, it is not uncommon to observe objects with complex shapes whose scale and orientation constantly change due to the camera and object motions. In this paper, we propose a multi object tracking method which tracks the complete object regions, adapts to changing scale and orientation, and assigns consistent labels to each object throughout real world video sequences. Our approach has five major components: (1) dynamic background subtraction, (2) level sets, (3) mean shift convergence, (4) object identification, and (5) occlusion handling. The experimental results show that the proposed method is superior to the traditional mean shift tracking in the following aspects: (1) it provides consistent multi objects tracking instead of single object throughout the video, (2) it is not affected by the scale and orientation changes of the tracked objects, (3) its computational complexity is much less than traditional mean shift due to using level set method instead of probability density. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction Object tracking is an important task within the field of computer vision which has lots of application areas such as human–computer interaction, surveillance, smart rooms and medical imaging. In its simplest form, tracking can be defined as the problem of estimating the trajectory of an object in the image plane as it moves around a scene (Yilmaz, Javed, & Shah, 2006). The problem can be complex due to loss of information caused by projection of the 3D world on a 2D image, noise in images, complex object motion, non-rigid or articulated nature of objects, partial and full object occlusions, complex object shapes, scene illumination changes, and real-time processing requirements (Yilmaz et al., 2006). Numerous approaches have been proposed and implemented to overcome the problem. The proposed methods to solve the problem can be categorized under four sections ( Yang, Duraiswami, Elgammal, & Davis, 2004): (1) region-based, (2) feature-based, (3) model-based, (4) appearance-based. Many of these approaches require a statistical description of the region or the pixels to perform the tracking. The tracked object can be described using either parametric or nonparametric representations. In a parametric framework, the tracked objects are typically fitted by Gaussian or mixture of Gaussians models (Weng, Kuo, & Tu,
⇑ Corresponding author. Tel.: +90 5544907043. E-mail addresses:
[email protected] (M.F. Talu), iturkoglu@firat.edu.tr _ Türkog˘lu), mcebeci@firat.edu.tr (M. Cebeci). (I. 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.04.153
2006). A nonlinear estimation problem has to be solved to obtain the number of Gaussians and their parameters. However, the common parametric forms rarely fit the multimodal complex densities in practice, and are problematic when the fitted distributions are multidimensional. In contrast, nonparametric density estimation techniques allow representation of complex densities just by using the data. They have been successfully applied to object tracking (Comaniciu, Ramesh, & Meer, 2003; Zhao, Nevatia, & Wu, 2008). Among all these techniques, color-based tracking with kernel density estimation, introduced in Yilmaz et al. (2006), has recently gained popularity primarily due to their broad range of convergence and their robustness to unmodelled spatial deformations. The algorithm, called as ‘‘mean-shift’’, is a nonparametric density estimation technique for analyzing complex multimodal feature spaces and estimating the stationary points (modes) of the underlying probability density function without explicitly estimating it. It, originally invented by Fukunaga and Hostetler (1975) was successfully applied to computer vision applications Comaniciu and Meer (2002), Subasica, Loncarica, and Birchbauerb (2009) which used an elliptical shape representation and employ a color histogram computed from the elliptical region for modeling the appearance. In spite of the robustness of the mean-shift algorithm in finding the target location, it has three important difficulties (Dewan & Hager, 2006): (1) the kernel-based densities are expensive to evaluate, (2) the classically used similarity measures between the distributions in the model and target images are unwieldy, (3) it is affected by the scale and orientation changes of the tracked objects.
13683
M.F. Talu et al. / Expert Systems with Applications 38 (2011) 13682–13687
Many approaches have been proposed to eliminate these difficulties. For instance (Comaniciu et al., 2003; Yang, Duraiswami, & Davis, 2005; Yang et al., 2004), and (Babu & Makur, 2007) address that which similarity function between the distributions is more efficient for robust tracking. In addition, Yilmaz (2007) and Collins (2003) tried to find best scale and orientation values in mean shift iterations. In Yilmaz (2007), an asymmetric kernel mean shift algorithm, in which the scale and orientation of the kernel is changed adaptively depending on the observations at each iteration was presented. In Collins (2003), Robert addressed the problem of choosing the correct scale at which to track a blob using the mean-shift algorithm. In this paper, we represent a multi object tracking method which tracks the complete object regions, adapts to changing scale and orientation, and assigns consistent labels to each object throughout real world video sequences. In the proposed framework, the complex object shapes which may not have an analytical form are implicitly represented by a modified level set formalism. Our approach has five major components: (1) dynamic background subtraction, (2) level set, (3) mean shift convergence, (4) object identification, and (5) occlusion handling. We organized the paper as follows. In the next section, we introduce the proposed multi object tracking method, its motivation and formulation. Experimental results using the proposed method are given in Section 3. Finally, we discuss the future directions and conclude in Section 4.
article, we employ a recursive technique, since this approach generally requires less storage than non-recursive techniques (Smith, 2006). Literature on recursive techniques is broad and propositions involving approximated median filter, Kalman filter, and mixture of Gaussians are found. Unlike Kalman filters that track the evolution of only a single Gaussian distribution model, the mixture of Gaussians method tracks multiple Gaussian distribution simultaneously for the color model of each pixel. In this article, we use a modified version of the usual pixel-level background subtraction method (Zirkovic & Heijden, 2005). This method consists of a likelihood classifier for each pixel given the background models for them. Specifically, it checks the current color of the pixel xt in RGB coordinates at time t by
P xt j BG > e;
ð1Þ
where e (present threshold) is a small constant. In order to incorporate some level of robustness to lighting changes, the background models are typically designed as mixture models with B components for each pixel in image:
pðx j X t ; BGÞ
B X
pm g x; lm ; r2m ;
h
Background subtraction techniques can be classified into two broad categories ( Cheung & Kamath, 2004): non-recursive and recursive. In non-recursive background modeling techniques, a sliding-window approach is used for background estimation. A fixed number of frames is used and stored in a buffer. Recursive background modeling techniques do not keep track of a buffer containing a number of history frames. Instead, it recursively updates a single background model based on each incoming frame. In this
2
" 2 2 # xg l g 2 xr l r xb l b ; / exp
2. The proposed framework
2.1. Dynamic background subtraction
i
g x; l; r2 ¼ g xr ; xg ; xb ; lr ; lg ; lb ; ar ; ag ; ab ar
Our approach to tracking multiple objects emphasizes the use of appearance and shape models. Object appearance information is evaluated in a dynamic background model and object shape information is evaluated in a level set model. Finally, we use mean shift convergence algorithm to provide data association between objects in consecutive frames. An overview diagram is given in Fig. 1. In this section, we will address these issues in a detailed manner.
ð2Þ
m¼1
ag
ab
ð3Þ
where lm and rm are respectively means and variances of GMM. The GMM parameters are updated continuously by the video stream only with the nonmoving regions. The mixing weights denoted by p are non-negative. The GMM algorithm can select automatically the needed number of components per pixel and update p, l and r values by using a suitable recursive procedure. Hereby, undesired background information can be subtracted from the image leaving foreground objects for relatively static scenes. A more sophisticated background model (for example, nonparametric Elgammal, Duraiswami, Harwood, & Davis, 2002) could be used to account for more variations, but this is not the focus of this work; we assume that comparison with a background model yields adequate foreground blobs. 2.2. Level sets The level set method is a very simple, yet powerful numerical technique for tracking interfaces and shapes (Osher & Fedkiw, 2002; Sethian, 1999). In two dimensions, the level set method amounts to representing a closed curve C (such as the shape in
Fig. 1. Overview diagram of our approach.
13684
M.F. Talu et al. / Expert Systems with Applications 38 (2011) 13682–13687
our example) in the plane as the zero level set of a two-dimensional auxiliary function u:
C ¼ fðx; yÞ j uðx; yÞ ¼ 0g
ð4Þ
and then manipulating C implicitly, through the function u. This function is called a level set function. u is assumed to take positive values inside the region delimited by the curve C and negative values outside. The level set function, however, does not provide a compact support which is a required in order to be used as a density estimator in the mean shift framework. Hence, we introduce an indicator function I(x) to bound the level set by the zero level set, such that outside of the object boundary is set to 0. In this formalism, the level set kernel is computed by
#ðxÞ ¼ C uðxÞIðxÞ;
ð5Þ
where C is the normalization term:
P 1 . uðxÞIðxÞ
Using the level set ker-
nel, the density estimator in the mean shift framework becomes:
f ðxÞ ¼
n 1X #ðxi Þ: n i¼1
ð6Þ
The level set kernel generated from this equation for the object blobs shown in Fig. 2(a) is given in part (b). 2.3. Mean shift convergence The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters (Comaniciu & Meer, 2002; Comaniciu et al., 2003).
Given n data points xi ; i ¼ 1; . . . ; n on a d-dimensional space Rd, the multivariate kernel density estimate obtained with kernel K(x) and window radius h is:
f ðxÞ ¼
n x x 1 X 1 K : d h nh i¼1
ð7Þ
For radially symmetric kernels, it suffices to define the profile of the kernel k(x) satisfying:
KðxÞ ¼ ckd k jjxjj2 ;
ð8Þ
where ck,d is a normalization constant which assures K(x) integrates to 1. The modes of the density function are located at the zeros of the gradient function rf ðxÞ ¼ 0. The gradient of the density estimator is
n 2ckd X
x xi 2 ðx xÞg
i dþ2 h nh i¼1 3 " #2Pn
xxi 2
2 n x g i i¼1 h 2ckd X x x
i
4 ¼ dþ2 g
2 x5 Pn
h nh g xxi
i¼1
rf ðxÞ ¼
i¼1
ð9Þ
h
where g(s) = k0 (s). The first term is proportional tothe density estimate at x computed with kernel GðxÞ ¼ cg;d g kxk2 and the second term:
Pn
xxi 2 h x 2 g xxi
i¼1 xi g
mh ðxÞ ¼ P n
i¼1
ð10Þ
h
is the mean shift. The mean shift vector always points toward the direction of the maximum increase in the density. 2.4. Object Identification Let h represent the state of the objects in the scene at time t; it consists of the number of objects in the scene, their positions, and other parameters describing their size, shape, and pose. Our goal is to identify objects, ht, detected by dynamic background subtraction algorithm at time t:
hðtÞ ¼ fobj1 ; obj2 ; . . . ; objn g ¼ fðk1 ; m1 Þ; . . . ; ðkm ; mn ÞgHn ;
ð11Þ
where ki is the unique identity of the ith object whose parameter is mi and H is the solution space. To identify objects at time t, we use the state of the objects in the scene at time t 1. There are three different cases between the objects in frames t and t 1. The object relations between two consecutive frames are represented as follow:
8 ðtÞ ðt1Þ > > < Associated ki ¼ ki ðt1Þ Objt / New mi ¼/ ; > > : ðtÞ Dead Mi ¼ / ðtÞ
ðt1Þ
ð12Þ
ðtÞ
where ki ¼ ki means that object ki is a tracked object, ðt1Þ ðt1Þ ðtÞ mi ¼ / means that ki is a new object, mi ¼ / means that ðt1Þ ki is a dead object. 2.5. Occlusion handling
Fig. 2. Level set result.
Occlusion can be classified into three categories: self occlusion (one part of the object occludes another), interobject occlusion (two objects being tracked occlude each other), and occlusion by the background scene structure (Yilmaz et al., 2006). In this paper, we focus only interobject occlusion which is represented in Fig. 3(b). In that case, two different objects are seen as single object. To distinguish two objects from each other, we use mean-shift mode finding algorithm which climbs peak points of local hill until
M.F. Talu et al. / Expert Systems with Applications 38 (2011) 13682–13687
it reach that points. This process is illustrated on two sample distributions in Fig. 3(a). The algorithm takes the starting position information from prior frame and moves to peak points of object. Moving
13685
vectors of mean-shift are demonstrated in Fig. 3(c). Thus, it keeps unique identification numbers of tracked objects. Fig. 4 demonstrates results of our interobject occlusion handling algorithm.
Fig. 3. (a) Result of mean-shift mode finding algorithm. (b) Frame in which interobject occlusion is present. (c) Result of our occlusion handling algorithm which finds peak points of object by mean-shift mode finding algorithm on result frame obtained by using dynamic background and level set methods. Note that starting positions of two points come from prior frame.
Fig. 4. (a) Before occlusion, (b) interobject occlusion is present and object number 4 is lost, (c) result of our occlusion handling algorithm, (d) after occlusion.
13686
M.F. Talu et al. / Expert Systems with Applications 38 (2011) 13682–13687
Fig. 5. Selected frames of the tracking results from the Camera3exp3_1 video in Video Surveillance Online Repository (2007).
3. Experimental results We have experimented on the system with many types of data which was obtained from video surveillance online repository (VISOR) Video Surveillance Online Repository, 2007 and will only
show some representative ones. We will show results on outdoor scene videos which was generated by single camera. The first video sequence contains 4387 frames which have 384 288 size and 29 fps sampling rate. In this sequence, many people pass by the scene. The interhuman occlusions in this
M.F. Talu et al. / Expert Systems with Applications 38 (2011) 13682–13687
13687
Fig. 6. Selected frames of the tracking results from the video named visor_1210775946477_video2_5_mjpg. avi in Video Surveillance Online Repository (2007).
sequence are large. We show in Fig. 5 some sample frames from the result on this sequence. The identities of the objects are shown by their ID numbers displayed on the head. We evaluate the results by the trajectory-based errors. In the first video sequence, the proposed multi object tracking algorithm can detect shape bodies of moving objects and track its trajectories successfully. Although the interhuman occlusion is present at frame 1485th in Fig. 5, our occlusion handling algorithm is able to continue the tracking procedure successfully. The proposed algorithm automatically adapts to changing of scale and orientation of objects tracked in spite of changing shapes of objects constantly. The second video sequence contains 1304 frames which have 320 240 size and 10 fps sampling rate. In this sequence, the quality of images is lower than the first video and the occlusions are large. We show in Fig. 5 some sample frames from the result on this sequence. In frame 462th in Fig. 6, full occlusion is present. In this case, our proposed algorithm detects only one object though two objects are present. Besides, the proposed method could not detect all moving objects in frame 462th in Fig. 6 because of missing detection by the dynamic background algorithm. In both scenarios, the algorithm can process about 30 frames per second on a PC with 1 GB RAM and 3.2 GHz processor speed, operating in Matlab (implemented efficiently using matrix–vector elementwise operators).
4. Conclusion We have presented a principle approach to simultaneously be able to detect and track the complete moving object regions in a crowded scene acquired from a single stationary camera, adapt to changing scale and orientation of objects, and assign consistent labels to each moving object throughout real world video sequences. An optimum object tracker should have: (1) the ability to handle new objects entering the field of view or already tracked objects exiting the field of view, (2) the ability to handle partial and full object occlusion during the course of tracking, (3) the ability to assign a unique identification number to a tracked object which its size may be changed throughout video sequence, (4) the ability to run online. Our proposed object tracking algorithm has all aforementioned abilities. However, when the number of occluded objects exceeds two, our algorithm may not guarantee to handle true objects positions. In that situation, our algorithm only gives information about which objects are occluded. The experimental results show that the proposed method is superior to the traditional mean shift tracking in the following aspects: (1) it provides consistent multi
objects tracking instead of single object throughout the video, (2) it is not affected by the scale and orientation changes of the tracked objects, (3) its computational complexity is much less than traditional mean shift due to using the level set method instead of probability density. References Babu, R. V., & Makur, A. (2007). Kernel-based spatial-color modeling for fast moving object tracking. In Acoustics, speech and signal processing 2007 (pp. I-901–I-904). Cheung, S. -C. S., & Kamath, C. (2004). Robust techniques for background subtraction in urban traffic video. In Proceedings of the SPIE (Vol. 5308, pp. 881–892). Collins, R. T. (2003). Mean-shift blob tracking through Scale Space. IEEE Computer Vision and Pattern Recognition, 1063–6919. Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619. Comaniciu, D., Ramesh, V., & Meer, P. (2003). Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 564–575. Dewan, M., & Hager, G. D. (2006). Toward optimal kernel-based tracking. In IEEE computer society conference on computer vision and pattern recognition, CVPR 2006 (pp. 618–625). Elgammal, A., Duraiswami, R., Harwood, D., & Davis, L. (2002). Background and foreground modeling using non-parametric kernel density estimation for visual surveillance. Proceedings of the IEEE, 90(7), 1151–1163. Fukunaga, K., & Hostetler, L. D. (1975). The estimation of the gradient of a density functions, with applications in pattern recognition. IEEE Transactions on Information Theory, 21, 32–40. Osher, S. J., & Fedkiw, R. P. (2002). Level set methods and dynamic implicit surfaces. 0387-95482-1. Springer-Verlag, pp. 95. Sethian, J. A. (1999). Level set methods and fast marching methods: Evolving interfaces in computational geometry. Fluid mechanics, computer vision, and materials science. 0-521-64557-3. Cambridge University Press. Smith, M. (2006). Background subtraction for urban traffic monitoring using webcams. Master Thesis. University of Amsterdam. Subasica, M., Loncarica, S., & Birchbauerb, J. (2009). Expert system segmentation of face images. Expert Systems with Applications, 36(3, Part 1), 4497–4507. Video Surveillance Online Repository, (2007).
. Weng, S. K., Kuo, C. M., & Tu, S. K. (2006). Video object tracking using adaptive Kalman filter. Journal of Visual Communication and Image Representation, 1190–1208. Yang, C., Duraiswami, R., Elgammal, A., & Davis, L. (2004). Real-time kernel-based tracking in joint feature-spatial spaces. Technical report CS-TR-4567. College Park: Department of Computer Science, University of Maryland. Yang, C., Duraiswami, R., & Davis, L. (2005). Efficient spatial-feature tracking via the mean-shift and a new similarity measure. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 176–183). Yilmaz, A. (2007). Object tracking by asymmetric kernel mean shift with automatic scale and orientation selection. Computer Vision and Pattern Recognition, 1–6. Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM Computing Surveys (4), 45 [Article 13]. Zhao, T., Nevatia, R., & Wu, B. (2008). Segmentation and tracking of multiple humans in crowded environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(7), 1198–1211. Zirkovic, Z., & Heijden, F. V. (2005). Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognition Letters, 773–780.