Available online at www.sciencedirect.com
Procedia Computer Science 13 (2012) 185 – 191
Proceedings of the International Neural Network Society Winter Conference (INNS-WC2012)
Upper body tracking using KLT and Kalman filter Pouya Bagherpoura,∗, Seyed Ali Cheraghia , Musa bin Mohd Mokjia a Faculty
of Electrical Engineering University Teknology Of Malaysia UTM Skudai, 81310 Johor Malaysia
Abstract Human monitoring system based on image and sequence analysis is employed in security surveillance systems. For this purpose upper-body tracking is often needed. However, tracking challenges such as variations and similarity in appearance can mislead limbs tracker system. This paper describes a novel framework for visual tracking of human upper body parts in an indoor environment which can handle the tracking challenges. It is based on Kanade-Lucas-Tomasi (KLT) and motion model Kalman filter approach. In our approach, different upper body limbs are tracked by the KLT methods and then the motion model is imposed to the Kalman filter to predict and estimate the best tracked patch of KLT tracking results. These characteristics make our approach suitable for visual surveillance applications. Experiments on different datasets show the efficiency of our approach on tracking the human upper body limbs.
© byPublished Elsevier by B.V. Selection c 2012 2012 Published The Authors. Elsevier B.V. and/or peer-review under responsibility Selection and/or peer-review under responsibility of the Program Committee of INNS-WC 2012. of Program Committee of INNS-WC 2012 Open access under CC BY-NC-ND license. Keywords: upper body tracking, KLT, Kalman filter, affine transformation, articulated human body
1. Introduction Automatic video surveillance deals with observation of any predefined targets such as people or vehicles in various environments. It leads to detect, track and classify of any kind of activity in the field of view. Its applications cover a broad range from monitoring car parking lots to military surveillance systems. Human behavior analysis in public places such as stores and shopping malls is one of the critical topics in surveillance systems. It is an important task for the automatic security surveillance systems to monitor peoples’ movements and behaviors in order to recognize anomalies. In this respect, human motion tracking plays a substantial role. As far as tracking human body parts is concerned, it needs to estimate the position of each articulated human body parts in each frame. The method that is used by some human motion tracking algorithms which uses one camera is to employ a shape model of the target in order to support the tracking. Tracking methods can be categorized into three classes: feature-based methods which are based on regions of difference between templates’ features and the input frames [1] [2], gradient-based techniques [3] that estimate image flow in number of images by using spatial and temporal partial derivatives and at last Bayes-based trackers ∗ Corresponding
authors Email addresses:
[email protected] (Pouya Bagherpour),
[email protected] (Seyed Ali Cheraghi),
[email protected] (Musa bin Mohd Mokji)
1877-0509 © 2012 Published by Elsevier B.V. Selection and/or peer-review under responsibility of Program Committee of INNS-WC 2012 Open access under CC BY-NC-ND license. doi:10.1016/j.procs.2012.09.127
186
Pouya Bagherpour et al. / Procedia Computer Science 13 (2012) 185 – 191
which model the state xk and the observation zk to solve the problem of uncertainty [4] [5]. Given all the observations z1:k the goal of Bayes tracker is to find the current state x1:k with respect to the probability density function. The assumption imposed on state estimation is that the changes of the shape position and appearance of the targets are smooth over the time (i. e. consistency in trajectory). Gradient based trackers such as Kanade-Lucas-Tomasi (KLT) is a popular methods which have been widely used in human body tracking. Numerous studies have been made on hand tracking for different purposes such as gesture recognition. For example, a real-time hand tracking by Shan [6] improved particle filter to a faster real-time tracker by reducing the sample size according to mean shift. Klsch [7] combined Kanade-Lucas-Tomasi and k-nearest neighborhood in order to make a fast tracker. Hye-Jin Kim et al.[8] proposed a model of AR-KLT for hand tracking that hired KLT as one of its parts and is based on articulated human body motion. In this paper we employed KLT feature tracker to track the humans’ upper body in an indoor environment. In the case of variations in appearance such as illumination, pose object changes and similarity in appearance, KLT may drift to a wrong region of cluttered background. One of the solutions for this problem is to use an estimator which can predict target’s next position based on its current position. Kalman filter is one of the most efficient methods that were used in various scenarios. Boykov and Huttenlocher [9] used it for vehicles tracking in an adaptive framework. Pentland and Horowitz [10] and Metaxas and Terzopoulos [11] describe elastic solid models for tracking in conjunction with Kalman filtering, and give simple examples of articulated tracking by enforcing constraints. Reid et al.[12] proposed the fused KLT feature tracking with Kalman filter in order to localize the head which used for gaze estimation. Takahashi et al.[13] used Kalman filter for face and fingers region tracking which the regions were acquired by KLT tracker. Kalman filter has also been used for tracking moving target in occlusion. Jang and Choi in paper [14] propose the structural Kalman filter to estimate the motion information under a deteriorating condition as occlusion. In this paper we concentrate on the tracking of upper body parts while trying to solve the tracking challenges; therefore we propose a framework which is based on KLT and motion model Kalman filter approach. According to this method, different upper body limbs are tracked by the KLT methods and then the motion model is imposed to the kalman filter to predict and estimate the best tracked patch of KLT tracking results. The rest of the paper is structured as follows. Section 2 describes our method and details the structure of algorithm and constraints. Section 3 describes the experimental results which gained by evaluating the proposed method on the video datasets. Finally, after showing the quantitative and qualitative results, this paper closes with conclusions and future work. 2. Kanade-Lucas-Tomasi (KLT) In KLT methods, the image features are used as the candidate state of each limb to find the best solution of the tracking problem. According to the extracted features of the frame and the cost function which defines the robustness of the state, the state estimation is iteratively purified to converge to local maxima of the cost function. Normally, the temporal constraint is applied to target tracker by initialing the starting point in each frame around the previous state estimation. The goal of this method is to align the template T(x) to an input image. The overall KLT system is shown in Fig. 1. According to the initial point which can be the previous state; tracker tries to align the template with the image. If the template and image cannot converge, the tracker will search another point to find the alignment. This successive process continues until the best alignment can be found and the new state is extracted from it. Then, the new state will be used as the initial point for the next frame. Template as a target model is used for state estimation in KLT trackers which aligns a template image T (x) which is a sub-region (e.g. 9x9 window) at the t = 1 to an image I(x) at t = 2, where the pixel coordinates are (x; y). The alignment maps the pixel x on the template to the corresponding point on the input image by the warp W(x; p), where the p = (p1 , ..., pn )T is a set of parameters. For instance in a moving object situation, an affine warp might be considered: ⎛ ⎞ ⎜ x ⎟ (1 + p1 ) .x + p3 .y + p5 1 + p1 + p3 + p5 ⎜⎜⎜⎜ ⎟⎟⎟⎟ ⎜ y ⎟ w (x; p) = = (1) p2 .x + (1 + p4 ) .y + p6 p2 + 1 + p4 + p6 ⎜⎜⎝ ⎟⎟⎠ 1 KLT algorithm’s goal is to find the best alignment of the template with the warped image by minimizing the sum of squared error between them.
Pouya Bagherpour et al. / Procedia Computer Science 13 (2012) 185 – 191
187
Input Output state ݔ
Initial state ݔ Yes
Sequences Input Frame
Tracker
Converged ?
No Target model
Fig. 1. KLT trackers system.
2
I (w (x; p)) − T (x)
(2)
x
Equation 2 is carried out on all the pixels x in the template T (x) and the minimization is gained with respect to p. Even if the warp W(x; p) is linear in p, the equation 2 is non-linear, because the pixel values of I are non-linear. To solve this problem, KLT assumes that the current estimation of p exists and iteratively searches for the new p by following the expression as below:
(3) I (w (x; p + Δp)) − T (x) 2 x
p is an increment of p and can update p by: p ← p + p
(4)
The iteration of the last two equations continues till the parameter p converges. The convergence test is normally performed by comparing the norm of p with a threshold: ||Δp|| ≤ This algorithm finds the best alignment for each possible location in the search space which contains several regions around previous states. As it can be seen in Fig. 2 , the number of nine possible matches for the forearm is extracted by KLT method. In this case, the red quadrilateral is the selected patch to represent the forearm with respect to minimum root square error between the template and the patch. However, this method can suffer from variance in pose and appearance and drifts to the limbs’ like region of cluttered background. 2.1. Motion model Kalman Filtering For solving the problem of KLT method, the kalman filter can be acquired as a predictor of the next object’s position. In this case, Kalman filter can impose the motion model of the upper body limbs as the process noise covariance. The Kalman filter can be represented by transition and observation model as shown in the following equations: xk = Fk xk−1 + mk−1 And
(5)
188
Pouya Bagherpour et al. / Procedia Computer Science 13 (2012) 185 – 191
Fig. 2. Finding the matched patch. Quadrilaterals depict all the possible patches to represent the forearm. The red one exhibit the best matched resulted from implementing the KLT method and error function.
zk = Gk xk + nk
(6)
Here, Fk and Gk are matrices which define the linear relationship between successive states and between states and observations supplied by user, and also the Gaussian noises nk and m(k−1) with zero mean and covariances Rk and Qk , as the observation noise and process noise respectively. On the other hand the motion model of an object contains the spatiotemporal information between the successive frames. In human upper body part tracking the movement of the limbs can be exhibited by a recursive motion model, implemented to the Kalman filter as the below equation: 1 xt+1 = xt + dxt ΔT + d2 xt T 2 2
(7)
Where xt is the current position of limb, ΔT is the time interval between successive frames and dxt and d2 xt are the velocity and acceleration, respectively. Some constraints can be applied to the equation in the case of human body part tracking. With considering the constant interval time between frames, it can be imagined that the acceleration of limbs is constant [15]. Therefore, by given this model the transition model is structured based on the motion of current states. 3. Result and Discussion Our algorithm was evaluated on the large datasets which contain 45 videos with different length and different setup. The size of frames was 240 × 320 pixels. All the videos are complicated with the cluttered background and illumination changes and also, variance in appearance of the object such as variety outfits for different persons in scene. The location of these videos was indoor environment. The ground truth was produced by carefully annotating the position of body parts manually. The KLT trackers have a smooth tracking on the human limbs. While in some circumstances such as fast movement and pose changing, the tracker may drifts to a limb-like region of cluttered background. Fig. 3 demonstrates the losing tracking of the forearm in the frame 270. As discussed in the previous part, KLT methods can be improved by using the Kalman filter as a predictor. For comparing the results of body parts’ tracking, Fig. 5 shows the implementing of four different KLT methods combined by Kalman filter which are Forward Additive (FA-KF), Forward Compositional (FC-KF), Inverse Additive (IA-KF) and Inverse Compositional (IC-KF) approaches [16]. We run the four algorithms on windows seven operating system with 2.2 GHZ processor and Matlab. 7 platform. The timing results for each method are included in Fig. 4.a. As can be seen, the Forward additive algorithms is far slower than the others method. The Inverse additive algorithm is also somewhat slower than the Inverse compositional algorithm. The error computation of the methods is shown as the Mean square error between the ground truth points and those limb’s points which are determined by the tracking methods. As shown in Fig. 4.b the pure KLT does not have sufficient result
189
Pouya Bagherpour et al. / Procedia Computer Science 13 (2012) 185 – 191
Fig. 3. The results of KLT on the consecutive frames. Frames: 180, 195, 270. With displacement of the forearm in the frames the tracker drifts to a wrong region.
and has the highest error. While the proposed methods gain much better results. Forward Additive KLT combined with Kalman filter score the highest error between the other combinations; on the other hand, Inverse Additive and Inverse compositional earn the minimum errors. These results can be proved by checking the trajectories of Fig. 5. The FA-KF shown in the first row almost lost the tracking in the last frame, while the IA-KF and IC-KF have a very smooth tracking. In the case of accuracy rate, table 1 shows the results of four different KLT methods with the Kalman filter. As it can be seen the tracking of head has a sufficient accuracy even for the forward methods due to the unique template of the head. The results for the arm decrease in the accuracy rather than the head’s accuracy rate, according to the fact that changing appearance for the arms are much higher compare to the head. Also, the accuracy results reduce for the forearm tracking. For the forearm tracking the changing appearance and more movements cause the low accuracy results even for the inverse methods of KLT. Finally, the preferred methods in the case of accuracy rate still would be the choice between the inverse KLT methods. 50
0.055 FA FC IA IC
0.05 0.045
45 40 35
a)
Mean Square Error
Time Consumption
0.04 0.035 0.03 0.025
25 20
0.02
15
0.015
10
0.01
5
0.005 0.00
0
10
20
30
40 50 Frame
60
70
80
90
0
b)
LKT whithout KF FA FC IA IC
30
0
10
20
30
40
50
60
70
80
90
Frame
Fig. 4. a) Time Consumption (per second) of KLT method with Kalman Filter estimator. b) Mean Square Error in pixel.
As a whole body tracking we implemented the algorithms on different upper body parts such as head, upper arms and forearms. The results are depicted in Fig. 6. In this case, still the best results are gained by Inverse algorithm rather than the Forward methods. Hence, if obtaining high efficiency is important, the choice is between the inverse compositional algorithm and the inverse additive and if the simpler algorithm is concerned the inverse compositional algorithm is the best choice. 4. Conclusion and Future Work In this paper a novel frame work for upper body parts’ tracking based on KLT trackers and motion model Kalman filter has been proposed. In this method KLTs was acquired to track different limbs and motion model
190
Pouya Bagherpour et al. / Procedia Computer Science 13 (2012) 185 – 191
Fig. 5. Results of implementing four different methods of KLT. In order to show the results accurately and to prevent overlapping of different limbs’ patches, only the results of forearm are shown. Frame: 193, 201, 258 and 272. The first row shows the results of Forward Additive approach. The second row exhibits implementing of Forward Compositional method. The third row demonstrates the Inverse Additive approach and the last row illustrates of applying the Inverse Compositional method.
Table 1. Accuracy rate for the KLT methods with the Kalman filter.
Accuracy Head Arms Forearms
FA-KF 90% 67% 45%
FC-KF 90% 70% 50%
IA-KF 95% 77% 65%
IC-KF 95% 80% 67%
Kalman filter was used to predict and estimate the true position of the limbs. The empirical results show the compatibility of this method with different object pose, cluttered background and different scene’s setup in an indoor environment. However, by applying the articulated human kinematic constraints on the method the tracking result should be improved. Hence, in the future, we will implement the kinematic models to our algorithm to enhance the results. Acknowledgments This research is supported by a grant from the University Technology Malaysia (UTM) and MOSTI under vote No. 4S012 and Sub project code: 01-01-06-SF0908.
Pouya Bagherpour et al. / Procedia Computer Science 13 (2012) 185 – 191
(a)
(b)
(c)
(d)
Fig. 6. The results of KLT-KF on the whole upper body. a) Forward Additive, b) Forward Compositional, c) Inverse Additive and d) Inverse Compositional approach.
References [1] J. Shin, S. Kim, S. Kang, S.-W. Lee, J. Paik, B. Abidi, M. Abidi, Optical flow-based real-time object tracking using non-prior training active feature model, Real-Time Imaging 11 (3) (2005) 204–218. [2] I. Haritaoglu, D. Harwood, L. S. Davis, W4: Real-time surveillance of people and their activities, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 809–830. [3] K. Moustakas, D. Tzovaras, M. G. Strintzis, A non causal bayesian framework for object tracking and occlusion handling for the synthesis of stereoscopic video, in: Proceedings of the 3D Data Processing, Visualization, and Transmission, 2nd International Symposium, 3DPVT ’04, IEEE Computer Society, Washington, DC, USA, 2004, pp. 147–154. [4] D. Comaniciu, V. Ramesh, P. Meer, S. Member, S. Member, Kernel-based object tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003) 564–577. [5] S. J. Russell, P. Norvig, J. F. Candy, J. M. Malik, D. D. Edwards, Artificial intelligence: a modern approach, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1996. [6] C. Shan, Y. Wei, T. Tan, F. Ojardias, Real time hand tracking by combining particle filtering and mean shift, in: Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognition, FGR’ 04, IEEE Computer Society, Washington, DC, USA, 2004, pp. 669–674. [7] M. Klsch, M. Turk, Fast 2d hand tracking with flocks of features and multi-cue integration, in: In IEEE Workshop on Real-Time Vision for Human-Computer Interaction (at CVPR), 2004, p. 158. [8] H.-J. Kim, K.-C. Kwak, J. Lee, Bimanual hand tracking, in: Proceedings of the 6th international conference on Computational Science and Its Applications - Volume Part I, ICCSA’06, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 955–963. [9] Y. Boykov, D. P. Huttenlocher, Adaptive bayesian recognition in tracking rigid objects, in: in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Hilton Head, SC, 2000, pp. 697–704. [10] A. Petland, B. Horowitz, Recovery of nonrigid motion and structure, IEEE Trans. Pattern Anal. Mach. Intell. 13 (7) (1991) 730–742. [11] D. Metaxas, D. Terzopoulos, Shape and nonrigid motion estimation through physics-based synthesis, IEEE Trans. Pattern Anal. Mach. Intell. 15 (6) (1993) 580–591. [12] I. Reid, B. Benfold, A. Patron, E. Sommerlade, Understanding interactions and guiding visual surveillance by tracking attention, in: R. Koch, F. Huang (Eds.), Computer Vision ACCV 2010 Workshops, Vol. 6468 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2011, pp. 380–389. [13] M. Takahashi, M. Fujii, M. Naemura, S. Satoh, Human gesture recognition system for tv viewing using time-of-flight camera, Multimedia Tools and Applications (2011) 1–23. [14] D.-S. Jang, S.-W. Jang, H.-I. Choi, 2d human body tracking with structural kalman filter, Pattern Recognition 35 (10) (2002) 2041–2049. [15] K. Rangarajan, M. Shah, Establishing motion correspondence, CVGIP: Image Understanding 54 (1) (1991) 56 – 73. [16] S. Baker, I. Matthews, Lucas-kanade 20 years on: A unifying framework, International Journal of Computer Vision 56 (2004) 221–255.
191