Pattern Recognition Letters 33 (2012) 2192–2197
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
A hybrid motion and appearance prediction model for robust visual object tracking Hamidreza Jahandide, Kamal Mohamedpour, Hamid Abrishami Moghaddam ⇑ K.N. Toosi University of Technology, Seyed Khandan, P.O. Box 16315-1355, Tehran, Iran
a r t i c l e
i n f o
Article history: Received 11 March 2012 Available online 10 August 2012 Communicated by G. Borgefors Keywords: Visual object tracking Occlusion Appearance prediction Adaptive Kalman filter
a b s t r a c t In this paper a new video object tracking method is proposed. A hybrid model based on motion and appearance is constructed for the object and Kalman filter is applied to both components in order to reduce noise and provide a prediction for the next frame. Making available a prediction of the object appearance in the next frame contributes effectively in robust object tracking in spite of large changes in scene illumination. Experimental results using the proposed method and its counterparts without appearance prediction demonstrate the superiority of the novel hybrid prediction method under drastic changes in illumination. Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction Object tracking is an important task within the field of computer vision and has many applications such as surveillance (Ying-Li et al., 2011; Jong Sun et al., 2011; Zhu et al., 2010), human–computer interaction systems (Hansen and Qiang, 2010; Reale et al., 2011), augmented reality (Ferrari et al., 2001), robotics (Zhengtao et al., 2010), video compression (Del Bue et al., 2002), and driver assistance (Cheng and Trivedi, 2010; Eidehall et al., 2007; Murphy-Chutorian and Trivedi, 2010). Changes in object appearance and occurrence of occlusion are some of the challenges in object tracking. Changes in background or object appearance might occur due to changes in illumination, deformation of objects, changes in object pose, and so on. To handle the problem of changes in appearance, a model of the object that evolves frame by frame and captures the variation of the object image is required (Jang and Choi, 2000). However, when there are drastic changes in appearance of the target, the model evolution is not effective to prevent the algorithm from failing. To overcome this drawback, we propose to complete the model with prediction of the object appearance in the next frame. Nguyen and Smeulders (2004) use Kalman filter to smooth the appearance model. They use the color components of each pixel as a feature vector and apply a Kalman filter to each pixel in the target region. However, the application of an individual Kalman filter to ⇑ Corresponding author. Address: Biomedical Engineering Group, Electrical Engineering Department, K.N. Toosi University of Technology, Seyed Khandan, P.O. Box 16315-1355, Tehran, Iran. Tel.: +98 21 84062229; fax: +98 21 88462066. E-mail addresses:
[email protected] (H. Jahandide),
[email protected] (K. Mohamedpour),
[email protected] (H. Abrishami Moghaddam). URL: http://www.ee.kntu.ac.ir/biomedical/index2.htm (H. Abrishami Moghaddam). 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.07.021
each pixel is redundant; since almost any change in lighting conditions affects the brightness of all pixels in the target region and there is no need to process each pixel individually. In (Peng et al., 2005) an algorithm is proposed that uses Kalman filter to filter object kernel histogram instead of object trajectory. Bins in the histogram are filtered with individual Kalman filters supposing that the value of each bin changes independently. Due to lack of any prediction model for the state equation, the only state equation available for Kalman filter would be:
^k1 ^k ¼ x x
ð1Þ
^ ^ where x k is a priori state estimate at step k and xk1 is a posteriori state estimate at step k 1. The above equation provides actually no prediction for the next frame; hence the ability of Kalman filter for error reduction is limited. Furthermore, the method suffers from high computational complexity since there are several bins and individual Kalman filters should be applied to each bin. In this paper, an algorithm is proposed that uses Kalman filter to estimate both motion and appearance information in order to predict object position and color. Predicting the color of the object for the next frame makes our algorithm robust to changes in illumination. This method takes advantage of the fact that motion and color distribution of the object are independent. When one of the motion or appearance model deviates from reality and the algorithm is about to lose the object, the other model can help to keep track of the object. The paper is structured as follows: In Section 2 the proposed method is described and the appearance model is constructed. This section also explains how the algorithm handles occlusion. In Section 3 experimental results are presented and conclusions are drawn in the last section.
H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197
2. Proposed method Fig. 1 shows the data flow diagram of the algorithm. The user is required to specify the position of the object in the first frame. The background subtraction is performed in the first frame in order to obtain which pixels belong to the object. The first frame is subtracted from the background image and then using the Otsu’s method (Otsu, 1975), a threshold is found for converting the subtraction result to a binary image. In the binary image, pixels that have a non-zero value are considered to belong to the object. After the first frame, background subtraction is no more required and the appearance model provides the necessary information to detect the pixels of the object in subsequent frames. Adding background subtraction did not significantly improve the tracking performance in this algorithm, however it increased the computational complexity.
2.1. Appearance model In this stage, an appearance model is created using some state variables extracted from histogram of the object to which the Kalman filter is applied. Applying the Kalman filter to appearance state variables is useful for error reduction and providing a prediction of the object appearance in the next frame. In order to create an appearance model, the authors in (Peng et al., 2005) use kernel histogram which is composed of m bins and the set {pb}b=1,. . .,m represents the object model in which pb is the bth bin in kernel histogram. Each bin is treated as a state variable and is filtered by individual Kalman filters. For example, if there are 32 bins in the kernel histogram, there will be 323 = 32768 bins in the RGB space. Although many of these bins have zero values and the number of non-zero bins is much smaller (about 100 according to the experimental results in (Peng et al., 2005)), however the number of variables required to be filtered by the Kalman filter remains too high. To overcome this drawback, we propose a different approach to construct the appearance model. The pixels of the object (which are known thanks to background subtraction) are clustered based on their brightness values. The centroid of each cluster is defined as the first element of a state vector. The second and third elements of the state vector are defined as the first and second derivatives of the first element, respectively. Fig. 2 shows an object before and after background subtraction and its gray-level histogram. It is clear from Fig. 2(c) that there are two dominant colors in the object and if we cluster the pixels of the object into two clusters, the centroid of each cluster represents that color.
2193
2.1.1. Kalman appearance prediction During tracking, it is expected that the object appearance preserves its dominant colors characteristics and changes in illumination make only these dominant colors to shift in the histogram. Hence, a model for shift in color (due to changes in object appearance) can be used to predict the position of the centroid of each cluster for the next frame. As will be demonstrated, such a model reduces the probability of losing the object due to change in its appearance. In Fig. 3 the illumination component histogram of the object window under increasing illumination conditions are shown. As the illumination changes, the centroid of the bins in the histogram shifts upward. Supposing that this shift is linear with constant second order changes, the state equation in the kth frame would be:
xi;k ¼ Axi;k1 þ wi;k1
ð2Þ
in which xi;k is the state vector at kth frame for the ith cluster and its components are the centroid of the ith cluster, its first and secondorder temporal derivatives, respectively (xi;k ¼ ½c; c_ ; €ct ). wi;k1 is the state noise which is supposed to be normal with zero mean and covariance Qi. With the assumption that the shift of cluster centroids is linear with constant second order derivative, the state transition matrix A has the following structure:
2
1 DT 6 A ¼ 40 1 0
1 DT 2 2
3
7 DT 5 1
0
ð3Þ
where DT represents the time interval between successive frames. Based on the above assumptions, the Kalman equations for predicting the appearance model which is performed in the Kalman appearance prediction block in Fig. 1 can be written as:
^i;k ^i;k ¼ Ax x
ð4Þ
P i;k ¼ AP i;k1 AT þ Q i
ð5Þ
T
T
ki;k ¼ P i;k h ðhP i;k h þ r i Þ1
ð6Þ
where h is a vector that relates the state vector and measurement:
h ¼ ½1 0 0
ð7Þ
2.1.2. Kalman appearance correction The following equation relates the state vector to the observation z:
Fig. 1. Data flow diagram for the proposed algorithm.
2194
H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197
Fig. 2. The object before (a) and after (b) background subtraction and its histogram after transforming the color image to gray-level (c).
Fig. 3. (a–c) Three frames (1st, 50th and 70th) in which the white rectangle specifies the object window for which the histogram is computed. (d–f) The illumination component histogram of the object window in three frames with illumination changes.
zi;k ¼ hxi;k þ ti;k
ð8Þ
where ti;k is the measurement noise, and is assumed to be normal with zero mean and covariance ri. The measurement zi,k in the kth frame is the centroid of the ith cluster of pixels (based on their RGB values) in the current frame using the appearance model in the previous frame. In each frame, after obtaining the observation zi,k, the measurement update for Kalman filter which is performed in the Kalman appearance prediction block in Fig. 1 is performed as follows:
^i;k þ ki;k ðzi;k hx ^i;k Þ ^i;k ¼ x x Pi;k ¼ ðI ki;k hÞPi;k
ð9Þ ð10Þ
Each cluster shifts individually so we need to apply individual Kalman filters to each centroid. The updated variable is used to detect and segment object in the next frame. This is performed by thresholding all pixels within the search area:
Oðx; y; kÞ ¼
^i;k ð1Þj > T 0 if jf ðx; y; kÞ x 1 otherwise
ð11Þ
f(x, y, k) is the pixel gray level in position (x, y) at kth frame and T is an empirical threshold which is chosen equal to 15% of the maximum image gray level. Changing T from around 10% to 20% of the maximum gray level did not affect the performance of the tracking
algorithm significantly. The output O is a binary image in which non-zero pixels specify the object. The centroid of these pixels is considered as the object’s position and is used as measurement for the motion model which will be explained in Section 2.2. It will be shown in Section 3 that results from segmenting the object using updated model by Kalman filter is drastically improved from results without applying Kalman filter. 2.2. Motion model The state variable and state transition equation for a linear motion model with constant acceleration are defined as:
€ k T sk ¼ ½xk ; yk ; x_ k ; y_ k ; €xk y
ð12Þ
sk ¼ Usk1 þ ek
ð13Þ
2
1 0 DT
0
1 DT 2 2
0
DT 0 1 0 0
0
6 60 1 6 6 0 0 U¼6 6 60 0 6 6 40 0 0
0
1 0 0 0
DT 0 1 0
0
3
7 7 7 7 0 7 7 DT 7 7 7 0 5
1 DT 2 2
1
ð14Þ
H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197
2195
Fig. 4. Tracking results in extreme illumination changes in frames 4, 60, 90 and 100, respectively (EC Funded CAVIAR project/IST 2001 37540).
€k ) denote the position, velocity and in which (xk , yk ), (x_ k , y_ k ) and (€ xk , y acceleration of the object centroid at kth frame, respectively. ek is the state noise which is supposed to be normal with zero mean and covariance Qk. The measurement equation is in the following form:
zk ¼ csk þ sk c¼
1 0 0 0 0 0 0
1 0 0 0 0
ð15Þ ð16Þ
where sk is the measurement noise which is supposed to be normal with zero mean and covariance rk. 2.2.1. Kalman motion correction In each frame after obtaining pixels belonging to the object, their centroid is considered as the measurement zk. Updating the state ^s k which is performed in the Kalman motion correction block in Fig. 1 is performed by incorporating the measurement zk into the state:
^sk ¼ ^sk þ kk ðzk c^sk Þ
ð17Þ
where kk is the Kalman gain. Posterior estimate error covariance is computed as follows:
Pk ¼ ðI kk cÞPk where
P k
ð18Þ
is prior estimate error covariance.
2.2.2. Kalman motion prediction In the Kalman motion prediction stage (Fig. 1), the prediction of the state for the next frame is performed:
^skþ1 ¼ U^sk
ð19Þ
The first element in ^s kþ1 is the prediction for the position of the object in the next frame. The center of the search window in the next frame is positioned at this coordinates. The prior estimate error covariance and Kalman gain for the next frame are computed as follows:
Pkþ1 ¼ UPk UT þ Q kþ1
ð20Þ
kkþ1 ¼ Pkþ1 c T ðcPkþ1 c T þ r kþ1 Þ1
ð21Þ
Fig. 5. Results of the object segmentation based on MAKF with prediction (b, f, j), MAKF without prediction (c, j, k) and MKF (d, h, l). (a, e, i) shows original image in the search area in the frame index 15, 26 and 40 (first, second and third row respectively) as the object enters areas with less illumination (EC Funded CAVIAR project/IST 2001 37540).
2196
H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197
the measurement error covariance rk is set to infinity; because when the object is occluded the measurement is pure noise and the measurement error covariance is infinite. Moreover the prediction error covariance Qk is set equal to zero. In this condition according to (21) Kalman gain would be zero and consequently according to (17), the measurement would have no effect on the state and the tracking process is completely based on prediction. This situation continues until occlusion rate is below the threshold and the object is found again and the measurement is incorporated into the Kalman equations again. 3. Experimental results In this section, results from the proposed algorithm are compared with other methods including: I. The motion model with the appearance model without applying Kalman filter (MKF). II. The appearance model with Kalman filter and without prediction (MAKF without prediction). III. Kalman-based template matching (KTM) which is the method presented in (Nguyen and Smeulders, 2004).
Fig. 6. Euclidean distance between the true and the estimated object centroid for the four algorithms.
2.2.3. Handling occlusion To make the algorithm robust against occlusion, the adaptive Kalman filtering method proposed in (Weng et al., 2006) was adopted. Accordingly, a parameter ðak Þ is computed in each frame, which corresponds to the proportion of occlusion. This parameter is called occlusion rate and is defined as follows:
ak ¼
( PNk 1 if PNPNk 1 6 1 PNk1 k1 1
ð22Þ
otherwise
where PN k and PN k1 are the number of pixels that belong to the object in the kth and ðk 1Þth frame. The occlusion rate is used to adjust the Kalman filter parameters automatically. When the occlusion rate is below a threshold, the object is not considered to be occluded and Kalman filter works in its normal predictionupdating mode. Since the measurement error and the occlusion ratio are directly proportional and according to (21) the measurement error rk and the Kalman gain kk are inversely proportional, the value of the measurement error covariance rk is set equal to occlusion rate ak and the prediction error covariance Qk is set equal to I ak I. On the other hand, when the occlusion rate is above a threshold it means that the object is occluded and the value of
All the four methods use the same motion model with Kalman filter. The algorithms have been implemented in Matlab and tested on an Intel(R) Core 2 Duo CPU with 4.00 GB memory. The test sequences are chosen in a way to evaluate the proposed algorithm in situations with drastic changes in scene illumination and object appearance. The first experiment is aimed to evaluate the tracking performance of the new algorithm. Fig. 4 illustrates a few frames of the tracking result on a sequence from PETS2004 database (Browse1 sequence, frame numbers 500–570). As it is clear from Fig. 4, despite the heavy change in illumination, the object has been tracked successfully by the algorithm. The second experiment is designed to evaluate the segmentation ability of the algorithm. Fig. 5 shows the results of object segmentation on the same sequence as in Fig. 4 using three different algorithms. This experiment demonstrates that having a model for predicting the changes in the appearance has the advantage of increasing accuracy in segmenting the object in situations in which the illumination changes drastically (see Fig. 6). In Table 1, the results from the proposed algorithm are compared with the results in the other three methods, based on the mean distance error (de) between the tracked object centroid and the ground truth which is calculated as:
Table 1 Comparison of the results from three methods based on the average Euclidian distance between the true and estimated object centroid. MKF
Sequence Sequence Sequence Sequence Sequence Sequence a b c d e f g h
d
1 1d 2e 3f 4g 5h
MAKF without prediction
MAKF with prediction
KTM
Errora
Timeb
Error
Time
Error
Time
Error
Time
Increase in errorc
Decrease in timec
– – 4.92 – 2.62 3.24
2.514 2.514 2.378 3.673 2.731 7.184
2.55 2.55 4.81 2.72 2.46 2.94
2.554 2.554 2.410 3.762 2.782 7.392
2.33 2.33 4.63 2.68 2.43 2.78
2.556 2.556 2.414 3.768 2.784 7.412
2.30 2.30 4.56 2.65 2.41 2.74
10.316 10.316 9.573 15.138 11.025 28.417
1.3 1.3 1.5 1.1 0.8 1.4
75 75 75 75 75 74
Pixels. Seconds. Percentage. PETS2004 dataset Browse1 500–570. PETS2004 dataset Browse1 40–100. BEHAVE dataset Sequence 03950–4050. PETS2004 dataset Rest_InChair 300–400. WalkByShop 834–1036.
Comparison of MAKF with prediction and KTM
H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi de ¼ ðxt xg Þ2 þ ðyt yg Þ2
ð23Þ
where de is the distance error and (xt , yt ) are the coordinates of the tracked object centroid and (xg , yg ) are the coordinates of the corresponding object in the ground truth. As indicated in Table 1, the error in MKF is around 5% more than the error in the MAKF without prediction in sequences that MKF does not lose the object (MKF has failed to track the object in Sequences 1 and 3) and the error in the MAKF without prediction is around 4% more than the error in the MAKF with prediction. The error in the MAKF with prediction is around 2% more than the error in KTM which does not represent a considerable difference. On the other hand, the computational time for KTM is around four times higher with respect to the MAKF with prediction. Therefore, with nearly the same accuracy as the method proposed in (Nguyen and Smeulders, 2004), our proposed method reduces the computational complexity drastically.
4. Conclusion In this paper an algorithm for object tracking is proposed that uses Kalman filter to reduce error and give a prediction for both motion and appearance models. With this method we are able to track objects despite large appearance changes. This is possible because we use a linear model for changes in appearance and apply this model to Kalman filter. Therefore we have a prediction of the appearance model for the next frame which helps us reduce the probability of losing the object when large changes in illumination occur. This algorithm has much smaller computational complexity than other algorithms that use Kalman filter to update their appearance models. This algorithm is robust against long lasting occlusions due to adaptive Kalman filter applied to the motion model.
2197
References Cheng, S.Y., Trivedi, M.M., 2010. Vision-based infotainment user determination by hand recognition for driver assistance. IEEE Trans. Intell. Transport. Systems 11, 759–764. Del Bue, A., Comaniciu, D., Ramesh, V., Regazzoni, C., 2002. Smart cameras with realtime video object generation. Proceedings of the International Conference on Image Processing, vol. 3, pp. III-429–III-432. EC Funded CAVIAR project/IST 2001 37540, found at URL: http://www.dai.ed.ac.uk/ homes/rbf/CAVIAR/. Eidehall, A., Pohl, J., Gustafsson, F., 2007. Joint road geometry estimation and vehicle tracking. Control Eng. Practice 15, 1484–1494. Ferrari, V., Tuytelaars, T., Van Gool, L., 2001. Real-time affine region tracking and coplanar grouping. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2001), vol. 2, pp. II-226–II-233. Hansen, D.W., Qiang, J., 2010. In the eye of the beholder: A survey of models for eyes and gaze. IEEE Trans. Pattern Anal. Machine Intell. 32, 478–500. Jang, D.S., Choi, H.I., 2000. Active models for tracking moving objects. Pattern Recognition 33, 1135–1146. Jong Sun, K., Dong Hae, Y., Young Hoon, J., 2011. Fast and robust algorithm of tracking multiple moving objects for intelligent video surveillance systems. IEEE Trans. Consumer Electron. 57, 1165–1170. Murphy-Chutorian, E., Trivedi, M.M., 2010. Head pose estimation and augmented reality tracking: An integrated system and evaluation for monitoring driver awareness. IEEE Trans. Intell. Transport. Systems 11, 300–311. Nguyen, H.T., Smeulders, A.W.M., 2004. Fast occluded object tracking by a robust appearance filter. IEEE Trans. Pattern Analysis Machine Intell. 26, 1099–1104. Otsu, N., 1975. A threshold selection method from gray-level histograms. Automatica 11, 285–296. Peng, N.S., Yang, J., Liu, Z., 2005. Mean shift blob tracking with kernel histogram filtering and hypothesis testing. Pattern Recognition Lett. 26, 605–614. Reale, M.J., Canavan, S., Lijun, Y., Kaoning, H., Hung, T., 2011. A multi-gesture interaction system using a 3-D Iris disk model for gaze estimation and an active appearance model for 3-D hand pointing. IEEE Trans. Multimedia 13, 474–486. Weng, S.K., Kuo, C.M., Tu, S.K., 2006. Video object tracking using adaptive Kalman filter. J. Vis. Commun. Image Represent. 17, 1190–1208. Ying-Li, T., Feris, R.S., Haowei, L., Hampapur, A., Ming-Ting, S., 2011. Robust detection of abandoned and removed objects in complex surveillance videos. IEEE Trans. Systems Man Cybernet. 41, 565–576. Zhengtao, Z., De, X., Min, T., 2010. Visual measurement and prediction of ball trajectory for table tennis robot. IEEE Trans. Instrument. Measure. 59, 3195– 3205. Zhu, J., Yuanwei, L., Zheng, Y.F., 2010. Object tracking in structured environments for video surveillance applications. IEEE Trans. Circ. Systems Video Technol. 20, 223–235.