A hybrid motion and appearance prediction model for robust visual object tracking

Pattern Recognition Letters 33 (2012) 2192–2197 Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters journal homepage: www...

Download PDF

769KB Sizes 4 Downloads 73 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 33 (2012) 2192–2197

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

A hybrid motion and appearance prediction model for robust visual object tracking Hamidreza Jahandide, Kamal Mohamedpour, Hamid Abrishami Moghaddam ⇑ K.N. Toosi University of Technology, Seyed Khandan, P.O. Box 16315-1355, Tehran, Iran

a r t i c l e

i n f o

Article history: Received 11 March 2012 Available online 10 August 2012 Communicated by G. Borgefors Keywords: Visual object tracking Occlusion Appearance prediction Adaptive Kalman ﬁlter

a b s t r a c t In this paper a new video object tracking method is proposed. A hybrid model based on motion and appearance is constructed for the object and Kalman ﬁlter is applied to both components in order to reduce noise and provide a prediction for the next frame. Making available a prediction of the object appearance in the next frame contributes effectively in robust object tracking in spite of large changes in scene illumination. Experimental results using the proposed method and its counterparts without appearance prediction demonstrate the superiority of the novel hybrid prediction method under drastic changes in illumination. Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction Object tracking is an important task within the ﬁeld of computer vision and has many applications such as surveillance (Ying-Li et al., 2011; Jong Sun et al., 2011; Zhu et al., 2010), human–computer interaction systems (Hansen and Qiang, 2010; Reale et al., 2011), augmented reality (Ferrari et al., 2001), robotics (Zhengtao et al., 2010), video compression (Del Bue et al., 2002), and driver assistance (Cheng and Trivedi, 2010; Eidehall et al., 2007; Murphy-Chutorian and Trivedi, 2010). Changes in object appearance and occurrence of occlusion are some of the challenges in object tracking. Changes in background or object appearance might occur due to changes in illumination, deformation of objects, changes in object pose, and so on. To handle the problem of changes in appearance, a model of the object that evolves frame by frame and captures the variation of the object image is required (Jang and Choi, 2000). However, when there are drastic changes in appearance of the target, the model evolution is not effective to prevent the algorithm from failing. To overcome this drawback, we propose to complete the model with prediction of the object appearance in the next frame. Nguyen and Smeulders (2004) use Kalman ﬁlter to smooth the appearance model. They use the color components of each pixel as a feature vector and apply a Kalman ﬁlter to each pixel in the target region. However, the application of an individual Kalman ﬁlter to ⇑ Corresponding author. Address: Biomedical Engineering Group, Electrical Engineering Department, K.N. Toosi University of Technology, Seyed Khandan, P.O. Box 16315-1355, Tehran, Iran. Tel.: +98 21 84062229; fax: +98 21 88462066. E-mail addresses: [email protected] (H. Jahandide), [email protected] (K. Mohamedpour), [email protected] (H. Abrishami Moghaddam). URL: http://www.ee.kntu.ac.ir/biomedical/index2.htm (H. Abrishami Moghaddam). 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.07.021

each pixel is redundant; since almost any change in lighting conditions affects the brightness of all pixels in the target region and there is no need to process each pixel individually. In (Peng et al., 2005) an algorithm is proposed that uses Kalman ﬁlter to ﬁlter object kernel histogram instead of object trajectory. Bins in the histogram are ﬁltered with individual Kalman ﬁlters supposing that the value of each bin changes independently. Due to lack of any prediction model for the state equation, the only state equation available for Kalman ﬁlter would be:

^k1 ^k ¼ x x

ð1Þ

^ ^ where x k is a priori state estimate at step k and xk1 is a posteriori state estimate at step k 1. The above equation provides actually no prediction for the next frame; hence the ability of Kalman ﬁlter for error reduction is limited. Furthermore, the method suffers from high computational complexity since there are several bins and individual Kalman ﬁlters should be applied to each bin. In this paper, an algorithm is proposed that uses Kalman ﬁlter to estimate both motion and appearance information in order to predict object position and color. Predicting the color of the object for the next frame makes our algorithm robust to changes in illumination. This method takes advantage of the fact that motion and color distribution of the object are independent. When one of the motion or appearance model deviates from reality and the algorithm is about to lose the object, the other model can help to keep track of the object. The paper is structured as follows: In Section 2 the proposed method is described and the appearance model is constructed. This section also explains how the algorithm handles occlusion. In Section 3 experimental results are presented and conclusions are drawn in the last section.

H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197

2. Proposed method Fig. 1 shows the data ﬂow diagram of the algorithm. The user is required to specify the position of the object in the ﬁrst frame. The background subtraction is performed in the ﬁrst frame in order to obtain which pixels belong to the object. The ﬁrst frame is subtracted from the background image and then using the Otsu’s method (Otsu, 1975), a threshold is found for converting the subtraction result to a binary image. In the binary image, pixels that have a non-zero value are considered to belong to the object. After the ﬁrst frame, background subtraction is no more required and the appearance model provides the necessary information to detect the pixels of the object in subsequent frames. Adding background subtraction did not signiﬁcantly improve the tracking performance in this algorithm, however it increased the computational complexity.

2.1. Appearance model In this stage, an appearance model is created using some state variables extracted from histogram of the object to which the Kalman ﬁlter is applied. Applying the Kalman ﬁlter to appearance state variables is useful for error reduction and providing a prediction of the object appearance in the next frame. In order to create an appearance model, the authors in (Peng et al., 2005) use kernel histogram which is composed of m bins and the set {pb}b=1,. . .,m represents the object model in which pb is the bth bin in kernel histogram. Each bin is treated as a state variable and is ﬁltered by individual Kalman ﬁlters. For example, if there are 32 bins in the kernel histogram, there will be 323 = 32768 bins in the RGB space. Although many of these bins have zero values and the number of non-zero bins is much smaller (about 100 according to the experimental results in (Peng et al., 2005)), however the number of variables required to be ﬁltered by the Kalman ﬁlter remains too high. To overcome this drawback, we propose a different approach to construct the appearance model. The pixels of the object (which are known thanks to background subtraction) are clustered based on their brightness values. The centroid of each cluster is deﬁned as the ﬁrst element of a state vector. The second and third elements of the state vector are deﬁned as the ﬁrst and second derivatives of the ﬁrst element, respectively. Fig. 2 shows an object before and after background subtraction and its gray-level histogram. It is clear from Fig. 2(c) that there are two dominant colors in the object and if we cluster the pixels of the object into two clusters, the centroid of each cluster represents that color.

2193

2.1.1. Kalman appearance prediction During tracking, it is expected that the object appearance preserves its dominant colors characteristics and changes in illumination make only these dominant colors to shift in the histogram. Hence, a model for shift in color (due to changes in object appearance) can be used to predict the position of the centroid of each cluster for the next frame. As will be demonstrated, such a model reduces the probability of losing the object due to change in its appearance. In Fig. 3 the illumination component histogram of the object window under increasing illumination conditions are shown. As the illumination changes, the centroid of the bins in the histogram shifts upward. Supposing that this shift is linear with constant second order changes, the state equation in the kth frame would be:

xi;k ¼ Axi;k1 þ wi;k1

ð2Þ

in which xi;k is the state vector at kth frame for the ith cluster and its components are the centroid of the ith cluster, its ﬁrst and secondorder temporal derivatives, respectively (xi;k ¼ ½c; c_ ; €ct ). wi;k1 is the state noise which is supposed to be normal with zero mean and covariance Qi. With the assumption that the shift of cluster centroids is linear with constant second order derivative, the state transition matrix A has the following structure:

2

1 DT 6 A ¼ 40 1 0

1 DT 2 2

3

7 DT 5 1

0

ð3Þ

where DT represents the time interval between successive frames. Based on the above assumptions, the Kalman equations for predicting the appearance model which is performed in the Kalman appearance prediction block in Fig. 1 can be written as:

^i;k ^i;k ¼ Ax x

ð4Þ

P i;k ¼ AP i;k1 AT þ Q i

ð5Þ

T

T

ki;k ¼ P i;k h ðhP i;k h þ r i Þ1

ð6Þ

where h is a vector that relates the state vector and measurement:

h ¼ ½1 0 0

ð7Þ

2.1.2. Kalman appearance correction The following equation relates the state vector to the observation z:

Fig. 1. Data ﬂow diagram for the proposed algorithm.

2194

H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197

Fig. 2. The object before (a) and after (b) background subtraction and its histogram after transforming the color image to gray-level (c).

Fig. 3. (a–c) Three frames (1st, 50th and 70th) in which the white rectangle speciﬁes the object window for which the histogram is computed. (d–f) The illumination component histogram of the object window in three frames with illumination changes.

zi;k ¼ hxi;k þ ti;k

ð8Þ

where ti;k is the measurement noise, and is assumed to be normal with zero mean and covariance ri. The measurement zi,k in the kth frame is the centroid of the ith cluster of pixels (based on their RGB values) in the current frame using the appearance model in the previous frame. In each frame, after obtaining the observation zi,k, the measurement update for Kalman ﬁlter which is performed in the Kalman appearance prediction block in Fig. 1 is performed as follows:

^i;k þ ki;k ðzi;k hx ^i;k Þ ^i;k ¼ x x Pi;k ¼ ðI ki;k hÞPi;k

ð9Þ ð10Þ

Each cluster shifts individually so we need to apply individual Kalman ﬁlters to each centroid. The updated variable is used to detect and segment object in the next frame. This is performed by thresholding all pixels within the search area:

Oðx; y; kÞ ¼

^i;k ð1Þj > T 0 if jf ðx; y; kÞ x 1 otherwise

ð11Þ

f(x, y, k) is the pixel gray level in position (x, y) at kth frame and T is an empirical threshold which is chosen equal to 15% of the maximum image gray level. Changing T from around 10% to 20% of the maximum gray level did not affect the performance of the tracking

algorithm signiﬁcantly. The output O is a binary image in which non-zero pixels specify the object. The centroid of these pixels is considered as the object’s position and is used as measurement for the motion model which will be explained in Section 2.2. It will be shown in Section 3 that results from segmenting the object using updated model by Kalman ﬁlter is drastically improved from results without applying Kalman ﬁlter. 2.2. Motion model The state variable and state transition equation for a linear motion model with constant acceleration are deﬁned as:

€ k T sk ¼ ½xk ; yk ; x_ k ; y_ k ; €xk y

ð12Þ

sk ¼ Usk1 þ ek

ð13Þ

2

1 0 DT

0

1 DT 2 2

0

DT 0 1 0 0

0

6 60 1 6 6 0 0 U¼6 6 60 0 6 6 40 0 0

0

1 0 0 0

DT 0 1 0

0

3

7 7 7 7 0 7 7 DT 7 7 7 0 5

1 DT 2 2

1

ð14Þ

H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197

2195

Fig. 4. Tracking results in extreme illumination changes in frames 4, 60, 90 and 100, respectively (EC Funded CAVIAR project/IST 2001 37540).

€k ) denote the position, velocity and in which (xk , yk ), (x_ k , y_ k ) and (€ xk , y acceleration of the object centroid at kth frame, respectively. ek is the state noise which is supposed to be normal with zero mean and covariance Qk. The measurement equation is in the following form:

zk ¼ csk þ sk c¼

1 0 0 0 0 0 0

1 0 0 0 0

ð15Þ ð16Þ

where sk is the measurement noise which is supposed to be normal with zero mean and covariance rk. 2.2.1. Kalman motion correction In each frame after obtaining pixels belonging to the object, their centroid is considered as the measurement zk. Updating the state ^s k which is performed in the Kalman motion correction block in Fig. 1 is performed by incorporating the measurement zk into the state:

^sk ¼ ^sk þ kk ðzk c^sk Þ

ð17Þ

where kk is the Kalman gain. Posterior estimate error covariance is computed as follows:

Pk ¼ ðI kk cÞPk where

P k

ð18Þ

is prior estimate error covariance.

2.2.2. Kalman motion prediction In the Kalman motion prediction stage (Fig. 1), the prediction of the state for the next frame is performed:

^skþ1 ¼ U^sk

ð19Þ

The ﬁrst element in ^s kþ1 is the prediction for the position of the object in the next frame. The center of the search window in the next frame is positioned at this coordinates. The prior estimate error covariance and Kalman gain for the next frame are computed as follows:

Pkþ1 ¼ UPk UT þ Q kþ1

ð20Þ

kkþ1 ¼ Pkþ1 c T ðcPkþ1 c T þ r kþ1 Þ1

ð21Þ

Fig. 5. Results of the object segmentation based on MAKF with prediction (b, f, j), MAKF without prediction (c, j, k) and MKF (d, h, l). (a, e, i) shows original image in the search area in the frame index 15, 26 and 40 (ﬁrst, second and third row respectively) as the object enters areas with less illumination (EC Funded CAVIAR project/IST 2001 37540).

2196

H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197

the measurement error covariance rk is set to inﬁnity; because when the object is occluded the measurement is pure noise and the measurement error covariance is inﬁnite. Moreover the prediction error covariance Qk is set equal to zero. In this condition according to (21) Kalman gain would be zero and consequently according to (17), the measurement would have no effect on the state and the tracking process is completely based on prediction. This situation continues until occlusion rate is below the threshold and the object is found again and the measurement is incorporated into the Kalman equations again. 3. Experimental results In this section, results from the proposed algorithm are compared with other methods including: I. The motion model with the appearance model without applying Kalman ﬁlter (MKF). II. The appearance model with Kalman ﬁlter and without prediction (MAKF without prediction). III. Kalman-based template matching (KTM) which is the method presented in (Nguyen and Smeulders, 2004).

Fig. 6. Euclidean distance between the true and the estimated object centroid for the four algorithms.

2.2.3. Handling occlusion To make the algorithm robust against occlusion, the adaptive Kalman ﬁltering method proposed in (Weng et al., 2006) was adopted. Accordingly, a parameter ðak Þ is computed in each frame, which corresponds to the proportion of occlusion. This parameter is called occlusion rate and is deﬁned as follows:

ak ¼

( PNk 1 if PNPNk 1 6 1 PNk1 k1 1

ð22Þ

otherwise

where PN k and PN k1 are the number of pixels that belong to the object in the kth and ðk 1Þth frame. The occlusion rate is used to adjust the Kalman ﬁlter parameters automatically. When the occlusion rate is below a threshold, the object is not considered to be occluded and Kalman ﬁlter works in its normal predictionupdating mode. Since the measurement error and the occlusion ratio are directly proportional and according to (21) the measurement error rk and the Kalman gain kk are inversely proportional, the value of the measurement error covariance rk is set equal to occlusion rate ak and the prediction error covariance Qk is set equal to I ak I. On the other hand, when the occlusion rate is above a threshold it means that the object is occluded and the value of

All the four methods use the same motion model with Kalman ﬁlter. The algorithms have been implemented in Matlab and tested on an Intel(R) Core 2 Duo CPU with 4.00 GB memory. The test sequences are chosen in a way to evaluate the proposed algorithm in situations with drastic changes in scene illumination and object appearance. The ﬁrst experiment is aimed to evaluate the tracking performance of the new algorithm. Fig. 4 illustrates a few frames of the tracking result on a sequence from PETS2004 database (Browse1 sequence, frame numbers 500–570). As it is clear from Fig. 4, despite the heavy change in illumination, the object has been tracked successfully by the algorithm. The second experiment is designed to evaluate the segmentation ability of the algorithm. Fig. 5 shows the results of object segmentation on the same sequence as in Fig. 4 using three different algorithms. This experiment demonstrates that having a model for predicting the changes in the appearance has the advantage of increasing accuracy in segmenting the object in situations in which the illumination changes drastically (see Fig. 6). In Table 1, the results from the proposed algorithm are compared with the results in the other three methods, based on the mean distance error (de) between the tracked object centroid and the ground truth which is calculated as:

Table 1 Comparison of the results from three methods based on the average Euclidian distance between the true and estimated object centroid. MKF

Sequence Sequence Sequence Sequence Sequence Sequence a b c d e f g h

d

1 1d 2e 3f 4g 5h

MAKF without prediction

MAKF with prediction

KTM

Errora

Timeb

Error

Time

Error

Time

Error

Time

Increase in errorc

Decrease in timec

– – 4.92 – 2.62 3.24

2.514 2.514 2.378 3.673 2.731 7.184

2.55 2.55 4.81 2.72 2.46 2.94

2.554 2.554 2.410 3.762 2.782 7.392

2.33 2.33 4.63 2.68 2.43 2.78

2.556 2.556 2.414 3.768 2.784 7.412

2.30 2.30 4.56 2.65 2.41 2.74

10.316 10.316 9.573 15.138 11.025 28.417

1.3 1.3 1.5 1.1 0.8 1.4

75 75 75 75 75 74

Pixels. Seconds. Percentage. PETS2004 dataset Browse1 500–570. PETS2004 dataset Browse1 40–100. BEHAVE dataset Sequence 03950–4050. PETS2004 dataset Rest_InChair 300–400. WalkByShop 834–1036.

Comparison of MAKF with prediction and KTM

H. Jahandide et al. / Pattern Recognition Letters 33 (2012) 2192–2197

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ de ¼ ðxt xg Þ2 þ ðyt yg Þ2

ð23Þ

where de is the distance error and (xt , yt ) are the coordinates of the tracked object centroid and (xg , yg ) are the coordinates of the corresponding object in the ground truth. As indicated in Table 1, the error in MKF is around 5% more than the error in the MAKF without prediction in sequences that MKF does not lose the object (MKF has failed to track the object in Sequences 1 and 3) and the error in the MAKF without prediction is around 4% more than the error in the MAKF with prediction. The error in the MAKF with prediction is around 2% more than the error in KTM which does not represent a considerable difference. On the other hand, the computational time for KTM is around four times higher with respect to the MAKF with prediction. Therefore, with nearly the same accuracy as the method proposed in (Nguyen and Smeulders, 2004), our proposed method reduces the computational complexity drastically.

4. Conclusion In this paper an algorithm for object tracking is proposed that uses Kalman ﬁlter to reduce error and give a prediction for both motion and appearance models. With this method we are able to track objects despite large appearance changes. This is possible because we use a linear model for changes in appearance and apply this model to Kalman ﬁlter. Therefore we have a prediction of the appearance model for the next frame which helps us reduce the probability of losing the object when large changes in illumination occur. This algorithm has much smaller computational complexity than other algorithms that use Kalman ﬁlter to update their appearance models. This algorithm is robust against long lasting occlusions due to adaptive Kalman ﬁlter applied to the motion model.

2197

References Cheng, S.Y., Trivedi, M.M., 2010. Vision-based infotainment user determination by hand recognition for driver assistance. IEEE Trans. Intell. Transport. Systems 11, 759–764. Del Bue, A., Comaniciu, D., Ramesh, V., Regazzoni, C., 2002. Smart cameras with realtime video object generation. Proceedings of the International Conference on Image Processing, vol. 3, pp. III-429–III-432. EC Funded CAVIAR project/IST 2001 37540, found at URL: http://www.dai.ed.ac.uk/ homes/rbf/CAVIAR/. Eidehall, A., Pohl, J., Gustafsson, F., 2007. Joint road geometry estimation and vehicle tracking. Control Eng. Practice 15, 1484–1494. Ferrari, V., Tuytelaars, T., Van Gool, L., 2001. Real-time afﬁne region tracking and coplanar grouping. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2001), vol. 2, pp. II-226–II-233. Hansen, D.W., Qiang, J., 2010. In the eye of the beholder: A survey of models for eyes and gaze. IEEE Trans. Pattern Anal. Machine Intell. 32, 478–500. Jang, D.S., Choi, H.I., 2000. Active models for tracking moving objects. Pattern Recognition 33, 1135–1146. Jong Sun, K., Dong Hae, Y., Young Hoon, J., 2011. Fast and robust algorithm of tracking multiple moving objects for intelligent video surveillance systems. IEEE Trans. Consumer Electron. 57, 1165–1170. Murphy-Chutorian, E., Trivedi, M.M., 2010. Head pose estimation and augmented reality tracking: An integrated system and evaluation for monitoring driver awareness. IEEE Trans. Intell. Transport. Systems 11, 300–311. Nguyen, H.T., Smeulders, A.W.M., 2004. Fast occluded object tracking by a robust appearance ﬁlter. IEEE Trans. Pattern Analysis Machine Intell. 26, 1099–1104. Otsu, N., 1975. A threshold selection method from gray-level histograms. Automatica 11, 285–296. Peng, N.S., Yang, J., Liu, Z., 2005. Mean shift blob tracking with kernel histogram ﬁltering and hypothesis testing. Pattern Recognition Lett. 26, 605–614. Reale, M.J., Canavan, S., Lijun, Y., Kaoning, H., Hung, T., 2011. A multi-gesture interaction system using a 3-D Iris disk model for gaze estimation and an active appearance model for 3-D hand pointing. IEEE Trans. Multimedia 13, 474–486. Weng, S.K., Kuo, C.M., Tu, S.K., 2006. Video object tracking using adaptive Kalman ﬁlter. J. Vis. Commun. Image Represent. 17, 1190–1208. Ying-Li, T., Feris, R.S., Haowei, L., Hampapur, A., Ming-Ting, S., 2011. Robust detection of abandoned and removed objects in complex surveillance videos. IEEE Trans. Systems Man Cybernet. 41, 565–576. Zhengtao, Z., De, X., Min, T., 2010. Visual measurement and prediction of ball trajectory for table tennis robot. IEEE Trans. Instrument. Measure. 59, 3195– 3205. Zhu, J., Yuanwei, L., Zheng, Y.F., 2010. Object tracking in structured environments for video surveillance applications. IEEE Trans. Circ. Systems Video Technol. 20, 223–235.

A hybrid motion and appearance prediction model for robust visual object tracking

A hybrid motion and appearance prediction model for robust visual object tracking

Recommend Documents