Region tracking for non-rigid video objects in a non-parametric MAP framework

Region tracking for non-rigid video objects in a non-parametric MAP framework

ARTICLE IN PRESS Signal Processing: Image Communication 21 (2006) 235–251 www.elsevier.com/locate/image Region tracking for non-rigid video objects ...

1MB Sizes 0 Downloads 34 Views

ARTICLE IN PRESS

Signal Processing: Image Communication 21 (2006) 235–251 www.elsevier.com/locate/image

Region tracking for non-rigid video objects in a non-parametric MAP framework Chiou-Ting Hsu, Ming-Shen Hsieh Department of Computer Science, National Tsing Hua University, Taiwan Received 19 November 2004; received in revised form 11 October 2005; accepted 13 October 2005

Abstract This paper presents a non-parametric maximum a posteriori MAP framework for tracking non-rigid video objects. We formulate the region tracking problem as a MAP probability problem and define the probabilistic models in terms of the distances between the intensity distribution of the object and that of its spatial- and temporal-neighborhood. Furthermore, in order to better model the complex intensity changes due to non-rigid movement, we propose to use a non-parametric method to approximate the likelihood and prior terms in the MAP problem. The proposed non-parametric estimation algorithm mostly relies on intensity features and requires no time-consuming motion estimation. Finally, we employ a contour evolution method in the MAP optimization step to iteratively track the object contour. The experimental results demonstrate that the proposed method achieves satisfactory results and outperforms the previous parametric method. r 2005 Elsevier B.V. All rights reserved. Keywords: Map model; Non-parametric density estimation; Non-rigid motion; Curve evolution

1. Introduction Region tracking is an important preprocessing task in various applications such as object-based video coding, video surveillance and video retrieval. Several techniques [20,19] assume that a video object follows a uniform motion or has a parametric shape and thus need only a limited number of parameters to track the object. However, for video object composed of multiple regions with different moving behaviors, their inconsistent motions and the resulting self-occlusion become a tough issue in region tracking. Moreover, if the target object is a Corresponding author. Tel.: +886 3 5742960; fax: +886 3 5723694. E-mail address: [email protected] (C.-T. Hsu).

non-rigid moving object, the tracking problem becomes even more challenging because of its time-varying deformation. Many tracking techniques have been proposed to deal with object deformation and occlusion problem. For ease of discussion, we classify these methods into three categories: contour-based method, region-based method and topology-independent method. Contour-based methods [4–8,14,15,17,19] use a contour (or boundary) to encompass a tracking region and keep fitting the contour to certain visual features detected from each video frame. The method proposed in [14,15] updates the contour by partially matching the edge points of a tracking region across contiguous frames. The snake-based method proposed in [8,6] assumes the contour is

0923-5965/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2005.10.002

ARTICLE IN PRESS 236

C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

aligned to pixels with several specific features and updates the contour by minimizing energy terms defined by these features. In [17], a contour is determined by combining the current edge map and the contour predicted from the tracking result in the previous frame. Region-based methods [10,12,18,22,23] track a video object by computing and updating motions of a set of regions belonging to the target object. The method proposed in [10] combines spatial and temporal information to track a moving object. First, a video frame is spatially segmented into regions and the segmented regions are classified into moving or non-moving regions according to their temporal information. A moving object is then constructed by merging a set of spatially connected moving regions. In [22], tracking of moving objects is modeled as a graph labeling problem. Motion of each foreground region is first estimated by region matching. A tracking memory is then used to ensure the temporal coherency of these foreground regions during the whole tracking process. Although contour and region-based methods solve many difficulties in region tracking problem, some issues need to be further discussed. Contourbased methods are generally ineffective to track regions without strong contour features (e.g. weak edge), because energy terms derived from these weak features tends to converge to false boundaries. On the other hand, though region-based methods try to estimate and track the motion for each region, the tracked region contours are usually unsatisfactory [13]. Therefore, in [13], a topology-independent method, which assumes little constraints on the video object, is proposed to solve the above difficulties. The method proposed in [13] assumes that the intensity difference between successive frames is a zero-mean Gaussian process and the luminance and chrominance statistics of the video object is nearly constant. The tracking problem in [13] is then formulated as a maximum a posteriori (MAP) estimation problem and the resulting probability maximization process is expressed as a level set partial differential equation. Although the tracking results in [13] are very promising, the performance could be further improved if two more issues are taken into consideration. First, though Gaussian model is a good approximation for frame-to-frame differences in most cases, it is insufficient to represent complex intensity changes due to movement of non-rigid

moving object. Second, the tracking performance should be insensitive to control parameters, because parameter setting is usually a non-trivial task even to an experienced user. Hence, this paper aims to propose an efficient region tracking algorithm for non-rigid moving objects and try to lessen the sensibility on parameter selection. As suggested in [13], since motion estimation is time-consuming and usually unreliable for deformed and occluded objects, we include no motion fields in our tracking process. Our tracking problem is similarly formulated as a MAP framework and merely relies on general assumption of constant intensity distribution between consecutive frames. However, unlike [13], we propose to use a non-parametric method to estimate the prior and likelihood models. Our major concern is based on the observation that, in most non-rigid moving cases, distributions of intensity changes are usually multi-model and non-Gaussian. We believe a non-parametric method can better approximate the distribution than using a single Gaussian. In the MAP estimation process, we employ a contour evolution process to iteratively track the region contour. Our experiments show that the proposed method is efficient, has little dependence with the control parameters, and achieves better performance than using a parametric method. This paper is organized as follows. Section 2 elaborates the MAP model, our proposed nonparametric estimation and the curve evolution process. Section 3 shows our experimental results and comparisons with the parametric method. Section 4 discusses the selection of control parameters, limitations of the proposed method, and suggests a possible extension. Finally, Section 5 concludes this paper. 2. Proposed method 2.1. MAP model We formulate the region tracking problem as a MAP estimation process. Let In denote the nth frame in a video sequence, Rn denote the region of our target object in In and Rcn denote the complement region of Rn. The tracking problem is modeled as estimating the region Rn, when the current frame In, the previous frame In1 and the segmented region Rn1 in the previous frame In1 are given. The region Rn is determined by maximizing the a

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

posteriori probability: ^ n Þ ¼ arg max pðRn jRn1 ; In ; In1 Þ. ðR Rn

(1)

Using the Bayes formula, the MAP estimation is rewritten as pðIn jRn1 ; Rn ; In1 ÞpðRn jRn1 ; In1 Þ Rn pðIn jIn1 ; Rn1 Þ ¼ arg max pðIn jRn1 ; Rn ; In1 Þ

^ n Þ ¼ arg max ðR Rn

pðRn jRn1 ; In1 Þ,

ð2Þ

where the likelihood model pðIn jRn1 ; Rn ; In1 Þ measures how faithful the estimation of Rn1 and Rn for the current frame In when the previous frame In1 is given, and the term pðRn jRn1 ; In1 Þ represents the prior probability of Rn conditioning on Rn1 and In1. 2.2. Initialization To initialize the MAP estimation for the first frame, we simplify Eq. (1) as ^ n Þ ¼ arg max pðRn jIn Þ. ðR Rn

(3)

Hence, the problem in Eq. (3) becomes a spatial region segmentation problem. Here, we model the spatial region segmentation of In as a mode seeking process and employ the mean-shift based technique [2] to iteratively converge to the mode along its gradient direction. The detailed steps are described below. First, assume the frame In has N pixels with spatial coordinates xj ¼ ðxj1 ; xj2 ÞT , j ¼ 1; . . . ; N, we measure the density of the ith pixel xi using a kernel density estimator by   N N vi  vj 1X 1 X f ðvi Þ ¼ K H f ðvi  vj Þ ¼ K , N j¼1 Hf NH df j¼1 (4) where vi and vj are the d-dimensional feature vectors of the pixels xi and xj , respectively. K is a kernel function and Hf indicates the kernel bandwidth. As suggested in [2], we incorporate both the color components and the spatial coordinates to form a five-dimensional feature vector (i.e. three color components and two spatial coordinates) for each pixel, and use the Epanechnikov kernel [1] to estimate the density. The Epanechnikov kernel

237

function is defined as ( 1  kvk2 if kvkp1; K ð vÞ ¼ 0 if kvk41:

(5)

The mean-shift procedure then classifies each feature vector to its corresponding local maximum along the gradient direction. The pixels with feature vectors associated with their local maximum thus constitute a region. Next, we need the user’s input to manually select a set of connected regions as the target object. We then combine the manually selected regions into a single region Rn and pass this region as the prior information to the next frame. 2.3. Likelihood model The likelihood term pðIn jRn1 ; Rn ; In1 Þ measures the faithfulness in observing In when Rn, Rn1, and In1 are known. This term is difficult to formulate if no further restriction is imposed. Thus, we follow the two assumptions presented in [13] to simplify the estimation: (1) Conditional independence: Given Rn, Rn1 and In1, the conditional probability of In ðxÞ and In ðyÞ for different pixels x and y are assumed to be independent. That is, pðIn ðxÞ; In ðyÞjRn1 ; Rn ; In1 Þ ¼ pðIn ðxÞjRn1 ; Rn ; In1 Þ  pðIn ðyÞjRn1 ; Rn ; In1 Þ.

ð6Þ

(2) Partial independence: Given Rn, Rn1 and In1, the conditional probability of In ðxÞ is assumed to be a function which depends only on Rn1, In1, and the membership of x in Rn. That is, pðIn ðxÞjRn1 ; Rn ; In1 Þ ( pin;x ðIn ðxÞjRn1 ; In1 Þ; ¼ pout;x ðIn ðxÞjRn1 ; In1 Þ;

if x 2 Rn ; if x 2 Rcn ;

ð7Þ

and pout;x ðIn ðxÞj where pin;x ðIn ðxÞjRn1 ; In1 Þ Rn1 ; In1 Þ are the conditional probabilities when the pixel x is inside and outside the region Rn, respectively. The conditional independence assumption factorizes the likelihood term into a multiplication of conditional probabilities for each pixel. This assumption is appropriate once the correspondences between pixels in two consecutive frames In1 and In are available. However, measurement of pixel correspondence is not an easy task, especially for

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

238

pixels located in occluded regions. Thus, conditional independence is generally an over-simplified assumption. Nevertheless, we still adopt this assumption to simplify our estimation into a more tractable form. The partial independence assumption simplifies the estimation of likelihood term into two exclusive cases, depending on whether a pixel is inside or outside the region Rn. When a pixel is inside the region Rn, i.e. x 2 Rn , the likelihood of In ðxÞ is assumed to be independent of the pixels outside Rn. On the other hand, when a pixel is outside the region Rn, i.e. x 2 Rcn , the likelihood of In ðxÞ is assumed to be independent of the pixels inside Rn. This assumption is feasible when the intensity distribution of the region Rn and that of its complement Rcn are very dissimilar. With the above assumptions, we rewrite the likelihood term into pðIn jRn1 ; Rn ; In1 Þ Y pðIn ðxÞjRn1 ; Rn ; In1 Þ ¼ x2L

Y

¼

pin;x ðIn ðxÞjRn1 ; In1 Þ

x2Rn



Y

pout;x ðIn ðxÞjRn1 ; In1 Þ,

ð8Þ

x2Rcn

where L denote the set of all possible configurations of x. In Eq. (8), we will first need to estimate the conditional probability for each pixel x within the frame In, either inside or outside the region Rn. It is quite obvious that, when there is no abrupt illumination change between two consecutive frames, each pixel and its corresponding pixel in the previous frame generally have similar intensity. Although the exact pixel correspondence is unavailable, we do know the corresponding regions Rn1 and Rn. Thus, if a pixel x is inside the region Rn, its intensity In ðxÞ tends to be similar to that of the pixels inside Rn1. Therefore, we propose to formulate the conditional probability pin;x ðIn ðxÞjRn1 ; In1 Þ in terms of the difference between the intensity In ðxÞ and the pixel intensities inside Rn1. Similarly, for a pixel outside the region Rn, we formulate the conditional probability pout;x ðIn ðxÞjRn1 ; In1 Þ in terms of the difference between In ðxÞ and the pixel intensities outside Rn1. Many density estimation methods can be employed to measure the conditional probabilities. For example, in [13,21], the conditional probability is modeled as a parametric Gaussian form. Although Gaussian model is a commonly used assumption,

the intensity changes due to non-rigid moving objects tend to be more complex than a single Gaussian. Once the parametric model is inaccurate and poorly approximates the probability, the tracking process would gradually divert from the true target object. Therefore, to better approximate the probability, we propose to use a non-parametric estimation method and assume no special structure of the distribution to estimate the likelihood term. We use the kernel density estimator and define the conditional probability pin;x ðIn ðxÞjRn1 ; In1 Þ as follows: pin;x ðIn ðxÞjRn1 ; In1 Þ X 1 ^ K H I ðIn ðxÞ  In1 ðxÞÞ ¼ jRn1 j x2R ^ n1  X In ðxÞ  In1 ðxÞ ^ 1 ¼ K , HI jRn1 jH 256 I x2R ^ n1

ð9Þ

where the pixel intensity In ðxÞ is represented by 256 gray levels, and the kernel function K H I ðÞ gives different values according to the intensity difference ^ The kernel bandwidth between In ðxÞ and In1 ðxÞ. H I controls the tolerance in intensity difference between Rn1 and Rn. The notation jRn1 j indicates the number of pixels within Rn1. Similarly, we define the conditional probability pout;x ðIn ðxÞj Rn1 ; In1 Þ by pout;x ðIn ðxÞjRn1 ; In1 Þ X 1 ^ K H I ðIn ðxÞ  In1 ðxÞÞ, ¼ c jRn1 j x2R ^ c

ð10Þ

n1

where jRcn1 j indicate the number of pixels within Rcn1 . Here, we also use the Epanechnikov kernel defined in Eq. (5) to estimate the probabilities in Eqs. (9) and (10). Nevertheless, the assumption of similarity between a pixel x in Rn (or Rcn ) and all the pixels x^ 2 Rn1 (or x^ 2 Rcn1 ) is over-simplified if the region Rn (or Rcn ) is composed of multiple components with dissimilar intensity distributions. Hence, we further modify the conditional probabilities defined in Eqs. (9) and (10) and now include only the neighboring pixels around x into the estimation. The modified equations are pin;x ðIn ðxÞjRn1 ; In1 Þ 1 ¼ jfRn1 \ Nx gj X ^ K H I ðIn ðxÞ  In1 ðxÞÞ,  ^ x2fR n1 \Nx g

ð11Þ

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

and pout;x ðIn ðxÞjRn1 ; In1 Þ 1 ¼ c jfRn1 \ Nx gj X ^ K H I ðIn ðxÞ  In1 ðxÞÞ. 

and

ð12Þ

c ^ x2fR n1 \Nx g

In Eqs. (11) and (12), Nx is a neighborhood window around the pixel x and is defined by Nx ¼ ^ ^ fxjkx  xkpdg, where d controls the size of the neighborhood window. Therefore, even if the region Rn is composed of multiple components, the assumption of similarity is still feasible, because now we assume the similarity is between a pixel x in Rn and its spatial- and temporal-neighborhood x^ 2 Rn1 \ Nx . 2.4. Prior model The prior term pðRn jRn1 ; In1 Þ measures the probability of the estimation of Rn according to the prior information Rn1 and In1. Note that the information of In1 is useful only when In is also given. Therefore, we drop In1 term and simplify our prior model into pðRn jRn1 ; In1 Þ ¼ pðRn jRn1 Þ. (13) In Eq. (13), the prior model becomes to estimate the probability of Rn when Rn1 is given. We again use the assumptions of conditional independence and partial independence to factorize the prior model: Y pðRn jRn1 Þ ¼ pðRn ðxÞjRn1 Þ x2L

¼

Y

pðRn ðxÞ ¼ 1jRn1 Þ

x2Rn



Y

pðRn ðxÞ ¼ 0jRn1 Þ,

ð14Þ

x2Rcn

where Rn ðxÞ ¼ 1 and Rn ðxÞ ¼ 0 indicate that the pixel x is inside or outside Rn, respectively. To measure the prior probability for each pixel, we assume that the region Rn is in the vicinity of Rn1 and formulate the prior model in terms of spatial distance between pixels in Rn and Rn1. Again, we use kernel density to estimate the prior probabilities by pðRn ðxÞ ¼ 1jRn1 Þ X 1 ^ K H S ðx  xÞ ¼ jRn1 j x2R ^ n1 X x  x^  1 ¼ K HS jRn1 jH 2S x2R ^ n1

239

ð15Þ

pðRn ðxÞ ¼ 0jRn1 Þ X 1 ^ K H S ðx  xÞ ¼ c jRn1 j x2R ^ cn1 X x  x^  1 ¼ c K , HS jRn1 jH 2S x2R ^ c

ð16Þ

n1

where K H S ðÞ is a kernel function with kernel bandwidth H S that controls the tolerance in spatial translation between Rn and Rn1. In Eqs. (15) and (16), if a pixel x is inside Rn1, then the probability of Rn ðxÞ ¼ 1 is relatively high. On the other hand, if a pixel x is in the exterior of Rn1, then the probability of Rn ðxÞ ¼ 1 is inversely proportional to its distance to the region boundary of Rn1. Our formulation of the prior term aims to keep the change of region boundary as smooth as possible. The underlying idea is similar to most existing methods [9,13,24]. However, the methods in [9,13,24] enforce smoothness assumption on the region contour and model the prior probability as inversely proportional to the contour length of the region. Thus, the estimated region tends to shrink to a region with low curvature. Unlike the method proposed in [13,9], our proposed prior term tends to remain in the original shape rather than reduce the contour curvature regardless of its original shape. 2.5. MAP estimation Given the likelihood and prior models, we rewrite the MAP estimation by taking the logarithms of Eq. (2):  ^ n Þ ¼ arg max log pðIn jRn ; Rn1 ; In1 Þ ðR Rn  þ log pðRn jRn1 ; In1 Þ .

ð17Þ

The product of the likelihood model and prior model now becomes addition of their logarithms and will be maximized independently. Next, according to Eqs. (8) and (14), we further rewrite Eq. (17) into 8
ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

240

þ

X

log pðRn ðxÞ ¼ 1jRn1 Þ

x2Rn

9 = þ log pðRn ðxÞ ¼ 0jRn1 Þ . ; x2Rc X

ð18Þ

n

To maximize the term in Eq. (18), we use the Euler–Lagrange ascent equation [24] and derive the evolution equation for each boundary pixel x by qx  ¼ log pin;x ðIn ðxÞjRn1 ; In1 Þ qt  log pout;x ðIn ðxÞjRn1 ; In1 Þ þ log pðRn ðxÞ ¼ 1jRn1 Þ *

 log pðRn ðxÞ ¼ 0jRn1 ÞÞ n ðxÞ,

ð19Þ

*

where n ðxÞ is the unit normal along the region contour at x pointing outward of Rn. From the first two terms in Eq. (19), if a pixel x belongs to the target region Rn, then its conditional probability pin;x ðIn ðxÞjRn1 ; In1 Þ would be higher than pout;x ðIn ðxÞjRn1 ; In1 Þ; thus the evolution equation will push the pixel x outwardly toward the region boundary of Rn. On the contrary, if the pixel x is outside the target region, its conditional probability pout;x ðIn ðxÞjRn1 ; In1 Þ would be higher than pin;x ðIn ðxÞjRn1 ; In1 Þ and the evolution equation will push the pixel x inwardly toward the region boundary. The last two terms in Eq. (19) indicate a force that prevents severe boundary change from the previously tracked region Rn1. Once a pixel x evolves outwardly across the region boundary of Rn1, we will have pðRn ðxÞ ¼ 1jRn1 ÞopðRn ðxÞ ¼ 0jRn1 Þ. Then the evolution equation will pull the movement of x back to Rn1. If otherwise, the evolution equation will push the movement of x toward Rcn1 . 2.6. Stopping criterion If we rewrite the evolution equation in Eq. (19) as  pin;x ðIn ðxÞjRn1 ; In1 Þ qx ¼ log qt pout;x ðIn ðxÞjRn1 ; In1 Þ  pðRn ðxÞ ¼ 1jRn1 Þ * ð20Þ þ log n ðxÞ. pðRn ðxÞ ¼ 0jRn1 Þ From Eq. (20), the evolution equation consists of two terms: the likelihood flow term   pin;x ðIn ðxÞjRn1 ; In1 Þ * log n ðxÞ pout;x ðIn ðxÞjRn1 ; In1 Þ

and the prior flow term   pðRn ðxÞ ¼ 1jRn1 Þ * n ðxÞ. log pðRn ðxÞ ¼ 0jRn1 Þ As discussed in Section 2.5, evolution by the likelihood flow term tries to maximize the similarity between regions in two consecutive frames in terms of their intensity distributions, while evolution by the prior term tries to minimize shape deformation between regions in two consecutive frames. To provide more flexibility in our evolution process, we further include a positive weight lp to penalize the latter term in Eq. (20):  p ðIn ðxÞjRn1 ; In1 Þ qx ¼ log in;x qt pout;x ðIn ðxÞjRn1 ; In1 Þ  pðRn ðxÞ ¼ 1jRn1 Þ * ð21Þ þ lp log n ðxÞ. pðRn ðxÞ ¼ 0jRn1 Þ The weight lp suppresses the influence of the prior term during our evolution process. In other words, small lp allows more serious shape deformation, while large lp allows only slight deformation. We refer to the term  pin;x ðIn ðxÞjRn1 ; In1 Þ log pout;x ðIn ðxÞjRn1 ; In1 Þ  pðRn ðxÞ ¼ 1jRn1 Þ þlp log pðRn ðxÞ ¼ 0jRn1 Þ in Eq. (21) as the evolution coefficient. When the evolution coefficient is positive, the evolution of a pixel x will move toward the outside of Rn. On the other hand, when the evolution coefficient is negative, the evolution will move toward the inside of Rn. If the evolution coefficient is zero, then the evolution of x will stop. However, we may not always achieve a zero value for the evolution coefficient, because this value may keep oscillating between positive and negative values. Therefore, we should define a stopping criterion to cease the evolution. Our stopping criteria for two different situations are defined as follows. (a) Evolution toward the outside of the region In this case, the evolution coefficient is initially positive and the evolution of x moves along * n ðxÞ. After several steps of evolution, if x moves across the true boundary of the region, the evolution coefficient will change from positive to negative and enforce the evolution to move * along  n ðxÞ. If we continue the evolution process, the evolution direction of x will keep

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251 *

*

changing along n ðxÞ and  n ðxÞ. The oscillation between evolution directions indicates the existence of a true boundary. Hence, we will stop the evolution once the evolution coefficient changes from positive to negative. However, when the evolution coefficient becomes negative, the position of x locates outside of the region. Thus, we have to roll back one evolution step to move the pixel x into the interior of the region. (b) Evolution toward the inside of a region In this case, the evolution coefficient is initially negative and the evolution of x moves along *  n ðxÞ. Similarly, we will stop the evolution once the evolution coefficient changes from negative to positive. In this case, no further rollback action is needed. 3. Experimental results We have conducted the proposed algorithm on several video sequences with different degrees of deformation. For ease of discussion, the test video sequences are divided into two categories: video with slightly deformed objects (e.g. ‘‘ZoomCar’’) and video with severely deformed objects (e.g. ‘‘Flower’’ and ‘‘Bream’’). We also implemented the parametric method [13] and compared the results with our proposed method. The proposed algorithm is implemented on a 1.2 GHz PC. The execution time is listed in Table 1. The tracking time includes I/O time, color conversion, probability estimation and evolution. The evolution time is proportional to the size of the target object. The target object in ‘‘Flower’’ sequence is the largest one; thus, the evolution in ‘‘Flower’’ sequence spends most time. 3.1. Tracking of slightly deformed moving objects Fig. 1 shows the test video sequence ‘‘ZoomCar’’ and the tracking results. The first frame, its spatially Table 1 Execution time Video sequence

ZoomCar Flower Bream

Number of frames (frames)

64 123 300

Evolution time (s/frame)

0.03 0.26 0.08

Tracking time

(s/frame)

(frames/s)

0.40 0.58 0.18

2.50 1.72 5.56

241

segmented result, and the manually initialized target object are shown in Fig. 1(a)–(c), respectively. Note that the target object consists of several components with dissimilar intensity distributions, and the intensity of the car window is very similar to that of the highway. Also, as the camera zooms in, the tracking region increases quite fast. With such fast zooming-in effect, many techniques [10,15,22] need additional camera motion compensation to adjust to the size change of the target object. Our method, in contrast, overcomes this difficulty by using the proposed non-parametric model to better approximate the change without relying on any camera motion information. Fig. 2 shows the tracking result of the parametric method [13]. Note that, in [13], the point-wise likelihood functions pin;x ðIn ðxÞjRn1 ; In1 Þ and pout;x ðIn ðxÞjRn1 ; In1 Þ are modeled as a parametric form (i.e. zero-mean Gaussian); while in our proposed method, these two likelihood functions are better approximated using the non-parametric representation. In the video sequence ‘‘ZoomCar’’, when the camera is slowly zooming-in, i.e. between the 1st frame and the 23rd frame, both methods achieve satisfactory results, as shown in Figs. 1(d) and 2. However, when the camera quickly zooms-in to the target object after the 33rd frame, the non-parametric likelihood terms apparently better approximate the point-wise intensity change than the parametric method and achieve better performance. 3.2. Tracking of severely deformed moving objects Figs. 3 and 4 show the tracking results of test video sequence ‘‘Flower’’ using the proposed method and the parametric method [13], respectively. In the ‘‘Flower’’ sequence, the petals will gradually spread out and thus reveal some intricate intensity change around the petal boundary. Using a parametric shape model can hardly represent the shape deformation. It is also impossible to calculate the exact motion fields in this case, because most of the petal regions are with smoothly changed intensities and without much texture to achieve a good motion estimation. Spatial segmentation of the flower is also challenging, as shown in Fig. 3(b), because this flower is composed of multiple regions with slight intensity differences. Even with so many difficulties in this case, our proposed non-parametric method in MAP estimation (in Fig. 3(d)) indeed outperforms the

ARTICLE IN PRESS 242

C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

Fig. 1. Video sequence ‘‘ZoomCar’’: (a) the 1st frame; (b) the spatially segmented result of the 1st frame; (c) the initialized target object at the 1st frame; and (d) the tracking results between the 3rd frame and the 64th frame, with frame number indicated below each frame.

Fig. 2. The tracking results on the ‘‘ZoomCar’’ sequence using the parametric method [13].

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

243

Fig. 3. Video sequence ‘‘Flower’’: (a) the 1st frame; (b) the segmentation result of the 1st frame; (c) the initialized target object at the 1st frame; and (d) the tracking results between the 10th and the 120th frame.

parametric method (in Fig. 4), especially in modeling the complex intensity changes around the flower boundary. Figs. 5 and 6 show the tracking results of test video sequence ‘‘Bream’’ using the proposed method and the parametric method [13], respectively. As shown in Fig. 5(d), the target object ‘‘Bream’’ turns around twice (where the first turn is between the 110th and the 135 frames, and the second turn is between the 195th and 240th frames) and severely deforms its region contour. When the bream turns around, several of its area has been self-occluded (e.g. the 115th, 120th, 220th, and 225th frames) and reappears afterward (e.g. the 135th and 240th

frames). As the bream deforms very quickly, a parametric shape model is again impossible to represent the severe deformation. Motion information will also become unreliable, since self-occlusion will greatly degrade the accuracy of motion estimation [11]. The tracking results (as shown in Fig. 6) by parametric model [13] also unfortunately fail to track the severe deformation of this challenging case. Our method, nevertheless, suffers from none of these difficulties and tracks the deformation very well. Though the results between the 225th and 240th frames are somewhat inaccurate, these errors have been corrected by the evolution process in the following frames.

ARTICLE IN PRESS 244

C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

Fig. 4. The tracking results on the ‘‘Flower’’ sequence using the parametric method [13].

4. Discussion and extension In this section, we discuss the selection of some control parameters, limitations of the proposed method and suggest one possible extension. 4.1. Neighborhood window size d As described in Section 2.3, we decompose the likelihood model as a multiplication of the conditional probabilities pin;x ðIn ðxÞjRn1 ; In1 Þ for all the pixels within Rn and the conditional probabilities pout;x ðIn ðxÞjRn1 ; In1 Þ for all the pixels outside Rn. Then we formulate the conditional probabilities pin;x ðIn ðxÞjRn1 ; In1 Þ and pout;x ðIn ðxÞjRn1 ; In1 Þ in terms of the intensity difference between x and its neighboring pixels inside and outside Rn1, respectively. Here, we discuss how the window size d of the ^ ^ neighborhood window Nx ¼ fxjkx  xkpd g affects the performance of our evolution. Small neighborhood window Nx is generally preferred for its lower computational cost. In

addition, if a video object consists of several components with dissimilar intensities, small neighborhood window better models the conditional probability pin;x ðIn ðxÞjRn1 ; In1 Þ. Otherwise, if we use a large neighborhood window that covers multiple components with dissimilar intensities, the conditional probability pin;x ðIn ðxÞjRn1 ; In1 Þ for pixels inside the region tends to be very small. In the worst case, if pin;x ðIn ðxÞjRn1 ; In1 Þ for pixels inside the region becomes even smaller than pout;x ðIn ðxÞjRn1 ; In1 Þ, then our evolution step can never find out the true region boundary. However, small neighborhood window size d poses another problem. In many video sequences, we found that boundaries between tracking regions and their neighboring regions are generally obscure. When there is no clear boundary between regions, pixels closest to the region boundary are more likely to be misclassified. Therefore, neighborhood window size smaller than a pre-defined threshold should be avoided. Moreover, when using small neighborhood window to estimate the conditional

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

245

Fig. 5. Video sequence ‘‘Bream’’: (a) the 1st frame; (b) the segmentation result of the 1st frame; (c) the initialized target object at the 1st frame; and (d) the tracking results between the 30th frame and the 300th frame.

probabilities, if the tracking region moves very fast and exceeds the neighborhood window size, the conditional probability pin;x ðIn ðxÞjRn1 ; In1 Þ will decrease to zero and cease the evolution steps prematurely. We use Fig. 7 to illustrate this problem. Fig. 7(a) shows the initial condition of the evolution process and Fig. 7(b) and (c) show the succeeding evolution steps. In Fig. 7(c), the pixel x, in order to catch up with the fast moving object, moves its neighborhood window to be completely outside the object region Rn1. Thus, the condi-

tional probability pin;x ðIn ðxÞjRn1 ; In1 Þ decreases to zero and our evolution process will stop at this instant, which is an incorrect stopping condition. From the above discussion, an ideal neighborhood window size should be adaptive to the moving speed of the tracking region. However, in fact, we do not know the moving speed in advance. It is impossible to apply an adaptive window size for the likelihood probability estimation. In order to solve this difficulty using a fixed-size neighborhood window, we propose to use a ‘‘virtual neighborhood

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

246

Fig. 6. The tracking results on the ‘‘Bream’’ sequence using the parametric method [13].

Rn−1

Rn

x (a)

Rn−1

Nx

Rn

x (b)

Rn−1

Nx

Rn

x

Nx

(c)

Fig. 7. Estimation of the likelihood probability using the neighborhood window Nx. x is the current evolution pixel and Nx is its neighborhood window: (a)–(c) are three evolution steps for the pixel x. The evolution process stops at (c), because there is no pixel belonging to the set fRn1 \ Nx g.

window’’ to estimate Eqs. (11) and (12). In Section 2.3, we use the neighborhood window Nx to measure the intensity distributions of a neighboring area around a pixel x. Though the intensity distribution within the neighborhood window will be involved into the kernel density estimation in Eqs. (11) and (12), the location of the neighborhood window has nothing to do with the estimation. Therefore, during the evolution process for each boundary pixel, we propose to use the intensity distributions measured in the initial phase of the evolution to update the conditional probabilities pin;x ðIn ðxÞjRn1 ; In1 Þ and pout;x ðIn ðxÞjRn1 ; In1 Þ in the succeeding evolution steps. We illustrate this idea in Fig. 8, where Nx indicate the neighborhood window and N0x indicate the virtual neighborhood window. The location of the virtual neighborhood window N0x is determined in the initial phase of

an evolution process and will not move with the evolution of the pixel x. Thus, Eq. (11) is modified as pin;x ðIn ðxÞjRn1 ; In1 Þ 1 ¼ jfRn1 \ N0x gj X ^ K H I ðIn ðxÞ  In1 ðxÞÞ. 

ð22Þ

0 ^ x2fR n1 \Nx g

To sum up, we use the ‘‘virtual neighborhood window’’ to solve the confliction in the selection of neighborhood window size. Thus, the neighborhood window size no longer affects the accuracy of the likelihood probability estimation and does not suffer from fast movement of the tracking region. In our experiments, the virtual neighborhood window size d is set as 5.

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

Rn−1

Rn N′x = Nx

N′x

(b)

Rn

Rn−1 N′x

Nx x

x (a)

Rn

Rn−1

247

Nx x

(c)

Fig. 8. Estimation of the likelihood probability using the ‘‘virtual neighborhood window’’ N0x . x is the current evolution pixel, Nx is its neighborhood window and N0x is the virtual neighborhood window determined by the initial phase of the evolution process. (a)–(c) are three evolution steps of the pixel x. The evolution in (c) will continue until the stopping criterion is satisfied.

4.2. The intensity bandwidth HI In Eqs. (11) and (12), the kernel bandwidth HI controls the maximum tolerance in intensity difference of the tracking region between consecutive frames. We refer to this kernel bandwidth as the intensity bandwidth. Consider the relationship between the intensity bandwidth HI and the likelihood flow coefficient   pin;x ðIn ðxÞjRn1 ; In1 Þ log pout;x ðIn ðxÞjRn1 ; In1 Þ in Eq. (21). We use Fig. 9 as an example. Fig. 9 shows the variation of the likelihood flow coefficient versus different intensity bandwidths for one boundary pixel (i.e. x 2 Rn ) in the first frame of the video sequence ‘‘Bream’’. In Fig. 9, when the intensity bandwidth increases from 1 to 70, pin;x ðIn ðxÞjRn1 ; In1 Þ increases faster than pout;x ðIn ðxÞj Rn1 ; In1 Þ, because the intensity bandwidth keeps including all the intensities in the video object while not yet covers the intensity range in the background. As the intensity bandwidth becomes larger than 70, pout;x ðIn ðxÞjRn1 ; In1 Þ increases faster than pin;x ðIn ðxÞj Rn1 ; In1 Þ, because the intensity bandwidth now covers all the intensities in the background. Hence, the likelihood flow increases when the intensity bandwidth increases from 1 to 70 and decreases when the intensity bandwidth is larger than 70. Note that, though different intensity bandwidth changes the magnitude of the likelihood flow, the sign of the likelihood flow is unaffected. In our experiments, to ensure the intensity bandwidth can cover all the intensities of the target object, we set the intensity bandwidth as 255. 4.3. The weight lp and the spatial bandwidth HS As mentioned in Sections 2.4 and 2.6, the spatial bandwidth HS controls the maximum tolerance in spatial translation of the tracking region between

Fig. 9. Variation of the likelihood flow coefficient versus different intensity bandwidth HI for one boundary pixel in the target region of the ‘‘Bream’’ sequence.

consecutive frames and the weight lp controls the tolerance of shape deformation relative to the tracked region in the previous frame. Consider two extreme cases. If the evolution equation in Eq. (21) is dominated by the prior flow term, then the evolution will allow no shape deformation. On the other hand, if the evolution equation is dominated by the likelihood flow term, the evolution tends to deform arbitrarily and disregards the original shape in the previous frame. In Fig. 10, we use an example to demonstrate how the weight lp and the spatial bandwidth H S influence the tracking results. We set the weight lp ¼ 0:1 in Fig. 10(a) and (b) and set lp ¼ 0:01 in Fig. 10(c) and (d). The results in Fig. 10(a)–(d) confirm our assumption that smaller weight lp allows larger deformation and thus achieve better performance in this case. Fig. 10(a)–(d) also shows that, with a fixed weight lp , large spatial bandwidth H S allows larger spatial

ARTICLE IN PRESS 248

C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

Fig. 10. the tracking results of the 10th, the 20th and the 30th frames on the ‘‘Flower’’ sequence with different weight lp and spatial bandwidth pffiffiffi HS: (a) lp ¼ 0:1, H S ¼ 5; (b) lp ¼ 0:1, H S ¼ 15; (c) lp ¼ 0:01, H S ¼ 5; (d) lp ¼ 0:01, H S ¼ 15; and (e) lp ¼ 0:000002, H S ¼ 2 (note that, the results in (e) are the same as the 2nd row in Fig. 3).

translation between consecutive frames. Thus, we can conclude that the combination of a smaller weight lp and a larger spatial bandwidth H S is a better selection for tracking severely deformed region. In this work, since we wish to deal with arbitrary deformation, we use a nominal value for lp ¼ 0:000002 in our experiments. Thus, since the evolution process is dominated by the likelihood term, the size of the spatial bandwidth H S has little influence to the final results. In order to reduce the computational cost, in experiments, we set the pffiffiour ffi spatial bandwidth as 2. The pffiffiffi tracking results with lp ¼ 0:000002 and H S ¼ 2 are also shown in Fig. 10(e) for better visual comparison.

4.4. Extension Here we discuss the limitation of our evolution process and present a possible extension to solve this problem. Consider the case depicted in Fig. 11. For a small and fast-moving object, the region Rn is very likely to have no overlapping with the region Rn1 in the previous frame. In this case, our evolution process will prematurely stop at an incorrect position. As shown in Fig. 11(b), the pixel x stops its evolution when it comes across a region boundary. Though the stopping criterion for the evolution is satisfied, the correct stopping point should be x0 instead of x. The difficulty due

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

to non-overlapping region in consecutive frames can be possibly solved once the true motion of the region is available. If the motion is available, we can

x' Rn

Rn

x Nx

249

use the motion to predict a starting position for the evolution process. For example, in the ‘‘Table Tennis’’ sequence (Fig. 12), the ball keeps moving up and down very fast and produces the above-mentioned non-overlapping case. If we directly employ the proposed evolution process to track the ball, the evolution process would fail to track the object, as shown in Fig. 13. Therefore, we need to predict the object motion to initialize our original evolution process. Here, we employ a kernel-based object tracking method [3] to estimate the object motion and use the motion-predicted position as the starting position for the evolution process. The correct tracked results are shown in Fig. 12. 5. Conclusion

x Rn−1

Rn−1

(a)

(b)

Fig. 11. Incorrect stopping condition of the evolution process for a fast moving object: (a) initial condition of the evolution for x and (b) the stopping position of x.

This paper proposes an efficient approach for tracking non-rigid moving object. We formulate the tracking problem as a MAP framework and use a non-parametric method to estimate the probabilities. In the proposed method, we assume no special observation model and apply no restriction on the region shape. We only rely on a general assumption that a video object in two consecutive frames has a

Fig. 12. Video sequence ‘‘Table Tennis’’: (a) the initialized target object at the 1st frame and (b) the tracking results between the 10th and the 50th frames.

ARTICLE IN PRESS 250

C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251

Fig. 13. The incorrect tracking result without using the object motion to initialize the evolution process.

relatively constant intensity distribution. In addition, our method includes no motion estimation and need no special strategies to deal with different cases, such as camera motion, inconsistent motion, and deformation. The neighborhood window size selection could be a tradeoff between computation cost and accuracy of probability estimation. We use a ‘‘virtual neighborhood window’’ to solve this problem and do not have to adjust the neighborhood window size for different sequences. Our experimental results show that the proposed algorithm performs very well and outperform the parametric method even for severely deformed video objects. Acknowledgement The authors would like to thank the anonymous reviewers for their valuable comments, which help improve the quality of this paper. This work was partially supported by the National Science Council of Taiwan under contracts NSC93-2213-E-007-001 and NSC93-2213-E-007-055.

References [1] Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (August 1995) 790–799. [2] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 23 (5) (2002) 603–619. [3] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (5) (May 2003) 564–577.

[4] C.E. Erdem, A.M. Tekalp, B. Sankur, Video object tracking with feedback of performance measures, IEEE Trans. Circuits Systems for Video Technol. 13 (4) (April 2003) 310–324. [5] D. Freedman, T. Zhang, Active contours for tracking distributions, IEEE Trans. Image Process. 13 (4) (April 2004) 518–526. [6] Y. Fu, A.T. Erdem, A.M. Tekalp, Tracking visible boundary of objects using occlusion adaptive motion snake, IEEE Trans. Image Process. 9 (12) (December 2000) 2051–2060. [7] C. Gentile, O. Camps, M. Sznaier, Segmentation for robust tracking in the presence of severe occlusion, IEEE Trans. Image Process. 13 (2) (February 2004) 166–178. [8] M. Kass, A. Witkin, D. Terzopouls, Snakes: active contour models, in: Proceedings of the First International Conference on Computer Vision, London, England, 1987, pp. 259–268. [9] J. Kim, J.W. Fisher III, A.Y.M. Cetin Jr., A.S. Willsky, Nonparametric methods for image segmentation using information theory and curve evolution, Proceedings of the ICIP, 2002, pp. 797–800. [10] M. Kim, J.G. Choi, D. Kim, H. Lee, M.H. Lee, C. Ahn, Y.S. Ho, A VOP generation tool: automatic segmentation of moving objects in image sequences based on spatio-temporal information, IEEE Trans. Circuits Systems Video Technol. 9 (8) (December 1999) 1216–1226. [11] K.P. Lim, A. Das, M.N. Chong, Estimation of occlusion and dense motion fields in a bidirectional Bayesian framework, IEEE Trans. Pattern Anal. Mach. Intell. 24 (6) (May 2002) 712–718. [12] H. Luo, A. Eleftheriadis, Model-based segmentation and tracking of head-and-shoulder video objects for real-time multimedia services, IEEE Trans. Multimedia 5 (3) (September 2003) 379–389. [13] A.R. Mansouri, Region tracking via level set PDEs without motion computation, IEEE Trans. Pattern Anal. Mach. Intell. 24 (7) (July 2002) 947–961. [14] T. Meier, K.N. Ngan, Automatic segmentation of moving objects for video object plane generation, IEEE Trans. Circuits Systems Video Technol. 8 (5) (September 1998) 525–538. [15] T. Meier, K.N. Ngan, Video segmentation for content-based boding, IEEE Trans. Circuits Systems Video Technol. 9 (8) (December 1999) 1190–1203. [17] H.T. Nguyen, M. Worring, R.V.D. Boomgaard, A.W.M. Smeulders, Tracking nonparameterized object contours in video, IEEE Trans. Image Process. 11 (9) (September 2002) 1081–1091. [18] I. Patras, E.A. Hendriks, R.L. Lagendijk, Video segmentation by MAP labeling of watershed segments, IEEE Trans. Pattern Anal. Mach. Intell. 23 (3) (March 2001) 326–332. [19] T. Schoepflin, V. Chalana, D.R. Haynor, Y. Kim, Video object tracking with a sequential hierarchy of template deformations, IEEE Trans. Circuits Systems Video Technol. 11 (11) (November 2001) 1171–1182. [20] I.K. Sethi, R. Jain, Finding trajectories of feature points in a monocular image sequence, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1) (1987) 56–73. [21] A.M. Tekalp, Digital Video Processing, Prentice-Hall, Englewood Cliffs, NJ, 1995.

ARTICLE IN PRESS C.-T. Hsu, M.-S. Hsieh / Signal Processing: Image Communication 21 (2006) 235–251 [22] Y. Tsaig, A. Averbuch, Automatic segmentation of moving objects in video sequences: a region labeling approach, IEEE Trans. Circuits Systems Video Technol. 12 (7) (July 2002) 597–612. [23] D. Wang, Unsupervised video segmentation based watersheds and temporal tracking, IEEE Trans. Circuits Systems Video Technol. 8 (5) (September 1998) 539–546.

251

[24] S.C. Zhu, A. Yuille, Region competition: unifying snakes, region growing, and Bayes/MDL for multiband image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 18 (9) (July 1996) 884–900.