Heuristic algorithm for visual tracking of deformable objects

Heuristic algorithm for visual tracking of deformable objects

J. Vis. Commun. Image R. 22 (2011) 465–478 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/loc...

2MB Sizes 0 Downloads 87 Views

J. Vis. Commun. Image R. 22 (2011) 465–478

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Heuristic algorithm for visual tracking of deformable objects q Elena Sánchez-Nielsen a,⇑, Mario Hernández-Tejera b a b

Department of Statistics, Operations Research and Computer Science, University of La Laguna, 38271 La Laguna, Spain Institute of Intelligent Systems and Numerical Applications in Engineering, University Campus of Tafira, 35017 Gran Canaria, Spain

a r t i c l e

i n f o

Article history: Received 26 July 2005 Accepted 27 May 2011 Available online 17 June 2011 Keywords: Real-time visual tracking Heuristic tracking algorithm Template matching Template updating Template tracking Heuristic search Target motion Kullback–Leibler

a b s t r a c t Many vision problems require fast and accurate tracking of objects in dynamic scenes. These problems can be formulated as exploration problems and thus can be expressed as a search into a state space based approach. However, these problems are hard to solve because they involve search through a space of transformations corresponding to all the possible motion and deformation. In this paper, we propose a heuristic algorithm through the space of transformations for computing target 2D motion. Three features are combined in order to compute efficient motion: (1) a quality of function match based on a holistic similarity measurement, (2) Kullback–Leibler measure as heuristic to guide the search process and (3) incorporation of target dynamics into the search process for computing the most promising search alternatives. Once 2D motion has been calculated, the result value of the quality of function match computed is used with the purpose of verifying template updates. A template will be updated only when the target object has evolved to a transformed shape dissimilar with respect to the actual shape. Also, a short-term memory subsystem is included with the purpose of recovering previous views of the target object. The paper includes experimental evaluations with video streams that illustrate the efficiency and suitability for real-time vision based tasks in unrestricted environments. Ó 2011 Elsevier Inc. All rights reserved.

1. Introduction Template tracking is a basic task in visual systems whose main goal is focused on detection and tracking a mobile object of interest in a dynamic vision context given one or several explicit templates that represent the target object. If an active vision approach is considered, it is also desirable that the tracking process keeps the object of interest centered in the image, moving the sensor adequately [1,2]. At present, there are still obstacles in achieving all-purpose and robust tracker approaches. Four main issues must be addressed in order to carry out an effective template tracking approach: (1) Real-time performance. Real-time template tracking is a critical task in many computer vision applications such as vision based interface tasks [3], visual surveillance [35], traffic control [36], navigation tasks for autonomous robots [37], gesture based human–computer-interaction [38], perceptual intelligence applications [4], virtual and augmented reality systems [39] or applications from the ‘‘looking and people’’

q This work has been supported by the Spanish Government and Canary Islands Autonomous Government under the Projects TIN2004-07087 and PI2003/165. ⇑ Corresponding author. Fax: +34 922473377. E-mail addresses: [email protected] (E. Sánchez-Nielsen), mhernandez@iusiani. ulpgc.es (M. Hernández-Tej).

1047-3203/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2011.05.005

domain [5]. Moreover, in real-time applications not all system resources can be allocated for tracking processes because other high-level tasks such as trajectory interpretation and reasoning can be demanded. Therefore, it is desirable to adjust the requirements of the computational cost of a tracker approach to be as low as possible to make feasible real-time performance over general purpose hardware. (2) Initialisation. Many template based tracking approaches are focused on the use of a manual initialisation. Some approaches often assume that the template which represents the target object is correctly aligned in the first frame [6]. Other approaches select the reference templates by a hand-drawn prototype template, i.e., an ellipse outline for faces [7,8] or they are extracted from a set of examples such as level appearance [9] or outlines [10,11]. Moreover, the condensation algorithm [10] also requires training using the object moving over an uncluttered background to learn the motion model parameters before it can be applied to the real scene. However, these selection processes restrict its use in many practical embedded applications. Therefore, quick and transparent initializations without user participation are required. (3) Matching. Template matching is the process in which a reference template T(k) is searched for in an input image I(k) to determine its location and occurrence. Over the last decade, different approaches based on searching the space of

466

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

transformations using a measurement similarity have been proposed for template based matching. Some of them explicitly establish point correspondences between two shapes and subsequently find a transformation that aligns these shapes [12,13]. The iteration of these two steps involves the use of algorithms such as iterated closest points (ICP) [14,15] or shape context matching [13]. However, these methods require a good initial alignment in order to converge, particularly whether the image contains a cluttered background. Other approaches are based on searching the space of transformations using Hausdorff matching [16], which are based on an exhaustive search that works by subdividing the space of transformations in order to find the transformation that matches the template position into the current image. Also, similar techniques have been used for tracking selected targets in natural scenes [30] and for person tracking using an autonomous robot [31]. However, no heuristic functions and no target dynamics have been combined in the search process. This situation leads to an increase of the computational costs in the tracking process. (4) Updating. The underlying assumption behind several template tracking approaches is that the appearance of the object remains the same through the entire video [17–19]. This assumption is generally reasonable for a certain period of time and a naive solution to this problem is updating the template every frame [30–32] or every n frames [33] with a new template extracted from the current image. However, small errors can be introduced in the location of the template each time the template is updated and this situation establishes that the template gradually drifts away from the object [20]. Matthews et al. in [20] propose a solution to this problem. However, their solution approach only addresses the issue related to objects whose visibility does not change while they are being tracked. In this paper, a template based solution for fast and accurate tracking of moving objects is proposed. The main contributions are focused on: (1) an A⁄ search algorithm that uses the Kullback–Leibler measurement as heuristic to guide the search process for efficient matching of the target position, (2) dynamic update of the search space in each image, whose corresponding dimension is determined by the target dynamics, dramatically reducing the number of possible search alternatives, (3) updating templates only when the target object has evolved to a new shape change significantly dissimilar with respect to the current template in order to solve the drift problem and (4) representation of illustrative views of the target shape evolution through a short-term memory subsystem. As a result, the first two contributions provide a fast algorithm to apply over a space of transformations for computing target 2D motion and the other two contributions provide robust tracking because accurate template updating can be performed. In addition to these contributions, the paper also contains a number of experimental evaluations and comparisons:  A direct comparison of the performance of conventional search approaches [16] that work by subdividing transformations space and the proposed A⁄ search approach that incorporates target dynamics and heuristic to guide the search process, demonstrating that A⁄ search based approach is faster.  An empirical comparison of updating templates using a continuous updating approach like that proposed in [30–32] and the updating template approach that is proposed in this paper,

demonstrating that no updating templates in every frame and using a dynamic short-term memory subsystem, lead to a more robust tracking approach.  An analysis of the time required for computing the proposed template matching and updating approach, illustrating that the time to track targets in video streams is lower than realtime requirements. The structure of this paper is as follows: the problem formulation is illustrated in Section 2. In Section 3, the heuristic algorithm for computing target position is described. The updating reference template problem is detailed in Section 4. Experimental results are provided in Sections 5 and 6 concludes the paper. 2. Problem formulation The template tracking problem of objects in 3D space from 2D images is formulated in terms of decomposing the transformations induced by moving objects between frames into two parts: (1) a 2D motion, corresponding to the change of the target position in the image space, which is referred to as the template position matching problem and (2) a 2D shape, corresponding to a different aspect of the object becoming visible or an actual change of shape in the object, which is referred to as the template updating problem. For the sake of subsequent problem formulation, some definitions are first introduced: Definition 1 (Template). Let TðkÞ ¼ ft1 ; . . . ; t r g # R2 be a set of points that represent a template in step time k. Definition 2 (Image). Let IðkÞ ¼ fi1 ; . . . ; is g # R2 be another set of points that denote an input image in step time k. It is assumed that each new step time k corresponds to a new frame k of the video stream. Definition 3 (Set of transformations). Let a translational transformation g be parameterized by the x displacement gx and the y displacement gy. That is, g = (gx, gy). Let a bounded set of translational transformations be a set of transformations G ¼ ½g xmin ; g xmax   ½g ymin ; g ymax  # R2 and let g c ¼ ðg cx ; g cy Þ denote the transformation that corresponds to the center of G. It is defined as:

gc ¼



   1 1 ðg xmin þ g xmax Þ ; ðg ymin þ g ymax Þ 2 2

ð1Þ

where (xmin, xmax) and (ymin, ymax) represent respectively the low and upper bounds of G in x and y dimension. Definition 4 (Bound error notion of quality of match). Let a bounded error notion of quality of match Q(g; T(k), I(k), e) be a measurement for computing the degree of match between a template T(k) and a current input image I(k), where the dependence of Q on T, I and/or e is omitted for sake of simplicity but without loss of generality. That is, the quality of match assigned to a transformation g is represented by the allowed error bound, e, when template points are brought to image points using the transformation g. This quality of match function assigned to a transformation g is expressed as:

QðgÞ ¼

X t2T

max kgðtÞ  ik < e i2I

ð2Þ

where kk denotes a measurement of distance and g(t) represents the result of applying the transformation g = (gx, gy) to every point in template T(k).

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

2.1. Template position matching Given a step time k, a template T(k), an input image I(k) and an error bound e, the template position matching problem can be viewed as the search process in the space of transformations in order to find the transformation gopt that maximizes the quality of match Q(g):

g opt ðTðkÞ; IðkÞ; eÞ ¼ arg max Q ðg; TðkÞ; IðkÞ; eÞ g2G

ð3Þ

2.2. Template updating Once the new position of the target has been computed, the template is updated to reflect the change in its shape. Since the template matching position problem determines translational motion between consecutive frames, all the non-translational transformations of the image motion and three-dimensional shape change are considered to be a change in the 2D shape of the object. Let gopt = (gxopt, gyopt) denote the translation that best matches template T(k) with the new object position in step time k and let gopt(T(k)) denote the translated points set of template T(k) by the translation gopt. Given gopt(T(k)) and I(k), new 2D shape changes between successive images are computed as the measure of the discrepancy between gopt(T(k)) and I(k) under a certain error bound d. That is, the new template T(k + 1) is represented by all those points of input image I(k) that are within distance d of some point of gopt(T(k)) according to the following expression:

Tðk þ 1Þ ¼

X

kg opt ðTðkÞÞ  ik < d

ð4Þ

i2I

where kk denotes a measurement of distance. Fig. 1 illustrates a graphical summary of the problem formulation. The set of points that represent I(k) and T(k) are based on edge-based features and extracted from real images using a Canny edge detector [26]. The template T(k) corresponds to the target to be tracked at every step time k and it is computed from previous

467

step time k  1. Given T(k), I(k) and G, the template position matching consists of the search process in the space of transformations G in order to find the transformation gopt that allows approaching the maximum template points to image points under an allowed error bound e and a measurement of distance. The result of applying the initial transformation g = (gx, gy) of G to every point in template T(k) is illustrated in Fig. 1(a). In this initial situation, the template T(k) does not match with the target position and therefore, the match function Q(g) based on a measurement distance is not maximized. In Fig. 1(b) is illustrated the result of applying the transformation that best matches the current template points T(k) in the image I(k), after the last step of the search process has been computed. That is, this situation corresponds to the transformation gopt that maximizes the match function Q(g). In this case, the translated template points gopt(T(k)) lie within e points of some image point I(k) using a measurement of distance. Since the template position matching computes translational motion, all the non-translational transformations of the image motion are taken into account in the updating of the new template T(k + 1) according to expression 4. In the following sections, the proposed solution for the problem will be described and evaluated with experiments.

3. Template position matching based on A⁄ search algorithm Formulation of problem solving under the framework of heuristic search is expressed through a state space based-representation approach [23], where the possible problem situations are considered as a set of states. The start state corresponds to the initial situation of the problem, the final, goal or target state corresponds to problem solution and the transformation between states can be carried out by means of operators. As a result, problem solving is addressed as the sequence of operators which transform start state to target state. A problem space tree is used to represent the problem space. The states of the space are represented by nodes of the

Fig. 1. A schematic overview of the template tracking problem. The template position matching consists of the search process in the space of transformations G in order to find the transformation gopt that maximizes the match function Q(g). (a) illustrates the result of applying the initial transformation g = (gx, gy) of the search process to every point in template T(k), where Q(g) is not maximized. (b) shows the result of applying the transformation that best matches the current template points T(k) in the image I(k), after the search process has been computed and where Q(g) is maximized. Once the template matching problem has been computed, the template is updated in order to represent the new view of the target such is illustrated in (c).

468

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

tree; the initial state is the root of the tree and the operators correspond to edges between nodes. According to the heuristic search framework described, the A⁄ heuristic search problem based on state space representation is formulated as: the search process oriented to find the transformation g opt 2 G that maximizes the quality of function match Q(g) between the transformed template g(T(k)) and the current image I(k). Next, the elements of the problem are described in order to formalize the heuristic search framework:  State: each search state n is associated with a subset of transformations Gn # G. Each state is represented by the transformation that corresponds to the center of the partial set of transformations, which is referred to as gc.  Initial state: is represented by a bounded set of translational transformations G, which allow matching the current template position in the current scene. The computation of the initial dimension of G, M  N, will be described in subSection 3.4.  Final state: is the transformation that best matches the current template points T(k) in the current image I(k), according to the quality of function Q(g). The quality of function match assigned to a transformation g is expressed in terms of the partial directed Hausdorff distance (see Appendix A) between g(T(k)) and I(k). It is defined as:

QðgÞ ¼ hq ðgðTðkÞÞ; IðkÞÞ < e

ð5Þ

where the parameter q represents the qth quartile value selected according to expression 5 and e denotes that each point of g(T(k)) must be within distance e of some point of I(k).  Operators: are the functional elements that lead the transformation of one state to another. For each current state n, the operators A and B are computed: – Function A. Each partial set of transformations from the current state is partitioned into four regions by vertical and horizontal bisections. Each one of these regions is associated with a new node ni on the tree search (that is, four new states). – Function B. The quality of function match (Eq. 5), denoted by the partial directed Hausdorff distance between gc(T(k)) and the current image I(k), is computed for each one of the new states generated, where gc(T(k)) represents the transformation of template points by the central translation of the corresponding state. It is referred to as hq(gc(T(k)), I(k)). Splitting each current state into four new states leads to the representation of the search tree to be a quaternary tree structure; where each node is associated to a 2i  2j region. To be precise, the heuristic search process is initiated by the association of the set of transformations G with the root of the search tree, and subsequently the best node at each tree-level l is expanded into four new distinct states, which are non-overlapping and mutually exclusive. The splitting operation is finished when the quadrisection process computes a translational motion according to the quality of function match Q(g) or all the regions associated with the different nodes have been partitioned in cells of unit size. Fig. 2 illustrates the space of transformations tree that corresponds to a search process. Each one of the four regions computed are referred to as NW, NE, SE and SW cells. The best node to expand from NW, NE, SW and SE cells at each tree-level l is computed by an A⁄ algorithm [23], which combines features of uniform-cost search and pure heuristic search. The corresponding cost value assigned to each state n is defined as: 

f ðnÞ ¼ cðnÞ þ h ðnÞ

ð6Þ

Fig. 2. Search tree of the space of transformations: hierarchical partition of the space of states. This space is partitioned using a quadrisection process. The nodes at the leaf level define the finest partition.

where c(n) is the estimated cost of the path from the initial node n0 to current node n, and h⁄(n) is the heuristic estimate of the cost of a path from node n to the goal. 3.1. Heuristic evaluation function h⁄(n) The heuristic value of the cost of a path from node n to the goal, h⁄(n), is estimated by means of evaluating the quality of the best solution reachable from the current node n. Desirability of the best state is estimated by means of measuring the similarity among the distribution functions that characterize the current state and the objective state. The similarity between distributions is measured using the Kullback–Leibler distance [27,40], which is frequently a used information theoretic similarity measure in the machine learning and information theory fields. Let P denote the distribution function that characterizes the current state n and let Q denote the corresponding distribution function that characterizes the objective state. The definition of both functions is based on the quality of function match assigned to the target transformation, gopt. Since the quality of function match is denoted by the partial directed Hausdorff distance, the distribution function P can be approximated by a histogram of distances fHgc gi¼1;...;r , which contains the number of template points T(k) at distance dj with respect to the points of the input image I(k), when the transformation gc of the current state n is applied on the current template T(k). The distribution function Q that characterizes the objective state can be modeled by approximating fHgc gi¼1;...;r by an exponential distribution function f(n) = kean, where parameter a checks the falling of the exponential function. According to the definition of the quality of function match (expression (5)), the transformation gopt corresponds to the transformation gc that provides the highest number of distance values dj near zero values and lower than error bound e. Therefore, if the quality of function match Q(gc) assigned to the transformation gc is verified, the distributions P and Q will show respectively an appearance similar to the illustration of Fig. 3(a) and (b).

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

469

Fig. 3. Distribution functions: (a) histogram of distances fHgc gi¼1;...;r associated with a transformation gc that verifies the quality of function match Q(gc). (b) Distribution function f(n) = kean with parameter a = 1 that characterizes the final state. The horizontal axis represents distance values and the vertical axis denotes the number of transformed template points with gc at distance dj with respect to the current points scene I(k).

3.2. Computing similarity using the Kullback–Leibler measure Given the distribution functions P and Q, and let R be the number of template points, the Kullback–Leibler distance (KLD) between the two distributions is defined as:

DðPkQ Þ ¼

N X i¼1

pi log

pi qi

ð7Þ

According to [27,40], D(PkQ) has two important properties: (1) D(PkQ) P 0; and (2) D(PkQ) = 0iff P = Q These properties show that when the template points do not match the input image points, the values of KLD, D, will be nonzero and positive because the distributions P and Q are not similar, P – Q. On the other hand, if the template points match the input image points, then the value of KLD is equal or near zero. 3.3. Estimated cost function c(n) An estimated cost function c(n) is added to the f(n) function (expression (6)) in order to generate a backtracking process when the heuristic function leads the search process towards no promising solutions. This depth term is based on the number of operators of type A applied from the initial search node to the current node n. 3.4. Initial state computation The dimension M  N of the set of transformations G in the 2D space is dynamically computed in every frame k by means of incorporating an alpha-beta predictive filtering [22,28,29] into the search algorithm. In order to minimize the search exploration to the strictly necessary dimension, an adjustable-size based set of transformations approach is proposed. This approach is focused on the assumption that there is a relationship between the size of the set of transformations and the resulting uncertainty of the alpha–beta predictive filtering. The basic hypothesis underlying this approach is that the regularity of the movement of mobile objects is related to small uncertainties in the prediction of future target locations. Therefore, this hypothesis allows the definition of a reduced set of transformations when the target is evolving through a regular motion. Otherwise, the set of transformations is increased when the motion trajectory exhibits any deviation from the assumed temporal motion model.

The computation of the dimension of G involves the calculation of the parameters to be estimated by the filtering approach. These parameters are represented by the 2D opposite coordinates of the bounding box that encloses the target shape and can be written as a four-dimensional vector h = [h1, . . . , h4]T. The location vector and the corresponding velocity vector are jointly expressed as a state vector x ¼ ½hT ; h_ T T . The dynamic state equation assuming a constant velocity model is formulated as:

xðk þ 1Þ ¼ UxðkÞ þ v ðkÞ   1 Dt where U ¼ 0

ð8Þ

1

Dt is the sampling interval and v(k) denotes the process noise vector which is used for modeling unknown manoeuvres of a non-constant velocity target. The state vector estimation is obtained as:

 T b ^xðk þ 1jk þ 1Þ ¼ ^xðk þ 1jkÞ þ v ðk þ 1Þ a DT

ð9Þ

where ^xðk þ 1jk þ 1Þ represents the updated estimate of x, ^xðk þ 1jkÞ corresponds to the predicted value at time step k + 1 and v(k + 1) denotes the residual error, called innovation factor, which is defined as the difference between the current measurement z(k + 1) and the predicted observation for this measurement, ^zðk þ 1jkÞ. The parameters a and b, are used respectively for weighting the target position and velocity in the update state computation of the filtering approach. These parameters are computed from the process noise standard deviation and the measurement noise deviation. The current measurement z(k + 1) is computed from the vector h once the object has been matched through the search algorithm and has been updated through the reference template process (described in Section 4). Since the innovation factor v(k) represents a measure of the error of ^zðk þ 1Þ, a decision rule focused on the uncertainty measurement can be obtained in order to compute the dimension of G . Two main criteria are considered in the decision rule design. The first one is that small values of the innovation factor indicate low uncertainty about its estimate and therefore, a reduced size of G. However, deviations of the target motion from the assumed temporal motion model involves higher uncertainty about the estimation and so, larger dimension of G. The second criterion is that the dimension of M  N must be a 2p  2q value in order to assure that each terminal cell of G will contain only a single transformation after the last quadrisection operation had been applied. Assuming these requirements, M  N is associated with a low and upper bound defined by 2min 6 v(k) 6 2max, where v(k) = (vM(k), vN(k)) and 2min, 2max represent the nearest values to the innovation factor

470

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

v(k). Let / be the parameter that weights the influence of the difference between both bounds in the selection of the appropriate value. This parameter is defined as a percentual margin, which is selected as 35%. A dynamic computation of this parameter can be introduced in order to obtain different weights according to the motion trajectory. Let w be the parameter computed according to the following expression:

w ¼ /ð2max  2min Þ

ð10Þ

Once w is calculated, we have obtained an estimate which determines from what value the 2max or 2min bound value should be selected. Using this estimate, the dimension M  N of G is independently computed according to the following decision rule:

( M¼

max

2 (



2min ;

2min ;

if w þ 2min 6 v M ðkÞ

; if w þ 2min > v M ðkÞ if w þ 2min 6 v N ðkÞ

2max ; if w þ 2min > v N ðkÞ

ð11Þ

ð12Þ

Fig. 4 illustrates the integration of the different stages that compose the alpha-beta filtering with the computation of the adjustable-size based set of transformations and their respective equations. Once M and N have been computed, the low and upper bounds [gxmin, gxmax]  [gymin, gymax] of G in each step time k are calculated as follows:

M 2   M 1 g xmax ðkÞ ¼ g x ðk  1Þ þ 2 N g ymin ðkÞ ¼ g y ðk  1Þ  2   N g ymax ðkÞ ¼ g y ðk  1Þ þ 1 2

g xmin ðkÞ ¼ g x ðk  1Þ 

ð13Þ ð14Þ ð15Þ ð16Þ

where (gx(k  1), gy(k  1)) represents the components of the solution transformation g computed in previous step time k  1. 3.5. Search algorithm The proposed algorithm using A⁄ heuristic search for fast template position matching is described in Fig. 5.

B. The new measurement is zero and similar to the non-zero prediction: this situation is related to an accurate estimation of the alpha-beta filtering and the A⁄ search algorithm. C. The new measurement is non-zero and noticeably different from the non-zero prediction: this situation denotes the presence of a motion discrepancy due to erratic direction changes or extreme deformations introduced by the target shape. D. The new measurement is zero and the prediction is non-zero: stopping conditions are introduced by the target shape in its trajectory motion. Since there is no reason to assume that the motion after a discontinuity can be accurately predicted from the motion history of the tracked object, a new initialisation of the alpha–beta filtering is computed for the last two situations. Introduction of additional mechanisms are required in the third situation when the heuristic algorithm cannot provide a solution gopt, such that the condition Q(g) = hq(g(T(k)), I(k)) < e is verified, because erratic direction changes of the object or extreme deformations by the target shape have been introduced. With the purpose of computing target position in this situation, the search algorithm is carried out according to the next steps: I. Search area increase. A new dimension of G, M  N, is proportionally computed for the initial search area established to match the target position. II. Use of an iconic memory stack including different significant views of the object. With the aim to guarantee the best matching of the target object, the heuristic search is computed with all the different views of the target that are stored in the short-term memory of the tracking system. III. Compute weak solutions. A weak solution is a solution where the error bound e of the quality of function match (expression (5)) is increased in one unit until a maximum value of 10. New A⁄ searches using the current template and the different views that represent the target object are computed for each new value of e until a solution is achieved. IV. Alert state establishment. The loss of the target is established when the object cannot be matched with the previous actions. These actions are achieved in the following images until new regularities are detected. A disappearance of the object from the scene represents the main reason of the alert situation.

3.6. Recovering exceptions Four different situations are presented in shape tracking when an alpha-beta predictive filtering is integrated with an A⁄ heuristic search: A. The new measurement computed is non-zero and the alphabeta prediction is zero: this situation corresponds to the start of the movement of target shape and therefore, an initialization process of the alpha-beta filtering must be computed.

4. Template update approach The new template is only updated when the target object has evolved significantly to a new shape. The aspect changes can be due to non-rigidity of objects or a different aspect of the object becoming visible. Since heuristic search computes only the 2D motion component and the best matching between the current template T(k) and input image I(k) is expressed by means of the error bound distance

Fig. 4. Adjustable-size based set of transformations: computation of M  N dimension of G.

471

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

that removes the less distinctive templates in order to leave space to the new ones when the capacity of the visual memory is achieved and a new template must be inserted in STMS. In order to determine the weight of every template, we introduce a relevance index, which is associated to every template. This index is defined according to the following expression:

Rðk; iÞ ¼

T p ðkÞ 1 þ k  T s ðkÞ

ð17Þ

Where k represents the time step, i corresponds to the identification symbol template, Tp(i) characterizes the persistent parameter of the ith template and represents the frequency of use as the current template. Ts(i) denotes the obsolescence parameter and corresponds to the time from the last instant it was used as the current template. Both parameters are updated in each tracking stage k. A new template is inserted into STMS, when the quality of function match Q(gopt) between the current template T(k) and input image I(k) ranges from emin to emax. On the other hand, the template is removed from STMS when the stack of templates is full and a new template must be added to the short-term memory. In this situation, the template that has the lower index of relevance is removed and the new template is inserted. 4.2. Template updating algorithm According to the value of Q(gopt) assigned to the transformation computed gopt, every current template T(k) is updated as T(k + 1) based on one of the steps of the following algorithm: Fig. 5. Heuristic algorithm for template position matching.

e of the quality of function match, a low error bound e will denote that the target is moving according to a translational motion. However, the target object will have evolved towards a new shape when a large value of the error bound e is computed. In order to detect an appreciable shape change, a minimum and maximum boundary for e is defined from the analysis of different processed sequence images that represent deformable and rigid objects. Let emin be the minimum distance value that does not denote shape changes and let emax be the maximum distance value that is acceptable for a tolerable matching. Given emin and emax, all the solution values computed between emin 6 Q(g) 6 emax will denote a 2D shape change. Therefore, the target object will show a new shape change when the qth number of the points that characterize the position template matching is between emin and emax maximum distance of I(k) image points. At the same time, in certain situations, the target object will show previous views that are not represented by the current template such as the object becoming visible after a disappearance from the image or occlusion conditions. Recovery of previous views of the target can be achieved through the use of a short-term memory subsystem (STMS).

Tðk þ 1Þ

Step 1 If Q(gopt) 6 emin, the new template in time step k + 1, T(k + 1), is equivalent to the best matching of T(k) in I(k). That is, the edge points of I(k) that are directly overlapping on some edge point of the best matching, gopt(T(k)), represent the new template T(k + 1):

Tðk þ 1Þ

  i 2 IðkÞj min kg opt ðtÞ  ik ¼ 0 t2TðkÞ

Step 2 If some template of STMS computes the best matching when the current template T(k) cannot match the target object with an inferior or equivalent distance value emin, Q(gopt) > emin, this template of STMS is selected as the new template T(k + 1). Otherwise, the current template, T(k) is updated by incorporating the shape change by means of the partial directed Hausdorff distance measure (see appendix). In this context, we denote STMS ¼ fTðSTMSÞi gNi¼1 as the different templates that integrate the short-term memory, Q(g; T(STMS)i, I(k), e) as the best error bound distance e computed for the ith template of STMS, where this template is referred to as T(STMS)i. The updating process is expressed as:

( ) i 2 IðkÞjmint2TðkÞ kg opt ðtÞ  ik 6 d ; if Q ðg opt ; TðkÞ; IðkÞ; eÞ 6 Q ðg opt ; TðSTMSÞi ; IðkÞ; eÞ TðSTMSÞi ; if Q ðg opt ; TðSTMSÞi ; IðkÞ; eÞ < Q ðg opt ; TðkÞ; IðkÞ; eÞ

4.1. Short-term memory subsystem (STMS) The different templates that compose STMS must be represented by the more common views of the object. With the purpose of minimizing redundancies, the problem about what representative templates must be stored is addressed as a dynamic approach

Step 3 If the best matching computed using the current template T(k) and all templates of STMS is superior to the error bound distance emax, no template is updated:

Tðk þ 1Þ

 / if Q ðg opt ; TðkÞ; IðkÞ; eÞ

P emax and Q ðg opt ; TðSTMSÞi ; IðkÞ; eÞ P emax



472

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

Fig. 6. Template updating: the target motorcycle is moving according to a translational motion trajectory. Therefore, Q(gopt) 6 emin and the new template T(k + 1) is updated from the edge contours of frame k + 1 that are directly overlapping on some edge contour of the best matching gopt(T(k)).

Fig. 7. Template updating: a shape change of the hand target has taken place between frame k and frame k + 1 due to a fast motion. Therefore, emin 6 Q(gopt) 6 emax and the new template T(k + 1) is updated from the edge contours of frame k + 1 that are at d distance from some edge contour of the best matching gopt(T(k)).

Fig. 8. Template updating: since the target person has rotated around himself according to a non-translational motion from frame k to frame k + 1, the best matching computed is obtained by means of a template of STMS. Therefore, this template, T(STMS)i is selected as the new template T(k + 1).

Fig. 9. Template updating: no template is updated in step time k + 1, since the target car has been disappeared from the scene. All the templates of STMS are used in the following images until new regularities are detected.

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

Figs. 6–9 illustrate the results obtained with the updating template approach using rigid and non-rigid targets. Each figure shows: (1) the current edge template updated from frame k, T(k), and used for matching the target object in frame k + 1, (2) the new template T(k + 1), which is updated from the best matching results between T(k) and frame k + 1 and the use of STMS when it is required. Edges were extracted using a Canny edge detector [26]. 5. Experiments and results In this section, the proposed A⁄ search strategy for computing target position and the template updating approach are evaluated with different experiments and the performance achieved is analyzed in relation to Rucklidge conventional strategy [16] and previous template updating approaches [30–32]. In order to compare the approach proposed with the previous works, we use the same values of parameters that have previously been used. 5.1. Experimental environment An experimental evaluation of the proposed framework is included to show its use in practical applications for vision based interface tasks such as gesture based human–computer-interaction, navigation tasks for autonomous robots and visual surveillance. Thirty different sequences, which contain 750 frames as average rate, have been used for experimental evaluations, achieving the same behavior for all of them. Particularly, indoor and outdoor video streams which are labeled ‘‘People’’, ‘‘Hand’’, ‘‘ Cars’’, ‘‘Motorcycle’’ and ‘‘ Cover Book’’ are illustrated. Each one of these sequences contains frames of 280  200 pixels that were acquired at 25 fps. Edges were extracted from frames using a Canny edge detector [26]. Different features are taken into account in the experiments: (1) the use of arbitrary shapes, such as deformable targets and rigid objects (2) mobile and fixed acquisition cameras (3) presence of similar objects related to the target shape and (3) different illumination conditions from outdoor and indoor environments. All experimental results were computed on a P-IV 3.0 GHz. The main aspects of each sequence used are:  People sequence: contains 855 frames (that is, 34 seconds) that are characterized by one person of 70  148 average size that is moving through an indoor environment surrounded by several static objects near the target shape. The person movement is based on speeding up, slowing down and erratic motions. Also, diverse turns, stretching and deformations are introduced by the target object.  Hand sequence: consists of 512 frames (20 seconds) that are characterized by a flexible hand of 108  116 average size that twists, bends and rotates on a desk with different objects around it. The tracker was tested for dramatic changes in appearance (frame 200) and rotations (frame 250).  Car sequence: comprised of 414 frames (17 seconds) which were acquired through a visual surveillance system. Different parked cars and a mobile car are present. The average size of the template is 60  40 pixels. The tracker was tested for scale changes (frame 100, 250).  Motorcycle sequence: includes 70 frames (3 seconds) of a motorcycle of 184  149 average size that is moving fast in a traffic environment. The video stream was acquired through a mobile camera with zoom features.  Cover Book sequence: consists of 955 frames (38 seconds) of a cover book of 180  180 average size in motion that represents a large planar object moving across the sequence with an affine transformation model.

473

5.1.1. Initialisation The first reference template is computed from initial image areas that exhibit coherent movement. The movement detection is performed through the computation of optical flow [21] between the first two frames. After, a segmentation process is calculated using a thresholding based method in order to select patches showing coherent movement. The patches are obtained and ranked according to different criteria, i.e., their size. The first ranked blob is selected as the reference template. Other criteria can be established in order to select a specific target object. 5.1.2. Initial state The different initial states evaluated for each image that integrates the video sequence correspond to different set of transformations G and are based on:  Fixed search area without motion prediction (A1): each initial state for each image to be processed is made up by a 64  64 pixels 2D translations set ranging from (32, 32) to (32, 32).  Fixed search area with motion prediction (A2): a 64  64 pixels 2D translations set ranging from (32, 32) to (32, 32) are computed from the predicted target position.  Adjustable-size based search area (A3): the dimension of each initial state for each image I(k) is computed from expressions 13, 14, 15 and 16. 5.1.3. Final state In all the experiments carried out, the goal state of the A⁄ heuristic search is defined as the translation g that verifies that 80% (parameter q = 0.8) of template edge points are at maximum 2 pixels distance (error bound distance e = 2.0) from the current explored image I(k). If no goal state is computed, the error bound distance e is increased in one unit until a maximum value of 10. Heuristic thresholds emin, emax and d are respectively settled up to 2, 10 and 8. 5.1.4. Short-term memory subsystem (STMS) The short-term memory subsystem (STMS) is addressed as a dynamic approach that removes the less distinctive templates in order to leave space to the representative ones when the capacity of the visual memory is achieved. The dimension of the short-term memory subsystem (STMS) is settled up to 6. That is, the limit of the different reference templates that constitute STMS is six. Using six different reference templates, we can provide efficiency and flexibility to the tracking process. On one hand, the computational cost of the matching process between the target and every template stored at SMTS is not increased and on the other hand, with the use of the features of persistence and relevance (described in Section 4.1) it is sufficiently represented the different views of the target shape evolution for rigid and deformable objects. 5.1.5. Tracking sequences Figs. 10 and 11 illustrate, respectively three sample frames of indoor and outdoor sequences mentioned above. The first and fourth columns show original frames; the second and fifth columns depict the edge image computed from the Canny approach [26]; third and sixth columns show edge located template. No object is shown in the last frame of Car sequence because the object present in the input scene does not correspond to the target tracked. Fig. 12 shows three sample frames for a 180  180 region of a book cover tracked across the video sequence, illustrating original frames, edge image and edge located template. In the following sections, the results of the template matching through the search process, updating template, average runtime and experimental conclusions for these sequences will be described.

474

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

Fig. 10. Indoor sequences: People sequence (frames 150, 200 and 250 are shown) and Hand Sequence (frames 100, 150 and 200 are shown).

Fig. 11. Outdoor sequences: Car sequence (frames 100, 250 and 350 are shown) and Motorcycle Sequence (frames 10, 30 and 50 are shown).

5.2. Computation of initial search state This section analyzes the three different approaches for computing the initial search state using the A⁄ search strategy described in Section 3. The performance of the different approaches is measured by means of the total number of translations computed (nodes to be explored) and the time required for processing each sequence. Results are illustrated in Fig. 13. The results reported in Fig. 13 show that the approach based on fixed search area with motion prediction (A2) requires a lower number of translations to be computed for matching the template in comparison with the approach that does not use a predicting filter approach (A1). On the other hand, the number of translations to be computed is reduced considerably by the approach based on an adjustable-size search area (A3) in comparison with the other two approaches. At the same time, a reduction in the number of translations to be computed (nodes to be explored) is associated with a reduction in the temporal cost of the search process. Fig. 14 shows the performance of the different approaches based on the number of average nodes explored and time measured in seconds for each

frame of each sequence. As we can see, the average nodes to be explored and the time required for each frame is reduced through the use of an adjustable-size search area approach in comparison with the other two approaches. 5.3. Comparative analysis between search strategies Results in the previous section evaluate the performance of the computation of initial search state for the proposed A⁄ heuristic search framework. In this section, we evaluate the best results achieved from the proposed A⁄ search that uses an adjustable-size based set of transformations approach with the conventional blind search strategy [16] that does not use information to guide the search process and that uses a fixed set of transformations. Average running time measured in seconds for computing every search strategy in each frame is illustrated in Fig. 15. From the results of Fig. 15, we can observe that the proposed A⁄ heuristic search framework is computationally lighter and as consequence, faster than the blind search strategy, in an average rate of three times better.

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

475

Fig. 12. Planar object sequence: Cover Book Sequence (frames 300, 500 and 700 are shown).

Fig. 13. Computation of initial search state: nodes explored and total time required for processing each one of the sequences using a fixed search area without motion prediction approach (A1), a fixed search area with motion approach (A2) and an adjustable-size search area approach (A3).

Fig. 14. Performance of initial state search based approaches: (a) average nodes to be explored and (b) time measured in seconds for each frame for each sequence using a fixed search area without motion prediction approach (A1), a fixed search area with motion approach (A2) and an adjustable-size search area approach (A3).

5.4. Comparative analysis between template updating approaches Standard approaches are based on: (i) no updating the template or (ii) updating the template every frame. Fig. 16 illustrates a

qualitative comparison of updating approaches with the Hand sequence. It is showed the original frame, the appearance of the template to be matched at I(k), the result of the matching at I(k), and the step time k from which was computed the template, e.g.

476

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

Fig. 15. Comparative analysis between two search strategies: average running time in seconds required to process one frame using a blind search strategy and the proposed A⁄ search strategy.

T(74) denotes that the template was computed from step time k = 74. The tracking process fails when the template is not updated (approach 1) such is illustrated in frames 155 and 200, because the template position matching process cannot compute a transformation g where the maximum of template point’s of T(74) are within distance e of some point of I(k). The tracking process also fails when the template is continuously updating (approach 2) because the template has also been updated in those situations where the target has not been evolved to a new shape and therefore, the template was constructed from background edges. In this situation, these new templates such as T(154) and T(199) do not represent the target to be tracked and therefore the objects localized at frames 155 and 200 are erroneous. Updating the template only when the target has evolved to a new shape (approach 3) allows a successful tracking process across the entire sequence such is illustrated at the last row of Fig. 16.

Fig. 16. A qualitative comparison of update approaches 1, 2 and 3. With approach 1, the template is not updated and tracking process fails because the template does not reflect the change of the shape of the target. With approach 2, the template is updated every frame and the template is continuously increased from background edges. In this situation, tracking process fails because the template does not correspond to the target to be tracked. With approach 3, the template is only updated when the target has evolved to a new shape. With this approach, the target is tracked correctly and the template is appropriately updated across the sequence.

477

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478 Table 1 Template updating: number of template updates, total number of different templates stored in STMS during the tracking process for each sequence, total number of templates used from STMS during the tracking process and number of frames where the target was retrieved using STMS. Sequence

People

Hand

Car

Motorcycle

Cover Book

N° of frames Number of updates Number of different templates stored in STMS Number of templates used from STMS Number of frames where the target was retrieved using STMS

855 430 84

512 300 19

414 200 6

70 60 0

950 480 36

16

12

6

0

16

10

6

0

0

8

In order to compute the robustness of template updating algorithm described in Section 4.2, we evaluate the number of template updates for each sequence, testing: (1) our approach based on updating templates only when the target has evolved to a new shape and the use of a short-term memory and (2) the template updating approach used in [30–32] based on continuous updating at every frame using the Hausdorff measurement. Also, we evaluate the performance achieved when a short-term memory is incorporated into the updating template process. Table 1 summarizes the results obtained for the sequences illustrated in this paper. The experimental results reported in Table 1 show that the number of required updates is minimized in relation to other updating approaches based on Hausdorff measurement [30–32]. This situation minimizing drift risk. Concretely, no target was drifted using the proposed approach; however, templates were drifted in sequences People, Hand, Motorcycle and Cover Book using continuous updating approach at every frame. Moreover, the use of a short-term memory avoids the loss of the target in certain situations as illustrated in Table 1 for the sequences People, Hand and Cover Book. For sequence Car, six templates were used for recovering the target, once it had disappeared from the input image as shown in Fig. 11. No target was recovered because the target did not appear in the input image again. For sequence Motorcycle, no template was used because the current template was matched in each input image. To be precise, STMS templates were used when the current template did not reflect the target object in the next frame. These situations corresponded to: (1) an imprecision error of the edge detection process due in the current template (People, Hand and Cover Book sequence), (2) disappearance and reappearance of the target in the video stream (Car, and People sequence) (3) occlusion conditions of the target (People, Hand and Car sequence). 5.5. Average runtime behavior Visual systems with real-time restrictions require the processing of each frame in a maximum time of 40 ms. In order to evaluate the average runtime behavior of the proposed tracking approach, Table 2 shows runtime in seconds for computing the A⁄ search strategy and the template updating process for each video stream. The time measured in seconds for processing each frame of sequences People, Hand, Car, Motorcycle, and Cover Book is respectively: 0.022, 0.024, 0.013, 0.037 and 0.02. An increase of the time for computing Motorcycle sequence is achieved because the target dimension encloses practically all the current scene to be processed. The experimental results reported in Table 2 confirm the adaptation of the tracker proposed to real-time restrictions. Moreover, the computational cost to carry out the visual tracking task is lower than the processing latency (40 ms). This feature allows using the

Table 2 Runtime: measured in seconds for processing each process of each sequence evaluated. Sequence

People

Hand

Car

Motorcycle

Cover Book

N° of frames Seconds Time required to process A⁄ Search Strategy Time required to process Template Updating

855 34 13.5

512 20 8.49

414 17 4.95

70 3 1.6

950 38 14.2

5.4

4.29

0.7

1.0

5.9

Total (seconds) to process the sequence

18.9

12.78

5.65

2.6

20.1

remaining time for other processes that integrate the visual system. 6. Conclusions and future work This paper is concerned with fast and accurate tracking of arbitrary shapes in video streams without any assumption of the speed and trajectory of the objects. The described approach does not need a priori a 2D template of the object to be tracked. The major aspects of the approach are focused on decomposition of the transformation between frames of a 2D object moving in 3D space into two parts: (1) a 2D motion corresponding to the new target position and (2) a 2D shape change. It is proposed an A⁄ search framework in the space of transformations to compute efficient target motion that uses the Kullback–Leibler measurement as heuristic to guide the search process. The most promising initial search alternatives are computed through the incorporation of target dynamics. 2D shape change is captured with 2D templates that evolve with time. These templates are only updated when the target object has evolved to a new shape change. Also, the representative temporal variations of the target shape are enclosed in a short-term memory subsystem. The proposed template based tracking system has been tested and has been empirically proved that: (1) computational cost of visual tracking for an arbitrary target shape is related directly to the set of transformations. Real-time performance using general purpose hardware can be achieved when an A⁄ search strategy and an adjustable-size based set of transformations are used, (2) the heuristic search proposed is faster than previous search strategies with similar features in an average rate three times better, allowing real-time performance, (3) although abrupt motions cannot be predicted by an alpha-beta filtering approach, the tracker performance was well adapted to the non-stationary character of the person’s movement which alternates abruptly between slow and fast motion such as the People sequence, (4) target shape evolution introduces, in certain situations, views that cannot be matched in the current image I(k) using the A⁄ heuristic search. These situations are presented when the target shape is represented by some sparse edge points and its size dimension is extremely reduced due to disappearance and reappearance conditions from the current image I(k), such as illustrated in the Car sequence. In these situations, the use of color cue is required in order to avoid the loss of the target and (5) updating templates using combined results focused on the value of the quality of function match and the use of a short-term memory lead to accurate template based tracking. This work leaves a number of open possibilities that may be worth further research, among others it may be interesting to consider: (i) further processing on input images in order to reduce the illumination sensitivity by means of the use of an anisotropic diffusion filter instead of the Gaussian filter that is used by the classical Canny detector in order to obtain better edge detection and the proposal of a dynamic reformulation of the Canny edge

478

E. Sánchez-Nielsen, M. Hernández-Tejera / J. Vis. Commun. Image R. 22 (2011) 465–478

detector replacing the hysteresis step by a dynamic threshold process in order to reduce blinking effect of edges during successive frames and as consequence generate more stable edge sequences. The study of known models for varying illuminations such as [41] can also be interesting in order to supplement the previous approach, (ii) the study of more sophisticated filtering association approaches such as multiple hypothesis tracking [34] for reducing search initial space and (iii) inclusion of more perceptual pathways (e.g. color information) in order to perform a more robust tracking approach when the template is represented by a reduced set of edge points. Appendix A. Partial directed Hausdorff distance The Hausdorff distance [24] is a metric between two sets of points that is computed without the explicit pairing of points in their respective data sets, A and B. Formally, given two finite sets of points A = {a1, . . . , ar} and B = {b1, . . . , bs}, the Hausdorff distance is defined as:

HðA; BÞ ¼ maxðhðA; BÞ; hðB; AÞÞ

ðA:1Þ

where

hðA; BÞ ¼ max min ka  bk a2A

b2B

ðA:2Þ

The function h(A, B) is called the directed Hausdorff distance from set A to B. Where ka  bk denotes the Euclidean distance. The directed distance ranks each point of A based of its distance to the nearest point in B and uses the largest ranked point as the measure of distance. In order to avoid erroneous results due to occlusions or noise conditions, the directed Hausdorff distance can be naturally extended to find the partial directed distance between sets A and B by replacing the maximum value by a qth quartile value. The partial directed Hausdorff distance is defined as:

hq ðA; BÞ ¼ Q th a2A min ka  bk b2B

ðA:3Þ

An efficient computation of the partial directed Hausdorff distance between a data set that represents a template T(k) and another data set that represent an input image I(k) can be made through the use of the notion of the distance transform image or Voronoi surface [25] of I(k). References [1] Y. Aloimonos (Ed.), Active Perception, Lawrence Erlbaum Assoc., Pub, N.J, 1993. [2] R. Bajcsy, Active perception, in: Proceedings of IEEE Workshop on Computer Vision, vol. 76, issue no. 8, 1988, pp. 996–1005. [3] Turk, Mathew, Computer vision in the interface, Communications of the ACM 47 (1) (2004) 61–67. [4] A.P. Pentland, Perceptual intelligence, Communications of the ACM 43 (3) (2000) 35–44. [5] D.M. Gavrila, The visual analysis of human movement: a survey, Computer Vision and Image Understanding. 73 (1999) 82–89. [6] J.M. Rehg, T. Kanade, Visual tracking of high DOF articulated structures: an application to human hand tracking, in: Proc. 3rd European Conference on Computer Vision, vol. II , 1994, pp. 35–46. [7] Yu. Zhong, K. Jain Anil, M.P. Dubuisson-Jolly, Object tracking using deformable templates, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (5) (2000) 544–549. [8] M. Spengler, B. Schiele, Towards robust multi-cue integration for visual tracking, Machine Vision and Applications, vol. 14, Springer-Verlag, 2003, pp. 50–58. [9] M. Black, A. Jepson, Eigentracking: robust matching and tracking of articulated objects using a view-based representation, International Journal of Computer Vision 26 (1) (1998) 63–84. [10] M. Isard, A. Blake, Condensation-conditional density propagation for visual tracking, International Journal of Computer Vision 29 (1) (1998) 5–28. [11] C. Kervrann, F. Heitz, A hierarchical Markov modelling approach for the segmentation and tracking of deformable shape, Graphical Models and Image Processing 60 (3) (1998) 173–195.

[12] S. Baker, I. Matthews, Lucas Kanade 20 years on: a unifying framework, International Journal of Computer Vision 56 (3) (2004) 221–255. [13] S. Belongie, J. Malik, J. Puzicha, Shape matching and object recognition using shape context, IEEE Transaction on Pattern Analysis and Machine Intelligence 24 (4) (2002) 509–522. [14] P.J. Besl, N. McKay, A method for registration of 3-D shapes, IEEE Transaction Pattern Analysis and Machine Intelligence 14 (2) (1992) 239–256. [15] Y. Chen, G. Medioni, Object modelling by registration of multiple range images, Image and Vision Computing 10 (3) (1992) 145–155. [16] W.J. Rucklidge, Efficient Computation of the Minimum Hausdorff Distance for Visual Recognition, Lecture Notes in Computer Science, vol. 1173, SpringerVerlag, NY, 1996. [17] Liu Tyng-Luh, Chen Hwan-Tzong, Real-time tracking using trust-region methods, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (3) (2004) 397–401. [18] D. Comaniciu, V. Ramesh, P. Meer, Real-time tracking of non-rigid objects using mean shift, in: IEEE Conf. on Computer Vision and Pattern Recognition, Hilton Head, SC, vol. II, 2000, pp. 142–149. [19] G.D. Hager, P.N. Belhumeur, Efficient region tracking with parametric models of geometry and illumination, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (10) (1998) 1025–1039. [20] I. Matthews, T. Ishikawa, S. Baker, The template update problem, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (6) (2004) 810–815. [21] M. Campani, A. Verri, Motion analysis from first-order properties of optical flow, GVGIP: Image Understanding 56 (1) (1992) 90–107. [22] Brown, Christopher, Tutorial on Filtering, Restoration, and State Estimation. Technical Report 534, The University of Rochester Computer Science Department, 1995. [23] J. Pearl , Heuristics. Intelligent Search Strategies for Computer Problem Solving, Addison-Wesley Series in Artificial Intelligence, 1984. [24] D.P. Huttenlocher, G.A. Klanderman, W.J. Rucklidge, Comparing images using the Hausdorff distance, IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (9) (1993) 850–863. [25] D.W. Paglieroni, Distance transforms : properties and machine vision applications, GVGIP: Graphical Models and Image Processing 54 (1) (1992) 56–74. [26] J. Canny, A computational approach to edge detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 8 (6) (1986) 679–698. [27] T.M. Cover, J.A. Thomas, J.A. Thomas, Elements of Information Theory, John Wiley & Sons Inc., 1991. [28] R. Kalman, A new approach to linear filtering and prediction problems, Transaction of the ASME, Journal of Basic Engineering, Series 82T (1960) 35– 45. [29] Y. Bar-Shalom, Li. Xiao-Rong, Estimation and Tracking: Principles, Techniques, and Software, Artech House, Boston, 1993. [30] R. Parra, M. Devy, M. Briot, 3D Modelling and Robot Localization from Visual and Range Data in Natural Scenes, Lectures Notes in Computer Science, vol. 1542, Springer-Verlag, 1999. [31] C. Schlegel, J. Illmann, H. Jaberg, M. Schuster, R. Worz, Integrating Vision Based Behaviours with an Autonomous Robot, Lectures Notes in Computer Science, 1542, Springer-Verlag, 1999. [32] E. Sánchez Nielsen, M. Hernández Tejera, Tracking Moving Objects Using the Hausdorff Distance, A Method and Experiments, Frontiers in Artificial Intelligence and Applications: Pattern Recognition and Applications, IOSPRess, 2000. pp. 164–172. [33] J. Reynolds, Autonomous Underwater Vehicle: Vision System, Ph.D. Thesis, Robotic Systems Lab, Department of Engineering, Australian National University Camberra, Australia, 1998 [34] T. Cham, J. Rehg, A multiple hypothesis approach to figure tracking, IEEE Computer Vision and Pattern Recognition II (1999) 219–239. [35] R. Collins, A. Lipton, H. Fujiyoshi, T. Kanade, Algorithms for cooperative multisensor surveillance, Proceedings of the IEEE 89 (10) (2001) 1456–1477. [36] D. Koller, H. Daniilidis, H. Nagel, Model-based object tracking in monocular sequences of road traffic scenes, International Journal of Computer Vision 10 (1993) 257–281. [37] D. Burschka, G. Hager, Dynamic composition of tracking primitives for interactive vision-guided navigation, SPIE Proceedings Mobile Robots XVI (2001) 114–125. [38] C. Lange, T. Hernmann, H. Ritter, Holistic body tracking for gestural interfaces, in: 5th International Workshop on Gesture and Sign Language Based Human– Computer Interaction, Lectures Note in Computer Science, vol. 2915, SpringerVerlag, 2003, pp. 132–139. [39] V. Ferrari, Tuytelaars, L.V. Gool, Real-time affine region tracking and coplanar grouping, In Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. II, 2001, pp. 226–233. [40] S. Kullback, Information Theory and Statistics, Dover Publications, Inc, 1968. [41] Gregory D. Hager, Peter N. Belhumeur, Efficient tracking with parametric models of geometry and illumination, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (10) (1998) 1025–1039.