Pattern Recognition 38 (2005) 1045 – 1058 www.elsevier.com/locate/patcog
Real-time multiple people tracking using competitive condensation Hee-Gu Kang, Daijin Kim∗ Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-Dong, Nam-Gu, Pohang, 790-784, Korea Received 29 December 2003; received in revised form 24 November 2004; accepted 17 December 2004
Abstract The CONDENSATION (Conditional Density Propagation) algorithm has a robust tracking performance and suitability for real-time implementation. However, the CONDENSATION tracker has some difficulties with real-time implementation for multiple people tracking since it requires very complicated shape modelling and a large number of samples for precise tracking performance. Further, it shows a poor tracking performance in the case of close or partially occluded people. To overcome these difficulties, we present three improvements: First, we construct effective templates of people’s shapes using the SOM (Self-Organizing Map). Second, we take the discrete HMM (Hidden Markov Modelling) for an accurate dynamical model of the people’s shape transition. Third, we use the competition rule to separate close or partially occluded people effectively. Simulation results shows that the proposed CONDENSATION algorithm can achieve robust and real-time tracking in the image sequences of a crowd of people. 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Real-time multiple people tracking; Competitive CONDENSATION; HMM; SOM; People count; Video surveillance
1. Introduction The goal of this research is to track a number of people walking in a cluttered scene in real-time. We can find people tracking in many applications, such as surveillance, humancomputer interactions, people counting system, etc. However, tracking multiple people is difficult, since the shapes and the dynamics of human beings are complicated and the backgrounds are often cluttered. Moreover, partial occlusion and “people-in-groups” situations occur frequently during the practical tracking of multiple people. Background subtraction [1,2] is a popular method to segment and track moving objects in a real-time surveillance system. The W 4 system, the surveillance system which detects and tracks people and their body parts using background subtraction and shape analysis, acquired a real-time ∗ Corresponding author.
E-mail addresses:
[email protected] (H.-G. Kang),
[email protected] (D. Kim).
tracking performance on the PC platform [1]. However, the segmentation method based on the background subtraction was defective in image sequences from a moving camera or sequences which include the instantaneous change of illumination or shadow. The CONDENSATION algorithm proposed by Isard and Blake is the tracking method based on the recursive Bayesian filter [3]. It first calculates the a priori density p(xt |Zt−1 ) using the system dynamics model p(xt |xt−1 ) from the initialization density p(xt−1 |Zt−1 ) and then evaluates the a posteriori density p(xt |Zt ) given the new measurement p(zt |xt ) as p(xt−1 |Zt−1 )
dynamics p(xt |xt−1 )
measurement p(zt |xt )
⇒
⇒
p(xt |Zt )
p(xt |Zt−1 ) (1)
Philomin et al. [4] have achieved reliable tracking from a moving camera by using the point distribution model for the human shape and the quasi Monte–Carlo technique to
0031-3203/$30.00 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2004.12.008
1046
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
efficiently sample the human shape in the high dimensional state space. However, their CONDENSATION tracker has some difficulties for real-time implementation of multiple object tracking since it requires a very complicated shape model and a large number of samples per object for good tracking performance. On the other hand, tracking multiple objects using the CONDENSATION algorithm was considered in other application areas. MacCormick and Blake [5] proposed a state vector that concatenates the states of all individual objects. Their approach determined the number of tracked objects in advance and needed a very high dimension of state space to track several objects, such as a crowd of people. Koller-Meier and Ade [6] extended the CONDENSATION algorithm to handle the tracking of multiple objects in range image sequence. Each object represented in the multi-modal state distribution is separated by a clustering method such as the nearest neighbor technique. However, it is difficult to handle multiple objects that are near or partially occluded, which frequently occurs in practical situations. In our work, we propose a new competitive CONDENSATION algorithm that can achieve robust and real-time tracking of near or partially occluded multiple people. To accomplish this, we construct a discrete human shape model using the SOM, where each node corresponds to a representative template of a human shape. We take the discrete HMM to compute the dynamical transition of human shape accurately and rapidly. We also use a competitive rule for effective selection of the most plausible samples when some people are very near or partially overlapped. This will greatly improve the robustness of multiple people tracking in the image sequences of a crowd of people. This paper is organized as follows. Section 2 gives a brief description about the theoretical backgrounds of the algorithms used in this paper. Section 3 describes the techniques used to construct a discrete human shape model. In Section 4, we explain the competitive CONDENSATION algorithm proposed for the robust multi-people tracking. Section 5 shows simulation results that provide the robustness and real-time performance of our proposed tracking method. Finally, a conclusion is drawn.
p(xt ) =
p(xt |xt−1 )p(xt−1 ) dxt−1 ,
(3)
where p(x|xt−1 ) is a Markov process in that the density function p(x) depends only on the immediately preceding density function p(xt−1 ), but not on any density function values prior to t − 1. Let xt be the measurement at time t with the observation history Zt = {z1 , . . . , zt }. To track the objects, we update the state vector x at each time step based on the new measurement zt according to the Bayes’ rule as p(zt |xt , Zt−1 )p(xt |Zt−1 ) , p(zt |Zt−1 ) = kp(zt |xt , Zt−1 )p(xt |Zt−1 ), = kp(zt |xt )p(xt |Zt−1 ),
p(xt |Zt ) =
(4)
where k is a normalization factor. The simplification from p(zt |xt , Zt−1 ) to p(zt |xt ) can be made by the assumption that the measurements are independent, i.e., p(zt |xt , Zt−1 ) = p(zt |xt ). From the above equation, we obtain the posteriori state density p(xt |Zt ) by multiplying the priori state density p(xt |Zt−1 ) and the observation density p(zt |xt ). The priori state density p(xt |Zt−1 ) is obtained by computing p(xt |xt−1 )p(xt−1 |Zt−1 )dxt−1 . The observation density p(zt |xt ) evaluates the likelihood that a state xt causes the measurement zt . This analytic propagation rule is not practical since each function cannot be computed simply in a closed form. Thus, the CONDENSATION algorithm uses a factor sampling method [7] to compute the posteriori state density p(xt |Zt ) approximately. The CONDENSATION algorithm can be summarized as follows: from the state density p(xt−1 | (1) Sampling: Sample st−1 i , i } where zt−1 ) represented by the sample set {st−1 t−1
it = p(zt |xt = sti ). (2) Prediction: Predict new sample sti using the probabilistic i ). dynamical model p(xt = sti |xt−1 = st−1
2. Backgrounds 2.1. The CONDENSATION algorithm The CONDENSATION algorithm is a probabilistic tracking algorithm based on factored sampling [3]. Let x represent an n-dimensional state vector of the tracked object. Because we do not know about the exact state, the object is described by a probability function p(x). As the observation of the object is changed over time, the probability function of the state vector p(x) will be changed. The transition of state vector x is modelled by a stochastic dynamic equation as xt = f (xt−1 , wt−1 ),
where wt−1 is a noise vector with a Gaussian distribution function that allows the uncertainties in the modelling. The transition of the state’s probability function from p(xt−1 ) to p(xt ) is then modelled by mapping as
(2)
(3) Measurement: Compute it = p(zt |xt = sti ). Continue the above sampling steps until a predetermined stopping condition is satisfied.
Usually, we would train the dynamic model in an off-line manner. When the parameter vector represents the outlined contour of the tracked objects, we may measure the observation density p(zt |xt = sti ) using the distance sum between all measurement points on the outline of human body and the corresponding closest feature points in the image.
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
2.2. Self-organizing map The self-organizing map (SOM) is a neural network built around a one- or two-dimensional lattice of neurons for extracting the features contained in the input space [8]. Unlike competitive learning, the update rule is not applied only to a winner node but also to neighborhood nodes of the winner node. The SOM algorithm can be summarized as follows: (1) Sampling: Draw a sample x(n) from the input space. (2) Similarity matching: Find the best-matching neuron i(x) by i(x) = argmin x(n) − wj (n) , j = 1, 2, . . . , N. j
(5)
1047
Given appropriate values of N, M, A, B and , the HMM can be used as a generator to give an observation sequence O = O1 O2 · · · OT , where each observation Ot is one of the symbols from V and T is the number of observations in the sequence, by the following: (1) Choose an initial state q1 = Si according to the initial state distribution . (2) Set t = 1. (3) Choose Ot = vk according to the symbol probability distribution in state Si , i.e., bi (k). (4) Transit to a new state qt+1 = Sj according to the state transition probability distribution for state Si , i.e., aij . (5) Set t = t + 1: If t < T , return to step 3. Otherwise, terminate the procedure.
(3) Updating: Adjust the synaptic weight vectors of all neurons by wj (n + 1) = wj (n) + (n)hj,i(x) (n)(x(n) − wj (n)),
3. Learning a discrete human model (6)
where N is the number of all neurons, (n) is the learning-rate parameter, and hj,i(x) (n) is the neighborhood function around the winning neuron. Continue the above steps until a predetermined stopping condition is satisfied. Since one important property of the SOM is the approximation ability of data distribution in input space, the resultant feature map reflects the inherent data distribution density. We use this approximation property for the data distribution to build the representative human shape models from a variety of outlines of human bodies. 2.3. Hidden Markov model The time varying sequences of observation can be modelled using the HMM models [9]. We define the elements of an HMM by specifying the following parameters: • N: The number of states in the model. The set of states is denoted as S = {S1 , S2 , . . . , SN } and the state of the model at time t is qt , qt ∈ S and 1 t T , where T is the length of the symbol sequence of observation output. • M: The number of distinct observable symbols. The set of observable symbols is denoted as V = {v1 , v2 , . . . , vM }. • AN×N : An N × N matrix specifies the state-transition probability that the state will transit from state Si to state Sj . AN×N =[aij ]1 i,j N , where aij =P (qt+1 = Sj |qt = Si ). • BN×M : An N × M matrix specifies that the system will generate the observable symbol vk at state Sj and at time t. BN×M =[bj (k)]1 j N,1 k M , where bj (k)= P (vk |qt = Sj ). • N : An N-element vector that indicates the initial state probabilities. N = [i ]1 i N , where i = P (q1 = Si ).
3.1. Extracting shape vectors from image sequences Since the human body is a nonrigid object, it is difficult to represent all possible shapes of human body. Therefore, we need to construct a human shape model that consists of some representative template shapes. To build the human shape model, we have to extract shape vectors from the training image sequences. Baumberg and Hogg [10] proposed to generate a flexible shape model of a nonrigid object, such as a walking pedestrian. Their system automatically extracted shape vectors from image sequences obtained from a stationary camera. For simplicity and flexibility, we also take the same method as Baumberg and Hogg [10] as follows. First, we construct a background image of image sequences by using median filtering under the assumption that no occlusion has occurred and extract the foreground objects using a background subtraction technique. Next, each foreground object provides an ordered set of boundary points of their silhouettes, which is treated as a sufficient number of points for B-spline approximation. Then, we select an initial starting point on the closed boundary appropriately, where the selected initial point is the upper-right one among two points that are intersected with the principal axis (i.e., the axis through the centroid of the boundary points which minimizes the sum of the perpendicular distances to that axis). The boundary points are reordered so that the first point is the selected initial point. The ordered set of M boundary points Wi = (Xi , Yi ), 0 i < M can be approximated by a closed B-spline curve P(u) = (Px (u), Py (u)) with N control points Qk = (Rk , Sk ) given below:
P(u) =
N−1 t=0
Qk Bk (u),
(7)
1048
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
Fig. 1. A procedure for extracting shape vectors: (a) background image, (b) input image, (c) foreground object, (d) B-spline curve.
where P(0) = P(N ), and the modified basis function Bk (u) is B(u − k) if(u − k) 0, Bk (u) = (8) B(u + N − k) if(u − k) < 0. A B-spline curve which approximates the extracted boundary points will be required so as to minimize the error function Error =
M−1 t=0
(Px (ui ) − Xi )2 + (Py (ui ) − Yi )2 ,
(9)
where the parameter values uk are determined by the following equation: 0 for k = 0, (10) uk = k i=1 | Wi − Wi−1 | for k > 0, where WM = W0 and is chosen such that uM = N. By Eq. (9), uk is the parameter value of the B-spline curve such that P (uk ) is corresponding to each boundary point Wk . For a good shape representation, we want to obtain the control points of a length-wise uniformly spaced B-spline curve. However, the boundary points Wk are not uniformly spaced, we need a certain approximation method. We take the approximation method from Baumberg and Hogg [10], where they defined the control point parameter uk as Eq. (10) for a good shape representation. Finally, the N control points Qk = (Rk , Sk ) of a lengthwise uniformly spaced B-spline are used as a shape vector
to represent the segmented object x begin below: x = (R0 , S0 , R1 , S1 , . . . , RN , SN ).
(11)
Fig. 1 shows a whole procedure for obtaining the shape vector of a walking pedestrian in an image. The upper left picture shows a background image obtained from a median filtering technique, the upper right picture is an input image, the lower left picture is a foreground object obtained from the background subtraction technique, and the lower right picture is the approximated object by the B-spline curve. 3.2. Creating the template shapes using SOM For real-time object tracking, it is better to reduce not only the dimensionality of the shape vector but also a shape model’s space as much as possible. Cootes et al. [11] proposed to use principal component analysis (PCA) for reducing only the dimensionality of shape vectors to a number of modes of variation. However, we are trying to reduce the shape model’s space by applying the vector quantization (VQ) to the shape space so as to discretize it into one of N discrete values. Moreover, VQ reduces the computational complexity of the measurement process because it directly select the measurement points of the template shape from a look-up table that includes the pre-computed measurement points for each template. This avoids the timeconsuming processes such as PCA reconstruction and the B-spline drawing process.
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
1049
shape vectors, but are also smoothly changed to cover all possible shape vectors between two adjacent nodes.
4. Tracking algorithm 4.1. Dynamical model We need to define a dynamical model in order to apply the discrete shape model to the CONDENSATION algorithm. The more similar is shape, the higher is the probability that the shape at time t − 1 (st−1 ) is turned into the shape at time t (st ), we usually define the transition probability by a Gaussian distribution as p(xt = st | xt−1 = st−1 ) 1 1 =√ exp − [dtemp (st , st−1 )/]2 , 2 2
Fig. 2. B-spline curves recovered from control points of typical16 template shapes located at (n, 0)(n = 0 . . . 15).
We take the self-organizing map (SOM) to provide a set of discrete template shapes. The SOM used in this work has one input layer and one output layer. Shape vectors extracted from the training image sequences are presented to SOM. We assign the weight vector between all input nodes and each output node to a template vector. On the other hand, we define the distance between two templates as the lattice distance between the corresponding output nodes. The property of the SOM where the feature map reflects the distribution of input patterns provides a more meaningful distance measure than distance defined on the shape vector space. Each output node becomes a label of each template. Fig. 2 shows 16 B-splines curves that are recovered from the control points of 16 template shapes, where the templates are located from (0,0) to (15,0) in the 16 × 16 two-dimensional SOM space. From this figure, we note that the template shapes are not only sufficiently distinct to separate different
(12)
where dtemp (st , st−1 ) is the dissimilarity measure between two templates, which could be defined as the distance between two corresponding output nodes in the lattice of SOM. However, this dynamical model is difficult to use for reflecting the transition of human shapes directly. Therefore, we take the hidden Markov model (HMM) to predict the transition of human shapes more precisely. The HMM learns the dynamical model with the shape transition sequences obtained from the training image sequences. We use the Viterbi algorithm for finding the optimal state transitional probability from a given shape sequence. In our experiment, the number of observable symbols is 256(=16∗ 16), since observation symbol comes from the index of shape model which is the index of output node of the SOM. The number of hidden states N is 32 in our experiment. We can predict the transition probability p(st−1 , st | ) for a specific observed shape transition sequences (st−1 , st ) from the trained HMM . Finally, the shape transition probability is given by p(st−1 , st | ) p(xt = st | xt−1 = st−1 ) = , s p(st−1 , s | )
(13)
where s denotes all possible observed shape sequences. 4.2. Measurement After obtaining the predicted new samples from the dynamical model, we need to measure the observation density P (z | x) for evaluating the likelihood that state xt causes measurement zt . Here, we take the observation density based on the multi-feature distance transform algorithm [12], where four-directional edge-features are used as the multi-features. For each template shape, we obtain a shape vector which corresponds to its control points from the trained SOM. We can recover the B-spline curve from the control points using Eq. (8) and obtain the edge direction
1050
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
2 3
2 1
3
1
0
0
0
0 1
3 1
2
3 2
Fig. 3. Four different directions in the canny edge detector.
at the measurement point zi along the recovered B-spline curve by first-order differentiation. The image features are obtained by applying a Canny edge detector to the input image I where the Canny edge detector provides the oriented edge-features discretized into four bins (See Fig. 3). Then, the observation density can be given by M 1 2 P (z | x) = exp − d (zi , I ) , (14) M i=1
where M is the number of measurement points along the B-spline curve and d 2 (zi , I ) denotes the distance between the measurement point zi and an edge feature point in I that is closest from the zi and has the same edge direction with zi . We considered two approaches to compute the distance d 2 (zi , I ) as follows. First, we built an N × N mask whose elements have the value of distance from the center point of the mask. The mask is applied with min(mask) operator which indicate the minimum value in the mask to the region around a given measurement point in the edge feature image. This method accomplishes fast measurement. However, since it neglects the edge feature on the exterior of the mask, the observation can be distorted especially when samples lie coarsely on the state space. Second, we use the distance transform [12] where each pixel value of distance transformed image is the distance between the measurement point and the closest feature point. Thus, we can compute the distance d 2 (zi , I ) by merely obtaining the pixel value of the distance transformed image at the measurement point zi . The distance transform provides a more precise distance measure, thus the observation is not distorted. However, the distance transform process can render the entire tracking process that is so time-consuming.
Fig. 4. Edge segmentation results: (a) original image; (b) edge map; (c) moving edge map.
To reduce the distortion due to clutter in the scene, we use the moving edge map [13]. The moving edge map is obtained by a point-wise AND operation to edge features of the current image and edge features of difference image of two consecutive frames. We do not use the background edge map. The moving edge map is effective to extract edge features of moving objects in clutter. Fig. 4 shows two kinds of edge segmentation. The moving edge is clearly shown to be more reliable than the edge map in the case of large
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
1051
4.3. A competition rule for multiple object tracking
Fig. 5. Several states for several people converge into one dominant people.
object movement. Finally, the observation density is defined as the weighted sum of measurement densities on the edge map and the moving edge map as p(z | x = s) = w1 · pEMAP (z | x = s) + w2 · pMEMAP (z | x = s),
(15)
where w1 + w2 = 1 and pEMAP (•) and pMEMAP (•) are the observation densities on the edge map and the moving edge map, respectively. In case that the camera is moving or an object does not move, the moving edge map has no influence on the measurement density.
The original CONDENSATION algorithm proposed to track a single object and has been extended to track multiple objects in a variety of ways. Koller-Meier [6] used a multi-modal distribution where each modal corresponds to individual people. They assume a fair segmentation, i.e., all objects are detected with the same probability but this assumption is not reasonable in several practical situations. It follows that their approach is not reliable because the sampling process makes the state samples converge to one target which best fits the model. Fig. 5 shows that several proximate objects are converged into one dominant where each human shaped contour represents each state of KollerMeier’s the CONDENSATION tracker. Differently from our expectation to track multiple objects, they converge into the location of the best-fit object. MacCormick and Blake [5] considered that a joint probability with concatenated state vectors could be used to track multiple objects. This method has the disadvantage that the size of state vectors varies, depending on the number of tracked people. It also enlarges when the number of tracked people is high. To track an arbitrary number of objects in real-time manner, we consider the approach where one independent tracker is assigned to each object. Under the situation of tracking multiple people, the shapes of objects are very similar. When they become closer to each other or some are occluded by another object, the observation density becomes a multi-modal distribution. As the observation density becomes complicated, we need a large number of samples to track people precisely. Even if we use a sufficient number of samples and a highly
Fig. 6. The effect of competition rule on the conditional observation density.
1052
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
complicated shape model, the “people in groups” situation can make tracking often fail. To alleviate this situation, we propose a competitive CONDENSATION algorithm as follows. Suppose that each tracker tracks only one object. In the CONDENSATION algorithm, the state densities at time t is represented by a set of samples st and weights t . In a very close or overlapped situation, a tracker contains a subset of samples around the features tracked by some other trackers. Trackers need only the subset of samples which operate as the candidates of states around an object (feature) to be tracked. Thus, we consider a heuristic that suppresses the weight of samples around the features tracked by the other tracker. Since this heuristic makes the suppressed samples soon disappear, the tracker can represent the density with even a small number of samples. (i,n) For obtaining this suppression effect, the weight t (i,n)
of the nth sample st
of the ith tracker at time t is
modified as (i,n)
t
(i,n)
= p(zt | xt = st
)
k∈T ,k=i (i,n) × {1 − 1/ exp[ · d(Ek , st )]},
(16)
where is a constant, T is a set of all trackers, Ek denotes the state estimate of the tracker k such as the expectation value (i,n) of the density, and d(Ek , st ) is distance between Ek and (i,n)
a sample st . Here, we use Ek , which is of a weighted sum among overall samples of the tracker k, instead of using (k,n) (k,n) because st is a randomly generated one sample st sample and an individual sample itself is not meaningful for the suppression effect. As shown in Fig. 6, some samples of a given tracker that are located close to other trackers will be suppressed by Eq. (16) and these reduced weights results in a reduced number of samples from a given tracker in the next sampling process.
Fig. 7. Tracking two people in case of not using the competitive people.
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
5. Experiment results and discussion 5.1. Tracking multiple people on a natural scene video We present some results on tracking multiple people. Shape vectors were obtained from the training image sequences using B-spline approximation and they are represented by 40 control points. We constructed 256 template shapes by training the 16 × 16 SOM with many human boundary data segmented from the training images. The dynamic model for shape transition was obtained by using the following two steps. First, we built the training sequence data of shape transition for training the HMM, where the initial tracker was assumed that its dynamical model obeys Eq. (12). Second, the HMM was learned by the training sequence data of shape transition and the shape transition model has been determined using the techniques described in Section 4.1. We selected test image sequences that were compressed by H.263 format in 15 fps. We assign 100 samples to each
1053
tracker when the competition rule is used, and assign 1000 samples to each tracker when the competition rule is not used. We represent each sample by a state vector which includes the center position, the scale, and the shape vector of the template shape. Figs. 7 and 8 show the tracking results in the case of “moving in a group” of two people when the competitive rule is used or not, respectively. Generally, it is difficult to track multiple people “moving in a group” because moving people easily become partially occluded and the state densities of the moving people in a group (specifically, the position, the scale, the shape and the dynamics) are almost identical. As seen in Fig. 8, the tracking of two people failed when the competitive rule was not used, even if a large number of samples ( = 1000) per each tracker was used. The samples around the occluded people have lower measurement densities than the samples around nonoccluded people because the occluded people do not reveal the whole edge features. As a result, the samples of the trackers had a tendency to converge to non-occluded people by the sampling
Fig. 8. Tracking two people in case of using the competitive rule.
1054
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
Fig. 9. Tracking a crowd of people using competition rule.
process and the trackers were unwillingly to follow the nonoccluded people. However, the tracking of two people was successfully executed when the competitive rule was used, even if a small number of samples ( = 100) per each tracker was used (See in Fig. 7). When the competitive rule is used, it suppresses the measurement densities around the samples of the tracker for non-occluded people. This suppression effect makes the measurement densities around the samples of the tracker for the occluded people remain more or less in the next frame and the trackers continue to follow each individual person. Fig. 9 shows that the proposed competitive CONDENSATION algorithm can successfully track a crowd of people (more than 10 people) in real-time of 15 fps. In this figure, we do not assign any trackers to some non-moving people. Trackers without the competition rule could not track a crowd of people at all. However, the competition rule en-
abled trackers to track a crowd of people even with small number of samples ( = 100) per each tracker. 5.2. Detecting a newly appeared people Basically, the original CONDENSATION algorithm is proposed to track moving objects, and we also place an emphasis on the multiple tracking capability in this research. But to apply our algorithm to a practical application, we are required to obey the object segmentation algorithm independently to detect moving objects. In this work, we propose to reuse the CONDENSATION algorithm for detecting the appearance of new people. A detector is just one of the trackers in the set of CONDENSATION tracker with competition rule, but we apply the additional process to the detector. We know that some people exist around the locations of having trackers in the image. Initially, many people candidates
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
1055
Fig. 10. An illustration of detecting a newly appeared people.
were randomly located in the manner of uniform distribution over the areas that have no people, (i.e., areas of not having trackers). The detector is trying to find the new people in the people non-existent areas since the samples located in and around people being tracked are suppressed by the competition rule (Eq. (16)), and the samples converge to the location of newly appeared person by its matching criterion. Being converging to new person, if the expectation value of the observation density of a certain detector exceeds a predetermined threshold, the system adds a new tracker for each newly-detected object. The detector generates the samples for a priori density from two different sources, such as a posteriori density at the previous step, as well as the uniformly distributed initialization density for ability to detect newly appeared person on location which we do not know. Fig. 10 illustrates a detection process using the CONDENSATION algorithm, where the bottom section of each picture shows the location and matching weight t of the samples and the shape contours represent the samples whose weight value t are greater than a given threshold value. First picture shows randomly sampled samples from the uniform distribution.
We can see that the randomly generated samples converge to the newly appeared object within two or three consecutive frames. As you see, the performance of the detector is adequate, but it still cannot detect a crowd of people in a very complex scene, as shown in Fig. 9. Therefore, we need to develop a more precise and robust detector (Fig. 10). 5.3. Application to real-time video analyzer To demonstrate the practical usage of the proposed competitive CONDENSATION algorithm, we applied it to the real-time video analyzer, which is a type of people counting system. The implemented system can track people’s moving trajectories in stored digital video and analyze the moving behaviors to compute several trajectory statistics, such as the number of passengers in specific directions and the flow density within the specific region. Fig. 11 shows the architecture of the proposed realtime video analyzer. Its functional description can be explained as follows. First, the digital video recorder (POS−WATCHTM ) compresses and stores the surveillance
1056
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
Fig. 11. The architecture of the real-time video analyzer.
Fig. 12. An illustrative surveillance region for the real-time video analyzer.
video from the stationary camera in H.263 format. Next, the backup utility, provided by the vendor of the digital video recorder, moves the stored video from the digital video recorder to the video storage of the analyzer. Afterwards, the decoding module decompress the H.263 compressed video to get the original video images. Two sequential images of the current frame and the previous frame generate the edge feature map and the moving edge feature map, and a Canny edge detector is used to obtain them. Next, we apply the distance transform to the feature images in order to build four distance transformed images, one for each individual direction. Next, the tracking and detection modules are implemented based on the competitive CONDENSATION algorithm mentioned earlier. These modules generate the tracking histories, such as the trajectories of the appeared
people. Next, the statistics module uses the obtained tracking information to analyze the tracking behaviors of people appeared in the stored video. Finally, the analysis results are transferred to a report module, where the analysis results are represented in tabular and graphical form. Additionally, shape modelling and the dynamics learning modules are included in the off-line manner to build the shape templates and parameters of probabilistic dynamics, respectively. Fig. 13 shows the tabular report of the real-time video analyzer under the surveillance of the area, given as Fig. 12, where the table summarizes the number of passengers and the number of events of entering into or going out the user-defined arbitrary shaped square region during a 20 min period. Each row denotes a specific direction defined as a pair of the markers (top, bottom, left and right side of the
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
1057
Fig. 13. A tabular report of the real-time video analyzer: people counting result.
Fig. 14. A graphical report of the real-time video analyzer: the flow density.
user-defined arbitrary shaped square region). Each column will be added whenever a certain event has been occurred. From the table, we learn that the number of coming in and going out the arbitrary shaped square region are 71 and 70, respectively, and the event of entering into the right side and going out the bottom side occurs very frequently (Fig. 13). Fig. 14 shows the graphical report of the real-time video analyzer under the surveillance of the area given as Fig. 12, where the graph represents the passenger’s flow density on the user-defined arbitrary shaped square region. The square region is divided into 10 × 10 equal areas. From the graph, we know that many people pass the diagonal area in the 45◦ direction. The analyzer also has a function that makes out a summarized video, which contains only the frames with the moving people.
6. Conclusions To track multiple people in real-time, we proposed a discrete human shape model that reduced the shape space into a discrete valued parameter space and a competitive heuristic to sample around an object to be tracked precisely using the state estimate of the other trackers. The discrete
human model using the SOM reduced the search space effectively and provided a meaningful distance measure between templates as well. A competition rule reduced the number of samples by an effective sampling based on the competitions among the adjacent trackers. It also makes an arbitrary number of partially occluded people with partial to be tracked successfully. We also proposed a good detection method based on the CONDENSATION algorithm for newly appeared people, even though it failed to detect very close or overlapping multiple objects. The proposed competitive CONDENSATION tracking method is practical for real-time tracking applications due to its simplicity and efficiency. We have successfully applied the proposed tracking algorithm to a real-time video analyzer, such as the people counting system.
Acknowledgements The authors would like to thank the Ministry of Education of Korea for its financial support toward the Electrical and Computer Engineering Division at POSTECH through its BK21 program. This research was also partially performed for the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Science and Technology of Korea.
References [1] I. Haritaoglu, D. Harwood, L. Davis, W4: Real-time surveillance of people and their activities, IEEE Trans. Pattern Anal. Mach. Intel. 22 (8) (2000) 809–830. [2] C.R. Wren, A. Azarbayejani, T. Darrell, A. Pentland, Pfinder: real-time tracking of the human body, IEEE Trans. Patt. Anal. Mach. Intel. 19 (7) (1997) 780–785.
1058
H.-G. Kang, D. Kim / Pattern Recognition 38 (2005) 1045 – 1058
[3] M. Isard, A. Blake, Contour tracking by stochastic propagation of conditional density, Proc. Eur. Conf. Comput. Vision (1996) 343–356. [4] V. Philomin, R. Duraiswami, L.S. Davis, Quasi-random sampling for condensation, Proc. Eur. Conf. Comput. Vision (2000). [5] J. MacCormick, A. Blake, A probabilistic exclusion principle for tracking multiple objects, Proceedings of the Seventh International Conference on Computer Vision, 1999. [6] E.B. Koller-Meier, F. Ade, Tracking multiple objects using the condensation algorithm, J. Robotics and Autonomous Syst. 34 (2–3) (2001) 93–105. [7] U. Grenander, Y. Chow, D.M. Keenan, A Pattern Theoretic Study of Biological Shapes, Springer, Berlin, 1991. [8] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall, Englowood Cliffs, NJ, 1999, pp. 443–465.
[9] L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE 77 (2) (1989) 257–286. [10] A. Baumberg, D.C. Hogg, Learning flexible models from image sequences, Proc. Eur. Conf. Comput. Vision (1994). [11] T.F. Cootes, C.J. Taylor, A. Lanitis, D.H. Cooper, J. Graham, Building and using flexible models incorporating grey-level information, Proc. IEEE Int. Conf. Comput. Vision (1993) 242–246. [12] D. Gavrila, V. Philomin, Real-time object detection for “smart” vehicles, Proc. IEEE Int. Conf. Comput. Vision (1999) 87–93. [13] C. Kim, J. Hwang, Fast and robust moving object segmentation in video sequences, IEEE Int. Conf. Image Process. (1999) 131–134.
About the Author—HEEGU KANG received the B.S. degree (2002) in Computer Science and Engineering and Mathematics, and the M.S. degree (2002) in Computer Science and Engineering from Pohang University of Science and Technology(POSTECH), South Korea. He is currently a research engineer at Affiliated Technology Research Center in INTELLIX Co. Ltd., Seoul, Korea. His research interests include intelligent systems, video surveillance, visual tracking and multimedia systems. About the Author—DAIJIN KIM received the B.S. degree in Electronic Engineering from Yonsei University, Seoul, Korea, in 1981, and the M.S. degree in Electrical Engineering from the Korea Advanced Institute of Science and Technology(KAIST), Taejon, 1984. In 1991, he received the Ph.D. degree in Electrical and Computer Engineering from Syracuse University, Syracuse, NY. During 1992–1999, he was an Associate Professor in the Department of Computer Engineering at DongA University, Pusan, Korea. He is currently an Associate Professor in the Department of Computer Science and Engineering at POSTECH, Pohang, Korea. His research interests include intelligent systems, biometrics, RFID systems.