Pattern Recognition Letters 34 (2013) 780–788
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Multi-part sparse representation in random crowded scenes tracking Jie Shao a,b,⇑, Nan Dong c, Minglei Tong a a
Department of Computer and Information Engineering, Shanghai University of Electric Power, 201804, PR China Department of Information and Communication Engineering, Tongji University, Shanghai 200090, PR China c Chinese Academy of Sciences, Shanghai Advanced Research Institute, Shanghai 201203, PR China b
a r t i c l e
i n f o
Article history: Available online 22 July 2012 Keywords: Visual tracking Multi-part sparse representation Crowded scenes Particle filter
a b s t r a c t A multi-part sparse representation method is used in random crowded scenes for pedestrian tracking in this paper. In crowded scenes, there are random movements and orderly movements. Random movements are defined as the motion of each individual in the crowd appears to be unique, and different participants move in different directions over time. This means methods about multi-model in motion flows are not available. As a result, we propose a fully unsupervised tracking algorithm based on a multi-part local sparse appearance model. Based on the facts that only non-occluded segments of a target are effective in feature matching, while the occluded segments are the disturbed ones, our algorithm employs a multi-part sparse reconstruction code. The method is used on target segments in stead of the whole target, and implemented by solving an l1 regularized least squares problem. The segment group with the smallest projection error will be taken as the tracking result. All the segment groups are drawn based on a density distribution in a Bayesian state inference framework. After tracking process in each frame, the template dictionary will be jointly inferred and updated to adapt appearance variation. We test the method on numerous videos including different type of very crowded scenes with serious occlusion and illumination variation. The proposed approach demonstrates excellent performance in comparison with previous methods. Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction Tracking in crowded scenes is a great challenge to traditional tracking methods due to large number of clustered targets in the image. However, keep these types of high density scenes in intelligent control are in the most need of video surveillance since incidents usually occur in such congested areas. A fixed-view surveillance system continuously captures global variations of all the motions that take place in the scene. These variations of motions exhibit different speeds and directions as the targets traverse the dense crowded place. Sometimes, some movements tend to repeat due to large numbers of targets in the space, which could be called an orderly crowded scene. For instance, events involving queues of people, a video of marathon race, or traffic on a road, etc. Besides, there are some other crowded scenes where motions of the crowd appear to be random. That is, erratic movements occur not only at any time, but also in each spatial location. We call it random crowded scene. For instance, people walking in the street or visiting an exhibition. Most of the recent works aim at the former, the orderly case (Rodriguez et al., 2009; Ali and Shah, 2008; Kratz and Nishino, 2010). In their works, ⇑ Corresponding author at: Department of Computer and Information Engineering, Shanghai University of Electric Power, 201804, PR China. Tel.: +86 21 65430410. E-mail address:
[email protected] (J. Shao). 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.07.008
limited numbers of motion flows are detected by crowd behavior models (Rodriguez et al., 2009; Ali and Shah, 2008) or by training (Kratz and Nishino, 2010). This is a major shortcoming, as a random crowded scene where behaviors are not repeated could not be learned in advance. Like the scene with pedestrians walking on the ground in Fig. 3, people walk in different directions. Therefore, every spatial location supports different types of motions over time. Sparse representation has played an important role in many research areas recently, including face recognition, background subtraction, texture segmentation, and so on (Wright et al., 2010). In sparse representation, the coefficients of representation vector associated with target appearance are usually nonzero while other non-associated coefficients vanish. Thus a representation vector could encode the identity of a target patch (Wright et al., 2009). The method has been shown to be robust to various forms of images, and especially to those of occlusions. As a result, to overcome the problem of tracking in random crowded scenes, we develop an unsupervised tracking algorithm that uses sparse representation on target segments, and find the optimal set of segments in template subspace. The overview of the method is shown in Fig. 1. At first, the template dictionary of the target is built according to information in the first frame. Then candidates are predicted in a probabilistic framework of particle filtering. Both the candidates and the templates are composed of multiple sparse appearance models. Each model is associated with a target segment. It makes each segment an independent entity, and decreases
781
J. Shao et al. / Pattern Recognition Letters 34 (2013) 780–788
Particle filter
N Templates
Target
Update
First frame
7 divisions
Video L1 minimization
A*
Tracking result Optimization
Fig. 1. The overview.
the adverse effects of occlusions in the process of l1 minimization. The drawback of particle filtering is that similar objects can hardly be distinguished from each other based on normal appearance representation, and hence impair reliabilities of estimated results, especially in case of crowded scenes. In order to overcome the problem of feature confusion, kernel color representation is used for target segments, which makes the sparse appearance model include not only appearance but also texture information. The method is derived from multi-part color histogram representation proposed by Maggio and Cavallaro (2009). At the end process, the segment group with the smallest target template projection error will be chosen, and be deemed as tracking result in current frame. Then the dictionary will be updated according to it. We test the proposed method on a number of videos involving highly inter-occlusion, illumination variation and cluster movements. Both outdoor and indoor sequences with different sizes are used in the experiments. The proposed approach shows excellent performance in comparison with many other trackers. But so far, we have only tested the approach on single target tracking. In further extension, it could be associated with MHT in multiple target tracking. The main contributions of this paper are as follows: (1) It makes no prior assumption of motion flows in the crowded scenes, hence there is no need to obtain prior models or knowledge of the clusters. And it is fully unsupervised. (2) It provides a framework of using multi-part sparse representation for tracking in random crowded scenes. Sparse coding is used on each target segments instead of the integral, which excludes errors caused by occluded segments. (3) Our algorithm is suitable for not only orderly crowded scenes, but also random crowded scenes. The template dictionary continues to learn and update every frame. After summarizing related work in the next section, an introduction about local sparse appearance model is provided in Section 3. Section 4 demonstrates the tracking framework, followed by experimental results and qualitative comparisons in Section 5. Finally, we conclude the paper in Section 6. 2. Related work Since tracking is one of the widely researched areas in computer vision, we only focus on the algorithms specially designed for crowded scenes in this section.
Tracking in crowded scenes is a fresh topic in crowd analysis. Most literatures related to crowd analytics are interested in crowd density estimation and crowd event detection (Dong et al., 2010; Garate et al., 2010; Ryan et al., 2010; Andrade et al., 2006; Patzold et al., 2010). These methods aim at detecting abnormal events in crowd flows using motion pattern classification or implementing people counting. In their works, for example, Garate et al. (2010) tracked and extracted feature points based on a HOG descriptor. And then at the event recognition stage, they statistically analyzed vectors formed by tracking of the feature points to recognize a predefined event. Dong et al. (2010) detected abnormal motion variations using motion behavior map, while Andrade et al. (2006) and Qiao et al. (2009) depending on optical flow for the similar purpose, all of which focused on behaviors of the crowd rather than individuals. Ryan et al. (2010) and Patzold et al. (2010) implemented their researches on people counting by individual tracking. The former work obtained observation data by background subtraction, and then did blob merging and splitting to improve tracking results. The latter one found the upper body of each target by a trained shape detector, which indicates that the algorithm can only work after time-consuming training. After the year 2006, more researches have focused on tracking individuals in crowded scenes, but they are more likely to track multiple targets in sparse crowded scenes with only partial occlusions (Cheriyadat et al., 2008; Tsai et al., 2006; Lien et al., 2007; Lu et al., 2009). Like Ryan et al. (2010) we mentioned above, targets can be distinguished from background by background subtraction, which is based on the fact that varied pixels usually belong to the foreground except impacts of noise and light variation. However in the case of high density crowd, a high degree of inter-object occlusion makes it impossible to depart individuals from the crowd by background subtraction. In recent two years, another group of work presented algorithms for tracking individual targets in high density crowd scenes. Ali and Shah (2008) used a scene structure based force model to track individuals in hundreds of people, which was implemented by calculating the probability of movements from one location to another. As a result, it can only deal with crowd flow with regular movements. Then after one year, they published another paper about tracking in unstructured crowded scenes (Rodriguez et al., 2009). They captured different behavior modalities at specific
782
J. Shao et al. / Pattern Recognition Letters 34 (2013) 780–788
locations in the scenes, and then tracked individuals according to these behaviors (or motion flows). Both approaches impose a fixed number of possible motions at each spatial location in the frame. Similarly, Kratz and Nishino (2010) encoded many possible motions in HMM, and derived a full distribution of motions for each video before tracking. It costs a lot of time for training and is not robust for all the videos. Our goal is to track individuals in high density crowd scenes using a robust approach without training. Our proposed tracking method borrows some ideas from the work of Thida et al. (2009) and Khansari et al. (2007). First we use the popular particle filter framework, which has been applied to tracking problems under the name Condensation. There are recent studies about crowd scene tracking using particle filters similar to our method (Thida et al., 2009). A particle swarm optimization algorithm is used in Thida et al. (2009). Our modeling uses a linear template combination, which is the improvement of the approach (Khansari et al., 2007). Khansari et al. (2007) used only one template with undecimated wavelet features for data association in tracking. Despite sharing many ideas with previous works discussed above, our work is the first algorithm that combines the l1 regularization and sparse representation for high density crowd tracking. Our work can be viewed as an extension of recent study in visual surveillance via sparse representation. 3. Local sparse appearance model 3.1. Video representation Different from conventional settings of sparse representation like Mei and Ling (2011) and Liu et al. (2011), where the input signal is a single feature, the input signal in our work is composed of a group of multi-part appearance representation X ¼ fx1 . . . ; xn . . . ; xN g. xn , n ¼ 1 : N denotes the n-th vectorized appearance information of the target. Therefore, the form of the input is not a vector, but a matrix with a group of vectors. In recent works, there have been several attempts in appearance modeling, like interest points detection in Ross et al. (2008), HOG and optical flow description in Zhao et al. (2011), and so on. But neither of them is suitable for small patch representation in crowded scenes. Color distribution is a kind of choice in target tracking, which is appropriate to represent patches lack of details, but in defect of structural information. In order to overcome this shortcoming, we introduced a multi-part kernel histogram representation method. If we label a target with a rectangle, seven possible segments of the rectangle are shown in Fig. 2. The multi-part RGB kernel histograms are calculated based on these seven divisions. The first histogram is from the whole region of the foreground object patch, and the second to the fifth histograms are respectively calculated based on four equally divided parts of the whole. The last two parts are from the inside and the outside half of the target. Then we assume zn are pixels in segment n, which is centered at a pixel ctr n . A Gaussian kernel Kðctr n Þ is applied to assign smaller weights to pixels far away from the center. So the
value of the j-th bin bj of histogram in segment n can be computed as:
bj ¼ c
XX
Kðctr n Þ ¼ c
RGB zn 2Xj
XX
expðjjzn ctr n jj2 =ð2r2 ÞÞ;
ð1Þ
RGB zn 2Xj
where Xj is the group of pixels with value j in the division n of the target, and c is a normalization coefficient. We define the above kernel histogram as xn in appearance model X. So N equals 7, as there are 7 histograms of segments in a target.
xn ¼ fb1 ; . . . ; bj ; . . . ; bd gT 2 Rd1 :
ð2Þ
3.2. Sparse representation model In crowded visual tracking scenarios, noises and partial occlusions are common. Occlusions create unpredictable errors which affect accuracy of tracking results. In most of the cases, occlusion is a connected region of pixels, and only part of the target is occluded. Consequently, in this paper, a target patch is treated as a set of patch segments. Which means that, in our method, detecting the optimal candidate in current frame is transformed to finding the optimal set of segments by solving a sparse coding problem. The basic idea of our approach is to represent the knowledge of target segment x using dictionary T, whose columns are templates for sparse reconstruction. In the previous subsection, we use xn to indicate the n-th segment of the target x. But here, as we deem each target segment as an independent entity in sparse representation, n is omited, and x denotes a sparse appearance model of a target segment.
x ¼ TA þ e;
ð3Þ
e represents the noise and the reconstruction weight vector A can be computed by optimizing the l1 regularized least square problem. Accordingly, a template dictionary is T ¼ fT 1 ; . . . ; T N g 2 RdðMNÞ , containing N ¼ 7 template groups of target segments. Each template group of segment comes from M different target patches T n ¼ ft1 ; . . . ; t M g 2 RdM . All the target patches are initialized in the first frame. They are patches with trivial coordinate variance of original target position, as shown in Fig. 1. As a result, the template dictionary is composed of M N numbers of template vectors. Each template vector is corresponding to a kernel color histogram of a patch segment. The partitions are overlapping regions in the target. A tracking candidate is also a set of segments in the current frame, whose appearance representation is X ¼ fx1 . . . ; xN g 2 RdN . Since the locations of occlusion may differ from different tracking targets and are unknown to the computer, we can use multi-part templates to avoid the impact of occlusions. Each vector xn 2 Rd1 could be represented as the linear combination of a few vectors in the dictionary T, which is the extension of Eq. (3): xn TA ¼
N NM X X T n afng ¼ ai ti ; n¼1
ð4Þ
i¼1
where A ¼ faf1g ;...; afng ;...; afNg gT ¼ fa1 ;...; aM ;...; ai ;...; aMN gT 2 RMN1 is the reconstruction weight vector in our method. 4. Tracking framework 4.1. Particle filtering
Fig. 2. Seven divisions of a target. (a) Whole, (b) rotation sensitive division, (c) size sensitive division, and (d) target overall division.
The state variable st of a target patch is modeled by five parameters: st ¼ ðpx ; py ; X; v x ; v y Þ. ðpx ; py Þ are the 2D position parameters, X is the normalized multi-part kernel color histogram appearance representation mentioned in Section 3, and ðv x ; v y Þ are the average velocity of the horizontal and vertical position parameters ðpx ; py Þ.
783
J. Shao et al. / Pattern Recognition Letters 34 (2013) 780–788
(b) Localization of particles in the second frame
(a) Original position in the first frame
2
1
Fig. 3. Particle filtering in the image sequences ‘street’.
Particle filter is used to search for candidates in the current image based on their previous states. With all available observations of a target y1:t1 ¼ fy1 ; y2 ; . . . ; yt1 g up to time t 1, predicted distribution of st is computed by:
pðst =y1:t1 Þ ¼
Z
pðst =st1 Þpðst1 =y1:t1 Þ dst1 :
ð5Þ
With pðst1 =y1:t1 Þ known from the previous iteration and pðst =st1 Þ determined by the state equation (Arulampalam et al., 2002), the update step using the Bayes’ rules is:
pðst =y1:t Þ ¼ R
pðyt =st Þpðst =y1:t1 Þ : pðyt =st Þpðst =y1:t1 Þ dst
ð6Þ
Assume that pðst =y1:t Þ is approximated by the set of particles fskt gk¼1;...;Np with importance weights xkt . Here N p is the number of particles. The candidate samples skt in this paper are drawn from prior probability pðst =st1 Þ, so that the updated set of weights xkt at time t is determined as
xk pðy =sk Þ xkt ¼ PNp t1k t t k : k¼1 xt1 pðyt =st Þ
ð7Þ
In our proposed method, we assume that target objects are firstly initialized by the user in a reference frame, such as the first frame. Fig. 3(a) shows initialization of a target patch in the first frame by user. The particles, centered in blobs of estimated size, are scattered around the target patch in the second frame, as shown in Fig. 3(b). Then in the following frames, to propagate the particles in the state equation, an average velocity is used to model the object motion. frm is the number of passed frames.
ðpx ; py Þt ¼ ðpx ; py Þt1 þ ðv x ; v y Þt1 ; X 1 ðv x ; v y Þti : ðv x ; v y Þt ¼ t 1 i¼1;...;t1
matching, according to Eq. (4), then we need to consider the relationship between X and reconstruction weight vector A. After given the dictionary T, we define the following function to measure the confidence of each part xn of a candidate X ¼ fx1 . . . ; xN g and the choice of reconstruction weight vectors A, and then solve the function as an l1 regularized least squares problem.
f ðxn ; An ; TÞ ¼ jjxn TAn jj22 þ kjjAn jj1 ;
where k is a regularization parameter. The first term jjxn in Eq. (9) is the reconstruction error. The smaller this term is, the more similar this part of the candidate and the templates are. The second term is the sparse regularization, which makes l1 minimization favor templates with large norms. As A is a latent variable introduced in the formulation, in order to properly find the optimal candidate, we need to minimize Eq. (9) for each segment representation xn . Our implementation solves the l1 regularized least squares problem via a Lasso solution, and using the sparse decomposition toolbox presented in SPAMS. Specifically in our method, N parts of a target patch are defined in Section 3.1, X ¼ fx1 . . . ; xN g, then the final optimal reconstruction weight vector A is learned by solving the following optimization problem.
A ¼ arg min
Particles skt generated in the process of particle filtering are used as observation candidates. Since the multi-part appearance X for each particle is used in sparse representation model for template
N X f ðxn ; An ; TÞ:
ð10Þ
A1 ;...;AN n¼1
As we assume the dictionary Tis determined in advance, reconstruction weight vectors A1 ; . . . ; AN for different patch parts are independent. Therefore, they could be optimized independently. Specifically, for X ¼ fx1 . . . ; xN g, the corresponding optimization problem is as follows
ð8Þ
4.2. Optimization through l1 minimization
ð9Þ TAjj22
arg min
N N X X jjxn TAn jj22 þ k jjAn jj1 :
A1 ;...;AN n¼1
ð11Þ
n¼1
Then the tracking result is corresponding to the one with the smallest residual after projecting it on the target template subspace. Specifically, at frame t, let
v ¼ fX 1 ; X 2 ; . . . ; X Np g be appear-
ance representation of N p target candidates fskt gk¼1;...;Np . The tracking result X is chosen by
Table 1 Summary of our proposed tracking algorithm in crowded scenes. 1. 2. 3. 4. 5.
Locate the target in the first frame manually or automatically, then divided the target into seven parts and initialize the dictionary T. Initialize the state of the target s1 with current px, py, X and v x ¼ 0, v y ¼ 0. The value of reconstruction weight vector is 1 for all coefficient ai . Advance to the next frame. Draw particles from the particle filter. For each particle, extract the corresponding window from the current frame, and calculate its multi-part appearance representation X as the observation model. Compute the reconstruction weight vector An for each segment xn , and then take the optimal candidate particle with the smallest residual of Eq. (12) as tracking result. 6. Update the dictionary T. 7. Go to step 3.
784
J. Shao et al. / Pattern Recognition Letters 34 (2013) 780–788
Fig. 4. Experimental results of tracking an individual in four different crowded scenes. (a) People are waiting in the airport; (b) people walks in the street; (c) people are walking in the campus, and the sequences are from the dataset of PETS2009; and (d) people are walking downstairs in a subway station.
X ¼ arg min X2v
X
f ðxn ; A ; TÞ:
ð12Þ
7
xn 2X
6
4.3. Template update
g
T t ¼ T t1 rT t
N X
f ðxn;t ; At ; T t1 Þ;
ð13Þ
n
where the function f ðÞ is from the Eq. (9),
f ðxn:t ; At ; T t1 Þ As a result,
¼ jjxn;t
T t1 An;t jj22
þ
kjjAn;t jj1 :
ð14Þ
5 RMS error (pixel)
The appearance of a target may change drastically due to intrinsic and extrinsic factors. Therefore, to produce a robust tracker, it is important to update the appearance model online to reflect these changes while tracking. In order to update the templates dynamically, we make use of the sparse feature of the reconstruction weight vector A, which can be viewed as a selection of relevant/ important templates in the dictionary T. Large coefficient ai in A always correlates with more relevant template after l1 regularization. Besides, as the existence of term jjxn TAn jj22 , the larger the norm of ti is, the smaller the coefficient ai is needed. Consequently, since we need to minimize the regularization term jjAjj1 , we exploit this characteristic by using the gradient descent as a learning factor. The appearance templates we have chosen, the dictionary T, is typically learned at the first frame of the video and a zero-meanunit-norm normalization is applied. Then in the t frame, given T t1 and X t , we could find the optimal T t by using the gradient with respect to T to the following update:
4 3 2 1 0
Airport
Street
Pets1
Subway
Pets2
Entrance
Fig. 5. The EMS error vector (differences between the ground truths and the tracking results) for objects in six different crowded scenes.
Table 2 Comparison of mean processing time per frame. Method
Mean processing time (number of frames per second)
Our approach L1 MS
4 0.5 20
J. Shao et al. / Pattern Recognition Letters 34 (2013) 780–788
785
Fig. 6. Tracking results of the sequence ‘PETS2’ over frames with algorithms of MP (ours), affine template L1 and MS (mean shift).
T t ¼ T t1 þ
N 2g X ðx T t1 At ÞAt : t n n;t
ð15Þ
After the update, each template will be normalized again. g is a learning rate, and the existence of t in the place of denominator makes the effect of new variations smaller as the time passed by. 4.4. Summary of the tracking algorithm Here we provide a summary of the proposed tracking algorithm in Table 1. 5. Experimental results and analysis 5.1. Datasets and results We implemented the proposed approach in Matlab and evaluated its performance on numerous video sequences. The videos were recorded in indoor and outdoor environments in different density crowded scenes, and the target underwent lighting changes and occlusion. The template sizes are different depending on different target patches in the first frame. The number of tem-
plates for each part of the target patch is M ¼ 10. In all cases, the initial position of the target was selected manually. The number of particles used for our method is 400 for all experiments. A visualization of our tracking results is shown in Fig. 4. Each row from left to right shows five frames of our proposed method tracking different targets in crowded scenes. The trajectories in different areas of the frame demonstrate the ability of our approach capturing the temporal motion variations of the crowd. Besides, we capture the specific size of the target in a patch as well. The first test sequences show a crowded scene in the waiting area of an airport, where a passenger with dark shirt is tracked. The passenger walks from one side of the chairs to the other side, and then crosses a group of people moving in a different direction. Such dynamic variations in the movement of the target cannot be captured by certain motion flow model such as floor fields (Rodriguez et al., 2009). The second row shows a sequence of images about people walking in different directions in the street. The tracked person is in a gray shirt whose color is similar to the ground. It demonstrates that our method is robust to color similarity and successfully tracked the person. The third test sequences are from PETS2009, which show a group of people walk from left to right through the campus. And the sequences in the fourth row are about pedes-
786
J. Shao et al. / Pattern Recognition Letters 34 (2013) 780–788
Fig. 7. Tracking results of the sequence ‘Entrance’ over frames with algorithms of MP (ours), L1 and MS (mean shift).
trians walking downstairs in a subway station, so the density of the crowd is very high. Our tracker tracks well even though the light variation and occlusion. Then, we computed the root mean square (RMS) error between the estimated locations of the patches and the ground-truth. Given the round truth state vector ctr t , which is defined by the 2D loca^
tion of the target, the RMS error of the tracking result ctr t is rP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ^ frm ðctr t ctr t Þ2 t¼1 , where frm is the number of frames in the RMSE ¼ frm video. Fig. 5 shows the error for each target averaged over all of the frames in the video. All the videos mentioned above and shown in the next subsection are evaluated here. 5.2. Qualitative comparison In our experiments, we compare the tracking results of our proposed method (MP) with the standard mean shift (MS) tracker (Collins, 2003), and the affine template L1 tracker (L1 ) (Mei and Ling, 2011). All the three algorithms are implemented in Matlab.
We use the software of author for testing L1 . All the videos are composed of 24-bit color images. Our method takes about four frames per second for the image of 768 576 in the above experimental setup. We expect our method to process over 10 frames per second if the implementation is optimized. The comparison of mean processing time per frame between L1 , MS and our method is listed in Table 2. The first test video ‘PETS2’ is from the View2 sequences of PETS2009 S2 L2 dataset. Some samples of the final tracking results are demonstrated in Fig. 6. Four representative frames of the video sequences are labels of 2, 10, 20 and 30. The target-to-background contrast is low due to strong sunshine. From Fig. 6, we will find that our tracker is capable of tracking the object all the time even with severe light refection. In comparison, the affine template L1 tracker and the MS tracker lock onto the target starting from the second frame. But then the affine template L1 tracker fails to track the target accurately in the seventh frame, similar to the MS tracker. And the MS tracker loses the target since the fifteenth frame. Oppositely, our proposed method avoids this problem and is effective in such noisy situation.
J. Shao et al. / Pattern Recognition Letters 34 (2013) 780–788
100
MP L−1 MS
90
number of error pixels
80 70 60 50 40 30
787
kind of behavior over time. To this end, multiple target segments are deemed as independent entity projected onto a set of sparse coding templates. The template dictionary is learned and updated unsupervised. Then a l1 regularized least squares approach is used to solve the sparse optimization problem. For further robustness, dynamic template update is introduced in our approach. The experimental results show that the proposed approach provides superior tracking results in all kinds of crowded scenes. We believe the approach can be further extended to track across temporary full occlusions in the next research. Acknowledgments
20 10 0
0
50
100 frame
150
200
(a) PETS2
References
100
MP L−1 MS
90
number of error pixels
80 70 60 50 40 30 20 10 0
0
50
100 frame
The authors gratefully acknowledge support by the Natural Science Foundation for the Youth (NSFC: 61105016), Shanghai University of Electric Power, Tongji University and Chinese Academy of Sciences.
150
200
(b) Entrance Fig. 8. Quantitative comparison of the trackers in terms of position errors (in pixel).
The second test video ‘Entrance’ is in a security entrance of the airport. The woman is far away from the camera and walks in the crowd inconspicuously. Some samples of the final tracking results are demonstrated in Fig. 7. The frame numbers are 10, 50, 88 and 119. The MS tracker loses the target very quickly and wanders around. Our tracker and the affine template L1 tracker are able to track the target well through the background with similar color. Then we manually labeled the ground truth of the sequences ‘PETS2’ and ‘Entrance’ for 200 frames. The evaluation criteria of the tracking error are based on the relative position errors between the center of the tracking result and that of the ground truth, calculated in pixel. Ideally, the position differences should be around zero. In practice, only integral numbers of pixels are recorded. In Fig. 8, the position differences of the results of our MP tracker are much smaller than those of the other two trackers. It demonstrates the advantage of our approach.
6. Conclusion We have developed and tested an unsupervised robust tracking framework in crowded scenes with random motions, which means the crowded scenes have spatial locations support more than one
Ali, S., Shah, M., 2008. Floor fields for tracking in high density crowd scenes. In: ECCV ’08 Proceedings of the 10th European Conference on Computer Vision, pp. 1–14. Andrade, E., Blunsden, S., Fisher, R., 2006. Modelling crowd scenes for event detection. In: 18th International Conference on Pattern Recognition, 2006. ICPR 2006, pp. 175–178. Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T., 2002. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing 50 (2), 174–188. Cheriyadat, A., Bhaduri, B., Radke, R., 2008. Detecting multiple moving objects in crowded environments with coherent motion regions. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08, pp. 1–8. Collins, R.T., 2003. Mean-shift blob tracking through scale space. In: IEEE Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin, pp. 234–240. Dong, N., Jia, Z., Shao, J., Xiong, Z., Li, Z., Liu, F., Zhao J., Peng, PeiYuan, 2010. Traffic abnormality detection through directional motion behavior map. In: 2010 Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 80–84. Garate, C., Bilinsky, P., Bremond, F., 2010. Crowd event recognition using hog tracker. In: 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS-Winter), pp. 1–6. Khansari, M., Rabiee, H., Asadi, M., Ghanbari, M., 2007. Occlusion handling for object tracking in crowded video scenes based on the undecimated wavelet features. In: IEEE/ACS International Conference on Computer Systems and Applications, 2007. AICCSA ’07, pp. 692–699. Kratz, L., Nishino, K., 2010. Tracking with local spatio-temporal motion patterns in extremely crowded scenes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition CVPR ’10, pp. 693–700. Lien, C.-C., Wang, J.-C., Jiang, Y.-M., 2007. Multi-mode target tracking on a crowd scene. In: Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing, 2007. IIHMSP 2007, pp. 427–430. Liu, B., Huang, J., Kulikowski, C., Yang, L., 2011. Robust tracking using local sparse appearance model and k-selection. In: In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’2011, Colorado, USA, June 2011, pp. 1313–1320. Lu, W., Wang, S., Ding, X., 2009. Vehicle detection and tracking in relatively crowded conditions. In: Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics, pp. 4136–4141. Maggio, E., Cavallaro, A., 2009. Accurate appearance-based bayesian tracking for maneuvering targets. Computer Vision and Image Understanding 113 (4), 544– 555. Mei, X., Ling, H., 2011. Robust visual tracking and vehicle classification via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 66, 1–14. Patzold, M., Evangelio, R., Sikora, T., 2010. Counting people in crowded environments by fusion of shape and motion information. In: 2010 Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 157–164. Qiao, W., Wang, H., Wu, X., Liu, P., 2009. Crowd target extraction and density analysis based on ftle and glcm. In: 2nd International Congress on Image and Signal Processing, 2009. CISP ’09, pp. 1–5. Rodriguez, M., Ali, S., Kanade, T., 2009. Tracking in unstructured crowded scenes. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1389–1396. Ross, David A., Jongwoo, L., Ruei-Sung, L., 2008. Incremental learning for robust visual tracking. International Journal of Computer Vision 77, 125–141. Ryan, D., Denman, S., Fookes, C., Sridharan, S., 2010. Crowd counting using group tracking and local features. In: 2010 Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 218–224.
788
J. Shao et al. / Pattern Recognition Letters 34 (2013) 780–788
Thida, M., Remagnino, P., Eng, H.-L., 2009. A particle swarm optimization approach for multi-objects tracking in crowded scene. In: 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1209–1215. Tsai, Y.-T., Shih, H.-C., Huang, C.-L., 2006. Multiple human objects tracking in crowded scenes. In: 18th International Conference on Pattern Recognition, 2006. ICPR 2006, pp. 51–54. Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T., Yan, S., 2010. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE 98 (6), 1031–1044.
Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y., 2009. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (1), 210–227. Zhao, B., Fei-Fei, L., Xing, E.P., 2011. Online detection of unusual events in videos via dynamic sparse coding. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’2011, Colorado, USA, June 2011, pp. 3313–3320.