Neurocomputing 184 (2016) 145–167
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Robust object tracking based on local region sparse appearance model Guang Han n, Xingyue Wang, Jixin Liu, Ning Sun, Cailing Wang Engineering Research Center of Wideband Wireless Communication Technique, Ministry of Education, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
art ic l e i nf o
a b s t r a c t
Article history: Received 22 January 2015 Received in revised form 25 July 2015 Accepted 26 July 2015 Available online 26 November 2015
We propose a robust object tracking algorithm based on local region sparse appearance model in this paper. In this algorithm, the object is divided into several sub-regions, and the sparse dictionaries are obtained by clustering in each sub-region. Therefore spatial structure information of the object can be captured well, and the change of object appearance can be also resisted effectively. First, the object is divided into many small patches. Then the object is divided into several sub-regions according to patch distribution again. The establishment of object dictionary base is based on combination of the dictionaries from all the sub-regions, and then space alignment between different parts of the object can be achieved. Meanwhile, noise removal and other operations in the existing sparse reconstruction error maps are performed to retain valuable information. In the updating framework, a novel flexible template set update mechanism is introduced in this paper. In this update mechanism, valuable object samples will be put into the template set. If samples are not valuable, they should not be put into the template set, even when the template set is not full. Then we use patch sparse coefficient histogram of updated templates to extract time domain information of the object in the form of weighted sum. Therefore, it can provide a reliable template basis for obtaining good candidate object. In addition, when tracking result deviates from the actual position of the object, we use a dynamic sub-region resampling method based on cosine angle to correct the position deviation timely. Therefore this method can effectively prevent the object from being completely lost. Both qualitative and quantitative evaluations on challenging video sequences demonstrate that the proposed tracking algorithm performs favorably against several state-ofthe-art methods. & 2015 Elsevier B.V. All rights reserved.
Keywords: Object tracking Local region descriptors Local sparse representation
1. Introduction Video-based object tracking has become one of the hotspots of computer vision [1,2], mainly because it has a very important application value in the civilian and military [3,4]. Currently it remains a challenging issue for achieving stable and reliable tracking results in a complex situation, for example, heavy occlusion, short-time complete occlusion, illumination variation, object posture, shape and scale change, background clutter, appearance change caused by camera moving, etc. These may result in mismatch or lose object during tracking and reduce the accuracy of tracker. To solve the above problems, scholars have proposed a lot of algorithms which can be divided into two categories, one is discriminative method [5–25]; the other is generative method [26–37]. In this paper, we will not discuss these two methods which have been introduced in many literatures, more details can be found in [38,39]. Another important classification method is n
Corresponding author. Tel.: þ 86 25 8349 2050. E-mail address:
[email protected] (G. Han).
http://dx.doi.org/10.1016/j.neucom.2015.07.122 0925-2312/& 2015 Elsevier B.V. All rights reserved.
that it is divided into whole-based method and patch-based method according to the modeling form of the object appearance. Whole-based method considers tracking object as a holistic template and extracts a holistic feature. This method is more suitable when the object is comparatively complete, but the performance will be not robust when the object is occluded partially. Ross et al. [34] propose an Incremental Visual Tracking (IVT) method which is a typical whole-based method by using a low dimensional Principle Component Analysis (PCA) and particle filter framework. The incremental PCA is used to train previous tracking results in IVT method, and then feature subspace is obtained. The candidate with the smallest distance is selected as the tracking result in current frame. Because incremental PCA and a forgetting factor are used in the model updating, IVT algorithm can effectively adapt the deformation of the object appearance and illumination change. However, it also has demonstrated that IVT with using the whole-based representation scheme is sensitive to partial occlusion. In [15,40], the proposed Multiple Instance Learning (MIL) tracker is also based on the holistic template. The MIL algorithm is improved based on the semi-supervised learning algorithm, and it formulates tracking problem as a classification
146
G. Han et al. / Neurocomputing 184 (2016) 145–167
and detection problem. Tracking drifts will be relieved by a pretrained discriminative online classifier. But when inaccurate positive or negative samples are used to update a classifier, it will result in the deviation and degradation of the classifier. In addition, in [41–43], the tracking algorithms based on sparse representation with the holistic template are proposed, and particle filter framework is used in these algorithms. To find the tracking object in current frame, each object candidate is linearly represented in the space spanned by object templates and trivial templates. After L1-regularized Least Mean Square problem is solved, they can obtain the sparse coefficient and reconstruction error of each candidate. Then the candidate with the smallest reconstruction error is the tracking object. This algorithm is robust to trivial noise and illumination variation, but it cannot deal with partial occlusion effectively. Patch-based method divides the object into a number of patches, and then features are extracted in these patches. Due to the ability of capturing the local appearance information [37,39], patch-based method can effectively resist the appearance change caused by the illumination variation, deformation and partial occlusion. In [12], a large number of experiments show that the patch-based tracker is superior to the whole-based tracker. Adam et al. [26] propose a tracker based on the patch histogram feature. In this method, the object is divided into a number of patches, and then the object is located in the next frame according to a voting map, which is formed by comparing histograms of candidate patches and the corresponding templates. It is successful to handle with partial occlusions. However, the template set is not updated properly. Thus tracking will fail if the object has large deformation. In [44] object template and the candidate are divided into overlapping patches with the same size. After the sparse coefficients for each patch are solved, the sparse coefficient histograms of candidate patches and the corresponding templates will be compared. Then the tracking object is located by the candidate with the largest confidence value. Although this method is robust with partial occlusion, it may cause the misidentification of patches easily. Because its dictionaries are obtained by clustering all the patches, tracking accuracy is not robust. In addition, Jia et al. [39] propose an alignment method to extract the object local sparse feature and space information. It is also robust to partial occlusion. But it lacks particle re-sampling mechanism. Once tracking result deviates from the actual object location, the tracker may lose the object completely. In view of the advantages and problems of patch-based sparse representation in the object tracking, this paper gives some innovative designs and improvements on the basis of the existing technology. A robust object tracking algorithm based on local region sparse appearance model is proposed. The main contributions of this paper can be summarized as follows: first, a novel local region appearance model based on sparse representation and space alignment method is presented. The object is divided into many small patches in this method. Then the object is divided into several sub-regions according to patch distribution again. The establishment of the object dictionary base is based on combination of the dictionaries from all the sub-regions. The dictionaries of each sub-region are obtained by clustering all the patches in each sub-region. Thus space alignment between different parts of the object can be achieved. False recognition of patch is prevented. This method can extract sparse features and spatial structure information of the object effectively. Second, initial sparse reconstruction error map will be corrected again. The small isolated regions with the bigger or smaller sparse reconstruction error will be deleted by using the region growing method. Then the noise caused by illumination, pose and scale change, etc. can be resist effectively. Third, a new flexible template set update mechanism is introduced in this paper, i.e. only valuable object samples will be
put into the template set. If samples are not valuable, they should not be put into the template set, even when the template set is not full. Then patch sparse coefficient histogram of updated templates is used to extract the time domain information of the object in the form of weighted sum. This new update mechanism can provide a reliable template basis for obtaining good candidate object. Fourth, a novel and effective dynamic sub-region resampling method based on the change of the cosine angle is proposed. The subregions are dynamically divided in abnormal position (for example, object drifts or is lost) according to the change of the cosine angle. Then the particle resampling is performed in every subregion. Thus the tracking object position can be corrected effectively. The situation of the object being completely lost can be also prevented. Finally, we evaluate the proposed algorithm in 25 challenging video sequences. The proposed algorithm shows robust tracking performance.
2. Related work and context Sparse representation has been applied to the object tracking [34,37,39,41–44] recently and show good experimental results. The object tracking based on sparse representation has become one of the popular areas in the current research. Mei and Ling [43] introduce the sparse representation to the object tracking for the first time. In this method, each candidate object is linear represented in the space spanned by object templates and trivial template. They solve L1-regularized Least Mean Square to obtain the sparse coefficient and reconstruct error for each candidate. The tracking result is decided by largest confidence score from the reconstruct error. Although this method can track the object steadily, inadequate issues still exist: its computation cost is very high. The particle filter requires a lot of candidate particles to ensure accuracy, and L1-regularized Least Mean Square need be solved for each candidate particle. In addition, L1 method only uses holistic template. The trivial template not only is used to model the object region, but also it can model the background. Therefore when partial occlusion occurs, the reconstruction errors of image regions from the object and the background may be very small. The tracking drifting or failure occurs easily. In order to solve the above high computation complexity problem, Bao et al. [41] propose an Accelerated Proximal Gradient (APG) method to solving the L1 minimization problem. The Accelerated Proximal Gradient method is used to solve the sparse problem iteratively until the result converges. Its computation complexity is much less than Lasso method. Bai et al. [45] apply a structured sparse representation model to the visual tracking. They also claim a Block Orthogonal Matching Pursuit (BOMP) algorithm based on orthogonal matching pursuit and mutual relations of the object on the spatial structure. During the iterative procedure, this method uses the predefined patch in matching stage instead of each dimension feature. Once the best patch is matched, the corresponding coefficients are estimated. Then residual error is used in next iterative stage. The algorithm fully combines the structural characteristics of the object to reduce the amount of calculation in the sparse representation. Thus it has good robustness for the occlusion. Zhang et al. [13] propose a multi-task sparse learning method to improve the tracking algorithm. This multi-task method can make full use of the correlation among particles to solve the sparse problem. Because there is certain relevance among the particles, and reconstruction of most similar particles also use the same template, the joint sparsity is captured based on these advantages. Therefore the overall particle sparse representation is more reasonable. Processing procedure is also accelerated. Zhang et al. [10,29] make use of the properties of compress sensing to
G. Han et al. / Neurocomputing 184 (2016) 145–167
Divide the object into some patches
147
Divide the patch distribution map into some sub-regions
Fig. 1. The schematic of dividing the object and the patch distribution map. (a) Divide the object into some patches. (b) Divide the patch distribution map into some subregions. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
compress high-dimensional feature vector. The high-dimensional feature vector is mapped into a low-dimensional vector space by a sparse matrix which satisfies the restricted isometry property. In the meantime, when the sparse representation problem is solved, previous L1 norm minimization solving method is not be used. They solve the minimization problem by an Orthogonal Matching Pursuit algorithm with the very low time complexity, thus it can make the tracking algorithm meet real-time requirements. Wang et al. [46] propose a SPTrack algorithm, which combines the PCA subspace and trivial templates to model the sparse representation of the object. It has achieved very high tracking precision on a large number of dataset tests. However, this high cost algorithm needs to solve L1 norm minimization hundreds of times for each frame. When the object appearance is modeled, Liu et al. [37] introduce the patch-based sparse representation model. The object dictionaries are represented by the sparse coding histogram. Since patch-based method is used, it can deal with partial occlusion effectively. But its dictionaries are obtained by solving the L1 norm on all the patches. Its tracking framework is based on mean-shift, which needs to complete multiple iterations. Therefore total complexity of this algorithm is high. Since the sparse coefficient histogram of local patch does not provide enough space structure information, it will lose object when the object moves fast. While the object is divided into several sub-regions again on the basis of object having been divided into many small patches in this paper. The establishment of the object dictionary base is based on combination of the dictionaries from all the sub-regions. Thus false recognition of patch can be prevented effectively, the processing procedure is also very fast. Zhong et al. [44] propose a sparse collaborative model that exploits both patch templates (Sparse Generative Model, SGM) and holistic templates (Sparse Discriminative Classifier, SDC), this algorithm can deal with partial occlusion well. However, parameter tuning is required for different video sequences for this algorithm. The dictionaries in the SGM are obtained by clustering all the patches from the object. Thus the clustering number is large. Accordingly, its computation cost is very high. This method cannot extract the effective spatial structure information either. If similar patches appear in different parts of the object, it will produce false recognition of patch and cause the position deviation. The tracking performance also declines. While the object is divided into several sub-regions again on the basis of object having been divided into many patches in this paper. The establishment of the object dictionary base is based on combination of the dictionaries from all the sub-regions. The dictionaries of each sub-region are obtained by clustering all the patches in each sub-region. Thus space alignment between
different parts of the object can be achieved. The spatial structure information of the object can be also extracted effectively. Because the clustering number is small, processing procedure is very fast. Jia et al. [39] propose an alignment method to extract the object patch sparse feature and space information. It is also robust to partial occlusion. But it lacks the particle re-sampling mechanism. Once tracking result deviates from the actual object location, the tracker may lose the object completely. The algorithm in this paper has real-time detection and resampling mechanism. When the object deviation occurs, it can correct the position deviation timely. Tracking accuracy is also improved greatly.
3. Proposed tracking algorithm 3.1. Object region division Motivated by recent literatures [37,39,44], the patch-based representation method is used for the appearance model in this paper. We start by manually select the object, as shown the red box in Fig. 1(a). The patch sparse representation method can be described as follows: the object region is divided into some patches. Then L1-regularized least squares are solved on each patch. The method of object division is shown in Fig. 1(a), there is overlap region between patches. The dashed box with different color is also the location of one patch. Let L (in pixels) represents the distance between overlap patches. 3.2. Dividing the sub-region In this paper, we normalize the object region into the same size by applying affine transformation. The normalized object is divided into some patches, as shown in Fig. 1(a). Finally, these patches are rearranged according to corresponding sequence. Then the patch distribution map of the object region in Fig. 1(b) is obtained. In order to further improve tracking accuracy and robustness, this paper gives an innovative design as following: the patch distribution map from the object region in Fig. 1(b) is divided into a number of sub-regions, such as 9 sub-regions. Each sub-region has the corresponding relationship with original object region. For example, sub-region 1 in Fig. 1(b) is corresponding with left top region of original object. Each object dictionary base is established by combining all the dictionaries from all the sub-regions. The updating histogram and similarity calculation of the candidate object are also processed with the same way. This method can effectively prevent false identification of patch caused by the
148
G. Han et al. / Neurocomputing 184 (2016) 145–167
Out of occlusions
Lateral occlusions
Vertical occlusions
Heavy occlusions
Shallow occlusions Fig. 2. The results of the object tracking, the corresponding sparse reconstruction error map and mask map. (a) Out of occlusions. (b) Lateral occlusions. (c) Vertical occlusions. (d) Heavy occlusions. (e) Shallow occlusions.
G. Han et al. / Neurocomputing 184 (2016) 145–167
dictionary base of different sub-regions. For example, in Fig. 1, the dictionary base from head region should not be valid in the jaw region. If the object is not divided into some sub-regions, the patch which has similar appearance with head region once appears in the jaw region (this kind of similarity is raised by the patch appearance change caused by occlusion or illumination variation, etc.), that is, contaminated patches in the jaw region fall into feature subspace formed by the dictionary base of the head region, it may be considered as a right candidate patch. But it is wrong on space relation, because the patch of head region should not appear in the jaw region. If the object is divided into several sub-regions, this mistake will be avoided. The division of the sub-region can help align different regions of the object. Thus it can prevent false identification of the patch in different regions. Tracking accuracy is increased. This novel local region appearance model based on the sparse representation and space alignment method is one of innovative contributions of this paper, it is also one of the biggest differences between the algorithm in this paper and the algorithms in [37,39,44]. Assume that there are N patches in each subregion, then total number of patches is 9N in the object region, the patch size is m m (in pixels). The detailed analysis and discussion about selection and constraint relationship of the size of the normalized object, the size of patch, the number of patches in the object region, the size and number of sub-region is given in Section 4.3. 3.3. Sparse representation of patches and histogram construction Assume that the object region is divided into 9 sub-regions, the dictionary base of each sub-region isD ¼ ½d1 ; d2 ; ⋯; dK A RdK , where d is feature dimension of each dictionary, d ¼ m m, K is the number of dictionaries obtained by clustering all the patches in each sub-region, 1 o K o N. The object dictionary base used in this paper only comes from the object region manually labeled in the first frame. Let Y be a candidate object in the current frame, its object region is also divided into 9 sub-regions as shown in Fig. 1(b). The patches of each sub-region of Y is Y ¼ ½y1 ; y2 ; ⋯; yN A RdN . When sparse coefficients of patches in different subregion are calculated, we also use different dictionary base respectively, its specific calculation formula (1) is: For each sub region; min‖yi D bi ‖22 þ λ‖bi ‖1 ; s:t: bi
bi Z 0
ð1Þ
where yi A Rd1 is feature vector obtained by i-th patch in each sub-region, and bi A RK1 is the sparse coefficient of i-th patch in each sub-region, λ is a weight parameter. The histogram B ¼ ½B1 ; B2 ; …; B9 represents the sparse representation of one candidate object. Bi ¼ ½b1 ; b2 ; …; bN A RKN represents the sparse representation of one sub-region in the object candidate. 3.4. Construction of the sparse reconstruction error map and occlusion handling
corresponding sparse reconstruction error map and mask map respectively. Results intuitively show the occluded object leads to larger error value in the sparse reconstruction map. In order to improve accuracy, the occluded regions should not be involved in the calculation of the final object confidence (i.e. corresponding sparse coefficients will be set to 0). Therefore, we use the mask histogram β ¼ ½β 1 ; β 2 ; ⋯; β9 in this paper. This mask histogram is the indicator set of whether the sparse coefficient vector of all the patch in the corresponding sub-region is set to 0, where βi ¼ ½ρ1 ; ρ2 ; ⋯; ρN A RKN , and the ρi is obtained by 1 εi o ε0 ρi ¼ ð3Þ 0 otherwise Here ε0 is a uniform threshold, which is used to indicate whether the current patch is occluded. After mask histogram β is showed in the form of Fig. 1(b), a mask map is built, as shown in the last column of Fig. 2, where ε0 ¼ 0:11. We find that the value of ε0 cannot be set too big or too small in the experiments. It can obtain relatively good results when ε0 ¼ 0:11. As shown in Fig. 2, there are many other factors which cause the reconstruction error value of patch to increase apart from the occlusion. For example, object pose, appearance and illumination variation or object drifting. Meanwhile in the occlusion regions, there are also some patches with small reconstruction error, as shown mask map in Fig. 2(c). This is because the partial appearance of the occlusion is similar with some dictionary base. For example, part of the book is similar with the hair. According to general space constraint relations between the tracking object and the occlusion object, nonocclusion regions should not appear in the occlusion region (except ring occlusion); On the contrary, small occlusion regions should not appear in the non-occlusion region either (except very small occlusion). According to above, small non-occlusion regions in the occlusion region and small occlusion regions in the nonocclusion region should be removed in the sparse reconstruction error map. Therefore, a region growing method is adopted to deal with the small interfering area in the mask map in order to improve accuracy. After mask maps in Fig. 2 are processed, new mask maps (Fig. 3) are obtained. Then the mask histogram is restored. It is a new mask histogram Nβ ¼ ½N β1 ; N β2 ; ⋯; N β 9 , where N β i ¼ ½Nρ1 ; N ρ2 ; ⋯; N ρN A RKN . This novel and effective design can correct initial sparse reconstruction error map. The small isolated regions with the bigger or smaller sparse reconstruction error will be deleted by using the region growing method, thus the noise caused by illumination, pose and scale change, etc. can be resisted effectively. At last, histogram B will be rewritten as a new histogram NB, denotes the element-wise multiplication. NB ¼ B N β
ð4Þ
3.5. Confidence calculation of the candidate object
In Section 3.3, sparse coefficients of all the patches in each subregion have been obtained. The reconstruction error of all the patches in each sub-region can be calculated by the formula (2). The errors of all the patches in 9 sub-regions are shown in the form of Fig. 1(b). Then the sparse reconstruction error map of the object is constructed, as shown in the middle column of Fig. 2.
εi ¼ ‖yi D bi ‖22
149
ð2Þ
where, εi is reconstruction error of the i-th patch in each subregion. Fig. 2 shows the tracking results when object is out of occlusion, in shallow occlusion and in heavy occlusion, the
Finally, confidence of the candidate object is calculated by the formula (5): ! ! ðε Þ2 9N K X X i2 minðNBki ðxÞ; N ψ ki Þ U e σ1 Conf ðxÞ ¼ ð5Þ i¼1
k¼1
where ψ is histogram of the object region template, it is been constantly updated. ψ is also executed element-wise multiplication to ensure the fairness of the process. The calculation formula is N ψ ¼ ψ Nβ , where x represents the x-th candidate object in the current iterative process. The candidate object with the largest confidence Conf is the tracking result that we want.
150
G. Han et al. / Neurocomputing 184 (2016) 145–167
#5
#214
#32
#547
#894
Fig. 3. New mask maps after the mask maps in Fig. 2 being processed. #5, #32, #214, #547, #894.
3.6. Template updating This paper designs a new flexible template set mechanism, its fundamental is that the object manually selected in the first frame is used as the initial template. The increase of the template in the template set need to satisfy certain conditions (in B part below). Otherwise the template number in the template set remains unchanged. The maximum number of the template is set to n. Thus only when the template number is n, the template set will be updated or replaced. Assume template set of the object region is T ¼ ½T 1 ; T 2 ; ⋯; T n , n is the size of the template set T. 3.6.1. Initialization of the template set Object selected manually in the first frame is used to initialize the template set T. With the tracking process continuing, template set T is filled incessancy. Template set T will be updated until it is full. 3.6.2. Filling and updating of the template set The specific process of template set filling and updating is as follows: First, the best tracking object in current frame is obtained by the calculation in Section 3.5. Then cosine angle between this result and the object manually selected is calculated. The cosine angle can be used to measure the similarity between the latest tracking object and object template in the first frame. Smaller cosine angle yields greater similarity. This value is used to determine whether the current result will be filled or updated into the template set. Thus the template set needs to be filled or updated when the cosine angle is in a certain range. Second, the occlusion rate of tracking result in the current frame is calculated. This is obtained by calculating the ratio of the number of the patch with reconstruction error being larger than a predefined value and total number of the patch in the current best candidate object. The template set should not be filled or updated
if the occlusion rate exceeds the threshold (i.e. the object is occluded heavily). Because when heavy occlusion occurs, most regions of the object appearance might have changed observably. If the template set is forced to fill or update, it will likely cause a large disturbance for the template set, even tracking drift. Therefore, the tracking result with the heavy occlusion should not be used to update or fill the template set. Finally, if the two conditions above are fulfilled, the next step is to judge whether the template set is full. If not, the best candidate object in the current frame is put into the template set. Otherwise, the template with the largest cosine angle will be replaced by the best candidate object. Note that the first template in the template set stays constant, it is not involved in the template updating.
3.6.3. Construction of the template histogram There are 9N patches in each template for the object. Each patch can build a histogram according to the method in Section 3.3. The combination of histograms of 9N patches is ψ l , here l represents the l-th template in the template set. Because there are n templates in the template set, we can obtain n histograms ψ l ; l ¼ 1; 2; 3; ⋯; n. The template histogram ψ can be obtained by weighted sum of above n histograms, as shown in Fig. 4. The weights are calculated based on the cosine angle of tracking result in the previous frame and the templates. If the template set is not full, the weight of empty template is 0. Meanwhile the weights need to meet the constraint conditions: ω1 þ ω2 þ ω3 þ ⋯⋯ þ ωn ¼ 1. The weight ω1 of the first template in the template set stays constant. This is the new flexible template set update mechanism proposed in this paper. The patch sparse coefficient histogram of updated templates is used to extract the time domain information of the object in the form of weighted sum. This new update mechanism can provide a reliable template basis for obtaining good candidate object.
G. Han et al. / Neurocomputing 184 (2016) 145–167
3.7. Proposed tracking algorithm After the tracking object is manually selected, the proposed tracking algorithm is executed from the second frame. In order to locate the object accurately and execute the tracking algorithm fast, we use a two-step search method to locate the tracking object. The specific process of this method is as follows: first, a large-scale particle sampling is carried out in the vicinity of the object location in previous frame, i.e. a coarse sampling with large sampling parameters is carried out in advance. Assuming that the number of particles is S. Accordingly, S candidate objects are got. Then the confidence of each candidate object is calculated according to formula (5). Second, a small-scale particle sampling is performed in the vicinity of the location of candidate object with the largest confidence, i.e. a precision sampling with small sampling parameters is carried out. The number of particles is SS, so SS candidate objects in the second step are got. In the same way, the confidence of each candidate object is calculated again according to the formula (5). Finally the candidate object with the largest confidence is chosen as the tracking result in current frame. However, if the object moves very fast, the above method may still lose the object, such as situation in Fig. 5. If the cosine angle suddenly becomes large, tracking maybe fail, Fig. 5 shows a demonstration of the object being lost. It may make the tracking not return the actual position of the object if the object is lost. In order to return the actual position of the object after the object being lost, we propose a novel and effective dynamic sub-region
151
resampling method based on the change of cosine angle. After the tracking result in current frame is obtained by two-step search method, cosine angle of the best tracking result and the object template is calculated. If cosine angle in the current frame is 1.5 times larger than the average value of cosine angle from all the previous frames, we think that the object may be lost, as shown in Fig. 5 (note that when this situation occurs, in addition to the object moving fast, it also may be caused by short-time complete occlusion or the obvious object appearance change. The proposed sampling method is also equally effective for both cases). Then the dynamic sub-region resampling method is used to search for possible losing object. The idea of this method is straight forward. Vicinity region of the losing object location is dynamically divided into several sub-regions. Then coarse sampling within each subregion is performed, as shown in Fig. 6. The blue box is the subregions dynamically divided, green ‘ þ’ is the center location of the sampling. Then we use the formula (5) calculate the confidence of candidate objects in all the dynamical sub-regions, and a precision sampling is performed again in the vicinity of the candidate object with the highest confidence. Finally, the candidate object which has highest score is the tracking object that we want. Fig. 7 shows the variation trend of cosine angle after the above-mentioned resampling method is used. Most of cosine anger values are in the expected range except a few large ones. The innovation of this method is that when the object is lost due to fast motion and short-time complete occlusion etc., large scale and low density resampling in sub-regions dynamically divided is performed. This
Fig. 4. Construction schematic of the object template histogram.
Fig. 5. The variation trend of cosine angle on boy video sequence after the object being lost.
152
G. Han et al. / Neurocomputing 184 (2016) 145–167
process can not only maintain the algorithm's processing speed stable, but also search the object in a greater range. Then small scale sampling is performed again in the vicinity of the best location in the previous stage, thus the object can be located more accurately. This method can correct the tracking object position effectively and prevent the object from being completely lost. Therefore tracking accuracy can be improved dramatically. In the tracking process, scale change of object is also an important task. A simple and effective scale processing method— random scale method—is used in this paper. The principle of this method is similar and closely related to particle sampling process. When the sampling is performed, particles are randomly sowed in accordance with the some laws in the vicinity of the object location in previous frame. Then the candidate object corresponding to each particle is also selected and calculated. While the scale of each candidate object is randomly expanded or reduced in a certain range on the basis of tracking object size in previous frame. Then the best location and scale are determined by object confidence calculation (the formula (5)). Thus, location and size of candidate object with the largest confidence value is the position and size of the object to track currently.
4. The experimental results and analysis The proposed algorithm is implemented in MATLAB and runs at 1.8 frames per second on an Intel 3.6 GHz i7-4790 Core PC with 16 GB memory. In order to evaluate the performance of the proposed algorithm, we conduct experiments on twenty-five challenging video sequences. These video sequences contain high challenging situations: heavy occlusion, short-time complete occlusion, illumination variation, fast motion, background clutter, in-plane and out-of-plane rotations and scale variation (as shown in Figs. 8–13). All the video sequences are public datasets which can be downloaded on the internet, most of them come from tracking benchmark 2013 [48] and the VOT2014 Challenge [49]. In this paper we compare with nine state-of-the-art tracking algorithms, these algorithms include IVT [34], Frag [26], STC [38], L1APG [41], SCM [44], ASLA [39], MTT [13], LSK [37] and SRPCA [46], the initial position of the object is the same for all the algorithms in the experiments. In the experiments, location of the object is manually labeled in the first frame for every sequence. The object is normalized into 30 30 (in pixels) in the proposed algorithm, the size of the patch is 6 6 (in pixels), the overlap distance L between patches is 3 (in pixels), the number N of patches in each sub-region is 9, the number K of the dictionaries clustered in each sub-region is 5, the weight parameter λ in the formula (1) is 0.01, the threshold ε0 in the formula (3) is 0.11, the template number n is 8, the template set T is updated in every frame, and above these parameters are fixed in all the experiments. The detailed analysis and discussion about selection and constraint relationship of the size of the normalized object, the size of patch, the number of patches in the object region, the size and number of sub-region is given in the third part of this section. 4.1. Qualitative evaluation
Fig. 6. An example of the resampling in the sub-regions dynamically divided. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
4.1.1. Heavy occlusion We test four video sequences: occlusion1, faceocc2, girl and walking2. There are heavy occlusions in these datasets, as shown in Fig. 8. In addition, there are also in-plane rotation, as shown in Fig. 8(b); and out-of-plane rotation, as shown in Fig. 8(c). The proposed algorithm can achieve excellent tracking results from the experiments, the main reasons are: first, the object region is divided into many sub-region. The establishment of each object dictionary base is based on combination of the dictionaries from all the sub-regions. The dictionaries of each sub-region are obtained by clustering all the patches in each sub-region. This
Fig. 7. The variation trend of cosine angle on boy video sequence after the proposed resampling method being used.
G. Han et al. / Neurocomputing 184 (2016) 145–167
153
occlusion1
faceocc2
girl
walking2
Fig. 8. A part of tracking results when the objects being heavily occluded.
process could help align different regions of the object and prevent false identification of the patch in different sub-region; second, sparse reconstruction error map is reconstructed to eliminate the influence of noise; Third, a flexible template set update mechanism is used. All the above methods can help improve and enhance the accuracy of object tracking. From Fig. 8, we can see that tracking drift occurs in the STC, IVT, Frag, MTT and LSK algorithms, since their update mechanisms do not consider partial occlusion. In contrast, Ours, L1APG, SCM, ASLA and SRPCA algorithms can achieve the better tracking results. This is because there are update mechanisms of dealing with the occlusion in these algorithms. 4.1.2. Short-time complete occlusion We test three video sequences: jogging-1, jogging-2 and suv. There is short-time complete occlusion in these datasets, as shown in Fig. 9. In addition, there are non-rigid objects, a part of regions of these non-rigid objects have significant appearance change, such as Fig. 9(a) and (b). Running movement in legs region has a big difference with the training data selected manually. Results
show that the algorithm proposed in this paper can achieve significant performance improvement compared with other algorithms. Because of following: 1. A novel and effective dynamic subregion resampling method based on the change of cosine angle is used in this paper. Once the object deviates or is lost, cosine angle of object maybe vary obviously. Then dynamic sub-region resampling is automatically performed within a large range of the vicinity of the object position with cosine angle varying obviously. When the best object position is fixed in the current stage, next precise search of small scale is performed. Accordingly the relocation of object can be realized and possible position deviation can be also corrected timely. 2. Local region sparse representation of the object is used. The object is divided into several sub-regions again on the basis of the object having been divided into many small patches. The establishment of each object dictionary base is based on combination of the dictionaries from all the sub-regions. The change part is only a part of the position of the non-rigid object. The algorithm in this paper adopts the processing mode of sub-regions, which can make it not affect processing results of the parts with no change. Reconstruction error of patches in the
154
G. Han et al. / Neurocomputing 184 (2016) 145–167
jogging-1
jogging-2
suv
Fig. 9. A part of tracking results when the objects being completely occluded in a short time.
change part may be large. If this error value is greater than a certain threshold, the corresponding patches should not be involved in the calculation of the object confidence. Then the interferences caused by significant change part can be excluded. Therefore, good tracking results can be obtained, even if part appearance of the object has significant change. Because there is no above two processing mechanisms in other algorithms, objects are wrongly tracked or lost generally from Fig. 9. 4.1.3. Illumination variation We test five video sequences: car4, car11, davidin300, tunnel and sylvester with large illumination variation as shown in Fig. 10. In addition, there are a certain scale variation and pose variation in these datasets, as shown in Fig. 10(c) and (e). Frag algorithm has an unsatisfied performance when severe illumination variation occurs. Tracking drift occurs in all five video sequences. In the car11, tunnel and sylvester video sequences, the tracking effect is poor in low light environment for the LSK algorithm. In the davidin300 video sequence, the trivial template mechanism of L1APG considers the object pose and facial expression variation as occlusion mistakenly, thus the object is tracked wrongly. In the car4, car11 and davidin300 video sequences, because object description based on sparse representation and effective update mechanism are used in Ours, SCM, ALSA and SRPCA algorithms, good tracking results are obtained. In the tunnel video sequence, object is completely black in a period of time due to illumination variation, so different degree of drift occurs for most of the algorithms. Because processing method based on sparse patch, effective update mechanism and confidence calculation are used in Ours and ALSA algorithms, the object can also be located well even if object become black completely due to illumination variation. In
the sylvester video sequence, in addition to illumination variation, there are pose variation. Since the robust sparse feature description and particle sampling, voting mechanism are used in Ours, SCM and MTT algorithms, good tracking results are obtained in the sylvester video sequence. 4.1.4. Fast motion We test four video sequences: boy, owl, face and fish with motion blur caused by fast motion as shown in Fig. 11. The results show that when objects fast move, the objects drift or are lost for the most algorithms. Since the objects move fast in situ, some algorithms can still continue to locate the object and track. But if the fast moving object is not in situ, the object may be lost completely. The dynamic sub-region resampling and local region sparse representation method are used in this paper. When the objects deviate from the actual position caused by fast motion, resample is automatically performed according to the change of cosine angle. Thus when the object drifts or be lost, the proposed algorithm in this paper can continue to find the object and correct object position error. When motion blur is produced due to fast motion, local region sparse appearance model proposed in this paper has strong ability of sparse reconstruction, so the candidate object which is closest to the original signal can be found. Compared to other algorithms, the proposed algorithm can achieve the better tracking results. 4.1.5. Background clutter We test four video sequences: board, stone, subway and singer2 with complex background as shown in Fig. 12, there are many other objects in the background, including some similar objects with the tracking object. In addition, there is serious out-
G. Han et al. / Neurocomputing 184 (2016) 145–167
155
car4
car11
davidin300
tunnel
sylvester
Fig. 10. A part of tracking results when the objects undergoing illumination variation.
of-plane rotation as shown in Fig. 12(a) and (d). In the board video sequence, the L1APG, IVT and MTT algorithms almost lose the objects in all the frames. Tracking drifting or being lost also occur when the objects undergo serious out-of-plane rotations in the SCM, ALSA, Frag, STC, LSK and SRPCA algorithms. In the singer2 video sequence, there is also the same situation. Tracking drift occurs soon due to background clutter in the L1APG, SCM, MTT, LSK and SRPCA algorithms. Frag, IVT and STC algorithms also lose object, when the object pose changes. While the tracking result of the proposed algorithm is good generally, a certain degree of drifting occurs only in the final when the object poses change greatly. The algorithm in this paper can get good
tracking performance, mainly because local region object description based on sparse representation and flexible template set update mechanism are used. In the stone and subway video sequences, there are many similar object interference sources in the background, so objects drift or are lost very quickly for many algorithms. While Ours, SCM and ASLA algorithms can achieve better tracking results than other algorithms, since patch description method based on sparse representation has strong ability of sparse reconstruction, accordingly the ability of object recognition is also very strong. Even if there are many similar objects in the background, the object that we want can be also located accurately.
156
G. Han et al. / Neurocomputing 184 (2016) 145–167
boy
face
owl
fish
Fig. 11. A part of tracking results when the objects fast moving.
4.1.6. Out-of-plane rotation We test five video sequences: fleetFace, freeman1, freeman3, polarbear and surfing with different degree of out-of-plane rotation as shown in Fig. 13. In the fleetFace video sequence, the object is lost for STC and L1APG algorithms, other algorithms can track the object, but different degree of drifting occurs. The accuracy of the proposed algorithm in this paper is high. In the freeman1 and freeman3 video sequences, the size of tracking object is small relatively and degree of out-of-plane rotation is large relatively. Thus objects are lost for many algorithms, for example, L1APG, STC, LSK and MTT algorithms. Other algorithms also have a certain degree of drifting. The tracking results of Ours and SCM algorithms are good. In the polarbear and surfing video sequences, the degree of out-of-plane rotation is small relatively. Thus tracking results of most of the algorithms are good. In general, the proposed algorithm in this paper can obtain the better tracking results than other algorithms. For non-rigid object, a part of regions of these non-rigid objects have significant change, as shown in Fig. 13(a)– (c), heavy out-of-plane rotation occurs in the head region. The algorithm in this paper can still achieve good tracking results. This is because local region sparse representation of the object is used in this paper. The object is divided into several sub-regions again on the basis of the object having been divided into many small patches, and the establishment of each object dictionary base is
based on combination of the dictionaries from all the sub-regions. The change part is only a part of the position of the non-rigid object. The algorithm in this paper adopts the processing mode of sub-regions, which can make it not affect processing results of the parts with no change. The reconstruction error of patches in the change part may be large. If this error value is greater than a certain threshold, the corresponding patches should not be involved in the calculation of the object confidence. Then interferences caused by significant change part can be excluded. Therefore, good tracking results can be obtained, even if the part appearance of the object has significant change. Because there is no above processing mechanism in other algorithms, object is wrongly tracked or lost generally. 4.2. Quantitative evaluation We evaluate the above-mentioned algorithms using the center location error and overlap rate. Center location error is the distance (in pixels) between the center of the tracking object and the center of the object manually labeled. The smaller center location error is, the better the performance of the algorithm is; the overlap rate is computed by intersection over union based on the tracking result RT and the ground truth RG , i.e., RRTT \[ RRGG , the larger the overlap rate is, the better the performance of the algorithm is. Specific
G. Han et al. / Neurocomputing 184 (2016) 145–167
157
board
stone
subway
singer2
Fig. 12. A part of tracking results when the objects moving in a clutter background.
description about these two evaluation methods can be found in the literature [47]. Figs. 14 and 15 show the center location error curves and overlap rate curves of the 10 algorithms on the 25 video sequences. Results in Fig. 14 show that the center location error curves diverge for some algorithms, such as STC algorithm in the occlusion1, faceocc2, jogging-1, jogging-2 and owl video sequences, Frag algorithm in the walking2, car11, davidin300, tunnel, boy and surfing video sequences, IVT algorithm in the girl, jogging-1, jogging-2, sylvester, subway, fleetface, freeman3 and boy video sequences, etc. These indicate that these algorithms lose the objects during the tracking. Tracking results are farther away from the actual object location. In addition, the center location error curves of some algorithms are on the swings. For example, Frag, MTT, LSK and SRPCA algorithms in the face video sequence, IVT, MTT and SRPCA algorithms in the owl video sequence, etc. These indicate that tracking results can still return the vicinity of the actual object location for these algorithms after the objects are lost, and this procedure is repeated several times. Fig. 15 also shows the similar situation. Overall, the proposed algorithm in this paper is better than the other nine state-of-the-art algorithms. Tables 1 and 2 show the average center location error and average overlapping rate, where the bold red and blue fonts represent the top two tracking results. In this paper, many innovative designs and improvements based on the existing patch
sparse representation methods for object appearance model are proposed. Results of Tables 1 and 2 show that the proposed algorithm can achieve better performance than the SCM, ASLA and LSK algorithms based on patch sparse representation. It also has obvious advantages compared with other six state-of-the-art tracking algorithms. 4.3. Discussion 4.3.1. The selection and analysis of important parameters There are many parameters in the proposed algorithm in this paper, for example, the size of the patch, the size of each subregion and the number of dictionaries obtained by clustering in each sub-region, etc. Different selection of these parameters may have different effects on the algorithm. The following are experimental results of selecting different parameters and analysis of the influence on the algorithm. Assume that the normalized object size is 30 30 (in pixels), in order to make the number of patches and sub-regions be an integer (if they are not an integer, small part of edge regions of the object may be missing, thus a part of the object information can be lost, accordingly the tracking accuracy is affected), size of the normalized object will have certain floating up and down. After the size of patch and overlap distance L is fixed, there is certain constraint relationship between the size of the normalized object,
158
G. Han et al. / Neurocomputing 184 (2016) 145–167
fleetFace
freeman1
freeman3
polarbear
surfing
Fig. 13. A part of tracking results when the objects undergoing out-of-plane rotation.
the number of patches in the object region, the size and number of the sub-regions, as shown in Table 3–6. Note that the unit of measure of the size of sub-region is not pixel. The used unit is the number of patches for the size of sub-region, such as, the size of sub-region is 3 3, i.e. there are 3 patches in the length and width direction of sub-region respectively. If the size of patch is 4 4 (in pixels, the same below), the overlap distance L is 2, the constraint relationship between the size of object, the number of patches, the size and number of subregion is shown in Table 3. In order to make the table more concise, the size of object, the number of patches, the size and number of sub-region are expressed as Sizeobj , Numpat , Sizesubre and Nu msubre respectively. If the size of patch is 6 6, the overlap distance L is 3, the constraint relationship between the size of object, the number of patches, the size and number of sub-region is shown in Table 4. If the size of patch is 8 8, the overlap distance L is 4, the constraint relationship between the size of object, the number of patches, the size and number of sub-region is shown in Table 5.
If the size of patch is 10 10, the overlap distance L is 5, the constraint relationship between the size of object, the number of patches, the size and number of sub-region is shown in Table 6. In this paper, every constraint relation in Table 3–6 is set up an ID to facilitate the show of experimental results. The number of cluster dictionary be used in each sub-region is also different for every ID. Figs. 16–21 give the number K of cluster dictionary be used in each sub-region and the corresponding tracking results on the occlusion1 video sequence for every constraint relation. From Table 3–6 and Figs.16 and 17, if the size of patch and overlap distance L are fixed, the smaller the size of sub-region is, the larger the number of sub-region divided in the object region is, and the better the tracking results are generally for the same cluster number K. Under the same ID, the influence on tracking results is not very obvious when the cluster number K changes. But if K is very small, tracking results are poor generally. In addition, the difference of the tracking results is very small. From the point of view of statistics, if K is large, its tracking performance is generally better than the situations with K being small.
G. Han et al. / Neurocomputing 184 (2016) 145–167
159
Fig. 14. The center location error curves.
If the experimental results in Table 3–6 are considered as a group respectively, from Figs. 18 and 19, we can see that the larger the number of patch divided in the object region is, the better the tracking results are. So when the size of sub-region is very small or the number of patch is very large, the more detailed division and description for the object will be obtained. When heavy occlusion occurs, the influence of occlusion boundary can be reduced to minimum. Because useful regions can be retained as far as possible, good tracking results are got. From Tables 3 and 6 and Figs. 20 and 21, when ID is 2, 8, 12 and 15, the size of sub-area is the same (3 3), the size of patch is 4 4, 6 6, 8 8 and 10 10 respectively. We can see that the
smaller the size of patch is, the better tracking results are for the same K. Similarly, when ID is 1, 7, 11 and 14, or ID is 3, 9 and 13, or ID is 4 and 10, results are basically similar. This shows that, if the smaller the size of patch is, the better the tracking results are in a certain extent. Due to the larger the number of sub-region is, the longer processing time is, especially with the increase of K, the processing time will increase significantly. Thus taking into account the balance of speed and accuracy, the experiment parameter configuration of ID 8 is selected to use in all the experiments in this paper, i.e. the size of object is 30 30, the size of patch is 6 6, the overlap distance L is 3, the size of the sub-regions is 3 3, and the number of cluster dictionary in every sub-region is
160
G. Han et al. / Neurocomputing 184 (2016) 145–167
Fig. 15. The overlap rate curves.
5, which have been given in the beginning of the Section 4. In addition, different choices of overlap distance L also produce different number of patches and sub-regions in the object region. In order to simplify the experiment, overlap distance L is set to half of the length of the patch size uniformity in this paper. The interested reader can select the different overlap distance L in the experiment to verify its influence on tracking results in future. This experiment is not in the list here. 4.3.2. The contribution analysis and discussion for each part of the proposed algorithm The proposed algorithm in this paper is mainly divided into three parts as follows: local region sparse representation of the
object, flexible template set update mechanism and dynamic subregion resampling based on cosine angle change. In order to better display the contribution of each parts of the proposed algorithm, the experiment is designed as follows: each part of the algorithm is numbered, ① local region sparse representation of the object; ② flexible template set update mechanism; ③ dynamic sub-region resampling. In order to make experiment successful, conventional particle sampling is numbered ④. The order of the experiment is ① þ④(no update), ① þ ② þ④ and ① þ② þ③(the complete algorithm in this paper). Six video sequences occlusion1, jogging-1, tunnel, owl, subway and freeman1 are used in the experiments. These video sequences contain heavy occlusion, short-time complete occlusion, illumination variation, fast
G. Han et al. / Neurocomputing 184 (2016) 145–167
161
Table 1 Average center location errors.
motion, background clutter and out-of-plane rotation six kinds of challenging situation respectively. Tables 7 and 8 give the tracking results of using different part of the proposed algorithm respectively. The difference of ① þ④ and ① þ② þ ④ is whether flexible template set update mechanism is used. Results in Tables 7 and 8 show that in the occlusion1, jogging-1, owl and subway video sequences, after ② is added, performance improvement is not obvious; while for the tunnel and freeman1 video sequences, after ② is added, tracking performance is improved significantly. This is because the appearance of object has changed markedly. In the freeman1 video sequence, there is heavy out-of-plane rotation; while in the tunnel video sequence, there is obvious illumination variation. ② flexible template set update mechanism plays a positive role in the tracking. Thus tracking effect is enhanced significantly. Although in the occlusion1 video sequence, the appearance of object has also changed obviously due to the occurrence of heavy occlusion. But the algorithm itself judges that occlusion occurs through the sparse reconstruction error estimate, and these samples aren't updated into the
template set. Thus the template update plays a limited role. The improvement of effect is not obvious. The difference of ① þ② þ④ and ① þ ② þ ③ is whether dynamic sub-region resampling is used. In the occlusion1, tunnel, subway and freeman1 video sequences, after ③ is add, the improvement of effect is not obvious; while in the jogging-1 and owl video sequences, after ③ is add, performance is improved significantly. This is because if the object deviates or be lost due to fast motion and short-time complete occlusion, cosine angle can vary obviously. Then dynamic subregion resampling is automatically performed within a large range of the vicinity of the object position with cosine angle varying obviously. When the best object position is fixed in the current stage, next the precise search of small scale is performed. Thus the object position deviation can be corrected effectively. Accordingly the problem of object drifting or being lost is solved. Through the result comparisons of ① þ ④ and the ASLA, L1APG, SCM algorithms, we can see that even if ②, ③ are not used, the tracking result is not very poor. Therefore local region sparse representation of the object in this paper is effective. Through the result comparisons of ① þ ② þ ④ and the ASLA, L1APG, SCM
162
G. Han et al. / Neurocomputing 184 (2016) 145–167
Table 2 Average overlap rates.
Table 3 The constraint relationship between the size of object, the number of patches, the size and number of sub-region (patch ¼ 4 4, L ¼2). ID
Sizeobj
Numpat
Sizesubre
Numsubre
1 2 3 4 5 6
30 30 32 32 26 26 32 32 26 26 30 30
14 14¼ 196 15 15¼ 225 12 12¼144 15 15¼ 225 12 12¼144 14 14¼ 196
22 33 44 55 66 77
7 7¼ 49 5 5¼ 25 3 3¼ 9 3 3¼ 9 2 2¼ 4 2 2¼ 4
Table 4 The constraint relationship between the size of object, the number of patches, the size and number of sub-region (patch ¼ 6 6, L ¼3). ID
Sizeobj
Numpat
Sizesubre
Numsubre
7 8 9 10
28 28 30 30 28 28 34 34
8 8 ¼64 9 9 ¼81 8 8 ¼64 10 10¼ 100
22 33 44 55
4 4¼ 16 3 3¼ 9 2 2¼ 4 2 2¼ 4
G. Han et al. / Neurocomputing 184 (2016) 145–167
163
Table 5 The constraint relationship between the size of object, the number of patches, the size and number of sub-region (patch ¼ 8 8, L ¼4). ID
Sizeobj
Numpat
Sizesubre
Numsubre
11 12 13
30 30 30 30 36 36
6 6¼ 36 6 6¼ 36 8 8¼ 64
22 33 44
3 3¼ 9 2 2¼ 4 2 2¼ 4
Table 6 The constraint relationship between the size of object, the number of patches, the size and number of sub-region (patch ¼ 10 10, L ¼ 5). ID
Sizeobj
Numpat
Sizesubre
Numsubre
14 15
28 28 36 36
4 4¼ 16 6 6¼ 36
22 33
2 2¼ 4 2 2¼ 4
The results of table 3
The results of table 4
The results of table 6
The results of table 5
Fig. 16. Center location errors on the occlusion1 video sequence for every constraint relation. (a) The results of Table 3, (d) The results of Table 6, (b) The results of Table 4, (c) The results of Table 5.
164
G. Han et al. / Neurocomputing 184 (2016) 145–167
The results of table 3
The results of table 4
The results of table 6
The results of table 5
Fig. 17. Overlap rates on the occlusion1 video sequence for every constraint relation.
Fig. 18. Average center location errors of Tables 3–6.
G. Han et al. / Neurocomputing 184 (2016) 145–167
165
Fig. 19. Average overlap rates of Tables 3–6.
Fig. 20. Center location errors when the size of sub-area being the same.
algorithms, we can see that even if ③ are not used, tracking results of ① þ ② þ④ are better than the latter generally. These show that the proposed local region sparse representation of the object and flexible template set update method are also feasible. Then the combination of ①, ② and ③ is the complete algorithm proposed in this paper. The experimental results show that the tracking performance of the complete algorithm in this paper is more excellent than other algorithms.
5. Conclusions In this paper, we propose a robust object tracking algorithm based on local region sparse appearance model. This algorithm divides the object region into several sub-regions to achieve the
alignment of different regions of the object. Then the spatial structure information of the object is extracted effectively. Thus object appearance change caused by the illumination, posture variation and heavy occlusion and so on can be dealt with well. Meanwhile this paper also carries out secondary processing on constructed sparse reconstruction error maps. Then small isolated regions are deleted to retain valuable information and reduce the effect of noise. In the updating framework, we design a novel flexible template set update mechanism, which uses weighted sum form to calculate the object template histogram. This update mechanism can provide a strong template foundation for matching good candidate object. In addition, in order to prevent the object from being completely lost, we innovatively design a twostep search method and a dynamic sub-region resampling method based on cosine angle to sow the particles and correct the tracking
166
G. Han et al. / Neurocomputing 184 (2016) 145–167
Fig. 21. Overlap rates when the size of sub-area being the same.
Table 7 Center location errors of using different part of the proposed algorithm respectively.
occlusion1 jogging-1 tunnel owl subway freeman1
①þ ④
① þ② þ④
①þ ②þ ③
ASLA
L1APG
SCM
4.1 114.7 24.7 11.5 7.3 103.0
3.6 100.3 5.7 11.3 2.9 8.9
3.4 5.0 5.9 5.4 2.9 6.7
7.5 102.6 6.5 10.1 4.0 105.2
7.8 86.7 21.7 27.3 147.6 62.3
3.6 104.0 34.0 12.5 3.7 6.9
Table 8 Overlap rates of using different part of the proposed algorithm respectively.
occlusion1 jogging-1 tunnel owl subway freeman1
①þ ④
① þ② þ④
① þ② þ③
ASLA
L1APG
SCM
0.92 0.18 0.47 0.71 0.62 0.26
0.92 0.18 0.66 0.74 0.71 0.57
0.93 0.78 0.67 0.82 0.73 0.65
0.86 0.19 0.63 0.76 0.70 0.27
0.85 0.19 0.23 0.50 0.16 0.20
0.92 0.18 0.31 0.74 0.69 0.61
deviation. Once the tracking deviates, the uses of these methods can offer a strong guarantee for returning the actual object position again. Both qualitative and quantitative evaluation results on 25 challenging video sequences demonstrate that the proposed tracking algorithm is robust. It is better than the other nine stateof-the-art algorithms.
Acknowledgment This work is supported by the National Natural Science Foundation of China under Grants 61302156, 61401220, 61471206 and 61402237 and the University Natural Science Research Project of Jiangsu province under Grant 13KJB510021.
References [1] A. Yilmaz, O. Javed, M. Shah, Object tracking: a survey, ACM Comput. Surv. 38 (4) (2006). [2] K. Cannons, A Review of Visual Tracking, York University, Tech. Rep., 2008. [3] P. Trigueiros, F. Ribeiro, L.P. Reis, Generic system for human-computer gesture interaction, In: Proceedings of the IEEE Conference on ICARSC, 2014. [4] S. Vishwakarma, A. Agrawal, A survey on activity recognition and behavior understanding in video surveillance, Vis. Comput. 29 (10) (2013) 983–1009. [5] R. T. Collins, Mean-shift blob tracking through scale space, In: Proceedings of the IEEE Conference on CVPR, 2009. [6] R.T. Collins, Y. Liu, M. Leordeanu, Online selection of discriminative tracking features, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1631–1643. [7] M. Yang, J. Yuan, Y. Wu, Spatial selection for attentional visual tracking, In: Proceedings of the IEEE Conference on CVPR, 2007. [8] J. Kwon, K. M. Lee, Visual tracking decomposition, In: Proceedings of the IEEE Conference on CVPR, 2010. [9] J. Kwon, K. M. Lee, Tracking by sampling trackers, In: Proceedings of the ICCV, 2011. [10] H. Li, C. Shen, Q. Shi, Real-time visual tracking using compressive sensing, In: Proceedings of the IEEE Conference on CVPR, 2011. [11] L. Sevilla-Lara, E. Learned-Miller, Distribution fields for tracking, In: Proceedings of the IEEE Conference on CVPR, 2012. [12] S. Oron, A. Bar-Hillel, D. Levi, S. Avidan, Locally orderless tracking, In: Proceedings of the IEEE Conference on CVPR, 2012. [13] T. Zhang, B. Ghanem, Robust visual tracking via multi-task sparse learning, In: Proceedings of the IEEE Conference on CVPR, 2012. [14] S. Avidan, Ensemble tracking, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2) (2007) 261. [15] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instance learning, In: Proceedings of the IEEE Conference on CVPR, 2009. [16] H. Grabner, H. Bischof, On-line boosting and vision, In: Proceedings of the IEEE Conference on CVPR, 2006. [17] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-lineboosting for robust tracking, In: Proceedings of the ECCV, 2008.
G. Han et al. / Neurocomputing 184 (2016) 145–167
[18] Z. Kalal, J. Matas, K. Mikolajczyk, P-N learning: Bootstrapping binary classifiers by structural constraints, In: Proceedings of the IEEE Conference on CVPR, 2010. [19] S. Wang, H. Lu, F. Yang, M.-H. Yang, Superpixel tracking, In: Proceedings of the on ICCV, 2011. [20] L. Wen, Z. Cai, Robust online learned spatio-temporal context model for visual tracking, IEEE Trans. Image Process. 23 (2) (2014) 785–796. [21] S.-H. Bae, K.-J. Yoon, Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning, In: Proceedings of the IEEE Conference on CVPR, 2014. [22] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422. [23] D. Chen, Z. Yuan, G. Hua, Description-discrimination collaborative tracking, In: Proceedings of the ECCV, 2014. [24] N. Pham, K. Leman, Fusing appearance and spatio-temporal features for multiple camera tracking, In: Proceedings of the 20th Anniversary International Conference, 2014. [25] Q. Wang, F. Chen, W. Xu, M.-H. Yang, Online discriminative object tracking with local sparse representation, In: Proceedings of the WACV, 2012. [26] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, In: Proceedings of the IEEE Conference on CVPR, 2006. [27] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, In: Proceedings of the BMVC, 2006. [28] S. Hare, A. Saffari, P. H. Torr, Struck: structured output tracking with kernels, In: Proceedings of the ICCV, 2011. [29] K. Zhang, L. Zhang, M.-H. Yang, Real-time compressive tracking, In: Proceedings of the ECCV, 2012. [30] J. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant structure of tracking-by-detection with kernels, In: Proceedings of the ECCV, 2012. [31] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (5) (2003) 564–575. [32] J. Kwon, K. M. Lee, Visual tracking decomposition, In: Proceedings of the IEEE Conference on CVPR, 2010. [33] I. Matthews, T. Ishikawa, S. Baker, The template update problem, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 810–815. [34] D. Ross, J. Lim, R. Lin, M.-H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vis. 77 (1) (2008) 125–141. [35] S. Bai, R. Liu, Z. Su, Incremental robust local dictionary learning for visual tracking, In: Proceedings of the ICME, 2014. [36] J. Zhang, W. Cai, Y. Tian, Y. Yang, Visual tracking via sparse representation based linear subspace model, In: Proceedings of the IEEE Conference on Computer and Information Technology, 2009. [37] B. Liu, J. Huang, L. Yang, C. Kulikowsk, Robust tracking using local sparse appearance model and k-selection, In: Proceedings of the IEEE Conference on CVPR, 2011. [38] K. Zhang, L. Zhang, Q. Liu, D. Zhang, Fast visual tracking via dense spatiotemporal context learning, In: Proceedings of the ECCV, 2014. [39] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, In: Proceedings of the IEEE Conference on CVPR, 2012. [40] K. Zhang, H. Song, Real-time visual tracking via online weighted multiple instance learning, Pattern Recognit. 46 (1) (2013) 397–411. [41] C. Bao, Y. Wu, H. Ling, H. Ji. Real time robust L1 tracker using accelerated proximal gradient approach, In: Proceedings of the IEEE Conference on CVPR, 2012. [42] X. Mei, H. Ling, Robust visual tracking and vehicle classification via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (11) (2011) 2259–2272. [43] X. Mei, H. Ling, Robust visual tracking using L1 minimization, In: Proceedings of the ICCV, 2009. [44] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparse collaborative appearance model, IEEE Trans. Image Process. 23 (5) (2014) 2356–2368. [45] T. Bai, Y. Li, Y. Tang, Structured sparse representation appearance model for robust visual tracking, In: Proceedings of the IEEE Conference on Robotics and Automation, 2011. [46] D. Wang, H. Lu, M.-H. Yang, Online object tracking with sparse prototypes, IEEE Trans. Image Process. 22 (1) (2013) 314–325. [47] M. Everingham, L. Van Gool, C.K.I. Williams, The pascal visual object classes (voc) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338. [48] Y. Wu, J. Lim, M. -H. Yang, Online object tracking: a benchmark, In: Proceedings of the IEEE Conference on CVPR, 2013. [49] M. Kristan, R. Pflugfelder, A. Leonardis, et al., The visual object tracking VOT2014 challenge results, In: Proceedings of the ECCV, 2014.
Guang Han received the B.S. degree from Shandong University of Technology in 2004, and M.S. and Ph.D. degrees from Nanjing University of Science and Technology, in 2006 and 2010, respectively. Since 2010, he has been with Nanjing University of Posts and Telecommunications, Nanjing, China, where he is currently a lecturer in the Engineering Research Center of Widerand Wireless Communication Technology, Ministry of Education. His current research interests include pattern recognition, video analysis, computer vision and machine learning.
167 Xingyue Wang received the B.S. degree from Soochow University in 2009. Currently, he is pursuing the M.S. degree in the Department of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China. His current research interests include image processing, multimedia communications and pattern recognition.
Jixin Liu received the B.S., M.S. and Ph.D. degrees from Nanjing Normal University, Nanjing Tech University and Nanjing University of Science and Technology, in 2004, 2008 and 2013, respectively. Since 2013, he has been with Nanjing University of Posts and Telecommunications, Nanjing, China, where he is currently a lecturer in the Engineering Research Center of Widerand Wireless Communication Technology, Ministry of Education. His current research interests include pattern recognition and intelligent system.
Ning Sun received the B.S., M.S. and Ph.D. degrees from Guilin University of Electronic Technology, Nanjing Institute of Electronic Technology and Southeast University, in 2000, 2004 and 2007, respectively. Since 2012, he has been with Nanjing University of Posts and Telecommunications, Nanjing, China, where he is currently a Associate Professor in the Engineering Research Center of Widerand Wireless Communication Technology, Ministry of Education. His current research interests include deep learning, pattern recognition and embedded platform based video analysis.
Cailing Wang received the B.S., M.S. and Ph.D. degrees from Nanjing University of Science and Technology, in2002, 2004 and 2011, respectively. Since 2012, she has been with Nanjing University of Posts and Telecommunications, Nanjing, China, where she is currently a lecturer in the Department of Automation. Her current research interests include pattern recognition, intelligent system, image processing and machine vision.