Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
Self-expressive tracking Yao Sui a,n, Xiaolin Zhao b, Shunli Zhang a, Xin Yu a, Sicong Zhao a, Li Zhang a a b
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China School of Aeronautics and Astronautics Engineering, Airforce Engineering University, Xi'an 710038, China
art ic l e i nf o
a b s t r a c t
Article history: Received 9 June 2014 Received in revised form 16 December 2014 Accepted 6 March 2015
Target representation is critical to visual tracking. A good representation usually exploits some inherent relationship and structures among the observed targets, the candidates, or both. In this work, we observe that the candidates are strongly correlated to each other and exhibit obvious clustering structure, when they are densely sampled around possible target locations. Thus, we propose a Self-Expressive Tracking (SET) algorithm based on an accurate representation with good discriminative performance. The interrelationship and the clustering structure among the observed targets and the candidates are exploited by using a self-expressive scheme with a low-rank constraint. Further, we design a discriminative criterion of the likelihood for target location, which simultaneously considers the target, background and representation errors. To appropriately capture the appearance changes of the target, we develop an update strategy that adaptively switches different update rates during tracking. Extensive experiments demonstrate that our tracking algorithm outperforms many other state-of-the-art methods. & 2015 Elsevier Ltd. All rights reserved.
Keywords: Visual tracking Target representation Appearance model
1. Introduction Visual tracking plays an important role in machine learning and pattern recognition for its various applications, e.g., visual surveillance, robotics, and human–machine interface. In recent years, much success in constructing trackers has been achieved. However, many challenges remain to be conquered, e.g., illumination variations, occlusions, out-of-plane rotations, and cluttered background. The fundamental task of tracking is to estimate the motion states of the target among the candidates (i.e., the observed image patches in the current frame), given all the observed targets (located in previous frames). To conquer the challenges, in general, several factors need to be considered. The first is how the target representation is formulated. Many effective representations are proposed, e.g., subspace [1–4], histograms [5,6], sparse coding [7–10], superpixel [11], and compressive sensing [12,13]. Recently, the reconstructable method, where the representation can also be used to reconstruct the target, is popular in tracker construction, e.g., subspace and sparse representations, because the reconstructions (with noise removal) work more robustly in the challenging situations. Furthermore, to represent the target more accurately, the representations usually exploit some particular relationship among the observed targets and the candidates. For example, subspace and sparse representations learn the relationship in the sense of low-dimensional subspace and sparse reconstructions,
n
Corresponding author.
respectively. However, such relationship in these methods is unidirectional, i.e., only the observed targets contribute to the representations of the candidates, but the contributions of the candidates themselves are not considered. In this work, with the reconstructable property, we use a more informative method to formulate the representation, where the observed targets and the candidates contribute to the representations of each other. This method exploits the interrelationship among the observed targets and the candidates. Thus, the representation is self-expressive. Such a self-expressive scheme makes the representations more accurate, because all the pairwise relations among the observed targets and the candidates are considered. Second, the inherent structures among the candidates should be taken into account for a good representation. Such structures are often determined by the generation strategy (e.g., sampling [14], and sliding window [15]) of the candidates. In the literatures, many approaches are proposed to capture these structures, e.g., subspace [16], sparsity [10], compactness [17], and self-similarity [18]. In this work, we generate the candidates by densely sampling the image patches around possible target locations. Thus, the candidates exhibit the clustering structure (i.e., mixture of several subspaces) due to their spatially local similarity. To this end, we impose a low-rank constraint on their representations to capture the clustering structure among the candidates. The clustering structure leads to good discriminative performance for target location, which successfully handles the challenges, e.g., occlusions. Third, a criterion is required to evaluate the likelihood of the candidates belonging to the target. The criterion determines the accuracy of the target location. There are many popular appro-
http://dx.doi.org/10.1016/j.patcog.2015.03.007 0031-3203/& 2015 Elsevier Ltd. All rights reserved.
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2
aches, e.g., reconstruction error [7], histograms matching [9], and representations difference [16]. In this work, the evaluation criterion of the likelihood is designed to be discriminative, where the target, background and representation errors are simultaneously taken into account. The empirical results in this work demonstrate that the criterion performs robustly in various challenging cases. Last but not least, the online update strategy is crucial because it maintains the appearance model to capture the up-to-date changes of the target. Many methods are used to design the update strategy, e.g., simple thresholding [7] and random method [19]. The most difficult problem of the update strategy is how to determine the update rate. In this work, we use multiple appearance models with different update rates and update them independently. During tracking, the appearance models are adaptively switched according to the likelihood values, leading to an adaptive update rate. Thus, our strategy can capture both slight and significant appearance changes. Contributions: A self-expressive tracking (SET) algorithm is proposed in this work.
A representation method is proposed by using a self-expressive
scheme with a low-rank constraint, which exploits the interrelationship and the clustering structure among the observed targets and the candidates. A robust evaluation criterion of the likelihood is designed for target location, which simultaneously takes into account the target, background and representation errors. An online update strategy is developed to capture both slight and significant appearance changes of the target, where the update decision is adaptively made in terms of multiple predefined update rates.
2. Related work Subspace representation is a conventional approach to tracking. The incremental subspace learning is introduced in [1], in which the principle components of the observed targets in successive frames are learned to represent the target. To model different appearance changes, the sparsity structure is imposed on the bases of the subspace in [2]. It has been shown that subspace representation is effective to address some challenges, e.g., pose changes of the target, and illumination variations of the scenario. However, it performs unstably in the case of partial occlusions. This is because the subspace analysis assumes the representation errors yield a Gaussian distribution with small variances, leading to the small and dense errors that are unsuitable to model occlusions. Sparse representation has been demonstrated to be robust against partial occlusions. In [7], sparse representation is first introduced to tracking, where both the target representation and the occlusions are enforced to exhibit sparsity structure. Thus, the target is represented as a linear combination of a few observed targets and the occlusions are absorbed by the sparse errors. The underlying assumption is that the errors yield a Laplacian distribution, which is suitable to model arbitrary large but sparse errors (e.g., partial occlusions). The major problem of [7] is the expensive computational cost. Consequently, jointly sparse representation [10] and compressive sensing [13] are used to make this paradigm computationally attractive to tracking. To exploit the advantages of the above two approaches, recent efforts aim to combine both subspace and sparse representations to construct a robust tracker. In [4], subspace representation is used to describe the target and sparse representation is used to formulate occlusions. Similarly, in [20], the target is assumed to yield a Gaussian distribution (subspace prior) and occlusions are regarded to be generated from a Laplacian distribution (sparsity prior). In [21], subspace representation is used with a block sparse representation to construct a structured appearance model. In [3],
two-dimensional PCA is used to construct the subspace in the original image domain and occlusions are absorbed by a sparse representation error. In [16], both the subspace and sparsity structures among the observations are simultaneously exploited by enforcing their representations to be jointly low-rank and sparse, and occlusions are modeled as a sparse additive error term. As presented above, the approach to challenge handling is generally to formulate the representation errors as the outliers that exhibit sufficient sparsity during tracking. There are also extensive studies providing alternative approaches, including local patches analysis, e.g., fragment [6] and sparse local appearance model [19]; discriminative method, e.g., boosting [22,23], multiple instances learning [24] and structured learning [25,26]; and collaborative appearance model, e.g., [27]. For the thorough reviews of tracking methods, the readers are recommended to refer to [28–30]. Motivated by the previous success, in this work, we propose the self-expressive scheme with the low-rank constraint to capture the clustering structure among the observed targets and the candidates. Such a structure leads to good discriminative performance for target location, which can successfully handle the challenges, e.g., occlusions. Because of the sampling based generation strategy [14], the differences between the candidates, caused by the motion state transitions, e.g., translations, rotations, and scalings, can be large but sparse. Thus, we use the sparse representation to model these sampling differences. Compared to the subspace representation based trackers [1–4] that perform the subspace analysis on the observed targets (along temporal dimension), our method conducts the analysis of mixture of subspaces on both the observed targets and the candidates (along both temporal and spatial dimensions), leading to a robust description of the target. Different from the trackers based on the joint representation, which deal with the relationship [10,16] and the subspace structure [16] among the observed targets and the candidates, our method handles the interrelationship and exploits the clustering structure (mixture of several subspaces), leading to a good discriminative capability for target location.
3. Representation method 3.1. Formulation Our representation method is designed as a self-expressive scheme with a low-rank constraint. The self-expressive scheme has various applications in machine learning and pattern recognition, e.g., subspace segmentation [31] and graph construction [32]. It models the interrelationship between pairwise data, which is formulated as min f ðZÞ; Z
subject to X ¼ XZ;
ð1Þ
where X denotes the data matrix, Z denotes the representation matrix of X, and f is a function with respect to Z. Further, to exploit the clustering structure (mixture of several subspaces) and allow the representation errors to be large but sparse, our representation Z is found by min J Z J n þλ J E J 1 ; Z;E
subject to X ¼ XZ þ E;
ð2Þ
where E denotes the representation errors, λ controls the weights of the low-rankness of Z and the sparsity of E, and J J n denotes the nuclear norm that is computed by summing all the absolute singular values of a matrix. The low-rank constraint on Z guarantees to discover the mixture of several subspaces among the data. The sparsity on E ensures the differences between the columns of X are large but sparse. Fig. 1 illustrates the proposed representation, where the observed images X are decomposed into the reconstructed images XZ and the sparse errors E by using Eq. (2). It can be seen that the errors are denser when the images have more misalignments against the corresponding means, therefore, their reconstructed images are
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
3
Fig. 1. Illustration of the representation. (a) Shows the observed images X (top line), the reconstructed images XZ (middle line), and the representation errors E (bottom line). (b) Plots the representation Z, where the colder color indicates the smaller value. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
pulled toward the means. And clearly, the representation matrix Z shown in Fig. 1(b) obviously exhibits the clustering structure.
Proof. If Zn1 and Zn2 are given, minimizing LðZ1 ; Z2 ; EÞ is equivalent to solving the following problem:
3.2. Evaluation method
λ 1 min J E J 1 þ J E X XZn2 þY 1 =τ J 2F ; E τ 2
The representation model in Eq. (2) presents a least squares problem with a nuclear norm and an ℓ1-regularization. To efficiently solve it, the Augmented Lagrange Multiplier (ALM) method is applied. Here, the problem is required to be reformulated as ( X ¼ XZ2 þ E min J Z1 J n þλ J E J 1 ; subject to ; ð3Þ Z1 ¼ Z2 Z1 ;Z2 ;E where Z is renamed as Z1 and Z2 is a slack variable. The ALM formulation is found by
which is derived by completing the square. This is a convex optimization problem with respect to E and its global minimum can be obtained by the soft-thresholding algorithm [34], such that En ¼ φλ=τ X XZn2 þ ð1=τÞY 1 . □ The iterative algorithm is depicted in Algorithm 1. It stops when either of the following criteria is reached: (1) the difference of LðZ1 ; Z2 ; EÞ between two iterations is very small and (2) the number of iterations exceeds a predefined value. Algorithm 1. Evaluating representations.
LðZ1 ; Z2 ; EÞ ¼ J Z1 J n þλ J E J 1 þ hY 1 ; X XZ2 Ei þ hY 2 ; Z2 Z1 i τ þ J X XZ2 E J 2F þ J Z2 Z1 J 2F ; 2
Input: data matrix X, parameters λ and τ. Output: the representation matrix Z of X.
ð4Þ
where Y 1 and Y 2 denote the Lagrange multipliers, τ 4 0 is a penalty parameter, J J F denotes the Frobenius-norm, and 〈; 〉 denotes the inner-product of two matrices. The representation matrix is obtained by minimizing LðZ1 ; Z2 ; EÞ. To the best of our knowledge, there is no closed-form solution to the minimization of LðZ1 ; Z2 ; EÞ. Thus, an iterative algorithm is required to be developed to obtain the solution Zn1 , Zn2 and En . First, the definition of the shrinkage function is given as follows: φσ ðxÞ ¼ signðxÞ maxð0; jxj σ Þ;
ð5Þ
where σ denotes the soft-thresholding parameter. With this function, the following lemmas can be derived to develop the iterative algorithm. n En , Zn1 can be obtained from Zn1 ¼ Uφ1=τ ðSÞVT , Lemma h 1. Given i Z2 and where U; S; VT ¼ svd Zn2 þY 2 =τ .
1 2 3 4 5 6 7
8 9 10 11
Proof. If Zn2 and En are given, minimizing LðZ1 ; Z2 ; EÞ is equivalent to solving the following problem: 1 1 min J Z1 J n þ J Z1 Zn2 þ Y 2 =τ J 2F ; Z1 τ 2
ð6Þ
which is derived by completing the square. This is a convex optimization problem with respect to Z1 and its global minimum can be obtained by the singular value thresholding h i algorithm [33], such that Zn1 ¼ Uφ1=τ ðSÞVT , where U; S; VT ¼ svd Zn2 þ Y 2 =τ . □ n
n
n
ð7Þ
n
Lemma Given Z1 and E , Z2 can be obtained from Z2 ¼ 2. 1 XT X þ I XT X XT En þ Zn1 þ ð1=τÞ XT Y 1 Y 2 . Proof. If Zn1 and En are given, the minimization on LðZ1 ; Z2 ; EÞ is a quadric form with respect to Z2 . Its global minimum can be obtained by the least squares. Thus, this lemma holds. □ Lemma 3. Given Zn1 and Zn2 , En can be obtained from En ¼ φλ=τ ðX XZn2 þ ð1=τÞY 1 Þ.
12
Initialize k ¼0, ρ4 1, Zð10Þ ¼ Zð20Þ ¼ Eð0Þ ¼ 0, and Y ð10Þ ¼ Y ð20Þ ¼ 0. while true do i h U; S; VT ¼ svd ZðkÞ þ Y ðkÞ =τ : 2 2 Obtain Zð1k þ 1Þ ¼ US1=τ ðSÞVT : Obtain Zðk þ 1Þ ¼ XT X þ I 1 2 1 T ðkÞ X Y 1 Y ð2kÞ : XT X EðkÞ þ Zð1kÞ þ τ Obtain Eðk þ 1Þ ¼ Sλ X XZðkÞ þ 1 Y ðkÞ : 2 τ τ 1 if converged or terminated then j break Update Y ðk þ 1Þ ¼ Y ðkÞ þ τ X XZðkÞ EðkÞ : 1 1 2 Update Y ðk þ 1Þ ¼ Y ðkÞ þ τ ZðkÞ ZðkÞ : 2 2 2 1 Update τ ¼ ρτ; k ¼ k þ 1: return Z1 and Z2 .
3.3. Discriminative performance The representation has a naturally discriminative capability (see Fig. 1(a)), because Z describes how a sample is similar to another one in the sense of low-rank reconstruction. However, such a description is unbalanced, i.e., Zij indicates the contribution of Xi to the reconstruction of Xj , but not vice versa. Thus, an alternative description is defined, which is required to be non-negative and symmetric while preserving the clustering structure. Definition 1 (similarity matrix). Given a sample matrix X, of which each column denotes a sample, the similarity between the pairwise
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
samples from X is found by A ¼ 12 jZjþ ZT ;
4.1. Adaptive self-expressive appearance model ð8Þ
where Z denotes the representation matrix of X, which is obtained from Eq. (2), and jj evaluates the element-wise absolute values of a matrix. An example of the proposed representation method is shown in Fig. 2, where the observed samples are generated by embedding two clusters in two-dimensional subspaces into a three-dimensional space with sparse corruptions. Benefitting from the accuracy of the representations, most of samples are recovered from the corruptions (Fig. 2(a)). Attributed to the preservation of the clustering structure, the similarity matrix A obviously exhibits the block patterns that correspond to the two clusters (Fig. 2(b)).
4. Self-expressive tracking The proposed representation provides a clustering structure and leads to a naturally discriminative capability. Thus, it inspires us to describe the candidates from the perspectives of both target and background in tracking. The overview of the proposed tracking algorithm is shown in Fig. 3.
A novel appearance model, the adaptive self-expressive (ASE) appearance model, is developed in this work. The ASE consists of a target descriptor, which describes the candidate from the perspective of the target, and a background descriptor, which describes the candidate in the viewpoint of the background. Thus, the candidate, closer to its target descriptor and further away from its background descriptor, is more likely to be the target. Because the appearance of the target varies, a number of representative observed targets (target templates) need to be maintained during tracking to describe the characteristic of the target. Similarly, the background patches (background templates) are also required to describe the background. In this work, eight background templates are used, which are generated in each frame by shifting the bounding box of the latest observed target with the distances of 20% of its width or height. Fig. 4 shows an illustration of these templates. Let T denote the target templates, B denote the background templates, and C denote the candidates. By setting the data matrix X ¼ ½T; B; C, the similarity matrix A is obtained from Eq. (8), which indicates the similarity between all the pairs of the target templates, the background templates and the candidates. Then, the candidates are classified into a target category and eight background categories, according to the maximum similarity to the target and background templates. Because the target templates are
Fig. 2. An example on the artificial data. (a) Shows the observed samples in two clusters (blue and green circles), and the reconstructed samples (red and magenta points). (b) Shows the similarity matrix, where the darker the entry is, the smaller the value is. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
Fig. 3. Tracking flow diagram of the proposed tracker.
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
5
Fig. 4. Illustration of the target and background templates.
Fig. 5. Illustration of the ASE. A candidate is marked in the left image. The middle image patch shows its target descriptor, and the eight image patches in the right show its background descriptor.
correlated to each other, only one target category is used. However, the eight background templates are quite different from each other. Thus, eight background categories are required, corresponding to the eight background templates. Note that only the candidates belonging to the target category are considered to be the target, while the candidates belonging to the background categories are used to learn the background descriptor. For each candidate, we use the reconstructed target template most similar to this candidate as its target descriptor. The reconstruction is obtained from the corresponding column of the matrix h ~ ¼ ½T; B; CZ, where T, ~ B; ~ C ~ B, ~ and C~ are the reconstructions of the T; target templates, the background templates and the candidates, respectively. Note that using the reconstructed target templates makes the target descriptor more stable. Furthermore, different reconstructed target templates are adaptively used for different candidates in terms of the similarity, leading to different target descriptors. The background descriptor of each candidate contains eight components learned from the eight background categories. Each component is defined by the weighted average of all the reconstructed samples in the corresponding background category. The weights are set to the corresponding similarity values between the candidate and these samples. Note that using all the reconstructed background samples aims to promote the diversity of the background descriptor because there is only one background template in each background category. The weighted average makes the background descriptor more smooth and stable. The weights vary between different candidates so that the different background descriptors are adaptively constructed. The main steps to construct the ASE are depicted in Algorithm 2. An illustration of the ASE is shown in Fig. 5. Fig. 6 shows an example to demonstrate that the candidates are successfully classified into the target or background category by using the maximum similarity to the target and background templates, such that the target and background descriptors are learned from correct samples. Three (one good and two bad) representative candidates are analyzed in both general and challenging cases. Obviously, it can be seen that the candidate most similar to the good one is a target template, and the candidate most similar to the bad one is a background template (see the panels on the right of the images in both sub-figures). This is attributed to the preservation of the clustering structure.
Furthermore, there is another interesting observation that may facilitate to locate the target: the representation error of the good candidate is much sparser than those of the two bad candidates. This is because the bad candidates have more translations or rotations and the proposed representation tends to eliminate these differences in the reconstructions by using the sparsity on E. Thus, the reconstructed bad candidates are pulled toward the sampling center, such that their representation errors are denser. Algorithm 2. Constructing ASE appearance model.
1 2
3 4 5 6 7 8
Input: the target templates T, the background templates B, the candidates C, and the representation matrix Z. Output: the ASE of the candidates. Construct the similarity matrix A using Eq. (8) with X ¼ ½T; B; C. Let Nt denote the number of the target templates, Nb denote the number of the background templates, and N l ¼ N t þ N b . for each candidate Ci do
s ¼ arg max j¼1:N l Ai;j : if s∈½1; N t then j Put C into the target category: i else j Put C into the kth background category S ; i k where k ¼ s−Nt :
9 Obtain all the reconstructed samples from ˜ ˜ ˜ T; B; C ¼ ½T; B; CZ. 10 for each candidate Ci do ˜ 11 Obtain the target descriptor from t ¼ Tl ;
12 where l ¼ arg maxj¼1:Nt Ai;j : Obtain the kth component of the background descriptor from bðkÞ ¼ DðkÞ w; where DðkÞ denotes all the reconstructed samples in the kth background category Sk ; and w is the weights vector computed by wl ¼ Aij ; for j∈Sk ; and l ¼ 1; 2; …:
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
Fig. 6. Example of the similarity. In each sub-figure, one good candidate (in red) and two bad candidates (in blue and green) are marked in the image. Their representation errors are shown at the three corners of the image in the respective colors. On the right of the image, the three panels in the respective colors plot the similarity values between the three candidates and the templates, respectively, where the horizontal axis denotes the index of the templates (target in blue and background in red), and the vertical axis denotes the similarity values obtained from the similarity matrix defined in Eq. (8). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
4.2. Update strategy During tracking, the target and background templates need to be updated to maintain the ASE. The background templates B are updated in each frame. We refer to the method in [35] to update the target templates T. This method evaluates the correlation between the latest observed target and the target templates. If the correlation is less than a predefined threshold, denoted by θ here, the target template with the smallest correlation will be replaced by the latest observed target. Thus, the parameter θ controls the update rate. However, note that θ in [35] is invariant during tracking, such that the robustness of the appearance model is limited to some extent. In this work, we use multiple groups of target and background templates to construct different ASEs. These ASEs have different update rates, i.e., different θ values, to capture both slight and significant changes of the target. Each ASE is updated independently using the corresponding θ. During tracking, we switch the different ASEs to locate the target, leading to an adaptive update rate. The main steps of the update algorithm are summarized from lines 13 to 17 in Algorithm 3.
4.3.2. Observation likelihood A good candidate is expected to be close to its target descriptor and far away from its background descriptor. Additionally, as shown in Fig. 6, a good candidate is also expected to have sparser representation error. Thus, the observation likelihood of a candidate is defined by pðcj st Þ ¼ min Ll ; ð12Þ l ¼ 1;2;…;N θ
where N θ denotes the number of the template groups, and Ll denotes the likelihood value computed by using the l-th group, 12 9 which is 8 found 0 by < 1 X = @ ~ ~ Ll p exp 2 J c t J 2 ρ min J c bk J 2 þ β ej a 0 A ; : σ ; k j ð13Þ where c~ denotes the reconstruction of the candidate c, t denotes the target descriptor, bk denotes the k-th component of the background descriptor, ej denotes the j-th element of the representation error e, ρ and β are the weight parameters, and σ controls the scale of the exponent function. The main steps of the proposed tracking algorithm are depicted in Algorithm 3. Algorithm 3. Self-expressive tracking.
4.3. Target location Let st denote the motion state variable of a candidate at time
t. Given all the previous observed targets y1:t 1 ¼ y1 ; y2 ; …; yt 1 up to the time t 1, the posterior distribution of st is recursively predicted by Z p st j y1:t 1 ¼ ð9Þ pðst j st 1 Þp st 1 j y1:t 1 dst 1 ; where pðst j st 1 Þ denotes the motion state transition that is used to sample the candidates. At time t, a candidate c is observed. The distribution of the state variable st is updated by pðcj st Þp st j y1:t 1 ; ð10Þ p st j c; y1:t 1 ¼ p cj y1:t 1 where pðcj st Þ denotes the observation likelihood. The target at time t, denoted by yt , is found by ð11Þ yt ¼ arg max p st j c; y1:t 1 ; cAC
where C denotes the set of all the candidates at time t. 4.3.1. Motion state transition In this work, the motion state variable of a candidate is defined as
s ¼ x; y; δ; ϕ , where x and y denote the 2D position, δ is the scaling coefficient, and ϕ is the rotation parameter. The motion state transition is formulated by a Gaussian
distribution, i.e., pðst j st 1 Þ N ðst 1 ; Σ Þ, where Σ ¼ diag vx ; vy ; vδ ; vϕ is a diagonal matrix. The diagonal elements of Σ denote the variances of x; y; δ; ϕ, respectively.
1
Input: the N frame images fI t gN t ¼ 1 , and the target location in the first frame. Output: the target locations from the second to the last frame. n oN θ Initialize the N θ groups of target templates TðkÞ
k¼1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
and the background templates. for t ¼ 2 : N do Generate the candidates: %% Perform the self expressive scheme %% Obtain the Nθ groups of representations of the candidates using Algorithm 1: Construct Nθ groups of the ASEs for the candidates using Algorithm 2: %% Locate the target %% for each candidate do for each ASEk in the kth group do j Evaluate L using Eq: ð13Þ: k Determine the likelihood using Eq: ð12Þ: Locate the tth target objt by using Eq: ð11Þ: %% Update the ASEs %% for each ASEk do n o 1 ðk 4 θk then if max arccos objt ; T A j Update TðkÞ by the method in the work ½35: Update the background templates:
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
18 return all the locations obj1:N . Note that the minimization on Ll in Eq. (12) indicates that all the ASEs are expected to consistently assign large likelihood values to a good candidate. More importantly, it switches the ASEs between the N θ likelihood values during tracking, leading to an adaptive update rate. In Eq. (13), the minimization on the second term ensures that a good candidate should be far away from all the background samples, and the representation error involved in the third term makes the likelihood more robust (see Fig. 6).
5. Experiments 5.1. Implementation details The proposed tracking algorithm SET is implemented in Matlab on a PC with Intel Core 2.8 GHz processor. Its average running speed is about 5 frames per second. The color frame images of the experimental video sequences are converted to gray scale ones and the pixel values are normalized to ½0; 1. 10 target templates and 8 background templates are maintained. In each frame, 200 candidates are generated. The candidates are normalized to the size of 20 20 pixels and stacked into column vectors as well as
7
the target and background templates. In Eq. (4), the parameters are set by λ ¼ 1 and τ ¼ 30. In Eq. (13), the parameters are set by σ ¼ 0:01, ρ ¼ 0:01, and β ¼ 0:001. Two template groups are used, i.e., Nθ ¼ 2, θ1 ¼ 15, and θ2 ¼ 30. The source code will be available in our project website.
5.2. Data description and competing trackers The experimental results on twenty video sequences are reported in this work. The target locations in these video sequences are manually labeled and used as the ground truth. All the video sequences can be downloaded from our project website. The proposed tracker is compared on the qualitative and quantitative experiments to seventeen other state-of-the-art trackers, including Frag [6], OAB [22], IVT [1], SOAB [23], L1 [7], VTD [2], MIL [24], Struck [26], ASLAS [19], SCM [27], 2DPCA [3], CT [13], LRST [16], TLD [36], MTT [37], SRPCA [4], and LSST [20]. Because there are no available source codes of LRST, we implement it by ourselves following the paper [16]. The other sixteen competing trackers are publicly provided by the authors. The parameters of the competing trackers are tuned carefully to obtain the best performance.
Fig. 7. Tracking results on the representative frames of the video sequences.
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
Fig. 8. Tracking results on the representative frames of the video sequences.
5.3. Qualitative results Due to the limit of pages, only the tracking results on representative frames of ten video sequences are reported, as shown in Figs. 7 and 8. Complete quantitative evaluations for all the video sequences are given in Tables 1 and 2 and Fig. 12. As shown in Fig. 7(a), a car is moving quickly on a dark road. In this scene, the challenges are illumination changes and cluttered background. MTT, Struck, IVT, SCM and SET achieve good tracking results while the other trackers obtain less accurate results or fail in tracking this car. Fig. 7(b) shows several persons are walking in a corridor where one person is occluded by the others. Only LSST and SET successfully track the person. In contrast, the other trackers drift away. In Fig. 7(c), a person is walking in a room. This scene includes the difficulties in illumination changes and out-of-plane rotations. SRPCA and SET obtain good tracking results while the other trackers have inaccurate results or drift away. Fig. 7(d) shows the results on the video sequence containing the occlusions and in-plane rotations. LSST, ASLAS, Struck, SCM, and SET achieve accurate tracking results. As shown in Fig. 7(e), a football player is running in the field. This scene includes the challenges of out-of-plane rotations, occlusions and similar object distractions. Only SCM, CT and SET succeed in tracking this player while the other trackers obtain inaccurate results or drift away.
Fig. 8(a) shows the results on the video sequence with difficulties in the head pose changes, out-of-plane rotations and occlusions. MTT, Struck and SET achieve the best tracking results. Frag and OAB also obtain accurate tracking results. In contrast, the other trackers fail in tracking the target. Fig. 8(b) shows the results on the video sequence with the difficulty in motion blurring. 2DPCA, SRPCA, TLD, LSST and SET achieve accurate tracking results while the other trackers fail in tracking the target. As shown in Fig. 8(c), a woman is singing in a stage where the illumination changes drastically. SRPCA, LSST and SET obtain accurate tracking results. ASLAS and SCM also successfully track the singer. In contrast, the other trackers obtain inaccurate results or drift away. Fig. 8(d) shows the results on the video sequence with the difficulties in body pose changes and occlusions. Frag, OAB, 2DPCA, LSST and SET successfully track the target while the other trackers drift away. As shown in Fig. 8(e), a person is walking along a pavement. In this scene, the challenges are body pose changes, occlusions, similar object distractions, and cluttered background. Only OAB and SET achieve accurate results while the other trackers fail in tracking this person. It can be seen that SET performs well in various challenging situations. This is attributed to the discriminative capability of the self-expressive scheme and the robustness of the observation likelihood. To verify this point, Figs. 9–11 show some illustrations of the ASE in some challenging cases. In each figure, a good (solid lines) and
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
9
Table 1 Average tracking location errors (in pixel) obtained by the eighteen trackers over the twenty video sequences. The best results are shown in bold-face font. Trackers
SET
Frag
OAB
IVT
SOAB
L1
VTD
TLD
MIL
Struck
ASLAS
SCM
MTT
2DPCA
CT
LRST
SRPCA
LSST
animal car4 car11 caviar1 caviar2 caviar3 davidindoor davidoutdoor dollar faceocc2 football girl jumping oneleaveshop singer1 stone sylv thusl thusy twinings
6.1 2.0 2.0 1.0 1.8 2.5 6.0 4.9 5.3 5.1 5.3 4.4 3.4 1.8 2.2 2.1 6.1 10.1 3.9 5.4
87.8 83.2 62.2 5.4 6.0 19.2 71.6 68.9 69.1 16.4 138.2 6.8 21.3 58.8 38.5 66.3 11.9 9.1 69.0 12.5
20.6 89.5 78.8 107.2 68.4 66.9 48.9 75.0 11.3 14.1 162.4 7.1 45.8 42.6 10.9 92.0 15.1 11.8 5.0 64.0
130.2 2.6 3.0 36.2 65.1 66.3 10.7 350.1 15.8 10.1 16.2 23.2 45.3 17.6 10.0 26.5 71.3 272.5 275.7 13.4
74.1 64.3 3.5 3.3 9.7 24.0 45.0 118.2 62.9 19.7 36.3 26.6 59.5 30.9 85.1 21.9 20.7 159.1 24.8 85.3
66.3 88.1 29.3 47.6 3.0 18.4 67.9 183.0 34.6 34.5 39.2 9.5 25.8 11.6 60.1 6.5 52.5 244.7 18.9 22.1
131.9 3.6 92.4 1.5 5.7 83.1 21.1 62.3 18.7 10.9 118.7 9.0 82.7 79.8 26.2 20.5 8.8 171.5 219.7 3.8
– 6.3 33.0 9.1 – 31.7 5.0 71.1 68.3 38.2 15.5 11.9 3.7 50.1 10.7 2.3 6.5 90.9 198.3 38.6
43.0 72.3 67.2 45.0 73.0 106.0 17.8 106.2 17.8 15.3 13.8 14.9 12.5 55.2 18.7 30.8 17.5 169.1 221.1 8.7
11.5 2.5 2.3 3.0 10.0 57.3 6.2 79.1 14.4 5.3 14.8 4.4 6.0 11.1 14.0 1.9 9.6 160.9 66.8 7.4
76.6 8.1 4.6 1.4 58.5 61.5 37.4 86.2 – 4.8 9.6 5.2 53.6 21.0 3.8 1.9 42.9 11.1 324.8 –
– 3.2 1.9 1.1 2.0 62.1 6.9 70.7 6.1 4.5 6.0 9.1 – 2.9 3.3 1.6 23.5 265.6 181.1 45.7
11.7 13.3 1.8 57.2 2.4 66.8 69.9 104.1 6.1 9.7 13.8 3.8 68.0 2.7 11.8 1.8 17.3 287.0 326.1 11.1
61.3 2.6 2.8 1.1 65.1 26.8 17.7 52.2 64.8 11.2 167.2 36.0 4.1 6.6 4.9 2.8 29.8 12.0 209.9 52.2
8.7 70.3 19.5 13.1 71.0 50.5 15.4 20.2 10.7 11.6 7.9 16.9 9.3 56.4 17.3 31.3 10.9 61.2 223.1 9.6
– 93.6 4.1 49.9 91.8 9.9 9.7 66.1 69.5 31.7 21.6 18.4 – 5.5 16.2 48.7 13.9 63.2 45.8 25.4
8.6 55.4 36.5 1.6 2.8 112.4 4.9 5.1 84.4 8.3 50.6 13.7 4.1 – 2.8 2.5 200.8 – 136.3 –
10.0 2.7 3.4 1.3 1.6 3.4 12.4 7.6 71.9 3.0 24.6 24.1 4.3 63.0 2.4 41.5 52.5 14.0 218.4 12.3
Table 2 Success rates obtained by the eighteen trackers over the twenty video sequences. The best results are shown in bold-face font. Trackers
SET
Frag
OAB
IVT
SOAB
L1
VTD
TLD
MIL
Struck
ASLAS
SCM
MTT
2DPCA
CT
LRST
SRPCA
LSST
animal car4 car11 caviar1 caviar2 caviar3 davidindoor davidoutdoor dollar faceocc2 football girl jumping oneleaveshop singer1 stone sylv thusl thusy twinings
0.92 1.00 0.99 1.00 1.00 1.00 0.96 0.99 1.00 1.00 0.96 0.89 0.98 1.00 1.00 1.00 0.94 0.92 0.98 1.00
0.07 0.27 0.06 0.94 0.54 0.26 0.06 0.56 0.39 0.62 0.30 0.71 0.30 0.30 0.25 0.27 0.77 0.98 0.62 0.73
0.77 0.27 0.08 0.30 0.40 0.14 0.22 0.34 0.82 0.82 0.28 0.83 0.04 0.35 0.25 0.15 0.57 0.90 0.93 0.51
0.03 1.00 0.99 0.26 0.42 0.16 0.51 0.03 1.00 0.62 0.71 0.21 0.12 0.42 0.68 0.47 0.45 0.16 0.07 0.38
0.37 0.24 0.86 0.98 0.48 0.25 0.16 0.18 0.39 0.68 0.22 0.32 0.06 0.35 0.22 0.54 0.57 0.20 0.64 0.22
0.11 0.24 0.63 0.30 0.99 0.23 0.25 0.08 0.55 0.47 0.21 0.65 0.24 0.52 0.23 0.81 0.24 0.21 0.62 0.64
0.14 0.94 0.10 0.98 0.68 0.13 0.14 0.58 0.47 0.61 0.32 0.85 0.08 0.12 0.25 0.17 0.70 0.44 0.33 0.72
– 0.84 0.25 0.86 – 0.17 0.45 0.48 0.41 0.52 0.73 0.29 0.85 0.21 0.77 0.99 0.94 0.74 0.64 0.43
0.14 0.27 0.06 0.28 0.41 0.16 0.23 0.33 0.74 0.68 0.63 0.25 0.23 0.37 0.25 0.19 0.77 0.33 0.33 0.78
0.86 0.38 0.99 0.99 0.43 0.16 0.30 0.56 1.00 1.00 0.78 0.88 0.81 0.40 0.25 0.99 0.84 0.61 0.62 0.97
0.28 0.99 0.77 1.00 0.43 0.16 0.37 0.52 – 0.99 0.62 0.30 0.27 0.40 1.00 0.93 0.45 0.62 0.10 –
– 1.00 0.97 1.00 1.00 0.16 0.85 0.35 1.00 1.00 0.70 0.68 – 1.00 1.00 1.00 0.56 0.20 0.33 0.40
0.80 0.33 1.00 0.30 0.99 0.16 0.34 0.34 1.00 0.82 0.74 0.93 0.07 0.99 0.32 0.92 0.76 0.20 0.31 0.62
0.15 1.00 0.97 1.00 0.41 0.19 0.35 0.35 0.39 0.73 0.28 0.18 0.96 0.56 0.34 0.80 0.65 0.90 0.33 0.37
0.85 0.27 0.56 0.54 0.36 0.15 0.49 0.78 1.00 0.98 0.77 0.24 0.45 0.38 0.25 0.20 0.64 0.62 0.24 0.86
– 0.30 0.95 0.30 0.38 0.35 0.82 0.31 0.36 0.39 0.51 0.29 – 0.93 0.35 0.12 0.53 0.48 0.60 0.50
0.83 0.31 0.61 1.00 0.94 0.08 0.95 0.98 0.24 0.80 0.18 0.17 0.98 – 1.00 0.60 0.04 – 0.07 –
0.77 1.00 0.99 1.00 1.00 1.00 0.57 0.40 0.38 1.00 0.46 0.17 0.96 0.41 1.00 0.16 0.59 0.87 0.33 0.33
Fig. 9. An illustration of the ASE in the case of occlusion.
a bad (dashed box) candidates are marked in (a). The target descriptor and background descriptor of the good (top) and the bad (bottom) candidates are shown in (b) and (c), respectively. Obviously, compared to the good candidates, the bad candidates deviate far away from their target descriptors in all the cases, leading
to small likelihood values. This is because the self-expressive scheme swallows up the misalignments of the bad candidates. Meanwhile, the background descriptors are also different in each component due to the weighted average of the samples in each background category, leading to different likelihood values.
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
10
Fig. 10. An illustration of the ASE in the case of illumination change.
Fig. 11. An illustration of the ASE in the case of out-of-plane rotation.
300
250
animal
200
car4
200 200
300
car11
150
200
caviar1
200
150 100
20
40
0 0 200
60
caviar3
200
100
50
50 0 0 250
100
100
100
200
400
600
davidindoor
0 0 600
caviar2
150
100
200
0 0 250
300
davidoutdoor
50 100
200
0 0 150
300
dollar
200
400
faceocc2
200
150
400
150
100
150
100 100
0 0 400
200
0 0 100
400
football
200
400
girl
50
100 0 0 200
100
200
300
stone
0 0 500
200
400
sylv
400
150
50
50
300 200
100
200
50
50
0 0 200
100
0 0 200
200
jumping
100
200
300
oneleaveshop
0 0 200
150
150
150
100
100
100
50
50
50
0 0 800
100
200
300
thusl
0 0 800
200
0 0 200
400
thusy
600
600
150
400
400
100
200
200
50
200
400
600
800
singer1
100
200
300
twinings
300 100 200 50 0 0
100 200
400
600
0 0
500
1000
0 0
100
200
300
0 0
100
200
0 0
200
400
Fig. 12. Quantitative evaluation by tracking location errors (in pixel) obtained by the eighteen trackers. In each sub-figure, the x-axis corresponds to the frame index number, and the y-axis is associated with the tracking location error.
5.4. Quantitative comparison of competing trackers To quantitatively evaluate the performance of the proposed tracker, tracking location error (TLE) and success rate (SR) are used to compare against the state-of-the-art methods. The tracking result
in a frame is considered successful if RT ⋂RG =T R ⋃RG 4 0:5, where RT and RG denote the tracking and the ground truth bounding boxes, respectively. Fig. 12 plots the TLEs obtained by the eighteen trackers on the twenty experimental video sequences. The average TLEs and the SRs
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
11
Fig. 13. Verification of the interrelationship. (a) Shows the experimental image, where the target is marked by the solid box. The confidence maps are plotted in (b)–(d) to show the likelihood obtained from Eq. (12) by using the self-expressive scheme, templates [16] and random dictionary, respectively. Each element of the confidence maps corresponds to a candidate in the search area within the dashed box in (a). The maximum is expected to appear in the center of each confidence map.
Fig. 14. Comparisons of the tracking location errors (in pixel) obtained by the three appearance models on the four representative video sequences.
Table 3 Average tracking location errors (in pixel) and success rates of the three proposed trackers over the representative video sequences. Tracker
SET-15 SET-30 SET
Tracking location error
Success rate
car11
caviar3
dollar
jumping
singer1
sylv
car11
caviar3
dollar
jumping
singer1
sylv
1.9 2.3 2.0
60.2 2.8 2.5
3.3 11.2 5.3
5.8 3.7 3.4
2.4 2.6 2.2
6.0 15.2 6.1
0.99 0.99 0.99
0.16 0.98 1.00
1.00 0.83 1.00
0.86 0.96 0.98
1.00 1.00 1.00
0.85 0.81 0.94
for each video sequence are also reported in Tables 1 and 2, respectively. SET achieves the largest SRs over seventeen out of the twenty video sequences. And over the rest three video sequences, SET obtains the second largest SRs. Overall, SET outperforms the other seventeen state-of-the-art trackers. 5.5. Verification of the interrelationship To verify that the interrelationship exploited by using the selfexpressive scheme leads to better tracking results, the following comparisons are conducted. With the same configurations of the low-rank constraint, the templates (used in LRST [16], see also Section 5.4) and random dictionary methods, both of which exploit the relationship among the observed targets and the candidates, are used as the competing approaches. The similarity matrix in the two competing methods is obtained by A0 ði; jÞ ¼ exp J Zi Zj J 22 , where Zi denotes the i-th column of Z. Fig. 13 shows the comparison results. It can be seen that using the self-expressive scheme obtains better performance in this image where the illumination of the scenario is drastically changed. This is attributed to: (1) the interrelationship allows that the candidates themselves contribute to their representations, leading to a more informative and more accurate description and (2) because all the pairwise relations are handled, the interrelationship encourages better discriminative performance for the representations when the clustering structure (mixture of several subspaces) is involved, compared to the subspace structure of the two competing methods.
5.6. Analysis of the appearance model The ASE has two important characteristics: self-expressive scheme (representations) and adaptive model learning (target and background descriptors). Here, the contributions of them to the final tracking results are analyzed. Thus, another two appearance models are designed for the evaluation: one only using the self-expressive scheme, denoted as SES, and the other one only using the adaptive model learning, denoted as AML. Instead of using the adaptive model learning, SES uses the mean of the target samples as the target model and the eight background templates as the eight components of the background model. AML uses the similarity matrix computed by A″ði; jÞ ¼ exp J Xi Xj J 22 instead of that in the self-expressive scheme. In Fig. 14, the tracking location errors obtained by the three appearance models on the four representative video sequences are plotted. It can be seen that (1) adaptive model learning is crucial to the ASE, without which an incorrect appearance model is obtained and (2) the self-expressive scheme promotes the accuracy and the robustness of the ASE. 5.7. Effectiveness of the update strategy To verify the effectiveness of the update strategy of SET, another two trackers are also implemented, corresponding to the trackers with high and low update rates, respectively: one with N θ ¼ 1 and θ1 ¼ 15, denoted as SET-15, and the other one with N θ ¼ 1 and θ1 ¼ 30, denoted as SET-30. The two trackers use the likelihood
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
12
defined in Eq. (13) and set the other parameters as the same as SET. The evaluations on the quantitative performance of the three proposed trackers are shown in Table 3. It can be seen that the three trackers are all insensitive to illumination variations (car11, singer1) and motion blurring (jumping). SET-15 is not good at handling occlusions (caviar3) because it updates the ASEs frequently such that the distractive objects are acquired. SET-30 cannot handle similar object distractions (dollar) and out-of-plane rotations (sylv) very well because the update ignores some small changes of the target and the cumulative errors lead to incorrect ASEs. Overall, benefitting from the update strategy, SET performs better on the six representative video sequences.
6. Conclusion In this paper, the tracking algorithm via the self-expressive scheme has been proposed, which has successfully exploited the interrelationship and the clustering structure among the observed targets and the candidates. A novel appearance model has been proposed to represent the target, by which the tracking drift has been alleviated in various challenging situations. A large number of experiments have been conducted to verify the robustness and effectiveness of the proposed tracking algorithm. These experimental results have demonstrated that (1) using the selfexpressive formulation leads to a robust tracker, (2) the proposed appearance model improves the performance of the proposed tracker due to its discriminative capability and (3) the proposed update strategy successfully captures the up-to-date appearance changes of the target. Both the qualitative and quantitative evaluations on various challenging video sequences have demonstrated that the proposed tracking algorithm outperforms the other seventeen state-of-the-art methods.
Conflict of interest There is no conflict of interest in this paper.
Acknowledgment This work is supported by the National Natural Science Foundation of China (NSFC) with Grants #61172125 and #61132007.
Appendix A. Supplementary data Supplementary data associated with this paper can be found in the online version at http://dx.doi.org/10.1016/j.patcog.2015.03.007. References [1] D.A. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vis. 77 (1–3) (2007) 125–141. [2] J. Kwon, K. Lee, Visual tracking decomposition, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 1269–1276. [3] D. Wang, H. Lu, Object tracking via 2DPCA and L1-regularization, IEEE Signal Process. Lett. 19 (11) (2012) 711–714. [4] D. Wang, H. Lu, M.-H. Yang, Online object tracking with sparse prototypes, IEEE Trans. Image Process. 22 (1) (2013) 314–325, URL 〈http://ieeexplore.ieee.org/ xpls/abs_all.jsp?arnumber=6212358〉. [5] D. Comaniciu, S. Member, V. Ramesh, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (5) (2003) 564–577.
[6] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, pp. 798–805. [7] X. Mei, H. Ling, Robust visual tracking using L1 minimization, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1436–1443. [8] X. Mei, H. Ling, Y. Wu, Minimum error bounded efficient l1 tracker with occlusion detection, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011, pp. 1257–1264. [9] B. Liu, J. Huang, L. Yang, C. Kulikowsk, Robust tracking using local sparse appearance model and K-selection, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011, pp. 1313–1320. [10] T. Zhang, B. Ghanem, S. Liu, Robust visual tracking via multi-task sparse learning, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2042–2049. [11] S. Wang, H. Lu, F. Yang, M. Yang, Superpixel tracking, in: IEEE International Conference on Computer Vision, 2011, pp. 1323–1330. [12] H. Li, C. Shen, Q. Shi, Real-time visual tracking using compressive sensing, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1305–1312. [13] K. Zhang, L. Zhang, M.-H. Yang, Real-time compressive tracking, in: European Conference on Computer Vision (ECCV), 2012, pp. 866–879. [14] M. Isard, CONDENSATION—conditional density propagation for visual tracking, Int. J. Comput. Vis. 29 (1) (1998) 5–28. [15] H. Grabner, H. Bischof, On-line boosting and vision, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 260–267. [16] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Low-rank sparse learning for robust visual tracking, in: European Conference on Computer Vision (ECCV), 2012, pp. 470–484. [17] X. Li, A. Dick, C. Shen, A. van den Hengel, H. Wang, Incremental learning of 3ddct compact representations for robust visual tracking, IEEE Trans. Pattern Anal. Mach. Intell. 35 (4) (2013) 863–881. [18] X. Lu, Y. Yuan, P. Yan, Robust visual tracking with discriminative sparse learning, Pattern Recognit. 46 (7) (2013) 1762–1771. [19] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 1822–1829. [20] D. Wang, H. Lu, M.-H. Yang, Least soft-thresold squares tracking, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2371–2378. [21] T. Bai, Y. Li, Robust visual tracking with structured sparse representation appearance model, Pattern Recognit. 45 (6) (2012) 2390–2404. [22] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in: British Machine Vision Conference, 2006, pp. 6.1–6.10. [23] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robust tracking, in: European Conference on Computer Vision (ECCV), 2008, pp. 234– 247. [24] B. Babenko, S. Member, M.-H. Yang, S. Member, Robust object tracking with online multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1619–1632. [25] Z. Kalal, J. Matas, K. Mikolajczyk, P-N learning: bootstrapping binary classifiers by structural constraints, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 49–56. [26] S. Hare, A. Saffari, P. Torr, Struck: structured output tracking with kernels, in: IEEE International Conference on Computer Vision (ICCV), 2011, pp. 263–270. [27] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based collaborative model, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 1838–1845. [28] A. Yilmaz, O. Javed, M. Shah, Object tracking, ACM Comput. Surv. 38 (4) (2006) 13–57. [29] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: a benchmark, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2411–2418. [30] S. Zhang, H. Yao, X. Sun, X. Lu, Sparse coding based visual tracking: review and experimental comparison, Pattern Recognit. 46 (7) (2013) 1772–1788. [31] E. Elhamifar, R. Vidal, Sparse subspace clustering, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 2790–2797. [32] J. Wright, Y. Ma, J. Mairal, G. Sapiro, Sparse representation for computer vision and pattern recognition, Proc. IEEE 98 (6) (2010) 1031–1044. [33] J. Cai, E. Candès, Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (4) (2010) 1956–1982. [34] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci. 2 (1) (2009) 183–202. [35] X. Mei, H. Ling, Robust visual tracking and vehicle classification via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (11) (2011) 2259–2272. [36] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422. [37] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via structured multi-task sparse learning, Int. J. Comput. Vis. 101 (2013) 367–383.
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i
Y. Sui et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎
13
Yao Sui received his B.S. degree in Physics from China Agricultural University, Beijing, China, in 2004, and his M.S. degree in Software Engineering from Peking University, Beijing, China, in 2007, respectively. He is currently working toward his Ph.D. degree in the Department of Electronic Engineering, Tsinghua University, Beijing, China. His research interests include image processing, computer vision, machine learning, and pattern recognition.
Xiaolin Zhao received his Ph.D. degree in the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 2011. He is currently working in Aeronautics and Astronautics Engineering College, Air Force Engineering University, Xi'an, China. His interests include computer vision, image/video processing and pattern recognition.
Shunli Zhang received his B.S. and M.S. degrees from Shandong University, Jinan, China, in 2008 and 2011, respectively. He is currently working toward the Ph.D. degree in the Department of Electronic Engineering, Tsinghua University, Beijing, China. His research interests include image processing, pattern recognition and computer vision.
Xin Yu received his B.S. degree in Electronic Engineering from University of Electronic Science and Technology of China, Chengdu, China, in 2009. He is currently pursuing the Ph.D. degree in the Department of Electronic Engineering, Tsinghua University, Beijing, China. His interests include computer vision and image processing.
Sicong Zhao received his B.E. degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2010. He is currently pursuing the Ph.D. degree in the Department of Electronic Engineering, Tsinghua University, Beijing, China. His research interests include object tracking and background subtraction.
Li Zhang received his B.S., M.S. and Ph.D. degrees from Tsinghua University, Beijing, China, in 1987, 1992 and 2008, respectively. He is currently the professor of the Department of Electronic Engineering, Tsinghua University, Beijing, China. His research interests include image processing, computer vision, pattern recognition, and computer graphics.
Please cite this article as: Y. Sui, et al., Self-expressive tracking, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.03.007i