Dual-scale structural local sparse appearance model for robust object tracking

Dual-scale structural local sparse appearance model for robust object tracking

Author’s Accepted Manuscript Dual-scale structural local sparse appearance model for robust object tracking Zhiqiang Zhao, Ping Feng, Tianjiang Wang, ...

3MB Sizes 3 Downloads 91 Views

Author’s Accepted Manuscript Dual-scale structural local sparse appearance model for robust object tracking Zhiqiang Zhao, Ping Feng, Tianjiang Wang, Fang Liu, Caihong Yuan, Jingjuan Guo, Zhijian Zhao, Zongmin Cui www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(16)31048-7 http://dx.doi.org/10.1016/j.neucom.2016.09.031 NEUCOM17560

To appear in: Neurocomputing Received date: 6 January 2016 Revised date: 24 July 2016 Accepted date: 13 September 2016 Cite this article as: Zhiqiang Zhao, Ping Feng, Tianjiang Wang, Fang Liu, Caihong Yuan, Jingjuan Guo, Zhijian Zhao and Zongmin Cui, Dual-scale structural local sparse appearance model for robust object tracking, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.09.031 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Dual-scale Structural Local Sparse Appearance Model for Robust Object Tracking Zhiqiang Zhaoa,b , Ping Fenga , Tianjiang Wanga,∗, Fang Liua , Caihong Yuana,c , Jingjuan Guoa,b , Zhijian Zhaod , Zongmin Cuib a School

of Information Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China b School of Information Science and Technology, University of Jiujiang, Jiujiang, Jiangxi 332005, China c School of Computer and Information Engineering, Henan University, Kaifeng, HeNan 475004,China d Business School of Hunan University, Changsha, HuNan 410006,China

Abstract Recently, sparse representation has been applied in object tracking successfully. However, the existing sparse representation captures either the holistic features of the target or the local features of the target. In this paper, we propose a dual-scale structural local sparse appearance (DSLSA) model based on overlapped patches, which can capture the quasiholistic features and the local features of the target simultaneously. This paper first proposes two-scales structural local sparse appearance models based on overlapped patches. The larger-scale model is used to capture the structural quasiholistic feature of the target, and the smaller-scale model is used to capture the structural local features of the target. Then, we propose a new mechanism to associate these two scale models as a new dual-scale appearance model. Both qualitative and quantitative analyses on challenging benchmark image sequences indicate that the tracker with our DSLSA model performs favorably against several state-of-the-art trackers. Keywords: Appearance model, visual tracking, sparse representation, dual scale

1. Introduction Visual tracking [1] has been an important application field in computer vision, especially for the application of visual surveillance, traffic flow monitoring, robotics, and human-computer interaction. In the past decades, the technology of visual tracking has made great progress. However, visual tracking technology is still very challenging because there are uncertain factors in the tracking process, such as the change of target shape, illumination change, varying viewpoints, background clutter, partial occlusion etc. Generally, a tracker includes three main parts: (1) a motion model, which produces candidate targets for the next frame to forecast the target, on the basis of the tracking result of the current frame; (2) an appearance model, which is used to model the shape of the target in the whole process of tracking; (3) an observation model, which defines a similarity measure between candidate targets and the target to discriminate the best candidate target. This paper focuses on the second part. An appearance model [2] generally includes two parts: visual representation and statistical modeling. The focus of visual representation is how to construct robust target descriptors using different visual features, while the focus of statistical modeling is to establish an effective mathematical model for the target by statistical learning.

∗ Corresponding

author Email address: [email protected] (Tianjiang Wang) Preprint submitted to Nuclear Physics B

Visual representation in visual tracking comprises holistic visual representations and local visual representations. The advantages of holistic visual representations are simplicity and efficiency. Raw pixel representation, histogram representation, covariance representation, and so on are classical holistic visual representations. Ross et al. [3] and Silveira and Mails [4] present raw pixel representation based on vectors. Other researches [5][7],[48,49] present the histogram representation as visual features of the target. The histogram representation can capture the holistic visual features of the target, but it is sensitive to the noise and the variety of illumination, as well as the shape of the target. A few other reports [8][10] present covariance representation as the visual features of the target. The advantage of covariance representation is that it can capture the intrinsic information of the target, but it is also sensitive to the noise and it loses the information on the shape and the location. Holistic visual representation is sensitive to the holistic appearance variance, while local visual representation can capture more information on the local structural aspects of the target. Thus, local visual representation is more robust to variations in appearance of the target. Liu et al. [10] utilize a hierarchical part-template shape model to detect and segment humans. Liu et al. [11] use a particlefiltering (PF) model based on a template to track a target. Researchers of [10],[12] and [13] have also exploited a visual feature based on the superpixel for visual tracking. There are also papers [14]-[16] that present visual features based on Scale-invariant feature transform (SIFT). The advantage of these local visual representations is that September 20, 2016

they can capture the structural features of the target; however, a larger computing cost is needed to capture huge local features. Benefiting from the robust performance of compressive sensing [17, 18] in signal processing, sparse coding [19], which acts as a powerful model, has attracted increasing attention in the field of vision applications [20,21,47]. Recently, sparse coding has achieved great success in face recognition [20], which has inspired some researchers to apply sparse coding to visual tracking. In fact, they have achieved great success. Mei and Ling [21] and Mei et al [22] apply sparse coding as a holistic feature of target appearance model to visual tracking. Other works [23]-[25] incorporate sparse coding as a local feature of the target appearance model for visual tracking. Wu et al. [37] propose a multi-scale pooling on weighted local sparse coding to obtain a pyramid representation of an object. Wang et al. [38] propose a novel algorithm that exploits joint optimization of representation and classification for robust tracking. In this paper, we propose a dual-scale structural local sparse appearance (DSLSA) model, which can capture both the quasi-holistic structural features of the target and the local structural features of the target simultaneously. The large-scale appearance model is named as quasiglobal structural local sparse appearance (QSLSA) model, whose scale is half of the target. The QSLSA model can capture the quasi-holistic structural features of the target. The other-scale appearance model is named the small structural local sparse appearance (SSLSA) model, whose scale is a quarter of the target. The SSLSA model can capture small structural local features of the target. For the purpose of inheriting the advantages of the QSLSA and the SSLSA models, we propose a new mechanism to associate the QSLSA model and the SSLSA model as a new appearance model, namely, our DSLSA model. The remainder of this paper is organized as follows. In Section 2, we summarize some works related to ours. In Section 3, we first review sparse coding, and then we detail our DSLSA model. In Section 4, we present the tracking framework of our DSLSA model. Section 5 gives the experimental results of the trackers with our DSLSA model and other trackers, respectively, and in Section 6, conclusions are given.

pearance, as well as partial occlusion. Liu et al. [23] use histogram sparse coding to represent the basic parts of the target. This method can capture the spatial information of the target, however, the spatial information of the target is not enough. The local sparse representation in Xie et al. [24] and Zarezade et al. [27] acts as the target appearance model also. The target appearance model in these studies is made up of several different-scales sparse representations with non overlapping patches. The advantage of this method is that it can capture information of target on several scales. However, this division of target with non overlapping patches cannot capture enough structural spatial information of the target. Meanwhile, the four corners of the target, which might belong to the background, have the same weight as the main part of the target in visual tracking. Furthermore, more scales and a large number of patches mean more computing costs. Jia et al. [25] propose an adaptive structural local sparse appearance (ASLSA) model, which extracts a fixed spatial layout of patches with overlap from the target to construct an appearance model of the target. These local patches with overlap can capture more structural spatial information of the target than the local patches with non overlapping information. Our works are inspired by the ASLSA model [25]. First, we propose a structural local sparse appearance model with fewer patches than the ASLSA model, namely our SSLSA model. For the purpose of increasing the performance of the ASLSA model, the SSLSA model makes a dent in the patches of the ASLSA model. Furthermore, we propose a new structural local sparse appearance model with a large scale, namely, the QSLSA model. The QSLSA model can capture the structural quasi-holistic features of the target. The robustness of a single appearance model, the QSLSA model or the SSLSA model, is not strong enough. Therefore, based on the SSLSA and the QSLSA models, we propose a DSLSA model by a new associative mechanism to increase the robustness of the appearance model.

3. The Proposed Method 3.1. Overview of Sparse Coding Let z ∈ RG be a column vector, namely, a vector that transfers all pixel intensity of an image patch to a column. Here, B = {b1 , b2 , · · · , bn } ∈ RG∗n (G  n) is a set of target templates; n is the number of target templates; bi is a column vector that transfers all pixel intensities of a template to a column; and B is a template matrix.The sparse representation of target z can be indicated approximately as

2. Related Works Sparse representation, which is a main part of sparse coding, has been applied in the field of visual tracking successfully ([21-23,26,31,37-42]). In sparse representation, a signal can be represented as a linear combination of a few basis vectors. The sparse representation [21, 22] is solved through l1-norm minimization with nonnegative restriction. The holistic sparse representation is used as the appearance model of the target, which can capture the overall information of the target and capture the less-local structural information about the target. Thus, these methods are sensitive to the noise and the variety of target ap-

z ≈ Ba + ε

(1)

where a = (a1 , a2 , · · · , an )T ∈ Rn is called a target coefficient vector; ε is an error vector, and its non negative items represent the pixels, which are interfaced or are occluded corresponding to z. 2

In order to get a reasonable solution for a in eq. (1), Field [28] proposes a sparse constraint placed on a. He proposes an l0 -norm minimization eq. (2), on a, as follow: a∗ = arg min(kak0 ) a

s.t.

kz − Bak22 ≤ ε

can represent a sparse representation of a candidate target, and all the sparse coefficients are denoted as A = {a1 , a2 , · · · , aN }, ai ∈ R(n∗N)∗1 . The sparse coefficient ai is generated from the local patch i to all templates. Thus, ai is a sparse representation from the local patch i to all templates. 3) Accumulation and weighting. For each ai , we accumulate the sparse representations that are based on the same patch using the following formula:

(2)

where k · k0 and k · k22 denote the l0 -norm and l2 -norm of a vector, respectively. The l0 -norm is a NP-hard optimization problem. Thus, Mei et al. [21] propose an l1 regularized least square mode, eq. (3), to solve the sparse problem a = arg min(kz − ∗

a

Bak22 )

+ λkak1 )

cij =

(3)

s.t. ai ≥ 0

(5)

where i represents the patch, n is the number of templates and e is an element of ai . Therefore, the sparse representation of a candidate target can be denoted as si = {ci1 , ci2 , · · · , ciN }T after the accumulation. Then, we use the following formula to normalize cij , and we call this a weighting operation.

3.2. Structural Local Sparse Appearance Model The sparse coefficient of the holistic sparse representation is obtained through the whole target, while the sparse coefficient of the local sparse representation is obtained through the patches of the target. Given a target image, we first warp it as a new image block with 32*32 pixels. Under a premise of a known template T = {t1 , t2 , · · · , tn } ∈ Rd∗n , we extract a set of local overlapped image patches with a fixed spatial structure from the image block as the dictionary, namely D = {d1 , d2 , · · · , d(n∗N) } ∈ Rd∗(n∗N) , where d is the dimension of the image patch vector, n is the number of templates, and N is the number of local image patches. We execute l2 -normalization for every column vector di of the dictionary D. Each di represents a fixed patch of the target. Similarly, for a patch of the target at an arbitrary position, dictionary D collects all information related to the patch in all templates of the target. Generally, the more templates, the greater is the accuracy of the dictionary. However, a large number of templates affects the sparse representation of the target and increase the computing cost. Thus, the number of templates in this paper is 10. For every candidate target, the process of the structural local sparse appearance model can be divided into the following four stages. 1) Extraction and pretreatment. First, we extract local patches with fixed spatial structure from each candidate to a new vector, Z = {z1 , z2 , · · · , zN } ∈ R(d∗N) . Each zi is a column vector and we regularized it with the l2 -norm. 2) Sparse representation. According to the formula (3) and non-negative constraints [21], we can work out the sparse representation of each local patch by the following formula:

ai

eik∗n+ j , eik∗n+ j ∈ ai , j = 1, 2, · · · , N

k=0

where λ is the parameter penalizing sparsity, and k · k1 and k · k22 denote the l1 -norm and l2 -norm of a vector, respectively.

a∗i = arg min(kzi − Dai k22 ) + λkai k1 )

N−1 X

cij cij = PN

i k=1 ck

i = 1, 2, · · · , N

(6)

After the accumulation and weighting stage, we can get a new vectors, which is denoted as S = {s1 , s2 , · · · , sN }. Here, si is a column vector with N*1. All the vectors of S form a square matrix with N ∗ N. 4) Alignment pooling. According to the sparse assumption, the sparse coefficient of a local patch is large when the local patch of the candidate target aligns with the local patch of the target. Therefore, we can take the diagonal elements of the matrix S as a new pooling feature to measure a candidate target. g = diag(S )

(7)

Because template T includes the state of the target object and the appearance variants of the target object, such pooling of features reflects the similarity between the candidate target and the target object. Finally, we can discriminate the best candidate target by maximum a posteriori (MAP). 3.3. DSLSA Model In this section, we introduce our DSLSA model. We first propose the two-scales sparse appearance models based on local patches, and then we propose an associative mechanism on these two sparse appearance models to form a new combined appearance model. Why we use two-scales appearance models has two points. First, the two-scales appearance models can capture more features from the target. Second, appearance models with more scales involve more computing costs. Therefore, we propose two-scales sparse appearance models to strike a balance between accuracy and speed. One sparse appearance model of our DSLSA model is the QSLSA model, whose size of the local patches is half of the target image. The local patch of this model captures

(4)

where zi is an arbitrary local patch after vectorization, ai ∈ R(n∗N)∗1 is a sparse coefficient that corresponds to zi , and ai ≥ 0 represents that all the elements of ai are nonnegative. Then, all the sparse coefficients together 3

1

1

2

2

1

M M+1 M+2

Sparse coding

16*32 or 32*16

... 2M

(n-1)M +1 (n-1)M +2

1

1

2

2

1

2

Cumulative And Weight

M

s1

s2

1

Alignment pooling

sM

M

g’

nM

32*32

a1

a2

aM

1

1

1

2

2

Extract patches

Warp

16*16 Sparse coding

O O+1 O+2

...

p(zt|xt)

Cumulative And Weight

1

1

2

2

s1

s2

1

Alignment pooling

1 2

2O (n-1)O +1 (n-1)O +2

O

O

sO

g’’

nO

a1

a2

aO

Figure 1: The framework of the tracker with our DSLSA model.

After obtaining the dictionary D0 , we generate candidate targets by PF and then run the four stages of the SLSA model to generate the pooling features for every candidate target. In the first stage, we extract the local patches Z 0 from every candidate target. Then, we perform some pretreatment on these patches, after which we transfer every patch to the column vector and normalize it with the l2 − norm. In the second stage, we generate sparse coefficients for each local patch, noted as A0 = {a01 , a02 , · · · , a0M } ∈ R(n∗M)∗1 . In the third stage, we accumulate the column vector to obtain a square matrix, and then we obtain a new square matrix S 0 = {s01 , s02 , · · · , s0M } after the weighting operation. In the last stage, we obtain the pooling feature g0 = diag(S 0 ). The detailed process of the QSLSA model is shown as the top flow of Fig. 1.

1 2

5

4

6

3

Figure 2: The six local patches of the QSLSA model. 1

2

3 4

5 6

7

8

Figure 3: The eight local patches of the SSLSA model.

the feature of half a target. We call such feature a quasiholistic feature and we call the structural local sparse appearance model with such features as the QSLSA model. The other sparse appearance model of our DSLSA model is the SSLSA model, whose size of the local patch is a quarter of the target image. Based on the QSLSA and the SSLSA models, we propose our DLSLA model using a new associative mechanism. The framework of the DSLSA model is summarized in Fig 1.

3.3.2. SSLSA Model In this section, we introduce the SSLSA model that is based on small local patches. The size of the local overlapped patches in the SSLSA model is set as 16*16. This appearance model extracts eight local overlapped patches from the target image to generate the dictionary of appearance models. The sequence and the location of these eight patches are shown in Fig. 3. The representation of these local patches is denoted as Z 00 = {z001 , z002 , · · · , z00M }, z00i ∈ Rd2 ∗O , where d2 is the dimension of local patch and is set as 256, and O is the number of local patches and is set as 8. The template of the SSLSA model is denoted as T 00 = {t100 , t200 , · · · , tn00 } ∈ Rd2 ∗n . Thus, the dictionary can be 00 denoted as D00 = {d100 , d200 , · · · , dn∗O } ∈ Rd2 ∗(n∗O) , which is 00 generated from the local patch Z and the template T 00 . The following process of the SSLSA model is similar to that of the QSLSA model above. We generate candidate targets by PF and then run the four stages of the SLSA model to generate the other pooling features for every candidate target. The sparse coefficients for each local patch in stage 2 are represented as A00 = {a001 , a002 , · · · , a00O } ∈ R(n∗O)∗1 . The square matrix that is obtained in the third

3.3.1. QSLSA Model We first introduce the QSLSA model based on the large scale. This appearance model extracts six local overlapped patches from the target image to generate a dictionary of appearance models; the sequence and the location of these six patches are shown in Fig. 2. Generally, the size of the local patches is set as a square, with size 4*4, 8*8, 12*12, 16*16, etc. However, for the purpose of capturing the features with larger size, the size of the local patches in the QSLSA model is set as 16*32. The representation of all the local patches is denoted as Z 0 = {z01 , z02 , · · · , z0M }, z0i ∈ Rd1 ∗M , where d1 is the dimension of the local patch, namely, d1 is set as 512. M is the number of local patches and is set as 6. The template of the QSLSA model is denoted as T 0 = {t10 , t20 , · · · , tn0 }, ti0 ∈ Rd1 ∗n . Thus, the dictionary can 0 be denoted as D0 = {d10 , d20 , · · · , dn∗M , } ∈ Rd1 ∗(n∗M) , which 0 is generated from the local patch Z and the template T 0 . 4

stage is denoted as S 00 = {s001 , s002 , · · · , s00O }. In the last stage, we obtain the other pooling feature of the SSLSA model, g00 = diag(S 00 ). The detailed process of the SSLSA model is shown as the bottom route in Fig. 1.

The last state variable xt of the target at time t can be computed through the following MAP formula:

3.3.3. Associative Mechanism The pooling features gather the structural local information of the target, which can capture the scale variance, illumination change, and partial occlusion of the target. The QSLSA and the SSLSA models generate a pooling feature respectively. Based on these pooling features, we propose our DSLSA model by the union of the QSLSA model and the SSLSA model. In order to use the pooling feature more effectively, we propose an associative mechanism for the pooling features of the QSLSA model and the SSLSA model by the following formula:

where xti represents the state of a sample at time t.

b xt = arg maxp(xti |z1:t )

4.2. Template Update Generally, the appearance of the target varies frequently during the process of visual tracking because the pose of the target varies frequently and there are various interferences to the target from outliers. Such interferences include illumination change, noise, background clutter, etc. Visual tracking with fixed templates cannot fit the variance of the target appearance. Moreover, such visual tracking has poor anti-interference ability. On the contrary, visual tracking with frequently update templates is also unadvisable. The frequent updating of the template may accumulate the estimate errors of the target, which will produce drifting from the target. Many methods of template updates are available in the literature ([4], [21], [30] and [36]). Mei et al. [21] use a target template and a trivial template to deal with the interference and the partial occlusion. The drawback of this method is that it does not propose measure to prevent the drift problem. Comaniciu et al. [4] extends the sequential Karhunen-Loeve algorithm and proposes a new incremental principal component analysis (PCA) algorithm. The disadvantage of this method is that the PCA is sensitive to partial occlusion. This is because the PCA assumes that the reconstruction error is compliant with a Gaussian distribution. On the basis of Silveira and Malis [4], the template update strategy of Matthews et al. [30] introduces the sparse representation and subspace learning. Moreover, Jia et al. [30] use a random number that is generated from a Gaussian distribution to update the template. The elimination probability of the earlier-captured tracker results is higher than that of the later-captured ones. Our template update strategy combines both sparse representation [21] and subspace learning [3]. First, an estimated target can be modeled by a linear combination of the PCA basis vectors and additional trivial templates, as follows: h i" q # p = Uq + e = U I (11) e

1 1 ∗ g0t + ∗ g00t (8) M O where p(xt |z1:t ) is the posterior estimate of a candidate target. Here, 1/M and 1/O are the normalized coefficients that make the value of the pooling feature to lie between 0 and 1. p(xt |z1:t ) ∝

4. Implementation 4.1. Particle Filter PF [29,44] is widely used in visual tracking. The principle of PF is to use a random sample set to approximate the posterior of the target in state-space transmission. If the target at frame t is known and the observation set z1:t has been obtained, the posterior p(xt |z1:t ) can be inferred by the Bayesian theory iteratively: Z p(xt |Z1:t ) ∝ p(zt |xt )

(10)

xti

p(xt |xt−1 )p(xt−1 |Z1:t−1 ) dxt−1 (9)

where ∝ means propotional to, p(xt |xt−1 ) denotes the motion model, and xt is a bound box of the target at time t. The motion model describes a time relationship of the target state between two consecutive frames. The target pose between two consecutive frames is modeled by an affine transformation of six parameters. The state transmission is achieved through Gaussian kernel, namely p(xt |xt−1 ) = N(x ; xt−1 , Σ), where Σ is the diagonal covariance matrix and each element of Σ is the variable of the affine parameters. The observation model p(zt |xt ) represents the likelihood of the observation variable zt under the condition of xt . As shown in eq. (5), the sparse coefficient is transferred to the weight of the local patch. Therefore, we make the pooling feature as the likelihood of the observation variable zt to a candidate target. Because our DSLSA model has two pooling features that are generated from the QSLSA and the SSLSA models. We use the combination of these pooling features represented by eq. (8) to represent the final likelihood of the candidate target.

where p denotes the observation vector, U is the matrix composed of eigenbasis vectors, q is the coefficient of the eigenbasis vectors, and e indicates the pixels in p that are corrupted or occluded. As the error caused by occlusion and noise is arbitrary and sparse, we solve the problem as an l1 -regularized least-square problem [25] min(kp − Hck22 ) + ψkck1 ) c

(12)

h i h iT where H = U I , c = q e , and ψ is the regularization parameter. With the coefficients of the trivial templates for noise and occlusion, our template update strategy can deal with 5

Figure 4: Tracking results when there is large illumination variation.

Figure 5: Tracking results when the target objects are heavily occluded.

6

Figure 6: Tracking results when the target objects are hardly rotated.

the challenge of appearance change as well as dispose off the partial occlusion. In addition, our DSLSA model has two parts for the appearance models, namely, the QSLSA model and the SSLSA model. Therefore, we use two templates in our DSLSA model, which are based on the QSLSA model and the SSLSA model respectively. The updating of these two templates, according to the QSLSA model and the SSLSA model, are mutually independent.

5.1. Qualitative Evaluation Illumination change: We test the challenge of illumination change using four sequences, namely, the sylvester, trellis, fish, and singer1. Fig. 4 presents the tracking results of these sequences with all trackers. In the sylvester sequence, the tracking results of all trackers are good in the front part of the sequence, and the tracking results of most trackers generate bias from the target in the latter part of the sequence. In the trellis sequence, the shadows and the sunshine make it very challenging to track the target. Only the ASLSA tracker and our DSLSA tracker are able to lock the target accurately. In the fish sequence, continuous illumination changes exist in the whole sequence. The IVT tracker, the ASLSA tracker, and our DSLSA tracker can track the target accurately. In the singer1 sequence, the singer was interfaced by some strong illumination changes and scale variations. Most trackers lose the target, and our DSLSA tracker tracks the target well. Thus, our DSLSA tracker can deal with these challenges better than other trackers from these sequences above. Occlusion: In order to verify the performance of our DSLSA tracker in dealing with the challenges of partial occlusion and long-time occlusion, we use five challenging sequences. Fig. 5 presents the tracking results of these sequences with all trackers. In the girl sequence, the girl is interfaced by several pose changes and partial occlusions heavily. Because the girl is stocked by a similar head, only the L1APG tracker, the ASLSA tracker, and our DSLSA tracker can track the girl accurately. In the walking sequence, the man is partially occluded by the light pole. Although the others trackers can track the target, the estimates of the other trackers have different degrees of bias from the target. In the woman sequence, the woman is partially occluded heavily several times. The CSK tracker, the ASLSA tracker, and our DSLSA tracker can track the woman accurately. In the faceocc2 sequence,

5. Experiments For simplicity, the trackers with the QSLSA model, the SSLSA model, and our DSLSA model are called QSLSA tracker, SSLSA tracker, and DSLSA tracker, respectively. All the experiments are implemented in MATLAB on a personal computer (PC) with Intel i7-3770 CPU (3.4 GHz) and 16 GB memory. For every visual sequence, the location of the target in the first frame is manually tagged. The l1 minimization problem in sparse representation is resolved by Mairal et al. [32]. The normalization constant λ in all the experiments is set as 0.01. In every experiment, every target image is transferred into an image with 32*32 pixels before implementing the tracking process. In order to guarantee the accuracy and the generality of the experiment, the number of particles in the PF is set as 600. For the template update, we use eight eigenvectors to implement incremental subspace learning in all experiments for every five frames. In order to verify the effectiveness of our experiments, we use 12 video sequences adapted from Xu et al. [25], Matthews et al. [31], and Wu et al. [33]. The challenges of these sequences include illumination change, occlusion, rotation, etc. The trackers that are compared with our method include CSK tracker [34], IVT tracker [3], Frag tracker [35], L1AGP tracker [36], and ASLSA tracker [25], whose data are running from our PC. 7

Figure 7: Center error curves of the CSK, Frag, IVT, L1APG, ASLSA and our DSLSA trackers.

Figure 8: Overlap rate curves of the CSK, Frag, IVT, L1APG, ASLSA and our DSLSA trackers.

the head is partially occluded by the book and the hat. The CSK tracker, the L1APG tracker, and our DSLSA tracker can track the target accurately. Several trackers can show good performance in dealing with the challenge of occlusion from these five sequences above. Our DSLSA tracker can track the target accurately as shown by the small center error and the high overlap rate.

Rotation: Fig. 6 presents the tracking results of all trackers in dealing with the challenges of rotation using three sequences. The rotation tests in these sequences include an in-plane rotation test and an out-of-plane rotation test. In the david2 sequence, the mans head swings randomly, which is accompanied by the movement of the body. The rotations of the head have both in-plane rota8

Table 1: RMSEs of trackers. The best results are shown in red.

sequence

CSK

Frag

IVT

L1APG

MIL

david2 sylvester trellis fish singer1 girl walking freeman3 dog1 woman faceocc2 caviar2 average

2.33 9.92 18.82 41.19 14.01 19.34 7.17 53.90 3.81 6.28 5.92 8.75 15.95

56.72 14.40 63.65 22.73 77.10 20.12 8.76 58.30 12.24 104.73 42.22 6.95 40.66

1.43 34.27 124.96 4.07 11.75 18.58 1.81 35.90 3.51 182.08 7.08 4.41 35.82

1.57 23.82 62.12 18.68 98.82 3.41 2.16 33.81 9.62 135.23 10.85 35.57 36.30

14.26 19.40 62.19 39.40 14.22 14.59 7.96 101.10 21.65 123.30 17.91 75.44 42.62

tion and out-of-plane rotations. We can see from Fig. 6(a) that the accuracy of our DSLSA tracker is higher than the accuracy of the others trackers. In the freeman3 sequence, a man is walking around among the students who are sitting in a classroom. In addition to the changes of the head state, there are changes in the spatial positions. In this challenging sequence, only the ASLSA tracker and our DSLSA tracker can lock the target accurately. In the dog1 sequence, the rotations of the dog have both in-plane and out-of-plane rotations. We can see from Fig. 6(c) that our DSLSA tracker can deal with this challenge well.

1.47 7.90 5.39 3.32 5.90 4.54 1.78 2.14 4.81 2.41 12.69 1.32 4.47

0.70 5.98 4.02 2.70 4.09 2.37 1.52 1.23 4.30 2.01 5.41 1.20 2.96

In order to validate the effectiveness of our associative mechanism, we use two criteria, the average root-meansquared error (ARMSE) and the average mean overlap rate (AMOR). We use three trackers, the QSLSA tracker, the SSLSA tracker, and the ASLSA tracker, to compare with our DSLSA tracker. Because the frameworks of these four trackers are based on PF, we run these four trackers 100 times independently under the same condition to evaluate the robustness of these trackers. The ARMSE is defined by the following formula: ARMS E =

In order to evaluate the performances of all trackers better, we use two criteria: the root-mean-squared error (RMSE) and the mean overlap rates (MOR) [31], which are used comprehensively. The RMSE is defined by the following formula: V q 1X (xt − gt )2 V t=1

DSLSA

5.3. Validation of the Associative Mechanism

5.2. Quantitative Evaluation

RMS E =

ASLSA

L V q 1X 1X ( (xti − git )2 ) L i=1 V t=1

(14)

where L is the number of times that the tracker is run independently, V is the numbers of frames for every sequence, xti is the result of the trackers in frame t, and git is the ground truth of frame t. L is set as 100 in our experiment. The AMOR is defined by the following formula:

(13) AMOR =

where V is the number of frames for every sequence, xt is the estimated result of the tracker at frame t, and gt is the ground truth of frame t. Table 1 lists the RMSEs of all trackers. We can see that the RMSEs of our DSLSA tracker are better than the RMSEs of the others trackers. For clarity, Fig. 7 presents the center errors between the estimates of all trackers and the ground truth of the target. We can see from Fig. 7 that the center errors of our tracker are better than the center errors of the others trackers. Under the condition of the tracking result RT and the groundT truth RG , we use the rule of [39], S core = area(RST RG RT RG , to evaluate the overlap rate of all trackers. Table 2 lists the MORs of all trackers. In most cases, the MORs of our DSLSA tracker are the best. In order to display the differences between all trackers more specifically, Fig. 8 presents the overlap rate curves of all trackers. We can see that our DSLSA tracker performs favorably against the others trackers.

L V 1X 1X ( S coret ) L i=1 V t=1

(15)

where L is the number of times that the tracker is run independently, V is the numbers of frames for every sequence, S coret is the overlap rate between the tracking result RT and the ground truth RG at time t. The ARMSEs and AMORs of a tracker are more accurate than the MRSE and MOR of running the tracker one time. The values of the ARMSE and AMOR are higher than the values of the normal when a tracker produces drifting. Therefore, the value of the ARMSE and AMOR can reflect the robustness of a tracker better. Table 3 lists the ARMSEs of these four trackers, and Table 4 lists the AMORs of these four trackers. As shown in Tables 3 and 4, most ARMSEs and AMORs of the QSLSA tracker and the SSLSA tracker are better than the ARMSEs and AMORs of the ASLSA tracker. This indicates that the QSLSA model is better than the ASLSA model, 9

Figure 9: Average Center error Curves of the ASLSA, QSLSA, SSLSA, and our DSLSA trackers over 100 times running independently.

Figure 10: Average Overlap rate Curves of the ASLSA, QSLSA, SSLSA, and our DSLSA trackers over 100 times running independently.

and the SSLSA model improves the performance relative to the ASLSA tracker also. To improve the robustness of the appearance model, we propose a new associative mechanism, which is applied to our DSLSA model, to inherit the advantages of the QSLSA and the SSLSA models. As shown in the Tables 3 and 4, several ARMSEs and AMORs of our DSLSA tracker are the best. Almost all ARMSEs and AMORs of our DSLSA tracker are bet-

ter than the ARMSEs and AMORs of the ASLSA model. Although there are some ARMSEs and AMORs of our DSLSA tracker that are slightly poorer than the results of the QSLSA tracker, the average ARMSEs and the average AMORs of our DSLSA tracker are better than the results of the QSLSA tracker. In summary, the associative mechanism of our DSLSA model can improve the robustness of a single appearance model, namely, the QSLSA model 10

Table 2: MORs of trackers. The best results are shown in red.

sequence

CSK

Frag

IVT

L1APG

MIL

ASLSA

DSLSA

david2 sylvester trellis fish singer1 girl walking freeman3 dog1 woman faceocc2 caviar2 average

0.81 0.64 0.49 0.21 0.36 0.38 0.55 0.30 0.55 0.66 0.78 0.54 0.52

0.25 0.60 0.28 0.53 0.25 0.47 0.55 0.31 0.55 0.16 0.49 0.53 0.41

0.66 0.52 0.28 0.77 0.57 0.17 0.73 0.38 0.74 0.19 0.73 0.71 0.54

0.85 0.43 0.20 0.53 0.24 0.75 0.74 0.31 0.51 0.18 0.68 0.30 0.48

0.48 0.51 0.26 0.28 0.36 0.42 0.48 0.02 0.43 0.17 0.62 0.22 0.35

0.87 0.68 0.84 0.87 0.75 0.67 0.76 0.76 0.71 0.79 0.65 0.78 0.76

0.92 0.72 0.76 0.90 0.81 0.71 0.72 0.85 0.67 0.85 0.77 0.77 0.79

Figure 11: Precision and success plots over OOTB-50. The performance score for each tracker is shown in the legend

Table 3: ARMSEs of trackers. The best results are shown in red.

Table 4: AMORs of trackers. The best results are shown in red.

sequence

ASLSA

QSLSA

SSLSA

DSLSA

sequence

ASLSA

QSLSA

SSLSA

DSLSA

david2 sylvester trellis fish singer1 girl walking freeman3 dog1 woman faceocc2 caviar2 average

3.79 14.23 13.44 3.82 6.70 11.63 1.90 2.99 5.00 57.71 24.55 10.78 13.04

1.77 12.24 15.63 6.42 5.19 5.44 1.77 1.81 4.41 31.53 14.16 2.22 8.55

3.00 14.99 15.84 3.61 5.86 11.03 2.01 2.44 5.07 64.16 20.54 4.61 12.76

1.72 12.75 11.03 3.97 5.34 5.86 1.78 1.89 4.75 10.02 14.85 1.77 6.31

david2 sylvester trellis fish singer1 girl walking freeman3 dog1 woman faceocc2 caviar2 average

0.77 0.62 0.76 0.85 0.75 0.56 0.75 0.75 0.70 0.57 0.58 0.67 0.69

0.86 0.61 0.63 0.81 0.78 0.68 0.67 0.81 0.63 0.71 0.61 0.74 0.71

0.81 0.61 0.73 0.86 0.76 0.56 0.74 0.77 0.69 0.39 0.59 0.75 0.69

0.87 0.63 0.73 0.85 0.78 0.67 0.73 0.80 0.66 0.80 0.62 0.77 0.74

11

6. Conclusion

or the SSLSA model. In order to display the robustness of our DSLSA tracker specifically, we present the average center errors of all these four trackers in Fig. 9 and we present the average overlap rate of all these four trackers in Fig. 10. The default of our DSLSA tracker is that the time complexity of our DSLSA tracker is slightly higher than the other three trackers. The speeds of the ASLSA tracker, QSLSA tracker, SSLSA tracker and our DSLSA tracker are 7.8, 8.9, 8.4 and 5.8 frames per second (fps), respectively. All these results are the average fps of running all sequences over 100 times.

In this paper, we present a dual-scale structural local sparse appearance model based on overlap patches. This appearance model includes a bigger scale model, the QSLSA model, and a small scale model, the SSLSA model. The QSLSA model can capture structural quasiholistic feature of target and the SSLSA model can capture more structural local features of target. Then, we joint these two scale models as a new appearance model, the DSLSA model. The DSLSA model absorbs the advantages of the QSLSA and SSLSA models, which increase the accuracy and the robust of our appearance model. Both qualitative and quantitative analyses on challenging benchmark image sequences indicate that the tracker with our DSLSA model performs favorably against several state-of-the-art trackers.

5.4. Evaluation on OOTB-50 In this section, we evaluate our algorithm on the online object tracking benchmark (OOTB) [33]. We use the first part of sequences on OOTB, namely OOTB-50. The trackers that are compared with our method include CSK, IVT, Frag, L1AGP, ASLSA, DSSM [45], and PCOM [46]. The precision plot and the success plot are used to evaluate the performance of trackers. The score for the threshold, which is set as 20 pixels, is used to represent the precision score for each tracker. Likewise, we use the area under curve (AUC) of each success plot to rank the trackers. The temporal robustness evaluation (TRE) is used to evaluate the robustness of the trackers. All results are running on our PC over 50 sequences of OOTB-50. To be fair, the locations of the target in the first frame and the affine transformation parameters of the state variable of the target are consistent with the settings of OOTB. Fig. 11 shows the qualitative comparison of all trackers. We can see that our DSLSA model has good performances on both precision plot and the success plot. In the end, we discuss the problem of how to choose the patch sizes for our dual-scale tracker. The experiment results are running 100 times over 12 video sequences in section 5.3 under the same condition. We consider the following three aspects. First, Overlap patches with small size, such as 8*8 or 4*4, can extract more local features. However, it increases the compute complexity heavily. The speed of the ASLSA tracker with size 8*8 is 1.01, and the speed of the ASLSA tracker with size 4*4 is 0.04. Therefore, we take no account of these small scales from our dual-scale tracker. Second, Complex model with more scales, more than 2, can extract more features. However, it increases the computational complexity also. We do some tests on the tracker with three scales, namely the QSLSA, the SSLSA, and the ASLSA models. The ARMSE of the tracker is 6.3 and the average speed of the tracker 4.41 fps. At last, we do some tests on the other dual-scales. The first kind of dual-scale includes the scales of the QSLSA and the ASLSA models, respectively. The ARMSE of the tracker with this dual-scale is 6.49 and the average speed is 5.51. The other kind of dual-scale includes the scales of the SSLSA model and the ASLSA model, respectively. The ARMSE of the tracker with this dual-scale is 8.58 and the average speed is 5.48. In summary, we choose the QSLSA model and the SSLSA model as our dual-scale model.

Acknowledgements Thank the editor and the anonymous referees for their valuable comments. This work was supported by the Nature Science Foundation of China (No. 61572214, U1233119 and 61462048), Wuhan Science and Technology Bureau of Hubei Province, China (No. 2014010202010110), and the Natural Science Foundation of Jiangxi Province, China (No. 20161BAB202036). [1] A. Yilmaz, O. Javed, M. Shah, Object tracking: A survey, Acm computing surveys (CSUR), 2006, 38(4): 13. [2] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, and A. V. D. Hengel, A survey of appearance models in visual object tracking, ACM Transactions on Intelligent Systems and Technology, vol. 4, pp. 1-48, 2013. [3] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, Incremental Learning for Robust Visual Tracking, International Journal of Computer Vision, vol. 77, pp. 125-141, 2007. [4] G. Silveira and E. Malis, Real-time visual tracking under arbitrary illumination changes, 2007 Ieee Conference on Computer Vision And Pattern Recognition, Vols 1-8, pp. 186-191, 2007 [5] B. Georgescu and P. Meer, Point matching under large image deformations and illumination changes, IEEE Trans Pattern Anal Mach Intell, vol. 26, pp. 674-88, Jun 2004. [6] J. Ning, L. E. I. Zhang, D. Zhang, and C. Wu, Robust Object Tracking Using Joint Color-Texture Histogram, International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, pp. 12451263, 2009. [7] J. Wang, H. Wang, Y. Yan, Robust visual tracking by metric learning with weighted histogram representations[J], Neurocomputing, 2015, 153: 77-88. [8] W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang, Single and multiple object tracking using log-euclidean Riemannian subspace and block-division appearance model, IEEE Trans Pattern Anal Mach Intell, vol. 34, pp. 2420-40, Dec 2012. [9] F. Porikli, O. Tuzel, P. Meer, Covariance tracking using model update based on lie algebra, Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. IEEE, 2006, 1: 728735. [10] Z. Lin, L. S. Davis, D. Doermann, and D. DeMenthon, Hierarchical part-template matching for human detection and segmentation, 2007 Ieee 11th International Conference on Computer Vision, Vols 1-6, pp. 1152-1159, 2007. [11] J. Kwon, H. S. Lee, F. C. Park, and K. M. Lee, A Geometric Particle Filter for Template-Based Visual Tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, pp. 625-643, 2014. [12] Y. Fan, L. Huchuan, and Y. Ming-Hsuan, Robust superpixel tracking, IEEE Trans Image Process, vol. 23, pp. 1639-51, Apr 2014.

12

[13] Y. Yuan, J. Fang, and Q. Wang, Robust Superpixel Tracking via Depth Fusion, Ieee Transactions on Circuits And Systems for Video Technology, vol. 24, pp. 15-26, Jan 2014. [14] S. Fazli, H. M. Pour, and H. Bouzari, Particle Filter based Object Tracking with Sift and Color Feature, 2009 Second International Conference on Machine Vision, Proceedings, ( Icmv 2009), pp. 8993, 2009. [15] Y. Song, C. Li, L. Wang, et al, Robust visual tracking using structural region hierarchy and graph matching[J], Neurocomputing, 2012, 89: 12-20. [16] Z. Qi, R. Ting, F. Husheng, and Z. Jinlin, Particle Filter Object Tracking Based on Harris-SIFT Feature Matching, Procedia Engineering, vol. 29, pp. 924-929, 2012. [17] E. J. Candes, J. Romberg, and T. Tao, Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, pp. 489-509, 2006. [18] D. L. Donoho, Compressed sensing, IEEE Transactions on Information Theory, vol. 52, pp. 1289-1306, 2006. [19] B. A. Olshausen and D. J. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature, vol. 381, pp. 607-9, Jun 13 1996. [20] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, Robust face recognition via sparse representation, IEEE Trans Pattern Anal Mach Intell, vol. 31, pp. 210-27, Feb 2009. [21] X. Mei, H. Ling, Robust visual tracking using l1 minimization, Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009: 1436-1443. [22] X. Mei, H. Ling, Y. Wu, et al., Minimum error bounded efficient l1 tracker with occlusion detection, Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011: 1257-1264. [23] B. Y. Liu, J. Z. Huang, L. Yang, and C. Kulikowsk, Robust Tracking Using Local Sparse Appearance Model and K-Selection, 2011 Ieee Conference on Computer Vision And Pattern Recognition (Cvpr), pp. 1313-1320, 2011. [24] C. Xie, J. Tan, P. Chen, J. Zhang, and L. He, Multi-scale patchbased sparse appearance model for robust object tracking, Machine Vision And Applications, vol. 25, pp. 1859-1876, Oct 2014. [25] X. Jia, H. Lu, M. Yang, Visual tracking via adaptive structural local sparse appearance model, Computer vision and pattern recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 1822-1829. [26] B. Y. Liu, L. Yang, J. Z. Huang, P. Meer, L. G. Gong, and C. Kulikowski, Robust and Fast Collaborative Tracking with Two Stage Sparse Optimization, Computer Vision-Eccv 2010, Pt Iv, vol. 6314, pp. 624-637, 2010. [27] A. Zarezade, H. R. Rabiee, A. Soltani-Farani, and A. Khajenezhad, Patchwise Joint Sparse Tracking With Occlusion Detection, Ieee Transactions on Image Processing, vol. 23, pp. 4496-4510, Oct 2014. [28] D. J. Field, What Is the Goal of Sensory Coding?, Neural Computation, vol. 6, pp. 559-601, 1994. [29] Z. Zhao, T. Wang, F. Liu, et al., Remarkable local resampling based on particle filter for visual tracking[J], Multimedia Tools and Applications, 2015: 1-26. DIO: 10.1007/s11042-015-3075-6 [30] I. Matthews, T. Ishikawa, and S. Baker, The template update problem, IEEE Trans Pattern Anal Mach Intell, vol. 26, pp. 810-5, Jun 2004. [31] Z. Xiao, H. Lu, and D. Wang, L2-RLS-Based Object Tracking, Ieee Transactions on Circuits And Systems for Video Technology, vol. 24, pp. 1301-1309, Aug 2014. [32] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online Learning for Matrix Factorization and Sparse Coding, Journal Of Machine Learning Research, vol. 11, pp. 19-60, Jan 2010. [33] Y. Wu, J. Lim, M. Yang, Online object tracking: A benchmark, Proceedings of the IEEE conference on computer vision and pattern recognition, 2013: 2411-2418. [34] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, Exploiting the Circulant Structure of Tracking-by-Detection with Kernels, Computer Vision - Eccv 2012, Pt Iv, vol. 7575, pp. 702-715, 2012. [35] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, Computer vision and pattern recognition, 2006 IEEE Computer Society Conference on. IEEE, 2006, 1:

798-805. [36] C. Bao, Y. Wu, H. Ling, et al., Real time robust l1 tracker using accelerated proximal gradient approach, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 1830-1837. [37] Y. Wu, B. Ma, M. Yang, J. Zhang, and Y. Jia, Metric Learning Based Structural Appearance Model for Robust Visual Tracking, IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24, (5), pp. 865-877 [38] Q. Wang, F. Chen, W. Xu, and M. Yang, Object Tracking With Joint Optimization of Representation and Classification, IEEE Transactions on Circuits and Systems for Video Technology, 2015, 25, (4), pp. 638-650. [39] Y. Yang, Y. Xie, W. Zhang, et al, Global Coupled Learning and Local Consistencies Ensuring for sparse-based tracking[J], Neurocomputing, 2015, 160: 191-205. [40] S. Zhang, H. Yao, H. Zhou, X. Sun, and S. Liu, Robust visual tracking based on online learning sparse representation, Neurocomputing, 2013, 100, pp. 31-40. [41] S. Zhang, H. Zhou, F. Jiang, and X. Li, Robust Visual Tracking Using Structurally Random Projection and Weighted Least Squares, IEEE Transactions on Circuits and Systems for Video Technology, 2015, 25, (11), pp. 1749-1760. [42] R. Yao, S. Xia, Y. Zhou, Robust tracking via online Max-Margin structural learning with approximate sparse intersection kernel[J], Neurocomputing, 2015, 157: 344-355. [43] E. Mark, V. Luc, et al., Pascal visual object classes challenge results[J]. Available from www. pascal-network. org, 2005. [44] J. Ding, T. Tang, W. Liu, et al., Tracking by local structural manifold learning in a new SSIR particle filter[J], Neurocomputing, 2015, 161: 277-289. [45] B. Zhuang, H. Lu, Z. Xiao, et al., Visual tracking via discriminative sparse similarity map[J], Image Processing, IEEE Transactions on, 2014, 23(4): 1872-1881. [46] D. Wang, H. Lu, Visual tracking via probability continuous outlier model, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 3478-3485. [47] C. Hong, J. Yu, D. Tao, et al., Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval[J], Industrial Electronics, IEEE Transactions on, 2015, 62(6): 3742-3751. [48] Z. Xia, X. Wang, X. Sun, Q. Liu, and N. Xiong, Steganalysis of LSB matching using differences between nonadjacent pixels, Multimedia Tools and Applications, 2016, vol. 75, no. 4, pp. 1947-1962. [49] Z. Xia, X. Wang, X. Sun, and B. Wang, Steganalysis of least significant bit matching using multi-order differences, Security and Communication Networks, 2014, vol. 7, no. 8, pp. 1283-1291.

13