Two dimensional hashing for visual tracking

Two dimensional hashing for visual tracking

Computer Vision and Image Understanding xxx (2015) xxx–xxx Contents lists available at ScienceDirect Computer Vision and Image Understanding journal...

3MB Sizes 2 Downloads 73 Views

Computer Vision and Image Understanding xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Two dimensional hashing for visual tracking q Chao Ma, Chuancai Liu ⇑ School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, PR China

a r t i c l e

i n f o

Article history: Received 27 August 2014 Accepted 16 January 2015 Available online xxxx Keywords: Hashing Tracking Appearance model Incremental learning

a b s t r a c t Appearance model is a key part of tracking algorithms. To attain robustness, many complex appearance models are proposed to capture discriminative information of object. However, such models are difficult to maintain accurately and efficiently. In this paper, we observe that hashing techniques can be used to represent object by compact binary code which is efficient for processing. However, during tracking, online updating hash functions is still inefficient with large number of samples. To deal with this bottleneck, a novel hashing method called two dimensional hashing is proposed. In our tracker, samples and templates are hashed to binary matrices, and the hamming distance is used to measure confidence of candidate samples. In addition, the designed incremental learning model is applied to update hash functions for both adapting situation change and saving training time. Experiments on our tracker and other eight state-of-the-art trackers demonstrate that the proposed algorithm is more robust in dealing with various types of scenarios. Ó 2015 Elsevier Inc. All rights reserved.

1. Introduction Visual tracking, as a main topic in computer vision, has been widely used in many applications, such as augmented reality, surveillance and object identification. Numerous tracking algorithms have been proposed during past decades [1–19]. Most visual trackers can be divided into four parts: feature extraction, appearance model, sampling method and online learning model. Among these four parts, appearance model is the key element to the success of coping with occlusion, pose change, illumination and other complex scenarios. How to design an effective and robust appearance model is the main task in most state-of-the-art trackers. These algorithms often construct appearance models by low-level and mid-level cues such as: Raw pixel, Color Histogram, Gradient, Superpixel, Texture, Haar-like, SIFT, SURF, MSER, LBP, HoG, Harris corner and Saliency object [20]. Mei and Ling [1] treat tracking as sparse approximation problem, which uses raw pixel representation. Based on this idea, much works [2–6] focus on using sparse representation method to solve the tracking problem and have achieved good performance. He et al. [7] propose a new histogram extraction method called local sensitive histogram. Based on this histogram technique, illumination invariant feature is proposed, which is proven to be effective to track object. Wang et al. [8,9] propose superpixel

q

This paper has been recommended for acceptance by Y. Aloimonos.

⇑ Corresponding author.

E-mail addresses: [email protected] (C. Ma), [email protected] (C. Liu).

tracking algorithm, which uses superpixels to generate local parts’ confidence of object. The tracker is proven to be effective to occlusion and pose change by its unique appearance model. Some other work [10,11] also use superpixel feature to construct robust appearance model for visual tracking and achieve well tracking results in experiments. Chen et al. [12] integrate several types of features and construct a robust appearance model. The model is combined with complex cells by adaptive weights and appears to be effective to deal with occlusion, pose change and illumination. Besides these trackers, there are also many other algorithms [13– 15,17–19], which use the artificial feature or model as the base of tracking algorithm. An alternative method for constructing appearance model is data dependent technique. Given sufficient descriptor or raw pixel of samples, the data dependent method aims to find a map from high dimensional space to low dimensional space. The most representative literature based on this method is IVT [21,22]. Given the samples at first frame, the tracker obtains a low dimensional representation by PCA, and candidates obtained from coming frames would be mapped into low dimensional feature space to get compact representation. By the support of theory, IVT achieves well performance in experiments. However, the tracker does not take semantic information into consideration and the mapping functions have limited discriminability, which may cause drift or loss of target under complex cases. Compressive tracking [23,24] is another work that uses dimensionality reduction manner. It uses compressive method to reduce the dimension of original feature vector. Although the tracker owns the ability to deal with some

http://dx.doi.org/10.1016/j.cviu.2015.01.003 1077-3142/Ó 2015 Elsevier Inc. All rights reserved.

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

2

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx

tracking situations, its performance is instable because of the random process in mapping method. Recently data-dependent hashing approach becomes a popular method in large-scale vision problems. It is shown to be a good method to expedite similarity computation. This property makes it possible to apply the hashing method to tracking algorithms. Li et al. [25] propose learning compact binary code for visual tracking. The algorithm always keeps a sample buffer and updates the hash functions every ten frames. Although the tracker obtains effective tracking results in experiments, its training stage is not an efficient manner to update hash functions. To better apply hashing method in tracking and provide an efficient updating strategy, we proposed two dimensional hashing method for visual tracking. Inspired by 2DPCA [26] and 2DLDA [27], the proposed two dimensional hashing method is based on 2D sample matrices (object image) rather than 1D feature vectors. The algorithm does not need to vectorize the sample matrices, and then the dimension of covariance matrix is reduced dramatically, which is the key element for computational efficiency. Given training samples at first frame, the hash functions are initialized by the proposed hash method. Then we can formulate the binary codes of candidates at coming frame. Confidences of candidates are calculated by their hamming distances to the templates, and the highest one is chosen as target at current frame. For adapting to the situation changing, hash functions are updated by the designed incremental learning model during tracking. Due to this learning model, our algorithm only needs to store few matrices. Experimental results show that the approach achieves good performance in evaluation metrics CLE, SR and FPS, which demonstrate the effectiveness and efficiency of our tracker, and outperforms the-stateof-art tracking algorithms. The rest of our paper is organized as follows: Section 2 introduces some related literatures. Section 3 presents the details of two dimensional hashing method. Section 4 shows the framework of tracking by two dimensional hashing. Experimental results are analyzed detailedly in Section 5. And final conclusion is drawn in Section 6.

2. Related work Recently, hashing techniques attract great attention in computer vision and pattern recognition due to its efficiency and accuracy in data organization. Hashing method aims to map the original high-dimensional data into a compact binary code, while preserving structure of original data. After mapping to binary code, the pairwise comparison could be calculated efficiently. Hashing techniques can be categorized to data-independent methods [28– 32] and data-dependent methods [33–40,25]. One of the most well-know data-independent hashing method is Local Sensitive Hashing (LSH) [28], which generates hash code by random projections. Based on LSH’s idea, many methods [29–32] belonging to the LSH family have been proposed and have achieved good performance in computer vision and pattern recognition applications. Compared to data-independent techniques, more recent works have focused on data-dependent techniques with the goal of capturing more compact binary codes. They learn hash functions by optimizing objective function, which preserves samples similarity in Hamming space. However, directly solving optimization problem is very difficult. Thus relaxation technique is often used to simplify the optimization. Spectral hashing [33] uses spectral relaxation method to solve this problem and has proved successful. Liu et al. [34] propose hashing with graph, which utilizes manifold structure to generate binary codes of data. The widely developing deep learning techniques are also applied into hashing techniques [41–43]. They use common framework (RBM and neural network) of deep learning to learn the binary code of data, which achieves

good performance in computer vision applications. Recently LDA Hashing [35] and Semi-supervised Hashing [36] are proposed. They both provide methods to learn hash functions with labeled train samples. With the success of hashing methods, many applications [37,44] begin to use hashing technique to solve the real problems. Incremental learning is widely used to online learning applications, which is suit for objecting tracking. Many incremental learning works [27,45–49] have been proposed and applied successfully. Pang et al. [46] propose an Incremental LDA for online face recognition. They address the method of updating the scatter matrices with incremental manner, but the updating scheme is still time and memory consuming. To solve speed limitation, Kim et al. [47] use sufficient spanning set approximation to address problem of incremental LDA. The method significantly improves the calculation speed and provides the similar results with batch LDA. Based on 2DLDA, Wang et al. [48] propose incremental 2DLDA for face recognition, which achieves fast speed and effective performance. And the idea is also applied to the tracking algorithm by Li et al. [45]. Zhao and Yuen [27] improve the LDA/GSVD algorithm and fast SVD updating technique by the proposed incremental supervised learning method called GSVD–ILDA which can incrementally learn an adaptive subspace instead of recomputing the LDA/GSVD. Lu et al. [49] extend the CLDA algorithm to online manner by incremental method and apply it to face recognition successfully. In our work, we aim to apply data-dependent hashing technique into object tracking, which is challenge. The hashing methods mentioned above are designed for off-line training application which is not suit for online training. If we use batch manner to apply these hash functions directly in tracking, it will take several seconds per-frame. Even if it may achieve good performance, the processing time could not be accepted. So in order to solve this problem, we use incremental learning model to update our hashing functions at each frame. However, because of the demands of hashing method, the covariance matrices are not same as other incremental LDA algorithms such as [45], which makes original incremental method cannot be applied directly. Fortunately, by the proof of Theorem 1, we could update the covariance matrices incrementally and improve the tracking speed. Overall, we list our main contributions of our tracker as below: 1. In this paper, we propose a 2D-based hashing method which could fast extract the binary feature of samples. 2. We successfully apply the hashing method into tracking model by some details. 3. We design an effective and suitable learning model to update hash functions at every frame. 4. Comparison experiments are done to demonstrate the effectiveness and efficiency of our tracker. 3. Two dimensional hashing In this section, we introduce the details of our hashing method. Given training image samples X ¼ fAi ; yi g; i ¼ 1; . . . ; n; A 2 IRDD ; yi is the label of i-th sample, which represents positive (foreground) or negative (background). In these samples, pairs of them are categorized into two sets: neighbor pair set NE and non-neighbor pair set NO. A pair of ðAi ; Aj Þ 2 NE is represented as neighbor pair (both positive or both negative). Similarly, ðAi ; Aj Þ 2 NO is non-neighbor pair (one of them is positive and n

the other is negative). fAci ; yi gi¼1 are centralized samples of X . So the proposed hashing method aims to map data Ai to hamming space to obtain binary representation. Suppose the number of projection directions to be K, that is to say K hash functions need to be learned by training samples. Follow general idea, our hash function is defined as

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx

  hk ðAci Þ ¼ sgn w>k Aci þ bk

ð1Þ

where bk is the mean of projected data. Since training data P n c fAci ; yi gi¼1 is zero mean, we can get bk ¼  ni¼1 w> k Ai =n ¼ 0. sgnðÞ is element-wise sign function. The hash functions make binary code be 1 or 1. We can get common binary bit by

  1    1 zki ¼ 1 þ hk Aci ¼ 1 þ sgn w>k Aci 2 2

ð2Þ

One thing to keep in mind here is that Aci is a matrix and 2 IRD1 is a vector, so the zki 2 IRD1 is a binary code vector which represents the binary code of Aci in direction wk . Thus the binary code of Ai is a matrix zi ¼ ½z1i ; z2i ; . . . ; zKi  rather than a vector. Let W ¼ ½w1 ; . . . ; wK  2 IRDK be a set of K projections. Our goal is to learn a W that gives same bits for pair ðAi ; Aj Þ 2 NE and different bits for pair ðAi ; Aj Þ 2 NO as much as possible. An objective function reaching this goal on training samples can be defined as: c w> k Ai

D   E X 1 X  > c > c sgn w A A ; sgn w c c k k i j ðAi ;Aj Þ2NE jNEj k D   E X  1 > c > c  sgn w A A ; sgn w k i k j ðAci ;Acj Þ2NO jNOj

JðWÞ ¼

ð3Þ

where h; i is inner product operator and j  j means the cardinality of set. However, the objective function JðWÞ is hard to solve, because it is not differentiable. To overcome this limitation, we apply relaxation to Eq. (3). A simple way is to replace the sign function with its signed magnitude. The new objective function can be derived as  X 1 X 1 X > c> c > c> c JðWÞ ¼ w A A w  w A A w k k c c c c i j i j k k ðAi ;Aj Þ2NE ðAi ;Aj Þ2NO jNEj jNOj k ð4Þ

Let

A

be

the

mean

of

fAi gni¼1 .

Then

the

matrix

>

c Ac> i Aj ¼ ðAi  AÞ ðAj  AÞ. By taking this formulation into Eq. (4), we can get

> 1 X w> ðAi  AÞ ðAj  AÞwk ðAi ;Aj Þ2NE k jNEj k  > 1 X >  w ðA  AÞ ðA  AÞw i j k k ðAi ;Aj Þ2NO jNOj  X  1 X > wk w>k ðA  AÞ ðA  AÞ ¼ i j ðAi ;Aj Þ2NE jNEj k    > 1 X w>k ðAi  AÞ ðAj  AÞ wk ðAi ;Aj Þ2NO jNOj

JðWÞ ¼

X

ð5Þ

We define SNE and SNO as covariance matrices for pairs ðAi ; Aj Þ 2 NE and pairs ðAi ; Aj Þ 2 NO. These two covariance matrices are denoted by: >

SNE ¼ EðAi ;Aj Þ2NE ðAi  AÞ ðAj  AÞ X > 1 ¼ ðAi  AÞ ðAj  AÞ jNEj ðA ;A Þ2NE i

ð6Þ

j

>

SNO ¼ EðAi ;Aj Þ2NO ðAi  AÞ ðAj  AÞ X > 1 ¼ ðAi  AÞ ðAj  AÞ jNOj ðA ;A Þ2NO i

ð7Þ

j

Here we should notice that if ðAi ; Aj Þ 2 NE then ðAj ; Ai Þ 2 NE, this trick ensures that the matrix SNE is symmetric. Similarly, SNO is symmetric too. So Eq. (5) can be rewritten as

JðWÞ ¼

X

w>k SNE wk



w>k SNO wk



ð8Þ

k

Assume jjwk jj ¼ 1 for any k, the above function can be expressed as matrix form:

 JðWÞ ¼ tr W> ðSNE  SNO ÞW

3

ð9Þ

The optimal projection directions W is obtained by 

W ¼ arg max JðWÞ s:t: W> W ¼ I W

ð10Þ

Now, achieving optimal W becomes a typical eigen-problem, which can be easily solved by doing eigen-decomposition to matrix SNE  SNO :

max JðWÞ ¼ W

L X kl l¼1

ð11Þ



W ¼ ½e1 ; e2 ; . . . ; eL  where k1 > k2 >    > kL are top eigenvalues of matrix SNE  SNO , and e1 ; e2 ; . . . ; eL are corresponding eigenvectors. Finally, the binary code of Ai is sgnðW> Aci Þ ¼ ðz1i ; z2i ; . . . ; zKi Þ> 2 IRDK and it should be noted that the proposed hash code for image sample is a matrix rather than vector. Hashing methods always vectorize the original object matrix or generate a high dimensional feature vector to represent object, which makes the covariance matrix to be highdimensional. However, the covariance matrices (SNE ; SNO ) in our method can be calculated directly by matrix Ai 2 IRDD (similar to 2DLDA [27]) and the dimension of them are reduced significantly. Owing to this property, it makes the speed of learning binary code reach the time requirement of object tracking. 4. Visual tracking with two dimensional hashing This section introduces the framework of our tracking algorithm, the basic flow is shown in Fig. 1. Three main parts (appearance model, tracking model and update model) would be described later. 4.1. Patch-based appearance model Given object O and current frame F , tracking task is to locate the object O in F . O is represented by positive templates  þ T þ T s s¼1 and background B is represented by negative templates   T  T s s¼1 . We wish to find the scale and position of a candidate C in F which is most similar to positive templates and most dissimilar to negative templates. Similarity measurement between candidates and templates is defined by hamming distance:

dðC; T s Þ ¼ jjhðCÞ  hðT s ÞjjH

ð12Þ

where jj  jjH is hamming distance in hamming space and hðÞ denotes hash functions for samples. As Section 3 defined, the binary code of a candidate is a binary code matrix, and hamming distance between matrices can be measured by bit-wise XOR operation, which is similar to vectors. The confidence (described below) of each candidate is calculated based on Eq. (12). However, mapping candidate rectangle directly into binary code cannot give best description to target. To the best of our understanding of tracking problem, structure information of target is discriminative for distinguishing target from background. In order to describe structure better, we cut rectangle of target into several patches (P1 ; P2 ; . . . ; Pn ), as Fig.2 shows. Each patch Pi is mapped by corresponding hash functions hi ðÞ ¼ ðhi1 ; hi2 ; . . . ; hiK Þ. Then similarity measurement Eq. (12) is replaced by:

dðC; T s Þ ¼

n X

jjhi ðPCi Þ  hi ðPTi s ÞjjH

ð13Þ

i¼1

where n is the number of patches, PCi and PTi s represent i-th patch of candidate C and template T s . Finally, we can define the confidence of j-th candidate Cj as the average distance to all templates,

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

4

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx

Fig. 1. Basic flow of two dimensional hashing tracking.

Fig. 2. Cut object rectangle into several patches and each patch is mapped to binary code by corresponding hash function. þ

ConðCj Þ ¼

pðX t jX t1 Þ ¼ NðX t ; X t1 ; UÞ



T T   1 X   1 X d Cj ; T þs   d Cj ; T s þ T s¼1 T s¼1

The proposed appearance model resizes candidates and training samples into same size (32  32 in experiments), which could handle different scale rectangles. And a predefined patch partition category (discussed in Section 5.2.2) is applied to all candidates or samples from beginning to end. The patches of target are processed respectively, and i-th hash functions hi ðÞ ¼ ðhi1 ; hi2 ; . . . ; hiK Þ must be trained by the i-th patch of samples. The property encodes the structure of target into appearance model and makes our tracker handle many situations successfully. 4.2. Tracking model

Z

pðX t jX t1 ÞpðX t1 jY 1:t1 ÞdX t1

ð15Þ

where Y 1:t is all the observation from the first frame to current frame. It is important to construct an effective observation model pðY t jX t Þ as demonstrated by numerous tracking algorithms. In our formulation,

pðY t jX t Þ / ConðX t Þ

where U denotes the diagonal covariance matrix and the elements (rx ; ry ; rs ) of the matrix are standard deviations of X axis location, Y axis location and scale. X t1 represents previous target’s state (xt1 ; yt1 ; st1 ), which are next frames mean values for gaussian model. These elements decide how the tracker accounts for the change of location and scale. After determining the observation b t at current model and motion model, the estimate of the target X frame can be obtained by the MAP estimate over M the samples.

b t ¼ arg max pðX ðiÞ jY 1:t Þ 8i ¼ 1; . . . ; M X t ðiÞ

ð18Þ

Xt

where X ðiÞ t denotes the i-th sample’s state.

The proposed tracking approach follows the Bayesian frame  work. Given a sequence, state X t ¼ xht ; xvt ; xst is represented by horizontal position, vertical position and scale. The Y t represents observation at frame t. The posterior of the state at given time t is computed by

pðX t jY 1:t Þ ¼ pðY t jX t Þ

ð17Þ

ð14Þ

ð16Þ

where ConðX t Þ is the confidence of state X t defined in Section 4.1. Besides the observation model, motion model is another one that we need to determine. Here the motion model pðX t jX t1 Þ in Eq. (15) is assumed to be gaussian distribution for both location and scale:

4.3. Incremental learning model Training samples obtained from first frame are always insufficient for handling all frames of a sequence. Thus we need to update hash functions in time. A naive strategy is to store positive and negative samples in a buffer and use them to retrain the hash functions at each frame. It is obviously a time-consuming strategy. Here in our method, we use incremental learning model to update covariance matrices SNE and SNO defined in Section 3. These two covariance matrices measure the correlation between pairwise data, which are different with the common covariance matrix in LDA and its follow-up work. Since every patch is processed respectively, we only take one patch as an example to introduce the update model. Assume fAi ; yi gni¼1 are samples extracted at previous frames and o o current frame samples are fAi ; yi gnþm i¼nþ1 . We define SNE and SNO are covariance matrices for pair dataset NE o and NOo obtained from 4 fAi gni¼1 , while S4 NE and SNO are covariance matrices for pair dataset new new NE M and NOM extracted from fAi gnþm i¼nþ1 . Similarly, SNE and SNO are

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

5

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx

covariance

matrices

for

NE new ¼ fNE o jNE M g



>

JðWÞ ¼ tr W ¼

ðSnew NE



By definition:

and

NOnew ¼ fNOo jNOM g. Ao ; A4 ; Anew are means of old samples nþm fAi gni¼1 , new coming fAi gnþm i¼nþ1 and whole samples fAi gi¼1 . Our goal  > new new is to get tr W ðSNE  SNO ÞW by new coming training samples (refresh Eq. (9)). By simple algebraic steps, we could divide objective functions into two parts and update them separately:

0 1 X  > > tr W dp W ¼ tr@W fðAi  Ap Þ ðAp  Aw ÞgWA 

>

ð19Þ

1 >

fðAp  Aw Þ ðAj  Ap ÞgWA

ðAi ;Aj Þ2P p

0

  tr W> Snew NO W

X

þ tr@W>

Snew NO ÞW

trfW> Snew NE Wg

ðAi ;Aj Þ2P p

0

¼ tr@W ðAp  Aw Þ >

1

X

>

ðAi þ Aj  2Ap ÞWA

ðAi ;Aj Þ2P p

Theorem 1. Let X p ¼ fA1 ; A2 ; . . . ; An g; X q ¼ fAnþ1 ; Anþ2 ; . . . ; Anþm g, and X w ¼ fX p jX q g. The Ap and Aq are means of P p and P q which are pair dataset extracted from X p and X q , and Aw is mean of P w ¼ fP p jP q g. Sp ; Sq ; Sw are covariance matrices defined in Section P > 3 for P p ; P q ; P w respectively: Sp ¼ ðAi ;Aj Þ2P p ðAi  Ap Þ ðAj  Ap Þ; P P > > Sq ¼ ðAi ;Aj Þ2P q ðAi  Aq Þ ðAj  Aq Þ and Sw ¼ ðAi ;Aj Þ2P w ðAi  Aw Þ ðAj  Aw Þ. Then:

    > jP p jjP q j trðW> Sw WÞ  tr W> Sp þ Sq þ ðAq  Ap Þ ðAq  Ap Þ W jP w j

X

>

ðAi  Aw Þ ðAj  Aw Þ

ðAi ;Aj Þ2P p

X

þ

>

ðAi  Aw Þ ðAj  Aw Þ

ðAi ;Aj Þ2P q

X

¼

X

>

ðAi  Aq þ Aq  Aw Þ ðAj  Aq þ Aq  Aw Þ

ðAi ;Aj Þ2P q

trðW dq WÞ ¼ tr@W ðAq  Aw Þ

In our tracking applications, there are two different pair-sets E (same-class) and O (different-class) in P p and P q which are defined in Section 3 (Fig. 3 shows the pair-set). Without loss of generality, we P take P p as an example to discuss. By definition, ðAi ;Aj Þ2P p P P ðAi þ Aj  2Ap Þ contains ðAi ;Aj Þ2E ðAi þ Aj  2Ap Þ and ðAi ;Aj Þ2O ðAi þ P Aj  2Ap Þ. Then we should discuss that whether ðAi ;Aj Þ2E P ðAi þ Aj  2Ap Þ and ðAi ;Aj Þ2O ðAi þ Aj  2Ap Þ are zero or not. Assume

¼

Sq þ jP q jðAq  Aw Þ ðAq  Aw Þ þ dp þ dq : o X n > > ðAi  Ap Þ ðAp  Aw Þ þ ðAp  Aw Þ ðAj  Ap Þ

ðAi ;Aj Þ2O

ðAi ;Aj Þ2P p

o X n > > ðAi  Aq Þ ðAq  Aw Þ þ ðAq  Aw Þ ðAj  Aq Þ

¼

>

>

¼ 2N

>

>

>

trðW Sw WÞ ¼ trðW Spq WÞ þ trðW dp WÞ þ trðW dq WÞ

ðAi þ Aj Þ  4NPAp

P N X X Ai þ 2P Ai  4NPAp i¼1

Spq ¼ Sp þ jP p jðAp  Aw Þ ðAp  Aw Þ þ Sq þ jP q jðAq  Aw Þ

ðAq  Aw Þ. We take trace of both side of above function, then:

X

ðAi ;Aj Þ2O

ðAi ;Aj Þ2P q

>

ðAi þ Aj Þ  2ðPðP  1Þ þ NðN  1ÞÞAp

P N X X ¼ 2ðP  1Þ Ai þ 2ðN  1Þ Ai  2ðPðP  1Þ þ NðN  1ÞÞAp i¼1 i¼1 X ðAi þ Aj  2Ap Þ

Here:

Let

X

ðAi ;Aj Þ2E

>

dq ¼

ðAi þ Aj  2Aq ÞWA

ðAi ;Aj Þ2P q

ðAi ;Aj Þ2E

>

¼ Sp þ jP p jðAp  Aw Þ ðAp  Aw Þþ

dp ¼

1

X

>

>

PPþN A i¼1 i 1. N ¼ P and Ap ¼ PþN is mean of all samples. Under this situation, we could do some simple transforms: X ðAi þ Aj  2Ap Þ

>

ðAi  Ap þ Ap  Aw Þ ðAj  Ap þ Ap  Aw Þ

ðAi ;Aj Þ2P p

þ

0 >

the number of negative samples is N and number of positive samples is P. So the cardinality of neighbor pair jEj ¼ NðN  1Þ þ PðP  1Þ and the cardinality of non-neighbor pair jOj ¼ 2NP. In pair-set E, it contains 2PðP  1Þ positive samples and 2NðN  1Þ negative samples. Similarly, in pair-set O, it contains NP positive samples and NP negative samples. We list all possible cases below:

Proof. Theorem 1By definition we have:

Sw ¼

Similarly,

i¼1

Because N ¼ P, we can easily get P ðAi ;Aj Þ2O ðAi þ Aj  2Ap Þ ¼ 0.

P

ðAi ;Aj Þ2E ðAi

þ Aj  2Ap Þ ¼ 0 and

Fig. 3. Two types of pair-set.

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

6

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx

2. N – P and Ap ¼

Ap þ þAp  , 2

where Ap þ is the mean of positive sam-

Algorithm 1. Incremental Learning Model.



ples and Ap is the mean of negative samples. In this case, we substitute the conditions into above equations, and only P ðAi ;Aj Þ2O ðAi þ Aj  2Ap Þ ¼ 0 is correct.

Input: Covariance matrices SoNE ; SoNO , Mean value Ao , New coming dataset fAi gnþm i¼nþ1 ;

3. N – P and Ap is the mean of all samples. In this case, none of them meet the criteria.

Output: new project directions W ¼ ðw1 ; w2 ; . . . ; wn Þ> for each patch; P 1: Calculate mean value AM ¼ nþm i¼nþ1 Ai =m.

By above analysis, we can make a conclusion that only the number of positive samples and negative samples be equal could make trðW> dp WÞ þ trðW> dq WÞ have no or little impact on the

2: Get pair dataset NOM and NE M from fAi gnþm i¼nþ1 . 3: for p ¼ 1 to n do P p p > p p 4: Calculate SM NE ¼ ðAi ;Aj Þ2NE M ðAi  AM Þ ðAj  AM Þ=jNE M j; P > p p p p 5: Calculate SM NO ¼ ðAi ;Aj Þ2NOM ðAi  AM Þ ðAj  AM Þ=jNOM j;

trðW> Sw WÞ. Fortunately, we could easily handle this problem by repeatedly sampling the less one in real applications. Then: >

new Calculate Snew NE ; SNO , and Anew by Eqs. (21)–(23); new  Generate Wp for p-th patch with Snew NE and SNO by Eq. (11); 8: end for

6: 7:

>

trðW Sw WÞ  trðW Spq WÞ n n > ¼ tr W> Sp þ jP p jðAp  Aw Þ ðAp  Aw Þ þ Sq o o > þjP q jðAq  Aw Þ ðAq  Aw Þ W jP j

jP j

According to the definition, Aw ¼ jPwp j Ap þ jPwq j Aq , then Ap  Aw ¼ jP q j ðAp jP w j

jP j

 Aq Þ and Aq  Aw ¼ jP wp j ðAq  Ap Þ. By substituting them to

above formulation, we could get final conclusion in theorem:     > jP p jjP q j trðW> Sw WÞ  tr W> Sp þ Sq þ ðAq  Ap Þ ðAq  Ap Þ W jP w j

5.1. Experiments details

h By the Theorem 1, the trace of new covariance matrices for positive pairs and negative pairs can be calculated as follows:

  jNE o jSoNE jNE M jSMNE > trðW> Snew WÞ ¼ tr W þ NE jNE o j þ jNE 4 j jNE o j þ jNE 4 j ) ) > jNE o jjNE M j þ ðAo  AM Þ ðAo  AM Þ W 2 ðjNE o j þ jNE 4 jÞ   jNOo jSoNO jNOM jSMNO > trðW> Snew þ NO WÞ ¼ tr W jNOo j þ jNO4 j jNOo j þ jNO4 j ) ) > jNOo jjNOM j ðAo  AM Þ ðAo  AM Þ W þ 2 ðjNOo j þ jNO4 jÞ

ð20Þ

ð21Þ

In practical implementation, jNE o j; jNE M j; jNOo j and jNOM j are constant at each frame, so we can use learning rate kNE and kNO to represent jNE M j=ðjNE o j þ jNE M jÞ and jNOM j=ðjNOo j þ jNOM jÞ. Then:

 > trðW> Snew ð1  kNE ÞSoNE þ kNE SMNE NE WÞ ¼ tr W

o o > þkNE ð1  kNE ÞðAo  AM Þ ðAo  AM Þ W

And the mean of

fAi gnþm i¼1

Anew ¼ ð1  kÞAo þ kAM

In Section 5, we compare our tracker 2DHT (Two Dimensional Hashing for Visual Tracking) with eight state-of-the-art trackers on ten challenging sequences. We use the source code published by authors of these trackers. These eight trackers are compressive tracker CT [23,24], IVT method [21,22], complex cell tracker CCT [12], L1APG method [2], local sensitive histogram tracker LSHT [7], LSST tracker [6], multiple instance learning tracker MIL [14] and superpixel tracking SPT [8,9]. Some experimental screenshots are shown in Figs. 8–10. The tested sequences are provided by the paper [50]. To better analyze our tracker’s performance, we design comparison experiments with different settings in Section 5.2.2. Our 2DHT is implemented by C++ and opencv without any optimization, and the tracker can run at about ten frames per second on Intel Core 2 Duo E7500 CPU with 2 GB RAM. Two most popular metrics are used to evaluate the proposed tracker and other eight state-of-the-art trackers. The first one is the success rate (SR), which is defined as the proportion between successful tracking frames and total number of sequence frames, successful tracking is defined as:

SUC ¼ ð22Þ

 > trðW> Snew ð1  kNO ÞSoNO þ kNO SMNO NO WÞ ¼ tr W

o o > þkNO ð1  kNO ÞðAo  AM Þ ðAo  AM Þ W

5. Experiments

ð23Þ

ð24Þ

Here k ¼ ðkNE þ kNO Þ=2 (In experiments, we set kNE ¼ kNO , so k ¼ kNE ¼ kNO ). By Eqs. (19), (22) and (23) the objective function JðWÞ can be updated in an efficient way, and then the new hash functions can be obtained by Eq. (11). From the incremental learning model in this section, we can see that the algorithm only need to store SoNE ; SoNO and Ao , which is time- and space- saving. For every patch’s hash functions, the incremental learning is processed respectively. Algorithm 1 shows pseudo-code of our incremental learning model.

:

0

areaðROIgt areaðROIgt

T S ROIt Þ > 0:5 ROIt Þ

otherwise

where ROIgt denotes manually labeled ground truth, and ROIt denotes the trackers’ tracking result. The other one is center location error (CLE) measured with manually labeled ground truth. Center location error is defined as follows:

CLE ¼

is updated by:

8 <1

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx  xgt Þ2 þ ðy  ygt Þ2

where ðx; yÞ denotes the target location given by tracker, and ðxgt ; ygt Þ denotes the ground-truth position. The compared results are summarized in Tables 1 and 2. In addition, there are small number of parameters that we need to set in our tracker. During tracking, all image candidates or samples are resized to 32  32 rectangle. The number of hash functions (K in Section 3) for each patch is set to be 4, so the number of binary code of one patch is 4  D. D is the width of patch which is different with different strategies. In Section 4.3, the defined learning rate k is set to 0.15–0.25 according to the extent of variation of target or background. For example, the target and background in

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

7

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx Table 1 Center location errors (in pixels) for our tracker and state-of-the-art trackers. Bold fonts indicate the best performance, while the italic fonts indicate the second best ones. Sequences

2DHT

CT

IVT

CCT

L1APG

LSHT

LSST

MIL

SPT

David Deer Dog Dudek FaceOcc1 FaceOcc2 FleetFace Jumping Mhyang Sylvester

7.80 5.12 9.35 12.25 10.01 8.21 20.57 5.08 3.46 5.86

26.52 113.08 14.06 34.06 26.64 23.58 63.64 7.85 16.55 18.95

130.62 36.05 4.95 10.39 17.98 10.91 60.97 6.98 2.80 52.50

20.25 5.91 4.26 30.40 16.57 16.24 25.77 5.46 3.91 5.89

9.68 214.48 4.08 16.84 15.07 21.36 86.84 117.33 3.88 38.95

8.50 100.91 6.48 15.15 12.31 7.99 25.15 52.56 4.45 18.72

33.15 190.17 5.79 9.24 13.51 21.73 30.50 10.53 2.88 29.06

26.37 61.24 15.66 31.77 37.02 16.00 56.54 6.65 18.11 9.70

22.71 112.87 17.46 69.97 37.01 94.94 196.41 31.84 23.99 27.36

Average

8.77

34.49

33.40

13.46

52.85

25.22

34.85

27.90

63.45

Table 2 Success rate (SR) (%) for our tracker and state-of-the-art trackers. Bold fonts indicate the best performance, and the italic fonts indicate the second best ones. Sequences

2DHT

CT

IVT

CCT

L1APG

LSHT

LSST

MIL

SPT

David Deer Dog Dudek FaceOcc1 FaceOcc2 FleetFace Jumping Mhyang Sylvester

0.89 1.00 1.00 0.97 1.00 1.00 0.82 1.00 1.00 1.00

0.64 0.12 1.00 0.95 1.00 1.00 0.62 1.00 0.93 0.75

0.11 0.57 1.00 0.65 1.00 0.48 0.62 0.98 0.92 0.45

0.99 1.00 1.00 0.85 0.98 0.70 0.64 1.00 1.00 1.00

0.91 0.04 1.00 0.57 0.99 0.32 0.63 0.05 1.00 0.45

0.99 0.04 0.99 0.97 1.00 1.00 0.83 0.18 1.00 0.79

0.77 0.04 0.86 0.96 1.00 0.49 0.72 0.81 0.95 0.33

0.70 0.52 1.00 0.95 0.99 1.00 0.75 1.00 0.69 0.98

0.89 0.08 0.88 0.68 0.24 0.02 0.02 0.38 0.56 0.31

Average

0.968

0.801

0.678

0.916

0.596

0.779

0.693

0.858

0.406

Table 3 Five patch strategies of appearance model: Left column lists patch number of each strategy; middle lists patch size of each strategy; right lists the whether the patches are overlapped or not. (All candidates rectangles are resized to 32  32). Patch strategy (patch number)

Patch size (pixels)

Patch overlap

22 33 44 55 77

16  16 16  16 88 88 88

No Yes No Yes Yes

sequence Mhyang change slowly and learning rate could be set lower, but in sequence Dudek they change quickly and learning rate should be set higher to adapt fast changing.

5.2. Quantitative analysis 5.2.1. 2DHT vs. state-of-the-art trackers In our experiments, we use success rate and center error to evaluate our tracking algorithm. The ground truth is obtained by manually labeling at all frames in sequences. Table 1 lists average center error of our tracker and eight state-of-the-art trackers. The success rate is the ratio between the number of successfully tracked frames and the number of total frames. And Table 2 gives the success rate of the nine trackers on ten test sequences. Our tracker achieves most best or second performance as see in tables. The curve Fig. 7 also shows the stable tracking results of 2DHT. The second best tracker is CCT, which benefits from its powerful appearance model (combination of complex cells). But it appears to be unstable to some sequences such as David, Dudek and FleetFace. MIL also performs well both on SR and CLEs owing to

Fig. 4. Experiments results with different patch strategies. (a) FPS experiments (without optimization); (b) center location error experiments; and (c) success rate experiments.

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

8

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx

its delicate update model. However, the appearance model of MIL is not robust and makes the tracker lose the target under fuzzy situation (Deer) and pose change (FleetFace). LSHT also achieves well performance in several sequences (David, Dog, FaceOcc2 and Mhyang) due to its illumination invariant feature. But the LSH cannot catch the discriminative information well from fuzzy object, which is the failure factor of the tracker in sequences Deer and Jumping. LSST and L1APG are the sparse representation based trackers. The sparse coding techniques have been proven to be effective in object tracking since literature [1] published. But this type of technique is rough method of representing object, thus the performance is limited. In our experiments, these two trackers only track object well in few sequences. SPT achieves worst results as Tables 1 and 2 list. The tracker only considers the confidence of local parts and does not take use of global distinctiveness. In this point, it also confirms the importance of our patch-based appearance model. IVT and CT can be taken as same type as our tracker (dimensionality reduction manner). However, our tracker performs better than them owing to the techniques used in our tracker. For IVT tracker, it shows its high-efficiency and ability in tracking. The success of IVT proves that data-driven feature selection has good performance in object tracking. But IVT do not use negative samples to help learning map functions, so that the tracker’s discriminative ability is limited. However, if IVT takes negative samples into consideration, the feature selection process would be time-consuming with large amount of samples. In our tracker, we use 2D hash manner to improve the processing speed, which makes the tracker could take large number of samples to train mapping directions in real-time. CT also uses both negative and positive samples, but its map

functions are randomly predefined by compressive technique. Although the mapping functions are guaranteed by mathematical theory, the random mapping process cannot always provide a stable appearance model, which makes CT lose the target completely in several sequences. In conclusion, the experiments demonstrate that our 2DHT is a robust and efficiency tracker, which provides a reasonable and efficiency feature selection way, as shown in Tables 1, 2 and Fig. 7. 5.2.2. Comparison of 2DHT with different settings (1) Patch strategies: The patch-based appearance model introduced in Section 4.1 is an important part of our tracker. We design six patch strategies to estimate the influence of the tracker’s performance. The strategies are listed in Table 3, where left column denotes number of patches inside one rectangle, middle column means patch size (all rectangles are resized to 32  32) and right column tells that patches are overlapped or not. The results are shown in Fig. 4. It is easy to find in Fig. 4(a) that more patches the appearance model has, the lower speed of tracking process is. But Fig. 4(b) and (c) shows that the performance rises with patches increases. Let us focus on the changing from nopatch to 2  2 strategy, the tracking results improve a lot with this changing. It is a strong evidence for the importance of patch-based appearance model. However, the performance of 2  2 strategy does not meet precision requirement. Then we increase the patches from 3  3 to 7  7. 3  3 strategy has low FPS than 4  4 because patches of

Fig. 5. Comparison experimental results with column dimension reduction and row dimension reduction on sequences ‘‘Deer’’, ‘‘David’’, ‘‘Jumping’’ and ‘‘Mhyang’’.

Fig. 6. Comparison experimental results with trackers with proposed hash and PCA hash on sequences ‘‘Deer’’, ‘‘Dudek’’, ‘‘Dog’’, ‘‘David’’, ‘‘FaceOcc’’, ‘‘FleetFace’’, ‘‘Jumping’’ and ‘‘Sylvester’’.

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

9

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx Table 4 Processing time on each sequence per frame (ms). The table lists processing time of whole process and two main parts of our tracker: confidence calculation and hashing training. Sequence

Confidence calculation

Hashing training

Whole process

David Deer Dog Dudek FaceOcc1 FaceOcc2 FleetFace Jumping Mhyang Sylvester

66.81 61.61 60.69 67.01 67.78 66.93 73.75 63.71 66.54 64.64

37.03 34.92 28.71 36.52 37.41 34.01 37.04 30.52 35.80 33.78

103.8 96.54 89.41 103.53 105.2 101.1 110.8 94.25 102.3 98.42

Average

65.94

34.57

100.5

3  3 are overlapped. In our view, 4  4 strategy achieves best balance between time cost and accuracy. Although 5  5 and 7  7 strategies get better performance, they improve the performance rarely and sacrifice a lot of processing time. (2) Column dimension reduction vs. row dimension reduction: In 2D manner, there exists two directions of dimension reduction: column reduction and row reduction. Our hashing method in Section 3 is proposed by row reduction. To transform to column reduction manner, we only need to change >

covariance matrix S to EðAi  AÞðAj  AÞ . We implement comparison experiments between both directions on sequence David, Deer, Mhyang and Jumping. As shown in Fig. 5, the difference between performances of two manners is not significant, because columns and rows both contain the structure information of target. However, we would like to suggest users to test both two manners and choose the better one when applying our method into applications. (3) 2DH tracker vs. PCAHash tracker: Because the hashing method utilized in our tracker is similar to the idea of LDA, we also implement the tracker with PCAhash for comparison. To be fair, 2D manner and other settings are used in PCAhash method too. According to the idea of PCA, the PCAhash tracker could be easily implemented by let covariance P > matrix S ¼ M1 i ðAi  AÞ ðAi  AÞ and objective function be JðWÞ ¼ trðW> SWÞ. As shown in Fig. 6, the performance of PCAHash tracker is similar to our tracker in sequence Jumping, Deer and Dog. In sequence David, FaceOcc, Dudek, FleetFace, Sylvester, our tracker performs much better than

Table 5 The challenges of tested sequences. The degree of the challenges is divided into four types: ; U ; U and Uþ . They denote not exist, slight, modest and severe separately. PC, Occ, BC, IV and FM denote pose change, occlusion, background clutter, illumination and fast motion respectively. Sequence

PC

Occ

BC

IV

FM

David Deer Dog Dudek FaceOcc1 FaceOcc2 FleetFace Jumping Mhyang Sylvester

U  U Uþ  Uþ Uþ  U Uþ

   U Uþ Uþ    U

U U      Uþ Uþ U

Uþ     U   U U

 Uþ    U  Uþ  U

PCAhash tracker. The reason is that PCAhash does not utilize the discriminative information of samples. Due to it, the target sample and other background samples may not be separated after mapping, which makes the tracker be difficult to distinguish target from other candidates. In conclusion, the PCAhash tracker can handle sequences with simple background, but appears to lose target of sequences under complex cases. And our tracker is much more robust than tracker with PCAhash. 5.2.3. Computation time analysis The computation time of our tracking algorithm is analyzed in this section. During tracking, the whole process can be divided into two parts: confidence calculation and hashing training. Table 4 also lists the computation time of two parts on each sequence. The experiments are implemented in the same computer mentioned in Section 5.1, and we do not use parallel optimization to improve the speed. We can see that the tracker could reach average 100 ms per frame (10FPS) with 4  4 patch strategy as shown in Table 4. We also attempt to optimize the tracker by simple parallel computing, and it makes our tracker reach 20FPS. The confidence calculation spends nearly 65% of whole computation time and it is still the most time-consuming part. But it also proves the importance of the hashing technique, which makes similarity computation have a qualitative leap. The learning part only takes average 34 ms per frame, which benefits from 2D manner and incremental learning designed in this paper. We analyze the computational complexity here for better knowing the time consuming of our tracking algorithm. Assume there are N candidates and M training pairs at each frame. Patch

Fig. 7. Center error curves of our tracker and eight tracking algorithms on the ten sequences.

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

10

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx

number is P and dimensionality of patch matrix is d (equal to the number of matrix elements). We list the complexity as follows: 1. OðPdNTÞ: the tracker calculates the hamming distance between each candidate and templates(the number of templates is T) by Eq. (13). Actually, Hamming distance could be calculated faster than Euclidean distance and P; d; T are much smaller than N. So the process costs a little time. 3

2. OðMPd2 þ MPdÞ: calculate covariance matrices for all pairs with 2D manner and calculate the average covariance matrices. If we vectorizes sample matrix like original method, the complexity 2

would be Oð2MPd Þ, which could not be accepted for online update. 3. OðMPdÞ: calculate average patch matrices of training samples. 3

4. OðPd þ Pd2 Þ: update covariance matrices and average matrix. The summation of these four parts is the total complexity. As we can see, the overall computational complexity of tracking algorithm is dominated by first part and second part. The analysis tells us that the usage of hash is an important factor for efficiency tracking. Besides it, second part further highlights the importance of our two dimensional hashing technique for saving time. 5.3. Qualitative analysis In this section, we evaluate the performance of trackers qualitatively. Figs. 8–10 show the screenshots of trackers’ performance and Fig. 7 presents the curves of tracking results on CLE. Several challenging situations are the key element to test trackers performance, and we listed them happening at each sequence in Table 5. Illumination variation: In the experiments, we utilized sequence David and Mhyang to test trackers. The tracked target in David is human face which moves from dark to light. From frame #120 to frame #220 as shown in Figs. 8–10, CT, IVT and MIL lost the target because they did not have any mechanism to deal with illumination. At frame #302, SPT and L1APG appeared to give imprecise tracking result. The key challenge happened at frame #498, most trackers lost the target completely. CCT and LSHT performed well because of their used robust feature. However, our tracker performed best owing to the robust appearance model and real-time update model. In sequence Mhyang, most trackers could get

through the first 1000 frames, but when the illumination variation and pose change happened together, these trackers began to lose target as shown in Fig. 10 frame #1456 and #1488. Taken tables and figures together, the proposed tracker and CCT achieved the best performance. By this fact, we can make a conclusion that robust appearance model and real-time update model are two important factors to deal with this situation. Owing to these two elements, our tracker tracked object more robustly than CCT in these two sequences as experiments show. Fast motion: Fast motion is always an intractable problem in tracking. It always makes trackers easily lose target. In these experiments, we tested two sequences Jumping and Deer with fast motion. Heads of human and deer move fast and become fuzzy during movement. It is a challenge for trackers’ appearance model. As shown in Fig. 8, most trackers lost target completely in sequence Deer. However, our tracker and CCT benefit from powerful appearance model, and provide best and second best performance. Tracking sequence Jumping precisely is also a hard task. Trackers SPT, L1APG and LSHT cannot deal with fast motion well as shown in Fig. 9. However, our 2DHT tracker presented most precise results due to the robust appearance model and real-time update model. The test results show that 2DHT could handle fast motion well. Pose change: Pose change is a common situation which exists in several tested sequences (David, Dudek, FleetFace, Mhyang and Sylvester). As shown at frame #498 in sequence David, the compared eight trackers lose or drift away from the target after pose variation. The pose change in sequences Dudek is not severe, thus LSHT and LSST could handle the situation, while other trackers appeared to lose target at some frames. With robust appearance model, our tracker could easily handle these two sequences and perform better than other eight trackers (see Fig. 8). In sequences FleetFace, Mhyang and Sylvester, objects’ poses changed heavily, which made several trackers could not locate the target well. The situation in Mhyang happened at frames #1456 to #1488, all trackers except CCT and 2DHT drifted away from the target. At frames #454, #522 and #633 of FleetFace, only LSHT and 2DHT caught the face well. In sequence Sylvester, CCT and 2DHT are the only two trackers which could track object all the time. These experimental results demonstrate that our tracking algorithm is able to deal with pose change more robustly than other trackers. Occlusion: We tested on FaceOcc1 and FaceOcc2 which have partial and heavy occlusion. As shown in Fig. 9, our method located

Fig. 8. Screenshots of tracking results of our tracker and eight state-of-the-art trackers on sequences: David, Deer, Dog and Dudek.

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx

11

Fig. 9. Screenshots of tracking results of our tracker and eight state-of-the-art trackers on sequences: FaceOcc1, FaceOcc2, FleetFace and Jumping.

Fig. 10. Screenshots of tracking results of our tracker and eight state-of-the-art trackers on sequences: Mhyang and Sylvester.

object well due to our patch-based appearance model when the target was under heavy occlusion. CCT, CT and SPT performed well before occlusion came, but began to drift away after occlusion occurred. Especially in FaceOcc2, when the man’s face under heavy occlusion (frame #700), almost all trackers could not give imprecise location of target. Here we notice that the discriminative model trackers (CT, MIL and 2DHT) performed better than generative model under this situation. That is mainly because when most parts of object are not visible, the tracker locating object by eliminating negative samples could achieve good performance. With such characteristic, our tracker could handle the situation easily and efficiently. Besides it, 2DHT’s appearance model is more sophisticated than appearance models of CT and MIL, which makes the proposed tracking algorithm present more robust results. Background clutters: Background clutter exists in several sequences (Deer, Jumping), which tests the discriminative ability of tracking algorithms. Trackers with global discriminativeness always deal with the situation well. As screenshots of Deer shown (Fig. 8), except 2DHT and CCT all trackers lost the target completely. The reason is that they do not catch the global structure information of target well. However, our tracker benefits from patch-based appearance model and could handle background clutters well. CCT also provided good tracking results because of its appearance model. This advantage makes 2DHT and CCT perform well in sequence Jumping too, while other trackers cannot give better results than them.

Overall, our tracker, as a novel tracking technique, proves its ability in visual tracking. In the experiments, the robust tracking results and short computational time are strong evidence that demonstrates the worth of our tracker. 6. Conclusion In this paper, we propose a hashing based tracking algorithm called 2DHT. General data-dependent hashing methods could improve the speed of similarity measurement significantly, which could be utilized to track object. However, it is still time-consuming to train hash functions with large number of samples. To overcome this bottleneck, we proposed two dimensional hashing method. The method greatly reduces the dimensionality of covariance matrices, and then the speed of learning hash functions is improved significantly. Both theoretical framework in Section 3 and experimental evaluations in Section 5 demonstrate the effectiveness and efficiency of our 2D hashing method. To obtain global distinctiveness, patch-based appearance model is designed for catching structure information, and it achieves a big success in experiments. During tracking, hash functions are updated by incremental learning model for adapting situation changing. By these innovations, our tracker obtains great performance in accuracy and efficiency. In addition, our proposed two dimensional hashing technique can be easily implemented and applied to many other applications in computer vision such as image retrieval and object detection.

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003

12

C. Ma, C. Liu / Computer Vision and Image Understanding xxx (2015) xxx–xxx

Conflict of interest The authors declare that they have no conflict of interest. Acknowledgments This work is supported by National Natural Science Foundation of China (Grant Nos. 61373063, 61373062), the project of Ministry of Industry and Information Technology of China (Grant No. E0310/ 1112/02–1). References [1] X. Mei, H. Ling, Robust visual tracking using l1 minimization, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 1436–1443. [2] C. Bao, Y. Wu, H. Ling, H. Ji, Real time robust l1 tracker using accelerated proximal gradient approach, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 1830–1837. [3] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based collaborative model, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 1838–1845. [4] B. Liu, J. Huang, L. Yang, C. Kulikowsk, Robust tracking using local sparse appearance model and k-selection, in: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 1313–1320. [5] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via multi-task sparse learning, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 2042–2049. [6] D. Wang, H. Lu, M.-H. Yang, Least soft-threshold squares tracking, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2013, pp. 2371–2378. [7] S. He, Q. Yang, R.W. Lau, J. Wang, M.-H. Yang, Visual tracking via locality sensitive histograms, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2013, pp. 2427–2434. [8] S. Wang, H. Lu, F. Yang, M.-H. Yang, Superpixel tracking, in: 2011 IEEE International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 1323– 1330. [9] F. Yang, H. Lu, M.-H. Yang, Robust superpixel tracking, IEEE Trans. Image Process. 23 (4) (2014) 1639–1651. [10] S. Oron, A. Bar-Hillel, D. Levi, S. Avidan, Locally orderless tracking, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 1940–1947. [11] Y. Yuan, J. Fang, Q. Wang, Robust Superpixel Tracking via Depth Fusion. [12] D. Chen, Z. Yuan, Y. Wu, G. Zhang, N. Zheng, Constructing adaptive complex cells for robust visual tracking, in: 2013 IEEE International Conference on Computer Vision (ICCV), IEEE, 2013, pp. 1113–1120. [13] S. Hare, A. Saffari, P.H. Torr, Struck: structured output tracking with kernels, in: 2011 IEEE International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 263–270. [14] B. Babenko, M.-H. Yang, S. Belongie, Robust object tracking with online multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1619–1632. [15] Z. Kalal, J. Matas, K. Mikolajczyk, Pn learning: bootstrapping binary classifiers by structural constraints, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2010, pp. 49–56. [16] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422. [17] L. Sevilla-Lara, E. Learned-Miller, Distribution fields for tracking, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 1910–1917. [18] Z. Li, W. Wang, Y. Wang, F. Chen, Y. Wang, Visual tracking by proto-objects, Pattern Recogn. 46 (8) (2013) 2187–2201. [19] Y. Su, Q. Zhao, L. Zhao, D. Gu, Abrupt motion tracking using a visual saliency embedded particle filter, Pattern Recogn. 47 (5) (2014) 1826–1834. [20] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, A.V.D. Hengel, A survey of appearance models in visual object tracking, ACM Trans. Intell. Syst. Technol. (TIST) 4 (4) (2013) 58. [21] J. Lim, D.A. Ross, R.-S. Lin, M.-H. Yang, Incremental learning for visual tracking, in: Advances in Neural Information Processing Systems, 2004, pp. 793–800. [22] D.A. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vision 77 (1-3) (2008) 125–141.

[23] K. Zhang, L. Zhang, M.-H. Yang, Real-time compressive tracking, in: Computer Vision – ECCV 2012, Springer, 2012, pp. 864–877. [24] K. Zhang, L. Zhang, M. Yang, Fast Compressive Tracking. [25] X. Li, C. Shen, A. Dick, A. van den Hengel, Learning compact binary codes for visual tracking, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2013, pp. 2419–2426. [26] J. Yang, D. Zhang, A.F. Frangi, J.-y. Yang, Two-dimensional pca: a new approach to appearance-based face representation and recognition, IEEE Trans. Pattern Anal. Mach. Intell. 26 (1) (2004) 131–137. [27] H. Zhao, P.C. Yuen, Incremental linear discriminant analysis for face recognition, IEEE Trans. Syst. Man Cybern. B: Cybern. 38 (1) (2008) 210–221. [28] A. Gionis, P. Indyk, R. Motwani, et al., Similarity search in high dimensions via hashing, in: VLDB, vol. 99, 1999, pp. 518–529. [29] M. Datar, N. Immorlica, P. Indyk, V.S. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, in: Proceedings of the Twentieth Annual Symposium on Computational Geometry, ACM, 2004, pp. 253–262. [30] B. Kulis, P. Jain, K. Grauman, Fast similarity search for learned metrics, IEEE Trans. Pattern Anal. Mach. Intell. 31 (12) (2009) 2143–2157. [31] B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 2130–2137. [32] M. Raginsky, S. Lazebnik, Locality-sensitive binary codes from shift-invariant kernels, in: Advances in Neural Information Processing Systems, 2009, pp. 1509–1517. [33] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Advances in Neural Information Processing Systems, 2009, pp. 1753–1760. [34] W. Liu, J. Wang, S. Kumar, S.-F. Chang, Hashing with graphs, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 1–8. [35] C. Strecha, A.M. Bronstein, M.M. Bronstein, P. Fua, Ldahash: Improved matching with smaller descriptors, IEEE Trans. Pattern Anal. Mach. Intell. 34 (1) (2012) 66–78. [36] J. Wang, S. Kumar, S.-F. Chang, Semi-supervised hashing for large-scale search, IEEE Trans. Pattern Anal. Mach. Intell. 34 (12) (2012) 2393–2406. [37] G. Ye, D. Liu, J. Wang, S.-F. Chang, Large-scale video hashing via structure learning, in: 2013 IEEE International Conference on Computer Vision (ICCV), IEEE, 2013, pp. 2272–2279. [38] B. Zhao, E. Xing, Hierarchical feature hashing for fast dimensionality reduction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2043–2050. [39] Y. Weiss, R. Fergus, A. Torralba, Multidimensional spectral hashing, in: Computer Vision – ECCV 2012, Springer, 2012, pp. 340–353. [40] Y. Gong, S. Lazebnik, A. Gordo, F. Perronnin, Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2916–2929. [41] R. Salakhutdinov, G. Hinton, Semantic hashing, Int. J. Approx. Reason. 50 (7) (2009) 969–978. [42] Y. Kang, S. Kim, S. Choi, Deep learning to hash with multiple representations, in: ICDM, 2012, pp. 930–935. [43] G. Hinton, R. Salakhutdinov, Discovering binary codes for documents by learning deep generative models, Top. Cogn. Sci. 3 (1) (2011) 74–91. [44] Q. Shi, H. Li, C. Shen, Rapid face recognition using hashing, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2010, pp. 2753–2760. [45] G. Li, D. Liang, Q. Huang, S. Jiang, W. Gao, Object tracking using incremental 2d-lda learning and bayes inference, in: 15th IEEE International Conference on Image Processing, 2008, ICIP 2008, IEEE, 2008, pp. 1568–1571. [46] S. Pang, S. Ozawa, N. Kasabov, Incremental linear discriminant analysis for classification of data streams, IEEE Trans. Syst. Man Cybern. B: Cybern. 35 (5) (2005) 905–914. [47] T.-K. Kim, K.-Y.K. Wong, B. Stenger, J. Kittler, R. Cipolla, Incremental linear discriminant analysis using sufficient spanning set approximations, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007, CVPR’07, IEEE, 2007, pp. 1–8. [48] J.-G. Wang, E. Sung, W.-Y. Yau, Incremental two-dimensional linear discriminant analysis with applications to face recognition, J. Netw. Comput. Appl. 33 (3) (2010) 314–322. [49] G.-F. Lu, J. Zou, Y. Wang, Incremental complete lda for face recognition, Pattern Recogn. 45 (7) (2012) 2510–2521. [50] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2013, pp. 2411–2418.

Please cite this article in press as: C. Ma, C. Liu, Two dimensional hashing for visual tracking, Comput. Vis. Image Understand. (2015), http://dx.doi.org/ 10.1016/j.cviu.2015.01.003