Individual adaptive metric learning for visual tracking

Individual adaptive metric learning for visual tracking

Neurocomputing 191 (2016) 273–285 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Individ...

5MB Sizes 0 Downloads 81 Views

Neurocomputing 191 (2016) 273–285

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Individual adaptive metric learning for visual tracking Sihua Yi, Nan Jiang, Xinggang Wang, Wenyu Liu n College of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, PR China

art ic l e i nf o

a b s t r a c t

Article history: Received 18 September 2015 Received in revised form 17 November 2015 Accepted 19 January 2016 Communicated by Jinhui Tang Available online 11 February 2016

Recent attempts demonstrate that learning an appropriate distance metric in visual tracking applications can improve the tracking performance. However, the existing metric learning methods learn and adjust the distance between all pairwise sample points in an iterative way, which raises the time consumption issue in real-time tracking applications. To address this problem, this paper proposes a novel metric learning method and applies it to visual tracking. The main idea of the proposed method is to adapt the distance from each individual sample point to a few anchor points instead of the distance between all pairs of samples, so as to reduce the number of distances to be adjusted. Based on this idea, we construct a convex matrix function that collapses the sample points to their class centers and maximizes the interclass distance. Given n training samples in d-dimensional space, the equation can be solved in a closed 2 form with the computational complexity of Oðd nÞ. This is much more computationally efficient than 2 traditional methods with the computational complexity of Oðdn Þ in each iteration (normally in tracking applications, d⪡n). Furthermore, the proposed method can be learned in an online manner which is able to accelerate the learning process and improve the matching accuracy in visual tracking applications. Experiments on UCI datasets demonstrate that the proposed learning method is comparable with the traditional metric learning methods in term of classification accuracy but much more time efficient. The comparison experiments on benchmark video sequences show that the tracking algorithm based on our learning approach can outperform the state-of-the-art tracking algorithms. & 2016 Elsevier B.V. All rights reserved.

Keywords: Metric learning Classification Visual tracking

1. Introduction During the visual tracking process, matching the target of interest between consecutive frames is critical to tracking performance, due to the fine and dynamic distinction between the visual appearance of the target and background caused by occlusion, background clutter, illumination variation, motion blur and pose change. Instead of matching under a pre-specified or fixed distance metric [1–3], recent attempts [4–10] incorporate metric learning into tracking applications, which learn and adjust the distance metric, so as to yield small distance between different appearances of the target and large distance between the target's appearances and background's appearances. These methods demonstrate that learning an appropriate distance metric in tracking applications can significantly improve the tracking performance. Normally, distance metric learning methods focus on learning and adapting a Mahalanobis distance between pairs of training samples [11–17]. This can be viewed as to linearly transform the original feature space to a new transformed space. And in the new n

Corresponding author. E-mail address: [email protected] (W. Liu).

http://dx.doi.org/10.1016/j.neucom.2016.01.052 0925-2312/& 2016 Elsevier B.V. All rights reserved.

space, the inter-class distance (calculated by the distance between pairwise samples in different classes) is constrained to be larger than the intra-class distance (calculated by the distance between pairwise samples in the same class), so that the feature samples of the target are separated from feature samples of the background. However, a straightforward integration of distance metric learning algorithms into tracking methods may be still problematic. Conventional metric learning need to adjust the Mahalanobis distance between all pairs of feature samples in an iterative way. As shown in Fig. 1 (a), for n training samples of dimension d, there are Oðn2 Þ distances to be adjusted in every iteration, i.e., a 2 single iteration of looping through all constraints costs Oðdn Þ [18,19]. With the increase of available samples during a tracking process, a straightforward concatenation of these metric learning algorithms and tracking methods may lead to a serious time complexity issue. Additionally, the change of distance between any sample pair would cause the change of distances between other sample pairs, i.e., the adjustment for the distance between any pair of samples would affect other distances. As a result, the objective under the adaption for all pairwise samples simultaneously may be difficult to achieve. To address these two problems, this paper proposes a novel metric learning approach, which adapts the distance from each individual sample point to a few anchor points instead of the

274

S. Yi et al. / Neurocomputing 191 (2016) 273–285

pairwise adjustment

individual adjustment

samples in the same class samples in the other classes the congeneric center of the heterogeneric center of pairwise distance adjustments for distance adjustments from to the congeneric center distance adjustments from to the heterogeneric center Fig. 1. (a) Conventional metric learning methods adapt the distance between all pairs of sample points. Thus for n training samples, there are nðn 1Þ=2 distances to be adjusted. (b) The proposed method adapt the distance from each individual sample point to its congeneric center and the distance from each point to its heterogeneric center. Thus for n training samples, there are only 2n distances that need to be adjusted. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

distances between all pairwise samples, so as to reduce the number of distances to be adapted. As shown in Fig. 1(b), for any sample point, when selecting its class mean (called congeneric center) and the mean of points belonging to the other classes (called heterogeneric center) to be anchor points, the intra-class distance can be expressed by the distances from feature samples to their congeneric centers, and the inter-class distance can be expressed by the distances from samples to their heterogeneric centers. Then the number of distances to be adapted is significantly reduced. Ideally, under an optimal distance metric, sample points will be collapsed to their congeneric centers and farthest from the other classes. Thus we construct a convex function that projects most sample points to the congeneric centers through a sparsity regularization and maximizes the inter-class distance. This matrix function has a closed form solution with the computational com2 plexity of Oðd nÞ. Normally, in tracking applications, the number n of training samples is much larger than the feature dimension d [20–25]. Thus the proposed method is much more computationally efficient than traditional metric learning methods (which cost 2 Oðdn Þ in each iteration, as discussed before). Intuitively, the distances to be adjusted on individual samples (represented by the Magenta and Golden lines in Fig. 1) are much less than the distances to be adjusted (represented by the Red lines in Fig. 1) on pairwise training samples, and the adjustment for each individual point is independent to the others. Thus the optimal solution of the proposed individual adaptive metric learning (IAML) algorithm is also easier to achieve than that of conventional metric learning algorithms. Furthermore, IAML can be learned in an online manner, which is able to accelerate the learning process and improve the matching accuracy in visual tracking applications. We combine IAML with the binary classification based tracking framework [7,9,24,26]. Experimental results on both synthetic data and benchmark video sequences show that IAML can improve the matching accuracy, and its application in visual tracking can outperform the state-of-the-art tracking methods.

This paper contributes to the research of metric learning based tracking in the following ways. (1) A novel distance metric learning method is proposed, which reduces the computational complexity by adjusting the distance from each individual sample to the congeneric center and the heterogeneric center instead of the distances between all pairs of sample points. (2) Based on the idea of adapting individual sample points, we construct a convex matrix function, which collapses the sample points to their respective congeneric centers and maximizes the inter-class distance. This function has a closed form solution with low computational complexity. (3) An online version of the proposed learning method is presented, which can accelerate the learning process and improve the tracking accuracy in visual tracking applications. (4) The proposed metric learning method is incorporated into visual tracking framework. The performance improvement is demonstrated by extensive experiments. The rest of the paper is organized as follows. Section 2 provides an overview of the related visual tracking and metric learning algorithms. The formulation and solution of our metric learning method is given in Section 3. The use of our metric learning method in visual tracking is presented in Section 4. Experiment results are given in Section 5 and the paper is concluded in Section 6.

2. Related work Recent attempts of incorporating subspace learning into visual tracking have produced encouraging results, and various objectives for subspace adaption have been proposed. For example, Wang et al. [27] parameterize the instantaneous image motion by a subspace motion model. Ho et al. [28] propose linear subspaces to represent the appearance of the target. Lim et al. [29,30] incrementally learn a low dimensional eigenspace representation to reflect appearance changes of the target. Elgammal et al. [31– 33] characterize the target with the manifold structure. Liu et al. [34] segment the subspaces by low-rank representation. Mei et al. [35–37] model the target's appearance by the sparse approximation over template sets. Li et al. [38] transform the feature space to a 3D-DCT subspace. Shirazi et al. [39] learn the target's appearance in the non-Euclidean geometry subspace by a Grassmann approach. However, these methods do not take the background information into account, which may lead to tracking failure when the background shares a similar appearance with the foreground. By contrast, an increasing number of trackers adopt the binary classifiers to separate the target from its background. For examples, Collins et al. [20] adapt the color space to distinguish between the target and its surrounding environment. Grabner et al. [21,40] track the target with an online boosting classifier. Babenko et al. [22] learn the multiple instances with the bag information. Bai et al. [23] distinguish between the target and the background by a weakly supervised ranking support vector machine. Zhang et al. [24] learn multiple weak classifiers to separate the target and the background in each dimension. Yao et al. [41] represent the visual appearance by the Gaussian mixture model and distinguishes the foreground and the background by the KL divergence. Henriques et al. [25] classify the target and the background with a kernel correlation filter. Based on the classification strategy, learning the distance metric between the feature samples representing the target and the background has produced encouraging results, and various objectives for metric adjustment have been proposed. For example, Wang et al. [4] learn the distance metric by collapsing classes [19]. Another example is that Jiang et al. [5] learn the metric by maximizing the k-NN classification accuracy [18] for differential tracking. There are many other distance metric

S. Yi et al. / Neurocomputing 191 (2016) 273–285

learning algorithms [11–16,42–44] applied in the context of visual tracking, but they just provide alternatives to the abovementioned metric-adaptive methods. However, a straightforward combination of existing distance metric learning algorithms and visual tracking is still problematic. Since the existing metric learning methods need to optimize the Mahalanobis distance between pairwise samples iteratively, which are computational expensive. Although the method in [8] tries to ease the computational burden through a sparsity regularization, it is still very time consuming. This limits the utilization of metric learning in visual tracking, and may cause unsatisfactory tracking results. To reduce the computational complexity, this paper presents a novel metric learning approach, which has an arithmetic solution with low computational complexity, and can be learned in an online manner that is able to accelerate the learning process and improve the tracking accuracy.

3. Proposed metric learning method To improve the computational efficiency in distance metric learning based visual tracking methods, we propose a novel metric learning approach, which adapts the distance from each individual sample point to the congeneric center and the heterogeneric center instead of the distances between all pairwise samples. Under the optimal distance metric, the distance between each sample point and its congeneric center (i.e., the intra-class distance) should be approximated to zero while the distances from samples to the heterogeneric centers (i.e., the inter-class distance) should be maximized.

Given a supervised sample set fxi ; ci gni¼ 1 in d  dimensional space Rd , where xi represents the feature vector and ci is its corresponding class label, the Mahalanobis distance dM ðxi ; xj Þ between any pair of training samples xi and xj is characterized by the matrix L as following: dM ðxi ; xj Þ ¼ ðLxi  Lxj Þ > ðLxi  Lxj Þ

ð1Þ

which can be viewed as to project feature samples fxi gni¼ 1 A Rd to a new subspace through a linear projection L A Rpd , where p is the dimension of the projected space. The inter-class distance Dinter is formulated by κ X X

dM ðxi ; μ i Þ ¼

k ¼ 1 ci ¼ k

¼

n X

to the class-mean is equivalent to maximizing the number of allzero columns of LðX  M C Þ. Let Z ¼ ½LðX  M C Þ, Z ¼ Z > Z, Z i be the ith row of Z, and Zi be the ith column of Z. It is easy to show that Zi  0 3 Zi  0

Thus maximizing the number of samples that are collapsed to their congeneric centers, i.e., the number of all-zero vectors for Lðxi  μj Þ can be transformed to maximizing the number of all-zero rows in Z. According to [45], maximizing the number of all-zero rows in Z can be treated as minimizing the ð2; 1Þ-norm of Z (which is implemented by first computing the 2-norm across the rows to have fZ i g and then calculating the 1-norm of the column vector ½ J Z 1 J 2 ; …; J Z n J 2 ), denoted by J Z J ð2;1Þ . In summary, maximizing the inter-class distance and collapsing most training sample points to their congeneric centers can be formulated as following: 8 2 > < max J LðX M C Þ J F L ð5Þ > Z ð2;1Þ : min L

3.2. Closed form solution In our previous work [8], we have shown that minimizing Z ð2;1Þ can be treated as minimizing trðZZ > Þ, where trðÞ is a trace function. Therefore maximizing the number of samples that are collapsed to the congeneric centers can be transformed to minimizing trðZZ > Þ, which is expanded as

κ X X

ðLxi  Lμ i Þ > ðLxi  Lμ i Þ ð2Þ

From Eq. (6), minimizing trðZZ > Þ is essentially minimizing ðzi> zj Þ2 for any i and j, which tends to approximate fzi gni¼ 1 to all-zero vectors, i.e., project all feature samples fLxi gni¼ 1 to their congeneric centers fLμi gni¼ 1 . Then Eq. (5) is transformed to 8 2 > < max J LðX M C Þ J F L ð7Þ > trðZZ > Þ : min L Denoting A ¼ L > L, we have   J LðX  M C Þ J 2F ¼ tr ðX  M C Þ > AðX  M C Þ

ð8Þ

Then Eq. (7) can be transformed to a convex function max f ðAÞ A

where μ i denotes the heterogeneric center for the sample point xi which is the mean of samples fxj gc a c , X ¼ ½x1 ; …; xn  is the sample

  where f ðAÞ ¼ tr ðX M C Þ > AðX  M C Þ

matrix with size of d  n, M C is a d  n matrix in which the ith column is μ i , and J  J F is the Frobenius norm. The intra-class distance Dintra is calculated by

s:t: A A Sdþ

j

Dintra ¼

ð6Þ

i¼1j¼1

trðZZ > Þ ¼ J ðX  M C Þ > AðX  M C Þ J 2F

k ¼ 1 ci ¼ k

ðLxi  Lμ i Þ > ðLxi  Lμ i Þ ¼ J LðX  M C Þ J 2F

i¼1j¼1

(

i¼1

κ X X

ð4Þ

trðZZ > Þ ¼ J Z > J 2F ¼ J ðLX  LM C Þ > ðLX  LM C Þ J 2F n X n  n X n 2 X X ¼ ðLxi  Lμi Þ > ðLxj Lμj Þ ¼ ðzi> zj Þ2

3.1. Formulation

Dinter ¼

275

dM ðxi ; μi Þ ¼

k ¼ 1 ci ¼ k

¼ J LðX  M C Þ J 2F

κ X X

 λ J ðX  M C Þ > AðX  M C Þ J 2F

i

ðLxi  Lμi Þ > ðLxi  Lμi Þ

k ¼ 1 ci ¼ k

ð3Þ

where μi denotes the congeneric center for xi which is the class mean of the sample xi, and MC is a d  n matrix in which the ith column is μi. Let zi ¼ Lðxi  μi Þ. Notice that, for any pair of xi and μi, if zi is an all-zero vector, the projection Lxi is mapped to the congeneric center. So maximizing the number of samples that being collapsed

ð9Þ

where λ is a weighting parameter, and Sdþ denotes a set of matrices being positive semi-definite. A A Sdþ promises the Mahalanobis distance being equal to or larger than zero. Eq. (9) can be differentiated with respect to A as ∂f ðAÞ ¼ XX >  2λYY > AYY > ∂A ( X ¼ X M C where Y ¼ X  MC

ð10Þ

Normally, Y is column full rank, and YY > is a full rank square matrix, which is invertible. Then the closed form solution for

276

S. Yi et al. / Neurocomputing 191 (2016) 273–285

Eq. (9) is A¼

1 ðYY > Þ  1 ðXX > ÞðYY > Þ  1 2λ

ð11Þ

From Eq. (11), A is insensitive to λ. Hence we fix 1 throughout all the experiments. Let L ¼ X > ðYY > Þ  1 . It is easy to demonstrate that, A ¼ L> L

λ to

ð12Þ

which is a positive semi-definite matrix. As discussed before, metric learning can be viewed as to project feature samples fxi gni¼ 1 A Rd to a new subspace through a linear

projection L A Rpd . When A is obtained by Eq. (11), the dimension of the projected space (i.e., the dimension of L), which depends on the rank of A, is optimized adaptively. Eq. (11) essentially adapts the distance from each sample point xi to its congeneric center μi and the distance from xi to its heterogeneric center μ i simultaneously. Given a labeled sample set fxi gni¼ 1 , fμi gni¼ 1 and fμ i gni¼ 1 are specified, thus the distance adjustment between any pair of ðxi ; μ i Þ or ðxi ; μi Þ is independent to adjustments on other pairs. By contrast, conventional metric learning methods have to adapt distances between nðn  1Þ=2 pairs of points, and each distance adjustment may affect other ones.

Frame(t-N)...Frame(t-1)

positive samples

Feature points

Learning Metric

negative samples

target candidates

Frame(t)

Matching under New Metric

Localization

Fig. 2. The framework of our tracking algorithm. Red and green bounding boxes represent the target (positive samples) and background samples (negative samples), respectively. They are collected from previous video frames. Blue boxes represent the target candidates in the current frame, and the cyan box represents the target estimated by our algorithm. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Fig. 3. The linear transformation of Optdigits dataset. Samples from different classes are represented by different colors. (a) Feature samples in the original space. (b) Sample projection by LDA with 10 discriminant vectors. (c) Sample projection by LDA with 7 discriminant vectors. (d) Sample projection by LDA with 4 discriminant vectors. (e) Sample projection by LDA with 2 discriminant vectors. (f) Sample projection by IAML. For easy to show the difference between LDA and IAML, we just print out 6 classes of samples in Optdigits. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

S. Yi et al. / Neurocomputing 191 (2016) 273–285

Thus the proposed approach is easier to achieve the optimal solution than conventional methods. The major time cost of Eq. (11) is spent on the computation of 2 2 XX , YY, and ðYY > Þ  1 , which respectively scale in Oðd nÞ, Oðd nÞ 3 and Oðd Þ. Considering that in tracking applications, to cover all possible cases of the target as well as the background, the number of training samples should be large, while the feature dimension can be compressed by dimension reduction methods, such as PCA or random projection [46]. Thus the number of training samples is always much larger than the feature dimension, i.e., n⪢d. Then the 2 complexity of Eq. (11) is Oðd nÞ without any iteration, which is much more computational efficient than conventional iterative metric learning methods (as discussed in Section 1, most of 2 existing metric learning methods cost Oðdn Þ in each iteration). However, in visual tracking applications, the number of training samples grows fast during the tracking process, i.e., n will become very large, which still causes the time and space complexity issues. Hence to apply our metric learning algorithm in visual tracking, we propose an online algorithm to update XX and YY in the next subsection.

YY > ¼ ½X  M C ½X  M C  > ¼ XX > þ M C M C>  XM C>  M C X > κ X

nk mk mk> 

k¼1

¼ XX > þ

κ X

nk mk mk> 

k¼1 nX c þ na

¼ XX > þ

κ X X

xi mk> 

k ¼ 1 ci ¼ k

k ¼ 1 ci ¼ k

κ X

κ X

nk mk mk> 

k¼1

κ X

κ X X

mk xi>

¼ XX > þ

κ X

nk m k m k> 

κ X

nk m k m k> 

k¼1 nX c þ na

¼

xi xi> þ

i ¼ nr þ 1

( where

κ X

κ X

xi m k> 

mk

k ¼ 1 ci ¼ k

k¼1

κ X

κ X

nk mk m k> 

k¼1

X

xi>

ci ¼ k

nk m k mk>

k¼1

  nk m k m k>  mk m k>  m k mk>

k¼1

m k ¼ ðnm  nk mk Þ=ðn  nk Þ

Pn c þ n a

> i ¼ nr þ 1 xi xi and > mk m k and m k mk>

where

κ X X

ð14Þ

m ¼ ðnc mc  nr mr þ na ma Þ=n

mk is updated by Eq. (13), updating m, 2

respectively cost O(d), O(d), Oðd Þ and

Oðd Þ. Since the computational complexity of Eqs. (13) and (14) can be controlled small, our method is able to learn and adjust the distance metric for large amounts of training samples during the visual tracking process.

Table 1 The detailed information of UCI Benchmark Datasets, including names of datasets, the number of samples, and the number of classes. Name

Sample's number

Dimensions

Class's number

Iris Wine Soybean Balance Fertility Pima Seeds Australian Heart Semeion CNAE-9 Optdigits

150 178 302 625 100 768 210 690 270 1593 1080 3823

4 13 35 4 9 8 7 14 13 256 856 64

3 3 19 3 2 2 3 2 2 10 9 10

mk nk mk>

k¼1

8 nX nc nr nX c þ na c þ na X X > > < xi xi> ¼ xi xi>  xi xi> þ xi xi>

0.3

k¼1

i¼1

i¼1

i ¼ nc þ 1

> > : m ¼ ðn m  n m þ n m Þ=n k kc kc kr kr ka ka k

lda

Pn c > where and mkc can be updated from the last step, i ¼ 1 xi xi Pnc þ na 2 > and mk respectively cost Oðd Þ O(d), the updating i ¼ nr þ 1 xi xi Pn c þ n a Pn r > > and computational complexity of mk mk , i ¼ nc þ 1 xi xi i ¼ 1 xi 2

2

mmc

ð13Þ

xi> are Oðd Þ, Oðd na Þ and Oðd nr Þ, respectively. During the updating process, na (i.e., the number of newly added samples at each time) and nr (i.e., the number of old samples need to be removed at each time) can be manually controlled small. Hence the computational complexity of updating YY > is reduced to a small value. Similar to updating YY > , let m be the mean of all training nr c samples, mc be the mean of fxi gni ¼ 1 , mr be the mean of fxi gi ¼ 1 , ma c þ na be the mean of fxi gni ¼ , m be the mean of fx g , m i ci a k k kc be the nc þ 1 nr c mean of fxi gni ¼ , m be the mean of fx g , and m ka be i kr 1;ci a k i ¼ 1;ci a k c þ na . The number of all training samples is the mean of fxi gni ¼ nc þ 1;ci a k

nca

knn classification error

i ¼ nr þ 1

2

>

k¼1

0.35

i ¼ nr þ 1

where

>

¼ XX > þ M C M C  XM C  M C X >

nk mk mk>

¼

xi xi> 

XX > ¼ ½X  M C ½X  M C  >

2

Considering the appearance change during the tracking process, recent observations are more likely to be indicative of its appearance than the old ones. Thus the online metric learning algorithm needs to incrementally update the distance metric for the newly added training samples and decrementally adapt the metric to remove the oldest samples simultaneously. nr c Let fxi gni ¼ 1 be training samples at the current time, fxi gi ¼ 1 be c þ na old samples that need to be removed, fxi gni ¼ be new added nc þ 1 samples, mk be the sample mean of kth class, mkc be the sample mean of kth class at the current time, mkr be the sample mean of kth class that need to be removed, and mka be the sample mean of kth class being new added. Then the number of training samples in kth class becomes nk ¼ nkc  nkr þ nka . At the current step, YY > is updated as

¼ XX > þ

n ¼ nc  nr þ na . At the current step, XX > is updated by

mk,

3.3. Online solution

277

0.25 mcml lmnn

0.2

itml boost

0.15

our

0.1

0.05

0

iris wine soyb bal2 fert pima seed aust heart sem cnae opt uci datasets

Fig. 4. The k-NN classification error rates of metric learning methods on UCI datasets.

278

S. Yi et al. / Neurocomputing 191 (2016) 273–285

3.4. Difference to linear discriminant analysis The linear discriminant analysis (LDA) proposed in [47,48] linearly projects the training samples to a subspace constructed by an optimal discriminant vector set, which maximizes the interclass scatter and minimizes the intra-class scatter simultaneously. In this subsection, we present the main difference between IAML and LDA. Given a κ-class sample set fxi gni¼ 1 , multi-class LDA seeks optimal discriminant vectors fωj gJj ¼ 1 to optimize the following form

ωj> Sinter ωj ωj> Sintra ωj

ð15Þ

κ X X > > > Sintra ¼ ðxi  mk Þðxi  mk Þ > > > : k ¼ 1c ¼ k i

where Sinter denotes the inter-class scatter matrix and Sintra denotes the intra-class scatter matrix. By solving Eq. (15), the discriminant vectors 1 Sinter corresponding to its J largest fωgJj ¼ 1 are J eigenvectors of Sintra eigenvalues (where J is smaller than the number of class κ). The main differences between LDA and IAML are as following:

 The objective of LDA is to optimize the discriminant vectors fωj gJj ¼ 1 . When projecting feature samples to these vectors, the

max ωj

where

8 κ X > > > nk ðmk  mÞðmk  mÞ > > Sinter ¼ > < k¼1

Table 2 Time consumption (second). Bold values indicate the best performance. Datasets

lda

mmc

nca

mcml

lmnn

itml

boost

ours

Iris Wine Soyb Bal2 Fert Pima Seed Aust Heart Seme CNAE Optd

0.0002 0.0003 0.0024 0.0003 0.0002 0.0003 0.0002 0.0004 0.0003 0.0056 0.0086 0.0098

2.2484 87.4613 5.3615 13.5636 0.6536 8.9261 1.3536 129.0449 1.5737 2439.1889 43.4928 3343.2573

0.0440 10.6632 2.4851 2.6221 0.0442 20.6144 0.1782 163.7497 0.5547 858.5701 66.7856 1097.1364

96.7203 0.7814 575.3617 6.5719 45.5056 6.1547 25.5424 6.0617 1.3967 1281.2768 597.6320 2814.2247

0.9247 10.6475 3.7304 0.3484 1.4164 0.6897 1.3417 1.4735 0.5114 9.6135 83.8927 51.7529

0.0317 0.3167 0.6845 0.4711 0.0512 0.0234 0.4839 0.0415 0.0936 1.0948 2.0648 1.2532

0.0224 2.0941 2.0459 0.0103 0.0334 0.0523 0.7524 0.2119 0.0781 61.6817 0.0241 209.2137

0.0001 0.0001 0.0007 0.0001 0.0001 0.0002 0.0001 0.0002 0.0001 0.0025 0.0037 0.0071

LDA

MMC

NCA

MCML

LMNN

ITML

Boost

ours

Fig. 5. Tracking results of each learning based trackers on the video sequences, in which the illumination is significantly changed. The sequences from top to bottom are Car4, Fish, Shaking and Trellis, respectively.

S. Yi et al. / Neurocomputing 191 (2016) 273–285





number J of eigenvectors, i.e., the dimension of the projected subspace constructed by the discriminant vectors, needs to be pre-specified. However, it is difficult to predeterminate the optimal dimension of the projected space [8]. This dimension determination issue would limit the classification performance for LDA. By contrast, IAML directly projects the feature samples to a Mahalanobis subspace to separate them from different classes and optimizes the dimension of the subspace adaptively. According to Eq. (15), multi-class LDA defines the inter-class distance by the distance between the class centers fmk gκk ¼ 1 and the total center m of overall samples. When the number of samples in kth class is much larger than the number of samples in other classes, its class center will be closed to the total center, i.e., ½mk  m is approximated to a all-zeros vector, which approximates ωj> ðmk mÞðmk  mÞ > ωj to zero. Then the ith class may be difficult to separated from other classes. At the mean time, IAML defines the inter-class distance by the distance between any individual point xi and its heterogeneric center μ i , which is independent to the number of samples in each class. From Eq. (15), LDA minimizes the intra-class distance by minimizing the sum of the distances from each sample to its class mean. This may lead to that several samples are closed to their class centers while the other ones being far away. Then samples being far to their class centers may be mixed up with samples in other classes, which limits the classification performance of LDA. Based on Eq. (5), IAML aims to maximize the number of samples being collapsed to the congeneric centers, i.e., the number of samples being correctly classified.

279

4. Metric learning based tracking We apply the proposed online metric learning algorithm into visual tracking in this section. The whole tracking framework is shown in Fig. 2. The target representation and sample selection strategies are similar to [24]. We utilize the high-dimensional Haar feature to represent the target appearance, and reduce the feature dimension to 65 by the random projection algorithm. In the N th frame, denote the center location of the target by lN . By treating the tracking task as a binary classification problem, at the N th frame, the positive samples are selected in the region set fRp g by J Rp  lN J 2 or p

ð16Þ

where rp is a predefined radius, J  J 2 is 2-norm function. And 50 image patches are randomly selected from the region set fR n g as the negative samples by r n1 o J Rn  lN J 2 or n2

where rn1 and rn2 are two predefined radius, and r p o r n1 or n2 (similar to [24], in the experiment, r p ¼ 2, r n1 ¼ 3 and rn2 ¼6). Assuming that the tracking process can be viewed as an Norder Markov process [9,49,50], the training feature samples in both positive class and negative class from the last N frames are respectively collected into a target set and a background set (assuming that in video sequences the target appearance changes drastically for two seconds and the frame rate of a normal video is about 25 frames per second, we fix N to 50 in our tracking

#42

#162

#291

#87

#380

#890

#7

#79

#191

#62

#233

#589

LDA

MMC

NCA

MCML

ð17Þ

LMNN

ITML

Boost

ours

Fig. 6. Tracking results of each learning based trackers on the video sequences, in which the target is occluded. The sequences from top to bottom are Coke, FaceOcc1, Jogging1 and Woman, respectively.

280

S. Yi et al. / Neurocomputing 191 (2016) 273–285

experiments). With these training samples, the online IAML algorithm is proposed to separate the target from its background. To predict the object location in the N þ 1 frame, the target candidates are drawn in the region set fRs g by j Rs  lN j o r s

ð18Þ

where rs is the searching radius (rs is fixed to 30 during the experiments). Considering that, under the proposed distance metric, sample points in the same class can be closed to their class center and far away to other classes, we select the candidate nearest to the center of target class and furthest to the center of background class to be the target. Let C(l) be the target's appearance at spatial location l, m1 be the mean of samples in the target set, and m2 be the mean of samples in the background set. The n target location lN þ 1 is selected by n

lN þ 1 ¼ arg min ðdM ðCðlÞ; m1 Þ  dM ðCðlÞ; m2 ÞÞ l A fRs g

ð19Þ

The whole algorithmic description of the proposed tracking method is proposed in Algorithm 1. Algorithm 1. Tracking by IAML.

1 2 3 4 5 6 7 8

respectively. The dimension of each space is reduced to 2 by PCA, as shown in Fig. 3. From Fig. 3, in the transformed space by LDA, the yellow points are always mixed up with blue points, due to that the optimal number of the discriminant vectors is difficult to be pre-defined. This limits the classification performance of LDA. In contrast, IAML optimizes the dimensionality of the projected space adaptively, which can distinguish most of samples in different classes. 5.2. Comparison of learning methods on the UCI benchmarks To validate the general superiority of our metric learning approach, we compare the offline IAML with multi-class LDA (the default number of discriminant vectors is set to κ  1 according to [51,48]) and the state-of-the-art metric learning methods, including MMC [11], NCA [18], MCML [19], LMNN [13], ITML [14], and Boosting Metric [52]. The evaluation is performed on twelve UCI datasets described in Table 1. Fig. 4 plots the five-fold 3-NN classification error rates. We run the comparison 10 times to obtain the average results. It is observed that the classification error of IAML is lower than multiclass LDA for most of the time, due to the fact that IAML can

Input: Training samples of positive class and negative class from last N frames, P c κ > target location at current frame, ni ¼ 1 xi xi , fmc g and fmkc gk ¼ 1 saved from the last step. Pn c þ n a Output: Target location in next frame, training samples, i ¼ nr þ 1 xi xi> , m and fmk gκk ¼ 1 P c κ > (which are set to ni ¼ 1 xi xi , fmc g and fmkc gk ¼ 1 for the next step). for each video frame N do

Pnc þ na  >  Collect positive and negative samples respectively at current frame by Eq: ð16Þ and Eq: ð17Þ; and compute the sample mean fma g; congeneric centers fmka gκk ¼ 1 ; and i ¼ nc þ 1 xi xi for these new added samples:   if N 4 N then   Pnr >  j calculate fmr g; fmkr gκk ¼ 1 ; and i ¼ 1 xi xi for training samples in frame ðN  NÞ from the target and background set:    end  nX þ n c a   Update xi xi> ; fmk gκk ¼ 1 and YY > by Eq: ð13Þ; update m and XX > by Eq: ð14Þ; and calculate the distance metric by Eq: ð11Þ:   i ¼ nc þ 1   Draw target candidates in the next frame by Eq: ð18Þ; and predict the target location by Eq: ð19Þ:   c þ na  Add fxi gni ¼ into training sample set: nc þ 1    if N 4 N then   j Remove old samples fxi gnr in the frame ðN  NÞ: i¼1   end

9 10 11 12

end

5. Experiments In this section, we compare IAML with several state-of-the-art methods on the synthetic data and in visual tracking application. For fair comparison, we use the source codes provided by the authors, and each code is initialized with its default parameters. All of our experiments are performed using matlab R2012b on an Intel Dual-Core 3.40 GHz CPU with 4 GB RAM.

5.1. Comparison with LDA in the transformed space To demonstrate the difference between IAML and LDA, we transform the feature space of the dataset Optdigits (of the UCI repository1) to the new metric spaces by these two methods, 1

http://archive.ics.uci.edu/ml/

optimize the dimension of the projected subspace while LDA cannot. Comparing to the conventional metric learning methods, IAML always achieves the best or the second best performance. One possible reason is that the distance adjustments on pairwise training samples are more rigorous than the adjustments on individual samples, which leads the conventional metric learning methods to be more difficult to distinguish samples from different classes. Table 2 gives the comparison of the average running time for each metric algorithm. From Table 2, IAML is at least two times faster than multi-class LDA, of which the main reason is that LDA determines the number of discriminant vectors by a eigenvalue decomposition step, which costs much time. And IAML is at least 100 times faster than those traditional metric learning methods. This suggests that IAML can be used to distinguish the target from the background in real-time visual tracking applications.

S. Yi et al. / Neurocomputing 191 (2016) 273–285

281

#70

#313

#517

#105

#304

#532

#6

#21

#64

#26

#158

#236

LDA

MMC

NCA

MCML

LMNN

ITML

Boost

ours

Fig. 7. Tracking results of each learning based trackers on the video sequences, in which the target region is blurred due to the fast motion of the target or camera. The sequences from top to bottom are Boy, David2, Deer and Jumping, respectively.

5.3. Comparison of learning methods for visual tracking applications In this section, we evaluate the visual tracking application of metric learning methods on a benchmark [53] that includes 50 video sequences. In the metric learning component of the visual tracking algorithm proposed from Section 4, we replace our online metric learning method by LDA, MMC, NCA, MCML, LMNN, ITML and Boost Metric respectively, to demonstrate the tracking performance of each metric learning approach. Figs. 5–8 show some video frames of the comparison results. During the tracking process, the matching performance will be affected by the challenges such as illumination variation (as shown in Fig. 5), occlusion (as shown in Fig. 6), motion blur (as shown in Fig. 7) and background clutters (as shown in Fig. 8). As demonstrated in Figs. 5–8, although part of trackers based on other learning methods can track the target during a period of time, only our method tracks the target during the whole time. This is because that IAML can optimize the projected dimension adaptively and be easy to optimize. We also measure the tracking performance for quantitative evaluations. For the performance criteria, we do not choose the bounding box overlap rate, since the bounding box overlap has the disadvantage of heavily penalizing trackers that do not track across scale, even if the target position is otherwise tracked perfectly [25]. We use two evaluation metrics to compare the visual tracking application for our metric learning method with other methods. The first one is the average center location error (defined by the average distance between the center of the tracking result and the ground truth in each frame). Considering that, when a

tracker loses the target, the output location can be random and the average error value may not measure the tracking performance correctly [53]. Thus we propose the second evaluation metric, called the scaled precision plot. We resize the target to 32  32 in each video frame, and consider the center location error in a frame being smaller than a threshold to be positive predictive. Then the scaled precision rate is defined by the rate of positive predictive frames to all video frames. Table 5 shows the center location error of trackers based on different learning algorithms and Fig. 9 shows the scaled precision plot. As illustrated in Fig. 9 and Table 3, IAML based tracking algorithm always achieves better tracking accuracy than other algorithms. Table 4 shows the speed (frames per second) of trackers based on each learning algorithm. From Table 4, IAML is more than 2 times faster than LDA and 100 times faster than the average of other metric learning algorithms. 5.4. Comparison of our tracking algorithm with other state-of-theart trackers In this section, we evaluate the performance of our tracker on the benchmark [53] with other eleven state-of-the-art trackers, including accelerated proximal gradient approach (APG) [36], adaptive structural local sparse appearance based tracker (ASLA) [54], circulant structure tracker (CSK) [55], compressive tracker (CT) [24], distribution fields tracker (DFT) [56], fast compressive tracker (FCT) [26], incremental visual tracker (IVT)[30], kernelized correlation filter tracker (KCF) [25], sparse coding tracker (L1) [35],

282

S. Yi et al. / Neurocomputing 191 (2016) 273–285

#94

#335

#700

#89

#196

#449

#14

#37

#104

#50

#226

#359

LDA

MMC

NCA

MCML

Boost

ITML

LMNN

ours

Fig. 8. Tracking results of each learning based trackers on the video sequences, in which the background near the target has the similar visual appearance as the target. The sequences from top to bottom are Basketball, Girl, Subway and Walking2, respectively.

Table 3 CLE indicates the Average Center Location Error (pixel) of tracking application for learning methods on all video sequences. Bold value indicates the best performance.

Scaled Precision Plot 1 0.9 0.8

Evaluation

lda

mmc

nca

mcml

lmnn

itml

boost

ours

CLE

36.9

82.3

71.7

77.7

38.9

51.0

35.1

14.1

Precision

0.7 0.6 0.5

Table 4 Speed (frames per second) of each metric learning based tracker. Bold value indicates the best performance.

0.4 LDA MMC NCA MCML LMNN ITML Boost ours

0.3 0.2 0.1 0

0

5

10

15

20

25

30

35

40

45

Evaluation

lda

mmc

nca

mcml

lmnn

itml

boost

ours

FPS

15.35

0.77

0.09

0.02

0.59

2.81

0.20

35.26

50

Threshold

Fig. 9. The scaled precision rate of all testing sequences for different learning based trackers.

multiple instance learning (MIL) [22], multi-task sparse tracker (MTT) [37], partial least squares tracker (PLS) [57], and spatiotemporal context (STC) [58]. For quantitative evaluations, we measure the tracking performance of all the algorithms on these sequences by the average center location error, as shown in Table 5. As shown in Table 5, the proposed tracker achieves lower center location error than other trackers. Moreover, the video sequences in [53] are annotated with 11 attributes, which describe

the challenges may affect the performance of a visual tracker, including illumination variation, scale variation, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out-of-view, background clutters and low resolution. These attributes are useful for diagnosing and characterizing the behavior of trackers. Fig. 10 shows the scaled precision plot on all sequences and each attribute. As illustrated in Fig. 10, our approach provides comparable results with other trackers on 10 of 11 attributes, except out-ofview. This is due to the fact that IAML can significantly improve the matching performance in tracking application, but it lacks a failure recovery mechanism.

S. Yi et al. / Neurocomputing 191 (2016) 273–285

283

Table 5 CLE indicates the average center location error (pixel) of each tracking algorithm on all video sequences. Bold value indicates the best performance. Evaluation

APG

ASL

CSK

CT

DFT

FCT

IVT

L1

MIL

MTT

PLS

STC

KCF

ours

CLE

58.6

37.0

80.7

46.5

51.6

45.6

51.8

85.1

49.7

49.8

77.4

54.5

19.1

14.1

Fig. 10. The scaled precision rate on all of the testing sequences for different trackers.

284

S. Yi et al. / Neurocomputing 191 (2016) 273–285

6. Conclusion In this paper, we have proposed a novel metric learning approach called IAML for real-time visual tracking. The proposed learning approach relies on the observation that adjusting the distance from individual sample points to a few anchor points instead of the distance between pairwise samples can significantly improve the computational efficiency. Based on this observation, we construct a matrix function that collapses sample points to the congeneric centers by a sparsity regularization and maximizes the distances between samples and the heterogeneric centers. The function has a closed form solution, which is about 100 times faster than traditional metric learning algorithms. Additionally, IAML also achieves a superior classification accuracy than other metric learning methods. Since the distances that need to be adapted on individual samples are less rigorous than the distances between pairwise samples, which leads the optimal solution of IAML to be easier to achieve than that of traditional methods. We also apply the online version for IAML, which can accelerate the learning process and improve the tracking accuracy. With the help of the proposed metric learning approach, our tracking algorithm can outperform the state-of-the-art trackers. Furthermore, due to the high effectiveness and efficiency of the proposed metric learning method, we believe that our algorithm can be applied to a large range of applications in computer science/engineering which require a reliable learned distance metric. In the future, we would like to develop more applications based on the proposed metric learning method.

Acknowledgments The authors would like to thank the anonymous reviewers for their valuable comments. This work is supported by National Natural Science Foundation of China (Grant no. 61572207).

References [1] M.J. Black, A.D. Jepson, Eigentracking: robust matching and tracking of articulated objects using a view-based representation, Int. J. Comput. Vis. 26 (1) (1998) 63–84. [2] M. Isard, A. Blake, Condensation conditional density propagation for visual tracking, Int. J. Comput. Vis. 29 (1) (1998) 5–28. [3] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (5) (2003) 564–577. [4] X. Wang, G. Hua, T.X. Han, Discriminative tracking by metric learning, In: Computer Vision–ECCV, Heraklion, Crete, Greece, 2010, Springer, 2010, pp. 200–214. [5] N. Jiang, W. Liu, Y. Wu, Adaptive and discriminative metric differential tracking, In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, IEEE, 2011, pp. 1161–1168. [6] N. Jiang, W. Liu, H. Su, Y. Wu, Tracking low resolution objects by metric preservation, In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, IEEE, 2011, pp. 1329–1336. [7] G. Tsagkatakis, A. Savakis, Online distance metric learning for object tracking, IEEE Trans. Circuits Syst. Video Technol. 21 (12) (2011) 1810–1821. [8] N. Jiang, W. Liu, Y. Wu, Order determination and sparsity-regularized metric learning adaptive visual tracking, In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Rhode Island, USA, IEEE, 2012, pp. 1956–1963. [9] X. Li, C. Shen, Q. Shi, A. Dick, A. van den Hengel, Non-sparse linear representations for visual tracking with online reservoir metric learning, In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Rhode Island, USA, IEEE, 2012, pp. 1760–1767. [10] B. Wang, G. Wang, K.L. Chan, L. Wang, Tracklet association with online targetspecific metric learning, In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, Ohio, USA, IEEE, 2014, pp. 1234–1241. [11] E.P. Xing, M.I. Jordan, S. Russell, A.Y. Ng, Distance metric learning with application to clustering with side-information, In: Advances in Neural Information Processing Systems, 2002, pp. 505–512. [12] M. Schultz, T. Joachims, Learning a distance metric from relative comparisons, in: Advances in Neural Information Processing Systems (NIPS), 2004, p. 41.

[13] K.Q. Weinberger, J. Blitzer, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, In: Advances in Neural Information Processing Systems, 2005, pp. 1473–1480. [14] J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in: Proceedings of the 24th International Conference on Machine Learning, Corvallis, USA, ACM, 2007, pp. 209–216. [15] G.-J. Qi, J. Tang, Z.-J. Zha, T.-S. Chua, H.-J. Zhang, An efficient sparse metric learning in high-dimensional space via l 1-penalized log-determinant regularization, In: Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Canada, ACM, 2009, pp. 841–848. [16] P. Jain, B. Kulis, I.S. Dhillon, K. Grauman, Online metric learning and fast similarity search, in: Advances in Neural Information Processing Systems, 2009, pp. 761–768. [17] A. Bellet, A. Habrard, M. Sebban, A survey on metric learning for feature vectors and structured data, arXiv preprint arxiv:1306.6709. [18] J. Goldberger, G.E. Hinton, S.T. Roweis, R. Salakhutdinov, Neighbourhood components analysis, In: Advances in Neural Information Processing Systems, 2004, pp. 513–520. [19] A. Globerson, S. Roweis, Metric learning by collapsing classes, In: Neural Information Processing System, vol. 18, 2005, pp. 451–458. [20] R.T. Collins, Y. Liu, M. Leordeanu, Online selection of discriminative tracking features, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1631–1643. [21] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in: British Machine Vision Conference, vol. 1, 2006, p. 6. [22] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instance learning, In: IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009), Miami, USA, IEEE, 2009, pp. 983–990. [23] Y. Bai, M. Tang, Robust tracking via weakly supervised ranking SVM, In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Rhode Island, USA, IEEE, 2012, pp. 1854–1861. [24] K. Zhang, L. Zhang, M.-H. Yang, Real-time compressive tracking, In: Computer Vision–ECCV 2012, Florence, Italy, Springer, 2012, pp. 864–877. [25] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) (2015) 583–596. [26] K. Zhang, L. Zhang, M.-H. Yang, Fast compressive tracking, IEEE Trans. Pattern Anal. Mach. Intell. 36 (10) (2014) 2002–2015. [27] J. Wang, F. Zhong, G. Wang, Q. Peng, X. Qin, Visual tracking via subspace motion model, Constraints 48 (3) (2002) 173–194. [28] J. Ho, K.-C. Lee, M.-H. Yang, D. Kriegman, Visual tracking using learned linear subspaces, In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004 (CVPR 2004), Columbus, OH, USA, vol. 1, IEEE, 2004, pp. I–782. [29] J. Lim, D.A. Ross, R.-S. Lin, M.-H. Yang, Incremental learning for visual tracking, In: Advances in Neural Information Processing Systems, 2004, pp. 793–800. [30] D.A. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vis. 77 (1–3) (2008) 125–141. [31] A. Elgammal, C.-S. Lee, Tracking people on a torus, IEEE Trans. Pattern Anal. Mach. Intell. 31 (3) (2009) 520–538. [32] H. Qiao, P. Zhang, B. Zhang, S. Zheng, Learning an intrinsic-variable preserving manifold for dynamic visual tracking, IEEE Trans. Syst. Man Cybern., Part B: Cybern. 40 (3) (2010) 868–880. [33] M. Wang, H. Qiao, B. Zhang, A new algorithm for robust pedestrian tracking based on manifold learning and feature selection, IEEE Trans. Intell. Transp. Syst. 12 (4) (2011) 1195–1208. [34] G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 663–670. [35] X. Mei, H. Ling, Robust visual tracking using l1 minimization, In: 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, IEEE, 2009, pp. 1436– 1443. [36] C. Bao, Y. Wu, H. Ling, H. Ji, Real time robust l1 tracker using accelerated proximal gradient approach, In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Rhode Island, USA, IEEE, 2012, pp. 1830–1837. [37] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via multi-task sparse learning, In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Rhode Island, USA, IEEE, 2012, pp. 2042–2049. [38] X. Li, A. Dick, C. Shen, A.V.D. Hengel, H. Wang, Incremental learning of 3D-DCT compact representations for robust visual tracking, IEEE Trans. Pattern Anal. Mach. Intell. 35 (4) (2013) 863–881. [39] S. Shirazi, M.T. Harandi, B.C. Lovell, C. Sanderson, Object tracking via nonEuclidean geometry: A grassmann approach, arXiv preprint arxiv:1403.0309. [40] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robust tracking, in: Computer Vision–ECCV 2008, Marseille, France, Springer, 2008, pp. 234–247. [41] Z. Yao, W. Liu, Extracting robust distribution using adaptive Gaussian mixture model and online feature selection, Neurocomputing 101 (2013) 258–274. [42] L. Torresani, K.-c. Lee, Large margin component analysis, Adv. Neural Inf. Process. Syst. 19 (2007) 1385. [43] C. Shen, J. Kim, L. Wang, A. Van Den Hengel, Positive semidefinite metric learning with boosting, In: Neural Information Processing System, vol. 22, 2009, pp. 629–633. [44] J. Kwon, K. M. Lee, Visual tracking decomposition, In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, IEEE, 2010, pp. 1269–1276.

S. Yi et al. / Neurocomputing 191 (2016) 273–285

[45] Y. Ying, K. Huang, C. Campbell, Sparse metric learning via smooth optimization, in: Neural Information Processing System, 2009, pp. 2214–2222. [46] W.B. Johnson, J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, In: Contemporary Mathematics, 1984, pp. 189–206. [47] D.H. Foley, J.W. Sammon, An optimal set of discriminant vectors, IEEE Trans. Comput. 100 (3) (1975) 281–289. [48] J. Duchene, S. Leclercq, An optimal transformation for discriminant and principal component analysis, IEEE Trans. Pattern Anal. Mach. Intell. 10 (6) (1988) 978–983. [49] R. Washington, Markov tracking for agent coordination, in: Proceedings of the Second International Conference on Autonomous Agents, Minneapolis, USA, ACM, 1998, pp. 70–77. [50] Z. Rabinovich, J.S. Rosenschein, Extended Markov tracking with an application to control. [51] C.R. Rao, The utilization of multiple measurements in problems of biological classification, J. R. Stat. Soc. 10 (2) (1948) 159–203. [52] C. Shen, J. Kim, L. Wang, A. van den Hengel, Positive semidefinite metric learning using boosting-like algorithms, J. Mach. Learn. Res. 98888 (1) (2012) 1007–1036. [53] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: a benchmark, In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, USA, IEEE, 2013, pp. 2411–2418. [54] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, In: 2012 IEEE Conference on Computer vision and pattern recognition (CVPR), Rhode Island, USA, IEEE, 2012, pp. 1822–1829. [55] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant structure of tracking-by-detection with kernels, In: Computer Vision–ECCV 2012, Florence, Italy, Springer, 2012, pp. 702–715. [56] L. Sevilla-Lara, E. Learned-Miller, Distribution fields for tracking, In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Rhode Island, USA, IEEE, 2012, pp. 1910–1917. [57] Q. Wang, F. Chen, W. Xu, M.-H. Yang, Object tracking via partial least squares analysis, IEEE Trans. Image Process. 21 (10) (2012) 4454–4465. [58] K. Zhang, L. Zhang, M.-H. Yang, D. Zhang, Fast tracking via spatio-temporal context learning, arXiv preprint arxiv:1311.1939.

Sihua Yi received his B.S. degree in Electronics and Information Engineering, from the Huazhong University of Science & Technology (HUST), Wuhan, China, in 2006, and M.S. degree in Electronics and Information Engineering from the Wuhan Research Institute of Posts and Telecommunications (WRI), Wuhan, P.R. China in 2009. He is currently working toward the Ph.D. degree at HUST. His research interests include computer vision and pattern recognition.

Nan Jiang received the B.S. and Ph.D. degrees in Electronics and Information Engineering from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2006 and 2011, respectively. She was an assistant professor of the Huazhong University of Science and Technology. Her research interests include computer vision and pattern recognition.

285

Xinggang Wang is an assistant professor of School of Electronics Information and Communications of the Huazhong University of Science and Technology. He received his Bachelors' degree in communication and information system and Ph.D. degree in Computer Vision both from the Huazhong University of Science and Technology. From May 2010 to July 2011, he was with the Department of Computer and Information Science, Temple University, Philadelphia, PA., as a visiting scholar. From February 2013 to September 2013, he was with the University of California, Los Angeles, as a visiting graduate researcher. He is a reviewer of IEEE Transaction on Cybernetics, Pattern Recognition, Computer Vision and Image Understanding, Neurocomputing, CVPR, ICCV and ECCV etc. His research interests include computer vision and machine learning.

Wenyu Liu received the B.S. degree in Computer Science from the Tsinghua University, Beijing, China, in 1986, and the M.S. and Ph.D. degrees, both in Electronics and Information Engineering, from the Huazhong University of Science & Technology (HUST), Wuhan, China, in 1991 and 2001, respectively. He is now a professor and associate dean of the School of Electronic Information and Communications, HUST. His current research areas include computer vision, multimedia, and sensor network. He is a senior member of IEEE.