Robust visual tracking by metric learning with weighted histogram representations

Robust visual tracking by metric learning with weighted histogram representations

Neurocomputing 153 (2015) 77–88 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Robust vi...

15MB Sizes 0 Downloads 88 Views

Neurocomputing 153 (2015) 77–88

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Robust visual tracking by metric learning with weighted histogram representations Jun Wang a,b,c, Hanzi Wang a,n, Yan Yan a a

School of Information Science and Technology, Xiamen University, Xiamen 361005, China Fujian Key Laboratory of the Brain-like Intelligent Systems (Xiamen University), Xiamen 361005, China c Cognitive Science Department, Xiamen University, Xiamen 361005, China b

art ic l e i nf o

a b s t r a c t

Article history: Received 22 May 2014 Received in revised form 2 September 2014 Accepted 22 November 2014 Communicated by Xiaoqin Zhang Available online 3 December 2014

Measuring the similarity between the target template and a target candidate is a critical issue in visual tracking. An appropriate similarity metric can improve the accuracy and robustness of visual tracking. This paper proposes a robust visual tracking algorithm that incorporates online distance metric learning into visual tracking based on a particle filter framework. The appearance variations of an object are effectively learned via an online metric learning mechanism. In addition, we use spatially weighted feature representations using both color and spatial information of objects, which can further improve the tracking performance. The proposed algorithm is compared with several state-of-the-art tracking algorithms, and experimental results on challenging video sequences demonstrate the effectiveness and robustness of the proposed tracking algorithm. & 2014 Elsevier B.V. All rights reserved.

Keywords: Visual tracking Distance metric learning Weighted histogram representations Particle filters

1. Introduction Visual tracking has been well studied in recent decades. The goal of visual tracking is to continually predict the locations of target objects in video sequences. A large number of visual tracking algorithms have been proposed and applied in vehicle navigation, monitor surveillance, human–computer interaction, and so on [1,2]. Although much progress has been made, robust visual tracking remains a challenging problem due to occlusion, background clutters, fast motion, illumination changes, motion blur and rotation (see Fig. 1). Generally speaking, visual tracking algorithms can be categorized as either generative [3–9] or discriminative [10–20]. Generative tracking algorithms search for an image region in each frame that is the most similar to the target template with a maximal similarity score or a minimal reconstruction error. In generative tracking algorithms, an appearance model is used to represent the target object, and then dynamically updated during tracking. Ross et al. [3] used a low-dimensional subspace model to represent an object, which is robust to illumination and pose changes. Wang et al. [4] proposed the least soft-threshold squares (LSS) algorithm to deal with appearance variations. Li et al. [6] used a set of cosine basis functions to build a compact 3D-DCT object representation, where an incremental 3D-DCT algorithm

n

Corresponding author. Tel.: þ 86 592 2580063. E-mail addresses: [email protected] (J. Wang), [email protected] (H. Wang), [email protected] (Y. Yan). http://dx.doi.org/10.1016/j.neucom.2014.11.050 0925-2312/& 2014 Elsevier B.V. All rights reserved.

was proposed to achieve robust tracking in challenging environments. In [7], a motion model was decomposed into multiple basic motion models, which aimed to handle motion changes. Recently, sparse representation based tracking algorithms were also developed [8,9,21,22]. In [8], visual tracking was formulated as a sparsity-based reconstruction problem in a particle filter framework, where a target candidate with the smallest reconstruction error is considered as the tracking result. Zhang et al. [9] generalized the ℓ1 minimization tracking algorithm [8] as a multitask tracking (MTT) algorithm, where visual tracking was formulated as a multi-task sparse learning problem. Jia et al. [21] presented a structural local sparse appearance model for object representation, which exploited both partial and spatial information of the target via an alignment-pooling method. Zhong et al. [22] adopted local representations to build a sparsity-based generative model, which can effectively handle heavy occlusion. In contrast, discriminative tracking algorithms treat visual tracking as a binary classification problem. These kinds of algorithms consider the differences between an object and its surrounding background. Grabner et al. [10] presented an online boosting algorithm (OAB) to select discriminative features for visual tracking. OAB was extended to a semi-supervised boosting algorithm [11], which effectively alleviated the drifting problem. Avidan [12] proposed an ensemble tracking framework wherein multiple weak classifiers were combined into a strong classifier by using an AdaBoost algorithm. The randomized ensemble tracking (RET) algorithm was proposed by Bai et al. [13], where a set of weak classifiers was combined by using a weight vector that is treated as a

78

J. Wang et al. / Neurocomputing 153 (2015) 77–88

Fig. 1. Tracking in several challenging situations including motion blur (the first row: Jumping), fast motion (the middle row: Face), and occlusion (the last row: Coke can). The tracking results of CT [15], Frag [5], L1 [8], MTT [9], TLD [19], VTD [7] and the proposed tracking algorithm are represented by magenta, cyan, blue, green, yellow, black and red rectangles, respectively. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

distribution of confidence among the weak classifiers. Collins et al. [14] adaptively selected the best discriminative feature from multiple features via an online ranking mechanism, which can deal with partial occlusions, background clutters and illumination changes. Zhang et al. [15] proposed a real-time compressive tracking (CT) algorithm by adopting random projection to project a datum in highdimensional space to a low-dimensional vector. In order to alleviate the drifting problem, CT used a spatial sampling scheme to obtain several positive samples to train a classifier, instead of using one positive sample as in [10,14]. Babenko et al. [16] used positive and negative bags to learn classifiers, where multiple instance learning was introduced into visual tracking. Hare et al. [17] proposed a tracking-by-detection algorithm based on a structured output SVM learning technique. Yang et al. [18] presented a superpixel based appearance model for visual tracking to handle pose variations. The measure of similarity between the target template and a target candidate is an important issue, which might have influence on the accuracy and robustness of visual tracking algorithms. In this paper, we present a simple and robust tracking algorithm that is able to handle fast motion, background clutters, motion blur, occlusion, etc. The proposed algorithm employs an online distance metric learning technique [23], instead of a predefined metric to measure the similarity between the target template and a target candidate. In order to further improve the robustness for visual tracking, we use spatially weighted histogram-based feature representation. The intensity values are used in the spatially weighted histogram. The computational complexity of our algorithm is low because the dimensionality of the proposed feature representation is low. Therefore, the computational cost of online distance metric learning and object tracking is reduced. By using the online distance metric learning algorithm [23], we compute the measure of similarity between the target template and a target candidate and track an object in a particle filter framework. The remainder of this paper is organized as follows. Section 2 summarizes the related work. Section 3 presents the feature representation and proposes an online distance metric learning

based tracking (referred to as OMLT) algorithm. Section 4 presents experimental results, and evaluates the performance of the proposed tracking algorithm and several competing algorithms. Section 5 concludes the paper.

2. Related work The performance of most tracking algorithms greatly depends on a distance metric or the measure of similarity between the target template and a target candidate. Most existing tracking algorithms use a pre-determined metric, e.g., the EMD metric [5], the histogram intersection [22], or the Bhattacharyya coefficient metric [24,25]. A predefined distance metric cannot adapt to appearance variations, and may lead to tracking failure. Recently, much research in the fields of image retrieval and pattern recognition has demonstrated that an appropriate distance metric can significantly improve the retrieval or classification performance. Distance metric learning algorithms have attracted much interest in visual tracking [26,28,30,32]. The goal of metric learning is to learn a metric for measuring the similarity between the target template and a target candidate. Jiang et al. [26] proposed a discriminative tracking algorithm using the neighborhood component analysis (NCA) metric learning algorithm [27]. NCA learned a distance metric and reduces the dimensionality of the feature space. However, NCA could suffer from spurious local maxima. Wang et al. [28] presented an object tracking algorithm based on the maximally collapsing metric learning (MCML) [29]. MCML learned a distance metric by collapsing samples with the same class label together and pushing away samples with different class labels. MCML assumed that samples with the same class labels have a unimodal class distribution. Tsagkatakis and Savakis [30] used the information-theoretic metric learning (ITML) algorithm [31] for visual tracking, where visual tracking was considered as the nearest neighbor classification problem. ITML required a large number of

J. Wang et al. / Neurocomputing 153 (2015) 77–88

training samples; otherwise, it was prone to overfitting. Li et al. [32] proposed a robust visual tracking algorithm based on a scalable image similarity learning mechanism [33]. The objective function in [33] considered the distances among samples with the same label. All of the above mentioned tracking algorithms used a discriminative tracking framework based on an adaptive metric learning mechanism. Due to the simplicity and robustness to scaling and rotation, color histograms have been widely used to represent objects for visual tracking. Color histograms count the number of occurrences of each intensity or color value inside an object region. However, color histograms neglect the spatial layout information of feature values, which is important to improve the robustness of tracking algorithms. In order to address this problem, several visual tracking algorithms exploited the spatial information of an object in color histograms to represent an object. Adam et al. [5] divided an object region into multiple nonoverlapping patches and represented each patch with a histogram. A voting map was obtained by measuring histogram similarity between a template patch and the corresponding image patch. The patch-division mechanism took into account the spatial distribution of the feature values. The object location was estimated by combining the voting maps of the multiple patches. The combination mechanism reduces the influence of the outliers resulting from occlusions. Comaniciu et al. [24] proposed a mean shift-based tracking algorithm, which used a histogram-based features to represent an object. Birchfield and Rangarajan [34] proposed the spatiograms to incorporate the spatial information, namely, the spatial mean and covariance of the positions of the pixels that fall into histogram bins of a histogram. Wang et al. [35] used a Gaussian mixture model to represent an object in a joint spatial-color space. More recently, He et al. [36] used a locality sensitive histogram to represent an object. The locality sensitive histogram is computed at each pixel location. In contrast with an intensity histogram, instead of increasing the value of the corresponding bin by one, each pixel contributes a floating-point value to the corresponding bin.

79

the pixels close to the target center should have larger contributions to the histogram used to represent the object. These pixels contain more foreground information than other pixels. The histogram bins that contain more pixels close to the target center play a more important role for visual tracking than the other bins. The spatial layout information are commonly used for visual tracking [36–38]. Thus, we use spatially weighted histogram (referred to as SWH) representations for visual tracking. The spatially weighted histogram takes into account not only the frequency of occurrence of the pixel color values, but also the distribution of the pixels’ spatial locations inside an object region. Let h^ ¼ ½h1 ; h2 ; …; hB T A RB denote the spatially weighted histogram inside an object region: hk ¼

1

N ! ! ∑ ½f ð xi Þ  δðbð xi Þ  kÞ  nk ;

Γi¼1

k ¼ 1; …; B;

ð1Þ

where Γ is a normalization factor ensuring ∑Bk ¼ 1 hk ¼ 1; N is the number of the pixels inside an object region; B is the number of the bins of the spatially weighted histogram; δðÞ is the Kronecker delta function. The function bðÞ : R2 -f1; 2; …; Bg maps the color ! ! value of a given pixel xi to its bin bð xi Þ in the quantized feature ! N space; nk is equal to ∑i ¼ 1 δ½bð xi Þ  k. The function f ðÞ, which ! calculates the weight of a pixel xi according to its distance to the target center, is written as n o ! !! f ð xi Þ p exp  12 DM  1 ð xi ; μ Þ ; ð2Þ ! where μ and M are the mean vector and covariance matrix of the locations of all pixels inside an object region, respectively; ! ! ! ! xi ¼ ½xi ; yi T is the coordinate of the pixel xi ; DM  1 ð x i ; μ Þ is the ! ! squared Mahalanobis distance between xi and μ . The function f ðÞ of Eq. (2) is to get the spatially weight for a spatial location. Based on Eqs. (1) and (2), we obtain a spatially weighted histogram representation for a target. 3.2. Training sample selection

Before formally presenting the proposed tracking algorithm, n ! we describe some notations used in this work. Let f pi ; ci gi ¼ 1 be a ! set of n labeled training samples with samples pi A Rd and discrete class labels ci A f 7 1g. Let L A Rmd be a linear transformation matrix. For a positive semidefinite (p.s.d.) matrix W, we denote it as W≽0.

Similar to [16], training samples are selected via a spatial distancebased mechanism, as shown in Fig. 2. We select the image regions from a small neighborhood around the current object location as positive samples, and select the image regions relatively far from the current object location as negative samples. Based on the above n ! scheme, we obtain a set of training samples f pi ; ci gi ¼ 1 at the current ! frame. Here, we use p i to denote the SWH representation obtained by Eq. (1) for the i-th sample.

3.1. Object representation

3.3. Distance metric learning

An important issue in visual tracking is how to represent an object. For visual tracking based on histogram-based representations,

The distance between two vectors is usually measured via a distance metric. The squared Euclidean distance between two

3. The proposed visual tracking algorithm

Fig. 2. An illustration of training sample selection. (a) The current target location represented by a red rectangle box. (b) The selected positive samples represented by color rectangle boxes. (c) The selected negative samples represented by color rectangle boxes. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

80

J. Wang et al. / Neurocomputing 153 (2015) 77–88

! ! vectors pi and pj is computed as !! ! ! ! ! DL ð pi ; pj Þ ¼ ½Lð pi  pj ÞT ½Lð pi  pj Þ;

ð3Þ

where a linear transformation is performed by the matrix L. Eq. (3) can be re-formulated as !! ! ! ! ! DW ð pi ; pj Þ ¼ ð pi  pj ÞT Wð pi  pj Þ;

¼ ð4Þ

where W ¼ LT L is a Mahalanobis distance metric matrix. For distance metric learning, the objective is to learn a linear transformation matrix L in Eq. (3) or to learn a Mahalanobis distance matrix W in Eq. (4). Distance metric learning in Eq. (4) can be formulated as a convex optimization over a p.s.d matrix W, where a global minimum can be obtained. In contrast, the solution in Eq. (3) may get trapped in local minima. In this work, we learn a Mahalanobis distance metric matrix W in Eq. (4) for visual tracking. A distance metric learning strategy, which is originally proposed in distance metric learning for large margin nearest neighbor (LMNN) classification [23], is employed in the proposed tracking algorithm. In [23], the objective function contains two competing terms: εpull ðWÞ and εpush ðWÞ, and is defined as follows:

εðWÞ ¼ ð1  ηÞεpull ðWÞ þ ηεpush ðWÞ;

ð5Þ

where η A ½0; 1 is a parameter that balances the two terms. The first term in Eq. (5) penalizes large distances between samples with an identical label. In practice, this term only needs to penalize large distances between samples and their neighboring ! samples. The neighboring samples of a sample pi have the same ! ! class label as p i and are close to pi in the Euclidean space. The first term in Eq. (5) uses the implicit assumption that the distances between a sample and its neighboring samples are closer than those between the sample and all other samples with different class labels. During the visual tracking process, we select both positive and negative samples according to the object location in each frame. For each frame, we find the neighboring samples for each sample. The first term in Eq. (5) pulls neighboring samples closer together during the distance metric learning process. The term εpull ðWÞ is denoted as [23]

εpull ðWÞ ¼

!! ∑ DW ð pi ; pj Þ; !! ! pi ; pj A Nð pi Þ

ð6Þ

! ! where NðÞ denotes a sample's neighboring sample set. pj A Nð pi Þ ! ! represents that pj is one of the K neighboring samples of pi . The second term in Eq. (5) penalizes small distances between samples that have different class labels. This term provides a pushing force to keep away differently labeled samples whose distances are smaller than a unit margin. The second term in Eq. (5) satisfies the following condition: !! !! DW ð pi ; pl Þ  DW ð pi ; pj Þ r 1;

ð7Þ

! ! where pj A Nðpi Þ and cl a ci . !!! ! ! Let ðpi ; pj ; pl Þ be a triplet, where i; j; l A f1; 2; …; ng, pj A Nð pi Þ !!! and ci ¼ cj a cl , and ð pi ; pj ; pl Þ satisfies the condition in Eq. (7). The second term in Eq. (5) is denoted as [23]

εpush ðWÞ ¼

!! !! ∑ ½1 þ DW ð pi ; pj Þ  DW ð pi ; pl Þ: !!! ð pi ; pj ; pl Þ

Eqs. (3) and (4), Eq. (6) can be rewritten as !! ! ! ! ! ∑ εpull ðWÞ ¼ ∑ DW ð pi ; pj Þ ¼ ð pi  pj ÞT Wð pi  pj Þ !! ! !! ! pi ; pj A Nð pi Þ pi ; pj A Nð pi Þ

ð8Þ

The goal of distance metric learning is to directly minimize the objective function in Eq. (5). A gradient descent algorithm is used to iteratively estimate distance metric matrix W. Based on

! ! ! ! ½Lð pi  pj ÞT ½Lð pi  pj Þ: ∑ !! ! pi ; pj A Nð pi Þ

ð9Þ

Then, the inner products in Eq. (9) are converted into the outer products via the trace operator of a matrix. Therefore, Eq. (9) can be rewritten as !! εpull ðWÞ ¼ ∑ DW ð pi ; pj Þ !! ! pi ; pj A Nð pi Þ ! ! ! ! ½Lð pi  pj Þ½Lð pi  pj ÞT ¼ tr ∑ !! ! pi ; pj A Nð pi Þ ! ! ! ! ¼ tr ∑ ð pi  pj Þðpi  pj ÞT W; ð10Þ !! ! pi ; pj A Nð pi Þ where trðÞ denotes the trace operator of a matrix. The gradient of εpull ðWÞ in Eq. (6) is computed as ! ! ! ! ∑ ð pi  pj Þð pi  pj ÞT W ∇W εpull ðWÞ ¼ ∇W tr !! ! pi ; pj A Nð pi Þ ! ! ! ! ¼ ∑ ð pi  pj Þð pi  pj ÞT : ð11Þ !! ! pi ; pj A Nð pi Þ Similarly, the gradient of εpush ðWÞ in Eq. (8) is computed as ! ! ! ! ! ! ! ! ∑ ½ð pi  pj Þðpi  pj ÞT  ð pi  pl Þð pi  pl ÞT : ∇W εpush ðWÞ ¼ !!! ð pi ; pj ; pl Þ ð12Þ Thus, the gradient of the objective function in Eq. (5) is computed as ! ! ! ! ∑ ð pi  pj Þð pi  pj ÞT G ¼ ∇W εðWÞ ¼ ð1  ηÞ !! ! pi ; pj A Nð pi Þ ! ! ! ! ! ! ! ! þη ∑ ½ð pi  pj Þðpi  pj ÞT  ð pi  pl Þð pi  pl ÞT : !!! ð pi ; pj ; pl Þ ð13Þ ðkÞ

ðkÞ

and G denote the learned distance metric and Let W gradient at the k-th iteration during the distance metric learning process, respectively. Then, the metric WðkÞ is updated as WðkÞ ¼ Wðk  1Þ  τGðkÞ ;

ð14Þ

where τ is a step size. At each iteration, the updated matrix is projected onto the positive semidefinite cone for remaining positive semidefinite. During the distance metric learning process, the metric matrix WðkÞ in Eq. (14) is iteratively updated until the learning algorithm converges or the maximal number of iterations is reached. In each frame, after obtaining both positive and negative training samples, we find K nearest neighboring samples for each sample. The neighboring samples of each sample in each frame are fixed during the metric learning process. A set of triplets are generated based on these training samples and the learned matrix W. These triplets form a training set, which is used to learn the metric matrix. When the iterative process is completed, a metric matrix W is obtained. The complete procedure of the metric learning is summarized in Algorithm 1. The LMNN distance metric algorithm has the following advantages: firstly, LMNN shrinks the distances between a sample and its neighboring samples, while it separates the samples with different class labels by a large margin. Secondly, LMNN formulates the optimization as an instance of semidefinite programming, which has a globally optimal solution. Thirdly, LMNN can adaptively

J. Wang et al. / Neurocomputing 153 (2015) 77–88

learn a discriminative distance metric. Considering the above advantages, we propose a novel tracking algorithm that integrates LMNN for visual tracking. Algorithm 1. Online distance metric learning. Input: A set of both positive and negative training samples n ! P ¼ f pi ; ci gi ¼ 1 , initial metric matrix Wð0Þ ¼ I, where I is an identify matrix, minimum and maximum number of iterations min and max, the tolerance for convergence tol. Output: The updated distance metric matrix W. 1: k≔0; prev_cost≔cost≔Inf . 2: find K neighboring samples for each training sample, and generate a set of neighboring sample pairs !! n ! ! S ¼ fð pi ; pj Þgi ¼ 1 , pj A Nð pi Þ, j ¼ 1; …; K. 3: while (ðprev_cost cost 4 tol J k o minÞ && k o max) do !!! n Generate a set of triplets Ω≔fð pi ; pj ; pl Þgi ¼ 1 in terms of 4: Eq. (7). 5: prev_cost≔cost. 6: compute cost ¼ εðWk Þ via Eqs. (5), (6) and (8). 7:

Compute the gradient Gðk þ 1Þ via Eq. (13) with S and Ω.

8:

Update Wðk þ 1Þ in terms of Gk þ 1 via Eq. (14) S and Ω.

Project Wðk þ 1Þ for keeping positive semidefinite. 10: k≔k þ1. 11: end while 12: W≔Wðk þ 1Þ .

9:

81

the observations from previous t frames. In the particle filter ! ! framework, pð s t j y 1:t Þ is approximated by a set of n target !i n candidates (called particles) f s t gi ¼ 1 with the importance weights n !i n fωit gi ¼ 1 . The n target candidates f s t gi ¼ 1 are drawn from an ! ! ! importance distribution qð s t j s 1:t  1 ; y 1:t Þ that is often assumed to follow a first-order Markov model. Therefore, the importance ! ! ! distribution qð s t j s 1:t  1 ; y 1:t Þ can be simplified to the state ! ! transition model pð s t j s t  1 Þ. The state transition model is assumed to be a Gaussian distribution. The importance weight ωit of a target candidate i is updated by the observation likelihood ! !i as ωit ¼ ωit  1 pð y t j s t Þ. The importance weight ωit of a target ! !i candidate is proportional to pð y t j s t Þ after being updated because the weights of the target candidates are equally weighted due ! to resampling before being updated. The state st at time t is i ^ ! ! estimated as st ¼ ∑ni¼ 1 ωit s t . In the particle filter framework, an important issue is to ! !i formulate the observation likelihood pð y t j s t Þ, which reflects the similarity between a target candidate i (particle) and the ! !i target template. pð y t j s t Þ is derived according to the distance ! !i between a target candidate s t and the target template O t at time !i ! t, which is denoted as DW ð y t ; O t Þ. Therefore, we have   1 ! !i !i ! pðyt j s t Þ ¼ exp  λDW ð y t ; O t Þ ; T

ð15Þ

3.4. Likelihood evaluation and template update The particle filter [39] provides a robust framework for visual tracking due to its simplicity and effectiveness. A particle filter method, also known as a sequential Monte Carlo method [40] for importance sampling, sequentially estimates the posterior distribution of the latent state variables of a dynamical system based on a sequence of corresponding observations [41]. ! ! Let s t and y t denote the latent state variable describing the parameters of a target (e.g., the location and scale) and the corresponding observation, respectively, at time t. The tracking problem can be formulated as an estimation of the posterior ! ! ! ! ! ! probability pð s t j y 1:t Þ, where y 1:t ¼ f y 1 ; y 2 ; …; y t g represents

where λ is a positive constant controlling the shape of the Gaussian kernel; T is a normalization factor; W is the online !i ! learned metric matrix; DW ð y t ; O t Þ can be computed by using Eq. (4). In our implementation, we simply consider the target state parameters as the 2D position and scale of the target in the ! particle filter module. Given a state s t , its corresponding observa! tion y t is collected by cropping the corresponding image region in ! ! terms of s t . The observation y t is represented by a spatially weighted histogram via Eq. (1). In order to deal with dynamic scenes and appearance variations, we use a simple template update scheme (similar to [22]) as

Fig. 3. The framework of the proposed distance metric learning based tracking algorithm. (a) Input. (b) Particle generation. (c) Object localization. (d) Training sample selection. (e) Neighboring sample identification. (f) Metric learning. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

82

J. Wang et al. / Neurocomputing 153 (2015) 77–88

Fig. 4. The video sequences: the sequence names are listed in the first row. The images on the second and third rows are the first and last frames of each sequence. The targets are represented by red rectangle boxes at the corresponding frames. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

follows: !0 ! ! !0 !0 ð16Þ h n ¼ α h n  1 þ ð1  αÞ h n ; if simð h n ; h n  1 Þ 4 θ; !0 where h n is the updated template histogram at the n-th frame; !0 ! h n  1 is the template histogram at the ðn  1Þ-th frame; h n is the target histogram at the n-th frame; α is a constant. This update ! !0 process is performed when the similarity between h n and h n  1 is larger than a threshold θ. The above scheme effectively enables the proposed tracking algorithm to deal with object appearance variations and alleviates the drifting problem. 3.5. Overview of the proposed OMLT By integrating the spatially weighted histogram representations and the online distance metric learning into a particle filter framework, we propose a robust visual tracking algorithm. The framework of the proposed OMLT algorithm is described in Fig. 3. It consists of five major components: particle generation, object localization, training sample selection, triplet generation and metric learning. In Fig. 3(b), a number of particles are generated around the target location at the previous frame and the likelihoods of the particles are evaluated. The current target location is estimated as shown in Fig. 3(c). Both positive and negative samples (in green and blue, respectively) are selected in Fig. 3(d), and we find the K neighboring samples for each sample as shown in Fig. 3(e). Triplets are generated and used to learn a metric matrix (illustrated in Fig. 3(f)). The complete procedure of the proposed tracking algorithm is summarized in Algorithm 2. Algorithm 2. The proposed tracking algorithm. ! Input: Video frames F 1 ; …; F L , the initial target state s 1 in the first frame. ^ ! Output: Current target state s t . 1: for t¼1 to L do 2: if t ¼ ¼1 then ! 3: Obtain y 1 and its representation via Eq. (1), initialize ! !i n the target template with y 1 ; initialize n particles f s t gi ¼ 1 with equal weights. 4: Go to step 14. 5: else !i n 6: Generate n particles f s t gi ¼ 1 using the particle filter. !i for each particle s t do 7: i ! 8: Obtain y t and its representation via Eq. (1), ! !i compute pð y t j s t Þ using Eq. (15) with the learnt metric Wt  1 and update the particle weight

ωit.

9: 10:

end for ^ ! !i Estimate the target state st ¼ ∑ni¼ 1 ωit s t .

! Obtain y t and its representation with Eq. (1). Update the target template using Eq. (16). end if Select positive and negative samples (in Section 3.2). Run metric learning in Algorithm 1, and obtain a metric matrix Wt . ! ! 16: Propagate particles and resampling w.r.t pð s t j s t  1 Þ. ^ ! 17: Return s t and Wt . 18: end for 11:

12: 13: 14: 15:

4. Experiments In order to evaluate the effectiveness of the online metric learning and the robustness of the proposed OMLT algorithm, we conduct experiments on eight challenging video sequences which are publicly available.1 Fig. 4 shows the first and last frames of the eight video sequences. Table 1 summarizes the main attributes of the video sequences. The proposed OMLT is implemented in MATLAB R2011a on a 3.40 GHZ Intel Core i7-2600 PC with 8 GB memory. The number of particles is set to 300. The controlling parameter η in Eq. (5) is set to 0.5. The K value of all the neighboring samples is set to 3. The tolerance of convergence tol, the minimum iteration number min and maximum iteration number max in Algorithm 1 are set to 0.001, 50 and 500, respectively. The α and θ in Eq. (16) are set to 0.95 and 0.55, respectively. Positive samples are selected from a region with a radius of 4 pixels centered at the current target, and negative samples are selected from a region with an inner radius of 8 pixels and an outer radius of 20 pixels centered at the target. The targets are manually initialized in the first frame. The metric matrix W in Eq. (4) is updated in each frame. 4.1. Evaluating the effectiveness of metric learning To validate the effectiveness of online metric learning in visual tracking, we compare the proposed OMLT algorithm with those using different metrics. Specifically, we use the histogram intersection [22] and the Bhattacharyya coefficient [24,25,37] instead of using an online learned metric in the proposed algorithm to measure the similarity between the target template and a target 1 The Basketball, Jumping and Subway sequences are from http://visual-track ing.net. The Coupon book and Coke can sequences are from http://vision.ucsd.edu/ bbabenko/project-miltrack.shtml. The Caviar1 and Face sequences are from http:// faculty.ucmerced.edu/mhyang/pubs.html. The Badminton sequence is from http:// www.ics.uci.edu/jsupanci/.

J. Wang et al. / Neurocomputing 153 (2015) 77–88

83

candidate. For the Bhattacharyya coefficient, we compute the qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! !i likelihood observation via Eq. (15) with 1  ∑Bk ¼ 1 y t ðkÞn O t ðkÞ !i ! instead of DW ð y t ; O t Þ. For the histogram intersection, we compute the likelihood observation in Eq. (15) with 1  ∑Bk ¼ 1 min ! ! !i ! !i !i ð y t ðkÞ; O t ðkÞÞ instead of DW ð y t ; O t Þ, where y t ðkÞ and O t ðkÞ are !i the spatially weighted histograms of the observation y t and the ! target template O t , respectively. When using the histogram intersection and the Bhattacharyya coefficient as the similarity metrics, the metric learning process is ignored in Algorithm 2.

video sequences. Table 2 shows the tracking success rates obtained by the tracking algorithms with different similarity metrics on the eight video sequences. In the experiments, the tracking algorithm using the histogram intersection as the similarity metric is denoted by HIT, and that using the Bhattacharyya coefficient as the similarity metric is denoted by BCT. The three tracking algorithms use the same feature representations, i.e., the spatially weighted histogram representations. As shown in Fig. 5 and Table 2, OMLT achieves the best results in terms of the center location error and the success rate on the eight video sequences despite severe occlusion, illumination changes, non-rigid object deformation and out-of-plane rotation.

4.1.1. Quantitative comparison We use two criteria to evaluate the performance of the tracking algorithms with different metrics. The first criterion is the center location error, which is computed by using the Euclidean distance between the center of the bounding box of the tracking result and the center of the ground truth bounding box. The second criterion is the tracking success rate. Let the overlap ratio be equal to areaðRT \ RG Þ=areaðRT [ RG Þ, where RT and RG are the bounding boxes of the tracking result and the corresponding ground truth, respectively. When the overlap ratio is larger than 0.5, the result is considered as a successful result. Fig. 5 shows the plots of the center location errors obtained by the tracking algorithms with different similarity metrics on the eight

4.1.2. Qualitative comparison In the Coupon book sequence shown in Fig. 6(a), the BCT and HIT algorithms drift far away from the coupon book due to the appearance variations at the 54th frame. When the other coupon book appears, the BCT and HIT algorithms drift to the distracter after the 136th and 134th frames, respectively. The proposed OMLT tracks the target successfully throughout the video sequence. These tracking results verify the effectiveness of the proposed OMLT using online metric learning. As shown in Fig. 6(b), a number of basketball players run forth and back on a basketball court in the Basketball sequence. Because of partial occlusion, out-of-plane rotation, background clutters and drastic illumination changes, the BCT and HIT algorithms fail to

Table 1 The main attributes of the eight video sequences. Occ: occlusion, FM: fast motion, BC: background clutters, MB: motion blur, IC: illumination changes, IPR: in-plane rotation, OPR: out-of-plane rotation, and Def: deformation. Sequence

Frames Image size Color Occ FM BC MB IC IPR OPR Def

Badminton

281

640  360 RGB



Basketball

725

576  432 RGB





Caviar1

382

384  288 RGB











Coke can

292

320  240 RGB



Coupon book Face

327

320  240 RGB



493

640  480 RGB





Jumping

313

352  288 RGB





Subway

175

352  288 RGB





√ √

Face

Basketball

50 100

200

300

100 0

400

0

200

400

600

Coke can

800

0

200

Frame #

100

200

300

20

400

0

500

100

200 Frame #

100

300

400

200

300

400

150

200

Frame #

Subway 80

100 50 0

0

5

Badminton

40

0

300

150

Center Error

Center Error

50

10

Frame #

Caviar1

100

15

0

0

60

100

100

Frame #

150

0

200

Center Error

0

200

20

Center Error

100

Jumping

300 Center Error

Center Error

Center Error





Frame #

Center Error

√ √



300

150

0





Coupon book 200

0

√ √





Table 2 Success rates (in percentage) obtained by the tracking algorithms with different similarity metrics. The best results are shown in red color.

60 40 20 0

0

100

200

Frame #

300

0

50

100 Frame #

Fig. 5. The center location errors (in pixels) obtained by the tracking algorithms with different similarity metrics on the eight video sequences.

84

J. Wang et al. / Neurocomputing 153 (2015) 77–88

Fig. 6. The tracking results on the eight video sequences. (a) Coupon book. (b) Basketball. (c) Face. (d) Jumping. (e) Coke can. (f) Caviar1. (g) Badminton. (h) Subway.

track the target (a basketball player) after the 240th and 651st frames, respectively. On the contrary, OMLT keeps tracking the target, even in the case of drastic illumination changes. Fig. 6(c) shows the tracking results on the Face sequence, which is captured in an indoor scene. The tracked target is a moving face. Due to the influence of motion blur, deformation and fast motion, the BCT algorithm fails to track the target after the 157th frame. The HIT algorithm stays away from the target from the 171st frame. In contrast, the proposed OMLT algorithm can accurately track the target throughout the video sequence. The Jumping sequence, shown in Fig. 6(d), is recorded in an outdoor scene. The tracked target is a man's face, who jumps fast. The appearance variations caused by fast motion and motion blur are very severe in this video sequence. Because of motion blur, the BCT and HIT algorithms lose the target in some frames, e.g., from the 33rd frame to the 35th frame, the 46th frame. However, the proposed OMLT algorithm can keep tracking the target accurately throughout the video sequence and obtains the more accurate tracking performance. In the Coke can sequence shown in Fig. 6(e), a coke can (i.e., the target) is rotated and moved by which severe occlusion and illumination changes occur. Because of the influence of out-ofplane rotation, illumination changes and background clutters, the BCT and HIT algorithms fail to track the target after the 13th frame. In contrast, OMLT achieves better tracking performance than the other tracking algorithms. As shown in Fig. 6(f), one person with background clutters is tracked in the Caviar1 sequence. The tracked target is occluded by other people. Due to background clutters and occlusion, the HIT

algorithm drifts away from the target from the 173rd frame to the 186th frame. In contrast, the proposed OMLT and the BCT algorithms can accurately track the target. Furthermore, the proposed OMLT achieves the more accurate tracking results. In the Badminton sequence shown in Fig. 6(g), a badminton match was going on. Affected by partial occlusion, fast motion, background clutters and deformation, both the HIT and BCT algorithms fail to track the target (a badminton player) between the 53rd and the 212th frames. In contrast, OMLT tracks the target accurately. As shown in Fig 6(h), the Subway sequence is captured in a subway scene. The target (one pedestrian) is occluded by the other four pedestrians. Because of the influence of occlusion, deformation and background clutters, the BCT algorithm fails to track the target after the 143rd frame. Compared with the BCT and HIT algorithms, the proposed OMLT algorithm achieves more accurate tracking results. 4.2. Comparison with several state-of-the-art algorithms We also compare the proposed OMLT with several state-of-theart tracking algorithms on the eight challenging video sequences quantitatively and qualitatively. These competing algorithms include: CT [15], Frag [5], L1 [8], MTT [9], OAB [10], VTD [7] and TLD [19] algorithms. For fairness, in all the experiments, we use the source codes or binary codes provided by the authors, and initialize these algorithms with default parameters. In the first frame of all the video sequences, we use identical initialized target locations for all the competing algorithms.

J. Wang et al. / Neurocomputing 153 (2015) 77–88

4.2.1. Quantitative comparison Fig. 7 shows the plots of the center location errors obtained by the competing tracking algorithms for the eight video sequences. We also report the average center location errors obtained by the competing algorithms in Table 3. We note that TLD loses the target and thus we cannot report the tracking results for some frames in three of the eight video sequences, namely, the Jumping, Basketball and Badminton we do not report the center location errors for the frames that TLD fails to track the targets for the above three video

Coupon book

400 200 0

100

200

300

0

400

200

400

600

Coke can

0

800

200

Frame #

100

200

300

400

0

500

100

0

100

200 Frame #

100

300

400

300

400

150

200

Subway 400

150 100 50 0

200 Frame #

Badminton

200

0

300

200 Center Error

Center Error

50

100

Frame #

300

100

200

0 0

Caviar1

150

100

100

Frame #

Frame #

0

200

Center Error

0

300 Center Error

50

Jumping

300 Center Error

Center Error

Center Error

100

0

Center Error

Face

Basketball

150

0

sequences. In Table 3, we do not report the average center location errors of TLD for the above three video sequences. From Fig. 7 and Table 3, we see that, in most sequences, the proposed OMLT algorithm obtains better performance than the competing algorithms. And the proposed OMLT algorithm obtains the smallest average center location errors over the eight video sequences. For the Badminton sequence, OMLT obtains an average center location error of 13 pixels, which is not perfect, however, it is the best result among the results obtained by all the eight algorithms.

600

200

85

0

100

200

300

300 200 100 0

0

50

Frame #

Fig. 7. Quantitative evaluation of the eight algorithms on the eight sequences in terms of center location errors (in pixels).

Table 3 The average center location errors (in pixels) obtained by the eight algorithms. The best and the second best results are shown in red color and blue color, respectively. The entry ‘–’ for TLD indicates that the value is not available as TLD loses the target.

Table 4 The success rates (in percentage) obtained by the eight algorithms. The best and the second best results are shown in red color and blue color, respectively.

100 Frame #

86

J. Wang et al. / Neurocomputing 153 (2015) 77–88

Fig. 8. The tracking results of the eight algorithms on the eight video sequences. (a) Coupon book. (b) Basketball. (c) Face. (d) Jumping. (e) Coke can. (f) Caviar1. (g) Badminton. (h) Subway.

J. Wang et al. / Neurocomputing 153 (2015) 77–88

Table 4 summarizes the success rates obtained by the competing algorithms. From Table 4, we can see that the proposed OMLT algorithm obtains the highest average tracking success rate; OMLT obtains the best results in six of the eight sequences and it obtains the second best results in the other two sequences. Although OMLT obtains a low success rate for the Coke can sequence, it obtains the best result against the competing algorithms.

4.2.2. Qualitative comparison Fig. 8(a) shows the tracking results in the Coupon book sequence. Distracted by background clutters, L1 and VTD fail to track the target coupon book completely after the 131st frame, while Frag and TLD fail to track it after the 136th and 142nd frames, respectively. All of OAB, CT, MTT and OMLT successfully track the target coupon book. However, OAB obtains the largest center location errors among the four algorithms. Compared with these competing algorithms, the proposed OMLT achieves the most accurate tracking results. Fig. 8(b) shows the tracking results in the Basketball sequence. Suffering from non-rigid object deformation, partial occlusion, outof-plane rotation, background clutters and illumination changes, both MTT and OAB fail to track the target (a basketball player) after the 22nd and 15th frames, respectively. Frag obtains inaccurate tracking results from the first frame to the 397th frame. After that, it loses the target entirely. L1 tracks the other basketball player from the 288th frame to the 507th frame. From the 8th frame to the 239th frame, TLD fails to track the target. CT is not robust to partial occlusion. Thus, it drifts away from the target in some frames (e.g., in the 19th, 622nd and 659th frames). VTD stays far away from the target because of non-rigid object deformation in some frames (e.g., in the 73rd, 481st and 487th frames). Compared with these competing algorithms, the proposed OMLT algorithm tracks the target accurately throughout the video sequence. In the Face sequence shown in Fig. 8(c), the target is a moving face in an indoor scene. Due to the influence of fast motion and motion blur, CT fails to track the target from the 21st frame to the 158th frame. VTD stays away from the target from the 201st frame to the 337th frame. And it tracks the target with incorrect scale from the 250th to the 286th frame. Frag loses the target from the 282nd frame to the 340th frame. OAB and MTT fail to track the target after the 163rd and 150th frames, respectively. L1 keeps tracking the target, but it obtains inaccurate tracking results. TLD and the proposed OMLT algorithm can accurately track the target throughout the video sequence. However, TLD tracks the target with less accurate scale estimates in some frames, e.g., from the 335th frame to the 343rd frame. In the Jumping sequence shown in Fig. 8(d), a man jumps quickly in an outdoor scene. Because of motion blur and fast motion, OAB and MTT fail to track the face of the man (i.e., the target) after the 16th and 17th frames, respectively. From the 15th frame to the 74th frame, VTD tracks the target intermittently, while it loses the target after the 75th frame. After the 45th and 53rd frames, Frag and L1 lose the target intermittently. TLD shows that the target is missing in 44 frames out of the 313 frames for the Jumping sequence. In contrast, CT and the proposed OMLT track the target accurately throughout the video sequence. Fig. 8(e) shows the tracking results in the Coke can sequence. Due to the influence of out-of-plane rotation and illumination changes, L1 gains inaccurate tracking results from the 15th frame to the 33rd frame. After the 34th frame, L1 fails to track the coke can (i.e., the target) completely. Frag and VTD fail to track the target after the 65th and 36th frames, respectively. TLD can track the target, but with incorrect scale after the 40th frame. From the 41st frame, OAB and CT lose the target intermittently. In contrast, both MTT and the proposed OMLT algorithm achieve better tracking results.

87

The Caviar1 sequence, shown in Fig. 8(f), is captured in a corridor scene. The tracked target is one person, who is occluded by two other people. There is a distracter (i.e., another person) in the background. Affected by severe occlusion and background clutters, CT, L1, MTT and OAB fail to track the target after the 120th frame. VTD fails to track the target after the 129th frame. TLD tracks the other person from the 113th frame to the 146th frame. In contrast, only Frag and OMLT can accurately track the person throughout the video sequence. Fig. 8(g) shows the tracking results in the Badminton sequence. Due to the influence of non-rigid object deformation, partial occlusion, background clutters, fast motion and motion blur, Frag and VTD fail to track the body of a male player (i.e., the target) after the 50th and 129th frames, respectively. From the 208th frame, CT fails to track the target completely; however, it tracks the other player who is similar to the target in appearance. MTT, L1 and OAB are incapable of accurately tracking the target, and thus obtain inaccurate tracking results in most frames. TLD is not stable over this video sequence as the target is lost in 63 frames of the 281 frames for the Badminton sequence. Compared with these competing algorithms, the proposed OMLT algorithm can accurately track the target in the case of partial occlusion, non-rigid object deformation and fast motion. In the Subway sequence shown in Fig. 8(h), several pedestrians walk in a subway scene with background clutters. Because of the influence of occlusion and non-rigid object deformation, MTT and VTD fail to track the target after the 40th frame, while OAB fails to track the target after the 113th frame. Distracted by background clutters, Frag fails to track the target from the 125th frame to the 132nd frame. When the target undergoes occlusion, L1, CT and TLD lose the target in some frames, e.g., from the 38th frame to the 44th frame. Compared with these competing algorithms, the proposed OMLT algorithm keeps tracking the target accurately throughout the video sequence.

5. Conclusion In this paper, we have proposed an effective tracking algorithm based on online distance metric learning. The online distance metric learning mechanism adaptively learns object appearance variations. The observation likelihood of a target candidate is derived from the distance between the target template and the target candidate. And the distance is measured by using an online learned distance metric matrix. For improving the robustness against appearance variations, we use spatially weighted histogram to represent an object. We combine online distance metric learning with the spatially weighted histogram representation in a particle filter framework. Experimental results on several challenging video sequences demonstrate that the proposed tracking algorithm is robust to challenges including illumination changes, fast motion, motion blur, occlusions, etc. Both quantitative and qualitative comparisons with several state-of-the-art algorithms demonstrate the effectiveness and robustness of the proposed tracking algorithm.

Acknowledgment This work was supported by the National Natural Science Foundation of China under Grants 61170179, 61201359 and 61472334, the Natural Science Foundation of Fujian Province of China under Grant 2012J05126, and the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant 20110121110033.

88

J. Wang et al. / Neurocomputing 153 (2015) 77–88

References [1] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: a benchmark, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 2411–2418. [2] X. Li, W. Hu, C. Shen, Z. Zhang, et al., A survey of appearance models in visual object tracking, ACM Trans. Intell. Syst. Technol. 4 (4) (2013) 58-58:48. [3] D.A. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vision 77 (1) (2008) 125–141. [4] D. Wang, H. Lu, M.-H. Yang, Least soft-threshold squares tracking, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 2371–2378. [5] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2006, pp. 798–805. [6] X. Li, A. Dick, C. Shen, A. van den Hengel, H. Wang, Incremental learning of 3D-DCT compact representations for robust visual tracking, IEEE Trans. Pattern Anal. Mach. Intell. 35 (4) (2013) 863–881. [7] J. Kwon, K.M. Lee, Visual tracking decomposition, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2010, pp. 1269–1276. [8] X. Mei, H. Ling, Robust visual tracking and vehicle classification via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (11) (2011) 2259–2272. [9] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via multi-task sparse learning, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2012, pp. 2042–2049. [10] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in: British Machine Vision Conference, 2006, pp. 47–56. [11] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robust tracking, in: European Conference on Computer Vision, 2008, pp. 234–247. [12] S. Avidan, Ensemble tracking, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2) (2007) 261–271. [13] Q. Bai, Z. Wu, S. Sclaroff, M. Betke, C. Monnier, Randomized ensemble tracking, in: IEEE International Conference on Computer Vision, 2013, pp. 2040–2047. [14] R.T. Collins, Y. Liu, M. Leordeanu, Online selection of discriminative tracking features, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1631–1643. [15] K. Zhang, L. Zhang, M.-H. Yang, Real-time compressive tracking, in: European Conference on Computer Vision, 2012, pp. 864–877. [16] B. Babenko, M.-H. Yang, S. Belongie, Robust object tracking with online multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1619–1632. [17] S. Hare, A. Saffari, P.H. Torr, Struck: structured output tracking with kernels, in: IEEE International Conference on Computer Vision, 2011, pp. 263–270. [18] Fan Yang, Huchuan Lu, Minghsuan Yang, Robust superpixel tracking, IEEE Trans. Image Process. 23 (4) (2014) 1639–1651. [19] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422. [20] K. Zhang, L. Zhang, M. Yang, Real-time object tracking via online discriminative feature selection, IEEE Trans. Image Process. 22 (12) (2013) 4664–4677. [21] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2012, pp. 1822–1829. [22] Wei Zhong, Huchuan Lu, Minghsuan Yang, Robust object tracking via sparse collaborative appearance model, IEEE Trans. Image Process. 23 (5) (2014) 2356–2368. [23] K.Q. Weinberger, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244. [24] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (5) (2003) 564–577. [25] J. Kwon, K.M. Lee, Minimum uncertainty gap for robust visual tracking, in: International Conference on Computer Vision and Pattern Recognition, 2013, pp. 4321–4328. [26] N. Jiang, W. Liu, Y. Wu, Adaptive and discriminative metric differential tracking, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2011, pp. 1161–1168. [27] J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Neighborhood component analysis, in: Advances in Neural Information Processing Systems, 2005, pp. 513–520. [28] X. Wang, G. Hua, T.X. Han, Discriminative tracking by metric learning, in: European Conference on Computer Vision, 2010, pp. 200–214. [29] A. Globerson, S.T. Roweis, Metric learning by collapsing classes, in: Advances in Neural Information Processing Systems, 2005, pp. 451–458. [30] G. Tsagkatakis, A. Savakis, Online distance metric learning for object tracking, IEEE Trans. Circuits Syst. Video Technol. 21 (12) (2011) 1810–1821. [31] J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in: International Conference on Machine Learning, 2007, pp. 209–216. [32] X. Li, C. Shen, Q. Shi, A. Dick, A. van den Hengel, Non-sparse linear representations for visual tracking with online reservoir metric learning, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2012, pp. 1760–1767.

[33] G. Chechik, V. Sharma, U. Shalit, S. Bengio, Large scale online learning of image similarity through ranking, J. Mach. Learn. Res. 11 (2010) 1109–1135. [34] S.T. Birchfield, S. Rangarajan, Spatiograms versus histograms for region-based tracking, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2005, pp. 1158–1163. [35] H. Wang, D. Suter, K. Schindler, C. Shen, Adaptive object tracking based on an effective appearance filter, IEEE Trans. Pattern Anal. Mach. Intell. 29 (9) (2007) 1661–1667. [36] S. He, Q. Yang, R.W. Lau, J. Wang, M.-H. Yang, Visual tracking via locality sensitive histograms, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 2427–2434. [37] P. Prez, C. Hue, J. Vermaak, M. Gangnet, Color-based probabilistic tracking, in: European Conference on Computer Vision, 2002, pp. 661–675. [38] D. Comaniciu, V. Ramesh, P. Meer, Real-time tracking of non-rigid objects using mean shift, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2000, pp. 142–149. [39] M. Isard, A. Blake, Condensation—conditional density propagation for visual tracking, Int. J. Comput. Vision 29 (1) (1998) 5–28. [40] A. Doucet, D.N. Freitas, N. Gordon, Sequential Monte Carlo Methods in Practice, Springer, New York, 2001. [41] N. Wang, J. Wang, D.-Y. Yeung, Online robust non-negative dictionary learning for visual tracking, in: IEEE International Conference on Computer Vision, 2013, pp. 657–664.

Jun Wang received the M.S. degree in Computer Science and Technology from Nanchang University, China, in 2007. He is currently a Ph.D. student in the School of Information Science and Technology at Xiamen University, China. His research interests include visual tracking and pattern recognition.

Hanzi Wang is currently a Chairman of the Professor Committee in the School of IST at Xiamen University (XMU) in China, a Distinguished Professor of “Minjiang Scholars” in Fujian province and a Founding Director of the Center for Pattern Analysis and Machine Intelligence (CPAMI) at XMU. He was an Adjunct Professor (2010–2012) and a Senior Research Fellow (2008–2010) at the University of Adelaide, Australia; an Assistant Research Scientist (2007–2008) and a Postdoctoral Fellow (2006–2007) at the Johns Hopkins University; and a Research Fellow at Monash University, Australia (2004–2006). He received his Ph.D degree in Computer Vision from Monash University where he was awarded the Douglas Lampard Electrical Engineering Research Prize and Medal for the best Ph.D. thesis in the department. His research interests are concentrated on computer vision and pattern recognition including visual tracking, robust statistics, object detection, video segmentation, model fitting, optical flow calculation, 3D structure from motion, image segmentation and related fields. He is a Senior Member of the IEEE. He is an Associate Editor for IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT) and he was a Guest Editor of Pattern Recognition Letters (September 2009). He is the General Chair for ICIMCS2014. He was the Program Chair for CVRS2012, Publicity Chair for IEEE NAS2012, and Area Chair for DICTA2010. He also serves on the program committee (PC) of ICCV, ECCV, CVPR, ACCV, PAKDD, ICIG, ADMA, and CISP, and he serves on the reviewer panel for more than 40 journals and conferences.

Yan Yan is currently an Associate Professor in the School of Information Science and Technology at Xiamen University, China. He received the B.S. degree in Electrical Engineering from the University of Electronic Science and Technology of China (UESTC), China, in 2004 and the Ph.D. degree in Information and Communication Engineering from Tsinghua University, China, in 2009, respectively. He worked at Nokia Japan R&D center as a Research Engineer (2009–2010) and Panasonic Singapore Lab as a Project Leader (2011). His research interests include image recognition and machine learning.