Accepted Manuscript Coupling deep correlation filter and online discriminative learning for visual object tracking Hanhui Li, Hefeng Wu, Shujin Lin, Xiaonan Luo
PII: DOI: Reference:
S0377-0427(17)30244-3 http://dx.doi.org/10.1016/j.cam.2017.05.008 CAM 11139
To appear in:
Journal of Computational and Applied Mathematics
Received date : 28 October 2016 Revised date : 21 April 2017 Please cite this article as: H. Li, et al., Coupling deep correlation filter and online discriminative learning for visual object tracking, Journal of Computational and Applied Mathematics (2017), http://dx.doi.org/10.1016/j.cam.2017.05.008 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
*Manuscript Click here to view linked References
Coupling Deep Correlation Filter and Online Discriminative Learning for Visual Object Tracking Hanhui Lia , Hefeng Wub,a , Shujin Lina , Xiaonan Luoa,c,∗ a Sun
Yat-sen University, Guangzhou 510006, China University of Foreign Studies, Guangzhou 510006, China c Guilin University Of Electronic Technology, Guilin 541004, China
b Guangdong
Abstract Advances in mathematical models and optimization introduce powerful tools for solving the visual object tracking problem, but this fundamental problem remains unsolved due to various challenges such as illumination variation, deformation and occlusion. Recent progress in correlation based models has provided an effective and efficient tracking solution. This type of trackers exploits the property of convolution theorem to significantly speed up their computational processes. However, their scheme of training correlation filters restricts them to use only a few negative examples to train their model, consequently lowers their performance. Therefore, in this paper, we propose to combine the fast calculation advantage of correlation filter with an online discriminative learning model, which can fully exploit the negative examples in the context around the target. The proposed tracker proceeds in a coarse-to-fine scheme: the proposed tracker will first employ a correlation filter to generate a coarse estimation of the location of the target, and then employ a translating model to refine its estimation and calculate the scale variation of the target via a scaling model. Both the translating model and the scaling model are formulated in the framework of our online discriminative model. Besides, we also propose an effective offline representation learning method to generate robust image feature for our correlation filter. Extensive experiments on the online Object Tracking Benchmark against the state-of-the-art methods validate the effectiveness of the proposed tracker. ∗ Corresponding
author. Tel.: +86 020 39943198; fax: +86 020 39943199. Email address:
[email protected] (Xiaonan Luo)
Preprint submitted to Journal of Computational and Applied Mathematics
April 21, 2017
Keywords: visual tracking, correlation filter, Fourier transform, online optimization, deep learning
1. Introduction Visual object tracking has become one of the significant topics in the communities of computer vision and image processing in the recent decades. Techniques on this topic not only can be employed in applications like video surveillance, but also can be used as an effective preprocessing step in other video analysis methods. Due to the practicality of visual object tracking, abundant research on this topic has been conducted and various tracker has been proposed [1, 2, 3]. Recently, a new type of trackers, which is based on correlation filter (CF), is proposed [4, 5, 6, 7, 8, 9]. CF based trackers have attracted considerable attention because of their astounding computational efficiency and high accuracy. This type of trackers can run at hundreds of frames per second with considerable performance, which is faster and more accurate than conventional tracking methods. The secret of the high speed of CF based trackers is that they propose to locate their target with the operation of correlation, which means they try to learn a filter giving the maximum response value at the location of their target. Furthermore, the convolution theorem [10] demonstrates that the convolution operation can be performed in the frequency domain with element-wise calculations, thus it highly increases the speed of CF based tracker. Nevertheless, learning CF based trackers also introduces a serious problem, that is, their learning and update process can only be done with particular training examples. More specifically, only the patch of the target [4] or its cyclic-shifted variations [5] can be used for training and update. Such restriction makes distinguishing target from the distractors become challenging for CF based trackers, and consequently lowering their performance. Based on the above advantages and disadvantages of CF based trackers, in this paper, we aim at combining correlation filter with the online discriminative learning method to generate an effective yet efficient tracker. The proposed tracker not only can
2
possess the ability of correlation filter for detecting its target quickly, but also can fully explore possible training examples to increase its robustness. Moreover, to further enhance the performance of CF based trackers, we propose to combine them with deep learning based features, since this type of features has obtain significant outbreak in visual-related tasks [11]. We introduce a metric learning based method, which aims at maximizing the distances between targets and distractors, to train a deep convolutional neural network (CNN) and use it for feature extraction, and consequently generating our deep correlation filter. In summary, the proposed tracker is composed of two major components: a deep correlation filter and an online discriminative learning method. The proposed deep correlation filter employs CNN to generate a robust representation of the context around the target, and applies the correlation filter to locate the target at a coarse level; while the online discriminative learning method is employed to train a translating model and a scaling model to refine the coarse predictions of deep correlation filter. Combining these two components lets our tracker locate its target precisely. The contributions of this paper are threefold: first of all, we propose a novel tracker which can better exploit its training examples; second, we introduce the deep correlation filter for fast target detection, and an offline representation learning method to enhance the filter; third, an online discriminative learning method is proposed to refine our tracking result, and it can also provide us with the ability of estimating the size variations of the target. 2. Related Work In this review we will concentrate on the recent advances of mathematical models applied in online object tracking [1, 2, 3]. The essential problem in object tracking is how to build a robust model that can capture the appearance variations of the target over time. Discrete histograms of image characteristics, such as colors, textures and edges are introduced to approximately model the appearance distribution of the target [12, 13]. Gaussian mixture models [14] are also presented to model the underlying multi-modality of the distribution. In order to better reflect the partial property of appearance deformation, Adam et al. [15]
3
propose a fragments-based model to formulate the multi-modal spatial appearance distribution. Ross et al. [16] introduce the incremental principal component analysis (PCA) to discover the primal elements in appearance representation. Recent advances in L1 optimization make it possible to employ sparse representation-based models to solve the tracking problem [17, 18] with certain success, but at the cost of the high computational complexity. In general, these models all faces the challenge of model adaptation over time due to little object information available in model initialization. Locating the target in the search space is another critical issue in object tracking. In [12], a kernel method is introduced to convert the similarity measurement of histograms into a convex optimization problem, in which the mean shift algorithm can be nicely applied to find local maximum. Besides, Kalman filters [19] are applied and extended to model the object’s motion assumption, and Bayesian particle filters [20] are exploited to approximate the non-parametric state distribution of the object via Monte Carlo theory to avoid being trapped locally. In the past few years, the object tracking problem is formulated as a binary classification task that separates the target from the background, and many machine learning theories are introduced to learn discriminative models and achieve impressive progress [21, 22, 23, 24, 25, 26, 27, 28]. The discriminative models are trained to find the optimal margin between target examples and background examples in the feature space. Semi-supervised learning [23] and multiple instance learning [24] are also explored to deal with the unreliable examples collected for model training and updating. However, these classification-based trackers are commonly obstructed by the ill-posed difficulty that the learned model easily degrades over time and leads to tracking drift due to lacking good training examples. The recent breakthrough in correlation filters makes them become a considerable solution for the tracking problem [4, 6, 5, 29, 30, 7, 31, 9, 8]: Bolme et al. [4] first propose to minimize output sum of squared error to train CF based tracker. Henriques et al. [6, 5] exploit the property of Circultant Matrices and propose a fast closed-form solution for learning kernelized CF based trackers. Except for estimating the motion of the target via CF, Martin et al. [31] propose to use 1D CF to estimate the scale variations of the target. They also propose to incorporate a spatial regularization term into the procedure into CF learning [9]. 4
Passive-Aggressive
Fast Training CNN
Fourier Transform
Frame t
CNN Frame t+1
Correlation Filter Model
patch feature
Fast Fourier Transform patch feature
Optimization
Sort & Crop mini-patch features
Sampling
Discriminative Translating & Scale Models
Model Update
Location & Scale Refinement
Coarse Response
Result
Figure 1: Diagrammatic demonstration of the proposed tracker. In this figure, the upper row illustrates the learning procedure of our model, while the lower row describes our tracking procedure. In the initialization/update phase, Fast Fourier Transform and Passive-Aggressive optimization are respectively employed to learn/update the correlation filter model, the translating and scaling models using the context around the target’s position. In the tracking phase, given the frame needed to be processed, our tracker first employs the deep correlation filter model to generate a coarse prediction of the target’s position, and then employs the translating and scaling models to predict the precise position and size of the target.
However, as we have mentioned above, due to the theoretical foundation of CF based trackers, context of the target cannot be fully exploited. This disadvantage will cause CF based trackers to fail when the target is similar to the context. Therefore, we try to develop a tracker which has a stronger discriminability, compared with conventional CF based trackers. 3. The Proposed Method In this section, we introduce the proposed visual object tracker in detail. Our tracker is composed of two major components, a deep correlation filter which generates the coarse estimation of position of our target, and an online discriminative learning method which is responsible for refining the estimation given by the deep correlation filter. To begin with, we demonstrate an overview of the proposed tracker in Section 3.1; then we present the deep correlation filter and a novel method to train it offline in Section 3.2 and 3.3, respectively; lastly, we introduce the online discriminative learning method for prediction refinement in Section 3.4.
5
3.1. Overview of the Proposed Tracker The key idea of the proposed tracker is to take the fast-detecting advantage of CF based trackers, and to exploit more training examples to enhance the final performance. To achieve this, we propose to couple correlation filter with an online discriminative learning method based on the Passive-Aggressive algorithm [32]. Before introducing the proposed framework, we formulate our task for a clearer description: given a sequence of T frames, let pt = (px , py ) denote the position of the target at frame t, st = (sw,t , sh,t ) denote the scale factor of the size of the target at frame t, and b = (w, h) denote the size of the target. Given b, p1 and s1 , the task of our tracker is to predict pt and st for t = 2, ..., T . However, exhaustively exploring the possible combination of p and s is too timeconsuming to be applied in the object tracking scenario, therefore our solution for the aforementioned task is divided into two steps: the proposed tracker will first locate the target, and then evaluate the size variation of the target, as demonstrated in Figure 1. First of all, assume frame t is the current frame needed to be processed, the proposed tracker will crop out an image patch centered at pt−1 with size (1+λp )·b·st−1 , where λp is an empirical constant controlling the padding size. We set λp = 2 throughout this paper, so that the cropped image patch is much larger than the bounding box of the target, and contains the context around the target for further analysis. Let It denote the cropped image patch. Then the cropped image patch is resized and fed into a deep Convolutional Neural Network (CNN) to extract its feature. Let xt ∈ RW ×H×D denote the extracted feature where W, H, D is the dimension of extracted feature.
Secondly, a correlation filter Ft ∈ RW ×H×D is employed to estimate the confi-
dence of the target centered at every position in It . Let cFt (p, xt ) denote the calculated confidence at position p. Positions with the top K highest confidence value will be selected to construct a candidate set, denoted as P = {pk |k = 1, ..., K}.
Thirdly, for every position pk in the candidate set, we crop a mini-patch centered
at pk with the size of the target, and employ a translating model to calculate its confidence. Note that feature extraction via CNN can maintain the spatial structure (see Figure 2(a) and 2(b)), therefore we can directly obtain the feature of a mini-patch via cropping xt . Let xt,k ∈ RW
0
×H 0 ×D 0
denote the W 0 × H 0 × D0 dimensional feature 6
of the k-th mini-patch, then a translating model Tt ∈ RW
0
×H 0 ×D 0
is employed to cal-
culate its corresponding confidence cTt (xt,k ). The refined confidence of position pk , c(pk , xt,k ), is defined as follows: c(pk , xt,k ) = δ(cFt (pk , xt )) · δ(cTt (xt,k )), where δ(x) =
1 1+e−x
(1)
is the sigmoid function. Position with the highest refined confi-
dence is considered as the location of the target. Fourthly, the proposed tracker samples K mini-batches with different s centered at the predicted target location, and resize them to W 0 × H 0 × D0 , then employs a scaling model St ∈ RW
0
×H 0 ×D 0
to find the best scale factor at frame t. Note that both
the translating model Tt and the scaling model St are trained via the proposed online discriminative learning method. At last, we update Ft , Tt and St with context around the predicted location of the target. Detailed computational processes will be presented in the following sections. The advantages of the proposed tracker are dual: on the one hand, compared with previous random sampling based trackers, the proposed tracker generates a dense confidence estimation of the context around the target via a single correlation operation; on the other hand, compared with CF based trackers, introducing a translating model can help to explore more examples and improve the precision of our tracker, as we will explain in Section 3.2. Besides, a scaling model can also allow our tracker to adapt to size changes during the tracking process. 3.2. Deep Correlation Filter In this section, we introduce the deep correlation filter, which consists of a deep convolutional neural network for obtaining a robust image representation, and a correlation filter for estimating the coarse location of our target. We will also analyze the scheme of CF based tracker and explain why it is necessary to exploit more training examples in our tracker. Conventional CF based trackers utilizing hand-crafted features, such as image intensity and histograms of oriented gradients [4, 5], are easily affected by motion blurs 7
or illumination variations. On the other hand, recent study on CNN [11] shows that utilizing features extracted by a pre-trained CNN model can obtain astounding performance on many visual object recognition applications. Therefore, it is considerable to employ CNN feature in object tracking. Nevertheless, due to the deep structure of CNN, obtaining CNN feature of an image patch is time-consuming. Even with a high speed GPU, we can only calculate hundreds of CNN features of image patches in one second [33]. This computational efficiency is not enough for traditional trackers based on random sampling strategy, which usually samples hundreds of image patches in a frame to detect the target. Therefore, applying CNN feature in CF based trackers is a reasonable choice for real-time object tracking. In fact, CNN feature can be adapted to correlation filters naturally. Consider a CNN with only four types of operations: convolution, deconvolution, pooling and activation function [34]. Since these operations are performed on a cell level, we can construct a CNN with these operations and let it maintain the 2D spatial structure of its input. Figure 2(a) and 2(b) demonstrates an example of such property: here we visualize the output of the relu3 4 layer in the VGG-19 network [35], we can easily see that the CNN feature captures contour of the target. In this paper, instead of using pre-trained VGG-19 network directly, we modify its architecture and enhance its representation ability via an offline learning strategy. But here we leave the details in the next section and focus on correlation filter. Reason that we want our feature to maintain the spatial structure of the input image is that, with the spatial information, subsequent processes can be sped up: on the one hand, correlation filter with this type of feature can realize the fast detection of the target; on the other hand, we can directly sample training examples for discriminative learning from the extracted feature, instead of from the input image. Note that in the following sections, for the purpose of compact representation, we omit the calculation steps of converting coordinate between image space and CNN feature space. To understand the fast detection advantage of correlation filter, we consider object tracking as a regression problem, as in [5]. Assume we have N training examples x1 , x2 , ..., xN and their corresponding confidences y1 , y2 , ..., yN , we obtain the corre-
8
(a)
(b)
(c)
(d)
Figure 2: Demonstration of training examples in the proposed tracker: (a) context of the target, red grids denote negative examples while green grid denote positive example; (b) deep feature of the context of the target, which maintains the spatial relationship between positive example and negative examples; (c) negative examples used in correlation filter; (d) negative examples used in the proposed translating model. Note that figures are resized for better visualization.
lation filter F via minimizing the following ridge regression problem: min F
N X
n=1
(F · xn − yn ) + λF ||F||22 ,
(2)
where λF is an empirical constant for balancing minimizing square regression lost with reducing the complexity of learned model. If we vectorize both F and x, the regression problem defined in Eq. 2 has an analytic solution as follows [36]: F = (XT X + λF I)−1 XT y.
(3)
Note that in our derivation, we abuse the notations a little to let F and x denote the vectorization of F and x (which will not affect the final results). X is the training data matrix and the n-th row of it represents the example xn , y is the vector concatenating y1 , y2 , ..., yN , and I is the identity matrix. An important property of X is that if X is a circulant matrix generated by cyclically shifting x W × H times, an element at a time, then X can be diagonalized by the
discrete Fourier transform (DFT) as follows [37]:
X = Ddiag(ˆ x)DH ,
(4)
where D denotes the transformation matrix of discrete Fourier transform, x ˆ denotes the DFT of x and DH denotes the Hermitian transpose of D. Substitute Eq. 4 into Eq. 9
ˆ [5]: 3 and apply a few algebra operations, we can obtain the following expression of F ˆ= F
x ˆ∗ y ˆ , ∗ x ˆ x ˆ + λF
(5)
where x ˆ∗ denotes the complex conjugate of x ˆ and denotes the Hadamard (element-
wise) product.
Moreover, in our tracking scheme of processing frame t, if we consider calculating the confidence cFt (p, xt ) at position p as translating the target to position p, then we can gather all possible translations and construct a circulant matrix Xt as in Eq. 4, to compute the confidence yt for all possible positions p as follows: ˆt x yt = (cFt (p0 , xt ), cFt (p1 , xt ), ..., cFt (pW ×H−1 , xt )) = F −1 (F ˆt ),
(6)
where F −1 denotes the inverse DFT. Note that pi , i = 0, 1, ..., W × H − 1 denotes the relative position (translation).
To ensure the correlation filter F is adaptable to the appearance change of the target, we employ a linear online update strategy as follows: let zt denote the CNN feature ˆ t+1 of the context around the predicted location of the target at frame t, we calculate F directly in the frequency domain as: ˆ t+1 = (1 − λl )F ˆ t + λl F
ˆ z∗t y ˆ , ∗ ˆ zt ˆ zt + λF
(7)
where λl is the user-defined learning rate. In this paper, we assume y follows a 2D ˆ2 = Gaussian function with spatial bandwidth by and set F 2, ..., T .
x ˆ∗ y 1 ˆ x ˆ∗ x1 +λF , 1 ˆ
since t =
Although correlation filter provides us a fast solution for detecting the target, its theoretical foundation imposes a restriction on itself that correlation filter constructed in the aforementioned way can only model object motion as translation, and their negative examples are generated by cyclically shifting a single positive example, as demonstrated in Figure 2(c). In this way, CF based trackers cannot fully exploit the context of their target, and easily fail in background clutter situations. 3.3. Offline Representation Learning The VGG-19 network [35] is designed for classification originally, in other words, it aims at maximizing the margin between different classes of objects (e.g. dog vs. car), 10
Positive Euclidean Distance
Anchor
Conv1 Conv2 11211264 5656128
Negative
Conv3 2828256
Conv4 1414512
Conv5 77512
DeConv 2828512
FC 111024
d(
, )
d(
, )
d(
, )
Triplet Loss 11
L2 Normalization 111024
Input Triplet 1121123
Figure 3: Architecture of the proposed network for deep representation learning. Blue / green / red dot represents the anchor, the positive sample and the negative sample, respectively. The aim of this network is to generate robust image feature for correlation filters via minimizing the triplet loss function. The 2D outputs of the deconvolution layer are used as features in the proposed tracker.
instead of enhancing its ability of identifying the subtle differences between two similar objects, which is significant in distinguishing our target from distractors in visual object tracking. Therefore, we propose an offline representation learning method to adapt it to our task. Our method adopts metric learning mechanism since it provides us with a considerable measure of the visual similarity between two objects. Specifically, given a triplet (o+ , o, o− ), where o is a patch centering at the target and is considered as the anchor, o+ is a positive instance which also contains the target but it is slightly different from o, o− is a negative instance which does not contain the target, our goal is to learn a distance function d satisfying d(o+ , o) < d(o, o− ). To implement such function, we design an end-to-end deep neural network as demonstrated in Fig. 3: the first part of our network is the first 5 convolutional components of VGG-19. Such layers are pre-trained well enough to tackle general tasks. We have resized its inputs to 112 × 112, thus the size of their final outputs (Conv5) is only 7 × 7, which is too small to describe the appearance of our targets. Therefore we adopt a deconvolution layer to upscale and refine the outputs; Then we add a fully connected layer to transform 2D features into vectors, and use L2 normalization to ensure the lengths of outputs vectors are equal to 1, to avoid the gradient vanishing or gradient explosion problem. At last, we calculate the Euclidean distances between vectors and estimate the loss of inputs. We define the corresponding hinge-like loss function as l(o+ , o, o− )= max [0, d(o+ , o) − d(o, o− )], 11
which suffers a loss if and only if the distance constraint is violated. Our network is optimized by the standard gradient descent method. Note that the aforementioned network is trained and tested on disjointed data sets. Once the training procedure is finished, we remove the fully connected layer and the L2 normalization layer, and use the outputs of the deconvolution layer as the features of given image patches. Such training process has an advantage that it does not require us to minimize 2D spatial loss, which will cause the learning process to collapse, as observed in our early study. Besides, such network structure is more convenient to employ in most modern deep learning frameworks. 3.4. Online Discriminative Learning We now turn to propose an online discriminative model to enhance the performance of CF based trackers. Our online discriminative learning is a variation of the PassiveAggressive (PA) algorithm [32], and is applied to a translating model and a scaling model in our framework. The intuition behind the proposed discriminative learning method is that we can explore mini-batches as negative examples in the context of our target to train a discriminative model, as demonstrated in Figure 2(d). Besides, since the CNN feature has maintained the spatial information of its input, we can directly obtain features of mini-batches by sampling in the CNN feature of the context around the target. However, notice that the padding area around the target is much bigger than the target itself (see Figure 2(a)), therefore the number of negative mini-patches is also much bigger than that of positive mini-patches. If we directly use these imbalanced examples to train a discriminative model, the obtained model will have severe bias to the negative examples. Hence we propose an online discriminative learning method to handle this situation. We begin introducing the proposed discriminative learning method with building a translating model. As mentioned above, let Tk denote the translating model at the frame t and let xt,k , k = 1, ..., K denote the features of K examples. We consider constructing a translating model as learning a binary classifier, and define rt,k = 1 if the k-th mini-patch centered at the predicted position of the target, otherwise rt,k = 12
−1. We then minimize the following optimization function to learn the translating model Tt+1 :
min
Tt,k+1
s.t.
1 ||Tt,k+1 − Tt,k ||22 + C + [rt,k = 1]ξ + C − [rt,k = −1]ξ, 2
max(0, 1−Tt,k+1 · xt,k rt,k ) ≤ ξ
and
(8)
ξ ≥ 0.
In Eq. 8, C + and C − are empirical parameters that control regularization and the balance between positive and negative examples. We set C + > C − because we need to emphasize the effect of positive training examples. ξ is the slack variable, and [·] denotes the Iverson bracket that [P ] = 1 if statement P is true, otherwise [P ] = 0. We set T1 = 0, Tt,0 = Tt and Tt+1 = Tt,K so that our translating model will be updated with all examples at each frame. Eq. 8 is a convex optimization problem, and can be easily solved by the method of Lagrange multipliers. We give the analytic solution of Eq. 8 as follows: Tt,k+1 = Tt,k + τt,k rt,k xt,k , τt,k = min(Ct,k ,
1 − Tt,k+1 · xt,k rt,k ), ||xt,k ||22
(9)
where Ct,k = C + if rt,k = 1, otherwise Ct,k = C − . Consequently the confidence of a mini-patch cTt (xt,k ) can be calculated as follows: (10)
cTt (xt,k ) = Tt · xt,k .
The learning scheme of the proposed scaling model is similar, but we consider it as learning a regression function. We also randomly sample K mini-patches with different sizes and define their corresponding regression value for online learning. Note that all the features of sampled mini-patches to will be resized to W 0 × H 0 × D0 for subsequent processing. With a little abuse of notations, we again use xt,k , k = 1, ..., K
to denote the features of these mini-patches and let rt,k ∈ [0, 1] denote the regression value, we propose to optimize the following problem to learning our scaling model: min
St,k+1
1 ||St,k+1 − St,k ||22 + C + [rt,k = 1]ξ + C − [rt,k 6= 1]ξ, 2
s.t. |St,k+1 · xt,k − rt,k | − ε ≤ ξ 13
and
ξ ≥ 0,
(11)
where ε is a positive constant used to control the sensibility of the scaling model. We set S1 = 0, St,0 = St and St+1 = St,K as well. The regression value of an example is defined as the Jaccard index between its bounding box and the predicted bounding box of target at current frame. The solution of Eq. 11 can be obtained via the method of Lagrange multipliers as well, that is: St,k+1 = St,k + sign(rt,k − St,k+1 · xt,k )τt,k xt,k , τt,k = min(Ct,k ,
max(0, |St,k+1 · xt,k − rt,k | − ε) ), ||xt,k ||22
(12)
where Ct,k = C + if rt,k = 1, otherwise Ct,k = C − . Confidence of a mini-patch, cSt (xt,k ) is calculated as follows: cSt (xt,k ) = St · xt,k .
(13)
In this way, we have applied the proposed online discriminative learning method into translation and scale estimation. A characteristic of the proposed discriminative model is that it can balance between adapting to changes of appearance of target and the consistency of the learned model, so that we can relieve the side-effect of noise during the tracking process. 4. Experiments In this section, we present the implementation details of the proposed tracker, and validate its effectiveness via several experiments on an widely-used benchmark. 4.1. Implementation Details We implement the proposed tracker in MATLAB with a 3.40 GHz CPU, 32GB RAM and a TITAN X video card. The proposed tracker runs at about 10 frames per second. We use the MatConvNet toolbox [33] to implement CNN for feature extraction. In offline representation learning, we select 51 sequences from the VOT2015 dataset1 , which is disjointed from the test set, as our training data. The proposed network is trained for 5 epochs. In each epoch, we sample 500 triplets from 1 http://www.votchallenge.net/vot2015/
14
each sequence randomly. The learning rate of gradient descent in each epoch is set as [0.0100, 0.0095, 0.0091, 0.0087, 0.0083], respectively. As to online discriminative learning, the number of candidates for confidence refinement, as well as the number of sampling multi-patches, K, is set to 10; the regularization parameter in ridge regression λF is set to 0.001; the learning rate of correlation filter λl is set to 0.02; the spatial bandwidth of 2D Gaussian function, by , is set to 0.1; the regularization parameters in online discriminative learning, C + and C − , is set to 1 and 0.01, respectively; the sensibility parameter ε is set to 0.01; we use fixed scale factors that s ∈ {0.992, 1, 1.08}. All parameters are fixed during our experiments. 4.2. Setup The proposed tracker is validated on the online Object Tracking Benchmark (OTB) [38], which consists of 50 sequences and the results from 29 state-of-the-art trackers for comparison, such as Struck [25] and TLD [39]. We also include 8 CF based trackers in our experiments, including MOSSE [4], STC [7], CSK [6], DCF and KCF [5], DSST [31], SAMF-AT [8] and SRDCF [9]. All trackers are run with their default parameter settings. We use the center location error (CLE) provided by the benchmark as our evaluation metric. We also draw the precision plots for comparison. The representative scores of precision plots are CLE with the threshold of 20 pixels. Due to page limit, we only report the top 10 trackers in our experiments. Besides, OTB includes 11 tasks with respective to different challenging factors, including Fast Motion (FM), Motion Blur (MB), Deformation (DEF), Occlusion (OCC), Out of View (OV), Illumination Variation (IV), Out-of-Plane Rotation (OPR), Scale Variation (SV), In-Plane Rotation (IPR) and Low Resolution (LR). Results of the proposed tracker in these tasks are reported as well. 4.3. Quantitative Analysis The overall performance of the proposed tracker is shown in Figure 4: From these results, we can see that the proposed tracker outperforms the baseline trackers by raising the best average precision score of the CF based trackers from 83.8% to 84.2%.
15
Figure 4: Precision plots of the proposed tracker against the baseline trackers on the OTB [38] with all 50 sequences. The proposed tracker outperforms the state-of-the-art methods significantly. Note that only the results of the top 10 trackers are showed in this figure. Best viewed on a high-resolution screen. Table 1: Summary of experimental results of precision scores on the OTB [38]. The best and second best scores are marked bold red fonts and italic blue fonts, respectively. Our tracker outperforms the conventional correlation filter based methods and other sophisticate models in most situations. FM
MB
DEF
OCC
BC
OV
IV
OPR
SV
IPR
LR
Average
Ours
0.735
0.796
0.835
0.802
0.848
0.614
0.811
0.818
0.832
0.827
0.806
0.842
SRDCF
0.741
0.789
0.855
0.844
0.803
0.680
0.761
0.818
0.778
0.766
0.518
0.838
SAMF-AT
0.715
0.718
0.797
0.872
0.713
0.672
0.718
0.811
0.776
0.782
0.554
0.833
KCF
0.610
0.660
0.747
0.753
0.753
0.650
0.733
0.732
0.679
0.729
0.381
0.742
DSST
0.513
0.544
0.658
0.706
0.694
0.511
0.730
0.736
0.738
0.768
0.497
0.740
DCF
0.566
0.598
0.746
0.730
0.719
0.632
0.704
0.716
0.654
0.708
0.354
0.730
CSK
0.381
0.342
0.476
0.500
0.585
0.379
0.481
0.540
0.503
0.547
0.411
0.545
MOSSE
0.284
0.282
0.358
0.415
0.407
0.261
0.315
0.431
0.426
0.466
0.396
0.454
STC
0.250
0.261
0.359
0.394
0.437
0.198
0.422
0.385
0.416
0.320
0.265
0.388
Struck
0.604
0.551
0.521
0.564
0.585
0.539
0.558
0.597
0.639
0.617
0.545
0.656
TLD
0.551
0.518
0.512
0.563
0.428
0.576
0.537
0.596
0.606
0.584
0.349
0.608
This performance gain sufficiently reflects the benefit of exploiting the negative examples in CF based trackers. Besides, Table 1 also shows that the proposed tracker can handle most challenging factors well, especially in the MB, BC, IV, OPR, SV, IPR and LR tasks. Compared with other CF based trackers, the most significant advantage of the proposed tracker is in the low resolution situation: it raises the best precision score among CF based trackers from 55.4% to 80.6%. Mostly, we owe this advantage to 16
Table 2: Summary of precision scores of the proposed method with different components on the OTB [38]. “Ours” denotes tracker with all the components, “Ours-I” denotes our tracker using outputs of relu3 4 of VGG-19 as features and without offline representation learning, “Ours-II” denotes our tracker without offline representation learning, and “Ours-III” denotes our tracker without online discriminative learning. The best and second best scores are marked bold red fonts and italic blue fonts, respectively. FM
MB
DEF
OCC
BC
OV
IV
OPR
SV
IPR
LR
Average
Ours
0.735
0.796
0.835
0.802
0.848
0.614
0.811
0.818
0.832
0.827
0.806
0.842
Ours-I
0.640
0.677
0.748
0.703
0.815
0.636
0.778
0.735
0.752
0.735
0.677
0.775
Ours-II
0.631
0.680
0.690
0.685
0.739
0.458
0.743
0.744
0.778
0.755
0.742
0.774
Ours-III
0.720
0.775
0.831
0.779
0.832
0.552
0.809
0.830
0.820
0.845
0.658
0.839
KCF
0.610
0.660
0.747
0.753
0.753
0.650
0.733
0.732
0.679
0.729
0.381
0.742
the deep correlation filter in the proposed tracker. However, we also notice that when tracking in the out of view situation, the performance of the proposed tracker is not satisfactory. This is because the proposed tracker is lacking of a sophisticated strategy to stop updating models, so that the online discriminative model might be disrupted when the target disappears. To evaluate the effectiveness of each component in our method, we run the proposed tracker with different components on the OTB and report the corresponding precision scores in Table 2. We construct three variants of the proposed tracker: “Ours-I” denotes the proposed tracker with VGG-19 feature; “Ours-II” denotes the proposed tracker without offline representation learning; “Ours-III” denotes the proposed tracker without online discriminative learning. From the experimental results, we can see what contributes most to our performance gain is the offline representation learning mechanism: on the one hand, the average performance of our tracker without representation learning (“Ours-II”) is just slightly worse than those of our tracker with VGG-19 feature (“Ours-I”), thus suggesting the performance gain does not directly come from the network structure; on the other hand, our tracker with representation learning (“OursIII”) successfully improves the average performance by 6.5%, therefore validating the importance and effectiveness of the representation learning in object tracking. By comparing “Ours” to “Ours-III”, We can see that the proposed discriminative learning
17
Figure 5: Selected qualitative results of our tracker against five state-of-the-art trackers in sequences basketball, couple and jumping (from top to bottom), respectively. These sequences are with fast motions and motion blurs. The predicted bounding boxes of seven trackers are presented with different colors, including ours(red), SRDCF(green), SAMF-AT(blue), KCF(orange), DSST(pink), Struck(cyan), and TLD(purple). Best viewed on a high-resolution screen.
method is capable of enhancing the performance generally, and is especially helpful under circumstances with low-resolution targets. 4.4. Qualitative Analysis We select 9 difficult sequences to perform qualitative analysis, including basketball, couple, jumping, ironman, shaking, football1, soccer, skiing and motorRolling. These sequences can be divided into three groups by their major challenging factors: in basketball, couple, jumping, trackers will face fast motion; in ironman, shaking, and football1 trackers will face severe illumination changes or out-of-plane rotation; in soccer, skiing, and motorRolling trackers will face background clutters and in-plane rotation. We use SRDCF, SAMF-AT and KCF as the baseline of CF based trackers. We also select Struck and TLD for comparison, since each of them is the representative method in various types of trackers. Figure 5 demonstrates the qualitative results of the proposed tracker in sequences basketball, couple and jumping, respectively. In the basketball sequence, Struck, TLD and KCF drift near the 64th frame because of the fast motion of the target, but KCF 18
Figure 6: Selected qualitative results of our tracker against five state-of-the-art trackers in sequences ironman, shaking and football1, respectively. These sequences are with wide illumination variations, background clutters and out-of-plane rotations. The predicted bounding boxes of seven trackers are presented with different colors, including ours(red), SRDCF(green), SAMF-AT(blue), KCF(orange), DSST(pink), Struck(cyan), and TLD(purple). Best viewed on a high-resolution screen.
retrieves the target at frame 142. In the couple sequence, KCF, DSST, SAMF-AT and Struck fail due to the vibration of camera. In sequence jumping, fast motion and motion blur are the most challenging factors, causing KCF and DSST to drift at frame 97. Due to the proposed off-line representation learning strategy, we can obtain robust image feature, and consequently our tracker performs well in these sequences. Figure 6 shows the qualitative results of the proposed tracker in sequences ironman, shaking and football1. These sequences show the typical tracking situations with illumination variation, out-of-plane rotation and deformation. The ironman sequence has severe illumination and background clutter, and the proposed tracker drifts a little at the very beginning, but retrieves the target after frame 49, while other trackers fail in this sequence. In the shaking sequence, KCF and SRDCF fail at the beginning (frame 43) due to the background clutter. From frame 112 to 364, TLD and Struck drift as well because of illumination variation. Although DSST does not drift as other trackers, it fails to capture the scale variations of the target. In the football1 sequence, the outof-plane rotation and fast motion of the target cause TLD, SRDCF, KCF and Struck to drift as well. What really shows the power of the proposed tracker is its results on sequences soccer, skiing and motorRolling, as demonstrated in Fig. 7. In the soccer sequence,
19
Figure 7: Selected qualitative results of our tracker against five state-of-the-art trackers in sequences soccer, skiing and motorRolling, respectively. These sequences are challenging because of severe background clutters and in-plane rotation. The predicted bounding boxes of seven trackers are presented with different colors, including ours(red), SRDCF(green), SAMF-AT(blue), KCF(orange), DSST(pink), Struck(cyan), and TLD(purple). Best viewed on a high-resolution screen.
background clutters and occlusions happen a lot, causing Struck and TLD to fail. Most tracker fail in sequences skiing and motorRolling, since they have severe in-plane rotation problem. However, to our surprise, the proposed tracker can handle both sequences well. Theoretically, CF based methods only model the translation of the target, and require the spatial structure of the target to be consistent during the tracking process. Therefore, the only reason can be derived is that our tracker has learned a feature robust to rotations, which is worth discussing further in our future work. In summary, with these qualitative results, we consider the proposed tracker as a valuable solution for online visual object tracking in difficult situations. 5. Conclusions In this paper, we tackle the deficiency of correlation filter based trackers in exploiting the negative training examples. We train a deep convolutional neural network to extract the robust representation of the context around the target via an offline learning process, and then locate the target quickly via a correlation filter. We further increase the precision of our tracker by learning a translating model and a scaling model based on the passive-aggressive algorithm. We demonstrate that the aforementioned processes can be done efficiently in a joint framework with proper design. Performance 20
of the proposed tracker on the benchmark demonstrates that our idea of increasing the discriminative ability of correlation filter based tracker is worthy. In the future, we will consider building more sophisticated motion models based on correlation filters, so that challenging factors such as rotation and occlusion can be handled better. Acknowledgments This work is supported by the National Natural Science Foundation of China (Nos. 61320106008, 61402120, 61572531), and the Natural Science Foundation of Guangdong Province (No. 2014A030310348). References [1] A. Yilmaz, O. Javed, M. Shah, Object tracking: A survey, ACM Comput. Surv. 38 (4). [2] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, M. Shah, Visual tracking: An experimental survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7) (2014) 1442–1468. [3] X. Li, W. Hu, C. Shen, Z. Zhang, A. R. Dick, A. van den Hengel, A survey of appearance models in visual object tracking, ACM TIST 4 (4) (2013) 58. [4] D. S. Bolme, J. R. Beveridge, B. A. Draper, Y. M. Lui, Visual object tracking using adaptive correlation filters, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2544–2550. [5] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3) (2015) 583–596. [6] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant structure of tracking-by-detection with kernels, in: Proceedings of European Conference on Computer Vision (ECCV), 2012, pp. 702–715.
21
[7] K. Zhang, L. Zhang, Q. Liu, D. Zhang, M. Yang, Fast visual tracking via dense spatio-temporal context learning, in: Proceedings of European Conference on Computer Vision (ECCV), 2014, pp. 127–141. [8] A. Bibi, M. Mueller, B. Ghanem, Target response adaptation for correlation filter tracking, in: Proceedings of European Conference on Computer Vision (ECCV), 2016, pp. 419–433. [9] M. Danelljan, G. H¨ager, F. S. Khan, M. Felsberg, Learning spatially regularized correlation filters for visual tracking, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4310–4318. [10] R. C. Gonzalez, R. E. Woods, Digital image processing, Addison-Wesley, 1992. [11] A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN features off-the-shelf: An astounding baseline for recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, workshop, 2014, pp. 512–519. [12] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (5) (2003) 564–575. [13] H. Wu, N. Liu, X. Luo, J. Su, L. Chen, Real-time background subtraction-based video surveillance of people by integrating local texture patterns, Signal, Image and Video Processing 8 (4) (2014) 665–676. [14] A. D. Jepson, D. J. Fleet, T. F. El-Maraghi, Robust online appearance models for visual tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (10) (2003) 1296–1311. [15] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 798–805. [16] D. A. Ross, J. Lim, R. Lin, M. Yang, Incremental learning for robust visual tracking, International Journal of Computer Vision 77 (1-3) (2008) 125–141.
22
[17] X. Mei, H. Ling, Robust visual tracking and vehicle classification via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (11) (2011) 2259–2272. [18] D. Wang, H. Lu, M. Yang, Online object tracking with sparse prototypes, IEEE Transactions on Image Processing 22 (1) (2013) 314–325. [19] R. E. Kalman, A new approach to linear filtering and prediction problems, Transactions of the ASME–Journal of Basic Engineering 82 (1960) 35–45. [20] M. S. Arulampalam, S. Maskell, N. Gordon, T. Clapp, A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking, IEEE Transactions on Signal Processing 50 (2) (2002) 174–188. [21] S. Avidan, Support vector tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (8) (2004) 1064–1072. [22] H. Grabner, H. Bischof, On-line boosting and vision, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 260– 267. [23] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robust tracking, in: Proceedings of European Conference on Computer Vision (ECCV), 2008, pp. 234–247. [24] B. Babenko, M. Yang, S. Belongie, Robust object tracking with online multiple instance learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (8) (2011) 1619–1632. [25] S. Hare, A. Saffari, P. H. S. Torr, Struck: Structured output tracking with kernels, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2011, pp. 263–270. [26] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (7) (2012) 1409–1422.
23
[27] K. Zhang, L. Zhang, M. Yang, Real-time compressive tracking, in: Proceedings of European Conference on Computer Vision (ECCV), 2012, pp. 864–877. [28] R. Liu, G. Zhong, J. Cao, Z. Lin, S. Shan, Z. Luo, Learning to diffuse: A new perspective to design pdes for visual analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (12) (2016) 2457–2471. [29] A. Rodriguez, V. N. Boddeti, B. V. K. V. Kumar, A. Mahalanobis, Maximum margin correlation filter: A new approach for localization and classification, IEEE Transactions on Image Processing 22 (2) (2013) 631–643. [30] M. Danelljan, F. S. Khan, M. Felsberg, J. van de Weijer, Adaptive color attributes for real-time visual tracking, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1090–1097. [31] M. Danelljan, G. H¨ager, F. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in: Proceedings of British Machine Vision Conference (BMVC), 2014. [32] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, Y. Singer, Online passiveaggressive algorithms, Journal of Machine Learning Research 7 (2006) 551–585. [33] A. Vedaldi, K. Lenc, Matconvnet: Convolutional neural networks for MATLAB, in: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, 2015, pp. 689–692. [34] J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks 61 (2015) 85–117. [35] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR abs/1409.1556. [36] R. Rifkin, G. Yeo, T. Poggio, Regularized least-squares classification, Nato Science Series Sub Series III Computer and Systems Sciences 190 (2003) 131–154. [37] R. M. Gray, Toeplitz and circulant matrices: A review, Foundations and Trends in Communications and Information Theory 2 (3). 24
[38] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2411–2418. [39] Z. Kalal, J. Matas, K. Mikolajczyk, P-N learning: Bootstrapping binary classifiers by structural constraints, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 49–56.
25