Communicated by Dr Zhen Lei
Accepted Manuscript
Learning Transform-Aware Attentive Network for Object Tracking Xiankai Lu, Bingbing Ni, Chao Ma, Xiaokang Yang PII: DOI: Reference:
S0925-2312(19)30235-8 https://doi.org/10.1016/j.neucom.2019.02.021 NEUCOM 20484
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
8 October 2018 2 January 2019 11 February 2019
Please cite this article as: Xiankai Lu, Bingbing Ni, Chao Ma, Xiaokang Yang, Learning Transform-Aware Attentive Network for Object Tracking, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.02.021
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Learning Transform-Aware Attentive Network for Object Tracking
CR IP T
Xiankai Lu, Bingbing Ni1 , Chao Ma and Xiaokang Yang School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
Abstract
AN US
Existing trackers often decompose the task of visual tracking into multiple in-
dependent components, such as target appearance sampling, classifier learning, and target state inferring. In this paper, we present a transform-aware attentive tracking framework, which uses a deep attentive network to directly predict the target states via spatial transform parameters. During off-line training, the proposed network learns generic motion patterns of target objects from auxiliary
M
large-scale videos. These leaned motion patterns are then applied to track target objects on test sequences. Built on the Spatial Transform Network (STN),
ED
the proposed attentive network is fully differentiable and can be trained in an end-to-end manner. Notably, we only fine-tune the pre-trained network in the initial frame. The proposed tracker requires neither online model update nor
PT
appearance sampling during the tracking process. Extensive experiments on OTB-2013, OTB-2015, VOT-2014 and UAV-123 datasets demonstrate the com-
CE
petitive performance of our method against state-of-the-art attentive tracking methods.
Keywords: Transform-aware, visual attention, Spatial Transformer Networks,
AC
object tracking.
1 Corresponding
author:
[email protected]
Preprint submitted to Neurocomputing
February 14, 2019
ACCEPTED MANUSCRIPT
1. Introduction Object tracking is a fundamental problem in computer vision with a wide range of applications, including human-machine interface, video surveillance,
CR IP T
traffic monitoring, etc. Typically, given the ground-truth of target objects in the first frame, object tracking aims at predicting the target states, e.g., position 5
and scale, in subsequent frames. Recent years have witnessed the success of the tracking by detection approach, which incrementally learns a binary classifier to discriminate target objects from the background. This approach requires generating a large number of samples in each frame using either sliding win-
10
AN US
dows [1, 2, 3], or random samples [4, 5], or region proposals [6, 7]. For training
the discriminative classifier, the samples are assigned to binary labels according to their overlap ratio scores with respect to the tracked result in the previous frame. For the tracking process, the classifier is used to compute the confidence scores of the samples. The sample with the highest confidence score indicates
15
M
the tracked result. Note that independently computing the confidence scores of samples often causes heavy computation burden, which is even heavier for deep learning trackers. For example, the speed of the recently proposed MDNet
ED
[3] tracker is less than one frame per second. To avoid drawing samples, an alternative approach is to learn correlation filters [8, 9]. The output correlation
20
PT
response maps can be used to locate target objects precisely. However, such response maps are hardly aware of the scale changes. We also note that correlation filters heavily rely on an incremental update scheme, which occurs frame
CE
by frame on the fly. Slight inaccuracy in a frame is easily aggregated to degrade the learned correlation filters. In this work, instead of drawing a large number of samples to learn a dis-
criminative classifier or directly learning correlation filters, we exploit a novel
AC 25
framework to infer target states in terms of both position and scale changes in an end-to-end manner (see Figure 1). We take the inspirations from the recent success of the spatial transformer [10] as well as the visual attention mechanism in learning deep neural networks. On the one hand, Spatial Transformer Networks
2
ACCEPTED MANUSCRIPT
Context template
Image sampling
Response map
Search area
Reference image
CNN
CNN
Tailored STN
(b)
AN US
ϴ
(a)
CR IP T
Input image
(c)
Figure 1: Different tracking schemes. (a) Tracking by sampling target states from the image.
M
(b) Tracking by inferring targets states from response maps (e.g., correlation filter based response map [8, 9]). (c) The proposed tracking framework. The proposed tracker builds upon a tailored spatial transformer network that takes the reference image and search area
ED
as input and outputs the spatial transformer parameters containing both location and scale information as tracking results.
30
(STN) learn invariance to translation, scale, rotation and more generic warping.
PT
Therefore, STN can attend to the task-relevant regions via predicted transformation parameters. It is straightforward to exploit this invariance to estimate
CE
the appearance changes of target objects. On the other hand, existing attentive tracking methods built on deep neural networks such as Restricted Boltzmann
35
Machine (RBM) [11] and Recurrent Neural Network (RNN) [12] cannot deal
AC
with spatial transformation. Therefore, multiple independent components are needed for position and scale estimation. In other words, the visual attention mechanism in [11, 12] is only exploited as one submodule for estimating location changes. This work aims at learning a unified attention network that directly
40
predicts both the position and scale changes via spatial transformer parameters.
3
ACCEPTED MANUSCRIPT
The proposed transform-aware attentive network (TAAT) is a Siamese matching network with two input branches. We constantly feed ground truth of the target in the first frame into one branch, while sequentially feed image frames
45
CR IP T
into the other branch. Each branch consists of multiple convolutional layers to generate deep features. Features from two branches are then concatenated and fed into fully connected layers that output spatial transformer parameters. The proposed network naturally attends to regions of interest where the target
object is likely to be. Compared to traditional attentive tracking methods, the
proposed network outputs a considerably finer attentive area defined by spatial transformer parameters. This naturally facilitates visual tracking algorithms
AN US
50
being more invariant to translation and scale changes. We first train the proposed TAAT network off-line in an end-to-end manner on large labeled video dataset. We use a data augmentation scheme in both the temporal and spatial domains. In each iteration, we feed triplet pairs, i.e., reference image, search 55
image, ground-truth image of target objects, into the network. We use an `1
M
loss constraint to speed up convergence. During the tracking process, we apply this pre-trained network to search frames. The output directly shows the mov-
ED
ing states of the target as well as glimpse [13] from the input image. Figure 2 illustrates an overview of the proposed tracker. 60
We summarize the contributions of this work as follows:
PT
• We propose a transform-aware attentive network for object tracking by integrating the attention mechanism into a tailored Spatial Transformer
CE
Network. The proposed network attends to the region of interest with finer attention and can be trained in an end-to-end manner. With the
AC
65
use of an `1 loss constraint, the proposed network converges fast in the training stage.
• We cast the visual tracking problem as pairwise matching. We effectively get rid of the cumbersome sampling scheme. The proposed algorithm achieves a satisfying tracking speed.
70
• Extensive experiments on popular benchmark datasets demonstrate the 4
ACCEPTED MANUSCRIPT
Reference image(R)
Convnet Convnet
Ground-truth(G)
Feature vector
Fully connect layer
2048
224x224
128
256 512
512
256
112x112
Shared v weight
4
Grid Generator
Glimpse(V)
L1-Loss
CR IP T
θ
Shared weight
224x224
Search image(S)
2304
128
128
64
112x112
96
256
192 192
Figure 2: Architecture of the proposed transform-aware attentive network for visual tracking.
AN US
It consists of an attention module (left) and a patch similarity module (right). It takes as
input a triple of images: reference image R, search area S, and ground-truth image G. In the training stage, a pair of reference image R and search area S are fed into a matching network, i.e., a tailored Spatial Transformer Network. The output transformer parameters Θ define the tracking prediction, and corresponding cropped area (V ) can be viewed as a glimpse on search area [13]. To fine tune the matching network, we compute the `1 distance between the representation of glimpse V and ground-truth image G as loss for back propagation. In the test time, given a pair of reference image R and a search image S, the proposed network
M
outputs the glimpse(V) as well as the similarity between R and S.
ED
favorable performance of the proposed algorithm when compared with state-of-the-art trackers.
The rest of this paper is organized as follows. In Section 2, we review the
75
PT
works closely related to our proposed approach. Section 3 gives a detailed description of the proposed transform-aware attentive model. Experimental
CE
results are reported and analyzed in Section 4. We conclude this paper in Section 5.
AC
2. Related Work
80
Visual tracking has long been an active research area and deep learning has
become popular for visual tracking. We briefly categorize the most related works into the following aspects: (1) tracking by sampling target states in images, (2) tracking by inferring target states from response maps, and (3) tracking by 5
ACCEPTED MANUSCRIPT
attention models. 2.1. Tracking with Sampling Traditional tracking-by-detection methods [2, 14, 15, 16, 17, 18, 19, 20] usu-
CR IP T
85
ally learn a discriminative classifier from a large number of sampling candidates around the position in the previous frame. The learned classifier is then used to
compute the confidence scores of samples in the current frame. The sample with the highest confidence score thus indicates the tracking result. This strategy is 90
popular among recent deep trackers [21, 3]. On the other hand, Zhu et al. [7]
AN US
exploit region proposals trained for object detection to generate good candi-
dates. Note that generating a large number of samples brings not only a heavy computational burden but also sampling ambiguity [9], i.e., assigning spatially correlated samples with binary labels. To reduce the computational load, Tao 95
et al. [21] employ the region of interest (RoI) pooling technique. Rather than searching over tens of hundreds of sampled candidates, the proposed method em-
M
ploys a deep neural network with attention to directly output the target states. As a result, the proposed algorithm successfully circumvents the annoying sam-
100
ED
pling issues and achieves real-time tracking speed. In [16], the proposed method first trains an off-line target detector via large amounts of labeled videos and then test the pre-trained model on unseen videos. Different from this method,
PT
the proposed method regards object tracking as an attention process, bottomup and top-down mechanisms are combined to locate the tracking target in next
CE
frame. 105
2.2. Tracking with Response Map
AC
Recent years have witnessed the success of inferring target states from re-
sponse maps [22, 23, 24, 25, 26, 27]. The most representative approach is correlation filter based trackers [8, 9, 6]. The key idea lies in that: correlation filter can be seen as a template encoding the appearance of target objects. The
110
correlation response indicates the similarity between the target template and a candidate search window. The position of maximum response value indicates
6
ACCEPTED MANUSCRIPT
the location of the target. Note that correlation response maps can also be obtained by the fully convolutional deep network. In [28, 29], Wang et al. develop a convolutional sub-network on top of deep features generated by VGG-Net [30] to output confidence map. The location of the maximum value in the confi-
CR IP T
115
dence map is used for identifying the target’s position in the next frame. Unlike these methods that can only infer the location from these response maps, the
proposed network learns invariance to spatial transforms including but not lim-
ited to location and scale. With the use of STN, state sampling happens in the middle CNN layers in a similar way to Faster-RCNN [31].
AN US
120
2.3. Tracking with Deep Attention Model
Visual attention mechanisms mimic the biological vision system [32, 33, 34] to allocate limited perceptual resources to attend to interest or saliency area. Visual attention models have been widely used to improve visual tracking [35, 125
36, 37, 11, 38]. Most of these models are saliency map driven [39, 40, 41].
M
Recently, deep attention mechanisms become an attractive research domain [42, 43], and show great advantages in a variety of tasks, like object recognition [44],
ED
image caption [43], image generation [13] and fine grained classification [45, 10]. The deep attentive network is capable of learning to know ”where” and 130
”what” to focus on. Considerable efforts have been made to train attentive
PT
network for visual tracking. Inspired by the success of DRAW [13], Samira et al. [12] trained a temporal recurrent neural network (RNN) to predict where to track in the subsequent frame. They also apply 2D Gaussian filters to crop
CE
search areas. Since this model is only trained with toy examples generated
135
from the MNIST dataset [46], it is unlikely to perform well on tracking generic
AC
target objects. In [47], Cui et al. build an attention module upon spatial multi-direction RNN to encode the ensemble target parts, and use the output saliency map to compute the weights of different parts. However, this RNN is incrementally trained on each tracking result, so it is lack of ability to mining
140
complex attentive mechanisms from large-scale image sequences. In this work, we develop a transform-aware attentive network built upon Spatial Transformer 7
ACCEPTED MANUSCRIPT
Networks. Our network is fully differentiable, allowing end-to-end training on large-scale auxiliary sequences. So the proposed tracker can be taught to learn the generic transformation knowledge from these auxiliary sequences and track novel target subsequently.
3. Transform-aware Attentive Tracking
CR IP T
145
In this section, we first overview the proposed transform-aware attentive
network. Then we present the network architecture in more details. We then introduce the training scheme on large-scale data sets. Lastly, we show how to use the training model to perform visual tracking.
AN US
150
3.1. Overview
As mentioned before, existing algorithms [2, 14, 8, 21, 3] relying on sampling target states face both the challenges of high computational load and sampling
155
M
ambiguity (i.e., assigning spatially correlated samples with either positive or negative label). Meanwhile, response maps generated by correlation filters [9] or fully convolutional neural network [28] cannot infer more target states than
ED
location changes. Our goal is to explore the mechanism of spatio-temporal transformation of generic target objects between two image pairs. To this end, we train a tailored Spatial Transformer Network that attends to the most relevant regions (attention) from large-scale image sequences in an end-to-end fashion.
PT
160
The transformation includes scaling, cropping, rotations, as well as non-rigid
CE
deformations, and naturally fulfills the goal of visual tracking. When performed on the entire feature map (non-locally) of a search image, the proposed network directly outputs the transform-aware parameters as tracking results. Figure 2 shows an overview of the proposed algorithm. In the training stage, we feed a
AC
165
triplet of images (reference image R, search image S, and ground truth image
G) into the network. Note that in training process, the reference image R is
the target in the last frame while in test it is the ground-truth patch in the first frame. Search image S denotes spatially augmented patches in training phase
8
ACCEPTED MANUSCRIPT
170
and the frame in which tracking is performed in the test phase. Ground truth image G crops from corresponding S during training and comes from the first frame in testing. During the forward propagation, the Spatial Transformer Net-
CR IP T
work performs pairwise matching between the reference image R and the search image S. The output transformer parameters just indicate the target states con175
taining both location and scale changes. As for the backward propagation, we compute the `1 loss between the estimated target patch (defined by the output transform-aware parameters) and the ground truth image G to fine tune the
AN US
network. 3.2. Formulation of Attention
The attention procedure in object tracking can be described as: given the
180
reference target appearance (R) in the first frame, the trained network attends to the target on the search patch (S). This attend step can be modeled by learning a relationship between the image appearance and the relative transfor-
this procedure can be formulated as:
ED
185
M
mation parameters Θ. Furthermore, similar to Lucas-Kanade Algorithm [48],
φ(S)
Θ = f (
φ(R)
)
(1)
PT
In this work, we use Θ = [sx , sy , tx , ty ] to denote the state changes in terms of scale and location (e.g., location and size), where sx and sy stand for scale changes, tx and ty for translation changes on the horizontal and vertical direc-
CE
tions. f denotes the translation estimation function and φ means the feature extraction function. So the transformation between the search image S and
AC
the estimated attentive region (i.e., outputted tracking result) is subject to an inverse warp function: xin s i = x yiin 0
0 sy
9
xout i tx yiout ty 1
(2)
ACCEPTED MANUSCRIPT
in out where xin and yiout are the output i and yi are the input image coordinates, xi
image coordinates. Their values are normalized to [−1 1]. The index i is based on the output image. Note that the transformer Θ in natural delineates an
190
CR IP T
attentive region where the target might be. 3.3. Network Architecture
We briefly introduce the STN for completeness. STN is a sample-based differential network which consists of localization network for regressing trans-
former parameters and sampling layers to transform the input maps to output
195
AN US
maps which correspond to a region of the input maps. This is in accord with
the attention procedure introduced above. Thus we tailor STN to realize this attention module for estimating transformer parameters Θ.
Given an input image pair of reference image R and search image S, we develop this attention module upon a Siamese structure. We first use the shared convnets from deep networks (e.g., Alex-Network [49]) for feature extraction. The two branches of deep features are concatenated and fed into a three-
M
200
layer fully connected regression network, which directly outputs the transformer
ED
Θ. Note that this fully connected regression network learns a generic spatialtemporal transformation invariant to significant appearance changes caused by scale change, abrupt motion, deformation, as well as partial occlusion.
PT
After computing the transformer parameters Θ, we utilize a differentiable sampling layer to obtain the tracked result in the search image S. For each training iteration, the tracked result is often thought of a glimpse V, and the
CE
objective function can be computed as follows:
AC
Vic =
205
H,W X n,m
in c Snm max(0, 1 − xin i − m ) max(0, 1 − yi − n )
(3)
where Vic is the output value of the i-th pixel in the channel c of V. Here, W and H denote the width and height of S, respectively. Since the estimated xin i and yiin are not always integers, bilinear interpolation is employed to compute in Vic from the four neighbor pixels around (xin i , yi ).
10
ACCEPTED MANUSCRIPT
3.4. Network Training To guide the matching network learn to estimate the transform parameters
210
between successive frames, we design a patch similarity module to compute the
CR IP T
loss for back propagation. Rather than comparing the difference between the
output transformer parameters and ground truth bounding boxes, we compute the distance between the tracked result (a glimpse of visual attention) in a form 215
of image patch and ground truth patches. This image-level distance makes the training easily as it does not require the extract transformation parameters,
explicitly. Inspired by [50, 51], we feed both glimpse and ground-truth patches
AN US
into Alex-Network and use the output of the conv5 layer as features. The loss
between the tracked patch (glimpse) V and ground-truth image G is calculated 220
via the `1 distance as follows:
L = kφ(V ) − φ(G)k1 +
λ kW k22 2
(4)
M
where φ(V ) and φ(G) denote the extracted deep features from tracked image patch V and ground truth patch G. Here, W denotes the weight parameters of the fully connected regression layers in the Spatial Transformer Network
225
ED
and λ is the weight decay. We add a `2 normalization layer before the loss layer to eliminate the scale disparity of input deep features. This operation
PT
also brings an advantage, i.e., during the test we can leverage a simple inner product to measure the similarity among the input features. With the use of the `1 loss, the proposed attentive network converges fast. Figure 3 visualizes the
CE
predicted transformer parameters with different iterations on training sequences.
230
Although the proposed attentive network initiates in different regions with high variance, and it gradually attends to the target regions along with the increase
AC
of iterations. Since target states between two consecutive frames usually do not change
dramatically, we crop a search window centered at the previous position in the
235
search image S for training. Let wt−1 and ht−1 denote width and height of ground-truth image patch in the previous frame. We enlarge the search window
11
Iterations
CR IP T
ACCEPTED MANUSCRIPT
Figure 3: Visualization of attentive process on the sequences in ALOV dataset [52]. Each col-
umn shows the attentive regions with different training iterations (10, 1000, 1500, 20000, 70000)
AN US
for a same frame. With the increase of iterations, the proposed network attends to the regions of interest precisely.
in proportion to a scaling factor k > 1 with width ws = kwt−1 and height hs = kht−1 . As a result, this simple scheme avoids searching over the whole image. In addition, data augmentation is implemented on the training data in both the spatial and temporal domains. In the spatial domain, we augment the
M
240
search window with both position and scale changes [16], and obtain M crops. To augment the temporal variance among sequences, we extract multiple pairs
ED
between search window and the reference image. In this case, both the search image and the reference image are randomly picked from different frames. Since long temporal span usually causes untrue invariance, we control the temporal
PT
245
augmentation within a span of T frames. More implementation details can be
CE
found in 4.1.
3.5. State Inference The proposed algorithm does not require model update to perform visual
AC
tracking. Once we have the trained network, we can directly apply it to track target objects in testing sequences. We fix the reference image as the ground truth patch in the first frame and fine-tune the network with samples generated as the training phase. Thus we can obtain a domain specific tracking network. From the second frame, we take crop centered position in the previous frame
12
ACCEPTED MANUSCRIPT
with a width and height of kwt−1 and kht−1 . We sequentially feed each crop into the network as the search image S. The output transformer Θ = [sx , sy , tx , ty ] of the network indicates the target state changes. In this work, we assume the
CR IP T
glimpse V matches the target tightly, so the target coordinates in V can be expressed as [1, 1], [−1, −1] which indicate bottom right and top left corners in
V. Then we infer the bounding box on search area from the glimpse V (See Eq. 3) as:
x1 s = x y1 0
0 sy
1 tx 1 ty 1 −1 tx −1 ty 1
AN US
x2 s = x y2 0
0
sy
(5)
where (x1 , y1 ) and (x2 , y2 ) are the bottom-right and top-left coordinates in S. 250
2 y1 +y2 In the t-th frame, the target position is pt = (cx, cy) = ( x1 +x , 2 ), and the 2
M
scale is (w, h) = (x1 − x2 , y1 − y2 ).
Position Refinement. To obtain a more tight bounding box, we leverage
ED
the bounding box regression to refine estimated results as in [31, 3]. In the first frame, we train four linear ridge regressors for the center, width and height of bounding boxes using the conv4 features of Alex-Net. For each subsequent
PT
frame, regressors take the output glimpse of the attentive network as input and
AC
CE
output the refined bounding boxes as tracking results: tx =(cx − xa )/(wa ), ty = (cy − ya )/(ha ),
tw = log(w/wa ), th = log(h/ha ), t∗x =(x∗ − xa )/wa , t∗y = (y ∗ − ya )/ha ,
(6)
t∗w = log(w∗ /wa ), t∗h = log(h∗ /ha )
Here, cx, cy, w, and h denote the predicted boxs center coordinates and its width and height. Variables cx, xa , and x∗ are for the predicted box, anchor box, and ground-truth box respectively. These four regressors are not updated
13
ACCEPTED MANUSCRIPT
Success plots of OPE on OTB−2013
Precision plots of OPE on OTB−2013
0.9
0.9
0.8
0.8 0.7 TAAT_res [0.873] TAAT [0.858] FCNT [0.855] TAAT_VGG [0.853] TAAT_bb [0.833] MEEM [0.830] Siamesefc [0.809] KCFDP [0.786] KCF [0.741] DSST [0.739] TGPR [0.705] Struck [0.656] GOTURN [0.633] TLD [0.608] SCM [0.583]
0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10
15
20
25
30
35
40
Location error threshold
45
Success rate
Precision
0.7
0.1 0 0
50
0.1
0.2
0.3
0.4
1
0.8 0.7
0.6
0.4 0.3 0.2 0.1 5
10
15
20
25
30
35
Location error threshold
40
0.6 0.5
0.6
0.7
Siamesefc [0.582] TAAT_res [0.573] TAAT_VGG [0.558] TAAT [0.557] FCNT [0.554] KCFDP [0.545] MEEM [0.530] TAAT_bb [0.529] DSST [0.475] KCF [0.475] Struck [0.461] TGPR [0.458] GOTURN [0.428] TLD [0.426] SCM [0.413]
AN US
0.5
Success rate
TAAT_res [0.805] TAAT_VGG [0.787] TAAT [0.786] MEEM [0.781] FCNT [0.779] Siamesefc [0.771] KCFDP [0.738] TAAT_bb [0.734] DSST [0.695] KCF [0.692] TGPR [0.643] Struck [0.638] TLD [0.595] GOTURN [0.572] SCM [0.526]
0.5
Overlap threshold Success plots of OPE on OTB−2015
0.9
0.7
Precision
0.4
0.2
0.8
0 0
0.5
0.3
Precision plots of OPE on OTB−2015
0.9
TAAT_res [0.617] Siamesefc [0.607] FCNT [0.604] TAAT [0.603] TAAT_VGG [0.601] TAAT_bb [0.587] KCFDP [0.576] MEEM [0.566] KCF [0.513] DSST [0.505] TGPR [0.503] Struck [0.474] GOTURN [0.455] SCM [0.448] TLD [0.437]
0.6
CR IP T
1
1
45
0.4 0.3 0.2 0.1
50
0 0
0.1
0.2
0.3
0.4
0.5
0.6
Overlap threshold
0.7
0.8
0.8
0.9
0.9
1
1
Figure 4: Overall performance on the OTB-2013 [53] and OTB-2015 [54] datasets using one-
M
pass evaluation (OPE). The legend of distance precision plots contains the threshold scores at 20 pixels, while the legend of overlap success plots contains area-under-the-curve (AUC) scores for each tracker. The proposed tracker performs well against the baseline trackers. Here
255
ED
TAAT-bb denotes the proposed method without bounding box regression.
during tracking and adjust predicted position only when the estimated position
PT
is reliable (i.e. f (x, z) > µ, µ is a predefined value). Scale Refinement.
We observe that the estimated scale changes are not
smooth from frame to frame. We thus crop N patches centered at the estimated
CE
position pt but with different scales n = b− N 2−1 c, b− N 2−2 c, · · · , b N 2−1 c [55]. Then these patches are resized to a fixed size of training patches. We feed them into
AC
the network and then compare the similarity of the output glimpses Vn and
the reference image R. For efficiency, rather than using the `1 loss as in Eq. 4, we compute the inner product between glimpse V and reference image R to evaluate their similarity as: f (R, V ) = φ(R)T φ(V )
14
(7)
ACCEPTED MANUSCRIPT
Algorithm 1 Proposed Tracking Algorithm Require: Pre-trained network f , previous target position (xt−1 , yt−1 ) and size (wt−1 , ht−1 )
CR IP T
Ensure: Estimated position (xt , yt ) and size (wt , ht ). 1:
Fine tune the pre-trained network f with samples in the first frame
2:
repeat
3:
Crop out the searching window in frame t centered at (xt−1 , yt−1 );
4:
Estimate new position (xt , yt ) from Equation 5 over deep convolutional features;
6:
if f (x, z) > µ then
AN US
5:
Perform position refine via Equation 6;
7:
end if
8:
Crop multiple patches centered at (xt , yt ) and obtain the optimal scale via Equation 8; until End of video sequences.
M
9:
where φ(R) and φ(V ) denote the deep features generated from R and V. The
ED
refined tracking scale n∗ is inferred from the glimpse with the maximum value of similarity score:
n∗ = arg max f (R, Vn ).
(8)
n
PT
The overall algorithmic steps are summarized in Algorithm 1.
CE
4. Experiment
4.1. Implementation The proposed model is implemented in MATLAB with the caffe library [56].
AC
260
The model is trained on an Intel Xeon 1.60G Hz CPU with 16G RAM and TITAN X GPU. We utilize Alex-Network [49], VGG-Network [30] and ResNet [57] to build the Convnet part in localization network, respectively. Specifically, we leverage the deep features from Conv5, Conv5 3 layer of VGGNet and res4f
265
layer in ResNet for target appearance representation. For ResNet, we add a 15
ACCEPTED MANUSCRIPT
0.5 0.4 0.3 0.2 0.1 0 0
5
10
15
20
25
30
35
40
45
0.7
0.6
MEEM [0.760] TAAT_res [0.750] TAAT [0.731] Siamesefc [0.731] TAAT_VGG [0.725] KCFDP [0.714] FCNT [0.713] TAAT_bb [0.702] DSST [0.657] KCF [0.651] TGPR [0.610] Struck [0.566] TLD [0.545] GOTURN [0.513] SCM [0.507]
0.5 0.4 0.3 0.2 0.1 0 0
50
Location error threshold Precision plots of OPE − illumination variation (35) on OTB−2015
5
10
15
20
25
30
35
40
45
Location error threshold Precision plots of OPE − in−plane rotation (51) on OTB−2015
0.6 Precision
TAAT_res [0.835] TAAT [0.824] TAAT_VGG [0.800] TAAT_bb [0.748] MEEM [0.746] FCNT [0.732] KCFDP [0.724] KCF [0.712] DSST [0.702] Siamesefc [0.690] TGPR [0.593] Struck [0.559] SCM [0.552] GOTURN [0.534] TLD [0.464]
Precision
0.4
0.2 0.1 0 0
50
1
0.9
0.9
0.9
0.8
0.8
0.7
0.7
0.4 0.3 0.2 0.1 0 0
5
10
15
20
25
30
35
Location error threshold
40
45
50
0.8 TAAT_res [0.826] TAAT_VGG [0.810] TAAT [0.803] MEEM [0.794] FCNT [0.787] TAAT_bb [0.769] Siamesefc [0.742] KCFDP [0.727] DSST [0.724] KCF [0.693] TGPR [0.659] Struck [0.633] TLD [0.609] GOTURN [0.572] SCM [0.453]
0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10
15
20
25
30
35
Location error threshold
40
45
50
10
15
20
25
30
35
40
45
50
0.7
TAAT_res [0.816] TAAT_VGG [0.810] TAAT [0.808] MEEM [0.808] FCNT [0.793] TAAT_bb [0.762] Siamesefc [0.762] KCFDP [0.736] DSST [0.705] KCF [0.694] TGPR [0.655] Struck [0.616] GOTURN [0.606] TLD [0.577] SCM [0.502]
0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10
15
20
25
30
35
Location error threshold
40
45
50
AN US
0.5
Precision
1
0.6
5
Location error threshold Precision plots of OPE − out−of−plane rotation (59) on OTB−2015
1
FCNT [0.822] TAAT_res [0.816] TAAT [0.815] TAAT_VGG [0.806] MEEM [0.735] TAAT_bb [0.734] Siamesefc [0.731] KCFDP [0.724] DSST [0.709] KCF [0.697] TGPR [0.617] GOTURN [0.594] Struck [0.554] SCM [0.552] TLD [0.547]
TAAT_res [0.750] MEEM [0.731] TAAT [0.723] TAAT_VGG [0.723] Siamesefc [0.705] FCNT [0.680] TAAT_bb [0.664] KCFDP [0.637] DSST [0.611] KCF [0.600] Struck [0.587] TLD [0.535] TGPR [0.529] GOTURN [0.435] SCM [0.271]
0.5
0.3
Precision
Precision
0.8
0.7
0.6
Precision plots of OPE − motion blur (29) on OTB−2015
0.9
0.8
0.7
Precision
Precision plots of OPE − occlusion (44) on OTB−2015
0.9
0.8
CR IP T
Precision plots of OPE − fast motion (37) on OTB−2015
0.9
Figure 5: Quantitative results on 6 challenging attributes in the OTB-2015 [54] dataset using one-pass evaluation (OPE). The legend of the distance precision plots contains the threshold scores at 20 pixels, whereas the legend of the overlap success plots contains area-under-thecurve (AUC) scores for each tracker. The proposed method generally performs better than the state-of-the-art trackers.
M
1 × 1 convolution layer which reduces the feature channel from 1024 to 256. We remove the origin fully connected layers and add three fully connected layers
ED
with nodes 2048, 256 and 4, respectively. As suggested in [10], we initialize the weight of the last fully connected layer with 0. In addition, Alex-Network is uti270
lized to extract features when matching the glimpse and ground-truth patches.
PT
We set the size of glimpse half of the search image, i.e., 112 × 112 constantly. To ensure the extracted deep features have same dimensions (2304), we append
CE
an average pooling layer (stride = 2, kernel size = 2) for the ground-truth patch (G) before inputting to the Convnet. We remove the fully connected layer and
275
add an `2 normalization layer before the loss layer. We use the ALOV [52]
AC
and ImageNet Video [58] datasets to finetune the Spatial Transformer Network. These datasets cover diverse challenges of visual tracking including illumination change, cluttered background, occlusion, abrupt motion, etc. We exclude the overlapping sequences in the ALOV and test datasets for fair comparison. As
280
described in Section 3.2, a triple image is fed into the network in every iteration.
16
ACCEPTED MANUSCRIPT
All images are re-scaled to 224 × 224. For data augmentation, we set the search radius k = 2.5. Every frame, there are M = 10 random patches are cropped, and for temporal augmentation the frame interval T is 10, this results in ap-
285
CR IP T
proximately 400k triplet pairs. We use the stochastic gradient descent (SGD) method with a mini-batch size of 70 to train the network. The base learning
rate is 0.0005, and it is decreased by multiplying a factor of 0.8 at two epochs.
We stop the training procedure when the loss drops lower than 0.02 and the whole training times is about 12 hours. For scale refinement, we set the step size to 1.02 and the number of scales N to 5. The threshold value µ is set to
0.56. During the tracking procedure, we set the height of the search area is 3
AN US
290
times the height of the target and the width of the search area is 5 times the width of the target. Then the cropped image patch is re-scaled to 224 × 224. 4.2. Evaluation on OTB Dataset
4.2.1. Dataset and Evaluation Settings
The OTB dataset includes 50 sequences (OTB-2013) [53] and its updated
M
295
version [54] contains 100 sequences (OTB-2015) with more than 58,000 frames
ED
in total. These video sequences cover various challenging factors, such as fast motion, illumination change, background clutter and occlusion. We validate the proposed tracker against several recently proposed trackers which can be divided into three categories: (i) deep learning based tracking methods including
PT
300
Generic Object Tracking Using Regression Networks (GOTURN) [16], Siame-
CE
sefc [60] and Fully Convolution Network Tracker (FCNT) [28]; (ii) correlation filter based trackers including KCF [8], KCFDP [6] and DSST [61] as well as five representative trackers with favorable performance in the benchmark: MEEM [59], TGPR [62], TLD [14], Struck [2] and SCM [63]. We follow the
AC
305
benchmark evaluation protocol in [53], and use the precision and success plots to evaluate all the trackers. The precision plot demonstrates the percentage of frames where the distance between the predicted target location and the ground truth is within a given threshold (e.g., 20 pixels). The success plot illustrates
310
the percentage of frames where the overlap ratio between the predicted bound17
ACCEPTED MANUSCRIPT
#0440
#0950
#1290
#1570
#0010
#0315
#0800
#1250
#1335
#0015
#0085
#0125
#0205
#0010
#0030
#0070
#0100
CR IP T
#0195
#0255
AN US
#0120
#0090
#0800
#0990
#1035
#0040
#0310
#0460
#0765
#0915
ED
M
#0010
Ours
MEEM
KCF
KCFDP
Struck
Figure 6: Tracking results on six challenging benchmark sequences by ours, the MEEM [59],
PT
KCF [8], KCFDP [6] and Struck [2] trackers. Our tracker performs well against the state-ofthe-art trackers in terms of precise localization and scale estimation. From the first row to
CE
the final row: Liquor, Lemming, Jogging, Motorrolling, Human and Girl2.
ing box and the ground truth bounding box is higher than a threshold (e.g., 0.5
AC
Intersection of Union). 4.2.2. Evaluation Results
315
The overall results in Figure 4 are under the one-pass evaluation (OPE).
Figure 4 compares the average precision and success results with baseline algorithms on all the 50 benchmark sequences and the 100 sequences, respectively.
18
ACCEPTED MANUSCRIPT
TAAT Res denotes ResNet [57] based localization network, TAAT VGG denotes VGGNet [30] based localization network and TAAT denotes AlexNet [49] based localization network. In addition, we disable the bounding box regression module and report the
CR IP T
320
results using the name TAAT-bb. Among the compared trackers, even without the bounding box regression module, the proposed tracker TAAT-bb still achieves comparable performance with MEEM and KCFDP in terms of dis-
tance precision and overlap success. Equipped with all the components, the 325
proposed TAAT tracker achieves the best distance precision rates and over-
AN US
lap success rates. Among the compared trackers, TAAT Res achieves the top
performance when compared with deep learning based methods and tradition hand-crafted feature based methods according to distance precision (87.3% and 80.5%). TAAT VGG and origin TAAT trackers achieve slight inferior perfor330
mance on OTB-2013 and OTB-2015 datasets, respectively. With respect to the overlap success rate, the proposed method performs favorably against the re-
M
cent deep trackers Siamesefc [60] due to SiameseFC adopts more complex scale estimation strategy with scale penalty term. However, TAAT is superior to
335
ED
rest trackers including MEEM [59], KCF [8], KCFDP [6] with 61.7% and 58.2% scores on OTB-2013 and OTB-2015 datasets, respectively. It is worth mentioning that the proposed TAAT method achieves better performance than GO-
PT
TURN [16]. There are two reasons, first reason is that GOTURN directly optimizes the difference between predicted bounding box and the ground truth, this
CE
leads to the tracking performance fluctuation across different video sequences. 340
In addition, the proposed TAAT is pre-trained on a larger dataset which lead
AC
to more stronger generalization ability on unseen testing videos. We further compare the tracking performance for different video attributes
including fast motion, occlusion, motion blur, illumination variation, in plane rotation and out of plane rotation. Figure 5 illustrates that our tracker performs
345
well the against state-of-the-art methods in terms of distance precision rate on all 100 video sequences. Our method achieves superior performance with fast motion (83.5%), occlusion (76.0%), motion blur (75.0%), illumination variation 19
ACCEPTED MANUSCRIPT
(81.6%), in plane rotation (82.6%) and out plane rotation (81.6%). These results suggest that the proposed method are effective in handling various challenging 350
scenario, especially tracking failure caused by fast motion, occlusion.
CR IP T
Figure 6 qualitatively compares tracking results on featured challenging sequences. Since we do not update the trained network online, the proposed algorithm effectively avoids the noisy updates which always leads to tracking
failures. The proposed network learns the invariance to a wide range of spa355
tial transformation. It is able to deal with the heavy occlusion, e.g., recovering
the target after long-term occlusion (from 330th frame to 370th frame) in the
AN US
Lemming sequence and girl2 sequence (from 110th frame to 120th frame) as
shown in Figure 6. In addition, via the deep network in the proposed attentive network, the proposed method handles rotation and background clutter 360
(MotorRolling) effectively as the high level convolution features contain strong semantic information.
M
4.3. Evaluation on VOT2014 Dataset
4.3.1. Dataset and Evaluation Settings
365
ED
The VOT 2014 dataset [64] contains 25 real-world video sequences. Each sequence is labeled with six attributes including camera motion, illumination change, motion change, occlusion, size change and no degradation. There are
PT
two evaluation criteria in VOT2014 including accuracy and robustness. Accuracy is computed as the Pascal VOC Overlap Ratio (VOR): e =
area(RT ∩RG ) area(RT ∪RG ) ,
CE
where RT and RG are the areas of tracked and ground truth box respectively. 370
The robustness indicates the number of failures to track an object in a sequence. These two metrics are used to rank all the trackers. According to the evaluation
AC
protocol, a restart scheme is incorporated into a tracker whenever tracking failure occurs (overlap between estimated and ground truth target bounding box equals zero).
20
ACCEPTED MANUSCRIPT
375
4.3.2. Evaluation Results on VOT2014 Dataset We follow the protocol of VOT2014 and run experiment analysis to obtain the final report. Figure 7 illustrates the accuracy-robustness plots of all
CR IP T
comparison trackers including GOTURN, Struck, SAMF, DGT, eASMS, CMT,
MatFlow, ABS, BDT, HMMTxD, ACAT, DynMS, ACT, PTp, EDFT, IPRT, 380
LT FLO, SIR PF, FoT, ThunderStruck, FSDT, FRT, IVT, OGT, MIL, IIVTV2, Matrioska, CT, IMPNCC and NCC, where the best trackers are closer to the
top-right corner. The proposed TAAT as well as the extended TAAT VGG,
TAAT res perform favorably among all the other trackers, e.g., DSST [61],
AN US
KCF [8] and SAMF[55], which are all correlation filter based trackers.
In addition, we present the average accuracy, robustness rank and expected
385
average overlap (EAO) of all compared trackers in Table 1. The proposed tracker achieves a comparable result in both baseline and region noise experiments. Due to no model updating during tracking, the proposed tracker is not sensitive to
AC
CE
PT
ED
M
region noise.
Figure 7: The robustness-accuracy ranking plots of all trackers under baseline and region noise experiments. Trackers close to the top right corner of the plot are among the top performers.
21
ACCEPTED MANUSCRIPT
baseline
region noise
Overall
EAO
Accuracy
Robustness
Accuracy
Robustness
TAAT
5.52
4.20
4.48
3.68
5.00
3.94
0.21
KCF
3.56
7.52
5.08
7.92
4.32
7.72
0.19
TAAT VGG
5.40
4.22
4.36
3.25
4.88
3.735
0.21
DSST
4.92
7.28
4.28
6.64
4.60
6.96
0.17
TAAT res
5.20
3.28
4.48
3.28
4.84
3.28
0.22
Struck
10.00
13.16
10.36
11.92
10.18
12.54
0.19
GOTURN
5.84
8.20
5.64
6.84
5.74
7.52
0.21
SAMF
4.40
7.64
4.48
7.56
4.44
7.60
0.20
DGT
8.40
5.12
6.16
6.12
7.28
5.62
0.18
eASMS
8.92
7.96
7.28
7.88
8.10
7.92
0.18
CMT
12.92
14.88
15.68
14.76
14.30
14.82
0.17
MatFlow
11.92
4.52
9.88
8.40
10.90
6.46
0.17
ABS
11.44
9.04
9.68
7.56
10.56
8.30
0.16
BDF
12.08
10.88
12.32
9.84
12.20
10.36
0.15
HMMTxD
5.80
10.16
4.80
10.08
5.30
10.12
0.15
ACAT
8.24
9.60
8.76
8.68
8.50
9.14
0.15
DynMS
11.56
10.12
11.60
11.00
11.58
10.56
0.15
ACT
8.96
9.80
9.40
9.60
9.18
9.70
0.15
PTp
19.04
10.60
15.20
10.36
17.12
10.48
0.14
EDFT
10.68
13.84
10.88
15.52
10.78
14.68
0.14
IPRT
14.72
12.60
13.72
12.56
14.22
12.58
0.14
LT FLO
10.28
18.80
9.76
17.84
10.02
18.32
0.14
SIR PF
10.80
14.00
10.56
14.52
11.68
13.76
0.14
FoT
11.92
17.12
12.88
19.20
12.40
11.16
0.14
ThunderStruck
10.08
14.00
10.56
11.52
10.68
12.76
0.13
FSDT
14.12
18.16
12.20
15.08
12.24
16.62
0.13
FRT
13.12
23.96
15.88
23.32
14.66
23.06
0.13
IVT
13.60
18.48
16.80
16.60
15.46
17.64
0.12
OGT
10.06
16.04
9.80
16.40
9.90
16.24
0.13
MIL
21.12
14.02
25.84
14.92
23.90
14.82
0.12
14.20
16.22
15.76
14.88
14.98
15.56
0.12
Matrioska
12.52
11.64
10.08
15.04
11.66
13.34
0.11
CT
16.16
17.64
17.00
16.56
16.58
17.10
0.11
IMPNCC
14.96
20.24
17.68
18.68
16.32
19.46
0.11
NCC
11.88
27.24
12.20
27.64
12.04
27.44
0.08
PT
IIVTv2
AN US
CR IP T
Robustness
ED
Accuracy
M
Trackers
Table 1: The performance of all trackers on VOT2014 dataset. We report the average ranks under baseline region noise experiments and EAO. Red, blue and green colors denote the first,
CE
second and third best results.
4.4. Evaluation on UAV-123 Dataset 4.4.1. Dataset and Evaluation Settings
AC
390
UAV-123 dataset [65] is a recently released dataset which contains 123 track-
ing targets. All videos are collected from a low-altitude aerial perspective. The evaluation settings are similar to OTB dataset.
22
ACCEPTED MANUSCRIPT
395
4.4.2. Evaluation Results Figure 8 illustrates the detailed tracking results on UAV-123 dataset including MEEM, SRDCF, MUSTER, SAMF, KCF, DSST, DCF, Struck, ASLA,
CR IP T
OAB, CSK and TLD. The proposed TAAT method achieves top performance in
terms of distance precision (72.5%) and overlap rate (48.0%). This result proves the effectiveness of the proposed TAAT on the long term tracking dataset. Precision plots of OPE
0.8
0.5 0.4 0.3
0.5 0.4 0.3 0.2
0.2
0.1
0.1 0
0.6
Success rate
Precision
0.6
TAAT [0.480] SRDCF [0.464] ASLA [0.407] SAMF [0.396] MEEM [0.392] MUSTER [0.391] Struck [0.381] DSST [0.356] DCF [0.332] KCF [0.331] OAB [0.331] IVT [0.318] CSK [0.311] MOSSE [0.297]
0.7
AN US
0.7
Success plots of OPE
0.8
TAAT [0.725] SRDCF [0.676] MEEM [0.627] SAMF [0.592] MUSTER [0.591] DSST [0.586] Struck [0.578] ASLA [0.571] DCF [0.526] KCF [0.523] OAB [0.495] CSK [0.488] MOSSE [0.466] TLD [0.439]
0
0
10
20
30
40
Location error threshold
50
0
0.2
0.4
0.6
0.8
1
Overlap threshold
Figure 8: Overall performance on the UAV-123 [65] dataset using one-pass evaluation (OPE).
4.5. Tracker Analysis
M
400
ED
In this section, we discuss the proposed tracker. 4.5.1. Different loss functions
405
PT
Here, we evaluate the effectiveness of the proposed l1 loss. For comparison, we substitute the proposed l1 loss with the traditional l2 loss. Figure 9(a)
CE
reports the converge curve during the training procedure. It is clear that the l1 loss plot converges more quickly than l2 loss. For final tracking performance, l1 loss and l2 loss achieve similar results (distance precision: 0.805 versus 0.804,
AC
overlap rate: 0.573 versus 0.571 on OTB-2015 dataset).
410
4.5.2. Tracking speed As our method does not require the cumbersome sampling stage during
tracking, it achieves a satisfying tracking speed. Without bounding box regression, it runs at 20.1 frames per second (FPS). With all the components, it runs 23
ACCEPTED MANUSCRIPT
0.9 L2 loss L1 loss
0.8
17.1
0.574
0.573
0.7
0.573
Training loss
0.566 0.6
15.1 0.5
13.8
0.540
0.4
17.4
0.3
0.1
CR IP T
11.5
0.2
10.6
0
5000
10000
15000
Number of iterations
avg OP
avg speed
Figure 9: Left: converge cure with different loss functions. Right: Tracking performance and speed with varying scales on OTB-2015 dataset. The average overlap rate and average speed versus the number of the scales N are illustrated.
TAAT VGG
TAAT Res
MDNet [3]
HCFT [9]
STCT [29]
20.1
15.5
13.8
0.7
8.3
4.1
AN US
FPS
TAAT
Table 2: Speed comparison between deep trackers on the OTB dataset. FPS: frames per second.
at 15 FPS. Here we compare the speed of the state-of-the-art deep trackers on OTB dataset in Table 2. All the trackers run on an Intel Xeon 1.60G Hz CPU
M
415
with 16G RAM and a TITAN X GPU. It is worth mentioning that our model
ED
has no model updating during tracking, the key parameter which affects the results is scale number N in scale estimation. The larger value of N contributes
420
PT
to accurate target’s size estimation while leads to heavy computation burden. 4.5.3. Parameter sensibility This subsection validates the effectiveness of several key components in our
CE
tracker. It is worth mentioning that our model has no model updating during tracking, the key parameters which affect the results include number N in scale
estimation and search area size. Generally, the higher value of N contributes to more precise size estimation (i.e., overlap rate) while aggravates computa-
AC 425
tion burden. Figure 9(b) reports the overlap rate versus tracking speed among different values of N .
24
ACCEPTED MANUSCRIPT
5. Conclusion In this paper, we propose a transform-aware attentive tracking method in430
spired by deep attentive network. With the use of the revised Spatial Trans-
CR IP T
former Network, the proposed network attends to regions of interest where novel target objects might be. The output spatial transfer parameters indicate the tar-
get states with both the location and scale information. The proposed algorithm does not require the cumbersome state sampling and model updating as existing 435
tracking algorithms do. It is evaluated on three popular benchmark datasets
including OTB-2013, OTB-2015, VOT-2014 and UAV-123 dataset. The exper-
AN US
iment results demonstrate that the proposed TAAT performs very well against
baseline algorithms and achieves a satisfying speed. In the future, an online updating mechanism can be integrated into the TAAT model to handle target 440
variance adaptively.
M
References
[1] N. Wang, J. Shi, D. Yeung, J. Jia, Understanding and diagnosing visual
ED
tracking systems, in: IEEE Int. Conf. Comput. Vis., 2015, pp. 3101–3109. [2] S. Hare, A. Saffari, P. H. Torr, Struck: Structured output tracking with kernels, in: IEEE Int. Conf. Comput. Vis., 2011, pp. 263–270.
PT
445
[3] H. Nam, B. Han, Learning multi-domain convolutional neural networks
CE
for visual tracking, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4293–4302.
[4] J. Shen, D. Yu, L. Deng, X. Dong, Fast online tracking with detection
AC
450
refinement, IEEE Trans. Intelligent Transportation Systems 19 (1) (2018) 162–173.
[5] B. Ni, A. A. Kassim, S. Winkler, A hybrid framework for 3-d human motion tracking, IEEE Trans. Circuits Syst. Video Techn. 18 (8) (2008) 1075–1084.
25
ACCEPTED MANUSCRIPT
[6] D. Huang, L. Luo, M. Wen, Z. Chen, C. Zhang, Enable scale and aspect ratio adaptability in visual tracking with detection proposals, in: BMVC,
455
2015, pp. 185.1–185.12.
CR IP T
[7] G. Zhu, F. Porikli, H. Li, Beyond local search: Tracking objects everywhere with instance-specific proposals, arXiv preprint arXiv:1605.01839.
[8] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking
with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell.
460
37 (3) (2015) 583–596.
AN US
[9] C. Ma, J.-B. Huang, X. Yang, M.-H. Yang, Hierarchical convolutional fea-
tures for visual tracking, in: IEEE Int. Conf. Comput. Vis., 2015, pp. 3074–3082. 465
[10] M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer networks, in: Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2017–2025.
M
[11] M. Denil, L. Bazzani, H. Larochelle, N. de Freitas, Learning where to attend with deep architectures for image tracking, Neural computation 24 (8)
470
ED
(2012) 2151–2184.
[12] S. E. Kahou, V. Michalski, R. Memisevic, Ratm: Recurrent attentive track-
PT
ing model, arXiv preprint arXiv:1510.08660. [13] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, D. Wierstra, DRAW:
CE
A recurrent neural network for image generation, in: Int. Conf. Machin. Learn, 2015, pp. 1462–1471.
[14] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE
AC
475
Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422.
[15] T. Zhou, F. Liu, H. Bhaskar, J. Yang, H. Zhang, P. Cai, Online discriminative dictionary learning for robust object tracking, Neurocomputing 275 (2018) 1801–1812.
26
ACCEPTED MANUSCRIPT
480
[16] D. Held, S. Thrun, S. Savarese, Learning to track at 100 FPS with deep regression networks, in: Eur. Conf. Comput. Vis., 2016, pp. 749–765. [17] B. Ma, L. Huang, J. Shen, L. Shao, Discriminative tracking using tensor
CR IP T
pooling, IEEE Transactions on Cybernetics 46 (11) (2016) 2411–2422.
[18] Q. Guo, W. Feng, C. Zhou, C. Pun, B. Wu, Structure-regularized com-
pressive tracking with online data-driven sampling, IEEE Trans. Image
485
Processing 26 (12) (2017) 5692–5705.
[19] C. Li, L. Lin, W. Zuo, J. Tang, M. Yang, Visual tracking via dynamic graph
AN US
learning, IEEE Transactions on Pattern Analysis and Machine Intelligence. [20] X. Wang, C. Li, B. Luo, J. Tang, SINT++: robust visual tracking via adversarial positive instance generation, in: CVPR, 2018, pp. 4864–4873.
490
[21] R. Tao, E. Gavves, A. W. M. Smeulders, Siamese instance search for tracking, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1420–
M
1429.
ED
[22] J. Shen, J. Peng, L. Shao, Submodular trajectories for better motion segmentation in videos, IEEE Trans. Image Processing 27 (6) (2018) 2688–
495
2700.
PT
[23] B. Ni, X. Yang, S. Gao, Progressively parsing interactional objects for fine grained action detection, in: Proc. IEEE Conf. Comput. Vis. Pattern
CE
Recognit., 2016, pp. 1020–1028. 500
[24] W. Song, Y. Li, J. Zhu, C. Chen, Temporally-adjusted correlation filter-
AC
based tracking, Neurocomputing 286 (2018) 121–129.
[25] C. Zhang, J. Yan, C. Li, R. Bie, Contour detection via stacking random forest learning, Neurocomputing 275 (2018) 2702–2715.
[26] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, M. Yang, Deep regression tracking
505
with shrinkage loss, in: Eur. Conf. Comput. Vis., 2018, pp. 369–386.
27
ACCEPTED MANUSCRIPT
[27] T. Zhang, S. Liu, C. Xu, B. Liu, M. Yang, Correlation particle filter for visual tracking, IEEE Trans. Image Processing 27 (6) (2018) 2676–2687. [28] L. Wang, W. Ouyang, X. Wang, H. Lu, Visual tracking with fully convo-
510
CR IP T
lutional networks, in: IEEE Int. Conf. Comput. Vis., 2015, pp. 3119–3127. [29] L. Wang, W. Ouyang, X. Wang, H. Lu, STCT: sequentially training convolutional networks for visual tracking, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1373–1381.
[30] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-
515
AN US
scale image recognition abs/1409.1556.
[31] S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2017) 1137–1149.
[32] P. Cavanagh, G. A. Alvarez, Tracking multiple targets with multifocal at-
[33] S. Zhang, X. Lan, H. Yao, H. Zhou, D. Tao, X. Li, A biologically inspired
ED
520
M
tention, Trends in Cognitive Sciences 9.
appearance model for robust visual tracking., IEEE Trans. Neural Netw. Learning Syst. (2016) 1–14.
PT
[34] Y. Jiao, Z. Li, S. Huang, X. Yang, B. Liu, T. Zhang, Three-dimensional attention-based deep ranking model for video highlight detection, IEEE Trans. Multimedia 20 (10) (2018) 2693–2705.
CE
525
[35] L. Itti, P. Baldi, A principled approach to detecting surprising events in
AC
video, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2005, pp. 631–637.
[36] D. Gao, V. Mahadevan, N. Vasconcelos, The discriminant center-surround
530
hypothesis for bottom-up saliency, in: Proc. Adv. Neural Inf. Process. Syst., 2008, pp. 497–504.
28
ACCEPTED MANUSCRIPT
[37] V. Mahadevan, N. Vasconcelos, Biologically inspired object tracking using center-surround saliency mechanisms, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2013) 541–554. [38] J. Choi, H. J. Chang, J. Jeong, Y. Demiris, J. Y. Choi, Visual tracking
CR IP T
535
using attention-modulated disintegration and integration, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4321–4330.
[39] S. Hong, T. You, S. Kwak, B. Han, Online tracking by learning discrimina-
tive saliency map with convolutional neural network, in: Int. Conf. Machin. Learn, 2015, pp. 597–606.
AN US
540
[40] W. Wang, J. Shen, R. Yang, F. Porikli, Saliency-aware video object segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 40 (1) (2018) 20–33. [41] W. Zhang, Q. Chen, W. Zhang, X. He, Long-range terrain perception using convolutional neural networks, Neurocomputing 275 (2018) 781–787. [42] V. Mnih, N. Heess, A. Graves, et al., Recurrent models of visual attention,
M
545
in: Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2204–2212.
ED
[43] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation
550
PT
with visual attention, in: Int. Conf. Machin. Learn, 2015, pp. 2048–2057. [44] J. Ba, V. Mnih, K. Kavukcuoglu, Multiple object recognition with visual
CE
attention, arXiv preprint arXiv:1412.7755. [45] P. Sermanet, A. Frome, E. Real, Attention for fine-grained categorization,
AC
arXiv preprint arXiv:1412.7054.
[46] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-
555
plied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
[47] Z. Cui, S. Xiao, J. Feng, S. Yan, Recurrently target-attending tracking, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1449–1458. 29
ACCEPTED MANUSCRIPT
[48] S. Baker, I. A. Matthews, Lucas-kanade 20 years on: A unifying framework, Int. J. Comput. Vis. 56 (3) (2004) 221–255.
560
[49] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep
CR IP T
convolutional neural networks, in: Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[50] S. Zagoruyko, N. Komodakis, Learning to compare image patches via convolutional neural networks, in: Proc. IEEE Conf. Comput. Vis. Pattern
565
Recognit., 2015, pp. 4353–4361.
AN US
[51] X. Han, T. Leung, Y. Jia, R. Sukthankar, A. C. Berg, Matchnet: Unifying feature and metric learning for patch-based matching, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3279–3286. 570
[52] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, M. Shah, Visual tracking: an experimental survey, IEEE Trans. Pattern
M
Anal. Mach. Intell. 36 (7) (2014) 1442–1468.
[53] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in:
575
ED
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 2411–2418. [54] Y. Wu, J. Lim, M. Yang, Object tracking benchmark, IEEE Trans. Pattern
PT
Anal. Mach. Intell. 37 (9) (2015) 1834–1848. [55] Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature
CE
integration, in: ECCV Workshops, 2014, pp. 254–265. [56] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast fea-
AC
580
ture embedding, in: ACM Multimedia, 2014, pp. 675–678.
[57] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
30
ACCEPTED MANUSCRIPT
585
[58] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, F. Li, Imagenet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (3) (2015)
CR IP T
211–252. [59] J. Zhang, S. Ma, S. Sclaroff, Meem: Robust tracking via multiple experts
using entropy minimization, in: Eur. Conf. Comput. Vis., 2014, pp. 188–
590
203.
[60] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr,
AN US
Fully-convolutional siamese networks for object tracking, in: Eur. Conf. Comput. Vis., 2016, pp. 850–865. 595
[61] M. Danelljan, G. H¨ ager, F. S. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in: BMVC, 2014.
[62] J. Gao, H. Ling, W. Hu, J. Xing, Transfer learning based visual tracking
M
with gaussian processes regression, in: Eur. Conf. Comput. Vis., 2014, pp. 188–203.
[63] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based
ED
600
collaborative model, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
PT
2012, pp. 1838–1845.
[64] L. Agapito, M. M. Bronstein, C. Rother, The visual object tracking
CE
VOT2014 challenge results, in: ECCV Workshops, 2014, pp. 191–217. 605
[65] M. Mueller, N. Smith, B. Ghanem, A benchmark and simulator for uav
AC
tracking, in: Eur. Conf. Comput. Vis., 2016.
31
CR IP T
ACCEPTED MANUSCRIPT
Xiankai Lu received the B.S. degree in au-
AN US
tomation from the Shan Dong University, Jinan, China, in 2012. He is currently
pursing his PhD degreein Shanghai Jiao Tong University, Shanghai, China. His research interests include image processing, object tracking and deep learning
PT
ED
M
610
Bingbing Ni received a B.Eng. in electrical
engineering from Shanghai Jiao Tong University, Shanghai, China, in 2005, and
CE
a Ph.D. from the National University of Singapore, Singapore, in 2011. He is currently a Professor with the Department of Electrical Engineering, Shanghai Jiao Tong University. Before that, he was a Research Scientist with the Ad-
AC
615
vanced Digital Sciences Center, Singapore. He was with Microsoft Research Asia, Beijing, China, as a Research Intern in 2009. He was also a Software Engineer Intern with Google Inc., Mountain View, CA, USA, in 2010. Dr. Ni was a recipient of the Best Paper Award from PCM11 and the Best Student
620
Paper Award from PREMIA08. He was also the recipient of the first prize in
32
ACCEPTED MANUSCRIPT
the International Contest on Human Activity Recognition and Localization in
CR IP T
conjunction with the International Conference on Pattern Recognition in 2012.
Chao Ma is a senior research associate with the
625
AN US
Australian Centre for Robotic Vision at The University of Adelaide. He received a Ph.D. from Shanghai Jiao Tong University in 2016. His research interests include computer vision and machine learning. He was sponsored by China
Scholarship Council as a visiting Ph.D. student at the University of California
ED
M
at Merced from the fall of 2013 to the fall of 2015. He is a member of the IEEE.
Xiaokang Yang received a B.S. from Xiamen University, Xia-
630
men, China, in 1994, an M.S. from the Chinese Academy of Sciences, Shanghai,
PT
China, in 1997, and a Ph.D. from Shanghai Jiao Tong University, Shanghai, in 2000. He is currently a Distinguished Professor with the School of Electronic In-
CE
formation and Electrical Engineering and the Deputy Director of the Institute of Image Communication and Information Processing at Shanghai Jiao Tong
635
University. He has authored over 200 refereed papers and holds 40 patents. His
AC
current research interests include visual signal processing and communication, media analysis and retrieval, and pattern recognition. He is an Associate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA and a Senior Associate Editor of the IEEE SIGNAL PROCESSING LETTERS
33