Learning transform-aware attentive network for object tracking

Learning transform-aware attentive network for object tracking

Communicated by Dr Zhen Lei Accepted Manuscript Learning Transform-Aware Attentive Network for Object Tracking Xiankai Lu, Bingbing Ni, Chao Ma, Xia...

4MB Sizes 1 Downloads 53 Views

Communicated by Dr Zhen Lei

Accepted Manuscript

Learning Transform-Aware Attentive Network for Object Tracking Xiankai Lu, Bingbing Ni, Chao Ma, Xiaokang Yang PII: DOI: Reference:

S0925-2312(19)30235-8 https://doi.org/10.1016/j.neucom.2019.02.021 NEUCOM 20484

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

8 October 2018 2 January 2019 11 February 2019

Please cite this article as: Xiankai Lu, Bingbing Ni, Chao Ma, Xiaokang Yang, Learning Transform-Aware Attentive Network for Object Tracking, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.02.021

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Learning Transform-Aware Attentive Network for Object Tracking

CR IP T

Xiankai Lu, Bingbing Ni1 , Chao Ma and Xiaokang Yang School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Abstract

AN US

Existing trackers often decompose the task of visual tracking into multiple in-

dependent components, such as target appearance sampling, classifier learning, and target state inferring. In this paper, we present a transform-aware attentive tracking framework, which uses a deep attentive network to directly predict the target states via spatial transform parameters. During off-line training, the proposed network learns generic motion patterns of target objects from auxiliary

M

large-scale videos. These leaned motion patterns are then applied to track target objects on test sequences. Built on the Spatial Transform Network (STN),

ED

the proposed attentive network is fully differentiable and can be trained in an end-to-end manner. Notably, we only fine-tune the pre-trained network in the initial frame. The proposed tracker requires neither online model update nor

PT

appearance sampling during the tracking process. Extensive experiments on OTB-2013, OTB-2015, VOT-2014 and UAV-123 datasets demonstrate the com-

CE

petitive performance of our method against state-of-the-art attentive tracking methods.

Keywords: Transform-aware, visual attention, Spatial Transformer Networks,

AC

object tracking.

1 Corresponding

author: [email protected]

Preprint submitted to Neurocomputing

February 14, 2019

ACCEPTED MANUSCRIPT

1. Introduction Object tracking is a fundamental problem in computer vision with a wide range of applications, including human-machine interface, video surveillance,

CR IP T

traffic monitoring, etc. Typically, given the ground-truth of target objects in the first frame, object tracking aims at predicting the target states, e.g., position 5

and scale, in subsequent frames. Recent years have witnessed the success of the tracking by detection approach, which incrementally learns a binary classifier to discriminate target objects from the background. This approach requires generating a large number of samples in each frame using either sliding win-

10

AN US

dows [1, 2, 3], or random samples [4, 5], or region proposals [6, 7]. For training

the discriminative classifier, the samples are assigned to binary labels according to their overlap ratio scores with respect to the tracked result in the previous frame. For the tracking process, the classifier is used to compute the confidence scores of the samples. The sample with the highest confidence score indicates

15

M

the tracked result. Note that independently computing the confidence scores of samples often causes heavy computation burden, which is even heavier for deep learning trackers. For example, the speed of the recently proposed MDNet

ED

[3] tracker is less than one frame per second. To avoid drawing samples, an alternative approach is to learn correlation filters [8, 9]. The output correlation

20

PT

response maps can be used to locate target objects precisely. However, such response maps are hardly aware of the scale changes. We also note that correlation filters heavily rely on an incremental update scheme, which occurs frame

CE

by frame on the fly. Slight inaccuracy in a frame is easily aggregated to degrade the learned correlation filters. In this work, instead of drawing a large number of samples to learn a dis-

criminative classifier or directly learning correlation filters, we exploit a novel

AC 25

framework to infer target states in terms of both position and scale changes in an end-to-end manner (see Figure 1). We take the inspirations from the recent success of the spatial transformer [10] as well as the visual attention mechanism in learning deep neural networks. On the one hand, Spatial Transformer Networks

2

ACCEPTED MANUSCRIPT

Context template

Image sampling

Response map

Search area

Reference image

CNN

CNN

Tailored STN

(b)

AN US

ϴ

(a)

CR IP T

Input image

(c)

Figure 1: Different tracking schemes. (a) Tracking by sampling target states from the image.

M

(b) Tracking by inferring targets states from response maps (e.g., correlation filter based response map [8, 9]). (c) The proposed tracking framework. The proposed tracker builds upon a tailored spatial transformer network that takes the reference image and search area

ED

as input and outputs the spatial transformer parameters containing both location and scale information as tracking results.

30

(STN) learn invariance to translation, scale, rotation and more generic warping.

PT

Therefore, STN can attend to the task-relevant regions via predicted transformation parameters. It is straightforward to exploit this invariance to estimate

CE

the appearance changes of target objects. On the other hand, existing attentive tracking methods built on deep neural networks such as Restricted Boltzmann

35

Machine (RBM) [11] and Recurrent Neural Network (RNN) [12] cannot deal

AC

with spatial transformation. Therefore, multiple independent components are needed for position and scale estimation. In other words, the visual attention mechanism in [11, 12] is only exploited as one submodule for estimating location changes. This work aims at learning a unified attention network that directly

40

predicts both the position and scale changes via spatial transformer parameters.

3

ACCEPTED MANUSCRIPT

The proposed transform-aware attentive network (TAAT) is a Siamese matching network with two input branches. We constantly feed ground truth of the target in the first frame into one branch, while sequentially feed image frames

45

CR IP T

into the other branch. Each branch consists of multiple convolutional layers to generate deep features. Features from two branches are then concatenated and fed into fully connected layers that output spatial transformer parameters. The proposed network naturally attends to regions of interest where the target

object is likely to be. Compared to traditional attentive tracking methods, the

proposed network outputs a considerably finer attentive area defined by spatial transformer parameters. This naturally facilitates visual tracking algorithms

AN US

50

being more invariant to translation and scale changes. We first train the proposed TAAT network off-line in an end-to-end manner on large labeled video dataset. We use a data augmentation scheme in both the temporal and spatial domains. In each iteration, we feed triplet pairs, i.e., reference image, search 55

image, ground-truth image of target objects, into the network. We use an `1

M

loss constraint to speed up convergence. During the tracking process, we apply this pre-trained network to search frames. The output directly shows the mov-

ED

ing states of the target as well as glimpse [13] from the input image. Figure 2 illustrates an overview of the proposed tracker. 60

We summarize the contributions of this work as follows:

PT

• We propose a transform-aware attentive network for object tracking by integrating the attention mechanism into a tailored Spatial Transformer

CE

Network. The proposed network attends to the region of interest with finer attention and can be trained in an end-to-end manner. With the

AC

65

use of an `1 loss constraint, the proposed network converges fast in the training stage.

• We cast the visual tracking problem as pairwise matching. We effectively get rid of the cumbersome sampling scheme. The proposed algorithm achieves a satisfying tracking speed.

70

• Extensive experiments on popular benchmark datasets demonstrate the 4

ACCEPTED MANUSCRIPT

Reference image(R)

Convnet Convnet

Ground-truth(G)

Feature vector

Fully connect layer

2048

224x224

128

256 512

512

256

112x112

Shared v weight

4

Grid Generator

Glimpse(V)

L1-Loss

CR IP T

θ

Shared weight

224x224

Search image(S)

2304

128

128

64

112x112

96

256

192 192

Figure 2: Architecture of the proposed transform-aware attentive network for visual tracking.

AN US

It consists of an attention module (left) and a patch similarity module (right). It takes as

input a triple of images: reference image R, search area S, and ground-truth image G. In the training stage, a pair of reference image R and search area S are fed into a matching network, i.e., a tailored Spatial Transformer Network. The output transformer parameters Θ define the tracking prediction, and corresponding cropped area (V ) can be viewed as a glimpse on search area [13]. To fine tune the matching network, we compute the `1 distance between the representation of glimpse V and ground-truth image G as loss for back propagation. In the test time, given a pair of reference image R and a search image S, the proposed network

M

outputs the glimpse(V) as well as the similarity between R and S.

ED

favorable performance of the proposed algorithm when compared with state-of-the-art trackers.

The rest of this paper is organized as follows. In Section 2, we review the

75

PT

works closely related to our proposed approach. Section 3 gives a detailed description of the proposed transform-aware attentive model. Experimental

CE

results are reported and analyzed in Section 4. We conclude this paper in Section 5.

AC

2. Related Work

80

Visual tracking has long been an active research area and deep learning has

become popular for visual tracking. We briefly categorize the most related works into the following aspects: (1) tracking by sampling target states in images, (2) tracking by inferring target states from response maps, and (3) tracking by 5

ACCEPTED MANUSCRIPT

attention models. 2.1. Tracking with Sampling Traditional tracking-by-detection methods [2, 14, 15, 16, 17, 18, 19, 20] usu-

CR IP T

85

ally learn a discriminative classifier from a large number of sampling candidates around the position in the previous frame. The learned classifier is then used to

compute the confidence scores of samples in the current frame. The sample with the highest confidence score thus indicates the tracking result. This strategy is 90

popular among recent deep trackers [21, 3]. On the other hand, Zhu et al. [7]

AN US

exploit region proposals trained for object detection to generate good candi-

dates. Note that generating a large number of samples brings not only a heavy computational burden but also sampling ambiguity [9], i.e., assigning spatially correlated samples with binary labels. To reduce the computational load, Tao 95

et al. [21] employ the region of interest (RoI) pooling technique. Rather than searching over tens of hundreds of sampled candidates, the proposed method em-

M

ploys a deep neural network with attention to directly output the target states. As a result, the proposed algorithm successfully circumvents the annoying sam-

100

ED

pling issues and achieves real-time tracking speed. In [16], the proposed method first trains an off-line target detector via large amounts of labeled videos and then test the pre-trained model on unseen videos. Different from this method,

PT

the proposed method regards object tracking as an attention process, bottomup and top-down mechanisms are combined to locate the tracking target in next

CE

frame. 105

2.2. Tracking with Response Map

AC

Recent years have witnessed the success of inferring target states from re-

sponse maps [22, 23, 24, 25, 26, 27]. The most representative approach is correlation filter based trackers [8, 9, 6]. The key idea lies in that: correlation filter can be seen as a template encoding the appearance of target objects. The

110

correlation response indicates the similarity between the target template and a candidate search window. The position of maximum response value indicates

6

ACCEPTED MANUSCRIPT

the location of the target. Note that correlation response maps can also be obtained by the fully convolutional deep network. In [28, 29], Wang et al. develop a convolutional sub-network on top of deep features generated by VGG-Net [30] to output confidence map. The location of the maximum value in the confi-

CR IP T

115

dence map is used for identifying the target’s position in the next frame. Unlike these methods that can only infer the location from these response maps, the

proposed network learns invariance to spatial transforms including but not lim-

ited to location and scale. With the use of STN, state sampling happens in the middle CNN layers in a similar way to Faster-RCNN [31].

AN US

120

2.3. Tracking with Deep Attention Model

Visual attention mechanisms mimic the biological vision system [32, 33, 34] to allocate limited perceptual resources to attend to interest or saliency area. Visual attention models have been widely used to improve visual tracking [35, 125

36, 37, 11, 38]. Most of these models are saliency map driven [39, 40, 41].

M

Recently, deep attention mechanisms become an attractive research domain [42, 43], and show great advantages in a variety of tasks, like object recognition [44],

ED

image caption [43], image generation [13] and fine grained classification [45, 10]. The deep attentive network is capable of learning to know ”where” and 130

”what” to focus on. Considerable efforts have been made to train attentive

PT

network for visual tracking. Inspired by the success of DRAW [13], Samira et al. [12] trained a temporal recurrent neural network (RNN) to predict where to track in the subsequent frame. They also apply 2D Gaussian filters to crop

CE

search areas. Since this model is only trained with toy examples generated

135

from the MNIST dataset [46], it is unlikely to perform well on tracking generic

AC

target objects. In [47], Cui et al. build an attention module upon spatial multi-direction RNN to encode the ensemble target parts, and use the output saliency map to compute the weights of different parts. However, this RNN is incrementally trained on each tracking result, so it is lack of ability to mining

140

complex attentive mechanisms from large-scale image sequences. In this work, we develop a transform-aware attentive network built upon Spatial Transformer 7

ACCEPTED MANUSCRIPT

Networks. Our network is fully differentiable, allowing end-to-end training on large-scale auxiliary sequences. So the proposed tracker can be taught to learn the generic transformation knowledge from these auxiliary sequences and track novel target subsequently.

3. Transform-aware Attentive Tracking

CR IP T

145

In this section, we first overview the proposed transform-aware attentive

network. Then we present the network architecture in more details. We then introduce the training scheme on large-scale data sets. Lastly, we show how to use the training model to perform visual tracking.

AN US

150

3.1. Overview

As mentioned before, existing algorithms [2, 14, 8, 21, 3] relying on sampling target states face both the challenges of high computational load and sampling

155

M

ambiguity (i.e., assigning spatially correlated samples with either positive or negative label). Meanwhile, response maps generated by correlation filters [9] or fully convolutional neural network [28] cannot infer more target states than

ED

location changes. Our goal is to explore the mechanism of spatio-temporal transformation of generic target objects between two image pairs. To this end, we train a tailored Spatial Transformer Network that attends to the most relevant regions (attention) from large-scale image sequences in an end-to-end fashion.

PT

160

The transformation includes scaling, cropping, rotations, as well as non-rigid

CE

deformations, and naturally fulfills the goal of visual tracking. When performed on the entire feature map (non-locally) of a search image, the proposed network directly outputs the transform-aware parameters as tracking results. Figure 2 shows an overview of the proposed algorithm. In the training stage, we feed a

AC

165

triplet of images (reference image R, search image S, and ground truth image

G) into the network. Note that in training process, the reference image R is

the target in the last frame while in test it is the ground-truth patch in the first frame. Search image S denotes spatially augmented patches in training phase

8

ACCEPTED MANUSCRIPT

170

and the frame in which tracking is performed in the test phase. Ground truth image G crops from corresponding S during training and comes from the first frame in testing. During the forward propagation, the Spatial Transformer Net-

CR IP T

work performs pairwise matching between the reference image R and the search image S. The output transformer parameters just indicate the target states con175

taining both location and scale changes. As for the backward propagation, we compute the `1 loss between the estimated target patch (defined by the output transform-aware parameters) and the ground truth image G to fine tune the

AN US

network. 3.2. Formulation of Attention

The attention procedure in object tracking can be described as: given the

180

reference target appearance (R) in the first frame, the trained network attends to the target on the search patch (S). This attend step can be modeled by learning a relationship between the image appearance and the relative transfor-

this procedure can be formulated as:

ED

185

M

mation parameters Θ. Furthermore, similar to Lucas-Kanade Algorithm [48],



φ(S)

Θ = f (

φ(R)



)

(1)

PT

In this work, we use Θ = [sx , sy , tx , ty ] to denote the state changes in terms of scale and location (e.g., location and size), where sx and sy stand for scale changes, tx and ty for translation changes on the horizontal and vertical direc-

CE

tions. f denotes the translation estimation function and φ means the feature extraction function. So the transformation between the search image S and

AC

the estimated attentive region (i.e., outputted tracking result) is subject to an inverse warp function:    xin s  i = x yiin 0

0 sy

9

  xout i  tx    yiout    ty 1 

(2)

ACCEPTED MANUSCRIPT

in out where xin and yiout are the output i and yi are the input image coordinates, xi

image coordinates. Their values are normalized to [−1 1]. The index i is based on the output image. Note that the transformer Θ in natural delineates an

190

CR IP T

attentive region where the target might be. 3.3. Network Architecture

We briefly introduce the STN for completeness. STN is a sample-based differential network which consists of localization network for regressing trans-

former parameters and sampling layers to transform the input maps to output

195

AN US

maps which correspond to a region of the input maps. This is in accord with

the attention procedure introduced above. Thus we tailor STN to realize this attention module for estimating transformer parameters Θ.

Given an input image pair of reference image R and search image S, we develop this attention module upon a Siamese structure. We first use the shared convnets from deep networks (e.g., Alex-Network [49]) for feature extraction. The two branches of deep features are concatenated and fed into a three-

M

200

layer fully connected regression network, which directly outputs the transformer

ED

Θ. Note that this fully connected regression network learns a generic spatialtemporal transformation invariant to significant appearance changes caused by scale change, abrupt motion, deformation, as well as partial occlusion.

PT

After computing the transformer parameters Θ, we utilize a differentiable sampling layer to obtain the tracked result in the search image S. For each training iteration, the tracked result is often thought of a glimpse V, and the

CE

objective function can be computed as follows:

AC

Vic =

205

H,W X n,m

in c Snm max(0, 1 − xin i − m ) max(0, 1 − yi − n )

(3)

where Vic is the output value of the i-th pixel in the channel c of V. Here, W and H denote the width and height of S, respectively. Since the estimated xin i and yiin are not always integers, bilinear interpolation is employed to compute in Vic from the four neighbor pixels around (xin i , yi ).

10

ACCEPTED MANUSCRIPT

3.4. Network Training To guide the matching network learn to estimate the transform parameters

210

between successive frames, we design a patch similarity module to compute the

CR IP T

loss for back propagation. Rather than comparing the difference between the

output transformer parameters and ground truth bounding boxes, we compute the distance between the tracked result (a glimpse of visual attention) in a form 215

of image patch and ground truth patches. This image-level distance makes the training easily as it does not require the extract transformation parameters,

explicitly. Inspired by [50, 51], we feed both glimpse and ground-truth patches

AN US

into Alex-Network and use the output of the conv5 layer as features. The loss

between the tracked patch (glimpse) V and ground-truth image G is calculated 220

via the `1 distance as follows:

L = kφ(V ) − φ(G)k1 +

λ kW k22 2

(4)

M

where φ(V ) and φ(G) denote the extracted deep features from tracked image patch V and ground truth patch G. Here, W denotes the weight parameters of the fully connected regression layers in the Spatial Transformer Network

225

ED

and λ is the weight decay. We add a `2 normalization layer before the loss layer to eliminate the scale disparity of input deep features. This operation

PT

also brings an advantage, i.e., during the test we can leverage a simple inner product to measure the similarity among the input features. With the use of the `1 loss, the proposed attentive network converges fast. Figure 3 visualizes the

CE

predicted transformer parameters with different iterations on training sequences.

230

Although the proposed attentive network initiates in different regions with high variance, and it gradually attends to the target regions along with the increase

AC

of iterations. Since target states between two consecutive frames usually do not change

dramatically, we crop a search window centered at the previous position in the

235

search image S for training. Let wt−1 and ht−1 denote width and height of ground-truth image patch in the previous frame. We enlarge the search window

11

Iterations

CR IP T

ACCEPTED MANUSCRIPT

Figure 3: Visualization of attentive process on the sequences in ALOV dataset [52]. Each col-

umn shows the attentive regions with different training iterations (10, 1000, 1500, 20000, 70000)

AN US

for a same frame. With the increase of iterations, the proposed network attends to the regions of interest precisely.

in proportion to a scaling factor k > 1 with width ws = kwt−1 and height hs = kht−1 . As a result, this simple scheme avoids searching over the whole image. In addition, data augmentation is implemented on the training data in both the spatial and temporal domains. In the spatial domain, we augment the

M

240

search window with both position and scale changes [16], and obtain M crops. To augment the temporal variance among sequences, we extract multiple pairs

ED

between search window and the reference image. In this case, both the search image and the reference image are randomly picked from different frames. Since long temporal span usually causes untrue invariance, we control the temporal

PT

245

augmentation within a span of T frames. More implementation details can be

CE

found in 4.1.

3.5. State Inference The proposed algorithm does not require model update to perform visual

AC

tracking. Once we have the trained network, we can directly apply it to track target objects in testing sequences. We fix the reference image as the ground truth patch in the first frame and fine-tune the network with samples generated as the training phase. Thus we can obtain a domain specific tracking network. From the second frame, we take crop centered position in the previous frame

12

ACCEPTED MANUSCRIPT

with a width and height of kwt−1 and kht−1 . We sequentially feed each crop into the network as the search image S. The output transformer Θ = [sx , sy , tx , ty ] of the network indicates the target state changes. In this work, we assume the

CR IP T

glimpse V matches the target tightly, so the target coordinates in V can be expressed as [1, 1], [−1, −1] which indicate bottom right and top left corners in

V. Then we infer the bounding box on search area from the glimpse V (See Eq. 3) as: 

 x1 s  = x y1 0 



0 sy

  1  tx    1 ty   1    −1  tx    −1 ty   1 

AN US



 x2 s  = x y2 0

0

sy

(5)

where (x1 , y1 ) and (x2 , y2 ) are the bottom-right and top-left coordinates in S. 250

2 y1 +y2 In the t-th frame, the target position is pt = (cx, cy) = ( x1 +x , 2 ), and the 2

M

scale is (w, h) = (x1 − x2 , y1 − y2 ).

Position Refinement. To obtain a more tight bounding box, we leverage

ED

the bounding box regression to refine estimated results as in [31, 3]. In the first frame, we train four linear ridge regressors for the center, width and height of bounding boxes using the conv4 features of Alex-Net. For each subsequent

PT

frame, regressors take the output glimpse of the attentive network as input and

AC

CE

output the refined bounding boxes as tracking results: tx =(cx − xa )/(wa ), ty = (cy − ya )/(ha ),

tw = log(w/wa ), th = log(h/ha ), t∗x =(x∗ − xa )/wa , t∗y = (y ∗ − ya )/ha ,

(6)

t∗w = log(w∗ /wa ), t∗h = log(h∗ /ha )

Here, cx, cy, w, and h denote the predicted boxs center coordinates and its width and height. Variables cx, xa , and x∗ are for the predicted box, anchor box, and ground-truth box respectively. These four regressors are not updated

13

ACCEPTED MANUSCRIPT

Success plots of OPE on OTB−2013

Precision plots of OPE on OTB−2013

0.9

0.9

0.8

0.8 0.7 TAAT_res [0.873] TAAT [0.858] FCNT [0.855] TAAT_VGG [0.853] TAAT_bb [0.833] MEEM [0.830] Siamesefc [0.809] KCFDP [0.786] KCF [0.741] DSST [0.739] TGPR [0.705] Struck [0.656] GOTURN [0.633] TLD [0.608] SCM [0.583]

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

20

25

30

35

40

Location error threshold

45

Success rate

Precision

0.7

0.1 0 0

50

0.1

0.2

0.3

0.4

1

0.8 0.7

0.6

0.4 0.3 0.2 0.1 5

10

15

20

25

30

35

Location error threshold

40

0.6 0.5

0.6

0.7

Siamesefc [0.582] TAAT_res [0.573] TAAT_VGG [0.558] TAAT [0.557] FCNT [0.554] KCFDP [0.545] MEEM [0.530] TAAT_bb [0.529] DSST [0.475] KCF [0.475] Struck [0.461] TGPR [0.458] GOTURN [0.428] TLD [0.426] SCM [0.413]

AN US

0.5

Success rate

TAAT_res [0.805] TAAT_VGG [0.787] TAAT [0.786] MEEM [0.781] FCNT [0.779] Siamesefc [0.771] KCFDP [0.738] TAAT_bb [0.734] DSST [0.695] KCF [0.692] TGPR [0.643] Struck [0.638] TLD [0.595] GOTURN [0.572] SCM [0.526]

0.5

Overlap threshold Success plots of OPE on OTB−2015

0.9

0.7

Precision

0.4

0.2

0.8

0 0

0.5

0.3

Precision plots of OPE on OTB−2015

0.9

TAAT_res [0.617] Siamesefc [0.607] FCNT [0.604] TAAT [0.603] TAAT_VGG [0.601] TAAT_bb [0.587] KCFDP [0.576] MEEM [0.566] KCF [0.513] DSST [0.505] TGPR [0.503] Struck [0.474] GOTURN [0.455] SCM [0.448] TLD [0.437]

0.6

CR IP T

1

1

45

0.4 0.3 0.2 0.1

50

0 0

0.1

0.2

0.3

0.4

0.5

0.6

Overlap threshold

0.7

0.8

0.8

0.9

0.9

1

1

Figure 4: Overall performance on the OTB-2013 [53] and OTB-2015 [54] datasets using one-

M

pass evaluation (OPE). The legend of distance precision plots contains the threshold scores at 20 pixels, while the legend of overlap success plots contains area-under-the-curve (AUC) scores for each tracker. The proposed tracker performs well against the baseline trackers. Here

255

ED

TAAT-bb denotes the proposed method without bounding box regression.

during tracking and adjust predicted position only when the estimated position

PT

is reliable (i.e. f (x, z) > µ, µ is a predefined value). Scale Refinement.

We observe that the estimated scale changes are not

smooth from frame to frame. We thus crop N patches centered at the estimated

CE

position pt but with different scales n = b− N 2−1 c, b− N 2−2 c, · · · , b N 2−1 c [55]. Then these patches are resized to a fixed size of training patches. We feed them into

AC

the network and then compare the similarity of the output glimpses Vn and

the reference image R. For efficiency, rather than using the `1 loss as in Eq. 4, we compute the inner product between glimpse V and reference image R to evaluate their similarity as: f (R, V ) = φ(R)T φ(V )

14

(7)

ACCEPTED MANUSCRIPT

Algorithm 1 Proposed Tracking Algorithm Require: Pre-trained network f , previous target position (xt−1 , yt−1 ) and size (wt−1 , ht−1 )

CR IP T

Ensure: Estimated position (xt , yt ) and size (wt , ht ). 1:

Fine tune the pre-trained network f with samples in the first frame

2:

repeat

3:

Crop out the searching window in frame t centered at (xt−1 , yt−1 );

4:

Estimate new position (xt , yt ) from Equation 5 over deep convolutional features;

6:

if f (x, z) > µ then

AN US

5:

Perform position refine via Equation 6;

7:

end if

8:

Crop multiple patches centered at (xt , yt ) and obtain the optimal scale via Equation 8; until End of video sequences.

M

9:

where φ(R) and φ(V ) denote the deep features generated from R and V. The

ED

refined tracking scale n∗ is inferred from the glimpse with the maximum value of similarity score:

n∗ = arg max f (R, Vn ).

(8)

n

PT

The overall algorithmic steps are summarized in Algorithm 1.

CE

4. Experiment

4.1. Implementation The proposed model is implemented in MATLAB with the caffe library [56].

AC

260

The model is trained on an Intel Xeon 1.60G Hz CPU with 16G RAM and TITAN X GPU. We utilize Alex-Network [49], VGG-Network [30] and ResNet [57] to build the Convnet part in localization network, respectively. Specifically, we leverage the deep features from Conv5, Conv5 3 layer of VGGNet and res4f

265

layer in ResNet for target appearance representation. For ResNet, we add a 15

ACCEPTED MANUSCRIPT

0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

20

25

30

35

40

45

0.7

0.6

MEEM [0.760] TAAT_res [0.750] TAAT [0.731] Siamesefc [0.731] TAAT_VGG [0.725] KCFDP [0.714] FCNT [0.713] TAAT_bb [0.702] DSST [0.657] KCF [0.651] TGPR [0.610] Struck [0.566] TLD [0.545] GOTURN [0.513] SCM [0.507]

0.5 0.4 0.3 0.2 0.1 0 0

50

Location error threshold Precision plots of OPE − illumination variation (35) on OTB−2015

5

10

15

20

25

30

35

40

45

Location error threshold Precision plots of OPE − in−plane rotation (51) on OTB−2015

0.6 Precision

TAAT_res [0.835] TAAT [0.824] TAAT_VGG [0.800] TAAT_bb [0.748] MEEM [0.746] FCNT [0.732] KCFDP [0.724] KCF [0.712] DSST [0.702] Siamesefc [0.690] TGPR [0.593] Struck [0.559] SCM [0.552] GOTURN [0.534] TLD [0.464]

Precision

0.4

0.2 0.1 0 0

50

1

0.9

0.9

0.9

0.8

0.8

0.7

0.7

0.4 0.3 0.2 0.1 0 0

5

10

15

20

25

30

35

Location error threshold

40

45

50

0.8 TAAT_res [0.826] TAAT_VGG [0.810] TAAT [0.803] MEEM [0.794] FCNT [0.787] TAAT_bb [0.769] Siamesefc [0.742] KCFDP [0.727] DSST [0.724] KCF [0.693] TGPR [0.659] Struck [0.633] TLD [0.609] GOTURN [0.572] SCM [0.453]

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

20

25

30

35

Location error threshold

40

45

50

10

15

20

25

30

35

40

45

50

0.7

TAAT_res [0.816] TAAT_VGG [0.810] TAAT [0.808] MEEM [0.808] FCNT [0.793] TAAT_bb [0.762] Siamesefc [0.762] KCFDP [0.736] DSST [0.705] KCF [0.694] TGPR [0.655] Struck [0.616] GOTURN [0.606] TLD [0.577] SCM [0.502]

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

20

25

30

35

Location error threshold

40

45

50

AN US

0.5

Precision

1

0.6

5

Location error threshold Precision plots of OPE − out−of−plane rotation (59) on OTB−2015

1

FCNT [0.822] TAAT_res [0.816] TAAT [0.815] TAAT_VGG [0.806] MEEM [0.735] TAAT_bb [0.734] Siamesefc [0.731] KCFDP [0.724] DSST [0.709] KCF [0.697] TGPR [0.617] GOTURN [0.594] Struck [0.554] SCM [0.552] TLD [0.547]

TAAT_res [0.750] MEEM [0.731] TAAT [0.723] TAAT_VGG [0.723] Siamesefc [0.705] FCNT [0.680] TAAT_bb [0.664] KCFDP [0.637] DSST [0.611] KCF [0.600] Struck [0.587] TLD [0.535] TGPR [0.529] GOTURN [0.435] SCM [0.271]

0.5

0.3

Precision

Precision

0.8

0.7

0.6

Precision plots of OPE − motion blur (29) on OTB−2015

0.9

0.8

0.7

Precision

Precision plots of OPE − occlusion (44) on OTB−2015

0.9

0.8

CR IP T

Precision plots of OPE − fast motion (37) on OTB−2015

0.9

Figure 5: Quantitative results on 6 challenging attributes in the OTB-2015 [54] dataset using one-pass evaluation (OPE). The legend of the distance precision plots contains the threshold scores at 20 pixels, whereas the legend of the overlap success plots contains area-under-thecurve (AUC) scores for each tracker. The proposed method generally performs better than the state-of-the-art trackers.

M

1 × 1 convolution layer which reduces the feature channel from 1024 to 256. We remove the origin fully connected layers and add three fully connected layers

ED

with nodes 2048, 256 and 4, respectively. As suggested in [10], we initialize the weight of the last fully connected layer with 0. In addition, Alex-Network is uti270

lized to extract features when matching the glimpse and ground-truth patches.

PT

We set the size of glimpse half of the search image, i.e., 112 × 112 constantly. To ensure the extracted deep features have same dimensions (2304), we append

CE

an average pooling layer (stride = 2, kernel size = 2) for the ground-truth patch (G) before inputting to the Convnet. We remove the fully connected layer and

275

add an `2 normalization layer before the loss layer. We use the ALOV [52]

AC

and ImageNet Video [58] datasets to finetune the Spatial Transformer Network. These datasets cover diverse challenges of visual tracking including illumination change, cluttered background, occlusion, abrupt motion, etc. We exclude the overlapping sequences in the ALOV and test datasets for fair comparison. As

280

described in Section 3.2, a triple image is fed into the network in every iteration.

16

ACCEPTED MANUSCRIPT

All images are re-scaled to 224 × 224. For data augmentation, we set the search radius k = 2.5. Every frame, there are M = 10 random patches are cropped, and for temporal augmentation the frame interval T is 10, this results in ap-

285

CR IP T

proximately 400k triplet pairs. We use the stochastic gradient descent (SGD) method with a mini-batch size of 70 to train the network. The base learning

rate is 0.0005, and it is decreased by multiplying a factor of 0.8 at two epochs.

We stop the training procedure when the loss drops lower than 0.02 and the whole training times is about 12 hours. For scale refinement, we set the step size to 1.02 and the number of scales N to 5. The threshold value µ is set to

0.56. During the tracking procedure, we set the height of the search area is 3

AN US

290

times the height of the target and the width of the search area is 5 times the width of the target. Then the cropped image patch is re-scaled to 224 × 224. 4.2. Evaluation on OTB Dataset

4.2.1. Dataset and Evaluation Settings

The OTB dataset includes 50 sequences (OTB-2013) [53] and its updated

M

295

version [54] contains 100 sequences (OTB-2015) with more than 58,000 frames

ED

in total. These video sequences cover various challenging factors, such as fast motion, illumination change, background clutter and occlusion. We validate the proposed tracker against several recently proposed trackers which can be divided into three categories: (i) deep learning based tracking methods including

PT

300

Generic Object Tracking Using Regression Networks (GOTURN) [16], Siame-

CE

sefc [60] and Fully Convolution Network Tracker (FCNT) [28]; (ii) correlation filter based trackers including KCF [8], KCFDP [6] and DSST [61] as well as five representative trackers with favorable performance in the benchmark: MEEM [59], TGPR [62], TLD [14], Struck [2] and SCM [63]. We follow the

AC

305

benchmark evaluation protocol in [53], and use the precision and success plots to evaluate all the trackers. The precision plot demonstrates the percentage of frames where the distance between the predicted target location and the ground truth is within a given threshold (e.g., 20 pixels). The success plot illustrates

310

the percentage of frames where the overlap ratio between the predicted bound17

ACCEPTED MANUSCRIPT

#0440

#0950

#1290

#1570

#0010

#0315

#0800

#1250

#1335

#0015

#0085

#0125

#0205

#0010

#0030

#0070

#0100

CR IP T

#0195

#0255

AN US

#0120

#0090

#0800

#0990

#1035

#0040

#0310

#0460

#0765

#0915

ED

M

#0010

Ours

MEEM

KCF

KCFDP

Struck

Figure 6: Tracking results on six challenging benchmark sequences by ours, the MEEM [59],

PT

KCF [8], KCFDP [6] and Struck [2] trackers. Our tracker performs well against the state-ofthe-art trackers in terms of precise localization and scale estimation. From the first row to

CE

the final row: Liquor, Lemming, Jogging, Motorrolling, Human and Girl2.

ing box and the ground truth bounding box is higher than a threshold (e.g., 0.5

AC

Intersection of Union). 4.2.2. Evaluation Results

315

The overall results in Figure 4 are under the one-pass evaluation (OPE).

Figure 4 compares the average precision and success results with baseline algorithms on all the 50 benchmark sequences and the 100 sequences, respectively.

18

ACCEPTED MANUSCRIPT

TAAT Res denotes ResNet [57] based localization network, TAAT VGG denotes VGGNet [30] based localization network and TAAT denotes AlexNet [49] based localization network. In addition, we disable the bounding box regression module and report the

CR IP T

320

results using the name TAAT-bb. Among the compared trackers, even without the bounding box regression module, the proposed tracker TAAT-bb still achieves comparable performance with MEEM and KCFDP in terms of dis-

tance precision and overlap success. Equipped with all the components, the 325

proposed TAAT tracker achieves the best distance precision rates and over-

AN US

lap success rates. Among the compared trackers, TAAT Res achieves the top

performance when compared with deep learning based methods and tradition hand-crafted feature based methods according to distance precision (87.3% and 80.5%). TAAT VGG and origin TAAT trackers achieve slight inferior perfor330

mance on OTB-2013 and OTB-2015 datasets, respectively. With respect to the overlap success rate, the proposed method performs favorably against the re-

M

cent deep trackers Siamesefc [60] due to SiameseFC adopts more complex scale estimation strategy with scale penalty term. However, TAAT is superior to

335

ED

rest trackers including MEEM [59], KCF [8], KCFDP [6] with 61.7% and 58.2% scores on OTB-2013 and OTB-2015 datasets, respectively. It is worth mentioning that the proposed TAAT method achieves better performance than GO-

PT

TURN [16]. There are two reasons, first reason is that GOTURN directly optimizes the difference between predicted bounding box and the ground truth, this

CE

leads to the tracking performance fluctuation across different video sequences. 340

In addition, the proposed TAAT is pre-trained on a larger dataset which lead

AC

to more stronger generalization ability on unseen testing videos. We further compare the tracking performance for different video attributes

including fast motion, occlusion, motion blur, illumination variation, in plane rotation and out of plane rotation. Figure 5 illustrates that our tracker performs

345

well the against state-of-the-art methods in terms of distance precision rate on all 100 video sequences. Our method achieves superior performance with fast motion (83.5%), occlusion (76.0%), motion blur (75.0%), illumination variation 19

ACCEPTED MANUSCRIPT

(81.6%), in plane rotation (82.6%) and out plane rotation (81.6%). These results suggest that the proposed method are effective in handling various challenging 350

scenario, especially tracking failure caused by fast motion, occlusion.

CR IP T

Figure 6 qualitatively compares tracking results on featured challenging sequences. Since we do not update the trained network online, the proposed algorithm effectively avoids the noisy updates which always leads to tracking

failures. The proposed network learns the invariance to a wide range of spa355

tial transformation. It is able to deal with the heavy occlusion, e.g., recovering

the target after long-term occlusion (from 330th frame to 370th frame) in the

AN US

Lemming sequence and girl2 sequence (from 110th frame to 120th frame) as

shown in Figure 6. In addition, via the deep network in the proposed attentive network, the proposed method handles rotation and background clutter 360

(MotorRolling) effectively as the high level convolution features contain strong semantic information.

M

4.3. Evaluation on VOT2014 Dataset

4.3.1. Dataset and Evaluation Settings

365

ED

The VOT 2014 dataset [64] contains 25 real-world video sequences. Each sequence is labeled with six attributes including camera motion, illumination change, motion change, occlusion, size change and no degradation. There are

PT

two evaluation criteria in VOT2014 including accuracy and robustness. Accuracy is computed as the Pascal VOC Overlap Ratio (VOR): e =

area(RT ∩RG ) area(RT ∪RG ) ,

CE

where RT and RG are the areas of tracked and ground truth box respectively. 370

The robustness indicates the number of failures to track an object in a sequence. These two metrics are used to rank all the trackers. According to the evaluation

AC

protocol, a restart scheme is incorporated into a tracker whenever tracking failure occurs (overlap between estimated and ground truth target bounding box equals zero).

20

ACCEPTED MANUSCRIPT

375

4.3.2. Evaluation Results on VOT2014 Dataset We follow the protocol of VOT2014 and run experiment analysis to obtain the final report. Figure 7 illustrates the accuracy-robustness plots of all

CR IP T

comparison trackers including GOTURN, Struck, SAMF, DGT, eASMS, CMT,

MatFlow, ABS, BDT, HMMTxD, ACAT, DynMS, ACT, PTp, EDFT, IPRT, 380

LT FLO, SIR PF, FoT, ThunderStruck, FSDT, FRT, IVT, OGT, MIL, IIVTV2, Matrioska, CT, IMPNCC and NCC, where the best trackers are closer to the

top-right corner. The proposed TAAT as well as the extended TAAT VGG,

TAAT res perform favorably among all the other trackers, e.g., DSST [61],

AN US

KCF [8] and SAMF[55], which are all correlation filter based trackers.

In addition, we present the average accuracy, robustness rank and expected

385

average overlap (EAO) of all compared trackers in Table 1. The proposed tracker achieves a comparable result in both baseline and region noise experiments. Due to no model updating during tracking, the proposed tracker is not sensitive to

AC

CE

PT

ED

M

region noise.

Figure 7: The robustness-accuracy ranking plots of all trackers under baseline and region noise experiments. Trackers close to the top right corner of the plot are among the top performers.

21

ACCEPTED MANUSCRIPT

baseline

region noise

Overall

EAO

Accuracy

Robustness

Accuracy

Robustness

TAAT

5.52

4.20

4.48

3.68

5.00

3.94

0.21

KCF

3.56

7.52

5.08

7.92

4.32

7.72

0.19

TAAT VGG

5.40

4.22

4.36

3.25

4.88

3.735

0.21

DSST

4.92

7.28

4.28

6.64

4.60

6.96

0.17

TAAT res

5.20

3.28

4.48

3.28

4.84

3.28

0.22

Struck

10.00

13.16

10.36

11.92

10.18

12.54

0.19

GOTURN

5.84

8.20

5.64

6.84

5.74

7.52

0.21

SAMF

4.40

7.64

4.48

7.56

4.44

7.60

0.20

DGT

8.40

5.12

6.16

6.12

7.28

5.62

0.18

eASMS

8.92

7.96

7.28

7.88

8.10

7.92

0.18

CMT

12.92

14.88

15.68

14.76

14.30

14.82

0.17

MatFlow

11.92

4.52

9.88

8.40

10.90

6.46

0.17

ABS

11.44

9.04

9.68

7.56

10.56

8.30

0.16

BDF

12.08

10.88

12.32

9.84

12.20

10.36

0.15

HMMTxD

5.80

10.16

4.80

10.08

5.30

10.12

0.15

ACAT

8.24

9.60

8.76

8.68

8.50

9.14

0.15

DynMS

11.56

10.12

11.60

11.00

11.58

10.56

0.15

ACT

8.96

9.80

9.40

9.60

9.18

9.70

0.15

PTp

19.04

10.60

15.20

10.36

17.12

10.48

0.14

EDFT

10.68

13.84

10.88

15.52

10.78

14.68

0.14

IPRT

14.72

12.60

13.72

12.56

14.22

12.58

0.14

LT FLO

10.28

18.80

9.76

17.84

10.02

18.32

0.14

SIR PF

10.80

14.00

10.56

14.52

11.68

13.76

0.14

FoT

11.92

17.12

12.88

19.20

12.40

11.16

0.14

ThunderStruck

10.08

14.00

10.56

11.52

10.68

12.76

0.13

FSDT

14.12

18.16

12.20

15.08

12.24

16.62

0.13

FRT

13.12

23.96

15.88

23.32

14.66

23.06

0.13

IVT

13.60

18.48

16.80

16.60

15.46

17.64

0.12

OGT

10.06

16.04

9.80

16.40

9.90

16.24

0.13

MIL

21.12

14.02

25.84

14.92

23.90

14.82

0.12

14.20

16.22

15.76

14.88

14.98

15.56

0.12

Matrioska

12.52

11.64

10.08

15.04

11.66

13.34

0.11

CT

16.16

17.64

17.00

16.56

16.58

17.10

0.11

IMPNCC

14.96

20.24

17.68

18.68

16.32

19.46

0.11

NCC

11.88

27.24

12.20

27.64

12.04

27.44

0.08

PT

IIVTv2

AN US

CR IP T

Robustness

ED

Accuracy

M

Trackers

Table 1: The performance of all trackers on VOT2014 dataset. We report the average ranks under baseline region noise experiments and EAO. Red, blue and green colors denote the first,

CE

second and third best results.

4.4. Evaluation on UAV-123 Dataset 4.4.1. Dataset and Evaluation Settings

AC

390

UAV-123 dataset [65] is a recently released dataset which contains 123 track-

ing targets. All videos are collected from a low-altitude aerial perspective. The evaluation settings are similar to OTB dataset.

22

ACCEPTED MANUSCRIPT

395

4.4.2. Evaluation Results Figure 8 illustrates the detailed tracking results on UAV-123 dataset including MEEM, SRDCF, MUSTER, SAMF, KCF, DSST, DCF, Struck, ASLA,

CR IP T

OAB, CSK and TLD. The proposed TAAT method achieves top performance in

terms of distance precision (72.5%) and overlap rate (48.0%). This result proves the effectiveness of the proposed TAAT on the long term tracking dataset. Precision plots of OPE

0.8

0.5 0.4 0.3

0.5 0.4 0.3 0.2

0.2

0.1

0.1 0

0.6

Success rate

Precision

0.6

TAAT [0.480] SRDCF [0.464] ASLA [0.407] SAMF [0.396] MEEM [0.392] MUSTER [0.391] Struck [0.381] DSST [0.356] DCF [0.332] KCF [0.331] OAB [0.331] IVT [0.318] CSK [0.311] MOSSE [0.297]

0.7

AN US

0.7

Success plots of OPE

0.8

TAAT [0.725] SRDCF [0.676] MEEM [0.627] SAMF [0.592] MUSTER [0.591] DSST [0.586] Struck [0.578] ASLA [0.571] DCF [0.526] KCF [0.523] OAB [0.495] CSK [0.488] MOSSE [0.466] TLD [0.439]

0

0

10

20

30

40

Location error threshold

50

0

0.2

0.4

0.6

0.8

1

Overlap threshold

Figure 8: Overall performance on the UAV-123 [65] dataset using one-pass evaluation (OPE).

4.5. Tracker Analysis

M

400

ED

In this section, we discuss the proposed tracker. 4.5.1. Different loss functions

405

PT

Here, we evaluate the effectiveness of the proposed l1 loss. For comparison, we substitute the proposed l1 loss with the traditional l2 loss. Figure 9(a)

CE

reports the converge curve during the training procedure. It is clear that the l1 loss plot converges more quickly than l2 loss. For final tracking performance, l1 loss and l2 loss achieve similar results (distance precision: 0.805 versus 0.804,

AC

overlap rate: 0.573 versus 0.571 on OTB-2015 dataset).

410

4.5.2. Tracking speed As our method does not require the cumbersome sampling stage during

tracking, it achieves a satisfying tracking speed. Without bounding box regression, it runs at 20.1 frames per second (FPS). With all the components, it runs 23

ACCEPTED MANUSCRIPT

0.9 L2 loss L1 loss

0.8

17.1

0.574

0.573

0.7

0.573

Training loss

0.566 0.6

15.1 0.5

13.8

0.540

0.4

17.4

0.3

0.1

CR IP T

11.5

0.2

10.6

0

5000

10000

15000

Number of iterations

avg OP

avg speed

Figure 9: Left: converge cure with different loss functions. Right: Tracking performance and speed with varying scales on OTB-2015 dataset. The average overlap rate and average speed versus the number of the scales N are illustrated.

TAAT VGG

TAAT Res

MDNet [3]

HCFT [9]

STCT [29]

20.1

15.5

13.8

0.7

8.3

4.1

AN US

FPS

TAAT

Table 2: Speed comparison between deep trackers on the OTB dataset. FPS: frames per second.

at 15 FPS. Here we compare the speed of the state-of-the-art deep trackers on OTB dataset in Table 2. All the trackers run on an Intel Xeon 1.60G Hz CPU

M

415

with 16G RAM and a TITAN X GPU. It is worth mentioning that our model

ED

has no model updating during tracking, the key parameter which affects the results is scale number N in scale estimation. The larger value of N contributes

420

PT

to accurate target’s size estimation while leads to heavy computation burden. 4.5.3. Parameter sensibility This subsection validates the effectiveness of several key components in our

CE

tracker. It is worth mentioning that our model has no model updating during tracking, the key parameters which affect the results include number N in scale

estimation and search area size. Generally, the higher value of N contributes to more precise size estimation (i.e., overlap rate) while aggravates computa-

AC 425

tion burden. Figure 9(b) reports the overlap rate versus tracking speed among different values of N .

24

ACCEPTED MANUSCRIPT

5. Conclusion In this paper, we propose a transform-aware attentive tracking method in430

spired by deep attentive network. With the use of the revised Spatial Trans-

CR IP T

former Network, the proposed network attends to regions of interest where novel target objects might be. The output spatial transfer parameters indicate the tar-

get states with both the location and scale information. The proposed algorithm does not require the cumbersome state sampling and model updating as existing 435

tracking algorithms do. It is evaluated on three popular benchmark datasets

including OTB-2013, OTB-2015, VOT-2014 and UAV-123 dataset. The exper-

AN US

iment results demonstrate that the proposed TAAT performs very well against

baseline algorithms and achieves a satisfying speed. In the future, an online updating mechanism can be integrated into the TAAT model to handle target 440

variance adaptively.

M

References

[1] N. Wang, J. Shi, D. Yeung, J. Jia, Understanding and diagnosing visual

ED

tracking systems, in: IEEE Int. Conf. Comput. Vis., 2015, pp. 3101–3109. [2] S. Hare, A. Saffari, P. H. Torr, Struck: Structured output tracking with kernels, in: IEEE Int. Conf. Comput. Vis., 2011, pp. 263–270.

PT

445

[3] H. Nam, B. Han, Learning multi-domain convolutional neural networks

CE

for visual tracking, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4293–4302.

[4] J. Shen, D. Yu, L. Deng, X. Dong, Fast online tracking with detection

AC

450

refinement, IEEE Trans. Intelligent Transportation Systems 19 (1) (2018) 162–173.

[5] B. Ni, A. A. Kassim, S. Winkler, A hybrid framework for 3-d human motion tracking, IEEE Trans. Circuits Syst. Video Techn. 18 (8) (2008) 1075–1084.

25

ACCEPTED MANUSCRIPT

[6] D. Huang, L. Luo, M. Wen, Z. Chen, C. Zhang, Enable scale and aspect ratio adaptability in visual tracking with detection proposals, in: BMVC,

455

2015, pp. 185.1–185.12.

CR IP T

[7] G. Zhu, F. Porikli, H. Li, Beyond local search: Tracking objects everywhere with instance-specific proposals, arXiv preprint arXiv:1605.01839.

[8] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking

with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell.

460

37 (3) (2015) 583–596.

AN US

[9] C. Ma, J.-B. Huang, X. Yang, M.-H. Yang, Hierarchical convolutional fea-

tures for visual tracking, in: IEEE Int. Conf. Comput. Vis., 2015, pp. 3074–3082. 465

[10] M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer networks, in: Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2017–2025.

M

[11] M. Denil, L. Bazzani, H. Larochelle, N. de Freitas, Learning where to attend with deep architectures for image tracking, Neural computation 24 (8)

470

ED

(2012) 2151–2184.

[12] S. E. Kahou, V. Michalski, R. Memisevic, Ratm: Recurrent attentive track-

PT

ing model, arXiv preprint arXiv:1510.08660. [13] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, D. Wierstra, DRAW:

CE

A recurrent neural network for image generation, in: Int. Conf. Machin. Learn, 2015, pp. 1462–1471.

[14] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE

AC

475

Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422.

[15] T. Zhou, F. Liu, H. Bhaskar, J. Yang, H. Zhang, P. Cai, Online discriminative dictionary learning for robust object tracking, Neurocomputing 275 (2018) 1801–1812.

26

ACCEPTED MANUSCRIPT

480

[16] D. Held, S. Thrun, S. Savarese, Learning to track at 100 FPS with deep regression networks, in: Eur. Conf. Comput. Vis., 2016, pp. 749–765. [17] B. Ma, L. Huang, J. Shen, L. Shao, Discriminative tracking using tensor

CR IP T

pooling, IEEE Transactions on Cybernetics 46 (11) (2016) 2411–2422.

[18] Q. Guo, W. Feng, C. Zhou, C. Pun, B. Wu, Structure-regularized com-

pressive tracking with online data-driven sampling, IEEE Trans. Image

485

Processing 26 (12) (2017) 5692–5705.

[19] C. Li, L. Lin, W. Zuo, J. Tang, M. Yang, Visual tracking via dynamic graph

AN US

learning, IEEE Transactions on Pattern Analysis and Machine Intelligence. [20] X. Wang, C. Li, B. Luo, J. Tang, SINT++: robust visual tracking via adversarial positive instance generation, in: CVPR, 2018, pp. 4864–4873.

490

[21] R. Tao, E. Gavves, A. W. M. Smeulders, Siamese instance search for tracking, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1420–

M

1429.

ED

[22] J. Shen, J. Peng, L. Shao, Submodular trajectories for better motion segmentation in videos, IEEE Trans. Image Processing 27 (6) (2018) 2688–

495

2700.

PT

[23] B. Ni, X. Yang, S. Gao, Progressively parsing interactional objects for fine grained action detection, in: Proc. IEEE Conf. Comput. Vis. Pattern

CE

Recognit., 2016, pp. 1020–1028. 500

[24] W. Song, Y. Li, J. Zhu, C. Chen, Temporally-adjusted correlation filter-

AC

based tracking, Neurocomputing 286 (2018) 121–129.

[25] C. Zhang, J. Yan, C. Li, R. Bie, Contour detection via stacking random forest learning, Neurocomputing 275 (2018) 2702–2715.

[26] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, M. Yang, Deep regression tracking

505

with shrinkage loss, in: Eur. Conf. Comput. Vis., 2018, pp. 369–386.

27

ACCEPTED MANUSCRIPT

[27] T. Zhang, S. Liu, C. Xu, B. Liu, M. Yang, Correlation particle filter for visual tracking, IEEE Trans. Image Processing 27 (6) (2018) 2676–2687. [28] L. Wang, W. Ouyang, X. Wang, H. Lu, Visual tracking with fully convo-

510

CR IP T

lutional networks, in: IEEE Int. Conf. Comput. Vis., 2015, pp. 3119–3127. [29] L. Wang, W. Ouyang, X. Wang, H. Lu, STCT: sequentially training convolutional networks for visual tracking, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1373–1381.

[30] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-

515

AN US

scale image recognition abs/1409.1556.

[31] S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2017) 1137–1149.

[32] P. Cavanagh, G. A. Alvarez, Tracking multiple targets with multifocal at-

[33] S. Zhang, X. Lan, H. Yao, H. Zhou, D. Tao, X. Li, A biologically inspired

ED

520

M

tention, Trends in Cognitive Sciences 9.

appearance model for robust visual tracking., IEEE Trans. Neural Netw. Learning Syst. (2016) 1–14.

PT

[34] Y. Jiao, Z. Li, S. Huang, X. Yang, B. Liu, T. Zhang, Three-dimensional attention-based deep ranking model for video highlight detection, IEEE Trans. Multimedia 20 (10) (2018) 2693–2705.

CE

525

[35] L. Itti, P. Baldi, A principled approach to detecting surprising events in

AC

video, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2005, pp. 631–637.

[36] D. Gao, V. Mahadevan, N. Vasconcelos, The discriminant center-surround

530

hypothesis for bottom-up saliency, in: Proc. Adv. Neural Inf. Process. Syst., 2008, pp. 497–504.

28

ACCEPTED MANUSCRIPT

[37] V. Mahadevan, N. Vasconcelos, Biologically inspired object tracking using center-surround saliency mechanisms, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2013) 541–554. [38] J. Choi, H. J. Chang, J. Jeong, Y. Demiris, J. Y. Choi, Visual tracking

CR IP T

535

using attention-modulated disintegration and integration, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4321–4330.

[39] S. Hong, T. You, S. Kwak, B. Han, Online tracking by learning discrimina-

tive saliency map with convolutional neural network, in: Int. Conf. Machin. Learn, 2015, pp. 597–606.

AN US

540

[40] W. Wang, J. Shen, R. Yang, F. Porikli, Saliency-aware video object segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 40 (1) (2018) 20–33. [41] W. Zhang, Q. Chen, W. Zhang, X. He, Long-range terrain perception using convolutional neural networks, Neurocomputing 275 (2018) 781–787. [42] V. Mnih, N. Heess, A. Graves, et al., Recurrent models of visual attention,

M

545

in: Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2204–2212.

ED

[43] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation

550

PT

with visual attention, in: Int. Conf. Machin. Learn, 2015, pp. 2048–2057. [44] J. Ba, V. Mnih, K. Kavukcuoglu, Multiple object recognition with visual

CE

attention, arXiv preprint arXiv:1412.7755. [45] P. Sermanet, A. Frome, E. Real, Attention for fine-grained categorization,

AC

arXiv preprint arXiv:1412.7054.

[46] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-

555

plied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.

[47] Z. Cui, S. Xiao, J. Feng, S. Yan, Recurrently target-attending tracking, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1449–1458. 29

ACCEPTED MANUSCRIPT

[48] S. Baker, I. A. Matthews, Lucas-kanade 20 years on: A unifying framework, Int. J. Comput. Vis. 56 (3) (2004) 221–255.

560

[49] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep

CR IP T

convolutional neural networks, in: Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

[50] S. Zagoruyko, N. Komodakis, Learning to compare image patches via convolutional neural networks, in: Proc. IEEE Conf. Comput. Vis. Pattern

565

Recognit., 2015, pp. 4353–4361.

AN US

[51] X. Han, T. Leung, Y. Jia, R. Sukthankar, A. C. Berg, Matchnet: Unifying feature and metric learning for patch-based matching, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3279–3286. 570

[52] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, M. Shah, Visual tracking: an experimental survey, IEEE Trans. Pattern

M

Anal. Mach. Intell. 36 (7) (2014) 1442–1468.

[53] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in:

575

ED

Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 2411–2418. [54] Y. Wu, J. Lim, M. Yang, Object tracking benchmark, IEEE Trans. Pattern

PT

Anal. Mach. Intell. 37 (9) (2015) 1834–1848. [55] Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature

CE

integration, in: ECCV Workshops, 2014, pp. 254–265. [56] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast fea-

AC

580

ture embedding, in: ACM Multimedia, 2014, pp. 675–678.

[57] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.

30

ACCEPTED MANUSCRIPT

585

[58] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, F. Li, Imagenet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (3) (2015)

CR IP T

211–252. [59] J. Zhang, S. Ma, S. Sclaroff, Meem: Robust tracking via multiple experts

using entropy minimization, in: Eur. Conf. Comput. Vis., 2014, pp. 188–

590

203.

[60] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr,

AN US

Fully-convolutional siamese networks for object tracking, in: Eur. Conf. Comput. Vis., 2016, pp. 850–865. 595

[61] M. Danelljan, G. H¨ ager, F. S. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in: BMVC, 2014.

[62] J. Gao, H. Ling, W. Hu, J. Xing, Transfer learning based visual tracking

M

with gaussian processes regression, in: Eur. Conf. Comput. Vis., 2014, pp. 188–203.

[63] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based

ED

600

collaborative model, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,

PT

2012, pp. 1838–1845.

[64] L. Agapito, M. M. Bronstein, C. Rother, The visual object tracking

CE

VOT2014 challenge results, in: ECCV Workshops, 2014, pp. 191–217. 605

[65] M. Mueller, N. Smith, B. Ghanem, A benchmark and simulator for uav

AC

tracking, in: Eur. Conf. Comput. Vis., 2016.

31

CR IP T

ACCEPTED MANUSCRIPT

Xiankai Lu received the B.S. degree in au-

AN US

tomation from the Shan Dong University, Jinan, China, in 2012. He is currently

pursing his PhD degreein Shanghai Jiao Tong University, Shanghai, China. His research interests include image processing, object tracking and deep learning

PT

ED

M

610

Bingbing Ni received a B.Eng. in electrical

engineering from Shanghai Jiao Tong University, Shanghai, China, in 2005, and

CE

a Ph.D. from the National University of Singapore, Singapore, in 2011. He is currently a Professor with the Department of Electrical Engineering, Shanghai Jiao Tong University. Before that, he was a Research Scientist with the Ad-

AC

615

vanced Digital Sciences Center, Singapore. He was with Microsoft Research Asia, Beijing, China, as a Research Intern in 2009. He was also a Software Engineer Intern with Google Inc., Mountain View, CA, USA, in 2010. Dr. Ni was a recipient of the Best Paper Award from PCM11 and the Best Student

620

Paper Award from PREMIA08. He was also the recipient of the first prize in

32

ACCEPTED MANUSCRIPT

the International Contest on Human Activity Recognition and Localization in

CR IP T

conjunction with the International Conference on Pattern Recognition in 2012.

Chao Ma is a senior research associate with the

625

AN US

Australian Centre for Robotic Vision at The University of Adelaide. He received a Ph.D. from Shanghai Jiao Tong University in 2016. His research interests include computer vision and machine learning. He was sponsored by China

Scholarship Council as a visiting Ph.D. student at the University of California

ED

M

at Merced from the fall of 2013 to the fall of 2015. He is a member of the IEEE.

Xiaokang Yang received a B.S. from Xiamen University, Xia-

630

men, China, in 1994, an M.S. from the Chinese Academy of Sciences, Shanghai,

PT

China, in 1997, and a Ph.D. from Shanghai Jiao Tong University, Shanghai, in 2000. He is currently a Distinguished Professor with the School of Electronic In-

CE

formation and Electrical Engineering and the Deputy Director of the Institute of Image Communication and Information Processing at Shanghai Jiao Tong

635

University. He has authored over 200 refereed papers and holds 40 patents. His

AC

current research interests include visual signal processing and communication, media analysis and retrieval, and pattern recognition. He is an Associate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA and a Senior Associate Editor of the IEEE SIGNAL PROCESSING LETTERS

33