Robust visual tracking by embedding combination and weighted-gradient optimization

Robust visual tracking by embedding combination and weighted-gradient optimization

Pattern Recognition 104 (2020) 107339 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog...

7MB Sizes 0 Downloads 21 Views

Pattern Recognition 104 (2020) 107339

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/patcog

Robust visual tracking by embedding combination and weighted-gradient optimization Jin Feng, Peng Xu, Shi Pu, Kaili Zhao, Honggang Zhang∗ Beijing University of Posts and Telecommunications, China

a r t i c l e

i n f o

Article history: Received 7 June 2019 Revised 15 February 2020 Accepted 17 March 2020 Available online 19 March 2020 Keywords: Visual tracking Data imbalance Embedding combination Weighted-gradient loss

a b s t r a c t Existing tracking-by-detection approaches build trackers on binary classifiers. Despite achieving state-ofthe-art performance on tracking benchmarks, these trackers pay limited attention to data imbalance issue, e.g, positive and negative, easy and hard. In this paper, we demonstrate that separately learning feature embeddings corresponding to negative samples with different semantic characteristics is effective in reducing the background diversity to handle the imbalance between positive and negative samples, which facilitates background awareness of classifiers. Specifically, we propose a negative sample embedding combination network, which helps to learn several sub-embeddings and combine them to build a robust classifier. In addition, we propose a weighted-gradient loss to handle the imbalance between easy and hard samples. The gradient contribution of each sample to model training is dynamically weighted according to the gradient distribution, which prevents easy samples from overwhelming model training. Extensive experiments on benchmarks demonstrate that our tracker performs favorably against state-ofthe-art algorithms. © 2020 Elsevier Ltd. All rights reserved.

1. Introduction Visual tracking is one of the fundamental problems in computer vision and plays an important role in many applications such as action recognition [1], automatic driving [2], video surveillance [3]. It is the task of estimating the trajectory of a target in an image sequence, given only its initial location annotated by a bounding box [4]. The tracking algorithm must learn an appearance model of the target online with limited training data, often only the first frame in the video [5]. The model then needs to generalize to variants of the target appearance under kinds of challenging conditions [6], including deformation, occlusion, illumination variation, background clutters, etc. In recent years, tracking-by-detection algorithms have achieved state-of-the-art performance on tracking benchmarks [6–9], which build trackers on binary classification models. To train the binary classification models, image patches are drawn from the first frame as training data. Patches that have overlap ratios larger than a threshold with the target are labeled with positive class, and others are labeled with negative class [10–12]. However, this most commonly adopted strategy of collecting training data inevitably causes extreme data imbalance. Specifically, the class imbalance is



Corresponding author. E-mail address: [email protected] (H. Zhang).

https://doi.org/10.1016/j.patcog.2020.107339 0031-3203/© 2020 Elsevier Ltd. All rights reserved.

the imbalance between positive and negative samples, and the attribute imbalance is the imbalance between easy and hard samples. Since it is widely believed that data imbalance has a negative impact on model training, many tracking algorithms try to handle this issue. Most of their adopted methods are borrowed from classification and detection algorithms [13,14], such as hard negative mining [15] and setting hyper-parameters to weight the losses of different classes. However, these methods are not effective enough for tracking tasks, due to the gap between tracking and other tasks. The former method directly abandons most samples and makes the training inefficient. The latter only considers the class imbalance but not attribute imbalance. Hence, it is important to design the method of handling the data imbalance specifically for the tracking task. To alleviate the effect of class imbalance, we propose a novel negative sample embedding combination network. Our work is motivated based on the following observation. All the negative samples can be divided into several subsets, and each subset has distinct semantic characteristics. When we take one subset and all the positive samples as the training dataset, it is easier to learn an embedding where positive and negative samples are discriminable, due to the reduced diversity and number of negative samples. We utilize separate branches of embedding specific layers to learn the embeddings for each negative sample subset respectively. Afterward, we utilize an embedding combination layer to combine these separate sub-

2

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

embeddings. By this mechanism, the negative impact of class imbalance on model training is mitigated and a discriminative boundary between all the positive and negative samples is learned. To alleviate the effect of attribute imbalance, we present a novel weighted-gradient loss. We focus on harmonizing the total gradient contributions of easy and hard samples to prevent easy samples from overwhelming model training. Specifically, we assign a weighting parameter to the gradient of each sample according to the gradient distribution. With our proposed weighted-gradient loss, the large total gradient contributions cumulated by easy samples can be down-weighted, and the total gradient contributions of hard samples can be up-weighted. As a result, the contributions of easy and hard samples are balanced and the training becomes efficient and stable. The main contributions of this work can be summarized as follows: •





We propose a novel negative sample embedding combination (NSEC) network specific for handling the imbalance between positive and negative samples in the tracking-by-detection framework, which learns a discriminative boundary between positive and negative samples by learning a combined embedding. We propose a weighted-gradient loss (WGL) to balance the total gradient contributions of easy and hard samples. The gradients generated by back-propagation are weighted and the weighting parameters vary dynamically as the gradient distribution changes. We evaluate the proposed tracker (ECWGO) on four challenging benchmarks: OTB2013, OTB100, VOT2016, Temple Color 128. Extensive experimental results demonstrate that the proposed algorithm provides a significant improvement over the baseline and performs favorably against state-of-the-art trackers. The complete code will be made available as soon as possible.

2. Related work 2.1. Tracking by detection The tracking-by-detection framework [10–12] treats the tracking task as a binary classification problem. It emphasizes on learning a robust and discriminative binary classifier to distinguish the target from the background. During the training phase, the target object and the background in the first frame are labeled with positive class and negative class respectively. During the tracking phase, a sparse set of samples are drawn around the location of the target in the previous frame as target proposals. The model classifies the samples into either the target or the background. To adapt to the appearance variation of the target object throughout the video, some trackers also adopt the online update mechanism [16]. Representative works of tracking-by-detection approaches include Struck [17], MIL [18], CNN-SVM [19], MDNet [10], VITAL [11] and DAT [12]. Unlike the existing two-stage tracking-by-detection trackers listed above, we propose a novel network specific for mitigating the data imbalance issue. Not expecting to learn a binary classifier directly from the imbalanced data, we divide a huge number of negative samples into several sub-classes by their individual semantic characteristics and learn several negative sample embeddings from relatively balanced data. With our proposed embedding combination layer, different negative sample embeddings are combined and we finally obtain a discriminative boundary between samples of the target and the background. 2.2. Data imbalance Many computer vision tasks such as classification and detection suffer from data imbalance issue [20]. This problem has been ex-

tensively studied for years. In visual tracking, data imbalance arises in two aspects. Firstly, the number and appearance diversity of positive samples are extremely limited but those of negative samples are large across the whole frame. Secondly, there exists the attribute imbalance of easy and hard samples. For example, samples that only contain the background pixels are easy to be classified as the negative samples while others that contain not only the background but also pixels from the target are hard to classify, as shown in Fig. 2. Generally, the number of negative samples is larger than that of positive samples and a large portion of negative samples are easy samples. These easy samples dominate the training of the model and make the hard samples unaware. In the community of visual tracking, online hard negative mining [15] and reweighting positive and negative loss are utilized most commonly to alleviate the effect of data imbalance. But the former method directly abandons most samples and makes training insufficient. The latter method only handles the class imbalance and ignores the attribute imbalance. Recent work on dense object detection proposes focal loss [21] to decrease the loss to handle attribute imbalance by decreasing the loss from easy samples. But focal loss is a static loss, thus it is not adaptive for the change of gradient distribution during the training process. Inspired by recent works that exploit gradient flow [12,22,23], we propose a weighted-gradient loss to balance the gradient distribution in the training process, as we identify that samples of different attributes (easy or hard) dominantly affect the model updating by their total gradient contributions. Our proposed loss focuses on balancing the total gradient contributions of samples to model updating, and alleviates the effect of data imbalance. Moreover, our proposed loss is a dynamic loss, which means that it can adapt to the variations of data distribution as the training continues. 3. Proposed algorithm In this section, we first formulate the deep tracking-bydetection approaches. After that, we propose a negative sample embedding combination network. Finally, we introduce the weighted-gradient loss for handling the attribute imbalance during the training process. The details of our work are discussed below. 3.1. Formulation of tracking-by-detection framework The tracking-by-detection framework formulates the tracking task as a problem of binary classification where the target object is labeled with positive class and the background is labeled with negative class [10–12]. When tracking an arbitrary object in a video, there are basically two phases: the training phases and the tracking phase. During the training phase, positive and negative samples are collected from the first frame, where positive and negative samples have ≥ θ + and ≤ θ − overlap ratios with the ground truth bounding box respectively. Afterward, a deep classification network is trained with these samples by minimizing the binary classification loss. During the tracking phase, to estimate the target location in a frame, N target proposals x1 , x2 , . . . , xN are generated around the previous target location. After that, they are evaluated by the network. Positive class probabilities f + (xi ) and negative class probabilities f − (xi ) are given by the network. The optimal target location can be determined by finding the sample x∗ that has the maximum positive class probability as

x∗ = arg max f + (xi ).

(1)

xi

To adapt to the appearance variation of the target object during tracking process, some approaches also adopt the online update mechanism [10].

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

3

Fig. 1. The tracking flow chart shows the whole pipeline of our negative sample embedding network. FE stands for the Feature Extractor. ECL stands for Embedding Combination Layers that share the weight parameters. ESL stands for Embedding Specific Layers that don’t share the weight parameters. In the process (a), the embedding combination layers are trained sequentially and incrementally under the supervision of corresponding embedding specific layers. In the process (b), the learned negative sample embeddings are combined to obtain a discriminative classifier. See details in Section 3.2. Best viewed in color.

Fig. 2. In the tracking-by-detection framework, negative samples are drawn from across the frame. The number and diversity of negative samples are large. We heuristically define four kinds of negative samples according to their different semantic characteristics. Best viewed in color.

3.2. Negative sample embedding combination network Existing tracking-by-detection methods based on binary classification models commonly assign all the non-target samples with negative labels. It is intuitive but brings about the data imbalance issue. Because by this strategy of collecting samples, the number of negative samples across the whole background is large while the number of positive samples is limited. Besides, it causes attribute imbalance (easy and hard) within negative samples. For example, if a sample contains no pixels from the target, it is easy to be classified as a negative sample. But for a negative sample that contains the target as well as pixels from the background around it, it is a hard negative sample compared to the former sample as shown in Fig. 2. To overcome this issue, we propose the negative sample embedding combination network. Instead of taking all the negative and positive samples as the training dataset, we divide all the negative samples into several subsets that have distinct semantic characteristics. After that, we take one subset and all the positive samples as the training dataset. It is easier to learn an embedding where positive and negative samples are discriminable since the number and diversity of negative samples in only one subset are largely reduced and the class imbalance is alleviated. We utilize separate branches of embedding specific layers to learn the embeddings for each negative sample subset respectively. Finally, with the proposed embedding combination mechanism, the learned negative sample embeddings are combined and we obtain a robust and discriminative boundary between positive samples and all the negative samples.

To divide the whole set of negative samples into several subsets, we heuristically define four kinds of negative samples according to their specific semantic characteristics as shown in Fig. 2. They are defined and collected in the following ways. •







Pure background samples Si− : image patches that include only background but no pixels from the target at all. Part of target samples with some background Sii− : image patches that include part of the target (not the whole target) and background. Meanwhile, these patches have an overlap ratio with the ground truth bounding box bgt lower than the threshold θii− . − Part of target samples without any background Siii : image patches that include only part of the target (not the whole target) and no pixels from the background at all. Meanwhile, these patches have an overlap ratio with bgt lower than the threshold θiii− . Target over-included samples Si−v : image patches that include not only the target but also its surrounding background. Meanwhile, these patches have an overlap ratio with the ground truth bounding box bgt lower than the threshold θi−v .

As shown in Fig. 1, there are four branches of embedding specific layers. The embedding combination layers share the weight parameters while the embedding specific layers don’t share the weight parameters. The four branches are trained sequentially with the positive samples and each subset of negative samples, and only one of the branches is enabled when the corresponding negative group is trained. Because the embedding specific layers don’t share the weight parameters, each embedding specific layer will supervise the embedding combination layers to learn a specific embed-

4

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

ding where the corresponding group of negative samples and the positive samples are discriminable. Therefore when finishing the sequential training steps, there exist four individual feature embeddings latent in the embedding combination layer, and in each of the embedding spaces, positive samples and corresponding negative samples are discriminable. By proposing the embedding specific layers, we mainly aim to prevent the model from overfitting to any subset of the negative samples. Afterward, we replace the former embedding specific layers with the resulting classifier and finetune the network with all the negative samples and positive samples. Under the supervision of the resulting classifier, the individual negative sample embeddings latent in the embedding combination layers are combined to obtain an embedding where the positive samples and all the negative samples are discriminable. Finally, we obtain a robust binary classifier to distinguish the target from the background. 3.3. Weighted-gradient loss In this section, we first make an analysis of the relationship between gradient distribution and data imbalance (e.g. easy or hard) and how gradient distributions impact the training process. Afterward, we introduce the weighted-gradient loss and illustrate how it mitigates the effect of data imbalance during the training process. The relationship between gradient distribution and data imbalance As mentioned in Section 3.1, tracking-by-detection approaches treat the tracking task as a problem of classification and train the model by minimizing the binary classification loss. For a sample, let p ∈ [0, 1] be the probability for a certain class predicted by the model and y ∈ {0, 1} be its ground truth label. If we take the binary cross-entropy loss as our optimal objective, the loss produced by this sample is:

lBCE = −ylog( p) − (1 − y )log(1 − p).

(2)

And the gradient with regard to the direct output of the model x is:

g=

∂ lBCE = y( p − 1 ) + (1 − y ) p. ∂x

Afterward, we define the gradient weighting parameter as:

αi =

AG , Ni



g = αi [yi ( pi − 1 ) + (1 − yi ) pi ].

where N is the total number of samples and  is the length of a bin, if we divide gradient magnitude range [0,1] into 1 bins of equal length. The average gradient distribution parameter AG represents the number of samples within each bin if the samples are uniformly distributed in the range [0,1].

(6)

According to Eq. (3), a novel weighted-gradient loss can be formulated as:

LGW = −

(4)

(5)

where Ni is the number of samples within the bin that sample i falls in. For an easy sample e, the number of samples in its bin tends to be greater than AG, so the corresponding α e is less than 1 to down-weight the contribution of samples in this bin. While for a hard sample h, the number of samples in its bin is smaller than AG, so the corresponding α h is greater than 1 to up-weight the contribution of the hard sample to the global gradient. We assign the weighting parameters to the corresponding gradient of samples. On the basis of Eq. (2), the weighted gradient is:

(3)

If the ground truth label of a sample is 1 and the predicted probability is 0.9, it can be identified as an easy sample. According to Eq. (3), the gradient produced by this easy sample is of little magnitude. Therefore we can conclude that the gradient magnitude implies the attribute (easy or hard) of a sample as well as its contribution to the model training process. The little gradient magnitude corresponds to the easy samples while the large gradient magnitude corresponds to the hard samples. Fig. 3 shows the gradient magnitude distribution in a tracking model. As shown in Fig. 3, the proportion of little gradient magnitude is extremely large, which means easy samples are much more than the hard samples. The effect of attribute imbalance is that the total gradient contributions cumulated by a large number of easy samples will produce a dominant impact on model training. To overcome this issue, we propose the weighted-gradient loss, which focuses on harmonizing the total gradient contributions of easy and hard samples. Firstly we introduce an average gradient distribution parameter:

AG = N ·  ,

Fig. 3. The distribution of gradient magnitude in a binary classification model. We find that the proportion of little gradient magnitude is extremely large, which proves that the easy samples are much more than the hard samples. And a large amount of little gradients of easy samples produce a dominant impact on the global gradient.

N 1 αi [yi log( pi ) + (1 − yi )log(1 − pi )]. N

(7)

i=1

Since we embed our weighted-gradient mechanism into the loss function, there is little additional computation complexity during model training. With the proposed weighted-gradient loss, the gradients of easy samples are down-weighted and the gradients of hard samples are up-weighted to balance their contributions to the model training. Fig. 4 shows the distribution of the total gradient contributions of all samples with the proposed weighted-gradient loss. Moreover, since the weight parameters are calculated every iteration, it is adaptive to the changes of data attribute distribution along with the training process to make the training robust and efficient. 4. Experiment In this section, we first present the implementation details. And then we evaluate our proposed method on Online Tracking Benchmark (OTB) [6,7], Temple Color 128 (TC-128) [8] and Visual Object Tracking 2016 benchmark (VOT2016) [9]. We compare the performance of our method with the state-of-the-art trackers. 4.1. Experimental setup Network architecture As shown by Fig. 1, the CNN feature extractor of our model is the first three convolutional layers of the

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

5

Online tracking When tracking the target object in a frame the video, 256 samples are randomly drawn around the predicted target bounding box in the previous frame. These samples are evaluated by the model to get the binary classification scores. We locate the target by choosing the sample with the highest positive score. Every time we succeed in locating the target, we collect additional positive and negative samples using the same strategy as in model training for updating the model. We finetune the fully-connected layers using T2 = 15 iterations every 10 frames with a learning rate of lr3 = 3e − 4. Other training settings are the same as those in the model training. 4.2. Ablation studies

Fig. 4. By our proposed weighted-gradient loss, the total gradient contributions of easy samples is down-weighted and the total gradient contributions of hard samples is up-weighted. From a global view, the contribution of easy and hard samples on the model training process is balanced with our proposed loss.

VGG-M model [30]. The embedding layer consists of two fully connected layers combined with ReLUs whose output units are 512. The embedding specific layers are four branches of binary classification layers. The classifier at the end of the model is a binary classification layer. The outputs of all the binary classification layers are normalized by the softmax layer to calculate the probabilities of the target and the background. Our algorithm is implemented in PyTorch and runs at around 3 fps with a core of 2.66 GHz Intel Xeon E5-2660 and NVIDIA Tesla K40 GPU. Model training We don’t pretrain our model with any video frames offline since it is not allowed by VOT challenges for fair comparison [9]. The convolutional layers use the parameters of the VGG-M model pretrained on ImageNet dataset. And we initialize the weight parameters of our embedding combination layers and embedding specific layers only using the samples drawn from the first frame of the video. According to the proposed scheme in Section 3.2, the number of positive samples is set to N+ = 500, and the number of negative samples of each group is set to N− = 10 0 0. We use T1 = 20 iterations when train each branch of embedding specific layers with the learning rate of lr1 = 5e − 4. Afterward, we replace the former embedding specific layers with a binary classification layer and finetune the network with all the negative samples and positive samples for T2 = 80 iterations with the learning rate of lr2 = 2e − 4. The network solver is stochastic gradient descent (SGD).

We propose a negative sample embedding combination network (NSEC) for handling the class imbalance of positive and negative samples, and a weighted-gradient loss (GWL) to weight the gradient distribution of easy and hard samples to further facilitate the model training. In this section, we investigate how these two components contribute to learning discriminative classifier and discuss the parameter sensitivity to the final performance. We conduct the following experiments. Contribution of each component To investigate how NSEC and WGL contribute to learning a discriminative classifier, we firstly implement a baseline tracker in which the CNN model consists of three convolutional layers and three fully connected layers. We collect positive and negative samples using the strategy mentioned in Section 3.1 to train the model with the cross-entropy loss. Secondly, we replace the CNN model with our proposed negative sample embedding combination network and still train the model with the cross-entropy loss. Thirdly, we train the negative sample embedding combination network with our proposed weightedgradient loss (WGL). Fig. 5 shows the results on the OTB100 benchmark. To further show the effectiveness of our proposed components, some examples of tracking results in different models including Baseline, NSEC, and NSEC+WGL are shown in Fig. 6. By observing the trackers’ performance on the front, middle, and end parts of the videos, we find that: •



In the early period of the tracking processes, our proposed methods can overcome some challenging cases including background clutters, scale variations, and abrupt motions, which proves that our proposed NSEC component provides a better model initialization against the baseline method. By adding the proposed WGL component, our tracker performs more stable and gives more accurate results through the whole

Fig. 5. Precision and success plots on the OTB100 dataset using one-pass evaluation. The numbers indicate the average distance precision scores at 20 pixels and the area under curve success scores. Best viewed in color.

6

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

Fig. 6. Qualitative results of Baseline, NSEC, and NSEC+WGL on 3 challenging sequences (Human3, Biker and DragonBaby). Best viewed in color. Table 1 Average precision scores on different in the OPE experiment on OTB100 dataset: fast motion (FM), background clutters (BC), motion blur (MB), deformation (DEF), illumination variation (IV), in-plane rotation (IPR), low resolution (LR), occlusion (OCC), out-of-plane rotation (OPR), out-of-view (OV) and scale variation (SV). The best two results are denoted as bold and italic.

ECWGO ECO [24] C-COT [25] RT-MDNet [26] MDNet [10] MCPF [27] TADT [28] SiamRPN [29] Siam-Tri [20] CNN-SVM [19]

FM

BC

MB

DEF

IV

IPR

LR

OCC

OPR

OV

SV

AP

0.874 0.878 0.883 0.863 0.855 0.845 0.834 0.793 0.763 0.747

0.940 0.942 0.882 0.882 0.859 0.823 0.805 0.803 0.715 0.776

0.882 0.897 0.899 0.851 0.850 0.840 0.833 0.821 0.727 0.751

0.895 0.857 0.857 0.877 0.842 0.815 0.822 0.837 0.683 0.791

0.931 0.912 0.882 0.883 0.879 0.882 0.865 0.873 0.752 0.792

0.924 0.892 0.877 0.871 0.878 0.888 0.832 0.859 0.774 0.813

0.897 0.881 0.883 0.901 0.874 0.911 0.881 0.870 0.897 0.811

0.870 0.906 0.903 0.822 0.816 0.862 0.842 0.791 0.730 0.727

0.911 0.907 0.899 0.859 0.852 0.867 0.872 0.855 0763 0.798

0.851 0.913 0.895 0.792 0.822 0.764 0.816 0.728 0.723 0.650

0.901 0.877 0.879 0.864 0.860 0.862 0.863 0.846 0.752 0.785

0.915 0.910 0.903 0.885 0.878 0.873 0.866 0.851 0.781 0.814

Table 2 Average success scores on different attribute in the OPE experiment on OTB100 dataset: fast motion (FM), background clutters (BC), motion blur (MB), deformation (DEF), illumination variation (IV), in-plane rotation (IPR), low resolution (LR), occlusion (OCC), out-of-plane rotation (OPR), out-of-view (OV) and scale variation (SV). The best two results are denoted as bold and italic.

ECWGO ECO [24] C-COT [25] TADT [28] RT-MDNet [26] MDNet [10] SiamRPN [29] MCPF [27] Siam-Tri [20] CNN-SVM [19]

FM

BC

MB

DEF

IV

IPR

LR

OCC

OPR

OV

SV

AS

0.659 0.683 0.676 0.657 0.647 0.644 0.606 0.597 0.585 0.546

0.678 0.700 0.652 0.622 0.639 0.623 0.601 0.601 0.542 0.548

0.680 0.709 0.706 0.671 0.649 0.660 0.627 0.599 0.567 0.578

0.646 0.633 0.616 0.607 0.631 0.603 0.628 0.569 0.504 0.547

0.685 0.715 0.686 0.681 0.658 0.650 0.663 0.629 0.585 0.537

0.657 0.655 0.627 0.621 0.628 0.620 0.636 0.620 0.580 0.548

0.599 0.603 0.610 0.634 0.613 0.591 0.597 0.581 0.634 0.403

0.650 0.681 0.676 0.643 0.618 0.612 0.597 0.620 0.554 0.514

0.660 0.673 0.652 0.646 0.632 0.612 0.631 0.619 0.563 0.548

0.640 0.660 0.648 0.625 0.587 0.620 0.550 0.553 0.543 0.488

0.652 0.667 0.656 0.655 0.630 0.619 0.628 0.604 0.567 0.489

0.673 0.691 0.673 0.660 0.650 0.644 0.637 0.628 0.590 0.554

tracking process. This proves the effectiveness of our weightedgradient loss in dealing with data imbalance. With the quantitative results in Fig. 5 and qualitative results in Fig. 6, we can conclude that our proposed negative sample embedding combination network and weighted-gradient loss improve

the tracking performance compared to conventional tracking-bydetection approaches. Parameter analysis There are many parameters in this paper. In this section, we discuss how to determine these parameters and make an analysis of some key parameters related to our proposed algorithm. Generally, the parameters appearing in

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

7

Fig. 7. Qualitative evaluation of our proposed tracker, CNN-SVM, MDNet Siam-Tri and CCOT on 6 challenging sequences (Basketball, Singer2, Football, Bird1, Box and Ironman). Best viewed in color.

Table 3 Parameter sensitivity analysis of θii− on the OTB-2013 dataset. The best two results are denoted as bold and italic.

θii− Prec(%) Succ(%)

0.7 90.7 67.2

0.6 94.6 69.7

0.5 94.4 69.6

0.4 90.2 67.0

0.3 90.5 67.3

0.2 86.5 60.3

Table 5 Parameter sensitivity analysis of θi−v on the OTB-2013 dataset. The best two results are denoted as bold and italic.

0.1

θi−v

0.7

0.6

0.5

0.4

0.3

0.2

0.1

82.9 58.9

Prec(%) Succ(%)

91.7 67.3

92.5 67.9

94.6 69.7

92.0 67.9

92.8 68.5

92.2 68.0

90.4 66.5

Table 4 Parameter sensitivity analysis of θiii− on the OTB-2013 dataset. The best two results are denoted as bold and italic.

θiii−

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Prec(%) Succ(%)

90.1 67.0

94.6 69.7

90.3 67.1

92.2 68.0

92.2 68.1

88.7 65.8

87.1 62.1

this paper can be divided into two categories. One category of parameters includes the learning rates (lr1 , lr2 , lr3 ), the numbers of positive and negative samples (N+ , N− ), the numbers of training and updating iterations (T1 , T2 ) and so on. Another cat− egory of parameters includes the thresholds (θii− , θiii , θi−v ) we used

8

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

Fig. 8. Precision and success plots on the OTB2013 dataset using one-pass evaluation. Best viewed in color.

Fig. 9. Precision and success plots on the OTB100 dataset using one-pass evaluation. Best viewed in color.

to define different kinds of negative samples in Section 3.2, and length of bins ( ) in our proposed weighted-gradient loss in Section 3.3. As discussed in [31], there are no simple and easy methods to set hyperparameters in deep models. The grid-search method or the random search method is time-consuming and requires expensive computation. In deep learning, the hyperparameters settings are largely dependent on empiricism and engineering. Therefore, for the first category of parameters, we used the same settings as in [10] for fair comparison. − For the scalar parameters θii− , θiii , θi−v which are the upper limits of overlap ratios of negative samples and the ground truth bounding box, the main basis for determining their values is the definition of each specific kinds of negative samples. Since in trackingby-detection tasks, if an image patch has at least an overlap ratio of 0.7 with the target, it is defined as a positive sample. Thus, − the maximum value that θii− , θiii , θi−v can adopt is 0.7. In fact, it is more reasonable to adopt 0.6 or 0.5 for these values than 0.7. Because adopting 0.7 for these values leaves no clear margin between positive and negative samples, it makes some negative samples are similar to the positive samples, which will confuse the model training process and impact the performance of our tracker negatively. Moreover, we conducted parameter sensitivity analysis experiments to validate our statement above. The results are reported in Tables 3–5. The scalar parameter  is mainly used to calculate a corrected value for the gradient magnitude within a certain range, which

is discussed in Section 3.3. Table 6 shows the results of varying  which is the length of a bin, if we divide gradient magnitude range [0,1] into 1 bins of equal length. When  is too large, the corrected value can not have a good variation over different gradient magnitude and the performance is not so good. Therefore, the performance gets better as  decreases. However,  is not necessarily very small. When  = 0.03, the weighted-gradient loss yields a good enough performance over baseline. Comparison with other losses In order to verify the advantages of the weighted-gradient loss (WGL), we compare the performance of trackers when we adopt other losses for the binary classifier in the model. We choose the binary cross-entropy loss (BCE), weighted binary cross-entropy loss (W-BCE) and focal loss for comparison. Table 7 shows the detail of performance under different settings. As mentioned above, we adopt the network architecture in [10] as the baseline network. From the results shown by Table 7, the weighted-gradient loss has a competitive advantage over the other three losses. We will make an analysis of these results. For binary cross-entropy loss, it is the standard and widespread loss function for classification tasks, but it can not deal with the data imbalance issue in visual tracking. For weighted binary cross-entropy loss, it can only deal with the class imbalance to a certain extent, but it still suffers from attribute imbalance. For focal loss, it can deal with the attribute imbalance issue by retraining the easy samples through weight parameters, however, in fact, it retrains the hard samples as well. For the weightedgradient loss, it can down-weighted the gradients of easy samples

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

9

Fig. 10. The success plots on different attributes in the OPE experiment for OTB2013. The numbers in the parenthesis indicates the number of sequences involved in the corresponding attribute. The numbers indicate the AUC success scores. Best viewed in color.

and up-weighted the gradients of hard samples to facilitate the model training. 4.3. Evaluation on OTB Dataset and evaluation settings We first evaluate our tracker on the online tracker benchmark (OTB) [6,7]. OTB2013 is a popular dataset which has 51 fully-annotated video sequences. OTB100 [7] is the extension of OTB2013 [6] which has 100 video sequences, and it is more challenging. Besides, OTB is annotated with 11 challenging factors, including illumination variation (IV), scale variation

(SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out -of-view (OV), background clutters (BC) and low resolution (LR). The video sequences and 11 challenges factors will help us analyze the characteristics of our tracker. We compare our algorithm with 29 trackers in [7] and other state-of-the-art trackers including ECO [24], TADT [28], Siam-Tri [20], RT-MDNet [26], CNNSVM [19], MCPF [27], CCOT [25], SiamRPN [29] [32] and MDNet [10]. Because it is thought unfair to pretrain the trackers with tracking videos, we only initialize the MDNet with the parameters of convolutional layers in VGG-M for convincing comparison.

10

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

Fig. 11. The precision plots on different attributes in the OPE experiment for OTB2013. The numbers in the parenthesis indicates the number of sequences involved in the corresponding attribute. The numbers indicate the average distance precision scores at 20 pixels. Best viewed in color.

We use the two metrics as suggested by [7]: success and precision plots. We need to calculate the success overlap scores firstly.  The overlap score is defined as t =

B pred Bgt  B pred Bgt

, where Bpred denotes

the predicted target bounding box, and Bgt denotes the ground truth bounding box. The success plot shows the percentage of frames with t > s for s ∈ [0, 1]. Moreover, we use the areaunder-curve (AUC) of the success plot as a measure to compare the trackers’ performance. Alternatively, the precision plot shows the percentage of frames whose predicted bounding box locates within a threshold distance of d. We report the success and pre-

cision plots in one pass evaluation (OPE), in which trackers are evaluated throughout a test sequence with initialization from the ground truth position in the first frame. Quantitative evaluation. Figs. 8 and 9 illustrate the overall precision and success plots over the 51 sequences in OTB2013 and the 100 sequences in OTB100 respectively. The legend of the precision plot contains the threshold scores at 20 pixels, while the legend of the success plot contains the area-under-curve (AUC) scores for each tracker. As shown by Figs. 8 and 9, our proposed tracker performs favorably against state-of-the-art approaches in both measures. We partly attribute the performance of our tracker to the

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

11

Fig. 12. Precision and success plots on the Temple Color 128 dataset using one-pass evaluation. Best viewed in color.

Fig. 13. The expected average overlap (EAO) scores and ranks of tested algorithms on VOT2016 dataset. The better trackers are located at the upper-right corner. Best viewed in color.

power of feature representation and discrimination of CNN, which is more robust than the hand-craft features against various challenges such as deformation and scale variation. Compared with the trackers that also utilize CNN features, our tracker obtains better performance due to the enhanced discriminating power between the target object and the background. By our proposed network and loss function, we learn a better embedding space where the positive and negative samples are more discriminable. It deserves analysis that the performance of ECWGO is slightly inferior to ECO and CCOT in terms of the success rate as shown in Fig. 9. The ECO and CCOT methods are effective to estimate the scale variation of the target object because of its continuous filter mechanism, while our tracker estimates the scale based on random samples. Considering both precision and success rate as shown in Fig. 9, the location precision of ECWGO is higher than that of ECO and CCOT, which proves the power of discriminating the target and the background of our algorithm. Therefore, the bottleneck of getting a higher success rate is the limited ability of scale estimation. We will improve the scale estimation module of our method in future work. To illustrate the effectiveness of our tracker for various kinds of challenging factors, we show the success and precision scores of our tracker and existing state-of-the-art trackers on OTB2013 in

Figs. 10 and 11. Thanks to the negative sample embedding combination network and weighted-gradient loss, our tracker performs favorably against the state-of-the-art trackers. The results show that our tracking model has the power of discrimination and generalization. Although our model is initialized only with the target in the first frame, it performs robustly when the scale and the appearance vary throughout the video. However, our tracker performs slightly worse for fast motion and out of view cases. The main reason is that we draw the target proposals within a certain range of area, and when the target moves fast or out of view, our tracker will lose track of the target. To get an insight into the performance of our proposed tracker when handling various challenging factors, Tables 1 and 2 compares the precision and success scores of our tracker and existing state-of-the-art trackers for 11 challenging factors over the 100 sequences in OTB100. Qualitative evaluation. To qualitatively compare our method with other visual tracking approaches. We group existing state-ofthe-art trackers into following several categories. Trackers based on correlation filters includes C-COT [25], ECO [24], LCT [33], CF2 [34], CREST [35] and PATV [36]. Trackers based on siamese network includes SiamFC [32], Siam-Tri [20], SiamRPN [29] and CRPN [37]. Trackers based on tracking-by-detection network in-

12

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339 Table 6 Parameter sensitivity analysis of  on the OTB-2013 dataset. The best two results are denoted as bold and italic.



0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Prec(%) Succ(%)

92.9 69.1

94.0 69.4

94.6 69.7

94.2 69.5

93.5 69.0

92.0 67.9

91.6 67.3

90.3 66.9

89.9 66.2

Table 7 Comparison of different loss functions for tracking on the OTB-2013 dataset. The best two results are denoted as bold and italic. Network architecture

Loss function

Baseline √ √ √ √

BCE √

NSEC

W-BCE

Results Focal Loss

WGL

√ √ √ √ √ √ √

√ √ √ √

Table 8 Quantitative comparison of different trackers on the VOT2016 dataset. Trackers

Accuracy

Robustness

Rank

Score

Rank

Score

ECO ECWGO DSLT SiamRPN CCOT TCNN VITAL DAT TADT Staple EBT DSRDCF MDNet SRDCF SiamAN NSAMF ColorKCF SiamFC-Tri GCF ASMS ANT BST KCF SCT4 DSST ACT LGT MIL MatFlow STRUCK2014 BDF IVT

3.97 3.90 4.22 4.48 4.50 3.40 3.95 2.95 3.97 4.35 9.90 5.37 3.68 4.72 6.60 7.58 7.83 5.27 5.82 7.97 9.27 17.58 8.15 9.47 6.98 10.98 13.53 13.63 14.38 11.90 16.20 13.87

0.55 0.53 0.54 0.57 0.54 0.55 0.55 0.56 0.54 0.54 0.46 0.52 0.54 0.53 0.53 0.50 0.50 0.52 0.51 0.50 0.48 0.37 0.48 0.46 0.52 0.44 0.42 0.41 0.40 0.45 0.37 0.41

4.60 5.33 5.25 8.18 5,52 7.43 7.68 5.97 8.08 9.82 5.82 8,83 7.92 9.68 11.08 9.60 12.72 12.62 11.37 14.33 13.38 11.38 12.70 12.52 15.10 14.30 16.25 18.33 16.95 20.10 17.90 21.92

0.87 0.63 0.83 1.10 0.89 0.83 1.09 0.85 1.24 1.42 1.05 1.23 0.91 1.43 1.91 1.25 1.50 2.13 1.57 1.93 1.72 1.97 1.95 2.04 2.38 2.34 2.24 3.03 2.67 3.40 3.15 4.15

EAO

0.37 0.35 0.34 0.34 0.33 0.32 0.32 0.32 0.30 0.30 0.29 0.28 0.26 0.25 0.24 0.23 0.23 0.22 0.22 0.21 0.20 0.20 0.19 0.19 0.18 0.17 0.16 0.16 0.15 0.14 0.14 0.11

cludes MDNet [10] and DAT [12]. Trackers based on SVM includes CNN-SVM [19] and Struck [17]. For presentation clarity, we choose one representative approach from each category for comparison. Fig. 7 shows the results of representative trackers: CCOT, SiamTri, MDNet, CNN-SVM, and ECWGO(Ours) on some challenging sequences. Basketball and Football sequences suffer from background clutters because there exist distractors of the same class with the object. Singer2 sequence suffers from a dark target against

Prec(%) 89.8 89.9 90.3 91.6 93.1 93.1 94.0 94.6

Succ(%) 66.1 66.2 66.9 67.3 69.3 69.3 69.5 69.7

the colorful background distractors. Bird1 sequence suffers from deformation and fast motion. Box sequence suffers from occlusion, rotation and scale variation. Ironman sequence suffers from fast motion, dark background and background clutters. In many of these challenging sequences, MDNet fails to locate the target because the data imbalance issue makes the model insufficiently trained and the model cannot distinguish between the target and the background under challenging conditions. Trackers based on the siamese network such as Siam-Tri and SiamRPN is not robust enough when similar distractors appear around the target. The overall performance of CNN-SVM is inferior to the other 4 trackers because of the limited performance of the SVM classifier. The overall performance of CCOT and our proposed tracker ECWGO is about the same, but our tracker is more robust under some challenging cases (Singer2, Bird1, and Football). This is mainly because we focus on mitigating the data imbalance issue to prevent easy samples from overwhelming the model training. As shown in Fig. 7, our proposed tracker performs favorably against state-of-the-art trackers. 4.4. Evaluation on temple color 128 The dataset Temple Color 128 (TC-128) [8] contains 129 colorful video sequences. The evaluation metric is the same as the OTB datasets. Fig. 12 illustrates the overall precision and success plots on TC-128 dataset. In addition to the tracker mentioned in Section 4.3, we compare our tracker with the trackers including Struck [17], KCF [38], ASLA [39], and MIL [19], which are evaluated by the author of TC-128. As Fig. 12 illustrates, our tracker performs favorably against other state-of-the-art trackers. 4.5. Evaluation on VOT2016 The VOT2016 dataset [9] is another popular dataset for visual tracking. Its protocol and evaluation metrics are not the same as the OTB and the TC-128 datasets. Therefore, to completely compare our tracker against existing state-of-the-art trackers, we conduct experiments on the VOT2016 dataset, which contains 60 challenging video sequences. In the VOT challenge protocol, a tracker is re-initialized whenever tracking fails (the overlap between the ground truth and the estimated bounding box is zero). The official evaluation tool provided by VOT challenge reports accuracy and robustness, which measures the bounding box overlap ratio and the

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

number of failures respectively. Taking both of the two metrics into consideration, the VOT2016 challenge introduces the expected average overlap (EAO) to rank the tracking algorithms. We compare our tracker with state-of-the-art trackers on the VOT2016 benchmark, including TADT [28], DSLT [40], Siam-Tri [20], SiamRPN [29], ECO [24], C-COT [25], DAT [12], RT-MDnet [26] and MDNet [10]. For fair comparison, we utilize the tracking results downloaded from VOT official website. Table 8 shows the accuracy score, robustness score, accuracy rank and robustness rank of all the trackers. Fig. 13 shows the EAO and EAO rank of all the trackers. It is worth noting that C-COT is the winner of the VOT2016 challenges. As shown by Table 13, our tracker performs favorably against the state-of-the-art trackers. 5. Conclusion and future work In this paper, we propose the negative sample embedding network for visual tracking to mitigate the effect of data imbalance. We heuristically define several kinds of negative samples according to their specific semantic characteristics. With the help of the proposed embedding specific branches and embedding combination layer, we can separately learn several negative sample embeddings and combine these embeddings to learn a more discriminative boundary between the positive and negative samples. Moreover, we propose a novel weighted-gradient loss to mitigate the effect of imbalance between easy and hard samples. Since we identify that a large number of easy samples overwhelm the model training and make the model insufficiently trained, our proposed loss reweights the total gradient contributions of easy and hard samples. It only adds a little more computation complexity than the standard cross-entropy loss, but it shows better performance. Extensive experimental results show that our tracking algorithm performs favorably against the state-of-the-art tracking algorithms on OTB2013, OTB100, Temple Color 128 and VOT2016 benchmarks. Although our method gets favorable tracking performance in the four benchmarks, the tracker can still be further improved. First, our method is not a real-time tracker since hundreds of samples are evaluated by our model in each frame of videos. We can utilize RoI Align mechanism and employ a better online updating strategy to speed up. Second, we find that the bottleneck of getting a higher success rate is the limited ability of scale estimation compared with the performance of ECO and C-COT. We can improve the scale estimation module of our method. Besides, motivated by the good performance of recent work in bounding box regression, we intend to train a better bounding box regressor to help estimate the regions of object more precisely for a better performance. Acknowledgement The work was funded by Beijing Municiple Science and Technology Commission project under Grant No. Z18110 0 0 019180 05. References [1] H. Yang, C. Yuan, B. Li, Y. Du, J. Xing, W. Hu, S.J. Maybank, Asymmetric 3d convolutional neural networks for action recognition, Pattern Recognit. 85 (2019) 1–12, doi:10.1016/j.patcog.2018.07.028. [2] J. Guo, P. Hu, L. Li, R. Wang, Design of automatic steering controller for trajectory tracking of unmanned vehicles using genetic algorithms, IEEE Trans. Veh. Technol. 61 (7) (2012) 2913–2924, doi:10.1109/TVT.2012.2201513. [3] X. Zhou, K. Jin, Q. Chen, M. Xu, Y. Shang, Multiple face tracking and recognition with identity-specific localized metric learning, Pattern Recognit. 75 (2018) 41– 50, doi:10.1016/j.patcog.2017.09.022. [4] A. Yilmaz, O. Javed, M. Shah, Object tracking: a survey, ACM Comput. Surv. 38 (4) (2006) 13, doi:10.1145/1177352.1177355. [5] X. Dong, J. Shen, D. Wu, K. Guo, X. Jin, F. Porikli, Quadruplet network with one-shot learning for fast visual object tracking, IEEE Trans. Image Process. 28 (7) (2019) 3516–3527, doi:10.1109/TIP.2019.2898567. [6] Y. Wu, J. Lim, M. Yang, Online object tracking: a benchmark, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2411–2418, doi:10.1109/CVPR.2013.312.

13

[7] Y. Wu, J. Lim, M. Yang, Object tracking benchmark, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (2015) 1834–1848, doi:10.1109/TPAMI.2014.2388226. [8] P. Liang, E. Blasch, H. Ling, Encoding color information for visual tracking: algorithms and benchmark, IEEE Trans. Image Process. 24 (12) (2015) 5630–5644, doi:10.1109/TIP.2015.2482905. [9] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojír, G. Häger, G. Nebehay, R.P. Pflugfelder, The visual object tracking vot2015 challenge results, in: IEEE International Conference on Computer Vision Workshops, 2015, pp. 1–23, doi:10.1109/ICCVW.2015.79. [10] H. Nam, B. Han, Learning multi-domain convolutional neural networks for visual tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4293–4302, doi:10.1109/CVPR.2016.465. [11] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R.W.H. Lau, M. Yang, Vital: Visual tracking via adversarial learning, in: IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8990–8999, doi:10.1109/CVPR.2018. 00937. [12] S. Pu, Y. Song, C. Ma, H. Zhang, M. Yang, Deep attentive tracking via reciprocative learning, in: Advances in Neural Information Processing Systems, 2018, pp. 1931–1941. [13] W. Wang, J. Shen, L. Shao, Video salient object detection via fully convolutional networks, IEEE Trans. Image Process. 27 (1) (2018) 38–49, doi:10.1109/TIP.2017. 2754941. [14] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, Salient object detection in the deep learning era: An in-depth survey, CoRR (2019). abs/1904.09146 [15] A. Shrivastava, A. Gupta, R.B. Girshick, Training region-based object detectors with online hard example mining, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 761–769, doi:10.1109/CVPR.2016.89. [16] J. Shen, D. Yu, L. Deng, X. Dong, Fast online tracking with detection refinement, IEEE Trans. Intell. Transp. Syst. 19 (1) (2018) 162–173, doi:10.1109/TITS.2017. 2750082. [17] S. Hare, A. Saffari, P.H.S. Torr, Struck: Structured output tracking with kernels, in: IEEE International Conference on Computer Vision, 2011, pp. 263–270, doi:10.1109/ICCV.2011.6126251. [18] B. Babenko, M. Yang, S.J. Belongie, Robust object tracking with online multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1619– 1632, doi:10.1109/TPAMI.2010.226. [19] S. Hong, T. You, S. Kwak, B. Han, Online tracking by learning discriminative saliency map with convolutional neural network, in: International Conference on Machine Learning, 2015, pp. 597–606. [20] X. Dong, J. Shen, Triplet loss in siamese network for object tracking, in: European Conference on Computer Vision, 2018, pp. 472–488, doi:10.1007/ 978- 3- 030- 01261- 8_28. [21] T. Lin, P. Goyal, R.B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: IEEE International Conference on Computer Vision, 2017, pp. 2980– 2988, doi:10.1109/ICCV.2017.324. [22] W. Wang, J. Shen, L. Shao, Consistent video saliency using local gradient flow optimization and global refinement, IEEE Trans. Image Process. 24 (11) (2015) 4185–4196, doi:10.1109/TIP.2015.2460013. [23] W. Wang, J. Shen, F. Porikli, R. Yang, Semi-supervised video object segmentation with super-trajectories, IEEE Trans. Pattern Anal. Mach. Intell. 41 (4) (2019) 985–998, doi:10.1109/TPAMI.2018.2819173. [24] M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, Eco: efficient convolution operators for tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6931–6939, doi:10.1109/CVPR.2017.733. [25] M. Danelljan, A. Robinson, F.S. Khan, M. Felsberg, Beyond correlation filters: learning continuous convolution operators for visual tracking, in: European Conference on Computer Vision, 2016, pp. 472–488, doi:10.1007/ 978- 3- 319- 46454- 1_29. [26] I. Jung, J. Son, M. Baek, B. Han, Real-time MDNet, in: European Conference on Computer Vision, 2018, pp. 89–104, doi:10.1007/978- 3- 030- 01225- 0_6. [27] T. Zhang, C. Xu, M. Yang, Multi-task correlation particle filter for robust object tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4819–4827, doi:10.1109/CVPR.2017.512. [28] X. Li, C. Ma, B. Wu, Z. He, M. Yang, Target-aware deep tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1369–1378, doi:10.1109/CVPR.2019.00146. [29] B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network, in: IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8971–8980, doi:10.1109/CVPR.2018. 00935. [30] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, 2015. [31] L.N. Smith, A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay, CoRR (2018). arXiv: 1803.09820. [32] L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H.S. Torr, Fullyconvolutional siamese networks for object tracking, in: European Conference on Computer Vision Workshops, 2016, pp. 850–865, doi:10.1007/ 978- 3- 319- 48881- 3_56. [33] C. Ma, X. Yang, C. Zhang, M. Yang, Long-term correlation tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5388–5396, doi:10.1109/CVPR.2015.7299177. [34] C. Ma, J. Huang, X. Yang, M. Yang, Hierarchical convolutional features for visual tracking, in: IEEE International Conference on Computer Vision, 2015, pp. 3074–3082, doi:10.1109/ICCV.2015.352.

14

J. Feng, P. Xu and S. Pu et al. / Pattern Recognition 104 (2020) 107339

[35] Y. Song, C. Ma, L. Gong, J. Zhang, R.W.H. Lau, M. Yang, Crest: convolutional residual learning for visual tracking, in: IEEE International Conference on Computer Vision, 2017, pp. 2574–2583, doi:10.1109/ICCV.2017.279. [36] H. Fan, H. Ling, Parallel tracking and verifying: a framework for real-time and high accuracy visual tracking, in: IEEE International Conference on Computer Vision, 2017, pp. 5487–5495, doi:10.1109/ICCV.2017.585. [37] H. Fan, H. Ling, Siamese cascaded region proposal networks for real-time visual tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7952–7961, doi:10.1109/CVPR.2019.00814. [38] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) (2014) 583–596, doi:10.1109/tpami.2014.2345390. [39] M. Danelljan, G. Häger, F.S. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in: British Machine Vision Conference, 2014, pp. 1–11. [40] X. Lu, C. Ma, B. Ni, X. Yang, I.D. Reid, M. Yang, Deep regression tracking with shrinkage loss, in: European Conference on Computer Vision, 2018, pp. 369– 386, doi:10.1007/978- 3- 030- 01264- 9_22. Jin Feng is currently pursuing the Ph.D. degree with the School of Information and Communication Engineering, Beijing University of Posts and Telecommunications (BUPT), China. His research interest is computer vision, particularly focuses on visual tracking.

Peng Xu is currently a Ph.D. student in Beijing University of Posts and Telecommunications (BUPT), China. His research interests lie in computer vision, deep learning, and machine learning. He serves as reviewers for CVPR, ICCV, CVIU, TNNLS, etc.

Shi Pu is currently a Ph.D. student at School of Information and Communication Engineering, Beijing University of Posts and Telecommunications. His research interest is computer vision, particularly focuses on visual tracking. He was sponsored by China Scholarship Council as a visiting Ph.D. student in University of California at Merced from the fall of 2017 to the fall of 2018.

Kaili Zhao is currently an associate professor at Beijing University of Posts and Telecommunications. Her interests are in computer vision and machine learning. She developed techniques spanning over structured multi-task learning, weakly-supervised learning and deep learning. She also led projects in facial expression analysis, AU detection, semantic segmentation, crowd counting, and pedestrian detection.

Honggang Zhang is currently an associate professor and director of Web Search Center at Beijing University of Posts and Telecommunications (BUPT), China. He received his Ph.D. degree of computer application technology from the School of Information and Communication Engineering, Beijing University of Posts and Telecommunications in 2003. His research interest covers image retrieval, computer vision, and pattern recognition. He published more than 30 papers on TPAMI, TIP, SCIENCE, CVPR, ECCV, and NeurIPS. He is a senior member of IEEE.