Explicitly Exploiting Hierarchical Features in Visual Object Tracking
Communicated by Hongli Dong
Journal Pre-proof
Explicitly Exploiting Hierarchical Features in Visual Object Tracking Tianze Gao, Nan Wang, Jun Cai, Weiyang Lin, Xinghu Yu, Jianbin Qiu , Huijun Gao PII: DOI: Reference:
S0925-2312(20)30228-9 https://doi.org/10.1016/j.neucom.2020.02.038 NEUCOM 21906
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
28 August 2019 14 December 2019 9 February 2020
Please cite this article as: Tianze Gao, Nan Wang, Jun Cai, Weiyang Lin, Xinghu Yu, Jianbin Qiu , Huijun Gao, Explicitly Exploiting Hierarchical Features in Visual Object Tracking, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.02.038
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.
Explicitly Exploiting Hierarchical Features in Visual Object Tracking Tianze Gao1a , Nan Wang1a , Jun Caia , Weiyang Lina,b , Xinghu Yua,c , Jianbin Qiua,1 , Huijun Gaoa,∗ b
a State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, 150001, China Key Laboratory of Micro-systems and Micro-structures Manufacturing of Ministry of Education, Harbin Institute of Technology, Harbin, 150001, China c HIT-Ningbo Institute of Intelligent Equipment Technology, Ningbo, 315200, China
Abstract A common drawback of convolutional features based trackers is the incapability of distinguishing tracking targets from distractors, even in the presence of distinct color difference. In this paper, we design a robust hybrid tracker that explicitly combines color features with convolutional features. We design our tracker on the basis of fully convolutional Siamese network (SiamFC) to emphasize the performance promotion by introducing color features instead of using a more advanced network architecture. A novel approach to integrate two score maps from different channels is proposed. Techniques include cropping out the ineffective area and denoising via Gaussian smoothing. Experiments conducted on OTB2015 and VOT2018 benchmarks show the superiority of our hybrid tracker over the original SiamFC. Keywords: Siamese network, Hybrid tracker, Color histogram
1. Introduction Visual object tracking is one of the fundamental problems in the field of computer vision. It is widely applied in tasks including video surveillance [1], augmented reality [2] and autonomous driving [3]. In recent years, CNN based tracking methods have achieved a dominant position. Such success is attributed to both the improvement of hardware computational capability and the continuous optimization of network architectures. However, with the prevalence of CNN, traditional feature extraction methods have been ignored to some extent. Although the state-of-the-art visual object tracking algorithms mostly rely on CNN features, the traditional visual features such as HOG [2], SIFT [4], Haar-like [5], optical flow, color histogram, etc. still bear certain merits in application. Current main stream deep learning based tracking algorithms are generally divided into two categories: tracking-by-detection and template matching [6]. The tracking-by-detection 1 ∗
The first and second author contributed equally to this paper Corresponding author Email addresses:
[email protected] (Weiyang Lin),
[email protected] (Huijun Gao)
Preprint submitted to Neurocomputing
February 13, 2020
methods usually require an online update on the model and thus lead to a severe drop in tracking speed [5, 7]. The advantage of such methods is their robustness to the changes in target appearance and environment. The idea of template matching methods is to enable networks to evaluate the similarity between target patches and candidate patches. The most representative network architecture is the Siamese architecture [8, 9, 10]. The state-of-theart trackers–the SiamRPN family [11, 12, 13, 14] are all built with Siamese networks. Apart from pure deep learning approaches, trackers that combine CNN features and traditional methods [15, 16, 17] have also achieved promising results. For example, paper [18] sends convolutional features into a correlation filter [19] and utilizes a coarse-to-fine search strategy. Paper [9] achieves a good candidate sampling effect by using optical flow. Paper [20] combines deep learning architecture with traditional particle filter algorithm. Siamese trackers such as SiamFC [8] often have trouble in distinguishing similar objects even when the objects have distinct color difference, as shown in Fig. 1. This is caused by the insensitivity to low-level features. By using convolutional networks, trackers have access to both semantic information from high-level layers and local features like texture from low-level layers. Hence, it is a common practice to integrate features learned from different layers to acquire rich hierarchical information [21, 22]. However, such integration is implicit and can not bring low-level features into full play since the information contained may be diluted by high-level features. In this paper, we explicitly exploit the color histogram as the low-level feature to assist the Siamese network in target tracking. Two channels are used to extract hierarchical features of the input image. As a result, two score maps representing information from different levels are generated and the final prediction is determined by the integration of them. We also propose a novel approach to integrate the score maps obtained from the correlation result in SiamFC and probability calculation in color histograms. On one hand, we point out that the search image can not be learned in full scope by traditional siamese networks, in that no padding is added during convolutional process. For this reason, we crop out ineffective area in score map before the integration. On the other hand, we do a Gaussian smoothing in the final score map to remove noise pixels that mislead the tracker. In summary, the contributions of our work are three-fold: • We design a robust hybrid tracker that explicitly exploits color histogram as the lowlevel feature and combines it with the high-level semantic feature learned by a fully convolutional siamese network. • We propose a novel approach for the integration of score maps from different channels. • We conduct experiments on two standard benchmarks OTB2015 and VOT2018 to compare our hybrid tracker with the original SiamFC tracker.
2
Fig. 1. Visualization of the response map produced by SiamFC: note that the player in white suit even has a larger response value than the target player (in green suit).
2. Related Works 2.1. Siamese network based trackers Siamese networks have been very prevalent in the field of visual object tracking. Taking the most notable one, SiamFC [8], as an example, the features of the exemplar image and the search image are extracted by two branches of the siamese network. A convolutional operation is conducted using the obtained features to get a response map. The target location and scale change is then predicted by the response map. A commonly discussed drawback of SiamFC network is that it only compares later frames with the initial template, which can be quite different from the newest target. To solve this problem, RFL [23] integrates convolutional Long Short-Term Memmory (convLSTM) into a siamese network to update the target appearance. GOTURN [10] compares two adjacent frames to locate the target. CFNet [24] conducts a weighted fusion of all previous frames. Despite the adaption for target appearance change, aforementioned algorithms do not perform well in target reidentification. When the prediction deviates, wrong information will be memorized and cause severe disruption to the following frames. The worse the deviation is, the worse the template is deformed and thus a vicious cirle is formed. Such phenomenon is even more serious in the presence of occlusion. We argue that simply using the initial frame as the template is quite effective, especially in target reidentification. The state-of-the-art network, SiamRPN++ [14], adopts such strategy and achieved good tracking perfomance. However, SiamRPN++ combines high-level layers and low-level layers implicitly via a weighted summation. The features extracted by lowlevel layers can not be controlled directly and thus may not be helpful in discriminating semantically similar objects. 2.2. Traditional features used in object tracking Despite the success of deep learning features in the computer vision field. Traditional visual features are still useful since they are manually crafted and more task-specific. KCF [25] inputs HOG [26] features into a kernelized version of linear correlation filter. It achieves good tracking performance while working at a high speed. DAT [15] uses color histogram 3
and applies Bayes rules to predict the target location. MaskTrack [27] does object tracking and instance segmentation simultaneously, taking optical flow as an additional input feature. [16] proposes a SIFT [4] based color particle filter tracking method. MIL [28] proposes to track with a boosting framework whose weak classifiers are established with Haar-like [5] features. Our tracking strategy is inspired by DAT. Differently from DAT, we do not directly use the color histogram feature to predict the target location and scale change. Instead, it serves as complementary information for the deep learning features extracted by a siamese network. 2.3. Distractor suppression One of the biggest problems in visual object tracking is the drifting caused by objects with similar appearance (usually called distractors). Researches have put forward various approaches to address such problem. SO-DLT [7] proposes a multi-scale search mechanism to avoid including extraneous context information. By doing so, distractors are excluded from the search area as far as possible. DAT [15] calculates the color histogram of possible distractors and uses it to adjust the probability map. FCNT [21] points out that different convolutional layers capture different information. And it designs a distractor detection mechanism based on a joint decision by a general network (GNet) and a specific network (SNet). DaSiamRPN [12] solves the problem in a data-driven way. It argues that the incapability to discriminate distractors attributes to two aspects: the non-semantic negative samples and not enough training pairs. Apart from the original VID [29] dataset, a much larger dataset Youtube-BB [30] serves as an extra training set. To expand object categories, it also generate image pairs from two detection datasets: ImgeNet [29] and COCO [31]. As one of the state-of-the-arts, DaSiamRPN achieves prominent performance among the aforementioned methods. However, it requires a huge amount of distractor samples for training which may not be possible in practice. In addition, despite its efforts to include all possible categories in training phase, it may lose the superiority when meeting tracking targets with no similar counterparts in the training set. 3. Method The architecture of our hybrid tracker is briefly illustrated in Fig. 2. Different from SiamFC, we exploit different features of the input frame through two branches. Following [8], the high-level feature is extracted using AlexNet [32]. Color histograms of target and the search area are calculated as the low-level feature. Two score maps obtained via these two types of features are integrated so as to precisely predict the target position and scale change. 3.1. Score map generated using color histogram The approach to generating score map from color histograms is inspired by [15], yet we make some simplification. We first encode the pixels of input frames into 10 bins and 4
Search Image
Calculate color histogram
AlexNet
Exemplar Image Calculate color histogram
AlexNet
Bayes Rule
color map
Correlation
Resize and Integrate
CNN map
final score map
Fig. 2. Main architecture of the hybrid tracker
calculate the color histograms of the target area (denoted by HT ) and the search area (denoted by HS ). Now consider a single pixel x ∈ S, where S is the search area. Denote the bin that x belongs to as bx . Then following [15], we can calculate the probability of x belonging to the target area using Bayes rule: P (x ∈ T | bx ) =
P (bx | x ∈ T )P (x ∈ T ) P (bx | x ∈ T )P (x ∈ T ) + P (bx | x ∈ B)P (x ∈ B)
(1)
where T , B denotes the target area (or foreground area) and the background area respectively. We can calculate the probability as follows: P (bx | x ∈ T ) = P (bx | x ∈ B) = |T | P (x ∈ T ) = |S| P (x ∈ B) =
HT (bx ) |T |
HS (bx ) − HT (bx ) |S| − |T |
|S| − |T | |T | 5
(2)
where | · | denotes the cardinality. Then we can simply substitute 2 into 1 and get: HT (bx ) , if H (b ) 6= 0 S x HS (bx ) P (x ∈ T | bx ) = 0.5 , if HS (bx ) = 0 The score map used in our method is a variant of 3, which is: H (b ) T x , if HS (bx ) 6= 0 HS (bx ) S(x) = 0.1 , if HS (bx ) = 0
(3)
(4)
We assign 0.1 to the new pixel values in the score map instead of 0.5 as in [15], in that we do not predict the target location and scale change from the score map generated using color histogram. The score map here is more of an aid to the score map generated by the siamese network. In fact, 0.1 is above the adjusting thresholds and we do so to avoid modifying the scores of unseen pixel values. 3.2. Integration of two score maps The process of generating a score map by the siamese network is as described in [8]. Simply put, we crop a search image and an exemplar image from the input frame and send them into two branches of AlexNet. Then we do a convolutional operation (which is indeed a correlation operation) between the two feature maps extracted and get a score map indicating the correlation between them. CNN map
new score
old score
remains unchanged
color map Y
Y
Y
N
color score Thi?
Fig. 3. The integration of two maps
6
N
N Thr?
Thd?
Now we have two score maps which contain the information from high-level features and low-level features respectively. We then design a rule to integrate the information from both channels. To avoid ambiguity, we call the score map generated by siamese network CNN map and the score map generated by color histogram color map. The two maps are firstly upsampled using bicubic interpolation into the same size, and then the CNN map is adjusted according to the color map. In detail, we set three thresholds: T hi,T hd and T hr specifying respectively which values in the CNN map should be increased, decreased or reset to zero. The score adjusting rule is shown in Fig. 3, where ∆i, ∆r and ∆d (∆d << ∆r) are the amplitudes of adjustment. Note that the search image in the siamese branch is larger than the bounding box (for example, twice the size of the bounding box) to include some context information, while the target area used in the color histogram branch is strictly delimited by the bounding box. This is done to prevent the interference of context color from harming the accuracy of probability calculation. The effect of our integration method is displayed in Fig. 4. It shows that the response produced by distractors are greatly suppressed.
Fig. 4. Effect of our integration method: The patches in the upper left, upper right, lower left, lower right show the effective area (defined in Section 3.3.1) of input image, the CNN map, the color map and the integrated map respectively.
3.3. Techniques used in integration 3.3.1. Extraction of effective area One thing worth noting is that we do not integrate the CNN map with the full scope of the color map. This is because the area that contains effective information (dubbed effective area) vary in two maps. Now we explain this point in detail. As illustrated in [8], zero paddings should not be involved in the convolutional process of siamese networks in order to avoid learning a center-preferred tracker. Hence, we follow the convention and exclude padding operations. One problem with this is that it can lead to loss of edge information in the search image, as shown in Fig. 5 7
discarded area effective area
convolution without padding
Fig. 5. No-padding convolution: the edge information of the search image is discarded.
In other words, only part of the information in the input image is learned by the convolutional network. We argue that this is one reason why the exemplar image should include some background in SiamFC. Note that if the target in the coming frame lies on the edge of the search image, the tracker behaves worse due to the same reason. Adhering to principles that the two maps to be integrated should contain the same amount of information, we crop the color map into a smaller size corresponding to the size of the effective area in the search image. A similar strategy is adopted in [13]. The difference is that [13] convolutes with padding first and then crop out the padding-affected features using a CIR unit, while we exclude padding and abandon the convolution-affected area in the color map. 3.3.2. Score map denoising As a low level feature, color histogram has limited ability of discrimination. The background pixels that resemble target pixels in terms of color can not be identified by the color map. Such noise pixels scatter over the entire search area. Although they do not tend to form clusters of a large size, the robustness of the tracker is still seriously affected. Because the prediction of target location solely depends on the maximum value in the score map, which follows SiamFC. We do a Gaussian smoothing in the final score map to address this problem. As shown in Fig. 6 (b), most of the response values of the distractor (marked by a green circle) are suppressed by the color map, yet the remainder (marked by a yellow circle) still misleads the tracker. Such phenomenon can be avoided by a Gaussian smoothing (Fig. 6 (a)(c)). Note that the visual effects in different subfigures are not exactly the same, which is attributed 8
distractor
target
mostly suppressed
noise
(a) Smoothing effect: noise pixels are well suppressed.
thoroughly suppressed
(b) Before Gaussian smoothing: the response of distractor is partially suppressed and the tracker mistakes a distractor as the target.
(c) After Gasussion smoothing: the response of distractor is throroughly suppressed.
Fig. 6. Gaussian smoothing for score map denoising
to change in search area after adopting the Gaussian smoothing in previous frames. Some of the tracking results of our approach are shown in Fig. 7. 4. Experiments Experiments are conducted on two commonly used benchmarks: OTB2015 [33] and VOT2018. We display the reported performance of various trackers except for SiamFC, for which we use the reproduced version in python for fair comparison. The performance of our hybrid tracker and SiamFC is evaluated using the dataset interfaces provided by GOT-10k [34] toolkit. 4.1. Implementation details In order to emphasize the performance improvement resulting from our novel integration approach, we follow [8] with most experiment settings. Moreover, the thresholds and increments in Fig. 3 are set as: T hi = 8 × 10−1 , T hr = 1 × 10−1 , T hd = 2 × 10−3 , ∆i = 1 × 10−7 , ∆r = 5 × 10−5 , ∆d = 1 × 10−5 . We expand the search area to the full scope when the predicted target position is out of the boundary. 4.2. Results on OTB2015 The OTB2015 benchmark consists of 100 sequences, 24 of which are grayscale sequences. It evaluates trackers’ performance considering both a precision score and a success score. To be specific, the precision score is the average value of the IoU (Intersection-over-Union) of the successfully tracked frames, which refers to those in which the center distance between the predicted bounding box and the ground truth bounding box is below a certain threshold. Here the thresholds of center location error range from 0 pixel to 50 pixels. The success 9
original SiamFC
our hybrid tracker
Fig. 7. The comparison between our hybrid tracker and the original SiamFC tracker.
(a) Precision plot
(b) Success plot
Fig. 8. Comparison with SiamFC on OTB2015 benchmark
10
score is the proportion of successfully tracked frames. Similar to the precision score, the successfully tracked frames relates to a series of IoU threshold ranging from 0.0 to 1.0. Since our tracker exploits the color feature of videos, we only evaluate our tracker on color scale sequences. As shown in Fig. 8, our tracker achieves a 4.9% increase in precision score and a 2.7% increase in success score. We also conduct an extended experiment. In detail, the tracking results are evaluated on all 100 sequences, wherein the grayscale sequences are always evaluated with the original SiamFC tracker. In this case, there is an obvious rise in both curves since the grayscale sequences are easier to track compared with the color ones. Such an extension does no harm to the fairness of performance comparison, since our hybrid tracker can easily degrade into a vanilla SiamFC tracker simply by removing the color branch. A further comparison with other trackers is shown in Fig. 9. The performance of SiamFC is a little lower than reported in [8] due to two reasons: (1) The experiments in [8] are conducted on OTB2013 [35] benchmark. (2) We use a reproduced version of SiamFC (written in PyTorch). It has no influence in the fairness of comparison since our tracker is built on the basis of the same version and parameter settings.
(a) Precision plot
(b) Success plot
Fig. 9. Comparison with other trackers on OTB2015 benchmark
4.3. Results on VOT2018 Different from the OTB2015 benchmark, the VOT2018 benchmark adopts a special reinitialization mechanism which reinitializes the tracker once the target is totally missing. Besides, the VOT2018 benchmark evaluates trackers in terms of accuracy and robustness. Here the accuracy score is calculated using the average IoU between the ground truth bounding box and the predicted bounding box. The value of robustness is a weighted average of the failure times in all types of challenges including camera motion, illumination change, occlusion, size change and motion change. The experimental result is shown in Tabel 1. The empty in the table indicates no particular challenge in the sequence frames. It can be seen 11
that although our tracker has a slight drop in precision by 1.2%, it achieves a remarkable enhancement in robustness (the lower the better) by 19.5%. Table 1: Performance on VOT2018 benchmark Accuracy Robustness SiamFC
0.502
36.4
Ours
0.496
29.3
As shown in Table 2, the original SiamFC tracker makes 132 failures in all 60 sequences, which is 29.4% more than ours (102 failures). Here a failure refers to a supervised respecification of target location when the intersection over union (IOU) between the predicted box and the ground truth box is zero. Note that the total number of failure times is not a simple summation of individual items in the table due to the concurrences of challenges. It can be seen that our hybrid tracker makes less mistakes than SiamFC under conditions of occlusion, size change, motion change and normal condition. We deduce the reason for the robustness improvement from the effect illustrated in Fig. 4. The distractors with different color features are well suppressed, which prevents the tracker’s focus from deviation. Within the target area, however, a few pixels which have low scores in the color map are also suppressed. This should be the main reason of the aforementioned drop in accuracy. Table 2: Failure times under different challenges Camera motion Illumination change Occlusion Size change Motion change Empty Total SiamFC
44
2
32
25
37
37
132
Ours
47
3
27
14
24
24
106
5. Conclusion In this paper, we propose a hybrid tracker that explicitly exploits the low-level color feature and combines it with the high-level semantic feature learned by SiamFC. With techniques including effective area cropping and Gaussian smoothing, we adopt a novel approach to integrate two score maps from different channels. Experiments on OTB2015 and VOT2018 validate the superiority of our hybrid tracker over the original SiamFC tracker. Declaration of Interest Statement The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 12
Acknowledgement This work was supported in part by the National Natural Science Foundation of China (61973099), Key Laboratory of Micro-systems and Micro-structures Manufacturing of Ministry of Education (HIT: NO.2017 KM008) and the 111 Project (B16014). References [1] A. Hampapur, L. Brown, J. Connell, A. Ekin, N. Haas, M. Lu, H. Merkl, S. Pankanti, Smart video surveillance: exploring the concept of multiscale spatiotemporal tracking, IEEE Signal Processing Magazine 22 (2) (2005) 38–51. doi:10.1109/MSP.2005.1406476. [2] Y. Park, V. Lepetit, Woontack Woo, Multiple 3D Object tracking for augmented reality, in: 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, IEEE, 2008, pp. 117–120. doi:10.1109/ISMAR.2008.4637336. [3] W. Luo, B. Yang, R. Urtasun, Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 2018, pp. 3569–3577. doi:10.1109/CVPR.2018.00376. [4] D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110. doi:10.1023/B:VISI.0000029664.99615.94. [5] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 1, IEEE Comput. Soc, 2005, pp. I–511–I–518. doi:10.1109/CVPR.2001.990517. [6] T. Yang, A. B. Chan, Learning Dynamic Memory Networks for Object Tracking, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, pp. 153–169. doi:10.1007/978-3-030-01240-3_10. [7] N. Wang, S. Li, A. Gupta, D.-Y. Yeung, Transferring rich feature hierarchies for robust visual tracking, arXiv preprint arXiv:1501.04587. [8] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr, Fully-Convolutional Siamese Networks for Object Tracking, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2016, pp. 850–865. doi:10.1007/ 978-3-319-48881-3_56. [9] R. Tao, E. Gavves, A. W. M. Smeulders, Siamese Instance Search for Tracking, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2016, pp. 1420–1429. doi:10.1109/CVPR.2016.158. [10] D. Held, S. Thrun, S. Savarese, Learning to Track at 100 FPS with Deep Regression Networks, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2016, pp. 749–765. doi:10.1007/978-3-319-46448-0_45. [11] B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High Performance Visual Tracking with Siamese Region Proposal Network, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 2018, pp. 8971–8980. doi:10.1109/CVPR.2018.00935. [12] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, W. Hu, Distractor-Aware Siamese Networks for Visual Object Tracking, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, pp. 103–119. doi:10.1007/978-3-030-01240-3_7. [13] Z. Zhipeng, P. Houwen, W. Qiang, Deeper and Wider Siamese Networks for Real-Time Visual Tracking, arXiv preprint arXiv:1901.01660. [14] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, J. Yan, SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks, arXiv preprint arXiv:1812.11703. [15] H. Possegger, T. Mauthner, H. Bischof, In defense of color-based model-free tracking, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2015, pp. 2113–2120. doi: 10.1109/CVPR.2015.7298823.
13
[16] S. Fazli, H. M. Pour, H. Bouzari, Particle Filter Based Object Tracking with Sift and Color Feature, in: 2009 Second International Conference on Machine Vision, IEEE, 2009, pp. 89–93. doi:10.1109/ ICMV.2009.47. [17] F. Han, H. Dong, Z. Wang, G. Li, F. E. Alsaadi, Improved Tobit Kalman filtering for systems with random parameters via conditional expectation, Signal Processing 147 (2018) 35–45. doi:10.1016/j. sigpro.2018.01.015. [18] C. Ma, J.-B. Huang, X. Yang, M.-H. Yang, Hierarchical Convolutional Features for Visual Tracking, in: 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 3074–3082. doi:10.1109/ICCV.2015.352. [19] D. Bolme, J. R. Beveridge, B. A. Draper, Y. M. Lui, Visual object tracking using adaptive correlation filters, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp. 2544–2550. doi:10.1109/CVPR.2010.5539960. [20] G. Carneiro, J. C. Nascimento, The Fusion of Deep Learning Architectures and Particle Filtering Applied to Lip Tracking, in: 2010 20th International Conference on Pattern Recognition, IEEE, 2010, pp. 2065–2068. doi:10.1109/ICPR.2010.508. [21] L. Wang, W. Ouyang, X. Wang, H. Lu, Visual Tracking with Fully Convolutional Networks, in: 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 3119–3127. doi:10. 1109/ICCV.2015.357. [22] E. Shelhamer, J. Long, T. Darrell, Fully Convolutional Networks for Semantic Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4) (2017) 640–651. doi:10.1109/ TPAMI.2016.2572683. [23] T. Yang, A. B. Chan, Recurrent Filter Learning for Visual Tracking, in: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), IEEE, 2017, pp. 2010–2019. doi:10.1109/ ICCVW.2017.235. [24] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, P. H. S. Torr, End-to-End Representation Learning for Correlation Filter Based Tracking, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 5000–5008. doi:10.1109/CVPR.2017.531. [25] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-Speed Tracking with Kernelized Correlation Filters, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3) (2015) 583–596. doi: 10.1109/TPAMI.2014.2345390. [26] N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, IEEE, 2005, pp. 886–893. doi:10.1109/CVPR.2005.177. [27] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, A. Sorkine-Hornung, Learning Video Object Segmentation from Static Images, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 3491–3500. doi:10.1109/CVPR.2017.372. [28] B. Babenko, Ming-Hsuan Yang, S. Belongie, Robust Object Tracking with Online Multiple Instance Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (8) (2011) 1619–1632. doi:10.1109/TPAMI.2010.226. [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision 115 (3) (2015) 211–252. doi:10.1007/s11263-015-0816-y. [30] E. Real, J. Shlens, S. Mazzocchi, X. Pan, V. Vanhoucke, YouTube-BoundingBoxes: A Large HighPrecision Human-Annotated Data Set for Object Detection in Video, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 7464–7473. doi:10.1109/CVPR. 2017.789. [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, C. L. Zitnick, Microsoft COCO: Common Objects in Context, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, pp. 740–755. doi: 10.1007/978-3-319-10602-1_48. [32] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep Convolutional Neural
14
Networks, in: ImageNet Classification with Deep Convolutional Neural Networks, 2012. [33] Y. Wu, J. Lim, M.-H. Yang, Object Tracking Benchmark, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9) (2015) 1834–1848. doi:10.1109/TPAMI.2014.2388226. [34] L. Huang, X. Zhao, K. Huang, GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild, arXiv preprint arXiv:1810.11981. [35] Y. Wu, J. Lim, M.-H. Yang, Online Object Tracking: A Benchmark, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2013, pp. 2411–2418. doi:10.1109/CVPR.2013.312.
Tianze Gao proposed the main ideas of the paper, designed the overall network architecture, wrote part of the experimental program and did the main part of paper writing.
Nan Wang did the main part of literature research, reconfirmed the proposed methods, conducted the contrastive experiments and did part of paper writing.
Jun Cai wrote part of the experimental program and sorted out the experimental data.
15
Weiyang Lin optimized the code and fixed some bugs.
Xinghu Yu configured the experimental environment and sorted out some of the experimental curves and graphs
Jianbin Qiu reviewed the theoretical part of the paper and made some modification.
16
Huijun Gao funded the whole project and provided experiment equipment as the corresponding author.
17