J. Vis. Commun. Image R. 56 (2018) 1–14
Contents lists available at ScienceDirect
J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci
Robust visual tracking via multi-feature response maps fusion using a collaborative local-global layer visual model q Haoyang Zhang a,b, Guixi Liu a,b,⇑, Zhaohui Hao a a b
School of Mechano-Electronic Engineering, Xidian University, Xi’an, Shaanxi 710071, PR China Shaanxi Key Laboratory of Integrated and Intelligent Navigation, Xi’an, Shaanxi, PR China
a r t i c l e
i n f o
Article history: Received 26 September 2017 Revised 7 July 2018 Accepted 21 August 2018 Available online 23 August 2018 Keywords: Collaborative visual model Block color tracking Correlation filter tracking Response maps fusion Online re-detection
a b s t r a c t This paper addresses the issue of robust visual tracking, in which an effective tracker based on multifeature fusion under a collaborative local-global layer visual model is proposed. In the local layer, we implement a novel block tracker using structural local color histograms feature based on the foreground-background discrimination analysis approach. In the global layer we implement a complementary correlation filters-based tracker using HOG feature. Finally, the local and global trackers are linearly merged in the response maps level. We choose the different merging factors according to the reliability of each combined tracker, and when both of the combined trackers are unreliable, an online trained SVM detector is activated to re-detect the target. Experiments conducted on challenging sequences show that our final merged tracker achieves favorable tracking performance and outperforms several state-of-the-art trackers. Besides, performance of the implemented block tracker is evaluated by comparing with some relevant color histograms-based trackers. Ó 2018 Elsevier Inc. All rights reserved.
1. Introduction Visual tracking is an important research topic for many computer vision tasks including robotics, surveillance, and humancomputer interaction, to name a few [1]. By specifying the target in the initial frame, the kernel problem of visual tracking is to continuously estimate the best configuration of the target in the coming frames. Over the past decade, numerous tracking methods have been proposed and significant progress has been made, yet it is still challenging for an existing tracker to simultaneously deal with the complicated situations such as illumination variations, occlusion and shape deformation [2,3]. To address this issue, a more comprehensive and robust tracker is needed. Generally speaking, tracking methods can be classified as either generative or discriminative tracking based on the appearance models used [3]. Generative methods [4–8] learn a generic appearance model according to the given templates or subspace models of the target, and tracking is accomplished by performing the searching operation for the best matching score within the target region. Discriminative methods take advantage of the appearance information of the target and background, and treat the tracking q
This paper has been recommended for acceptance by Radu Serban Jasinschi.
⇑ Corresponding author at: School of Mechano-Electronic Engineering, Xidian University, Xi’an, Shaanxi 710071, PR China.PR China E-mail address:
[email protected] (G. Liu). https://doi.org/10.1016/j.jvcir.2018.08.018 1047-3203/Ó 2018 Elsevier Inc. All rights reserved.
as a binary classification process (known as tracking-bydetection [9]) that aims at distinguishing the target from the background [10–12]. Recently, discriminative correlation filters-based (DCFs) approaches have drawn extensive attention in computer vision fields and been successfully applied in visual tracking [13– 19]. One of the prominent merits that highlights the DCFs-based visual tracking among others discriminative approaches is that the DCFs are very efficient in training and detection stage as they can be transferred into the Fourier domain and operated in element-wise multiplication, which is of significance for the realtime tracking. Despite the development of visual tracking in model representation, such as sparse representation and correlation filters, with some well-engineered features like HOG and CN (color names) [16], the previous work [20] still argued that trackers based on standard color representations can achieve competitive performance. Furthermore, in [21], Bertinetto et al. proposed the Staple (sum of template and pixel-wise learners) tracker by complementing the color histograms-based models with DCFs, and showed that a simple linear combination of response maps of these two models gets a marvelous success in visual tracking. Following the seminal work of [15,20,21], in this paper we implement an effective tracking method by proposing a collaborative local-global layer tracking framework. Our local tracker is based on multiple overlapped local target patches where each patch is represented by the color his-
2
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
tograms. By analyzing the foreground-background discrimination of each patch, the weight of each patch can be determined. The global tracker, as a supplement for our local tracker based on color feature, is a DCFs-based tracker where the HOG feature is been applied. The tracking results of local and global tracker are combined in the response maps level, and we implement a conditional fusion strategy by analyzing the reliability of each combined tracker. To achieve a more robust tracking performance, a SVM classifier is trained and updated during the tracking. The SVM detector can be used to determine the reliability of the local color tracker, and also to perform a re-detect process when both of the combined trackers are unreliable. The overall flowchart of our tracking method is shown in Fig. 1, and the main contributions of this paper are summarized as follows: (1) The application of foreground-background discrimination analysis method in weighting the different parts of the target, by which a structural local color model-based (SLC) tracker is proposed. (2) Implement a novel conditional fusion strategy to combine the two response maps from the SLC tracker and global DCFs-based tracker, by which the final local-global confidence maps fusion (LGCmF) tracker is formulated. (3) A thorough performance comparison of the proposed trackers with several state-of-the-art trackers based on the experiments performed on 80 challenging sequences. 2. Related work 2.1. Color histograms-based visual model Color histograms have been used in many visual tracking approaches [20–26]. One of the most important properties of color histograms is its insensitivity to shape variation, which is of significance for tracking non-rigid objects. An early implementation of color histograms in visual tracking is the Meanshift tracking [22], where the target position is located by minimizing the Bhattacharyya distance of color histograms between the target and the candidate area using the Meanshift iteration. Abdelai et al. [23] combined Bhattacharyya kernel and integral image as a similarity measure to find the image region most similar to the target. In [20,21,24–26], the histogram model of the target is applied to produce the backprojection map of the searching area, on which each value reflects the probability of corresponding location belonging to the target. Especially, Possegger et al. [20] proposed a discriminative color model by analyzing the target distractors during training and detection stage, according to which the novel DAT (distractor-aware tracker) was formulated. Duffner et al. [25] applied the backprojection maps to segment the target from the background region, and the visual tracking is accomplished by combining a local Hough-Voting model. Given the two regions Rf and Rs , where the first is corresponding to the foreground of the target and second to its surroundings. We denote Hf and Hs as the color histograms of the two regions. For a pixel IðxÞ at location x in frame I, the normalized likelihood that IðxÞ corresponds to the foreground and surroundings can be depicted as:
PðxjRf Þ ¼
Hf ðIðxÞÞ jRf j
and PðxjRs Þ ¼
Hs ðIðxÞÞ jRs j
ð1Þ
Here jRf j denotes the number of pixels in the region Rf ; Hf ðIðxÞÞ represents the bin value of the pixel IðxÞ in the histogram Hf . We denote C as the target color model. The probability that a pixel at location x belongs to the target C can then be induced by Bayes rule as follows [26]:
PðCjxÞ ¼
PðxjRf ÞPðRf Þ PðxjRf ÞPðRf Þ þ PðxjRs ÞPðRs Þ
ð2Þ
the prior probability can be approximated as:
PðRf Þ ¼
jRf j jRf j þ jRs j
and PðRs Þ ¼
jRs j jRf j þ jRs j
ð3Þ
submit Eqs. (1) and (3) to Eq. (2) and we get:
PðCjxÞ ¼
Hf ðIðxÞÞ Hf ðIðxÞÞ þ Hs ðIðxÞÞ
ð4Þ
2.2. Part-based visual models Part-based appearance models address the local information of the object, and hence have been proved to be particularly useful in dealing with part occlusion and deformation [27–34]. Adam et al. [27] depicted the target template by arbitrary fragments. The possible states of the target at current frame were voted by each patch according to its histogram similarity with the corresponding target patch. Jia et al. [28] proposed the structural local sparse appearance model, where a novel alignment-pooling method was utilized to exploit the feature of the target. Kwak et al. [29] divided the target into regular grid cells and employed a binary classifier to learn the occlusion status of different parts of the target. When the learned occlusion model is accurate, this method can efficiently handle the occlusion as it obtains the specific occlusion state of the target, while training such a precise occlusion model requires enough training data, which is not always feasible in practice. Kwon et al. [30] successfully addressed the problem of appearance deformation by incorporating a flexible star-like model with a Bayesian filtering framework. However, this method focuses on local appearance of the target while ignoring its whole visual information, and hence it is prone to drift under the scenes like motion blur and clutter. Some authors emphasize the different importance of each patch in the part-based models. For example, Lee et al. [31] selected only pertinent patches that occur repeatedly near the center of the target to construct the foreground appearance model. Li et al. [32] voted the target state by the reliable patches that are exploited according to the trackability and motion similarity of each patch. Recently the coupled-layer visual model has been applied in the visual tracking [33,34]. In these trackers, the local and global information of the target appearance are taken into account simultaneously and are combined in a coupled way, hence achieve more robust tracking performance compared to those trackers using only local information. 2.3. Discriminative correlation filters with multi-channel features The DCF presented in [15] learns the convolution filter from an image patch x with M N pixels extracted from the center of the target. The training samples are generated by all the cyclic shifts of x : xðm; nÞ; ðm; nÞ 2 f0; . . . ; M 1g f0; . . . ; N 1g. Consider representing these samples by some multi-channel features such as HOG, and then a multi-channel filter f can be learned by minimizing the squared error cost function as follows:
2 X d d X ðlÞ ðlÞ xðlÞ f g þ k kf k2 l¼1 l¼1
eðf Þ ¼
ð5Þ
Here, denotes circular convolution. g is a scalar valued function of size M N with its values gðw; hÞ represent the desired convolution response output corresponding to the training samples xðw; hÞ. l 2 f1; . . . ; dg is the feature dimension number of x, and k P 0 is a regulation parameter. As the desired response g is constructed as a Gaussian function, the coefficients of the learned filter f are mod-
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
3
Fig. 1. The overall tracking flowchart of the proposed tracking method.
eled by Gaussian ridge regression. Using the Parseval’s formula, Eq. (5) has an approximation solution in the Fourier domain as follows [15]:
F ðlÞ ¼ Pd
GX
k¼1 X
ðkÞ
ðlÞ
X ðkÞ þ k
ð6Þ
where the capital letters denote the discrete Fourier transforms (DFTs) of the corresponding functions, the bar G represents complex conjugation and the operator is the point-wise multiplication. The target detection is accomplished by applying the filter f on an image patch z that is cropped in the new frame at the previous target position with the same size of x. The response map of z can be calculated by:
y¼F
1
( ) d X ðlÞ ðlÞ F Z
ð7Þ
l¼1
where F 1 denotes the inverse DFTs. The new target position is then corresponding to the location of peak value ym of y. Note that the issue of correlation filters with multi-channel features has been successfully addressed in the previous work (MCCF) [18] for memory and computation savings. However, MCCF specializes in the off-line detection and localization tasks with a large amount of training images. As for on-line tracking, different from MCCF we use a single image patch and separately update the numerator and denominator in our filter frame by frame, which is efficient in computation and robust in performance as well. 2.4. Long term visual tracking To obtain a robust long-term tracking, it is necessary to take advantage of the reliable target appearance information. Supancic et al. [35] address this by employing self-paced learning to select reliable history frames that extend the training set. Hong et al. [36] maintain a long-term target appearance memory using the reliable key point database based on the short-term tracking. Zhang et al. [37] correct the undesirable model updates by constituting an expert ensemble according to the historical snapshots. Ma et al. [38] train and update a new regression model only by the most reliable tracking results, and cope with the long-term tracking by performing a re-detection process whenever there exists a tracking failure.
3. Proposed tracking method 3.1. Problem formulation and overview The conventional color model-based trackers [20–26] utilizing the holistic visual models have many limitations especially when the target is partially occluded or undergoing drastic deformations, while the part-based models can achieve better tracking performance as they discriminatively treat each part of the target, therefore in this paper we attempt to improve the performance of color model-based trackers by taking a local view. During tracking the trackability of each patch in the target region is different. For the color histograms-based models, those patches that have high discrimination against the background are always tracked readily. As Fig. 2 shows, patch 2 has significant different appearance against its surroundings, and hence its response map computed from Eq. (4) has an obvious peak, which means that this patch tends to be correctly tracked. The foregroundbackground discrimination analysis method is versatile in many computer vision tasks such as discriminative feature selection [24,39,40] and track quality estimation [41]. In this paper we employ this method to discriminate the weight of each patch of the target in tracking. Color histograms are invariant to deformation or rotation yet sensitive to illumination, an effective method to address this problem is to supplement the color feature with gradient features such as HOG [21,40,42]. Among these methods, the recent multi-kernel correlation filter (MKCF) [19] offers a novel and efficient way to combine two features using different kernels. Nevertheless, MKCF is vulnerable to drastic deformation as the features in MKCF are all template-related (CN and HOG). Different from MKCF, the Staple [21] tracker combines the gradient and color histogram features in the response map level and obtains favorable performance, while it still shows inferior performance in some scenes such as drastic deformation, motion blur, background clutter and long-time occlusion (Fig. 3). The shortcomings of Staple can be summarized from two aspects. Firstly, in Staple the combined trackers such as Stapleh (Staple with color histograms only) and DSST, all perform poorly individually and this always cause the inaccurate fusion results. And secondly, the fixed merging factor (0.3) in Staple tracker is acquired according to the overall performance evaluation. Under the fixed merging factor, the Staple tracker may fail by addressing too much on the unreliable tracker.
4
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
Fig. 2. From left to right: estimated target region (blue bounding box region), segmented patch (red bounding box region) and the response map of each patch derived from its color histograms model. The un-occluded patch 2 obtains the best result for its high discrimination against the surrounding area. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 3. Separate tracking results of the combined trackers in Staple (top row of the parallel sequence) and the proposed tracker (bottom row of the parallel sequence) including Stapleh , SLC and DSST. The results of Staple and our LGCmF are also labeled. Note that the re-detection of online SVM is activated only in the sequence Jogging.
In Fig. 3, we separately label the tracking results of the individual trackers Stapleh , DSST, SLC in Staple and the proposed LGCmF
3.2. Structural local color model-based (SLC) tracker
tracker. Staple fails in the 655th frame of Basketball, the 27th frame
Given the target region O in the current frame I, we sample a set of overlapped image patches pðjÞ ; j 2 f1; . . . ; Kg, where K is the number of patches. For a patch pðjÞ , we denote its state as
of Deer, and the 84th frame of Jumping due to the failure of Stapleh or DSST in these frames. As for the sequence Jogging, both Stapleh and DSST fail after the target undergoes a complete occlusion. As a result of the fusion, the Staple also fails. Our LGCmF tracker successfully tracks the target in the above four sequences and this can be attributed to three aspects: firstly, our SLC tracker achieves better performance than Stapleh , just as Fig. 3a, b and c show; secondly, our conditional fusion strategy always attempts to emphasize much on the combined tracker that reliably tracks the target; thirdly, the online trained SVM classifier is activated when both of the combined trackers are failed. For example, in the sequence Jogging, when the woman reappears at frame 80, the SLC and DSST still lose the target, yet the SVM classifier re-detects the target and this new detected position is then set as the current tracking result.
ðjÞ
ðjÞ
S ¼ fcðjÞ ; sðjÞ ; xðjÞ ; of ; os g, where the components cðjÞ ; sðjÞ , and xðjÞ ðjÞ
denote the coordinate, size, and weight of pðjÞ , respectively. of and
ðjÞ os
are the cropped foreground and surrounding regions of ðjÞ
ðjÞ
p . Commonly the region of of can be the bounding box annotation centered around c
ðjÞ
with the size equal to or smaller than
ðjÞ
ðjÞ
sðjÞ . Let Hf and HðjÞ s denote the color histograms of the regions of ðjÞ os ,
and and IðxÞ denote a pixel value at location x in image I. Then the likelihood that a pixel at location x belongs to the patch pðjÞ can be calculated according to Eq. (4) as follows: ðjÞ
Pðr ðjÞ jxÞ ¼
Hf ðIðxÞÞ ðjÞ
Hf ðIðxÞÞ þ HðjÞ s ðIðxÞÞ
ð8Þ
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
where rðjÞ denotes the visual model of pðjÞ . After calculating the probability map Pðr ðjÞ jxÞ, the response scores can be evaluated in dense sliding-window search as:
P ðiÞ
RðjÞ ðhj Þ ¼
ðiÞ
x2hj
Pðr jxÞ ðiÞ
jhj j
K X
xðjÞ RðjÞ
ð10Þ
j¼1
where wðjÞ is the weight of pðjÞ and is evaluated by two aspects: variance ratio and histograms similarity among the foreground and surrounding regions. Variance Ratio (VR). VR was first employed in [24] to select the most discriminative color feature to distinguish the target from the background. To compute VR, we first form the log likelihood ration of color histograms between foreground and surrounding regions [24]:
maxðHf ðiÞ; dÞ maxðHs ðiÞ; dÞ
ð11Þ
where the factor d is used to prevent dividing by zero or taking log of zero. Eq. (11) maps the foreground/surrounding histograms into positive for colors distinctive to the foreground region and negative for colors that are closely associated with the surrounding area. In [40], the log likelihood ratio LðiÞ is limited to ½1; 1 to get a more robust description:
maxðHf ðiÞ; dÞ LðiÞ ¼ max 1; min 1; log maxðHs ðiÞ; dÞ
ð12Þ
The VR of LðiÞ with respect to the histograms Hf and Hs is computed by:
VRðL; Hf ; Hs Þ ¼
v arðL; ðHf þ Hs Þ=2Þ v arðL; Hf Þ þ v arðL; Hs Þ þ e
ð13Þ
where e is used to avoid the division by zero, v arðL; HÞ is the variance of LðiÞ with respect to the histogram H and is computed by:
" #2 X X 2 v arðL; HÞ ¼ HðiÞL ðiÞ HðiÞLðiÞ i
ð14Þ
i
Histogram similarity. A patch region is readily distinguished from its surroundings when they have less similarity in appearance. In this paper we use the Bhattacharyya distance to estimate the appearance similarity:
qðHf ; Hs Þ ¼
b qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X Hf ðiÞHs ðiÞ
¼
ð9Þ
where is the i-th sliding window with the same size as patch pðjÞ . Note that Eq. (9) can be efficiently calculated by employing the integral image, and hence the computation cost for each patch in formulating the response scores is very low. During tracking, we segment new patches over the estimated target region and evaluate the response map of each patch in the searching region according to its stored color histograms model. The new position of the j-th patch is estimated by searching for the peak value of the response map. Different from the previous approaches that estimate the target location via a mere interpolation of each patch’s estimates, in this paper we fuse all patches’ response maps in a linear weighted way:
LðiÞ ¼ log
ðjÞ
ðjÞ
ðiÞ hj
R¼
Finally the discrimination value of patch pðjÞ can be represented by:
d
ðjÞ
ð15Þ
i¼1
where qðHf ; Hs Þ indicates the Bhattacharyya distance between the histograms Hf and Hs , b is the number of histogram bins.
5
VRðLðjÞ ; Hf ; HðjÞ s Þ ðjÞ qðHðjÞ f ; Hs Þ þ e
ð16Þ
here LðjÞ is the VR of pðjÞ . The weight of each patch is then calculated by:
,
ðjÞ
ðjÞ
x ¼d
K X ðjÞ d
ð17Þ
j¼1
To ensure the robustness of our tracker, during tracking we only choose the patches that have their discrimination values larger than a threshold sdis to track. Fig. 4 shows the variation of each value during the foregroundbackground discrimination analysis in the sequence Woman, from which we can see different patches have different discrimination values, and an obvious contrast is patches 2 and 8, where one is in the upper part of the body yet the other is in the leg region. We denote c as the peak value of the local merged response map R. The target position in the new frame is then corresponding to the location of c in R. During tracking, those patches close to the border of the target region always contain more background context. To suppress the effect of those patches, we introduce a penalty coefficient to each patch and Eq. (10) can be rewritten as follows:
R¼
N X
xðjÞ nðjÞ RðjÞ
ð18Þ
j¼1
where nðjÞ is the penalty coefficient of patch pðjÞ . 3.3. Conditional fusion with the DCFs-based tracker As mentioned before, to address the limitation of color histograms feature, it is necessary to supplement our local color tracker with the gradient features. Specifically, we merge the response maps of the aforementioned SLC tracker and the HOGbased DCF tracker (DSST). Unlike [21] that takes a fixed merging factor, in this paper we adjust the merging factor by analyzing the reliability of each combined tracker. We take the linear interpolation of the response maps of SLC and DSST as follows:
S ¼ bR þ ð1 bÞy
ð19Þ
where R and y are the response maps of SLC and DSST. The factor b is adaptively selected according to the reliability of each combined tracker. For the DSST, its reliability can be estimated by the peak value ym of its response map y. As Fig. 5 shows, generally DSST tends to be reliable when ym is larger than a threshold sdcf . For SLC, we take advantage of the trained online SVM classifier to evaluate its reliability. Specifically, during tracking we employ the SVM classifier on the tracking result of SLC and we denote this detect score as dslc . Fig. 5 shows that the SLC tracker tends to be reliable if dslc is larger than a preset threshold sslc . In the fusion process we first sequentially evaluate the reliability of the combined SLC and DSST trackers to select the different merging factors. When both of the SLC and DSST are unreliable, a re-detection process based on the SVM classifier is performed by densely evaluating the candidates in the searching region. This re-detected result is adopted only when the maximum score of dcand is larger than a threshold ssv m to ensure the correctness. When maxðdcand Þ 6 ssv m , we abandon the re-detected result and take the merge result with the merging factor as bslc . As in such case the object always under part occlusion or drastic appearance deforma-
6
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
Fig. 4. Top row: Plots of the Bhattacharyya distance, variance ratio of each patch among its foreground and surroundings in the sequence Woman. Bottom row: Plots of the discrimination and weights of each patch in the sequence Woman. For the sake of clarity, we only show the details of patches 2, 5, and 8.
tion, we choose to believe the SLC tracker more for its part-based appearance model with color histograms feature can handle the above issues well compared to DSST which adopts the holistic appearance model using HOG feature. The conditional fusion process is summarized in Algorithm 1.
ðjÞ
ðjÞ
ðjÞ
threshold of dt . For the correlation filter, we separately update its numerator
(
3.4. Model updating Updating the target model is necessary for robust visual tracking, in our method we address this issue from three aspects including color histograms, correlation filters and the SVM classifier models. For the local target patches, we update their color histograms models only when they are obviously discriminated from the background. For each patch we update its color histograms model as follows:
( ðjÞ hc;t
¼
ðjÞ
ðjÞ
ð1 lÞhc;t1 þ lhc;new ðjÞ
hc;t1
ðjÞ
if dt P sdis otherwise
ð20Þ
j
ground and surrounding regions of patch pðjÞ in frame t. dt is the current discrimination value computed by Eq. (16). sdis is the
Algorithm 1. Conditional fusion of the methods Require: SLC tracking result pslc , DSST response map peak value ym , stored SVM classifier, previous target position pt1 . Ensure: Current target position pt . 1: Get the SVM detect score dslc based on pslc . 2: if dslc P sslc then 3: Set b ¼ bslc and get pt according to Eq. (19). 4: else 5: if ym P sdcf then 6: Set b ¼ bdcf and get pt according to Eq. (19). 7: else 8: Perform the dense sampling process around pt1 and get the detect scores dcand of all samples. 9: if maxðdcand Þ P ssv m then 10: pt ¼ arg maxp ðdcand Þ 11: else 12: Set b ¼ bslc and get pt according to Eq. (19). 13: end if 14: end if 15: end if
ðjÞ
where hc;t 2 fhf ;t ; hs;t g denotes the learned color histograms of fore-
AðlÞ t
¼
( Bt ¼
AtðlÞ
and denominator Bt in Eq. (6) as follows: ðlÞ
ð1 gÞAt1 þ gG X ðlÞ
if ym P sdcf
ðlÞ At1
otherwise
P ð1 gÞBt1 þ g dl¼1 X ðlÞ X ðlÞ Bt1
if ym P sdcf otherwise
ð21Þ
ð22Þ
where ym is the peak value of the response map, the threshold sdcf ensures that updating is activated only when the current estimates is reliable. As for the SVM classifier, we update it only when dslc P sslc or ym P sdcf , as in such cases the current tracking result is reliable. We take the positive and negative samples at the current estimated target location. The SVM classifier is then updated using these samples and the previous classifier coefficients via the passiveaggressive online learning algorithm [43].
3.5. Scaling For the part-based visual models, a common method to account for the target scale is to estimate the distance changing between each patch among two adjacent frames [32], while this is feasible only when the amount of the correctly tracked patches is large. In this paper, considering the application of the DCFs, we use the method proposed in [15] to estimate the target scale by learning a separate 1-dimensinal DCF based on the scale pyramid. Let M N be the target size in the current frame and N s denotes the n j k j ko number of scales. For each s 2 an jn 2 ðNs21Þ ; ðNs21Þ , an image patch of size sM sN centered around the target is extracted for the training of the 1-dimension DCF. In the detection stage, we construct a new scale pyramid in the estimated target location and employ the correlation filtering on the these samples. The target scale is then corresponding to the scale level with the largest response value. Readers should refer to [15] for more details.
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
7
Fig. 5. Variation of values ym ; dslc and maxðdcand Þ in the sequence Jogging. These values obviously fall below certain thresholds when the target is lost during frame 68–81. After the target is re-detected since frame 85, all of these values increase to a relatively high level.
4. Experiments In this section we first make a thorough performance evaluation of our local global confidence maps fusion (LGCmF) tracker by comparing with state-of-the-art trackers. Then we evaluate the effect of our block tracking strategy by comparing the proposed structural local color (SLC) tracker with some most relevant trackers that based on color histograms or local visual models. Furthermore, as our method consists of three main components, namely, part-based appearance, multi-feature fusion and re-detection modules, we evaluate each component independently in the experiment to clarify its contributions to the accurate visual tracking. 4.1. Experiment setup We perform the experiments on 80 challenging video sequences from the tracker benchmark datasets [3]. The selected sequences cover 11 different challenging scenes: IV (illumination variation), SV (scale variation), OCC (occlusion), DEF (deformation), MB (motion blur), FM (fast motion), IPR (in-plane rotation), OPR (out-of-plane rotation), OV (out-of-view), BC (background clutter), and LR (low resolution). The performance of a tracker is evaluated by two criteria: CLE (center location error) and VOR (Pascal VOC overlap ratio) [44]. Here CLE is the average Euclidean distance between the ground-truth and the estimated target locations, and VOR is computed by:
VOR ¼
areaðRt \ Rg Þ areaðRt [ Rg Þ
ð23Þ
where Rt and Rg indicate the estimated bounding box and the ground-truth. Experiment results are reported by distance precision (DP) and success rate (SR). DP is the percentage of frames in a sequence whose center location error is smaller than a preset threshold. SR is the relative number of frames with VOR is greater than a threshold t h 2 ½0; 1. All evaluated trackers are ranked by the average scores of DP and SR over the tested sequences. Besides that, the experiment results of each tracker are also summarized using median value to get a more reasonable comparison. Additionally, the area under curve (AUC) scores [3] derived from success plots are used as well to rank the performance of the compared trackers.
The proposed methods are mainly implemented in Matlab assisted by some C functions1 to extract the image features. The source codes of the compared trackers are provided by the authors themselves. Both our method and the compared methods are run in a desktop with Windows operating system of 64 byte and under the fixed configuration of an AMD A10-5800K 3.80 GHz CPU with 4 GB RAM, all parameters are fixed. 4.2. Initialization of the method In the first frame, the target is initialized using a rectangular bounding box. Let M N denote the initial target size. We extract the initial patches in the target region by taking the patch size and step length as (0:5M; 0:5N), and hence we get 3 3 overlapped patches. The background is initialized as an expanded region based on target with 0:5ðM þ NÞ as the increment of length and width. For each patch, its background region is obtained in the same way, while the foreground region is 0.8 times the size of the patch. The SVM classifier is initialized in the first frame by densely taking the positive and negative samples in the searching region, which is 1.5 times the background size. The positive samples are selected as those regions that have an overlapping ratio with the target region large than 0.6 and for the negative samples this ratio is smaller than 0.2. Each sample is represented by the feature used in MEEM [37] based on the CIELab color space, and we use the function supplied by VLFeat open source library2 to train the initial SVM classifier. Fig. 6 shows the details of the initialization of the proposed method. 4.3. Implementation details and parameters setting For the proposed SLC tracker, target is divided into 3 3 regular overlapped patches. The color histogram bin is set to 32 per channel. The penalty coefficients n are set to 1 for the central patch and 0.7 for the rest of the eight patches. The numerical guards are set as d ¼ 103 and e ¼ 106 . The learning rate l ¼ 0:04. The threshold sdis for updating the color histograms models is set to 2.5/0.5 for color/gray images. For DSST, the cell size of HOG feature is 4. The 1 2
https://github.com/pdollar/toolbox. http://www.vlfeat.org/index.html.
8
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
Fig. 6. (a) Initial target (red bounding box) and background (blue region). (b) Extraction of the patches. (c) The first patch (red bounding box) and its background (blue region) and foreground (yellow region). (d) Initialization of positive (green bounding boxes) and negative samples (pink bounding boxes) for training the SVM classifier. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
regulation parameter k ¼ 103 , the learning rate g ¼ 0:01. We use N s ¼ 33 number of scales with scale factor as a ¼ 1:02. The thresholds sslc , sdcf and ssv m are set to 0, 0.3 and 0.5. The merging factors bslc and bdcf are set to 0.57 and 0.3.
4.4. Experiment 1: performance evaluation of the LGCmF tracker We compare the proposed LGCmF tracker with 12 state-of-theart trackers including DAT [20], Staple [21], MEEM [37], TGPR [45], KCF [14], SRdcf [17], LCT [38], RPT [32], SAMF [42], DSST [15], ASLA [28] and PPT [31]. Among them, DAT is based on color histograms feature with holistic visual model, which is most relevant to our local SLC tracker; Staple and SAMF employ complementary features such as color feature and HOG feature; RPT, ASLA and PPT utilize the part-based visual models; KCF, SRdcf, LCT, SAMF and DSST are the correlation filters-based trackers. Besides, the trackers
MEEM and TGPR are chosen as comparisons because of their competitive performance in VOT 2015 [46] and VOT 2016 [47]. Table 1 reports quantitative comparisons of the evaluated trackers based on the experiment results of one-pass-evaluation (OPE) [3]. For the overall performance, Staple provides a mean SR score of 70.4%, and achieves the best performance among the 12 compared trackers. Our approach obtains state-of-the-art results, and outperforms the Staple in the mean SR score by 4.4%. As for the 11 independent attributes-based datasets, our tracker performs best in 8 datasets including OCC, DEF, OPR, OV, SV, BC, IV and MB. In particularly, our tracker makes obvious progress in the scenes of OCC, DEF, OPR and OV compared to the corresponding second best trackers with the gains of 6.8%, 6.5%, 4.2% and 9.9%. For the datasets of FM and IPR, our method achieves the second best results with the mean SR scores of 66.1% and 69.1%. Even though the Staple adopts complementary features in tracking, it still shows poor performance in the low resolution scenes and obtains a mere 43.2%
Table 1 Experiment results of our method and 12 state-of-the-art trackers. The mean SR(%) scores in the threshold of 0.5 for each tracker over all tested 80 challenging sequences and 11 attribute-based datasets are reported. The best two results are shown in red and blue.
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
mean SR score. Our method, which takes part-based visual models and the modified fusion strategy, improves this score to 54.1% and meanwhile, shows competitive performance against the others compared trackers. Fig. 7 shows the precision and success plots of our LGCmF tracker and compared trackers over all and 11 attributes-based datasets. Our approach provides superior results compared to the
9
12 state-of-the-art trackers. Especially, the large plots margin in both precision and success plots of Fig. 7b and e show that LGCmF has the distinct advantage in dealing with the occlusion and deformation. Table 2 reports the median results of CLE (pixels), mean VOR (%) and average speed (fps) of each tracker over all tested sequences. Our method obtains the best result in Med CLE with the value of
Fig. 7. Precision and Success plot in overall tested sequences and attributes-based datasets. The number suffixed in each subtitle indicates the size of the corresponding dataset. The average DP score at the threshold of 20 pixels and the AUC score of success plot are reported in the legends.
Table 2 Quantitative comparison of our method against 12 state-of-the-art trackers. The best two results are shown in red and blue.
10
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
7.27 pixels and shows competitive results in Med VOR with the value of 64.1%. Considering the fairness of time comparisons, we list the implementation environment of each method in the Code row (M: Matlab, C: C/C++, MC: Mixture of Matlab and C/C++). For all evaluated trackers, only PPT is purely implemented by C with a 6.60fps speed. Our non-optimized Matlab code has a speed of 5.42fps, yet obtains much better tracking precision than PPT. 4.5. Qualitative evaluation Fig. 8 visualizes the tracking results of our LGCmF tracker and the compared 12 state-of-the-art trackers with different
colors and lines respectively. Our tracker achieves favorable results in all sequences covering all kinds of the challenges in visual tracking. Occlusion. In the sequence Box, the target is gradually occluded since the 449th frame. When it reappears in the scene at frame 505, most of the trackers lose the target, only our method, MEEM, SAMF and PPT reliably track the target while our method achieves the best results both in position and scale. Another example where the occlusion is the main challenge is Jogging. After a short time complete occlusion, most of the trackers lose the target. Only our method, MEEM, SRdcf, SAMF, LCT and PPT capture the target again
Fig. 8. A visualization of the tracking results of all evaluated trackers in challenging sequences. The main challenges of each sequence are also listed.
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
when the target reappears at frame 82. As mentioned before, the reason that our method can re-detect the target here can be attributed to the application of the online re-detection mechanism. Background clutter. In Basketball, the existing of players wearing the same color basketball clothes causes the background clutter is one of the main challenges over entire sequence. Among the compared trackers, SRdcf fails from the beginning. At frame 673, Staple and ASLA drift to the distractors, the others trackers are sticking on the target yet many of them get inaccurate results. Our method obtains relatively better result during the whole tracking period. In Board, the target undergoes the background clutter as well throughout all the sequence. At frame 150, DAT, TGPR, MEEM and LCT lose the target because of the background clutter and motion blur. At frame 675, many failed trackers re-acquire the target yet there are still five trackers including DAT, Staple, KCF, SRdcf and DSST lose the target, our method accurately tracks the target in overall tracking period. Deformation. In Skating2, the target undergoes drastic deformation from the beginning. At frame 233, all of the compared trackers obtain inaccurate estimates. Trackers including ASLA, LCT, DAT and Staple even lose the target. At frame 385, KCF and DSST drift to the distractors, Staple tracks the target with a wrong scale estimate. At frame 415, Staple, SRdcf, PPT and KCF stick on the target, yet our method shows the best tracking result compared to them. In Bolt2, the target also undergoes drastic deformation. Many trackers lose the target from the beginning. Only our method, Staple, RPT and DAT successfully track the target. Staple and RPT get inaccurate estimates in the 100th frame and DAT loses the target at frame 226, our method obtains the favorable results in the whole tracking process. Scale variation. In CarScale, the scale of the car is constantly varied from beginning to the end. Trackers MEEM, DAT, KCF, TGPR and PPT can’t handle the problem of scale variation, and hence they obtain inaccurate tracking results at frame 172, MEEM even lose the target completely. At frame 218, our method and Staple get right estimates in location and scale, while the others trackers like RPT, DSST, SRdcf, SAMF, LCT and ASLA all focus on the head of the car. In the end of the sequence, Staple drifts to the head of the car while our method still accurately tracks the target. Low resolution. The tracking results in the sequences Panda, Surfer and Freeman4 show that our method has high robustness in dealing with the low resolution scenes. In Panda, SRdcf, RPT, KCF and DSST lose the target at frame 550, and more trackers fail when the panda passes the tree. In Surfer. LCT, KCF, DSST, ASLA and Staple lose the target in the 154th frame; MEEM and RPT drift to the wrong location at frame 215. In the end of the sequence, SRdcf, SAMF and PPT get inaccurate target location. Our method successfully tracks the target in all above sequences.
11
4.6. Comparisons with deep learning based trackers We compare the overall performance of our method with 5 state-of-the-art deep learning based trackers including HCF [48], HDT [49], LCCFdeep [50], DLT [51] and CNT [52]. All evaluated trackers run in a common machine without the hardware acceleration of GPU computation (see Table 3). Compared to five deep leaning based trackers, our tracker performs best in SR and Med VOR with values of 62.1% and 64.1%. In terms of DP and Med CLE, our tracker outperforms the CNT and DLT, and also achieves the competitive results compared to LCCFdeep , HDT and HCF, where the CNN features based on VGG-Net are used in these trackers. More remarkably, all of the deep learning based trackers are suffered from the computation load compared to our method. 4.7. Experiment 2: performance evaluation of the SLC tracker In this experiment we are aimed at showing the improvement for color histograms-based tracking using our block tracking strategy SLC. We choose some trackers that are relevant to the SLC tracker as comparisons including DAT [20], Stapleh [21], KMS [22] and PPT [31]. Among them, Stapleh is the part of Staple only utilizing the color histograms. Both DAT, Stapleh and KMS employ the holistic visual models, yet PPT attempts to find the pertinent part to track, which is similar to the idea of our SLC tracker. Fig. 9 shows the overall precision and success plots on tested sequences. The most relevant method to our SLC tracker is Stapleh and DAT tracker. Staple is merged from Stapleh and DSST, however, from Fig. 8 we can find that the individual performance of Stapleh is unsatisfied in the overall evaluation. DAT has similar tracking framework as Stapleh while DAT performs better by analyzing the distractors of the target during the tracking. Our block tracking method SLC significantly improves the DAT and Stapleh with gains of 15.1% and 18.9% in DP score, and with gains of 8.4% and 12.1% in SR score. PPT is a novel color histograms-based visual tracking algorithm that using part-based appearance models, our method outperforms the PPT with gains of 4.9% and 2.5% in DP and SR score respectively. Table 4 shows the average center location errors of each tracker on some challenging sequences. The proposed block tracker SLC obtains the best results on these sequences. Note that for the sequences Box and Woman, where the occlusion is the main challenge. Only our method and PPT successfully track the target because of the using of block tracking strategy. Fig. 10 visualizes the tracking results of the compared trackers on several challenging sequences. SLC can handle the challenging scenes such as illumination variation (CarDark, Fish), scale varia-
Table 3 Overall performance comparisons with deep learning based trackers. We report the average DP (%) at 20 pixels and mean SR with threshold of 0.5. The Med CLE (pixel), VOR (%) and speed (fps) are shown as well.
12
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
Fig. 9. Comparison of the color histograms-based trackers according to the experiment results conducted on all tested sequences. The legend shows the ranking of each tracker in terms of DP score (20 pixels) and SR score (AUC). SLC obtains the best result among the compared trackers. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Table 4 Average center location errors (pixels) on challenging sequences. The best two results are shown in red and blue.
tion (Carscale), occlusion (Woman), background clutter (Car2, Human7), motion blur (BlurBody) and low resolution (Freeman1), etc. Especially, all of the compared trackers can’t handle the illumination variation well while SLC accurately tracks the target in this circumstance. PPT performs well in occlusion yet still shows inferior result in CarScale as it is not designed to cope with the scale issue. 4.8. Experiment 3: contributions evaluation of each component The utilized components: re-detection modules, part-based appearance model and multiple feature fusion, are related to three tracker modules in our method, namely, online SVM detector, SLC tracker and DSST. To clarify the contributions of each component, we perform the experiments by removing one of the related tracker modules from our LGCmF tracker. The individual performance of each tracker module is also evaluated. From Fig. 11 we notice that removing the SVM detector from LGCmF results in the reduction of 5% in DP and 3.6% in SR, yet removing the SLC tracker and DSST causes more obvious decrease of the two values. Especially, compared to another two tracker
modules, removing DSST causes the most distinct performance degradation of 21.1% in DP score and 18.7% in SR score. The above analysis means that among the three merged components, the redetection module contributes the least for the accurate tracking yet the fusion of HOG feature contributes the most. Note that although the SLC tracker contributes less than DSST in the overall dataset evaluation, it still can address the specific situations. As Table 5 shows, SLC outperforms the DSST in eight datasets with different attributes, which means in these datasets the part-based visual model and the color histograms used in SLC contribute much for the accurate tracking of LGCmF. The individual SVM classifier obtains unsatisfied results, yet by interacting with another two tracker modules, it still accurately detects the target in challenging scenes. 5. Conclusion An effective tracker based on collaborative local-global layer visual model is proposed in this paper. In the local layer we implement an effective block tracker SLC that is based on traditional color histograms feature and part-based visual models. We dis-
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
13
Fig. 10. A qualitative comparison of the evaluated trackers. The main challenges of each sequence are suffixed in the bracket.
Fig. 11. Experiment results of different cases over all tested sequences. The DP (20 pixels) and SR (AUC) of each case are reported in the legends.
Table 5 Individual performance of related tracker modules over all tested sequences and 11 attributes-based datasets. The average DP scores (%) at the threshold of 20 pixels are reported. The best results are shown in red.
criminate each part of the target according to its foregroundbackground discrimination score. The local layer response map is the linear weighted combination of the response maps of all patches. To complement the deficiency of color histograms, in the global layer we design a supplementary correlation filters-
based tracker using HOG feature. The final tracker LGCmF is formulated by linearly combine the response maps of local and global tracker. During the fusion stage of local and global layer, we design a conditional fusion strategy to choose the different merging factors according to the reliability of each combined tracker, and
14
H. Zhang et al. / J. Vis. Commun. Image R. 56 (2018) 1–14
when both of the combined trackers are unreliable, an online trained SVM classifier is activated to re-detect the target in a large searching area. Experiments are conducted on 80 challenging benchmark sequences. Both quantitative and qualitative analysis are made. The experiment results show that the proposed LGCmF tracker obtains favorable tracking performance and outperforms several state-of-the-art trackers in overall and eight attributes-based datasets. Moreover, the superiority of our block tracking method SLC is justified by comparing with some relevant trackers that use color histograms or block strategy. The contributions of each involved component in LGCmF are also summarized by reasonable experiment methods.
Acknowledgement This work is supported by the Foundation of Preliminary Research Field of China under Grant No. 6140001010201, the National Key Research and Development Program Strategic High Technology Special Focus under Grant No. H863-031, and the Open Foundation of Shaanxi Key Laboratory of Integrated and Intelligent Navigation.
References [1] A. Yilmaz, O. Javed, M. Shah, Object tracking: a survey, ACM Comput. Surv. 38 (4) (2006) 1–45. [2] A.W. Smeulders, D.M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, M. Shah, Visual tracking: an experimental survey, IEEE Trans. Pattern Anal. Mach. Intell. 36 (7) (2014) 1442–1468. [3] Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (2015) 1834–1848. [4] M.J. Black, A.D. Jepson, Eigentracking: robust matching and tracking of articulated objects using a view-based representation, Int. J. Comput. Vis. 26 (1) (1998) 63–84. [5] M. Yang, Y. Wu, G. Hua, Context-aware visual tracking, IEEE Trans. Pattern Anal. Mach. Intell. 31 (7) (2009) 1195–1209. [6] J. Kwon, K.M. Lee, Visual tracking decomposition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1269–1276. [7] D. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vis. 77 (1) (2008) 125–141. [8] X. Mei, H. Ling, Robust visual tracking using l1 minimization, in: IEEE International Conference on Computer Vision, 2009, pp. 1436–1443. [9] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422. [10] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in: Proceedings of the British Machine Vision Conference, vol. 1, 2006. [11] B. Babenko, M.-H. Yang, S. Belongie, Robust object tracking with online multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1619–1632. [12] S. Hare, A. Saffari, P.H.S. Torr, Struck: Structured output tracking with kernels, in: IEEE International Conference on Computer Vision, 2011, pp. 263–270. [13] D.S. Bolme, J.R. Beveridge, B.A. Draper, Y.M. Lui, Visual object tracking using adaptive correlation filters, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2544–2550. [14] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) (2015) 583–596. [15] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in: Proceedings of the British Machine Vision Conference, 2014. [16] M. Danelljan, F.S. Khan, M. Felsberg, J.v.d. Weijer, Adaptive color attributes for real-time visual tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1090–1097. [17] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Learning spatially regularized correlation filters for visual tracking, in: IEEE International Conference on Computer Vision, 2015, pp. 4310–4318. [18] H.K. Galoogahi, T. Sim, S. Lucey, Multi-channel correlation filters, in: IEEE International Conference on Computer Vision, 2013, pp. 3072–3079. [19] M. Tang, J. Feng, Multi-kernel correlation filter for visual tracking, in: IEEE International Conference on Computer Vision, 2015, pp. 3038–3046. [20] H. Possegger, T. Mauthner, H. Bischof, In defense of color-based model-free tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2113–2120. [21] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, P.H.S. Torr, Staple: complementary learners for real-time tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1401–1409.
[22] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (5) (2003) 564–577. [23] H.A. Abdelali, F. Essannouni, L. Essannouni, D. Aboutajdine, Fast and robust object tracking via accept-reject color histogram-based method, J. Vis. Commun. Image Rep. 34 (2016) 219–229. [24] R.T. Collins, Y.X. Liu, M. Leordeanu, Online selection of discriminative tracking features, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1631–1643. [25] S. Duffner, C. Garcia, Pixeltrack: a fast adaptive algorithm for tracking nonrigid objects, in: IEEE International Conference on Computer Vision, 2013, pp. 2480–2487. [26] C. Bibby, I. Reid, Robust real-time visual tracking using pixel-wise posteriors, in: European Conference on Computer Vision, 2008, pp. 831–844. [27] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2006, pp. 798–805. [28] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1882–1829. [29] S. Kwak, W. Nam, B. Han, J. H. Han, Learning occlusion with likelihoods for visual tracking, in: IEEE International Conference on Computer Vision, 2011, pp. 1551–1558. [30] J. Kwon, K.M. Lee, Highly nonrigid object tracking via patch-based dynamic appearance modeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (10) (2013) 2427–2441. [31] D.Y. Lee, J.Y. Sim, C.S. Kim, Visual tracking using pertinent patch selection and masking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3486–3493. [32] Y. Li, J. Zhu, S.C. Hoi, Reliable patch trackers: robust visual tracking by exploiting reliable patches, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 353–361. [33] L. Cehovin, M. Kristan, A. Leonardis, Robust visual tracking using an adaptive coupled-layer visual model, IEEE Trans. Pattern Anal. Mach. Intell. 35 (4) (2013) 941–953. [34] O. Akin, E. Erdem, A. Erdem, K. Mikolajczyk, Deformable part-based tracking by coupled global and local correlation filters, J. Vis. Commun. Image Rep. 38 (2016) 763–774. [35] J.S. Supancic III, D. Ramanan, Self-paced learning for long-term tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2379– 2386. [36] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, D. Tao, Multi-store tracker (muster): a cognitive psychology inspired approach to object tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 749–758. [37] J. Zhang, S. Ma, S. Sclaroff, Meem: robust tracking via multiple experts using entropy minimization, in: European Conference on Computer Vision, vol. 6, 2014, pp. 188–203. [38] C. Ma, X. Yang, C. Zhang, M.-H. Yang, Long-term correlation tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5388– 5396. [39] H.T. Nguyen, A.W.M. Smeulders, Robust tracking using foregroundbackground texture discrimination, Int. J. Comput. Vis. 69 (3) (2006) 277–293. [40] Z. Li, S. He, M. Hashem, Robust object tracking via multi-feature adaptive fusion based on stability: contrast analysis, Vis. Comput. 31 (10) (2015) 1319– 1337. [41] O.U. Khalid, A. Cavallaro, B. Rinner, Detecting tracking errors via forecasting, in: Proceedings of the British Machine Vision Conference, 2016. [42] Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature integration, in: European Conference on Computer Vision Workshops, 2014, pp. 254–265. [43] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, Y. Singer, Online passiveaggressive algorithms, J. Mach. Learn. Res. 7 (2006) 551–585. [44] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338. [45] J. Gao, H. Ling, W. Hu, J. Xing, Transfer learning based visual tracking with gaussian processes regression, in: European Conference on Computer Vision, vol. 3, 2014, pp. 188–203. [46] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, et al., The visual object tracking vot2015 challenge results, in: IEEE International Conference on Computer Vision Workshops, 2015, pp. 564–586. [47] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, et al., The visual object tracking vot2016 challenge results, in: IEEE International Conference on Computer Vision Workshops, 2016, pp. 777–823. [48] C. Ma, J. -B. Huang, X. Yang, M. -H. Yang, Hierarchical convolutional features for visual tracking, in: IEEE International Conference on Computer Vision, 2015, pp. 3074–3082. [49] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, M. -H. Yang, Hedged deep tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4303–4311. [50] B. Zhang, S. Luan, C. Chen, J. Han, W. Wang, A. Perina, L. Shao, Latent constrained correlation filter, IEEE Trans. Image Process. 27 (3) (2018) 1038– 1048. [51] N. Wang, D. Y. Yeung, Learning a deep compact image representation for visual tracking, in: International Conference on Neural Information Processing Systems, 2013, pp. 809–817. [52] K. Zhang, Q. Liu, Y. Wu, M.-H. Yang, Robust visual tracking via convolutional networks without training, IEEE Trans. Image Process. 25 (4) (2016) 1779– 1792.