Neurocomputing 339 (2019) 139–148
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Robust visual tracking via Laplacian Regularized Random Walk RankingR Bo Jiang, Yuan Zhang, Jin Tang, Bin Luo, Chenglong Li∗ School of Computer Science and Technology, Anhui University, Hefei, China
a r t i c l e
i n f o
Article history: Received 26 January 2018 Revised 13 November 2018 Accepted 29 January 2019 Available online 14 February 2019 Communicated by Dr Zhu Jianke Keywords: Visual tracking Laplacian regularization Random walk Structured SVM
a b s t r a c t Visual tracking is a fundamental and important problem in computer vision and pattern recognition. Existing visual tracking methods usually localize the visual object with a bounding box. Recently, learning the patch-based weighted features has been demonstrated to be an effective way to mitigate the background effects in the target bounding box descriptions, and can thus improve tracking performance significantly. In this paper, we propose a simple yet effective approach, called Laplacian Regularized Random Walk Ranking (LRWR), to learn more robust patch-based weighted features of the target object for visual tracking. The main advantages of our LRWR model over existing methods are: (1) it integrates both local spatial and global appearance cues simultaneously, and thus leads to a more robust solution for patch weight computation; (2) it has a simple closed-form solution, which makes our tracker efficiently. The learned features are incorporated into the structured SVM to perform object tracking. Experiments show that our approach performs favorably against the state-of-the-art trackers on two standard benchmark datasets. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Visual object tracking is an important and fundamental problem in computer vision and pattern recognition area [1–8]. It has wide applications such as video surveillance, robotics, human computer interaction, and medical image analysis [1,2]. Recent years have witnessed rapid advancements in visual tracking, but it is still a challenging task partly due to the large changes of object appearance caused by a lot of challenge factors, like pose, illumination, deformation and occlusion. Most visual tracking methods mainly adopt the tracking-by-detection paradigm, which aims to conduct target tracking by classifying the target object against its background from frame to frame. The key issue for this kind of method is how to maintain a classifier during the tracking process. Usually, the classifier is trained in the first frame using the ground truth bounding box, and updated in the subsequent frames using tracking results. However, the bounding box can not describe the target object accurately due to irregular object shapes, scale variations and occlusions, and the trackers will be disturbed by the
R ∗
Fully documented templates are available in the elsarticle package on CTAN. Corresponding author. E-mail address:
[email protected] (C. Li).
https://doi.org/10.1016/j.neucom.2019.01.102 0925-2312/© 2019 Elsevier B.V. All rights reserved.
introduced background information, which makes the tracker undertake the risk of model drifting. In order to overcome the above challenges, a lot of efforts have been developed to alleviate the undesirable effects of background information [3–7,9–16]. For example, some methods [4–6,9] update the object classifiers by further considering the distances of candidate bounding box with respect to the bounding box center and assigning higher weights to the candidate bounding box when they are close to the center. This setting is unreliable when the object shapes are irregular. The methods proposed in [17–19] conduct object segmentation in the tracking process to exclude the background information. The results of these methods are usually affected by the unreliable segmentation process. The methods proposed in [20,21] use the sparse coding to build a discriminative appearance model. These methods are limited in dealing with cluttered backgrounds which may lead to bad segmentation results. To improve the robustness, Kim et al. [10] recently proposed to define an image patch based 8-neighbor graph to represent the tracked object, in which the 8-neighbor graph denote that if two nodes are 8-neighbors, they are connected by an edge, and the edge weight is computed by their low-level feature distance. However, it only considers the spatial neighbors, and cannot capture the intrinsic global relationship among patches. A dynamic graph learning approach is proposed by Li et al. [22] to make the best
140
B. Jiang, Y. Zhang and J. Tang et al. / Neurocomputing 339 (2019) 139–148
Fig. 1. Flowchart of the proposed tracking algorithm. Our tracking algorithm contains four main parts: Patch generation, LRWR weight computation, Multi-scale weighted descriptor for each bounding box and structured SVM localization.
use of the relationship among patches, but they ignore the local cues in graph learning and the optimization is also complicated, resulting in long latency. To handle the above issues, we aim to learn a robust object representation for object with deformation and partial occlusion to lead effective visual tracking. In general, our tracking approach contains three main steps. First, we partition the target bounding box into a set of non-overlapping image patches, which are described with color and gradient histograms. Then, to mitigate the effects of noisy patches of background, we associate each patch with a weight to reflect how likely it belongs to the target object, and integrate it into patch feature descriptors to construct a robust weighted feature descriptor. Finally, the constructed weighted features are combined with the structured SVM [4] to perform object tracking. Fig. 1 shows the overflow of the proposed tracking approach. To improve the robustness and effectiveness of patch weight computation, we propose a novel model, called Laplacian Regularized Random Walk Ranking (LRWR), to compute the patch weights to reflect how likely the patches belong to the target object. One main benefit of LRWR is that it integrates both local spatial and global appearance cues simultaneously in its weight computation process, which thus leads to more robust and effective object feature representation for visual tracking. Also, LRWR has a closedform solution and thus can be computed efficiently. The learned features are incorporated into the traditional structured SVM to perform object tracking. This is because in structured SVM, the candidate samples are generated around the target bounding box. Thus, we can incorporate LRWR based patch weights into each candidate sample to alleviate the undesired effect of background information. Extensive experiments on standard benchmark datasets
show that the proposed tracking approach outperforms several state-of-the-art tracking methods. 2. Related work Random walk with restart model has been used for patchweighted feature representation and object tracking problems [10]. Here, we give a brief review of random walk with restart (RWR) model. Given a local/neighborhood graph G(V, E) with V = (v1 , v2 , . . . vn ) and E denoting nodes and edges, a transition matrix A is first defined, where the element ai j denotes the probability that a walker moves from node vi to its neighborhood node vj . In RWR, starting at a node, the walker have two options at each step, i.e., moving to a randomly chosen neighbor with probability α or jumping to a specified node with probability (1 − α ) [10,23]. Formally, it iteratively computes the probability distribution ri(t+1 ) for node vi as,
ri(t+1) = α
n
ai j r(jt ) + (1 − α )pi
(1)
j=1
Using vector notation, this can be rewritten as,
r(t+1) = α Ar(t ) + (1 − α )p
(2)
where α ∈ [0, 1] and 1 − α is a restarting probability, and p = (p1 , p2 , . . . pn ) is a restarting distribution. It is known that regardless of any initialization r(0 ) , as the iteration t increases, the random walk [10,24] process will converge to the stationary distribution r∗ , i.e., the converged stationary distribution r∗ satisfies the following,
r∗ = α Ar∗ + (1 − α )p.
(3)
B. Jiang, Y. Zhang and J. Tang et al. / Neurocomputing 339 (2019) 139–148
The optimal r∗i gives a kind of weight/ranking for node vi of graph G(V, E). This model has been usually used as background/foreground measurement in many vision problems, such as saliency detection [24,25], object tracking [10] and so on. In work [10], RWR model has been used to generate a more effective feature descriptor for visual tracking problem. One main limitation of RWR is that, it only considers the spatial neighbors and thus fails to capture the intrinsic global relationship among patches. Also, for tracking strategy, the tracking algorithm in [10] fails to consider the scale variation of target object. To overcome these problems, our aim in the following is to propose a new Laplacian regularized RWR (LRWR) model for patch weighted object representation. For tracking strategy, we incorporate the scale estimation step in our tracking process to deal with scale variation of target object. In the following, we first present our LRWR model and present the overall tracking algorithm in Section 4. 3. Laplacian Regularized Random Walk Ranking As mentioned before, the main aspect of RWR is that it computes the optimal ranking r∗ of graph nodes by using local propagation on graph G(V, E) and prior information (restart term) p. However, one limitation of RWR is that it generally fails to consider the global consistency of nodes with similar appearance in its propagation process, i.e., if the appearance features of node vi and vj are similar, then their corresponding ranking weights ri and r j should also be close. Our aim in this section is to incorporate this global consistency in RWR process. We call it as Laplacian regularized RWR (LRWR). 3.1. Model formulation Let Wi j be the appearance affinity between node vi and vj . Based on W, the weighting consistency between node vi and vj can be enforced by minimizing the following Laplacian objective energy function,
JLap (r ) =
n n
Wi j ( ri − r j ) 2
(4)
i=1 j=1
where the larger Wi j leads to the closer relationship between ri and r j [26,27]. Using this Laplacian constraint, we propose our LRWR model as follows. First, we show that the above RWR model can be equivalently reformulated as the following optimization problem,
min JRWR (r ) = r − α Ar − (1 − α )p2 . r
(5)
This is because, the optimal solution r of problem Eq. (5) is obtained by setting the first derivation w.r.t. r to zero, i.e.,
∂ JRWR (r ) = r − α Ar − ( 1 − α ) p = 0 ∂r
(6)
That is, the optimal r satisfies the following equation,
r = α Ar + ( 1 − α ) p
(7)
which is exactly same to Eq. (3). Also, the objective JRWR is a convex function and unique global optimal solution exists. Therefore, the optimal solution of problem Eq. (5) is identical to the converged solution of the above RWR process. Then, we add Laplacian regularization term JLap (r ) into the above RWR objective function JRWR (r ) and propose our LRWR as
min r
JLRWR = JRWR + β JLap
+
n n β
2
i=1 j=1
Wi j (ri − r j )2
where β > 0 is a weighting parameter. Using matrix denotation, this problem can be rewritten compactly as
min JLRWR = r − α Ar − (1 − α )p2 + β rT (D − W )r, r
where D = diag(d1 , d2 . . . dn ) and di = j=1 Wi j . when β = 0, our LRWR degenerates to RWR model. Comparing with RWR, LRWR conducts local spatial propagation via JRWR while maintains the global consistency via Laplacian regularization JLap in its ranking process, and thus obtains more effective ranking result for visual tracking problem. In the following, we first derive a simple closedform solution for the proposed LRWR model. 3.2. Optimization Since both JRWR and JLap are convex function, thus our LRWR model JLRWR is convex and the global optimal solution can be computed. The optimal solution is computed by setting the first derivative of the above function JLRWR (r ) w.r.t variable r to be zero, i.e.,
∂ JLRWR (r ) = 2 KT K + β (D − W ) r − 2(1 − α )KT p = 0 ∂r
(10)
where K = I − α A and I is an identity matrix. Therefore, the optimal r∗ can be obtained as the following closed-form,
r∗ = (1 − α ) KT K + β (D − W )
−1
KT p
Since the problem is convex, thus the optimal solution global optimal solution.
(11) r∗
is the
4. Visual object tracking In this section, we apply our LRWR approach to generate a robust feature representation for visual tracking. Generally, our tracking process contains three main steps: weighted patch feature descriptor, scale estimation and structured output tracking. 4.1. Weighted patch feature descriptor In visual tracking problem, it is desirable to generate robust feature descriptor for each bounding box c of the candidate target object. One kind of popular way is to assign weights to different pixels (or patches) in the bounding box and generate a weighted patch descriptor for each bounding box to alleviate the effects of background information, as discussed in work [10]. In the following, we use our LRWR model to compute the patch weights. (1) Graph construction. Formally, given one bounding box ct of the target object in the current tth frame, we first partition it into non-overlapping local patches (Fig. 2(a)) and extract feature c descriptor xi t for each patch. Then, we construct both local neighborhood graph G(V, E) and global graph G (V , E ) as follows. A local neighborhood graph G(V, E) is constructed with nodes V representing patches and edges E denoting the 8-neighborhood relationship between patch vi and vj , as shown in Fig. 2(b). The weight of edge eij is calculated as,
aci jt
=
exp(−γ xci t − xcjt 2 ) if v j ∈ N (vi ) 0 otherwise
(12)
where N (vi ) denotes the 8-neighbors of node vi , and γ is a scaling parameter. Then, the transition matrix Act which is used to conduct ranking propagation is defined by normalizing each edge weight as,
j=1
(8)
(9)
n
aci jt Aci jt = n
= r − α Ar − (1 − α )p2
141
aci jt
.
(13)
In addition to local 8-neighbor graph G(V, E), we also construct a global graph G (V , E ) with nodes V representing patches and
142
B. Jiang, Y. Zhang and J. Tang et al. / Neurocomputing 339 (2019) 139–148 c
c
(3) Weighted patch descriptor. Based on ri t and r˜ i t , we comc pute the foreground weight wi t for the ith patch as
wci t =
1
1 + exp − μ(rci t − r˜ ci t )
(15)
c
That is, the larger weight wi t is, the more likely that the ith patch belongs to foreground. Fig. 3 illustrates an example, where the first row shows the LRWR weight computation result and the second row shows the RWR result. Note that, comparing with RWR, the proposed LRWR can return the weights of patches more robustly and consistently. This demonstrates the desired benefit of the proposed Laplacian regularization in LRWR model. By incorporating the weight wct into the patch feature descriptor, we obtain the weighted descriptor for the bounding box as
x˜ ct = φ (xtct ) = wc1t xc1t , wc2t xc2t . . . wcnt xcnt
Fig. 2. Illustration of the graph construction. (a) Bounding box partition and patches generation; (b) Local graph construction G(V, E); (b) Global graph construction G (V , E ).
(16)
Comparing with original feature xct , the weighted feature descriptor x˜ ct can alleviate the undesirable effects of the background information which provides a kind of more accurate descriptor for the following tracking process. 4.2. Structured SVM tracking
Fig. 3. Illustration of the proposed weight computation process. 1st row shows the LRWR result; 2nd row shows the RWR result. (a) Definition of three regions Rct ,out (blue), Rct ,bnd (green), Rct ,in (red) over a bounding box ct . (b) The stationary distribution of background r˜ ct obtained by our LRWR model. (c) The stationary distribution of foreground rct obtained by our LRWR model. (d) The foreground weight wct obtained by combining rct and r˜ ct via Eq. (15).
edges E existing between each pair of patches, as shown in c Fig. 2(c). The weight Whkt of edge ehk which encodes the appearance affinity and conducts global consistency between node vh and vk is computed as,
Wchkt = exp(−εxcht − xckt 2 )
(14)
where ε is a scaling parameter. (2) Background and foreground measurement. Using Act and ct W , we then perform LRWR based on foreground and background restart distributions pct and p˜ ct to obtain the foreground and background measurement rct and r˜ ct respectively for the patches. In our method, the restart distributions pct and p˜ ct are obtained from the ∗ optimal bounding box ct−1 at t − 1 frame to inherit the information from the previous frame. Formally, we compute them as,
pci t =
c∗ ρ × ri t−1 0
if if
vi ∈ Rct ,in ∪ Rct ,bnd vi ∈ Rct ,out
0 c∗ ρ˜ × r˜i t−1
if if
vi ∈ Rct ,in vi ∈ Rct ,bnd ∪ Rct ,out
p˜ ci t =
where Rct ,in , Rct ,bnd and Rct ,out denote the sets of inner, boundary and outer patches of the bounding box ct at the tth frame, respec∗ tively, as shown in Fig. 3. ct−1 is the optimal target bounding box at the t − 1 frame using the proposed tracking algorithm. ρ , ρ˜ are parameters.
The proposed LRWR based weighted object descriptor can be incorporated into many traditional tracking-by-detection algorithms. In this section, we incorporate it into the conventional popular tracking-by-detection algorithm, Struck [28]. The aim of Struck is to adopt the structured SVM method during classification. Recent studies demonstrate that the structure SVM usually outperforms the binary SVM because it uses the more flexible structural information between samples instead of binary labels used in traditional binary SVM training process. Formally, let xt−1 = (ct−1 , wt−1 , ht−1 ) be the target state at the previous frame t − 1, where ct−1 , wt−1 , and ht−1 are the target center, width, and height, respectively. To estimate the target center ct at the current frame t, we first set a square search region, which is centered at ct−1 and has a side length of wt−1 , ht−1 . Then, we sample candidate states within the search region using the sliding window method. At this stage, every candidate has the same size wt−1 × ht−1 . We then determine the current state xt in the tth frame by maximizing the classification score,
xt = arg max ht−t φ (x ) T
(17)
x
where ht−1 is the normal vector of a decision plan of t − 1-th frame, and x˜ c denotes the weighted descriptor of each candidate state x, which is computed by Eq. (16). In order to further incorporate the information of the initial frame, in this paper we compute the optimal bounding box xt by maximizing the classification score as
xt = arg max( ht−1 φ (x ) + (1 − )h0 φ (x )) T
T
(18)
x
where h0 is learned in first frame by Struck. This strategy can prevent it from learning drastic appearance changes, and is a balance parameter and fixed to be 0.30 in this work. After obtaining the optimal tracked bounding box xt , we then update the classifier ht to adapt to the appearance change of the target. To prevent the effects of unreliable tracking results, we update the classifier only when the confidence score of tracking result is larger than a threshold θ , we set it as 0.3. Here, the confidence score of tracking result at t frame is defined as the average similarity between the weighted descriptor x˜ ct of the tracked bound ing box xt and the positive support vectors, i.e., |S1 | s∈St s, x˜ ct , t
where St is the set of the positive support vectors at the t frame, as discussed in work [10].
B. Jiang, Y. Zhang and J. Tang et al. / Neurocomputing 339 (2019) 139–148
143
Fig. 4. Illustration of the proposed scale estimation. (a) Traditional methods using the bounding box with the fixed size. (b) Multiple bounding boxes with different scales.
Table 1 The precision rate(PR) and the success rate(SR) of the proposed method with different parameters. Parameter
Setting
PR/SR
α
0.15 0.20 0.25 0.30 0.35
0.846/0.606 0.854/0.608 0.864/0.613 0.851/0.607 0.850/0.602
Parameter
Setting
PR/SR
β
0.10 0.15 0.20 0.25 0.30
0.853/0.602 0.858/0.609 0.864/0.613 0.855/0.605 0.848/0.601
4.3. Scale estimation As shown in Fig. 4(a), as the camera moves away, the scale of the target object is challenging. In this case, the above mentioned tracker cannot provide accurate results, since it only considers the fixed size candidate bounding box. The problem can be alleviated by considering candidate bounding boxes with various sizes. However, this approach may generate too many candidates and increase false positives, which may reduce the tracking reliability. In this paper, we adopt the approach of modern tracker [29], which decomposes the problem of target state estimation into two subproblems, i.e., translation estimation and scale estimation. It first estimates the center of a target using a fixed scale and then determines the scale of the target at the estimated center location as well as the translation estimation of target using Structured SVM classifier [28]. Despite its simplicity, this approach enables robust scale estimation, by reducing the number of candidates efficiently.
ture vector including 24-dimensional RGB color histogram and 8dimensional oriented gradient histogram features. To improve efficiency, we scale each frame so that the minimum side length of each bounding box is √ 32 pixels. We fix the side length of each searching window to 2 wh, where w and h denote the width and height of the scaled bounding box, respectively. OTB100 benchmark dataset. We first evaluate the proposed tracking approach on the OTB100 benchmark dataset [1]. This dataset includes 100 image sequences with ground-truth object locations. These sequences are associated with different attributes for the performance analysis. Similar to many previous works, we use both precision rate (PR) and success rate (SR) to measure the quantitative performances of different tracker methods. PR is defined as the ratio of the frames whose output/estimated object location is within a threshold distance of ground truth bounding box. SR is calculated as the ratio of the frames, in which the overlap ratio between an estimated bounding box and the ground truth is larger than a threshold. In particular, the precision at the distance threshold of 20 pixels is employed as the representative PR score, and the average success rate, which is the area under the success rate curve over all overlap thresholds, is used as the representative SR score. Temple Color benchmark dataset. We also test our tracker and compare it with other trackers on a bigger benchmark dataset, i.e., TColor-128 [30]. This large-scale database involves 128 challenging image sequences of animals, pedestrians and rigid objects. Evaluation metrics used in this dataset are the same with [1]. 5.2. Evaluation on OTB100 Dataset
5. Experiment 5.1. Evaluation settings Parameters. The experiments are carried on a PC with an Intel i7 4.0 GHz CPU and 16 GB. The proposed algorithm is implemented in C++. The proposed tracker performs at 10 frames per second. For fair comparisons, we fixed all parameters and other setting in experiments. In the proposed LRWR model, we set α = 0.25, β = 0.2. Note that, our LRWR is stable to the regularization parameters, as shown in Table 1. When we slightly adjust the parameter α , β , the final tracking performance only changes a little. In Eqs. (12), (14) and (15), we empirically set {γ , ε , μ} = {9.0, 9.0, 42.0}. The parameter ρ , ρ˜ in computing the restart distribution are set to 1.0. In all experiments, we divided the bounding box into 64 non-overlapping patches to balance accuracy and efficiency trade-off. For each patch, we extract 32-dimensional fea-
We present the evaluation results on OTB100 dataset and compare our method with some classical trackers, including SOWP [10], Struck [4] and other 7 trackers in [1]. Fig. 5 shows the results in one-pass evaluation (OPE) using the distance precision rate (PR) and overlap success rate (SR) curves, respectively. Overall, the comparison curves show that our tracker outperforms the second best tracker, and achieving 22.4% performances gain in PR and 15.0% performance gain in SR over Struck. In particular, our method achieves 6.1%, 5.3% performance gains in PR and SR over SOWP [10] which is most related with our tracker. Tables 2–5 summarize the comparison results of our tracker with 11 different attributes on PR and SR, respectively. In comparison methods, we have compared our tracker with both non-deep learning methods including MUSTer [31], MEEM [7], LCT [29], KCF [32], DSST [33] and TLD [34], and some deep learning methods including DLT [35] and HCF [36]. Overall, we can note that LRWR outperforms other trackers in
144
B. Jiang, Y. Zhang and J. Tang et al. / Neurocomputing 339 (2019) 139–148
Table 2 Comparison of attribute-based PR scores on OTB benchmark. The comparing methods including SOWP, Struck, CXT, OAB, LSK, CSK, LOT, VTS, VTD and LRWR. The attributes includes IV (illumination variation), SV (scale variation), OCC (occlusion), DEF (deformation), MB(motion blur), FM (fast motion), IPR (in-plane-rotation), OPR (out-of-plane rotation), OV (out-of-view), BC (background clutters), and LR (low resolution). The italic, bold and bold italic indicate the first three best performance, respectively.
FM BC MB DEF IV IPR LR OCC OPR OV SV ALL
SOWP
Struck
CXT
OAB
LSK
CSK
LOT
VTS
VTD
LRWR
0.723 0.775 0.702 0.741 0.766 0.828 0.903 0.754 0.787 0.633 0.746 0.803
0.626 0.566 0.594 0.527 0.545 0.637 0.674 0.537 0.593 0.503 0.600 0.640
0.554 0.457 0.559 0.410 0.473 0.607 0.546 0.441 0.524 0.416 0.524 0.556
0.467 0.432 0.463 0.394 0.401 0.474 0.445 0.440 0.452 0.384 0.524 0.481
0.451 0.455 0.414 0.436 0.447 0.504 0.500 0.478 0.496 0.416 0.460 0.4976
0.408 0.592 0.378 0.453 0.479 0.521 0.423 0.437 0.484 0.315 0.474 0.521
0.395 0.435 0.369 0.485 0.322 0.472 0.415 0.453 0.498 0.408 0.457 0.470
0.337 0.535 0.293 0.462 0.477 0.548 0.509 0.462 0.558 0.409 0.516 0.507
0.332 0.545 0.283 0.461 0.478 0.564 0.537 0.488 0.574 0.417 0.523 0.513
0.803 0.831 0.779 0.860 0.828 0.828 0.869 0.814 0.842 0.722 0.816 0.864
Table 3 Comparison of attribute-based PR scores on OTB benchmark. The comparing methods including MUSTer, LCT, HCF, MEEM, KCF, DSST, TLD, DLT and LRWR. The italic, bold and bold italic indicate the first three best performance, respectively.
FM BC MB DEF IV IPR LR OCC OPR OV SV ALL
MUSTer
LCT
HCF
MEEM
KCF
DSST
TLD
DLT
LRWR
0.683 0.784 0.678 0.689 0.770 0.773 0.747 0.734 0.744 0.591 0.710 0.774
0.681 0.734 0.669 0.689 0.732 0.782 0.699 0.682 0.746 0.592 0.681 0.762
0.796 0.842 0.803 0.754 0.784 0.853 0.847 0.734 0.795 0.676 0.774 0.821
0.752 0.746 0.731 0.754 0.728 0.794 0.808 0.741 0.794 0.685 0.736 0.781
0.625 0.718 0.606 0.617 0.693 0.697 0.671 0.625 0.670 0.512 0.636 0.693
0.584 0.702 0.611 0.568 0.708 0.724 0.708 0.615 0.670 0.487 0.662 0.695
0.563 0.470 0.542 0.484 0.535 0.613 0.627 0.535 0.570 0.488 0.565 0.597
0.391 0.515 0.387 0.451 0.515 0.471 0.751 0.454 0.509 0.558 0.535 0.535
0.803 0.831 0.779 0.860 0.828 0.828 0.869 0.814 0.842 0.722 0.816 0.864
Table 4 Comparison of attribute-based SR scores on OTB benchmark. The comparing methods including SOWP, Struck, CXT, OAB, LSK, CSK, LOT, VTS, VTD and LRWR. The attributes are the same as Table 2. The italic, bold and bold italic indicate the first three best performance, respectively.
FM BC MB DEF IV IPR LR OCC OPR OV SV ALL
SOWP
Struck
CXT
OAB
LSK
CSK
LOT
VTS
VTD
LRWR
0.556 0.570 0.567 0.527 0.554 0.567 0.423 0.528 0.547 0.497 0.475 0.560
0.470 0.438 0.468 0.383 0.422 0.453 0.313 0.394 0.424 0.384 /0.404 0.463
0.407 0.361 0.396 0.297 0.358 0.447 0.360 0.337 0.391 0.344 0.379 0.414
0.382 0.324 0.390 0.314 0.312 0.361 0.225 0.341 0.340 0.327 0.379 0.366
0.369 0.360 0.356 0.390 0.364 0.384 0.318 0.370 0.379 0.342 0.337 0.386
0.341 0.425 0.327 0.338 0.368 0.385 0.224 0.337 0.355 0.283 0.367 0.386
0.327 0.319 0.316 0.339 0.253 0.324 0.211 0.329 0.344 0.326 0.328 0.339
0.283 0.396 0.268 0.339 0.360 0.384 0.249 0.348 0.388 0.347 0.651 0.364
0.282 0.403 00.258 0.341 0.366 0.395 0.260 0.360 0.398 0.351 0.356 0.369
0.607 0.612 0.606 0.596 0.590 0.574 0.484 0.570 0.590 0.532 0.552 0.613
Table 5 Comparison of attribute-based SR scores on OTB benchmark. The comparing methods including MUSTer, LCT, HCF, MEEM, KCF, DSST, TLD, DLT and LRWR. The attributes are the same as Table 1. The italic, bold and bold italic indicate the first three best performance, respectively.
FM BC MB DEF IV IPR LR OCC OPR OV SV ALL
MUSTer
LCT
HCF
MEEM
KCF
DSST
TLD
DLT
LRWR
0.533 0.581 0.544 0.524 0.592 0.551 0.415 0.554 0.537 0.469 0.512 0.577
0.534 0.550 0.533 0.341 0.557 0.557 0.399 0.507 0.538 0.452 0.488 0.562
0.570 0.585 0.585 0.517 0.525 0.559 0.388 0.514 0.534 0.474 0.477 0.556
0.542 0.519 0.556 0.489 0.515 0.529 0.382 0.504 0.525 0.488 0.470 0.530
0.463 0.500 0.463 0.436 0.471 0.467 0.290 0.441 0.450 0.401 0.369 0.476
0.442 0.477 0.467 0.412 0.485 0.485 0.314 0.426 0.448 0.374 0.409 0.475
0.434 0.362 0.435 0.341 0.401 0.432 0.346 0.371 0.390 0.361 0.388 0.427
0.318 0.372 0.320 0.295 0.401 0.348 0.465 0.335 0.371 0.384 0.391 0.391
0.607 0.612 0.606 0.596 0.590 0.574 0.484 0.570 0.590 0.532 0.552 0.613
B. Jiang, Y. Zhang and J. Tang et al. / Neurocomputing 339 (2019) 139–148 Success plots of OPE
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Success rate
Precision
Precision plots of OPE
145
0.5
LRWR[0.864] SOWP [0.803] Struck [0.640] CXT [0.556] CSK [0.521] VTD [0.513] VTS [0.507] LSK [0.497] OAB [0.481] LOT [0.470]
0.4
0.3
0.2
0.1
0.5
LRWR[0.613] SOWP [0.560] Struck [0.463] CXT [0.414] LSK [0.386] CSK [0.386] VTD [0.369] OAB [0.366] VTS [0.364] LOT [0.339]
0.4
0.3
0.2
0.1
0
0 0
5
10
15
20
25
30
35
40
45
50
Location error threshold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Overlap threshold
Fig. 5. Precision plots and success plots of OPE (one-pass evaluation) of the proposed tracker against other state-of-the-art trackers on OTB100 dataset. The representative scores measure the PR score at the threshold of 20 pixels and the average SR score over all overlap thresholds, respectively. Our method performs favorably against the state-of-the-art trackers.
Fig. 6. Tracking results of our method against Struck, MEEM, KCF, LCT and sowp (denoted in different colors and lines) in our evaluation on 12 challenging sequences (from left to right and top to down are Board, Bolt2, Car1, Car24, Car4, Human3, Freeman1, Lemming, Girl2, Shaking, Human4, and Skating1, respectively).
most of challenging attributes. In particular, our method achieves better performance on the attribute of fast motion, background clutter, motion blur, deformation, illumination variation, occlusion, out-of-plane-rotation, out-of-view, scale variation. This clearly demonstrates the benefit of the global consistency regularization of LRWR model in weight computation and thus leads to robust and effective feature descriptor for visual tracking problem. Fig. 6 shows some tracking results on some challenging examples. Fig. 7 shows the corresponding center location error [29]. The lower of center error value is, the more accurately that the visual object is located by trackers. Intuitively, one can note that, the
proposed tracker performs better and locates the visual object more effectively and accurately on these challenging sequences. 5.3. Evaluation on Temple Color Dataset Our second evaluation is conducted on Temple Color dataset [30] with 17 trackers including SOWP [10], MEEM [7], Struck [4], KCF [32] and other 13 trackers in [30]. Fig. 8 shows the success plot and precision plot over all the 128 videos on this dataset. One can note that, our tracker generally outperforms the current popular trackers and achieves the best performance on this dataset.
146
B. Jiang, Y. Zhang and J. Tang et al. / Neurocomputing 339 (2019) 139–148
100
200
0 200
300
400
500
600
100
150
200
250
400
200
100
600
800
1000
50
100
# Frame Girl2
LRWR SOWP Struck MEEM LCT KCF
400
Center error
LRWR SOWP Struck MEEM LCT KCF
100
LRWR SOWP Struck MEEM LCT KCF
0 200
# Frame Freeman1
Center error
Center error
50
100
200
0 50
# Frame Car4
100
200
0 100
150
300
Freeman1
LRWR SOWP Struck MEEM LCT KCF
300 200
150
200
250
300
# Frame Human3
LRWR SOWP Struck MEEM LCT KCF
Center error
200
400
Car1 400
Center error
300
Center error
Center error
400
Bolt2 LRWR SOWP Struck MEEM LCT KCF
Center error
Board LRWR SOWP Struck MEEM LCT KCF
400
200
LRWR SOWP Struck MEEM LCT KCF
100 0 100
200
300
400
500
50
600
100
150
200
250
0
300
300 200
150
LRWR SOWP Struck MEEM LCT KCF
Center error
Center error
Center error
200
LRWR SOWP Struck MEEM LCT KCF
100
100
50
0
0 100
200
300
400
500
1500
400
600
800
500
1000 1200
LRWR SOWP Struck MEEM LCT KCF
100
# Frame
300
200
1000
1500
# Frame Skating1
0 200
600
# Frame
1000
# Frame Shaking
400 400
0 500
# Frame Lemming
# Frame Human4
Center error
0
300
200
LRWR SOWP Struck MEEM LCT KCF
100 0 100
# Frame
200
300
400
# Frame
Fig. 7. Comparison of center location errors (in pixels) on 12 challenging sequences. Generally, our method obtains lower center location errors.
Fig. 8. Evaluation results on TColor-128 dataset, The legend contains the area-under-the-curve score and the average distance precision score at 20 pixels for each tracker. Our method performs favorably against the state-of-the-art trackers.
This further demonstrates the effectiveness and robustness of the proposed LRWR based tracking approach.
model, and (5) LRWR method. The evaluation results of these versions, together with two baseline trackers, Struck [28] and SOWP [10], are shown in Table 6. We can see that the performance achieved by our versions outperforms Struck and SOWP, which demonstrates the importance of our components. Our LRWR performs obviously better than LRWR-noL, which clearly demonstrates the benefit and effectiveness of the Laplacian regularization term in our LRWR model. Our LRWR performs obviously better than LRWR-noS, which clearly demonstrates the effectiveness of the proposed scale estimation in LRWR tracking method. Also, LRWRnoT and LRWR-noI performs obviously better than SOWP, which clearly demonstrates the effectiveness of the LRWR based weights
5.4. Component analysis In order to validate the significance of the main components, five special versions of our method are implemented for further analysis in the following: (1) LRWR-noL that removes the Laplacian regularization term in our LRWR model, (2) LRWR-noS that removes the scale estimation in our tracking algorithm, (3) LRWR-noI that removes the information of the initial frame, (4) LRWR-noT that updates the classifier at every frame in our LRWR Table 6 The performance of different versions of the proposed method against Struck and SOWP.
PR SR
Struck
SOWP
LRWR
LRWR-noS
LRWR-noL
LRWR-noT
LRWR-noI
0.640 0.463
0.803 0.560
0.864 0.613
0.851 0.601
0.833 0.593
0.832 0.586
0.854 0.603
B. Jiang, Y. Zhang and J. Tang et al. / Neurocomputing 339 (2019) 139–148
147
5
LRWR LRWR-noL
4.5
4
APCE Value
3.5
3
2.5
2
1.5
1
0.5
0 0
10
20
30
40
50
60
70
80
90
100
Sequence Fig. 9. The APCE score of LRWR and LRWR-noL at different sequence on OTB100. Table 7 The speed of the proposed method against Struck and SOWP.
OTB100 and TColor-128 demonstrated the better performance of our approach over other state-of-the-art tracking methods.
Method
Struck
SOWP
LRWR
FPS
20.2
7.3
10.2
computation. To further illustrate the robust of classifier about LRWR and LRWR-noL, we introduce Average Peak-to-Correlation energy (APCE) [37] which is defined as
APCE =
|smax − smin | mean( i∈ (si − smin ))
(19)
where s represents the classifier score and represents all candidate bounding box. Larger APCE value indicates the more demonstrative ability of the feature descriptor. Fig. 9 shows the comparison results. We can note that the proposed LRWR based patch weighted descriptor performs more discriminatively than LRWR-noL method, which clearly demonstrates the effectiveness of the proposed Laplacian regularization term in patch weight computation. Table 7 shows the average speed of different tracking methods. We can note that (1) Struck performs faster than SOWP and LRWR because it extracts a single global feature for each bounding box while SOWP and LRWR extract many local features for patches of bounding box. (2) LRWR performs faster than SOWP because LRWR has a simple closed-form solution and thus can be computed efficiently while SOWP utilizes an iteration algorithm to obtain the solution. 6. Conclusions This paper proposes a novel graph Laplacian Regularized Random Walk Ranking (LRWR) model to generate robust object representation for visual tracking problem. LRWR integrates both local spatial and global appearance cues simultaneously in its weight computation process and thus leads to more robust and effective object feature representation for visual tracking. Also, LRWR has a simple closed-form solution and thus can be computed efficiently. Extensive experiment on standard benchmark
Acknowledgment This work is supported National Natural Science Foundation of China (61602001, 61702002, 61671018); Natural Science Foundation of Anhui Province (1708085QF139); Natural Science Foundation of Anhui Higher Education Institutions of China (KJ2016A020). References [1] Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark, IEEE Transactions on Pattern Analysis Machine Intelligence 37 (2015) 1834–1848. [2] A. Li, M. Lin, Y. Wu, M.H. Yang, S. Yan, Nus-pro: A new visual tracking challenge, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2) (2016) 335–349. [3] W. Li, P. Wang, R. Jiang, H. Qiao, Robust object tracking guided by top-down spectral analysis visual attention, Neurocomputing 152 (2015) 170–178. [4] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.M. Cheng, S.L. Hicks, P.H.S. Torr, Struck: Structured output tracking with kernels, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2016) 2096–2109. [5] S. He, Q. Yang, R.W.H. Lau, J. Wang, M.H. Yang, Visual tracking via locality sensitive histograms, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2427–2434. [6] Y. Yuan, H. Yang, Y. Fang, W. Lin, Visual object tracking by structure complexity coefficients, IEEE Trans. Multimed. 17 (8) (2015) 1125–1136. [7] J. Zhang, S. Ma, S. Sclaroff, in: Meem: robust tracking via multiple experts using entropy minimization, 8694, 2014, pp. 188–203. [8] G. Han, X. Wang, J. Liu, N. Sun, C. Wang, Robust object tracking based on local region sparse appearance model, Neurocomputing 184 (2016) 145–167. [9] C. Dorin, R. Visvanathan, M. Peter, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (2003) 564–575. [10] H.U. Kim, D.Y. Lee, J.Y. Sim, C.S. Kim, Sowp: spatially ordered and weighted patch descriptor for visual tracking, in: Proceedings of the IEEE International Conference on Computer Vision ICCV, 2015, pp. 3011–3019. [11] R.Z. Han, Q. Guo, W. Feng, Content-related spatial regularization for visual object tracking, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2018. [12] C. Li, X. Wu, Z. Bao, J. Tang, Regle: Spatially Regularized Graph Learning for Visual Tracking (2017) 252–260. [13] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, S. Wang, Learning dynamic siamese network for visual object tracking, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2018, pp. 1781–1789. [14] F. Liu, C. Gong, T. Zhou, K. Fu, X. He, J. Yang, Visual tracking via nonnegative multiple coding, IEEE Trans. Multimed. 19 (12) (2017) 2680–2691.
148
B. Jiang, Y. Zhang and J. Tang et al. / Neurocomputing 339 (2019) 139–148
[15] P. Zhang, T. Zhuo, L. Xie, Y. Zhang, Deformable object tracking with spatiotemporal segmentation in big vision surveillance, Neurocomputing 204 (2016) 87–96. [16] Q. Guo, W. Feng, C. Zhou, C. Pun, B. Wu, Structure-regularized compressive tracking with online data-driven sampling, IEEE Trans. Image Process. (12) (2017) 5692–5705. [17] S. Duffner, C. Garcia, Pixeltrack: A fast adaptive algorithm for tracking non– rigid objects, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2480–2487. [18] F. Yang, H. Lu, M.H. Yang, Robust superpixel tracking, IEEE Trans. Image Process. 23 (2014) 1639–1651. [19] R. Yao, S. Xia, Z. Zhang, Y. Zhang, Real-time correlation filter tracking by efficient dense belief propagation with structure preserving, IEEE Trans. Multimed. 19 (4) (2017) 772–784. [20] B. Ma, J. Shen, Y. Liu, H. Hu, L. Shao, X. Li, Visual tracking using strong classifier and structural local sparse descriptors, IEEE Trans. Multimed. 17 (10) (2015) 1818–1828. [21] S. Zhang, H. Yao, X. Sun, X. Lu, Sparse coding based visual tracking: Review and experimental comparison, Pattern Recognit. 46 (7) (2013) 1772–1788. [22] C. Li, L. Lin, W. Zuo, J. Tang, Learning patch-based dynamic graph for visual tracking, in: Proceedings of the AAAI, 2017. [23] J.-Y. Pan, H.-J. Yang, C. Faloutsos, P. Duygulu, Automatic multimedia cross– modal correlation discovery, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in: KDD ’04, ACM, 2004, pp. 653–658. [24] Z. He, B. Jiang, Y. Xiao, C. Ding, B. Luo, Saliency detection via a graph based diffusion model, in: Proceedings of the Graph-Based Representations in Pattern Recognition, 2017, pp. 3–12. [25] J.S. Kim, J.Y. Sim, C.S. Kim, Multiscale saliency detection using random walk with restart, IEEE Trans. Circuits Syst. Video Technol. 24 (2014) 198–210. [26] B. Jiang, C. Ding, B. Luo, J. Tang, Graph-laplacian pca: closed-form solution and robustness, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3492–3498. [27] B. Jiang, C. Ding, B. Luo, Robust data representation using locally linear embedding guided pca, Neurocomputing 275 (2018) 523–532. [28] S. Hare, A. Saffari, P.H.S. Torr, Struck: Structured output tracking with kernels, in: Proceedings of the ICCV, 2011, pp. 263–270. [29] C. Ma, X. Yang, C. Zhang, M.H. Yang, Long-term correlation tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5388–5396. [30] P. Liang, E. Blasch, H. Ling, Encoding color information for visual tracking: Algorithms and benchmark, IEEE Trans. Image Process. 24 (12) (2015) 5630–5644. [31] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, D. Tao, Multi-store tracker (muster): a cognitive psychology inspired approach to object tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2015, pp. 749–758. [32] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) (2015) 583–596. [33] M. Danelljan, G. Hauml; ger, F. Shahbaz Khan, M. Felsberg, Accurate Scale Estimation for Robust Visual Tracking, 2014. [34] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422. [35] W. Naiyan, Y. Dit-Yan, Learning a deep compact image representation for visual tracking, in: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (Eds.), Proceedings of the Advances in Neural Information Processing Systems 26, 2013, pp. 809–817. [36] C. Ma, J.B. Huang, X. Yang, M.H. Yang, Hierarchical convolutional features for visual tracking, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3074–3082. [37] M. Wang, Y. Liu, Z. Huang, Large margin object tracking with circulant feature maps, CoRR abs/1703.05020 (2017).
Bo Jiang received the B.Eng. degree in 2012 and the Ph.D. degree in 2015 in computer science from Anhui University, Hefei, China. He is currently an associate professor in computer science at Anhui University. His current research interests include image and graph matching, image feature extraction, and statistical pattern recognition.
Yuan Zhang received the B.S degree in applied mathematics in 2016 from Anhui University, Hefei, China. He is currently a Master student in computer science at Anhui University. Her current research interests include object tracking with graph model and statistical pattern recognition.
Jin Tang received the B.Eng. degree in automation in 1999, and the Ph.D. degree in computer science in 2007 from Anhui University, Hefei, China. Since 2009, he has been a professor at the School of Computer Science and Technology at the Anhui University. His research interests include image processing, pattern recognition, machine learning and computer vision.
Bin Luo received his Ph.D. degree in Computer Science in 2002 from the University of York, the United Kingdom. He has published more than 200 papers in journals, edited books and refereed conferences. He is a professor at Anhui University of China. At present, he chairs the IEEE Hefei Subsection. He served as a peer reviewer of international academic journals such as IEEE Trans. on PAMI, Pattern Recognition, Pattern Recognition Letters, International Journal of Pattern Recognition and Artificial Intelligence, and Neurocomputing, etc. His current research interests include random graph based pattern recognition, image and graph matching, and video analysis. Chenglong Li received the M.S. and Ph.D. degrees from the School of Computer Science and Technology, Anhui University, Hefei, China, in 2013 and 2016, respectively. From 2014 to 2015, he was a visiting student with the School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China. He is currently a lecturer at the School of Computer Science and Technology, Anhui University, and also a postdoctoral research fellow at the Center for Research on Intelligent Perception and Computing (CRIPAC), National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China. He was a recipient of the ACM Hefei Doctoral Dissertation Award in 2016.