Non-rigid object tracking via deep multi-scale spatial-temporal discriminative saliency maps

Pattern Recognition 100 (2020) 107130 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog...

Download PDF

3MB Sizes 0 Downloads 51 Views

Report

Full Text

Pattern Recognition 100 (2020) 107130

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/patcog

Non-rigid object tracking via deep multi-scale spatial-temporal discriminative saliency maps Pingping Zhang a, Wei Liu b, Dong Wang a, Yinjie Lei c, Hongyu Wang a, Chunhua Shen b, Huchuan Lu a,∗ a b c

School of Information and Communication Engineering, Dalian University of Technology, Dalian, Liaoning 116024, China School of Computer Science, University of Adelaide, Adelaide, SA 5005, Australia College of Electronics and Information Engineering, Sichuan University, Chengdu, Sichuan 610065, China

a r t i c l e

i n f o

Article history: Received 5 January 2019 Revised 10 October 2019 Accepted 24 November 2019 Available online 25 November 2019 Keywords: Deep neural network Non-rigid object tracking Salient object detection Spatial-temporal consistency

a b s t r a c t In this paper, we propose a novel effective non-rigid object tracking framework based on the spatialtemporal consistent saliency detection. In contrast to most existing trackers that utilize a bounding box to specify the tracked target, the proposed framework can extract accurate regions of the target as tracking outputs. It achieves a better description of the non-rigid objects and reduces the background pollution for the tracking model. Furthermore, our model has several unique characteristics. First, a tailored fully convolutional neural network (TFCN) is developed to model the local saliency prior for a given image region, which not only provides the pixel-wise outputs but also integrates the semantic information. Second, a novel multi-scale multi-region mechanism is proposed to generate local saliency maps that effectively consider visual perceptions with different spatial layouts and scale variations. Subsequently, the local saliency maps are fused via a weighted entropy method, resulting in a discriminative saliency map. Finally, we present a non-rigid object tracking algorithm based on the predicted saliency maps. By utilizing a spatial-temporal consistent saliency map (STCSM), we conduct the target-background classiﬁcation and use an online ﬁne-tuning scheme for model updating. Extensive experiments demonstrate that the proposed algorithm achieves competitive performance in both saliency detection and visual tracking, especially outperforming other related trackers on the non-rigid object tracking datasets. Source codes and compared results are released at https://github.com/Pchank/TFCNTracker. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction Visual object tracking aims to automatically identify the trajectories or locations of the moving objects in a sequence of images. It is a very long-standing research topic in the ﬁeld of computer vision due to its numerous advanced applications such as video surveillance, human-computer interaction and automatic driving. Although a diverse set of approaches have emerged and achieved satisfactory solutions under well-controlled environments, tracking generic objects has remained very challenging. In this paper we pay our attention to the single-target tracking, which only provides the target region in the ﬁrst frame and must infer new locations in the following frames. Current efforts

∗

Corresponding author. E-mail addresses: [email protected] (P. Zhang), [email protected] (W. Liu), [email protected] (D. Wang), [email protected] (Y. Lei), [email protected] (H. Wang), [email protected] (C. Shen), [email protected] (H. Lu). https://doi.org/10.1016/j.patcog.2019.107130 0031-3203/© 2019 Elsevier Ltd. All rights reserved.

of single-target tracking mainly focus on building robust bounding box-based trackers to overcome inevitable factors, such as scale change, partial occlusion and illumination variation. To improve the tracking accuracy, a few researchers have shifted their efforts on non-rigid object tracking, which is a more challenging task. This kind of tracking requires obtaining accurate target-background separations rather than coarse bounding boxes. Existing non-rigid trackers often rely on the pixel-level [1], superpixel-level [2] or patch-level [3,4] classiﬁcation. For instance, PixelTrack [1] provides soft segmentations of the tracked object based on pixelwise classiﬁcation. Superpixel tracker [2] treats the superpixels of speciﬁc objects as mid-level features and designs a discriminative model to generate a conﬁdence map for online tracking. HoughTrack [3] is able to identify the target area through patch-based classiﬁcation and voting-based online Hough forest. Recently, Son et al. [4] present an online gradient-boosting decision tree (GBDT) model to integrate a classiﬁer operating on individual patches and generate segmentation masks of the tracked object. However, all these methods are built with handcrafted features, which are not

2

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

robust enough for complex object variations, and they are not aware of the internal relations between the non-rigid tracking and salient object detection. Visual tracking is essentially a selective attention procedure based on human visual systems; however, this saliency characteristic is often ignored in designing a tracking system. As a basic preprocessing procedure in computer vision, salient object detection has shown great progress in recent years. However, several gaps still exist in applying saliency detection algorithms to solve the tracking problem. Technically, saliency detection generally operates on holistic images, and loses local speciﬁcity and scale consideration. Conversely, visual tracking requires to focus on a speciﬁc object rather than the entire scene in a cluttered environment. Several attempts [5,6] have been performed to connect visual tracking and saliency detection. Mahadevan and Vasconcelos [5] incorporate the center-surround attention into visual tracking, and present a biologically-inspired tracker. Hong et al. [6] adopt a feature projection to generate a target-speciﬁc saliency map, then locate the tracked object in a recursive manner. However, their generated saliency maps mainly focus on enhancing the contrast between the center of the object and the local background. Thus, these methods are suitable for the bounding box-based object tracking, and they can not produce the segmentation-based outputs for non-rigid object tracking. In this paper, we believe that the goals of saliency detection and non-rigid object tracking are quite similar, i.e., producing pixel-wise outputs that distinguish the objects of interest from its surrounding background. Inspired by the above-mentioned facts, in this paper we propose a novel non-rigid object tracking method based on the spatial-temporal consistent discriminative saliency detection. Our method can extract an accurate object region as the tracking output, which provides a better description of the non-rigid objects while reducing the background pollution. To achieve this goal, we ﬁrst develop a tailored fully convolutional neural network (TFCN), which is pre-trained on a well-build saliency detection dataset to predict the saliency map for a given image region. Then, the pre-trained TFCN takes as inputs the local image regions with various scales and spatial conﬁgurations to predict multiple local saliency maps. Based on the proposed weighted entropy method, these local saliency maps are effectively fused to produce a discriminative saliency map for online tracking. In addition, we build a structural-output target-background classiﬁer with the accumulated saliency maps. It can effectively utilize the spatial-temporal information to generate pixel-wise outputs for depicting the state of the tracked object. Finally, we extract the regions of interest (ROIs) and ﬁne-tune the TFCN to obtain the local saliency map in the next frame. Fig. 1 illustrates the critical stages of the proposed tracking method. In summary, the contributions of this paper are four aspects: •

•

•

•

Salient object detection and visual object tracking are integrated into an iterative learning framework with one FCN. The ﬁnal outputs of the proposed tracker are pixel-wise saliency maps with the structural property and computational scalability, which is more suitable for the non-rigid object tracking problem. An eﬃcient TFCN is developed to model the local saliency prior for a given image region. It not only provides the pixel-wise outputs but also integrates the semantic information of targets. A multi-scale multi-region mechanism is presented to generate multiple local saliency maps which are fused by a weighted entropy method. This new mechanism can produce discriminative saliency maps to facilitate the tracking process. Extensive experiments on public saliency detection and visual tracking datasets show that our algorithm achieves considerably impressive results in both research ﬁelds.

The remainder of this paper is organized as follows. In Section 2, we give an overview of visual object tracking, salient object detection and their relationships in the perspective of deep learning. Then we introduce the proposed spatial-temporal consistent saliency detection model in Section 3, and present the non-rigid object tracking algorithm in Section 4. In Section 5, we evaluate and analyze the proposed method through extensive experiments. Finally, we provide conclusions and future works in Section 6. 2. Related work Recently, deep convolutional neural networks (CNNs) exhibit impressive performance in both visual object tracking and salient object detection. In this section, we brieﬂy review related works and discuss the relations between these two topics in the view of deep learning. A complete survey of these methods is beyond the scope of this paper, and we refer the readers to recent survey papers [8,9] for more details. 2.1. Deep learning for visual object tracking In the visual tracking ﬁeld, many practices indicate that the feature extractor plays an important role in a powerful tracker. Thus, the recent state-of-art trackers have taken advantages of deep learned features. For instance, Wang et al. [10] observe that the higher layers of pre-trained CNNs comprise abundant semantic information, whereas the lower layers include considerable discriminative cues. Thus, they extract feature maps of conv4 − 3 and conv5 − 3 layers in the VGG-16 model [11], and build two subnets to capture high-level and low-level information respectively. The two subnets are combined to generate conﬁdence maps for object localization. Motivated by the same observation, Ma et al. [12] introduce multi-scale features into the correlation ﬁlter framework, and infer the location in a coarse-to-ﬁne manner. Nam and Han [13] pre-train a domain-speciﬁc CNN on largescale videos to obtain generic target representations, then transfer the model for online tracking. Very recently, there are many deep features-based correlation ﬁlter trackers with good performance. Although deep learning-based trackers signiﬁcantly improve the tracking accuracy, we note that almost all of them are designed for the bounding box inference strategy. Therefore, these trackers can not provide pixel-level tracking results, which limits their performance in the non-rigid object tracking problem. 2.2. Deep learning for salient object detection Salient object detection aims to identify the most conspicuous objects or regions in an image. Since the revolution of deep learning in computer vision, salient object detection has made a great progress. Motivated by the relationship of saliency detection and semantic segmentation, Li et al. [14] propose a multitask saliency detection method with collaborative feature learning. Subsequently, Li et al. [15] construct a saliency model based on multi-scale deep features. Lee et al. [16] propose to encode low-level distance maps and high-level sematic features for salient object detection. Liu et al. [17] propose a deep hierarchical network to detect salient objects. The network ﬁrst makes a coarse global prediction. Then a hierarchical recurrent convolutional neural network (HRCNN) is adopted to reﬁne the details. The whole architecture works in a global-to-local and coarse-to-ﬁne manner. Wang et al. [18] also develop a deep recurrent FCN to incorporate the coarse predictions as saliency priors, and stage-wisely reﬁne the generated saliency maps. Zhang et al. [19,20] present the generic aggregating multi-level convolutional feature frameworks for salient object detection. Wang et al. [21] propose stage-wise

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

3

Fig. 1. The overall framework of our proposed tracking approach. In the ﬁrst stage, we adopt a multi-scale multi-region method and the proposed TFCN to predict local saliency maps in the start frame. The Graphcut [7] is employed in the start frame to increase the robustness of saliency region predictions. Then we utilize the weighted entropy to fuse the local saliency maps, resulting in a discriminative saliency map for the target localization. During tracking, the STCSM model is incorporated for the location inference. After the localization, we update the TFCN with stochastic gradient descent (SGD)-based ﬁne-tuning.

reﬁnement methods to improve the boundary accuracy. Zhang et al. [22] integrate the lossless feature reﬂection and structural loss for accurate saliency detection. Then, Zhang et al. [23] introduce a hyper-densely reﬂective feature fusion for salient object detection. All of these methods demonstrate the effectiveness of deep CNNs in predicting saliency maps. However, they merely work on static images and pay little attention to the scale variations of objects, which are important and necessary in visual object tracking. Therefore, it’s harmful to the tracking performance by directly using these saliency detection methods. Meanwhile, introducing dynamic information and solving the scale change are critical for saliency detection-based visual tracking. 2.3. Relations between visual object tracking and salient object detection Currently, visual object tracking and salient object detection are always separately studied for different applications in computer vision. However, research in psychology indicates that selective visual attention process or saliency detection is crucial to visual tracking. Based on this fact, several attempts have been made to reveal the relations between visual object tracking and salient object detection. For example, Mahadevan and Vasconcelos [5] ﬁrst connect the center-surround saliency detection and visual object tracking, and present a biologically-inspired object tracker. They utilize multiple low-level visual features to boost the overall saliency map for bounding box-based target localization. Recently, Hong et al. [6] adopt deep feature projection methods to generate a target-speciﬁc saliency map, then locate the tracked object on it. Feng et al. [24] propose a dynamic saliencyaware regularized scheme for object tracking. Though the results are impressive, saliency maps obtained by existing methods are all with respect to the center of the tracked object, which facilitates the bounding box-based trackers. It is unsuitable in generating segmentation-based outputs for non-rigid object tracking. To the best of our knowledge, this paper is the ﬁrst work that provides new insights and attempts to integrate saliency detection and visual tracking under the deep learning framework. First, we ﬁnd that the goals of saliency detection and non-rigid object tracking are quite similar, i.e., producing pixel-wise outputs that distinguish the objects of interest from their surrounding background. Thus, it is reasonable to perform non-rigid object tracking by using local saliency detection. Second, the proposed method is developed

based on a FCN, which can exploit the powerful deep features, introduce the rich semantic information, and facilitate the generation of segmentation-based outputs. Third, beside spatial information, motion information can also be utilized for the localization of moving objects. If motion information is ignored or inaccurate, object tracking may fail. To deal with this issue, we introduce the accumulated spatial-temporal saliency map, which can quickly capture the target during tracking. 3. Multi-scale local region saliency model In this section, we start by describing the architectures of the fully convolutional network (FCN) [25] and the proposed TFCN. Then we give the details of how to generate the discriminative saliency map, which is competent in dealing with tracking problems. 3.1. FCN architectures The incipient FCN architecture [25] is an end-to-end learning model, which can produce a pixel-wise prediction for dense labeling tasks. The model differs from traditional CNNs. It converts all fully-connected layers into convolutional layers, and adopts transposed convolutions for upsampling feature maps. Speciﬁcally, the output of a convolutional layer is calculated by

j

Y = f

i

X ∗W

i, j

+b

j

,

(1)

i

where the operator ∗ represents 2-D convolution and f( · ) is an element-wise non-linear activation function, e.g., ReLU ( f (x ) = max(0, x )). Xi is the ith input feature map and Yj is the jth output feature map. Wi,j is a ﬁlter of size k × k and bj is the corresponding bias term. The transposed convolutions perform the transformation in the opposite direction of a normal convolution. In the FCN [25], transposed convolutions are used to project feature maps into a higher-dimensional space for upsampling features. As shown in Fig. 2(a), the FCN model sequentially performs several layers of convolution and pooling on the image to extract multi-scale feature representations. Then the back-end layers perform several transposed convolutions to increase the resolutionreduced feature maps. Finally, the prediction is achieved by applying the pixel-wise classiﬁcation with a Softmax function. In [25],

4

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

Fig. 2. Comparisons of the FCN-8s [25] and our proposed TFCN. (a) Fully convolutional network (FCN). (b) The FCN-8s for sematic segmentation. (c) The proposed TFCN model for local saliency detection. Conv stands for convolution. The numbers inside each layer indicates the amount of convolutional kernels. The arrows stand for skipconnections, aiming to fuse multi-level features. Different colors are used to discriminate the fused levels.

the authors introduce several skip-connections, which add highlevel predictions to intermediary layers to generate predictions at multiple resolutions. The skip-connections signiﬁcantly improve the semantic segmentation performance. 3.2. Tailored FCN (TFCN) Our proposed TFCN is largely inspired by the FCN-8s model [25] for semantic segmentation. First, the goals of saliency detection and semantic segmentation are relatively close. The goal of saliency detection is to extract the salient region from the background, whereas semantic segmentation is to distinguish different objects from the background. Second, both tasks produce pixellevel outputs. Each pixel of the input image requires to be categorized into two or multiple classes. Thus, the pre-trained FCN8s model can be utilized to provide prior information on generic objects. The original FCN-8s model introduces two independent skip-connections and adds high-level prediction layers to intermediary layers for generating high-resolution predictions, as shown in Fig. 2(b). More details can be found in [25]. To adapt to the saliency detection task, we introduce scale considerations and modify the structure of the original FCN-8s model. The major modiﬁcations include: (1) changing the ﬁlter size from 7 × 7 to 1 × 1 in the fc6 layer to enlarge resolutions of feature maps and keep abundant details; (2) discarding the drop6, fc7, relu7 and drop7 layers because of their insigniﬁcant contribution for our tasks; (3) introducing additional convolutional layers with 2 channels for each block because the proposed TFCN is expected to predict the scores for two classes (salient foreground or general background); and (4) connecting all the previous layers to integrate moderate semantic information. Fig. 2(b) and Fig. 2(c) illustrate the differences of the FCN-8s and the proposed TFCN. 3.3. Prediction of local region saliency maps The proposed TFCN aims to generate local saliency maps, which can be used for visual tracking. The overall procedure comprises pre-training the TFCN, extracting scale-dependent regions, and fusing discriminative saliency maps. 3.3.1. Pre-training TFCN Essentially, the proposed TFCN is derived from the FCN-8s designed for the semantic segmentation task. Thus, the direct application on salient object detection may lead to the negative transfer [26]. To deal with this issue, we ﬁrst pre-train the proposed TFCN on a well-build saliency dataset (described in Section 5). Formally, given the salient object detection dataset S = {(Xn , Yn )}Nn=1 with N training pairs, where Xn = {xnj , j = 1, . . . , T }

and Yn = {ynj , j = 1, . . . , T } are the input image and the corresponding ground-truth with T pixels, respectively. ynj = 1 denotes the foreground pixel and ynj = 0 denotes background pixel. For notional simplicity, we subsequently drop the subscript n and consider each image independently. We denote W as the parameters of the proposed TFCN. For the pre-training, we adopt the frequency-weighted loss [22], which can be expressed as

L f (W ) = −β

log P r (y j = 1|X ; W )

j∈Y+

− (1 − β )

log P r (y j = 0|X ; W ),

(2)

j∈Y−

where Y+ and Y− denote the foreground and background label sets, respectively. The loss weight β = |Y+ |/(|Y+ | + |Y− | ), and |Y+ | and |Y− | denote the foreground and background pixel number, respectively. Pr(y j = 1|X; W ) ∈ [0, 1] is the conﬁdence score that measures how likely the pixel belong to the foreground. For the saliency inference, outputs of the last convolutional layer are utilized to distinguish saliency foreground from the general background.

3.3.2. Extracting scale-dependent regions Human visual system suggests that the size of visual receptive ﬁelds signiﬁcantly affects the ﬁxation mechanisms. Fig. 3(a) shows a typical image with structural and hierarchical characteristics. Human visual system focuses on the fried sunny-side up eggs and makes these regions salient as shown in the right. However, if we zoom in a speciﬁc region (shown in Fig. 3(b)), we may select the area of the egg yolks as the most salient region or obtain an inconspicuous activation. This scale-sensibility of visual perception motivates us to present a novel image representation method and develop a saliency detection model with different spatial layouts and scale variations. More speciﬁcally, we exploit a multi-region target representation scheme as shown in Fig. 4. We divide each image into seven parts and calculate seven saliency maps over these regions. The detailed conﬁguration is as follows: (1) The ﬁrst saliency map is obtained from the entire image region; (2) The next four saliency maps are calculated on the four equal parts to introduce spatial information. (3) The last two saliency maps are extracted from the inside and outside areas to highlight the scale support. In addition, we introduce a multiple scale mechanism into each part to enhance the diversity of the region representation, as shown in Fig. 1. N scales are sampled to generate multiple regions with the object center (lx , ly ) and size (w0 , h0 ). In this paper, we experimentally chose (n × w, n × h), where w = 14 w0 , h = 14 h0 , and n = 1, 2, . . . , N.

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

5

Fig. 3. Illustration of the effect with varied receptive ﬁelds. (a) Original image with the salient region. (b) Top: Three zoomed regions. Bottom: Corresponding saliency maps.

Fig. 4. Illustration of the multi-region representation.

Fig. 5. Illustration of saliency maps. From left to right: (a) Original image; (b) Exaction map; (c) Inhibition map; (d) Saliency map; (e) Texture map generated by fuzzy logical ﬁlters; (f) Final saliency map after domain transform; and (g) ground truth.

3.3.3. Fusing saliency maps with weighted entropy Based on the proposed multi-scale multi-region scheme, we can obtain M × N saliency maps (Sm,n , m = 1, . . . , M, n = 1, . . . , N) in total. M denotes the number of regions for describing spatial layouts and N is the number of sampled scales. Each saliency map Sm,n is calculated based on the TFCN model, i.e.,

Sm,n = padding(MEm,n − MIm,n ),

(3)

Sn = max

M m=1

Sm,n , 0 ,

S=

wn Sn ,

(5)

n

where S is the fused saliency map, wn is the weight of the nth scale (wn ≥ 0, wn = 1). Generally, a more important saliency n

where MEm,n and MIm,n are the exaction map and the inhibition map obtained by the TFCN outputs, respectively. Note that for each regional saliency map, we re-arrange it to the same location of raw images. The other region in the saliency maps is padded with zeros to guarantee that different saliency maps are of equal sizes. We observe that ME and MI have reciprocal properties and the proposed strategy is able to eliminate several noises introduced by transposed convolution operations, as shown in Fig. 5(b-d). To integrate the information of different regions, we perform a pixel-wise summation with the same scale, and then drop the negative values by

where the max(.) operator avoids the model degradation and retains the high-conﬁdence regions. Finally, we exploit a weighted entropy strategy to combine multi-scale saliency maps,

(4)

map has a larger weight assigned to it. Thus, we utilize the weighted entropy to measure the discriminative ability of the fused saliency map. Let w = [w1 , w2 , ., wn ]T be the learnable weights, the weighted entropy is deﬁned as,

H (w ) = −(w )

i

sαi +1 ln si ,

(6)

where (w ) = 1/ i sα denotes a normalization term, α denotes a i constant and si is a function of w at the ith entropy state, here we choose si = n wn Sin , i = 1, . . . , K. To reduce computational complexity, we follow the work in [27], and set α = 1 and K = 4, ob taining (w ) = 1/ i sα = 1. A small weighted entropy represents i that the saliency map is highly different from others, indicating more discrimination. The optimal weight vector w can be obtained

6

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

by minimizing the objective function (Eq. 6), which can be effectively solved based on the iterative gradient descent method [27]. After obtaining the fused saliency map S, we utilize the domain transform technique to enhance its spatial consistency. Speciﬁcally, a high-quality edge preserving ﬁlter [28] is performed on S with the texture map generated by fuzzy logical ﬁlters. As shown in Fig. 5(d-f), several holes or disconnected regions can be ﬁlled after domain transform. 4. Non-rigid online tracking model In this section, we elaborate the proposed non-rigid object tracking method in detail. We develop our online tracking method based on the generated spatial-temporal consistent saliency map. We will show how to initialize the state of target objects with the discriminative saliency map, and how to utilize spatial-temporal saliency information for object tracking. The overall framework of the proposed method is shown in Fig. 1. The critical components and discussions are presented in the following subsections. 4.1. Tracker initialization Given the center location (x1 , y1 ) and speciﬁed region R1 of the target object in the ﬁrst frame, we ﬁrst crop an image patch P1 centered at the target location with the 1.5 times target size (W1 × H1 ). Subsequently, the image patch P1 is feed-forwarded into the TFCN to generate the initial saliency map S1 . To improve the robustness of initial saliency maps, Grabcut [7] is conducted to obtain a foreground mask M1 using the map S1 as a prior. Finally, the TFCN is ﬁne-tuned with the intersection region of S1 and M1 in the ﬁrst frame, which provides more accurate information of the target. Note that we only use the Grabcut in the ﬁrst frame. 4.2. Target localization Due to the assumption of spatial-temporal consistency in visual tracking, the shape and deformation information of the tracked object in previous frames can be utilized to predict the new state in the current frame. After cropping the same region in the tth frame, we determine the state of the tracked object with a spatialtemporal consistent saliency map (STCSM), which is deﬁned as

StST C = St +

t−1 k=t−τ −1

C β (k )SST k , β (k ) =

1 , ck

(7)

where St denotes the saliency map in the current frame, which can be obtained by the saliency detection method presented in Section 3.3. StST C is the STCSM up to the tth frame, τ is the accumulated time interval, and β (k) corresponds to the weights of previous STCSMs with a decay factor c. We note that β (k) imposes higher weights for recent frames and lower weights for previous frames. In addition, thanks to the availability of the TFCN, St can be used in the generation of the accumulated saliency map before determining the location of the target object. This is very different from most of existing trackers which only take previous information into account and directly detect the location in the current frame. The proposed accumulated saliency maps are also pixel-wise maps, having the structural property and computational scalability. They maintain more discriminative information between the target and backgrounds. For target localization, we can directly treat the saliency mask obtained by the STCSM method as the tracking result in the current frame. This strategy gives more accurate tracking states, which are very appropriate for non-rigid object tracking. The obtained saliency region provides not only accurate tracking states but also detailed segmentation masks concentrating the tracked object

(Fig. 6(d)). It achieves a better description of the non-rigid objects and reduces the background pollution for the tracking model. In addition, based on the saliency mask, we can generate a bounding box prediction by computing the tightest rectangle containing all target pixels. Although this may not be the best way to obtain the optimal bounding box, we believe that this is at least reasonable for the localization. Afterwards, the center of this rectangle is treated as the location of the tracked target. The latter strategy can be used for generic tracking, and facilitates the possible comparisons between our method and other bounding box-based trackers. 4.3. Online model update To update the proposed tracker for online adaptation, we train the TFCN for saliency detection and object tracking, iteratively. The TFCN is updated based on the information from the two tasks. More speciﬁcally, we ﬁrst convert the STCSM into a binary maps, and treat the resulting binary map as the ground truth in the current frame. Then, we ﬁne-tune the last two layers of the TFCN with a softmax cross-entropy loss. Following previous works [29,30], we utilize the tracked regions of recent 20 frames for the ﬁnetuning. Because we have one labeled image pair in each frame, ﬁne-tuning the TFCN only with this image pair tends to overﬁtting. Thus, we employ data augmentation by mirror reﬂection and rotation techniques. The learning parameters can be found in Section 4.2. The overall process of our tracking system is summarized as Algorithm 11. Algorithm 1 Our non-rigid object tracking approach. Input: Frames It , t ∈ [1, T ], initial location (x1 , y1 ) and region R1 . Output: Object Mask Ot and compact bounding box Bt , t ∈ [2, T ]. 1: for each t = 2, ..., T do 2: Extract patch Pt based on Rt−1 by image cropping. 3: Fuse saliency maps by 4: 1) Feed-forward Pt to the TFCN for local saliency maps. 5: 2) Minimize the objective in Eq.~6 to seek the weights w. 6: 3) Obtain fused saliency map St by Eq. 5. 7: 4) Perform domain transform on St . 8: Compute StST C by Eq. 7. 9: Obtain Ot by thresholding StST C . 10: Obtain Bt , center location (xt , yt ) and Rt through Ot . 11: Augment labeled image pairs. 12: Fine-tune the TFCN with augmented labeled image pairs. 13: end

4.4. Differences with existing works First, our method signiﬁcantly differs from video saliency detection algorithms [31,32] in three aspects: (1) Our aim is to adaptively track a single object of interest, while video saliency detection is to capture all salient objects in a scene; (2) The proposed method focuses on simultaneous local saliency detection and visual tracking, while video saliency detection always emphasizes obtaining motion saliency maps; (3) The video sequences for visual tracking are usually more complicated, which makes the video saliency detection method incapable of capturing the tracked objects. The representative results of video saliency detection are shown in Fig. 6(b). Second, our method has obvious advantages compared with recent deep saliency-based trackers [6,24]. In [6], saliency maps are derived from features of the ﬁxed fully connected layers, which are often very noisy (Fig. 6(c)). The method in [24] treats dynamic local saliency maps as regularization for correlation ﬁlters. However,

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

7

Fig. 6. Typical saliency maps for visual tracking. From left to right: (a) input frames; (b) video saliency detection [31]; (c) target-speciﬁc saliency map [6]; (d) our proposed method.

they still focus on bounding box-based object tracking, losing the accurate object shapes. Contrast to previous works, our proposed saliency model is directly learned from raw images in an end-toend manner. In addition, the introduction of scale estimation and temporal consistency makes the obtained saliency maps dense and edge-preserving, which effectively facilitates non-rigid object tracking. Fig. 6 illustrates that our method achieves better visual effects and location accuracy compared with other methods.

popular OTB-50 [39] and VOT-2018 [40] datasets are used to evaluate the generalization ability of our tracker. Due to the limitation of space, we refer the readers to the corresponding papers for more details of these datasets.

5. Experiments

We implement our approaches based on the MATLAB R2014b platform with the Caffe toolbox [41]. Before putting training images into the TFCN, each image is subtracted with the ImageNet mean [42] and resized into the same resolution (500 × 500). For the correspondence, we also resize the ground-truth to the same size. The maximum edge of ROIs are restricted to be 256, considering the tracked targets may be relatively small or extremely large. Other critical parameters are experimentally set as follows: the number of rectangular regions N = 6, the accumulated time interval τ = 4, and the decay factor c = 1.1. For pre-training the TFCN, we use the stochastic gradient descent (SGD) with a momentum 0.9, weight decay 0.0 0 05, and mini-batch 8. We set the base learning rate to 1e−8 and decrease it by 0.1 when training loss reaches a plateau. The training process converges after 100k iterations. During online visual tracking, we set the max iteration to 100, batch size 1, and learning rate 1e−12 , and keep other parameters ﬁxed as in saliency detection. We perform the grid search for all above parameters and ﬁnd that they works best for our approach, so we ﬁx them during our experiments. We run our approach in a quad-core PC machine with an i7-4790 CPU (with 16G memory) and a NVIDIA Titan X GPU (with 12G memory). The pre-training process of our TFCN takes almost 6 hours. The overall tracking algorithm runs at approximately 7 fps. For reproduction, we release the source codes and compared results at https://github.com/Pchank/TFCNTracker.

In this section, we present the experimental results to validate the effectiveness of our method. First, we describe the datasets for salient object detection and visual object tracking. Then, we give the implementation details with parameter settings. Third, we test and compare our proposed approach with other state-of-the-art methods. Both quantitative and qualitative analysis of each components are also presented to show the effectiveness. 5.1. Datasets To pre-train the TFCN, we construct a new large-scale saliency detection dataset based on the THUS [33] and object extraction (OE) [34] datasets. The mirror reﬂection and rotation techniques (0◦ , 90◦ , 180◦ , 270◦ ) are used for data augmentation, resulting in a total of 161,464 training images. To evaluate the performance, we adopt the widely-used DUT-OMRON [35], ECSSD [36], HKUIS-TE [15] and SOD [37] datasets. Other multiple objects based datasets are also frequently used to evaluate saliency detection methods. However, in this paper we focus on the single salient object detection, which is considerably related to the online tracking problem. The non-rigid object tracking (NROT) dataset [4] and the DAVIS-2016 dataset [38] are adopted to evaluate the tracking performance for non-rigid and articulated objects. In addition, the

5.2. Implementation details

8

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130 Table 1 Quantitative results on compared datasets. The best results are in bold. “–” means corresponding methods are trained on that dataset. ∗ means using extensive post-processing. DUT-OMRON Methods Ours Amulet [19] DCL∗ [43] DHS [17] DS∗ [14] DSS∗ [44] ELD [16] LEGS [45] MCDL [46] MDF∗ [15] RFCN∗ [18] UCF [47]

Fη ↑ 0.702 0.647 0.684 – 0.603 0.729 0.611 0.592 0.625 0.644 0.627 0.621

MAE↓

ECSSD Sλ ↑

0.083 0.098 0.157 – 0.120 0.066 0.092 0.133 0.089 0.092 0.111 0.120

0.782 0.771 0.743 – 0.741 0.765 0.743 0.701 0.739 0.703 0.752 0.748

Fη ↑

HKU-IS-TE MAE↓

0.882 0.868 0.829 0.872 0.826 0.904 0.810 0.785 0.796 0.807 0.834 0.844

Sλ ↑

0.050 0.059 0.149 0.060 0.122 0.052 0.080 0.118 0.101 0.105 0.107 0.069

5.3. Experimental results on saliency detection Our approach is compared with other 11 state-of-the-art saliency detection methods, including Amulet [19], DCL [43], DHS [17], DS [14], DSS [44], ELD [16], LEGS [45], MCDL [46], MDF [15], RFCN [18] and UCF [47]. For model details, we refer the readers to the corresponding papers. We note that: (1) For fair comparison, we utilize either the implementations with recommended parameter settings or the saliency maps provided by the corresponding authors; (2) We perform experiments to highlight the effectiveness of our saliency maps for object tacking, not to beat the current best deep learning based saliency detection methods. To evaluate the performance, we adopt four metrics, including the widely used precision-recall (PR) curves, F-measure, mean absolute error (MAE) [8] and recently proposed S-measure [48]. The precision and recall are computed by thresholding the predicted saliency map, and comparing the binary map with the ground truth. The F-measure value is deﬁned as,

Fη =

(1 + η2 ) × Precision × Recall , η2 × Precision + Recall

(8)

where η2 is set as 0.3 to weigh precision more than recall as suggested in [8]. We note that a higher F-measure value means that the corresponding algorithm can capture more valid salient regions. Thus, achieving a high F-measure is very helpful for online tracking. For a fair comparison on non-salient regions, we also calculate the mean absolute error (MAE) by

MAE =

W H 1 |S(x, y ) − G(x, y )|, W ×H

(9)

x=1 y=1

where W and H are the width and height of the input image. S(x, y) and G(x, y) are the pixel values of the saliency map and the binary ground truth at (x, y), respectively. To evaluate the structural similarities of saliency maps, we calculate the S-measure [48], deﬁned as

Sλ = λ ∗ So + ( 1 − λ ) ∗ Sr ,

(10)

where So and Sr are the object-aware and region-aware structural similarity, respectively. λ is a balance parameter and is set to 0.5. Fig. 7 illustrates the PR curves with different methods on the compared datasets. From the results, we can see that the proposed method achieves better performance than other competing ones. In addition, Tab. 1 shows the quantitative performances. As can be seen, our method delivers expressive performances in terms of Fmeasure and S-measure, which are more related to visual tracking. We also observe that DSS [44] is better than our method in term

Fη ↑ 0.901 0.894 0.863 0.884 0.821 0.882 0.839 0.787 0.803 0.776 0.852 0.884

MAE↓ 0.878 0.843 0.853 0.854 0.787 0.902 0.776 0.732 0.760 0.802 0.838 0.823

SOD Sλ ↑ 0.039 0.050 0.136 0.053 0.077 0.041 0.072 0.118 0.091 0.095 0.088 0.061

Fη ↑ 0.910 0.886 0.859 0.869 0.854 0.878 0.823 0.745 0.786 0.779 0.860 0.874

MAE↓ 0.792 0.745 0.741 0.775 0.698 0.788 0.712 0.683 0.677 0.721 0.743 0.738

Sλ ↑ 0.120 0.144 0.194 0.129 0.189 0.123 0.155 0.196 0.181 0.165 0.170 0.148

0.782 0.753 0.748 0.750 0.712 0.744 0.705 0.657 0.650 0.674 0.730 0.762

of F-measure. However, it adopts a more powerful FCN than our TFCN. Many tricks are used to improve its performance, such as dense skip connections at full levels, deeply supervised learning, conditional random ﬁeld, etc. The tricks in DSS is certainly complementary with our method. We believe that with these tricks, our method can further improve the performance. We note that our method mainly aims to improve the tracking performance not to beat other state-of-the-art saliency detection methods. Besides, due to the introduction of many tricks, DSS is very time-consuming and memory-requiring (Please see the quantitative results in [44]). Several visual examples are shown in Fig. 8, which clearly demonstrates that the results obtained by our method are much closer to the ground truth. The predicted saliency maps can convincingly identify salient objects and provide accurate saliency regions. 5.4. Experimental results on visual object tracking 5.4.1. Non-rigid object tracking The NROT dataset [4] includes 11 challenging image sequences with pixel-wise annotations in each frame. The bounding box annotations are generated by computing the tightest rectangular boxes containing all target pixels, which is consistent to our settings. In addition, we select the DAVIS-2016 dataset [38] as an additional benchmark to diversify the appearances of non-rigid objects. Results on the DAVIS-2016 dataset. For the DAVIS-2016 dataset, we compare our method with 11 state-of-the art ones (MSK [49], SFL [50], FAVOS [51], RGMP [52], PML [53], OSMN [54], PLM [55], VPN [56], OnAVOS [57], OSVOS [58] and PReMVOS [59]). For performance evaluation, we use the Jaccard mean metric (J mean) to express region similarity, and the F-measure mean metric (F mean) to express contour accuracy as the benchmark setting [38]. Tab. 2 presents the quantitative results. From the results, we can see that our method performs better than other methods on the DAVIS2016 dataset. Our method achieves over 1% relative improvement than top-ranked approaches, such as FAVOS, RGMP, OnAVOS and OSVOS. The PReMVOS achieves the best result in term of F mean but with a low speed. Meanwhile, our method shows a very comparable F mean to PReMVOS. Besides, our method runs the second fast in all compared methods, being more eﬃcient than most state-of-the-art ones. This demonstrates that our method is robust in predicting moving targets, and can be extended to perform object tracking for long sequences. Results on the NROT dataset. For the NROT dataset, we compare our method with 8 segmentation-based algorithms (HT [3], SPT [2], PT [1], OGBT [4], RGMP [52], OnAVOS [57], OSVOS [58] and PReMVOS [59]) and 7 boundary box-based trackers (DSMT [6], FCNT [10], HCFT [12], LSART [29], DRT [30], SCF [60], MTPCF [61]).

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

9

Fig. 7. Comparisons of PR curve metrics with different methods. Our method consistently outperforms other compared methods.

Fig. 8. Representative results of the compared methods for saliency detection. Our method consistently produces saliency maps closest to the ground truth. Table 2 Results on the DAVIS-2016 dataset. All numbers are presented in terms of percentage (%). Speed is measured in frames per second (fps). For each row, the best result is in bold.

J mean↑ F mean↑ Speed↑

VPN

PLM

MSK

SFL

FAVOS

RGMP

PML

CTN

LVO

OnAVOS

OSVOS

PReMVOS

Ours

70.2 65.5 1.5

70.2 62.5 6.5

79.7 75.4 0.1

76.1 76.0 0.1

82.4 79.5 0.8

81.5 82.0 8.0

75.5 79.3 3.6

73.5 69.3 0.03

75.9 72.1 0.31

86.1 84.9 0.08

79.8 80.6 2.5

84.9 88.6 0.01

88.5 87.3 7.1

We follow the settings in [4], and adopt two overlap ratio metrics to evaluate the proposed method and other competing ones on the NROT dataset. The bounding box overlap ratio is used to compare all trackers; whereas the segmentation overlap ratio is adopted for comparing non-rigid object tracking algorithms with segmentation outputs. Tab. 3 and Tab. 4 demonstrate the average overlap ratio results on the NROT dataset. From the results, we

have two fundamental observations: (1) the proposed method outperforms all compared algorithms in both bounding box and segmentation overlap ratios of all sequences; (2) other deep learningbased trackers have not utilized the segmentation masks, however, their performance on this non-rigid dataset is still competitive. However, our tracker is more suitable for tracking deformable and articulated objects.

10

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

Table 3 Bounding box overlap ratio on the NROT dataset. All numbers are presented in terms of percentage (%). For each row, the best result is in bold.

Cliff-dive1 Cliff-dive2 Diving Gymnastics High-jump Motocross1 Motocross2 Mtn-bike Skiing Transformer Volleyball Average

HT

SPT

PT

OGBT

RGMP

OnAVOS

OSVOS

PReMVOS

DSMT

FCNT

HCFT

LSART

DRT

SCF

MTCPF

Ours

61.0 52.0 7.9 10.4 39.1 55.1 62.0 48.6 50.2 63.4 27.7 43.4

66.5 30.3 35.2 42.6 5.3 11.2 46.1 53.0 27.7 55.8 26.8 36.4

29.9 13.4 12.3 26.0 0.6 3.6 21.3 12.9 21.3 13.9 15.2 15.5

75.9 49.3 50.6 70.4 51.5 62.4 72.1 61.2 39.0 86.6 46.2 60.5

63.1 55.3 24.6 72.8 44.2 70.3 73.8 56.4 40.6 71.5 48.6 56.5

71.2 52.1 38.5 56.8 48.2 67.3 76.3 69.0 48.6 82.5 49.8 60.0

58.1 42.0 46.3 64.9 47.3 62.1 70.2 67.6 42.0 82.7 45.3 57.1

76.8 52.3 42.1 77.9 49.8 71.4 76.4 72.3 54.6 88.4 50.7 64.8

67.3 37.6 41.3 72.1 43.6 68.8 71.3 69.4 32.6 82.4 38.5 56.8

62.3 36.2 24.3 56.6 44.2 67.6 67.9 68.5 54.1 72.8 45.3 54.5

68.1 40.7 34.5 71.2 48.9 68.5 70.7 72.1 52.6 84.1 49.8 60.2

77.1 48.6 51.8 76.6 51.0 68.3 72.9 72.6 53.1 81.8 50.1 64.0

76.8 49.6 52.6 77.2 48.6 71.0 73.4 71.5 51.2 85.6 49.8 64.3

74.3 50.2 48.1 68.7 51.2 55.6 69.1 68.9 48.2 79.6 32.0 58.7

70.2 44.5 50.0 76.3 46.4 64.1 55.4 64.8 41.5 87.5 45.9 58.8

78.6 56.7 54.6 78.1 52.0 72.6 76.7 73.8 55.5 90.5 51.4 67.3

Fig. 9. Qualitative results of the proposed tracker and other segmentation-based methods.

Fig. 9 shows the representative screenshots of our tracker and other competing ones. The screenshots reveal that our tracker achieves smoother visual effects and more accuracy predictions. More speciﬁcally, the boundary of the tracked targets is clearly highlighted in most sequences, such as Gymnastics, High-jump and Motocross1. It is worth noted that in challenging sequences such as MotorRolling and Transformer, most methods fail to track targets, while our approach successfully locates the targets in terms of ei-

ther precision or overlap. For the more challenging sequence, i.e, Diving, other trackers fail to track target and can not re-locate the target. However, our tracker can automatically recover the tracking targets. 5.4.2. Evaluation on different components To further verify the contribution of each component in our model, we also implement different variants of the proposed

72.9 51.8 45.5 75.2 48.1 60.6 71.5 58.7 48.6 77.5 46.6 59.7 71.4 45.8 43.0 72.7 45.9 59.3 71.0 58.0 48.0 74.4 46.1 57.8 70.3 40.9 38.1 68.4 43.1 57.9 68.3 56.5 46.9 72.2 45.7 55.3 67.1 34.8 30.7 59.1 37.9 55.6 65.1 55.2 44.3 56.8 44.6 50.1 Cliff-dive1 Cliff-dive2 Diving Gymnastics High-jump Motocross1 Motocross2 Mtn-bike Skiing Transformer Volleyball Average

64.2 49.4 6.7 9.2 40.4 52.2 53.0 53.4 41.0 45.0 31.1 40.5

54.6 41.8 21.2 10.6 52.1 8.9 37.1 43.0 37.3 2.8 6.5 28.7

60.1 16.0 25.5 52.0 0.9 1.4 39.7 32.1 43.0 5.5 25.1 27.4

67.6 36.7 44.1 69.8 42.8 53.1 64.5 54.9 32.1 74.0 41.1 52.8

68.3 48.8 40.2 74.3 47.5 60.1 44.5 53.1 42.5 74.2 45.1 54.4

70.7 49.5 45.1 73.4 48.9 57.3 68.7 56.2 47.7 73.2 46.2 57.9

70.9 49.1 43.6 72.8 44.5 51.3 63.1 51.2 39.7 71.0 38.4 54.1

71.8 50.3 44.3 74.9 53.6 59.1 69.6 57.2 48.4 72.6 45.2 58.8

62.4 33.6 28.5 58.0 36.2 53.2 64.8 53.9 39.4 53.1 42.3 47.8

67.5 35.1 32.2 59.6 37.5 56.2 65.8 55.4 44.6 57.3 45.1 50.6

+Grabcut+MS +Grabcut TFCN FCN-8s PReMVOS OSVOS OnAVOS RGMP OGBT PT SPT HT

Table 4 Segmentation overlap ratio on the NROT dataset All numbers are presented in terms of percentage (%). For each row, the best result is in bold.

+Grabcut+MS+AM

Overall

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

11

method and report their average segmentation overlap ratios on the NROT dataset. These variants include the following methods: (1) TFCN denotes the proposed baseline algorithm which only uses the TFCN predictions, and performs the object tracking in each frame. (2) T F CN + Grabcut introduces the Grabcut [7] for the additional initialization. (3) T F CN + Grabcut + MS fuses the multi-scale dependent saliency maps, which adopt the ROI as the input. (4) T F CN + Grabcut + MS + AM stands for the enhanced method with the accumulated saliency maps. (5) Overall stands for our ﬁnal model with the additional weighted entropy fusion. To verify the effectiveness of TFCN, we also add a baseline with the FCN-8s. Tab. 4 shows the quantitative results. Compared with results of F CN − 8s and TFCN, we can see that our TFCN consistently improves the tracking performance of all sequences. This result conﬁrms that introducing more skip-connections can help the model capture more information for saliency detection [19]. With more accurate saliency maps, higher tracking results can be achieved. In addition, the comparison of TFCN and T F CN + Grabcut shows that adding Grabcut can improve the tracking performance with a little. With the multi-scale multi-region method, our proposed model can signiﬁcantly improve the tracking performance with 5% average segmentation overlap ratio. Furthermore, the integration of scale-dependent saliency maps and accumulated operations boosts the tracking performance from 50.6 to 57.8. For this result, we perform the grid search for the scale, accumulated interval and decay factor. We ﬁnd that setting them to N = 6, τ = 4, c = 1.1 works best for our approach. With the weighted entropy, our method further improves the results. These results demonstrate the effectiveness of our methods in both accuracy and robustness. 5.4.3. Generalization ability To demonstrate the generalization ability, we also evaluate our method on the OTB-50 [39] and VOT-2018 [40] benchmarks which include both rigid and non-rigid objects (most of them are rigid or approximatively rigid). For the evaluation, we follow the main metrics in the two benchmarks. We refer the readers to [39,40] for more details. On the OTB-50 benchmark, our tracker achieves comparable results (precision = 0.930, success rate = 0.660). For more results of other state-of-the-art methods, we refer the readers to the link: http://cvlab.hanyang.ac.kr/tracker_benchmark/benchmark. html. For one thing, our tracker performs better than the outstanding non-rigid tracker, i.e., OGBT [4]. The performance of the OGBT method is much worse (precision = 0.748, success rate = 0.524). For another, our tracker is not the best compared with the latest ones reported in the benchmark. However, this is principally because of evaluation protocols. We note the following facts: (1) The ground-truth annotations of the OTB-50 dataset are only based on bounding boxes, and more importantly the quality of annotations are very poor. Fig. 10 illustrates typical examples with annotation errors. Red bounding boxes are the ground-truths provided in the OTB-50 dataset. The ground-truths are not consistent at all and sometimes misaligned with targets. It is obvious that evaluation based on such poor ground-truths is not reliable for our method. (2) For the evaluation of generic tracking, the annotations are bounding boxes. Thus, previous methods mainly focus on the estimation of center locations of objects or the overlap between the predicted bounding boxes and the ground-truth bounding boxes. Our tracker can generate pixel-wise saliency maps, which are more diﬃcult and useful than the outputs of bounding box-based trackers. Tab. 5 shows quantitative results on the VOT-2018 benchmark, presented in terms of expected average overlap (EAO), accuracy and robustness (tracking failure). For performance comparison, we mainly compare with the top-8 trackers from the VOT-2018 challenge. All of the compared trackers are based on deep learning techniques. Among the top-ranked trackers, DaSi-

12

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

Fig. 10. Illustration of the inaccurate bounding box annotations [4]. Based on the provided ground-truth boxes (Red), the evaluation is unfair for our predicted boxes (Green).

Table 5 Quantitative comparison on the VOT-2018 benchmark in terms of expected average overlap (EAO), accuracy and robustness. For each row, the best result is in bold. Methods

EAO↑

Accuracy↑

Robustness↓

DaSiamRPN [62] STRCF [64] LSART [29] CPT [40] DRT [30] RCO [40] UPDT [63] MFT [40] Ours

0.326 0.345 0.323 0.339 0.356 0.376 0.378 0.385 0.396

0.569 0.523 0.495 0.506 0.519 0.507 0.536 0.505 0.580

0.337 0.215 0.218 0.239 0.201 0.155 0.184 0.140 0.149

amRPN [62] achieves higher accuracy compared to its counterparts (MFT [40], UPDT [63] and RCO [40]). However, our tracker achieves the best accuracy, while having competitive robustness. Meanwhile, our tracker performs better than other top-ranked methods in terms of EAO, which show the eﬃciency of our method. More speciﬁcally, our tracker obtains the best EAO score of 0.396, with a relative gain of 2% over the VOT-2018 winner (MFT). These results demonstrate that our method has a strong generalization ability for rigid object tracking. 6. Conclusions and future works In this work, we present a novel online tracking framework for non-rigid objects. Collaborated with saliency detection, our framework utilizes learned spatial-temporal saliency maps for object localization. To handle the scale-sensitive targets, we develop a tailed FCN to incorporate the local saliency prior into the framework. Meanwhile, we propose a multi-scale multi-region mechanism to generate multiple local saliency maps. By utilizing a weighted entropy fusion, the local saliency maps can be further aggregated into a more discriminative saliency map. Based on the discriminative saliency map, we build an effective non-rigid object tracker. Extensive experiments on public benchmarks demonstrate that our framework performs better than other related trackers for nonrigid objects. We note that most of the proposed modules are general-purpose, and can be used for other applications such as semantic segmentation, video parsing and object detection. Although effective, our framework currently focuses on the single target tacking. In practice, natural scenes always include multiple targets of interest, so extending our method to multiple object tracking is a valuable direction. Besides, our method can not well handle the severe occlusion of targets. We plan to explore new mechanisms to solve this problem in the future.

Declaration of Competing Interest The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to inﬂuence the work reported in this paper.

Acknowledgment This work is supported in part by the National Natural Science Foundation of China (NSFC), Nos. 61725202, 61751212 and 61771088. This work is also supported by the Key Research and Development Program of Sichuan Province (2019YFG0409). References [1] S. Duffner, C. Garcia, Pixeltrack: a fast adaptive algorithm for tracking non-rigid objects, in: Proceedings of the International Conference on Computer Vision, 2013, pp. 2480–2487. [2] F. Yang, H. Lu, M.-H. Yang, Robust superpixel tracking, IEEE Trans. Image Process. 23 (4) (2014) 1639–1651. [3] M. Godec, P.M. Roth, H. Bischof, Hough-based tracking of non-rigid objects, Comput. Vis. Image Underst. 117 (10) (2013) 1245–1256. [4] J. Son, I. Jung, K. Park, B. Han, Tracking-by-segmentation with online gradient boosting decision tree, in: Proceedings of the International Conference on Computer Vision, 2015, pp. 3056–3064. [5] V. Mahadevan, N. Vasconcelos, Biologically inspired object tracking using center-surround saliency mechanisms, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) (2013) 541–554. [6] S. Hong, T. You, S. Kwak, B. Han, Online tracking by learning discriminative saliency map with convolutional neural network, in: Proceedings of the International Conference on Machine Learning, 2015, pp. 597–606. [7] C. Rother, V. Kolmogorov, A. Blake, Grabcut: interactive foreground extraction using iterated graph cuts, ACM Trans. Graph. 23 (3) (2004) 309–314. [8] A. Borji, What is a salient object? A dataset and a baseline model for salient object detection, IEEE Trans. Image Process. 24 (2) (2015) 742–756. [9] P. Li, D. Wang, L. Wang, H. Lu, Deep visual tracking: review and experimental comparison, Pattern Recognit. 76 (2018) 323–338. [10] L. Wang, W. Ouyang, X. Wang, H. Lu, Visual tracking with fully convolutional networks, in: Proceedings of the International Conference on Computer Vision, 2015, pp. 3119–3127. [11] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: Proceedings of the International Conference on Learning Representation, 2015, pp. 1–9. [12] C. Ma, J.-B. Huang, X. Yang, M.-H. Yang, Robust visual tracking via hierarchical convolutional features, IEEE Trans. Pattern Anal. Mach. Intell. 25 (4) (2018) 1–14. [13] H. Nam, B. Han, Learning multi-domain convolutional neural networks for visual tracking, IEEE Trans. Pattern Anal. Mach. Intell. 2 (4) (2018) 1–14. [14] X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling, J. Wang, Deepsaliency: multi-task deep neural network model for salient object detection, IEEE Trans. Image Process. 25 (8) (2016) 3919–3930. [15] G. Li, Y. Yu, Visual saliency detection based on multiscale deep CNN features, IEEE Trans. Image Process. 25 (11) (2016) 5012–5024. [16] G. Lee, Y.-W. Tai, J. Kim, Eld-net: an eﬃcient deep learning architecture for accurate saliency detection, IEEE Trans. Pattern Anal. Mach. Intell. 40 (7) (2018) 1599–1610.

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130 [17] N. Liu, J. Han, Dhsnet: deep hierarchical saliency network for salient object detection, in: Proceedings of the Computer Vision and Pattern Recognition, 2016, pp. 678–686. [18] L. Wang, L. Wang, H. Lu, P. Zhang, X. Ruan, Saliency detection with recurrent fully convolutional networks, IEEE Trans. Pattern Anal. Mach. Intell. 41 (7) (2018) 1734–1746. [19] P. Zhang, D. Wang, H. Lu, H. Wang, X. Ruan, Amulet: aggregating multi-level convolutional features for salient object detection, in: Proceedings of the International Conference on Computer Vision, 2017, pp. 202–211. [20] P. Zhang, L. Wang, D. Wang, H. Lu, C. Shen, Agile amulet: real-time salient object detection with contextual attention, arXiv:1802.06960 (2018). [21] T. Wang, A. Borji, L. Zhang, P. Zhang, H. Lu, A stagewise reﬁnement model for detecting salient objects in images, in: Proceedings of the International Conference on Computer Vision, 2017, pp. 4019–4028. [22] P. Zhang, W. Liu, H. Lu, C. Shen, Salient object detection with lossless feature reﬂection and weighted structural loss, IEEE Trans. Image Process. 28 (6) (2019) 3048–3060. [23] P. Zhang, W. Liu, Y. Lei, H. Lu, Hyperfusion-net: hyper-densely reﬂective feature fusion for salient object detection, Pattern Recognit. 93 (2019) 521–533. [24] W. Feng, R. Han, Q. Guo, J. Zhu, S. Wang, Dynamic saliency-aware regularization for correlation ﬁlter based object tracking, IEEE Trans. Image Process. 28 (7) (2019) 3232–3245. [25] E. Shelhamer, J. Long, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 4 (39) (2017) 640–651. [26] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 (10) (2010) 1345–1359. [27] L. Ma, J. Lu, J. Feng, J. Zhou, Multiple feature fusion via weighted entropy for visual tracking, in: Proceedings of the International Conference on Computer Vision, 2015, pp. 3128–3136. [28] E.S. Gastal, M.M. Oliveira, Domain transform for edge-aware image and video processing, ACM Trans. Graph. 30 (4) (2011) 69. [29] C. Sun, H. Lu, M.-H. Yang, Learning spatial-aware regressions for visual tracking, in: Proceedings of the Computer Vision and Pattern Recognition, 2018, pp. 8962–8970. [30] C. Sun, D. Wang, H. Lu, M.-H. Yang, Correlation tracking via joint discrimination and reliability learning, in: Proceedings of the Computer Vision and Pattern Recognition, 2018, pp. 489–497. [31] H. Kim, Y. Kim, J.-Y. Sim, C.-S. Kim, Spatiotemporal saliency detection for video sequences based on random walk with restart, IEEE Trans. Image Process. 24 (8) (2015) 2552–2564. [32] C. Bak, A. Kocak, E. Erdem, A. Erdem, Spatio-temporal saliency networks for dynamic saliency prediction, IEEE Trans. Multimed. 20 (7) (2018) 1688–1698. [33] M.-M. Cheng, N.J. Mitra, X. Huang, P.H. Torr, S.-M. Hu, Global contrast based salient region detection, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) (2014) 569–582. [34] X. Wang, L. Zhang, L. Lin, Z. Liang, W. Zuo, Deep joint task learning for generic object extraction, in: Advances in Neural Information Processing Systems, 2014, pp. 523–531. [35] L. Zhang, C. Yang, H. Lu, X. Ruan, M.-H. Yang, Ranking saliency, IEEE Trans. Pattern Anal. Mach. Intell. 39 (9) (2017) 1892–1904. [36] J. Shi, Q. Yan, L. Xu, J. Jia, Hierarchical image saliency detection on extended CSSD, IEEE Trans. Pattern Anal. Mach. Intell. 38 (4) (2016) 717–729. [37] J. Wang, H. Jiang, Z. Yuan, Y. Wu, N. Zheng, S. Li, Salient object detection: a discriminative regional feature integration approach, Int. J. Comput. Vis. 123 (2) (2017) 251–268. [38] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, A. Sorkine-Hornung, A benchmark dataset and evaluation methodology for video object segmentation, in: Proceedings of the Computer Vision and Pattern Recognition, 2016, pp. 724–732. [39] Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (2015) 1834–1848. [40] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pfugfelder, L. Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, et al., The sixth visual object tracking vot-2018 challenge results, in: Proceedings of the European Conference on Computer Vision Workshops, 2018, pp. 1–18. [41] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, in: Proceedings of the ACM International Conference on Multimedia, 2014, pp. 675–678. [42] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale hierarchical image database, in: Proceedings of the Computer Vision and Pattern Recognition, 2009, pp. 248–255. [43] G. Li, Y. Yu, Contrast-oriented deep neural networks for salient object detection, IEEE Trans. Neural Netw. Learn. Syst. 29 (12) (2018) 6038–6051. [44] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, P.H. Torr, Deeply supervised salient object detection with short connections., IEEE Trans. Pattern Anal. Mach. Intell. 41 (4) (2019) 815–828. [45] L. Wang, H. Lu, X. Ruan, M.-H. Yang, Deep networks for saliency detection via local estimation and global search, in: Proceedings of the Computer Vision and Pattern Recognition, 2015, pp. 3183–3192. [46] R. Zhao, W. Ouyang, H. Li, X. Wang, Saliency detection by multi-context deep learning, in: Proceedings of the Computer Vision and Pattern Recognition, 2015, pp. 1265–1274. [47] P. Zhang, D. Wang, H. Lu, H. Wang, B. Yin, Learning uncertain convolutional features for accurate saliency detection, in: Proceedings of the International Conference on Computer Vision, 2017, pp. 212–221.

13

[48] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, A. Borji, Structure-measure: a new way to evaluate foreground maps, in: Proceedings of the International Conference on Computer Vision, 2017, pp. 4548–4557. [49] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, A. Sorkine-Hornung, Learning video object segmentation from static images, in: Proceedings of the Computer Vision and Pattern Recognition, 2017, pp. 2663–2672. [50] J. Cheng, Y.-H. Tsai, S. Wang, M.-H. Yang, Segﬂow: joint learning for video object segmentation and optical ﬂow, in: Proceedings of the International Conference on Computer Vision, 2017, pp. 686–695. [51] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, M.-H. Yang, Fast and accurate online video object segmentation via tracking parts, in: Proceedings of the Computer Vision and Pattern Recognition, 2018, pp. 7415–7424. [52] S.W. Oh, J.-Y. Lee, K. Sunkavalli, S.J. Kim, Fast video object segmentation by reference-guided mask propagation, in: Proceedings of the Computer Vision and Pattern Recognition, 2018, pp. 7376–7385. [53] Y. Chen, J. Pont-Tuset, A. Montes, L. Van Gool, Blazingly fast video object segmentation with pixel-wise metric learning, in: Proceedings of the Computer Vision and Pattern Recognition, 2018, pp. 1189–1198. [54] L. Yang, Y. Wang, X. Xiong, J. Yang, A.K. Katsaggelos, Eﬃcient video object segmentation via network modulation, in: Proceedings of the Computer Vision and Pattern Recognition, 2018, pp. 6499–6507. [55] J.S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, I.S. Kweon, Pixel-level matching for video object segmentation using convolutional neural networks, in: Proceedings of the International Conference on Computer Vision, 2017, pp. 2186–2195. [56] V. Jampani, R. Gadde, P.V. Gehler, Video propagation networks, in: Proceedings of the Computer Vision and Pattern Recognition, 2017, pp. 451–461. [57] P. Voigtlaender, B. Leibe, Online adaptation of convolutional neural networks for video object segmentation, in: Proceedings of the British Machine Vision Conference, 2017, pp. 1–12. [58] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, L. Van Gool, One-shot video object segmentation, in: Proceedings of the Computer Vision and Pattern Recognition, 2017, pp. 221–230. [59] J. Luiten, P. Voigtlaender, B. Leibe, Premvos: Proposal-generation, reﬁnement and merging for video object segmentation, in: Proceedings of the Asian Conference on Computer Vision, 2018, pp. 565–580. [60] W. Zuo, X. Wu, L. Lin, L. Zhang, M.-H. Yang, Learning support correlation ﬁlters for visual tracking, IEEE Trans. Pattern Anal. Mach. Intell. 4 (2) (2018) 311–322. [61] T. Zhang, C. Xu, M.-H. Yang, Learning multi-task correlation particle ﬁlters for visual tracking, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2) (2019) 365–378. [62] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, W. Hu, Distractor-aware siamese networks for visual object tracking, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 103–119. [63] G. Bhat, J. Johnander, M. Danelljan, F.S. Khan, M. Felsberg, Unveiling the power of deep tracking, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 483–498. [64] F. Li, C. Tian, W. Zuo, L. Zhang, M.-H. Yang, Learning spatial-temporal regularized correlation ﬁlters for visual tracking, in: Proceedings of the Computer Vision and Pattern Recognition, 2018, pp. 4904–4913. Pingping Zhang received his B.E. degree in mathematics and applied mathematics, Henan Normal University (HNU), Xinxiang, China, in 2012. He is currently a Ph.D. candidate in the School of Information and Communication Engineering, Dalian University of Technology (DUT), Dalian, China. His research interests include deep learning, saliency detection, object tracking and semantic segmentation. Wei Liu received the B.Eng. degree from the Department of Automation, Xi’an Jiao Tong University, in 2012. He received the Ph.D. degree with the Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, in 2019. He is currently a research fellow with the School of Computer Science, University of Adelaide. His current research interests mainly focus on low-level computer vision and graphics. Dong Wang received the B.E. degree in electronic information engineering and the Ph.D. degree in signal and information processing from the Dalian University of Technology (DUT), Dalian, China, in 2008 and 2013, respectively. He is currently an associate professor with the School of Information and Communication Engineering, DUT. His current research interests include face recognition, interactive image segmentation, and object tracking. Yinjie Lei received his M.S. degree from Sichuan University (SCU), China, with the area of Image Processing in 2009, and the Ph.D. degree in Computer Vision from University of Western Australia (UWA), Australia in 2013. He is currently an associate professor with the college of Electronics and Information Engineering at SCU. He serves as the vice dean of the College of Electronics and Information Engineering at SCU since 2017. His research interests mainly include deep learning, 3D biometrics, object recognition and semantic segmentation. Hongyu Wang received the B.S. degree from Jilin University of Technology, Changchun, China, in 1990 and the M.S. degree from the Graduate School of Chinese Academy of Sciences, Beijing, China, in 1993, both in electronic engineering. He received the Ph.D. degree in precision instrument and optoelectronics engineering from Tianjin University, Tianjin, China, in 1997. He is currently a professor with Dalian University of Technology, Dalian, China. His research interests include algorithmic, optimization, and performance issues in wireless ad hoc, mesh, and sensor networks.

14

P. Zhang, W. Liu and D. Wang et al. / Pattern Recognition 100 (2020) 107130

Chunhua Shen received the B.S. degree from Nanjing University, Nanjing, China, in 1990 and the M.S. degree from the Australian National University, Canberra, Australia, in 1993, both in electronic engineering. He received the Ph.D. degree in precision instrument and optoelectronics engineering from the University of Adelaide, Adelaide, Australia, in 1997. He is currently a professor with the University of Adelaide. His research interests include statistical machine learning and computer vision.

Huchuan Lu received the M.S. degree in signal and information processing, Ph.D. degree in system engineering, Dalian University of Technology (DUT), China, in 1998 and 2008 respectively. He has been a faculty since 1998 and a professor since 2012 in the School of Information and Communication Engineering of DUT. His research interests are in the areas of computer vision and pattern recognition. He focuses on visual tracking, saliency detection and semantic segmentation.

Non-rigid object tracking via deep multi-scale spatial-temporal discriminative saliency maps

Non-rigid object tracking via deep multi-scale spatial-temporal discriminative saliency maps

Recommend Documents