J. Vis. Commun. Image R. 37 (2016) 3–13
Contents lists available at ScienceDirect
J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci
Network in network based weakly supervised learning for visual tracking q Yan Chen ⇑, Xiangnan Yang, Bineng Zhong, Huizhen Zhang, Changlong Lin Department of Computer Science and Technology, Huaqiao University, China
a r t i c l e
i n f o
Article history: Received 13 December 2014 Accepted 8 October 2015 Available online 22 October 2015 Keywords: Little supervision Handcrafted features Visual tracking Deep network Network in network Object appearance model Large scale training data Drifting problem
a b s t r a c t One of the key limitations of the many existing visual tracking method is that they are built upon lowlevel visual features and have limited predictability power of data semantics. To effectively fill the semantic gap of visual data in visual tracking with little supervision, we propose a tracking method which constructs a robust object appearance model via learning and transferring mid-level image representations using a deep network, i.e., Network in Network (NIN). First, we design a simple yet effective method to transfer the mid-level features learned from NIN on the source tasks with large scale training data to the tracking tasks with limited training data. Then, to address the drifting problem, we simultaneously utilize the samples collected in the initial and most previous frames. Finally, a heuristic schema is used to judge whether updating the object appearance model or not. Extensive experiments show the robustness of our method. Ó 2015 Elsevier Inc. All rights reserved.
1. Introduction Visual tracking has gained a great deal of attention in the computer vision community over the past decade. It is one of several fundamental topics in computer vision, e.g., intelligence video surveillance, self-driving vehicles, human computer interaction, and robotics and so on. Given the initialized object states (i.e., locations and sizes), the task of visual tracking is to estimate the incoming object states in subsequent frames. Despite much progress has been made in recent years, designing robust object tracking methods is still a challenging problem due to object appearance variations caused by illumination changes, occlusions, pose changes, cluttered scenes, moving backgrounds, etc. To effectively handling object appearance variations, many top-performing tracking methods have been proposed. According to several recent literature surveys on visual tracking [1–4], the performance of visual tracking can be significantly boosted via using more sophisticated features, more elaborate synergies between tracking and classification, segmentation or detection, and taking into account prior information of the scenes and the tracked objects. Handcrafted low-level visual features (e.g., color [5], subspace [6,7], Haar [8,9], LBP [10], SIFT [11], and HoG [12,13]) have been applied q
This paper has been recommended for acceptance by Luming Zhang.
⇑ Corresponding author at: Jimei District, Huaqiao University, Xiamen, Fujian 361021, China. E-mail address:
[email protected] (Y. Chen). http://dx.doi.org/10.1016/j.jvcir.2015.10.004 1047-3203/Ó 2015 Elsevier Inc. All rights reserved.
to these tracking methods to construct robust appearance models, and shown promising results for some specific data and tasks. However, they are low-level features and not tuned for the tracked object. Moreover, to capture the object appearance variations during tracking, designing effective features for object tracking usually requires reflecting the time-varying properties of object appearance model. However, most handcrafted features cannot be simply adapted according to the new observed data. Thus, online learning features from the object of interest is considered as a plausible way to remedy the limitation of handcrafted features. In this paper, inspired by the recent success of deep learning on speech and object recognition problems [14–21], we propose a visual tracking method that relies on a deep neural network structure termed Network in Network (NIN) [21] to address both limitations of handcrafted features and little supervision (limited data) in visual tracking problem. More specifically, as shown in Fig. 1, the discriminative features are first automatically learned via a NIN, which is composed of multiple mlpconvs, in which a multilayer perceptron (MLP) is used to replaces the conventionally linear convolution filter to convolve over the input. With such a ‘‘micro network” structure (i.e., MLP), the NIN can effectively handle complexly nonlinear data. Consequently, the proposed tracking method can alleviate the need for handcrafted features, and automatically learn hierarchical and object-specific feature representations by end-to-end training. Then, the proposed tracking method simultaneously exploits both the samples collected in the initial and most previous frames to alleviate the tracker drifting problem.
4
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13
Fig. 1. Illustration of how the proposed tracking method constructs an object appearance model from a NIN [21] which is composed of multiple mlpconvs, in which a multilayer perceptron (MLP) is used to replaces the conventionally linear convolution filter to convolve over the input.
Finally, a heuristic schema is used to judge whether updating the object appearance models or not. The rest of the paper is organized as follows. An overview of the related work is given in Section 2. Section 3 introduces how to learn a deep NIN-based object appearance model. The detailed tracking method is then described in Section 4. Experimental results are given in Section 5. Finally, we conclude this work in Section 6.
2. Related work This section gives a brief review of related tracking methods using different types of object appearance models, namely, generative and discriminative methods. Moreover, some tracking methods based on deep learning are also briefly reviewed. Please refer to the survey papers [1–3] and a recent benchmark [4] for a comprehensive survey of visual tracking. Although numerous methods have been proposed, robust visual tracking remains a significant challenge due to object appearance variations caused by non-rigid shapes, occlusions, illumination changes, cluttered scenes, etc. Recently, to build online object appearance models to account for the inevitable appearance changes of objects, numerous object representation strategies have been proposed. Typically, the online tracking methods using different types of object appearance models can be classified into two categories: generative and discriminative methods. Generative tracking methods use some generative process to describe the object appearance models and make a decision based on the reconstruction errors or the matching scores. Popular generative methods include kernel-based tracker [5], subspace-based tracker [6,7], Gaussian mixture model-based tracker [22], covariance-based tracker [23], low-rank and sparse representation-based trackers [24–29], and visual tracking decomposition [30], and so on. Specifically, Comaniciu et al. [5] propose a kernel-based tracking method using color histogram-based object representations. To handle object appearance changes, Jepson et al. [22] use an online EM algorithm to construct an adaptive Gaussian mixture model-based object representation. In [6], an incremental principal component analysis (PCA)-based tracking method is proposed. In [23], Porikli et al. propose a tracking method using a covariance based object description and a Lie algebra based update mechanism. Recently, a variety of low-rank subspaces and sparse representations based tracking methods have been proposed [24–29] for object tracking due to their robustness to occlusion and image noises. Mei and Ling [24] are among the first to utilize the technique and propose an L1 minimizationbased tracking method, in which the target is represented by a sparse linear combination of object templates and trivial templates. However, it is time consuming. To improve the time com-
plexity, many extensions [25,26] have also been proposed. Zhang et al. [27] exploit the relationship between particles via low rank sparse learning for object tracking. In [28], Jia et al. propose a tracking method based on the structural local sparse appearance model. Zhang and Wong [29] propose a tracking method relying on mean shift, sparse coding and spatial pyramids. Kwon and Lee [30] propose an object tracking decomposition method which decomposes an observation model into multiple basic observation models to capture a wide range of pose and illumination changes. While generative tracking methods usually produce more accurate results under less complex environments due to the richer image representations used, they are prone to drift in more complex environments without utilizing the target and background information. Unlike generative tracking methods, discriminative tracking methods learn to explicitly distinguish the target object from its surrounding background by formulating tracking as a binary classification problem. Based on the variance ratio of two classes, Collins et al. [31] select discriminative color features for constructing the adaptive appearance models. Avidan [32] uses an adaptive ensemble of weak classifiers for object tracking. Grabner and Bischof [8] propose an online boosting based tracking algorithm. Unfortunately, one inherent problem of online learning-based trackers is drift, a gradual adaptation of the tracker to nontargets. To alleviating the drifting problem, Matthews et al. [33] provide a partial solution for template trackers by assuming the current tracker does not stray too far from the initial appearance model. Combining with a prior classifier, Grabner et al. [34] propose a semi-supervised online boosting algorithm for object tracking. However, the method cannot accommodate very large changes in appearance due to using the prior classifier. Babenko et al. [35] propose a tracking method using a multiple instance boosting based appearance model. Hare et al. [9] design a tracking system based on an online kernelized structured output support vector machine. Santner et al. [36] propose a tracking method which combines an online random forest-based tracker, a correlation-based template tracker, and an optical-flow-based mean shift tracker in a cascade-style. Zhang and Maaten [37] propose a multi-object model-free tracker by using an online structured SVM algorithm to learn the spatial constraints along with the object detectors. Duffner and Garcia [38] present a method for tracking non-rigid objects, which combines and adapts a detector and a probabilistic segmentation method in a co-training manner. The co-training based algorithms require the independence among different features which is too strong to be met in practice. Kalal et al. [39] propose a TLD-based tracking system that explicitly decomposes the long-term tracking task into tracking, learning, and detection. Kwon and Lee [40] propose a visual tracker sampler, in which multiple appearance models, motion models, state representation types, and observation types are sampled via Markov Chain Monte Carlo to generate the sampled trackers. However, most of these
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13
methods use a single global bounding box to delineate the entire target object. Consequently, they may be sensitive to partial occlusion and pose variation. Although the above generative and discriminative tracking methods have achieved very good performance on several tracking tasks, they may be limited by the choice of the handcrafted lowlevel features applied to the target objects. To automatically learning features for visual tracking, Carneiro and Nascimento [41] propose a tracking method for the Left Ventricle Endocardium in Ultrasound Data which combines multiple dynamic models and deep learning architectures. Wang and Yeung [42] propose a stacked denoising autoencoder based tracking method which firstly learns generic image features from an offline dataset and then transfers the features learned to the online tracking task. Fan et al. [43] propose a human tracking method via a convolutional neural network, in which the features are learned during offline training. Li et al. [46] and Zhang et al. [47] propose a convolutional neural network-based tracking method respectively. In [48], Wang et al. propose a tracking method via transferring rich feature hierarchies. In this paper, inspired by the powerful representation ability of Network in Network (NIN), we apply it into the object tracking. However, with the rapid development of hardware and big data [58,59], it can also be applied into other fields, such as: speech recognition, object classification and mobile visual search [14– 21,49–57]. 3. Object appearance model Network in Network (NIN) [21] is an effective deep network structure that enhance model discriminability for local patches within the receptive field. NIN has achieved excellent classification performances on visual recognition problems, e.g., CIFAR-10 and CIFAR-100 [44], etc. In this section, we address the problem of how to learn a data-driven object appearance model from a NIN. The NIN is fed with raw image pixels and trained in supervised mode from training samples to produce a likelihood evaluation for each candidate sample. 3.1. Network in network Following the notations of Lin et al. [21], we briefly review the NIN shown in Fig. 2. In contrast to conventional convolutional neural network (CNN), NIN build micro neural networks (i.e., multilayer perceptron, MLP) with more complex structures to abstract the data within the receptive field. The MLP is adopted to instantiate the micro neural network due to it is a more potent nonlinear function approximator. By sliding the micro networks over the input in a similar manner as CNN, NIN generates the feature maps which can then be fed into the next layer. Deep NIN can be implemented by stacking multiple of the above
5
described structure (termed mlpconv layer) and one global average pooling layer. More specifically, in the first mlpconv layer, the input is a pre-processed 3-channel image of size 32 by 32 pixels and is given to a convolutional layer with filters of size 5 5. Then, the feature maps calculated by the mlpconv layer is shown as follows: T
1
f i;j;k1 ¼ reluðw1k1 xi;j þ bk1 Þ
ð1Þ
n
n1
f i;j;kn ¼ reluðwnkn T f i;j þ bkn Þ where reluðxÞ ¼ maxð0; xÞ is a rectified linear unit (ReLU) transformation, n is the number of layers in the MLP, ði; jÞ is the pixel index in the feature map, xi;j is the input image patch centered at location ði; jÞ, k is the indexing number of the channels in the feature map, w is the weight and b is the bias. Similarly, the next mlpconv layer takes the output of the previous mlpconv layer as input. Finally, the overall structure of NIN is a stack of mlpconv layers, on top of which lie the global average pooling and the objective cost layer. After the above three mlpconv layers, the NIN can extract high level features. 3.2. Learning object appearance models from deep NIN In this paper, object tracking is formulated as an online transfer learning problem and the deep NIN is used to construct the object appearance model due to its powerful capacity for automatically learning a hierarchical feature representation. As shown in Fig. 3, the key idea is to use the internal NIN features as a generic and middle level image representation, which can be pre-trained on one dataset (the source task, here CIFAR-10 [44]) and then reused (fine-tuned) on the tracking tasks. More specifically, for the source task, we pre-train a deep NIN with three mlpconv layers followed by one global average pooling layer from the CIFAR-10 natural image dataset [44]. The CIFAR-10 dataset is a labeled subset of the 80 million tiny images, containing 60,000 images and ten classes. The output layer has size 10 equal to the number of target categories. After pre-training on the source task, the parameters of three mlpconv layers are first transferred to the tracking task and fixed during the tracking tasks. Then, we remove the output layer with ten units and add one fully-connected layer with 64 units and an output layer with one unit. Finally, the pre-trained parameters from three mlpconv layers are fixed, and the newly incorporated parameters from the fully-connected and output layers are retrained on the training data from the tracking task to learn an object appearance model. This simple yet effective transferring schema enables the proposed tracking method to tackle the domain changes in training tasks.
Fig. 2. The structure of Network in Network (NIN) [21].
6
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13
Fig. 3. Learning object appearance models by transferring the NIN features. First, the deep NIN is pre-trained on the source task (CIFAR-10 classification, top row). Then, the pre-trained parameters of the internal layers of the NIN (three mlpconv layers) are then transferred to the tracking task (bottom row). To achieve the transfer and construct the object appearance models, we remove the output layer with ten units and add one fully-connected layer with 64 units and an output layer with one unit. Furthermore, to alleviate the drifting problem, we exploit both the initial and online samples to update the object appearance models. Best viewed in color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
X pðxt jxt1 Þ ¼ N xt ; xt1 ;
4. Object tracking via deep network in network In this section, we develop the proposed tracking method based on the appearance model described above. The basic idea is to efficiently incorporate the deep NIN-based appearance model into the particle filtering framework, which implements recursive Bayesian filter by Monte Carlo sampling. The main idea behind particle filter is to represent the posterior density by a set of random particles with associated weights. Typically, a particle filter is composed of two main components, i.e., dynamic and observation model. The dynamic model is used to generate candidate samples based on previous particles. Meanwhile, the observation model calculates the similarity between candidate samples and the target appearance model. Given all observations of the object y1:t ¼ ½y1 ; . . . ; yt up to time t, the aim of a particle filter-based tracking system is to estimate pðxt jy1:t Þ, which is a posterior density of target state. Using the Bayesian theorem, the posterior density pðxt jy1:t Þ can be reformulated:
Z pðxt jy1:t Þ / pðyt jxt Þ
pðxt jxt1 Þpðxt1 jy1:t1 Þdxt1
ð2Þ
where pðxt jxt1 Þ is the dynamic model and pðyt jxt Þ is the observation model. Calculation of the integral is carried out by Monte Carlo sampling in particle filter. In other words, the posterior density Nt pðxt jy1:t Þ is approximated by a set of samples (particles) xit i¼1 with i Nt associated weights wt i¼1 . Finally, the optimal object state xt at time t can be determined by the maximum a posterior estimation:
xt ¼ arg max pðxt jy1:t Þ ¼ xit ¼ arg max wit xt
xit
ð3Þ
Specifically, let xt ¼ ðptx ; pty ; wt ; ht Þ denote the object state parameters including the horizontal coordinate, vertical coordinate, width and height respectively. A Gaussian distribution is used to construct the dynamic model between two consecutive frames
ð4Þ
P where is a diagonal covariance matrix whose diagonal elements are the corresponding variances of respective parameters. For each state xt , there is a corresponding image patch that is normalized to 32 32 pixels by image scaling. The likelihood function pðyt jxt Þ is calculated based on the proposed deep NIN-based appearance model. With an output score from the output layer of the deep NIN-based appearance model dt , the likelihood function is calculated by:
pðyt jxt Þ ¼ expðdt Þ
ð5Þ
To capture the appearance variations, the likelihood function needs to adapt over time due to updating the deep NIN-based appearance model. However, the main drawback of the appearance-adaptive methods is their sensitivity to drift, i.e., they may gradually adapt to non-targets. To alleviate the drifting problem, the proposed tracking method simultaneously exploits both the samples collected in the initial and most previous frames. Specifically, we assume the positive samples obtained in the first frame to be n oNþ 1 þ sþ . The online positive samples obtained in the most 1 ¼ x1;i i¼1 þ T 1 recent frames are denoted as sþ ot ¼ xti i¼1 . The positive and negative samples obtained at current frame t are denoted as n oNþt n oNt þ and s respectively. At current frame t, if sþ t ¼ xt;i t ¼ xt;i i¼1
i¼1
the posterior density of the optimal object state pðxt jy1:t Þ is smaller than a predefined threshold T 2 or larger than a predefined threshold T 3 , we do not update the deep NIN-based appearance model. Otherþ þ wise, based on sþ 1 , sot , st , and st , we update the objet appearance model. The basic idea behind this heuristic schema is that if the likelihood is smaller than a predefined threshold T 2 , the current tracking results may be unreliable. Thus we should not update the object appearance model. Meanwhile, if the likelihood is larger than a predefined threshold T 3 , the current tracking results is reliable.
7
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13
Thus, we should believe the tracking results and not update the object appearance model. Otherwise, we should update the object appearance model to capture the gradually object appearance variations. Finally, a summary of our deep NIN-based tracking method is described in Fig. 4.
5. Experiments In this section, we first introduce the setting of our experiments. Then, we apply the proposed tracking method to the 50 challenging sequences from the CVPR2013 tracking benchmark [4], and systematically compare it with the 30 state-of-the-art trackers.
Algorithm 1 Object Tracking via the Deep NIN-based Appearance Model Initialization: 1. Pre-train a deep NIN on the CIFAR-10 dataset. 2. Acquire one manually labels the first frame. +
s1+ = {x1,+i }iN=11 ,
3. Collect positive samples
−
s1− = {x1,−i }iN=11 ,
negative samples
and crop out the corresponding
image patches. 4. Resize each positive/negative image patch to 32*32 pixels. 5. Fine-tune the pre-trained deep NIN-based appearance model based on
6. Initialize the particle set
{x1i , w1i }iN=11
7. Set the maximum buffer size
T1
8. Set the likelihood threshold
T2 .
s1+
and
s1− .
w1i = 1/ N1 , i = 1,..., N1
at time t=1, where
for the set of online positive samples
sot+ .
for t = 2 to the end of the video 1. Prediction: for
i = 1,..., N1 , generate xti
2. Likelihood evaluation: for
p ( xt | xti−1 )
i = 1,..., N1 , let wti = wti−1 p ( yt | xti ) .
3. Determine the optimal object state
xt*
as the particle with the maximum weight.
4. Resample: Normalize the weights and compute the covariance of the normalized weights. If this variance exceeds one threshold, then
βj
{wti }iN=11
and replace
5. If the posterior density of the optimal object state
{xti , wti }iN=11
with
p ( xt* | y1:t )
is within [ T2 , T3 ], then,
+
5.1 Select positive and negative samples
st+ = {xt+,i }iN=t1
5.2 Update the set of online positive samples
5.3 If the size of
sot+
is larger than
sot+ = sot+
T1 , then sot+
5.4 Update the final positive sample sets
s + = s1+
β
{xt j ,1/ N1}Nj =11
−
and
st− = {xt−,i }iN=t1
st+ .
is truncated to keep the last
T1
sot+ .
5.5 Update the deep NIN-based appearance model based on
at time t respectively.
s+
and
st− .
end for Fig. 4. Summary of the deep NIN-based tracking method.
elements.
8
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13
5.1. Experiment setting Within the Caffe framework [45], the proposed tracking method is implemented in Matlab on a HP Z800 workstation with an Intel (R) Xeon(R) E5620 2.40 GHz processor, 12G RAM and GTX-980 graphics. The number of particles in particle filtering is set to 1000. The maximum buffer size T 1 , the likelihood threshold T 2 and T 3 are set as 50, 0.5 and 0.9 respectively. Each image observation of the target object is normalized to a 32 ⁄ 32 patch. In our experiments, we choose to track only the location and size for simplicity and computational efficiency reasons. The average processing speed is about one second per frame. We use the same parameters for all of the experiments. For performance evaluation, we test the proposed tracking method on the CVPR2013 tracking benchmark, which contains 50 fully annotated image sequences. Each image sequence is tagged by a number of attributes indicating to the presence of different challenging aspects, such as illumination variation, scale variation, occlusion, deformation, and background clutters, etc. In the benchmark, 30 publicly available trackers are evaluated. We follow the protocol used in the benchmark, in which the evaluation is based on two different metrics: the precision plot and success plot. The precision plot shows the percentage of frames whose estimated location is within the given threshold distance of the ground truth, and a representative precision score (threshold = 20 pixels) is used for ranking. Another metric contains the overlap precision over a range of thresholds. The overlap precision is defined as the percentage of frames where the bounding box overlap exceeds a given threshold varied from 0 to 1. In contrast to the precision plot, the trackers are ranked using the area under curve (AUC) in the success plot. Please see the original paper [4] for more details. In addition, for more close-view evaluation, we show several typical examples of the center distance error per frame with the top 10 tracker compared and the corresponding qualitative comparison results. 5.2. Comparison with other trackers 5.2.1. Quantitative evaluation The quantitative comparison results of all the trackers are listed in Fig. 5 where only the top 10 trackers are shown for clarity. The values in the legend of the precision plot are the relative number of
frames in the 50 sequences where the center location error is smaller than a threshold of 20 pixels. The values in the legend of the success plot are the AUC. In both the precision and success plots, our tracking method is the state-of-the-art comparing to all alternative methods. The robustness of our tracking method lies in the deep NIN-based appearance model which is discriminatively trained online to capture the high-level representation of the object and can effectively account for each variation. 5.2.2. Attribute-based evaluation The object appearance variations may be caused by illumination changes, occlusions, pose changes, cluttered scenes, moving backgrounds, etc. To analyze the performance of trackers for each challenging factor, the CVPR2013 tracking benchmark [4] annotates the attributes of each sequence and constructs subsets with 11 different dominant attributes, namely: illumination variation, scale variation, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out-of-view, background clutter and low resolution. We perform a quantitative comparison with the 30 state-of-art tracking methods on the 50 sequences annotated with respect to the aforementioned attributes. We show the precision and success plots of OPE for different subsets divided based on main variation of the target object in Fig. 6. Our tracking method achieves the state-of-the-art results for 2 out of 11 attributes: deformation and out-of-view. For the 6 out of 11 attributes: occlusions, fast motion, motion blur, low resolution, scale variation and out-of-plane rotation, our tracking method performs favorably. For the illumination variation, background clutter and in-plane rotation attribute, our tracking method is deteriorated. 5.2.3. Qualitative evaluation Qualitative comparison with the top 10 trackers (on eight typical sequences) is shown in Fig. 7. Meanwhile, for more close-view evaluation, we show the corresponding examples of the center distance error per frame in Fig. 8 with the top 10 trackers compared, which show that our tracking method can transfer the pre-trained NIN features to the specific target object well. According to the overall tracking results, our tracking method achieves the state-of-the-art performance. Recall that the pretrained NIN is learned entirely from natural scenes, which are completely unrelated to the tracking task. It implies that our method
Fig. 5. The precision and success plots of quantitative comparison for the 50 sequences in the CVPR2013 tracking benchmark [4]. The performance score of each tracker is shown in the legend. Our tracking method (in red) obtains better or comparable performance against state-of-the-art tracking methods. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13
9
Fig. 6. Plots for different subsets divided based on main variation of the target object. The value appearing the title denotes the number of videos tagged with the main variation. Our tracking method (in red) achieves comparable performance against state-of-the-art trackers. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
10
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13
Fig. 7. Qualitative comparison on several sequences from [4], i.e., the david3, deer, doll, freeman1, jogging-2, lemming, liquor, and walking2 sequence respectively.
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13
Fig. 8. Quantitative comparison on the center distance error per frame for several sequences from [4].
11
12
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13
can construct robust object appearance models by effectively learning and transferring the highly general NIN features. Meanwhile, by simultaneously exploiting ground-truth and online positive samples, the drifting problem is greatly alleviated. 6. Conclusion In this paper, we have proposed a robust visual tracking method via deep Network in Network (NIN) [21]. Our tracking method does not rely on engineered features and automatically learns the most discriminative features in a data-driven way. A simple yet effective method has been used to transfer the generic and mid-level features learned from the deep NIN to the tracking task. The drifting problem is alleviated by simultaneously exploiting ground-truth and online positive samples. Moreover, a heuristic schema has been used to judge whether updating the object appearance models or not. Extensive comparison experiments on the CVPR2013 tracking benchmark [4] demonstrate the advantage of our tracking method. Acknowledgments This work is supported by Natural Science Foundation of China (Nos. 61572205, 61202299, 61502181, and 61403150), Natural Science Foundation of Fujian Province of China (Nos. 2015J01257 and 2014J05075), Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQN-PY210), 2015 Program for New Century Excellent Talents in Fujian Province University, and Outstanding Young Persons’ Research Program for Higher Education of Fujian Province (No. JA13007). References [1] A. Yilmaz, O. Javed, M. Shah, Object tracking: a survey, ACM Comput. Surv. 38 (2006). [2] A.W.M. Smeulders, D.M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, M. Shah, Visual tracking: an experimental survey, IEEE Trans. Pattern Anal. Mach. Intell. 36 (7) (2014) 1442–1468. [3] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, A. van den Hengel, A survey of appearance models in visual object tracking, ACM Trans. Intell. Syst. Technol. 22 (8) (2013) 3028–3040. [4] Y. Wu, J.W. Lim, M.H. Yang, Online object tracking: a benchmark, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013. [5] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (5) (2003) 564–577. [6] D.A. Ross, J. Lim, R. Lin, M. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vis. 77 (1) (2008) 125–141. [7] Q. Wang, F. Chen, W.L. Xu, M.H. Yang, Object tracking via partial least squares analysis, IEEE Trans. Image Process. 21 (10) (2012) 4454–4465. [8] H. Grabner, H. Bischof, On-line boosting and vision, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006. [9] S. Hare, A. Saffari, P. Torr, Struck: structured output tracking with kernels, in: IEEE Conference on Computer Vision, 2011. [10] V. Takala, M. Pietikainen, Multi-object tracking using color, texture and motion, in: IEEE Workshop on Visual Surveillance, 2007. [11] J.L. Fan, X.H. Shen, Y. Wu, Scribble tracker: a matting-based approach for robust tracking, IEEE Trans. Pattern Anal. Mach. Intell. 34 (8) (2012) 1633– 1644. [12] M. Godec, P.M. Roth, H. Bischof, Hough-based tracking of non-rigid objects, in: IEEE Conference on Computer Vision, 2011. [13] Y. Lu, T.F. Wu, S.C. Zhu, Online object tracking, learning and parsing with andor graphs, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014. [14] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507. [15] G.E. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, IEEE Signal Process. Mag. (2012) 82–97. [16] A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with deep convolutional neural networks, Neural Inform. Process. Syst. 2012 (2012). [17] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828.
[18] R.B. Girshick, J. Donahue, T. Darrell, J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2013. Available from: arXiv:1311.2524 [cs.CV]. [19] P. Sermanet, K. Kavukcuoglu, S. Chintala, Y. LeCun, Pedestrian detection with unsupervised multi-stage feature learning, in: IEEE Conference on Computer Vision and Pattern Recognition 2013. [20] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, Decaf: a deep convolutional activation feature for generic visual recognition, in: International Conference on Machine Learning, 2014. [21] M. Lin, Q. Chen, Shuicheng Yan, Network in network. CoRR, abs/1312.4400, 2013. [22] A.D. Jepson, D.J. Fleet, T.F. El-Maraghi, Robust online appearance models for visual tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (10) (2003) 1296– 1311. [23] F. Porikli, O. Tuzel, P. Meer, Covariance tracking using model update based on lie algebra, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006. [24] X. Mei and H. Ling. Robust Visual Tracking using L1 Minimization. IEEE Conference on Computer Vision, 2009. [25] C. Bao, Y. Wu, H. Ling, H. Ji, Real time robust L1 tracker using accelerated proximal gradient approach, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012. [26] K.H. Zhang, L. Zhang, M.H. Yang, Real-time compressive tracking, in: European Conference on Computer Vision, 2012. [27] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Low-rank sparse learning for robust visual tracking, in: European Conference on Computer Vision, 2012. [28] X. Jia, H.C. Lu, M.H. Yang, Visual tracking via adaptive structural local sparse appearance model, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012. [29] Z. Zhang, K.H. Wong, Pyramid-based visual tracking using sparsity represented mean transform, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014. [30] J. Kwon, K.M. Lee, Visual tracking decomposition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010. [31] R.T. Collins, Y. Liu, M. Leordeanu, Online selection of discriminative tracking features, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1631–1643. [32] S. Avidan, Ensemble tracking, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2) (2007) 261–271. [33] I. Matthews, T. Ishikawa, S. Baker, The template update problem, IEEE Trans. Pattern Anal. Mach. Intell. 26 (6) (2004) 810–815. [34] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robust tracking, in: European Conference on Computer Vision, 2008. [35] B. Babenko, M. Yang, S. Belongie, Robust object tracking with online multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1619– 1632. [36] J. Santner, C. Leistner, A. Saffari, T. Pock, H. Bischof, PROST: parallel robust online simple tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010. [37] L. Zhang, L.V.D. Maaten, Preserving structure in model-free tracking, IEEE Trans. Pattern Anal. Mach. Intell. 36 (4) (2014) 756–769. [38] S. Duffner, C. Garcia, Pixeltrack: a fast adaptive algorithm for tracking nonrigid objects, in: IEEE Conference on Computer Vision, 2013. [39] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422. [40] J. Kwon, K. Lee, Tracking by sampling and integrating multiple trackers, IEEE Trans. Pattern Anal. Mach. Intell. 36 (7) (2014) 1428–1441. [41] G. Carneiro, J.C. Nascimento, Combining multiple dynamic models and deep learning architectures for tracking the left ventricle endocardium in ultrasound data, IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013) 2592– 2607. [42] N.Y. Wang, D.Y. Yeung, Learning a deep compact image representation for visual tracking, Neural Inform. Process. Syst. (2013) 809–817. [43] J.L. Fan, W. Xu, Y. Wu, Y.H. Gong, Human tracking using convolutional neural networks, IEEE Trans. Neural Network 21 (10) (2010) 1610–1623. [44] Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technique Report, 2009. [45] Y.Q. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding, 2014. Available from: arXiv:1408.5093. [46] H.X. Li, Y. Li, F. Porikli, DeepTrack: learning discriminative feature representations by convolutional neural networks for visual tracking, in: British Machine Vision Conference, 2014. [47] K.H. Zhang, Q.S. Liu, Y. Wu, M.H. Yang, Robust Tracking via Convolutional Networks without Learning, 2015. Available from: arXiv:1501.04505 [cs.CV]. [48] N.Y. Wang, S.Y. Li, A. Gupta, D.Y. Yeung, Transferring Rich Feature Hierarchies for Robust Visual Tracking, 2015. Available from: arXiv:1501.04587 [cs.CV]. [49] Y.W. Luo, T. Guan, B.C. Wei, H.L. Pan, J.Q. Yu, Fast terrain mapping from low altitude digital imagery, Neurocomputing 156 (2015) 105–116. [50] T. Guan, Y. He, J. Gao, J. Yang, J. Yu, On-device mobile visual location recognition by integrating vision and inertial sensors, IEEE Trans. Multimedia 15 (7) (2013) 1688–1699. [51] T. Guan, Y.F. He, L.Y. Duan, J.Q. Yu, Efficient BOF generation and compression for on-device mobile visual location recognition, IEEE Multimedia 21 (2) (2014) 32–41. [52] B.C. Wei, T. Guan, J.Q. Yu, Projected residual vector quantization for ANN search, IEEE Multimedia 21 (3) (2014) 41–51.
Y. Chen et al. / J. Vis. Commun. Image R. 37 (2016) 3–13 [53] L. Zhang, Y. Yang, Y. Gao, Y. Yu, C. Wang, X. Li, A probabilistic associative model for segmenting weakly-supervised images, IEEE Trans. Image Process. 23 (9) (2014) 4150–4159. [54] L.M. Zhang, Y. Gao, Y.J. Xia, K. Lu, J.L. Shen, R.R. Ji, Representative discovery of structure cues for weakly-supervised image segmentation, IEEE Trans. Multimedia 16 (2) (2014) 470–479. [55] L.M. Zhang, M.L. Song, Y. Yang, Q. Zhao, C. Zhao, N. Sebe, Weakly supervised photo cropping, IEEE Trans. Multimedia 16 (1) (2014) 94–107. [56] L.M. Zhang, M.L. Song, Z.C. Liu, X. Liu; J.J. Bu, C. Chen, Probabilistic graphlet cut: exploring spatial structure cue for weakly supervised image segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013.
13
[57] R.R. Ji, L.Y. Duan, J. Chen, H. Yao, J. Yuan, Y. Rui, W. Gao, Location discriminative vocabulary coding for mobile landmark search, Int. J. Comput. Vis. 96 (3) (2012) 290–314. [58] C. Wang, X. Li, X.H. Zhou, SODA: software defined FPGA based accelerators of big data, in: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015. [59] C. Wang, X. Li, P. Chen, X.H. Zhou, H. Yu, Heterogeneous cloud framework for big data genome sequencing, IEEE/ACM Trans. Comput. Biol. Bioinform. 12 (1) (2015) 166–178.