Robust Visual Tracking based on Variational Auto-encoding Markov Chain Monte Carlo
Journal Pre-proof
Robust Visual Tracking based on Variational Auto-encoding Markov Chain Monte Carlo Junseok Kwon PII: DOI: Reference:
S0020-0255(19)30861-8 https://doi.org/10.1016/j.ins.2019.09.015 INS 14846
To appear in:
Information Sciences
Received date: Revised date: Accepted date:
27 February 2019 2 September 2019 9 September 2019
Please cite this article as: Junseok Kwon, Robust Visual Tracking based on Variational Auto-encoding Markov Chain Monte Carlo, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.09.015
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.
1 Highlights • We propose a novel visual tracker based on VAE-MCMC. • We extend VAE-MCMC int two variants, VampPriorMCMC and hierarchical VampPrior-MCMC. • VAE-MCMC outperforms other state-of-the-art visual tracking methods.
1
Information Sciences journal homepage: www.elsevier.com
Robust Visual Tracking based on Variational Auto-encoding Markov Chain Monte Carlo Junseok Kwona,∗∗ a School
of Conputer Science and Enginerring, Chung-Ang University, Seoul, Republic of Korea
ABSTRACT In this study, we present a novel visual tracker based on the variational auto-encoding Markov chain Monte Carlo (VAE-MCMC) method. A target is tracked over time with the help of multiple geometrically related supporters whose motions correlate with those of the target. Good supporters are obtained using variational auto-encoding techniques that measure the confidence of supporters in terms of marginal probabilities. These probabilities are then used in the MCMC method to search for the best state of the target. We extend the VAE-MCMC method to a variational mixture of posteriors (VampPrior)-MCMC and hierarchical VampPrior-MCMC methods. Experimental results demonstrate that the supporters are useful for robust visual tracking and that the variational auto-encoding can accurately estimate the distribution of supporters’ states. Moreover, our proposed VAE-MCMC method quantitatively and qualitatively outperforms recent state-of-the-art tracking methods. c 2019 Elsevier Ltd. All rights reserved.
1. Introduction Visual tracking, with applications including autonomous driving [39], surveillance [13], and robotics [43], has been a fundamental problem in computer vision. Recently, several visual tracking methods have achieved great success in real-world environments [42]. Although these methods produce promising results in many tracking sequences, several bottlenecks still exist. In particular, traditional tracking methods are unable to track targets that are frequently occluded by other objects or whose appearance is altered owing to illumination changes as they are not easily visible in the scene. In such cases, visible supporters located around a target can help in the estimation of the target position if the positions of the supporters and target are geometrically correlated over time. Following the approach proposed in [15], we define supporters as feature points that are useful for predicting the target’s positions. Their motions are closely correlated; for example, supporters and a target move in the same direction. However, the difficulty in obtaining good supporters makes this visual tracking approach challenging. To solve this problem, this study proposes a variational auto-encoding (VAE) Markov chain Monte Carlo (MCMC) method and presents a novel visual tracking
∗∗ Corresponding
author: Tel.: +82-2-820-5914; e-mail:
[email protected] (Junseok Kwon)
system based on VAE-MCMC. The VAE technique enables us to measure the confidence of the supporters in terms of marginal probability; it is enhanced through application to its variants, variational mixture of posteriors (VampPrior) and hierarchical VampPrior [44]. These techniques are further integrated into the MCMC sampling method for the accurate estimation of target states. Figure 1 shows the basic idea of the proposed method. The main contributions of the proposed method can be summarized as follows: • A VAE-MCMC method for visual tracking is presented. Using the method, the target states can be accurately estimated with the help of supporters. • The VAE-MCMC method is extended into two variants, namely VampPrior-MCMC and hierarchical VampPriorMCMC. The VampPrior-MCMC method substitutes the VampPrior for the VAE, whereas the hierarchical VampPrior-MCMC method hierarchically estimates the VampPrior. • Extensive experiments conducted with an ablation study demonstrate that VAE-MCMC outperforms other state-ofthe-art visual tracking methods. Section 2 introduces related works. Section 3 introduces a Bayesian tracker, the performance of which can be augmented
2 2TQRQUCN
୲ ୲ଶ
/%/%
୲୬ ୲ାଵ
#EEGRVQTTGLGEV
ሺ୲ ȁ୲ଵ ሻ
୲
ሺ୲ ȁ୲ଶ ሻ
୲ଵ
୲ଶ
ଵ ሺ୲ାଵ ȁ୲ାଵ ሻ
8CTKCVKQPCNCWVQGPEQFKPI
୲ଵ ǡ ୲ଶ
ence of high-dimensional posteriors containing multiple latent variables. In comparison to particle MCMC, which has similar goals, our method produces better results in terms of accuracy and computation time.
ሺܵ௧ ሻ
ଵ ୲ାଵ
୲ାଵ
ଶ ሺ୲ାଵ ȁ୲ାଵ ሻ
ଶ ୲ାଵ
8CTKCVKQPCNCWVQGPEQFKPI
ଵ ଶ ୲ାଵ ǡ ୲ାଵ
ሺܵ௧ାଵ ሻ
Fig. 1. Basic idea of the VAE-MCMC. In the example, we use two supporters, S1t and S2t , at each frame. These supporters are geometrically correlated with the target, Xt , as illustrated by dotted yellow lines and calculated by p(Xt |S1t ) and p(Xt |S2t ) in (7). VAE-MCMC measures the confidence of S1t and S2t by estimating the marginal probability, p(St ), in (12). Then, p(St ) is used to find the best state of the target in the acceptance step of MCMC in (5).
by multiple supporters. Section 4 presents three visual trackers, namely VAE-MCMC, VampPrior-MCMC, and hierarchical VampPrior-MCMC, and Section 5 provides their implementation details. Section 6 presents the experimental results. Finally, Section 7 concludes the study. 2. Related Work 2.1. MCMC-based Visual Trackers Isard et al. proposed a particle-filter-based visual tracking method that can model the non-Gaussian posterior of target states and thus accurately track the target despite its nonrigid movements [20]. Khan et al. presented an MCMC-based visual tracking method that can deal with the high-dimensional posterior of target states [22]. Zhou et al. developed a visual tracker based on stochastic approximation Monte Carlo, wherein abrupt changes in the target states can be efficiently explained [50]. Kwon et al. proposed a particle-MCMCbased visual tracking method that simultaneously infers the states of multiple latent variables and provided an efficient sampling strategy in terms of computation time and accuracy [26]. They also presented a visual tracking method based on an uncertainty-calibrated MCMC that can adapt to changes in the shape of posteriors over time according to the current tracking environment [25]. To date, few studies have attempted to combine MCMC with VAE. Caterini et al. integrated VAE into Hamiltonian Monte Carlo, proposing the Hamiltonian variational auto-encoder, which allows accurate approximation of the posterior by producing tight evidential lower bounds [5]. Arulkumaran et al. used MCMC to improve the quality of the sampled states obtained using VAE [2]. Salimans et al. approximated the VAE lower bound without bias using MCMC inference [37]. Unlike the aforementioned methods, the method proposed here integrates VAE into MCMC to pursue an efficient infer-
2.2. VAE-based Visual Trackers VAE has various applications that employ generative models. For example, VAE was utilized for image captioning by Pu et al. [36], for three-dimensional reconstruction by Tan et al. [41], for image questioning by Jain et al. [21], for gesture recognition by Shi et al. [38], and for expression recognition by Deng et al. [12]. However, studies that use VAE directly for visual tracking are limited. Although Wang et al. indirectly applied a variational auto-encoder to generate hard positive samples for visual tracking, VAE was not incorporated into the target position estimation process [47]. In contrast, our method integrates VAE into visual tracking algorithms and uses it directly for accurate estimation of target positions over time. 2.3. Visual Trackers Using Deep Features Various studies have utilized deep features for visual tracking problems. The deep features extracted by convolutional neural networks have achieved great success in various computer vision applications [6]. These features have also been successfully utilized for visual tracking problems, resulting in a new category of visual trackers based on deep learning [32, 33, 45]. Wang et al. were the first to incorporate deep features into visual trackers [45], while Ma et al. utilized hierarchical convolutional features trained on object recognition datasets [6] to improve visual tracking accuracy [32]. Nam et al. extracted multi-domain features that adapt to model target appearances according to the test sequences [33]. Siamese networks, in which visual tracking problems are transformed into patch matching problems, have recently been proposed [4, 18]. Visual tracking methods based on these networks typically show good performance in terms of speed and tracking accuracy. Held et al. presented a similarity matching function for Siamese networks that shows robustness to variations in target appearance [18]. Bertinetto et al. presented a fully convolutional Siamese network for visual tracking [4], which inputs exemplar and search images and outputs the localization result using a cross-correlation operation. Similarly to theses methods, our method utilizes deep features extracted by the VGG-m network [6]. However, the robustness of our method is attributable to the proposed inference strategy using VAE-MCMC, as experimentally demonstrated. Thus, our method can be integrated into various deep-learningbased tracking methods to improve their visual tracking accuracy. 2.4. Motion Segmentation Chen et al. proposed a motion detection method based on counter-propagation neural networks [8]. In numerical experiments, their approach outperformed other methods over limited bandwidth networks. The method in [10] presented a moving object detection method based on long-term background
3 and short-term foreground models, in which the interaction between these models enhances the accuracy of the method. The method in [9] involved developing a video stabilization system based on motion estimation methods. For this purpose, the system employed the shortest spanning path clustering algorithm, showing state-of-the-art performance. Kim et al. proposed a visual tracking method that can robustly track targets in the presence of large motions [23]. In visual tracking, large motions were dealt with using their proposed coarse-to-fine strategy and image superpixels. The moving vehicle detection method presented in [7] is based on probabilistic neural networks, in which background images are generated in a reliable manner. Experimental results demonstrate the qualitative and quantitative effectiveness of the method. In contrast to these methods, our method indirectly utilizes motion segmentation results to estimate supporters’ positions, with confidence being measured based on the proposed variational auto-encoding method. 3. Bayesian Tracker Augmented by Supporters Our proposed visual tracker aims to simultaneously find the ˆ t , and the best set of states of the supbest state of the target, X ˆ porters, St . To achieve this, we model a joint posterior p(Xt , St ) ˆ t and Sˆ t : and find X ˆ t , Sˆ t = arg max p(Xt , St ), X
(1)
Xt ,St
where Xt = [xt yt st ]T is a vector containing the (x, y) position T and scale s of the target at time t, Sm t = [xt yt st ] is a vector containing the (x, y) position and scale s of the m-th supporter at time t, and St = {S1t , · · · , StM } is obtained by concatenating M supporters, Sm t , for m = 1, · · · , M. In (1), Xt and St are represented with a joint probability and should thus be simultaneously inferred for accurate estimation. ˆ t and Sˆ t using the MCMC Our visual tracker searches for X sampling method [22]. This method aims to describe p(Xt , St ) using the proposed samples. This is achieved via two steps, proposal and acceptance. In the proposal step, the MCMC proposes new samples, X0t , S0t , using the current samples, Xt , St : (2) Q X0t , S0t ; Xt , St = N (X0t , S0t ); (Xt , St ), Σ2 , where Q X0t , S0t ; Xt , St is a proposal function designed as a normal distribution N with mean (Xt , St ) and variance Σ2 . MCMC then accepts or rejects each proposed sample (X0t , S0t ) using the following acceptance ratio: " # p(X0t , S0t )Q (Xt , St ); (X0t , S0t ) , (3) a = min 1, p(Xt , St )Q (X0t , S0t ); (Xt , St )
where a is the acceptance probability. MCMC iteratively performs both steps and obtains N target states, {Xnt |n = 1, · · · , N}, at each frame. Traditional MCMC methods typically suffer from the convergence problem in high-dimensional state space. In particular, the joint posterior p(Xt , St ) in (3) should be in a
higher-dimensional space than a conventional marginal posterior p(Xt ). To describe p(Xt , St ), traditional MCMC methods require a considerable number of samples. However, proposing more samples is not practical as it increases the computational complexity of tracking algorithms and decreases their speed. Therefore, this study presents a more sophisticated and efficient sampling strategy that describes p(Xt , St ) using a smaller number of samples. To apply this strategy, several challenges need to be addressed. These include designing the joint proposal function N (X0t , S0t ); (Xt , St ), Σ2 in (2) as well as concurrently proposing samples of multiple variables in Xt and St . We overcome these challenges through the VAE-MCMC method which we introduce in the following section. 4. The VAE-MCMC Method By applying VAE [24] to MCMC [22], the VAE-MCMC method describes a high-dimensional posterior p(Xt , St ) and efficiently obtains samples from it. For example, VAE can obtain samples of St and p(St ), while MCMC can obtain samples of Xt simultaneously. Similarly to traditional MCMC described in Section 3, the VAE-MCMC method has proposal and acceptance steps. The proposal step obtains a sample comprising a state X0t and sup porters S0t using the joint proposal function Q X0t , S0t ; Xt , St . Considering the difficulty in implementing a joint proposal function, we split it into two proposal functions such that each proposal function contains only a single variable, X0t or S0t , making them easy to design and use. Our proposal function is Q X0t , S0t ; Xt , St = Q(S0t ; St ) × p(X0t |S0t ), (4)
where Q(S0t ; St ) proposes new supporters using the current supporters and p(X0t |S0t ) indicates the probability of a new state conditioned on the proposed supporters. Note that Q(S0t ; St ) and p(X0t |S0t ) will be defined in (17) and (7), respectively. The acceptance step aims to accept or reject the proposed samples, X0t and S0t , using the acceptance ratio: p(X0t , S0t )Q (Xt , St ); (X0t , S0t ) aV AE−MCMC = p(Xt , St )Q (X0t , S0t ); (Xt , St ) p(X0t |S0t )p(S0t )Q(St ; S0t )p(Xt |St ) (5) = p(Xt |St )p(St )Q(S0t ; St )p(X0t |S0t ) p(S0t )Q(St ; S0t ) = , p(St )Q(S0t ; St )
where the first equality holds because of the multiplication rule and the second equality holds because of (4). On one hand, p(Xt , St ) in (5) is reduced to p(St ), thus alleviating the highdimensionality problem associated with p(Xt , St ). On the other, Q (X0t , S0t ); (Xt , St ) in (5) is reduced to Q(S0t ; St ), resolving the implementation problem of Q (X0t , S0t ); (Xt , St ) . Note that VAE-MCMC can still describe the original posterior p(Xt , St ), although p(Xt , St ) is changed to a different distribution p(St ) in (5). This useful property is mathematically proven in [1]. The remaining tasks include inferring a target state using the MCMC method and supporter states using VAE, explained in Sections 4.1 and 4.2 respectively.
4 using variational inference as follows:
4.1. Markov Chain Monte Carlo in VAE-MCMC The MCMC method in VAE-MCMC proposes a new target state X0t using the proposal function Q(X0t ; Xt ). The proposal function is implemented by Q(X0t ; Xt ) = N(X0t ; Xt , σ2X ),
(6)
where N is a normal distribution with mean Xt and variance σ2X . Next, the confidence of the proposed state X0t conditioned on the supporters is measured by p(X0t |S0t ) in (4), which encodes the relationship between the target and supporters. To design this probability, we assume that the target is geometrically related to the supporters and that the direction and distance of the target from each supporter change smoothly over time. Thus, the geometrical probability of the target conditioned on the supporters is p(Xt |St ) = +
M X
m T
e−λ1 (Xt −St )
(Xt−1 −Sm t−1 )
m=1
M X
e−λ2
m ||Xt −Sm t ||2 −||Xt−1 −St−1 ||2
m=1
2
(7) ,
where the first and second terms force the direction and distance of the target from a supporter at time t to be consistent with those at time t−1. λ1 and λ2 are the weighting parameters. Here we assume that the supporters are independent of each other. If the target position is accurately estimated, the direction from the target to the supporter is similar over consecutive frames . T In this case, the inner product in the first term, (Xt −Sm t ) (Xt−1 − m St−1 ), approaches zero, thus maximizing p(Xt |St ). In addition, the distance between the target and supporter will be similar over consecutive frames, ensuring a higher value of the second term and thus maximizing p(Xt |St ). The remaining tasks include initializing good supporters St and estimating the marginal probability p(St ) in (5), which is explained in Section 4.2. Note that the technical details of initializing the supporters are presented in Section 5.2.
DKL [q(z|St )kp(z|St )] =
X z
q(z|St ) log
q(z|St ) p(z|St )
# q(z|St ) = Ez log q(z|St ) − log p(z|St ) = Ez log p(z|St ) " # (10) p(St |z)p(z) = Ez log q(z|St ) − log p(St ) = Ez log q(z|St ) − log p(St |z) − log p(z) + log p(St ) = Ez log q(z|St ) − log p(St |z) − log p(z) + log p(St ), "
where p(z|St ) is approximated using a simpler distribution q(z|St ), and the difference between p(z|St ) and q(z|St ) is minimized by the KullbackLeibler (KL) divergence method DKL . The last equality in (10) holds because the expectation is over z. Reformulating (10), we obtain the following: log p(St ) − DKL [q(z|St )kp(z|St )] = Ez − log q(z|St ) + log p(St |z) + log p(z) = Ez log p(St |z) − Ez log q(z|St ) − log p(z) = Ez log p(St |z) − DKL q(z|St )kp(z) ,
(11)
where the second and last equalities hold by the linearity of expectation and the definition of KL divergence, respectively. Then, log p(St ) = DKL [q(z|St )kp(z|St )] + Ez log p(St |z) − DKL q(z|St )kp(z) ≥ Ez log p(St |z) − DKL q(z|St )kp(z) ,
(12)
because DKL [q(z|St )kp(z|St )] ≥ 0. Here Ez log p(St |z) − DKL q(z|St )kp(z) is a lower bound of log p(St ). Thus, es timating log p(St ) amounts to maximizing Ez log p(St |z) − DKL q(z|St )kp(z) , which gives the value of p(St ). Implementation details on the modeling of q(z|St ) and p(St |z) in (12) and the maximization of the lower bound are provided in (18) and (19) of Section 5.3. 4.3. Advanced Variational Auto-encoding in VAE-MCMC
4.2. Variational Auto-encoding in VAE-MCMC The marginal probability p(St ) in (5) can be estimated by marginalization. This study adopts a variational autoencoder for efficient marginalization [24]. We write p(St ) =
Z
p(St |z)P(z)dz,
(8)
where z indicates a stochastic latent variable. The idea of VAE is to infer p(z) using p(z|St ). We can assume the distribution of p(z), for instance a standard normal distribution: p(z) = N(z; 0, I).
(9)
However, we also have to infer p(z|St ), which can be achieved
Selecting a very simplistic prior, such as the normal distribution for p(z) in (8), could lead to over-regularization and, as a consequence, very poor hidden representations [19]. Tomcza et al. [44] extended the VAE framework using a new type of prior called VampPrior. In this section, we present VampPriorMCMC, which substitutes VAE by VampPrior. In the original VAE, p(z) is assumed to be the standard normal distribution with mean 0 and variance I presented in (9). In VampPrior, p(z) is designed as p(z) =
K K 1 X 1 X q(z|uk ) = N(z; µ(uk ), σ2 (uk )), K k=1 K k=1
(13)
where uk is a D-dimensional vector called pseudo-input and is learned through back-propagation of the variational autoencoder. µ(uk ) and σ2 (uk ) denote the mean and variance of uk ,
5 Algorithm 1 VAE-MCMC based Visual Tracking Algorithm ˆ t−1 and Sˆ t−1 Input: X ˆ t and Sˆ t Output: X 1: for m=1 to M do 2: Propose a new state of the m-th supporter, Sm t , using (17): (implementation details: Section 5.2). 3: end for 4: 5: 6:
[Variational auto-encoding] Compute p(St ) using (12): (implementation details: Section 5.3). 7: • VAE-MCMC: use p(z) in (9) for p(St ). 8: • VampPrior-MCMC: use p(z) in (13) for p(St ). 9: • Hierarchical VampPrior-MCMC: use p(z) in (14) for p(St ). 10:
[Markov chain Monte Carlo] for n=1 to N do Propose the n-th state of the target, Xnt , using (6). Accept or reject Xnt with the probability aV AE−MCMC in (5). 15: end for ˆ t among N samples, which 16: Select the best target state X maximizes p(Xt , St ) in (1): (implementation details: Section 5.1). 11: 12: 13: 14:
respectively. K represents the number of pseudo-inputs, and the uk can be considered as hyper-parameters of the prior, as mentioned in [44]. q(·) can be any function. If q(·) is a normal distribution, p(z) is in the form of a mixture of Gaussian distributions, as shown in (13). For the maximization of the lower bound (Section 5.3), the new p(z) in (13) is employed instead of that in (9), resulting in better regularization and a more accurate lower bound. We further enhance the VampPrior-MCMC method by developing it as the hierarchical VampPrior-MCMC method, in which the VampPrior is hierarchically estimated. p(z) =
=
K 1 X q(z|uk , z0 )q(z0 |uk ) K k=1
K 1 X N z; µ(uk , z0 ), σ2 (uk , z0 ) N(z0 ; µ(uk ), σ2 (uk )), K k=1
(14)
2
where µ(uk , z0 ) and σ (uk , z0 ) denote the mean and covariance of (uk , z0 ), respectively. Algorithm 1 describes the entire pipeline for the method. 5. Implementation Details In this section, we provide implementation details on three intermediate processes in our method, namely feature extraction (Section 5.1), supporter initialization (Section 5.2), and lower bound maximization (Section 5.3).
5.1. Features and Likelihood p(Xt , St ) in (1) can be more accurately estimated by considering the observation Yt : X p(Xt , St ) = p(Xt , St |Yt )p(Yt ) all Yt
=
X
all Yt
p(St |Xt , Yt )p(Xt |Yt )p(Yt )
= p(St |Xt ) = p(St |Xt )
X
all Yt I X i=1
p(Xt |Yt )p(Yt )
(15)
p(Xt |Yit )p(Yit ),
where the first equality holds because of Bayes’ rule and the marginalization theorem, and the third equality holds because St is independent of Yt . In (15), p(Xt |Yit ) is the likelihood that aims to compare the i-th deep features for the region described f by Xt , Yit (Xt ), with the reference feature Yre t . For comparison, we utilize the correlation operation [4] as a similarity measure f (·). Then, p(Xt |Yit ) is designed as re f i p(Xt |Yit ) = e−λ3 f Yt (Xt ),Yt , (16)
where λ3 is a weighting parameter. For efficient visual tracking, we observe the deep features of the first (conv-1) and last (conv5) convolutional layers in the VGG-m network [6], which is similar to the approach in [11]. Thus, in (15), I is 2 and Y1t and Y2t are the deep features extracted from the conv-1 and conv5 layers, respectively. We assume that the features extracted from conv-1 and those extracted from conv-5 are equally important, which leads to the equal weighting of both low-level and semantic-level features. Thus, p(Y1t ) = p(Y2t ) = 0.5. We f note that Yre denotes features of the reference image in an init tial frame. The reference image is cropped according to an initial coordinate of the target, while the features are extracted by the VGG-m network, as mentioned above. In (15), p(St |Xt ) ∝ p(Xt |St )p(St ), and p(Xt |St ) can be considered as a geometric likelihood that measures the similarity of star-like topologies between the target and supporters over time [14]. In contrast, p(Xt |Yt ) in the second term functions as a photometric likelihood that calculates the similarity between the reference and observed appearances around the estimated target position. 5.2. Initialization of Supporters We consider supporters as regions whose motions are correlated with the ones of the targets [15]. Therefore, to initialize the supporters St , we first extract the correlated motions via motion segmentation [34]. Next, we initialize the positions of the supporters in the motion segment that also contains the center position of the target. To determine the good positions of the supporters, we adopt the approach proposed in [27], which is similar to feature point extraction algorithms such as the scale-invariant feature transform (SIFT) [30]. However, we can also initialize the supporters randomly or using different feature point extraction algorithms such as the speed-up robust features
6 Table 1. Performance according to different initialization settings. The accuracy was measured in terms of the AUC of the success plot, while the computation time was measured in terms of FPS.
Method AUC Number FPS
(a) Input images
(b) Motion segmentation and initialization of supporters Fig. 2. Initialization of supporters. Given an input image, we first conduct motion segmentation to produce two segments (i.e., red and gray regions). Next, we select a motion segment that contains the target positions (i.e., orange circles). Finally, in the selected segment, we find the M-most robust positions based on the feature point extraction criteria (i.e., blue circles). In (b), M is set to 3. The target position and supporters are geometrically related, as depicted by yellow lines.
(SURF) [3], as explained in Section 6.1.1. Figure 2 shows the results of the supporter initialization process. The positions of supporters should be initialized in the first frame of a video. In the consecutive frames, their new positions S0t are proposed based on their current positions St by the following function: Q(S0t ; St ) = N(S0t ; St , σ2S ),
(17)
where N is a normal distribution with mean St and variance σ2S . Note that Q(S0t ; St ) in (17) is used in (4). During visual tracking, supporters can be re-initialized when their confidence is very low: p(St ) < θ, where p(St ) is estimated in (12). For reinitialization, we follow this approach. The parameter θ is empirically set to 0.3 but fixed throughout all the experiments. 5.3. Maximization of the Lower Bound We implement the maximization of the lower bound in (12) using a feed-forward neural network (i.e., a variational autoencoder) proposed in [24]. In this context, q(z|St ) in (12) is the encoder that transfers the input St to its implicative representation z, whereas p(St |z) is the decoder that recovers S˜ t from z, where S˜ t denotes the recovered supporters. p(z) is the prior, typically set to the standard normal distribution. Then q(z|St ) = N(z; µ(St ), σ2 (St )),
(18)
where µ(St ) and σ2 (St ) denote the mean and variance of St , respectively. We can generate z as µ(St ) + σ(St ) via the encoder, where = N(0, 1). The decoder estimates p(St |z) by recovering St from z and computing the reconstruction error: ˜
p(St |z) = e−||St −St ||2 .
(19)
SIFT [30] 0.686 1 8.1
SURF [3] 0.683 5 6.9
RANDOM 0.675 10 5.7
UNIFORM 0.672 20 4.6
To maximize the lower bound, we define the loss function of the variational auto-encoder network as L = −Ez log p(St |z) − DKL q(z|St )kp(z) (20) ≈ ||St − S˜ t ||2 + log σ(St )2 − µ(St )2 − σ(St )2 ,
and train the network. The equality in (20) holds by Appendix B of [24]. We utilize supporters as training data, which can be obtained as described in Section 5.2. 6. Experiments
The number of supporters, M, and sample states for the target, N, were set to 5 and 100, respectively. The parameters σ2X in (6), λ1 and λ2 in (7), λ3 in (16), and σ2S in (17) were fixed at 0.2, 0.1, 0.1, 0.05, and 0.1, respectively, throughout all the experiments. The ablation study in Section 6.1 shows the sensitivity of these parameters to performance. For each method compared, the best parameters were carefully selected to obtain the best performance. The proposed tracker was implemented using MATLAB 2018a. Three proposed trackers, VAE-MCMC, VamPrior-MCMC, and hierarchical VamPrior-MCMC, were compared with 30 non-deep-learning-based trackers based on the OTB benchmark [48]. The OTB benchmark, the most widely used dataset for visual tracking contains 50 sequences from recent literatures and has manually annotated the test sequences with 9 important visual tracking attributes. In addition, we compared our trackers with recently developed deep-learning-based trackers such as TADT [29], SiamDW [49], DAT [35], and SiamRPN++ [28]. We utilized the 50 test sequences from [48] for evaluation, and employed ECO [11] as a backbone network. The experiments were performed on a computer with an Intel CPU i7 3.60GHz processor and a GeForce Titan XP graphic card. We used the precision plot, success plot, and area under the curve (AUC) as our evaluation metrics [48]. To illustrate the precision plot, we measure the center location error, which is the Euclidean distance between the tracked bounding box bt and the ground truth bounding box bg . The precision plot indicates the percentage of frames in which the center location error is less than a given threshold. To compute the success rate, we |b ∩b | calculate the score S = |btt ∪bgg | , where ∩ and ∪ represent the intersection and union of two bounding boxes, respectively, and | · | denotes the number of pixels in the bounding box. Thus, the success rate is the ratio of successfully tracked frames to total frames whose S value is larger than a given threshold. The success plot depicts the success rate with a varying threshold between 0 and 1. AUC is the area under either the precision or success plot.
7
Precision plot
1
20 [0.901] 10 [0.901] 5 [0.898] 1 [0.860]
0.8
Precision
0.7
0.8 0.7
0.6 0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1
0.1 0
0
5
10
15
20
25
30
35
40
45
500 [0.901] 200 [0.900] 100 [0.898] 10 [0.877]
0.9
Precision
0.9
0
Precision plot
1
50
0
5
Location error threshold
Success plot
1
10 [0.691] 20 [0.689] 5 [0.686] 1 [0.668]
0.7
0.5 0.4 0.3
0.7
0.8
0.9
1
Overlap threshold Fig. 3. Performance according to the number of supporters.
6.1. Ablation Study The effectiveness of each component in our method was verified by conducting several ablation tests and answering the following questions: • How does the number of supports affects the accuracy of our tracker? (Section 6.1.1) • How does the number of states affects the accuracy of our tracker? (Section 6.1.2) • How sensitive is the tracker to the parameters? (Section 6.1.3) • How important is the strategy for estimating p(St ) for our tracker? (Section 6.1.5) 6.1.1. The number of supporters Figure 3 depicts the precision and success plots of our tracker according to the number of supporters. For this experiment, we varied the number of supporters from 1 to 20. As shown in the figure, the accuracy of our tracker improved as the number of supporters increased, implying that the supporters play an important role in accurately tracking the target. However, this led to increased computation time. Furthermore, we observed that
50
Success plot 500 [0.689] 200 [0.688] 100 [0.686] 10 [0.674]
0.3
0 0.6
45
0.4
0.1
0.5
40
0.5
0.1
0.4
35
0.6
0.2
0.3
30
0.7
0.2
0.2
25
0.8
0.6
0.1
20
0.9
Success rate
Success rate
0.8
0
15
1
0.9
0
10
Location error threshold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Overlap threshold Fig. 4. Performance according to the number of sample states.
when the number of supporters is more than 20, their contribution to the success of visual tracking is insignificant. Thus, we empirically determined only five supporters as optimal by trading off between accuracy and computation time. Table 1 reports the accuracy of our tracker with respect to different initialization methods and its speed with respect to a different number of supporters. We obtained the best results in terms of AUC when we use the SIFT method for initializing supporters. As we mentioned above, more supporters improved the accuracy of our tracker, but also slowed it down. 6.1.2. The number of sample states Figure 4 illustrates the precision and success plots of our tracker according to the number of sample states. For this experiment, we varied the number of samples from 10 to 500. As the number of sample states increased, the accuracy of our tracker increased, but only slightly. In particular, the accuracy did not significantly drop even when only 10 samples were used, thus verifying effectiveness with a small number of samples. This useful property may be attributable to proposed supporters that utilize geometrical consistency over time to aid the
8 Table 2. Performance according to different hyper-parameter settings. The numbers in parentheses (·) denote the values of the parameters. The accuracy was measured in terms of the AUC of the success plot.
λ1 0.680(0.01) 0.679(0.05) 0.686(0.10) 0.685(0.20) 0.681(0.50)
λ2 0.682(0.01) 0.680(0.05) 0.686(0.10) 0.686(0.20) 0.682(0.50)
λ3 0.678(0.01) 0.682(0.03) 0.686(0.05) 0.683(0.07) 0.686(0.10)
σ2S 0.680(0.01) 0.681(0.05) 0.686(0.10) 0.681(0.50) 0.682(1.00)
0.9 0.8 0.7
Precision
AUC AUC AUC AUC AUC
σ2X 0.680(0.01) 0.685(0.10) 0.686(0.20) 0.686(0.50) 0.684(1.00)
Precision plot
1
0.6 0.5 0.4 0.3
Hierachical VamPrior-MCMC [0.898] VamPrior-MCMC [0.888] VAE-MCMC [0.874]
0.2 0.1 0
0
5
10
15
20
25
30
35
40
45
50
Location error threshold (a) Deer sequence
Success plot
(b) Ironman sequence 1 0.9
(c) Matrix sequence
(d) Lemming sequence
Fig. 5. Results of motion segmentation.
estimation of target positions with a smaller number of samples. 6.1.3. Hyper-parameters Our tracker utilizes σ2X in (6), λ1 and λ2 in (7), λ3 in (16), and 2 σS in (17) as hyper-parameters: σ2X and σ2S , control the possible changes of transitions and scales of a target, respectively, while λ1 , λ2 , and λ3 adjust the importance of consistency of direction from the target to supporter, consistency of distance between target and supporter, and appearance consistency of the target, respectively. Table 2 summarizes the AUC of the success plot according to different hyper-parameter settings and shows that the performance of our tracker is not very sensitive to hyperparameter settings. 6.1.4. Motion segmentation Figure 5 shows the results of motion segmentation obtained using the algorithm in [34]. From the figure, we observe that the pixels belonging to an object typically have the same motion and are segmented into the same region. If we consider supporter regions from the segment including the target region, the motions of the target and supporters will be correlated with each other and show consistency over time. Using this consistency, our tracker can infer the position of the target with the help of the supporters even if the target disappears from the scene. 6.1.5. Estimation of the marginal probability p(St ) In this experiment, we examined various methods such as VAE, VampPrior, and hierarchical VampPrior for the estimation of the marginal probability p(St ). Figure 6 shows the ac-
Success rate
0.8 0.7 0.6 0.5 0.4 0.3
Hierachical VamPrior-MCMC [0.686] VamPrior-MCMC [0.681] VAE-MCMC [0.675]
0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Overlap threshold Fig. 6. Performance according to estimation methods for p(St ).
curacy of trackers with different estimation methods. The VAE method assumes that the probability has a normal distribution. However, the VampPrior method can be applied to a probability with any distribution. The hierarchical VampPrior method improves the VampPrior method by hierarchically estimating the marginal probability. Thus, the hierarchical VampPrior method showed the highest accuracy in estimating p(St ), and the hierarchical VampPrior-MCMC tracker showed the most accurate performance, as shown in Figure 6. 6.2. Quantitative Comparison Figure 7 shows the comparison between our tracker and nondeep-learning-based trackers. Our tracker significantly outperformed the other trackers in terms of precision and success rate. Moreover, the VAE-MCMC tracker was superior to other sampling-based trackers (specially, ASLA, VTS, and VTD) with the same number of samples as well as correlationfilter- based trackers (specially, SRDCF). Figure 8 compares our tracker and recently developed deep-learning-based trackers. Our tracker surpassed the performance of TADT, SiamDW, COT, SINT, and FCN with respect to success rate and was com-
9
Precision plot
1 0.9
0.9
0.8
0.8 0.7
Hierachical VamPrior-MCMC [0.898] MEEM [0.840] SRDCF [0.838] Struck [0.656] SCM [0.649] TLD [0.608] VTD [0.576] VTS [0.575] CXT [0.575] CSK [0.545]
0.6 0.5 0.4 0.3 0.2 0.1
0
5
10
15
20
25
30
35
40
45
Precision
Precision
0.7
0
Precision plot
1
SiamDW [0.932] ECO [0.930] DAT [0.928] SiamRPN++ [0.918] COT [0.908] Hierachical VamPrior-MCMC [0.898] TADT [0.896] SINT_op [0.882] ECO-HC [0.874] FCN [0.856]
0.6 0.5 0.4 0.3 0.2 0.1 0
50
0
5
Success plot
0.9
0.9
0.8
0.8
0.7
Hierachical VamPrior-MCMC [0.686] SRDCF [0.626] MEEM [0.572] SCM [0.499] Struck [0.474] TLD [0.437] ASLA [0.434] CXT [0.426] VTS [0.416] VTD [0.416]
0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
15
0.7
0.8
25
30
35
40
45
50
0.9
0.7
ECO [0.709] DAT [0.695] SiamRPN++ [0.691] Hierachical VamPrior-MCMC [0.686] TADT [0.680] COT [0.677] SiamDW [0.662] SINT_op [0.655] ECO-HC [0.652] SINT [0.635]
0.6 0.5 0.4 0.3 0.2 0.1
1
Overlap threshold Fig. 7. Quantitative comparison with non-deep-learning-based trackers.
petitive with ECO. These results demonstrate that the use of supporters and VAE is very useful for accurate visual tracking. Our tracker verified that a neural network (i.e., VAE) can be well-integrated into the MCMC framework, considerably enhancing sampling methods. We also compared our method with the top 10 trackers (e.g., LSART [40], CFWCR [17], CFCF [16], ECO [11], MCCT [46], and CSRDCF [31]) at VOT 2017 1 . The VOT 2017 dataset contains 60 videos featuring diversity and richness in attributes. In the VOT 2017 dataset, those sequences in previous VOT datasets, which were least challenging were replaced by new sequences, making state-of-the-art methods more difficult to track. Our method, VAE-MCMC, outperformed other methods in terms of accuracy, as shown in Figure 9. This experiment demonstrates the consistent effectiveness of our method across multiple visual tracking datasets. Figure 10 shows the comparison of several deep-learningbased trackers by computation time. ECO was the fastest because it utilized correlation filters for the inference. Our tracker 1 http://www.votchallenge.net/vot2017/
20
Success plot
1
Success rate
Success rate
1
10
Location error threshold
Location error threshold
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Overlap threshold Fig. 8. Quantitative comparison with deep-learning-based trackers.
was the second best in terms of frames per second, despite being based on sampling methods which are known to be slow. Our proposed tracker reduced the computation time without reducing accuracy by considerably decreasing the number of sample states with the help of supporters.
6.3. Qualitative Comparison Figure 11 shows the qualitative tracking results of our tracker and ECO. The test sequences contains severe illumination changes (e.g., Skating1, Singer1, Matrix, and Ironman sequences), occlusion (e.g., Soccer sequence), pose variation (e.g., Skiing, MotorRolling, and Ironman sequences), background clutter (e.g., Matrix and Deer sequences), and abrupt motions (e.g., Skating1 and Deer sequences). Our tracker successfully tracked the targets, while ECO missed them in several sequences, as shown in Figures 11(a)-(c). In particular, the tracker gave good estimates of the scales of the targets, as shown in Figures 11(e)-(f).
10
Fig. 9. Quantitative comparison with top 10 trackers at VOT 2017.
(TCOGRGT5GEQPF 1WTU
'%1
%16
5+06
(%0
Fig. 10. Comparison based on the computation time in terms of frames per second.
7. Conclusion This paper presents a visual tracking method using VAE and MCMC. The MCMC sampling method estimates the target state, while VAE finds good supporters. The supporters are initialized based on motion segmentation and are correlated with the target. Thus, they help to accurately estimate the target states, particularly when the target is not clearly visible owing to various obstacles found in visual tracking environments. Experimental results demonstrate that our tracker outperforms conventional visual trackers, including recent deep-learningbased trackers. Acknowledgment This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (NRF-2018R1A4A1059731). References [1] Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle markov chain monte carlo methods. J. R. Statist. Soc. B, 72(3):269–342.
[2] Arulkumaran, K., Creswell, A., and Bharath, A. A. (2016). Improving sampling from generative autoencoders with markov chains. CoRR, abs/1610.09296. [3] Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008). Speeded-up robust features (surf). Comput. Vis. Image Underst., 110(3):346–359. [4] Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., and Torr, P. H. (2016). Fully-convolutional siamese networks for object tracking. CoRR, abs/1606.09549. [5] Caterini, A. L., Doucet, A., and Sejdinovic, D. (2018). Hamiltonian variational auto-encoder. CoRR, abs/1805.11328. [6] Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman., A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In BMVC. [7] Chen, B.-H. and Huang, S.-C. (2015). Probabilistic neural networks based moving vehicles extraction algorithm for intelligent traffic surveillance systems. Inf. Sci, 299:283–295. [8] Chen, B.-H., Huang, S.-C., and Yen, J.-Y. (2018). Counter-propagation artificial neural network-based motion detection algorithm for static-camera surveillance scenarios. Neurocomputing, 273:481–493. [9] Chen, B.-H., Kopylov, A., Huang, S.-C., Seredin, O., Karpov, R., Kuo, S.-Y., Robert Lai, K., Tan, T.-H., Gochoo, M., Bayanduuren, D., Gong, C.S., and Hung, P. C. (2016). Improved global motion estimation via motion vector clustering for video stabilization. Eng. Appl. Artif. Intell., 54(C):39– 48. [10] Chen, Z., Wang, R., Zhang, Z., Wang, H., and Xu, L. (2019). Backgroundforeground interaction for moving object detection in dynamic scenes. Inf. Sci, 483:65–81. [11] Danelljan, M., Bhat, G., Shahbaz Khan, F., and Felsberg, M. (2017). Eco: Efficient convolution operators for tracking. In CVPR. [12] Deng, Z., Navarathna, R., Carr, P., Mandt, S., Yue, Y., Matthews, I., and Mori, G. (2017). Factorized variational autoencoders for modeling audience reactions to movies. In CVPR. [13] Feng, W., Ji, D., Wang, Y., Chang, S., Ren, H., and Gan, W. (2018). Challenges on large scale surveillance video analysis. In CVPRW. [14] Fergus, R., Perona, P., and Zisserman, A. (2005). A sparse object category model for efficient learning and exhaustive recognition. In CVPR. [15] Grabner, H., Matas, J., Gool, L. V., and Cattin, P. (2010). Tracking the invisible: Learning where the object might be. In CVPR. [16] Gundogdu, E. and Alatan, A. A. (2017). Good features to correlate for visual tracking. CoRR, abs/1704.06326. [17] He, Z., Fan, Y., Zhuang, J., Dong, Y., and Bai, H. (2017). Correlation filters with weighted convolution responses. In ICCVW. [18] Held, D., Thrun, S., and Savarese, S. (2016). Learning to track at 100 fps with deep regression networks. In ECCV. [19] Hoffman, M. D. and Johnson, M. J. (2016). Elbo surgery: yet another way to carve up the variational evidence lower bound. In NIPS Workshop. [20] Isard, M. and Blake, A. (1998). Icondensation: Unifying low-level and high-level tracking in a stochastic framework. In ECCV. [21] Jain, U., Zhang, Z., and Schwing, A. (2017). Creativity: Generating diverse questions using variational autoencoders. In CVPR. [22] Khan, Z., Balch, T., and Dellaert, F. (2005). MCMC-based particle filtering for tracking a variable number of interacting targets. TPAMI, 27(11):1805–1918. [23] Kim, C., Song, D., Kim, C.-S., and Park, S.-K. (2019). Object tracking under large motion: Combining coarse-to-fine search with superpixels. Inf. Sci, 480:194–210. [24] Kingma, D. and Welling, M. (2013). Auto-encoding variational bayes. In ICLR. [25] Kwon, J. (2018). Uncertainty calibrated markov chain monte carlo sampler for visual tracking based on multi-shape posterior. J Math Imaging Vis., 60(5):681–691. [26] Kwon, J., Dragon, R., and Van Gool, L. (2016). Joint tracking and ground plane estimation. SPL, 23(11):1514–1517. [27] Kwon, J. and Lee, K. M. (2013). Highly nonrigid object tracking via patch-based dynamic appearance modeling. TPAMI, 35(10):2427–2441. [28] Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J. (2019a). SiamRPN++: Evolution of siamese visual tracking with very deep networks. In CVPR. [29] Li, X., Ma, C., Wu, B., He, Z., and Yang, M.-H. (2019b). Target-aware deep tracking. In CVPR. [30] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110.
11
(a) Ironman seq.
(b) Matrix seq.
(e) Skating1 seq.
(i) MotorRolling seq.
(c) Shaking seq.
(f) Skiing seq.
(j) MountainBike seq.
(d) Singer1 seq.
(g) Soccer seq.
(h) Deer seq.
(k) Singer2 seq.
(l) Surfer seq.
Fig. 11. Qualitative comparison. The red and green boxes denote the tracking results of hierarchical VampPrior-MCMC and ECO, respectively.
[31] Lukezic, A., Vojir, T., Zajc, L. C., Matas, J., and Kristan, M. (2017). Discriminative correlation filter with channel and spatial reliability. In CVPR. [32] Ma, C., Huang, J.-B., Yang, X., and Yang, M.-H. (2015). Hierarchical convolutional features for visual tracking. In ICCV. [33] Nam, H. and Han, B. (2015). Learning multi-domain convolutional neural networks for visual tracking. In CVPR. [34] Pathak, D., Girshick, R., Dollr, P., Darrell, T., and Hariharan, B. (2017). Learning features by watching objects move. In CVPR. [35] Pu, S., Song, Y., Ma, C., Zhang, H., and Yang, M.-H. (2018). Deep attentive tracking via reciprocative learning. In NIPS. [36] Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., and Carin, L. (2016). Variational autoencoder for deep learning of images, labels and captions. In NIPS. [37] Salimans, T., Kingma, D. P., and Welling, M. (2014). Markov chain monte carlo and variational inference: Bridging the gap. CoRR, abs/1410.6460. [38] Shi, H., Liu, X., Hong, X., and Zhao, G. (2018). Bidirectional long shortterm memory variational autoencoder. In BMVC. [39] Simon, M., Amende, K., Kraus, A., Honer, J., Saemann, T., Kaulbersch, H., Milz, S., and Gross, H.-M. (2019). Complexer yolo: Real-time 3d object detection and tracking on semantic point clouds. In CVPRW. [40] Sun, C., Wang, D., Lu, H., and Yang, M.-H. (2018). Learning spatialaware regressions for visual tracking. In CVPR. [41] Tan, Q., Gao, L., Lai, Y.-K., and Xia, S. (2018). Variational autoencoders for deforming 3d mesh models. In CVPR. [42] Timofte, R., Kwon, J., and Van Gool, L. (2016). Picaso: Pixel correspondences and soft match selection for real-time tracking. CVIU, 153:162–153. [43] Tokekar, P., Isler, V., and Franchi, A. (2014). Multi-target visual tracking with aerial robots. In IROS. [44] Tomczak, J. M. and Welling, M. (2017). VAE with a vampprior. CoRR, abs/1705.07120. [45] Wang, N. and Yeung, D.-Y. (2013). Learning a deep compact image representation for visual tracking. In NIPS. [46] Wang, N., Zhou, W., Tian, Q., Hong, R., Wang, M., and Li, H. (2018a).
Multi-cue correlation filters for robust visual tracking. In CVPR. [47] Wang, X., Li, C., Luo, B., and Tang, J. (2018b). SINT++: Robust visual tracking via adversarial positive instance generation. In CVPR. [48] Y. Wu, J. L. and Yang, M.-H. (2013). Online object tracking: A benchmark. In CVPR. [49] Zhang, Z. and Peng, H. (2019). Deeper and wider siamese networks for real-time visual tracking. In CVPR. [50] Zhou, X. and Lu, Y. (2010). Abrupt motion tracking via adaptive stochastic approximation monte carlo sampling. In CVPR.
Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: