Nonconvex dictionary learning based visual tracking method☆

Nonconvex dictionary learning based visual tracking method☆

Nonconvex Dictionary Learning Based Visual Tracking Method Journal Pre-proof Nonconvex Dictionary Learning Based Visual Tracking Method Hongyan Wang...

4MB Sizes 3 Downloads 72 Views

Nonconvex Dictionary Learning Based Visual Tracking Method

Journal Pre-proof

Nonconvex Dictionary Learning Based Visual Tracking Method Hongyan Wang, Helei Qiu, Wenshu Li PII: DOI: Reference:

S0165-1684(20)30078-5 https://doi.org/10.1016/j.sigpro.2020.107535 SIGPRO 107535

To appear in:

Signal Processing

Received date: Revised date: Accepted date:

17 June 2019 23 January 2020 10 February 2020

Please cite this article as: Hongyan Wang, Helei Qiu, Wenshu Li, Nonconvex tionary Learning Based Visual Tracking Method, Signal Processing (2020), https://doi.org/10.1016/j.sigpro.2020.107535

Dicdoi:

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Highlights • An inconsistent constraint is applied to the dictionary learning model to make the object and background dictionaries more independent, which can enhance the dictionary discriminability. • A nonconvex minimax concave plus function is applied to punish the sparse coding and error matrices to avoid over-penalizing large entries by to improve the object tracking accuracy. • Based on majorization-minimization method, a local linear approximation strategy and inexact augmented Lagrange multiplier optimization method are utilized to solve the nonconvex optimization problem to obtain an efficient solution with better convergence.

1

Nonconvex Dictionary Learning Based Visual Tracking Method Hongyan Wanga,b,c , Helei Qiua,b,∗ and Wenshu Lid a Liaoning

Engineering Laboratory of BeiDou High-precision Location Service, Dalian University, Dalian, 116622, China Key Laboratory of Environmental Perception and Intelligent Control, Dalian University, Dalian, 116622, China c Faculty of Intelligent Manufacturing, Wuyi University, Jiangmen, 529020, China d School of Information Science and technology, Zhejiang Sci-Tech University, Hangzhou, 310018, China b Dalian

ARTICLE INFO

ABSTRACT

Keywords: Visual tracking Dictionary learning Sparse representation Nonconvex optimization Bayesian inference

Focusing on the heavy decrease of object tracking performance induced by complex circumstances, an object tracking method based on nonconvex discriminative dictionary learning (NDDL) is proposed. Firstly, the object and background samples are acquired according to the temporal and spatial local correlation of objects. Since object and background samples have some common features, an inconsistent constraint is imposed on dictionaries to improve their robustness and discriminability. In what follows, a nonconvex minimax concave plus (MCP) function can be used to penalize sparse encoding matrices to avoid over-punishment via some convex relaxation methods. Based on the sparse representation (SR) theory, a NDDL model can be constructed, which can be tackled by majorizationminimization inexact augmented Lagrange multiplier (MM-IALM) optimization method to achieve better convergence. After obtaining the optimal discriminative dictionary, the reconstruction errors of all candidates are calculated to construct the object observation model. Finally, the object tracking is implemented accurately based on the Bayesian inference framework. Compared to the existing state-of-the-art trackers, simulation results show that the proposed tracker can improve the precision and success rate of the object tracking significantly in complex circumstances.

1. Introduction

Object tracking is one of the most active research topics in computer vision with tremendous applications including video surveillance [1], human-computer interaction [2], and activity recognition [3]. In the past decades, some remarkable progress has been achieved in the field of visual tracking, and numerous tracking methods with high efficiency and robustness have been proposed [4, 5, 6]. However, some challenging problems, for example, illumination variation, scale variation, occlusion and deformation, have not been solved effectively, which leads to the significant degradation of tracking performance. Therefore, how to improve the tracking performance in the complex circumstance is one of the research hotspots in visual tracking. In order to improve the performance of visual tracking in the complicated scenes, Mei et al. [7] proposed a visual tracking method based on sparse representation (SR), which reconstructs candidates by utilizing object and trivial templates to reduce the impact of occlusion and noise. However, it should be noted that the current tracking result was exploited directly to replace the template with the lowest similarity, so that the external interference can be easily updated to the template set, which conducts to template drift. To tackle this issue, a non-negative dictionary learning (DL) method for updating template was developed in [8], which fuses the tracking results obtained in the recent frames to This work is supported by the National Natural Science Foundation of China (No. 61301258 and 31771224), China Postdoctoral Science Foundation (No. 2016M590218), Key Laboratory Foundation (No. 61424010106), Zhejiang Provincial Natural Science Foundation of China (No. Y17C090031), and National Key Research and Development Program of China (No. 2018YFB1004901). cas-email.jpeg [email protected] (H. Qiu) ORCID (s):

Helei Qiu et al.: Preprint submitted to Elsevier

generate templates with better robustness, and then exploits the obtained template to achieve accurate object tracking. However, it is difficult to distinguish object from the similar background effectively in complex clutter. Focusing on this problem, a tracking method based on SR and DL was proposed in [9], which associates dictionary atoms with label information to learn discriminative dictionary, so as to distinguish object and background efficiently. Howbeit, the object and background samples were chosen from the object area and the area far away from the object respectively in this method, without considering the spatial local correlation between the object and background [10], and then the learned dictionary cannot represent candidates perfectly, and thus possesses the poor discriminability of dictionary. Aiming at this issue, Xie et al. [11] coded the appearance information of the object and its adjacent background, which trains the linear discriminative model with samples to improve the discriminability and uses key point matching pattern to enhance the tracking performance. However, the samples selected at the object and adjacent background have some common features, i.e., the learned object and background dictionaries have common atoms, which leads to a significant reduction in the discriminability of dictionary. Regarding this issue, a tracker based on multi-class discriminative dictionary was considered in [12], which exploits the intra-class visual information and inter-class visual correlations concurrently to learn the shared and class-specific dictionaries, and then imposes an inter-orthogonality constraint on dictionaries, so that the dictionary has strong discriminability. However, this method does not consider occlusion or noise, which makes it vulnerable to outliers and leads to tracking drift. Considering this, Sui et al. [13] constructed a subspace to represent the object and its adjacent background, and exploited Page 1 of 15

sparse error terms to compensate damaged samples to improve the robustness to occlusion and noise. Nevertheless, the biased 𝓁1 norm penalty error matrix was employed in this method, which may over-penalize large entries and lead to sub-optimal solution [14, 15, 16, 17], thus affects the tracking accuracy. Focusing on this, an almost unbiased minimax concave plus (MCP) function was exploited to punish error matrix to overcome the unbalanced penalty of 𝓁1 norm [18]. Howbeit, to the best of our knowledge, the nonconvex constraint method is not applied in visual tracking effectively. Focusing on the issue illustrated above, a nonconvex discriminative dictionary learning model (NDDL) for visual tracking is developed. First, since the object is easily disturbed by its surrounding background, in order to improve the separability between the object and background, unlike the previous work which randomly selected background samples, we consider the temporal local correlation of objects and spatial local correlation of object and background, which indicates that there exists significant local correlations between objects in temporal domain, and the closer the spatial distance between the object and background, the stronger the correlation among the object and background. From this point of view, the object samples are selected according to the tracking results of the recent frames, and the background ones are sampled around the object. Besides, there are atoms with common features in the object and background dictionaries, which is not good for object tracking and even reduces the discriminability of dictionary. To solve this issue, an inconsistent constraint for dictionary is proposed, which reduces the consistency between dictionaries and makes the object and background dictionaries more independent to improve the discriminability of dictionary. Furthermore, occlusion and noise always impact the tracking performance and cause the tracker to drift. Concerning this, an error term is added to deal with outliers caused by occlusion and noise to improve the robustness. Moreover, most trackers based on SR or DL use the biased 𝓁1 norm penalty sparse coding or error matrices, which may over-penalize large entries and lead to sub-optimal solution, thus affects the tracking accuracy. Focus on this problem, an almost unbiased MCP function can be exploited to punish sparse coding and error matrices to overcome the unbalanced penalty of 𝓁1 norm, so as to improve tracking accuracy. What’s more, the proposed NDDL optimization problem is nonconvex and needs an efficient and fast solution. Inspired by [14], a method named majorization-minimization and inexact augmented Lagrange multiplier (MM-IALM) is utilized to achieve better convergence. Finally, on the basis of the NDDL model, a NDDLbased tracker (NDDLT) is developed. Using the learned discriminative dictionary, an object observation model can be constructed within the Bayesian inference framework, and the optimal candidate object is attained based on the reconstruction error assessment to achieve accurate object tracking. The main contributions of this work can be summarized as follows: 1) The temporal and spatial local correlation of objects is

Helei Qiu et al.: Preprint submitted to Elsevier

2) 3) 4)

5)

6)

considered to improve the separability between the object and background. An inconsistent constraint is applied to make the object and background dictionaries more independent, which can improve the discriminability of the dictionary. An error term is added to deal with outliers caused by occlusion and noise to improve the robustness. A MCP function is applied to punish the sparse coding and error matrices to avoid over-penalizing large entries by the 𝓁1 norm to improve the accuracy of the object tracking. Based on majorization-minimization method, a local linear approximation strategy and IALM method are utilized to solve the nonconvex optimization problem to obtain an efficient solution with better convergence. An observation model is constructed within the Bayesian inference framework, and the optimal candidate object is attained based on the reconstruction error.

The rest of this paper is organized as follows. The related works are reviewed in Section 2. Section 3 presents a detailed description of the NDDL model and its optimization method. After that, Section 4 illustrates the tracking framework incorporating the learned dictionary. Finally, the experimental results are reported in Section 5. Section 6 concludes this paper.

2. Related work

Visual tracking methods can be categorized as generative and discriminative methods. Generative methods [19, 20] cast object tracking as a searching task to find the candidate most similar to the tracked object. Discriminative methods [21, 22, 23, 24] defines the tracking task as a detection problem based on a classifier that separates the object from the background. Correlation Filters. Recently, the correlation filtering has been successfully applied in visual tracking [25, 26, 27, 28, 6, 29]. The correlation filter evaluates the similarity between the image area and the learned template by the inner product, and the convolution theorem is used to significantly reduce the computation complexity of the correlation filter. According to correlation filtering tactics, many discriminative correlation filtering (DCF) methods was developed [25, 27, 6, 29], which can effectively exploit all cyclic shifts of object and background training samples. However, the existing DCF paradigm suffers from two major issues, i.e., spatial boundary effect and temporal filter degradation. To mitigate these issues, a new tracking method [25] was developed, in which the new tracker enables joint spatialtemporal filter learning in a lower dimensional discriminative manifold by employing adaptive spatial feature selection and temporal consistent constraint. The underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. Concerning this, Danelljan et al. [27] employed an implicit interpolation model to formulate the learning problem in the continuous spatial domain. The proposed formulation enables efficient integraPage 2 of 15

tion of multi-resolution deep feature maps. However, this method has high computational complexity and thus is not realtime. Moreover, the proposed complex model with a massive number of trainable parameters has a risk of serious over-fitting [6]. Focus on these issues, the author developed a novel tracker [6] by revisiting the core DCF formulation and introducing: (i) a factorized convolution operator, which reduces the number of parameters in the model; (ii) a compact generative model of the training sample distribution to reduce memory and time consumption; (iii) a conservative model update strategy with improved robustness and reduced computational complexity. Deep Learning. In recent years, deep learning has attracted sustained and extensive attention from industrial and academic community. Benefiting from its deep features with powerful representation ability, deep learning has been applied in various computer vision tasks, such as object tracking [30, 31, 32, 33, 34, 35, 36]. A Siamese network was used in [30] to learn a matching function that matched the target patches of the initial frame with candidates in the following frame. However, the Siamese trackers pay less attention to deeper network features. To break this restriction, Li et al. [32] designed a simple yet effective spatial aware sampling strategy, and futher trained a ResNet-driven Siamese tracker. Nevertheless, positive samples being overlapped and class imbalance limit the performance of deep trackers. Focusing on this problem, the adversarial learning was exploited in [35] to identify the mask with target features over a long temporal span, and a high-order cost sensitive loss was designed to decrease the effect of those easy negative samples. Sparse Representation and Dictionary Learning. SR and DL are widely used in image denoising, classification, segmentation, visual tracking and face recognition [37]. The SR-based tracking methods [38, 39, 40, 41, 42] consider the candidate as a linear combination of dictionary atoms. Based on SR, Liu et al. [38] proposed a robust tracker using a local sparse appearance model, and developed a novel DL method with locally constrained sparse representation. To reduce the computational complexity of 𝓁1 tracker, based on the accelerated proximal gradient approach, a very fast numerical solver was developed to solve the 𝓁1 norm related minimization problem with guaranteed quadratic convergence [39]. After that, the tracking task was formulated as a binary classification via a naive Bayes classifier with online update in the compressed domain. The purpose of DL is to find a suitable dictionary for dense samples, and transforms the samples into the appropriate sparse expression form. Regarding DL methods, Aharon et al. [43] developed a K-means singular value decomposition DL method, which attains good results in image restoration. In order to effectively utilize the features and coding coefficients of the training samples, a graph regularized sparse codes method [44] used manifold learning strategy to design discriminant function, which enhances the discriminability of the dictionary. In addition, the locally constrained linear coding method [45] was designed by utilizing the distance between the training sample and atom, in which the nearHelei Qiu et al.: Preprint submitted to Elsevier

est K atoms are selected for encoding to maintain the local features of the training sample. Subsequently, numerous approaches were proposed to improve the performance of the above mentioned methods, which greatly promotes the development of SR and DL techniques. For example, Zhou et al. [42] proposed a tracker based DL, which encapsulates local manifold structure of the data to preserve locality and similarity information among instances, and the learned dictionary was used to calculate the corresponding coefficients of each candidate. Other Methods. Many other methods are also presented such as support vector machine (SVM) [46, 47, 48], flock of trackers (FoT) [49, 50, 51] and audio assisted visual tracking [52, 53, 54]. Hare et al. [46] developed a tracker based on structured output prediction which uses a kernelized structured output SVM and learns online to provide adaptive tracking. Feng et al. [48] proposed a novel method based on probability hypothesis density (PHD) filter. A one class SVM was utilized in the update step to mitigate the noise in measurements, which trained with features from both color and oriented gradient histograms. By taking the idea of FoT [49], a tracker named best structured tracker (BST) [50] exploited a set of local trackers to track patches of the original object independently in an online learning manner. Meanwhile, an outlier detection procedure filtered out the less meaningful ones, and a resampling procedure allowed to correctly reinitialize the trackers that had been filtered out. Focusing on the issue that single visual information could not effectively track the object, some methods [52, 53, 54] employed audio information to assist visual tracking. Barnard et al. [52] investigated the problem of visual tracking of multiple human speakers in an office environment. The proposed method utilized the audio information recorded by microphone array, i.e., direction of arrival (DOA) angle, to realize the automatic initialization of the visual tracker. KÄślÄśÃğ et al. [53] proposed a novel tracker to combine audio and visual data, which employed the DOA angles of the audio sources to reshape the typical Gaussian noise distribution of particles in the propagation step and weighed the observation model in the measurement step. In addition, the author also proposed a multi-speaker tracker based on PHD filter [54], in which the DOA of audio sources was utilized to determine the time to propagate the born particles and reallocate the surviving and spawned particles.

3. The NDDL method

The framework overview of the proposed NDDLT tracker is shown in Fig. 1, which includes the following parts: sample collection, the NDDL model, observation model and evaluation method of candidates. In this section, the NDDL model is first conducted, and then the MM-IALM method is considered to tackle the nonconvex discriminative dictionary optimization problem, after that the convergence and computational complexity of the developed method is analyzed. Page 3 of 15

Figure 1: Visual tracking framework based on NDDL model

Finally, the initialization method and the dictionary update scheme are illustrated. Besides, the observation models can be established in Section 4 to evaluate candidates.

3.1. The NDDL model

Given a training sample set 𝐗 = [𝐗1 , 𝐗2 , ⋯ , 𝐗𝐿 ] ∈ ℝ𝑑×𝑞 , where 𝐿 is the number of classes, 𝑑 is the feature dimension of each training sample, 𝑞 is the total number of ∑ training samples, and 𝐗𝑖 ∈ ℝ𝑑×𝑞𝑖 ( 𝐿 𝑖=1 𝑞𝑖 = 𝑞) denotes the 𝑞𝑖 samples belonging to the 𝑖-th class. According to the training sample set 𝐗, the dictionary 𝐃 = [𝐃1 , 𝐃2 , ⋯ , 𝐃𝐿 ] ∈ ℝ𝑑×𝑘 should be learned with strong discrimination and robustness, where 𝑘 is the total number of atoms in the dictio∑ nary 𝐃, and 𝐃𝑖 ∈ ℝ𝑑×𝑘𝑖 ( 𝐿 𝑖=1 𝑘𝑖 = 𝑘) is the sub-dictionary associated with the 𝑖-th class containing 𝑘𝑖 atoms. 𝐂 = [𝐂1 , 𝐂2 , ⋯ , 𝐂𝐿 ] ∈ ℝ𝑘×𝑞 represents the coefficient matrix of 𝐗 over 𝐃, where 𝐂𝑖 ∈ ℝ𝑘𝑖 ×𝑞𝑖 denotes the coefficient matrix of 𝐗𝑖 over 𝐃𝑖 . According to the SR theory, the training samples 𝐗𝑖 can be represented approximately via the dictionary 𝐃𝑖 , i.e., 𝐗𝑖 ≈ 𝐃𝑖 𝐂𝑖 . Therefore, the basic DL model can be shown as min ‖𝐗𝑖 − 𝐃𝑖 𝐂𝑖 ‖2𝐹 + 𝛼‖𝐂𝑖 ‖0 , 𝐂𝑖

(1)

where ‖ ⋅ ‖𝐹 denotes Frobenius norm operator, ‖ ⋅ ‖0 stands for 𝓁0 norm operator, 𝛼 is regularization parameter. Due to the fact that video sequences contain occlusion or noise inevitably, which leads to outliers and greatly reduces the robustness of DL method. To tackle this problem, an error item should be added to the DL model to deal with the outliers caused by occlusion or noise, so as to improve the robustness of the DL method. Accordingly, the DL model can be constructed as min‖𝐗𝑖 − 𝐃𝑖 𝐂𝑖 − 𝐏𝑖 ‖2𝐹 𝐂 𝑖

+ 𝛼‖𝐂𝑖 ‖0 + 𝛽‖𝐏𝑖 ‖0 ,

(2)

where 𝐏𝑖 is a reconstructed error matrix and 𝛽 is regularization parameter. Helei Qiu et al.: Preprint submitted to Elsevier

Obviously, the 𝓁0 norm optimization problem in (2) is an NP-hard one, and then difficult to be solved. To tackle it, the 𝓁1 norm is usually used instead of the 𝓁0 norm to relax the DL model (2) [10, 13]. However, 𝓁1 norm is a biased estimator, which may over-penalize large entries [14, 15]. Focusing on this issue, a nonconvex MCP function is introduced to replace 𝓁0 norm to obtain nearly unbiased estimator [16, 17, 18], which can be is defined as follows: Given a vector 𝐚 = (𝑎1 , 𝑎2 , ⋯ , 𝑎𝑝 )𝑇 ∈ ℝ𝑝 , when 𝜐 > 0 and 𝛾 > 1, the MCP penalty function can be expressed as ⎧ 𝛾𝜐 , |𝑎𝑖| ≥ 𝛾𝜐 ⎪ 2 𝑧 (1− ) 𝑑𝑧 = ⎨ , (3) 2 𝛾𝜐 + ⎪ 𝜐|𝑎 |− 𝑎𝑖 , |𝑎 | < 𝛾𝜐 𝑖 ⎩ 𝑖 2𝛾 2

𝐽𝜐,𝛾 (𝑎𝑖 )=𝜐

∫0

|𝑎𝑖 |

where (𝑢)+ = max{𝑢, 0}. Given a matrix 𝐀, the vector MCP function can be extended to the matrix version [15, 18], which is expressed as ∑ 𝑀𝜐,𝛾 (𝐀) = 𝐽 (𝐴𝑚𝑛 ). (4) 𝑚𝑛 𝜐,𝛾

For convenience of expression, let 𝐽𝛾 (𝐀) = 𝐽1,𝛾 (𝐀) and 𝑀𝛾 (𝐀) = 𝑀1,𝛾 (𝐀), and then the function 𝑀𝛾 (𝐀) becomes the 𝓁1 norm when 𝛾 → ∞ and 𝐽𝛾 (𝐀) → |𝐀|, as well as it gives rise to a hard threshold operator corresponding to 𝓁0 norm while 𝛾 → 1 [16]. Thus, 𝑀𝛾 (𝐀) bridges the 𝓁1 and 𝓁0 norm. With the discussion above, replacing 𝓁0 norm in model (2) with MCP function, the NDDL model can be recast as min ‖𝐗𝑖 − 𝐃𝑖 𝐂𝑖 − 𝐏𝑖 ‖2𝐹 + 𝛼𝑀𝛾 (𝐂𝑖 ) + 𝛽𝑀𝛾 (𝐏𝑖 ). (5)

𝐃𝑖 ,𝐂𝑖 ,𝐏𝑖

The object and background samples are learned independently to obtain object and background dictionaries, respectively. There may be some atoms with common features between the object and background dictionaries. The relationship between them is shown in Fig. 2. It should be noted that these atoms have no discriminability and make dictionary Page 4 of 15

Given 𝐀𝑜𝑙𝑑 , 𝑄𝛾 (𝐀|𝐀𝑜𝑙𝑑 ) in (8) can be considered as a LLA of 𝑀𝛾 (𝐀), which can be expressed as 𝑄𝛾 (𝐀|𝐀𝑜𝑙𝑑 ) =𝑀𝛾 (𝐀𝑜𝑙𝑑 ) + Figure 2: The relationship between object and background dictionaries.

redundant, which reduces the discriminability of dictionary. In addition, reducing the consistency between dictionaries facilitates to improve the effectiveness of sparse representation. From this point of view, an inconsistency constraint ‖𝐃𝑖 𝐃𝑇𝑗 ‖2𝐹 [55] can be applied to the DL model to make the object and background dictionaries more independent to improve the discriminability of dictionary. Based on the discussion above, the NDDL model can be reformulated as min ‖𝐗𝑖 − 𝐃𝑖 𝐂𝑖 − 𝐏𝑖 ‖2𝐹 + 𝛼𝑀𝛾 (𝐂𝑖 )

𝐃𝑖 ,𝐂𝑖 ,𝐏𝑖

+ 𝛽𝑀𝛾 (𝐏𝑖 ) + 𝜆‖𝐃𝑖 𝐃𝑇𝑗 ‖2𝐹 ,

(6)

where 𝑗 ≠ 𝑖, and 𝜆 is regularization parameter.

3.2. The NDDL method

Because the MCP function in model (6) is nonconvex, and the NDDL model (6) is a nonconvex optimization problem and cannot be solved by convex optimization approach directly. Inspired by [14], the nonconvex optimization problem can be solved based on the MM-IALM method that can be regarded as a special example of the MM method [56]. The MM-IALM optimization approach includes one inner loop and one outer loop. In each iteration, the outer loop approximates the original nonconvex problem with local linear approximation (LLA) and can be transformed into a weighted convex optimization one, while the inner one exploits the IALM approach to alternately minimize the convex optimization to solve each optimization variable, and then the optimal solution of the original optimization issue can be approximated until certain condition is satisfied. Before giving the detailed optimization process, a generalized shrinkage operator 𝔻𝜏,𝐖 (𝐇) needs to be introduced. For 𝜏 ≥ 0, 𝛾 > 1, given matrices 𝐀, 𝐇, 𝐀𝑜𝑙𝑑 (𝐀𝑜𝑙𝑑 represents a matrix obtained in the previous iteration), 𝐖 = (𝟏𝑚 𝟏𝑇𝑛 −𝐀𝑜𝑙𝑑 ∕𝛾)+ and 𝟏𝑛 = [1, 1, ⋯ , 1]𝑇 ∈ ℝ𝑛 , we have [𝔻𝜏,𝐖 (𝐇)]𝑚𝑛 = sign(𝐻𝑚𝑛 )(|𝐻𝑚𝑛 | − 𝜏𝑊𝑚𝑛 )+ ,

(7)

where | ⋅ | denotes the absolute value of the element, sign(⋅) denotes the symbol of element in parentheses, i.e., plus or minus. Eq. (7) can be regarded as the closed-loop solution of the following problem [14]: 1 𝜑𝜏,𝐖 (𝐀) = min ‖𝐀 − 𝐇‖2𝐹 + 𝜏𝑄𝛾 (𝐀|𝐀𝑜𝑙𝑑 ). 𝐀 2 Helei Qiu et al.: Preprint submitted to Elsevier

(8)



𝑚𝑛

(1−

|𝐴𝑜𝑙𝑑 𝑚𝑛 | 𝛾

) (|𝐴𝑚𝑛 |−|𝐴𝑜𝑙𝑑 𝑚𝑛 |).

(9)

+

Based on the MM-IALM optimization approach illustrated above, a method for solving NDDL optimization problem is proposed here, which can be described as following: Outer loop: The outer loop is based on LLA. The LLA method can be expressed as: Given a differentiable concave function 𝑓 (𝑥) ∈ (0, +∞), let 𝑓 (𝑥|𝑥𝑡 ) = 𝑓 (𝑥𝑡 )+𝑓 ′ (𝑥)(𝑥−𝑥𝑡 ) be the first order Taylor expansion of 𝑓 (𝑥), then 𝑓 (𝑥) ≤ 𝑓 (𝑥|𝑥𝑡 ) with equality holding if and only if 𝑥 = 𝑥𝑡 . Furthermore, 𝑓 (𝑥𝑡+1 ) ≤ 𝑓 (𝑥𝑡 ) if 𝑡 > 1. In order to reduce the computational complexity, a onestep LLA strategy is adopted, i.e., only running the outer loop once [15, 18], instead of satisfying the convergence condition. The experiment in [14] shows that the performance of the multi-step LLA strategy with higher computational complexity (i.e., waiting for the outer loop to converge) is only slightly better than that of the one-step LLA. Based on (9), the upper bound function of (6) is obtained 𝑜𝑙𝑑 by substituting 𝑄𝛾 (𝐂𝑖 |𝐂𝑜𝑙𝑑 𝑖 ) and 𝑄𝛾 (𝐏𝑖 |𝐏𝑖 ) for 𝑀𝛾 (𝐂𝑖 ) and 𝑀𝛾 (𝐏𝑖 ), respectively. Consequently, the optimization issue (6) can be rewritten as min ‖𝐗𝑖 − 𝐃𝑖 𝐂𝑖 − 𝐏𝑖 ‖2𝐹 + 𝛼𝑄𝛾 (𝐂𝑖 |𝐂𝑜𝑙𝑑 𝑖 )

𝐃𝑖 ,𝐂𝑖 ,𝐏𝑖

𝑇 2 + 𝛽𝑄𝛾 (𝐏𝑖 |𝐏𝑜𝑙𝑑 𝑖 ) + 𝜆‖𝐃𝑖 𝐃𝑗 ‖𝐹 .

(10)

Inner loop: Based on the IALM approach, the inner loop can be considered. It should be noted that the first item in (10) is associated with the coupling of the variables 𝐃𝑖 and 𝐂𝑖 . As a consequence, in order to solve the variable 𝐂𝑖 by using the closed-loop solution of (8), an auxiliary optimization variable 𝐁𝑖 = 𝐂𝑖 should be introduced, and thereafter the issue (10) can be rewritten as min ‖𝐗𝑖 − 𝐃𝑖 𝐂𝑖 − 𝐏𝑖 ‖2𝐹 + 𝛼𝑄𝛾 (𝐁𝑖 |𝐁𝑜𝑙𝑑 𝑖 )

𝐃𝑖 ,𝐂𝑖 ,𝐏𝑖 ,𝐁𝑖

𝑇 2 + 𝛽𝑄𝛾 (𝐏𝑖 |𝐏𝑜𝑙𝑑 𝑖 ) + 𝜆‖𝐃𝑖 𝐃𝑗 ‖𝐹

(11)

𝑠.𝑡. 𝐁𝑖 = 𝐂𝑖 .

By employing the IALM method, the constrained optimization issue (11) can be recast as an unconstrained optimization one, which can be formulated as =

min ‖𝐗𝑖 − 𝐃𝑖 𝐂𝑖 − 𝐏𝑖 ‖2𝐹

𝐃𝑖 ,𝐂𝑖 ,𝐏𝑖 ,𝐁𝑖

𝑜𝑙𝑑 ) + 𝜆‖𝐃𝑖 𝐃𝑇𝑗 ‖2𝐹 +𝛼𝑄𝛾 (𝐁𝑖 |𝐁𝑜𝑙𝑑 𝑖 ) + 𝛽𝑄𝛾 (𝐏𝑖 |𝐏𝑖 𝜇 +tr(𝐕𝑇𝑖 (𝐂𝑖 − 𝐁𝑖 )) + 𝑖 ‖𝐂𝑖 − 𝐁𝑖 ‖2𝐹 2 = min ‖𝐗𝑖 −𝐃𝑖 𝐂𝑖 −𝐏𝑖 ‖2𝐹 +𝛼𝑄𝛾 (𝐁𝑖 |𝐁𝑜𝑙𝑑 𝑖 ) 𝐃𝑖 ,𝐂𝑖 ,𝐏𝑖 ,𝐁𝑖

+𝛽𝑄𝛾 (𝐏𝑖 |𝐏𝑖 𝑜𝑙𝑑 )+𝜆‖𝐃𝑖 𝐃𝑇𝑗 ‖2𝐹 +

(12)

𝜇𝑖 𝐕 ‖𝐂𝑖−𝐁𝑖+ 𝑖 ‖2𝐹 , 2 𝜇𝑖 Page 5 of 15

where 𝐕𝑖 is Lagrange multiplier and 𝜇𝑖 > 0 is a penalty parameter. In the (𝑞 + 1)-th iteration, the variables are updated alternately in the following sequence: Fixing all other variables to tackle 𝐁𝑖 , and the issue (12) can be recast as 𝐕 𝜇 𝐁𝑖 = arg min𝛼𝑄𝛾 (𝐁𝑖 |𝐁𝑜𝑙𝑑 )+ 𝑖 ‖𝐂𝑖 −𝐁𝑖 + 𝑖 ‖2𝐹 . (13) 𝑖 𝐁𝑖 2 𝜇𝑖

As illustrated above, (7) is the solution of the problem (8), and therefore the solution of (13) can be expressed as 𝐁𝑞+1 = 𝔻𝛼∕𝜇𝑖 ,𝐖𝐁 (𝐂𝑞𝑖 + 𝑖 𝑖

𝐕𝑞𝑖 𝜇𝑖𝑞

(14)

).

Keeping all other variables unchanged to solve 𝐂𝑖 , (12) is amount to the following minimization problem 𝐂𝑖 = arg min𝑓 (𝐂𝑖 ) 𝐂𝑖

= arg min‖𝐗𝑖−𝐃𝑖 𝐂𝑖−𝐏𝑖 ‖2𝐹 + 𝐂𝑖

(15) 𝜇𝑖 𝐕 ‖𝐂𝑖−𝐁𝑖+ 𝑖 ‖2𝐹 . 2 𝜇𝑖

Obviously, the issue above is convex. In order to obtain the optimal 𝐂𝑖 , the differentiation of the objective function in (15) can be firstly calculated as 𝑑 𝑓 (𝐂𝑖 ) = (2𝐃𝑇𝑖 𝐃𝑖 + 𝜇𝑖 𝐈)𝐂𝑖

− 2𝐃𝑇𝑖 (𝐗𝑖 − 𝐏𝑖 ) − (𝜇𝑖 𝐁𝑖 − 𝐕𝑖 ),

(16)

where 𝐈 ∈ ℝ𝑘𝑖 ×𝑘𝑖 is an identity matrix. In what following, let 𝑑 𝑓 (𝐂𝑖 ) = 0, and then the optimal 𝐂𝑖 can be acquired as 𝑇

= (2(𝐃𝑞𝑖 ) 𝐃𝑞𝑖 + 𝜇𝑖 𝐈) 𝐂𝑞+1 𝑖

−1

𝑇

⋅ [2(𝐃𝑞𝑖 ) (𝐗𝑖 −𝐏𝑞𝑖 )+𝜇𝑖 𝐁𝑞+1 −𝐕𝑞𝑖 ], 𝑖

(17)

Holding all other variables to consider 𝐏𝑖 , the issue (12) can be rewritten as 𝐏𝑖 = arg min‖𝐗𝑖 −𝐃𝑖 𝐂𝑖 −𝐏𝑖 ‖2𝐹 +𝛽𝑄𝛾 (𝐏𝑖 |𝐏𝑖 𝑜𝑙𝑑 ). (18) 𝐏 𝑖

Similarly, the solution of (18) can be obtained as 𝐏𝑞+1 𝑖

=

(19)

𝔻𝛽,𝐖𝐏 (𝐗𝑖 − 𝐃𝑞𝑖 𝐂𝑞+1 𝑖 ). 𝑖

Maintaining all other variables to fix 𝐃𝑖 , (12) can be reformulated as 𝐃𝑖 = arg min𝑓 (𝐃𝑖 ) 𝐃𝑖

=

arg min‖𝐗𝑖 − 𝐃𝑖 𝐂𝑖 − 𝐏𝑖 ‖2𝐹 𝐃 𝑖

+ 𝜆‖𝐃𝑖 𝐃𝑇𝑗 ‖2𝐹 .

(20)

It is obvious that (20) is convex. Similar to the derivation of (16), we have 𝑑 𝑓 (𝐃𝑖 ) = −2(𝐗𝑖 −𝐏𝑖 )𝐂𝑇𝑖 +2𝐃𝑖 (𝐂𝑖 𝐂𝑇𝑖 +𝜆𝐃𝑇𝑗 𝐃𝑗 ). (21)

Let 𝑑 𝑓 (𝐃𝑖 ) = 0, the optimal 𝐃𝑖 can be obtained as

−1

𝑞+1 𝑞+1 𝑇 𝑞+1 𝑞+1 𝑇 𝑇 (22) 𝐃𝑞+1 𝑖 =(𝐗𝑖−𝐏𝑖 )(𝐂𝑖 ) (𝐂𝑖 (𝐂𝑖 ) +𝜆𝐃𝑗 𝐃𝑗 ) .

Helei Qiu et al.: Preprint submitted to Elsevier

According to [57], 𝐕𝑖 and 𝜇1 can be updated as { 𝑞+1 𝐕𝑖 = 𝐕𝑞𝑖 + 𝜇𝑖𝑞 (𝐁𝑞+1 − 𝐂𝑞+1 𝑖 𝑖 ) , 𝑞+1 𝑞 𝜇𝑖 = min(𝜌𝜇𝑖 , 𝜇𝑖(max) )

(23)

where 𝜌 is a parameter slightly greater than 1 and 𝜇𝑖(max) is the upper bound of the penalty parameter 𝜇. As mentioned above, the proposed MM-IALM method contains outer and inner loops. Given any initial point (𝐂𝑖,0 , 𝐏𝑖,0 ), the outer loop finds a weighted nonconvex function 𝑄𝛾 (𝐂𝑖 , 𝐏𝑖 |𝐂𝑖,0 , 𝐏𝑖,0 ) = 𝛼𝑄𝛾 (𝐂𝑖 |𝐂𝑖,0 ) + 𝛽𝑄𝛾 (𝐏𝑖 |𝐏𝑖,0 ), which maximizes 𝐺(𝐂𝑖 , 𝐏𝑖 ) = 𝛼𝑀𝛾 (𝐂𝑖 ) + 𝛽𝑀𝛾 (𝐏𝑖 ) at (𝐂𝑖,0 , 𝐏𝑖,0 ). After that, the obtained problem (10) can be minimized by IALM method scheme (i.e., (14), (17), (19) and (22)) to find a local solution (𝐂∗𝑖,0 , 𝐏∗𝑖,0 , 𝐃∗𝑖,0 ). Subsequently, let (𝐂𝑖,1 , 𝐏𝑖,1 , 𝐃𝑖,1 ) = (𝐂∗𝑖,0 , 𝐏∗𝑖,0 , 𝐃∗𝑖,0 ) in the next iteration, and similar to the above, the method finds the weighted nonconvex function 𝑄𝛾 (𝐂𝑖 , 𝐏𝑖 |𝐂𝑖,1 , 𝐏𝑖,1 ) which maximizes 𝐺(𝐂𝑖 , 𝐏𝑖 ) at the point (𝐂𝑖,1 , 𝐏𝑖,1 ), and so on and so forth until it satisfies the condition of convergence is ‖𝐂𝑞+1 − 𝐁𝑞+1 𝑖 𝑖 ‖∞ < 𝜉, where 𝜉 is the convergence threshold. The detailed procedure of our optimization is outlined in Algorithm 1. Algorithm 1 MM-IALM method for NDDL issue Input: Training sample set 𝐗𝑖 , dictionary 𝐃𝑖,0 , sparse coding matrix 𝐂𝑖,0 , auxiliary matrix 𝐁∗𝑖,0 , error matrix 𝐏∗𝑖,0 , multiplier 𝐕𝑖,0 , 𝛼, 𝛽, 𝜆, 𝜌, 𝜇𝑖,0 , 𝜇𝑖(max) , 𝜉, 𝑝 = 0; Output: 𝐃𝑖 ; 1: while not converged do 2: 𝐁0𝑖,𝑝+1 ← 𝐁∗𝑖,𝑝 , 𝐏0𝑖,𝑝+1 ← 𝐏∗𝑖,𝑝 ; 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

𝐖𝐁𝑖 = (𝟏𝑚 𝟏𝑇𝑛 −|𝐁∗𝑖,𝑝 |∕𝛾)+ , 𝐖𝐏𝑖 = (𝟏𝑚 𝟏𝑇𝑛 −|𝐏∗𝑖,𝑝 |∕𝛾)+ ; while not converged do Update 𝐁𝑞+1 by (14), when others fixed; 𝑖,𝑝+1 Update 𝐂𝑞+1 by (17), when others fixed; 𝑖,𝑝+1 Update 𝐏𝑞+1 by (19), when others fixed; 𝑖,𝑝+1

Update 𝐃𝑞+1 by (22), when others fixed; 𝑖,𝑝+1

− 𝐂𝑞+1 ); Update 𝐕𝑖 : 𝐕𝑞+1 = 𝐕𝑞𝑖 + 𝜇𝑖𝑞 (𝐁𝑞+1 𝑖 𝑖,𝑝+1 ( ) 𝑖,𝑝+1 𝑞+1 𝑞 Update 𝜇𝑖 : 𝜇𝑖 = min 𝜌𝜇𝑖 , 𝜇𝑖(max) ; Check the convergence conditions: ‖𝐂𝑞+1 − 𝐁𝑞+1 ‖ < 𝜉; 𝑖,𝑝+1 𝑖,𝑝+1 ∞ 𝑞 =𝑞+1 end while 𝐁∗𝑖,𝑝+1 ← 𝐁𝑞+1 , 𝐏∗𝑖,𝑝+1 ← 𝐏𝑞+1 ; 𝑖,𝑝+1 𝑖,𝑝+1 𝑝=𝑝+1 end while

3.3. Convergence analysis

For Algorithm 1, we know that only the updating steps for 𝐂𝑖 and 𝐏𝑖 are nonconvex, so that the convergence of the whole Algorithm 1 can be obtained if the convergence of the updating steps for 𝐂𝑖 and 𝐏𝑖 is ensured. Nevertheless, it is difficult to give a strict mathematical proof of global convergence, but it has local convergence [14, 15, 18]. Based Page 6 of 15

on (9), the objective function satisfies the following lemma. Lemma 1. Suppose 𝐺(𝐂𝑖 , 𝐏𝑖 ) = 𝛼𝑀𝛾 (𝐂𝑖 ) + 𝛽𝑀𝛾 (𝐏𝑖 ), then in each iteration, the values 𝐺(𝐂𝑖 , 𝐏𝑖 ) in the objective function of model (6) obey the monotonically decreasing property, i.e., 𝐺(𝐂∗𝑖,𝑝+1 , 𝐏∗𝑖,𝑝+1 ) ≤ 𝐺(𝐂∗𝑖,𝑝 , 𝐏∗𝑖,𝑝 ). Proof 1. Since 𝐺(𝐂𝑖 , 𝐏𝑖 ) is a concave function with (𝐂𝑖 , 𝐏𝑖 ), and the weighted function 𝑄𝛾 (𝐂𝑖 , 𝐏𝑖 |𝐂𝑖,0 , 𝐏𝑖,0 ) obtained by LLA maximizes 𝐺(𝐂𝑖 , 𝐏𝑖 ) at (𝐂𝑖,0 , 𝐏𝑖,0 ), thus we have 𝐺(𝐂𝑖 , 𝐏𝑖 ) ≤ 𝑄𝛾 (𝐂𝑖 , 𝐏𝑖 |𝐂𝑖,0 , 𝐏𝑖,0 ).

(24)

From (9), the equality in (24) holds if and only if (𝐂𝑖 , 𝐏𝑖 ) = (𝐂𝑖,0 , 𝐏𝑖,0 ), that is 𝐺(𝐂𝑖,0 , 𝐏𝑖,0 ) = 𝑄𝛾 (𝐂𝑖,0 , 𝐏𝑖,0 |𝐂𝑖,0 , 𝐏𝑖,0 ).

(25)

According to the above formulation of both 𝐺(⋅, ⋅) and 𝑄𝛾 (⋅, ⋅|⋅, ⋅), along with using the iteration steps (i.e., lines 4âĂŞ13) in Algorithm 1, the optimal solution to 𝑄𝛾 (𝐂𝑖 , 𝐏𝑖 |𝐂∗𝑖,𝑝 , 𝐏∗𝑖,𝑝 ) denoted by (𝐂∗𝑖,𝑝+1 , 𝐏∗𝑖,𝑝+1 ) can be found with the increasing of iteration number. Hence, the following inequality can be obtained. 𝑄𝛾 (𝐂∗𝑖,𝑝+1 , 𝐏∗𝑖,𝑝+1 |𝐂∗𝑖,𝑝 , 𝐏∗𝑖,𝑝 ) ≤ 𝑄𝛾 (𝐂∗𝑖,𝑝 , 𝐏∗𝑖,𝑝 |𝐂∗𝑖,𝑝 , 𝐏∗𝑖,𝑝 ). (26)

Obviously, the lemma follows directly from Eq. (24) âĂŞ (26). In addition, according to the discussion in [15] and [17], the local optimal solution of the nonconvex problem is usually better than the global optimal solution of the convex relaxation version.

4. The tracking framework

Based on the dictionary learned in the previous section, the tracking framework via exploiting Bayesian reasoning is first introduced in this section, and then the object observation model is established to achieve the object tracking.

4.1. Bayesian reasoning

Visual tracking can be regarded as a Bayesian reasoning based on the hidden Markov model. The affine parameter 𝐳𝑡𝑖 = {𝑙𝑥, 𝑙𝑦, 𝜈, 𝑠, 𝜓, 𝜙} is used to represent the object state, in which 𝑙𝑥, 𝑙𝑦, 𝜈, 𝑠, 𝜓, 𝜙 are horizontal displacement, vertical displacement, horizontal scale factor, rotation angle, aspect ratio and twist angle, respectively [58]. Given a set of observations 𝑦1∶𝑡 = {𝑦1 , 𝑦2 , ⋯ , 𝑦𝑡 } at the 𝑡-th frame, the posterior probability is estimated recursively as 𝑝(𝐳𝑡𝑖 |𝐲1∶𝑡 ) ∝ 𝑝(𝐲𝑡 |𝐳𝑡𝑖 )



𝑝(𝐳𝑡𝑖 |𝐳𝑡−1 )𝑝(𝐳𝑡−1 |𝐲1∶𝑡−1 )𝑑𝐳𝑡−1 , (27)

where 𝑝(𝐲𝑡 |𝐳𝑡𝑖 ) is the observation model of 𝐲𝑡 under state 𝐳𝑡𝑖 , and 𝑝(𝐳𝑡𝑖 |𝐳𝑡−1 ) is the temporal correlation of state transition between two consecutive frames. Helei Qiu et al.: Preprint submitted to Elsevier

It is noted that a large number of high-dimensional integral operations need to be used in solving Eq. (27). Particle filter [59] is applied to approximate the distribution over the location of the object and produce a finite set of samples. Given all the observations 𝑦1∶𝑡 up to the 𝑡-th frame, The optimal state 𝐳̂ 𝑡 of the object in the current frame is obtained by the maximum a posteriori estimation over these samples, which can be shown as 𝐳̂ 𝑡 = arg max𝑝(𝐳𝑡𝑖 |𝐲1∶𝑡 ) 𝐳𝑡𝑖

= arg max𝑝(𝐲𝑡 |𝐳𝑡𝑖 )𝑝(𝐳𝑡𝑖 |𝐳𝑡−1 ),

(28)

𝐳𝑡𝑖

where 𝑖 = 1, 2, ⋯ , 𝑁, 𝑁 is the number of samples. Assuming that the state variables are independent of each other, the motion model of the object can be modeled as Gaussian distribution [58]: (29)

𝑝(𝐳𝑡𝑖 |𝐳𝑡−1 ) = 𝑁(𝐳𝑡𝑖 ; 𝐳𝑡−1 , Σ),

where 𝑁(𝐳𝑡𝑖 ; 𝐳𝑡−1 , Σ) denotes that 𝐳𝑡𝑖 obeys Gaussian distribution with mean 𝐳𝑡−1 and variance Σ, in which Σ is diagonal covariance matrix and each of its elements is the variance of each affine parameter.

4.2. Observation model

For the current candidate set 𝐘, the optimization issue (30) needs to be solved to obtain sparse coding matrix 𝐂𝑖 . min ‖𝐘 − 𝐃𝑖 𝐂𝑖 − 𝐏𝑖 ‖2𝐹 + 𝛽1 𝑀𝛾 (𝐏𝑖 ),

𝐂𝑖 ,𝐏𝑖

(30)

where 𝐏𝑖 denotes the corresponding error term and 𝛽1 is regularization parameter. Obviously, the solution of (30) is the same as that of (6). According to the optimal sparse coding matrix obtained from (30), one can see that the more relevant candidate should be better represented by the object dictionary, i.e., the corresponding reconstruction error 𝜀𝑖𝐃 = ‖𝐘𝑖 − 𝐃1 𝐜1𝑖 ‖2𝐹 should 1

be smaller, where 𝐜1𝑖 is the sparse coefficient vector corresponding to the 𝑖-th sample 𝐘𝑖 over the object dictionary 𝐃1 . At the same time, the candidate should not be represented by the background dictionary effectively, i.e., the corresponding reconstruction error 𝜀𝑖𝐃 = ‖𝐘𝑖 − 𝐃2 𝐜2𝑖 ‖2𝐹 is rather large, 2

where 𝐜2𝑖 is the sparse coefficient vector corresponding to the 𝑖-th sample 𝐘𝑖 over the background dictionary 𝐃2 . With the discussion above, the object observation model can be constructed as 𝑝(𝐲𝑡 |𝐳𝑡𝑖 ) ∝ exp(−

𝜀𝑖𝐃 − 𝜀𝑖𝐃 1

2

𝜎

),

(31)

where 𝜎 is a constant. According to Eq. (31), the posterior probability of each candidate can be estimated, and then the optimal state of the object can be estimated by Eq. (27) and (28), so that the object can be tracked accurately.

Page 7 of 15

4.3. Initialization and dictionary updating 4.3.1. Initialization

In the first frame, a rectangular area is selected manually to obtain the object, and the center of the rectangular area is denoted by 𝑙(𝑥, 𝑦). Within the range ‖𝑙𝑖 − 𝑙(𝑥, 𝑦)∗ ‖22 < 𝑟, 𝑞1 image blocks are selected as object samples, where 𝑙𝑖 is the center of the 𝑖-th image block and 𝑟 is the radius of inner round. Similarly, within the range 𝑟 < ‖𝑙𝑗 − 𝑙(𝑥, 𝑦)∗ ‖22 < 𝑅, 𝑞2 image blocks are sampled as background samples, where 𝑙𝑗 is the center of the 𝑗-th image block and 𝑅 is the radius of the outer round. Several object and background samples are randomly selected to initialize 𝐃1 and 𝐃2 . The sparse encoding and auxiliary matrix are initialized by using random normal distribution, and the error matrix and Lagrange multiplier are initialized to 𝟎, respectively.

4.3.2. Dictionary updating To ensure the proposed method can adapt to the changes of the object appearance, the dictionary 𝐃 should be updated online. Because the object is selected manually at the first frame, it is always the ground truth. Hence, the training sample set 𝐗1 obtained from the first frame is always retained all the time to alleviate the drift problem. In order to obtain a more robust and discriminative dictionary, the object and background samples collected from the previous 𝑇 frames are loaded into a temporary sample pool 𝐗𝑡𝑒𝑚𝑝 = {𝐗𝑡−𝑇 +1 , 𝐗𝑡−𝑇 +2 , ⋯ , 𝐗𝑡 }, where 𝐗𝑡 represents the training samples gathered from the tracking results of the 𝑡-th frame, so the sample pool can be depicted as 𝐗𝑡𝑟𝑎𝑖𝑛 = {𝐗1 , 𝐗𝑡𝑒𝑚𝑝 }. In a sequel, a new dictionary 𝐃 can be learned by using the sample pool 𝐗𝑡𝑟𝑎𝑖𝑛 to track the object in the next frame. It should be pointed that 𝐗𝑡𝑒𝑚𝑝 needs to be swept to collect new training samples after obtaining the optimal dictionary. The tracking result may contain interference such as occlusion or noise when the sample is accumulated in temporary sample pool 𝐗𝑡𝑒𝑚𝑝 in the process of sample collection. If the evaluation value of the optimal object location determined by the tracker (see Section 5.1 for the evaluation method) is larger than the reconstruction error threshold 𝜃, the tracking result is considered unreliable, and then the frame is skipped to avoid noise, otherwise the samples of this frame will be accumulated into 𝐗𝑡𝑒𝑚𝑝 . It is noted that the dictionary should not be updated if the collection of the temporary sample pool is not finished when a frame is skipped. In addition, it should be pointed out how to choose the value of the reconstruction error threshold 𝜃 is rather difficult [60]. In this paper, the value of 𝜃 is determined according to the experiment, and the selection of it remains an open issue, which should be considered in the future.

5. Experimental results

The proposed tracker is evaluated on four benchmarks: OTB2015 [3], VOT2016 [4], UAV123 [61], LaSOT [62]. All numerical examples are implemented in MATLAB with CPU 1.8 GHz Intel Core (TM) i7-8550U and 8 GB RAM.

Helei Qiu et al.: Preprint submitted to Elsevier

5.1. Parameters setup and evaluation metrics

If the number of candidate particles is too large during the tracking, the computational load will be heavy. On the contrary, the optimal candidate location may not be collected and the integrity of candidates cannot be guaranteed although the computational complexity is rather small. Being a tradeoff, 600 particles are collected in each frame. The size of each image block is set to 32 × 32, and each image block is represented by gray features. In the first frame, 60 object samples and 100 background ones are collected as training sets to initialize the dictionaries. Considering the validity and timeliness of the proposed method, in each frame after the first frame, 10 object samples and 60 background ones are collected to form a temporary sample pool for the followup dictionaries update. The number of atoms in each dictionary is set to 15, and the dictionary is updated every 10 frames (i.e., 𝑇 = 10). The optimal location of the object determined by the proposed tracker is evaluated by −(𝜀𝑖𝐃 − 1

𝜀𝑖𝐃 )∕𝜎, where the constant 𝜎 = 0.1. The parameter 𝛾 should 2 be a smaller real number larger than 1, it can be set as 𝛾 = 2. Similar to [18], the other parameters can be set as: 𝜌 = 1.2, 𝜇𝑖0 = 10−3 , 𝜇𝑖(max) = 105 , 𝜉 = 10−5 . The precision and success rate are usually used as evaluation metrics, as suggested in [3]. The evaluation metric on tracking precision is the center location error, which is defined as the Euclidean distance between the center locations of the tracked objects and the manually labeled ground √ truths, i.e., (𝑥1 − 𝑥0 )2 + (𝑦1 − 𝑦0 )2 , which (𝑥1 , 𝑦1 ) is the center locations of tracking results, (𝑥0 , 𝑦0 ) is the real object position. The precision plot indicates the overall tracking performance, measures the percentage of frames that the distance between the estimated position of the object and the ground truth is within the given location error threshold. The precision score with a threshold of 20 is utilized to rank these trackers. The evaluation metric associated with tracking success rate is the bounding box overlap, i.e., the overlap score, which is defined as 𝑠𝑐𝑜𝑟𝑒 = 𝑎𝑟𝑒𝑎(𝑅𝑡 ∩ 𝑅𝑔 )∕𝑎𝑟𝑒𝑎(𝑅𝑡 ∪ 𝑅𝑔 ), where 𝑅𝑡 and 𝑅𝑔 denote the tracked and ground truth bounding box, respectively, ∩ and ∪ represent the intersection and union of two regions, respectively. Thus, 0 ≤ score ≤ 1, and the higher the score, the better the performance of the tracker. To measure the performance on a sequence of frames, the number of successful frames whose overlap score is larger than the given threshold is counted. The success rate shows the ratio of the number of successful frames to the total number of frames in a sequence. As the threshold varies between 0 and 1, the success rate changes and the resultant curve is presented in this work. The area under curve (AUC) of each success plot are utilized to rank these trackers.

5.2. Results on OTB2015

Some numerical examples are provided to illustrate the merits of the developed method compared to the following 9 state-of-the-art trackers, including CT [41], TLD [22], LSK [38], L1APG [39], ASLA [20], MTT [40], KCF [5], Struck [46], LSHT [63], which are implemented on the 40 video Page 8 of 15

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 3: Precision and overlap success plots over four tracking challenges such as illumination variation, scale variation, occlusion and deformation.

(a)

(b)

Figure 4: Overall performance comparison using precision and success plots.

sequences1 selected from the OTB2015 [3] dataset2 . These text sequences mainly contain complex scenes with challenging factors, e.g., illumination variation, scale variation, occlusion and deformation.

5.2.1. Quantitative comparison 1) Attribute-Based Evaluation: Fig. 3 displays results for four main tracking challenges. The proposed method performs favorably under illumination variation, scale variation, occlusion and deformation challenges. These experimental results also verify that the proposed tracker is robust against some of the common challenges in visual tracking. 2) Overall Performance: Fig. 4 shows the overall performance of all the evaluated trackers on the 40 video se1 The 40 video sequences include Basketball, Car1, Car2, Car4, Car24, CarDark, CarScale, Couple, Crossing, Crowds, Dancer, David3, Dog, Dog1, Doll, Dudek, FaceOcc1, FaceOcc2, Fish, FleetFace, Freeman1, Freeman3, Girl2, Human7, Jogging.1, Jogging.2, KiteSurf, Man, Mhyang, Panda, RedTeam, Singer1, Skater, Skating1, Subway, Surfer, Suv, Sylvester, Walking, Walking2. 2 Details of the dataset can be found at http://cvlab.hanyang.ac.kr/

tracker_benchmark/datasets.html

Helei Qiu et al.: Preprint submitted to Elsevier

quences. The figure 4(a) shows the precision plot, in which the decimal represents the precision score with a threshold of 20. The figure 4(b) shows the success plot, in which the decimal represents AUC. On the whole, the proposed tracking method performs favorably against the state-of-the-art methods. Compared to sparse trackers, such as ASLA, L1APG, the tracking precision of our tracker achieves about 7.3% and 17.6% improvement, the tracking success rate of our tracker achieves about 4% and 17.7% improvement, respectively.

5.2.2. Qualitative comparison Some key frames of the text sequences are selected from tracking results and displayed in Fig. 5. Regarding validation of tracking performance against tracking challenges, we mainly discuss issues of illumination variation, scale variation, occlusion and deformation. 1) Illumination variation: In the sequences of Fig. 5(a)(f), the objects suffer from a variety of illumination variation. All other trackers almostly lose their objects, or the tracking box are too large. However, our tracker can always track the object effectively under illumination variation. It is attributed to the fact that the relevant dictionaries of the proposed NDDL method can be moderately updated according to the changes in the appearance of the object. 2) Scale variation: The sequences of Fig. 5(g)-(l) are challenging due to severe scale variation. All other trackers drift, or the tracking box are larger or smaller than the real object size. Howbeit, our tracker can adapt to the scale variation. Compared with the similar SR-based trackers (LSK, L1APG, ASLA, MTT), the proposed method can achieve the better tracking accuracy due to the fact that it employs nonconvex MCP function to punish sparse encoding and error matrices to obtain unbiased estimation of the object. Page 9 of 15

Figure 5: Qualitative comparison of trackers over challenging sequences with illumination variation, scale variation, occlusion and deformation at some key frames

3) Occlusion: Occlusion is one of the major challenges in tracking task. Severe occlusion exists in the sequences of Fig. 5(m)-(r), such as the object woman is occluded by a lamppost in frame #56 of the Jogging.2 sequence of Fig. 5(m). Obviously, only our tracker and L1APG can track the object effectively. In the sequences of Fig. 5(m), (o), (p) and (r), all other trackers drift or lose object in varying degrees. However, our tracker can still track the object stably with high tracking precision and success rate owing to the error Helei Qiu et al.: Preprint submitted to Elsevier

term is added to our tracker to resist the occlusion and noise. 4) Deformation: In Fig. 5 (s)-(x), the objects deform to various degrees. For example, in the Skating1 sequence of Fig. 5(t), the object skater is not only deformed, but immerged into the similar background, the trackers CT, TLD, LSK, LSHT and L1APG lose object completely. Howbeit, our tracker can lock the object stably and has high tracking accuracy and robustness in all the sequence of Fig. 5(s)-(x). The main reason is that our tracker learns the dictionary acPage 10 of 15

(a) AR plot (mean)

(b) EAO scores plot

Figure 6: AR and EAO scores plots on VOT2016. In AR plot, the accuracy and robustness is calculated by the average overlap and the average failure rate. In EAO scores plot, these trackers was ranked from right to left by expected average overlap.

cording to the change of object appearance. In addition, the background information around the object is also considered to learn a discriminative dictionary. By using the discriminative dictionary, the similar background interference can be alleviated by the proposed method effectively. In a word, the main reasons for why our tracker has better tracking precision and success rate compared with the 9 state-of-the-art trackers can be summarized as follows:

1) The temporal and spatial local correlation of the object is exploited, so it is not easily disturbed by background information and then has high robustness. 2) The nonconvex MCP function is utilized to punish the sparse and error matrices to obtain nearly unbiased estimation, and thus achieve higher tracking precision. 3) The inconsistent constraint for dictionary makes the object and background dictionaries more independent, which improves the discriminability of dictionary. 4) Aiming at the occlusion or noise, an error term is added to the DL model to further improve the precision and success rate of our tracker.

5.3. Results on VOT2016

VOT2016 [4] dataset includes 60 fully annotated video sequences. A reset-based method is applied in the VOT2016 challenge. When the bounding box predicted by the tracker overlaps with ground truth of 0, a failure is detected and the tracker is re-initialized five frames after the failure. Accuracy and robustness (AR) are the basic measures used to probe the performance of tracker in the reset-based experiments. Accuracy is defined as the average overlap between the predicted and ground truth bounding boxes during successful tracking periods. Robustness is calculated by the average failure rate, which measures how many times the tracker loses the object during tracking. Meanwhile, the expected average overlap (EAO) is the primary measure for VOT to evaluate trackers, which is an estimator of the average overlap that a tracker is expected to attain on a large collection of sequences with the same visual properties as the given dataset. The EAO addresses the problem of the increased variance and bias of AO [3] due to the variable sequence lengths. Details of the evaluation method can be Helei Qiu et al.: Preprint submitted to Elsevier

referred to [4]. The following trackers are used to compare with our tracker: ECO [6], CCOT [27], MemTrack [34], Meta-Tracker [36] (including MetaCREST and MetaSDNet), BST [50], KCF [5], DSST [23], MIL [21], FoT [49], STRUCK [46], BDF [51], IVT [19]. Fig. 6 shows the evaluation results of the above trackers over the VOT2016 challenge dataset via employing AR and EAO scores plots. It can be found from Fig. 6 that the developed DL based tracker achieves better performance than the traditional state-of-the-art methods while it indeed performs worse in comparison with deep learning based trackers. Howbeit, it should be noted that deep learning based trackers potentially have high computational demand and require the GPU-assisted computing, which is not conducive to industrial implementation.

5.4. Results on UAV123

In this subsection, the dataset UAV123@10fps on UAV123 [61] is used to evaluate the proposed trackers compared with the following trackers CCOT [27], TADT [33], MCCT [29], MCPF [26], MEEM [47], Struck [46], DSST [23], TLD [22], KCF [5], ASLA [20], IVT [19] via exploiting the metric of precision and success rate similar to the evaluation strategy of OTB2015 [3]. Details of the evaluation method can be referred to [61]. Fig. 7 shows the evaluation results of the above trackers on UAV123 challenge, including tracking precision and success plots. Compared with many traditional state-of-the-art trackers, our tracker has good tracking performance. Compared to those trackers that use deep networks, our traditional trackers may not have advantages. The evaluation results of the above trackers on UAV123 challenge are shown in Fig. 7, from which one can see the developed trackers has a rather better tracking performance in comparison with the traditional state-of-the-art trackers while it may not possess advantages over the deep networks based trackers. However, it must be pointed out that the trackers with deep networks usually need tremendous data for training, and the training model does not have good generalization. The tracking performance of these trackers will be greatly reduced when the categories of objects to be tracked are few or never exist in the training data, while the proposed online tracker has a general representation of objects. In addition, although some trackers mentioned above (for example, MEEM) have better tracking performance than the proposed tracker, they are difficult to adapt to the scale change of object.

5.5. Results on LaSOT

In this subsection, compared with ATOM [31], SianRPN++ [32], MDNet [24], VITAL [35], SINT [30], CFNet [28], Struck [46], KCF [5], ASLA [20], L1APG [39], IVT [19], CT [41], MIL [21], the developed tracker is evaluated over the LaSOT dataset [62] via the metric of precision and success rates similar to the mentioned above. According to the 80/20 principle (i.e., the Pareto principle), 16 of the 20 videos in each category are selected for training, and the rest are for testing. Page 11 of 15

(a)

(b)

Figure 7: Precision and success plots on UAV123@10fps.

(a)

(b)

Figure 8: Precision and success plots on LaSOT testing set.

(a)

(b)

Figure 9: Effectiveness of different key components: the NDDLT, NDDLT-N, NDDLT-P and NDDLT-D performance comparison using precision and success plots.

proposed tracker by 7.4% and 7.1%, which is owing to the error term can deal with the outliers caused by occlusion or noise, and hence improve the robustness of the DL method. The inconsistency constraint improves the precision and success rate of the proposed tracker by 8% and 7.4%, which is mainly due to the fact that the constraint can make the object and background dictionaries more independent to improve the discriminability of dictionary.

5.6.2. Effects of key parameters The regularization parameter 𝛼 is exploited to control the sparsity level of sparse representation. If the value of 𝛼 is too small, many candidates will be maintained. Otherwise, the sparsity will be over-emphasized, and the sparse coding matrix cannot maintain the variety of particles. The parameter 𝛽 is employed to control the error term. The regularization parameter 𝜆 is used to control the degree of dictionary constraint. The parameter 𝜃 is a reconstruction error threshold, which is used to control whether the dictionary 5.6. Effects of key components and parameters is updated. Inspired by Yang et al. [3], the parameter 𝛼, 𝛽, 𝜆 and 𝜃 are set according to sensitivity analysis. Dif5.6.1. Effects of key components ferent values for each parameter can be examined on five In the proposed DL model, a nonconvex MCP function, video sequences with 1912 frames. To simplify this proban error term and an inconsistent constraint for dictionary lem, 𝛼, 𝛽, 𝜆 and 𝜃 can be parameterized as the following disare introduced into the learning model to enhance the abil∑ crete sets 𝛼,𝛽 = {0.1, 0.5, 0.8, 1.0, 1.2, 1.5, 2.0, 3.0, 5.0}, ity of dictionary. In order to verify the effectiveness of these ∑ ∑ three components, these three constraints are removed re𝜆 = {0.1, 0.5, 1.0, 1.3, 1.5, 1.7, 2.0, 2.5, 5.0}, and 𝜃 = spectively, and the corresponding impact on the tracking per{0.1, 0.5, 1.0, 1.3, 1.5, 1.7, 2.0, 2.5, 5.0}, respectively. The deformance of our tracker can be observed. The proposed tracker tailed setting of sensitivity analysis can be found in [3]. It can without the nonconvex MCP constraint is denoted by ’NDDLT- be seen from Fig. 10 that our tracker has a good trade-off by N’, and accordingly the tracker without using the error term setting 𝛼 = 1, 𝛽 = 1.2, 𝜆 = 1.5 and 𝜃 = 0.05, respectively. and the inconsistent constraint for dictionary are depicted as Similarly, we can set 𝛽1 = 1. ’NDDLT-P’ and ’NDDLT-D’, respectively. 5.7. Computational complexity analysis With the experimental configuration same as that in SecThe computational load of the proposed method is mainly tion 5.1, the obtained results are presented in Fig. 9. It is lie in the NDDL model. The outer loop of the proposed very clear that the NDDLT has the best tracking precision NDDL method runs only once, so the computational comand success rate, which illustrates that the three constraints plexity mainly depends on the inner loop. The sixth and are helpful to improve the tracking performance of the deeighth step of solving 𝐂𝑖 and 𝐃𝑖 in the inner loop contains veloped tracker. Specifically, the nonconvex MCP constraint improves the precision and success rate of the proposed tracker matrix multiplication and inversion, so its computational complexity is highest. Specifically, the computational complexby 5.7% and 3.9%, which mainly attributes to the fact that it ity of solving 𝐂𝑖 and 𝐃𝑖 is O(𝑑𝑘2𝑖 ) and O(𝑞𝑖 𝑘2𝑖 +𝑑𝑘2𝑖 ), respeccan avoid over-penalizing several large entries so as to attain tively. Let 𝑞 be the number of iterations in the inner loop, and sub-optimal solutions, and hence improve tracking accuracy. therefore the total computational complexity of the proposed The error term improves the precision and success rate of the The similar conclusion with that of Subsection 5.4 can be drawn from Fig. 8, which illustrates the evaluation results of the mentioned above trackers implemented on LaSOT dataset. It should be mentioned again that the tracking performance improvement of these trackers is mainly due to the fact that the tremendous samples used by these trackers existing in the LaSOT datasets, as well as the design of effective deep network architectures.

Helei Qiu et al.: Preprint submitted to Elsevier

Page 12 of 15

Table 1 Average running speed (FPS) of different trackers under 40 video sequences Tracker FPS

CT 49.37

TLD 24.61

LSK 8.90

(a)

(b)

(c)

(d)

L1APG 3.33

ASLA 7.47

Figure 10: Sensitivity analysis of 𝛼, 𝛽, 𝜆 and 𝜃.

method can be depicted as O(𝑞(𝑞𝑖 𝑘2𝑖 + 2𝑑𝑘2𝑖 )). Running speed is one of the most important evaluation criteria in object tracking. To analyze the realtime performance of our tracker, the other 9 trackers mentioned in Section 5.2 are compared with our tracker. Table 1 shows the average running speed (Frames Per Second, FPS) of all trackers under 40 test sequence. As shown in Table 1, compared with the sparse representation based trackers (LSK, L1APG, ASLA, MTT), it is obvious that the developed tracker has a better realtime performance. The reason can be illustrated as follows: the computational complexity of sparse representation based tracker is proportional to the number of candidate particles under normal conditions, by reasonably selecting the number of particles and using one-step LLA method with lower complexity, the proposed tracker can attain rather low computational load.

5.8. Discussion

Based on the above mentioned experiments, it is clear that our online tracker have good performance compared to traditional state-of-the-art methods. Besides, benefiting from the powerful representation ability of deep features, many trackers using deep networks are designed to achieve better tracking performance. Nevertheless, it should be pointed out that these deep network based trackers possess some shortcomings. On the one hand, they require higher computing demand and cannot guarantee the real-time tracking perforHelei Qiu et al.: Preprint submitted to Elsevier

MTT 0.99

KCF 152.01

Struck 12.89

LSHT 45.31

NDDLT 22.58

mance without the use of expensive computing equipment. On the other hand, they are limited to object categories in the training data. In contrast, our online tracker has real-time tracking performance with light-weight equipment and can represent general objects. In spite of this, considering the huge advantages of deep features, we will further consider the issue of the tracking performance improvement by incorporating the deep features generated by the deep learning approaches into the proposed dictionary learning method. In addition, our tracker has some limitations with some challenges, for example, our tracker would drift or even fail to track when the object is almost completely occluded. The main reason is that the object samples used in training object dictionary are missing or incorrect under this condition, and therefore the object dictionary cannot represent the appearance of the object effectively, resulting in the wrong location estimation. In addition, our tracker is sensitive to simultaneous rotation and fast motion. Consequently, learning a more robust dictionary to deal with abrupt motion and rotation should be considered in the future.

6. Conclusion

Concentrating on the issue that the object tracking performance degraded significantly in the complicated circumstance, a visual tracking method based on the NDDL model has been proposed in this paper. Firstly, the object and background samples were obtained from the tracking results of recent frames and their adjacent areas. After that, an inconsistent constraint was applied to the dictionary learning model to make the object and background dictionaries more independent, which can enhance the discriminability of dictionary. At the same time, considering the interference such as occlusion or noise, the error term can be exploited to deal with outliers to improve the tracking robustness. Moreover, the MCP function was applied to punish the sparse coding and error matrices to avoid over-penalizing large entries by 𝓁1 norm to improve the object tracking accuracy. In order to tackle the obtained nonconvex discriminative dictionary optimization problem, a method based on MM-IALM was developed to obtain an efficient solution with better convergence. In what follows, the object observation model was constructed according to the acquired optimal discriminative dictionary, and then the object tracking can be implemented accurately based on Bayesian reasoning. As compared to the existing state-of-the-art trackers, simulation results show that the proposed method has higher tracking precision and success rate in the cases of illumination variation, scale variation, occlusion and deformation.

Page 13 of 15

References

[1] B. Tian, B. T. Morris, M. Tang, Hierarchical and networked vehicle surveillance in its: A survey, IEEE Transactions on Intelligent Transportation Systems 18 (1) (2017) 25–48. [2] G. Du, P. Zhang, A markerless humanâĂŞrobot interface using particle filter and kalman filter for dual robots, IEEE Transactions on Industrial Electronics 62 (4) (2015) 2257–2264. [3] Y. Wu, J. Lim, M. Yang, Object tracking benchmark, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9) (2015) 1834–1848. [4] M. Kristan, A. Leonardis, J. Matas, The visual object tracking vot2016 challenge results, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 777–823. [5] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3) (2015) 583–596. [6] M. Danelljan, G. Bhat, F. S. Khan, M. Felsberg, Eco: Efficient convolution operators for tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 6931–6939. [7] X. Mei, H. Ling, Robust visual tracking using 𝓁1 minimization, in: IEEE International Conference on Computer Vision, Kyoto, Japan, 2009, pp. 1436–1443. [8] N. Wang, J. Wang, D. Yeung, Online robust non-negative dictionary learning for visual tracking, in: IEEE International Conference on Computer Vision, Sydney, Australia, 2013, pp. 657–664. [9] F. Yang, Z. Jiang, L. S. Davis, Online discriminative dictionary learning for visual tracking, in: IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, USA, 2014, pp. 854–861. [10] Y. Sui, G. Wang, L. Zhang, M. Yang, Exploiting spatial-temporal locality of tracking via structured dictionary learning, IEEE Transactions on Image Processing 27 (3) (2018) 1282–1296. [11] Y. Xie, W. Zhang, C. Li, S. Lin, Y. Qu, Y. Zhang, Discriminative object tracking via sparse representation and online dictionary learning, IEEE Transactions on Cybernetics 44 (4) (2014) 539–553. [12] Z. Liu, M. Pei, C. Zhang, M. Zhu, Jointly learning a multi-class discriminative dictionary for robust visual tracking, in: Pacific Rim Conference on Multimedia, Xi’an, China, 2016, pp. 550–560. [13] Y. Sui, Y. Tang, L. Zhang, G. Wang, Visual tracking via subspace learning: A discriminative approach, International Journal of Computer Vision 126 (5) (2018) 515–536. [14] S. Wang, D. Liu, Z. Zhang, Nonconvex relaxation approaches to robust matrix recovery, in: The International Joint Conferences on Artificial Intelligence, Beijing, China, 2013, pp. 1764–1770. [15] H. Zhang, J. Yang, J. Xie, J. Qian, B. Zhang, Weighted sparse coding regularized nonconvex matrix regression for robust face recognition, Information Sciences 394–395 (2017) 1–17. [16] C. Zhang, Nearly unbiased variable selection under minimax concave penalty, The Annals of Statistics 38 (2) (2010) 894–942. [17] J. Shi, X. Ren, G. Dai, A non-convex relaxation approach to sparse dictionary learning, in: IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, USA, 2011, pp. 1809–1816. [18] S. Li, K. Li, Y. Fu, Self-taught low-rank coding for visual learning, IEEE Transactions on Neural Networks and Learning Systems 29 (3) (2018) 645–656. [19] D. A. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, International Journal of Computer Vision 77 (1) (2008) 125–141. [20] X. Jia, H. Lu, M. Yang, Visual tracking via adaptive structural local sparse appearance model, in: IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, 2012, pp. 1822–1829. [21] B. Babenko, M. Yang, S. Belongie, Robust object tracking with online multiple instance learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (8) (2011) 1619–1632. [22] Z. Kalal, J. Matas, K. Mikolajczyk, P-N learning: Bootstrapping binary classifiers by structural constraints, in: IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, USA, 2010, pp. 49–56. [23] M. Danelljan, G. Häger, F. S. Khan, M. Felsberg, Accurate scale esti-

Helei Qiu et al.: Preprint submitted to Elsevier

[24] [25]

[26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39]

[40] [41] [42]

[43] [44] [45]

mation for robust visual tracking, in: Proceedings of the British Machine Vision Conference, 2014. H. Nam, B. Han, Learning multi-domain convolutional neural networks for visual tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4293–4302. T. Xu, Z. Feng, X. Wu, J. Kittler, Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking, IEEE Transactions on Image Processing 28 (11) (2019) 5596–5609. T. Zhang, C. Xu, M. Yang, Multi-task correlation particle filter for robust object tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4819–4827. M. Danelljan, A. Robinson, F. Shahbaz Khan, Beyond correlation filters: Learning continuous convolution operators for visual tracking, in: European Conference on Computer Vision, 2016, pp. 472–488. J. Valmadre, L. Bertinetto, J. Henriques, End-to-end representation learning for correlation filter based tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5000–5008. N. Wang, W. Zhou, Q. Tian, R. Hong, M. Wang, H. Li, Multi-cue correlation filters for robust visual tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4844–4853. R. Tao, E. Gavves, A. W. M. Smeulders, Siamese instance search for tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1420–1429. M. Danelljan, G. Bhat, F. S. Khan, M. Felsberg, Atom: Accurate tracking by overlap maximization, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4655–4664. B. Li, W. Wu, Q. Wang, Siamrpn++: Evolution of siamese visual tracking with very deep networks, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4277–4286. X. Li, C. Ma, B. Wu, Z. He, M. Yang, Target-aware deep tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1369–1378. T. Yang, A. B. Chan, Learning dynamic memory networks for object tracking, in: Proceedings of the European Conference on Computer Vision, Cham, 2018, pp. 153–169. Y. Song, C. Ma, X. Wu, Vital: Visual tracking via adversarial learning, in: IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8990–8999. E. Park, A. C. Berg, Meta-tracker: Fast and robust online adaptation for visual object trackers, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 569–585. Z. Zhang, Y. Xu, J. Yang, A survey of sparse representation: Algorithms and applications, IEEE Access 3 (2015) 490–530. B. Liu, J. Huang, C. Kulikowski, Robust visual tracking using local sparse appearance model and k-selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12) (2013) 2968–2981. C. Bao, Y. Wu, H. Ling, H. Ji, Real time robust l1 tracker using accelerated proximal gradient approach, in: IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, 2012, pp. 1830–1837. T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via multi-task sparse learning, in: IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, 2012, pp. 2042–2049. K. Zhang, L. Zhang, M.-H. Yang, Real-time compressive tracking, in: European Conference on Computer Vision, Florence, Italy, 2012, pp. 864–877. T. Zhou, H. Bhaskar, F. Liu, Graph regularized and localityconstrained coding for robust visual tracking, IEEE Transactions on Circuits and Systems for Video Technology 27 (10) (2017) 2153– 2164. M. Aharon, M. Elad, A. Bruckstein, K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Transactions on Signal Processing 54 (2006) 4311 – 4322. M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, D. Cai, Graph regularized sparse coding for image representation, IEEE Transactions on Image Processing 20 (5) (2011) 1327–1336. J. Wang, J. Yang, K. Yu, Locality-constrained linear coding for image

Page 14 of 15

[46] [47] [48] [49] [50] [51] [52] [53] [54]

[55]

[56] [57] [58] [59] [60] [61] [62] [63]

classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3360–3367. S. Hare, S. Golodetz, A. Saffari, Struck: Structured output tracking with kernels, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10) (2016) 2096–2109. J. Zhang, S. Ma, S. Sclaroff, Meem: Robust tracking via multiple experts using entropy minimization, in: Proceedings of the European Conference on Computer Vision, Cham, 2014, pp. 188–203. P. Feng, W. Wang, S. Dlay, S. M. Naqvi, J. Chambers, Social force model-based mcmc-ocsvm particle phd filter for multiple human tracking, IEEE Transactions on Multimedia 19 (4) (2017) 725–739. T. Vojíř, J. Matas, The enhanced flock of trackers, Berlin, Heidelberg, 2014, pp. 113–136. F. Battistone, A. Petrosino, V. Santopietro, Watch out: Embedded video tracking with bst for unmanned aerial vehicles, Journal of Signal Processing Systems 90 (6) (2018) 891–900. M. E. Maresca, A. Petrosino, Clustering local motion estimates for robust and efficient object tracking, in: European Conference on Computer Vision, Cham, 2015, pp. 244–253. M. Barnard, P. Koniusz, W. Wang, Robust multi-speaker tracking via dictionary learning and identity modeling, IEEE Transactions on Multimedia 16 (3) (2014) 864–880. V. KÄślÄśÃğ, M. Barnard, W. Wang, J. Kittler, Audio assisted robust visual tracking with adaptive particle filtering, IEEE Transactions on Multimedia 17 (2) (2015) 186–200. V. KÄślÄśÃğ, M. Barnard, W. Wang, Mean-shift and sparse sampling-based smc-phd filtering for audio informed visual speaker tracking, IEEE Transactions on Multimedia 18 (12) (2016) 2417– 2431. I. Ramirez, P. Sprechmann, G. Sapiro, Classification and clustering via dictionary learning with structured incoherence and shared features, in: IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, USA, 2010, pp. 3501–3508. D. R. Hunter, R. Li, Variable selection using mm algorithms, The Annals of Statistics 33 (4) (2005) 1617–1642. T. Zhou, F. Liu, H. Bhaskar, J. Yang, Robust visual tracking via online discriminative and low-rank dictionary learning, IEEE Transactions on Cybernetics 48 (9) (2018) 2643–2655. D. A. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, International Journal of Computer Vision 77 (1) (2008) 125–141. T. Zhang, C. Xu, M. Yang, Learning multi-task correlation particle filters for visual tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2) (2019) 365–378. X. Lu, S. Yi, Z. He, H. Wang, W.-S. Chen, A new template update scheme for visual tracking, in: International Conference on Cloud Computing and Big Data, Macau, China, 2016, pp. 243–247. M. Mueller, N. Smith, B. Ghanem, A benchmark and simulator for uav tracking, in: Proceedings of the European Conference on Computer Vision, Cham, 2016, pp. 445–461. H. Fan, L. Lin, F. Yang, Lasot: A high-quality benchmark for largescale single object tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5369–5378. S. He, Q. Yang, R. H. Lau, J. Wang, M. Yang, Visual tracking via locality sensitive histograms, in: IEEE Conference on Computer Vision and Pattern Recognition, Los Alamitos, USA, 2013, pp. 2427–2434.

Helei Qiu et al.: Preprint submitted to Elsevier

Page 15 of 15

Conflict of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Hongyan Wang

Helei Qiu

Wenshu Li

17