From sample selection to model update: A robust online visual tracking algorithm against drifting

From sample selection to model update: A robust online visual tracking algorithm against drifting

Neurocomputing 173 (2016) 1221–1234 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom From ...

6MB Sizes 0 Downloads 25 Views

Neurocomputing 173 (2016) 1221–1234

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

From sample selection to model update: A robust online visual tracking algorithm against drifting Zhu Teng a,n, Tao Wang a,b, Feng Liu a,b, Dong-Joong Kang c, Congyan Lang a,b, Songhe Feng a a

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Jiaotong University, No. 3 Shang Yuan Cun, Hai Dian District, Beijing, China c School of Mechanical Engineering, Pusan National University, Busan, South Korea b

art ic l e i nf o

a b s t r a c t

Article history: Received 23 December 2014 Received in revised form 8 June 2015 Accepted 28 August 2015 Communicated by Jinhui Tang. Available online 9 September 2015

This paper proposes an online tracking algorithm that employs a confidence combinatorial map model. Drifting is a problem that easily occurs in object tracking and most of the recent tracking algorithms have attempted to solve this problem. In this paper, we propose a confidence combinatorial map that describes the structure of the object, based on which the confidence combinatorial map model is developed. The model associates the relationship between the object in the current frame and that in the previous frame. On the strength of this relationship, more precisely classified samples can be selected and are employed in the model update stage, which directly influences the occurrence of the tracking drift. The proposed algorithm was estimated on several public video sequences and the performance was compared with several state-of-the-art algorithms. The experiments demonstrate that the proposed algorithm outperforms other comparative algorithms and gives a very good performance. & 2015 Elsevier B.V. All rights reserved.

Keywords: Visual tracking Confidence combinatorial map Model update

1. Introduction Visual tracking is one of the most significant tasks in the area of computer vision and new methods are continuously proposed due to its wide applications, such as automated surveillance, traffic monitoring, vehicle navigation, human computer interaction, medical imaging and so on [1,2]. Generally, visual tracking is a process that continuously locates an annotated (labeled in the first frame of a video sequence) object throughout a given video sequence. Although it has been heavily studied by many researchers for decades, a robust tracker still remains a very difficult task. One reason is the abrupt and complex motion of the object [3]. The abrupt motion might give rise to a failure of tracking as it violates the smoothness constraint assumed by some motion models [4]. The appearance variation of both objects and background is also one of the factors that affect the performance of a solution to a tracking problem [5]. If there is no appearance adaptation of the object, such as some template matching based tracking methods [6], the tracker may drift away from targets in some circumstances. Another challenge of the tracking problem is the occlusion of the object. When occlusion happens, the object is represented by incomplete information and the tracker easily drifts to other objects or backgrounds. There are also many n

Corresponding author. E-mail addresses: [email protected] (Z. Teng), [email protected] (T. Wang), fl[email protected] (F. Liu), [email protected] (D.-J. Kang), [email protected] (C. Lang), [email protected] (S. Feng). http://dx.doi.org/10.1016/j.neucom.2015.08.080 0925-2312/& 2015 Elsevier B.V. All rights reserved.

other factors that limit the performance of a tracker such as illumination variation, camera motion, real-time processing requirements and so on [7]. In general, a tracker can be disassembled into two phases, object representation and object localization. Trackers are discrepant in the formulations of these two phases. Object representation plays a critical role in the tracking problem, since the more distinguished the representations are, the more easily the object can be tracked or localized. From the perspective of shape and appearance, objects can be represented by points [8], silhouettes and contours [9], probability densities of object appearance [10,11], and so on [12]. From the view of features, the object can be represented by intensity [13], color [14], optical flow [15,16], edges [17], texture [18,19], Haar-like features [20–22], and context [23,24]. Features can be constituted in allusion to a single pixel [18] or a number of pixels (patches) [25,26]. Single pixel based features have a limited describing ability and a problem with expensive computations as it is necessary to calculate the features of all the pixels in a frame. Therefore, in this study, we employed patch based features that combine the color and histogram of gradients (HOG) [11,19], which was proved to be efficient and widely used in computer vision [19]. The second phase is object localization. The object localization problem can be solved in many ways. One way entails some motion model to localize targets [27], such as optical flow based methods [28], Kalman filter [4], and particle filter [29,12,30,53]. Another localization scheme is based on a learning algorithm

1222

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

[31,32]. Generally, features extracted from the first frame are trained by a learning algorithm. Then, the features in a new frame are labeled by the trained classifier, and the location of the object in the new frame is obtained by an optimization scheme. There are also many other techniques that have been employed to solve the localization problem such as segmentation [33], metric learning [34, 35], and mean-shift [10]. This paper focuses on the learning based visual tracking algorithm, exploits unlabeled samples that originate from the frames of video sequences, except the first labeled frame, in order to adapt online to the changes in the appearance of the target, and emphasizes the updated model to resist drifting while being adapted in the process of tracking.

2. Related works Within this study, a general object instead of a specific object such as a face [36] or a human [37] is the main focus and we center our attention on the learning based tracking algorithm. The intuition behind the learning-based tracker is that the tracking problem can be considered as a binary classification problem, which was first proposed by Comanciu [10]. According to the learning strategy, a learning algorithm can be categorized into online learning or offline learning. The online learning method processes one instance at a time, and the offline learning method requires the entire training set to be available at once [38]. Since the offline learning methods have limited adaptability to the variation of the objects, the proposed tracker is based on the online algorithm. Many researchers have proposed online learning based tracking algorithms and obtained promising results. [13,33,39] developed tracking methods based on a learning algorithm that incrementally learnt a subspace representation. Classifier-based tracking methodologies were exploited by many researchers [22,31]. For instance, Avidan [18] developed ensemble tracking that combined a set of weak classifiers into a single strong classifier. Danielsson et al. [40] used two derived weak classifiers to suppress combinations of weak classifiers whose outputs were anti-correlated on the target class. Online boosting and online semi-supervised boosting algorithms were proposed by [21,22]. The online boosting algorithm was demonstrated with theoretical and experimental evidence to have a comparable performance compared to its offline counterparts [38,41]. Babenko [20] also employed a boosting algorithm, but proposed Multiple Instance Learning (MIL) to avoid the tracking drift. Santner [42] described a tracker by combining template matching, online random forest, and optical flow. An online Laplacian ranking support vector tracker was proposed in [43] to robustly locate the object and [44] developed a sparsity-based discriminative classifier and a sparsitybased generative model that aimed to ameliorate the drifting problem. Fan et al. [45] developed a scribble tracker, but it requires user specified scribbles. Wang et al. [46] proposed an online object tracking algorithm with sparse prototypes to learn effective appearance models by using the classic principal component analysis algorithms with sparse representation schemes. Most of the above mentioned trackers employed a learning algorithm but did not specify how to collect samples to update the tracker against drifts. Adaptation is an indispensible capacity for a tracker. If the tracker does not have adaptation capacity, it will easily drift to other non-target regions in spite of a tiny little alteration in the appearance of the target. Recently many researches address this issue. Erdem et al. [47] employed an adaptive cue integration scheme to adapt the model to the changes of the appearances of the target. In [48], the incremental covariance tensor learning was used in model update, and a tracking algorithm with the dynamic online updated basis distribution was presented to adapt to appearance variations in [49]. Das et al. [50] utilized the

generalized expected log-likelihood statistics in order to detect model changing. Adaptation can be acquired through the updated model. However, the data employed by the updated model are unlabeled data and how to select accurate samples for updating still remains a very difficult problem. Tracking drift is one of the most important reasons why the tracking is challenging and it degrades the tracking performance. Since if the tracking drift happens, the tracking target is lost. In the worst scenario, the tracker never recovers from the drifts in the subsequent frames. The relationship between the adaptability and the tracking drift is extremely subtle, since adaptability facilitates this, but at the same time, frustrates the tracking drift. The more adaptability the tracker has, the more easily it can drift. Meanwhile, if the tracker hardly adapts to the variations of the targets, which occurs inevitably in a tracking problem, the tracker would also drift away. The right balance between these two factors needs to occur. In this paper, we concentrated on the classifier based tracking approaches. Disparate from the abovementioned trackers, we engrossed ourselves in resisting tracking drifts and propose a robust tracking method that benefits from the wise selection of unlabeled samples and an online updated model throughout the tracking process. The tracker is constructed based on the 2D Disjunctive Normal Form (DNF) classifier [51] as 2D DNF classifiers have the capacity to represent more complex distributions than original weak classifiers. Sample selection in the tracking process is developed by using a confidence combinatorial map. The main contributions of this paper include 1) the proposal of a confidence combinatorial map model to infer the location of the object; 2) the development of an online updated model to balance between the adaptation and the tracking drift and a reset framework that lets the tracker recover from the target lost. The remaining parts of the paper are arranged as follows. In Section 3, the confidence combinatorial map based tracking algorithm is proposed, and the details of the proposed algorithm including classifiers we employ and the proposed combinatorial map are delineated. We demonstrate the effectiveness of the confidence combinatorial map model and the proposed tracker against several state-of-the-art methods in Section 4 and give the conclusions in Section 5.

3. Confidence combinatorial map based tracking algorithm In this section, we present the proposed tacking algorithm based on the confidence combinatorial map. The overall tracking process is described in Fig. 1 where the first labeled frame is employed in learning and tracking results in other frames are used to update both classifiers and confidence combinatorial map. 2D DNF classifier [51] briefly described in Section 3.1 is utilized as learning classifiers. In Section 3.2, confidence combinatorial map and model update are proposed and the detailed tracking process is specified in Section 3.3. 3.1. 2D DNF classifier The 2D DNF classifier employed in this work was developed based on the boosting algorithm [18,41]. Conventional weak classifier has a limited representation capacity since it uses linear classifiers or stumps, which label samples just better than random guessing. Compared with that, 2D DNF classifiers have the capacity to represent more complex distributions. Let fxi ; yi gN i ¼ 1 denote N samples and their labels, respectively, xi A Rm where m is the dimension of a feature vector that represents the sample, yi A f 1; þ 1g. The weak classifier can then be

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

1223

Update +

Extracting features

Learning classifiers

Constructing confidence combinatorial map

+

Input

Tracking result

Classifying features

Extracting features

Object candidates

Fig. 1. Overall description of the proposed tracking algorithm.

defined by Eq. (1) (T stands for the number of weak classifiers). ht ðxÞ : Rm -f  1; þ 1g; t A ½1; T

ð1Þ

The strong classifier in the boosting algorithm is defined as a linear combination of a collection of weak classifiers as shown in Eq. (2). The decision is given by sign(H(x)). XT HðxÞ ¼ α h ðxÞ ð2Þ t¼1 t t 1 2

αt ¼ log err t ¼

1  err t err t

XN i¼1

ð3Þ

wi j ht ðxi Þ  yi j

ð4Þ th

where wi is the weight of the i example, and weights are updated in the process of training weak classifiers (Eq. (5)). wi ¼ wi eðαt j ht ðxi Þ  yi j Þ

ð5Þ

Based on the feature dimensions that weak classifiers select, the 2D DNF cell classifiers (Eq. (6)–(8)) are constructed and the 2D DNF classifier is constituted by a linear combination of a set of 2D DNF cell classifiers as shown in Eq. (9). Let fpf ; yg denote samples in the fth cell and their labels, where each element of y is  1 or þ1, pf ¼ ½dhi ; dhj 2N , N is the total number of samples in the cell, and dhi and dhj are the training feature data that weak classifiers hi and hj choose. h2Dcf ðpf Þ : R2 -f  1; þ 1g

ð6Þ

For each column vector pf i in pf , the specific mapping relationship is described in Eq. (7) (|.| indicates the cardinality of the set.). 8 Cbij pf i A [ < 1; 1 r i;j r m ð7Þ h2Dcf ðpf i Þ ¼ :  1; otherwise ( Cbij ¼

bij ; ∅

j fyk j yk Abij 4 yk ¼ 1gj  j fyk j yk A bij 4 yk ¼ 1gj 4r; k A ½1; N otherwise

ð8Þ H 2D ðxÞ ¼

XM f ¼1

α2Df h2Dcf ðpf Þ

ð9Þ

The plane dhi  dhj is quantified into m  m cells, denoted by bij ; 1 r i; j rm. If the number of positive samples is larger than that of negative samples falling in the bij by at least r (set to 5 in the experiments), Cbij represents the cell bij ; Otherwise, Cbij is set to a null set. [ Cbij is the union of all cells. A sample is classified

Fig. 2. 2D DNF classifier.

feature as a sample, and a feature is extracted from a patch of the image. The classifiers employed in this work are presented in Fig. 2. The training process proceeds to first obtain weak classifiers, based on which feature dimensions are determined and a number of 2D DNF cell classifiers are built. The 2D DNF classifier is created by the linear combination of 2D DNF cell classifiers. 3.2. Confidence combinatorial map and model update To adapt to the variances of the target, it is essential for a tracker to perform the sample update, in which tracking drift is easily incurred due to inaccurate sample selection. In order to suppress the drift, in this section we propose a confidence combinatorial map that describes the structure of the target, based on which robust samples are selected and employed in the updated model during the tracking process in order to adapt online to changes in the appearance of the target. A brief introduction of the combinatorial map is first given in Section 3.2.1, and then the confidence combinatorial map is proposed in Section 3.2.2. The sample selection algorithm based on the confidence combinatorial map and the updated model for tracking are developed in Section 3.2.3.

1 r i;j r m

as positive if it enters into any nonempty cell of this union. To avoid confusion among cell, patch and sample, in this paper cells indicate bins involving in the 2D DNF classifier, and we also call a

3.2.1. 2D Combinatorial map The reason why we used a 2D combinatorial map rather than a general graph to describe the structure of the target is that the

1224

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

4

5 9 6 12

3

11

10

2

3 2 1

4

7 8

11

5 6 12

9

7 1

8

10

Fig. 3. Isomorphic graphs and derived combinatorial maps.

which leads to a more accurate tracking. It also associates a relationship between the past and present object. In a confidence combinatorial map, the vertex weight is defined as the confidence estimated by the 2D DNF classifier. The map matching algorithm we employed was greedy algorithm [52]. We define two combinatorial maps as G1 and G2 (such as the maps of the target in the previous frame and the current frame). When mapping between G1 and G2, the cost of the mapping is estimated by the cost of an edit path between these two maps. The edit path is formulated by a sequence of operations S1, …, Sk that transforms one confidence combinatorial map into the other confidence combinatorial map. The cost function of an operation is defined in Def. 1. Def. 1. Let γ be a cost function that assigns a nonnegative real number γ (x-y) to each edit operation (x-y), where x and y may be darts or vertices. We constrain γ to be a distance metric is as follows: (1) (2) (3)

Fig. 4. 2D Combinatorial map.

rotation information is incorporated in a combinatorial map. As shown in Fig. 3, the two graphs (left two figures) are the same, but if we derive the corresponding combinatorial maps from these two graphs, the maps are different. The combinatorial map delineates the rotation order of edges against a vertex. For instance, in Fig. 3, the rotation order of edges against the nethermost vertex in the left map is 8-10-1, while in the right map it is 10-8-1. This information is very useful to cope with a tracking problem in the sense that a tracker also requires precise localization, even if rotation takes place. In other words, the tracking algorithm should also keep rotation invariant. If the rotation information can be expressed accurately, a more robust sample selection can be obtained, which leads to a reduction of tracking drifts. A 2D combinatorial map [52] can be considered as a graph that encompasses the orientation of edges around a given vertex. The basic element in combinatorial maps is called dart, and each edge is composed of two darts with different directions (like darts 1 and 2 in Fig. 4). The association between the two darts that originate from the same edge is expressed in an involution α ((1,2) in Fig. 4) and a permutation s ((18,5,4) in Fig. 4) encodes the rotation of darts around a vertex. One cycle of s is related to one vertex and delineates the orientation of darts encountered when turning counterclockwise around this vertex. A combinatorial map G is defined as a 4-tuple G ¼ (D, α, s, μ) where D is a finite set of darts, α is the involution on D, s is the permutation on D, and μ is a dart weighted function. Take Fig. 4 as an example, D ¼ α ¼(1,2)(3,4)(5,6)(7,8) {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18}, (9,10)(11,12)(13,14)(15,16)(17,18), and s¼ (3,2)(1,7,9)(10,11)(12,13) (6,15,14,8)(18,5,4)(16,17). Combinatorial maps explicitly encode the rotation order of darts around vertices, which enables us to distinguish between configurations in the graph matching problem and preserve rotation invariance when carry out matching between two combinatorial maps. 3.2.2. Proposal of confidence combinatorial map A confidence combinatorial map was built based on the weighted combinatorial map. It preserves the property of rotation invariance when matching between two combinatorial maps,

γ (x-y) Z0, γ (x-x) ¼ 0; γ (x-y) ¼ γ (y-x); γ (x-z) r γ (x-y) þ γ (y-z).

The operations involved in the transformation include insertion, deletion and substitution. While satisfying constrains in Def.1, the specific cost function used in the map matching between confidence combinatorial maps is defined as the confidence reckoned by the 2D DNF classifier, as shown in Eq. (10) (insertion), Eq. (11) (deletion), and Eq. (12) (substitution), where xi indicates a vertex (also corresponding to a feature) in the first confidence combinatorial map, and yi and yj stand for vertices (features) in the second confidence combinatorial map.

γ ðΛ-yi Þ ¼ H2D ðyi Þ

ð10Þ

γ ðxi -ΛÞ ¼ H2D ðxi Þ

ð11Þ

γ ðxi -yj Þ ¼ j H2D ðxi Þ  H2D ðyj Þj

ð12Þ

The cost of an edit path is defined as the sum of all the costs of operations in the path and the cost of a combinatorial map is defined as the sum of all the costs of deletions of vertices in the map. Of all the edit paths between two confidence combinatorial maps, the edit distance of two maps is defined as the minimum cost of the edit path as shown in Eq. (13), where S(G1, G2) denotes the set of all edit distances transforming G1 to G2. dðG1 ; G2 Þ ¼

min

ðS1 ;:::; Sk Þ A SðG1 ;G2 Þ

k X

γ ðSi Þ

ð13Þ

i¼1

In the proposed confidence combinatorial map, a target window (region) corresponds to the map G ¼(V, D). The set V is associated with all the features in the target region and the set D is associated with all 4-connected darts of the vertices. The confidence combinatorial map is formulated based on the combinatorial map of a target window by removing vertices with negative labels and the corresponding darts (as shown in Fig. 5c), the first value in “ o 4” indicates the vertex number and the second value in “ o 4” suggests the weight of the vertex). The weight of a vertex is defined as the confidence that the 2D DNF classifier estimates on the feature. The similarity between two confidence combinatorial maps (for example, the current candidate target window and the previous target window) is calculated by Eq. (14). Besides the matching score, the map matching can also provide a mapping set M ¼ fðxp1 -x'p' Þ; ðxp2 -x'p' Þ; :::; ðxpL -x'p' Þg (L is L 1 2 the total number of matching pairs), which is the gist for sample selection.

δðG1 ; G2 Þ ¼ 1 

dðG1 ; G2 Þ

γ ðG1 Þ þ γ ðG2 Þ

ð14Þ

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

3.2.3. Sample selection and model update Sample selection across the tracking frames is the most significant step for adapting the tracker to the variations of the target and keeping it from tracking drifts. The samples for update in the current frame are chosen based on two criteria. One is that samples should belong to the matching pairs between the target of the current frame and that of the previous frame (as shown in Eq. (15)). The other criterion described in Eq. (16) is that the confidence of a sample should exceed a threshold. A higher value of the threshold indicates only samples with relatively higher confidences can be selected for model update. The threshold can be set between 0.5 and 0.7 (set as 0.5 in the experiments). xcurj A fx`p`i gLi ¼ 1

ð15Þ

H 2D ðxcurj Þ 4 thr

ð16Þ

A sample selection procedure proposed in this paper is illustrated in Fig. 6. At the t 1th frame, the confidence map of the tracked target is extracted and then the target is represented by the combinatorial map. The gray values of nodes in the combinatorial map of Fig. 6 are the confidences. For the current tth frame, after the DNF classifier acts on it, several candidate windows are obtained, the matching between the target of the t  1th frame and the candidate windows are executed, and the

1225

current target is localized as the candidate with a larger similarity. As shown in Fig. 6, though the two candidate windows in the tth frame are both faces and have a lot in common, perceiving from the combinatorial maps and comparing them with the previous one, the two candidates are distinct in both structure and distribution of the gray values of nodes of combinatorial maps. This facilitates an accurate tracking and improves the problem of drifting. Samples for the updated model are acquired through matching pairs. As the target varies during the tracking process, the structure of the target should be able to follow the changes as well. That is to say, the adaptive combinatorial map of the target alters as the vertex and edges are different across frames. The dataset for the model update consists of samples selected by the confidence combinatorial map model from non-first frames and samples from the first frame, and the dataset is emptied every five frames. The samples selected in the tracking process (non-first frames) are the crucial part as they balance the adaptation and the tracking drift. The model update of the proposed tracker includes two phases, that is, the DNF classifier and the confidence combinatorial map. The structure of the target described by the confidence combinatorial map is upgraded every frame. When updating classifiers, weak classifiers are renewed first, and then the DNF cell classifiers, and the DNF classifier is overhauled lastly. The classifier update is carried out once every five frames. A reset mechanism is established in case the tracker continuously loses the target. It is especially effective when the target disappears from the shot. The target is determined as lost when there is no candidate recommended by the confidence combinatorial map model and the reset mechanism is launched if the tracker loses the target for continuous five frames, after that, the tracker will be recovered to the last non-lost state. The reset mechanism includes the loading of classifiers and the adaptive threshold. 3.3. Detailed tracking scheme

Fig. 5. Confidence combinatorial map. (a) Original image; (b) patch feautres; (c) confidence combinatorial map.

Object of t-1 frame

In this section, we will detail the tracking scheme. The tracker consists of two components: learning and tracking. The learning part includes supervised leaning based on the first frame and unsupervised learning based on the data generated in the tracking process. Without doubt, the updated step is also partitioned into the learning

Similarity: 0.7629

t-1 frame

Similarity: 0.4376 Candidate maps of t-1 frame t frame

Object map of t-1 frame

Candidates of t frame Fig. 6. Sample selection for updating.

Samples for update

1226

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

Table 1 Evaluated image sequences. Sequences

Frames

Challenging factors

Board [42] Cartoon [47,57] Cup Girl [7,12,20,47,54] Girl2 [56] Dancer2 [57] Lemming [42,46,56] FaceOcc1 [20,25,46,47,54] PencilBox Trans [56]

698 200 359 501 1500 150 1336 898 226 124

Background clutter, out-of-plane rotation, illumination variation, motion blur, fast motion, out-of-view, scale variation Motion blur, abrupt motion, , illumination variation Out-of-view, motion blur, illumination variation, scale variation Out-of-plane rotation, partial occlusion, in-plane rotation, scale variation Background clutter, deformation, out-of-plane rotation, scale variation, heavy occlusion, motion blur Deformation, scale variation Illumination variation, scale variation, occlusion, fast motion, out-of-plane rotation, out-of-view Partial occlusion Partial occlusion, scale variation, out-of-plane rotation, motion blur, Out-of-view Heavy scale change, abrupt motion, illumination variation, occlusion, deformation

part. Tracking is mainly composed of feature extraction and target localization. The detailed steps are delineated as follows. Description: Tracking scheme Input: a video sequence with n frames; a bounding box for the object in the first frame. Output: a bounding box of the object for the subsequent frames. For the first frame: (1)Extract features from the first frame, where the number of positive and negative patches extracted for learning is fixed and randomly selected. (2) Learn weak classifiers and 2D DNF classifiers. (3)Estimate the confidence of the first frame by using the weak classifiers and 2D DNF classifier and construct the initial combinatorial map of the target. (4)Set the state of tracking as FOUND, and save initial classifiers and data. For a new frame: (1) Draw features of all the patches from the background of the current frame. Generally, the background is defined to be twice the size of the target, while the detected region spreads to the entire frame in the case of losing the target. (2) Examine all the patches with the combination of weak classifiers and 2D DNF cell classifiers, and create the confidence map. The confidence of a patch is calculated by Eq. (17). conf idence ¼

XT t¼1

αt ht ðxÞ þ

XM f ¼1

α2Df h2Dcf ðpf Þ

 If no candidates are left, the detected region spreads to the entire



4. Experiments In this section, we evaluate the proposed tracking method on several challenging video sequences and compare the performance of the proposed method with that of other state-of-the-art tracking methods. The comparisons consist of quantitative experiments for tracking errors and efficient experiments for tracking time. The performance of the proposed tracker is compared with six other tracking methods including Srpca [46], L1 [12], MIL [20], Frag [25] (search margin is set to 7, the number of bins is set to 16, and the metric employed is a variation of Kolmogorov–Smirnov statistic), SemiBoost [22], and Superpixel [56]. The implementations of all these methods are obtainable from the homepage of the corresponding authors. In Section 4.1, the experimental setup is explained. The performance improved by the confidence combinatorial map is demonstrated in Section 4.2, and we report on the quantitative experiments by center difference and PASCAL score [55] (a percentage of correctly tracked frames for a sequence) and time performance in Section 4.3. All of the experiments are executed on an Intel(R) i5 2.80 GHz laptop computer.

ð17Þ

(3) Cast candidate windows from the integral image of the confidence map. (4) Construct combinatorial maps of the candidate windows and compare these maps with the previous combinatorial map of the target. The similarities are first compared with an adaptive threshold, the initial value of which is set to 0.4 and is fluctuated down at most to 0.1 of the similarity of the target in the previous frame thereafter.



updated samples are combined for five frames with the initial samples and are used to update classifiers every five frames. (6) If the state of the tracker is LOST for continuous five frames, the tracker will be reset to the latest saved FOUND tracker. The reset mechanism includes the loading of classifiers and the adaptive threshold.

frame and the target needs to be re-tracked. In the context of the whole frame as the detected region, if the tracker still fails to find the target, the state of tracking is set to LOST; otherwise, it is determined to be FOUND. If several candidates are qualified, the candidate with the largest confidence is determined to be the target and the state of the tracker is set as FOUND. If only one candidate is eligible, it is decided as the target with certainty and the state of the tracker is set as FOUND.

(5) Collect samples for updating based on the mapping relationship between the target of the current frame and the target of the previous frame under the FOUND state of the current frame. The

4.1. Experimental setup Ten sequences were used to evaluate the performance of the proposed tracker, which was implemented by Cþ þ . These sequences consisted of various challenging conditions, eight of which were publicly available and widely used in [12,20,25,42,46,47,51,54,56,57], and the other two sequences were collected by ourselves. The detailed information about the test dataset is shown in Table 1. The feature employed in the proposed tracker was created by a HOG feature with 31 dimensions [19] (8*8 is used as the patch size of HOG feature in the proposed tracker) and the RGB color information with three dimensions. This feature vector can be computed easily and is appropriate for object detection and tracking tasks. However, the focus of this work was on the learning algorithm and confidence combinatorial map rather than feature contribution. Therefore, other effective features for object tracking can also be employed. The number of weak classifiers in the DNF classifier was set to 30 and 15 of them were substituted while updating. The step length of DNF classifier was set to 0.1 and the initial adaptive threshold was set to 0.4 and fluctuated down to at most 0.1 thereafter compared to the similarity of the previous target. The reset mechanism operated when the tracker lost the target for consecutive 5 frames. All of the trackers were initialized on the first frame. In the learning process of the

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

proposed tracker, 400 samples were randomly selected from the first frame, 200 samples of which were positive and the rest was negative. Performance evaluation for tracking is a difficult task. Generally, qualitative experiments on test video sequences are very common and quantitative comparisons often involve plotting the center error distance for each frame. As these plots are difficult to read, the average center distance in pixels (ACDP) is frequently used together. Furthermore, in order to better evaluate the Table 2 Comparison of trackers with and without confidence combinatorial model (PASCAL score). Sequence Our model without confidence combinatorial model

Our model with confidence combinatorial model

Cup FaceOcc1 Girl2

0.789 0.989 0.495

0.721 0.927 0.436

1227

performance of the tracker, a percentage of corrected tracked frames for a sequence was employed as well. This percentage is also called PASCAL score and the overlap ratio between the tracked bounding box and the ground truth bounding box larger than 50% is regarded as correctly tracked. The speed of trackers was estimated by the number of processing frames per second (fps). To sum up, besides qualitative experiments on video sequences, we also employed ACDP, PASCAL scores, and fps as evaluation methodologies. 4.2. Performance boosting by the confidence combinatorial map In this section, the performance improved by the confidence combinatorial map is demonstrated. We compare two trackers with and without the confidence combinatorial map model and examine the PASCAL score on three representative video sequences. Note that all the other settings are the same in two trackers. Table 2 displays the results of comparisons and it is clear that the

Fig. 7. Qualitative experiments on video sequences of Board, Cartoon, Cup, Girl and Girl2. If the target is lost by using a tracking method, then the corresponding rectangle is not drawn in the figures.

1228

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

Fig. 8. Qualitative experiments on video sequences of Dancer2, Lemming, FaceOcc1, Trans, PencilBox. If the target is lost by using a tracking method, then the corresponding rectangle is not drawn in the figures.

confidence combinatorial map is effective and improves the tracker by more than six percent on average. 4.3. Comparison with other state-of-the-art tracking methods In this section, we will compare and discuss the performance of the proposed method against other state-of-the-art tracking methods in various challenging situations. 4.3.1. Qualitative evaluation The results of qualitative experiments are shown in Fig. 7 and Fig. 8. We analyze the results of the experiments in four situations. Abrupt motion and motion blur. Sequence Girl2 (size: 640*480) shows a girl scootering in a park. It is a challenging sequence as it presents occlusion, motion blur, and other children in a clustered background. For example, in the 118 frame (Fig. 7), occlusion happens when a person passes in front of the girl (target). As a

result, trackers Frag, L1, SuperPixel and Srpca drift to the person instead of the girl, and SemiBoost and MIL drift to the background, but the proposed tracker successfully follows the girl. The cause behind this problem is that other trackers chose samples from the passing person (incorrect target) to update trackers, while the proposed tracker did not as the confidence combinatorial map of the passing person or the background did not match with that of the target in the previous frame. In other words, other trackers adapted too much and led to drifts. In frame 1241 of sequence Girl2 in Fig. 7, there is a heavy blur. None of the other trackers adapt to it (some lost the object and others drift away), except for the proposed tracker. In the Trans sequence (size: 448*232), a large change in scale and poses occurred across the frames. SuperPixel gives a very successful tracking (third row in Fig. 8). Although L1 and Srpca have the ability to adapt to scale, they failed to follow the scaling of the transformer and Srpca even drifts to background in frame 101. In contrast, though the proposed tracker fails to adapt to the

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

Fig. 9. Center difference for frames of sequences Board, Cartoon, Cup, Girl, Dancer2, Lemming, Girl2, FaceOcc1, PencilBox, and Trans.

1229

1230

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

Fig. 10. Overlap rate for frames of sequences Board, Cartoon, Cup, Girl, Dancer2, Lemming, Girl2, FaceOcc1, PencilBox, and Trans.

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

scaling as well, it sticks to the transformer. In fact, the location SuperPixel tracked in frame 101 is inaccurate as well, since the hood of the transformer is not included in the bounding box. For the Cartoon sequence (size: 320*240), the object moves very rapidly. The proposed tracker followed the object closely while other trackers either lose the object or estimated an inaccurate location. In particular, in frame 71 (as shown in Fig. 7), besides the Frag tracker and the proposed one, all the other trackers suffer from tracking drifts, and the Frag tracker adhered to an inaccurate location. In the Dancer2 sequence (size: 320*262), the dancer presents a big body movement. In fact, if the tracker is not able to adapt to scaling, it is very hard to give an accurate estimation of the dancer. The SuperPixel tracker has the ability to adapt to scaling, but it adapts in an erroneous direction, which leads to a much smaller bounding box rather than a larger one (as shown in frame 138 of Fig. 8). Other trackers that have the adaptive ability to scaling such as Srpca and L1 cannot follow the body movement of the dancer (see frame 20 and 138 in Fig. 8). The rest trackers including the proposed one follow the main body of the dancer more or less and show a similar performance. Target occlusion and object disappearance. Sequence FaceOcc1 (size: 352*288) undergoes heavy occlusions. As shown in the third row of Fig. 8, the book covers the target by more than a half, and Frag and Srpca give the best performance (PASCAL score: 1.000), while L1 and the proposed method (PASCAL score: 0.994 and 0.989) slightly performed worse. Sequence Cup (size: 640*480) presents an abrupt motion and target disappearance, and both of them result in tracking drifts. For instance, in 168, the cup is taken out of the shot, barring the proposed method that treats it as lost, all the other trackers drift away. In 256, the cup is abruptly moved into the shot, the SemiBoost tracker approaches the target, but not accurately enough, while the proposed tracker gives a relatively precise localization. Table 3 PASCAL score on sequences Board, Cartoon, Cup, Girl, Girl2, Dancer2, Lemming, FaceOcc1, PencilBox, and Trans.

1231

To trace the performance to its sources, the reset mechanism in the proposed tracker was activated when the target disappeared for five frames and the target was re-localized as the tracker did not adapt to the other objects or the background. Clustered background. Sequence Lemming (size: 640*480) presents a variation of scale and pose in a clustered background. The clustered background makes tracking difficult as shown in 395 and 805 of Fig. 8. There, the trackers easily drift to the background. In frames 1007 and 1206, the scale and pose of the target change, and the L1 tracker estimates a totally inaccurate scale and location. In contrast, Srpca gives a smaller than normal scale in 1206, and the proposed tracker estimates a relatively accurate target location. Sequence Girl (size: 320*240) undergoes appearance variation, large pose change, and heavy occlusion by a person. For example, in 205, the target turns around, the pose of the target changes in 315, and in 457, the target is hidden by another person. Frag, Srpca, and the proposed method give a relatively good performance, and other trackers drift away or estimate an inaccurate location. Large background disturbances in the first frame. Sequence Board (size: 640*480) presents a large appearance variation as the board is turned to the opposite side, which is more different from the front side in the video sequence. For instance, in frame 490, the board is about to turn to the opposite side. With the exception of Srpca, Frag and the proposed tracker, all the other trackers drift away, and in 636, Srpca drifts to the background as well. In Sequence Pencilbox (size: 640*480) there are a number of background pixels in the bounding box of the first frame. It is easy to mistaken the background feature as the object feature, which ultimately causes the tracker to drift away. As shown in the last row of Fig. 8, part of the green coat is in the initial bounding box, and trackers Frag, L1 and Srpca track the green coat instead of the target (65, 133, 184, 219). SemiBoost loses the object in most of the frames and MIL drifts to the background and is not able to recover from it. The performance of the SuperPixel tracker is relatively Table 5 Computational time (fps).

Sequence

Srpca

L1

MIL

Frag

Semiboost

SuperPixel

Proposed

Sequence

Srpca

Board Cartoon Cup Girl Girl2 Dancer2 Lemming FaceOcc1 PencilBox Trans Average

0.707 0.100 0.198 0.644 0.411 0.767 0.832 1.000 0.473 0.258 0.539

0.107 0.080 0.152 0.386 0.029 0.573 0.157 0.994 0.265 0.444 0.319

0.136 0.175 0.050 0.416 0.000 0.807 0.209 0.804 0.257 0.306 0.316

0.657 0.365 0.380 0.832 0.054 0.773 0.037 1.000 0.221 0.411 0.473

0.057 0.015 0.525 0.436 0.191 0.480 0.280 0.793 0.186 0.089 0.305

0.086 0.225 0.670 0.752 0.055 0.067 0.131 0.888 0.673 1.000 0.455

0.836 0.825 0.789 0.713 0.495 0.667 0.873 0.989 0.823 0.516 0.753

Board Cartoon Cup Girl Girl2 Dancer2 Lemming FaceOcc1 PencilBox Trans Average

2.530 7.136 2.080 13.608 1.988 8.309 2.297 8.036 2.290 10.005 3.090 10.710 2.059 5.733 3.085 12.750 2.369 8.383 2.906 7.126 2.469 9.180

L1

MIL

Frag

SemiBoost SuperPixel Proposed

8.323 18.952 10.952 13.355 11.280 18.381 11.320 13.630 11.043 13.958 13.119

0.578 1.838 0.518 1.244 0.510 1.173 0.437 1.057 0.524 1.290 0.917

0.600 1.285 0.831 2.581 2.965 4.125 2.993 2.093 1.336 1.895 2.070

0.062 0.103 0.052 0.062 0.028 0.140 0.112 0.051 0.047 0.085 0.074

2.372 9.756 6.653 4.351 5.411 12.830 5.425 2.071 3.652 4.692 5.721

Table 4 Average center distance in pixels (ACDP) on sequences Board, Cartoon, Cup, Girl, Girl2, Dancer2, Lemming, FaceOcc1, PencilBox, and Trans. Sequence (size)

Srpca

L1

MIL

Frag

SemiBoost

SuperPixel

Proposed

Board Cartoon Cup Girl Girl2 Dancer2 Lemming FaceOcc1 PencilBox Trans Average

62.113 74.116 122.743 12.381 137.674 8.663 7.860 5.565 56.900 76.270 56.4285

239.640 118.965 78.637 51.698 118.803 12.276 172.54 9.041 89.039 44.339 93.4978

243.603 69.760 243.197 37.027 220.987 14.483 133.942 22.242 158.798 53.814 119.7853

39.738 46.191 101.575 24.705 236.635 11.075 210.426 6.609 87.405 38.262 80.2621

269.992 201.354 182.538 118.114 275.532 71.052 130.686 48.620 322.713 250.818 187.1419

163.948 95.447 52.306 26.780 258.439 12.143 116.389 24.156 22.185 7.995 77.9788

59.593 29.016 38.875 27.022 75.466 14.906 17.625 16.851 26.252 28.891 33.4497

1232

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

good, but not as accurate as the proposed method (133, 219). The DNF classifier plays a role in resisting the disturbance of the first frame. Meanwhile, the better performance of the proposed tracker also benefits from the confidence combinatorial map model that balances between the adaptation and the tracking drifts. To sum up, the update scheme and the use of the structure information of the object via the confidence combinatorial map facilitate the performance of our tracker. For other trackers, such as the SemiBoost tracker, it uses the samples of the object estimated from the previous frame to adapt to the changes of the object. In the proposed model, samples for update are selected from the matching pairs obtained from confidence combinatorial maps of the object in the previous and current frame. If the tracking drift happened in the previous frame, other tracker might drift from the object, but the proposed model did not because it did not absorb the inaccurate samples in the update process due to the mismatches. 4.3.2. Quantitative metric The center difference and overlap rate between the estimated bounding box and that of the ground truth were used as quantitative metrics to compare the performance of the proposed method against other state-of-the-art approaches in this work. Fig. 9 and Fig. 10 report the center difference and overlap for each frame of the ten sequences when compared with the ground truth, respectively. Sequences Cup, Cartoon, Dancer2, PencilBox, and Trans were evaluated for each frame, while for the other four sequences, center difference and the overlap were measured every five frames. Besides these two metrics, the PASCAL score and ACDP of ten sequences are revealed in Table 3 and Table 4, where the average PASCAL score and average ACDP across ten sequences are also calculated. It is clear that the proposed method performs best in both the average PASCAL score and average ACDP. In particular, the proposed method (0.753) outperforms the second best method (Srpca: 0.539) by 0.214 for the average PASCAL score. Table 5 unveils the computational time performance of all the methods evaluated for ten sequences. The performance was measured by the number of frames the methods process for a second, and the average frames per second are also computed in Table 5. The processing time before tracking starts (such as the time elapsed in learning the first frame, initialization time, etc.) was excluded in all the methods, and only the computational time between the frames was counted. We can see from Table 5 that the proposed method is the third fastest method, and the other two swiftest methods are L1 and MIL. When examined the corresponding center difference and PASCAL score of L1 (center difference: 93.498; PASCAL: 0.319) and MIL (center difference: 119.785; PASCAL: 0.316), the proposed method (center difference: 33.450; PASCAL: 0.753) is more effective considering both the processing speed and tracking performance. Although the proposed tracker provides relatively robust and fast results, it suffers from several times of lost and re-found of the object and cannot completely avoid the problem of fluctuating. For the problem of the lost and re-found of the object, it can be alleviated by considering a temporal relationship of locations of the object throughout the sequence, and this will be one of our future works to consummate our tracker. For the fluctuating problem, in particular, for the FaceOcc1 sequence, the overlap rate is 0.989, but the center difference is 16.851 pixels. In other words, the target successfully followed almost all of the frames, but the center difference was slightly larger (16.851 pixels). Other methods might have a similar problem as well. This can be attributed to the size of patch features utilized in the work. We employed 8*8 as the patch size and the estimated bounding box may swing up and down within 8 pixels. If the feature is extracted for each pixel, the problem may be improved, but in that case, the computational time would increase largely.

5. Conclusions In this paper, we propose a novel tracking method against drifts via the confidence combinatorial map. We employed unlabeled samples to update the tracker and to adapt the tracker to the variations of the target. A confidence combinatorial map was constructed to describe the structure of the object and the mapping between combinatorial maps, and it facilitated an accurate sample selection and model update. The experiments demonstrate that the proposed method outperforms several other state-of-theart approaches in several challenging video sequences.

Acknowledgments This work was supported by the Fundamental Research Funds for the Central Universities of China (2014JBM025), China Postdoctoral Science Foundation with Grant no. (2014M560881, and Natural Science Foundation of China (61502026, 61300071, 61472028, 61300175, 61301185, and 51505004). It is also partially supported by National Key Scientific Instrument and Equipment Development Project (2013YQ330667) and Basic Science Research Program through Korea NRF (Nos. 2011-0017228 and 2013R1A1A2060427).

References [1] A. Yilmaz, O. Javed, M. Shah, Object tracking: a survey, ACM Comput. Surv. 38 (4) (2006) 1–45. [2] K. Cannons, A Review of Visual Tracking, York University, Canada, 2008 (Technical Report CSE-2008-07). [3] J. Kwon and K. M. Lee, Tracking of abrupt motion using Wang-Landau Monte Carlo estimation, In: Proceedings of the European Conference on Computer Vision (ECCV), 2008. [4] J. Shi and C. Tomasi, Good features to track, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 1994, pp. 593–600. [5] A. Jepson, D. Fleet and T. El-Maraghi, Robust online appearance models for visual tracking, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), vol. I, 2001, pp. 415–422. [6] M. Black, A. Jepson, Eigentracking: Robust matching and tracking of articulated objects using a view-based representation, Int. J. Comput. Vis. 26 (1) (1998) 63–84. [7] Y. Wu, J. Lim, and M.-H. Yang, Online object tracking: a benchmark, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2013. [8] C. Veenman., M. Reinders, E. Backer, Resolving motion correspondence for densely moving points, IEEE Trans. Pattern Anal. Mach. Intell. 23 (1) (2001) 54–72. [9] A. Yilmaz, X. Li, M. Shah, Contour based object tracking with occlusion handling in video acquired using mobile cameras, IEEE Trans. Pattern Anal. Mach. Intell. 26 (11) (2004) 1531–1536. [10] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (2003) 564–575. [11] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2005. [12] X. Mei and H. Ling, Robust visual tracking using L1 minimization, In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), September–October, 2009, pp. 1436–1443. [13] D. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vis. 77 (1–3) (2008) 125–141. [14] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet, Color-based probabilistic tracking, In: Proceedings of the European Conference on Computer Vision (ECCV), 2002, pp. 661–675. [15] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, A database and evaluation methodology for optical flow, In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2007. [16] B. Lucas and T. Kanade, An iterative image registration technique with an application to stereo vision, In: Proceedings of the International Joint Conference on Artificial Intelligence, 1981, pp. 674–679. [17] K. Bowyer, C. Kranenburg, S. Dougherty, Edge detector evaluation using empirical roc curve, Comput. Vis. Image Underst. 10 (2001) 77–103. [18] S. Avidan, Ensemble tracking, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), vol. 2, 2005, pp. 494–501. [19] Pedro F. Felzenszwalb, Ross B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, IEEE Trans. Pattern Anal. Mach. Intell. 32 (9) (2010). [20] B. Babenko, M.-H. Yang, and S. Belongie, Visual tracking with online multiple instance learning, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2009, pp. 983–990.

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

[21] H. Grabner and H. Bischof, On-line boosting and vision, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), vol. 1, 2006, pp. 260–267. [22] H. Grabner, C. Leistner, and H. Bischof, Semisupervised on-line boosting for robust tracking, In: Proceedings of the European Conference on Computer Vision (ECCV), 2008. [23] L. Wen, Z. Cai, Z. Lei, D. Yi, S.Z. Li, Robust online learned spatio-temporal context model for visual tracking, IEEE Trans. Image Process. 23 (2) (2014) 785–796. [24] Y. Li, H. Shen, On identity disclosure control for hypergraph-based data publishing, IEEE Trans. Inf. Forensics Secur. 8 (8) (2013) 1384–1396. [25] A. Adam, E. Rivlin, and I. Shimshoni, Robust fragments-based tracking using the integral histogram, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), June, 2006, pp. 798–805. [26] C. Lang, G. Liu, J. Yu, S. Yan, Saliency detection by multitask sparsity pursuit, IEEE Trans. Image Process. 21 (3) (2012) 1327–1338. [27] H. Tao, H. Sawhney, R. Kumar, Object tracking with bayesian estimation of dynamic layer representations, IEEE Trans. Pattern Anal. Mach. Intell. 24 (1) (2002) 75–89. [28] J. Kwon and K. M. Lee, Visual tracking decomposition, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), June, 2010, pp. 1269– 1276. [29] S. Zhou, R. Chellapa, and B. Moghadam, Adaptive visual tracking and recognition using particle filters, In: Proceedings IEEE International Conference on Multimedia and Expo (ICME), 2003, pp. 349–352. [30] P. Perez, C. Hue, J. Vermaak, and M. Gangnet, Color-Based Probabilistic Tracking, In: Proceedings of the European Conference on Computer Vision (ECCV), 2002. [31] Z. Kalal, J. Matas, and K. Mikolajczyk, P-N learning: Bootstrapping binary classifiers by structural constraints, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), June, 2010, pp. 49–56. [32] H. Grabner, M. Grabner and H. Bischof, Real-Time Tracking via Online Boosting, In: Proceedings of the Conference on British Machine Vision, 2006, pp. 47–56. [33] W. Hu, X. Li, X. Zhang, X. Shi, S.J. Maybank, Z. Zhang, Incremental tensor subspace learning and its applications to foreground segmentation and tracking, Int. J. Comput. Vis. 91 (3) (2011) 303–327. [34] N. Jiang, W. Liu, Data-driven spatially-adaptive metric adjustment for visual tracking, IEEE Trans. Image Process. 23 (4) (2014) 1556–1568. [35] J. Wu, H. Shen, Y. Li, Z. Xiao, M. Lu, C. Wang, Learning a hybrid similarity measure for image retrieval, Pattern Recognit. 46 (11) (2013) 2927–2939. [36] P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2001, pp. 511–518. [37] M. Isard and J. Maccormick, Bramble: A Bayesian multiblob tracker, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2001, pp. 34–41. [38] N. Oza and S. Russell, Online bagging and boosting, in Proceedings of the Artificial Intelligence and Statistics, 2001, pp. 105–112. [39] G. Li, D. Liang, Q. Huang, S. Jiang, and W. Gao, Object tracking using incremental 2-D-LDA learning and bayes inference, In: Proceedings of the IEEE International Conference on Image Processing, 2008, pp. 1568–1571. [40] O. Danielsson, B. Rasolzadeh, and S.Carlsson, Gated Classifiers: Boosting under High Intra-Class Variation, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2011. [41] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Comput. Learn. Theory: Eurocolt 95 (1995) 23–37. [42] Jakob Santner, Christian Leistner, Amir Sa_ari, Thomas Pock, and Horst Bischof, Prost: Parallel robust online simple tracking, In: Proceedings of the IEEE Conference Computer Vision and Pattern Recognition (CVPR), 2010, pp. 723– 730, 13–18. [43] Y. Bai and M. Tang, Robust Tracking via Weakly Supervised Ranking SVM, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2012. [44] W. Zhong, H. Lu, and M.-H. Yang, Robust Object Tracking via Sparsity-based Collaborative Model, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2012. [45] J. Fan, X. Shen, Y. Wu, Scribble tracker: a matting-based approach for robust tracking, IEEE Trans. Pattern Anal. Mach. Intell. 34 (8) (2012) 1633–1644. [46] D. Wang, H. Lu, M.-H. Yang, Online object tracking with sparse prototypes, IEEE Trans. Image Process. 22 (1) (2013) 314–325. [47] E. Erdem, S. Dubuisson, I. Bloch, Fragment based tracking with adaptive cue integrations, Comput. Vis. Image Underst. 116 (7) (2012) 827–841. [48] Y. Wu, J. Cheng, J. Wang, H. Lu, J. Wang, H. Ling, E. Blasch, L. Bai, Real-time probabilistic covariance tracking with efficient model update, IEEE Trans. Image Process. 21 (5) (2012) 2824–2837. [49] B. Liu, J. Huang, C. Kulikowski, L. Yang, Robust visual tracking using local sparse appearance model and K-selection, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2968–2981. [50] S. Das, A. Kale, N. Vaswani, Particle filter with a mode tracker for visual tracking across illumination changes, IEEE Trans. Image Process. 21 (4) (2012) 2340–2346. [51] Z. Teng and D.-J. Kang, Disjunctive normal form of weak classifiers for online learning based object tracking, International Conference on Computer Vision Theory and Applications (VISAPP), 2013, pp. 138–146. [52] T. Wang, G. Dai, B. Ni, D. Xu, F. Siewe, A distance measure between labeled combinatorial maps, Comput. Vis. Image Underst. 116 (2012) 1168–1177.

1233

[53] T. Zhang, S. Liu, N. Ahuja, M.-H. Yang, B. Ghanem, Robust visual tracking via consistent low-rank sparse learning, Int. J. Comput. Vis. 111 (2) (2015) 171–190. [54] S. Chen, S. Lia, S. Sua, Q. Tian, R. Ji, Online MIL tracking with instance-level semi-supervised learning, Neurocomputing 139 (2) (2014) 272–288. [55] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The Pascal Visual Object Classes (VOC) Challenge, Int. J. Comput. Vis. 88 (2) (2009) 303–308. [56] S. Wang, H. Lu, F. Yang, and M.-H. Yang, Superpixel tracking, In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011. [57] S M S. Nejhum, J. Ho and M.-H. Yang, Visual tracking with histograms and articulating blocks, In: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2008.

Zhu Teng received her B.S. degree and Ph.D. degree in Automation from Central South University, China, 2006, and in Mechanical Engineering of Pusan National University, South Korea, 2013, respectively. She is now a postdoctoral researcher in the school of Computer and Information Technology, Beijing Jiaotong University. Her current research interests are image processing, machine learning, and pattern recognition.

Tao Wang received the Ph.D. degree from Beijing Jiaotong University, Beijing, China, in 2013. He is currently working as a teacher in the School of Computer and Information Technology, Beijing Jiaotong University. His main research interests include graph algorithm, pattern recognition and image understanding.

Feng Liu received his B.S. degree and M.E. degree in Computer Software from Beijing Jiaotong University, China, in 1983 and 1988. He also received his Ph.D. degree in Economics at Renmin University of China in 1997. He is now the director of Network Management Research Center in the school of Computer and Information Technology, Beijing Jiaotong University. His current research interests are big data analysis and cloud computing.

Dong-Joong Kang received his B.S. degree in Precision Engineering from Pusan National University, Pusan, Korea, in 1988 and an M.E. degree in Mechanical Engineering from KAIST (Korea Advanced Institute of Science and Technology), Seoul, Korea, in 1990. He also received his Ph.D. degree in Automation and Design Engineering at KAIST, in 1998. He is now a professor at the school of mechanical engineering in Pusan National University. His current research interests are visual surveillance, intelligent vehicles/robotics, machine vision. Congyan Lang received her Ph.D. degree from Beijing Jiaotong University, Beijing, China, in 2006. She is now a professor in the School of Computer and Information Technology, Beijing Jiaotong University. Her research interest is visual cognitive computing.

1234

Z. Teng et al. / Neurocomputing 173 (2016) 1221–1234

Songhe Feng received the Ph.D. degree in the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, P.R. China, in 2009. He is currently an Associate Professor in the School of Computer and Information Technology, Beijing Jiaotong University. He has been a visiting scholar in the Department of Computer Science and Engineering, Michigan State University (2013-2014). His research interests include computer vision and machine learning.