Multi-object tracking via discriminative appearance modeling

Multi-object tracking via discriminative appearance modeling

ARTICLE IN PRESS JID: YCVIU [m5G;August 30, 2016;11:0] Computer Vision and Image Understanding 0 0 0 (2016) 1–11 Contents lists available at Scien...

3MB Sizes 0 Downloads 125 Views

ARTICLE IN PRESS

JID: YCVIU

[m5G;August 30, 2016;11:0]

Computer Vision and Image Understanding 0 0 0 (2016) 1–11

Contents lists available at ScienceDirect

Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Multi-object tracking via discriminative appearance modeling Shucheng Huang∗, Shuai Jiang, Xia Zhu Jiangsu University of Science and Technology, No.2, Mengxi Road, Zhenjiang City, Jiangsu, China

a r t i c l e

i n f o

Article history: Received 31 August 2015 Revised 4 June 2016 Accepted 9 June 2016 Available online xxx Keywords: Multi-object traking Multiple hypotheses tracking Metric learning Multi-cue fusion Appearance modeling

a b s t r a c t Tracking multiple objects is important for automatic video content analysis and virtual reality. Recently, how to formulate data association optimization more effectively to overcome ambiguous detected responses and how to build more effective association affinity model have attracted more concerns. To address these issues, we propose a metric learning and multi-cue fusion based hierarchical multiple hypotheses tracking method (MHMHT), which conducts data association more robustly and incorporates more temporal context information. The association appearance similarity is calculated using the distances between feature vectors in each associated tracklet and the salient templates of each track hypothesis, which is then fused with the dynamic similarity calculated according to Kalman filter online to get association affinity. To make appearance similarity more discriminative, the spatial-temporal relationships of reliable tracklets in sliding temporal window are used as constraints to learn the discriminative appearance metric which measures the distance between feature vectors and salient templates. The salient templates of generated track hypotheses are updated using an incremental clustering method, considering the high order temporal context information. We evaluate our MHMHT tracker on challenging benchmark datasets. Qualitative and quantitative evaluations demonstrate that the proposed tracking algorithm performs favorably against several state-of-the-art methods. © 2016 Elsevier Inc. All rights reserved.

1. Introduction Tracking is a fundamental task for video understanding in computer vision and pattern recognition. Video data increase rapidly every day due to the development of mobile terminal, digital camera networks and communication technology. The vast amount of data contains a lot of redundant information, which needs to be effectively presented and abstracted. The ability to simultaneously track multiple objects is important for automatic video analysis and virtual reality, which can provide global trajectory and pose information for higher level analysis and decision, and have numerous applications including target identification, intelligent surveillance, video coding, video analysis and human action recognition. Multi-object tracking aims at inferring trajectories for each object from video sequence, which can be considered as a dynamic incremental spatiotemporal clustering problem. All image regions are clustered as certain specific object or background. In real senses, there are many challenges such as variance of number of objects, objects with similar appearances, variance of object appearance, complex motion, long time occlusions and clutter background, which generate the uncertainty and make multi-



Corresponding author. E-mail address: [email protected] (S. Huang).

object tracking difficult. Recently with the significant progresses in research on object detection and classification, methods adopting tracking by detection framework become more and more popular. These approaches are much more flexible and robust in complex scenes where the camera may move or zoom in/out, and they are completely automatic without manual initialization which is important in practical applications. There are two important parts in multi-object tracking based on tracking by detection framework: (1) association optimization model and (2) association affinity model. To execute association optimization, formulation of the first-order Markov model which is widely used in single object tracking can be extended in multiobject tracking, such as Joint Probabilistic Data Association Filter (JPDA) (Bar-Shalom et al., 2009) and methods based on particle filter (Breitenstein et al., 2009; Khan et al., 2005; Qu et al., 2007). However, it is usually much more effective to overcome the ambiguities in tracking process by considering information of past and future feedback simultaneously, which is the main conception of the current methods, such as multiple hypotheses tracking (MHT) (Cox and Hingorani, 1996; Miller et al., 1997; Ryoo and Aggarwal, 2008), markov chain monte carlo (MCMC) data association (Benfold and Reid, 2011; Oh et al., 2009; Yu and Medioni, 2009), and network flow graph (Ben Shitrit et al., 2011; Pirsiavash et al., 2011; Zhang et al., 2008). Recently, some work utilizes reliably associated tracklets as elements instead of detection results

http://dx.doi.org/10.1016/j.cviu.2016.06.003 1077-3142/© 2016 Elsevier Inc. All rights reserved.

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

JID: YCVIU 2

ARTICLE IN PRESS

[m5G;August 30, 2016;11:0]

S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11

to gradually reduce association ambiguities level by level (Brendel et al., 2011; Huang et al., 2008; Xing et al., 2009), which is robust in complex scene. However, in many association frameworks, especially the hierarchial methods, temporal context is not effectively utilized, and assignments between assumed trajectories and current measurements are only based on detected responses or tracklets in previous adjacent time. Moreover, association affinity calculation plays an important role in multi-object tracking. Data association meets the challenges caused by detection error, occlusion, and similar appearance among multiple objects. The experiential affinity model used in most previous methods cannot evaluate the data association well. Some recent work focuses on building more effective affinity calculation model to achieve better results. Li et al. (2009) propose a learning association approach by training offline hybridboosted classifier to formulate affinity calculation as a joint problem of ranking and classification. This method needs ground truth in the same scene to generate samples. Kuo et al. (2010) propose an online learned discriminative appearance model which uses the spatial-temporal relationships of reliable associated tracklets as sampling constraints. They also propose an improved vision (Kuo and Nevatia, 2011) which incorporates target-specific appearance model for tracklets confirmed belonging to an object. These approaches only model data association as bipartite graph of tracklets. To decide whether a tracklet is associated to an object, it needs additional algorithms and heuristic information such as occupancy map. Although many proposed affinity models have been proposed, there are still two major issues: how to adapt appearance similarity model in video, and how to effectively embed appearance similarity in affinity calculation by considering other cues, such as dynamic similarity. To address the above issues, we propose a hierarchical MHT framework based on reliable generated tracklets, which combines the merits of the hierarchial methods and MHT, conducting data association level by level to hierarchically reduce the ambiguities to form object tracks, while incorporating more temporal context information. The association affinity is calculated by fusing the dynamic similarity which is calculated using the Kalman filter with the appearance similarity online via logistic regression. The appearance similarity is defined on the distances between the salient templates of each track hypothesis and feature vectors which are extracted in the detected responses of associated tracklet. An incremental clustering method is adopted to attract and update the salient templates for generated track hypotheses, considering the high order temporal context. To enhance the discrimination of appearance similarity calculation, the distance metric is learned according to the constraints formed by the spatial-temporal relationships of tracklets. We summarize the contributions of this work in three folds: • A hierarchial MHT association framework, which has the merits of hierarchial manner and MHT, is proposed to gradually link the short tracklets to obtain object track by considering more temporal context information. • A robust appearance model is proposed to calculate the similarity of the salient template set and consider the high order temporal context of the generated tracks. The appearance similarity is then fused with the dynamic similarity by logistic regression with reliable tracklets. • The discriminative appearance metric is learned using the spatial-temporal relationships of tracklets as constraints to measure the similarity between feature vectors and salient templates for appearance modeling. The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 introduces the MHT algorithm in details. Section 4 gives the overview of the proposed tracking frame-

work. Section 5 shows the proposed multiple information fusion based MHT Algorithm. Section 6 reports the experimental results on the widely used representative datasets. We conclude this paper in Section 7. 2. Related work Visual tracking includes single object tracking (Zhang et al., 2013a,2015a,b) and multi-object tracking (Breitenstein et al., 2009; Khan et al., 2005; Milan et al., 2014; Qu et al., 2007). Research in multi-object tracking filed concentrates on the construction of affinity model and the strategy of association optimization. The first-order Markov process, broadly used in modeling single object tracking problem, can be extended to formulate data association in multi-object tracking. Although many interaction models are incorporated (Breitenstein et al., 2009; Khan et al., 2005; Qu et al., 2007), it is usually effective to overcome the ambiguities of occlusions, spurious measurements and missing measurements by considering information of past and future feedback simultaneously to estimate the current state. A kind of classic method is multiple hypotheses tracking (MHT) (Cox and Hingorani, 1996; Zhang and Xu, 2014) which is first used in radar signal processing. MHT maintains possible associations with higher probability (Miller et al., 1997) in a period of time to find out approximate optimal solution when receiving new information. This method generates many redundant hypotheses and relies on prune strategy to reduce the computational complexity. Recently, an observe-and-explain paradigm (Ryoo and Aggarwal, 2008) of MHT is proposed. This method enumerates tracking possibilities only when the system has enough information to evaluate them, which avoids exponential number of possibilities due to insufficient data. However, some additional heuristic decision procedures should be conducted to maintain and generate hypotheses. Inspired by the MHT algorithm, many approaches have been proposed. MCMC (Benfold and Reid, 2011; Oh et al., 2009; Yu and Medioni, 2009) data association uses Metropolis-Hastings algorithm to construct an irreducible and aperiodic Markov chain, which is employed to efficiently sample from the posterior distribution of state space and obtain approximate optimal solution. These methods need a large number of iterations, and lead to higher time complexity. Some approaches construct network flow graph models (Ben Shitrit et al., 2011; Pirsiavash et al., 2011; Zhang et al., 2008) to map the maximum-a-posteriori (MAP) data association problem into cost-flow network with non-overlap constraint on trajectories. Zhang et al. (2008) propose an explicit occlusion model which gradually enlarges the original node set with occluded hypotheses and imposes lower bound constraints after each iteration of network optimization. The number of tracks needs to be pre-set, and the solution is obtained if all tracks are formed. The paper (Pirsiavash et al., 2011) shows that the global solution including the number of tracks can be obtained using the shortest path computation on a flow network, and gives a near-optimal algorithm based on 2-pass dynamic programming. However, the cost-flow network can only address trajectory affinity functions, which can be decomposed as the multiplication of affinity functions between two adjacent instants, restricting the presentation power. There are also approaches addressing global association with hierarchical manner (Brendel et al., 2011; Huang et al., 2008; Xing et al., 2009; Zhang et al., 2012). In the paper (Huang et al., 2008), a three-level hierarchial association approach is proposed, which generates reliable tracklets at the low level, formulates the association as bipartite graph at the middle level, and estimates entries, exits and scene occluders using the already computed tracklets at the high level. The paper (Xing et al., 2009) adopts particle filter in local stage to generate reliable tracklets, which is buffered when

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

JID: YCVIU

ARTICLE IN PRESS

[m5G;August 30, 2016;11:0]

S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11

the object state cannot be observed. A set of potential tracklets are generated in a temporal sliding window in the global stage. Then the reliable tracklets and potential tracklets are associated by Hungarian algorithm to get the global optimal association. Brendel et al. (2011) utilizes pairs of detection responses from every two consecutive frames to build a graph of tracklets. Data association problem is formulated as iteratively finding the maximum-weigh independent set (MWIS) to hierarchically merge smaller tracks into longer one. However, in most of these methods, temporal context information is not effectively utilized. Recently, many researchers focus on appearance modeling (Zhang et al., 2013b,2014) and building more effective affinity calculation model. Li et al. (2009) formulate the association of tracklet as a joint problem of ranking and classification, and use HybridBoost algorithm to automatically select various features and nonparametric similarity model to form tracklet affinity model. The training samples for ranking and binary classification are generated by the set of tracklets in lower level and the ground truth of the training videos. Kuo et al. (2010) learn a discriminative appearance model for different targets using online boosting which combines effective image descriptors and their corresponding similarity measurements. Training samples are collected from tracklets according to some spatial-temporal constraints. They also propose an improved vision (Kuo and Nevatia, 2011) which incorporates targetspecific appearance model for tracklets of an object. In Bae and Yoon (2014), the multi-object tracking problem is formulated based on the tracklet confidence calculated by using the detectability and continuity of a tracklet. The multi-object tracking problem is then solved by associating tracklets in different ways according to their confidence values. In Kim et al. (2015), the classical multiple hypotheses tracking algorithm is revisited in a tracking-by-detection framework, and higher-order information is exploited by training online appearance models for each track hypothesis. In Rezatofighi et al. (2015), the joint probabilistic data association technique is revisited and a novel solution is proposed based on recent developments in finding the m-best solutions to an integer linear program. In Yang et al. (2011), conditional random field model is used to progressively associate tracklets to long tracks, considering both tracklet affinities and dependencies among them. Then, the RankBoosting algorithm is adopted to estimate energy terms in the CRF model using the training samples generated by the ground truth of training videos. However, in the above models, the two issues, how to adapt similarity model in video sequence and how to more effectively embed appearance similarity in affinity calculation fused with other cues, are still far from solved. Moreover, the affinity models trained using ground truth need extra work in application and the effectiveness of these models is limited with different scenes. 3. The MHT algorithm In a tracking by detection framework, new measurements (detection results) can be received at every instance. Many approaches formulate data association as a MAP problem and efficiently find the optimal solution of association state. The formulation proposed in MHT assumes that each measurement may either (1) belong to a previously known object, (2) be the start of a new object, or (3) be a false positive. For objects that are not assigned measurements, there is the possibility of (4) termination, or (5) continuation of an object. Multiple hypotheses with higher probability are maintained to search possible better solution once more measurements are received, and the association state is determined after a certain period of time. We take MHT as our fundamental optimization framework. To make this paper self-contained, the MHT algorithm is reviewed in this Section. For details, please refer to the reference (Cox and Hin-

3

gorani, 1996). Let Zk be the set of all measurements until time k. A single object trajectory hypothesis  j is defined as a list of measurements at different time. A specific partition of the measurement set, named the association hypothesis kl is defined as a set of hypothesis trajectories { j }, where l is index, in which an additional trajectory  0 contains all false positives. When measurements Z(k) at time k are received, a particular global hypothesis kl can be derived from certain hypothesis km−1 at time k − 1 by the specific set of assignments of the origins of all measurements received at time k with all objects assumed by the parent hypothesis km−1 . Let θ l (k) denote this set of assignments which is defined to consist of γ measurements from known objects, η measurements from new objects, φ spurious measurements, and ζ terminated objects. A constraint that one observation originates from at most one object and one object has at most one associated observation is imposed to reduce the size of the search space. The probability of an association hypothesis kl can be calculated according to parent hypothesis km−1 using Bayes’rule:

1 P {Z (k )|θl (k ), km−1 , Z k−1 } c P {θl (k )|km−1 , Z k−1 }P {km−1 |Z k−1 },

P {kl |Z k } =

(1)

where c is a normalization constant. Multi-object tracking aims to find out the MAP hypothesis. For term P {Z (k )|θl (k ), km−1 , Z k−1 }, it is assumed that a measurement zi (k) with index i obeys uniform distribution if it is considered as a false positive, and has likelihood distribution Passozi (k )| ji (k − 1 ) if it is associated with object trajectory  ji whose index is ji . For term P {θl (k )|km−1 , Z k−1 }, it is assumed that the probabilities of detection and termination of obj j ject trajectory  ji are PD and Pζ , respectively, and the numbers of false positives and new objects are assumed to obey the Poisson distributed with densities λF and λN , respectively. Take all these above settings into Eq. (1), the posterior probability of an association hypothesis can be expressed in recursive formula as:

P {kl |Z k } =

m  γi 1 k Passo zi (k )| ji (k − 1 ) c i=1

 φ η

λF λN



δj PDj

( ) (1 −

1 −δ j PDj

)

j ζj

 j 1 −ζ j

(Pζ ) (1 − Pζ )

P {km−1 |Z k−1 },

j

(2) where γ i , δ j , and ζ j are the indicator variable of measurement originating from known object, the indicator variable of detected object, and the indicator variable of terminated object, respectively. φ and η are the total number of false positives and new objects, respectively. In practical application, tracks that do not compete for common measurements can be partitioned into separate clusters. The k-best hypotheses are generated directly to achieve much more effective performance, which is implemented by optimizing Murty’s algorithm (Miller et al., 1997) in O(kN3 ). Additionally, pruning stage is to reduce the complexity. Actually, MHT can be seen as a branch and bound method with additional pruning strategy. 4. Overview of our approach Our approach can be divided into three main stages: reliable tracklet generation, tracklet association, and object association in upper levels. 1. In the first stage, we carry out a dual-threshold association method as in Huang et al. (2008), which is conservative and can generate object tracklets.

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

JID: YCVIU 4

ARTICLE IN PRESS

[m5G;August 30, 2016;11:0]

S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11

2. In the second stage, the metric learning and multi-cue fusion based tracklet MHT is used to conduct tracklet association. Appearance similarity and dynamic similarity are fused to obtain the association affinity. The distances between the salient templates of generated attracted tracks and the beginning extremity templates of tracklets are utilized to calculate the appearance similarity. While the dynamic similarity is calculated using the Kalman Filter based on the end extremity dynamic states of generated tracks and the beginning extremity dynamic states of tracklets. • To enhance the discrimination of appearance model, an effective appearance metric is learned to measure the similarity of detected responses based on similar and dissimilar constraints generated by the spatial-temporal constraints of the reliable tracklets. • Appearance similarity is fused with dynamic similarity, which is calculated by Kalman filter using the logistic regression online. • Salient templates are initialized with the generated tracks and updated in each temporal slice according to the generated hypotheses using an incremental clustering framework to consider temporal context. 3. In the third stage, the same strategy used in the second stage is adopted to conduct tracklets association in upper level. As a result, object tracks can be generated by traklets association with the proposed MHT gradually.

fined based on their position, size and color histogram.



ρ i −ρ j 2 sim(ri , r j ) = exp σρ2

In this paper, we address the multi-object tracking problem from the clustering perspective. Actually, there are certain natural inner relations between multi-object tracking and clustering. In a short period of time, the appearance and dynamic pattern of an object may obey to the same probability distribution, which implies that multi-object tracking can be seen as a kind of dynamic incremental clustering in time and space. From this perspective, many clustering algorithms can be transferred to conduct multiobject tracking, and more essential understanding will be achieved. We propose a metric learning and multi-cue fusion based hierarchical multiple hypotheses tracking method (MHMHT) to conduct association optimization with more temporal context information. The appearance similarity is defined with the distances of features of detected responses in associated tracklet and salient templates of each track hypothesis, which is then fused with dynamic similarity calculated according to Kalman filter to get association affinity by logistic regression. To enhance the discrimination of the appearance model, the spatial-temporal relationships of tracklets in sliding temporal window are used as constraints to learn the discriminative appearance metric which measures the similarity between feature vectors and salient templates. Moreover, an incremental clustering method is adopted to attract and update the salient templates for generated track hypotheses by considering the high order temporal context information. After tracklets association, object trajectories are gradually associated by repeating the same process which takes the trajectories generated in lower level as input for several times (1–3 times).

5.1. Reliable tracklets Given the detected responses in each video sequence, a dualthreshold strategy like Huang et al. (2008) is used to generate short but reliable tracklets, which conducts association only between two consecutive frames. The affinity of two responses is de-





α i −α j 2 · exp σα2 2



· exp





χ 2 ( h i ,h j ) , σh2

(3) where ri and rj are detected responses in two consecutive frames;

ρ i , α i and hi are position, scale and histogram vector, respectively; •2 is the 2-norm of vector; χ 2 (•) is the χ 2 distance of histogram; σρ2 , σα2 and σh2 are the variances of position, scale and color histogram. The responses ri and rj are associated if the affinity between them is higher than a threshold θ 1 and exceeds any other possible associations involving any one of them by another threshold θ 2 . 5.2. Appearance metric learning When reliable tracklets are generated, the appearance difference can be calculated via tracklet templates E = {h1 , . . . , hnE }. For each tracklet, two sets of templates, the beginning extremity template set and the end extremity template set are generated for association by considering the appearance evolution over time. The appearance difference dapp is calculated using the end extremity template set of the previous tracklet ETend and the beginning exj

bgn

tremity template set of the sequential tracklet ET ,

dapp = d 5. Muti-cue fused hierarchical MHT

2



ETend , ETbgn j i

i



(4)

which is obtained as the average of the forward appearance difference that is the smallest difference of every feature hl in the beginning set of sequential tracklet to the end set of previous tracklet ETend , and the backward appearance that is calculated similarly j

as the forward appearance but inversely over time. The end extremity template set ETend and the beginning extremity template set j

bgn

ET

j

of tracklet Tj can be generated by collecting the appearance of

detected responses in the beginning extremity and end extremity with specified length nE . If the detected responses of one tracklet is less than nE , the end template set and the beginning template set are the same. The similarity metric between feature vectors plays an important role in the forward and backward appearance difference modeling. General-purpose metrics exist (e.g., the Euclidean distance and the cosine similarity for feature vector) but they often fail to capture the idiosyncrasies of the data. Metric learning aims at automatically learning a real-valued metric function from data and attracts a lot of interests. The common form of metric function is the Mahalanobis distance dM (xi , x j ) =

(xi − x j )T M(xi − x j ),

where M is the positive semi-definite matrix. There are three different metric learning algorithms (Bellet et al., 2013): • Weakly supervised: the algorithm learns matrix M from the pair based constraints comprising of the similar constraints S = {(xi , x j )} and the dissimilar constraints D = {(xi , x j )}, or the triplet based constraints R = {(xi , x j , xk )}, where xi should be more similar to xj than to xk , that is the side information. • Fully supervised: the algorithm has access to a set of labeled training instances {bi = (xi , yi )}ni=1 , where each training example bi ∈ B = X × Y is composed of an instance xi ∈ X and a label yi ∈ Y. In practice the label information is used to generate specific sets of pair or triplet constraints based on a notion of neighborhood. • Semi-supervised: the algorithm has access to an (typically large) assemblage of unlabeled instances with no side information available. This is useful to avoid overfitting when the labeled data or side information is scarce.

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

ARTICLE IN PRESS

JID: YCVIU

[m5G;August 30, 2016;11:0]

S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11

5

Fig. 1. The illustration of training samples generation: the set on the right with label 1 is positive sample set, while the set with label −1 is negative sample set.

For a more thorough survey of tracking methods, we refer the readers to Bellet et al. (2013) and Yang and Jin (2006). Considering our formulation of multi-object tracking and the purpose to obtain discriminative appearance metric for different objects in sliding temporal window, a kind of weakly supervised metric learning, the information-theoretic metric learning is adopted. Notice that, many other weakly supervised or semisupervised methods may also be used to calculate appearance metric by the proposed framework. The information-theoretic metric learning (Davis et al., 2007) can be formulated as follows:

min

M∈Sd+

s.t.



tr(M · M−1 ) − log det(M · M−1 ) − dim + γM 0 0 2 dM ( x i , x j ) ≤ u M + ξi j 2 dM ( x i , x j ) ≥ v M + ξi j

∀ ( xi , x j ) ∈ S ∀ ( xi , x j ) ∈ D

 i, j

ξi j (5)

where M is a positive semi-definite matrix; dim is the dimension of the input space and M0 is a certain positive semi-definite matrix which M remains close to, which is often set to the identity matrix; γ M ≥ 0 is the trade-off parameter; uM , vM ∈ R are threshold parameters. To generate the training samples, we use the method as in Kuo et al. (2010), which is based on spatial-temporal constraints. With the reliable tracklets obtained by the dual-threshold association, two assumptions are made: (1) responses in one tracklet describe the same target; (2) any responses in two different tracklets which overlap in time represent different targets. These constraints are reasonable because one object cannot belong to two different trajectories at the same time. Additionally, (3) some tracklets are considered not being associable if the frame gap between them is small but they are spatially separated, which can be denoted as vasso ≥ θ 3 , where vasso = ρ /t is the ratio of spatial distance ρ and time difference t, and θ 3 is a prior threshold. For tracklet Tj , the discriminative set DT j is formed by the corresponding associated tracklets according to the spatial-temporal constraints. The similar and dissimilar constraints for the metric learning can be generated as:

S = { ( hl1 , hl2 )|hl1 , hl2 ∈ T j } D = {(hl3 , hl4 )|hl3 ∈ T j , hl4 ∈ Ti , Ti ∈ DTj ∨ T j ∈ DTi },

(6)

where hl is the corresponding feature of detected response rl ; (hl1 , hl2 ) = (hl2 , hl1 ). The training samples generation process is illustrated in Fig. 1. 5.3. Multi-cue fusion strategy Besides appearance similarity, dynamic similarity also plays an important part in data association. The dynamic similarity

can be calculated by Kalman filter (Reid, 1979). Let sT j (kend ) = T

T

T

[(ρ end ) (α end ) (υ end ) ]T be the dynamic state vector of the end T T T j

j

j

extremity of tracklet Tj , where ρ end is the position vector, α end T T j

j

is the scale vector, υ end is the velocity vector, with zT j (kend ) = T j

T

[(ρ end ) T j

T

T

(α end ) ]T ; sTi (kbgn ) = [(ρ bgn ) T T j

i

T

(α bgn ) T i

T

(υ bgn ) ]T is the T i

dynamic state vector of the beginning extremity of tracklet Ti bgn T

and zTi (kbgn ) = [(ρ T ) i

T

(α bgn ) ]T ; F and H represent the stateT i

transition matrix and the observation matrix, respectively, and they are implicitly determined by the physical significance of elements in the dynamic state vector; Q is the covariance of the process noise, and R is the covariance of the observation noise; T j (• ) indicates the error covariance matrix which can be calculated by Q and R conducting filter iteration. The forward dynamic difference ddynF can be calculated as:

sˆTj (kend + t |kend ) = Fn sTj (kend )

Tj (kend + t |kend ) = FTj (kend )FT + Q z(kend + t ) = z(kbgn ) − HsˆTj (kend + t |kend )

12 −1 ddynF = z(kend + t )T Tj (kend + t |kend ) z(kend + t ) . (7) Here, Q and R are prior parameters, which can be set according to the application scene or be learned from generated lower level tracklet set by EM algorithm (Digalakis et al., 1993). The backward dynamic difference can also be obtained in the same way. However, the velocity vector υ bgn is set to be negative to original value when predicting sˆTi (kbgn − t |kbgn ). The dynamic difference ddyn is calculated as the average of the forward dynamic difference and the backward dynamic difference. This process is illustrated in Fig. 2. These two kinds of difference are fused and normalized by the sigmoid function to generate association confidence Passo which ranges from 0 to 1:

Passo =

exp

1



β0 + β1 dapp + β2 ddyn

,

(8)

where [β 0 β 1 β 2 ] are the parameters of sigmoid function. With the generated reliable tracklets, parameters in the sigmoid function can be obtained by logistic regression. The traklet templates are simplified to one detected response. The only difference is that pairs of responses whose dynamic differences are too large to be associated, denoted as ddyn > θ 4 , are not treated as training samples,

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

ARTICLE IN PRESS

JID: YCVIU 6

[m5G;August 30, 2016;11:0]

S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11

order temporal context information is lost when calculating appearance similarity, which makes association be easily affected by noise. To address these problems, high order appearance information should be incorporated. In the view of dynamic incremental clustering, appearance of one object in adjacent temporal window can be considered as following the same distribution. The salient templates E∗ can be used to represent the appearance of one obj

Fig. 2. The illustration of dynamic difference calculation.

ject  j (or to say a cluster) in a temporal window near the end extremity, denoting as  Ej ∈  j . This presentation gets rid of noisy features which deviate from the distribution. To incorporate the high order temporal context information in the MHT, we propose an salient templates extraction and updating method based on incremental clustering framework. The set of salient templates E∗ of one object trajectory  j is j

initialized as the tracklet feature vectors in the ETend . Let H (k − 1 ) = j

{1 (k − 1 ), . . . , nH (k − 1 )} be the hypotheses set of the MHT association in previous temporal slice k − 1, where a (k − 1 ) = {1a , . . . , na } is one global hypothesis for every object. When the set of measurements Ok at temporal slice k is received, trajectory hypotheses  j in H (k − 1 ) can be partitioned into several track tree groups {G jG }nG in which object trajectories can be associated with common measurements. Each track tree group G jG can be seen as incremental clustering when generating multiple best hypotheses. The appearance difference is calculated using set of salient templates E∗ and associated tracklet appearance features j

bgn

ET , as described in Section 5.3. Once multiple best hypotheses i

Fig. 3. The illustration of temporal slice partition: the closed gray dashed lines indicate tracklets in each partitioned temporal slice.

are calculated, the set of salient templates corresponding to gener ated track  jg =  j {Ti } (the augmented cluster) will be updated. The appearance coherent sequence corresponding to one object ET

because similarity fusion on this pairs are impossible and meaningless. 5.4. Tracklet MHT With the definition of association similarity on tracklets, the original MHT can be easily extended to address the association of tracklets by treating them as track nodes. Considering that measurements at specific time instance in original MHT are naturally the detected responses, a specific method is needed to generate measurements when the MHT framework is extended to for tracklet association. An intuitive way is that tracklets with the same beginning time are assigned to the same measurement set, measurement set is empty when there is no tracklet beginning at the corresponding time instant. However, this method is inefficient because there are many overlapping tracklets beginning at different moments and there are many null measurement sets on the time axis. In this work, we propose a temporal slice based measurements generation algorithm, which minimizes the number of overlapping tracklets in different measurement sets and the number of null measurement sets. In chronological order, tracklets which overlap with each other in time are partitioned from remaining tracklets to form the measurement set in the corresponding temporal slice, illustrated in Fig. 3. 5.5. Appearance modeling and updating strategy

trajectory  j , denoting as  j i , is partitioned from the sequence by the maximum coherence time interval tcoh and current temporal slice. For generated track  jg , the appearance coherent sequence is ET ET   j i =  j i Ti . When the time span of possible associated tracklet g

Ti exceeds or reaches the bound, the salient templates are set to be the tracklet templates of Tj . In our implementation, the salient templates are extracted by two assumptions: • Cohesion: A feature template hl is visually representative if it is similar to most features of other detected responses in the same appearance coherent sequence of trajectory  jg . • Separation: If a feature template hl belongs to the generated trajectory  jg whose associated tracklet Ti can also be associated with other object trajectory  jc , this template well describes trajectory  jg when it is more similar to most features of other detected response in the same appearance coherent sequence of generated trajectory  jg than features in the appearance coherent sequences of other trajectories. According to the two assumptions, we present a formulation that simultaneously integrates them in a single framework. Considering the first assumption, the similarity with most features in the same appearance coherent sequence is measured utilizing the difference of pairwise features. The difference of a feature template ET

with features in the appearance coherent sequence  j culated as

i

can be cal-

ET

Tracklets generated by dual-threshold strategy or by tracklet association in lower level are in different lengths. For short tracklets which cover less frames, association base on tracklet templates may be inaccurate. There are two reasons: (1) the tracklet templates of short tracklets are not good representations for a certain object, which usually contain clutter or occlusion. (2) Much high

ETi

div(hl ,  j ) =

1

| j i |



E

| j Ti |

dM (hl , hlr ),

(9)

lr =0,

ET lr =l ∀hl ∈ i j

where dM (hl , hlr ) is the function to calculate the difference of two features using matrix M. We define the cohesion of template hl

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

ARTICLE IN PRESS

JID: YCVIU

[m5G;August 30, 2016;11:0]

S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11 ET

with the appearance coherent sequence  j



ET

C hl ,  jg i







ET

= 1 − sigm div hl ,  jg i



of trajectory  jg as

i

g

.

(10)

Here, sigm(•) is a certain monotonically increasing function which can be defined as the standard sigmoid function. Considering the second assumption, other trajectories associated with the common tracklet Ti , or called the competitive trajectories, are obtained from previous hypotheses set H (k − 1 ) and n track tree groups {G jg } g to form the set ϒ j . The set of correET

sponding appearance coherent sequences is denoted as ϒ i . The j relative difference is calculated as



E E div hl , jgTi , ϒTj i

E



ET ET ET T = min div hl ,  jr i ,  jr i ∈ ϒ i − div hl ,  jg i j

(11)

Accordingly, the separation of template hl is defined as



ET

ET

F h l ,  j g i , ϒ i



j





ET

ET

= sigm div hl ,  jg i , ϒ i



(12)

j

If there are no corresponding appearance coherent sequences to ET

ET

ET

other trajectories, that is |ϒ i | = 0, F (hl ,  j i , ϒ i ) is set to be 1. g j j Based on the definitions of terms regarding cohesion and separation for salient template set E∗ , the formulation for extracting jg

salient templates E∗ are presented as follows jg

⎧ ⎨

⎫ ⎬  E∗ j = argmax N1E φ ( hl ) , g ⎩ hl ∈E∗ ⎭ hl jg



E E E φ (hl ) = λC hl ,  jgTi + (1 − λ )F hl ,  jgTi , ϒTji .

(13)

7

benchmark data, the TUD dataset, the CAVIAR Test Case Scenarios, and the ETH mobile pedestrian dataset. The first three data sets are in static camera view. The PETS2009 benchmark data contain videos taken in a garden in front of brick building from different camera viewpoints with different crowd density at moderate camera angle in moderate range. The TUD dataset contain videos captured on streets at a very low camera angle in close range, and there are frequently full occlusions among pedestrians. The CAVIAR Test Case Scenarios dataset are captured in a shopping center corridor by two fixed cameras form two different viewpoints. While the ETH mobile pedestrian dataset are captured by a stereo pair of forward-looking cameras mounted on a moving children’s stroller in a busy street scene. Due to the lower position of the cameras, total occlusions often happen in these videos, which increases the difficulty of object tracking. In our experiments, the PETS S2.L1 sequence in the PETS2009 benchmark data is taken to carry out algorithm evaluation. Meanwhile, the most classic video sequence Stadtmitte in the TUD dataset is included for facilitating comparison. In the CAVIAR Test Case Scenarios dataset, two most challenging sequences, the TwoEnterShop1cor and WalkByShop1cor sequences which contain the most pedestrians are selected. Notice that we use the videos from the corridor view which suffer from significant scale variation and frequent full occlusions. Moreover, we choose the Bahnhof and Sunny Day sequences in the ETH mobile pedestrian dataset for testing. Because our work focuses on data association in multiobject tracking, only the sequences from left camera are used. These 6 test videos in the 4 typical public datasets in our experiments cover various scenes, viewpoints and ranges, partial occlusions, and background clutters, which are representative and challenging.

Where NE = |E∗ | is the number of the selected templates. λ ∈ [0,

6.2. Evaluation metrics

1] is a weighting parameter that is used to trade off two contributions; φ (hl ) is the saliency sore of template hl . It is computationally intractable to solve Eq. 13 directly since it is a nonlinear integer programming (NIP) problem. Alternatively, we resort to a greedy strategy which is simple but efficient to be solved.

To comprehensively evaluate the data association method for multi-object tracking, the common CLEAR MOT metrics as well as the metrics defined in Yang et al. (2011) are adopted for evaluation:

j

ET

The salient template set  j

i

g

of generated trajectory  jg is ini-

tialized as Ø. In each iteration, template hl is chose by solving arg maxhl φ (hl ). The salient template set is updated by E∗ = j ET  E∗ {hl } and obtained until  j i =Ø or E∗ = nE . j

g

j

Using the extracted set of salient templates E∗ of the object j

trajectory  j , the high order context in a temporal window is considered in appearance similarity calculation, which makes the MHT be much more robust to associate ambiguous tracklets and decrease the number of identity switches. After data association on reliable tracklets, object trajectories are gradually generated by adopting the same method which takes the results generated in lower level as input for several times (1–3 times). 6. Experimental results In this section, we validate the effectiveness of our hierarchical MHT method and we also conduct a thorough comparison between our hierarchical MHT with several state-of-the-art multiobject tracking methods. 6.1. Datasets To evaluate the performance of the hierarchical MHT, experiments are conducted on 4 typical public datasets: the PETS 2009

• MOTA(↑): the multiple object tracking accuracy, which evaluates the overall situation of trajectory identities generation by considering misses, false positives and identity switches. • MOTP(↑): the ratio of intersection area of tracking result over the union of bounding boxes of ground truth trajectories. • F.Neg(↓): the number of trajectories which are not tracked, accumulated and averaged on all frames. • F.Pos(↓): the number of trajectories which are wrongly generated, accumulated and averaged on all frames. • IDS(↓): id switch, the number of times that a tracked trajectory changes its matched id. • GT: the number of trajectories in ground truth. • MT(↑): the ratio of mostly tracked trajectories, which are successfully tracked for more than 80%. • ML(↓): the ratio of mostly lost trajectories, which are successfully tracked for less than 20%. • PL: the ratio of partially tracked trajectories, 1 − MT − ML. • FG(↓): fragments, the number of times that a ground truth trajectory is interrupted. In the above evaluation metrics, the first 5 items are the CLEAR MOT metrics, and the others are the metrics defined in Yang et al. (2011). For the items with ↑, they mean that the higher scores indicate the better results. For those with ↓, the lower scores indicate the better results.

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

ARTICLE IN PRESS

JID: YCVIU 8

[m5G;August 30, 2016;11:0]

S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11

6.3. Experiment setting and prior parameters The hierarchical MHT with two levels is adopted to carried data association in our experiments. The first level is designed to calculate association among tracklets whose temporal interval are smaller than 15 frames. The second level is designed to deal with longer time interval. Association over 45 frames can be calculated. The motivation behind this formulation is to gradually reduce ambiguity using hierarchical structure and accumulate more information for further decision. The prior parameters of the proposed association hypothesis affect the beginning, the termination situation of a trajectory, as well as the assignment with measurements. Specifically, the probability density of new object λN , the trajectory detected probability PD , the probability density of false alarm λF , and the association likelihood determine whether the measurement is the beginning of a new object or false alarm. The detected probability PD , false alarm probability density λF , and association likelihood determine whether the trajectory links or skips a measurement. The detected probability PD , termination probability Pζ determine whether the trajectory can continue when a trajectory misses measurements for a period of time. Besides, the pruning ratio will affect the redundancy of search in the state space. For example, one new object hypothesis for continuing measurements with high likelihood will unnecessarily be conserved. These prior parameters are set according to the requirement of each level association based on experiments. That is, the maximum time interval for association, the minimum number of detected measurements for the beginning of a new object, and the likelihood lower than which should not be associated to an exist object. These prior parameters are invariable for every test sequence in each association level. 6.4. Baseline methods We compare the proposed MHMHT with 5 state-of-the-art methods. The first 2 baseline methods are the original MHT (Cox and Hingorani, 1996) using simple appearance template which is denoted as MHT-A, and the method proposed in Ying et al. (2012) which incorporates appearance and repulsion-inertia model, denoted as MHT-AR. The other 3 state-of-the-art methods include the improved vision (Kuo and Nevatia, 2011) of online learned discriminative appearance model (Kuo et al., 2010) which incorporates target-specific appearance model for tracklets association, which is denoted as PIRMPT; the discrete-continuous optimization for multi-target tracking (Andriyenko et al., 2012), denoted as DCOMTT; and multi-target tracking on confidence maps using improved particle filter with Markov Random Field (Poiesi et al., 2013), denoted as MT-TBD. Because there are not publicly available source codes of these state-of-the-art methods, we use the results reported in the published papers. To more detailed illustrate the performance of the proposed method, we also show the evaluation of the results in the first level association of the multi-cue fused hierarchical MHT, which is the MHT association on tracklets, denoted as MTMHT. 6.5. Results and analyses The quantitative evaluations of the overall performances on each dataset are shown in Table 1. It can be seen that on all of these datasets, the proposed method performs robustly and achieves the best. Compared with the baseline methods, the proposed method is much higher in MOTA and MT, much lower in F.Neg, F.Pos, IDS, ML and FG, especially in IDS and FG, which shows the effectiveness of the discriminative and adaptive appearance model, the multi-cue fusion and the hierarchical strategy in our formulation. It can be seen that the MOTP of the proposed method

is generally higher than the baseline methods in most cases, except for few conditions. This metric indicates the precision of position estimation for trajectories in each frame, related to detection accuracy and association energy function in a concrete association process, which is less important as other metrics to evaluate the performance of multi-object tracking. Compared with the PIRMPT, the proposed method achieves much better performance with higher value in MT, lower value in ML, IDS and FG. Because our method combines the merits of the hierarchial methods and the MHT, which incorporates more sequence information via the adaptive appearance model. The affinity similarity is calculated by appearance similarity fused with dynamic similarity using logistic regression based on the spatialtemporal constraints of the reliable tracklets. Moreover, the high order time information is also integrated in appearance model updating. Compared with the DCOMTT and MT-TBD, our method also performances better, which proves the effectiveness and robustness of the spatial-temporal context for appearance modeling. The association results of first level in our method, which is denoted as the MTMHT, are better than the two baseline methods (MHT-A and MHT-AR) comprehensively according to the evaluation metrics, especially the number of ID switches, which demonstrates that the proposed association affinity model from clustering perspective can facilitate data association. The IDS and F.Pos value of MTMHT are also much better than the other 3 state-of-the-art methods (PIRMPT, DCOMTT, and MT-TBD), or even better than the hierarchical MHT which conducts one more level association on the trajectories generated by the MTMHT. However the fragments F.Neg of the MTMHT are larger than the baseline methods, because the purpose of the MTMHT is to reduce the association ambiguous and provide more possibly reliable tracklets for the upper level association, which means this level can achieve lower false positives. Some exemplar frames of the tracking results are shown in Figs. 4–9, which illustrate the tracking results of the PETS S2.L1 sequence in the PETS 2009 benchmark data, the results of the Stadtmitte sequence in the TUD dataset, the results of the Bahnhof and Sunny Day sequence in the ETH mobile pedestrian dataset, and the results of the TwoEnterShop1cor and WalkByShop1cor sequence in the CAVIAR dataset. The trajectories in a temporal window terminating at the illustrated sample frames are also shown on the same frame images. This display temporal window is set to be 240 frames of the tracking results on the TwoEnterShop1cor sequence, and be 60 frames of the other sequences. The performance of the proposed method in the second row of each figure is illustrated to compare with the baseline method MHT-A in the first row, which more clearly demonstrates the effectiveness of the proposed method with much less ID switches and fragments. 6.6. Benchmark comparison In addition to the above experiments, we also present our results on MOTChallenge,1 a recent multi-target tracking benchmark (Leal-Taixe et al., 2015), which contains 11 training and 11 testing sequences. Users tune their algorithms on the training sequences and then submit the results on the testing sequences to the evaluation server. This benchmark is of larger scale and includes more variations than the PETS benchmark with substantially varying properties, such as the number of targets present, camera motion, target density, etc. For performance evaluation, we follow the current evaluation protocols for visual multi-target tracking as in the above experiments, such as, MOTA, MOTP, IDS, MT, ML, and track fragmentations (FM). Detailed descriptions about

1

http://motchallenge.net/results/2D_MOT_2015/.

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

ARTICLE IN PRESS

JID: YCVIU

[m5G;August 30, 2016;11:0]

S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11

9

Table 1 The performance comparison on representative datasets: the gray rows indicate the evaluation of the proposed method. Dataset

Methods

MOTA↑

MOTP↑

F.Neg↓

F.Pos↓

IDS↓

GT

MT↑

PT

ML↓

FG↓

PETS2009 (S2.L1)

MHT-A(Cox and Hingorani, 1996) MHT-AR(Ying et al., 2012) PIRMPT(Kuo and Nevatia, 2011) DCOMTT(Andriyenko et al., 2012) MTMHT MHMHT MHT-A(Cox and Hingorani, 1996) MHT-AR(Ying et al., 2012) PIRMPT(Kuo and Nevatia, 2011) MTMHT MHMHT MHT-A(Cox and Hingorani, 1996) MHT-AR(Ying et al., 2012) PIRMPT(Kuo and Nevatia, 2011) MT-TBD(Poiesi et al., 2013) MTMHT MHMHT MHT-A(Cox and Hingorani, 1996)

0.7211 0.7326 – 0.8930 0.9218 0.9259 0.7561 0.7976 – 0.8065 0.8272 0.6518 0.6732 – – 0.6931 0.7237 0.8871

0.6382 0.6463 – 0.5640 0.6579 0.6581 0.7793 0.8165 – 0.8490 0.8511 0.7482 0.7465 – – 0.7469 0.7710 0.7089

0.0460 0.0432 – – 0.0353 0.0293 0.2105 0.1825 – 0.1914 0.1701 0.2716 0.2268 – – 0.2432 0.2095 0.0766

0.2299 0.2227 – – 0.0419 0.0442 0.0226 0.0136 – 0.0012 0.0018 0.0707 0.0922 – – 0.0583 0.0636 0.0333

4 1 1 – 0 0 5 2 1 0 0 28 20 11 45 3 5 7

19 19 19 – 19 19 10 10 10 10 10 125 125 125 125 125 125 20

89.5% 89.5% 78.9% – 94.7% 94.7% 60.0% 70.0% 60.0% 50.0% 70.0% 53.6% 56.8% 58.4% 62.4% 56.8% 63.2% 80.0%

10.5% 10.5% 21.1% – 5.3% 5.3% 30.0% 20.0% 30.0% 40.0% 30.0% 33.6% 32.8% 33.6% 29.6% 34.4% 28.8% 20.0%

0.0% 0.0% 0.0% – 0.0% 0.0% 10.0% 10.0% 10.0% 10.0% 0.0% 12.8% 10.4% 8.0% 8.0% 8.8% 8.0% 0.0%

12 9 23 – 3 1 8 4 0 1 1 47 39 23 69 40 20 35

MHT-AR(Ying et al., 2012) MTMHT MHMHT

0.8927 0.9004 0.9214

0.7081 0.7192 0.6995

0.0809 0.0839 0.0347

0.0236 0.0138 0.0434

5 0 1

20 20 20

85.0% 80.0% 100.0%

15.0% 20.0% 0.0%

0.0% 0.0% 0.0%

31 28 6

TUDS (Stadtmitte)

ETHMS (Bahnhof & Sunny Day)

CAVIAR (WalkByShop1cor & TwoEnterShop1cor)

Fig. 4. The tracking results of the proposed method on S2.L1 sequence from the PETS09 dataset: the first row is the contrasting results obtained by baseline MHT-A.

Fig. 5. Tracking results of the proposed method on the Stadtmitte sequence from the TUD dataset: the first row is the contrasting results obtained by baseline MHT-A.

Fig. 6. Tracking results of the proposed method on the Bahnhof sequence from the ETH dataset: the first row is the contrasting results obtained by baseline MHT-A.

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

ARTICLE IN PRESS

JID: YCVIU 10

[m5G;August 30, 2016;11:0]

S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11

Fig. 7. Tracking results of the proposed method on the Sunny Day sequence from the ETH dataset: the first row is the contrasting results obtained by baseline MHT-A.

Fig. 8. Tracking results of the proposed method on the TwoEnterShop1cor sequence from the CAVIAR dataset: the first row is the contrasting results obtained by baseline MHT-A.

Fig. 9. Tracking results of the proposed method on the WalkByShop1cor sequence from the CAVIAR dataset: the first row is the contrasting results obtained by baseline MHT-A. Table 2 Results on 2D MOT 2015 Challenge. The proposed MHMHT achieves a little better performances than several existing methods. Method

MOTA

MOTP

FAF

MT

ML

FP

FN

IDS

FM

Hz

LP_SSVM (Wang and C, 2015) ELP (McLaughlin et al., 2015) MHT-DAM (Kim et al., 2015) MotiCon (Leal-Taixe et al., 2014) SegTrack (Milan et al., 2015) RMOT (Yoon et al., 2015) TC_ODAL (Bae and Yoon, 2014) JPDA (Rezatofighi et al., 2015) MHMHT

25.2 25.0 32.4 23.1 22.5 18.6 15.1 23.8 31.6

71.7 71.2 71.8 70.9 71.7 69.6 70.5 68.2 71.9

1.4 1.3 1.6 1.8 1.4 2.2 2.2 1.1 1.9

5.8% 7.5% 16.0% 4.7% 5.8% 5.3% 3.2% 5.0% 17.4%

53.0% 43.8% 43.8% 52.0% 63.9% 53.3% 55.8% 58.1% 42.3%

8369 7,345 9,064 10,404 7890 12,473 12,970 6,373 8987

36,932 37,344 32,060 35,844 39,020 36,835 38,538 40,084 36,125

646 1396 435 1018 697 684 637 365 385

849 1804 826 1061 737 1282 1716 869 756

41.3 5.7 0.7 1.4 0.2 7.9 1.7 32.6 1.2

these metrics can be found in Milan et al. (2013) and Leal-Taixe et al. (2015). The baseline methods are several recently published trackers including LP_SSVM (Wang and C, 2015), ELP (McLaughlin et al., 2015), MHT-DAM (Kim et al., 2015), MotiCon (Leal-Taixe et al., 2014), SegTrack (Milan et al., 2015), RMOT (Yoon et al., 2015), TC_ODAL (Bae and Yoon, 2014), and JPDA (Rezatofighi et al., 2015). Table 2 shows our performance along with the top two competi-

tors with the baseline trackers. The details are as follows. The proposed MHMHT achieves the second best on MOTA with 31.6, which is comparable with the best tracker MHT-DAM (Kim et al., 2015). On MOTP, our MHMHT achieves the best, which shows the trajectories are aligned well with the ground truth trajectories in terms of the average distance between them. In addition, 17.4% of the tracks are mostly tracked, as compared to the next competitor at

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003

JID: YCVIU

ARTICLE IN PRESS S. Huang et al. / Computer Vision and Image Understanding 000 (2016) 1–11

16.0%. Our MHMHT has the best performance on ML about mostly lost targets. We also achieved the second lowest number of ID switches by a large margin. This shows the robustness of the proposed MHMHT over a large variety of videos under different conditions and confirms the power of metric learning and multi-cue fusion. Also note that because MOT is significantly more difficult than the PETS dataset, the strategy of metric learning and multicue fusion becomes more important to the performance. 7. Conclusions In this paper, we focus on the association optimization framework and affinity calculation by using temporal context information for robust and discriminative appearance modeling in multiobject tracking. We formulate multi-objet tracking as a dynamic incremental clustering process, and propose a multi-cue fused hierarchical MHT framework based on reliable generated tracklets, which conducts data association level by level to hierarchically reduce the ambiguities while incorporates more temporal context information. The appearance similarity is defined with the distance of features of tracklets and salient templates of each track hypothesis, which is then fused with dynamic similarity calculated according to the Kalman filter to get association affinity by logistic regression. To enhance the discriminative power of the appearance model, the spatial-temporal relationships of tracklets in sliding temporal window are used as constraints to learn the discriminative appearance metric which measures the similarity between feature vectors and salient templates. Experimental results on 4 challenging datasets compared with 5 state-of-the-art methods demonstrate the effectiveness of the proposed method. References Andriyenko, A., Schindler, K., Roth, S., 2012. Discrete-continuous optimization for multi-target tracking. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, pp. 1926–1933. Bae, S.-H., Yoon, K.-J., 2014. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. CVPR. Bar-Shalom, Y., Daum, F., Huang, J., 2009. The probabilistic data association filter. Control Syst. IEEE 29 (6), 82–100. Bellet, A., Habrard, A., Sebban, M., 2013. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709. Ben Shitrit, H., Berclaz, J., Fleuret, F., Fua, P., 2011. Tracking multiple people under global appearance constraints. In: Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, pp. 137–144. Benfold, B., Reid, I., 2011. Stable multi-target tracking in real-time surveillance video. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pp. 3457–3464. Breitenstein, M.D., Reichlin, F., Leibe, B., Koller-Meier, E., Van Gool, L., 2009. Robust tracking-by-detection using a detector confidence particle filter. In: Computer Vision, 2009 IEEE 12th International Conference on. IEEE, pp. 1515–1522. Brendel, W., Amer, M., Todorovic, S., 2011. Multiobject tracking as maximum weight independent set. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pp. 1273–1280. Cox, I.J., Hingorani, S.L., 1996. An efficient implementation of reid’s multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking. Pattern Anal. Mach. Intell. IEEE Trans. 18 (2), 138–150. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S., 2007. Information-theoretic metric learning. In: Proceedings of the 24th International Conference on Machine learning. ACM, pp. 209–216. Digalakis, V., Rohlicek, J.R., Ostendorf, M., 1993. Ml estimation of a stochastic linear system with the em algorithm and its application to speech recognition. Speech Audio Process. IEEE Trans. 1 (4), 431–442. Huang, C., Wu, B., Nevatia, R., 2008. Robust object tracking by hierarchical association of detection responses. In: Computer Vision–ECCV 2008. Springer, pp. 788–801. Khan, Z., Balch, T., Dellaert, F., 2005. Mcmc-based particle filtering for tracking a variable number of interacting targets. Pattern Anal. Mach. Intell. IEEE Trans. 27 (11), 1805–1819. Kim, C., Li, F., Ciptadi, A., Rehg, J., 2015. Multiple hypothesis tracking revisited. ICCV.

[m5G;August 30, 2016;11:0] 11

Kuo, C.-H., Huang, C., Nevatia, R., 2010. Multi-target tracking by on-line learned discriminative appearance models. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, pp. 685–692. Kuo, C.-H., Nevatia, R., 2011. How does person identity recognition help multi-person tracking? In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pp. 1217–1224. Leal-Taixe, L., Fenzi, M., Kuznetsova, A., Rosenhahn, B., Savarese, S., 2014. Learning an image-based motion context for multiple people tracking. CVPR. Leal-Taixe, L., Milan, A., Reid, I., Roth, S., Schindler, K., 2015. Motchallenge 2015: Towards a benchmark for multi-target tracking. In: arXiv:1504.01942. Li, Y., Huang, C., Nevatia, R., 2009. Learning to associate: Hybridboosted multi-target tracker for crowded scene. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, pp. 2953–2960. McLaughlin, N., Rincon, J.M.D., Miller, P., 2015. Enhancing linear programming with motion modeling for multi-target tracking. WACV. Milan, A., Leal-Taixe, L., Reid, I., Schindler, K., 2015. Joint tracking and segmentation of multiple targets. CVPR. Milan, A., Roth, S., Schindler, K., 2014. Continuous energy minimization for multitarget tracking. IEEE TPAMI 36 (1), 58–72. Milan, A., Schindler, K., Roth, S., 2013. Challenges of ground truth evaluation of multi-target tracking. CVPR Workshop. Miller, M.L., Stone, H.S., Cox, I.J., 1997. Optimizing murty’s ranked assignment method. Aerosp. Electron. Syst. IEEE Trans. 33 (3), 851–862. Oh, S., Russell, S., Sastry, S., 2009. Markov chain monte carlo data association for multi-target tracking. Autom. Control IEEE Trans. 54 (3), 481–497. Pirsiavash, H., Ramanan, D., Fowlkes, C.C., 2011. Globally-optimal greedy algorithms for tracking a variable number of objects. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pp. 1201–1208. Poiesi, F., Mazzon, R., Cavallaro, A., 2013. Multi-target tracking on confidence maps: an application to people tracking. Comput. Vision Image Understand. 117 (10), 1257–1272. Qu, W., Schonfeld, D., Mohamed, M., 2007. Real-time distributed multi-object tracking using multiple interactive trackers and a magnetic-inertia potential model. Multimedia IEEE Trans. 9 (3), 511–519. Reid, D.B., 1979. An algorithm for tracking multiple targets. Autom. Control IEEE Trans. 24 (6), 843–854. Rezatofighi, H., Milan, A., Zhang, Z., Shi, Q., Dick, A., Reid, I., 2015. Joint probabilistic data association revisited. ICCV. Ryoo, M.S., Aggarwal, J.K., 2008. Observe-and-explain: a new approach for multiple hypotheses tracking of humans and objects. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, pp. 1–8. Wang, S., C, F., 2015. Learning optimal parameters for multitarget tracking. BMVC. Xing, J., Ai, H., Lao, S., 2009. Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, pp. 1200–1207. Yang, B., Huang, C., Nevatia, R., 2011. Learning affinities and dependencies for multi-target tracking using a crf model. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pp. 1233–1240. Yang, L., Jin, R., 2006. Distance Metric Learning: A Comprehensive Survey. Michigan State Universiy. 2 Ying, L., Xu, C., Guo, W., 2012. Extended mht algorithm for multiple object tracking. In: Proceedings of the 4th International Conference on Internet Multimedia Computing and Service. ACM, pp. 75–79. Yoon, J., Yang, H., Lim, J., Yoon, K., 2015. Bayesian multiobject tracking using motion context from multiple objects. WACV. Yu, Q., Medioni, G., 2009. Multiple-target tracking by spatiotemporal monte carlo markov chain data association. Pattern Anal. Mach. Intell. IEEE Trans. 31 (12), 2196–2210. Zhang, L., Li, Y., Nevatia, R., 2008. Global data association for multi-object tracking using network flows. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, pp. 1–8. Zhang, T., Ghanem, B., Ahuja, N., 2012. Robust multi-object tracking via cross-domain contextual information for sports video analysis. In: International Conference on Acoustics, Speech and Signal Processing. Zhang, T., Ghanem, B., Liu, S., Ahuja, N., 2013. Robust visual tracking via structured multi-task sparse learning. Int. J. Comput. Vision 101 (2), 367–383. Zhang, T., Ghanem, B., Xu, C., Ahuja, N., 2013. Object tracking by occlusion detection via structured sparse learning. In: CVsports Workshop in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Zhang, T., Jia, C., Xu, C., Ma, Y., Ahuja, N., 2014. Partial occlusion handling for visual tracking via robust part matching. CVPR. Zhang, T., Liu, S., Ahuja, N., Yang, M.-H., Ghanem, B., 2015. Robust visual tracking via consistent low-rank sparse learning. Int. J. Comput. Vision 111 (2), 171–190. Zhang, T., Liu, S., Xu, C., Yan, S., Ghanem, B., Ahuja, N., Yang, M.-H., 2015. Structural sparse tracking. CVPR. Zhang, T., Xu, C., 2014. Cross-domain multi-event tracking via co-pmht. ACM Trans. Multimedia Comput. Commun. Appl. 10 (4), 31:1–31:19.

Please cite this article as: S. Huang et al., Multi-object tracking via discriminative appearance modeling, Computer Vision and Image Understanding (2016), http://dx.doi.org/10.1016/j.cviu.2016.06.003