Multi-marker tracking for large-scale X-ray stereo video data

Multi-marker tracking for large-scale X-ray stereo video data

Signal Processing: Image Communication ( ) – Contents lists available at ScienceDirect Signal Processing: Image Communication journal homepage: ww...

2MB Sizes 0 Downloads 6 Views

Signal Processing: Image Communication (

)



Contents lists available at ScienceDirect

Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image

Multi-marker tracking for large-scale X-ray stereo video data Xiaoyan Jiang a, *, Marcel Simon b , Yuan Yang c , Joachim Denzler b a b c

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China Computer Vision Group, Friedrich Schiller University of Jena, Jena, Germany School of Instrument Science and Engineering, Southeast University, Nanjing, China

a r t i c l e Keywords: Multi-object tracking Data association Directed acyclic graph X-ray Big data

i n f o

a b s t r a c t Analyzing large amounts of video data is one of the key challenges in the trend towards big data. In the field of medical research, for example, to analyze infected cardiac movements, stereo X-ray sequences of beating animal hearts implanted with radiopaque markers are recorded. As manual annotation of exact marker positions in largescale recordings is time-consuming and infeasible, research on automatic tracking of multiple markers is a crucial task. We propose an efficient two-stage graph-based data association approach to tackle this problem. Difficulties of the sequences like 2D occlusions, low contrast, inhomogeneous movement, and inaccurate detections, are considered in the framework. Reconstructed 3D observations are modeled and connected using a weighted directed acyclic graph to obtain tracklets with high confidence via shortest path extraction. Afterwards, tracklets are linked into longer tracks by a tracklet graph in a similar manner while global features are adopted. The approach is validated on eight X-ray cardiac datasets of beating sheep hearts with various diseases. Outperforming standard tracking approaches, e.g. particle filter, the experimental results show a high accuracy comparable to human experts and efficiency in the meantime. The proposed approach is generic and can be directly applied to other video data as well. © 2017 Elsevier B.V. All rights reserved.

1. Introduction In recent years, the trend towards big data also reached the area of medical research. One crucial task in this field is the analysis of recorded large-scale video data using object tracking techniques. For example, medical researchers are interested in the effect of certain diseases on the movement of the heart. For this purpose, cardiac sequences are recorded by a biplanar X-ray acquisition system where implanted radiopaque markers in the heart reveal the movement during the cardiac cycles and relevant muscle activities. The accurate tracking of multiple markers is an important step towards understanding and treating heart-related diseases since it is capable of tracking distinct cardiac structures [1,2]. Fig. 1 shows an example of the system and sample images. However, tracking of the radiopaque markers is challenging in complex scenes due to the following difficulties: Occlusion Densely distributed objects (the markers) have interobject occlusions and object-obstacle occlusions. As a result, appearance features of the objects are not temporally consistent, which is even worse if objects are occluded for a long period of time as it also commonly occurs in our application. This is a typical problem if only one single camera is used [3]. In our work, we hence use a biplanar camera setup

and find matching markers in the two views by reconstructing the 3D positions. Confusion The second most difficult case for visual tracking is ‘‘confusion’’ as pointed out by [4]. It happens when targets have indistinguishable appearance or similar-looking regions in the background are close to the object of interest. For instance, markers moving through anatomical structures. In this case, decision ambiguity may lead to mistakes. Others, e.g. bad observation viewpoints, small resolution images [5], background clutter, varying number of objects and inhomogeneous movement also confuse the tracking system. Due to the improvement of detection algorithms [6,7] both in terms of accuracy and computational feasibility, tracking-by-detection (TbD) becomes a popular framework [8]. By continuously applying detectors to single frames, objects can be initialized automatically by the discrete detections. However, the output of detectors is sparse and unreliable and accuracy is far away from perfect since there are a lot of false positive and false negative detections (missing detections). Recursively searching for all possible associations of the detections is computational expensive and in many cases NP-hard. Therefore, efficient data association is crucial for the detection-based multi-object tracking and the goal is to robustly

* Corresponding author to: Xingzhen Lou, Room: 908, Longteng Road Nr. 333, Songjiang District, Shanghai 201620, China.

E-mail addresses: [email protected] (X. Jiang), [email protected] (M. Simon), [email protected] (Y. Yang), [email protected] (J. Denzler). http://dx.doi.org/10.1016/j.image.2017.06.009 Received 6 January 2017; Received in revised form 8 June 2017; Accepted 23 June 2017 Available online xxxx 0923-5965/© 2017 Elsevier B.V. All rights reserved.

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.

X. Jiang et al.

Signal Processing: Image Communication (

(a) Acquisition system.

)



(b) Example images.

Fig. 1. (a) Biplanar high-speed X-ray acquisition system (Neurostar® , Siemens AG), (b) X-ray images of left and right camera views showing a sheep heart with implanted markers.

link detections into tracks. According to literature, a multi-object TbD system generally consists of three units [3,9]: object detection, data association, and state estimation. In this paper, we focus on data association. We present a two-stage graph-based data association approach to analyze the stereoscopic X-ray recordings. The method follows a TbD paradigm using 3D detections reconstructed from 2D detections as input and 3D object trajectories as output. The problem of finding meaningful linking among 3D detections is expressed by a probabilistic formulation, which is mapped to the shortest path searching algorithm in two consecutive graphs. In details, first, confident tracklets (object trajectory fragments) are extracted from a 3D detection based graph using spatial and temporal restrictions. Second, complete trajectories are then generated by linking tracklets modeled by a tracklet graph in a similar way using features extracted from tracklets. The framework assembles 3D detections to form probabilistically reasonable 3D trajectories in complex scenes. Note that the proposed framework is independent from specific scenarios and can be directly applied to any kind of video data. The paper is based on our previous work presented in [10]. Our contributions are an extension of this work with respect to the following aspects: (1) a theoretical modeling of two-stage graph-based data association by probabilistic formulation; This formulation optimizes multiple object trajectories using 3D detections of the full sequence data which is robust to occlusions; (2) a thorough discussion of feature selections. Local and global features are adopted in two sequential graphs, which allows to adapt to different levels of semantic information in different stages. Associations are made with higher confidence as the weight function represents the distance between nodes more reasonably; (3) a comprehensive analysis of the approach by testing on various videos and comparison with state-of-the-art trackers. Additionally, the approach is easy to implement and has a much lower computational time compared to other approaches. In contrast to previous work, we also considered the key characteristics of big data, i.e. high volume, high velocity, and high variety [11] by keeping speed and a generic framework in mind.

However, the computational complexity grows exponentially with the number of targets. In contrast, Hungarian algorithm [15] can be adopted to find the best local assignment with a run time cubical in the number of targets [3]. [8] used Hungarian algorithm in the hierarchical data association approach. Afterwards, [3] presented a greedy approach which is assumed to be sufficient in practice [16] compared with Hungarian algorithm to match detections with particle filtering in pairwise frames. In the prediction stage, a constant velocity motion model is used to propagate the particles and each detection-tracker pair has a gating function to decide the association score. Additionally, the observation in a new time step is mainly updated by the associated detection, a boosted target-specific classifier. In [17], bipartite graphs were employed for tracking locally. Multi-object tracking is formulated as finding strong local minima of a continuous energy function presented in [18], which extends the Kalman filter [19]. Multiple judgments based on likelihood functions were embedded into the weight functions. Their approach outperforms the integer linear programming-based tracker presented in [20]. Compared with these local maximum or minimum searching scheme, recent approaches based on global optimization aim to model the state probability and investigate the best posterior probability over the entire sequence. This avoids locally optimal solutions and prevents association of non-stationary false positive detections. Sample approaches are network flow-based schemes [20–22] and graph-based approaches [9, 23,24]. These approaches model the multi-object tracking issue as a combinatorial optimization problem which can be solved in polynomial time [21,22,25]. Graph-based approaches model the observations with nodes in a graph, where costs are assigned to edges to denote supporting scores for associations between different nodes. Solutions for the multiobject tracking problem are then obtained by searching paths with minimum or maximum cost. The tracking performance of graph-based algorithms depends on the objective function, and more specifically on the assigned costs to the edges. Due to noisy detections, direct linking among detections often fails in challenging situations, such as occlusions. Sometimes multiple objects even merge and split during this occlusion, which causes great difficulties in maintaining the correct identities of the objects. This problem also occurs when objects are crossing each other. Since global optimization using detections over the whole sequence is time-consuming and challenging in deciding the suitable searching depth, splitting the complete object trajectories into meaningful fragments depending on the data itself and then linking fragments is more effective. By this way, connections with highly confident scores are formed in earlier stage. Depending the connected short fragments, longer tracks can be further linked. Consequently, a lot of research separated the tracking process into several stages [8,9,26,27] which aimed at producing more reliable tracklets (short track segments) in the first step. Afterwards in a second step, short tracklets are linked further into longer and reliable tracks. The Markov Chain Monte Carlo data association in [28] is an example for this. [29] proposed a three-layered data association scheme to achieve multi-ball tracking in monocular sequences. Ball candidates are first filtered by a sliding window fitted with a dynamic model at

2. Literature review Numerous works have presented solutions to tackle the problem of data association [12–14]. Classic data association approaches include Multiple Hypotheses Tracking (MHT) [12] and Joint Probability Data Association (JPDA) [13]. Possible origins of target measurements in MHT are accounted by a set of data association hypotheses. For every measurement at a certain time step, the probability for three possible cases are calculated. The cases are (i) the measurement belongs to a previous track, (ii) it represents a new target, (iii) it is a false detection. After several time steps, as many measurements as time steps are obtained and the probabilities of joint hypotheses are computed recursively. Due to complexity, the analysis is limited to only a few such steps [3]. In comparison with the calculation of posterior probabilities for single measurements, JPDA considers independence between objects and computes joint conditional probabilities for data association. 2

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.

X. Jiang et al.

Signal Processing: Image Communication (

)



3. The two-stage graph-based approach In this section, we present the details of the proposed framework with the assumption that we already have access to calibration data of the multi-camera system. The two cameras provide synchronized frame captures, for which 2D detections are obtained by applying object detectors to each frame. Resulting detections are used as input to the tracking approach. The goal of data association for multi-object tracking then is to acquire optimal trajectories from the set of detections. 3.1. The main idea As illustrated in Fig. 2, our proposed framework for multi-object tracking contains 3 modules: hypothesis generation, tracklet extraction, and tracklet linking. To overcome object occlusion and interactions, we work directly with 3D hypotheses reconstructed from 2D detections in each time step from every view using calibration data. In the first stage, we extract short and reliable trajectories (tracklets) from 3D data over the entire sequence. These tracklets are generated under restricted constraints and are assumed to be correct once formed. In the second stage, tracklets are linked into long tracks, where object trajectories are refined by taking difficult cases into account, e.g. occlusions. Both stages are modeled by individual graphs using local and global features in the weight functions, respectively.

Fig. 2. The three modules of our tracking framework illustrated in 2D plane: hypothesis generation, tracklet extraction, and tracklet linking. Column ordered black dots in the top right image represent 3D hypotheses along time from the hypothesis generation module. Next, data association is handled in two steps: tracklet extraction and tracklet linking. Hypotheses are restrictively connected into short track segments, namely tracklets as shown in the middle right figure. Afterwards, tracklets are linked to form complete tracks in the second stage, which is displayed in the bottom right image.

3.2. Problem formulation Suppose we have 𝐶 cameras, and each camera records a video with 𝑇 ∈ Z+ frames. An object detection in 2D image space is formulated as 𝐨𝑐𝑖 = {𝒑𝑖 , 𝒂𝑖 , 𝒔𝑖 , 𝑡𝑖 } representing the position 𝒑𝑖 = (𝑥, 𝑦), appearance 𝒂𝑖 , size 𝒔𝑖 = (𝒔𝑥 , 𝒔𝑦 ), and time index 𝑡𝑖 ≤ 𝑇 of the 𝑖th detection in camera 𝑐 ∈ {1, … , 𝐶}. Given the set of 2D detections of all cameras of the whole sequence, we obtain 3D detections in each frame using triangulation as in [17] constrained by epipolar geometry. In the data association step, a set of object trajectories for the entire sequence 1∶𝑇 is searched by linking elements in the huge 3D detection space 1∶𝑇 . In other words, 1∶𝑇 is a sequential combination of 3D detections. One step further and referring to [21], data association of multiobject tracking using multi-camera systems can be defined as a MAP problem maximizing the posteriori probability of 1∶𝑇 given the 3D hypotheses:

the candidate level. The selected candidates are used to construct a weighted directed graph and a shortest path algorithm is afterwards applied to extract a single trajectory at the tracklet level. Finally, multiple trajectories are generated by considering all possible tracklets at the path level. [26] utilized particle filters for local tracklet extraction and generated tracklets globally within a sliding window afterwards. However, most literature dealing with implanted radiopaque marker tracking for beating hearts dates back to the 1990s. [30] attempted first to track and digitize multiple radiopaque markers by image enhancement and segmentation techniques, such as contrast stretching, spatial band-pass and matched filtering, adaptive binary thresholding, etc. While many previous works deal with heart wall motion tracking [31,32], they often suffer from several disadvantages. For instance, a lot of manual interaction is required while still only sparse marker configurations can be processed, which is time-consuming and tedious as well. In addition, model fitting would fail for abnormal movements of the heart. Furthermore, markers were estimated only in 2D image space [33] which restricts the degree of freedom for the movements and interpretability of the obtained results. We follow the tracking-by-detection framework to obtain marker 3D trajectories. Unlike the work of [21], we employ the shortest path algorithm instead of finding min-cost flows in the constructed graph. Moreover, they use an explicit occlusion model to add occluded object hypothesis to solve occlusions, while we solve it by searching for shortest paths in a two-stage graph formulation. In [26], local tracklets are constructed by particle filters and global tracklets are formed within sliding windows considering occlusions. The linking between tracklets is solved by Hungarian algorithm, which is pair-wise optimal. In contrast, we link tracklets by searching for shortest paths in a tracklet graph, which gains a global solution. The paper is structured as follows: Section 3 presents the two-stage graph-based multi-object tracking approach in detail including discussions regarding feature usage. Experimental results on X-ray recordings including quantitative and qualitative evaluation will be presented in Section 4.3. Section 5 summarizes the work and discusses possible future plans.

∗ 1∶𝑇 = arg max 𝑃 (1∶𝑇 |1∶𝑇 ),

(1)

1∶𝑇 ⊆ 1∶𝑇 ,

(2)

1∶𝑇

∗ where 1∶𝑇 is the optimal set of object trajectories for the entire sequence. The search space, i.e. the 3D hypotheses 1∶𝑇 , is a subspace of 1∶𝑇 since 3D detections belonging to identical objects are integrated into single hypotheses as stated in [27]. ∗ is a subset of time assorted elements From Eq. (1) we can see that 1∶𝑇 in 1∶𝑇 . Calculating the probability for every possible combination of the elements is impractical because of the high computational complexity. Hence, we present the following global optimization strategy for finding the associations which result in the highest posteriori probability over the entire sequence. Let 𝛶 denote the set of tracklets which is searched in the space of 1∶𝑇 and 𝜏𝑘 ∈ 𝛶 is a tracklet. Similarly, 𝜉𝑢 ∈ 1∶𝑇 is defined as a complete track for one single target and composed of a set of tracklets. Referring to [26], finding the optimal trajectories for multiple targets can be written as a global optimization problem: ∗ 1∶𝑇 = arg max 𝑃 (1∶𝑇 |𝛶 ) ⋅ 𝑃 (𝛶 |1∶𝑇 ) 1∶𝑇 ,𝛶

= arg max 1∶𝑇 ,𝛶

∏ 𝑢

𝑃 (𝜉𝑢 |𝛶 )



𝑃 (𝜏𝑘 |1∶𝑇 ).

(3) (4)

𝑘

3

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.

X. Jiang et al.

Signal Processing: Image Communication (

(

𝛶

∗ 1∶𝑇 = arg max 1∶𝑇

𝑘



( ) 𝑃 𝜉𝑢 |𝛶 ∗ ,

(5)

= arg min (− log 𝑃𝑒={𝑣𝑠𝑜𝑢𝑟𝑐𝑒 ,𝐯} ) 𝜏𝑘

(6)



)

𝛥𝑡 (− log 𝑃𝑖,𝑗 ) + (− log 𝑃𝑒={𝐯,𝑣𝑠𝑖𝑛𝑘 } ) 𝑖,𝑗 ( ) ∑ = arg min 𝑤en + 𝑤𝑖,𝑗 + 𝑤ex .

𝑢

+

which combines finding an optimal set of tracklets 𝛶 ∗ and optimal combinations of resulting tracklets. Eq. (4) is correct under the assumption that each track and tracklet is independent, respectively. The two parts in the equation Eq. (4) can be optimized independently. In Eq. (4), the assumption of independence between individual tracklets and tracks is used, which can be motivated by independently moving objects. Although, the assumption of independence between tracklets is a limitation, we will see it is necessary for solving the tracking task. Second, we assume that tracklets are not overlapping spatially: 𝜏𝑘 ∩ 𝜏𝑙 = ∅, ∀𝑘∕ =𝑙, ∀𝜏𝑘 , 𝜏𝑙 ∈ 𝛶 .



∈ . As displayed in the landmark graph, framewise missing ( ) detections are handled by creating additional edges 𝑒𝑡,𝑡+𝛥𝑡 = 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 𝑖,𝑗 𝑗 across several time steps, which allows skipping certain frames without appropriate detections. We denote 𝛥𝑡 as the allowed number of jumps which should be lower than a upper threshold 𝜑. Features used in the weight function will be discussed in Section 3.5. With this in mind, we are going to transform the MAP problem of extracting the optimal set of tracklets 𝛶 ∗ in Eq. (5) to the shortest paths problem conducted on landmark  in the following section. Since the computational time of finding the optimal data association for pairwise frames by Hungarian algorithm [15] is (𝑛4 ) or (𝑛3 ) with 𝑛 representing the number of detections, finding a globally optimal tracklet set over the entire sequence is very expensive and impractical [26], especially when considering the current big data trend. Hence, we approximate the objective by a greedy solution by searching the optimal fragments for individual tracklets sequentially. Referring to [21] and using negative logarithm transformation, the optimal solution 𝜏𝑘∗ for one single tracklet is realized as follows: ( ) 𝜏𝑘∗ = arg min − log 𝑃 (𝜏𝑘 |1∶𝑇 ) 𝜏𝑘 (

Fig. 3. An exemplary landmark graph  = (, , 𝑤) consists of 3 time steps. Vertices represent 3D hypotheses with corresponding time steps. Edges between two vertices connect succeeding frames (solid lines) and prevent connection within the same time. Furthermore, additional edges enable the link of two nodes with large temporal differences (dotted lines). The edges starting from the source node and ends at the sink node (dashed lines) allow the initiation and termination of tracklets at any time, respectively.

It can be approximated by the following two sequential steps: ∏ ( ) 𝛶 ∗ = arg max 𝑃 𝜏𝑘 |1∶𝑇 ,

𝑣𝑡𝑖 , 𝑣𝑡+1 𝑗

)

)

𝜏𝑘

(8)

𝑖,𝑗

Consequently, the weights are defined as: 𝛥𝑡 𝑤𝑖,𝑗 = − log 𝑃𝑖,𝑗 ,

(9)

𝑤en = − log 𝑃𝑒=(𝑣𝑠𝑜𝑢𝑟𝑐𝑒 ,𝐯) ,

(10)

𝑤ex = − log 𝑃𝑒=(𝐯,𝑣𝑠𝑖𝑛𝑘 ) ,

(11)

where 𝑤𝑖,𝑗 is the transition weight, 𝑤en is the entrance (departing) weight and 𝑤ex is the exit (terminating) weight for a path. From the above equations, we can see that weights relate to probabilities of the pair of nodes to be linked. The lower the weights are the more possible the two nodes can be connected in a path. As long as we set the weights appropriately to represent objects tracks, the shortest paths are capable of approaching the optimal solution of the tracklet set. We employ Dijkstra’s shortest path algorithm [35] to find the path  = (𝑣sink , 𝐯𝑡,0 , … , 𝐯𝑡,1 , 𝑣source ) with minimum weight, where 𝑡,0 and 𝑡,1 are time indexes of the vertices. Afterwards, additional shortest paths are found iteratively by applying Dijkstra’s algorithm iteratively. Due to the non-overlap constraint between tracklets of Eq. (7), we set the weights of edges that connect vertices in an already found path to infinity in order to avoid duplicated connections in multiple paths. Thus, single nodes can only belong to one shortest path at most and are impossible to belong to different shortest paths. Afterwards, tracklets are obtained by traversing each shortest path  from 𝑣sink to 𝑣source with an ending frame index and a starting frame index. For the convenience for the rest of this work, we denote 𝜏𝑘 = {𝜁𝑘 , 𝑡𝜏𝑘 ,0 , 𝑡𝜏𝑘 ,1 } ∈ 𝛶 as a tracklet starting at frame 𝑡𝜏𝑘 ,0 and ending at frame 𝑡𝜏𝑘 ,1 with a set of connected 3D hypotheses e.g. 3D locations 𝜁𝑘 = (𝑷 𝑡𝜏𝑘 ,0 , … , 𝑷 𝑡𝜏𝑘 ,1 ) in between.

(7)

Especially in 3D space, this is true as objects are impossible to occupy identical space in the 3D world. These reasonable constraints enable the generation of tracklets in a more restrictive manner. As a result, the tracklets have high confidence to be parts of a true target trajectory once extracted. From Eqs. (5) and (6), we can see that data association can be separated into two sequential MAP problems, namely tracklet extraction and tracklet linking as shown in Fig. 2. For the following sections, we will show that the two stages are equivalent to searching the paths with minimum weights in two individual graphs, respectively. Briefly, tracklets are extracted from a landmark graph without knowing the number of objects a priori [23], assuming entrance and exit regions [34], or any learning of optimal features or parameters in advance [34]. In the second stage, tracklets are linked in a tracklet graph to form complete tracks by incorporating global features into the weight function. 3.3. Tracklet extraction Given 3D detection hypotheses and the corresponding 2D image features, we construct a so-called landmark graph to extract tracklets in the first stage. We define a DAG as  = (, , 𝑤) and model its elements based on 1∶𝑇 . A vertex 𝑣𝑡𝑖 ∈  represents one 3D hypothesis 𝒉𝑡𝑖 = {𝑷𝑖𝑡 , 𝑨𝑡𝑖 , 𝑺𝑖𝑡 } ∈ 𝑡 where 𝑷𝑖𝑡 = (𝑋, 𝑌 , 𝑍) is the reconstructed 3D position, 𝑨𝑡𝑖 and 𝑺𝑖𝑡 are appearance features and sizes of the corresponding object in 2D image space from all visible views, respectively. In addition, we introduce a virtual source vertex 𝑣𝑠𝑜𝑢𝑟𝑐𝑒 along with a virtual sink vertex 𝑣𝑠𝑖𝑛𝑘 to the graph, with each being connected to all other nodes. All paths through the graph should initialize from and terminate at these two nodes. The obtained graph topology is shown in Fig. 3. Nodes of neighboring frames 𝑡, 𝑡 + 1 are connected by directed edges 𝑒𝑖,𝑗 =

3.4. Tracklet linking So far, we obtain tracklets by globally connecting 3D hypotheses with minimum weight. The next step is to fuse individual tracklets and link these fragments into longer tracks, i.e. complete object trajectories over the entire sequence. For this purpose, we formulate another directed acyclic graph, namely tracklet graph ′ = ( ′ ,  ′ , 𝑤′ ). Before describing the construction of the second-stage graph, we first illustrate the underlying theoretical idea. For the same reason as 4

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.

X. Jiang et al.

Signal Processing: Image Communication (

)



Fig. 4. An exemplary tracklet graph ′ = ( ′ ,  ′ , 𝑤′ ) topology contains vertices representing tracklets and edges between pairwise vertices with temporal consistent order painted in green dash lines. Additional edges (blue dotted lines) begin at the source node and end to the sink node allow any tracklet to initialize and terminate a track, respectively.

Section 3.3, we approximate Eq. (6) with finding the optimal association for individual tracks sequentially. One single optimal linking of tracklets can be formulated as: ( ) 𝜉𝑢∗ = arg min − log 𝑃 (𝜉𝑢 |𝛶 ) 𝜉𝑢 ( = arg min (− log 𝑃𝑒={𝑣′source ,𝐯′ } ) 𝜉𝑢

+



)

ensure tracking precision and set 𝜌 > 1 and 𝛼 > 1. In the following, we will describe the distances in detail. The spatial distance reveals pure geometry information in 3D space, e.g. Euclidean distance between the two represented 3D points: ( ) 𝑑spat 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 = ‖𝑷𝑖𝑡 − 𝑷𝑗𝑡+𝛥𝑡 ‖2 . (14) 𝑗 The temporal distance, i.e. the frame difference of the two represented vertices, is indicated as ( ) 𝑑temp 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 = 𝛥𝑡. (15) 𝑗

(12)

(− log 𝑃𝑘,𝑙 ) + (− log 𝑃𝑒={𝐯′ ,𝑣′ } ) sink 𝑘,𝑙 ( ) ∑ = arg min 𝑤′en + 𝑤′𝑘,𝑙 + 𝑤′ex . 𝜉𝑢

In addition, since each 3D hypothesis is reconstructed from a pair of 2D detections, the appearance features calculated in 2D image space also give hints of the possible connections and assist the tracking system to make more properly. Therefore, we use affinity ( decisions ) distance 𝑑Affi 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 to explore how likely the connected vertices 𝑗 are identical objects seen on 2D images, which is composed by local features, such as appearance similarity and object size changes. We average the likelihoods of the objects computed on local features only in the views in which they are visible instead of all the cameras in the system. According to Eq. (9), affinity distance can be formulated as follows ( ) 𝑑Affi 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 = 𝑑Appe + 𝑑Size (16) 𝑗 ( ) ( ) = − log(𝑃𝒂 ) + − log(𝑃𝒔 ) (17)

𝑘,𝑙

As we can see from Eq. (12), obtaining tracks as optimal combinations of tracklets is equivalent to searching of paths with minimum weight in the tracklet graph. In this proposed new graph, each vertex 𝑣′𝑘 ∈  ′ represents a tracklet 𝜏𝑘 . All possible pairs of vertices are connected by ( ) directed edges 𝑒′𝑘,𝑙 = 𝑣′𝑘 , 𝑣′𝑙 ∈  ′ . We again define a virtual source ′ vertex 𝑣𝑠𝑜𝑢𝑟𝑐𝑒 and a virtual sink vertex 𝑣′𝑠𝑜𝑢𝑟𝑐𝑒 , which serve as starting and stopping node for all paths in ′ , similar to those in Section 3.3. This results in a graph structure as shown in Fig. 4. Afterwards, we iteratively employ the Dijkstra’s shortest path algorithm [35] on ′ . Similar to before, we invalidate the connections of already linked vertices in the next iterations. Consequently, final trajectories can be obtained by traversing the found shortest paths. In case of missing segments in a track, the trajectory is linearly interpolated using adjacent tracklets.

where 𝑃𝒂 and 𝑃𝒔 are averaging appearance and size similarity probabilities in all pairs of views, as formula ( ) ∑ 𝑃𝒂 = 𝑠𝑖𝑚(𝒂𝑖 , 𝒂𝑗 ) ∕𝐶 ′ , (18)

3.5. Feature selection A crucial part of the presented approach is the selection of useful features for calculating the weights in  and ′ . In the following, local features for the generation of tracklets and global features extracted from tracklets for tracklet linking will be discussed. Instead of calculating weights from probabilities [8,21], we rewrite the weights in Eqs. (8) and (12) in terms of the landmark and tracklet graph by distances computed from local and global features, respectively. Local Features In , a finite weight is assigned to the edge 𝑒𝑡,𝑡+𝛥𝑡 = 𝑖,𝑗 (𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 ), 𝛥𝑡 ∈ [1, 𝜑] as follows: 𝑗 𝑡 𝑡+𝛥𝑡 𝑤𝛥𝑡 𝑖,𝑗 = 𝑤(𝑣𝑖 , 𝑣𝑗 ( ) ) ( ) = 𝛼 ⋅ 𝑑spat 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 ⋅ 𝑑temp 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 𝑗 𝑗 ( ) ( ) ⋅ 𝑑Affi 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 ⋅ 𝑑Epi 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 , 𝑗 𝑗

( 𝑃𝒔 =

𝑐′

∑ min(𝒔𝑖 , 𝒔𝑗 ) 𝑐′

max(𝒔𝑖 , 𝒔𝑗 )

) ∕𝐶 ′ ,

(19)

where 𝑐 ′ ∈ {1, … , 𝐶 ′ } are the views where both object 𝑖 and 𝑗 are visible. The appearance feature 𝒂 is a RGB histogram of the object. We employ Bhattacharyya distance 𝑠𝑖𝑚 (⋅, ⋅) with similarity values in [0, 1] [36] to evaluate the appearance similarity in that camera 𝑐 ′ . The more similar of the objects are, the higher is score. In Eq. (19), the functions min (⋅, ⋅) and max (⋅, ⋅) computes the minimum and maximum sizes between the two. Each size is the product of the width and height from the corresponding 2D detection. Moreover, calibration error is considered by including the distance to corresponding epipolar lines ( ) (20) 𝑑Epi 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 = 𝑒𝑑𝑖𝑙 ⋅ 𝑒𝑑𝑖𝑟 ⋅ 𝑒𝑑𝑗𝑙 ⋅ 𝑒𝑑𝑗𝑟 . 𝑗

(13)

which is proportional to the product of spatial distance 𝑑spat (⋅), temporal distance 𝑑temp (⋅), affinity distance 𝑑Affi (⋅) and epipolar distance 𝑑Epi (⋅) of the two vertices. Note that using the weighted sum of different distances in the equation is feasible in the experiments. Eq. (13) only holds when the spatial distance( of the two ) considered vertices is lower than a certain threshold 𝑑spat 𝑣𝑡𝑖 , 𝑣𝑡+𝛥𝑡 ∈ [0, 𝑇 ℎspat ]. Otherwise the weight 𝑗

This distance represents how precisely the reconstruction from 2D detections is possible by multiplying the distances of the detections of the corresponding left and right epipolar lines. Since we intend to extract tracklets with high confidence, the paths are terminated when the objects are not detected for a long time in order to prevent wrong identity linking. The departing and ending edge

𝛥𝑡 represents the penalty cost that is set to infinity, 𝑤𝛥𝑡 𝑖,𝑗 = ∞. 𝛼 = 𝜌 gradually increases the more frames are skipped. We set this parameter to encourage the possible connection of nodes with less frame gaps to

5

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.

X. Jiang et al.

Signal Processing: Image Communication (

weights should be able to encourage high confident connections, which are defined as follows: 𝑤source (𝑣𝑡𝑖 ) = 𝑡 ⋅ 𝑑penalty , 𝑤sink (𝑣𝑡𝑖 )

= (𝑇 − 𝑡) ⋅ 𝑑penalty .

(21) (22)

4.1. Datasets The eight datasets listed in Table 1 were acquired in two years on different dates using distinct operated sheep to have an objective medical analysis.1 The videos were recorded by a biplanar X-ray system, where two X-ray cameras observe the complete heart to make sure that all the markers can be captured in both views. The angles between the two cameras varied in the different datasets. To reveal the importance of medical analysis in different parts of the sheep, different numbers of radiopaque markers were surgically implanted into relevant anatomical parts, including muscular tissue of the right and left ventricle, the tricuspid and mitral annulus and leaflets of a sheep heart. The markers have different locomotion in terms of speed and direction. After a heart-related disease attack, their acceleration and movement becomes unpredictable. The markers have a diameter of 2 mm and a mixed linear and spherical or cylindrical shape. The spatial resolution of the markers in the recorded images is in the order of a few pixels. Note that the appearances alone cannot provide sufficient hints to distinguish different markers in image space. Calibration To acquire 3D hypotheses, the biplanar X-ray system needs to be calibrated. For the calibration of the camera setup, a custombuilt 140 mm × 60 mm × 0.5 mm radio-opaque steel plate containing 18 circular holes of diameter 5 mm as shown in Fig. 5 was used. The holes in this calibration pattern are circles in X-ray images and can automatically be detected by blob detection. The arrangement of different numbers of holes along different lines on the calibration pattern, i.e. 5, 7, 6, allows for a unique and fully automated identification of these points across both camera views. Given the corresponding points in the world coordinate system, we performed camera calibration based on [38]. In the following, we take the most difficult dataset si from Table 1 as an example to present detailed experimental results. Regarding si, the angle between the two cameras is 115◦ , while the distance between each camera and the heart was about 825 mm. We yielded an average back projection error of 1.3 pixels with the computed calibration data. Detection and Reconstruction Concerning detection of markers, due to the high resolution of the image and small size of the marker, sliding window based detectors are computationally too expensive. Furthermore, description of the markers could improve the performance. However, the appearance and structure of the markers in X-ray videos are too simple to obtain useful information. Except traditional methods, convolutional neural network based detectors fail due to the fact that markers have very small sizes. Hence, we extract detections in a simple and efficient way. Original images were first averaged to filter the noise and enhance the regions of real markers. Afterwards, 2D detections were extracted by difference of box filter (DOB) [39] and blob detection. 3D marker detection hypotheses are obtained by triangulating every detection pair of the two camera views constrained by epipolar geometry. We keep a low computational cost while the detection performance is sufficient for the application, which is crucial for big data analysis. Ground Truth Acquisition To assess the performance of the tracker, a comparison of the results with ground-truth data annotated by human medical experts is performed. Evaluation of ID performance, that is MOTA, is done for all our datasets. Most datasets contain ∼3000 frames, ∼40 markers in each view. Assuming one person can label one marker in

(23) (24)

This finite weight is only assigned under the condition that the two represented vertices satisfy a temporal consistency constraint 𝑡𝜏𝑙,0 −𝑡𝜏𝑘,1 > 0. Other edges without temporal consistency are assigned infinite weights. Eq. (24) shows that finite weights are proportional to the spatial distance of the two represented tracklets e.g. the Euclidean distance between their ( ′ ′) 𝑡𝜏 ,0 𝑡𝜏 ,1 ′ head and tail points 𝑑spat 𝑣𝑘 , 𝑣𝑙 = ‖𝜁𝑙 𝑙 −𝜁𝑘 𝑘 ‖2 , the temporal distance ( ′ ′) ′ 𝑑temp 𝑣𝑘 , 𝑣𝑙 = 𝑡𝜏𝑙,0 − 𝑡𝜏𝑘,1 between the two represented tracklets and the ( ) ( ) ( ′ ′) ′ motion distance extracted from them 𝑑Motion 𝑣𝑘 , 𝑣𝑙 = 𝑚 𝜏𝑘 − 𝑚 𝜏𝑙 in which 𝑚 (⋅) is the motion function conducting on a tracklet. We calculate the motion of a tracklet 𝑘 and 𝑙 by averaging the movements of the last 𝑁 time steps or the first 𝑁 time steps, respectively. Infinite weight is also assigned to the edges whose temporal differences is above a maximum threshold 𝜃, since the connection of tracklets which are temporally too far away from each other is meaningless in practice. Additionally, the edge whose spatial distance of the two connected vertices exceeds a maximal threshold 𝛽 is assigned infinite weight as well. Similarly, the entrance weight and exit weight for 𝑣′𝑘 are: ′ 𝑤′en = 𝑡𝜏𝑘 ,0 ⋅ 𝑑penalty ,

(25)

𝑤′ex

(26)

= (𝑇

′ − 𝑡𝜏𝑘 ,1 ) ⋅ 𝑑penalty ,



tracking evaluation. There are two important metrics included in CLEAR MOT, namely multiple object tracking precision (MOTP) and multiple object tracking accuracy (MOTA). MOTP measures the precision of the object locations independent of correct identity matches. In contrast, MOTA evaluates object misses, identity mismatches, and false positive tracks.

From Eqs. (21) and (22), we see that the weights connecting 𝑣source and 𝑣sink are proportional to the number of skipped frames. The common penalty 𝑑penalty should be surpass the largest weight to suppress connections of frames with larger time gaps. To sum up, tracklet generation as the first stage of data association aims at extracting track fragments with high confidence, for which local features are used as constraint. Once the detections are connected as tracklets, they are assumed to be correct. For the next section, global features extracted from tracklets are employed in the weight function in the tracklet graph. Global Features The weights assigned to edges in the tracklet graph ′ indicate scores of how well the two corresponding tracklets fit together. Obviously, not all the pairs of vertices in the graph can be connected by edges with finite weights. We assume that tracklets of identical objects have no intersections both temporally and spatially in 3D space, thus, only temporally consistent and spatially close vertices may have directed edges. ( ) The finite weight 𝑤′𝑘,𝑙 of a possibly associated pair of vertices 𝑣′𝑘 , 𝑣′𝑙 is defined as: 𝑤′𝑘,𝑙 = 𝑤′ (𝑣′𝑘 , 𝑣′𝑙 ) ( ′ ′) ′ ( ) ′ ( ′ ′) ′ = 𝑑spat 𝑣𝑘 , 𝑣𝑙 ⋅ 𝑑temp 𝑣′𝑘 , 𝑣′𝑙 ⋅ 𝑑Motion 𝑣𝑘 , 𝑣𝑙 .

)

′ where 𝑑penalty

is a manually set penalty per frame to encourage linking of confident tracklets, hence it should be larger than the maximal weight in ′ . This means, that the tracklets departing from the first frame or terminating at the last frame of the sequence have a lower entrance/exit weight, and thus are more likely to belong to a shortest path in the graph. 4. Experiments In order to evaluate the practicability and performance of the presented approach, we perform experiments on real-world large-scale X-ray videos of beating sheep hearts. For details and the acquisition of the data, we would like to refer to the previous paper [10] and the medical analysis paper [1]. The task is to track multiple markers in the synchronized videos of two cameras and obtain 3D trajectories as result. Evaluation Metric We employ the widely used CLEAR MOT [37], which has become the de facto standard in the field of multi-object

1 We want to thank Dr. Wolfgang Bothe (Department of Cardiothoracic Surgery, University Hospital Jena) for the cardiac surgery and Rommy Petersohn (Institute of Systematic Zoology and Evolutionary Biology with Phyletic Museum, Friedrich-Schiller University of Jena) for recording the dataset.

6

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.

X. Jiang et al.

Signal Processing: Image Communication (

)



Table 1 Sheep datasets: X-ray recordings of identical sheep were recorded on different dates. The number of markers stayed in the heart is equal or smaller than the number of markers originally implanted. For the convenience of the rest of the section, we name the dataset on 2012–12–12* as si.

Heart disease style Surgery Date X-ray recording 1 X-ray recording 2 X-ray recording 3 Implanted marker Reserved marker Calibrated Frames per camera Spatial resolution Temporal resolution

Sheep#1

Sheep#2

Sheep#3

Sheep#4

Acute 12/12/2012 12/12/2012* – – 58 42 Yes 3001 1024 × 768 500 Hz

Chronic 10/07/2013 10/10/2013 10/21/2013 01/27/2014 27 25 Yes 3960/1726 1024 × 768 500 Hz

Chronic 11/07/2013 11/11/2013 12/03/2013 01/21/2014 35 31 Yes ∼1751 1024 × 768 500 Hz

Acute 02/19/2014 02/19/2014 – – 27 27 Yes 2676 1024 × 768 500 Hz

(a) Custom-built calibration pattern.

(b) Hole detection.

Fig. 5. Picture (a) shows the custom-built calibration pattern that lies under the Biplanar X-ray Acquisition system. Picture (b) presents the detected and numbered circles in an X-ray image of the calibration pattern.

one frame within one second, the annotation of a complete video needs ∼70 h. Since manually locating all markers in each frame of every video is extremely time-consuming, we did not label all the markers for all the videos listed in Table 1. We believe that MOTP evaluation on si provides a fair and meaningful precision measurement of the approach. This is because precision assessment of tracking-by-detection methods is more related to detection precision. Moreover, si has the highest number of densely distributed markers compared to all the other datasets and has the most varying marker movement types. Hence, the precision evaluation is only done in si. Throughout the experiments presented in the following, we will show that it is also not necessary to label all these datasets for our application. The total length of the single sequence either from the left camera or from the right camera of si is 3001 frames and covers ∼7 times complete cardiac cycles. After labeling 2D positions of each marker in both views, correspondence of markers in two views were recognized by human medical experts. Consequently, each marker has 3D ground truth positions reconstructed from 2D labeled positions with unique identities, which is available for every 10th frame. The 3D positions instead of 2D locations of the markers are used for evaluation, which helps to have a more intuitive precision error in real-world units. Although there are errors in calibration, which also appear in the tracking process, we can at least provide the most reliable ground-truth possible for the evaluation.

features and achieves state-of-the-art performance. We use the source code and default values provided at http://rehg.org/mht/, which is available for tracking objects in 2D images and ground plane. The goal is to analyze the following questions: (i) Is the proposed approach capable of dealing with the difficulties in the large-scale X-ray videos (e.g. 2D marker occlusions, nondistinguishable marker appearance, and inhomogeneous marker movement)? (ii) How is the performance of the individual stages? How large is the improvement due to the second stage? (iii) Are the trajectories obtained by our approach valuable or even sufficient for medical analysis? If yes, how is the performance compared to human experts? To ensure a fair comparison, we used identical 3D detections as input to most of the methods to be compared. Due to the medical application at hand, the property of marker identity consistency is to be favored over the precise tracking locations. As a result, we only keep final tracks whose lengths are longer than 50% of the total sequence length since short tracks or id switched ones are practically meaningless even if they would improve evaluation scores. For particle filter, we used 100 particles for each initial detection. The update of the motion of the particles is done using the associated 3D detections. We use 2D detections as input for the MHT-DAM tracker and evaluate the MOTA in 2D space. MOTP is not evaluated for MHT-DAM because 2D precision is without meaning in the medical application at hand. For the two-stage graph-based tracking, we extracted 1000 tracklets in the first stage. In the second stage, 100 tracks were computed. In order to access the quality of the approach in each step, evaluation of the first step, i.e. shortest path algorithm, will be also presented. In all experiments, the maximal allowed number of jumps in the first stage was set to 𝜑 = 3 and the 2D spatial threshold was 𝑇 ℎspat = 15 pixels.

4.2. Qualitative & quantitative analysis In this section, we present a performance comparison between the proposed approach and three other methods, namely 3D particle filtering, shortest path, and 2D multiple hypothesis tracking revisited (MHT-DAM) [40] applied to one camera view. MHT-DAM revisits the classic MHT method by incorporating contemporary appearance 7

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.

X. Jiang et al.

Signal Processing: Image Communication (

)



Table 2 Multiple object tracking accuracy (MOTA) for the sheep heart dataset si of different methods: MHT-DAM [40], 3D extension of the particle filter [41], 3D shortest path algorithm, and the two-stage graph-based 3D tracking approach. Method

Properties

MOTA

Miss rate

False positive rate

Mismatch rate

Run time (min)

MHT-DAM [40] Particle filter Shortest path Two-stage graph

2D, online 3D, online 3D, offline 3D, offline

58.5% 50.88% 77.29% 82.08%

5.32% 32.35% 13.92% 13.08%

34.89% 16.22% 7.14% 4.76%

0.53% 0.54% 1.64% 0.08%

632 150 3 4

Table 3 MOTA for all the datasets by our two-stage graph-based 3D tracking approach. Date

MOTA

Miss rate

False positive rate

Mismatch rate

12/12/2012 10/10/2013 10/21/2013 01/27/2014 11/11/2013 12/03/2013 01/21/2014 02/19/2014

82.08% 92% 60% 68% 43.38% 61.29% 87.09% 99.63%

13.08% 0% 12% 16% 25.81% 12.9% 9.68% 0.37%

4.76% 8% 28% 16% 25.81% 25.81% 3.23% 0%

0.08% 0% 0% 0% 0% 0% 0% 0%

Furthermore, we compare the tracking precision of different methods. The proposed approach had a MOTP of 91%. Fig. 7 displays the median Euclidean error for each tracked marker of 3D particle filter based tracking and the two-stage graph-based 3D tracking approach. From the figure, we can state that the average precision error of our approach is approximately 0.2 mm, with only minor differences between different markers. Compared to the actual marker size, this precision error satisfies the application since locating identical markers by different people is noisy as well as discussed in the next part. A notable exception is marker 16, which is located at the end of a cardiac valve and suffers from the very abrupt movements. The particle filter based 3D tracking approach gives comparable results on many markers. However, in six cases the two-stage graph-based approach is able to reliably track markers which are lost using particle filter. Comparison to Human Experts Since individual persons have an inevitable annotation error, ground-truth data for four representative markers was annotated by more than one person to observe the precision of the tracker. Fig. 8 presents the results. In three out of four cases, our approach is competitive with the precision of human experts. Only for marker 32, the tracking positions are less precise since it is partly occluded by the anatomical structure surrounding in one view. All in all, the results are very promising and indeed comparable to human experts. Qualitative Evaluation The 3D surface of the heart reconstructed for one cardiac cycle based on tracked positions is shown in Fig. 9. Anatomical locations of the tracked markers are identified in distinct cardiac videos. The tracked trajectories were used to analyze heart movements from a medical point of view in [1].

Quantitative Evaluation Table 2 lists the quantitative results of MOTA using different methods: MHT-DAM [40], extended particle filtering in 3D (3D state vectors and updates instead of the 2D updates of [41]), shortest path algorithm, and the presented two-stage graphbased approach. It is shown in the table that MHT-DAM is used as a baseline, where 2D detections are used as input. Thus, MOTA is evaluated using 2D ground truth. From the table, we can see that the missing rate of online particle filter tracking is 32.35%, which influences the MOTA the most. This is partially caused by the rejection of useless short tracks to some extent, since particle filter loses targets easily when tracking in-homogeneously moving markers. Due to continuous false positive detections in anatomical structures in both views, the false positive rate of particle filter based tracking amounts to 16.22%. The shortest path algorithm obtained a moderate performance among the three. The proposed two-stage graph-based approach achieves the highest MOTA with a miss rate comparable to the shortest path algorithm. In addition, it reaches the lowest false positive and mismatch rate. This behavior can be explained by the fact that the tracklet association step of our approach favors long and reliable tracks at the expense of a relatively high miss rate. Short and unreliable tracks are discarded resulting in full misses, i.e. markers not tracked throughout the entire sequence. In addition, since the two-stage graph-based approach is a global optimization method using the complete data of a sequence, nontemporally consistent false positive detections are rejected automatically. The low mismatch rate benefits from this property as well. As expected, the mismatch rate is low for all the three methods which is important for a sufficient quality for medical analysis. Although only ground-truth data for si was available, we can provide MOTA for all the datasets in Table 1 by manually checking marker correspondences. Table 3 presents a comparison of MOTA for all the datasets. From the table, we can find a common trait that mismatch rates are almost zero for all the datasets. This is due to the same reason as before that it is preferable to have no tracks instead of id switched tracks for the markers, for which the miss rate is a sacrifice. Another obvious phenomenon shown in the table is that false positive rates are relatively high in several datasets, for example 11/11/2013 and 12/03/2013. This is due to the unrelated moving parts of a pace maker implanted by the doctors, as shown in Fig. 6. Once these moving points are recognized as markers, they are tracked over the whole sequence, which definitely increases the false positive rates. However, these false positive tracks can be recognized and removed by the doctors easily. Importantly, the results for the dataset from 02/19/2014 are nearly perfect, which is mainly due to the high quality of the images resulting in better 2D detections. Hence the MOTA evaluation of these datasets proves the high quality of our method regarding identity consistency.

4.3. Runtime performance The runtime performance is shown in Table 2. The entire system was implemented in C++. Our method has a complexity of (𝑘 ⋅ (𝑛 log 𝑛 + 𝑚)), where 𝑛 is the number of nodes, 𝑚 is the number of edges, and 𝑘 the number paths to be found in the respective graph [22]. For the extraction of 1000 tracklets from a 1.16 × 106 node 3D detection graph, our algorithm needs approximately 3 min, while the extraction of 100 final tracks from the tracklet graph takes about 31 s. In contrast, particle filter in 3D needs around 150 min for the whole process using the same 3D hypotheses as input. MHT-DAM used 2D detections as input and it took 632 min to run in MATLAB for one sequence of 3001 frames. All measurements were conducted on a standard desktop computer with an Intel® Core™ i5-4590 CPU (3.3 GHz). So far, we can state that the proposed two-stage graph-based approach fulfills the practical application where densely distributed objects have inhomogeneous movement and indistinguishable appearance. The fact that it can be used fully automatic supports above argumentation as well. The comparison to the state-of-the-art 2D or ground-plane tracker MHT-DAM shows that our approach is orders of magnitude faster and more accurate. Furthermore, the proposed approach gives results comparable to human experts and outperforms the traditional tracking method i.e. particle filter based 3D tracking in terms of both precision and accuracy. 5. Conclusion In this paper, we present a two-stage graph-based data association approach for the task of tracking implanted markers in biplanar X-ray 8

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.

X. Jiang et al.

Signal Processing: Image Communication (

)



Fig. 6. X-ray images of pace maker in the heart.

Fig. 7. Precision Comparison of median Euclidean error in 3D for each tracked marker.

Fig. 8. Precision comparison to human experts.

Fig. 9. 3D surface reconstruction of the beating sheep heart based on our tracked results for one cardiac cycle.

9

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.

X. Jiang et al.

Signal Processing: Image Communication (

videos. First, 3D reconstructed detections are used to build a detection graph. The weights between pairs of nodes indicate the similarity of 3D detections. We iteratively apply the shortest path algorithm to the detection graph to obtain an approximation of the globally optimal solution for the set of tracklets. The obtained tracklets are supposed to be short object trajectories with high confidence. Afterwards, these short fragments are linked into longer trajectories by a tracklet graph. In each stage of the graph, different features are efficiently used. We focus on efficiency and effectiveness of the approach in order to address demands of big data analysis. To achieve this goal, we assume independence of tracklets and tracks as well as that objects are occluded only for a short period of time. While limiting from a theoretical point of view, the experimental results show that the approach solved the multi-marker tracking problem with high accuracy. In scenarios where long-term occlusions of objects occur often, our tracker might lose the objects since no re-identification algorithm is employed. However, in the medical application considered in this paper, our approach has no such problem and achieves human-level accuracy.

)



[14] Y. Xiang, A. Alahi, S. Savarese, Learning to track: Online multi-object tracking by decision making, in: IEEE International Conference on Computer Vision, ICCV, 2015, pp. 4705–4713. [15] H.W. Kuhn, The hungarian method for the assignment problem, Naval Res. Logist. Q. 52 (1) (1955) 83–97. [16] B. Wu, R. Nevatia, Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors, Int. J. Comput. Vis. 75 (2) (2007) 247–266. [17] M. Bredereck, X. Jiang, M. Körner, J. Denzler, Data association for multi-object tracking-by-detection in multi-camera networks, in: ACM/IEEE International Conference on Distributed Smart Cameras, ICDSC, 2012, pp. 1–6. [18] A. Andriyenko, K. Schindler, Multi-target tracking by continuous energy minimization, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, vol. 42, 2011, pp. 1265–1272. [19] J. Black, T. Ellis, P. Rosin, Multi-view image surveillance and tracking, in: Workshop on Motion and Video Computing, 2003, pp. 169–174. [20] H. Jiang, S. Fels, J.J. Little, A linear programming approach for multiple object tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2007, pp. 1–8. [21] L. Zhang, Y. Li, R. Nevatia, Global data association for multi-object tracking using network flows, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2008, pp. 1–8. [22] J. Berclaz, F. Fleuret, E. Türetken, P. Fua, Multiple object tracking using k-shortest paths optimization, IEEE Trans. Pattern Anal. Mach. Intell. 33 (9) (2011) 1806–1819. [23] L. Leal-Taixè, G. Pons-Moll, B. Rosenhahn, Branch-and-price global optimization for multi-view multi-target tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, vol. 157, 2012, pp. 1987–1994. [24] J.F. Henriques, R. Caseiro, J. Batista, Globally optimal solution to multi-object tracking with merged measurements, in: IEEE International Conference on Computer Vision, ICCV, 2011. [25] R.T. Collins, Multitarget data association with higher-order motion models, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, vol. 157, 2012, pp. 1744–1751. [26] J. Xing, H. Ai, S. Lao, Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2009, pp. 1200–1207. [27] X. Jiang, M. Körner, D. Haase, J. Denzler, A graph-based map solution for multiperson tracking using multi-camera systems, in: Proceedings of the International Conference on Computer Vision Theory and Applications, VISAPP, 2014, pp. 3: 343–350. [28] W. Ge, R.T. Collins, Multi-target data association by tracklets with unsupervised parameter estimation, in: British Machine Vision Conference, BMVC, 2008. [29] F. Yan, W. Christmas, J. Kittler, Layered data association using graph-theoretic formulation with application to tennis ball tracking in monocular sequences, IEEE Trans. Pattern Anal. Mach. Intell. 30 (10) (2008) 1814–1830. [30] M.A. Niczyporuk, D.C. Miller, Automatic tracking and digitization of multiple radiopaque myocardial markers, Comput. Biomed. Res. 24 (1991) 129–142. [31] A. Muijtjens, J. Roos, T. Arts, A. Hasman, R. Reneman, Tracking markers with missing data by lower rank approximation, J. Biomech. 30 (1997) 95–98. [32] S. Malassiotis, M.G. Strintzis, Tracking the left ventricle in echocardiographic images by learning heart dynamics, IEEE Trans. Med. Imaging 18 (1999) 282–290. [33] P.M.F. Nielsen, I.J.L. Grice, B.H. Smaill, P.J. Hunter, Mathematical model of geometry and fibrous structure of the heart, Am. J. Physiol. Heart Circ. Physiol. 260 (1991) 1365–1378. [34] M. Hofmann, D. Wolf, G. Rigoll, Hypergraphs for joint multi-view reconstruction and multi-object tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2013. [35] E.W. Dijkstra, A note on two problems in connexion with graphs, Numer. Math. 1 (1959) 269–271. [36] F. Bajramovic, C. Grässl, J. Denzler, Efficient combination of histograms for real-time tracking using mean-shift and trust-region optimization, in: German Conference on Pattern Recognition, GCPR, 2005. [37] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance: the clear mot metrics, EURASIP J. Image Video Process. (2008) 1–10. [38] Z. Zhang, A flexible new technique for camera calibration, IEEE Trans. Pattern Anal. Mach. Intell. 22 (11) (2000) 1330–1334. [39] E. Rodner, H. Süsse, W. Ortmann, J. Denzler, Difference of boxes filters revisited: Shadow suppression and efficient character segmentation, in: IAPR Workshop on Document Analysis Systems, 2008, pp. 263–269. [40] C. Kim, F. Li, A. Ciptadi, J.M. Rehg, Multiple hypothesis tracking revisited, in: International Conference on Computer Vision, ICCV, 2015. [41] X. Jiang, E. Rodner, J. Denzler, Multi-person tracking-by-detection based on calibrated multi-camera systems, in: International Conference on Computer Vision and Graphics, ICCVG, 2012, pp. 7594:743–751.

Acknowledgments The work is supported by the following projects: National Natural Science Foundation of China (NSFC), Nr.: 61461021; Young Teacher Development Program of Shanghai, Nr.: ZZGCD16008; Local Colleges Faculty Construction of Shanghai, Nr.: 15590501300. References [1] W. Bothe, H. Schubert, M. Diab, G. Faerber, C. Bettag, X. Jiang, J. Denzler, T. Doenst, Fully automated tracking of cardiac structures using radiopaque markers and high-frequency videofluoroscopy in an in vivo ovine model from three-dimensional marker coordinates to quantitative analyses, Interactive Cardiovasc. Thorac. Surg. 5 (1) (2016) 1–10. [2] G.J. 3rd, G. KB, S. JT, G. RC, J. BM, R. MB, B. DK, E.L. Jr., Dynamic threedimensional imaging of the mitral valve and left ventricle by rapid sonomicrometry array localization, J. Thorac. Cardiovasc. Surg. [3] M.D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, L.V. Gool, Online multiperson tracking-by-detection from a single, uncalibrated camera, IEEE Trans. Pattern Anal. Mach. Intell. 33 (9) (2011) 1820–1833. [4] D.M. Chu, A.W. Smeulders, Thirteen hard cases in visual tracking, in: IEEE International Conference on Advanced Video and Signal Based Surveillance, 2010, pp. 103–110. [5] H. Aghajan, A. Cavallaro, Multi-Camera Networks: Principles and Applications, Academic Press, 2009. [6] P. Felzenszwalb, R. Girshick, D. McAllester, Cascade object detection with deformable part models, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2010, pp. 2241–2248. [7] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, vol. 1, 2005, pp. 886–893. [8] C. Huang, B. Wu, R. Nevatia, Robust object tracking by hierarchical association of detection responses, in: European Conference on Computer Vision, ECCV, 2008, pp. 788–801. [9] Z. Wu, T.H. Kunz, M. Betke, Efficient track linking methods for track graphs using network-flow and set-cover techniques, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2011, pp. 1185–1192. [10] X. Jiang, D. Haase, M. Körner, W. Bothe, J. Denzler, Accurate 3d multi-marker tracking in X-ray cardiac sequences using a two-stage graph modeling approach, in: Conference on Computer Analysis of Images and Patterns, CAIP, in: Lecture Notes in Computer Science, vol. 8048, Springer, Berlin, Heidelberg, 2013, pp. 117–125. [11] D. Laney, 3d data management: controlling data volume, velocity and variety, META Group Research Note. [12] D.B. Reid, An algorithm for tracking multiple targets, IEEE Trans. Automat. Control 24 (6) (1979) 843–854. [13] T.E. Fortmann, Y. Bar-Shalom, M. Scheffe, Sonar tracking of multiple targets using joint probabilistic data association, IEEE J. Ocean. Eng. 8 (3) (1983) 173–184.

10

Please cite this article in press as: X. Jiang, et al., Multi-marker tracking for large-scale X-ray stereo video data, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.06.009.