Journal Pre-proofs Multiple object tracking in soccer videos using topographic surface analysis Wonjun Kim PII: DOI: Reference:
S1047-3203(19)30304-9 https://doi.org/10.1016/j.jvcir.2019.102683 YJVCI 102683
To appear in:
J. Vis. Commun. Image R.
Received Date: Revised Date: Accepted Date:
4 February 2019 18 July 2019 12 October 2019
Please cite this article as: W. Kim, Multiple object tracking in soccer videos using topographic surface analysis, J. Vis. Commun. Image R. (2019), doi: https://doi.org/10.1016/j.jvcir.2019.102683
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier Inc.
Journal of Visual Communication and Image Representation (2019)
Contents lists available at ScienceDirect
Journal of Visual Communication and Image Representation journal homepage: www.elsevier.com/locate/jvci
Multiple object tracking in soccer videos using topographic surface analysis Wonjun Kima,∗ a Department
of Electrical and Electronics Engineering, Konkuk University, Seoul 05029, Republic of Korea
ARTICLE INFO
ABSTRACT
Article history: Received 1 May 2013 Received in final form 10 May 2013 Accepted 13 May 2013 Available online 15 May 2013
Multiple object tracking is still a challenging problem in computer vision even though there have been several attempts lately to resolve the tracking problem in the framework of deep neural networks. In this paper, a novel method for multiple object tracking in soccer videos, which often contain complicated interactions between players with severe occlusions, is introduced. To do this, we propose to interpret the extracted foreground regions in a given frame as the topographic surface. This gives a great help to reliably chase target players by accurately providing the boundary lines of each object even with occlusions. Color similarity and spatial proximity are subsequently employed to refine the estimated position of target players for continuous tracking over whole video sequences. Experimental results on various soccer videos, which are taken of the actual games with the wide-angle camera, demonstrate that the proposed method is effective for tracking multiple players in the dynamic scene of the soccer video. c 2019 Elsevier B. V. All rights reserved.
Communicated by S. Sarkar
Keywords: Multiple object tracking, Topographic surface, Color similarity, Spatial proximity
1. Introduction Visual object tracking has been constantly studied due to its wide range of applications, e.g., human-robot interfaces, intelligent surveillance, action recognition, etc [1, 2]. In particular, with the recent surge in demand for sports video analysis, multiple object tracking has become a key prerequisite for realizing advanced operations. For example, the players’ movement and their positions in the soccer game, which are automatically collected by the tracking system, give a good objective criteria to the team manager for developing a new plan to improve the team power as well as evaluating each player accurately. This tracking system can also improve the viewing experience by providing the additional information, e.g., visualization of players’ activity and the corresponding statistical information, to help viewers understand the game they are watching. Even though the field of multiple object tracking has been increasingly attracting attention of the computer vision community, most methods introduced in literature are not directly ∗ Corresponding
author. e-mail:
[email protected]. (Wonjun Kim)
applicable to the sports video analysis due to the following reasons: first of all, motion patterns of players are significantly different from those of natural videos containing walking people on the street, cars moving on the road, etc. The dynamics of the ball in the soccer video yields nonlinear motion patterns of players, that is, directions and velocities of foreground objects are unpredictable due to the position of the ball irregularly changed even in consecutive frames. Secondly, severe occlusions frequently occur between players since they struggle each other to take the ball in the whole region of the ground. Finally, similar shapes and colors between players make the tracking problem more intractable. In particular, the identity of players belonging to the same team is highly likely to be switched when they separate each other from the merging status due to the similar appearance. In this paper, a novel method for multiple object tracking in soccer videos is proposed. The key idea of the proposed method is to formulate extracted foreground regions (i.e., players) as the topographic surface. Since the boundaries of players are effectively detected by exploiting segmentation of the topographic surface, the identity of each object can be successfully maintained even with severe occlusions. Note that fore-
Given-name Surname et al. / Journal of Visual Communication and Image Representation (2019)
2
ground regions are extracted by adaptively selecting the result from background subtraction and edge detection. This strategy is very helpful to discriminate the temporary static player from background, which still gives a great difficulty to traditional approaches for moving object detection. Instead of using the rectangular window for defining the tracking model (e.g., center position, velocity, and color feature), which is widely employed in previous approaches, segmentation-based tracking model of the proposed method has a good ability to accurately deal with characteristics of target objects under dynamic scenes of the soccer game by excluding irrelevant background components from the tracking model. This is also useful to correctly update the color properties of target objects, which is continuously conducted through whole video sequences. Experimental results demonstrate that the proposed method is capable of reliably tracking multiple objects in various situations of the soccer video. Main contributions of this paper can be summarized as follows : • Foreground regions (i.e., players) are accurately extracted in a given frame by adaptively using the result of background subtraction and edge detection. This strategy is very helpful to discriminate the temporary static player from background, which still gives a great difficulty to traditional approaches for moving object detection. • A novel scheme for multiple object tracking especially in soccer videos is proposed based on the analysis of the topographic surface. By applying the topographic surfacebased segmentation technique to extracted foreground regions, boundaries of players can be clearly detected, which give a great help to successfully update the tracking model of each object even under severe occlusions. Furthermore, this segmentation-based strategy successfully prevents the tracking model from being contaminated by background, which efficiently improves the tracking performance. The rest of this paper is organized as follows. We briefly review the related work in Section 2. The main idea and the detailed procedure of the proposed method are explained in Section 3. Experimental results on various soccer videos taken of the actual match are demonstrated in Section 4, and the conclusion follows in Section 5. 2. Related work The field of multiple object tracking is rapidly evolving due to its plentiful possibilities. In this Section, we briefly review relevant studies for the soccer video analysis. First of all, early methods focused on discriminatively representing each object from background and other tracked ones. Liu et al. [3] utilized the visual codebook generated by using the technique of the dominant color learning to represent different teams in the soccer video as two types of tracking models. Xing et al. [4] first conducted the progressive observation modeling based on results of playfield segmentation, player detection, and team classification. By exploiting the dual-mode Bayesian inferencing scheme (i.e., combination of
forward filtering and backward smoothing), they constructed a unified tracking framework to efficiently resolve the problem driven by a single-isolated object as well as multiple occlusions. Zhang et al. [5] proposed a novel particle filter that combines both the appearance information and the cross-domain contextual information, i.e., trajectories estimated from the homography transform, for alleviating the effect of fast camera motions. Even though these methods provide quite a reliable tracking result, they often fail to grasp trajectories correctly when multiple target objects are occluded each other. More recently, several attempts to involve interactions between players into the tracking model have been studied. In [6], the contextconditioned motion models are introduced to implicitly incorporates complicated inter-object correlations via the hierarchical data association. Lu et al. [7] developed a new tracking system, which possesses the ability to detect, identify, and track multiple players by exploiting both temporal and mutual exclusion constraints with the conditional random field (CRF). Baysal and Duygulu [8] proposed to share particles densely sampled at fixed positions on the model field instead of assigning particles on target objects. This strategy allows the tracking algorithm to embed interactions of players into the statespace model, and makes tracking possible through occlusions. In addition, several methods have concentrated on formulating associations between objects to clearly represent the dynamic relationship under complicated situations. Bewley et al. [9] simply attempted to combine the high-performed detector with rudimentary tracking schemes, e.g., Kalman filter and Hungarian algorithm. They further extended their work to the framework of data association metric with the discriminative appearance model, which efficiently improves the tracking performance [10]. Lee et al. [11] utilized a partial least square (PLS) method for learning the appearance model for target objects and constructing associations between them even under contaminations by occlusions and shape similarities. In [12], authors proposed a historical appearance matching scheme, which enables to efficiently maintain associations of targets even with temporal errors while preventing tracking failures driven by occlusions. Such approaches have proven effective to handle multiple objects under dynamic environments, however, the amount of computation increases rapidly when there are a large number of objects to track like soccer games. Most notably, the correlation filter (CF) and its variants [13, 14, 15], which have been most widely employed for a single object tracking in recent days, start to be applied to multiple object tracking. Their strategy is simple and effective, i.e., an independent correlation filter is separately applied to each object while successfully keeping the advantage of the processing speed. For example, the kernelized correlation filter based approach [14] works very fast and also yields the competitive result compared with representative methods specifically designed for multiple object tracking [16]. Nevertheless, most approaches still suffer from the ambiguity raised by severe occlusions between players and complicated motion patterns occurring in the soccer game. On the other hand, inspired by the great success of deep learning techniques especially using the convolution neural network (CNN) for many applications in computer vision, CNN-
Given-name Surname et al. / Journal of Visual Communication and Image Representation (2019)
3
Fig. 1. An example of the state model for topographic surface-based multiple object tracking.
Fig. 3. (a) Extracted foreground region (two players are overlapped). (b) Result of distance transform. (c) Topographic surface.
is firstly defined as follows: Ski = {pki , vki , cki },
Fig. 2. (a) Player windows (ROIs). (b) Results of background subtraction [21]. (c) Results of Canny edge detector [22] with morphological filtering. (d) Pink : intersection between (b) and (c) / Cyan : union of (b) and (c). (d) Foreground regions FGki .
based methods are beginning to be studied for object tracking. Specifically, those algorithms formulate the problem of object tracking as the problem of discriminative object detection [17, 18]. That is, once observations (e.g., rectangular windows in a given frame) are given, CNN-based trackers attempt to find the most similar patch with the target object. Even though such CNN-based trackers demonstrate quite reliable tracking results for a single target object, they are weak to the interference by similar neighbors since CNNs are generally designed for the classification problem without considering the history of motion patterns. This architecture can be straightforwardly extended to multiple object tracking, however, additional networks for each object need to be trained independently, which requires the vast amount of parameters as well as the extensive learning time. Moreover, CNN-based trackers are not able to run real-time without GPUs and thus hard to be deployed into embedded systems. In contrast to that, the proposed algorithm performs real-time with a single CPU and provides the reliable trajectories even with the ambiguity from occlusions by players of the similar appearance. In the following Section, the proposed method will be explained in detail. 3. Proposed method The motivation of exploiting the topographic surface stems from the fact that it has shown a good ability to predict the boundary of overlapped objects in applications of image segmentation [19, 20]. This is fairly desirable to accurately maintain the tracking model without the interference by the irrelevant background, and thus leads to the reliable tracking for multiple objects. In order to handle each object independently during tracking, the state model of each player at the kth frame
(1)
where i denotes the index of extracted foreground regions (i.e., players). pki = (xc , yc ), vki = (d x , dy ), and cki = (µR , µG , µB ) are the center position, the velocity in both horizontal and vertical directions, and the average color (e.g., the average value of RGB channels) of pixels belonging to the ith foreground region, which are located in the neighbor area (e.g., 5 × 5 pixels) centered at (xc , yc ), respectively. A simple example of the state model is shown in Fig. 1. By continuously updating this state model through the whole video sequence, the topographic surface-based tracking algorithm successfully traces players under various conditions of the soccer game, which will be explained in the following subsections. 3.1. Foreground extraction First of all, foreground regions need to be clearly extracted from background to define the state model for each player as shown in Fig. 1. In order to accurately extract foreground regions without loss of details, we propose an adaptive scheme that selects the result from background subtraction and edge detection. Although diverse methods for background subtraction have successfully extracted moving objects, they are still vulnerable to uniformly highlight the whole region of the temporary static object, which often occurs in the soccer game (see Fig. 2). To complement this limitation, static edges are employed with simple morphological filters (e.g., opening and closing). Specifically, the player window (ROI), e.g., 80 × 80 pixels, is firstly defined using the center position of each player estimated in the previous frame. According to the overlapped ratio F between results by background subtraction and edge detection in the player window, the foreground region for each player is adaptively defined as follows : k BSi , if F > τ, k FGi = (2) SEki , otherwise, F=
BSki
SEki , S BSki SEki T
(3)
where BSki and SEki denote the result of background subtraction and edge detection in the player window defined for the
4
Given-name Surname et al. / Journal of Visual Communication and Image Representation (2019)
Fig. 4. (a) Topographic surface (distance map). (b) Changes of the watersheds according to different water level . Note that the water level increases from left to right.
Fig. 6. Some examples of the topographic surface-based tracking. Note that overlapped players are successfully separated and it thus leads to the robust tracking. Overlapped players (i.e., watersheds) are represented as different colors.
Fig. 5. (a) Topographic surfaces of two watersheds (objects), e.g., Rinit (B1 ) and Rinit (B2 ). (b)(c)(d) Intermediate results of watershed segmentation with dilation operators. (e) Result of dam point construction. Note that dam points are represented as red-colored lines and two objects are also differently colored.
ith player, respectively. The larger the F value, the more clear the movement of the corresponding object. τ is a pre-defined threshold, which is set to 0.3 in this work. For background subtraction, an improved Gaussian mixture model is employed [21] while the Canny detector [22] is applied to compute the edge magnitude in the proposed method. The result of foreground extraction is shown in Fig. 2. In the first row, the F value is larger than τ and thus the foreground region is defined via the result by background subtraction whereas the second row takes the edge magnitude map as foreground according to the overlapped ratio F with uniformly highlighting the temporary static player. Therefore, it is thought that our selection scheme for foreground extraction is effective to extract the whole region of players under various environments of the soccer game, which is helpful for conducting the topographic surface analysis.
Fig. 7. Effectiveness of the proposed watershed method in occlusion cases. (a) Tracking results by KCF [14]. (b) Tracking results by the proposed watershed method.
is defined as follows: S [] = {(x, y)|T (x, y) < },
where S [] is a set of pixels whose values are smaller than in the distance image T . Based on this, a submerged region whose shape can be changed by the water level is subsequently defined as follows: R (Bi ) = {(x, y)|(x, y) ∈ R(Bi ) and (x, y) ∈ S []},
3.2. Topographic surface-based object tracking Based on extracted foreground regions in the previous subsection, the topographic surface for each player’s window is firstly defined by adopting the distance transform [23] and each topographic surface can be regarded as a watershed as shown in Fig. 3. This interpretation allows us to know that constructing a dam when two nearby watersheds are merged due to the flood is equivalent to finding the boundary between players overlapped in the soccer game, which plays a key role for multiple object tracking. Motivated by this observation, we propose to apply a marker-controlled watershed segmentation technique [24], which has been popularly employed for medical images [20] as well as textural images [25], to the problem of multiple object tracking. More specifically, let B1 , B2 , · · · , BN denote the position of the river bed (i.e., regional minima) in each watershed where N is the total number of players. The center position of each object in the distance image T is used for the position of the river bed (i.e., marker) as shown in Fig. 3(b). In a geometrical view, the status of each watershed according to the water level
(4)
(5)
where R(Bi ) denotes a set of pixels belonging to the ith watershed. For setting the initial catchment basins, the water level is set to init = 1.3 × T (p, q) where (p, q) is the position of the marker. Now, let’s assume that we gradually fill catchment basins with water by adjusting the water level as shown in Fig. 4. After several repetitions, watersheds located nearby each other become merged at a specific level. To prevent such overflows (i.e., players overlapped), a dam, which can be represented as the boundary between different watersheds, is required to be constructed, and morphological filters have been employed for this task [26, 27, 28]. Specifically, the dilating operation is iteratively applied to the initial status of the ith watershed, i.e., Rinit (Bi ), until the water overflows, that is, expanded regions meet the boundary of the current watershed itself or the boundary between different watersheds. This can be efficiently formulated as follows: T m 1, if (x, y) ∈ {Rm Rinit (B j )} or init (Bi ) itermax max −1 Di (x, y) = (6) (x, y) ∈ {Rinit (Bi ) − Riter (Bi )}, init 0, otherwise,
Given-name Surname et al. / Journal of Visual Communication and Image Representation (2019)
where m indicates the number of iterations for dilation filtering and thus Rm init (Bi ) is the m times dilated watershed associated with the ith marker. itermax denotes the maximum iteration, which guides the expanding region to the boundary of the extracted foreground region. It is noteworthy that the difference between this maximally dilated watershed and the watershed generated at the previous step (i.e, m = itermax − 1) yields the boundary pixels of the corresponding foreground region. Di (x, y) is the dam point map for the ith player window as shown in Fig. 5. In this example, i and j of (6) are set to 1 and 2 for better understanding, respectively. Figure 6 shows some tracking results for overlapped players by the proposed method. As can be seen, overlapped players are successfully separated based on the proposed topographic surface analysis and it thus makes the tracking algorithm robust and reliable for the soccer video. Since moving objects are successfully segmented from background based on the proposed foreground extraction, which is introduced in the previous subsection, the binarized map for players is clearly generated. Most players of the binarized map can be regarded as the elliptical shape (see Fig. 6) and distinctive concave points are generated when they are overlapped. The watershed technique is well known to segment such convex-like objects even with occlusions, and thus it is thought that the proposed watershed method has a good ability to accurately assign the label to the area for each player. Furthermore, the proposed watershed method efficiently prevents the tracking model from being contaminated by background, which improves the tracking performance compared to approaches based on the rectangle-shaped tracking model. The effectiveness of the proposed watershed method under occlusion cases is also shown in Fig. 7. Based on segmented results computed at the kth frame, the center position of each object is newly updated with its velocity for the next frame as follows: X 1 Bi (k + 1) = p + λvi (k), (7) l itermax (Bi (k))| iter |(R init
pl ∈Rinitmax (Bi (k))
th max where |(Riter init (Bi (k))| denotes the size of the i segmented watershed. pl indicates the pixel position belonging to the ith segmented watershed. vi (k) denotes the velocity of the ith object, which is generally computed as vi (k) = Bi (k) − Bi (k − 1). λ is the weight of the velocity where this value is set to 1.0 in the proposed method. It should be emphasized that such velocity is helpful to precisely locate the marker position by considering the amount of movements computed from consecutive frames. By using the updated center (i.e., marker) position, each object is continuously and successfully tracked with the topographic surface-based segmentation scheme through the whole video sequence. To further improve the performance of multiple object tracking, the refinement for estimated positions is subsequently conducted, which will be explained in the following subsection.
3.3. Refinement of the river-bed position In order to precisely allow for nonlinear motion patterns of players as well, the estimated center position of each player by
5
Table 1. Parameter setting for multiple player tracking
Procedure Foreground extraction Topographic surface Refinement
Parameters τ = 0.3 init = 1.3 × T (p, q) λ = 1.0 α = 0.3 ξ = 1.3 × T (p, q)
Description Eq. (2) Eq. (6) Eq. (7) Eq. (8) Eq. (8)
Table 2. Performance variation according to changes of parameters
Descriptions λ, α fixed τ, α fixed τ, λ fixed
Parameters τ = 0.1 τ = 0.3 τ = 0.5 λ = 0.5 λ = 0.7 λ = 1.0 α = 0.3 α = 0.5 α = 0.7
MOTP (↑) 0.967 0.996 0.979 1 1 0.996 0.996 0.929 0.957
MOTA (↑) 0.915 0.958 0.936 0.908 0.881 0.958 0.958 0.9 0.871
utilizing (7) is refined based on the color similarity and the spatial proximity between the segmented watershed in the current frame and the tracking model for the corresponding player. This can be formulated as follows: Ei,n = (1 − α)||Bi (k + 1) − qn || + α||ci (k) − fqn (k + 1)||,
(8)
where qn is the position of the nth pixel belonging to the valley of the topographic surface (e.g., qn = {(x, y)|T (x, y) < ξ}). fqn (k + 1) indicates the average color vector, which is computed by using a set of pixels belonging to the ith foreground region, which are located in the neighbor area (e.g., 5 × 5 pixels) centered at qn . ci (k) denotes the average color vector for the ith tracking model, which is consistently updated in a given sequence. α is a balancing weight between the spatial proximity and the color similarity, which is set to 0.3 in this work. By finding the pixel position leading to the minimum of Ei,n , the marker position is finally refined as follows: B˜ i (k + 1) = argminqn Ei,n .
(9)
In the following, the color vector of the ith tracking model is updated by using the online interpolation, i.e., ck+1 = ci (k + i 1) = 0.9 · ci (k) + 0.1 · fB˜ i (k+1) (k + 1). In the subsequent frame, our tracking method is conducted based on the updated tracking k+1 k+1 model Sk+1 = {pk+1 i i , vi , ci }, given as follows : pk+1 ← B˜ i (k + 1), i vk+1 ← pk+1 − pki , i i ck+1 i
(10)
← 0.9 · ci (k) + 0.1 · fB˜ i (k+1) (k + 1).
Based on this updating procedure with the topographic surface analysis, players are reliably traced in a given soccer video.
6
Given-name Surname et al. / Journal of Visual Communication and Image Representation (2019)
Fig. 8. Results of multiple object tracking on the ETRI K-League dataset by the proposed method. Total 21 players (i.e., tracking models) are tracked and represented by different colors (one player is missing) in this example. Note that the shape of each player is accurately maintained during the tracking procedure due to the topographic surface analysis. Best viewed in colors.
4. Experimental results In this Section, the performance of multiple object tracking by the proposed method is demonstrated with the ETRI KLeague 2017 dataset [29]. This dataset is constructed based on the soccer matches of the Korean league at the Seoul World Cup soccer stadium and the resolution of video samples is 3196×648 pixels. For the comparison with the benchmark, the proposed method is tested on the ISSIA dataset [30], which is most popularly employed for soccer video tracking. Parameters used for foreground extraction, topographic surface analysis, and refinement scheme are summarized in Table 1. To justify such conditions for parameters, the performance variation according to changes of parameters is also shown in Table 2. Note that the size of ROI is determined based on the initial setting of tracking (i.e., the size of the rectangle set by the user at the first frame), and thus the sizes of 80 × 80 pixels and 40 × 40 pixels are used for ETRI K-League 2017 and ISSIA datasets, respectively. As can be seen, it is thought that parameters selected in Table 1 are appropriate for experiments. In order to show the efficiency and robustness of the proposed method, it is compared with previ-
ous methods popularly employed for multiple object tracking, which will be demonstrated in the following subsections. 4.1. Qualitative evaluation First of all, some tracking results by the proposed method are shown in Fig. 8. In this example, total 21 players are tracked and the corresponding segmented regions are represented as different colors. As can be seen, our topographic surface-based algorithm provides the reliable position of each player under dynamic scenes of the soccer game. In particular, the proposed method performs robust to the case of position switching between two players as shown in Fig. 9. Even though positions of two players are frequently switched during short time periods, the proposed method accurately maintains the trajectory of each player. For evaluating the tracking performance qualitatively, the proposed method is compared with three representative tracking algorithms, i.e., BOOSTING [31], MIL [32], KCF [14], and SORT [9], on the ETRI K-League 2017 dataset [29]. Even though such algorithms are originally developed for a single object tracking, those can be directly applied to multiple object
Given-name Surname et al. / Journal of Visual Communication and Image Representation (2019)
7
Table 3. Performance comparison on the ETRI K-League 2017 dataset
Methods MOTP↑ MOTA↑ MT↑ ML↓ PT IDS↓ FM↓ Fig. 9. Tracking results in the case of position switching (from left to right : 223th , 275th , 481th , 545th , and 634th frame of the test video). Note that positions of two players, which are represented by two different colors, are frequently switched during short time periods. Best viewed in colors.
tracking without any notable change. This is because previous methods only requires the center position and its small neighbor regions for tracking, and thus we simply apply each method to all the players simultaneously for multiple object tracking. Some experimental results are shown in Fig. 10. Specifically, previous methods often loss target objects and also suffer from the tracking drift (see red rectangles indicated by white arrows of the top in Fig. 10(a), (b), and (c)) due to complicated motion patterns and occlusion situations whereas the proposed method has a good ability to cope with such challenging conditions. It should be emphasized that the proposed method is able to accurately provide the foot position since it works on segmentation results instead of the rectangular window, which is widely employed in previous tracking methods. This is fairly desirable to estimate the actual trajectories as well as the amount of movements for each player. More examples of tracking results by the proposed method with foot positions are shown in Fig. 11. Some results for the ISSIA dataset by the proposed method are also shown in Fig. 12. 4.2. Quantitative evaluation To evaluate the performance of tracking algorithms in a quantitative manner, sub-videos, which are composed of 1,000
MIL 0.967 0.777 0.619 0.143 0.138 0 5
KCF 0.933 0.931 0.905 0 0.095 1 1
SORT 1 0.509 0 0.476 0.524 0 9
Proposed 0.996 0.958 0.952 0.047 0 0 1
Table 4. Performance comparison on the ISSIA dataset
Methods MOTP↑ MOTA↑ MT↑ ML↓ PT IDS↓ FM↓
Fig. 10. Performance comparison. Tracking results by (a) BOOSTING [31]. (b) MIL [32]. (c) KCF [14]. (d) Proposed method. Note that white arrows indicate examples of tracking loss. Best viewed in colors.
BOOST 0.888 0.825 0.667 0.095 0.238 0 3
BOOST 1 1 1 0 1 0 0
MIL 0.705 0.701 0.667 0.333 0 1 5
KCF 0.932 0.885 0.778 0 0.222 0 1
SORT 0.934 0.905 0.778 0.111 0.111 1 3
Proposed 0.916 0.921 0.778 0 0.222 2 1
frames and 300 frames respectively, are randomly sampled from original videos in ETRI K-League 2017 and ISSIA datasets. Note that total 21 and nine players for each dataset are set to be tracked at the first frame. First of all, the completeness of the tracking method, i.e., how completely the actual trajectories are traced by a given algorithm, is evaluated with seven metrics most widely employed in this field and defined as follows [33] : • MOTP : Overlap ratio between the estimated positions and the ground truth averaged over the matches. • MOTA : 1 - (rates of false negatives + false positives + mismatches). • MT : percentage of ground truth trajectories that are successfully traced by the tracker across more than 80% of the whole frames. • ML : percentage of ground truth trajectories that are covered by the tracker with less than 20% of the whole frames. • PT : 1−MT−ML. • IDS : number of times containing identity (ID) switching. • FM : number of times that are broken off. The performance comparison based on these metrics is shown in Table 3 and 4. Moreover, the effectiveness of the proposed foreground extraction is also evaluated based on MOTP and MOTA in Table 5. The number of players correctly tracked across entire frames are checked in Fig. 13. Based on evaluation results explained above, we can see that the proposed
Given-name Surname et al. / Journal of Visual Communication and Image Representation (2019)
8
Fig. 11. Tracking results by the proposed method with foot positions for each player. Note that these results are useful to estimate the actual trajectories of each player.
Fig. 13. Performance comparison of tracking methods based on the number of correctly tracked players throughout the whole video sequence of the ETRI K-League 2017 dataset.
Table 6. Performance comparison using GMOTA metrics
Fig. 12. Some tracking results by the proposed method for the test video of the ISSIA dataset.
Table 5. Effectiveness of the proposed foreground extraction
Methods MOTP↑ MOTA↑
BS-only 0.972 0.926
SE-only 0.876 0.854
Proposed 0.996 0.958
method provides more reliable trajectories of players compared to previous methods in the soccer game. Note that SORT does not perform well in the ETRI K-League 2017 dataset since the detector, which plays a key role in this method, often fail to consistently grasp multiple objects. In particular, the problems of ID switching (IDS) and fragmentation (FM), which are mostly driven by severe occlusions, are efficiently resolved in the ETRI K-League dataset 2017 due to the topographic surface analysis. For a more detailed analysis, three metrics of the global multiple-object tracking accuracy (GMOTA) [34] are additionally adopted, i.e., false negative (FN), false positive (FP), and global identity miss match (GMME), which are defined as follows : X mn X f pn X idsn FN = , FP = , GMME = , (11) gn gn gn n n n where gn denotes the actual number of players (i.e., ground truth) at the nth frame, mn is the number of lost players, f pn is the number of falsely tracked players, and idsn is the number of identity switching at the nth frame. Note that lower values for GMOTA metrics are desirable for multiple object tracking. The performance comparison using GMOTA metrics is shown in Table 6. As can be seen, the proposed method has a good ability to trace multiple players in the soccer game while efficiently suppressing false positives (i.e., cases of falsely track-
Methods BOOSTING [31] MIL [32] KCF [14] SORT [9] Proposed
FN ↓ 0.217 0.301 0.070 0.491 0.039
FP ↓ 0.094 0.196 0.035 0.375 0.056
GMME ↓ 0 0 0.002 0 0
ing players) compared to other approaches. Since the size of the stadium (field area) is given, the actual movement of each player can be efficiently estimated in a bird-eye viewpoint by using the homography transform [35, 36]. The corresponding result is shown in Fig. 14. It is easy to see that the movement of players is clearly shown in the transformed space with actual trajectories (see Fig. 14(b)), which is helpful to analyze the game content and establish a new strategy. The framework for the proposed method was implemented with Visual Studio 2017 (C implementation). The processing time for tracking algorithms is evaluated by utilizing a single PC whose specifications are given as Intel Xeon
[email protected] and 64GB RAM without parallel processing. The average processing speed of each tracking method is shown in Table 7. Even though the resolution of the test video is quite high (3196 × 648 pixels) and there are a lot of objects to track (21 players), the proposed method can be considered sufficient to allow its implementation in real-time applications (>15 fps). Note that the proposed method speeds up until about 60 fps for the test video of the ISSIA dataset (resolution : 576 × 324 pixels, the number of target objects : 9). 5. Conclusion A novel method for multiple object tracking in soccer videos has been proposed in this paper. The key idea of the proposed method is to apply the concept of the topographic surface to the extracted foreground region, i.e., player, for reliably tracking multiple objects under dynamic environments of the soc-
Given-name Surname et al. / Journal of Visual Communication and Image Representation (2019)
9
References
Fig. 14. (a) Transform result for estimating the actual amount of movement. (b) A simple example for actual trajectories of one object during 1,000 frames. Note that numbers on each axis denote the relative distance (meter) from the zero point (see the center of the bottom line).
Table 7. Comparison of the processing time for tracking 21 players in the test sequence of the ETRI K-League 2017 dataset
Methods BOOSTING [31] MIL [32] KCF [14] SORT [9] Proposed
Processing speed 0.51 fps 0.28 fps 7.29 fps 18.79 fps 17.36 fps
Implementation C++ C++ C++ Python (GPU) C++
cer match. To accurately extract foreground regions, a simple scheme for adaptively selecting the result from background subtraction and edge detection is introduced. Subsequently, topographic surface-based analysis and its refinement algorithm are applied to extracted player regions for robust tracking. Based on experimental results, it is thought that the proposed method paves a way for reliably tracking multiple objects even with severe occlusions and complicated motion patterns of the soccer video. Acknowledgments This research is supported by Ministry of Culture, Sports and Tourism(MCST) and Korea Creative Content Agency(KOCCA) in the Culture Technology(CT) Research & Development Program 2016 (R2016030044, Development of Context-Based Sport Video Analysis, Summarization, and Retrieval Technologies).
[1] M. Kristan et al., A novel performance evaluation methodology for singletarget trackers, IEEE Trans. Pattern Anal., Mach. Intell. 38 (11) (2016) 2137–2155. [2] R. S. Feris, B. Siddiquie, J. Petterson, Y. Zhai, A. Datta, L. M. Brown, S. Pankanti, Large-scale vehicle detection, indexing, and search in urban surveillance videos, IEEE Trans. Multimedia 14 (1) (2012) 28–42. [3] J. Liu, X. Tong, W. Li, T. Wang, Y. Zhang, H. Wang, Automatic player detection, labeling, and tracking in broadcast soccer video, Patten Recognit. Lett. 30 (2) (2009) 103–113. [4] J. Xing, H. Ai, L. Liu, S. Lao, Multiple player tracking in sports video: a dual-mode two-way bayesian inference approach with progressive observation modeling, IEEE Trans. Image Process. 20 (6) (2011) 1652–1666. [5] T. Zhang, B. Ghanem, N. Ahuja, Robust multi-object tracking via crossdomain contextual information for sports video analysis, in: Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2012, pp. 985–988. [6] J. Liu, P. Carr, R. Collins, Y. Liu, Tracking sports players with contextconditioned motion models, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1830–1837. [7] W.-L. Lu, J.-A. Ting, J. J. Little, P. K. Murphy, Learning to track and identify players from broadcast sports videos, IEEE Trans. Pattern Anal. Mach. Intell. 35 (7) (2013) 1704–1716. [8] S. Baysal, P. Duygulu, Senioscope: a soccer player tracking system using model field particles, IEEE Trans. Circuits Syst. Video Technol. 26 (7) (2016) 1350–1362. [9] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft, Simple online and realtime tracking, in: Proc. IEEE Int. Conf. Image Process., 2016, pp. 3464–3468. [10] N. Wojke, A. Bewley, D. Paulus, Simple online and realtime tracking with deep association metric, in: Proc. IEEE Int. Conf. Image Process., 2017, pp. 3645–3649. [11] S-H. Lee, M-Y. Kim, S-H. Bae, Learning discriminative appearance models for online multi-object tracking with appearance discriminability measures, IEEE Access 6 (2018) 67316–67328. [12] Y-C. Yoon, A. Boragule, Y-M. Song, K. Yoon, M. Jeon, Online multiobject tracking with historical appearance matching and scene adaptive detection filtering, in: Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill., 2018, pp. 1–6. [13] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, Visual object tracking using adaptive correlation filter, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 2544–2550. [14] J. F. Henrique, R. Caseiro, P. Martins, and J. Batista, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal., Mach. Intell. 37 (3) (2015) 583–596. [15] M. Danelljan, G. Hger, F. S. Khan, M. Felsberg, Discriminative scale space tracking, IEEE Trans. Pattern Anal., Mach. Intell. 39 (8) (2017) 1561–1575. [16] J. Kwon, K. Kim, K. Cho, Multi-target tracking by enhancing the kernelised correlation filter-based tracker, Elect. Lett. 53 (20) (2017) 1358– 1360. [17] H. Nam, B. Han, Learning multi-domain convolutional neural networks for visual tracking, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4293–4302. [18] D. Held, S. Thrun, S. Savarese, Learning to track at 100fps with deep regression networks, in: Proc. Eur. Conf. Comput. Vis., 2016, pp. 749– 765. [19] X. Yang, H. Li, X. Zhou, Nuclei segmentation using marker-controlled watershed, tracking using mean-shift, and Kalman filter in time-lapse microscopy, IEEE Trans. Circuits Syst. I: Regular Papers, 53 (11) (2006) 2405–2414. [20] J. Cheng, J. C. Rajapakse, Segmentation of clustered nuclei with shape markers and marking function, IEEE Trans. Biomed. Eng. 56 (3) (2009) 741–748. [21] Z. Zivkovic, Improved adaptive Gaussian mixture model for background subtraction, in: Proc. Int. Conf. Pattern Recognit., 2004, pp. 28–31. [22] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal., Mach. Intell. 8 (6) (1986) 679–698. [23] R. Kimmel, N. Kiryati, A. M. Bruckstein, Distance maps and weighted distance transforms, J. Math. Imaging Vis. 6 (1996) 223–233. [24] S. Beucher, F. Meyer, The morphological approach to segmentation: the watershed transformation, Mathematical Morphology in Image Processing. New York, NY, USA: Marcel Dekker, 1993, pp. 433–481.
10
Given-name Surname et al. / Journal of Visual Communication and Image Representation (2019)
[25] L. Chen, M. Jiang, J. Chen, Image segmentation using iterative watersheding plus ridge detection, in: Proc. IEEE Int. Conf. Image Process., 2009, pp. 4033-4036. [26] W. Kim, Y. B. Cho, S. Lee, Thermal sensor-based multiple object tracking for intelligent livestock breeding, IEEE Access, 5 (2017) 27453–27463. [27] W. Yu, X. Tian, Z. Hou, Y. Zha, Robust visual tracking based on watershed regions, IET Comput. Vis. 8 (6) (2014) 588–600. [28] R. C. Gonzalex, R. E. Woods, Digial Image Processing, 2nd ed. Upper Saddle River, NJ, USA: Prentice-Hall, 2002. [29] S. Moon, J. Lee, D. Nam, W. Yoo, W. Kim, A comparative study on preprocessing methods for object tracking in sports events, in: Proc. Int. Conf. Adv. Commun. Technol., 2018, pp. 460–462. [30] T. D’Orazio, M. Leo, N. Mosca, P. Spagnolo, P. L. Mazzeo, A semiautomatic system for ground truth generation of soccer video sequences, in: Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill., 2009, pp. 559–564. [31] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in: Proc. Brit. Machine Vis. Conf., 2006, pp. 47–56. [32] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instance learning, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2009, pp. 983–990. [33] Y. Li, C. Huang, R. Nevatia, Learning to associate: HybridBoosted multitarget tracker for crowded scene, in: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2009, pp. 2953–2960. [34] H. B. Shitrit, J. Berclaz, F. Fleuret, P. Fua, Multi-commodity network flow for tracking multiple people, IEEE Trans. Pattern Anal. Mach. Intell. 36 (8) (2014) 1614–1627. [35] Z. Zhang, Flexible camera calibration by viewing a plane from unknown orientations, in: Proc. IEEE Int. Conf. Comput. Vis., 1999, pp. 666–673. [36] R. I. Hartley, Self-calibration from multiple views with a rotating camera, in: Proc. Eur. Conf. Comput. Vis., 1994, 471–478.
Highlights
➢
Adaptively using results of background subtraction and edge detection
➢
Applying topographic surface-based segmentation to the tracking problem
➢
Preserving and updating the tracking model by efficiently removing background
➢
Working on real-time without requiring any parallel processing (e.g., GPU)
Conflict of Interest and Authorship Conformation Form Please check the following as appropriate: ⚫
All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version.
⚫
This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.
⚫
The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript
⚫
The following authors have affiliations with organizations with direct or indirect financial interest in the subject matter discussed in the manuscript:
Author’s name
Affiliation
Wonjun Kim
Konkuk University (South Korea)