Appearance-based multiple hypothesis tracking: Application to soccer broadcast videos analysis

Appearance-based multiple hypothesis tracking: Application to soccer broadcast videos analysis

Author’s Accepted Manuscript Appearance-based Multiple Hypothesis Tracking: Application to Soccer Broadcast Videos Analysis M. Manafifard, Moghaddam ...

1MB Sizes 0 Downloads 49 Views

Author’s Accepted Manuscript Appearance-based Multiple Hypothesis Tracking: Application to Soccer Broadcast Videos Analysis M. Manafifard, Moghaddam

H.

Ebadi,

H.

Abrishami

www.elsevier.com/locate/image

PII: DOI: Reference:

S0923-5965(17)30067-X http://dx.doi.org/10.1016/j.image.2017.04.001 IMAGE15208

To appear in: Signal Processing : Image Communication Received date: 4 May 2016 Revised date: 31 October 2016 Accepted date: 3 April 2017 Cite this article as: M. Manafifard, H. Ebadi and H. Abrishami Moghaddam, Appearance-based Multiple Hypothesis Tracking: Application to Soccer Broadcast Videos Analysis, Signal Processing : Image Communication, http://dx.doi.org/10.1016/j.image.2017.04.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Appearance-based Multiple Hypothesis Tracking: Application to Soccer Broadcast Videos Analysis

M. Manafifarda, H. Ebadib, H. Abrishami Moghaddamc PHD Student, Dept. of Photogrammetry and Remote Sensing, K.N. Toosi University of Technology, Valieasr Street, Tehran, Iran b Professor, Dept. of Photogrammetry and Remote Sensing, K. N. Toosi University of Technology, Valieasr Street, Tehran, Iran c Professor, Dept. of Electrical Engineering, K. N. Toosi University of Technology, Seyedkhandan Street, Tehran, Iran [email protected] [email protected] [email protected] a

Abstract Soccer is a popular sport in the world with the growth of demand for automatically analyzing matches and tactics. Since players are the focus of attention in soccer videos and they manage the entire game, player tracking is fundamental to most soccer video analysis. An efficient implementation of the multiple hypothesis tracking algorithm by evaluating its usefulness in the context of soccer player tracking is introduced in this paper. In contrast to the inherent linear assumption of multiple hypothesis tracking (MHT), which ignores appearance cues and occlusions, our approach relies on an appearance-based MHT (AMHT) framework by incorporating particle swarm optimization (PSO) to account for appearances, nonlinear movements and occlusions. Experimental results demonstrate the efficiency and robustness of the algorithm. Key words: PSO, appearance, MHT, soccer, player, tracking 1. Introduction According to the growing popularity of soccer around the world, automatic soccer video analysis is required to facilitate semantic extraction demanded by sports professionals and fans. Soccer video analysis has been applied to a broad range of applications, such as player trajectory and covered distance extraction, content retrieval and indexing, summarization, highlight detection, 3D reconstruction of the soccer match, animations, replays of goals from arbitrary views, virtual content insertion, visualization generation, editorial content creation and content enhancement, content based video compression, tactical analysis, pattern of attack or goal analysis, statistical evaluations, player action recognition, verification of referee decisions, adapting the training plan and evaluating strengths or weaknesses of a team or player. These high level applications are significantly based on player detection and tracking, which can provide significant convenience and information for viewers, trainers and sport professionals. However, soccer player tracking remains a challenging topic due to factors, such as motion blur, similar appearance of players, complex interactions and severe occlusions (e.g. during a corner scene), noise and clutter, unconstrained outdoor environment, player appearance change, lack of pixel resolution especially on small distant players, changing background, varying number of players with unpredictable movements, edited broadcast video and abrupt camera motion and zoom. Motion capture in commercial applications can be achieved with tracking reflective, magnetic markers or GPS [1] on player body which are not always possible in sport domain, where the player movement can be affected, or markers are not allowed. As a solution, computer vision techniques aim to dispose of such markers. 2. Related works Player tracking methods are very diverse, and most of them are not totally ideal. Point trackers [2-5] performed well through partial occlusions, but it might be hard to detect and match interest points for distant or blurred players. The limitations of snakes [6-8] were also their sensitivity to parameters, contour initialization, occlusion or non-smooth shape varying process. On the other hand, player tracking can be performed by searching optimal path in a graph [915]. Accordingly, a minimal path searching was applied in [3]. Although graph provided a beneficial tool for occlusion resolution, the number of look forward or backward frames should be increased in case of long term occlusions. In [16], a neighborhood graph was built, and the specific structure of the graph was exploited to reach the optimum trajectories using PSO algorithm. Multi-object tracking based on appearance cues was also formulated as a multi-commodity network flow problem (MCNF) on a direct acyclic graph (DAG) [17]. The estimation was

performed using linear programming, and a reduced graph was achieved by grouping spatio-temporal locations intro tracklets (TMCNF). An obvious advantage of the meanshift [18] over the standard template matching [19] was also avoiding brute force search. However, it required that a portion of the player be inside the initial search window. Since player maneuvers were difficult to represent with a single maneuver model, meanshift was employed for three motion models in [20]. Then, one pseudo measurement from fusing the obtained measurements was used to drive interacting multiple model (IMM). On the other hand, nonlinear and unpredictable players’ movement might be problematic for Kalman filter (KF). In [21], one main player was tracked, and other players were tracked with respect to the main player by defining their relative position to the main player in the state vector of KF. In [22], human intervention was applied for improving KF performance in challenging scenes. The method by [23-26] tracked players via KF through sequences captured by static cameras, which alleviated at least the camera motion effects. As a solution, particle filter (PF) has been applied [27]. In [28], instead of assigning separate particles to each player, densely sampled particles on the field model were shared among all the players. Then, interactions among the players were considered by globally evaluating likelihood of locating players on the particles. Probabilistic support vector classification (PSVC) was also combined with SVR particle filter to handle occlusion among the players [29]. The main drawback of PF was its dependence on the number of particles, and the precision of player localization by PF needs to be improved. Three methods, namely, color-based PF, meanshift and Kalman-meanshift were compared in [30]. Although Kalman-meanshift reduced meanshift iterations, it failed during nonlinear motions. In contrast, nonlinear movement was dealt by PF, and tracking was more reliable due to its multiple hypotheses. However, a more precise localization was achieved by the other two trackers, since the location was estimated by the mean value of particles in PF. Multi-player tracking involved the problem of data association to determine which measurements were generated by which players (predicted states). The graph representation implicitly retained association, but it relied on an ad hoc technique. In [31], multi-player tracking was performed by finding local maxima on a confidence map using a greedy heuristic method, which applied the Kalman prediction as the starting position. However, a single Kalman filter was applied for each player ignoring data association and total occlusion problems. Several data association algorithms have been proposed in the literature. They can be divided into single frame association methods, which make assignments based only on the current frame (e.g. NN, GNN), or multi-frame association methods (e.g. MHT). They can also be divided into single target association (e.g. NN) or multi-target association (e.g. GNN, JPDA and MHT). Moreover, the choice of the association algorithm depends on the particular application. Most of the player tracking algorithms ignored association or assumed that the association was trivial so that a nearest neighbor could be effective. Two most common single frame association methods were the nearest neighbor (NN) and global nearest neighbor (GNN) [32]. Accordingly, NN has been applied along with KF [33]. An obvious advantage of NN was its simplicity; however, it might lead several close players being associated with the same measurement. On the other hand, GNN [29] was a global version of the NN by considering assignment for all players simultaneously under the constraint that an observation could be associated with at most one track. However, it still might fail in the case of closely spaced players and high number of false measurements. One approach to improve GNN performance was joint probabilistic data association (JPDA) [34] in which a track was updated by weighted sum of all observations in its gate. The main limitation of the original JPDA was its inability to perform track initiation and deletion, and it was appropriate when the number of tracks was known and remained fix throughout the sequence, which was not the case in soccer broadcast videos. In order to track players from static cameras in world coordinate system, JPDA with KF was applied in [35]. Moreover, multi-player tracking with Markov chain Monte Carlo data association (MCMCDA) was presented in [36, 37] where the whole detection and labeling results were taken as observations. Then, the best association was represented by configuration on a neighborhood graph which optimized Gibbs distribution. As a result, MCMCDA relaxed from one to one correspondence between observations and players. However, detectionbased trackers gave poor performance, since detection was not reliable. MHT was originally developed by [38] for tracking multiple targets in clutter, and its version by [39] was later applied in soccer sequences [40-42]. MHT is a statistical association algorithm with capabilities of track initiation, track termination, track continuation, spurious measurement handling and uniqueness constraint preservation. The data association can also be postponed to the later time step for resolving uncertainties. Generally, MHT includes hypothesis matrix creation, hypothesis generation, hypothesis probabilities calculation and KF associated with each target and hypothesis management [43, 44]. On the other hand, MHT can be divided into two main categories, namely, hypothesis-oriented MHT (HOMHT) and track-oriented MHT (TOMHT) [32]. Actually, the former was the original MHT, where each hypothesis included a set of possible observation to track assignments. The difference of TOMHT with the former stood for representing each track by one tree, where the root denoted the birth of a target, and the branches denoted different track hypotheses for that. MHT has also been refined to assign more than one player to a measurement [42]. An improvement of MHT using a modification of Murty’s algorithm was presented in [45] by permitting association of one player with multiple measurements and vice versa. As a result, the multiple

hypotheses outperformed the single hypothesis in case of occlusions and noise. Although MHT yielded better results than methods with one association hypotheses, appearance information was ignored, assuming little appearance information was available. MHT was widely and successfully used for point tracking in radar [46], laser range scanner or laser range finder [4749]. Since few MHT algorithms were presented to alleviate the lack of occlusion handling and appearance cues (e.g. Haar features, optical flow, HSV-LBP histogram) for applications such as pedestrian tracking [50-52], it did not attract visual tracking in spite of its capabilities. MHT has also been integrated with other trackers, such as graph representation, to compensate for its weaknesses (i.e., considering appearance and occlusion situations) [53, 54]. Appearance cues were also inserted into MHT by computing appearance scores regarding the detection results [50, 51]. Since partially occluding players were not isolated by previous player detectors [12, 25, 36, 45], the appearance score may be deviated by partial occlusions. In previous papers, the appearance cues mostly concentrated on reducing ambiguities for multiple nearby objects, and they rarely helped tracking partially occluding objects and isolating them. However, appearance cues are mainly beneficial in the case of partially occluding competitors. In this paper, appearance cues can be applied for isolating partially occluding competitors. 3. Algorithm In this paper, a modified MHT framework is proposed for player tracking. Since Kalman prediction is the main component of the MHT, erratically moving players may be problematic for the MHT, and it does not seem the most appropriate method in the case of sudden changes in direction and velocity of players or camera. Moreover, the target appearance and occlusions are ignored due to the small target tracking application in the conventional MHT. However, the colors of players’ uniforms provide useful information, and occlusions are common in soccer sequences. Furthermore, the appearance models are most needed when the players are bunched together. Therefore, appropriate modifications have to be made to the MHT. The measurements for many trackers (e.g. MHT, graph and KF) were provided by an independent player detection step. Accordingly, background subtraction has been widely used [12, 25, 45]; however, it was typically applied for sequences captured by static cameras. Generally, detection-based trackers gave poor performance, since any error in the detection step might affect the tracking results. For instance, missed players during the detection step by Adaboost caused missing the player track by MCMCDA [36]. In this paper, most of the errors resulting from missed detections can be resolved by incorporating PSO-based detection inside the tracking framework. On the other hand, one main drawback of exhaustive brute search for players is the need to investigate players in every location of the solution space (four dimensional space regarding player position and size). A possible solution is performing a partial exploration of the image by means of detected blobs and random exploration of swarm particles. The novelty of the proposed approach relies mainly on introducing an effective appearance-based MHT framework, which is capable of handling occlusions. For this purpose, the PSO concept is introduced into the MHT scheme. In the first step, measurement data is approximated by a grass-based blob detection. The players are then predicted, and the validation gates are computed. Afterwards, the ambiguous measurements are identified, and the proposed PSO search is performed to partially correct for them and extract appearance cues. As a result, approximate association is computed by forming hypotheses based on the partially corrected measurements. The best assignment is then used by PSO post-processing step for applying more correction on measurements, and an auxiliary partial occlusion resolution step is introduced to correct for partial occlusions in big blobs. As a result, accurate association is performed based on the ultimate modified measurements including the occluding ones. Finally, the hypotheses are updated using the ultimate association. The proposed framework is shown in Fig. 1. As illustrated, the common association solution in the MHT is replaced by the measurements classification, measurements refinement and accurate association blocks (dotted rectangle) in the AMHT to achieve a more reliable result. The main contributions of this paper are: 1) An effective AMHT framework is introduced to account for players’ appearance, nonlinear movements and occlusions. 2) A blob-guided PSO player detection is introduced into the MHT, which is capable of detecting and labeling multiple players and resolving partial occlusions simultaneously. Only teammates in occluding blobs are detected by erasing the previously detected ones. This paper is organized as follows. Section 4 is devoted to measurements. Kalman prediction and gating computation are described in Section 5. A detailed description of modified MHT for data association is presented in Section 6. Updating hypotheses regarding data association results is discussed in Section 7. The experimental results and analyses are given in Section 8 and conclusions are drawn in the last section.

𝑘−1

Kalman prediction for each

Measurements New frame

𝑘 −1

Gating computation for each

Playfield detection Classify measurements into ambiguous and definite

Blob detection

First measurements refinement

Update

PSO initialization in ambiguous measurements

𝑘−1

𝑘



PSO search in ambiguous measurements Save

𝑘

Approximate association Update for each

Second measurements refinement

𝑘

Auxiliary partial occlusion resolution

PSO post-processing

Accurate association

Fig. 1. Flowchart of the proposed AMHT algorithm.

is the -th association hypothesis at time k.

4. Measurement In order to generate candidate player regions, the Gaussian mixture models are adopted to detect the green grass [55]. Following the grass and spectator (i.e., outside the field convex hull) elimination, the external contours of objects in the binary image are traced, and detected blobs are sent as approximate measurements to the next steps (Fig. 2-b). Moreover, a list of confirmed blobs is generated which consists of blobs with reasonable area, major and minor axis length, height-width ratio (>1) and amount of grass. Accordingly, windows with number of blob pixels less than half of their area (windows with large amount of grass pixels) are penalized to be in the list. eroded mask

labeled regions by eliminating so unreasonable

50

100

100

4 6

150

200

200

89 7

250

300

2

300

400

12 10

5

350

1

400

11

450

500

3

500 550

100

200

300 100 400 200 500 300 600 400

(a)

500

600

700

(b)

Fig. 1. (a) Original frame. (b) detected measurements (yellow) for most probable hyp Fig. 2. a) predictions Original frame. b) Detected measurements. 4162 100

3

100

3 5 13 78 5. Kalman prediction 200 200 5 12 and gating computation 6

1

10

13 4

100

128 713

6

11 9

200

3

129 14 6 8

11 10

300 300 300 4 7 is placed around the predicted player. 1 at each frame is predicted using The player Kalman filter, and10a validation gate 9 15 4 400 400 As a 400 result, the probability of11associating the prediction and observation outside its validation gate is forced to zero. 2 5 500 500 In this case,2 the validation gate of j-th500player ( ) is defined as an ellipse (Fig. 3) centered at predicted position 100 200 300 400 500 600 100 200 300 400 500 600 100 200 300 400 500 600 whose major ( ̂ ) and minor ( ̂ ) axis are defined as follows:

̂

̂

(1)

where and are scale coefficients, which can change the size of validation gate, and is its maximum allowable size. Velocity terms ) in (1) increase the size of gate for running players and shrink it for slow moving players. The predicted player size (i.e., the major ( ) and minor ( ) axis length of the elliptical window) also

avoids zero size validation gates for still players due to camera movements. At the end of this step, allowed total occlusion set (couples of players who can be in total occlusion) is defined using predicted positions of the players. In this case, a couple of players far from the borders (i.e., not outgoing players) will be permitted in allowed total occlusion set if they are closely spaced (e.g. mutual distance <30 pixels), intersecting with each other or intersecting with common measurements. Moreover, two players (except for players intersecting with common measurement) could not be in total occlusion with each other if one or both of them were in total occlusion with other players in previous time step. By doing so, continuation of false alarms is avoided. Afterwards, the measurements corresponding to players in allowed total occlusion set are stored as allowed occluding measurements. At the end of this step, predicted players, allowed total occlusion set (couples of players who can be in total occlusion) and allowed occluding measurements (measurements which may correspond to total occlusions) are generated. 6. MHT modification to data association Following to players’ states prediction, association is performed. A detailed description of MHT can be found in [7, 8]. In this section, AMHT is introduced for association in the presence of occlusions using both appearance and motion cues. After measurements classification into ambiguous and definite, a two-step measurements refinement process is performed. The details are elaborated in the following sections. 6.1. Measurements classification Let denote the set of new measurements (i.e., detected blobs in Section 4) at time k and denote j-th player. Although the association is straightforward for widely separated players, the ambiguity arises when 1) multiple measurements are inside a predicted player’s gate (Fig. 3-a), 2) a measurement is inside the gates of multiple predicted players (Fig. 3-b), 3) a measurement may correspond to a new (i.e., close measurement to the borders and arms) or missed player (i.e., confirmed blobs (Section 4) associated with no predictions by the nearest neighbor data association) (Fig. 3-c), or 4) a measurement is imprecise (Fig. 3-d) (e.g. players merged with line segments). In this case, imprecise measurement is identified whenever the size ratio (i.e., major or minor axis) between measurement and prediction exceeds a threshold ( ). Therefore, the above measurements are added to ambiguous measurements list. Moreover, labeling errors might occur for players stuck to the image borders. Therefore, measurements corresponding to players stuck to the borders in the previous frame which are apart from the borders in the current frame are added to the ambiguous measurements list to compensate for labeling errors. At the end of this step, a measurement and prediction may be associated with high probability. Therefore, the measurement will be removed from the ambiguous measurements list if the measurement and a prediction are highly overlapped with each other (e.g. more than 50%) while they are overlapped with no other predictions and measurements. Following to ambiguous measurements determination, candidate labels (probable teams) for each ambiguous measurement are generated. As a result, PSO will only search for candidate labels within each ambiguous measurement in the following steps. If the ambiguous measurement is inside the gates of one or more predicted players, the labels of these players are stored as candidate labels for the ambiguous measurement. In Fig. 3-b, the first and second teams will be stored as candidate labels for z1 if t1 and t2 are players in the first and second teams. However, all teams will be assumed as candidate labels for an ambiguous measurement if it probably corresponds to a new or missed player. In order to reduce ambiguities, all the ambiguous measurements and their candidate labels are sent as input to the proposed PSO.

z2

z1 t1

(a)

z1

z1 t2

t1

(b)

z2

t1

(c)

z1

t1

(d)

Fig. 3. Ambiguous measurements. a) Multiple measurements (z1 and z2) inside a predicted player’s gate (t1). b) A measurement (z1) inside the gates of multiple predicted players (t1 and t2). c) A new or missed measurement (z2). d) Imprecise measurement (z1).

6.2. First measurements refinement A new search scheme using PSO is introduced in this section to partially modify ambiguous measurements and extract appearance cues (i.e., fitness and labels). The ideas behind the proposed search algorithm including constrained initialization of particles, fitness evaluation function and PSO search are described in the following sections. 6.2.1. Particle swarm optimization (PSO) algorithm PSO [56] is an optimization technique formed by a set of particles moving in search of the function optima. Each particle moves with an associated position and velocity adjusted regarding its own experience (local search) as well as the best experience of the group (global search). All ambiguous measurements are sent as input to the blob-guided PSO detection algorithm for localizing players within the ambiguous blobs. For this purpose, the global swarm is divided into sub-swarms, which are initialized by the constrained initialization. Then, the most player-like regions are simultaneously searched within the ambiguous blobs by all the sub-swarms. The details of blob-guided PSO detection are elaborated in the following sections. 6.2.2. Constrained initialization PSO is initialized with a group of particles distributed to cover the possible solution space. Each particle corresponds to a candidate solution of the problem, and it represents an ellipse surrounded by the player bounding box. In this paper, a constrained propagation is applied to diminish the search area. Firstly, the global swarm is divided into smaller swarms (sub-swarms) regarding the number of detected ambiguous blobs in a particular frame. Then, particles of each sub-swarm are placed as uniform as possible within their corresponding ambiguous blob. As a result, the area of each ambiguous blob will be searched for players instead of the whole image, and the aggregation of particles corresponding to teammates in different blobs will be avoided. Secondly, constrained initialization tightens the initial allowable range for player size. Since the main camera is set up on one side of the pitch, the players’ size varies between largest in the nearest and smallest in the furthest parts. Therefore, the minimum and maximum size of the player between the furthest and the nearest confirmed bounding boxes (Section 4) are set to the size of furthest and nearest confirmed bounding boxes. On the other hand, the maximum size of the player above the furthest confirmed bounding box is set to the size of furthest confirmed bounding box, while the minimum size of the player at the bottom of nearest confirmed bounding box is set to the size of nearest confirmed bounding box. The size of predictions are also beneficial for tightening the interval when the number of confirmed bounding boxes is not adequate. Returning to initial allowable values can improve the detection task for blobs stuck to the borders and smaller than initial allowable range (e.g. fallen player). After setting the range of player size, vectors of equally spaced are created for player size within the range. Moreover, the bounding box enclosing the confirmed ambiguous blob is set as a size candidate. Following the particle size generation, generated minor (major) axes are assigned to particles in each sub-swarm. As shown in Fig. 4, players in occluding blobs were missed by random propagation, while they were successfully detected using constrained propagation; since the search space was appropriately covered by initial particles. By dispersing particles in relatively smaller search area, searching for players is also accelerated. At the end of this step, the initial values (position and size) for PSO particles are generated.

(a)

(a)

(a)

(b)

(b)

(b)

Fig. 4. Particles distribution and corresponding detection results (population size= 8). (a) Random distribution. (b) Constrained distribution. 6.2.3. Fitness In order to allow for an efficient evolution of PSO, a suitable fitness function is essential. The higher the fitness value, the more similar is the particle window with the corresponding player. In this paper, the fitness function for particle

with respect to the i-th team

is formulated as a combination of color and gradient features: −

(2)

where λ is a weighting coefficient, is the cost based on color of players’ uniforms, is the gradient cost, and is the repulsion factor penalizing lines and grass regions. In order to define , multiple player models, each associated with one team, are established. These windows are named team models. The appearance model for each team (team A, team B and referee) is manually selected from one frame of each video. The model can be extended easily to include goalkeepers. Nevertheless, the appearance models are not defined for goalkeepers; since they wear similar uniforms as two teams or referee on many occasions, their trajectories are not informative due to their restricted movements and their disappearance in most frames. After models definition, the color histogram corresponding to each particle and the player model are built. Then, the Bhattacharya Coefficient (BC) [30] between the particle histogram ( ) and the i-th model histogram ( ) is used to define the similarity measure: √ −

(

)

(3)

In this term, searching for the most similar color uniform of each team is favored. The histogram is built by weighting and putting together the color histogram of each part (i.e., shirt, short and socks) and normalizing the vector to unit length. Since the gradient directions of points on line segments within the detection window are mostly similar, this property can be exploited to penalize particles including line segments. The next step is to build a histogram with the gradient directions of points in the window and define the gradient cost using relative histogram values as follows: −





(4)

where and are the value of gradient directions histogram corresponding to particle at bin and (bin with the peak value), respectively. This term can penalize particles including line segments and is particularly beneficial for detecting players in occluding blobs including line segments. The repulsion factor also penalizes the presence of large grass and lines within the window as follows:

{

(5)

where and are elliptical window (particle) and its associated blob, respectively, and area(A) returns the number of pixels inside the region A. Accordingly, windows with number of blob pixels less than 60% of their area inside the convex hull (first row) and windows with gradient histogram similar to lines (second row) are penalized. 6.2.4. PSO search Following particles initialization, the search step is formulated as a four-dimensional optimization problem solved using a blob-guided PSO algorithm to find the best match of the predefined player models in the ambiguous blobs. For this purpose, a swarm of particles flies through the image to find the detection windows with the best fitness values. The most important reason for the weakness of the PSO for multi-player detection is that the particles cannot converge to all different solutions; instead they migrate towards some local maxima. As a result, some players are missed, and others may not be effectively localized. The problem persists in spite of applying independent swarm for each player due to the presence of teammates in same uniforms. To overcome this problem, a blob guided multiplayer detection is applied in which the global swarm is divided into sub-swarms, and each ambiguous blob is searched by its corresponding sub-swarms.

The search step is summarized in Algorithm 1, and notations are denoted in Table 1. The initial search is first performed, and all the initialized sub-swarms within ambiguous blobs are put together while an identifier is assigned to each particle indicating the sub-swarm corresponding to the particle. Furthermore, the initialized swarm is tripled to search for three models corresponding to the competitors and the referee (line 1). Then, PSO performs its evolution in the outermost loop of Algorithm 1, and particles start moving in the solution space. At each iteration, particles corresponding to two teams and referee move in their corresponding ambiguous blob to search for the optimum of a cost function. If t-th team is the member of candidate labels for no ambiguous measurements, the search for that team will be terminated (line 4). Otherwise, the cost value for particle according to t-th model will be computed (line 8). The less the cost value, the more similar the particle is to the model. Moreover, a large cost value (e.g. 10000) is assigned to each particle outside its corresponding ambiguous blob (line 9). By this way, particles of independent sub-swarms are disallowed to merge. Each particle searching for t-th team keeps track of the best solution it has found so far by updating the local best solution (line 10). In order to update each particle based on its own sub-swarm, the global best is independently found for each sub-swarm. For this purpose, the global best of sub-swarm corresponding to the particle searching for t-th team is updated for the last particle of sub-swarm in the inner loop (line 11). Given the local and global best values, the particle updates its velocity and state at the next iteration (line 5). Therefore, the global best of each sub-swarm searching for t-th team in b-th ambiguous blob indicates state of detected player by the sub-swarm at the last iteration. In addition, cost values of detected players by each sub-swarm are generated (line 11). In this research, particles keep moving around until a termination condition (i.e., reaching the maximum number of iterations or convergence (line 7)) is satisfied. Moreover, the particle will stop moving if t-th team is not the member of candidate labels for particle’s blob (line 7). If a blob does not contain the searched team, the sub-swarm may converge to a different team. Therefore, detected players are relabeled according to the model associated with minimum cost using (3) (line 15). Afterwards, each player is identified regarding the detected player after searching for prediction label in the ambiguous blob. Then, the cost and label for each player are extracted from the computed costs and modified labels, respectively. As a result, computed states, costs and modified labels are generated (line 16). Table 1. Notations for Algorithm 1 and 2. I, ,

,

,

,m

,

, L

T GNI{t} ,

and

Update

and Update and and

and ,

, and H

VG and , ,

Denote input image and the set of new measurements (i.e., detected blobs) at time k Denote detected ambiguous blobs, ambiguous blob corresponding to i-th particle and number of ambiguous blobs, respectively. Denote players’ models, t-th model and number of team models ( ), respectively. Denote initialized particles, initialized particles within k-th ambiguous blob and total number of particles in all blobs, respectively. Candidate labels for ambiguous blobs (Section 6.1); Indicates PSO search in initial detected blobs ( ) or occluding blobs ( ). Maximum number of iterations; Indicates going to the next iteration for {t} loop. Denote the state of i-th particle during the search for t-th model, the local best of i-th particle and global best of the sub-swarm corresponding to i-th particle during the search for t-th model, respectively. Update the and values, respectively. Computes cost value for i-th particle according to t-th model using (2). Denote the last particle of sub-swarm corresponding to i-th particle and candidate labels for blob corresponding to i-th particle, respectively. Denote state and cost value of detected player in b-th ambiguous blob after searching for t-th team, respectively. Relabels detected players. Detected player after searching for prediction label ( ̂ ) in b-th ambiguous blob; Compute cost and label for each based on the computed costs ( ) and modified labels ( ), respectively. Denote final states, costs and modified labels after the search step. Hypothesis matrix; Assigned measurements to predictions by the Hungarian; Denote validation gates of all players and j-th player, respectively. Denote corrected measurements, labels and removed set after the PSO post-processing step. Computes labels of corrected measurements ( ). Returns index of removed measurements using threshold ( ) and corrected measurements ( ). Allowed occluding measurements (Section 5); Outputs the valid occluded players.

Algorithm 1. Summary of the search step for blob-guided PSO multi-player detection algorithm. 1

2 3 4 5 6 7 8 9 10 11

12 13 14 15 16

Input: I, , , , ,T Particles initialization:   : sub-swarm corresponding to i-th particle;  Triple for three teams as ; for It=1 to T do for t=1 to m do if then GNI{t}; if It > 1 then Update ; for i=1 to N do if or ( is converged) then GNI{i}; ← ; if then ← ← ; if then do ← ; Output the state of detected player ( and corresponding cost as: ← end end end ← ; if then do    Output:

{ ← ←

̂

|

};

; ;

,

6.3. Approximate association In the following sections, hypothesis matrix is constructed, and a set of association hypotheses are obtained. Appropriate modifications are also made to the probability calculation of the MHT to account for appearance cues. Finally, the best solution to the problem is generated as an approximate association. 6.3.1. Hypothesis matrix Each hypothesis in MHT is denoted by hypothesis matrix, which consists of three sub-matrices representing assignments of measurements to predictions, new players and false alarms. In the first sub-matrix, predicted players are represented by the columns and the measurements by the rows. A non-zero element at matrix position denotes that i-th measurement is inside the validation gate of j-th player. In order to build up the ultimate hypothesis matrix, a diagonal matrix of (i.e., second sub-matrix) denoting new players is appended to the end of first sub-matrix. Then, another diagonal matrix (i.e., third sub-matrix) denoting false alarms is appended to the end of second sub-matrix. Non-zero elements at position of second and third sub-matrices denote that i-th measurement can be a new player and false alarm, respectively. Fig. 5-a depicts a situation with five predictions and five measurements. In addition to the partial occlusion between and , total occlusion between and happens. Moreover, false alarm is inside the gate of , and new player enters into the field of view. The situation in Fig. 5-a is represented by the hypothesis matrix in Fig. 5-b. Hypothesis generation is then performed by restricting the hypothesis matrix to have only a single nonzero value in any row or column. However, this one-to-one mapping assumption in original MHT is violated when occlusions occur. For example, both positions (1,1) and (1,2) in same row should have nonzero elements when player and are merged. Therefore, a modified framework is proposed in the following sections.

𝒛𝟒 𝐳𝟏

t4 t5

𝐳𝟑

t2

t1

𝐳𝟐

New players

Predictions

1 0 0 0 0

𝐳𝟓

t3

(a)

1 0 0 0 0

0 1 1 0 0

0 0 0 1 0

0 0 0 1 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

False alarms

0 0 0 0 1

1 0 0 0 0

0 1 0 0 0

0 0 1 0 0

0 0 0 1 0

0 0 0 0 1

,

,

(b)

Fig. 5. (a) Predicted players and elliptical validation gates for a situation with five predictions ( , five measurements ( , , , , ). b) Hypothesis matrix.

,

) and

6.3.2. Probability calculation and best assignmment Cox and Hingorani [39] obtained the new global hypothesis at time k, from the current set of assignments and a previous hypothesis at time k-1, based on measurements up to and including time k-1, i.e., . In order to add robustness to the MHT, an additional term corresponding to label consistency, i.e., is included. Moreover, imprecise measurements are substituted by the corrected measurements. The probability calculation of a hypothesis { | } is modified as follows: { A

|

}



(

[

])

{∏

[



]

[−

{

}

(

−̂ )

{

(

|

}

− ̂ )]

(6)

where c is a normalization constant and { | } is the probability of the parent global hypothesis. denotes the correction of i-th measurement ( ) resulting from the search for j-th player using the PSO scheme. For this purpose, i-th measurement will be replaced by the corrected state if it is searched by PSO. The distance between associated measurement with j-th player is denoted by the exponential term ( ). ̂ is the predicted observation for j-th player and is the associated covariance. will be one if the player t is detected. , , , will be one if i-th measurement comes from a known player, new player, false alarm, or it is ambigious, respectively. Otherwise, they will be zero. is also defined as follows: {

̂

(7)

In this paper, appearance cues are mainly considered using the label consistency term to reduce the probability of associating predictions and measurements with inconsistent labels. ̂ denotes the label for j-th player and denotes labels of detected players in i-th measurement using the PSO scheme (Algorithm1). , and are probability of detection of track t, density of spurious measurement ( ) and density of new players ( ), respectively. The main novelty of (6) is considering label consistency and corrected measurements by PSO . In order to find the

best assignment, the cost function derived by computing the negative logarithm of the probability function is minimized. Then, the best solution is generated using the Hungarian algorithm [57], and assigned measurement to predictions are stored. 6.4. Second measurements refinement Following approximate association, the computed association is applied for post-processing of the PSO search results. Then, an auxiliary partial occlusion resolution step is introduced to correct for partially occluding measurements in big blobs. The details of each part are elaborated in the following sections. 6.4.1. PSO post-processing At this step, post-processing of the initial search is performed by setting to zero (Algorithm 2). Notations for Algorithm 2 are illustrated in Algorithm 1 and Table 1. A threshold value for evaluating the validity of each detected player (Section 6.2.4) is computed. For this purpose, detected PSO ellipses consisting of large amount of grass are discarded regarding their cost values ( ) (line 1), and the objective threshold value is computed based on the mean (m) and standard deviation ( ) of the costs corresponding to the remained assigned measurements to predictions by the Hungarian (Section 6.3.2) as (line 2). In order to perform the initial correction of measurements, -th measurement in the whole measurements list (i.e., b-th ambiguous measurement) is replaced by the best player (i.e., with minimum cost) found in the corresponding blob (line 6). Then, the remaining detected players in the ambiguous measurement are computed (line 10) and validated to output the valid occluded players in bth ambiguous measurement (line 13). If the corrected label for the detected state (Section 6.2.4) is not changed ( ), and the associated cost does not exceed the threshold value, the detected state will be transferred to the valid set. Moreover, the detected states which are the member of allowed occluding measurements (Section 5) are always assumed valid; since they may correspond to the highly occluding players (line 13). A more reliable way is using classifiers to validate detected players. At the end of this step, the initial corrected measurements along with the valid occluded players in all ambiguous measurements generate the corrected measurements set (line 15). Labels of corrected measurements are computed (line 16), and removed set is computed as index of corrected measurements whose corresponding costs are more than the (line 17). Moreover, the initial correction of hypothesis matrix is performed for corrected measurements. For this purpose, the assignment of prediction with -th corrected measurement out of the prediction’s validation gate is penalized (line 8). Since occluding measurement may be out of the prediction’s validation gate due to its large size, hypothesis matrix is corrected for corrected measurements which are not the member of allowed occluding measurements (Section 5). As a result, corrected measurements, their associated labels, removed set and corrected hypothesis matrix are generated. Algorithm 2. Summary of the post-processing step. Input: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

if else if for if

, {

, , , VG, H, , , Z, | then ← then is the computed one for do

Compute player with minimum cost as for do if & then end ← − ; else if then ← ; end ← ; end ← ; ← ; ← Output: , , , H

6.4.2. Auxiliary partial occlusion resolution

;

}; and ← ←

; ← ;

;

This step is appended for detecting multiple occluding teammates within a blob. For this purpose, the ambiguous measurement which is not in the removed set and its candidate labels (Section 6.1) consist of more than one teammate is stored as occluding blob. Following initial PSO search and post-processing steps, the image is turned into the grass color at the detected states to search for more players in the remained pixels of the occluding blobs. Particles initialization, PSO search and post-processing (Algorithm 1 and 2) are then performed within the occluding blobs by setting to one. Moreover, only teams similar to previously detected teams within the blob and corresponding to probable teammates are searched. Since referee is not allowed more than once in the blobs; it is not searched at this step. The post-processing is also initialized by replacing measurements with the corrected ones. At this step, maximum number of iterations for searching the b-th ambiguous blob and all ambiguous blobs is set to maximum number of probable teammates in the b-th ambiguous blob and all ambiguous blobs, respectively. The search will be terminated if maximum number of iterations is reached, or no measurement is in the blobs. The flowchart of auxiliary partial occlusion resolution is shown in Fig. 6. Set to 0; Initial PSO search; Initial post-process;

, ,

Post-processing step

End No



Output , ,

No

Search step Any new measurement is added or is not reached? yes

yes Set

to 1;

Pick occluding blobs

Turn the image inside detected measurements into grass pixels

Constrained initialization Remove blobs’ pixels inside detected measurements

Fig. 6. Flowchart of auxiliary partial occlusion resolution. is maximum number of probable teammates in all ambiguous blobs and contains occluding blobs. Other symbols have been defined in Table 1. 6.5. Accurate association The results of the second measurements refinement are used to perform accurate association. Therefore, hypothesis matrix is corrected to account for modified measurements and occlusions. Then, a set of association hypotheses is obtained, and solution to the assignment problem is generated. The details are elaborated in the following sections. 6.5.1. Hypothesis matrix correction If partial occlusion happens between players and , the measurement of players under partial occlusion will be divided into two parts ( , ) (Fig. 7-a). While the original occluding measurement is replaced by the best isolated measurement (e.g. ), remaining ones are appended to the end of the hypothesis matrix (e.g. ) as partial occlusion candidates ( ) (the sixth row in Fig. 7-b). Afterwards, MHT is modified for highly occluding players. In particular, potential total occlusion measurements ( ) consist of measurements that are a) not in the removed set of the PSO post-process, b) assigned to players by the Hungarian and c) located inside the validation gates of multiple predicted players. Accordingly, with undetected candidate labels (Section 6.1) after the PSO post-process will be assumed as a total occlusion candidate ( ). One simple strategy is appending new rows corresponding to (e.g. ) to the end of the hypothesis matrix as (seventh row in Fig. 7-b). Similar to the initial hypothesis matrix, each appended row consists of three parts from three sub-matrices. The first part corresponding to the first sub-matrix is built by repeating the rows of the hypothesis matrix corresponding to . The second part corresponding to the second submatrix is a zero row, since total occlusion candidates cannot be a new player, and the third part corresponding to the third sub-matrix is built similar to the initial third sub-matrix. By doing so, one to one correspondence of measurements to predictions can be imposed even if occlusion happens. The corrected hypothesis matrix corresponding to Fig. 5-a is shown in Fig. 7-b, and the correct assignment is shown in Fig. 7-c.

𝐳𝟔

𝒛𝟒 𝒛𝟕

𝐳𝟏 t1

t4 t5

𝐳𝟑

t2

𝐳𝟐

1 0 0 0 0 1 0

𝐳𝟓

t3

1 0 0 0 0 1 0

0 1 1 0 0 0 0

0 0 0 1 0 0 1

0 0 0 1 0 0 1

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

(a)

0 0 0 0 0 0 0

0 0 0 0 1 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

1 0 0 0 0 0 0

0 1 0 0 0 0 0

0 0 1 0 0 0 0

0 0 0 1 0 0 0

0 0 0 0 1 0 0

0 0 0 0 0 1 0

0 0 0 0 0 0 1

(b) 0 0 0 0 0 1 0

1 0 0 0 0 0 0

0 0 1 0 0 0 0

0 0 0 0 0 0 1

0 0 0 1 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 1 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 1 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

(c) Fig. 7. (a) Corrected measurements for situation depicted in Fig. 5-a. b) Corrected hypothesis matrix. c) Correct assignment. 6.5.2. Probability calculation correction and M-best assignments The first term of probability calculation (Equation 6) is modified as follows:

∏(

[

])

(8)

While the description of most terms can be found in Section 6.3.2, an additional validity term is appended. As a result, the probability of associating j-th prediction with i-th corrected measurement ( ) in the removed set (Section 6.4.1) is reduced (first row in (9)). Since may be removed due to the incomplete or missing image observations of occluded player, the validity probability is increased for occluding measurements (i.e., ) (second row in (9)). Let be the set of players near the borders. The validity probability will be set to one if the corrected measurement is not removed, or all measurements inside the validation gate of player which are not a member (not an exiting player) are removed (third row in (9)). By doing so, players can be associated with more probable removed measurements, and the tracking errors due to missed measurements in tracking-by-detection methods can be resolved. The validity term is defined as follows:

{ (9) where are the set of players near the borders, validation gate of j-th player, j-th player, occluding measurements, i-th corrected measurement at time k and removed set, respectively. The label consistency indicator is also modified after the PSO post-processing step. As a result, the probability of associating differently labeled prediction and measurement ( ̂ ≠ ) which is not a total occlusion candidate ( ) is reduced; since labeling error occurs for highly occluded players. is defined as follows: {

̂ ≠ (10)

̂ are total occlusion candidates, i-th corrected measurement at time k, label for i-th measurement where after the PSO post-processing step and label for j-th player (prediction), respectively.

Initial measurements were replaced by the best isolated measurements within the corresponding blobs, and occlusion candidates were the remaining probable measurements. Therefore, an additional occlusion term is appended to reduce the probability of associating j-th prediction with occlusion candidates. Moreover, the probability of associating j-th prediction with total occlusion candidates ( ) is less than partial occlusion candidates ( ; since partial occlusion candidates are detected measurements within the blobs, and total occlusion candidates are the repeated ones (Section 6.5.1). However, the probability of associating intersecting prediction with ( ) and (1) is increased. is defined as follows:



(11)

{ The density of new players for i-th measurement is set to penalize picking 1) measurements in the removed set (Section 6.4.1), 2) total occlusion candidates, 3) measurements assigned by the Hungarian (Section 6.3.2) and 4) measurements which are not a member (near the arms, near the image borders and probable missed measurements) as new players. is defined as follows: {

(12)

where are i-th corrected measurement at time k, removed set, total occlusion candidates and assigned measurements to predictions by the Hungarian, respectively. The main novelty of (8) is considering label consistency , validity term ( ) and additional occlusion term . Moreover, the density of new players was adjusted regarding our particular application. Following the probabilities calculation, M-best solutions to the assignment problem are generated using Murty’s algorithm [39]. In this paper, only one best solution is kept; since the main goal of this paper is modifying the MHT performance by resolving occlusions and appearance cues. 7. Update Following measurement to prediction association, measurements are used to update the prediction of players’ states using the Kalman filter. In this regard, a much smaller value is assigned to the measurement noise covariance compared to the process noise covariance to deal with unpredictable player movements. In order to rectify total occlusions, the less probable player for unallowed total occlusion (Section 3) will be removed. Moreover, the less probable player at the beginning of the occlusion occurrence will be removed if two players are occluded for a long period of time. Accordingly, an upper threshold is defined for occlusion ( frames) and total occlusion ( frames) between two players to penalize continuation of false alarms. If total occlusion occurs, the corresponding prediction will be shifted within the associated measurement without updating the results due to the uncertainty of both measurement and prediction. Players are also relabeled, since labeling errors may occur for blurred or partially visible players stuck to the borders. For this purpose, labels for predictions are updated regarding the PSO results for non-occluding confirmed players which are not stuck to the borders. As a result, the set of tracked players can be used for predicting players’ states in the next frame of the sequence. 8. Experimental results The performance of the proposed tracking algorithm was demonstrated on shots from five soccer broadcast sequences and datasets captured by static cameras. In test sequences, occlusions were quite abundant, and a lot of clutter was presented in broadcast dataset due to jitter of the cameras and simultaneous movements of the camera and the players. Although there was the lack of pixel resolution on the players, the proposed approach performed quite well. All codes were implemented in Matlab. In the following, datasets, the computational complexity, parameters setting, performance evaluation, performance comparison and failure cases are presented. 8.1. Datasets

The algorithm was tested on 14100 frames from 5 different video sequences recorded from regular television broadcasts. Moreover, videos were captured under different weather, ground and lighting conditions involving sun light, cloudy weather and low quality videos. In order to perform evaluations, 30 far view shots were selected from five videos (i.e., Beira-Mar versus Benfica game, short clips, Yemen versus Kuwait game, Ayaks versus Real game and La Gantoise versus FC Malines game). Two short clips represented player trajectories during a goal scene from the middle of the playfield to the near of the goalpost. Moreover, three frames per second were subsampled from 25 or 30 frames per second sequences. The images sizes of these sequences were also , , and , respectively. Furthermore, each player was manually labeled as the ground truth. The experiments were also conducted on soccer clips captured from static cameras. For this purpose, Spagnolo dataset [58] and one subset of the VS-PETS-2003 dataset [59] were used to validate the effectiveness of the proposed AMHT framework. Accordingly, six clips from the Spagnolo dataset (each clip was 3002 frames long) with image size of covered the whole pitch. The results on this clip were compared to the other counterparts. On the other hand, the selected clip from the VS-PETS-2003 dataset (2499 frames long) with image size of was captured from the third camera on one corner of the pitch. Background subtraction was also applied for detecting the initial blobs regarding the static backgrounds. 8.2. Computational complexity One main factor that affects the processing speed is searching for players of different teams in a four dimensional search space (i.e., window size and position). Since all pixels in the image are searched for large number of parameters by exhaustive brute search, the search space will be huge. With this precedure, the total number of scanned pixels is computed by multiplying number of image pixels, uniforms (three for two teams and referee), candidates for player width and height, which also increases as the image size grows. A possible solution would be performing a partial exploration of the image by the proposed blob-guided PSO detection inside the AMHT framework. The search space is also reduced by only searching the ambiguous blob pixels inside the validation gates for probable teams except for blob at the border of the image (which probably contains new players). For PSO within the AMHT, multiplying number of particles and iteration steps by number of probable uniforms in each blob gives the number of hits that each blob is scanned. In Table 2, the numbers of scans by exhaustive search in whole image and all blobs (i.e., 15 blobs including 9337 pixels), PSO in all blobs and ambiguous blobs (i.e., 5 ambiguous blobs) and PSO in ambiguous blobs within the AMHT are compared for one sample frame. As shown, the proposed PSO inside the AMHT can significantly reduce scanned pixels. However, the computation cost still depends on many factors, such as the number of ambiguous blobs, occlusions, PSO parameters (e.g. the number of particles) and blobs complexity. Table 2. Number of scans for exhaustive search in the whole image, exhaustive search in all blobs, PSO in all blobs, PSO in ambiguous blobs and PSO in ambiguous blobs within the AMHT framework for one sample frame. Method Image size Width Height Number of Number of Number of Number of range range uniforms iterations particles scans Exhaustive search Exhaustive search in all blobs PSO in all blobs PSO in ambiguous blobs PSO in ambiguous blobs within the AMHT

3 3

-

-

3 3

10 10

112 37

3360 1110

Differs for each blob

10

37

610

8.3. Parameters setting In all experiments, (1.5), (60), (300) and the PSO parameters, such as (0.6), maximum iterations (20), acceleration constants (1/5), (4) and the AMHT parameters, such as (0.05), (0.9), (0.001), (0.00033), (0.05), (0.13), (0.002), (0.37) and (1.5) were kept fixed across all the datasets. Furthermore, was set to 1 and 0.5 for broadcast and static cameras, respectively. The allowable ranges for ellipse minor and major axis interval were also fixed at [10 35] and [30 80], respectively. The parameters for confirming

blobs, such as minor axis interval, major axis interval and minimum area were also set to [10 50], [29 110] and 250, respectively. In case of static cameras, the parameters corresponding to confirmed blobs and allowable ranges for ellipse can be adjusted regarding the positioning of the cameras. In this regard, the allowable ranges for ellipse minor axis interval ([20 90]), major axis interval ([80 184]) and the parameters for confirming blobs, such as minor axis interval ([20 90]), major axis interval ([80 184]) and minimum area (1000) were set to larger values just for the Spagnolo dataset, since its image size was twice the image size of the remaining datasets (Section 8.1). 8.4. Performance evaluation In contrast to the original MHT, which used KF assumption without appearance cues, AMHT used appearance cues to perform association and better discrimination of players and false alarms. It was also capable of tracking players during the occlusions and particularly helped when competitors moved closely thanks to the appearance cues. It also performed player detection and labeling during the tracking step. In order to evaluate AMHT, experiments were first conducted on broadcast videos to illustrate improvements obtained by different components in the proposed framework. Since PSO randomly searched the solution space, reported results were averaged over three trials. In order to evaluate benefits of the appearance cues, experiments were conducted with appearance term (AMHT) and without appearance term in (6) and (8) (MHT-A). Obviously, the most affected criteria by using appearance cues was the number of identity switches (IDs). Although the proposed AMHT provided good results for tracking competitors in occluding blobs, the MHT-A occasionally caused ID switches during tracking nearby competitors. By ignoring appearance cues, track IDs may switch when competitors merge and then split, since errors in predicting the players may occur. Prediction errors particularly happened due to nonlinear movements of players and camera or errors in locating the players. The ID switches also increased for blurred videos due to growth of errors in labeling and locating players in occluding blobs. Moreover, the difference between IDS in the MHT-A and AMHT was grown in situation with highly occluding players, large number of total occlusions and blurred occluding players, respectively. On the other hand, the number of switches in the MHT-A and AMHT was minimum for clips including widely separated players. Experiments were also conducted without occlusion term ( ) (MHT-O) to evaluate benefits of considering occlusions. For this purpose, the appended total ( ) and partial occlusion candidates ( ) were eliminated from the hypothesis matrix. The main weakness of the MHT-O was missing players in occlusion situations, since it only tracked one player in each occluding blob. Furthermore, PSO was applied for player tracking using nearest neighbor data association in order to evaluate its capabilities within the AMHT framework. For this purpose, each prediction was associated with the nearest blob, and ambiguous associated blobs were searched for the predicted player. As a result, the prediction may be associated with false alarms (e.g. line segments) close to the prediction, or several predictions may be associated with the same measurements. Since errors in predicting the players may occur, track IDs may also switch for nearby teammates. Experiments were also conducted without constrained initialization and auxiliary partial occlusion resolution in big blobs (MHT-APO). AMHT without constrained initialization was conducted by associating random positions and sizes to the particles. As a result, nearly all players were missed, since the solution space was not properly covered by the particles. Since color plays more important role in player detection than gradient, was set larger than 0.5 in (2) to put more weight on the color cost with respect to the gradient one. Since partial occlusion resolution among the competitors (Section 6.2.4) failed by ignoring the color cost, and no occluding competitors could be isolated, color cost was inevitable. On the other hand, the gradient cost only helped avoiding the line segments. Accordingly, ignoring the gradient cost was not problematic for tracking existing players, since validation gates (Section 5) helped detecting players by limiting the search area. However, the color cost (by ignoring the gradient cost) missed most of the newly arrived merged players with lines in big blobs. Although the gradient cost solved such situations, the overall performance change was negligible, because such situations occurred only in few frames. On the other hand, the gradient cost helps detecting the player in similar color with white lines, since the color cue is useless for player detection in such situations. In Fig. 8, tracking methods were assessed by MOTA and GMOTA [17] as follows: −





∑ ∑ ∑

(13)

where and are number of false positives, false negatives, instantaneous identity switches, global identity switches and ground truth detections, respectively. Since a defender was often close to the offensive

player of the other team, most occlusions occurred among competitors. The proposed AMHT handled 95% of occlusions among two competitors and 83% of occlusions among two teammates. Some occluding players were missed due to shifted detections and entering as an occluding blob to the scene. Less occluding competitors were missed, since detecting competitors in an occluding blob was more precise. 1 0.8 0.6 0.4 0.2 0

1 0.8 0.6 0.4 0.2 0

MOTA

GMOTA

Fig. 8. Comparing the performance of AMHT against the MHT-A, MHT-O, MHT-APO and PSO using MOTA and GMOTA metrics on broadcast videos. Fig. 9 illustrates the performance of the AMHT on several frames. The blue circles depict the performance of the AMHT under some occlusion situations, and the odd columns depict the occluding blobs. The ID of each player is also shown on the top of each rectangle, and the prior appeared player is assigned smaller ID. previous frame-row for mht

mht

mask so unreasonable led regionseroded by eliminating

column for previous frame-row for mht

mht

labeled regions by eliminating so unreasonable

50

13

612 9 14

100

4

oc1

11

4 89 6 by eliminating d regions so unreasonable 7

10

300

15

11

8

3

250

10

13 4

129 146

oc1

200

8

3

13 12 9 4 14 6

150

7

oc2

15

oc2

8

3

78

10

15

350

eroded mask

11

3 5

12 9 14 6

11 10

7 15

labeled regions by eliminating so unreasonable

400

labeled regions by eliminating so unreasonable 11 12 7 6 12unreasonable1 beled regions by eliminating so 2 13 9 10 Frame 4168 Frame 4162 4 5 labeled regions by eliminating so unreasonable 10 7 oc6 8 oc3 10 11 5 6 23 10 8 5

450

5

5

500 550

300

400

500

600

100

r previous frame-row for mht

200

400

500

600

column for previous frame-row for mht

mht

67

300

mht

811 10 15 oc3 6 11 2 oc4 1 15 4 35 6 79 oc4 12 13 11 7 15 1 4 14 12 13 1 4 12 13 9 39 oc7 5 7 3 oc5 5 14 14 2 oc5 5 6 12

2

50

100 150

17 29

18 24

11

22

2319

200

26

32 27

20

21

250

18 24

300

21

12

17 23 2929 19 11 18

910 oc7 oc6 8 1117 12 22

4

12 22 17

23 29 19

11

24 32

32

18

27

24

11 32

27

21

27

350

28 31

400

23

450 500

2

300

400

5

600

4168

13 4 6 8 7

500

550

2

200 Frame 4309 100

300

400

4

500

600

8

910

Frame 4375

11

Fig. 9. Tracking results of the proposed AMHT algorithm on some sample frames. The players in team A, team B and 100 and referee are represented by white, black respectively. The blue circles depict the tracked occluding 4375 13blue rectangles, 12 13 8 blobs, and the odd columns depict the corresponding occluding 12 4 7 12 9 1496 blobs from the blob detection step.

1

14 6

8.5. Performance 11 comparison

200

3

11

100 8

11

10 It5 is 12 difficult tracking methods due 23 to29 the variability in datasets in particular for broadcast 12 22 69 10 to compare player300 17 211 10 19 7 200 sequences and evaluation metrics, lack of access to the earlier methods’ codes and11 datasets. Moreover, different steps 8 4 24 3 10can915affect the next steps, and the variability in18 32 this 27 of each7paper applied steps15 makes hard to perform comparison of 400 the final tracking step. For this purpose, similar metrics and datasets are used to compare the performance of the 21 300 proposed AMHT with some player tracking methods in the literature. 12 22 5

17 29 10018 2211 12 22 24 400 500 600 00010021200200 300 32 300 400 17 500 600 23 23 19 29 27 23 19

100

200

300

400

19 500

600

500

12 17 29

400 500

100

200

300

400

100

200

500 300

600 400

500

600

To compare the AMHT with MCNF and T-MCNF [17], experiments were conducted on similar dataset (i.e., Spagnolo dataset). By using similar metrics (i.e., MOTA and GMOTA), AMHT outperformed MCNF and T-MCNF, which has been shown to outperform KSP [60], C-KSP and DP (Fig. 10). It also outperformed improved PF (Sentioscope) [28], which has been shown to outperform meanshift, optical flow, color-based PF and color-based mixture PF. 0.8 0.6 0.4 0.2 0

1 0.8 0.6 0.4 0.2 0

MOTA

GMOTA

Fig. 10. Comparing the performance of AMHT against the KSP, C-KSP, DP, MCNF, T-MCNF [17] and Sentioscope [28] using MOTA and GMOTA metrics on Spagnolo dataset. To compare the AMHT with the MHT, similar experiments were conducted on the same dataset (VS-PETS 2003) as used in [45]. The MHT failed to track more than half of the highly overlapped players (>60% of occlusion); while the AMHT tracked successfully more than 96% of these cases. Therefore, the AMHT is less sensitive to occlusion severity with respect to MHT. The AMHT achieved MOTA of 92% on this dataset. Most of the missed players were partially visible and stuck to the border. Most of switching IDs were also due to occluding teammates (in particular those stuck to the borders). These situations are not problematic for multiple static cameras; since these players enter the other cameras’ field of view. Some players were also missed due to shifted detections and entering as an occluding blob to the scene. More than half of the shifts were compensated in the following frames, and the missed tracks corresponding to entering occluded players were retrieved as soon as they were isolated. In addition to the above failure cases, there were some other failure cases in broadcast sequences. Some difficulties were due to situations in which the naked eye could hardly distinguish players due to the blurriness. Therefore, false negatives occurred whenever the player blob was missed due to the blurriness or locating outside the grass field. The blurriness might also cause missing new players. Moreover, mislabeling of players (particularly occluding or nearby players) due to the blurriness might cause shifting players on the nearby blobs, which were usually compensated in the following frames. In order to remedy this, employing more discriminative features will be helpful. Although the AMHT was capable of tracking players in many occlusion situations, it was likely to miss one or more players or switch IDs in hard situations (e.g. several players including teammates in long lasting highly occluding blob with unpredictable movements of players). In the tracking level, the affinities in appearance and motion were used. Therefore, the tracker might fail when players were similar in appearance and the prediction results were not satisfactory. In order to compensate for missed cases, the system waited for the player to come out of occlusion, deadline area or blurriness to track player in the following frames. The evaluation results demonstrate that the proposed AMHT makes improvement compared to the MHT. Moreover, the reduction of IDS proves the effectiveness of the proposed framework to the switching problem and observation assignment. Although there are still some errors, more views can resolve most of these cases. Accordingly, the proposed AMHT can improve the tracking results in each view. 9. Conclusion In the past decades, extensive research efforts have been devoted to soccer player tracking. In this paper, an appearance-based MHT is introduced to compensate for ignoring appearance and occlusion situations in the MHT. At the core of the proposed framework, the capabilities of PSO and MHT are combined. Moreover, enhancement of PSO algorithm, such as constrained initialization of particles, a gradient-based fitness definition and a new PSO search and post-processing schemes are implemented. Although the results suggest the reliable behavior of the AMHT with respect to the MHT, there are still some open issues that need to be investigated. The first challenge is updating the initial players’ models to account for illumination changes and to avoid player model deviation from the correct

representation. Another issue is the automatic selection of the appropriate color features, which extremely depends on player uniforms and changes from one game to another. The proposed scheme and the ideas behind it can also be extended to other applications by modifying the search step, which is out of scope of this paper. In conclusion, improvements in player tracking will certainly lead to more precise analysis of the soccer games. References [1] P. Rangsee, P. Suebsombat, P. Boonyanant, Simplified low-cost GPS-based tracking system for soccer practice, In: Proc.13th Int. Symp. on Communications and Information Technologies, 2013, pp. 724-728. [2] P. Gabriel, J.B. Hayet, J. Piater, J. Verly, Object tracking using color interest points, In: Proc. IEEE Conf. on Advanced Video and Signal Based Surveillance, 2005, pp. 159-164. [3] J.B. Hayet, T. Mathes, J. Czyz, J. Piater, J. Verly, B. Macq, A modular multi-camera framework for team sports tracking, In: Proc. IEEE Conf. on Advanced Video and Signal Based Surveillance, 2005, pp. 493-498. [4] H. Li, M. Flierl, Sift-based multi-view cooperative tracking for soccer video, In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2012, pp. 1001-1004. [5] T. Mathes, J.H. Piater, Robust non-rigid object tracking using point distribution manifolds, 28th DAGM Symp. on Pattern Recognition, Berlin, Germany, September 12-14, 2006, pp. 515-524. [6] N. Vandenbroucke, L. Macaire, C. Vieren, J.G. Postaire, Contribution of a color classification to soccer players tracking with snakes, In: Proc. IEEE Int. Conf. on Systems, Man, and Cybernetics, Computational Cybernetics and Simulation, 1997, pp. 3660-3665 [7] S. Lefèvre, C. Fluck, B. Maillard, N. Vincent, A fast snake-based method to track football players, Int. Association for Pattern Recognition (IAPR) Workshop on Machine Vision Applications, (2000) 501-504. [8] S. Lefèvre, N. Vincent, Real time multiple object tracking based on active contours, In: Proc. Int. Conf. on Image Analysis and Recognition, Porto, Portugal, September 29 - October 1, 2004, pp. 606-613. [9] M. Heydari, A.M.E. Moghadam, An MLP-based player detection and tracking in broadcast soccer video, In: Proc. Int. Conf. on Robotics and Artificial Intelligence, 2012, pp. 195-199. [10] P. Figueroa, N. Leite, R.M.L. Barros, I. Cohen, G. Medioni, Tracking soccer players using the graph representation, In: Proc. 17th Int. Conf. on Pattern Recognition, 2004, pp. 787-790. [11] V. Pallavi, J. Mukherjee, A.K. Majumdar, S. Sural, Graph-based multiplayer detection and tracking in broadcast soccer videos, IEEE Transactions on Multimedia, 10 (2008) 794-805. [12] P.J. Figueroa, N.J. Leite, R.M.L. Barros, Tracking soccer players aiming their kinematical motion analysis, Computer Vision and Image Understanding, 101 (2006) 122-135. [13] J. Sullivan, S. Carlsson, Tracking and labelling of interacting multiple targets, In: Proc. 9th European Conf. on Computer Vision, Graz, Austria, May 7-13, 2006, pp. 619-632. [14] J. Sullivan, P. Nillius, S. Carlsson, Multi-target tracking on a large scale: experiences from football player tracking, In: Proc. IEEE Int. Conf. on Robotics and Automation (ICRA) Workshop on People Detection and Tracking, Kobe, Japan, (2009). [15] J. Miura, H. Kubo, Tracking players in highly complex scenes in broadcast soccer video using a constraint satisfaction approach, In: Proc. Int. Conf. on Content-based Image and Video Retrieval, Niagara Falls, Canada, 2008, pp. 505-514. [16] M. Manafifard, H. Ebadi, H. Abrishami-Moghaddam, Discrete particle swarm optimization for player trajectory extraction in soccer broadcast videos, Scientia Iranica, 22 (2015) 1031-1044. [17] H. Ben Shitrit, J. Berclaz, F. Fleuret, P. Fua, Multi-commodity network flow for tracking multiple people, IEEE Transactions on Pattern Analysis and Machine Intelligence, 36 (2014) 1614-1627. [18] T.-K. Chiang, J.-J. Leou, C.-S. Lin, An improved mean shift algorithm based tracking system for soccer game analysis, In: Proc. Asia-Pacific Signal and Information Processing Association, Sapporo, Japan, (2009) 380-385. [19] S.H. Khatoonabadi, M. Rahmati, Automatic soccer players tracking in goal scenes by camera motion elimination, Image and Vision Computing, 27 (2009) 469-479. [20] X. Zhong, N. Zheng, J. Xue, Pseudo measurement based multiple model approach for robust player tracking, In: Proc. 7th Asian Conf. on Computer Vision, Hyderabad, India, January 13-16, 2006, pp. 781-790. [21] N. Najafzadeh, M. Fotouhi, S. Kasaei, Multiple soccer players tracking, Int. Symp. on Artificial Intelligence and Signal Processing, 2015, pp. 310-315. [22] M. Schlipsing, J. Salmen, M. Tschentscher, C. Igel, Adaptive pattern recognition in real-time video-based soccer analysis, J. Real-Time Image Processing, (2014) 1-17. [23] M. Xu, J. Orwell, L. Lowey, D. Thirde, Architecture and algorithms for tracking football players with multiple cameras, IEE Proc. Vision, Image and Signal Processing, 2004, pp. 51-55.

[24] M. Xu, J. Orwell, G. Jones, Tracking football players with multiple cameras, In: Proc. Int. Conf. on Image Processing, 2004, pp. 2909-2912. [25] J. Ren, M. Xu, J. Orwell, G.A. Jones, Multi-camera video surveillance for real-time analysis and reconstruction of soccer games, Machine Vision and Applications, 21 (2009) 855-863. [26] T. Misu, A. Matsui, S. Clippingdale, M. Fujii, N. Yagi, Probabilistic integration of tracking and recognition of soccer players, In: Proc. 15th Int. Conf. on Multimedia Modeling, Sophia-Antipolis, France, January 7-9, 2009, pp. 39-50. [27] R. Hamid, R.K. Kumar, M. Grundmann, K. Kihwan, I. Essa, J. Hodgins, Player localization using multiple static cameras for sports visualization, In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2010, pp. 731738. [28] S. Baysal, P. Duygulu, Sentioscope: a soccer player tracking system using model field particles, IEEE Transactions on Circuits and Systems for Video Technology, 26 (2016) 1350 - 1362. [29] G. Zhu, C. Xu, Q. Huang, Y. Rui, S. Jiang, W. Gao, H. Yao, Event tactic analysis based on broadcast sports video, IEEE Transactions on Multimedia, 11 (2009) 49-67. [30] K. Nummiaro, E. Koller-Meier, L.V. Gool, An adaptive color-based particle filter, Image and Vision Computing, 21 (2003) 99-110. [31] M. Herrmann, M. Hoernig, B. Radig, Online multi-player tracking in monocular soccer videos, AASRI Procedia, 8 (2014) 30-37. [32] D. Svensson, Target tracking in complex scenarios, Thesis for the degree of Doctor of Philosophy, Chalmers University of Technology, Department of Signals and Systems, Goteborg, Sweden, (2010). [33] L. Barceló, X. Binefa, J.R. Kender, Robust methods and representations for soccer player tracking and collision resolution, In: Proc. 4th Int. Conf. on Image and Video Retrieval, Singapore, July 20-22 2005, pp. 237-246. [34] T.E. Fortmann, Y. Bar-Shalom, M. Scheffe, Sonar tracking of multiple targets using joint probabilistic data association, IEEE J. Oceanic Engineering, 8 (1983) 173-184. [35] R.G. Abbott, L.R. Williams, Multiple target tracking with lazy background subtraction and connected components analysis, Machine Vision and Applications, 20 (2007) 93-101. [36] J. Liu, X. Tong, W. Li, T. Wang, Y. Zhang, H. Wang, Automatic player detection, labeling and tracking in broadcast soccer video, Pattern Recognition Letters, 30 (2009) 103-113. [37] X. Tong, J. Liu, T. Wang, Y. Zhang, Automatic player labeling, tracking and field registration and trajectory mapping in broadcast soccer video, Association for Computing Machinery (ACM) Transactions on Intelligent Systems and Technology, 2 (2011) 1-32. [38] D.B. Reid, An algorithm for tracking multiple targets, IEEE Transactions on Automatic Control, 24 (1979) 843854. [39] I.J. Cox, S.L. Hingorani, An efficient implementation of Reid's multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18 (1996) 138-150. [40] M. Beetz, N.V. Hoyningen-Huene, J. Bandouch, B. Kirchlechner, S. Gedikli, A. Maldonado, Camera-based observation of football games for analyzing multi-agent activities, In: Proc. 5th Int. Joint Conf. on Autonomous Agents and Multiagent Systems Hakodate, Japan, 2006, pp. 42-49. [41] S. Gedikli, J. Bandouch, N. von Hoyningen-Huene, B. Kirchlechner, M. Beetz, An adaptive vision system for tracking soccer players from variable camera settings, In: Proc. 5th Int. Conf. on Computer Vision Systems, (2007). [42] M. Beetz, S. Gedikli, J. Bandouch, B. Kirchlechner, N.V. Hoyningen-Huene, A. Perzylo, Visually tracking football games based on TV broadcasts, In: Proc. 20th Int. Joint Conf. on Artifical intelligence, Hyderabad, India, 2007, pp. 2066-2071. [43] I.J. Cox, J.J. Leonard, Probabilistic data association for dynamic world modeling: a multiple hypothesis approach, In: Proc. 5th Int. Conf. on Advanced Robotics. Robots in Unstructured Environments, 1991, pp. 12871294 [44] S.S. Blackman, Multiple hypothesis tracking for multiple target tracking, IEEE Aerospace and Electronic Systems Magazine, 19 (2004) 5-18. [45] S.-W. Joo, R. Chellappa, A multiple-hypothesis approach for multiobject visual tracking, IEEE Transactions on Image Processing, 16 (2007) 2849-2854. [46] S. Chang, R. Sharan, M. Wolf, N. Mitsumoto, J.W. Burdick, People tracking with UWB radar using a multiplehypothesis tracking of clusters (MHTC) method, Int. J. Social Robotics, 2 (2010) 3-18. [47] M. Mucientes, W. Burgard, Multiple hypothesis tracking of clusters of people, IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2006, pp. 692-697.

[48] A. Rahman, H. Zamzuri, S.A. Mazlan, M.A. Abdul Rahman, Y. Yamamoto, S.B. Samsuri, Dynamic track management in MHT for pedestrian tracking using laser range finder, Mathematical Problems in Engineering, (2015) 9 pages. [49] K.O. Arras, B. Lau, S. Grzonka, M. Luber, O.M. Mozos, D. Meyer-Delius, W. Burgard, Range-based people detection and tracking for socially enabled service robots, Towards Service Robots for Everyday Environments: Recent Advances in Designing Service Robots for Complex Tasks in Everyday Environments, 2012, pp. 235-280. [50] C. Kim, F. Li, A. Ciptadi, J.M. Rehg, Multiple hypothesis tracking revisited, IEEE Int. Conf. on Computer Vision (ICCV), (2015) 4696-4704. [51] L. Ying, C. Xu, W. Guo, Extended MHT algorithm for multiple object tracking, 4th Int. Conf. Internet Multimedia Computing and Service, ACM, Wuhan, China, 2012, pp. 75-79. [52] L. Ying, T. Zhang, S. Qian, C. Xu, Multi-cue based multi-target tracking with boosted MHT, 14th Pacific-Rim Conf. on Multimedia, Nanjing, China, December 13-16, 2013, pp. 528-537. [53] A. Torabi, G.-A. Bilodeau, A multiple hypothesis tracking method with fragmentation handling, Canadian Conf. on Computer and Robot Vision, 2009, pp. 8-15. [54] D.M. Antunes, D. Figueira, D.M. Matos, A. Bernardino, J. Gaspar, Multiple hypothesis tracking in camera networks, IEEE Int. Conf. Computer Vision Workshops, 2011, pp. 367-374. [55] S. Jiang, Q. Ye, W. Gao, T. Huang, A new method to segment playfield and its applications in match analysis in sports video, In: Proc. 12th Annual Association for Computing Machinery (ACM) Int. Conf. on Multimedia, New York, NY, USA, 2004, pp. 292-295. [56] J. Kennedy, R. Eberhart, Particle swarm optimization, IEEE Int. Conf. Neural Networks, 1995, pp. 1942-1948 vol.1944. [57] H.W. Kuhn, The Hungarian method for the assignment problem, Naval Research Logistics Quarterly, 2 (1955) 83-97. [58] T. D’Orazio, M. Leo, N. Mosca, P. Spagnolo, P.L. Mazzeo, A semi-automatic system for ground truth generation of soccer video sequences, 6th IEEE Int. Conf. on Advanced Video and Signal Surveillance, Genoa, Italy, 2009, http://www.ino.it/home/spagnolo/Dataset.html. (accessed: 17.06.2014). [59] University of Reading, VS-PETS football dataset, The First Joint IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Nice, October 2003. http://www.cvg.reading.ac.uk/slides/pets.html, (accessed 17.06.2016). [60] J. Berclaz, F. Fleuret, E. Turetken, P. Fua, Multiple object tracking using K-shortest paths optimization, IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (2011) 1806-1819.

Highlights  An effective AMHT framework is introduced to account for players’ appearance, nonlinear movements and occlusions.  A blob-guided PSO player detection is introduced into the MHT, which is capable of detecting and labeling multiple players and resolving partial occlusions simultaneously. Only teammates in occluding blobs are detected by erasing the previously detected ones.