Scene-specific classifier for effective and efficient team sport players detection from a single calibrated camera

Scene-specific classifier for effective and efficient team sport players detection from a single calibrated camera

Accepted Manuscript Scene-specific classifier for effective and efficient team sport players detection from a single calibrated camera Pascaline Pari...

12MB Sizes 4 Downloads 76 Views

Accepted Manuscript

Scene-specific classifier for effective and efficient team sport players detection from a single calibrated camera Pascaline Parisot, Christophe De Vleeschouwer PII: DOI: Reference:

S1077-3142(17)30003-6 10.1016/j.cviu.2017.01.001 YCVIU 2525

To appear in:

Computer Vision and Image Understanding

Received date: Revised date: Accepted date:

11 March 2016 4 November 2016 2 January 2017

Please cite this article as: Pascaline Parisot, Christophe De Vleeschouwer, Scene-specific classifier for effective and efficient team sport players detection from a single calibrated camera, Computer Vision and Image Understanding (2017), doi: 10.1016/j.cviu.2017.01.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • The cues derived from a background-subtracted foreground mask are combined with image classification to detect team sport players.

CR IP T

• Foreground-based detection exploits the verticality of players’ silhouette to detect player candidates in a computationally efficient manner. • Online training adapts the classifier to the game at hand, which helps to deal with the large deformations and appearance variability of players. • The na¨ıve Bayesian decision framework adopted by the classifier makes it robust to training sample label corruption.

AC

CE

PT

ED

M

AN US

• Overall, the proposed system achieves a unique complexity/accuracy trade-off.

1

ACCEPTED MANUSCRIPT

Scene-specific classifier for effective and efficient team sport players detection from a single calibrated camera Pascaline Parisot, and Christophe De Vleeschouwer

CR IP T

ISPGroup, ELEN Department, ICTEAM Institute, Universit´e catholique de Louvain, Belgium Keemotion, Belgium [email protected], [email protected]

Abstract

ED

M

AN US

This paper considers the detection of players in team sport scenes observed with a still or motion-compensated camera. Background-subtracted foreground masks provide easy-to-compute primary cues to identify the vertical silhouettes of moving players in the scene. However, they are shown to be too noisy to achieve reliable detections when only a single viewpoint is available, as often desired for reduced deployment cost. To circumvent this problem, our paper investigates visual classification to identify the true positives among the candidates detected by the foreground mask. It proposes an original approach to automatically adapt the classifier to the game at hand, making the classifier scene-specific for improved accuracy. Since this adaptation implies the use of potentially corrupted labels to train the classifier, a semi-naive Bayesian classifier that combines random sets of binary tests is considered as a robust alternative to boosted classification solutions. In final, our validations on two publicly released datasets prove that our proposed combination of visual and temporal cues supports accurate and reliable players’ detection in team sport scenes observed from a single viewpoint. Keywords: detection, scene-specific classifier, online training, random ferns, sports analysis.

PT

1. Introduction

CE

AC

5

Our work considers the detection of players in team sport actions, in a reliable but costeffective manner, using a single camera viewpoint. Whilst our contributions are demonstrated in a static acquisition set-up, they only assume that the objects moving in the scene can be extracted with reasonable reliability, using off-the-shelf background subtraction algorithms. Hence, our proposed method remains valid for motion-compensated pan/tilt/zoom camera views [1]. The use of such a (semi-)static acquisition set-up to locate the players in a sport context appears to be relevant either to feed sport analytics and enrich broadcasted content [2, 3], to retrieve game actions from a database [4, 5], or to drive automatic democratic team sport coverage [6, 7, 8, 9, 10]. Beyond the sport analysis context, the detection of objects-of-interest is a central component of most video analysis systems dedicated to scene interpretation. Simplest detectors traditionally build on the extraction of connected components in a foreground mask [11, 12, 13, 14], and on the comparison of their shapes with a model of the object silhouette [15, 16]. Those detectors obviously suffer from significant false and/or missed detection rates in presence of multiple

10

15

Preprint submitted to Journal of LATEX Templates

January 2, 2017

ACCEPTED MANUSCRIPT

CR IP T

30

Test on SPIROUDOME

M

0.8 0.6 0.4 0.2

0.2

0.4 0.6 false detection rate

0.8

0.8 0.6 0.4 0.2

ICF, trained on APIDIS ICF, trained on SPIROU ICF, trained on INRIA

PT

0 0

Test on APIDIS

1

ED

missed detection rate

1

AN US

25

missed detection rate

20

moving objects, or noisy foreground mask. As detailed in Section 2, to address those limitations, many modern approaches have proposed to exploit object silhouette geometric primitives to locate objects by deconvolving the foreground mask [17, 18, 19, 20]. Those solutions appear to be especially effective when multiple and complementary views of the same scene are available [21, 18, 19, 20]. A second popular strategy envisioned in the past decade to detect objects from an image consists in exploiting visual classifiers to capture the specificities of the object appearance in terms of texture patterns [22, 23, 24, 25, 26, 27, 28, 29]. In practice however, those modern approaches require either the deployment of a multi-view set-up, or the identification of visual features that are able to discriminate the object-of-interest from the rest of the scene. The first option dramatically increases the deployment cost since multiplying the number of cameras increases the amount of processed data, requires synchronization, and makes the system connectivity more complex. The second option is not more practical since, in addition to being based on a computationally demanding multiscale scanning of the frames (see Section 2), it fundamentally relies on a representative set of training samples to learn the object appearance. This last point inherently restricts the scope and the genericity of the resulting detector, and induces a cost to collect and (manually) annotate a training set that is relevant to the application at hand. As illustrated in Figure 1, and pointed out in [30] [31], not collecting appropriate training samples is likely to severely penalize the classification accuracy.

0 0

1

0.2

0.4 0.6 false detection rate

0.8

1

(b)

CE

(a)

ICF, trained on APIDIS ICF, trained on SPIROU ICF, trained on INRIA

AC

Figure 1: Without surprise, the accuracy of a classifier highly depends on the representativeness of the training set: the ROC curves obtained with the well-known ICF detector [32] reflect a significant loss of detection accuracy when the classifier is tested on a different dataset than the one used for training. This is problematic because it means that a classifier that has been trained based on samples collected from a (few) specific team sport game(s) will not perform properly when deployed to process the images of another game. The two basket-ball datasets APIDIS and SPIROUDOME are introduced in Section 4), and are made publicly available at [33]. The INRIA dataset has been extensively used for pedestrian detection evaluation [32, 23].

35

Hence, the number of cameras involved in the system, but also the computational complexity associated to visual detectors, as well as their need to collect different training samples at each game, are identified as being among the principal barriers to exploit commercially, i.e. in a costeffective and autonomous manner, modern object detection approaches. Therefore, we primarily 3

ACCEPTED MANUSCRIPT

65

70

CR IP T

AC

75

AN US

60

M

55

For improved detection accuracy, and to avoid the performance drop depicted in Figure 1, we introduce a scene-specific classifier, which automatically adapts to the specificities of the game at hand (e.g. team color, background appearance). In short, our proposed classifier is trained with samples that have been automatically labeled based on the probably correct, but error-prone, decisions of the conventional foreground detector. In practice, only the samples for which the labeling decision is sufficiently reliable (in terms of foreground/background detection cues) are considered for the training. Adapting the classifier to the case at hand is especially relevant in the envisioned sport analysis context. This is due to the important deformations of the players, which make it important to exploit visual cues that are specific to a game (e.g. players equipment, scene background) in addition to the generic but poorly discriminative (due to the variety of deformations) features associated to players silhouettes. To get the best out of our scene-specific training strategy, we also propose an original patchbased visual classifier, whose specificity lies in the robustness of its training to label corruption. Being robust to wrongly labeled training samples is critical in our case since the training samples are automatically selected and labeled based on error prone foreground detector cues. Our proposed classification method considers an ensemble of random sets of binary tests, also named ferns, to characterize the texture describing the visual appearance of a player, and thereby can be seen as an extension of the works in [34] and [35] towards the description of large image patterns. As an important lesson, we observe that the semi-naive Bayesian decision rule associated to the ensemble of ferns outperforms the boosted methods [32] in the particular case of training with corrupted labels, while performing similarly in the case of uncorrupted labels. As a second observation, we note the benefit resulting (i) from a soft-thresholding of the variable handled by the binary tests in [35], and (ii) from the selection of spatially localized binary tests in each fern. When combined together, all those contributions allow for cost-effective and reliable detection of players, starting from a computationally efficient foreground mask analysis, and increasing the detection reliability based on advanced processing of a limited number of candidates. Avoiding the scanning of (sub-regions of) each images with a sliding window at multiple scales, as considered by conventional pedestrian detection methods, drastically reduces the computational complexity, which is of primary importance regarding the deployment of our system in low-cost commercial autonomous sport-analysis or production infrastructures [6, 2].

ED

50

The contributions associated to our proposed two-steps hybrid framework sum up as follows.

PT

45

CE

40

investigate solutions that provide reliable detection in a single-view set-up, without having to manually collect training samples at runtime, and without requiring an exhaustive scanning of each frame with an ad-hoc classifier. Specifically, for computational efficiency, we adopt a hybrid framework. In a first step, a state-of-the-art geometrically-constrained foreground mask detector scans the background-subtracted image to identify at very low cost a set of object candidate locations. In a second step, only those candidate locations are processed by a visual classifier to differentiate true and false positives among the foreground detections.

80

Our work extends our earlier conference publication [36] in several important aspects. It first improves the random fern classifier accuracy through (i) the use of soft-thresholded binary tests, and (ii) the block-based selection of binary tests in each fern. It then extends the evaluation of the method by presenting results on two distinct datasets that are publically released [33], and are unique in that they capture team sport with a fixed, sideview, and single viewpoint calibrated camera, as considered by low-cost and easy-to-deploy sports analytics or autonomous production systems. Interestingly, in addition to [36], this validation compares our method to the ICF detec4

ACCEPTED MANUSCRIPT

90

2. Related works: Player/people detection in a nutshell

105

AN US

AC

CE

PT

110

M

100

This section presents the previous works that are relevant to our study, and positions them with respect to our envisioned application scenario, both in terms of detection effectiveness and deployment constraints. Detecting people in images is an important question for many computer vision applications including surveillance, automotive safety, or sportsmen behavior monitoring. It has motivated a long history of research efforts [23], which have recently converged into two main trends. On the one hand, background subtraction approaches have gained in popularity since they have been considered in a multi-view framework. In each view, those approaches rely on a background model to compute a mask that is supposed to detect the moving foreground objects in the view. The foreground silhouettes computed in the multiple views of a calibrated multi-camera set-up are then merged to mitigate the problem caused by occlusions and illumination changes when inferring people location from a single view. Several strategies have been considered to fusion the masks from multiple views [21, 18, 19, 20]. They generally rely on geometric primitives to define a ground occupancy probability map, which exploits the verticality of people silhouettes to estimate the likelihood that a particular ground plane position is occupied or not by someone. Even if geometric primitives still help in single-view set-ups [17, 19, 20], all of these approaches build on the multiplicity and diversity of viewpoints, and their performances significantly degrade when a single viewpoint is available. As an example, Figure 2 illustrates the foreground detection in a Basket-ball game covered by a single camera. We observe that the dynamic led advertising boards, but also the shadows and the similar color between players’ shirts and some parts of the field induce a quite noisy foreground mask, making the ground occupancy mask relatively ambiguous, which results in many false positive detections.

ED

95

CR IP T

85

tor [32, 23], provides running time figures, and analyzes how the variety of appearance features involved in the classifier impacts its complexity/accuracy trade-off. The rest of the paper is organized as follows. Section 2 positions the previous art related to humans and players detection, with respect to our applicative context. Section 3 introduces our proposed two-steps hybrid framework. It describes how to design scene-specific classifiers for player detection, and explains why it does not suffer from the drifting issue discussed in [30]. Section 4 then considers a representative basketball use case, and offers public access to two datasets that are extensively used to validate our contributions. Section 5 concludes.

(a)

(b)

(c)

Figure 2: Foreground-based detection results in many false positives in case of single viewpoint (a). The dynamic led advertising boards, but also shadows, light reflection, and lack of color discrimination, induce a noisy foreground mask (b), which results in an ambiguous ground occupancy map (c).

5

ACCEPTED MANUSCRIPT

130

135

AC

150

CE

145

PT

ED

140

CR IP T

125

AN US

120

On the other hand, a significant amount of investigations have been carried out to detect people or objects of interest based on their visual appearance. Modern approaches make an extensive use of training samples, to learn how the object is defined in terms of topologically organized components [24, 37] and/or in terms of texture statistics [34, 38]. The pioneering work of Viola and Jones [22] illustrates the success of those approaches to detect objects in images. It relies on boosting strategies to select and combine a large number of weak binary tests to decide whether the content of a (sub-)image corresponds to the object-of-interest or not. Since the tests are defined in terms of the average luminance observed on small patches defined by their size and location in the image, their statistics intrinsically capture the spatial topological organization of the image textures. Several recent works have been inspired by the same intuition to detect people in images. A representative example is the work in [32, 39, 26], which considers integral image techniques to analyze the content of an image in terms of a multiplicity of pixel features -like color, gradient, or motion- observed on a set of rectangular patches. The method appears to be efficient in detecting people, as long as a sufficiently large and representative database is available to train the classifier. A similar approach has been applied to sport players detection in [40], and the same image channels have been considered in many works dealing with pedestrian detection [41, 42, 43, 44]. Many works have investigated the use of more discriminant or pedestrian-specific features, e.g., inspired by the symmetry and some inherent attributes of pedestrians, to reduce computational complexity and/or to increase the detection accuracy [41, 45, 46, 47, 48, 43, 49]. The use of semantic channels has also been considered to exploit high-level scene attributes to improve pedestrian detection accuracy [50, 51]. All those works however require careful manual selection of training samples, which prevents adapting the detector to the appearance specificities encountered in the particular case at hand. Such adaptation capability is especially relevant in our team sport analysis context, since the background and the appearance of the players of each team are specific to the game at hand, and might carry quite discriminative information. Moreover, in accordance with our Figure 1, some very recent works have also pointed the difficulty to be robust to large variation in pedestrian appearance [52], which confirms the need for scene-specific classifiers. The importance of using sanitized training set during training has been pointed in [53]. This last observation is directly in line with our observation that conventional approaches suffer in presence of training set label corruption (see Figure 5), and further motivates the design of robust classifiers to address the online training stage of the adaptive classification system envisioned in our work. Another drawback of those numerous previous approaches lies in the computational load associated to the exhaustive scanning of the image with a multiscale classification window. Many efforts have been carried out to mitigate this drawback, leading to reasonably accurate solutions processing 480 × 640 images at 60 fps, using modern CPUs [54, 26], and increasing the frame rate at 135 fps with GPUs [25]. The same computational complexity issue holds regarding recent approaches based on deep learning, which are generally even more complex [55, 56, 57, 58], including when cascade of filters are considered [28, 29, 59]. In contrast, we show in our experimental section that a foreground detector can process 100 fps with less than one percent of the CPU resources, making it attractive to pre-scan the image before further analysis of the visual texture. This pre-scanning strategy is similar in principle to the approach in [60], which considers stereo images to estimate the presence of objects above the ground, thereby limiting the search area for the object detector, and reaching 100 fps on a CPU. As detailed in Section 3, our paper takes advantage of the two trends presented above to design a cost-effective and reliable player detector. Cost-effectiveness results from the exploitation 6

M

115

155

160

ACCEPTED MANUSCRIPT

of the foreground mask to locate players candidates1 . Reliability arises from the classification of those candidates based on their visual appearance, using a so-called scene-specific classifier, i.e. a classifier that automatically adapts to the specificities of the scene/game at hand by collecting representative training samples based on the (error-prone) decisions of the foreground detector2 . 3. Player detection in a single-view set-up

170

This section presents our proposed hybrid detection framework. It explains how to learn the player appearance from the error-prone decisions of a conventional foreground silhouette detector. The advantage of our framework is twofold. First, the appearance-based classifier does not need to be applied on the whole frame, since it only aims at improving the foreground silhouette detector (referred as foreground detector in the following) by differentiating false and true positives among the silhouettes located with the foreground detector working with high recall rate. Second, since the training is based on samples extracted at run time, the classifier becomes scene-specific, in that it has the capability to account for the specific background and player appearances associated to the game at hand.

AN US

175

CR IP T

165

3.1. System overview: hybrid foreground/appearance detection The proposed hybrid detection scheme is depicted in Figure 3. Input video

Foreground−based detection

M

Ground plane occupancy map

ED

High detection rate selection

Probably positive image samples

Small false alarm rate selection

Probably positive Probably negative image samples image samples

Training of classifier

PT

Appearance−based classification

CE

Appearance−validated detections

AC

Figure 3: Solid lines depict the proposed two-steps hybrid detection scheme. The foreground-based detections are validated or rejected based on their appearance. Dashed lines depict the training phase. The appearance-based classifier is trained with image samples that are collected based on the foreground detector decisions. Note that the operating points of the foreground detectors used in the detection and the training phase are different. High detection rate is required in the detection path, whilst small false alarm rate is desired to select positive and negative samples in the training path. In practice, different operating points are achieved by playing on the foreground detector decision threshold.

Along the solid path, people/player image samples are continuously detected with a high detection rate and thus a potentially significant false alarms rate. This detection relies on a 1 The 2 The

foreground detector here operates with high detection rate, and thus significant false positive rate foreground detector is here parametrized so as to limit the error rate, at the cost of a small detection rate

7

ACCEPTED MANUSCRIPT

190

195

(a)

(b)

CE

PT

ED

M

200

CR IP T

185

conventional foreground-based detector that uses the camera calibration parameters to turn the background-subtracted mask into a (human silhouettes) ground occupancy map, as described in [20] and [21]. Candidate player locations are computed as local maxima of this map, and are backprojected to the image plane to define player candidate bounding boxes. The resulting, probably positive, image samples are then processed by an appearance-based classifier, which further investigates the visual features of each foreground detected object to decide whether it corresponds to a human/player or not. Because it exploits color and gradient visual features, the appearance-based classifier offers complementary information compared to the one provided by the foreground detector, thereby making the overall detection more reliable. Along the dashed path, the foreground detector operates at a reasonably small false alarm rate. Its decisions are exploited to feed the training of the classifier with examples that are representative of the game at hand. Specifically, two classes of training samples are defined based on the ground occupancy map computed in [20]. The first class of training samples corresponds to probably positive samples. Those samples are defined by cropping a rectangular sub-image in the camera view, around the backprojection of a ground position that is considered as being occupied by the detector, when using a strict detection threshold. The training samples of the second class correspond to probably negative samples, which are randomly cropped around backprojected ground positions that are considered to be unoccupied by the detector, despite a relaxed detection threshold. Because our approach defines the training samples of the classifier based on the foreground detector decisions, no manual annotation is required to generate the training set, which makes it possible to retrain and adapt the classifier to the case at hand. Examples of image samples from both classes are presented in Figure 4.

AN US

180

Figure 4: Examples of samples labeled as probably positive (a) or probably negative (b) by the foreground detector on the SPIROUDOME dataset. The sample appearance variability is inherent to the sport practice, but is also due to the foreground detector inaccuracy, which itself results from the single viewpoint acquisition set-up.

AC

The architecture of our two-steps detector and the samples presented in Figure 4 suggest three important observations that motivate the rest of the section. We first observe a significant variability within each class of samples, which makes the learning of a classifier challenging, and motivates the need for online training to exploit highly discriminant game-specific features (e.g. related to players equipment or background appearance), in complement to generic but less discriminant features (e.g. related to widely variable poses). Section 3.2 details that aspect, and positions our work with regard to the previous art related to online training.

205

8

ACCEPTED MANUSCRIPT

225

230

235

CR IP T

AN US

220

M

215

Training set label corruption

1

PT

1

ED

210

We also notice from the 4th sample in Figure 4-(a) that some positive samples selected by the foreground detector actually correspond to multiple players. This case is relatively rare in practice since the foreground detector exploits geometric priors and camera calibration to restrict the detections to foreground activities whose spatial support is similar, both in size and aspect ratio, to a human silhouette [20]. However, in final, such multi-players sample might occasionally pop-up, preventing our two-steps hybrid detector to always distinguish each individual player. Interestingly, it is worth noting that this case has little impact in practice, since many applications in team sport analysis do not strictly require the accurate detection of each individual player at each time instant. For example, in autonomous video production/edition, the information about the players’ location is used to select the view angle to render the action, typically by cropping within a fixed view [7, 8] or driving a motorized camera [61]. Hence, this application is not interested in the accurate segmentation and identification of each individual player, but is rather eager to determine whether a given foreground activity either results from (one or several) players, or is caused by some other reason like, for example, dynamic advertisement panels or spot lightings. As another example, despite it relies on individual player trajectories, team sport analytics can now deal with detections that actually correspond to multiple players, as long as they only occur sparsely. This is because modern tracking solutions that exploit sporadic features, like the player’s digit or his/her shirt’s color, are able to derive long-term trajectories, even in presence of temporary multi-players clutters [62, 63, 64, 65]. Context-conditioned motion models have also been shown to improve tracking in presence of noisy detections [66]. Eventually, and more importantly, we note that, since the samples’ labels are defined based on the foreground detector, they are subject to errors. As a consequence, the appearance-based classifier should be designed so that its learning is robust to label corruption. As it can be observed in Figure 5, the accuracy of a boosted classifier, namely the popular ICF classifier [32, 23], significantly falls down in presence of label corruption. This is inherent to the boosting principle, and has motivated the design of a classifier that offers better robustness to labeling errors. This classifier combines randomized sets of binary tests and adopts a semi-Naive Bayesian decision rule. It is presented in details in Section 3.3.

true positive rate

0.4 0.2

AC

0.8

0.6

CE

true positive rate

0.8

0 0

0.2

Training set label corruption

0.6 0.4 0.2

0% ICF SPIROU 2% ICF SPIROU 5% ICF SPIROU 10% ICF SPIROU 0.4 0.6 0.8 1 false positive rate

0 0

(a)

0.2

0% ICF APIDIS 2% ICF APIDIS 5% ICF APIDIS 10% ICF APIDIS 0.4 0.6 0.8 1 false positive rate

(b)

Figure 5: ROC curves obtained with the ICF classifier [32, 23], for different label corruption rates of the training set. The SPIROUDOME (a) and APIDIS (b) datasets considered in those graphs are introduced in Section 4. We observe that the ICF accuracy rapidly falls down as the rate of label corruption increases.

9

ACCEPTED MANUSCRIPT

3.2. Specificities of our applicative context vs. work related to online training To motivate both the need for online training, and the development of an original solution to this problem, it is worth presenting the specificities of our application context.

255

260

CR IP T

Besides motivating the online training, the wide range of deformations encountered by the positive samples in our application also prevents the use of most of the solutions that have been proposed in the past to learn online, without manual labeling of the training samples. Specifically, in the late nineties, Blum and Mitchell [67] have introduced the co-training framework to reduce the amount of labeled samples required to train a classifier. Their purpose was to exploit unlabeled samples to jointly reinforce two complementary classifiers, i.e. which look at the data from different points of view, using independent features. In a straightforward implementation of their framework, the two classifiers are initially trained based on a small set of manually labeled samples, and are then jointly improved by increasing the training set of one classifier based on the reliable labels assigned by the other classifier [68]. In more recent works, motion detection has been considered to initialize the learning process, so that manual labeling is not required anymore [69, 70, 71]. In both kinds of approaches, however, a key issue lies in the selection of reliably labeled samples. To identify those reliable samples, earlier works make the explicit or implicit assumption that the appearances of the objects-of-interest are sufficiently similar to be accurately described by some fixed discriminative (appearance) model. They then propose to learn such discriminative model from the dominant statistics observed among the positively labeled samples of each classifier. Depending on the adopted framework, those statistics are defined in terms of PCA [69], in terms of base classifiers decisions (in a boosted framework [71], including with cascade structures [72]), or simply in terms of motion blobs aspect ratio [70]. In [31], the online incremental learning process is fed based on a tracking procedure (to interpolate the position of missed detections and to identify false detections), and the incremental learning problem is formulated in a multiple instance learning framework, so as to account for the fact that only a fraction of the image samples collected based on the tracking outcome at a given time instant might actually correspond to the object to detect. To adapt a generic detector to the scene at hand, [30] accounts for the detection score and a variety of context cues (size, pedestrian paths, motion) to assign confidence levels to the samples detected in the target scene, so as to train a scene-specific detector using confidence-encoded SVM. In a face detection context, [73] relies on a general and offline trained probabilistic elastic part (PEP) model to identify the most positive and negative samples in the database at hand, and then trains a discriminative SVM classifier online from those most positive/negative samples.

AC

270

CE

PT

265

AN US

250

M

245

Because it deals with team sport analysis, our system has to deal with severe deformations of the object-of-interest (players are running, jumping, falling down, connecting to each other’s, etc.). Moreover, as explained above, the fact that the samples presented to the classifier are selected by the foreground detector adds to the appearance variability of those samples, which might sometimes even correspond to multiple players. Hence, to be effective, the classification of those samples should not only rely on the characterization of the standard appearance of a standing human, like it is done for pedestrian detection for example, but should exploit as much of the a priori information that is available about the appearance of the object (e.g. players’ jerseys have a known color) and of the scene (sport hall, known background advertisements). Since this a priori information changes from one game to another, the classifier has to be trained online, so as to adapt to the game at hand.

ED

240

275

280

10

ACCEPTED MANUSCRIPT

300

305

CR IP T

Another reason why previous works are so concerned about the estimation of reliable labels for the training samples lies in the architecture of their detection framework. Fundamentally, they start with a rudimentary detector, and consider online training to improve its performance through an iterative process switching between the automatic selection and labeling of samples based on the classifier(s) decisions, and classifier(s) update based on the new extracted samples. Obviously, this induces a dependency between the decisions of the classifier and its update, which might lead to drifting problems if the update step builds on wrong labeling decisions [30]. In contrast, our framework works in open loop. As depicted in Figure 3, the classifier decisions are not used to label samples that will then feed a novel training step (potentially after reliable samples identification). In our case, classifier decisions are only considered to sort the (high recall) foreground detector decisions. Hence, there is no risk of drift. The only risk is that our classifier starts validating (rejecting) wrong (right) decisions from the foreground detector. This would typically appear when the majority of the decisions taken by the foreground detector are wrong, which has never been observed in practice, especially since the detection threshold of the foreground detector has been set to values that target a small error rate on each of the dashed path in Figure 3.

315

CE

PT

310

As it appears from Section 4, the semi-naive Bayesian decision rule fundamentally reflects the dominant statistics in the input training samples. Hence, our framework implicitly follows the same intuition than earlier works, in the sense that it relies on the dominant statistics observed among the samples detected by the error-prone foreground detector, but without having to explicitly define a model from which label confidence levels are inferred (see for example the use of PCA in [69], context cues in [30] or base classifiers decisions in [71]).

AN US

295

M

290

ED

285

In our sport analysis context, however, the assumption about the existence of a stable appearance model or strong context cues does not hold anymore. Even if context-conditioned motion models have been shown to improve tracking in team sport environments [66], it is not straightforward to exploit those models to discriminate between true and false detections. Players are very active. They follow largely unpredictable motion patterns, and their silhouettes change a lot depending on the action at hand (see variability in Fig. 4-(a)). Bottom line, we cannot rely on some generic (simplistic) appearance or motion model to select reliable samples among the ones detected based on foreground mask analysis or object tracking. For this reason, we have to deal with erroneous labels during the training. This difference is fundamental compared to the prior art: we do not count on some kind of oracle to identify the samples that are confidently labeled. Instead, we decide to live with labeling errors, and to focus on the design of a classifier that is robust to (a limited amount of) labeling errors, as can be expected from a majority vote or Bayesian decision rule. This is in contrast with most previous works, which adopt boosted classifiers, known to be especially sensitive to labeling errors (see Figure 5).

AC

3.3. Classification of object image patterns based on ensembles of random binary tests

320

325

Many works have demonstrated the advantages of combining (weak) binary tests to solve image classification problems [34, 38, 22, 32, 35]. In particular, ensembles of random classifiers have gained popularity in recent years, mainly because they reduce the risk of overfitting and offer good generalization properties in case of training samples scarcity [74]. Therefore, we have decided to follow this paradigm to learn the appearance of players in a game. Interestingly, our experiments, reported in Section 4, reveal that those random classifiers are also much more robust to labels corruption than the boosted alternatives [22], which give additional credit to our choice. 11

ACCEPTED MANUSCRIPT

335

CR IP T

330

The rest of the section is organized as follows. Section 3.3.1 defines the binary tests in terms of pixel values comparisons. Section 3.3.2 presents the Random Ferns approach that is used to combine the binary tests. It adopts a Semi-Naive Bayesian formulation, and classifies samples based on the joint probability distributions associated to random ferns, i.e. to small sets of randomly selected binary tests [35]. In contrast to previous usages of ferns, which have focused on the description of small texture patches around key points, our paper proposes to exploit ferns to classify semantically meaningful image patterns. By semantically meaningful, we mean that each individual image pattern conveys some information about the objects present in the scene from which they are extracted, i.e. they are large enough to allow the recognition of a real-life object. This implies two changes in the classifier definition, which appear to have a significant impact on accuracy (see Section 4). Those changes are respectively related to the definition of regularized binary tests (Section 3.3.1), and to the selection of spatially localized tests within a fern (Section 3.3.2).

340

In our work, for each backprojected bounding box associated to a candidate player location, the tests are carried out on so-called image channels, defined in [32] as the R, G, and B components, the gradient magnitude GM, and the magnitude of oriented gradients OG j , 0 ≤ j ≤ 5. In [35], a binary test is defined to compare the intensities of two pixel locations. Comparisons of pixel intensities are performed within a small block, e.g. limited to 16 × 16 pixels, because they aim at describing local textures. The extension of [35] to multiple channels is trivial. Formally, given the test image channel Ii ∈ {R, G, B, GM, {OGk }0≤k≤5 }, and a pair of pixel locations (mi,1 , mi,2 ) (defined in a 16 × 16 block). The ith binary test bi is defined as:    1, if (Ii (mi,1 ) − Ii (mi,2 )) > 0 . (1) bi =   0, otherwise

Such tests have been proved to be effective in capturing the statistics of key point textures [35]. By definition of a key point, those textures are characterized by important gradients and patterns of changing intensities. In contrast, a general object pattern is partly characterized by its homogeneous areas of (close to) constant intensities. In those regions, the outcome of a binary test like the one defined in Equation (1) is close to random, and the test becomes unable to capture any relevant information. For this reason, we propose to regularize the difference di = Ii (mi,1 ) − Ii (mi,2 ) between the channel pixels intensities, so as to activate the binary test only when this difference is significant enough. Hence, we define the regularized difference d∗ as

AC

355

CE

PT

350

ED

M

345

AN US

3.3.1. Definition of binary tests

d∗ = argmin ||x − di ||22 + λ||x||1 , x

and the binary test in Equation (1) becomes:    1, bi =   0,

if d∗ > 0 . otherwise

(2)

(3)

Computing the regularized difference d∗ is straightforward, and known as the soft-thresholding 12

ACCEPTED MANUSCRIPT

operation in the literature. Specifically, the soft-thresholding operator S λ is defined as:    d − 0.5λ, if d > 0.5λ     S λ (d) =  d + 0.5λ, if d < −0.5λ .     0, otherwise

or equivalently,

In practice, our experiments demonstrate that setting δ to a small non-zero value, typically equal to one percent of the image channel dynamic range, significantly improves the classifier performances. Note that our binary tests are only based on the appearance of the image sample cropped around the backprojection of the (probably occupied) ground position. It would be easy to extend them to include features like position on the ground field, or repetition of co-located detections in time, as considered in the form of context cues in [30]. 3.3.2. Combination of binary tests

ED

AC

380

PT

375

Our proposed approach to combine the weak binary classifiers is inspired by a number of earlier works dealing with texture statistics classification [34] and key point identification [35]. It follows our initial work in [36], and differs from those previous works by the fact that it is designed to describe the large pattern corresponding to the projection of a semantically meaningful object, here a human-being. Therefore, the binary tests are selected over the entire image support, and have to be defined in terms of their relative position compared to the image support. This is simply done by normalizing the image sizes, typically to 128 × 64 pixels in our work. To explain the other specificities of our approach compared to [35], it is worth reminding the principle underlying the classification with ensemble of random sets of binary tests, also named random ferns (RF) classification. Let C denote the random variable that represents the class of an image sample. In our problem, C = 1 if the sample corresponds to a player, and C = 0 otherwise. Given a set of N binary tests bi , i = 1, ..., N, the sample class MAP estimate cˆ is defined by:

CE

370

(6)

M

365

(5)

AN US

   1, if (Ii (mi,1 ) − Ii (mi,2 )) > δ . bi =   0, otherwise.

CR IP T

360

and our so-called soft-thresholded binary test in Equation (3) simply writes:    1, if di > 0.5λ . bi =   0, otherwise

(4)

cˆ = argmax P(C = c|b1 , ..., bN ).

(7)

cˆ = argmax P(b1 , ..., bN |C = c),

(8)

c∈{0,1}

Bayes’ formula yields: c∈{0,1}

if we admit a uniform prior P(C). 13

ACCEPTED MANUSCRIPT

P(b1 , ..., bN |C = c) =

400

(9)

where Fk denotes the kth fern, and the class conditional distribution of each fern is simply learnt based on the accumulation of the observations, as detailed in [35]. Now that the random ferns classification principles have been reminded, we explain how our approach differs from earlier works in the way it assigns tests to ferns. This subtle change is required to characterize large image patterns, and not just small texture patches as in [35]. In [35], a random permutation function with range 1...N splits the N tests into ferns of S tests. This is motivated by the fact that all tests have a priori the same chance to be (in)dependent. In our case, this assumption does not hold anymore. Our tests are local by definition, since they compare the intensities of two locations that are close to each other. Hence, two tests dealing with the same image sub-region are more likely to depend on each other than two tests dealing with pairs of pixels that are far apart. Since dependencies are only handled within a fern, we assign to each fern a set of tests that correspond to the same spatial area, in practice to the same block. In final, when using pixels intensities comparisons, our proposed approach can be summarized as follows. The image support is split into a grid of non-overlapping blocks of 16 × 16 pixels, and all tests of a given fern are defined based on a pair of pixels that are selected within the same block.

ED

405

P(Fk |C = c),

AN US

395

k=1

M

390

M Y

CR IP T

385

Learning and maintaining the joint probability in Equation (8) is not feasible for large N since it would require to compute and store 2N entries for each class. A naive approximation would assume independence between binary tests, which would reduce the number of entries per class to N. However, such representation completely ignores the correlation between the tests. The semi-naive Bayesian approach proposed in [35] accounts for dependencies between tests while keeping the problem tractable, by grouping the N binary tests into M sets of size S = N/M. These groups are named ferns, and the joint conditional probability is approximated by:

4. Experimental validation

PT

AC

CE

410

This section considers a typical real life team sport player detection scenario. It evaluates the benefit resulting from the visual classification of the image samples detected by a conventional foreground-based detector into player and not-player classes. To make our validation relevant, we have considered two basketball datasets that correspond to realistic and among the most challenging scenarios encountered when covering team sports with a single viewpoint. The datasets considered in our experiments have been derived from two different games that respectively happened in the SPIROUDOME sport hall (http://www.spiroudome.com), and in the Hall Octave Henry in Namur. Both datasets, as well as ground truth player locations, are made publically available at [33]. Those datasets are especially representative and challenging because (1) compared to outdoor sports, indoor team sport cannot exploit the high contrast between players and grass field, making the foreground mask especially noisy, (2) among indoor team sports, basketball action is quite dynamic and results in the largest interactions/occlusions between players, (3) among basketball games, the APIDIS and SPIROUDOME games are especially challenging since they both present players whose shirts have the same color than some parts of the background (same appearance of advertisements on the field and on the local team players’ shirts), (4) APIDIS corresponds to an easier case in terms of players activity (women game) and foreground

415

420

14

ACCEPTED MANUSCRIPT

435

440

As a main outcome, the section demonstrates the benefit of our game-specific classifier to improve the detections resulting from a state-of-the-art player foreground detector, and show that it only requires a small fraction of modern CPU in practical scenarios. Thereby, it also demonstrates the relevance of our hybrid detection strategy, since accurate detection is achieved with a very limited computational complexity. Besides, our experiments also reveal that, in presence of label corruption, our proposed random ferns method to turn an ensemble of weak pixel-based binary tests into a player detector outperforms the conventional AdaBoost strategy, as well as the popular Integral Channel Features [32, 23] (ICF) methods. The section extends our preliminary results in [36] by demonstrating the classifier accuracy improvement resulting from (i) the use of soft-thresholded binary tests, and (ii) the block-based selection of binary tests in each fern. In addition to [36], it also presents results on two distinct datasets that are publicly released to favor reproducible research [33], compares our method to the ICF detector [32, 23], provides running time figures, and analyzes how the variety of appearance features involved in the classifier impacts its complexity/accuracy trade-off. We now present our experiments in details.

CE

455

PT

450

ED

M

445

CR IP T

430

AN US

425

mask computation (no dynamic advertisement panels), and (5) SPIROUDOME corresponds to the ultimate scenario encountered in practice, with very dynamic scene (men basketball national league), and quite noisy foreground mask due to dynamic led advertisement boards Except when explicitly mentioned, the training sets considered in this section have been defined automatically, as explained in Section 3.1. The set of probably positive samples, as identified by the foreground-based detector [20], is referred to as the detector set in the following. The set of probably negative samples is named random set. In addition, a reference ground truth label has been assigned manually to each sample of the detector set, so as to split the detector set into a positive and a negative set. The positive set includes the valid detections, while the negative set contains the false detections, resulting from a foreground detector error. In our experiments, we train the classifiers based on detector and random training sets, and measure how well those classifiers discriminate between the positive and the negative test sets. For each game, five pairs of detector and random sets have been picked up randomly, so as to repeat the experiments and compute average and standard deviation performance metrics. Each detector set is composed of 1000 samples randomly picked among the foreground detected samples and is affected by a variable rate n of false detections (n = 0, 2, 5 or 10%), depending on the foreground detector operating point. When the tests are performed on the same sequence than the one used for training, all test sets have been extracted by the foreground detector in a different game time period than the training sets.

4.1. Block-based and soft-thresholded random ferns

AC

This section investigates the advantages arising from the two specificities of our ferns-based classifier design, namely the use of soft-thresholded binary test, and the block-based definition of each fern. Figure 6 presents, for both datasets, the ROC curves obtained with our block-based random ferns when the tests are defined either with or without soft thresholding. We observe that the soft-thresholding significantly improves the classifier performances. Figure 7 presents, for both datasets, the ROC curves obtained with soft-thresholded tests whose pixels location are either restricted to a single block for each individual fern or can be selected arbitrarily on the whole image pattern. We observe that defining ferns on a block basis significantly improves the classifier accuracy, since it allows to exploit the correlation inherent to the neighboring pixels.

460

465

15

ACCEPTED MANUSCRIPT

Interest of soft-thresholded tests

1 0.8 true positive rate

true positive rate

0.8 0.6 0.4 0.2 0 0

Interest of soft-thresholded tests

0.6 0.4 0.2

0.2

0% BRF SPIROU SoftThres 0% BRF SPIROU NoThres 0.4 0.6 0.8 1 false positive rate

0 0

0.2

(a)

CR IP T

1

0% BRF APIDIS SoftThres 0% BRF APIDIS NoThres 0.4 0.6 0.8 1 false positive rate

(b)

Binary test: block vs. image

M

0.6 0.4 0.2

0.2

0.6 0.4 0.2

0% BRF SPIROU 0% IRF SPIROU 0.4 0.6 0.8 1 false positive rate

0 0

PT

0 0

Binary test: block vs. image

0.8

ED

true positive rate

0.8

1

true positive rate

1

AN US

Figure 6: ROC curves obtained with (SoftThres) and without (NoThres, as in [35]) binary tests soft-thresholding, for the SPIROUDOME (a) and APIDIS (b) datasets.

(a)

0.2

0% BRF APIDIS 0% IRF APIDIS 0.4 0.6 0.8 1 false positive rate

(b)

CE

Figure 7: ROC curves obtained with binary tests defined onto a single block for each fern (BRF) or onto the whole image pattern (IRF), for the SPIROUDOME (a) and APIDIS (b) datasets

AC

4.2. Robustness to label corruption

470

This section compares, on both datasets, the proposed block-based random ferns (BRF) classifier, with alternative solutions, including the popular ICF detector [32, 23] and the conventional AdaBoost boosting strategy [22]. All classifiers are trained with the same sets, belonging to the same game than the test set. A particular attention is devoted to the impact of errors affecting the training set labels because, during deployment of our hybrid detector, the classifier will be trained based on slightly corrupted labels. Hence, our preferred classifier should be robust to such corruption. 16

ACCEPTED MANUSCRIPT

1

Classifier: Random Ferns vs. ICF

Classifier: Random Ferns vs. ICF

0.8

0.6 0.4 0.2

ED

0.2

0% BRF SPIROU 2% BRF SPIROU 5% BRF SPIROU 10% BRF SPIROU 0% ICF SPIROU 2% ICF SPIROU 5% ICF SPIROU 10% ICF SPIROU 0.4 0.6 0.8 1 false positive rate

M

true positive rate

0.8

0 0

1

CR IP T

485

Figure 8 compares BRF with the ICF method. It plots, for both kinds of classifiers, the detection rate on the positive set versus the detection rate on the negative set (which corresponds to the false alarm rate on the detector set). Figure 8-(a) and (b) reveal that our proposed random ferns achieve similar performance than ICF on uncorrupted labels, but outperforms it in presence of corrupted labels. Table 1 gives the corresponding/associated area under curve values, with standard deviation.

true positive rate

480

For those trials, presented in Figures 8 and 9, each image sample is characterized by 10 image channels, and the classifiers parameters are set as follows. There are 5 tests per fern, and 200 ferns per 16 × 16 image block. Hence, there are 32000 tests for a normalized image of size 128 × 64. The same number of tests, i.e. weak classifiers, is considered for Random Ferns (RF) and AdaBoost classifiers.

AN US

475

(a)

0.6 0.4 0.2

0 0

0.2

0% BRF APIDIS 2% BRF APIDIS 5% BRF APIDIS 10% BRF APIDIS 0% ICF APIDIS 2% ICF APIDIS 5% ICF APIDIS 10% ICF APIDIS 0.4 0.6 0.8 1 false positive rate

(b)

CE

PT

Figure 8: BRF versus ICF, for different corruption rates of the training samples labels, for the SPIROUDOME (a) and APIDIS (b) datasets. Those figures reveal that our proposed random ferns achieve similar performance than ICF on uncorrupted labels, but outperforms it in presence of corrupted labels.

AC

Figure 9 compares BRF and the AdaBoost strategies to combine our proposed weak classifiers (i.e. pairwise pixels comparisons) into a strong classifier. It reveals that our proposed random ferns approach outperforms the AdaBoost strategy, both in presence and absence of label corruption. Moreover, as an additional advantage compared to AdaBoost, it is worth noting that random ferns naturally support incremental training, through an update of the class conditional probabilities[35, 75]. This is especially interesting in our team sport analysis context, since it allows to initialize the process with default ferns probability distributions (e.g. averaged on several games), and to progressively update them based on the game at hand, making the classifier scene-specific.

490

17

ACCEPTED MANUSCRIPT

Classifier: Random Ferns vs. AdaBoost

1 0.8

0.6 0.4 0.2 0 0

0.2

0% BRF SPIROU 2% BRF SPIROU 5% BRF SPIROU 10% BRF SPIROU 0% BAB SPIROU 2% BAB SPIROU 5% BAB SPIROU 10% BAB SPIROU 0.4 0.6 0.8 1 false positive rate

0.6 0.4 0.2 0 0

0.2

0% BRF APIDIS 2% BRF APIDIS 5% BRF APIDIS 10% BRF APIDIS 0% BAB APIDIS 2% BAB APIDIS 5% BAB APIDIS 10% BAB APIDIS 0.4 0.6 0.8 1 false positive rate

(b)

AN US

(a)

true positive rate

true positive rate

0.8

Classifier: Random Ferns vs. AdaBoost

CR IP T

1

Figure 9: BRF versus block-based AdaBoost (BAB), for different corruption rates of the training samples labels, for the SPIROUDOME (a) and APIDIS (b) datasets.

Corrupted labels 5%

10%

0.948 ± 0.002 0.918 ± 0.018 0.819 ± 0.037

0.942 ± 0.005 0.881 ± 0.027 0.762 ± 0.057

0.939 ± 0.004 0.848 ± 0.044 0.727 ± 0.053

0.951 ± 0.001 0.952 ± 0.008 0.899 ± 0.016

0.947 ± 0.001 0.925 ± 0.019 0.862 ± 0.016

0.942 ± 0.002 0.872 ± 0.023 0.804 ± 0.011

M

2%

PT

ED

Uncorrupted labels (0%) SPIROUDOME BRF 0.949 ± 0.005 ICF 0.950 ± 0.008 BAB 0.858 ± 0.038 APIDIS BRF 0.950 ± 0.001 ICF 0.968 ± 0.005 BAB 0.927 ± 0.016

4.3. Two-steps hybrid detector accuracy Since improving the performance of a conventional foreground detector in the single view case was the initial motivation of our work, we have measured the impact of our proposed scenespecific classifier on the operating points of the integrated system depicted in Fig. 3, when applied to each dataset. For this purpose, for the SPIROUDOME dataset, we have defined manually a detection ground truth over 280 frames, regularly spaced in an interval of 4min40s and for the APIDIS dataset we have a detection ground truth for a quarter. This ground truth consists in the bounding boxes of the players and referees in the frame view coordinate system. We have then compared this ground truth to the detections computed by the foreground detector in [20], and to the subsets of those detections that are considered to be positive by our scene-specific

AC

495

CE

Table 1: Areas under curve in Fig. 8 and Fig. 9 (mean ± standard deviation).

500

18

ACCEPTED MANUSCRIPT

520

CR IP T

We observe that our scheme is able to preserve the initial detection rate, while significantly reducing the false positive rate. We conclude from Figure 10 that the classifier definitely and significantly improves the operating trades-off compared to the ones obtained based on foreground detection only, which demonstrates the relevance of the scheme proposed in Figure 3. In addition, the comparison with the red and blue curves obtained by the ICF detector [32] reveals that ICF is only competitive compared to our scheme in the unrealistic case for which ICF is trained based on samples extracted from the game at hand. Hence, we conclude that, whilst being computationally simpler (see Section 4.4), our solution significantly outperforms ICF, when ICF is trained on a training set derived from another game than the one at hand. This case is the one of interest since it corresponds to what is encountered in practical wide-scale deployment scenarios. Video of the system in action are available at [76], and image samples are presented in Fig. 11.

Test on SPIROUDOME

Foreground detection BRF, 11% MD BRF, 20% MD BRF, 50% MD BRF, 70% MD ICF, trained on APIDIS ICF, trained on SPIROU

ED

0.8 0.6

CE

0.2

PT

missed detection rate

1

0.4

AC

0 0

AN US

515

Figure 10 presents, in green solid line, the ROC curve of the initial foreground detector algorithm. The dotted, dashed-dotted and dashed magenta lines correspond to the ROC curves obtained when using the random ferns classifier to sort the foreground detections into false and true positives. Each of these 3 curves is derived from a particular foreground detector operating point, respectively corresponding to 20%, 50% and 70% of missed detections. The classifier has been trained with an operating point corresponding to 5% of false positives.

0.2

0.4 0.6 false detection rate

Test on APIDIS

1

missed detection rate

510

classifier. A detector output and a ground-truth bounding box are considered to be matched if their intersection-over-mean area ratio exceeds 50%.

M

505

Foreground detection BRF, 14% MD BRF, 20% MD BRF, 50% MD BRF, 70% MD ICF, trained on APIDIS ICF, trained on SPIROU

0.8 0.6 0.4 0.2

0.8

1

(a)

0 0

0.2

0.4 0.6 false detection rate

0.8

1

(b)

Figure 10: Improvement of ROC resulting from the cascade depicted in Figure 3 for the SPIROUDOME (a) and APIDIS (b) datasets. The green curve corresponds to the foreground detector. The four magenta curves correspond to our proposed hybrid detection scheme, and are derived from four distinct foreground detector operating points, characterized by different missed detection (MD) rates. The red and blue curves correspond to the ICF detector [32], respectively trained on SPIROUDOME and on APIDIS. See the text for details. Video demos are available at [76] and image samples are presented in Fig. 11.

19

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

Figure 11: Results from the cascade depicted in Figure 3 for the SPIROUDOME (left) and APIDIS (right) datasets. Blue, green and red bounding boxes correspond to the candidate positions of the foreground detector (20% of missed detection, see Fig. 10). Blue bounding boxes are rejected by the RF classifier. Other boxes are kept by the RF classifier. They are drawn in green when close to the players’ positions ground truth (true positive), and in red otherwise (false positive). Failure cases correspond to missed foreground detection (see the two players in bottom left image), erroneous rejection of true positive foreground (see the referee in top right image), and non-rejection by the classifier of false positive foreground detections (see red boxes in last line). Note that shadows are only detected by the foreground detector when they are reasonably vertical. They might sometimes be rejected by the classifier (see blue boxes around shadows in second line left and bottom right images), but not always (see failure case red boxes around shadows in last line).

20

ACCEPTED MANUSCRIPT

4.4. Computational complexity analysis

535

540

M

545

CR IP T

530

Now that we have demonstrated the advantages of our proposed random ferns classifier, both in term of classification accuracy and training robustness, we investigate its computational complexity. We first analyze how classification performances degrade as a function of the number of tests and channels involved in the random ferns. We then provide some running time figures, measured on a modern CPU. Thereby, we demonstrate that our proposed player detection algorithm only requires a small percentage of the resources provided by a modern logical processor unit when running at twenty or thirty frames per second. In the context of autonomous production and sport analytics systems, such small computational resources are desired to preserve enough resources for the remaining components of the system, e.g., including ball detection/tracking [77, 78, 79, 80, 81], automatic camera planning and image editing [61, 7, 8], and video compression for interactive streaming [82]. In the following experiments, there are 5 tests per fern, and F ferns per 16 × 16 image block (F = 210, 180, 150, 120, 90, 60, 30, 10, 9, 6, 3 or 1). The F ferns are equally distributed over the channels. To evaluate the benefit of considering many different channels, we compare classifiers that exploit different color channels, i.e. Y, RGB or RGB + Gradient Magnitude (GM) + Oriented Gradients (OG), while keeping the same number of tests per block. Figures 12-(a) and 12-(b) plot the obtained ROC curves for different numbers of ferns by block. We observe that the classification performance is reasonably preserved while reducing the number of ferns per block. We also observe that reducing the number of channels has little impact on the performances, which suggests that working on the Y component only might be reasonable in resource constrained scenarios. Considering a true positive rate around 90%, we see in Figures 12-(c) and 12-(d) that computing 3 ferns per Y or RGB block already rejects more than 60% of the foreground detector false positives, while 30 ferns reject more than 80% of false positives. In Figure 12-(b), we observe that many channels are only helpful to achieve very high rejection rates: with more than 60 ferns by block, the 10 channels scenario preserves 80% of detections, while reducing the false positives to 5%. In contrast, using only the 3 RGB channels can not preserve more than 60% detections for such a small false positive rate. Figure 13 plots the “area under curve” (AUC) values as a function of the number of ferns by block for different numbers of image channels and different rates of corrupted labels. It reveals that the performance does not improve beyond 60 ferns by block. At constant number of ferns (i.e. at constant computational cost), when enough resources are available, using more image channels results in a better performance. In contrast, when the available resources only allow for a few ferns per block, it is better to focus on the color channels, and ignore the (oriented) gradients. Using fewer channels offers the additional advantage to reduce the memory requirements, and to still offer reasonable classification accuracy with as few as 3 or 6 ferns per block. To quantify the computational requirements of our method without being limited by the speed at which the input frames are forwarded by the camera or read from the hard drive, following the methodology introduced in [83], we have run our system while continuously and iteratively accessing the same input image (from the cache). In this experiment, each processing iteration computes a 640 × 480 foreground mask, exploits look-up tables to transform this mask into rectified coordinates [20], computes the associated integral images3 , and runs the player detection.

AN US

525

AC

560

CE

555

PT

ED

550

565

3 which

reduce to vertical segments in our optimized implementation

21

ACCEPTED MANUSCRIPT

BRF (APIDIS): 3 vs. 10 channels 1

0.8

0.8

0.6 0.4

RGB+GM+OG, 120 f/b RGB+GM+OG, 60 f/b RGB+GM+OG, 30 f/b RGB, 120 f/b RGB, 60 f/b RGB, 30 f/b

0.2 0 0

0.2

0.4 0.6 false positive rate

0.8

0.6 0.4

RGB+GM+OG, 120 f/b RGB+GM+OG, 60 f/b RGB+GM+OG, 30 f/b RGB, 120 f/b RGB, 60 f/b RGB, 30 f/b

0.2 0 0

1

0.2

(a)

0.8

1

BRF (APIDIS): 3 vs. 1 channels 1 0.8

true positive rate

0.8

AN US

1

true positive rate

0.4 0.6 false positive rate

(b)

BRF (SPIROUDOME): 3 vs. 1 channels

0.6 0.4

RGB, 30 f/b RGB, 3 f/b Y, 30 f/b Y, 3 f/b

0.2

0.2

0.4 0.6 false positive rate

(c)

0.8

0.6 0.4 0.2

1

M

0 0

CR IP T

true positive rate

true positive rate

BRF (SPIROUDOME): 3 vs. 10 channels 1

0 0

0.2

0.4 0.6 false positive rate

RGB, 30 f/b RGB, 3 f/b Y, 30 f/b Y, 3 f/b 0.8

1

(d)

PT

ED

Figure 12: ROC curves resulting from the random ferns (BRF) classifier trained on corrupted labeled samples (5%) for different number of ferns by block (f/b) equally divided into 3 or 10 channels (3 ch = RGB, 10 ch = RGB+GM+OG) (top), or 1 or 3 channels (3ch = RGB, 1ch = Y) (bottom), for the SPIROUDOME (left) and APIDIS (right) datasets.

CE

AC

570

The processor used was a hyper-threaded (meaning two logical processor per core) quad-core i74790 CPU @ 3.60GHz. Our implementation builds on a linux operating system (Ubuntu 14.04 R Integrated Performance Primitives (IPP), and C/C++ programLTS, 64 bits), and uses Intel ming language. The observed throughput over 10 trials was 12053 iterations/sec, with a standard deviation of 122 iterations/sec. Table 2 presents the percentage of resources used by each component of the system. This percentage is expressed in terms of logical processor usage. Hence, when fully activated, the 8 threads of the processor correspond to 800% of resources. We observe that none of the processes saturates its assigned logical processor, which reveals that computations do not constraint the throughput. In particular, those numbers show that at 20 fps, which is a reasonable rate to drive an autonomous production system for example, the computational resources required by the foreground detector are negligible. Similarly, we have measured the complexity associated to our block-based random ferns (BRF) classifier by accessing randomly 40 candidate player positions within the same input image stored in the cache. The classifier builds on a single channel and 30 ferns per blocks, which provides a good accuracy/complexity trade-off (see Figure 12). We have observed that

575

580

22

ACCEPTED MANUSCRIPT

BRF Complexity (SPIROUDOME)

BRF Complexity (APIDIS)

0.95

0.95 area under curve

1

0.9

0.85

0.9

0.85

0.8

RGB+GM+OG, 0% RGB+GM+OG, 10% RGB, 0% RGB, 10% Y, 0% Y, 10%

0.75 0.7 0

20 40 60 80 100 number of ferns by block (f/b)

0.8

0.75

120

(a)

0.7 0

CR IP T

area under curve

1

RGB+GM+OG, 0% RGB+GM+OG, 10% RGB, 0% RGB, 10% Y, 0% Y, 10%

20 40 60 80 100 number of ferns by block (f/b)

120

(b)

System component Foreground mask Look-up and integrals Player detection

AN US

Figure 13: Areas under curve (AUC) as a function of the number of ferns by block for different rate of corrupted labels (0 or 10%) and different number of channels (1, 3 or 10 ch). SPIROUDOME is on the left, APIDIS on the right.

Logical processor usage (%) 37 % 41 % 60 %

M

Table 2: Percentage of logical processor (hardware thread) utilization associated to a players’ detection system. Total resources for the tested quad-core processor correspond to 800% of virtual/logical processor. Input image is accessed from fast cache memory. System runs at 12000 frames/sec.

ED

AC

CE

590

PT

585

the process saturates one logical processor, and achieves a frame rate of 530 fps with a standard deviation of 11 fps. It means that, at a conventional rate of 20 fps, the random ferns classification needs less than 4% of the computational resources of a single logical processor. Predicting the capacity in an arbitrary usage scenario (e.g. with more camera viewpoints and more detection candidates) is not trivial since, in multi-thread/multi-core CPU architectures, capacity extrapolation requires to know both the detailed utilization information of all shared core processing units (arithmetic and logic unit, floating point unit, cache, memory bandwidth, etc.) in the reference usage scenario, as well as the characteristics of the workload to be added in the targeted scenario. However, from the above numbers, one can reasonably consider that in most practical team sport coverage scenarios, for which the acquisition frame rate remains below 30fps and a 1280 × 480 panoramic image is enough to cover the field, the computational load of our proposed hybrid detector remains sufficiently small to not hamper other concurrent processes, e.g. dealing with data transfer, viewpoint selection, image reconstruction and compression in an autonomous video production system. This is in contrast with the computational load reported for pure classifier-based detectors (see Fig. 10 in [26]), which monopolize the whole resources to process 1280 × 480 images at 30fps.

595

5. Conclusion

600

Our investigations have demonstrated that reliable and computationally efficient players’ detection can be supported by a cost-effective single-view acquisition system, as long as advanced 23

ACCEPTED MANUSCRIPT

615

CR IP T

610

AN US

605

image processing tools are implemented to cope with the limitations of conventional and errorprone foreground silhouettes detectors. As a federating contribution, our paper has introduced an original scene-specific classifier to capture the visual appearance of team sport players, so as to discriminate between false and true positives among the candidate positions identified at low computational cost by the foreground detector. Our classifier relies on an ensemble of sets of random binary tests to characterize the texture describing the appearance of large image patterns, including the ones associated to players. Ensembles of random classifiers have gained popularity in recent years, mainly because they reduce the risk of overfitting and offer good generalization properties in case of training samples scarcity [74]. Our work has revealed that those random classifiers are also more robust to labels corruption than the boosted classification methods, traditionally adopted for people detection [32]. Hence, the classifier can be adapted to the game at hand in an automatic manner, without manual labeling of training samples, simply by selecting and labeling probably reliable training samples based on the foreground detector decisions. In terms of implementation, our study has revealed that working with a few tenth of ferns per RGB or Y blocks is sufficient to reject most false positives foreground detections, while preserving most of the true positives. Overall, our hybrid foreground/appearance detection framework results in high throughput capabilities, or in small computational load at conventional frame rates. Acknowledgment

Part of this work has been funded by the Belgian NSF, and the Walloon region projects SPORTIC and PTZ-PILOT. Authors also thank Adrien Descamps (project DETECT and PTZPILOT, Multitel) for providing the ICF results.

M

620

References

ED

AC

635

PT

630

CE

625

[1] Z. Wu, R. Radke, Keeping a Pan-Tilt-Zoom Camera Calibrated, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8) (2013) 1994 – 2007. [2] STATS SportVU technology, http://www.stats.com/. [3] I. Fernandez, F. Chen, F. Lavigne, X. Desurmont, C. De Vleeschouwer, Browsing Sport Content through an Interactive H.264 Streaming Session, in: MMEDIA, 2010, pp. 155–161. [4] L. Sha, P. Lucey, Y. Yue, P. Carr, C. Rohlf, I. Matthews, Chalkboarding: A New Spatiotemporal Query Paradigm for Sports Play Retrieval, in: Intelligent User Interfaces, 2016. [5] A. Bialkowski, P. Lucey, P. Carr, Y. Yue, S. Sridharan, I. Matthews, Large-Scale Analysis of Soccer Matches using Spatiotemporal Tracking Data, in: International Conference on Data Mining, 2014. [6] Keemotion production technology, http://www.keemotion.com. [7] F. Chen, D. Delannay, C. De Vleeschouwer, An autonomous framework to produce and distribute personalized team-sport video summaries: a basket-ball case study, IEEE Transactions on Multimedia 13 (6) (2011) 1381–1394. [8] F. Chen, C. De Vleeschouwer, Personalized production of basketball videos from multi-sensored data under limited display resolution, CVIU 114 (6) (2010) 667–680. [9] P. Carr, M. Mistry, I. Matthews, Hybrid robotic/virtual pan-tilt-zom cameras for autonomous event recording, in: ACM International Conference on Multimedia, 2013. [10] M. Monfort, B. Lake, B. Ziebart, P. Lucey, J. Tenenbaum, Softstar: Heuristic-Guided Probabilistic Inference, in: NIPS, 2015. [11] L. Sun, Q. De Neyer, C. De Vleeschouwer, Multimode Spatiotemporal Background Modeling for Complex Scenes, in: EUSIPCO, 2012, pp. 165–169. [12] C. Stauffer, W. Grimson, Adaptive background mixture models for real-time tracking, in: CVPR, 1999. [13] T. Horprasert, D. Harwood, L. Davis, A statistical approach for real-time robust background subtraction and shadow detection, in: ICCV, 1999, pp. 1–19. [14] V. Reddy, C. Sanderson, B. Lovell, Improved foreground detection via block-based classifier cascade with probabilistic decision integration, IEEE TCSVT 23 (1) (2013) 83–93. [15] J. Zhou, J. Hoang, Real time robust human detection and tracking system, in: CVPR, 2005.

640

645

24

ACCEPTED MANUSCRIPT

675

680

685

690

CR IP T

AC

695

AN US

670

M

665

ED

660

PT

655

[16] A. Cavallaro, O. Steiger, T. Ebrahimi, Semantic video analysis for adaptive content delivery and automatic description, IEEE TCSVT 15 (10) (2005) 1200–1209. [17] I. Matthews, P. Carr, Y. Sheikh, Monocular Object Detection Using 3D Geometric Primitives, in: ECCV, 2012. [18] F. Fleuret, J. Berclaz, R. Lengagne, P. Fua, Multi-camera people tracking with a probabilistic occupancy map, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2) (2008) 267–282. [19] A. Alahi, L. Jacques, Y. Boursier, P. Vandergheynst, Sparsity driven people localization with a heterogeneous network of cameras, Jour. of MIV 41 (1-2) (2011) 39–58. [20] D. Delannay, N. Danhier, C. De Vleeschouwer, Detection and recognition of sports (wo)men from multiple views, in: ACM/IEEE ICDSC, 2009, pp. 1–7. [21] S. Khan, M. Shah, A multiview approach to tracking people in crowded scenes using a planar homography constraint, in: ECCV, Vol. 4, 2006, pp. 133–146. [22] P. Viola, M. Jones, Robust real-time object detection, in: Int. workshop on SCTV, 2001. [23] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: a benchmark, in: CVPR, 2009. [24] P. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained partbased models, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (9) (2010) 1627–1645. [25] R. Benenson, M. Mathias, R. Timofte, L. Van Gool, Pedestrian detection at 100 frames per second, in: CVPR, 2012. [26] P. Dollar, R. Appel, S. Belongie, P. Perona, Fast Feature Pyramids for Object Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (8) (2014) 1532–1545. [27] P. Sermanet, K. Kavukcuoglu, S. Chintala, Y. LeCun, Pedestrian detection with unsupervised multi-stage feature learning, in: CVPR, 2013. [28] A. Angelova, A. Krizhevsky, V. Vanhoucke, A. Ogale, D. Ferguson, Real-Time Pedestrian Detection With Deep Network Cascades, in: BMVC, 2015. [29] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, in: NIPS, 2015. [30] X. Wang, M. Wang, W. Li, Scene-Specific Pedestrian Detection for Static Video Surveillance, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2) (2014) 361–374. [31] P. Sharma, C. Huang, R. Nevatia, Unsupervised Incremental Learning for Improved Object Detection in a Video, in: CVPR, 2012. [32] P. Dollar, Z. Tu, P. Perona, S. Belongie, Integral channel features, in: BMVC, 2009. [33] APIDIS and SPIROUDOME datasets, http://sites.uclouvain.be/ispgroup/index.php/Softwares/ APIDIS and http://sites.uclouvain.be/ispgroup/index.php/Softwares/SPIROUDOME. [34] R. Mar´ee, P. Geurts, J. Piater, L. Wehenkel, Random subwindows for robust image classification, in: CVPR, 2005. ¨ [35] M. Ozuysal, M. Calonder, V. Lepetit, P. Fua, Fast keypoint recognition using random ferns, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (3) (2010) 448–461. [36] P. Parisot, B. Sevilmis¸, C. De Vleeschouwer, Training with corrupted labels to reinforce a probably correct teamsport player detector, in: ACIVS, 2013. [37] Y. Amit, D. Geman, Shape quantization and recognition with randomized trees, Neural Comput. 9 (12) (1997) 1545–1588. [38] A. Bosch, A. Zisserman, X. Munoz, Image classification using random forests and ferns, in: ICCV, 2007. [39] P. Dollar, S. Belongie, P. Perona, The fastest pedestrian detector in the west, in: BMVC, 2010. [40] J. Xing, H. Ai, L. Liu, S. Lao, Multiple player tracking in sports video: A dual-mode two-way bayesian inference approach with progressive observation modeling, IEEE Trans. on Image Processing 20 (6) (2011) 1652 – 1667. [41] W. Nam, P. Dollar, J. Han, Local decorrelation for improved pedestrian detection, in: NIPS, 2014. [42] S. Zhang, R. Benenson, B. Schiele, Filtered channel features for pedestrian detection, in: CVPR, 2015. [43] S. Zhang, C. Bauckhage, A. Cremers, Informed Haar-like features improve pedestrian detection, in: CVPR, 2014. [44] R. Benenson, M. Mathias, T. Tuytelaars, L. Van Gool, Seeking the strongest rigid detector, in: CVPR, 2013. [45] S. Zhang, C. Bauckhage, A. Cremers, Real-time human detection based on optimized integrated channel features, in: Pattern Recognition vol. 484, Proceedings of the 6th Chinese Conference on Pattern Recognition, 2014. [46] S. Paisitkriangkrai, C. Shen, A. van den Hengel, Pedestrian detection with spatially pooled features and structured ensemble learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (6) (2016) 1243–1257. [47] J. Cao, Y. Pang, X. Li, Pedestrian Detection Inspired by Appearance Constancy and Shape Symmetry, in: CVPR, 2016. [48] Y. Yang, Z. Wang, F. Wu., Exploring prior knowledge for pedestrian detection, in: BMVC, 2015. [49] S. Zhang, C. Bauckhage, D. Klein, A. Cremers, Exploring Human Vision Driven Features for Pedestrian Detection, IEEE Transactions on Circuits and Systems for Video Technology 25 (10) (2015) 1709 – 1720. [50] A. Costea, S. Nedevschi, Semantic Channels for Fast Pedestrian Detection, in: CVPR, 2016. [51] Y. Tian, P. Luo, X. Wang, X. Tang, Pedestrian detection aided by deep learning semantic tasks, in: CVPR, 2015. [52] P. Balasubramanian, S. Pathak, A. Mittal, Improving gradient histogram based descriptors for pedestrian detection

CE

650

700

705

25

ACCEPTED MANUSCRIPT

735

740

745

750

CR IP T

AC

755

AN US

730

M

725

ED

720

PT

715

CE

710

in datasets with large variations, in: CVPR, 2016. [53] S. Zhang, R. Benenson, M. Omran, H. J., B. Schiele, How Far are We from Solving Pedestrian Detection?, in: CVPR, 2016. [54] P. Dollar, R. Appel, W. Kienzle, Crosstalk cascades for framerate pedestrian detection, in: ECCV, 2012. [55] R. Benenson, M. Omran, J. Hosang, B. Schiele, Ten years of pedestrian detection, What have we learned?, in: ECCV, 2014. [56] Y. Tian, P. Luo, X. Wang, T. X., Deep learning strong parts for pedestrian detection, in: ICCV, 2015. [57] B. Yang, J. Yan, Z. Lei, S. Li, Convolutional channel features, in: ICCV, 2015. [58] J. Hosang, M. Omran, R. Benenson, B. Schiele, Taking a deeper look at pedestrians, in: CVPR, 2015. [59] Z. Cai, M. Saberian, N. Vasconcelos, Learning complexity-aware cascades for deep pedestrian detection, in: ICCV, 2015. [60] R. Benenson, M. Mathias, R. Timofte, L. Van Gool, Fast stixel computation for fast pedestrian detection, in: ECCV, 2012. [61] J. Chen, H. Le, P. Carr, Y. Yue, J. Little, Learning online smooth predictors for realtime camera planning using recurrent decision trees, in: CVPR, 2016. [62] A. K. K.C., D. Delannay, L. Jacques, C. De Vleeschouwer, Iterative hypothesis testing for multi-object tracking with noisy/missing appearance features, in: ACCV, 2012. [63] H. Ben Shitrit, J. Berclaz, F. Fleuret, P. Fua, Tracking multiple people under global appearance constraints, in: ICCV, 2011. [64] H. Ben Shitrit, J. Berclaz, F. Fleuret, P. Fua, Multi-Commodity Network Flow for Tracking Multiple People, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (8) (2013) 1614–1627. [65] A. K. K.C., C. De Vleeschouwer, Discriminative label propagation for multi-object tracking with sporadic appearance features, in: ICCV, 2013. [66] Jingchen Liu, P. Carr, Detecting and tracking sports players with random forests and context-conditioned motion models, in: Computer Vision in Sports, Part of the series Advances in Computer Vision and Pattern Recognition, 2014, pp. 113–132. [67] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: COLT, 1998, pp. 92–100. [68] A. Levin, P. Viola, Y. Freund, Unsupervised improvement of visual detectors using co-training, in: ICCV, 2003, pp. 626–633. [69] P. Roth, H. Grabner, D. Skocaj, H. Bischof, A. Leonardis, Conservative visual learning for object detection with minimal hand labeling effort, in: DAGM, 2005, pp. 293–300. [70] V. Nair, J. J. Clark, An unsupervised, online learning framework for moving object detection, in: CVPR, Vol. 2, 2004, pp. 317–324. [71] O. Javed, S. Ali, M. Shah, Online Detection and Classification of Moving Objects Using Progressively Improving Detectors, in: CVPR, 2005. [72] B. Wu, R. Nevatia, Improving part based object detection by unsupervised, online boosting, in: CVPR, 2007. [73] H. Li, G. Hua, Z. Lin, J. Brandt, J. Yang, Probabilistic elastic part model for unsupervised face detector adaptation, in: ICCV, 2013. [74] P. Geurts, D. Ernst, L. Wehenkel, Extremely Randomized Trees, Machine Learning 36 (1) (2006) 3–42. [75] A. Legrand, L. Jacques, C. De Vleeschouwer, Mitigating memory requirements for random trees/ferns, in: ICIP, 2015. [76] Demo video, http://sites.uclouvain.be/ispgroup/index.php/Research/PlayerDetectImproved. [77] A. Maksai, X. Wang, P. Fua, What Players do with the Ball: A Physically Constrained Interaction Modeling, in: CVPR, 2016. [78] P. Parisot, C. De Vleeschouwer, Graph-based filtering of ballistic trajectory, in: ICME, 2011. [79] H.-T. Chen, W.-J. Tsai, S.-Y. Lee, J.-Y. Yu, Ball tracking and 3D trajectory approximation with applications to tactics analysis from single-camera volleyball sequences, Multimedia Tools and Applications 60 (3) (2012) 641– 667. [80] X. Wang, V. Ablavsky, H. Ben Shitrit, P. Fua, Take your eyes off the ball: Improving ball-tracking by focusing on team play, CVIU 119 (2014) 102–115. [81] X. Wang, E. Tretken, F. Fleuret, P. Fua, Tracking interacting objects optimally using integer programming, in: ECCV, 2014. [82] E. Bomcke, C. De Vleeschouwer, An interactive video streaming architecture for h.264/avc compliant players, in: ICME, 2009. [83] P. Parisot, C. De Vleeschouwer, Consensus-based trajectory estimation for ball detection in calibrated cameras systems, Journal on Real-Time Image Processing, 2016, DOI 10.1007/s11554-016-0638-3.

760

26