DPDnet: A robust people detector using deep learning with an overhead depth camera

DPDnet: A robust people detector using deep learning with an overhead depth camera

DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera Journal Pre-proof DPDnet: A Robust People Detector using Deep Lea...

8MB Sizes 0 Downloads 12 Views

DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera

Journal Pre-proof

DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera David Fuentes-Jimenez, Roberto Martin-Lopez, Cristina Losada-Gutierrez, David Casillas-Perez, Javier Macias-Guarasa, Carlos A. Luna, Daniel Pizarro PII: DOI: Reference:

S0957-4174(19)30885-1 https://doi.org/10.1016/j.eswa.2019.113168 ESWA 113168

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

25 February 2019 28 October 2019 27 December 2019

Please cite this article as: David Fuentes-Jimenez, Roberto Martin-Lopez, Cristina Losada-Gutierrez, David Casillas-Perez, Javier Macias-Guarasa, Carlos A. Luna, Daniel Pizarro, DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera, Expert Systems With Applications (2019), doi: https://doi.org/10.1016/j.eswa.2019.113168

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Elsevier Ltd. All rights reserved.

Highlights • Robust system to detect people only using depth information from a ToF camera • System outperforms state-of-the-art methods in different datasets without fine-tuning • Proposal runs in real time using conventional GPUs • Computational demands are independent of the number of people in the scene • Generated database is available to the research community

1

DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera David Fuentes-Jimeneza,b , Roberto Martin-Lopeza , Cristina Losada-Gutierreza , David Casillas-Pereza , Javier Macias-Guarasaa , Carlos A. Lunaa , Daniel Pizarroa a Department

of Electronics. University of Alcal´ a, Ctra. Madrid-Barcelona, km. 33600, 28805. Alcal´ a de Henares. SPAIN E-mails: [email protected] (D. Fuentes-Jimenez), [email protected] (R. Martin-Lopez), [email protected] (C. Losada-Gutierrez), [email protected] (D. Casillas-Perez), [email protected] (J. Macias-Guarasa), [email protected] (C.A. Luna), [email protected] (D. Pizarro). b Corresponding author. Tel: +34 637151349, fax: +34 918856591.

Abstract This paper proposes a deep learning-based method to detect multiple people from a single overhead depth image with high precision. Our neural network, called DPDnet, is composed by two fully-convolutional encoder-decoder blocks built with residual layers. The main block takes a depth image as input and generates a pixel-wise confidence map, where each detected person in the image is represented by a Gaussian-like distribution, The refinement block combines the depth image and the output from the main block, to refine the confidence map. Both blocks are simultaneously trained end-to-end using depth images and ground truth head position labels. The paper provides a rigorous experimental comparison with some of the best methods of the state-of-the-art, being exhaustively evaluated in different publicly available datasets. DPDnet proves to outperform all the evaluated methods with statistically significant differences, and with accuracies that exceed 99%. The system was trained on one of the datasets (generated by the authors and available to the scientific community) and evaluated in the others without retraining, proving also to achieve high accuracy with varying datasets and experimental conditions. Additionally, we made a comparison of our proposal with other CNN-based alternatives that have been very recently proposed in the literature, obtaining again very high

Preprint submitted to Journal of Expert Systems with Applications

December 28, 2019

performance. Finally, the computational complexity of our proposal is shown to be independent of the number of users in the scene and runs in real time using conventional GPUs. Keywords: People detection, Depth camera information, Interest regions estimation, Overhead Depth Camera, Feature extraction

1. Introduction People detection and tracking have recently attracted growing interest from the scientific community because of their applications in multiple areas, such as video-surveillance, access control or human behavior analysis. These applica5

tions require systems able to work with different experimental conditions and non-invasive systems (i.e. without adding turnstiles or other contact systems), such as video cameras. Detecting people in images is a classic Computer Vision problem, thoroughly studied over the last two decades. In recent years, improvements in

10

technology and the availability of large annotated image datasets, such as Imagenet (Krizhevsky et al., 2012), have allowed Deep Neural Networks (DNNs) to become the state-of-the-art solutions for tasks such as segmentation (Cao et al., 2017), classification (Shin et al., 2016) and object detection (Redmon et al., 2015) in RGB images. Numerous DNN-based people detection works

15

have recently appeared (Ouyang & Wang, 2013; Wang & Zhao, 2017; Zhao et al., 2017), obtaining very accurate results in controlled conditions. However, these methods suffer in scenarios with a high amount of clutter. To reduce the effect of occlusions and to improve detection accuracy in densely populated environments, some works employ depth cameras in over-

20

head configurations (Bevilacqua et al., 2006; Stahlschmidt et al., 2013; Jia & Radke, 2014; Zhang et al., 2012; Gal˘ a´ık & Gargal´ık, 2013; Del Pizzo et al., 2016; Vera et al., 2016). An additional quality of these methods is that they also preserve privacy, preventing anyone from using the system to find out the identity of each person in the scene, which is a requirement in some applica-

2

25

tions Chan et al. (2008). Overhead depth images are easier to process than RGB images. This has opened up room for detection methods based on finding local geometric structures of the depth image (Bevilacqua et al., 2006; Zhang et al., 2012; Stahlschmidt et al., 2013, 2014), which are combined with classification stages (Gal˘ a´ık & Gargal´ık, 2013; Vera et al., 2016; Zhu & Wong, 2013; Luna

30

et al., 2017) to reduce false detections. The detection rates of these methods is however still limited in crowded scenes. Unlike the aforementioned methods, DNNs have the ability of analyzing the whole image context to detect people in heavy cluttered images and to discard false positives due to the presence of other objects in the scene. The main

35

challenge of DNN-based methods is how to effectively train the network so that it can cope with unseen environments and different depth sensors. Two people detection methods based on DNNs have been recently proposed (Liciotti et al., 2018a; Villamizar et al., 2018) using overhead depth images. However they have been tested in limited scenarios, either with a very small dataset (Liciotti

40

et al., 2018a) or a simple scene with only two people interacting in a small room (Villamizar et al., 2018). Using DNNs to effectively solve this problem in general scenarios remains an open question. In this paper, we propose a method based on DNNs, hereinafter referred to as DPDnet, for detecting multiple people from a single depth image acquired with

45

an overhead depth camera (see Figure 1). Our neural network is trained endto-end using depth images and likelihood maps generated from the annotated position of each person in the scene. This paper brings the following contributions: 1) DPDnet achieves top performance when compared with all the evaluated state-of-the-art methods, which

50

constitutes the main contribution of this paper, exceeding 99% accuracy in all the developed experiments for a wide range of experimental conditions.2) Our approach provides a dense multimodal confidence map based on Gaussian mixture distribution with the same resolution as the input depth image. This allows the neural network to analyze the entire image context and to provide in a single

55

output all detected people. 3) Our network detects people in real time, regard3

(a) Camera location above sensed area.

(b) Example of recorded frame in sensed area.

Figure 1: Scheme of camera location above sensed area and example of recorded frame

less of the number of users in the scene. 4) Our proposal reaches a high degree of robustness (referring to its relative invariance to changes in the experimental conditions): The depth images from the evaluated datasets have been recorded from cameras with different resolutions, technology (structured light or Time60

of-Flight); located in a height range varying between 2.8 and 3.53 meters, and with different background content. Our system performance is very high even when there are mismatched conditions between training and testing data. To the best of our knowledge, there are no other works in the literature able to detect people in top-view depth images with comparable accuracy, and evaluated

65

on this variety of experimental datasets. We structure the paper as follows: section 2 details the state-of-the-art in people detection methods, section 3 describes our proposal based on CNNs, section 4 includes the experimental setup, results and discussion, and section 5 describes the main conclusions and future work discussion.

70

2. Previous Works The first proposed works in non-invasive people detection were based on the use of RGB cameras. To name a few, Ramanan et al. (2006) proposed a system based on learning person appearance models, whereas Andriluka et al. (2008) used hierarchical Gaussian Process Latent Variable Models (hGPLVM)

75

for modeling people. Other approaches were based on face detection (Chen

4

et al., 2010) or interest point classification (Jeong et al., 2013). DNN-based people detection methods have been recently proposed. In particular, Ouyang & Wang (2013) proposed a new DNN architecture that jointly performs feature extraction, deformation handling, occlusion handling, and clas80

sification.

Wang & Zhao (2017) proposed a Convolutional Neural Network

(CNN) as a feature extractor, followed by a proposal selection network. In (Tian et al., 2015), they combined a novel DNN model that jointly optimizes pedestrian detection and semantic tasks, including both pedestrian and scene attributes. Finally, Zhao et al. (2017) used a novel physical radius-depth (PRD) 85

to detect human candidates, followed by a CNN for human feature extraction. These proposals are more accurate than classic methods but they still suffer in highly cluttered scenes. To reduce the effect of occlusions and to improve detection accuracy, some works propose the fusion of RGB and depth data (Liu et al., 2015; Munaro

90

et al., 2016; Zhou et al., 2017; Ren et al., 2017; Cao et al., 2017), whereas other works place cameras in overhead configurations (Antic et al., 2009; Cai et al., 2014; Dan et al., 2012; Del Pizzo et al., 2016). Other works employed only depth sensors and overhead cameras for people detection, either based on Time-of-Flight (ToF) (Bevilacqua et al., 2006; Stahlschmidt et al., 2013, 2014;

95

Jia & Radke, 2014) or structured light (Zhang et al., 2012; Gal˘ a´ık & Gargal´ık, 2013; Rauter, 2013; Zhu & Wong, 2013; Del Pizzo et al., 2016; Vera et al., 2016) technologies. Some of these methods are based on finding distinctive maxima of the depth image (Bevilacqua et al., 2006; Zhang et al., 2012), or a filtered depth image

100

using the normalized Mexican Hat Wavelet filter (Stahlschmidt et al., 2013, 2014). However, they usually fail to separate clusters of people that are close to each other in the scene, or produce false detections when body parts, other than the head, are close to the camera. Moreover, the proposals by Zhang et al. (2012), and Stahlschmidt et al. (2013, 2014) do not include a classification stage,

105

so that they are not able to discriminate people from other objects in the scene, so their performance drops if there are other elements such as chairs or clothes 5

racks. In order to reduce false positives, some works include a classification stage that allows discriminating between people and other elements in the scene. In particular, Gal˘ a´ık & Gargal´ık (2013); Vera et al. (2016); Zhu & Wong (2013); 110

Luna et al. (2017) used shape descriptors that encode the morphology of the human head and shoulders, as seen in the depth image. These proposals are able to discriminate between people and other elements in the scene, but in some of them, their detection rates significantly decrease if people are close to each other.

115

Regarding DNN-based methods using only top-view depth data, the authors of (Villamizar et al., 2018) recently presented an approach for detecting people and identifying people attacks and intrusion in video surveillance scenarios, called WatchNet. The proposal is tested using the UNICITY dataset (Dumoulin et al., 2018) (and available at (Dumoulin et al., 2019)), recorded using 6 color

120

and depth cameras, located at different positions inside an airlock constructed to simulate a building access room. The recording conditions of the dataset are very restrictive, since the area of interest is very reduced, and it only includes one or two people in each scene. Thus, this proposal is not adapted to crowded scenarios, and there is no attempt to deal with elements different from people

125

in the scene. Liciotti et al. (2018a) also addressed the problem of detecting people from top-view depth data in crowded environments, using a CNN based strategy. Their proposal is based on the novel U-Net3 architecture, that includes several convolutional layers, followed by ReLU activations and max pooling operations.

130

To train and test their proposal, they use the TVHeads dataset (available at (Liciotti et al., 2018b)), limited to 1815 depth images. They include a comparison with other deep learning architectures, obtaining high accuracy with most of them, but they do not evaluate their proposal on other more complex and bigger datasets. Moreover, the authors do not specify how they partition the

135

TVHeads data for training, validation and testing, so that it is difficult to assess the generalization capabilities of their proposal, due to the very reduced size of their database. 6

3. DPDnet People Detector 3.1. Problem Formulation 140

We propose the DPDnet neural network to detect multiple people from a single depth image. The output consists of a multimodal confidence map, with the same size as that of the input image, that jointly codifies the detection of multiple people in the scene (see Figure 2). The term multimodal here refers to the statistical concept of multimodal distributions, which are probability distri-

145

butions with multiple “modes”. Input

Output

Convolutional Neural Network

Depth map

Confidence map

Figure 2: Depth input image and output multimodal confidence map.

. The multimodal confidence map assigns a 2D Gaussian distribution centered in the centroid of the head of each detected person, and its maximum value has been normalized to one. The standard deviation (eq. 1) is set to be constant, independently of the size of the head, and the head-camera distance. This 150

value has been calculated based on the geometrical conditions of the GOTPD1+ recordings (see Section 4.1.1), and also taking into account anthropometric considerations. In GOTPD1+, the average diameter of the head is D = 15 pixels, and we use an uncertainty factor f s = 0.8 to account for possible variations in the head sizes, so that the selected σ value is:

σ=

D fs = 6 2

7

(1)

155

It is important to note that the chosen value for the Gaussian standard deviation does not have a relevant impact on the system performance, as it is just modeling the uncertainty in the position of the head centroid. In addition to this, the use of Gaussian distributions favor the training convergence, given its smoothness. Finally, the 2D position of each person can be easily obtained

160

from the confidence map by detecting its local maxima. The advantages of this strategy are twofold: first, learning the confidence map is a well defined process to detect multiple hypothesis, as opposed to using other parametrizations (e.g multiple 2D coordinate vectors) which can lead to ambiguities in the regression function. Second, it makes computational com-

165

plexity to be independent of the number of users in the scene, considering that the time used for extracting multiple detection hypotheses from the confidence map is mostly negligible. 3.2. DPDnet Network Architecture Figure 3 illustrates the DPDnet network architecture to detect people from

170

a single depth image. Our DPDnet neural network consists of two blocks: the main block and the refinement block. Both blocks use an encoder-decoder architecture inspired by the SegNet model (Badrinarayanan et al., 2015), originally proposed in semantic segmentation, and use the residual layers proposed in the ResNet model (He

175

et al., 2016). The main block takes a single depth image as input, and outputs a confidence map. The inputs to the refinement block are the input depth image, along with the confidence map obtained from the main block, and generates a refined confidence map in its output. Similar refinement blocks have been proposed for

180

other tasks, such as in monocular reconstruction methods (Eigen et al., 2014), with very good results. The structure of the main block is detailed in Table 3. It takes 212 × 256 depth images as input, and uses separable convolutional layers based on the Xception model (Chollet, 2016). Separable convolutions are faster than con8

Main Block Output

Input

Depth Map

Confidence Map

Encoder

Decoder

Refinement Output

Confidence Map

Upsampling

Encoding Convolutional Block

Batch Normalization

Decoding Convolutional Block

Separable Convolutional Layer

Activation Relu

Activation Sigmoid

MaxPooling

Figure 3: The DPDnet architecture divided into the main block and the refinement block. Figure shows the general encoder-decoder structure in both blocks.

. 185

ventional convolutions, while achieving similar performance. Regarding the activation functions, we use ReLU activations Agarap (2018) because of their properties, and given that Krizhevsky et al. (2012) proved that, in terms of training time, the saturating non-linearities of tanh or sigmoidal are much slower than the non-saturating ones, like ReLU. Moreover, the saturating non-

190

linearities can derive into gradient vanishing because of their properties. Finally, we use Batch normalization layers as Ioffe & Szegedy (2015) showed that they introduce a regularization component that improves other alternatives such as Dropout, while allowing CNNs to accelerate the training speed and convergence to its optimum performance, in a lower number of epochs.

195

To build the encoder-decoder core of the main block, we use several units of residual blocks, similar to those proposed in the ResNet (He et al., 2016) model.

9

The encoding blocks downsample the spatial dimensions by a factor of two, and the decoding blocks use upscaling layers to increase the spatial dimensions by the same factor. Their basic architecture is described in Figure 4 and Tables 1 200

and 2. Encoding Convolutional Block

Decoding Convolutional Block

Separable Convolutional Layer Batch Normalization

Activation Relu Upsampling

Figure 4: DPDnet decoding and encoding block.

Finally, the last block of the CNN and the cropping layers, adapt the output of the last decoding block to the output size of the confidence map (212 × 256), which is the same size as the input image. Table 3 summarizes the information of all the layers that appear in the main block. 205

The refinement stage is a reduced version of the main block with two encoding and decoding blocks, being thus shallower than the main block. The refinement block takes as inputs the input depth image and the confidence map computed from the main block, and generates a new “refined” confidence map. Table 4 summarizes the information of all the layers and blocks used in the

210

refinement stage. In addition to the DPDnet architecture described above, we have adapted the same model to work with images half the resolution of the original ones (106 × 128), seamlessly producing downscaled confidence maps. We will refer to this reduced model as DPDnetf ast in section 4. The reduction of input and

215

output size allows low cost GPUs to be able to run our model with a very small impact in terms of accuracy. 10

Table 1: Encoding Block

Encoding Block (filters=(a,b,c)),kernel size=k Layer

Parameters

Output Size

Input

-

(Width, Height, Depth)

kernel=(1, 1) Convolution (Main Branch)

strides=(2, 2)

(Width/2, Height/2, a)

filters=a Batch Normalization (Main Branch)

-

Activation (Main Branch)

ReLU kernel=(k, k)

Convolution (Main Branch)

strides=(1, 1)

(Width/2, Height/2, b)

filters=b Batch Normalization (Main Branch)

-

Activation (Main Branch)

ReLU kernel=(1, 1)

Convolution (Main Branch)

strides=(1, 1)

(Width/2, Height/2, c)

filters=c Batch Normalization (Main Branch)

kernel=(1, 1)

Convolution (Shortcut)

strides=(2, 2)

(Width/2, Height/2, c)

filters=c Batch Normalization (Shortcut)

-

Add (Main Branch+Shortcut)

-

Activation (Main Branch+Shortcut)

(Width/2, Height/2, a) ReLU

11

Table 2: Decoding Block

Decoding Block (filters=(a,b,c)),kernel size=k Layer

Parameters

Output Size

Input

-

(Width, Height, Depth)

size=(2,2)

(2*Width,2*Height,Depth)

Upsampling (Main Branch)

kernel=(1, 1) Convolution (Main Branch)

strides=(1, 1)

(Width/2, Height/2, a)

filters=a Batch Normalization (Main Branch)

-

Activation (Main Branch)

ReLU kernel=(k, k)

Convolution (Main Branch)

strides=(1, 1)

(Width/2, Height/2, b)

filters=b Batch Normalization (Main Branch)

-

Activation (Main Branch)

ReLU kernel=(1, 1)

Convolution (Main Branch)

strides=(1, 1)

(Width/2, Height/2, c)

filters=c Batch Normalization (Main Branch)

-

Upsampling (Shortcut)

size=(2,2)

(2*Width,2*Height,Depth)

kernel=(1, 1) Convolution (Shortcut)

strides=(1, 1)

(Width/2, Height/2, c)

filters=c Batch Normalization (Shortcut)

-

Add (Main Branch+Shortcut)

-

Activation (Main Branch+Shortcut)

(Width/2, Height/2, a) ReLU

12

Table 3: Main Stage. In Encoding and Decoding Blocks, a, b and c represents the number of filters of the intermediate convolutional layers in the branches.

Main Block Layer

Output Size

Parameters

Input

212x256x1

-

Convolution

106x128x64

kernel=(7, 7) / strides=(2, 2)

Batch Normalization

-

Activation

ReLU

Max Pooling

35x42x64

Encoding Conv. Block

35x42x256

size=(3, 3) kernel=(3, 3) / strides=(1, 1) (a=64, b=64, c=256)

Encoding Conv. Block

18x21x512

kernel=(3, 3) / strides=(2, 2) (a=128, b=128, c=512)

Encoding Conv. Block

9x11x1024

kernel=(3, 3) / strides=(2, 2) (a=256, b=256, c=1024)

Decoding Separable Conv. Block

9x11x256

kernel=(3, 3) / strides=(1, 1) (a=1024, b=1024, c=256)

Decoding Separable Conv. Block

18x22x128

kernel=(3, 3) / strides=(2, 2) (a=512, b=512, c=128)

Decoding Separable Conv. Block

36x44x64

kernel=(3, 3) / strides=(2, 2) (a=256, b=256, c=64)

Cropping

36x43x64

cropping=[(2, 2) (1, 1)]

Up Sampling

108x129x64

size=(3, 3)

Convolution

216x258x64

kernel=(7, 7) / strides=(2, 2)

Cropping

212x256x64

cropping=[(2, 2) (1, 1)]

Batch Normalization

-

Activation Convolution

ReLU 212x256x1

Activation Output 1

kernel=(3, 3) / strides=(1, 1) Sigmoid

212x256x1 13

-

Table 4: Refinement Stage

Refinement Block Layer

Output Size

Parameters

Output 1 + Input

212x256x1

-

Convolution

106x128x64

kernel=(7, 7) / strides=(2, 2)

Batch Normalization

-

Activation

ReLU

Max Pooling

35x42x64

Encoding Conv. Block

35x42x256

size=(3, 3) kernel=(3, 3) / strides=(1, 1) (a=64, b=64, c=256)

Encoding Conv. Block

18x21x512

kernel=(3, 3) / strides=(2, 2) (a=128, b=128, c=512)

Decoding Conv. Block

36x42x128

kernel=(3, 3) / strides=(2, 2) (a=512, b=512, c=128)

Decoding Conv. Block

72x84x64

kernel=(3, 3) / strides=(2, 2) (a=256, b=256, c=64)

Zero Padding

72x86x64

padding=(0, 1)

Up Sampling

216x258x64

size=(3, 3)

Cropping

212x256x64

cropping=[(2, 2) (1, 1)]

Convolution

212x256x1

kernel=(3, 3) / strides=(1, 1)

Activation Output 2

Sigmoid 212x256x1

14

-

3.3. Training procedure We denote by pi the input depth image and by qi the target confident map, where pi , qi ∈ R212×256 . We assume that both pi and qi are normalized in the 220

interval [0, 1]. The target confidence map qi is generated by a sum of twodimensional Gaussian distributions with covariance matrix P = σ 2 I2×2 and centered around the labeled position of each person in the scene. We use σ = 6 pixels in all our experiments. The main and refinement blocks of the DPDnet architecture are defined with

225

the functions Tm (pi , θm ) and Tr (pi , qi , θr ) respectively. All trainable parameters of the main block are defined with vector θm , and similarly with θr for the refinement block. When referring to all trainable parameters of the entire DPDnet model we use their concatenation θ = (θm , θr ). The complete DPDnet function T (pi , θ) is obtained by composition of Tm and Tr : T (pi , θ) = Tr (pi , Tm (pi , θm ), θr )

(2)

We train our network using a set of corresponding depth images and confi230

dence maps {(pi , qi )}i=1,...,N , with the following loss function being used: L(θ) =

N N 1 X 1 X kT (pi , θ) − qi k2 + λ kTm (pi , θm ) − qi k2 , N i=1 N i=1

(3)

where λ is a hyper-parameter that weights the importance of the main and refinement terms in the loss function. The optimal value of λ was empirically found on validation data, and was set to λ = 1 by evaluating its effect on the network performance metrics, being this λ = 1 the best compromise value 235

among them. Note that the second term of equation (3) imposes the main block to be as close as possible to the target confidence map. This forces the refinement block to improve the output of the main block. By minimizing equation (3), we train the entire DPDnet model end-to-end in a single step without separating the main and refinement blocks.

240

In our experiments, we train the network for 50 epochs and use a validation set to select the best model. We use the Adam optimizer (Kingma & Ba, 2014) 15

with 0.001 as initial learning rate. This optimizer has been chosen because of its adaptive capabilities in terms of learning rate. The great advantage of Adam is the fact that it starts from an initial learning rate set by the user and then 245

uses the first and second moments of the gradient to adapt the learning rate to the evolution of the loss function. This increases the convergence speed as compared to other optimizers, making Adam a good choice for the proposed problem.

(a) Sample frames from the GOTPD1+ dataset.

(b) Sample frames from the MIVIA dataset.

Figure 5: Sample frames from the GOTPD1+ and MIVIA databases (gray coded depth).

16

4. Experimental Work 250

4.1. Datasets In order to provide a wide range of evaluation conditions, we have used four different databases, that will be described next. Figures 5 and 6 provide some sample frames to give an idea of the style and quality of the different datasets. 4.1.1. GOTPD1 database

255

In this work, we have used part of the GOTPD1 database (available at (MaciasGuarasa et al., 2016a) and (Macias-Guarasa et al., 2016b)), that was recorded R with a Kinect v2 device located at a height of 3.4m, generating depth images

with a resolution of 512 × 424. Depth values are provided in millimeters as 16 bits unsigned integers. The recordings covered a broad variety of conditions, 260

with scenarios comprising single and multiple people, single and multiple objects (such as chairs), people with and without accessories (hats, caps), people with different complexity, height, hair color, and hair configurations, people actively moving and performing additional actions (such as using their mobile phones, moving their fists up and down, moving their arms, etc.).

265

The GOTPD1 data was originally split in two subsets, one for training and the other for testing, with the configuration described in (Luna et al., 2017). However, for this work, and given the strict training data requirements of deep learning approaches, we will be using the full GOTPD1 dataset for training purposes, and we recorded 23 additional sequences to allow for a proper evaluation

270

effort. With this strategy, the training and testing subsets are fully independent, as people and conditions fully differ between training and testing recordings. Table 5 and Table 6 show the details of the training and testing subsets, respectively. #Frames refers to the number of frames in the recorded sequences, while #Positives refers to the number of all the heads over all the frames (in

275

our recordings we used 39 different people). The database contains sequences in which the users were instructed on how to move under the camera (to allow for proper coverage of the recording area), and sequences where people moved

17

Table 5: GOTPD1+ Training subset details.

#Sequences

#F rames

#P ositives

Description

26853

14381

Single people sequences

7138

2416

Two people sequences

6526

18106

Multiple people sequences

2323

1458

S0001 through S0022, S0102,S0110,S0123, S0130,S0136,S0142 S0023 through S0026 S0027 through S0029, S0031 through S0034, S0152, S0153 S0035,

Sequences with chairs and

S0038 through S0040

people balancing fists facing up

Totals

42840

36361

freely (to allow for a more natural behavior)1 . This dataset (composed by the original GOTPD1 and the additional 23 280

recorded sequences), will be referred to as GOTPD1+ in the tables included in Section 4.4. The first row of Figure 5 shows two sample frames from this dataset. 1 This

is fully detailed in the documentation distributed with the database at (Macias-

Guarasa et al., 2016a) and (Macias-Guarasa et al., 2016b).

Table 6: GOTPD1+ testing subset details.

Sequence IDs

#Samples

#P ositives

Description

2742

2391

Single people sequences

1888

2865

Two people sequences

4516

14837

Multiple people sequences

S0036 through S0037

1789

0

Sequences with chairs

Totals

11375

21403

S041, S0101, S0131 through S0135, S143 S0144 through S0146 S0030, S0147 through S0151, S0154 through 157

18

(a) Sample frames from the Zhang2012 dataset.

(b) Sample frames from the TVHEADS dataset.

Figure 6: Sample frames from the Zhang2012 and TVHEADS databases (gray coded depth).

4.1.2. MIVIA People Counting Dataset The MIVIA people counting dataset (available at (Lab, 2016)) has been developed by the MIVIA Research Lab at the University of Palermo, and it 285

was first described and used in (Del Pizzo et al., 2016). It is mainly aimed at evaluating people flow density in the sensed environment, but we will use it for people counting and tracking purposes. R The MIVIA dataset includes RGB and depth data captured with a Kinect

v1 device located in an overhead position at a height of 3.2m. Depth values are 290

provided as 8 bits unsigned integers (the actual depth resolution is 11 bits, but they just kept the 8 most significant ones, guaranteeing a depth resolution of roughly 1.25 cm). The recordings have been captured in indoor and outdoor conditions, and for isolated and group transits. From the 34 available sequences,

19

Table 7: MIVIA testing subset details.

Sequence IDs

#F rames

#P ositives

Description

DIS1 DIS2

7692

3601

Single people sequences

DIS3 DIS4 DIS5

10259

5315

Two people sequences

DIG1 DIG2 DIG3

12222

8780

Multiple people sequences

Totals

30173

17696

we will be using 8 of them, all that correspond to indoor conditions for both 295

the single and multiple person transits. The recordings include people walking with bags and other big objects, and their details are shown in Figure 7. In our experiments we will use the depth streams with a resolution of 640 × 480, so that it will be necessary to resize the depth streams used in this database, to those that DPDnet can handle.

300

This dataset will be referred to as MIVIA in the tables included in Section 4.4. The bottom row of Figure 5 shows two sample frames from this dataset. 4.1.3. Zhang 2012 database The Zhang 2012 database (Zhang et al., 2012) consists of two sequences R v1 device also located in an (dataset1 and dataset2) recorded with a Kinect

305

overhead position at an estimated height of 3.53m (the authors do not explicitly state the camera position in their paper, so that we estimated it from the available images), generating depth data with a resolution of 320 × 240 pixels. Depth values are provided in millimeters as 16 bits unsigned integers. Both sequences include multiple people moving under the camera, and their details

310

are shown in Table 8. The recordings include groups of people walking closely to each other, people walking freely, and people with bags and other big objects. The data, kindly provided to us by its authors, was processed with an adhoc background subtraction strategy (described in (Zhang et al., 2012)), that introduced some artifacts in the depth images. Due to that, we do not have

315

access to the original depth streams. The provided groundtruth information 20

Table 8: Zhang2012 testing subset details.

Sequence IDs

#P ositives

#P ositives

Description

dataset1

2384

4527

Multiple people sequences

dataset2

1500

1553

Multiple people sequences

Totals

3884

6080

consists of the same dataset images with red masks covering the location of the heads in the scene. This dataset will be referred to as Zhang2012 in the tables included in section 4.4. The first row of Figure 6 shows two sample frames from this dataset. 320

4.1.4. TVHeads database The TVHeads 2018 database (Liciotti et al., 2018b) consists of 1815 depth images recorded with a RGB-D device (the authors do not specify which is the camera model they are using) also located in an overhead position at heights between 2.82m and 3.01m, depending on the recorded sequence, and generating

325

depth data with a resolution of 320 × 240 pixels. All images includes multiple people under the camera (but no other moving objects), and their details are shown in Table 9. The recordings include groups of people walking closely to each other, squatting or standing and talking in different scenes. The data provided by its authors, includes 8 and 16 bits depth images and

330

a groundtruth that consist of images with black background and white masks covering the location of the heads in the scene. A relevant difference with the other datasets is that, in this case, the ground truth information also includes labeled people whose heads are partially out of the recorded image, as shown in Figure 7. These cases happen frequently in the recorded sequences, and this

335

will have a significant impact on the performance of our proposal as will be discussed in Section 4.4.5. This dataset will be referred to as TVHEADS in the tables included in section 4.4. The bottom row of Figure 6 shows two sample frames from this dataset. 21

Table 9: TVHeads testing subset details.

Sequence IDs

#F rames

#P ositives

Description

Frames

1815

4413

Multiple people sequences

Totals

1815

4413

Figure 7: Sample ground truth frames from the TVHEADS databases with labelled people in the image boundaries.

4.2. Experimental setup 340

In the experiments carried out, we have used the four different datasets described in section 4.1. We compare the performance of our proposals DPDnet and DPDnetf ast with other strategies described in the literature that use an overhead ToF camera in a people detection/tracking/counting task. The following state-of-the-art methods are used to carry out a fully exhaustive comparison:

345

• WaterFilling (Zhang et al., 2012): this method is based on the water filling algorithm and the original source code was kindly provided to us by its authors. To run this method in the GOTPD1+ dataset we scaled the images to 640×480 pixels, and it does not require to be previously trained. • SLICEP CA (Luna et al., 2017): this method uses feature vectors extracted

350

from the depth image and a generative PCA model of these vectors to classify detections. In our experiments, the PCA model is trained using the GOTPD1 dataset, with two target categories: people and non-people. • SLICESV M (Fernandez-Rincon et al., 2017): this method is similar to SLICEP CA , but uses a SVM classifier instead of PCA. In our experiments,

355

the SVM model is also trained using the full GOTPD1 dataset. 22

• MexicanHat (Stahlschmidt et al., 2013, 2014): this method is based on the normalized Mexican Hat Wavelet and requires the detection and removal of the floor plane. It does not require training data though. We have also carried out a comparison with CNN-based methods, but in this 360

case it will be done on the TVHEADS dataset, as Liciotti et al. (2018a) provided a broad performance comparison among six CNN-based strategies using this dataset. It is important to note that in the comparison, we did not use a tracking module for any of the methods. Our objective is to provide a fair comparison on

365

their discrimination capabilities without further improvements. The only input frame manipulation we did was that required to scale the frame size to fit in the input stage conditions of the evaluated algorithms. 4.3. Evaluation Metrics To provide a detailed and complete view on the performance on the evaluated

370

algorithms, we will calculate the main standard metrics in a detection problem, namely: P recision, Recall, F 1score , False Negative Rate (F N R), False Positive Rate (F P R), and overall error ERR (calculated as ERR = F N R + F P R) We will also calculate confidence intervals for the F 1score and ERR metrics, for a confidence value of 95%, to assess the statistical significance of the results

375

when comparing different strategies. 4.4. Results and Discussion In this section we will first provide detailed results for the algorithms described in Section 4.2, when applied to each of the datasets described in Section 4.1. An overall comparison will be then included in Section 4.4.4. In the

380

case of the CNN-based methods, the evaluation will be done on the TVHEADS dataset, as stated in Section 4.2, and the discussion will be done in Section 4.4.5. Tables 10, 11, and 12 include the results for the GOTPD1+, MIVIA, and Zhang2012 datasets respectively, including the evaluation metrics described in section 4.3 for all the evaluated algorithms. To visually aid in the comparative 23

385

performance analysis, we have added a green background to those table cells that correspond to the best results across all algorithms for each condition. Table 10: Detailed Results using all the tested algorithms on the GOTPD1+ dataset. All values in % (cells in green background correspond to the best results across all algorithms for each condition). DB

MexicanHat

P recision

Recall

F 1score

FNR

FPR

Single person

99.87

99.34

Two people

99.63

49.72

More than two people

99.60 ± 0.23

0.66

0.16

0.44 ± 0.25

66.34 ± 1.91

50.28

0.41

34.84 ± 1.93

99.86

37.84

54.88 ± 0.87

62.16

0.47

55.92 ± 0.87

Chairs. no people Totals

SLICESV M

94.24

45.63

61.48 ± 0.68

54.37

1.06 ± 0.38

98.23

99.87

99.04 ± 0.36

0.13

2.18

100.00

96.46 ± 0.75

0.00

16.25

5.05 ± 0.89

More than two people*

99.93

99.26

99.59 ± 0.11

0.74

0.63

0.73 ± 0.15

35.33

35.33 ± 2.21

Totals*

96.71

99.40

98.04 ± 0.19

0.60

9.70

2.95 ± 0.24

Single person

98.23

99.87

99.04 ± 0.36

0.13

2.18

1.06 ± 0.38

Two people

94.10

100.00

96.96 ± 0.70

0.00

13.91

4.32 ± 0.82

More than two people

99.93

99.20

99.56 ± 0.12

0.80

0.63

0.79 ± 0.15

99.06

99.36

99.21 ± 0.12

0.64

0.00

0.00 ± 0.00

2.70

1.18 ± 0.15 1.79 ± 0.50

Single person

98.18

98.63

98.40 ± 0.47

1.37

2.31

Two people

99.26

100.00

99.63 ± 0.25

0.00

1.66

0.51 ± 0.29

More than two people

99.90

99.81

99.85 ± 0.07

0.19

0.86

0.26 ± 0.09

22.97

22.97 ± 1.95

Totals

96.90

99.70

98.28 ± 0.18

0.30

9.24

2.59 ± 0.22

Single person

100.00

99.67

99.84 ± 0.15

0.33

0.00

0.18 ± 0.16

Two people

99.88

100.00

99.94 ± 0.10

0.00

0.28

0.09 ± 0.12

More than two people

99.92

99.90

99.91 ± 0.05

0.10

0.71

0.16 ± 0.07

0.39

0.39 ± 0.29

0.11

0.36

0.17 ± 0.06 0.11 ± 0.12

Chairs. no people

Chairs. no people Totals

DPDnet

42.48 ± 0.69

93.17

Totals

DPDnetf ast

21.97 ± 1.92

8.06

Two people*

Chairs. no people

WaterFilling

21.97

Single person*

Chairs. no people*

SLICEP CA

ERR

99.88

99.89

99.88 ± 0.05

Single person

99.93

99.87

99.90 ± 0.12

0.13

0.08

Two people

100.00

99.94

99.97 ± 0.07

0.06

0.00

0.04 ± 0.08

More than two people

100.00

99.71

99.86 ± 0.07

0.29

0.00

0.26 ± 0.09

99.99

99.75

99.87 ± 0.05

0.25

Chairs. no people Totals

24

0.00

0.00 ± 0.00

0.02

0.19 ± 0.06

4.4.1. Results for the GOTPD1+ database Regarding the results for the GOTPD1+ database (Table 10), it can be clearly seen that the DPDnet and DPDnetf ast solutions are the best ones, and this is 390

specially significant for sequences including more than two people. Despite this, SLICEP CA and SLICESV M also obtain top results for the Recall and F N R rates, for the sequences with one and two people. This happens at the cost of significantly increasing the F P R rates, which is the reason why these methods finally had a lower performance in the F 1score and ERR metrics. The WaterFilling

395

algorithm has also top performance for the sequences with two people, but also with higher F P R rates. The MexicanHat algorithm is clearly the worst one, as has already been proved in previous works. Despite the good results of SLICESV M and WaterFilling, their robustness to sequences without people (just chairs, for example) is less than that of

400

DPDnet, SLICEP CA and DPDnetf ast . In the case of SLICESV M , this bad performance could be due to the class modeling carried out, and for the WaterFilling algorithm, it is clearly due to the absence of an explicit person discrimination procedure, which generates a higher number of false positives. The results obtained with the SLICEP CA method in the chair sequences are similar or better

405

than those obtained with our CNN based proposals, but we must bear in mind that the SLICEP CA method requires a calibration process to know the intrinsic and extrinsic parameters of the sensor. It is also worth mentioning that the CNN-based methods are undoubtedly the best when considering the F P R rates, showing a very good discrimination

410

capability. Finally, when comparing DPDnet and DPDnetf ast , a singular effect can be observed: the F N R values for DPDnetf ast are better than those of DPDnet, while the F P R values of DPDnet are better than those of DPDnetf ast . This can be explained by the reduction in the image resolution, since this reduction implies

415

an interpolation that allows to slightly filter the noise and artifacts in the input image. This may lead the fast version to have an easier job in discriminating

25

negatives, while in the case of the standard version, the availability of the full image information allows for a better discrimination of actual people present in the scene. Table 11: Detailed Results using all the tested algorithms on the MIVIA dataset. All values in % (cells in green background correspond to the best results across all algorithms for each condition). DB

MexicanHat

SLICESV M

SLICEP CA

WaterFilling

DPDnetf ast

DPDnet

420

P recision

Recall

F 1score

FNR

FPR

ERR

Single person

93.76

92.66

Two people

88.61

86.64

93.21 ± 0.56

7.34

2.94

4.36 ± 0.46

87.61 ± 0.64

13.36

6.21

More than two people

93.73

8.77 ± 0.55

53.67

68.26 ± 0.76

46.33

2.65

21.21 ± 0.67

Totals

91.81

71.40

80.33 ± 0.43

28.60

3.89

13.27 ± 0.37

Single person*

92.92

93.90

93.41 ± 0.55

6.10

3.39

4.26 ± 0.45

Two people*

87.75

87.97

87.86 ± 0.63

12.03

6.85

8.70 ± 0.55

More than two people*

91.85

94.11

92.96 ± 0.42

5.89

6.13

6.03 ± 0.39

Totals*

90.85

92.22

91.53 ± 0.30

7.78

5.66

6.46 ± 0.27

Single person*

93.30

93.90

93.67 ± 0.55

6.10

3.20

4.13 ± 0.44

Two people*

87.97

87.97

87.97 ± 0.63

12.03

6.71

8.62 ± 0.54

More than two people*

92.91

93.95

93.43 ± 0.40

6.05

5.25

5.59 ± 0.38

Totals*

91.52

92.14

91.83 ± 0.30

7.86

5.20

6.20 ± 0.26

Single person

95.20

98.96

97.04 ± 0.38

1.04

2.41

1.96 ± 0.31

Two people

96.67

99.84

98.23 ± 0.26

0.16

1.96

1.31 ± 0.22

More than two people

91.29

98.53

94.77 ± 0.37

1.47

6.81

4.57 ± 0.34

Totals

93.68

99.02

96.27 ± 0.21

0.98

4.08

2.91 ± 0.18

Single person

99.71

97.41

98.55 ± 0.27

2.59

0.14

0.94 ± 0.22

Two people

99.54

99.09

99.32 ± 0.16

0.91

0.26

0.50 ± 0.14

More than two people

99.24

96.62

97.91 ± 0.23

3.38

0.53

1.72 ± 0.21

Totals

99.43

97.54

98.47 ± 0.13

2.46

0.34

1.14 ± 0.12

Single person

100.00

98.56

99.27 ± 0.19

1.44

0.00

0.47 ± 0.15

Two people

99.81

99.05

99.43 ± 0.15

0.95

0.11

0.41 ± 0.12

More than two people

99.23

98.14

98.68 ± 0.19

1.86

0.54

1.09 ± 0.17

Totals

99.57

98.50

99.03 ± 0.11

1.50

0.26

0.72 ± 0.09

4.4.2. Results for the MIVIA database Regarding the results for the MIVIA database (Table 11), we first have to stress and remind the fact that no training nor adaptation has been done for this new dataset.

26

Again, the best results are found in the CNN-based methods (which use a 425

model trained on a different dataset), and (in specific metrics) in the WaterFilling algorithm (that does not need any training procedure). The SLICEP CA and SLICESV M methods exhibit a relevant performance drop that can be explained by their higher sensitivity to the mismatch in the recording conditions as compared with those of the training material used.

430

The better performance of the WaterFilling algorithm in the Recall and F N R rates is again achieved at the cost of increasing the F P R rates, which finally translates to overall lower F 1score and ERR metrics. As in the case of GOTPD1+, the MexicanHat method is the worst of all in all the evaluated metrics. Table 12: Detailed Results using all the tested algorithms on the Zhang2012 dataset. All values in % (cells in green background correspond to the best results across all algorithms for each condition).

MexicanHat

SLICESV M

SLICEP CA

WaterFilling

DPDnetf ast

DPDnet

DB

P recision

Recall

F 1score

FNR

FPR

Dataset1

40.52

24.51

30.54 ± 1.51

75.49

410.76

102.49 ± −

Dataset2

26.72

21.02

23.53 ± 2.02

78.98

86.32

81.92 ± 1.83

Totals

36.56

23.68

28.74 ± 1.22

76.32

182.85

95.87 ± 0.54

Dataset1*

82.92

85.77

84.32 ± 1.19

14.32

226.82

29.60 ± 1.49

Dataset2*

59.56

59.28

59.42 ± 2.33

40.72

64.38

49.82 ± 2.37

Totals*

77.50

79.44

78.46 ± 1.10

20.56

110.57

36.09 ± 1.29

Dataset1*

94.31

97.81

96.03 ± 0.64

2.19

67.60

7.44 ± 0.86

Dataset2*

94.48

99.30

96.83 ± 0.83

0.70

8.36

3.84 ± 0.92

Totals*

94.35

98.16

96.22 ± 0.52

1.84

25.69

6.28 ± 0.66

Dataset1

95.40

98.76

97.05 ± 0.55

1.24

55.48

5.53 ± 0.75

Dataset2

96.70

98.91

97.79 ± 0.70

1.09

4.96

2.66 ± 0.77

Totals

95.70

98.79

97.22 ± 0.44

1.21

19.71

4.61 ± 0.57

Dataset1

99.57

99.21

99.39 ± 0.25

0.79

4.95

1.12 ± 0.34

Dataset2

99.01

99.11

99.06 ± 0.46

0.89

1.45

1.12 ± 0.50

Totals

99.44

99.19

99.31 ± 0.22

0.81

2.47

1.12 ± 0.28

Dataset1

98.89

99.76

99.32 ± 0.27

0.24

12.94

1.26 ± 0.36

Dataset2

97.10

99.90

98.48 ± 0.58

0.10

4.35

1.83 ± 0.64

Totals

98.46

99.79

99.12 ± 0.25

0.21

6.87

1.44 ± 0.32

27

ERR

435

4.4.3. Results for the Zhang2012 database Regarding the results for the Zhang2012 database (Table 12), we are again in the case of a mismatch between the training and evaluation conditions discussed above. Here, the superiority of the CNN-based methods are even clearer than in the

440

two previous datasets, obtaining very good results considering the experimental conditions. The performance drop of the SLICEP CA and SLICESV M methods is higher than in the MIVIA results, specially in the latter. As described in Section 4.1.3, the images of this dataset contain artifacts from the background subtraction

445

method, and they might affect the classifiers SLICEP CA and SLICESV M , which were trained with cleaner data (the GOTPD1+ database). As discussed in the case of the MIVIA results, the WaterFilling algorithm performance is closer to the top results in the Recall and F N R rates, with the same explanation than above.

450

The CNN-based methods are affected by image artifacts to a much lower extent than SLICEP CA and SLICESV M . Between out two CNN-based proposals, DPDnet is more affected than DPDnetf ast , possibly given that its greater number of parameters (due to the larger input image size) forces it to search for more defined and detailed features similar to those found in the training subset

455

of the GOTPD1+ dataset, while DPDnetf ast generalizes better when facing new datasets, due to the lower definition of its input images (finding more general characteristics of people). Finally, as in all the previous cases, the MexicanHat method gets the worst performance for all metrics.

460

4.4.4. Overall comparison Once the detailed results have been discussed for each dataset, now we will provide full details on the overall comparison among the evaluated algorithms for all the datasets used. We will include tables with the overall behavior of the different strategies, and we will also include bar charts to ease the visual 28

465

comparison, providing error confidence values for all the metrics. In the tables and graphics below, we have omitted the results for the MexicanHat algorithm due to its bad performance. In the charts, please note that the vertical scale is not the same for all of them, as they have been adjusted to allow for a better visualization.

470

From the previous discussion on the results shown in tables 10, 11 and 12, we can conclude that the proposed solutions DPDnet and DPDnetf ast are the best alternatives for a robust detection in this application. They have obtained the best results, and they have proved to be robust against changing database recording and environmental conditions, maintaining the overall er-

475

ror rate around 1% percent in all cases. The classic trainable methods such as SLICEP CA and SLICESV M are good alternatives but have worse results, and they require calibration and retraining to change to provide optimal results. On the other hand, WaterFilling can be seen as another alternative system, but unlike the two previous approaches, it acts more as a detector of maximums than

480

as a classification algorithm, not being able to distinguish persons from other elements. Finally, the MexicanHat algorithm is not a good alternative system and globally has obtained the worst results among the evaluated systems. Table 13 shows the final aggregated results of the evaluated algorithms on the GOTPD1+ dataset, and in Figures 8a and 8b we show the graphical comparison

485

among the different strategies. Table 13: Results for the GOTPD1+ dataset. All values in %.

Method

P recision

Recall

F1 score

FNR

FPR

ERR

DPDnet

99.99

99.75

99.87 ± 0.05

0.25

0.02

0.19 ± 0.06

DPDnetf ast

99.88

99.89

99.88 ± 0.05

0.11

0.36

0.17 ± 0.06

WaterFilling

96.90

99.70

98.28 ± 0.18

0.30

9.24

2.59 ± 0.22

SLICEP CA *

99.06

99.36

99.21 ± 0.12

0.64

2.70

1.18 ± 0.15

SLICESV M *

96.71

99.40

98.04 ± 0.19

0.60

9.70

2.95 ± 0.24

MexicanHat

94.24

45.63

61.48 ± 0.68

54.37

8.06

42.48 ± 0.69

29

Figure 8a shows how the performance metrics are reasonably high, except for the WaterFilling and SLICESV M algorithms, whose P recision and F 1score have a significant drop. This tendency is also clearly observed in the error metrics of Figure 8b. The CNN-based methods achieve the best results, and 490

the observed improvements are statistically significant against the non CNNbased methods, considering the provided confidence intervals.

(a) Performance metrics.

(b) Error metrics.

Figure 8: Overall performance and error metrics for the GOTPD1+ dataset.

Table 14 shows the final aggregated results of the evaluated algorithms on the MIVIA dataset, and in Figures 9a and 9b we show the graphical comparison among the different strategies Table 14: Results for the MIVIA dataset. All values in %.

495

Method

P recision

Recall

F1 score

FNR

FPR

ERR

DPDnet

99.57

98.50

99.03 ± 0.11

1.50

0.26

0.72 ± 0.09

DPDnetf ast

99.43

97.54

98.47 ± 0.13

2.46

0.34

1.14 ± 0.12

WaterFilling

93.68

99.02

96.27 ± 0.21

0.98

4.08

2.91 ± 0.18

SLICEP CA *

91.52

92.14

91.83 ± 0.30

7.86

5.20

6.20 ± 0.26

SLICESV M *

90.85

92.22

91.53 ± 0.30

7.78

5.66

6.46 ± 0.27

MexicanHat

91.81

71.40

80.33 ± 0.43

28.60

3.89

13.27 ± 0.37

Figure 9a shows a much more significant performance drop of the non CNNbased methods (due to the mismatch between the training and testing conditions), with a clear superiority of the DPDnet and DPDnetf ast approaches in 30

P recision and F 1score . Their worse performance in Recall as compared with WaterFilling was already explained in previous sections. In what respect to 500

the error metrics of Figure 9b, the CNN-based methods achieve the best results in the overall error, again with statistically significant results as compared with the others.

(a) Performance metrics.

(b) Error metrics.

Figure 9: Overall performance and error metrics for the MIVIA dataset.

Table 15 shows the final aggregated results of the evaluated algorithms on the ZHAN2012 dataset, and in Figures 10a and 10b we show the graphical comparison 505

among the different strategies, where it is again clear how our proposals clearly outperform the other algorithms, with statistically significant improvements. Table 15: Results for the Zhang2012 dataset. All values in %.

Method

P recision

Recall

F1 score

FNR

FPR

ERR

DPDnet

98.46

99.79

99.12 ± 0.25

0.21

6.87

1.44 ± 0.32

DPDnetf ast

99.44

99.19

99.31 ± 0.22

0.81

2.47

1.12 ± 0.28

WaterFilling

95.70

98.79

97.22 ± 0.44

1.21

19.71

4.61 ± 0.57

SLICEP CA *

94.35

98.16

96.22 ± 0.52

1.84

25.69

6.28 ± 0.66

SLICESV M *

77.50

79.44

78.46 ± 1.10

20.56

110.57

36.09 ± 1.29

MexicanHat

36.56

23.68

28.74 ± 1.22

76.32

182.85

95.87 ± 0.54

31

(a) Performance metrics.

(b) Error metrics.

Figure 10: Overall performance and error metrics for the Zhang2012 dataset.

4.4.5. Comparison with other CNN-based proposals In this section we use the TVHEADS dataset to compare our proposal with other very recently proposed CNN-based methods that face top-view depth 510

data for the people detection task. We focus on the TVHEADS for this evaluation, as Liciotti et al. (2018a) already provided a broad comparison among CNNbased strategies using this dataset. In the evaluation of our proposal on the TVHEADS database, the mismatch between the training and evaluation conditions discussed above is also present,

515

and in this case we have an additional relevant issue: the fact that the TVHEADS dataset ground truth includes labeled people partially occluded by the image boundaries, as described in Section 4.1.4, and graphically shown in Figure 7. Table 16: Results for the TVHeads dataset (some provided by (Liciotti et al., 2018a)) Method

P recision

Recall

F 1score

ERR

DPDnet

89.63 ± 0.90

72.67 ± 1.31

80.26 ± 1.17

18.93 ± 1.16

DPDnet-FT

99.78 ± 0.14

98.88 ± 0.31

99.33 ± 0.24

0.97 ± 0.29

Fractal (Larsson et al., 2016)

99.26 ± 0.25

99.32 ± 0.24

99.29 ± 0.25

0.56 ± 0.22

U-Net (Ronneberger et al., 2015)

94.50 ± 0.67

94.89 ± 0.65

94.69 ± 0.66

0.75 ± 0.25

U-Net2 (Ravishankar et al., 2017)

96.78 ± 0.52

97.05 ± 0.50

96.91 ± 0.51

0.69 ± 0.24

U-Net3 (Liciotti et al., 2018a)

98.93 ± 0.30

98.94 ± 0.30

98.93 ± 0.30

0.55 ± 0.22

Segnet (Badrinarayanan et al., 2015)

94.62 ± 0.67

95.33 ± 0.62

94.96 ± 0.65

0.74 ± 0.25

ResNet (He et al., 2016)

96.87 ± 0.51

96.92 ± 0.51

96.89 ± 0.51

0.62 ± 0.23

32

(a) Performance metrics.

(b) Error metrics.

Figure 11: Overall performance and error metrics for the TVHEADS dataset.

Table 16 shows the result of our proposal compared with that of (Liciotti et al., 2018a), and others in the literature that were designed for general segmen520

tation tasks (Larsson et al., 2016; Ronneberger et al., 2015; Ravishankar et al., 2017; Badrinarayanan et al., 2015), with the results provided in (Liciotti et al., 2018a). Figures 11a and 11b show the graphical comparison among the different methods. In this case we include the confidence bands for all the metrics given the similarity among the results achieved by the different systems.

525

In the table, we can observe results for two versions of our DPDnet, the first one (DPDnet) refers to the standard proposal trained on the GOTPD1+1 dataset, without fine tuning. As it can be seen, the standard DPDnet results are worse that all the other CNN-based systems. After analyzing the DPDnet results in detail, we found that most of the

530

performance drop is due to the high amount of labeled people partially occluded by the image boundaries, a situation which is not found in the ground truth information provided with the GOTPD1+1 database, so that DPDnet had been unable to model such situation. To face this issue and, and taking into account the very limited size of the

535

TVHEADS dataset, we decided to build an adapted system called DPDnet-FT, which is a fine-tuned version of DPDnet, using part of the TVHEADS as the finetuning data. That way, DPDnet-FT uses the previous acquired knowledge on the people detection task, to adapt its predictions to the new labeling conditions

33

of the TVHEADS environment. We used 80% of the TVHEADS data to fine tune 540

the DPDnet models, 10% for validation, and the remaining 10% for testing. To provide more reliable results we applied a 10-fold cross-validation strategy over the TVHEADS database, obtaining the final average performance metrics of the 10 folds. That way, we could get the most out of the available data, while also allowing to reduce the statistical significance bands as much as possible.

545

It is important to mention here that Liciotti et al. (2018a) did not provide the experimental details on their database partition scheme, so that we don’t have the precise information to estimate the confidence intervals in their results. To overcome this issue and provide a rigorous comparison, we assumed that their published results also considered the highest possible amount of data, thus also

550

having the smallest possible statistical confidence intervals. The P recision and F 1score results of DPDnet-FT shown in Table 16 are the best among the compared systems, being the differences in P recision statistically significant. In terms of Recall, DPDnet-FT is positioned in third place, but the differences with the top two systems are not statistically significant.

555

Finally, in terms of ERR, DPDnet-FT is positioned last, but, again, there are no statistically significant differences with all the other CNN-based systems. It is relevant to mention here that at these accuracy levels, the number of errors is extremely small. For example, our ERR = 0.97% means less than 4 errors out of 4413 labelled positives. It is also worth mentioning here that the TVHEADS

560

dataset does not include sequences with objects, so that we cannot evaluate the performance of the CNN-based alternatives under these conditions. It is clear that further experimental work with the proposals by Liciotti et al. (2018a) on larger datasets is required to assess the relative performance among the CNNbased proposals with statistical significant results, but this will be addressed as

565

a future task. 4.5. Computational Performance Evaluation Regarding the computational demands, Table 17 shows the performance of the evaluated algorithms measured in average frames per second (we have 34

explicitly removed results for MexicanHat due to its bad performance and its 570

bad results in computational behavior according to the results shown in Luna et al. (2017)). All the experiments shown were run in a standard PC, with an Intel I7-6700 at 3.4GHz and 32Gb RAM. The GPU used was an NVIDIA GTX 1080 Ti. The evaluation was run on samples from the GOTPD1+ dataset.

575

In the case of the DPDnet and DPDnetf ast we also show CPU results. DPDnet

DPDnetf ast

WaterFilling

SLICEP CA

SLICESV M

FPS (CPU)

2.7

9.8

16.6

312.0

406.0

FPS (GPU)

65.4

116.9

N/A

N/A

N/A

Table 17: Timing Results, measured as average frames per second (FPS).

As it can be clearly seen, the computational performance of the CNNbased strategies running on a generic CPU is well behind that obtained by the WaterFilling and SLICE approaches, although the DPDnetf ast could run at a reasonable rate of 10 FPS. The performance of the SLICE algorithms is much 580

better than the one reported in Luna et al. (2017) as the CPU we use is significantly better. When the GPU enters the competition, the results for DPDnet and DPDnetf ast are, as expected, much better than those for WaterFilling, but still behind SLICESV M and SLICEP CA . However both CNN-based strategies are well above the requirements demanded for real time operation. Taking

585

into account that these approaches significantly outperform the other methods, there is still room for further reduction in the network complexity. In what respect to the effect of the scene complexity on the computational performance of the algorithms, Luna et al. (2017) already showed that the SLICEP CA algorithm was affected by the number of users in the scene (this

590

was due to the fact that the ROI estimation and feature extraction modules, which demands that are proportional to the number of users, had a big impact in the processing time). On the other hand, the DPDnet, DPDnetf ast , and WaterFilling algorithms do not exhibit such a scene complexity depen35

dence.Table 18 shows the FPS performance for these algorithms evaluated on 595

sequences with one, two, and more than two people. It can be seen that there is no direct impact of the number of users on the results for DPDnet, DPDnetf ast , and WaterFilling. For the SLICEP CA and SLICESV M approaches, the relative decrease in computational performance between the FPS for the single and more than two people cases is of 65.3% and 58.4%, respectively. DPDnet (GPU)

DPDnetf ast (GPU)

WaterFilling

Single person

66.9

111.5

16.6

572.0

682.0

Two people

60.9

124.0

16.8

391.8

498.8

More than two people

65.5

117.5

16.5

198.7

283.4

SLICEP CA

SLICESV M

Table 18: Average timing results for the WaterFilling, DPDnet and DPDnetf ast algorithms for sequences with varying number of users (FPS).

600

Regarding the computational performance comparison with the other CNNbased alternatives, Liciotti et al. (2018a) state in their paper that their proposal is “computationally efficient at inference time”. Unfortunately, they do not provide any numerical details on this aspect, so that we are unable to further compare with them on this issue.

605

5. Conclusions and Future Work In this paper we propose a new CNN-based solution (with two implementations, namely DPDnet and DPDnetf ast ) to address the task of multiple people detection from overhead depth images, using multimodal likelihood maps. We evaluated and compared our method with other recent state-of-the-art meth-

610

ods using four different depth image datasets with varying recording conditions (technology used in the depth sensors, resolution, height of the sensors’ location, background configuration, number of users, actions performed by them, presence of non-people elements, etc.), and following a rigorous and exhaustive experimental procedure. To the best of our knowledge, our study constitutes

615

the broadest one in the literature in what respect to the number of evaluated methods on a wide range of evaluation datasets. 36

From our experimental results, we can claim that the proposed method clearly outperforms all the non-CNN-based state-of-the-art methods in different databases (GOTPD1+, MIVIA and Zhang2012) with no need of retraining nor 620

fine-tuning, and with statistically significant differences. In the case of the comparison with the CNN-based methods on the TVHEADS dataset, we had to apply a fine tuning procedure given the relevant mismatch in the labeling criteria used as compared to the one we applied in the GOTPD1+ dataset. After the fine tuning, our system proved to be the best one when con-

625

sidering the P recision and F 1score performance metrics, being the differences in P recision statistically significant. In the other performance metrics, the differences found were not statistically significant, so that we can state that our proposal is also competitive as compared to the other CNN-based alternatives. With the limited size of the TVHEADS dataset, we have not been able to assess the

630

superiority of none of the CNN-based methods, but it is clear that at these performance levels (overall error below 1%) and given its computational efficiency, our strategy is ready to be deployed in real scenarios. Nevertheless, and in a near future, we will address further experimentation with larger datasets using CNN-based approaches, to be able to actually assess their relative performance

635

with statistically significant results. Also pursing the deployment in real scenarios, and after implementing the fast version of DPDnet (DPDnetf ast ), we proved that we obtained a relative reduction of computing time between 65.3% and 58.4%, with just a minor decrease in accuracy. Given that the commercial systems try to optimize hardware costs,

640

we highlight this as an important contribution that can run at real time in low cost GPUs. Once we have evaluated all the used systems, including CNN and non-CNN based ones, we can conclude that the CNN-based ones are the best for detecting people in terms of accuracy, speed and robustness. Despite that, these approaches require the use of GPUs, in contrast to the non-CNN-based

645

ones that usually run in real time with conventional CPUs, at the expense of lower performance. Regarding suggestions for future research, we think that one of the major 37

drawbacks in the proposed task is the expected low performance in outdoor conditions, as current depth camera technology exhibits, in general, problems 650

with IR interference from sunlight. We will address this task in the future, either by using depth cameras that claim to be more robust to this interference, or by adopting alternative strategies to face higher errors in depth estimation and increased noise conditions. Another future relevant line of research will be to port the DPDnet strategy

655

to a more generic approximation of the problem that does not involve the restriction to overhead camera locations. In this case, we will have to face some associated problems, mainly related to the appearance of occlusions, that could be initially approached by using tracking strategies. In what respect to the CNN architecture, we found that the generic objective

660

function being used, might be improved by adopting loss functions that explicitly model the upper human body geometry. This could lead to a stronger, faster and better people detector. Finally, and as already suggested in Section 4.4.5, we will be working in an extensive evaluation of the CNN-based alternatives using larger datasets, to be

665

able to asses their relative performance with statistically significant results.

6. Acknowledgments This work has been supported by the Spanish Ministry of Economy and Competitiveness under projects HEIMDAL-UAH TIN2016-75982-C2-1-R) and ARTEMISA (TIN2016-80939-R) and by the University of Alcal´ a under project 670

JANO CCGP2017/EXP-025). We also thank Xucong Zhang for providing us with their evaluation data; and to the MIVIA and VRAI labs, that share their MIVIA and TVHEADS datasets with the scientific community.

38

References References 675

Agarap,

A. F. (2018).

(ReLU).

Deep learning using rectified linear units

CoRR, abs/1803.08375 . URL: http://arxiv.org/abs/1803.

08375. arXiv:1803.08375. Andriluka, M., Roth, S., & Schiele, B. (2008). People-tracking-by-detection and people-detection-by-tracking. In 2008 IEEE Conference on Computer Vision 680

and Pattern Recognition (pp. 1–8). doi:10.1109/CVPR.2008.4587583. Antic, B., Letic, D., Culibrk, D., & Crnojevic, V. (2009). K-means based segmentation for real-time zenithal people counting. In Proceedings of the 16th IEEE International Conference on Image Processing ICIP’09 (pp. 2537– 2540). Piscataway, NJ, USA: IEEE Press. doi:10.1109/ICIP.2009.5414001.

685

Badrinarayanan, V., Kendall, A., & Cipolla, R. (2015).

SegNet:

A

deep convolutional encoder-decoder architecture for image segmentation. CoRR, abs/1511.00561 . URL: http://arxiv.org/abs/1511.00561. arXiv:1511.00561. Bevilacqua, A., Di Stefano, L., & Azzari, P. (2006). People tracking using a time690

of-flight depth sensor. In Video and Signal Based Surveillance, 2006. AVSS ’06. IEEE International Conference on (pp. 89–89). doi:10.1109/AVSS.2006. 92. Cai, Z., Yu, Z. L., Liu, H., & Zhang, K. (2014).

Counting people in

crowded scenes by video analyzing. In Industrial Electronics and Applica695

tions (ICIEA), 2014 IEEE 9th Conference on (pp. 1841–1845). doi:10.1109/ ICIEA.2014.6931467. Cao, Y., Shen, C., & Shen, H. T. (2017). Exploiting depth from single monocular images for object detection and semantic segmentation. IEEE Transactions on Image Processing, 26 , 836–846. doi:10.1109/TIP.2016.2621673.

39

700

Chan, A., Liang, Z.-S., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people without people models or tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–7). doi:10.1109/CVPR.2008.4587569. Chen, T.-Y., Chen, C.-H., Wang, D.-J., & Kuo, Y.-L. (2010). A people count-

705

ing system based on face-detection.

In Genetic and Evolutionary Com-

puting (ICGEC), 2010 Fourth International Conference on (pp. 699–702). doi:10.1109/ICGEC.2010.178. Chollet, F. (2016). Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357 . URL: http://arxiv.org/abs/1610.02357. 710

arXiv:1610.02357. Dan, B.-K., Kim, Y.-S., Suryanto, Jung, J.-Y., & Ko, S.-J. (2012). Robust people counting system based on sensor fusion. Consumer Electronics, IEEE Transactions on, 58 , 1013–1021. doi:10.1109/TCE.2012.6311350. Del Pizzo, L., Foggia, P., Greco, A., Percannella, G., & Vento, M. (2016).

715

Counting people by rgb or depth overhead cameras. Pattern Recognition Letters, 81 , 41–50. URL: https://doi.org/10.1016/j.patrec.2016.05. 033. doi:10.1016/j.patrec.2016.05.033. Dumoulin, J., Can´evet, O., Villamizar, M., Nunes, H., Khaled, O. A., Mugellini, E., Moscheni, F., & Odobez, J.-M. (2018). Unicity: A depth maps database

720

for people detection in security airlocks. In IEEE International Conference on Advanced Video and Signal-based Surveillance Workshop. Dumoulin, J., Can´evet, O., Villamizar, M., Nunes, H., Khaled, O. A., Mugellini, E., Moscheni, F., & Odobez, J.-M. (2019). UNICITY: A depth maps database for people detection in security airlocks. Available online. doi:10.5281/

725

zenodo.2556678 (last accessed October 2019). Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th 40

International Conference on Neural Information Processing Systems - Volume 2 NIPS’14 (pp. 2366–2374). Cambridge, MA, USA: MIT Press. URL: http: 730

//dl.acm.org/citation.cfm?id=2969033.2969091. Fernandez-Rincon, A., Fuentes-Jimenez, D., Losada, C., Marron, M., Luna, C. A., Macias-Guarasa, J., & Mazo, M. (2017). Robust people detection and tracking from an overhead time-of-flight camera. In 12th International Conference on Computer Vision Theory and Applications. (pp. 556–564). Porto,

735

Portugal volume 4. doi:10.5220/0006169905560564. Gal˘ a´ık, F., & Gargal´ık, R. (2013). Real-time depth map based people counting. In 15th International Conference on Advanced Concepts for Intelligent Vision Systems - Volume 8192 ACIVS 2013 (pp. 330–341). Berlin, Heidelberg: Springer-Verlag. doi:10.1007/978-3-319-02895-8_30.

740

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 (pp. 770– 778). doi:10.1109/CVPR.2016.90. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network

745

training by reducing internal covariate shift. arXiv:1502.03167. Jeong, C. Y., Choi, S., & Han, S. W. (2013). A method for counting moving and stationary people by interest point classification. In Image Processing (ICIP), 2013 20th IEEE International Conference on (pp. 4545–4548). doi:10.1109/ ICIP.2013.6738936.

750

Jia, L., & Radke, R. (2014). Using time-of-flight measurements for privacypreserving tracking in a smart room. Industrial Informatics, IEEE Transactions on, 10 , 689–696. doi:10.1109/TII.2013.2251892. Kingma, D. P., & Ba, J. (2014).

Adam: A method for stochastic opti-

mization. CoRR, abs/1412.6980 . URL: http://arxiv.org/abs/1412.6980. 755

arXiv:1412.6980. 41

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).

Imagenet clas-

sification with deep convolutional neural networks.

In F. Pereira,

C. J. C. Burges,

760

Neural

L. Bottou,

vances

in

Information

1105).

Curran Associates, Inc.

& K. Q. Weinberger (Eds.), Processing

Systems

25

(pp.

Ad1097–

URL: http://papers.nips.cc/paper/

4824-imagenet-classification-with-deep-convolutional-neural-networks. pdf. Lab, M. R. (2016). The MIVIA People Counting Dataset. Available online. URL: http://mivia.unisa.it/datasets-request (last accessed January 765

2019). Larsson, G., Maire, M., & Shakhnarovich, G. (2016). Fractalnet: Ultra-deep neural networks without residuals. CoRR, abs/1605.07648 . URL: http:// arxiv.org/abs/1605.07648. arXiv:1605.07648. Liciotti, D., Paolanti, M., Pietrini, R., Frontoni, E., & Zingaretti, P. (2018a).

770

Convolutional networks for semantic heads segmentation using top-view depth data in crowded environment. In 2018 24th International Conference on Pattern Recognition (ICPR) (pp. 1384–1389). doi:10.1109/ICPR.2018.8545397. Liciotti, D., Paolanti, M., Pietrini, R., Frontoni, E., & Zingaretti, P. (2018b). The TVHEADS (Top-View Heads) dataset. Available online. URL: http:

775

//vrai.dii.univpm.it/tvheads-dataset (last accessed October 2019). Liu, J., Liu, Y., Zhang, G., Zhu, P., & Chen, Y. Q. (2015). Detecting and tracking people in real time with rgb-d camera. Pattern Recognition Letters, 53 , 16 – 23. URL: http://www.sciencedirect.com/science/article/pii/ S016786551400302X. doi:10.1016/j.patrec.2014.09.013.

780

Luna, C. A., Losada-Gutierrez, C., Fuentes-Jimenez, D., Fernandez-Rincon, A., Mazo, M., & Macias-Guarasa, J. (2017). Robust people detection using depth information from an overhead time-of-flight camera. Expert Syst. Appl., 71 , 240–256. doi:10.1016/j.eswa.2016.11.019.

42

Macias-Guarasa, J., Losada-Gutierrez, C., Fuentes-Jimenez, D., Fernandez, 785

R., Luna, C. A., Fernandez-Rincon, A., & Mazo, M. (2016a). The GEINTRA Overhead ToF People Detection (GOTPD1) database. Available online. URL: http://www.geintra-uah.org/datasets/gotpd1 (last accessed January 2019). Macias-Guarasa,

790

nandez,

R.,

(2016b). database.

J., Luna,

Losada-Gutierrez, C. A.,

C.,

Fuentes-Jimenez,

Fernandez-Rincon,

A.,

D.,

& Mazo,

FerM.

The GEINTRA Overhead ToF People Detection (GOTPD1) Available online.

URL: https://www.kaggle.com/lehomme/

overhead-depth-images-people-detection-gotpd1 (last accessed January 2019). 795

Munaro, M., Lewis, C., Chambers, D., Hvass, P., & Menegatti, E. (2016). Rgb-d human detection and tracking for industrial environments. In E. Menegatti, N. Michael, K. Berns, & H. Yamaguchi (Eds.), Intelligent Autonomous Systems 13 (pp. 1655–1668). Cham: Springer International Publishing. Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection.

800

In 2013 IEEE International Conference on Computer Vision (pp. 2056–2063). doi:10.1109/ICCV.2013.257. Ramanan, D., Forsyth, D. A., & Zisserman, A. (2006). Tracking People by Learning Their Appearance.

Pattern Analysis and Machine Intelligence,

IEEE Transactions on, 29 , 65–81. doi:10.1109/tpami.2007.250600. 805

Rauter, M. (2013). Reliable human detection and tracking in top-view depth images. In 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 529–534). doi:10.1109/CVPRW.2013.84. Ravishankar, H., Venkataramani, R., Thiruvenkadam, S., Sudhakar, P., & Vaidya, V. (2017). Learning and incorporating shape models for semantic

810

segmentation. In MICCAI .

43

Redmon, J., Divvala, S. K., Girshick, R. B., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. CoRR, abs/1506.02640 . Ren, X., Du, S., & Zheng, Y. (2017). Parallel rcnn: A deep learning method for people detection using rgb-d images. In 2017 10th International Congress 815

on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) (pp. 1–6). doi:10.1109/CISP-BMEI.2017.8302069. Ronneberger, O., P.Fischer, & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation.

In Medical Image Computing

and Computer-Assisted Intervention (MICCAI) (pp. 234–241). 820

Springer

volume 9351 of LNCS . URL: http://lmb.informatik.uni-freiburg.de/ Publications/2015/RFB15a (available on arXiv:1505.04597 [cs.CV]). Shin, H.-c., Roth, H., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., & Summers, R. (2016). Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning.

825

IEEE Transactions on Medical Imaging, 35 , 1285–1298. doi:10.1109/TMI. 2016.2528162. Stahlschmidt, C., Gavriilidis, A., Velten, J., & Kummert, A. (2013). People detection and tracking from a top-view position using a time-of-flight camera. In A. Dziech, & A. Czyazwski (Eds.), Multimedia Communications, Ser-

830

vices and Security (pp. 213–223). Springer Berlin Heidelberg volume 368 of Communications in Computer and Information Science. doi:10.1007/ 978-3-642-38559-9_19. Stahlschmidt, C., Gavriilidis, A., Velten, J., & Kummert, A. (2014).

Ap-

plications for a people detection and tracking algorithm using a time-of835

flight camera. Multimedia Tools and Applications, (pp. 1–18). doi:10.1007/ s11042-014-2260-3. Tian, Y., Luo, P., Wang, X., & Tang, X. (2015). Pedestrian detection aided by deep learning semantic tasks. In 2015 IEEE Conference on Computer Vision

44

and Pattern Recognition (CVPR) (pp. 5079–5087). doi:10.1109/CVPR.2015. 840

7299143. Vera, P., Monjaraz, S., & Salas, J. (2016).

Counting pedestrians with a

zenithal arrangement of depth cameras.

Machine Vision and Applica-

tions, 27 , 303–315. URL: https://doi.org/10.1007/s00138-015-0739-1. doi:10.1007/s00138-015-0739-1. 845

Villamizar, M., Mart´ınez-Gonz´alez, A., Can´evet, O., & Odobez, J. (2018). Watchnet: Efficient and depth-based network for people detection in video surveillance systems. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) (pp. 1–6). doi:10.1109/ AVSS.2018.8639165.

850

Wang, C., & Zhao, Y. (2017). Multi-layer proposal network for people counting in crowded scene.

In 2017 10th International Conference on Intel-

ligent Computation Technology and Automation (ICICTA) (pp. 148–151). doi:10.1109/ICICTA.2017.40. Zhang, X., Yan, J., Feng, S., Lei, Z., Yi, D., & Li, S. (2012). Water filling: 855

Unsupervised people counting via vertical Kinect sensor. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on (pp. 215–220). doi:10.1109/AVSS.2012.82. Zhao, J., Zhang, G., Tian, L., & Chen, Y. Q. (2017). Real-time human detection with depth camera via a physical radius-depth detector and a CNN descriptor.

860

In 2017 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1536–1541). doi:10.1109/ICME.2017.8019323. Zhou, K., Paiement, A., & Mirmehdi, M. (2017). Detecting humans in rgb-d data with cnns. In 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA) (pp. 306–309). doi:10.23919/MVA.2017.

865

7986862.

45

Zhu, L., & Wong, K.-H. (2013).

Human tracking and counting using the

kinect range sensor based on adaboost and kalman filter. In International Symposium on Visual Computing (pp. 582–591). Springer. doi:10.1007/ 978-3-642-41939-3_57.

46

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

48

Author Statement for manuscript: “DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera” David Fuentes-Jimenez, Roberto Martin-Lopez, Cristina Losada-Gutierrez, David Casillas-Perez, Javier Macias-Guarasa, Carlos A. Luna, Daniel Pizarro February 25, 2019 Following the request in the submission system, we include below the author statement information outlining all authors’ individual contributions, using the relevant CRediT roles: • Conceptualization and design of the algorithmic strategy: David FuentesJimenez. • Methodology: David Fuentes-Jimenez, David Casillas-Perez, Cristina Losada-Gutierrez and Daniel Pizarro defined the algorithmic and experimental methodology. • Data curation efforts, including database recording and labeling, and the supervision, execution and validation of all the experiments: David Fuentes-Jimenez, Roberto Martin-Lopez, Javier Macias-Guarasa, and Cristina Losada-Gutierrez. • Sofware: David Fuentes-Jimenez, David Casillas-Perez, Roberto MartinLopez, Cristina Losada-Gutierrez and Javier Macias-Guarasa generated the implementation of the algorithmic proposals and the labeling and scoring software. • Writing, review and editing: All the authors participated in the manuscript generation and revision. 1