DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera
Journal Pre-proof
DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera David Fuentes-Jimenez, Roberto Martin-Lopez, Cristina Losada-Gutierrez, David Casillas-Perez, Javier Macias-Guarasa, Carlos A. Luna, Daniel Pizarro PII: DOI: Reference:
S0957-4174(19)30885-1 https://doi.org/10.1016/j.eswa.2019.113168 ESWA 113168
To appear in:
Expert Systems With Applications
Received date: Revised date: Accepted date:
25 February 2019 28 October 2019 27 December 2019
Please cite this article as: David Fuentes-Jimenez, Roberto Martin-Lopez, Cristina Losada-Gutierrez, David Casillas-Perez, Javier Macias-Guarasa, Carlos A. Luna, Daniel Pizarro, DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera, Expert Systems With Applications (2019), doi: https://doi.org/10.1016/j.eswa.2019.113168
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Elsevier Ltd. All rights reserved.
Highlights • Robust system to detect people only using depth information from a ToF camera • System outperforms state-of-the-art methods in different datasets without fine-tuning • Proposal runs in real time using conventional GPUs • Computational demands are independent of the number of people in the scene • Generated database is available to the research community
1
DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera David Fuentes-Jimeneza,b , Roberto Martin-Lopeza , Cristina Losada-Gutierreza , David Casillas-Pereza , Javier Macias-Guarasaa , Carlos A. Lunaa , Daniel Pizarroa a Department
of Electronics. University of Alcal´ a, Ctra. Madrid-Barcelona, km. 33600, 28805. Alcal´ a de Henares. SPAIN E-mails:
[email protected] (D. Fuentes-Jimenez),
[email protected] (R. Martin-Lopez),
[email protected] (C. Losada-Gutierrez),
[email protected] (D. Casillas-Perez),
[email protected] (J. Macias-Guarasa),
[email protected] (C.A. Luna),
[email protected] (D. Pizarro). b Corresponding author. Tel: +34 637151349, fax: +34 918856591.
Abstract This paper proposes a deep learning-based method to detect multiple people from a single overhead depth image with high precision. Our neural network, called DPDnet, is composed by two fully-convolutional encoder-decoder blocks built with residual layers. The main block takes a depth image as input and generates a pixel-wise confidence map, where each detected person in the image is represented by a Gaussian-like distribution, The refinement block combines the depth image and the output from the main block, to refine the confidence map. Both blocks are simultaneously trained end-to-end using depth images and ground truth head position labels. The paper provides a rigorous experimental comparison with some of the best methods of the state-of-the-art, being exhaustively evaluated in different publicly available datasets. DPDnet proves to outperform all the evaluated methods with statistically significant differences, and with accuracies that exceed 99%. The system was trained on one of the datasets (generated by the authors and available to the scientific community) and evaluated in the others without retraining, proving also to achieve high accuracy with varying datasets and experimental conditions. Additionally, we made a comparison of our proposal with other CNN-based alternatives that have been very recently proposed in the literature, obtaining again very high
Preprint submitted to Journal of Expert Systems with Applications
December 28, 2019
performance. Finally, the computational complexity of our proposal is shown to be independent of the number of users in the scene and runs in real time using conventional GPUs. Keywords: People detection, Depth camera information, Interest regions estimation, Overhead Depth Camera, Feature extraction
1. Introduction People detection and tracking have recently attracted growing interest from the scientific community because of their applications in multiple areas, such as video-surveillance, access control or human behavior analysis. These applica5
tions require systems able to work with different experimental conditions and non-invasive systems (i.e. without adding turnstiles or other contact systems), such as video cameras. Detecting people in images is a classic Computer Vision problem, thoroughly studied over the last two decades. In recent years, improvements in
10
technology and the availability of large annotated image datasets, such as Imagenet (Krizhevsky et al., 2012), have allowed Deep Neural Networks (DNNs) to become the state-of-the-art solutions for tasks such as segmentation (Cao et al., 2017), classification (Shin et al., 2016) and object detection (Redmon et al., 2015) in RGB images. Numerous DNN-based people detection works
15
have recently appeared (Ouyang & Wang, 2013; Wang & Zhao, 2017; Zhao et al., 2017), obtaining very accurate results in controlled conditions. However, these methods suffer in scenarios with a high amount of clutter. To reduce the effect of occlusions and to improve detection accuracy in densely populated environments, some works employ depth cameras in over-
20
head configurations (Bevilacqua et al., 2006; Stahlschmidt et al., 2013; Jia & Radke, 2014; Zhang et al., 2012; Gal˘ a´ık & Gargal´ık, 2013; Del Pizzo et al., 2016; Vera et al., 2016). An additional quality of these methods is that they also preserve privacy, preventing anyone from using the system to find out the identity of each person in the scene, which is a requirement in some applica-
2
25
tions Chan et al. (2008). Overhead depth images are easier to process than RGB images. This has opened up room for detection methods based on finding local geometric structures of the depth image (Bevilacqua et al., 2006; Zhang et al., 2012; Stahlschmidt et al., 2013, 2014), which are combined with classification stages (Gal˘ a´ık & Gargal´ık, 2013; Vera et al., 2016; Zhu & Wong, 2013; Luna
30
et al., 2017) to reduce false detections. The detection rates of these methods is however still limited in crowded scenes. Unlike the aforementioned methods, DNNs have the ability of analyzing the whole image context to detect people in heavy cluttered images and to discard false positives due to the presence of other objects in the scene. The main
35
challenge of DNN-based methods is how to effectively train the network so that it can cope with unseen environments and different depth sensors. Two people detection methods based on DNNs have been recently proposed (Liciotti et al., 2018a; Villamizar et al., 2018) using overhead depth images. However they have been tested in limited scenarios, either with a very small dataset (Liciotti
40
et al., 2018a) or a simple scene with only two people interacting in a small room (Villamizar et al., 2018). Using DNNs to effectively solve this problem in general scenarios remains an open question. In this paper, we propose a method based on DNNs, hereinafter referred to as DPDnet, for detecting multiple people from a single depth image acquired with
45
an overhead depth camera (see Figure 1). Our neural network is trained endto-end using depth images and likelihood maps generated from the annotated position of each person in the scene. This paper brings the following contributions: 1) DPDnet achieves top performance when compared with all the evaluated state-of-the-art methods, which
50
constitutes the main contribution of this paper, exceeding 99% accuracy in all the developed experiments for a wide range of experimental conditions.2) Our approach provides a dense multimodal confidence map based on Gaussian mixture distribution with the same resolution as the input depth image. This allows the neural network to analyze the entire image context and to provide in a single
55
output all detected people. 3) Our network detects people in real time, regard3
(a) Camera location above sensed area.
(b) Example of recorded frame in sensed area.
Figure 1: Scheme of camera location above sensed area and example of recorded frame
less of the number of users in the scene. 4) Our proposal reaches a high degree of robustness (referring to its relative invariance to changes in the experimental conditions): The depth images from the evaluated datasets have been recorded from cameras with different resolutions, technology (structured light or Time60
of-Flight); located in a height range varying between 2.8 and 3.53 meters, and with different background content. Our system performance is very high even when there are mismatched conditions between training and testing data. To the best of our knowledge, there are no other works in the literature able to detect people in top-view depth images with comparable accuracy, and evaluated
65
on this variety of experimental datasets. We structure the paper as follows: section 2 details the state-of-the-art in people detection methods, section 3 describes our proposal based on CNNs, section 4 includes the experimental setup, results and discussion, and section 5 describes the main conclusions and future work discussion.
70
2. Previous Works The first proposed works in non-invasive people detection were based on the use of RGB cameras. To name a few, Ramanan et al. (2006) proposed a system based on learning person appearance models, whereas Andriluka et al. (2008) used hierarchical Gaussian Process Latent Variable Models (hGPLVM)
75
for modeling people. Other approaches were based on face detection (Chen
4
et al., 2010) or interest point classification (Jeong et al., 2013). DNN-based people detection methods have been recently proposed. In particular, Ouyang & Wang (2013) proposed a new DNN architecture that jointly performs feature extraction, deformation handling, occlusion handling, and clas80
sification.
Wang & Zhao (2017) proposed a Convolutional Neural Network
(CNN) as a feature extractor, followed by a proposal selection network. In (Tian et al., 2015), they combined a novel DNN model that jointly optimizes pedestrian detection and semantic tasks, including both pedestrian and scene attributes. Finally, Zhao et al. (2017) used a novel physical radius-depth (PRD) 85
to detect human candidates, followed by a CNN for human feature extraction. These proposals are more accurate than classic methods but they still suffer in highly cluttered scenes. To reduce the effect of occlusions and to improve detection accuracy, some works propose the fusion of RGB and depth data (Liu et al., 2015; Munaro
90
et al., 2016; Zhou et al., 2017; Ren et al., 2017; Cao et al., 2017), whereas other works place cameras in overhead configurations (Antic et al., 2009; Cai et al., 2014; Dan et al., 2012; Del Pizzo et al., 2016). Other works employed only depth sensors and overhead cameras for people detection, either based on Time-of-Flight (ToF) (Bevilacqua et al., 2006; Stahlschmidt et al., 2013, 2014;
95
Jia & Radke, 2014) or structured light (Zhang et al., 2012; Gal˘ a´ık & Gargal´ık, 2013; Rauter, 2013; Zhu & Wong, 2013; Del Pizzo et al., 2016; Vera et al., 2016) technologies. Some of these methods are based on finding distinctive maxima of the depth image (Bevilacqua et al., 2006; Zhang et al., 2012), or a filtered depth image
100
using the normalized Mexican Hat Wavelet filter (Stahlschmidt et al., 2013, 2014). However, they usually fail to separate clusters of people that are close to each other in the scene, or produce false detections when body parts, other than the head, are close to the camera. Moreover, the proposals by Zhang et al. (2012), and Stahlschmidt et al. (2013, 2014) do not include a classification stage,
105
so that they are not able to discriminate people from other objects in the scene, so their performance drops if there are other elements such as chairs or clothes 5
racks. In order to reduce false positives, some works include a classification stage that allows discriminating between people and other elements in the scene. In particular, Gal˘ a´ık & Gargal´ık (2013); Vera et al. (2016); Zhu & Wong (2013); 110
Luna et al. (2017) used shape descriptors that encode the morphology of the human head and shoulders, as seen in the depth image. These proposals are able to discriminate between people and other elements in the scene, but in some of them, their detection rates significantly decrease if people are close to each other.
115
Regarding DNN-based methods using only top-view depth data, the authors of (Villamizar et al., 2018) recently presented an approach for detecting people and identifying people attacks and intrusion in video surveillance scenarios, called WatchNet. The proposal is tested using the UNICITY dataset (Dumoulin et al., 2018) (and available at (Dumoulin et al., 2019)), recorded using 6 color
120
and depth cameras, located at different positions inside an airlock constructed to simulate a building access room. The recording conditions of the dataset are very restrictive, since the area of interest is very reduced, and it only includes one or two people in each scene. Thus, this proposal is not adapted to crowded scenarios, and there is no attempt to deal with elements different from people
125
in the scene. Liciotti et al. (2018a) also addressed the problem of detecting people from top-view depth data in crowded environments, using a CNN based strategy. Their proposal is based on the novel U-Net3 architecture, that includes several convolutional layers, followed by ReLU activations and max pooling operations.
130
To train and test their proposal, they use the TVHeads dataset (available at (Liciotti et al., 2018b)), limited to 1815 depth images. They include a comparison with other deep learning architectures, obtaining high accuracy with most of them, but they do not evaluate their proposal on other more complex and bigger datasets. Moreover, the authors do not specify how they partition the
135
TVHeads data for training, validation and testing, so that it is difficult to assess the generalization capabilities of their proposal, due to the very reduced size of their database. 6
3. DPDnet People Detector 3.1. Problem Formulation 140
We propose the DPDnet neural network to detect multiple people from a single depth image. The output consists of a multimodal confidence map, with the same size as that of the input image, that jointly codifies the detection of multiple people in the scene (see Figure 2). The term multimodal here refers to the statistical concept of multimodal distributions, which are probability distri-
145
butions with multiple “modes”. Input
Output
Convolutional Neural Network
Depth map
Confidence map
Figure 2: Depth input image and output multimodal confidence map.
. The multimodal confidence map assigns a 2D Gaussian distribution centered in the centroid of the head of each detected person, and its maximum value has been normalized to one. The standard deviation (eq. 1) is set to be constant, independently of the size of the head, and the head-camera distance. This 150
value has been calculated based on the geometrical conditions of the GOTPD1+ recordings (see Section 4.1.1), and also taking into account anthropometric considerations. In GOTPD1+, the average diameter of the head is D = 15 pixels, and we use an uncertainty factor f s = 0.8 to account for possible variations in the head sizes, so that the selected σ value is:
σ=
D fs = 6 2
7
(1)
155
It is important to note that the chosen value for the Gaussian standard deviation does not have a relevant impact on the system performance, as it is just modeling the uncertainty in the position of the head centroid. In addition to this, the use of Gaussian distributions favor the training convergence, given its smoothness. Finally, the 2D position of each person can be easily obtained
160
from the confidence map by detecting its local maxima. The advantages of this strategy are twofold: first, learning the confidence map is a well defined process to detect multiple hypothesis, as opposed to using other parametrizations (e.g multiple 2D coordinate vectors) which can lead to ambiguities in the regression function. Second, it makes computational com-
165
plexity to be independent of the number of users in the scene, considering that the time used for extracting multiple detection hypotheses from the confidence map is mostly negligible. 3.2. DPDnet Network Architecture Figure 3 illustrates the DPDnet network architecture to detect people from
170
a single depth image. Our DPDnet neural network consists of two blocks: the main block and the refinement block. Both blocks use an encoder-decoder architecture inspired by the SegNet model (Badrinarayanan et al., 2015), originally proposed in semantic segmentation, and use the residual layers proposed in the ResNet model (He
175
et al., 2016). The main block takes a single depth image as input, and outputs a confidence map. The inputs to the refinement block are the input depth image, along with the confidence map obtained from the main block, and generates a refined confidence map in its output. Similar refinement blocks have been proposed for
180
other tasks, such as in monocular reconstruction methods (Eigen et al., 2014), with very good results. The structure of the main block is detailed in Table 3. It takes 212 × 256 depth images as input, and uses separable convolutional layers based on the Xception model (Chollet, 2016). Separable convolutions are faster than con8
Main Block Output
Input
Depth Map
Confidence Map
Encoder
Decoder
Refinement Output
Confidence Map
Upsampling
Encoding Convolutional Block
Batch Normalization
Decoding Convolutional Block
Separable Convolutional Layer
Activation Relu
Activation Sigmoid
MaxPooling
Figure 3: The DPDnet architecture divided into the main block and the refinement block. Figure shows the general encoder-decoder structure in both blocks.
. 185
ventional convolutions, while achieving similar performance. Regarding the activation functions, we use ReLU activations Agarap (2018) because of their properties, and given that Krizhevsky et al. (2012) proved that, in terms of training time, the saturating non-linearities of tanh or sigmoidal are much slower than the non-saturating ones, like ReLU. Moreover, the saturating non-
190
linearities can derive into gradient vanishing because of their properties. Finally, we use Batch normalization layers as Ioffe & Szegedy (2015) showed that they introduce a regularization component that improves other alternatives such as Dropout, while allowing CNNs to accelerate the training speed and convergence to its optimum performance, in a lower number of epochs.
195
To build the encoder-decoder core of the main block, we use several units of residual blocks, similar to those proposed in the ResNet (He et al., 2016) model.
9
The encoding blocks downsample the spatial dimensions by a factor of two, and the decoding blocks use upscaling layers to increase the spatial dimensions by the same factor. Their basic architecture is described in Figure 4 and Tables 1 200
and 2. Encoding Convolutional Block
Decoding Convolutional Block
Separable Convolutional Layer Batch Normalization
Activation Relu Upsampling
Figure 4: DPDnet decoding and encoding block.
Finally, the last block of the CNN and the cropping layers, adapt the output of the last decoding block to the output size of the confidence map (212 × 256), which is the same size as the input image. Table 3 summarizes the information of all the layers that appear in the main block. 205
The refinement stage is a reduced version of the main block with two encoding and decoding blocks, being thus shallower than the main block. The refinement block takes as inputs the input depth image and the confidence map computed from the main block, and generates a new “refined” confidence map. Table 4 summarizes the information of all the layers and blocks used in the
210
refinement stage. In addition to the DPDnet architecture described above, we have adapted the same model to work with images half the resolution of the original ones (106 × 128), seamlessly producing downscaled confidence maps. We will refer to this reduced model as DPDnetf ast in section 4. The reduction of input and
215
output size allows low cost GPUs to be able to run our model with a very small impact in terms of accuracy. 10
Table 1: Encoding Block
Encoding Block (filters=(a,b,c)),kernel size=k Layer
Parameters
Output Size
Input
-
(Width, Height, Depth)
kernel=(1, 1) Convolution (Main Branch)
strides=(2, 2)
(Width/2, Height/2, a)
filters=a Batch Normalization (Main Branch)
-
Activation (Main Branch)
ReLU kernel=(k, k)
Convolution (Main Branch)
strides=(1, 1)
(Width/2, Height/2, b)
filters=b Batch Normalization (Main Branch)
-
Activation (Main Branch)
ReLU kernel=(1, 1)
Convolution (Main Branch)
strides=(1, 1)
(Width/2, Height/2, c)
filters=c Batch Normalization (Main Branch)
kernel=(1, 1)
Convolution (Shortcut)
strides=(2, 2)
(Width/2, Height/2, c)
filters=c Batch Normalization (Shortcut)
-
Add (Main Branch+Shortcut)
-
Activation (Main Branch+Shortcut)
(Width/2, Height/2, a) ReLU
11
Table 2: Decoding Block
Decoding Block (filters=(a,b,c)),kernel size=k Layer
Parameters
Output Size
Input
-
(Width, Height, Depth)
size=(2,2)
(2*Width,2*Height,Depth)
Upsampling (Main Branch)
kernel=(1, 1) Convolution (Main Branch)
strides=(1, 1)
(Width/2, Height/2, a)
filters=a Batch Normalization (Main Branch)
-
Activation (Main Branch)
ReLU kernel=(k, k)
Convolution (Main Branch)
strides=(1, 1)
(Width/2, Height/2, b)
filters=b Batch Normalization (Main Branch)
-
Activation (Main Branch)
ReLU kernel=(1, 1)
Convolution (Main Branch)
strides=(1, 1)
(Width/2, Height/2, c)
filters=c Batch Normalization (Main Branch)
-
Upsampling (Shortcut)
size=(2,2)
(2*Width,2*Height,Depth)
kernel=(1, 1) Convolution (Shortcut)
strides=(1, 1)
(Width/2, Height/2, c)
filters=c Batch Normalization (Shortcut)
-
Add (Main Branch+Shortcut)
-
Activation (Main Branch+Shortcut)
(Width/2, Height/2, a) ReLU
12
Table 3: Main Stage. In Encoding and Decoding Blocks, a, b and c represents the number of filters of the intermediate convolutional layers in the branches.
Main Block Layer
Output Size
Parameters
Input
212x256x1
-
Convolution
106x128x64
kernel=(7, 7) / strides=(2, 2)
Batch Normalization
-
Activation
ReLU
Max Pooling
35x42x64
Encoding Conv. Block
35x42x256
size=(3, 3) kernel=(3, 3) / strides=(1, 1) (a=64, b=64, c=256)
Encoding Conv. Block
18x21x512
kernel=(3, 3) / strides=(2, 2) (a=128, b=128, c=512)
Encoding Conv. Block
9x11x1024
kernel=(3, 3) / strides=(2, 2) (a=256, b=256, c=1024)
Decoding Separable Conv. Block
9x11x256
kernel=(3, 3) / strides=(1, 1) (a=1024, b=1024, c=256)
Decoding Separable Conv. Block
18x22x128
kernel=(3, 3) / strides=(2, 2) (a=512, b=512, c=128)
Decoding Separable Conv. Block
36x44x64
kernel=(3, 3) / strides=(2, 2) (a=256, b=256, c=64)
Cropping
36x43x64
cropping=[(2, 2) (1, 1)]
Up Sampling
108x129x64
size=(3, 3)
Convolution
216x258x64
kernel=(7, 7) / strides=(2, 2)
Cropping
212x256x64
cropping=[(2, 2) (1, 1)]
Batch Normalization
-
Activation Convolution
ReLU 212x256x1
Activation Output 1
kernel=(3, 3) / strides=(1, 1) Sigmoid
212x256x1 13
-
Table 4: Refinement Stage
Refinement Block Layer
Output Size
Parameters
Output 1 + Input
212x256x1
-
Convolution
106x128x64
kernel=(7, 7) / strides=(2, 2)
Batch Normalization
-
Activation
ReLU
Max Pooling
35x42x64
Encoding Conv. Block
35x42x256
size=(3, 3) kernel=(3, 3) / strides=(1, 1) (a=64, b=64, c=256)
Encoding Conv. Block
18x21x512
kernel=(3, 3) / strides=(2, 2) (a=128, b=128, c=512)
Decoding Conv. Block
36x42x128
kernel=(3, 3) / strides=(2, 2) (a=512, b=512, c=128)
Decoding Conv. Block
72x84x64
kernel=(3, 3) / strides=(2, 2) (a=256, b=256, c=64)
Zero Padding
72x86x64
padding=(0, 1)
Up Sampling
216x258x64
size=(3, 3)
Cropping
212x256x64
cropping=[(2, 2) (1, 1)]
Convolution
212x256x1
kernel=(3, 3) / strides=(1, 1)
Activation Output 2
Sigmoid 212x256x1
14
-
3.3. Training procedure We denote by pi the input depth image and by qi the target confident map, where pi , qi ∈ R212×256 . We assume that both pi and qi are normalized in the 220
interval [0, 1]. The target confidence map qi is generated by a sum of twodimensional Gaussian distributions with covariance matrix P = σ 2 I2×2 and centered around the labeled position of each person in the scene. We use σ = 6 pixels in all our experiments. The main and refinement blocks of the DPDnet architecture are defined with
225
the functions Tm (pi , θm ) and Tr (pi , qi , θr ) respectively. All trainable parameters of the main block are defined with vector θm , and similarly with θr for the refinement block. When referring to all trainable parameters of the entire DPDnet model we use their concatenation θ = (θm , θr ). The complete DPDnet function T (pi , θ) is obtained by composition of Tm and Tr : T (pi , θ) = Tr (pi , Tm (pi , θm ), θr )
(2)
We train our network using a set of corresponding depth images and confi230
dence maps {(pi , qi )}i=1,...,N , with the following loss function being used: L(θ) =
N N 1 X 1 X kT (pi , θ) − qi k2 + λ kTm (pi , θm ) − qi k2 , N i=1 N i=1
(3)
where λ is a hyper-parameter that weights the importance of the main and refinement terms in the loss function. The optimal value of λ was empirically found on validation data, and was set to λ = 1 by evaluating its effect on the network performance metrics, being this λ = 1 the best compromise value 235
among them. Note that the second term of equation (3) imposes the main block to be as close as possible to the target confidence map. This forces the refinement block to improve the output of the main block. By minimizing equation (3), we train the entire DPDnet model end-to-end in a single step without separating the main and refinement blocks.
240
In our experiments, we train the network for 50 epochs and use a validation set to select the best model. We use the Adam optimizer (Kingma & Ba, 2014) 15
with 0.001 as initial learning rate. This optimizer has been chosen because of its adaptive capabilities in terms of learning rate. The great advantage of Adam is the fact that it starts from an initial learning rate set by the user and then 245
uses the first and second moments of the gradient to adapt the learning rate to the evolution of the loss function. This increases the convergence speed as compared to other optimizers, making Adam a good choice for the proposed problem.
(a) Sample frames from the GOTPD1+ dataset.
(b) Sample frames from the MIVIA dataset.
Figure 5: Sample frames from the GOTPD1+ and MIVIA databases (gray coded depth).
16
4. Experimental Work 250
4.1. Datasets In order to provide a wide range of evaluation conditions, we have used four different databases, that will be described next. Figures 5 and 6 provide some sample frames to give an idea of the style and quality of the different datasets. 4.1.1. GOTPD1 database
255
In this work, we have used part of the GOTPD1 database (available at (MaciasGuarasa et al., 2016a) and (Macias-Guarasa et al., 2016b)), that was recorded R with a Kinect v2 device located at a height of 3.4m, generating depth images
with a resolution of 512 × 424. Depth values are provided in millimeters as 16 bits unsigned integers. The recordings covered a broad variety of conditions, 260
with scenarios comprising single and multiple people, single and multiple objects (such as chairs), people with and without accessories (hats, caps), people with different complexity, height, hair color, and hair configurations, people actively moving and performing additional actions (such as using their mobile phones, moving their fists up and down, moving their arms, etc.).
265
The GOTPD1 data was originally split in two subsets, one for training and the other for testing, with the configuration described in (Luna et al., 2017). However, for this work, and given the strict training data requirements of deep learning approaches, we will be using the full GOTPD1 dataset for training purposes, and we recorded 23 additional sequences to allow for a proper evaluation
270
effort. With this strategy, the training and testing subsets are fully independent, as people and conditions fully differ between training and testing recordings. Table 5 and Table 6 show the details of the training and testing subsets, respectively. #Frames refers to the number of frames in the recorded sequences, while #Positives refers to the number of all the heads over all the frames (in
275
our recordings we used 39 different people). The database contains sequences in which the users were instructed on how to move under the camera (to allow for proper coverage of the recording area), and sequences where people moved
17
Table 5: GOTPD1+ Training subset details.
#Sequences
#F rames
#P ositives
Description
26853
14381
Single people sequences
7138
2416
Two people sequences
6526
18106
Multiple people sequences
2323
1458
S0001 through S0022, S0102,S0110,S0123, S0130,S0136,S0142 S0023 through S0026 S0027 through S0029, S0031 through S0034, S0152, S0153 S0035,
Sequences with chairs and
S0038 through S0040
people balancing fists facing up
Totals
42840
36361
freely (to allow for a more natural behavior)1 . This dataset (composed by the original GOTPD1 and the additional 23 280
recorded sequences), will be referred to as GOTPD1+ in the tables included in Section 4.4. The first row of Figure 5 shows two sample frames from this dataset. 1 This
is fully detailed in the documentation distributed with the database at (Macias-
Guarasa et al., 2016a) and (Macias-Guarasa et al., 2016b).
Table 6: GOTPD1+ testing subset details.
Sequence IDs
#Samples
#P ositives
Description
2742
2391
Single people sequences
1888
2865
Two people sequences
4516
14837
Multiple people sequences
S0036 through S0037
1789
0
Sequences with chairs
Totals
11375
21403
S041, S0101, S0131 through S0135, S143 S0144 through S0146 S0030, S0147 through S0151, S0154 through 157
18
(a) Sample frames from the Zhang2012 dataset.
(b) Sample frames from the TVHEADS dataset.
Figure 6: Sample frames from the Zhang2012 and TVHEADS databases (gray coded depth).
4.1.2. MIVIA People Counting Dataset The MIVIA people counting dataset (available at (Lab, 2016)) has been developed by the MIVIA Research Lab at the University of Palermo, and it 285
was first described and used in (Del Pizzo et al., 2016). It is mainly aimed at evaluating people flow density in the sensed environment, but we will use it for people counting and tracking purposes. R The MIVIA dataset includes RGB and depth data captured with a Kinect
v1 device located in an overhead position at a height of 3.2m. Depth values are 290
provided as 8 bits unsigned integers (the actual depth resolution is 11 bits, but they just kept the 8 most significant ones, guaranteeing a depth resolution of roughly 1.25 cm). The recordings have been captured in indoor and outdoor conditions, and for isolated and group transits. From the 34 available sequences,
19
Table 7: MIVIA testing subset details.
Sequence IDs
#F rames
#P ositives
Description
DIS1 DIS2
7692
3601
Single people sequences
DIS3 DIS4 DIS5
10259
5315
Two people sequences
DIG1 DIG2 DIG3
12222
8780
Multiple people sequences
Totals
30173
17696
we will be using 8 of them, all that correspond to indoor conditions for both 295
the single and multiple person transits. The recordings include people walking with bags and other big objects, and their details are shown in Figure 7. In our experiments we will use the depth streams with a resolution of 640 × 480, so that it will be necessary to resize the depth streams used in this database, to those that DPDnet can handle.
300
This dataset will be referred to as MIVIA in the tables included in Section 4.4. The bottom row of Figure 5 shows two sample frames from this dataset. 4.1.3. Zhang 2012 database The Zhang 2012 database (Zhang et al., 2012) consists of two sequences R v1 device also located in an (dataset1 and dataset2) recorded with a Kinect
305
overhead position at an estimated height of 3.53m (the authors do not explicitly state the camera position in their paper, so that we estimated it from the available images), generating depth data with a resolution of 320 × 240 pixels. Depth values are provided in millimeters as 16 bits unsigned integers. Both sequences include multiple people moving under the camera, and their details
310
are shown in Table 8. The recordings include groups of people walking closely to each other, people walking freely, and people with bags and other big objects. The data, kindly provided to us by its authors, was processed with an adhoc background subtraction strategy (described in (Zhang et al., 2012)), that introduced some artifacts in the depth images. Due to that, we do not have
315
access to the original depth streams. The provided groundtruth information 20
Table 8: Zhang2012 testing subset details.
Sequence IDs
#P ositives
#P ositives
Description
dataset1
2384
4527
Multiple people sequences
dataset2
1500
1553
Multiple people sequences
Totals
3884
6080
consists of the same dataset images with red masks covering the location of the heads in the scene. This dataset will be referred to as Zhang2012 in the tables included in section 4.4. The first row of Figure 6 shows two sample frames from this dataset. 320
4.1.4. TVHeads database The TVHeads 2018 database (Liciotti et al., 2018b) consists of 1815 depth images recorded with a RGB-D device (the authors do not specify which is the camera model they are using) also located in an overhead position at heights between 2.82m and 3.01m, depending on the recorded sequence, and generating
325
depth data with a resolution of 320 × 240 pixels. All images includes multiple people under the camera (but no other moving objects), and their details are shown in Table 9. The recordings include groups of people walking closely to each other, squatting or standing and talking in different scenes. The data provided by its authors, includes 8 and 16 bits depth images and
330
a groundtruth that consist of images with black background and white masks covering the location of the heads in the scene. A relevant difference with the other datasets is that, in this case, the ground truth information also includes labeled people whose heads are partially out of the recorded image, as shown in Figure 7. These cases happen frequently in the recorded sequences, and this
335
will have a significant impact on the performance of our proposal as will be discussed in Section 4.4.5. This dataset will be referred to as TVHEADS in the tables included in section 4.4. The bottom row of Figure 6 shows two sample frames from this dataset. 21
Table 9: TVHeads testing subset details.
Sequence IDs
#F rames
#P ositives
Description
Frames
1815
4413
Multiple people sequences
Totals
1815
4413
Figure 7: Sample ground truth frames from the TVHEADS databases with labelled people in the image boundaries.
4.2. Experimental setup 340
In the experiments carried out, we have used the four different datasets described in section 4.1. We compare the performance of our proposals DPDnet and DPDnetf ast with other strategies described in the literature that use an overhead ToF camera in a people detection/tracking/counting task. The following state-of-the-art methods are used to carry out a fully exhaustive comparison:
345
• WaterFilling (Zhang et al., 2012): this method is based on the water filling algorithm and the original source code was kindly provided to us by its authors. To run this method in the GOTPD1+ dataset we scaled the images to 640×480 pixels, and it does not require to be previously trained. • SLICEP CA (Luna et al., 2017): this method uses feature vectors extracted
350
from the depth image and a generative PCA model of these vectors to classify detections. In our experiments, the PCA model is trained using the GOTPD1 dataset, with two target categories: people and non-people. • SLICESV M (Fernandez-Rincon et al., 2017): this method is similar to SLICEP CA , but uses a SVM classifier instead of PCA. In our experiments,
355
the SVM model is also trained using the full GOTPD1 dataset. 22
• MexicanHat (Stahlschmidt et al., 2013, 2014): this method is based on the normalized Mexican Hat Wavelet and requires the detection and removal of the floor plane. It does not require training data though. We have also carried out a comparison with CNN-based methods, but in this 360
case it will be done on the TVHEADS dataset, as Liciotti et al. (2018a) provided a broad performance comparison among six CNN-based strategies using this dataset. It is important to note that in the comparison, we did not use a tracking module for any of the methods. Our objective is to provide a fair comparison on
365
their discrimination capabilities without further improvements. The only input frame manipulation we did was that required to scale the frame size to fit in the input stage conditions of the evaluated algorithms. 4.3. Evaluation Metrics To provide a detailed and complete view on the performance on the evaluated
370
algorithms, we will calculate the main standard metrics in a detection problem, namely: P recision, Recall, F 1score , False Negative Rate (F N R), False Positive Rate (F P R), and overall error ERR (calculated as ERR = F N R + F P R) We will also calculate confidence intervals for the F 1score and ERR metrics, for a confidence value of 95%, to assess the statistical significance of the results
375
when comparing different strategies. 4.4. Results and Discussion In this section we will first provide detailed results for the algorithms described in Section 4.2, when applied to each of the datasets described in Section 4.1. An overall comparison will be then included in Section 4.4.4. In the
380
case of the CNN-based methods, the evaluation will be done on the TVHEADS dataset, as stated in Section 4.2, and the discussion will be done in Section 4.4.5. Tables 10, 11, and 12 include the results for the GOTPD1+, MIVIA, and Zhang2012 datasets respectively, including the evaluation metrics described in section 4.3 for all the evaluated algorithms. To visually aid in the comparative 23
385
performance analysis, we have added a green background to those table cells that correspond to the best results across all algorithms for each condition. Table 10: Detailed Results using all the tested algorithms on the GOTPD1+ dataset. All values in % (cells in green background correspond to the best results across all algorithms for each condition). DB
MexicanHat
P recision
Recall
F 1score
FNR
FPR
Single person
99.87
99.34
Two people
99.63
49.72
More than two people
99.60 ± 0.23
0.66
0.16
0.44 ± 0.25
66.34 ± 1.91
50.28
0.41
34.84 ± 1.93
99.86
37.84
54.88 ± 0.87
62.16
0.47
55.92 ± 0.87
Chairs. no people Totals
SLICESV M
94.24
45.63
61.48 ± 0.68
54.37
1.06 ± 0.38
98.23
99.87
99.04 ± 0.36
0.13
2.18
100.00
96.46 ± 0.75
0.00
16.25
5.05 ± 0.89
More than two people*
99.93
99.26
99.59 ± 0.11
0.74
0.63
0.73 ± 0.15
35.33
35.33 ± 2.21
Totals*
96.71
99.40
98.04 ± 0.19
0.60
9.70
2.95 ± 0.24
Single person
98.23
99.87
99.04 ± 0.36
0.13
2.18
1.06 ± 0.38
Two people
94.10
100.00
96.96 ± 0.70
0.00
13.91
4.32 ± 0.82
More than two people
99.93
99.20
99.56 ± 0.12
0.80
0.63
0.79 ± 0.15
99.06
99.36
99.21 ± 0.12
0.64
0.00
0.00 ± 0.00
2.70
1.18 ± 0.15 1.79 ± 0.50
Single person
98.18
98.63
98.40 ± 0.47
1.37
2.31
Two people
99.26
100.00
99.63 ± 0.25
0.00
1.66
0.51 ± 0.29
More than two people
99.90
99.81
99.85 ± 0.07
0.19
0.86
0.26 ± 0.09
22.97
22.97 ± 1.95
Totals
96.90
99.70
98.28 ± 0.18
0.30
9.24
2.59 ± 0.22
Single person
100.00
99.67
99.84 ± 0.15
0.33
0.00
0.18 ± 0.16
Two people
99.88
100.00
99.94 ± 0.10
0.00
0.28
0.09 ± 0.12
More than two people
99.92
99.90
99.91 ± 0.05
0.10
0.71
0.16 ± 0.07
0.39
0.39 ± 0.29
0.11
0.36
0.17 ± 0.06 0.11 ± 0.12
Chairs. no people
Chairs. no people Totals
DPDnet
42.48 ± 0.69
93.17
Totals
DPDnetf ast
21.97 ± 1.92
8.06
Two people*
Chairs. no people
WaterFilling
21.97
Single person*
Chairs. no people*
SLICEP CA
ERR
99.88
99.89
99.88 ± 0.05
Single person
99.93
99.87
99.90 ± 0.12
0.13
0.08
Two people
100.00
99.94
99.97 ± 0.07
0.06
0.00
0.04 ± 0.08
More than two people
100.00
99.71
99.86 ± 0.07
0.29
0.00
0.26 ± 0.09
99.99
99.75
99.87 ± 0.05
0.25
Chairs. no people Totals
24
0.00
0.00 ± 0.00
0.02
0.19 ± 0.06
4.4.1. Results for the GOTPD1+ database Regarding the results for the GOTPD1+ database (Table 10), it can be clearly seen that the DPDnet and DPDnetf ast solutions are the best ones, and this is 390
specially significant for sequences including more than two people. Despite this, SLICEP CA and SLICESV M also obtain top results for the Recall and F N R rates, for the sequences with one and two people. This happens at the cost of significantly increasing the F P R rates, which is the reason why these methods finally had a lower performance in the F 1score and ERR metrics. The WaterFilling
395
algorithm has also top performance for the sequences with two people, but also with higher F P R rates. The MexicanHat algorithm is clearly the worst one, as has already been proved in previous works. Despite the good results of SLICESV M and WaterFilling, their robustness to sequences without people (just chairs, for example) is less than that of
400
DPDnet, SLICEP CA and DPDnetf ast . In the case of SLICESV M , this bad performance could be due to the class modeling carried out, and for the WaterFilling algorithm, it is clearly due to the absence of an explicit person discrimination procedure, which generates a higher number of false positives. The results obtained with the SLICEP CA method in the chair sequences are similar or better
405
than those obtained with our CNN based proposals, but we must bear in mind that the SLICEP CA method requires a calibration process to know the intrinsic and extrinsic parameters of the sensor. It is also worth mentioning that the CNN-based methods are undoubtedly the best when considering the F P R rates, showing a very good discrimination
410
capability. Finally, when comparing DPDnet and DPDnetf ast , a singular effect can be observed: the F N R values for DPDnetf ast are better than those of DPDnet, while the F P R values of DPDnet are better than those of DPDnetf ast . This can be explained by the reduction in the image resolution, since this reduction implies
415
an interpolation that allows to slightly filter the noise and artifacts in the input image. This may lead the fast version to have an easier job in discriminating
25
negatives, while in the case of the standard version, the availability of the full image information allows for a better discrimination of actual people present in the scene. Table 11: Detailed Results using all the tested algorithms on the MIVIA dataset. All values in % (cells in green background correspond to the best results across all algorithms for each condition). DB
MexicanHat
SLICESV M
SLICEP CA
WaterFilling
DPDnetf ast
DPDnet
420
P recision
Recall
F 1score
FNR
FPR
ERR
Single person
93.76
92.66
Two people
88.61
86.64
93.21 ± 0.56
7.34
2.94
4.36 ± 0.46
87.61 ± 0.64
13.36
6.21
More than two people
93.73
8.77 ± 0.55
53.67
68.26 ± 0.76
46.33
2.65
21.21 ± 0.67
Totals
91.81
71.40
80.33 ± 0.43
28.60
3.89
13.27 ± 0.37
Single person*
92.92
93.90
93.41 ± 0.55
6.10
3.39
4.26 ± 0.45
Two people*
87.75
87.97
87.86 ± 0.63
12.03
6.85
8.70 ± 0.55
More than two people*
91.85
94.11
92.96 ± 0.42
5.89
6.13
6.03 ± 0.39
Totals*
90.85
92.22
91.53 ± 0.30
7.78
5.66
6.46 ± 0.27
Single person*
93.30
93.90
93.67 ± 0.55
6.10
3.20
4.13 ± 0.44
Two people*
87.97
87.97
87.97 ± 0.63
12.03
6.71
8.62 ± 0.54
More than two people*
92.91
93.95
93.43 ± 0.40
6.05
5.25
5.59 ± 0.38
Totals*
91.52
92.14
91.83 ± 0.30
7.86
5.20
6.20 ± 0.26
Single person
95.20
98.96
97.04 ± 0.38
1.04
2.41
1.96 ± 0.31
Two people
96.67
99.84
98.23 ± 0.26
0.16
1.96
1.31 ± 0.22
More than two people
91.29
98.53
94.77 ± 0.37
1.47
6.81
4.57 ± 0.34
Totals
93.68
99.02
96.27 ± 0.21
0.98
4.08
2.91 ± 0.18
Single person
99.71
97.41
98.55 ± 0.27
2.59
0.14
0.94 ± 0.22
Two people
99.54
99.09
99.32 ± 0.16
0.91
0.26
0.50 ± 0.14
More than two people
99.24
96.62
97.91 ± 0.23
3.38
0.53
1.72 ± 0.21
Totals
99.43
97.54
98.47 ± 0.13
2.46
0.34
1.14 ± 0.12
Single person
100.00
98.56
99.27 ± 0.19
1.44
0.00
0.47 ± 0.15
Two people
99.81
99.05
99.43 ± 0.15
0.95
0.11
0.41 ± 0.12
More than two people
99.23
98.14
98.68 ± 0.19
1.86
0.54
1.09 ± 0.17
Totals
99.57
98.50
99.03 ± 0.11
1.50
0.26
0.72 ± 0.09
4.4.2. Results for the MIVIA database Regarding the results for the MIVIA database (Table 11), we first have to stress and remind the fact that no training nor adaptation has been done for this new dataset.
26
Again, the best results are found in the CNN-based methods (which use a 425
model trained on a different dataset), and (in specific metrics) in the WaterFilling algorithm (that does not need any training procedure). The SLICEP CA and SLICESV M methods exhibit a relevant performance drop that can be explained by their higher sensitivity to the mismatch in the recording conditions as compared with those of the training material used.
430
The better performance of the WaterFilling algorithm in the Recall and F N R rates is again achieved at the cost of increasing the F P R rates, which finally translates to overall lower F 1score and ERR metrics. As in the case of GOTPD1+, the MexicanHat method is the worst of all in all the evaluated metrics. Table 12: Detailed Results using all the tested algorithms on the Zhang2012 dataset. All values in % (cells in green background correspond to the best results across all algorithms for each condition).
MexicanHat
SLICESV M
SLICEP CA
WaterFilling
DPDnetf ast
DPDnet
DB
P recision
Recall
F 1score
FNR
FPR
Dataset1
40.52
24.51
30.54 ± 1.51
75.49
410.76
102.49 ± −
Dataset2
26.72
21.02
23.53 ± 2.02
78.98
86.32
81.92 ± 1.83
Totals
36.56
23.68
28.74 ± 1.22
76.32
182.85
95.87 ± 0.54
Dataset1*
82.92
85.77
84.32 ± 1.19
14.32
226.82
29.60 ± 1.49
Dataset2*
59.56
59.28
59.42 ± 2.33
40.72
64.38
49.82 ± 2.37
Totals*
77.50
79.44
78.46 ± 1.10
20.56
110.57
36.09 ± 1.29
Dataset1*
94.31
97.81
96.03 ± 0.64
2.19
67.60
7.44 ± 0.86
Dataset2*
94.48
99.30
96.83 ± 0.83
0.70
8.36
3.84 ± 0.92
Totals*
94.35
98.16
96.22 ± 0.52
1.84
25.69
6.28 ± 0.66
Dataset1
95.40
98.76
97.05 ± 0.55
1.24
55.48
5.53 ± 0.75
Dataset2
96.70
98.91
97.79 ± 0.70
1.09
4.96
2.66 ± 0.77
Totals
95.70
98.79
97.22 ± 0.44
1.21
19.71
4.61 ± 0.57
Dataset1
99.57
99.21
99.39 ± 0.25
0.79
4.95
1.12 ± 0.34
Dataset2
99.01
99.11
99.06 ± 0.46
0.89
1.45
1.12 ± 0.50
Totals
99.44
99.19
99.31 ± 0.22
0.81
2.47
1.12 ± 0.28
Dataset1
98.89
99.76
99.32 ± 0.27
0.24
12.94
1.26 ± 0.36
Dataset2
97.10
99.90
98.48 ± 0.58
0.10
4.35
1.83 ± 0.64
Totals
98.46
99.79
99.12 ± 0.25
0.21
6.87
1.44 ± 0.32
27
ERR
435
4.4.3. Results for the Zhang2012 database Regarding the results for the Zhang2012 database (Table 12), we are again in the case of a mismatch between the training and evaluation conditions discussed above. Here, the superiority of the CNN-based methods are even clearer than in the
440
two previous datasets, obtaining very good results considering the experimental conditions. The performance drop of the SLICEP CA and SLICESV M methods is higher than in the MIVIA results, specially in the latter. As described in Section 4.1.3, the images of this dataset contain artifacts from the background subtraction
445
method, and they might affect the classifiers SLICEP CA and SLICESV M , which were trained with cleaner data (the GOTPD1+ database). As discussed in the case of the MIVIA results, the WaterFilling algorithm performance is closer to the top results in the Recall and F N R rates, with the same explanation than above.
450
The CNN-based methods are affected by image artifacts to a much lower extent than SLICEP CA and SLICESV M . Between out two CNN-based proposals, DPDnet is more affected than DPDnetf ast , possibly given that its greater number of parameters (due to the larger input image size) forces it to search for more defined and detailed features similar to those found in the training subset
455
of the GOTPD1+ dataset, while DPDnetf ast generalizes better when facing new datasets, due to the lower definition of its input images (finding more general characteristics of people). Finally, as in all the previous cases, the MexicanHat method gets the worst performance for all metrics.
460
4.4.4. Overall comparison Once the detailed results have been discussed for each dataset, now we will provide full details on the overall comparison among the evaluated algorithms for all the datasets used. We will include tables with the overall behavior of the different strategies, and we will also include bar charts to ease the visual 28
465
comparison, providing error confidence values for all the metrics. In the tables and graphics below, we have omitted the results for the MexicanHat algorithm due to its bad performance. In the charts, please note that the vertical scale is not the same for all of them, as they have been adjusted to allow for a better visualization.
470
From the previous discussion on the results shown in tables 10, 11 and 12, we can conclude that the proposed solutions DPDnet and DPDnetf ast are the best alternatives for a robust detection in this application. They have obtained the best results, and they have proved to be robust against changing database recording and environmental conditions, maintaining the overall er-
475
ror rate around 1% percent in all cases. The classic trainable methods such as SLICEP CA and SLICESV M are good alternatives but have worse results, and they require calibration and retraining to change to provide optimal results. On the other hand, WaterFilling can be seen as another alternative system, but unlike the two previous approaches, it acts more as a detector of maximums than
480
as a classification algorithm, not being able to distinguish persons from other elements. Finally, the MexicanHat algorithm is not a good alternative system and globally has obtained the worst results among the evaluated systems. Table 13 shows the final aggregated results of the evaluated algorithms on the GOTPD1+ dataset, and in Figures 8a and 8b we show the graphical comparison
485
among the different strategies. Table 13: Results for the GOTPD1+ dataset. All values in %.
Method
P recision
Recall
F1 score
FNR
FPR
ERR
DPDnet
99.99
99.75
99.87 ± 0.05
0.25
0.02
0.19 ± 0.06
DPDnetf ast
99.88
99.89
99.88 ± 0.05
0.11
0.36
0.17 ± 0.06
WaterFilling
96.90
99.70
98.28 ± 0.18
0.30
9.24
2.59 ± 0.22
SLICEP CA *
99.06
99.36
99.21 ± 0.12
0.64
2.70
1.18 ± 0.15
SLICESV M *
96.71
99.40
98.04 ± 0.19
0.60
9.70
2.95 ± 0.24
MexicanHat
94.24
45.63
61.48 ± 0.68
54.37
8.06
42.48 ± 0.69
29
Figure 8a shows how the performance metrics are reasonably high, except for the WaterFilling and SLICESV M algorithms, whose P recision and F 1score have a significant drop. This tendency is also clearly observed in the error metrics of Figure 8b. The CNN-based methods achieve the best results, and 490
the observed improvements are statistically significant against the non CNNbased methods, considering the provided confidence intervals.
(a) Performance metrics.
(b) Error metrics.
Figure 8: Overall performance and error metrics for the GOTPD1+ dataset.
Table 14 shows the final aggregated results of the evaluated algorithms on the MIVIA dataset, and in Figures 9a and 9b we show the graphical comparison among the different strategies Table 14: Results for the MIVIA dataset. All values in %.
495
Method
P recision
Recall
F1 score
FNR
FPR
ERR
DPDnet
99.57
98.50
99.03 ± 0.11
1.50
0.26
0.72 ± 0.09
DPDnetf ast
99.43
97.54
98.47 ± 0.13
2.46
0.34
1.14 ± 0.12
WaterFilling
93.68
99.02
96.27 ± 0.21
0.98
4.08
2.91 ± 0.18
SLICEP CA *
91.52
92.14
91.83 ± 0.30
7.86
5.20
6.20 ± 0.26
SLICESV M *
90.85
92.22
91.53 ± 0.30
7.78
5.66
6.46 ± 0.27
MexicanHat
91.81
71.40
80.33 ± 0.43
28.60
3.89
13.27 ± 0.37
Figure 9a shows a much more significant performance drop of the non CNNbased methods (due to the mismatch between the training and testing conditions), with a clear superiority of the DPDnet and DPDnetf ast approaches in 30
P recision and F 1score . Their worse performance in Recall as compared with WaterFilling was already explained in previous sections. In what respect to 500
the error metrics of Figure 9b, the CNN-based methods achieve the best results in the overall error, again with statistically significant results as compared with the others.
(a) Performance metrics.
(b) Error metrics.
Figure 9: Overall performance and error metrics for the MIVIA dataset.
Table 15 shows the final aggregated results of the evaluated algorithms on the ZHAN2012 dataset, and in Figures 10a and 10b we show the graphical comparison 505
among the different strategies, where it is again clear how our proposals clearly outperform the other algorithms, with statistically significant improvements. Table 15: Results for the Zhang2012 dataset. All values in %.
Method
P recision
Recall
F1 score
FNR
FPR
ERR
DPDnet
98.46
99.79
99.12 ± 0.25
0.21
6.87
1.44 ± 0.32
DPDnetf ast
99.44
99.19
99.31 ± 0.22
0.81
2.47
1.12 ± 0.28
WaterFilling
95.70
98.79
97.22 ± 0.44
1.21
19.71
4.61 ± 0.57
SLICEP CA *
94.35
98.16
96.22 ± 0.52
1.84
25.69
6.28 ± 0.66
SLICESV M *
77.50
79.44
78.46 ± 1.10
20.56
110.57
36.09 ± 1.29
MexicanHat
36.56
23.68
28.74 ± 1.22
76.32
182.85
95.87 ± 0.54
31
(a) Performance metrics.
(b) Error metrics.
Figure 10: Overall performance and error metrics for the Zhang2012 dataset.
4.4.5. Comparison with other CNN-based proposals In this section we use the TVHEADS dataset to compare our proposal with other very recently proposed CNN-based methods that face top-view depth 510
data for the people detection task. We focus on the TVHEADS for this evaluation, as Liciotti et al. (2018a) already provided a broad comparison among CNNbased strategies using this dataset. In the evaluation of our proposal on the TVHEADS database, the mismatch between the training and evaluation conditions discussed above is also present,
515
and in this case we have an additional relevant issue: the fact that the TVHEADS dataset ground truth includes labeled people partially occluded by the image boundaries, as described in Section 4.1.4, and graphically shown in Figure 7. Table 16: Results for the TVHeads dataset (some provided by (Liciotti et al., 2018a)) Method
P recision
Recall
F 1score
ERR
DPDnet
89.63 ± 0.90
72.67 ± 1.31
80.26 ± 1.17
18.93 ± 1.16
DPDnet-FT
99.78 ± 0.14
98.88 ± 0.31
99.33 ± 0.24
0.97 ± 0.29
Fractal (Larsson et al., 2016)
99.26 ± 0.25
99.32 ± 0.24
99.29 ± 0.25
0.56 ± 0.22
U-Net (Ronneberger et al., 2015)
94.50 ± 0.67
94.89 ± 0.65
94.69 ± 0.66
0.75 ± 0.25
U-Net2 (Ravishankar et al., 2017)
96.78 ± 0.52
97.05 ± 0.50
96.91 ± 0.51
0.69 ± 0.24
U-Net3 (Liciotti et al., 2018a)
98.93 ± 0.30
98.94 ± 0.30
98.93 ± 0.30
0.55 ± 0.22
Segnet (Badrinarayanan et al., 2015)
94.62 ± 0.67
95.33 ± 0.62
94.96 ± 0.65
0.74 ± 0.25
ResNet (He et al., 2016)
96.87 ± 0.51
96.92 ± 0.51
96.89 ± 0.51
0.62 ± 0.23
32
(a) Performance metrics.
(b) Error metrics.
Figure 11: Overall performance and error metrics for the TVHEADS dataset.
Table 16 shows the result of our proposal compared with that of (Liciotti et al., 2018a), and others in the literature that were designed for general segmen520
tation tasks (Larsson et al., 2016; Ronneberger et al., 2015; Ravishankar et al., 2017; Badrinarayanan et al., 2015), with the results provided in (Liciotti et al., 2018a). Figures 11a and 11b show the graphical comparison among the different methods. In this case we include the confidence bands for all the metrics given the similarity among the results achieved by the different systems.
525
In the table, we can observe results for two versions of our DPDnet, the first one (DPDnet) refers to the standard proposal trained on the GOTPD1+1 dataset, without fine tuning. As it can be seen, the standard DPDnet results are worse that all the other CNN-based systems. After analyzing the DPDnet results in detail, we found that most of the
530
performance drop is due to the high amount of labeled people partially occluded by the image boundaries, a situation which is not found in the ground truth information provided with the GOTPD1+1 database, so that DPDnet had been unable to model such situation. To face this issue and, and taking into account the very limited size of the
535
TVHEADS dataset, we decided to build an adapted system called DPDnet-FT, which is a fine-tuned version of DPDnet, using part of the TVHEADS as the finetuning data. That way, DPDnet-FT uses the previous acquired knowledge on the people detection task, to adapt its predictions to the new labeling conditions
33
of the TVHEADS environment. We used 80% of the TVHEADS data to fine tune 540
the DPDnet models, 10% for validation, and the remaining 10% for testing. To provide more reliable results we applied a 10-fold cross-validation strategy over the TVHEADS database, obtaining the final average performance metrics of the 10 folds. That way, we could get the most out of the available data, while also allowing to reduce the statistical significance bands as much as possible.
545
It is important to mention here that Liciotti et al. (2018a) did not provide the experimental details on their database partition scheme, so that we don’t have the precise information to estimate the confidence intervals in their results. To overcome this issue and provide a rigorous comparison, we assumed that their published results also considered the highest possible amount of data, thus also
550
having the smallest possible statistical confidence intervals. The P recision and F 1score results of DPDnet-FT shown in Table 16 are the best among the compared systems, being the differences in P recision statistically significant. In terms of Recall, DPDnet-FT is positioned in third place, but the differences with the top two systems are not statistically significant.
555
Finally, in terms of ERR, DPDnet-FT is positioned last, but, again, there are no statistically significant differences with all the other CNN-based systems. It is relevant to mention here that at these accuracy levels, the number of errors is extremely small. For example, our ERR = 0.97% means less than 4 errors out of 4413 labelled positives. It is also worth mentioning here that the TVHEADS
560
dataset does not include sequences with objects, so that we cannot evaluate the performance of the CNN-based alternatives under these conditions. It is clear that further experimental work with the proposals by Liciotti et al. (2018a) on larger datasets is required to assess the relative performance among the CNNbased proposals with statistical significant results, but this will be addressed as
565
a future task. 4.5. Computational Performance Evaluation Regarding the computational demands, Table 17 shows the performance of the evaluated algorithms measured in average frames per second (we have 34
explicitly removed results for MexicanHat due to its bad performance and its 570
bad results in computational behavior according to the results shown in Luna et al. (2017)). All the experiments shown were run in a standard PC, with an Intel I7-6700 at 3.4GHz and 32Gb RAM. The GPU used was an NVIDIA GTX 1080 Ti. The evaluation was run on samples from the GOTPD1+ dataset.
575
In the case of the DPDnet and DPDnetf ast we also show CPU results. DPDnet
DPDnetf ast
WaterFilling
SLICEP CA
SLICESV M
FPS (CPU)
2.7
9.8
16.6
312.0
406.0
FPS (GPU)
65.4
116.9
N/A
N/A
N/A
Table 17: Timing Results, measured as average frames per second (FPS).
As it can be clearly seen, the computational performance of the CNNbased strategies running on a generic CPU is well behind that obtained by the WaterFilling and SLICE approaches, although the DPDnetf ast could run at a reasonable rate of 10 FPS. The performance of the SLICE algorithms is much 580
better than the one reported in Luna et al. (2017) as the CPU we use is significantly better. When the GPU enters the competition, the results for DPDnet and DPDnetf ast are, as expected, much better than those for WaterFilling, but still behind SLICESV M and SLICEP CA . However both CNN-based strategies are well above the requirements demanded for real time operation. Taking
585
into account that these approaches significantly outperform the other methods, there is still room for further reduction in the network complexity. In what respect to the effect of the scene complexity on the computational performance of the algorithms, Luna et al. (2017) already showed that the SLICEP CA algorithm was affected by the number of users in the scene (this
590
was due to the fact that the ROI estimation and feature extraction modules, which demands that are proportional to the number of users, had a big impact in the processing time). On the other hand, the DPDnet, DPDnetf ast , and WaterFilling algorithms do not exhibit such a scene complexity depen35
dence.Table 18 shows the FPS performance for these algorithms evaluated on 595
sequences with one, two, and more than two people. It can be seen that there is no direct impact of the number of users on the results for DPDnet, DPDnetf ast , and WaterFilling. For the SLICEP CA and SLICESV M approaches, the relative decrease in computational performance between the FPS for the single and more than two people cases is of 65.3% and 58.4%, respectively. DPDnet (GPU)
DPDnetf ast (GPU)
WaterFilling
Single person
66.9
111.5
16.6
572.0
682.0
Two people
60.9
124.0
16.8
391.8
498.8
More than two people
65.5
117.5
16.5
198.7
283.4
SLICEP CA
SLICESV M
Table 18: Average timing results for the WaterFilling, DPDnet and DPDnetf ast algorithms for sequences with varying number of users (FPS).
600
Regarding the computational performance comparison with the other CNNbased alternatives, Liciotti et al. (2018a) state in their paper that their proposal is “computationally efficient at inference time”. Unfortunately, they do not provide any numerical details on this aspect, so that we are unable to further compare with them on this issue.
605
5. Conclusions and Future Work In this paper we propose a new CNN-based solution (with two implementations, namely DPDnet and DPDnetf ast ) to address the task of multiple people detection from overhead depth images, using multimodal likelihood maps. We evaluated and compared our method with other recent state-of-the-art meth-
610
ods using four different depth image datasets with varying recording conditions (technology used in the depth sensors, resolution, height of the sensors’ location, background configuration, number of users, actions performed by them, presence of non-people elements, etc.), and following a rigorous and exhaustive experimental procedure. To the best of our knowledge, our study constitutes
615
the broadest one in the literature in what respect to the number of evaluated methods on a wide range of evaluation datasets. 36
From our experimental results, we can claim that the proposed method clearly outperforms all the non-CNN-based state-of-the-art methods in different databases (GOTPD1+, MIVIA and Zhang2012) with no need of retraining nor 620
fine-tuning, and with statistically significant differences. In the case of the comparison with the CNN-based methods on the TVHEADS dataset, we had to apply a fine tuning procedure given the relevant mismatch in the labeling criteria used as compared to the one we applied in the GOTPD1+ dataset. After the fine tuning, our system proved to be the best one when con-
625
sidering the P recision and F 1score performance metrics, being the differences in P recision statistically significant. In the other performance metrics, the differences found were not statistically significant, so that we can state that our proposal is also competitive as compared to the other CNN-based alternatives. With the limited size of the TVHEADS dataset, we have not been able to assess the
630
superiority of none of the CNN-based methods, but it is clear that at these performance levels (overall error below 1%) and given its computational efficiency, our strategy is ready to be deployed in real scenarios. Nevertheless, and in a near future, we will address further experimentation with larger datasets using CNN-based approaches, to be able to actually assess their relative performance
635
with statistically significant results. Also pursing the deployment in real scenarios, and after implementing the fast version of DPDnet (DPDnetf ast ), we proved that we obtained a relative reduction of computing time between 65.3% and 58.4%, with just a minor decrease in accuracy. Given that the commercial systems try to optimize hardware costs,
640
we highlight this as an important contribution that can run at real time in low cost GPUs. Once we have evaluated all the used systems, including CNN and non-CNN based ones, we can conclude that the CNN-based ones are the best for detecting people in terms of accuracy, speed and robustness. Despite that, these approaches require the use of GPUs, in contrast to the non-CNN-based
645
ones that usually run in real time with conventional CPUs, at the expense of lower performance. Regarding suggestions for future research, we think that one of the major 37
drawbacks in the proposed task is the expected low performance in outdoor conditions, as current depth camera technology exhibits, in general, problems 650
with IR interference from sunlight. We will address this task in the future, either by using depth cameras that claim to be more robust to this interference, or by adopting alternative strategies to face higher errors in depth estimation and increased noise conditions. Another future relevant line of research will be to port the DPDnet strategy
655
to a more generic approximation of the problem that does not involve the restriction to overhead camera locations. In this case, we will have to face some associated problems, mainly related to the appearance of occlusions, that could be initially approached by using tracking strategies. In what respect to the CNN architecture, we found that the generic objective
660
function being used, might be improved by adopting loss functions that explicitly model the upper human body geometry. This could lead to a stronger, faster and better people detector. Finally, and as already suggested in Section 4.4.5, we will be working in an extensive evaluation of the CNN-based alternatives using larger datasets, to be
665
able to asses their relative performance with statistically significant results.
6. Acknowledgments This work has been supported by the Spanish Ministry of Economy and Competitiveness under projects HEIMDAL-UAH TIN2016-75982-C2-1-R) and ARTEMISA (TIN2016-80939-R) and by the University of Alcal´ a under project 670
JANO CCGP2017/EXP-025). We also thank Xucong Zhang for providing us with their evaluation data; and to the MIVIA and VRAI labs, that share their MIVIA and TVHEADS datasets with the scientific community.
38
References References 675
Agarap,
A. F. (2018).
(ReLU).
Deep learning using rectified linear units
CoRR, abs/1803.08375 . URL: http://arxiv.org/abs/1803.
08375. arXiv:1803.08375. Andriluka, M., Roth, S., & Schiele, B. (2008). People-tracking-by-detection and people-detection-by-tracking. In 2008 IEEE Conference on Computer Vision 680
and Pattern Recognition (pp. 1–8). doi:10.1109/CVPR.2008.4587583. Antic, B., Letic, D., Culibrk, D., & Crnojevic, V. (2009). K-means based segmentation for real-time zenithal people counting. In Proceedings of the 16th IEEE International Conference on Image Processing ICIP’09 (pp. 2537– 2540). Piscataway, NJ, USA: IEEE Press. doi:10.1109/ICIP.2009.5414001.
685
Badrinarayanan, V., Kendall, A., & Cipolla, R. (2015).
SegNet:
A
deep convolutional encoder-decoder architecture for image segmentation. CoRR, abs/1511.00561 . URL: http://arxiv.org/abs/1511.00561. arXiv:1511.00561. Bevilacqua, A., Di Stefano, L., & Azzari, P. (2006). People tracking using a time690
of-flight depth sensor. In Video and Signal Based Surveillance, 2006. AVSS ’06. IEEE International Conference on (pp. 89–89). doi:10.1109/AVSS.2006. 92. Cai, Z., Yu, Z. L., Liu, H., & Zhang, K. (2014).
Counting people in
crowded scenes by video analyzing. In Industrial Electronics and Applica695
tions (ICIEA), 2014 IEEE 9th Conference on (pp. 1841–1845). doi:10.1109/ ICIEA.2014.6931467. Cao, Y., Shen, C., & Shen, H. T. (2017). Exploiting depth from single monocular images for object detection and semantic segmentation. IEEE Transactions on Image Processing, 26 , 836–846. doi:10.1109/TIP.2016.2621673.
39
700
Chan, A., Liang, Z.-S., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people without people models or tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–7). doi:10.1109/CVPR.2008.4587569. Chen, T.-Y., Chen, C.-H., Wang, D.-J., & Kuo, Y.-L. (2010). A people count-
705
ing system based on face-detection.
In Genetic and Evolutionary Com-
puting (ICGEC), 2010 Fourth International Conference on (pp. 699–702). doi:10.1109/ICGEC.2010.178. Chollet, F. (2016). Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357 . URL: http://arxiv.org/abs/1610.02357. 710
arXiv:1610.02357. Dan, B.-K., Kim, Y.-S., Suryanto, Jung, J.-Y., & Ko, S.-J. (2012). Robust people counting system based on sensor fusion. Consumer Electronics, IEEE Transactions on, 58 , 1013–1021. doi:10.1109/TCE.2012.6311350. Del Pizzo, L., Foggia, P., Greco, A., Percannella, G., & Vento, M. (2016).
715
Counting people by rgb or depth overhead cameras. Pattern Recognition Letters, 81 , 41–50. URL: https://doi.org/10.1016/j.patrec.2016.05. 033. doi:10.1016/j.patrec.2016.05.033. Dumoulin, J., Can´evet, O., Villamizar, M., Nunes, H., Khaled, O. A., Mugellini, E., Moscheni, F., & Odobez, J.-M. (2018). Unicity: A depth maps database
720
for people detection in security airlocks. In IEEE International Conference on Advanced Video and Signal-based Surveillance Workshop. Dumoulin, J., Can´evet, O., Villamizar, M., Nunes, H., Khaled, O. A., Mugellini, E., Moscheni, F., & Odobez, J.-M. (2019). UNICITY: A depth maps database for people detection in security airlocks. Available online. doi:10.5281/
725
zenodo.2556678 (last accessed October 2019). Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th 40
International Conference on Neural Information Processing Systems - Volume 2 NIPS’14 (pp. 2366–2374). Cambridge, MA, USA: MIT Press. URL: http: 730
//dl.acm.org/citation.cfm?id=2969033.2969091. Fernandez-Rincon, A., Fuentes-Jimenez, D., Losada, C., Marron, M., Luna, C. A., Macias-Guarasa, J., & Mazo, M. (2017). Robust people detection and tracking from an overhead time-of-flight camera. In 12th International Conference on Computer Vision Theory and Applications. (pp. 556–564). Porto,
735
Portugal volume 4. doi:10.5220/0006169905560564. Gal˘ a´ık, F., & Gargal´ık, R. (2013). Real-time depth map based people counting. In 15th International Conference on Advanced Concepts for Intelligent Vision Systems - Volume 8192 ACIVS 2013 (pp. 330–341). Berlin, Heidelberg: Springer-Verlag. doi:10.1007/978-3-319-02895-8_30.
740
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 (pp. 770– 778). doi:10.1109/CVPR.2016.90. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network
745
training by reducing internal covariate shift. arXiv:1502.03167. Jeong, C. Y., Choi, S., & Han, S. W. (2013). A method for counting moving and stationary people by interest point classification. In Image Processing (ICIP), 2013 20th IEEE International Conference on (pp. 4545–4548). doi:10.1109/ ICIP.2013.6738936.
750
Jia, L., & Radke, R. (2014). Using time-of-flight measurements for privacypreserving tracking in a smart room. Industrial Informatics, IEEE Transactions on, 10 , 689–696. doi:10.1109/TII.2013.2251892. Kingma, D. P., & Ba, J. (2014).
Adam: A method for stochastic opti-
mization. CoRR, abs/1412.6980 . URL: http://arxiv.org/abs/1412.6980. 755
arXiv:1412.6980. 41
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
Imagenet clas-
sification with deep convolutional neural networks.
In F. Pereira,
C. J. C. Burges,
760
Neural
L. Bottou,
vances
in
Information
1105).
Curran Associates, Inc.
& K. Q. Weinberger (Eds.), Processing
Systems
25
(pp.
Ad1097–
URL: http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks. pdf. Lab, M. R. (2016). The MIVIA People Counting Dataset. Available online. URL: http://mivia.unisa.it/datasets-request (last accessed January 765
2019). Larsson, G., Maire, M., & Shakhnarovich, G. (2016). Fractalnet: Ultra-deep neural networks without residuals. CoRR, abs/1605.07648 . URL: http:// arxiv.org/abs/1605.07648. arXiv:1605.07648. Liciotti, D., Paolanti, M., Pietrini, R., Frontoni, E., & Zingaretti, P. (2018a).
770
Convolutional networks for semantic heads segmentation using top-view depth data in crowded environment. In 2018 24th International Conference on Pattern Recognition (ICPR) (pp. 1384–1389). doi:10.1109/ICPR.2018.8545397. Liciotti, D., Paolanti, M., Pietrini, R., Frontoni, E., & Zingaretti, P. (2018b). The TVHEADS (Top-View Heads) dataset. Available online. URL: http:
775
//vrai.dii.univpm.it/tvheads-dataset (last accessed October 2019). Liu, J., Liu, Y., Zhang, G., Zhu, P., & Chen, Y. Q. (2015). Detecting and tracking people in real time with rgb-d camera. Pattern Recognition Letters, 53 , 16 – 23. URL: http://www.sciencedirect.com/science/article/pii/ S016786551400302X. doi:10.1016/j.patrec.2014.09.013.
780
Luna, C. A., Losada-Gutierrez, C., Fuentes-Jimenez, D., Fernandez-Rincon, A., Mazo, M., & Macias-Guarasa, J. (2017). Robust people detection using depth information from an overhead time-of-flight camera. Expert Syst. Appl., 71 , 240–256. doi:10.1016/j.eswa.2016.11.019.
42
Macias-Guarasa, J., Losada-Gutierrez, C., Fuentes-Jimenez, D., Fernandez, 785
R., Luna, C. A., Fernandez-Rincon, A., & Mazo, M. (2016a). The GEINTRA Overhead ToF People Detection (GOTPD1) database. Available online. URL: http://www.geintra-uah.org/datasets/gotpd1 (last accessed January 2019). Macias-Guarasa,
790
nandez,
R.,
(2016b). database.
J., Luna,
Losada-Gutierrez, C. A.,
C.,
Fuentes-Jimenez,
Fernandez-Rincon,
A.,
D.,
& Mazo,
FerM.
The GEINTRA Overhead ToF People Detection (GOTPD1) Available online.
URL: https://www.kaggle.com/lehomme/
overhead-depth-images-people-detection-gotpd1 (last accessed January 2019). 795
Munaro, M., Lewis, C., Chambers, D., Hvass, P., & Menegatti, E. (2016). Rgb-d human detection and tracking for industrial environments. In E. Menegatti, N. Michael, K. Berns, & H. Yamaguchi (Eds.), Intelligent Autonomous Systems 13 (pp. 1655–1668). Cham: Springer International Publishing. Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection.
800
In 2013 IEEE International Conference on Computer Vision (pp. 2056–2063). doi:10.1109/ICCV.2013.257. Ramanan, D., Forsyth, D. A., & Zisserman, A. (2006). Tracking People by Learning Their Appearance.
Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 29 , 65–81. doi:10.1109/tpami.2007.250600. 805
Rauter, M. (2013). Reliable human detection and tracking in top-view depth images. In 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 529–534). doi:10.1109/CVPRW.2013.84. Ravishankar, H., Venkataramani, R., Thiruvenkadam, S., Sudhakar, P., & Vaidya, V. (2017). Learning and incorporating shape models for semantic
810
segmentation. In MICCAI .
43
Redmon, J., Divvala, S. K., Girshick, R. B., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. CoRR, abs/1506.02640 . Ren, X., Du, S., & Zheng, Y. (2017). Parallel rcnn: A deep learning method for people detection using rgb-d images. In 2017 10th International Congress 815
on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) (pp. 1–6). doi:10.1109/CISP-BMEI.2017.8302069. Ronneberger, O., P.Fischer, & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation.
In Medical Image Computing
and Computer-Assisted Intervention (MICCAI) (pp. 234–241). 820
Springer
volume 9351 of LNCS . URL: http://lmb.informatik.uni-freiburg.de/ Publications/2015/RFB15a (available on arXiv:1505.04597 [cs.CV]). Shin, H.-c., Roth, H., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., & Summers, R. (2016). Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning.
825
IEEE Transactions on Medical Imaging, 35 , 1285–1298. doi:10.1109/TMI. 2016.2528162. Stahlschmidt, C., Gavriilidis, A., Velten, J., & Kummert, A. (2013). People detection and tracking from a top-view position using a time-of-flight camera. In A. Dziech, & A. Czyazwski (Eds.), Multimedia Communications, Ser-
830
vices and Security (pp. 213–223). Springer Berlin Heidelberg volume 368 of Communications in Computer and Information Science. doi:10.1007/ 978-3-642-38559-9_19. Stahlschmidt, C., Gavriilidis, A., Velten, J., & Kummert, A. (2014).
Ap-
plications for a people detection and tracking algorithm using a time-of835
flight camera. Multimedia Tools and Applications, (pp. 1–18). doi:10.1007/ s11042-014-2260-3. Tian, Y., Luo, P., Wang, X., & Tang, X. (2015). Pedestrian detection aided by deep learning semantic tasks. In 2015 IEEE Conference on Computer Vision
44
and Pattern Recognition (CVPR) (pp. 5079–5087). doi:10.1109/CVPR.2015. 840
7299143. Vera, P., Monjaraz, S., & Salas, J. (2016).
Counting pedestrians with a
zenithal arrangement of depth cameras.
Machine Vision and Applica-
tions, 27 , 303–315. URL: https://doi.org/10.1007/s00138-015-0739-1. doi:10.1007/s00138-015-0739-1. 845
Villamizar, M., Mart´ınez-Gonz´alez, A., Can´evet, O., & Odobez, J. (2018). Watchnet: Efficient and depth-based network for people detection in video surveillance systems. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) (pp. 1–6). doi:10.1109/ AVSS.2018.8639165.
850
Wang, C., & Zhao, Y. (2017). Multi-layer proposal network for people counting in crowded scene.
In 2017 10th International Conference on Intel-
ligent Computation Technology and Automation (ICICTA) (pp. 148–151). doi:10.1109/ICICTA.2017.40. Zhang, X., Yan, J., Feng, S., Lei, Z., Yi, D., & Li, S. (2012). Water filling: 855
Unsupervised people counting via vertical Kinect sensor. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on (pp. 215–220). doi:10.1109/AVSS.2012.82. Zhao, J., Zhang, G., Tian, L., & Chen, Y. Q. (2017). Real-time human detection with depth camera via a physical radius-depth detector and a CNN descriptor.
860
In 2017 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1536–1541). doi:10.1109/ICME.2017.8019323. Zhou, K., Paiement, A., & Mirmehdi, M. (2017). Detecting humans in rgb-d data with cnns. In 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA) (pp. 306–309). doi:10.23919/MVA.2017.
865
7986862.
45
Zhu, L., & Wong, K.-H. (2013).
Human tracking and counting using the
kinect range sensor based on adaboost and kalman filter. In International Symposium on Visual Computing (pp. 582–591). Springer. doi:10.1007/ 978-3-642-41939-3_57.
46
Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
48
Author Statement for manuscript: “DPDnet: A Robust People Detector using Deep Learning with an Overhead Depth Camera” David Fuentes-Jimenez, Roberto Martin-Lopez, Cristina Losada-Gutierrez, David Casillas-Perez, Javier Macias-Guarasa, Carlos A. Luna, Daniel Pizarro February 25, 2019 Following the request in the submission system, we include below the author statement information outlining all authors’ individual contributions, using the relevant CRediT roles: • Conceptualization and design of the algorithmic strategy: David FuentesJimenez. • Methodology: David Fuentes-Jimenez, David Casillas-Perez, Cristina Losada-Gutierrez and Daniel Pizarro defined the algorithmic and experimental methodology. • Data curation efforts, including database recording and labeling, and the supervision, execution and validation of all the experiments: David Fuentes-Jimenez, Roberto Martin-Lopez, Javier Macias-Guarasa, and Cristina Losada-Gutierrez. • Sofware: David Fuentes-Jimenez, David Casillas-Perez, Roberto MartinLopez, Cristina Losada-Gutierrez and Javier Macias-Guarasa generated the implementation of the algorithmic proposals and the labeling and scoring software. • Writing, review and editing: All the authors participated in the manuscript generation and revision. 1