Accepted Manuscript
Robust People Detection Using Depth Information from an Overhead Time-of-Flight Camera Carlos A. Luna, Cristina Losada-Gutierrez, David Fuentes-Jimenez, Alvaro Fernandez-Rincon, Manuel Mazo, Javier Macias-Guarasa PII: DOI: Reference:
S0957-4174(16)30648-0 10.1016/j.eswa.2016.11.019 ESWA 10990
To appear in:
Expert Systems With Applications
Received date: Revised date: Accepted date:
20 June 2016 13 October 2016 16 November 2016
Please cite this article as: Carlos A. Luna, Cristina Losada-Gutierrez, David Fuentes-Jimenez, Alvaro Fernandez-Rincon, Manuel Mazo, Javier Macias-Guarasa, Robust People Detection Using Depth Information from an Overhead Time-of-Flight Camera, Expert Systems With Applications (2016), doi: 10.1016/j.eswa.2016.11.019
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights • Robust system to detect people only using depth information from a ToF camera. • Refined algorithm for determining the regions of interest. • Classifier stage to distinguish between people and other objects in the scene.
CR IP T
• Rigorous experimental procedure shows high performance with a broad user variability.
AC
CE
PT
ED
M
AN US
• Generated database will be made available to the research community.
1
ACCEPTED MANUSCRIPT
Robust People Detection Using Depth Information from an Overhead Time-of-Flight Camera Carlos A. Lunaa , Cristina Losada-Gutierreza , David Fuentes-Jimeneza , Alvaro Fernandez-Rincona , Manuel Mazoa , Javier Macias-Guarasaa,b Department of Electronics, University of Alcal´a, Ctra. Madrid-Barcelona, km. 33,600, 28805-Alcal´a de Henares, SPAIN Emails:
[email protected] (C.A. Luna),
[email protected] (C. Losada-Gutierrez),
[email protected] (D. Fuentes-Jimenez)
[email protected] (A. Fernandez-Rincon),
[email protected] (M. Mazo),
[email protected] (J. Macias-Guarasa) b Corresponding author. Tel: +34 918856918, fax: +34 918856591.
CR IP T
a
Abstract
AN US
In this paper we describe a system for the automatic detection of multiple people in a scene, by only using depth information provided by a Time of Flight (ToF) camera placed in overhead position. The main contribution of this work lies in the proposal of a methodology for determining the Regions of Interest (ROI’s) and feature extraction, which result in a robust discrimination between people with or without accessories and objects (either static or dynamic), even when people and objects are close together. Since only depth
M
information is used, the developed system guarantees users’ privacy. The designed algorithm includes two stages: an online stage, and an offline one. In the offline stage, a new depth image dataset has been recorded
ED
and labeled, and the labeled images have been used to train a classifier. The online stage is based on robustly detecting local maximums in the depth image (which are candidates to correspond to the head of the people present in the scene), from which a carefully ROI is defined around each of them. For each ROI, a fea-
PT
ture vector is extracted, providing information on the top view of people and objects, including information related to the expected overhead morphology of the head and shoulders. The online stage also includes a
CE
pre-filtering process, in order to reduce noise in the depth images. Finally, there is a classification process based on Principal Components Analysis (PCA). The online stage works in real time at an average of 150
AC
fps. In order to evaluate the proposal, a wide experimental validation has been carried out, including different number of people simultaneously present in the scene, as well as people with different heights, complexions, and accessories. The obtained results are very satisfactory, with a 3.1% average error rate.
Keywords: People detection, Depth camera information, Interest regions estimation, Overhead camera, Feature extraction
Preprint submitted to Expert Systems with Applications
November 25, 2016
ACCEPTED MANUSCRIPT
1. Introduction Automatic people detection and counting is a highly interesting topic because of its multiple applications in areas such as video-surveillance, access control, people flow analysis, behavior analysis or event capacity management. Given the needs for preventing and detecting potentially dangerous situations, there is no doubt 5
these applications are becoming increasingly important in recent years.
CR IP T
There are several works in the literature aimed at achieving a robust and reliable detection and counting of people in a non-invasive way (without adding turnstiles or other contact systems for access control). The first works in this line were based on the use of RGB cameras. In (Ramanan et al., 2006) the authors propose a system based on learning person models. Lefloch et al. (2008) describe a proposal for real-time people 10
counting based on robust background subtraction, and subsequent people segmentation. Other approaches
AN US
are based on face detection (Chen et al., 2010) or interest point classification (Jeong et al., 2013). These proposals have good results in controlled conditions, but they have problems in scenarios with occlusions. In order to reduce these occlusions, there are other works that propose placing the camera in an overhead position. This is the case of Antic et al. (2009) and Cai et al. (2014), whereas Dan et al. (2012) use the fusion 15
of RGB and depth information in order to improve the detection. Color and depth information are acquired
M
using a Kinect R v1 sensor (Smisek et al., 2011), which simultaneously includes an RGB camera and a depth sensor that constructs a depth map by analyzing a speckle pattern of infrared laser light.
ED
However, using an RGB camera implies that there is information that could allow knowing the identity of the people in the scene. This possibility can be a relevant issue in applications where there are privacy 20
preservation requirements, due, among others, to legal considerations. This is the reason why, in the last few
PT
years, researchers have looked for alternatives in order to preserve users’ privacy. This is the case of Chan et al. (2008), who proposed the use of a low resolution camera, located far away from the users. This allows
CE
monitoring people without being able to identify them, and, consequently, without invading their privacy. Unfortunately, this solution can only be used in environments which allow placing the cameras in distant positions, and the proposal exhibits some problems if there are occlusions, or people walk very close to each
AC
25
other.
On the other hand, also in the last years, there are several works that propose the use of depth sensors or
2.5D cameras (Lange & Seitz, 2001; Sell & O’Connor, 2014), based on Time of Flight (ToF) (Bevilacqua et al., 2006; Stahlschmidt et al., 2013, 2014; Jia & Radke, 2014) or structural light (Zhang et al., 2012; Galˇc´ık
30
& Gargal´ık, 2013; Rauter, 2013; Zhu & Wong, 2013; Del Pizzo et al., 2016; Vera et al., 2016) in order to detect and count people. All these works are based on overhead cameras, with the objective of reducing the occlusion effects. The use of depth sensors implies privacy preservation, but also provides a source of 3
ACCEPTED MANUSCRIPT
information that allows achieving object segmentation in a more straightforward way than the traditional optical cameras (especially in the context of people counting from top-view cameras). 35
In (Bevilacqua et al., 2006) the use of a ToF depth sensor (Canesta EP205 with a resolution of 64 × 64 pixels) instead of a stereo pair of cameras is proposed, to allow people tracking even in low illumination conditions. In that work, they use both depth and infrared intensity (grayscale) images provided by the
CR IP T
sensor, which implies that the users’ privacy is not fully preserved. Moreover, it only works properly if people enter the scene separated, to avoid occlusion effects. So, their proposal does not work if the number 40
of people is high, or if they move close to each other, thus not being useful in realistic environments.
In (Zhang et al., 2012), the authors use a vertical Kinect R v1 sensor to obtain a depth image. Since they assume that the head is always closer to the camera than other body parts, the proposal for people detection is based on obtaining the local minimums of the depth image. This proposal is scale-invariant but it cannot
45
AN US
handle the situation where some moving object is closer to the sensor than the head, such as when raising hands over it.
Another interesting proposal is described in (Stahlschmidt et al., 2013, 2014), where the authors present a system that also uses an overhead ToF camera. The proposal includes several stages: first, they detect the maximums in the height image, and define a Region of Interest (ROI) around each maximum. Then, a
50
M
preprocessing stage is carried out, in which they remove the points that belong to the floor, and normalize the measurements. Finally, they use a filter based on the normalized Mexican Hat Wavelet that allows
ED
segmenting the different objects in the scene. This proposal improves the results of Bevilacqua et al. (2006) when the number of people in the scene is high, but it still generates errors if people are very close to each
PT
other.
An additional relevant drawback of the proposals by Zhang et al. (2012) and Stahlschmidt et al. (2013, 2014) is that they do not include a classifier, but only a detector (Zhang et al., 2012) or a tracker (Stahlschmidt
CE
55
et al., 2013, 2014), so that they cannot discriminate between people and other (mobile or static) objects in the scene, thus leading to the possible appearance of an important number of false positives in realistic scenarios.
AC
Jia & Radke (2014) describe an alternative for people detection and tracking, as well as for posture
estimation (standing or sitting) in an intelligent space. The authors detect people in the scene and their pose
60
as a function of the heights of a set of segmented points. Just as in the previously cited works, a preprocessing stage is necessary in order to reduce noise and depth measurements errors. After removing the background (pixels belonging to the floor or furniture), any group of connected depth measurements (blob) with a height over 90 cm, is considered to be a person. The proposal by Del Pizzo et al. (2016) allows real-time people counting and includes two stages: first 4
ACCEPTED MANUSCRIPT
65
there is a background subtraction step, which includes a dynamic background update. This stage allows identifying the pixels in motion within the scene. Then, a second step interprets the results of the foreground detection in order to count people. Both Jia & Radke (2014) and Del Pizzo et al. (2016) allow people detection preserving the users’ privacy, but again, since no classification stage is included, these proposals are not able to discriminate people from other objects in the scene. In order to solve this issue, other works add a classification stage that allows
CR IP T
70
reducing the number of false positives (Galˇc´ık & Gargal´ık, 2013; Rauter, 2013; Zhu & Wong, 2013; Vera et al., 2016).
The proposals in (Galˇc´ık & Gargal´ık, 2013), (Vera et al., 2016), and (Zhu & Wong, 2013) are based on the human head and shoulders structure in order to obtain a descriptor that will be used to detect people in 75
depth images. In particular, Galˇc´ık & Gargal´ık (2013) propose to first detect the areas that can be a head, and
AN US
then generate a descriptor with three components related to the head structure: the head area, its roundness, and a box test component. Finally, the correctness of the descriptor is computed. An additional tracking stage is included in order to reduce the number of false positives. The proposal by Vera et al. (2016) is also based on using the head roundness, its area, and a tracker, but the authors also include an stage where 80
the tracklets obtained for several depth cameras are combined. On the other hand, Zhu & Wong (2013)
M
propose to use the Head and Shoulder Profile (HASP) as the input feature vector, and an Adaboost based algorithm for classification. All these proposals are able to efficiently discriminate between people and other
ED
objects or animals in the scene, but the detection rates significantly decrease if people are close to each other. Additionally, since the proposals are based on the head and shoulder structure for humans, they should not work properly if people wear complements such as hats, caps, etc.
PT
85
In summary, the works by Stahlschmidt et al. (2013), Stahlschmidt et al. (2014), Jia & Radke (2014);
CE
Zhang et al. (2012), and Del Pizzo et al. (2016), allow detecting people preserving their privacy, but since these proposals do not include a classification process, they are not able to discriminate between people and other objects, thus leading to a possibly high number of false positives in realistic scenarios. On the other hand, the proposals by Galˇc´ık & Gargal´ık (2013), Zhu & Wong (2013), and Vera et al. (2016) include a
AC
90
classification stage (based on the head or head and shoulder structure for humans), but they exhibit a lack of robustness if people are close to each other, and they should not work properly if people wear complements that change their appearance such as hats, caps, backpacks, etc. In this paper, we propose a system for robust and reliable detection of multiple people, by only using
95
depth data acquired using a ToF camera. The proposed solution works properly even if the number of people is high and/or they are close to each other. This work improves the robustness of previous proposals in the 5
ACCEPTED MANUSCRIPT
literature as it is able to discriminate people from other objects in the environment. Moreover, our proposal is able to detect people even if they are wearing complements that change their appearance, such as hats, caps, backpacks, etc. 100
The structure of the paper is described as follows: section 1 provides a general introduction and a critical review of the literature, section 2 describes the notation used, section 3 describes the person detection
CR IP T
algorithm, section 4 includes the experimental setup, results and discussion, and section 5 contains the main conclusions and some ideas for future work. 2. Notation 105
The real scalar values are represented by lowercase letters (e.g. δ or n). Vectors are organized by rows,
AN US
and they are represented by using bold lowercase letters (e.g. x). Capital letters are reserved for defining the size of vectors and sets. (e.g. vector x = [x1 , · · · , xN ]> is of size N), and x> denotes transpose of vector x. Matrices are represented by bold capital letters (e.g. Z). Calligraphic types are reserved for representing ranges or sets, such as R for real numbers, or H for generic sets, being H the complementary set of H. The 110
cardinality of a set A is defined as the number of elements included in the set, and it will be referred to as
M
C (A).
Along this paper, we will use the concept of neighborhood area for a given region in the image plane
ED
P. This will be referred to as VvR , where R is the region to which the neighborhood refers to (that could be composed by just a single pixel), and v is the neighborhood distance used. n o If P = pi, j , with 0 < i ≤ M and 0 < j ≤ N , where pi, j refers to the pixel in position i, j, formally:
(1)
Figure 1 shows an example of two different regions R1 and R2 , and two neighborhood areas V2R1 associ-
CE
115
PT
VvR = pk,l ∈ P / pk,l < R and ∀pm,n ∈ R, |k − m| ≤ v or |l − n| ≤ v .
ated to R1 with neighborhood distance 2, and V1R2 associated to R2 with neighborhood distance 1.
AC
R In the same way, we specify the concept of neighborhood in a given radial direction, referred to as Vv,δ i
(with 1 ≤ i ≤ 8), as shown in Figure 2, where neighborhoods are represented in radial directions (radial subzones) for the examples in Figure 1. Arrows labeled δ1 to δ4 follow the direction of the compass points
120
(north, east, south and west, respectively), whereas δ5 to δ8 correspond to the four diagonals (northeast, southeast, southwest and northwest, respectively). We also define the concept of local neighborhood area of a region R for a given distance in a given direc-
tion δi , as the one which excludes the neighborhood area of the immediately lower neighborhood distance:
6
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
Figure 1: Examples of neighborhood areas.
Figure 2: Examples of neighborhood areas in radial directions.
7
CR IP T
ACCEPTED MANUSCRIPT
M
AN US
Figure 3: Examples of local neighborhood areas in radial directions.
PT
ED
Figure 4: Proposed system architecture for people detection using ToF cameras.
R R ∩ Vv−1,δ LRv,δi = Vv,δ , for v ≥ 2 i i
(2)
3. Person Detection Algorithm
AC
125
CE
Figure 3 graphically shows some examples of local neighborhood areas in radial directions.
As it was previously commented in the introduction, the solution we propose for people detection, only
uses depth information provided by a ToF camera (in order to guarantee the privacy of the users in the environment) in an overhead position. A general block diagram of the proposal is shown in Figure 4. There are two processes, an offline process and an online one. For the offline process, we have recorded
130
and labeled a database of depth sequences1 . This dataset has been used to define two different classes: one 1
The GOTPD1 database, that is available to the academic community for research purposes (full information can be found
8
ACCEPTED MANUSCRIPT
class for people without any accessories, and another one for people including accessories such as hats and caps. The online process includes five stages: One to obtain the height matrix for each depth image acquired by the ToF camera; another to carry out a filtering process to reduce the measurement noise; a third stage, 135
for the detection of local maximums in the height matrix that could correspond to a person head, and the
CR IP T
determination of the Regions of Interest (ROI) in its environment (corresponding to the person head, neck and shoulders); a fourth module to extract a feature vector for each ROI, and, finally, the last stage performs the discrimination between people and other objects.
The algorithm proposed by the authors is based on the detection of isolated maximums in the height 140
matrix that can belong to the heads of people in the scene (or to any other object), and then, the extraction of features in each ROI around each maximum. In order to discriminate between people and other objects we
AN US
use a classification stage based on PCA (Principal Components Analysis). This stage allows determining if a feature vector extracted from a ROI around a maximum belongs to any of the trained classes representing people. The main difficulty of a realistic scenario is the high variability in the people head and shoulders 145
shaping (long/short hair, hats, caps, backpacks, etc.) which the system has to face.
M
Next, each stage of the proposed solution shown in Figure 4 is described in detail. 3.1. Height Acquisition
as shown in Figure 5. In this figure, the camera coordinate system is defined as Xc , Yc , Zc , and its origin (Oc ) h i> is the optical center of the camera lens. p3D = xp3D , yp3D , zp3D corresponds to a 3D point in the scene q whose coordinates xp3D , yp3D , zp3D are related to the camera coordinate system. dp3D = xp2 3D + y2p3D + z2p3D
PT
150
ED
The ToF camera is located in an overhead position, and its optical axis is perpendicular to the floor plane,
is the distance between p3D and OC , and hp3D is the height of p3D with respect to the floor plane. Supposing
CE
that the floor plane and the Xc , Yc plane are parallel, and that the distance between them is hcamera , then hp3D = hcamera − zp3D ∀p3D .
A ToF camera provides, for each pixel qi, j in the depth image (being i, j the pixel coordinates in the h i> image plane), the 3D coordinates of the point pqi, j = xpqi, j , ypqi, j , zpqi, j in the 3D scene associated to the qi, j
AC
155
pixel, as well as the distance dpqi, j , all of them related to the camera coordinate system. So, for each acquired
image, it is possible to obtain a height measurement matrix Hmea , whose dimensions are the same that the camera resolution (one measurement for each image pixel). Assuming a camera with a spatial resolution
160
M × N, then: in (Macias-Guarasa et al., 2016)).
9
CR IP T
ACCEPTED MANUSCRIPT
Hmea
AN US
Figure 5: Definition of coordinates and measurements.
hmea . . . 1,1 . .. = .. . mea h M,1 . . .
hmea 1,N .. M×N , . ∈ R hmea M,N
(3)
mea where hmea i, j = hpqi, j = hcamera − zpq represents the obtained height for pixel qi, j in the ToF camera with i, j
M
respect to the floor. These heights have been obtained from the height of the camera with respect to the (zmea pq = zpqi, j ). i, j
3.2. Noise Reduction
PT
165
ED
floor (hcamera ), and the value of the z coordinate of the 3D point related to the camera coordinate system
One of the fundamental problems in ToF cameras is the high noise level that is present in the measured matrix Hmea . This noise is especially significant if there are moving objects in the scene, leading to a great
CE
number of invalid measurements along the objects’ edges (Jimenez et al., 2014) (motion artifacts). Another noise source is the multipath interference (Jim´enez et al., 2014). ToF camera manufacturers detect if a measurement zmea pq is not valid, and they indicate this circumstance
AC
170
i, j
by assigning a predetermined value to these invalid measurements (e.g. in the PMD S3 camera the invalid R
mea measurements have a value zmea pq = 2mm whereas in Microsoft Kinect v2 the assigned value is zpq = 0mm). i, j
i, j
In order to reduce the noise as well as the number of invalid measurements, we have implemented a noise reduction algorithm that includes two stages. In the first one, the invalid measurements are corrected using 175
the information of the nearest neighbors pixels. Then, a mean filter is used to smooth the detected surfaces. In this work, we consider as invalid values those hmea i, j which are provided by the camera as invalid 10
ACCEPTED MANUSCRIPT
measurements (associated to invalid coordinate measurements zmea pq ) as well as those with a height greater i, j
180
than the maximum height for a person. We define the set of pixels with an invalid measurement given by the n o n o null-h pmax = q mea camera as Inull-camera = qm,n / zmea m,n / hm,n > h pmax as the set of pixels pqm,n is invalid , and, I
with a height greater than the maximum allowed value for a person (h pmax = 220cm in this work). From them, the full set of invalid measurements, referred to as Inull , is Inull = Inull-camera ∪ Inull-h pmax .
CR IP T
Next, for each qi, j ∈ Inull a new height value is estimated, referred to as b hmea i, j , following the procedure
described next:
1. First, there is a search for valid height values in a neighborhood area around every pixel qi, j ∈ Inull . 185
q
This search is carried out in the 8 nearest neighbors, in the neighborhood area V1 i, j . If there are no valid q
AN US
heights in this neighborhood area, the search continues in the neighborhood area of level two V2 i, j . For q q the first neighbor level v∗ in which a valid pixel is found (Vv∗i, j , with v∗ = argmin1≤l≤2 l/Vl i, j , Ø ), mea the estimated value b hmea i, j is obtained as the average of the valid values hm,n in that neighborhood level. ∗ q If hmea,v are the valid values in Vv∗i, j , then the value of b hmea m,n i, j is given by:
mea,v∗ b hmea = average h m,n i, j ∗
qi, j
(4)
hmea,v ∈Vv∗ m,n
The height measurement matrix obtained after removing the invalid measurement will be referred to
M
190
as Hval ∈ R M×N .
ED
If ∀v/v ∈ {1, 2} no valid height measurement is obtained, we consider that the number of measurement errors in the image is excessive, so that the image is rejected, and a new one is processed. This
195
PT
condition may seem very stringent, but in the experimental conditions of this work, we have confirmed that the probability of having a 5x5 pixels area without any valid measurement is certainly low. In fact, the created dataset does not include any image with a 5x5 pixels area only containing invalid pixels.
CE
Nevertheless, in case an image is rejected, the people’s positions could be recovered by including a tracking stage in the proposal.
200
AC
In the search for valid pixels, we use a maximum neighborhood level of two because we consider that up to that level it can be guaranteed that there is a high correlation between the pixels information, given the camera position and the image resolution used.
2. Mean filter: once the matrix Hval has been obtained, a nine element mean filter is used to estimate a new height value for each pixel, called b hval i, j and calculated as: b hval i, j =
average 0≤∆q≤1,0≤∆r≤1
11
hˆ mea i±∆q, j±∆r
(5)
ACCEPTED MANUSCRIPT
After the filtering process, the obtained height measurement matrix is finally referred to as H:
205
h i h i H= b hval i, j = hi, j ,
(6)
where the notation hi, j is used to refer to the heights of each pixel in the image plane after the correction of the invalid values and the application of the mean filter. This notation is used in order to simplify the
CR IP T
expressions below. 3.3. Regions of Interest (ROI’s) Estimation
In this work, the regions of interest (ROI’s) are defined as the pixels around each detected isolated 210
maximum that belong to the same object (which can be a person or not). Since the body parts of interest for this solution are the head, neck and shoulders, the ROI’s should include all of them. Additionally, their
AN US
estimation should be precise enough so that they do not include measurements from any other nearby people or objects in the scene, even if the distance between them is small.
In order to determine the ROI associated to each isolated maximum, the initial criterion is that the height 215
difference between the highest point on the head and the shoulders should not be greater that an interest height hinterest . In this work, we have selected hinterest = 40cm, based on anthropometric considerations (Bushby
M
et al., 1992; Matzner et al., 2015). Because of that (refer to Figure 6), once that a maximum with a height hmax has been detected, the ROI around that maximum initially includes all the pixels whose heights hi, j
PT
220
ED
fulfill:
hmax − hi, j ≤ hinterest = 40cm
(7)
The initial ROI should be adjusted taking into account additional considerations in order to achieve a
CE
correct segmentation for the extraction of reliable features that will be used in the classification process. It should work even if people are close to each other or if there are partial occlusions. With this objective, we
AC
have designed a robust algorithm for ROI’s estimation that is composed of two stages: the robust detection of maximums in the height matrix, that will provide a list of possible candidates to be identified as people; and
225
a second stage of robust ROI’s adjustment, for facing situations of highly populated environments (where people are very close to each other) and occlusions. 3.3.1. Detection of Local Maxima In order to select which regions in the height matrix H can actually correspond to people or other objects, we have developed a robust local maxima detection algorithm which is detailed below. 12
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 6: 3D projections and representations of a person in the hinterest height range.
1. Division of the image plane in sub-regions: Assuming that the ToF camera has a spacial resolution
M
230
M × N, in this stage the image is divided in square sub-regions (SR0 s), with dimensions D × D pixels.
Nr =
M N ; Nc = D D
(8)
PT
ED
Thus, the number of sub-regions will be Nr × Nc (refer to Figure 7), with:
The value of D is set as a function of the camera intrinsic parameters, the camera height related to the floor plane (hcamera ), the minimum height of people to be detected (h pmin ), and the minimum area
CE
(l × l), in metric coordinates, that the top part of a person head with a height h pmin can occupy. In this work, the value D is given by:
AC
235
D=
f l a hcamera − h pmin
(9)
where f is the camera focal length, and a is the pixel dimension (assuming square pixels). In our experimental tests with Kinect R v2 cameras, the used values are
f a
= 365.77, hcamera = 340cm and
h pmin = 100cm. Based on anthropometric characteristics of the human body (Bushby et al., 1992; Matzner et al., 2015), we have set l = 13cm in this work. This value guarantees that the overhead view of a person whose height is h pmin covers several SR’s. From all these considerations, the calculated 13
CR IP T
ACCEPTED MANUSCRIPT
Figure 7: Division of the original pixel space (qi, j pixels) in SRr,c subregions of dimension D × D (3 × 3 in this example).
value for D is 20 pixels. For l = 10cm, D would become 15 pixels, and the obtained results would
AN US
be similar. Therefore, each SRr,c (note that the coordinates r, c of the subregion in the new plane partition have been already included in the notation) with 1 ≤ r ≤ Nr and 1 ≤ c ≤ Nc , includes the corresponding pixels from the image plane:
n o SRr,c = qi, j /(r − 1)D + 1 ≤ i ≤ (r − 1)D + D and (c − 1)D + 1 ≤ j ≤ (c − 1)D + D ,
(10)
M
and will have assigned the corresponding heights from matrix H, as shown in equation 11:
ED
h (r−1)D+1,(c−1)D+1 . . . .. .. = . . h(r−1)D+D,(c−1)D+1 . . .
PT
HSRr,c
where HSRr,c ∈ RD×D .
h(r−1)D+1,(c−1)D+D .. . h(r−1)D+D,(c−1)D+D
(11)
AC
CE
2. Calculation of Maximums: identifying as hmaxSR the maximum height values associated to each SRr,c r,c (hmaxSR = max∀qi, j ∈SRr,c hi, j ), a matrix HmaxSR ∈ RNr ×Nc is constructed as follows: r,c HmaxSR
hmaxSR . . . 1,1 . .. = .. . maxSR hNr ,1 ...
hmaxSR 1,Nc .. . hmaxSR Nr ,Nc
(12)
Each value hmaxSR in HmaxSR is considered as a candidate to correspond to a person if it initially fulfills r,c
240
the next condition (see Figure 8): h pmin ≤ hmaxSR ≥ hmaxSR r,c r∗ ,c∗ , 14
SRr,c
∀hmaxSR ∈ V1 r∗ ,c∗
(13)
CR IP T
ACCEPTED MANUSCRIPT
SRr,c
AN US
Figure 8: Scheme to obtain mkr,c , in the HSR matrix, where the thick border square shows the level 1 neighboring region (V1
).
In order to simplify the notation, all the values of hmaxSR that fulfill the previous condition will be r,c referred to as mkr,c with 1 ≤ k ≤ NP , where k indexes each maximum height in a list of them, and NP is the number of detected maximums.
∗
245
M
Given that it is very probable that nearby SR’s will belong to the same person, when nearby mkr∗ ,c∗ are found, they will be substituted by a single one, that with the highest value, that will represent all the
ED
others. In our case, and taking into account the dimensions of the SR’s, the camera placement and characteristics, and the people height range, we will consider close SR’s to a given SRr,c , those that SRr,c
are within its neighbor area of order 2, that is, those included in V2 ∈
SR V2 r,c ,
.
the one with the highest value is chosen, assigning to it the
coordinates of the SR that is the nearest to the centroid of the SR’s that have an associated mkr,c . That n o is, identifying as SRr1 ,c1 , SRr2 ,c:2 , . . . , SRrNS Rp ,cNS Rp (being NS Rp the number of nearby subregions
CE
250
PT
As a consequence, from all m
k∗ r∗ ,c∗
AC
with associated mkr,c ), the coordinates to be assigned to the associated maximum will be:
(14)
Where round {·} is a function that rounds to the nearest integer. That maximum will take the value ∗ mkr,c = maxmk∗ ∈VSRr,c mkr∗ ,c∗ , The left side of Figure 9 shows an example of a person with a depth r∗ ,c∗
255
PN PN S Rp S Rp l=1 cl l=1 rl ; c = round r = round NS Rp NS Rp
2
map with two nearby maximums, and the right side of the same figure shows the final result, with a single maximum set to the highest value of the nearby ones. After this process, the result is a set of NP maximums that are candidates to correspond to NP people. 15
CR IP T
ACCEPTED MANUSCRIPT
Figure 9: Scheme to obtain mkr,c , when nearby maximums are found. The person depth map shows two maximums (left graphic, ∗
mkr,c and mkr∗ ,c∗ ) which will be merged (right graphic).
AN US
3.3.2. ROI’s Adjustment
Since the body parts of interest for the feature extraction and classification are the head, neck and shoul260
ders, the ROI’s should include this three body parts. Thus, each ROI, associated to each mkr,c will always comprise several SR’s. The set of all the SR’s associated with a mkr,c will be referred to as ROIkr,c . Figure 10
shows a diagram that includes, as an example, the SR’s corresponding to the ROI’s associated with two
M
people.
In order to robustly determine the SR’s that belong to the ROIkr,c associated to each mkr,c , we have designed an algorithm that accurately searches for the boundaries between each candidate to be a person
ED
265
and its environment. The proposal exhaustively and sequentially analyzes a set of neighborhood areas, in
PT
different radial directions starting from the SR under study. The analyzed SR’s are added to the ROIkr,c if they fulfill several conditions.
270
CE
In what follows, we will use M to refer to the set of all the SRkr,c that have an associated mkr,c . As seen in section 3.3.1, C (M) = NP is satisfied in that set. ∗
In order to select the SR’s (SRkr∗ ,c∗ ) that belong to each ROIkr,c , NV local neighborhood area levels
AC
and 8 radial directions δi (1 ≤ i ≤ 8) will be established around each SRkr,c ∈ M, as shown in Figure 11. Algorithm 1 describes the procedure for ROI’s adjustment, for which we provide next some considerations and comments (refer to the left side of Algorithm 1 for the locations of the comments discussed below):
275
• C1: The sub-region SRkr,c where the mkr,c has been detected always belong to the ROIkr,c . • C2: The SR’s in the level 1 neighborhood area of SRkr,c , always belong to the ROIkr,c , provided they are within the interest heights limits. 16
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
Figure 10: Example of the SR’s subregion set forming the ROIs associated to two people.
Figure 11: Directions and neighboring levels used to decide which SR’s belong to each ROIkr,c .
17
ACCEPTED MANUSCRIPT
foreach SRkr,c ∈ M do ROIkr,c ←− ∅
SRkr,c
∗
for SRkr∗ ,c∗ ∈ V1 [C3] [C3]
∗
ROIkr,c ←− ROIkr,c ∪ SRkr∗ ,c∗
CR IP T
[C1,2]
/hmaxSR ≥ hmaxSR − hinterest do r,c r∗ ,c∗
for i = 1..8 do for v∗ = 2..NV do
SRk
∗
r,c foreach SRkr∗ ,c∗ ∈ Lv∗ −1,δ do ∗
[C3]
if
[C4]
i
hmaxSR r,c
− hinterest then SRkr,c if 1 ≤ i∗ ≤ 4 and C ROIkr,c ∩ Lv∗ −1,δ ≥ v∗ − 1 then ∗ ≥
AN US
[C5]
hmaxSR r∗ ,c∗
i
if checkDecreasingHeightsAndAdd() =FALSE then break to for i SRkr,c if 5 ≤ i∗ ≤ 8 and C ROIkr,c ∩ Lv∗ −1,δ = 1 then ∗
[C6]
i
M
if checkDecreasingHeightsAndAdd() =FALSE then break to for i
ED
Function checkDecreasingHeightsAndAdd() r0 = r∗ − ∆r; c0 = c∗ − ∆c (see Table 1) r00 = r∗ + ∆r; c00 = c∗ + ∆c (see Table 1)
PT
[C8]
if hmaxSR ≥ hmaxSR − hinterest then r,c r0 ,c0
if hmaxSR ≥ hmaxSR ≥ hmaxSR r∗ ,c∗ r0 ,c0 r00 ,c00 then ∗ k ROIr,c ←− ROIkr,c ∪ SRkr∗ ,c∗
CE
[C7]
return TRUE;
else
∗
ROIkr,c ←− ROIkr,c ∪ 12 SRkr∗ ,c∗
AC
[C9]
return FALSE;
else
[C10]
return FALSE Algorithm 1: ROIs Adjustment algorithm.
18
ACCEPTED MANUSCRIPT
Table 1: Values taken by the ∆r, ∆c variables for each δi direction in Algorithm 1.
δ1
δ2
δ3
δ4
δ5
δ6
δ7
δ8
∆r
-1
0
1
0
-1
1
1
-1
∆c
0
1
0
-1
1
1
-1
-1
CR IP T
The decision of including the level 1 neighborhood areas in the ROIkr,c is based on the sub-regions size, the dimension of people in the scene, and the camera characteristic. Thus, the sub-regions that 280
are adjacent to the considered one, always belong to the ROI (if their height is adequate).
• C3: For all the directions δi∗ (1 ≤ i∗ ≤ 8) all the SR’s belonging to the local neighborhood areas of ∗
SRk
r,c SRkr,c with a neighborhood distance 2 ≤ v ≤ NV are analyzed. These SR’s are SRkr∗ ,c∗ ∈ Lv∗ −1,δ , ∗ ∗
i
(C4 through C10). 285
∗
AN US
with 2 ≤ v∗ ≤ NV . Then each SRkr∗ ,c∗ will be added to the ROIkr,c if it fulfills the conditions below
• C4: The maximum height of the SRkr∗ ,c∗ considered for its inclusion in a given ROIkr,c , should be within the interest heights limits.
M
The number of non-shared sub-regions belonging to the ROI in the local neighborhood area with neighborhood distance immediately below to the considered one is:
290
ED
– C5: for the horizontal and vertical directions δ1 , δ2 , δ3 and δ4 , equal at least to the value of the considered neighborhood distance v∗ minus 1.
PT
– C6: for the diagonal directions δ5 , δ6 , δ7 and δ8 , equal to 1. The objective of these two previous conditions is to guarantee a continuity in the ROI building
CE
up, so that most of the neighborhood area with distance immediately below the considered one, already belongs to that ROI. The condition does not force all that area to belong to the ROI in
order to increase the robustness of the ROI estimation process.
AC
295
– C8: since the interest is focused on separating people that are close to each other (especially if they are very close), and for each considered direction, we will impose that the maximum height ∗
0
in the considered SRkr∗ ,c∗ has to be lower than the maximum height of the SRkr0 ,c0 immediately
adjacent to it (in the local neighborhood area immediately below), and it has to be greater that 300
00
the maximum height of the SRkr00 ,c00 immediately adjacent to it (in the local neighborhood area immediately above). All of this applies provided that the immediately adjacent region in the lower neighborhood level has its maximum height within the heights of interest (C7). 19
ACCEPTED MANUSCRIPT
– C9: if condition [C8] is not fulfilled, then there is a minimum between two adjacent ROI’s, thus ∗
the considered SRkr∗ ,c∗ is shared by the two adjacent ROI’s. In this situation, half of the points
305
∗
belonging to that SRkr∗ ,c∗ will be assigned to each of the adjacent ROI’s, being the dividing line the one perpendicular to the considered direction δi∗ .
– C10: if the conditions [C7] or [C8] are not fulfilled, then the search of adjacent sub-regions of
CR IP T
interest will be stopped (from the considered SR and for greater neighborhood distances in the corresponding direction). 310
Figures 12.a and 12.d (as well as in their zoomed in versions in Figures 12.b and 12.e) show an example of the selection of the SR’s that belong to each ROI in a situation that includes two people very close to each other, including their shared SR’s. Figure 12.b shows a shared SR in the neighborhood distance with
AN US
level v∗ = 2, and another one in the neighborhood distance with level v∗ = 1, both of them in the direction δ1 ,
and also a shared SR in the neighborhood distance with level v∗ = 1 in the direction δ8 . Figure 12.e shows 315
shared SR’s in the neighborhood distances v∗ = 2 and v∗ = 3 in the directions δ6 and δ2 , respectively.
Finally, Figures 12.c and 12.f show two different real examples of the results of the ROIs adjustment algorithm for a scene that includes two people. One of then wears a cap (in Figure 12.c) and another one
M
wears a hat (in Figure 12.f), and both of them are marked with a blue square.
320
ED
3.4. Feature Extraction
Given that our objective is determining the presence of people considering the morphology of the head, neck and shoulders areas, we have designed a feature set able to model such morphology, using the height
PT
measurements from the ToF camera as the only input, and taking into account anthropometric considerations (related to the visible person profile from an overhead view, the head geometry, etc.). So, the feature vector
325
CE
components will be related to the pixel density associated to the person surface in different height levels within the corresponding ROI. In this work, the feature vector is composed of six components that will
AC
be extracted for each ROIkr,c , as they correspond to possible candidates to be people. Five of these features
will be related to the visible people or objects surfaces at different heights, and the sixth component will correspond to the relationship between the lower and higher diameters of the top surface, providing an idea on the eccentricity of the person head. The calculation of the feature vector components is done following
330
the process described in the next sections and shown in Figure 13.
20
(b) Zoom of shared SR’s of ROIk=1 r,c .
(d) SR’s of ROIk=2 r,c .
(e) Zoom of shared SR’s of ROIk=2 r,c
(c) ROI estimation for a scene with two people, one of them wearing a cap.
(f) ROI estimation for a scene with 2 people, one of them wearing a hat.
ED
M
AN US
(a) SR’s of ROIk=1 r,c .
CR IP T
ACCEPTED MANUSCRIPT
PT
Figure 12: Example of the selection of SR’s shared by two ROI’s.
3.4.1. Pixel Assignment to Height Slices and Counting In this first stage, features related to the pixels associated to the head, neck and shoulders areas are
CE
calculated. As described in Section 3.3, we assume them to be included in a height segment hinterest below the person height. So, starting from that person height at each ROIkr,c equally spaced slices are taken, with a
AC
slice height ∆h (in our work, ∆h = 2cm). So, NF =
hinterest ∆h
ROIkr,c
slices (referred to as F s
) will be considered
for analysis, with 1 ≤ s ≤ NF (NF = 20 in our case). The pixels included in the subregions belonging to the
given ROIkr,c , will be assigned to the corresponding slice, that is: ROIkr,c
Fs
n o = qi, j /qi, j ∈ ROIkr,c and hmaxSR − (s − 1) · ∆h ≥ hi, j > hmaxSR − s · ∆h r,c r,c ROIkr,c
For each of these F s
(15)
slices, the number of height measurements obtained by the ToF camera are ROIk ROIk counted (the number of pixels), being ϕ s r,c = C F s r,c , providing information on the pixel density for 21
k
CR IP T
ACCEPTED MANUSCRIPT
k
Figure 13: Feature extraction process: from ϕROIr,c to b ϕROIr,c .
the identification of the head and shoulders areas of a person. With the values of these densities, a vector k
ϕROIr,c of NF components is generated, where the value of each component coincides with the number of > > k ROIk ROIk ROIk ROIk height measurements in each slice ϕROIr,c = ϕ1 r,c . . . ϕNF r,c = C F1 r,c . . . C FNF r,c . In the
AN US
335
k
left part of Figure 14 an example of the values for the 20 components of a ϕROIr,c vector is shown, for a
person with a silhouette similar to that shown in the central part of the figure. The values in the upper section of the figure correspond to the number of pixels in the head area, those in the intermediate section correspond to the neck area, and those in the lower section correspond to the shoulders area.
M
340
3.4.2. Count Accumulation k
ED
The components of the ϕROIr,c vector are very sensitive to the appearance changes of a person (hair style, hair length, neck height, etc.), the person height, and, additionally, their estimation will be affected by the
345
PT
person position within the scene (taking into account that the camera is in overhead position, there will be areas not seen by the camera). Moreover, the effects of distance on the measurement noise, and the multipath
CE
propagation of ToF cameras must also be taken into account. To minimize the noise measurement errors and the appearance changes of people, an accumulation of the distance measurements in several slices is carried out. Also eccentricity information of the top sec-
AC
350
k
tion of the head will be added (as described in section 3.4.4), building a new feature vector b ϕROIr,c = > ROIk ROIk b ϕ1 r,c . . . b ϕ6 r,c , with 6 components in our case. k
The first three components of b ϕROIr,c include information related to the head, each of them integrating the
number of pixels for three consecutive slices (individual components spans 6 cm each). To avoid problems of a wrong estimation of the person height due to measurement noise, we assume that the first three components k
355
of b ϕROIr,c comprise most of the pixels in the top section of the head. Then, depending on the slice with the k ROIk highest pixel density (µH = argmax1≤s≤3 ϕ s r,c ), the first three components of b ϕROIr,c will be generated 22
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
k
CE
Figure 14: Example of slice segmentation for a person. The number of measured points in each slice is shown in the left (ϕROIr,c ). ROIkr,c
AC
The values for the first 5 components of the feature vector are shown in the right (b ϕ
23
).
ACCEPTED MANUSCRIPT
as follows: b ϕ1
ROIkr,c
=
ROIkr,c
=
ROIkr,c
b ϕ3
=
b ϕ2
ROIk b ϕ1 r,c ROIkr,c if µH = 3 ⇒ b ϕ2 ROIk b ϕ3 r,c
= = =
k
P4
P3
ROIkr,c s=1 ϕ s P6 ROIkr,c s=4 ϕ s P9 ROIkr,c s=7 ϕ s
(16)
ROIkr,c s=2 ϕ s P7 ROIkr,c s=5 ϕ s P10 ROIkr,c s=8 ϕ s
CR IP T
if µH = 1 OR µH = 2 ⇒
(17)
The fourth and fifth components of b ϕROIr,c are related to the shoulders zone, integrating each of them
the number of pixels found in three consecutive slices (individual components spans 6 cm each). Again, to ROIkr,c
360
AN US
increase the estimation robustness, the shoulders zone will be considered to start in any of the F s in a given range of values for s:
Srange
10 ≤ s ≤ 16 if µH = 1 OR µH = 2 = 11 ≤ s ≤ 16 if µH = 3
slices
(18)
M
ROIk so that the slice with the highest pixel density will be selected (µS = argmaxSrange ϕ s r,c ), and from it, the k
ED
values of the fourth and fifth components of b ϕROIr,c will be generated, as follows:
PT
ROIkr,c
b ϕ4
=
µX S +1
ROIkr,c
ϕs
s=µS −1
ROIkr,c
b ϕ5
=
µX S +4
(19) ROIkr,c
ϕs
s=µS +2
CE
Figure 14 includes an example in which the different height slices are shown, determining the values of k
AC
365
k
the feature vector components ϕROIr,c and b ϕROIr,c . 3.4.3. Normalization
As can be seen from the calculation scheme shown in Figure 14, the number of pixels associated to ROIkr,c
height measurements in the different ∆h slices, depends on the person height. So, the components b ϕ1
ROIk b ϕ5 r,c
to
associated to people will be also dependent on height, thus being necessary to normalize them.
To carry out the normalization, the relationship between the maximum height (b hmaxSR ) and the number r,c ROIkr,c
of detected pixels by the camera in the top section of the head (b ϕ1
24
) will be calculated. As an initial
ROIkr,c
curve.
AN US
Figure 15: ρ1
CR IP T
ACCEPTED MANUSCRIPT
approximation, a quadratic relationship has been assumed, so that: ROIkr,c
ϕ1
ROIkr,c
≈ ρ1
2 = a0 b hmaxSR + a1b hmaxSR + a2 r,c r,c
(20)
where a0 , a1 and a2 are the coefficients to estimate.
The Levenberg-Marquardt algorithm was used for the determination of those coefficients, as those that ROIkr,c b best fit the input data set b hmaxSR , ϕ , selected from the training database, and according to the non r,c 1 ROIkr,c
in equation (20). For a sample set of people with heights between 140 cm and 213 cm, ROIkr,c
ED
linear function ρ1
M
370
and for each height, the average number of pixels of ϕ1
was calculated, and from it, the normalization
curve was obtained, along with the coefficient values a0 , a1 and a2 . Figure 15 includes a graphic in which the training data and the fitted curve is shown (resulting a0 = 0.138, a1 = −36.94 and a2 = 2997), obtaining
PT
375
a mean square error of 45 pixels approximately. k
CE
Finally, the normalized vector will be obtained dividing the feature vector components of b ϕROIr,c , by the ROIkr,c
AC
estimated ρ1
3.4.4. Eccentricity Calculation
380
ROIk
(that will generate a new normalized vector b ϕnormr,c ): ROIk b ϕi,normr,c
ROIkr,c
=
b ϕi
ROIk ρ1 r,c
, with 1 ≤ i ≤ 5
(21)
ROIk
The five normalized components of the feature vector b ϕnormr,c already described, provide information
on the top view of people and objects with different heights, but initial experiments on people detection showed the need to also include more information related to the expected overhead geometry of the head. So, to incorporate information on the way that pixels are distributed in the top section of the head, a sixth 25
(a) 3D point cloud measures
(b) 2D depth map
CR IP T
ACCEPTED MANUSCRIPT
(c) Sample feature vector values
Figure 16: Example of a frame with 8 people. In Subfigure (c), top graphic corresponds to a person 165cm tall and long hair (that in the north-west position in the group of Subfigure (b), and the bottom graphic to a person 202cm tall and short hair (that in the
(a) 3D point cloud measures
M
AN US
center position in the group of Subfigure (b)).
(b) 2D depth map
(c) Sample feature vector values
ED
Figure 17: Example of a frame with two people, one of them wearing a hat.
component has been added to the feature vector, at heights between hmaxSR and hmaxSR − 6cm (as was r,c r,c discussed above). If the function that calculates the relationship between the major and minor axes of the
PT
385
region located 6cm (3∆h) below the maximum height is referred to as rba (operating on a set of pixels), the
AC
CE
sixth component will be:
o n ROIkr,c b ϕ6,norm = rba qi, j /qi, j ∈ ROIkr,c and hmaxSR ≥ hi, j > hmaxSR − 3∆h r,c r,c
(22)
To provide real examples on the used dataset, in Figures 16, 17 and 18 we show examples of the 3D
representation of scenes with 8, 2 and 1 people, respectively, their corresponding 2D depth maps, and feature
390
vectors for selected elements. Figure 19 shows also an example of a person pushing a chair. 3.5. People Classification Given that the objective of this work is detecting people in a scene in which there may be other static or moving objects (chairs, for example), it’s necessary to implement a classifier able to differentiate the feature 26
(b) 2D depth map
(c) Sample feature vector values
AN US
(a) 3D point cloud measures
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
Figure 18: Example of a frame with one person moving his fists up and down.
(a) 3D point cloud measures
(b) 2D depth map
(c) Sample feature vector values
Figure 19: Example of a frame with one person pushing a chair.
27
ACCEPTED MANUSCRIPT
vectors obtained for the different ROIkr,c , as corresponding or not to people. 395
ROIk
From now on, and with the objective of simplifying the notation, the feature vector b ϕnormr,c ∈ R6 obtained
for each ROIkr,c , will be referred to as Ψ (Ψ ∈ R6 ).
In our task, the values of the Ψ feature vector components significantly change when people carry accessories occluding the head and shoulders (wearing hats, caps, etc.). This is why two classes have been defined
400
CR IP T
(α ∈ 1, 2), for people without and with accessories respectively.
To carry out such classification, a classifier based on Principal Component Analysis (PCA) was selected (Shlens, 2014; Jim´enez et al., 2005), thus requiring an offline estimation of the models for each class, prior to the online classification process. 3.5.1. Model Estimation (Offline Process)
405
AN US
In the offline process, the two transformation matrices Uα are calculated. To do so, Nα training vectors were used, associated to different people representative of each of the two classes Ψαi (α = 1, 2, i = 1 · · · Nα ). From those Nα vectors, their average value and scatter matrices are calculated, for each class: Nα 1 X Ψαi Nα i=1
Nα X
M
Ψα = ST α =
i=1
(Ψαi − Ψα )(Ψαi − Ψα )
(23)
>
ED
Matrices Uα for each class α are formed by the eigenvectors associated to the highest eigenvalues of the corresponding scatter matrices ST α (Shlens, 2014; Jim´enez et al., 2005). In our case, three eigenvectors have
than 90%, that is:
CE
410
PT
been chosen, following the criteria that the average normalized residual quadratic error (RMSE) is higher
P6
j=m+1 λα j
RMS E = P6
j=1 λα j
> 0.9
(24)
AC
resulting m = 3 and, consequently, Uα ∈ R6x3 . 3.5.2. Person detection (Online Process)
In the classification process (online process), the Ψ feature vector of each ROIkr,c is calculated, and
for each class α, the difference between this vector and the average vector class Ψα is projected in the 415
transformed space (Φα = Ψ − Ψα ).
The projected vector will then be Ωα = U> α Φα . Next, the projected vector is recovered in the original
b α = Uα Ωα . The Euclidean distance between Φα and Φ b α is called the reconstruction error α . This space Φ 28
ACCEPTED MANUSCRIPT
process is applied for each of the two classes. Finally, a feature vector is classified as corresponding to a person if its reconstruction error is lower than 420
a given threshold for any of both transformations. That is:
b 1 || ≤ T h1 OR 2 = ||Φ − Φ b 2 || ≤ T h2 Ψ is person if 1 = ||Φ − Φ
(25)
using the following equation:
T hα = α + 3σα
CR IP T
where T hα is the threshold for the α class, that in our case was determined experimentally for each class,
(26)
where α is the average value of the reconstruction error and σα is its standard deviation, for Nα people with
425
AN US
different characteristics, and in different scene positions.
In case that the condition in equation 25 does not hold, the feature vector is considered not to correspond to a person. 4. Experimental Work
M
4.1. Experimental Setup
In order to provide data for training and evaluating the proposal, we used part of the GOTPD1 database (available at (Macias-Guarasa et al., 2016)), that was recorded with a Kinect R v2 device located at a height
ED
430
of 3.4m. The recordings tried to cover a broad variety of conditions, with scenarios comprising:
PT
• Single and multiple people
CE
• Single and multiple non-people (such as chairs) • People with and without accessories (hats, caps) • People with different complexity, height, hair color, and hair configuration
AC
435
• People actively moving and performing additional actions (such as using their mobile phones, moving their fists up and down, moving their arms, etc.).
The data used was split in two subsets, one for training and the other for testing. The subsets are fully independent, so that no person present in the training database was present in the testing subset. 440
Table 2 and Table 3 show the details of the training and testing subsets, respectively. #Samples refers to the number of all the heads over all the frames in the recorded sequences (in our recordings we used 39 29
ACCEPTED MANUSCRIPT
Table 2: Training subset details.
Sequence ID
#S amples
Description
Class
seq-P01-M02-A0001-G00-C00-S0041
141
seq-P01-M02-A0001-G00-C00-S0042
299
seq-P01-M02-A0001-G00-C00-S0043
566
seq-P01-M02-A0001-G00-C00-S0044
149
seq-P01-M04-A0001-G00-C00-S0045
226
seq-P01-M04-A0001-G00-C00-S0046
301
seq-P01-M02-A0001-G00-C02-S0047
221
Multiple people with
Class 2: Person with
seq-P01-M02-A0001-G00-C02-S0048
152
accessories (hats, caps)
accessories (hats, caps)
Class 1: Person without
CR IP T
Single person
AN US
accessories
Table 3: Testing subset details, specifying the total number of samples, and the number of samples for classes 1 and 2.
Sequence type
#Class1
#Class2
Sequences with a single person
5317
5317
0
Sequences with two people
933
833
100
Sequences with more than two people
8577
6929
1648
830
830
0
15657
13909
1748
M
#S amples
Sequences with chairs and
ED
people balancing fists facing up Totals
PT
different people). The database contains sequences in which the users were instructed on how to move under the camera (to allow for proper coverage of the recording area), and sequences where people moved freely
4.2. Comparison with other methods
AC
445
CE
(to allow for a more natural behavior)2 .
4.2.1. Performance comparison In order to evaluate the improvements of the proposed method compared with others in the literature,
we first chose the recent work described in (Stahlschmidt et al., 2014), given the similarity of the task. For the comparison, we run preliminary experiments on a subset of the testing database. In the comparison we
450
did not use a tracking module for any of the methods (to provide a fair comparison on their discrimination 2
This is fully detailed in the documentation distributed with the database at (Macias-Guarasa et al., 2016)
30
ACCEPTED MANUSCRIPT
Table 4: Comparison results with the strategy described in (Stahlschmidt et al., 2014).
Sequence type
#S amples
Single person
T PR
FPR Proposal
ST2014
Proposal
5757
100.00%±0.00%
98.06%±0.36%
0.21%±0.12%
0.23%±0.12%
Two people
973
60.43%±3.07%
97.01%±1.08%
0.10%±0.20%
0.32%±0.35%
More than two people
2383
84.01%±1.47%
95.90%±0.80%
0.42%±0.26%
0.13%±0.14%
1042
98.37%±0.77%
97.57%±0.93%
21.02%±2.47%
0.12±0.21%%
10155
92.29%±0.52%
97.40%±0.31%
2.38%±0.30%
0.20%±0.09%
Sequences with chairs and people balancing fists facing up Total counts and average TPR/FPR rates
CR IP T
ST2014
capabilities without other improvements), and we evaluated the results in correct and false detections (true
AN US
and false positive rates (T PR and FPR, respectively)), with the results shown in Table 4, where #S amples is the number of people heads labeled in the ground truth, ST2014 refers to the results in (Stahlschmidt et al., 2014), and Proposal refers to our results. Table 4 also includes confidence interval values for a confidence 455
level of 95%.
As it can be clearly seen, the proposed method is better than the other one, except for very slight degra-
M
dations in the case of the T PR in the single person sequences, and in sequences with chairs and people balancing fists facing up (not statistically significant). There is also a very minor (again not statistically
460
ED
significant) degradation in the FPR for sequences with one or two people, but this just reduces to 2 + 1 cases out of 5757 + 973, respectively. These results are coherent with our expectations, as the Mexican
PT
hat strategy used in (Stahlschmidt et al., 2014) will generate a detection for almost everything that can be assimilated to this shape. This behavior leads to very good performance for single person sequences, but this
CE
is accomplished at the expense of generating a much higher number of false positive detections when there are additional objects in the scene (as is the case for sequences with chairs and people balancing their fists 465
facing up).
AC
It can also be observed that the improvements in the T PR are very significant for sequences of multi-
ple people, in which the Mexican hat strategy will not be able to accurately separate nearby people. The effect is particularly noticeable in the sequences of two people (with 60.5% relative improvement), because the recordings were done with both people being very close to each other. The observed improvement in
470
sequences with more than two people is lower than with just two people, because in the latter recordings people were not requested to remain close to each other, and they mainly moved freely in the scene, so that the Mexican hat can do a better job in discriminating among people in the scene. 31
ACCEPTED MANUSCRIPT
Table 5: Comparison results with the strategy described in (Zhang et al., 2012).
Strategy (Zhang et al., 2012) Proposal3
Recall
Accuracy
F-score
99.47% ± 0.42%
99.57% ± 0.38%
99.52% ± 0.40%
99.57% ± 0.38%
99.57% ± 0.38%
99.57% ± 0.38%
CR IP T
3 Values of all metrics are equal in our results as the number of false positives and false negatives are equal.
The last row in Table 4 shows the average results, where we can see that our proposal clearly outperforms the strategy described in (Stahlschmidt et al., 2014), and that the observed differences are statistically 475
significant in the evaluated metrics.
We also compared the performance of our proposal on a different dataset, the one used in (Zhang et al.,
AN US
2012), kindly provided to us by the authors. The application of our algorithm was challenging, due to three different facts. First of all, the provided data was generated with a Kinect R v1 sensor, which has a lower resolution than the Kinect R v2 sensor in which we based our development (320x240 vs 512x424 pixels), and 480
also uses a technology that generates noisier data3 . Second, the provided data had been already processed by applying their background substraction strategy, so that we could not apply our noise reduction algorithm
M
as we didn’t have access to the raw depth stream required to do so. Finally, we also had to adapt the PCA models used in the classification stage to the new data, allowing the consideration of a new class to model
485
ED
the high presence of people with backpacks and wearing coats with large hoods (either put on their heads, or resting on their shoulder and backs). Given that our system needs to be trained, and to avoid training biases, we generated the models using sequences from the dataset1 described in the paper, and tested our proposal
PT
on the dataset2 sequences (which were recorded in a fully different scene than dataset1). Table 5 shows the comparison results using the same metrics than the ones used in (Zhang et al., 2012)
95%: Recall rate is the fraction of people that are detected; Accuracy is the fraction of detected result that are people4 , and F-score is the trade-off between recall rate and accuracy. It can be clearly seen that the
AC
490
CE
(to ease the comparison), where we have also added the confidence interval values for a confidence level of
performance of both algorithms is very similar, with a very slight (non statistically significant) advantage of our proposal. It is important to note here that the strategy in (Zhang et al., 2012) makes the assumption that only people are present in the sequences, so that all the detected regions will be considered to be people 3 4
The Kinect R v1 sensor uses structured light, as opposed to time of flight in the Kinect R v2. In the literature, this is usually referred to as “Precision”, but we kept the naming used in (Zhang et al., 2012) for easier
comparison.
32
ACCEPTED MANUSCRIPT
Table 6: Timing details of the proposed algorithm (showing average processing times per frame in ms.)
Filter
Max
ROI’s
FE
Classify
Total time
FPS
Sequence with 8 users
3.756
0.138
1.132
3.692
0.087
8.805
114
Sequence with 4 users
3.752
0.095
1.016
3.536
0.710
9.110
110
Single user height=178cm
3.772
0.007
0.356
1.331
0.241
5.707
175
Single user height=150cm
3.738
0.012
0.361
1.307
0.254
5.672
176
CR IP T
495
Sequence type
(the only validation check the authors use to discriminate between people or other elements in the scene is the area of the detected region bounding box, which is compared to a predefined threshold value). This
AN US
is actually acknowledged by the authors that state that “the water filling cannot handle the situation where some moving object is closer to the sensor than head, such as raising hands over head.” The use of the classification stage in our proposal actually avoid such false detections. 500
4.2.2. Computational demands comparison
M
Regarding the computational demands5 , we also first compared our proposal with that of Stahlschmidt et al. (2014). Table 6 shows the average processing time per frame of our proposal for several sequence types (given that the execution times per frame depend on the number of maximums detected and the conditions
505
ED
found in each frame) and in each of the most relevant stages: Noise reduction (column “Filter”), detection of local maxima (column “Max”), ROI’s estimation (column “ROI’s”), feature extraction (column “FE”)
PT
and classification (column “Classify”). Column “Total Time” shows the average total time per frame, and column “FPS” shows the maximum number of frames per second that the algorithm can cope with in real
CE
time. It is clear that the most demanding stages are the noise reduction and the feature extraction, which account for 80% of the total execution time. We have also evaluated the best and worst timing cases in the sequences analyzed in Table 6 and found that the number of frames that the algorithm can process in real time vary between 43 and 309, thus proving its adequacy for real time processing.
AC
510
Table 7 shows the average processing time per frame using the strategy described in (Stahlschmidt et al.,
2014) in each of its most relevant stages: Preprocessing (column “Prep.”), application of the normalized Mexican Hat Wavelet (column “Wavelet”), and peak detection (column “Peak”). Columns “Total Time” 5
All the experiments reported in this section were run in an Asus X553MA laptop, with an Intel BayTrail M Dual Core Celeron
N2840 2.58 GHz, and 4GB RAM.
33
ACCEPTED MANUSCRIPT
Table 7: Timing details of the strategy described in (Stahlschmidt et al., 2014) (showing average processing times per frame in ms).
Prep.
Wavelet
Peak
Total time
FPS
Sequence with 8 users
11.133
87.439
10.745
109.317
9
Sequence with 4 users
12.837
87.822
10.730
111.389
9
Single user height=178cm
12.959
86.776
10.537
110.272
9
Single user height=150cm
11.616
88.540
10.968
111.124
9
CR IP T
515
Sequence type
and “FPS” have the same meaning than in Table 6. In this case, the timing is very similar across sequence types, as the wavelet filter must be applied across all the frame content, being this stage also the most time consuming. This is the reason why the strategy in (Stahlschmidt et al., 2014) is much slower than our
AN US
proposal.
Regarding the computational complexity of the proposal described in (Zhang et al., 2012), the authors 520
state in their paper that the speed of their water filling algorithm is about 30 frames per second (although they don’t provide information on the hardware used), which is also lower than in our proposal.
M
4.3. Results and discussion
To validate the people detection algorithm performance, the used sequences are those in which both the people and the accessories used (hats and caps) are different than those used in the offline training process. Experimentation was carried out with sequences with a different number of people involved in the record-
ED
525
ings. In all cases, the selected people had different characteristics in what respect to hair style, complexity,
PT
height and with/without accessories, etc. Additional sequences also include people moving their fists up and down, and moving three different types of chairs around the scene.
530
CE
Table 8 shows the classification results, where #S amples is the number of people occurrences, FP and FN are the number of false positives and false negatives respectively, and ERR is the system error rate (ERR = 100 · (FP + FN)/#S amples ). The table also includes confidence interval values calculated on the
AC
ERR metric, for a confidence level of 95%. The last column (#Other) shows the number of other detected
maxima in the scene (people hands, chairs and other objects) that have not been labeled as corresponding to people by the classification stage. As it can be clearly seen, this number is high for sequences with a
535
lot of non-people objects, proving the capacity of the proposal to avoid generating false detections, without significantly impacting its performance. From Table 8, we can conclude that the performance is very high, with an average error rate of 3.1%. In sequences with one or two people we get the best results, with an average error rate of 2.14%. The error 34
ACCEPTED MANUSCRIPT
Sequence
#S amples
FN
FP
ERR %
#Other
Single person
5317
96
10
1.99% ± 0.38%
76
Two people
933
28
0
3.00% ± 1.09%
1
More than two people
8577
327
4
3.86% ± 0.41%
26
830
20
0
15657
471
14
CR IP T
Table 8: Experimental results.
Sequences with chairs and people balancing fists facing up Totals counts and average ERR
2.41% ± 1.04%
956
3.10% ± 0.27%
1059
increases to 3.86% for sequences with more than two people, and it keeps at 2.41% in situations where there are increased difficulties due to the presence of chairs and users behaving to confuse the system (when they
AN US
540
move their fists up and down). Given these results, we can also state that the system is also able to efficiently cope with the approximately 10% of samples of people with complements (refer to Table 3 for details on the testing subset data).
When individually analyzing the frames in which the system made an error, those errors were mainly found near the scene borders, in which the objects capture is less “zenithal” and the noise level of the
M
545
measurements is higher (as the illumination intensity arriving the infrared sensor is lower). This noise is
ED
higher for people with lower height (further from the sensor), and with black straight hair (with a lower light reflection factor). These issues have a relevant impact in the generation of the feature vector values, so that, in some cases, the classification stage is not able to correctly identify them as corresponding or not to people. Most of the times when these issues affects the classification, they lead to the generation of a number of false
PT
550
negatives, which is by far the most common error in our system, as can be clearly seen in Table 8. The false
CE
positives shown in Table 8 are due to the very few cases in which spurious peaks (due to raised arms or
AC
hands) are incorrectly classified as corresponding to a person. 5. Conclusions
555
In this work we have presented a system for the real time robust detection of people, by only using the
depth information provided by a ToF camera placed in an overhead position. The proposal comprises several stages that allow the detection of multiple people in the scene in an efficient and robust way, even when the number of people is high and they are very close to each other. It also allows discriminating people from other objects (moving or fixed) present in the scene. Additionally, and provided that only depth information
35
ACCEPTED MANUSCRIPT
560
is used, people privacy is guaranteed. This implies an additional advantage over solutions making use of standard RGB or gray-level information. Due to the lack of publicly available datasets fulfilling the requirements of the target application (high quality labeled depth sequences acquired with an overhead ToF camera), we have recorded and labeled a specific database, that is available to the scientific community. This dataset has been used to train and evaluate the PCA models for the two classes defined: people with and without accessories (caps, hats).
CR IP T
565
For the people detection task on the depth images, we have proposed an algorithm to detect the isolated maximums in the scene (which will be candidates to correspond to people), and to precisely define a Region of Interest (ROI) around each of them. From the pixels included in the ROI, a feature vector is extracted, whose component values are related to pixel heights and pixel densities in given areas of the ROI, so that it 570
allows to properly characterize the people upper body geometry. The classification of such feature vectors
AN US
using PCA techniques allows determining whether they belong or not to the two people classes that have been defined. The system evaluation has followed a rigorous experimental procedure, and has been carried out on previously labeled depth sequences, including a wide range of conditions with respect to number of people present in the scene (considering cases in which they were very close to each other), people complexity, 575
people height, positions of arms and hands (arms up, fists moving up and down, people using their mobile
M
phones, etc.), accessories on the head (caps, hats), and the presence of additional objects such as chairs. The obtained results (3.1% average error rate) are very satisfactory, given the complexity of the task due to the
ED
high variability of the evaluated sequences.
We have also compared our proposal with other alternatives in the literature (both in terms of performance and computational demands), and evaluated our system performance on an alternative challenging data set,
PT
580
proving its ability to efficiently cope with different experimental scenarios.
CE
As a general conclusion, the proposal described in this work allows the robust detection of people present in a given scene, with high performance. It only uses the depth information obtained by a ToF camera, thus being appropriate in those cases where privacy concerns are relevant, also making the system to properly work independently of the lighting conditions in the scene. Additionally, the people detection process can
AC
585
run in real time at an average of 150 fps. Regarding future work, we are planning to address the improvement of the feature extraction and clas-
sification stages, by deriving more robust descriptors to be used in more sophisticated pattern recognition strategies. We will also address the evaluation of the proposal in recently available datasets (such as the one 590
described in (Del Pizzo et al., 2016), which will require additional manual labeling efforts), and complete the GOTPD1 dataset with even more challenging scenarios. 36
ACCEPTED MANUSCRIPT
6. Acknowledgments This work has been supported by the Spanish Ministry of Economy and Competitiveness under project SPACES-UAH (TIN2013-47630-C2-1-R), and by the University of Alcal´a under projects DETECTOR 595
(CCG2015/EXP-019) and ARMIS (CCG2015/EXP-054). We also thank Xucong Zhang for providing us
of our manuscript, and their insightful comments and suggestions. References
CR IP T
with their evaluation data, and special thanks are given to the anonymous reviewers for their careful reading
Antic, B., Letic, D., Culibrk, D., & Crnojevic, V. (2009). K-means based segmentation for real-time zenithal 600
people counting. In Proceedings of the 16th IEEE International Conference on Image Processing ICIP’09
AN US
(pp. 2537–2540). Piscataway, NJ, USA: IEEE Press. doi:10.1109/ICIP.2009.5414001.
Bevilacqua, A., Di Stefano, L., & Azzari, P. (2006). People tracking using a time-of-flight depth sensor. In Video and Signal Based Surveillance, 2006. AVSS ’06. IEEE International Conference on (pp. 89–89). doi:10.1109/AVSS.2006.92.
Bushby, K. M., Cole, T., Matthews, J. N., & Goodship, J. A. (1992). Centiles for adult head circumference.
M
605
Archives of Disease in Childhood, 67, 1286. doi:10.1136/adc.67.10.1286.
ED
Cai, Z., Yu, Z. L., Liu, H., & Zhang, K. (2014). Counting people in crowded scenes by video analyzing. In Industrial Electronics and Applications (ICIEA), 2014 IEEE 9th Conference on (pp. 1841–1845). doi:
610
PT
10.1109/ICIEA.2014.6931467.
Chan, A., Liang, Z.-S., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people
CE
without people models or tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–7). doi:10.1109/CVPR.2008.4587569.
AC
Chen, T.-Y., Chen, C.-H., Wang, D.-J., & Kuo, Y.-L. (2010). A people counting system based on facedetection. In Genetic and Evolutionary Computing (ICGEC), 2010 Fourth International Conference on
615
(pp. 699–702). doi:10.1109/ICGEC.2010.178.
Dan, B.-K., Kim, Y.-S., Suryanto, Jung, J.-Y., & Ko, S.-J. (2012). Robust people counting system based on sensor fusion. Consumer Electronics, IEEE Transactions on, 58, 1013–1021. doi:10.1109/TCE.2012. 6311350.
37
ACCEPTED MANUSCRIPT
Del Pizzo, L., Foggia, P., Greco, A., Percannella, G., & Vento, M. (2016). Counting people by rgb or depth 620
overhead cameras. Pattern Recognition Letters, . Galˇc´ık, F., & Gargal´ık, R. (2013). Real-time depth map based people counting. In International Conference on Advanced Concepts for Intelligent Vision Systems (pp. 330–341). Springer. Jeong, C. Y., Choi, S., & Han, S. W. (2013). A method for counting moving and stationary people by
625
(pp. 4545–4548). doi:10.1109/ICIP.2013.6738936.
CR IP T
interest point classification. In Image Processing (ICIP), 2013 20th IEEE International Conference on
Jia, L., & Radke, R. (2014). Using time-of-flight measurements for privacy-preserving tracking in a smart room. Industrial Informatics, IEEE Transactions on, 10, 689–696. doi:10.1109/TII.2013.2251892.
AN US
Jimenez, D., Pizarro, D., & Mazo, M. (2014). Single frame correction of motion artifacts in PMD-based time of flight cameras. Image and Vision Computing, 32, 1127 – 1143. doi:10.1016/j.imavis.2014.08.014. 630
Jim´enez, D., Pizarro, D., Mazo, M., & Palazuelos, S. (2014). Modeling and correction of multipath interference in time of flight cameras. Image and Vision Computing, 32, 1 – 13. doi:10.1016/j.imavis.2013.10.008.
M
Jim´enez, J. A., Mazo, M., Ure˜na, J., Hern´andez, A., Alvarez, F., Garc´ıa, J. J., & Santiso, E. (2005). Using PCA in time-of-flight vectors for reflector recognition and 3-D localization. Robotics, IEEE Transactions
635
ED
on, 21, 909–924. doi:10.1109/TRO.2005.851375.
Lange, R., & Seitz, P. (2001). Solid-state time-of-flight range camera. Quantum Electronics, IEEE Journal
PT
of , 37, 390–397. doi:10.1109/3.910448.
Lefloch, D., Cheikh, F. A., Hardeberg, J. Y., Gouton, P., & Picot-Clemente, R. (2008). Real-time people
CE
counting system using a single video camera. (pp. 681109–681109–12). volume 6811. doi:10.1117/12. 766499.
Macias-Guarasa, J., Losada-Gutierrez, C., Fuentes-Jimenez, D., Fernandez, R., Luna, C. A., Fernandez-
AC
640
Rincon, A., & Mazo, M. (2016). The GEINTRA Overhead ToF People Detection (GOTPD1) database. Available online. URL: http://www.geintra-uah.org/datasets/gotpd1 (accessed June 2016).
Matzner, S., Heredia-Langner, A., Amidan, B., Boettcher, E., Lochtefeld, D., & Webb, T. (2015). Standoff human identification using body shape. In Technologies for Homeland Security (HST), 2015 IEEE 645
International Symposium on (pp. 1–6). doi:10.1109/THS.2015.7225300.
38
ACCEPTED MANUSCRIPT
Ramanan, D., Forsyth, D. A., & Zisserman, A. (2006). Tracking People by Learning Their Appearance. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29, 65–81. doi:10.1109/tpami.2007. 250600. Rauter, M. (2013). Reliable human detection and tracking in top-view depth images. In Proceedings of the 650
IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 529–534).
doi:10.1109/MM.2014.9.
CR IP T
Sell, J., & O’Connor, P. (2014). The Xbox one system on a chip and Kinect sensor. Micro, IEEE, 34, 44–53.
Shlens, J. (2014). A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100. URL: https://arxiv.org/pdf/1404.1100.pdf (accessed June 2016).
Smisek, J., Jancosek, M., & Pajdla, T. (2011). 3D with Kinect. In Computer Vision Workshops (ICCV Work-
AN US
655
shops), 2011 IEEE International Conference on (pp. 1154–1160). doi:10.1109/ICCVW.2011.6130380. Stahlschmidt, C., Gavriilidis, A., Velten, J., & Kummert, A. (2013). People detection and tracking from a top-view position using a time-of-flight camera. In A. Dziech, & A. Czyazwski (Eds.), Multimedia
660
M
Communications, Services and Security (pp. 213–223). Springer Berlin Heidelberg volume 368 of Communications in Computer and Information Science. doi:10.1007/978-3-642-38559-9 19.
ED
Stahlschmidt, C., Gavriilidis, A., Velten, J., & Kummert, A. (2014). Applications for a people detection and tracking algorithm using a time-of-flight camera. Multimedia Tools and Applications, (pp. 1–18).
PT
doi:10.1007/s11042-014-2260-3.
Vera, P., Monjaraz, S., & Salas, J. (2016). Counting pedestrians with a zenithal arrangement of depth cameras. Machine Vision and Applications, 27, 303–315.
CE
665
Zhang, X., Yan, J., Feng, S., Lei, Z., Yi, D., & Li, S. Z. (2012). Water filling: Unsupervised people counting
AC
via vertical kinect sensor. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on (pp. 215–220). IEEE.
Zhu, L., & Wong, K.-H. (2013). Human tracking and counting using the kinect range sensor based on
670
adaboost and kalman filter. In International Symposium on Visual Computing (pp. 582–591). Springer.
39