CNN-Based Crowd Counting Through IoT: Application For Saudi Public Places

CNN-Based Crowd Counting Through IoT: Application For Saudi Public Places

ScienceDirect 2 Author name / Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science 00 (2019) 000–000 ScienceDirect www.elsevier.c...

1MB Sizes 0 Downloads 18 Views

ScienceDirect

2

Author name / Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science 00 (2019) 000–000

ScienceDirect

www.elsevier.com/locate/procedia

Available online at www.sciencedirect.com Procedia Computer Science 00 (2019) 000–000

www.elsevier.com/locate/procedia

ScienceDirect

Procedia Computer Science 163 (2019) 134–144 16th International Learning & Technology Conference 2019

CNN-Based16th Crowd Counting IoT: Application For International Learning &Through Technology Conference 2019 Saudi Public Places CNN-Based Crowd Counting Through IoT: Application For Maha Hamdan Alotibi , Salma Kammoun Jarraya , Manar Salamah Ali , Kawthar Moria Public Department ofSaudi Computer Science, King AbdulPlaces Aziz University, Jeddah, Saudi Arabia a,b*

a

a,c**

a***

a****

bDepartment of Computer Science, King Khalid University, Abha, Saudi Arabia cMIRACL-Laboratory, Sfax University, Tunisia

Maha Hamdan Alotibia,b*, Salma Kammoun Jarrayaa,c**, Manar Salamah Alia***, Kawthar Moriaa**** a Department

Abstract

of Computer Science, King Abdul Aziz University, Jeddah, Saudi Arabia bDepartment of Computer Science, King Khalid University, Abha, Saudi Arabia cMIRACL-Laboratory, Sfax University, Tunisia

Crowd counting in specific places has recently been considered as a significant contribution in many applications in terms of security Abstractand economic values. Recently, the Kingdom of Saudi Arabia has considered new ways and methods to diversify sources of income, where many non-traditional establishments in several fields have been initiated and put in place. However, controlling the number of visitors and participants events and been acontribution challenge, asinit many has always been considered as Crowd counting in specific places hastorecently beenexhibitions consideredhas as always a significant applications in terms of an important success factor to any event. the Smart public of places approach is one of the inevitable directions of development Saudi security and economic values. Recently, Kingdom Saudi Arabia has considered new ways and methods to diversifyin sources Arabia, where security, and safety of crowds in is to be controlled andbeen managed using learning techniques, more of income, where many comfort, non-traditional establishments several fields have initiated andmachine put in place. However, controlling specifically, IoT-based crowd counting to techniques. a technology will not onlya help in resolving security safety problems, the number of visitors and participants events andSuch exhibitions has always been challenge, as it has alwaysand been considered as butimportant also will play a significant roleevent. in reducing time for visitors,isby giving anddevelopment advices on crowded an success factor to any Smart waiting public places approach one of theindicators, inevitableprojections directions of in Saudi places. In this paper, a mobile-based fortocounting peopleand in high and low crowded public placestechniques, in Saudi Arabia Arabia, where security, comfort, andmodel safetyisofproposed crowds is be controlled managed using machine learning more under variousIoT-based scene conditions with notechniques. prior knowledge. proposed model is built on pre-trained neural specifically, crowd counting Such a The technology will not only helpbased in resolving security convolutional and safety problems, network (CNN) VGG-16 with some modifications last layer of the indicators, CNN to increase the efficiency of on thecrowded training but also will playcalled a significant role in reducing waiting timeonforthevisitors, by giving projections and advices model. In this addition improvement of efficiency, accepts images arbitrary sizes/scales inputs. The places. paper,toathe mobile-based model is proposedtheforproposed countingmethod people in high and low of crowded public places inasSaudi Arabia applicability the proposed been evaluatedThe by proposed incorporating IoTis architecture, surveillance camerasneural to be under variousofscene conditionsmethod with nohas prior knowledge. model built based onwhere pre-trained convolutional connected(CNN) to thecalled Internet to capture pictures of different public achieve goal, New specialofSaudi people network VGG-16 withlive some modifications on the lastplaces. layer ofTothe CNN this to increase the and efficiency the training dataset In as addition well as some existing dataset, have been producedmethod and used to train the network. Thesizes/scales result showsasainputs. significant model. to theother improvement of efficiency, the proposed accepts images of arbitrary The improvement of to the efficiency of the DCNN overevaluated the existing networks. applicability proposed method has been bycounting incorporating IoT architecture, where surveillance cameras to be connected to the Internet to capture live pictures of different public places. To achieve this goal, New and special Saudi people dataset as well as some other existing dataset, have been produced and used to train the network. The result shows a significant improvement to the efficiency of the DCNN over the existing counting networks. © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the 16th International Learning & Technology Conference 2019. Keywords: CNN, Mobile Application, People Counting, Internet of Thing.

Keywords: CNN, Mobile Application, People Counting, Internet of Thing. 1877-0509 © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) 1877-0509 © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)

1877-0509 © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the 16th International Learning & Technology Conference 2019. 10.1016/j.procs.2019.12.095

Author name / Procedia Computer Science 00 (2019) 000–000 Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144



1.

3 135

Introduction

Rapid population growth along with the new strategy of Saudi Arabia which involves relaxing rules and flexible thoughts as an attitude to modernize the community in the era of IoT have led to the immigrant of new uncommon social activities, such as entertainment events, sports, governmental rallies, etc. These new events and activities have raised new challenges and issues, including uninhabitable crowd organizing and control. In such scenarios, it is important to ensure a high level of safety, management, and security which indeed requires careful study of the size of the crowds. While crowd counting technologies has been studied by researchers for a while, a very limited research has been conducted on crowds with unique and uncommon characteristics, such as Saudi people in Saudi public places. In fact, the aim behind crowd counting analysis is to calculate the number of people whereby the density of people is high. Like other computer vision problems, counting the of crowd comes with several issues such as non-uniform illumination, extreme clutter, occlusions, non-uniform spreading of people, perspective, intra-scene and inter-scene differences in appearance, and scale, then, making it extremely challenging to optimize and unify. The wide range of crowd analysis applications together with the complexity of the problem have been guiding researchers in recent years to improve the efficiency of these techniques and come with new and effective solutions [1]. Nevertheless, as other computer vision problem crowd counting has encountered severe modifications in the methodologies throughout the last years. Starting by detection based approaches [2][3], Regression-based approaches[4] [5], Density estimation-based approaches[6] [7], and early Histograms of Oriented Gradients based head detections(HOG) and not to end by CNN regressors, and predicting the crowd density. Studies showed that regardless the pros and cons of other methods, CNN-based regressors have considerably outperformed old-style crowd counting approaches, considering the representations from local features of those approaches were not good enough to satisfy a high level of performance and accuracy. The prosperity of CNNs in most computer vision issues has triggered researchers to deeply investigate and study the nonlinear functions from images of crowds to the identical counts or identical density maps. Most of CNN crowd counting techniques still requires an input image with a fixed size. This desire is “artificial” and probably will decrease the images' understanding precision. However, when CNNs have initially proposed, most related studies and research depended on patches of the image and required the use of fixed size images [8][9]. It still needs to enhance the density maps since the quality of the density maps is weak. Recently, a multi-column style has been released, which provides better quality density maps [10][11]. With multicolumn style, performance has been, but the design has raised two significant drawbacks when networks go deeper, namely non-effective branch structure and the problem of very long training time. In this paper, we produced a deeper network called Deeper Convolution Neural Network DCNN is proposed, where the focus is on both crowd counting and generating high-quality density maps. However, unlike other existing techniques [12], [17] which focus on using deep CNN as a supplementary, this proposed method focuses on designing a CNN-based density map generator. 3 × 3 convolution filters have been used in all layers in order to control the complexity of the network. The model supports the use of flexible size/resolution images by using pure convolutional layers as the backbone of the model. Therefore, the current study has deployed the first 8 layers from VGG-16[14] and modified the last 8 layers to get better results and to accept any image size. The CNN can efficiently detect people count, anywhere and with different crowd levels. With the help of IoT concept, the data was collected using cameras in public places. We aimed to add IP addresses to the cameras to send images to the counting system. Despite having 1.5 million applications available in the App Store [15] and 1.6 million Android applications, only a small number of them focus on reducing waiting times for people, where our application considers this issue. Therefore, the novelty of our method stems from its capacity to overcome several significant challenges for robust counting of visitors in high and low crowded area. More specifically, the proposed method should work efficiently and correctly indoors, outdoors, and in undefined environments; runs on an average range smartphone environment in terms of processor speed, storage, and capacity;

Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144 Author name / Procedia Computer Science 00 (2019) 000–000

136

4

perform with high precision under the special nature of Saudi society [Fig 1]; and work efficiently in public places in Saudi Arabia and accept arbitrary images size/scale as input.

Fig. 1. Different Scenes for Saudi public places

The rest of the study is arranged as follows: an overview of crowd counting and density map generation is summarized in section2; section 3 introduces the application model; section 4 shows the experimental results on several datasets; and the conclusion of this study is described in Section 5. 2. Background 2.1.

An Overview of crowd counting:

There is no doubt that Crowd counting is seen as critical for many applications across different industries, and therefore, the interest in studying crowd behaviour analysis is rising. However, the analysis and study of this domain has always been challenged by two different categories of problems. First, ROI (region-of-interest counting) where the chosen region of study may affect the accuracy and performance of calculation. The second category is LOI (lineof-interest counting), which calculates the number of people crossing a chosen line. Since the proposed application of this study will be used in public places, the region-of-interest counting is adopted based on a single image to count both highly and low crowded places. Generally, crowd counting methods can be grouped into four main classes: detection-based approaches [16], Regression-based approaches [7], Density estimation-based approaches [17][18] and CNN-based approaches [19][20][21]. 2.2. Related Work: Due to the ground-breaking achievements of convolutional neural network (CNN) models, crowd analysis has gained vast interest in recent years in terms of using CNN’s capability for investigating the nonlinear functions from crowd images to the respective density maps or counts. Researchers [1] have classified network property approaches into three categories. Firstly, the basic CNN models, which comprise basic CNN layers in networks. The early deep learning methods for assessing crowd density and counts used these models [22][21] ; Secondly, scaleaware models, which are more advanced than the basic CNN models and are robust to varying scales. Several methods, such as the multi-column architecture, are used to ensure the robustness of the models [23]; and thirdly, context-aware models, which are capable of integrating images of regional and global contextual information into the CNN framework [24], thus reducing estimation errors.



Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144 Author name / Procedia Computer Science 00 (2019) 000–000

137 5

Zhang et al. [11] have proposed in their study a multicolumn based design (MCNN) method for images with random crowd density and perspectives. It ensured the robustness of significant variation in object scales by large, medium, or small size network construction. Representation of different object scales in images has been supported by the design and use of different sizes. On top of that, ground truth crowd density maps generation has been conducted by proposing a new mothod. The crowd counting method proposed in [11] works by training wholly regressors of a multi-column network on every input spots. However, Sam et al. [10] has argued that having different crowd densities within a picture along with training with a specific collection of training spots would improve the performance. A switching CNN has been poposed in their study that ingeniously chooses an optimal regressor suited for a specific input spot, where that proposed network simulates the multi-column network by the use of multiple independent regressors with different switch classifier and sensory domain. In Sam.D [25] the authors proposed a new method called Contextual Pyramid to generate count estimation and improved-quality crowd densities by explicitly joining local and global contextual material of crowd images. The suggested CP-CNN contains units: Local Context Estimator (LCE), Global Context Estimator (GCE), a Fusion-CNN (F-CNN ) and Density Map Estimator (DME). However, it needs long training time and consists of multiple models that led to complex calculation. Following the same idea proposed in [25] and [10], the quality of the density maps with deeper network single column structure has been improved in CSNet [13] by introducing the dependency on VGG-16 [26], where the first ten layers of VGG-16 was used without applying a fully connected layer. While maintaining the output resolution, dilated convolutional layers have been used to deploy as the back-end to extract deeper information of salience. Implementing a deeper CNN and generating a high-quality density map were the fundamental idea of that design where high-level characteristics with bigger receptive fields can be captured by the CNN, and network complexity is not necessarily expanded when generating density maps. Only a few numbers of studies have considered combining IoT with crowd counting. The author in [27] used a camera connected to the network and utilized detection-based approaches to count the persons in the picture. Moreover, in [15], the authors have simulated the crowd to calculate the waiting time in a row.

2.3. Limitations of the-state-of-the-art approaches: The majority of CNN based models [9] [28] [22] require input images of fixed sizes as well as patches of the image. Moreover, the density maps tend to have a low quality, which opens opportunities for future enhancements. To overcome the issue of low-quality density maps, a multi-column style has been introduced [12][10]. Despite the commendable achievements in this area, the proposed designs have a couple of notable drawbacks when the networks go deeper, which are non-effective branch structure and the extended period of time required for training. 3.

System Model:

In this study, a new solution for counting people in public places in Saudi Arabia is proposed. The new solution is based on a machine learning approach. The learning scenario considered here is the CNN-Deep learning, considering that CNNs main advantage comes from learning data representations directly from data in a hierarchical layer-based structure. Once the learned CNN model is generated and trained, it is adopted in the proposed system for counting people/visitors in image data from different public places. The proposed solution in this study operates in 9 main stages. First, cameras to be installed in the targeted public areas, with IP address to be assigned to each camera. Then, the cameras to be connected to the cloud using the internet of Wi-Fi technology. The fourth stage involves connecting the user’s device to the cloud. Then, transmitting

138

Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144 Author name / Procedia Computer Science 00 (2019) 000–000

6

the data from the cloud. The sixth step is to access the data from the cloud to the end user’ devices through the internet, then, run the crowd counting system, which will be mainly achieved by the use of our DCNN training model. After that, the required processed data will be collected. And finally, the result to be sent and displayed on the users’ mobile application.

3.1. Developing the Counting system: In In our CNN, There are two primary parts in the proposed CNN model of this study. The first part, which is the front-end, comprises adapted VGG-16 CNN (except fully-connected layers)[14] with 3×3 kernels for 2D feature extraction. The second part is the back-end comprising [1] dilated CNN (DCNN) with more layers, where dilated kernels are utilised for delivering larger reception fields and replacing pooling operations. It is recommended to use DCNN with four network configurations that have many dilation rates at the back-end, but using the same front-end arrangement. According to [19], it is more efficient to utilise smaller kernels with a higher number of convutional layers rather than bigger kernels with a lower number of layers for receptive fields of the same size. The primary consideration is to balance the need for accuracy against the resources involved, such as the number of parameters, the training time, and memory consumption. Based on the experiment conducted, it’s been found that the optimal arrangement involves the use of the first eight layers of VGG-16 [19] with three rather than five pooling layers so that the adverse impacts of pooling operations on output accuracy could be reduced. In this study, the same front-end structure has been maintained maintained while the dataset has been trained starting from the eighth layer until the end of the network. Padding has been employed to keep all convolutional layers at the prior size. The parameters of the convolutional layers have been represented by “conv-(kernel-size) - (dilation-rate)- (number of filters)”, where the max-pooling layers have been done on a 2×2 pixel window with stride 2. Figure 2 shows an overview of proposed counting network DNCC

Fig. 2 proposed counting network DNCC



Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144 Author name / Procedia Computer Science 00 (2019) 000–000

139 7

3.2. Training method: Details about the training are provided in this section. Due to the use of regular CNN without branch structures, the model implementation is straightforward and swift. a.

Ground truth generation: The substantially crowded scenes are handled using geometry-adaptive kernels. A Gaussian kernel that is normalised to 1 is utilised to blur all the head annotations, thus generating the ground truth taking into account all the images’ spatial distribution from each dataset. The geometry-adaptive kernels are represented by the function:

where for each targeted object xi in the ground truth δ, 𝑑𝑑̅𝑖𝑖represents the mean distance of k nearest neighbours. Next, δ(x − xi) is convolved with a Gaussian kernel with a standard deviation of σi, where x is the pixel’s location on the image, in order to produce the density map. The study’s experiment is in accordance with the configuration used by [18] where β = 0:3 and k = 3. When a non-intensive crowd is involved, an average head size of the

Gaussian kernel is used so that all the annotations will be blurred.

b. Data augmentation: Subsequently, images have cropped at various places producing nine patches at a quarter of the initial image’s size. The patches of first four encompass four quarters of the input image that do not overlap, and the remaining five patches are cropped indiscriminately from the image. Next, the patches have been reflected, thus doubling the training set. c. Training details: A direct approach is used for training the DCNN as an end-to-end structure. A well-trained VGG-16[26] is modified to produce the first 8 convolutional layers [. The initial values of the remaining layers are derived from a Gaussian initialisation with a standard deviation of 0.01. During the training session, stochastic gradient descent (SGD) is used at a constant learning rate of 1e-6. Consistent with [19, 18, 4], the Euclidean distance is used for calculating the difference between the ground truth and the estimated density map generated. The equation for the loss function is given below:



where N is the training batch’s size while ሺ‹Ǣ𝛩𝛩ሻrepresents the output generated by the DCNN with a parameter of 𝛩𝛩. Meanwhile, Xi is the input image with 𝐺𝐺𝑇𝑇 𝑖𝑖 as the ground truth result of the input image Xi.

4.

Experiments:

There are an existing Number of datasets for crowd counting such as UCF-QNRF Dataset, UCSD Dataset, WorldExpo'10 Dataset, UCF CC 50 Dataset, Mall Dataset and ShanghaiTech Dataset. These datasets have different characteristics and number of parsons as shown in table 1: Table 1. state-of-art crowd counting datasets: Dataset

Resolution

Color

Num

Min

Max

Average

Total

UCSD

158 × 238

Grey

2000

11

46

24.9

49885

Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144

140

Author name / Procedia Computer Science 00 (2019) 000–000 Grey 50 94 4543 1279.5

UCF CC 50

different

Mall

640 × 480

RGB

2000

11

53

31.2

62315

WorldExpo

576 × 720

RGB

3980

1

253

50.2

199923

RGB

1198

20

3500

1760

330,165

RGB

750

1

450

90

34422

ShanghaiTech dataset Saudi_Dataset (our)

different

63974

8

However, considering the unique and different characteristics of Saudi crowds in public places, the datasets mentioned above are not an optimal choice to help in achieving the goals of our project. Due to the unique culture of Saudi people, a special dataset is needed for this study. The above datasets contain a lot of images, but they all follow the same category either indoor or outdoor images, and there are many missing categories such as children images. 4.1

Evaluation metrics:

To evaluate our model we have used MSE and MAE which are defined as follows:

In this equation N is as the exact number of input images in one test sequence. 𝐺𝐺𝐺𝐺𝑇𝑇 𝑖𝑖 is the total counting number in the Ground-Truth. 𝐺𝐺𝑖𝑖is defended as the estimated count which is represented as follows:





Y and X represent the length and width of the density map whereas zY;X is the pixel at (Y;X) of the produced density map. 𝐶𝐶𝑖𝑖denotes to the expected counting number for image Li.

Author name / Procedia Computer Science 00 (2019) 000–000

Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144

a.

9 141

Results on Shanghaies dataset:

ShanghaiTech dataset consists of a total number of 330,165 persons [29] within 1198 annotated images. This public dataset be based on two parts, A and B, as shown in table1. The method followed in the current study has been compared and evaluated to other seven existing related work, and the estimation results are showed in Table 2. The results indicate that the followed method have achieved 37% lower MAE in Part B compared to the Cross-scene [4] work, and the lowest MAE (the highest accuracy) in Part A compared to other existing models, where an 8% lower MAE has been achieved in part A. Table 2 Approximation errors on ShanghaiTech dataset

Year

2015

Method

Shanghai Tech dataset Part A

Part B

MAE

MSE

MAE

MSE

181.8

277.7

32.0

49.8

M-CNN[11]

110.2

173.2

26.4

41.3

FCN [30]

126.5

173.5

23.76

33.12

Cascaded-

101.3

152.4

20.0

31.1

90.4

135.0

21.6

33.4

73.5

112.3

18.7

26.0

CSRNet

68.2

115

10.6

16

DNCC(0ur)

72.3

98.4

11.1

17.3

Cross-scene [4]

2016 2017

MTL [19] SwitchingCNN[10] 2018

Z. Shi et al., [31]

Fig 3 The 1stline displays testing samples of the set in ShanghaiTech Part A dataset. While the 2nd line displays the ground truth for samples and the 3rd line shown the produced density map

b. Results on Saudi_Dataset: A new and specific large-scale crowd counting dataset has been introduced for this study. Due to the need of a large number of training data agreed by CNN we proposed a new dataset that is customised and personalised for Saudi people in public places. This dataset consists of 750 images with different heads numbers and different crowding levels. The count ranges from 1 to 450 with an average of 80 persons in view. The description of the dataset is as follows:  The pictures were taken from different camera angles which allow our system to recognize persons regardless the position of the camera.  Images have been collected as snapshots taken from videos from different places such as malls, restaurant gardens, and airport. we also gather a set of images from widely available websites, such as GoogleImges and Instagram. These images cover several events and places including rallies, concert, and stadiums.

Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144 Author name / Procedia Computer Science 00 (2019) 000–000

142

10

 The dataset contains both indoor and outdoor activities images.  The dataset contains a considerable number of children unlike existing datasets. However, there are various challenges such as High clutter figure, Occlusions, Non-uniform illumination and Nonuniform distribution of people.

600 400 200 0

1 34 67 100 133 166 199 232 265 298 331 364 397 430 463 496 529 562 595 628 661

MBER OF PERSONS IN VIEW

Statistics of collected dataset

NUMBER OF IMAGES Fig. 4 Statistics of SAUDI_DATASET

Table 3 Estimation errors on Saudis dataset Year

2016

Method

M-CNN.

Saudi dataset Part A MAE

MSE

43.2

49

20.9

30

18

28.2

15.5

25.6

[11] 2017

CascadedMTL [19]

2018

CSRNet [13] DCNN (our)

Fig. 5 The 1st column displays testing samples of the set in SaudiDataset. While the 2nd column displays the ground truth for samples and the 3rd column shown the produced density map

Due to the uncommon and special nature of the Saudi Community, Saudis may require special features to be added to the global features. We have trained our data for crowd counting considering accepting the unfixedresolution colour of the images. Thus M-CNN training code has been used as a base, where the findings further support the idea of non-effective branch structure, and Cascaded-MTL that have complicated structure. However, our method achieves the lowest MAE by 6.7%, 3.9% .and 2.79%and lowest MSE by around 12.25%, 7.5% and 7.1% than M-CNN, Cascaded-MTL, and CSRNet respectively.



5.

Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144 Author name / Procedia Computer Science 00 (2019) 000–000

143 11

Conclusion:

In this study, a novel deep CNN called DCNN has been proposed which followed the structure of VGG-16 small conve size that take any image resolution as an input to train new datasets that were specified for Saudi related places. This counting system is proposed as a part of a mobile application that aims to reduce the waiting time of the users by showing them the crowd size of the area they are heading to. Also, A new and specific large-scale crowd counting dataset for Saudi public area has been used to train the model. The results indicate that our proposed method achieves a lower MAE by 6.7%, 3.9% .and 2.79% compared to M-CNN, Cascaded-MTL, and CSRNet respectively. Also, a lower MSE by almost 12.25%, 7.5%, and 7.1% has been achieved compared to the three methods mentioned above. References [1]

V. A. Sindagi and V. M. Patel, “A survey of recent advances in CNN-based single image crowd counting and density estimation,” Pattern Recognit. Lett., pp. 1–16, 2017.

[2]

M. Li, Z. Zhang, K. Huang, and T. Tan, “Estimating the number of people in crowded scenes by MID based foreground segmentation and head-shoulder detection,” Pattern Recognition, 2008. ICPR 2008. 19th Int. Conf., pp. 1–4, 2008.

[3]

P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and appearance,” Int. J. Comput. Vis., vol. 63, no. 2, pp. 153–161, 2005.

[4]

D. R. and S. D. and C. F. and S. Sridharan, “Crowd Counting using Multiple Local Features,” Techniques, no. December, pp. 1–3, 2009.

[5]

A. B. Chan and N. Vasconcelos, “Bayesian poisson regression for crowd counting,” Proc. IEEE Int. Conf. Comput. Vis., pp. 545–551, 2009.

[6]

B. Xu and G. Qiu, “Crowd density estimation based on rich features and random projection forest,” 2016 IEEE Winter Conf. Appl. Comput. Vision, WACV 2016, 2016.

[7]

V. Lempitsky and A. Zisserman, “Learning To Count Objects in Images,” Adv. Neural Inf. Process. Syst., pp. 1324–1332, 2010.

[8]

Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd Counting via Adversarial Cross-Scale Consistency Pursuit.”

[9]

C. Wang, H. Zhang, L. Yang, S. Liu, and X. Cao, “Deep People Counting in Extremely Dense Crowds,” Proc. 23rd ACM Int. Conf. Multimed. - MM ’15, pp. 1299–1302, 2015.

[10]

D. B. Sam, S. Surya, and R. V. Babu, “Switching Convolutional Neural Network for Crowd Counting,” 2017.

[11]

Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-Image Crowd Counting via Multi-Column Convolutional Neural Network,” 2016 IEEE Conf. Comput. Vis. Pattern Recognit., pp. 589–597, 2016.

[12]

L. Zeng, X. Xu, B. Cai, S. Qiu, and T. Zhang, “Multi-scale convolutional neural networks for crowd counting,” Proc. - Int. Conf. Image Process. ICIP, vol. 2017–Septe, pp. 465–469, 2018.

[13]

Y. Li, X. Zhang, and D. Chen, “Congested Scenes.”

[14]

C. Chung et al., “Implementation of an integrated computerized prescriber order-entry system for chemotherapy in a multisite safetynet health system,” Am. J. Heal. Pharm., vol. 75, no. 6, pp. 398–406, 2018.

[15]

P. S. Sierra and J. B. Siculaba, “Knowledge Management in Organizations,” vol. 731, pp. 509–519, 2017.

[16]

P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detection: An Evaluation of the State of the Art,” Ieee Tpami, vol. 34, no. 4, pp. 743–761, 2012.

144

Maha Hamdan Alotibi et al. / Procedia Computer Science 163 (2019) 134–144 Author name / Procedia Computer Science 00 (2019) 000–000

12

[17]

Y. Wang and Y. Zou, “Fast visual object counting via example-based density estimation,” Proc. - Int. Conf. Image Process. ICIP, vol. 2016–Augus, pp. 3653–3657, 2016.

[18]

V. Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada, “COUNT forest: Co-voting uncertain number of targets using random forest for crowd density estimation,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 3253–3261, 2015.

[19]

V. A. Sindagi and V. M. Patel, “CNN-Based cascaded multi-task learning of high-level prior and density estimation for crowd counting,” 2017 14th IEEE Int. Conf. Adv. Video Signal Based Surveillance, AVSS 2017, 2017.

[20]

S. Kumagai, K. Hotta, and T. Kurita, “Mixture of Counting CNNs: Adaptive Integration of CNNs Specialized to Specific Appearance for Crowd Counting,” pp. 1–8, 2017.

[21]

G. Li and Y. Yu, “Visual Saliency Detection Based on Multiscale Deep CNN Features,” 2016.

[22]

E. Walach and L. Wolf, “Learning to count with CNN boosting,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9906 LNCS, pp. 660–676, 2016.

[23]

L. Zhang, T. Youtu, M. Shi, I. Rennes, T. Youtu, and Q. Chen, “Crowd counting via scale-adaptive convolutional neural network.”

[24]

J. Liu, C. Gao, D. Meng, and A. G. Hauptmann, “DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation,” 2017.

[25]

V. A. Sindagi and V. M. Patel, “Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017–Octob, pp. 1879–1888, 2017.

[26]

F. O. R. L. Arge and C. I. Mage, “V d c n l -s i r,” pp. 1–14, 2015.

[27]

S. V. B, P. Sanguansat, and S. Toriumi, “Advances in Intelligent Information Hiding and Multimedia Signal Processing,” vol. 82, 2018.

[28]

M. Fu, P. Xu, X. Li, Q. Liu, M. Ye, and C. Zhu, “Fast crowd density estimation with convolutional neural networks,” Eng. Appl. Artif. Intell., vol. 43, pp. 81–88, 2015.

[29]

C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via deep convolutional neural networks,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07–12–June, pp. 833–841, 2015.

[30]

M. Marsden, K. McGuinness, S. Little, and N. E. O’Connor, “Fully Convolutional Crowd Counting On Highly Congested Scenes,” 2016.

[31]

Z. Shi et al., “Crowd Counting with Deep Negative Correlation Learning,” pp. 5382–5390.