DSPNet: Deep scale purifier network for dense crowd counting

DSPNet: Deep scale purifier network for dense crowd counting

Expert Systems With Applications 141 (2020) 112977 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

3MB Sizes 0 Downloads 58 Views

Expert Systems With Applications 141 (2020) 112977

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

DSPNet: Deep scale purifier network for dense crowd counting Xin Zeng, Yunpeng Wu, Shizhe Hu, Ruobin Wang, Yangdong Ye∗ School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China

a r t i c l e

i n f o

Article history: Received 17 April 2019 Revised 17 September 2019 Accepted 24 September 2019 Available online 25 September 2019 Keywords: Crowd counting Density map estimation Convolutional neural network Deep learning

a b s t r a c t Crowd counting has produced considerable concern in recent years. However, crowd counting in highly congested scenes is a challenging problem owing to scale variation. To remedy this issue, we propose a novel deep scale purifier network (DSPNet) that can encode multiscale features and reduce the loss of contextual information for dense crowd counting. Our proposed method has two strong points. First, the DSPNet model consists of a frontend and a backend. The frontend is a conventional deep convolutional neural network, while the unified deep neural network backend adopts a “maximal ratio combining” strategy to learn complementary scale information at different levels. The scale purifier module, which improves scale representations, can effectively fuse multiscale features. Second, DSPNet performs the whole RGB image-based inference to facilitate model learning and decrease contextual information loss. Our customized network is end-to-end and has a fully convolutional architecture. We demonstrate the generalization ability of our approach by cross-scene evaluation. Extensive experiments on three publicly available crowd counting benchmarks (i.e., UCF-QNRF, ShanghaiTech, and UCF_CC_50 datasets) show that our DSPNet delivers superior performance against state-of-the-art methods. © 2019 Published by Elsevier Ltd.

1. Introduction With the rapid growth of urban populations, crowd counting has been a popular topic in recent years. The purpose of crowd counting is to predict the number of pedestrians in still images or video sequences (Lempitsky & Zisserman, 2010). Overcrowding in public places may result in a loss of control and even stampedes. Thus, an accurate crowd counting method is essential for highly crowded scenes, such as political rallies, religious activities, musical festivals, and sporting events. By determining the number of pedestrians in such scenes and subsequently taking effective measures, some tragedies may be entirely avoided. Furthermore, counting semantic features can be extended to other important domains, including medical and biological image processing (Lempitsky & Zisserman, 2010), traffic monitoring (Barcellos, Bouvié, Escouto, & Scharcanski, 2015; De Almeida, Oliveira, Britto Jr, Silva Jr, & Koerich, 2015), and wildlife census (Laradji, Rostamzadeh, Pinheiro, Vazquez, & Schmidt, 2018). As a well-established problem in computer vision, crowd counting has plagued researchers with many challenges over the last few years. A good crowd counting system must be accurate and robust when faced with high clutter, extreme weather, severe ∗

Corresponding author. E-mail addresses: [email protected] (X. Zeng), [email protected] (Y. Wu), [email protected] (S. Hu), [email protected] (R. Wang), [email protected] (Y. Ye). https://doi.org/10.1016/j.eswa.2019.112977 0957-4174/© 2019 Published by Elsevier Ltd.

occlusion, scale variation, and non-uniform crowd distributions. Some typical challenges in crowd counting are illustrated in Fig. 1. Early work generally estimates the number of pedestrians via detection or regression frameworks, with hand-crafted features as visual descriptors (Ryan, Denman, Sridharan, & Fookes, 2015). However, traditional crowd counting models have limited descriptive ability because they fail to consider potential information. The advent of convolutional neural network (CNN) has allowed various problems to be overcome that previous image analysis methods failed to address. In visual systems, the CNN can ensure the performance of complex tasks, such as crowd counting, anomaly detection, object tracking, and activity recognition (Ronao & Cho, 2016). Thus, crowd counting systems gradually adopt a CNN as the backbone to support the counting task. A central issue in dense crowd counting lies in handling scale variation. It is very common that there exists a wide range of variation in the scale of pedestrians, which impedes the final counting results. Several studies have achieved significant error reduction and improvements in precision. Zhang, Zhou, Chen, Gao, and Ma (2016) used a multi-column CNN (MCNN) to capture different scale features with various kernel sizes across the column. Hydra CNN, presented by Onoro-Rubio and López-Sastre (2016), also followed a similar step to build a scale-aware model. Babu Sam, Surya, and Venkatesh Babu (2017) employed a switchable classifier (Switching-CNN) to assign local image patches to one of the CNN columns, ensuring that each CNN column is adapted to a certain

2

X. Zeng, Y. Wu and S. Hu et al. / Expert Systems With Applications 141 (2020) 112977

Fig. 1. Examples of challenges of crowded scenes: (a) occlusions, (b) high clutter, (c) scale variation, and (d) non-uniform distribution of crowds.

scale. In the same year, Zeng, Xu, Cai, Qiu, and Zhang (2017) proposed the inclusion of a multiscale blob to the convolutional neural network (MSCNN), producing scale-relevant density maps from still images. CSRNet, a single-column fully convolutional network (Long, Shelhamer, & Darrell, 2015), was introduced by Li, Zhang, and Chen (2018). This framework was based on diverse rate dilated convolutional filters, thus extending the receptive field of feature maps and reducing scale information loss. However, MCNN and Hydra CNN work on a relatively shallow CNN model. For instance, MCNN includes just five convolutional layers with a small number of filters. Using a deeper network can further enhance the discriminative ability of the model towards dense crowd counting. Furthermore, a Switching-CNN relies on a switchable classifier based on the VGGNet (Simonyan & Zisserman, 2015) structure, which is both costly and at times, defective. Most existing methods typically use cropped image patches (e.g., CSRNet, Switching-CNN, MSCNN, MCNN, Hydra CNN), and different approaches perform different crop sizes during the process of training. Finally, the whole image count approximates the total sum over these local patches. However, this processing method still has severe drawbacks. Cropping local patches from each image is adverse when augmenting the training data. The sum of the image patches is not equal to the whole image count caused by the contextual information loss. Additionally, there is a large number of redundant computations in these overlapping regions, which is quite time-consuming. Based on the aforementioned limitations, our research designs a novel deep scale purifier network (DSPNet) for dense crowd counting. To address the considerable changes in scale, we propose a scale purifier module (SPM) with the deconvolution operation. Specifically, the SPM adopts a “maximal ratio combining” strategy to further capture the different scale level

(e.g., small, middle, and large) features within a unified deep neural network backend. Different columns with various kernel sizes are used to encode different levels of scale information. All multiscale features are fused in the network backend. The SPM exhibits an effective multi-column architecture in which different columns can learn complementary scale information. DSPNet encodes different scale features to generate high-quality multiscale density maps. As already mentioned, most crowd counting systems work on a patch-based inference. This typical divide-count-sum pipeline, however, overlooks latent global contextual information. Thus, we apply a whole RGB image-based inference strategy, which can directly avoid computationally expensive sliding windows. DSPNet is trained using the whole image rather than using patches cropped from the input. The use of full image-based inference helps the model achieve better count estimation. Our unified architecture is valid and easy to train, allowing the crowd counting in a complex environment. Three publicly accessible datasets are used for the experiments, and we further conduct an ablation study on the ShanghaiTech Part_A dataset. Although DSPNet exhibits an unsophisticated structure, it offers superior performance compared to state-of-the-art approaches. In summary, our major contributions of this work are threefold: •



We establish a novel end-to-end trainable architecture, the DSPNet. To our knowledge, this is the first work that uses multi-column architecture within a unified deep neural network backend to address the problem of scale variation. We propose a scale purifier module (SPM) with the deconvolution operation that is simple and valid for learning different level scales and reducing the final count error.

X. Zeng, Y. Wu and S. Hu et al. / Expert Systems With Applications 141 (2020) 112977 •

We conduct extensive experiments and a cross-scene experiment and compare with more state-of-the-art methods on three mainstream datasets. Experimental results show the superiority of DSPNet and good generalization ability.

The remainder of this paper is organized as follows. Section 2 describes the current literature on this topic. Section 3 provides details of our proposed method. The experimental results on crowd counting are summarized in Section 4. Finally, we conclude the findings of this paper and provide some clues for future research in Section 5. 2. Related work Crowd counting is a classic pattern recognition problem in computer vision and is the subject of many published works. We briefly review the related crowd counting methods. 2.1. Density estimation-based methods Methods for density estimation can be categorized into the pedestrian dynamics field and still crowd counting. Detecting and tracking are key for crowd dynamics modeling because they provide the location and velocity features of pedestrian dynamics. Some works incorporate automatic sensor technologies in crowd dynamics, such as counting camera systems (Duives, Daamen, & Hoogendoorn, 2018a), tablet sensors (Nagao, Yanagisawa, & Nishinari, 2018), and Wi-Fi sensors (Duives, Daamen, & Hoogendoorn, 2018b). In this paper, we focus on still crowd counting to estimate crowd density. Initial studies often utilize a detection-style framework to estimate the number of crowds. However, trained detectors are prone to fail for overcrowdedness or highly dense crowds. Different from counting by detection, counting by regression relies on hand-crafted features, such as SIFT (Lowe, 2004), HOG (Dalal & Triggs, 2005), and VLAD (Jegou et al., 2012). However, such handcrafted image features are not robust when facing large-scale variation. Foreground segmentation in the regression framework, which is considered highly challenging as it only captures appearance information, is indispensable for still crowd counting. It does not consider the texture and contextual information that are incorporated in the regression model. Although the spatial information of pedestrians is a key factor in real applications, the aforementioned counting methods always ignore this point. Moreover, previous studies on crowd counting have mainly focused on very lowdensity scenes. Most experiments in the literature select sparse crowd distribution datasets, such as UCSD (Chan, Liang, & Vasconcelos, 2008), PETS (Ellis & Ferryman, 2010), and Mall (Chen, Loy, Gong, & Xiang, 2012) datasets. For highly congested scenes, the performance of these counting systems is poor. As a consequence of the aforementioned limitations, Lempitsky and Zisserman (2010) developed a pioneering approach for learning a linear mapping between local image patches and corresponding density maps. This method achieved great breakthroughs in a diverse range of counting problems. More specifically, spatial information present in the given image is applied to generate a crowd density map, providing the spatial distribution of the crowd. Density maps can preserve an extensive set of detailed information, and the final number of pedestrians can be directly obtained by integrating the density map. With a successful learning, the research community has been persuaded that this strategy is currently the best method. We select the density estimation-based method for dense crowd counting. 2.2. Crowd counting based on CNN Recently, deep neural networks (DNNs) are undisputedly the top method for computer vision owing to their success in many

3

challenging tasks. Based on many lessons learned from mainstream experiments, numerous CNN-based models, such as AlexNet (Krizhevsky, Sutskever, & Hinton, 2012), VGGNet (Simonyan & Zisserman, 2015) and ResNet (He, Zhang, Ren, & Sun, 2016), have been applied in academic researches and engineering applications. For conventional visual tasks, deep features extracted from the convolutional neural network clearly outperform hand-crafted features. State-of-the-art methods for crowd counting are built using convolutional modules. This pipeline has yielded drastic improvements on several standard benchmarks. A comprehensive survey in this area of research can be found in Sindagi and Patel (2018) and classified into two broad fields: traditional and CNN-based approaches. Here, we focus on advances in crowd counting and density estimation. Fu et al. (2015) were the first to utilize a CNN-based method for still crowd counting. Sindagi and Patel (2017a) and Shi, Zhang, Sun, and Ye (2018) explored the maximum potential of multi-task learning in crowd counting subproblems. Cascaded-MTL (Sindagi & Patel, 2017a) applied a high-level prior to improve counting accuracy. SwitchingCNN (Babu Sam et al., 2017) was made up of one classifier and multiple independent regressors, relying on CNN classification to choose an optimal column and input image patches. Sam and Babu (2018) built top-down feedback with TDF-CNN to rectify initial counting predictions. Many state-of-the-art approaches utilize CNN with strong abilities for crowd counting in numerous ways. An incrementally growing CNN (IG-CNN) (Babu Sam, Sajjan, Venkatesh Babu, & Srinivasan, 2018) used continuous iterations to create expert regressors. Ranjan, Le, and Hoai (2018) presented a multistage iterative counting CNN (ic-CNN) framework, where all the prior stages influence final stage predictions. To decrease the risk of overfitting, Shi,Zhang,Liu et al. (2018) proposed a novel decorrelated CNN (D-ConvNet) that adopts negative correlation learning.

2.3. Multiscale models for crowd counting To deal with the multiscale counting, Idrees, Saleemi, Seibert, and Shah (2013) offered a method (FHSc+MRF) that fuses features from head detections, Fourier analysis, and SIFT interest points. Since learning a linear mapping is highly demanding, Pham, Kozakaya, Yamaguchi, and Okada (2015) extended the density regression model using a nonlinear function with a random forest. With the introduction of “CrowdNet” by Boominathan, Kruthiventi, and Babu (2016), research on crowd counting moved from traditional methods to the convolutional neural network. CNN methods can output more discriminative representations. Hydra CNN is a scale-aware counting model that adopts a multiple input strategy to predict density maps of combining multiple scales. Zhang et al. (2016) built a multi-column layout (MCNN) to capture multiscale features by setting the different sizes of kernels in a shallow neural network. Furthermore, the MSCNN generates scale-relevant features by adding multiscale blobs. Sindagi and Patel (2017b) proposed a contextual pyramid CNN (CP-CNN) by incorporating different levels of contextual information. The CP-CNN includes MCNN as a part of the network and generates highquality density maps by incorporating global and local contexts. Shen et al. (2018) utilized a scale-consistency regularization constraint (ACSCP) to enhance the coherence between estimated density maps of different scales. Li et al. (2018) combined the VGGNet network and dilated convolution layers to aggregate scale information in CSRNet. How to process scale variation in crowd counting is still a key issue for scientific researchers. Thus, we propose a deep scale purifier network to improve the performance for crowd counting.

4

X. Zeng, Y. Wu and S. Hu et al. / Expert Systems With Applications 141 (2020) 112977

3. Crowd counting by DSPNet 3.1. Deconvolution for neural network Despite the improvements in crowd counting using CNN, the application of deeper CNN still faces several problems. It is well known that deeper networks are composed of multiple convolutions and pooling layers, making it difficult to determine object locations because their large receptive fields and incremental spatial invariance are unavoidable. Counting systems are gradually moving to deeper networks, resulting in the loss of contextual information. Inspired by DeconvNet (Noh, Hong, & Han, 2015), we design deconvolution layers in our network architecture. Deconvolution, as discussed in Zeiler, Krishnan, Taylor, and Fergus (2010), is similar to conventional convolution but with a zero inserted into the input activations. Deconvolution layers in the decoder can upscale the extracted feature map. Formally, deconvolution can be described as follows:

yic =

K1 

zki ∗ fk,c ,

(1)

k=1

where the reconstruction yic is obtained by a 2D convolution, zki i represents the feature maps and fk,c are the filters with the c channels. Deconvolution is made robust to the noise by the use of coded apertures to capture images without losing resolution. Specifically, deconvolutional layers can minimize the reconstruction error of the input image in crowd counting. By building on deconvolutional layers between a series of convolutional layers and the multi-column architecture, our method increases the contextual information used to make pixel-level decisions. DSPNet can thus acquire more latent multiscale information in highly congested scenes. 3.2. Network structure DSPNet can learn high-level powerful features from low-level raw data in dense crowd counting. We aim to maximize the use of context, that is, reduce the loss of contextual information in the network. Thus, the whole network inference is on the full image. In particular, we do not use random crop or center crop and only apply flip augmentation during training. The DSPNet mainly comprises two parts. The frontend captures low-level features, and the backend further learns multiscale representations. We choose VGGNet as part of the DSPNet structure because VGGNet has high performance, good usability and fast convergence in many regression and classification fields. VGGNet has exhibited great power for improving image recognition performance. As a trade-off between different elements of the model, we propose a DSPNet architecture that is derived from the VGGNet (Simonyan & Zisserman, 2015) and borrows the first ten layers of VGGNet. Fig. 2 describes the specific details of the DSPNet. The SPM is introduced in Section 3.3. The initial Conv1 to Conv4 are retained, and a 3 × 3 convolutional layer is added to Conv4. The first three max-pooling layers are 2 × 2, and the stride is 2. To minimize spatial resolution loss, we remove the fourth pooling operation. DSPNet is considered a fully convolutional network (Long et al., 2015), and hence, it can be applied to crowd images of arbitrary size. All activations in the specified network are Rectified Linear Unit(ReLU) (Nair & Hinton, 2010). The classic Euclidean distance is used as a loss function in regression tasks. We select the Euclidean regression loss for training DSPNet to quantify the loss between the estimated density map and the ground truth. The widely used Euclidean loss function is defined as follows:

L() =

N 1  F (Xi ; ) − Fi 22 , 2N i=1

(2)

where  is the set of network parameters for optimizing, N is the quantity of training data, and F(Xi ; ) represents the output density map. Xi and Fi are the ith input image and ground truth density map, respectively. Due to scarce data, creating crowd datasets is a time-consuming and laborious job. Moreover, as numerous crowd counting models are trained from scratch, we aim to deploy pretrained features in our model. Technically, we employ the pretrained VGGNet-16 model based on the ImageNet (Deng et al., 2009) dataset and finetune with the crowd images. Several current methods (Babu Sam et al., 2018; Idrees et al., 2018; Shi,Zhang,Liu et al., 2018) exploit pretrained models, making several modifications. It has been found that using pretrained features significantly improves results. 3.3. Scale purifier module (SPM) Neural networks learn and optimize the filters through backpropagation algorithms. By using gradient descent, filters can extract deep features that represent important semantic information. SPM aims to further learn a comprehensive scale representation, which can promote the models representation ability, and thus, better prediction. Onoro-Rubio and López-Sastre (2016), Zhang et al. (2016) and Deb and Ventura (2018) demonstrate the effectiveness in estimating multiscale object counting using multi-column architecture. Transforming the convolution kernel size across columns can produce features at different scales. It is very important to further consider the issue of scale effect and give better performance. The maximal ratio combining technique is the most optimal linear combining strategy. Although using these merits can ensure model accuracy, our strategy is slightly different from the maximal ratio combining technique. We adopt the same ratio “maximal ratio combining” strategy to learn different level scales (e.g., small, middle, and large) as complementarity and add the deconvolution into the frontend of the multi-column architecture. All branches, which are feasible, are then combined to maximize the scale features. We design the SPM as an attractive alternative to the architecture of MCNN. In sharp contrast to MCNN, SPM is composed of a set of deconvolutional (Zeiler et al., 2010) and convolutional layers. Because of the pooling operations in the earlier stages, the network needs to be upsampled using deconvolution. Deconvolutional layers reconstruct the full-sized input and incorporate sufficient contextual information. The SPM then learns the different scaled features in the deep network backend. We investigate three possible solutions in Section 4.4 and redesign multiple columns with a different number of filters and filter sizes. Every column can learn different scales of information as complementarity. Finally, we concatenate the predictions from the multi-column networks and process it by using a 1 × 1 convolution layer. We use a 1D convolution operation rather than a fully connected layer to generate feature maps. The final feature maps are fused in the same ratio. The SPM network is detailed in Fig. 3. 3.4. Ground truth density maps Most of the publicly available crowd counting datasets provide dotted annotations for the raw data. Specifically, there are two rule configurations, including the estimated head count and crowd density maps. Compared with determining the number of heads, density maps can provide the spatial distribution of the crowd, which benefit public space design and resource management. We select the latter option for dense crowd counting. Our method consists of three main steps. Firstly, we convert these points to ground truth density maps using the Gaussian kernel function. Secondly, the model continuously learns the ground truth to

X. Zeng, Y. Wu and S. Hu et al. / Expert Systems With Applications 141 (2020) 112977

Conv

DeConv

..

MaxPool

5

Input Image

. .. ..

Density Map

. ..

Conv. Layers Conv. Layers Conv. Layers 3×3×256 3×3×64 3×3×128 3×3×256 3×3×64 3×3×128 3×3×256 MaxPool Layer MaxPool Layer MaxPool Layer 2×2/2 2×2/2 2×2/2



..

Conv. Layers 3×3×512 3×3×512 3×3×512 3×3×512

Scale Purifier Module

People Count

Fig. 2. Overview of the DSPNet.

Input

Scale Purifier Module

Input

Deconv 4×4-s2, 256

Input

Deconv 4×4-s2, 256

Scale Purifier Module

Deconv 4×4-s2, 256

Scale Purifier Module

Deconv 4×4-s2, 128

Deconv 4×4-s2, 128

Deconv 4×4-s2, 128

Deconv 4×4-s2, 64

Deconv 4×4-s2, 64

Deconv 4×4-s2, 64

Conv 3×3-s1, 32

Conv 3×3-s1, 32

Conv 3×3-s1, 32

Pooling

Conv 5×5-s1, 48

Conv 7×7-s1, 32

Pooling

Conv 3×3-s1, 64

Conv 5×5-s1, 48

Conv 7×7-s1, 32

Conv Pooling 3×3-s1, 64

Conv 5×5-s1, 48

Conv 7×7-s1, 32

Conv 9×9-s1, 64

Pooling

Conv 3×3-s1, 48

Conv 5×5-s1, 64

Pooling

Conv 3×3-s1, 64

Conv 3×3-s1, 48

Conv 5×5-s1, 64

Conv Pooling 3×3-s1, 64

Conv 3×3-s1, 48

Conv 5×5-s1, 64

Conv 7×7-s1, 48

Conv 3×3-s1, 40

Conv 3×3-s1, 16

Conv 3×3-s1, 24

Conv 3×3-s1, 40

Conv 3×3-s1, 16

Conv 3×3-s1, 24

Conv 3×3-s1, 40

Conv 3×3-s1, 16

Conv 5×5-s1, 12

Conv 3×3-s1, 10

Conv 3×3-s1, 8

Conv 3×3-s1, 12

Conv 3×3-s1, 10

Conv 3×3-s1, 8

Conv 3×3-s1, 12

Conv 3×3-s1, 10

Conv 3×3-s1, 8

Conv 3×3-s1, 6

Concat

Conv 1×1-s1, 1

Concat

Concat

Concat Conv Concat 1×1-s1, 1

Output

Concat Concat

Conv 1×1-s1, 1

Output

(a)

(b)

Concat

Output

(c)

Fig. 3. The SPM architecture. (a) Dual-column SPM, (b) three-column SPM, and (c) four-column SPM.

generate optimal crowd density maps. Finally, the integral of the density map is equal to the number of pedestrians that are labeled manually. A single ground truth density map is the same size as the corresponding crowd image. Fig. 6 presents an example of the visualization results, where the second column presents ground truth. Obtaining ground truth is crucial for crowd counting systems. Ground truth greatly helps multiple layers learn strong representations, which can offer the constraints for the training model. We follow the method for creating the ground truth as in Zhang, Li, Wang, and Yang (2015) and employ the fixed size Gaussian kernel to handle overcrowded scenes. Specifically, the ground truth F(z) is generated by convolving a delta function δ (z − zi ) and blurring each head annotation with a normalized Gaussian kernel Gμ,ρ 2 in the process. For a training image with N heads labeled, the density function is represented as:

F (z ) =

N  i=1

δ (z − zi ) ∗ Gμ,ρ 2 (z ).

(3)

In general, μ and ρ represent the kernel size and standard deviation, respectively, in the Gaussian process. By following Sindagi and Patel (2017a), we set parameters μ = 15 and ρ = 4 for all datasets. 4. Experiments 4.1. Datasets In this section, we evaluate our method using three publicly published crowd counting datasets: UCF-QNRF (Idrees et al., 2018), ShanghaiTech (Zhang et al., 2016), and UCF_CC_50 (Idrees et al., 2013). Fig. 4 presents representative samples from the three datasets, and detailed statistics of three datasets are reported in Table 1. UCF-QNRF dataset is a new and the largest crowd counting dataset until now, which is collected from Flickr, Web Search, and Hajj footage. It contains 1,535 images with 1,251,642 annotated persons in total. The number of people ranges from 49 to 12,865 with an average of 815 individuals in each image, thus making this

6

X. Zeng, Y. Wu and S. Hu et al. / Expert Systems With Applications 141 (2020) 112977

Fig. 4. Examples of the three crowd counting datasets. Table 1 Summary of various crowd counting datasets. Dataset

Num

Total

Min

Max

Ave

Place

Color

UCF-QNRF ShanghaiTech Part_A ShanghaiTech Part_B UCF_CC_50

1,535 482 716 50

1,251,642 241,677 88,488 63,974

49 33 9 94

12,865 3,139 578 4,543

815 501 124 1,279

Indoor/Outdoor Indoor/Outdoor Outdoor Outdoor

RGB RGB RGB Gray

dataset challenging for deep learning approaches. The training and test sets comprise 1,201 and 334 images, respectively. ShanghaiTech dataset is a large-scale crowd counting dataset, containing 1,198 images and 330,165 annotated head centers. The dataset is divided into two pieces: Part_A and Part_B. Part_A includes 482 images randomly crawled from the Internet, among which 300 images are used for training, and the remaining 182 images are used for testing. The 716 images in Part_B were collected from the busy urban streets of Shanghai. For Part_B, 400 images are used for training, and 316 images are used to test the model. UCF_CC_50 dataset contains 50 images of unconstrained scenes and is a grayscale image dataset, unlike the above two datasets that contain RGB images. The small number of images makes it difficult for the crowd counting system to perform well during training. As a challenging benchmark, this dataset contains extremely dense crowds with counts varying between 94 and 4,543. We follow the dataset instructions and evaluate our results using 5-fold cross-validation. 4.2. Evaluation metric and training details Two standard metrics, the mean absolute error (MAE) and the root mean squared error (RMSE), are employed to compare the performance of different models. The metrics are defined as follows:

MAE =

N 1 |zi − zi |, N

(4)

i=1

and

RMSE =



N 1 |zi − zi |2 . N

(5)

i=1

Here, N represents the number of images in the test set, zi is the ground truth and zi is the estimated value corresponding to the ith

sample. Lower MAE and RMSE values indicate better model performance. To measure the quality of the estimated density maps, we calculate the PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index Method). Note that these results are only reported on the ShanghaiTech Part_A dataset. The higher the PSNR and SSIM are, the better the quality of the density map is. We provide details about the training of the DSPNet. The input of the network consists of RGB (UCF-QNRF and ShanghaiTech) and grayscale (UCF_CC_50) images. Ground truth density maps are created as described in Section 3.4. The implementation of experiments is based on the PyTorch (Paszke et al., 2017) framework, and the computations are run on a GeForce GTX 1080Ti GPU with 11GB of memory. Data augmentation is performed by only applying random horizontal flipping with a probability of 0.5. Parameter optimization is performed using Adam (Kingma & Ba, 2014) with a batch size of 1. The smaller learning rate is set to 0.0 0 0 01 and is subsequently decreased by a decay rate of 0.995 every 1 epoch. The training process runs for 500 epochs on all training sets. According to our experiments, DSPNet shows convergence at approximately 100 epochs. 4.3. Comparison with the state-of-the-art methods We quantitatively evaluate our proposed method and compare its performance with various advanced counting systems: ic-CNN, CSRNet, ACSCP, D-ConvNet, IG-CNN, CP-CNN, CL-CNN (Idrees et al., 2018), AMDCN (Deb & Ventura, 2018), along with prior models including MCNN, Hydra CNN, Cascaded-MTL, Switching-CNN, MSCNN, and TDF-CNN. As shown in Table 2, our method achieves the highest accuracy for the UCF-QNRF dataset among all approaches. We obtain the lowest 107.5 of MAE and achieve 18.6% improvement of MAE compared to the existing best-performing solution (CL-CNN). DSPNet also delivers a 57.1% lower RMSE when applied to the UCFQNRF dataset compared to the MCNN. Thus, our network out-

X. Zeng, Y. Wu and S. Hu et al. / Expert Systems With Applications 141 (2020) 112977 Table 4 Comparison of DSPNet with state-of-theart methods on the UCF_CC_50 dataset.

Table 2 Comparison of DSPNet with state-ofthe-art methods on the UCF-QNRF dataset. Method

MAE

RMSE

FHSc+MRF MCNN Encoder-Decoder Cascaded-MTL Switching-CNN Resnet101 Densenet201 CL-CNN

315 277 270 252 228 190 163 132

508 426 478 514 445 277 226 191

DSPNet(Ours)

107.5

182.7

Table 3 Comparison of DSPNet with state-of-the-art methods on the ShanghaiTech dataset. Method

7

ShanghaiTech Part_A

ShanghaiTech Part_B

MAE

RMSE

MAE

RMSE

MCNN Cascaded-MTL TDF-CNN Switching-CNN MSCNN ACSCP CP-CNN D-ConvNet-v1 IG-CNN ic-CNN CSRNet

110.2 101.3 97.5 90.4 83.8 75.7 73.6 73.5 72.5 68.5 68.2

173.2 152.4 145.1 135.0 127.4 102.7 106.4 112.3 118.2 116.2 115.0

26.4 20.0 20.7 21.6 17.7 17.2 20.1 18.7 13.6 10.7 10.6

41.3 31.1 32.8 33.4 30.2 27.4 30.1 26.0 21.1 16.0 16.0

DSPNet(Ours)

68.2

107.8

8.9

14.0

Method

MAE

RMSE

MCNN MSCNN TDF-CNN Hydra CNN Cascaded-MTL Switching-CNN CP-CNN IG-CNN ACSCP AMDCN D-ConvNet-v1 CSRNet ic-CNN

377.6 363.7 354.7 333.7 322.8 318.1 295.8 291.4 291.0 290.82 288.4 266.1 260.9

509.1 468.4 491.4 425.3 397.9 439.2 320.9 349.4 404.6 – 404.7 397.5 365.5

DSPNet(Ours)

243.3

307.6

Table 5 Quality of the density map on the ShanghaiTech Part_A dataset. Method

PSNR

SSIM

MCNN CP-CNN CSRNet

21.4 21.7 23.8

0.52 0.72 0.76

DSPNet(Ours)

28.0

0.84

130 MAE RMSE

120 110

4.4. Ablation study To investigate the behavior of SPM within the proposed DSPNet, we perform a series of ablation studies on ShanghaiTech Part_A and analyze the results. In deep CNN, different layers extract

100

Error

performs existing approaches and achieves the lowest MAE and RMSE. The results clearly reveal the significance of learning scale information, particularly in images with widely varying densities. Similarly, Table 3 reports the empirical results using ShanghaiTech Part_A and Part_B. The DSPNet model achieves the best performance in three of the four results. DSPNet is simple to construct and can be trained directly on the full images. Unlike sophisticated models such as Sindagi and Patel (2017b), Shen et al. (2018) and Babu Sam et al. (2018), Occam’s razor particularly applies to DSPNet. Our approach has competitive performance in all these heterogeneous scenes. We conduct further experiments on density map measurements from the ShanghaiTech dataset. The results are tabulated in Table 5. Our model outperforms several previous methods when Part_A is applied, including MCNN, CP-CNN, CSRNet. We surpass the second-best method (CSRNet) by 17.6% and 10.5% for PSNR and SSIM, respectively. DSPNet results in higher quality crowd density maps due to the design and use of deconvolution layers. To overcome the limited data bottleneck, we split the UCF_CC_50 dataset randomly into five parts in which each part includes ten images, following the standard protocol (Idrees et al., 2013). Table 4 presents the DSPNet results, along with the previous top-performing systems for the UCF_CC_50 dataset, evaluated at MAE and RMSE. Our counting system achieves superior performance compared to the state-of-the-art methods. For example, DSPNet exhibits a 6.7% MAE improvement compared with the suboptimal approach (ic-CNN) and a 4.1% RMSE improvement compared with the second-best method (CP-CNN), which indicates that DSPNet can perform well for a little bit of training data.

90 80 70 60 50 0

1

2

3

4

5

Column Fig. 5. DSPNet with different columns using ShanghaiTech Part_A dataset.

special types of features. More specifically, lower layers achieve inner class changes by encoding special features, and upper layers capture the category semantic information (Wang, Ouyang, Wang, & Lu, 2015). The effectiveness of the scale purifier module in crowd counting is demonstrated using ablations. For a fair comparison, we use the DSPNet without SPM as the baseline, which is a singlecolumn network structure. The details of the SPM are detailed in Section 3.3. As shown in Fig. 5, we obtain a significant decrease in MAE and RMSE errors when using the SPM at baseline compared to the single-column DSPNet. Thus, the DSPNet with the threecolumn SPM achieves the most accurate counting results. The performance of different settings is reported in Table 6. MAE values decrease from 76.0 to 68.2 for the ShanghaiTech Part_A dataset. It can be seen that the scale purifier module can increase the prediction accuracy on challenging datasets with complex scale variations. Moreover, the scale purifier module reduces the RMSE further to 107.8, improving the RMSE by approximately 10% compared

8

X. Zeng, Y. Wu and S. Hu et al. / Expert Systems With Applications 141 (2020) 112977

Fig. 6. Examples of ground truth and predicted density maps for the ShanghaiTech Part_A dataset. First column: RGB images. Second column: Ground truth density maps. Third column: DSPNet. Fourth column: MCNN.

Table 6 MAE and RMSE results of different models on the ShanghaiTech Part_A dataset. Method

MAE

RMSE

DSPNet (single-column, w/o SPM) DSPNet (dual-column SPM) DSPNet (three-column SPM) DSPNet (four-column SPM)

76.0 71.5 68.2 69.6

119.8 114.9 107.8 112.5

to the single-column DSPNet. In conclusion, these experimental results confirm the feasibility and effectiveness of the SPM. 4.5. Effect of generalizability In the following discussion, we explore the generalization ability of DSPNet. By following the existing work (Shi,Zhang,Liu et al., 2018), we use three different cross-scene validation schemes: 1) ShanghaiTech Part_A is the training set, and Part_B is the test set. 2) UCF_CC_50 is the test set, and ShanghaiTech Part_A is the training set. 3) ShanghaiTech Part_B is the training set, and Part_A is the test set. The qualitative results are listed in Table 7. The MAE and RMSE values are used in all evaluations as the standard metrics. Our DSPNet evidently yields superior performance. Next, we describe details for the experiment corresponding to each line in the table. On the Part_A dataset, using the Part_B test, the MAE generally has a 10.1-82.3% improvement, and the RMSE has an 8.1-81.6% improvement. DSPNet can perform well even when the challenging dataset UCF_CC_50 is selected as the test set. In particular, the MAE decreases from 321.6 to 317.7, and the RMSE decreases from 475.4 to 471.0 when compared to the existing best algorithm

CSRNet. Finally, we train DSPNet on the Part_B dataset. Our method achieves lower errors (120.5 and 194.6) over D-ConvNet-v1 (140.4 and 226.1) and CSRNet (131.6 and 210.3). The ShanghaiTech Part_B dataset is collected from busy streets in Shanghai. Thus, compared to ShanghaiTech Part_A and UCF_CC_50, scene diversity is limited. Considering the diversity in scene types, our proposed method still achieves a great improvement. The results of CSRNet are similar to our network, as DSPNet and CSRNet adopt pretrained features from ImageNet. All experimental results demonstrate that DSPNet is more accurate and robust compared to current models. 4.6. Discussions of the pretrained network In this subsection, we discuss the effect of DSPNet, using the merged dataset and pretraining on the crowd counting dataset. Considering that the UCF_CC_50 dataset typically adopts 5-fold cross-validation, we only use UCF-QNRF and ShanghaiTech datasets in this part of the experiments. For the purpose of comparison, we strictly follow the experimental setup as described in Section 4.2. First, we experiment on the merged dataset. We merge the training set (1901 images) of UCF-QNRF, ShanghaiTech Part_A, and Part_B, and the setups for the merged dataset are performed as in Section 3.4. We separately evaluate the model on the test set of the above datasets, and the results are shown in Table 8. The first line denotes DSPNet that trains on the respective dataset, and the second line represents the results of DSPNet that trains on the merged dataset. The model trained on the respective dataset achieves better results than that trained on the merged dataset, except for the UCF-QNRF dataset. In the merged dataset, the proportions of UCF-QNRF, ShanghaiTech Part_A, and Part_B are 63%, 16%, and 21%, respectively. As the UCF-QNRF dataset maintains a relatively high proportion, the model can perform more accurately

X. Zeng, Y. Wu and S. Hu et al. / Expert Systems With Applications 141 (2020) 112977

9

Table 7 The results of the transfer performances on the ShanghaiTech and UCF_CC_50 datasets. Part_A → Part_B

Part_A → UCF_CC_50

Part_B → Part_A

MAE

RMSE

MAE

RMSE

MAE

RMSE

MCNN Cascaded-MTL D-ConvNet-v1 CSRNet

85.2 40.5 49.1 16.8

142.3 77.0 99.2 28.5

397.7 450.3 364.0 321.6

624.1 710.4 545.8 475.4

221.4 224.0 140.4 131.6

357.8 417.0 226.1 210.3

DSPNet(Ours)

15.1

26.2

317.7

471.0

120.5

194.6

Method

Table 8 The results of DSPNet with different training strategies. Method

DSPNet(separate dataset) DSPNet(merged dataset) DSPNet(UCF-QNRF) DSPNet(Part_A) DSPNet(Part_B)

UCF-QNRF

ShanghaiTech Part_A

ShanghaiTech Part_B

MAE

RMSE

MAE

RMSE

MAE

RMSE

107.5 107.1 – 108.3 110.8

182.7 181.2 – 180.4 188.4

68.2 68.7 74.2 – 73.7

107.8 116.4 119.0 – 120.9

8.9 9.6 8.4 9.4 –

14.0 17.5 13.6 14.8 –

and robustly in testing. The small size of the ShanghaiTech dataset may cause underfitting in the learning process, in general, it is not easy to obtain the minima. Second, we conduct a series of experiments to train a pretrained network with one dataset and fine-tune with a new dataset. All the results of the test set are listed in Table 8, and the last three lines correspond to the DSPNet pretrains on one dataset, and further fine-tunes and tests on the remaining datasets. We obtain the best results in three out of six items. The learned model becomes similar to that learned without a pretrained network when DSPNet uses the UCF-QNRF and Part_B datasets for the pretraining. However, Part_A is tested with the pretrained network, which leads to performance degradation. Considering transfer learning to analyze the reason of these results, we conclude that the distributions of the source domain (UCF-QNRF or ShanghaiTech Part_B) and the target domain (ShanghaiTech Part_A) have some differences, which may tend to fall into local minima and lead to poor performance in the test set. 4.7. Visualization of our method To provide qualitative insights, we visualize the density maps. Our method can utilize given labels to generate the corresponding density map for each still image within different datasets. To obtain a better comparison, we show the density maps generated by DSPNet and MCNN. We randomly select crowd images of varying density to display density maps. Some examples of the predicted density maps and true count results are shown in Fig. 6. The results of DSPNet are roughly similar to the ground truth density maps and the crowd distributions in the real-world images. We can easily distinguish the estimation results of density maps, revealing that our proposed DSPNet can effectively predict crowd density maps. In contrast, the quality of density maps produced from MCNN is low. Note that our proposed solution is calculated using a naive convolution operation in the absence of a sophisticated network structure. DSPNet performs better in fine-grained feature extraction and produces high-quality crowd density maps. Additionally, our method can reduce the counting errors and improve the accuracy of benchmark datasets.

module (SPM) for encoding implicit multiscale features and generating high-quality density maps. The design of the SPM adopts a “maximal ratio combining” strategy to learn different level scales and enhance scale representations. To reduce the loss of contextual information, we feed the full images into the DSPNet. The whole network is trained in an end-to-end manner with the naive Euclidean loss. Based on this, our proposed model is simple and easy to train. Our approach overcomes issues related to the scale variation in crowd counting regressions. To the best of our knowledge, research on different scales in a unified deep neural network backend has not yet been performed. Experimental results show that DSPNet outperforms the counting precision and robustness for the UCF-QNRF and UCF_CC_50 datasets compared to stateof-the-art methods and achieves comparable performance for the ShanghaiTech dataset. The evaluation results reveal that our crowd counting model can robustly count crowds with improving accuracy. For future work, we will employ active learning techniques in order to count multiple types of objects, including bacterial cells, vehicles, and wild animals. We will also consider the intermediate supervision of the neural network for further performance improvements.

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Credit authorship contribution statement Xin Zeng: Conceptualization, Investigation, Visualization, Writing - original draft, Writing - review & editing. Yunpeng Wu: Investigation, Writing - review & editing. Shizhe Hu: Conceptualization, Writing - review & editing. Ruobin Wang: Investigation, Visualization. Yangdong Ye: Supervision, Writing - review & editing.

Acknowledgments 5. Conclusion and future work In this paper, we propose a novel deep scale purifier network (DSPNet) for dense crowd counting. We present the scale purifier

This work was supported by the National Key R&D Program of China (No. 2018YFB1201403), the National Natural Science Foundation of China (No. 61772475).

10

X. Zeng, Y. Wu and S. Hu et al. / Expert Systems With Applications 141 (2020) 112977

References Babu Sam, D., Sajjan, N. N., Venkatesh Babu, R., & Srinivasan, M. (2018). Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3618–3626). Babu Sam, D., Surya, S., & Venkatesh Babu, R. (2017). Switching convolutional neural network for crowd counting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4031–4039). Barcellos, P., Bouvié, C., Escouto, F. L., & Scharcanski, J. (2015). A novel video based system for detecting and counting vehicles at user-defined virtual loops. Expert Systems with Applications, 42, 1845–1856. Boominathan, L., Kruthiventi, S. S., & Babu, R. V. (2016). Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM international conference on multimedia (pp. 640–644). Chan, A. B., Liang, Z.-S. J., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people without people models or tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–7). Chen, K., Loy, C. C., Gong, S., & Xiang, T. (2012). Feature mining for localised crowd counting. In BMVC (p. 3). Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 886–893). De Almeida, P. R., Oliveira, L. S., Britto Jr, A. S., Silva Jr, E. J., & Koerich, A. L. (2015). Pklot–a robust dataset for parking lot classification. Expert Systems with Applications, 42, 4937–4949. Deb, D., & Ventura, J. (2018). An aggregated multicolumn dilated convolution network for perspective-free counting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 195–204). Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Li, F. F. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 248–255). Duives, D., Daamen, W., & Hoogendoorn, S. (2018a). Monitoring the number of pedestrians in an area: The applicability of counting systems for density state estimation. Journal of Advanced Transportation, 2018. Duives, D. C., Daamen, W., & Hoogendoorn, S. (2018b). How to measure static crowds? Monitoring the number of pedestrians at large open areas by means of Wi-Fi sensors. Technical Report. Ellis, A., & Ferryman, J. (2010). Pets2010: Dataset and challenge. In Advanced video and signal based surveillance (AVSS) (pp. 143–150). Fu, M., Xu, P., Li, X., Liu, Q., Ye, M., & Zhu, C. (2015). Fast crowd density estimation with convolutional neural networks. Engineering Applications of Artificial Intelligence, 43, 81–88. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). Idrees, H., Saleemi, I., Seibert, C., & Shah, M. (2013). Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2547–2554). Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., et al. (2018). Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the european conference on computer vision (ECCV) (pp. 532–546). Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 1704–1716. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In Proceedings of the 2nd international conference on learning representations. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105). Laradji, I. H., Rostamzadeh, N., Pinheiro, P. O., Vazquez, D., & Schmidt, M. (2018). Where are the blobs: Counting by localization with point supervision. In Proceedings of the european conference on computer vision (ECCV) (pp. 547–562). Lempitsky, V., & Zisserman, A. (2010). Learning to count objects in images. In Advances in neural information processing systems (pp. 1324–1332). Li, Y., Zhang, X., & Chen, D. (2018). Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1091–1100).

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440). Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 91–110. Nagao, K., Yanagisawa, D., & Nishinari, K. (2018). Estimation of crowd density applying wavelet transform and machine learning. Physica A: Statistical Mechanics and its Applications, 510, 145–163. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning. Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision (pp. 1520–1528). Onoro-Rubio, D., & López-Sastre, R. J. (2016). Towards perspective-free object counting with deep learning. In Proceedings of the european conference on computer vision (ECCV) (pp. 615–629). Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in pytorch. Advances in neural information processing systems. Pham, V.-Q., Kozakaya, T., Yamaguchi, O., & Okada, R. (2015). Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In Proceedings of the IEEE international conference on computer vision (pp. 3253–3261). Ranjan, V., Le, H., & Hoai, M. (2018). Iterative crowd counting. In Proceedings of the european conference on computer vision (ECCV) (pp. 270–285). Ronao, C. A., & Cho, S.-B. (2016). Human activity recognition with smartphone sensors using deep learning neural networks. Expert Systems with Applications, 59, 235–244. Ryan, D., Denman, S., Sridharan, S., & Fookes, C. (2015). An evaluation of crowd counting methods, features and regression models. Computer Vision and Image Understanding, 130, 1–17. Sam, D. B., & Babu, R. V. (2018). Top-down feedback for crowd counting convolutional neural network. In Proceedings of the 32nd AAAI conference on artificial intelligence (pp. 7323–7330). Shen, Z., Xu, Y., Ni, B., Wang, M., Hu, J., & Yang, X. (2018). Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5245–5254). Shi, Z., Zhang, L., Liu, Y., Cao, X., Ye, Y., Cheng, M.-M., et al. (2018). Crowd counting with deep negative correlation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5382–5390). Shi, Z., Zhang, L., Sun, Y., & Ye, Y. (2018). Multiscale multitask deep netvlad for crowd counting. IEEE Transactions on Industrial Informatics, 14, 4953–4962. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd international conference on learning representations. Sindagi, V. A., & Patel, V. M. (2017a). Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Advanced video and signal based surveillance (AVSS) (pp. 1–6). Sindagi, V. A., & Patel, V. M. (2017b). Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the IEEE international conference on computer vision (pp. 1861–1870). Sindagi, V. A., & Patel, V. M. (2018). A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters, 107, 3–16. Wang, L., Ouyang, W., Wang, X., & Lu, H. (2015). Visual tracking with fully convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 3119–3127). Zeiler, M. D., Krishnan, D., Taylor, G. W., & Fergus, R. (2010). Deconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. Zeng, L., Xu, X., Cai, B., Qiu, S., & Zhang, T. (2017). Multi-scale convolutional neural networks for crowd counting. In 2017 IEEE international conference on image processing (ICIP) (pp. 465–469). Zhang, C., Li, H., Wang, X., & Yang, X. (2015). Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 833–841). Zhang, Y., Zhou, D., Chen, S., Gao, S., & Ma, Y. (2016). Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 589–597).