Counting crowds with varying densities via adaptive scenario discovery framework

Counting crowds with varying densities via adaptive scenario discovery framework

Counting Crowds with Varying Densities via Adaptive Scenario Discovery Framework Communicated by Dr. Wang QI Journal Pre-proof Counting Crowds with...

7MB Sizes 0 Downloads 30 Views

Counting Crowds with Varying Densities via Adaptive Scenario Discovery Framework

Communicated by Dr. Wang QI

Journal Pre-proof

Counting Crowds with Varying Densities via Adaptive Scenario Discovery Framework Xingjiao Wu, Yingbin Zheng, Hao Ye, Wenxin Hu, Tianlong Ma, Jing Yang, Liang He PII: DOI: Reference:

S0925-2312(20)30235-6 https://doi.org/10.1016/j.neucom.2020.02.045 NEUCOM 21913

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

19 September 2019 20 December 2019 9 February 2020

Please cite this article as: Xingjiao Wu, Yingbin Zheng, Hao Ye, Wenxin Hu, Tianlong Ma, Jing Yang, Liang He, Counting Crowds with Varying Densities via Adaptive Scenario Discovery Framework, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.02.045

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Counting Crowds with Varying Densities via Adaptive Scenario Discovery Framework Xingjiao Wua , Yingbin Zhengb , Hao Yeb , Wenxin Hua , Tianlong Maa , Jing Yanga , Liang Hea a East

China Normal University, Shanghai, China b Videt Tech Ltd., Shanghai, China

Abstract The task of crowd counting is to estimate the number of pedestrian in crowd images. Due to camera perspective and physical barriers among dense crowds, how to construct a robust counting model for varying densities and various scenarios has become a challenging problem. In this paper, we propose an adaptive scenario discovery framework for counting crowds with varying densities. The framework is structured with two parallel pathways that are trained to represent different crowd densities and present in the proper geometric configuration using different sizes of the receptive field. A third adaption branch is designed to adaptively recalibrate the pathway-wise responses by discovering and modeling the dynamic scenarios implicitly. We conduct experiments using the adaptive scenario discovery framework on five challenging crowd counting datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches. Keywords: Crowd counting, adaptive scenario discovery, convolutional neural network 1. Introduction Crowd counting, i.e., estimating the people number of the particular scenario, plays a crucial role in public safety management and urban planning. With the improvement of public security awareness, crowd counting has received increasing attention, and the 5

Convolutional Neural Networks (CNNs) based counting approaches have significant progress in recent years [1, 2, 3]. However, precisely estimating the number of crowds Preprint submitted to Neurocomputing

February 14, 2020

Figure 1: The deviation of ground-truth and predicted number of the pedestrian by MCNN [4] (a) and CSRNet [5] (b) on the ShanghaiTech dataset [4]. The green part indicates that the predicted number is lower than the ground-truth, the middle area indicates the set of images with less error, and the red part indicates the set of images whose predicted number is higher than the ground-truth.

in images is still a challenging task, because of occlusions, perspective distortion, nonuniform distribution of people, scale variations, and non-uniform illumination. One of the principal problems is that it is difficult to describe the complex scenario effectively. 10

In our preliminary experiments, we found an interesting phenomenon that the estimate errors are usually caused by a small minority of the crowd images. We illustrate some examples and the distribution of the deviations between ground-truth and prediction by previous works in Figure 1. The images with similar deviation seem to have similar prediction patterns: the images with the lower camera viewpoints and

15

more backgrounds usually achieve smaller counting predictions than the ground-truth, while these with high viewpoint get larger predicted values than the ground-truth; some ground-truth and predicted density maps are also shown in Figure 2. Although deep learning based algorithms are with better representation ability than the handcrafted features, the methods still have a gap for some crowd patterns. We consider a scenario

20

as a few crowd images with similar patterns, such as viewpoint, background, and etc. Here we aim to design a model that can discover the scenarios with varying density

2

and modeling the crowd images simultaneously. In this paper, we present an adaptive scenario discovery framework for crowd counting. One intuitive idea is to add the number of network branch with well-designed 25

convolutional layers. Recently, more and more architectures [4, 6, 7, 5] use the network with multiple columns (branches), as the multi-branch approaches aim to learn more contextual information and achieve excellent performance than the system employs a single CNN regressor (e.g., [8]). However, the highly variable crowd images are difficult to represent by different receptive fields that be applied in multi-branches [5]. In

30

this work, instead of using the multiple receptive fields directly, we first employ VGG as our backbone and use the pre-training method for migration learning to extract the features. To discover the different scenarios, we use two parallel pathways consists of the counting network after the backbone. The first pathway, named Sparse Branch, is designed to model the high congested scenario with sparse crowd. The second pathway,

35

named Dense Branch, is designed for the dense crowd. We notice some of the multiple columns (branches) architecture also employ a shared backbone and focus on building more pathways to model the different scenarios. We denote them as the multi-pathway method, including Switch-CNN [6], CMTL [9], DAN [10], D-ConvNets [11], and IG-CNN [12]. These previous methods

40

define a limited number of scenarios and are difficult to adapt to more scenarios. Here we propose the adaption branch that provides different weights for dense and sparse branches, respectively, so that the two branches can be combined to a variety of change scenarios. It is worth mentioning that the weights provided are learned during training and our adaptation branch can be dynamically evolved.

45

Our contributions are summarized as follows. • From the perspective of scenario discovery, we propose a novel adaptive framework for crowd counting. Different from previous multiple columns (branches) frameworks, ours can represent highly variable crowd images with two branches by incorporating the discretized pathway-wise responses. We further explore the

50

effect of scenario discovery network and we compare with the multi-pathway to explain the benefit of our architecture.

3

Figure 2: The crowd images from the ShanghaiTech dataset [4] and their crowd counting prediction by CSRNet [5].

• We employ the VGG [13] as the backbone and the counting network built with

two parallel pathways that are trained by different sizes of the receptive field to serve various scales and crowd densities. We consider the scenario as a linear

55

combination of two pathways with the discretized weights and design a third adaption branch to learn the scenario aware responses and discover the scenarios implicitly. • We apply our framework to five challenging crowd counting datasets, and find that it outperforms the state-of-the-art approaches.

60

The rest of this paper is organized as follows. Section 2 introduces background of crowd counting. Section 3 discusses the model design, network architecture and training process in detail. In Section 4, we demonstrate the qualitative and quantitative study of the proposed framework. Finally, we conclude our work in Section 5. 2. Related Work

65

Recently, more and more researchers devoted to the question of crowd counting. A detailed survey of the recent progress can be found [1, 2, 3] and we summarize

4

Table 1: Previous crowd counting works categorized with different backbone networks (a) and datasets (b).

Backbone AlexNet

VGG

Model SD-CNN (2019, [14]) Sindagi et al. (2017, [7])

Switching-CNN (2017, [6])

D-ConvNets (2018, [11])

SaCNN (2018, [15])

Liu et al. (2019, [16])

DSPNet (2020, [17])

L2R (2018, [18])

DAN (2018, [10])

IG-CNN (2018, [12])

SCAR (2019, [19])

Wang et al.(2019, [20]) ResNet

PSDDN (2019, [21]) (a)

Datasets

Model

ShanghaiTech

[18, 9, 19, 22, 20, 17, 4, 6, 21, 16, 7, 10, 12, 11, 15, 23]

UCF CC 50

[18, 9, 19, 22, 20, 17, 4, 6, 21, 14, 16, 7, 10, 12, 11, 15]

UCSD

[14, 24, 25, 26, 4, 6, 27, 28]

Mall

[29, 28, 23]

WorldExpo’10

[7, 10, 12, 11, 15, 22, 20, 17, 4, 6, 27, 16, 14]

UCF-QNRF

[16, 30, 22, 20, 17] (b)

previous works with their backbone models and datasets in Table 11 . In this section, we will first briefly introduce the approaches of the crowd counting and then discuss the multi-branches Architectures for crowd counting. 70

2.1. Crowd Counting Approaches Detection-based approaches. In the incipient stage of the crowd counting, the goal of this task is to count a limited number only. Thus, researchers focus on the detection style framework, which employs a sliding window to detect people in the scenario and 1 Due

to the limitation of space, the full list of previous works are put in the supplementary material.

5

count the number of people by detected bounding boxes [31, 32, 33]. Recently, [30, 14] 75

designed CNNs with composition loss or multiple stages to count and localize people. Regression-based approaches. With the development of crowd counting, the scenarios are becoming more and more complex, and occlusions are becoming more frequent with the number of people increases. Some previous works tried to mitigate the issue of occlusion with parts-based and shape-based detectors. However, these methods un-

80

derperform in images which contain extremely dense crowds or cluttered background, so researchers proposed the method which counts by regression [24, 25, 29, 34, 35, 36]. Density estimation-based approaches. The regression-based approaches have successfully solved the problem of occlusion and clutter. However, it ignores important spatial information because the regression counts global information but not focus on

85

local information. Lempitsky et al. [26, 28] proposed to learn a linear mapping between local patch features and corresponding object density map, which integrates local spatial information during the learning process. CNN-based methods: There are some amazing ideas of crowd counting. For example, dealing with the lack of labeled data of crowd counting. As we all know, the cost of the

90

label the image is expensive. This is still a challenge that obtaining a perfect result by a few data. Wang et al. [20] proposed a crowd counting method via domain adaptation, which can free humans from heavy data annotations. L2R [18] by learning to rank leveraging unlabeled data for crowd counting, they collect additional data from Google using keyword search and query-example image retrieval, respectively. Xu et al. [23]

95

proposed the depth information guided crowd estimation method.Lu et al. [37] modeling the image self-similarity, which naturally exists in the object count problem, represents the count as a matching problem and employs a form of few-shot learning as a novel method deal unlabeled data. Incorporating image context is another direction. SCAR [19] focuses on encodes the spatial dependencies in the whole feature map,

100

which can extract large-range contextual information. Wang et al. [38] designed a new structural context descriptor to characterize the structural properties of individuals in crowd scenes.

6

2.2. The Multi-branch Approaches In recent years, multi-branch approaches are popular and achieve state-of-the-art 105

performance. Zhang et al. [4] first employed multi-columns of convolutional neural networks (MCNN) with filters of different sizes to obtain more information. Inspired by MCNN, Sam et al. [6] proposed the Switching-CNN that decouples the three columns into separate CNN and each trained on a subset of the patches, especially, they designed a density selector with VGG that can guide different branches to choose by

110

utilizing the structural and functional differences. Further, more works obtain context information of the crowd images based on Multi-branches. Sindagi et al. [7] proposed to employs local and global context coding to estimation the density map. Zhang et al. [15] proposed a scale-adaptive CNN architecture with a backbone of fixed small receptive fields. Sindagi et al. [9] proposed to learn two related sub-tasks: crowd

115

count classification (which we call as high-level prior) and density map estimation in a cascaded. Li et al. [10] proposed the Density-Aware Network (DAN) to realize the structured density map learning by distinguishing various density levels. Shi et al. [11] used deep convolutional networks with many hidden layers to learn discriminative feature embedding from raw data, rather than relying on handcrafted feature extraction.

120

Babu et al. [12] tackle this problem with a growing CNN which can progressively increase its capacity to account for the wide variability seen in crowd scenario. Many previous Multi-branch works employ a shared backbone followed by several pathways. Our framework is also based on this pipeline and is designed with two parallel pathways, i.e., one pathway modeling the dense crowd and another modeling

125

the sparse crowd. Instead of using the fix branch weights or selecting one explicitly column, we employ a dynamic weight to represent more scenario beside the dense or sparse, and this weight can constantly be updated over model training. In other words, the responses of the dense and sparse pathways are adaptively recalibrated by a third branch, which explicitly models interdependencies between pathways. Moreover, with

130

the discretization of these pathway-wise responses, the crowd scenarios are implicitly discovered and respond to different crowd images in a highly scenario-specific manner. We achieve an end-to-end regression method using CNNs which takes the entire image as input and we obtain more accurate compared to previous approaches. 7

Input Image

Backbone Sparse Branch

DC

512

1

512

256

7

9

128 64

7

3

MP Pred:1076.1 Pred:1230.7

512 3

256 3

128

64 3

3

Pred:1059.8

Dense Branch 64

64

128 128

Kernel : 3×3

256 256 256

The number of channel

512

512 512 512 max-pooling

GAP

32

FC

FC

w

1-w

Output

1 Normalization

Pred:1114.5

Figure 3: Network structure of proposed framework. DC, MP, GAP, and FC indicate deconvolution layer, max pooling, global average pooling, and full-connected layer respectively.

A preliminary version of our framework was described in [39]. In this work, we 135

discuss the multi-pathway approaches with additional analysis, and get the insights for the designs of parallel pathways and adaption branch. We also extend [39] with state-of-the-art results on three additional datasets, a comprehensive experiment of the network settings, and extensive analysis for all evaluated datasets. 3. Framework

140

The overall architecture of our framework is illustrated in Figure 3. In this section, the design of adaptive scenario discovery (ASD) is first introduced and then the implementation details are presented. The ASD model consists of two components: backbone and counting network. The backbone is used to extract features from the input image and the counting network estimates density maps by fusing these features.

145

3.1. Backbone The utility of a backbone network can be seperated into two categories: either design a new network structure and learning from scratch (e.g., [4, 40]), or select a subnet with pre-trained weights from a existing network (e.g., [5, 11]). In this paper, our framework belongs to the second strategy. Recently, a few papers use VGG [13] on

150

their architectures and achieve good performance (see Table 1(a)). They use the VGG model to incorporate the global context, as the switch classifier, extracts features and shared blocks part. Following these works, we also employ a VGG-16 model that pre-

8

Table 2: Configuration of ASD. All convolutional layers use padding to maintain the previous size. The convolutional layers parameters are denoted as conv-(kernel size)-(number of filters)-(dilation rate), maxpooling layers are conducted over a 2 pixel window with stride 2. Where ’deconv’ represents a deconvolution layer, ’conv’ represents a convolution layer, ’pool’ represents a max-pooling layer, ’FC’ represents the fully connected layer.

Configurations of ASD VGG-16 Dense Branch

Adaption Branch

Sparse Branch

conv3-512-1

Global Average Pooling

deconv1-512-2

FC-512-32

conv9-512-1

conv3-256-1

RELU conv3-128-1

conv7-256-1 conv7-128-1

FC-32-1

conv3-64-1

conv3-64-1

Sigmoid

Max Pooling

conv1-1-1

Normalization

conv1-1-1

Output trained from the ImageNet dataset [41] and fine-tuned with the crowd images. Here we choose the first 10 layer VGG-16 with 3 max-pooling as the backbone. 155

3.2. Counting Network The counting network consists of two parallel pathways after the backbone as Table 2. The first pathway, i.e., the Sparse Branch, is designed to model the high congested scenario with sparse crowd. It starts with a deconvolutional layer that amplifies the inputs, and then a few convolutional layers with larger receptive fields are used, followed by max pooling. More concretely, the architecture is deconv(1, 512, 2) conv(9, 512, 1) - conv(7, 256, 1) - conv(7, 128, 1) - conv(3, 64, 1) - pool(2) - conv(1, 1, 1), where ‘deconv’ represents a deconvolutional layer, ‘conv’ represents a convolutional layer, and ‘pool’ represents a max-pooling layer. The numbers in the parentheses are respectively kernel size, number of channels and dilation rate. The second pathway named Dense Branch is designed for the dense crowd. The architecture of dense branch 9

is conv(3, 512, 1) - conv(3, 256, 1) - conv(3, 128, 1) - conv(3, 64, 1) - conv(1, 1, 1). Note that the concept of dense or sparse is relative and both pathways can output a density map. Unlike the usually density maps fuse approach, we would like to use a dynamic weighting strategy to fuse density maps. Inspired by the excitation operation in SENet [42], we propose the adaption branch. The set of operations of adaption branch, We can describe as GPool(1) - FC(512, 32) - RELU - FC(32, 1) - Sigmoid Normalization, where ’GPool’ represents global average pooling, ’FC’ represents the fully connected layer, ’RELU’ represents Rectified Linear Unit, ’Sigmoid’ is activation function. We will obtain a initial response w after sigmoid, we expect w to adaptively recalibrate the weight of the dense and sparse pathways, therefore we normalize it into the interval of [0, 0.5] with the following formula: w∗ = arctan(sigmod(w)) ×

2 π

(1)

Experiments on Section 4.7 will show the effect compared with the single branch or average fusion. However, we find that the convergence speed of this architecture is slow, probably due to the small size of the crowd counting dataset but the continuous response. 160

Our solution is to divide the response value into bins, by borrowing the idea from traditional visual features such as color histogram [43], SIFT [44], and HoG [45]. The benefits of discretization are two-folder. First, the model itself is easier to train and converge. Second, similar attributes are significantly observed from the images within the same bin (see Figure 6), indicating that discretization operation is able to implicitly

165

discovering the dynamic scenarios. Loss function. We define the loss function as follows: L(Θ) =

1 2N

N P

i=1

2

kF(Xi ; Θ) − F (Xi )k2

(2)

where F (Xi ) is the ground-truth density map of image Xi from Eq. (3) and F(Xi ; Θ) is the estimated density map of Xi with the parameters Θ learned by the proposed network. To ensure the spatial feature and the context of the crowd images, we do not ex170

tract the image patches for data augmentation. And there is also no additional image 10

B1

ConvNet Conv

P

IN

B2

VGG16 switch

P3

R3

P

IN

P

B3

(b)

(a)

… …

P2



… …



IN

Regreesor

Pool

P1

(c) R1

B1

LOSS

B1

R1

R2

1-W IN

IN

Scenario Discovery

Backbone

P

IN

R3

ROUTE

P

W B2

Main

P

(d)



B2

(e)

Rn

(f)

Figure 4: Compare with the Multi-pathway method (a) DAN, (b) Switch-CNN, (c) D-ConvNets, (d) CMTL, (e) ASD, and (f) IG-CNN.

copy/conversion enhancement. During training, we employ stochastic gradient descent (SGD) for its good generalization ability. 3.3. Discussion on Multi-pathway Approaches We notice many previous works [9, 6, 10, 12, 11] also employ the multi-pathway 175

methods. Our work is related to these works in different aspects and we discuss the relations and differences in this section. DAN ([10], Figure 4(a)): The existing methods are affected by the inhomogeneous density distribution problem. To address this question, they propose a simple yet effective Density-Aware Network (DAN) to realize the structured density map learning.

180

The key idea is capable of distinguishing various density levels by using a VGG16. Switch-CNN ([6], Figure 4(b)): The key idea behind crowd counting is density times and area, and crowds are not regular across the scenario. So they will choose crowd scenario to independent CNN regressors based on a switch classifier. And use an adaptation of VGG16 as the switch classifier to perform 3-way classification.

185

D-ConvNets ([11], Figure 4(c)): It follows the common idea behind these solu-

11

tions, they use deep convolutional networks with many hidden layers, aiming at learning discriminative feature embedding from raw data, rather than relying on handcrafted feature extraction. They deeply learn a pool of decorrelated regressors with sound generalization capabilities through managing their intrinsic diversities. 190

CMTL ([9], Figure 4(d)): It observed a fact in the crowd count that a large change in the scale and appearance of the object due to severe perspective distortion of the scenario. So they propose to learn two related sub-tasks: crowd count classification (which call as high-level prior) and density map estimation in a cascaded. IG-CNN ([12], Figure 4(f)): The features available for crowd discrimination largely

195

depend on the crowd density to the extent that people are only seen as blobs in a highly dense scenario. They tackle this problem with a growing CNN which can progressively increase its capacity to account for the wide variability seen in crowd scenario. They address this question by the creation of a set of regressors at the leaf nodes of the CNN tree. This is an excellent idea.

200

We compare with the multi-pathway method with our ASD (Figure 4(e)) and report the results in Table 8. We can find that with the design of the adaptive scenario, our model can overcome the problem of discrete representation of the scenario in traditional methods and achieve better results. 4. Experiments

205

We evaluate our model on five crowd counting benchmarks, including ShanghaiTech [4], UCF CC 50 [46], UCSD [24], Mall [29], and WorldExpo’10 [47]. 4.1. Settings Ground-truth generation. The CNN-based crowd counting method needs to process continuous data, but the ground-true is discrete, so we need to convert the discrete label

210

data into a continuous density map. The conversion operation is pixel level and the idea is to convert the discrete position annotation information into image information and may contain density information. Please read the following algorithm description for details.

12

Algorithm 1: Ground-truth generation Input: I : Image matrix, label : label list Output: DM : density map matrix 1

create matrix DM , which width and height are the same as the input image: I;

2

for i = 1; i ≤ length(label) do

3

Find the three nearest neighbors distance: l1 , l2 , l3 ;

4

Calculate Gaussian smoothing parameters: σi = (l1 + l2 + l3 )/3 ∗ β;

5 6

Calculate the density of DM (i). return DM ;

We follow [5] to generate the density maps from ground-truth. the density map F (x) is generated with the formula: F (x) =

N X i=1

δ(x − xi ) × Gσi (x)

(3)

where xi is a targeted object in the ground-truth δ and Gσi (·) is a Gaussian kernel 215

with standard deviation of σi . For the datasets with high congested scenario (such as ShanghaiTech Part A [4] and UCF CC 50 [46]), F (x) is defined as a geometryadaptive kernel with σi = β d¯i . Here d¯i is the average distance of k nearest neighbors of targeted object xi . For low congested scenario (i.e., ShanghaiTech Part B [4]), we set σi = 15. Evaluation metrics. To evaluate the proposed approaches, we follow the standard experimental protocols and apply the mean absolute error (MAE) and mean squared error (MSE) as the evaluation metric, i.e., M AE =

220

N 1 X gt |C − Ci | N i=1 i

v u N u1 X 2 M SE = t |Cigt − Ci | N i=1

Here N is the number of test images, Ci and Cigt are the estimated and ground-truth count of the i-th crowd image. We compute Ci as the accumulation of the density map, 13

Table 3: Crowd counting results on the ShanghaiTech dataset. Part A and Part B indicate ShanghaiTech Part A and Part B, respectively.

Method

map.

L P W P

Part B

MAE

MSE

MAE

MSE

DAN [10]

81.8

134.7

13.2

20.1

IG-CNN [12]

72.5

118.2

13.9

21.1

Zhang et al. [47]

181.8

277.7

32.0

49.8

MCNN [4]

110.2

173.2

26.4

41.3

CMTL [9]

101.3

152.4

20.0

31.1

Switching-CNN [6]

90.4

135.0

21.6

33.4

Wang et al. [48]

88.5

147.6

17.6

26.8

CP-CNN [7]

73.6

106.4

20.1

30.1

Huang et al. [49]

-

-

20.2

35.6

D-ConvNet [11]

73.5

112.3

18.7

26.0

ACSCP [40]

75.7

102.7

17.2

27.4

-

-

20.8

29.4

SaCNN [15]

86.8

139.2

16.2

25.8

CSRNet [5]

68.2

115.0

10.6

16.0

ASD [ours]

65.6

98.0

8.5

13.7

DecideNet [50]

i.e., Ci =

Part A

zl,w , where zl,w is the value at pixel (l, w) of the estimated density

l=1 w=1

We denote our approach as ASD in the following comparisons. There are a few 225

hyper-parameters in ASD. In this section, we use the VGG-16 as the backbone and set discretization bins number is 100. The effect of these settings will be evaluated thoroughly in the ablation study. 4.2. ShanghaiTech The ShanghaiTech dataset [4] is divided into Part A and Part B. ShanghaiTech Part

230

A contains 482 crowd images with 300 training images and 182 testing images. The

14

number of the pedestrian in each image is from 33 to 3,139, and the total number of labeled pedestrians is 241,677. ShanghaiTech Part B is with 716 images (400 training and 316 testings). The resolution of the images is fixed with 768 × 1024 pixels,

and the pedestrian number is generally smaller than Part A with an average number 235

of 123. We compare our framework with several state-of-the-art approaches, including the multi-column CNN with different receptive fields [4], the Switching-CNN that leverages variation of crowd density [6], and a very recent dilated convolution-based model CSRNet [5]. Table 3 summarizes the MAE and MSE of previous approaches and ours in the datasets. On Part A of ShanghaiTech, we achieve a significant overall

240

improvement of 24.8 of absolute MAE value over Switching-CNN [6] and 2.6 of MAE over the state-of-the-art CSRNet [5]. On Part B, our ASD framework also achieves the best MAE 8.5 and MSE 13.7 compared to the state-of-the-art. Figure 5(a) and Figure 5(b) illustrate the density maps and the prediction results of some crowd images from both parts respectively.

245

4.3. UCF CC 50 The UCF CC 50 dataset [46] is a challenge for crowd counting for the high crowd density. The images vary in the number of pedestrians with a range of 94 to 4,543 and an average number of 1,279, while the dataset contains only 50 images. The results are reported in Table 3 and the sample results are shown in Figure 5(c). Similar to

250

the experiments on ShanghaiTech, the ASD framework shows better results than the other approaches, and we improve on the previously reported state-of-the-art results by 26.3% for the MAE metric and 31.8% for the MSE, which indicates the low variance of our prediction across the high crowd density images. 4.4. UCSD

255

The UCSD dataset [24] is a group of video frames captured by surveillance cameras at 10 fps with a frame size of 238×158. The dataset contains 2,000 frames with 49,885 annotated persons. Training set and testing set splits belong to a single scene. The pedestrians in the frames are sparse in the dataset with the number varying from 11 to 46. Following the general settings [24], we use frames 601-1400 as the training data 15

Table 4: Crowd counting results on UCF CC 50 dataset.

260

Method

MAE

MSE

DAN [10]

309.6

402.6

IG-CNN [12]

291.4

349.4

Zhang et al. [47]

467.0

498.5

MCNN [4]

377.6

509.1

CMTL [9]

322.8

397.9

Switching-CNN [6]

318.1

439.2

Wang et al. [48]

234.5

289.6

CP-CNN [7]

295.8

320.9

Huang et al. [49]

409.5

563.7

D-ConvNet [11]

288.4

404.7

ACSCP [40]

291.0

404.6

SaCNN [15]

314.9

424.8

CSRNet [5]

266.1

397.5

ASD [ours]

196.2

270.9

and the remaining 1200 frames as the testing data. We generate ground-truth density maps with fixed spread Gaussian kernel. Due to the small resolution of this dataset, it is difficult to generate high-quality density maps. Therefore, we resize the video frames into 952 × 632 by using bilinear interpolation. The results for the UCSD dataset are

shown in Table 5. Our results are comparable with the state-of-the-art methods. One 265

potential issue that limits the performance is that the background and the portrait of some images may be confused during the upsampling of the grey images. We provide the quality of generated density maps in Figure 5(d). 4.5. Mall The Mall dataset [29] contains 2,000 video frames at the resolution of 640 × 480

270

with 62,315 annotated pedestrians. The crowd density of this dataset is relatively low

with only around 25 persons on average in each frame. The video captured by a pub16

Table 5: Crowd counting results on UCSD.

Method

MAE

MSE

Hydra-CNN [8]

1.65

-

CNN-Boosting [51]

1.10

-

Switching-CNN [6]

1.62

2.10

ConvLSTM-nt [52]

1.73

3.52

ConvLSTM [52]

1.30

1.79

Bidirectional ConvLSTM [52]

1.13

1.43

BSAD [49]

1.00

1.40

ACSCP [40]

1.04

1.35

CSRNet [5]

1.16

1.47

SANet [53]

1.02

1.29

ASD [ours]

1.15

1.44

lic accessible surveillance camera in a shopping mall with more challenging lighting conditions and glass surface reflections. Following the setting of [29], we use the first 800 frames for training and the remaining 1,200 frames for testing. We can observe 275

from Table 6 that the MAE of our framework achieves the highest performance, while the MSE metric is also better than most of the previous results. As the generated result shown in Figure 5(e), our counting methods can perform well not only under the extremely dense crowds but also for relative sparse scenes. 4.6. WorldExpo’10

280

The WorldExpo’10 dataset [47] consists of 3,980 annotated frames from 1,132 video sequences captured by 108 different surveillance cameras during the 2010 Shanghai World Expo. The videos are with the resolution of 720 × 576. The training set includes of 3,380 annotated frames from 103 scenes, and the testing images are from

the other five scenes with 120 frames per scene. The video frames are with the num285

ber of pedestrians from 1 to 253. Follow the settings of [47], we use MAE as the evaluation metric and we do not use the perspective maps and the regions of interest 17

Table 6: Crowd counting results on Mall.

Method

MAE

MSE

Gaussian process regression [24]

3.72

20.1

Ridge regression [29]

3.59

19.0

Cumulative attribute regression [54]

3.43

17.7

Count forest [28]

2.50

10.0

DecideNet [50]

1.52

1.90

R-FCN [55]

6.02

5.46

Faster R-CNN [56]

5.91

6.60

Exemplary-Density [57]

1.82

2.74

Boosting-CNN [51]

2.01

-

Weighted VLAD [58]

2.41

9.12

Bi-ConvLSTM [52]

2.10

7.6

ASD [ours]

1.50

1.91

(ROI). Table 7 compares our model with state-of-the-art methods. We can see that our ASD outperforms all previous models with an average improvement of 8.9% for the MAE metric. Also, we provide the quality of the generated density map in Figure 5(f) 290

illustrates the generated density maps for the WorldExpo’10 dataset. 4.7. Ablation Study We evaluate a few parameters and an alternative implementation for the ASD. Network design. To further validate whether the performance of our proposed methods is affected by the backbone, we first conduct experiments with alternative back-

295

bone settings. On ShanghaiTech Part A, employing the ResNet-50 and ResNet-101 as the backbone achieve the MAE of 80.23 and 78.09, which is lower than the VGG-16 based approach. We also compare the proposed ASD with other multi-pathway CNNs (as discussed in Section 3.3) and find that our approach outperforms all these multipathway CNNs (Table 8). Furthermore, we study the weight recalibration method.

300

While Sigmoid function is used in many previous works, here we employ Eq. (1) for 18

Table 7: The MAE of different scenes and their average MAE for WorldExpo’10.

Method

S1

S2

S3

S4

S5

Avg.

DRSAN [59]

2.6

11.8

10.3

10.4

3.7

7.8

SaCNN [15]

2.6

13.5

10.6

12.5

3.3

8.5

Zhang et al. [47]

9.8

14.1

14.3

22.2

3.7

12.9

ic-CNN [12]

17.0

12.3

9.2

8.1

4.7

10.3

D-ConvNet [11]

1.9

12.1

20.7

8.3

2.6

9.1

CSRNet [5]

2.9

11.5

8.6

16.6

3.4

8.6

Switching-CNN [6]

4.4

15.7

10.0

11.0

5.9

9.4

ACSCP [40]

2.8

14.1

9.6

8.1

2.9

7.5

IG-CNN [60]

2.6

16.1

10.2

20.2

7.6

11.3

MSCNN [61]

7.8

15.4

14.9

11.8

5.8

11.7

BSAD [49]

4.1

21.7

11.9

11.0

3.5

10.5

DecideNet [50]

2.0

13.1

8.9

17.4

4.8

9.2

TDF-CNN [62]

2.7

23.4

10.7

17.6

3.3

11.5

MCNN [4]

3.4

20.6

12.9

13.0

8.1

11.6

SANet [53]

2.6

13.2

9.0

13.3

3.0

8.2

Bi-ConvLSTM [52]

6.8

14.5

14.9

13.5

3.1

10.6

ASD [ours]

2.5

14.2

7.1

7.4

3.8

7.1

recalibration. As shown in Table 9, we can observe perform gain from most of the comparisons. Pathway settings. We compare the performance of our frameworks with the varying pathway settings. As shown in Figure 8, we design the network with at most 4 pathways 305

with the same pathway weights. Here branch 1 and branch 2 are the same as the proposed ASD and branch 3 and branch 4 are inspired by MCNN [4]. We can see from Table 10 that average weighting with 3 or 4 pathways outperform that with only 2 pathways, and the proposed method with only two branches and learned weights boosts the performance with fewer parameters.

310

Another architecture issue for the whole framework is whether two parallel paths have an effect, therefore we will evaluate the effect of the two parallel pathways. Figure 19

Table 8: Comparison with the multi-pathway method. Part A, Part B and WE indicate ShanghaiTech Part A , ShanghaiTech Part B and WorldExpo’10, respectively.

Part A

Method

Part B

UCF CC 50

WE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

CMTL [9]

101.3

152.4

20.0

31.1

322.8

397.9

-

Switching-CNN [6]

90.4

135.0

21.6

33.4

318.1

439.2

9.4

DAN [10]

81.8

134.7

13.2

20.1

309.6

402.6

9.4

IG-CNN [12]

72.5

118.2

13.9

21.1

291.4

349.4

11.3

D-ConvNet [11]

73.5

112.3

18.7

26.0

288.4

404.7

9.1

ASD [ours]

65.6

98.0

8.5

13.7

196.2

270.9

7.1

Table 9: Comparison with the sigmoid method. Part A, Part B and WE indicate ShanghaiTech Part A , ShanghaiTech Part B and WorldExpo’10, respectively.

Method

Part A

Part B

UCF CC 50

WE

UCSD

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MAE

MSE

MAE

MSE

ASD with Sigmoid

69.4

100.5

9.8

14.9

265.5

370.9

8.2

1.16

1.43

1.56

2.04

ASD with Eq. (1)

65.6

98.0

8.5

13.7

196.2

270.9

7.1

1.15

1.44

1.50

1.91

7 gives the comparison with different network architecture on the datasets, including the single pathway and the fusion of them. We performed five sets of experiments for each dataset. They were sparse pathway only, dense pathway only, the fusion of 315

the two pathways with the same weight, learned weight without discretization and the ASD. We observe significant performance gains when adding the dynamic pathwaywise responses and the discretization. The effect of scenario discovery. Recall that the discretization on the adaption branch is applied to discover the dynamic scenarios implicitly; here we consider the different

320

MALL

choice of parameters. It is well known that it is impossible to traverse all the solution spaces, so we use the idea of dividing equivalence classes to verify the number of divisions of dynamic scenes. The output response after the operation of Eq. (1) fall in the interval (0,1), and is divided into 2,10,100, and 1000 bins. Note that only a proportion of bins are with images after model training due to the size of the dataset, therefore

20

Table 10: Comparison with different pathway settings. Part A, Part B and WE indicate ShanghaiTech Part A , ShanghaiTech Part B and WorldExpo’10, respectively.

Method

325

Part A

Part B

UCF CC 50

WE

UCSD

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MAE

MSE

MAE

MSE

Branch1+2

74.1

144.0

10.5

16.8

260.7

367.5

8.3

1.52

2.98

2.24

4.56

Branch1+2+3

73.6

111.4

9.6

12.9

275.6

365.6

8.3

1.23

3.96

1.87

1.88

Branch1+2+4

68.3

185.7

8.6

17.8

223.8

354.9

7.5

1.16

1.46

1.96

2.78

Branch1+2+3+4

66.5

99.0

9.0

12.9

211.9

321.6

7.9

1.16

1.85

1.77

1.99

ASD

65.6

98.0

8.5

13.7

196.2

270.9

7.1

1.15

1.44

1.50

1.91

the number of scenarios is usually smaller than that of the bin. Discretization with 2 bins can be considered as a simplified version of Switching-CNN [6], and our learning strategy still achieves lower MAE (74.4 vs. 90.4). Discretization with 2 bins can be considered as a simplified version of Switching-CNN [6], and our learning strategy still achieves lower MAE (Part A: 74.4 vs. 90.4; Part B: 10.7 vs. 21.6; UCF CC 50: 240.9

330

vs. 318.1; WorldExpo’10: 8.1 vs. 9.4; UCSD: 1.51 vs. 1.62). Take ShanghaiTech Part A as an example, without the discretization, we obtain the MAE of 69.4, which is not as good as the scenario discovery with 15 and 81 scenarios (MAE of 65.6 and 68.7, respectively). We have selected some images for comparison based on the different classifications of the output. Figure 6 shows some crowd images from different scenar-

335

ios. We can see that there are large differences between different classification scenes, and the same classification scene has roughly the same scene composition. 5. Conclusions In this paper, we have presented a novel architecture for counting crowds with varying densities. Our approach focuses on the implicit discovery and dynamic modeling

340

MALL

of scenarios. We have reformulated the crowd counting problem as a scenario classification problem such that the semantic scenario models into a combined prediction sub-tasks. Experimental comparisons with the state-of-the-art approaches on ShanghaiTech, UCF CC 50, UCSD, Mall and WorldExpo’10 showed the effectiveness and

21

efficiency of our proposed adaptive scenario discovery framework for the crowd count345

ing task. As future work, we will explore some related topics such as combining the proposed network structure with the localization components and video crowd counting based on scenario adaption. Credit Author Statement X. Wu and Y. Zheng contributed to the design and implementation of the research,

350

analyze the results, and wrote the manuscript. H. Ye and J. Yang devised the main conceptual ideas, planned the experiments, and wrote the manuscript. T. Ma and W. Hu carried out the experiment and analyze the results. L. He devised the project and wrote the manuscript.

355

Declaration of Competing Interest none. References [1] V. A. Sindagi, V. M. Patel, A survey of recent advances in cnn-based single image crowd counting and density estimation, Pattern Recognition Letters 107 (2018)

360

3–16. [2] D. Kang, Z. Ma, A. B. Chan, Beyond counting: Comparisons of density maps for crowd analysis tasksłcounting, detection, and tracking, IEEE Transactions on Circuits and Systems for Video Technology 29 (5) (2018) 1408–1422. [3] M. S. Zitouni, H. Bhaskar, J. Dias, M. E. Al-Mualla, Advances and trends in

365

visual crowd analysis: A systematic survey and evaluation of crowd modelling techniques, Neurocomputing 186 (2016) 139–159. [4] Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma, Single-image crowd counting via multi-column convolutional neural network, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 589–597. 22

370

[5] Y. Li, X. Zhang, D. Chen, CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1091–1100. [6] D. B. Sam, S. Surya, R. V. Babu, Switching convolutional neural network for crowd counting, in: IEEE Conference on Computer Vision and Pattern Recogni-

375

tion (CVPR), 2017, pp. 5744–5752. [7] V. A. Sindagi, V. M. Patel, Generating high-quality crowd density maps using contextual pyramid cnns, in: International Conference on Computer Vision (ICCV), 2017, pp. 1879–1888. [8] D. O˜noro Rubio, R. J. L´opez-Sastre, Towards perspective-free object counting

380

with deep learning, in: European Conference on Computer Vision (ECCV), 2016, pp. 615–629. [9] V. A. Sindagi, V. M. Patel, Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting, in: IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017, pp. 1–6.

385

[10] H. Li, X. He, H. Wu, S. A. Kasmani, R. Wang, X. Luo, L. Lin, Structured inhomogeneous density map learning for crowd counting, arXiv:1801.06642. [11] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M. Cheng, G. Zheng, Crowd counting with deep negative correlation learning, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5382–5390.

390

[12] D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, M. Srinivasan, Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 3618–3626. [13] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale

395

image recognition, in: International Conference on Learning Representations (ICLR), 2015.

23

[14] S. Basalamah, S. D. Khan, H. Ullah, Scale driven convolutional neural network model for people counting and localization in crowd scenes, IEEE Access. [15] L. Zhang, M. Shi, Q. Chen, Crowd counting via scale-adaptive convolutional 400

neural network, in: IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 1113–1121. [16] C. Liu, X. Weng, Y. Mu, Recurrent attentive zooming for joint crowd counting and precise localization, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1217–1226.

405

[17] X. Zeng, Y. Wu, S. Hu, R. Wang, Y. Ye, Dspnet: Deep scale purifier network for dense crowd counting, Expert Systems with Applications 141 (2020) 112977. [18] X. Liu, J. van de Weijer, A. D. Bagdanov, Leveraging unlabeled data for crowd counting by learning to rank, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7661–7669.

410

[19] J. Gao, Q. Wang, Y. Yuan, SCAR: Spatial-/channel-wise attention regression networks for crowd counting, Neurocomputing. [20] Q. Wang, J. Gao, W. Lin, Y. Yuan, Learning from synthetic data for crowd counting in the wild, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8198–8207.

415

[21] Y. Liu, M. Shi, Q. Zhao, X. Wang, Point in, box out: Beyond counting persons in crowds, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6469–6478. [22] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, L. Shao, Crowd counting and density estimation by trellis encoder-decoder networks, in: IEEE

420

Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6133–6142. [23] M. Xu, Z. Ge, X. Jiang, G. Cui, P. Lv, B. Zhou, C. Xu, Depth information guided crowd counting for complex crowd scenes, Pattern Recognition Letters 125 (2019) 563–569. 24

425

[24] A. B. Chan, Z. J. Liang, N. Vasconcelos, Privacy preserving crowd monitoring: Counting people without people models or tracking, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–7. [25] A. B. Chan, N. Vasconcelos, Bayesian poisson regression for crowd counting, in: International Conference on Computer Vision (ICCV), 2009, pp. 545–551.

430

[26] V. Lempitsky, A. Zisserman, Learning to count objects in images, in: Neural Information Processing Systems (NeurIPS), 2010, pp. 1324–1332. [27] S. D. Khan, H. Ullah, M. Uzair, M. Ullah, R. Ullah, F. A. Cheikh, Disam: Density independent and scale aware model for crowd counting and localization, in: IEEE International Conference on Image Processing (ICIP), 2019, pp. 4474–4478.

435

[28] V.-Q. Pham, T. Kozakaya, O. Yamaguchi, R. Okada, Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation, in: International Conference on Computer Vision (ICCV), 2015, pp. 3253–3261. [29] K. Chen, C. Loy, S. Gong, T. Xiang, Feature mining for localised crowd counting, in: British Machine Vision Conference(BMVC), Vol. 1, 2012, p. 3.

440

[30] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, M. Shah, Composition loss for counting, density map estimation and localization in dense crowds, in: European Conference on Computer Vision (ECCV), 2018, pp. 532– 546. [31] M. Enzweiler, D. M. Gavrila, Monocular pedestrian detection: Survey and exper-

445

iments, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (12) (2008) 2179–2195. [32] B. Leibe, E. Seemann, B. Schiele, Pedestrian detection in crowded scenes, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 878–885.

450

[33] O. Tuzel, F. Porikli, P. Meer, Pedestrian detection via classification on riemannian manifolds, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (10) (2008) 1713–1727. 25

[34] H. Ullah, N. Conci, Structured learning for crowd motion segmentation, in: IEEE International Conference on Image Processing (ICIP), 2013, pp. 824–828. 455

[35] H. Ullah, M. Ullah, N. Conci, Dominant motion analysis in regular and irregular crowd scenes, in: International Workshop on Human Behavior Understanding, 2014, pp. 62–72. [36] D. Lian, J. Li, J. Zheng, W. Luo, S. Gao, Density map regression guided detection network for rgb-d crowd counting and localization, in: IEEE Conference on

460

Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1821–1830. [37] E. Lu, W. Xie, A. Zisserman, Class-agnostic counting, in: Asian Conference on Computer Vision (ACCV), 2018. [38] Q. Wang, M. Chen, F. Nie, X. Li, Detecting coherent groups in crowd scenes by multiview clustering, IEEE Transactions on Pattern Analysis and Machine

465

Intelligence. [39] X. Wu, Y. Zheng, H. Ye, W. Hu, J. Yang, L. He, Adaptive scenario discovery for crowd counting, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 2382–2386. [40] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, X. Yang, Crowd counting via adversarial

470

cross-scale consistency pursuit, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5245–5254. [41] J. Deng, W. Dong, R. Socher, L. Li, K. Li, F. Li, Imagenet: A large-scale hierarchical image database, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.

475

[42] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141. [43] M. A. Stricker, M. Orengo, Similarity of color images, in: Storage and Retrieval for Image and Video Databases III, Vol. 2420, 1995, pp. 381–393.

26

[44] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Interna480

tional Journal of Computer Vision 60 (2) (2004) 91–110. [45] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, 2005, pp. 886–893. [46] H. Idrees, I. Saleemi, C. Seibert, M. Shah, Multi-source multi-scale counting in

485

extremely dense crowd images, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2547–2554. [47] C. Zhang, H. Li, X. Wang, X. Yang, Cross-scene crowd counting via deep convolutional neural networks, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 833–841.

490

[48] L. Wang, W. Shao, Y. Lu, H. Ye, J. Pu, Y. Zheng, Crowd counting with density adaption networks, arXiv:1806.10040. [49] S. Huang, X. Li, Z. Zhang, F. Wu, S. Gao, R. Ji, J. Han, Body structure aware deep crowd counting, IEEE Transaction on Image Processing 27 (3) (2018) 1049– 1059.

495

[50] J. Liu, C. Gao, D. Meng, A. G. Hauptmann, Decidenet: Counting varying density crowds through attention guided detection and density estimation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5197– 5206. [51] E. Walach, L. Wolf, Learning to count with cnn boosting, in: European Confer-

500

ence on Computer Vision (ECCV), 2016, pp. 660–676. [52] F. Xiong, X. Shi, D.-Y. Yeung, Spatiotemporal modeling for crowd counting in videos, in: International Conference on Computer Vision (ICCV), 2017, pp. 5151–5159. [53] X. Cao, Z. Wang, Y. Zhao, F. Su, Scale aggregation network for accurate and

505

efficient crowd counting, in: European Conference on Computer Vision (ECCV), 2018, pp. 734–750. 27

[54] K. Chen, S. Gong, T. Xiang, C. Change Loy, Cumulative attribute space for age and crowd density estimation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2467–2474. 510

[55] J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks, in: Neural Information Processing Systems (NeurIPS), 2016, pp. 379–387. [56] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Neural Information Processing Systems

515

(NeurIPS), 2015, pp. 91–99. [57] Y. Wang, Y. Zou, Fast visual object counting via example-based density estimation, in: IEEE International Conference on Image Processing (ICIP), 2016, pp. 3653–3657. [58] B. Sheng, C. Shen, G. Lin, J. Li, W. Yang, C. Sun, Crowd counting via weighted

520

vlad on a dense attribute feature map, IEEE Transactions on Circuits and Systems for Video Technology 28 (8) (2016) 1788–1797. [59] M. Liu, Y. Liu, J. Jiang, Z. Guo, Z. Wang, Crowd counting with fully convolutional neural network, in: IEEE International Conference on Image Processing (ICIP), 2018, pp. 953–957.

525

[60] V. Ranjan, H. Le, M. Hoai, Iterative crowd counting, in: European Conference on Computer Vision (ECCV), 2018, pp. 270–285. [61] L. Zeng, X. Xu, B. Cai, S. Qiu, T. Zhang, Multi-scale convolutional neural networks for crowd counting, in: IEEE International Conference on Image Processing (ICIP), 2017, pp. 465–469.

530

[62] D. Sam, R. V. Babu, Top-down feedback for crowd counting convolutional neural network, in: AAAI Conference on Artificial Intelligence (AAAI), 2018, pp. 7323–7330.

28

Figure 5: Qualitative results on the benchmarks.

29

Figure 6: Images of four sample scenarios grouped by adaptive scenario discovery. A various of differences between each two scenarios, such as crowd density, ratio of background, and viewpoints, can be visibly from the images. Left: ShanghaiTech Part A .Right: ShanghaiTech Part B.

30

-L

(a) ShanghaiTech Part A

(b) ShanghaiTech Part B

(c) UCF_CC_50

(d) UCSD

(e) Mall

(f) WorldExpo10

Figure 7: The effect of varying network architecture, a. sparse pathway only; b. dense pathway only; c. fusion of the two pathways with the same weight; d. learned weight without discretization; e. proposed approach.

31

512

Branch1

DC Input Image

1

512

7

7

9

128 64

256

MP

3

w1

512

Branch 2

3

3

128

256

3

64 3 w2

512

Branch 3

Backbone

DC

1

5

7

Pred.

128 64

256

5

3

MP w3

512

Branch 4

DC

512

1

512

256

3

5

3

128 64

3

MP w4

Figure 8: The architecture with more pathways and same pathway weights.

32

omparison-R

(a)ShanghaiTech Part A

(b) ShanghaiTech Part B

(c) UCF_CC_50

(d) UCSD

(e) Mall

(f) WorldExpo10

Figure 9: The effect of scenario discovery w.r.t the number of discretization bins and grouped scenarios (“None” indicates the result without discretization).

33

Xingjiao Wu is currently a Ph.D. candidate in the School of Computer Science and Technology, East China Normal University, Shanghai, China. His research interests include computer vision and machine learning.

Yingbin Zheng received the B.S. and Ph.D. degree in computer science from Fudan University, Shanghai, China. He was a research scientist with SAP Labs China and an associate professor with Shanghai Advanced Research Institute, Chinese Academy of Sciences. Currently he is with Videt Tech Ltd., Shanghai, China. His research interests are in computer vision, especially in scene text analysis and video understanding. Hao Ye received the Ph.D. degree in computer science from Fudan University, Shanghai, China. He was an Associate Professor with the Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai. He is currently with Videt Tech Ltd., Shanghai. His research interests include computer vision, multimedia information processing, deep learning and scene text analysis. Wenxin Hu received the Ph.D. degrees from East China Normal University, Shanghai, China, in 2016. She is currently an associate professor in the School of Data Science and Engineering, East China Normal University. Her current research interests include intelligent data analysis and high performance computing technology.

Tianlong Ma received the B.S. degree from East China Normal University, Shanghai, China, in 1999 and the M.A. degree from Shanghai Jiao Tong University, China, in 2007. He is currently an assistant researcher of East China Normal University. His current research interests include data mining and machine learning.

Jing Yang received the Ph.D. degrees from East China Normal University, Shanghai, China, in 2005. She is currently an associate professor in the School of Computer Science and Technology, East China Normal University. Her current research interests include knowledge graph and user behavior analysis.

Liang He received his bachelor’s degree and PhD degree from the Department of Computer Science and Technology, East China Normal University, Shanghai, China. He is now a professor and the associate dean of the School of Computer Science and Technology, East China Normal University. His current research interest includes knowledge processing, user behavior analysis, and contextaware computing. He has been awarded the Star of the Talent in Shanghai. He is also a council member of the Shanghai Computer Society, a member of the Academic Committee, the director of the technical committee of Shanghai Engineering Research Center of Intelligent Service Robot, and a technology foresight expert of the Shanghai Science and Technology in focus areas.