Multi-level feature fusion based Locality-Constrained Spatial Transformer network for video crowd counting

Multi-level feature fusion based Locality-Constrained Spatial Transformer network for video crowd counting

Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting Communicated by Dr. Wang QI Journal Pre-...

3MB Sizes 0 Downloads 24 Views

Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting

Communicated by Dr.

Wang QI

Journal Pre-proof

Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting Yanyan Fang, Shenghua Gao, Jing Li, Weixin Luo, Linfang He, Bo Hu PII: DOI: Reference:

S0925-2312(20)30145-4 https://doi.org/10.1016/j.neucom.2020.01.087 NEUCOM 21844

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

25 September 2019 12 December 2019 21 January 2020

Please cite this article as: Yanyan Fang, Shenghua Gao, Jing Li, Weixin Luo, Linfang He, Bo Hu, MultiLevel Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.01.087

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting Yanyan Fang a , Shenghua Gao b , Jing Li b , Weixin Luo b , Linfang He a , Bo Hu a,1 a School of information and technology, Fudan University, China of Information Science and Technology, ShanghaiTech University, China {yyfang, lfhe16, bohu}@fudan.edu.cn, {gaoshh, lijing1, luowx}@shanghaitech.edu.cn b School

Abstract Video-based crowd counting can leverage the spatial-temporal information between neighboring frames, and thus this information would improve the robustness of crowd counting. Therefore, this solution is more practical than single image-based crowd counting in real applications. Since severe occlusions, translation, rotation, and scaling of persons will give rise to the change of density map of heads between neighboring frames, video-based crowd counting is a very challenging task. To alleviate these issues in video crowd counting, a MultiLevel Feature Fusion Based Locality-Constrained Spatial Transformer Network (MLSTN) is proposed, which consists of two components, namely density map regression module and Locality-Constrained Spatial Transformer (LST) module. Specifically, we first estimate the density map of each frame by utilizing the combination of the low-level, middle-level and high-level features of the Convolutional Neural Networks. This is because the low-level features may be more effective in the extraction of small head information, while the middle and high level features are more effective in the extraction of medium and large head information. Then to measure the relationship of the density maps between neighboring frames, the LST module is proposed, which estimates the density map of the next frame by concatenating several regression density maps. To fa1 Corresponding

author.

Preprint submitted to Journal of LATEX Templates

January 25, 2020

cilitate the performance evaluation for video crowd counting, we have collected and labeled a large-scale video crowd counting dataset which includes 100 fivesecond-long sequences with 394,081 annotated heads from 13 different scenes. As far as we know, it is the largest video crowd counting dataset. Extensive experiments show the effectiveness of our proposed approach for crowd counting on our dataset and other video-based crowd counting datasets. All our dataset are released online

2

Keywords: Convolutional Neural Network; Locality-Constrained Spatial Transformer Network; Video Crowd Counting; Multi-Level Feature Fusion

1. Introduction Crowd counting aims to estimate the number of people in stationary images or surveillance videos. It has drawn a lot of attention in computer vision due to its potential applications in many security-related scenarios [1][2], such as video 5

surveillance, traffic monitoring, and emergency management. The majority of previous works such as [3][4][5] for crowd counting are based on single-image. Currently, with cameras deployed at every street corner, video-based crowd counting is more suitable for practical needs because the movement of crowd is predictable and consistent [6]. Therefore, this paper focuses on video-based

10

crowd counting. Various approaches have been proposed to tackle the problem of crowd counting.

Traditional methods could generally be classified into detection-

based approachs and regression-based approaches. The detection-based methods [7][8][9] were based on detection-style framework, in which sliding window 15

detector was used to detect the heads or full bodies of persons in the scene. But these approaches usually fail to detect the tiny [7] or occluded [10] heads/bodies which are very common in real scenarios. Thus researchers have attempted to overcome the above problems by regression, in which features extracted from 2

https://github.com/sweetyy83/Lstn_fdst_dataset.

2

the crowded scenes or patches are learned to be mapped into crowd counting 20

numbers. In recent years, CNN-based approaches have achieved a remarkable success in image classification [11], pose estimation[12], and semantic segmentation [13]. They have also been used to solve crowd counting problems, where CNN is utilized to learn a mapping from an input image to its corresponding density map.

25

For video crowd counting, there are two important problems. 1) how to leverage the consistency between neighboring frames. 2) how to extract robust features for crowd counting. For the first problem, to leverage the spatial-temporal consistency among neighboring frames for more accurate density map prediction in videos, LSTM [14] or ConvLSTM [15] based approaches have been proposed to

30

accumulate features of all history frames with LSTM or ConvLSTM for density map estimation. These approaches have demonstrated their effectiveness for video crowd counting. However, they rely on previous neighboring hidden states implicitly and ignore the important current intention of the neighboring people. When people walk in/out or are occluded, the identities of the crowd in the

35

history frame may be completely different from the ones in current frame. Consequently, these historical features may even harm the density map estimation of current frame without carefully processing. For the second problem, because multi-column networks can integrate features with different resolutions, existing works such as [5][16]take advantage of multi-column convolutional neural

40

networks with different receptive fields to study scale-robust features. However, multi-column networks are still affected by outliers in the training data because they are actually global models [17]. Different from existing contributions, in this paper, in order to encode the consistency between neighboring frames, we present a novel video-based crowd

45

counting framework, which consists of CNN and spatial transformer network (STN). In this framework, we design to utilize a Locality-Constrained Spatial Transformer (LST) module to explicitly model the spatial-temporal correlation between neighboring frames, instead of LSTM or ConvLSTM to implicitly model the spatiotemporal relationship in videos. On this basis, we propose a more 3

50

robust crowd counting for different sizes of human heads with multi-level feature fusion. The rationality behind this system is mainly based on the following two observations. Firstly, when we consider the same population of the crowd, previous work [6] has shown that the trajectories of crowd can be well predicted. However, due to factors such as distance, rotation, illumination and the change

55

of perspective, even the same person may have a visually significant change in his/her appearance. Therefore, it is sometimes hard to re-identify the people directly in two adjacent frames. Because the density map ignores the appearance of the person and is only related to the location of heads, it widely uses in literature. While the density map of one frame may be distorted compared

60

to the density map of its previous frame [15], it is impossible to estimate the density maps of those people from previous frames directly. Considering that peoples trajectories are predictable, some transformation can be implemented to alleviate the distortion. Secondly, for videos, some people are close to the camera and some are far away. Therefore, the sizes of human heads in the video

65

varies within one frame. It is obviously infeasible to estimate the density maps of those people by using single-scale feature extraction. Taking all these factors into account, we take the form of warping the density map for the whole frame in our LST. To be specific, given two images from adjacent frames, we use their similarity to weight the difference between the

70

ground-truth density map and the warped density map. If these two images are similar and contain nearly the same number of people, then the difference between ground-truth density map and warped density map should be smaller. But if someone walks in/out or is occluded, then we allow the warped density map of the previous frame to be slightly different from the ground-truth. Fur-

75

ther, since our model only uses spatial-temporal dependencies between adjacent frames, it can eliminate the impact of uncorrelated history frames on density map estimation. Recently, Residual Networks (ResNets) [18] and Densenet [19] have been proposed to extract features. As for crowd counting, the shallow features may be effective in comparison with smaller heads, the middle features

80

are effective for medium heads, and the higher-level features are more effective 4

Frame 105

Frame 125

Frame 145

Frame 8

Frame 38

Frame 68

Figure 1: Presentative images in our crowd dataset.

for large heads, so we propose to fuse low, middle and high-level features for more robust crowd counting. Experiments verify the validity of our model for video-based crowd counting. It is necessary to collect a large-scale dataset with multiple scenes for video 85

crowd counting. But most existing crowd counting datasets are based on single images. Although there are some video-based datasets for crowd counting, such as the UCSD dataset and the Mall dataset, they have relatively low-resolution frames and typically focus on only one or two scenes. Similarly, the WorldExpo’10 dataset only contains 5 scenes, the interval between two labeled frames

90

is more than 10 seconds such that the temporal correlation and consistency between neighboring labeled frames may be weak. Thus we propose to build a new large-scale video crowd counting dataset named Fudan-ShanghaiTech(FDST) with more scenes (See Fig. 1 for some typical examples). Specifically, FDST dataset contains 15,000 frames with 394,081 annotated heads captured from 13

95

difference scenes, including shopping malls, squares, hospitals, etc.. The dataset is much larger than the WorldExpo’10 dataset, which only contains 3980 frames with 199,923 annotated heads. Further, we provide the frame-wise annotation while WordExPo’10 only provides the annotation for more than 10 seconds. Therefore FDST dataset is more suitable for video crowd counting evaluation.

5

100

In summary, the main contributions of this work are as follows. • We design a Multi-level Feature Fusion Based Locality-Constrained Spatial Transformer Network (MLSTN), which explicitly encodes the spatialtemporal dependencies between neighboring frames to achieve a more robust crowd counting.

105

• We propose to fuse multi-level features to boost the robustness for heads with different sizes. • We collect a large-scale and diversified video crowd counting dataset with frame-wise ground-truth annotation, which promotes the performance of video crowd counting.

110

This work is an extension of our previous work [20] published in ICME 2019 (oral). Compared with our conference version, we improve our work in the followings: 1) We propose to fuse features of multi-scales, which is suitable for density map regression of heads with different sizes; 2) We propose to leverage multiple frames to predict the next frame, which helps the prediction accuracy;

115

3) We conduct more experiments to validate the importance of different components. The rest of the paper is organized as follows. We first briefly review the related work in Section 2. Then, we introduce the architecture of Multi-Level Feature Fusion Based LSTN (MLSTN) in Section 3. We introduce the FDST

120

dataset in Section 4. The performance comparison between our approach and the state-of-the-art is presented in Section 5. Finally, we conclude the paper in Section 6.

2. Related work The problems of crowd counting and density map estimation face many chal125

lenges such as non-uniform density, occlusions, intra-scene, and inter-scene variations in scale and perspective [1]. Various methods have been proposed in the

6

literature to deal with the problem of crowd counting in images [21][3][4][5][16][22] and videos [23][24][25][15][14][26][27]. In this section, we review deep learning based crowd counting and spatial transformer network(STN), and give an 130

overview of the related works. 2.1. Deep learning methods for crowd counting 2.1.1. Image-based crowd counting Recent works [5][28][29][22][30][31][32][33] have proved the validity of CNN for density map estimation in single image crowd counting. Wang et al. [21] and

135

Fu et al. [3] are among the first researchers to apply CNN-based methods to crowd density estimation. At the same time, Zhang et al. [4] propose the idea of cross-scene crowd counting. Their basic idea is to map images into crowd counts and adapt this mapping to new target scenes for cross-scene counting. However, the shortcoming of this method is the need for perspective maps both on train-

140

ing scenes and test scenes. Therefore, Zhang et al. [5] propose a multi-column CNN architecture that allows input images to have arbitrary size or resolution. Similarly, Onoro and Sastre et al. [16] propose a scale-aware counting model called Hydra CNN, which uses a pyramid of image patches extracted at multiple scales to perform the final density prediction. However, the above methods

145

focus on incorporating scale information in their networks. Sheng et al. [34] propose a new image representation method, which combines semantic attributes and spatial cues to improve the discriminative power of feature representation. Shang et al. [35] present an end-to-end network that consists of CNN model and LSTM decoder to predict the number of people. Yao et al. [36] propose a

150

deep spatial regression model for counting the number of individuals present in a still image with arbitrary perspective and arbitrary resolution based on CNN and LSTM. Recently, several approaches focus on combining additional cues to assist the crowd counting such as Detection [37], Attention [38], Localization [22][32][33] and Synthetic Data [31], In particular, Wang et al. [31] introduce a

155

very large synthetic crowd counting dataset, and their proposed Spatial Fully Convolutional Network to improve real-world performance with synthetic data. 7

All these methods have achieved great success in crowd counting. But these single image crowd counting methods may lead to inconsistent head counts for neighboring frames in video crowd counting. 160

2.1.2. Video-based crowd counting Although previous works involving single image crowd counting have reported good results, they always treat all datasets as a set of still images without considering their spatial-temporal correlation even in the video sequence. Recently, several works [15][39][40][26][14] attempt to exploit spatial-temporal

165

correlation. More specifically, Xiong et al. [15] propose to leverage ConvLSTM to integrate history features and features of current frame for video crowd counting, which has shown its effectiveness for video crowd counting. Li et al. [40] propose the Multiview-based Parameter Free approach to detect groups in crowd scenes. Ding et al. [26] propose a deeply-recursive network based on ResNet

170

blocks for crowd counting. Zou et al. [27] propose an adaptive multi-scale convolutional network that assigns different capacities to different portions of the input. Further, Zhang et al. [14] also propose to use LSTM for vehicle counting in videos. However, all these LSTM based methods may be affected by those irrelevant history, and do not explicitly consider the spatial-temporal dependen-

175

cies in videos, whereas our solution models such dependencies in neighboring frames with LST explicitly. 2.2. Spatial transformer network Although CNNs have enjoyed huge success in various computer vision problems, there is no principled way of ability to be spatially invariant to the in-

180

put data. Recently, Jaderberg et al. [41] introduce a differentiable Spatial Transformer (ST) module which is capable to model the spatial transformation between inputs and outputs. Such ST module provides an end-to-end learning mechanism that can be easily inserted into many existing networks, to explicitly learn how to transform the input data to achieve spatial invariance. Since the

185

introduction of the STN, it has been proven effective in resolving geometric vari-

8

ations. For example, in [42] and [43], STNs are used to improve the performance of face alignment and detection. Although STN model is very effective in many cases, it does not work in situations where heavy deformities exist. To solve this problem, Wu et al.[44] propose a multiple-STNs model named Recursive 190

Spatial Transformer (ReST) and use it for Alignment-Free Face Recognition. However, as the number of geometric prediction layers increases, problems such as unwanted boundary effects arise. So Lin and Lucey [45] advocate Inverse Compositional Spatial Transformer Networks (IC-STNs), which combines conventional STNs with IC-LK algorithm. Further, it also has been applied for

195

density map estimation in a coarse-to-fine based single image crowd counting framework [46]. Different from [46], we propose to leverage ST to evaluate the relationship of the density maps between neighboring frames for video crowd counting. Table 1: The detailed description of the variables.

Variables Xt MtGT Mtreg ˜ reg M t+2 LST Mt+3

Description the image of the tth (t = 1, . . . , T ) frame of a video sequence the ground-truth density map of the tth (t = 1, . . . , T ) frame of a video sequence the density map of the tth (t = 1, . . . , T ) frame estimated by density map regression module reg reg the conatenation of the Mtreg , Mt+1 and Mt+2 (t = 1, . . . , T )

the density map of the (t + 2)th (t = 1, . . . , T ) frame estimated by LST

3. The proposed method 200

3.1. Overview In this work, we propose a Multi-Level Feature Fusion Based LocalityConstrained Spatial Transformer Network (MLSTN). The architecture of this 9

network is shown in Fig. 2, it consists of two basic modules: density map regression module and Locality-Constrained Spatial Transformer (LST) module. 205

Following the prior work [47], we still formulate the crowd counting task as a density map estimation problem. In order to get more robust crowd counting, we first utilize a feature extractor to extract multi-level features for each input frame and then send them into the LST after concatenation. Specifically, in the density map regression module, we

210

first take three consecutive images Xt , Xt+1 , Xt+2 as a set of input and estimate reg reg its corresponding density map Mtreg , Mt+1 , Mt+2 and then aggregate them by

˜ reg . In the end, we take the concatenation to get a new estimated density map M t+2 ˜ reg as input to predict the density map of the next frame M LST in the LST M t+3 t+2 module. In the rest of this section, we will introduce the above module in detail. 215

For reading convenience, we summarize a collection of important notations used in this paper as Table 1. Locality-constrained spatial transformer module M tLST 3

M tGT 3

lLST

M tLST 4

LST

M

LST Shared

reg t 2

M treg 3

Concat

lreg M tGT

M

reg t +1

M treg +2

Feature extractor

Xt

X t 1

Shared

Concat

lreg M treg

M tGT 4

lLST

lreg M tGT +2

M tGT +1

lreg

M treg +1

Feature extractor

Shared

X t 2

M treg +2

X t 1

X t 2

M treg +3

M tGT +3

Shared

X t 3

Density map regression module Figure 2: The structure of the MLSTN module for video crowd counting.

10

3.2. Density map regression based crowd counting The generation of density maps is very important for the performance of density map based crowd counting. Similarily [47], we also express crowd counting as a density map estimation problem. That is, given one frame with N heads, if the ith head is centered at pi , we represent it as a delta function δ(p − pi ), then the ground-truth density map of this frame can be calculated as follows: M=

N X i=1

δ(p − pi ) ∗ Gσ (p).

(1)

Here Gσ (p) is a 2D Gaussian kernel with variance σ: Gσ (p) =

1 − (x2 +y2 2 ) e 2σ . 2πσ 2

(2)

That is to say, if a pixel is near the annotated point, it has higher probability of belonging to a head. Once the density map is generated, the density map regression module maps each frame to its corresponding density map. As aforementioned, we denote the ground-truth density map of tth (t = 1, . . . , T ) frame as MtGT , and denote the density map estimated by density map regression module as Mtreg . Then the objective of density map regression module can be written as follows: `reg =

T 1 X 2T t=1

kMtreg − MtGT k2 .

(3)

64 Conv 4x4

64

Xt

Conv 3x3

X t +1

64 Conv 3x3

MaxPool 2x2

128 Conv 3x3

128 Conv 3x3

MaxPool 2x2

256 Conv 3x3

256 Conv 3x3

256 Conv 3x3

512 Conv 3x3

Low-level features

512 Conv 3x3

512 Conv 3x3

512 Conv 3x3

512 Conv 3x3

512 Conv 3x3

256 Conv 3x3

128 Conv 3x3

M reg t

64 Conv 3x3

Concat

High-level features

Xt +2

Middle-level features

Input images

M reg t +1 M reg t +2 Density maps

Figure 3: The network architecture of feature extractor module.

3.3. Multi-Level Feature Fusion for robust crowd counting As we know, features are very important in crowd counting. Therefore, in 220

the density map regression module, we use a multi-level feature fusion feature 11

extractor to extract features. As shown in Fig. 3, it consists of low-level features, middle-level features, and high-level features. In a real scene, the size of the head in the video image varies. The head close to the camera is usually larger, and away from the camera is smaller. Therefore, 225

we need to extract robust features from different sizes of human heads. To this end, we propose to use a multi-level feature fuse structure. In this structure, we use low-level features for small heads, middle-level features for medium heads, and high-level features for large heads. Through the concatenation of low-level, middle-level, and high-level features, we are able to achieve a more robust crowd

230

counting for different sizes of human heads. In addition, it is obviously impossible to concatenation all features in terms of computational complexity. As such we just extract three different feature layers to concatenate them together. The experiments prove the effectiveness of our design. 3.4. LST module

235

Many previous works [48][49][6] have proven that for the same population of crowd in videos, their trajectories can be well predicted. It is easy to think that the density map of the previous frames may be helpful to the prediction of the density map of the current frame. However, in the existing video crowd counting datasets, most of them don’t provide the corresponding relationship

240

between people in neighboring frames. Therefore, this makes it impossible to directly learn the mapping from the previous frames’ head coordinates to the current frames’ head coordinates. Furthermore, the appearance of the same person may visually change a lot due to the variations in perspective, distance, rotation, lighting conditions and occlusion in neighboring frames, which makes

245

it difficult to directly re-identify the person in two adjacent frames. But for density maps, which ignores the appearances of the person, it only depends on the location of heads. In addition, previous work has shown that the trajectories of people are predictable. Therefore, when estimating the same group of people, we can utilize the density map of previous frame to estimate the density map

250

of current frame. To be specific, the deformation of the density map of the 12

same group people in neighboring frames includes the following: translation and scaling when people walking away from or towards camera, or when there is some movement of camera, or rotation caused by wind or vibration of ground, etc.. 255

CNNs have defined an exceptionally powerful class of models. However, it is still limited due to the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner [41]. To exploit spatiotemporal correlation for video crowd counting, we propose a novel approach based on STN. As can be seen from the recent work [46], the spatial transformer

260

(ST) module has an obvious effect on learning the transform between input and output. Therefore, for the same group of people, ST can be used to learn the mapping between two adjacent frames. However, in practice, the application of ST is restricted when people walk in/out the range of camera, or some people may be occluded. Therefore, an LST is proposed in this paper, whose basic

265

principle is to use the similarity of two adjacent images to weight the difference between the ground-truth and the transformed density map. Besides, If the two input images are similar and they may correspond to the same population, so the difference between ground-truth density map and transformed density map should be smaller. In addition, if someone walks in/out or is occluded, then we

270

allow the estimated density map to be slightly different from the ground truth. Through the above operation, it will minimize such differences on all frames, thereby utilizing the dependency between adjacent frames for video-based crowd counting. In this article, we define the mapping function of LST module as fLST . It

275

˜ reg of the (t + 2)th frame as uses the concatenation of estimated density map M t+2 LST as input to estimate the density map of the (t + 3)th frame. We denote Mt+3

the density map of the (t + 2)th frame estimated by LST. Then LST ˜ reg ; Aθ ). Mt+3 = fLST (M t+2

13

(4)

Localisation Network



Grid Generator

 (Gi ) lLST

Sampler

M tLST 3

M treg 2

M tGT 3

Figure 4: The architecture of Locality-Constrained Spatial Transformer (LST).

Inspired by [41], our LST module mainly includes three parts, as shown in Fig. 4. They are: 1) The Localisation Network; 2) The Grid Generator; 3) The ˜ reg as input and Sampler. The localisation network which takes density map M t+2 uses a number of hidden filters to produce the parameters θ. Then, the spatial

lLST will create a sampling grid Γθ through the predicted parameters transformation θ. The density map can be mapped into the sampling grid. This operation can LST produce the transformed density maps Mt+3 . Following [41], LST function is

of the form as following:  





xti



    = Γθ (Gi ) = Aθ   yit  s   yi 1     xt i  θ11 θ12 θ13    =  yit ,   θ21 θ22 θ23 1

xsi

(5)

where (xti ,yit ) is the target coordinate of the sampling grid Γθ in the output 280

density maps, (xsi ,yis ) is the source coordinate in the input density maps that define the sample points, and Aθ denotes the transformation matrix. GT LST We use Xt+2 , Mt+3 and Mt+3 to denote the image of the (t + 2)th frame,

its ground-truth density map and density map estimated by LST. Then the

14

objective of LST can be written as follows. `LST =

T −3 1 X S(Xt+2 , Xt+3 ) 2T t=1

×

LST kMt+3



(6)

GT 2 Mt+3 k2 ,

where S(Xt+2 , Xt+3 ) denotes the similarity between the corresponding temporal neighboring frames, which can be measured as follows S(Xt+2 , Xt+3 ) = exp(−

kXt+2 − Xt+3 k22 ). 2β 2

(7)

3.5. Loss function and Implementation details The objective function consists of two parts: the losses of the density map regression module and that of the LST module, as follows. ` = `reg + λ`LST ,

(8)

where λ is a weight used to balance `reg and `LST . 285

In the training process, an Adam optimizer is used with a learning rate at 1e-6 on our dataset. To reduce over-fitting, we adopt the batch-normalization [50], and the batch-size is 7. Once our network is trained, in the testing phase, we can directly estimate the density map Mtreg (t = 1, . . . , T ) of each frame and integrate it to get the estimated head counts.

290

The variance in gaussian based density map generation γ = 3, and the β used in similarity measurement is 20 on FDST dataset. We resize all frames to 640 × 360 pixels. We first pretrain density map regression module, then we finetune the whole network by fixing all layers of features in density map regression module. We set λ = 0.01 in the FDST dataset 3 . 3 Because

the ground-truth is annotated 2fps on Expo’10 and ROI’s are also marked, there-

fore the population of two neighboring frames changes a lot. Thus this dataset is not suitable for performance evaluation of our method.

15

295

To evaluate the training time of the proposed approach, we implement our model on an NVidia Titan X GPU platform and test the running time of our model in the training phase on the FDST dataset. We use 60 videos with 9000 frames as the training set and the remaining 40 videos with 6000 frames as the testing set. As stated previously, our training is divided into two parts. First,

300

we use the CNN network for pre-training. After training with 9,000 frames, it takes about 1 hour and 50 minutes for the first test result to appear. Then, we fixe all layers of features and introduce LSTN. It takes about another 45 minutes for the first test result to appear. We also test the running time of our model in the test phase on the FDST dataset. We repeat our program for 30

305

times, and the average runtime of our model is 1.15ms.

4. The Fudan-ShanghaiTech video crowd counting dataset

Scene 1

Scene 2

Scene 3

Scene 4

Scene 5

Scene 6

Scene 7

Scene 8

Scene 9

Scene 10

Scene 11

Scene 12

Scene 13

Scene 14

Scene 15

Figure 5: Representative images of the fifteen scenes in our crowd dataset.

In the past few years, various datasets for video-based crowd counting have been created, while existing datasets typically contain low-resolution images captured from a single scene, such as the Mall dataset and the UCSD dataset. 310

Although the WordExpo10 dataset provides some different scenarios, it samples the video sparsely, which is not entirely suitable for video-based crowd counting tasks. Hence, we introduce a new large-scale video-based crowd counting dataset 16

named FDST. Specifically, we collect 100 five-second-long sequences captured from 15 scenes (please refer to Fig. 5) and FDST dataset contains 150,000 315

frames, with a total of 394,081 annotated heads. Annotation of the FDST dataset takes us more than 400 hours. To our knowledge, this dataset is by far the largest video crowd counting dataset. Statistics for our dataset and other relevant datasets are shown in Table 2. Table 2: Details of some datasets: Num is the total number of frames; FPS is the number of frames per second; Max is the maximal crowd count in one frame; Min is the minimal crowd count in one frame; Ave is the average crowd count in one frame; Total is total number of labeled people.

Dataset

Resolution

Num

FPS

Max

Min

Ave

Total

UCSD

238 × 158

2000

10

46

11

24.9

49,885

Mall

640 × 480

2000

<2

53

13

31.2

62,316

WorldExpo

576 × 720

3980

50

253

1

50.2

199,923

Ours

1920 × 1080

15000

30

57

9

26.7

394,081

1280 × 720

5. Experiments 320

5.1. Evaluation metric Following the convention of existing work [51] for crowd counting, we adopt the mean absolute error (MAE) and the mean squared error (MSE) as metrics to evaluate. They are defined as follows: v u T T u1 X X 1 M AE = |zi − zˆi |, M SE = t (zi − zˆi )2 T i=1 T i=1

(9)

where T is the total number of frames used for all testing video sequences, zi is the true number of people in the ith frame, and zˆi is estimated number of people in the ith frame. Broadly speaking, MAE and MSE indicate the accuracy and robustness of the estimates, respectively.

17

Table 3: Results of different methods on the UCSD dataset.

325

Method

MAE

MSE

Kernel Ridge Regression [52]

2.16

7.45

Ridge Regression [53]

2.25

7.82

Gaussian Process Regression [54]

2.24

7.97

Cumulative Attribute Regression [55]

2.07

6.86

Zhang et al [4]

1.60

3.31

MCNN [5]

1.07

1.35

Switch-CNN [28]

1.62

2.10

CSRNet [29]

1.16

1.47

FCN-rLSTM [14]

1.54

3.02

ConvLSTM [15]

1.30

1.79

Bidirectional ConvLSTM [15]

1.13

1.43

DUB-CSRNet [56]

1.03

1.24

Liu et al [57]

1.17

1.55

LSTN [20]

1.07

1.39

Our Method

1.02

1.32

5.2. The UCSD dataset The UCSD dataset [54] contains a 2000-frame video of pedestrians on the sidewalk of the UCSD campus captured by surveillance cameras. The frame size is 238 × 158 and the rate of frame is 10 fps. On average, there are only about 25 persons per frame. According to the train-test setting in [54], we use

330

frames from 601 to 1400 as training data and the remaining 1200 frames as testing data. The low resolution 238 × 158 of the image brings a great challenge to generate density maps, especially after using the pooling operation. Similar to [29], we use linear interpolation to resize each frame into 476 × 316. The accuracy of different methods for this dataset is shown in Table 3. We can see

335

that our model achieves the lowest MAE and comparable MSE with existing methods. 18

Table 4: Results of different methods on the Mall dataset.

Method

MAE

MSE

Kernel Ridge Regression [52]

3.51

18.10

Ridge Regression [53]

3.59

19.00

Gaussian Process Regression [54]

3.72

20.10

Cumulative Attribute Regression [55]

3.43

17.70

COUNT Forest [58]

2.50

10.00

ConvLSTM [15]

2.24

8.50

Bidirectional ConvLSTM [15]

2.10

7.60

DigCrowd [59]

3.21

16.4

LSTN [20]

2.00

2.50

Our Method

1.80

2.42

5.3. The Mall dataset The Mall dataset was introduced by Chen et al. [53]. This video contains 2000 annotated frames in the resolution of 640 × 480, with over 60,000 labeled 340

pedestrians, all from a shopping mall with a surveillance camera. The Region of Interest (ROI) and the perspective map are provided in the dataset. Using the same setting as mentioned in [53], the first 800 frames are used as training frames and the remaining 1200 frames are used for testing. The results are shown in Table 4. We can see that our method outperforms most methods

345

on this dataset. 5.4. FSDT dataset FDST dataset contains 100 five-second videos, we use 60 videos, 9000 frames for the training set, and the remaining 40 videos, 6000 frames as the testing set. We compare our approach with MCNN[5] which achieves advanced performance

350

in single image crowd counting, and ConvLSTM [15] which is a state-of-the-art video crowd counting method. We also report the performance of our conference paper LSTN [20] and Method [60] that have recently run on our dataset. All 19

Table 5: Results of different methods on our dataset.

Method

MAE

MSE

MCNN [5]

3.77

4.88

ConvLSTM [15]

4.48

5.82

LSTN [20]

3.35

4.45

COMBI [60]

2.92

3.76

ALL-EST [60]

2.84

3.57

Our Method

2.35

3.02

Xt

X t 1

X t 2

X t 3

Input Ground truth Estimation Figure 6: The density maps estimated by our method on our dataset.

results are shown in Table 5. We can see that our method achieves the best performance. It is worth noting that as there are many scenes in our dataset, 355

and it is not easy to train the ConvLSTM, the performance of ConvLSTM is even worse than a single image-based method. We also show the density map estimated by our MLSTN in Fig. 6. 5.5. Ablation study In order to further analyze the relative contribution of various components

360

of our model, the impact of different feature extractors and heads with different sizes on performance, we conduct the ablation study on FDST dataset is as follows. 20

Table 6: Impact of LST on our dataset.

Method

MAE

MSE

Ours non-LST

2.84

3.94

Ours-one

2.45

3.22

Ours-three

2.35

3.02

Impact of LST. We first analyze the relative contributions of LST and non-LST in our model, in table 6, and then we show the performance of the 365

regression density map to LST for input one frame Xt (Ours-one) and three consecutive frames Xt , Xt+1 , Xt+2 (Ours-three) respectively. We can see that compared with the method without LST, the improvement of our method shows the effectiveness of LST. In addition, in order to better predict the next frame, we increase the number of frames per input to generate regression density maps,

370

and the experiments verify the rationality of our design. Table 7: Results of different feature extractors on our dataset.

Method

MAE

MSE

Unet [61](1/2)

3.81

5.01

Unet [61](1/4)

3, 46

4.47

Unet [61](1/8)

4.10

5.82

Resnet-VGG16 [62]

2.95

3.90

Our Method

2.35

3.02

Impact of different feature extractors. We use low-level, middle-level and high-level features to fuse. However, a more intuitive idea is to combine the features of each layer, such as Unet [61] or ResNets [18]. So in this experiment, we take the upsampled of Unet to the original image 1/2, 1/4 and 1/8 as the 375

output results, respectively. In addition, similar to [62], we change VGG16 to residual connections. Table 7 shows all results of the above mentioned methods. Performance comparison for heads with different sizes. When label-

21

Table 8: Impact of heads with different sizes on our dataset.

Small

Middle

Large

Total

Method

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

Our Method

5.69

6.89

2.59

3.09

0.66

1.11

2.35

3.02

Small heads Middle heads Large heads Large heads Small heads 28.0%

10.0%

Middle heads 62.0%

Figure 7: The percentage of heads of three different sizes in the FDST dataset.

ing the head of our dataset, we adopt the form of a rectangular box. Therefore, the size of the rectangular directly represents the size of the head in the dataset. 380

Through the size of rectangular, we classify the heads in the video set into three categories: Small heads(W ≤ 35 and H ≤ 35 ), Medium heads(35
385

are shown in Table 8. We can see that as the size of the heads gets larger, the accuracy of counting increases, which is consistent with people’s observations. 5.6. Non-parametric tests. We perform a non-parametric statistical test in our experiments. Specifically, we use the nearest neighbor based solution. We first evenly divide the image into

390

N blocks for both training and testing images. For each testing images block, we find its nearest neighbor of training blocks based on VGG features of different 22

Table 9: Impact of dividing a image into N parts on FDST dataset and mall dataset.

Method NPT Our Method NPT Our Method

Performance(MAE/MSE ) N = 4×4 N = 8×8 N = 16×16 3.56/4.68 3.79/5.14

3.88/5.15

2.35/3.02 1.99/2.55 2.07/2.63 1.80/2.42

Dataset FDST FDST

2.12/2.69

Mall Mall

blocks. Then we assume the number of people in the corresponding training block is the number of people in the testing block. By summing head counts over all the blocks in each testing image, we can get an approximation of head 395

counts in each image. We denote such non-parametric tests (NPT) baseline as NPT and show its performance in Table 9. We can see that fewer blocks lead to better performance, while our method outperforms such a baseline. Further, such an NPT baseline is very time-consuming because of the expensive nearest neighbor search strategy.

400

6. Conclusion In this paper, we propose a Multi-Level Feature Fusion Based LocalityConstrained Spatial Transformer Network (MLSTN), which explicitly relates the density maps of neighboring frames to achieve a more robust video crowd counting. Specifically, we first leverage a Convolutional Neural Networks to

405

estimate the density map of each frame by utilizing the combination of the low-level, middle-level and high-level features. Then to related the density map between neighboring frames, a Locality-Constrained Spatial Transformer (LST) module is proposed. We further build a large-scale and diversified video crowd counting dataset with frame-wise ground-truth annotation. As far as we know,

410

FDST dataset is the largest video crowd counting dataset, regardless of the number of frames or the scenes. Extensive experiments show the effectiveness of our MLSTN for video crowd counting. 23

References References 415

[1] V. Sindagi, V. Patel, A survey of recent advances in cnn-based single image crowd counting and density estimation, Pattern Recognition Letters (2017) 3–16. [2] Q. Wang, M. Chen, F. Nie, X. Li, Detecting coherent groups in crowd scenes by multiview clustering, IEEE transactions on pattern analysis and

420

machine intelligence. [3] M. Fu, P. Xu, X. Li, Q.Liu, M.Ye, C.Zhu, Fast crowd density estimation with convolutional neural networks, Engineering Applications of Artificial Intelligence (2015) 81 – 88. [4] C. Zhang, H. Li, X. Wang, X. Yang, Cross-scene crowd counting via deep

425

convolutional neural networks, in: CVPR, 2015. [5] Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma, Single-image crowd counting via multi-column convolutional neural network, in: CVPR, 2016, pp. 589– 597. [6] B. Federico, L. Giuseppe, B. L, A. Bimbo, Context-aware trajectory pre-

430

diction, international conference on pattern recognition. [7] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection (2005) 886–893. [8] B. Leibe, E. Seemann, B. Schiele, Pedestrian detection in crowded scenes, in: 2005 IEEE Computer Society Conference on Computer Vision and

435

Pattern Recognition (CVPR’05), Vol. 1, IEEE, 2005, pp. 878–885. [9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, IEEE transactions on pattern analysis and machine intelligence 32 (9) (2010) 1627–1645.

24

[10] O. Tuzel, F. Porikli, P. Meer, Pedestrian detection via classification on 440

riemannian manifolds, TPAMI 30 (10) (2008) 1713–1727. [11] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks (2012) 1097–1105. [12] G. Cheron, I. Laptev, C. Schmid, P-cnn: Pose-based cnn features for action recognition, ICCV (2015) 3218–3226.

445

[13] S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic segmentation-aware cnn model, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1134–1142. [14] S. Zhang, G. Wu, J. P. Costeira, J. M. F. Moura, Fcn-rlstm: Deep spatiotemporal neural networks for vehicle counting in city cameras, in: ICCV,

450

2017, pp. 3687–3696. doi:10.1109/ICCV.2017.396. [15] X. Feng, X. Shi, D. Yeung, Spatiotemporal modeling for crowd counting in videos, in: ICCV, IEEE, 2017, pp. 5161–5169. [16] D. D. Onoro-Rubio, R. L´opez-Sastre, Towards perspective-free object counting with deep learning, in: ECCV, Springer, 2016, pp. 615–629.

455

[17] H. Li, X. He, H. Wu, S. A. Kasmani, R. Wang, X. Luo, L. Lin, Structured inhomogeneous density map learning for crowd counting, arXiv preprint arXiv:1801.06642. [18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and

460

pattern recognition, 2016, pp. 770–778. [19] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.

25

[20] Y. Fang, B. Zhan, W. Cai, S. Gao, B. Hu, Locality-constrained 465

spatial transformer network for video crowd counting, arXiv preprint arXiv:1907.07911. [21] C. Wang, H. Zhang, L. Yang, S. Liu, X. Cao, Deep people counting in extremely dense crowds, in: Proceedings of the 23rd ACM international conference on Multimedia, ACM, 2015, pp. 1299–1302.

470

[22] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, M. Shah, Composition loss for counting, density map estimation and localization in dense crowds, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 532–546. [23] Yang Cong, H. Gong, S. Zhu, Yandong Tang, Flow mosaicking: Real-

475

time pedestrian counting without scene-specific learning, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1093– 1100. doi:10.1109/CVPR.2009.5206648. [24] M. Rodriguez, J. Sivic, I. Laptev, J.-Y. Audibert, Data-driven crowd analysis in videos, in: 2011 International Conference on Computer Vision, IEEE,

480

2011, pp. 1235–1242. [25] Z. Ma, A. B. Chan, Crossing the line: Crowd counting by integer programming with local features, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2539–2546. [26] X. Ding, Z. Lin, F. He, Y. Wang, Y. Huang, A deeply-recursive convolu-

485

tional network for crowd counting, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 1942–1946. [27] Z. Zou, Y. Cheng, X. Qu, S. Ji, X. Guo, P. Zhou, Attend to count: Crowd counting with adaptive capacity multi-scale cnns, Neurocomputing 367

490

(2019) 75–83.

26

[28] D. Babu Sam, S. Surya, R. Venkatesh Babu, Switching convolutional neural network for crowd counting, in: CVPR, 2017. [29] Y. Li, X. Zhang, D. Chen, Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes, in: CVPR, 2018, pp. 1091– 495

1100. [30] J. Gao, Q. Wang, X. Li, Pcc net: Perspective crowd counting via spatial convolutional network, IEEE Transactions on Circuits and Systems for Video Technology. [31] Q. Wang, J. Gao, W. Lin, Y. Yuan, Learning from synthetic data for crowd

500

counting in the wild, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8198–8207. [32] Y. Liu, M. Shi, Q. Zhao, X. Wang, Point in, box out: Beyond counting persons in crowds, arXiv preprint arXiv:1904.01333. [33] D. Lian, J. Li, J. Zheng, W. Luo, S. Gao, Density map regression guided

505

detection network for rgb-d crowd counting and localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1821–1830. [34] B. Sheng, C. Shen, G. Lin, J. Li, W. Yang, C. Sun, Crowd counting via weighted vlad on a dense attribute feature map, IEEE Transactions on

510

Circuits and Systems for Video Technology 28 (8) (2016) 1788–1797. [35] C. Shang, H. Ai, B. Bai, End-to-end crowd counting via joint learning local and global count, in: 2016 IEEE International Conference on Image Processing (ICIP), IEEE, 2016, pp. 1215–1219. [36] H. Yao, K. Han, W. Wan, L. Hou, Deep spatial regression model for image

515

crowd counting, arXiv preprint arXiv:1710.09757. [37] J. Liu, C. Gao, D. Meng, A. Hauptmann, Decidenet: counting varying density crowds through attention guided detection and density estimation, in: CVPR, 2018, pp. 5197–5206. 27

[38] J. Gao, Q. Wang, Y. Yuan, Scar: Spatial-/channel-wise attention regression 520

networks for crowd counting, Neurocomputing 363 (2019) 1–8. [39] Q. Wang, M. Chen, F. Nie, X. Li, Detecting coherent groups in crowd scenes by multiview clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence (2018) 1–1doi:10.1109/TPAMI.2018.2875002. [40] X. Li, M. Chen, F. Nie, Q. Wang, A multiview-based parameter free frame-

525

work for group detection, in: Thirty-First AAAI Conference on Artificial Intelligence, 2017. [41] M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer networks, in: Advances in neural information processing systems, 2015, pp. 2017–2025.

530

[42] D. Chen, G. Hua, F. Wen, J. Sun, Supervised transformer network for efficient face detection, in: ECCV, Springer, 2016, pp. 122–138. [43] Y. Zhong, J. Chen, B. Huang, Toward end-to-end face recognition through alignment learning, IEEE signal processing letters 24 (8) (2017) 1213–1217. [44] W. Wu, M. Kan, X. Liu, Y. Yang, S. Shan, X. Chen, Recursive spatial

535

transformer (rest) for alignment-free face recognition, in: CVPR, 2017, pp. 3772–3780. [45] C.-H. Lin, S. Lucey, Inverse compositional spatial transformer networks, arXiv preprint arXiv:1612.03897. [46] L. Liu, H. Wang, G. Li, W. Ouyang, L. Lin, Crowd counting using deep

540

recurrent spatial-aware network, arXiv preprint arXiv:1807.00601. [47] V. Lempitsky, A. Zisserman, Learning to count objects in images, in: Advances in neural information processing systems, 2010, pp. 1324–1332. [48] D. Xie, T. Shu, S. Todorovic, S.-C. Zhu, Modeling and inferring human intents and latent functional objects for trajectory prediction, arXiv preprint

545

arXiv:1606.07827. 28

[49] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, S. Savarese, Social lstm: Human trajectory prediction in crowded spaces, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 961–971. doi:10.1109/CVPR.2016.110. 550

[50] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167. [51] K. Tota, H. Idrees, Counting in dense crowds using deep features (2015). [52] S. An, W. Liu, S. Venkatesh, Face recognition using kernel ridge regression, in: CVPR, 2007, pp. 1–7. doi:10.1109/CVPR.2007.383105.

555

[53] K. Chen, C. C. Loy, S. Gong, T. Xiang, Feature mining for localised crowd counting, in: In BMVC. [54] A. B. Chan, Z.-S. J. Liang, N. Vasconcelos, Privacy preserving crowd monitoring: Counting people without people models or tracking, in: CVPR, 2008, pp. 1–7.

560

[55] K. Chen, S. Gong, T. Xiang, C. C. Loy, Cumulative attribute space for age and crowd density estimation, in: CVPR, 2013, pp. 2467–2474. doi: 10.1109/CVPR.2013.319. [56] M.-h. Oh, P. A. Olsen, K. N. Ramamurthy, Crowd counting with decomposed uncertainty, arXiv preprint arXiv:1903.07427.

565

[57] X. Liu, J. Van De Weijer, A. D. Bagdanov, Exploiting unlabeled data in cnns by self-supervised learning to rank, IEEE transactions on pattern analysis and machine intelligence. [58] V. Pham, T. Kozakaya, O. Yamaguchi, R. Okada, Count forest: Co-voting uncertain number of targets using random forest for crowd density estima-

570

tion, in: ICCV, 2015, pp. 3253–3261. doi:10.1109/ICCV.2015.372.

29

[59] M. Xu, Z. Ge, X. Jiang, G. Cui, B. Zhou, C. Xu, et al., Depth information guided crowd counting for complex crowd scenes, Pattern Recognition Letters. [60] W. Liu, M. Salzmann, P. Fua, Estimating people flows to better count them 575

in crowded scenes, arXiv preprint arXiv:1911.10782. [61] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241.

580

[62] H. Qassim, A. Verma, D. Feinzimer, Compressed residual-vgg16 cnn model for big data places image recognition, in: 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), IEEE, 2018, pp. 169–175.

30

Yanyan Fang received the B.Sc. degree in Au585

tomatic Test and Control from Harbin Institute of Technology, Harbin, China, in 2005 and the M.Sc. degree in Electronic Engineering from University of Electronic Science and Technology, Chengdu, China, in 2008. She is currently a faculty member at the Department of Electronic Engineering, Fudan University, Shanghai, China. Her research interests include computer vision and deep

590

learning

Shenghua Gao is an associate professor at ShanghaiTech University, China. He received the B.E. degree from the University of Science and Technology of China in 2008 and received the Ph.D. degree from the Nanyang Technological University in 2012. From Jun 2012 to Aug 2014, he 595

worked as a research scientist in UIUC Advanced Digital Sciences Singapore. His research interests include computer vision and machine learning. He has published more than 60 papers on image and video understanding in many top-tier international conferences and journals. He served as an area chair in ICCV2019, SPC in AAAI2019, IJCAI2020. He also served as the Associate

600

Editor for IEEE Transactions on Circuits and Systems for Video Technology (IF:3.558) and Neurocomputing (IF:3.224).

31

Jing Li received the BSc. degree in Electronic Information Science and Technology from Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, 605

China, in 2018. He is currently pursuing the master’s degree with ShanghaiTech University, Shanghai, China. His current research interest is computer vision and machine learning.

32

Weixin Luo received the bachelor degree from Shenzhen University, Shenzhen, China. He is currently pursuing the Ph.D. degree at 610

ShanghaiTech University. He focuses on analysis of surveillance videos including human detection, anomaly detection, and face anonymization.

Linfang He is currently pursuing a bachelor’s degree in the department of computer science, Fudan University. Her research interest is computer vision and deep learning, and she is eager to contribute more to 615

computer vision in the future.

Bo Hu received the B.S. and Ph.D. degrees in electronic engineering from Fudan University, Shanghai, China, in 1990 and 1996, respectively. He is currently a Professor with the Department of 33

Electronic Engineering, Fudan University. His research interests include digital 620

image processing, digital communication, and digital system design.

34

Declaration of interests The authors declare that they have no known competing for financial interests or personal relationships that could have appeared to influence the work reported in this paper.

35

625

Author Contributions Section Yanyan Fang: Conceptualization, Methodology, Software, Writing- Original draft preparation Shenghua Gao: Data curation, Writing- Reviewing and Editing, Jing Li: Software, Validation

630

Weixin Luo: Visualization, Investigation Linfang He: Supervision Bo Hu: Writing- Reviewing and Editing,

36