Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting
Communicated by Dr.
Wang QI
Journal Pre-proof
Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting Yanyan Fang, Shenghua Gao, Jing Li, Weixin Luo, Linfang He, Bo Hu PII: DOI: Reference:
S0925-2312(20)30145-4 https://doi.org/10.1016/j.neucom.2020.01.087 NEUCOM 21844
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
25 September 2019 12 December 2019 21 January 2020
Please cite this article as: Yanyan Fang, Shenghua Gao, Jing Li, Weixin Luo, Linfang He, Bo Hu, MultiLevel Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.01.087
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.
Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting Yanyan Fang a , Shenghua Gao b , Jing Li b , Weixin Luo b , Linfang He a , Bo Hu a,1 a School of information and technology, Fudan University, China of Information Science and Technology, ShanghaiTech University, China {yyfang, lfhe16, bohu}@fudan.edu.cn, {gaoshh, lijing1, luowx}@shanghaitech.edu.cn b School
Abstract Video-based crowd counting can leverage the spatial-temporal information between neighboring frames, and thus this information would improve the robustness of crowd counting. Therefore, this solution is more practical than single image-based crowd counting in real applications. Since severe occlusions, translation, rotation, and scaling of persons will give rise to the change of density map of heads between neighboring frames, video-based crowd counting is a very challenging task. To alleviate these issues in video crowd counting, a MultiLevel Feature Fusion Based Locality-Constrained Spatial Transformer Network (MLSTN) is proposed, which consists of two components, namely density map regression module and Locality-Constrained Spatial Transformer (LST) module. Specifically, we first estimate the density map of each frame by utilizing the combination of the low-level, middle-level and high-level features of the Convolutional Neural Networks. This is because the low-level features may be more effective in the extraction of small head information, while the middle and high level features are more effective in the extraction of medium and large head information. Then to measure the relationship of the density maps between neighboring frames, the LST module is proposed, which estimates the density map of the next frame by concatenating several regression density maps. To fa1 Corresponding
author.
Preprint submitted to Journal of LATEX Templates
January 25, 2020
cilitate the performance evaluation for video crowd counting, we have collected and labeled a large-scale video crowd counting dataset which includes 100 fivesecond-long sequences with 394,081 annotated heads from 13 different scenes. As far as we know, it is the largest video crowd counting dataset. Extensive experiments show the effectiveness of our proposed approach for crowd counting on our dataset and other video-based crowd counting datasets. All our dataset are released online
2
Keywords: Convolutional Neural Network; Locality-Constrained Spatial Transformer Network; Video Crowd Counting; Multi-Level Feature Fusion
1. Introduction Crowd counting aims to estimate the number of people in stationary images or surveillance videos. It has drawn a lot of attention in computer vision due to its potential applications in many security-related scenarios [1][2], such as video 5
surveillance, traffic monitoring, and emergency management. The majority of previous works such as [3][4][5] for crowd counting are based on single-image. Currently, with cameras deployed at every street corner, video-based crowd counting is more suitable for practical needs because the movement of crowd is predictable and consistent [6]. Therefore, this paper focuses on video-based
10
crowd counting. Various approaches have been proposed to tackle the problem of crowd counting.
Traditional methods could generally be classified into detection-
based approachs and regression-based approaches. The detection-based methods [7][8][9] were based on detection-style framework, in which sliding window 15
detector was used to detect the heads or full bodies of persons in the scene. But these approaches usually fail to detect the tiny [7] or occluded [10] heads/bodies which are very common in real scenarios. Thus researchers have attempted to overcome the above problems by regression, in which features extracted from 2
https://github.com/sweetyy83/Lstn_fdst_dataset.
2
the crowded scenes or patches are learned to be mapped into crowd counting 20
numbers. In recent years, CNN-based approaches have achieved a remarkable success in image classification [11], pose estimation[12], and semantic segmentation [13]. They have also been used to solve crowd counting problems, where CNN is utilized to learn a mapping from an input image to its corresponding density map.
25
For video crowd counting, there are two important problems. 1) how to leverage the consistency between neighboring frames. 2) how to extract robust features for crowd counting. For the first problem, to leverage the spatial-temporal consistency among neighboring frames for more accurate density map prediction in videos, LSTM [14] or ConvLSTM [15] based approaches have been proposed to
30
accumulate features of all history frames with LSTM or ConvLSTM for density map estimation. These approaches have demonstrated their effectiveness for video crowd counting. However, they rely on previous neighboring hidden states implicitly and ignore the important current intention of the neighboring people. When people walk in/out or are occluded, the identities of the crowd in the
35
history frame may be completely different from the ones in current frame. Consequently, these historical features may even harm the density map estimation of current frame without carefully processing. For the second problem, because multi-column networks can integrate features with different resolutions, existing works such as [5][16]take advantage of multi-column convolutional neural
40
networks with different receptive fields to study scale-robust features. However, multi-column networks are still affected by outliers in the training data because they are actually global models [17]. Different from existing contributions, in this paper, in order to encode the consistency between neighboring frames, we present a novel video-based crowd
45
counting framework, which consists of CNN and spatial transformer network (STN). In this framework, we design to utilize a Locality-Constrained Spatial Transformer (LST) module to explicitly model the spatial-temporal correlation between neighboring frames, instead of LSTM or ConvLSTM to implicitly model the spatiotemporal relationship in videos. On this basis, we propose a more 3
50
robust crowd counting for different sizes of human heads with multi-level feature fusion. The rationality behind this system is mainly based on the following two observations. Firstly, when we consider the same population of the crowd, previous work [6] has shown that the trajectories of crowd can be well predicted. However, due to factors such as distance, rotation, illumination and the change
55
of perspective, even the same person may have a visually significant change in his/her appearance. Therefore, it is sometimes hard to re-identify the people directly in two adjacent frames. Because the density map ignores the appearance of the person and is only related to the location of heads, it widely uses in literature. While the density map of one frame may be distorted compared
60
to the density map of its previous frame [15], it is impossible to estimate the density maps of those people from previous frames directly. Considering that peoples trajectories are predictable, some transformation can be implemented to alleviate the distortion. Secondly, for videos, some people are close to the camera and some are far away. Therefore, the sizes of human heads in the video
65
varies within one frame. It is obviously infeasible to estimate the density maps of those people by using single-scale feature extraction. Taking all these factors into account, we take the form of warping the density map for the whole frame in our LST. To be specific, given two images from adjacent frames, we use their similarity to weight the difference between the
70
ground-truth density map and the warped density map. If these two images are similar and contain nearly the same number of people, then the difference between ground-truth density map and warped density map should be smaller. But if someone walks in/out or is occluded, then we allow the warped density map of the previous frame to be slightly different from the ground-truth. Fur-
75
ther, since our model only uses spatial-temporal dependencies between adjacent frames, it can eliminate the impact of uncorrelated history frames on density map estimation. Recently, Residual Networks (ResNets) [18] and Densenet [19] have been proposed to extract features. As for crowd counting, the shallow features may be effective in comparison with smaller heads, the middle features
80
are effective for medium heads, and the higher-level features are more effective 4
Frame 105
Frame 125
Frame 145
Frame 8
Frame 38
Frame 68
Figure 1: Presentative images in our crowd dataset.
for large heads, so we propose to fuse low, middle and high-level features for more robust crowd counting. Experiments verify the validity of our model for video-based crowd counting. It is necessary to collect a large-scale dataset with multiple scenes for video 85
crowd counting. But most existing crowd counting datasets are based on single images. Although there are some video-based datasets for crowd counting, such as the UCSD dataset and the Mall dataset, they have relatively low-resolution frames and typically focus on only one or two scenes. Similarly, the WorldExpo’10 dataset only contains 5 scenes, the interval between two labeled frames
90
is more than 10 seconds such that the temporal correlation and consistency between neighboring labeled frames may be weak. Thus we propose to build a new large-scale video crowd counting dataset named Fudan-ShanghaiTech(FDST) with more scenes (See Fig. 1 for some typical examples). Specifically, FDST dataset contains 15,000 frames with 394,081 annotated heads captured from 13
95
difference scenes, including shopping malls, squares, hospitals, etc.. The dataset is much larger than the WorldExpo’10 dataset, which only contains 3980 frames with 199,923 annotated heads. Further, we provide the frame-wise annotation while WordExPo’10 only provides the annotation for more than 10 seconds. Therefore FDST dataset is more suitable for video crowd counting evaluation.
5
100
In summary, the main contributions of this work are as follows. • We design a Multi-level Feature Fusion Based Locality-Constrained Spatial Transformer Network (MLSTN), which explicitly encodes the spatialtemporal dependencies between neighboring frames to achieve a more robust crowd counting.
105
• We propose to fuse multi-level features to boost the robustness for heads with different sizes. • We collect a large-scale and diversified video crowd counting dataset with frame-wise ground-truth annotation, which promotes the performance of video crowd counting.
110
This work is an extension of our previous work [20] published in ICME 2019 (oral). Compared with our conference version, we improve our work in the followings: 1) We propose to fuse features of multi-scales, which is suitable for density map regression of heads with different sizes; 2) We propose to leverage multiple frames to predict the next frame, which helps the prediction accuracy;
115
3) We conduct more experiments to validate the importance of different components. The rest of the paper is organized as follows. We first briefly review the related work in Section 2. Then, we introduce the architecture of Multi-Level Feature Fusion Based LSTN (MLSTN) in Section 3. We introduce the FDST
120
dataset in Section 4. The performance comparison between our approach and the state-of-the-art is presented in Section 5. Finally, we conclude the paper in Section 6.
2. Related work The problems of crowd counting and density map estimation face many chal125
lenges such as non-uniform density, occlusions, intra-scene, and inter-scene variations in scale and perspective [1]. Various methods have been proposed in the
6
literature to deal with the problem of crowd counting in images [21][3][4][5][16][22] and videos [23][24][25][15][14][26][27]. In this section, we review deep learning based crowd counting and spatial transformer network(STN), and give an 130
overview of the related works. 2.1. Deep learning methods for crowd counting 2.1.1. Image-based crowd counting Recent works [5][28][29][22][30][31][32][33] have proved the validity of CNN for density map estimation in single image crowd counting. Wang et al. [21] and
135
Fu et al. [3] are among the first researchers to apply CNN-based methods to crowd density estimation. At the same time, Zhang et al. [4] propose the idea of cross-scene crowd counting. Their basic idea is to map images into crowd counts and adapt this mapping to new target scenes for cross-scene counting. However, the shortcoming of this method is the need for perspective maps both on train-
140
ing scenes and test scenes. Therefore, Zhang et al. [5] propose a multi-column CNN architecture that allows input images to have arbitrary size or resolution. Similarly, Onoro and Sastre et al. [16] propose a scale-aware counting model called Hydra CNN, which uses a pyramid of image patches extracted at multiple scales to perform the final density prediction. However, the above methods
145
focus on incorporating scale information in their networks. Sheng et al. [34] propose a new image representation method, which combines semantic attributes and spatial cues to improve the discriminative power of feature representation. Shang et al. [35] present an end-to-end network that consists of CNN model and LSTM decoder to predict the number of people. Yao et al. [36] propose a
150
deep spatial regression model for counting the number of individuals present in a still image with arbitrary perspective and arbitrary resolution based on CNN and LSTM. Recently, several approaches focus on combining additional cues to assist the crowd counting such as Detection [37], Attention [38], Localization [22][32][33] and Synthetic Data [31], In particular, Wang et al. [31] introduce a
155
very large synthetic crowd counting dataset, and their proposed Spatial Fully Convolutional Network to improve real-world performance with synthetic data. 7
All these methods have achieved great success in crowd counting. But these single image crowd counting methods may lead to inconsistent head counts for neighboring frames in video crowd counting. 160
2.1.2. Video-based crowd counting Although previous works involving single image crowd counting have reported good results, they always treat all datasets as a set of still images without considering their spatial-temporal correlation even in the video sequence. Recently, several works [15][39][40][26][14] attempt to exploit spatial-temporal
165
correlation. More specifically, Xiong et al. [15] propose to leverage ConvLSTM to integrate history features and features of current frame for video crowd counting, which has shown its effectiveness for video crowd counting. Li et al. [40] propose the Multiview-based Parameter Free approach to detect groups in crowd scenes. Ding et al. [26] propose a deeply-recursive network based on ResNet
170
blocks for crowd counting. Zou et al. [27] propose an adaptive multi-scale convolutional network that assigns different capacities to different portions of the input. Further, Zhang et al. [14] also propose to use LSTM for vehicle counting in videos. However, all these LSTM based methods may be affected by those irrelevant history, and do not explicitly consider the spatial-temporal dependen-
175
cies in videos, whereas our solution models such dependencies in neighboring frames with LST explicitly. 2.2. Spatial transformer network Although CNNs have enjoyed huge success in various computer vision problems, there is no principled way of ability to be spatially invariant to the in-
180
put data. Recently, Jaderberg et al. [41] introduce a differentiable Spatial Transformer (ST) module which is capable to model the spatial transformation between inputs and outputs. Such ST module provides an end-to-end learning mechanism that can be easily inserted into many existing networks, to explicitly learn how to transform the input data to achieve spatial invariance. Since the
185
introduction of the STN, it has been proven effective in resolving geometric vari-
8
ations. For example, in [42] and [43], STNs are used to improve the performance of face alignment and detection. Although STN model is very effective in many cases, it does not work in situations where heavy deformities exist. To solve this problem, Wu et al.[44] propose a multiple-STNs model named Recursive 190
Spatial Transformer (ReST) and use it for Alignment-Free Face Recognition. However, as the number of geometric prediction layers increases, problems such as unwanted boundary effects arise. So Lin and Lucey [45] advocate Inverse Compositional Spatial Transformer Networks (IC-STNs), which combines conventional STNs with IC-LK algorithm. Further, it also has been applied for
195
density map estimation in a coarse-to-fine based single image crowd counting framework [46]. Different from [46], we propose to leverage ST to evaluate the relationship of the density maps between neighboring frames for video crowd counting. Table 1: The detailed description of the variables.
Variables Xt MtGT Mtreg ˜ reg M t+2 LST Mt+3
Description the image of the tth (t = 1, . . . , T ) frame of a video sequence the ground-truth density map of the tth (t = 1, . . . , T ) frame of a video sequence the density map of the tth (t = 1, . . . , T ) frame estimated by density map regression module reg reg the conatenation of the Mtreg , Mt+1 and Mt+2 (t = 1, . . . , T )
the density map of the (t + 2)th (t = 1, . . . , T ) frame estimated by LST
3. The proposed method 200
3.1. Overview In this work, we propose a Multi-Level Feature Fusion Based LocalityConstrained Spatial Transformer Network (MLSTN). The architecture of this 9
network is shown in Fig. 2, it consists of two basic modules: density map regression module and Locality-Constrained Spatial Transformer (LST) module. 205
Following the prior work [47], we still formulate the crowd counting task as a density map estimation problem. In order to get more robust crowd counting, we first utilize a feature extractor to extract multi-level features for each input frame and then send them into the LST after concatenation. Specifically, in the density map regression module, we
210
first take three consecutive images Xt , Xt+1 , Xt+2 as a set of input and estimate reg reg its corresponding density map Mtreg , Mt+1 , Mt+2 and then aggregate them by
˜ reg . In the end, we take the concatenation to get a new estimated density map M t+2 ˜ reg as input to predict the density map of the next frame M LST in the LST M t+3 t+2 module. In the rest of this section, we will introduce the above module in detail. 215
For reading convenience, we summarize a collection of important notations used in this paper as Table 1. Locality-constrained spatial transformer module M tLST 3
M tGT 3
lLST
M tLST 4
LST
M
LST Shared
reg t 2
M treg 3
Concat
lreg M tGT
M
reg t +1
M treg +2
Feature extractor
Xt
X t 1
Shared
Concat
lreg M treg
M tGT 4
lLST
lreg M tGT +2
M tGT +1
lreg
M treg +1
Feature extractor
Shared
X t 2
M treg +2
X t 1
X t 2
M treg +3
M tGT +3
Shared
X t 3
Density map regression module Figure 2: The structure of the MLSTN module for video crowd counting.
10
3.2. Density map regression based crowd counting The generation of density maps is very important for the performance of density map based crowd counting. Similarily [47], we also express crowd counting as a density map estimation problem. That is, given one frame with N heads, if the ith head is centered at pi , we represent it as a delta function δ(p − pi ), then the ground-truth density map of this frame can be calculated as follows: M=
N X i=1
δ(p − pi ) ∗ Gσ (p).
(1)
Here Gσ (p) is a 2D Gaussian kernel with variance σ: Gσ (p) =
1 − (x2 +y2 2 ) e 2σ . 2πσ 2
(2)
That is to say, if a pixel is near the annotated point, it has higher probability of belonging to a head. Once the density map is generated, the density map regression module maps each frame to its corresponding density map. As aforementioned, we denote the ground-truth density map of tth (t = 1, . . . , T ) frame as MtGT , and denote the density map estimated by density map regression module as Mtreg . Then the objective of density map regression module can be written as follows: `reg =
T 1 X 2T t=1
kMtreg − MtGT k2 .
(3)
64 Conv 4x4
64
Xt
Conv 3x3
X t +1
64 Conv 3x3
MaxPool 2x2
128 Conv 3x3
128 Conv 3x3
MaxPool 2x2
256 Conv 3x3
256 Conv 3x3
256 Conv 3x3
512 Conv 3x3
Low-level features
512 Conv 3x3
512 Conv 3x3
512 Conv 3x3
512 Conv 3x3
512 Conv 3x3
256 Conv 3x3
128 Conv 3x3
M reg t
64 Conv 3x3
Concat
High-level features
Xt +2
Middle-level features
Input images
M reg t +1 M reg t +2 Density maps
Figure 3: The network architecture of feature extractor module.
3.3. Multi-Level Feature Fusion for robust crowd counting As we know, features are very important in crowd counting. Therefore, in 220
the density map regression module, we use a multi-level feature fusion feature 11
extractor to extract features. As shown in Fig. 3, it consists of low-level features, middle-level features, and high-level features. In a real scene, the size of the head in the video image varies. The head close to the camera is usually larger, and away from the camera is smaller. Therefore, 225
we need to extract robust features from different sizes of human heads. To this end, we propose to use a multi-level feature fuse structure. In this structure, we use low-level features for small heads, middle-level features for medium heads, and high-level features for large heads. Through the concatenation of low-level, middle-level, and high-level features, we are able to achieve a more robust crowd
230
counting for different sizes of human heads. In addition, it is obviously impossible to concatenation all features in terms of computational complexity. As such we just extract three different feature layers to concatenate them together. The experiments prove the effectiveness of our design. 3.4. LST module
235
Many previous works [48][49][6] have proven that for the same population of crowd in videos, their trajectories can be well predicted. It is easy to think that the density map of the previous frames may be helpful to the prediction of the density map of the current frame. However, in the existing video crowd counting datasets, most of them don’t provide the corresponding relationship
240
between people in neighboring frames. Therefore, this makes it impossible to directly learn the mapping from the previous frames’ head coordinates to the current frames’ head coordinates. Furthermore, the appearance of the same person may visually change a lot due to the variations in perspective, distance, rotation, lighting conditions and occlusion in neighboring frames, which makes
245
it difficult to directly re-identify the person in two adjacent frames. But for density maps, which ignores the appearances of the person, it only depends on the location of heads. In addition, previous work has shown that the trajectories of people are predictable. Therefore, when estimating the same group of people, we can utilize the density map of previous frame to estimate the density map
250
of current frame. To be specific, the deformation of the density map of the 12
same group people in neighboring frames includes the following: translation and scaling when people walking away from or towards camera, or when there is some movement of camera, or rotation caused by wind or vibration of ground, etc.. 255
CNNs have defined an exceptionally powerful class of models. However, it is still limited due to the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner [41]. To exploit spatiotemporal correlation for video crowd counting, we propose a novel approach based on STN. As can be seen from the recent work [46], the spatial transformer
260
(ST) module has an obvious effect on learning the transform between input and output. Therefore, for the same group of people, ST can be used to learn the mapping between two adjacent frames. However, in practice, the application of ST is restricted when people walk in/out the range of camera, or some people may be occluded. Therefore, an LST is proposed in this paper, whose basic
265
principle is to use the similarity of two adjacent images to weight the difference between the ground-truth and the transformed density map. Besides, If the two input images are similar and they may correspond to the same population, so the difference between ground-truth density map and transformed density map should be smaller. In addition, if someone walks in/out or is occluded, then we
270
allow the estimated density map to be slightly different from the ground truth. Through the above operation, it will minimize such differences on all frames, thereby utilizing the dependency between adjacent frames for video-based crowd counting. In this article, we define the mapping function of LST module as fLST . It
275
˜ reg of the (t + 2)th frame as uses the concatenation of estimated density map M t+2 LST as input to estimate the density map of the (t + 3)th frame. We denote Mt+3
the density map of the (t + 2)th frame estimated by LST. Then LST ˜ reg ; Aθ ). Mt+3 = fLST (M t+2
13
(4)
Localisation Network
Grid Generator
(Gi ) lLST
Sampler
M tLST 3
M treg 2
M tGT 3
Figure 4: The architecture of Locality-Constrained Spatial Transformer (LST).
Inspired by [41], our LST module mainly includes three parts, as shown in Fig. 4. They are: 1) The Localisation Network; 2) The Grid Generator; 3) The ˜ reg as input and Sampler. The localisation network which takes density map M t+2 uses a number of hidden filters to produce the parameters θ. Then, the spatial
lLST will create a sampling grid Γθ through the predicted parameters transformation θ. The density map can be mapped into the sampling grid. This operation can LST produce the transformed density maps Mt+3 . Following [41], LST function is
of the form as following:
xti
= Γθ (Gi ) = Aθ yit s yi 1 xt i θ11 θ12 θ13 = yit , θ21 θ22 θ23 1
xsi
(5)
where (xti ,yit ) is the target coordinate of the sampling grid Γθ in the output 280
density maps, (xsi ,yis ) is the source coordinate in the input density maps that define the sample points, and Aθ denotes the transformation matrix. GT LST We use Xt+2 , Mt+3 and Mt+3 to denote the image of the (t + 2)th frame,
its ground-truth density map and density map estimated by LST. Then the
14
objective of LST can be written as follows. `LST =
T −3 1 X S(Xt+2 , Xt+3 ) 2T t=1
×
LST kMt+3
−
(6)
GT 2 Mt+3 k2 ,
where S(Xt+2 , Xt+3 ) denotes the similarity between the corresponding temporal neighboring frames, which can be measured as follows S(Xt+2 , Xt+3 ) = exp(−
kXt+2 − Xt+3 k22 ). 2β 2
(7)
3.5. Loss function and Implementation details The objective function consists of two parts: the losses of the density map regression module and that of the LST module, as follows. ` = `reg + λ`LST ,
(8)
where λ is a weight used to balance `reg and `LST . 285
In the training process, an Adam optimizer is used with a learning rate at 1e-6 on our dataset. To reduce over-fitting, we adopt the batch-normalization [50], and the batch-size is 7. Once our network is trained, in the testing phase, we can directly estimate the density map Mtreg (t = 1, . . . , T ) of each frame and integrate it to get the estimated head counts.
290
The variance in gaussian based density map generation γ = 3, and the β used in similarity measurement is 20 on FDST dataset. We resize all frames to 640 × 360 pixels. We first pretrain density map regression module, then we finetune the whole network by fixing all layers of features in density map regression module. We set λ = 0.01 in the FDST dataset 3 . 3 Because
the ground-truth is annotated 2fps on Expo’10 and ROI’s are also marked, there-
fore the population of two neighboring frames changes a lot. Thus this dataset is not suitable for performance evaluation of our method.
15
295
To evaluate the training time of the proposed approach, we implement our model on an NVidia Titan X GPU platform and test the running time of our model in the training phase on the FDST dataset. We use 60 videos with 9000 frames as the training set and the remaining 40 videos with 6000 frames as the testing set. As stated previously, our training is divided into two parts. First,
300
we use the CNN network for pre-training. After training with 9,000 frames, it takes about 1 hour and 50 minutes for the first test result to appear. Then, we fixe all layers of features and introduce LSTN. It takes about another 45 minutes for the first test result to appear. We also test the running time of our model in the test phase on the FDST dataset. We repeat our program for 30
305
times, and the average runtime of our model is 1.15ms.
4. The Fudan-ShanghaiTech video crowd counting dataset
Scene 1
Scene 2
Scene 3
Scene 4
Scene 5
Scene 6
Scene 7
Scene 8
Scene 9
Scene 10
Scene 11
Scene 12
Scene 13
Scene 14
Scene 15
Figure 5: Representative images of the fifteen scenes in our crowd dataset.
In the past few years, various datasets for video-based crowd counting have been created, while existing datasets typically contain low-resolution images captured from a single scene, such as the Mall dataset and the UCSD dataset. 310
Although the WordExpo10 dataset provides some different scenarios, it samples the video sparsely, which is not entirely suitable for video-based crowd counting tasks. Hence, we introduce a new large-scale video-based crowd counting dataset 16
named FDST. Specifically, we collect 100 five-second-long sequences captured from 15 scenes (please refer to Fig. 5) and FDST dataset contains 150,000 315
frames, with a total of 394,081 annotated heads. Annotation of the FDST dataset takes us more than 400 hours. To our knowledge, this dataset is by far the largest video crowd counting dataset. Statistics for our dataset and other relevant datasets are shown in Table 2. Table 2: Details of some datasets: Num is the total number of frames; FPS is the number of frames per second; Max is the maximal crowd count in one frame; Min is the minimal crowd count in one frame; Ave is the average crowd count in one frame; Total is total number of labeled people.
Dataset
Resolution
Num
FPS
Max
Min
Ave
Total
UCSD
238 × 158
2000
10
46
11
24.9
49,885
Mall
640 × 480
2000
<2
53
13
31.2
62,316
WorldExpo
576 × 720
3980
50
253
1
50.2
199,923
Ours
1920 × 1080
15000
30
57
9
26.7
394,081
1280 × 720
5. Experiments 320
5.1. Evaluation metric Following the convention of existing work [51] for crowd counting, we adopt the mean absolute error (MAE) and the mean squared error (MSE) as metrics to evaluate. They are defined as follows: v u T T u1 X X 1 M AE = |zi − zˆi |, M SE = t (zi − zˆi )2 T i=1 T i=1
(9)
where T is the total number of frames used for all testing video sequences, zi is the true number of people in the ith frame, and zˆi is estimated number of people in the ith frame. Broadly speaking, MAE and MSE indicate the accuracy and robustness of the estimates, respectively.
17
Table 3: Results of different methods on the UCSD dataset.
325
Method
MAE
MSE
Kernel Ridge Regression [52]
2.16
7.45
Ridge Regression [53]
2.25
7.82
Gaussian Process Regression [54]
2.24
7.97
Cumulative Attribute Regression [55]
2.07
6.86
Zhang et al [4]
1.60
3.31
MCNN [5]
1.07
1.35
Switch-CNN [28]
1.62
2.10
CSRNet [29]
1.16
1.47
FCN-rLSTM [14]
1.54
3.02
ConvLSTM [15]
1.30
1.79
Bidirectional ConvLSTM [15]
1.13
1.43
DUB-CSRNet [56]
1.03
1.24
Liu et al [57]
1.17
1.55
LSTN [20]
1.07
1.39
Our Method
1.02
1.32
5.2. The UCSD dataset The UCSD dataset [54] contains a 2000-frame video of pedestrians on the sidewalk of the UCSD campus captured by surveillance cameras. The frame size is 238 × 158 and the rate of frame is 10 fps. On average, there are only about 25 persons per frame. According to the train-test setting in [54], we use
330
frames from 601 to 1400 as training data and the remaining 1200 frames as testing data. The low resolution 238 × 158 of the image brings a great challenge to generate density maps, especially after using the pooling operation. Similar to [29], we use linear interpolation to resize each frame into 476 × 316. The accuracy of different methods for this dataset is shown in Table 3. We can see
335
that our model achieves the lowest MAE and comparable MSE with existing methods. 18
Table 4: Results of different methods on the Mall dataset.
Method
MAE
MSE
Kernel Ridge Regression [52]
3.51
18.10
Ridge Regression [53]
3.59
19.00
Gaussian Process Regression [54]
3.72
20.10
Cumulative Attribute Regression [55]
3.43
17.70
COUNT Forest [58]
2.50
10.00
ConvLSTM [15]
2.24
8.50
Bidirectional ConvLSTM [15]
2.10
7.60
DigCrowd [59]
3.21
16.4
LSTN [20]
2.00
2.50
Our Method
1.80
2.42
5.3. The Mall dataset The Mall dataset was introduced by Chen et al. [53]. This video contains 2000 annotated frames in the resolution of 640 × 480, with over 60,000 labeled 340
pedestrians, all from a shopping mall with a surveillance camera. The Region of Interest (ROI) and the perspective map are provided in the dataset. Using the same setting as mentioned in [53], the first 800 frames are used as training frames and the remaining 1200 frames are used for testing. The results are shown in Table 4. We can see that our method outperforms most methods
345
on this dataset. 5.4. FSDT dataset FDST dataset contains 100 five-second videos, we use 60 videos, 9000 frames for the training set, and the remaining 40 videos, 6000 frames as the testing set. We compare our approach with MCNN[5] which achieves advanced performance
350
in single image crowd counting, and ConvLSTM [15] which is a state-of-the-art video crowd counting method. We also report the performance of our conference paper LSTN [20] and Method [60] that have recently run on our dataset. All 19
Table 5: Results of different methods on our dataset.
Method
MAE
MSE
MCNN [5]
3.77
4.88
ConvLSTM [15]
4.48
5.82
LSTN [20]
3.35
4.45
COMBI [60]
2.92
3.76
ALL-EST [60]
2.84
3.57
Our Method
2.35
3.02
Xt
X t 1
X t 2
X t 3
Input Ground truth Estimation Figure 6: The density maps estimated by our method on our dataset.
results are shown in Table 5. We can see that our method achieves the best performance. It is worth noting that as there are many scenes in our dataset, 355
and it is not easy to train the ConvLSTM, the performance of ConvLSTM is even worse than a single image-based method. We also show the density map estimated by our MLSTN in Fig. 6. 5.5. Ablation study In order to further analyze the relative contribution of various components
360
of our model, the impact of different feature extractors and heads with different sizes on performance, we conduct the ablation study on FDST dataset is as follows. 20
Table 6: Impact of LST on our dataset.
Method
MAE
MSE
Ours non-LST
2.84
3.94
Ours-one
2.45
3.22
Ours-three
2.35
3.02
Impact of LST. We first analyze the relative contributions of LST and non-LST in our model, in table 6, and then we show the performance of the 365
regression density map to LST for input one frame Xt (Ours-one) and three consecutive frames Xt , Xt+1 , Xt+2 (Ours-three) respectively. We can see that compared with the method without LST, the improvement of our method shows the effectiveness of LST. In addition, in order to better predict the next frame, we increase the number of frames per input to generate regression density maps,
370
and the experiments verify the rationality of our design. Table 7: Results of different feature extractors on our dataset.
Method
MAE
MSE
Unet [61](1/2)
3.81
5.01
Unet [61](1/4)
3, 46
4.47
Unet [61](1/8)
4.10
5.82
Resnet-VGG16 [62]
2.95
3.90
Our Method
2.35
3.02
Impact of different feature extractors. We use low-level, middle-level and high-level features to fuse. However, a more intuitive idea is to combine the features of each layer, such as Unet [61] or ResNets [18]. So in this experiment, we take the upsampled of Unet to the original image 1/2, 1/4 and 1/8 as the 375
output results, respectively. In addition, similar to [62], we change VGG16 to residual connections. Table 7 shows all results of the above mentioned methods. Performance comparison for heads with different sizes. When label-
21
Table 8: Impact of heads with different sizes on our dataset.
Small
Middle
Large
Total
Method
MAE
MSE
MAE
MSE
MAE
MSE
MAE
MSE
Our Method
5.69
6.89
2.59
3.09
0.66
1.11
2.35
3.02
Small heads Middle heads Large heads Large heads Small heads 28.0%
10.0%
Middle heads 62.0%
Figure 7: The percentage of heads of three different sizes in the FDST dataset.
ing the head of our dataset, we adopt the form of a rectangular box. Therefore, the size of the rectangular directly represents the size of the head in the dataset. 380
Through the size of rectangular, we classify the heads in the video set into three categories: Small heads(W ≤ 35 and H ≤ 35 ), Medium heads(35
385
are shown in Table 8. We can see that as the size of the heads gets larger, the accuracy of counting increases, which is consistent with people’s observations. 5.6. Non-parametric tests. We perform a non-parametric statistical test in our experiments. Specifically, we use the nearest neighbor based solution. We first evenly divide the image into
390
N blocks for both training and testing images. For each testing images block, we find its nearest neighbor of training blocks based on VGG features of different 22
Table 9: Impact of dividing a image into N parts on FDST dataset and mall dataset.
Method NPT Our Method NPT Our Method
Performance(MAE/MSE ) N = 4×4 N = 8×8 N = 16×16 3.56/4.68 3.79/5.14
3.88/5.15
2.35/3.02 1.99/2.55 2.07/2.63 1.80/2.42
Dataset FDST FDST
2.12/2.69
Mall Mall
blocks. Then we assume the number of people in the corresponding training block is the number of people in the testing block. By summing head counts over all the blocks in each testing image, we can get an approximation of head 395
counts in each image. We denote such non-parametric tests (NPT) baseline as NPT and show its performance in Table 9. We can see that fewer blocks lead to better performance, while our method outperforms such a baseline. Further, such an NPT baseline is very time-consuming because of the expensive nearest neighbor search strategy.
400
6. Conclusion In this paper, we propose a Multi-Level Feature Fusion Based LocalityConstrained Spatial Transformer Network (MLSTN), which explicitly relates the density maps of neighboring frames to achieve a more robust video crowd counting. Specifically, we first leverage a Convolutional Neural Networks to
405
estimate the density map of each frame by utilizing the combination of the low-level, middle-level and high-level features. Then to related the density map between neighboring frames, a Locality-Constrained Spatial Transformer (LST) module is proposed. We further build a large-scale and diversified video crowd counting dataset with frame-wise ground-truth annotation. As far as we know,
410
FDST dataset is the largest video crowd counting dataset, regardless of the number of frames or the scenes. Extensive experiments show the effectiveness of our MLSTN for video crowd counting. 23
References References 415
[1] V. Sindagi, V. Patel, A survey of recent advances in cnn-based single image crowd counting and density estimation, Pattern Recognition Letters (2017) 3–16. [2] Q. Wang, M. Chen, F. Nie, X. Li, Detecting coherent groups in crowd scenes by multiview clustering, IEEE transactions on pattern analysis and
420
machine intelligence. [3] M. Fu, P. Xu, X. Li, Q.Liu, M.Ye, C.Zhu, Fast crowd density estimation with convolutional neural networks, Engineering Applications of Artificial Intelligence (2015) 81 – 88. [4] C. Zhang, H. Li, X. Wang, X. Yang, Cross-scene crowd counting via deep
425
convolutional neural networks, in: CVPR, 2015. [5] Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma, Single-image crowd counting via multi-column convolutional neural network, in: CVPR, 2016, pp. 589– 597. [6] B. Federico, L. Giuseppe, B. L, A. Bimbo, Context-aware trajectory pre-
430
diction, international conference on pattern recognition. [7] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection (2005) 886–893. [8] B. Leibe, E. Seemann, B. Schiele, Pedestrian detection in crowded scenes, in: 2005 IEEE Computer Society Conference on Computer Vision and
435
Pattern Recognition (CVPR’05), Vol. 1, IEEE, 2005, pp. 878–885. [9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, IEEE transactions on pattern analysis and machine intelligence 32 (9) (2010) 1627–1645.
24
[10] O. Tuzel, F. Porikli, P. Meer, Pedestrian detection via classification on 440
riemannian manifolds, TPAMI 30 (10) (2008) 1713–1727. [11] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks (2012) 1097–1105. [12] G. Cheron, I. Laptev, C. Schmid, P-cnn: Pose-based cnn features for action recognition, ICCV (2015) 3218–3226.
445
[13] S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic segmentation-aware cnn model, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1134–1142. [14] S. Zhang, G. Wu, J. P. Costeira, J. M. F. Moura, Fcn-rlstm: Deep spatiotemporal neural networks for vehicle counting in city cameras, in: ICCV,
450
2017, pp. 3687–3696. doi:10.1109/ICCV.2017.396. [15] X. Feng, X. Shi, D. Yeung, Spatiotemporal modeling for crowd counting in videos, in: ICCV, IEEE, 2017, pp. 5161–5169. [16] D. D. Onoro-Rubio, R. L´opez-Sastre, Towards perspective-free object counting with deep learning, in: ECCV, Springer, 2016, pp. 615–629.
455
[17] H. Li, X. He, H. Wu, S. A. Kasmani, R. Wang, X. Luo, L. Lin, Structured inhomogeneous density map learning for crowd counting, arXiv preprint arXiv:1801.06642. [18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and
460
pattern recognition, 2016, pp. 770–778. [19] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
25
[20] Y. Fang, B. Zhan, W. Cai, S. Gao, B. Hu, Locality-constrained 465
spatial transformer network for video crowd counting, arXiv preprint arXiv:1907.07911. [21] C. Wang, H. Zhang, L. Yang, S. Liu, X. Cao, Deep people counting in extremely dense crowds, in: Proceedings of the 23rd ACM international conference on Multimedia, ACM, 2015, pp. 1299–1302.
470
[22] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, M. Shah, Composition loss for counting, density map estimation and localization in dense crowds, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 532–546. [23] Yang Cong, H. Gong, S. Zhu, Yandong Tang, Flow mosaicking: Real-
475
time pedestrian counting without scene-specific learning, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1093– 1100. doi:10.1109/CVPR.2009.5206648. [24] M. Rodriguez, J. Sivic, I. Laptev, J.-Y. Audibert, Data-driven crowd analysis in videos, in: 2011 International Conference on Computer Vision, IEEE,
480
2011, pp. 1235–1242. [25] Z. Ma, A. B. Chan, Crossing the line: Crowd counting by integer programming with local features, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2539–2546. [26] X. Ding, Z. Lin, F. He, Y. Wang, Y. Huang, A deeply-recursive convolu-
485
tional network for crowd counting, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 1942–1946. [27] Z. Zou, Y. Cheng, X. Qu, S. Ji, X. Guo, P. Zhou, Attend to count: Crowd counting with adaptive capacity multi-scale cnns, Neurocomputing 367
490
(2019) 75–83.
26
[28] D. Babu Sam, S. Surya, R. Venkatesh Babu, Switching convolutional neural network for crowd counting, in: CVPR, 2017. [29] Y. Li, X. Zhang, D. Chen, Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes, in: CVPR, 2018, pp. 1091– 495
1100. [30] J. Gao, Q. Wang, X. Li, Pcc net: Perspective crowd counting via spatial convolutional network, IEEE Transactions on Circuits and Systems for Video Technology. [31] Q. Wang, J. Gao, W. Lin, Y. Yuan, Learning from synthetic data for crowd
500
counting in the wild, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8198–8207. [32] Y. Liu, M. Shi, Q. Zhao, X. Wang, Point in, box out: Beyond counting persons in crowds, arXiv preprint arXiv:1904.01333. [33] D. Lian, J. Li, J. Zheng, W. Luo, S. Gao, Density map regression guided
505
detection network for rgb-d crowd counting and localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1821–1830. [34] B. Sheng, C. Shen, G. Lin, J. Li, W. Yang, C. Sun, Crowd counting via weighted vlad on a dense attribute feature map, IEEE Transactions on
510
Circuits and Systems for Video Technology 28 (8) (2016) 1788–1797. [35] C. Shang, H. Ai, B. Bai, End-to-end crowd counting via joint learning local and global count, in: 2016 IEEE International Conference on Image Processing (ICIP), IEEE, 2016, pp. 1215–1219. [36] H. Yao, K. Han, W. Wan, L. Hou, Deep spatial regression model for image
515
crowd counting, arXiv preprint arXiv:1710.09757. [37] J. Liu, C. Gao, D. Meng, A. Hauptmann, Decidenet: counting varying density crowds through attention guided detection and density estimation, in: CVPR, 2018, pp. 5197–5206. 27
[38] J. Gao, Q. Wang, Y. Yuan, Scar: Spatial-/channel-wise attention regression 520
networks for crowd counting, Neurocomputing 363 (2019) 1–8. [39] Q. Wang, M. Chen, F. Nie, X. Li, Detecting coherent groups in crowd scenes by multiview clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence (2018) 1–1doi:10.1109/TPAMI.2018.2875002. [40] X. Li, M. Chen, F. Nie, Q. Wang, A multiview-based parameter free frame-
525
work for group detection, in: Thirty-First AAAI Conference on Artificial Intelligence, 2017. [41] M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer networks, in: Advances in neural information processing systems, 2015, pp. 2017–2025.
530
[42] D. Chen, G. Hua, F. Wen, J. Sun, Supervised transformer network for efficient face detection, in: ECCV, Springer, 2016, pp. 122–138. [43] Y. Zhong, J. Chen, B. Huang, Toward end-to-end face recognition through alignment learning, IEEE signal processing letters 24 (8) (2017) 1213–1217. [44] W. Wu, M. Kan, X. Liu, Y. Yang, S. Shan, X. Chen, Recursive spatial
535
transformer (rest) for alignment-free face recognition, in: CVPR, 2017, pp. 3772–3780. [45] C.-H. Lin, S. Lucey, Inverse compositional spatial transformer networks, arXiv preprint arXiv:1612.03897. [46] L. Liu, H. Wang, G. Li, W. Ouyang, L. Lin, Crowd counting using deep
540
recurrent spatial-aware network, arXiv preprint arXiv:1807.00601. [47] V. Lempitsky, A. Zisserman, Learning to count objects in images, in: Advances in neural information processing systems, 2010, pp. 1324–1332. [48] D. Xie, T. Shu, S. Todorovic, S.-C. Zhu, Modeling and inferring human intents and latent functional objects for trajectory prediction, arXiv preprint
545
arXiv:1606.07827. 28
[49] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, S. Savarese, Social lstm: Human trajectory prediction in crowded spaces, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 961–971. doi:10.1109/CVPR.2016.110. 550
[50] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167. [51] K. Tota, H. Idrees, Counting in dense crowds using deep features (2015). [52] S. An, W. Liu, S. Venkatesh, Face recognition using kernel ridge regression, in: CVPR, 2007, pp. 1–7. doi:10.1109/CVPR.2007.383105.
555
[53] K. Chen, C. C. Loy, S. Gong, T. Xiang, Feature mining for localised crowd counting, in: In BMVC. [54] A. B. Chan, Z.-S. J. Liang, N. Vasconcelos, Privacy preserving crowd monitoring: Counting people without people models or tracking, in: CVPR, 2008, pp. 1–7.
560
[55] K. Chen, S. Gong, T. Xiang, C. C. Loy, Cumulative attribute space for age and crowd density estimation, in: CVPR, 2013, pp. 2467–2474. doi: 10.1109/CVPR.2013.319. [56] M.-h. Oh, P. A. Olsen, K. N. Ramamurthy, Crowd counting with decomposed uncertainty, arXiv preprint arXiv:1903.07427.
565
[57] X. Liu, J. Van De Weijer, A. D. Bagdanov, Exploiting unlabeled data in cnns by self-supervised learning to rank, IEEE transactions on pattern analysis and machine intelligence. [58] V. Pham, T. Kozakaya, O. Yamaguchi, R. Okada, Count forest: Co-voting uncertain number of targets using random forest for crowd density estima-
570
tion, in: ICCV, 2015, pp. 3253–3261. doi:10.1109/ICCV.2015.372.
29
[59] M. Xu, Z. Ge, X. Jiang, G. Cui, B. Zhou, C. Xu, et al., Depth information guided crowd counting for complex crowd scenes, Pattern Recognition Letters. [60] W. Liu, M. Salzmann, P. Fua, Estimating people flows to better count them 575
in crowded scenes, arXiv preprint arXiv:1911.10782. [61] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241.
580
[62] H. Qassim, A. Verma, D. Feinzimer, Compressed residual-vgg16 cnn model for big data places image recognition, in: 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), IEEE, 2018, pp. 169–175.
30
Yanyan Fang received the B.Sc. degree in Au585
tomatic Test and Control from Harbin Institute of Technology, Harbin, China, in 2005 and the M.Sc. degree in Electronic Engineering from University of Electronic Science and Technology, Chengdu, China, in 2008. She is currently a faculty member at the Department of Electronic Engineering, Fudan University, Shanghai, China. Her research interests include computer vision and deep
590
learning
Shenghua Gao is an associate professor at ShanghaiTech University, China. He received the B.E. degree from the University of Science and Technology of China in 2008 and received the Ph.D. degree from the Nanyang Technological University in 2012. From Jun 2012 to Aug 2014, he 595
worked as a research scientist in UIUC Advanced Digital Sciences Singapore. His research interests include computer vision and machine learning. He has published more than 60 papers on image and video understanding in many top-tier international conferences and journals. He served as an area chair in ICCV2019, SPC in AAAI2019, IJCAI2020. He also served as the Associate
600
Editor for IEEE Transactions on Circuits and Systems for Video Technology (IF:3.558) and Neurocomputing (IF:3.224).
31
Jing Li received the BSc. degree in Electronic Information Science and Technology from Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, 605
China, in 2018. He is currently pursuing the master’s degree with ShanghaiTech University, Shanghai, China. His current research interest is computer vision and machine learning.
32
Weixin Luo received the bachelor degree from Shenzhen University, Shenzhen, China. He is currently pursuing the Ph.D. degree at 610
ShanghaiTech University. He focuses on analysis of surveillance videos including human detection, anomaly detection, and face anonymization.
Linfang He is currently pursuing a bachelor’s degree in the department of computer science, Fudan University. Her research interest is computer vision and deep learning, and she is eager to contribute more to 615
computer vision in the future.
Bo Hu received the B.S. and Ph.D. degrees in electronic engineering from Fudan University, Shanghai, China, in 1990 and 1996, respectively. He is currently a Professor with the Department of 33
Electronic Engineering, Fudan University. His research interests include digital 620
image processing, digital communication, and digital system design.
34
Declaration of interests The authors declare that they have no known competing for financial interests or personal relationships that could have appeared to influence the work reported in this paper.
35
625
Author Contributions Section Yanyan Fang: Conceptualization, Methodology, Software, Writing- Original draft preparation Shenghua Gao: Data curation, Writing- Reviewing and Editing, Jing Li: Software, Validation
630
Weixin Luo: Visualization, Investigation Linfang He: Supervision Bo Hu: Writing- Reviewing and Editing,
36