Fast crowd density estimation with convolutional neural networks

Fast crowd density estimation with convolutional neural networks

Engineering Applications of Artificial Intelligence 43 (2015) 81–88 Contents lists available at ScienceDirect Engineering Applications of Artificial I...

5MB Sizes 0 Downloads 171 Views

Engineering Applications of Artificial Intelligence 43 (2015) 81–88

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai

Fast crowd density estimation with convolutional neural networks Min Fu a, Pei Xu a, Xudong Li a, Qihe Liu a, Mao Ye a,n, Ce Zhu b a School of Computer Science and Engineering, Center for Robotics, Key Laboratory for NeuroInformation of Ministry of Education, University of Electronic Science and Technology of China, Chengdu 611731, PR China b School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu 611731, PR China

art ic l e i nf o

a b s t r a c t

Article history: Received 4 September 2013 Received in revised form 31 January 2015 Accepted 9 April 2015

As an effective way for crowd control and management, crowd density estimation is an important research topic in artificial intelligence applications. Since the existing methods are hard to satisfy the accuracy and speed requirements of engineering applications, we propose to estimate crowd density by an optimized convolutional neural network (ConvNet). The contributions are twofold: first, convolutional neural network is first introduced for crowd density estimation. The estimation speed is significantly accelerated by removing some network connections according to the observation of the existence of similar feature maps. Second, a cascade of two ConvNet classifier has been designed, which improves both of the accuracy and speed. The method is tested on three data sets: PETS_2009, a Subway image sequence and a ground truth image sequence. Experiments confirm the good performance of the method on the same data sets compared with the state of the art works. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Crowd density Convolutional neural network Neural networks

1. Introduction Nowadays intelligent surveillance has been widely used in fire detection, vehicle identification, crowd management and so on. Crowd density estimation which is a crucial part of crowd management has become more and more important for the happening of stampede. In general, to estimate the crowd density, crowd features need to be designed first and then a classifier needs to be trained to discriminate the crowd density by these features. In the past few decades, many efforts have been made to find out both a better way for crowd feature description and a more efficient classifier for classification (Davies et al., 1995; Marana et al., 1997, 1998; Xiaohua et al., 2006; Ma et al., 2008, 2010; Su et al., 2010; Zhou et al., 2012). Davies et al. (1995) proposed to estimate crowds using image processing. They also proposed using background removal and edge-detection for stationary crowd estimation and optical flow computation for the estimation of crowd motion. In fact, their method can be used in some practical circumstances except for the situations where crowd density went high. To deal with high density crowds, Marana et al. (1997) introduced in texture information which is based on gray level dependence matrix (GLDM) and the features were used by a self organizing neural network which was responsible for the crowd density estimation. However their work only gets 81.88% accuracy in average. To enhance the estimation performance, the support vector machine (SVM) was n

Corresponding author. Tel.: +86 13981910920. E-mail addresses: [email protected] (M. Ye), [email protected] (C. Zhu).

http://dx.doi.org/10.1016/j.engappai.2015.04.006 0952-1976/& 2015 Elsevier Ltd. All rights reserved.

generally adopted (Xiaohua et al., 2006; Ma et al., 2010; Zhou et al., 2012). But, because the used features are not designed for the special practical environments, these SVM based methods also cannot obtain satisfactory results in applications. Meanwhile, artificial neural network, which is another popular and efficient approach, has been widely used. Because it discriminates features more precisely like human neuron network. Recently, convolutional neural network (ConvNet) which is derived from Hubel and Wiesel (1962) has become research focus again for its three brilliant ideas of local receptive fields, shared weights and spatial sub-sampling. So, ConvNet can ensure some degree of shift, scale, and distortion invariance (Chen et al., 2010). Besides, ConvNet can directly process images and learn features because of the multiple convolutional layers whose convolutional weights are trained by the back propagation algorithms. For these advantages, ConvNets were widely applied into different fields such as document recognition (LeCun et al., 1998), traffic sign recognition (Sermanet and LeCun, 2011), face detection (Garcia and Delakis, 2002; Chen et al., 2010). The standard ConvNet is made up by multiple layers. As the layer goes up, the distortion will get down. So some details of the image will probably be lost for the reason that only the feature maps in the last sub-sampling layers are accepted by the nets. To get better performance for shift, scale and distortion invariance, the multi-stage ConvNet in the works of Sermanet and LeCun (2011) and Sermanet et al. (2012) has been adopted. Besides, considering practical engineering application, some important improvements are made for the speed and accuracy. The main contribution contains two aspects. (a) The structure of multi-stage

82

M. Fu et al. / Engineering Applications of Artificial Intelligence 43 (2015) 81–88

convolutional network is optimized by integrating similar feature maps, so as to decrease the computation cost and accelerate the speed of both training and detection stages. (b) Inspired by the boosting algorithm proposed in Chen et al. (2010), a cascade of two classifiers is also designed. The first one picks out samples which are obviously misclassified, and the second one, which is trained especially for hard samples, will reclassify those rejected samples. By this strategy, the accuracy is improved after handling hard samples more precisely. The rest of this paper is organized as follows: in Section 2, we formally define our problem and introduce the basic concepts of ConvNets. Section 3 provides the detail interpretation of our improvements on ConvNets which are specially required for engineering applications. Section 4 shows experiment results. In the end, conclusions are given in Section 5.

2. Preliminary 2.1. General procedure for crowd estimation Crowd density estimation is mainly used in public places such as train station, subway station and the square. In these places, from the video, people always overlap when it is crowded and sometimes only their heads protrude. So the method to estimate crowd density by analyzing the individuals to count their numbers is out of the ability (Davies et al., 1995). Through years of researches, a general procedure of crowd estimation has been established. Fig. 1 shows the steps of crowd density estimation. The first step is to find out an effective way to extract crowd features of different density levels and the second step is to discriminate these features with a classification model. According to the observation that textures are becoming more distinctive with the growing of crowd, texture analysis (Marana et al., 1998; Ma et al., 2008, 2010) has been chosen many times and shows good performance. So, in this paper, texture is also used by the convolutional neural network. Generally, the neural network is a mechanism for machine learning which is constituted of an input layer, multiple hide layers and an output layer. Each layer is made up by several neuron nodes and the output of one node from the layer in the front connects to the input of a node from the layer in the next. Each connection carries a weight which indicates the influence of one node to another. Generally, the nodes in the front layers are full connected, which is known as full connection, to every node in the next layer. With the desired outputs, the network can learn the optimal solution of a special problem by adjusting the weights of connections (Kim et al., 2012). In this paper, a novel type of neural network has been applied: the convolutional neural network which brings us the benefits of feature learning. 2.2. Multi-stage convolutional neural network Unlike in the general neural network, the neurons in ConvNet are organized as feature maps in each layer. Benefiting from the convolution operation, images can directly be used as the input and feature maps are acquired. Several feature maps constitute the convolutional layers as in Fig. 2. A feature map represents a special kind of feature by convoluting with different kernels. A kernel is a filter which is a small set of connection weights in general neural

Fig. 1. The procedure of crowd estimation.

network and it is initialized by random values. Each convolution result makes an element of feature map in next layer, so that the connections between two layers are established. Eq. (1) shows the computation method for feature map Sj, ! X xi nwij þbj Sj ¼ sigmoid ð1Þ i

where n presents convolution operation, xi means the ith feature map of previous layer. The matrix wij means the weights for convolution from the ith map of the previous layer to the jth map of the followed layer and bj is the threshold for the jth map in the followed layer. The sigmoid function is defined in the following equation: sigmoidðtÞ ¼ 1=ð1 þ e  t Þ:

ð2Þ

Furthermore, to reduce the resolution for invariant and shift, sub-sampling is applied to convolutional layers (Sermanet and LeCun, 2011; Sermanet et al., 2012). So the sub-sampling layers, which contain half rows and columns of the previous layer, is obtained as in Fig. 2. With the further resolution reduction by second convolution and sub-sampling, a large degree of invariance to the spatial transformations can be obtained (LeCun et al., 1998). In addition, the connections from the same map share the same weights which avoids generating too much weights of connections. The feature maps in the last sub-sampling layer are used as input for the final classifier and each unit in the map is fully connected to every nodes in output layer. So the connections of the final classifier contribute the most weights for training. To adjust the weights for training, a back propagation algorithm is used which is an application of gradient descent algorithm. According to the algorithm, the weights in the network will move along the negative gradient of response function, which is known as sigmoid function, to get the fastest way for adjustment (Kim et al., 2012). As in Fig. 2, the improved work is based on the multi-stage ConvNet which consists of two stages of feature maps. The structure is a little different from the standard ConvNet by stretching the sub-sampling layer S1 from the 1st stage out to be as the input of full connection together with S2 (Sermanet and LeCun, 2011; Sermanet et al., 2012). With this processing, more details of features are obtained. As for the output,the crowd density is divided into five levels: very low, low, medium, high and very high level. In Fig. 2, levels are abstracted to 5 output nodes in output layer. The feature maps of the 2nd stage always include more feature maps. In our work, the first stage contains about 5 feature maps which correspond to 5 kinds of features and the second stage is initialized to contain 30 feature maps which means each map in the first stage generates 6 maps in the second stage. But in fact, to get feature maps in the second stage, all maps from the first stage are involved by adding their convolutional results up. So the number of filters in the convolutional layer C2 of the second stage is 150. Besides, for an input image of size 42 by 40, the size of feature maps in layer S2 will be 8  8. So for 30 feature maps in layer S2, there will exist 1920  5 weights generated by full connections although some of the connections are not necessary. This is a very big burden for calculation.

3. The improved approach 3.1. Optimizing connections The multi-stage ConvNet increases the quantity of features in the final classifier as well as the connections. This seriously increases the computation time at both training and detection stages, which affects its engineering applications. Inspired by the

M. Fu et al. / Engineering Applications of Artificial Intelligence 43 (2015) 81–88

83

Fig. 2. The structure of multi-stage convolutional neural network.

observation that there exists some redundant connections among two similar feature maps, we propose an optimization method, whose efficiency is confirmed in experiments, to move out these extra connections based on similarity matrix Mði; jÞ to accelerate the speed. In our case, the first stage contributes only 1/7 feature maps for classification and its feature maps varies a lot with higher resolution than the ones in the second stage. So the optimization focuses on the feature maps in the second stage. Thus, the point is to find out those maps which are very similar and the judgement is described in Eq. (3). Suppose hk ði; jÞ function measures the similarity between the map i and the map j for the kth sample, Mk is the similarity matrix which contains the similarity values for all feature maps, and σ is a artificial similarity threshold,  1; M k ði; jÞ Z σ ð3Þ hk ði; jÞ ¼ 0 otherwise: Then, once the similarity between two maps is confirmed to satisfy the condition, one of them is randomly chosen to be deleted together with the connections. The weights of the connections from the deleted map are added up to the kept ones before being removed. This is to avoid causing instability and oscillations in networks. After these processing, the network is simplified which contains only half or less connections according to experiments. As a result, speed is significantly accelerated. Fig. 3 shows the details of combination of similar feature maps. According to the similarity matrix in Fig. 3, the element in ith row and jth column exceeds a predefined threshold σ which means that the maps Ai and Aj are similar. The weights of connections from the map j are added up to the ones from the map i and the map j together with the corresponding connections are deleted. The dotted lines in the figure represent that these connections should be deleted. The similarity matrix M in Fig. 3 is derived by the arithmetic average value of all training samples, ! N X M¼ Mk N; ð4Þ i¼1

where N is the total number of samples, and Mk is the similarity measure for one sample. Mk is a square matrix and so is the matrix M. The size of the similarity matrix depends on the number of feature maps in the last layer. For example, if there are 30 features maps, the matrix should be 30 by 30. Every element in the matrix is calculated as Eq. (5). Suppose both Ai and Aj are two feature maps, M k ði; jÞ is one of the elements in the matrix Mk, which is the cosine similarity

computed by the maps Ai and Aj, M k ði; jÞ ¼ sumðAi :nAj Þ=ð J Ai J  J Aj J Þ for all 1 r i; j r dimðM k Þ:

ð5Þ

where :n indicates multiplication of the corresponding elements from matrix Ai and Aj, and the results will still be a matrix. J  J means the Euclidean norm of matrix. dimðÞ is the matrix dimension and sumðÞ means the sum of all elements in a matrix. Algorithm 1 shows the detail steps of optimization. The optimization is executed after the network is stable with the network output error under a threshold β which is far from the precisely classification but enough for electing the un-contributive connections. After the combination of similar feature maps, the ConvNet needs to be further trained to get lower network error. In this way, an optimized ConvNet is obtained. Since the connections are decreased, overall training time is saved. Algorithm 1. The algorithm of connection optimization. Input: the unoptimized ConvNet Output: the optimized ConvNet 1: Initialize the ConvNet parameters 2: Train the ConvNet until the network error is below β 3: Calculate the similarity of every two maps in the last layer and get the similarity matrix M 4: WHILE find Mði; jÞ 4 σ 5: Adding weights that are related to Aj up to the weights of Ai 6: Remove all connections from Aj 7: Remove the map Aj 8: Set the jth row of M to 0 9: ENDWHILE 10: Train the optimized ConvNet

3.2. Cascade classifier Hard samples have always been a problem for learning. In our work, hard samples mean those samples with very complicate backgrounds which are hard to fit. With the existences of these samples, sometimes it will be hard for the network to converge. So picking out those hard samples and training them individually is a good choice. Inspired by the idea of boosting algorithm, two optimized multi-stage ConvNets are cascaded as a strong classifier to discriminate sample crowd densities as shown in Fig. 4.

84

M. Fu et al. / Engineering Applications of Artificial Intelligence 43 (2015) 81–88

Fig. 3. Example of feature map combination.

No

Hard samples

Err <

1

?

Classifier 2 Class = Target& max_val > ? No

Yes

Classifier 1

End Yes

Err <

2

?

Yes

Classifier 1

Train set

No Regular samples Fig. 4. The boosting training procedure. If a sample is correctly classified and the max value max_val among the members of the output vector exceeds threshold α, the sample should go to the regular sample set, otherwise, it goes to hard sample set. Then, both sets are trained individually until the errors of two classifiers are below some thresholds.

According to Fig. 4, hard sample is picked out by classifier 1 with the condition that the max output component is below α or the class label does not match its labeled category. Because the hard sample among which the max output component is below α is more close to the classification boundary, it will be more likely to interfere with the training. After separating hard samples and regular samples, both of the classifiers are trained individually until the network errors are below the thresholds. Experiments show that this strategy of training enhances the performance. For detection, as shown in Fig. 5, the crowd density level of an image is discriminated first by classifier 1. If the max component of the output does not satisfy a threshold, this sample could be considered as a hard sample. So it should be sent to classifier 2 for reclassification. The detail procedure is described as the following. Suppose vi is the max value among the members of the output vector of the ith classifier in cascade for i ¼ 1; 2, and a boolean function is defined as follows: ( f ðx; λÞ ¼

1;

x 4 λ;

0

otherwise;

ð6Þ

the final decision is 8 f ðv1 ; λ1 Þ a 0; > < v1 ; f ðv1 ; λ1 Þ ¼ 0; f ðv2 ; λ2 Þ a 0; O ¼ v2 ; > : maxðv ; v Þ otherwise; 1 2

ð7Þ

where O is the final value whose index in output layer corresponds to the class which the Iimg belongs to. Applying the re-identifying detection strategy for those uncertain samples above, the more precise results are achieved. As both networks in cascade are optimized, the speed for training and detection can still be guaranteed.

4. Experiments and comparisons As there is no standard data set for crowd density estimation (Zhou et al., 2012), the approach is tested on 3 data sets: PETS_2009 (Ferryman and Ellis, 2010), a Subway video (Ma et al., 2008, 2010) and a video of Chunxi_Road in Chengdu.1 The experimental results are described in tables to demonstrate the 1

http://cvlab.uestc.edu.cn/CDE_CNN/Dataset_Chunxi_Road.zip

M. Fu et al. / Engineering Applications of Artificial Intelligence 43 (2015) 81–88

v1 >

1

v2 >

?

2

?

v2

v1

Classifier 1

= max(v1,v2)

Classifier 2 No

I img

85

No

Yes

Yes

= v2

= v1 Fig. 5. The procedure of detection.

Table 1 The range of number of people of each class. Data set

Very low

Low

Medium

High

Very high

Chunxi_Road PETS_2009 Subway

0–3 0–8 0–7

3–5 9–16 7–11

5–10 17–24 11–21

10–14 25–32 21–31

Z 14 Z 33 Z 31

validity of the improvements in crowd density estimation. To evaluate the performance of our method, some comparisons are made with Kim et al.'s (2012) and Su et al.'s (2010) works on the same data set. 4.1. Data sets PETS_2009 and Subway are from data sets of other works and Chunxi_Road is a ground truth video from real life. As the environment in Chunxi_Road is a little complicated with billboards and buildings, the interested areas are cut out and form the data set. Since the approach is developed to deal with image sequences, the videos are divided into frames and the frames are resized into 42 by 40. People in every image are counted manually. Then images are divided into 5 density groups by people counts according to the standard described in Table 1. Fig. 6 shows examples for different levels of crowd density. The first line is the scene of Chunxi_Road. The second line is the example of PETS_2009. The third line is the example of Subway. The scene of Subway is very ideal as the background is clean without too many objects or complex textures. But the scene of Chunxi_Road contains stripes on the floor which may disturb the feature extraction from textures and affect the results of classification. As the distance to camera varies in different scenes, the number of people for every level in different scene changes too. In the scene of PETS_2009, the camera is farther than those of Subway and Chunxi_Road. So the range of number of people in PETS_2009 is larger than those of other two data sets. Table 1 shows the number range which is overlapped with its adjacent levels. So samples being classified into their neighbors are acceptable. The number of frames of each crowd levels for each scene is shown in Table 2. From the table, the Chunxi_ Road contains 1241 frames, the subway have got 663 frames and the PETS_2009 have got other 663 images. Additionally, about half of the frames are used for training of the ConvNet and the other half are left for the test of the proposed approach. 4.2. Experiments and analysis In our work, the first classifier in cascade contains 5 and 30 feature maps in the first and second stages respectively. The second classifier contains 6 and 12 maps in the first and second stages respectively. The first stage is gained by convolution from input layer with filters of the size 7  5 and the second stage is

gained by convolution from the outputs from first stage with filters of size 3  3. The levels of crowd density, 1–5, correspond to the index of 5 nodes in output layer. For example, if the density level is very low, the target output for network will be ð1; 0; 0; 0; 0ÞT . And if the density is Medium, the target output will be ð0; 0; 1; 0; 0ÞT . So the crowd density level of an image can be told by the index of the max value in the output vector. In the experiments, the input image is resized to 42  40. Classifier 1, for example, produces 2325 values in total. Among these values, 405 of them, which are down-sampled from 5 feature maps in first stage and another 1920 ones are from 30 feature maps of layer S2. As the classifier is fully connected, there are 2325  5 weights to be trained at least without counting connections in convolution layers. With the optimization of connections, in Chunxi_Road, the number of weights is reduced to 597  5 and in Subway and PETS_2009, the number decreases to 469  5 and 725  5 respectively. This means the consuming time will be 1=3 or 1=4 of before. Table 3 shows the detection speed on all test sets. For test sets of PETS_2009, the standard multi-stage ConvNet takes about 0.016 s per frame; while after deploying our proposed optimization, it takes only 0.004 s. That means our cascade optimized method is 3 times faster than that of before. In Subway and Chunxi_Road, cascade optimized ConvNets still outperform the standard ones in speed. For the training of the cascade, each classifier is optimized first. The network error β for the optimization is set to 0.2 or 0.3. Because most samples except for those hard ones are already learned under this setting and getting that error rate will not takes too much time. The parameter α, which decides whether a sample is a hard sample, is set to 0.9. In this setting, the percentage of hard samples can be controlled under 25%. Besides, the network errors β1, β2 which correspond to classifier 1 and classifier 2 are set to be under 0.1. For example, in Chunxi_Road, β1 is 0.04 and β2 is 0.03. For detection, the parameter λ1 in Eq. (7) is set as 0.7 according to experience. Correspondingly, λ2 in classifier 2 is set as 0.5, because most of hard samples are not easy to be distinguished. By the two step screening, the cascade detection process ensures the reliability of classification results for most of the samples. From Tables 4 to 6, the classification results are mostly cropped in main diagonals which demonstrate the validity of estimating crowd density by ConvNets. From these 3 tables, we noticed that in general after the optimization of feature maps, the accuracy is not affected. The cascade optimized method has a good performance by only misclassifying samples to the adjacent classes, which is also reasonable since we cannot define the exact boundary between two adjacent classes. On the data set of PETS_2009, there are 21 samples misclassified by the standard multi-stage ConvNets and optimized ConvNets, but the cascade optimized ConvNets only misclassifies 11 samples (which takes 3.96% in total) to its adjacent classes. From Table 5, the performance of the cascade optimized method is nearly the same as the un-cascade one. But the overall performance is the best, only 12 out of 303 samples are misclassified

86

M. Fu et al. / Engineering Applications of Artificial Intelligence 43 (2015) 81–88

Fig. 6. The example of different crowd density level in Chunxi_ Road, PETS_2009, Subway: (a) very low, (b) low, (c) medium, (d) high and (e) Very high.

Table 2 The detail frames of different data sets. Data set

Chunxi_ Road PETS_2009 Subway

Train set

Table 4 Results of PETS_2009 by the standard multi-stage ConvNet, optimized ConvNets and cascade optimized ConvNets.

Test set

VL

L

M

H

20 50 73

96 79 61

292 364 118 55 25 50

VH

Sum

VL L

M

H

VH

58 58 50

830 360 360

13 30 42 67 73 61

162 174 32 99 46 49 25 51 50

Sum 411 303 303

Table 3 Speed on the data sets by different ConvNet structures. Data sets

Multi-stage ConvNet (s/frame)

PETS_2009 0.016 Subway 0.025 Chunxi_Road 0.031

Methods

The multi-stage ConvNets

Optimized ConvNets

Optimized Cascade optimized ConvNet (s/frame) ConvNet (s/frame) 0.004 0.006 0.006

0.006 0.004 0.011

(which takes only 3.96%). In Table 6, it can be observed that the cascade optimized method performs better by the results in diagonal line and the neighbors. There are 55 out of 411 samples misclassified which takes 13%. The classification of Chunxi_Road is not good as other two data sets. After analyzing the results, 3 main reasons should be responsible. (1) The first reason is that the textures of floor are extracted as input of classifier together with other features. Fig. 7 shows the images of the 1st stage feature maps for this scene. The texture of floor can be seen clearly. (2) The second reason is that the number of people in this scene varies little during the video, so dividing training samples to different levels becomes harder as the number of the adjacent crowd level is very close. Fig. 8 shows the uncertain samples. A medium level picture which is misclassified to low level is similar with a low level picture which is misclassified to the medium level. It is because that there are crosses

Cascade optimized ConvNets

Density level

Very low (%)

Low (%)

Very low Low Medium High Very high

95.3 7.4 0 0 0

4.7 88.1 3 0 0

Very low Low Medium High Very high

95.3 7.5 0 0 0

Very low Low Medium High Very high

90.5 0 0 0 0

Medium (%)

High (%)

Very high (%)

0 4.5 96 6.5 2.1

0 0 1 91.3 4.1

0 0 0 2.2 93.8

4.7 86.6 2 0 0

0 5.9 97 6.5 2

0 0 1 91.3 4

0 0 0 2.2 94

9.5 100 1.1 0 0

0 0 98.9 8.7 0

0 0 0 91.3 4.1

0 0 0 0 95.9

between adjacent levels and the number of people in the two samples is on the border of classification rules. So, when separating training samples, it is hard to clearly tell which level a sample should belong to. (3) The third reason is that the unbalanced samples in different classes may cause the under-training for some levels and further affect the accuracy of networks. 4.3. Comparisons with previous works Kim et al. (2012) estimated crowd density on data sets PETS_2009 from moving areas and contrast information i.e. the sum of normalized gray level dependence matrix. Then, features are trained by a back propagation neural network. In our work, texture features are automatically extracted by learning and

M. Fu et al. / Engineering Applications of Artificial Intelligence 43 (2015) 81–88

87

Table 5 Results of Subway by the standard multi-stage ConvNets, optimized ConvNets and cascade optimized ConvNets. Methods

The multi-stage ConvNets

Optimized ConvNets

Cascade optimized ConvNets

Density level

Very low (%)

Low (%)

Medium (%)

High (%)

Very high (%)

Very low Low Medium High Very high

100 11.5 0 0 0

0 88.5 4 0 0

0 0 92 5.9 0

0 0 4 94.1 0

0 0 0 0 100

Very low Low Medium High Very high

100 11.5 0 0 0

0 88.5 4 0 0

0 0 96 5.9 0

0 0 0 92.2 0

0 0 0 1.9 100

Very low Low Medium High Very high

100 11.5 0 0 0

0 88.5 4 0 0

0 0 88 3.9 0

0 0 8 94.2 0

0 0 0 1.9 100 Fig. 7. Example of feature maps from Chunxi_Road.

Table 6 Results of Chunxi_Road by the standard multi-stage ConvNets, optimized ConvNets and cascade optimized ConvNets. Methods

The multi-stage ConvNets

Optimized ConvNets

Cascade optimized ConvNets

Density level

Very low (%)

Low (%)

Medium (%)

High (%)

Very high (%)

Very low Low Medium High Very high

92.3 0 0 0 0

0 67 4.3 0 0

7.7 33 90.1 6.9 0

0 0 5.6 93.1 28.1

0 0 0 0 71.9

Very low Low Medium High Very high

0 0 0 0 0

15.4 67 3.7 0 0

84.6 33 82.1 6.3 0

0 0 14.2 93.7 28.1

0 0 0 0 71.9

Very low Low Medium High Very high

100 0 0 0 0

0 63.3 3.1 0 0

0 36.7 83.9 5.2 0

0 0 12.9 94.8 28.1

0 0 0 0 71.9

classified by cascade optimized ConvNets which are also based on the spirit of back propagation. Table 7 shows the comparison results of the two methods. The cascade optimized ConvNets have classified samples in low level with 100% accuracy and the correctness of other levels is above 90%. But as for their GLDM based way, the highest accuracy is only 91.8%. As the crowd density becomes higher, the performance of the method is still decreasing. Furthermore, we also make comparison with Su et al.'s (2010) work which applied a sparse spatiotemporal local binary pattern (SST-LBP) for texture description and SVM for classification. In order to be comparable, some adjustments of class levels are made by uniting the very low and low level together. Fig. 9 shows the estimation results from Kim's BP network, Hang Su's SVM method and the cascade optimized ConvNet. The good performance reasons are also analyzed. In Kim's approach, moving areas and the sum of normalized gray level dependence matrix are regarded as features. As the moving areas are obtained by optical flow, sometimes it will not be effective when the crowd does not move, so that only the feature of the sum of normalized gray level dependence matrix works, which leads to that only one kind of feature is input into the BP network.

Crowd Density

VeryLow

Low

Medium

High

VeryHigh

VeryLow

Low

Medium

High

VeryHigh

Fig. 8. Example of misclassified samples for Chunxi_Road. The column is the real crowd level of an image and the row is the crowd level that the image is misclassified to.

Table 7 Contrast with previous work on PETS_2009. Methods

Density level

VL (%)

L (%)

Former Work

VL L M H

91.8 8.4

8.2 81.2 9.8

M (%) H (%)

10.4 83.1 14. 7

VH Cascade VL Optimized L ConvNets M H VH

7.1 71.5 13. 4

90.5

9.5 100 1

99 1.4

98.6 4.1

VH (%)

Correct rate (%)

13.8

91.8 81.2 83.1 71.5

86.6

86.6

95.9

90.5 100 99 98.6 95.9

In our work, the approach is designed to process single image, so it is adaptive for both moving and stationary crowd. Additionally, every elements in feature maps are sent to classifier, so the result will not be affected too much if one or two features do not work. In Hang Su's work, the SVM classifiers are grouped one by one (Su

88

M. Fu et al. / Engineering Applications of Artificial Intelligence 43 (2015) 81–88

Acknowledgements This work was supported in part by the National Natural Science Foundation of China (61375038), 973 National Basic Research Program of China (2010CB732501) and Fundamental Research Funds for the Central University (ZYGX2012YB028 and ZYGX2011X015). References

Fig. 9. Crowd density estimation result of three different methods.

et al., 2010) to fit for the multi-class crowd estimation problem. But grouping the classifiers is not an easy job. In our work, since the 5 levels constitute an output layer, the interaction relationship of the levels have been learned through sample training. So the cascade optimized ConvNet provides better results.

5. Conclusions In this paper, we proposed a real-time approach of cascade optimized convolutional neural network for crowd density estimation. The proposed method is based on the multi-stage ConvNet. According to the key observation that some of connections in the network are not necessary, the idea of moving away similar feature maps from the 2nd stage together with their connections is applied. By the simplification of connections, a lighter network is obtained and the computation burden is significantly reduced. Additionally, the boosting ideology is applied into training and a cascade classifier with two optimized ConvNets is also constructed. So, samples that are hard to classify would be sent to the second ConvNet classifier to get final determination. Experiment results provided the best accuracy of 96.8% in average which confirmed that the cascade optimized ConvNet can be used for practical engineering applications. Now, we are modifying the network structure such that the accurate number of people can be predicted instead of the density level.

Chen, Y., Han, C., Wang, C., Jeng, B., Fan, K., 2015. A cnn-based face detector with a simple feature map and a coarse-to-fine classifier. IEEE Trans. Pattern Anal. Mach. Intell., in press. http://dx.doi.org/10.1109/TPAMI.2007.70798. Davies, A.C., Yin, J.H., Velastin, S.A., 1995. Crowd monitoring using image processing. Electron. Commun. Eng. J. 7 (1), 37–47. Ferryman, J., Ellis, A., 2010. PETS2010: dataset and challenge. In: 2010 Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE, pp. 143–150. Garcia, C., Delakis, M., 2002. A neural architecture for fast and robust face detection. In: Proceedings of the 16th International Conference on Pattern Recognition, IEEE, pp. 44–47. Hubel, D.H., Wiesel, T.N., 1962. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. J. Physiol. 160 (1), 106. Kim, G.A., TaekiKim, Moonhyun, 2012. Estimation of crowd density in public areas based on neural network. KSII Trans. Internet Inf. Syst. 6 (9), 2170–2190. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), 2278–2324. Ma, W., Huang, L., Liu, C., 2008. Crowd estimation using multi-scale local texture analysis and confidence-based soft classification. In: Second International Symposium on Intelligent Information Technology Application (IITA'08), IEEE, pp. 142–146. Ma, W., Huang, L., Liu, C., 2010. Crowd density analysis using co-occurrence texture features. In: 2010 5th International Conference on Computer Sciences and Convergence Information Technology (ICCIT), IEEE, pp. 170–175. Marana, A., Velastin, S., Costa, L., Lotufo, R., 1997. Estimation of crowd density using image processing. In: IEE Colloquium on Image Processing for Security Applications (Digest No.: 1997/074), IET, pp. 11/11-11/18. Marana, A., Velastin, S.A., Costa, L.d.F., Lotufo, R., 1998. Automatic estimation of crowd density using texture. Saf. Sci. 28 (3), 165–175. Sermanet, P., Chintala, S., LeCun, Y., 2012. Convolutional neural networks applied to house numbers digit classification. In: 2012 21st International Conference on Pattern Recognition (ICPR), IEEE, pp. 3288–3291. Sermanet, P., LeCun, Y., 2011. Traffic sign recognition with multi-scale convolutional networks. In: 2011 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 2809–2813. Su, H., Yang, H., Zheng, S., 2010. The large-scale crowd density estimation based on effective region feature extraction method. Computer Vision-ACCV 2010, Springer, pp. 302–313. Xiaohua, L., Lansun, S., Huanqin, L., 2006. Estimation of crowd density based on wavelet and support vector machine. Trans. Inst. Meas. Control 28 (3), 299–308. Zhou, B., Zhang, F., Peng, L., 2012. Higher-order SVD analysis for crowd density estimation. Comput. Vis. Image Understand. 116 (9), 1014–1021.