Learning for an aesthetic model for estimating the traffic state in the traffic video

Learning for an aesthetic model for estimating the traffic state in the traffic video

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Learning ...

12MB Sizes 0 Downloads 34 Views

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Learning for an aesthetic model for estimating the traffic state in the traffic video Xingmin Shi n, Zhenyu Shan, Na Zhao Intelligent Transportation and Information Security Lab, Hangzhou Normal University, Hangzhou, Zhejiang, China

art ic l e i nf o

a b s t r a c t

Article history: Received 13 March 2015 Received in revised form 30 July 2015 Accepted 27 August 2015

With the increasing number of vehicles running on the urban roads, the traffic jam becomes much more serious. Properly estimating the traffic jam level from traffic videos is essential for the department of transportation management and drivers. Currently, for estimating the traffic state on videos, most solutions are built on evaluating traffic flow by counting the running vehicles per time unit or detecting their moving speed. However, the main challenge of these solutions is on the vehicle tracking method, in which the vehicles are necessary to be effectively and integrally segmented from the scenes. The solutions should tradeoff the accuracy of the estimation results and the efficiency of the method. In this paper, we propose a learning-based aesthetic model to estimate the traffic state on videos. The model uses multiple video-based perceptual features about traffic state to train the random forest classifier with the labeled data, and estimates traffic state by data classification. The evaluation experiments are conducted on a testing image set, and the results show that the traffic state estimation accuracy of the proposed model is higher than 98% and the efficiency performance is achieved in real-time. & 2015 Elsevier B.V. All rights reserved.

Keywords: Aesthetic model Traffic state Random forest Perceptual feature

1. Introduction Owing to the rapid increment of vehicles in recent years, urban traffic jam has become a serious problem for the department of transportation management and drivers. Many researchers are pursuing research on effective methods of traffic state estimation to deliver accurate and real-time traffic information about the traffic jam. This traffic information is essential for traffic management, moving control, and driving guides, which are main research directions in intelligent transportation systems (ITS). In the past few decades, a lot of research has been conducted on traffic state estimation. Most of them are built on traffic flow estimation by counting the running vehicles per time unit or detecting their moving speed. The traffic flow can be evaluated by the buried inductive loop detector [1], such as Sydney Coordinated Adaptive Traffic System (SCATS). In this system, the detector can obtain stimuli when a vehicle passes by the loop. However, the loop device buried under the road surface is not difficult to be broken and inconvenient to be maintained. Recently, another type of detector, the surveillance camera, has been widely used to evaluate traffic flow [2]. The video data n

Corresponding author. E-mail address: [email protected] (X. Shi).

provide more information than the inductive loop detector because the surveillance camera can capture more features. In video-based methods, besides the counting number of vehicles, we also use the estimated speed to evaluate the traffic flow [3]. However, the speed estimation greatly depends on the vehicle features extraction, which is seriously affected by changed conditions in videos. Moreover, multi-source traffic data, such as loop detector data, GPS data, and video data, are fused to improve the accuracy of traffic state estimation [4]. The general video-based methods exploit the segmentation of moving vehicles, either by frame-differencing or background subtraction. Unfortunately, both methods highly depend on the effectiveness and efficiency of the vehicle segmentation method to get accurate estimation results [5]. The methods often fail to work when the illumination conditions or environmental conditions change greatly. Moreover, the vehicle segmentation is difficult to be implemented due to the influence of moving objects or environmental objects, such as the pedestrians and blowing leaves. In order to overcome the limitation of general video-based methods, we propose to explore the traffic scene from a perceptual view, and build a learning-based aesthetic model with specific features, which are extracted from the videos for classifying the traffic state effectively. The vehicle corners and their spatial distribution features are used to construct the feature vector in this work, and the random forest classifier is utilized for perception training and traffic state classification. In summary, the main contributions of

http://dx.doi.org/10.1016/j.neucom.2015.08.099 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the traffic state in the traffic video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

this work include: (1) the vehicle corners and their spatial distribution feature are extracted from a perceptual aspect; (2) the proposed learning-based aesthetic model can effectively estimate the traffic state without vehicle segmenting and tracking process; and (3) the implementation of the proposed method is efficient and can be used in real-time applications. The rest of this paper is organized as follows. Section 2 presents a review on the relevant methods proposed in latest literatures on aesthetic evaluation and traffic flow estimation. Section 3 fully illustrates the modules of the proposed method. The experimental results and analysis are elucidated in Section 4. Finally, the conclusion is drawn and future work is outlined in Section 5.

2. Related work As aesthetic evaluation and video-based traffic state estimation are two main aspects related to our work, their related work is reviewed as follows. 2.1. Aesthetic approaches for image evaluation Aesthetic evaluation is widely used in estimating the aesthetic quality, color harmony, beauty or the professional level of a photo or an image. To measure the aesthetic quality of an image, a bagsof-color-patterns method was proposed in [6]. In this work, the aesthetic quality of a photo is classified by taking color harmony into consideration. The color harmony model is built on computing the harmony score of local regions of a photograph instead of utilizing the global statistical information of colors. Generic image descriptor is another method utilized in [7] to evaluate the aesthetic quality of images. In this method, generic content-based local features, such as bag-of-visual-words (BOV), fishing vectors (FV), and GIST descriptors, are introduced due to that local-level patch-based information can be aggregated into an integrated global representation of images. Fusion of various features can achieve better performance in aesthetic assessment. The effectivity of aesthetic assessment with various features including low-level features, mid-level semantic features, and type descriptors are evaluated in [8]. In this method, the training dataset is built by collecting free images from the public image sharing website, such as Flickr and DP Challenge, without any manual annotation. Comparing with the utilization of large amount of features for better understanding the aesthetics of an image, researchers select seven features into their top-down approach and improve the accuracy significantly [9]. In [10], the work extracts the relationship between the visual textures and the aesthetic perception properties, and proposes a layered prediction model to predict the aesthetic content for a given texture, which is defined as a vector consisting of computational features including low-level texture and color features. For evaluating the aesthetical features and classify the images, the Cellet, which connect spatially adjacent cells within the same pyramid level, can be utilized to represent the object [11]. The Graphlet is proposed to describe the object-level and spatial-level cues and forms the effective saliency descriptor [12], and the experimental results show that the active graphlet path is more indicative for photo aesthetics [13]. Moreover, structural cues are discovered and exploited in [14] and a new summarization technique is proposed in [15], in which the method can enforce video stability and preserves wellaesthetic frames. Size has more or less impact on aesthetic evaluation. In the work [16], the researchers investigated the influence of size for aesthetic perception and found that the resolution and the physical dimensions can affect the appreciation of viewers. A series of

regression models are proposed for predicting the aesthetic level of an image for a given size. Furthermore, the essential features related to the size-dependent property of image aesthetics are fully analyzed. In addition, machine learning methods are utilized to deal with the aesthetic quality estimation of an image [17]. The visual contents are extracted from the images and the support vector machine and tree classifiers are used for classification. They seek to explore the relationship between the perception feeling of a person and the low level content. Moreover, some universal metrics are adopted to evaluate the aesthetics in the game of chess [18]. 2.2. Video-based traffic state estimation For estimating the traffic state in traffic videos, virtual loop based methods and vehicle tracking methods are extensively used to count the vehicles. Virtual loops are defined or designated before dealing with the object tracking. When a moving object enters the region of virtual loops and exit the region, the counter will plus one, thus making the vehicle counting possible. The userdefined virtual loops are utilized to detect and count vehicles in [19]. In this work, the foreground mask is produced by Gaussian Mixture Model (GMM) and Motion Energy Images (MEI) method. The particle grouping is utilized in sub-sampling video frames according to their spatial and temporal coherence, as well as motion coherence. Such particles are clustered with the k-means algorithm, and their motion patterns and spatial information are taken into consideration in the clustering. Vehicle tracking is performed on these clusters corresponding to vehicles computed by evaluating the similarity of color histograms. An extended Kalman filter is introduced into the real-time freeway traffic state estimation in [20]. This work is intended to pursue a general solution of real-time adaptive traffic state estimation in freeway networks. The solution is based on the stochastic macroscopic traffic flow model with extended Kalman filtering. The model parameters and the traffic flow variables are jointly estimated. As a result, the prior calibration is not necessary, and the solution is adaptive to various scenarios which can trigger the incident alarms. A complete system is proposed by Henri Nicolas et al. to analyze the behavior of vehicles in [21]. For improving the results, the scene characteristics and predefined traffic rules are employed in this work. The solution includes three steps. The first step is scene modeling, in which the scene structure and the traffic rules are automatically obtained. The second step is tracking multiple objects whose trajectories are evaluated. The final step is to evaluate the behavior of vehicles. This method can efficiently detect and estimate the behavior of vehicles. Predefined rules and the geometry constraints make it more complex and inconvenient in application. In order to estimate the speed of traffic flow, the road spaced occupancy and the traffic state, a traffic flow detection tool is implemented in [22].

3. Proposed method 3.1. Overview The proposed method mainly consists of four modules, which are data acquisition, initial setup, features extraction, and traffic state training and classification. The acquisition has the responsibility of acquiring the video data from traffic surveillance cameras. Initial setup mainly concerns the lane setting. In practice, the traffic flow in a specific direction should be independently estimated. Generally, the view of the camera covers several two-way lanes. As a result, the lane region should be designated before image analysis in order to mitigate the effect from the adjacent

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the traffic state in the traffic video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

lanes in the opposite direction. The features extraction module is utilized to extract the aesthetic features from the images and use the features to build the feature vectors for traffic state training. Finally, the feature vectors are fed into the training and classification module for training the random forest classifier and achieve the classification of traffic states. We specify three main modules for initial setup, features extraction, and traffic state training and classification, as illustrated in Fig. 1.

3.2. Initial setup Generally, there are several lanes for vehicles running in two directions, as shown in Fig. 2. For traffic flow estimation, we usually focus on the flow on the lanes in a specific direction or all two directions. As a result, the interested lane region should be manually configured prior to make the evaluation in order to mitigate the effect from the surrounding objects. As illustrated in Fig. 2(a), the marked region with red rectangle box is the region used to estimate the bidirectional traffic flow, and the region configured in Fig. 2(b) aims to be used in one-way traffic flow estimation. In the proposed method, the specified region is used as the region of interests (ROI) of the image, and the feature extraction is carried out in this ROI region.

Fig. 1. The diagram of the proposed method.

3

3.3. Feature selection Due to the proposed method concerning the aesthetic appearance of the road for estimating the traffic state, the features extraction module focuses on selecting the aesthetic features related to perceptual views on traffic status. Moreover, selected features should be sensitive to the scene changes caused by different level of traffic flow, and can be computed and extracted from the traffic surveillance video effectively. (1) Corner feature: The first feature is the quantity of vehicle corners. A corner can be defined as the intersection of two edges, or a point at which there are two and different edge directions in a local neighborhood of the point. When more vehicles run on the road, more corners can be detected due to the special profile of vehicles. As shown in Fig. 3, there are many more vehicles running in Fig. 3(a) than that of Fig. 3(c). As a result, compared with the detected corners of the image shown in Fig. 3(d), more corners are also introduced due to the increasing number of vehicles, as shown in Fig. 3(b). Moravec and Harris corner detection algorithms are two classical methods for efficient corner detection [23]. In Moravec algorithm, the corner is identified as a point with low selfsimilarity. The similarity is assessed by taking the sum of squared differences (SSD) between the two patches. As the value of SSD decreases, the similarity increases. The main problem lies in its anisotropy, i.e., if an edge is present and not in the direction of the neighbors, then the smallest SSD will be large and the edge will be incorrectly chosen as an interest point. Harris improved the Moravec corner detection algorithm by considering the difference of the corner score with respect to the direction directly [24]. However, the stability of the Harris corner detector is relevant to the value of parameter k, which is an empirical value and can be varied in a larger range. In our model, we adopt the FAST (Features from Accelerated Segment Test) corner detector [25], a currently popular corner detection method, for its high computational performance, repeatability, and the capability of real-time processing. The FAST feature detector consists of three steps: Step 1: The segment test. The segment test is conducted on the pixels of the fixed-radius circle, and can eliminate lots of non-candidate points. Step 2: The corner detection based on classification. In this step, a decision tree classifier is utilized to determine if a

Fig. 2. The configuration of lane setting.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the traffic state in the traffic video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

4

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 3. The changes of corner distribution caused by the number of vehicles.

Fig. 4. The spatial distribution of corners.

candidate point has the feature of a corner according to 16 features, and the state of each feature is  1, 0, 1. Step 3: Corner features are verified by applying non-maxima suppression. (2) Spatial distribution of corners: The quantity of corners detected is the global character of the ROI region. It is not eligible to be utilized as a distinctive feature for traffic state estimation. For understanding the traffic scene accurately, the spatial distribution of corners should reveal the local changes of the scenario.

In our model, we divide the ROI region into 16 blocks with the same width and height. The corners in each block are summed up as a value of the feature vector, and these feature values can represent the distribution status of each block, and can significantly indicate the variations caused by vehicles movements. As shown in Fig. 4, we specify the region of one-way lanes for traffic state estimation, and the region is divided into 16 regions for the spatial corner distribution evaluation. Accordingly, 16 values, each of which relates to the number of corners in corresponding blocks, can be obtained and constitute the 16 features of the feature vector.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the traffic state in the traffic video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

Fig. 5. Five traffic scenes used in the experiments.

As a result, the feature vector consists of two perceptual features: a global feature which refers to the total quantity of vehicle corners, and the spatial corner distribution feature consists of 16 feature values. This feature vector will be sent to the classifier of the aesthetic model for traffic state training and classification. 3.4. Training and classification In view of the large number of images employed in classifier training, we train the random forest classifier with the feature vector extracted from the labeled image training set in this work. Apart from the apparent speed advantage than other discriminative methods, we choose such an RF based method for its following advantages. First, leaves of RF contain valuable locality information of feature space suitable for clustering. Second, RF maximizes the classification margin in an empirical sense. An RF forms with multiple decision trees: F ¼ ft 1 ; t 2 ; …; t N g. Each tree of an RF is independently trained and each internal node of RF is used to partition the data space. Moreover, a classification function is learned by constructing trees in its training stage. Based on the splitting function, X Rl ¼ fx A X R j t R ðxÞ o0g; X Rr ¼ X R ⧹X Rl ;

ð1Þ

Table 1 The size of training image sets and testing image sets in five scenes. No.

Size of training image set

Size of testing image set

1 2 3 4 5

3464 3495 3360 3052 3213

3465 3496 3568 3065 3218

the root to the leaf node. Hence the locality feature can be well preserved by dealing with the samples in the same leaf node as a cluster. In the testing stage, for a given test case x0, RF estimates the probability of each class as follows: pðkj x0 Þ ¼

N 1X p ðkj x0 Þ; Ni¼1 i

ð3Þ

where pi ðkj x0 Þ is the estimated probability of class k resulted by the ith tree. It is estimated by calculating the ratio that class k gets votes from the leaves in the ith tree

the training data XR in an internal node R is split into left and right subsets X Rl and X Rr , where t R ðÞ refers to the test function of node R for splitting which is usually defined as the oblique linear partition form

pi ðkj x0 Þ ¼

t R ðxÞ ¼ W R  x  θR

where li ðx0 Þ refers to the leaf node that x0 belongs in the ith tree. The overall decision function of the RF is defined as

ð2Þ

where WR is a vector and θR is a threshold. In a tree, if two samples are the same leaf node, their return value signs are identical for all test functions along the path from

P

xhj A X ;xhj A li ðx0 Þ ½yhj

li ðx0 Þ

Gðx0 Þ ¼ arg maxGk ðx0 Þ;

¼ k

;

ð4Þ

ð5Þ

kAY

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the traffic state in the traffic video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

Fig. 6. The varying quantity of vehicle corners in traffic scenes.

Fig. 7. The varying quantity of vehicle corners in four local blocks.

where Gk ðx0 Þ ¼ pðkj x0 Þ;

ð6Þ

is the probability estimated by RF.

4. Experimental results and analysis 4.1. Experimental setup The experiments are conducted to evaluate the effectiveness and adaptability of the proposed method. The images are obtained from the videos captured by the camera installed on the post at one side of the road intersection. The image resolution of the video is 640n480. All the experiments are executed on a laptop with Intel Core i7-3667U CPU 2.0 GHz.

The experiments are designed for evaluating (1) relationship between the quantity of vehicle corners and traffic status; (2) effectiveness and accuracy of traffic state estimation; and (3) computing efficiency of the execution. The training sets and testing sets are respectively constructed with the color images randomly selected from the videos, which are captured from the traffic surveillance cameras installed at the road intersections. We use five training image sets and five testing image sets in the experiments for evaluating the effectiveness of the proposed method. The images are selected from five distinct scenes shown in Fig. 5. The sizes of these image sets are listed in Table 1. 4.2. Results and analysis In order to evaluate the performance of the proposed method, the accuracy of traffic state classification is evaluated based on training sets, which are acquired from the cameras installed at

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the traffic state in the traffic video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

different locations with diverse illumination conditions and context settings. Furthermore, the computing efficiency of the model training and traffic state classification is evaluated by random forest (RF) and SVM. (1) Relationship between quantity of vehicle corners and traffic status: As shown in Fig. 6, the quantity of corners detected in the image of traffic video varies in different time, when various numbers of vehicles running in the view field of the surveillance camera. With the conjoint analysis on the number of corners and the video content, more vehicles will introduce more corners. On the trend line in Fig. 6, the period, in which the numbers of corners are more than 600 in continuous frame images, corresponds to time of a serious traffic jam. The quantity decrement of the vehicle corners can be found from the 2400th frame. This indicates that the vehicle density decreases correspondingly when the vehicles are gradually moving out of the given region. In the same way, when more vehicles move to the region, the quantity of vehicle corners will increase correspondingly. Such trend can be figured out in Fig. 6 from the 3800th frame to the 4400th frame. In addition, we use 16 blocks to describe the local spatial distribution of vehicle corners. As shown in Fig. 7, four trend lines correspond to four blocks, R1, R2, R3, and R4, which are resulted from the equal divisions of the lane region and demonstrate distinctive distribution patterns. The experimental results in Fig. 7 indicate that the varying quantity of vehicle corners in these four blocks is a differentiable feature which can be used for evaluating regional vehicle distribution perceptually.

(2) Effectiveness and accuracy: For training of classifier, the images in the training set are randomly selected from the videos in diverse scenes. They are marked with the corresponding traffic states. As the aforementioned Table 1, the first training set has 3464 images for model training and the corresponding test set forms with 3465 color images. We use these two image sets to assess the accuracy of the classifier when the number of training images changes. As shown in Table 2 and Fig. 8, when the sizes of training image set are 3460 and 2860, the accuracies of traffic state classification with random forest and SVM are both higher than 98%, even higher than 99% for the random forest classifier if we use the complete training set. These can indicate the effectiveness of our proposed method for traffic state estimation. In addition, the accuracy of traffic state estimation by the random forest classifier becomes higher when more images are employed in training with the fixed-size testing image set. This trend also exists in the accuracy of traffic state estimation by the SVM classifier. The experimental results demonstrate that more training images can cover more corner distribution patterns for effective classification. Moreover, the accuracy obtained by the random forest is little higher than the results from the SVM classifier. Furthermore, we conduct a five-fold cross validation in the experiments to verify the effectiveness of the proposed method. For each test, we split the data into five parts randomly. One of them is selected randomly as the testing set, and the other four sets are used for classifier training. For a dataset, we conduct the same experiment for five times and the classification accuracies

Table 2 The accuracies of traffic state estimation by using different sizes of the training image set.

Table 3 Training time and classification time of the random forest and SVM classifiers by using different sizes of the training image set.

No.

1 2 3 4 5

Size of image set

Accuracy of estimation (%)

Training

Testing

Random forest

SVM

3460 2860 2360 2050 1860

3465 3465 3465 3465 3465

99.82 98.98 97.75 97.75 76.94

98.53 98.42 97.58 97.53 75.25

No.

1 2 3 4 5

Size of image sets

Training time

Classification time

Training

Testing

RF (ms)

SVM (ms)

RF (ms)

SVM (ms)

3460 2860 2360 2050 1860

3465 3465 3465 3465 3465

278 188 171 125 119

4020 984 3412 275 739

15 15 15 15 15

15 15 15 15 15

Fig. 8. The accuracies of traffic state estimation by using different sizes of the training image set.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the traffic state in the traffic video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

Fig. 9. Training time of the random forest and SVM classifiers by using different sizes of the training image set.

are 99.13%, 98.70%, 99.21%, 99.42% and 99.71%, respectively. Such results indicate that the proposed method works well on these randomly selected training and testing sets. The average accuracy is 99.23%. (3) Computing efficiency: Computing efficiency is another metric evaluated in this experiment. We compare the training time and classification time of the random forest classifier with those of SVM. As illustrated in Table 3, the classification time consumed by the random forest classifier and SVM are the same, while the training time consumed by the random forest classifier is much less than that by SVM as presented in Fig. 9. The reason lies in the fact that SVM is a simple and robust method for small amount of sample data and unsuitable for large scale dataset because it utilizes the quadratic programming to solve the support vector. However, the random forest classifier knows how to deal with large scale dataset, and therefore its training time is much less and more stable.

5. Conclusion and future work As the number of vehicles changing with the traffic state, the number of vehicle corners introduced by the distinctive profile of the vehicles will change correspondingly. The spatial distribution pattern of vehicle corners is also related to the traffic state. These two perceptual features are utilized to build an aesthetic model for traffic state estimation by videos. The random forest classifier is utilized in training the model and classifying the traffic state. The experimental result indicates that the proposed model is promising in traffic state estimation with less time consumption, lower computing complexity, and higher accuracy. In the future, we will further conduct research on automated tools in related applications for traffic state estimation. Furthermore, we will explore more perceptual features which can be used in the aesthetic model to reduce the error of classification for traffic state estimation.

Acknowledgments This research is supported in part by the following funds: National Natural Science Foundation of China under Grant number 61472113 and 61304188, and Zhejiang Provincial Natural Science Foundation of China under Grant number LZ13F020004 and LR14F020003.

References [1] S.Y. Cheung, S. Coleri, B. Dundar, S. Ganesh, C.-W. Tan, P. Varaiya, Traffic measurement and vehicle classification with single magnetic sensor, Transp. Res. Rec. 2005 (1917) 173–181. [2] X. Li, Y. She, D. Luo, Z. Yua, A traffic state detection tool for freeway video surveillance system, Procedia – Soc. Behav. Sci. 96 (2013) 2453–2461. [3] Y. Xia, X. Li, Z. Shan, Parallelized fusion on multi-sensor transportation data: a case study in cyberits, Int. J. Intell. Syst. 28 (2013) 540–564. [4] Y. Xia, W. Xu, L. Zhang, X. Shi, K. Mao, Integrating 3d structure into traffic scene understanding with rgb-d data, Neurocomputing 151 (2015) 700–709. [5] Y. Xia, X. Shi, G. Song, Q. Geng, Y. Liu, Towards improving quality of videobased vehicle counting method for traffic flow estimation, Signal Process. http://dx.doi.org/10.1016/j.sigpro.2014.10.035. [6] M. Nishiyama, T. Okabe, I. Sato, Y. Sato, Aesthetic quality classification of photographs based on color harmony, in: Computer Vision and Pattern Recognition – CVPR, 2011, pp. 33–40. [7] L. Marchesotti, F. Perronnin, D. Larlus, G. Csurka, Assessing the aesthetic quality of photographs using generic image descriptors, in: International Conference on Computer Vision – ICCV, 2011, pp. 1784–1791. [8] Y. Wang, Q. Dai, R. Feng, Y.-G. Jiang, Beauty is here: evaluating aesthetics in videos using multimodal features and free training data, in: Proceedings of the 21st ACM International Conference on Multimedia, 2013, pp. 369–372. [9] S.S. Khan, D. Vogel, Evaluating visual aesthetics in photographic portraiture, in: Proceedings of the Eighth Annual Symposium on Computational Aesthetics in Graphics, Visualization, and Imaging, 2012, pp. 55–62. [10] S. Thumfart, R.H.A.H. Jacobs, E. Lughofer, C. Eitzinger, F.W. Cornelissen, W. Groissboeck, R. Richter, Modeling human aesthetic perception of visual textures, ACM Trans. Appl. Percept. 8 (2011) 1–29. [11] L. Zhang, Y. Gao, Y. Xia, Q. Dai, X. Li, A fine-grained image categorization system by cellet-encoded spatial pyramid modeling, IEEE Trans. Ind. Electron. 62 (2015) 564–571. [12] L. Zhang, Y. Xia, R. Ji, X. Li, Spatial-aware object-level saliency prediction by learning graphlet hierarchies, IEEE Trans. Ind. Electron. 62 (2015) 1301–1308. [13] L. Zhang, Y. Gao, R. Ji, Y. Xia, Q. Dai, X. Li, Actively learning human gaze shifting paths for semantics-aware photo cropping, IEEE Trans. Image Process. 23 (2014) 2235–2245. [14] L. Zhang, Y. Gao, Y. Xia, K. Lu, J. Shen, R. Ji, Representative discovery of structure cues for weakly-supervised image segmentation, IEEE Trans. Multimed. 16 (2014) 470–479. [15] L. Zhang, Y. Xia, K. Mao, H. Ma, Z. Shan, An effective video summarization framework toward handheld devices, IEEE Trans. Ind. Electron. 62 (2015) 1309–1316. [16] W.-T. Chu, Y.-K. Chen, K.-T. Chen, Size does matter: how image size affects aesthetic perception?, in: Proceedings of the 21st ACM International Conference on Multimedia, 2013, pp. 53–62. [17] R. Datta, D. Joshi, J. Li, J.Z. Wang, Studying aesthetics in photographic images using a computational approach, in: European Conference on Computer Vision – ECCV, 2006, pp. 288–301. [18] A. Iqbal, The relevance of universal metrics in relation to human aesthetic perception, in: International Conference on Intelligent and Advanced Systems – ICIAS, 2010, pp. 1–6. [19] P. Barcellos, C. Bouvi, F.L. Escouto, J. Scharcanski, A novel video based system for detecting and counting vehicles at user-defined virtual loops, Expert Syst. Appl. 42 (2015) 1845–1856. [20] Y. Wanga, M. Papageorgiou, A. Messmer, Real-time freeway traffic state estimation based on extended Kalman filter: adaptive capabilities and real data testing, Transp. Res. Part A 42 (2008) 1340–1358. [21] H. Nicolas, Mathieu Brulin, Video traffic analysis using scene and vehicle models, Signal Process: Image Commun. 29 (2014) 807–830.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the traffic state in the traffic video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [22] X. Lia, Y. Shea, D. Luo, Z. Yua, A traffic state detection tool for freeway video surveillance system, Procedia – Soc. Behav. Sci. 96 (2013) 2453–2461. [23] P.K. MitalTim, T.J. Smith, R.L. Hill, J.M. Henderson, Clustering of gaze during dynamic scene viewing is predicted by motion, Cogn. Comput. 3 (2011) 5–24. [24] C. Harris, M. Stephens, A combined corner and edge detector, in: 4th Alvey Vision Conference, 1988, pp. 147–151. [25] E. Rosten, T. Drummond, Fusing points and lines for high performance tracking, in: International Conference on Computer Vision – ICCV, 2005, pp. 1508–1515.

9 Na Zhao is an undergraduate student of Institute of Service Engineering, Hangzhou Normal University. Her research interests include the image processing algorithms and relevant applications in transportation systems.

Xingmin Shi received his M.S. degree in the School of Computer Science, Beijing Institute of Technology in 2004. Now he is a Ph.D. student of College of Information Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang, PR China, as well as an associate researcher in Hangzhou Normal University, Hangzhou, Zhejiang, PR China. His research interests include data mining, image and video processing, and applications in intelligent traffic systems.

Zhenyu Shan received the B.S. degree in computer science and technology from the Zhejiang University, Hangzhou, in 2004; the Ph.D. degree in computer science and technology from the Zhejiang University, Hangzhou, in 2010. He is currently a lecturer in Intelligent Transportation and Information Security Lab at the Hangzhou Normal University. His research interests include intelligent transportation systems, big data and visual semantic understanding.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the traffic state in the traffic video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i