Learning for an aesthetic model for estimating the traffic state in the traffic video

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Learning ...

Download PDF

12MB Sizes 0 Downloads 34 Views

Report

Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video Xingmin Shi n, Zhenyu Shan, Na Zhao Intelligent Transportation and Information Security Lab, Hangzhou Normal University, Hangzhou, Zhejiang, China

art ic l e i nf o

a b s t r a c t

Article history: Received 13 March 2015 Received in revised form 30 July 2015 Accepted 27 August 2015

With the increasing number of vehicles running on the urban roads, the trafﬁc jam becomes much more serious. Properly estimating the trafﬁc jam level from trafﬁc videos is essential for the department of transportation management and drivers. Currently, for estimating the trafﬁc state on videos, most solutions are built on evaluating trafﬁc ﬂow by counting the running vehicles per time unit or detecting their moving speed. However, the main challenge of these solutions is on the vehicle tracking method, in which the vehicles are necessary to be effectively and integrally segmented from the scenes. The solutions should tradeoff the accuracy of the estimation results and the efﬁciency of the method. In this paper, we propose a learning-based aesthetic model to estimate the trafﬁc state on videos. The model uses multiple video-based perceptual features about trafﬁc state to train the random forest classiﬁer with the labeled data, and estimates trafﬁc state by data classiﬁcation. The evaluation experiments are conducted on a testing image set, and the results show that the trafﬁc state estimation accuracy of the proposed model is higher than 98% and the efﬁciency performance is achieved in real-time. & 2015 Elsevier B.V. All rights reserved.

Keywords: Aesthetic model Trafﬁc state Random forest Perceptual feature

1. Introduction Owing to the rapid increment of vehicles in recent years, urban trafﬁc jam has become a serious problem for the department of transportation management and drivers. Many researchers are pursuing research on effective methods of trafﬁc state estimation to deliver accurate and real-time trafﬁc information about the trafﬁc jam. This trafﬁc information is essential for trafﬁc management, moving control, and driving guides, which are main research directions in intelligent transportation systems (ITS). In the past few decades, a lot of research has been conducted on trafﬁc state estimation. Most of them are built on trafﬁc ﬂow estimation by counting the running vehicles per time unit or detecting their moving speed. The trafﬁc ﬂow can be evaluated by the buried inductive loop detector [1], such as Sydney Coordinated Adaptive Trafﬁc System (SCATS). In this system, the detector can obtain stimuli when a vehicle passes by the loop. However, the loop device buried under the road surface is not difﬁcult to be broken and inconvenient to be maintained. Recently, another type of detector, the surveillance camera, has been widely used to evaluate trafﬁc ﬂow [2]. The video data n

Corresponding author. E-mail address: [email protected] (X. Shi).

provide more information than the inductive loop detector because the surveillance camera can capture more features. In video-based methods, besides the counting number of vehicles, we also use the estimated speed to evaluate the trafﬁc ﬂow [3]. However, the speed estimation greatly depends on the vehicle features extraction, which is seriously affected by changed conditions in videos. Moreover, multi-source trafﬁc data, such as loop detector data, GPS data, and video data, are fused to improve the accuracy of trafﬁc state estimation [4]. The general video-based methods exploit the segmentation of moving vehicles, either by frame-differencing or background subtraction. Unfortunately, both methods highly depend on the effectiveness and efﬁciency of the vehicle segmentation method to get accurate estimation results [5]. The methods often fail to work when the illumination conditions or environmental conditions change greatly. Moreover, the vehicle segmentation is difﬁcult to be implemented due to the inﬂuence of moving objects or environmental objects, such as the pedestrians and blowing leaves. In order to overcome the limitation of general video-based methods, we propose to explore the trafﬁc scene from a perceptual view, and build a learning-based aesthetic model with speciﬁc features, which are extracted from the videos for classifying the trafﬁc state effectively. The vehicle corners and their spatial distribution features are used to construct the feature vector in this work, and the random forest classiﬁer is utilized for perception training and trafﬁc state classiﬁcation. In summary, the main contributions of

http://dx.doi.org/10.1016/j.neucom.2015.08.099 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

this work include: (1) the vehicle corners and their spatial distribution feature are extracted from a perceptual aspect; (2) the proposed learning-based aesthetic model can effectively estimate the trafﬁc state without vehicle segmenting and tracking process; and (3) the implementation of the proposed method is efﬁcient and can be used in real-time applications. The rest of this paper is organized as follows. Section 2 presents a review on the relevant methods proposed in latest literatures on aesthetic evaluation and trafﬁc ﬂow estimation. Section 3 fully illustrates the modules of the proposed method. The experimental results and analysis are elucidated in Section 4. Finally, the conclusion is drawn and future work is outlined in Section 5.

2. Related work As aesthetic evaluation and video-based trafﬁc state estimation are two main aspects related to our work, their related work is reviewed as follows. 2.1. Aesthetic approaches for image evaluation Aesthetic evaluation is widely used in estimating the aesthetic quality, color harmony, beauty or the professional level of a photo or an image. To measure the aesthetic quality of an image, a bagsof-color-patterns method was proposed in [6]. In this work, the aesthetic quality of a photo is classiﬁed by taking color harmony into consideration. The color harmony model is built on computing the harmony score of local regions of a photograph instead of utilizing the global statistical information of colors. Generic image descriptor is another method utilized in [7] to evaluate the aesthetic quality of images. In this method, generic content-based local features, such as bag-of-visual-words (BOV), ﬁshing vectors (FV), and GIST descriptors, are introduced due to that local-level patch-based information can be aggregated into an integrated global representation of images. Fusion of various features can achieve better performance in aesthetic assessment. The effectivity of aesthetic assessment with various features including low-level features, mid-level semantic features, and type descriptors are evaluated in [8]. In this method, the training dataset is built by collecting free images from the public image sharing website, such as Flickr and DP Challenge, without any manual annotation. Comparing with the utilization of large amount of features for better understanding the aesthetics of an image, researchers select seven features into their top-down approach and improve the accuracy signiﬁcantly [9]. In [10], the work extracts the relationship between the visual textures and the aesthetic perception properties, and proposes a layered prediction model to predict the aesthetic content for a given texture, which is deﬁned as a vector consisting of computational features including low-level texture and color features. For evaluating the aesthetical features and classify the images, the Cellet, which connect spatially adjacent cells within the same pyramid level, can be utilized to represent the object [11]. The Graphlet is proposed to describe the object-level and spatial-level cues and forms the effective saliency descriptor [12], and the experimental results show that the active graphlet path is more indicative for photo aesthetics [13]. Moreover, structural cues are discovered and exploited in [14] and a new summarization technique is proposed in [15], in which the method can enforce video stability and preserves wellaesthetic frames. Size has more or less impact on aesthetic evaluation. In the work [16], the researchers investigated the inﬂuence of size for aesthetic perception and found that the resolution and the physical dimensions can affect the appreciation of viewers. A series of

regression models are proposed for predicting the aesthetic level of an image for a given size. Furthermore, the essential features related to the size-dependent property of image aesthetics are fully analyzed. In addition, machine learning methods are utilized to deal with the aesthetic quality estimation of an image [17]. The visual contents are extracted from the images and the support vector machine and tree classiﬁers are used for classiﬁcation. They seek to explore the relationship between the perception feeling of a person and the low level content. Moreover, some universal metrics are adopted to evaluate the aesthetics in the game of chess [18]. 2.2. Video-based trafﬁc state estimation For estimating the trafﬁc state in trafﬁc videos, virtual loop based methods and vehicle tracking methods are extensively used to count the vehicles. Virtual loops are deﬁned or designated before dealing with the object tracking. When a moving object enters the region of virtual loops and exit the region, the counter will plus one, thus making the vehicle counting possible. The userdeﬁned virtual loops are utilized to detect and count vehicles in [19]. In this work, the foreground mask is produced by Gaussian Mixture Model (GMM) and Motion Energy Images (MEI) method. The particle grouping is utilized in sub-sampling video frames according to their spatial and temporal coherence, as well as motion coherence. Such particles are clustered with the k-means algorithm, and their motion patterns and spatial information are taken into consideration in the clustering. Vehicle tracking is performed on these clusters corresponding to vehicles computed by evaluating the similarity of color histograms. An extended Kalman ﬁlter is introduced into the real-time freeway trafﬁc state estimation in [20]. This work is intended to pursue a general solution of real-time adaptive trafﬁc state estimation in freeway networks. The solution is based on the stochastic macroscopic trafﬁc ﬂow model with extended Kalman ﬁltering. The model parameters and the trafﬁc ﬂow variables are jointly estimated. As a result, the prior calibration is not necessary, and the solution is adaptive to various scenarios which can trigger the incident alarms. A complete system is proposed by Henri Nicolas et al. to analyze the behavior of vehicles in [21]. For improving the results, the scene characteristics and predeﬁned trafﬁc rules are employed in this work. The solution includes three steps. The ﬁrst step is scene modeling, in which the scene structure and the trafﬁc rules are automatically obtained. The second step is tracking multiple objects whose trajectories are evaluated. The ﬁnal step is to evaluate the behavior of vehicles. This method can efﬁciently detect and estimate the behavior of vehicles. Predeﬁned rules and the geometry constraints make it more complex and inconvenient in application. In order to estimate the speed of trafﬁc ﬂow, the road spaced occupancy and the trafﬁc state, a trafﬁc ﬂow detection tool is implemented in [22].

3. Proposed method 3.1. Overview The proposed method mainly consists of four modules, which are data acquisition, initial setup, features extraction, and trafﬁc state training and classiﬁcation. The acquisition has the responsibility of acquiring the video data from trafﬁc surveillance cameras. Initial setup mainly concerns the lane setting. In practice, the trafﬁc ﬂow in a speciﬁc direction should be independently estimated. Generally, the view of the camera covers several two-way lanes. As a result, the lane region should be designated before image analysis in order to mitigate the effect from the adjacent

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

lanes in the opposite direction. The features extraction module is utilized to extract the aesthetic features from the images and use the features to build the feature vectors for trafﬁc state training. Finally, the feature vectors are fed into the training and classiﬁcation module for training the random forest classiﬁer and achieve the classiﬁcation of trafﬁc states. We specify three main modules for initial setup, features extraction, and trafﬁc state training and classiﬁcation, as illustrated in Fig. 1.

3.2. Initial setup Generally, there are several lanes for vehicles running in two directions, as shown in Fig. 2. For trafﬁc ﬂow estimation, we usually focus on the ﬂow on the lanes in a speciﬁc direction or all two directions. As a result, the interested lane region should be manually conﬁgured prior to make the evaluation in order to mitigate the effect from the surrounding objects. As illustrated in Fig. 2(a), the marked region with red rectangle box is the region used to estimate the bidirectional trafﬁc ﬂow, and the region conﬁgured in Fig. 2(b) aims to be used in one-way trafﬁc ﬂow estimation. In the proposed method, the speciﬁed region is used as the region of interests (ROI) of the image, and the feature extraction is carried out in this ROI region.

Fig. 1. The diagram of the proposed method.

3

3.3. Feature selection Due to the proposed method concerning the aesthetic appearance of the road for estimating the trafﬁc state, the features extraction module focuses on selecting the aesthetic features related to perceptual views on trafﬁc status. Moreover, selected features should be sensitive to the scene changes caused by different level of trafﬁc ﬂow, and can be computed and extracted from the trafﬁc surveillance video effectively. (1) Corner feature: The ﬁrst feature is the quantity of vehicle corners. A corner can be deﬁned as the intersection of two edges, or a point at which there are two and different edge directions in a local neighborhood of the point. When more vehicles run on the road, more corners can be detected due to the special proﬁle of vehicles. As shown in Fig. 3, there are many more vehicles running in Fig. 3(a) than that of Fig. 3(c). As a result, compared with the detected corners of the image shown in Fig. 3(d), more corners are also introduced due to the increasing number of vehicles, as shown in Fig. 3(b). Moravec and Harris corner detection algorithms are two classical methods for efﬁcient corner detection [23]. In Moravec algorithm, the corner is identiﬁed as a point with low selfsimilarity. The similarity is assessed by taking the sum of squared differences (SSD) between the two patches. As the value of SSD decreases, the similarity increases. The main problem lies in its anisotropy, i.e., if an edge is present and not in the direction of the neighbors, then the smallest SSD will be large and the edge will be incorrectly chosen as an interest point. Harris improved the Moravec corner detection algorithm by considering the difference of the corner score with respect to the direction directly [24]. However, the stability of the Harris corner detector is relevant to the value of parameter k, which is an empirical value and can be varied in a larger range. In our model, we adopt the FAST (Features from Accelerated Segment Test) corner detector [25], a currently popular corner detection method, for its high computational performance, repeatability, and the capability of real-time processing. The FAST feature detector consists of three steps: Step 1: The segment test. The segment test is conducted on the pixels of the ﬁxed-radius circle, and can eliminate lots of non-candidate points. Step 2: The corner detection based on classiﬁcation. In this step, a decision tree classiﬁer is utilized to determine if a

Fig. 2. The conﬁguration of lane setting.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

4

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 3. The changes of corner distribution caused by the number of vehicles.

Fig. 4. The spatial distribution of corners.

candidate point has the feature of a corner according to 16 features, and the state of each feature is 1, 0, 1. Step 3: Corner features are veriﬁed by applying non-maxima suppression. (2) Spatial distribution of corners: The quantity of corners detected is the global character of the ROI region. It is not eligible to be utilized as a distinctive feature for trafﬁc state estimation. For understanding the trafﬁc scene accurately, the spatial distribution of corners should reveal the local changes of the scenario.

In our model, we divide the ROI region into 16 blocks with the same width and height. The corners in each block are summed up as a value of the feature vector, and these feature values can represent the distribution status of each block, and can signiﬁcantly indicate the variations caused by vehicles movements. As shown in Fig. 4, we specify the region of one-way lanes for trafﬁc state estimation, and the region is divided into 16 regions for the spatial corner distribution evaluation. Accordingly, 16 values, each of which relates to the number of corners in corresponding blocks, can be obtained and constitute the 16 features of the feature vector.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

Fig. 5. Five trafﬁc scenes used in the experiments.

As a result, the feature vector consists of two perceptual features: a global feature which refers to the total quantity of vehicle corners, and the spatial corner distribution feature consists of 16 feature values. This feature vector will be sent to the classiﬁer of the aesthetic model for trafﬁc state training and classiﬁcation. 3.4. Training and classiﬁcation In view of the large number of images employed in classiﬁer training, we train the random forest classiﬁer with the feature vector extracted from the labeled image training set in this work. Apart from the apparent speed advantage than other discriminative methods, we choose such an RF based method for its following advantages. First, leaves of RF contain valuable locality information of feature space suitable for clustering. Second, RF maximizes the classiﬁcation margin in an empirical sense. An RF forms with multiple decision trees: F ¼ ft 1 ; t 2 ; …; t N g. Each tree of an RF is independently trained and each internal node of RF is used to partition the data space. Moreover, a classiﬁcation function is learned by constructing trees in its training stage. Based on the splitting function, X Rl ¼ fx A X R j t R ðxÞ o0g; X Rr ¼ X R ⧹X Rl ;

ð1Þ

Table 1 The size of training image sets and testing image sets in ﬁve scenes. No.

Size of training image set

Size of testing image set

1 2 3 4 5

3464 3495 3360 3052 3213

3465 3496 3568 3065 3218

the root to the leaf node. Hence the locality feature can be well preserved by dealing with the samples in the same leaf node as a cluster. In the testing stage, for a given test case x0, RF estimates the probability of each class as follows: pðkj x0 Þ ¼

N 1X p ðkj x0 Þ; Ni¼1 i

ð3Þ

where pi ðkj x0 Þ is the estimated probability of class k resulted by the ith tree. It is estimated by calculating the ratio that class k gets votes from the leaves in the ith tree

the training data XR in an internal node R is split into left and right subsets X Rl and X Rr , where t R ðÞ refers to the test function of node R for splitting which is usually deﬁned as the oblique linear partition form

pi ðkj x0 Þ ¼

t R ðxÞ ¼ W R x θR

where li ðx0 Þ refers to the leaf node that x0 belongs in the ith tree. The overall decision function of the RF is deﬁned as

ð2Þ

where WR is a vector and θR is a threshold. In a tree, if two samples are the same leaf node, their return value signs are identical for all test functions along the path from

P

xhj A X ;xhj A li ðx0 Þ ½yhj

li ðx0 Þ

Gðx0 Þ ¼ arg maxGk ðx0 Þ;

¼ k

;

ð4Þ

ð5Þ

kAY

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

Fig. 6. The varying quantity of vehicle corners in trafﬁc scenes.

Fig. 7. The varying quantity of vehicle corners in four local blocks.

where Gk ðx0 Þ ¼ pðkj x0 Þ;

ð6Þ

is the probability estimated by RF.

4. Experimental results and analysis 4.1. Experimental setup The experiments are conducted to evaluate the effectiveness and adaptability of the proposed method. The images are obtained from the videos captured by the camera installed on the post at one side of the road intersection. The image resolution of the video is 640n480. All the experiments are executed on a laptop with Intel Core i7-3667U CPU 2.0 GHz.

The experiments are designed for evaluating (1) relationship between the quantity of vehicle corners and trafﬁc status; (2) effectiveness and accuracy of trafﬁc state estimation; and (3) computing efﬁciency of the execution. The training sets and testing sets are respectively constructed with the color images randomly selected from the videos, which are captured from the trafﬁc surveillance cameras installed at the road intersections. We use ﬁve training image sets and ﬁve testing image sets in the experiments for evaluating the effectiveness of the proposed method. The images are selected from ﬁve distinct scenes shown in Fig. 5. The sizes of these image sets are listed in Table 1. 4.2. Results and analysis In order to evaluate the performance of the proposed method, the accuracy of trafﬁc state classiﬁcation is evaluated based on training sets, which are acquired from the cameras installed at

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

different locations with diverse illumination conditions and context settings. Furthermore, the computing efﬁciency of the model training and trafﬁc state classiﬁcation is evaluated by random forest (RF) and SVM. (1) Relationship between quantity of vehicle corners and trafﬁc status: As shown in Fig. 6, the quantity of corners detected in the image of trafﬁc video varies in different time, when various numbers of vehicles running in the view ﬁeld of the surveillance camera. With the conjoint analysis on the number of corners and the video content, more vehicles will introduce more corners. On the trend line in Fig. 6, the period, in which the numbers of corners are more than 600 in continuous frame images, corresponds to time of a serious trafﬁc jam. The quantity decrement of the vehicle corners can be found from the 2400th frame. This indicates that the vehicle density decreases correspondingly when the vehicles are gradually moving out of the given region. In the same way, when more vehicles move to the region, the quantity of vehicle corners will increase correspondingly. Such trend can be ﬁgured out in Fig. 6 from the 3800th frame to the 4400th frame. In addition, we use 16 blocks to describe the local spatial distribution of vehicle corners. As shown in Fig. 7, four trend lines correspond to four blocks, R1, R2, R3, and R4, which are resulted from the equal divisions of the lane region and demonstrate distinctive distribution patterns. The experimental results in Fig. 7 indicate that the varying quantity of vehicle corners in these four blocks is a differentiable feature which can be used for evaluating regional vehicle distribution perceptually.

(2) Effectiveness and accuracy: For training of classiﬁer, the images in the training set are randomly selected from the videos in diverse scenes. They are marked with the corresponding trafﬁc states. As the aforementioned Table 1, the ﬁrst training set has 3464 images for model training and the corresponding test set forms with 3465 color images. We use these two image sets to assess the accuracy of the classiﬁer when the number of training images changes. As shown in Table 2 and Fig. 8, when the sizes of training image set are 3460 and 2860, the accuracies of trafﬁc state classiﬁcation with random forest and SVM are both higher than 98%, even higher than 99% for the random forest classiﬁer if we use the complete training set. These can indicate the effectiveness of our proposed method for trafﬁc state estimation. In addition, the accuracy of trafﬁc state estimation by the random forest classiﬁer becomes higher when more images are employed in training with the ﬁxed-size testing image set. This trend also exists in the accuracy of trafﬁc state estimation by the SVM classiﬁer. The experimental results demonstrate that more training images can cover more corner distribution patterns for effective classiﬁcation. Moreover, the accuracy obtained by the random forest is little higher than the results from the SVM classiﬁer. Furthermore, we conduct a ﬁve-fold cross validation in the experiments to verify the effectiveness of the proposed method. For each test, we split the data into ﬁve parts randomly. One of them is selected randomly as the testing set, and the other four sets are used for classiﬁer training. For a dataset, we conduct the same experiment for ﬁve times and the classiﬁcation accuracies

Table 2 The accuracies of trafﬁc state estimation by using different sizes of the training image set.

Table 3 Training time and classiﬁcation time of the random forest and SVM classiﬁers by using different sizes of the training image set.

No.

1 2 3 4 5

Size of image set

Accuracy of estimation (%)

Training

Testing

Random forest

SVM

3460 2860 2360 2050 1860

3465 3465 3465 3465 3465

99.82 98.98 97.75 97.75 76.94

98.53 98.42 97.58 97.53 75.25

No.

1 2 3 4 5

Size of image sets

Training time

Classiﬁcation time

Training

Testing

RF (ms)

SVM (ms)

RF (ms)

SVM (ms)

3460 2860 2360 2050 1860

3465 3465 3465 3465 3465

278 188 171 125 119

4020 984 3412 275 739

15 15 15 15 15

15 15 15 15 15

Fig. 8. The accuracies of trafﬁc state estimation by using different sizes of the training image set.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

Fig. 9. Training time of the random forest and SVM classiﬁers by using different sizes of the training image set.

are 99.13%, 98.70%, 99.21%, 99.42% and 99.71%, respectively. Such results indicate that the proposed method works well on these randomly selected training and testing sets. The average accuracy is 99.23%. (3) Computing efﬁciency: Computing efﬁciency is another metric evaluated in this experiment. We compare the training time and classiﬁcation time of the random forest classiﬁer with those of SVM. As illustrated in Table 3, the classiﬁcation time consumed by the random forest classiﬁer and SVM are the same, while the training time consumed by the random forest classiﬁer is much less than that by SVM as presented in Fig. 9. The reason lies in the fact that SVM is a simple and robust method for small amount of sample data and unsuitable for large scale dataset because it utilizes the quadratic programming to solve the support vector. However, the random forest classiﬁer knows how to deal with large scale dataset, and therefore its training time is much less and more stable.

5. Conclusion and future work As the number of vehicles changing with the trafﬁc state, the number of vehicle corners introduced by the distinctive proﬁle of the vehicles will change correspondingly. The spatial distribution pattern of vehicle corners is also related to the trafﬁc state. These two perceptual features are utilized to build an aesthetic model for trafﬁc state estimation by videos. The random forest classiﬁer is utilized in training the model and classifying the trafﬁc state. The experimental result indicates that the proposed model is promising in trafﬁc state estimation with less time consumption, lower computing complexity, and higher accuracy. In the future, we will further conduct research on automated tools in related applications for trafﬁc state estimation. Furthermore, we will explore more perceptual features which can be used in the aesthetic model to reduce the error of classiﬁcation for trafﬁc state estimation.

Acknowledgments This research is supported in part by the following funds: National Natural Science Foundation of China under Grant number 61472113 and 61304188, and Zhejiang Provincial Natural Science Foundation of China under Grant number LZ13F020004 and LR14F020003.

References [1] S.Y. Cheung, S. Coleri, B. Dundar, S. Ganesh, C.-W. Tan, P. Varaiya, Trafﬁc measurement and vehicle classiﬁcation with single magnetic sensor, Transp. Res. Rec. 2005 (1917) 173–181. [2] X. Li, Y. She, D. Luo, Z. Yua, A trafﬁc state detection tool for freeway video surveillance system, Procedia – Soc. Behav. Sci. 96 (2013) 2453–2461. [3] Y. Xia, X. Li, Z. Shan, Parallelized fusion on multi-sensor transportation data: a case study in cyberits, Int. J. Intell. Syst. 28 (2013) 540–564. [4] Y. Xia, W. Xu, L. Zhang, X. Shi, K. Mao, Integrating 3d structure into trafﬁc scene understanding with rgb-d data, Neurocomputing 151 (2015) 700–709. [5] Y. Xia, X. Shi, G. Song, Q. Geng, Y. Liu, Towards improving quality of videobased vehicle counting method for trafﬁc ﬂow estimation, Signal Process. http://dx.doi.org/10.1016/j.sigpro.2014.10.035. [6] M. Nishiyama, T. Okabe, I. Sato, Y. Sato, Aesthetic quality classiﬁcation of photographs based on color harmony, in: Computer Vision and Pattern Recognition – CVPR, 2011, pp. 33–40. [7] L. Marchesotti, F. Perronnin, D. Larlus, G. Csurka, Assessing the aesthetic quality of photographs using generic image descriptors, in: International Conference on Computer Vision – ICCV, 2011, pp. 1784–1791. [8] Y. Wang, Q. Dai, R. Feng, Y.-G. Jiang, Beauty is here: evaluating aesthetics in videos using multimodal features and free training data, in: Proceedings of the 21st ACM International Conference on Multimedia, 2013, pp. 369–372. [9] S.S. Khan, D. Vogel, Evaluating visual aesthetics in photographic portraiture, in: Proceedings of the Eighth Annual Symposium on Computational Aesthetics in Graphics, Visualization, and Imaging, 2012, pp. 55–62. [10] S. Thumfart, R.H.A.H. Jacobs, E. Lughofer, C. Eitzinger, F.W. Cornelissen, W. Groissboeck, R. Richter, Modeling human aesthetic perception of visual textures, ACM Trans. Appl. Percept. 8 (2011) 1–29. [11] L. Zhang, Y. Gao, Y. Xia, Q. Dai, X. Li, A ﬁne-grained image categorization system by cellet-encoded spatial pyramid modeling, IEEE Trans. Ind. Electron. 62 (2015) 564–571. [12] L. Zhang, Y. Xia, R. Ji, X. Li, Spatial-aware object-level saliency prediction by learning graphlet hierarchies, IEEE Trans. Ind. Electron. 62 (2015) 1301–1308. [13] L. Zhang, Y. Gao, R. Ji, Y. Xia, Q. Dai, X. Li, Actively learning human gaze shifting paths for semantics-aware photo cropping, IEEE Trans. Image Process. 23 (2014) 2235–2245. [14] L. Zhang, Y. Gao, Y. Xia, K. Lu, J. Shen, R. Ji, Representative discovery of structure cues for weakly-supervised image segmentation, IEEE Trans. Multimed. 16 (2014) 470–479. [15] L. Zhang, Y. Xia, K. Mao, H. Ma, Z. Shan, An effective video summarization framework toward handheld devices, IEEE Trans. Ind. Electron. 62 (2015) 1309–1316. [16] W.-T. Chu, Y.-K. Chen, K.-T. Chen, Size does matter: how image size affects aesthetic perception?, in: Proceedings of the 21st ACM International Conference on Multimedia, 2013, pp. 53–62. [17] R. Datta, D. Joshi, J. Li, J.Z. Wang, Studying aesthetics in photographic images using a computational approach, in: European Conference on Computer Vision – ECCV, 2006, pp. 288–301. [18] A. Iqbal, The relevance of universal metrics in relation to human aesthetic perception, in: International Conference on Intelligent and Advanced Systems – ICIAS, 2010, pp. 1–6. [19] P. Barcellos, C. Bouvi, F.L. Escouto, J. Scharcanski, A novel video based system for detecting and counting vehicles at user-deﬁned virtual loops, Expert Syst. Appl. 42 (2015) 1845–1856. [20] Y. Wanga, M. Papageorgiou, A. Messmer, Real-time freeway trafﬁc state estimation based on extended Kalman ﬁlter: adaptive capabilities and real data testing, Transp. Res. Part A 42 (2008) 1340–1358. [21] H. Nicolas, Mathieu Brulin, Video trafﬁc analysis using scene and vehicle models, Signal Process: Image Commun. 29 (2014) 807–830.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

X. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [22] X. Lia, Y. Shea, D. Luo, Z. Yua, A trafﬁc state detection tool for freeway video surveillance system, Procedia – Soc. Behav. Sci. 96 (2013) 2453–2461. [23] P.K. MitalTim, T.J. Smith, R.L. Hill, J.M. Henderson, Clustering of gaze during dynamic scene viewing is predicted by motion, Cogn. Comput. 3 (2011) 5–24. [24] C. Harris, M. Stephens, A combined corner and edge detector, in: 4th Alvey Vision Conference, 1988, pp. 147–151. [25] E. Rosten, T. Drummond, Fusing points and lines for high performance tracking, in: International Conference on Computer Vision – ICCV, 2005, pp. 1508–1515.

9 Na Zhao is an undergraduate student of Institute of Service Engineering, Hangzhou Normal University. Her research interests include the image processing algorithms and relevant applications in transportation systems.

Xingmin Shi received his M.S. degree in the School of Computer Science, Beijing Institute of Technology in 2004. Now he is a Ph.D. student of College of Information Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang, PR China, as well as an associate researcher in Hangzhou Normal University, Hangzhou, Zhejiang, PR China. His research interests include data mining, image and video processing, and applications in intelligent trafﬁc systems.

Zhenyu Shan received the B.S. degree in computer science and technology from the Zhejiang University, Hangzhou, in 2004; the Ph.D. degree in computer science and technology from the Zhejiang University, Hangzhou, in 2010. He is currently a lecturer in Intelligent Transportation and Information Security Lab at the Hangzhou Normal University. His research interests include intelligent transportation systems, big data and visual semantic understanding.

Please cite this article as: X. Shi, et al., Learning for an aesthetic model for estimating the trafﬁc state in the trafﬁc video, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.099i

Learning for an aesthetic model for estimating the traffic state in the traffic video

Learning for an aesthetic model for estimating the traffic state in the traffic video

Recommend Documents