JOURNAL OF TRANSPORTATION SYSTEMS ENGINEERING AND INFORMATION TECHNOLOGY Volume 12, Issue 4, August 2012 Online English edition of the Chinese language journal Cite this article as: J Transpn Sys Eng & IT, 2012, 12(4), 35í42.
RESEARCH PAPER
ETC Data Mining Based on Hybrid Markov Model QIAN Chao, XU Hongke*, DAI Liang, LI Shuguang School of Electronic and Control Engineering, Chang’an University, Xi’an 710064, China
Abstract: ETC tolling data contains a vast amount of information. The data mining to improve management efficiency is an urgent problem to the expressway administrations. In this paper, ETC raw data are used to construct the route sequences transactional database. Against the shortcomings of low accuracy and coverage rate with basic Markov route prediction model, a new method based on hybrid Markov route prediction model is proposed to predict vehicle route on the expressway. ETC vehicles’ future driving states are predicted and unusual route sequences are detected using this method. The experimental results show that the detecting result is reliable, and the overall prediction accuracy rate is above 83%. It may provide theoretical foundation and decision support for expressway administrations to develop charge checking and improve ETC management level. Key Words: highway transportation; hybrid Markov model; data mining; ETC tolling data; route prediction; route sequence
1
Introduction
Vehicle Routing Problem (VPR), also called Vehicle Scheduling Problem (VSP), is the key problem faced in traffic organization and optimization. Under the premise of satisfying customer requirements, it optimizes transportation routes and delivers goods to their destinations at the lowest cost. In the expressway network, as depicted in Fig. 1, vehicles drive into the expressway via station A, and then drive out via station B, so the corresponding OD route AB is generated; when the network connects into a ring, the OD route may not be unique, then an ambiguous path is generated. Usually, building path-identifying stations that can add path-identifying information in the OD route is the most general method which is used for recognizing driving routes in an accurate manner. The ETC system contains a large quantity of sequence data information such as entrance, exit, path-identifying station, time, vehicle type, and plate. The goal of this article is to find vehicle driving rules using ETC raw data, predict future driving routes in an accurate manner, and detect unusual routes in a correct manner.
2
Related work and contribution
In 1959, VPR was first proposed by Dantzig and Ramser[1]. It is an NP-hard problem that is difficult to be solved using an accurate algorithm; the heuristic algorithm is the main method
that is used to solve this problem. The heuristic algorithm mainly includes tabu search algorithms, genetic algorithms, simulated annealing algorithms, ant colony algorithms, and so on. Tabu search algorithms, genetic algorithms, and simulated annealing algorithms overcome the defects that traditional algorithms stop searching for once they find a locally optimal solution. The algorithms are aided in jumping out of the local optima and instead turn to the global optimal solution[2–5]. Although ant colony algorithms have more advantages in solving the VPR, they still exhibit several defects such as slower search speed and being easily trapped in local optimization; especially when the road network is on a large scale, the solving speed and computed results are not ideal[6]. In recent years, with the enlargement of the network toll collection, some scholars in China have done a lot of work on expressway route recognition and prediction as well. Ref. [7] presents a route recognition method based on probability analysis, and application effects show its feasibility. Through the research of mass GPS floating car data, Ref. [8] points out that the movement of vehicles exists in a certain movement pattern (there exists a certain space-time regularity in vehicle traffic), and the position and movement state of vehicles could be used to predict future driving routes with a certain probability. Based on the variable-length Markov model, the author proposes the generation algorithm of the vehicle movement pattern, and predicts its future driving route.
Received date: Apr 5, 2012; Revised date: May 24, 2012; Accepted date: Jun 25, 2012 *Corresponding author. E-mail:
[email protected] Copyright © 2012, China Association for Science and Technology. Electronic version published by Elsevier Limited. All rights reserved. DOI: 10.1016/S1570-6672(11)60212-2
QIAN Chao al. / J Transpn Sys Eng & IT, 2012, 12(4), 35í42
i0, ···, it–1 X, there always holds P{X t 1 j | X t i, X t 1 it 1,"X 0 i0}=P{X t 1
j | Xt
i} (1)
then the random process constituted by {Xt, t=0,1,2, ···} can be called a Markov chain. Suppose for pij (t ) P{ X t 1 j | X t i}
Fig. 1 Structure of expressway network
The quality and operating efficiency of traffic data that are obtained from the transport information collection system based on GPS floating car data is lower; meanwhile, the cost is higher, and most of the present research uses the data collecting by taxies. The ETC system communicates with the on board unit (OBU) via the road side unit (RSU); thus, vehicle information can be automatically identified and exchanged. With the promotion of the ETC system throughout the country, expressway administrations have accumulated large amounts of tolling raw data, which contain plenty of internal relation and implicit information that urgently demands to be mined using advanced technology[9]. In recent years, some scholars have carried out related basic research[10–13]. However, the tolling data mining and prediction model with the purpose of improving the management level and decision ability has not yet become mature. The Markov model is a classical probability-statistical model that was proposed by Zukerman et al in 1999. In this model, the web browsing process was abstracted into a special random process called the homogeneous discrete Markov model, and users’ browsing characteristics are described by the transition probability matrix. Based on this, users’ browsing could be predicted. Based on some relevant studies, this article establishes a Markov driving behavioral model. Against the shortcomings of low accuracy and the coverage rate of the basic Markov route prediction model, this article introduces a new Markov model (a hybrid Markov route prediction model) and provides a method for classifying ETC vehicle route sequences using the EM iterative clustering algorithm so that vehicles in the same class have the same or similar driving behavior; it also builds an independent model for each class of vehicles to describe its driving behavior. Finally, the article predicts the future driving route using historical data.
3
Markov route sequences prediction model
3.1 Markov driving behavior model Definition 1 (Markov chain): Assume that state space X={x1, x2, ···, xn} of the sequence of random variables {Xt ,t=0,1,2, Ă} is discrete; in the case that t t 0 and for any state space i, j,
if pij(t) is uncorrelated with time t, namely, for any different times t1 and t2, there is pij(t1)=pij(t2) then the Markov chain , could be called a homogeneous Markov chain[15]. Definition 2 (Route sequences): When passing through an expressway, the entrance and exit sequences of vehicles in a certain time are called route sequences. Definition 3 (ETC route sequences transaction database DSDB): The ETC route sequences transaction database is constituted by route sequences of ETC vehicles; that is, DSDB={x1, x2, ···, xm}, vector xi(i=1,···, m) indicates the route sequences of vehicles. Hypothesis 1 (Markov property hypothesis of driving behavior): Assume that the route sequences of vehicles is a homogeneous Markov chain; then, route sequences can be regarded as value sequences of the discrete random variable X, thus satisfying the Markov property. Definition 4 (Markov driving behavior model): The driving behavior of vehicles can be expressed as a homogeneous discrete Markov chain M=
, among which X is a discrete random variable, the range is {x1, x2, ···, xn}, each xi corresponds to an entrance and exit sequence, which is called a state of the model; A is a transition probability matrix, A=(pij)=P(Xt=xj|Xt–1=xi) means the transition probability is transferred from state xi to xj, namely, A=(pij)nhn, (i, ję{1, 2, ···, n}). A, the transition probability matrix, should satisfy the following two requirements: pij0, i, j {1, 2, ···, n} n
¦p
ij
1 i, j {1, 2, ···, n}
j 1
3.2 Basic Markov route sequences prediction model An establishment of the effective route sequences prediction model to make an accurate forecast of the driving behavior of vehicles is one of the critical issues that is faced in an expressway management intelligent decision system. Definition 5 (Route prediction): The problem of route sequences prediction based on DSDB can be defined as follows: p( x, DSDB) o (seq1 (Z1 ), seq2 (Z2 ),Ăseqn (Zn )) SEQ (2)
among which x means the current route sequences of the vehicle, SEQ means the set of all route sequences states, and Zi (i 1,2,Ăn) means the weight of prediction. Assume vector xt means a vehicle’s state at time t; if the vehicle is in the state of xi, then the dimension i of xt equals 1, and other dimensions are 0. According to Hypothesis 1 and the Markov driving behavior model, a vehicle’s state at time t could be predicted by Eq. (3),
QIAN Chao al. / J Transpn Sys Eng & IT, 2012, 12(4), 35í42
xt
xt 1 u A
(3)
In vector xt, the corresponding state of dimensions with the maximum probability values is the most likely state at time t. Usually, we select a set composed of top-N states that have the maximum probability or state sets comprised by all states which exceed the threshold values that serve as prediction results. 3.3 Hybrid Markov route sequences prediction model The driving behavior of vehicles is acknowledged as a complex process that is influenced by many factors. Due to the differences observed in these factors, the driving behavior exhibits different individual characteristics[16]. The basic Markov route sequences prediction model uses one Markov chain to describe the driving characteristic of all vehicles; obviously, it is very simple, and considerable errors exist in the prediction results. Based on the driving characteristics, all the vehicles can be divided into several classes, so that vehicles in the same class will have the same or similar driving behavior, and then an independent model for each class of vehicles is built to describe its driving behavior. In this manner, it overcomes the inaccuracy of the basic model, which uses only one Markov chain to describe the driving behavior of all vehicles and obtains a higher predictive accuracy. This model, which uses the multiple Markov chain to describe multi-classes, is called the hybrid Markov model[17]. Hypothesis 2 (Hypothesis of vehicle classification): According to ETC route sequences transaction database DSDB, vehicles can be divided into K classes. It enables vehicles in the same class to have the greatest feature similarity of driving behavior, and vehicles between different classes to have the least feature similarity. Assume C={c1, c 2, ···, ck} expresses classes of vehicles, the probability that any vehicle belongs to class ck (k {1,2, ···, K}) is P(C= ck), then K
¦ P(C
ck ) 1
k 1
Hypothesis 2 (Hypothesis of classified Markov chain): Assume that the driving process of the same class of vehicles is a special random process (homogeneous discrete Markov chain). Definition 6 (Hybrid model): The hybrid model can be expressed as a quadruple M X , K , P(C ), MC , among which X is a discrete random variable with a range of values in {x1, x2, ···, xn}; each xi corresponds to a driving state of the model, K means the total number of vehicle classes in the model, C={c1, c2, ···, ck} means the classes of vehicles, its distribution function P(C) expresses a probability distribution of different classes. MC={mc1, mc2, Ă, mck} means a set of the classified Markov chain, and each mck represents a Markov chain that describes the driving behavior of the class ck, so it is called a classified Markov chain, and its state transition matrix can be expressed as Eq. (4),
( pkij ) nun , (i , j {1, 2, Ă n}, k {1, 2, Ă K })
AK
(4)
The main purpose of learning the hybrid model is to determine the following parameters: (1) The number of vehicle classes-K; (2) The probability of any vehicle belongs to class ck–P(C=ck); (3) The state transition matrix of the classified Markov chain. The expectation-maximization (EM) algorithm is an iterative refinement clustering algorithm[18,19] that can be viewed as a kind of expansion of the k-means clustering algorithm. This article adopts the EM algorithm to classify the routes of vehicles; the main process is depicted in Fig. 2. First, the algorithm estimates the prior probability distribution of ș (the estimated parameters set); then, it iteratively refines the parameters based on expectation step (E-Step) and maximization step (M-Step); only when the algorithm converges can the classification results-K be output. E-Step means classifying vehicles that have driving behavior xi into class ck under the current given ș, namely, p(ck | T ) pk ( xi | ck ,T ) (5) Pi ,k (T ) p (ck | x i , T ) K j p ( c | ) p ( x | c , ) T T ¦j 1 j j j among which p (ck | x i , T ) expresses the probability that vehicles which possess the driving behavior xi belong to the kth classified Markov model under the parameters set ș; p(ck | T ) means the marginal probability of the kth classified Markov model in the hybrid model, namely, K
¦ p (c
k
|T ) 1
k 1
and pk ( x i | ck , T ) means the probability of vehicles that possess driving behavior xi in the kth classified Markov model. M-Step uses the maximum likelihood estimate to revise the estimate of parameters set ș using learning data DSDB. The target function Q should satisfy the following equation: Q(T ,Told )
m
K
¦¦ P
i ,k
i 1 k 1
(T old ) log ª¬ p(ck | T ) pk ( xi | ck ,T ) º¼ log p(T ) (6)
Fig. 2 Diagram of the EM algorithm
QIAN Chao al. / J Transpn Sys Eng & IT, 2012, 12(4), 35í42
Fig. 3 Training process of hybrid Markov driving prediction model
Fig. 4 Hybrid Markov driving sequence prediction model
The main training process of realizing vehicle route prediction and unusual route detection is shown in Fig. 3. The training of the hybrid model is a process dealing with vehicles classification and learning in the classified Markov chain. First, it uses the EM algorithm to classify vehicles route sequences in DSDB; then, it builds an independent model for each class of vehicles and obtains a corresponding state transition matrix. After the training of the model has been completed, the probability of the given driving sequences can be calculated by inputting or querying the vehicle’s historical driving sequences; if they are below the preset threshold, the route can be considered abnormal. Meanwhile, the probability of the follow-up route can be predicted by Eq.(7). K
p ( xt 1 | xt , Ăx1 )= ¦ p ( xt 1 , ck | xt , Ăx1 ) k 1 K
= ¦ p ( xt 1 | xt , Ăx1 , ck ) p (ck | xt , Ăx1 ) k 1
(7)
K
= ¦ p ( xt 1 | xt , ck ) p (ck | xt , Ăx1 ) k 1
4 4.1
Analysis of hybrid Markov route sequences prediction model
Hierarchical structure of hybrid Markov route sequences prediction model The learning process of the hybrid Markov route sequences prediction model can be divided into four levels. The root
node represents the model, as shown in Fig. 4. The second level is the cluster level. In this level, each node except the last one represents a cluster discovered by the EM algorithm. The last node in the second level is a transition probability matrix, which represents the state transition probabilities of the overall clusters. The transition probability matrix has a set of children, where each child represents a row in the transition probability matrix. This indicates the probability of the follow-up route. Each cluster node also has a transition probability matrix as its child, which represents the transition probability of the given cluster. 4.2 Performance index of hybrid Markov route sequences prediction model In order to evaluate the performance of the hybrid Markov route sequences prediction model, we choose the following two performance indexes: Definition 7 (Prediction accuracy-P): P=Paccurate/Pall =Paccurate/(Paccurate+ Pwrong) Prediction accuracy-P represents the ratio of the correct route prediction vehicles to the total route prediction vehicles, among which Paccurate indicates the number of times that vehicles have at least one driving sequence in the time window after current driving; Pall means the total number of predictions. This describes the prediction accuracy of the model. Definition 8 (Coverage rate -A) A=Pall/Precord =(Paccurate+ Pwrong)/Precord Coverage rate A represents the percentage of the number of times that the model can be used and the total application number, among which Precord represents the total application number. It describes the availability of the model.
5
Analysis of experimental results
Yanba expressway is located in eastern Shenzhen, and it passes through several seaside resorts along the line. Since opening to traffic in 2001, it has become a main trunk radiating outward from eastern Shenzhen with a daily mixed traffic flow of more than 20000 vehicles. In this article, more than 2 million ETC tolling data from the Yanba expressway are chosen as raw data. According to accrual demand and DB44[20], the tolling data are aggregated by plate information in the OBU that is recognized and ordered by time; then, the ETC route sequences transaction database DSDB is structured. 5.1 Comparative analysis of Markov route sequences prediction model In order to testify the predictive effect of the hybrid Markov route sequences prediction model, 8000 vehicles’ driving records were randomly selected as test samples in the DSDB. Prediction accuracy and coverage rate are chosen as an evaluation standard. Averaging the basic Markov route
QIAN Chao al. / J Transpn Sys Eng & IT, 2012, 12(4), 35í42
Table 1 Prediction of accuracy under different threshold values Threshold values
Accuracy of hybrid model
Accuracy of basic model
0.1
0.47
0.38
0.3
0.62
0.53
0.6
0.75
0.66
0.9
0.83
0.71
Fig. 5 ETC vehicles’ clustering result
Coverage rate (%)
100
Hybrid Markov model Basic Markov model
80
60
40 1
2
3
Order
4
5
6
coverage rate of the hybrid Markov route sequences prediction model with the basic Markov route sequences prediction model is depicted in Fig. 6. As observed in Fig. 6, the prediction of each order is considered in the hybrid Markov route sequences prediction model, and all coverage rates are up to one hundred percent; namely, any route sequences could obtain corresponding predicted values. Meanwhile, coverage rates of the basic Markov route sequences prediction model decrease along with an increase in order. 5.2 Route sequences prediction and unusual detection Vehicles’ future driving route can be predicted by using the hybrid Markov route sequences prediction model. Table 2 shows the prediction results of one car in the next three steps according to its historical driving records in DSDB: Unusual detection of route sequences means calculating the probability of a given vehicle’s route sequences using the hybrid Markov route sequences prediction model; if the probability value is significantly lower than the requirements, it firmly believes that the given vehicle has unusual route sequences. The major reasons causing unusual route sequences include (1). infeasible route, that is, after the previous route sequence, there is no route to drive to the next sequence; (2). unusual driving sequence, which mainly contains: (a) ETC deducting failed; (b) fee evasion, and so on. Detecting one month’s data in DSDB, the result is depicted in Fig. 7. As shown in Fig. 7, DrivingPath is a nested table with two columns, SequenceID and Way. SequenceID represents the driving sequence, and Way represents the driving route. Probability means the probability of driving according to DrivingPath. Through prediction, vehicles that have route sequences with minimum probability can be considered to exist unusual driving.
Fig. 6 Comparison of two models coverage rate Table 2 Results of driving sequence predictions
sequences prediction model based on the 1–6 order and comparing it with the hybrid model is depicted in Table 1. As observed in Table 1, the accuracy of the hybrid Markov route sequences prediction model is higher than that of the basic model; when the threshold values are increased, vehicles are divided into several classes or clusters, each class or cluster has an approximate driving state, which means that the model has a higher similarity. When threshold values equal 0.9, the hybrid Markov route sequences prediction model exhibits the highest accuracy, and vehicles’ driving behavior is divided into 15 classes by the EM algorithm, as depicted in Fig. 5. The basic Markov route sequences prediction model improves prediction accuracy by raising order, and this will reduce the coverage rate. A comparing of the average
No.
1
2
3
Prediction route sequence
Confidence (%)
KuichongĺYantian
81.76
TuyangĺYantian
1.17
YantianĺKuichong
0.87
KuichongĺDameisha
0.62
YantianĺKuichong
51.32
YantianĺTuyang
3.17
KuichongĺYantian
1.45
YantianĺXichong
0.63
KuichongĺYantian
47.24
TuyangĺYantian
0.91
YantianĺKuichong
0.87
XichongĺYantian
0.47
QIAN Chao al. / J Transpn Sys Eng & IT, 2012, 12(4), 35í42
Routing Problems. Systems Engineering, 2007, 25(1): 49–52. [4]
Cheng L H, Wang J Q. Improved genetic algorithm for vehicle routing problem. Computer Engineering and Applications, 2010, 46(36): 219–221.
[5]
Hu D W, Zhu Z Q, Hu Y. Simulated Annealing Algorithm for Vehicle Routing Problem. China Journal of Highway and Transport, 2006, 19(4): 123–126.
[6] Bullnheimer B, Hartl R F. Christine Strauss. An improved ant system algorithm for the vehicle routing problem. Annals of Operations Research. 1999, 89: 319–328.
Fig. 7 Detecting result of unusual driving sequences [7]
6
on Probability Analysis in Networking Toll Expressways.
Conclusions
According to the driving behavior of ETC vehicles, this article presents a new method based on the hybrid Markov route prediction model to predict vehicle route on the expressway. It solves the shortcomings of low accuracy and coverage rate with the basic Markov route prediction model. Aggregate data in the DSDB are detected by the model; unusual route sequences are found; and ETC vehicles’ future driving states are predicted. The experiment demonstrates that the method is reliable, and the overall prediction accuracy rate is above 83%. It may provide a theoretical foundation as well as decision support for the expressway administration to develop charge checking and improve the ETC management level. Especially when the network structure is complex, there are ambiguous paths in the network. The number of routes will be doubled when one path-identifying station is added. The vehicles’ route that passes through the path-identifying station can be expressed as different segments, as the form of “entrance+ path-identifying station 1,” “path-identifying station 1+ path-identifying station 2,” and “path-identifying station n+ exit”. So, a route is divided into multi-routes. Vehicles’ routes are classified as optimum classes by adjusting threshold values. As a result, the prediction accuracy is improved.
Computer and Communications, 2008, 26(6): 65–69. [8] [9]
Dantzig G, Ramser J. The truck dispatching problem.
[10]
Osman I H. Meta-strategy simulated annealing and Tabu search algorithms for the vehicle routing problem. Annals of Operations Research, 1993, 41: 77–86.
[3]
LIU X, QI H. Tabu Search Algorithm of Min-max Vehicle
Weng J C, Liu L L, Du B. ETC Data Based Traffic Information Mining Techniques. Journal of Transportation Systems Engineering and Information Technology, 2010, 10(2): 57–63.
[11]
Diao H X. Escaped toll analysis of ETC system customer data based on fuzzy C-means clustering. Technological Development of Enterprise, 2005, 24(10): 8–10.
[12]
Ye C Z, Zhong Z F. Realization of an OD Traveling Time Forecast System Based on Highway Network Toll Data. Microcomputer Information, 2007, 23(6-2): 44–46.
[13]
Liu W N, Zeng C J, Sun D H. Clustering Analysis of Overspeed Spots for Commercial Vehicles Based on DBSCAN. Computer Engineering, 2009, 35(5): 268–270.
[14]
Zuckerman I, Albrecht D. Predicting user’s requests on the WWW. In: Proceedings of the 7th International Conference on User Modeling, New York, 1999, 275–284.
[15] Kijima M. Markov Processes for Stochastic Modeling. London: Chapman&Hall, 1997. [16]
Chang S J, Shi J J, Li X L. Process and Effect Factors of Driver’s Behavior. Transport Standardization, 2010, 4: 103–107.
[17]
Sen R, Hansen M H. Predicting a Web user’s next request based on log data. Journal of Computational Graphics and Statistics, 2003, 12(1): 143–155.
[18]
Dempster A. Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. Royal Statistical Society, 1977, 39: 1–38.
[19]
Management Science, 1959, 6: 80–91. [2]
Han J, Kamber M. Data Mining: Concepts and Techniques San Francisco: Morgan Kaufmann Publishers, 2006.
References [1]
Li Z W. Urban Vehicular Mobility Patterns for Driving Route Prediction. Shanghai: Shanghai Jiao Tong University, 2010.
Acknowledgments This research was funded by the National Natural Science Foundation of China (No. 60804049), Program for Changjiang Scholars, and Innovative Research Team in the University of Ministry of Education of China (IRT1050).
Zhang J, Li X H, Yu S J. Path Identification Technique Based
[20]
Lauritzen S L. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 1995, 19: 191–201. DB44/127-2003.Guangdong provincial expressway network tolling system. Administration of Quality and Technology Supervision of Guangdong Province, 2003.