Discovering spatio-temporal dependencies based on time-lag in intelligent transportation data

Discovering spatio-temporal dependencies based on time-lag in intelligent transportation data

Accepted Manuscript Discovering Spatio-temporal Dependencies Based on Time-lag in Intelligent Transportation Data Xiabing Zhou, Haikun Hong, Xingxing...

1MB Sizes 0 Downloads 53 Views

Accepted Manuscript

Discovering Spatio-temporal Dependencies Based on Time-lag in Intelligent Transportation Data Xiabing Zhou, Haikun Hong, Xingxing Xing, Kaigui Bian, Kunqing Xie, Mingliang Xu PII: DOI: Reference:

S0925-2312(17)30247-3 10.1016/j.neucom.2016.06.084 NEUCOM 18043

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

9 February 2016 9 June 2016 13 June 2016

Please cite this article as: Xiabing Zhou, Haikun Hong, Xingxing Xing, Kaigui Bian, Kunqing Xie, Mingliang Xu, Discovering Spatio-temporal Dependencies Based on Time-lag in Intelligent Transportation Data, Neurocomputing (2017), doi: 10.1016/j.neucom.2016.06.084

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Discovering Spatio-temporal Dependencies Based on Time-lag in Intelligent Transportation Data

CR IP T

Xiabing Zhoua,∗, Haikun Honga , Xingxing Xinga , Kaigui Biana , Kunqing Xiea , Mingliang Xu1 a

AN US

Key Laboratory of Machine Perception, Ministry of Education, Peking University, Beijing, 100871, China. b School of Information Engineering, Zhengzhou University, Zhengzhou 450000, China

Abstract

CE

PT

ED

M

Learning spatio-temporal dependency structure is meaningful to characterize causal or statistical relationships. In many real-world applications, dependency structure are often characterized by time-lag between variables. For example, traffic system and climate, time lag is a key feature of hidden temporal dependencies, and plays an essential role in interpreting the cause of discovered temporal dependencies. However, traditional dependencies learning algorithms only use the same time stamp data of variables. In this paper, we propose a method for mining dependencies by considering the time lag. The proposed approach is based on a decomposition of the coefficients into products of two-level hierarchical coefficients, where one represents featurelevel and the other represents time-level. Specially, we capture the prior information of time lag in intelligent transportation data. We construct a probabilistic formulation by applying some probabilistic priors to these hierarchical coefficients, and devise an expectation-maximization (EM) algorithm to learn the model parameters. We evaluate our model on both synthetic and real-world highway traffic datasets. Experimental results show the effectiveness of our method.

AC

Keywords: Spatio-temporal dependency, Time lag, Intelligent transportation data



[email protected]

Preprint submitted to Neurocomputing

February 8, 2017

ACCEPTED MANUSCRIPT

1. Introduction

AC

CE

PT

ED

M

AN US

CR IP T

The problem of mining dependencies between variables in complex systems to characterize causal or statistical relationships such as economics, biological systems, traffic systems, climate change, etc., is important and fundamental. Given these multiple variables, the goal is to use available variables to make precise prediction of future events and trends. In addition to this primary goal, an important task is to identify dependencies between these variables wherein, data from one variable significantly help in marking predictions about another variable, including space and time dimension. For example, economists want to know whether burning natural gas is a causal factor for the global warming, so they need to mine whether the global warming depends on burning nature gas. In transportation system, we want to know, which flow of entrance ramp and when, is a cause factor for the changes of outflow. In the past, graphical modeling techniques have been considered as a viable option for modeling dependencies, which use Bayesian networks and other causal networks [1, 2, 3, 4]. Statistical tests such as specific hypothesis tests have also been designed to identify causality between the various temporal variables [5, 6]. However, most of them either do not consider the time lag [7, 8, 9], or only use a predefined value. In transportation temporal data, time-lagged relationships are crucial towards understanding the linkages and influence of the change between relative entrance ramps and exit ramps. These relationships are lagged in time because vehicles from entrance ramps do not affect vehicle flow of exit ramps at the same time but only at a later time. One such important time-lagged pattern in traffic is relative entrance ramps learning for exit ramps. Fig. 1 shows the Origin-Destination(OD) matrix[10, 11, 12]1 of a highway traffic network, where the rows and the columns denote the entrance and exit ramps in the highway, respectively, and the values of the matrix represent the vehicle counts rushed from the entrances to the exits. The entries with brighter color denote larger vehicle counts and darker ones represent small vehicle counts. We say that there exits dependency between an entrance ramp and an exit ramp when the corresponding entry of OD matrix is bright. 1 The OD matrix can provide important spatial correlation information for the traffic behaviors and is a widely used tool in the traffic analysis [13].

2

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Owing to its dependencies to the traffic system, its understanding has the potential to aid forecasts of vehicle flow at exit ramps. One important model for mining causal dependencies considering time lag is Granger causality [14], which is a widely accepted notion of causality in econometrics. In particular, it says that time series A causes B, if the current observations in A and B together, preOD matrix of dict the future observations in B significantly Figure 1: highway traffic network of a more accurately, than the predictions obtained province in China. The rows by using just the current observations in B. represent the entrance staRecently, there has been a surge of methods tions, and then columns repthat combine this notion of causality with re- resent the exit stations. If the gression algorithms [15, 16, 17]. However, they traffic flow from the entrance station to exit station is noneither do not emphasize the concept and im- zero, the pixel is white portant of time lag, or make all history data and current data together to predict, which leads to learning more irrelevant attributions. In this paper, we proposed a method to cope with mining dependencies of spatio-temporal traffic data. This method is based on the idea that combine the notion of causality with regression algorithm, called Two-Level Hierarchies with time Lag lasso (TLHL). TLHL decomposes the regression coefficients into a product between a feature-level component and a time-level component. Feature-level component is used for space feature learning, like traditional feature learning measure, while, time-level component represents time feature learning, which is correlation with time lag, and also constrained by time lag distribution. The advantage of two level is that make spatiotemporal feature learning toward the real value. Such a decomposition is very natural from the theory, namely, a specific regression coefficient is equal to zero if either of its two components is zero; Furthermore, the feature-level control the dependencies of prediction variables and responding variables; the time-level component represents the choice of time lag. Specifically, the TLHL model places Gaussian and Cauchy distributions for the component coefficients as priors to control the model complexity. With the Gaussian likelihood, we devise an efficient expectation maximization (EM) algorithm [18] to learn the model parameters. Moreover, we evaluate our model on both synthetic and real-world traffic data, and the conducted results show that the 3

ACCEPTED MANUSCRIPT

2. Background and Preliminaries

CR IP T

TLHL model is very effective in mining dependencies due to considering the time lag. This paper is an improved version of the conference paper [19]. The remainder of this paper is organized as follows. In Section 2, we briefly review background and preliminaries. Section 3 presents the proposed method. Experimental studies are reported in Section 4. We conclude this paper and present future directions in Section 5.

AC

CE

PT

ED

M

AN US

2.1. Background Estimating the spatio-temporal dependency structure in traffic networks plays an important role. Most of the existing works on dependency structure discovery do not consider the time lag. For example, Meinshausen etc. [20] learn the dependency structure by using lasso; Yuan etc. [15] use group lasso or extended algorithm [21] to cope with this problem; Han etc. [13] detect dependency relationships in traffic system by Gaussian graphical model. However, time lag is an important factor for discovering the traffic pattern, especially in spatio-temporal traffic data. In traffic system, it takes time for a vehicle from an entrance ramp to an exit ramp, thus, we need to consider this time when discovering the dependencies between entrance ramps and exit ramps. Traditional methods either use the same time stamp data of entrance ramps and exit ramps [13], or use neighbor history data of current time stamp immediately [22, 23]. [24, 25] plug different values of time lag in the model, which set time lag as a constant. Granger graphical model has a parameter called the maximum lag. The maximum lag for a set of time series signifies the number of time units one must look into the past to make accurate predictions of current and future events. However, all the past time stamp data in this method are treated uniformly. The disadvantage is that it does not consider time lag accurately, and always learns more irrelevant attributions. 2.2. Preliminaries 2.2.1. Highway intelligent transportation data In highway transportation system, two types of data are usually used to traffic analysis. The first type of data are collected by sensors on each road, such as inductive loops. They count the traffic volume of road segment. The other type of data are collected from the entrance-exit ramp. They count 4

ACCEPTED MANUSCRIPT

AN US

CR IP T

the traffic volume and time of entrance-exit ramps. The data from highway system in China are usually the second type of data. In each entrance and exit of highway, there is a ramp for charging and recording related information. These data include the number of vehicles, the time of entering and exiting the highway for each vehicle, license plate, vehicle type, etc. In order to analysis conveniently, the data are aggregated into 15 minute periods and uploaded to the database. In this paper, the data used are entrance-exit toll ramp, including the traffic volume and the time of data collection. We have about 200 entranceexit volume data in more than 5 years. From these data, we could analyze the characteristic of highway traffic data. In experiments, only part of data are used to show the performance of our method.

AC

CE

PT

ED

M

2.2.2. Problem definition Given d random variables X = {x1 , · · · , xd }, and each variable xi has n observations, xi = (x1i , · · · , xni )T . Let Y = {y i , · · · , y n } be the response variable. We aim to learn the dependency structures between prediction variables and response variable. In traffic analysis, X denotes the vehicle counts collected from the entrance ramps, and Y is the vehicle counts of the exit ramps. The tasks are predicting the vehicle counts for the exit ramps, since the traffic experts are eager to know how many vehicles will pass through some important exit ramps. We aim to encode the dependency structures between entrance ramps and exit ramps. The vehicles from relative entrance ramps of exit ramp are useful for predicting the vehicle flow of this exit ramp, thus, the challenge is how to learn the relative entrance ramps of exit ramp more accurately. To solve this, we make full use of time lag, which is a prior information in traffic data. Time lag is caused by travel time, and the mean travel time can be obtained by history statistic. However, different vehicles might have different travel time, the mean travel time is not used as a fixed time lag because it has fluctuation. Thus, we learn the distribution of time lag to choose the relative past time data of entrance ramps for exit ramps more flexibly and accurately. 3. Proposed Method In this section, we introduce the proposed TLHL model. TLHL model is based on the idea that combines the notion of causality with regression 5

ACCEPTED MANUSCRIPT

CR IP T

algorithm. The traditional regression coefficient is decomposed into products of two-level hierarchical coefficients. They represents feature-level and timelevel respectively. Specially, we propose a probabilistic framework for TLHL model, and devise an EM algorithm to infer the model parameters. 3.1. The TLHL Model First we propose a model for the component coefficients introduced previously. Most of the lasso-based algorithms solve the following optimization problem [26, 20, 15, 27]:

β

n X

L(y i , β T xi + b) + λR(β),

i=1

(1)

AN US

min

M

where β = (β1 , · · · , βi ) is the coefficient vector, L(·, ·) is the loss function, and R(·) is a regularizer that encodes different sparse pattern of β. In this paper, we propose a hierarchical model where each coefficient in β is decomposed into products of multiple hierarchical coefficients. In order to learn the dependencies between variables and time lag of each variables simultaneously, we consider the two-level hierarchies, βj = α j

L X

γjl

ED

l=1

AC

CE

PT

where βj is the jth element in vector β, αj represents the feature-level coefficient, and γjl denotes the time-level coefficient of lth time lag with respect to the jth feature. In order to improve the efficiency of dependencies and time lag learning, we use some prior knowledge about time lag. For example, in traffic data, some travel time can be obtained from history data, or computed by road length and vehicle speed. However, different vehicles have different speeds, (i−µ )2 which leads to the fluctuation of time lag. Let K(i) = exp(− δ2j ) be the j prior information restriction about time lag at the ith time stamp for the jth feature, where the µj is mean time lag for vehicles from jth entrance ramp to the target exit ramp, and it is a known information. Thus, we have: βj = α j

L X l=1

6

K(l)γjl

(2)

ACCEPTED MANUSCRIPT

This two-level hierarchies with time lag lasso objection is established as follow:

β

n d L X X X (l − µj )2 (y i − αj exp(− )γjl xijl − b)2 + λ1 kαk + λ2 kγk, (3) 2 δ j i=1 j=1 l=1

CR IP T

min

where α = (α1 , · · · , αd ) and γ = (γ11 , · · · , γ1L , · · · , γdL ).

AN US

3.2. A Probabilistic Framework for TLHL model In this section, we give the probabilistic interpretation for introducing our probabilistic model. For a regression problem, we use normal distribution to define the likelihood for y i : d X L X βjl xijl + b, σ), yi ∼ N (

(4)

j=1 l=1

ED

M

where N (µ, s) denotes a normal distribution with mean µ and variance s2 . Then we need to specify the prior over the parameter βj in β. Since βj is represented by αj and γjl , instead we define priors over the component coefficients. The component coefficients corresponding to the feature-level and time-level are placed in two probabilistic priors. We assume time-level coefficient follows a norm distribution: 2 γjl ∼ N (0, θjl ).

(5)

αj ∼ C(0, φj ),

CE

PT

For feature-level coefficient, a Cauchy prior is placed: (6)

AC

where C(a, b) denotes the Cauchy distribution [28] with the probability density function defined as: p(x; a, b) =

1 πb[( x−a )2 b

+ 1]

,

where a and b represent the location and scale parameters respectively. The Cauchy prior is widely used for feature learning [29, 30]. Especially, traditional coefficient (single level coefficient) is gaussian distribution, and it is easy to see that αj follows a ration distribution, which is constructed as the 7

ACCEPTED MANUSCRIPT

CR IP T

distribution of the quotient of two normal distributed variables. This ratio distribution is also known as the Cauchy distribution. Moreover, in order to obtain sparse hyperparameter θjl , we place the Jeffreys prior [31] over the θjl : 1 (7) p(θjl ) ∝ θjl Eqs.(4)-(7) define the probabilistic model of TLHL. In the next, we discuss how to learn the model parameters.

AN US

3.3. Parameter Inference In the EM algorithm, θ = {θjl }j∈Nd ,l∈NL , where Nd is the index set {1, · · · , d}, are treated as hidden variables, and the model parameters are denoted by Θ = {δ, b, σ, φ} where δ = {δj }j∈Nd , and φ = {φj }j∈Nd . In the following, we give the details in the EM algorithm. E-step: we construct the Q-function as Z t Q(Θ|Θ ) = E[ln p(Θ|y, θ)] = p(θ|y, Θt ) ln p(Θ|y, θ)dθ.

M

where Θt denotes the estimate of Θ in the tth iteration. It is easy to get ln p(Θ|y, θ) ∝ ln p(y|δ, b, σ, γ, α) + ln p(α|φ) + ln p(γ|θ)

ED

d L n X (l − µj )2 1 i X X exp(− (y − α )γjl xijl − b)2 − n ln σ ∝− j 2 2 2σ δj j=1 i=1 l=1 d d X L X X γjl2 αj2 − ln φj − ln p( 2 + 1) − 2 φj 2θjl j=1 j=1 l=1 j=1

PT

d X

CE

and p(θjl |y, Θt ) ∝ p(θjl )p(γjlt |θjl ). Then we compute the following expectation: R∞ 1 t 1 1 2 p(θjl )p(γjl |θjl )dθjl t E[ |y, Θ ] = 0R ∞2θ = . t 2 t 2θjl [2γ ] p(θ )p(γ |θ )dθ jl jl jl jl jl 0

AC

Let νjl =

1 t ]2 [2γjl

and we can finally get

n d L X 1 i X X (l − µj )2 Q(Θ|Θ ) = − (y − α exp(− )γjl xijl − bi )2 − n ln σ j 2 2 2σ δj i=1 j=1 l=1 t

d X L d X X αj2 2 νjl γjl − ln φj . − ln p( 2 + 1) − φj j=1 j=1 l=1 j=1 d X

8

ACCEPTED MANUSCRIPT

CR IP T

M-step: We maximize Q(Θ|Θt ) to update the estimates of α, γ, b, σ and δ. (1)For the estimation of α, we have to solve the following optimization problem: L n d X 1 i X X (l − µj )2 )γjl xijl − b)2 min J = (y − α exp(− j 2 2 2σ δ j i=1 j=1 l=1 d X

d X αj2 + ln p( 2 + 1) + ln φj . φ j j=1 j=1

(8)

φ

AN US

By setting the derivatives of J with respect to φ to zero, we only consider P P α2 min dj=1 ln p( φ2j + 1) + dj=1 ln φj , and then get j

φj = |αj |.

(9)

By plugging the solution in Eq.(9), we simplify Eq.(8) as

n d d L X 1 i X X i 2 X xˆjl ) + ln |αj |, (ˆ y − αj 2σ 2 i=1 j=1 j=1 l=1

(10)

M

min J¯ =

AC

CE

PT

ED

where yˆi = y i − b and xˆijl = K(l)γjl xijl . Problem (10) is non-convex since the second term of the objective function is non-convex. To solve this problem, we use the majorization-minimization (MM) algorithm [32] to solve this problem. For numerical P stability, we slightly modify Eq.(10) by replacing the second term with dj=1 ln(|αj | + η), where η is a tuning parameter. We denote the solution obtained in the t0 th iteration in the MM algorithm as 0 αjt . Thus, in the (t0 + 1)th iteration, we only need to solve a weighted `1 minimization problem [33, 34]: n X i=1

i

σ ¯ (ˆ y −

d X j=1

αj

L X

xˆijl )2



l=1

d X

|αj | , +η

0 |αjt | j=1

(11)

where σ ¯ = 2σ1 2 , which can be solved by some mature solvers, such as LASSOstyle solvers. (2)For the estimation of γ, we need to solve: min J˜ =

n X i=1

i

σ ¯ (ˆ y −

γl x˜il )2

+

d X L X j=1 l=1

9

νjl γjl2 ,

(12)

ACCEPTED MANUSCRIPT

CR IP T

where γl = (γ1l , · · · , γdl ), x˜il = (α1 xi1l , · · · , αd xidl )T . We use a gradient method such as conjugate gradient to optimize Eq.(12). The subgradient with respect to γl is ∂ J˜ ˜l X ˜ l T γl − X ˜ l Yˆ ) + 2D, = 2¯ σ (X (13) ∂γl ˜ l = (x˜l i , · · · , x˜l n ), Yˆ = (ˆ where X y 1 , · · · , yˆn )T , and D = (ν1l γ1l , · · · , νdl γdl )T . (3)For the estimation of b, σ and δ, we solve: min Jˆ =

n d L X 1 i XX (l − µj )2 i )¯ xjl − b)2 + n ln σ, (y − exp(− 2 2 2σ δ j i=1 j=1 l=1

(14)

AN US

where x¯ijl = αjt+1 γjlt+1 xijl . We set the derivatives of Eq.(14) with respect to each of them to zero and get n

b

d

L

1 X i XX (l − µj )2 i = [y − exp(− )¯ xjl ], n i=1 δj2 j=1 l=1

v u n d X L X u1 X (l − µj )2 i t i )¯ xjl − bt+1 ]2 , = [y − exp(− 2 n i=1 δ j j=1 l=1

M

σ t+1

t+1

(15)

(16)

n

ED

For the estimation of δj , we also use a gradient method. The gradient can be calculated as d

L

CE

PT

XX (l − µj )2 i ∂ Jˆ X 1 i t+1 exp(− = {[y − b − )¯ xjl ] ∂δj (σ t+1 )2 δj2 i=1 j=1 l=1 ·

L X l=1

2(l − µj )2 (l − µj )2 i x ¯ ) · }. exp(− jl δj2 δj3

(17)

AC

4. Experiments In this section, we evaluate our proposed method on both synthetic dataset and real-world traffic dataset. The compared methods include: lasso without considering time lag [26], lasso with fixed time lag and Granger Lasso [35].

10

ACCEPTED MANUSCRIPT

200

lasso lasso fix Granger lasso TLHL

1

150

MAPE

MSE

1.5

lasso lasso fix Granger Lasso TLHL

100

0.5

50 0 40

50

60 n

70

0 40

80

60 n

70

80

(b) MAPE with n varying

AN US

(a) MSE with n varying

50

CR IP T

250

Figure 2: Results on Synthetic datasets

ED

M

4.1. Synthetic Dataset Setting: We first evaluate the effectiveness of our method on synthetic data. We simulate a regression problem with d = 20 features (prediction observations). The corresponding observations are generated from y = Xβ + ε, ε ∼ N (0, σ 2 ). We vary n from 40 to 100, and set σ = 0.1. The true coefficient of feature-level is set randomly as β ∗ = (2 2 0 0 0 2 2 0 0 0 0 2 2 2 2 0 0 0 0 0)T .

CE

PT

In order to simulate the corresponding observations generated from prediction observations of different time lag, we randomly generate mean time lag µj |j ∈ Nd of each feature and standard deviation δj . The maximum lag is set as L = 10. For each feature, only 5 out of coefficients of time-lag-level (l−µ )2 γjl |l ∈ NL , j ∈ Nd have values from [0, 1]. Finally, we use exp(− δ2j ) · γjl j

AC

as a coefficient of the jth feature at the lth time. We set X ∼ N (0, SP×P ) with sii = 1, ∀i and sij = 0.5 for i 6= j. We generate n samples for training as well as 20 samples for testing. n P We use the mean squared error (MSE) ( n1 kYi − Xi βk22 ) and mean

absolute percentage error (MAPE)

( n1

n P

i=1

t=1

iβ | yi −x |) yi

to evaluate the prediction

error and use the F1-score to evaluate the performance in terms of feature

11

ACCEPTED MANUSCRIPT

0.9

100 80

0.7

MSE

F1 score

0.8

120

lasso lasso fix Granger lasso TLHL

0.6

40

0.5

20

0.4 40

60

50

60 n

70

0 40

80

60 n

70

80

(b) MSE of variance with n varying

AN US

(a) F1 with n varying

50

CR IP T

1

Figure 3: Results of lag distribution learned by TLHL

selection, where the F1-score is computed as follows: |{i|βi 6= 0, βi∗ 6= 0}| , |{i|βi 6= 0}|

Rec =

|{i|βi 6= 0, βi∗ 6= 0}| , |{i|βi∗ 6= 0}|

M

P re =

F1 =

2 · P re · Rec . P re + Rec

AC

CE

pdf

PT

ED

Results: We compare the methods when the sample size n is varying. Fig.2 shows the values of MSE and MAPE. The methods with considering time lag outperform traditional lasso algorithm. Specially, TLHL performs best. Due to the method of Lasso with fixed time lag only consider one value of 0.4 real lag d istribution lag, the lag fluctuating cannot be caplag distribution by TLHL tured. Granger Lasso considers all the 0.3 history prediction observations of lag equally, and it focuses on making best 0.2 prediction, thus, many unrelated dependencies are learned, and might ap0.1 pear over fitting. Granger Lasso has a better prediction accuracy on train0 0 5 10 15 ing samples than that on test samples. TLHL method adds the time lag dis- Figure 4: lag distribution on synthetic tribution constraint, which make the datasets feature learning toward the real value. Fig.3(a) gives the F1-score, which reflects the accuracy of dependency struc12

ACCEPTED MANUSCRIPT

AN US

CR IP T

tures. We can see that the result of Granger Lasso is worse than TLHL, because the value of recall is small. Our method, TLHL, shows the best effect no matter on prediction accuracy or on dependencies learning. Fig.3(b) shows the MSE of variance of lag distribution TLHL method learned with n varying. Fig.4 shows the learned lag distribution, the blue line is the real lag distribution while the red line is our method learned. We can see that the variance that learned is close to the real value. Variance measures how far a set of numbers is spread out, a small variance indicates that the data points tend to be very close to the mean and hence to each other, while a high variance indicates that the data points are very spread out around the mean and from each other. In transportation system, variance reflects the range of time lag. A small variance represents the speed of vehicles are concentrated, which means the lag distribution is narrow, and the more values that deviate from the mean are tend to zero. 0.5

0.4

M

0.35

0.25 0.2 0.15 0.1 0.05 lasso

lasso fix

Granger Lasso

0.4 0.3 0.2

0.1 2

TLHL

PT

0

ED

MSE

0.3

MSE

0.45

(a) MSE

5

10

Lag

20

30

40

(b) MSE of time-lagged variance

CE

Figure 5: Results on Highway traffic datasets

AC

4.2. Highway Traffic Dataset Description and Setting: We evaluate our methods on real-life traffic data. The features in this traffic dataset are observations collected from sensors located on ramps in a highway traffic network. Each observation is the vehicle count during 15 minutes interval. There are total 236 traffic stations, which correspond to 236 ramps, i.e., d = 236. The corresponding observations are collected at time interval 10:0010:15 AM from 2010/8/1 to 2010/10/31 (n=92). All prediction observations 13

ACCEPTED MANUSCRIPT

AN US

CR IP T

are collected from 236 entrance ramps at 5:00-10:00 AM (L=20). The mean time lag of each entrance ramp is average travel time from this entrance ramp to exit ramp. The data in this experiment are all normalized. Because there is no ground truth for dependency structure in real traffic data, F1-score cannot be measured. Nevertheless, the dependencies detected is the most important for traffic prediction. We use MSE to evaluate the accuracy. Specifically, the dependency structure information can help domain experts to predict vehicle flow with relative prediction algorithm, thus, we combine local weighted learning (LWL) [36] with our vehicle flow information of entrance ramps, which is learned by dependency structure learning algorithm. The value of MAPE is the prediction results. 0.2

250

0.18

MAPE

0.14 0.12 0.1 0.08 0.06

M

0.04 0.02 0

lasso+LWL

Number of vehicles in 15 min

Real traffic flow

0.16

ED

150

100

50

0

lasso fix+LWL Granger lasso+LWL TLHL+LWL

(a) MAPE of vehicle flow prediction

Prediction of traffic flow

200

July,1st

July,2nd July,3rd

July,4th

July,5th

July,6th

July,7th

(b) Prediction of traffic flow and real traffic flow in one week

PT

Figure 6: Results of prediction

AC

CE

Results: Fig.5(a) shows the MSE of all methods. As can be seen, our proposed TLHL is consistently among the best. Time lag is the key feature of hidden in traffic analysis. It takes time for a vehicle from one entrance ramp to exit ramp. It is not accurate for only using the same time stamp data of entrance ramp and exit ramp, because some of vehicles from that entrance ramp do not arrive at the same time stamp. That is why the algorithm without considering time lag does not show the satisfied results. Similarly, different vehicle might has different time lag, using fixed time lag also does not obtain the accurate results. Fig.5(b) shows the MSE of different maximum lag value that we set. We can see that, if the maximum lag is set small, it means the time range of entrance ramp we consider is small, and 14

14

7

13

6 5

11

4

CR IP T

12

min2

min

ACCEPTED MANUSCRIPT

3

10

2

9 0:00 4:00 8:00 12:00 16:00 20:00 23:45 time

1 0:00 4:00 8:00 12:00 16:00 20:00 23:45 time

AN US

(a) mean value of time-lag in one day of a (b) variance value of time-lag in one day of random OD the same OD as (a) 100

13 11 9

80

min2

min

90

90

7 5

M

60

0 0:00

4:00

8:00 12:00 16:00 20:00 23:45 time

ED

50 0:00 4:00 8:00 12:00 16:00 20:00 23:45 time

3

(c) mean value of time-lag in one day of an- (d) variance value of time-lag in one day of other random OD the same OD as (c)

PT

Figure 7: mean and variance of time-lag in one day

AC

CE

vice versa. When the maximum lag is small, the MSE is large because it miss some traffic flow of history which need more time to arrive the exit ramp. However, if we consider more history time, although it do not miss traffic flow of entrance ramp, the algorithm consumes more time. Thus, from the result of experiment, when the maximum lag is lager than 20 (5 hours), the MSE is stable, so we set the maximum lag is 20. The dependency structures obtained by dependency learning methods are essentially important for the analysis of traffic systems, such as vehicle flow prediction, anomaly detection. Domain experts can obtain the information of upper stations (entrance ramps) from the accurate dependencies, and pre15

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

dict the state of lower stations (exit ramps). Fig.6(a) shows the results of exit ramp flow prediction algorithm with dependency learning methods. We revise LWL prediction algorithm by monitoring the vehicle flow of entrance ramps. Due to our TLHL method can obtain the accurate dependency structure of entrance ramps and exit ramps, we can get precise information. The TLHL outperforms all other methods in terms of MAPE. Fig.6(b) give an illustration of flow prediction for the one week. We first use our method TLHL to learn the relative spatio-temporal flow value of entrance ramp, and then combine the LWL to predict the traffic flow of exit ramp. It shows the prediction of traffic flow and real traffic flow can match very well. The results further indicate our method is meaningful for traffic flow analysis. Fig.7 give an illustration of means and variance of lag distribution in one day. Mean value is prior information in our paper. Fig.7(a) shows the high mean values are at midnight, morning peak and night peak, while Fig.7(b) gives the corresponding variance of time-lag of the same OD. At midnight, because the nigh vision is not good enough, the speed of most vehicles are slow, so the mean value is large while the variance is small. At peak time, because of traffic jam, different vehicles have different speed, so the mean value and variance are all large. The same as Fig.7(c) and Fig.7(d). From these results, traffic domain experts can analyze the road condition.

lag distribution of OD1 lag distribution of OD2

CE

0.1 0 20

10

0

lag distribution of long distant OD lag distribution of short distant OD

0.15 0.1

PT

0.3 0.2

0.2

ED

0.4

0.05

10

0 20

20

10

0

10

20

AC

(a) compare lag distribution of two same (b) compare lag distribution of different length OD length OD Figure 8: compare lag distribution on Highway traffic datasets

Furthermore, Fig.8 shows the lag distribution of different OD. Fig.8(a) compares the lag distribution of two OD. These two OD have the same dis16

ACCEPTED MANUSCRIPT

CR IP T

tance, but different variance of time-lag. The blue one reflects the poor road condition. Fig.8(b) also compares the lag distribution, the red line represents the long distance OD, and the blue line represents the short distance OD. These results shows that, the poorer the road condition is and the longer distance the OD is, the larger the variance of time-lag shows. It is reasonable because the long distance OD include more links, the vehicles do not keep the same speed, and different vehicles often have different travel time. Similarity, even if the distance of two OD is the same, due to different road conditions (free flow and traffic jam), vehicles have various speeds when the poor road condition, which all lead to large variance.

CE

PT

ED

M

AN US

Fig.9 is the structure of the highway traffic network, in which each circle represents a traffic ramp, and the line between any connected traffic ramps is the bidirectional highway. We show the space dependencies result of one exit ramp. The red circle is the target exit ramp, and the green circles are the associated upstream entrance ramps. From this figure, we can get the associated upstream entrance ramps are not only close to the target exit ramp on the network, but also are far from exit ramp. It illustrates that only rely on the physical space of traffic network to learn the upstream entrance ramps is Figure 9: space dependencies on Real-life highway traffic network. unreasonable. 5. Conclusion

AC

In this paper, we propose a two-level hierarchies with time lag lasso to cope with dependency structure learning. We decompose the traditional regression coefficients into products of two-level hierarchical coefficients, where each one represents the different levels of information. Specifically, in order to learn the time-level structure more accurately, we use the prior time lag information. We develop a probabilistic model to interpret how to construct the regularization form for model parameters. For experimental studies, we 17

ACCEPTED MANUSCRIPT

CR IP T

demonstrate the effectiveness of our method on both synthetic data and reallife highway traffic data. The results show that our TLHL model can achieve significant improvements in both the datasets compared with other methods. For future work, it is interesting to extend the static dependency structure learning to deal with time-varying observations. We can follow the evolvement of the dependencies and time lag in a network. Acknowledgments

AN US

Research was supported by the National Natural Science Foundation of China under Grant No. 61473006. References

[1] A. Moneta, P. Spirtes, Graphical models for the identification of causal structures in multivariate time series models., in: JCIS, 2006.

M

[2] N. Friedman, I. Nachman, D. Per, Learning bayesian network structure from massive datasets: The ”sparse candidate” algorithm, in: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999, pp. 206–215.

ED

[3] R. Silva, R. Scheines, C. Glymour, P. Spirtes, Learning the structure of linear latent variable models, The Journal of Machine Learning Research 7 (2006) 191–246.

CE

PT

[4] D. Li, G. Hu, Y. Wang, Z. Pan, Network traffic classification via nonconvex multi-task feature learning, Neurocomputing 152 (2015) 322 – 332. [5] P. Spirtes, C. N. Glymour, R. Scheines, Causation, prediction, and search, Vol. 81, MIT press, 2000.

AC

[6] Z. Ren, Y. Yang, F. Bao, Y. Deng, Q. Dai, Directed adaptive graphical lasso for causality inference, Neurocomputing 173, Part 3 (2016) 1989 – 1994. [7] F. Moretti, S. Pizzuti, S. Panzieri, M. Annunziato, Urban traffic flow forecasting through statistical and neural network bagging ensemble hybrid modeling, Neurocomputing 167 (2015) 3 – 7. 18

ACCEPTED MANUSCRIPT

[8] H. Dezani, R. D. Bassi, N. Marranghello, L. Gomes, F. Damiani, I. N. da Silva, Optimizing urban traffic flow using genetic algorithm with petri net analysis as fitness function, Neurocomputing 124 (2014) 162 – 167.

CR IP T

[9] W.-C. Hong, Traffic flow forecasting by seasonal {SVR} with chaotic simulated annealing algorithm, Neurocomputing 74 (12C13) (2011) 2096 – 2107.

[10] S. Lee, B. Heydecker, Y. H. Kim, E.-Y. Shon, Dynamic od estimation using three phase traffic flow theory, Journal of Advanced Transportation 45 (2) (2011) 143–158.

AN US

[11] J. Barcel´o, L. Montero, M. Bullejos, O. Serch, C. Carmona, A kalman filter approach for the estimation of time dependent od matrices exploiting bluetooth traffic data collection, in: Transportation Research Board 91st Annual Meeting, no. 12-3843, 2012. [12] H. Zhao-cheng, Y. Zhi, Dynamic od estimation model of urban network [j], Journal of Traffic and Transportation Engineering 5 (2) (2005) 94–98.

ED

M

[13] L. Han, G. Song, G. Cong, K. Xie, Overlapping decomposition for causal graphical modeling, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, 2012, pp. 114–122. [14] C. W. Granger, Testing for causality: a personal viewpoint, Journal of Economic Dynamics and control 2 (1980) 329–352.

CE

PT

[15] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 68 (1).

AC

[16] F. Li, N. R. Zhang, Bayesian variable selection in structured highdimensional covariate spaces with applications in genomics, Journal of the American Statistical Association 105 (491) (2010) 1202–1214. [17] M. T. Bahadori, Y. Liu, Granger causality analysis in irregular time series., in: SDM, 2012, pp. 660–671. [18] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the em algorithm, JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B 39 (1) (1977) 1–38. 19

ACCEPTED MANUSCRIPT

[19] X. Zhou, H. Hong, X. Xing, W. Huang, K. Bian, K. Xie, Mining dependencies considering time lag in spatio-temporal traffic data, in: Web-Age Information Management, 2015, pp. 285–296.

CR IP T

[20] N. Meinshausen, P. B¨ uhlmann, High-dimensional graphs and variable selection with the lasso, The Annals of Statistics (2006) 1436–1462.

[21] D. Cheng, M. T. Bahadori, Y. Liu, Fblg: A simple and effective approach for temporal dependence discovery from time series data, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 382–391.

AN US

[22] Y. Gao, S. Sun, Multi-link traffic flow forecasting using neural networks, in: Natural Computation (ICNC), 2010 Sixth International Conference on, Vol. 1, 2010, pp. 398–401.

[23] S. Sun, R. Huang, Y. Gao, Network-scale traffic modeling and forecasting with graphical lasso and neural networks, Journal of Transportation Engineering 138 (11) (2012) 1358–1367.

ED

M

[24] J. Kawale, S. Liess, V. Kumar, U. Lall, A. Ganguly, Mining time-lagged relationships in spatio-temporal climate data, in: Intelligent Data Understanding (CIDU), 2012 Conference on, 2012, pp. 130–135. [25] S. Yang, On feature selection for traffic congestion prediction, Transportation Research Part C: Emerging Technologies 26 (2013) 160–169.

CE

PT

[26] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological) 58 (1996) 267–288.

AC

[27] P. Zhao, G. Rocha, B. Yu, Grouped and hierarchical model selection through composite absolute penalties, Department of Statistics, UC Berkeley, Tech. Rep 703. [28] B. C. Arnold, P. L. Brockett, On distributions whose component ratios are cauchy, The American Statistician 46 (1) (1992) 25–26. [29] C. M. Carvalho, N. G. Polson, J. G. Scott, Handling sparsity via the horseshoe, in: International Conference on Artificial Intelligence and Statistics, 2009, pp. 73–80. 20

ACCEPTED MANUSCRIPT

CR IP T

[30] D. Hern´andez-Lobato, J. M. Hern´andez-Lobato, P. Dupont, Generalized spike-and-slab priors for bayesian group feature selection using expectation propagation, The Journal of Machine Learning Research 14 (1) (2013) 1891–1945. [31] Y. A. Qi, T. P. Minka, R. W. Picard, Z. Ghahramani, Predictive automatic relevance determination by expectation propagation, in: Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, 2004, pp. 85–92.

AN US

[32] K. Lange, D. R. Hunter, I. Yang, Optimization transfer using surrogate objective functions, Journal of Computational and Graphical Statistics 9 (1) (2000) 1–20.

[33] E. J. Candes, M. B. Wakin, S. P. Boyd, Enhancing sparsity by reweighted `1 minimization, Journal of Fourier analysis and applications 14 (5-6) (2008) 877–905.

M

[34] D. Wipf, S. Nagarajan, Iterative reweighted `1 and `2 methods for finding sparse solutions, Selected Topics in Signal Processing, IEEE Journal of 4 (2) (2010) 317–329.

ED

[35] A. Arnold, Y. Liu, N. Abe, Temporal causal modeling with graphical granger methods, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, 2007, pp. 66–75.

AC

CE

PT

[36] S. Meng, H. Lei, X. Kunqing, S. Guojie, M. Xiujun, C. Guanhua, An adaptive traffic flow prediction mechanism based on locally weighted learning, ACTA SCIENTIARUM NATURALIUM UNIVERSITATIS PEKINENSIS 46 (1) (2010) 64–68.

21

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Xiabing Zhou received the B.E. degree in computer science from Xi’an Jiaotong University, in 2011. Since 2011, she has been worked toward the Ph.D degree on data mining and machine learning with the school of Electrical Engineering and Computer Science, Peking University, Beijing.

Haikun Hong received the B.E. degree in computer science from Beijing University of Posts and Telecommunications, in 2011. Since 2011, he has been worked toward the Ph.D degree on data mining and machine learning with the school of Electrical Engineering and Computer Science, Peking 22

ACCEPTED MANUSCRIPT

AN US

CR IP T

University, Beijing.

AC

CE

PT

ED

M

Xingxing Xing received the B.E. degree in computer science from Beijing Normal University, in 2012. Since 2012, he has been worked toward the Ph.D degree on data mining and machine learning with the school of Electrical Engineering and Computer Science, Peking University, Beijing.

23

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

Kaigui Bian received B.S. degree in Computer Science from Peking University, Beijing, China in 2005, and Ph.D. degree in Computer Engineering from Virginia Tech in 2011. Now he is an Associate Professor of Institute of Network Computing and Information Systems School of EECS, Peking University. He is currently interested in Cognitive Radio (CR) networks, Dynamic Spectrum Access (DSA) technologies, and wireless network protocol design, Mobile computing and Network security.

AC

CE

PT

Kunqing Xie is a Professor of the school of EECS, Peking University, dean of the department of intelligent science, and director of research center of Intelligent Traffic System (ITS), at Peking University in China. The research interests include: spatia-temporal information analysis and data mining, remote sensing and geographic information system, city planning and intelligent traffic system etc. He has presided several research programs of national level, provincial and ministerial level, and international cooperation. Have published over 80 papers and won several provincial and ministerial level awards in research and teaching.

24

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Ming-Liang Xu is an associate professor in the School of Information Engineering of Zhengzhou University, China. His research interests include computer graphics and computer vision. Xu got his Ph.D. degree in computer science and technology from the State Key Lab of CAD&CG at Zhejiang University.

25