Robust causal dependence mining in big data network and its application to traffic flow predictions

Robust causal dependence mining in big data network and its application to traffic flow predictions

Transportation Research Part C xxx (2015) xxx–xxx Contents lists available at ScienceDirect Transportation Research Part C journal homepage: www.els...

3MB Sizes 0 Downloads 21 Views

Transportation Research Part C xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Transportation Research Part C journal homepage: www.elsevier.com/locate/trc

Robust causal dependence mining in big data network and its application to traffic flow predictions Li Li a,b, Xiaonan Su a, Yanwei Wang a,c, Yuetong Lin d, Zhiheng Li a,c,⇑, Yuebiao Li a a

Department of Automation, Tsinghua University, Beijing 100084, China Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, SiPaiLou #2, Nanjing, China c Graduate School at Shenzhen, Tsinghua University, Shenzhen 518055, China d Department of Electronics & Computer Engineering Technology, Indiana State University, IN 47809-9989, USA b

a r t i c l e

i n f o

Article history: Received 1 May 2014 Received in revised form 16 February 2015 Accepted 3 March 2015 Available online xxxx Keywords: Big Data Traffic flow prediction Causal dependence Lasso regression Robust

a b s t r a c t In this paper, we focus on a special problem in transportation studies that concerns the so called ‘‘Big Data’’ challenge, which is: how to build concise yet accurate traffic flow prediction models based on the massive data collected by different sensors? The size of the data, the hidden causal dependence and the complexity of traffic time series are some of the obstacles that affect making reliable forecast at a reasonable cost, both time-wise and computationwise. To better prepare the data for traffic modeling, we introduce a multiple-step strategy to process the raw ‘‘Big Data’’ into compact time series that are better suited for regression and causality analysis. First, we use the Granger causality to define and determine the potential dependence among data, and produce a much condensed set of times series who are also highly dependent. Next, we deploy a decomposition algorithm to separate daily-similar trend and nonstationary bursts components from the traffic flow time series yielded by the Granger test. The decomposition results are then treated by two rounds of Lasso regression: the standard Lasso method is first used to quickly filter out most of the irrelevant data, followed by a robust Lasso method to further remove the disturbance caused by bursts components and recover the strongest dependence among the remaining data. Test results show that the proposed method significantly reduces the costs of building prediction models. Moreover, the obtained causal dependence graph reveals the relationship between the structure of road networks and the correlations among traffic time series. All these findings are useful for building better traffic flow prediction models. Ó 2015 Elsevier Ltd. All rights reserved.

1. Introduction The operation and management of contemporary Intelligent Transportation Systems (ITS) relies on a myriad of sensors/ actuators to generate important data and actions. The magnitude of data has increased dramatically over the years as the result of growing number of sensing and actuating units in the ITS, and it poses a serious challenge on how to effectively utilize these ‘‘Big Data’’ to aid traffic research and practices. Unfortunately, traditional databases and search engines can only offer functions such as data storage, indexing and query, and are incapable of turning the ‘‘Big Data’’ into applicable knowledge or building inference mechanism to quantify the degree of support that data offers for decision making.

⇑ Corresponding author at: #822, Central Main Building, Tsinghua University, Beijing 100084, China. Tel.: +86 (10) 62795503. E-mail address: [email protected] (Z. Li). http://dx.doi.org/10.1016/j.trc.2015.03.003 0968-090X/Ó 2015 Elsevier Ltd. All rights reserved.

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

2

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

One of the critical tasks in ITS is traffic predictions, which has been heavily investigated by multiple researchers (Smith et al., 2002; Vlahogianni et al., 2004; Chrobok et al., 2004). Since it is generally believed that more data help to build better prediction models (Bhaskar et al., 2011; Van Lint and Hoogendoorn, 2010; Heilmann et al., 2011), there have been numerous attempts that target mining the available data generated by all traffic sensors and actuators (Zhang et al., 2011; Faouzi et al., 2011). However, with the emergence of ‘‘Big Data’’, in particular considering some ITS systems are already generating gigabytes, and on the verge of generating terabytes of data per day or even beyond, it has become apparent that feeding all data into one prediction model becomes both computationally prohibitive and counter-productive. A potential solution is to employ parallel computing techniques where the computational tasks are divided into a few smaller subtasks to be run on different computers separately (Chen et al., 2013). However, the benefit-to-cost ratio of this kind of approach remains to be further examined. While it seems intuitively obvious that more data can aid prediction, the performance of prediction often may not be dramatically improved in practice. Another more practical method is to find and use the most relevant data to build parsimonious prediction models. To this end, different methods have been proposed in the last decade, including the hierarchical fuzzy model (Stathopoulos et al., 2010) that fuses multivariate data for traffic flow prediction, the adaptive Lasso method (Least Absolute Shrinkage and Selection Operator) that finds the most influential data in building prediction models (Kamarianakis et al., 2012), and the Graphical Lasso method (Sun et al., 2012) that builds sparse Bayesian Network models for prediction. As summarized in Vlahogianni et al. (2014), the main challenge of these studies lies in how to correctly retrieve both temporal characteristics and spatial dependence of traffic flow time series collected at different locations. The size of the data, the hidden causal dependence and the complexity of traffic time series all hinder us to reach this goal. In this paper, we focus on this special problem, which is: how to build traffic prediction models based on the massive data collected by different sensors? Particularly, the first problem that we need to address is: when we are discussing dependence, how to thoroughly classify and well define the so called ‘‘causality’’ between different sources of data? In order to answer this question, we turn to the famous Granger causality test. The search for causal relationship dates back to a millennium ago due to its indubitable importance and wide applications. Since there lacks a uniform yet intuitive understanding of cause and effect, the causal relationship is often difficult to define for complex systems. In 1956, Norbert Wiener proposed that one variable (usually appears as a particular time series) could be called ‘causal’ to another variable if the ability to predict the second variable can be noticeably improved by incorporating information about the first variable (Wiener, 1956). In 1969, Clive Granger designed a practical implementation of this idea by checking the information gain of different linear autoregressive models for stochastic processes (Granger, 1969, 1980). Since then, Granger causality test has become a well-established method for identifying potential causal connectivity (Otter, 1991; Hlavácˇková-Schindler et al., 2007; Bressler and Seth, 2011). It has gained tremendous success across many domains (Asimakopoulos et al., 2000; Lozano et al., 2009). In this paper, we adopt the newly developed Lasso method based Granger causality model (Arnold et al., 2007; Lozano et al., 2009) to retrieve the most important causal relationships from the massive traffic data. However, the original Lasso method is based on the assumption of stationary time series and may fail when this assumption is not met. Therefore, the second problem we need to address is: how to simultaneously cope with the causal dependence modeling and the non-stationarity of traffic time series? Realizing that traffic flow time series usually contain various components and are not strictly stationary (Chen et al., 2012; Zeng and Zhang, 2013; Zhang et al., 2014), we apply the decomposing technique that we had developed in the last decade (Li et al., 2014b) to classify the traffic data from each sensor into three kinds of patterns: the intra-day trend, the Gaussian type fluctuations, and the bursts. From the viewpoint of traffic prediction, these patterns represent two opposing aspects of the traffic series: the intra-day trend reflects the endogenous, self-driven and time-invariant characteristics; while the residual series, which includes fluctuations and bursts, reflects the exogenous, environment-dependent and time-variant characteristics. As already shown in Chen et al. (2012), building prediction models on the residual time series may significantly improve prediction accuracy. And in this paper, we further demonstrate that residual series also play a fundamental role in uncovering the underlying causal relationships in traffic series. To this aim, we may use the Lasso algorithm to quickly filter out most of the unrelated data from the residual time series. However, since the bursts usually represent the impact of un-modeled environmental disturbance that may bias the estimation of standard Lasso algorithm, a robust Lasso algorithm is designed instead to remove the bursts first and then recover the dependence among the remaining time series. Test results show that the proposed method is able to appropriately recover both the temporal characteristics and spatial dependence of the original traffic flow time series. Moreover, the obtained causal dependence graph reveals the relationships between the structure of road networks and the correlations among traffic time series. All these findings help to build better prediction models. Fig. 1 shows the five steps (models) to handle the interlaced difficulties. The rest of this paper will present each solution model respectively. Section 2 first explains how to decompose traffic flow time series for further study. Section 3 briefly reviews the theory of Granger causality models and then explains how to determine the causal dependence for traffic prediction. Section 4 presents the test results of this new method and discusses the relations between causality modeling and prediction modeling. Finally, Section 5 gives the conclusions.

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

3

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

big data

raw data

complexity of traffic time series

causality

decomposition

original Lasso Granger causality regression model

robust Lasso Granger causality regression model

GrangerWald test

prediction model

time series decomposing

screen out most unrelated time series

reserve possible related time series

find causal dependent time series

making prediction

difficulty of prediction

corresponding solution model

prediction result

data processing procedure

Fig. 1. The flowchart of the proposed method.

2. The decomposition of traffic flow time series It has been pointed out in Vlahogianni et al. (2014) that identifying temporal flow patterns is a critical task for short-term traffic forecasting. However, due to the fact that traffic flow time series are not stationary processes (Zhang et al., 2014), Lasso Granger causality model cannot be directly employed in traffic predictions. Therefore, we need to first gain a better understanding of the temporal characteristics of traffic series. Various temporal pattern models have been proposed for traffic flow recently (Jiang and Adeli, 2005; Vlahogianni et al., 2008; Ghosh et al., 2010). As we have shown in our previous study (Chen et al., 2012), a traffic flow series consists of three patterns: the intra-day trend, the Gaussian type fluctuations and the bursts, and each of them embodies different characteristics. The intra-day trend is shown in Fig. 2(b) as an M-shape curve that represents the regular variations of daily traffic flow. The morning and evening rush hours result in the two peaks; one shallow valley appears at noon, and the other deep valley appears at midnight. In this paper, we use the simple average method to retrieve the common intra-day trend shared by the time series collected at one location for several consecutive days. Suppose the sampled traffic flow data in N consecutive work days can be written as a series of 1D vectors

Y t1 ¼ ½yt11 ; yt12 ; . . . ; yt1n ; . . . ; Y tN ¼ ½ytN1 ; ytN2 ; . . . ; ytNn 

ð1Þ

where we assume there are n sampling data point(s) per day; for example, if the sample time interval is 30 s, then n = 2880. The simple average method then gets the intra-day trend as

"

Y Av ergae

N N 1 X 1 X ¼ y ;...; y D j¼NDþ1 tj1 D j¼NDþ1 tjn

# ð2Þ

The Gaussian type fluctuations refer to the high-frequency residual time series whose empirical probability density func^ Fluct is obtained from the square root of tion roughly follows that of the normal distribution. Suppose the standard deviation r the variance for the entire residual time series, we can further define a point as a ‘‘burst’’ if its deviation from the simpleaverage trend is larger than twice the standard deviation; see Fig. 2(c)–(d) for illustrations. Since the purpose for calculating the mean square deviation (MSD) of the residual series is simply to identify the bursts, we do not need to be concerned about how the MSD of the rest of time points are affected, even after the bursts are removed. As demonstrated in Chen et al. (2012), the existence of bursts brings difficulties for prediction models. Most existing prediction models assume explicitly or implicitly, that, there exists a certain smooth mapping relationship between the input and the output. As a result, the predicted values are mainly located within the intervals bounded by the upper and ^ Fluct away from the simple average trend line. In other words, few lower envelopes, which are one standard deviation r existing prediction models can appropriately forecast the bursts. For Lasso regression problems, the bursts may significantly bias the regression coefficients estimates and lead to the wrong conclusion on causal relationships; see illustrations in Section 4.2. Therefore, we need to design a new Lasso Granger model to cope with the impact of bursts. 3. Granger causality models 3.1. Basic theory In short, Clive Granger characterized the causality as the extent to which a stationary process fxt gxt 2R;t2Z is influencing another stationary process fyt gyt 2R;t2Z based upon incremental predictability. {xt} is said to Granger cause {yt} if the future values of {yt} can be better predicted by using the past values of both {xt} and {yt}, rather than just the past values of {yt}. Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

4

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

250

Number of vehicles

Number of vehicles

250 200 150 100 50

50

100

150

200

150 100 50 0

250

50

100

200

250

Time of the day (one unit=5 mins)

(a)

(b) 0.15

50

Measured distribution Gaussian distribution

Burst points 2σ

0 -2σ

-50

150

Time of the day (one unit=5 mins)

Probability density

Deviations from average trend

0

200

50

100

150

200

0.1

0.05

0 -30

250

-20

-10

0

10

20

30

Time of the day (one unit=5 mins)

Deviations from average trend (veh)

(c)

(d)

Number of vehicles

250 200 150 100 50 Measured data Upper & lower bound(2σ) Burst points

0 -50 50

100

150

200

250

Time of the day (one unit=5 mins)

(e) Fig. 2. (a) The original traffic flow series in a day; (b) The intra-day trend obtained by the simple average method; (c) The residual series with all bursts removed, r is the standard deviation; (d) The distribution of residual series with all bursts removed; and (e) The bursts. In all the figures, the aggregation time scale is five minutes. The specific data shown in this figure are gathered on August 4, 2011 by sensor 402514(located in the 4th district of PeMS dataset).

Mathematically, the Granger test first performs the following two regressions

yt ¼

L X ai yti þ et;1

ð3Þ

i¼1

yt ¼

L L X X ai yti þ bi xti þ et;2 i¼1

ð4Þ

i¼1

where L is the maximal lag for {xt} and {yt}; et,1 and et,2 are the residuals (regression errors) of two regression models at time t; ai and bi are the corresponding regression coefficients, respectively. Next, the Granger test measures the influence of {xt} for predictions by comparing the prediction errors. In this paper, we choose the Granger-Wald hypothesis test (Hlavácˇková-Schindler et al., 2007). Suppose the sample time points are t = L + 1, . . ., N, we define the test statistic GWSingle as

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

GW Single ¼ N

r^ 2et;1  r^ 2et;2 r^ 2et;2

5

ð5Þ

^ 2e is the estimate of the variance of et,2 from model (4) and r ^ 2e is the estimate of the variance of et,1 from model (3), where r t;2 t;1 for t = L + 1, . . ., N. If the hypothesis of no causality is true, GWSingle follows a Chi-squared distribution with L degrees of freedom. Thus, if the null hypothesis is denied by a significant GWSingle, we can reasonably conclude that {xt} Granger causes {yt}, otherwise {xt} and {yt} have no causal relationships. 3.2. Lasso Granger Causality regression model To apply the Granger test model in (2) to discover the possible mutual Granger causal relationships among P independent variables fxtj gx j 2R;t2Z ; j ¼ 1; . . . ; P, the number of iterations of regression analysis is O(P2). Therefore, when we need to consider t

a large number of independent variables, the computational load of the Granger test becomes too overwhelming. To solve this problem, we often extend the model in (4) to a multi-variable regression model that consists of fyt g and fxtj gx j 2R;t2Z ; j ¼ 1; . . . ; P t

yt ¼

L P X L X X j j ai yti þ bi xti þ et;3 i¼1

ð6Þ

j¼1 i¼1 j

where L is the maximal lag for yt and xtj , et,3 is the residual at time t, ai and bi are the corresponding regression coefficients that can be determined by one round of regression. Suppose the sample time points are still t = L + 1, . . ., N, after applying Lasso regression, we similarly define a statistics GWMulti as

GW Multi ¼ N

r^ 2et;1  r^ 2et;3 r^ 2et;3

ð7Þ

^ 2e is the estimate of the variance of et,3 from model (6) and r ^ 2e is the estimate of the variance of et,1 from model (3), where r t;3 t;1 for t = L + 1, . . ., N. Under the null hypothesis of no causality, the statistic of GWMulti still follows Chi- squared distribution. Thus, if the null j

hypothesis is denied and there exists an element of bi that is non-zero, we can claim fxtj g Granger causes {yt}. How to choose the appropriate regression algorithm that can retrieve the most influential variables has received increasing interests. In this paper, we adopt the Lasso regression method (Friedman et al., 2008) to address this issue. The Lasso algorithm carries out variable selection by adding the L1-penalty terms on regression coefficients. For our purpose, we apply the so called Lasso Granger causality regression model (Valdés-Sosa et al., 2005; Arnold et al., 2007; Lozano et al., 2009) formulated as

 2   L P X L L P X L   X X X X   j j j  minyt  ai yti  bi xti  þ k jai j þ k bi  j   ai ;b i¼1 j¼1 i¼1 i¼1 j¼1 i¼1 i

ð8Þ

2

j

where k is the scalar regularization parameter determining the sparseness (the percentage of non-zero elements) of ai and bi . It is usually chosen by cross-validation. As shown in various research fields including gene data manipulation and brain data analysis (Valdés-Sosa et al., 2005; Lozano et al., 2009), the L1-penalty term of Lasso Granger causality model can help to retrieve the most consequential factors from the original datasets. The minimization problem in (5) is a convex optimization problem (Boyd and Vandenberghe, 2004) and can be quickly solved via many algorithms. This makes Lasso Granger causality model a useful tool for ‘‘Big Data’’ problems. 3.3. The robust Lasso Granger Causality regression model and the two-step solving strategy In this paper, we choose different penalty functions for the Gaussian type fluctuations and bursts so that the disturbance of the bursts can be suppressed. This strategy leads to the following robust Lasso Granger causality regression model L P X L X X j j min H yt  ai yti  bi xti j

ai ;bi

i¼1

j¼1 i¼1

! þk

L P X L   X X  j jai j þ k bi  i¼1

ð9Þ

j¼1 i¼1

where the penalty function H() satisfies

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

6

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

( HðwÞ ¼

w2 =2

jwj 6 m

mjwj  m2 =2 jwj > m

ð10Þ

This function imposes a least-square penalty for any residual smaller than m > 0, but a linear penalty for any residual larger than m. That is, the residuals larger than m receive less consideration since they are assumed to be associated with outliers or bad data (Huber, 1964; Boyd and Vandenberghe, 2004). This method can help to reduce the impact of bursts in traffic predictions. However, if we directly solve the robust Lasso regression problem in (9) using a massive dataset, it will consume a very long time and sometimes cause the out-of-memory problem. Therefore, we have developed the following two-step solving strategy. In the first step, we aim to pick up a limited number of time series that are potentially dependent upon the time series under study, or destination time series. To reach this goal, we sequentially solve a series of standard Lasso regression problems with a fixed problem size to accommodate the limited memory of an ordinary PC. Here is an illustrative example: suppose we have 1000 time series arranged in order (one destination time series and 999 possibly dependent time series) and the number of truly dependent series is less than 100. We first solve a Lasso regression problem with the first 200 time series, and discard 100 series whose regression coefficients are all zeros. The remaining 100 time series, mixed with the next 100 series, are fed into a new Lasso regression. This process continues iteratively until all the formulated Lasso regression problems are solved, and the most relevant time series from the 1000 time series are identified. To accelerate the processing, we use the Alternating Direction Method of Multipliers (ADMM) algorithm (Boyd et al., 2011) to solve these fixed-size Lasso regression problems. Let us rewrite the Lasso regression problem in (8) into the matrix form:

1 ~ qk22 þ kk~ p ¼ arg min kA~ p ~ pk1 2

ð11Þ

q are known matrix and vector respecwhere ~ p denotes the regression coefficient vector that needs to be determined, A and ~ tively with appropriate dimensions. The ADMM algorithm first considers an equivalent convex problem of (11) as 1 qk22 kA~ p ~ 2

min ~ p

zk1 þ q2 k~ zk22 þ kk~ p ~

ð12Þ

~ z p ¼~

s:t:

where q > 0 is a scalar regularization parameter that controls the rate of convergence of ADMM method. From (12), we can derive an iterative solving procedure as

~ pkþ1 ¼ arg min

h i 1 1 q qk22 þ k~ zk þ ~ q þ qðzk  uk Þ kA~ p ~ p ~ uk k22 ¼ ðAT A þ qIÞ AT~ 2 2

~ zkþ1 ¼ arg min kk~ zk1 þ

 kþ1  q ~kþ1 ~ ~k 2 p þ~ uk kp  z þ u k2 ¼ Sk=q ~ 2

~ zkþ1 ukþ1 ¼ ~ uk þ ~ pkþ1  ~

ð13Þ ð14Þ ð15Þ

where the soft threshold operator Sk=q ðÞ is defined as

~ Þ ¼ ðw ~  k=qÞþ  ðw ~  k=qÞþ Sk=q ðw

ð16Þ

p -update rule given in Eq. (13) is indeed a ridge regression problem Since q > 0, ATA + qI is always invertible. Therefore, the ~ (Hoerl, 1962; Bickel et al., 2006). In other words, ADMM for Lasso problems can be interpreted as carrying out the ridge regression iteratively. This explains why ADMM method could solve Lasso problem with a very fast speed. Moreover, we introduce the following threshold strategy

yt ¼



^ Fluct 2r ^ Fluct 2r

^ Fluct yt > 2r ^ Fluct yt < 2r

ð17Þ

to pre-process the time series before feeding them into the Lasso regression. Tests show that this simple algorithm could help to suppress the impact of bursts disturbance. In the second step, we solve a robust Lasso regression problem for the time series picked up in the first step. The data used in this step are the raw residual time series which are not pre-processed by the threshold algorithm in (17). We still use the ADMM method to solve the robust Lasso regression problem in (9), (10). We consider an equivalent convex problem in the matrix form expressed as

zÞ þ kk~ q ~ zk22 min Hð~ pk1 þ q2 kA~ p ~ ~ p

s:t:

ð18Þ

q ¼~ z A~ p ~

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

7

zk , ~ We can solve this problem iteratively, where ~ pk is updated the same way as in Eq. (13) and ~ uk are updated as

~ zkþ1 ¼

 kþ1  q  ~kþ1 ~ ~k  1 q þ~ p ~ uk Ap  q þ u þ Smþm=q A~ 1þq 1þq

~ q ~ zkþ1 ukþ1 ¼ ~ uk þ A~ pkþ1  ~

ð19Þ ð20Þ

To guarantee the convergence of ADMM algorithm, we set q ¼ k for the standard Lasso regression problem in (13)–(16), and for the robust Lasso regression problem, we adaptively change the scalar q as

q0 ¼ 1; qkþ1 ¼ maxf5k; 1:01qk g

ð21Þ

One of the most desirable features of the ADMM algorithm is that it can be run on several computers in parallel (Boyd et al., 2011). This allows us to tackle the Lasso regression problems using the data collected from thousands of sensors in an acceptable time. Therefore, the proposed model can well handle the big data size for the up-to-date traffic prediction applications. Since Lasso regression model cannot guarantee that all the found time series are causally dependent with the destination series, we need to run the one-by-one Granger-Wald hypothesis test on each of the remaining time series in order to find the most concise prediction model. 4. Test results 4.1. Testing dataset and settings We choose the publicly available PeMS traffic datasets (PeMS) for the following test. The specific traffic network that we analyze is in the fourth district of PeMS dataset, which is located in the Bay Area, Alameda, Oakland of the U.S. There are 1000 sensors selected in this study. The sampling period of the testing dataset is from August 1, 2011 to August 31, 2011, where the first three weeks of data are used as the training set for building Granger causality model, and the last week of data is used as the testing set for traffic predictions. The raw data of PeMS are sampled every 30 s. The short-term prediction provides up-to-date information to traffic operators and travelers, and is thus much more valuable in practice. As suggested in many previous studies (Chen et al., 2012; Vlahogianni et al., 2014), we set the prediction time interval to be five minutes, i.e., we sum up the traffic flow records across all the lanes every five minutes for each sensor to formulate a traffic flow time series. We also set L = 10, which means maximal time lag is 50 min. The influence of time aggregation widow will be discussed in Section 4.2. Furthermore, since we need a complete data set when building regression models, we use the simple average method to impute the missing data in PeMS dataset. The simple average method is adopted mainly because the missing ratios for the selected sensors are sufficiently low. All the following discussions are based on the ‘‘imputed’’ dataset. Readers who are interested in missing data imputation are referred to our recent reports (Qu et al., 2009; Li et al., 2013; 2014a,b,c). k is the most important parameter in the proposed method because it characterizes the sparseness of the Lasso regression model. On one hand, a small k could guarantee that all the causally dependent time series will not be mistakenly screened out during the aforementioned two-step Lasso regressions. On the other hand, a large k could help to discard most of the irrelevant time series without using the time consuming one-by-one Granger-Wald hypothesis test. Since it is hard to set different values of k for each time series, we use the same k for all time series. To determine this value, we first randomly sample 50 time series from the entire 1000 time series, and use the one-by-one Granger-Wald hypothesis test to determine the number of dependent time series that they could have among the 1000 time series. The level of confidence is set to be 95%. As shown in Fig. 3, we find that all the 50 time series have less than 30 dependent time series. Next, based on the same 50 samples, we examine the change of the average number of the dependent time series with respect to k. As shown in Fig. 4, we find that for k ¼ 50; 000, the number of dependent time series should be less than 50. Thus, in the following studies, we set k ¼ 50; 000 for the standard and robust Lasso regressions, and expect that only a very few independent time series could pass the filters of two-step Lasso regressions. 4.2. Test results on Granger Causal dependence After the dependent time series are chosen, we select the multi-variable linear regression (MVLR) model as the prediction model. This model aims to set a linear dependence between the past data (including past data of studied time series and the causally dependent time series) and the future data as:

yt ¼

L P X L X X j j ai yti þ bi xti þ et i¼1

ð22Þ

j¼1 i¼1

j where yti represents the past traffic flow volume data of the studied time series recorded at time t  i, xti is the traffic volume of j th causally dependent time series recorded at time t  i, P denotes the number of the causally dependent time series,

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

30

The number of dependent series

8

25 20 15 10 5 0

0

10

20

30

40

50

The index of the 50 time series Fig. 3. The number of the dependent time series for the sampled 50 time series (in descending order).

j

L shows the maximal time lag, ai and bi are the regression coefficients of the model, et denotes the modeling error. Clearly, if we do not consider the causally dependent time series in (22), MVLR model degenerates to the classical moving average model (Brockwell and Davis, 1987; Chandra and Al-Deek, 2009). The coefficients of this model are calculated by solving the following optimization problem j

ai ; bi ¼ arg min

X ðet Þ2

ð23Þ

t

Similar to Chen et al. (2012), Li et al. (2013), we use the mean square deviation as prediction performance index:

MSD ¼

k  2 1X observ ed ypredicted  yti ti k i¼1

ð24Þ

v ed denote the predicted and observed data, respectively. where ypredicted and yobser ti ti To compare prediction performances of different models, we compute the relative MSD variation (RMV) as: 2 RMV model model 1 ¼

ðMSDmodel1  MSDmodel2 Þ MSDmodel 1

ð25Þ

where MSDmodel 1 and MSDmodel 2 are the prediction performance indexes of model one and model two, respectively. If 2 RMV model model 1 > 0, we can say that the prediction accuracy is improved by using model two instead of model one, and vice versa. To justify using RMV in evaluating prediction models, we compare the MVLR model and autoregressive MVLR (ARMVLR) model given below

yt ¼

p p q P X X X X j j ai yti þ bi xti þ ci eti þ et i¼1

j¼1 i¼1

ð26Þ

i¼1

j

The number of dependent series

where ai, bi and ci are the coefficients of this model, et-i denotes the modeling error at time t  i. Given p, we determine the best value of q using the Akaike Information Criterion (AIC). The coefficients of this model can be obtained through standard 300 250 200 150 100 50 0

2

4

The value of λ

6

8 4

x 10

Fig. 4. The change of the average number of the dependent time series with respect to k.

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

9

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

Box–Jenkins parameter optimization. Clearly, if we do not consider the causally dependent time series in (26), ARMVLR model degenerates to the classical autoregressive moving average model (Brockwell and Davis, 1987; Chandra and AlDeek, 2009). Fig. 5 illustrates the prediction performance. As can be seen in Fig. 5(b), the prediction performance for 82.3% of the time series remains the same for two models. This prediction equivalency is supported by RMV as well, where 0.01 > RMV > 0.01. Based on our previous studies, e.g. Chen et al. (2012), Li et al. (2013), the commonly used models are generally comparable in terms of predication performances, and appropriate selection of data sources would benefit all kinds of prediction models. Due to length limit, we only present the test results of the MVLR model, and similar tests can be easily extended to other models. 0.1 MVLR ARMVLR

1500

MSD

Relative MSD variation

2000

1000 500 0

20

40

60

80

0.05 0.01

0 -0.01

-0.05 929

106

-0.1

100

0

200

400

600

800

Index of time series

Index of time series

(a)

(b)

1000

Fig. 5. (a) A comparison of prediction MSD values for MVLR model and ARMVLR model over 100 randomly selected time series; (b) The RMVs for MVLR model and ARMVLR model over 1000 randomly selected time series. Model one uses the MVLR model, model two uses the ARMVLR model.

Relative MSD variation

1 0.8 0.6 0.4 0.2 0.1 302

0

200

623

400

600

800

1000

Index of time series

0.5 0.4 0.3 0.2 0.1 0.05

0 -0.1

53

239

200

400

600

800

1000

Deviations from average trend

Relative MSD variation

Fig. 6. A comparison of prediction performance for models using causal dependence and spatial dependence on 1000 selected time series. Model one uses spatially nearest neighboring time series, and model two uses causally dependent time series. The aggregation time window is set to be five minutes.

200

Burst points

100



0 -2σ

-100 -200 -300 -400 50

100

150

200

250

Index of time series

Time of the day (one unit=5 mins)

(a)

(b)

Fig. 7. (a) A comparison of prediction performances for models using robust Lasso method and standard Lasso regression method over 1000 selected time series. Model one uses robust Lasso method, and model two uses standard Lasso method. (b) An illustration of the unexpected pattern of traffic flow fluctuations. The aggregation time window is set to be five minutes.

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

10

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

0.2 Relative MSD variation

Relative MSD variation

1 0.8 0.6 0.4 0.2 0.1

0

78

320

200

0.15 0.1 0.05 0.01

0 -0.05

400

600

800

1000

150

200

385

400

821

600

800

Index of time series

Index of time series

(a)

(b)

1000

Fig. 8. (a) A comparison of prediction performance for models using causal dependence and spatial dependence over 1000 selected time series. (b) A comparison of prediction performances for models using the robust Lasso method and standard Lasso regression method over 1000 selected time series. The aggregation time window is set as 15 min.

MSD curve of 8th time series 800

MSD

600

400

200

0 11 0

17

50

100

150

200

The number of time series Fig. 9. The MSD prediction error indexes with respect to the number of time series used in the MVLR model for the eighth time series. There are 17 time series (whose MSD prediction error indexes are plotted in red) found to be causally dependent on this time series. The remaining MSD prediction error indexes are plotted in blue. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 10. Two causality diagrams: in (a), the corresponding structural equation models are vB = j1vA, vC = j2vB; and in (b) the structural equation model is vC = j1vA + j2vB, where vA, vB, vC denotes three variables and j1, j2 are two proportional coefficients. An arrow X ? Y designates X drives Y and Y is driven by X.

Next, we demonstrate the benefits of determining causal dependence. Fig. 6 compares the prediction performance between the models using causally dependent time series and spatially nearest neighboring time series (with the same number of causal time series). It shows that for 30.2% of the time series, prediction accuracy is improved by 20% (RMV > 0.2); for 62.3% of the time series, prediction accuracy is improved by 10% (RMV > 0.1). Only 0.8% of the time series have their RMV values to be roughly equal to zero, and no time series yields RMV < 0. Since it is believed that unrelated data may lead to overtraining and thus bias the prediction results, we can claim that the prediction accuracy improvements are due to our method’s effective removal of these unrelated data. Third, we illustrate the benefits of the robust Lasso regression method.

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

11

Fig. 11. The Granger causal dependence of 1000 time series.

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

12

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

Fig. 7(a) compares the performances of prediction models which use either robust Lasso regression method or standard Lasso regression method to find the causal dependence. Results indicate that of all the time series, 78.6% see their prediction accuracy improved, about 10.9% have no change in prediction accuracy, and the remaining 10.5% are degenerated, when robust Lasso regression method is used instead of standard Lasso regression method. Specifically, the prediction accuracy for 5.3% of the time series is improved by 10% (RMV > 0.1), about 24% of the time series see 5% (RMV > 0.05) improvement on prediction accuracy. On the contrary, only 4.9% of time series have their RMVs to be less than 0.01. This indicates that robust Lasso regression method outperforms or at least fares equally to the standard Lasso regression method in most cases. Further statistics on the time series show that the success of robust Lasso regression method mainly lies in its capability to handle the unexpected fluctuation patterns in some traffic flow time series; see Fig. 7(b) for an illustration. For the time series whose bursts noticeably bias the estimated coefficients of standard Lasso regression, robust Lasso regression method significantly delivers better results. Finally, we investigate the possible influence of aggregation time window. The above results are obtained when the aggregation time window is set to be five minutes. Fig. 8 gives the corresponding results when the aggregation time window is changed to 15 min. We can see that while the benefits of determining causal dependence remains noticeable, those of robust Lasso regression method become less significant. This can be attributed to the fact that aggregation with a large time window has the smoothing effect on traffic flow time series and thus may trivialize the burst components. We conclude that the proposed robust regression method is better suited for short-term traffic prediction applications. 4.3. From Granger Causality to Prediction It must be pointed out that the above Granger model only helps us determine the limited causal dependence among massive time series. For traffic prediction purpose, a more concise model needs to be built based on the causally dependent time series identified. Fig. 9 shows the variation of prediction accuracies with respect to the number of time series used in the MVLR model for the eighth time series. The MVLR model is trained with the data collected in the first three weeks of August, 2011, and is tested with the data from the last week of the same month (Brockwell and Davis, 1987; Li et al., 2013). Results show that adding the time series that are not causally dependent into the prediction model is simply a bad idea due to the noticeable overtraining problem. More interestingly, although there are 17 time series causally dependent on the eighth series, the best MVLR model uses only eleven of them. This result indicates that even though some time series are causally dependent, they may not be included in the prediction models. Judea Pearl has a simple but intuitive explanation for this outcome (Pearl, 1988, 2009a,b). As shown in Fig. 10, even if a variable vC depends on two variables vA and vB, the roles of vA and vB can be different. In Fig. 10(a), vB depends on vA as vB = j1vA, while vC depends on vB as vC = j2vB, where j1, j2 are two proportional coefficients. Therefore, when we aim to find a parsimonious prediction model, we should neglect vA since its influence on vC can be fully recovered from that of vB. In Fig. 10(b), variable vC simultaneously depends on two variables vA and vB as vC = j1vA + j2vB. The influences of vA and vB are separable and independent, and we should take into account both when we build prediction models. Fortunately, the proposed Granger causality model can significantly reduce the cost of screening redundant time series since it has already filtered out most of the irrelevant time series. Test results prove that it is a useful tool to handle the ‘‘Big Data’’ in traffic predictions. 4.4. Further discussions on Granger causal dependence Fig. 11 shows the causal dependence of the 1000 time series under study. It is created by Pajek (Pajek), which is a program for analysis and visualization of large networks. The meanings for different colors are as follows: 0.2

Probability distribution

Probability distribution

0.1 0.08 0.06 0.04 0.02 0

0

10

20

30

0.15 0.1 0.05 0

0

20

40

The value of in-degree

The value of out-degree

(a)

(b)

60

Fig. 12. The empirical distributions of (a) in-degree and (b) out-degree for the Granger causal dependence shown in Fig. 4.

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

13

Fig. 13. The spatial locations of the dominant time series (plotted as red dots) and the rest of time series (plotted as green dots). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

   

Green-Yellow: time series whose out-degrees are between zero (including zero) and five. Yellow: time series whose out-degrees are between five (including five) and ten. Purple: time series whose out-degrees are between ten (including ten) and 15. Cyan: time series whose out-degrees are between 15 (including 15) and 20.

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

14

     

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

Orange: time series whose out-degrees are between 20(including 20) and 25. Thistle: time series whose out-degrees are 25 (including 25) and 30. Salmon: time series whose out-degrees are between 30 (including 30) and 35. Red: time series whose out-degrees are between 35 (including 35) and 40. Red-Violet: time series whose out-degrees are between 40 (including 40) and 45. Brick Red: time series whose out-degrees are larger than 45.

Fig. 12 further elaborates the empirical distributions of (a) in-degree and (b) out-degree for the causal dependence in Fig. 11. We can see that the in-degrees of most time series are smaller than ten, which means that most time series are influenced by less than 20 neighbors. On the other hand, there are a few time series whose out-degrees are larger than 30, which indicates that these dominant time series can influence more than 30 neighbors. A question naturally arises on what causes such different dependent relations. Fig. 13 shows the spatial locations of the dominant and remaining time series. We can see that these dominant time series are mainly collected around the merging points of arterials. The traffic that passes through these points contains flow through some upstream points and downstream points. It is conceivable that the time series collected at these points may be ‘‘causally’’ dependent on many time series collected at either upstream or downstream points. Similar to the conclusions given in Li et al. (2013), our findings indicate that there exist tight relations between the structure of road networks and the structure of traffic prediction models. Since a thorough analysis on how to make full use of such property requires a dedicated paper, we will discuss it in our coming reports. 5. Conclusions In this paper, we discuss the important traffic prediction problem with a special emphasis on the recently emerging ‘‘Big Data’’. Different from many existing approaches that focus on how to strengthen the computational efficiency of the prediction models with ‘‘Big Data’’ as the direct input, our method addresses how to pick the most relevant data from the ‘‘Big Data’’ to build parsimonious prediction models. Based on the Granger causality theory, we propose a set of Lasso Granger causality models to screen out most irrelevant or redundant data. The corresponding algorithm has good scalability, robustness, real-time responsiveness, and can lead to a traffic prediction model that strikes a good balance between model complexity and model performance. One remaining problem is we only discuss the possible linear causal dependence between time series in this paper due to the fact that the original Granger causality model (Granger, 1969, 1980) was defined on linear regressions. A possible modification is to further consider nonlinear regressions implemented via kernel trick (Li et al., 2013; Yu and Lam, 2014). For instance, we can consider the kernel type Lasso regression models (Roth, 2004) or group Lasso regression models which had been shown to be equivalent to multiple kernel learning (Bach, 2008). However, applying kernel trick will significantly increase computational cost and running time. The performance-to-cost ratio of kernel trick remains to be further studied. Besides, the obtained causal dependence graph sheds lights on the implicit relationships between the structure of road networks and the correlations of traffic time series. Constrained by the page limit, we would like to discuss how to combine the structure of road networks and the structure of traffic prediction models in our coming reports. Acknowledgements This work was supported in part by the National Science and Technology Support Program 2013BAG18B00, National Natural Science Foundation of China 51278280, National Basic Research Program of China (973 Project) 2012CB725405, and Project Supported by Tsinghua University (20131089307). Appendix A. Alternating direction method of multipliers (ADMM) is a simple algorithm for solving large-scale convex optimization. It follows the decomposition-coordination procedure and combines the salient properties of dual decomposition and method of multipliers. Consider the following convex problem

min

zÞ xÞ þ hð~ f ð~

s:t:

A~ b z ¼~ x þ B~

~ x

ðA:1Þ

Using the method of multipliers, the augmented Lagrangian function associated with (A.1) can be formulated as

bÞ þ ðq=2ÞkA~ bk22 z ~ Lq ð~ aÞ ¼ f ð~xÞ þ hð~zÞ þ ~ aðA~x þ B~z  ~ x; ~ z; ~ x þ B~

ðA:2Þ

The optimal solutions of ~ a can be solved by applying the dual ascent, dual decomposition method iteratively on the x; ~ z; ~ following problems

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

15

~ ak Þ xkþ1 ¼ arg minLq ð~ x; ~ zk ; ~

ðA:3Þ

~ zkþ1 ¼ arg minLq ð~ xkþ1 ; ~ z; ~ ak Þ

ðA:4Þ

  ~ akþ1 ¼ ~ ak þ q A~xkþ1 þ B~zkþ1  ~ b

ðA:5Þ

~ x

~ z

b, then z ~ Moreover, if we define ~ r ¼ A~ x þ B~

~ bÞ þ ðq=2ÞkA~ bk22 z ~ x þ B~ aðA~x þ B~z  ~

¼~ a~r þ ðq=2Þk~rk22 r þ ð1=qÞ~ ¼ ðq=2Þk~ ak22  ð1=2qÞk~ ak22 r þ~ uk22 =2Þk~

¼ ðq

ðA:6Þ

uk22 =2Þk~

 ðq

where ~ a are the scaled dual vectors. u ¼ ð1=qÞ~ With the introduction of ~ u, we can rewrite ADMM in an equivalent form as

~ b þ~ zk  ~ xkþ1 ¼ arg minðf ð~ xÞ þ ðq=2ÞkA~ x þ B~ uk k22 Þ

ðA:7Þ

~ b þ~ zkþ1 ¼ arg minðhð~ zÞ þ ðq=2ÞkA~ z ~ xkþ1 þ B~ uk k22 Þ

ðA:8Þ

~ zkþ1  c ukþ1 ¼ ~ uk þ A~ xkþ1 þ B~

ðA:9Þ

~ x

~ x

The convergence analysis of ADMM can be found in Boyd et al. (2011). References Arnold, A., Liu, Y., Abe, N. (2007) Temporal Causal Modeling with Graphical Granger Methods. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 66–75. Asimakopoulos, I., Ayling, D., Mahmood, W.M., 2000. Non-linear granger causality in the currency futures returns. Econ. Lett. 68 (1), 25–30. Bach, F.R., 2008. Consistency of the group Lasso and multiple kernel learning. J. Machine Learning Res. 9, 1179–1225. Bhaskar, A., Chung, E., Dumont, A.-G., 2011. Fusing loop detector and probe vehicle data to estimate travel time statistics on signalized urban networks. Comput.-Aided Civil Infrastruct. Eng. 26 (6), 433–450. Bickel, P.J., Li, B., Tsybakov, A.B., Van de Geer, S.A., Yu, B., Valdes, T., Rivero, C., Fan, J., Van der Vaart, A., 2006. Regularization in statistics. Test 15 (2), 271– 344. Boyd, S., Vandenberghe, L., 2004. Convex Optimization. Cambridge University Press. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. TrendsÒ Machine Learning 3 (1), 1–122. Bressler, S.L., Seth, A.K., 2011. Wiener-Granger Causality: A well established methodology. NeuroImage 58 (2), 323–329. Brockwell, P.J., Davis, R.A., 1987. Time Series: Theory and Methods. Springer-Verlag. Chandra, S.R., Al-Deek, H., 2009. Predictions of freeway traffic speeds and volumes using vector autoregressive models. J. Intell. Transport. Syst. 13 (2), 53– 72. Chen, C., Wang, Y., Li, L., Hu, J., Zhang, Z., 2012. The retrieval of intra-day trend and its influence on traffic prediction. Transport. Res. Part C: Emer. Technol. 22, 103–118. Chen, C., Liu, Z., Lin, W.-H., Li, S., Wang, K., 2013. Distributed modeling in a mapreduce framework for data-driven traffic flow forecasting. IEEE Trans. Intell. Transp. Syst. 14 (1), 22–33. Chrobok, R., Kaumann, O., Wahle, J., Schreckenberg, M., 2004. Different methods of traffic forecast based on real data. Eur. J. Oper. Res. 155 (3), 558–568. Faouzi, N.-E.El., Leung, H., Kurian, A., 2011. Data fusion in intelligent transportation systems: Progress and challenges – a survey. Inform. Fusion 12 (1), 4–10. Friedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), 432–441. Ghosh, B., Basu, B., O’Mahony, M., 2010. Random process model for urban traffic flow using a wavelet-Bayesian hierarchical technique. Comput.-Aided Civil Infrastruct. Eng. 25 (8), 613–624. Granger, C.W.J., 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37 (3), 424–438. Granger, W.J., 1980. Testing for causality: a personal viewpoint. J. Econ. Dynam. Control 2, 329–352. Heilmann, B., El Faouzi, N.-E., de Mouzon, O., Hainitz, N., Koller, H., Bauer, D., Antoniou, C., 2011. Predicting motorway traffic performance by data fusion of local sensor data and electronic toll collection data. Comput.-Aided Civil Infrastruct. Eng. 26 (6), 451–463. Hlavácˇková-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J., 2007. Causality detection based on information-theoretic approaches in time series analysis. Phys. Rep. 441 (1), 1–46. Hoerl, A.E., 1962. Application of ridge analysis to regression problems. Chem. Eng. Prog. 58 (3), 54–59. Huber, P.J., 1964. Robust estimation of a location parameter. Ann. Stat. 53 (1), 73–101. Jiang, X., Adeli, H., 2005. Dynamic Wavelet Neural Network model for traffic flow forecasting. ASCE J. Transport. Eng. 131 (10), 771–779. Kamarianakis, Y., Shen, W., Wynter, L., 2012. Real-time road traffic forecasting using regime-switching space-time models and adaptive LASSO. Appl. Stoch. Models Bus. Ind. 28 (4), 297–315. Li, L., Li, Y., Li, Z., 2013. Efficient missing data imputing for traffic flow by considering temporal and spatial dependence. Transport Res. Part C: Emer. Technol. 34, 108–120. Li, Y., Li, Z., Li, L., 2014a. Missing traffic data: comparison of imputation methods. IET Intell. Transport. Syst. 8 (1), 51–57. Li, L., Su, X., Zhang, Y., Hu, J., Li, Z., 2014b. Traffic prediction, data compression, abnormal data detection and missing data imputation: an integrated study based on the decomposition of traffic time series. Proc. IEEE Conf. Intell. Transport. Syst., 282–289 Li, Z., Li, Y., Li, Li, 2014c. A comparison of detrending models and multi-regime models for traffic flow prediction. IEEE Intell. Transp. Syst. Mag. 6 (4), 34–44. Lozano, A.C., Abe, N., Liu, Y., Rosset, S., 2009. Grouped graphical Granger modeling for gene expression regulatory networks discovery. Bioinformatics 25 (12), i110–i118. Otter, P.W., 1991. On Wiener-Granger causality, information and canonical correlation. Econ. Lett. 35 (2), 187–191. Pajek, .

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003

16

L. Li et al. / Transportation Research Part C xxx (2015) xxx–xxx

Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA. Pearl, J., 2009a. Causal inference in statistics: an overview. Statist. Surv. 3, 96–146. Pearl, J., 2009b. Causality: Models, Reasoning, and Inference, second ed. Cambridge University Press, New York. PeMS, California Performance Measurement System. . Qu, L., Li, L., Zhang, Y., Hu, J., 2009. PPCA-Based missing data imputation for traffic flow volume: a systematical approach. IEEE Trans. Intell. Transp. Syst. 10 (3), 512–522. Roth, V., 2004. The generalized Lasso. IEEE Trans. Neural Networks 15 (1), 16–28. Smith, B.L., Williams, B.M., Oswald, R.K., 2002. Comparison of parametric and nonparametric models for traffic flow forecasting. Transport. Res. Part C: Emerg. Technol. 10 (4), 303–321. Stathopoulos, A., Karlaftis, M.G., Dimitriou, L., 2010. Fuzzy rule-based system approach to combining traffic count forecasts. Transp. Res. Rec. 2183, 120–128. Sun, S., Huang, R., Gao, Y., 2012. Network-Scale traffic modeling and forecasting with Graphical Lasso and neural networks. ASCE J. Transport. Eng. 138 (11), 1358–1367. Valdés-Sosa, P.A., Sánchez-Bornot, J.M., Lage-Castellanos, A., Vega-Hernández, M., Bosch-Bayard, J., Melie-García, L., Canales-Rodríguez, E., 2005. Estimating brain functional connectivity with sparse multivariate autoregression. Philos. Trans. Royal Soc. B 360 (1457), 969–981. Van Lint, J.W.C., Hoogendoorn, S.P., 2010. A robust and efficient method for fusing heterogeneous data from traffic sensors on freeways. Comput.-Aided Civil Infrastruct. Eng. 25 (8), 596–612. Vlahogianni, E.I., Golias, J.C., Karlaftis, M.G., 2004. Short-term traffic forecasting: overview of objectives and methods. Transport Review 24, 533–557. Vlahogianni, E.I., Karlaftis, M.G., Golias, J.C., 2008. Temporal evolution of short-term urban traffic flow: a non-linear dynamics approach. Comput.-Aided Civil Infrastruct. Eng. 22 (5), 317–325. Vlahogianni, E.I., Karlaftis, M.G., Golias, J.C., 2014. Short-term traffic forecasting: where we are and where we’re going. Transport. Res. Part C: Emer. Technol. 43 (Part 1), 3–19. Wiener, N., 1956. In: Beckenbach, E. (Ed.), The Theory of Prediction, Modern Mathematics for Engineers. McGraw-Hill, New York. Yu, C., Lam, A.C., 2014. Applying multiple kernel learning and support vector machine for solving the multicriteria and nonlinearity problems of traffic flow prediction. J. Adv. Transport. 48 (3), 250–271. Zeng, X., Zhang, Y., 2013. Development of Recurrent Neural Network considering temporal-spatial input dynamics for freeway travel time modeling. Comput.-Aided Civil Infrastruct. Eng. 28 (5), 359–371. Zhang, J., Wang, F.-Y., Wang, K., Lin, W.-H., Xu, X., Chen, C., 2011. Data-driven intelligent transportation systems: a survey. IEEE Trans. Intell. Transp. Syst. 12 (4), 1624–1639. Zhang, Y., Zhang, Y., Haghani, A., 2014. A hybrid short-term traffic flow forecasting method based on spectral analysis and statistical volatility model. Transport. Res. Part C: Emer. Technol. 43 (Part 1), 65–78.

Please cite this article in press as: Li, L., et al. Robust causal dependence mining in big data network and its application to traffic flow predictions. Transport. Res. Part C (2015), http://dx.doi.org/10.1016/j.trc.2015.03.003