Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Stability-based Dynamic Bayesian Network method for dynamic data mining Mohamed Naili a ,∗, Mustapha Bourahla b , Makhlouf Naili c , AbdelKamel Tari d a
Department of Computer Science, Faculty of Mathematics and Informatics, University of Bordj Bou Arreridj, 34030 Bordj Bou Arreridj, Algeria Department of Computer Science, University of M’Sila, 28000 M’sila, Algeria Department of Computer Science, University of Biskra, 07000 Biskra, Algeria d Laboratory of Medical Computing (LIMED), Faculty of Fundamental Sciences, University of Bejaia, 06000 Bejaia, Algeria b c
ARTICLE
INFO
Keywords: Dynamic data mining Dynamic model Stability Dynamic Bayesian Network Grow-Shrink algorithm Modeling and simulation
ABSTRACT In this article we introduce a new stability-based dynamic Bayesian network method for dynamic systems represented by their time series. Based on the Grow Shrink algorithm and the stability of the network through time, new variables and arcs could be added to the network in order to generate missing data or predict future values. The concept of stability in the network is maintained through a stability matrix which contains learned values that indicate the strength of dependencies between variables along the time. Moreover, we present the application of the proposed method to deal with the problem of prediction in a real-life air quality case study, in which we try to predict the level of Carbon monoxide in the air, comparing between the results obtained using the proposed method and those obtained using the Vector Autoregression model.
1. Introduction One of the major challenges to address in a dynamic system is predicting missing or future data. Unfortunately, most real-life systems require a considerable time to collect sufficient data (with a considerable possibility of missing data) before carrying out any analysis or data mining to extract useful knowledge. Variation in real-life processes makes the integration of new methods a necessity. Several methods have been already proposed. These include Autoregressive Integrated Moving Average (ARIMA) and Vector Autoregression (VAR) models, Artificial Neural Networks (ANN) and Bayesian network (BN) and its dynamic extension known as Dynamic Bayesian Networks (DBN). A classical way to handle the problem of prediction is ARIMA model. An ARIMA model describes the relationship between current values of a given variable and its previous values and errors, in order to forecast future estimations. Several publications have shown how such models can be used (either alone or in combination with other types of models) in the traffic flow, water quality, cloud computing and economics (Williams and Hoel, 2003; Ömer Faruk, 2010; Zhang, 2003; Contreras et al., 2003; Calheiros et al., 2015; Babu and Reddy, 2014; Christodoulos et al., 2010; Narendra Babu and Eswara Reddy, 2015). One important drawback of ARIMA models is the difficulty to represent in a compact form the relationship between different variables. This stems from the fact that ARIMA model was conceived mainly for univariate problems. Because of this limitation, the VAR model (Hamilton,
1994; Montgomery et al., 2008) has been proposed to deal with multivariate time series by modeling the underlying dependencies between variables. In this model, a given variable can be predicted on the basis of its previous values and those of other variables. Artificial Neural Network models (ANN) have been designed in order to process a set of inputs through an input layer in order to calculate outputs. This is achieved by modeling the possible non-linear relations between inputs through a set of functions and connections between the ‘‘neurons’’ of this network. Despite the difficulty in explaining the parameters learned during the training phase, ANN models have demonstrated a strong ability to model real-life static and dynamic systems such as the flood flow, energy consumption forecasting, stock market prediction and so on (Foster et al., 1992; Amrouche and Le Pivert, 2014; Qiu et al., 2016; Chae et al., 2016). The necessity to explain outputs has driven many researchers towards the use of Bayesian network models. A Bayesian network is a Directed Acyclic Graph (DAG) used to represent dependencies among a set of variables. Bayesian Network, and especially its temporal extension i.e. the Dynamic Bayesian Network (Robinson and Hartemink, 2008) have been used for financial purposes (Kita et al., 2012) in healthcare and biological systems (Sandri et al., 2014; van der Heijden et al., 2014; Acerbi et al., 2016) Traffic flow forecasting (Queen and Albers, 2009) monitoring (Lv et al., 2013; Wenhui et al., 2005; An et al., 2013; Cheng et al., 2012) and many other applications. As mentioned previously, VAR models are used to predict a given variable’s future values on the basis of its and other variables’ previous
∗ Corresponding author. E-mail address:
[email protected] (M. Naili).
https://doi.org/10.1016/j.engappai.2018.09.016 Received 27 September 2017; Received in revised form 12 September 2018; Accepted 23 September 2018 Available online xxxx 0952-1976/© 2018 Elsevier Ltd. All rights reserved.
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Markov property determines the independence of a given variable from all its non-descendants given its parents. With the faithfulness property, the structure of the Bayesian network and the probability distribution agree on the dependence/independence relations discovered among the variables. In such case, the probability distribution and the structure are said to be faithful to each other (Margaritis, 2003). Markov boundary 𝐵(𝑥) is the smallest family of a given variable 𝑥, which consists of its parents, its partners (the other parents of its children) and its children. This family separates (shields) the node 𝑥 from all the other variable nodes. In other words, let us say that 𝑥 is a variable from the set of variables U, so the Markov boundary is the minimal Markov blanket of 𝑥, which is the minimal set of variables 𝐵𝐿 (𝑥) ⊆ 𝑈 (it could be not unique), which makes 𝑥 independent of any other variable 𝑦 ∉ {𝐵𝐿 (𝑥) ∪ {𝑥}},
Fig. 1. D-separation rules.
values. However, on devices with limited resources especially memory, the important amount of data needed to reach an accurate estimation of the model’s parameters could be considered as a very serious problem. Also, the problem of storage space could be encountered when we calculate the conditional probability tables for a given Bayesian Network, so what could we propose to overcome such problem? To tackle this challenge, we propose a Dynamic Bayesian Network based on the variation in the strength of dependencies between variables over time, where we just keep track of these values rather than the whole history of variables’ values. In the rest of this article, we discuss various concepts and algorithms within the framework of Bayesian Network modeling. In the following section, we explain in more detail the most important notions of Bayesian Network models which represent the main frame of the proposed model in this work. In Section 3, we give a brief presentation of the main idea behind the Grow Shrink algorithm. Section 4 is dedicated to explain our model, the Stability-based Dynamic Grow-Shrink model (SDGS) proposed to model a dynamic system. In order to assess the performance of the proposed method, in Section 5, we tackle a reallife problem regarding air quality. We conclude this article with a brief conclusion in Section 6.
The Bayesian network’s learning process includes learning the structure and the parameters of the network. 2.1. Learning Bayesian network parameters In order to learn Bayesian network parameters, two main methods have been widely used. The first method is called Maximum Loglikelihood and the second method is called Maximum A Posteriori (MAP). The Maximum Likelihood method is based on finding the set of parameters which maximizes the probability of seeing the values observed in a given dataset. For example, if we are dealing with independent and identically distributed Bernoulli random variables X (𝑥 ∈ {0, 1}), then the best value for the Bernoulli parameter 𝜇 is the value which maximizes the likelihood function for a vector X of n elements (Eq. (4)),
2. Background As a Directed Acyclic Graph (DAG), a Bayesian network is used as a probabilistic graphical model to represent dependencies among a set of variables. In this type of DAG, variables are represented by nodes whereas relations between them are represented by edges. In this graph, we associate with each node a conditional probability table, which represents the probability distribution of the corresponding variable demonstrating the dependencies between this variable and its parents (nodes starting an edge to this variable node). Without doubt, Bayesian networks have been widely used to model structured probability distributions and to extract the underlying dependencies among a set of attributes of a given dataset and then use these dependencies to answer forecasting issues or to find the most probable explanation for a sequence of events. There is an important concept in Bayesian networks called the DSeparation. The directional separation (Pearl, 1988, 2009) denoted by D-Separation is a set of rules which helps us to discover relations between variables. Let us consider 𝑋, 𝑌 and 𝑍 to be three subsets of nodes in a given DAG. We say that 𝑍 d-separates 𝑋 from 𝑌 (denoted by 𝑋⊥𝑌 |𝑍) if for each node 𝑥 in 𝑋 and another node 𝑦 in 𝑌 , there exists a node 𝑧 in 𝑍 which satisfies the following situation represented in Fig. 1 (a shaded node means that the node is observed): The conditional probability of two independent sets of random variables A and B given another set C can be formulated as follows (Eq. (1)): 𝑃 (𝐴, 𝐵|𝐶) = 𝑃 (𝐴|𝐶) .𝑃 (𝐵|𝐶)
𝐿 (𝑋 ∶ 𝜇) =
𝑛 ∏
𝜇 𝑥𝑖 (1 − 𝜇)1−𝑥𝑖
(4)
𝑖=1
So, the best value for 𝜇 is the one which maximizes the previous equation (Eq. (4)). We usually maximize the log likelihood function rather than the standard likelihood function because the derivation of a log function is much easier. As we can notice with the likelihood/log-likelihood function, it is not possible to involve any prior knowledge about the possible values of the parameters, which could be considered as a disadvantage. In order to overcome such limitation, Maximum A Posteriori Estimation allows us to have such knowledge fusion. Maximum A Posteriori estimation (MAP) is based on Bayes’s rule (Eq. (5)) in which it is possible to involve what we know about the possible values of a parameter 𝜇 in the whole equation used to find its best value to maximize the probability of observing a set of events X (which are considered as independent and identically distributed random variables), 𝑝 (𝜇|𝑋) =
𝑝 (𝑋|𝜇) 𝑝(𝜇) 𝑝(𝑋)
(5)
As we can notice in the previous equation (Eq. (5)), the posterior probability of 𝜇 depends only on the likelihood 𝑝 (𝑋|𝜇) and the prior knowledge 𝑝(𝜇) about the value of 𝜇. The parameters that maximize the likelihood and the prior knowledge will automatically maximize the posterior probability estimate (Eq. (6)),
(1)
𝜇̂ = argmax 𝑝 (𝑋|𝜇) 𝑝(𝜇)
The probability of a set of variables in a Bayesian network is the product of the conditional probabilities of each variable in the set given its parents’ values. This property can be formulated as follows (Eq. (2)): 𝑛 ( ) ∏ 𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 = 𝑃 (𝑥𝑖 |𝑃 𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖 ))
(3)
𝑥⊥𝑦|𝐵𝐿(𝑥)
(6)
𝜇
2.2. Structure learning
(2)
Learning the structure is to extract the underlying structure of the Bayesian network from a dataset. Many methods have been proposed, where generally, these methods are either constraints-based search methods, search and score based methods or hybrid methods (Noble and Koski, 2012).
𝑖=1
Bayesian networks are characterized by the following properties: The causal sufficiency property states that variables which are parents of observed variable(s) of the domain must not be hidden. The causal 284
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
𝑥𝑖 [𝑡] stands for the 𝑖th variable at a time point t, and 𝑃 𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖 [𝑡]) represents this variable’s parents at that instant of time t. We can distinguish two main groups of DBNs: homogeneous and non-homogeneous DBN. In the first group, neither arcs nor their directions in the network will be changed (invariant through time). Nonhomogeneous are the opposite of those homogeneous (Grzegorczyk, 2015; Dondelinger et al., 2012; Grzegorczyk and Husmeier, 2011; Jia and Huan, 2010; Nielsen and Nielsen, 2008; Gonzales et al., 2015; Shijiazhu and Wang, 2012), where both parameters and structure could change over time. In order to simplify and generalize any DBN, it is recommended to select only (as far as possible) the most important variables. An interesting method used in various DBN (Chiquet et al., 2009; Lebre, 2009) is the Least Absolute Shrinkage and Selection Operator (Tibshirani, 1994; Hastie et al., 2008)or for short LASSO. This regression analysis method helps to carry out variable selection in order to build a more accurate model and avoid over-fitting. The LASSO problem can be represented under the Lagrangian form (using the positive complexity parameter 𝜆) as follows (Eq. (8)):
The constraints-based search algorithms are composed of two steps: First, we carry out independence (conditional) tests, then we define the orientation of non-oriented edges (Chow and Liu, 1968; Kirshner et al., 2004; Tsamardinos et al., 2006). In the first step, we carry out statistical conditional independence tests for all pairs of variables in order to extract the skeleton of the underlying Bayesian network. Recalling that skeleton means an undirected graph, in which each conditional dependence between a pair of variables is represented by an undirected edge. This requires an important amount of data in order to find the real relations between variables. After that, the structures have to be identified and edges to be directed either on the basis of the conditional independence and the structures previously found (Partially Directed Acyclic Graph — PDAG) or randomly keeping in mind that the final graph has to be DAG (no cycles). We can also find other algorithms of this family, especially those based on discovering the Markov blanket of variables. For instance, the Incremental Association (IAMB) which is based on Incremental Association Markov blanket algorithm (Tsamardinos et al., 2003) and the Fast Incremental Association (Fast-IAMB) (Yaramakala and Margaritis, 2005). Also in 2003, Margaritis (Margaritis, 2003) has proposed another algorithm based on the variable’s Markov blanket. We will discuss this algorithm in Section 3. The algorithms based on search and score rely on three parts (de Jongh, 2014). The first part is to find a search space model, which consists of all possible DAG of the set of variables using several algorithms such as Greedy Search (Chickering, 2003) or genetic algorithms (Srinivas and Patnaik, 1994; Larranaga et al., 1996) where new DAG could be created through crossover and mutation. The second part is a set of three operators: add arc, remove arc and reverse arc, which allows going from a DAG to another. The third part uses a score function like the Akaike Information Criterion (AIC) (Akaike, 1974) or the Bayesian Information Criterion (BIC) (Schwarz, 1978; Liu et al., 2012) which penalizes the complexity (number of parameters) to assess each DAG created after adding, removing or reversing an arc. Generally, these score functions yield a metric value which reflects the goodness-of-fit of a model (DAG) to represent the underlying model of the system. Besides these two functions, we can find others, such as the Bayesian Dirichlet (BD) score proposed by Heckerman, Geiger and Chickering (1995) and the likelihood-equivalence Bayesian Dirichlet (BDe) score (Heckerman et al., 1995) where the problem of learning the Bayesian network is NP-hard (Chickering et al., 2004). It is worthy to note that the main problem in these families of methods is confronted mainly in the search step, where it is possible to get stuck in local optimum. Some works have proposed a combination of constraints and score based approaches (Hybrid Algorithms) in order to overcome the shortcoming observed in each of these approaches used alone such as local optimum and overfitting with small datasets (Wang et al., 2007; Wong and Leung, 2004). Hybrid algorithms are used to enhance the search step in order to find more accurate DAG in few iterations through involving some constraints on the model space and assess this DAG using score functions (Tsamardinos et al., 2006).
)2 ⎧ 𝑁 ( ⎫ 𝑃 𝑃 ∑ ∑ ⎪1 ∑ | |⎪ 𝐿𝐴𝑆𝑆𝑂 ̂ 𝛽 = 𝑎𝑟𝑔𝑚𝑖𝑛 ⎨ 𝑦𝑖 − 𝛽0 − 𝑥𝑖𝑗 𝛽𝑗 +𝜆 |𝛽𝑗 |⎬ | | ⏟⏟⏟ ⎪ 2 𝑖=1 ⎪ 𝑗=1 𝑗=1 ⎩ ⎭ 𝛽
Or by specifying the constraint as in the following model (Eq. (9)): )2 ⎫ ⎧𝑁 ( 𝑃 ∑ ⎪ ⎪∑ 𝑥𝑖𝑗 𝛽𝑗 ⎬ , 𝛽̂𝐿𝐴𝑆𝑆𝑂 = 𝑎𝑟𝑔𝑚𝑖𝑛 ⎨ 𝑦𝑖 − 𝛽0 − ⏟⏟⏟ ⎪ 𝑖=1 ⎪ 𝑗=1 ⎭ ⎩ 𝛽 subject to:
𝑃 (𝑥𝑖 [𝑡]|𝑃 𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖 [𝑡]))
𝑃 ∑ | | |𝛽𝑗 | ≤ 𝜗, | | 𝑗=1
)2 ⎫ ⎧𝑁 ( 𝑃 𝑃 ∑ ∑ ⎪ ⎪∑ 𝛽𝑗2 ⎬ 𝛽̂𝑟𝑖𝑑𝑔𝑒 = 𝑎𝑟𝑔𝑚𝑖𝑛 ⎨ 𝑥𝑖𝑗 𝛽𝑗 +𝜆 𝑦𝑖 − 𝛽0 − ⏟⏟⏟ ⎪ 𝑖=1 ⎪ 𝑗=1 𝑗=1 ⎭ ⎩ 𝛽
(10)
Also, a ridge regression problem could be represented using the following mathematical model (Eq. (11)): )2 ⎫ ⎧𝑁 ( 𝑃 ∑ ⎪∑ ⎪ 𝑟𝑖𝑑𝑔𝑒 ̂ 𝛽 =𝑎𝑟𝑔𝑚𝑖𝑛 ⎨ 𝑦𝑖 − 𝛽0 − 𝑥𝑖𝑗 𝛽𝑗 ⎬ , ⏟⏟⏟ ⎪ 𝑖=1 ⎪ 𝑗=1 ⎭ ⎩ 𝛽 subject to:
𝑃 ∑
(11)
𝛽𝑗2 ≤ 𝜐,
𝑗=1
As a compromise between the LASSO and the ridge methods, in the elastic-net penalty method the penalty term includes a portion of the LASSO penalty and a portion of the ridge regression penalty as it is illustrated in the following formula (Eq. (12)):
Real life is far from being static, which makes of studying the temporal behavior of any system a fundamental issue in any field. Time series analysis, which aims to analyze a stochastic process’s outputs (observed sequentially through time) represents a suitable field to apply a temporal version of Bayesian network algorithms. In order to model dependencies among a set of stochastic processes through time, the join probability in the Dynamic Bayesian Network (DBN) could be formulated as follows (Eq. (7)) (Lähdesmäki and Shmulevich, 2008) : 𝑇 ∏ 𝑁 ∏
(9)
The basic idea behind this method is to minimize the error in prediction and at the same time do not violate the constraint imposed on the sum of the coefficients 𝛽𝑖 (L1 norm). By choosing small values for the threshold 𝜗, some of the coefficients will tend to zero (shrink), which leads to exclude the corresponding predictors. Another method slightly similar to the LASSO method has been also proposed, it is the Ridge regression method (Hastie et al., 2008). Based on the 𝐿2 ridge penalty instead of the 𝐿1 norm used in the LASSO method, the ridge regression method could be summarized as follows (Eq. (10)):
2.3. Dynamic Bayesian Network
𝑃 (𝑥[1], 𝑥[2], … , 𝑥[𝑇 ]) = 𝑃 (𝑥[1])
(8)
𝜆
𝑃 ∑ 𝑗=1
| | (𝛼𝛽𝑗2 + (1 − 𝛼) |𝛽𝑗 |) | |
(12)
It is clear that the more 𝛼 tends to zero the more the method tends to be a LASSO one and the more 𝛼 approaches one the more it will be a ridge regression. After this brief presentation of some important notions in the field of DBN, in the next section we introduce our method of DBN. This method is based on the Grow shrink algorithm and the notion of stability in the DBN.
(7)
𝑡=2 𝑖=1
285
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
3. Grow Shrink algorithm
6. Direction propagation: we deal with not oriented edges between variables. If 𝑥 is a variable, and 𝑦 is one of its neighbors and the edge between them is not oriented, then we look for direct paths between 𝑥 and 𝑦, if exists then we orient the edge to be 𝑥 → 𝑦.
Regarding the low complexity (𝑂(𝑛) in the number of independence tests (Margaritis, 2003)) of the Grow Shrink algorithm (GS), this last could be considered as a very interesting means to model Bayesian networks. On the basis of conditional independence tests, this algorithm extracts the underlying causal structure of a dataset. In order to discover the local neighborhood for each variable in the underlying Bayesian network, the author has relied on some important notions and assumptions, such as D-separation rules, the Causal Sufficiency, Causal Markov and the faithfulness. In the following lines we discuss the main steps of the GS algorithm.
As we can notice in this algorithm, the main idea is to discover the closest dependencies of variables and build upon them the whole graph, which makes of this algorithm very simple and easy to understand regarding its complexity O(n) (Margaritis, 2003) in the number of independence tests for the Markov Blanket algorithm. The total number of conditional independence tests for the entire algorithm is 𝑂(𝑛4 2𝑛 |𝐷|) (Margaritis, 2003) where n is the number of variables and D is the set of instances used to learn the Bayesian network. Unfortunately, real-life problems are not simple to the point that just a first training subset could offer the underlying structure and parameters needed to infer and predict any missing or future values. Thus, we need to take into consideration the relationships between variables over time in order to get more accurate results. In the following section, we present a new method to induce DBN based on the GS algorithm and the stability of the network over time.
3.1. The Grow-Shrink (GS) Markov Blanket algorithm In the proposed algorithm, Margaritis (Margaritis, 2003) assumes that all conditions for the existence and uniqueness of Markov Boundary (see Section 2) are satisfied and based on pairwise independence tests. The two main steps, the Growing and the Shrinking phases, are used to discover the Markov Blanket (Markov Boundary) for a given variable 𝑥.
4. Stability-based Dynamic Grow-Shrink Method (SDGS)
Growing phase: For any variable 𝑥, we look for all variables 𝑦 depending on 𝑥 given the Markov boundary B(x). This is formulated mathematically as follows (Eq. (13)): 𝑆𝑡𝑎𝑟𝑡𝑖𝑛𝑔 𝑤𝑖𝑡ℎ 𝑎𝑛 𝑒𝑚𝑝𝑡𝑦 𝐵 (𝑥) ∶ 𝐹 𝑜𝑟 𝑒𝑎𝑐ℎ 𝑦 ≠ 𝑥 𝑎𝑛𝑑 𝑛𝑜𝑡 (𝑦⊥𝑥|𝐵(𝑥)) , 𝐵(𝑥) ← 𝐵(𝑥) ∪ {𝑦}
One can think of inducing a new GS network on the basis of all data we receive over time. However, sometimes this requires a huge amount of data storage space. To avoid this situation, we propose a DBN based on computing some parameters which represent the real dependencies between variables rather than saving all the data observed during the system’s life. In order to maintain the stability of the DBN, adding arcs between variables should be based on the strength of their dependencies, not only observed at a given interval of time but through the whole life of the dynamic system. To do so, we propose a new matrix; we call it the ‘‘stability Matrix’’ which maintains updated degrees of the strength of dependencies between variables through time. The stability matrix is a (𝑛 × 𝑛) matrix where n is the number of variables. Columns and rows have the same labels denoting the variables’ names. In this matrix, a column i holds the stability degrees corresponding to a given variable. A stability degree in this method means the strength (updated through time) of dependency between a given variable and any other variable j. The more this degree gets larger, the more adding an arc 𝑗 → 𝑖 will preserve the stability of the whole Bayesian network and helps to get better prediction for future and missing values. In the following sub-sections, we explain how to initialize, compute and update the stability matrix and then the DBN.
(13)
Shrinking phase: Remove any variable 𝑦 which has been added to the Markov blanket indirectly while it is independent of 𝑥. We can summarize this in the following formula Eq. (14): 𝐹 𝑜𝑟 𝑒𝑎𝑐ℎ 𝑦 𝑤ℎ𝑖𝑐ℎ 𝑠𝑎𝑡𝑖𝑠𝑓 𝑖𝑒𝑠 (𝑦⊥𝑥|𝐵 (𝑥) − {𝑦}) , 𝐵(𝑥) ← 𝐵(𝑥) − {𝑦}
(14)
3.2. Inducing Bayesian network using Markov Blanket On the basis of the Markov blanket of each variable 𝑥, Margaritis proposes an efficient algorithm to find the underlying structure of the whole Bayesian network. The main ideas of each step of this algorithm are presented below. 1. Markov Blanket: find the Markov Blanket 𝐵(𝑥) for each variable 𝑥. 2. Direct Neighbors and graph structure: • Find the smallest set between 𝐵 (𝑥) − {𝑦} and 𝐵 (𝑦) − {𝑥}. Let it be 𝑇 . • If 𝑥 and 𝑦 (𝑦 ∈ 𝐵 (𝑥)) are dependent given all subsets of 𝑇 , then we consider 𝑦 as a member of the set of direct neighbors of 𝑥 denoted by 𝑁(𝑥).
4.1. Initialize the stability matrix Using standardized data for the first m entries, we may initialize the stability matrix either by calculating their Partial cross correlation or the shrinkage coefficients (LASSO, ridge or elastic net). In order to facilitate the understanding of the rest of the proposed method, we define some concepts, namely: current dataset, historical average and cumulative dataset. For a given point of time t and a moving window size m, the current dataset represents all the observations collected between [t − m + 1, t]. For example, for hourly time series, the current dataset at hour 24:00 is all the observation gathered in the last 24 h. We mean by historical average, the mean of values for each variable at a specific point of time. If we consider again the example of hourly time series, the historical average at an hour h, is the mean of all values observed through the system’s life till this specific hour. Based on the concept of historical average explained above, the cumulative dataset is calculated as historical averages of all the variables till a point of time t.
3. Orienting Edges: given a variable 𝑥 and a neighbor 𝑦 (𝑦 ∈ 𝑁 (𝑥)) • If there is a variable 𝑧 such that 𝑧 ∈ 𝑁 (𝑥) − 𝑁 (𝑦) − {𝑦} • Find the smallest set between 𝐵 (𝑦)−{𝑥, 𝑧} and 𝐵 (𝑧)−{𝑥, 𝑦}, let this set be T, and S as a one of its subsets. • If 𝑦 and 𝑧 are dependent given all 𝑆𝑈 {𝑥}, then we orient the edge to be an arc 𝑦 → 𝑥. 4. Cleaning the graph of Cycles: in this step, we remove the edges which make part of the greatest number of cycles observed in the graph and insert them in a set R, which stands for the set of edges to be possibly reversed. 5. Reversing edges: we take the edges of the set R (collected in step 4) in the reverse order of their insertion, and we insert them again in the graph but reversed. 286
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. 2. CO.GT hourly time series plot between 03∕10∕2004 and 4∕4∕2005.
Fig. 5. CO.GT first difference time series diagram.
Fig. 3. Box plot of the CO.GT monthly levels between March 2004 and April 2005.
Fig. 6. Box plot of CO.GT first difference’s hourly levels.
Only those 𝑃 𝐶𝐶𝐹𝑖𝑗,𝑐 with a 𝑝-value less than or equal to a specified significance level will be considered. The idea behind the stability degree is that we keep track of the strength of dependency between i and j based on the 𝑃 𝐶𝐶𝐹𝑖𝑗,𝑐 . 4.1.2. Stability matrix based on shrinkage coefficients (lasso, ridge or elastic net coefficients) We can also initialize the stability matrix with the coefficients of the LASSO, ridge or elastic net coefficients of the first collected data subset. For example, if for a given target variable j and predictor variables 𝑖1 , i2,. . . i𝑛 , then each entry 𝑖𝑥 𝑗 (which represents the strength of dependency between j and 𝑖𝑥 ) in the stability matrix will be initiated with the coefficient 𝑐𝑖𝑥 𝑗 .
Fig. 4. Box plot of the CO.GT hourly levels.
4.1.1. Stability matrix based on the partial cross correlation In the beginning, the stability matrix is initialized with stability degrees computed on the basis of the Partial Cross Correlation (𝑃 𝐶𝐶𝐹𝑖𝑗 ) of the first collected m entries (for example 𝑚 = 24 points of time (hours) in hourly time series) for given variables i and j. The Stability degree at a point of time t (𝑆𝑡𝐷𝑖𝑗,𝑡 ) is calculated as follows (Eq. (15)): ∑ 𝑆𝑡𝐷𝑖𝑗,𝑡 = 𝑃 𝐶𝐶𝐹𝑖𝑗,𝑐 (15)
4.2. Update the stability matrix For the subsequent intervals of time, each current subset (last observed m entries) and the cumulative subset will have m entries. Using standardized values of variables in both datasets, we calculate the
𝒕−𝒎<𝒄≤𝒕
287
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. 7. Box plot of the CO.GT first difference’s monthly levels.
Fig. 10. RMSE’s values in terms of 𝛾 and Shrinkage 𝛼 using the Shrinkage-based model (learned 𝜂, 𝜌).
Fig. 8. RMSE’s values in terms of 𝛾 and PCCF 𝑝-value using the PCCF-based model. Fig. 11. RMSE’s values in terms of 𝛾 and Shrinkage 𝛼 using the Shrinkage-only-based model.
stability degree either on the basis of the Partial Cross Correlation and their respective 𝑝-values, or the shrinkage coefficients between variables. We refer to the matrix of new stability degrees as the ‘‘update matrix’’. Using the Hadamard multiplication (entrywise multiplication, operator ‘‘𝑜") we update the stability matrix according to the following equation (Eq. (16)): 𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑎𝑡𝑟𝑖𝑥 = 𝜂 𝑜 𝐶𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑎𝑡𝑟𝑖𝑥+𝜌 𝑜 𝑢𝑝𝑑𝑎𝑡𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 (16) Where 𝜂(𝑛 × 𝑛) and 𝜌(𝑛 × 𝑛) are two matrices (to be learned) for which ∀𝑘, 𝑗 ∶ 𝜂𝑘𝑗 ∈ [0, 1] and 𝜌𝑘𝑗 ∈ [0, 1]. 𝜂 and 𝜌 are used to calibrate the importance between the previous and the new knowledge represented respectively by the Cumulative Stability matrix (𝑛×𝑛)(historical average of the stability degrees calculated over time) and the update matrix (𝑛×𝑛) respectively. 4.2.1. Learning 𝜂 and 𝜌 Let us consider the interval of time [t−m+1, t] from which we want to predict m entries and learn, based on the amount of error calculated, the
Fig. 9. RMSE’s values in terms of 𝛾 and Shrinkage 𝛼 using the Shrinkage-based model (equal 𝜂, 𝜌).
288
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. 12. Predicted CO.GT first difference’s time series using the VAR model.
Fig. 14. Predicted CO.GT first difference’s time series using the Shrinkage-based model (𝜌 = 𝜂, 𝛾 = 0.2, 𝛼 = 0).
Fig. 13. Predicted CO.GT first difference’s time series using the PCCF-based model (𝛾 = 0.6, 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.5).
Fig. 15. Predicted CO.GT first difference’s time series using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
change in values of the matrices 𝜂 and 𝜌. Let 𝛥𝑤𝑘𝑗 𝑡 , 𝛥𝑣𝑘𝑗𝑡 be the changes in weights 𝜂kj and 𝜌kj associated respectively to the previous cumulative and update stability degrees of the variable k for the variable j at time point t. The values of 𝛥𝑤𝑘𝑗𝑡 , 𝛥𝑣𝑘𝑗𝑡 are calculated as follows (Eq. (18)): | | 𝛥𝑤𝑘𝑗 𝑡 = 𝜑𝑡 (|𝑦𝑘𝑗 𝑡 | + 𝛥𝑤𝑘𝑗 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 ) | | | | 𝛥𝑣𝑘𝑗 𝑡 = 𝜑𝑡 (|𝑢𝑘𝑗 𝑡 | + 𝛥𝑣𝑘𝑗 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 ) | | Where 𝛥𝑤𝑘𝑗 , 𝛥𝑣𝑘𝑗 , 𝑦𝑘𝑗𝑡 and 𝑢𝑘𝑗𝑡 represent respectively:
• Update matrix: 𝛥𝑣𝑘𝑗𝑡 is the change in 𝜌𝑘𝑗 and 𝑢𝑘𝑗𝑡 is the stability degree related to the variable k for the variable j in the update matrix. 𝛥𝑣𝑘𝑗𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 represents the previous value of the change in the value of 𝜌𝑘𝑗 .
(17)
As we can notice, the changes in values of coefficients 𝜂kj and 𝜌𝑘𝑗 depend on the stability degrees 𝑦𝑘𝑗𝑡 , 𝑢𝑘𝑗𝑡 on their previous amount of change represented by 𝛥𝑤𝑘𝑗𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 , 𝛥𝑣𝑘𝑗𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 and on the momentum 𝜑𝑡 . The value of each next value of the momentum is calculated as follows (Eq. (19)):
(18)
• Cumulative stability matrix: 𝛥𝑤𝑘𝑗𝑡 represents the change in 𝜂𝑘𝑗 and 𝑦𝑘𝑗𝑡 stands for the stability degree related to the variable k for the variable j in the cumulative stability matrix. 𝛥𝑤𝑘𝑗𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 represents the previous value of the change in the value of 𝜂𝑘𝑗 .
| | | ⎧0, 𝑖𝑓 𝜉 = 0 𝑜𝑟 ||𝜉 | − |𝜉 | = 0 𝑗𝑡 ⎪ | 𝑗 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 | | 𝑗 𝑡 || | 𝜑𝑛𝑒𝑥𝑡 = ⎨| −𝜉𝑗 | |𝜉𝑗 | 1 = | | 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠| 𝑡 | , | ⎪||𝜉𝑗 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 − 𝜉𝑗 𝑡 || × ||𝜉 − 𝜉 𝜉 − 𝜉 | | | | | | 𝑗𝑡 𝑗 𝑡| ⎩ | 𝑗 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 | | 𝑗 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 | 289
𝑖𝑓 𝜉𝑗 𝑡 ≠ 0
(19)
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. 18. Residuals over time using the Shrinkage-based model (𝜂 = 𝜌, 𝛾 = 0.2, 𝛼 = 0).
Fig. 16. Residuals’ values over time using the VAR model.
Fig. 19. Residuals over time using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
Fig. 17. Residuals’ values over time using the PCCF-based model (𝛾 = 0.6, 𝑝-value = 0.5).
𝜉𝑗𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 represents the amount of error calculated in the previous interval of time [t − 2m + 1, t − m]. Based on the prediction made between [t − m + 1, t], the amount of error 𝜉𝑗𝑡 is calculated as follows (Eq. (20)): ⎧ (∑|𝑟𝑒𝑎𝑙−𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑|) ∑ 𝑖𝑓 (𝑟𝑒𝑎𝑙 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑) = 0 ⎪ 2 ⎪ 𝜉𝑗 𝑡 = ⎨ ∑ ⎪ ( |𝑟𝑒𝑎𝑙−𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑|) ⎪ ∑(𝑟𝑒𝑎𝑙−𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ⎩
4.3. DBN inference and update Based on the current data subset, we infer a Bayesian Network BN structure using the GS algorithm. Using the stability matrix and the cumulative subset, we enhance the previous structure of the BN. To do so, we update BN with a set of variables selected on the basis of the stability matrix. Let 𝐴𝐵𝑠𝑢𝑚𝑗 be the sum of the absolute values of stability degrees corresponding to a variable j in the (𝑛 × 𝑛) stability matrix. In order to predict future values for a variable j, we select ‘‘𝑥’’ variables from the ‘‘n’’ variables in the stability matrix. To do so, we order (descending order) all the absolute values of the ‘‘n’’ variables, then we select the ‘‘𝑥’’ first variables whose the sum of their absolute values in the stability degrees
(20)
The more 𝜉𝑗𝑡 is larger than 𝜉𝑗𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 , the more the contribution of 𝑦𝑘𝑗𝑡 and 𝑢𝑘𝑗𝑡 (in reaching such level of error) gets larger, this is why we multiply | | | | the momentum by |𝑦𝑘𝑗𝑡 | and |𝑢𝑘𝑗𝑡 | in order to amplify the reduction in | | | | weights given to 𝑦𝑘𝑗 and 𝑢𝑘𝑗 . 290
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Table 1 Main statistics of each variable of the processed dataset.
Min 1st Qu. Median Mean 3rd Qu. Max.
CO.GT
PT08.S1.CO
C6H6.GT
PT08.S2.NMHC
NOx.GT
T08.S3.NOx
NO2.GT
T08.S4.NO2
PT08.S5.O3
T
RH
AH
0.1 1.0 1.8 2.106 2.8 11.9
647 940 1072 1108 1246 2040
0.1 4.4 8.375 10.179 14.2 63.7
383.0 734.0 913.0 942.2 1124.0 2214.0
2.0 95.0 178.0 241.6 322.0 1479.0
322.0 650.0 799.0 833.4 971.0 2683.0
2 75 107 111 139 333
551 1195 1442 1441 1671 2775
221 740 986 1040 1303 2523
−1.9 11.1 16.6 17.56 23.5 44.6
9.2 36.0 49.8 49.47 62.8 88.7
0.1847 0.6943 0.9556 0.9872 1.2569 2.1806
• NO2.GT: True hourly averaged NO2 concentration in microg/m3 (reference analyzer), • PT08.S4.NO2: PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted), • PT08.S5.O3: PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted), • T: Temperature in Â◦ C, • RH: Relative Humidity (%), • AH: Absolute Humidity.
divided by 𝐴𝐵𝑠𝑢𝑚𝑗 is less than or equal to a threshold 𝛾 (Eqs. (21) and (22)): | | |𝑠𝑡𝐷𝑖,𝑗 | ordered in descending order: | | 𝐴𝐵𝑠𝑢𝑚𝑗 =
𝑛 ∑ | | |𝑠𝑡𝐷𝑖,𝑗 | | | 𝑖=1
𝑖1 , 𝑖2 , … , 𝑎𝑟𝑒 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑖𝑓 ∶
(21) ∑𝑥 | | 𝑖=1 ||𝑠𝑡𝐷𝑖,𝑗 || ≤𝛾 𝐴𝐵𝑠𝑢𝑚𝑗
(22)
The new structure of BN will include new arcs between each variable 𝑖𝑥 and the variable j (𝑖𝑥 → 𝑗). Based on variables selected from the previous step, we select the columns corresponding to the same variables from the cumulative subset and we construct a new dataset which includes these columns and those of the current subset. The new dataset will be used to learn the parameters of the modified GS Bayesian Network and then to predict the next future m values for the studied variable. As we can notice, we do not need to keep track of all the time series’ data (which could need a very important amount of storage space) in order to find the suitable coefficients needed to predict future values or generate missing values of a given variable, as it is the case in the classical VAR method. In the next section, we discuss a real life application of the proposed method, and compare the results with those found using VAR model.
To deal with missing data, we adopted a very simple approach based on computing the average of the previous and future values (historical averages) giving direct previous and next values greater importance than those given to older values. In this approach we try to predict the value of a missing value on the basis of its two previous and two future values, otherwise we remove the whole entry. For example, the missing value 𝑒𝑖 of an entry ‘‘i’’ is equal to (Eq. (23)): 𝑒𝑖 = 𝑎1 × 𝑒𝑖−2 + 𝑎2 × 𝑒𝑖−1 + 𝑎3 × 𝑒𝑖+1 + 𝑎4 × 𝑒𝑖+2
(23)
Where 𝑎i represents the weight given to each previous and next value. If we initialize all the 𝑎i to zero, their new values will be calculated as follows: If 𝑒𝑖−1 and 𝑒𝑖+1 exist, then 𝑎2 and 𝑎3 are equal to 0.5. If 𝑒𝑖−1 is missing and 𝑒𝑖+1 , 𝑒𝑖−2 exists, then 𝑎1 = 0.25 and 𝑎3 = 0.75. If 𝑒𝑖+1 is missing and 𝑒𝑖−1 , 𝑒𝑖+2 exists, then 𝑎2 = 0.75 and 𝑎4 = 0.25. If both 𝑒𝑖−1 and 𝑒𝑖+1 are missing and 𝑒𝑖−2 , 𝑒𝑖+2 exists, then 𝑎1 =a4 = 0.5. Otherwise, remove entry ‘‘i’’. Using the R platform (Team, 2016), and after processing the dataset, it remained 7465 entries. Table 1 summarizes the main statistics of each variable of the processed dataset. As we can notice in Table 1, gases have different levels in the same area during the same interval of time. Because we are going to focus on CO.GT, we present in Fig. 2 the time series plot of this gas. It is noticeable that this time series does not seem as a stationary one where the Kwiatkowski–Phillips–Schmidt–Shin test (KPSS) (Kwiatkowski et al., 1992) yields a 𝑝-value equals to 0.01, which means that there is strong evidence to say that this time series is non-stationary. Fig. 3 shows how the levels and the variance are getting larger between September and December of 2004, where CO.GT levels soar on some specific months especially October, November and December. In Fig. 4, we see that the CO.GT levels are so important from 7 to 10 am and from 5 to 8 pm, and the most straightforward explanation for this is that these hours are the peak traffic periods. As mentioned before, the time series of CO.GT is non-stationary, so making this time series stationary is an important step before going any further in the analysis. To do so, we have differentiated with one lag, and we get the diagram depicted in Fig. 5 which shows that the first difference was sufficient. Using the KPSS test (Kwiatkowski et al., 1992) we got a 𝑝-value equal to 0.1, which indicates that there is no strong evidence to say that the first difference time series is non-stationary. Figs. 6 and 7 depict respectively both the variation in hourly and monthly CO.GT first difference levels. In Fig. 6, it is clear that hours between 3 am and 6 am, have the lowest variation in the first difference values, whereas peak traffic hours (7 am–10 am and 17 pm–22 pm) show the highest variation, and hours
5. Real-life problem application of SDGS In order to assess SDGS method, we discuss in this section how it is used to predict future values of a very important gas which contributes to the pollution of the air. Between March 10th, 2004 and April 4th, 2005, five metal oxide chemical sensors embedded in an air quality chemical multisensor device have been placed in a very polluted area at the road level in Italy (De Vito et al., 2008). 5.1. Dataset description Between March 10th, 2004 and April 4th, 2005; a dataset of 9358 entries (including missing values) of an hourly averaged response has been collected. In order to test SDGS method, we use the (Vito, 2005) dataset (donated on 03/23/2016). As a multi-variate time series, this dataset contains the following attributes (as they are from the source (Vito, 2005)): • CO.GT: True hourly averaged concentration of the CO in mg/m3 (reference analyzer), • PT08.S1.CO: PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted), • NMHC.GT: True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m3 (reference analyzer), • C6H6.GT: True hourly averaged Benzene concentration in microg/m3 (reference analyzer), • PT08.S2.NMHC: PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted), • NOx.GT: True hourly averaged NOx concentration in ppb (reference analyzer), • PT08.S3.NOx: PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted), 291
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Table 2 Best results. Method
𝛾
𝑃 -value
Shrinkage 𝛼
𝜂𝑘𝑗
𝜌𝑘𝑗
predicted values
RMSE
Anderson–Darling statistic/𝑝-value
Kolmogorov–Smirnov test statistic/𝑃 -value
VAR VAR
– –
0.05 -
– –
-
– –
7440 7440
12.6714 0.8133
– D = 0.44422 𝑃 -value < 2.2e−16
PCCF PCCF PCCF PCCF
0.5 0.3 0.5 0.6
0.6 0.8 0.6 0.5
– – – –
0.25 0.75 1 Learned
0.75 0.25 1 Learned
7440 7440 7440 7440
0.8067 0.8062 0.8059 0.8026
Stb-Shrk Stb-Shrk Stb-Shrk
0.2 0.2 0.2
– – –
0 0 0
0.25 0.75 1
0.75 0.25 1
7440 7440 7440
0.8061 0.8061 0.8060
Stb-Shrk
0*
–
0
Learned
Learned
7440
0.8066
Shrk-Only
0.1
–
1
–
-
7440
0.8101
Shrk-Only
–
-
0.9
7440
0.8119
GS-CV GS-PV GS-CV-PV
– – –
-
– – –
6072 6240 6216
0.8579 0.8517 0.8530
– AD = 1356 T.AD = 1780 𝑃 -value = 0 – – – AD = 174.1 T.AD = 227.4 𝑃 -value = 7.343e−96 – – AD = 396.6 T.AD = 519.7 𝑃 -value = 4.200e−218 AD = 396.1 T.AD = 519.0 𝑃 -value = 8.361e−218 AD = 93.52 T.AD = 121.5 𝑃 -value = 1.354e−51 AD = 70.75 T.AD = 91.62 𝑃 -value = 4.393e−39 – – –
-
– – –
between 11 am–16 pm show a stable variation in values of the CO.GT the first difference. For monthly variation in the CO.GT first difference levels, we notice that the median is almost the same in all months with slightly different upper and lower quartiles where October 2004 shows the largest variation (quartiles and whiskers) and April 2005 the lowest variation. On the basis of the first difference of both the CO.GT and the other variables’ time series, we carried out the time series analysis in order to find a suitable VAR model for the first difference of CO.GT and then, we use it with the SDGS to predict future values for the same gas.
(1 − 𝛼) ‖𝛽‖22 2
D = 0.080376 𝑃 -value < 2.2e−16 – – –
As we can notice in Table 2, for the VAR model, we have considered first all the coefficients yielded using the model and then we considered only those with a 𝑝-value ≤ 0.05. For the Stability based models, in the PCCF model we considered a range of values for 𝛾 from 0* (0* means we consider only 1 variable with the greatest (absolute value) stability degree) till 1 with a step of 0.1 where the 𝑝-value had the value of 0.05 then 0.1 and greater (by a step of 0.1). The Shrinkage 𝛼 has been changed from 0 to 1 with a step of 0.1 in shrinkage-based models. For (𝜂, 𝜌) matrices, we have considered four combinations: Either constant values for all their entries: (0.25, 0.75), (0.75, 0.25), (1, 1)
]} + 𝛼 ‖𝛽‖1
D = 0.093817, 𝑃 -value < 2.2e−16
• VAR: VAR model • PCCF: PCCF Stability-Based Grow-Shrink model. • Stb-Shrk : Stability-based model in which Stability degrees are calculated using the Shrinkage method. • Shrk-Only : Variables from the cumulative dataset are selected based on the shrinkage method only. When 𝛾 is considered this means that we select only the predictors with the ratio of the sum of their coefficients’ absolute values is equal to 𝛾. • GS-CV: The Grow Shrink Bayesian Network is inferred on the basis of the current subset and the 24 cumulative values of variables. • GS-PV: The Grow Shrink Bayesian Network is inferred on the basis of the current subset and the previous 24 values of each variable. • GS-CV-PV: The Grow Shrink Bayesian Network is inferred on the basis of the current subset and the previous 24 values and the cumulative 24 values of each variable.
5.2.1. Experiments’ context In order to assess the performance of the studied models, we have used the vars package (Pfaff and Stigler, 2015) for the VAR model, and the package bnlearn (Scutari, 2016) for Grow-Shrink-based models. The bnlearn package includes an implementation of the Algorithm GS, which we used to implement a modified version that includes the SDGS functions (using Maximum Likelihood parameter estimation (MLE) (Scutari and Ness, 2018)). In the shrinkage-based models, we used the package glmnet (Hastie and Qian, 2014), where the objective function (Gaussian family) is (Eq. (24)): { 𝑁 )2 1 ∑( 𝐿𝐴𝑆𝑆𝑂 ̂ 𝛽 = 𝑚𝑖𝑛 𝑦 − 𝛽0 − 𝑥𝑇𝑖 𝛽 ⏟⏟⏟ 2𝑁 𝑖=1 𝑖 [
D = 0.2172 𝑃 -value < 2.2e−16
5.2.2. Results In Table 2 we summarize the results of conducting several experiments based on different models in order to predict the first difference values of CO.GT. As mentioned above, we have conducted a set of experiments based each time on a different model as explained below:
In the experiments’ context subsection we discuss the R packages used in this study and then key results will be discussed in the next section.
+ 𝜆
– – D = 0.21976, 𝑃 -value < 2.2e−16
LassoType measure has been set to ‘‘mse’’ (mean squared error criterion) that measures the deviation from the fitted mean to the response, which refers to the loss used for cross-validation (default number of folds = 10).
5.2. Experiments and results
(𝛽0 ,𝛽)∈R𝑝+1
– – – D = 0.13871, 𝑃 -value < 2.2e−16
(24)
Where 𝑥𝑖 ∈ Rp and the response 𝑦𝑖 ∈ R, 𝑖 = 1, …, 𝑁. 𝜆 ≥ 0 is a complexity parameter and 0 ≤ 𝛼 ≤ 1 is a compromise between ridge (𝛼 = 0) and LASSO (𝛼 = 1). 292
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. 20. Residuals density using the VAR model. Fig. 22. Residuals density using the Shrinkage-based model (𝜂 = 𝜌, 𝛾 = 0.2, 𝛼 = 0).
Fig. 21. Residuals density using the PCCF-based model (𝛾 = 0.6, 𝑝-value = 0.5). Fig. 23. Residuals density over time using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
or the system learns the value of each entry through time (learned, learned). For the GS-CV, GS-PV and GS-CV-PV models, we notice that the model failed to predict all the 7440 values; this is why we exclude these three models from any further analysis. As results, we can notice that the PCCF model with (𝛾 = 0.6, 𝑝-value = 0.5, 𝜂 = learned, 𝜌 = learned) yields the best RMSE 0.8026 with a goodness of fit test of Anderson–Darling of a 𝑝-value=7.343e−96. The worst model was the VAR model in which we consider only predictors with coefficient of 𝑝-value < = 0.05, where we got a RMSE = 12.6714. Overall, we notice that all models do not have a statistically significant 𝑝-value for both goodness of fit tests; this is why we rely on the RMSE to decide which model is the best. In the remaining of this section we start with a sensitivity analysis in which we compare the RMSE values in terms of 𝛾 and PCCF’s 𝑝-value in the PCCF-based model, and the RMSE values in terms of 𝛾, 𝛼, 𝜂 and 𝜌 in the Shrinkage-based model. In the predicted and residual values analysis subsection, we discuss both the predicted CO.GT first difference values’ time series and residual yielded from the best models (based on RMSE) in the VAR model, PCCF-based models and the Shrinkage-based model
as long as the residual yielded from these models. In the Absolute error analysis subsection we discuss the hourly and monthly absolute error values noticed when using the aforementioned models. In order to discuss in more details the question of which predictor(s) is/are best to predict the CO.GT first difference future values, in Appendices A–C we try to understand how predictors have been contributed in predicting future values of the response variable in terms of presence, magnitude and direction besides discussing the evolution of 𝜂 and 𝜌 in stability-based models that yielded the best RMSE results. 5.2.3. Sensitivity Analysis In order to track the values of RMSE, we conducted a series of experiments in which we study the impact of 𝛾, 𝑝-values (in PCCF-based model) and 𝛼(in shrinkage-based model) on the variation of RMSE. Fig. 8 illustrates the RMSE values in terms of 𝛾 and 𝑝-value in the PCCFbased model. As we can notice, most experiments yielded an RMSE value between 0.802 and 0.818 whatever 𝛾 and 𝑝-value values are. 293
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. 24. Predicted values and residuals of the first difference of CO.GT using VAR model. Fig. 26. Predicted values and residuals of the first difference of CO.GT using the Shrinkage-based model (𝜂 = 𝜌, 𝛾 = 0.2, 𝛼 = 0).
Fig. 25. Predicted values and residuals of the first difference of CO.GT using the PCCFbased model (𝛾 = 0.6, 𝑝-value = 0.5). Fig. 27. Predicted values and residuals of the first difference of CO.GT using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
Remarkably, when we consider 𝛾 = 1 the RMSE soars to more than 0.83. Figs. 9–11 depict the RMSE’s values in terms of 𝛾 and Shrinkage 𝛼. We notice that the RMSE levels using the Shrinkage-based model with equal 𝜂, 𝜌 and its levels using learned 𝜂, 𝜌 (Figs. 9 and 10 respectively) have almost the same pattern, where most values are between 0.806 and 0.82. As in the PCCF-based model, we notice that these two stability-based models exhibit higher RMSE levels when 𝛾 tends to 1. In the contrary to stability-based models, RMSE’s values in terms of 𝛾 and Shrinkage 𝛼 using the Shrinkage-only-based model becomes lower (less than 0.815) when 𝛾 tends to 1. The fact that the RMSE level soars when 𝛾 tends to 1 leads us to the conclusion that considering too much predictors (based on the stability degrees matrix which keep track of the dependency strength between variables) would hamper rather than helping the predictability of the model.
VAR model (no predictor selection based on 𝑝-value), PCCF-based model (𝛾 = 0.6, 𝑝-value = 0.5), Shrinkage-based model (𝜌 = 𝜂, 𝛾 = 0.2, 𝛼 = 0) and Shrinkage-based model (learned 𝜂 and 𝜌 , 𝛾 = 0∗ , 𝛼 = 0 ). In Fig. 12, most forecasted values are close to zero, where the overall shape of the time series is far from the plot of real values (Fig. 2), such distance between the real and the predicted time series is reflected in the residual time series depicted in Fig. 16. Fig. 13 has shown more varied values for the predicted variable but looking at its corresponding residuals plots in Fig. 17, we conclude that also the underlying model used to predict these values could not provide an accurate estimation. For Shrinkage-based models (Figs. 14 and 15), we notice that predictions are not also accurate despite the fact that values are not close to zero as it is the case with the VAR model, but by diagnosing their corresponding residuals plots in Fig. 18, and 19 we notice that residuals have the same important levels at the same points of time noticed in all considered models, which may refer to outliers. The first thing we can notice about all these plots of residuals (Figs. 16–19), is that we have a non-constant variation of their values
5.2.4. Predicted and residual value analysis In order to have a look at how models have shown variation in the response variable, in this subsection we analyze the time series of both predicted values and residuals yielded from best models, namely: 294
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. 30. Box plot of hourly CO.GT first difference’s absolute error levels using the Shrinkage-based model (𝜂 = 𝜌, 𝛾 = 0.2, 𝛼 = 0).
Fig. 28. Box plot of hourly CO.GT first difference’s absolute error levels using the VAR model.
Fig. 29. Box plot of hourly CO.GT first difference’s absolute error levels using the PCCFbased model (𝛾 = 0.6, 𝑝-value = 0.5).
Fig. 31. Box plot of hourly CO.GT first difference’s absolute error levels using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
(heteroscedasticity), though their distribution is close to a normal distribution as we can notice in Figs. 20–23 where most values are between [−1, +1] especially Figs. 21–23, this is what would explain the similarity in their plots as we can see in Figs. 16–19. In Figs. 24–27, we present the predicted (CO.GT first difference predicted values) values against their corresponding residuals in each considered model. Fig. 24 confirms the fact that most predicted values in the VAR model are close to zero, this is why we see that the majority of residuals are condensed on the 0 value of CO.GT (the first difference of CO.GT). Unlike the VAR model, for the remaining plots (Figs. 25–27), we notice that residuals are dispersed on a radius of 1.5 from the point (0, 0) (with few exceptions) especially in Fig. 25.
In all plots that represent the hourly variation in the absolute error, we notice that the least levels are between 3 am and 6 am, whereas the greatest values are noticed at 7 am–10 am and 17 pm–22 pm. When we compare these values to those noticed in Fig. 6 (Box plot of CO.GT first difference’s hourly levels), we notice that variation in the response variable’s values had its impact on the predictability of models and outliers noticed in Fig. 6 have been reflected again in the absolute error plots on the same hours (peak traffic hours). In contrary to the CO.GT first difference’s hourly variation reflected on the hourly absolute error plots, we notice that the relatively stable monthly variation (Fig. 7) has not much affected the predictability of the models as we can see in the monthly absolute error plots (Figs. 32– 35). It is noticeable that all models (Figs. 32–35) have similar pattern of variation especially Grow-shrink-based models, where September, October and November 2004 show the largest variation in absolute errors whereas August 2004 and April 2005 have the lowest levels. In the end of this section we can conclude that despite the non-zero error level, the stability-based models show a good performance if we
5.2.5. Absolute error analysis In order to analyze the absolute error yielded from each model, we illustrate the variation in error per hour in Figs. 28–31 then per month in Figs. 32–35. 295
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. 34. Box plot of monthly CO.GT first difference’s absolute error levels using the Shrinkage-based model (𝜂 = 𝜌, 𝛾 = 0.2, 𝛼 = 0).
Fig. 32. Box plot of monthly CO.GT first difference’s absolute error levels using the VAR model.
Fig. 33. Box plot of monthly CO.GT first difference’s absolute error levels using the PCCFbased model (𝛾 = 0.6, 𝑝-value = 0.5).
Fig. 35. Box plot of monthly CO.GT first difference’s absolute error levels using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
take into account the volume of data used in each interval of time to predict future values. Comparing stability-based models’ results to those found using the VAR model, it is obvious that the VAR model is less accurate than the stability-based models and this besides the amount of data, the storage space and the time needed (almost a year) to collect all data in order to enhance the accuracy of the VAR model, this is why we tend to prefer the stability-based models, especially the PCCF model. In this case study, the stability-based models required only a (12 × 12) stability matrix (where we need only one column of 12 entries if we consider only one response variable), a (2 × 12) matrix used in the learning process of 𝜂 and 𝜌 (for each response variable), and a (24 × 12) dataset for the cumulative subset, where only 24 h were enough to begin the prediction.
the stability of the induced Bayesian Network. The concept of stability proposed in the frame of the SDGS method is based on the strength of dependency between variables. The proposed approach shows a very interesting ability to model the underlying dependencies among a set of variables through time. Such performance is reflected by the good accuracy found using SDGS models in comparison with the VAR model which needs an important amount of data to compute accurate parameters, such limit would absolutely hamper any attempt to predict future values or generate missing values in real-time systems based on very restricted material resources. In order to improve this method, we consider in future works the problem of periodicity in time series and how we could take it into account to improve accuracy in prediction of periodic time series. Also we intend to incorporate a mechanism which improves learning from errors in order to enhance the model’s results.
6. Conclusion In this work, we have proposed a new approach to build Dynamic Bayesian Network models based on the Grow Shrink algorithm and 296
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. A.1. Coefficients’ mean values for predictors in the VAR model.
Fig. A.2. CO.GT coefficient’ values and their corresponding 𝑝-value over time in the VAR model.
Fig. A.3. PT08.S1.CO coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
Acknowledgments
It is clear that PT08.S2.NMHC, CO.GT, PT08.S1.CO, PT08.S5.O3, C6.H6.GT and T are the most important predictors in the VAR model in order to predict the first difference values of CO.GT and that the remaining variables especially AH is the least important, and this regardless the corresponding 𝑝-value associated to the coefficient that represents each predictor. In order to track the coefficients’ time series along with their corresponding 𝑝-value in the VAR model, in Figs. A.2–A.13 we summarize that information to be able to compare between them and derive some conclusions. As we can notice, an important number of coefficients have a 𝑝value greater than 0.05 such as the case with PT08.S2.NMHC, NOx.GT, PT08.S3.NOx, NO2.GT, PT08.S4.NO2 and AH whereas the important
We would like to thank the anonymous reviewers for their very helpful comments. This research received no specific grant or support from any funding agency or organization in the public, governmental, commercial or notfor-profit sectors. Appendix A. Predictors’ coefficients and frequencies analysis Fig. A.1 depicts the coefficient’s mean value for each predictor based on the VAR model (no predictor selection based on 𝑝-value). 297
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. A.4. C6.H6.GT coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
Fig. A.5. PT08.S2.NMHC coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
Fig. A.6. NOx.GT coefficient’ values and their corresponding 𝑝-value over time in the VAR model.
Fig. A.7. PT08.S3.NOx coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
298
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. A.8. NO2.GT coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
Fig. A.9. PT08.S4.NO2 coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
Fig. A.10. PT08.S5.O3 coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
Fig. A.11. T coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
299
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. A.12. RH coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
Fig. A.13. AH coefficient’s values and their corresponding 𝑝-value over time in the VAR model.
Fig. A.14. Predictors’ frequencies using the PCCF-based model (𝛾 = 0.6, 𝑝-value = 0.5).
predictors that we noticed in Figs. A.2–A.13 namely PT08.S2.NMHC, CO.GT, PT08.S1.CO, PT08.S5.O3 and C6.H6.GT have a very important number of values with a 𝑝-value <0.1. Considering only statistically significant coefficients (𝑝-value< = 0.05) led to an RMSE above 12 (Table 2) which means that we cannot rely on 𝑝-value to select the best predictors. Another point we can emphasize is that coefficients become statistically significant with time as we can easily notice in Figs. A.4, A.9, A.12–A.13 this remark that can be explained by the fact that the more we go in time the more we collect data, this last will help us to get closer to the real (unknown) underlying distribution of variables, in its turn this lead to calculate more realistic coefficients. For Grow-Shrink-based models (Figs. A.14–A.16), we notice that coefficients have some differences and some common patterns of presence
from a model to another. Let us start with common patterns. We notice that in all models, direct previous values (that we refer to with the prefix ‘‘Prev_’’) Prev_AH, Prev_C6H6.GT, Pre_Nox.GT, Prev_PT08.S1.CO, Prev_PT08.S4.NO2 and Prev_T, used to predict current values, are the least frequent predictors in all models. As for important predictors, Lag_PT08.S5.O3 (Lag_ is the prefix for cumulated values) is important in almost all models, and the same for Lag_Nox.GT. Prev_CO.GT and Lag_CO.GT (related to CO.GT first difference) show some importance in all models (especially Lag_CO.GT in the Shrinkage-based model (learned 𝜂 and 𝜌) Fig. A.16), but Prev_CO.GT was not as important as other predictors such as Lag_PT08.S5.O3 (Figs. A.14 and A.15) and Lag_Nox.GT (Figs. A.14–A.16) despite the fact that the response variable is CO.GT first difference! 300
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. A.15. Predictors’ frequencies using the Shrinkage-based model (𝜂 = 𝜌, 𝛾 = 0.2, 𝛼 = 0).
Fig. A.16. Predictors’ frequencies using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
For the rest of predictors, we notice that they have different frequencies from a model to another, for instance Lag_AH, Lag_C6H6.GT, Lag_PT08.S4.NO2 and Lag_T. After discussing the coefficients’ importance and presence pattern, in the next subsection we analyze Stability degrees used in selecting cumulative values of variables then we shed light on how previous and cumulative values have been contributed to update the stability degree matrix in Stability-based models.
Except PCCF-based model, Shrinkage-based model (𝜂 = 𝜌, 𝛾 = 0.2, 𝛼 = 0) and Shrinkage-based model (learned 𝜂 and 𝜌) have shown very similar pattern for all stability degrees’ time series. For example, Lag_AH and Lag_Nox.GT have peaks on the same dates in both models (between October 2004 and January 2005) and also for Lag_RH, Lag_T. As the PCCF-based model (𝛾 = 0.6, 𝑝-value = 0.5) is the model that yielded the best RMSE, in Figs. B.4–B.15 we present the box plot of each predictor’s (cumulative values) monthly levels of stability degrees. Comparing the PCCF Absolute error plot (Fig. 33) with each figure of those above (Figs. B.4–B.15), we can make few remarks. Focusing on the median, for the Lag_PT08.S5.O3, Lag_NOx.GT, Lag_RH, Lag_AH, Lag_CO.GT and Lag_NO2.GT, we notice that there is almost an inverse relationship between this variable and the absolute error (except between December 2004 and February 2005 for Lag_PT08.S5.O3 and for Lag_NOx.GT). Lag_C6H6.GT’s monthly stability degrees have almost a positive relationship with the absolute error, but even after removing C6H6.GT from the model the RMSE did not improve (0.81408258). With Lag_PT08.S2.NMHC’s monthly stability degrees, we notice sometimes a positive relationship (September–December) and other
Appendix B. Stability degrees and shrinkage-based coefficients analysis Figs. B.1–B.3 depict the stability degrees’ values over time recorded for the three best SDGS models, namely: Shrinkage-based model (𝜂 = 𝜌, 𝛾 = 0.2, 𝛼 = 0), Shrinkage-based model (learned 𝜂 and𝜌, 𝛾 = 0∗ , 𝛼 = 0) and the PCCF-based model (𝛾 = 0.6, 𝑝-value = 0.5). Regardless the stability degrees’ levels, we notice that all models (Figs. B.1–B.3) show the same pattern of peaks, especially for Lag_AH, Lag_Nox.GT, Lag_PT08.S5.O3 and Lag_T. 301
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. B.1. Predictors’ stability degrees over time using the Shrinkage-based model (𝜂 = 𝜌, 𝛾 = 0.2, 𝛼 = 0).
Fig. B.2. Predictors’ stability degrees over time using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
302
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. B.3. Predictors’ stability degrees over time using the PCCF-based model (𝛾 = 0.6, 𝑝-value = 0.5).
Fig. B.4. Box plot of monthly Lag_AH Stability degrees’ levels using PCCF model.
times an inverse relationship (April–June). Without this predictor the RMSE is almost 0.8171. For lag_T, Lag_PT08.S4.NO2, Lag_PT08.S3.NOx and especially Lag_PT08.S1.CO, it is hard to recognize a specific pattern between their stability degrees box plots and the absolute error plot.
As for the rest of predictors, we notice an unstable pattern (ups and downs) of values especially with Lag_CO.GT and Lag_C6H6.GT, this last that shows a steep decrease in values in June then a constant increase till March 2005 then a steep decrease again in April of the same year (in the 303
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. B.5. Box plot of monthly Lag_C6H6.GT Stability degrees’ levels using PCCF model.
Fig. B.6. Box plot of monthly Lag_CO.GT Stability degrees’ levels using PCCF model.
Fig. B.7. Box plot of monthly Lag_NO2.GT Stability degrees’ levels using PCCF model.
304
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. B.8. Box plot of monthly Lag_NOx.GT Stability degrees’ levels using PCCF model.
Fig. B.9. Box plot of monthly Lag_PT08.S1.CO Stability degrees’ levels using PCCF model.
Fig. B.10. Box plot of monthly Lag_PT08.S2.NMHC Stability degrees’ levels using PCCF model.
305
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. B.11. Box plot of monthly Lag_PT08.S3.NOx Stability degrees’ levels using PCCF model.
Fig. B.12. Box plot of monthly Lag_PT08.S4.NO2 Stability degrees’ levels using PCCF model.
Fig. B.13. Box plot of monthly Lag_PT08.S5.O3 Stability degrees’ levels using PCCF model.
306
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. B.14. Box plot of monthly Lag_RH Stability degrees’ levels using PCCF model.
Fig. B.15. Box plot of monthly Lag_T Stability degrees’ levels using PCCF model.
same month, we notice a sudden decrease in Lag_CO.GT’s coefficient’s values and an increase in the Lag_PT08.S2.NMHC’s coefficient’s values).
and vice-versa for other predictors such as Lag_AH, Lag_NOx.GT, Lag_T, Lag_PT08.S3.NOx and slightly Lag_C6H6.GT. Knowing that Lag_NOx.GT, Lag_PT08.S5.03, Lag_AH and Lag_RH are the most frequent variables in the PCCF model, it seems that the model relies on historical average of the stability degrees related to Lag_NOx.GT and Lag_T more than it depends on their newly calculated stability degrees and the opposite with Lag_PT08.S5.03 and Lag_RH. We also notice that as Lag_AH’s (a frequent predictor) 𝜂 and 𝜌 entries’ values tend to negative values, the Lag_PT08.S1.CO, Lag_PT08.S4.NO2 and slightly PT08.S3.NOx’s 𝜂 and 𝜌 tend to greater positive values. In the Shrinkage-based model, we notice that Lag_CO.GT, Lag_PTO8.S2.NMHC and Lag_RH have the same pattern of increase of 𝜂 and decrease of 𝜌 (we can notice that also in the PCCF model but with few differences in the 𝜌’s plots). Furthermore, we can say the same about Lag_C6H6.GT, Lag_PT08.S1.CO and Lag_PT08.S5.03. Also, the steep decrease in 𝜌’s values for predictors Lag_C6H6.GT, Lag_PT08.S1.CO, Lag_PT08.S5.03, Lag_CO.GT, Lag_PT08.S2.NMHC, Lag_RH and slightly Lag_NO2.GT is not faced with a fast increase in their plots for 𝜂.
Appendix C. 𝜼 and 𝝆 analysis In this appendix, we discuss how historical and recent dependencies between predictors have been contributed through the Cumulative stability matrix and the Update matrix respectively, in predicting the response variable’s values. This contribution has been weighted based on two matrices: 𝜂 and 𝜌. In Figs. C.1–C.4, we illustrate the 𝜂 and 𝜌 entries variation (entries that correspond to the target variable) through time in experiments based on the PCCF and the shrinkage-based models. In the following analysis, we simply refer to the 𝜂 and 𝜌 entries that correspond to the target variable as 𝜂 and 𝜌 . In the PCCF-based model (Figs. C.1 and C.2) we notice that for almost each predictor, the decrease in 𝜂 is reciprocated with an increase in 𝜌 and vice-versa, for example Lag_CO.GT, Lag_PT08.NMHC, Lag_RH and Lag_T. Also we notice that cumulative values of some predictors such as Lag_CO.GT, Lag_PT08.NMHC, Lag_RH, Lag_PT08.S5.O3 tend to lose importance (decreasing 𝜂) where its recent values gain it (increasing 𝜌) 307
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. C.1. 𝜂’s value over time for predictor variables using the PCCF-based model (𝛾 = 0.6, 𝑝-value = 0.5).
Fig. C.2. 𝜌’s value over time for predictor variables using the PCCF-based model (𝛾 = 0.6, 𝑝-value = 0.5).
308
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
Fig. C.3. 𝜂’s value over time for predictor variables using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
Fig. C.4. 𝜌’s value value over time for predictor variables using the Shrinkage-based model (learned 𝜂 and 𝜌, 𝛾 = 0∗ , 𝛼 = 0).
309
M. Naili et al.
Engineering Applications of Artificial Intelligence 77 (2019) 283–310
References
Larranaga, P., Poza, M., Yurramendi, Y., Murga, R.H., Kuijpers, C.M.H., 1996. Structure learning of Bayesian networks by genetic algorithms: a performance analysis of control parameters. IEEE Trans. Pattern Anal. Mach. Intell. 18, 912–926. Lebre, S., 2009. Inferring dynamic genetic networks with low order independencies. Stat. Appl. Genet. Mol. Biol. 8, Article 9. Liu, Z., Malone, B., Yuan, C., 2012. Empirical evaluation of scoring functions for Bayesian network model selection. BMC Bioinformatics 13 (Suppl 15), S14. Lv, G., Hu, S., Fan, Y., Qi, M., 2013. Visual Emotion Recognition Based on Dynamic Models. pp. 1–4. Margaritis, D., 2003. Learning bayesian network model structure from data. In: (Ph. D.). School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Montgomery, D.C., Jennings, C.L., Kulahci, M., 2008. Intorduction to Timeseries and Forecasting. John Wiley & Sons. Inc, New Jersey. Narendra Babu, C., Eswara Reddy, B., 2015. Prediction of selected Indian stock using a partitioning–interpolation based ARIMA–GARCH model. Appl. Comput. Inform. 11, 130–143. Nielsen, S.H., Nielsen, T.D., 2008. Adapting Bayes network structures to non-stationary domains. Internat. J. Approx. Reason. 49, 379–397. Noble, J., Koski, T.J.T., 2012. A Review of Bayesian Networks and Structure Learning. Math. Appl. 40. Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, CA, USA. Pearl, J., 2009. Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY, USA. Pfaff, B., Stigler, M., 2015. Package ‘‘vars’’. https://cran.r-project.org/web/packages/ vars/vars.pdf. (15 December 2017). Qiu, M., Song, Y., Akagi, F., 2016. Application of artificial neural network for the prediction of stock market returns: The case of the Japanese stock market. Chaos Solitons Fractals 85, 1–7. Queen, C.M., Albers, C.J., 2009. Intervention and causality: forecasting traffic flows using a dynamic bayesian network. J. Amer. Statist. Assoc. 104, 669–681. Robinson, J.W., Hartemink, A.J., 2008. Non-stationary dynamic Bayesian networks. In: Neural Information Processing Systems 2008. Vancouver, British Columbia, Canada, pp. 1372–1379. Sandri, M., Berchialla, P., Baldi, I., Gregori, D., De Blasi, R.A., 2014. Dynamic bayesian networks to predict sequences of organ failures in patients admitted to icu. J. Biomed. Inform. 48, 106–113. Schwarz, G., 1978. Estimating the Dimension of a Model. Ann. Statist. 6, 461–464. Scutari, M., 2016. bnlearn: bayesian network structure learning, parameter learning and inference. https://cran.r-project.org/web/packages/bnlearn/index.html. (29 September 2016). Scutari, M., Ness, R., 2018. Bayesian Network Structure Learning, Parameter Learning and Inference. Shijiazhu, Wang, Y., 2012. Modelling non-stationary gene regulatory process with hidden Markov Dynamic Bayesian Network. In: Presented at the 2012 IEEE International Conference on Bioinformatics and Biomedicine Philadelphia, PA, USA. Srinivas, M., Patnaik, L.M., 1994. Genetic algorithms: a survey. Computer 27, 17–26. Team, R.C., 2016. R: a language and environment for statistical computing. http://www.rproject.org. (25 May 2016). Tibshirani, R., 1994. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288. Tsamardinos, I., Aliferis, C.F., Statnikov, A.R., 2003. Algorithms for large scale markov blanket discovery. In: Presented at the 16th International Florida Artificial Intelligence Research Society Conference, St. Augustine, Florida, USA. Tsamardinos, I., Brown, L.E., Aliferis, C.F., 2006. The max–min hill-climbing bayesian network structure learning algorithm. Mach. Learn. 65, 31–78. Vito, S.D., 2005. Air Quality Data Set, 2016-03-23 ed. E a. S E D National Agency for New Technologies. Wang, M., Chen, Z., Cloutier, S., 2007. A hybrid Bayesian network learning method for constructing gene networks. Comput. Biol. Chem. 31, 361–372. Wenhui, L., Weihong, Z., Zhiwei, Z., Qiang, J., 2005. A Real-Time Human Stress Monitoring System Using Dynamic Bayesian Network. vol. 3. 70-70. Williams, B., Hoel, L., 2003. Modeling and forecasting vehicular traffic flow as a seasonal arima process: theoretical basis and empirical results. J. Transp. Eng. 129, 664–672. Wong, M.L., Leung, K.S., 2004. An Efficient Data Mining Method for Learning Bayesian Networks Using an Evolutionary Algorithm-Based Hybrid Approach. IEEE Trans. Evol. Comput. 8, 378–404. Yaramakala, S., Margaritis, D., 2005. Speculative Markov Blanket Discovery for Optimal Feature Selection. pp. 809–812. Zhang, G.P., 2003. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50, 159–175.
Acerbi, E., Vigano, E., Poidinger, M., Mortellaro, A., Zelante, T., Stella, F., 2016. Continuous time Bayesian networks identify Prdm1 as a negative regulator of TH17 cell differentiation in humans. Sci. Rep. 6, 23128. Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. Automat. Control 19, 716–723. Amrouche, B., Le Pivert, X., 2014. Artificial neural network based daily local forecasting for global solar radiation. Appl. Energy 130, 333–341. An, L., Kafai, M., Bhanu, B., 2013. Dynamic Bayesian Network for Unconstrained Face Recognition in Surveillance Camera Networks. IEEE J. Emerg. Sel. Topics Circuits Syst. 3, 155–164. Babu, C.N., Reddy, B.E., 2014. A moving-average filter based hybrid ARIMA–ANN model for forecasting time series data. Appl. Soft Comput. 23, 27–38. Calheiros, R.N., Masoumi, E., Ranjan, R., Buyya, R., 2015. Workload prediction using arima model and its impact on cloud applications’ qos. IEEE Trans. Cloud Comput. 3, 449–458. Chae, Y.T., Horesh, R., Hwang, Y., Lee, Y.M., 2016. Artificial neural network model for forecasting sub-hourly electricity usage in commercial buildings. Energy Build. 111, 184–194. Cheng, H.Y., Weng, C.C., Chen, Y.Y., 2012. Vehicle detection in aerial surveillance using dynamic Bayesian networks. IEEE Trans. Image Process. 21, 2152–2159. Chickering, D.M., 2003. Optimal structure identification with greedy search. J. Mach. Learn. Res. 3, 507–554. Chickering, D.M., Heckerman, D., Meek, C., 2004. Large-sample learning of bayesian networks is np-hard. J. Mach. Learn. Res. 5, 1287–1330. Chiquet, J., Smith, A., Grasseau, G., Matias, C., Ambroise, C., 2009. SIMoNe: Statistical Inference for MOdular NEtworks. Bioinformatics 25, 417–418. Chow, C., Liu, C., 1968. Approximating discrete probability distributions with dependence trees. IEEE Trans. Inform. Theory 14, 462–467. Christodoulos, C., Michalakelis, C., Varoutas, D., 2010. Forecasting with limited data: combining arima and diffusion models. Technol. Forecast. Soc. Change 77, 558–565. Contreras, J., Espinola, R., Nogales, F.J., Conejo, A.J., 2003. ARIMA models to predict next-day electricity prices. IEEE Trans. Power Syst. 18, 1014–1020. De Vito, S., Massera, E., Piga, M., Martinotto, L., Di Francia, G., 2008. On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors Actuators B 129, 750–757. Dondelinger, F., Lèbre, S., Husmeier, D., 2012. Non-homogeneous dynamic Bayesian networks with Bayesian regularization for inferring gene regulatory networks with gradually time-varying structure. Mach. Learn. 90, 191–230. Ömer Faruk, D., 2010. A hybrid neural network and ARIMA model for water quality time series prediction. Eng. Appl. Artif. Intell. 23, 586–594. Foster, W.R., Collopy, F., Ungar, L.H., 1992. Neural network forecasting of short, noisy time series. Comput. Chem. Eng. 16, 293–297. Gonzales, C., Dubuisson, S., Manfredotti, C., 2015. A new algorithm for learning nonstationary dynamic bayesian networks with application to event detection. In: 28th International Florida Artificial Intelligence Research Society Conference Intelligence, FLAIRS-28, Hollywood, Florida, pp. 564–569. Grzegorczyk, M., 2015. A non-homogeneous dynamic Bayesian network with a hidden Markov model dependency structure among the temporal data points. Mach. Learn. 102, 155–207. Grzegorczyk, M., Husmeier, D., 2011. Non-homogeneous dynamic Bayesian networks for continuous data. Mach. Learn. 83, 355–419. Hamilton, J.D., 1994. Time Series Analysis. Hastie, T., Qian, J., 2014. Glmnet Vignette. https://web.stanford.edu/~hastie/glmnet/ glmnet_alpha.html. (1 May 2018). Hastie, T., Tibshirani, R., Friedman, J., 2008. The Elements of Statistical Learning. Heckerman, D., Geiger, D., Chickering, D.M., 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 20, 197–243. van der Heijden, M., Velikova, M., Lucas, P.J., 2014. Learning Bayesian networks for clinical time series analysis. J. Biomed. Inform. 48, 94–105. Jia, Y., Huan, J., 2010. Constructing non-stationary Dynamic Bayesian Networks with a flexible lag choosing mechanism. BMC Bioinformatics 11 (Suppl 6), S27. de Jongh, M., 2014. Algorithms for constraint-based learning of bayesian network structures with large numbers of variables. In: (Ph.D.). University of Pittsburgh. Kirshner, S., Smyth, P., Robertson, A., 2004. Conditional chow-liu tree structures for modeling discrete-valued vector time series. In: Twentieth Conference on Uncertainty in Artificial Intelligence, UAI2004, pp. 317–324. Kita, E., Harada, M., Mizuno, T., 2012. Application of Bayesian Network to stock price prediction. Artif. Intell. Res. 1. Kwiatkowski, D., Phillips, P.C.B., Schmidt, P., Shin, Y., 1992. Testing the null hypothesis of stationarity against the alternative of a unit root. J. Econometrics 54, 159–178. Lähdesmäki, H., Shmulevich, I., 2008. Learning the structure of dynamic Bayesian networks from time series and steady state measurements. Mach. Learn. 71, 185–217.
310