Information Sciences 264 (2014) 279–290
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Causal conditioning and instantaneous coupling in causality graphs Pierre-Olivier Amblard a, Olivier Michel b,⇑ a b
Department of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia Gipsa-Lab UMR 5216, 961 rue de la Houille Blanche, BP.46, 38402 Saint Martin d’Hères Cedex, France
a r t i c l e
i n f o
Article history: Received 7 February 2013 Received in revised form 27 August 2013 Accepted 29 December 2013 Available online 7 January 2014 Keywords: Directed information Transfer entropy Granger causality Graphical model
a b s t r a c t This study aims at providing the definitive links between Massey and Kramer’s directed information theory and Granger causality graphs, recently formalized by Eichler. This naturally leads to consider two concepts of causality that can occur in physical systems: dynamical causality and instantaneous coupling. Although it is well accepted that the former is assessed by conditional transfer entropy, the latter is often overlooked, even if it was clearly introduced and understood in seminal studies. In the bivariate case, we show that directed information is decomposed into the sum of transfer entropy and instantaneous information exchange. In the multivariate case, encountered for conditional graph modeling, such a decomposition does not hold anymore. We provide a new insight into the problem of instantaneous coupling and show how it affects the estimated structure of a graphical model that should provide a sparse representation of a complex system. This result is discussed and analyzed for the inference of causality graphs. Two synthetic examples are developed to illustrate the theoretical concepts. Practical issues are also briefly discussed on these examples and an extension of Leonenko’s k-nearest neighbors based entropy estimators is used to derive a nonparametric estimator of directed information. Ó 2014 Elsevier Inc. All rights reserved.
1. Introduction 1.1. Motivations Graphical modeling has received major attention in many different domains, such as the neurosciences [18], econometry [8], and complex networks [26]. This proposes a representation paradigm to explain how information flows between the nodes of a graph. The graph vertices are in most cases, and particularly in the present study, associated to synchronous time series. To infer a graph thus requires that the edges or links between the vertices are defined. Granger [16,17] proposed a set of axiomatic definitions for the causality between, say, x and y (with a slight abuse of notation, each vertex will be named after its associated time series). Granger’s definitions are based on the improvement that observations of x up to some time t 1 can allow the prediction of y at time t. The fundamental idea in Granger’s approach is that the past and the present may cause the future, but the future cannot cause the past [16, Axiom A]. Granger’s work also stresses the importance of side information, which accounts for the presence of all of the other vertices except x and y, in the assessing of the existence of a link between two nodes. This leads to what will be referred to as the bivariate case (an absence of side information) and the
⇑ Corresponding author. Tel.: +33 476826333. E-mail addresses:
[email protected] (P.-O. Amblard),
[email protected] (O. Michel). 0020-0255/$ - see front matter Ó 2014 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2013.12.037
280
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
multivariate case (a presence of side information) in what follows. Eichler and Dahlaus [7–9] were among the first to propose the use of Granger causality in the framework of graphical modeling. In [9], precise definitions of Granger causality graphs are presented, and both the concepts of dynamical causality and instantaneous coupling (as we call them herein) are emphasized. Note that the concept of instantaneous coupling was present in the early works on Granger causality, although as this concept appeared quite weak compared to dynamical causality, it has been overlooked in modern studies on causality, and especially in the application of these. Instantaneous dependence in complex networks can have different origins. Here, if it is not easy to conceive of instantaneous information exchange between nodes, the recording process (which includes filters, sample and hold devices, converters) contains integrators over short time lags. Any information that flows between two nodes within a delay shorter than the integration time can then be seen as instantaneous. Such a case is often met in systems that require long integration times per sample, as for example in functional magnetic resonance imaging. Alternatively, instantaneous coupling can occur if the noise contributions in structural models are no longer independent, which is likely in real-world cases. 1.2. Aim of the paper and outline The purpose of the present study is to provide new insight into the problems related to instantaneous coupling, and to show how the presence of such coupling can affect the estimated structure of a graphical model that should provide a sparse representation of a complex system. This report is focused on the interplay of two types of dependences: dynamical causality and instantaneous coupling, which have been largely overlooked in all other studies of directed information. The possibility to estimate directed-information-based measures with k nearest neighbors (k-NN)-based tools is illustrated. This novel application of k-NN circumvents the requirement for the definition of a parametric model, as has always been introduced so far in the estimation of directed information from continuous valued data. We begin this report with a short review of the possible approaches to Granger causality. Section 3 then introduces a brief review and some definitions of Eichler and Dalhaus causality graphs [7,9], and presents an enlightening toy problem, where instantaneous coupling strongly affects the edge detection in a graphical model. Theoretical relations that demonstrate the link between directed information theory and Granger causality graphs are developed in the following section. The last section discusses some of the practical implementation issues, and provides the full treatment of a toy problem.
2. Approaches to Granger causality 2.1. Model-based approaches In Geweke’s pioneering work [11,12], an autoregressive modeling approach was adopted (for both the bivariate and multivariate cases) to provide practical implementation of Granger causality graphs. Such a model-based approach motivated further studies. Information theory tools were also added by Rissanen and Wax [30], to account for the complexity of the regression model. Directed transfer functions, which are frequency domain filter models for Granger causality, were derived in [16,17] for neuroscience applications. Nonlinear extensions have been proposed [14], with recent developments relying on functional estimation in a reproducing kernel Hilbert space [23,3]. All of these approaches are intrinsically parametric, and as such, they can introduce bias into the analysis. 2.2. Information-theory-based measures An alternative to an assessment of the existence of a link between nodes was elaborated earlier in the bivariate case (see, for example, [31,15,32,29,33]). This consists of the adapting of information theory measures, such as mutual information or information divergence, to assess the existence and/or strength of a link between two nodes. The motivations for introducing such tools rely upon the ability of the information theory measure to account for the entire probability density function of the observations (provided that such a probability density exists), instead of only second-order characteristics, as for linear filter modeling approaches. Among these earlier studies, one of the oldest and maybe the least well known was developed by Gouriéroux et al. [15], where a generalization of Geweke’s idea [11] using Kullback divergences was introduced, for the case of two time series. It is of note that the tools they introduced were later rediscovered by Massey [25] and Kramer [19] in their development of bivariate directed information theory. The development of directionality-specific or causality-specific measures was initiated by a study on directed information of Marko [24], and extended by Massey [25], and later Kramer [19], who introduced causal conditioning by side information. This offers a means to account for side information, or to tackle the multivariate case. The first steps in the exploration of the relationships between the Geweke approach of Granger causality and the directed information theory tools were defined in [1] for the Gaussian case. with further insights then developed in [2], and in [27,28] in the absence of instantaneous dependence structure. In [2], a directed-information-based new definition was proposed for Granger causality. Eichler [9] recently studied this latter issue in a graph modeling framework, both from a theoretical point of view, with recourse to probability based definitions, and in a parametric modeling context.
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
281
3. Causality graphs We briefly review the concept of causality graphs as developed by Eichler. The main reference is [9], where a complete presentation of causality graphs and a study of their Markovian properties are developed. 3.1. Definitions Let xV ¼ fxV ðkÞ; k 2 Zg be a d-dimensional discrete time stationary multivariate process on some probability space. The probability measures are assumed to be absolutely continuous with respect to Lebesgue measure, and the density associated to it will be termed P. V is the index set f1; . . . ; dg. For a 2 V we denote xa as the corresponding component of xV . Likewise, for any subset A V; xA is the corresponding multivariate process. The information obtained by observing xA up to time k is sumarized by the filtration generated by fxA ðlÞ; 8l 6 kg. This is denoted as xkA . Following [16,17,9], three definitions can be proposed for Granger causality. The first is based on simple forward prediction, which is the root concept underlying Granger causality. The next two definitions correspond to alternative choices in the definition of instantaneous coupling. Let A and B be two disjoint subsets of V. Let C ¼ V n ðA [ BÞ. Definition 1 (Dynamical). xA does not (dynamically) cause xB if for all k 2 Z,
PðxB ðk þ 1ÞxkA ; xkB ; xkC Þ ¼ PðxB ðk þ 1ÞxkB ; xkC Þ Dynamical Granger causality states that x causes y if the prediction of y from its past is improved when also considering the past of x. Moreover, this is relative to any side information observed prior to the prediction. This is the meaning of Definition 1: conditional to its past and to the past of side information, xB is independent of the past of xA . In mathematical terms, xkA ! xkB ! xB ðk þ 1Þ is a Markov chain conditionally to the side information (xkC ). Conditioning on xkC instead of xCkþ1 in Definition 1 raises an important issue: in a model estimation framework that is not aimed at the identification of links between all pairs of possible nodes, accounting for the present of xC in the prediction problem can be thought about; this is, for instance, the case for autoregressive – moving average (ARMA) modeling. However, conditioning on xCkþ1 weakens the effectiveness of the definition of causality by introducing a symmetry in the causal relationships between B and C. Conditioning is therefore restricted to the past of the observation, in a strict sense. This excludes the possibility of instantaneous dependences, for which a separate definition is required. There are however two possible definitions here, as follows. Definition 2 (Instantaneous). xA is (instantaneously) not coupled to xB if for all k 2 Z,
k k kþ1 k kþ1 PðxB ðk þ 1Þxkþ1 A ; xB ; xC Þ ¼ PðxB ðk þ 1Þ xA ; xB ; xC Þ The second possibility is the following. Definition 3 (Unconditional instantaneous). xA is not (unconditionally instantaneously) coupled to xB if for all k 2 Z,
k k k k k PðxB ðk þ 1Þxkþ1 A ; xB ; xC Þ ¼ PðxB ðk þ 1Þ xA ; xB ; xC Þ First, Definitions 2 and 3 are easily shown to be symmetrical in A and B (through the application of Bayes theorem). Secondly, taking as side information xkþ1 in Definition 2 instead of xkC in Definition 3 is fundamental here. If the side inforC mation is considered up to time k only, the instantaneous dependence or independence is not conditional to the remaining nodes in C. Indeed, inclusion of all of the information up to time k in the conditioning variables allows the dependence or independence between xA ðk þ 1Þ and xB ðk þ 1Þ to be instantaneously tested. This independence tested is not conditional if xC ðk þ 1Þ is not included in the conditioning set, whereas the independence tested is conditional if xC ðk þ 1Þ is included. Thus the choice is crucial when dealing with the type of graph of instantaneous dependence obtained. In Definition 2, the graphs obtained are conditional dependence graphs, as is usual in graphical modeling [36,21]. On the contrary, graphs obtained with Definition 3 are dependence graphs that do not have the Markov properties that conditional dependence graphs can have. The two possible types of dependence (dynamical and instantaneous) will be encoded in the graphs by two different types of edges between the vertices. Dynamical causality will be represented by an arrow, hence symbolizing directivity, whereas instantaneous coupling will be represented by a line. 3.2. A detailed example For the sake of illustration, we can create a four dimensional simple example. Let q1;2;3 2 ð1; 1Þ and let
282
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
Fig. 1. Causality graphs for the example developed in the text. Illustration of the differences between the two definitions of instantaneous coupling.
0
q1
1
B q 1 @ 0
Ce ¼ B B
0
1
0
0
1
q1 q2 q2 q3
1
q1 q2 q2 C C C q3 A
ð1Þ
1
be the covariance matrix of the independent and identically distributed (i.i.d.) zero mean Gaussian sequence ðew;t ; ex;t ; ey;t ; ez;t Þ> . The inverse of Ce , which is known as the precision matrix, reveals the conditional independence relationship between the components of the noise (as it is Gaussian), and reads as:
0
C1 e
d1 B d q B 1 1 ¼B @ 0 0
d1 q1 2 1
0 2 2
2 3Þ
d1 d2 ð1 q q q
0
d2 q2 q3 2 2Þ
d2 q2 q3
d2 ð1 q
d2 q2
d2 q3
1
d2 q2 C C C d2 q3 A
ð2Þ
d2
where d1 ¼ 1=ð1 q21 Þ; d2 ¼ 1=ð1 q22 q23 Þ. Consider the following structural model:
8 wt > > >
y > t > : zt
¼ fw ðwt1 ; xt1 ; zt1 Þ þ ew;t ¼ fx ðxt1 ; zt1 Þ þ ex;t ¼ fy ðxt1 ; yt1 Þ þ ey;t ¼ fz ðwt1 ; zt1 Þ þ ez;t
To infer the causality graph, we first look for any directed link between pairs of nodes. In such a structural model, if a signal a at time t depends on the function fa on another signal b at time t 1, then there is a link b ! a. For example, consider the question of whether there is a link from z to w or not. From the definition of the model, we have:
Pðwt jwt1 ; zt1 ; ðx; yÞt1 Þ ¼ Pew ðwt fw ðwt1 ; xt1 ; zt1 ÞÞ Pðwt jwt1 ; ðx; yÞt1 Þ ¼ Ezt1 ½Pew ðwt fw ðwt1 ; xt1 ; zt1 ÞÞ which are obviously not equal here. Therefore z ! w j x; y. Consider now the case of z and y. We have: Pðyt j yt1 ; zt1 ; ðx; wÞt1 Þ ¼ Pey ðyt fy ðxt1 ; yt1 ÞÞ ¼ Pðyt j yt1 ; ðx; wÞt1 Þ. Thus z9y j x; w. Doing this pairwise or following the intuitive point of view described above leads to the set of oriented edges depicted in the causality graph in Fig. 1. To get the instantaneous edges, as discussed in the previous section, we have two possible definitions. If side information is considered up to time t 1, we obtain the unconditional graph in Fig. 1. Indeed, for the unconditional graph, when testing for the presence of an edge between x and y, we evaluate Pðxt j xt1 ; yt ; ðw; zÞt1 Þ ¼ Pðex j ey Þ ¼ Pðex Þ as ex and ey are independent (examine Ce and remember the noises are Gaussian). Note that doing this for all pairs, we indeed obtain the graph of dependence relationships. For the conditional graph, we instead evaluate Pðxt j xt1 ; yt ; ðw; zÞt Þ ¼ Pðex j ey ; ew ; ez Þ. In this case, we indeed measure the conditional dependence between x and y. It turns out in this example that even if independent, ex and ey are dependent conditionally on ez , and therefore there is an undirected edge between x and y in the conditional graph. This synthetic and simple example raises an important practical issue, as in real world experiments with multiple time series, it is unlikely that noise sources are truly independent or that all relevant processes were recorded. We see here that instantaneous coupling may stem from such noise correlation and strongly affects the identified graphical model, depending on the definition (conditional or unconditional) adopted for instantaneous dependence. 4. Directed information and causality graphs The importance of side information and the consequence of choosing a particular definition of instantaneous coupling was introduced in the previous section by considering sets of equations on probability density functions. An alternate approach that leads to practical quantifiable quantities and provides a interesting insight in the dependence structure relies on using information theoretic tools. We start with a brief reminder of the main definitions of directed information and some related results. Bivariate analysis results are outlined first, to provide better insight in the discussion of the multivariate case. Note that bivariate analysis corresponds to the simple situation where no side information is present, or it is simply neglected.
283
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
Work by Massey has focused on information measures for systems that can exhibit feedback [25]. In this framework, Massey showed that the appropriate information measure was no longer the mutual information, but the directed information. For two subsets A and B, directed information is defined by
IðxkA ! xkB Þ ¼
k X IðxiA ; xB ðiÞjxi1 B Þ i¼1
IðxiA ; xiB jxiB Þ
where stands for the usual conditional mutual information [6]. Later, in [19], Kramers introduced the idea of causal conditioning and defined causally conditioned entropy as a modified version of the Bayes chain rule for conditional enP k tropy: While the usual chain rule is written as HðxkB jxkA Þ ¼ ki¼1 HðxB ðiÞjxi1 B ; xA Þ, causally conditioned entropy is defined as:
HðxkB kxkA Þ ¼
k X i HðxB ðiÞjxi1 B ; xA Þ i¼1
The difference lies on the conditioning on xA , which is now considered up to time i only, for each term entering the sum. From these definitions, directed information can be easily decomposed into the difference of two terms
IðxkA ! xkB Þ ¼ HðxkB Þ HðxkB kxkA Þ which can be compared to the well known (and sometimes admitted as a definition) formula for the mutual information IðxkA ; xkB Þ ¼ HðxkB Þ HðxkB jxkA Þ. From these definitions, Massey and Kramer derived two interesting results. The first was the following equality, where DxkA ¼ ð0; xAk1 Þ represents the delayed (one time lag) version of xA :
IðxkA ! xkB Þ þ IðxkB ! xkA Þ ¼ IðxkA ; xkB Þ þ IðxkA ! xkB kDxkA Þ
ð3Þ xkB kDxkA Þ
This implies that the sum of the directed information is larger than the mutual information. The term IðxkA ! is positive (sum of positive contributions) and accounts for the instantaneous information exchange. Using Eq. (6), the following can easily be obtained:
IðxkA ! xkB kDxkA Þ ¼
k X IðxA ðiÞ; xB ðiÞjxBi1 ; xi1 A Þ
ð4Þ
i¼1
which is symmetric with respect to A and B. It is of note that according to its definition, directed information accounts for instantaneous information exchange as well as for dynamical information exchange. Then, in the sum of the directed information on the left-hand side of Eq. (3), the contribution of instantaneous information is counted twice. It is counted only once in the mutual information, and this explains the remaining term on the right-hand side of this equation. i1 The instantaneous information exchange term of Eq. (4) is zero if and only if IðxA ðiÞ; xB ðiÞjxi1 B ; xA Þ ¼ 0; 8i, or xA and xB are independent conditionally on their past. Such a situation can occur for multivariate Markov processes described by X V ðtÞ ¼ f ðX V ðt 1ÞÞ þ V ðtÞ, where V ðtÞ is an i.i.d. multivariate noise process with independent components. Note that in the example in the previous section, Ce is not diagonal, and therefore the noise components are correlated and lead to some instantaneous information exchanges between some nodes (in a nontrivial way). We are at this point ready to examine how directed information can be used in causality graphs. For multivariate measurements, two approaches are possible. The first is a bivariate analysis in which we study directed information between pairs of nodes without taking into account the side information (the remaining nodes). The second approach accounts for the side information, but will need further development. Even if the bivariate framework is a naive approach, it is presented here as it provides some insight into how directed information is applied. We will then turn to the more tricky multivariate analysis. 4.1. Bivariate analysis in graphs Consider two disjoint subsets A and B of V. The directed information can be re-expressed as the sum:
IðxkA ! xkB Þ ¼ IðxkA ! xkB jjDxkA Þ þ IðDxkA ! xkB Þ; where the first term is the instantaneous information exchange as defined by Eq. (4), whereas the second term IðDxkA ! xkB Þ will be referred to as the transfer entropy, following Schreiber’s definition that was proposed in a different framework [32]. In the absence of any side information, these terms account for the instantaneous coupling and for the dynamical causality, i1 respectively. Indeed, the transfer entropy reduces to zero if and only if Iðxi1 A ; xB ðiÞjxB Þ ¼ 0; 8i or equivalently if and only if xA does not dynamically cause xB (see Definition 1). Furthermore, we have seen above that IðxkA ! xkB jjDxkA Þ ¼ 0 if and only if xA and xB are independent conditionally on their past, or in the words of our Definitions, if and only if xA does not instantaneously cause xB . This result extends those obtained in the Gaussian bivariate case in [1,5] that were restricted to the dynamical causality. Again, these conclusions hold in the sole case where no side information is considered.
284
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
4.2. Multivariate analysis in graphs It is assumed in the following that the set of measurements or nodes V is partitioned into three disjoint subsets A; B and C ¼ V n ðA [ BÞ. We study the information flow between A and B when side information C is considered. Mathematically, taking into account the side information corresponds to conditioning on the side information. As we outlined in Section 3, since we are dealing with the graph V, we must use causal conditioning or we would break the symmetry between either A or B and C in the analysis. This leads to the relating of the causal conditional directed information to Definitions 1–3. As we have two possible definitions for instantaneous coupling, we have two possible choices for using the side information as a conditioner: we can use the past xCk1 ¼ DxkC , or the past as well as the present xkC . Conditioning on the past: Here, we evaluate IðxkA ! xkB kDxkC Þ. This quantity will be termed causal directed information in the sequel, and is defined substituting conditioned entropies for entropies in the definition of directed information:
IðxkA ! xkB kDxkC Þ ¼ HðxkB kDxkC Þ HðxkB kxkA ; DxkC Þ ¼
k X IðxiA ; xB ðiÞjxBi1 ; xCi1 Þ:
ð5Þ
i¼1
This can easily be expanded as:
IðxkA ! xkB kDxkC Þ ¼ IðDxkA ! xkB kDxkC Þ þ IðxkA ! xkB kDxkA ; DxkC Þ We call the first term of this decomposition IðDxkA ! xkB kDxkC Þ the conditional transfer entropy between A and B given C. i1 i1 This is zero if and only if Iðxi1 or equivalently if and only if PðxB ðkÞjxAk1 ; A ; xB ðiÞjxB ; xC Þ ¼ 0; 8i xBk1 ; xCk1 Þ ¼ PðxB ðkÞjxBk1 ; xCk1 Þ. In other words, according to Definition 1, the conditional transfer entropy between A and B is zero if and only if A does not dynamically cause B. i1 i1 The second term IðxkA ! xkB kDxkA ; DxkC Þ is zero if and only if IðxA ðiÞ; xB ðiÞjxi1 A ; xB ; xC Þ ¼ 0; 8i and therefore, according to Definition 3, if and only if A does not unconditionally instantaneously cause B. We will refer to this measure as the unconditional instantaneous information exchange. Note that the ‘unconditional’ term refers to the nature of the type of independence the measure reveals. Conditioning up to the present: Here, we evaluate IðxkA ! xkB kxkC Þ defined as:
IðxkA ! xkB kxkC Þ ¼ HðxkB kxkC Þ HðxkB kxkA ; xkC Þ ¼
k X IðxiA ; xB ðiÞjxBi1 ; xiC Þ:
ð6Þ
i¼1
Referring to Definition 2, this quantity will be termed conditional causal directed information. Note that the term conditional here emphasizes the link with conditional graphs introduced in Section 3. The idea we pursue now is to find a decomposition in which both the conditional transfer entropy and a measure accounting for Definition 3 appear. Applying the chain rule for conditional mutual information several times, we get i i i1 i1 i1 i1 IðxiA ; xB ðiÞjxi1 B ; xC Þ ¼ IðxC ðiÞ; xA ; xB ðiÞjxB ; xC Þ IðxC ðiÞ; xB ðiÞjxB ; xC Þ i1 i1 i1 i1 i1 i1 ¼ IðxAi1 ; xB ðiÞjxi1 B ; xC Þ þ IðxC ðiÞ; xA ðiÞ; xB ðiÞjxA ; xB ; xC Þ IðxC ðiÞ; xB ðiÞjxB ; xC Þ i1 i1 i1 i i1 i1 i1 ¼ IðxAi1 ; xB ðiÞjxi1 B ; xC Þ þ IðxA ðiÞ; xB ðiÞjxA ; xB ; xC Þ þ IðxC ðiÞ; xB ðiÞjxA ; xB ; xC Þ i1 IðxC ðiÞ; xB ðiÞjxi1 B ; xC Þ
Summing over i we get the definition for the causally conditioned directed information IðxkA ! xkB kxkC Þ:
IðxkA ! xkB kxkC Þ ¼ IðDxkA ! xkB kDxkC Þ þ IðxkA ! xkB kDxkA ; xkC Þ þ DIðDxkC ! xkB kDxkA ; DxkC Þ IðxkA
ð7Þ
xkB kDxkA ; xkC Þ
The term ! is called the conditional instantaneous information exchange. This is zero if and only if Definition 3 is verified; i.e., if and only if A does not instantaneously cause B. We recover the conditional transfer entropy that accounts for dynamical causality in the decomposition. The surprise arises from an extra term in Eq. (7), which is defined as:
DIðDxkC ! xkB kDxkA ; DxkC Þ ¼ IðDxkC ! xkB kDxkA ; DxkC Þ IðDxkC ! xkB kDxkC Þ This term is also measuring an instantaneous quantity. It is the difference between the two different natures of instantai1 i1 neous coupling: the first term IðxC ðiÞ; xB ðiÞjxi1 A ; xB ; xC Þ describes intrinsic coupling in the sense that it does not depend i1 on parties other than C and B; The second coupling term is expressed by IðxC ðiÞ; xB ðiÞjxi1 B ; xC Þ and it is relative to the extrinsic coupling, as it measures the instantaneous coupling at time i created by variables other than B and C. The conclusion is the following: causal directed information is the right measure to assess information flow in Granger causality graphs if the unconditional definition is adopted for instantaneous coupling. In this case, the causal directed information IðxkA ! xkB kDxkC Þ is zero if and only if there is no causality from A to B. If it is not zero, we must evaluate the conditional transfer entropy and the unconditional instantaneous information exchange to assess the dynamical and instantaneous coupling. However, as shown by Eichler [8,9], the graphs obtained in this case do not have nice properties, as the instantaneous graph is not a conditional dependence graph.
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
285
On the other hand, if we adopt Definition 2 for instantaneous coupling, we do not have the same nice decomposition, and IðxkA ! xkB kxkC Þ cannot be used to check noncausality. However, we have shown that the correct measures to assess dynamical and instantaneous coupling are the conditional transfer entropy IðDxkA ! xkB kDxkC Þ and the instantaneous information exchange IðxkA ! xkB kDxkA ; xkC Þ, respectively. 5. Illustrations, practical implementation This section is devoted to the practical application of the previous results. We begin by discussing estimation issues, and we illustrate the key ideas on synthetic examples. To our knowledge, for estimating directed information (DI) from continuous valued data, only parametric estimation techniques are known. Here, we extends the k-nearest neighbors based entropy estimation technique developed by Leonenko [13] to the case of DI estimation. Convergence properties or issues concerning the choice of k are not discussed in this paper but could be found in [4] and references therein. 5.1. Estimation issues The estimator we use is based on the Leonenko k-nearest neighbor estimator of the entropy (see also [22] for extension to multidimensional densities). Let xi ; i ¼ 1; . . . ; N, for N observations of some random vector x that takes values in Rn . Then the Leonenko estimator for the entropy reads as: [13] N X b k ðxÞ ¼ 1 H logððN 1ÞC k V n dðxi ; xiðkÞ Þn Þ N i¼1
In this expression, d : Rn Rn ! Rþ is a metric. xiðkÞ is defined as the kth nearest neighbor of xi . V n is the volume of the unit ball for the metric d; C k ¼ expðwðkÞÞ, and wð:Þ is the digamma function that is defined as the derivative of the logarithm of the Gamma function. [13] showed that this estimator converges in the mean square sense to the entropy of the random vector x (under the i.i.d. assumption of the xi ) for any value of k lower than N 1. Estimation of the conditional mutual information. To estimate directed information, we need to estimate the conditional mutual information Iða; bjcÞ ¼ Hðb; cÞ þ Hða; cÞ Hða; b; cÞ HðcÞ. Thus, estimation of the conditional mutual information can be done using four applications of the Leonenko estimator. Although it is asymptotically unbiased, the Leonenko estimator is biased for a finite sample size, and the bias depends on the dimension of the underlying space. Apart from the fact that a plug-in estimator would suffer from high variance, the bias of the entropy estimators will therefore not cancel out. A smart idea to circumvent this problem was proposed by Kraskov in 2004 for the mutual information case [20], and later extended by Frenzel and Pompe for the conditional mutual information case [10]. The idea relies on two concepts: the estimator is valid for any metric, and the estimator converges for any k 6 N 1. The idea is then to use the of the maximum metric used on the marginal spaces as a metric in the product space. This determines the distance d xi ; xiðkÞ between xi 0 and its kth nearest neighbor as a scale in the product space. This distance is then used on the marginals to determine k , 0 for which dðxi ; xiðkÞ Þ is the distance between xi projected onto the marginal to its k nearest neighbor. Estimation of the directed information. Estimation requires that the processes studied are ergodic and stationary. Without these basic assumptions, nothing can really be done. The goal is to estimate the transfer entropy and the instantaneous information exchange. When dealing with monovariate signals xA ðkÞ ¼ xðkÞ and xB ðkÞ ¼ yðkÞ, and with side information xC ðkÞ, the information measures read as:
IðDxk ! yk kDxkC Þ ¼
k X i¼1
IðDxk ! yk kDxk ; xkC Þ ¼
Iðxi1 ; yðiÞjyi1 ; xCi1 Þ
X IðxðiÞ; yðiÞjyi1 ; xi1 ; xiC Þ i
For stationary sequences, it is convenient to consider the rate of growth of these measures. Indeed, the measures often linearly increase with k. Thus, the rate is defined as the asymptotic linear growth rate. Furthermore, following the proof in [19] for the directed information (or in [6] for the entropy), it can be shown that:
1 IðDxk ! yk kDxkC Þ ¼ lim Iðxk1 ; yðkÞjyk1 ; xk1 C Þ k!þ1 k 1 lim IðDxk ! yk kDxk ; xkC Þ ¼ lim IðxðkÞ; yðkÞjyk1 ; xk1 ; xkC Þ k!þ1 k k!þ1 lim
k!þ1
Suppose now that we are dealing with finite order joint Markov sequences. Then, by working with vectors, we can represent the signal using an order 1 Markov multivariate process. We thus assume that ðx; y; xC Þ is a Markov process of order 1. Under this assumption and stationarity, we have:
286
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
lim Iðxk1 ; yðkÞjyk1 ; xk1 C Þ ¼ Iðxð1Þ; yð2Þjyð1Þ; xC ð1ÞÞ
k!þ1
lim IðxðkÞ; yðkÞjyk1 ; xk1 ; xkC Þ ¼ Iðxð2Þ; yð2Þjyð1Þ; xð1Þ; x2C Þ
k!þ1
and in this case, we can estimate the conditional transfer entropy and the instantaneous information exchange from the data. Practically, from the two time series x and y and a pool of others, xC , we create the realizations of the vectors i1 i1 2 i xð1Þi ¼ xi1 id ; yð1Þi ¼ yid ; xC ð1Þi ¼ xC;id and xC i ¼ xC;id from the signals, and we estimate Iðxð1Þ; yð2Þjyð1Þ; xC ð1ÞÞ and Iðxð2Þ; yð2Þjyð1Þ; xð1Þ; x2C Þ using these realizations and the k-nn estimators described above [20,10]. This approach has already been described in [35] for the transfer entropy. 5.2. Synthetic examples We develop here two synthetic examples to illustrate the key ideas that we have developed in this report. In the first example, we stress the importance of causal conditioning using a simple causality chain. The second example is a particular instance of the example developed in the second section above, for which we estimate dynamic and instantaneous coupling measures. 5.3. A chain Consider the following three-dimensional example, in which the noises are i.i.d. and independent of each other.
xt ¼ bxt1 þ ex;t yt ¼ cyt1 þ dxy x2t1 þ ey;t zt ¼ dzt1 þ cyz yt1 þ ez;t where a ¼ 0:2; b ¼ 0:5; c ¼ 0:8; dxy ¼ 0:8; cyz ¼ 0:7. The expected conditional and unconditional graphs are respectively shown in Fig. 2. First, we evaluate the Geweke measure based on linear prediction error [11,12] (the logarithm of the ratio between variances of linear prediction). The measures are evaluated on 100 independent realizations of a length of 3000 samples of the processes. These are depicted in Fig. 3 in the form of histograms. As can be seen, the histogram for the conditional Geweke measure F xzky has the same support as the histogram of the unconditional measure F xz . Therefore, we have an example where linear Granger causality gives the same answer whether it is conditional or not: x does not dynamically cause z (conditional or not to y). t1 We then evaluate the transfer entropy IðDx ! zÞ ¼ Iðxt1 t2 ; zt jzt2 Þ and the conditional transfer entropy IðDx ! zkDyÞ ¼ t1 t1 t1 Iðxt2 ; zt jzt2 ; yt2 Þ on the same datasets. The results are depicted at the bottom of Fig. 3. We see that the histograms of the conditional measure are clearly centered around 0, whereas the histogram for the unconditional measure has a clearly nonoverlapping support. Therefore, we conclude that when side information is not taken into account, x causes z, whereas including y as side information reverses this conclusion. Therefore, the existing link from x to z passes through y. In the plot of the transfer entropy, we present the histograms of the measures for three different values of k, the number of nearest neighbors considered by the estimation. As seen, and as reported in [4], there is a trade-of between bias and variance as a function of k. The present lack of precise theoretical analysis does not allow us to optimize this trade-off to choose k (see, however, [34] for a study that goes in this direction). However, numerical simulations have shown that k should be chosen to be small as the dimension of the space increases. 5.4. A four-dimensional complete toy We come back to the example described in Section 3.2. Below, we provide an explicit form to the functional links:
8 > wt > > > < xt > > yt > > : zt
¼ awt1 þ azt1 þ ex2t1 þ ew;t 2
¼ bxt1 þ fzt1 þ ex;t ¼ cyt1 þ bxt1 þ gx2t1 þ ey;t ¼ dzt1 þ cwt1 þ ez;t
Fig. 2. Conditional resp. unconditional graph representing the chain.
ð8Þ
287
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
Geweke measures 100
unconditional conditional
80 60 40 20 0
0
0.05
0.1
0.15
transfer entropy 100
k=5 k=10 k=15
80 60 40 20 0 0
0 . 05
0 .1
0 .1 5
Fig. 3. Dynamical causality analysis from x to z in the first example. Top: linear analysis using the Geweke measures. Both conditional and unconditional measures lead to the conclusion that x9z. Bottom: directed information theoretic analysis. The three different types of histograms correspond to three different choices of the number of nearest neighbors, k, for the estimation. As can be seen, the variance decreases with k but the bias increases. From the transfer entropy, as Iðx ! zÞ > 0, we obtain x ! z, whereas the conditional transfer entropy leads to x9z as Iðx ! zkyÞ ¼ 0.
and we recall that the noise sequence is white with covariance given by Eq. (1). For the purpose of this example, we set
q1 ¼ 0:66; q2 ¼ 0:55 and q3 ¼ 0:48. Note that the conditional and unconditional graphs associated to the system described by Eq. (8) are depicted in Fig. 1. To mimic a real experiment, we have simulated a long time series from which N b ¼ 100 consecutive blocks of 3000 samples, each of which was used to generate the realizations of the process. Thus, all of the information measures needed were evaluated on these blocks. Furthermore, we performed random permutations to simulate the independence situation, called H0 . While estimating Iða; bjcÞ from samples ai ; bi ; ci , the permutation is done on all of the bi . Indeed if b is independent from t1 a and c, then Iða; bjcÞ ¼ 0. For example, when estimating the transfer entropy Iðxt1 t2 ; zt jzt2 Þ, we use the permutation for zt but not for zt1 t2 . Thus, for each block, two measures are actually performed, which correspond to the one that needs to be evaluated and another one for which the H0 hypothesis is forced. The N b results under H0 allow us to evaluate the threshold gij over which only a% of false-positive decisions (there is a link from i to j) will be taken. Practically, we set a ¼ 10%. Since for this toy problem, 12 dependence pairwise tests need to be made, the Bonferronni1 correction is applied to the threshold, to maintain the family-wise global false detection rate. Nine different measures were tested in this example: 1. Geweke’s instantaneous coupling measure
F xy ¼ lim
n!þ1
eðxn jxn1 ; yn1 Þ eðxn jxn1 ; yn Þ
where eðxjzÞ is the variance of the error in the linear estimation of x from y. 2. Geweke’s conditional instantaneous coupling measure
eðxn jxn1 ; yn1 ; ðw; zÞn Þ n n!þ1 eðxn jxn1 ; yn ; ðw; zÞ Þ
F xy ¼ lim
1 i.e. a is replaced with a=12 to ensure a family of false positive rates less than a%. Note that Bonferronni correction is known to be very conservative, and a less conservative procedure such as the false discovery rate control could be easily adopted.
288
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
3. Geweke’s dynamical causality measure
F x!y ¼ lim
n!þ1
eðxn jxn1 Þ : eðxn jxn1 ; yn1 Þ
Fig. 4. Measures calculated from Example 2. From top to bottom, instantaneous coupling and conditional instantaneous coupling Geweke’s measures, dynamical causality and conditional dynamical causality Geweke’s measures, instantaneous information exchange, unconditional instantaneous information exchange, conditional instantaneous information exchange, transfer entropy, and finally conditional transfer entropy. The left column is the mean measure calculated over 100 realizations of 3000 samples each. The right column represents the number of times the corresponding measures exceeded a threshold chosen to ensure a family false-positive probability of 10% (using Bonferronni correction).
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
289
4. Geweke’s conditional dynamical causality measure
F x!y ¼ lim
n!þ1
5. 6. 7. 8. 9.
eðxn jxn1 ; ðw; zÞn1 Þ eðxn jxn1 ; yn1 ; ðw; zÞn1 Þ
Instantaneous information exchange Iðxn ! yn kDxn Þ. Instantaneous unconditional information exchange Iðxn ! yn kDxn ; Dwn ; Dzn Þ. Instantaneous conditional information exchange Iðxn ! yn kDxn ; wn ; zn Þ. Transfer entropy IðDxn ! yn Þ. Conditional transfer entropy IðDxn ! yn kDwn ; Dzn Þ.
These Geweke measures are based on linear estimations. Note that they are >1 when x causes y, and that they are equal to 1 otherwise. We stressed that the Geweke measures are the Gaussian version of directed information measures discussed here [1,2]. Information measures were estimated using the appropriate conditional mutual information definitions, with time lag windows of length 2, e.g. the conditional transfer entropy IðDxn ! yn kDwn ; Dzn Þ is approximated by the estimation t1 t1 t1 of Iðxt1 t2 ; yt jyt2 ; wt2 ; zt2 Þ. The results are depicted in Fig. 4, where the measures 1 to 9 are depicted from top to bottom. The left column represents the matrix of the measures averaged over the N b blocks. The right column represents the matrix of estimated probabilities of deciding that there is a link between two nodes. To estimate whether there is a link, we use the threshold gij discussed above. To evaluate the Geweke measures, we performed a linear prediction using 10 samples in the past, and evaluated the variance of the errors. The main conclusions to be drawn from this experiment are the following: Whether causally conditional or not, the linear analysis implemented using the Geweke measures fails to retrieve the structure of the causality graphs. The instantaneous information exchange must be causally conditioned, as the results given in the fifth line of Figure (Ieij ) do not reveal the exact nature of the dependencies. The importance of the horizon of causal conditioning appears in the 6th and 7th lines, where we plot the results for the unconditional and conditional instantaneous information exchange, respectively. The measures are correctly estimated, as the probability of assigning links is very high, as shown in the right column: the forms of the matrices are correct. We recover the form of the covariance matrix of the noise using the unconditional form, whereas we recover the form of the precision matrix using the conditional form of the instantaneous information exchange. Note that this example has a rather low probability of estimating the link between x and y in the conditional form. The causality graph needs causal conditioning to be correctly inferred, as revealed by the two last measures. However, note also the relatively low probability of correctly estimating the link from w to z, a difficulty that is clearly due to the low coupling constant c in this direction. Trying to increase this coupling to study the sensitivity is unfortunately impossible, as increasing c destabilizes the system. 6. Conclusion In the present report, we have revisited and highlighted the links between directed information theory and Granger causality graphs. In the bivariate case, the directed information decomposes into the sum of two contributions: the transfer entropy and the instantaneous information exchange. Each term in this decomposition reveals a type of causality. The transfer entropy between two processes (say, X and Y) is zero if and only if there is no dynamical Granger causality: the knowledge of the past of X dos not lead to any improvement in the prediction quality of Y. Instantaneous information exchange quantifies the instantaneous link that can exist between the two signals. Remind that although instantaneous information transfer is hard to conceive, e.g., a continuous time setting, it can easily occur in experimental set-ups where the measurements are integrated over time intervals that can be greater than the propagation delays. In the multivariate case, instantaneous causality gives rise to increased difficulties when relating directed information theory to the measures introduced in the bivariate case. We have shown that two definitions of instantaneous coupling can be provided, which depend on the time horizon selected in the consideration of the side information. If only the past of the side information is considered, instantaneous coupling leads to a concept of independence graph models (associated to Definition 3), whereas consideration of the present of the side information as well, leads to a conditional graphical model (associated to Definition 2). Preferring one of these definitions leads to a relatively difficult choice, as discussed above: Conditional graphs have nice Markov properties, whereas unconditional graphs represent a preferred solution in neuroscience, as they provide better matches for the concept of functional connectivity. We have also shown that if independence graphs are considered, directed information causally conditioned to the past of the side information (simply termed causal directed information in this case) decomposes into the sum of the causally conditioned transfer entropy and the causally conditioned (independent) information exchange. This decomposition however does not hold any longer in the other case, when present of side information is taken into account in the conditioning. For the conditional graph, an extra term appears in the decomposition of the conditional causal directed information. This further explains how instantaneous exchange takes place between the two signals of interest and the side information.
290
P.-O. Amblard, O. Michel / Information Sciences 264 (2014) 279–290
All this theoretical framework is illustrated through some practical developments with two synthetic examples. The estimators we have used in the present study rely on nearest-neighbors-based entropy estimators, that generalizes Leonenko’s approaches for entropy estimation. This implementation can be efficiently used, as long as the dimensionality of the problems at hand is not high, to warrant both acceptable computational cost and convergence of the estimators. Acknowledgement P.O.A. is supported by a Marie Curie International Outgoing Fellowship from the European Community. References [1] P.O. Amblard, O.J.J. Michel, Sur différentes mesures de dépendance causale entre signaux aléatoires (On different measures of causal dependence for stochastic processes), in: Proc. Gretsi, Dijon, France, September 2009. [2] P.O. Amblard, O.J.J. Michel, Relating granger causality to directed information theory for networks of stochastic processes, 2011, ArXiv:0911.2873v3 (submitted for publication). [3] P.-O. Amblard, O.J.J. Michel, C. Richard, P. Honeine, A Gaussian process regression approach for testing Granger causality between time series data, in: Proc. ICASSP’2012 Osaka, 2012. [4] P.O. Amblard, S. Zozor, O.J.J. Michel, A. Cuculescu, On the estimation of the entropy using kth nearest neighbors, in: IMA Conf. on Maths and Signal Processing, 2008, pp. 79–82. [5] L. Barnett, A.B. Barrett, A.K. Seth, Granger causality and transfer entropy are equivalent for Gaussian variables, Phys. Rev. Lett. 103 (2009) 238707. [6] T. Cover, J. Thomas, Elements of Information Theory, Wiley, 1993. [7] R. Dahlaus, M. Eichler, Highly structured stochastic systems chapter Causality and graphical models in time series analysis, in: P. Green, N. Hjort, S. Richardson (Eds.), University Press, Oxford, 2003. [8] M. Eichler, Granger causality and path diagram for multivariate time series, J. Econometr. 137 (2) (2007) 334–353. [9] M. Eichler, Graphical modelling of multivariate time series, Proba. Theory Relat. Fields (2011). http://dx.doi.org/10.1007/s00440-011-0345-8. [10] S. Frenzel, B. Pompe, Partial mutual information for coupling analysis of multivariate time series, Phys. Rev. Lett. 99 (2007) 204101. [11] J. Geweke, Measurement of linear dependence and feedback between multiple time series, J. Am. Stat. Assoc. 77 (1982) 304–313. [12] J. Geweke, Measures of conditional linear dependence and feedback between times series, J. Am. Stat. Assoc. 79 (388) (1984) 907–915. [13] M.N. Goria, N.N. Leonenko, V.V. Mergel, P.L. Novi Invardi, A new class of random vector entropy estimators and its applications in testing statistical hypotheses, J. Nonparam. Stat. 17 (3) (2005) 277–297. [14] B. Gourévitch, R.L. Bouquin-Jeannès, G. Faucon, Linear and nonlinear causality between signals: methods, example and neurophysiological applications, Biol. Cyber. 95 (4) (2006) 349–369. [15] C. Gouriéroux, A. Monfort, E. Renault, Kullback causality measures, Ann. Econ. Stat. (6–7) (1987) 369–410. [16] C.W.J. Granger, Testing for causality: a personal viewpoint, J. Econ. Dynam. Contr. 2 (1980) 329–352. [17] C.W.J. Granger, Some recent developments in a concept of causality, J. Econometr. 39 (1988) 99–211. [18] M.I. Jordan, T.J. Sejnowski, (Eds.), Graphical Models: Foundations of Neural Computation, MIT Press, Cambridge, MA, USA, 2001. [19] G. Kramers, Directed information for channels with feedback, PhD thesis, Swiss Federal Institute of Technology Zurich, 1998. [20] A. Kraskov, H. Stogbauer, P. Grassberger, Estimating mutual information, Phys. Rev. E 69 (2004) 066138. [21] S. Lauritzen, Graphical Models, Oxford University Press, 1996. [22] N.N. Leonenko, L. Pronzato, V. Savani, A class of Rényi information estimators for multidimensional densities, Ann. Stat. 36 (2008) 2153–2182. [23] D. Marinazzo, M. Pellicoro, S. Stramaglia, Kernel–Granger causality and the analysis of dynamical networks, Phys. Rev. E 77 (2008) 056215. [24] H. Marko, The bidirectional communication theory – a generalization of information theory, IEEE Trans. Commun. 21 (12) (1973) 1345–1351. [25] J. Massey, Causality, feedback and directed information, in: Proc. Intl. Symp. on Info. Th. and its Applications, Waikiki, Hawai, USA, November 1990. [26] M.E.J. Newman, Networks: An Introduction, Oxford University Press, 2010. [27] C.J. Quinn, N. Kiyavas, T.P. Coleman, Equivalence between minimal generative model graphs and directed information graph, in: Proc. ISIT, St. Persburg, Russia, 2011. [28] C.J. Quinn, T.P. Coleman, N. Kiyavash, N.G. Hastopoulos, Estimating the directed information to infer causal relationships in ensemble neural spike train recordings, J. Comput. Neurosci. 30 (2011) 17–44. [29] M. Palus, M. Vejmelka, Directionality of coupling from bivariate time series: how to avoid false causalities and missed connections, Phys. Rev. E 056211 (2007) 2–14. [30] J. Rissanen, M. Wax. Measures of mutual, causal dependence between two time series, IEEE Trans. Inform. Theory 33 (1987) 598–601. [31] Y. Saito, H. Harashima, Recent Advances in EEG and EMG Data Processing Chapter Tracking of Information within Multichannel EEG Record-causal Analysis in EEG, Elsevier, 1981. pp. 133–146. [32] T. Schreiber, Measuring information transfer, Phys. Rev. Lett. 85 (2) (2000) 461–465. [33] V. Solo, On causality and mutual information, in: Proceedings of the 47th IEEE Conference on Decision and Control, Cancun, Mexico, 2008. [34] K. Sricharan, A.O. Hero, Weighted k-nn graphs for rényi entropy estimation in high dimension, in: Proceedings of IEEE SSP workshop, Nice, France, 2011. [35] R. Vicente, M. Wibral, M. Lindner, G. Pipa, Transfer entropy – a model-free measure of effective connectivity for the neurosciences, J. Comput. Neurosci. 30 (1) (2011) 45–67. [36] J. Whittaker, Graphical Models in Applied Multivariate Statistics, Springer, 1989.