Information Theoretical Approaches

Information Theoretical Approaches

Information Theoretical Approaches M Wibral, Goethe University, Frankfurt, Germany V Priesemann, Max Planck Institute for Brain Research, Frankfurt, G...

748KB Sizes 3 Downloads 86 Views

Information Theoretical Approaches M Wibral, Goethe University, Frankfurt, Germany V Priesemann, Max Planck Institute for Brain Research, Frankfurt, Germany ã 2015 Elsevier Inc. All rights reserved.

pXt (xt ¼ aj)

Nomenclature d   AXt ¼ a1 , . . . aj , . . . aJ aj H(X) H(X|Y) h(x) h(x|y) I(X; Y) I(X; Y|Z) i(x, y) i(x; y|z) Notes regarding t

Physical or true interaction delay between two processes Set of all possible outcomes of Xt Specific outcome of a random variable x Entropy Conditional entropy Information content Conditional information content Mutual information Conditional mutual information Local mutual information Local conditional mutual information Whenever necessary, the index t is detailed as t1, t2,..., tk. In contrast, for stationary processes, the index t can be omitted

Information, Meaning, and Neuroscience Meaning Versus Information Information theory in neuroscience is often misunderstood. At the heart of the confusion lies the fact that in everyday language, we use information and meaning as synonymous. However, meaning is ultimately relative to the human receiver of information. Information that is meaningful for one person – say a message with a reply to a question the receiver sent earlier – may be meaningless for an accidental receiver of the message who does not know the initial question or even how to read. The information content of the message however should stay the same independent of who receives it. Hence, information theory devised a measure of information content that is independent of the receiver. In the following paragraphs, we will provide an intuitive introduction to this measure. The fundamental definition of information content was given by Shannon in Shannon (1948) and relies on the notion that an event that we observe reduces the uncertainty (about what is still possible). In the following, we give a heuristic derivation of the Shannon information based on this general intuition about information. Assume, for example, that the message in the example earlier in the text contained the six-digit winning number sequence of a lottery (where the six numbers were drawn at random). Then, before reading the first digit, we are uncertain about which of 106 possible (and equally likely) number sequences the message contains. After reading the first digit of our message, which

Brain Mapping: An Encyclopedic Reference

R ¼ {R1, R2} Si, Rj u X(c), X(c) t

X(s), X(s) Xt, xt X , Y, Z X,Y,Z X, Y, Z or Xt, Yt, Zt x, y, z or xt, yt, zt

Probability that Xt has a specific outcome aj Joint variable (in this example of two responses) Random variables referring to stimuli (Si) or responses (Rj) Assumed interaction delay between two processes Cyclostationary process and cyclostationary random variable Stationary process and stationary random variable State space representation of X at t System Random process Random variable (at time point t) Realization of the random variable (at time point t)

will be one number between 0 and 9, only 105 possibilities are left, and our uncertainty about the overall possible six-digit sequences in the message is reduced. Finally, after reading the last digit, only one six-digit sequence remains possible, now with a probability equal to unity, and there is no remaining uncertainty. In this example, each time we read a digit, our uncertainty was reduced. However, the amount of uncertainty reduction needs to be defined. At first, it might seem that reading the first digit reduces our uncertainty most, because it eliminated 106  105 ¼ 9  105 possibilities while reading the last digit removed only nine possibilities. Intuition however tells us that in this example, all digits should carry equal amounts of ‘information.’ How can we define a measure that appeals to our intuition? The correct way to look at the earlier-mentioned problem is in terms of the fractional change in the number of possibilities that reading a digit provides: We see that reading the first digit removes a fraction f of 9/10 of all possibilities, and so do all others, including the last digit. In other words, reading the first digit informs us about which one out of ten equiprobable partitions of the space of all six-digit sequences of numbers the winning number is in. The probability of being in any one partition is 1/10. Upon reading that number, we only have 1/10 of the initial possibilities remaining. If we repeated the lottery with drawing six-digit codes composed of letters a, b, . . ., z, then our initial uncertainty would be clearly bigger as there are now 266 possibilities to begin with.

http://dx.doi.org/10.1016/B978-0-12-397025-1.00338-9

599

600

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

Nevertheless, after reading all six letters in the sequence, our uncertainty would be again zero, because only one possibility remained. Hence, compared to the example with the numbers, the reading of each letter must provide a greater reduction in uncertainty. And, indeed, each letter lets us decide between one of 26 equiprobable partitions of the remaining possibilities now, since each letter has a probability of 1/26. Thus, reading a letter reduces the space of potential outcomes to a fraction of f ¼ 1/26. Therefore, it makes sense to define the information in an event in relation to the fraction f of the original space of outcomes or possibilities that observing this event confines us to. The smaller this fraction, the more possibilities are excluded, the more uncertainty is reduced, and therefore, the larger the information content of the event. The ‘fraction’ of the space of all possible events that an event ai (e.g., obtaining the letter i when reading the first digit) takes can be measured by its probability p(ai). Furthermore, before an event, all (remaining) possibilities are available p(ai ¼ any event)¼ 1. Hence, the reduction of uncertainty by an event ai is related to the fraction f(ai) ¼ p(any event)/p(specific event) ¼ 1/p(ai). Interestingly, this quantity f can be defined in a meaningful way even when the events in question are no longer equiprobable, as we are only interested in the fraction of possibilities that remain after observing a specific event with probability p(ai). If, in a next step, we consider the reduction of uncertainty provided by reading two consecutive letters ai, aj at the same time, we see that this reduces our possibilities to a fraction of f(ai, aj) ¼ 1/p(ai, aj). As the letters were drawn independently, this amounts to f(ai, aj) ¼ (1/p(ai))(1/p(aj)). This shows that f itself cannot be a measure of information content, because our intuition tells us that such a measure should be additive for independent events and not multiplicative. A measure that complies with the requirements in the preceding text and is also additive is the logarithm of the fraction f(ai): hðai Þ ¼ log ðf ðai ÞÞ ¼ log

1 pðai Þ

[1]

Here, the logarithm can in principle be taken to any base. The resulting quantity h is called the Shannon information (content). This definition of information content also has the desirable property that it is continuous in the changes of the underlying probabilities. This means that if the probability of an event ai slightly changes, the corresponding change of h(ai) is also small, that is, our measure does not jump around with slight changes of the probability distributions. As a consequence, small errors in our assumptions about the necessary probability distributions will not produce arbitrarily large errors in our assessment of the information content. Here, we chose an intuitive introduction of the Shannon information. Nevertheless, the definition given in Eqn [1] is precisely the definition of the Shannon information that can also be axiomatically derived via the average information content, called entropy (see Shannon, 1948).

Information Theory in Neuroscience To see what information theory can add to our understanding in neuroscience, it is useful to first define the general questions

we want to address in neuroscience. What kinds of questions would we like to have answers to? At first glance, questions in neuroscience seem to separate well into three levels of understanding that have been repeatedly described (e.g., Marr, 1982) and that are generally considered useful:







At the functional level (originally called ‘computational’ level), we ask what information processing problem a neural system (or a part of it) tries to solve. Such problems could, for example, be the detection of edges or objects in a visual scene or the maintenance of information about an object after the object is no longer in the visual scene. It is important to note that questions at the functional level typically revolve around entities that have a direct meaning to us, for example, objects or specific object properties used as stimulus categories, or operationally defined states or concepts such as attention or working memory. An example of an analysis carried out purely at this level is the analysis whether a person behaves as an optimal Bayesian observer (see references in Knill & Pouget, 2004). At the algorithmic level (in complex systems science, this would be called the computational perspective on a system (Crutchfield, Ditto, & Sinha, 2010), one reason to largely avoid the use of the word computation here), we ask what entities or quantities of the functional level are represented by the neural system and how the system operates on these representations using algorithms. For example, detection of an object in noisy visual input might in principle be solved algorithmically either by a parallel or by a sequential search of all memories of objects ever seen for the best match. At the (biophysical) implementation level, we ask how the representations and algorithms are implemented in biological neural systems. Descriptions at this level are given in terms of the relationship between various biophysical properties of the neural system or its components, for example, membrane currents or voltages, the morphology of neurons, and spike rates. A typical study at this level might aim, for example, at reproducing observed physical behavior of neural circuits, such as gamma-frequency (>40 Hz) oscillations in local field potentials by modeling the biophysical details of these circuits from the ground up (Markram, 2006).

While this separation of levels of understanding initially served to resolve important debates in neuroscience, there is a growing awareness of a specific shortcoming of this classic view: results obtained by careful study at any single one of these levels do not constrain the possibilities at any other level (see, e.g., the afterword by Poggio in Marr, 1982). For example, the goal of winning a game of tic-tac-toe at the functional level can be reached by a brute force strategy at the algorithmic level that is implemented in a mechanical computer (Dewdney, 1993), but it can also be reached by a different algorithm carried out by the biological brains of children. As we will see, information theory serves to address the relationships between the levels. For example, a measure called mutual information offers powerful tools to link entities defined at the functional level to physical quantities observed at the implementation level. Furthermore, measures from a subfield

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

of information theory, called local information dynamics (Lizier, 2013), link the algorithmic to the implementation level. Local information dynamics establishes this link by deriving constraints for possible algorithms from the physical quantities (implementation level). Such constraints can be derived, for example, by comparing the relative amounts of stored or transferred information or by identifying the sequence of information transfers between various parts of the brain. More specifically, using information theory, we can address the following main questions: 1. How much information does a specific neural response provide about a stimulus or experimental condition? This question clearly relates quantities at the implementation level, that is, physical indices of neural responses, to entities defined by the experimenter at the functional level, such as certain stimulus classes (for example faces, objects, and animals) or operationally defined states such as attention. The benefit of information theory here is that we can measure not only that a response conveys information about a stimulus but also how much. Moreover, this can be done without explicit knowledge of the neural encoding. 2. Which stimulus leads to responses (implementation level) that are relatively unique for this stimulus? In other words, which stimulus is well encoded, in the sense that the system does indeed represent it? Again, this question asks about entities defined at the functional level, that is, stimuli, tasks, or states, but this time relates them to properties at the algorithmic level, that is, how well they are represented by the neural system. 3. If we accept the notion that the brain performs information processing in the form of a (distributed) computation, then this computation can be decomposed into the components of information storage, information transfer, and information modification to improve our understanding of it. We may therefore ask how this computation is carried out, that is, which biological phenomena at the implementation level subserve information storage, information transfer, and information modification at the algorithmic level? Clearly, this question uses data at the implementation level (neural activity) to provide constraints for possible algorithms. For example, if we observe no signs of information storage in the system, this may rule out algorithms that rely heavily on storing intermediate information. In the following sections, we will first introduce the necessary information theoretical concepts and then proceed to answer questions one and two that belong to the domain of neural (en)coding and representations. Finally, we show how question three can be addressed using information theory in the form of (local) information dynamics.

Information Theoretical Preliminaries In this section, we introduce the necessary notation and basic information theoretical concepts that are indispensable to understand information theoretical analyses in neuroscience. This is done to obtain a self-contained presentation of the

601

material for readers without a background in information theory. Readers familiar with elementary information theory and readers who are not interested in the precise definitions at first reading may skip ahead to the next section.

Notation and Basic Information Theory Notation To avoid confusion, we first have to state what systems we wish to apply our measures to and how we formalize observations from these systems mathematically. We define that a system of interest (e.g., neuron and brain area) X produces an observed time series {x1, . . ., xt, . . ., xN} that is sampled at time intervals D. For simplicity, we choose our temporal units such that D ¼ 1 and hence index our measurements by t 2 {1 . . . N}  ℕ. The full time series is understood as a realization of a random process X, unless stated otherwise. This random process is nothing but a collection of random variables Xt, sorted by an integer index (t in our case). Each random variable Xt, at a specific time  t, is described  by the set of all its J possible outcomes AXt ¼ a1 , . . . aj . . . aJ and their associated probabilities pXt ðxt ¼ aj Þ. Since the probabilities of a specific outcome pXt ðxt ¼ aj Þ may change with t, that is, when going from one random variable to the next, we have to indicate the random variable Xt the probabilities belong to – hence the subscript in pXt ðÞ. In sum, in this notation, the individual random variables Xt produce realizations xt, and each of the realizations has J different possible outcomes aj. Furthermore, the time index of a random variable Xt is necessary, because the random variables Xt1 and Xt2 with t1 6¼ t2 can have different probability distributions. When using more than one system, the notation is generalized to multiple systems X , Y, Z, . . . . To clarify the conceptual distinction between a random variable Xt and a random process X, think about the following example. Assume one would like to obtain 20 random numbers between 1 and 6 by throwing (biased) dice. One could do this, for example, by throwing one dice twenty times or by throwing twenty dice, each one time. The first case represents 20 realizations of a single random variable X1. The six possible outcomes for each of these 20 realizations of the random variable are AX1 ¼ f1; 2; 3; 4; 5; 6g, and the probability distribution of the random variable is pX1 ðÞ. Here, the random process X consists of only one random variable X1, that is, it is of length one. In contrast, in the second case, where twenty different dice are thrown, each dice represents a random variable Y1, . . ., Yt, . . ., Y20, and the length of the random process Y is 20. Here, each of the random variables Yt is realized only once and has its own probability distribution pYt ðÞ. The important difference between the two examples is that in the second case, each of the dice in principle can have a different bias, for example, the probability distributions pYt1 ðÞ and pYt2 ðÞ can differ for t1 6¼ t2. Thus, in the second case, we could bias the first dice to always yield ‘1’ and the last dice to always yield ‘6.’ This is impossible with the single dice of the first example. The distinction between a random process and a random variable is very important in the context of stationarity, which will be discussed next.

602

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

Estimation of probability distributions for stationary and nonstationary random processes Often, the probability distributions of our random variables Xt are unknown. Since knowledge of these probability distributions is essential to computing any information theoretical measure, the probability distributions have to be estimated from the observed realizations of the random variables xt. This is only possible if we have some form of replication of the processes we wish to analyze. From such replications, the probabilities are estimated, for example, by counting relative frequencies or, more generally, by a density estimation. In general, the probability pXt ðxt ¼ aj Þ to obtain the jth outcome xt ¼ aj at time t has to be estimated from replications of the processes at the same time point t, that is, via an ensemble of physical replications of the systems in question. Generating a sufficient number of such physical replications of a process is often impossible in neuroscience. For example, to obtain 105 samples, we would have to record from this number of subjects at time t, given that the subjects were similar enough to count as replications of a single random variable at all. Therefore, to generate sufficient data for the estimation of pXt ðÞ, one either resorts to repetitions of parts of the process in time (often called trials), resorts to the generation of cyclostationary processes, or even assumes stationarity. All three possibilities will be discussed in the following. In neuroscience, experiments are often organized in repeated parts (e.g., trials and stimulation blocks). The random process X therefore repeats at tk (we do not require evenly spaced trials in time). Thus, the probability to obtain the value xt ¼ aj can be estimated from observations made at a sufficiently large set M of time points tk, where we know by design of the experiment that the process repeats itself, that is, that random variables Xtk at certain time points tk have probability distributions identical to the distribution at t that is of interest to us: 8 t ∃M  ℕ∧M 6¼ ∅ : pXt ðaj Þ ¼ pXt ðaj Þ 2 AXt

k

8 t k 2 M, aj

Basic information theory Based on the previously mentioned definitions, we now define the necessary basic information theoretical quantities. We assume the reader is familiar with the basic concepts of probability theory. If this is not the case, the treatment in Effenberger (2013) offers a well-written introduction. Assume a random variable X with possible outcomes x 2 AX and a probability distribution p(x) over these outcomes. As stated before, the information content is the reduction in uncertainty linked to the occurrence of a specific outcome x ¼ aj (with probability p(x ¼ aj)) that we obtain when aj is observed. As shown in the introduction, this uncertainty reduction entirely depends on the probability of the event x, and the event’s information content h is defined as hðxÞ ¼ log

8n 2 ℕ, aj 2 AXt

[3]

This condition guarantees that we can estimate the necessary probability distributions pXt ðÞ of the random variable X(c) t (c) by looking at other random variables XtþnT of the process X(c). Finally, for stationary processes X(s), we can substitute T in Eqn [3] by T ¼ 1 and pXt ðaj Þ ¼ pXtþn ðaj Þ 8t, n 2 ℕ, aj 2 AXt

[4]

In the stationary case, the probability distribution pXt ðÞ can be estimated from the entire set of measured realizations xt.

1 ¼  log ðpðxÞÞ pðxÞ

[5]

While the last term in the equation appears simpler and is more commonly used, we will use the middle term in the following, because it stresses that the information content of an event is related to the inverse of its probability. The average information content of a random variable X is called the entropy H of the random variable: HðXÞ ¼

X

pðxÞ log

x2Ax

1 pðxÞ

[6]

The information content of a specific outcome x ¼ aj of X, given we already know the outcome y ¼ ak of another variable Y, which is not necessarily independent of X, is called conditional information content:

[2]

If the set M of time points tk that the process is ‘repeated’ at is large enough, we obtain a reliable estimate of pXt ðÞ. Cyclostationarity can be understood as a specific form of repeating parts of the random process, where the repetitions occur after regular intervals T. For cyclostationary processes X(c), we assume (Gardner, 1994; Gardner, Napolitano, & (c) Paura, 2006) that there are random variables XtþnT at times t þ nT that have the same probability distribution as X(c) t : 8t ∃T 2 ℕ : pXt ðaj Þ ¼ pXtþnT ðaj Þ

From now on, we assume stationarity of all involved processes, unless stated otherwise. Accordingly, we drop the subscript index indicating the specific random variable, that is, pXt ðÞ ¼ pðÞ, Xt ¼ X, and xt ¼ x.

hðxjyÞ ¼ log

1 pðxjyÞ

[7]

Averaging this for all possible outcomes of X, given their probabilities p(x|y) after the outcome y was observed, and then averaging over all possible outcomes y that occur with p(y) yields the conditional entropy X

X

1 pðxjyÞ y2AY x2AX X 1 pðx; yÞ log ¼ pðxjyÞ x2A , y2A

HðXjYÞ ¼

X

pðyÞ

pðxjyÞ log

[8]

Y

The conditional entropy H(X|Y) can be described from various perspectives: H(X|Y) is the average amount of information that we get from making an observation of X after having already made an observation of Y. In terms of uncertainties, H(X|Y) is the average remaining uncertainty in X once Y was observed. We can also say that H(X|Y) is the information in X that cannot be obtained from Y. The conditional entropy is used to derive the amount of information shared between the two variables X, Y. This is because this shared or mutual information I(X; Y) is the total average information in one variable (H(X)) minus the average

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

information that is unique to this very variable (H(X|Y)). Hence, the mutual information is defined as IðX; YÞ ¼ HðXÞ  HðXjYÞ ¼ HðYÞ  HðYjXÞ

[9]

Clearly, the mutual information is symmetrical, that is, I(X; Y) ¼ I(Y; X). Similarly to conditional entropy, we can also define a conditional mutual information between two variables X, Y, given the value of a third variable Z is known: IðX; YjZÞ ¼ HðXjZÞ  HðXjY, ZÞ

[10]

The earlier-mentioned measures of mutual information are averages, but they can also be used without taking the expected value, that is, in their localized form. Although average values like the mutual information or entropy are used more often than their localized equivalents, it is perfectly valid to inspect local values like the information content h. This ‘localizability’ was in fact a requirement both Shannon and Fano postulated for proper information theoretical measures (Fano, 1961; Shannon, 1948), and there is a growing trend in neuroscience (Lizier, Heinzle, Horstmann, Haynes, & Prokopenko, 2011a; Wibral et al., 2014) and in the theory of distributed computation (Boedecker, Obst, Lizier, Mayer, & Asada, 2012; Lizier, 2013) to return to local values. For the previously mentioned measures of mutual information, the localized forms are listed in the following: The local mutual information i(x; y) is defined as iðx; yÞ ¼ log

pðx; yÞ pðxÞpðyÞ

[11]

The local conditional mutual information is defined as iðx; yjzÞ ¼ log

pðxjy, zÞ pðxjzÞ

[12]

Hence, mutual information and conditional mutual information are just the expected values of these local quantities. These quantities are called local, because they allow to quantify mutual and conditional mutual information between single realizations. Note however that the probabilities p() that enter Eqns [11] and [12] are global in the sense that they are representative of all possible outcomes. In this sense, the use of local information measures is complementary to the evaluation of probability distributions of single random variables Xt of a nonstationary process X via an ensemble of realizations. In other words, a valid probability distribution has to be estimated irrespective of whether we are interested in average or local information measures.

603

time, then the position xt1 matters for our information about the next position xtþ1 of the pendulum at time t þ 1. Here, analyzing just xt is not enough to describe the next position. For example, assume xt ¼ 0 and xt1 < 0. In this case, we have a zero crossing of the pendulum and in the next step xtþ1 > 0. Conversely, if xt ¼ 0 and xt1 ¼ 0 as well, the pendulum does not move at all and the future state of the pendulum is xtþ1 ¼ 0. Clearly, analyzing a single position xt is not enough to inform us about the next position of the pendulum. In this example, looking at just one past value together with the present one (xt, xt1) is sufficient to predict the next position xtþ1. Therefore, (xt, xt1) describes the ‘state’ of the pendulum at time t. In the next paragraph, we will translate the concept of a state to our framework of random processes. In general, if there is any dependence between the Xt, we have to form the smallest collection of variables X t ¼ ðXt ; Xt1 ; Xt2 ; . . . ; Xti ; . . .Þ with ti < t that jointly make Xtþ1 conditionally independent of all Xtk with tk < min(ti), that is, pðxtþ1 , xtk jxt Þ ¼ pðxt jxt Þpðxtk jxt Þ

[13]

8tk < min ðti Þ 8xtþ1 2 AXtþ1 , xtk 2 AXtk , xt 2 Axt A realization of Xt is called a state of the random process X at time t. To obtain the states of a random process, the time intervals between ti need not be uniform (Faes, Nollo, & Porta, 2012; Small & Tse, 2004). The process of obtaining Xt from data is called state space reconstruction. For state space reconstruction, various optimization routines exist (Ragwitz & Kantz, 2002; Small & Tse, 2004). Without proper state space reconstruction, information theoretical analyses of computational processes of a system that are carried out will almost inevitably miscount information in the random process. Indeed, the importance of state space reconstruction cannot be overstated; for example, a failure to reconstruct states properly leads to false-positive findings and reversed directions of information transfer as demonstrated in Vicente, Wibral, Lindner, and Pipa (2011). Imperfect state space reconstruction is also the underlying principle of various modes of failure of information transfer analysis recently demonstrated analytically (circumventing trivial estimation problems) by Smirnov (2013). For the remainder of the text, we therefore assume a proper state space reconstruction. The resulting state space representations are indicated by bold case letters, that is, Xt and xt refer to the state variables of X and their realizations, respectively. For stationary processes, again, pXt1 ðÞ ¼ pXt2 ðÞ for all t1, t2 and the indices in principle can be omitted.

Analyzing Neural Coding Signal Representation and State Space Reconstruction The neural signals (i.e., the random process) that we analyze usually show history dependence. This means that the random variables that form the random process are no longer independent, but depend on variables in their past. In this setting, a proper description of the process requires to look at the present and past random variables jointly. A simple example may illustrate this: if we record the position xt of a pendulum over

As introduced in the preceding text, information theory can serve to bridge the gap between the functional level, where we deal with properties of a stimulus or task that bear a direct meaning to us, and the implementation level, where we recorded physical indices of neural activity, such as action potentials. To this end, we use mutual information (Eqn [10]) and derivatives thereof to answer questions like the following:

604

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

1. Which (features of) neural responses (R) carry information about (which features of) a class of stimuli (S)? 2. How much does an observer of a specific neural response r, that is, a receiving brain area, change its beliefs about the identity of a stimulus s, from the initial belief p(s) to the posterior belief p(s|r) after receiving the neural response r? 3. Which specific neural response r is particularly informative about an unknown stimulus from a certain set of stimuli? 4. Which stimulus s leads to responses that are informative about this very stimulus, that is, to responses that can ‘transmit’ the identity of the stimulus to downstream neurons? In the following sections will show how to answer these questions one by one using information theory. 1. Which features of neural responses (R) carry information about which features of a certain class of stimuli (S)? This question can be easily answered by computing the mutual information I(S; R) between stimulus features in an experiment and specific features of the neural responses (such as spike rates). Despite its deceptive simplicity, computing this mutual information can be very informative about neural codes. For example, if we only extract certain features Fi(r) of neural responses r, such as the time of the first spike (e.g., Johansson & Birznieks, 2004) or the relative spike times (Havenith et al., 2011), also with respect to an oscillation cycle (O’Keefe & Recce, 1993), and compare the resulting mutual information for both features I(S; F1(R)), and I(S; F2(R)), then we can see which feature carries more information. The feature carrying more information is potentially the one also read out by the neural system. In general, these mutual information values give us upper bounds on what downstream neural circuits can possibly extract from a certain response feature dimension, such as spike times or rate, alone. Nevertheless, one has to keep in mind that several response features might have to be considered jointly and carry synergistic information (see Section ‘The Question of Ensemble Coding’, below). 2. How much does an observer of a specific neural response r, that is, a receiving brain area, change its beliefs about the identity of a stimulus s, from the initial belief p(s) to the posterior belief p(s|r) after receiving the neural response r? This question is natural to ask in the setting of Bayesian brain theories (Knill & Pouget, 2004). Since this question addresses a quantity associated with a specific response (r), we have to decompose the overall mutual information between the stimulus variable and the response variable (I(S; R)) into more specific information terms. As this question is about a difference in probability distributions, before and after receiving r, it is naturally expressed in terms of a Kullback–Leibler divergence between p(s) and p(s|r). The resulting measure is called the specific surprise isup (DeWeese & Meister, 1999): isup ðS; rÞ ¼

X s

pðsjrÞ log

pðsjrÞ pðsÞ

[14] P

It can be easily verified that I(S; R) ¼ r p(r)isup(S; r). Hence, isup is a valid decomposition of the mutual information into more specific, response-dependent

contributions. As a Kullback–Leibler divergence, isup is always positive or zero: isup ðS; rÞ  0

[15]

this simply indicates that any incoming response will either update our beliefs – leading to a positive Kullback–Leibler divergence – or not, in which case the Kullback–Leibler divergence will be zero. From this, it immediately follows that isup cannot be additive: if of two subsequent responses r1, r2, the first leads us to update our beliefs about s from p (s) to p(s|r1), but the second leads us to revert this update, that is, p(s|r1, r2) ¼ p(s), then isup(S; r1, r2) ¼ 0 6¼ isup(S; r1) þ isup(S; r2|r1). Loosely speaking, a series of surprises and belief updates do not necessarily lead to a better estimate. 3. Which specific neural response r is particularly informative about an unknown stimulus from a certain set of stimuli? This question asks how much the knowledge about r and the update of our beliefs about s from p(s) to p(s|r) is worth in terms of an uncertainty reduction about s, that is, an information gain. In contrast to the question about an update of our beliefs earlier in the text, we here ask whether this update increases or reduces uncertainty about s. This question is naturally expressed in terms of conditional entropies, comparing our uncertainty before the response, H(S), with our uncertainty after receiving the response, H(S|r). The resulting difference is called the response specific information ir(S; r) (DeWeese & Meister, 1999): ir ðS; rÞ ¼ HðSÞ  HðSjrÞ

[16] P

Again, it is easily verified that I(S; R) ¼ r p(r)ir(S; r). However, here, ir(S; r) is not necessarily positive: As a response r can lead from probability distributions p(s) with a low entropy H(S) to one with a high entropy H(S| r), and vice versa, the measure can be positive or negative. What is gained by accepting ‘negative information gains’ in the sum, that is, accepting that results may be misinformative in terms of an increase in uncertainty, is that the measure is additive for two subsequent responses: ir ðS; r1 ; r2 Þ ¼ ir ðS; r1 Þ þ ir ðS; r2 jr1 Þ

[17]

4. Which stimulus s leads to responses r that are informative about the stimulus itself? In other words, which stimulus is reliably associated to responses that are relatively unique for this stimulus, so that we know about the occurrence of this specific stimulus from the response unambiguously? Here, we ask about stimuli that are being encoded well by the system, in the sense that they lead to responses that are informative to a downstream observer. In this type of question, a response is considered informative if it strongly reduces the uncertainty about the stimulus, that is, if it has a large ir(S; r). Thus, we ask how informative the responses for a given stimulus s are on average over all responses that the stimulus elicits with p(r|s): iSSI ðs; RÞ ¼

X

pðrjsÞir ðS; rÞ

[18]

r2Ar

The resulting measure iSSI(s; R) is called the stimulusspecific information (SSI) (Butts, 2003). Again, it can be

605

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches P verified easily that I(S; R) ¼ s p(s)iSSI(s; R), meaning that iSSI is another valid decomposition of the mutual information. Just as the response-specific information terms that it is composed of, the SSI can be negative. The SSI has been used to investigate which stimuli are encoded well in neurons with a specific tuning curve; it was demonstrated that the specific stimuli that were encoded best changed with the noise level of the responses (Butts & Goldman, 2006).

RF1

1 +1 RF3 Receptive fields

The Question of Ensemble Coding

RF2

Another question that is often asked in neuroscience is whether an ensemble of neurons codes individually or jointly for a stimulus. For example, after recording from an ensemble of two neurons {R1, R2} after stimulation with stimuli s 2 AS ¼ fs1 ; s2 ; . . .g, we may ask the following:

RF3

Interestingly, only questions one and two can be answered using standard tools of information theory such as the (conditional) mutual information. In fact, the answers to questions three to five, that is, the quantification of unique, redundant, and synergistic information, are still a field of active research. While at the time of writing no agreement about a solution of this problem has been reached, it is clear that early attempts at quantifying synergies or redundancies, for example, by the interaction information I(S; R2|R1)  I(S; R1) (McGill, 1954), are imperfect in that they let synergistic and redundant information cancel each other in the measure when both are present (Williams & Beer, 2010). The fact that interaction information has been widely used in neuroscience makes it important to have closer look at the problem – even in the absence of a solution. Before we present more details, we would like to improve the understanding of the previously mentioned questions by looking at a thought experiment where three neurons in visual area V1 are recorded simultaneously while being stimulated with one of a set of four stimuli (Figure 1(a) and 1(b)). Two of the neurons have almost identical receptive fields (RF1, RF2), while the third one has a collinear but spatially displaced receptive field (RF3). These neurons are stimulated with one of the following stimuli: s1 does not contain anything at the receptive fields of the three neurons, and the neurons stay

−2

RF2

(b)

(c)

1 +1

RF1

1. What information does Ri provide about S? This is the mutual information I(S; Ri). 2. What information does the joint variable R ¼ {R1, R2} provide about S? This is the mutual information I(S; {R1, R2}). 3. What information does the joint variable R ¼ {R1, R2} have about S that we cannot get from observing both variables R1, R2 separately? This information is called the synergy of {R1, R2} with respect to S. 4. What information does one of the variables, say R1, have about S that we cannot obtain from any other variable (R2 in our case)? This information is the unique information of R1 about S. 5. What information does one of the variables, again say R1, have about S that we could also obtain by looking at another variable R2 alone? This information is the redundant information of R1 and R2 about S.

N

(a)

+1

2

1

+1

+1

R1

R3

+1

RF1 RF2

RF3

Stimulus s1

Stimulus s2

1 2 3 RF1

RF1 RF2

RF3

RF2

Stimulus s3

RF3

Stimulus s4

Figure 1 Redundant and synergistic neural coding. (a) Receptive fields (RFs) of three neurons R1, R2, R3. (b) Set of four stimuli. (c) Circuit for synergistic coding. Responses of neurons R1, R2 determine the response of neuron N via an XOR function. In the hidden circuit in between (gray box), open circles denote excitatory neurons; filled circles, inhibitory neurons. Numbers in circles are activation thresholds; signed numbers at connecting arrows are synaptic weights.

inactive; s2 is a small bar moving with the preferred orientation of neurons 1 and 2 and is moving in their preferred direction; s3 is a similar small bar, but now moving over the receptive field of neuron 3, instead of 1 and 2; and s4 is a long bar covering all receptive fields in the example. To make things easy, let us encode responses that we get from these three neurons (colored traces in Figure 1) in binary form, with ‘1’ simply indicating that there was a response in our response window (gray boxes behind the activity traces in Figure 1). If we assume the stimuli to be presented with equal probability (p(s ¼ si) ¼ 1/4), then the entropy of the stimulus set is H(S) ¼ 2 (bit). Obviously, none of the information terms in the preceding text can be larger than these 2 bits. We also see that each neuron shows activity (binary response ¼ 1) in half of the cases, yielding an entropy H(Rj) ¼ 1 for the responses of each neuron. The responses of the three neurons fully specify the stimulus, and therefore, I(S; R1, R2, R3) ¼ 2). To see the mutual information between an individual neuron’s response

606

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

and the stimulus, we may compute I(S; Ri) ¼ H(S)  H(S|Ri). To do this, we remember H(S) ¼ 2 and use that the number of equiprobable outcomes for S drops by half after observing a single neuron (e.g., after observing a response r1 ¼ 1 of neuron 1, two stimuli remain possible sources of this response – s2 or s4). This gives H(S|Ri) ¼ 1, and I(S; Ri) ¼ 1. Hence, each neuron provides 1 bit of information about the stimulus when considered individually. What happens if we consider joint variables formed from pairs of neurons? If we look at neurons 1 and 2, their responses to each stimulus are identical. Each of the neurons provides 1 bit of information about the stimulus. Even if we look at the two of them jointly {R1, R2}, we still get only one bit: I(S; {R1, R2})¼ 1. This is because the information carried by their responses is redundant. To see this, consider that we cannot decide between stimuli s1 and s3 if we get the result (r1 ¼ 0,r2 ¼ 0), and we can also not decide between stimuli s2 and s4 when observing (r1 ¼ 1,r2 ¼ 1). (Other combinations of responses do not occur by the construction of this example.) We see that neurons 1 and 2 seem to have the same information about the stimulus; hence, we suspect that a measure of redundant information should show nonzero values in this case. To understand the concept of synergy, we next consider the output of our two example neurons 1 and 3 being further transformed by a network that implements the mathematical XOR function, such that a downstream neuron N gets activated only if there is one small bar on the screen (i.e., one of our

neurons R1 or R3 gets activated, but not both), but neither for no stimulus nor for the long bar (Figure 1(c)). In this case, the individual mutual information of each neuron R1, R3 with the downstream neuron N is zero (I(N; Ri) ¼ 0). However, the mutual information between our two neurons considered jointly and the downstream neuron is still 1 bit, because the response of N is fully determined by its two inputs: I(N; {R1, R3}) ¼ 1. In this case, there is only synergistic information. In general, the total information I(Y; {X1, X2}) that two variables X1, X2 have about a third variable Y can be decomposed into the unique information of each Xi about Y, the redundant information that both variables share about Y, and the synergistic information that can only be obtained by considering {X1, X2} jointly. Figure 2 shows this so-called partial information decomposition (Williams & Beer, 2010). One easily sees that the redundant, unique, and synergistic information cannot be obtained by simply subtracting classical mutual information terms, without double counting. However, if we were given a measure of either redundancy or synergy (or even unique information) (Bertschinger et al., 2014), the other parts of the decomposition can be computed. Therefore, Williams and Beer recently suggested a specific measure of redundant information, called Imin, that measures the redundant information about a variable Y that is contained in a set of source variables X ¼ {X1, X2, . . ., Xr} (Williams & Beer, 2010): Imin ðX; fX 1 ; X 2 ; . . . ; X r gÞ ¼

X y2AY

I(Y; X1)

Ird

[19]

Iuq

I(Y; X1X2)

Mutual information between Y and the set {X1X2}

I(Y; X1)

Mutual information between Y and X1

Iuq(Y; {X1}) = ?∗

Unique information about Y from X1, which cannot be obtained from X2.

Ird(Y; {X1}{X2}) =

?∗

Isyn(Y; {X1X2}) = ?∗



Xi

I(Y;X1X2) I(Y; X2)

Isyn Iuq

pðyÞ min IðY ¼ y; X i Þ

Redundant information about Y can be obtained either from X1 or from X2. Synergetic information can only be obtained from X1 and X2 together (red area).

The definitions of these information measures are currently debated.

Figure 2 Partial information decomposition for three variables: one target Y and two sources X1 and X2. The original partial information decomposition diagram is a modified version of the one first presented in Williams and Beer (2010).

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

[20]

}

{X1}{X2}

{X 1

}{X 2

X3

{X1}

{X1}{X2}{X3}

{X2}

{X2}{X3}

{X1}{X3}





{X

{X3}{X1X2}

2X 3}

} X2

} X3 }{X 2 X2 {X 1

It may not be visible immediately in the preceding equations, but central quantities of the previous treatment, such as H(S),



} X3 }{X 1 {X 2

Importance of the Stimulus Set and Response Features

{X1X2}

} X3 {X 1

However, this measure has been criticized for sometimes showing redundant information even if source variables X1, X2 have no mutual information or showing synergy when the target variable (Y in the preceding text) is just a collection of the source variables (Griffith & Koch, 2014). For example, when considering neurons 1 and 3 jointly, we can fully decode the 2 bit of stimulus information. At first sight, it seems clear in this example that each neuron contributes 1 bit of unique information, because the responses of neurons 1 and 3 are independent on the randomly chosen stimuli of our set. Therefore, it was axiomatically proposed that in this case, each neuron should contribute 1 bit of unique information (Griffith & Koch, 2014; Harder, Salge, & Polani, 2013). In this view, there is no synergy, in contrast to the results that Imin provides. However, positive synergy has also been proposed as a mathematically sound alternative result for this example (Bertschinger, Rauh, Olbrich, & Jost, 2013). To see the rationale for this alternative proposal, consider that after observing r1 ¼ 1, we (and also neuron 1) know that the two events (r1 ¼ 0,r2 ¼ 0) and (r1 ¼ 0, r2 ¼ 1) are both impossible. Interestingly, after observing just r2 ¼ 1, we (and also neuron 2) know that (r1 ¼ 0,r2 ¼ 0) and (r1 ¼ 1,r2 ¼ 0) are both impossible. The sets of events known to each neuron to be impossible intersect. Hence, although the two neurons do not share any information about each other’s responses, they have information in common about impossible states of the world. Therefore, a nonzero synergy for this case has been considered, albeit less than the one bit that Imin measures for the example. Obviously, if a measure is used that provides nonzero synergy in the example, then each neuron will contribute less than 1 bit of unique information. It should be noted that Imin has also been criticized for not having a continuous local counterpart – in contrast to mutual information (see Eqn [11]) (Lizier, Flecker, & Williams, 2013). It should also be noted that the problem of assigning redundant or synergistic information quickly increases in complexity when the number of variables increases (see Figure 3 for the decomposition of I(Y; {X1, X2, X3})). In sum, there is currently no generally accepted measure, neither of redundancy nor of synergy (Bertschinger et al., 2013; Lizier et al., 2013). Various candidate measures (Griffith & Koch, 2014; Harder et al., 2013; Williams & Beer, 2010) have been proposed, but so far, most of them seem have considerable shortcomings, and it is even unclear whether all the properties that we intuitively assign to redundancies or synergies are compatible with each other (Bertschinger et al., 2013). For neuroscience, this means that older accounts of synergy (Brenner, Strong, Koberle, Bialek, & Van Steveninck, 2000; Gawne & Richmond, 1993; Han, 1978; McGill, 1954) have to be carefully reconsidered as they may (or may not) double count redundancies, while a mathematically rigorous replacement is missing for now.

{X1X2X3}

}{X 1

xi 2AXi

  1 1  log pðxi jyÞ log pðyÞ pðyjxi Þ

X3

X

{X 1

IðY ¼ y; X i Þ ¼

607

{X3} {X1X3}{X2X3} ∗{X

1X2}{X1X3}{X2X3}

Figure 3 Partial information decomposition for four variables: one target variable Y can receive information from various combinations of the three source variables X1, X2, X3. Comparing with Figure 2, one sees that the number of partial information contributions rises very rapidly with the number of source variables. Modified from Williams, P.L., & Beer, R.D. (2010). Nonnegative decomposition of multivariate information. ArXiv e-print No. 1004.2515.

H(S|r), depend strongly on the choice of the stimulus set AS . For example, if one chooses to study the human visual system with a set of ‘visual’ stimuli in the far infrared end of the spectrum, I(S; R) will most likely be very small and analysis futile (although done properly, a zero value of iSSI(s; R) for all stimuli will correctly point out that the human visual system does not care or code for any of the infrared stimuli). Hence, characterizing a neural code properly hinges to a large extent on an appropriate choice of stimuli. In this respect, it is safe to assume that a move from artificial stimuli (such as gratings in visual neuroscience) to more natural ones may dramatically alter our view of neural codes in the future. A similar argument holds for the response features that are selected for analysis. If any feature is dropped or not measured at all, this may distort the information measures in the preceding text. This may even happen, if the dropped feature, say the exact spike time variable RST, seems to carry no mutual information with the stimulus variable when considered alone, that is, I(S; RST ¼ 0). This is because there may still be synergistic information that can only be recovered by looking at other response variables jointly with RST. For example, it would be possible in principle that neither spike time RST nor spike rate RSR carry mutual information with the stimulus variable when considered individually, that is, I(S; RST) ¼ I(S; RSR) ¼ 0. Still, when considered jointly, they may be informative: I(S; {RST, RSR})> 0! The problem of omitted response features is almost inevitable in neuroscience, as the full sampling of all parts of a neural system is typically impossible, and we have to work with subsampled data. Considering only a subset of

608

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

(response) variables may dramatically alter the apparent dependency structure in the neural system (see Priesemann, Munk, & Wibral, 2009, for an example). Therefore, the effects of subsampling should always be kept in mind when interpreting results of studies on neural coding.

Analyzing Distributed Computation in Neural Systems The analysis of neural coding strategies presented in the preceding text relies on our knowledge of the (stimulus) features at the functional level that are encoded in neural responses that we record at the implementation level. If we have this knowledge, information theory will help us link these two levels. This is somewhat similar to the situation in cryptography where we consider a code ‘cracked’ if we obtain a humanreadable plain text message, that is, we move from the implementation level (encrypted message) to the functional level (meaning). However, what happens if the plain text were in a language that bears no resemblance to ours? In this case, we would potentially crack the code without ever realizing it, as the plain text still has no meaning for us. This is the situation we face currently in neuroscience if we move beyond early sensory or motor areas. Beyond these areas, our knowledge of what is actually encoded in those neural signals gets vague. As a result, proper stimulus sets get hard to choose. In this case, the gap between the functional level and the implementation level may actually become too wide for meaningful analyses, as also noticed recently by Carandini (2012). As shown in this section, we may try instead to use information theory to link the implementation level and algorithmic level, by retrieving the information processing footprint of the computation carried out by a neural circuit in terms of information storage, information transfer, and information modification. There is considerable agreement that neural systems perform some kind of information processing or computation. Information processing in general can be decomposed into the component processes of (1) information storage, (2) information transfer, and (3) information modification. This decomposition had already been formulated by Turing (see Langton, 1990):





Information storage quantifies the information contained in the past of a process that is used by the process (e.g., a neuron or a brain area) in its future. This relatively abstract definition means that we will see at least a part of the past information again in the future of the process, but potentially transformed. Hence, information storage can be naturally quantified by a mutual information between the past and the future of a process. If this mutual information is nonzero, we can predict a part of the future information in the process, that is, there is information storage. Information transfer quantifies the information contained in the past state variables X t-u of one source process X that can be used to predict information in future variables Yt of a target process Y, while this information is not contained in the past state variables Yt1 of the target process.



Information modification quantifies the combination of information from various source processes into a new form that is not (trivially) predictable from any subset of these source processes (Lizier, Prokopenko, & Zomaya, 2010; Lizier et al., 2013).

A thought experiment may serve to illustrate that information theory serves as an analysis at the algorithmic level rather than at the implementation level. Assume a very capable neuroscientist working with a small neural system, for example, the nervous system of the worm Caenorhabditis elegans. Let us also assume that she has sorted out the anatomical setup of this nervous system in such detail that he can simulate the realistic dynamics of this system in a computer. At this stage, our researcher may actually perform an analysis to understand the causal role elements of the nervous system play for its dynamic behavior. To do this, she may intervene in the biological system, for example, by injecting a current into a specific cell (there are roughly 300 neurons in C. elegans, each with its own identity), and observe how the dynamics change. She may then proceed to (virtually) inject the same current in his modeled neurons. If the modeled dynamics reflect the biological dynamics, she may rightfully claim to have a causal understanding of the system’s dynamics at the implementation level. Note that at this level, the interventions used are very different (injecting a current vs. changing some lines of code in the model). Interestingly, both interventions would look indistinguishable at the algorithmic level. This is because the nervous system and its model produce identical behavior in the sense that the activities recorded in vivo and in silico and then stored on disk look identical (we assume a perfect model here). Accordingly, an information theoretical analysis would not be able to tell the difference between the model and the real thing. At first sight, this may seem rather disappointing – so is there any insight we potentially gain by shedding biological/modeling detail using information theoretical analysis? To see what insights can be gained about the information processing in a neutral system using information theory, assume that a set of neurons only contributes to support processes, for example, as pacemaker cells that keep this system in an operational dynamic activity regime. Assume that these neurons do not contribute anything else to the computation proper. Telling these neurons apart from other parts of network convincingly may be hard at the implementation level, analyzing only the dynamics. It may even be hard to detect their particular role in the first place, performing an analysis at the implementation level. In contrast, an information theoretical analysis at the algorithmic level may easily identify them by their lack of information storage, their lack of mutual information with input, or their high values of information modification (they would modify any potential input to random output in our scenario). Thus, information theory helps to unravel the role each component has in information processing at each point in time. Based on Turing’s general decomposition of computation into component processes of storage, transfer, and modification (Langton, 1990), Lizier and colleagues recently proposed an information theoretical framework to quantify distributed computations in terms of all three component processes locally

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

for each part of the system (e.g., neurons or brain areas) and each time step (Lizier, 2013; Lizier, Prokopenko, & Zomaya, 2008, 2012; Lizier et al., 2010). This framework is called local information dynamics and has been successfully applied to unravel computation in swarms (Wang, Miller, Lizier, Prokopenko, & Rossi, 2012), in Boolean networks (Lizier, Pritam, & Prokopenko, 2011b), and in neural data (Lizier et al., 2011a). In the following, we present both global and local measures of information transfer, information storage, and information modification, beginning with the well-established measures of information transfer and ending with the highly dynamic field of information modification.

609

variables is appropriate in a neural setting must be checked carefully for each case. In fact, it was found that electroencephalography (EEG) source signals are not Gaussian: When decomposing EEG activity in its independent components, which are by definition as non-Gaussian as possible, then these results match well with a decomposition into electric dipoles, which are physical approximations to neural sources. The close match between the non-Gaussian decomposition and the dipole decomposition indicates that neural signals may often be nonGaussian (Wibral, Turi, Linden, Kaiser, & Bledowski, 2008). Therefore, the results of a linear Granger causality analysis should be interpreted with care.

TE estimation Information Transfer The analysis of information transfer was formalized initially by Schreiber using the transfer entropy (TE) functional (Schreiber, 2000) and has seen a rapid surge of interest in neuroscience (Amblard & Michel, 2011; Barnett, Barrett, & Seth, 2009; Battaglia, Witt, Wolf, & Geisel, 2012; Besserve, Scho¨lkopf, Logothetis, & Panzeri, 2010; Buehlmann & Deco, 2010; Cha´vez, Martinerie, & Le Van Quyen, 2003; Garofalo, Nieus, Massobrio, & Martinoia, 2009; Goure´vitch & Eggermont, 2007; Ito et al., 2011; Li & Ouyang, 2010; Lindner, Vicente, Priesemann, & Wibral, 2011; Lizier et al., 2011b; Lu¨dtke et al., 2010; Neymotin, Jacobs, Fenton, & Lytton, 2011; Palusˇ, Koma´rek, Hrncˇ´ırˇ, & Sˇteˇrbova´, 2001; Staniek & Lehnertz, 2009; Stetter, Battaglia, Soriano, & Geisel, 2012; Vakorin, Misˇic´, Krakovska, & McIntosh, 2011; Vakorin, Kovacevic, & McIntosh, 2010; Vakorin, Krakovska, & McIntosh, 2009; Vicente et al., 2011; Wibral et al., 2013; Wibral et al., 2011) and general physiology (Faes & Nollo, 2006; Faes, Nollo, & Porta, 2011; Faes et al., 2012).

Definition Information transfer from a random process X (the source) to another random process Y (the target) is measured by the TE functional (Schreiber, 2000): TEðXtu ! Yt Þ ¼ IðX tu ; Yt jY t1 Þ

[21]

or, equivalently: X

TEðX tu ! Yt Þ ¼

yt 2AYt , yt1 2AY t1 , x tu 2AXtu

pðyt ; yt1 ; xtu Þ log

pðyt jyt1 , xtu Þ pðyt jyt1 Þ

[22]

When the probability distributions entering Eqn [21] are known (e.g., in an analytically tractable neural model), TE can be computed directly. However, in most cases, these probability distributions would have to be derived from data to be inserted into Eqn [21] for an estimate of TE. Indeed, this approach has been used in the past and leads to a so-called ‘plug-in’ estimator (see Panzeri, Senatore, Montemurro, and Petersen, 2007). However, these plug-in estimators have serious bias problems (Panzeri et al., 2007). Therefore, newer approaches to TE estimation rely on more direct estimations of the entropies that TE can be decomposed into (Kraskov, Sto¨gbauer, & Grassberger, 2004; Vicente et al., 2011). These estimators have better bias properties and we therefore restrict our presentation to these approaches. As pointed out in the preceding text, we will have to reconstruct the states of the processes before we can proceed to estimate information theoretical quantities. One approach to state reconstruction is time delay embedding (Takens, 1981). It uses past variables Xtnt, n ¼ 1, 2, . . . that are spaced in time by an interval t. The number of these variables and their optimal spacing can be determined using established criteria (Faes et al., 2012; Lindner et al., 2011; Ragwitz & Kantz, 2002; Small & Tse, 2004). The realizations of the states variables can be represented as vectors of the form x dt ¼ ðxt , xtt , xt2t , . . .

, xtðd1Þt Þ

[23]

where d is the dimension of the state vector. Using this vector notation, the TE estimator for an interaction with assumed delay u is X dy x TESPO ðX tu ! Yt Þ ¼ pðyt ; yt1 ; x dtu Þ dy dx yt , yt1 , xtu d

where I(;|) is the conditional mutual information, Yt is the random variable of process Y at time t, and Xtu, Yt1 are the past state random variables of processes X and Y, respectively. The delay variable u in Xtu indicates that the past state of the source is to be taken u time steps into the past to account for a potential physical interaction delay between the processes. The TE functional can be linked to Wiener–Granger-type causality (Barnett et al., 2009; Wiener, 1956). More precisely, for systems with jointly Gaussian variables, TE is equivalent to linear Granger causality (see Barnett et al., 2009 and references therein). However, whether the assumption of Gaussian

log

y x pðyt jyt1 , xdtu Þ

[24]

d

y pðyt jyt1 Þ

where the subscript SPO (for self-prediction optimal) is a d

y , has to be constructed reminder that the past state of Yt, y t1 such that conditioning on the past of the target process is optimal (see Wibral et al., 2013, for details). Or saying it

d

d

y y instead of y t1 , differently, if one would condition on y tu then the self-prediction for Yt is not optimal and the TE will be overestimated. The parameter u is the assumed time that the information transfer needs to get from process X to Y. It was recently proven

610

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

for bivariate systems that the previously mentioned estimator assumes its maximum value if the parameter u is equal to the true delay d of the information transfer from X to Y (Wibral et al., 2013). This relationship allows to estimate the interaction delay d by scanning the assumed delay u: d ¼ argmax ½TESPO ðX tu ! Yt Þ

[25]

u

We can rewrite Eqn [24] using a representation in the form of four entropies (for continuous-valued random variables, these entropies are differential entropies) H() as d

d

y y x x TESPO ðX tu ! Yt Þ ¼ Hðyt1 ; x dtu Þ  Hðyt ; y t1 ; xdtu Þ

þ

dy Þ Hðyt ; yt1



dy Hðy t1 Þ

[26]

This shows that TESPO estimation is equal to computing a combination of different joint and marginal entropies. Entropies can be estimated efficiently by nearest-neighbor techniques. These techniques exploit the fact that the distances between neighboring data points in a given embedding space are related to the local probability density: the higher the local probability density around an observed data point, the closer the next neighbors. Since next neighbor estimators are dataefficient (Kozachenko & Leonenko, 1987; Victor, 2002), they allow to estimate entropies in high-dimensional spaces from limited real data. Unfortunately, it is problematic to estimate TE by simply applying a naive nearest-neighbor estimator for the entropy, such as the Kozachenko–Leonenko estimator (Kozachenko & Leonenko, 1987), separately to each of the terms appearing in Eqn [26]. The reason is that the dimensionality of the state spaces involved in Eqn [26] differs largely across terms. In Eqn x [26], the term ðx dtu Þ has the highest dimensionality. Because of the different dimensionalities, fixing a given number of neighbors for the search will set very different spatial scales (range of distances) for each term (Kraskov et al., 2004). Since the error bias of each term is dependent on these scales, the errors would not cancel each other but accumulate. The Kraskov– Grassberger–Sto¨gbauer estimator handles this problem by only fixing the number of neighbors k in the highestdimensional space and by projecting the resulting distances to the lower-dimensional spaces as the range to look for neighbors there (Kraskov et al., 2004). After adapting this technique to the TE formula (Wollstadt et al., 2014; Vicente et al., 2011), the suggested estimator can be written as D TESPO ðX tu ! Yt Þ ¼ cðkÞ þ cðnydy þ 1Þ t1

cðny ydy þ 1Þ  cðnydy t t1

dx t1 x tu

þ 1Þ

E [27]

where c denotes the digamma function, the angle brackets (h  i) indicate an averaging over different time points for stationary systems or an ensemble average for nonstationary ones, and k is the number of nearest neighbors used for the estimation. n() refers to the number of neighbors that are within a hypercube around a state vector, where the size of the hypercube in each of the marginal spaces is defined based on the distance to the kth nearest neighbor in the highest-dimensional dy x , xdtu ). space (spanned by yt , yt1

Interpretation of TE as a measure at the algorithmic level In line with what was said in the introduction, it is important to note that TE analyzes distributed computation at the algorithmic level, not at the level of a physical dynamic system. As such, it is not well suited for inference about causal interactions – although it has been used for this purpose in the past. The fundamental reason for this is that information transfer relies on causal interactions, but causal interactions do not necessarily lead to nonzero information transfer (Ay & Polani, 2008; Chicharro & Andrzejak, 2009; Lizier & Prokopenko, 2010). Instead, causal interactions may serve active information storage alone (see next section) or force two systems into identical synchronization, where information transfer becomes effectively zero. This might be summarized by stating that TE is limited to effects of a causal interaction from a source to a target process that are unpredictable given the past of the target process alone. In this sense, TE may be seen as quantifying causal interactions currently in use for the communication aspect of distributed computation. Therefore, one may say that TE measures predictive or algorithmic information transfer. The difference between an analysis of information transfer in a computational sense and causality analysis based on interventions has been clarified recently by Lizier and colleagues (2010). The same authors also demonstrate why an analysis of information transfer is sometimes more important than the analysis of causal interactions if the computation of the system is to be understood.

Common problems and solutions Typical problems in TE estimation encompass (1) finite sample bias, (2) the need for multivariate analyses, and (3) the presence of nonstationarities in the data. In recent years, all of these problems have been addressed satisfactorily: 1. Finite sample bias can be overcome by using surrogate data, dy x , x dtu of the random where the observed realizations yt , yt1 dy dx variables Yt , Y t1 , X tu are reassigned to other random variables of the process, such that the temporal order underlying the information transfer is destroyed (for an example, see the procedures suggested in Lindner et al., 2011). This reassignment should conserve as many data features of the single process realizations as possible. 2. So far, we have restricted our presentation of TE estimation to the case of just two interacting random processes X, Y, that is, a bivariate analysis. In a setting that is more realistic for neuroscience, one deals with large networks of interacting processes X, Y, Z, . . .. In this case, various complications arise if the analysis is performed in a bivariate manner. For example, a process Z could transfer information with two different delays dZ!X, dZ!Y to two other processes X, Y. In this case, a pairwise analysis of TE between X and Y will yield an apparent information transfer from the process that receives information from Z with the shorter delay to the one that receives it with the longer delay (common driver effect). A similar problem arises if information is transferred first from a process X to Y and then from Y to Z. In this case, a bivariate analysis will also indicate information transfer from X to Z (cascade effect). Moreover, two sources may transfer information purely synergistically, that is, the TE from each source alone to the target is zero,

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

and only considering them jointly reveals the information transfer. Again, cryptography may serve as an example here. If an encrypted message is received, there will be no discernible information transfer from encrypted message to plain text without the key. In the same way, there is no information transfer from the key alone to the plain text. It is only when encrypted message and key are combined that the relation between the combination of encrypted message and key on the one side and the plain text on the other side is revealed. From a mathematical perspective, this problem seems to be easily solved by introducing the complete TE (Lizier & Prokopenko, 2010): TEcompl ðX tu ! Yt jV Þ ¼

X yt , yt1 , x tu , v

log

pðyt ; yt1 ; xtu ; v Þ

pðyt jyt1 , xtu , v Þ pðyt jy t1 , v Þ

[28]

Here, the state random variable V is a collection of the past states of all processes in the network other than X, Y. However, there is often a practical problem with this approach, because even for small networks of random processes, the joint state space of the variables Yt, Yt1, Xtu, V is intractably large from the estimation perspective. Moreover, the problem of finding all information transfers in the network, from either single source variables into the target or synergistic transfer from collections of source variables to the target, is in principle NP-complete and can therefore typically not be solved in a reasonable time. Therefore, Faes and colleagues (2012), Lizier and Rubinov (2012), and Stramaglia, Wu, Pellicoro, and Marinazzo (2012) suggested to analyze the information transfer in a network iteratively, selecting information sources for a target in each iteration based on either the magnitude of apparent information transfer (Faes et al., 2012) or its significance (Lizier & Rubinov, 2012; Stramaglia et al., 2012). In the next iteration, already selected information sources are added to the conditioning set (V in Eqn [28]), and the next search for information sources is started. The approach of Stramaglia and colleagues is particular here in that the conditional mutual information terms are computed at each level as a series expansion, following a suggestion by Bettencourt, Gintautas, and Ham (2008). This allows for an efficient computation as the series may truncate early, and the search can proceed to the next level. Importantly, these approaches also consider synergistic information transfer from more than one source variable to the target. For example, a variable transferring information purely synergistically with V maybe included in the next iteration, given that the other variables it transfers information with are already in the conditioning set V. However, there is currently no explicit indication in the approaches of Faes and colleagues (2012) and Lizier and Rubinov (2012) as to whether multivariate information transfer from a set of sources to the target is in fact synergistic, and redundant links will not be included. In contrast, both redundant and synergistic multiplets of variables transferring information into a target may be identified in

611

the approach of Stramaglia and colleagues (2012) by looking at the sign of the contribution of the multiplet. Unfortunately, there is also the possibility of cancellation if both types of multivariate information (redundant and synergistic) are present. 3. As already explained in Section ‘Basic information theory,’ nonstationary random processes in principle require that the necessary estimates of the probabilities in Eqn [21] be based on physical replications of the systems in question. Where this is impossible, the experimenter should design the experiment in such a way that the processes are repeated in time. If such cyclostationary data are available, then TE should be estimated using ensemble methods as described in Wollstadt et al., 2014 and implemented in the TRENTOOL toolbox (Lindner et al., 2011).

Local information transfer As TE is formally just a conditional mutual information, we can obtain the corresponding local mutual information (Eqn [12]) from Eqn [24]. This quantity is called the local information transfer te (Lizier et al., 2008). For realizations xt, yt of two processes X, Y at time t, it reads d

teðX tu ¼ xtu ! Yt ¼ yt Þ ¼ log

y x pðyt jyt1 , xdtu Þ

d

y pðyt jyt1 Þ

[29]

As said earlier in Section ‘Basic information theory,’ the use of local information measures does not eliminate the need for an appropriate estimation of the probability distributions involved. Hence, for a nonstationary process, these distributions will still have to be estimated via an ensemble approach for each time point for the random variables involved – for example, via physical replications of the system or via enforcing cyclostationarity by the design of the experiment. The analysis of local information transfer has been applied with great success in the study of cellular automata to prove the conjecture that certain coherent spatiotemporal structures traveling through the network are indeed the main carriers of information transfer (Lizier et al., 2008). In addition in the study of random Boolean networks, it was shown that the local information transfer and the local active information storage (see next section) are in an optimal balance near the critical point (Lizier et al., 2011b). These theoretical findings on computational properties at the critical point are of great relevance to neuroscience, as it has been conjectured that the brain may operate in a self-organized critical state (Beggs & Plenz, 2003), and recent evidence demonstrates that the human brain is at least very close to criticality, albeit slightly subcritical (Priesemann, Valderrama, Wibral, & Le Van Quyen, 2013; Priesemann et al., 2014).

Information Storage Before we present explicit measures of information storage, a few comments may serve to avoid misunderstanding. Since this article is about the analysis of neural activity, measures of information storage are concerned with information stored in this activity – rather than in synaptic properties, for example. As laid out in the preceding text, information storage is

612

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

conceptualized here as a mutual information between past and future states of neural activity. From this, it is clear that there will not be much information storage if the information contained in the future states of neural activity is low in general. If, on the other hand, these future states are rich in information but bear no relation to past states, that is, are unpredictable, again information storage will be low. Hence, large information storage occurs for activity that is rich in information but, at the same time, predictable. Thus, information storage gives us a way to define the predictability of a process that is independent of the prediction error: Information storage quantifies how much future information of a process can be predicted from its past, whereas the prediction error measures how much information cannot be predicted. If both are quantified in information measures, the error and the predicted information add up to the total amount of information (technically: the entropy rate; also see Eqn [40]). Importantly, since the entropy rate can vary considerably, these two measures can lead to quite different views about the predictability of a process. As predictability (in the sense of the amount of predicted information) in turn plays a crucial role in current theories of brain function, such as predictive coding, we expect to see a rise in the application of this measure to neural data in the near future (Wibral et al., 2014; Go´mez et al., 2014). Before turning to the explicit definition of measures of information storage, it is also worth to consider different perspectives on the temporal extent of ‘past’ and ‘future’ states that we are interested in: Most globally, excess entropy (Crutchfield & Packard, 1982; Grassberger, 1986) is the mutual information between the semi-infinite past and future of a process before and after time point t. In contrast, if we are interested in the information currently used for the next step of the process, the mutual information between the semi-infinite past and the next step of the process, the active information storage is of greater interest.

kþ eXt ¼ iðxk t ; xt Þ

[32]

The limit of k ! 1 can be replaced by a finite kmax if a kmax exists such that conditioning on X kt max  renders X kt max þ conditionally independent of any Xl with l  t  kmax.

Local active information storage From a perspective of the dynamics of information processing, we might not be interested in information that is used by a process at some time far in the future, but at the next point in time, that is, information that is said to be active or ‘currently in use’ for the computation of the next step (the realization of the next random variable) in the process (Lizier et al., 2012). To quantify this information, a different mutual information is computed, namely, the active information storage: AXt ¼ lim IðX k t1 ; Xt Þ

[33]

k!1

Again, if the process in question is stationary, then AXt ¼ const: ¼ AX and the expected value can be obtained from an average over time – instead of an ensemble of realizations of the process – as   ; x Þ [34] AX ¼ lim iðx k t1 t k!1

t

which can be read as an average over local active storage values aXt : AX ¼ haXt it

[35]

aXt ¼ lim iðx k t1 ; xt Þ

[36]

k!1

Even for nonstationary processes, we may investigate local active storage values, given the corresponding probability distributions are properly obtained from an ensemble of realizak : tions of Xt, Xt1 aXt ¼ lim iðx k t1 ; xt Þ k!1

Excess entropy Excess entropy is formally defined as kþ EXt ¼ lim IðX k t ; Xt Þ

[30]

k!1

kþ where Xk t ¼ {Xt, Xt1, . . ., Xtkþ1} and Xt ¼ {Xtþ1, . . ., Xtþk} indicate collections of the past and future k variables of the kþ process X. These collections of random variables, (Xk t , Xt ), in the limit k ! 1 span the semi-infinite past and future, respectively. In general, the mutual information in Eqn [30] has to be evaluated over multiple realizations of the process. For stationary processes, however, EXt is a constant, EX, and Eqn [30] may be rewritten as an average over time points t and computed from a single realization of the process (at least in principle, as we have to take into account that the process must run for an infinite time to allow the limit lim for all t): k!1   kþ EX ¼ lim iðx k [31] t ; xt Þ

k!1

t

where i(;) is the local mutual information from Eqn [11], and kþ kþ xk are realizations of Xk t , xt t , Xt . Even if the process in question is nonstationary, we may look at values that are local in time as long as the probability distributions are derived appropriately (see Section ‘Basic information theory’):

[37]

Again, the limit of k ! 1 can be replaced by a finite kmax if a max renders Xt condikmax exists such that conditioning on X kt1 tionally independent of any Xl with l  t  kmax.

Interpretation of information storage as a measure at the algorithmic level As laid out in the preceding text, information storage is a measure of the amount of information in a process that is predictable from its past. As such, it quantifies, for example, how well activity in one brain area A can be predicted by another area, for example, by learning its statistics. Hence, questions about information storage arise naturally when asking about the generation of predictions in the brain.

Information Modification Langton (1990) described information modification as an interaction between transmitted information and stored information that results in a modification of one or the other. Attempts to define information modification more rigorously implemented this basic idea. First attempts at defining a quantitative measure of information modification resulted in a heuristic measure termed local separable information (Lizier et al., 2010), where

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

the local active information storage and the sum over all pairwise local transfer entropies into the target was taken: sXt ¼ aXt þ

X

iðxt ; zt , i jxt1 Þ

[38]

Zt , i 6¼X t1

with Zt , i indicating the past state variables of all processes that transfer information into the target variable Xt and where the index t is a reminder that only past state variables are taken into account, that is, t < t. As shown previously in the text, the local measures entering the sum are negative if they are misinformative about the future of the target. Eventually, the overall sum, or separable information, might also be negative, indicating that neither the pairwise information transfers nor the history could explain the information contained in the target’s future. This was seen as an indication of a modification of either stored information or transferred information. At the same time, the overall information in the future of the target can of course be explained by looking at the sources of information and the history of the target jointly, at least up to the genuinely stochastic part (innovation) in the target, as shown in Lizier et al. (2010). Even a decomposition of the overall information into storage, transfer terms, and the innovation is possible for considering the sources jointly (Lizier et al., 2010). To see why this is possible when considering the source variables and the history of the target jointly, but not when we look at pairwise local mutual information terms, it is instructive to consider a series of subsets formed from the (ordered) set of all variables Zt , i that can transfer information into the target, except variables from the targets’ own history. Following the derivation in Lizier et al. (2010) (with the exception of adding the variable index t to account for nonstationary processes), we denote this set as V Xt nX t1 ¼ fZt , 1 ; . . . ; Zt , G g, where G is the number of variables in the set and Xt is the target. Xt1, the history of the target, is not part of the set. The bold typeface in Zt , i is a reminder that we work with a state space representation where necessary. Next,  we create a series  g g of subsets V Xt nX t1 such that V Xt nX t1 ¼ Zt , 1 ; . . . ; Zt , g1 , that is, the gth subset only contains the first g  1 sources. As the TE is a mutual information, we can decompose the collective TE from all our source variables, TEðV Xt nX t1 ! Xt Þ, as a series of conditional mutual information terms, incrementally increasing the set that we condition on TEðV Xt nX t1 ! Xt Þ ¼

G X

g IðXt ; Zt , g jX t1 , V Xt nX t1 Þ

[39]

g¼1

The total entropy of the target H(Xt) can then be written as HðXt Þ ¼ AX þ

G X

g

IðXt ; Zt , g jXt1 , V Xt nX t1 Þ þ WXt

[40]

g¼1

where WXt is the genuine innovation in Xt. If we rewrite the decomposition in Eqn [41] in its local form, hðxt Þ ¼ aXt þ

G X

g

iðxt ; zt , g jxt1 , vXt nxt1 Þ þ wXt

613

context that the source variables provide for each other is neglected in (Eqn 38) and synergies and redundancies are not properly accounted for. Importantly, the results of both Eqns [38] and [41] are only identical, if no information is provided either redundantly or synergistically by the sources Zt , g . This observation led Lizier and colleagues to propose a more rigorously defined measure of information modification based on the synergistic part of the information transfer from the source variables Zt , g and the targets’ history Xt1 to the target Xt (Lizier et al., 2013). This definition of information modification has several highly desirable properties. However, it relies on a suitable definition of synergy, which is currently not available. As there is currently a considerable debate on how to define the part of a the mutual information I(Y; {X1, . . ., Xi, . . .}) that is synergistically provided by joint source variables Xi about the target Y, the question of how to best measure information modification remains an open one.

Conclusion and Outlook Neural systems undoubtedly perform acts of information processing in the form of distributed (biological) computation. This computation can be decomposed into its component processes of information storage, information transfer, and information modification using information theoretical tools. This allows us to derive constraints on possible algorithms served by the observed neural dynamics. The representations that these algorithms operate on, on the other hand, can be guessed by analyzing the relation between meaningful quantities in our experiments or the outside world and indices of neural activity. This is done via an analysis of mutual information between these indices of neural activity and these meaningful quantities. In this article, we have shown how to rigorously define the necessary information theoretical quantities and how to apply them to neuroscientific questions. Due to the dynamic development of the field, not all questions have found their definitive answers yet in terms of information theory. Especially the question of how to measure nontrivial information modification remains open at the time of writing but is bound to see further progress in the future.

See also: INTRODUCTION TO ACQUISITION METHODS: Basic Principles of Electroencephalography; Basic Principles of Magnetoencephalography; Functional Near-Infrared Spectroscopy; INTRODUCTION TO ANATOMY AND PHYSIOLOGY: Functional Connectivity; INTRODUCTION TO COGNITIVE NEUROSCIENCE: Uncertainty; INTRODUCTION TO METHODS AND MODELING: Dynamic Causal Models for Human Electrophysiology: EEG, MEG, and LFPs; Effective Connectivity; Multi-voxel Pattern Analysis; RestingState Functional Connectivity; INTRODUCTION TO SYSTEMS: Motion Perception; Neural Correlates of Motor Deficits in Young Patients with Traumatic Brain Injury.

[41]

g¼1

and compare to Eqn [38], we see that the difference between the potentially misinformative sum in Eqn [38] and the fully accounted for information in h(xt) from Eqn [41] lies in the conditioning of the local transfer entropies. This means the

References Amblard, P.-O., & Michel, O. J. (2011). On directed information theory and Granger causality graphs. Journal of Computational Neuroscience, 30, 7–16. Ay, N., & Polani, D. (2008). Information flows in causal networks. Advances in Complex Systems, 11, 17–41.

614

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

Barnett, L., Barrett, A. B., & Seth, A. K. (2009). Granger causality and transfer entropy are equivalent for Gaussian variables. Physical Reviews Letters, 103, 238701. Battaglia, D., Witt, A., Wolf, F., & Geisel, T. (2012). Dynamic effective connectivity of inter-areal brain circuits. PLoS Computational Biology, 8, e1002438. Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 23, 11167–11177. Bertschinger, N., Rauh, J., Olbrich, E., & Jost, J. (2013). Shared information-new insights and problems in decomposing information in complex systems. In Proceedings of the European Conference on Complex Systems 2012 (pp. 251–269). Springer International Publishing. Bertschinger, N., Rauh, J., Olbrich, E., Jost, J., & Ay, N. (2014). Quantifying unique information. Entropy, 16, 2161–2183. Besserve, M., Scho¨lkopf, B., Logothetis, N. K., & Panzeri, S. (2010). Causal relationships between frequency bands of extracellular signals in visual cortex revealed by an information theoretic analysis. Journal of Computational Neuroscience, 29, 547–566. Bettencourt, L. M., Gintautas, V., & Ham, M. I. (2008). Identification of functional information subgraphs in complex networks. Physical Reviews Letters, 100, 238701. Boedecker, J., Obst, O., Lizier, J. T., Mayer, N. M., & Asada, M. (2012). Information processing in echo state networks at the edge of chaos. Theory in Biosciences, 131, 205–213. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & Van Steveninck, R. R. D. R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552. Buehlmann, A., & Deco, G. (2010). Optimal information transfer in the cortex through synchronization. PLoS Computational Biology, 6, e1000934. Butts, D. A. (2003). How much information is associated with a particular stimulus? Network: Computation in Neural Systems, 14, 177–187. Butts, D. A., & Goldman, M. S. (2006). Tuning curves, neuronal variability, and sensory coding. PLoS Biology, 4, e92. Carandini, M. (2012). From circuits to behavior: A bridge too far? Nature Neuroscience, 15, 507–509. Cha´vez, M., Martinerie, J., & Le Van Quyen, M. (2003). Statistical assessment of nonlinear causality: Application to epileptic EEG signals. Journal of Neuroscience Methods, 124, 113–128. Chicharro, D., & Andrzejak, R. G. (2009). Reliable detection of directional couplings using rank statistics. Physical Review E, 80, 026217. Crutchfield, J. P., Ditto, W. L., & Sinha, S. (2010). Introduction to focus issue: intrinsic and designed computation: Information processing in dynamical systems-beyond the digital hegemony. Chaos, 20, 037101-1–037101-6. Crutchfield, J. P., & Packard, N. H. (1982). Symbolic dynamics of one-dimensional maps: Entropies, finite precision, and noise. International Journal of Theoretical Physics, 21, 433–466. Dewdney, A. K. (1993). The Tinkertoy computer and other machinations. New York: W.H. Freeman. DeWeese, M. R., & Meister, M. (1999). How to measure the information gained from one symbol. Network: Computation in Neural Systems, 10, 325–340. Effenberger, F. (2013). A primer on information theory, with applications to neuroscience. In Computational biomedicine: Data mining and modeling (135–192). Berlin: Springer-Verlag. Faes, L., & Nollo, G. (2006). Bivariate nonlinear prediction to quantify the strength of complex dynamical interactions in short-term cardiovascular variability. Medical and Biological Engineering and Computing, 44, 383–392. Faes, L., Nollo, G., & Porta, A. (2011). Information-based detection of nonlinear Granger causality in multivariate processes via a nonuniform embedding technique. Physical Review E, 83, 051112. Faes, L., Nollo, G., & Porta, A. (2012). Non-uniform multivariate embedding to assess the information transfer in cardiovascular and cardiorespiratory variability series. Computers in Biology and Medicine, 42, 290–297. Fano, R. M. (1961). Transmission of Information: A statistical theory of communications. Cambridge, MA: Massachusetts Institute of Technology Press. Gardner, W. A. (1994). An introduction to cyclostationary signals. Cyclostationarity in Communications and Signal Processing, 1–90. Gardner, W. A., Napolitano, A., & Paura, L. (2006). Cyclostationarity: Half a century of research. Signal Processing, 86, 639–697. Garofalo, M., Nieus, T., Massobrio, P., & Martinoia, S. (2009). Evaluation of the performance of information theory-based methods and cross-correlation to estimate the functional connectivity in cortical networks. PloS One, 4, e6482. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13, 2758–2771. Go´mez, C., Lizier, J. T., Schaum, M., Wollstadt, P., Gru¨tzner, C., Uhlhaas, P., et al. (2014). Reduced predictable information in brain signals in autism spectrum disorder. Frontiers in neuroinformatics, 8.

Goure´vitch, B., & Eggermont, J. J. (2007). Evaluating information transfer between auditory cortical neurons. Journal of Neurophysiology, 97, 2533–2543. Grassberger, P. (1986). Toward a quantitative theory of self-generated complexity. International Journal of Theoretical Physics, 25, 907–938. Griffith, V., & Koch, C. (2014). Quantifying synergistic mutual information. In Guided Self-Organization: Inception (pp. 159–190). Berlin Heidelberg: Springer. Han, T. S. (1978). Nonnegative entropy measures of multivariate symmetric correlations. Information and Control, 36, 133–156. Harder, M., Salge, C., & Polani, D. (2013). Bivariate measure of redundant information. Physical Review E, 87, 012130. Havenith, M. N., Yu, S., Biederlack, J., Chen, N.-H., Singer, W., & Nikolic´, D. (2011). Synchrony makes neurons fire in sequence, and stimulus properties determine who is ahead. Journal of Neuroscience, 31, 8570–8584. Ito, S., Hansen, M. E., Heiland, R., Lumsdaine, A., Litke, A. M., & Beggs, J. M. (2011). Extending transfer entropy improves identification of effective connectivity in a spiking cortical network model. PloS One, 6, e27431. Johansson, R. S., & Birznieks, I. (2004). First spikes in ensembles of human tactile afferents code complex spatial fingertip events. Nature Neuroscience, 7, 170–177. Knill, D. C., & Pouget, A. (2004). The Bayesian brain: The role of uncertainty in neural coding and computation. Trends in Neurosciences, 27, 712–719. Kozachenko, L. F., & Leonenko, N. N. (1987). Sample estimate of the entropy of a random vector. Problems of Information Transmission, 23, 95–101. Kraskov, A., Sto¨gbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69, 066138. Langton, C. G. (1990). Computation at the edge of chaos: Phase transitions and emergent computation. Physica D: Nonlinear Phenomena, 42, 12–37. Li, X., & Ouyang, G. (2010). Estimating coupling direction between neuronal populations with permutation conditional mutual information. NeuroImage, 52, 497–507. Lindner, M., Vicente, R., Priesemann, V., & Wibral, M. (2011). TRENTOOL: A Matlab open source toolbox to analyse information flow in time series data with transfer entropy. BMC Neuroscience, 12, 119. Lizier, J., & Rubinov, M. (2012). Multivariate construction of effective computational networks from observational data. Max Planck Institute for Mathematics in the Sciences Preprint 25/2012. Lizier, J. T. (2013). The local information dynamics of distributed computation in complex systems. Heidelberg/New York/Dordecht/London: Springer. Lizier, J.T., Flecker, B., & Williams, P.L. (2013). Towards a synergy-based approach to measuring information modification. Artificial Life (ALIFE), IEEE Symposium on artificial life. IEEE. Lizier, J. T., Heinzle, J., Horstmann, A., Haynes, J.-D., & Prokopenko, M. (2011). Multivariate information-theoretic measures reveal directed information structure and task relevant changes in fMRI connectivity. Journal of Computational Neuroscience, 30, 85–107. Lizier, J. T., Pritam, S., & Prokopenko, M. (2011). Information dynamics in small-world Boolean networks. Artificial Life, 17, 293–314. Lizier, J. T., & Prokopenko, M. (2010). Differentiating information transfer and causal effect. European Physical Journal B, 73, 605–615. Lizier, J. T., Prokopenko, M., & Zomaya, A. Y. (2008). Local information transfer as a spatiotemporal filter for complex systems. Physical Review E, 77, 026110. Lizier, J. T., Prokopenko, M., & Zomaya, A. Y. (2010). Information modification and particle collisions in distributed computation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 20, 037109. Lizier, J. T., Prokopenko, M., & Zomaya, A. Y. (2012). Local measures of information storage in complex distributed computation. Information Sciences, 208, 39–54. Lu¨dtke, N., Logothetis, N. K., & Panzeri, S. (2010). Testing methodologies for the nonlinear analysis of causal relationships in neurovascular coupling. Magnetic Resonance Imaging, 28, 1113–1119. Markram, H. (2006). The blue brain project. Nature Reviews Neuroscience, 7, 153–160. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San Francisco, CA: W.H. Freeman. McGill, W. J. (1954). Multivariate information transmission. Psychometrika, 19, 97–116. Mitchell, M. (2011). Ubiquity symposium: Biological computation. Ubiquity, 2011, 3. Neymotin, S. A., Jacobs, K. M., Fenton, A. A., & Lytton, W. W. (2011). Synaptic information transfer in computer models of neocortical columns. Journal of Computational Neuroscience, 30, 69–84. O’Keefe, J., & Recce, M. L. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3, 317–330. Palusˇ, M., Koma´rek, V., Hrncˇ´ıˇr, Z., & Sˇteˇrbova´, K. (2001). Synchronization as adjustment of information rates: Detection from bivariate time series. Physical Review E, 63, 046211.

INTRODUCTION TO METHODS AND MODELING | Information Theoretical Approaches

Panzeri, S., Senatore, R., Montemurro, M. A., & Petersen, R. S. (2007). Correcting for the sampling bias problem in spike train information measures. Journal of Neurophysiology, 98, 1064–1072. Priesemann, V., Munk, M. H.J, & Wibral, M. (2009). Subsampling effects in neuronal avalanche distributions recorded in vivo. BMC Neuroscience, 10, 40. Priesemann, V., Valderrama, M., Wibral, M., & Le Van Quyen, M. (2013). Neuronal avalanches differ from wakefulness to deep sleep – Evidence from intracranial depth recordings in humans. PLoS Computational Biology, 9, e1002985. Priesemann, V., Wibral, M., Valderrama, M., Propper, Le Van Quyen, M., Geisel, T., et al. (2014). Spike avalanches in vivo suggest a driven, slightly subcritical brain state. Frontiers in Systems Neuroscience, 8, 108. Ragwitz, M., & Kantz, H. (2002). Markov models from data by simple nonlinear time series predictors in delay embedding spaces. Physical Review E, 65, 056201. Schreiber, T. (2000). Measuring information transfer. Physical Reviews Letters, 85, 461. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423. Small, M., & Tse, C. K. (2004). Optimal embedding parameters: A modelling paradigm. Physica D: Nonlinear Phenomena, 194, 283–296. Smirnov, D. A. (2013). Spurious causalities with transfer entropy. Physical Review E, 87, 042917. Staniek, M., & Lehnertz, K. (2009). Symbolic transfer entropy: Inferring directionality in biosignals. Journal of Biomedical Engineering and Technology, 54, 323–328. Stetter, O., Battaglia, D., Soriano, J., & Geisel, T. (2012). Model-free reconstruction of excitatory neuronal connectivity from calcium imaging signals. PLoS Computational Biology, 8, e1002653. Stramaglia, S., Wu, G.-R., Pellicoro, M., & Marinazzo, D. (2012). Expanding the transfer entropy to identify information circuits in complex systems. Physical Review E, 86, 066211. Takens, F. (1981). Detecting strange attractors in turbulence. In Dynamical systems and turbulence, Warwick 1980 (pp. 366–381). New York: Springer. Vakorin, V. A., Kovacevic, N., & McIntosh, A. R. (2010). Exploring transient transfer entropy based on a group-wise ICA decomposition of EEG data. NeuroImage, 49, 1593–1600.

615

Vakorin, V. A., Krakovska, O. A., & McIntosh, A. R. (2009). Confounding effects of indirect connections on causality estimation. Journal of Neuroscience Methods, 184, 152–160. Vakorin, V. A., Misˇic´, B., Krakovska, O., & McIntosh, A. R. (2011). Empirical and theoretical aspects of generation and transfer of information in a neuromagnetic source network. Frontiers in Systems Neuroscience, 5, 96. Vicente, R., Wibral, M., Lindner, M., & Pipa, G. (2011). Transfer entropy – A model-free measure of effective connectivity for the neurosciences. Journal of Computational Neuroscience, 30, 45–67. Victor, J. D. (2002). Binless strategies for estimation of information from neural data. Physical Review E, 66, 051903. Wang, X. R., Miller, J. M., Lizier, J. T., Prokopenko, M., & Rossi, L. F. (2012). Quantifying and tracing information cascades in swarms. PloS One, 7, e40084. Wibral, M., Lizier, J. T., Vo¨gler, S., Priesemann, V., & Galuske, R. (2014). Local active information storage as a tool to understand distributed neural information processing. Frontiers in neuroinformatics, 8. Wibral, M., Pampu, N., Priesemann, V., Siebenhuehner, F., Seiwert, H., Lindner, M., et al. (2013). Measuring information-transfer delays. PloS One, 8, e55809. Wibral, M., Rahm, B., Rieder, M., Lindner, M., Vicente, R., & Kaiser, J. (2011). Transfer entropy in magnetoencephalographic data: Quantifying information flow in cortical and cerebellar networks. Progress in Biophysics and Molecular Biology, 105, 80–97. Wibral, M., Turi, G., Linden, D. E., Kaiser, J., & Bledowski, C. (2008). Decomposition of working memory-related scalp ERPs: Crossvalidation of fMRI-constrained source analysis and ICA. International Journal of Psychophysiology, 67, 200–211. Wiener, N. (1956). The theory of prediction. In E. F. Beckenback (Ed.), Modern mathematics for engineers (pp. 165–190). New York: McGraw-Hill. Williams, P.L., & Beer, R.D. (2010). Nonnegative decomposition of multivariate information. ArXiv e-print No. 1004.2515. Wollstadt, P., Martı´nez-Zarzuela, M., Vicente, R., Dı´az-Pernas, F. J., & Wibral, M. (2014). Efficient Transfer Entropy Analysis of Non-Stationary Neural Time Series. PLoS ONE, 9(7), e102833. http://dx.doi.org/10.1371/journal. pone.0102833.