CHAPTER
5 Surprise: A Shortcut for Attention? Pierre Baldi
ABSTRACT
question, of course, is the algorithmic definition of what is relevant for a given organism in a given situation and at a given time. A particular class of inputs that tend to be relevant are those that are highly unusual or “surprising.” Thus, here we address the problem of providing a precise algorithmic definition of surprise and speculate on its possible relationships to attentional mechanisms. A definition of novelty or surprise, it must first of all be related to the foundations of the notion of probability, which can be approached from a frequentist or subjectivist, also called bayesian, point of view (Berger, 1985; Box and Tiao, 1992). Here we follow the bayesian approach, which has been prominent in recent years and has led to important developments in many fields (Gelman et al., 1995). The definition we propose stems directly from the bayesian foundation of probability theory, and the relation given by Bayes’ theorem between the prior and posterior probabilities of an observer. The amount of surprise in the data for a given observer can be measured by the change that has taken place in going from the prior to the posterior probabilities.
Attention plays an essential role in the survival of an organism by enabling the real-time redirection or allocation of computing resources toward “regions” of the input field that are subjectively relevant. A particular class of relevant inputs are those that are unexpected or surprising. We propose an observerdependent computational defintion of surprise by considering the relative entropy between the prior and the posterior distributions of the observer. Surprise requires integration over the space of models, in contrast with Shannon’s entropy, which requires integration over the space of data. We show how surprise can be computed exactly in a number of discrete and continuous cases using distributions from the exponential family with conjugate priors. Surprise can be defined at multiple scales, from single synapses, to neurons, to networks, to areas and systems. Surprise provides a shortcut for relevance and a general principled way for computing centralized saliency maps in any feature space to control the deployment of attention or other information retrieval mechanisms toward the most surprising items, that is, those carrying the largest amount of information with respect to internal expectations. Because attention is a rapid mechanism likely to be driven by bottom-up cues, it could benefit from the simplicity of surprise calculation to detect mismatches between bottom-up inputs and expectations generated locally or in top-down fashion.
A. Information and Surprise To define surprise, we consider an “observer” with a corresponding class of hypotheses or models M. In the subjectivist framework, degrees of belief or confidence are associated with hypotheses or models. It can be shown that under a small set of reasonable axioms, these degrees of belief can be represented by real numbers and that when rescaled to the [0, 1] interval, these degrees of confidence must obey the rules of probability and, in particular, Bayes’ theorem (Cox, 1964; Jaynes, 1986; Savage, 1972). Within the bayesian framework, prior to seeing any data, the observer has a prior distribution P(M) over the space M of models. Here the word observer is taken in a very general sense
I. INTRODUCTION Attention plays a fundamental role in the survival of an organism. In particular, attention enables the real-time redirection or allocation of computing resources toward particular “regions” of the input field that are relevant for the organism. A fundamental
Neurobiology of Attention
24
Copyright 2005, Elsevier, Inc. All rights reserved.
25
II. COMPUTATION OF SURPRISE
and could, for instance, refer to a single neuron or a neuronal circuit. The prior P(M) could be hardwired or result from experience, possibly including recently observed data points. It could be generated locally and/or in top-down fashion. If an observer has a model M for the data, associated with a prior probability P(M), the arrival of a data set D leads to a reevaluation of the probability in terms of the posterior distribution: P( M D) =
P(D M)P( M) . P(D)
(1)
The effect of the information contained in D is clearly to change the belief of the observer from P(M) to P(M|D). Thus, the surprise can be defined as the “distance” between the prior and posterior distributions: highly surprising data should correspond to substantive reevaluation of the prior and, therefore, to larger distances. Thus, a way of measuring information carried by the data D complementary to Shannon’s definition of entropy, I (D, M) = H (P(D M)) = - Ú D P(D M) log P(D M)dD,
(2)
(Shannon, 1948; Cover and Thomas, 1991) is to measure the distance between the prior and the posterior. To distinguish it from Shannon’s communication information, we call this notion of information the surprise information or surprise (Baldi, 2002), S(D, M) = d[P( M), P( M D)],
(3)
where d is a distance or similarity measure. There are different ways of measuring a distance between probability distributions. In what follows, for standard well-known theoretical reasons (including invariance with respect to reparameterizations), we use the relative entropy or Kullback–Liebler (Kullback, 1968) divergence K, which is not symmetric and hence not a distance. This lack of symmetry, however, does not matter in most cases and in principle can easily be fixed by symmetrization of the divergence. The surprise then is S(D, M) = K (P(M), P(M D)) P(M ) dM = Ú M P(M) log P(M D)
M
(4)
Alternatively, we can define the single model surprise by the log odd ratio P( M ) , P( M D)
(6)
taken with respect to the prior distribution over the model class. In statistical mechanics terminology, the surprise can also be viewed as the free energy of the negative log posterior at a temperature t = 1, with respect to the prior distribution over the space of models (Baldi and Brunak, 2001). Note that the definition of suprise addresses the “white snow” paradox of information theory according to which white snow, the most boring of all television programs, carries the largest amount of Shannon information. At the time of snow onset, the image distribution we expect and the image we perceive are very different, and therefore the snow carries a great deal of both surprise and Shannon’s information. Indeed, snow may be a sign of storm, earthquake, toddler’s curiosity, or military putsch. But after a few seconds, once our model of the image shifts toward a snow model of random pixels, television snow perfectly fits the prior and hence becomes boring. Because the prior and the posterior are virtually identical, snow frames carry zero surprise although megabytes of Shannon’s information.
II. COMPUTATION OF SURPRISE To be useful, the notion of suprise ought to be computable analytically, at least in simple cases. Here we consider a data set D = {x1, ... , xN} containing N points. Surprise can be calculated exactly in a number of interesting cases. For simplicity, although this does not correspond to any restriction of the general theory, we consider here only the case of conjugate priors, where the prior and the posterior have the same functional form. In this case, to compute the surprise defined by Eq. (4), we need only compute general terms of the form F(P1 , P2 ) = Ú P1 log P2 dx ,
S = F(P1 , P1 ) - F(P1 , P2 ),
P(M) log P(D M)dM.
S(D, M) = log
S(D, M) = Ú M S(D, M)P( M)dM ,
(7)
where P1 and P2 have the same functional form. The surprise is then given by
= - H (P(M)) - Ú P(M) log P(M D)dM = log P(D) - Ú
and the surprise by its average
(5)
(8)
where P1 is the prior and P2 is the posterior. Note also that in this case the symmetric divergence can be easily computed using F(P1, P1) - F(P1, P2) + F(P2, P2) - F(P2, P1). It should also be clear that in simple cases, for instance, for certain members of the exponential family (Brown, 1986) of distributions, the posterior depends entirely on the sufficient statistics and therefore we can expect surprise also to depend only
SECTION I. FOUNDATIONS
26
CHAPTER 5. SURPRISE: A SHORTCUT FOR ATTENTION?
on sufficient statistics in these cases. Additional examples and mathematical details can he found in Baldi (2002).
A. Discrete Data and Dirichlet Model Consider the case where xi is binary. The simplest class of models for D is then M(p), the first-order Markov models, with a single parameter p representing the probability of success or of emitting a 1. The conjugate prior on p is the Dirichlet prior (or beta distribution in the two-dimensional case), D1 (a1 , b1 ) =
G (a1 + b1 ) a1 -1 x (1 - x)b1 -1 G (a1 )G (b1 )
(9)
= C1x a1 -1 (1 - x)b1 -1 , with a1 ≥ 0, b1 ≥ 0, and a1 + b1 > 0. The expectation is a1/(a1 + b1), b1/(a1 + b1). With n sucesses in the sequence D, the posterior is a Dirichlet distribution D2(a2, b2) with (Baldi and Brunak, 2001) a2 = a1 + n and b2 = b1 + (N - n).
C1 + n[Y(a1 + b1 ) - Y(a1 )] C2 + (N - n)[Y(a1 + b1 ) - Y(b1 )],
S(D, M) = K ((D1 , D2 )) = log
(11) where Y is the derivative of the logarithm of the gamma function. When N Æ •, and n = pN with 0 < p < 1, we have (12)
where K(p, a1) represents the Kullback–Liebler divergence distance between the empirical distribution (p, 1 - p) and the expectation of the prior (a1/(a1 + b1), b1/(a1 + b1)). Thus, asymptotically surprise information grows linearly with the number of data points with a proportionality coefficient that depends on the discrepancy between the expectation of the prior and the observed distribution. The same relationship can be expected to be true in the case of a multinomial model with an arbitrary number of classes. In the case of a symmetric prior (a1 = b1), a slightly more precise approximation is provided by È2 a1 -1 1 ˘ S(D1 , D2 ) ª N Í Â - H (p)˙. Î k = a1 k ˚
When the xi are real, we can consider first the case of unknown mean with known variance. We have a family M(m) of models, with a gaussian prior G1(m1, s 12). If the data have known variance s 2, then the posterior distribution is gaussian G2(m2, s 22), with parameters given by (Gelman et al., 1995) m2 =
m1 N m + 2 s 12 s
(13)
For instance, when a1 = 1 then R(D1, D2) ª N(1 - H(p)), and when a1 = 5 then R(D1, D2) ª N[0.746 - H(p)].
1 N + s 12 s 2
and
1 1 N , = + s 22 s 12 s 2 (14)
where m is the observed mean. In the general case, S(D, M) = KG1 , G2 ) = log
s s 12 +N 2 2s 2 s + Ns 1 2
2
N 2s 12 (m 1 - m ) 2s 2 (s 2 + Ns 12 N ª [s 12 + (m1 + m )2 ], 2s 2 +
(10)
The surprise can be computed exactly as
S(D, M) ª NK (p , a1 ),
B. Continuous Data: Unknown Mean/Known Variance
(15)
the approximation being valid for large N. In the special case where the prior has the same variance as the data s1 = s, then the formula simplifies a little and yields S = K (G1 , G2 ) = ª
N 1 N 2 (m 1 - m ) - log(N + 1) + 2 2 2(N + 1)s 2
N 2 s 2 + (m 1 - m ) 2s 2
[
]
2
(16)
when N is large. In any case, surprise grows linearly with N with a coefficient that is the sum of the prior variance and the square difference between the expected mean and the empirical mean scaled by the variance of the data. The cases with known mean and unknown variance, or unknown mean and unknown variance, are treated in Baldi (2002).
III. HABITUATION AND SURPRISE There is an immediate connection between surprise and learning, or rather habituation. If we imagine that data points from a training set are presented sequentially, we can consider that the posterior distribution after the Nth point becomes the prior for the next iteration (sequential bayesian learning). In this case we can expect a form of habituation whereby, on average, surprise decreases after each iteration. As a system
SECTION I. FOUNDATIONS
IV. DISCUSSION: A SHORTCUT FOR ATTENTION
learns what is relevant in a data set, new data points become less and less surprising. This can be quantified precisely, at least in simple cases. Consider, for example, a sequence of 0–1 examples D = (dN). The learner starts with a Dirichlet prior D0(a0, b0). With each example dN, the learner updates its Dirichlet prior DN(aN, bN) into a Dirichlet posterior DN+1(aN+1, bN+1) with (aN+1, bN+1) = (aN + 1, bN) if dN+1 = 1, and (aN+1, bN+1) = (aN, bN + 1) otherwise. When dN+1 = 1, the corresponding surprise is easily computed using the theory developed above. For simplicity, and without much loss of generality, let us assume that a0 and b0 are integers, so that aN and bN are also integers for any N. Then if dN+1 = 1, the relative surprise is S(DN , DN +1 ) = log
bN - 1 aN 1 +Â , a N + bN k = 0 a N + k
(17)
and similarly in the case dN+1 = 0 by interchanging the role of aN and bN. Thus, in this case, 0 £ S(DN , DN +1 ) £
1 1 ˆ Ê + log 1 . Ë aN a N + bN ¯
(18)
Asymptotically we have aN ª a0 + pN and therefore 0 £ S(DN , DN +1 ) £
1- p . pN
(19)
Thus, surprise decreases in time with the number of examples as 1/N. A similar result can be obtained in the continuous case (Baldi, 2002).
IV. DISCUSSION: A SHORTCUT FOR ATTENTION While eminently successful for the transmission of data, Shannon’s theory of information does not address semantic and subjective dimensions of data, such as relevance and surprise. We have proposed an observer-dependent computational theory of surprise where surprise is defined by the relative entropy between the prior and the posterior distributions of an observer. As such, it is a measure of dissimilarity between the prior and posterior distributions which lies close to the axiomatic foundation of bayesian probability. Surprise is different from Shannon’s information. While Shannon’s definition fixes the model and varies the data, surprise fixes the data and varies the model. Surprise requires integration over the space of models, in contrast with Shannon’s entropy, which requires integration over the space of data. In a number of cases, surprise can be computed analytically in terms of both exact and asymptotic computa-
27
tionally efficient, formulas in the discrete and continuous case. During sequential bayesian learning, habituation corresponds to a 1/N decay of surprise. In general, however, the computation of surprise can be expected to require Monte Carlo methods to approximate integrals over model spaces. In this respect, the computation of surprise should benefit from progress in Markov chain and other Monte Carlo methods, as well as progress in computing power. Granted that the mathematical foundation of surprise is solid, we can speculate that surprise may he used to guide rapid attention mechanisms. With limited computing resources available relative to the volume of data, rapid identification and ranking of unusual items that require further processing to establish semantic relevance become important. Surprise provides a shortcut to relevance and a general principled way for computing centralized saliency maps in any feature space to control the deployment of attention or other information retrieval mechanisms toward the most surprising items, that is, those carrying the largest amount of information with respect to internal expectations. Because attention is a rapid mechanism likely to be driven by bottom-up cues, it could benefit from the simplicity of surprise calculation to detect mismatches between bottom-up inputs and expectations generated locally or in top-down fashion. The relationship of surprise to attention then boils down to two questions: (1) Is surprise being used by biological attention mechanisms? (2) Can surprise be used as a neuroengineering principle for the design of artificial attention systems? Testing the first hypothesis in detail would require showing that probability distributions, such as priors and posteriors, are part of the biological computations carried by natural attention systems. In general, this is beyond the current state-of-the-art in neuroscience and it is unlikely that clear answers can be provided in the very near future, indirect evidence, but from psychophyscial experiments may be possible. The second question is somewhat more promising—as often the case in science, direct problems tend to be easier than reverse engineering problems—and hinge on simulating or implementing artificial attention algorithms and circuits. A positive feature of surprise as a general neural engineering design principle is the fact that it can be applied uniformly at multiple temporal and spatial scales in any feature space or sensory modality and that it can be aggregated. In principle, we can talk as well of the surprise of a synapse, a neuron, a neuronal circuit, an area, a system, or even an entire organism and, for instance, in visual, auditory, olfactory spaces. Consider, for
SECTION I. FOUNDATIONS
28
CHAPTER 5. SURPRISE: A SHORTCUT FOR ATTENTION?
instance, an angle detector in the [0°, 360°] range that is tuned for angles around 90°, that is, with a circular hill-shaped response curve centered at 90°. We may as well consider that this detector is computing surprise with a prior that is hill-shaped and centered at -90°. Finally, one should not forget that Shannon’s entropy, surprise, and relevance are three different facets of information that can be present in different combinations. If, while surfing the Web in search of a car, one stumbles on an old picture of Brigitte Bardot, the picture may carry a low degree of relevance, a high degree of surprise, and a small to large amount of Shannon information depending on the pixel structure. Although, despite of several attempts (Jumarie, 1990; Tishby et al., 1999), the notion of relevance remains the least understood, over short time scales surprise may provide an adequate approximation to relevance. Indeed, over short time scales, the overlap between relevant and surprising inputs may be large and it may also be safer for the organism not to ignore surprising inputs during early processing stages even if, on second inspection, some of these surprising inputs turn out to be irrelevant. Thus, in guiding attention, surprise may be used as a computational shortcut to relevance, but not as a perfect substitute. In this view, relevance often must await confirmation by additional and slower processes that extend beyond the realm of rapid attention.
Acknowledgments The work of P.B. is supported by a Laurel Wilkening Faculty Innovation Award and grants from the
National Institutes of Health, National Science Foundation, and Sun Microsystems at the University of California, Irvine.
References Baldi, P. (2002). A computational theory of surprise. In “Information, Coding, and Mathematics” (M. Blaum, P. G. Farrell, and H. C. A. van Tilborg, Eds.). Kluwer Academic, Boston. Baldi, P., and Brunak, S. (2001). “Bioinformatics: The Machine Learning Approach”, 2nd ed. MIT Press, Cambridge, MA. Berger, J. O. (1985). “Statistical Decision Theory and Bayesian Analysis.” Springer-Verlag, New York. Box, G. E. P., and Tiao, G. C. (1992). “Bayesian Inference in Statistical Analysis.” Wiley, New York. Brown, L. D. (1986). “Fundamentals of Statistical Exponential Families.” Institute of Mathematical Statistics, Hayward, CA. Cover, T. M., and Thomas, J. A. (1991). “Elements of Information Theory.” Wiley, New York. Cox, R. T. (1964). Probability, frequency and reasonable expectation. Am. J. Phys. 14, 1–13. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995). “Bayesian Data Analysis.” Chapman & Hall, London. Jaynes, E. T. (1986). Bayesian methods: general background. In “Maximum Entropy and Bayesian Methods in Statistics” (J. H. Justice, Ed.), pp. 1–25. Cambridge Univ. Press, Cambridge. Jumarie, G. (1990). “Relative Information.” Springer-Verlag, New York. Kullback, S. (1968). “Information Theory and Statistics.” Dover, New York. Savage, L. J. (1972). “The Foundations of Statistics.” Dover, New York. Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, 623–656. Tishby, N., Pereira, F., and Bialek, W. (1999). The information bottleneck method. In “Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing”, (B. Hajek and R. S. Sreenivas, Eds.), pp. 368–377. University of Illinois.
SECTION I. FOUNDATIONS