Axiomatic approach to computational attention

Axiomatic approach to computational attention

ARTICLE IN PRESS Pattern Recognition 43 (2010) 1618–1630 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevie...

1MB Sizes 2 Downloads 151 Views

ARTICLE IN PRESS Pattern Recognition 43 (2010) 1618–1630

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Axiomatic approach to computational attention J.A. Garcı´a, Rosa Rodriguez-Sa´nchez, J. Fdez-Valdivia  ´n e I. A., CITIC-UGR, Universidad de Granada, 18071 Granada, Spain Departamento de Ciencias de la Computacio

a r t i c l e in f o

a b s t r a c t

Article history: Received 5 December 2008 Received in revised form 24 September 2009 Accepted 26 September 2009

Here we describe, in terms of a decision problem, any situation in which a computational system will be forced to allocate attention at any time to one spatial location to improve the reconstruction fidelity on a neighborhood of the chosen point. The result is a rational model of computational attention in which a multi-bitrate attention map will provide us with the attention score for each spatial location at high and low quality versions of the image reconstruction. At any time a rational system should choose, even though without any outside knowledge, among alternative spatial locations in such a way as to avoid certain forms of behavioral inconsistency. We compare the performance between a rational approach of computational attention and various models for predicting visual target distinctness, using scenes that represent military vehicles in complex rural backgrounds. & 2009 Elsevier Ltd. All rights reserved.

Keywords: Computational attention Axiomatic models Visual target distinctness Military

1. Introduction Surprisingly, the human visual system (HVS) appears to employ a serial computational strategy to select locations of interest in the processing of massive amounts of incoming visual information (around 108 bits per second in the optic human nerve) with nearly real-time capacity of reaction [1–3]. Thus the detection and analysis of visual objects seem to involve either covert shifts of attention or saccadic eye movements, and the image analysis and scene understanding may be performed by biological visual systems through a temporal serialization into smaller, localized analysis tasks [4,5]. While a bottom-up, primitive mechanism biases the human observer towards selecting stimuli based on their saliency, a topdown mechanism with variable selection criteria directs the spotlight of attention under cognitive, volitional control [6,3]. In the literature, several computational models were proposed which are close to the local processing of biological reality within the HVS [2,5,7]. Nevertheless recent results in visual attention [8] brought confirmations for a global integration of feature information all over the visual field which is possible thanks to the impressive neuronal network [9]. Thus, following the approach that attention may be due to global properties, a number of computational models were developed [9–14]. The problem is that different computational attention models were tuned for some kinds of images and often react very badly to other images [15], and it should be very difficult to use only an attention model in all the applications. Several authors have

developed a combined approach to the development and validation of cognitive models for human–computer systems [16,17]. When used in applications where human error in the allocation of attention is likely to occur, these systems are able to detect overand under-allocation of attention and react in a way that helps human subjects. This paper deals with a computational approach to the rational characteristics of visual attention. The overall objective of developing a rational approach of attention does not purport to describe the ways in which HVS actually do behave in making choices among possible locations of interest for allocating attention. Instead we are interested in the aspects of rationality that seem to be present in the decision making of the HVS: at any time a rational system should choose among candidate spatial locations to avoid certain forms of inconsistency. In this paper, a set of axioms are to be proposed simply to prescribe constraints that seem to us imperative to acknowledge in the problem of allocating attention (Section 2). Here we also show that with several plausible assumptions we may restrict the form of utilities for consequences achieved by the allocation of attention (Section 3). Several experiments are performed to investigate the relationship between the computational attention model and the visual target distinctness measured by human observers (Section 4). The main conclusions of the paper are summarized in Section 5.

2. Basic axioms for avoiding forms of behavioral inconsistency while allocating attention  Corresponding author.

E-mail addresses: [email protected] (J.A. Garcı´a), [email protected] (R. Rodriguez-Sa´nchez), [email protected] (J. Fdez-Valdivia). 0031-3203/$ - see front matter & 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2009.09.027

Here we describe, in terms of a decision problem, any situation in which the attention model will be forced to allocate attention at

ARTICLE IN PRESS J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

each time to one spatial location in order to improve the reconstruction fidelity on a neighborhood of the chosen point. At any time t, the structure of this decision problem is determined by three basic elements:

1619

rational choice between alternative locations in order to focus attention, then it must at least be willing to express preferences between them. Postulate 1 (Essence while allocating attention).

1. A set fPi ; i A Ig of candidate locations of interest, one of which is to be selected for improving the reconstruction fidelity on a neighborhood by allocating attention to this point; with I being the set of location indexes i. 2. For each available spatial location Pi , a set fGl;i ; l A Lg describing the gray-level occurrences in the neighborhood of Pi whose reconstruction fidelity will be improved by directing attention to this point at time t; where Gl;i denotes the uncertain event pixel X within a digitized disk of fixed radius centered at Pi , taking gray-level value l; and L denotes the gray-level set. 3. Corresponding to each set fGl;i ; l A Lg, a set of consequences fcl;i ; l A Lg that results from the improvement in visual quality achieved by allocating attention to Pi . The idea is as follows. Suppose the computational system chooses a spatial location Pi for allocating attention at time t. Thus, the choice of a location Pi produces an increment in visual quality in a neighborhood of Pi and one set of gray-level occurrences fGl;i ; l A Lg. Finally, the set fGl;i ; l A Lg induces, at this time, a particular set of consequences for the application at hand (for example, early detection of small, hidden military vehicles in some complex rural background). But we also need to consider a fourth basic element in the attention problem, which was reformulated in terms of a decision problem: the order $, which expresses the preferences between pairs of available spatial locations at a particular time, so that Pi $Pj signifies that attention score of Pi is not greater, at this time t, than that of Pj . Thus, a formal definition of computational attention as a decision problem is given by: Definition 1 (Allocating attention as a decision problem). The decision problem of allocating attention at time t is defined by four elements fP; G; C; $g, where: (i) P ¼ fPi ; i A Ig is the set of candidate spatial locations Pi upon which to allocate attention; (ii) G ¼ fGl;i ; l A Lg is the class of any possible set of gray-level occurrences in the neighborhood of Pi whose reconstruction fidelity will be improved by allocating attention to this point at time t; where the neighborhood of Pi is a digitized disk of fixed radius r centered at Pi ; (iii) C ¼ fcl;i ; l A Lg is the class of any set of possible consequences for the application at hand, associated with a gray-level occurrence set G that results from the improvement in visual quality achieved by allocating attention to Pi ; (iv) $ is an attention score order between spatial locations, with Pi $Pj meaning that attention score of Pi is not greater than that of Pj at this time t. A different approach to computational attention can be to first state some general principles that the solution of the problem must obey, and then derive the solution that satisfies exactly the principles. And following this approach, here we impose three coherence axioms, two quantization axioms, and two additional axioms that restrict the form of utilities for consequences. The first postulate states the essence of what is required for an orderly and systematic approach to comparing among spatial locations: (a) if all consequences were equivalent, there would not be a decision problem and (b) if the system aspires to make a

(i) Not all the consequences in C are equivalent; and (ii) The attention model is able to compare attention scores for any pair of spatial locations at time t. The second axiom is intended to impose rules of coherence on attention score orderings that will exclude the possibility of two types of inconsistencies: first, in order to allocate attention, the system prefers one location over another identical location of interest; second, the system is willing to suffer the certain loss of something of value, which happens if Pi $Pj and Pj $Pk , but then, attention score of Pi is greater than that of Pk . Postulate 2 (Basic rules of coherence on attention score orderings). (i) Attention score of Pi is not greater than that of another identical location of interest; and (ii) If Pi $Pj and Pj $Pk , then Pi $Pk . The order $ may also provide a qualitative basis for comparing, by extension, consequences and gray-level occurrence events. And the third axiom shall ensure the consistency of any kind of preferences (e.g., between consequences or gray-level occurrence events). Postulate 3 (Consistency of any kind of preferences). (i) Preferences between consequences at a given bitrate of reconstruction fidelity, should not be affected by the gray-level occurrences at higher visual quality; (ii) If a gray-level occurrence set fGl;i ; l A Lg is more likely to relate to better consequences (for the application at hand) than another gray-level occurrence set fGl;j ; l A Lg then the attention score of Pi should be greater than that of Pj ; (iii) If attention score of Pi is greater than that of Pj under the occurrence of event G, then comparison of scores for Pi and Pj (which are identical if a different event occurs) depends entirely on consideration of what happens if G occurs.

Postulates 1–3 then provide a minimal set of rules to ensure that qualitative comparisons based on the preference $ cannot have intuitively undesirable implications. But we also need to introduce some form of quantification by setting up a standard unit of measurement that enables the attention model to assign a score to any given available location. In short, precision through quantification is achieved by introducing some form of numerical standard into the system already equipped with a coherent qualitative ordering relation (Postulates 1–3). We shall regard it as essential to be able to aspire to some kind of quantitative precision in the context of comparing attention scores. It is therefore necessary that we have available some form of standard locations of interest. This notion of quantization is given by means of two additional axioms (Postulates 4 and 5). Postulate 4 (Standard unit of measurement). In the attention model, there exists some form of standard location of interest, which will play a role analogous to the standard unit of measurement.

ARTICLE IN PRESS 1620

J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

Postulate 5 (Precision through quantification). The standard family of locations of interest provides a continuous scale against which any consequence or event can be precisely compared.

achieve an improvement in reconstruction fidelity for a neighborhood of Pj does not depend on the absolute amount of visual quality for a neighborhood of Pi involved.

Next, a pair of Proposition 1(a) and 1(b) serve to determine how numerical measures can be assigned to two of the elements of the attention problem in the form of probabilities for gray-level occurrence events and utilities for consequences.

We now demonstrate that these two assumptions restrict the form of the utility function for consequences that result from directing attention to Pi :

Proposition 1 (Probabilities for gray-level occurrences and utilities for consequences). A computational attention model that aspires to analyze the decision problem fP; G; C; $g at any time t in accordance with Postulates 1–5 should verify that: (a) Degrees of belief about gray-level occurrence sets fGl;i ; l A Lg are represented in the form of finite probability distributions Ri  fpðGl;i jPi ; tÞ; l A Lg, with pðGl;i jPi ; tÞ denoting the probability of gray level l in the neighborhood of location Pi that would result from the improvement in reconstruction fidelity achieved by allocating attention to Pi at time t; (b) Numerical values attached to the consequences fcl;i ; l A Lg foreseen if there exists a particular degree of reconstruction fidelity given by the gray-level occurrence set fGl;i ; l A Lg are represented in the form of a utility function.

Proposition 2 (Utility function for the allocation of attention). If Postulates 6 and 7 both hold, then a well-behaved utility function u for consequences must have one of the following three functional forms (‘‘well-behaved’’ means local, twice differentiable, and that 00 ðpÞ exists): limp-0 p uu0 ðpÞ 8 ðpl;i Þr if r o 0 > < uðRi ; Gl;i Þ ¼ log pl;i if r ¼ 0; > : ðp Þr if r 4 0 l;i

with Ri  fpl;i jl A Lg; pl;i ¼ pðGl;i jPi ; tÞ being the probability of gray level l in the neighborhood of location Pi that would result from the improvement in visual quality achieved by allocating attention to Pi at time t. If r 4 1, the resultant attentional model exhibits a riskseeking posture with respect to ‘‘gambles’’ on location-dependent attention; whereas r o 1 implies risk-averse behavior regarding gambles on location-dependent attention.

Proof. See Appendix A.

Proof. See Appendix B.

In the following section we complete the specification of this decision problem through the introduction of a particular form of utility function uðRi ; Gl;i Þ for all pairs ðRi ; Gl;i Þ of probability distributions and actual gray-level occurrences.

As a result of Propositions 1 and 2 we can now provide, on the basis of the expected utility, an optimal solution to the problem of allocating attention, at time t. To this aim, we first need to derive alternative forms of the expected increase in utility provided by the allocation of attention to spatial location Pi at time t; and second, we simply select for allocating attention at time t the spatial location Pi associated with the maximum expected increase in utility over candidate locations of interest.

3. Utility functions for attention score measurement In the absence of a priori knowledge about locations of interest, experience with allocating attention indicates that, for any spatial location, preferences are for reconstructions of a neighborhood of this location that lead to the highest level of fidelity that is perceptible within this image portion and thereby emphasizing its particular features, independent of the level of fidelity in any other point. It indicates that the value independence between reconstructions for different spatial locations is a plausible assumption without any outside knowledge about locations of interest. Even if this property does not hold exactly, it leads to simplifications that make it a useful working hypothesis. Postulate 6 (Independence between reconstructions for different spatial locations). Let Pi and Pj be any pair of possible locations of interest at time t. Preferences for consequences at this time t involving the two locations Pi and Pj depend only on the probability distributions for the respective gray-level occurrence sets fGl;i ; l A Lg and fGl;j ; l A Lg, and not on their joint probability distribution. We next invoke another specific preference pattern that may arise in the process of allocating attention at any processing time: Suppose that if the attention model was willing to sacrifice a fraction of reconstruction fidelity corresponding to spatial location Pi for an improvement in quality corresponding to Pj , then it would also be willing to sacrifice the same fraction of reconstruction fidelity corresponding to Pi with independence of the actual degree of reconstruction fidelity for Ri. Again, this may be a plausible hypothesis, but only if there is no a priori knowledge for the attention problem at hand. Postulate 7 (Basic rule while sacrificing a fraction of reconstruction fidelity). The portion of visual quality for a neighborhood of Pi that the attention system is willing to give up at processing time t to

Proposition 3 (Expected increase in utility provided by the allocation of attention). Let ql;i ¼ pðGl;i jt  1Þ be the probability of gray level l in the neighborhood of location Pi using the level of reconstruction fidelity given at time t  1 (i.e., before time t). Let pl;i ¼ pðGl;i jPi ; tÞ be the probability of gray level l in the neighborhood of location Pi that would result from the improvement in visual quality achieved by allocating attention to Pi at time t. In a rational attention model for which Postulates 6 and 7 hold, the possible functional forms of the expected increase in utility provided by the allocation of attention to spatial location Pi at time t, when the initial probability distribution Q  fql;i ; l A Lg is strictly positive, are as follows: 8P pl;i ½ðql;i Þr  ðpl;i Þr  if r o 0 > > > > l > >

l > P > > p ½ðp Þr  ðq Þr  if r 4 0 > > l;i l;i l;i : l

where if r 41, the system exhibits a risk-seeking posture with respect to gambles on location-dependent attention, while r o 1 implies risk aversion. Risk neutrality is given by r ¼ 1. Proof. See Appendix C. From Proposition 3, we have that for an image reconstruction fidelity given by bitrate r the attention score of location Pi will be the total sum of utilities for consequences, I½Pi ; t=Q, that were provided at times t when attention was directed to Pi , up to the given bitrate r. Fig. 1 illustrates two different plots of I½Pi ; t=Q for varying r.

ARTICLE IN PRESS J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

1621

-0.6 -0.4 risk -0.2 0.0 att itud 0.2 0.4 e( 0.6 r)

I [Pi; t/Q]

I [Pi; t/Q]

r = -0.2 r = -0.1 r = 0.0 r = 0.1 r = 0.2

t t

Fig. 1. (Left) 3D plot of expected increase in utility I½Pi ; t=Q provided by the allocation of attention to spatial location Pi at time t, when the initial probability distribution Q is strictly positive, for risk attitudes r (with respect to gambles on location-dependent attention) in the range ½0:6; 0:6; (Right) 2D plots of I½Pi ; t=Q provided by the allocation of attention to spatial location Pi for varying time t, with risk attitude r set to 0:2; 0:1; 0; 0:1; and 0.2.

Hence we can now give a formal definition of the global attention map for a given image, following the rational approach to the measurement of attention score: Definition 2 (Multi-bitrate attention map). The multi-bitrate attention map that measures the attention score, following Postulates 1–7, at any spatial location Pi and any bitrate r of image reconstruction fidelity, is fAðPi ; rÞgPi ;r

ð2Þ

where AðPi ; rÞ ¼

X I½Pi ; t=Q

ð3Þ

tPi

with the sum over times tPi , up to the given bitrate r, such that attention is directed to Pi at time tPi ; and I½Pi ; t=Q is given by Eq. (1). The multi-bitrate attention map fAðPi ; rÞgPi ;r will provide us with a computational attention score for each spatial location Pi at high and low quality versions of the image reconstruction. The novelty of this map is that: (i) it allows distinct attention score for the same spatial location at different picture quality, which may be relevant for example in applications of advertisement on Internet; (ii) it avoids certain forms of behavioral inconsistency in the absence of a priori knowledge about the locations of interest, which is a characteristic of rational systems; and (iii) a particular integration of feature information (e.g., color, intensity, orientation) is not used in assigning attention scores to the points, therefore computational attention is not tuned for only certain images. Figs. 2 and 3 illustrate the multi-bitrate attention map for two different scenes that represent a military vehicle in a complex rural background (see Fig. 4). The first column—(A)–(C)— shows the attention scores for reconstructions given in the second column—(D)–(F). The higher intensity in (A)–(C) means a higher attention score for the respective reconstructions in (D)–(F). To obtain image reconstructions at different bitrates of visual quality, the computational attention model follows Refs. [18,19]. Here we are using a very efficient implementation of the rational model of computational attention whose interest is not only in cost reduction but also in a real-time analysis of new images. Score maps and image reconstructions were blended into each other to form the images illustrated in the third column—(G)–(I). Fig. 2(J)–(L) show that the most salient locations are within the target area, for the three different quality reconstructions in Fig. 2(D)–(F). But the situation is more complex for the target

scene illustrated in Fig. 3. As can be seen from Fig. 3(K), the target is the most salient area for the image reconstruction given in Fig. 3(E); while there are other areas within the complex rural background (see Fig. 3 (J) and (L)) with greater saliency (compared to the target saliency) for the lower and higher quality versions of the reconstruction (Fig. 3 (D) and(F)).

4. Experimental results In this section, we study the relationship between computational attention and the visual target distinctness measured by human observers. Rohaly et al. [20] proved that if computational models of early human vision give good predictors of target saliency for humans performing visual search and detection tasks, they may be used to compute visual distinctness of image subregions (target areas) from digital imagery. To this aim we compute the multi-bitrate attention map fAðPi ; rÞgPi ;r for each target scene in a database that was presented in [21,22] (see Figs. 4–7). The images used in the experiment are target sections from slides made during the DISSTAF (distributed interactive simulation, search and target acquisition fidelity) field test, that was designed and organized by NVESD (Night Vision & Electro-optic Sensors Directorate, Ft. Belvoir, VA, USA) and that was held in Fort Hunter Liggett, California, USA [21]. Next we calculate the lowest bitrate, r , of reconstruction fidelity for which the attention score of some point in the target area will be in the upper quartile of the attention map fAðPi ; r ÞgPi at bitrate r . A small value of r means that the computational model brings the attention onto the target using a low bitrate of picture quality which corresponds to a high saliency of the target area. Also, a psychophysical experiment is performed in which human observers estimate the visual distinctness of targets in the same database. The procedure of the psychophysical experiment is described in [21,22]. The subjective ranking induced by the psychophysical target distinctness is adopted as the reference rank order. An evaluation function may then used to study the efficacy of the computational attention model. To avoid the perils of inferring too much from correlations, the evaluation function PCC is defined by the fraction of correctly classified targets (using the computational model) with respect to the reference rank order (target distinctness measured by human observers): PCC ¼ Number of correctly classified targets=Number of targets.

ARTICLE IN PRESS 1622

J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

Fig. 2. (A)–(C) Attention scores for three different quality versions—given in (D)–(F)—of a highly visible target; (G)–(I) blending of the respective score maps and image reconstructions; (J)–(L) most salient locations.

In the following we analyze the comparative results of computational attention and various quantitative measures for predicting visual target distinctness. The quantitative measures include signal-to-noise ratio (SNRlog ), root mean square error (RMSE), and mean absolute error (MAE). They quantify the target distinctness by means of the difference between the signal from the target-and-background scene and the signal from the background-with-no-target. Also we compare the performance between the rational model and a well known model of computational attention, the Itti’s attention model [5].

4.1. Experiment 1: assessment of parameter r for the computational attention model Following Proposition 3, this section is intended to elicit information on risk attitudes of computational attention with respect to gambles on location-dependent attention (see Eq. (1) in Proposition 3) using the DISSTAF images (see Figs. 4–7). The elicitation of the optimal risk attitude in Eq. (1) using a rational system of computational attention is performed as

follows. The target images of the DISSTAF database are rankordered using computational attention with parameter r in Eq. (1) taking values between 1 and 1. The respective fraction of correctly classified targets PCC using computational attention models with parameter r between 1 and 1 is illustrated in Fig. 8. The optimal value of the parameter r produces a model achieving the highest fraction of correctly classified targets PCC over the rational models of computational attention being compared (with parameter r between 1 and 1). From Fig. 8 we conclude that a computational attention model with parameter r around 0:6 is best able to compute a visual target distinctness rank ordering that correlates with human observer performance. Hence, in the following experiments, the rational model of computational attention is to be used with parameter r ¼  0:6.

4.2. Experiment 2 A subset (dataset #1) of twelve complex natural images containing a single target (and twelve empty images of the same

ARTICLE IN PRESS J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

1623

Fig. 3. (A)–(C) illustrate attention scores for three quality versions, respectively (D)–(F), of a military vehicle in low visibility conditions; (G)–(I) show blendings of the respective score maps and image reconstructions; (J)–(L) most salient locations.

rural backgrounds with no target) was used in this second experiment. Following the psychophysical experiment described in [21], the image pairs from dataset #1 were clustered into four subsets of targets with comparable visual distinctness: {1,2,7,11}, {17,20,21,22}, {24,27,32}, and {35} (see Figs. 4–7). The comparative results of the rational model of computational attention with risk-aversion ðr ¼  0:6Þ, and those of both quantitative and qualitative measures are illustrated in Table 1. The bottom of each of the columns shows the probability of correct classification PCC of the rank order in that column with respect to the reference rank order given in column 2. Significant rank-order permutations are displayed in boxes in Table 1. We have to take into account that rank-order permutations of targets of the same cluster with comparable visual distinctness are not significant and so they are correctly classified. Both RMSE and SNRlog produce a rank order with six significant order reversals. The other targets have been attributed rank orders which do not differ significantly from the reference rank order. RMSE and SNRlog yield a probability of correct classification PCC ¼ 0:5. MAE produces rank orders with five significant order reversals, and it yields a probability PCC ¼ 0:58. Also Table 1 shows the rank order using the rational

model of computational attention. The highest value of the evaluation function, PCC ¼ 0:83, is obtained for the computational attention which produces a rank order with two significant order reversals. Fig. 9 (-EXP. 2-) shows these results with a bar chart to emphasize that the computational attention model is a better predictor than MAE, RMSE, and SNRlog .

4.3. Experiment 3 A second subset (dataset #2) of fifteen targets (which are grouped into four clusters {2,4,5,9,11}, {12,15,18,19}, {24,26,27,29,32}, and {36}) was used in this second experiment (see Figs. 4–7). The resulting rank orders of MAE, RMSE, and SNRlog are listed in Table 2. Again at the bottom of each of the columns is shown the probability of correct classification of the rank order in that column with respect to the reference rank order in column 2. For these three quantitative measures, SNRlog yields the highest probability (PCC ¼ 0:6) with six significant order reversals (they

ARTICLE IN PRESS J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

1624

CLUSTER # 1 OF TARGETS WITH SIMILAR DISTINCTNESS

#1

#2

#4

#3

#5

#6

#7

#8

#9

# 10

# 11 Fig. 4. First cluster of targets with comparable visual distinctness: target and non-target scenes.

ARTICLE IN PRESS J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

1625

CLUSTER # 2 OF TARGETS WITH SIMILAR VISUAL DISTINCTNESS

# 12

# 13

# 14

# 15

# 16

# 17

# 18

# 19

# 20

# 21

# 22 Fig. 5. Second cluster of targets with comparable visual distinctness: target and non-target scenes.

ARTICLE IN PRESS 1626

J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

CLUSTER # 3 OF TARGETS WITH SIMILAR VISUAL DISTINCTNESS

# 23

# 24

# 25

# 26

# 27

# 28

# 29

# 30

# 31

# 32

# 33

# 34

Fig. 6. Third cluster of targets with comparable visual distinctness: target and non-target scenes.

ARTICLE IN PRESS J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

1627

CLUSTER # 4 OF TARGETS WITH SIMILAR VISUAL DISTINCTNESS

# 35

# 36

Fig. 7. Fourth cluster of targets with comparable visual distinctness: target and non-target scenes.

4.4. Experiment 4

0.7 0.65

PCC

0.6 0.55 0.5 0.45 0.4 0.35

-1

-0.6

1 r risk attitude

Fig. 8. 2D plot on the risk-attitude r and PCC , with r between 1 and 1.

Table 1 Column 1: dataset in Experiment 2; column 2: the reference rank order; Columns 3–6: the resulting rank order of MAE, RMSE, SNRlog , and computational attention.

The bottom of each column shows the probability of correct classification of the rank order in that column with respect to the reference rank order in column 2.

are displayed in boxes). Fig. 9 (-EXP. 3-) illustrates these results with a bar chart. Table 2 also displays the rank order of the computational attention model. Again the highest value of the evaluation function, PCC ¼ 0:8, is obtained for the computational attention that produces a rank order with only three significant order reversals (see Fig. 9 -EXP. 3-).

Now we study the comparative performance between the proposed attention model and a well known model of computational attention, [5], using the DISSTAF database. Itti and Koch [23] applied the Itti’s model of human visual search based on the concept of a ‘‘salience map’’, [5], to a wide range of target detection tasks using the DISSTAF images. Through a 2D map, the saliency of objects in the visual environment is encoded. In Itti and Koch [23], low-level visual features are extracted in parallel from nine spatial scales, and the resulting feature maps are combined to yield three saliency maps for color, intensity, and orientation. These, in turn, be feed into a single saliency map (a 2D layer of integrate-and-fire neurons). Competition among neurons in this map yields to a single winning location corresponding to the next attended target. Inhibiting this location transiently suppresses the currently attended location, causing the focus of attention to shift to the next most salient location. With respect to the predicted search times of the Itti’s attention model on the DISSTAF images, Itti and Koch [23] found a poor correlation between human and model search times (see Fig. 8 in [23]). It may be a consequence of the fact that the Itti’s attention model was originally designed not to find small, hidden targets, but rather to find the few most obviously conspicuous objects in an image. For a dataset of 33 image pairs (target and non-target images) from the DISSTAF database, we have also calculated the probability of correct classification PCC using the computational attention model that follows Postulates 1–7. From the psychophysical experiment [21], the image pairs in the dataset can be clustered into four subsets of targets with comparable visual distinctness: {1,2,3,4,5,6,7,8,10,11}, {12,13,14,15,16,17,18,19,20,21,22}, {23,24,25,26,27,28,29,30,31,32,34}, and {35} (see Figs. 4–7). In this case the rational model of computational attention once again yields a high probability of correct classification (PCC ¼ 0:7272). It implies a correlation between human and model predictions of visual attention. Recall that the rational model of attention does not extract any visual feature like as color, intensity or orientation. Instead, it is only based on the multi-bitrate attention map fAðPi ; rÞgPi ;r for each target scene, where AðPi ; rÞ is defined as given by definition 2. Summarizing, the rational model of computational attention that follows Postulates 1–7 with risk-aversion attitude ðr ¼  0:6Þ, shows the best overall performance in these four experiments.

5. Conclusions In this paper we have proposed that a different approach to computational attention can be to first state some general

ARTICLE IN PRESS J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

1628

-EXP. 31

0.8

0.8

0.6

0.6

Pcc

Pcc

-EXP. 21

0.4

0.4

0.2

0.2

0

0 MAE

RMSE

COMP. SNRlog ATTENTION

MAE

RMSE

COMP. SNRlog ATTENTION

Fig. 9. Probability of correct classification PCC using MAE, RMSE, computational attention, and SNRlog .

Table 2 Column 1: dataset in Experiment 3; column 2: the reference rank order; Columns 3–6: the resulting rank order of MAE, RMSE, SNRlog , and computational attention.

The bottom of each column shows the probability of correct classification of the rank order in that column with respect to the reference rank order in column 2.

attention. It allows that the rational model of attention can be tuned for some kinds of images but also reacts very well to other kinds of pictures by simply changing risk attitudes within the same framework. For the DISSTAF database, it was demonstrated that rational models of computational attention with a riskaversion attitude with respect to gambles on location-dependent attention can be used to predict visual target distinctness. The risk-aversion attitude seems to be a consequence of the possible presence of small, hidden military vehicles in some complex rural backgrounds of the DISSTAF database. The validity and generalizability of the elicitation of risk attitudes for a rational model of computational attention as that presented here could be enhanced by exploring different databases. While this empirical assessment should be regarded cautiously, on the basis of these preliminary findings it appears that the multi-bitrate attention map, which measures the attention score following Postulates 1–7, gives consistent results and is suitable for further work.

Acknowledgments The authors thank to the referees for suggesting several good ways to improve the original manuscript.

principles that the solution of the problem must obey, and then derive the solution that satisfies exactly the principles. To the computational attention problem we have imposed three coherence axioms, two quantization axioms, and two additional axioms that restrict the form of utilities for consequences. The result is a rational model of computational attention. It is not rare that one would like to impose more axioms that are jointly compatible. It may also happen that the axiomatic computational attention resulting from the original list of axioms is found to react very bad to some significant image. One must then formalize the characteristics of the image and state an additional axiom that specifies how the computational attention should behave in this situation, and finally determine the greatest subset of axioms from the original list that are compatible with the new axiom. Of course, compatibility may hold for several distinct such subsets. In any case, the critical difference with respect to the approaches discussed in the Introduction section is that we will be able to predict exactly the behavior of the axiomatic solution according to its principles. Thus, the rational model of computational attention can choose at any time among alternative spatial locations in such a way as to avoid certain forms of behavioral inconsistency. We have proved in Proposition 3 that a rational system for the allocation of attention might exhibit either a risk seeking posture or risk aversion with respect to gambles on location-dependent

Appendix A. Proof of Proposition 1 (a) Proposition 1(a) states that, to avoid certain forms of behavioral inconsistency when a computational system chooses a location of interest upon which to focus attention, at any time, the gray-level occurrence sets Gi  fGl;i ; l A Lg should be represented by probability distributions Ri  fpðGl;i jPi ; tÞ; l A Lg. In such a framework, the actions available to the system are the various probability distributions Ri over Gi, the latter constituting the gray-level occurrence set corresponding to each possible location of interest upon which to direct attention. And this result directly comes from Proposition 2.11 in Bernardo and Smith [24], which establishes formally that coherent, quantitative measures of uncertainty about events must take the form of probabilities: (i) coherent, quantitative degrees of belief have the structure of a finitely additive probability measure; moreover, (ii) significant events, i.e., events which are practically possible but not certain, should be assigned probability values in the open interval ð0; 1Þ. (b) Proposition 1(b) asserts that options in the selection of a spatial location to allocate attention at time t cannot be ordered without a specification of utilities (numerical values) for the consequences. Assuming a definition of utility that

ARTICLE IN PRESS J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

only involves comparison among consequences and options constructed with standard events, we would expect the utility of a consequence to be uniquely defined and to remain unchanged as new information is obtained, since the preference patterns among consequences is unaffected by additional information. This is indeed the case, as Proposition 2.21 in [24] establishes for decision problems in which extreme consequences are assumed to exist. In our problem it is attractive to have available the possibility, for conceptual and mathematical convenience, of dealing with sets of consequences not possessing extreme elements. But Proposition 2.23 in [24] also extends Proposition 2.21 in [24] to a more general situation in which extreme consequences are not assumed to exist.

We next prove that these three functions are the only ones consistent with Postulate 7 (to this aim we follow a proof suggested in another context by Keeney and Raiffa [28]). We twice differentiate Eq. (6) and divide the second derivative of each side by the first derivative, which gives (suppressing the subscripts) u0 ðpÞ u00 ðqpÞ ¼q 0 : u0 ðpÞ u ðqpÞ

u00 ðpÞ u00 ðqn p1 Þ u00 ðp1 Þ ¼ lim qn p1 0 n ¼ p1 0 ; n-1 u0 ðpÞ u ðq p1 Þ u ðp1 Þ

ð9Þ

where (a) follows from Eq. (8). And, similarly, for any p2 we have

A well-behaved utility function, [25–27], is local and thus the value of distribution Ri  fpðGl;i jPi ; tÞ; l A Lg is to be assessed in terms of the probability it assigned to the actual outcome. It leads to simplifications that make it a useful working hypothesis:

lim p

p-0

with pl;i ¼ pðGl;i jPi ; tÞ being the probability of gray level l in the neighborhood of location Pi that would result from the improvement in reconstruction fidelity achieved by allocating attention to Pi at time t. Let uP1 ;P2 ;...;Pn ðpl;1 ; pl;2 ; . . . ; pl;n Þ represents the utility function for consequences over spatial locations P1 ; P2 ; . . . ; Pn . Keeney and Raiffa [28] have demonstrated that the hypothesis of value independence given in Postulate 6 holds if and only if, for all Pi , there are utility functions uPi ðpl;i Þ and constants ai such that: X ai  uPi ðpl;i Þ; ð4Þ uP1 ;P2 ;...;Pn ðpl;1 ; pl;2 ; . . . ; pl;n Þ ¼ i

P

where i ai ¼ 1. That is, uP1 ;P2 ;...;Pn ðpl;1 ; pl;2 ; . . . ; pl;n Þ has an additive form. Let pw l;j be the probability of gray level l using the worst reconstruction fidelity in the neighborhood of location Pj ; and pbl;j be the probability of gray level l using the best reconstruction fidelity in the neighborhood of Pj . Without loss of generality we b scale utility function uPj ðpl;j Þ so that uPj ðpw l;j Þ ¼ 0 and uPj ðpl;j Þ ¼ 1. We have that Postulate 7 states that uP1 ;...;Pn ðpl;1 ; . . . ; pl;i ; . . . ; pw l;j ; . . . ; pl;n Þ ¼ ð5Þ

with 0 oq o 1, and for all pl;i , where 1  q is the proportion of pl;i that would be given up to achieve the improvement in b reconstruction fidelity that results by changing pw l;j to pl;j . Then, substituting Eq. (4) in Eq. (5), we find that ð6Þ

for all pl;i . Following Pliskin et al. [29], we first show that the three functional forms are consistent with Postulate 7. If uðpÞ ¼ log ðpÞ, then ai ¼ 1=ð1  log qÞ and aj ¼  log q=ð1  log qÞ are consistent with Eq. (6). Similarly, if uðpÞ ¼ pr or uðpÞ ¼  pr , then the values of ai ¼ qr and aj ¼ 0 are consistent with Eq. (6). Summarizing, we have already proved that logðpÞ, pr for r 40, and pr for r o0, are consistent with Postulate 7 when two particular probabilities pw l;j to pbl;j are involved in the trade-offs. By a corollary given in [29] we can easily prove that this is sufficient for Postulate 7 to hold for any pair of probabilities.

u00 ðpÞ u00 ðqn p2 Þ u00 ðp2 Þ ¼ lim qn p2 0 n ¼ p2 0 : n-1 u0 ðpÞ u ðq p2 Þ u ðp2 Þ

ð10Þ

From Eqs. (9) and (10), it follows that for any p1 and p2 : p1

uðRi ; Gl;i Þ ¼ uðpl;i Þ;

ai uPi ðpl;i Þ ¼ ai uPi ðqpl;i Þ þ aj ;

ð8Þ

for all p and all n Z 0. Given that u is a ‘‘well-behaved’’ function, it implies the existence of limp-0 pu00 ðpÞ=u0 ðpÞ. Then, for any p1, we have lim p

uP1 ;...;Pn ðpl;1 ; . . . ; q  pl;i ; . . . ; pbl;j ; . . . ; pl;n Þ;

ð7Þ

Recursively substituting qp for p in Eq. (7) it follows that: u00 ðpÞ u00 ðqn pÞ ¼ qn 0 n ; 0 u ðpÞ u ðq pÞ

p-0

Appendix B. Proof of Proposition 2

1629

u00 ðp1 Þ u00 ðp2 Þ ¼ p2 0 ; u0 ðp1 Þ u ðp2 Þ

ð11Þ

or, equivalently, p

u00 ðpÞ ¼ c: u0 ðpÞ

ð12Þ

By integration we obtain that the utility function uðpÞ must have one of the three functional forms logðpÞ for r ¼ 0, pr for r 4 0, and pr for r o 0, where r ¼ 1  c, and c is the constant in Eq. (12). Following Machina [30], we know that the shape of uðpÞ determines risk attitudes. We have that for r o 1, the utility is a concave function: (i) pr for 0 or o 1; (ii) logðpÞ for r ¼ 0; or (iii) pr for r o0. Since a system with a concave utility function will in fact always prefer receiving a sure gain to the ‘‘gamble’’ itself, concave utility functions are termed risk averse. For r 4 1 the utility function must have the form pr , which is a convex function. It implies that the resulting system prefers bearing the risk rather than receiving the sure gain of the expected value of the ‘‘gamble’’ on location-dependent attention. Hence, such utility function is termed risk loving. This proves Proposition 2.

Appendix C. Proof of Proposition 3 From Proposition 2, we have the utilities of reporting that pðGl;i jt  1Þ or pðGl;i jPi ; tÞ might be log pðGl;i jt  1Þ and log pðGl;i jPi ; tÞ, respectively. Thus, conditional upon the allocation of attention to spatial location Pi at time t, the expected increase in utility that would result from the improvement in visual quality achieved by allocating attention to Pi at t, would be given by X pðGl;i jPi ; tÞ½log pðGl;i jPi ; tÞ  log pðGl;i jt  1Þ I½Pi ; t=Q ¼ l

X pðGl;i jPi ; tÞ ; pðGl;i jPi ; tÞlog ¼ pðGl;i jt  1Þ l which, by Theorem 1 in Garcia et al. [22], is non-negative and verifies the nilpotence condition. That is, the expected increase in utility in this case is the Kullback–Leibler information gain, which has a minimal number of properties that are natural and, thus, desirable for predicting visual distinctness from 2D digital images (see [22] for further details).

ARTICLE IN PRESS J.A. Garcı´a et al. / Pattern Recognition 43 (2010) 1618–1630

1630

Similarly, it can be proved that if preferences are described by the utility function for r o 0, then the expected increase in utility provided by the improvement in visual quality achieved by allocating attention to Pi at t, when the initial probability distribution Q  fpðGl;i jt  1Þ; l A Lg is given by X I½Pi ; t=Q ¼ pðGl;i jPi ; tÞf½pðGl;i jt  1Þr  ½pðGl;i jPi ; tÞr g; ð13Þ l

while if preferences are described by the utility function for r 4 0, then the expected increase in utility is given by X pðGl;i jPi ; tÞf½pðGl;i jPi ; tÞr  ½pðGl;i jt  1Þr g: ð14Þ I½Pi ; tÞ=Q ¼ l

This proves Proposition 3. References [1] J.R. Bergen, B. Julesz, Parallel versus serial processing in rapid pattern discrimination, Nature 303 (1983) 696–698. [2] C. Koch, S. Ullman, Shifts in selective visual attention: towards the underlying neural circuitry, Hum. Neurobiol. 4 (1985) 219–227. [3] M.I. Posner, Y. Cohen, R.D. Rafal, Neural systems control of spatial orienting, Philos. Trans. R. Soc. Lond. B Biol. Sci 298 (1982) 187–198. [4] W. James, The Principles of Psychology, Dover, New York, 1950. [5] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 1254–1259. [6] A. Treisman, Features and objects: the fourteenth Barlett memorial lecture, Q. J. Exp. Psychol. A 40 (1998) 201–237. [7] O. Le Meur, P. Le Callet, D. Barba, A coherent computational approach to model the bottom-up visual attention, IEEE Trans. Pattern Anal. Mach. Intell. 28 (5) (2006) 802–817. [8] G.M. Boynton, Attention and visual perception, Curr. Opin. Neurobiol. 15 (2005) 465–469. [9] M. Mancas, Computational Attention: Towards Attentive Computers, Presses universitaires de Louvain, ISBN: 978-2-87463-099-6, Belgium, 2007, pp. 1–267. [10] W. Osberger, A.J. Maeder, Automatic identification of perceptually important regions in an image using a model of the human visual system, in: 14th International Conference on Pattern Recognition, Brisbane, Australia, 1998. [11] K.N. Walker, T.F. Cootes, C.J. Taylor, Locating salient object features, in: P.H. Lewis, M.S. Nixon (Eds.), Proceedings of British Machine Vision Conference, vol. 2, BMVA Press, 1998, pp. 557–566. [12] T.N. Mudge, J.L. Turney, R.A. Voltz, Automatic generation of salient features for the recognition of partially occluded parts, Robotica 5 (1987) 117–127.

[13] A. Oliva, A. Torralba, M.S. Castelhano, J.M. Henderson, Top-down control of visual attention in object detection, in: Proceedings of IEEE International Conference on Image Processing, vol. 1, 2003, pp. 253–256. [14] L. Itti, P. Baldi, Bayesian surprise attracts human attention, Advances in Neural Information Processing SystemsMIT Press, Cambridge, MA, 2006, pp. 1–9. [15] C.M. Privitera, L. Stark, Algorithms for defining visual regions of interest: comparison with eye fixations, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 970–982. [16] T. Bosse, W. Doesburg, P. Maanen, J. Treur, Augmented metacognition addressing dynamic allocation of tasks requiring visual attention, in: Proceedings of 12th International Conference on Human–Computer Interaction, Lecture Notes in Computer Science, vol. 4565, Springer, Berlin, 2007. [17] P. Maanen, L. Koning, K. Dongen, Design and validation of HABTA: human attention-based task allocator, in: Proceedings of First International Workshop on Human Aspects in Ambient Intelligence, Darmstadt, Germany, November 10, 2007. [18] J.A. Garcia, R. Rodriguez-Sanchez, J. Fdez-Valdivia, Justice in quantizer formation for rational progressive transmission, Opt. Eng. 43 (2004) 2105– 2119. [19] J.A. Garcia, R. Rodriguez-Sanchez, J. Fdez-Valdivia, Progressive Image Transmission: The Role of Rationality, Cooperation and Justice, PM-140, SPIE Press, Bellingham, Washington, USA, 2004, p. 230. [20] A.M. Rohaly, A.J. Ahumada, A.B. Watson, Object detection in natural backgrounds predicted by discrimination performance and models, Vision Res. 37 (23) (1997) 3225–3235. [21] A. Toet, F.L. Kooi, P. Bijl, J.M. Valeton, Visual conspicuity determines human target acquisition performance, Opt. Eng. 37 (7) (1998) 1969–1975. [22] J.A. Garcia, J. Fdez-Valdivia, X.R. Fdez-Vidal, R. Rodriguez-Sanchez, Information theoretic measure for visual target distinctness, IEEE Trans. Pattern Anal. Mach. Intell. 23 (4) (2001) 362–383. [23] L. Itti, C. Koch, Target detection using saliency-based attention, in: NATO SCI12 Workshop on Search and Target Acquisition, Utrecht, The Netherlands, June 21–23, 1999, pp. (3-1)–(3-10). [24] J.M. Bernardo, A.F.M. Smith, Bayesian Theory, Wiley Series in Probability and Statistics, Wiley, Chichester, UK, 1994. [25] P.C. Fishburn, The Foundations of Expected Utility, D. Reidel Pub. Co., Dordrecht, The Netherlands, 1982. [26] G. Herden, N. Knoche, C. Seidel, W. Trockel (Eds.), Mathematical Utility TheorySpringer, Wien, 1999. [27] B.P. Stigum, F. Wentsop, Foundations of Utility and Risk Theory with Applications, Kluwer Academic Press, Dordrecht, The Netherlands, 1983. [28] R.L. Keeney, H. Raiffa, Decisions with Multiple Objectives: Preferences and Value Tradeoffs, Wiley, NY, 1976. [29] J.S. Pliskin, D.S. Shepard, M.C. Weinstein, Utility functions for life years and health status, Oper. Res. 28 (1980) 206–224. [30] M.J. Machina, Choice under uncertainty: problems solved and unsolved, Econ. Perspect. 1 (1) (1987) 121–154.

About the Author—J.A. GARCI´A was born in Almeria, Spain. He received the M.S. and Ph.D. degrees both in Mathematics from the University of Granada in 1987 and 1992, respectively. Since 1988 he has been with the Computer Science Department (DECSAI) at Granada University where he is now Full Professor. Author of over 100 technical papers and three books, he has devoted the last 14 years to developing computer vision models for biomedicine, astronomy, cartography, feature extraction, clustering, image representation, image distortion, visual target distinctness, and image compression.

About the Author—ROSA RODRIGUEZ-SA´NCHEZ was born in Granada, Spain. She received the M.S. and Ph.D. degrees both in Computer Science from the University of Granada in 1996 and 1999, respectively. Currently she is with the Computer Science Department (DECSAI) at Granada University where she is now an Associate Professor. Her current interest includes computer vision, visual perception and image coding.

About the Author—J. FDEZ-VALDIVIA was born in Granada, Spain. He received the M.S. and Ph.D. degrees both in Mathematics from the University of Granada in 1986 and 1991, respectively. Since 1988 he has been with the Computer Science Department (DECSAI) at Granada University where he is now Full Professor. His current interest includes computer vision, image representation, feature detection, visual target distinctness, image coding, and biomedical applications. His research work is summarized in over 100 papers in scientific journals and conference proceedings and three books in the field of Computer Vision.