Information visibility using transmission methods

Information visibility using transmission methods

Pattern Recognition Letters 31 (2010) 609–618 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.c...

1MB Sizes 1 Downloads 118 Views

Pattern Recognition Letters 31 (2010) 609–618

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Information visibility using transmission methods J.A. García *, Rosa Rodriguez-Sánchez, J. Fdez-Valdivia, J. Martinez-Baena Departamento de Ciencias de la Computación e I. A., CITIC-UGR, Universidad de Granada, 18071 Granada, Spain

a r t i c l e

i n f o

Article history: Received 6 May 2009 Received in revised form 19 October 2009 Available online 11 December 2009 Communicated by R.C. Guido Keywords: Computational attention Transmission Paradigm Visual efficiency Important information Advertising

a b s t r a c t Lossy coding, selective seeing, ignoring visual cues, and perceptual biases are different sources of error into the visual communication. To counteract these tendencies we propose a change of paradigm from transmitting important information first to seeing important information first, for the same quality factor. Here we use the new paradigm to the evaluation of the visual efficiency of image information when it is reconstructed at high and low fidelity using different transmission methods. To this aim, we apply a rational model of computational attention in which a multi-bitrate attention map will provide us with the attention score for each spatial location at high and low quality versions of the image reconstruction. The rational approach of attention does not purport to describe the ways in which the Human Visual System (HVS) actually do behave in making choices among possible locations of interest for allocating attention. Instead we are interested in the aspects of rationality that seem to be present in the decision making of the HVS. From both rate–attention and rate–distortion curves, we conclude which transmission method is the overall winner according to the new paradigm. A dataset of advertisement images is used to compare the visual efficiency of the advertisement when it is transmitted using three different coders without region-dependent quality of encoding. Experimental results show that a potential consumer may see important areas faster using SPIHT (without regiondependent quality of encoding) on a significant number of advertisement images, even though this transmission method should improve its capabilities in terms of important information visibility across bitrates. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction In visual communication, by the time a message gets from a sender to a receiver, there are several basic places where transmission errors can take place and at each place there are a multitude of potential sources of error or interference that can enter into the communication process: (i) Transmission distortion from lossy coding; (ii) receiver distortion produced by either selective seeing or ignoring visual cues; (iii) perceptual biases, that is, people attend to stimuli in the environment in very different ways. It is critical to counteract these sources of errors by making a conscientious effort to make sure the receiver sees the important visual information first. But standard schemes prioritize the code bits often according to their reduction in distortion, and a major objective in this context is to select the most important information – which yields the largest distortion reduction – to be transmitted first, where the distortion is usually a squared-error

* Corresponding author. Tel.: +34 58 24 3197; fax: +34 58 24 3317. E-mail addresses: [email protected] (J.A. García), [email protected] (R. Rodriguez-Sánchez), [email protected] (J. Fdez-Valdivia), [email protected] (J. Martinez-Baena). 0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.12.001

metric. Since the quality of the reconstructions at different bit rates strongly depends on the visual distinctness of the perceived data, the information selected to be transmitted first by any prioritization scheme should achieve the largest visual distinctness over still-to-be-transmitted data. The natural question is whether a squared-error metric is capable to rank order visual information with respect to the visual distinctness as measured by humans, and thus, the largest squarederror reduction can be used to prioritize, with reliability, the most important information according to their distinctness. So what are the actual properties of a squared-error metric? For example, does it take into account the effectiveness of the information, so discriminating relevant structures from unwanted detail and noise? Does it examine whether the properties of the original image at significant points are equal to the properties of the decoded output at corresponding locations? The point is that whereas we have no evident affirmative answer to these and other questions, the mean squared error does not appear capable of predicting visual distinctness from digital imagery as perceived by human observers (Garcia et al., 2004). Here we postulate that a transmission method should guarantee that the receiver really sees important information first, for

610

J.A. García et al. / Pattern Recognition Letters 31 (2010) 609–618

example, by the maximization of the mean attention within the important image regions for the same quality factor. In Section 2, a set of axioms prescribe constraints that seem to us imperative to acknowledge in the problem of allocating attention. The result is a multi-bitrate attention map which provides us with a computational attention score for each spatial location at high and low quality versions of the image reconstruction. It allows distinct attention score for the same spatial location at different picture quality, as well as to avoid certain forms of behavioral inconsistency in the absence of a priori knowledge about the locations of interest, which is a characteristic of rational systems. Section 2.1 demonstrates a correlation between human and model predictions of visual attention using the target dataset from a DISSTAF (Distributed Interactive Simulation, Search and Target Acquisition Fidelity) field test, that was designed and organized by NVESD (Night Vision and Electro-optic Sensors Directorate, Ft. Belvoir, VA, USA) (Toet et al., 1998). Section 3 compares the visual efficiency of an advertisement image when it is reconstructed at high and low fidelity using different transmission methods, which may be relevant for example in applications of advertisement on Internet. To this aim, we firstly compute the multi-bitrate attention map using each one of the coding methods to obtain the advertisement reconstruction at different bitrates. Secondly, for each coding method we calculate a normalized measure of the average attention score within the areas of interest provided by the publicist, for each bitrate, based on the attention map. A high value of the mean attention within the areas of interest at a given reconstruction fidelity means that the coding method brings the attention onto the important regions to advertising using this bitrate of picture quality. The main conclusions of the paper are summarized in Section 4. 2. Multi-bitrate attention map In (Garcia et al., 2009) we have proposed that a different approach to computational attention can be to first state some general principles that the solution of the problem must obey, and then derive the solution that satisfies exactly the principles. The idea is as follows. Suppose the computational system will be forced to allocate attention at any time to one spatial location Pi to improve the reconstruction fidelity on a neighborhood of the chosen point. Let P ¼ fPi ; i 2 Ig be the set of candidate spatial locations P i upon which to allocate attention. The choice of a location P i that is required for directing attention at any time t produces an increment in visual quality in a neighborhood of P i that induces a particular set of consequences. Let G ¼ fGl;i ; l 2 Lg be the class of any possible set of gray-level occurrences in the neighborhood of P i whose reconstruction fidelity will be improved by allocating attention to this point at time t. Also let C ¼ fcl;i ; l 2 Lg be the class of any set of possible consequences for the application at hand, associated with a gray-level occurrence set Gthat results from the improvement in visual quality achieved by allocating attention to P i . To the computational attention problem, Garcia et al. (2009) proposed three coherence axioms, two quantization axioms, and two additional axioms that restrict the form of utilities for consequences. Let  be an attention score order between spatial locations in P, with Pi  Pj meaning that attention score of Pi is not greater, at this time t, than that of P j . Postulates 1–3 will provide a minimal set of rules to ensure that qualitative comparisons based on the attention score order  cannot have intuitively undesirable implications, as follows. Postulate 1 (First coherence axiom).

(i) Not all the consequences in C are equivalent; and (ii) The attention model is able to compare attention scores for any pair of spatial locations at time t. The second axiom is intended to impose rules of coherence on attention score orderings that will exclude the possibility of two types of inconsistencies: First, in order to allocate attention, the system prefers one location over another identical location of interest; second, the system is willing to suffer the certain loss of something of value, which happens if P i  P j and P j  Pk , but then, attention score of P i is greater than that of P k . Postulate 2 (Second coherence axiom). (i) P i  P i ; and (ii) If P i  P j and P j  Pk , then Pi  Pk . Postulate 3 (Third coherence axiom). (i) Preferences between consequences at a given bitrate of reconstruction fidelity, should not be affected by the graylevel occurrences at higher visual quality; (ii) If a gray-level occurrence set fGl;i ; l 2 Lg is more likely to relate to better consequences (for the application at hand) than another gray-level occurrence set fGl;j ; l 2 Lg then the attention score of Pi should be greater that of Pj ; (iii) If attention score of P i is greater than that of P j under the occurrence of event G, then comparison of scores for P i and P j (which are identical if a different event occurs) depends entirely on consideration of what happens if G occurs. Postulates 4 and 5 will introduce some form of quantification by setting up a standard unit of measurement that enables the attention model to assign a score to any given available location. In short, precision through quantification is achieved by introducing some form of numerical standard into the system. Postulate 4 (First quantization axiom). In the attention model, there exists some form of standard locations of interest, which will play a role analogous to the standard units of measurement. Postulate 5 (Second quantization axiom). The standard family of locations of interest provides a continuous scale against which any consequence or event can be precisely compared. Two additional assumptions (Postulates 6 and 7) restrict the form of the utility function for consequences that result from directing attention to a particular spatial location. Postulate 6 (Independence between reconstructions for different spatial locations). Let P i and P j be any pair of possible locations of interest at time t. Preferences for consequences at this time t involving the two locations Pi and P j depend only on the probability distributions for the respective gray-level occurrence sets fGl;i ; l 2 Lg and fGl;j ; l 2 Lg, and not on their joint probability distribution. Postulate 7 (Basic rule while sacrificing a fraction of reconstruction fidelity). The portion of visual quality for a neighborhood of Pi that the attention system is willing to give up at processing time t to achieve an improvement in reconstruction fidelity for a neighborhood of Pj does not depend on the absolute amount of visual quality for a neighborhood of Pi involved. Let ql;i ¼ pðGl;i jt  1Þ be the probability of gray level l in the neighborhood of location P i using the level of reconstruction fidelity given at time t  1 (i.e., before time t). Let pl;i ¼ pðGl;i jP i ; tÞ be the probability of gray level l in the neighborhood of location P i that would result from the improvement in visual quality achieved by allocating attention to Pi at time t.

J.A. García et al. / Pattern Recognition Letters 31 (2010) 609–618

Following the three mathematical results given in Appendix A, we have that for an image reconstruction fidelity given by bitrate q the attention score of location Pi will be the total sum of utilities for consequences, noted as I½Pi ; t=Q, that were provided at times t when attention was directed to P i , up to the given bitrate q:

8 P h   r i r > pl;i ql;i  pl;i if r < 0; > > > > l >

> h i >  r  r P > > > if r > 0: : pl;i pl;i  ql;i

Definition 1 (Multi-bitrate attention map). The multi-bitrate attention map that measures the attention score, following Postulates 1–7, at any spatial location P i and any bitrate q of image reconstruction fidelity, is

fAðPi ; qÞgPi ;q ; ð1Þ

l

611

ð2Þ

where

AðPi ; qÞ ¼

X tP

I½Pi ; t=Q;

ð3Þ

i

Hence we can now give a formal definition of the global attention map for a given image, following the rational approach to the measurement of attention score:

with the sum over times tPi , up to the given bitrate q, such that attention is directed to P i at time t Pi .

Fig. 1. Advertisement images and publicist areas of interest.

Fig. 2. Advertisement images and publicist areas of interest.

612

J.A. García et al. / Pattern Recognition Letters 31 (2010) 609–618

(using computational attention) with respect to the reference rank order (target distinctness measured by human observers):

2.1. Model evaluation The multi-bitrate attention map fAðP i ; qÞgPi ;q will provide us with a computational attention score for each spatial location P i at high and low quality versions of the image reconstruction. Next we show that a rational model of computational attention can be used to predict visual distinctness in a Database of target scenes that was presented in (Toet et al., 1998, 2001; Garcia et al., 2001). The images used in the experiment are slides made during the DISSTAF (Distributed Interactive Simulation, Search and Target Acquisition Fidelity) field test, that was designed and organized by NVESD (Night Vision & Electro-optic Sensors Directorate, Ft. Belvoir, VA, USA) and that was held in Fort Hunter Liggett, California, USA (Toet et al., 1998). These slides depict 44 different scenes. Each scene represents a military vehicle in a complex rural background. The visibility of the targets varies throughout the entire stimulus set. This is mainly due to variations in the structure of the local background, the viewing distance, the luminance distribution over the target support (shadows), the orientation of the targets, and the degree of occlusion of the targets by vegetation. Here we firstly compute the multi-bitrate attention map fAðP i ; qÞgPi ;q for each target scene in the Database from the DISSTAF field test (see Figs. 2–5 in (Garcia et al., 2001)). Then we calculate the lowest bitrate, q , of reconstruction fidelity for which the attention score of some point in the target area will be in the upper quartile of the attention map fAðPi ; q ÞgPi at bitrate q . A small value of q means that the computational model brings the attention onto the target using a low bitrate of picture quality which corresponds to a high saliency of the target area. Hence a lower value of q predicts a faster detection (due to the higher saliency) of the target in the cluttered scene; therefore the value q calculated for each image can be used to rank order the visual distinctness of the targets. Secondly, a psychophysical experiment is performed in which human observers estimate the visual distinctness of targets using the slides made during the DISSTAF field test. In the psychophysical experiment search times and cumulative detection probabilities were measured for nine military targets in complex natural backgrounds. A total of 64 civilian observers, aged between 18 and 45 years, participated in the visual search experiment. The procedure of the search experiment is described in (Toet et al., 1998, 2001). Search performance is usually expressed as the cumulative detection probability as function of time, and it can be approximated by Krendel and Wodinsky (1960), Rotman et al. (1989) and Waldman et al. (1991):

( P d ðtÞ ¼

0

t < t0 o 0 ; t P t0 1  exp  tt q n

) ;

ð4Þ

where P d ðtÞ is the fraction of correct detections at time t, t 0 is the minimum time required to response, and q is a time constant. The curves are rank-ordered according to the area beneath their graphs (Toet et al., 1998). This subjective ranking induced by the psychophysical target distinctness is adopted as the reference rank order in the comparative study. Targets in a particular dataset will have similar visual distinctness if they give rise to closely spaced cumulative detection curves that are similar in accordance with a Kolmogorov–Smirnov test (Toet et al., 2001). In fact, the target images in the Dataset are clustered into four groupings of targets with comparable visual distinctness (Garcia et al., 2001). For the target dataset given by Figs. 2–5 in (Garcia et al., 2001), we have calculated the probability of correct classification PCC for the rational attention model with risk aversion (r ¼ 0:6) with respect to gambles on location-dependent attention. The evaluation function PCC is defined by the fraction of correctly classified targets

Fig. 3. Advertisement images and publicist areas of interest.

613

J.A. García et al. / Pattern Recognition Letters 31 (2010) 609–618

Number of correctly classified targets : Number of targets

The rational model of computational attention (which follows Postulates 1–7) yields a high probability of correct classification (P CC ¼ 0:7272). It implies a correlation between human and model predictions of visual attention. Recall that the rational model of attention does not extract any visual feature like as color, intensity or orientation. Instead, it is only based on the multi-bitrate attention map fAðP i ; qÞgPi ;q for each target scene. Ref. (Itti et al., 1999) applied the Itti’s model of human visual search based on the concept of a ‘‘salience map” (Itti et al., 1998) to a wide range of target detection tasks using the DISSTAF images. Through a 2D map, the saliency of objects in the visual environ-

ment is encoded. In (Itti et al., 1999), low-level visual features (color, intensity, and orientation) are extracted in parallel from nine spatial scales, and the resulting feature maps are combined to yield three saliency maps. These, in turn, feed into a single saliency map (a 2D layer of integrate-and-fire neurons). Competition among neurons in this map yields to a single winning location corresponding to the next attended target. Inhibiting this location transiently suppresses the currently attended location, causing the focus of attention to shift to the next most salient location. With respect to the predicted search times of the Itti’s attention model on the DISSTAF images, Itti et al. (1999) found a poor correlation between human and model search times (see Fig. 8 in (Itti et al., 1999)). It may be a consequence of the fact that the Itti’s attention model was originally designed not to find small, hidden targets, but

#1

SPIHT KAKADU JASPER

SNRlog

ATENTION

#1

SPIHT KAKADU JASPER

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

Compression Ratio

#2

#2

SPIHT KAKADU JASPER

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

SPIHT KAKADU JASPER

SNRlog

ATTENTION

Compression Ratio

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

Compression Ratio

#3

#3

ATTENTION

Compression Ratio

SPIHT KAKADU JASPER

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

SPIHT KAKADU JASPER

SNRlog

PCC ¼

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

Compression Ratio

Compression Ratio

#4

#4

SNRlog

ATTENTION

SPIHT KAKADU JASPER

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

Compression Ratio

SPIHT KAKADU JASPER

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

Compression Ratio

Fig. 4. (Left column) Normalized mean attention score within the areas of interest; (Right column) SNRlog measured on the important areas.

614

J.A. García et al. / Pattern Recognition Letters 31 (2010) 609–618

rather to find the few most obviously conspicuous objects in an image.

3. Evaluation of important information visibility Figs. 1–3 show, on the left column, a dataset of advertisement images, originally presented in (Mancas et al., 2007). The same figures also illustrate (on the right column) the respective important areas selected by the publicist since they provide its main message to the potential consumer. Next we use this dataset to compare the visual efficiency of the advertisement when it is transmitted using

three different coders: the SPIHT coder (http://www.cipr.rpi.edu/ research/SPIHT/spiht0.html), the KAKADU coder (http://www.kakadusoftware.com/), and JASPER (http://www.ece.uvic.ca/~mdadams/jasper/). The three transmission methods were applied without region-dependent quality of encoding. Thus for each one of the advertisement images in Figs. 1–3 we firstly compute the multi-bitrate attention map fAðPi ; qÞgPi ;q , using each one of the coding methods under analysis in order to obtain the advertisement reconstruction at bitrate q of reconstruction fidelity. Secondly, for each coding method we calculate the average attention score within the areas of interest provided by the pub-

#5

#5

SNRlog

ATTENTION

SPIHT KAKADU JASPER

SPIHT KAKADU JASPER

16:1 58:1100:1 150:1 200:1 250:1 300:1 350:1 400:1 450:1 512:1

16:1 58:1100:1 150:1 200:1 250:1300:1 350:1 400:1 450:1 512:1

Compression Ratio

Compression Ratio

#6

#6

SPIHT KAKADU JASPER

SNRlog

ATTENTION

SPIHT KAKADU JASPER

16:1 58:1100:1 150:1 200:1 250:1 300:1 350:1 400:1 450:1 512:1

16:1 58:1100:1 150:1 200:1 250:1 300:1 350:1 400:1 450:1 512:1

Compression Ratio

Compression Ratio

#7

#7

ATTENTION

SPIHT KAKADU JASPER

SNRlog

SPIHT KAKADU JASPER

16:1 58:1100:1 150:1 200:1 250:1 300:1 350:1 400:1 450:1 512:1

16:1 58:1100:1 150:1 200:1 250:1 300:1 350:1 400:1 450:1 512:1

Compression Ratio

Compression Ratio

#8

#8

SNRlog

ATTENTION

SPIHT KAKADU JASPER

SPIHT KAKADU JASPER

16:1 58:1100:1 150:1 200:1 250:1 300:1 350:1 400:1 450:1 512:1

16:1 58:1100:1 150:1 200:1 250:1 300:1 350:1 400:1 450:1 512:1

Compression Ratio

Compression Ratio

Fig. 5. (Left column) Normalized mean attention score within the areas of interest; (Right column) SNRlog measured on the important areas.

615

J.A. García et al. / Pattern Recognition Letters 31 (2010) 609–618

licist, for each bitrate q, based on the attention map fAðPi ; qÞgPi . The mean attention score within the areas of interest is normalized by dividing by the average attention achieved in the adver-

tisement reconstruction at bitrate q. Appendix B provides a specification of the algorithm to compute the mean attention score.

#9

#9

SNRlog

ATTENTION

SPIHT KAKADU JASPER

SPIHT KAKADU JASPER

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

16:1 58:1100:1150:1200:1250:1300:1350:1400:1 450:1 512:1

Compression Ratio

Compression Ratio

#10

#10

SNRlog

ATTENTION

SPIHT KAKADU JASPER

SPIHT KAKADU JASPER 16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

Compression Ratio

Compression Ratio

#11

#11

SPIHT KAKADU JASPER

SNRlog

ATTENTION

SPIHT KAKADU JASPER

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

Compression Ratio

Compression Ratio

#12

#12

SPIHT KAKADU JASPER

SNRlog

ATTENTION

SPIHT KAKADU JASPER

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

Compression Ratio

Compression Ratio

#13

#13 SPIHT KAKADU JASPER

SPIHT KAKADU JASPER

SNRlog

ATTENTION

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

16:1 58:1100:1150:1200:1250:1300:1350:1400:1450:1 512:1

Compression Ratio

Compression Ratio

Fig. 6. (Left column) Normalized mean attention score within the areas of interest; (Right column) SNRlog measured on the important areas.

616

J.A. García et al. / Pattern Recognition Letters 31 (2010) 609–618

A high value of the mean attention within the areas of interest at reconstruction fidelity q means that the coding method brings the attention onto the important regions to advertising using a bitrate q of picture quality, which corresponds to a high saliency of the areas of interest provided by the publicist. Hence it predicts a faster detection (due to the higher saliency) of the important areas in the advertisement scene; therefore the mean attention score across bitrates can be used to rank order the important information visibility using SPIHT, KAKADU, and JASPER. For each advertisement image in the dataset, Figs. 4–6 (left column) show the respective rate–attention curve using each one of the three transmission methods, as given by the normalized mean attention score within the areas of interest across bitrates of reconstruction fidelity. Following these results, rate–attention curves predict a higher saliency of the important areas using SPIHT on advertisement images #1, #3, #4, #6, #9, #10, and #12. By the contrary, areas of interest would be faster detected using JASPER on images #2, #5, #7, #11, and #13 and using KAKADU on image #8. We also perform an objective coder evaluation of the reconstruction fidelity by using the rate–SNRlog curves. Figs. 4–6 (right column) show the rate–SNRlog curves on the advertisement images given in Figs. 1–3. The bit rate ranges from 0.015625 bpp to 0.5 bpp. Although the SNRlog has a good physical and theoretical basis, this measure is often found to correlate poorly with subjective ratings due to the fact that the human visual system does not process the image in a point-by-point manner but rather in a selective way according to the decisions made on a cognitive level, by choosing specific data on which to make judgments and weighting this data more heavily than the rest of the image (Garcia et al., 2001). To overcome this problem, different weightings have been proposed, for example logarithmic and cube-root ones. Furthermore, both preprocessing the pair of images and emphasizing their edge content have been suggested as well. Here we use a different approach to improve the correlation between subjective ratings and the SNRlog: The differences between the original advertisement image and its decoded outputs are only evaluated on important areas provided by the publicist. Figs. 4–6 (right column) show rate–SNRlog curves on the thirteen test images, using this approach to evaluate the SNRlog. From these plots we have that, at most bit rates, SPIHT decoded outputs achieve a better objective quality than KAKADU and JASPER decoded outputs for all the advertisement images. From both rate–attention and rate–SNRlog curves, we can conclude that the SPIHT coder is the overall winner according to the new paradigm of seeing important information first on images #1, #3, #4, #6, #9, #10, #12. Regarding the other images #2, #5, #7, #11, #13, and #8, SPIHT (without region-dependent quality of encoding) should improve its capabilities in terms of important information visibility across bitrates.

4. Conclusions The multi-bitrate attention map will provide us with a computational attention score for each spatial location at high and low quality versions of the image reconstruction. The novelty of this map is that it allows distinct attention score for the same spatial location at different picture quality, as well as avoids certain forms of behavioral inconsistency in the absence of a priori knowledge about the locations of interest, and its is not tuned for only certain images. We have evaluated the visual efficiency of each advertisement image in a dataset when it is reconstructed at high and low fidelity using three transmission methods. A high value of the normalized mean attention within the areas of interest at a given reconstruc-

tion fidelity q predicts that the coding method brings the attention onto the important regions to advertising using a bitrate q of picture quality. An objective coder evaluation of the reconstruction fidelity is achieved by using the rate–SNRlog curve on important areas provided by the publicist. From both rate–attention and rate–SNRlog curves we have demonstrated (using a dataset of advertisement images) that a potential consumer may see important areas faster using SPIHT (without region-dependent quality of encoding) on a significant number of advertisement images, even though this transmission method should improve its capabilities in terms of important information visibility across bitrates. But what are the limitations of the proposed approach? Here we have dealt with a computational approach to the rational characteristics of visual attention. The overall objective of developing a rational approach of attention does not purport to describe the ways in which the Human Visual System (HVS) actually do behave in making choices among possible locations of interest for allocating attention. Instead we are interested in the aspects of rationality that seem to be present in the decision making of the HVS: At any time a rational system should choose among candidate spatial locations to avoid certain forms of inconsistency. Regarding future work, we are going to develop a new method to rank sets of fused and input images in the order of important information visibility. The objective of image fusion is to represent relevant information from multiple individual images in a single image. Some fusion methods may represent important visual information more distinctively than others, thereby conveying it more efficiently to the human observer. We will propose to rank order images fused by different methods according to the computational attention value of their visually important details. First we need to compute for each of the fused images a multi-bitrate attention map, following a rational model of computational attention. From this attention map, we then calculate the average attention score within areas of interest (e.g., living creatures, man-made objects, and terrain features), for each bitrate. A high computed mean attention value within the areas of interest at any reconstruction fidelity corresponds to a high computational saliency of the areas of interest. The main advantages of this approach for comparative visual efficiency analysis are its simplicity and speed. We have to study if the computational results agree with human observer performance, making the approach valuable for practical applications. Also, we are going to develop a publicly available suite of Webbased comparative visual efficiency tools designed to facilitate comparison of fused images. In addition we will provide an interface for comparative visual efficiency analysis, which like all of the tools reported here, will be freely available to the scientific community. Acknowledgments The authors thank Dr. Alexander Toet (TNO Human Factors, Soesterberg, The Netherlands) for providing us with image data, search times, and cumulative detection probabilities from search experiments made during the DISSTAF field test. Thanks are due to the reviewers for their constructive suggestions. Appendix A. Mathematical results The following mathematical results were first presented in (Garcia et al., 2009). They are given for the sake of completeness. Anyway the proofs are skipped here but provided in (Garcia et al., 2009). Proposition 1. A computational attention model that aspires to analyze the decision problem fP; G; C; g at any time t in accordance with Postulates 1–5 should verify that:

617

J.A. García et al. / Pattern Recognition Letters 31 (2010) 609–618

(a) Degrees of belief about gray-level occurrence sets fGl;i ; l 2 Lg are represented in the form of finite probability distributions Ri  fpðGl;i jP i ; tÞ; l 2 Lg, with pðGl;i jPi ; tÞ denoting the probability of gray level l in the neighborhood of location P i that would result from the improvement in reconstruction fidelity achieved by allocating attention to P i at time t; (b) Numerical values attached to the consequences fcl;i ; l 2 Lg foreseen if there exists a particular degree of reconstruction fidelity given by the gray-level occurrence set fGl;i ; l 2 Lg are represented in the form of a utility function. Proof. See (Garcia et al., 2009).

h

Proposition 2. If Postulates 6 and 7 both hold, then a well-behaved utility function u for consequences must have one of the following three functional forms (‘‘well-behaved” means local, twice differentia00 ðpÞ exists): ble, and that limp!0 p uu0 ðpÞ

8 r > < ðpl;i Þ uðRi ; Gl;i Þ ¼ log pl;i > : ðpl;i Þr

if r < 0;

– INPUT: I1 and I2 images – OUTPUT: For I1 and I2 , compute the number of different bits at location ði; jÞ. Return the overall sum for all the spatial locations. 1. cnt 0 2. For each ði; jÞ (a) aux jI1 ði; jÞjxorjI2 ði; jÞj (b) For each bit b in aux i. if (auxðbÞ is 1) cnt cnt þ 1 end if (c) end for 3. end for 4. return cnt END PROCEDURE PROCEDURE UTILITY

if r ¼ 0; if r > 0;

with Ri  fpl;i jl 2 Lg; pl;i ¼ pðGl;i jP i ; tÞ being the probability of gray level l in the neighborhood of location Pi that would result from the improvement in visual quality achieved by allocating attention to Pi at time t. If r > 1, the resultant attentional model exhibits a risk-seeking posture with respect to ‘‘gambles” on location-dependent attention; whereas r < 1 implies risk-averse behavior regarding gambles on location-dependent attention. Proof. See (Garcia et al., 2009).

h

Proposition 3. Let ql;i ¼ pðGl;i jt  1Þ be the probability of gray level l in the neighborhood of location P i using the level of reconstruction fidelity given at time t  1 (i.e., before time t). Let pl;i ¼ pðGl;i jPi ; tÞ be the probability of gray level l in the neighborhood of location P i that would result from the improvement in visual quality achieved by allocating attention to Pi at time t. In a rational attention model for which Postulates 6 and 7 hold, the possible functional forms of the expected increase in utility provided by the allocation of attention to spatial location P i at time t, when the initial probability distribution Q  fql;i ; l 2 Lg is strictly positive, are as follows:

8 P h   r i r > if r < 0; > > pl;i ql;i  pl;i > > l >

> h i >  r  r >P > > if r > 0; : pl;i pl;i  ql;i

ð5Þ

l

where if r > 1, the system exhibits a risk-seeking posture with respect to gambles on location-dependent attention, while r < 1 implies risk aversion. Risk neutrality is given by r ¼ 1. Proof. See (Garcia et al., 2009).

h

Appendix B. Attention algorithm Definitions – – – – – –

PROCEDURE BITHIGH

ICRi : decoded output using the C coder at compression ratio Ri TWðICRi Þ: wavelet transformed image for ICRi Q 1 . . . Q s : quantizers for TWðICRl Þ ROI : regions of interest for the original image N: number of rows in the original image M: number of columns in the original image

– INPUT: I1 and I2 images – OUTPUT: Compute the expected increase in utility between I1 and I2 images 0 1. factor1 0 2. factor2 3. For each ði; jÞ factor 1 þ jI1 ði; jÞj (a) factor1 factor 2 þ jI2 ði; jÞj (b) factor2 4. end for 0:6 factor 1 5. f1 0:6 6. f2 factor 2 7. sum 0 8. For each ði; jÞ   0:6 0:6 jI1 ði;jÞj  jI2 ði;jÞj  jI1 ði;jÞj (a) sum sum þ factor f2 f1 1 9. end for 10. return sum

END PROCEDURE BEGIN 1. For each ði; jÞ (a) Atði; jÞ 0 2. end for maximum compression ratio  1 3. Rnext maximum compression ratio 4. Rprev ious 5. while (Rnext P Robjectiv e ) (a) For each ði; jÞ i. For each Q k A. nbits BITHIGH½Q k ðTWðICRnext ÞÞ; Q k ðTWðICRprev ious ÞÞ B. For each coefficient cl;m in Q k ðTWðICRnext ÞÞ if wavelet coefficient cl;m comes from spatial location ði; jÞ UTILITY½Q k ðTWðICR ÞÞ;Q k ðTWðICR ÞÞ next prev ious Atði; jÞ Atði; jÞ þ nbits end if C. end for ii. end for (b) (c) (d)

end for Rnext Rprev ious Rnext Rnext  1

6. end while 7. mean 0; meanroi 8. For each ði; jÞ

0; npointsroi

0

618

J.A. García et al. / Pattern Recognition Letters 31 (2010) 609–618

(a) if ði; jÞ 2 ROI meanroi þ Atði; jÞ i. meanroi npointsroi þ 1 ii. npointsroi (b) (c)

end if mean

mean þ Atði; jÞ

9. end for   meanroi ; mean 10. meanroi npointsroi mean  roi 11. return mean

mean NM

END References Garcia, J.A., Fdez-Valdivia, J., Fdez-Vidal, X.R., Rodriguez-Sanchez, R., Computing Models for Predicting Visual Target Distinctness. SPIE Bellingham, Washington, USA. PM-95. Garcia, J.A., Fdez-Valdivia, J., Fdez-Vidal, X.R., Rodriguez-Sanchez, R., Information theoretic measure for visual target distinctness. IEEE Pattern Anal. Machine Intell. 23 (4), 362–383.

2001. Press, 2001. Trans.

Garcia, J.A., Rodriguez-Sanchez, R., Fdez-Valdivia, J., 2004. Progressive Image Transmission: The Role of Rationality, Cooperation and Justice. SPIE Press, Bellingham, Washington USA. PM-140. Garcia, J.A., Rodriguez-Sánchez, R., Fdez-Valdivia, J., 2009. Axiomatic approach to computational attention. Pattern Recognition. doi:10.1016/j.patcog.2009.09. 027. Itti, L., Koch, C., Niebur, E., 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal Machine Intell. 20, 1254– 1259. Itti, L., Koch, C., 1999. Target detection using saliency-based attention. In: NATO SCI12 Workshop on Search and Target Acquisition, The Netherlands, vol. 3 (1), pp. 3–10. Krendel, E.S., Wodinsky, J., 1960. Visual search in an unstructured visual field. J. Opt. Soc. Amer. 50, 562–568. Mancas, M., Computational attention: Towards attentive computers, Presses universitaires de Louvain, Belgium, ISBN: 978-2-87463-099-6. Rotman, S.R., Gordon, E.S., Kowalczyk, M.L., 1989. Modeling human search and target acquisition performance: I. First detection probability in a realistic multitarget scenario. Opt. Eng. 28, 1216–1222. Toet, A., Kooi, F.L., Bijl, P., Valeton, J.M., 1998. Visual conspicuity determines human target acquisition performance. Opt. Eng. 37 (7), 1969–1975. Toet, A., Bijl, P., Valeton, J.M., 2001. Image dataset for testing search and detection models. Opt. Eng. 40 (9), 1756–1759. Waldman, G., Wootton, J., Hobson, G., 1991. Visual detection with search: An empirical model. IEEE Trans. Systems Man Cybernet. 21, 596–606.