On semantic-instructed attention: From video eye-tracking dataset to memory-guided probabilistic saliency model

On semantic-instructed attention: From video eye-tracking dataset to memory-guided probabilistic saliency model

Neurocomputing 168 (2015) 917–929 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom On sema...

3MB Sizes 7 Downloads 62 Views

Neurocomputing 168 (2015) 917–929

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

On semantic-instructed attention: From video eye-tracking dataset to memory-guided probabilistic saliency model Yan Hua a, Meng Yang b, Zhicheng Zhao a,n, Renlai Zhou b, Anni Cai a a b

Beijing University of Posts and Telecommunications, Beijing, China Beijing Normal University, Beijing, China

art ic l e i nf o

a b s t r a c t

Article history: Received 17 September 2014 Received in revised form 25 April 2015 Accepted 9 May 2015 Communicated by Yongdong Zhang Available online 19 May 2015

Visual attention influenced by example images and predefined targets are widely studied in both cognitive and computer vision fields. Nevertheless, semantics, known to be related to high-level human perception, have a great influence on top-down attention process. Understanding the impact of semantics on visual attention is beneficial for providing psychological and computational guidance on many real-world applications, e.g., semantic video retrieval. In this paper, we intend to study the mechanisms of attention control and computational modeling of saliency detection for dynamic scenes under semantic-instructed viewing conditions. We start our study by establishing a dataset REMoT, the first video eye-tracking dataset with semantic instructions to our best knowledge. We collect the fixation locations of subjects when they are given four kinds of instructions with different levels of noise. The fixation behavior analysis on REMoT shows that the process of semantic-instructed attention can be explained with long-term memory and short-term memory of human visual system. Inspired by this finding, we propose a memory-guided probabilistic model to exploit the semantic-instructed top-down attention. The experience of attention distribution to similar scenes in long-term memory is simulated by linear mapping of global scene features. An HMM-like conditional probabilistic chain is constructed to model the dynamic fixation patterns among neighboring frames in short-term memory. Then, a generative saliency model is constructed which probabilistically combines the top-down and a bottom-up modules for semantic-instructed saliency detection. We compare our model to state-ofthe-art models on REMoT and a widely used dataset RSD. Experimental results show that our model achieves significant improvements not only in predicting visual attention under correct instructions, but also in detecting saliency for free viewing. & 2015 Elsevier B.V. All rights reserved.

Keywords: Semantic-instructed viewing Eye-tracking Top-down attention Memory Saliency model

1. Introduction Although receiving a large amount of 108 –109 bits every second from eyes [1], human visual system (HVS) is capable of rapidly catching the prime information by selectively paying attention to regions with salient patterns. This visual attention mechanism plays a key role in visual processing tasks such as object recognition, semantic extraction and scene analysis. Nevertheless, the attention mechanism has not been fully understood due to the agnosticism on the collaboration of different parts in human brain. How to simulate such a biological cognitive process with computational models is even more challenging for practitioners in computer vision. Generally, visual attention is induced by two interrelated and indivisible information processing stages, bottom-up and top-down. n

Corresponding author. E-mail addresses: [email protected] (Y. Hua), [email protected] (M. Yang), [email protected] (Z. Zhao), [email protected] (R. Zhou), [email protected] (A. Cai). http://dx.doi.org/10.1016/j.neucom.2015.05.033 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Bottom-up is an involuntary attention that is only attracted by physical features of stimulus. Most studies on bottom-up attention are focused on identifying the physical features that may attract attention [2–6]. The achievements in bottom-up studies have been widely utilized to extract saliency visual regions for computer vision tasks, such as image segmentation [7], object recognition [8] and scene classification [9]. Top-down is a voluntary attention processing, which requires active awareness and directs intentional searching in our daily life [10]. There are many cognitive functions driving top-down attention, such as knowledge, memory, expectation, reward and current goal [11]. The mechanisms of top-down attention still need to be further investigated. Existing studies on top-down control of visual attention mainly focus on guiding with visual aspect of objects and scenes [10,12– 15]. For example, previous inspected visual item shortens the searching time [10]. How one's attention affected by simple semantic instructions, e.g., a target object name or its simple description, has also been investigated [16–18]. Cognitive experiments show that targets and their semantically similar regions are

918

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

recalled more often, thus they can be recognized more accurately than unrelated distractors. However, in practical applications users would prefer to provide complicated textual query inputs, e.g., a sentence or multiple keywords, for a visual search task. Unfortunately, previous work on visual examples and simple semantic instructions cannot be directly transferred to explain visual attention mechanisms under such situations. To facilitate the needs of practical applications such as semantic video retrieval, it is necessary to study the deployment as well as computational modeling of visual selective attention given complex instructions. In this paper, we make the first attempt to study mechanisms of attention control and modeling of saliency detection for dynamic scenes under complex semantic-instructed viewing conditions. The semantic-instructed viewing means that subjects are given complex semantic instructions before they watch a video clip. For example, users may be asked to judge whether a video is about “a boy wearing red coat playing basketball on the playground”. The complex instructions in our study are a set of sentences describing events that may be composed of subjects, actions, objects, scenes, attributes and other descriptions. Compared to simple semantic instructions, complex instructions deliver richer semantics which can more precisely describe the user's intention. To further investigate how attention mechanisms are affected by different levels of noise in complex semantic instructions, participants in our studies will be given either no instruction (free viewing), or an instruction that may be correct, partially correct or totally incorrect before viewing a video. We provide diversified types of semantic instructions that well simulate the conditions of real world semantic video search. In HVS, selective visual attention is exploited by actively controlling gaze to direct eye fixation towards informative regions in real time. Fixation maps of eye-movement detected and recorded by eye tracker in cognitive experiments are essential for analyzing visual attention patterns. Therefore, we start our study by establishing a new eye-tracking dataset REMoT, in which subjects' fixations are recorded under given complex semantic instructions. We then analyze differences of fixation positions between different kinds of semantic instructions and compare fixations with target locations based on this new eye-tracking dataset. By the experiments and analyses, we find the positive signals of visual attention influenced by complex semantic instructions. More specifically, the fixation patterns obtained in our experiments can be partially explained with declarative memory, procedural memory and short-term memory of HVS. The declarative memory is a type of long-term memory [19]. What stored in declarative memory is the knowledge and facts about concerned objects and their representations. It has been proved that declarative memory is involved in single target search task in cognitive studies [20]. Our experimental results also show that subjects identify the target regions more accurately and more quickly with correct instructions, indicating the guidance of declarative long-term memory on attention towards targets. The procedural memory is another kind of long-term memory for performing particular type of actions [19]. Behavioral analysis [21] indicates that the effects of memory-guided spatial orienting concur with the intuition that in everyday life our predictions about where relevant or interesting events will be unfolded are largely shaped by our memory experience. Our analyses show that the average fixation patterns under different instructed-viewing conditions trend to be similar, implying the impact of procedural long-term memory on attention in visual search actions. The short-term memory has been demonstrated [22] by the fact that subjects were able to report whether a change had occurred when they viewed two image patterns separated by a short temporal interval. Experiments also demonstrate that in the cognition process [10], visual search is more efficient when the memory

cue matches the visual item containing the target. Our analyses on REMoT show that fixations shift from target regions to other regions along with the time after target objects appearing, depicting the influence of short-term memory on attention dynamics. Inspired by the above findings on REMoT, we propose a memory-guided probabilistic saliency model in this paper. There has been a huge body of research work on computational modeling of visual attention from bottom-up [2,23–28] and top-down [29–34] perspectives. Bottom-up models define saliency mainly based on bio-inspired measurements on low-level image features. Top-down models generally detect salient regions with different learning methods. There have been some top-down methods which aim to saliency detection, but specifically for particular tasks such as video game playing [30,34]. In addition, there have been a number of dynamic models [31,32,23] which integrate the motion feature along with other static low-level features to estimate the saliency on a video frame. Unfortunately, they all lack the mechanisms for memory-guided saliency which has been proved important by analysis on REMoT for semantic-instructed attention. Previous psychology study [35] also supports that a robust memory of visual content guides the spatial attention. The proposed model is a three-layer generative graphical model which combines a top-down module and a bottom-up module. The former module is procedural memory-driven, while the latter is visual feature-activated for static scenes (i.e., each video key-frame). We employ gist of a scene [36] to define the global scene context, and the relationship between gist and semantic-instructed saliency is learned to simulate the content stored in procedural long-term memory. To model the short-term memory of dynamic attention, we construct an HMM-like conditional probabilistic chain on the saliency maps of a video keyframe sequence. It learns the diversified saliency patterns from true fixation data collected from the semantic-instructed psychological experiments. By incorporating both top-down and bottomup cues, our model outperforms existing saliency models on both RSD [37], a widely used dataset with manually labeled salient objects, and our REMoT dataset with different types of instructions. The contributions of our work are summarized as follows:

 We undertake a significant effort on establishing a new eye-





tracking dataset REMoT and making it available to the research community. In REMoT, the fixations are recorded by eye tracker under free viewing and three kinds of semantic instructions (correct, incorrect and partially correct). To our best knowledge, this is the first human eye-tracking dataset with semantic instructions on real-world videos. Detailed analyses on fixation locations of the recorded data are performed under different kinds of viewing instructions. The analyses show the evidences of attention affected by long-term and short-term memories of HVS. Inspired by our analyses, we introduce long-term memory and short-term memory as the top-down cues into attention modeling for semantic-instructed viewing. A three-layer generative graphical model is proposed to combine bottom-up and top-down cues. Our proposed model achieves a significant improvement in predicting saliency compared to state-of-theart models on both REMoT and RSD.

Roadmap: Section 2 gives a brief review on related work. Section 3 describes the collection of eye-tracking dataset REMoT. Section 4 introduces the analyses of fixation patterns. Section 5 presents the memory-guided probabilistic saliency model. Section 6 demonstrates the experimental results of saliency models. Section 7 gives the conclusion and future work.

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

2. Related work Semantic influence on attention: Previous studies on top-down control of visual attention focused on visual aspect of objects and scenes, and less of work has taken into account the influence of semantics on visual selective attention. In [16], participants were instructed to detect the presence of a pre-specified target, such as “motorbike” or “table”, by a textual word. The authors empirically found that objects associated with the target, such as “helmet” or “chair” in the above example, were recalled more often, and they were recognized more accurately than unrelated distracters. Experiments in [17] also demonstrated that fixation was driven by partial semantic correlation between a word and a visual object. For example, upon hearing the word “piano”, more looks were directed towards the semantically related object “trumpet” than any of the distracters. In [18], the authors studied guidance of eye movements by semantic similarity among objects in realworld scene inspection and search. They found that a preference attention transited to objects that were semantically similar to the target. These studies illustrate that subjects' attentions are semantically biased by text or verbal instructions. However, the instructions given in these studies only consist of a single target object (an object name or its simple descriptions), and visual scenes are either simple displays containing several intuitively selected objects [16,17] or static real-world images [18]. They could not reflect the viewing patterns when semantic instruction is complex and visual scene is dynamic in daily life. In this paper, we try to explore the mechanisms of semantic-instructed visual attention under fairly natural conditions. Saliency dataset: Fixations of eye-movement recorded by eye tracker in cognitive experiments are essential for analyzing visual attention pattern. In recent years, a growing number of eyetracking datasets have become available online [48]. The majority of them used static images as the stimuli. Datasets of eye-tracking on videos of daily scenes are not as many as that on static images. We list all the existing video eye-tracking datasets, to our best knowledge, in Table 1. We note that the fixations of eyemovement recorded in almost all of them are based on bottomup attention. The two exceptional cases, [30] and [47], are taskoriented, which aim specifically at game-playing and action recognition. In our work, the fixations are recorded with eye tracker when subjects are instructed by sentences to watch dynamic video stimuli. State-of-the-art saliency models in computer vision: Saliency detection has been widely studied in computer vision over the past decade [11]. We categorize state-of-the-art models into the following categories.

919

Cognitive models: Itti's model [2] is a classical cognitive model inspired by the behavior and neuronal architecture in receptive fields of the early primate visual system. Center-surround differences are computed between two pyramided feature maps with channels of color, intensity and orientation, then all channels are normalized and aggregated to yield the saliency map. In Harel and Perona's model GBVS [23], a similarity graph between two arbitrary map nodes is constructed, and the equilibrium distribution over map locations is treated as saliency representation. The approach is “organic” because, biologically, individual “nodes” (neurons) exist in a connected and retinotopically organized network (the visual cortex), and communicate with each other (synaptic firing) in a way which gives rise to immediate behavior, including fast decisions on which areas of a scene require additional processing. Signal processing models: AIM [24] and SUN [49] are based on “self-information”, which implies that moving one's eyes to the most saliency points in an image can be regarded as information acquisition maximization when his attention is directed only by bottom-up saliency. That is, rare features are more likely to indicate targets. SR [25] is based on the assumption that the statistical singularities in the spectrum represent image saliency regions. Differential spectral components are employed for salient region extraction. Achanta [26] implemented a frequency-tuned (FT) approach for salient region detection following the principles of (1) emphasizing the largest objects and the whole salient objects, (2) establishing boundaries of saliency objects and (3) removing high frequencies arising from texture and noise. Contrast models: Cheng [27] proposed a contrast-based saliency (HC) and a regional contrast-based saliency (RC) extraction algorithm. In HC, the saliency of a pixel is determined by using its Labspace color contrast to all other pixels in the image. In RC, image is first segmented into regions and region-based color contrasts are computed to define the saliency. Goferman [28] developed a context-aware saliency (CA) model by computing the distances between pixel-centered patches. The distances and the pixelcentered patch were defined with basic principles such as local low-level considerations, global considerations, visual organization rules and high-level factors. Learning models: Judd [29] developed a top-down method in which saliency was equated to discrimination. A linear SVM was trained using a set of low-level, mid-level and high-level image features to discriminate salient and none-salient locations. Lowlevel features include sub-band feature, Itti saliency channels and color channels. Mid-level feature is the output of horizon line detectors. High-level features are the outputs of car, face and person detectors. Peters and Itti [30] employed a regression

Table 1 A list of current video eye-tracking datasets. Datasets

Instruction

The number of videos and descriptions

Subjects

CRCNS [38] LeMeur [39]

Free viewing Free viewing

Marat [40]

Free viewing

Shic [41] Peters [30] Pang [33] IRCCyN/IVC [42] USC iLab [43] INB [44] DIEM [45] SFU [46] Hollywood-2, UCF sports [47]

Free viewing Playing games Free viewing Free viewing Free viewing Free viewing Free viewing Free viewing Action recognition

50 (25 min in total, TV programs, outdoor videos, video games, commercials, sporting events, etc) 8 7 (4.5-33.8 s each, faces, sporting events, audiencesm, landscape, logos, incrustations, low and high 27 spatiotemporal) 53 (1-3 s each, TV shows, TV news, animated movies, commercials, sport and music, indoor, outdoor, day-time, 15 night-time) 2 (1 min long clips from black and white movie) 10 24 (4-5 min each, game-play sessions) 5 13 (30-90 s each, movies, natural scenes) 6 51 (around 10 s each, natural scenes) 38 50 (around 10 s each, simple scenes and objects moving) 14 18 (around 20 s each, outdoor scenes) 54 85 (0.5–3.0 min each, movies, TV shows, music videos, commercials) 250 12 (around 9 s each, standard videos for data compression) 15 219 (total 21 h of videos, 12 action classes from Hollywood movies and 9 sports action classes from broadcast TV 16 channels)

920

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

classifier to capture the task-dependent saliency by learning the relationship between saliency and scene gist in their top-down component for video games. Itti's bottom-up saliency [2] was then multiplied by top-down saliency to generate the final saliency map. Dynamic models: Borji [34] proposed a unified Bayesian approach for modeling task-driven visual attention. To predict the next attended location, global context of a scene, previous attended locations and previous motor actions were integrated over time. Navalpakkam and Itti [20] proposed a computational model for the task-specific guidance of visual attention in realworld scenes. The model emphasized four aspects that were important in biological vision: determining task-relevance of an entity, biasing attention for the low-level visual features of desired targets, recognizing these targets using the same low-level features, and incrementally building a visual map of task-relevance at every scene location. Liu et al. [31] formulated salient object sequence detection as an energy minimization problem in a Conditional Random Field (CRF) framework, in which salience, spatial and temporal coherence and global topic models were integrated to identify a salient object sequence. Rahtu et al. [32] proposed to measure saliency with CRF by integrating local feature contrasts in illumination, color and motion. Harel and Perona constructed a dynamic saliency model GBVSD [23] by adding motion features to nodes of the similarity graph in static model GBVS. Pang [33] proposed a dynamic Bayesian network of visual attention considering the eye movement pattern, which was regard as binary cognitive state of a person (passive and active). Binary eye movement pattern controlled eye position density map, which estimated the probable human-attended regions. Cognitive models, signal processing models and contrast models are based on bottom-up attention. Learning and dynamic models are top-down attention models. However, some topdown methods aim to detect saliency specifically for video game playing [30,34] or for task-relevant predefined entities [20], while our approach is designed for semantic-instructed attention on general dynamic scenes. In most of the existing dynamic models, motion features together with static low-level features are integrated with CRF [31,32] or other means such as similarity graph [23] for saliency detection, while our model learns the transfer law of short-term memory influenced attention from true fixation data with HMM to predict the saliency states. In addition, different from modeling binary cognitive state to control attention [33], we employ multi-states memory-guided saliency which better simulates the top-down control of HVS. There has been a research [50] using HMM to estimate human scanpaths, which are sequences of gaze shifts for visual attention over a static image. However, we use HMM to model the saliency between neighboring frames.

3. Dataset collection 3.1. Video clips We establish a video eye-tracking dataset which collects fixations of eye movement when performing semantic-instructed video viewing. The task of Known Item Search (KIS) in TRECVID [51] models a scenario of video retrieval by text query. Thus the video data and text queries of KIS could perfectly serve as our experimental materials. We choose two groups of videos from the ones specified by query text descriptions in TRECVID-KIS 2010 as viewing videos. The corresponding query text descriptions are treated as the correct semantic instructions in our experiments. Group 1 includes 30 videos in total, 10 from each of human, cartoon and landscape classes. Group 2 also includes 30 videos, whose descriptions have more semantic details of target objects. The total length of each video clip varies from 11 to 209 s. We name the dataset REMoT, an abbreviation for Recording Eye Movements on TRECVID videos. The videos for semantic-instructed viewing possess diverse characteristics. First, the appearing time of semantically related objects is uncertain, i.e., target objects may appear at the beginning of the video or appear at anytime during the video. Second, the semantically related objects are not necessarily with bottomup saliency. For example, the corresponding video frame of semantic viewing task “The video of a Sega video game advertisement that shows tanks and futuristic walking weapons called Hounds” is shown in Fig. 1(a). The bottom-up salient region on this frame is the rectangle region like a national flag while the semantically related target is the walking weapon. However in Fig. 1(b), the bottom-up salient region “wings of an angel” is just the semantically related object of task “The video of a cartoon showing all sports characters as cherubs”. 3.2. Instructions The instructions in our study are a set of descriptive sentences, which consist of subjects, actions, objects, scenes, attributes and so on. They deliver richer semantics and can more precisely describe users' intention than a simple object name. To explore visual attention mechanisms under semantic-instructed viewing conditions, we design four kinds of instructions: no instruction (free viewing), correct instruction, partially correct instruction and incorrect instruction. Correct instruction is the text-description corresponding to the video being watched, partially correct instruction is the corresponding text-description but with errors on some details, and incorrect instruction is the text-description of another video. Samples of instructions are shown in Table 2.

Fig. 1. (a) The bottom-up salient region is not consistent with the task-related region. (b) The bottom-up salient region contains the task-related object.

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

921

Table 2 Samples of different kinds of instructions. A baby girl wearing a yellow shirt reading book on wood floor A woman sitting in a black and orange living room chair on a sea shore A baby girl wearing a pink shirt reading book on wood floor

Correct instruction Incorrect instruction Partially correct instruction

Fig. 2. (a) Procedure of free viewing. (b) Procedure of instructed viewing.

In experiments of Group 1, free viewing, correct and incorrect instructions are employed. The instructions in Group 1 include both abstract semantic and object-oriented descriptions. To further investigate the tolerance of fixation patterns to a certain level of noise in complex semantic instructions, correct, partially correct and incorrect instructions are used in experiments of Group 2. The eye movements led by partially correct instruction might reflect the meticulousness of HSV when pursuing a specific goal. The instructions for videos in Group 2 are mostly objectoriented with detailed descriptions. 3.3. Eye tracker We use the highly accurate iViewXTM Hi-Speed eye tracker [52] to collect eye-movement data. It is head mounted and does not allow subject to move his/her head freely. The accepted accuracy of this eye tracker is less than 0.51 in our data collection and we choose the 9 points calibration mode. The sampling rate is 500 Hz (Binocular). Video clips are shown as full screen on a 17 in monitor with the resolution of 1024  768 (the original resolution of the videos is 320  240) and vertical frequency of 60 Hz. 3.4. Eye-tracking data collection For free viewing, as shown in Fig. 2(a), in each trial, following a 500 ms fixation mark (0.51 in diameter), the video clip appears for its duration. After the clip disappears, subjects have to make a brief description on the video content, and then proceed to next trial. For instructed viewing, as shown in Fig. 2(b), in each trial, following a 500 ms fixation mark (0.51 in diameter), a centerlocated instruction appears. Subjects terminate instruction after they finished reading it. Then another fixation mark appears for 500 ms, followed with a video clip. Subjects have to judge whether the semantic instruction is semantically consistent with the video clip by filling the questionnaire (Yes/No). Considering long time viewing would decrease the performance of subjects' eye-movement, we split Group 1 into two parts, each including 15 video clips, and the 15 video clips are showed to each subject randomly. “Correct instructions” in part 1 (2) are used as the “incorrect instructions” in part 2 (1). The correct instruction and incorrect instruction in one experiment are given with an uncertain order to ensure judgment unbiasedness. In this

way, six experiments for Group 1 are conducted. Similarly, we conduct six experiments for Group 2, consisting of 3 instructed viewing trials in each of two parts. Detailed arrangement is shown in Table 3, where F , C, I and P represent free viewing, viewing with correct, incorrect and partially correct instructions, respectively. 186 non-expert subjects (88 males and 98 females, aged between 18 and 25) take part in our eye-tracking data collection. All of them have normal or corrected normal vision. Data collection is performed in a quiet room with constant light. Each subject sits in front of the screen at a fixed distance (60 cm). Examples of heat maps of fixations are illustrated in Fig. 3.

3.5. Obtained dataset In the dataset,1 information stored in separate folders contains (1) 60 original video clips; (2) video clips with visualized heat map of four viewing conditions; (3) hand-labeled areas of interest (AOI, semantic target regions) of all video frames; (4) lists of instructions of the two group experiments; (5) x- and y- coordinates of each fixation arranged by time (raw fixations). The dataset could enrich existing data sources for studies on visual attention for both psychology and computer vision communities.

4. Analyses of fixation pattern Intuitively, the attention patterns would be influenced by different kinds of semantic instructions. We employe “fixation distance” and ROC curve of saliency map vs. target region to quantitatively analyze such a influence. The former measures the position difference between two sets of fixation points, while the latter reflects how well the saliency map of eye movement coincides with the target region assigned by the correct semantic instruction. The analyses will help us to gain illuminating insight into semantic-instructed visual attention, so as to provide useful hints at computational modeling of saliency for semanticinstructed viewing. 1 The dataset can be downloaded via: http://www.bupt-mcprl.net/apply4da tahtml.php.

922

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

Table 3 Arrangements of instructions and participants for each group. Part 1

Part 2

Group 1 (Videos) instruction Subject (total/male)

Exp. 1 15F 15/8

Exp. 2 7C 8I 15/7

Exp. 3 8C 7I 15/8

Exp. 4 15F 15/7

Exp. 5 7C 8I 15/8

Exp. 6 8C 7I 15/7

Group 2 (Videos) instruction Subject (total/male)

Exp. 1 5C 5P 5I 16/8

Exp. 2 5C 5P 5I 16/6

Exp. 3 5C 5P 5I 16/8

Exp. 4 5C 5P 5I 16/8

Exp. 5 5C 5P 5I 16/6

Exp. 6 5C 5P 5I 16/7

Fig. 3. Examples of heat maps of fixations. Top row from left to right: original frame, heat maps under free viewing, correct and incorrect instructions. The corresponding correct instruction is “the video of Young News program depicting a man wearing green sport jacket and a woman announcer in blue sweater and white shirt and features a Nancy Drew trailer”. Bottom row from left to right: original frame, heat maps of correct, incorrect and partially correct instructions. The correct instruction is “the video showing a baby eating a meal and a man with a baseball cap assisting him”. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

4.1. Fixation distance

as

Evaluation method: A measure is required to reflect the distance between two sets of fixation points recorded under different viewing conditions. Kullback–Leibler divergence and correlation coefficient are often used to evaluate the difference of two saliency distributions [11]. Normalized Scanpath Saliency can also be used to measure the mean value of the normalized saliency map at fixation locations [11]. However, these methods are not able to explicitly measure the location difference between two sets of points. EMD [53] measures the distance between two probability distributions by counting how much transformation one distribution would undergo to another. We use EMD to evaluate the distance of two fixation sets [54]. We extract one key frame every 2 s for each video, and fixations of a key frame are fixation points of all the subjects which are recorded by eye tracker on that frame. Fixation sets of frames p and q are denoted as P and Q respectively, i.e., P ¼    ðw1 ; p1 Þ; …; ðwi ; pi Þ; …; ðwn ; pn Þ and Q ¼ ðw1 ; q1 Þ; …; ðwj ; qj Þ; …; ðwm ; qm ÞÞ, where wi (wj) is the coordinates of the fixation point, and pi (qj) is the ratio of the number of fixations falling at wi (wj) to the total number of fixations of p (q) frame. We define the cost of moving one point to another as the Euclidean distance, and then the minimum cost of moving one point set to another point set is calculated by EMD. The EMD [53] between P and Q is formulated

EMDðP; Q Þ ¼ min

P f i;j

s:t:

X f i;j r pi ; j

i;j f i;j Dði; jÞ

P

i;j f i;j X f i;j rqj ; i

X f i;j ¼ 1

ð1Þ

i;j

where Dði; jÞ is the Euclidean distance of wi and wj. The fixation distance of a video clip is defined as the average distance of all the key frames. Distance matrix: We calculate the average fixation distance of all video clips under two different kinds of instructions. It is then regarded as the fixation difference of the two semantic-instructed fixations. The results of Group 1 and 2 for every two kinds of instructions are shown in Table 4 and 5 respectively, where F , C, I and P respectively represent free viewing, correct, incorrect and partially correct instructions. Observations: From Tables 4 and 5, we can find that (1) fixations of correct instruction are moderately different from that of incorrect instruction; (2) fixations of free viewing are closer to those of incorrect instruction than correct instruction in Group 1; (3) fixations of correct instruction are closer to those of partially correct instruction than incorrect instruction in Group 2. The experimental results verify that visual attention is influenced by the level of noise in complex semantic instructions. However, it

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

should be noted that the average distances of fixation sets under different kinds of viewing conditions do not deviate significantly ( o 5:5%) with the frame size of 320  240. 4.2. Saliency map vs. target region Target region: In this section, we explore how eye movements behave with respect to the semantically related objects under different viewing conditions. The semantically related objects are the objects assigned by the correct instructions such as bus, flag, human face, beard and colored dress. We refer to these object regions related to correct instruction as the target region (TR). To obtain manually labeled TR, we divide a key frame into 8  8 blocks and then label a block with 1 if the whole block is in TR (see Fig. 4), and otherwise with 0. We labeled all the videos consisting of total 2716 key frames sampled at every 2 s. On average, 12.46% of the image area is occupied by TR when only considering the key frames containing target objects, and 8.04% of the image area for all the key frames. Saliency map: Fixation saliency map (FSM), F i ðx; yÞ, of frame i is generated by using fixations recorded by eye tracker. First, an unsmoothed saliency map F 0i ðx; yÞ on frame i is obtained by X sgnðj x  xi;j j o r&&j y yi;j j o rÞ F 0i ðx; yÞ ¼ j

sgnðxÞ ¼

(

1;

xa0

0;

x¼0

ð2Þ

where ðxi;j ; yi;j Þ is the j-th fixation location in frame i, and r is the effective salient distance of a fixation. Then, F 0i ðx; yÞ is convolved Table 4 Fixation distance matrix for Group 1. EMD (in pixel)

F

C

I

F C I

0 20.9206 20.7325

20.9206 0 21.2352

20.7325 21.2352 0

Table 5 Fixation distance matrix for Group 2. EMD (in pixel)

C

I

P

C I P

0 24.8799 23.8127

24.8799 0 24.5284

23.8127 24.5284 0

923

with a 2D-Gaussian kernel N to produce FSM, i.e., F i ðx; yÞ ¼ F 0i ðx; yÞnN

ð3Þ

Evaluations: Using TR as the ground truth, ROC curves are plotted to evaluate how well the fixation saliency map (FSM) matches TR. We use different Thresholds to cut off the fixation Saliency Map to obtain different sizes of saliency regions (TSM). The true positive rate (TPR) is plotted in function of the false positive rate (FPR) for saliency regions at different thresholds to produce the ROC curve. The more the area under curve (AUC) approaches to 1, the better the saliency map matches TR. TR in the following equation represents the non-target region: TPR ¼ FPR ¼

TSM \ TR TR TSM \ TR TR

ð4Þ ð5Þ

In order to inspect how eye movements change with respect to target regions along with the time, we plot ROC curves for fixation saliency maps (FSM) of the following four cases: (1) FSM of the first key frame that target objects just appear; (2) FSM of the second key frame that target objects appear; (3) FSMs of these two key frames and (4) FSMs of all key frames which contain the target objects. We call the duration of case (3) a quick glance, which denotes a time period of about 4 s since the sampling interval of key frames is 2 s in our experiments. The ROC curves of FSM under three different viewing conditions of Groups 1 and 2 for the four cases are respectively plotted in Fig. 5(a)–(h). The AUC within a quick glance is shown in Fig. 6. Observations: From these figures we can see that (a) During the quick glance at target objects, AUC of correctly instructed viewing is notably higher than those of free viewing (4% in Fig. 5(c)) and incorrectly instructed viewing (3.3% in Fig. 5(c) and (g)). In Fig. 5(g), AUC of correctly instructed viewing is also slightly higher than that of partially correctly instructed viewing. This observation manifests that semantic instruction indeed influences HSV attention distribution. Subjects are informed which objects they need to search by the instruction before viewing. The knowledge about target objects and their representations stored in their declarative memory is then activated, so that the subjects’ attentions are directed to the target region under correct instruction. In other words, guiding by the top-down control of declarative memory, subjects can catch larger portions of target region with correct instruction than other viewing conditions at the first glance, even when the target region is not a bottom-up salient region, as shown in Fig. 1(a). (b) In experiments of Group 1, the ROC difference between correctly and incorrectly instructed viewings of the first key frame

Fig. 4. Examples of the original video frame (left) and its target regions (right). Correct instruction of the video is “a woman wearing a white fur coat and orange pants sitting in a black and orange living room chair on a sea shore showing the woman digging in sand then walking from scene”. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

924

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

First group: (1)

First group: (2)

1 0.8

0.8

0.8

0.6 0.5 0.4 0.3

0.7 0.6 0.5 0.4 0.3

true positive rate

0.8

true positive rate

0.9

0.7

0.7 0.6 0.5 0.4 0.3

0.7 0.6 0.5 0.4 0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

false postive rate

0 0

false postive rate

Second group: (1)

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

false postive rate

Second group: (3)

1

0.8

0.8

0.8

0.8

0.7

0.7

0.5 0.4 0.3

0.6 0.5 0.4 0.3

0.6 0.5 0.4 0.3

0.2

0.2

0.2

0.1

0.1

0.1

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false postive rate

0

true positive rate

0.9

true positive rate

0.9

true positive rate

0.9

0.7

0.6 0.5 0.4 0.3 0.1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.7

0.2

0 0

Second group: (4)

1

0.9

0.6

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

false postive rate

Second group: (2)

1

First group: (4)

1

0.9

0

true positive rate

First group: (3)

1

0.9 true positive rate

true positive rate

1 0.9

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

false postive rate

false postive rate

false postive rate

Fig. 5. (a), (b), (c) and (d) are the ROC curves of cases (1), (2), (3) and (4) of Group 1, and (e), (f), (g) and (h) are the ROC curves of cases (1), (2), (3) and (4) of Group 2, respectively. The four cases are: (1) FSM of the first key frame that target objects just appear; (2) FSM of the second key frame that target objects appear; (3) FSMs of these two key frames and (4) FSMs of all key frames which contain the target objects. AUC

0.8

First group: correctly instructed viewing Secord group: correctly instructed viewing First group: free viewing

0.76

0.72

0.68

0.64

0.6 (1)

(2)

Fig. 6. Illustrations of AUC of cases (1) and (2) under free viewing and correctly instructed viewing.

(Fig. 5(a)) is larger than that of the second key frame (Fig. 5(b)). Nevertheless, the ROC difference between correctly and incorrectly instructed viewing of the first key frame (Fig. 5(e)) is smaller than that of the second key frame (Fig. 5(f)) in Group 2. This observation shows that viewing environment (such as interference of semantic noise in partially correct instruction) affects the viewing pattern. (c) The ROC curves of all the key frames of Group 1 and Group 2, respectively shown in Fig. 5(d) and (h), are quite close under three kinds of viewing conditions. This is consistent with the experimental results in Section 4.1. It implies that the portions of target region caught by subjects under different kinds of instructions are about the same when all key frames are considered. A similar fact was also found elsewhere for eye-movement measurements between free viewing and task-demanded viewing for action recognition [47]. This phenomenon may be explained by the following reasons. Subjects are able to judge whether the video content matches the instruction within a short period of time (quick glance) under correctly instructed viewing. Thereafter, they continue to watch the video to gather its semantic information, as like in free viewing and incorrectly instructed viewing. During the latter period of time (after quick glance), the predictions about where relevant or interesting events will be unfolded are largely shaped by procedural memory

experience. This fact is similar to that in language grammar learning, which was shown to depend on procedural memory [19]. Since a quick glance is much shorter than the whole period of a video, the average fixation patterns over the entire video under different viewing conditions are not notably different. This observation demonstrates the significant impact of action-oriented procedural memory on semantic-instructed attention, since gathering the semantic meaning of a video is subject's main action in both instructed viewing and free viewing (which requires subjects to describe video content after viewing in our experiments). (d) From Fig. 6, we can see that the values of AUC of the first key frame are larger than those of the second key frame under correctly instructed viewing and free viewing. This observation indicates that the gaze has been shifted from target region to other regions in the second frame. Cognitive experiments on static images have showed that the content in short-term memory provides straightforward evidence on attention distribution [10]. In the case of semantic-instructed viewing on videos, the information acquired from the previous frames is stored in short-term memory, which will influence attention distributions on the current frame. Attention transfer that happens between the targetappearing frame and the next frame shows that short-term memory guides attention under semantic-instructed viewing.

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

5. Proposed model The observations in Section 4.2 have shown that memory guides visual attention under semantic-instructed viewing conditions. Inspired by that, we propose to introduce memory-guided top-down cues into video saliency detection model. As mentioned in Section 2, memory cues have been rarely studied in the literature. In this paper we consider the procedural long-term memory that plays a significant role in semantic-instructed viewing, as mentioned in observation (c) of Section 4.2. We do not consider another type of long-term memory, i.e., declarative memory, due to the following reason. Declarative memory is related to the knowledge and facts about concerned objects and their representations. For general natural scenes, diversified types of objects and scenes would be involved in both semantic instructions and viewed videos. It is difficult and time-consuming to model the declarative memory because it requires near-perfect detection and recognition processes for a wide variety of objects and scenes. Besides the memory-guided top-down cue, we also incorporate the stimuli of bottom-up attention, i.e., salient low-level features, into our model. Accordingly, a three-layer generative graphical model is constructed to probabilistically combine a bottom-up module with the procedural memory-guided top-down module. In the top-down module, the experience of attention distribution on similar scenes in procedural long-term memory is simulated by a linear mapping from a global visual feature of the scene to the top-down saliency for static scenes (see Fig. 7(a)). In this paper, we use scene gist as the global feature and GBVS [23] as our bottom-up module. Furthermore, considering the significance of short-term memory in fixation evolution process, we incorporate the influence of shortterm memory into the model for dynamic saliency. It has been shown by observation (d) in Section 4.2 that the information acquired from previous frames stored in short-term memory affects the attention distribution on the current frame. However, the mechanism how short-term memory guides attention has not been completely understood yet. One experiment proved that the spatial short-term memory was a key to the inhibition of return [55], while on the contrary, another experiment indicated that increase of spatial short-term memory caused a sharp decrease of inhibition [56]. To appropriately model the visual dynamics, we propose to model spatial short-term memory-guided visual attention by learning the transfer of true saliency states between neighboring frames with an HMM-like conditional probabilistic chain (see Fig. 7(b)). Condition Random Field (CRF) has been claimed to be superior to generative models such as HMM because of (1) non-independent features of the observation sequence, which means that the

925

observations may represent attributes at different levels of granularity or aggregate properties of the sequence; (2) its allowance of longdistance dependency, which means that features from past and future observations can be used to explicitly represent longdistance interactions and dependency. However, in our case, the observations of bottom-up saliency and global gist feature giving rise to procedural memory saliency are independent. In addition, the effect of short-term memory on attention is of backward and timelimited. The attention distribution at the present moment is only influenced by that of a short period of previous time. Thus we construct the dynamic model with HMM, whose training is computationally more efficient and converges faster than CRF. The primary structure of this model is shown in Fig. 7(b) and [57]. 5.1. Long-term top-down saliency The procedural long-term memory is simulated by learning the relationship between scene gist and the top-down saliency. Assume that the relationship between the two is linear [30]: Lm ðpi Þ ¼ f i ðGÞ ¼ Wi G

ð6Þ

m

where L ðpi Þ is the procedural memory-guided saliency value of m pixel ðpi ÞN is an N i ¼ 1 , N is the number of pixels in the scene. L dimensional vector for procedural memory-guided saliency at all pixels. G A RM is the global gist feature describing contextual scene information. fi is a linear function of G for pixel pi. Wi is parameters for linear function fi. The parameters for all pixels form a matrix W A RNM , which simulates the knowledge stored in procedural memory for guiding visual attention. The formulation of conditional probability of G on Lm is thus defined as PðGjLm Þ ¼ δðWG  Lm Þ

ð7Þ

where δ denotes the element-wise Dirac function. In fact, the δ function can also be replaced by other functions such as the Hamming function, Gaussian function or other functions in exponential family. The selection of different functions reflects different prior assumptions on the residual errors in function fitting. We assume that the learned mapping should perfectly match the procedural memory-guided saliency. 5.2. Static saliency We propose a generative model to combine procedural memory-guided top-down saliency and bottom-up saliency to obtain the saliency for a static scene. The saliency map “S”, which is a latent variable, is dependent on the two observed variables,

Fig. 7. (a) Static saliency model. (b) Dynamic saliency model. B represents the bottom-up saliency map. G is the global scene gist feature. Lm represents the procedural memory-guided top-down saliency map and Sm represents the effect of short-term memory on saliency distribution. S is the final static saliency map. St and St þ 1 are the saliency maps respectively for frames t and t þ 1 when considering the dynamic attention distribution pattern.

926

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

bottom-up saliency “B” and scene gist “G” (see Fig. 7(a)). We omit the subscript t for video frames since there is no confusion for static scenes: PðSjB; GÞ p PðSÞPðB; GjSÞ X PðSÞPðB; Lm jSÞPðGjLm Þ p L

ð8Þ

m

m

where L , a inter-layer latent variable in Fig. 7(a), is obtained by procedural memory processing the scene context. Maximizing the posterior probability PðSjB; GÞ obtains the static saliency. We make the following reasonable assumptions. First, PðSÞ follows uniform distribution. Second, the saliency values of pixel ðpi ÞN i ¼ 1 in bottomup and procedural memory-guided saliency maps are generated from that of the co-located pixel in S. Last, the saliency values of pixels are independent to each other. Therefore, PðB; Lm jSÞ ¼ PðBjSÞPðLm jSÞ N

¼ ∏ P Bi ðBpi jSpi ÞP Lmi ðLm pi jSpi Þ i¼1

ð9Þ

If probability distributions P Bi and P Lmi are assumed to be i.i.d on locations, Eq. (9) can be simplified as PðB; Lm jSÞ

N

¼ ∏ P B ðBpi jSpi ÞP Lm ðLm pi jSpi Þ i¼1

ð10Þ

P B and P Lm are Gaussian mixtures in our model. 5.3. Dynamic saliency

T

t¼2

X

N

m ∏ P B ðBt;pi jSt;pi ÞP Lm ðLm t;pi jSt;pi ÞPðGt jLt Þ

t ¼ 1 Lm i ¼ 1

ð12eÞ where Gt represents the gist feature of frame t, and St represents the final saliency map of frame t. Eq. (12b) can be expressed by Eq. (12c) since Bt and Gt are only based on St (see Fig. 7(b)). Substituting Eq. (10) into Eq. (12d), we obtain Eq. (12e) where P B and P Lm are assumed independent of time. Maximizing the above joint posterior probability, the dynamic saliency maps are obtained. We use the Viterbi algorithm with a minor modification to perform the MAP process since our dynamic model is HMM-like (see Fig. 7(b)). 5.4. Model training In the proposed generative model, the variable Lm is latent, which can be learned by using EM algorithm. However, the high dimensions of linear weight matrix W and the Gaussian mixture models P B and P m L make EM learning very difficult to converge. Therefore, we use an approximate algorithm instead. In this algorithm, procedural memory-guided top-down saliency is first replaced by FSM. (FSM is introduced in Section 4.2). Then W is learned through Eq. (6) by least-square. After W is trained, Lm is recalculated from G, and the trained W is used to estimate the parameters of the Gaussian mixtures. The complete training procedure is shown in Algorithm 1. Algorithm 1. Training steps for our three-layer generative model.

5.3.1. Short-term top-down saliency As a key factor that influences the saliency pattern on visual dynamics, inter-frame information is introduced into our model. From observation (d) in Section 4.2, we know that the shift of fixations in consecutive frames is influenced by previous saliency stored in short-term memory. Since the mechanism of short-term memory is not comprehensively understood, the behavior of maintaining and inhibition of spatial short-term memory is learned in our model by estimating the transfer probability. If the state values of saliency maps between two consecutive frames mostly maintain or increase, the corresponding transfer probabilities are large and the behavior of maintaining dominates; otherwise, inhibition works. We assume the transfer probability following Markov property and independent of time. Then the joint probability of saliencies for all frames in a dynamic scene is T

PðS1 ; S2 …St Þ ¼ PðS1 Þ ∏ P Sm ðSt jSt  1 Þ t¼2

ð11Þ

where T is the time duration of the dynamic scene, expressed simply by the number of key frames. Sm simulates the effect of short-term memory on attention distribution. The transfer probability P Sm is learned from the true saliency maps recorded by the eye tracker. 5.3.2. Dynamic saliency model Incorporating saliencies in a sequence of bottom-up Bt ; t ¼ 1; …; T, procedural memory-guided Lm t ; t ¼ 1; …; T and the effect of shortterm memory Sm, we formulate the dynamic saliency as PðS1 ; …; ST jB1 ; …; BT ; G1 ; …; GT Þ p PðS1 ; …; ST ÞPðB1 ; …; BT ; G1 ; …; GT jS1 ; …; ST Þ T

T

t¼2

t¼1

T

T

p PðS1 Þ ∏ P Sm ðSt jSt  1 Þ ∏ PðBt ; Gt jSt Þ

p PðS1 Þ ∏ P Sm ðSt jSt  1 Þ ∏ t¼2

T

p PðS1 Þ ∏ P Sm ðSt jSt  1 Þ ∏

X m PðBt ; Lm t jSt ÞPðGt jLt Þ

t ¼ 1 Lm

ð12aÞ ð12bÞ ð12cÞ

ð12dÞ

1: Compute B and G for all training examples; 2: Replace all procedural memory-guided saliency maps with FSMs, Lm ¼ F; 3: Get W from G and Lm of training examples with Eq. (6) by least-squares; 4: Re-obtain Lm for all training examples with Lm ¼ W  G; 5: Obtain parameters of Gaussian mixtures P B and P Lm using all maps of procedural memory-guided saliency, bottom-up saliency and FSM with Eq. (10) by EM algorithm; 6: If the training is for dynamic scene saliency model, compute the transfer probability P m S with Eq. (11) using all FSMs. 6. Performance comparisons In this section, we adopt ROC to evaluate the proposed memory-guided probabilistic generative model. We compare our model to state-of-the-art saliency models described in Section 2, i.e., AIM [24], CA [28], FT [26], GBVS [23], GBVSD [23], HC [27], RC [27], Itti [2], Judd [29], SR [25], SUN [49] and Peters [30] on our eye-tracking dataset REMoT. Among the compared methods, GBVSD [23] computes the saliency for a video with motion information. Furthermore, we conduct the experiments on a widely used public video dataset RSD [37] for a fair comparison. RSD dataset mainly covers six kinds of contents: documentary, advertisement, cartoon, news, movies and surveillance. 23 subjects are asked to manually label the salient regions with at least one rectangle per frame. At least 10 subjects are asked to label a frame. All rectangles labeled on a frame are added together to generate a binary map, which is then convolved with a 2DGaussian kernel just like in [25]. In this way an FSM with edge smoothed is obtained. In the experiments on REMoT, we randomly choose 15 videos for training and 5 videos for validation under free viewing and correctly instructed viewing respectively. The remaining 10 videos under free viewing and 40 videos under correctly instructed viewing serve as the test data. On RSD dataset, we randomly choose 15 videos for training, 30 videos for testing and 5 videos for

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

ROC based on free viewing: REMoT

ROC based on correctly instructed viewing: REMoT

1

1

0.9

0.9

0.8

0.8 0.7

AIM CA FT GBVS GBVSD HC Itti Judd RC SR SUN Peters Static Dynamic

0.6 0.5 0.4 0.3 0.2 0.1

true positive rate

true positive rate

0.7

0

927

AIM CA FT GBVS GBVSD HC Itti Judd RC SR SUN Peters Static Dynamic

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

false postive rate

0.4

0.5

0.6

0.7

0.8

0.9

1

false postive rate

Fig. 8. (a) ROC curves based on free viewing on REMoT. (b) ROC curves based on correctly instructed viewing on REMoT.

ROC: RSD 1 0.9 0.8 0.7 true positive rate

validation. The optimal parameters in each method are respectively selected by a validation process with the 5 videos in the two datasets. In our memory-guided probabilistic model, the saliency states (possible values of saliency) and the number of pixels of a key frame are manually chosen as 64 and 4800 respectively, because our model is insensitive to the two parameters. Fig. 8(a) and (b) respectively presents the ROC curves of various saliency models by using FSMs of free viewing and correctly instructed viewing on REMoT as the ground truth. From Fig. 8(a) and (b), we can see that our two memory-guided models (both Static and Dynamic) achieve the best performances, followed by Peters and GBVS, and then Itti and Judd, for both free viewing and correctly instructed viewing. Fig. 9 presents the ROC curves of various saliency models on RSD dataset. From Fig. 9, we can see that our dynamic model achieves the best performance, followed by Peters, Judd and GBVS. The performance of our dynamic saliency model is better than our static model, indicating the effectiveness of modeling short-term memory with Markov process in HMM-like style. In addition, the performances of most models on RSD are consistent with those on our dataset REMoT. The models with good performances are all cognitive-based, either based on cognitive principles or learned from actual eye-movement data. The models GBVS, Itti and Judd have been shown to have good performances on several other datasets for free viewing [58]. By adding memory-guided top-down cues to the bottom-up model GBVS, our model demonstrates significant improvements in performance over GBVS both for free viewing and correctly instructed viewing. It should be noted that any good bottom-up model can be integrated into our model to substitute GBVS. As discussed in Section 4, the fixation patterns of free viewing and semantic-instructed viewing have no significant difference when all key frames in the duration of a video are considered. Therefore, there are two issues worth noting. First, if fixations of semantic-instructed viewing are difficult to get, by removing the fixations of the first few frames of target appearing, we could instead use the fixations of free viewing for model training. Second, although the majority of the compared models are bottom-up for free viewing, their performances only decrease slightly at correctly instructed viewing (see Fig. 8).

AIM CA FT GBVS GBVSD HC Itti Judd RC SR SUN Peters Static Dynamic

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false postive rate

Fig. 9. ROC curves on RSD dataset.

new eye-tracking dataset, REMoT, under free viewing and three kinds of semantic instructions to provide essential data for our study. By analyzing fixation locations recorded in this dataset, we find that fixation patterns of semantic-instructed attention can be explained by long-term memory and short-term memory of HVS. Inspired by this finding, we then propose a probabilistic generative saliency model which combines bottom-up and memory-guided top-down saliency for static and dynamic scenes. Our model achieves good performances on REMoT for free viewing and correctly instructed viewing. Our model also outperforms stateof-the-art models on a widely used dataset RSD. Experimental results demonstrate the effectiveness of memory as a top-down cue in attention modeling. In the future, we will conduct further studies on saccade dynamics and sequencing of eye-tracking data on REMoT, for a deeper understanding of visual attention evolution under semantic-instructed viewing. Acknowledgements

7. Conclusions In this paper, we investigate the influence and computational modeling of complex semantics on HVS attention under fairly natural conditions. We first make a significant effort to establish a

This work is supported by Chinese National Natural Science Foundation (61471049, 61101212 and 61372169), National High Technology R&D Program of China (863 Program) (No. 2012AA012505), and the Fundamental Research Funds for the

928

Y. Hua et al. / Neurocomputing 168 (2015) 917–929

Central Universities. The authors thank Yajuan Bai and Wanling Liu from Beijing Normal University for advices on establishing the dataset, and Yilun Li from Beijing University of Posts and Telecommunications for building the webpage for the dataset. The authors also thank Shuhui Wang from the Institute of Computing Technology, Chinese Academy of Sciences for advices on modifying the paper.

[33]

References

[34]

[1] K. Koch, J. McLean, R. Segev, M.A. Freed, M.J. Berry II, V. Balasubramanian, P. Sterling, How much the eye tells the brain, Curr. Biol. 16 (14) (2006) 1428–1434. [2] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20 (11) (1998) 1254–1259. [3] C. Zetzsche, K. Schill, H. Deubel, G. Krieger, E. Umkehrer, S. Beinlich, Investigation of a sensorimotor system for saccadic scene analysis: an integrated approach, in: Proceedings of the 5th International Conference on Simulation Adaptive Behavior, vol. 5, 1998, pp. 120–126. [4] P. Reinagel, A.M. Zador, Natural scene statistics at the centre of gaze, Netw.: Comput. Neural Syst. 10 (4) (1999) 341–350. [5] R.J. Peters, A. Iyer, L. Itti, C. Koch, Components of bottom-up gaze allocation in natural images, Vis. Res. 45 (18) (2005) 2397–2416. [6] D.J. Parkhurst, E. Niebur, Texture contrast attracts overt visual attention in natural scenes, Eur. J. Neurosci. 19 (3) (2004) 783–789. [7] R. Achanta, F. Estrada, P. Wils, S. Süsstrunk, Salient region detection and segmentation, in: Computer Vision Systems, Springer, Santorini, Greece, 2008, pp. 66–75. [8] D. Walther, U. Rutishauser, C. Koch, P. Perona, On the usefulness of attention for object recognition, in: Workshop on Attention and Performance in Computational Vision at ECCV, 2004, pp. 96–103. [9] M. Xu, J. Wang, M.A. Hasan, X. He, C. Xu, H. Lu, J.S. Jin, Using context saliency for movie shot classification, in: 2011 IEEE International Conference on Image Processing (ICIP), IEEE, Brussels, Belgium, 2011, pp. 3653–3656. [10] D. Soto, D. Heinke, G.W. Humphreys, M.J. Blanco, Early, involuntary top-down guidance of attention from working memory, J. Exp. Psychol.: Hum. Percept. Perform. 31 (2) (2005) 248. [11] A. Borji, L. Itti, State-of-the-art in visual attention modeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 185–207. [12] G.F. Woodman, E.K. Vogel, S.J. Luck, Visual search remains efficient when visual working memory is full, Psychol. Sci. 12 (3) (2001) 219–224. [13] P.E. Downing, Interactions between visual working memory and selective attention, Psychol. Sci. 11 (6) (2000) 467–473. [14] D. Soto, G.W. Humphreys, D. Heinke, Working memory can guide pop-out search, Vis. Res. 46 (6) (2006) 1010–1018. [15] S.W. Han, M.S. Kim, Do the contents of working memory capture attention? Yes, but cognitive control matters, J. Exp. Psychol.: Hum. Percept. Perform. 35 (5) (2009) 1292. [16] E. Moores, L. Laiti, L. Chelazzi, Associative knowledge controls deployment of visual selective attention, Nat. Neurosci. 6 (2) (2003) 182–189. [17] F. Huettig, G. Altmann, Word meaning and the control of eye fixation: semantic competitor effects and the visual world paradigm, Cognition 96 (1) (2005) B23–B32. [18] A.D. Hwang, H.C. Wang, M. Pomplun, Semantic guidance of eye movements in real-world scenes, Vis. Res. 51 (10) (2011) 1192–1205. [19] M.T. Ullman, Contributions of memory circuits to language: the declarative/ procedural model, Cognition 92 (1) (2004) 231–270. [20] V. Navalpakkam, L. Itti, Modeling the influence of task on attention, Vis. Res. 45 (2) (2005) 205–231. [21] J.J. Summerfield, J. Lepsien, D.R. Gitelman, M. Mesulam, A.C. Nobre, Orienting attention based on long-term memory experience, Neuron 49 (6) (2006) 905–916. [22] W. Phillips, On the distinction between sensory storage and short-term visual memory, Percept. Psychophys. 16 (2) (1974) 283–290. [23] J. Harel, C. Koch, P. Perona, Graph-based visual saliency, in: Advances in Neural Information Processing Systems, 2006, pp. 545–552. [24] N. Bruce, J. Tsotsos, Saliency based on information maximization, in: Advances in Neural Information Processing Systems, 2005, pp. 155–162. [25] X. Hou, L. Zhang, Saliency detection: a spectral residual approach, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Minneapolis, Minnesota, USA, 2007, pp. 1–8. [26] R. Achanta, S. Hemami, F. Estrada, S. Susstrunk, Frequency-tuned salient region detection, in: 2009 IEEE, Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Miami, Florida, USA, 2009, pp. 1597–1604. [27] M.-M. Cheng, G.-X. Zhang, N.J. Mitra, X. Huang, S.-M. Hu, Global contrast based salient region detection, in: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Colorada Springs, USA, 2011, pp. 409–416. [28] S. Goferman, L. Zelnik-Manor, A. Tal, Context-aware saliency detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (10) (2012) 1915–1926. [29] T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, in: 2009 IEEE International Conference on Computer Vision (ICCV), IEEE, Kyoto, Japan, 2009, pp. 2106–2113. [30] R.J. Peters, L. Itti, Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention, in: 2007 IEEE Conference on

[31]

[32]

[35] [36] [37]

[38] [39] [40]

[41] [42]

[43] [44] [45] [46] [47]

[48] [49] [50]

[51]

[52] [53] [54] [55]

[56] [57]

[58]

Computer Vision and Pattern Recognition (CVPR), IEEE, Minneapolis, Minnesota, USA, 2007, pp. 1–8. T. Liu, N. Zheng, W. Ding, Z. Yuan, Video attention: Learning to detect a salient object sequence, in: 19th International Conference on Pattern Recognition (ICPR), 2008, pp. 1–4. E. Rahtu, J. Kannala, M. Salo, J. Heikkilä, Segmenting salient objects from images and videos, in: European Conference on Computer Vision (ECCV), 2010, pp. 366–379. D. Pang, A. Kimura, T. Takeuchi, J. Yamato, K. Kashino, A stochastic model of selective visual attention with a dynamic Bayesian network, in: 2018 IEEE International Conference on Multimedia and Expo (ICME), IEEE, Hannover, Germany, 2008, pp. 1073–1076. A. Borji, D.N. Sihite, L. Itti, Probabilistic learning of task-specific visual attention, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Providence, Rhode Island, USA, 2012, pp. 470–477. M.M. Chun, Y. Jiang, Contextual cueing: implicit learning and memory of visual context guides spatial attention, Cognit. Psychol. 36 (1) (1998) 28–71. A. Torralba, Contextual priming for object detection, Int. J. Comput. Vis. 53 (2) (2003) 169–191. J. Li, Y. Tian, T. Huang, W. Gao, A dataset and evaluation methodology for visual saliency in video, in: IEEE International Conference on Multimedia and Expo (ICME), 2009, pp. 442–445. R. Carmi, L. Itti, Visual causes versus correlates of attentional selection in dynamic scenes, Vis. Res. 46 (26) (2006) 4333–4345. O. Le Meur, P. Le Callet, D. Barba, Predicting visual fixations on video based on low-level visual features, Vis. Res. 47 (19) (2007) 2483–2498. S. Marat, M. Guironnet, D. Pellerin, et al., Video summarization using a visual attention model, in: Proceedings of the 15th European Signal Processing Conference, EUSIPCO-2007, 2007. F. Shic, B. Scassellati, A behavioral analysis of computational models of visual attention, Int. J. Comput. Vis. 73 (2) (2007) 159–177. F. Boulos, W. Chen, B. Parrein, P. Le Callet, Region-of-interest intra prediction for h. 264/avc error resilience, in: 2009 IEEE International Conference on Image Processing (ICIP), IEEE, Cairo, Egypt, 2009, pp. 3109–3112. L. Itti, USC,iLab, 〈http://ilab.usc.edu/vagba/dataset/〉, 2009. M. Dorr, T. Martinetz, K.R. Gegenfurtner, E. Barth, Variability of eye movements when viewing dynamic natural scenes, J. Vis. 10 (10) (2010) 28. J.M. Henderson, DIEM, 〈http://thediemproject.wordpress.com/〉, 2010. H. Hadizadeh, M.J. Enriquez, I.V. Bajic, Eye-tracking database for a set of standard video sequences, IEEE Trans. Image Process. 21 (2) (2012) 898–903. S. Mathe, C. Sminchisescu, Dynamic eye movement datasets and learnt saliency models for visual action recognition, in: Computer Vision–ECCV 2012, Springer, Florence, Italy, 2012, pp. 842–856. S. Winkler, Datasets, 〈http://stefan.winkler.net/resources.html〉, 2013. L. Zhang, M.H. Tong, T.K. Marks, H. Shan, G.W. Cottrell, Sun: a Bayesian framework for saliency using natural statistics, J. Vis. 8 (7) (2008) 32. H. Liu, D. Xu, Q. Huang, W. Li, M. Xu, S. Lin, Semantically-based human scanpath estimation with hmms, in: 2013 IEEE International Conference on Computer Vision (ICCV), 2013, pp. 3232–3239. P. Over, G.M. Awad, J. Fiscus, B. Antonishek, M. Michel, A.F. Smeaton, W. Kraaij, G. Quénot, Trecvid 2010—An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics, National Institute of Standards and Technology, 2011. Eye-tracker, iview, 〈http://www.smivision.com/en/gaze-and-eye-tracking-sys tems/products/iview-x-hi-speed.html〉, 2013. Y. Rubner, C. Tomasi, L.J. Guibas, The earth mover's distance as a metric for image retrieval, Int. J. Comput. Vis. 40 (2) (2000) 99–121. T. Judd, F. Durand, A. Torralba, A benchmark of computational models of saliency to predict human fixations, in: MIT Technical Report, 2012. A.D. Castel, J. Pratt, F.I. Craik, The role of spatial working memory in inhibition of return: evidence from divided attention tasks, Percept. Psychophys. 65 (6) (2003) 970–981. C.Q. Jin Zhicheng, The effect of general attention capacity limits on inhibition of return, Acta Psychol. Sin. 35 (02) (2003) 163. Y. Hua, Z. Zhao, H. Tian, X. Guo, A. Cai, A probabilistic saliency model with memory-guided top-down cues for free-viewing, in: 2013 IEEE International Conference on Multimedia and Expo (ICME), 2013, pp. 1–6. A. Borji, D.N. Sihite, L. Itti, Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study, IEEE Trans. Image Process. 22 (1) (2013) 55–69.

Yan Hua received the B.S. degree in Communication Engineering from China Agricultural University, Beijing, China, in 2010. She is currently pursuing the Ph.D. degree in Beijing University of Posts and Telecommunications. Her research interests include saliency modeling, multimedia semantic analysis and correlation learning.

Y. Hua et al. / Neurocomputing 168 (2015) 917–929 Meng Yang received the M.S. degree from the Department of Psychology of Beijing Normal University in June, 2014. Her research focused on visual attention.

Zhicheng Zhao received the B.S. and M.S. degrees in Communication Engineering from Xidian University, China, in 1998 and 2003 individually, and the Ph.D. degree in Communication and Information System from Beijing University of Posts and Telecommunications, China, in 2008. He is currently an assistant professor in School of Information and Communication Engineering of Beijing University of Posts and Telecommunications. He has published more than 50 academic papers on video retrieval, computer vision and machine learning, which are his research interests.

Renlai Zhou is a professor with School of Psychology, Beijing Normal University, State Key Laboratory of Cognitive Neurosciences and Learning, Beijing Normal University. His current work focuses on (1) emotion: the measurement of emotional responding; the diagnosis and intervention of mood disorders; emotion regulation; the internal mechanism of emotion perception; (2) memory: the assessment of memory; the diagnosis and intervention of memory disorders; memory training; the internal mechanism of memory. Since 2004, his lab has published many works in journals like Biological Psychology, Psychophysiology, International Journal of Psychophysiology, Brain and Cognition, Psychiatry Research, Neuroscience Letters, Journal of Psychophysiology, Acta Astronautica, Perceptual and Motor Skills: Learning and Memory, Chinese Science Bulletin and so on.

929 Anni Cai is a professor with Beijing University of Posts and Telecommunications (BUPT), Beijing, China. She received the Ph.D. degree in University of California, Santa Barbara, USA in 1988. Her research areas include multimedia communication, cognitive computing of visual information, video search technologies and pattern recognition. She has published 6 monographs, and authored or coauthored about 70 academic papers in prestigious international journals and conferences.