Author's Accepted Manuscript
Temporal resolution vs. visual saliency in videos: Analysis of gaze patterns and evaluation of saliency models Manri Cheon, Jong-Seok Lee
www.elsevier.com/locate/image
PII: DOI: Reference:
S0923-5965(15)00097-1 http://dx.doi.org/10.1016/j.image.2015.05.010 IMAGE14975
To appear in:
Signal Processing: Image Communication
Cite this article as: Manri Cheon, Jong-Seok Lee, Temporal resolution vs. visual saliency in videos: Analysis of gaze patterns and evaluation of saliency models, Signal Processing: Image Communication, http://dx.doi.org/10.1016/j. image.2015.05.010 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Temporal resolution vs. visual saliency in videos: Analysis of gaze patterns and evaluation of saliency models Manri Cheona,b , Jong-Seok Lee a,b,∗ a School b Yonsei
of Integrated Techonology, Yonsei University, 406-840 Yeonsu-gu, Incheon, Korea Institute of Convergence Technology, Yonsei University, 406-840 Yeonsu-gu, Incheon, Korea
Abstract Temporal scalability of videos refers to the possibility of changing frame rate adaptively for efficient video transmission. Changing the frame rate may alter the spatial location that the viewers pay attention in the scene, which in turn significantly influences human’s quality perception. Therefore, in order to effectively exploit the temporal scalability in applications, it is necessary to understand the relationship between frame rate variation and visual saliency. In this study, we answer the following three research questions: (1) Does the frame rate influence the overall gaze patterns (in an average sense over subjects)? (2) Does the frame rate influence the inter-subject variability of the gaze patterns? (3) Do the state-of-the-art saliency models predict human gaze patterns reliably for different frame rates? To answer the first two questions, we conduct an eye-tracking experiment. Under a free viewing scenario, we collect and analyze gaze-paths of human subjects watching high-definition (HD) videos having a normal or low frame rate. Our results show that both the average gaze-path and subjectwise variability of the gaze-path are influenced by frame rate variation. Then, we apply representative state-of-the-art saliency models to the videos and evaluate their performance by using the gaze pattern data collected from the eye-tracking experiment in order to answer the third question. It is shown that there exists a trade-off relation between accuracy in predicting the gaze pattern and robustness to frame rate variation, which raises necessity of further research in saliency modeling to simultaneously ∗ Corresponding
author Email addresses:
[email protected] (Manri Cheon),
[email protected] (Jong-Seok Lee)
Preprint submitted to Journal of LATEX Templates
June 13, 2015
achieve both accuracy and robustness. Keywords: Temporal scalability, eye-tracking, frame rate, perception, quality of experience, saliency model
1. Introduction Nowadays, video content delivery services over networks have become popular and a significant amount of contents are consumed through online video services. Many TV broadcasters have switched their broadcasting systems from analog to digital and 5
now provide TV services using Internet, e.g., IPTV. In addition, many online video services such as YouTube, Netflix, Vimeo, etc., provide high quality videos over networks. As the technology is advanced and popularized, high quality videos are made, distributed, and viewed easily by public users as well as experts. In particular, high definition (HD) video contents are popularly consumed in many video applications,
10
which achieve improved visual quality at the expense of an increased volume of data. As this trend continues, it can be predicted that the video traffic will continuously increase. Therefore, there exist challenges in these video services, including the limit in the network capacity and user heterogeneity in terms of the network environment and terminal capability.
15
Video scalability provides flexible solutions to such challenges, i.e., the data rate and quality parameters of a scalable video can be adapted without necessity of reencoding of the video data [1]. It is supported by recent video compression standards, e.g., the scalable extension of H.264/AVC (SVC) [2] and the scalable extension of HEVC (SHVC) [3]. In addition, the recently standardized dynamic adaptive streaming
20
over HTTP (DASH) technique [4] provides an efficient framework to implement video streaming systems based on video scalability [5]. There are several dimensions of video scalability, e.g., temporal, spatial, and quality scalability. Among these, the temporal scalability refers to the possibility of changing frame rate adaptively. It has been used singly or in combination with other scalability dimensions for video transmission in
25
recent studies [6, 7]. In general, changing the frame rate affects human’s visual perception significantly.
2
Thus, it is important to understand the effect of frame rate variations on viewer’s perception in order to exploit the temporal scalability effectively. The effects of the video frame rate on human performance were investigated by Chen et al. [ 8]. In the study, 30
researches about the effects of frame rates on perceptual performance were reviewed, from which it was concluded that a threshold of the low frame rate for conducting psychomotor and perceptual tasks is around 15 Hz. In particular, we focus on the influence of the frame rate change on the visual attention in this study. Changing the frame rate alters perceived motion information,
35
which consequently affects visual attention. In the study of Itti et al. [ 9], it was shown that the bottom-up visual attention mechanism is influenced by motion information in the given visual stimulus. The visual attention has been considered as important in many applications related to perceptual quality, video compression, computer vision, etc. Quality perception of
40
viewers is highly dependent on where they pay attention in the scene. It was shown that visual artifacts are likely more annoying in a salient region attended by viewers than those in other areas by Ninassi et al. [10]. You et al. [11] proposed a quality metric based on balancing the influence of the entire visual stimuli and attended stimuli on the perceived video quality. Video compression can also exploit different visual sensitiv-
45
ity of humans for attended and unattended regions, which is referred to as perceptual video compression [12]. Machine vision is also often based on the visual attention mechanism in order to simulate human’s capability of visual scene understanding, visual search, etc.; for instance, a model integrating top-down and bottom-up attention for fast object detection was proposed by Navalpakkam and Itti [ 13]. It is apparent that,
50
if the video frame rate variation significantly affects the visual attention, such effects need to be considered in designing the aforementioned applications based on the visual attention. Gulliver and Ghinea [14] performed an eye-tracking study related to the frame rate and perceptual quality. They noted that the gaze-path is not significantly affected by
55
frame rate variations, which was based on the observation that the median gaze-path over viewers is consistent independently of the presentation frame rate varied among 5 fps, 15 fps, and 25 fps. However, it is questionable whether such conclusion is still 3
valid for the up-to-date video consumption environment (such as HD or ultra highdefinition (UHD)) that is largely different from that considered in their study. With 60
the advances in the video technology, the screen size, display resolution, and video content resolution have ever increased, and accordingly, the horizontal viewing angle has also increased [15]. Consequently, human’s perceptual patterns of video content has changed; it was shown that the perception of presence (immersion and perceptual realism) is affected by the screen size [16]. In particular, the influence of the visual
65
attention on perception of video content became more and more prominent. Furthermore, while only simple examination of the median gaze-path was performed in the study of Gulliver and Ghinea [14], deeper analysis in various aspects is needed to understand the perceptual mechanism better. In fact, our preliminary study demonstrated that the frame rate variation changes the gaze-path in an average sense over subjects
70
[17]. For the applications using the gaze information, it is important to imitate the human’s gaze pattern precisely by using a visual saliency model [ 18]. In literature, many saliency models have been developed. [19]. However, their applicability across different video frame rates has not been studied before. Ultimately, it is desirable for a
75
saliency model to be able to accurately predict the human gaze pattern across different video frame rates consistently, which imposes two (possibly conflicting) criteria: accuracy and robustness (or consistency). Then, the model can be used without modification across applications involving different frame rates or will work well in an application having the possibility of frame rate variations during its operation.
80
In this paper, we will answer the following three research questions: (1) Does the frame rate influence the overall gaze patterns (in an average sense over subjects)? (2) Does the frame rate influence the inter-subject variability of the gaze pattern? (3) Do the state-of-the-art saliency models predict human gaze patterns reliably for different frame rates? To answer the first two questions, we conduct an eye-tracking experiment.
85
Under a free viewing scenario, we collect and analyze gaze-paths of human subjects watching HD videos having a normal or low frame rate. Then, we apply representative state-of-the-art saliency models to the videos and evaluate their performance by using the gaze pattern data collected from the eye-tracking experiment, which will answer 4
(1) alex
(2) analysis
(3) basketball
(4) bunny
(5) cook
(6) frame01
(7) frame02
(8) glow01
(9) glow02
(10) glow03
(11) greenland
(12) line
(13) nemesis
(14) playground
(15) rain
(16) road
(17) swim
(18) ymsb
Figure 1: Test sequences used in our study. We arranged eighteen test sequences in the alphabetical order.
the third question. The rest of the paper is organized as follows. The following section explains how
90
the eye-tracking experiment was designed and conducted, and presents the results of the experiment. Section 3 is devoted to evaluation of saliency models based on the collected gaze pattern data. Finally, conclusions are given in Section 4.
2. Eye-tracking Experiment
Table 1: Characteristics of the test video sequences #
Video
Format
# Frames
Description
Camera motion
1
alex
1920×1080@30fps
300
A baby and a mother touch on iPad together and the screen of iPad shines brightly.
fixed
local, slow
2
analysis
1920×1080@30fps
300
A man walks slowly and explains the pictures on the wall.
panning, tilting
global, medium
3
basketball
1920×1080@30fps
300
Five people play basetball.
shaking
global, medium
4
bunny
1920×1080@30fps
240
Bunny sees a fruit falling from the sky. Animation clip.
fixed
local, slow
5
cook
1920×1080@25fps
250
A chef cooks Mongolian food using two sticks.
fixed
local, medium
6
frame01
1920×1080@25fps
250
A man juggles and rotates the picture frame on the beach.
fixed
global, fast
7
frame02
1920×1080@25fps
250
A man juggles and rotates the picture frame on the grass.
fixed
global, fast
8
glow01
1920×1080@25fps
175
A boy moves to the right side to pick a sheet of paper flowing in the wind.
fixed
local, slow
9
glow02
1920×1080@25fps
225
Girls chatter together.
shaking
global, fast
10
glow03
1920×1080@25fps
176
Girls walk together holding a picket.
fixed
global, slow
11
greenland
1920×1080@25fps
175
A big whale jumps above the sea near the Greenland.
fixed
local, slow
12
line
1920×1080@30fps
300
Many people run and play with the rope on the playground.
fixed
global, medium
13
nemesis
1920×1080@30fps
300
A man escapes from the window slowly.
fixed
local, slow
14
playground
1920×1080@30fps
283
Two boys play basketball at the basketball court. Shot from a distance.
fixed
local, slow
15
rain
1920×1080@25fps
283
Four people hold an umbrella in the rain.
shaking
global, medium
Object motion
16
road
1920×1080@25fps
252
Cars pass by on the street slowly in different directions.
fixed
global, slow
17
swim
1920×1080@30fps
300
Swimmers compete in the pool.
panning
global, medium
18
ymsb
1920×1080@30fps
300
A rock band plays music and the lights change frequently.
shaking
local, fast
5
95
2.1. Test Sequences Eighteen HD sequences that have a frame size of 1920×1080 pixels were selected and used for the eye-tracking experiment. These sequences originally have a frame rate of 25 fps or 30 fps, which we refer to normal frame rate (NFR). Most of the sequences were obtained from the contents available in online video websites, i.e., YouTube [ 20]
100
and Vimeo [21]. These contents were downloaded and reused under creative commons license [22]. The source video clips are of high quality without visible artifacts. Scene cut was not included in each sequence. The sequences have lengths of 5 to 10 seconds, which correspond to 175 to 300 frames for NFR videos. We did not set the lengths of test sequences same in order to preserve the context of natural video contents. The
105
characteristics of the test sequences are summarized in Table 1 and Fig. 1 shows their representative frames. It can be seen that the test sequences cover a wide range of visual content characteristics. We avoided using popular contents (e.g., Hollywood movies) because familiarity of subjects with the contents may influence on the result in an undesirable way.
110
From NFR videos, low frame rate (LFR) videos were produced by reducing the frame rate by a factor of four, i.e., 6.25 fps or 7.5 fps. Frame skipping (or frame dropping) was employed for reducing the frame rate, where one frame is kept and the next three frames are dropped. This method is commonly used for realizing temporal scalability by reducing the frame rate of scalable videos [ 23].
115
2.2. Subjects Twenty four subjects (six females and eighteen males) participated in the experiment. Their ages were between 24 and 35 with a mean of 28.5. All of them reported normal or corrected-to-normal vision. They were inexperienced with experiments related to eye-tracking and visual perception. We divided the subjects into sep-
120
arate groups for the two conditions, i.e., the half of the subjects (twelve) watched NFR videos, while the other half watched LFR videos. This is to avoid the memory effect affecting the gaze pattern in multiple viewing of the same contents, which was noted in previous studies [24, 25]. In order to prevent group-dependence of the results, we assigned the subjects to the two groups evenly. The average ages were set similarly, 6
125
i.e., 29.4 and 27.6 for the NFR and LFR groups, respectively, and the gender was also divided evenly, i.e., three females and nine males for each group. And the number of people that have corrected-to-normal vision were also set evenly, i.e., five and six people for the NFR and LFR groups, respectively. 2.3. Equipment
130
We used a Smart Eye Pro eye-tracker and a Samsung 24-inch LCD monitor having a native resolution of 1920×1080 pixels. The eye-tracker equipment consists of the hardware part (three cameras and two infra-red (IR) flashes) and the software part (Smart Eye Pro 5.8). It records the reflections of the IR flashes on the cornea to find the center of the eyes and tracks the gaze. The equipment automatically tracks head
135
movement and corrects the eye-positions, which allows accurate gaze tracking without necessity of an additional mechanism to fix the head (e.g., chin rest). The eye-tracking data were collected at a rate of 60 Hz. 2.4. Procedure Each subject sat in front of the monitor at a distance of three times the height of the
140
monitor, as recommended for HD displays in the Recommendation ITU-R BT.710-4 [26]. The videos were played on the monitor supporting the full HD resolution. The gaze calibration was performed before tracking in order to enable accurate tracking of the subject’s gaze. This step typically took a few minutes. Then, the subject was told about the test procedure with some example video sequences that were not used
145
in the experiment. During the test, the subject watched eighteen test sequences continuously. The viewing order was set random for each subject. A red cross on the gray background was shown at the center of the screen for three seconds before the start of each sequence. This is to set the starting gaze point to be at the center. A free viewing scenario was considered, thus no particular task was assigned to the subjects. And, the
150
information about the frame rate of the video sequences was not given to the subjects for unbiased results.
7
(a)
(b)
Figure 2: Gaze-path of (a) the x-coordinate and (b) the y-coordinate for sequence #2. Bold lines indicate median gaze-paths for NFR (blue) and LFR (red) and shaded areas indicate their standard deviations over subjects for NFR (gray) and LFR (yellow). Green bars indicate frames for which gaze points of NFR and LFR are significantly different (at a significance level of 5%).
2.5. Results It is known that different types of eye movements have different importance in perception by You [27]. The human visual system has the highest sensitivity when the eyes 155
move very slowly (i.e., during the drift movement or fixation), while it has virtually no sensitivity during saccade. Therefore, the tracking points corresponding to saccade were not considered but only “gaze” points were considered in further processing. The types of eye-movements were determined automatically by the eye-tracker software part.
160
In order to compare the gaze patterns of the two types of videos, i.e., NFR and LFR, the coordinates of the recorded gaze points are examined. Note that, since the sampling rate of the eye-tracker was 60 Hz, the recorded data were interpolated so as to match the normal video frame rates, which allows us to map the gaze points to each frame easily. Figs. 2 and 3 show representative examples of median gaze-paths
165
(x-coordinates and y-coordinates) of all subjects for NFR and LFR and their variations (standard deviations). On average, it can be said that the gaze patterns for NFR and LFR are similar in terms of the overall shape of the median gaze-paths and variations, which is similar to the results of the previous study of Gulliver and Ghinea [ 14]. At the same time, however, locally appearing discrepancy between the two cases is also
8
(a)
(b)
Figure 3: Gaze-path of (a) the x-coordinate and (b) the y-coordinate for sequence #6. Bold lines indicate median gaze-paths for NFR (blue) and LFR (red) and shaded areas indicate their standard deviations over subjects for NFR (gray) and LFR (yellow). Cyan (or magenta) bars indicate frames that the variance of gaze points across subjects is significantly larger (or smaller) for NFR than LFR.
170
observed, e.g., around frame #240 in Fig. 2(a) and around frame #160 in Fig. 2(b). In fact, a two-way analysis of variance (ANOVA) test reveals that there exists a significant effect of the frame rate at a significance level of 5% for each of x- and y-coordinates (degree of freedom (df)=1; p=2.4 × 10 −21 for the x-coordinate and p=2.4 × 10 −39 for the y-coordinate), as well as a significant effect of the video sequence (df=17; p=0 for
175
both x- and y-coordinates) and a significant interaction effect between the frame rate and video sequence (df=17; p=2.0 × 10 −13 for the x-coordinate and p=2.5 × 10 −59 for the y-coordinate). Thus, it is worth examining further the local discrepancy. The local discrepancy of the gaze patterns between NFR and LFR can be analyzed from two aspects, namely, median and variation across the subjects. The discrepancy
180
between median gaze-paths for NFR and LFR means that, on average, there exists difference in gaze patterns for the two cases. In terms of subject-wise variations, the discrepancy between NFR and LFR means that the levels of agreement of the gaze patterns across the subjects in each group are different for NFR and LFR. In order to examine the discrepancy of median gaze-paths, we carried out statistical
185
tests (non-parametric Wilcoxon-Mann-Whitney tests) under the null hypothesis that the gaze x-coordinates (or y-coordinates) for NFR and those for LFR are samples from distributions with equal medians for each frame of the content. The tests were conducted
9
Table 2: Percentage of the number of frames where the gaze points for NFR and LFR are significantly different (at a significance level of 5%) Sequence
Percentage
Sequence
Percentage
1
5.7 %
11
2
11.0 %
12
5.7 % 1.0 %
3
1.7 %
13
0.7 %
4
0.8 %
14
4.6 %
5
6.0 %
15
13.5 %
6
17.2 %
16
1.2 %
7
2.0 %
17
4.3 %
8
1.7 %
18
2.3 %
9
7.6 %
10
4.0 %
Mean
5.1 %
(a)
median of NFR
median of LFR
(b)
(c)
Figure 4: (a) Example frame of sequence #2, and the intensity maps of the gaze points for (b) NFR and (c) LFR. The Gaussian blob was used for representing gaze points. Blue and red dots mean the median gaze points for NFR and LFR, respectively.
for x-coordinates and y-coordinates separately. From the test results, it was found that the gaze point locations for NFR and LFR are significantly different for some frames 190
(marked with green bars in Fig. 2), e.g., around the 240th frame in x-coordinates (Fig. 2(a)) and around the 160th frame in y-coordinates (Fig. 2(b)). Fig. 4 shows a repre-
10
(a)
median of NFR
median of LFR
(b)
(c)
Figure 5: (a) Example frame of sequence #6, and the intensity maps of the gaze points for (b) NFR and (c) LFR. The Gaussian blob was used for representing gaze points. Blue and red dots mean the median gaze points for NFR and LFR, respectively.
sentative example of significant difference in terms of median gaze-path. For the same frame, the intensity maps of the gaze points for NFR and LFR show noticeably different results. Table 2 shows the percentages of the number of those frames (either in 195
x-coordinates or y-coordinates) for each sequence when the significance level is 5%. On average, 5.1% of all frames show statistically significant difference. The values are different for each sequence, and the maximum is over 17% (sequence #6). Certainly, this amount is not negligible and needs to be considered in perception-driven video processing when different frame rates are involved.
200
In order to examine the difference of subject-wise variation of gaze-paths between NFR and LFR, we conducted statistical tests (non-parametric Ansari-Bradley tests) under the null hypothesis that the gaze x-coordinates (or y-coordinates) for NFR and those for LFR are from distributions with the same median and shape but different dispersions for each frame of the content. As the statistical tests for the median, the tests
205
for the dispersion were conducted for x-coordinates and y-coordinates separately. In
11
22 NFR > LFR NFR < LFR Total
20 18
Percentage (%)
16 14 12 10 8 6 4 2 0
1
2
3
4
5
6
7
8
9 10 11 Sequence
12 13 14
15 16
17 18
Figure 6: Percentage of the number of frames where variations of gaze points across subjects for NFR (or LFR) are significantly larger than LFR (or NFR) and the sum of two cases at a significance level of 5%.
this case, however, it is meaningful to distinguish which condition (NFR or LFR) has larger dispersion values, which are marked with cyan or magenta bars in Fig. 3. Fig. 5 shows a representative example of significant difference in terms of subject-wise variations. It is observed that, while the median gaze points for NFR and LFR are similar, 210
the dispersions are significantly different. Fig. 6 shows the percentages of the numbers of frames (either in x-coordinates or y-coordinates) for which NFR or LFR have larger dispersion values than the other one for each sequence when the significance level is 5%. On average, 4.25% of all frames show statistically significantly larger dispersions for NFR and 8.47% of all frames show statistically significantly larger dispersions for
215
LFR. Thus, it can be concluded that the level of inter-subject gaze-path agreement is significantly influenced by the frame rate change for 12.72% of all frames. The values are different for each sequence, and the maximum is larger than 20% (sequence #10). Although content-dependence is observed, the dispersion of LFR is larger than that of NFR for a larger number of frames. It is probable that jerky motion in LFR videos,
220
which is unnatural and unexpected, can make subjects’ focus deviated significantly from the usually focused region for NFR. Finally, we examine whether the two types of discrepancy between NFR and LFR
12
30
25
median−discrepancy only dispersion−discrepancy only both
Percentage (%)
20
15
10
5
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 Sequence
Figure 7: Proportions of the numbers of frames corresponding to different types of significant discrepancy.
(i.e., median and dispersion) occur at the same time or not. Fig. 7 shows the proportions of the numbers of frames showing only median-discrepancy, only dispersion225
discrepancy, and both types of discrepancy to the number of frames showing either type of discrepancy. The results show that two types of discrepancy are exclusive in most cases (about 97% on average). This implies that for some frames of LFR videos, either most of viewers consistently focus on a particular region different from that focused for NFR, or viewers look at different regions but still around the region focused for NFR.
230
3. Evaluation of Saliency Models As shown in the previous section, human’s gaze patterns are influenced by frame rate change of videos. In this section, we examine whether state-of-the-art visual saliency models mimic bottom-up visual attention robustly without being influenced by frame rate variation.
235
3.1. Saliency Models While there exist many saliency models in literature [28, 29, 30, 31, 32], we chose a few representative ones by considering the following issues. First, we considered the models that are proven to show good performance from previous researches [ 19, 33].
13
Second, we considered both spatial saliency models and spatio-temporal saliency mod240
els. The former works for each image separately without considering the temporal dimension of videos, whereas the latter exploits temporal information. Although saliency models for videos have been developed in literature, models for images have been researched more extensively. In addition, some spatial saliency models have shown good performance even for videos [19].
245
We selected finally eight saliency models, four spatial models and four spatiotemporal models: GBVS(vid), Ittikoch2(vid), HouNIPS(vid), SDSR(vid), AWS(img), GBVS(img), Ittikoch2(img), and HouNIPS(img). • Ittikoch21 : The Ittikoch2 model uses multiscale image features, e.g., color, intensity, and orientation, which are combined to generate a single topographical
250
saliency map [28]. We used two variants of the model, Ittikoch2(vid) and Ittikoch2(img), which are spatio-temporal and spatial models, respectively. For the former, additional features are used for considering the temporal aspect, e.g., flicker and motion features, as well as image features. The Matlab implementation provided in the GBVS implementation package [ 34] was used.
255
• GBVS : The graph-based visual saliency (GBVS) model uses feature maps extracted from multiple spatial scale images similarly to the Ittikoch2 model. Then, a fully connected graph model and a dissimilarity metric are applied to generate the saliency map [29]. We used both spatial and spatio-temporal GBVS models in our study, GBVS(img) and GBVS(vid). The Matlab implementation in the
260
GBVS implementation package [34] was used. • HouNIPS2 : Motivated by the sparse coding strategy of the brain, this model defines features of an image patch as sparse coding basis functions and measures the activity of the features in terms of so-called incremental coding length. Then, the saliency of a region is obtained by summing up the activity of all features
265
at that region [30]. We used both spatio-temporal and spatial versions of the 1 The 2 The
name of this model is inspired by the study of Judd et al. [33]. name of this model is inspired by the study of Borji et al. [19].
14
model, HouNIPS(vid) and HouNIPS(img), respectively. The implementation in [35] was used for the latter and the implementation for the former was obtained from the authors of [30]. • SDSR : The saliency detection by self-resemblance (SDSR) model is a spatio270
temporal model. It uses local regression kernels as features from the given video. A non-parametric kernel density estimation technique is applied to the features, which results in a saliency map constructed from a local “self-resemblance” measure indicating likelihood of saliency [31]. The Matlab implementation in [36] was used.
275
• AWS : The adaptive whitening saliency (AWS) model is a spatial saliency model. It is based on adaptive whitening of low level scale and color features in a hierarchical manner. The saliency map is obtained by computing the vector norm of the whitened representation [37]. The implementation in [38] was used. The models have a few tunable algorithmic parameters, which were determined for
280
the best performance. For fair comparison, the same parameter values were used for both NFR and LFR videos. 3.2. Evaluation Protocol 3.2.1. Ground-truth saliency map Using our eye-tracking data presented in Section 2, we produced the ground-truth
285
saliency maps to evaluate the results of the saliency models. The fixation map where fixation locations of all subjects were set to the maximum pixel value was convolved with a Gaussian kernel for smoothing. For the size of the Gaussian kernel, a foveal coverage of the approximately 2 degrees visual angle was considered [ 39]. The resulting maps were used as the ground-truth saliency maps, which were compared with the
290
saliency maps of the saliency models. 3.2.2. Saliency maps by models For each video sequence, saliency maps are generated by inputting the video to each saliency model. While this is straightforward for NFR videos, there is no standard way 15
to consider reduced frame rates via temporal scalability for visual saliency modeling. 295
In this work, therefore, two possible options are employed. First, each frame of a LFR video sequence is repeated four times so that the total number of frames are the same for both NFR and LFR. Then, these frames are used as the input of a saliency model and the resulting saliency maps are compared with the ground-truth saliency maps for all frames. We call this method as LFR-SL, where SL stands for the “same
300
length” (to NFR). Second, the original frames of a LFR video sequence are used as the input of a saliency model. Then, the obtained saliency maps are compared with the temporally corresponding ground-truth saliency maps (i.e., every fourth frames). We call this method as LFR-OL, meaning the “original length” (of LFR). It should be noted that, in the human’s perceptual point of view, there is no difference between the
305
videos for LFR-SL and LFR-OL. However, for a saliency model, particularly a spatiotemporal model, they are different; the model sees a series of frames having relatively large magnitudes of motion in the consecutive frames for LFR-OL, whereas for LFRSL, a series of a frame having a large magnitude of motion and three identical frames (where no motion is involved) is given to the model. The performance of the model
310
would depend on how it considers motion characteristics in its algorithm. 3.2.3. Accuracy measures Accuracy measures of saliency models can be classified into three categories, i.e., distribution-based metrics, value-based metrics, and location-based metrics [ 40]. The distribution-based metrics measure similarity or dissimilarity between the statistical
315
distributions of gaze and saliency. The value-based metrics compare the saliency amplitudes with the corresponding eye fixations maps. The location-based metrics focus on the location of salient regions at gaze positions, which are based on the area under curve (AUC) for the receiver operating characteristic (ROC) curve. We selected three popular evaluation metrics, one from each of the three categories, namely, linear
320
correlation coefficient (CC), normalized scanpath saliency (NSS), and shuffled AUC. CC measures the degree of the linear correlation between two variables. When the value of CC is close to +1, there is strong linear relationship, i.e., it indicates that saliency maps and ground-truth maps are very similar. NSS is the degree of correspon-
16
NFR LFR−SL LFR−OL
0.25
Correlation Coefficient (CC)
0.2
0.15
0.1
0.05
0
d)
(vi
VS
GB
d)
id)
(vi
h2
oc
k Itti
S(v
NIP
u Ho
d)
)
g)
mg
(vi
SR
SD
S(i
AW
(im
VS
GB
)
g)
mg
2(i
ch
ko Itti
(im
IPS
N ou
H
Figure 8: CC averaged over the sequences for each saliency model. NFR (left bar), LFR-SL (middle bar), and LFR-OL (right bar) are displayed together. Standard errors of the mean for all frames are also shown.
dence between human fixation locations and model saliency maps, taking into account 325
the high inter-subject variability of eye movements [ 41]. When NSS≥1, the saliency map shows higher saliency values at human fixated locations compared to other locations. The shuffled AUC was proposed in [42] in order to enable more accurate assessment of non-trivial off-center fixation in comparison to the uniform AUC. The score equal to 1 means perfect prediction while a score of 0.5 indicates a chance level.
330
The uniform AUC and shuffled AUC use the same positive sample set composed of the fixations of the subject on the image. However, the negative set is composed of all fixation points of all subjects for all other images for shuffled AUC, while the negative set of uniform AUC consists of uniformly random points. Therefore, the shuffled AUC score tackles center bias of human attention and border effects, and is the best option
335
for model comparison according to Borji et al. [ 19]. We used the implementation of the measures available in [43]. 3.3. Results We performed three-way ANOVA tests considering the frame rate, video sequence, and saliency model as main factors at a significance level of 5% for each of CC, NSS, 17
NFR LFR−SL LFR−OL
1.2
Nomalized Scanpath Saliency (NSS)
1
0.8
0.6
0.4
0.2
0
d)
(vi
VS
k Itti
d)
id)
(vi
h2
oc
GB
S(v
NIP
d)
)
g)
mg
(vi
SR
S(i
(im
VS
AW
SD
u Ho
)
ko Itti
g)
mg
2(i
ch
GB
(im
IPS
N ou
H
Figure 9: NSS averaged over the sequences for each saliency model. NFR (left bar), LFR-SL (middle bar), and LFR-OL (right bar) are displayed together. Standard errors of the mean for all frames are also shown.
NFR LFR−SL LFR−OL
Shuffled Area Under Curve (AUC)
0.56
0.54
0.52
0.5
0.48
0.46 id)
d)
(vi
VS
GB
id)
2(v
ch
ko
Itti
S(v
NIP
u Ho
d)
)
g)
mg
(vi
SR
SD
S(i
AW
(im
VS
GB
)
mg
2(i
ch
ko
Itti
)
mg
S(i
NIP
u Ho
Figure 10: Shuffled AUC averaged over the sequences for each saliency model. NFR (left bar), LFR-SL (middle bar), and LFR-OL (right bar) are displayed together. Standard errors of the mean for all frames are also shown.
340
and shuffled AUC scores. We found significant main effects of all the three factors, i.e., frame rate, video, and model (df=1, 7, and 17, respectively; p=7.4 × 10 −6 , 0, and 0 for
18
Table 3: Averages and standard errors of mean of the prediction accuracy scores of each saliency model over the video sequences and the rankings with respect to the average performance. Top three models for each case are highlighted by bold faces. Significantly different performance for LFR-SL or LFR-OL in comparison to NFR is highlighted by underline based on the statistical test (non-parametric Wilcoxon-Mann-Whitney test) at a significance level of 5%. The cases where the scores of LFR-SL and LFR-OL are significantly different are highlighted by italicized text. NFR Measure
LFR-SL
LFR-OL
Model Score
Ranking
Score
Ranking
Score
Ranking 1
GBVS(vid)
0.239±0.002
2
0.258±0.002
1
0.259±0.004
Ittikoch2(vid)
0.197±0.002
4
0.199±0.002
4
0.201±0.004
4
HouNIPS(vid)
0.106±0.002
7
0.105±0.003
8
0.104±0.005
8
SDSR(vid)
0.099±0.002
8
0.108±0.002
7
0.124±0.004
6
AWS(img)
0.167±0.002
5
0.166±0.002
5
0.166±0.005
5 2
CC GBVS(img)
0.257±0.002
1
0.257±0.001
2
0.256±0.003
Ittikoch2(img)
0.218±0.002
3
0.219±0.002
3
0.219±0.004
3
HouNIPS(img)
0.127±0.002
6
0.125±0.002
6
0.125±0.005
7 2
GBVS(vid)
1.083±0.011
2
1.164±0.008
2
1.186±0.021
Ittikoch2(vid)
0.959±0.010
4
0.968±0.009
4
0.980±0.020
4
HouNIPS(vid)
0.500±0.010
7
0.492±0.013
8
0.492±0.027
8
SDSR(vid)
0.480±0.011
8
0.531±0.011
7
0.608±0.021
6
AWS(img)
0.756±0.011
5
0.748±0.011
5
0.750±0.022
5 1
NSS GBVS(img)
1.234±0.007
1
1.237±0.007
1
1.239±0.015
Ittikoch2(img)
1.062±0.010
3
1.067±0.010
3
1.070±0.019
3
HouNIPS(img)
0.607±0.012
6
0.597±0.012
6
0.601±0.024
7 3
GBVS(vid)
0.545±0.002
4
0.548±0.002
5
0.561±0.004
Ittikoch2(vid)
0.550±0.002
2
0.558±0.002
3
0.564±0.004
2
HouNIPS(vid)
0.518±0.002
8
0.525±0.002
8
0.527±0.004
8
SDSR(vid)
0.531±0.001
6
0.544±0.002
6
0.555±0.003
6
AWS(img)
0.549±0.002
3
0.559±0.002
2
0.560±0.004
4 5
AUC GBVS(img)
0.542±0.002
5
0.556±0.002
4
0.557±0.004
Ittikoch2(img)
0.553±0.002
1
0.567±0.002
1
0.568±0.004
1
HouNIPS(img)
0.526±0.002
7
0.534±0.002
7
0.535±0.003
7
CC; p=7.6 × 10−5 , 0, and 0 for NSS; p=2.8 × 10 −42 , 0, and 3.0 × 10−267 for AUC), and significant two-way interaction effects of all combinations, i.e., model×video, model×frame rate, and video×frame rate, for all accuracy measures (df=119, 7, and 345
17, respectively; p=1.1 × 10 −4 , 1.9 × 10−14 , and 0 for CC; p=6.8 × 10 −5 , 1.2 × 10−10 , and 0 for AUC); the three-way interaction was significant only for shuffled AUC (df=119; p=0.408 for CC, p=0.178 for NSS, and p=4.1 × 10 −37 for AUC). Figs. 8, 9, and 10 show the average CC, NSS, and shuffled AUC values over the test
19
Table 4: Duncan’s multiple range test results for the shuffled AUC scores of the saliency models. Insignificant score differences between models are marked as a group with the same color. Ranking 1
2
3
4
5
6
7
8
Ittikoch2(img)
Ittikoch2(vid)
AWS(img)
GBVS(vid)
GBVS(img)
SDSR(vid)
HouNIPS(img)
HouNIPS(vid)
Ittikoch2(img)
AWS(img)
Ittikoch2(vid)
GBVS(img)
GBVS(vid)
SDSR(vid)
HouNIPS(img)
HouNIPS(vid)
Ittikoch2(img)
Ittikoch2(vid)
GBVS(vid)
AWS(img)
GBVS(img)
SDSR(vid)
HouNIPS(img)
HouNIPS(vid)
NFR
LFR-SL
LFR-OL
sequences for each saliency model, respectively. The error bars indicate the standard 350
errors of the mean for all frames. Table 3 summarizes the scores and rankings of the models in terms of each accuracy measure. In addition, we carried out pair-wise post-hoc comparisons (Duncan’s multiple range test) between the accuracy scores of the saliency models in terms of shuffled AUC in order to examine the significance of performance difference between the models, whose results are shown in Table 4.
355
For the NFR case, GBVS(img), GBVS(vid), Ittikoch2(img), and Ittikoch2(vid) are ranked high in terms of CC and NSS. Using the shuffled AUC measure, Ittikoch2(img), Ittikoch2(vid), AWS(img), and GBVS(vid) are ranked on the top, while performance difference between them does not seem significant as shown in the post-hoc test results in Table 4. Overall, GBVS and Ittikoch2 (both spatial and spatio-temporal) show good
360
performance for NFR. When shuffled AUC is used, the relative superiority of the models is changed in comparison to CC and NSS (e.g., Ittikoch2 vs. GBVS, AWS(img) vs. GBVS, and SDSR(vid) vs. HouNIPS(img)). Such change was also observed in the previous study of Borji et al. [19]. However, the degree of changes of relative superiority is smaller in our results in comparison to the previous study. This implies
365
that the center-bias of our database is relatively mild considering the characteristics of shuffled AUC, i.e., the capability of alleviating the center-bias effect. In the analysis below, shuffled AUC is used mainly as in [19]. For the LFR case, spatial saliency models show similar scores with the NFR case in terms of CC and NSS, while spatio-temporal models tend to show improved accu-
370
racy in comparison to the NFR case. The results of statistical tests (non-parametric
20
Wilcoxon-Mann-Whitney tests) confirm that GBVS(vid) and SDSR(vid) models show significantly improved accuracy for LFR (p=6.5 × 10 −18 and 8.5 × 10−5 for CC and p=4.65 × 10−21 and 9.7 × 10−5 for NSS), as underlined in Table 3. On the whole, the GBVS and Ittikoch2 (both spatial and spatio-temporal) models have good performance 375
in terms of CC and NSS, which is similar to the NFR case; in fact, the overall rankings of the models for NFR and LFR are similar to each other in Table 3. However, there are some differences in rankings between the three cases when shuffled AUC is considered. The shuffled AUC scores are significantly different between NFR and LFR for all models. The relative superiority of the models is also changed for the LFR-
380
SL and LFR-OL cases in terms of shuffled AUC. For example, the AWS(img) model is ranked higher for LFR-SL than for NFR and lower for LFR-OL than NFR. Such changes of rankings are mainly because of the changes of the performance scores of the spatio-temporal models. In other words, although both spatial and spatio-temporal models show higher shuffled AUC scores for both LFR-SL and LFR-OL than NFR,
385
the amount of improvement is quite different for the spatio-temporal models, unlike the spatial models. For AWS(img), GBVS(img), Ittikoch2(img), and HouNIPS(img), shuffled AUC scores for LFR-SL and LFR-OL are similar in Fig. 10, because the generated saliency maps without considering temporal information are similar for the two LFR cases. However, the spatio-temporal saliency models, i.e., GBVS(vid) and
390
SDSR(vid), show significantly different shuffled AUC scores for LFR-SL and LFROL (p=0 and 1.4 × 10 −4 ), as indicated by italicized text in Table 3. This is because the spatio-temporal saliency models exploit temporal information residing in adjacent frames, which is not the same between LFR-SL and LFR-OL. The results discussed above are for the average accuracy over all sequences. In
395
Figs. 11 and 12, we perform further analysis on content-dependence in terms of shuffled AUC. The figures demonstrate that there is content-dependence of the accuracy of the saliency models. The models achieve overall high shuffled AUC scores for sequences #5, #14, and #12, while relatively poor performance is obtained for sequences #18, #2, and #11. The sequences showing good performance usually include scenes
400
having low complexity, such as fixed camera angles and small moving object(s) at the center of the image frame. 21
0.65
NFR LFR−SL LFR−OL
Shuffled Area Under Curve (AUC)
0.6
0.55
0.5
0.45
0.4
1
2
3
4
5
6
7
8
9 10 11 Sequence
12
13
14
15
16
17
18
Figure 11: Shuffled AUC averaged across the saliency models for each sequence. NFR (left bar), LFR-SL (middle bar), and LFR-OL (right bar) are displayed together. Standard errors of the means are also shown.
0.7
1
2
3 0.65 4
5
6 0.6 7
8
9 0.55 10
11
12
0.5
13
14 0.45
15
16
17 0.4
18 GBVS(vid)
Ittikoch2(vid)
HouNIPS(vid)
SDSR(vid)
AWS(img)
GBVS(img)
Ittikoch2(img)
HouNIPS(img)
Figure 12: Shuffled AUC for each sequence and for each saliency model. Each set of three horizontally adjacent blocks are for one model: middle block for NFR, left block for LFR-SL, and right block for LFROL.
22
Table 5: Results of statistical tests of average performance. A ‘+’ (plus) (or ‘-’ (minus)) symbol means that the score of NFR is significantly higher (or lower) than that of LFR-SL or LFR-OL over all frames for a sequence. A ‘•’ (bullet) symbol means that the score of NFR is not significantly different from that of LFR-SL or LFR-OL over all frames. Saliency Model
NFR/LFR
CC
NSS
shuffled AUC
GBVS(vid)
NFR/LFR-SL
••–––••–•–•––•–•–•
••–•–••–•–•––•–•–•
••––+•–+•++––––••+
Ittikoch2(vid)
HouNIPS(vid)
SDSR(vid)
AWS(img)
GBVS(img)
Ittikoch2(img)
HouNIPS(img)
NFR/LFR-OL
–••••••••••–––•–••
–••••••••••––••–••
•••–•••••••–•–•–++
NFR/LFR-SL
•••+–•••••••••••–+
•••+–•••••••••••–+
–•–•••–––•+–+–•–++ –••••••–•••–•–•–+•
NFR/LFR-OL
•••••••••••••••••+
•••••••••••••••••+
NFR/LFR-SL
+•••••••++••••••••
+•••••••++••••••••
–••–•–•–•••–+•–––•
NFR/LFR-OL
••••••••••••••••••
••••••••+•••••••••
•••••••–•••–••––•• ••–•–•–––•+–••––••
NFR/LFR-SL
•••••••–••+––•––••
•••••••–••+––•––••
NFR/LFR-OL
••••–––––••–••–•••
••••–––––••–••––••
••–•–––•–••–••––••
NFR/LFR-SL
•••••••••+••••••••
•••••••••+••••••••
••––••––•••–+–•–+•
NFR/LFR-OL
••••••••••••••••••
••••••••••••••••••
•••–•••••••–••••••
NFR/LFR-SL
••••••••••••••••••
••••••••••••••••••
–•––•––––••–•–••+•
NFR/LFR-OL
••••••••••••••••••
••••••••••••••••••
••––••–••••–•–•••–
NFR/LFR-SL
••••••••••••••••••
•••••+••••••••••••
–•––•––••••–+–••••
NFR/LFR-OL
••••••••••••••••••
••••••••••••••••••
••––••–••••–•–••••
NFR/LFR-SL
•••••••••+••••••••
•••••••••+••••••••
•••–•–––•+•–••••–•
NFR/LFR-OL
••••••••••••••••••
••••••••••••••••••
•••••••••••–••••••
18
18
16
16
HouNIPS(img)
# contents (LFR−NFR not different)
# contents (LFR−NFR not different)
AWS(img)
14
12
HouNIPS(img) Ittikoch2(img)
10
HouNIPS(vid)
SDSR(vid) AWS(img)
8 GBVS(img)
HouNIPS(vid)
14 GBVS(img)
Ittikoch2(img)
12 GBVS(vid)
Ittikoch2(vid)
10 SDSR(vid)
8
Ittikoch2(vid)
6
6 GBVS(vid)
4
4 0.5
0.51
0.52
0.53 0.54 shuffled AUC (mean)
0.55
0.5
0.56
(a)
0.51
0.52
0.53 0.54 shuffled AUC (mean)
0.55
0.56
(b)
Figure 13: Number of sequences showing insignificant performance difference between NFR and LFR ((a) LFR-SL and (b) LFR-OL) vs. shuffled AUC score for NFR.
Table 5 summarizes the results of statistical tests of significant superiority or inferiority between NFR and LFR-SL and between NFR and LFR-OL in terms of CC, NSS, and shuffled AUC scores for each test sequence. We carried out statistical tests 405
(non-parametric Wilcoxon-Mann-Whitney tests) under the null hypothesis that accuracy scores for NFR and those for LFR-SL (or LFR-OL) are samples from distributions with equal medians for all frames of each sequence. A saliency model that has more 23
bullet symbols is more desirable in the viewpoint of consistency (or robustness) for different frame rates. The test results of shuffled AUC for NFR vs. LFR-SL show su410
periority of HouNIPS(img) to the other models; it shows consistency for 61.1% (11 out of 18) of the test sequences, while consistency is achieved for 33.3% to 55.6% of the tests sequences by the other models. In the case of NFR vs. LFR-OL, HouNIPS(img) and AWS(img) exhibit consistency for 94.4% (17 of 18) and 88.9% (16 of 18) of the test sequences, respectively. This shows that the way the LFR videos are inputted to
415
saliency models significantly influence results of prediction of visual attention for the videos. Finally, Fig. 13 shows the performance of the saliency models in the viewpoint of both accuracy and consistency for NFR and LFR. In general, it is desirable for a model to have large values in both axes (i.e., right upper corner). However, the fig-
420
ures clearly show that there is trade-off relationship between accuracy and consistency. Good models in terms of accuracy (e.g., Ittikoch2(img) and Ittikoch2(vid) in both figures, and AWS(img) in Fig. 13 (a)) are relatively not good in terms of consistency. On the contrary, good models in terms of consistency (e.g., HouNIPS(img) in Fig. 13 (a)) are relatively not good in terms of accuracy. This demonstrates that further research
425
is needed to develop a saliency model showing good performance in terms of both accuracy and robustness against frame rate variation.
4. Conclusion In this paper, we have presented an eye-tracking experiment and a benchmarking study of saliency models in relation to frame rate variation. Considering the temporal 430
scalability in video applications, we set the environments that the videos have different frame rates depending on the conditions. First, we performed an eye-tracking experiment with the aim of investigating the influence of the frame rate variation on viewing patterns and visual perception. Eyetracking data were recorded from two groups of subjects for eighteen HD video con-
435
tents, one group for normal frame rates and the other group for low frame rates. It was shown that the gaze pattern is significantly affected by the frame rate change for up to
24
about 17% among all frames depending on the contents. Then, it was shown that the dispersion of the gaze patterns across subjects is also affected by the frame rate change for about 13% of all frames on average. Additionally, we observed that the two types 440
of discrepancy tend to occur exclusively. Second, we performed a study to examine applicability of saliency models imitating the human’s gaze pattern across different video frame rates. Eight state-of-the-art saliency models, four spatial models and four spatio-temporal models, were selected. We applied these models to all the test videos (NFR, LFR-SL, and LFR-OL) and eval-
445
uated their accuracy. In addition, we conducted statistical tests of significant difference in accuracy between NFR and LFR-SL and between NFR and LFR-OL in order to examine consistency (i.e., robustness) of the models with respect to the frame rate variation. It was shown that the accuracy of the saliency models is different for NFR and LFR. From the results of statistical tests using shuffled AUC, it was shown that con-
450
sistency is achieved for 33.3% to 61.1% in the case of NFR vs. LFR-SL and 55.6% to 94.4% in the case of NFR vs. LFR-OL, depending on the models. Finally, it was shown that the considered state-of-the-art saliency models suffer from a trade-off between accuracy and consistency. The findings in our study will be valuable for developing perception-driven video
455
processing and saliency models that consider temporal scalability. Note that our work focused on investigating existence of difference in visual attention and saliency model performance due to temporal scalability. Therefore, it would be desirable for future work to quantify the difference in relation to the frame rate, which will need to consider various frame rate values. In addition, further analysis would be valuable on
460
related topics such as understanding and modeling content-dependence and subjectdependence of visual saliency for temporal scalability.
Acknowledgment This work was supported by the Ministry of Science, ICT and Future Planning (MSIP), Korea, under the IT Consilience Creative Program supervised by the Institute 465
for Information and Communications Technology Promotion (IITP-2015-R0346-15-
25
1008), and by the Basic Science Research Program through the National Research Foundation of Korea funded by the MSIP (2013R1A1A1007822).
References [1] J.-S. Lee, F. De Simone, T. Ebrahimi, N. Ramzan, E. Izquierdo, Quality as470
sessment of multidimensional video scalability, IEEE Communications Magazine 50 (4) (2012) 38–46. [2] H. Schwarz, M. Wien, The scalable video coding extension of the H.264/AVC standard, IEEE Signal Processing Magazine 25 (2) (2008) 135. [3] Y. Ye, P. Andrivon, The scalable extensions of HEVC for ultra-high-definition
475
video delivery, IEEE Multimedia 21 (3) (2014) 58–64. [4] I. Sodagar, The MPEG-DASH standard for multimedia streaming over the internet, IEEE Multimedia 18 (4) (2011) 62–67. [5] Y. Sanchez, T. Schierl, C. Hellge, T. Wiegand, D. Hong, D. De Vleeschauwer, W. Van Leekwijck, Y. Le Lou´edec, Efficient HTTP-based streaming using scal-
480
able video coding, Signal Processing: Image Communication 27 (4) (2012) 329– 342. [6] N. Cranley, P. Perry, L. Murphy, User perception of adapting video quality, International Journal of Human-Computer Studies 64 (8) (2006) 637–647. [7] Z. Ma, M. Xu, Y.-F. Ou, Y. Wang, Modeling of rate and perceptual quality of
485
compressed video as functions of frame rate and quantization stepsize and its applications, IEEE Transactions on Circuits and Systems for Video Technology 22 (5) (2012) 671–682. [8] J. Y. Chen, J. E. Thropp, Review of low frame rate effects on human performance, IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Hu-
490
mans 37 (6) (2007) 1063–1076.
26
[9] L. Itti, N. Dhavale, F. Pighin, Realistic avatar eye and head animation using a neurobiological model of visual attention, in: Proceedings of the SPIE 48th Annual International Symposium on Optical Science and Technology, 2004, pp. 64–78. [10] A. Ninassi, O. Le Meur, P. Le Callet, D. Barbba, Does where you gaze on an im495
age affect your perception of quality? applying visual attention to image quality metric, in: Proceedings of the IEEE International Conference on Image Processing (ICIP), Vol. 2, 2007, pp. 169–172. [11] J. You, J. Korhonen, A. Perkis, T. Ebrahimi, Balancing attended and global stimuli in perceived video quality assessment, IEEE Transactions on Multimedia 13 (6)
500
(2011) 1269–1285. [12] J.-S. Lee, T. Ebrahimi, Perceptual video compression: A survey, IEEE Journal of Selected Topics in Signal Processing 6 (6) (2012) 684–697. [13] V. Navalpakkam, L. Itti, An integrated model of top-down and bottom-up attention for optimizing detection speed, in: Proceedings of the IEEE Computer
505
Society Conference on Computer Vision and Pattern Recognition, Vol. 2, 2006, pp. 2049–2056. [14] S. R. Gulliver, G. Ghinea, Stars in their eyes: What eye-tracking reveals about multimedia perceptual quality, IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 34 (4) (2004) 472–482.
510
[15] ITU-R, Recommendation BT.1845-1: Rguidelines on metrics to be used when tailoring television programmes to broadcasting applications at various image quality levels, display sizes and aspect ratios, Tech. rep., ITU-R (2010). [16] C. C. Bracken, D. J. Atkin, How screen size affects perception of television: A survey of presence-evoking technology in our living rooms, Visual Communica-
515
tion Quarterly 11 (1-2) (2004) 23–27. [17] M. Cheon, J.-S. Lee, Gaze pattern analysis for video contents with different frame rates, in: Proceedings of the Visual Communications and Image Processing (VCIP), 2013, pp. 1–5. 27
[18] A. Borji, L. Itti, State-of-the-art in visual attention modeling, IEEE Transactions 520
on Pattern Analysis and Machine Intelligence 35 (1) (2013) 185–207. [19] A. Borji, D. N. Sihite, L. Itti, Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study, IEEE Transactions on Image Processing 22 (1) (2013) 55–69. [20] YouTube, Youtube website (2013).
525
URL http://www.youtube.com/ [21] Vimeo, Vimeo website (2013). URL http://www.vimeo.com/ [22] Creative commons license (2013). URL http://www.creativecommons.org/licenses/
530
[23] J.-S. Lee, F. De Simone, T. Ebrahimi, Subjective quality evaluation via paired comparison: application to scalable video coding, IEEE Transactions on Multimedia 13 (5) (2011) 882–893. [24] J.-S. Lee, F. De Simone, T. Ebrahimi, Subjective quality evaluation of foveated video coding using audio-visual focus of attention, IEEE Journal of Selected Top-
535
ics in Signal Processing 5 (7) (2011) 1322–1331. [25] M. Nystr¨om, K. Holmqvist, Effect of compressed offline foveated video on viewing behavior and subjective quality, ACM Transactions on Multimedia Computing, Communications and Applications 6 (1) (2010) 4. [26] ITU-R, Recommendation BT.710-4: Subjective assessment methods for image
540
quality in high-definition television, Tech. rep., ITU-R (1998). [27] J. You, Video gaze prediction: Minimizing perceptual information loss, in: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 2012, pp. 438–443.
28
[28] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid 545
scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (11) (1998) 1254–1259. [29] J. Harel, C. Koch, P. Perona, Graph-based visual saliency, in: Proceedings of the Advances in Neural Information Processing System, 2007, pp. 545–552. [30] X. Hou, L. Zhang, Dynamic visual attention: Searching for coding length incre-
550
ments, in: Proceedings of the Advances in Neural Information Processing Systems, Vol. 21, 2009, pp. 681–688. [31] H. J. Seo, P. Milanfar, Static and space-time visual saliency detection by selfresemblance, Journal of Vision 9 (12) (2009) 1– 27. [32] A. Garcia-Diaz, X. R. Fdez-Vidal, X. M. Pardo, R. Dosil, Saliency from hierar-
555
chical adaptation through decorrelation and variance normalization, Image and Vision Computing 30 (1) (2012) 51–64. [33] T. Judd, F. Durand, A. Torralba, A benchmark of computational models of saliency to predict human fixations, Tech. Rep. MIT-CSAIL-TR-2012-001, MIT Computer Science and Artificial Intelligence Laboratory (2012).
560
[34] J. Harel, A saliency implementation in Matlab (2010). URL http://www.klab.caltech.edu/˜harel/share/gbvs.php [35] X. Hou, Matlab implementation of static image saliency (2011). URL http://www.its.caltech.edu/˜xhou/projects/dva/dva.html [36] H. J. Seo, P. Milanfar, Matlab package (2011).
565
URL http://users.soe.ucsc.edu/˜milanfar/research/rokaf/.html/SaliencyDetectio [37] A. Garcia-Diaz, X. R. Fdez-Vidal, X. M. Pardo, R. Dosil, Saliency from hierarchical adaptation through decorrelation and variance normalization, Image and Vision Computing 30 (1) (2012) 51–64.
29
[38] X. R. Fdez-Vidal, Matlab code of computational model of visual attention (AWS) 570
(2011). URL http://persoal.citius.usc.es/xose.vidal/research/aws/AWSmodel.html [39] U. Engelke, H. Liu, J. Wang, P. Le Callet, I. Heynderickx, H.-J. Zepernick, A. Maeder, Comparative study of fixation density maps, IEEE Transactions on Image Processing 22 (3) (2013) 1121–1133.
575
[40] N. Riche, M. Duvinage, M. Mancas, B. Gosselin, T. Dutoit, Saliency and human fixations: State-of-the-art and study of comparison metrics, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 1153– 1160. [41] R. J. Peters, A. Iyer, L. Itti, C. Koch, Components of bottom-up gaze allocation
580
in natural images, Vision Research 45 (18) (2005) 2397–2416. [42] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, G. W. Cottrell, Sun: A Bayesian framework for saliency using natural statistics, Journal of Vision 8 (7) (2008) 1–20. [43] A. Borji, Matlab implementation of evaluation measures (2013).
585
URL https://sites.google.com/site/saliencyevaluation/evaluation-measures
30