Exploring visual attention using random walks based eye tracking protocols

Exploring visual attention using random walks based eye tracking protocols

Accepted Manuscript Exploring Visual Attention Using Random Walks Based Eye Tracking Protocols Xiu Chen, Zhenzhong Chen PII: DOI: Reference: S1047-32...

8MB Sizes 0 Downloads 37 Views

Accepted Manuscript Exploring Visual Attention Using Random Walks Based Eye Tracking Protocols Xiu Chen, Zhenzhong Chen PII: DOI: Reference:

S1047-3203(17)30042-1 http://dx.doi.org/10.1016/j.jvcir.2017.02.005 YJVCI 1956

To appear in:

J. Vis. Commun. Image R.

Received Date: Accepted Date:

6 February 2016 10 February 2017

Please cite this article as: X. Chen, Z. Chen, Exploring Visual Attention Using Random Walks Based Eye Tracking Protocols, J. Vis. Commun. Image R. (2017), doi: http://dx.doi.org/10.1016/j.jvcir.2017.02.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Exploring Visual Attention Using Random Walks Based Eye Tracking Protocols Xiu Chena , Zhenzhong Chena a

School of Remote Sensing and Information Engineering, Wuhan University, China

Abstract Identifying visual attention plays an important role in understanding human behavior and optimizing relevant multimedia applications. In this paper, we propose a visual attention identification method based on random walks. In the proposed method, fixations recorded by the eye tracker are partitioned into clusters where each cluster presents a particular area of interest (AOI). In each cluster, we estimate the transition probabilities of the fixations based on their point-to-point adjacency in their spatial positions. We obtain the initial coefficients for the fixations according to their density. We utilizing random walks to iteratively update the coefficients until their convergency. Finally, the center of the AOI is calculated according to the convergent coefficients of the fixations. Experimental results demonstrate that our proposed method which combines the fixations’ spatial and temporal relations, highlights the fixations of higher densities and eliminates the errors inside the cluster. It is more robust and accurate than traditional methods. Keywords: eye tracking, visual attention, fixation, area of interest, Random Walks 1. Introduction Optimizing user experience plays an important role in today’s research and development of multimedia systems and applications. Exploring human visual systems is a key to understanding the human characteristics. Visual attention has attracted more and more research activities due to its applications in saliency detection [1, 2, 3, 4], object detection [5], image and video Email addresses: [email protected] (Xiu Chen), [email protected] (Zhenzhong Chen) Preprint submitted to Journal of Visual Communication and Image RepresentationDecember 19, 2016

quality assessment [6, 7], video coding [8, 9, 10], video streaming [11], video summarization [12, 13], image search [14], image retargeting [15], webpage browsing [16], etc. To study the human visual attention characteristics, the eye tracking system is typically used for experiments to collect the human eye movement information according to the visual stimuli. The eye movements recorded by the eye tracker involve the visual perception behavior and various analyses have been proposed and applied for studying the viewer’s visual and cognitive processes [17]. The raw eye tracking data are primarily split into six types: fixations, saccades, smooth pursuits, optokinetic reflex, vestibulo-ocular reflex, and vergence [18]. As fixations and saccades are most frequently studied in the prior literatures, we typically focus on these two types of eye movements. Fixations are the eye movements that remain fixated on an area of interest (AOI), whilst saccades correspond to rapid eye movements among different AOIs [19]. Many approaches on the identification of fixations and saccades such as velocity, dispersion, and area based algorithms, have been proposed and extensively explored [20]. That provides the fundamentals for further researches on the cognitive processes in human beings. It is generally recognized that people tend to fixate their attention on the interested objects more frequently [21]. Many researches utilize the amount of fixations to represent the level of attention on the AOI, i.e., the more fixations the AOI has, the more attractive the AOI is [22, 23]. Once the viewer is observing an object of interest, the corresponding eye movements are expected to focus on the AOI and consequently generate a cluster of fixation points [24]. Various algorithms for clustering fixation points such as mean shift [25], Bayesian online clustering [26], have been developed to discover the viewer’s AOIs, among which the velocity-based clustering algorithm I-VT [20] is widely used in practice eye tracking systems to translate the fixation cluster to a representation using their mean value. Therefore, fixations are closely associated with scene objects which correspond to AOIs in the stimuli and the representation positions which direct towards the fixated regions, e.g., cluster centroid and fixation seed, are generated and employed to explore the visual behavior of human beings [27]. Furthermore, most eye tracking systems adopt the centroid of the fixation cluster as the representation of the AOI [20]. However, the centroid has some drawbacks: (1) each fixation in the cluster has equal importance in the final computed position, that ignores the inner spatial relationship among the fixations, (2) once the algorithm of clustering includes some noise points, 2

the produced position is susceptible to these noises and contains bias that reduces the accuracy of the output for further studies in human visual behaviors. We utilized some of work from our previous related work in [28] and validate our method with more details. In order to locate the viewer’s visual attention more accurately and overcome the disadvantages in the existing method, researchers have proposed some algorithms such as the method by utilizing the fixation durations as their weights [29], the densest position based method [30]. However, the spatial correlations among the fixations which provide important cues for visual attention analysis are not well utilized. In this paper, we present a method for identifying the representation of visual attention based on random walks. We accomplish this task by utilizing the fixations in the same cluster. Given the raw eye tracking data, the fixations extracted from the raw data are classified into different clusters which concentrate around the AOIs on the visual stimuli. Then, we propose a random walks based method for the cluster to identify the refined center based on the spatial distribution. Besides the spatial factor, we also incorporate the temporal characteristics, i.e., the duration of the fixation, as the bias in the evaluation of the final refined center. Experimental results show that our proposed method achieves excellent performance. Our main contributions can be summarized as follows: (1) a new method for identifying the fixation center of the AOI utilizing random walks. (2) taking into account the factors of density and distances among the fixations to judge the importance of the fixation. This paper is arranged as follows. Section II describes the work-flow of the proposed method and the details. Section III shows the experimental results and discusses the performance of our method. The conclusions are drawn in Section IV. 2. Visual Attention Identification The method proposed in this paper is shown in Fig.1. Given the raw eye tracking data recorded by the eye tracker, we discover the centers of the viewer’s visual attention on the corresponding stimuli. We achieve this goal by generating several clusters of fixations for the areas of the viewer’s interests. Then, we propose an approach based on random walks which distinguishes the dense fixations according to their pairwise consistency. When calculating the center of attention in a particular cluster, we incorporate the 3

Perform random walks on each cluster Raw eye tracking data

Generate the clusters of fixations

Transition probability

Fixations weighted by the coefficients

center of visual attention

Density of the fixation

Figure 1: The work-flow of the proposed method.

transition probabilities and densities of the fixations where the spatial distribution and temporal accumulation are utilized to evaluate the importance of the fixations in the cluster. Finally, the coefficients derived from random walks are used as weights of the fixations to generate the ultimate refined center of visual attention. 2.1. Generating the Clusters of Fixations In order to analyze where and how an image is viewed, a wealth of clustering algorithms towards eye movements have been applied to estimate the location and extent of the viewer’s AOI. Higher level analysis of the viewer’s visual interest explains and motivates the demand for approaches that concisely quantify AOIs of the viewer [31], and these analyses are heavily influenced by the characteristics of the AOI, like the size, location, duration, and center [32]. To provide effective fixations for clustering, we need to perform fixation identification first. Given the eye position coordinates relative to the stimuli, we utilize these data to analyze the cognitive processing of the human being. Fixations are supposed to be continuously centralized surround a small area where exists a special scene object, where saccades correspond to the eye movements with rapid velocities when the viewers change their central focus from an area to another. These two types of eye movements provide basic information for further eye-tracking analysis. The identification algorithm used to separate fixations from saccades is regarded as an essential part during interpreting the eye movements. According to the conclusions in many previous approaches from the literature that extensively explored the identification algorithms, the clusters of fixations may vary significantly because of the choice of different algorithms and the parameter settings [33]. Many algorithms [20] such as velocity-threshold identification

4

(I-VT), dispersion-threshold identification (I-DT), area-of-interest identification (I-AOI), HMMs, MST and so on, are capable of identifying fixations and clustering them into certain ROIs from different perspectives. In this paper, I-VT is implemented as its broad applications have proven that it is mature, simple, effective and with a low time complexity during accomplishing this task [34]. Algorithm 1 Fixation Identification and Clustering Require: The coordinates of raw eye tracking data; velocity-threshold; desperation-threshold; interval-threshold; Ensure: The clusters of fixations; 1. 2. 3. 4. 5. 6.

Calculate the velocity for each point. Label all points lower than the velocity-threshold as fixations. Label all points higher than the velocity-threshold as saccades. Successive fixations collapse into tentative groups, delete saccades. Calculate the centroid for each group. Calculate the distances of the centroid between the current group and the next one. 7. Calculate the time interval from the last fixation in the current group to the first fixation in the next group. 8. Merge the two groups into a new one when the two conditions are simultaneously satisfied: a) the distance of the centroid is below desperation-threshold; b) the time interval defined above is less than interval-threshold. return clusters of fixations; The raw data produced from the eye tracker consists of quadruples (xi , yi , ti , di ) where xi and yi refer to the horizontal and vertical coordinates in the stimuli at the recording time ti . The interval at which the eye tracker samples the gaze coordinates is what we called the duration di . We utilize I-VT to identify fixations from saccades based on the fact that fixations have lower velocity than the saccades and I-VT is widely accepted in eye tracking protocols. As the definition of velocity is v = d/t, we only consider d when the parameter t is constant. The sampling rate of the eye tracker is usually constant and known, thus only the distance from the current point to the next is needed when measuring the velocity. If the point’s velocity is lower 5

than the predefined velocity threshold, it is identified as a fixation, otherwise it is identified as a saccade which will be excluded. Then, successive fixations collapse into groups. Given the coordinates of raw eye tracking data, the clusters of fixations are generated by utilizing the Algorithm 1. Firstly, the steps 1-4 describe how I-VT produce the groups of fixations. Then, the centroid for each group is calculated to represent the corresponding AOI. However, the AOIs will overlap once the groups have the centroid close to each other. To avoid this situation, we need to merge the groups which are spatially and temporally adjacent into a new one. Therefore, another two thresholds are introduced to modify the fixation clusters. In step 6, the distance of the centroid between the current group and the next one is compared with a desperation threshold. In step 7, the time interval from the last fixation in the current group to the first fixation in the next group is calculated. In step 8, the two clusters can be combined only when the two requirements are satisfied: 1) the distance from the step 6 is less than the desperation threshold, 2) the time interval obtained in step 7 is less than the interval threshold. Here, we set velocitythreshold to 15 deg/s, the desperation threshold to 2 deg/s, and the interval threshold to 100ms. So far, we obtain the clusters of fixations based on their spatial and temporal characteristics. A cluster refers to an area of the viewer’s interest, and the fixations within this area must be centralized around a specific target on the stimuli. In the following procedures, we present a novel method based on random walks to locate the position of the specific target. 2.2. Identification of the Center of Area of Interest In order to identify whether a target on the stimuli is attracting the viewer’s attention, we design the algorithms of fixation identification and clustering to find the extent of the viewer’s interest. With the assumption that once the viewer is observing an object of interest, the eye movements which have been processed into fixations are expected to focus on that object. Therefore, a considerable proportion of fixations within the same cluster are expected to show a high pairwise consistency with each other. Thus we propose a random walks based method for identifying the center of the attention based on the above regulations. Each fixation is assigned with a coefficient according to its potential for consistency. To better understand how random walks is working, we can refer to a situation that a person is gazing at a target and his eye movements will focus on an AOI which includes 6

that target. The fixations in one AOI have high consistencies. Ultimately, the location nearby the target will be visited more often as it is more consistent to other locations. Intuitively, random walks on the graph consisting of fixations included in a cluster, will evaluate the fixations’ importance according to their consistency with each other. 2.2.1. Estimating the transition probability For a specific cluster I, the fixations within the cluster can form a graph GI = (N, E) where N and E represent the set of nodes and edges, respectively. Each node denotes one fixation in the cluster I, i.e., N = {f1 , f2 , · · · , fλ } where λ means the total number of fixations in the cluster, and there exists edges connecting these fixations, E = {(fi , fj ), i ̸= j}. The coordinates of fixations here are arranged as triplets, i.e., fi = {gi , ti , di } where gi represents the ith fixation’s location: gi = (xi , yi ), ti and di represent the recording time and the duration separately. ti and di are identical to the definition in the raw eye tracking data, e.g., our eye tracker records eye movements at the constant frequency 120 Hz, so the duration di of each point is the same. At the beginning we need to form a matrix D whose elements are the Euclidean distance between two fixations, i.e., the distance from the ith fixation to the j th , as: D(i, j) = ∥gi − gj ∥2 , (1) where ∥.∥2 denotes the Euclidean norm. Thus, the transition probability from the fixation i to j is defined as: e−σ×D(i,j) q(i, j) = ∑λ , −σ×D(i,k) k=1 e

(2)

where σ works as an insensitive parameter to modify the distribution shape, and through the trials in our later experiments. The transition probabilities from one fixation to the rest are normalized by the denominator practically. According to Equation (2), the closer the fixations, the more consistency they have, and the greater the transition probabilities become. 2.2.2. Incorporating the densities of fixations Intuitively, the fixations in a specific cluster show a high inclination to focus around on an AOI as long as the target in that area is perceived by the viewer. Our goal is to find a refined center to represent that target based on random walks instead of the centroid. The fixations gathered together 7

cause the differences in their densities. Therefore, the density of the fixation connotes its degree of importance in the group. In the proposed method based on random walks, we initialize the coefficient of a fixation by utilizing its density. The denser the fixation, the larger initial coefficient it gets. The density of the ith fixation can be defined as the total duration within the radius r of it: λ ∑ ρr (i) = { dj | D(i, j) ≤ r}. (3) j=1

For simplicity, we propose a method that requires less computational load to count the densities. With the sampling interval of the eye tracker known and fixed, the duration dj in Equation (3) is constant, thus we can count the number of fixations within the buffer zone: ρr (i) = {♯(j)|D(i, j) ≤ r},

(4)

where ♯(j) represents the number of fixations that meet the condition. We define the initial coefficient of the ith fixation in cluster I as: ρr (i) w(i) = ∑λ . j=0 ρr (j)

(5)

The denominator in Equation (5) acts as a normalization to ensure the Markov chain requirement: ∥w∥1 = 1. 2.2.3. Assigning fixations with convergent coefficients by random walks With the obtained transition probabilities and the densities, we perform random walks to update the coefficient of one fixation at each iteration on account of the probabilities from other fixations to it. We define an adaptive damping factor: 1 ∑ lt+1 (i) = { (1 − (1 − α)lt (i))lt (j)q(j, i) + (1 − α)lt (i)w(i)} η j=1 λ

(6)

where lt (i) is the relevance coefficient of the ith fixation at the tth iteration. The part (1 − (1 − α)lt (i))lt (j)q(j, i) calculates the summation of the transition probability from other fixations to the ith one. The adaptive damping factor (1 − α)lt (i)w(i) in Equation (6) is similar to the part in [35], it enables 8

our method to take advantage of the prior knowledge about the current status of the fixation. However, Zamir [35] wanted to offset the effect of the node’s density. On the contrary, the method proposed in this paper pays more attention to the positive effect of the fixation’s density. Because of the ambiguity of the fixation’s density ρr whose definition contains a vital parameter r which is uncertain at the beginning, the initial coefficients we get are heavily influenced by it, e.g., if r is too large, the densities ρr of the fixations will have no difference, which is not reasonable in practice. Therefore, the input errors coming from the initialization need to be considered. The adaptive damping factor has been proven to be able to solve this problem effectively [35]. So we use a constant α in Equation (6) which is set between 0.5 and 1. The normalization denominator η is defined as: λ λ ∑ ∑ η= { (1 − (1 − α)lt (i))lt (j)q(j, i) + (1 − α)lt (i)w(i)} i=1

(7)

j=1

∑ where η guarantees the summation of the coefficients is always 1: λi=1 lt (i) = 1. The coefficients are iteratively calculated until they converge to a stationary probability lT . In details, we set a predefined threshold T to control the precision of the final values. When the computed results meet the requirement: |lt+1 (i) − lt (i)| < T, i ∈ [1, λ], we return the iterative vector lt+1 as the final coefficients lT . 2.2.4. Producing fixation center The fixations which are isolated from the rest are supposed to have coefficients approximate to zero. Otherwise, the fixations are assigned with relevant coefficients in accordance with their consistency with each other. Utilizing the coefficients to weight the fixations, we finally get the refined position of the center to represent the cluster or the AOI: gˆ =

λ ∑

gi lT (i),

(8)

i=1

where gˆ is the location of the center we are seeking for. With the weighted fixations, we take into consideration the densities and the consistency among these fixations, thus the subareas in the AOI which are frequently watched by the viewer are highlighted in our estimations. 9

3. Experimental Results 3.1. Experiment Setup We utilize IRCCyN IVC Eyetracker 2006 05 image database [36] to show the performance of our proposed method on the natural images. In addition, we collected many raw eye tracking data with the Tobii X120 Eye Tracker which provides the tracking frequency as high as 120 Hz to guarantee the high accuracy. We conducted all the experiments on a PC of 2.5GHz and 16G RAM. In a natural environment where the viewer is confronted with the eye tracker and the visual stimuli, we designed 3 types of experiments: (1) We set different patterns on the stimuli, and instruct the viewer to focus their attention on the center of these patterns. (2) A free viewing task on the MIT database [37] is recorded, and the ground-truth is then assigned by the viewer. (3) The participants are asked to spell a particular word on the virtual keyboard by fixating their attention on the buttons. Each image is viewed for about 5 seconds. The raw data produced is arranged as successional coordinates of the aforementioned quadruples. After identifying the fixations from saccades, we separate the fixations into a number of clusters. Then, we will discover the center of the viewer’s visual attention using the generated fixation clusters. We compared different methods to demonstrate the performance of our proposed method. 3.2. Implementation of Reference Methods Given a collection of fixations in one cluster: {g1 , g2 , · · · , gi , · · · , gλ } where gi represents the location of the ith fixation point, the goal is to find a position which can describe the viewer’s interest in this period of time. Generally, researchers choose mean value of the eye movements surrounding the certain object as the representative point to show the location of viewer’s interest. The centroid [20] means that every fixation point plays an equally important role, that is, the weight of each fixation is the same: ∑λ gi Pcentroid = i=1 . (9) λ In addition, the method in [29] calculated the cluster’s center by utilizing the duration of each fixation as its weight: ∑λ i=1 gi × di Pduration = ∑ , (10) λ i=1 di 10

(a)

(b)

(c)

Figure 2: Experiments on patterns: With predefined patterns (black) on the stimuli, the viewer’s fixations (black crosshairs) are therefore limited in the surrounded area. We compare our computed centers with the center of the crosshairs. The centers using different methods have distinguishable accuracies: the centroid [20] (green square), densest posiˇ tion based method [30] (blue diamond), Spakov’s method [29] (magenta square) and our proposed method (red circle). Table 1: Comparison of deviation of different methods (pixels) in Fig.2. pattern

Centroid [20]

a b c

5.1 9.1 8.3

ˇ Spakov [29] 3.6 6.7 6.4

Densest [30]

ours

6.7 8.0 6.4

3.2 5.3 5.6

where di denotes the duration of the fixation. In this method, consecutive fixations within a small region are supposed to collapse into one fixation whose duration is the summation of the former. However, the weight of each fixation only depends on its local relationships. The process of the fixation’s collapse are very important to its results, thus a good performance of such a method requires continuous adjustment of the parameter. The densest position based method [30] regards the densest point as the representation of the interest area. Unlike the centroid [20], the densest point acts as the only vital factor and neglects other points: Pdensest = gd ,

(11)

where gd meets the requirement: ρr (d) = max(ρr ). However, the densest position based method only focuses on the fixation’s density while the global relationships among the fixations are ignored. What’s more, the parameter 11

(a)

(b)

(c)

(d)

(e)

(f)

Figure 3: Experiments on MIT [37]: (a)-(c) are the original images with the ground-truths (red rectangular box) assigned by the viewers. (d)-(f) show the identified fixation centers by the centroid [20] (green square), the densest position based method [30] (blue diamond), ˇ Spakov’s method [29] (yellow square), and our proposed method (red circle).

r used to calculate the fixation’s density needs a fine tuning when applying this method. In our proposed random walks based method, the position we pursued is defined in Equation 8. According to the process mentioned above, a coefficient is allocated to each fixation on the basis of its spatial and temporal distribution. 3.3. Experiments on patterns The cluster of fixations are expected to be focused on a center to represent the viewer’s visual attention. In order to measure the accuracy of different methods, we design three predefined patterns on the stimuli where the center of the pattern are regarded as the ground-truth. We use Tobii X120 to record the gaze information of the viewer. Compared with natural images, the image semantics of these patterns are simple enough which can help to collect eye tracking data towards one particular AOI. The viewers are instructed to focus their attention on the center of the pattern during the data collection. We can see in TABLE 1 that displays the deviation from the centers of the 12

patterns by different methods. The relative visual results are shown in Fig.2. From the results, we can conclude that our proposed method is robust and outstanding in locating the center of human being’s visual attention and dissolve the errors coming from human factors. 3.4. Experiments on images We utilize MIT database [37] which contains 1003 pictures to evaluate our proposed method. Each viewer’s fixations to one image during 5 seconds are handled individually. In Fig.3 we display the predicted centers on the original image. The participants provide the target AOI they fixated which is then regarded as our ground-truth. The ground-truth assigned by human is a bigger rectangular box that surround the object. In order to identify a small object, the center of AOI is very efficient, e.g. Figs. 3(d) and 3(f). However, when the ground-truth assigned by the viewer consists of a bigger target (e.g. Fig. 3(e)), most of the corresponding fixations are assembled in the region of the target. Our research attempts to find the AOI by purely utilizing the fixation distribution and this is very useful when tracking human’s eye ˇ movements. In this figure, center by Spakov’s method [29] is very close to the centroid, and it is shadowed. We adopted the database [36] to further enhance our experiments on the natural images. [36] provides the raw fixation data on 27 images along with their heat maps and saliency maps. And the brightest area in the saliency map is regarded as the ground-truth. We displayed the experimental results on their heat maps, some visual results are shown in Fig.4. Because each saliency map in this database is generated with many viewers’ eye tracking data, the clustering algorithm mentioned above is not suitable here. Therefore, we generate the fixation clusters with k-means which specify the initial cluster num according to its saliency map. The center of the AOI has a positive connotation of the viewer’s AOI which corresponds to the salient area. It provides a new way to demonstrate the viewer’s visual attention based on the fixations’ density and their cumulative transition probability. The existing methods used to identify this center have some defects in diverse situations. The goal of our research is to provide a novel approach to indicate the position more robustly and correctly. We displayed the experimental results on their heat maps as Fig.4 shown. The red areas in the heat map denote that they contain a majority of densest fixations, other areas which are light colored (yellow, green, and blue) surround some fixations with relatively low densities. However, the density map 13

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 4: (a)-(b) are the original images. (c)-(d) display the fixation clusters on their heat maps. (e)-(f) show the identified fixation centers by the centroid [20] (green square), ˇ the densest position based method [30] (blue diamond), Spakov’s method [29] (yellow square), and our proposed method (black circle) where the size of r is 5. (g)-(h) display the predicted centers on their saliency maps.

14

(a)

(b)

(c)

(d)

(e)

(f)

Figure 5: The results for various size of r. (b) shows the heat map of (a) with two fixation clusters. (c)-(f) display the identified fixation centers by the centroid [20] (green square), ˇ the densest position based method [30] (blue diamond), Spakov’s method [29] (yellow square), and our proposed method (black circle) where the size of r is 5, 20, 80 and 120 separately.

15

we received does not consider the cumulative transition probability among the global fixations. The yellow and green areas also contain fixations with relative high consistency. Here, we display the predicted centers on their saliency maps for comparison. When we assess the quality of the saliency detection, the center of our proposed method can be used as an evaluation criteria. The centroid is very sensitive to the sparse fixations. When the AOI is surrounded by many sparse fixations, which looks like the cluster halo, the centroid will be disturbed by these unimportant fixations (e.g. the right-most cluster in Fig.4(f)). Meanwhile, as shown in Fig.5, the densest position based ˇ method and Spakov’s method are easily influenced by the size of r in the definition of density, thus its results are not certain. The densest position based method [30] can be very close to the ground-truth only when we finely tuned the parameter r which determines the effects of the densest position based method. On the contrary, various sizes of r have little influence on our proposed method and the centroid. We want to propose a robust method to refine the results of the fixation density map, and to facilitate the researches of saliency detection. Compared the locations of different centers with the location of the target in the AOI, the accuracy of the proposed method eliminates the influence of many sparse fixations. 3.5. Applications for virtual keyboard By tracking the viewer’ head, eyes, gestures and so on, Virtual Reality application can provide better human-computer interaction. For example, virtual keyboard can press the buttons by tracking your eye movements. We display the image of virtual keyboard on the screen, and asked the viewers to type a word by gaze at the particular buttons after training. In Fig.6, the ground-truth pressed keys are “visual” and “bright”, respectively. When the viewer types the word, the eye movements we record cannot guarantee that they are always positioned in the centre of the button. Under normal conditions, the viewer’s attention may be diverse scattered around the target button. We can see that, the fixations surrounding the word v in Fig.6(b) shift array from the right centre. However, the shifted eye movements are consecutive in time and space. At least in our experiments they are not regarded as saccades. The centroid is positioned at the gap and it will generate ˇ an ambiguous result. The center of Spakov’s method here is very similar to the centroid. The screen we have in practical is usually smaller, so we need a

16

(a)

(b)

(c) Figure 6: Applications for virtual keyboard: (a) is the image of a virtual keyboard. (b) and (c) spell the words “visual” and “bright” separately. On the figure, we show the identified fixation centers by the centroid [20] (green square), the densest position based ˇ method [30] (blue diamond), Spakov’s method [29] (yellow square), and our proposed method (red circle).

17

precise center of the AOI. Every 5 seconds, the average time for computing the center with our proposed method is only 0. 0146 seconds. 4. Conclusion In this paper, we propose a random walks based method for the identification of the viewer’s visual attention. With the given raw eye tracking data on a stimulus, we generate fixation clusters to represent the area of interest (AOI). In each cluster or AOI, we calculate the consistency among these fixations and distribute the weights to fixations. We get a refined center which has a positive connotation about the importance degree of the fixation in the AOI. Our method eliminates the influence of noise fixation points and it is more robust than traditional methods. We compare our method with the state-of-the-arts methods with extensive experiments and demonstrate that our proposed method achieves better performance. Acknowledgements This work was supported in part by National Natural Science Foundation of China (No. 61471273), National Hightech R&D Program of China (863 Program, 2015AA015903), and Natural Science Foundation of Hubei Province of China (No. 2015CFA053). References [1] L. Itti, C. Koch, E. Niebur, et al., A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on pattern analysis and machine intelligence 20 (11) (1998) 1254–1259. [2] C. Guo, L. Zhang, A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, IEEE Transactions on Image Processing 19 (1) (2010) 185–198. [3] Y. Fang, W. Lin, B. S. Lee, C. T. Lau, Z. Chen, C. W. Lin, Bottom-up saliency detection model based on human visual sensitivity and amplitude spectrum, IEEE Transactions on Multimedia 14 (1) (2012) 187–198. ˙ [4] N. Imamo˘ glu, W. Lin, Y. Fang, A saliency detection model using lowlevel features based on wavelet transform, IEEE Transactions on Multimedia 15 (1) (2013) 96–105. 18

[5] H. Li, F. Meng, K. N. Ngan, Co-salient object detection from multiple images, IEEE Transactions on Multimedia 15 (8) (2013) 1896–1909. [6] Z. Lu, W. Lin, X. Yang, E. Ong, S. Yao, Modeling visual attention’s modulatory aftereffects on visual sensitivity and quality evaluation, IEEE Transactions on Image Processing 14 (11) (2005) 1928–1942. [7] Y. Zhao, L. Yu, Z. Chen, C. Zhu, Video quality assessment based on measuring perceptual noise from spatial and temporal perspectives, IEEE Transactions on Circuits and Systems for Video Technology 21 (12) (2011) 1890–1902. [8] C. W. Tang, C. H. Chen, Y. H. Yu, C. J. Tsai, Visual sensitivity guided bit allocation for video coding, IEEE Transactions on Multimedia 8 (1) (2006) 11–18. [9] Z. Chen, J. Han, K. N. Ngan, Dynamic bit allocation for multiple video object coding, IEEE Transactions on Multimedia 8 (6) (2006) 1117– 1124. [10] C. W. Tang, Spatiotemporal visual considerations for video coding, IEEE Transactions on Multimedia 9 (2) (2007) 231–238. [11] H. Hadizadeh, I. V. Bajic, G. Cheung, Video error concealment using a computation-efficient low saliency prior, IEEE Transactions on Multimedia 15 (8) (2013) 2099–2113. [12] Y. F. Ma, X. S. Hua, L. Lu, H. J. Zhan, A generic framework of user attention model and its application in video summarization, IEEE Transactions on Multimedia 7 (5) (2005) 907–919. [13] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, Y. Avrithis, Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention, IEEE Transactions on Multimedia 15 (7) (2013) 1553–1568. [14] J. Huang, X. Yang, X. Fang, W. Lin, R. Zhang, Integrating visual saliency and consistency for re-ranking image search results, IEEE Transactions on Multimedia 13 (4) (2011) 653–661.

19

[15] Y. Fang, Z. Chen, W. Lin, C. W. Lin, Saliency detection in the compressed domain for adaptive image retargeting, IEEE Transactions on Image Processing 21 (9) (2012) 3888–3901. [16] C. Shen, X. Huang, Q. Zhao, Predicting eye fixations on webpage with an ensemble of early features and high-level representations from deep network, IEEE Transactions on Multimedia 17 (11) (2015) 2084–2093. [17] M. A. Just, P. A. Carpenter, Using eye fixations to study reading comprehension, New Methods in Reading Comprehension Research (1984) 151–182. [18] R. J. Leigh, D. S. Zee, The neurology of eye movements, Oxford University Press, 2015. [19] C. Priviterra, L. Stark, Scanpath theory, attention and image processing algorithms for prediction of human eye fixations, Neurobiology of Attention (2005) 269–299. [20] D. D. Salvucci, J. H. Goldberg, Identifying fixations and saccades in eye-tracking protocols, in: Proceedings of the 2000 Symposium on Eye Tracking Research & Applications, ACM, 2000, pp. 71–78. [21] J. Nielsen, K. Pernice, Eyetracking web usability, New Riders, 2010. [22] G. Buscher, E. Cutrell, M. R. Morris, What do you see when you’re surfing?: using eye tracking to predict salient regions of web pages, in: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, 2009, pp. 21–30. [23] N. Matsuda, H. Takeuchi, Do heavy and light users differ in the webpage viewing patterns? analysis of their eye-tracking records by heat maps and networks of transitions, International Journal of Computer Information Systems and Industrial Management Applications 4 (2012) 109–120. [24] E. Tafaj, T. C. K¨ ubler, G. Kasneci, W. Rosenstiel, M. Bogdan, Online classification of eye tracking data for automated analysis of traffic hazard perception, in: Artificial Neural Networks and Machine Learning– ICANN 2013, Springer, 2013, pp. 442–450.

20

[25] A. Santella, D. DeCarlo, Robust clustering of eye movement recordings for quantification of visual interest, in: Proceedings of the 2004 Symposium on Eye Tracking Research & Applications, ACM, 2004, pp. 27–34. [26] E. Tafaj, G. Kasneci, W. Rosenstiel, M. Bogdan, Bayesian online clustering of eye movement data, in: Proceedings of the Symposium on Eye Tracking Research and Applications, ACM, 2012, pp. 285–288. [27] S. Ramanathan, H. Katti, N. Sebe, M. Kankanhalli, T. S. Chua, An eye fixation database for saliency detection in images, Computer Vision– ECCV 2010 (2010) 30–43. [28] X. Chen, Z. Chen, Visual attention identification using random walks based eye tracking protocols, in: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), IEEE, 2015, pp. 6–9. ˇ [29] O. Spakov, D. Miniotas, Application of clustering algorithms in eye gaze visualizations, Information Technology and Control 36 (2) (2007) 213– 216. [30] Y. Wang, X. Chen, Z. Chen, Towards region-of-attention analysis in eye tracking protocols, Electronic Imaging 2016 (2) (2016) 1–6. [31] C. M. Privitera, L. W. Stark, Algorithms for defining visual regions-ofinterest: Comparison with eye fixations, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (9) (2000) 970–982. [32] J. H. Goldberg, J. C. Schryver, Eye-gaze determination of user intent at the computer interface, Studies in Visual Information Processing (1995) 491–502. [33] F. Shic, B. Scassellati, K. Chawarska, The incomplete fixation measure, in: Proceedings of the 2008 Symposium on Eye Tracking Research & Applications, ACM, 2008, pp. 111–114. [34] M. Nystr¨om, K. Holmqvist, An adaptive algorithm for fixation, saccade, and glissade detection in eyetracking data, Behavior Research Methods (2010) 188–204.

21

[35] A. R. Zamir, S. Ardeshir, M. Shah, Gps-tag refinement using random walks with an adaptive damping factor, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2014, pp. 4280– 4287. [36] O. Le Meur, P. Le Callet, D. Barba, D. Thoreau, A coherent computational approach to model the bottom-up visual attention., IEEE transactions on pattern analysis and machine intelligence 28 (2006) 802–817. [37] T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 2106–2113.

22