Exploring visual attention using random walks based eye tracking protocols

Accepted Manuscript Exploring Visual Attention Using Random Walks Based Eye Tracking Protocols Xiu Chen, Zhenzhong Chen PII: DOI: Reference: S1047-32...

Download PDF

8MB Sizes 0 Downloads 37 Views

Report

Full Text

Accepted Manuscript Exploring Visual Attention Using Random Walks Based Eye Tracking Protocols Xiu Chen, Zhenzhong Chen PII: DOI: Reference:

S1047-3203(17)30042-1 http://dx.doi.org/10.1016/j.jvcir.2017.02.005 YJVCI 1956

To appear in:

J. Vis. Commun. Image R.

Received Date: Accepted Date:

6 February 2016 10 February 2017

Please cite this article as: X. Chen, Z. Chen, Exploring Visual Attention Using Random Walks Based Eye Tracking Protocols, J. Vis. Commun. Image R. (2017), doi: http://dx.doi.org/10.1016/j.jvcir.2017.02.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Exploring Visual Attention Using Random Walks Based Eye Tracking Protocols Xiu Chena , Zhenzhong Chena a

School of Remote Sensing and Information Engineering, Wuhan University, China

Abstract Identifying visual attention plays an important role in understanding human behavior and optimizing relevant multimedia applications. In this paper, we propose a visual attention identiﬁcation method based on random walks. In the proposed method, ﬁxations recorded by the eye tracker are partitioned into clusters where each cluster presents a particular area of interest (AOI). In each cluster, we estimate the transition probabilities of the ﬁxations based on their point-to-point adjacency in their spatial positions. We obtain the initial coeﬃcients for the ﬁxations according to their density. We utilizing random walks to iteratively update the coeﬃcients until their convergency. Finally, the center of the AOI is calculated according to the convergent coeﬃcients of the ﬁxations. Experimental results demonstrate that our proposed method which combines the ﬁxations’ spatial and temporal relations, highlights the ﬁxations of higher densities and eliminates the errors inside the cluster. It is more robust and accurate than traditional methods. Keywords: eye tracking, visual attention, ﬁxation, area of interest, Random Walks 1. Introduction Optimizing user experience plays an important role in today’s research and development of multimedia systems and applications. Exploring human visual systems is a key to understanding the human characteristics. Visual attention has attracted more and more research activities due to its applications in saliency detection [1, 2, 3, 4], object detection [5], image and video Email addresses: [email protected] (Xiu Chen), [email protected] (Zhenzhong Chen) Preprint submitted to Journal of Visual Communication and Image RepresentationDecember 19, 2016

quality assessment [6, 7], video coding [8, 9, 10], video streaming [11], video summarization [12, 13], image search [14], image retargeting [15], webpage browsing [16], etc. To study the human visual attention characteristics, the eye tracking system is typically used for experiments to collect the human eye movement information according to the visual stimuli. The eye movements recorded by the eye tracker involve the visual perception behavior and various analyses have been proposed and applied for studying the viewer’s visual and cognitive processes [17]. The raw eye tracking data are primarily split into six types: ﬁxations, saccades, smooth pursuits, optokinetic reﬂex, vestibulo-ocular reﬂex, and vergence [18]. As ﬁxations and saccades are most frequently studied in the prior literatures, we typically focus on these two types of eye movements. Fixations are the eye movements that remain ﬁxated on an area of interest (AOI), whilst saccades correspond to rapid eye movements among diﬀerent AOIs [19]. Many approaches on the identiﬁcation of ﬁxations and saccades such as velocity, dispersion, and area based algorithms, have been proposed and extensively explored [20]. That provides the fundamentals for further researches on the cognitive processes in human beings. It is generally recognized that people tend to ﬁxate their attention on the interested objects more frequently [21]. Many researches utilize the amount of ﬁxations to represent the level of attention on the AOI, i.e., the more ﬁxations the AOI has, the more attractive the AOI is [22, 23]. Once the viewer is observing an object of interest, the corresponding eye movements are expected to focus on the AOI and consequently generate a cluster of ﬁxation points [24]. Various algorithms for clustering ﬁxation points such as mean shift [25], Bayesian online clustering [26], have been developed to discover the viewer’s AOIs, among which the velocity-based clustering algorithm I-VT [20] is widely used in practice eye tracking systems to translate the ﬁxation cluster to a representation using their mean value. Therefore, ﬁxations are closely associated with scene objects which correspond to AOIs in the stimuli and the representation positions which direct towards the ﬁxated regions, e.g., cluster centroid and ﬁxation seed, are generated and employed to explore the visual behavior of human beings [27]. Furthermore, most eye tracking systems adopt the centroid of the ﬁxation cluster as the representation of the AOI [20]. However, the centroid has some drawbacks: (1) each ﬁxation in the cluster has equal importance in the ﬁnal computed position, that ignores the inner spatial relationship among the ﬁxations, (2) once the algorithm of clustering includes some noise points, 2

the produced position is susceptible to these noises and contains bias that reduces the accuracy of the output for further studies in human visual behaviors. We utilized some of work from our previous related work in [28] and validate our method with more details. In order to locate the viewer’s visual attention more accurately and overcome the disadvantages in the existing method, researchers have proposed some algorithms such as the method by utilizing the ﬁxation durations as their weights [29], the densest position based method [30]. However, the spatial correlations among the ﬁxations which provide important cues for visual attention analysis are not well utilized. In this paper, we present a method for identifying the representation of visual attention based on random walks. We accomplish this task by utilizing the ﬁxations in the same cluster. Given the raw eye tracking data, the ﬁxations extracted from the raw data are classiﬁed into diﬀerent clusters which concentrate around the AOIs on the visual stimuli. Then, we propose a random walks based method for the cluster to identify the reﬁned center based on the spatial distribution. Besides the spatial factor, we also incorporate the temporal characteristics, i.e., the duration of the ﬁxation, as the bias in the evaluation of the ﬁnal reﬁned center. Experimental results show that our proposed method achieves excellent performance. Our main contributions can be summarized as follows: (1) a new method for identifying the ﬁxation center of the AOI utilizing random walks. (2) taking into account the factors of density and distances among the ﬁxations to judge the importance of the ﬁxation. This paper is arranged as follows. Section II describes the work-ﬂow of the proposed method and the details. Section III shows the experimental results and discusses the performance of our method. The conclusions are drawn in Section IV. 2. Visual Attention Identification The method proposed in this paper is shown in Fig.1. Given the raw eye tracking data recorded by the eye tracker, we discover the centers of the viewer’s visual attention on the corresponding stimuli. We achieve this goal by generating several clusters of ﬁxations for the areas of the viewer’s interests. Then, we propose an approach based on random walks which distinguishes the dense ﬁxations according to their pairwise consistency. When calculating the center of attention in a particular cluster, we incorporate the 3

Perform random walks on each cluster Raw eye tracking data

Generate the clusters of fixations

Transition probability

Fixations weighted by the coefficients

center of visual attention

Density of the fixation

Figure 1: The work-ﬂow of the proposed method.

transition probabilities and densities of the ﬁxations where the spatial distribution and temporal accumulation are utilized to evaluate the importance of the ﬁxations in the cluster. Finally, the coeﬃcients derived from random walks are used as weights of the ﬁxations to generate the ultimate reﬁned center of visual attention. 2.1. Generating the Clusters of Fixations In order to analyze where and how an image is viewed, a wealth of clustering algorithms towards eye movements have been applied to estimate the location and extent of the viewer’s AOI. Higher level analysis of the viewer’s visual interest explains and motivates the demand for approaches that concisely quantify AOIs of the viewer [31], and these analyses are heavily inﬂuenced by the characteristics of the AOI, like the size, location, duration, and center [32]. To provide eﬀective ﬁxations for clustering, we need to perform ﬁxation identiﬁcation ﬁrst. Given the eye position coordinates relative to the stimuli, we utilize these data to analyze the cognitive processing of the human being. Fixations are supposed to be continuously centralized surround a small area where exists a special scene object, where saccades correspond to the eye movements with rapid velocities when the viewers change their central focus from an area to another. These two types of eye movements provide basic information for further eye-tracking analysis. The identiﬁcation algorithm used to separate ﬁxations from saccades is regarded as an essential part during interpreting the eye movements. According to the conclusions in many previous approaches from the literature that extensively explored the identiﬁcation algorithms, the clusters of ﬁxations may vary signiﬁcantly because of the choice of diﬀerent algorithms and the parameter settings [33]. Many algorithms [20] such as velocity-threshold identiﬁcation

4

(I-VT), dispersion-threshold identiﬁcation (I-DT), area-of-interest identiﬁcation (I-AOI), HMMs, MST and so on, are capable of identifying ﬁxations and clustering them into certain ROIs from diﬀerent perspectives. In this paper, I-VT is implemented as its broad applications have proven that it is mature, simple, eﬀective and with a low time complexity during accomplishing this task [34]. Algorithm 1 Fixation Identiﬁcation and Clustering Require: The coordinates of raw eye tracking data; velocity-threshold; desperation-threshold; interval-threshold; Ensure: The clusters of ﬁxations; 1. 2. 3. 4. 5. 6.

Calculate the velocity for each point. Label all points lower than the velocity-threshold as ﬁxations. Label all points higher than the velocity-threshold as saccades. Successive ﬁxations collapse into tentative groups, delete saccades. Calculate the centroid for each group. Calculate the distances of the centroid between the current group and the next one. 7. Calculate the time interval from the last ﬁxation in the current group to the ﬁrst ﬁxation in the next group. 8. Merge the two groups into a new one when the two conditions are simultaneously satisﬁed: a) the distance of the centroid is below desperation-threshold; b) the time interval deﬁned above is less than interval-threshold. return clusters of ﬁxations; The raw data produced from the eye tracker consists of quadruples (xi , yi , ti , di ) where xi and yi refer to the horizontal and vertical coordinates in the stimuli at the recording time ti . The interval at which the eye tracker samples the gaze coordinates is what we called the duration di . We utilize I-VT to identify ﬁxations from saccades based on the fact that ﬁxations have lower velocity than the saccades and I-VT is widely accepted in eye tracking protocols. As the deﬁnition of velocity is v = d/t, we only consider d when the parameter t is constant. The sampling rate of the eye tracker is usually constant and known, thus only the distance from the current point to the next is needed when measuring the velocity. If the point’s velocity is lower 5

than the predeﬁned velocity threshold, it is identiﬁed as a ﬁxation, otherwise it is identiﬁed as a saccade which will be excluded. Then, successive ﬁxations collapse into groups. Given the coordinates of raw eye tracking data, the clusters of ﬁxations are generated by utilizing the Algorithm 1. Firstly, the steps 1-4 describe how I-VT produce the groups of ﬁxations. Then, the centroid for each group is calculated to represent the corresponding AOI. However, the AOIs will overlap once the groups have the centroid close to each other. To avoid this situation, we need to merge the groups which are spatially and temporally adjacent into a new one. Therefore, another two thresholds are introduced to modify the ﬁxation clusters. In step 6, the distance of the centroid between the current group and the next one is compared with a desperation threshold. In step 7, the time interval from the last ﬁxation in the current group to the ﬁrst ﬁxation in the next group is calculated. In step 8, the two clusters can be combined only when the two requirements are satisﬁed: 1) the distance from the step 6 is less than the desperation threshold, 2) the time interval obtained in step 7 is less than the interval threshold. Here, we set velocitythreshold to 15 deg/s, the desperation threshold to 2 deg/s, and the interval threshold to 100ms. So far, we obtain the clusters of ﬁxations based on their spatial and temporal characteristics. A cluster refers to an area of the viewer’s interest, and the ﬁxations within this area must be centralized around a speciﬁc target on the stimuli. In the following procedures, we present a novel method based on random walks to locate the position of the speciﬁc target. 2.2. Identiﬁcation of the Center of Area of Interest In order to identify whether a target on the stimuli is attracting the viewer’s attention, we design the algorithms of ﬁxation identiﬁcation and clustering to ﬁnd the extent of the viewer’s interest. With the assumption that once the viewer is observing an object of interest, the eye movements which have been processed into ﬁxations are expected to focus on that object. Therefore, a considerable proportion of ﬁxations within the same cluster are expected to show a high pairwise consistency with each other. Thus we propose a random walks based method for identifying the center of the attention based on the above regulations. Each ﬁxation is assigned with a coeﬃcient according to its potential for consistency. To better understand how random walks is working, we can refer to a situation that a person is gazing at a target and his eye movements will focus on an AOI which includes 6

that target. The ﬁxations in one AOI have high consistencies. Ultimately, the location nearby the target will be visited more often as it is more consistent to other locations. Intuitively, random walks on the graph consisting of ﬁxations included in a cluster, will evaluate the ﬁxations’ importance according to their consistency with each other. 2.2.1. Estimating the transition probability For a speciﬁc cluster I, the ﬁxations within the cluster can form a graph GI = (N, E) where N and E represent the set of nodes and edges, respectively. Each node denotes one ﬁxation in the cluster I, i.e., N = {f1 , f2 , · · · , fλ } where λ means the total number of ﬁxations in the cluster, and there exists edges connecting these ﬁxations, E = {(fi , fj ), i ̸= j}. The coordinates of ﬁxations here are arranged as triplets, i.e., fi = {gi , ti , di } where gi represents the ith ﬁxation’s location: gi = (xi , yi ), ti and di represent the recording time and the duration separately. ti and di are identical to the deﬁnition in the raw eye tracking data, e.g., our eye tracker records eye movements at the constant frequency 120 Hz, so the duration di of each point is the same. At the beginning we need to form a matrix D whose elements are the Euclidean distance between two ﬁxations, i.e., the distance from the ith ﬁxation to the j th , as: D(i, j) = ∥gi − gj ∥2 , (1) where ∥.∥2 denotes the Euclidean norm. Thus, the transition probability from the ﬁxation i to j is deﬁned as: e−σ×D(i,j) q(i, j) = ∑λ , −σ×D(i,k) k=1 e

(2)

where σ works as an insensitive parameter to modify the distribution shape, and through the trials in our later experiments. The transition probabilities from one ﬁxation to the rest are normalized by the denominator practically. According to Equation (2), the closer the ﬁxations, the more consistency they have, and the greater the transition probabilities become. 2.2.2. Incorporating the densities of ﬁxations Intuitively, the ﬁxations in a speciﬁc cluster show a high inclination to focus around on an AOI as long as the target in that area is perceived by the viewer. Our goal is to ﬁnd a reﬁned center to represent that target based on random walks instead of the centroid. The ﬁxations gathered together 7

cause the diﬀerences in their densities. Therefore, the density of the ﬁxation connotes its degree of importance in the group. In the proposed method based on random walks, we initialize the coeﬃcient of a ﬁxation by utilizing its density. The denser the ﬁxation, the larger initial coeﬃcient it gets. The density of the ith ﬁxation can be deﬁned as the total duration within the radius r of it: λ ∑ ρr (i) = { dj | D(i, j) ≤ r}. (3) j=1

For simplicity, we propose a method that requires less computational load to count the densities. With the sampling interval of the eye tracker known and ﬁxed, the duration dj in Equation (3) is constant, thus we can count the number of ﬁxations within the buﬀer zone: ρr (i) = {♯(j)|D(i, j) ≤ r},

(4)

where ♯(j) represents the number of ﬁxations that meet the condition. We deﬁne the initial coeﬃcient of the ith ﬁxation in cluster I as: ρr (i) w(i) = ∑λ . j=0 ρr (j)

(5)

The denominator in Equation (5) acts as a normalization to ensure the Markov chain requirement: ∥w∥1 = 1. 2.2.3. Assigning ﬁxations with convergent coeﬃcients by random walks With the obtained transition probabilities and the densities, we perform random walks to update the coeﬃcient of one ﬁxation at each iteration on account of the probabilities from other ﬁxations to it. We deﬁne an adaptive damping factor: 1 ∑ lt+1 (i) = { (1 − (1 − α)lt (i))lt (j)q(j, i) + (1 − α)lt (i)w(i)} η j=1 λ

(6)

where lt (i) is the relevance coeﬃcient of the ith ﬁxation at the tth iteration. The part (1 − (1 − α)lt (i))lt (j)q(j, i) calculates the summation of the transition probability from other ﬁxations to the ith one. The adaptive damping factor (1 − α)lt (i)w(i) in Equation (6) is similar to the part in [35], it enables 8

our method to take advantage of the prior knowledge about the current status of the ﬁxation. However, Zamir [35] wanted to oﬀset the eﬀect of the node’s density. On the contrary, the method proposed in this paper pays more attention to the positive eﬀect of the ﬁxation’s density. Because of the ambiguity of the ﬁxation’s density ρr whose deﬁnition contains a vital parameter r which is uncertain at the beginning, the initial coeﬃcients we get are heavily inﬂuenced by it, e.g., if r is too large, the densities ρr of the ﬁxations will have no diﬀerence, which is not reasonable in practice. Therefore, the input errors coming from the initialization need to be considered. The adaptive damping factor has been proven to be able to solve this problem eﬀectively [35]. So we use a constant α in Equation (6) which is set between 0.5 and 1. The normalization denominator η is deﬁned as: λ λ ∑ ∑ η= { (1 − (1 − α)lt (i))lt (j)q(j, i) + (1 − α)lt (i)w(i)} i=1

(7)

j=1

∑ where η guarantees the summation of the coeﬃcients is always 1: λi=1 lt (i) = 1. The coeﬃcients are iteratively calculated until they converge to a stationary probability lT . In details, we set a predeﬁned threshold T to control the precision of the ﬁnal values. When the computed results meet the requirement: |lt+1 (i) − lt (i)| < T, i ∈ [1, λ], we return the iterative vector lt+1 as the ﬁnal coeﬃcients lT . 2.2.4. Producing ﬁxation center The ﬁxations which are isolated from the rest are supposed to have coeﬃcients approximate to zero. Otherwise, the ﬁxations are assigned with relevant coeﬃcients in accordance with their consistency with each other. Utilizing the coeﬃcients to weight the ﬁxations, we ﬁnally get the reﬁned position of the center to represent the cluster or the AOI: gˆ =

λ ∑

gi lT (i),

(8)

i=1

where gˆ is the location of the center we are seeking for. With the weighted ﬁxations, we take into consideration the densities and the consistency among these ﬁxations, thus the subareas in the AOI which are frequently watched by the viewer are highlighted in our estimations. 9

3. Experimental Results 3.1. Experiment Setup We utilize IRCCyN IVC Eyetracker 2006 05 image database [36] to show the performance of our proposed method on the natural images. In addition, we collected many raw eye tracking data with the Tobii X120 Eye Tracker which provides the tracking frequency as high as 120 Hz to guarantee the high accuracy. We conducted all the experiments on a PC of 2.5GHz and 16G RAM. In a natural environment where the viewer is confronted with the eye tracker and the visual stimuli, we designed 3 types of experiments: (1) We set diﬀerent patterns on the stimuli, and instruct the viewer to focus their attention on the center of these patterns. (2) A free viewing task on the MIT database [37] is recorded, and the ground-truth is then assigned by the viewer. (3) The participants are asked to spell a particular word on the virtual keyboard by ﬁxating their attention on the buttons. Each image is viewed for about 5 seconds. The raw data produced is arranged as successional coordinates of the aforementioned quadruples. After identifying the ﬁxations from saccades, we separate the ﬁxations into a number of clusters. Then, we will discover the center of the viewer’s visual attention using the generated ﬁxation clusters. We compared diﬀerent methods to demonstrate the performance of our proposed method. 3.2. Implementation of Reference Methods Given a collection of ﬁxations in one cluster: {g1 , g2 , · · · , gi , · · · , gλ } where gi represents the location of the ith ﬁxation point, the goal is to ﬁnd a position which can describe the viewer’s interest in this period of time. Generally, researchers choose mean value of the eye movements surrounding the certain object as the representative point to show the location of viewer’s interest. The centroid [20] means that every ﬁxation point plays an equally important role, that is, the weight of each ﬁxation is the same: ∑λ gi Pcentroid = i=1 . (9) λ In addition, the method in [29] calculated the cluster’s center by utilizing the duration of each ﬁxation as its weight: ∑λ i=1 gi × di Pduration = ∑ , (10) λ i=1 di 10

(a)

(b)

(c)

Figure 2: Experiments on patterns: With predeﬁned patterns (black) on the stimuli, the viewer’s ﬁxations (black crosshairs) are therefore limited in the surrounded area. We compare our computed centers with the center of the crosshairs. The centers using diﬀerent methods have distinguishable accuracies: the centroid [20] (green square), densest posiˇ tion based method [30] (blue diamond), Spakov’s method [29] (magenta square) and our proposed method (red circle). Table 1: Comparison of deviation of diﬀerent methods (pixels) in Fig.2. pattern

Centroid [20]

a b c

5.1 9.1 8.3

ˇ Spakov [29] 3.6 6.7 6.4

Densest [30]

ours

6.7 8.0 6.4

3.2 5.3 5.6

where di denotes the duration of the ﬁxation. In this method, consecutive ﬁxations within a small region are supposed to collapse into one ﬁxation whose duration is the summation of the former. However, the weight of each ﬁxation only depends on its local relationships. The process of the ﬁxation’s collapse are very important to its results, thus a good performance of such a method requires continuous adjustment of the parameter. The densest position based method [30] regards the densest point as the representation of the interest area. Unlike the centroid [20], the densest point acts as the only vital factor and neglects other points: Pdensest = gd ,

(11)

where gd meets the requirement: ρr (d) = max(ρr ). However, the densest position based method only focuses on the ﬁxation’s density while the global relationships among the ﬁxations are ignored. What’s more, the parameter 11

(a)

(b)

(c)

(d)

(e)

(f)

Figure 3: Experiments on MIT [37]: (a)-(c) are the original images with the ground-truths (red rectangular box) assigned by the viewers. (d)-(f) show the identiﬁed ﬁxation centers by the centroid [20] (green square), the densest position based method [30] (blue diamond), ˇ Spakov’s method [29] (yellow square), and our proposed method (red circle).

r used to calculate the ﬁxation’s density needs a ﬁne tuning when applying this method. In our proposed random walks based method, the position we pursued is deﬁned in Equation 8. According to the process mentioned above, a coefﬁcient is allocated to each ﬁxation on the basis of its spatial and temporal distribution. 3.3. Experiments on patterns The cluster of ﬁxations are expected to be focused on a center to represent the viewer’s visual attention. In order to measure the accuracy of diﬀerent methods, we design three predeﬁned patterns on the stimuli where the center of the pattern are regarded as the ground-truth. We use Tobii X120 to record the gaze information of the viewer. Compared with natural images, the image semantics of these patterns are simple enough which can help to collect eye tracking data towards one particular AOI. The viewers are instructed to focus their attention on the center of the pattern during the data collection. We can see in TABLE 1 that displays the deviation from the centers of the 12

patterns by diﬀerent methods. The relative visual results are shown in Fig.2. From the results, we can conclude that our proposed method is robust and outstanding in locating the center of human being’s visual attention and dissolve the errors coming from human factors. 3.4. Experiments on images We utilize MIT database [37] which contains 1003 pictures to evaluate our proposed method. Each viewer’s ﬁxations to one image during 5 seconds are handled individually. In Fig.3 we display the predicted centers on the original image. The participants provide the target AOI they ﬁxated which is then regarded as our ground-truth. The ground-truth assigned by human is a bigger rectangular box that surround the object. In order to identify a small object, the center of AOI is very eﬃcient, e.g. Figs. 3(d) and 3(f). However, when the ground-truth assigned by the viewer consists of a bigger target (e.g. Fig. 3(e)), most of the corresponding ﬁxations are assembled in the region of the target. Our research attempts to ﬁnd the AOI by purely utilizing the ﬁxation distribution and this is very useful when tracking human’s eye ˇ movements. In this ﬁgure, center by Spakov’s method [29] is very close to the centroid, and it is shadowed. We adopted the database [36] to further enhance our experiments on the natural images. [36] provides the raw ﬁxation data on 27 images along with their heat maps and saliency maps. And the brightest area in the saliency map is regarded as the ground-truth. We displayed the experimental results on their heat maps, some visual results are shown in Fig.4. Because each saliency map in this database is generated with many viewers’ eye tracking data, the clustering algorithm mentioned above is not suitable here. Therefore, we generate the ﬁxation clusters with k-means which specify the initial cluster num according to its saliency map. The center of the AOI has a positive connotation of the viewer’s AOI which corresponds to the salient area. It provides a new way to demonstrate the viewer’s visual attention based on the ﬁxations’ density and their cumulative transition probability. The existing methods used to identify this center have some defects in diverse situations. The goal of our research is to provide a novel approach to indicate the position more robustly and correctly. We displayed the experimental results on their heat maps as Fig.4 shown. The red areas in the heat map denote that they contain a majority of densest ﬁxations, other areas which are light colored (yellow, green, and blue) surround some ﬁxations with relatively low densities. However, the density map 13

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 4: (a)-(b) are the original images. (c)-(d) display the ﬁxation clusters on their heat maps. (e)-(f) show the identiﬁed ﬁxation centers by the centroid [20] (green square), ˇ the densest position based method [30] (blue diamond), Spakov’s method [29] (yellow square), and our proposed method (black circle) where the size of r is 5. (g)-(h) display the predicted centers on their saliency maps.

14

(a)

(b)

(c)

(d)

(e)

(f)

Figure 5: The results for various size of r. (b) shows the heat map of (a) with two ﬁxation clusters. (c)-(f) display the identiﬁed ﬁxation centers by the centroid [20] (green square), ˇ the densest position based method [30] (blue diamond), Spakov’s method [29] (yellow square), and our proposed method (black circle) where the size of r is 5, 20, 80 and 120 separately.

15

we received does not consider the cumulative transition probability among the global ﬁxations. The yellow and green areas also contain ﬁxations with relative high consistency. Here, we display the predicted centers on their saliency maps for comparison. When we assess the quality of the saliency detection, the center of our proposed method can be used as an evaluation criteria. The centroid is very sensitive to the sparse ﬁxations. When the AOI is surrounded by many sparse ﬁxations, which looks like the cluster halo, the centroid will be disturbed by these unimportant ﬁxations (e.g. the right-most cluster in Fig.4(f)). Meanwhile, as shown in Fig.5, the densest position based ˇ method and Spakov’s method are easily inﬂuenced by the size of r in the deﬁnition of density, thus its results are not certain. The densest position based method [30] can be very close to the ground-truth only when we ﬁnely tuned the parameter r which determines the eﬀects of the densest position based method. On the contrary, various sizes of r have little inﬂuence on our proposed method and the centroid. We want to propose a robust method to reﬁne the results of the ﬁxation density map, and to facilitate the researches of saliency detection. Compared the locations of diﬀerent centers with the location of the target in the AOI, the accuracy of the proposed method eliminates the inﬂuence of many sparse ﬁxations. 3.5. Applications for virtual keyboard By tracking the viewer’ head, eyes, gestures and so on, Virtual Reality application can provide better human-computer interaction. For example, virtual keyboard can press the buttons by tracking your eye movements. We display the image of virtual keyboard on the screen, and asked the viewers to type a word by gaze at the particular buttons after training. In Fig.6, the ground-truth pressed keys are “visual” and “bright”, respectively. When the viewer types the word, the eye movements we record cannot guarantee that they are always positioned in the centre of the button. Under normal conditions, the viewer’s attention may be diverse scattered around the target button. We can see that, the ﬁxations surrounding the word v in Fig.6(b) shift array from the right centre. However, the shifted eye movements are consecutive in time and space. At least in our experiments they are not regarded as saccades. The centroid is positioned at the gap and it will generate ˇ an ambiguous result. The center of Spakov’s method here is very similar to the centroid. The screen we have in practical is usually smaller, so we need a

16

(a)

(b)

(c) Figure 6: Applications for virtual keyboard: (a) is the image of a virtual keyboard. (b) and (c) spell the words “visual” and “bright” separately. On the ﬁgure, we show the identiﬁed ﬁxation centers by the centroid [20] (green square), the densest position based ˇ method [30] (blue diamond), Spakov’s method [29] (yellow square), and our proposed method (red circle).

17

precise center of the AOI. Every 5 seconds, the average time for computing the center with our proposed method is only 0. 0146 seconds. 4. Conclusion In this paper, we propose a random walks based method for the identiﬁcation of the viewer’s visual attention. With the given raw eye tracking data on a stimulus, we generate ﬁxation clusters to represent the area of interest (AOI). In each cluster or AOI, we calculate the consistency among these ﬁxations and distribute the weights to ﬁxations. We get a reﬁned center which has a positive connotation about the importance degree of the ﬁxation in the AOI. Our method eliminates the inﬂuence of noise ﬁxation points and it is more robust than traditional methods. We compare our method with the state-of-the-arts methods with extensive experiments and demonstrate that our proposed method achieves better performance. Acknowledgements This work was supported in part by National Natural Science Foundation of China (No. 61471273), National Hightech R&D Program of China (863 Program, 2015AA015903), and Natural Science Foundation of Hubei Province of China (No. 2015CFA053). References [1] L. Itti, C. Koch, E. Niebur, et al., A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on pattern analysis and machine intelligence 20 (11) (1998) 1254–1259. [2] C. Guo, L. Zhang, A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, IEEE Transactions on Image Processing 19 (1) (2010) 185–198. [3] Y. Fang, W. Lin, B. S. Lee, C. T. Lau, Z. Chen, C. W. Lin, Bottom-up saliency detection model based on human visual sensitivity and amplitude spectrum, IEEE Transactions on Multimedia 14 (1) (2012) 187–198. ˙ [4] N. Imamo˘ glu, W. Lin, Y. Fang, A saliency detection model using lowlevel features based on wavelet transform, IEEE Transactions on Multimedia 15 (1) (2013) 96–105. 18

[5] H. Li, F. Meng, K. N. Ngan, Co-salient object detection from multiple images, IEEE Transactions on Multimedia 15 (8) (2013) 1896–1909. [6] Z. Lu, W. Lin, X. Yang, E. Ong, S. Yao, Modeling visual attention’s modulatory aftereﬀects on visual sensitivity and quality evaluation, IEEE Transactions on Image Processing 14 (11) (2005) 1928–1942. [7] Y. Zhao, L. Yu, Z. Chen, C. Zhu, Video quality assessment based on measuring perceptual noise from spatial and temporal perspectives, IEEE Transactions on Circuits and Systems for Video Technology 21 (12) (2011) 1890–1902. [8] C. W. Tang, C. H. Chen, Y. H. Yu, C. J. Tsai, Visual sensitivity guided bit allocation for video coding, IEEE Transactions on Multimedia 8 (1) (2006) 11–18. [9] Z. Chen, J. Han, K. N. Ngan, Dynamic bit allocation for multiple video object coding, IEEE Transactions on Multimedia 8 (6) (2006) 1117– 1124. [10] C. W. Tang, Spatiotemporal visual considerations for video coding, IEEE Transactions on Multimedia 9 (2) (2007) 231–238. [11] H. Hadizadeh, I. V. Bajic, G. Cheung, Video error concealment using a computation-eﬃcient low saliency prior, IEEE Transactions on Multimedia 15 (8) (2013) 2099–2113. [12] Y. F. Ma, X. S. Hua, L. Lu, H. J. Zhan, A generic framework of user attention model and its application in video summarization, IEEE Transactions on Multimedia 7 (5) (2005) 907–919. [13] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, Y. Avrithis, Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention, IEEE Transactions on Multimedia 15 (7) (2013) 1553–1568. [14] J. Huang, X. Yang, X. Fang, W. Lin, R. Zhang, Integrating visual saliency and consistency for re-ranking image search results, IEEE Transactions on Multimedia 13 (4) (2011) 653–661.

19

[15] Y. Fang, Z. Chen, W. Lin, C. W. Lin, Saliency detection in the compressed domain for adaptive image retargeting, IEEE Transactions on Image Processing 21 (9) (2012) 3888–3901. [16] C. Shen, X. Huang, Q. Zhao, Predicting eye ﬁxations on webpage with an ensemble of early features and high-level representations from deep network, IEEE Transactions on Multimedia 17 (11) (2015) 2084–2093. [17] M. A. Just, P. A. Carpenter, Using eye ﬁxations to study reading comprehension, New Methods in Reading Comprehension Research (1984) 151–182. [18] R. J. Leigh, D. S. Zee, The neurology of eye movements, Oxford University Press, 2015. [19] C. Priviterra, L. Stark, Scanpath theory, attention and image processing algorithms for prediction of human eye ﬁxations, Neurobiology of Attention (2005) 269–299. [20] D. D. Salvucci, J. H. Goldberg, Identifying ﬁxations and saccades in eye-tracking protocols, in: Proceedings of the 2000 Symposium on Eye Tracking Research & Applications, ACM, 2000, pp. 71–78. [21] J. Nielsen, K. Pernice, Eyetracking web usability, New Riders, 2010. [22] G. Buscher, E. Cutrell, M. R. Morris, What do you see when you’re surﬁng?: using eye tracking to predict salient regions of web pages, in: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, 2009, pp. 21–30. [23] N. Matsuda, H. Takeuchi, Do heavy and light users diﬀer in the webpage viewing patterns? analysis of their eye-tracking records by heat maps and networks of transitions, International Journal of Computer Information Systems and Industrial Management Applications 4 (2012) 109–120. [24] E. Tafaj, T. C. K¨ ubler, G. Kasneci, W. Rosenstiel, M. Bogdan, Online classiﬁcation of eye tracking data for automated analysis of traﬃc hazard perception, in: Artiﬁcial Neural Networks and Machine Learning– ICANN 2013, Springer, 2013, pp. 442–450.

20

[25] A. Santella, D. DeCarlo, Robust clustering of eye movement recordings for quantiﬁcation of visual interest, in: Proceedings of the 2004 Symposium on Eye Tracking Research & Applications, ACM, 2004, pp. 27–34. [26] E. Tafaj, G. Kasneci, W. Rosenstiel, M. Bogdan, Bayesian online clustering of eye movement data, in: Proceedings of the Symposium on Eye Tracking Research and Applications, ACM, 2012, pp. 285–288. [27] S. Ramanathan, H. Katti, N. Sebe, M. Kankanhalli, T. S. Chua, An eye ﬁxation database for saliency detection in images, Computer Vision– ECCV 2010 (2010) 30–43. [28] X. Chen, Z. Chen, Visual attention identiﬁcation using random walks based eye tracking protocols, in: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), IEEE, 2015, pp. 6–9. ˇ [29] O. Spakov, D. Miniotas, Application of clustering algorithms in eye gaze visualizations, Information Technology and Control 36 (2) (2007) 213– 216. [30] Y. Wang, X. Chen, Z. Chen, Towards region-of-attention analysis in eye tracking protocols, Electronic Imaging 2016 (2) (2016) 1–6. [31] C. M. Privitera, L. W. Stark, Algorithms for deﬁning visual regions-ofinterest: Comparison with eye ﬁxations, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (9) (2000) 970–982. [32] J. H. Goldberg, J. C. Schryver, Eye-gaze determination of user intent at the computer interface, Studies in Visual Information Processing (1995) 491–502. [33] F. Shic, B. Scassellati, K. Chawarska, The incomplete ﬁxation measure, in: Proceedings of the 2008 Symposium on Eye Tracking Research & Applications, ACM, 2008, pp. 111–114. [34] M. Nystr¨om, K. Holmqvist, An adaptive algorithm for ﬁxation, saccade, and glissade detection in eyetracking data, Behavior Research Methods (2010) 188–204.

21

[35] A. R. Zamir, S. Ardeshir, M. Shah, Gps-tag reﬁnement using random walks with an adaptive damping factor, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2014, pp. 4280– 4287. [36] O. Le Meur, P. Le Callet, D. Barba, D. Thoreau, A coherent computational approach to model the bottom-up visual attention., IEEE transactions on pattern analysis and machine intelligence 28 (2006) 802–817. [37] T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 2106–2113.

22

Exploring visual attention using random walks based eye tracking protocols

Exploring visual attention using random walks based eye tracking protocols

Recommend Documents