JID:YDSPR AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.1 (1-13)
Digital Signal Processing ••• (••••) •••–•••
1
Contents lists available at ScienceDirect
67 68
2 3
Digital Signal Processing
4
69 70 71
5
72
6
www.elsevier.com/locate/dsp
7
73
8
74
9
75
10
76
11 12 13
Eye movements and visual discomfort when viewing stereoscopic 3D content
16 17 18 19 20
Jun Zhou
a,b,∗
c
d
, Ling Wang , Haibing Yin , Alan C. Bovik
e
a
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, China b Shanghai Key Laboratory of Digital Media Processing and Transmission, China c Shanghai Media and Entertainment Technology (Group) Co. Ltd., Shanghai, China d Hangzhou Dianzi University, Hangzhou, China e The University of Texas at Austin, Austin, USA
a r t i c l e
i n f o
27 28 29 30 31
82 83 84 85 86
89
a b s t r a c t
90
24 26
81
88
22
25
79
87
21 23
78 80
14 15
77
Article history: Available online xxxx Keywords: Stereoscopic 3D Eye-tracking Visual discomfort Eye movement features
32 33 34 35 36 37 38 39 40 41 42
The visual brain fuses the left and right images projected onto the two eyes from a stereoscopic 3D (S3D) display, perceives parallax, and rebuilds a sense of depth. In this process, the eyes adjust vergence and accommodation to adapt to the depths and parallax of the points they gazed at. Conflicts between accommodation and vergence when viewing S3D content potentially lead to visual discomfort. A variety of approaches have been taken towards understanding the perceptual bases of discomfort felt when viewing S3D, including extreme disparities or disparity gradients, negative disparities, dichoptic presentations, and so on. However less effort has been applied towards understanding the role of eye movements as they relate to visual discomfort when viewing S3D. To study eye movements in the context of S3D viewing discomfort, a Shifted-S3D-Image-Database (SSID) is constructed using 11 original nature scene S3D images and their 6 shifted versions. We conducted eye-tracking experiments on humans viewing S3D images in SSID while simultaneously collecting their judgments of experienced visual discomfort. From the collected eye-tracking data, regions of interest (ROIs) were extracted by kernel density estimation using the fixation data, and an empirical formula fitted between the disparities of salient objects marked by the ROIs and the mean opinion scores (MOS). Finally, eye-tracking data was used to analyze the eye movement characteristics related to S3D image quality. Fifteen eye movement features were extracted, and a visual discomfort predication model learned using a support vector regressor (SVR). By analyzing the correlations between features and MOS, we conclude that angular disparity features have a strong correlation with human judgments of discomfort. © 2018 Elsevier Inc. All rights reserved.
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
43
109
44
110
45
111
46
1. Introduction
47 48 49 50 51 52 53 54 55 56 57 58
Our eyes are horizontally separated, where each eye forms its own retinal image that slightly differs from the other as a function of binocular disparity. The brain fuses the left and right retinal images and extracts depth information from angular and positional disparities. Stereopsis is the perception of 3D depth arising from these disparities and from oculomotor feedback generated via binocular convergence. As pointed out in [1–3], accommodation– vergence conflicts, excessive binocular parallax, binocular mismatches, depth inconsistencies, depth distortion, and cognitive inconsistencies may occur when viewing stereoscopic 3D (S3D) im-
59 60 61 62 63 64 65 66
*
Corresponding author at: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, China. E-mail addresses:
[email protected] (J. Zhou),
[email protected] (L. Wang),
[email protected] (H. Yin),
[email protected] (A.C. Bovik). https://doi.org/10.1016/j.dsp.2018.12.008 1051-2004/© 2018 Elsevier Inc. All rights reserved.
ages on a flat 3D display, often leading to feelings of visual discomfort or visual fatigue. This distracts from what may otherwise maybe an increased quality of experience (QoE), as compared to viewing traditional 2D content [4,5]. A considerable amount of recent work has focused on creating models of experienced visual discomfort as functions of S3D picture attributes, towards building automatic objective discomfort prediction algorithms [6,7]. Recent research efforts on objective discomfort prediction have led to improved model performance by embedding features that relate to a variety of oculomotor mechanisms, neuronal depth perception models, physiological responses, and so on. Based on a model of the anomalous motor responses to accommodation–vergence mismatches (AVM) that arise when viewing S3D content on a planar 3D display, Park et al. presented a 3D-AVM predictor of discomfort arising from anomalies in the accommodation–vergence cross-link using a model of 3D
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
JID:YDSPR
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.2 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
visual bandwidth to obtain statistically superior discomfort predictions [8]. In [9], Park developed a visual discomfort assessment model using disparity features along with features predictive of the activity of disparity selective neurons in extrastriate area V5/MT. Since the accuracy of these disparity features based models largely depends on the accuracy of disparity calculation and the quality of disparity maps depends on complexity disparity computing algorithms, Chen et al. studied the relative performances of discomfort prediction models that use disparity algorithms having different levels of complexity, and also proposed a set of discomfort predictive features with good performance even when using low complexity disparity algorithms [10]. In [11], the authors developed a percentage of un-linked pixels (PUP) feature map, which is descriptive of the presence of disparity. They used a low complexity method to extract PUP based features to predict visual discomfort. In [12], Kim et al. proposed a temporal visual discomfort model (TVDM) which expressed the relationship between experienced visual discomfort and sensory depth information as an active dynamic process modeled by a second-order differential equation. Since high visual acuity is limited to the foveal region, a viewer must move his eyes to fixate local content at high resolution to gather maximum information [13]. Several feedback motor control processes, including compensation for head motion, maintaining fixation of the image on the fovea, directing change of gaze, etc., evoke eye movements [14]. Eye movements are generally preceded by a shift in visual attention to a subsequent area of interest, and are one of the few externally measurable activities of visual perception that can be used to analyze and validate models of perceptual processes [15,16]. While eye movements when viewing 2D content on a flat display usually imply a small range of vergences, the vergence eye movements that occur when viewing S3D are often quite large. Such externally measurable activity is potentially a useful way to study the relationship between the perception of stereo depth and the experience of visual discomfort while viewing S3D content. Eye-tracking research has often led to new ways of collecting data regarding how we see and experience the world [17]. However, little work has been done towards understanding the role of eye movements on experienced visual discomfort when viewing S3D images. In [18], Zhang et al. investigated a methodology to eliminate bias while collecting eye-tracking data, and found a relationship between image distortions and fixation deviation. In [19], Li et al. measured eye blinking rate while viewing 3D video with different characteristics and visual comfort. They found that planar motion and in-depth motion both affected eye blinks significantly and differently. In [20], Zhang et al. compared the eye movements of subjects viewing 2D and S3D videos, although they did not study the relationship between visual comfort and eye movements. In [21], Iatsun et al. compared the degree of visual discomfort experienced by subjects while viewing 2D and S3D content and when being simultaneously eye-tracked. They accounted for picture content and viewing time in the design of their experiments to investigate the accumulation of visual fatigue against various visual characteristics. By analyzing questionnaire responses regarding 16 potential visual fatigue symptoms, they found that eye-strain and general discomfort contributed the most to sensations of both 2D and S3D visual fatigue. Using an ANOVA analysis of the contributions of fixations, saccades, and blinking, they found that the eye movements were different between subjects viewing S3D and 2D content. They concluded that the accumulation of visual fatigue depends on both viewing time and picture content when viewing 3D content, but only on viewing time when viewing 2D content. They did not explore the relationship between eye movements and visual discomfort in [21]. Liu et al. used eye-tracking to study the
statistical properties of natural S3D images at visual fixations [22]. They concluded that luminance contrast and luminance gradient are generally higher at fixations, but that fixated disparity contrast and disparity gradient are generally lower than randomly selected disparity contrast and disparity gradient. They explained this to be a consequence of increased neural (metabolic) energy expenditures that occur when fixating at 3D points of high depth/disparity gradient, which may also relate to levels of visual discomfort. Here we investigate the role of eye movements while viewing S3D images against experienced visual discomfort. We design subjective comfort assessment experiments whereby human subjects would view 11 S3D images and multiple disparity-shifted versions of each. We collect subjective visual discomfort scores and eye-tracking data simultaneously from the subjects, and calibrated the eye-tracking data using a novel angular disparity based eyetracking data calibration method we developed. We develop a set of features descriptive of eye movements, and we create a visual discomfort prediction model using the eye movement features to train a support vector regressor (SVR). Using the regressed model, we demonstrate that angular disparity related features can predict visual comfort well.
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
2. Method
89 90
When viewing an S3D image, the perceived range of depths on the 3D display is limited by the disparity range and controlled by the Zero Disparity Plane (ZDP, known as the virtual display) [23]. Visual discomfort experienced when viewing such a given S3D image is related to its disparity distribution and ZDP in a normal 3D viewing environment. Visual discomfort can be reduced by limiting the disparity range to guarantee that the perceived depths of objects fall within a “comfort zone”, which is an admissible disparity range defined by the extremes of horizontal disparity within which clear binocular vision can be achieved [24]. For a captured S3D image, the depth is encoded as disparity between the pair of stereo images and the disparity range is fixed as constrained by the camera baseline, the shooting distance, the resolution, the focus length, and importantly the scene space. The geometry model of such S3D video systems can be established to analyze the characteristics and effects of stereoscopic distortion [25]. Chen [26] studied S3D video shooting rules toward controlling the stereoscopic distortion and the comfortable viewing zone. However, it is difficult to adjust the disparity distribution for a captured S3D image other than by adding a constant disparity offset, which moves the ZDP forward or backward by shifting the views with respect to each other [27]. Hence, we constructed a shifted S3D image database for subjective comfort assessment. During the following human experiments, we simultaneously collected eyetracking data to study the relationship between visual discomfort, ZDP and eye movements.
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
2.1. Testing Shifted-S3D-Images-Database (SSID)
118 119
Eleven 1920 × 1080 “original” S3D images, which were acquired using a FUJIFILM REAL 3D W3 equipped with two parallel lenses having an interocular separation of 75 mm, were used to construct a Shifted-S3D-Image-Database (SSID) (Fig. 1). These original S3D images were calibrated using a Homography Matrix Genetic Consensus Estimation (HM-GCE) method to eliminate geometrical distortions [28]. The disparity maps were computed using the optical flow estimation algorithm [29]. The disparity range of all eleven original S3D images ranged from 26 to 91 pixels, and the minimum disparity occurred in “32s0” (−31 pixels), and the maximum disparity in “47s0” (66 pixels) (Fig. 1(l)). A range of angular disparities of ±1◦ and less than 65 mm of maximum uncrossed screen disparity are normally considered a
120 121 122 123 124 125 126 127 128 129 130 131 132
JID:YDSPR AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.3 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
3
1
67
2
68
3
69
4
70
5
71
6
72
7
73
8
74
9
75
10
76
11
77
12
78
13
79
14
80
15
81
16
82
17
83
18
84
19
85
20
86
21
87
22
88
23
89
24
90
25
91
26
92
27
93
28
94
29
95
30
96
31
97
32
98
33
99
34
100
35
101
36
102
37
103
38
104 105
39 40 41 42 43 44 45 46 47 48
Fig. 1. Original S3D images (anaglyph presentations, (a)–(f)). (l) Disparity range (minimum, maximum) in black line and mean disparity in red “–” of each original S3D images. (For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.)
reasonable comfort limit, with the most comfortable regions falling slightly behind the viewing screen [30,31]. The disparity range of most of the selected original S3D images falls within the comfort zone, except “32s0”, “47s0”, and “69s0”, which have relatively large uncrossed disparities. For these selected “original” S3D images:
49 50 51 52
• “3s0” has two objects in the foreground and the forefront object at the center part has about −4 pixels disparity; • “26s0” has a single object in the foreground covering a large
53 54 55
area and has about a 24 pixels disparity;
• “32s0” has multiple objects scattered over the depth space
56 57 58 59 60
• •
61 62 63
•
64 65 66
•
with the largest disparity range (88 pixels), and the forefront object has −26 pixels disparity; “47s0” has multiple small objects in the foreground with a disparity range of 18 to 32 pixels; “55s0” and “73s0” have single large objects at the center of the picture with about 18 pixels and 22 pixels disparities, respectively; in “69s0” and “104s0”, the central parts of the pictures located at the background have 55 pixels and 46 pixels disparities; “82s0” has a single object at the center of picture with a 22 pixels disparity;
• “86s0” has a larger object with a 25 pixels disparity at the right part of the image and 2 small objects with about 36 pixels disparities at the other side; • “89s0” is a foggy scene, where the trees in the foreground have about 30-pixel disparity. Another 6 S3D images were generated from each “original” image to construct the SSID1 by shifting it by ±0.2◦ (±11 pixels), ±0.4◦ (±22 pixels) and ±0.6◦ (±34 pixels) of visual angle (Fig. 2), assuming the subjective test environment of a 47-inch 3D display being viewed at a distance of three times the screen height. Applying larger negative disparity shifts to an S3D image brings the fused S3D content closer to the viewer, while larger positive disparity shifts cause the fused S3D content away from the viewer, beyond the 3D display. This method of shifting serves to guarantee that the disparities of central salient objects in the “original” and 6 shifted S3D images would lie within the comfort limit except for “32s0” [24]. For −0.6◦ shifted version of “32s0”, the bottom side has disparity in range [−64, 16], which will cause board conflict and window violation occurs. The subjective test results demonstrate that the highest MOS were given to the 0.6◦ shifted version.
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
1
https://jbox.sjtu.edu.cn/l/g1yde8.
132
JID:YDSPR
AID:2445 /FLA
4
[m5G; v1.248; Prn:18/12/2018; 16:00] P.4 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
1
67
2
68
3
69
4
70
5
71
6
72
7
73
8
74
9
75
10
76
11
77
12
78
13
79
14
80
15
81
16
82
17
83
18
84
19
85
20
86
21
87
22
88
23
89
24
90
25
91
26
92
27
93
28
94 95
29 30 31 32
Fig. 2. Disparity-shifted versions of an original S3D stereo-pair. The “original” image (3s0) is shown in anaglyph (d). (e) is the depth map and (f) the normalized disparity distribution. The shifting interval was set to 0.2◦ , and 6 shifted versions ((a)–(c), (g)–(i)) were generated under the same scale. The shifted pixels were computed assuming a subjective test environment with a 47-inch 3D display being viewed at a distance of three times the screen height.
96 97 98
33
99
34
100
35
101
36
102
37
103
38
104
39
105
40
106
41
107
42
108
43
109 110
44 45 46
Fig. 3. Eye tracker and display configuration. The observer was arranged to sit at a distance of about 1.75 m from a 47-inch polarized HD 3D display (LG 47LA6600). The Tobii X120 was positioned and configured at the front and center of the standalone 3D display, and the observer’s two eyes were positioned in the X120 track box.
2.2. Observers
49 50 51 52 53 54 55 56
Twenty-four observers (9 women and 15 men, with ages between 21 and 27, and with an average age of 23.6) with normal or corrected-to-normal vision participated in the study. None were previously involved in 3D research. All of the subjects were found to have good or corrected visual acuity of greater than 1.0 and normal stereo vision as tested using the vision tests (VTs) recommended in ITU-R BT.1438 Annex 1 [32].
57 58
2.3. Procedure
59 60 61 62 63 64 65 66
112 113
47 48
111
A 47-inch polarized HD 3D display (LG 47LA6600) was used to display the S3D images to the human subjects. All of the S3D images were input to the 3D display in Top-and-Bottom format. Each observer was comfortably arranged to view the screen at a viewing distance of about 1.75 m (about 3 times the screen height) and equipped with a pair of passive polarized S3D viewing glasses, as depicted in Fig. 3.
The subjects’ gaze direction was tracked using a Tobii X120 eye tracker, which can accurately track eye movements when the motion of the head is less than 35 cm/s and within a 44 × 22 × 30 cm track box [33]. The eye tracker was positioned and configured at the front and center of the standalone 3D display using the assistance of a laser level meter. The sampling rate of the eye tracker was fixed at 60 Hz. The stated monocular accuracy, defined as the average difference between the position of the real stimuli and the measured gaze position, was 0.5◦ at 60 Hz with a monocular precision of 0.22◦ [33]. The observers were told that their eye movements would be monitored during these experiments. Instructions were given regarding their possible experiences of subjective discomfort, including eye strain, headache, dizziness and so on, and how to record their responses on a displayed 5-element Likert scale, where “1” represented “extremely uncomfortable” and “5” meant “very comfortable” [34]. Each observer would sit in front of the eye tracker while viewing the screen and while eye movements were recorded during each subjective test (Fig. 3).
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
JID:YDSPR AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.5 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
5
1
67
2
68
3
69
4
70
5
71
6
72
7
73
8
74
9
75
10
76
11
77
12
78
13
79
14
80 81
15 16 17 18 19 20 21 22 23 24 25 26
Fig. 4. Eye tracking data calibration. (a) Disparities of eye-tracking data using Wang’s method [36]. (b) Disparities of eye-tracking data using our ocular projection disparity based parameter estimation.
85 86 87 88 89 90 91 92 93
2.4. Ocular projection disparity based eye-tracker calibration
94 95
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
An eye tracker records each gaze position, but cannot provide absolute gaze direction because it associates the place the observer fixated with captured features of the eye [35]. The calibration process is required before each eye-tracking session to create a relationship between the images of the eyes and the points of fixation. The binocular accuracy and precision of the Tobii X120 after calibration, which determines the 2D content gaze position as the mean value of the measurements on the two eyes, is 0.4◦ and 0.16◦ [33]. In [36], Wang et al. presented a linear 3D calibration process to determine the mapping parameters for a fixed head position, i.e. assuming that the eyes’ positions were almost fixed. Given our unfixed head environment, Wang’s method cannot guarantee accurate depths (Fig. 4(a)). We therefore improve on Wang’s method by estimating the mapping parameters using ocular projection disparity. The tracking calibration was done by asking each observer to successively fixate at 25 bright discs randomly displayed at one of 5 spatial positions and on one of 5 depth planes. There was a 1 s interval between each two successive calibration displays. The Tobii eye tracker uses an Active Display Coordinate System (ADCS), and records pupil diameter, pupil position, and gaze point for each eye as the gaze data is collected (Fig. 5) [37]. The ocular projection disparity θ can be calculated using:
53 54 55 56 57 58 59 60 61 62 63 64 65 66
83 84
Before each experiment, each observer participated in a calibration procedure. Then each of the S3D images was displayed for 10 s in random order, while the eye tracker collected eyemovement samples. A rest period of 2 min was allowed after each 10 subjective assessments. The first 1 s of eye-tracking data was discarded to allow the subjects ample time to stereo-adapt. After 10 s of viewing, the subject was asked to record their subjective comfort score on a 5-element Likert scale.
27 28
82
θ = θ0 − θG −−→ −−−→ −−→ −−→ O l Gc · O r Gc O l Gl · O r Gr − arccos , = arccos |O l Gc| |O r Gc| | O l Gl | | O r Gr |
(1)
where G c = (G l + G r )/2 is the midpoint of the two gaze points on the display plane. Given a 3D point in the calibration test and corresponding eyetracking data, the measured ocular projection disparity θ j at fixation can be estimated from the recorded gaze directions of the two eyes, where the corresponding known ocular projection disparity is denoted by θ j . Applying Wang’s linear transform method, the calibration transform for ocular projection disparity can be written
96
Fig. 5. The Active Display Coordinate System (ADCS) used by the Tobii eye tracker. Tobii records pupil diameter, pupil position (O l , O r ), and gaze point on screen (G l , G r ) for each eye under ADCS.
θ j = φ a (where φ = [1 θ j ]). The parameters a = [a0 a1 ] T can be solved using the calibration eye-tracking data by:
99 100 101 102
(2)
is the vector of known values for θ , and
where is the matrix formed from the measured values of θ . Given the estimated parameter a, the calibrated ocular projection disparity θ of any recorded gaze θ becomes:
θ = a0 + a1 θ.
(3)
Then the calibrated gaze point G l and G r in Fig. 6 can be computed by:
104 105 106 107 108 109 110 111 112 113 114
|G c G l | = | O l G c | where θG θ0 −θG
98
103
a = ( T )−1 T ,
97
sin(
θ0 −θG 2
sin( 2)
= θ0 −
θ ,
)
115
(4)
,
2= 1−
116 117
θG −θG 2
, and sin( 1) =
|O l Gc | |G c G l |
×
sin( 2 ). As exemplified by Fig. 4(b), the gaze depth data following this method is improved as compared with Wang’s method.
118 119 120 121 122 123
3. Empirical model of salient object disparity and MOS
124 125
After completing the subjective tests, the subjects were screened. No outliers were detected according to the guideline described in ITU-R BT.500-13 Annex 2 [34]. The MOS of each “original” S3D image with different shift parameters is plotted in Fig. 7(a). The results obtained between MOS and 11 groups of shifted S3D images indicate that those S3D images having a larger percentage of uncrossed disparities would be more comfortable
126 127 128 129 130 131 132
JID:YDSPR
6
AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.6 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
1
67
2
68
3
69
4
70
5
71
6
72
7
73
8
74
9
75
10
76
11 12 13
77
Fig. 6. Calibration of gaze points. O l G l and O r G r (dash line) are the eye-tracked gaze directions of the left and right eye. G l and G r are the calibrated gaze points.
78 79 80
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
to view than those with a preponderance of crossed disparities. We analyzed the experimental results using a one-way analysis of variance (ANOVA) to study the effects on MOS of different degrees and directions of shift (Fig. 7(b)). It is clear that the MOS was significantly influenced by the shift factor (F (6, 70) = 18.21, p = 1.3e −12). The visual comfort score generally increased as the S3D content was shifted backward in depth meaning there were more uncrossed disparities in the corresponding S3D image, reaching the highest value at 0.4◦ . This is consistent with the definition of the comfort zone [24]. Fig. 7(b) gives a coarse relationship between visual comfort and depth owing to differences in the depth ranges of the different “original” S3D images, and also since the different salient objects in the images lie in different depth planes. The gaze points where the subjects fixated reflect human visual attention, and were recorded as fixation points by the eyetracker. The fixation distribution over all 24 observers reflected the frequency of measured fixations for a particular spatial location. Because the number of pixels covered by the fovea varies as a function of viewing distance, the discrete fixation distribution was smoothed to obtain a heat map, using a Gaussian convolution filter whose shape and size was determined by the pixels per degree for a display at a given viewing distance. Higher density zones in the heat map indicate where the users focused their gaze with higher frequency. These high density zones are regions of interest (ROIs) that users are interested in [38]. The fovea is the only part of the retina that permits the highest visual acuity. The average angular width of the foveal perceptive fields for human vision are about 17.8 min of arc [39], which is about 17 pixels under our viewing condition. We set the Gaussian filter with σ = 20 to generate the heat map on both the left and right images of an S3D image pair (Fig. 8). To analyze the relationship between disparity and visual comfort, salient objects were extracted from the heat map, then the disparities of the salient objects were used to obtain the empirical disparity–comfort formula. The heat map was normalized, then the salient objects were segmented where the heat density was greater than a threshold T so = 0.7 (Fig. 9(b), (e)). Multiple salient objects in different depth planes in an image cause multiple peaks in the histogram of salient objects’ disparity (Fig. 9(f)). These salient objects contribute to visual discomfort [40]. For a single salient object in an S3D image, with smaller salient object disparity variance (σdso ), the visual comfort should correlate highly with the depth of the salient object. Human stereo acuity is about 2.3 min arc, or about 1.5 pixels under our viewing conditions [41]. Usually, the heat map of the left and right images have similar distributions. But for “55s-11” and “82s-11”, the heat map distribution of left and right images was somewhat different (Fig. 9(a), (d)). 62 left and right images with σdi < 1.5 were finally so selected for model fitting, where 60 images were selected from 30 pairs of S3D images, and another two images were selected from the left image of “55s-11” (Fig. 9(c)) and “82s-11”.
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
Fig. 7. (a) MOS of shifted versions of eleven “original” S3D images. (b) ANOVA box plots of MOS against shifted versions. The total degrees of freedom for 77 S3D images and 11 shifted versions was 70, and the between-groups degrees of freedom was 6. ‘*’ is the mean of each shifted versions, ‘−’ is the median, and ‘+’ is the outliers.
105 106 107 108 109
A quadratic function was fitted to the mean salient object disparities
(diso )
i
and MOS (M O S ) of these selected images (Fig. 10):
110 111 112
M O S (d so ) = −3.01 × 10−4 × d2so + 2.53 × 10−2 × d so + 3.56. (5)
113
This fitted model is consistent with the result in Fig. 7(b). It shows that visual comfort improves when a salient object moves backwards, and reaching a peak at the depth plane of 0.7◦ ocular projection disparity.
115
114 116 117 118 119
4. Eye movement features and visual discomfort
120 121
4.1. Eye movement features
122 123
There are three main depth perception related oculomotor responses: accommodation, convergence, and control of pupillary diameter [42]. The Tobii X120 records every gaze sample and outputs the pupil diameters, pupil positions in space ( O l , O r ), and gaze points of the two eyes (G l , G r ), but it does not supply accommodation related diopter data [37]. Suppose the eye tracker has sampled the gaze data of K observers indexed k = 1, ..., K on N S3D test images I i = { I li , I ir }. These data are classified into categories of fixation, saccadic, and blinking [43].
124 125 126 127 128 129 130 131 132
JID:YDSPR AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.7 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
7
1
67
2
68
3
69
4
70
5
71
6
72
7
73
8
74
9
75
10
76
11
77
12
78
13
79
14
80
15
81
16
82
17
83
18
84
19
85
20
86
21
87
22
88
23
89
24
90
25
91
26
92
27
93
28
94
29
95
30
96
31
97
32
98
33
99
34
100
35
101
36
102
37
103
38
104
39
105
40
106
41
107
42
108
43
109
44
110
45
111
46
112 113
47 48 49
Fig. 8. Heat maps of fixations measured on all subjects’ left eyes ((a)–(c)) and right eyes ((g)–(i)) on a disparity-shifted group of images of 3s0. ROI disparity histogram of corresponding left image ((d)–(f)) and right image ((j)–(l)).
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
115 116
50 51
114
Jansen et al. found that depth information changes the nature of basic eye movements [44]. In the following, we will characterize eye movements using a variety of quantitative features, and we will compute the degree of correlations between these features against recorded experiences of visual discomfort when viewing S3D content. As compared to 2D viewing, S3D viewing produces additional vergence eye movements that fixate the target location correctly using both eyes [44]. The experience of visual discomfort is related in part to the complexity of achieving binocular fusion. Generally, there is binocular rivalry if the two retinal images are unmatched and therefore difficult to fuse. It has been proposed that binocular rivalry is resolved early in the visual pathway, via mutual inhibition between monocular neurons in primary visual cortex (V1) [45]. Other evidence suggests that rivalry is resolved in higher cortical areas after information from the two eyes has been integrated
[46]. This suggests that visual discomfort may be related to multiple levels of visual processing. We define three features descriptive of fixations during a session of viewing a given displayed S3D image I i . They are the number of fixations f 1i , the maximum fixation duration f 2i , and the average fixation duration f 3i . These are respectively defined as:
f 1i =
K 1
K
(6)
1 K
k =1
119 120 121 122 123 125 126
where ηi ,k is the total number of fixation tracked on observer k on test S3D image I i ,
f 2i =
118
124
ηi,k ,
k =1
K
117
127 128 129 130
max t i ,k, j ,
1≤ j ≤ηi ,k
(7)
131 132
JID:YDSPR
AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.8 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
8
1
67
2
68
3
69
4
70
5
71
6
72
7
73
8
74
9
75
10
76
11
77
12
78
13
79
14
80
15
81
16
82
17
83
18
84
19
85
20 21 22
Fig. 9. Heat maps of fixations measured on all subjects’ left eyes (a) and right eyes (d) on 55s-11. Salient object(s) (yellow area) segmented with T so = 0.7 ((b), (e)). Disparity histogram for pixels in salient object(s) region ((c), (f)).
23
f 5i
24 25
=
1 K
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Fig. 10. Quadratic fitted salient object disparity and MOS model. Each disc centered at (diso , M O S i ) corresponds to an image with the variance of salient object dispari ity (σdi ) in width and the variance of MOS (σ M O S ) in height. The blue dash line so
discs are those images with i d so
σ
i d so
σ
≥ 1.5, and the yellow discs are those images with
< 1.5.
43 44 45 46
where t i ,k, j is the duration of the j-th fixation of observer k on S3D image I i , and
47 48 49
f 3i
=
1 K
50 51 52 53 54 55 56 57 58 59 60 61
64
f 4i
=
65 66
k =1
⎛
ηi ,k
⎞
⎝ 1 t i ,k, j ⎠ .
ηi,k
(8)
j =1
When the vision system perceives a scene containing a range of depths, the pupil diameter dynamically changes with the depth of field (DOF). The DOF is roughly inversely proportional to the pupil diameter [47]. Given a distribution of relative depths around a fixation, a larger range of depths implies a higher probability of diplopia due to the limits of Panum’s fusional area. Given eye-tracked pupil diameter data dli ,k, j and dri,k, j , which are respectively the left and right pupil diameters at the j-th fixation of observer k on S3D image I i , calculate the mean pupil diameters f 4i and f 5i of the left and right eyes:
62 63
K
and
1 K
K k =1
dli ,k ,
(9)
K k =1
(10)
=
K
(11)
θi ,k ,
k =1
M i ,k
k =1
(12)
k =1
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 116 117
121 122 123
1 M i ,k
M i,k t =1
θi ,k,t is the
mean angular disparity. We also define angularity disparity features under a fixation model. Define the mean angular disparity at fixation as:
K
96
120
t =1
server k viewing S3D image I i , and θi ,k =
1
95
119
where M i ,k is the number of valid gaze points collected on ob-
f 8i =
94
118
M i ,k K 1 1 i f7 = (θi ,k,t − θi ,k )2 ,
K
93
115
and
K
90 92
k watching S3D image I i , and θi ,k,t is the angular disparity at a fixation. Then the mean angular disparity f 6i and the variance of angular disparity f 7i of the viewed S3D image I i are:
1
88
91
ηi,k
f
f 6i
87 89
dri,k ,
v where div,k = j =1 d i ,k, j (v = l, r) is the mean pupil diameter at fixation. In response to parallax on the displayed S3D image, the eyes change vergence to adapt to a pair of parallax points in order to form a mutual fixation point. The overall perception of depth is related to the parallax on the display, the pupil distance, the thickness of the crystalline lens, and the distance between the eyes and the display. The eye-tracker records the relationship between the two eyes and the display, and the positions of the gazed points. These data can be interpreted as samples of angular disparity θ (Fig. 5). During an S3D viewing experience, objective depth perception corresponds to angular disparity. [9] presented a discomfort prediction model using features expressive of disparity and neural activity statistics. Given their success in attaining high discomfort prediction performance using disparity features, we define several statistical features that relate to angularity disparity. Suppose θi ,k,t is the angular disparity at time t for observer
K
86
124 125 126 127 128 129 130
f
θi ,k ,
(13)
131 132
JID:YDSPR AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.9 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
9
1
67
2
68
3
69
4
70
5
71
6
72
7
73
8
74
9
75
10
76
11
77
12
78
13
79
14
80
15
81
16
82
17
83
18
84
19
85
20
86
21
87
22
88
23
89
24
90
25
91
26
92
27
93
28
94
29
95
30
96
31
97
32
98
33
99
34
100 101
35
Fig. 11. Angular disparity histogram (a) and FDM (b) of “69s11”; angular disparity histogram (c) and FDM (d) of “82s-11”.
36
102 103
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
⎛
63 64 65 66
104
N i,k
f
f
where θi ,k = N1 t =1 θi ,k,t is the mean angular disparity at fixai ,k tion for observer k viewing S3D image I i , and N i ,k is the number of fixated gaze points. Kim in [48] assumed that fixations tend to land at points of both lower disparity gradient [22] and smaller disparities. We proposed the concept of a fixation density map (FDM) annotation method [49], whereby a heat map is first segmented into equal angular disparity intervals, then the segmented heat maps are visualized in layers of depth space. It was shown that fixations are usually biased toward the center of screen, and in 3D it was also shown that positions in front attract attention [50]. Fig. 11 shows that most fixations are concentrated within a small range of depths where the angular disparity is close to 0. This is consistent with the results found in [50]. Smaller values of angular disparity correspond to fixations at nearer points to the screen plane. This phenomenon is consistent with the finding of [22] that the distribution of disparity tuning preferences in visual area (MT) is strongly skewed towards near disparities. In order to evaluate which angular disparities at fixation correlated most highly with the experience of visual discomfort, we define feature f pi as the lower p-th percentile of the absolute angular disparity distribution:
K 1 ⎜ f pi = ⎝
K
k =1
f
1 f N i ,k, p
N i ,k, p
j =1
⎞ f ,↑ ⎟ θi ,k, j ⎠ ,
(14)
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123
Fig. 12. Plot of PLCC between
f pi
124
and MOS against p%.
125 126 f ,↑
where {θi ,k, j } are angular disparities at fixation reordered by abf f solute value in ascending order, N i ,k, p = N i ,k · p is the number of fixations lying within the lower p% of f ,↑
f ,↑ {θi ,k, j },
and
f N i ,k
is the total
number of {θi ,k, j } at fixation for observer k viewing S3D image I i .
127 128 129 130 131 132
JID:YDSPR
AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.10 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
10
1 2 3 4
Fig. 12 plots the PLCC between f pi and MOS against p%. The PLCC reaches its maximum around p = 45. We define two features i f 9i , f 10 which are the average of the lower 45% and upper 45% of the absolute angular disparity distribution of all recorded fixations:
⎛
5 6 7 8
f 9i
K 1 ⎜ 1 = ⎝ f K N k =1
9 10
and
13
i f 10 =
14
1 K
K
⎜ ⎝
k =1
17 18 19 21
i f 11 =
23 24 25 26 27 28
K 1
K
31 32 33 34 35 36 37 38 39 40 41 42 43
⎛ K 1 i ⎝ f 12 = K
50 51 52 53 54 55 56 57 58
61 62 63
66
f ,↑ ⎟ θi ,k, j ⎠ .
(16)
f j = N i ,k,55%
N i ,k,45%
⎞
↑
θi ,k, j ⎠ ,
k =1
j =1
i f 13 =
1 K
K
⎛ ⎝
k =1
1 N i ,k − 1
↑
θi ,k, j ⎠ ,
(18)
⎞ ∇θi ,k,t ⎠ .
i f 13 :
(20)
t =2
f
N j
f
given the mean angular disparity θi ,k, j = N1 t =1 θi ,k, j ,t at a j-th j fixation with N j samples, and also define the maximum and mean vergence jumps: i f 14 =
1 K
k =1
and i f 15 =
1 K
K k =1
f max ∇θi ,k, j ,
(21)
1< j ≤ηi ,k
⎛ ⎝
1
ηi,k − 1
ηi ,k
j =2
0.3880 0.5721 0.3988
70 71 72 73
f
f
74 75 76 78
Given the above defined feature set {fi }i =1,...,77 on 77 shifted S3D images used in our experiment, where each feature vector fi ∈ R15 , and also given MOS values { y i }i =1,...,77 ∈ R, corresponding to the same images, we trained a support vector regressor (SVR) [55] on the features and MOS to learn visual discomfort prediction models. The choice of kernel function can significantly impact the performance of an SVR [56]. Here the LibSVM package [57] was utilized to implement the SVR, and three kernel functions were used to learn three SVR models:
79 80 81 82 83 84 85 86 87 88 89 90
xiT x j .
(23)
91 92
Polynomial kernel:
93
K (xi , x j ) = (γ xiT x j + r )d , γ > 0.
94
(24)
96
Radial basis function (RBF) kernel: 2
K (xi , x j ) = exp(−γ ||xi − x j || ), γ > 0.
95 97
(25)
A number of SVR parameters are required. These include the order d of the polynomial kernel, and the cost C and the scaling constant γ of the RBF kernel. d was set to 3 for the polynomial kernel, and the cost C = 2000 and on the γ = 0.1 were determined using cross-validation and grid search for the RBF kernel.
98 99 100 101 102 103 104 105
5. Experimental results
106 107
(19)
As a way of more fully exploiting the limited processing capacity of the visual system [53], visual attentional mechanisms direct our gaze towards objects of interest in the visual environment [54]. Salient objects in an S3D image that are located at different depth layers are associated not only with saccadic (ballistic) changes in vergence, but also with visual discomfort [40]. Thus,
K
0.6717 0.5110 0.6659
K (xi , x j ) =
⎞
j = N i ,k,55%
N i ,k
0.7379 0.5564 0.7341
Linear kernel:
N i ,k
1 N i ,k,45%
(17)
Also define the mean angular disparity time-difference
64 65
N i ,k,45%
Linear Poly RBF
69
77
∇θi ,k,t = θi ,k,t − θi ,k,t −1 .
59 60
1
RMSE
4.2. Visual discomfort model based on eye movement features
↑
48 49
N i ,k
where {θi ,k, j } are angular disparities reordered by absolute angular ↑ disparity in ascending order, and N i ,k is the total number of {θi ,k, j } for observer k viewing S3D image I i . The stereoscopic depth acuity of the human eyes is limited [51]. The difference in disparity of one part of an S3D content relative to another must be large enough to produce a discernible variation in perceived depth [52]. Furthermore, angular disparities change with eye movements over time. Hence define the angular disparity time-difference ∇θi ,k,t as the difference of successive angular disparity values:
45 47
⎝
k =1
44 46
⎞
and
29 30
⎛
SROCC
respectively, where ∇θi ,k, j = θi ,k, j − θi ,k, j −1 ( j = 2, ..., ηi ,k ) at fixation for observer k viewing S3D image I i .
f
N i ,k,45%
PLCC
f
j =1
1
68
Kernel type
(15)
The angular disparity of saccades is not taken into consideration i i in features f 9i and f 10 . Hence we define two other features f 11 and i f 12 that correspond to the upper and lower 45th percentiles of the angular disparity distribution from all of the eye-tracking data as:
20 22
f ,↑ ⎟ θi ,k, j ⎠ ,
f
15 16
⎛
11 12
f
N i ,k,45%
i ,k,45%
⎞
67
Table 1 Average results over 1000 trials with all features.
f
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
⎞ ∇θi ,k, j ⎠ ,
With the eye-tracking data and MOS collected in this visual discomfort experiment, visual discomfort models based on eye movement features were learned using the SVR with different kernels. To evaluate the visual discomfort model learned using the SVR, 1000 trials were performed and the average performance of each trial was used as the final performance. In each train-test trial, 80% of the dataset {fi , y i } was randomly selected for training, with the remaining 20% used for testing. We used Spearman’s Rank-Order Correlation Coefficients (SROCC), Pearson’s Linear Correlation Coefficients (PLCC) and the Root Mean Squared Error (RMSE) between the predicted and subjective scores to evaluate the performance of the trained model. The performance of the prediction model was evaluated over 1000 random train-test trials and the average correlation values are reported in Table 1. The linear kernel achieves the highest PLCC and SROCC values and the lowest RMSE. The performance of the RBF was very close to the performance of the linear kernel. We separately tested the oculomotor features { f ji } (i = 1, ..., 77; j = 1, ..., 15) by computing their PLCC and SROCC versus MOS (Fig. 13). From the correlation data between eye movement features and MOS, we observe that:
(22)
1. Fixation duration related features have negative correlation with MOS. Feature f 1 , the number of fixations, is positively correlated with MOS, while features f 2 and f 3 , maximum and
130 131 132
JID:YDSPR AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.11 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
11
1
67
2
68
3
69
4
70
5
71
6
72
7
73
8
74
9
75
10
76
11
77
12
78
13
79
14
80
15
81
16
82
17
83
18
84
19
85
Fig. 13. Correlations between oculomotor features and MOS scores.
86
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
average fixation duration, are negatively correlated (PLCC = −0.07, −0.28) with MOS. These results are consistent. Given a subjective test of fixed length, if the mean fixation duration is long, then generally there will be fewer fixations. The results on f 2 and f 3 are related to the concept that the visual brain requires more time and metabolic expenditure to fuse difficult S3D content related to visual discomfort; 2. Pupil diameter related features have small negative correlation with MOS. Features f 4 and f 5 , the mean pupil diameters of the left and right eyes, have negative correlation (PLCC = −0.16, −0.18) with MOS. This suggests that the pupil diameter is somewhat more likely to be small when viewing comfortable S3D content, where a clearer retina image may be formed within a larger depth of focus; 3. Angular disparity related features F ad = { f 6 , f 8 , f 9 , f 10 , f 11 , f 12 } have the highest correlation with MOS, with PLCC > 0.71 and SROCC > 0.66, except for the variance of angular disparity (PLCC < 0.01). Features f 14 and f 15 , the maximum and mean vergence jumps between fixations, also correlate fairly strongly (PLCC = 0.51, 0.36) with MOS. The relevance of sharp angular disparity gradients may actually be higher, given the limited disparity range (about 0.8◦ ) of the S3D images used in our study. Feature f 7 , the variance of angular disparity, is almost uncorrelated with MOS (PLCC = 0.0068). As shown in Fig. 11, most fixations are concentrated within a small range of depths having small angular disparities, which caused the distribution of angular disparity to be leptokurtic (average kurtosis of 11). Thus, feature f 7 is relatively small and nearly uncorrelated with MOS. Using F ad as a standalone feature set, we trained another SVR prediction model using the linear kernel. We then trained other SVR prediction models on feature vector F ad combined with every feature in F r = { f 1 , f 2 , f 3 , f 4 , f 5 , f 7 , f 13 , f 14 , f 15 }. The performances of these prediction models is shown in Fig. 14. Among all features in F r , features f 1 , f 2 , f 3 , f 4 , f 5 , f 7 actually decreased the performance. This phenomenon may follow from the notion that fixation duration is related not only to the stereo fusion process, but also to the visual interestingness of the S3D content. There is no tight correlation between these features and the experience of visual discomfort. The correlation coefficient between f 2 and MOS was also small, with PLCC = −0.072 and SROCC = −0.039. Little performance lift can be attributed to f 13 or f 15 , and the maximum vergence jump f 14 apparently boosts pre-
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
Fig. 14. Performance of predication models for specific feature combinations. The performance of “orig” is learned using the linear kernel with selected features F ad . The performances plotted in the figure were learned on every feature in F r combined with F ad .
106 107 108 109 110
Table 2 Average results over 1000 trials with selected features.
111 112
Kernel type
PLCC
SROCC
RMSE
113
Linear Poly RBF
0.7436 0.6864 0.7454
0.6736 0.6709 0.6738
0.3861 0.5681 0.3878
114 115 116 117
diction performance the most which gains 2.76% percent increase for PLCC. Based on the above analysis, we selected angular disparity features F ad and features { f 13 , f 14 , f 15 } as the most promising features to learn a prediction model using the same learning machine. The performance of this prediction model computed over 1000 times train-test trails is tabulated in Table 2. The highest performance was achieved using the RBF kernel, which was slightly better than that of the models trained on all features.
118 119 120 121 122 123 124 125 126 127
6. Conclusion
128 129
A subjective visual discomfort experiment was implemented using an eye tracker on subjects viewing a set of shifted S3D images. In this experiment, eleven natural scene S3D images were selected
130 131 132
JID:YDSPR
AID:2445 /FLA
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[m5G; v1.248; Prn:18/12/2018; 16:00] P.12 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
as “original” S3D images. Considering the effects of admissible disparity range against visual discomfort, another 6 shifted versions of each original S3D images were generated to construct a ShiftedS3D-Image-Database (SSID) by varying a set of shift parameters. The accuracy of eye-tracking data collected from the X120 can be greatly improved by applying a calibration procedure [58], which we improved on using a new angular disparity based calibration method. The proposed calibration method uses a least squares (LS) method to generate a linear calibration transform using angular disparities computed from the two eyes’ gaze directions and corresponding calibration points viewed in depth. Compared to Wang’s method [36], the gaze depth data following this method is improved. The results obtained between MOS and 11 groups of shifted S3D images indicate that those S3D images having a larger percentage of uncrossed disparities, which would be more comfortable to view than those with a preponderance of crossed disparities. Using the heat map obtained using eye-tracking data, an empirical best fit salient object disparity and visual comfort formula was found. For a salient object located in depth, the visual comfort generally increases as it shifts backwards, peaking at about 0.7◦ . As compared to the experience of viewing 2D content, an observer must adjust their vergence to fuse corresponding points in depth when viewing S3D content. We studied a number of quantitative eye movement features that relate to the experience of visual discomfort. These eye movement features can be classified into three categories: fixation, pupil diameter, and angular disparity related features. Using these quantitative eye movement features that relate to the experience of visual discomfort, prediction models were learned using an SVR. The learned models produce good visual discomfort prediction performance. We showed that angular disparity features are highly related to visual comfort.
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
Acknowledgment The work for this paper was supported in part by NSFC under 61471234, 61527804 and MOST under 2015BAK05B03. References [1] M. Lambooij, M. Fortuin, I. Heynderickx, W. IJsselsteijn, Visual discomfort and visual fatigue in stereoscopic displays: a review, J. Imaging Sci. Technol. 53 (3) (2009) 1–14, https://doi.org/10.2352/J.ImagingSci.Technol.2009.53.3.030201. [2] W. Tam, F. Speranza, S. Yano, K. Shimono, H. Ono, Stereoscopic 3D-TV: visual comfort, IEEE Trans. Broadcast. 57 (2) (2011) 335–346, https://doi.org/10.1109/ TBC.2011.2125070. [3] D.M. Hoffman, A.R. Girshick, K. Akeley, M.S. Banks, Vergence–accommodation conflicts hinder visual performance and cause visual fatigue, J. Vis. 8 (3) (2008) 33.1–33.30, https://doi.org/10.1167/8.3.33. [4] A.K. Moorthy, A.C. Bovik, Visual quality assessment algorithms: what does the future hold?, Multimed. Tools Appl. 51 (2) (2011) 675–696, https://doi.org/10. 1007/s11042-010-0640-x. [5] L. Goldmann, T. Ebrahimi, 3D quality is more than just the sum of 2D and depth, in: IEEE International Workshop on Hot Topics in 3D, Singapore, 2010. [6] A. Mittal, A.K. Moorthy, J. Ghosh, A.C. Bovik, Algorithmic assessment of 3D quality of experience for images and videos, in: Digital Signal Processing Workshop and IEEE Signal Processing Education Workshop (DSP/SPE), 2011, pp. 338–343. [7] B. Alexandre, L.C. Patrick, C. Patrizio, C. Romain, Quality assessment of stereoscopic images, EURASIP J. Image Video Process. 2008 (1) (2009) 1–13, https:// doi.org/10.1155/2008/659024. [8] J. Park, S. Lee, A.C. Bovik, 3D visual discomfort prediction: vergence, foveation, and the physiological optics of accommodation, IEEE J. Sel. Top. Signal Process. 8 (3) (2014) 415–427, https://doi.org/10.1109/JSTSP.2014.2311885. [9] J. Park, H. Oh, S. Lee, A. Bovik, 3D discomfort predictor: analysis of disparity and neural activity statistics, IEEE Trans. Image Process. 24 (3) (2015) 1101–1114, https://doi.org/10.1109/TIP.2014.2383327. [10] J. Chen, J. Zhou, J. Sun, A.C. Bovik, 3D visual discomfort prediction using low complexity disparity algorithms, EURASIP J. Image Video Process. 2016 (1) (2016) 23, https://doi.org/10.1186/s13640-016-0127-4. [11] J. Chen, J. Zhou, J. Sun, A.C. Bovik, Visual discomfort prediction on stereoscopic 3D images without explicit disparities, Signal Process. Image Commun. 51 (2017) 50–60, https://doi.org/10.1016/j.image.2016.11.006.
[12] T. Kim, S. Lee, A. Bovik, Transfer function model of physiological mechanisms underlying temporal transfer function model of physiological mechanisms underlying temporal visual discomfort experienced when viewing stereoscopic 3D images, IEEE Trans. Image Process. 24 (11) (2015) 4335–4347, https:// doi.org/10.1109/TIP.2015.2462026. [13] M.S. Castelhano, M.L. Mack, J.M. Henderson, Viewing task influences eye movement control during active scene perception, J. Vis. 9 (3) (2009) 6.1–6.15, https://doi.org/10.1167/9.3.6. [14] B.R.I.P. Howard, Perceiving in Depth, Volume 2, Stereoscopic Vision, Oxford University Press, 2012. [15] D. Noton, L.W. Stark, Eye movements and visual perception, Sci. Am. 224 (6) (1971) 35–43. [16] A.C. Schütz, D.I. Braun, K.R. Gegenfurtner, Eye movements and perception: a selective review, J. Vis. 11 (5) (2011) 9.1–9.30, https://doi.org/10.1167/11.5.9. [17] W.T. Benjamin, K. Clare, G.M. Ross, M.A.M. Katy, W.S. Steven, The active eye: perspectives on eye movement research, in: M. Horsley, M. Eliot, B.A. Knight, R. Reilly (Eds.), Current Trends in Eye Tracking Research, Springer, 2014, pp. 3–16. [18] W. Zhang, H. Liu, Toward a reliable collection of eye-tracking data for image quality research: challenges, solutions, and applications, IEEE Trans. Image Process. 26 (5) (2017) 2424–2437, https://doi.org/10.1109/TIP.2017.2681424. [19] J. Li, M. Barkowsky, P.L. Callet, Visual discomfort is not always proportional to eye blinking rage: exploring some effects of planar and in-depth motion on 3DTV QoE, in: Proceedings of VQME. [20] L. Zhang, J. Ren, L. Xu, J. Zhang, J. Zhao, Visual comfort and fatigue measured by eye movement analysis when watching three-dimensional displays, Ophthalmol. China 23 (1) (2014) 37–42. [21] I. Iatsun, M. Larabi, C. Fernandez, On the comparison of visual discomfort generated by S3D and 2D content based on eye-tracking features, Proc. SPIE 9011 (2014) 2978–2982, https://doi.org/10.1117/12.2042481. [22] Y. Liu, L. Cormack, A. Bovik, Dichotomy between luminance and disparity features at binocular fixations, J. Vis. 10 (12) (2010) 23.1–23.17, https://doi.org/10. 1167/10.12.23. [23] G.R. Jones, D. Lee, N.S. Holliman, D. Ezra, Controlling perceived depth in stereoscopic images, Proc. SPIE 4297 (2001), https://doi.org/10.1117/12.430855. [24] T. Shibata, J. Kim, D. Hoffman, M. Banks, The zone of comfort: predicting visual discomfort with stereo displays, J. Vis. 11 (8) (2011) 11.1–11.29, https://doi.org/ 10.1167/11.8.11. [25] A.J. Woods, T. Docherty, R. Koch, Image distortions in stereoscopic video systems, Proc. SPIE 1915 (1993) 36–48. [26] W. Cheng, M. Barkowsky, P.L. Callet, New stereoscopic video shooting rule based on stereoscopic distortion parameters and comfortable viewing zone, Proc. SPIE 7863 (4) (2011) 1200–1233, https://doi.org/10.1117/12.872332. [27] G. Sun, N. Holliman, Evaluating methods for controlling depth perception in stereoscopic cinematography, in: A.J. Woods, N.S. Holliman, J.O. Merritt (Eds.), Stereoscopic Displays and Virtual Reality Systems XX, in: Proc. SPIE, vol. 7237, 2009. [28] D. Yao, J. Zhou, Z. Xue, Homography matrix genetic consensus estimation algorithm, in: International Conference on Audio Language and Image Processing (ALIP), IEEE, 2010, pp. 1139–1143. [29] D. Sun, S. Roth, M.J. Black, Secrets of optical flow estimation and their principles, in: IEEE Conf. on Computer Vision and Pattern Recognition, 2010, pp. 2432–2439. [30] M. Bernard, 3D Movie Making – Stereoscopic Digital Cinema from Script to Screen, Focal Press, 2009. [31] ITU-T P.916, Information and guidelines for assessing and minimizing visual discomfort and visual fatigue from 3D video, 2016. [32] ITU-R BT.1438, Subjective assessment of stereoscopic television pictures, 2000. [33] Tobii® Technology, Accuracy and Precision Test Method for Remote Eye Trackers, 2011. [34] ITU-R BT.500-13, Methodology for the subjective assessment of the quality of television pictures, 2012. [35] P. Kasprowski, K. Harezlak, Etcal – a versatile and extendable library for eye tracker calibration, in: Digital Signal Processing & SoftwareX – Joint Special Issue on Reproducible Research in Signal Processing, Digit. Signal Process. 77 (2017) 222–232, https://doi.org/10.1016/j.dsp.2017.11.011. [36] R. Wang, B. Pelfrey, A. Duchowski, D. House, Online 3D gaze localization on stereoscopic displays, ACM Trans. Appl. Percept. 11 (1) (2014), https://doi.org/ 10.1145/2593689. [37] Tobii® Technology, Tobii Analytics SDK: Developer’s Guide, Release 3.0, 2013. [38] F. Xiao, L. Peng, L. Fu, X. Gao, Salient object detection based on eye tracking data, Signal Process. 144 (2018) 392–397, https://doi.org/10.1016/j.sigpro.2017. 10.019. [39] L. Spillmann, Foveal perceptive fields in the human visual system measured with simultaneous contrast in grids and bars, Pflügers Arch. 326 (4) (1971) 281–299, https://doi.org/10.1007/BF00586993. [40] X. Zhang, J. Zhou, J. Chen, X. Guo, Y. Zhang, X. Gu, Visual comfort assessment of stereoscopic images with multiple salient objects, in: 2015 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), 2015, pp. 1–6.
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
JID:YDSPR AID:2445 /FLA
[m5G; v1.248; Prn:18/12/2018; 16:00] P.13 (1-13)
J. Zhou et al. / Digital Signal Processing ••• (••••) •••–•••
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
[41] B.E. Coutant, G. Westheimer, Population distribution of stereoscopic ability, Ophthalmic Physiol. Opt. 13 (1) (1993) 3–7, https://doi.org/10.1111/j.14751313.1993.tb00419.x. [42] S. Reichelt, R. Häussler, N.L.G. Fütterer, Depth cues in human visual perception and their realization in 3D displays, in: Three-Dimensional Imaging, Visualization, and Display 2010 and Display Technologies and Applications for Defense, Security, and Avionics IV, in: Proc. SPIE, vol. 7690, 2010, pp. 281–290. [43] P. Olsson, Real-Time and Offline Filters for Eye Tracking, Master’s Degree Project, XR-EE-RT 2007:011, Stockholm, Sweden, 2007. [44] L. Jansen, S. Onat, P. König, Influence of disparity on fixation and saccades in free viewing of natural scenes, J. Vis. 9 (1) (2009) 29.1–29.19, https://doi.org/ 10.1167/9.1.29. [45] R. Blake, A neural theory of binocular rivalry, Psychol. Rev. 96 (1) (1989) 145–167. [46] I. Kovacs, T.V. Papathomas, M. Yang, A. Feher, When the brain changes its mind: interocular grouping during binocular rivalry, Proc. Natl. Acad. Sci. USA 93 (26) (1996) 15508–15511. [47] S. Marcos, E. Moreno, R. Navarro, The depth-of-field of the human eye from objective and subjective measurements, Vis. Res. 39 (12) (1999) 2039–2049. [48] H. Kim, S. Lee, A. Bovik, Saliency prediction on stereoscopic videos, IEEE Trans. Image Process. 23 (4) (2014) 1476–1490, https://doi.org/10.1109/TIP. 2014.2303640. [49] B. Ma, J. Zhou, X. Gu, M. Wang, Y. Zhang, X. Guo, A new approach to create 3D fixation density maps for stereoscopic images, in: 2015 3DTV-Conference: The True Vision – Capture, Transmission and Display of 3D Video (3DTV-CON), 2015. [50] J. Wang, M.P.D. Silva, P.L. Callet, V. Ricordel, Study of center-bias in the viewing of stereoscopic image and a framework for extending 2D visual attention models to 3D, Electron. Imaging 8651, 865114. [51] G. Westheimer, Cooperative neural processes involved in stereoscopic acuity, Exp. Brain Res. 36 (3) (1979) 585–597, https://doi.org/10.1007/BF00238525. [52] H. Filippini, M. Banks, Limits of stereopsis explained by local cross-correlation, J. Vis. 9 (1) (2009) 8.1–8.18, https://doi.org/10.1167/9.1.8. [53] S. Kastner, L.G. Ungerleider, Mechanisms of visual attention in the human cortex, Annu. Rev. Neurosci. 23 (2000) 315–341, https://doi.org/10.1146/annurev. neuro.23.1.315. [54] S.A. McMains, S. Kastner, Visual attention, in: M.D. Binder, N. Hirokawa, U. Windhorst (Eds.), Encyclopedia of Neuroscience, Springer, Berlin, Heidelberg, 2009, pp. 4296–4302. [55] N. Cristianini, J.S. Taylor, An Introduction to Support Vector Machines and Other Kernel Based Learning Methods, Cambridge University Press, 2000. [56] R. Herbrich, Learning Kernel Classifiers: Theory and Algorithms, The MIT Press, 2001. [57] C. Chang, C. Lin, LibSVM: a library for support vector machines, https://www. csie.ntu.edu.tw/~cjlin/libsvm/. [58] D.M. Stampe, Heuristic filtering and reliable calibration methods for videobased pupil-tracking systems, Behav. Res. Methods Instrum. Comput. 25 (2) (1993) 137–142, https://doi.org/10.3758/BF03204486.
13
Jun Zhou received his Ph.D. in E.E. from the Shanghai Jiao Tong University in 1997. He is an Associated Professor at Shanghai Jiao Tong University. He is a faculty member in the Department of Electrical Engineering and the Institute of Image Communication and Network Engineering. From September 2015 to September 2016, he was a visiting scholar at the Laboratory for Image and Video Engineering (LIVE) of The University of Texas at Austin. His research interests include image and video processing, computational vision, image/video quality assessment, multimedia networking engineering, and digital signal processing.
67 68 69 70 71 72 73 74 75 76
Ling Wang received the Diploma degree in Computer Science and the Master degree in Pattern Recognition and Intelligent Control from Tongji University in 1998. Currently he is a senior engineer in Shanghai Media & Entertainment Technology (Group) Co. Ltd. His research interests include video encoding/decoding, video quality and signal processing.
77 78 79 80 81 82
Hai Bing Yin received the Ph.D. degree from Shanghai Jiao Tong University, Shanghai, China, in 2006. He worked as a post-doc researcher in the National Engineering Laboratory for Video Technology, Peking University, from 2008 to 2010. He was a visiting scholar at the ECE Department, the Multimedia Communication Laboratory, University of Waterloo, Waterloo, Canada, from 2013 to 2014. Currently, he is a Professor at Hangzhou Dianzi University, Hangzhou, China. He is a member of IEEE and ACM. His research interests include image and video processing, VLSI architecture design.
83 84 85 86 87 88 89 90 91 92
Alan Bovik is the Cockrell Family Regents Endowed Chair Professor at The University of Texas at Austin. His research interests are digital video, image processing, and visual perception. For his work in these areas he has been the recipient of the 2019 IEEE Fourier Award, the 2017 Edwin H. Land Medal from the Optical Society of America, a 2015 Primetime Emmy Award for Outstanding Achievement in Engineering Development from the Television Academy, and the Norbert Wiener Society Award and the Karl Friedrich Gauss Education Award from the IEEE Signal Processing Society. He has also received about 10 ‘best journal paper’ awards, including the 2016 IEEE Signal Processing Society Sustained Impact Award. A Fellow of the IEEE, his recent books include The Essential Guides to Image and Video Processing. He co-founded and was longest-serving Editor-in-Chief of the IEEE Transactions on Image Processing, and also created/chaired the IEEE International Conference on Image Processing which was first held in Austin, Texas, 1994.
93 94 95 96 97 98 99 100 101 102 103 104 105 106
41
107
42
108
43
109
44
110
45
111
46
112
47
113
48
114
49
115
50
116
51
117
52
118
53
119
54
120
55
121
56
122
57
123
58
124
59
125
60
126
61
127
62
128
63
129
64
130
65
131
66
132