Bivariate analysis of 3D structure for stereoscopic image quality assessment

Bivariate analysis of 3D structure for stereoscopic image quality assessment

Accepted Manuscript Bivariate analysis of 3d structure for stereoscopic image quality assessment Yang Yao, Liquan Shen, Ping An PII: DOI: Reference: ...

16MB Sizes 1 Downloads 116 Views

Accepted Manuscript Bivariate analysis of 3d structure for stereoscopic image quality assessment Yang Yao, Liquan Shen, Ping An

PII: DOI: Reference:

S0923-5965(18)30177-2 https://doi.org/10.1016/j.image.2018.02.014 IMAGE 15341

To appear in:

Signal Processing: Image Communication

Received date : 29 September 2017 Revised date : 17 January 2018 Accepted date : 26 February 2018 Please cite this article as: Y. Yao, L. Shen, P. An, Bivariate analysis of 3d structure for stereoscopic image quality assessment, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.02.014 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

BIVARIATE ANALYSIS OF 3D STRUCTURE FOR STEREOSCOPIC IMAGE QUALITY ASSESSMENT Yang Yaoa , Liquan Shenb,* , Ping Anc a

Shanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai, 200072, China. Key laboratory of Specialty Fiber Optics and Optical Access Networks, Shanghai University, Shanghai, 200072, China. c Joint International Research Laboratory of Specialty Fiber Optics and Advanced Communication, Shanghai University, Shanghai, 200072, China. b

Abstract. Human visual system (HVS) identifies each unit of stereo-pair into binocular or monocular perception depending on distortions and depth perception. However, these HVS properties still have a large room for exploration in stereoscopic image quality assessment (SIQA) research field. In this paper, a bivariate natural scene statistics (NSS) model is proposed to capture image quality by extracting features from binocular and monocular perception regions, respectively. In the implementation details, the stereo-pair is first segmented into various regions based on spatial information of its disparity. Then the regions are classified into categories of binocular fusion, binocular rivalry and binocular suppression. Bivariate statistics of spatially adjacent Gabor response of image are extracted from each category of regions, based on which features are calculated for image quality representation. In particular, the extraction strategy depends on the type of image patch. Experimental results show that the proposed model is promising at handling the task of SIQA on LIVE 3D Image Quality Database. Keywords: no reference, stereoscopic image quality assessment, bivariate analysis, natural scene statistics, machine learning. *Liquan Shen, [email protected]

1 Introduction 3D/stereoscopic contents provide additional depth information for HVS to enable viewers to have realistic experience. There is a rapid growth of 3D/stereoscopic contents production across a wide range of consumer-oriented applications, including 3D film, remote phone, virtual reality and gaming to cater to people’s preference. However, quality degradation may be introduced into 3D/stereoscopic contents while they are transmitted or compressed. Thus it is necessary to conduct SIQA. According to the presence of reference information, objective SIQA metrics fall into three categories: full-reference (FR), reduced-reference (RR), and no-reference (NR). The FRSIQA metrics [1]-[10] compute the quality of test image with its original undistorted image as a reference. In the RR-SIQA metrics [11]-[13], only part information of the original image can be 1

accessed to assist the quality evaluation. The NR-SIQA metrics [14]-[20] calculate the quality of test images without any reference information. Unlike 2D images, distortions in the stereoscopic images not only induce the information loss, but also significantly disturb 3D visualization process resulting in difficulty of perceiving additional domain provided by depth information. To conduct SIQA, both monocular (2D) perceptual quality and binocular vision factors i.e., depth/disparity perception and binocular interaction effect including binocular fusion, binocular rivalry and binocular suppression need to be considered. The binocular visualization process involves complex cooperation among visual neurons of different areas of HVS. In the mammalian brain, two visual stimuli are transmitted through two parallel neural connections (termed dorsal and ventral streams) separated from the primary visual cortex (V1) to higher cortical areas. The dorsal stream predominantly handles “coarse” stereopsis [21] which is the result of binocular fusion while the cortex in ventral stream recognizes unmatchable regions with binocular rivalry [22] [23]. When an original natural stereo-pair presents to human eyes, HVS will fuse them into an integrated view with depth perception. This process is called binocular fusion [24]. Binocular rivalry [25] happens when HVS fails to fuse two views and perceives them alternatively. This happens when the two views have significantly different

Fig 1 Example of formulation of 3D structure. (a) XY -plane in 3D structure. (b) 3D structure.

2

contents. Binocular suppression [26] is a special case of binocular rivalry. When the stimulus energy of the stereo-pair is significantly different in the corresponding patches between left and right views, HVS will form a monocular vision without depth information and the patch with larger energy can be perceived [27]. The cooperation between dorsal and ventral stream along with other visual cortex forms a 3D structure in human brain shown as Fig. 1. Fig.1 (b) shows the 3D structure, and Fig.1 (a) is the XY -plane. Compared with 2D plane afforded by 2D images, 3D structure is a virtual space existing in human brain generated from stereo-pair. It displays contents of stereo-pair in its XY -plane, which has additional depth axis apart from XY -axis. The stereopair contents displayed in the XY -plane can be regarded as a cyclopean image to a certain extent. Depth can provide observers with reality feelings. The visualization process of binocular fusion or rivalry forms the regions of stereo vision with depth perception in 3D structure while the process of binocular suppression forms the regions of monocular vision only with XY -plane information in 3D structure. In fact, a distorted stereo-pair may contain all of the three perceptual phenomena including binocular fusion, rivalry and suppression across image patches depending on the locations of distortions and image content. The region of binocular fusion, rivalry or suppression locally expands within a spatially limited area, and they coexist in a stereo-pair [28] [29]. Several SIQA metrics [5] [17] [18] based on cyclopean image have addressed these 3D perception properties. However, they have been dealt with in a global manner which is not convincing. Quality score is homogeneously calculated across completely cyclopean image. Since the perceived 3D structure of distorted stereo-pair is not uniform within human brain [28] [29], it is logically to access them in spatially limited regions. In our paper, the quality is separately captured by the proposed bivariate statistics model (ex3

plained in Section 3.3) in different types of areas corresponding to the three perceptual phenomena. Specifically, the jointly spatially neighboring Gabor responses are summarized into a grid with size of M × N , and then the probability of occurrence is calculated for each grid entry. Bivariate statistics distribution of image Gabor responses depicts the shape of spatially adjacent Gabor responses which is orientation and frequency selective. This resembles with properties of HVS [30]. We select a relatively high and a low spatial center frequencies for Gabor filter [31] which can depict image elaborately and coarsely, respectively. Increasing the number of frequencies cannot improve the performance a lot, which also increase the complexity of the algorithm. Eight different sinusoidal grating orientations for each spatial frequency are selected to capture the image features in all orientations. The bivariate jointly statistics contains crucial information of naturalness of image [14] from which a blind quality predictor can be proposed. The bivariate NSS model is also constructed based on a Bivariate General Gaussian Distribution (BGGD) in Balasubramanyam’s

Fig 2 Bivariate joint statistics of reference image and its five distorted versions.

4

[14] and Su’s [18] work. Several groups of pictures are selected from LIVE 2D IQA database [32] to show that the bivariate statistics model is a promising model for image quality prediction. Different groups of pictures have different contents, and each group of pictures contains a reference image and its five distorted versions including JPEG, JPEG2000 (JP2K), Gaussian blur (GBLUR), White noise (WN) and fast-fading (FF). The bivariate statistic distributions are compared within each group to find out the differences between reference image and its five distorted images. Fig. 2 shows the comparison results of two groups of pictures in the selected pictures. Figs. 2(a) and (c) show two group pictures with different contents. Figs. 2(b) and (d) are the bivariate statistic distributions for each image. The neighboring normalized Gabor responses of image are summarized into a grid with size of 10 × 10. The notations of k and l are coordinate axes and φ denotes probability for each bin. In Figs. 2(b) and (d), the distribution shape is almost constant across different image contents. It also can be seen from Figs. 2(b) or (d) that the original two-dimensional distribution shape fluctuates depending on the distortion type. The distribution of blockiness, e.g., JPEG is gathered at the grid center, white noise smoothes the distribution, and the blurriness makes the center bins have almost the same magnitude. This enables us to design a NSS-based quality evaluator by quantifying the deviations of distribution between original and distorted images. Some blind IQA metrics [33]-[35] have adopted the univariate NSS theory to extract quality-aware features, however, the high-order image dependencies i.e., bivariate statistics analysis is less explored compared to the lower-order dependencies. Our contribution here is a new blind SIQA model that try to mimic binocular visual properties by figuring out areas of binocular fusion, rivalry and suppression based on their visual mechanisms, and then diverse feature combination strategies are exerted to these areas in the 3D structure. The proposed bivariate NSS model is applied for feature extraction. It demonstrates better performance to predict human judgments 5

of quality than other FR [5]-[10] and NR [15]-[18][20] SIQA metrics. According to the above analysis, the proposed model is specified as follows. Giving a stereopair, the type of each region in the internal 3D structure is decided based on left view, right view and binocular disparity together. Then bivariate analysis-based features are extracted from these regions based on their types. Lastly, features are combined and fed into the trained support vector regressor (SVR) model [36] to derive final quality score. The remainder of this paper is organized as follows. Section 2 reviews previous work on SIQA. Section 3 is the detail of proposed algorithm, and Section 4 presents the experimental results of the proposed metric in LIVE 3D database. Lastly, Section 5 concludes this paper. 2 Review of Previous Work on IQA Based on the implementation methods, the SIQA can be classified into four categories. The first kind is the simplest models called 2D-based SIQA, in which 2D-IQA metrics are directly applied to the left and right images respectively to predict the quality of stereo-pair[6]. Four off-the-shelf 2DIQA metrics along with three combination strategies are tested in [37] for measuring the perceptual quality of stereo-pair. However, the perceptual property of 3D vision is ignored in this category of methods. The second kind of SIQA called depth/disparity-based models [7][38][39] adopts the depth/disparity information into the framework. However, these metrics only quantify the structural distortion of the disparity, instead of integrating it into binocular vision. The third kind of SIQA called cyclopean-based models introduces the ‘cyclopean’ image to model the internal generated 3D structure in HVS by fusing the visual inputs in two eyes. It is frequently explored in the SIQA literatures due to its close simulation of HVS. Chen’s FR method 6

[5] generates the cyclopean image from both test and reference stereo-pairs, and applies varies 2D IQA methods to them for quality evaluation. Chen’s NR method [17] utilizes NSS model to the cyclopean image, disparity map and uncertainty map for NR quality evaluation. However, it produces the cyclopean image based on left image and the shifted right image that will cause bias in asymmetrically distorted stereo-pair. Su’s NR method [18] extracts univariate and bivariate statistics of wavelet coefficients of the cyclopean image to assess the image quality. The FR algorithm [2] predicts the quality of stereo-pair by two stages. In the first stage, the monocular view quality is estimated by a weighted sum of the qualities of two views. In the second stage, the cyclopean view quality is measured by statistical-difference-based features. Lin’s FR method [8] measures 2D artifacts by extracting local phase and amplitude features from the cyclopean image. The cyclopean approach is employed to obtain the local quality score in Heeseok’s deep learning scheme [20]. The distortions introduced in the stereo-pair will disturb 3D visualization process, and the last category of studies called 3D-property-based models focuses on modeling the binocular vision properties. Shao et al. [3] and Lee et al. [40] both propose a FR 3D perception-based stereo image quality pooling model to address binocular fusion and binocular rivalry separately. The NR metric [19] explores the relationship between the perceptual quality of stereoscopic images and visual information including blurriness and blockiness, which introduces a model for binocular quality perception. A linear rivalry model is deployed in [10] to depict the properties of binocular rivalry. Liu’s RR method [13] considers the monocular cue and binocular cue by entropy of primitive (EoP) and mutual information of primitive (MIP), respectively. Shao et al. [9] propose a sparse coding-based FR model to learn binocular receptive fields (RFs) properties to simulate simple and complex cells in the primary visual cortex. Wu’s RR method [41] exploits attention mechanism 7

based on HVS for IQA modeling. Inspired by the fact that the HVS is sensitive to luminance change and texture information, Fang et al. [42] propose to incorporate statistical luminance and texture features for screen content images with both local and global feature representation.

Fig 3 Framework of proposed metric.

3 Proposed Algorithm The general architecture of the proposed NR-SIQA method is shown in Fig. 3. The base and aid views are firstly decided by BRISQUE [33], and then binocular disparity and binocular consistency are estimated. Each region within the formed 3D structure is classified into areas of the three perceptual phenomena based on spatial information of stereo-pair and its disparity with the auxiliary information of binocular consistency and Gabor magnitude. Then, bivariate analysis is respectively conducted on regions of binocular fusion, rivalry and suppression, from which the quality-aware features can be extracted. The features of each category have different importance, which are combined according to the proportion of area in our algorithm. In the following subsections, the foundations of the proposed model including binocular disparity and consistency, classification of 3D structure, and bivariate statistics analysis are firstly introduced. Then various features are extracted using these techniques for quality prediction. 8

3.1 Binocular Disparity and Consistency Estimation Binocular disparity plays an important role in depth perception of a stereo-pair. The estimate of disparity for any point in one image is to find the same point in the other image. Recently, many disparity search algorithms have been proposed using different kinds of techniques, such as Sum of Absolute Difference (SAD)-based disparity algorithm [5] and Structural Similarity Index [43] (SSIM)-based algorithm [17]. However, all these methods are only efficient for high quality images, and SAD-based metric is time-consuming. In this paper, an improved disparity estimation algorithm based on Gaussian averaged-SSIM values is proposed. In the disparity algorithm, after the SSIM [43] similarity values are calculated for each pixel between two views, a Gaussian window with size of H × H centered at a certain pixel is used to average the SSIM similarity values. The Gaussian averaged sum of these similarity values is used for measuring the extent of similarity between the pair of corresponding points. When searching the corresponding point for the point in base view from aid view, the proposed algorithm simultaneously takes the neighboring information into consideration. Thus it performs well with respect to low-quality stereo-pairs.

Fig 4 Stereo-pair and disparities estimated by different disparity estimation algorithms. The three maps in first column are estimated by SAD-based algorithm. Second column is estimated by SSIM-based algorithm. Third column is estimated by proposed Gaussian averaged-SSIM-based algorithm.

9

Fig. 4 shows the examples of disparities estimated by the proposed stereo matching algorithms. The three maps in first column (a) are estimated by SAD-based algorithm. Second column (b) is estimated by SSIM-based algorithm, and third column (c) is estimated by proposed Gaussian averaged-SSIM algorithm. The disparities in first row are from original stereo-pair ‘im20 l.bmp’ and ‘im20 r.bmp’ from LIVE 3D IQA database Phase I [44]. It can be seen that all work well. The disparities in second row are from ‘im20 1 l’ and ‘im20 1 r with symmetrical JP2K compression distortion in LIVE 3D Phase I. It can be seen that the distortion has some effects on disparities. The disparities in last row are from ‘010image 4 1’ with asymmetrical gaussian blur distortion in LIVE 3D IQA database Phase II [17]. It can be seen that traditional stereo matching algorithms cannot achieve promising performance on the stereo-pair with asymmetrical JP2K distortions. The disparity map from SSIM-based method looks like random noise and we can hardly recognize the outline of image content from it. The SAD-based method is barely satisfactory. The proposed Gaussian averaged-SSIM-based algorithm performs best among these algorithms. In this paper, the first step to generate disparity is to decide base image between left and right image. The view with better quality is chosen as the base image, and the other one is the aid image. 2D metric BRISQUE [33] is adopted to decide the base image. The leftmost part of left view and rightmost part of right view are pruned since they cannot exist in another view. For any point in the base image with coordinate (x, y), its related point in the aid image is searched from [x − range, y] to [x + range, y]. Specifically, a H × H block in base image centered at point (x, y) and another H × H block in aid image centered at point (x + D, y) are delineated, respectively. Then, all the SSIM similarity values of pixels between base and aid image in the block are merged with Gaussian weighted sum. The block in aid image is shifted from [x − range, y] to [x + range, y] to find the maximum Gaussian weighted sum. Once the maximum Gaussian weighted sum is found, 10

the D is determined as the disparity for the point (x, y). The maximum Gaussian weighted sum of SSIM values in the block is defined as binocular consistency for the point (x, y), and all points’ consistency values comprise the binocular consistency map. The search range and block size, H could be changed based on application. For instance, for the image with size of 640 × 360, the range and H are both set to 25. Binocular consistency informs the level of similarity of corresponding local areas. Lower consistency indicates higher binocular rivalry and vice versa. 3.2 Classification of 3D Structure 3D visualization process is a complicated biological mechanism involving different parts of visual neurons. In fact, 3D perception differs under different distortion types [40]. For frequencydecrease distortion, e.g., blurriness, the quality of stereo-pair mainly depends on the relatively high-quality view [45]. Inversely, for frequency-increase distortion, e.g., blockiness, the lowquality view will suppress the high-quality view. Inspired by the binocular properties, we develop a model to classify different regions in 3D structure into binocular fusion, rivalry and suppression regions. Specifically, since the 3D structure within human brain formed by stereo-pair is not accessible, it is segmented [46] into non-overlapping regions based on spatial information of two views and disparity together. These regions are then labeled with binocular fusion, rivalry or suppression based on the consistency map and the relative amount of binocular energy. Furthermore, the spatially neighboring regions of the same class will be integrated together. 1) Segmentation: The stereo-pair and disparity are simultaneously considered in the formation of 3D structure. The aid image is compensated by disparity so that the points in base and aid image will match. The process of 3D structure segmentation involves with these three maps, and the following steps are conducted on all of these maps. The intended 3D structure has W 11

pixels in its XY -plane, which is identical with stereo-pair and disparity. Initially W pixels will be divided into K parts. The approximately size of each patch is W/K pixels. The original patch is set square with width of L =

p W/K. At the onset of the segmentation, we choose K cluster

centers Ck = (RCk , GCk , BCk , xCk , yCk )T at regular grid intervals L for base image, aid image

and disparity. (RCk , GCk , BCk ) are RGB values of cluster center Ck , and (xCk , yCk ) are spatial coordinates of Ck . k = 1, 2, ..., K. After these centers are spaced regularly on the base image, they are changed to the lowest gradient position in a 3 × 3 neighborhood. (Note that the cluster center selection in aid image and disparity is identical to base image). This is done to avoid placing them at the edge and reduce the probability of choosing a noisy pixel. Then each pixel Pi is associated with the nearest cluster center whose search region overlaps this pixel Pi . i = 1, 2, ..., W . The search range of the cluster center is set to a block centered at itself with size of 2L × 2L. doRGB denotes the Euclidean distance in RGB color space between the pixel Pi and cluster center Ck whose search area overlaps Pi . ‘o’ represents base (‘b’) or aid (‘a’) image, q doRGB = (RCo k − RPo i )2 + (GoCk − GoPi )2 + (BCo k − BPo i )2

(1)

For disparity map, the cluster center is denoted as Ck = (DCk , xCk , yCk )T where DCk is the disparity value at (xCk , yCk ) and the distance is defined as, q dD = (DCk − DPi )2

(2)

The synthetical distance for the base, aid image and the disparity map is defined as: ds = (dbRGB + daRGB )/255 + m · dD /range

(3)

where range is a changeable parameter in disparity search process. Numerators in the equation are used to normalize doRGB and dD , and m is used for balancing the relative importance between them. Here we set m to 0.5 for decreasing the influence of disparity’s quality. When pixel Pi has

12

Fig 5 Results of segmentation of stereo-pair with asymmetric JP2K distortion. (a) Left image. (b) Right image (c) Disparity map. (d) 3D structure segments based on stereo-pair and disparity.

the smallest distance ds with a certain cluster center Ck among the other cluster centers, it will be associated with this cluster center Ck . Step 1: according to the synthetical distance in Eq. 3, each pixel is associated with the nearest cluster center whose search area overlaps current pixel. Step 2: after all the pixels are associated with the nearest cluster center, the new center is computed as the gravity center of all the cluster pixels. Step 3: the process of Step 1 and Step 2 is iteratively repeated until the center positions converge. Figs. 5(a)-(d) show examples of the stereo-pair, disparity, and result of segmentation, where (a) and (b) represent the test stereo-pair ‘008image 2 2.bmp’ from LIVE 3D Phase II [17] with asymmtric JP2K distortion. (c) represents disparity map and (d) represents the result of segmentation for XY -plane in 3D structure. Different patches are denoted with colors of different grayscales.

13

2) Classification: the segmentation results can be projected onto the stereo-pair. Each unit of the image has almost uniform content and is classified into three categories according to binocular consistency and relative stimulus energy between the base and aid views. When two corresponding patches have high binocular consistency value, this region in 3D structure will be denoted as binocular fusion area; When two corresponding patches have low consistency value and their binocular energy of Gabor responses have little difference, this region will be denoted as binocular rivalry; Otherwise, this region will be denoted as binocular suppression. Cbf , Cbr and Cbs denote the classes of binocular fusion, rivalry and suppression, respectively. For each segment Si in XY -plane in formed 3D structure, the class Ci of which is determined by

Ci =

    Cbf , if {consi(I b (Si ), I a (Si )) ≥ thcon }        b   Cbs , else if {(consi(I b (Si ), I a (Si )) < thcon )&(wi > thw )}

  a   Cbs , else if {(consi(I b (Si ), I a (Si )) < thcon )&(wi < 1 − thw )}         Cbr , others.

(4)

where I o (Si ) denotes the region in base (‘b’) or aid (‘a’) image that is corresponding to segment Si in XY -plane in formed 3D structure. consi(·) computes the mean value of consistency (introduced b in Section 3.1) between two corresponding segments in base and aid images. Cbs represents the

class that base view has much stronger stimulus energy than aid view, and the base view dominates a the vision. Cbs denotes the class that the aid view dominates the vision. thcon is the consistency

threshold for the classification of binocular fusion and rivalry. thw is the weight threshold for the classification of binocular rivalry and suppression. The choice of thresholds thcon and thw will be

14

Fig 6 Consistency map and result of classification for Figs. 5(a) and (b). (a) Consistency map. (b) Result of classification and integration.

discussed in Section 4.2.5. wi is the weight for the segment of base image, which is defined as wi =

ψ(I b (Si )) ψ(I b (Si )) + ψ(I a (Si ))

where ψ(·) is the stimulus energy obtained by the Gabor filter given by [31]  2  2 !! X X 1 1 R1 R2 ψ(I(Si )) = exp − + exp(iωR ) 1 2πγσ 2 2 σ γσ θ

(5)

(6)

(x,y)∈Si

where R1 = xcosθ + ysinθ and R2 = −xsinθ + ycosθ, γ is the aspect ratio of the elliptical Gaussian envelope, (γ = 1 in our algorithm). σ is the standard deviation of an elliptical Gaussian envelope, ω =

p 2 ζx + ζy2 is the radial center frequency (ζx and ζy are the spatial center frequencies

of the complex sinusoidal carrier), and θ is the orientation.

After the classification, the spatially neighboring patches with same class will be integrated together. Figs. 6(a) and (b) show the consistency map and result of classification and integration. 3.3 Bivariate Statistics Analysis Varies studies [33] [34] indicate that the univariate statistics of natural images can well predict the image quality. In this paper, we demonstrate the effectiveness of bivariate NSS model based on image Gabor response which outperforms the univariate models [14]. Evidence indicates that the visual neurons in primary visual cortex are spatial frequency and

15

orientation selective, which have an elliptical Gaussian envelope [30]. Since Gabor filters can closely model the receptive field profiles of simple cells in the mammalian visual cortex, many metrics adopt the theory of Gabor as their basis [9] [47]. Thus the Gabor filter introduced in Section 3.2 is adopt to extract Gabor responses from stereo-pair. GR(x, y; ω, θ) represents the Gabor response of image at the location of (x, y) with the spatial frequency of ω and the orientation of θ. The complete set of Gabor responses across all orientations are summed into an overall response for each frequency, GR(x, y; ω) =

X

GR(x, y; ω, θ)

(7)

θ

The perceptually significant pre-processing operation called mean subtracted contrast normalized (MSCN) [33][48] is performed on the summed Gabor responses across different spatial frequencies ˆ resulting in GR(x, y; ω). The bivariate histograms are obtained by binning horizontally neighborˆ ing responses in GR(x, y; ω) into a grid with the size of M × N and the coordinate axes of k and ˆ l. The responses of GR(x, y; ω) at locations (x, y) and (x + 1, y) is a neighboring response pair, and the responses at locations (x + 1, y) and (x + 2, y) is another neighboring response pair. All the pairs of neighboring responses are plotted in the grid according to their joint magnitude, and the occurrences of them in each grid entry are counted. The joint probability of adjacent responses denoted by φ(m, n) in each grid is calculated by normalizing the occurrences in the grid by the overall number of pairs where m = 1, ..., M ; n = 1, ..., N . The bivariate joint statistic distributions of neighboring Gabor responses at a high spatial frequency shown in Fig. 2 can depict more details for the pristine image and its five distorted versions. Although the probability of φ(m, n) is a fine predictor for image quality, it contains information redundancy and large dimensionality. We collapse it into low dimensionality by marginal and 16

Fig 7 Marginal probabilities Pk (shown as the first half of the histograms) and conditional probabilities Qk (shown as the second half of the histograms) of the distorted images generated from the same reference image at different DMOS levels.

conditional probability functions. The marginal probability functions are denoted by Pk and Pl respectively,

 N  X    φm,n P (k = km ) =   k n=1

M  X    φm,n P (l = l ) =  n  l

(8)

m=1

In order to capture the dependencies between spatially neighboring responses, conditional probabilities denoted by Qk and  Ql respectively are calculated, N  X  φm,n (k = km , l = ln )   Qk (k = km ) =   Pl (l = ln )       Ql (l = ln ) =

n=1 M X

(9)

φm,n (k = km , l = ln ) Pk (k = km ) m=1

Both marginal probability Pk and Pl and conditional probability Qk and Ql of the binned responses have almost the same shapes of natural images across different contents. However, distortions of different types and levels can change these shapes. Fig. 7 shows the marginal and conditional distributions of the images with different distortions and distortion levels generated

17

from the same reference image ‘studentsculpture.bmp’ from the LIVE 2D database [32]. The distributions of Pk and Qk with N values are plot in the histograms as example. For each distortion, we select three images with different levels of artifacts which can be regarded as different levels of DMOS (Difference Mean Opinion Score). In Fig. 7, each column with same color shows the distribution of Pk and Qk with different DMOS at the same distortion. The left half of the column is the distributions of marginal probability Pk and the right half is the distributions of conditional probability Qk . The bottom row is distributions of the reference image ‘studentsculpture.bmp’ with DMOS of 0. It can be seen that the values of Pk and Qk fluctuate according to distortion types and levels. Thus Pk , Pl , Qk and Ql can be adopted as features to predict the quality of stereo-pair. Since primary visual cortex is orientation sensitive, the orientation dependency contained in the Gabor responses i.e., GR(x, y; ω, θ) is taken into account. GR(x, y; ω, θ) is preprocessed by ˆ MSCN operation resulting in GR(x, y; ω, θ). Then correlation coefficient ρ of horizontally adˆ jacent responses in GR(x, y; ω, θ) is calculated. Particularly, it fluctuates while Gabor tuning orientation θ changes. It is plotted as a function of Gabor tuning orientation in a certain spatial

Fig 8 Plots of correlation coefficient for pristine image and its five distorted versions.

18

frequency. The characteristics of the plots can be well captured by Weibull distribution [49], ρ = f (θ|a, b) = ba−b θb−1 exp(−(θ/a)b ) I(0,π) (θ)

(10)

where a and b are scale parameter and shape parameter, respectively. Fig. 8 plots the correlation coefficient ρ as a function of Gabor tuning orientation θ at a high spatial frequency for pristine and its five impaired versions. Parameters a and b of the distribution are adopted as quality-aware features. 3.4 Feature Extraction The 3D structure inside human brain from a distorted stereo-pair may contain annoying visual mechanism, e.g. binocular rivalry and suppression. Binocular rivalry induces the alternate appearance of two views while binocular suppression causes monocular vision. Each of these visual mechanisms dominates locally within a spatially restricted region, and all of them may coexist in whole formed 3D structure [28] [29]. Thus we extract bivariate features from binocular fusion, rivalry and suppression respectively. GRo denotes the Gabor response of base or aid image, and ‘o’ represents ‘b’ or ‘a’, GRo (x, y; ω, θ) = mo exp(iϕo (x, y; ω, θ)) where mo =

(11)

  p o) <2 (Go ) + =2 (Go ) is the amplitude of Gabor response and ϕo = arctan =(G <(Go )

is the phase of Gabor response. Note that each value in the Gabor responses of aid image are shifted by disparity, and thus the two Gabor response maps of base and aid images are matchable according to the matched pixels pair. 1) Binocular fusion: when the two views in corresponding place are matchable, HVS will fuse them into a single view with depth perception. BVbf = {∀Si |Ci = Cbf } denotes the regions

of binocular fusion class in 3D structure. The Gabor responses BRbf of BVbf are calculated by 19

binocular energy model [50], which is modified as follows BRbf (x, y; ω, θ) = kGRb (x, y; ω, θ) + GRa (x, y; ω, θ)k2 =

m2b (x, y; ω, θ)

+

m2a (x, y; ω, θ)

(12)

+ mb (x, y; ω, θ) · ma (x, y; ω, θ) · cos(∆ϕ(x, y))

where the phase shift between base and aid images ∆ϕ(x, y) =| ϕb (x, y) − ϕa (x, y) |. Since the set of BRbf (x, y; ω, θ) across all frequencies and orientations are obtained, the feature set f eatbf for binocular fusion can be extracted from bivariate statistics introduced in Section 3.3. 2) Binocular rivalry: both of the two views can be perceived alternatively. BVbr = {∀Si |Ci = Cbr } denotes the regions of binocular rivalry in the 3D structure. GRb (x, y; ω, θ) and GRa (x, y; ω, θ) respectively represent the Gabor responses of base and aid images in the areas corresponding to BVbr . The feature sets f eatb and f eata are extracted from GRb (x, y; ω, θ) and GRa (x, y; ω, θ) respectively. The Gain-Control Theory model [51] is utilized to converge them into a single set of features f eatbr , where w =

f eatbr = w · f eatb + (1 − w) · f eata

(13)

ψ(I b (BVbr )) . ψ(I b (BVbr ))+ψ(I a (BVbr ))

b a 3) Binocular suppression: this category contains two classes Cbs and Cbs whose corresponding

areas in 3D structure are denoted by BVbsb and BVbsa , respectively. Base view can be perceived in BVbsb , and only aid view can be perceived in BVbsa . Therefore, we extract features from BVbsb in base view and BVbsa in aid view resulting in two sets of features f eatbbs and f eatabs . 4) Feature combination: it should be noted that most of the distorted stereo-pair may contain less than four types of perceptual phenomena described above. If one type is absent, its features will set to 0. Since different types of perceptual phenomena occupy different proportion of area, the contribution for quality prediction of each feature set is different. Thus all of four feature sets

20

are combined as follows, f eat = pbf · f eatbf + pbr · f eatbr + pbbs · f eatbbs + pabs · f eatabs

(14)

where pbf , pbr , pbbs and pabs represent the area occupied by each type. 3.5 Quality Prediction After all the perceptual features are extracted, SVR is adopted to predict the final quality, which has the capability of handling high dimensional data and the lower over-fitting than other regression methods [33][52]. At the training phase, the regression model is used to train an optimum mapping between features and DMOS. At the testing phase, the mapping is used to map the extracted features into quality score space. LIBSVM package [36] is utilized in our algorithm to implement the SVR with a radial basis function (RBF) kernel. The SVR parameters C and γ are set to 64 and 0.0625, respectively. The details about the SVR parameters selection and the comparison with other regression methods can be found in Section 4.2. 4 Experimental Results and Analysis 4.1 Databases and Evaluation Criteria We test the proposed metric on LIVE 3D IQA database and Waterloo IVC (UW/IVC) 3D IQA database [53]. LIVE 3D IQA database contains two phases including Phase I [42] and Phase II [17]. The Phase I consists of 20 reference images and 365 distorted images that contains five types of distortions: JPEG, JPEG2000 (JP2K), Gaussian blur (GB), White noise (WN) and fast-fading (FF). The distortions are symmetrically applied to the left and right reference images at various levels. The Phase II consists of 120 symmetrically and 240 asymmetrically distorted images generated from 8 reference images with the same distortions as Phase I. Waterloo IVC 3D IQA database also 21

contains two phases. Phase I and Phase II are created from 6 and 10 pristine stereo-pairs, respectively. Three types of distortions: additive white Gaussian noise contamination, Gaussian blur and JPEG compression are symmetrically or asymmetrically applied on each pristine stereo-pair. Each type of distortion has four distortion levels. Altogether, there are totally 78 single-view images and 330 stereoscopic images in Phase I. There are totally 130 single-view images and 460 stereoscopic images in Phase II. Three commonly used criteria of performance are used: Spearman Rank Order Correlation Coefficient (SROCC), Pearson Linear Correlation Coefficient (PLCC) and Root Mean Squared Error (RMSE). PLCC and RMSE are used to evaluate the prediction accuracy of SIQA metrics, and SROCC is used for prediction monotonicity. Higher SROCC, PLCC and lower RMSE values mean better objective SIQA metrics. For the nonlinear regression, we use the following fiveparameter logistic function [54] DM OSp = β1 ·



1 1 − 2 1 + exp(β2 (x − β3 ))



+ β4 · x + β5

(15)

where β1 , β2 , β3 , β4 and β5 are determined by subjective scores and objective scores 4.2 Implementation Details In this section, a series experiments is conducted to demonstrate the process of parameter selection and to prove that almost no over-fitting exists in our implementation. We also compare the performances of SVR, neural network (NN), genetic programming (GP) and random forest (RF) that are used to predict the quality scores.

22

Fig 9 The SVR parameters (C, γ) selection process on (a) LIVE 3D database Phase I and (b) LIVE 3D database Phase II. The number on the level contour denotes the SROCC value of the cross-validation.

4.2.1 SVR Parameters (C, γ) Selection

A cross validation experiment is conducted to choose the optimum SVR parameters (C, γ). The LIVE 3D database Phase I and II are randomly partitioned into two sets: training set and testing set. For each database, 80% images are chosen for training and the rest 20% images are for testing. The mapping learned from training images is tested on the testing set. This training-testing procedure repeats 1000 times, and the mean results are recorded for the parameter selection. The pair of parameter (C, γ) delivering the best performance is chosen for SVR implementation. Fig. 9 illustrates the parameters selection process of the proposed metric on LIVE 3D Phase I and Phase II. The SROCC values are labelled on the level contour. It can be seen that there exists almost the same circular region surrounding the center where the proposed metric obtains the best SROCC values on both two databases. It indicates that the regression model is robust across these two databases. The optimal parameters (C, γ) are set to (64, 0.0625) on LIVE Phase I and Phase II.

23

4.2.2 Two-fold Cross-validation

Each database is randomly partitioned into two sets: A and B. For each database, 50% images are selected as A, and the rest 50% images are as B. Cross-validation will be conducted following the two steps: 1) the regression model is learned from A, and then tested on A and B, respectively. 2) the regression model is learned from B, and then tested on A and B, respectively. This procedure is repeated 1000 times. Performance comparisons on Phase I and Phase II are listed in Table 1 where “MoA&ToB” means the Model is trained on A and Tested on B. It can be seen that no matter the regression model is tested on training set or testing set, negligible gap exists between the performances of them. It means that almost no over-fitting exists in the proposed metric. Table 1 Performance on the set A and B as the regression model learned from set A and B, respectively.

Database

LIVE Phase I

LIVE Phase II

Training/Testing SROCC PLCC

RMSE SROCC PLCC

RMSE

MoA&ToB MoA&ToA

0.933 0.987

0.941 0.988

5.589 2.496

0.923 0.985

0.926 0.988

4.071 1.720

MoB&ToA MoB&ToB

0.934 0.987

0.940 0.988

5.573 2.470

0.923 0.986

0.926 0.988

4.268 1.688

4.2.3 Feature with Noise-added

The Phase I and Phase II are randomly partitioned into two sets: training set and testing set. For each database, 80% images are selected as the training set, and the rest 20% as the testing set. In addition, we add Gaussian white noise to the features in the training set. The variance of the white noise is set to 0.01 of the feature strength. The learned model will be tested on both training and testing sets. This training-testing procedure is repeated 1000 times. Performance comparisons between features with Gaussian white noise and features without noise on LIVE 3D database Phase I and Phase II are listed in Table 2. In Table 2, “WN-Training set” means that 24

the model learned from the training set based on features with Gaussian white noise is tested on training set, and “WN/F-Testing set” denotes that the model learned from the training set based on features with noise free is tested on testing set. It can be seen that there is small difference between performances tested on training set and testing set. Moreover, the gap of performance between features with noise and features without noise is almost no. Therefore, we can draw the conclusion that almost no over-fitting exists in the proposed metric. Table 2 Performance on training set and testing set as regression model learned from the features of training set with Gaussian white noise or without noise.

Databases

LIVE Phase I

LIVE Phase II

Training/Testing

SROCC PLCC

RMSE SROCC PLCC

RMSE

WN-Testing set WN-Training set

0.941 0.982

0.951 0.984

4.981 2.903

0.933 0.979

0.941 0.982

3.776 2.078

WN/F-Testing set WN/F-Training set

0.943 0.984

0.951 0.985

5.012 2.774

0.933 0.982

0.941 0.985

3.757 1.927

4.2.4 Performance Comparison among Several Regression Methods

We compare the performance among several regression methods including SVR, neural network (NN), genetic programming (GP) and random forest (RF). Specifically, the hidden layer size is set to 4, and each feature vector is normalized to [-1,1] in NN. In GP, the number of programs in each generation is set to 5000, and the maximum generations is set to 20. In RF, the number of trees in forest is set to 10, and the number of features looking for the best split equals to the number of input features. 80% images of each database are chosen for training and the rest are chosen for testing. This training-testing process is repeated 1000 times, and the average of performance across 1000 splits is reported as final performance. The result comparisons are shown as Table 3. It can be seen that SVR achieves the best performance than the others.

25

Table 3 Performance comparison between SVR, NN, GP and RF.

Database Regressor SVR NN GP RF

LIVE Phase I

LIVE Phase II

SROCC PLCC

RMSE SROCC PLCC

RMSE

0.943 0.902 0.837 0.907

5.012 6.400 8.668 6.491

3.757 4.974 8.196 5.353

0.951 0.917 0.854 0.918

0.933 0.884 0.715 0.874

0.941 0.891 0.715 0.881

Fig 10 Parameter selection across LIVE 3D IQA database Phase I and Phase II. (a). Parameter K for the numbers of cluster centers in image segmentation (Select from {24, 30, 36, 42, 48}). (b). Parameter M or N for the grid size (Select from {5, 10, 15, 20, 25}). (c). Parameter thcon for the threshold to figure out binocular fusion and rivalry (Select from {0.8, 0.85, 0.9, 0.95}). (d). Parameter thw for the threshold to figure out binocular rivalry and suppression (Select from {0.15, 0.2, 0.25, 0.3}).

4.2.5 Parameter Selection

Different values of the parameters used in the proposed method lead to different performance. To select the optimal parameters, the control variable method is adopted for each parameter selection. The experimental results are shown in Figs. 10(a)-(d). In the binocular view segmentation process, the final segmentation results are dependent on the number of cluster centers K. The larger K

26

results in higher complexity, and lower K leads to imprecise segmentation results. We set K = {24, 30, 36, 42, 48} and compute the SROCC values of proposed method on Phase I and Phase II. The results are plotted in Fig. 10(a). It can be seen that K = 36 leads to higher SROCC on both Phase I and Phase II. After segmentation, the type of each patch is classified based on thcon and thw . We choose thcon = {0.8, 0.85, 0.9, 0.95} and thw = {0.15, 0.20, 0.25, 0.30} and calculate the SROCC for each value. The results are shown in Figs. 10(c) and (d), respectively. thcon = 0.85 and thw = 0.2 are selected. While extracting bivariate statistics between spatially neighboring Gabor responses, it is necessary to set the size of grid M and N . In general, using a larger size of grid leads to more accurate modeling of image information, but this requires more samples and generates a higher dimension of features that will make the regression model learning less stable. For simplicity, we choose M = N = {5, 10, 15, 20, 25}, and the performance of each value is shown in Fig. 10(b). The M and N is set to 10 where the SROCC is high and the feature dimension is not large. 4.3 Performance Comparison Table 4 Overall performance comparisons on LIVE Phase I and Phase II.

Database Algorithm

LIVE Phase I SROCC PLCC

Shao[15] 0.876 Shao[16] 0.896 Chen[17] 0.891 Su[18] Heeseok[20] 0.935 Proposed 0.943

0.904 0.907 0.895 0.943 0.951

LIVE Phase II

RMSE SROCC PLCC

RMSE

7.247 5.012

5.102 4.657 3.757

27

0.848 0.880 0.905 0.871 0.933

0.824 0.880 0.913 0.863 0.941

Table 5 Overall performance comparisons on UW/IVC Phase I and Phase II.

Database

UW/IVC Phase I

UW/IVC Phase II

Algorithm

SROCC PLCC

SROCC PLCC

Yang[39] You[7] Chen[5] Mittal[33] Chen[17] Proposed

0.611 0.597 0.682 0.845 0.708 0.907

0.588 0.587 0.578 0.794 0.634 0.868

0.706 0.713 0.734 0.869 0.715 0.932

0.639 0.682 0.613 0.849 0.662 0.869

4.3.1 Overall Performance Comparison

The proposed model is compared with several state-of-the-art NR-SIQA methods on LIVE 3D database and UW/IVC database. 80% of the images in each database are chosen as training set and 20% of the images are chosen as testing set. Training set is fed into the SVR, and the learned regression model is tested on testing set. This training-testing procedure is repeated 1000 times and the mean of results is used as final performance. The comparisons are listed in Table 4. Su [18] is only evaluated on Phase II, thus the performance on Phase I is denoted as “-”. Other unavailable performance values are also denoted with “-”. The top performance has been highlighted in boldface. It shows that the proposed method achieves much better performance than other blind IQA metrics. Although Phase II contains asymmetrically distorted stereo-pairs that are difficult to be assessed, our method still achieves good performance on both Phase I and Phase II. The proposed method is also tested on UW/IVC 3D database Phase I and Phase II. The experimental conditions are same with LIVE database. We compare the results with four SIQA methods and the comparison results are shown in Table 5. It can be seen that our method achieves better performance than the FR and NR methods. Yang’s [39] and You’s [7] FR metrics take depth/disparity information into consideration but ignore binocular perception properties. Chen’s FR method [5] is based on cyclopean image, but the performance is not promising. Mittal’s [33] is a 2D NR

28

method which is applied to left and right views, the average of two scores is taken as quality score of stereo-pair. The reason of Chen’s NR method [17] performing worse than Mittal’s is that the Chen’s algorithm can not generate disparity precisely. 4.3.2 Performance Comparison of Each Distortion Type

We also make comparisons on each distortion type between both FR and NR metrics. The comparison results are shown in Table 6. Since the source of Liu’s [13] and Heeseok’s [20] methods do not have RMSE data, they are not listed in the table. It is clear seen that our method outperforms both FR and NR SIQA metrics, especially on all images. Besides, our method also outperforms other metrics on the distortions of JP2K and JPEG in LIVE Phase I. On Phase II, it performs well on all images and on distortions of JP2K. Although in some distortions the performance of proposed metric lags behind the best, the gap is not large. Gorley’s scheme [6] performs the worst since it ignores 3D visual properties. You’s scheme [7] considers the disparity information, and the performance is better than Gorley’s scheme. Lin’s FR scheme [8] combines local amplitude and phase maps to produce cyclopean amplitude/phase maps. The performance in Phase II is not as good as Phase I because the combination method in the generation of cyclopean amplitude/phase maps needs to be improved. Chen’s FR scheme in [5] uses 2D metric to assess the quality of the cyclopean image, and both SROCC and PLCC reach to 0.91 on Phase I. Chen’s NR scheme [17] extracts NSS-based features from cyclopean image and disparity map, but binocular rivalry effect and other 3D perceptual properties are not considered adequately. Both SROCC and PLCC reach to 0.89 on Phase I and 0.88 on Phase II. Su’s scheme [18] also develops the model based on the cyclopean image. SROCC and PLCC on Phase II are 0.90 and 0.91, which are more promising than other metrics except the proposed method. However, to what extent the 3D quality degradation is 29

Table 6 Performance comparisons on each distortion type on LIVE Phase I and Phase II (FR and RR metrics are italic).

Database Algorithm

LIVE Phase I JP2K JPEG WN

LIVE Phase II

GB

FF

ALL JP2K JPEG WN

GB

FF

ALL

Chen[5] 0.888 Gorley[6] 0.420 You[7] 0.860 Lin[8] 0.913 Shao[9] 0.895 Shao[10] 0.901 Liu[13] SROCC Shao[15] 0.871 Shao[16] 0.871 Chen[17] 0.863 Su[18] Heeseok[20] 0.885 Proposed 0.928

0.530 0.015 0.439 0.716 0.495 0.648 0.427 0.430 0.617 0.765 0.866

0.948 0.741 0.940 0.929 0.941 0.945 0.932 0.922 0.919 0.921 0.941

0.925 0.750 0.882 0.933 0.940 0.927 0.914 0.915 0.878 0.930 0.917

0.707 0.366 0.588 0.829 0.796 0.806 0.652 0.944 0.827

0.916 0.142 0.880 0.931 0.925 0.928 0.876 0.896 0.891 0.935 0.943

0.814 0.110 0.894 0.785 0.885 0.922 0.826 0.867 0.845 0.853 0.895

0.843 0.027 0.795 0.733 0.842 0.753 0.828 0.867 0.818 0.822 0.851

0.940 0.875 0.909 0.965 0.952 0.929 0.928 0.950 0.946 0.833 0.928

0.908 0.770 0.813 0.920 0.916 0.826 0.984 0.900 0.903 0.889 0.912

0.884 0.601 0.891 0.891 0.901 0.790 0.933 0.899 0.878 0.909

0.889 0.146 0.786 0.894 0.849 0.908 0.903 0.848 0.880 0.905 0.871 0.933

Chen[5] 0.912 Gorley[6] 0.485 You[7] 0.878 Lin[8] 0.952 Shao[9] 0.921 Shao[10] 0.939 Liu[13] PLCC Shao[15] 0.903 Shao[16] 0.901 Chen[17] 0.907 Su[18] Heeseok[20] 0.913 Proposed 0.966

0.603 0.312 0.487 0.755 0.520 0.686 0.459 0.458 0.695 0.767 0.879

0.942 0.796 0.941 0.927 0.945 0.937 0.907 0.916 0.917 0.910 0.964

0.942 0.853 0.920 0.958 0.959 0.950 0.950 0.952 0.917 0.950 0.961

0.776 0.364 0.730 0.862 0.859 0.825 0.735 0.954 0.875

0.917 0.451 0.881 0.937 0.935 0.926 0.904 0.907 0.895 0.943 0.951

0.834 0.372 0.905 0.782 0.850 0.888 0.818 0.867 0.847 0.865 0.921

0.862 0.322 0.830 0.747 0.853 0.887 0.808 0.867 0.888 0.821 0.920

0.957 0.874 0.912 0.946 0.956 0.892 0.923 0.950 0.953 0.836 0.954

0.963 0.934 0.784 0.958 0.976 0.957 0.944 0.900 0.968 0.934 0.979

0.901 0.706 0.915 0.905 0.926 0.937 0.933 0.944 0.815 0.944

0.900 0.515 0.800 0.911 0.863 0.912 0.915 0.824 0.880 0.913 0.863 0.941

7.837 6.533 5.562 11.569 14.635 9.113 8.492 7.746 4.186 6.299 5.744 5.816 6.181 8.322 7.247 4.298 5.482 5.209 5.012 3.465

3.865 6.940 4.086 3.342 4.169 2.554

3.368 5.202 4.396 3.513 3.547 2.913

3.747 4.988 8.649 4.725 4.453 2.593

4.966 8.155 4.649 4.180 4.199 3.511

4.987 9.675 6.772 4.648 5.706 4.356 5.102 4.657 3.757

Chen[5] Gorley[6] You[7] Lin[8] Shao[9] RMSE Shao[10] Chen[17] Su[18] Proposed

5.320 5.216 11.323 6.211 6.206 5.709 3.963 4.291 5.402 4.523 3.071 2.909

5.581 4.822 10.197 7.562 5.621 5.679 6.257 4.137 6.433 5.898 4.057 3.240

revealed in the cyclopean image is not fully understand. Shao’s method [9] combines the sparse feature similarity and luminance similarity. It performs well on Phase I and the performance is poor on Phase II, since the binocular vision properties are less explored. Simple and complex cell stages 30

in receptive field are modeled using energy model in [10], and thus the performance is competitive with other FR and NR metrics. However, the depth features are less explored and the performance is inferior to our proposed method. Sparse representation based EoP and MIP features are utilized in Liu’s RR method [13]. But the interaction between two views is less explored. Shao’s methods [15] [16] are developed without learning based on human scores. However, the performance turns out inferior to other opinion-aware methods. The deep convolutional neural network (CNN) scheme [20] is constructed in terms of local to global feature aggregation. Since the binocular perceptual rivalry and suppression mechanisms are not taken into account, the performance of the method in LIVE Phase II is not promising. 4.4 Performance Evaluation 4.4.1 Cross-Database Performance Evaluation

To test the generalization capability of the method, we conduct the experiment that a database is used for training the regression model and another database is used for testing the model. The model is trained on LIVE Phase I and tested on LIVE Phase II. Meanwhile, we train the model on the LIVE Phase II and test it on LIVE Phase I. The results are listed in Table 7 and Table 8. It can be seen that the performance of training on one database (i.e., Phase I) and testing on another database (i.e., Phase II) is not promising. It can be explained that these two databases contain different types of image contents. Besides, the Phase II contains asymmetric distorted stereo-pair while the Phase I includes symmetric stereo-pair. Table 7 Performance of model training on LIVE Phase I and testing on LIVE Phase II.

Criteria JP2K JPEG WN

GB

FF

ALL

SROCC 0.858 0.623 0.837 0.782 0.878 0.784 PLCC 0.864 0.691 0.855 0.889 0.874 0.795 RMSE 4.946 5.308 5.560 6.377 5.587 6.847

31

Table 8 Performance of model training on LIVE Phase II and testing on LIVE Phase I.

Criteria JP2K JPEG WN

GB

FF

ALL

SROCC 0.832 0.519 0.906 0.790 0.462 0.805 PLCC 0.884 0.569 0.881 0.821 0.641 0.812 RMSE 6.054 5.377 7.880 8.273 9.547 9.581 4.4.2 Cross-Distortion Performance Evaluation

We further verify the performance of the proposed scheme across different types of distortions in LIVE database. Specifically, we train our model on stereo-pairs with one type of distortion and test it on stereo-pairs with another type of distortion. The SROCC of cross-distortion performance evaluation is shown in Table 9 and Table 10. For each single table, the row element is training set and the column element is testing set. It can be seen from Table 9 that the model from WN distortion cannot efficiently assess the other distortions; However, the model from the other distortions generally can assess WN distortion. The performance of the model trained on one type of distortions may not perform well on some other distortions. However, the overall performance is stable except the model from WN and JPEG distortions. Table 9 SROCC on LIVE Phase I for cross-distortion performance evaluation.

Testing Training JP2K JPEG WN GB FF

JP2K JPEG WN 0.678 0.414 0.764 0.822

0.705 0.307 0.632 0.716

0.895 0.924 0.908 0.851

GB

FF

ALL

0.839 0.649 0.552 0.753

0.593 0.630 0.416 0.447 -

0.844 0.705 0.580 0.767 0.877

Table 10 SROCC on LIVE Phase II for cross-distortion performance evaluation.

Testing Training JP2K JPEG WN GB FF

JP2K JPEG WN

GB

FF

ALL

0.639 0.279 0.746 0.780

0.707 0.567 0.833 0.788

0.869 0.819 0.515 0.856 -

0.824 0.707 0.741 0.839 0.872

0.713 0.547 0.738 0.763

32

0.772 0.770 0.868 0.857

4.4.3 Image Content-Based Performance Evaluation

We partition each Phase in LIVE 3D database into training and testing sets based on image content. For each database, 80% reference images and their corresponding distorted images are selected as the training set, and the rest 20% images are as the testing set. The Phase I consists of 20 reference stereo-pairs, from which 16 reference stereo-pairs and its corresponding distorted images are selected as the training set, and the rest reference stereo-pairs and its corresponding distorted versions are as the testing set. For Phase II, it consists of 8 reference stereo-pairs. 6 reference stereo-pairs and its corresponding distorted images are selected as the training set, and the rest images as the testing set. This training-testing procedure repeats 1000 times in Phase I and 28 times in Phase II. Performance evaluations on random-based database partition and image contentbased partition are listed in Table 11. It can be seen from Table 11 that performance evaluation on image content-based partition is slightly worse than random-based database partition. The reason is that the regression model learned from the content-based training is a “content-unaware” model, resulting in the performance worse than the “content-aware” model. Table 11 Performance evaluation on database random partition and image content based partition

Database Partition method Random-based partition Content-based partition

LIVE Phase I

LIVE Phase II

SROCC PLCC

RMSE SROCC PLCC

RMSE

0.943 0.906

5.012 6.154

3.757 5.001

0.951 0.924

0.933 0.887

0.941 0.893

4.4.4 Performance Evaluation of Each Feature Type

In our metric, features including marginal probability (MP) and conditional probability (CP) of bivariate statistics analysis based NSS features, and correlation coefficient (CC) of horizontally adjacent responses based NSS features are extracted. In order to verify the contribution of each

33

type of features, we analyze the gain of each type of features in Table 12. It can be seen from Table 12 that each type of features can improve the performance of our metric. Table 12 Performance with feature type increasing on LIVE Phase I and Phase II

Database

LIVE Phase I

Feature set MP MP+CP MP+CP+CC

LIVE Phase II

SROCC PLCC

RMSE SROCC PLCC

RMSE

0.901 0.937 0.943

6.200 5.244 5.012

4.562 3.815 3.757

0.923 0.945 0.951

0.900 0.927 0.933

0.912 0.938 0.941

4.4.5 Training Samples Ratio Performance Evaluation

We evaluate the performance of the proposed metric that how it acts when different ratios of images in LIVE 3D database are used for training in the regression phase. Specifically, three settings including 80%, 50% and 30% samples are respectively used for training and the remaining for testing. The training-testing process of each setting is also repeated 1000 times and the experimental results are shown in Table 13. It can be seen that the performance will drop slightly with the number of training images decreasing. Table 13 Performance of different ratio (80%, 50%, 30%) of samples for training on LIVE database

Database

LIVE Phase I

Ratio Criteria JP2K JPEG WN

GB

LIVE Phase II FF

ALL JP2K JPEG WN

GB

FF

ALL

SROCC 0.928 0.866 0.941 0.917 0.827 0.943 0.895 0.851 0.928 0.912 0.909 0.933 80% PLCC 0.966 0.879 0.964 0.961 0.875 0.951 0.921 0.920 0.954 0.979 0.944 0.941 RMSE 3.071 2.909 4.057 3.240 5.209 5.012 3.465 2.554 2.913 2.593 3.511 3.757 SROCC 0.918 0.826 0.941 0.894 0.822 0.933 0.879 0.844 0.925 0.909 0.896 0.922 50% PLCC 0.942 0.818 0.955 0.935 0.860 0.940 0.893 0.887 0.944 0.971 0.913 0.926 RMSE 5.591 4.280 3.652 4.900 5.325 6.158 4.320 3.331 3.464 3.257 4.545 4.243 SROCC 0.900 0.779 0.935 0.878 0.784 0.921 0.842 0.821 0.918 0.895 0.884 0.902 30% PLCC 0.923 0.781 0.946 0.919 0.833 0.927 0.854 0.845 0.930 0.962 0.886 0.904 RMSE 4.905 4.043 5.371 5.580 6.818 6.163 5.001 3.876 3.875 3.721 5.282 4.838

5 Conclusion In this paper, we propose a novel 3D structure and bivariate analysis-based model for predicting the quality of stereo-pair. The prominent contribution of this work is that we classify the internal 34

3D structure into binocular and monocular regions and exert diverse feature extraction and combination strategies to different regions. A bivariate NSS model is proposed to depict the quality degradation of images from which features are extracted. To be more specific, each region of the perceived 3D structure is classified into the three types of perceptual phenomena, and then the bivariate analysis is conducted on each class utilizing discriminative strategies. These feature sets are combined based on their occupied areas. The SVR is used to train an optimum regression model. The proposed model is compared with other FR-SIQA and NR-SIQA metrics on LIVE 3D database Phase I and Phase II and Waterloo IVC 3D database Phase I and Phase II. Experimental results show that the proposed model is effective to predict the quality of stereo-pair. Acknowledgments This work is is supported by the National Natural Science Foundation of China under grant No. 61422111, 61671282 and U1301257 and sponsored by Shanghai Pujiang Program (15pjd015) and Shanghai Shuguang Program (17SG37). References 1 A. Benoit, P. Le Callet, P. Campisi, and R. Cousseau, “Quality assessment of stereoscopic images,” EURASIP J. Image Video Process., vol. 2008, pp. 1-13, Jan. 2009. 2 Y. Zhang and D. M. Chandler, “3D-MAD: A full reference stereoscopic image quality estimator based on binocular lightness and contrast perception,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3810-3825, Nov. 2015. 3 F. Shao, W. Lin, S. Gu, et al., “Perceptual full-reference quality assessment of stereoscopic images by considering binocular visual characteristics,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1940-1953. May 2013. 35

4 X. Geng, L. Shen, K. Li, P. An, “A stereoscopic image quality assessment model based on independent component analysis and binocular fusion property,” Signal Process. Image Commun., vol 52, no. 1, pp. 54-63, Mar, 2017. 5 M. J. Chen, C. C. Su, D. L. Kwon, L. K. Cormack, and A. C. Bovik, “Full-Reference Quality Assessment of Stereopairs Accounting for Rivalry,” Signal Process. Image Commun., vol. 28, no. 9, pp. 1143-1155, Oct. 2013. 6 P. Gorley and N. Holliman, “Stereoscopic image quality metrics and compression,” Proc. SPIE, vol. 6803, pp. 680305-1-680305-12, Feb. 2008. 7 J. You, L. Xing, A. Perkis, and X. Wang,“Perceptual quality assessment for stereoscopic images based on 2D image quality metrics and disparity analysis” Proc. Int. Workshop Video Process. Qaulity Metrics Consum. Elect., 2010, pp. 1-6. 8 Y. Lin, J. Yang, W. Lu, Q. Meng, Z. Lv and H. Song, “Quality Index for Stereoscopic Images by Jointly Evaluating Cyclopean Amplitude and Cyclopean Phase,” IEEE J. Sel. Topics Signal Process., vol. 11, no. 1, pp. 89-101, Feb. 2017. 9 F. Shao, K. Li, W. Lin, G. Jiang, M. Yu, and Q. Dai, “Full-reference quality assessment of stereoscopic images by learning binocular receptive field properties,” IEEE Trans. Image Process., vol. 24, no. 10, pp. 2971- 2983, Oct. 2015. 10 F. Shao, W. Lin, G. Jiang and Q. Dai, “Models of Monocular and Binocular Visual Perception in Quality Assessment of Stereoscopic Images,” IEEE Trans. Comput. Imag., vol. 2, no. 2, pp. 123-135, June 2016 11 F. Qi, D. Zhao, W. Gao, “Reduced Reference Stereoscopic Image Quality Assessment Based

36

on Binocular Perceptual Information,” IEEE Trans. Multi., vol. 17, no. 12, pp. 2338-2344, Oct. 2015. 12 L. Ma, et al., “Reorganized DCT-based image representation for reduced reference stereoscopic image quality assessment,” Neurocomputing, vol. 215, pp. 21-31, Nov. 2016. 13 Z. Liu, C. Yang, S. Rho, S. Liu and F. Jiang, “Structured entropy of primitive: big data-based stereoscopic image quality assessment,” IET Image Process., vol. 11, no. 10, pp. 854-860, Oct 2017. 14 A. Balasubramanyam, S. Khan, and S. S. Channappayya. “No-reference Stereoscopic Image Quality Assessment Using Natural Scene Statistics,” Signal Process. Image Commun., vol. 43, pp. 1-14, 2016. 15 F. Shao, W. Lin, S. Wang, G. Jiang, and M. Yu, “Blind image quality assessment for stereoscopic images using binocular guided quality lookup and visual codebook,” IEEE Trans. Broadcast., vol. 61, no. 2, pp. 154-165, Jun. 2015. 16 F. Shao, W. Lin, S. Wang, G. Jiang, M. Yu, and Q. Dai, “Learning receptive fields and quality lookups for blind quality assessment of stereoscopic images,” IEEE Trans. Cybern., vol. 46, no. 3, pp. 730-743, Mar. 2016. 17 M. J. Chen, L. K. Cormack, and A. C. Bovik, “No-Reference Quality Assessment of Natural Stereopairs,” IEEE Trans. Image Process., vol. 22, no. 9, pp. 3379-3391, Sep. 2013. 18 C. C. Su, L. K. Cormack and A. C. Bovik, “Oriented Correlation Models of Distorted Natural Images With Application to Natural Stereopair Quality Evaluation,” IEEE Trans. Image Process., vol. 24, no. 5, pp. 1685-1699, May 2015. 19 S. Ryu and K. Sohn, “No-reference quality assessment for stereoscopic images based on

37

binocular quality perception,” IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 4, pp. 591-602, Apr. 2014. 20 H. Oh, S. Ahn, J. Kim and S. Lee, “Blind Deep S3D Image Quality Evaluation via Local to Global Feature Aggregation,” IEEE Trans. Image Process., vol. 26, no. 10, pp. 4923-4936, Oct. 2017. 21 J. C. Read, G. P. Phillipson, I. Serrano-Pedraza, A. D. Milner, and A. J. Parker, “Stereoscopic vision in the absence of the lateral occipital cortex,” PLoS ONE 5, p. E12608, 2010. 22 F. Tong, M. Meng, and R. Blake, “Neural bases of binocular rivalry,” Trends Cogn. Sci., vol. 10, no. 11, pp. 502-511, Sep. 2006. 23 N. Logothetis, “Single units and conscious vision,” Philos. Trans. R. Soc. Lond. B. Biol. Sci., vol. 353, no. 1377, pp. 1801-1818, Nov. 1998. 24 I. P. Howard and B. J. Rogers, “Seeing in depth,” Depth Perception. Toronto, ON, USA: I Porteous, vol. 2. 25 W. J. M. Levelt, On Binocular Rivalry. Paris, France: Mouton, 1968. 26 R. Blake, D. H. Westendorf, and R. Overton, “What is suppressed during binocular rivalry?,” Perception, vol. 9, no. 2, pp. 223-31, 1980. 27 R. Blake, N.K. Logothetis, “Visual competition”, Nature Reviews Neuroscience, vol. 3, no. 1, pp. 13-21, 2002. 28 R. Blake, R. P. O’Shea, and T. J. Mueller, “Spatial zones of binocular rivalry in central and peripheral vision,” Vis. Neurosci., vol. 8, pp. 469-478, 1992. 29 S. Takase, S. Yukumatsu, and K. Bingushi, “Local binocular fusion is involved in global binocular rivalry,” Vis. Res., vol. 48, pp. 1978-1803, 2008.

38

30 J. G. Daugman, “Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters,” J. Opt. Soc. Amer. A, vol. 2, no. 7, pp. 1160-1169, Feb. 1985. 31 C. C. Su, L. K. Cormack, and A. C. Bovik, “Color and depth priors in natural images,” IEEE Trans. Image Process., vol. 22, no. 6, pp. 2259-2274, Jun. 2013. 32 H. R. Sheikh, Z. Wang, L. Cormack, and A. C. Bovik, “Live Image Quality Assessment Database Release 2,” [Online]. http://live.ece.utexas.edu/research/quality. 33 A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21, no. 12, pp. 4695-4708, Dec. 2012. 34 A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE Trans. Image Process., vol. 20, no. 12, pp. 3350-3364, Dec. 2011. 35 M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the DCT domain,” IEEE Trans. Image Process., vol. 21, no. 8, pp. 3339-3352, Aug. 2012. 36 C. C. Chang and C. J. Lin. (2001). LIBSVM: A Library for Support Vector Machines [Online]. Available: http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ 37 P. Campis, P. L., E. Marini, “Stereoscopic images quality assessment,” Signal Process. Conference, 2007, European vol. 32, pp. 2110-2114. 38 A. Benoit, P. L. Callet, P. Campisi, R. Cousseau, “Quality Assessment of Stereoscopic Images,” EURASIP J. Image Video Process., vol. 2008, no. 659024, pp. 1-13, Jan. 2008.

39

39 J. Yang, C. Hou, Y. Zhou, Z. Zhang, and J. Guo, “Objective quality assessment method of stereo images,” Proc. 3DTV-CON, 2009. 40 K. Lee and S. Lee, “3D perception based quality pooling: Stereopsis, binocular rivalry and binocular suppression,” IEEE J. Sel. Topics Signal Process., vol. 9, no. 3, pp. 533-545, Apr. 2015. 41 J. Wu, Y. Liu, L. Li and G. Shi, “Attended Visual Content Degradation Based Reduced Reference Image Quality Assessment,” IEEE Access, vol. PP, no. 99, pp. 1-1. doi: 10.1109/ACCESS.2018.2798573 42 Y. Fang, J. Yan, L. Li, J. Wu and W. Lin, “No Reference Quality Assessment for Screen Content Images With Both Local and Global Feature Representation,” IEEE Trans. Image Process., vol. 27, no. 4, pp. 1600-1610, Apr. 2018. 43 Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600612, Apr. 2004. 44 A. K. Moorthy, C. C. Su, A. Mittal and A. C. Bovik, “Subjective evaluation of stereoscopic image quality,” Signal Process.: Image Commun., vol. 28, no. 8, pp. 870-883, 2013. 45 D. V. Meegan, L. B. Stelmach, and W. J. Tam, “Unequal weighting of monocular inputs in binocular combination: Implications for the compression of stereoscopic imagery,” J. Experim. Psychol., Appl., vol. 7, no. 2, pp. 143-153, Jan. 2001. 46 R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Ssstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, pp. 2274-2282, Nov. 2012.

40

47 K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporal quality assessment of natural videos,” IEEE Trans. Image Process., vol. 19, no. 2, pp. 335-350, Feb. 2010. 48 D. L. Ruderman, “The statistics of natural images,” Netw. Comput. Neural Syst., vol. 5, no. 4, pp. 517-548, Jul. 1994. 49 J. M. Geusebroek and A. W. M. Smeulders, “A sixstimulus theory for stochastic texture,” Int. J. Comput. Vis., vol. 62, no. 1, pp. 7-16, Apr. 2005. 50 D. J. Fleet, H. Wagner, D. J. Heeger, “Neural encoding of binocular disparity: energy models, position shifts and phase shifts,” Vis. Res., vol. 36, no. 12, pp. 1839- 1857, Nov. 1996. 51 J. Ding and G. Sperling, “A gain-control theory of binocular combination,” Proc. Nat. Acad. Sci. United States Amer., vol. 103, no. 4, pp. 1141-1146, 2006. 52 C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining Knowl. Discovery, vol. 2, no. 2, pp. 121-167, 1998. 53 J. Wang, A. Rehman, K. Zeng, S. Wang, and Z. Wang, “Quality prediction of asymmetrically distorted stereoscopic 3D images,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 34003414, 2015. 54 P. G. Gottschalk, J.R. Dunn, The five-parameter logistic: a characterization and comparison with the four-parameter logistic, Anal. Biochem. 343 (1) (Aug. 2005) 54-65.

List of Figures 1

Example of formulation of 3D structure. (a) XY -plane in 3D structure. (b) 3D structure.

2

Bivariate joint statistics of reference image and its five distorted versions. 41

3

Framework of proposed metric.

4

Stereo-pair and disparities estimated by different disparity estimation algorithms. The three maps in first column are estimated by SAD-based algorithm. Second column is estimated by SSIM-based algorithm. Third column is estimated by proposed Gaussian averaged-SSIM-based algorithm.

5

Results of segmentation of stereo-pair with asymmetric JP2K distortion. (a) Left image. (b) Right image (c) Disparity map. (d) 3D structure segments based on stereo-pair and disparity.

6

Consistency map and result of classification for Figs. 5(a) and (b). (a) Consistency map. (b) Result of classification and integration.

7

Marginal probabilities Pk (shown as the first half of the histograms) and conditional probabilities Qk (shown as the second half of the histograms) of the distorted images generated from the same reference image at different DMOS levels.

8

Plots of correlation coefficient for pristine image and its five distorted versions.

9

The SVR parameters (C, γ) selection process on (a) LIVE 3D database Phase I and (b) LIVE 3D database Phase II. The number on the level contour denotes the SROCC value of the cross-validation.

42

10

Parameter selection across LIVE 3D IQA database Phase I and Phase II. (a). Parameter K for the numbers of cluster centers in image segmentation (Select from {24, 30, 36, 42, 48}). (b). Parameter M or N for the grid size (Select from {5, 10, 15, 20, 25}). (c). Parameter thcon for the threshold to figure out binocular fusion and rivalry (Select from {0.8, 0.85, 0.9, 0.95}). (d). Parameter thw for the threshold to figure out binocular rivalry and suppression (Select from {0.15, 0.2, 0.25, 0.3}).

List of Tables 1

Performance on the set A and B as the regression model learned from set A and B, respectively.

2

Performance on training set and testing set as regression model learned from the features of training set with Gaussian white noise or without noise.

3

Performance comparison between SVR, NN, GP and RF.

4

Overall performance comparisons on LIVE Phase I and Phase II.

5

Overall performance comparisons on UW/IVC Phase I and Phase II.

6

Performance comparisons on each distortion type on LIVE Phase I and Phase II (FR and RR metrics are italic).

7

Performance of model training on LIVE Phase I and testing on LIVE Phase II.

8

Performance of model training on LIVE Phase II and testing on LIVE Phase I.

9

SROCC on LIVE Phase I for cross-distortion performance evaluation.

10

SROCC on LIVE Phase II for cross-distortion performance evaluation.

43

11

Performance evaluation on database random partition and image content based partition

12

Performance with feature type increasing on LIVE Phase I and Phase II

13

Performance of different ratio (80%, 50%, 30%) of samples for training on LIVE database

44

Highlights (for review): 1. A bivariate natural scene statistics model is proposed to represent image quality. 2. Marginal and conditional probabilities are used to depict the bivariate distribution. 3. Features are separately extracted from monocular and binocular regions.