Journal Pre-proof Video quality assessment using space-time slice mappings Lixiong Liu, Tianshu Wang, Hua Huang, Alan Conrad Bovik
PII: DOI: Reference:
S0923-5965(19)31077-X https://doi.org/10.1016/j.image.2019.115749 IMAGE 115749
To appear in:
Signal Processing: Image Communication
Received date : 27 October 2019 Revised date : 14 December 2019 Accepted date : 15 December 2019 Please cite this article as: L. Liu, T. Wang, H. Huang et al., Video quality assessment using space-time slice mappings, Signal Processing: Image Communication (2019), doi: https://doi.org/10.1016/j.image.2019.115749. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Journal Pre-proof 1
Video Quality Assessment Using Space-Time
of
Slice Mappings
pro
Lixiong Liu, Tianshu Wang, Hua Huang, and Alan Conrad Bovik, Fellow, IEEE
Abstract—We develop a full-reference (FR) video quality assessment framework that integrates analysis of space-time slices (STSs) with frame-based image quality measurement (IQA) to form a high-performance video quality predictor. The approach first arranges the reference and test video sequences into a space-time slice representation. To more comprehensively characterize space-time
re-
distortions, a collection of distortion-aware maps are computed on each reference-test video pair. These reference-distorted maps are then processed using a standard image quality model, such as peak signal-to-noise ratio (PSNR) or Structural Similarity (SSIM). A simple learned pooling strategy is used to combine the multiple IQA outputs to generate a final video quality score. This leads to an algorithm called Space-Time Slice PSNR (STS-PSNR), which we thoroughly tested on three publicly available video quality assessment
urn al P
databases and found it to deliver significantly elevated performance relative to state-of-the-art video quality models. Source code for STS-PSNR is freely available at: http://live.ece.utexas.edu/research/Quality/STS-PSNR_release.zip. Index Terms—video quality assessment, image quality assessment, spatial temporal slice, space-time stability, learning based pooling.
I. INTRODUCTION
The many significant leaps in camera, computer, and network technologies over the past decade have led to an explosion of video content being delivered and shared over the Internet [1]. However, methods to measure, monitor, and control the perceptual quality of video content remain imperfect, and continue to evolve [2]. Although subjective tests provide the most accurate assessments of video quality, they are impractical for deployment in most real-world video processing systems. However, objective video quality assessment (VQA) algorithms that correlate well with
Jo
human judgments are well-suited for this purpose. Objective video quality assessment models can be roughly divided into two categories: (1) those that investigate quality of service (QoS), and (2) those that measure quality of experience (QoE) [3]. QoS
Manuscript received March 29, 2019. This work is supported by the National Natural Science Foundation of China under grant 61672095 and grant 61425013. (Corresponding author: H. Huang.)
L. Liu, T. Wang, and H. Huang are with the Beijing Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China (e-mail:
[email protected];
[email protected];
[email protected]). A. C. Bovik is with the Laboratory for Image and Video Engineering (LIVE), Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712 USA (e-mail:
[email protected]).
Journal Pre-proof 2
methods mainly deal with measurable performance factors of delivery platforms (such as telecommunication services), and are designed to find balances between system capacity and the needs of the users of the service. Some VQA models belong to this category. For example, Adas [4] designed a dynamic network bandwidth allocation strategy that sustains variable bitrate video traffic. Liu et al. [5] considered the interaction between peak signal-to-noise ratio (PSNR) and compression quantization parameters. Chong et al. [6] proposed a novel dynamic transmission bandwidth allocation strategy for real-time variable-bit-rate
of
video transport in Asynchronous Transfer Mode (ATM) networks. These methods typically focus on the performance of the physical system and rarely model the user directly. The main concern of QoE methods is the degree of user satisfaction with respect to video service [7], [8]. Because the factors that affect user experience may result from outside the system, such methods often
pro
have to deal with different types of input [9]. Raake et al. [10] proposed a HyperText Transfer Protocol (HTTP) adaptive streaming quality assessment model that serves as a component of the P.1203 standard framework. Xu et al. [11] considered relationships between visual consistency theory and buffer constraints as well as the free energy principle and how it relates to visual speed perception [12], to design video coding strategies controlled by video quality. Lin et al. [13] designed a compressed video quality assessment metric that takes three major factors (quantization, motion and bit allocation) into consideration. Bampis et al. [14]
re-
trained a variety of recurrent neural networks to predict QoE. These models considered multiple factors that may affect visual experience, such as rebuffering measurements and video quality scores. Compared with QoS, QoE methods have a wider range of application scenarios and a greater variety of solutions [15], [16].
A basic element of video QoE methods are VQA models that analyze spatial-temporal video content. The VQA problem is much
urn al P
more complicated than image quality assessment (IQA). Spatial information, temporal information and the interactions that occur between them all affect the perception of quality, hence directly applying an IQA method to video frames often leads to poor results [17]. Space-time distortions such as ghosting, jitter, stationary area fluctuations and local flicker are difficult or impossible to measure using frame-based IQA models [18]. Some of the approaches taken that go beyond frame-level processing include an early modification [19] to adapt Structural Similarity (SSIM) index [20] to the video quality assessment problem using an efficient sampling strategy. Yang et al. [21] proposed a model to deal with distortion-compressed natural scenes, by using temporal dependencies to weight spatial distortions. Seshadrinathan et al. [22] designed the “MOVIE” index, which measures video distortions by tracking them along motion trajectories. Manasa et al. [23] modeled the statistics of temporal and spatial distortions while accounting for optical flow. Wang et al. [24] extracted spatial edge features and temporal motion characteristics from localized space-time regions, and used them to represent video structures. Despite some success, VQA model performance generally lags that of the best IQA models. Towards advancing progress on the video quality assessment problem and levels attained on the still picture problem, we develop a very general video quality assessment framework that can be used to transform
Jo
a standard frame-centric IQA algorithm into an efficient video quality predictor. Several studies have addressed the interactions that occur between spatial and temporal information in video sequence [22], [25]. Many VQA models stress the spatial computations, by reserving the integration of temporal information until an inter-frame pooling stage. Park et al. [17] designed a content adaptive spatial-temporal pooling strategy, which uses the distribution frame quality scores to form a temporal weighting policy, while accounting for egomotion. Soundararajan et al. [26] combined perceptual principles with statistical video models to design the high-performing spatial-temporal reduced reference entropic differences (STRRED) indices. Ninassi et al. [27] devised a divided short-term/long-term temporal pooling strategy, whereby a
Journal Pre-proof 3
visual attention mechanism was utilized to evaluated temporal fluctuations, and then the resulting spatial-temporal maps were pooled over longer time periods. Seshadrinathan et al. [28] observed a hysteresis effect of subjective judgments of time-varying video quality, and developed a temporal pooling strategy to model this effect. Separating spatial and temporal information can reduce model complexity, but it may make it difficult to recover interactions between spatial and temporal aspects of distortion. Some methods deploy 3D space-time filtering to capture spatio-temporal interactions. Li et al. [29] used the 3D Shearlet transform
of
to process video information that was fed to a convolutional neural network (CNN) which learns to predict a final video score. Li et al. [30] analyzed the statistics of 3D space-time discrete cosine transform (DCT) blocks to characterize video distortions. Lu et al. [31] utilized 3D space-time gradients to extract spatio-temporal video features, then computed reference-distorted 3D gradient
pro
differences to predict video quality.
To better characterize and segment video content, Ngo et al. [32] proposed an attractive way of handling spatial-temporal information by defining “spatial temporal slices” (STS), which seeks to resolve the dilemma between spatial-temporal information acquisition and model complexity. They efficiently used the STS structure to segment and analyze the contents of videos. While the STS representation avoids the necessity of modeling motion perception, it does not isolate the temporal information contained
re-
within the slices. Changing patterns of spatial texture, contrast and lighting conditions, among other spatial factors, also can greatly affect the appearances of the STS slices [33]. Vu et al. [34] utilized STS for video quality assessment, along with an optical flow weighting strategy to control temporal relevance. Their resulting ViS3 model achieves good prediction accuracy, but may be limited by reliance on space-time sheets. Yan et al. [35] claimed that different motion structures contained within an STS may have
urn al P
different influences on quality perception. They divided the STSs into two subsets prior to the pooling stage according to their motion complexity, which increased the prediction accuracy of the algorithm to a certain degree. Although the STS representation is a very information-rich way of manipulating video sequences, it is also a complex, commingled mixture of space-time information. The authors of [34] and [35] directly use STS as the sole input to their IQA models. Here we go beyond this simple approach, by maintaining the STS data structure, but also augmenting it with information-bearing, space-time differential change information in the form of spatial gradients and temporal frame differences [36], [37]. Specifically, we compute a spatial STS edge map and a temporal STS frame difference map.
The gradient is a commonly used measure of brightness and color variation in images and videos [38]. It has also been used as an effective descriptor in picture quality assessment models [39], [40]. Lin et al. [41] conducted image quality assessment by using SSIM to account for visible differences in the image gradient. Li et al [42] conducted multi-distorted image quality assessment by combining the image gradient with local binary pattern (LBP) statistics. The gradient of the STS provides detailed information relating to the local space-time degradations of a video [43], [44]. Structural computational models of HVS were proposed to well
Jo
solve the problems of image post-processing and quality assessment [45]-[48]. Here, we compute a relative gradient magnitude (RM) map and a gradient orientation (GO) map on the STSs along all three dimensions. We also form baseband representations by low-pass Gaussian filtering the STSs along all three dimensions, and high-frequency ones in the form of the coefficients of the first layer of a Laplacian pyramid [49]. This is a direct way to measure the loss of fine image details, or high-frequency artifacts introduced by processes such as lossy compression [50]. We then apply a full-reference IQA algorithm on each of the reference-distorted STS map pairs to conduct video quality assessment. By applying an IQA algorithm to measure enhanced STS differences, we obtain vectors that are predictive of distortion.
Journal Pre-proof 4
Since multiple maps are produced containing values that may be distributed over different ranges, simple pooling methods (such as mean, max or min) may perform poorly. Neither is there an accepted theory supporting a weighting policy correlating with visual perception. Thus, we instead develop a learning based pooling strategy that automatically balances the influences of the maps to generate a final video quality score. The contributions we make are as follows. (1) We propose a general framework that transforms an ordinary (frame) IQA method
of
into a high-performance video quality predictor. (2) The STS representations are analyzed to produce multiple quality-aware maps, including distortion-aware processed frame and frame difference maps, relative gradient magnitude and gradient orientation maps, and Gaussian-filtered and Laplacian-compensated maps. (3) A process of space-time subsampling is applied in the STS domain to
pro
reduce computational complexity, without loss of quality prediction performance. (4) We devise a learning based pooling strategy to automatically weight and combine the feature maps to generate a final video quality score. (5) The resulting algorithm, which we call the Space-Time PSNR (STS-PSNR) delivers significantly elevated performance relative to state-of-the-art video quality models.
The remainder of this paper is organized as follow. Section II describes the proposed video quality predictor. We explain the
re-
spatial temporal slice multi-map configuration strategy, and how we deploy an IQA model onto processed STS map to create a VQA model called STS-PSNR. Section III presents experimental results that validate the proposed VQA method. Finally, Section IV concludes with future research directions.
urn al P
II. METHODOLOGY
We develop a general video quality assessment framework that is based on a multi-map, spatial temporal slice (STS) model. Figure 1 shows a block diagram of the proposed model. Two identical processing paths are defined on reference and distorted input video sequences. These are each processed into STS representations. The STS representations are further processed to produce a distortion-aware edge-enhanced map, a frame difference map, a relative gradient magnitude (RM) map, a gradient orientation (GO) map, a Gaussian-filtered map and a Laplacian-compensated map. Although these feature maps are well known in the image processing field, we collectively adapt them to more effectively represent video space-time information. These will be defined and motivated in the following. Thus, a multi-map STS representation is obtained, include the original STS map and the six maps just defined. Space-time subsampling of the STSs before additional computation can be used to improve efficiency, and we show that it does not adversely affect prediction performance. The reference and test maps are then compared using an IQA model, yielding several outputs that are pooled using a simple learning based strategy that automatically balances the influences of the maps to
Jo
generate a final video quality score.
Journal Pre-proof
pro
of
5
Fig. 1. Block diagram of the proposed framework.
A. Spatio-Temporal Slices (STS)
Spatio-temporal slices (STS) [32] are an attractive way of capturing space-time video information with relatively low
re-
computational complexity. Begin by treating a video sequence as a rectangular cuboid located in a 3D coordinate system with orthogonal time (T), height (H) and width (W) axes (see Fig. 2). By slicing the cuboid along planes of various orientations, different aspects of a video are obtained. While arbitrary slicing planes might be considered, for simplicity we use slices along the three cardinal orientations, yielding three two-dimensional STS sequences. This idea can be formally expressed as:
urn al P
𝐼𝑆𝑇𝑆 (𝑖, 𝑑) = {𝑉 𝑑 | 𝑑 ∈ [𝑇, 𝑊, 𝐻], 𝑖 ∈ [1, 𝑁]} ,
(1)
where V is the input video sequence, d and i denote the slice dimension and slice index respectively, ISTS(i, d) is the ith generated spatial temporal slice, and N is the number of slices in Video V. This process is conducted on both reference and distorted channels. This simple spatio-temporal joint representation is obtained without any motion computations or associated high computational cost.
Figure 2 illustrates several STSs of different video sequences from the LIVE Video Quality Database [1]. On a video sequence having large movements (such as Fig. 2a), it may be observed from the H and W slices that the STS is able to capture the trajectories of moving objects. On a video sequence with little movement (such as Fig. 2b), the STS preserves the spatial structures of the objects. Generally, the motion across neighboring video frames is smooth, and the values of pixels that are close to each other in space are related, meaning that their corresponding STSs are well structured. It is reasonable to believe that this space-time
Jo
structure is affected by distortion.
Journal Pre-proof 6
T
H
W
of
Video sequence
W
(a)
Video sequence
H
pro
T
(b) Fig. 2. Video sequences having (a) large scene movements and (b) little scene movements, along with their STSs along the time (T), height (H) and width (W) dimensions.
Of course, each STS is a simplification of information. Space-time events such as sudden movements or changes of movement,
re-
sudden changes in texture, contrast, or lighting, can all affect STS appearance. In order to heighten the availability of distortion-aware information in the STS representation, we compute edge-enhanced maps and frame difference maps that separately accentuate spatial and temporal distortion-relevant information contained in the original STSs. Edge enhancement has been widely investigated in the area of image and video processing. Edges are highly distortion-sensitive
urn al P
structures, since some distortions create false edges, such as blocking, while others degrade or destroy edges, such as blur. Moreover, the human visual system (HVS) is more sensitive to edges in moving scenes [51]. In our approach we deploy the 1-D, 13-point SI13 spatial information filter, which was elaborately designed to enhance prominent edge impairments [52], [53]. It is given by
SI13 = [−0.0052625, −0.0173466, −0.0427401, −0.0768961, −0.957739, −0.0696751, 0, 0.6696751, 0.0957739, 0.0768961, 0.0427401, 0.0173466, 0.0052625].
(2)
By horizontally and vertically applying SI13 we obtain edge-enhanced STSs: 𝑥 (𝑖, 𝑇) = 𝐼𝑆𝑇𝑆 (𝑖, 𝑇) ⊗ 𝑠𝑖𝑥 , 𝐼𝐸𝐷𝐺𝐸
(3)
𝑦 𝐼𝐸𝐷𝐺𝐸 (𝑖, 𝑇)
(4)
= 𝐼𝑆𝑇𝑆 (𝑖, 𝑇) ⊗ 𝑠𝑖𝑦 ,
where ISTS(i, T) denotes a slice along T axis at i, six and siy are horizontally x and vertically y directional SI13 filters, and the filtering
Jo
operation is denoted by ⊗. By using a larger template, the influences of individual pixels and small details on the edge angle and strength calculation is reduced [54]. The corresponding distortion-friendly edge enhanced map is then: 𝑦
𝑥 𝐼𝐸𝐷𝐺𝐸 (𝑖, 𝑇) = √𝐼𝐸𝐷𝐺𝐸 (𝑖, 𝑇)2 + 𝐼𝐸𝐷𝐺𝐸 (𝑖, 𝑇)2 .
(5)
Journal Pre-proof 7 Distorted image
Pristine image
(b)
Distorted edge-enhanced map
Edge difference map
SI13
re-
Pristine edge-enhanced map
Edge difference map
pro
Distorted edge-enhanced map
Sobel
Pristine edge-enhanced map
of
(a)
urn al P
(c)
Fig. 3. Differences between edge-enhanced maps generated by performing Sobel and SI13, respectively. (a) A pristine image and its distorted version, (b) edge-enhanced maps generated by performing Sobel of (a) and their difference, (c) edge-enhanced maps generated by performing SI13 of (a) and their difference.
To illustrate the difference between results obtained using SI13 and other edge filter kernels, we processed a pristine image and a distorted version of it impaired by Gaussian white noise, and computed edge enhanced maps on them using the Sobel [55] and SI13 filters, as shown in Fig. 3. It may be observed that both the SI13- and Sobel-filtered distorted maps contain different distortion-characteristic artifacts as compared to the pristine counterpart. To make these differences visually evident, we calculated the differences between the pristine and distorted edge-enhanced maps. The Sobel filter revealed finer-grained artifacts owing to its small kernel size, but these details are less likely to be perceivable considering the bandpass characteristic of human eye [56] as expressed by the spatial and temporal contrast sensitivity functions. Conversely, SI13 reveals more significant and distinctive distortion information. We therefore deployed SI13 to generate edge-enhanced maps. Frame differencing is a fundamental aspect of motion detection and object tracking methods [36], which we use to capture the
Jo
STS temporal representation. It also has great potential for highlighting frame freezes and other temporal flaws [57]. Define the frame difference map by:
𝐼𝐷𝐼𝐹𝐹 (𝑖, 𝑇) = {𝐼𝑆𝑇𝑆 (𝑖, 𝑇) − 𝐼𝑆𝑇𝑆 (𝑖 − 1, 𝑇) | 𝑖 ∈ [2, 𝑁]} ,
(6)
which is, of course, the usual frame difference. We illustrate the process of generating frame difference maps in the presence of wireless distortions in Fig. 4. In Fig. 4a, both a current frame and its previous frame are pristine, hence the frame difference map mostly captures contours along moving objects; in Fig. 4b, the previous frame is still pristine, but the current frame is impaired by
Journal Pre-proof 8
wireless distortion, in which case the frame difference map gains significant energy in distorted areas. This phenomenon shows frame difference’s ability in capturing sudden changes of video sequences.
Pristine previous frame
Pristine frame difference
of
Pristine frame
(a)
Distorted frame difference
Pristine previous frame
pro
Distorted frame
re-
(b) Fig. 4. Frame difference maps on (a) pristine frames and (b) distorted frames impaired by wireless distortion.
B. Content Measurement
Each STS is a complex mixture of space-time information. The purpose of the various feature maps that we compute is to extract
urn al P
content- and distortion-aware primitives that are predictive of video quality. By using the STS representation, we obtain this kind of information along three dimensional space-time planes. Towards further enhancing the STS information, we compute a relative gradient magnitude map, a gradient orientation map, a Gaussian-filtered map and a Laplacian-compensated map on the STSs. The gradient is an effective way to capture structure shifts, contrast changes and textural changes [58], which may arise from distortion. Here, we use two simple gradient based metrics [39]: the relative gradient magnitude (RM) and the gradient orientation (GO) to capture local and global variations of the STSs. The RM and GO are defined as: 𝑦
𝑦
2
𝑥 (𝑖, 𝑥 (𝑖, 𝐼𝑅𝑀 (𝑖, 𝑑) = √(𝐼𝐺𝑅𝐴 𝑑) − 𝐼𝐺𝑅𝐴 𝑑)𝐴𝑉𝐸 )2 + (𝐼𝐺𝑅𝐴 (𝑖, 𝑑) − 𝐼𝐺𝑅𝐴 (𝑖, 𝑑)𝐴𝑉𝐸 ) , 𝑥 𝐼𝐺𝑅𝐴 (𝑖,𝑑)
𝐼𝐺𝑂 (𝑖, 𝑑) = 𝑎𝑐𝑡𝑎𝑛( 𝑦
𝑦
𝐼𝐺𝑅𝐴 (𝑖,𝑑)
) ,
(7) (8)
𝑥 (𝑖, where 𝐼𝐺𝑅𝐴 𝑑) and 𝐼𝐺𝑅𝐴 (𝑖, 𝑑) are stable derivatives estimated by directionally filtering each STS along two orthogonal 𝑦
𝑥 (𝑖, orientations, 𝐼𝐺𝑅𝐴 𝑑)𝐴𝑉𝐸 and 𝐼𝐺𝑅𝐴 (𝑖, 𝑑)𝐴𝑉𝐸 are the local average estimates of the derivatives computed as
and
Jo
𝑥 𝑥 𝐼𝐺𝑅𝐴 (𝑖, 𝑑)𝐴𝑉𝐸 = 𝐼𝐺𝑅𝐴 (𝑖, 𝑑) ⊗ 𝜑𝑥 ,
𝑦
𝑦
𝐼𝐺𝑅𝐴 (𝑖, 𝑑)𝐴𝑉𝐸 = 𝐼𝐺𝑅𝐴 (𝑖, 𝑑) ⊗ 𝜑𝑦 ,
(9)
(10)
where 𝜑𝑥 and 𝜑𝑦 are 3x3 average smoothing filters which we use to calculate the local means. Exemplar RM and GO maps computed on an STS along all three axes are shown in Fig. 5. The STSs along the H and W axes are highly irregular and rapidly varying, but efficiently capture local space-time variations.
Journal Pre-proof 9 T
Gradient magnitude
W
Gradient magnitude
H
Gradient magnitude
Gradient orientation
(a)
of
pro
Gradient orientation
re-
(b)
Gradient orientation
urn al P
(c)
Fig. 5. Relative gradient magnitude and gradient orientation maps along the STS (a) time, (b) height and (c) width axes.
Unlike images, distortions are generally scale-variant and different distortion scales can impact the visual experience of quality in different ways [50], [59]. To access this scale-dependent information, we deploy a Laplacian pyramid, as depicted in Fig. 6. We define two additional feature maps in this way: a low-pass Gaussian-filtered map, and the first layer of the Laplacian pyramid, to separate the low and high frequency STS content. The filtered Gaussian map is: 𝐼𝐺𝐴𝑈 (𝑖, 𝑑) = 𝐼𝑆𝑇𝑆 (𝑖, 𝑑) ⊗ 𝑓𝑔 ,
(11)
where 𝑓𝑔 is a 5x5 Gaussian kernel having space constant 1.667. Down-sampling the filtered Gaussian map by a factor of two generates the second layer of the Gaussian pyramid, as follows:
𝐼𝐷𝑂𝑊𝑁 (𝑖, 𝑑) = 𝐷𝑂𝑊𝑁(𝐼𝐺𝐴𝑈 (𝑖, 𝑑)) ,
(12)
the down-sampling operation is then followed by nearest neighbor interpolation: 𝐼𝑈𝑃 (𝑖, 𝑑) = 𝑈𝑃(𝐼𝐷𝑂𝑊𝑁 (𝑖, 𝑑)) .
(13)
Jo
Finally, the Laplacian-compensated map (the first layer of Laplacian pyramid) is calculated as: 𝐼𝐿𝐴𝑃 (𝑖, 𝑑) = 𝐼𝑆𝑇𝑆 (𝑖, 𝑑) − 𝐼𝑈𝑃 (𝑖, 𝑑) .
(14)
The RM maps, GO maps, Gaussian-filtered maps and Laplacian-compensated maps enhance the STS content representation along all three dimensions. These, along with the original STS and its spatial-temporal complemented representation, constitute the (reference and test) map pairs that are input to the IQA engine for comparison.
Journal Pre-proof 10 Input(G0)
Up-sample image
Laplacian map
interpolation
Gaussian map
pro
G1
of
filter
down-sample
Fig. 6. Process of computing the first layer of the Laplacian pyramid. G0 and G1 are the first and second layers of the Gaussian pyramid.
re-
C. IQA Analysis of Paired Maps
During the past two decades, progress on image quality assessment domain has evolved significantly, especially on the full-reference problem. Yet progress on the video quality assessment problem has proved to be more challenging. The problem is complicated by the great variety of spatial and temporal information that directly affects perceptual video quality. Here we apply
urn al P
successful still picture (IQA) algorithms to the VQA problem, using the multiple STS maps just described. In this way, temporal video behavior over both short durations and longer time scales can be accounted for, without any motion computations, by simply applying an efficient IQA algorithm on each reference-distorted STS map pair. Thus, compute a distortion index on each map: 𝑟𝑒𝑓 𝑑𝑖𝑠 (𝑖, 𝑃𝑚 (𝑖, 𝑑) = 𝐼𝑄𝐴(𝐼𝑚 (𝑖, 𝑑), 𝐼𝑚 𝑑)) ,
(15)
where superscripts ref and dis refer to reference and distorted map sequences, IQA denotes an arbitrary full-reference IQA algorithm, and the subscript m indicates the type of map. Note that the edge-enhanced map and frame difference map are only computed on STSs along the T (temporal) axis. This collection of distortion maps 𝑃𝑚 (𝑖, 𝑑) can subsequently be combined into a measure of overall video quality. However, one additional step is applied to improve efficiency. D. Subsampled Configuration
Video contains significant space-time redundancies [60], whether they are distorted or not. Towards even more efficient
Jo
computation, we consider sub-sampling the STSs prior to forming the feature maps.
Journal Pre-proof
of
11
(a)
(b)
pro
Fig. 7. Scatter plots using all of the STSs to predict quality versus using only the odd-indexed STSs to predict quality, using the STS multi-map model and an IQA algorithm defined as (a) PSNR, and (b) SSIM.
Hence, suppose that we only compute and process the odd-numbered STSs, i.e. subsample each dimension by a factor of two. For the purpose of demonstration, we use the popular PSNR and SSIM algorithms as the IQA algorithm, respectively. Figure 7 shows scatter plots of using all of the STSs to predict quality versus using only the odd-indexed STSs to predict quality with the
re-
STS multi-map models, along with the best-fit logistic curves, when the respective models were applied to the entire corpus of videos in the LIVE Video Quality database using the pooling strategy described in the next subsection. The horizontal axis represents the use of all of the STSs to predict quality, while the vertical axis represents using only the odd-indexed STSs to predict quality. It may be observed that using both PSNR and SSIM yielded tight scatter distributions along the diagonal, meaning that the
urn al P
difference between using all of the STSs versus just the odd-indexed STSs was small. Overall, the results imply that subsampling by two does not adversely affect performance (see also [61]). E. Learning Based Pooling Strategy
At this point, the system has delivered reference-distorted pair IQA maps that can then be pooled to provide a summary video quality score. For simplicity, apply mean pooling on each IQA map, yielding a single score for each map: 𝑆𝑚 (𝑑) = ∑𝑁 𝑖=1 𝑃𝑚 (𝑖, 𝑑) / 𝑁 ,
(16)
the map scores over all indices m and STS dimensions d{T, H, W} can then be concatenated into a single 1-D vector S. Since the elements of S will fall into different ranges, simple pooling strategies like mean, max or min-pooling will lead to poor results. Since there is no simple way to model or understand the mapping from the elements of S to perceived video quality, we use a learning based pooling strategy that automatically determines the optimal contribution of each element in S to the final video quality prediction [62]. Let this final prediction be expressed
Jo
Q = 𝜃𝑇 𝑆 ,
(17)
where 𝜃 is a learned weight vector, and Q is the final video quality prediction. We deploy a simple, fully-connected feed-forward neural network with a single hidden layer and no activation function to map each vector S to a single DMOS prediction. We used the Adadelta Optimizer [63] to train the network to minimize the loss function 𝐿𝑂𝑆𝑆 = √𝑄𝑔2 − 𝑄𝑝2 ,
(18)
where 𝑄𝑔 and 𝑄𝑝 are the ground truth DMOS and predicted DMOS, respectively. Due to the simplicity of this model, overfitting is unlikely to occur. In this way, S is mapped to a single video quality prediction.
Journal Pre-proof 12
III. EXPERIMENTAL RESULTS We evaluated the performance of our proposed framework against several state-of-the-art video quality assessment algorithms on three publicly available VQA databases. These databases are the LIVE VQA database [1], the IVP VQA database [64] and the CSIQ VQA database [65]. All videos of the selected databases are in YUV420 format. We list basic information regarding these databases in Table I, including the number of reference videos, distorted videos, distortion types and resolutions in each.
of
We utilized three widely-used IQA models in the experiments to compare STS map pairs: PSNR, SSIM [20] and VIF [66]. We will refer to these as STS-PSNR, STS-SSIM and STS-VIF, respectively. Since each new “STS” VQA model uses a learning based pooling strategy, from map difference scores to quality prediction through training, we randomly divided the database into training
pro
and testing subsets in a ratio of 4:1 but without content overlap, using the training subset to train the prediction models, then running the VQA model on the test subset to evaluate its performance. This training-test procedure was repeated 1000 times and the median performance evaluation metrics used as the final model performance evaluation. We compared the three “STS” VQA models against PSNR, SSIM, VIF (applied on video frames and mean-pooled over time) as well as two other top-performing VQA algorithms: STRRED [26] and ViS3 [34]. STRRED and ViS3 were applied using the original settings used by their authors. The
re-
software releases of all the existing compared algorithms are publicly available.
Several metrics are commonly used to evaluate the performance of IQA/VQA models, including the Spearman Rank Ordered Correlation Coefficient (SRCC), Pearson's Correlation Coefficient (PLCC), Kendell Rank Order Correlation Coefficient (KRCC), Root Mean Squared Error (RMSE), Outlier Ratio (OR) and the Perceptually Weighted Rank Correlation (PWRC) [67]. We
urn al P
evaluated performance using the four most widely-used standard metrics: SRCC, PLCC, KRCC and RMSE. SRCC is a measure of (monotonic) rank correlation, while PLCC measures the linear correlation between two variables. KRCC is a statistic used to measure the ordinal association between two variables. It is a measure of rank correlation similar to SRCC, but it does not rely on any assumptions regarding the two variables’ distributions. The RMSE reflects the degree of deviation between the two variables. These four commonly used metrics comprehensively evaluate an algorithm’s performance. Higher SRCC, KRCC, PLCC and lower RMSE values indicate better consistency with human judgements. We highlighted the best performing algorithms in bold type.
TABLE I PARAMETERS OF THE THREE VQA DATABASES Reference Distorted Distortion Resolution Video Video Type
Database
10
150
4
432x768
IVP
10
128
4
1088x1920
CSIQ
12
216
6
480x832
Jo
LIVE
A. Performance on Three Databases
We conducted the validation experiments on the LIVE, IVP and CSIQ video databases. The performances of our model and all of the selected comparison algorithms are tabulated in Table II. All four evaluation metrics were used. From Table II, it may be seen that a significant performance improvement is obtained in terms of all four comparison metrics, by STS-PSNR, STS-SSIM and STS-VIF, on all three databases. In particular, STS-PSNR obtained the best performance on both the LIVE and IVP databases by a wide margin, while STS-VIF yielded the best performance on the CSIQ database. Our proposed
Journal Pre-proof 13
framework uses an IQA method to compute reference-distorted pair differences between space-time maps, include both well-structured images and feature maps. Since these three simple IQA methods impose little prior assumptions on the data distribution, they are applicable to both images and feature maps. Related experiments in [68] also support this idea. Hence, by deploying the proper weight vector through training, the performance of an IQA method in assessing video quality can be significantly improved. TABLE II PERFORMANCE OF COMPARED VQA MODELS ON THREE DATABASES ViS3
STS-PSNR
STS-SSIM
STS-VIF
SRCC
0.6752
0.7126
0.7099
0.8295
0.8148
0.8936
0.8654
0.8534
PLCC
0.6267
0.8985
0.8859
0.7985
0.7861
0.9479
0.9336
0.9304
KRCC
0.5030
0.5313
0.5267
0.6754
0.6404
0.7289
0.6872
0.6665
RMSE
12.8588
7.3064
7.4890
9.9634
10.1640
5.3215
5.8530
6.1219
SRCC
0.7401
0.6854
0.5869
0.7045
0.8666
0.8987
0.8353
0.8211
PLCC
0.7305
0.7192
0.5915
0.7175
0.8815
0.9122
0.8477
0.8325
KRCC
0.5802
0.5494
0.4649
0.5484
0.6959
0.7500
0.6498
0.6498
RMSE
0.7183
0.7283
0.8447
0.7495
0.4999
0.4323
0.5823
0.6250
SRCC
0.6693
0.6623
0.6905
PLCC
0.7340
0.6982
0.7286
KRCC
0.4900
0.5085
0.5295
RMSE
13.1714
14.0826
13.5211
of
STRRED
0.8490
0.8455
0.8568
0.8223
0.8752
0.8155
0.8267
0.8799
0.8481
0.8862
0.6638
0.6819
0.6644
0.6297
0.6931
11.5000
10.9913
9.6976
10.4162
9.4772
urn al P
CSIQ
VIF
pro
IVP
SSIM
re-
LIVE
PSNR
B. Performance on Each Distortion Category
We further performed the evaluation experiment on each distortion category from all the selected video databases (the LIVE, IVP and CSIQ databases). The LIVE video database contains four distortion categories: wireless distortions (WL), IP distortions (IP), H.264 compression (H264) and MPEG-2 compression (MPEG2).
The IVP video database contains direct wavelet
compression (DW), packet loss (IP), H264 and MPEG compression (MPEG). There are six distortion categories in the CSIQ video database: H264, packet loss in wireless network (WLPL), motion JPEG (MJPEG), snow coding (SNOW), white noise (AWGN) and HEVC compression (HEVC). The comparison results in terms of SRCC are listed in Table III. From Tables II and III, it can be observed that since the amount of data contained in each video database is rather small, the performance of the algorithms on each distortion type is less stable than on the overall video databases. The amount of data on each category may explain the varying model performance on several distortion categories. Clearly, the performance of STS-PSNR is
Jo
better than that of STS-SSIM. That may be because the proposed framework already extracts structural information, which may be partially redundant with respect to the information that SSIM extracts. Nonetheless, the results in Table III indicate that the STS framework yielded the best performance on most of the distortion types, especially STS-PSNR. Moreover, the performance improvement provided by STS-VIF on IP distortions in the IVP database was very significant.
Journal Pre-proof 14 TABLE III MEDIAN SRCC OF COMPARED VQA MODELS ON EACH DISTORTION CATEGORY
CSIQ
STRRED
ViS3
STS-PSNR
STS-SSIM
STS-VIF
WL
0.6905
0.7143
0.6667
0.7857
0.7143
0.8842
0.8619
0.8103
IP
0.6571
0.6000
0.7143
0.6000
0.6571
0.7290
0.7539
0.7495
H264
0.7381
0.8905
0.8333
0.8571
0.8810
0.8961
0.8576
0.8552
MPEG2
0.6190
0.7785
0.7381
0.8577
0.8571
0.9097
0.8649
0.8313
DW
0.8820
0.8832
0.8710
0.8700
0.9350
0.9420
0.9008
0.8972
IP
0.8741
0.5202
0.4490
0.7143
0.8333
0.8692
0.8139
0.8710
H264
0.8333
0.8096
0.8810
0.9048
0.9090
0.9334
0.8701
0.8999
MPEG
0.8867
0.7143
0.8286
0.8306
0.8757
0.8740
0.8207
0.8402
H264
0.8857
0.8801
0.9021
0.9331
0.9400
0.9537
0.9152
0.9282
WLPL
0.8611
0.9107
0.8711
0.8800
0.8701
0.8784
0.8870
0.8934
MJPEG
0.8286
0.9304
0.9022
0.8901
0.9413
0.8619
0.8494
0.9444
SNOW
0.8771
0.8301
0.9108
0.9203
0.8806
0.8965
0.8543
0.8915
AWGN
0.9429
0.8022
0.9010
0.8992
0.8713
0.8982
0.8976
0.8917
HEVC
0.8840
0.8500
0.8542
0.9137
0.9019
0.8901
0.9073
0.9022
C. Statistical Significance
of
VIF
pro
IVP
SSIM
re-
LIVE
PSNR
We conducted statistical tests [69], [70] on the LIVE and CSIQ video quality databases to determine the statistical
urn al P
significance of our model. Fig. 8 shows box plots of SRCC for all compared models across 1000 train-test trials (for methods that do not require training, we simply conducted the test procedure). It may be seen that the three “STS” VQA models’ stability and accuracy greatly improve upon that of PSNR, SSIM and VIF. We also conducted the T-test [71] on the SRCC values of the models. The null hypothesis is that there is no difference between the two methods at 95% confidence level. The indicator “1” or “-1” denotes that the row model is statistically superior, or statistically inferior to the column one, while the value “0” means they are statistically equivalent. As shown in Fig. 9, the three “STS” VQA models were superior as indicated in Fig.8. These
Jo
results show that the “STS” VQA models are stable, and statistically better than the compared algorithms.
(a)
(b)
Fig. 8. Box plots of SRCC across 1000 train-test runs on the (a) LIVE and (b) CSIQ databases.
Journal Pre-proof 15 PSN R
SSIM
VIF
STRRED
ViS3
STS-PSN R
STS-SSIM
STS-VIF
PSN R
0
-1
-1
-1
-1
-1
-1
-1
SSIM
1
0
1
-1
-1
-1
-1
VIF
1
-1
0
-1
-1
-1
STRRED
1
1
1
0
1
ViS3
1
1
1
-1
STS-PSN R
1
1
1
STS-SSIM
1
1
STS-VIF
1
1
SSIM
VIF
STRRED
ViS3
STS-PSN R
STS-SSIM
STS-VIF
PSN R
0
0
-1
-1
-1
-1
-1
-1
-1
SSIM
0
0
-1
-1
-1
-1
-1
-1
-1
-1
VIF
1
1
0
-1
-1
-1
-1
-1
-1
-1
-1
STRRED
1
1
1
0
0
-1
1
-1
0
-1
-1
-1
ViS3
1
1
1
0
0
-1
1
-1
1
1
0
1
1
STS-PSN R
1
1
1
1
1
0
1
-1
1
1
1
-1
0
1
STS-SSIM
1
1
1
1
-1
-1
0
STS-VIF
1
(a)
of
PSN R
1
1
-1
-1
-1
0
-1
1
1
1
1
1
1
0
(b)
pro
Fig. 9. Results of T-test between SRCC values of all compared VQA models on the (a) LIVE and (b) CSIQ databases.
D. Advantage of SI13Edge-Enhanced Map
To further prove the efficacy of the SI13 edge-enhanced map, we applied the Sobel and SI13 filters on a pair of reference-distorted slices of the videos from the LIVE database, yielding different edge-enhanced maps. We further integrated
re-
these into the STS-PSNR. The results are shown in Fig. 10, where different colors represent different distortion types. Most of the dots are distributed above the diagonal when using a Sobel edge-enhanced map. This phenomenon implies that when the Sobel edge-enhanced map is integrated into STS-PSNR, the model tends to generate a relatively lower quality score than ground truth, which may be because the Sobel filter detects finer details than the human visual system. While S13 is integrated into STS-PSNR,
distortion information.
urn al P
the distribution tends to be more balanced. These results again demonstrate that the SI13 filter extracts perceptually useful
We also tested the performance of the STS framework on the LIVE database when using the Sobel filter, by adapting it into STS-PSNR instead of the SI13 filter. The comparison results in terms of SRCC are shown in Fig. 11. It can be seen that the
Jo
SI13-filtered map yields better correlations against human subjectivity than the other constructed maps.
(a) (b) Fig. 10. Scatter plots of DMOS versus predicted scores of STS-PSNR using (a) Sobel-filter enhanced map and (b) SI13-filter enhanced map, respectively.
Journal Pre-proof 16 1 0.9
0.8
0.8936
0.8612
0.7
0.6 0.5
0.4 0.2 0.1
0
Sobel
SI13
of
0.3
pro
Fig. 11. Performance comparison of STS-PSNR when using Sobel and SI13 filters, respectively.
E. Contributions of Gaussian-filtered and Laplacian-compensated Maps
In order to clarify the contribution of the Laplacian pyramid to compensate the Gaussian filter’s loss of high frequency details, we followed the experimental design in Subsection III-C, to separately investigate the relative contributions of the
re-
Gaussian-filtered map, the Laplacian-compensated map and both together integrated into the STS-PSNR. The results are shown in Figs. 12 and 13.
From Fig. 12, it may be observed that the scatter plots tend to be distributed below the diagonal when using the Gaussian-filtered map or the Laplacian-compensated map. Both maps contribute to performance. These results are also included in Fig. 13, which
urn al P
compares the different versions of STS-PSNR in terms of SRCC. Evidently, combining both the Gaussian-filtered map and the Laplacian-compensated map into STS-PSNR provides a performance relative to using either in isolation.
Jo
(a) (b) (c) Fig. 12. Scatter plots of DMOS versus predicted scores of STS-PSNR on the LIVE database using (a) Gaussian-filtered map, (b) Laplacian-compensated map and (c) Gaussian-filtered and Laplacian-compensated maps, respectively.
Journal Pre-proof 17 1 0.9
0.8
0.8619
0.8936 0.8268
0.7
0.6 0.5
0.4 0.3
0.2 0.1
Gaussian
Laplacian
of
0
Integrated
pro
Fig. 13. Comparison of variations of STS-PSNR on the LIVE database when using the Gaussian-filtered map, the Laplacian-compensated map and both of them, respectively.
F. Algorithm Stability
It is also worthwhile to investigate whether a model is sensitive to outliers. We conducted a validation experiment on three frame based algorithms (PSNR, SSIM, and VIF) and their STS counterparts (STS-PSNR, STS-SSIM, and STS-VIF) by applying them on the LIVE video database. Since different databases’ ground truth perceptual scores fall into different ranges, and since different
re-
VQA algorithms generally output different ranges of values, we apply a normalization procedure on each: 𝑄̂ = (𝑄 − 𝑄𝑚𝑖𝑛 ) / (𝑄𝑚𝑎𝑥 − 𝑄𝑚𝑖𝑛 ) ,
(19)
where 𝑄𝑚𝑎𝑥 and 𝑄𝑚𝑖𝑛 are the maximum and minimum value of the ground truth/predicted score vector. After the normalization stage, all the values fall within the range of [0, 1].
urn al P
The scatter plots of DMOS versus predicted scores of the three frame based models and their STS versions are shown in Fig. 14, where different colors represent different distortion types. It may be observed that the three frame based methods’ distributions are highly dispersed, especially the PSNR model. The three integrated algorithms (STS-PSNR, STS-SSIM, STS-VIF) yield much more tightly distributed results. These results further confirm the efficacy of our proposed model.
(b)
(c)
Jo
(a)
(d) (e) (f) Fig. 14. Scatter plots of DMOS versus predicted scores of (a) PSNR, (b) SSIM, (c) VIF, (d) STS-PSNR, (e) STS-SSIM, and (f) STS-VIF.
Journal Pre-proof 18
G. Contributions of Individual Maps Next we investigated the relative contributions of the six constituent maps. We conducted an experiment on the LIVE VQA database, whereby we tested the performances of six versions of STS-PSNR, each using only one of the seven feature maps. The results are listed in Table IV, where “Original STS Map” indicates the map generated using (1) and “All” means the combination of all the above maps, i.e. the STS-PSNR. We found that using each map yielded a rather high SRCC, with the Gaussian-filtered map
of
giving the best result. Although using the relative gradient map or Laplacian-compensated map delivered a relatively low result, they both still moderately enhanced the STS content representation. Using all seven STS maps yielded a gain in performance of 0.15 of SRCC relative to using the original STS map alone. Clearly, each map strongly contributes to the performance of
pro
STS-PSNR.
urn al P
H. Contribution of Learning Based Pooling
re-
TABLE IV MEDIAN SRCC OF DIFFERENT MAPS ON THE LIVE DATABASE Original STS Map 0.7418 Edge-enhanced Map 0.7921 Frame Difference Map 0.6535 Relative Gradient Map 0.4130 Gradient Orientation Map 0.6421 Gaussian-filtered Map 0.8110 Laplacian-compensated Map 0.4016 All 0.8936
The proposed “STS” VQA models use a learning based pooling strategy to transform map difference scores into a final video quality score. In order to verify the effectiveness of using this strategy to construct “STS” VQA models, we further compared STS-PSNR with the model using other pooling methods on the LIVE database. Specifically, the learning based pooling was replaced with the max, min, mean, P20% (the average of the lowest 20%) pooling, and a support vector regression (SVR) in the STS-PSNR algorithm. The results are listed in Table V. We found that using mean or min pooling performed better than using max pooling, and that P20% pooling achieved better performance than max, min or mean pooling, and that using the SVR performed better than the other compared pooling methods except the learning based model. However, overall, learning based pooling delivered a significant performance improvement.
Jo
TABLE V MEDIAN SRCC OF THE STS-PSNR USING DIFFERENT POOLING STRATEGIES ON THE LIVE DATABASE Max 0.4306 Min 0.6464 Mean 0.7896 P20% 0.7912 SVR 0.8389 Learning based 0.8936
The essence of learning based pooling is to learn a weight vector that can balance different maps’ contributions to the overall video quality predictions. In order to better understand what is learned in this training process, we also extracted the learned weights of the maps and show them in Fig. 15. It may be seen that the original STS and the Gaussian map were assigned a higher
Journal Pre-proof 19
weight, while the frame difference and Laplacian maps also played important roles in the mapping. Although edge-enhanced map achieved high performance in the single map experiment (see Table IV), its role was reduced when combined with other maps. Normalized Weights 0.25 0.2
0.1 0.05 0 Edge
Frame Difference
RM
GO
Gaussian Laplacian
pro
Original STS
of
0.15
Fig. 15. Comparison of weights learned for different maps of STS-PSNR on the LIVE database.
I. Computational Complexity
We also analyzed the algorithm complexity of the high-performance STRRED, ViS3 and STS-PSNR algorithms on ten video
re-
sequences of resolution 432x768 pixels, frame rates 25 fps, and ten second durations from the LIVE VQA database, using a Dell desktop computer with a quad-core i7 CPU, 3.4 GHz and 12 GB RAM. The mean computation time of a model was used as the measure of computational complexity. The results are shown in Table VI. Clearly, STS-PSNR is faster than the other two VQA algorithms, but somewhat slower than PSNR. Clearly, STS-PSNR may be viewed as a very fast VQA algorithm.
urn al P
TABLE VI COMPARISON OF ALGORITHM COMPLEXITY Time (s) PSNR 3.7153 STRRED 122.9140 ViS3 225.1710 STS-PSNR 31.1000
IV. CONCLUSION
We have proposed a new VQA algorithm based on applying an IQA algorithm to multiple maps generated from spatial temporal slices of a video. The new model, which achieves its best performance in the form of a seven-map STS-PSNR, significantly outperforms prior models. The STS concept operates in a different space than where the usual principles apply, and much of its predictive power likely comes from the composite processing of spatial and temporal information. Nevertheless, there are still two issues that are worth addressing in the future. Firstly, the new STS family of VQA models does have one major drawback: it is
Jo
defined on entire videos. This makes them less useful than short-time or frame-based VQA algorithms for assessing or controlling streaming or real-time video quality. Towards remediating this, we are studying the efficiency of using “slicelets,” defined over shorter space-time intervals. Secondly, our proposed FR VQA framework cannot be directly transferred to create NR VQA models, because NR IQA methods require stronger prior hypotheses on the input data distribution, which may make them unsuitable for analyzing space-time slices. But it may be possible to design NR VQA models by considering the statistical characteristics of each map separately.
Journal Pre-proof 20
REFERENCES [1]
K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, “Study of subjective and objective quality assessment of video,” IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1427–1441, Jun. 2010.
[2]
Y. Chen, K. Wu, and Q. Zhang, “From QoS to QoE: A tutorial on video quality assessment,” IEEE Communications Surveys & Tutorials, vol. 17, no. 2, pp. 1126-1165, Jan. 2015. T. J. Liu, Y. C. Lin, W. Lin, and C. C. Kou, “Visual quality assessment: recent developments, coding applications and future trends,” APSIPA Transactions on
of
[3]
Signal and Information Processing, vol. 2, article no. e4, pp. 1-20, Jul. 2013. [4]
A. Adas, “Using adaptive linear prediction to support real-time VBR video under RCBR network service model,” IEEE-ACM Transactions on Networking,
[5]
pro
vol. 6, no. 5, pp. 636-644, Oct. 1998.
Y. Liu, Z. Li, and Y. Soh, “A novel rate control scheme for low delay video communication of H. 264/AVC standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 1, pp. 68-78, Jan. 2007.
[6]
S. Chong, S. Li, and J. Ghosh, “Predictive dynamic bandwidth allocation for efficient transport of real-time VBR video over ATM,” IEEE Journal on Selected Areas in Communications, vol. 13, no. 1, pp. 12-23, Jan. 1995.
F. Zhang and D. R. Bull, “A perception-based hybrid model for video quality assessment,” IEEE Transactions on Circuits and Systems for Video Technology,
re-
[7]
vol. 26, no. 6, pp. 1017-1028, Jun. 2016. [8]
D. Z. Rodríguez, R. Rosa, E. Costa, J. Abrahão, and G. Bressan, “Video quality assessment in video streaming services considering user preference for video content,” IEEE Transactions on Consumer Electronics, vol. 60, no. 3, pp. 436-444, Aug. 2014.
M. T. Vega, D. C. Mocanu, J. Famaey, S. Stavrou, and A. Liotta, “Deep learning for quality assessment in live video streaming,” IEEE Signal Processing
urn al P
[9]
Letters, vol. 24, no. 6, pp. 736-740, Jun. 2017.
[10] A. Raake, M. N. Garcia, W. Robitza, P. List, S. Goring, and B. Feiten, “A bitstream-based, scalable video-quality model for HTTP adaptive streaming: ITU-T P.1203.1,” In IEEE International Conference on Quality of Multimedia Experience, May 2017, pp. 1-6. [11] L. Xu, S. Li, K. N. Ngan, and L. Ma, "Consistent Visual Quality Control in Video Coding," IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 6, pp. 975-989, Jan. 2013.
[12] L. Xu, W. Lin, L. Ma, Y. Zhang, et al., "Free-Energy Principle Inspired Video Quality Metric and Its Use in Video Coding," IEEE Transactions on Multimedia, vol. 18, no. 4, pp. 590-602, Feb. 2016.
[13] X. Lin, H. Ma, L. Lou, and Y. Chen, “No-reference video quality assessment in the compressed domain,” IEEE Transactions on Consumer Electronics, vol. 58, no. 2, pp. 505-512, May 2012.
[14] C. Bampis, Z. Li, L. Katsavounidis, A. C. Bovik, “Recurrent and dynamic models for predicting streaming video quality of experiment,” IEEE Transactions
[15] K.
Jo
on Image Processing, vol. 27, no. 7, pp. 2216-3331, Jul. 2018. Brunnstrom
et
al.,
“Qualinet
white
paper
on
definitions
of
quality
of
experience,”
Qualinet.
(2013).
[Online].
Available:
http://www.qualinet.eu/images/stories/QoE_whitepaper_v1.2.pdf. [16] A. F. Silva, M. Farias, and J. Redi, “Perceptual annoyance models for video with combinations of spatial and temporal artifacts,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2446-2456, Dec. 2016. [17] J. Park, K. Seshadrinathan, S. Lee, and A. C. Bovik, “Video quality pooling adaptive to perceptual distortion severity,” IEEE Transactions on Image Processing, vol. 22, no. 2, pp. 610-620, Feb. 2013. [18] D. M. Chandler, “Seven challenges in image quality assessment: past, present, and future research,” ISRN Signal Processing, Feb. 2013.
Journal Pre-proof 21
[19] Z. Wang, L. Lu, and A. C. Bovik, “Video quality assessment based on structural distortion measurement,” Signal Processing: Image Communication, vol. 19, no. 2, pp. 121-132, Feb. 2004. [20] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004. [21] F. Yang, S. Wan, Y. Cheng, and H. R. Wu, “A novel objective no-reference metric for digital video quality assessment,” IEEE Signal Processing Letters, vol.
of
12, no. 10, pp. 685-688, Oct. 2005. [22] K. Seshadrinathan, A. C. Bovik, “Motion tuned spatio-temporal quality assessment of natural videos,” IEEE Transactions on Image Processing, vol. 19, no. 2, pp. 335-350, Feb. 2010.
25, no. 6, pp. 2480-2492, Jun. 2016.
pro
[23] K. Manasa, and S. Channappayya, “An optical flow-based full reference video quality assessment algorithm,” IEEE Transactions on Image Processing, vol.
[24] Y. Wang, T. Jiang, S. W. Ma, and W. Gao, “Novel spatio-temporal structural information based video quality metric,” IEEE Transactions on Circuits and System for Video Technology, vol. 22, no. 7, pp. 989-998, Jun. 2012.
[25] Y. Fang, Z. Wang, W. Lin and Z. Fang, “Video saliency incorporating spatiotemporal cues and uncertainty weighting,” IEEE Transactions on Image
re-
Processing, vol. 23, no. 9, pp. 3910-3921, Sep. 2014.
[26] R. Soundararajan and A.C. Bovik, “Video quality assessment by reduced reference spatio-temporal entropic differencing,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 4, pp. 684–694, Apr. 2012.
[27] A. Ninassi, O. L. Meur, P. L. Callet, and D. Barba, “Considering temporal variations of spatial visual distortions in video quality assessment,” IEEE Journal
urn al P
of Selected Topics in Signal Processing, vol. 3, no. 2, pp. 253–265, Apr. 2009.
[28] K. Seshadrinathan and A. C. Bovik, “Temporal hysteresis model of time varying subjective video quality,” In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2011, pp. 1153-1156.
[29] Y. Li, L. M. Po, C. H. Cheung, X. Xu, L. Feng, F. Yuan, and K. W. Cheung, “No-reference video quality assessment with 3D shearlet transform and convolutional neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 6, pp. 1044-1057, Jun. 2016. [30] X. Li, Q. Gou, X. Lu, “Spatiotemporal statistics for video quality assessment,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3329-3342, Jul. 2016.
[31] W. Lu, R. He, J. Yang, C. Jia, and X. Gao, “A spatiotemporal model of video quality assessment via 3D gradient differencing,” Information Science, vol. 478, pp. 141-151, Apr. 2019.
[32] C. W. Ngo, T. C. Pong, and H. J. Zhang, “Motion analysis and segmentation through spatio-temporal slices processing,” IEEE Transactions on Image Processing, vol. 12, no. 3, pp. 341-355, Mar. 2003.
[33] Z. Wang and Q. Li, “Video quality assessment using a statistical model of human visual speed perception,” Journal of the Optical Society of America A:
Jo
Optics and Image Science, and Vision, vol. 24, no. 12, pp. B61-B69, Dec. 2007. [34] P. V. Vu, D. M. Chandler, “ViS3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices,” Journal of Electronic Imaging, vol. 23, no. 1, article no. 013016, Jan. 2014. [35] P. Yan and X. Mou, “Video quality assessment based on motion structure partition similarity of spatiotemporal slice images,” Journal of Electronic Imaging, vol. 27, no. 3, article no. 033019, May 2018. [36] D. A. Migliore, M. Matteucci, and M. Naccari, “A revaluation of differencing frame in fast and robust motion detection,” In Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks, Oct. 2006, pp. 215-218.
Journal Pre-proof 22
[37] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind prediction of natural video quality,” IEEE Transactions on Image Processing, vol. 23, no. 3, pp. 1352-1365, Mar. 2014. [38] C. W. Niblack, R. Barber, W. Equitz, and M. D. Flickner, “QBIC project: querying images by content, using color, texture, and shape,” In Storage and Retrieval for Image and Video Databases, Apr. 1993, pp. 173-188. [39] L. Liu, Y. Hua, Q. Zhao, H. Huang, and A. C. Bovik, “Blind image quality assessment by relative gradient statistics and adaboosting neural network,” Signal
of
Processing: Image Communication, vol. 40, no. 1, pp. 1-15, Jan. 2016. [40] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: a highly efficient perceptual image quality index,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 684-695, Feb. 2014.
pro
[41] A. Lin, W. Lin, and M. Narwaria, “Image quality assessment based on gradient similarity,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1500-1512, Apr. 2012.
[42] Q. Li, W. Lin, and Y. Fang, “No-reference quality assessment for multiply-distorted images in gradient domain,” IEEE Signal Processing Letters, vol. 23, no. 4, pp. 541-545, Apr. 2016.
[43] C. Lee, S. Cho, J. Choe, T. Jeong, W. Ahn, and E. Lee, “Objective video quality assessment,” Optical Engineering, vol. 45, no. 1, article no. 017004, Jan.
re-
2006.
[44] W. Xue, X. Mou, L. Zhang, A.C. Bovik, and X. Feng, “Blind image quality prediction using joint statistics of gradient magnitude and laplacian features,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4850–4862, Nov. 2014.
[45] X. K. Yang, W. S. Ling, Z. K. Lu, E. P. Ong, and S. S. Yao, “Just noticeable distortion model and its applications in video coding,” Signal Processing: Image
urn al P
Communication, vol. 20, no. 7, pp. 662-680, Aug. 2005.
[46] G. Zhai, W. Zhang, X. Yang, W. Lin, and Y. Xu, “Efficient image deblocking based on postfiltering in shifted windows,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, no. 1, pp. 122-126, Jan. 2008.
[47] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 1, pp. 50-63, Jan. 2015.
[48] K. Gu, G. Zhai, W. Lin, X. Yang, and W. Zhang, “No-reference image sharpness assessment in autoregressive parameter space,” IEEE transactions on Image Processing, vol. 24, no. 10, pp. 3218-3231, Oct. 2015.
[49] P. J. Burt and E. H. Adelson, “The Laplacian pyramid as a compact image code,” IEEE Transactions on Communications, vol. 31, no. 4, pp. 671-679, Apr. 1983.
[50] K. Zhu, K. Hirakawa, V. Asari, and D. Saupe, “A no-reference video quality assessment based Laplacian pyramid,” In IEEE International Conference on Image Processing, Sep. 2013, pp. 49-53.
[51] S. Ye, K. Su, and C. Xiao, “Video quality assessment based on edge structural similarity,” In IEEE International Congress on Image and Signal Processing,
Jo
Jul. 2008, pp.445-448.
[52] M. H. Pinson and S. Wolf, “A new standardized method for objectively measuring video quality,” IEEE Transactions on Broadcasting, vol. 50, no. 3, pp. 312-322, Sep. 2004.
[53] ITU-T J.144, Objective perceptual video quality measurement techniques for digital cable television in the presence of a full reference, Recommendation ITU-T J.144, ITU Telecom. Sector of ITU, 2004. [54] S.
Wolf
and
M.
Pinson,
(2002,
June)
Video
Quality
Measurement
Techniques.
https://www.its.bldrdoc.gov/resources/video-quality-research/guides-and-tutorials/spatial-information-si-filter.aspx.
[Online].
Available:
Journal Pre-proof 23
[55] N. Kanopoulos, N. Vasanthavada, R. L. Baker, “Design of an image edge detection filter using the Sobel operator,” IEEE Journal of Solid-State Circus, vol. 23, no. 2, pp. 358-367, Apr. 1988. [56] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607-609, Jun. 1996. [57] Y. Xue, B. Erkin, and Y. Wang, “A novel no-reference video quality metric for evaluating temporal jerkiness due to frame freezing,” IEEE Transactions on
of
Multimedia, vol. 15, no. 1, pp. 134-139, Jan. 2015. [58] B. Tao and B. W. Dickinson, “Texture recognition and image retrieval using gradient indexing,” Journal of Visual Communication and Image Representation, vol. 11, no. 3, pp. 372-342, Sep. 2000.
pro
[59] M. J. Wainwright and E. P. Simoncelli, “Scale mixtures of Gaussians and the statistics of natural images,” Advances in Neural Information Processing Systems, Nov. 2000, pp. 855–861.
[60] J. Y. Lin, C. H. Wu, I. Katsavounidis, Z. Li, A. Aaron, and C. C. Kuo, “EVQA: An ensemble-learning-based video quality assessment index,” In IEEE International Conference on Multimedia and Expo Workshops, Jun. 2015, pp. 1-6.
[61] C. Bampis, T. R. Goodall, and A. C. Bovik, “Sampled efficient full-reference quality assessment models,” Asilomar Conference on Communication, Control,
re-
and Computing, Nov. 2016. pp. 561-565.
[62] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006. [63] M. Zeiler, “ADADELTA: An Adaptive Learning Rate Method,” Preprint arXiv:1212.5701, Dec. 2012. [64] Image
&
Video
Processing
Laboratory,
The
Chinese
University
of
Hong
Kong,
“IVP
subjective
quality
video
database,”
urn al P
http://ivp.eecuhk.edu.hk/research/database/subjective/index.shtml (20 April 2012).
[65] Laboratory of Computational Perception & Image Quality, Oklahoma State University, “CSIQ video database,” 2013, http://vision.okstate.edu/csiq/ (15 November 2012).
[66] H. R. Sheikh, A. C. Bovik, and G. Veciana, “An information fidelity criterion for image quality assessment using natural scene statistics,” IEEE Transactions on Image Processing, vol. 14, no. 12, pp. 2117-2128, Dec. 2005.
[67] Q. Wu, H. Li, F. Meng, and K. N. Ngan, “A perceptually weighted rank correlation indicator for objective image quality assessment”, IEEE transactions on Image Processing, vol. 27, no. 5, pp. 2499-2513, May. 2018.
[68] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,” IEEE transactions on Image Processing, vol. 20, no. 8, pp. 2378-2386, Aug. 2011.
[69] K. Brunnstrom and M. Barkowsky, “Statistical quality of experience analysis: on planning the sample size and statistical significance testing,” Journal of Electronic Imaging, vol. 27, no. 5, article no. 053013, Sep. 2018.
[70] ITU-T P.1401, Statistical analysis, evaluation and reporting guidelines of quality measurements, ITU-T P.1401, ITU Telecom. Sector of ITU, 2012.
2014.
Jo
[71] ITU-T J.343, Hybrid perceptual bitstream models for objective video quality measurements, Recommendation ITU-T J.343, ITU Telecom. Sector of ITU,
Journal Pre-proof 24
1) A full-reference video quality assessment framework is proposed. 2) The STS representations are analyzed to produce multiple quality-aware maps. 3) A process of space-time subsampling is applied in the STS domain. 4) A learning based pooling strategy is devised to automatically weight and combine the feature maps to generate a final video quality score.
Jo
urn al P
re-
pro
of
5) The resulting algorithm delivers significantly elevated performance relative to state-of-the-art video quality models.
Journal Pre-proof 25 Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Jo
urn al P
re-
pro
of
☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
Journal Pre-proof 26
Lixiong Liu: Conceptualization, Methodology, Writing- Original draft preparation. Tianshu Wang: Methodology, Writing- Original draft preparation, Software. Hua Huang: Conceptualization, Methodology, Writing- Reviewing and Editing. Alan Conrad Bovik: Visualization, Writing- Reviewing and Editing.
Jo
urn al P
re-
pro
of
[72]