Signal Processing: Image Communication 35 (2015) 46–60
Contents lists available at ScienceDirect
Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image
HDR-VQM: An objective quality measure for high dynamic range video Manish Narwaria, Matthieu Perreira Da Silva n, Patrick Le Callet IRCCyN/IVC Group, University of Nantes, 44306, France
a r t i c l e in f o
abstract
Article history: Received 1 December 2014 Received in revised form 22 April 2015 Accepted 22 April 2015 Available online 5 May 2015
High dynamic range (HDR) signals fundamentally differ from the traditional low dynamic range (LDR) ones in that pixels are related (proportional) to the physical luminance in the scene (i.e. scene-referred). For that reason, the existing LDR video quality measurement methods may not be directly used for assessing quality in HDR videos. To address that, we present an objective HDR video quality measure (HDR-VQM) based on signal preprocessing, transformation, and subsequent frequency based decomposition. Video quality is then computed based on a spatio-temporal analysis that relates to human eye fixation behavior during video viewing. Consequently, the proposed method does not involve expensive computations related to explicit motion analysis in the HDR video signal, and is therefore computationally tractable. We also verified its prediction performance on a comprehensive, in-house subjective HDR video database with 90 sequences, and it was found to be better than some of the existing methods in terms of correlation with subjective scores (for both across sequence and per sequence cases). A software implementation of the proposed scheme is also made publicly available for free download and use. & 2015 Elsevier B.V. All rights reserved.
Keywords: High dynamic range (HDR) video quality Objective quality Spatio-temporal analysis
1. Introduction The advent of better technologies in the field of visual signal capture and processing has fueled a paradigm shift in todays' multimedia communication systems. As a result, the notion of network-centric quality of service (QoS) in multimedia systems is being extended by relying on the concept of quality of experience (QoE) [1]. In this quest of increasing the immersive video experience and the overall QoE of the end user, newer technologies such as 3D, ultra high definition (UHD) and, more recently, high dynamic n Correspondence to: Polytech Nantes, Computer Science Department, La Chantrerie, Rue Christian Pauc, BP 50609, 44306 Nantes Cedex 3, France. Tel.: þ33 2 40 68 30 60; fax: þ 33 2 40 68 32 32. E-mail addresses:
[email protected] (M. Narwaria),
[email protected] (M. Perreira Da Silva),
[email protected] (P. Le Callet).
http://dx.doi.org/10.1016/j.image.2015.04.009 0923-5965/& 2015 Elsevier B.V. All rights reserved.
range (HDR) imaging have gained prominence within the multimedia signal processing community. HDR in particular has attracted attention since it in a way revisits the way we capture and display natural scenes. This is motivated by the fact that natural scenes often exhibit large ranges of illumination values. However, such high luminance values often exceed the capabilities of the traditional low dynamic range (LDR) capturing and display devices. Consequently, it is not possible to properly expose the dark and the bright areas simultaneously in one image (or video) during capture. This may lead to over-exposure (saturated pixels that are fully white) and/or underexposure (very dark or noisy pixels as sensor's response falls below its noise threshold). In both cases, visual information is either lost or altered. HDR imaging focuses on minimizing such losses and therefore aims at improving the quality of the displayed pixels by incorporating higher contrast and luminance.
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
As a result, HDR imaging has attracted attention from both academia and industry, and there has been interest and effort to develop tools/algorithms for HDR video processing [2]. For instance, there have been recent efforts within the Moving Picture Experts Group (MPEG) for extending High Efficiency Video Coding (HEVC) to HDR. Likewise, the JPEG has announced extensions that will feature the original JPEG standard with support for HDR image compression. Despite of some work on evaluating quality of HDR images [3,4] and video [5], there is overall lack of such efforts to quantify and measure the impact of such tools on HDR video quality using both subjective and objective approaches. The issue assumes further significance given that most of the existing objective methods may not be directly applicable for HDR quality estimation [6–8] (note that these studies only deal with HDR images and not video). It is therefore important to develop objective methods for HDR video quality measurement and benchmark their performance against subjective ground truth. With regards to visual quality measurement, both subjective and objective approaches can be used. The former involves the use of human subjects to judge and rate the quality of the test stimuli. With appropriate laboratory conditions and a sufficiently large subject panel, it remains the most accurate method. The latter quality assessment method employs a computational (mathematical) model to provide estimates of the subjective video quality. While such objective models may not mimic subjective opinions accurately in a general scenario, they can be reasonably effective in specific conditions/applications. Hence, they can be an important tool towards automating the testing and standardization of HDR video processing algorithms such as HDR video compression, post-processing, inverse video tone mapping, etc., especially when subjective tests may not be feasible. In light of this, we present a computationally tractable HDR video quality estimation method based on HDR signal transformation and subsequent analysis of spatio-temporal segments, and also verify its prediction performance based on a test bed of 90 subjectively rated compressed HDR video sequences. To the best of our knowledge, our study is amongst the first few efforts towards the design and verification of an objective quality measurement method for HDR video, and is therefore of interest to the video signal processing community both from subjective and objective quality view points. 2. Background Humans perceive the outside visual world through the interaction between luminance (measured in candela per square meter, cd/m2) and the eyes. Luminance first passes through the cornea, a transparent membrane. Then it enters the pupil, an aperture that is modified by the iris, a muscular diaphragm. Subsequently, light is refracted by the lens and hits the photoreceptors in the retina. There are two types of photoreceptors: cones and rods. The cones are located mostly in the fovea. They are more sensitive at luminance levels between 10 2 cd/m2 to 108 cd/m2 (referred to as the photopic or daylight vision)
47
[9]. Further, color vision is due to three types of cones: short, middle and long wavelength cones. The rods, on the other hand, are sensitive at luminance levels between 10 6 cd/m2 to 10 cd/m2 (scotopic or night vision). The rods are more sensitive than cones but do not provide color vision. Pertaining to the luminance levels found in the real world, direct sunlight at noon can be of the order in excess of 107 cd/m2 while a starlit night in the range of 10 1 cd/m2. This corresponds to more than 8 orders of magnitude. With regards to human eyes, their dynamic range depends on the time allowed to adjust or adapt to the given luminance levels. Due to the presence of rods and cones, human eyes have a remarkable ability to adjust to varying luminance levels, both dynamically (i.e. instantaneous) and over a period of time (i.e. adaptation time). Given sufficient adaptation time, the dynamic range of human eyes is about 13 orders of magnitude. However, without adaptation, the instantaneous human vision range is smaller and they are capable of dynamically adjusting so that a person can see about 5 orders of magnitude throughout the entire range. Since the typical frequency in video signals does not allow sufficient adaptation time, the dynamic vision range (5 orders of magnitude) is more relevant in the context of this paper as well as HDR video processing in general. However, typical digital imaging sensors (assuming the typical single exposure setting) and LDR displays are not capable of dealing with such large dynamic range present in the real world, and most of them (both capturing sensors and displays) can handle up to 3 orders of magnitude. Due to this limitation, the scenes captured and viewed via LDR technologies will have lower contrast (visual details are either saturated or noisy) and smaller color gamut than what the eyes can perceive. This in turn can decrease the immersive experience quotient of the end-user. HDR imaging technologies therefore aim to overcome the inadequacies of the LDR capture and display technologies via better video signal capture, representation and display, so that the dynamic range of the video can better match the instantaneous range of the eye. In particular, the major distinguishing factor of HDR imaging (in comparison to the traditional LDR one) is its focus on capturing and displaying scenes as natively (i.e. how they appear in the real world) as possible by considering physical luminance of the scene in question. Two important points should, however, be mentioned at the very outset. First, it may be emphasized that in HDR imaging one usually deals with proportional (and not absolute) luminance values. More specifically, unless there is a prior and accurate camera calibration, luminance values in an HDR video file represent the real world luminance up to an unknown scale.1 This, nonetheless, is sufficient for most purposes. Secondly, the HDR displays currently available cannot display luminance beyond the specified limit, given the hardware limitations. This necessitates a pre-processing step for
1 Even with calibration, the HDR values represent real physical luminance with certain error. This is because the camera spectral sensitivity functions which relate scene radiance with captured RGB triplets cannot match the luminous efficiency function of the human visual system.
48
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
both subjective and objective HDR video quality measurement, as elaborated further in the next section. Despite the two mentioned caveats, HDR imaging can improve the viewer experience significantly as compared to LDR2 imaging and, thus is an active research area. As already mentioned in the introduction, this paper seeks to address the issue of objective video quality measurement for HDR video. This is in light of the need to develop and validate such algorithms to objectively evaluate the perceptual impact of various HDR video processing tools on video quality. We describe the details of the method and verify its performance in the following sections. 3. The proposed objective HDR video quality measure A block diagram outlining the major steps in the proposed HDR-VQM is shown in Fig. 1. It takes as input the source and the distorted HDR video sequences. Note that throughout the paper we use the notation src (source) and hrc (hypothetical reference circuit) to respectively denote reference and distorted video sequences. As shown in the figure, the first two steps are meant to convert the native input luminance to perceived luminance. These can therefore be seen as pre-processing steps, and the first step is optional in case the user provides emitted luminance values. Next, the impact of distortions is analyzed by comparing the different frequency and orientation subbands in src and hrc. The last step is that of error pooling which is achieved via spatio-temporal processing of the subband errors. This comprises of short term temporal pooling, spatial pooling and finally, a long term pooling. A separate block diagram explaining the error pooling in HDR-VQM is shown in Fig. 3. In the following sub-sections, we elaborate on the various steps in HDR-VQM. 3.1. From native HDR values to emitted luminance: modeling the display processing We begin with two observations with regard to HDR video signal representation. First, as emphasized in the previous section, native HDR signal values are, in general, only proportional to the actual scene luminance and not equal to it. Therefore, the exact scene luminance at each pixel location will be, generally, unknown. Second, since the maximum luminance values of real-world scenes can be vastly different, the concept of a fixed maximum (or the white point) does not exist for HDR values. In view of these two observations, HDR video signals must be interpreted based on the display. Thus, their values should be recalibrated according to the characteristics of the HDR display used to view them. This is unlike the case of LDR video where the format is more standardized (e.g. for 8-bit representation, the maximum value will be 255 which 2 The terms LDR and HDR are also sometimes respectively referred to as lower or standard dynamic range (SDR) and higher dynamic range (to explicitly indicate that the range captured is only relatively higher than LDR but not the entire dynamic range present in a real scene). We, however, do away with such precise distinctions and always assume that the terms HDR and LDR are used in a relative context throughout this paper.
would be mapped to the peak display luminance that does not typically exceed 500 cd/m2). With regard to HDR displays, the inherent hardware limitations will impose a limit on the maximum luminance that can be displayed. Thus, a pre-processing of the HDR video signal is always required in order that the pre-defined maximum luminance point is not exceeded. Specifically, unlike the LDR domain, HDR videos will be generally viewed on HDR displays that may have different peak luminance and/or contrast ratios. Thus, artifact visibility for the same HDR video can be different depending on the display deployed (e.g. there will be different levels of saturation according to peak luminance). The said pre-processing for the transition from native HDR values (denoted as Nsrc, Nhrc in Fig. 1), to the emitted luminance (denoted as Esrc, Ehrc in Fig. 1) is therefore an important step in the objective method design. Also, as highlighted in Fig. 1, the first (optional) step can be skipped if the HDR data has been displayadapted by the user. While one can adopt different strategies (from simple ones like linear scaling to more sophisticated ones) for the said display based pre-processing, this is not the main focus of the work. However, for the purpose of the method described in this paper, it is sufficient to highlight that in the general case, it is important that the characteristics of the HDR display are taken into account and the HDR video transformed (pre-processed) accordingly, i.e. Nsrc -Esrc and N hrc -Ehrc , for objective HDR video quality estimation. The specific pre-processing that we employed in our experiments in discussed in Section 4.3. But as mentioned, it will depend on the user in general. 3.2. Transformation from emitted to perceived luminance The second step in the design of HDR-VQM concerns the transformation of the emitted luminance to perceived luminance, i.e. Esrc -P src and Ehrc -P hrc as indicated in Fig. 1. This is required since there exists a non-linear relationship between the perceived and emitted luminance values given the response of the human visual system (HVS) to different luminance levels. An implication of such non-linearity is that the changes introduced by an HDR video processing algorithm in the emitted luminance may not have a direct correspondence to the actual modification of visual quality. This is different from the case of LDR representation in which the pixel values are typically gamma encoded. Thus, LDR video encodes information that is non-linearly (the non-linearity arising due to the gamma curve) related to the scene luminance. As a result of such non-linear representation, the changes in LDR pixel values can be approximately linearly related to the actual change perceived by the HVS. Due to this, many LDR image/video quality measurement methods directly employ the said gamma encoded pixel values as input and assume that changes in LDR pixels (or changes in features extracted from those pixels) due to distortion can quantify quality degradation (the reference video is always assumed to be of perfect quality). Therefore, to achieve a similar functionality as the LDR domain, the said nonlinearity of the HVS to the emitted luminance should be taken into account for objective HDR video quality
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
(optional)
N src
src
N hrc
Psrc
E src
Transformation into emitted luminance
Transformation into perceived luminance
E hrc
Long term temporal pooling
Short term temporal pooling
Spatial pooling
Gabor filtering
Phrc l
hrc
HDR-VQM
49
( hrc ) ,ost,
) l ( src,ost,
Subband errors Subband comparison, pooling across scale, orientation
Error pooling Fig. 1. Block diagram of the proposed HDR-VQM.
1
0.7 PU Log
0.6
Normalized response
Normalized response
0.95 0.5 0.4 0.3 0.2
0.9
PU Log
0.85
0.8 0.1 0
0
20
40
60
80
100
120
140
Luminance in cd/m2
0.75
0
2000
4000
6000
8000
10000
Luminance in cd/m2
Fig. 2. Comparison of responses for two transformation functions in two different ranges of luminance, (a) 1–200 cd/m2 and (b) 1000–10 000 cd/m2.
evaluation. In this way, the input values to the objective HDR video quality estimator would be expected to be approximately linearly related to the changes induced due to distortions. According to the Weber law, a short increment of luminance at low level is perceived higher than the same increment at higher luminance level. We therefore explored the logarithmic transform and perceptually uniform (PU) encoding proposed in [10] which can transform luminance values in the range from 10 5 to 108 cd/m2 into approximately perceptually uniform code values. In order to compare the two mentioned transformations, we plot in Fig. 2, the respective responses to an input which is in the range from 1 to 10 000 cd/m2. For a closer look, we have plotted the responses for luminance in the range 1–200 cd/m2 and above 1000 cd/m2. We notice from Fig. 2(a) that the response of PU encoding is relatively more linear at lower luminance as compared to the logarithmic one. To further quantify this, it was found that the linear correlation between the original and transformed signals was 0.9334 for PU encoding and 0.9071 for logarithmic, for the range between 1 and 200 cd/m2. On the other hand, both PU and logarithmic curves have a similar response for higher luminance values (above 1000 cd/m2) as indicated in Fig. 2(b). In this case, the linear correlations were 0.8703 and 0.8763 respectively for PU and logarithmic transformation. Thus, PU encoding can better approximate the response of HVS which is approximately
linear at lower luminance and increasingly logarithmic for higher luminance values. Due to this, PU encoding is expected to better model the underlying non-linear relationship between HVS's response and emitted luminance. The study on HDR video compression in [11] provides further evidence of this where it was demonstrated that PU encoding resulted in better coding efficiency as compared to the logarithmic transformation. From objective image quality assessment view point also, PU encoding has been shown [7] to be beneficial. Lastly, as described in the original paper [10], PU encoding has been implemented as a look-up table operation, and therefore does not substantially increase the computational overhead. All these features render PU encoding a reasonably accurate and efficient alternative for transforming the emitted luminance values to the perceived ones. 3.3. Computation of subband error signal The proposed HDR-VQM is based on spatio-temporal analysis of an error video whose frames denote the localized perceptual error between a source and distorted video. We first describe the steps to obtain the subband error signal and then present the details of the spatiotemporal processing. We employed log-Gabor filters, introduced in [12], to calculate the perceptual error at different scales and
50
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
orientations. We note that log-Gabor filters have been widely used in image analysis and are a reasonable choice of features to compare intrinsic characteristics of natural scenes. In our approach, the log-Gabor filters were used in the frequency domain and can be defined in polar coordinates by Hðf ; θÞ ¼ H f H θ with Hf and Hθ being the radial and angular components, respectively:
H s;o f ; θ ¼ exp
logðf =f s Þ2
2ðlogðσ s =f s Þ2
!
ðθ θo Þ2 exp 2σ 2o
! ð1Þ
where H s;o is the filter denoted by spatial scale index s and orientation index o, fs denotes the normalized center frequency of the scale, θ is the orientation, σs defines the radial pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi bandwidth B in octaves with B ¼ 2 2=logð2Þnjlogðσ s =f s Þj, θo represents the center orientation of filter, pthe ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi and σo defines the angular bandwidth ΔΩ ¼ 2σ o 2 logð2Þ. Video frames in the perceived luminance domain (i.e. Psrc and P hrc ) were decomposed into a set of subbands by computing the inverse DFT of the product of the frames's DFT with frequency domain filter defined in (1). We ðsrcÞ ðhrcÞ denote the resulting subband values as flt;s;o g and flt;s;o g respectively. Here, s ¼ 1; 2; …; Nscale (the total number of scales), o ¼ 1; 2; …; Norient (the total number of orientations) and t ¼ 1; 2…; F (the total number of frames in the sequence). The next step is to compute the error in the subband values for each frame at different scales and orientation levels. At the same time, we know that channel masking effects can play a role in elevating or decreasing the visibility of errors. Thus, we compute the said error by suing the following equation similar formulation has been used in previous works such as [13] although not for directly modeling masking effect): ðsrcÞ
Error t;s;o ¼
ðhrcÞ
2 lt;s;o lt;s;o þ k ðsrcÞ 2 ðhrcÞ 2 flt;s;o g þflt;s;o g þ k
ð2Þ
Here k (a small constant) is added to avoid division by zero. The denominator term in the above formulation can be viewed as a simple intra-band masking model since large coefficients within the subband will mask the effect of smaller ones. Note that inter-band masking is a secondary effect and hence does not required to be modeled.3 We can then obtain the total error at each pixel in each video frame by pooling across scales and orientations. While more sophisticated methods such as those based on contrast sensitivity function (CSF) can be adopted for this purpose, a possible bottleneck is that of computing the desired CSF accurately (especially the one which may be applicable for both near-threshold and supra-threshold distortions). Therefore, in the current formulation, we limited the scope of application of the proposed HDRVQM to supra-threshold distortions and chose to assign
3 In fact, inter-band masking would be limited since the definition of band is the essence of a masking model. Masking mostly occurs in the band, following the receptor fields theory. Signals interact together (e.g. a masker on a masked) if they are exciting the same receptor field, which can be assimilated to a certain spatio-frequency band. Hence, inter-band masking will be limited as long as we have set the right subband.
equal weighting as follows: Error t ¼
N scale N orient X X 1 Error t;s;o Nscale N orient s ¼ 1 o ¼ 1
ð3Þ
Errort, which is a 2D distortion map, represents the error in each frame t due to processing of a source sequence by an HRC, and thus the error video can be represented as Error video ¼ fError t gFt ¼ 1 . It can be noted that the Errorvideo helps to quantify the effect of local distortions by assessing their impact across frequency and orientation. We can then exploit it further via spatio-temporal analysis in order to calculate short term quality in a spatially and temporally localized neighborhood, and subsequently obtain the overall HDR video quality score. We elaborate on this in the next sub-section. 3.4. From spatio-temporal subband errors to overall video quality: the pooling step Video signals propagate information along both spatial and temporal dimensions. However, due to visual acuity limitations of the eye, humans fixate their attention to local regions when viewing a video because only a small area of the eye retina, generally referred to as fovea, has a high visual acuity. As mentioned in Section 2, this is due to higher density of photoreceptor cells cones present in the fovea. Consequently, human eyes have to rapidly shift their gaze (the time between such movements is the fixation duration) to bring localized regions of the visual signal into the fovea field. Thus, humans tend to judge video quality in local context both spatially and temporally, and determine the overall video quality based on those assessments. Stated differently, the impact of distortions introduced in video frames will not be limited just to the spatial dimension but will rather manifest spatio-temporally. In the light of this, a reasonable strategy for objective video quality measurement is by analyzing the video in a spatiotemporal (ST) dimension [14–16], so that the impact of distortions can be localized along both spatial and temporal axes. To incorporate this in HDR-VQM, we first obtained the error video Error video ¼ fError t gFt ¼ 1 for the given pair of source and distorted HDR video. Then, the error video was divided into non-overlapping short-term ST tubes defined by a 3-dimensional region with x horizontal, y vertical and the z temporal data points, i.e. a cuboid with dimensions x y z, as illustrated in Fig. 3. Note that the former two define the spatial axes while the latter determines the temporal resolution to be considered when evaluating quality at the level of ST tubes. The values of x and y together define the area of the fixated region. Therefore, these can be computed by taking into account the viewing distance, the central angle of the visual field in the fovea and the display resolution. On the other hand, a good range of z can be determined by considering the average fixation duration when viewing a video. While this can vary due to content and/or distortions, studies (e.g. [17]) related to the analysis of eye-movement during video viewing indicate that values in the range of 300–500 ms can be a reasonable choice. We may also point out that in the current version of the proposed method, we do not
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
51
Standard deviation t=F t = F −1
z
y x
v ,t s
Spatial pooling Long term temporal pooling
t =1
Subband errors
ST .…… .…… .……
Standard deviation z
y x
Non-overlapping ST tubes
.…… .…… .……
STv ,t s
Spatial pooling
Spatio-temporal error frames
Short-term quality scores
Overall quality
Fig. 3. Error pooling in HDR-VQM.
employ motion information (e.g. those from motion vectors of the encoded video) so that the blocks can track object motion. This could possibly be explored as an extension of HDR-VQM but not within the current scope.
3.4.1. Short term temporal pooling The aim of short term temporal pooling is to pool or fuse the data in local spatio-temporal neighborhoods (defined by ST regions). Keeping in mind that the goal is to characterize the effect of spatial distortions over short term duration (equal to fixation time), we computed the standard deviation of each ST tube in order to obtain a spatio-temporal subband error frame ST v ; t s (as visually exemplified in Fig. 3) where v represents the spatial coordinates and ts being the index of resulting spatiotemporal frames. By this definition, a video sequence with lower visual quality will have higher localized standard deviations while these will decrease as the signal quality improves. Thus, ST v ; t s help to quantify signal coherence level in local neighborhoods. Also note that in our case, the error video was analyzed based on non-overlapping ST tubes. Thus we will have t s ¼ 1; 2; …; F=z. The final step is to perform spatial and long term temporal pooling to obtain the global video quality score.
3.4.2. Spatial and long term temporal pooling To obtain an overall video quality score that can quantify the level of annoyance in the video sequence, the local errors represented by spatio-temporal subband error frame ST v ; t s are pooled further in two stages: (a) spatial pooling to generate a time series of short term quality scores and (b) long term temporal pooling to fuse short term quality scores into a single number denoting the overall annoyance level. These stages are based on the premise that humans evaluate the overall video quality based on continuous assessments of the impact of short term errors or annoyance they came across while viewing the video. Therefore, we first perform spatial pooling on spatio-temporal error frames ST v ; t s in order to obtain the short-term quality scores, as illustrated in Fig. 3. We note that such short term quality scores can be useful in providing information about the distortions temporally. Finally, a long term pooling is applied to compute the overall video quality score. We therefore use the following
expression for HDR-VQM: HDR VQM ¼
XX 1 jvϵLp j ST v;t s jt s ϵLp j t ϵL vϵL s
p
ð4Þ
p
where Lp denotes the set with lowest p% values, j j stands for cardinality of the set. The reader will notice that both short term spatial and long term temporal pooling have been performed over the lowest p% values. This is because the HVS does not process necessarily visual data in its entirety and makes certain choices to minimize the amount of data analyzed. It is, of course, non-trivial to realize and integrate such exact HVS mechanisms into an objective method. In our case, we therefore employ the simple approach of percentile pooling keeping in mind that we are dealing with supra-threshold distortions, and quality judgment would be based on how much the signal has been distorted (i.e. for a perfect quality video HDRVQM will be zero). Finally, it is straightforward to use HDRVQM to compute objective HDR image quality (obviously there will be no temporal index in this case). 4. HDR video dataset To the best of our knowledge there are currently no publicly available subjectively annotated HDR video datasets dealing with the issue of visual quality. Therefore, for verifying the prediction performance of HDR-VQM and other objective methods, an in-house and comprehensive HDR video dataset was used. This section provides a brief description of the dataset. 4.1. Test material preparation The dataset used 10 source HDR video sequences,4 and the tone-mapped (based on a local operator) version of the first frame for each source is shown in Fig. 4. The file format was .hdr which is based on the Radiance picture format where each pixel is stored as 4 bytes, 1 byte mantissa for each color channel R, G, B and a shared 1 byte exponent E. The frame resolution was full HD (1920 by 1080 pixels) and the frame rate was 25 frames per second (fps). The dynamic range for each source sequence is shown in Fig. 5(a). It was computed as the maximum of 4 These HDR video sequences were shot and made available as part of the NEVEx Project FUI11, related to HDR video chain study.
52
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
Fig. 4. Tone mapped versions of first frame of the 10 reference (source) HDR video sequences used in the subjective study. (a) Geekflat (src1). (b) Lighthouse (src2). (c) Peniches (src3). (d) Scene A (src4). (e) Baseball (src5). (f) Machine (src6). (g) Outro (src7). (h) Casque (src8). (i) Tunnel1 (src9). (j) Tunnel2 (src10).
the dynamic ranges of individual frames, i.e. ( !) LðtÞ Dynamic Range ¼ max log10 max LðtÞ min t ¼ 1;2…F
ð5Þ
ðtÞ where LðtÞ max and Lmin respectively denote maximum and minimum relative luminance values for frame t. For robustness against outliers and avoiding division by zero, these were computed after discarding 1% brightest and darkest values. From this figure, we can see that the dynamic range of the source content is higher than 3 (except for src2 whose dynamic range is about 2.7). Although we report the video dynamic range as the maximum dynamic range of a frame in the sequence, we found that a large percentage of frames had dynamic range in excess of 3. Specifically, the percentage of such frames were respectively 73%, 0%, 90%, 43%, 100%, 63%, 85%, 27% and 56%, for src1 to src10. Since most LDR displays cannot
display more than 3 orders of magnitude, the video content used in our paper requires an HDR display (except probably for src2). Further, the spatial versus temporal information measures (computed on PU transformed frames) for each source sequence is shown in Fig. 5(b). The source sequences were compressed at different bit rates to obtain the test stimuli. To that end, we employed a backward-compatible (in our context backward compatibility with existing standard 8-bit displays) HDR video compression method. In general, any backward-compatible HDR compression scheme comprises of 3 main steps: (a) forward tone mapping in order to convert HDR video to LDR (8-bit precision), (b) compression and decompression of the LDR video by a standard LDR video compression method, and (c) inverse tone mapping of the decoded (decompressed) LDR bit stream to reconstruct HDR video. The specific method that we used is described in [11]. It is not discussed here but the reader is encouraged to refer to [11] for details. The LDR video
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
160
12 11
src7
140
10
120
9 8
100
7
TI
Dynamic Range
53
6 5
80
4
40
3 2
20
1 2
4
6
8
10
0
src4
src5
60
src8
src1 src10 src9 src6 src2 src3 0
50
src
100
150
200
250
300
350
400
SI
Fig. 5. Source content information. (a) Dynamic range for each src. (b) Spatial and temporal information plot.
was encoded and decoded using H.264/AVC at different bit rates. For the subjective viewing tests, we selected 8 bit rates (by varying the quantization parameter QP) such that the resultant HDR video quality covered the entire rating scale, i. e. from excellent (rating 5) to bad (rating 1). With the inclusion of the source sequences, we obtained a total of 90 HDR video sequences (10 8 bit rates þ10) to be rated by the subjects.
4.2. Rating methodology Our study involved 25 paid observers who were not expert in image or video processing. They were seated in a standardized room conforming to the International Telecommunication Union Recommendation (ITU-R) BT500-11 recommendations [18]. Prior to the test, observers were screened for visual acuity by using a Monoyer optometric table and for normal color vision by using Ishihara's tables. All of them had normal or corrected to normal visual acuity and normal color perception. For rating the test stimuli, we adopted the absolute category rating with hidden reference (ACR-HR), which is one of the rating methods recommended by the ITU in Rec. ITU-T P.910 [19]. The ACR-HR is a category judgment method where the test stimuli are presented one at a time and rated independently on a category scale. The rating method also includes the source sequences (i.e. undistorted) to be shown as any other test stimulus without informing the observers. This is therefore termed a hidden reference condition, and the advantage is that it implicitly encourages the observers to rate video quality and not fidelity since there is no reference to compare with. To quantify the video quality, a five-level scale is used: 5 (Excellent), 4 (Good), 3 (Fair), 2 (Poor) and 1 (Bad). We chose a discrete five-level scale because it is more suitable for naive (non-experts in image processing) observers, and it is easier for them to quantify the quality based on an adjective (‘Excellent’, ‘Good’, ‘Fair’, ‘Poor’ and ‘Bad’). We also employed post-experiment screening of the subjects in order to reject any outliers in accordance with the Video Quality Experts Group (VQEG)
multimedia test plan [20], and in our case no observer was rejected. 4.3. Display For displaying the HDR video sequences, SIM2 Solar47 HDR display was used which has a maximum displayable luminance of 4000 cd/m2. Since the HDR display luminance is relatively much higher as compared to the conventional display devices, we need much higher room illumination in order to reduce the visual discomfort of the subjects. Therefore, for setting the ambient luminance, we initially adopted the strategy in BT500-11 recommendations [18]. It states that the room illumination should be around 15% of maximum displayable luminance (this would be about 600 cd/m2 as SIM2 display has maximum luminance of about 4000 cd/m2). However, such high ambient luminance values would reduce the scene contrast drastically especially in darker areas (so this would also eventually depend on the content). Therefore, the ambient luminance was selected based on a pre-test where we adjusted the luminance values so that the viewers were comfortable in viewing the HDR video. Thus, there has to be a trade-off between minimizing glare effect and losing out details in darker areas. In our study we set the ambient luminance to 200 cd/m2 as it provided comfortable viewing conditions5 for the observers. The viewing distance was set to three times the height of the screen (active part), i.e. approximately 178 cm. It may be recalled from Section 3.1 that we first need to convert the scene-referred data to output-referred, i.e. Nsrc -Esrc and Nhrc -Ehrc . With regard to the said preprocessing step, the most straightforward way is the use of maximum normalization, i.e. scaling the values in each video frame independently by the maximum luminance 5 Our subsequent studies (e.g. [21]) with the SIM2 display indicated that a lower value in the range of 130–140 cd/m2 is more suitable. Nevertheless, the ambient light setting used for the study reported in this paper was still reasonable for practical purposes though on a slightly higher side.
54
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
value in that frame. We however observed that this approach suffers from at least two drawbacks. First, because only a single luminance value (maximum) is used, it is susceptible to outliers. Second, we also observed that it tends to reduce the overall luminance coherence of the HDR video due to the local (i.e. at frame level) normalization. To ameliorate these two issues, we opted for a temporally more coherent strategy and the normalization factor was determined as the maximum of the mean of top 5% luminance values of all the frames in an HDR video sequence. Specifically, we first computed a vector MT 5 whose elements were the mean of top 5% luminance values in each HDR video frame, i.e. ( ) 1 X j Nv;t ð6Þ MT 5 ¼ jvϵT 5 vϵT 5
t ¼ 1;2…F
where N denotes the native HDR values at spatial location v for frame t (F is the total number of frames), T5 denotes the set with highest 5% luminance values in the frame. Then, the native HDR values N were converted to emitted luminance values E as E¼
rate of 25 fps, we will have z ¼10 frames. The number of scales and orientations used were 5 and 4, respectively, i.e. N scale ¼ 5 and Norient ¼ 4. The orientations were equally spaced by 451. We found that using more orientations was not beneficial in terms of gains in prediction accuracy. Similarly, using 5 scales was enough to account for loss of details at different resolutions. Thus, in the corresponding viewing conditions, such settings mimic well the selectivity (orientation and radial) of the receptor fields. The pooling factor p was set to 5% for the results reported in the next sections. However, it may be mentioned that varying p between 5% and 50% did not significantly change the prediction accuracy on our HDR video dataset. But values higher than 60% lead to sharp decrease in prediction accuracies.
N 179 maxðMT 5 Þ
ð7Þ
where the multiplication factor of 179 is the luminous efficacy of equal energy white light that is defined and used by the radiance file format (RGBE) for the conversion to actual luminance value. Finally, a clipping function was applied to limit the E values in the range defined by the black point (lowest displayable luminance) and the maximum displayable luminance (obviously both will depend on the display characteristics). It is worth highlighting that the display based preprocessing described above is not unique and will depend on the user. Hence, it is optional in our method and this is true for any other objective method. 5. Experimental results Before we compute the experimental results, it is necessary to determine the parameter values in HDRVQM. First, to compute the values of x and y (it may be recalled from Fig. 3 that these respectively denote the width and height of the considered block), we assume the central angle of the visual field in the fovea to be 21 [22]. Then, we can define a quantity W which denotes the length of the fixated window in terms of number of pixels and can be computed as pffiffiffiffiffiffiffiffiffiffiffi W ¼ tan 21 V R=DA ð8Þ where V is the viewing distance in cm, R is the display resolution and DA is the display area. In our case, V ¼ 178 cm, R ¼ 1080 1920 pixels and DA 6100 cm2 . Plugging these values into (6) gives W 115. To reduce the computational effort, HDR-VQM was run on down sampled (by a factor of 2) video frames, and hence the approximate length of the fixated window will be W=2 58. Thus, we set x ¼ y ¼ 64 pixels in order to be nearest to a more standard block size. To determine z, we assumed a fixation duration of 400 ms and with a frame
5.1. Correlation based comparisons The first set of experimental results are reported in terms of two criteria: Pearson linear correlation coefficient Cp (for prediction accuracy) and Spearman rank order correlation coefficient Cs (for monotonicity), between the subjective score and the objective prediction. The former depends on the non-linear mapping between objective and subjective scores (we therefore employed a fourparameter monotonic logistic mapping between the objective outputs and the subjective quality ratings [20]) while the latter is independent of any such non-linearity. We compared the performance of HDR-VQM with a few popular LDR methods including NTIA-VQM [14] PSNR and multi-scale SSIM [13]. The input to all these methods were the perceived luminance values Psrc and Phrc and hence we refer to them as P-NTIA-VQM, P-PSNR and PSSIM. In addition, we considered two other methods which take as input the emitted luminance values Esrc and Ehrc. The first one is the HDR-VDP-2 originally proposed in [23], and we employed its re-calibrated version HDR-VDP-2.2 [24]. The second method is the relative PSNR (RPSNR) which employs a variant of the traditional mean squared error, MSE, in that it normalizes the error by the magnitude of luminance at each point in order to account for reduced sensitivity at high luminance. It can be defined as 0 1 2 1 XðEsrci;j Ehrci;j Þ A @ RPSNR ¼ 10 log10 ð9Þ Npixels i;j E2src þ E2hrc i;j i;j where i; j and Npixels respectively denote the pixel index and the total number of pixels in the source and distorted frames. Finally, since none of the image quality measurement methods consider temporal information explicitly, we applied the same spatio-temporal pooling as HDR-VQM (shown in Fig. 3) to P-SSIM and RPSNR so that the impact of the proposed pooling can be observed. These are denoted as R-PSNRST and P-SSIMST. Since we are dealing with full-reference video quality estimation, we computed the correlations using only the 80 distorted sequences (i.e. 80 test conditions). These results for across sequences (i.e. with all the 80 sequences) are presented in Table 1. One can observe the HDR-VQM achieves
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
55
Table 1 Experimental results for the 80 test conditions.
Cp Cs RMSE Outlier ratio (%)
R-PSNR
P-PSNR
P-SSIM
HDR-VDP-2.2
R-PSNRST
P-SSIMST
P-NTIA-VQM
HDR-VQM
0.3034 0.2095 1.1723 62
0.3385 0.2485 1.0708 68
0.6988 0.6447 0.7769 52
0.6456 0.5989 0.8154 53
0.4120 0.3567 1.011 61
0.7216 0.6790 0.7219 45
0.6917 0.7101 0.7903 46
0.8224 0.7604 0.5202 28
2.6
5 R−PSNR P−PSNR P−SSIM HDR−VDP−2.2 R−PSNRST
2.4 2.2
P−SSIM
4.5 4
R−PSNR P−PSNR P−SSIM HDR−VDP−2.2
ST
P−NTIA−VQM
2
F
=1.44
critical
3.5 3
1.8 2.5 1.6
F
1.4
=1.19
critical
2 1.5
1.2
1
1
0.5
Fig. 6. F-test results. (a) 80 test conditions in video dataset. (b) 350 test conditions in image dataset.
Table 2 Cp values per source (src) sequence. These values are computed without the logistic fitting function.
src1 src2 src3 src4 src5 src6 src7 src8 src9 src10
R-PSNR
P-PSNR
P-SSIM
HDR-VDP-2.2
R-PSNRST
P-SSIMST
P-NTIA-VQM
HDR-VQM
0.9709 0.7992 0.9230 0.2887 0.9297 0.9687 0.9756 0.9483 0.9319 0.6952
0.9809 0.9423 0.8839 0.6402 0.8873 0.9721 0.9607 0.9847 0.9374 0.8879
0.9281 0.9423 0.9102 0.9532 0.9584 0.9657 0.9680 0.8960 0.9081 0.9343
0.9700 0.9677 0.9320 0.9602 0.9600 0.9600 0.8923 0.9685 0.9124 0.9559
0.9009 0.5474 0.7251 0.6794 0.5692 0.9882 0.8865 0.9066 0.8958 0.8862
0.8484 0.9050 0.9317 0.9865 0.9617 0.9279 0.9656 0.8953 0.8562 0.9265
0.9560 0.9367 0.9689 0.9905 0.9764 0.9718 0.9815 0.9091 0.9434 0.9642
0.9168 0.9401 0.9504 0.9744 0.9680 0.9739 0.9898 0.9665 0.9533 0.9691
relatively higher prediction accuracy in comparison to the other methods. We also carried out analysis based on an Ftest to investigate the significance of the obtained results from a statistical view point, and those results are represented graphically in Fig. 6(a). In this, the F values were computed for the different methods with respect to HDRVQM (the residuals were verified to be approximately Gaussian which is an assumption for F-test), and are indicated by the lengths of the respective bars. The cases for which F 4 F critical (i.e. the bars that are above the Fcritical line), at 99% confidence, indicate that HDR-VQM is significantly better than the corresponding method. Of course, a larger test bed could help to further bring out such statistical differences between objective methods. The results, i.e. Cp values (without any non-linear fitting since the number of data points to be fitted is small), for
per source sequence are indicated in Table 2 from which we note that simpler methods like P-PSNR and MRSE tend to perform better (although the performance is still not consistent for each source) as compared to across sequence. But, in general, HDR-VQM is competitive or better in majority of cases. To further explain why simpler methods like RPSNR and P-PSNR achieve relatively lower correlations in case of across sequence in comparison to the case of per sequence, the scatter plots for different methods are shown in Fig. 7. In this, the x-axis denotes the objective scores while the yaxis indicates the MOS. We can see for instance in Fig. 7(a) that while the RPSNR values per source are more linearly related to the MOS, they exhibit a large scatter due to the curves being shifted for each source. As a result, for similar MOS values across sequences, the corresponding RPSNR
56
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
5
5
4.5
4.5
4
4
src1 src2 src3 src4 src5 src6 src7 src8 src9 src10
3 2.5 2 1.5 1 0.5 15
20
25
30
35
3.5
MOS
MOS
3.5
src1 src2 src3 src4 src5 src6 src7 src8 src9 src10
3 2.5 2 1.5 1 0.5
40
15
20
25
30
35
RPSNR
5
5
4.5
4.5
src1 src2 src3 src4 src5 src6 src7 src8 src9 src10
50
55
60
3 2.5 2
3.5
MOS
MOS
3.5
2.5 2 1.5
1
1
0.8
0.85
0.9
0.95
1
1.05
src1 src2 src3 src4 src5 src6 src7 src8 src9 src10
3
1.5
0.5 0.75
0.5 −2.2
1.1
−2
−1.8
−1.6
P−SSIM
5
5
4.5
4.5
4
4
3.5
3.5
3
src1 src2 src3 src4 src5 src6 src7 src8 src9 src10
2.5 2 1.5 1
0
0.2
−1.4
−1.2
−1
−0.8
−0.6
HDR−VDP−2.2
MOS
MOS
45
4
4
0.5
40
P−PSNR
src1 src2 src3 src4 src5 src6 src7 src8 src9 src10
3 2.5 2 1.5 1
0.4
0.6
0.8
1
P−NTIA−VQM
0.5 −0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
HDR−VQM
Fig. 7. Scatter plots for different objective methods. (a) RPSNR. (b) P-PSNR. (c) P-SSIM. (d) HDR-VDP-2. (e) P-NTIA-VQM. (f) HDR-VQM. The vertical bars indicate 95% confidence intervals for the corresponding MOS.
values can be quite different. This decreases the overall correlations. On the other hand, the HDR-VQM, while not perfect, exhibits less scatter across sequences indicating its ability to better distinguish between quality levels of video sequences from different source content.
Recall that the scope of HDR-VQM has been limited mainly to supra-threshold distortion levels. Therefore, it will be interesting to see its performance in high and low quality regimes. Of course, it must be noted that in our video dataset the distortions were mainly supra-threshold.
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
57
Table 3 Cp values for ‘High’ and ‘Low’ quality levels.
High Low
R-PSNR
P-PSNR
P-SSIM
HDR-VDP-2.2
R-PSNRST
P-SSIMST
P-NTIA-VQM
HDR-VQM
0.6543 0.3456
0.6799 0.2886
0.4785 0.6876
0.6338 0.6095
0.5050 0.4908
0.5267 0.7562
0.6331 0.6675
0.6087 0.8530
Nevertheless, we separated it into two sub-sets: videos encoded with lowest QP (highest bit rate and least distortion visibility) and those encoded with higher QP values (clearly noticeable artifacts). The results are shown in Table 3 from which it appears that HDR-VQM (and SSIM) perform better in the latter case. In contrast, HDR-VDP-2.2 is more consistent in the two cases. Due to limited data for near-threshold condition, we would, however avoid making generic statements on metric performances with regard to the two conditions. 5.2. Outlier ratio analysis Outlier ratio analysis is another approach to evaluate objective methods for their prediction accuracy. It is different from the usual correlation based comparisons in that it does not penalize if the prediction error is within the limit of uncertainty in the subjective rating process. The main advantage of outlier analysis is that it helps to evaluate metric accuracy by taking into account the variability or uncertainty in subjective opinions, which are ignored in correlation based comparisons. Particularly, it can be very useful in applications such as video compression where one is generally interested in the rate distortion (RD) behavior of objective methods, i.e. how the objective visual quality varies with bit rates for different source sequences and to what extent that compares with the subjective video quality. To carry out the outlier ratio analysis, we computed the number of outliers from different objective methods for the 80 conditions used in our HDR video dataset. To define an outlier, we use the fact that the subjective rating process is always associated with uncertainty due to inter-observer differences between the perceived quality of the same stimuli. Consequently, the actual subjective video quality can vary between two limits which can be, for instance, determined based on confidence intervals. Therefore, we first computed the absolute prediction error between the subjective MOS and logistically transformed objective scores for each test condition (let Nconditions denote the number of test conditions). For each objective method, if the said error was greater than the associated confidence interval associated with the MOS, then that objective method was deemed as an outlier for the given condition. Formally, the objective prediction from a given objective method OM will be deemed as an outlier for condition c ðc ¼ 1; 2; …Nconditions Þ if z stdðMOS Þ c ðlogisticÞ ð10Þ MOSc OMc 4 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N observers Here MOSc and OM ðlogisticÞ respectively denote the subjecc tive score and the logistically transformed value from the
Table 4 Experimental results for JPEG distortion (140 images) [6]. R-PSNR
Cp Cs RMSE Outlier ratio (%)
P-PSNR
P-SSIM
HDR-VDP2
0.6694 0.5849 0.8670 0.8968 0.6415 0.5356 0.8513 0.8910 0.7469 0.8154 0.5010 0.4447 62.1 63.5 47.8 42.8
HDRVQM 0.9049 0.8988 0.4279 33.5
Table 5 Experimental results for JPEG 2000 distortion (210 images) [25]. R-PSNR
Cp Cs RMSE Outlier ratio (%)
P-PSNR
P-SSIM
HDR-VDP2
0.5596 0.4732 0.8308 0.8364 0.5492 0.5002 0.8457 0.8422 0.9005 0.9572 0.6047 0.5956 68.5 70.1 55.7 53
HDRVQM 0.9350 0.9288 0.3854 31.9
method OM, z depends on the considered distribution (in our case we assumed Student's t-distribution) and confidence level (99%), stdðMOSc Þ is the standard deviation of raw (i.e. per observer) scores, Nobservers denotes the number of observers. The resulting outlier ratio (expressed as percentage) for each objective method are reported in Table 1 from which we can see that HDR-VQM has the least number of outliers (28%). Overall we find that HDR-VQM is reasonably accurate and more consistent for both across and per sequence cases as indicated by different tests (including correlation coefficients, statistical tests and outlier analysis). 5.3. Performance evaluation for images In this section, we compare the performance of HDRVQM and other objective methods in predicting quality of HDR images. To that end, we used 2 existing HDR image datasets. The first one consists of 350 distorted HDR images in which the artifacts are due to JPEG and JPEG 2000 compression (the details of the respective subjective studies can be found in [6,25]). The second dataset is relatively small with 44 HDR images comprising of JPEG XT distortions among others, and the details are available in [7]. The individual results for JPEG and JPEG 2000 distortions are presented in Tables 4 and 5 respectively while the results for the combined data (i.e. all distorted images together) are reported in Table 6. The corresponding F-test results for all the 350 images together are shown in Fig. 6 (b). We find that HDR-VQM overall performs well, and is
58
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
statistically better than the methods being compared. Note that in these tables we do not present the results for HDRVDP-2.2 [24] because it was trained on these images. Further, Table 7 presents the experimental results for the second dataset with 44 images from which one can see that method performance are quite similar. The most probable reason for this is the relatively low number of test images. For the same reason, we do not report F-test for this case. We also note that HDR-VDP-2.2 performs better than HDR-VDP-2. Finally, we also carried out statistical significance test based on Cp and RMSE which indicated the same trend as the respective F-tests (shown in Fig. 6(b) and (b)). However, statistical significance testing with outlier ratios indicated that the existing objective methods were statistically same as HDR-VQM, both for image and video dataset. 6. Discussion The previous sections proposed and verified the performance of an objective HDR video quality estimator HDRVQM. An important point worth re-iterating is that in light of limitations imposed in HDR display technologies, the HDR video signal must be pre-processed. This, in turn implies that HDR video quality can be affected by the type of pre-processing used as well as the maximum displayable luminance since these can affect distortion visibility. Thus, the role of a display model is more prominent in HDR, and objective quality assessment should consider this aspect. As opposed to this, quality measurement for the traditional LDR video signals does not usually require any specific display based pre-processing. By this, we do not imply that the display does not affect quality judgment in LDR domain, merely its effect is perhaps less prominent. While the proposed HDR-VQM is reasonably accurate, it is certainly not without its limitations as is the case with any objective method. The reader will notice that HDR-VQM does not explicitly account for spatio-temporal masking that could play a role in quality judgment. Therefore, in the current form, it is assumed that the employed measure of Table 6 Experimental results for all images together (350 images). R-PSNR
Cp Cs RMSE Outlier ratio (%)
P-PSNR
P-SSIM
HDR-VDP2
0.4646 0.4848 0.8307 0.8583 0.4480 0.4875 0.8335 0.8581 0.9426 0.9310 0.5926 0.5463 72.2 69.7 52.8 47.1
HDRVQM 0.9120 0.9071 0.4367 31.7
similarity is directly related to the change in video quality. We note that such an assumption is more reasonable for supra-threshold distortions (i.e. distortions clearly visible to the human eye) but may not always hold in case of nearthreshold distortion levels where the impact of spatiotemporal masking may be more prominent. Also recall that in (3) we did not employ a more sophisticated weighting such as one based on CSF. Thus, in the current formulation, HDR-VQM may be more effective in case of supra-threshold distortions and we believe that it is important to make this distinction in order to define the scope of application. In the current form, HDR-VQM operates only on the luminance component. In our dataset, there were no specific color related distortions, and the visible artifacts included mainly blockiness and loss of edges even for an expert viewer. However, color plays an important role in HDR video and quantifying the quality loss due to color artifacts could be a possible extension in HDR-VQM. With regard to the strengths of the proposed method, one of the most prominent is its relatively low computational complexity due to the fact that it does not explicitly extract motion information from the video. Rather, the focus is on exploiting basic video signal characteristics by resorting to spatio-temporal analysis which, in turn, is based on the idea that humans tend to judge video quality in local context, both from spatial and temporal view points. As a consequence, HDR-VQM also provides certain local spatio-temporal quality information that can be for instance useful as feedback for video codec optimization. While a formal computational complexity analysis of HDRVQM is not the focus of this work, recall that in HDR-VQM we performed the filtering in the frequency domain. This allows us to exploit the efficiency of the Fast Fourier Transform (FFT) and inverse FFT whose complexity is OðM logMÞ where M is the frame size. For the temporal dimension, the complexity would depend on block size, and video duration. Further, Fig. 8 provides the percentage of outliers and the relative computational complexity (expressed as the execution time relative to RPSNR). Obviously, lower values for an objective method along both axes implies that it is better. We find that the relative execution time for HDR-VQM is reasonable considering the improvements (i.e. smallest % of outliers) in performance over other methods. More specifically, for a 10 s sequence (with 250 frames), our method takes about 432 s, i.e. 7.2 min (Linux cluster, 32 GB RAM). Since the error video is computed frame-by-frame, this process can be easily parallelized. With the same hardware set up, HDRVDP-2.2 took about 24 min to process the same video. Obviously, processing time would also vary depending on the hardware. The reader will also appreciate the fact that
Table 7 Experimental results for second dataset (44 images) [7].
Cp Cs RMSE Outlier ratio (%)
R-PSNR
P-PSNR
P-SSIM
HDR-VDP-2
HDR-VDP-2.2
HDR-VQM
0.8950 0.8748 12.7622 50
0.8332 0.7927 15.8242 61
0.9473 0.9211 9.1676 31.8
0.9286 0.8896 10.6217 38.6
0.9533 0.9428 8.6402 29.5
0.9565 0.9495 8.3444 18.1
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
59
quality [25], it will also be interesting to evaluate different objective methods including the proposed HDR-VQM on another dataset where more than one tone mapping functions have been employed. This will help to assess the consistency of objective methods across different types of distortions.
Lower the better
Acknowledgments
Lower the better Fig. 8. % of outliers and the relative computational complexity (expressed as relative execution time with respect to RPSNR) for different methods.
video quality judgment, in general, can depend on several extraneous factors (such as display type, viewing distance, ambient lighting conditions, etc.) apart from the distortions themselves. In that regard, a notable feature of HDR-VQM is that it takes into account some of the mentioned factors such as the viewing distance, fixation duration and display resolution in computing the different parameters. This allows HDR-VQM to adapt to some of the physical factors that may affect subjective quality judgment. Thus, the necessary parameters of our method depend only on the setup of the test and are easily derived from it (refer to Section 5), and does not require training. 7. Concluding thoughts HDR imaging is increasingly becoming popular in the multimedia signal processing community primarily as a tool towards enhancing the immersive video experience of the user. However, there are very few works that address the issue of assessing the impact of HDR video processing algorithms on the perceptual video quality both from subjective and objective angles. The study in this paper seeks to outline the first steps towards the design and verification of an objective HDR video quality estimator HDR-VQM. While objective video quality measurements cannot always replace subjective opinion, they can still be useful in specific HDR video processing applications where subjective studies may be infeasible (e.g. real-time video streaming, video encoding, etc.). To that extent and within the scope of its application, HDR-VQM is a reasonable objective tool for HDR video quality measurement. To enable others to use the proposed method as well as validate it independently, a software implementation is made available online6 for free download and use. The immediate future work will ensue further refinement of the presented method in view of some of the mentioned limitations as well as further validation with larger HDR video datasets. Since tone mapping (apart from codec related artifacts) can have a significant impact on visual 6
https://sites.google.com/site/narwariam/hdr-vqm
The authors wish to thank Romuald Pepion for his help in generating the subjective test results used in this paper. This work has been supported by the NEVEx Project FUI11 financed by the French Government. References [1] P. Le Callet, S. Moller, A. Perkis, Qualinet white paper on definitions of quality of experience (2012), White Paper, European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), 2013. [2] F. Banterle, A. Artusi, K. Debattista, A. Chalmers, Advanced High Dynamic Range Imaging: Theory and Practice, AK Peters (CRC Press), Natrick, MA, USA, 2011. [3] C. Mantel, S. Ferchiu, S. Forchhammer, Comparing subjective and objective quality assessment of HDR images compressed with jpegxt, in: 16th IEEE International Workshop on Multimedia Signal Processing (MMSP), September 2014, pp. 1–6. [4] P. Hanhart, M. Bernardo, P. Korshunov, M. Pereira, A. Pinheiro, T. Ebrahimi, HDR image compression: a new challenge for objective quality metrics, in: 6th International Workshop on Quality of Multimedia Experience (QoMEX), September 2014, pp. 159–164. [5] M. Rerabek, P. Hanhart, P. Korsunov, T. Ebrahimi, Subjective and objective evaluation of HDR video compression, in: Proceedings of the Video Processing and Quality Metrics (VPQM), January 2015, pp. 1–6. [6] M. Narwaria, M. Perreira Da Silva, P. Le Callet, R. Pepion, Tone mapping-based high-dynamic-range image compression: study of optimization criterion and perceptual quality, Opt. Eng. 52 (10) (2013) 102008. [7] G. Valenzise, F. Simone, P. Lauga, F. Dufaux, Performance evaluation of objective quality metrics for HDR image compression, in: Proceedings of the SPIE, vol. 9217, 2014, pp. 92170C–92170C-10. [8] M. Narwaria, M. Perreira Da Silva, P. Le Callet, R. Pepion, On improving the pooling in HDR-VDP-2 towards better HDR perceptual quality assessment, in: Proceedings of the SPIE, vol. 9014, 2014, pp. 90140N–90140N-9. [9] G. Mather, Foundations of Perception, Psychology Press, Hove, East Sussex, 2006. [10] T. Aydin, R. Mantiuk, H. Seidel, Extending quality metrics to full luminance range images, in: Proceedings of the SPIE, vol. 6806, 2008, pp. 68060B–68060B-10. [11] A. Koz, F. Dufaux, Optimized tone mapping with perceptually uniform luminance values for backward-compatible high dynamic range video compression, in: Proceedings of the IEEE Visual Communications and Image Processing (VCIP), November 2012, pp. 1–6. [12] D. Field, Relations between the statistics of natural images and the response properties of cortical cells, J. Opt. Soc. Am. A 4 (December (12)) (1987) 2379–2394. [13] Z. Wang, E. Simoncelli, A. Bovik, Multiscale structural similarity for image quality assessment, in: Proceedings of the IEEE Conference Record of the 37th Asilomar Conference on Signals, Systems and Computers, vol. 2, November 2003, pp. 1398–1402. [14] M. Pinson, S. Wolf, A new standardized method for objectively measuring video quality, IEEE Trans. Broadcast. 50 (September (3)) (2004) 312–322. [15] M. Narwaria, W. Lin, A. Liu, Low-complexity video quality assessment using temporal quality variations, IEEE Trans. Multimed. 14 (June (3)) (2012) 525–535. [16] A. Ninassi, O. Meur, P. Le Callet, D. Barba, Considering temporal variations of spatial visual distortions in video quality assessment, IEEE J. Sel. Top. Signal Process. 3 (April (2)) (2009) 253–265. [17] O. Meur, A. Ninassi, P. le Callet, D. Barba, Overt visual attention for free-viewing and quality assessment tasks: impact of the regions of
60
[18]
[19]
[20]
[21]
[22]
M. Narwaria et al. / Signal Processing: Image Communication 35 (2015) 46–60
interest on a video quality metric, Signal Process.: Image Commun. 25 (7) (2010) 547–558. Methodology for the Subjective Assessment of the Quality of Television Pictures, Recommendation ITU-R BT.500-13, January 2012. Available online: 〈http://www.itu.int/rec/R-REC-BT.500-13-201201-I〉. Subjective Video Quality Assessment Methods for Multimedia Applications, ITU-T Recommendation P.910, April 2008. Available online: 〈http://www.itu.int/rec/T-REC-P.910-200804-I〉. Final Report from the Video Quality Experts Group on the Validation of Objective Quality Metrics for Video Quality Assessment, Video Quality Experts Group (VQEG), March 2003. M. Narwaria, M. Perreira Da Silva, P.L. Callet, R. Pepion, Tone mapping based HDR compression: does it affect visual experience? Signal Process.: Image Commun. 29 (2) (2014) 257–273. D. Green, Regional variations in the visual acuity for interference fringes on the retina, J. Physiol. 207 (April (2)) (1970) 351–356.
[23] R. Mantiuk, K. Kim, A. Rempel, W. Heidrich, HDR-VDP-2: a calibrated visual metric for visibility and quality predictions in all luminance conditions, ACM Trans. Graph. 30 (July (4)) (2011). (40:1–14). [24] M. Narwaria, R. Mantiuk, M. Perreira Da Silva, P. Le Callet, HDR-VDP2.2: a calibrated method for objective quality prediction of highdynamic range and standard images, J. Electron. Imaging 24 (1) (2015) 010501. [25] M. Narwaria, M. Perreira Da Silva, P. Le Callet, R. Pepion, Impact of tone mapping in high dynamic range image compression, in: Proceedings of the Video Processing and Quality Metrics (VPQM), January 2014, pp. 1–6.