Signal Processing: Image Communication 34 (2015) 22–31
Contents lists available at ScienceDirect
Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image
A regression-based framework for estimating the objective quality of HEVC coding units and video frames Tamer Shanableh n Department of Computer Science and Engineering, American University of Sharjah, United Arab Emirates
a r t i c l e in f o
abstract
Article history: Received 21 August 2014 Received in revised form 12 December 2014 Accepted 26 February 2015 Available online 17 March 2015
A no-reference objective quality estimation framework is proposed. The framework is suitable for any block-based video codec. In the proposed solution, features are extracted from coding units and summarized to form features at frame levels. Stepwise regression is used to select the important feature variables and reduce the dimensionality of feature vectors. Thereafter, a polynomial regression-based approach is used to model the nonlinear relationship between the feature vectors and the true objective quality values. Such values are estimated for coding units and video frames. The proposed framework is implemented using MPEG-2 and HEVC. The objective quality estimation results are compared against an existing state-of-the-art solution and quantified using the Pearson correlation factor and the root mean square error measure. & 2015 Elsevier B.V. All rights reserved.
Keywords: PSNR estimation SSIM estimation Machine learning Regression analysis Video codecs Video compression
1. Introduction In video broadcasting and IPTV, it is desired to monitor the quality of delivered services. There is a need in such applications to automatically monitor and estimate the quality of compressed video due to the distortions caused by lossy coding, transmission errors and potential intermediate video transcoding. Objective quality estimation of compressed video falls into two main categories; Reduced Reference (RR) and No Reference (NR) estimations. In the RR category, special information is extracted from the original frames and subsequently made available for PSNR estimation. On the other hand, such information is not available for objective quality estimation in the NR category. Therefore such category is less accurate and more challenging. An example of the RR estimation is the use of distributed source coding techniques where the encoder transmits the Slepian–Wolf syndrome a feature vector representing the original video frame using a LDPC encoder. The receiver reconstructs the side information of the received frame from n
Fax: þ971 6 515 2979. E-mail address:
[email protected]
http://dx.doi.org/10.1016/j.image.2015.02.008 0923-5965/& 2015 Elsevier B.V. All rights reserved.
the Slepian–Wolf bistream. Therefore the original feature vector is not transmitted and therefore the overall bit rate is reduced [1]. On the other hand, no-reference objective quality estimation can be applied to a video frame or to the whole video sequence. In [2] features are extracted from the whole sequence and compared against a dataset of features belonging to sequences of different spatio-temporal activities. Some solutions are applied to predict the PSNR of both video sequences and frames [3]. It was also reported that the PSNR of a video sequence can be estimated based on the average bitrate and mean quantization parameter of the I-frames only [4]. Estimating objective quality at a frame level can make use of the distribution statistics of Discrete Cosine Transformation (DCT) Coefficients. In [5,6] it was proposed to estimate the quantization error from the statistics of DCT coefficients to estimate the PSNR. It is noted that DCT coefficients follow a Laplacian probability distribution. The Lambda Laplacian distribution parameter is estimated for each DCT frequency band separately. More recently, a method for estimation the PSNR of H.264/ AVC video frames which considers both the deblocking filtering effect and the quantization error is proposed in [7].
T. Shanableh / Signal Processing: Image Communication 34 (2015) 22–31
Degradations due to transmissions are also taken into account in estimating the video quality. For instance, bit stream information, quantization distortions, packet losses and temporal effects of the human visual system are all used for estimating video quality [8,9]. In addition to estimating the video quality for broadcasting and streaming applications, quality estimation is useful in other scenarios as well. For instance, the quality of a surveillance video is assessed prior to admitting it to as a legal evidence in a court of law. In this case, it is preferred to estimate the quality of the video at a subframe level or at a macroblock level. This is needed as some regions of a compressed frame might be of higher interest to a jury. MB-level PSNR estimation is reported in [10,11] and MB-level SSIM estimation is reported in [12]. More recently, a no-reference PSNR estimation method for High Efficiency Video Coding (HEVC) was proposed in [13]. The method estimates the PSNR at a frame-level based on a Laplacian mixture distribution. The solution computes the distribution parameters of residual DCT coefficients in different quadtree depths and different types of coding units (CUs). Since some DCT bands might be all zeros, an exponential regression solution is used that takes into account the CU coding depths. While the prediction results are very accurate, one limitation of such a solution is that it assumes a fixed value of QP. Hence it is not suitable for constant bitrate coding. In this work, we approach objective quality estimation from a regression-based perspective. We propose a generic framework which is suitable for any block-based video codec. The proposed solution is applied to MPEG-2 and HEVC. The objective quality is estimated at both CU/MB level and at a frame level. The objective quality metrics used in this work are PSNR and Structural Similarity Indices (SSIM) [14]. Advantages of the proposed solution are its generic framework and suitability for estimating the PSNR of videos coded with constant bitrates. To the best of the author's knowledge, this solution is the first to estimate the PSNR and SSIM at coding unit (CU) level in HEVC. This paper is organized as follows. Section 2 briefly introduces HEVC and its new portioning feature which is known as the Coding Units (CUs). Section 3 introduces the proposed solution and objective quality estimation framework which is used to predict PSNR and SSIM values for CUs and frames. Section 4 reviews the tools used in the proposed systems; namely, stepwise regression and polynomial regression. The experimental results are given in Section 5 and Section 6 concludes the paper. 2. HEVC coding units The Joint Collaborative Team on Video Coding (JCT-VC) proposed and developed the High Efficiency Video Coding (HEVC) standard [15]. The objective of which is to offer a substantially higher compression capability in comparison to existing standardized codecs. HEVC also targets new applications, such as beyond high-definition spatiotemporal resolutions, various samplings formats and color spaces. One of the main features of HEVC is the frame partitioning which results in higher prediction accuracy. A frame is divided into square blocks known as coding units (CUs). The
23
maximum allowed size is 64 64 for the luma component and the minimum size, on the other hand, is 8 8. The syntax of each CU indicates the type of prediction, the transform unit (TU) sizes and the types of the prediction units (PU) used. The syntax also defines if a CU is coded in split mode. The largest CU is said to have the depth of 0 and if it is further split then the four resultant CUs have a depth of 1, and so forth. The partitioning used for motion estimation and compensation is instructed by the size of the PUs. Several PU sizes are allowed as follows; 2N 2N, 2N N, N 2N, N N, 2N nU, 2N nD, nL 2N and nR 2N. Further details about HEVC can be found in [16]. In this work, we are particularly interested in CUs and their PUs as features are collected from these coding and partitioning units. 3. Proposed system As mentioned previously, in this work, we use a polynomial regression-based approach to predict the objective quality of both video coding units (CUs) and full frames. The use of machine learning techniques for the prediction of objective video quality is reported in the literature for a number of video codecs. For instance, the work in [17] employed artificial neural networks to predict the quality of MPEG-2 videos using features that are extracted directly from the video streams. Likewise, a number of machine learning techniques including SVM and Bayes classifiers are used to classify the quality of coded MPEG-2 macroblocks into various PSNR classes [18]. In addition, linear regression was used to predict the objective quality of videos coded using the H.264/AVC codec. The features are extracted from the video bitstreams and consist of motion and coding parameters [19,20]. More recently, machine learning is used to predict several objective quality metrics with reasonable accuracy for the H.264/AVC codec [21]. The work is further enhanced using artificial neural networks with reduced complexity as reported in [22]. The proposed system is divided into two stages, training and testing. In the training phase, video bit streams and their PSNR/SSIM values are used to build a regression model. The training system is illustrated in Fig. 1. CU features are extracted from a coded video stream. These features are further summarized on frame basis to compute frame-level features. That is, all the features of all CUs belonging to one frame are summarized by computing their mean and standard deviation. The numerical summarization is performed in terms of central tendency and dispersion. Each CU and each frame is represented by one feature vector. The dimensionality of the CU and frame feature vectors is reduced using a statistical procedure known as stepwise regression [23] which is explained in the next section. Stepwise regression retains the features that affect the response variable the most, where the response variable in this case is the true PSNR/SSIM value. Hence, the inputs to this procedure are both the feature vectors and their corresponding PSNR/SSIM values. Clearly, the PSNR/SSIM values are not available during the testing phase, hence, in the training phase, the indices of the feature variables selected by the stepwise procedure are stored and reused in the testing phase. Once reduced in dimensionality, the
24
T. Shanableh / Signal Processing: Image Communication 34 (2015) 22–31
Video bit stream
Extract CU/MB features
CU/MB feature indices
Feature selection
PSNR/SSIM measures CU/MB model weights
Frame feature indices
Modeling Frame model weights
Summarize into frame features
Video bit stream
Extract CU/MB features
CU/MB feature indices
Feature selection Frame feature indices
Feature selection
PSNR measures
Summarize into frame features
CU/MB model weights
Feature selection
PSNR/SSIM prediction Frame model weights
Modeling
PSNR/SSIM prediction
Fig. 1. Proposed training framework for objective quality estimation.
feature vectors and their corresponding PSNR/SSIM values are used to compute a regression model using polynomial regression [24], which is reviewed in Section 4. The model weights are stored and used in the testing phase as well. In the testing phase, CU features are extracted from a coded video bit stream and summarized into frame-level features. This arrangement is exactly the same that is used in the training phase. The proposed testing system is illustrated in Fig. 2. The dimensionality of the feature vectors is reduced simply by retaining the variables corresponding to the stored indices from the training stage. A CU and/or a frame PSNR/SSIM is predicted using the stored polynomial regression model. Again, the algorithm used for modeling and prediction is explained in Section 4. It is worth to mention that the proposed system is not restricted to the objective quality prediction of HEVC videos. It is generic enough to work with any blockbased video codec. For instance in this work we also consider the prediction of frame-level PSNR values for MPEG-2 videos. In a similar arrangement to the one discussed above, the MB-level features are numerically summarized into frame features and used to predict PSNR values. Such an approach builds on top of the work previously reported by the author for the prediction of MB-level PSNR values for MPEG-2 video [10]. The HEVC CU features that are extracted from a video bit stream are listed in Table 1. Clearly, all of these features can be extracted in both the training and the testing phases as access to the original raw video is not required. This process is carried out at the decoder; hence, there is only access to quantized coefficients and syntax elements. Since each CU can be partitioned into many PUs, further features are extracted for each PU. Eventually the values of such variables are merged by computing their means and variances. In this work, the operations of the encoder are not modified. The HEVC encoder selects the best partitioning for a CU based on rate-distortion optimization. The features are extracted from the CU syntax elements and
Predicted CU/MB PSNR
Predicted frame PSNR
Fig. 2. Proposed testing framework for objective quality estimation.
the individual PUs. Then, the features of the PUs are statistically summarized. The CU features and the summarized PUs features are concatenated into one feature vector. Thus, the features at index 19 to 45 in Table 1 are repeated twice, once as mean values and once as variances. This brings the total number of features to 72. For intra-coded slices or CUs, the motion information is set to 0 and the coding type is signaled in the feature vector. As mentioned previously, all the features of all CUs belonging to one frame are summarized by computing their mean and standard deviation. This results in one feature vector which is used for frame-level quality prediction. The next section will present a solution in which the dimensionality of the feature vectors is reduced. As for the MB-level MPEG-2 feature variables, we use the ones reported in [10]. Basically they are similar in concept to the variables listed in Table 1. More specifically they are based on syntax elements, motion information, coding types and texture statistical measures. 4. System modeling This section reviews the tools used in the proposed solution and formulates the problem using polynomial regression. 4.1. Selecting the important features Important features can be objectively selected using the stepwise regression procedure. In this work we use stepwise regression to select important features and to reduce the dimensionality of the feature vectors. As mentioned previously, the stepwise regression procedure is applied during
T. Shanableh / Signal Processing: Image Communication 34 (2015) 22–31
the well-known T test at a specific level of significance, for example α ¼0.1. The predictor that generates the largest absolute T value is selected. Refer to this predictor as f 1 . In the second step, the remaining m-1 predictors are scanned for the best two-predictor regression model of the following form:
Table 1 HEVC CU feature variables extracted from video bit streams. ID
Feature description
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
CU features
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
PU features
Total number of bits in a CU X coordinate of a CU in pixels Y coordinate of a CU in pixels Quantization step size Quantization step size of previous CU Variance of DCT coefficients Mean of DCT coefficients Skewness of DCT coefficients Variance of residual pixels Mean of residual pixels Skewness of DCT residual pixels Variance of reconstructed pixels Mean of reconstructed pixels Skewness of reconstructed pixels Variance of prediction source Mean of prediction source Skewness of prediction source Total number of partitions in a CU
p^ ¼ β0 þ β1 f 1 þ β2 f i
ð2Þ
This is achieved by testing all two-predictor models containing f which was selected from the first step. The T value of the m-1 models is computed for H 0 : β2 ¼ 0. The predictor that generates the highest absolute T value is retained, refer to this predictor as f 2 . Now that β2 f 2 is added to the model, the procedure goes back and reexamines the suitability of including β 1 in the model. If the corresponding T value becomes insignificant (i.e. the alternative hypothesis H1 is rejected.), f 1 is removed and the predictors are searched for a variable that generates the highest T value in the presence of β2 f 2 . In the third step, remaining m-2 predictors are scanned for the best three-predictor regression model of the form:
Coding depth Partition type (2N 2N, 2N N,…) Partition width Partition height Coding mode Transformation index Merge flag Merge index Inter prediction direction Coded block flag xMV of list 0 yMV of list 0 Difference xMV of list 0 Difference yMV of list 0 xMV of list 1 yMV of list 1 Difference xMV of list 1 Difference yMV of list 1 Variance of residual pixels Mean of residual pixels Skewness of DCT residual pixels Variance of reconstructed pixels Mean of reconstructed pixels Skewness of reconstructed pixels Variance of prediction source Mean of prediction source Skewness of prediction source
p^ ¼ β0 þ β1 f 1 þ β 2 f 2 þ β3 f i
ð3Þ
At this point, the T values for f 1 and f 2 are computed and if any of them became insignificant after adding f 3 then the corresponding variable is removed from the model. The procedure repeats the above steps until no additional predictors are added or removed from the model. Again, this procedure is applied to the feature vectors prior to system modeling which is explained next. Hence only the important features are retained and used for modeling. The number of retained features is fixed for all feature vectors representing a CU or a video frame. In the experimental results section, we show the results of applying the stepwise regression procedure to the feature vectors of the CUs and the video frames. 4.2. Polynomial regression
the training phase only. The indices of the selected feature variables are then stored and used in the testing phase. We treat the feature variables f 1 ; f 2 ; …; f m as predictors where the subscript refers to the feature ID as outlines in Table 1 and m is the total number of features in each feature vector. All of the feature vectors belonging to the training set are organized into a features matrix. Likewise, the actual PSNR/SSIM values are treated as a response variable, p. The stepwise regression procedure is described in [23] using the following procedure. In the first step, the procedure tests all possible one-predictor regression models in an attempt to find the predictor that has the highest correlation with the response variable. The model is of the following form: p^ ¼ β0 þ β1 f i
25
ð1Þ
A hypothesis test is conducted for each model where H 0 : β1 ¼ 0 and H 1 : β1 a 0. The test is conducted using
In this work, PSNR/SSIM prediction is formulated using polynomial regression [13]. In the formulation, the extracted feature vectors are referred to as (F) and the actual PSNR/SSIM values are referred to as (p). As mentioned in the previous section, the dimensionality of the feature vectors is reduced using stepwise regression. According to Fig. 1, if the PSNR/SSIM is predicted at a CU-level then these feature vectors belong to individual CUs. Otherwise, if the PSNR/SSIM is predicted at frame-level then the feature vectors belong to individual frames. These feature vectors or predictors of n CUs/frames are arranges into one feature matrix after applying the stepwise regression procedure. This matrix is denoted by F as shown in the following equation: 2 3 f 1;1 f 2;1 … f m;1 6 ⋮ ⋮ 7 F¼4 ⋮ ð4Þ 5 f 1;n f 2;n … f m;n The subscripts of the matrix elements f j;i ðj ¼ 1:::m; i ¼ 1:::nÞ indicate the index of feature variables and the corresponding CU/ frame index respectively.
2
expand ðf Þ ¼ ½1; f j j 1 r jr m ;
x
x
x
x
x x x
x
10
11
where J Xα p J 2 denotes the l2 norm. To predict a PSNR value, the feature vectors are extracted from the video bit stream. The feature vectors are then arranged into a feature matrix and expanded to the second order. This results in the X matrix. The feature matrix is multiplied by the model weights αopt to compute the predicted PSNR values p^ as follows: p^ ¼ Xnαopt
x
13 12
ð6Þ
α
x
αopt ¼ arg min J Xα p J 2
14
For example, if a feature vector contains two variables f1 and f2 then the expanded feature vector will contain the h 2 2 following terms: 1; f 1; ; f 2 ; f 1 f 2 ; f 1 ; f 2 ; ðf 1 þ f 2 Þ; ðf 1 þ f 2 Þ2 ; f 1 ðf 1 f 2 Þ; f 2 ðf 1 f 2 Þ. A least-squared error objective criterion is used to perform the mapping between X and p as follows:
x
ð5Þ
x
ðf 1 þ f 2 þ …þ f m Þ2 ; f j ðf 1 þf 2 þ… þ f m Þ j 1 r j rm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Feature ID
CU features
In polynomial regression, a nonlinear mapping is performed between the feature matrix F and the response variable p. As such, the dimensionality of the rows in matrix F is expanded into an rth order. This is achieved using a reduced model polynomial expansion [24]. The expanded feature matrix is referred to as X A ℛnxk where k is the dimensionality of the expanded feature vectors. According to [24], the dimensionality of the expanded feature vector is defined by k ¼ 1 þr þ mð2r 1Þ,where m denotes the number of features variables. In this work we use a second order expansion. A second order expanded feature vector consists of the following terms:
Mean of PU features
T. Shanableh / Signal Processing: Image Communication 34 (2015) 22–31
x
26
x
x x
9
ð7Þ
x
x x
x x x x x x
x
x
x
x
x x x x
x
x x
x x
x
x x x
x
x x
3 3 3.5 3.5 3.5 3.5 6 6 6 1.5 1.5 1.5 1.5 1.5
5
1 1 1.5 1.5 1.5 1.5 2.0 3.0 3.0 0.5 0.5 0.5 0.5 0.5
4
Bitrate-2 (Mb/s)
3
BasketBallPass 416 240 (60 Hz) BQSquare 416 240 (60 Hz) BasketBallDrill 832 480 (30 Hz) Horses 832 480 (30 Hz) Mall 832 480 (30 Hz) Party 832 480 (30 Hz) City 1280 720 (60 Hz) BasketBallDrive 1920 1080 (50 Hz) Cactus 1920 1080 (50 Hz) Crew 352 288 (25 Hz) Coastguard 352 288 (25 Hz) Foreman 352 288 (25 Hz) Football 352 288 (25 Hz) Walk 352 288 (25 Hz)
Bitrate-1 (Mb/s)
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Resolution
1
ID Name
Sequence ID
Table 2 Test video sequences and their properties.
Table 3 Frame-level feature selection using stepwise regression.
6
7
The proposed system is implemented using HEVC reference software HM13 [25] and MPEG-2 [26]. The used video sequences, their resolutions and bitrates are listed in Table 2.
x x
8
5. Experimental results
x x
x
x
x
x
x x x
x x
x x
x x
x
x
x x
x
x x
x x x
x
x x
x x
x x
x
x x
x x
x
x x
x
x x
x
x
x x
x x x x
x x
x
x x
x x
x
x
x x
x
x
x
x
x x
x x
x x
x x
x
x
x x x x 6
9
12
x 10
x 15
10
14
9
8
7
7
9
6
27
11
x
Variance of PU features
T. Shanableh / Signal Processing: Image Communication 34 (2015) 22–31
x x x
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Total no. features
28
T. Shanableh / Signal Processing: Image Communication 34 (2015) 22–31
In the experiments to follow, we report the results using the bitrates of the first set, bitrate-1. We then repeat the results using bitrate-2 to confirm that the proposed solution works for both sets of bitrates. In HEVC, the coding structure used is IPPPPP… with 4 reference frames. The maximum CU size is set to 64 64. The asymmetric motion partitions tool and the adaptive loop filter tool are both enabled. The default fast motion estimation (a modified EPZS) and fast mode decisions are used. Constant bitrate coding at CU level is enabled. The coding configuration for MPEG-2 is similar. However, only one reference frame is used and the MB size is 16 16. The same constant bitrate values are used in both coders. We start by showing the results of applying the stepwise regression procedure for the prediction of frame-level PSNR values. Table 3 lists the feature variables that are retained for the video sequences listed in Table 2. The videos are compressed using HEVC using the abovementioned coder configuration. As shown in the table, the number of features retained by the stepwise regression procedure ranges from 6 to 15. The retained features are used for building the regression model as explained in Section 4. When repeating this experiment at the CU level, as opposed to the frame level, it was noticed that the number of retained features is higher as shown in Table 4. The increase in number of features is understood, as at the frame level, all of the CU data is summarized. As such the resultant data is rich and representative. Whereas at the CU level, the features available are not as representative. The proposed system is assessed and compared against existing work through the use of two performance attributes; prediction accuracy and prediction consistency. These attributes are proposed by the Video Quality Experts Group (VQEG) [27]. The Pearson linear correlation coefficient is used to assess the prediction accuracy and the Root Mean Square Error (RMSE) measure is used to assess the prediction consistency. Cleary, the objective is to minimize the RMSE and to maximum the correlation. The frame-level PSNR prediction results using the proposed system are reported in Table 5. The feature vectors of each video sequence were divided into 25% for training and 75% for testing. Table 5 shows the prediction results using MPEG-2 and HEVC coded sequences. The prediction results indicate that the proposed system is capable of predicting the frame-level PSNR values with high accuracy. The results also indicate that the proposed system is generic enough to incorporate different block-based video formats like MPEG-2 and HEVC with various spatial/temporal resolutions. While this is the first work to report on a polynomial regression-based approach for the prediction of HEVC PSNR values, nonetheless, we compare our solution against that reported in [13]. Again, the reviewed work is not based on machine learning and does not apply if the QP varies, hence not suitable for constant bitrate coding. Sequences 1–9 listed in Table 2 above are used in [13], hence, Table 6 presents the average correlation factor and average RMSE using these 9 sequences in comparison to the results reported in [13].
The results indicate that the proposed solution is at a slight advantage in terms of RMSE and at a slight disadvantage in terms of average correlations results. Although the proposed and reviewed solutions are of different natures, nonetheless, the comparison indicates that the proposed solution is at par with state-of-the-art PSNR estimation algorithms. The extra advantage offered by the proposed solution is that it can be applied to videos with constant bitrate values. Another advantage of the
Table 4 CU-level feature selection using stepwise regression. Sequence ID
Number of features
1 2 3 4 5 6 7 8 9 10 11 12 13 14
24 31 35 33 33 35 30 30 39 23 24 23 13 23
Table 5 Frame-level PSNR prediction results using the proposed system. Seq. ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Avg.
MPEG2
HEVC
RMSE (dB)
Corr.
RMSE (dB)
Corr.
0.7 0.1 0.2 0.2 0.1 0.2 0.12 0.1 0.1 0.24 0.12 0.23 0.5 0.35 0.23
0.99 0.98 0.93 0.99 0.98 0.98 0.96 0.99 0.97 0.98 0.981 0.938 0.979 0.975 0.97
0.6 0.2 0.28 1.1 0.3 0.6 0.5 0.5 0.4 0.5 0.93 0.46 0.86 0.6 0.56
0.98 0.99 0.94 0.92 0.97 0.96 0.9 0.95 0.96 0.969 0.954 0.956 0.98 0.96 0.96
Table 6 HEVC frame-level PSNR prediction, comparison with existing work.
Bitrate-1 Bitrate-2 Reviewed
Average RMSE (dB)
Average Corr.
0.5 0.48 0.6
0.95 0.96 0.98
T. Shanableh / Signal Processing: Image Communication 34 (2015) 22–31
Table 7 HEVC CU-level PSNR prediction results. Seq. ID
RMSE (dB)
Corr.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Avg.
1 0.7 0.8 1.6 1 1.1 1.2 1.2 1.5 1.0 0.87 1.16 0.07 1.3 1.0
0.94 0.9 0.85 0.9 0.9 0.86 0.95 0.88 0.85 0.91 0.90 0.83 0.97 0.93 0.89
Table 8 Summary of HEVC PSNR prediction using two sets of bitrates. Frame-based
Bitrate-1 Bitrate-2
CU-based
RMSE
Corr.
RMSE
Corr.
0.56 0.468
0.96 0.966
1.0 0.842
0.89 0.918
proposed solution is that it can be applied at a CU level; hence the PSNR can be estimated for each and every CU in a given frame. This is important in cases where regions-based PSNR estimates are vital. One scenario is the quality assessment of video material used as a legal evidence in a court of law as mentioned in the introduction [10,11]. The HEVC CU-level PSNR prediction results are listed in Table 7. The results show that the proposed solution can also be used to predict the PSNR at a CU level. The prediction results are not as accurate as those of the frame-level. This is so because the features available at the frame-level are numerical summaries of the CU-level features. Hence the frame-level features are much richer and better represent a video frame. The RMSE values range from 0.7 to 1.5 dB in PSNR. While such results are acceptable, nonetheless there is still room for improvement in terms of predicting the PSNR at a CU level in HEVC videos. The HEVC results in Tables 5 and 7 for the prediction of PSNR at frame and CU levels are repeated using bitrate-2. The summaries of both predictions are presented in Table 8. As seen in the table, the results are very close. Yet, increasing the bitrate results in less variation in the actual PSNR values and in slightly higher prediction accuracy. In Fig. 3, we show example histograms of the difference between the actual and predicted PSNRs at both frame and CU levels. In the figures, the x-axis represents the PSNR difference in dB and the y-axis is the count. The top histograms are
29
generated for frame level prediction, whilst the lower ones are generated for CU level prediction. In the histograms displayed at the left, we show examples of good predictions and in the histograms displayed at the right; we show examples where the prediction was not as accurate. Hence the tails are longer and the standard deviation is higher. In addition, the CU and frame level quality predictions of HEVC are repeated using SSIM instead of PSNR. In this case the model training is based on CU and frame level SSIM values. The results are shown in Table 9. For the frame-based quality prediction, the average correlation factor is 0.92 and the average RMSE is 0.008, recall that SSIM values range from 0 to 1. The SSIM prediction results at the CU level are less accurate where the average correlation factor is 0.88 and the average RMSE is 0.02. This difference in accuracy is consistent with the PSNR results reported in Tables 5 and 7 above. Again, the HEVC results in Table 8 for the prediction of SSIM at frame and CU levels are repeated using bitrate-2. The summaries of both predictions are presented in Table 10. As seen in the table, the results are very close. Yet, increasing the bitrate results in less variation in the actual SSIM values and in slightly higher prediction accuracy. This was the same conclusion for the PSNR comparisons of Table 8 above. Recently, work has been reported on the HEVC noreference SSIM quality assessment in the existence of transmission distortions and losses [28]. Six video sequences are used and the average correlation factor reported was 76.5%. Although a direct comparison is not possible in this case, nonetheless, this gives an indication that the prediction accuracy of the proposed solution is acceptable. Lastly, since the proposed feature extraction encompasses statistical operations, it is a good idea to compute its required processing time. The feature extraction time per video frame is reported in Table 11 for all of the spatial resolutions used. The feature extraction code was written using C þ þ and Matlab (R2013a) running on Microsoft Windows 7, 64 bits. The processer is Intels Core™ i7 CPU @ 2.7 GHz with 16 GB of RAM. Although the code is written for research and not for production purposes, nonetheless, the results indicate that the processing times per frame are fairly reasonable. 6. Conclusion An objective quality estimation framework is proposed in this paper. The framework is implemented using MPEG-2 and HEVC. Features are extracted from coding units and summarized to form features at frame levels. The proposed solution used stepwise regression to retain the important features and to reduce the dimensionality of the feature vectors. Thereafter, polynomial regression is used for system modeling. It was shown that the proposed system can predict the PSNR/SSIM of coding units and frames. Fourteen video sequences with various resolutions are used in the experimental results. The average PSNR
30
T. Shanableh / Signal Processing: Image Communication 34 (2015) 22–31
Fig. 3. PSNR difference histograms for frame and CU level predictions. Table 9 HEVC CU and frame level SSIM prediction results. Seq. ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Avg.
Frame-based
Table 10 Summary of HEVC SSIM prediction using two sets of bitrates.
CU-based
Frame-based
RMSE
Corr.
RMSE
Corr.
0.005 0.002 0.009 0.004 0.017 0.002 0.019 0.005 0.001 0.007 0.01 0.002 0.017 0.007 0.008
0.96 0.97 0.94 0.95 0.91 0.9 0.9 0.94 0.9 0.94 0.9 0.91 0.92 0.88 0.92
0.004 0.001 0.009 0.005 0.2 0.004 0.005 0.02 0.002 0.005 0.007 0.002 0.01 0.003 0.02
0.9 0.88 0.85 0.93 0.83 0.96 0.83 0.9 0.81 0.87 0.95 0.8 0.9 0.89 0.88
Bitrate-1 Bitrate-2
CU-based
RMSE
Corr.
RMSE
Corr.
0.008 0.006
0.92 0.934
0.02 0.012
0.88 0.902
prediction correlation factor at a frame level is above 95% and the average RMSE is less than or equal to 0.6 dB. Whereas the average SSIM prediction correlation factor at a frame level is above 92% and the average RMSE is 0.008. Therefore the no-reference objective quality of HEVC coded video frames can be predicted with reasonable accuracy. At the CU level, the objective prediction results are less accurate and therefore further research can be carried out in this direction.
T. Shanableh / Signal Processing: Image Communication 34 (2015) 22–31
Table 11 Feature extraction time per frame using various spatial resolutions. Spatial resolution
Processing time [s]
416 240 832 480 1280 720 1920 1080 352 288
0.0151 0.0602 0.1390 0.3126 0.0153
References [1] K.i Chono, Y.Ch.. Lin, D. Varodayan, Y. Miyamoto and B. Girod, Reduced-reference image quality assessment using distributed source coding, in: Proceedings of the IEEE ICME, Hannover, Germany, June, 2008. [2] L. Yu-xin, K. Ragip, B. Udit, Video classification for video quality prediction, J. Zhejiang Univ. Sci. A 7 (5) (2006) 919–926. [3] G. Valenzise, S. Magni, M. Tagliasacchi, S. Tubaro, No-reference pixel video quality monitoring of channel-induced distortion, IEEE Trans. Circuits Syst. Video Technol. 22 (4) (2012) 605–618. [4] D. Schroeder, A. El Essaili, E. Steinbach, D. Staehleand M. Shehada, Low-complexity no-reference PSNR estimation for H.264/AVC encoded video, in: Proceedings of the 20th International Packet Video Workshop (PV), 1–6, 12–13 December, 2013. [5] A. Ichigaya, M. Kurozumi, N. Hara, Y. Nishida, E. Nakasu, A method of estimating coding PSNR using quantized DCT coefficients, IEEE Trans. Circuits Syst. Video Technol. 16 (2) (2006) 251–259. [6] T. Brandao and M.P. Queluz, Blind PSNR estimation of video sequences using quantized DCT coefficient data, in: Proceedings of the Picture Coding Symposium, Lisbon, Portugal, November, 2007. [7] T. Na, M. Kim, A Novel no-reference PSNR estimation method with regard to deblocking filtering effect in H.264/AVC Bitstreams, IEEE Trans. Circuits Syst. Video Technol. 24 (2) (2014) 320–330. [8] F. Yang, S. Wan, Bitstream-based quality assessment for networked video: a review,, IEEE Commun. Mag. 50 (11) (2012) 203–209. [9] F. Yang, S. Wan, Q. Xie, H. Wu, No-reference quality assessment for networked video via primary analysis of bit stream,, IEEE Trans. Circuits Syst. Video Technol. 20 (11) (2010) 1544–1554. [10] T. Shanableh, No-reference PSNR identification of MPEG video using spectral regression and reduced model polynomial networks, IEEE Signal Process. Lett. 17 (8) (2010) 735–738. [11] T. Shanableh and F. Ishtiaq, Macroblock level quality assessment using video-independent classification, in: Proceedings of the 9th International Symposium Mechatronics and its Applications (ISMA), April, 2013. [12] T. Shanableh, Prediction of structural similarity index of compressed video at a macroblock level, IEEE Signal Process. Lett. 18 (5) (2011) 335–338.
31
[13] Bumshik Lee, Munchurl Kim, No-reference PSNR estimation for HEVC encoded video, IEEE Trans. Broadcast. 59 (1) (2013) 20–27. [14] Z. Wang, A. Bovik, H. Sheikh, E. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612. [15] ISO/IEC 23008-2:2013, Information technology – High efficiency coding and media delivery in heterogeneous environments – Part 2: High efficiency video coding, 2013. [16] G. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand, Overview of the high efficiency video coding (HEVC) standard, IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2012) 1649–1668. [17] C. Wang, X. Jiang, F. Meng and Y. Wang, Quality assessment for MPEG-2 video streams using a neural network model, in: Proceedings of the IEEE 13th International Conference on Communication Technology (ICCT), September, 2011. [18] T. Shanableh and F. Ishtiaq, Pattern Classification for Assessing the Quality of MPEG Surveillance Video, in: Proceedings of the IEEE International Conference on Computer Systems and Industrial Informatics (ICCSII’12), Sharjah, UAE, December 2012. ̈ m, A new low complex reference free [19] A. Rossholmand B. L̈ovstro video quality predictor, in: Proceedings of the IEEE 10th Workshop on Multimedia Signal Processing, 765–768, October, 2008. [20] A. Rossholm and B. Lövström, A new video quality predictor based on decoder parameter extraction, in: Proceedings of the 'SIGMAP', 285–290, 2008. [21] M. Shahid, A. Rossholm and B. Lovstrom, A no-reference machine learning based video quality predictor, in: Proceedings of the fifth International Workshop on Quality of Multimedia Experience (QoMEX), 176–181, July, 2013. [22] M. Shahid, A. Rossholm and B. Lovstrom, A reduced complexity noreference artificial neural network based video quality predictor, in: Proceedings of the 4th International Congress on Image and Signal Processing (CISP), 517–521, October, 2011. [23] W. Mendenhall, T. Sincich, Statistics for Engineering and Sciences, 5th ed. Pearson, 2007. [24] K.-A Toh, Q.-L. Tran, D. Srinivasan, Benchmarking a Reduced multivariate polynomial pattern classifier,, IEEE Trans. Pattern Anal. Mach. Intell. 26 (6) (2004) 740–755. [25] I.-K. Kim, K. D. McCann, K. Sugimoto, B. Bross, W.-J. Han and G. J. Sullivan, “High Efficiency Video Coding (HEVC) Test Model 13 (HM13) Encoder Description,“ Document: JCTVC-O1002, Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, 15th Meeting: Geneva, CH, 23 Oct. – 1 November, 2013,. [26] MPEG Software Simulation Group, implementation of ISO/IEC DIS 13818-2 TM5, available online 〈http://www.mpeg.org/MSSG/〉. [27] VQEG, Final report from the video quality experts group on the validation of objective models of video quality assessment, phase II, 〈www.vqeg.org〉, Tech. Rep., August, 2003. [28] M. Abed and G. AlRegib, No-reference quality assessment of HEVC videos in loss-prone networks, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4–9 May, 2014.