Signal Processing: Image Communication 29 (2014) 303–315
Contents lists available at ScienceDirect
Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image
SSIM-based error-resilient rate-distortion optimization of H.264/AVC video coding for wireless streaming Pinghua Zhao a,b, Yanwei Liu a,n, Jinxia Liu c, Song Ci a,d,nn, Ruixiao Yao a,b a
Institute of Acoustics, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China Zhejiang Wanli University, Ningbo, China d University of Nebraska-Lincoln, Omaha, USA b c
a r t i c l e i n f o
abstract
Article history: Received 23 May 2013 Received in revised form 25 December 2013 Accepted 29 December 2013 Available online 8 January 2014
The SSIM-based rate-distortion optimization (RDO) has been verified to be an effective tool for H.264/AVC to promote the perceptual video coding performance. However, the current SSIM-based RDO is not efficient for improving the perceptual quality of the video streaming application over the error-prone network, because it does not consider the transmission induced distortion in the encoding process. In this paper, a SSIM-based error-resilient RDO scheme for H.264/AVC is proposed to improve the wireless video streaming performance. Firstly, with the help of the SSE-based RDO, we present a lowcomplexity Lagrange multiplier decision method for the SSIM-based RDO video coding in the error-free environment. Then, the SSIM-based decoding distortion of the user end is estimated at the encoder and is correspondingly introduced into the RDO to involve the transmission induced distortion into the encoding process. Further, the Lagrange multiplier is theoretically derived to optimize the encoding mode selection in the error-resilient RDO process. Experimental results show that the proposed SSIM-based error-resilient RDO can obtain superior perceptual video quality (more structural information) to the traditional SSE-based error-resilient RDO for wireless video streaming at the same bit rate condition. & 2014 Elsevier B.V. All rights reserved.
Keywords: SSIM Error resilience H.264/AVC Rate-distortion optimization Video streaming
1. Introduction With the increasing data transmission capabilities of wireless network and the advancements in the video coding technologies, wireless video applications become more and more popular, such as the video conferencing, video surveillance and video sharing system. However, it is still very challengeable to maintain good visual experiences for the end users due to the time-varying characteristics of the wireless network. Firstly, the video data transmission over the wireless channel is unreliable because of the signal n
Corresponding author. Corresponding author at: Institute of Acoustics, Chinese Academy of Sciences, Beijing, China. E-mail address:
[email protected] (Y. Liu). nn
0923-5965/$ - see front matter & 2014 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.image.2013.12.004
interference and fading in the open environment. Even though most of the current video applications are transmitted using TCP/IP, the unlimited retransmission of the lost video packet is not available for some applications, especially for the real-time video application. The completely reliable video packet transmission still cannot be maintained even for the wireless video streaming applications using TCP/IP. Secondly, unlike the plain data, the compressed video bit stream has high decoding dependency due to the spatial and temporal prediction techniques in the encoding process. Therefore, the video quality may suffer a lot from the packet loss and the propagated errors caused by the referenced video packets which are incorrectly decoded [1]. To restrain the effects of packet loss on the perceived video qualities by the end users, researchers have proposed
304
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
several solutions to promote the error robustness of the video streaming. One strategy is to enhance the reliability of the communication link to prevent the loss of the transmitted video packet, such as forward error correction (FEC) and automatic repeat request (ARQ). Another is to suppress the quality degradation due to the video packet loss, such as the error concealment and the error-resilient video coding [2]. Usually, the FEC or ARQ is the specific technique operated at the transmission system level, and the error concealment is used at the decoder. To promote the robustness of the video streaming, what the video encoder can easily control is only the error-resilient video coding process. Adaptive intracoded macroblock (MB) inserting in the encoding process has been verified to be an efficient tool to prevent the error propagation, because the decoding of an intra-coded MB does not need to refer to the previous frames which may not be decoded correctly. While for the intra-coded MB even if the current video packet is received correctly, the quality of the decoded MB may still suffer from the errors propagated from the previous frames [3]. However, the increasing amounts of the intra-coded MBs may lead to large coding bits amount and correspondingly decrease the coding efficiency. The intra/inter-encoding mode is determined by the rate-distortion optimization (RDO) process [4] in the video coding. As discussed in the previous paragraph, the intracoded MB can restrain the error propagation at the cost of decreasing the coding efficiency. Therefore, how to achieve good trade-off between the coding efficiency and the error resilience is an important issue to be addressed [5]. The early algorithms, such as randomly refreshing the intracoded MBs or the intra-coded frames at a certain frequency, are easy to be realized and can achieve some perceived quality improvements. However, the ratedistortion performance of the transmitted video data is not considered in these algorithms. Further, the refreshing frequency and the locations of the inserted intra-coded MBs usually do not adapt well to the dynamic channel conditions. Alternatively, several error-resilient RDO schemes are proposed for the video transmission application in the error-prone environment [2,3,5–9]. In [6], the authors optimized the video coding mode selection and placement of synchronization markers in the compressed bit stream using the end-to-end RDO, which considered the channel condition and the error concealment method used by the decoder. In [7], the recursive optimal per-pixel estimate (ROPE) method was proposed to estimate the end-to-end decoding distortion, and the distortion estimation was further introduced into a rate-distortion based framework for selecting the intra/inter-encoding mode. In [8], the authors proposed a method to estimate the decoding distortion by calculating the expectation of K possible reconstructed MB0 s distortions, and then introduced the estimated distortion into the RDO process to improve the error resilience of the coded video data. In addition, the authors discussed how to choose the Lagrange multiplier for the error-resilient video coding. In [9], the decoding distortion was estimated as the sum of several separable distortion items including the source distortion, error-propagated distortion from the reference frame and the error concealment induced distortion. This kind of estimation could suppress the approximation errors caused
by the pixel averaging operations. Correspondingly, a new Lagrange multiplier was derived for the end-to-end RDO process. Although the previous proposed error-resilient schemes can achieve the rate-distortion performance improvement in some degree, there still remain some issues to be discussed for the rate-distortion optimized video coding in the error-prone environment. Firstly, the distortion metric adopted by the errorresilient RDO in the previous schemes is the sum of squared error (SSE) or the sum of absolute difference (SAD). They are simple to be calculated and have clear physical meanings. However, the perceived video qualities measured by them are not usually well matched with the Human Visual System (HVS) [10,11]. Recently, a lot of new objective metrics, such as the Structural Similarity (SSIM) index [12], Visual Information Fidelity (VIF) criterion [13] and Visual Signal-to-Noise Ratio (VSNR) [14], are proposed to measure the video quality. Among them, the SSIM index has been shown to be more accurate and simpler to be calculated [11]. To promote the perceptual video coding performance, the researchers have introduced the SSIM metric into the video encoder to measure the encoding distortion [15–25]. The above works can achieve the SSIMbased rate-distortion performance improvement for the video streaming in the error-free environment, but they do not work efficiently in the error-prone environment because they do not consider the effects of the transmission condition and the error concealment on the perceived video quality. In [26], the authors introduced the SSIM metric into the error-resilient RDO, and the proposed scheme achieved the improved SSIM-based rate-distortion performance. However, the SSIM-based error-resilient Lagrange multiplier used in the proposed scheme was obtained with the help of some priori statistical experiments which were not practical for real-time applications. Further, the proposed method only considered the relationship of SSIM-based distortion and bit rate at the sequence level, but rarely considered the relationship at the frame level. Thus, the accuracy of the SSIM-based Lagrange multiplier might not be acceptable for some coding units. Secondly, the previous error-resilient RDO schemes rarely consider the effect of the packet bit length on the packet loss probability. The packet loss probability for each video packet is always assumed to be fixed and known to the encoder before encoding the current video slice. In the wireless transmission environment, this assumption is usually unreasonable. In this paper, we first propose a low-complexity SSIMbased Lagrange multiplier decision method for the SSIMbased RDO video coding in the error-free environment. Then, for the video streaming over error-prone wireless network, we propose an efficient estimation method for the SSIM-based decoding distortion, which includes the quantization induced distortion, the propagation error and the error concealment induced distortion. Lastly, the estimated SSIM-based decoding distortion is introduced into the video coding process to select the error-resilient coding modes for the video streaming, and the Lagrange multiplier in the error-resilient RDO process is theoretically analyzed, correspondingly. The rest of this paper is organized as follows. In Section 2, the low-complexity SSIM-based Lagrange multiplier
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
decision method for the video coding without the packet loss is presented. In Section 3, the proposed SSIM-based decoding distortion estimation method is explained and the Lagrange multiplier adjustment strategy in the errorprone environment is also analyzed in details. Section 4 shows the experimental results. Finally, Section 5 concludes this paper. 2. SSIM-based RDO for H.264/AVC video coding in the error-free environment
305
where μx, sx and sxy are the mean, standard deviation and cross correlation between the two image windows, respectively. The C1 and C2 are used to maintain the stability when the means and variances are close to zero [10]. The value of SSIM index is limited in the range of 0–1, and the value of SSIM index is 1 for two image windows with the same content. As the SSIM index indicates the similarity degree of two image windows, the SSIM-based distortion (DSSIM) for two image windows x and y can be defined as DSSIMðx; yÞ ¼ 1 SSIMðx; yÞ:
ð4Þ
2.1. Traditional SSE-based RDO In the H.264/AVC coding standard, a variety of candidate encoding modes, such as INTRA16 16, INTRA4 4, INTER16 16, INTER16 8, INTER8 16, INTER8 8 and SKIP, are used to improve the video coding performance. Different encoding modes can achieve different degrees of video fidelities with the corresponding encoding bits amounts. The optimal encoding mode can be determined by reaching the best trade-off between the coding bits amount and the obtained video quality. Actually, this problem can be modeled as minfDg fmg
subject to R r Rc ;
ð1Þ
which indicates that the video encoder should minimize the perceived distortion D with the encoding bits amount R subjecting to a constraint bits amount Rc by selecting the appropriate encoding mode m [4]. In this paper, the luminance component is used to calculate the distortion D. To solve the constrained RDO problem in practical video coding process, the Lagrange optimization method is used to make the problem unconstrained as minfJg fmg
with J ¼ D þ λ R;
ð2Þ
where J is the Lagrange cost and λ is the Lagrange multiplier for RDO process. In the traditional Lagrange optimization, the distortion metrics, such as SSE and SAD, are used to measure the video quality, and the Lagrange multiplier is used to balance the SSE or SAD based distortion and the coding bits amount. 2.2. SSIM-based distortion metric Unlike the SSE-based distortion metrics which describe the signal errors, SSIM considers image degradation as perceived change in structural information. It evaluates the perceptual quality based on the assumption that the HVS is very sensitive to the structural information of the viewing field. The SSIM index is defined to objectively measure the similarities of local luminance, contrast and structure between an original image and a distorted image, and the SSIM index can be calculated in windows with different sizes (block unit or image unit) for two images. Given two image signal windows of x and y, the local SSIM index of the two windows is defined as SSIMðx; yÞ ¼
ð2μx μy þ C 1 Þð2sxy þ C 2 Þ ; ðμ2x þμ2y þ C 1 Þðs2x þ s2y þC 2 Þ
ð3Þ
2.3. SSIM-based RDO in the error-free environment When the video coding distortion is characterized by the SSIM-based distortion, the Lagrange optimization scheme in (2) can be further modeled as minfJg fmg
with J ¼ DSSIM þ λssim R;
ð5Þ
where DSSIM denotes the SSIM-based distortion and λssim is the Lagrange multiplier for the SSIM-based RDO. Because the video distortion is measured by the SSIM metric, λssim should be appropriately selected to reach the optimal tradeoff between the coding bits amount and the SSIM-based distortion. Thus, the core problem for SSIM-based RDO is to determine the SSIM-based Lagrange multiplier λssim. Since the SSE distortion has some theoretical relations with the quantization parameter (QP), the traditional SSE-based multiplier λsse can be derived by a mathematical form of a quantization-interval derivation between rate and distortion from the perspective of rate-distortion information theory [27]. λsse is further empirically modified to accurately relate to the quantization parameter in the H.264/AVC coding standard. However, the SSIM-based distortion reflects the perceptual quality loss which is very sensitive to the video content. Though we know that the quantization is the main contribution to the video compression loss (also the perceptual distortion), it is currently not very clear that how the quantization affects the perceived SSIM-based distortion from the mathematical perspective. Knowing that the SSE-based RDO can find the best trade-off between the coding rate and SSE-based distortion, we propose to utilize the SSE-based RDO as a bridge to solve the SSIM-based RDO problem with negligible computation overhead. Assume that the SSIM-based distortion can be obtained by scaling the SSE-based distortion with a fixed factor f. As is known to all, the optimal Lagrange multiplier λ can be obtained by deriving the distortion D with the coding bits amount R. Then, the SSIM-based Lagrange multiplier λssim can be obtained as D ∂ fsse ∂Dssim 1 ∂Dsse λsse ¼ ¼ ¼ : ð6Þ λssim ¼ f ∂R ∂R ∂R f Thus, for the SSIM-based Lagrange optimization process, it can be modeled by only scaling the existent SSE-based Lagrange optimization formulation with f as J J Dsse λsse min sse with sse ¼ þ R: ð7Þ fmg f f f f
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
(n¼2,3,4…) frame0 s distortion scale ratio sn as
At this assumption situation, the SSIM-based Lagrange optimization process will achieve totally the same encoding mode selection with the existent SSE-based method. Although this assumption is ridiculous and the corresponding conclusion seems meaningless, the above analyses give us the inspiration that the Lagrange multiplier should be equally scaled to balance the encoding distortion and encoding bits amount if the distortion is scaled to another magnitude. It is difficult to find the mathematical relationship between the SSE-based distortion and the SSIM-based distortion because the SSIM-based distortion considers the structural information into the distortion metric. However, if the ratio s of the frame-level SSE-based encoding distortion to the SSIM-based encoding distortion is known (or can be predicted), based on the above analyses, the frame-level SSIM-based Lagrange multiplier λssim can be obtained by scaling the existent SSE-based Lagrange multiplier λsse with the distortion scale ratio s as λssim ¼
λsse : s
8 < s1 ; s0n ¼ sn 1 þ sn 2 ; : 2
0.07
1.7
0.065
1.6
0.06
1.5
0.055
1.4
0.05
ð10Þ
Table 1 PARs for six sequences.
0.075
1.8
ð9Þ
The average PAR for the six sequences is 89.01% which indicates that our method is valid to predict the relationship between the SSE-based distortion and the SSIM-based distortion. Therefore, we can estimate λssim using the contentadaptive predicted scale ratio as (8). Based on the above knowledge, the SSIM-based RDO video coding flow is designed as follows. Firstly, the first frame is encoded with the traditional H.264/AVC RDO and the scale ratio s1 of the frame-level SSE-based encoding distortion Dsse;1 to the SSIM-based distortion Dssim;1 is calculated. Secondly, for the following nth (n ¼2,3,4…) frame, we can get the predicted distortion scale ratio s0n by (9) and the SSIM-based Lagrange multiplier λssim;n can be estimated as (8). Thirdly, the nth frame is encoded using the proposed SSIM-based RDO and the scale ratio of the encoding distortion Dsse;n to Dssim;n is calculated to predict
Resolution
QCIF
Sequences
Foreman
Ice
Soccer
Crew
Football
Mobile
PAR (%)
96.57
90.44
82.08
86.33
82.12
96.53
Foreman(QP=37) SSE DSSIM
;
jsn s0n j rn ¼ 1 100%: sn
ð8Þ
x 10
n42
where s0n denotes the prediction value of sn. For the first frame (n¼1), the traditional SSE-based RDO is adopted in the encoding process. To verify the accuracy of the prediction, we summarize the prediction accuracy ratings (PARs) for six sequences in Table 1. The PARs for the encoded videos with different encoding parameters (QP¼ 28, 31, 34, 37) are used to average the PAR for each sequence, and the PAR rn for the nth frame is calculated as
Taking advantage of λsse, the obtained λssim will achieve good balance between the coding bits amount and the SSIM-based encoding distortion even though this may not be the optimal SSIM-based multiplier. Maybe someone still thinks that the selected encoding mode with the proposed SSIM-based scheme is the same as the traditional SSEbased method. However, this is incorrect because the existing nonlinear relationship between the encoding distortions measured by the two metrics for the candidate encoding modes is overlooked. In Fig. 1, the frame-level SSE-based distortion and DSSIM for the sequences Foreman (CIF) and Ice (CIF) with 100 frames encoded by JM16.1 [28] are shown. It can be seen that the relations of the two metrics measured distortions change with different frames, and inherently the changeable distortion relations reflect the structural characteristics of the video contents. To get the scale ratio s before the current frame is encoded, we propose to use two previous frames0 distortion scale ratios sn 1 and sn 2 to predict the nth
1.9
n¼2
1.4
CIF
Ice(QP=37)
x 10
SSE DSSIM
SSE
DSSIM
SSE
1.2
0.045
1
0.04
0.8
1.3
0
20
40
60
Frame number
80
0.045 100
0.6
0.05
0.035
0
20
40
60
80
Frame number
Fig. 1. Relations of the SSE distortion and the DSSIM for the sequences of (a) Foreman and (b) Ice.
0.03 100
DSSIM
306
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
the s0n þ 1 . As the second step and the third step, the video encoding process will keep going on until to the last frame. To verify the effectiveness of the proposed SSIM-based coding method, we implement the method into the reference software JM16.1. The rate-SSIM curves of four sequences are shown in Fig. 2. It can be seen that the proposed method can achieve significant improvement of the rate-SSIM performance over the traditional SSE-based RDO method. With the predicted content-adaptive distortion scale ratio, the proposed method can efficiently adapt to the sequences with different structural characteristics by taking advantage of the existent SSE-based RDO. Moreover, compared to the previous methods [16,19], the computation overhead is very small since we only introduce the SSIM-based distortion computation into the proposed SSIM-based RDO method. It has been verified by the extensive results in our previous works [29]. In [29], the coding structure “IPPP” and the “baseline profile” are adopted in the experiments. The quantization parameters QP1¼(16, 20, 24, 28) and QP2¼ (24, 28, 32, 36) are used to calculate the performance improvements. The coded video streams using QP1 and QP2 indicate the high bit rate condition and the low bit rate condition, respectively. Compared to the traditional SSE-based coding, at the same SSIM-based coding distortion condition, our method can achieve the rate reduction with 12.83% vs. 9.79% of Huang
307
et al.'s [16] method and 12.39% of Wang et al.'s [20] method for the high bit rate, 13.50% vs. 11.58% of Huang et al.'s [16] method and 16.28% of Wang et al.0 s [20] method for the low bit rate. Further, compared to 5% of Huang et al.0 s [16] method and 6.3% of Wang et al.0 s [20] method, our coding method complexity is increased only by 0.32%. In [29], besides the frame-level Lagrange multiplier adjustment, the MB-level Lagrange multiplier adjustment is further performed based on the information theory that the MBs with large amounts of information usually need more bits to be encoded. However, in the error-resilient video coding, the large amounts of coding bits may come from the intra-mode coding for suppressing the error propagation instead of the MB itself containing a plenty of information. Therefore, the proposed MB-level Lagrange multiplier adjustment does not suit for the error-resilient RDO process. In this paper, we only adopt the frame-level Lagrange multiplier decision method. 3. SSIM-based error-resilient video coding 3.1. Video packet transmission over wireless network In order to provide “network friendliness”, the video coding layer (VCL) and the network abstraction layer (NAL) Foreman@QCIF
Bus@QCIF 1
0.98 0.97
0.99 0.96 0.95
0.98 SSIM
SSIM
0.94 0.93
0.97
0.92 0.96
0.91
proposed SSIM−based RDO traditional SSE−based RDO
proposed SSIM−based RDO traditional SSE−based RDO
0.9
0.95
0.89 0.88 100
150
200
250
300
350
400
450
500
0.94 40
550
60
80
Bitrate(kbps)
100
120
140
160
180
Bitrate(kbps)
Soccer@CIF
Ice@CIF
0.98
0.99
0.96 0.985
0.94 0.92
0.975
SSIM
SSIM
0.98
0.97
0.9 0.88 0.86
proposed SSIM−based RDO traditional SSE−based RDO
0.965 0.96 0.955 200
proposed SSIM−based RDO traditional SSE−based RDO
0.84 0.82
250
300
350
400
450
Bitrate(kbps)
500
550
600
0.8 300
400
500
600
700
800
900
1000
1100
Bitrate(kbps)
Fig. 2. Rate-SSIM curves of the proposed SSIM-based RDO and traditional SSE-based RDO for the sequences (a) Bus, (b) Foreman, (c) Ice and (d) Soccer in error-free environment.
308
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
are designed in the H.264/AVC video coding standard. The VCL contains the compression tools, such as the prediction module, DCT transformation module and the entropy coding module, and the video compression process is performed in the VCL. However, the NAL provides the tools to adapt the bit strings generated by the VCL on the slice level to the transmission network. In the wireless communication environment, the transmission channel is time-varying and error-prone. As the slice header provides a unique synchronization point in the video stream, the complete NAL unit will be discarded when a bit error occurs within a slice. Additionally, we assume that the channel state remains unchanged during the transmission time of one NAL unit and that the bit errors inside the video packet are uncorrelated. The independent channel model is used in the proposed scheme. Knowing the bit error rate (BER) ber of the transmission channel, the packet loss probability (also denotes the packet loss rate in the statistical sense) ρ for one NAL unit containing L bits can be related to the BER as [6] ρ ¼ 1 ð1 berÞL :
ð11Þ
e_c
n_l
where bn;m;k , bn;m;k and bn;m;k indicate the original MB, the error concealed MB with packet loss and the decoded MB without packet loss, respectively. For wireless communication, the channel BER bern;m can be approximately estimated in terms of the physical layer channel Signal to Noise Ratio (SNR) [30]. When calculating the packet loss probability ρn;m by (11), the coding bits amount of the current slice should be known. Since the coding bits amount for the current slice cannot be obtained before the slice being encoded, only the estimation method can be used to approximately obtain the length of the coding bits. Based on the knowledge that the contents of the two successive frames change little, the amounts of coding bits for two collocated slices in the successive frames are very familiar. Currently, we approximately regard the coding bits length of the collocated slice in the previous frame as the coding bits length of current slice. In terms of bern;m and the estimated coding bits length, the packet loss rate ρn;m for the current coding slice can be obtained by (11). Since the pixel values of the MB bn;m;k are known in the encoder, we focus on explaining how to obtain the pixel e_c
n_l
values of bn;m;k and bn;m;k in the next several paragraphs. Denote f , f^ and f~ the values of the ith pixel in the nth n;i
3.2. SSIM-based decoding distortion estimation As discussed earlier, the NAL unit may be discarded during the wireless transmission. At the user end, the error concealment is adopted to decrease the perceived distortion when the packet loss occurs. Although the error concealment can decrease the decoding distortion, the concealed video slice is still different from that of the correctly reconstructed. For the following encoded frame which refers to the concealed video slice, the error propagation will occur since the decoding reference slice is different from the encoding reference slice. Thus, the decoding distortion at the user end includes not only the encoding distortion (quantization induced distortion) but also the distortion caused by the transmission error (error concealment and error propagation induced distortions). To perform the end-to-end SSIM-based RDO, the decoding distortion must be estimated in the encoder during the encoding process. Since the decoding distortion for the current encoding frame is not only affected by the irreversible quantization process, but also affected by the incorrectly decoded reference frames. In this paper, we propose a recursive estimation method to estimate the decoding distortion with the help of the channel acknowledge information and the channel state estimation. We assume that the temporal replacement error concealment strategy is adopted by the decoder and that it is known by the encoder. In the encoding process, one frame is parted into one or several slices. Denote Sn;m the mth slice in the nth frame, bern;m the estimated channel BER during the transmission of Sn;m , and ρn;m the packet loss rate for the slice Sn;m . For the kth MB in the Sn;m , the expected SSIM-based decoding distortion EfDSSIM n;m;k g can be calculated as e_c
EfDSSIMn;m;k g ¼ 1 ρn;m SSIMðbn;m;k ; bn;m;k Þ n_l
ð1 ρn;m Þ SSIMðbn;m;k ; bn;m;k Þ;
ð12Þ
n;i
n;i
original frame, reconstructed frame in the encoder and reconstructed frame in the decoder, respectively. Additionally, we assume that the ith pixel belongs to the pixels of the kth MB in Sn;m . Then the expected value of f~ can be n;i
obtained as n_l e_c Eff~ n;i g ¼ ð1 ρn;m Þ f~ n;i þ ρn;m f~ n;i ;
ð13Þ
n_l e_c n_l e_c where f~ n;i and f~ n;i belong to the MBs bn;m;k and bn;m;k , respectively. We consider two cases depending on whether the current pixel belongs to an intra-coded MB (I MB) or an inter-coded MB (P MB). When the current MB is lost during the transmission, e_c the decoded pixel f~ n;i ðI; PÞ for both the I MB and the P MB can be got by the error concealment as e_c f~ n;i ðI; PÞ ¼ f~ n 1;j ;
ð14Þ
where f~ n 1;j indicates the decoded pixel value of the jth pixel in the (n 1)th frame. The jth pixel is found with the help of the median motion vector of the neighbor MBs [7]. When the current MB is received correctly, the decoded n_l pixel f~ n;i ðIÞ for the I MB is equal to the reconstructed pixel ^ value f n;i in the encoder as n_l f~ n;i ðIÞ ¼ f^ n;i :
ð15Þ
While for the P MB, we assume that the ith pixel in the nth frame is predicted from the kth pixel (or subpixel) in the the predicted pixel value of reference frame. Denote f^ ref ;k
n_l f^ n;i and r^ n;i the reconstructed residue in the encoder, i.e. n_l f^ n;i ¼ f^ ref ;k þ r^ n;i . Then, the decoded pixel f~ n;i ðPÞ can be represented as n_l f~ n;i ðPÞ ¼ f~ ref ;k þ r^ n;i ;
ð16Þ
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
where f~ ref ;k indicates the referred kth pixel in the decoded reference frame. Since the distortion estimation is conducted at the encoder end, we add a module into the encoder to simulate the decoding process with the help of the acknowledge message which informs the encoder whether the transmitted packet is received or not by the receiver. The acknowledge message can be transmitted by the existent communication protocol, for example the Realtime Transport Control Protocol (RTCP). Assume that the acknowledge message of the (n r)th frame is received by the encoder while encoding the nth frame. With the stored encoding information and the estimated channel BER, the added decoding module will decode the (n r)th frame and get the expected decoded frames from n r þ 1 to n 1 using (13). Given the decoded reference frames or the expectations of the decoded reference frames, the pixel e_c n_l values for bn;m;k and bn;m;k can be obtained. Further, the
(20% tested in the experiments). On this condition, in terms of the Lagrange parameter derivation [27], the new SSIM-based error-resilient Lagrange multiplier λ0ssim can be theoretically derived as where Rn;m;k indicates the coding bits amount for the kth MB in slice Sn;m . Assume that the slice coding bits are equally distributed among MBs [6], the packet loss rate ρn;m can be then related to the MB coding bits Rn;m;k by (11) as ρn;m ¼ 1 ð1 bern;m ÞNmb Rn;m;k , where Nmb indicates the number of MBs in the slice. Since the error concealment induced distortion is not related to the coding bits, the derivation of e_c SSIMðbn;m;k ; bn;m;k Þ to the coding bits amount Rn;m;k is zero. n_l And also, though bn;m;k may contain the propagation error from the previous frames, the propagation error is not related to the coding bits. Thus, the derivation of n_l SSIMðbn;m;k ; bn;m;k Þ to the coding bits Rn;m;k is only affected by the quantization parameter, and the derivation can be
e_c
λ0ssim ¼
n_l
∂ð1 ρn;m SSIMðbn;m;k ; bn;m;k Þ ð1 ρn;m Þ SSIMðbn;m;k ; bn;m;k ÞÞ ∂DSSIMðRÞ ¼ ∂Rn;m;k ∂R e_c
¼
309
n_l
∂ðρn;m SSIMðbn;m;k ; bn;m;k ÞÞ ∂ðð1 ρn;m Þ SSIMðbn;m;k ; bn;m;k ÞÞ þ ; ∂Rn;m;k ∂Rn;m;k
expected decoding distortion for the current MB in nth frame can be estimated by (12). In this work, one acknowledge packet contains the acknowledge messages for all the slices in one frame, and the round trip time of the acknowledge packet is set to 33 ms.
3.3. SSIM-based error-resilient RDO The decision of the SSIM-based Lagrange multiplier for the RDO video coding in the error-free transmission has been shown in Section 2. For the error-resilient video coding, the estimated decoding distortion is introduced into the RDO process. In other words, the distortion DSSIM in (5) indicates the expected decoding distortion EfDSSIM n;m;k g while encoding the kth MB in Sn;m . To achieve the good trade-off between the expected decoding distortion and the coding bits amount, the Lagrange multiplier should be correspondingly adjusted. During the transmission, the larger packet bit length (smaller QP) will make the packet more probable to be discarded. Correspondingly, the expected decoding distortion may increase with the larger coding bit rate (smaller QP) as the packet loss induced distortion has a major contribution to the decoding distortion. In such case, the error-resilient Lagrange multiplier λ0ssim may be a negative value since the Lagrange multiplier equals to the negative slope of the distortion-rate function DSSIM(R) [27]. The negative Lagrange multiplier is unreasonable in the practical Lagrange RDO process and the Lagrange cost function JðRÞ ¼ DSSIMðRÞ þλ0ssim R in the error-prone environment is not strictly convex any more. However, the Lagrange cost function JðRÞ ¼ DSSIMðRÞ þ λ0ssim R is still strictly convex when packet loss rate is smaller than some certain value
ð17Þ
approximately represented as n_l
n_l
∂ðSSIMðbn;m;k ; bn;m;k ÞÞ ∂ð1 SSIMðbn;m;k ; bn;m;k ÞÞ ¼ λssim ; ∂Rn;m;k ∂Rn;m;k ð18Þ where λssim indicates the Lagrange multiplier for the RDO in the error-free environment. Thus, (17) can be further derived as λ0ssim ¼
∂ð1 ð1 bern;m ÞNmb Rn;m;k Þ e_c SSIM bn;m;k ; bn;m;k ∂Rn;m;k ∂ðð1 ber n;m ÞNmb Rn;m;k Þ n_l SSIM bn;m;k ; bn;m;k þ ∂Rn;m;k n_l
þð1 bern;m ÞNmb Rn;m;k
∂ðSSIMðbn;m;k ; bn;m;k ÞÞ ∂Rn;m;k
N mb lnð1 bern;m Þ ð1 ber n;m ÞNmb Rn;m;k n_l
e_c
ðSSIMðbn;m;k ; bn;m;k Þ SSIMðbn;m;k ; bn;m;k ÞÞ þð1 bern;m Þ
N mb Rn;m;k
λssim
¼ ð1 bern;m ÞNmb Rn;m;k ðλssim þN mb lnð1 bern;m Þ n_l
e_c
ðSSIMðbn;m;k ; bn;m;k Þ SSIMðbn;m;k ; bn;m;k ÞÞÞ ¼ ð1 ρn;m Þ ðλssim þ Nmb lnð1 ber n;m Þ n_l
e_c
ðSSIMðbn;m;k ; bn;m;k Þ SSIMðbn;m;k ; bn;m;k ÞÞÞ:
ð19Þ
In the practical wireless transmission, Nmb lnð1 n_l e_c bern;m Þ ðSSIMðbn;m;k ; bn;m;k Þ SSIMðbn;m;k ; bn;m;k ÞÞ is generally far less than λssim when the packet loss rate is smaller than 20%. Thus, λ0ssim in (19) can be further approximated as λ0ssim ð1 ρn;m Þ λssim :
ð20Þ λ0ssim
is adjusted to be When the Lagrange multiplier smaller than λssim, the error-resilient RDO will select more intra-coded MBs to restrain the error propagation. The
310
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
(20) indicates that λ0ssim is adaptively adjusted to be smaller than λssim with the different packet loss rates to promote the error robustness of the video streaming. When the packet loss rate is larger than 20%, the Lagrange multiplier cannot be theoretically derived from (17) since JðRÞ ¼ DSSIMðRÞ þ λ0ssim R is not strictly convex. However, for the situation of large packet loss rate (larger than 20%), we can also use (20) to increase the intra-coded MBs by making the Lagrange multiplier λ0ssim smaller than λssim, which will restrain the error propagation.
Table 2 Experimental conditions. Profile
Baseline profile
Frame numbers Entropy coding QP range RDO Coding structure Frame rate Channel BER
100 CAVLC 28 37 Open IPPP 30F/s
Error concealment
5 10 6 10 4 Temporal replacement
3.4. Complexity analyses 4.1. Decoding distortion estimation accuracy To perform the proposed SSIM-based error-resilient RDO process, the encoder should have the ability of tracking the propagated errors. This is achieved at the cost of a modest complexity increase. The estimated decoding distortion is calculated using e_c (12). The pixel values of bn;m;k are obtained using the temporal replacement error concealment method, which introduces negligible complexity overhead. To get the pixel n_l
values of bn;m;k , the referred frames should be reconstructed at the encoder. As the needed residuals and motion vectors are stored in the encoder, only several addition operations are added for each pixel. Furthermore, the reconstruction of the referred frames is independent of the coding process at some degree, and the reconstruction process can be realized with the parallel programming design, which can improve the effectiveness of the proposed method significantly. In addition, the complexity of proposed Lagrange multiplier selection method in the error-free environment is only increased by 0.32% compared to the traditional SSEbased RDO method [29]. For the error-resilient Lagrange multiplier adjustment, only several addition/multiplication operations (the operations of (11) and (20)) are added for each video slice. The correspondingly increased complexity can be almost neglected. Because we do not change the syntax of the H.264/AVC standard, there is not any complexity increase at the decoder. And also, we note that the additional storage complexity caused by storing the encoding information for each pixel should not pose large difficulties in most applications.
4. Experimental results To verify the effectiveness of the proposed SSIM-based error-resilient RDO video coding scheme, we implement the proposed scheme into the H.264/AVC reference software JM16.1, and the wireless video transmission is simulated in NS2 [31]. In this section, we first verify the accuracy of the proposed SSIM-based decoding distortion estimation. Then, the performance improvement for the proposed error-resilient RDO scheme is evaluated. The detailed experimental conditions are shown in Table 2. The channel BERs used in the experiments are the values after the channel coding.
For the error-resilient video coding, the end-to-end decoding distortion needs to be estimated during the mode decision process in the video coding. The comparisons between the real decoding distortion and the estimated decoding distortion for Foreman and Soccer at the frame level and the MB level are shown in Figs. 3 and 4, respectively. In Figs. 3 and 4, the sequences are encoded with QP¼37, and the channel BER is set to be 1 10 5. From Fig. 3, we can see that the really obtained decoding distortion is sometimes larger than the estimated decoding distortion. This situation will appear when the transmitted video packet is really lost during the transmission. Although the distortion prediction is not very accurate at this situation, the distortion estimation can still appropriately reflect the decoding distortion differences between the candidate encoding modes. To suppress the error propagation caused by the errordecoded reference frames, the encoder should perform the optimal error-resilient encoding mode selection, which is based on that the encoder can distinguish the decoding distortion differences between the candidate encoding modes, especially the propagated errors. In fact, for all the candidate encoding modes, the decoding distortion at the condition of current video packet lost is the same (the same error concealment induced decoding distortion). However, for the decoding distortion at the condition of current video packet not lost, different candidate encoding modes will result in different decoding distortions, especially different propagated errors. For example, the intracoded mode does not refer to the previous frames (not affected by the propagated errors), but the inter-coded mode does (affected by the propagated errors). Our distortion estimation method can accurately predict the propagated errors and properly consider the different error-resilient abilities of the candidate encoding modes. Based on this, the error-resilient encoding mode selection during the encoding process can still work efficiently, and this is verified by the results in Section 4.2. In Table 3, the frame-level (F) and MB-level (M) estimation accuracy rates (EARs) for six sequences at the channel conditions of BER ¼ 5 10 6 and BER ¼ 5 10 5 are shown, and the summarized results are the averaged EAR values for QPs (28; 31; 34; 37). The EAR is calculated as the form of (10) by substituting sn with the real decoding distortion and s0n with the estimated distortion. It can be seen from Table 3 that the proposed method can efficiently
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
Foreman@QCIF
311
Soccer@CIF 0.16
0.06
real estimated
0.055
real estimated
0.15 0.14
0.05 0.13
DSSIM
DSSIM
0.045 0.04 0.035
0.12 0.11 0.1
0.03
0.09
0.025
0.08
0.02 0
20
40
60
80
100
0
20
Frame number
40
60
80
100
Frame number
Fig. 3. Comparison between the really measured and estimated decoding distortions for the sequences (a) Foreman and (b) Soccer at the frame level.
0.35
real estimated
0.3
0.3
0.25
0.25
0.2
0.2
DSSIM
DSSIM
The 75th frame of Soccer
The 50th frame of Foreman
0.35
0.15
0.15
0.1
0.1
0.05
0.05
0
0
20
40
60
80
100
MB number
real estimated
0
0
50
100
150
200
250
300
350
400
MB number
Fig. 4. Comparison between the really measured and estimated decoding distortions for the sequences (a) Foreman and (b) Soccer at the MB level.
4.2. Performance evaluation for SSIM-based error-resilient RDO
Table 3 EARs for six sequences. Sequences
Bus (QCIF) Foreman (QCIF) Football (QCIF) Harbour (QCIF) Crew (CIF) Hall (CIF) Ice (CIF) Soccer (CIF) Average
BER ¼ 5 10 6
BER ¼ 5 10 5
F EAR (%)
M EAR (%)
F EAR (%)
M EAR (%)
87.10 95.02 91.26 96.37 94.46 99.08 95.43 96.44 94.40
87.95 94.78 92.58 96.51 94.12 99.02 96.14 96.23 94.67
73.45 82.71 77.46 85.65 84.69 98.65 78.86 84.28 83.22
74.28 82.85 78.13 85.37 83.77 98.31 79.97 85.17 83.48
estimate the SSIM-based decoding distortion and that the EAR decreases with the increase of the channel BER. This is because the situation discussed in the previous paragraph will occur more frequently when the channel BER is larger (larger packet loss rate), and it has been explained that the appearance of this situation has little influence on selecting the appropriate error-resilient coding mode.
In this subsection, the SSE-based error-resilient RDO scheme is used as the comparison scheme. The SSE-based decoding distortion is estimated by the ROPE method in [7], which is well studied and regarded as an advanced SSEbased distortion estimation method. The error-resilient SSE-based Lagrange multiplier is adjusted by multiplying the existent Lagrange multiplier λsse by ð1 ρn;m Þ as (20), which is verified to be efficient to work in the error-prone environment [9]. In [26], the SSIM-based error-resilient Lagrange multiplier was obtained with a certain fixed target bit rate. However, for the fixed QP situation, the decision method of the SSIM-based error-resilient Lagrange multiplier in [26] is not suitable. Thus, we do not compare the performance improvement of our proposed method with Zhang et al.'s [26] method in this paper. The rate-SSIM curves of sequences Foreman (QCIF) and Crew (CIF) at different BER conditions (BER ¼ 5 10 6 , BER ¼ 5 10 5 and BER ¼ 1 10 4 ) are shown in Fig. 5. In Fig. 5, the SSIM on the vertical axis refers to the SSIM
312
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
video quality by the end user decreases with the increase of the channel BER. At the condition of BER ¼ 1 10 4 ((c) and (f)), the SSIM index value decreases with the increasing bit rates when the transmitting bit rate is higher to a certain degree. This is because the packet loss rate is extremely high (larger than 20%) due to the large video packet size (lower quantization parameter for the slice with fixed number of MBs) and the packet loss induced
index between the original sequence and the decoded sequence at the user end. It can be seen from Fig. 5 that the proposed scheme can achieve significant improvement of the rate-SSIM performance over the SSE-based scheme at different channel conditions, which indicates that the proposed SSIM-based scheme can achieve better tradeoff between the encoding bits amount and the SSIM-based decoding distortion. We can also find that the perceived
Foreman@QCIF BER=0.000005
Foreman@QCIF BER=0.00005
0.98
0.955 0.95
0.97
0.945 0.94 SSIM
SSIM
0.96 0.95
0.93
0.94 SSIM−based scheme SSE−based scheme
0.93 0.92 100
0.935
0.925
SSIM−based scheme SSE−based scheme
0.92 150
200
250
300
350
0.915 100
400
150
200
Bitrate(kbps)
Foreman@QCIF BER=0.0001
300
0.96
0.925
0.94
0.92
0.92
0.915 0.91
0.9 100
200
300
400
450
0.9 0.88
SSIM−based scheme SSE−based scheme
0.905
350
Crew@CIF BER=0.000005
0.93
SSIM
SSIM
250
Bitrate(kbps)
400
SSIM−based scheme SSE−based scheme
0.86 0.84 200
500
400
600
Bitrate(kbps)
800
1000
1200
1400
Bitrate(kbps)
Crew@CIF BER=0.00005
Crew@CIF BER=0.0001
0.93
0.9
0.92
0.89
0.91 0.88 0.89
SSIM
SSIM
0.9
0.88
0.87 0.86
0.87 SSIM−based scheme SSE−based scheme
0.86
0.84
0.85 0.84 200
SSIM−based scheme SSE−based scheme
0.85
400
600
800 Bitrate(kbps)
1000
1200
1400
0.83 200
400
600
800
1000
1200
1400
1600
Bitrate(kbps)
Fig. 5. The performance comparisons of the proposed scheme to the SSE-based scheme at the conditions of BER ¼ 5 10 6 ((a) and (d)), BER ¼ 5 10 5 ((b) and (e)), and BER ¼ 1 10 4 ((c) and (f)).
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
Crew@CIF 30
intra refresh rate(%)
Table 4 Performance comparison of two schemes.
QP=28 QP=31 QP=34 QP=37
25
313
Sequences
Rate reduction ΔR (%) BER ¼ 5 10 6
BER¼ 10 5
BER ¼ 5 10 5
9.79 8.41 9.33 14.67 9.30 4.44 14.50 18.07 13.47 17.25 16.65 15.04 16.50 14.23 16.45 18.06 13.51
12.35 12.76 10.52 15.98 9.31 3.69 16.23 18.98 12.64 19.91 28.62 17.71 15.91 15.71 16. 90 19.87 15.44
15.24 18.34 9.43 20.44 10.44 8.17 22.35 25.70 30.45 19.44 22.60 20.11 21.24 19.63 18.03 20.62 18.89
20
15
10
5
0
0
0.2
0.4
0.6 BER
0.8
1
1.2 x 10
Fig. 6. intra-coded MB refresh rates for sequence Crew (CIF) with different QPs.
distortion will be the major contribution to the decoding distortion instead of the quantization induced distortion. Even in such case, the proposed SSIM-based RDO scheme can still provide the superior SSIM-based rate-distortion performance than the SSE-based error-resilient RDO. As shown in Fig. 6, the intra-refresh rates at different channel conditions for the sequence Crew with different QPs (28, 31, 34, 37) are also summarized. From Fig. 6, it can be seen that the intra-coded MB refresh rate can adaptively change with the transmission channel state. When the channel quality is good (BER is small), the intra-coded MB refresh rate is correspondingly small. Otherwise, the intra-coded MB refresh rate will be increased to restrain the error propagation caused by the packet loss (larger BER will lead to more lost packets). Further, we can find that the intra-coded MB refresh rate increases with the decrease of QP at the same BER. This is because the smaller QP will lead to more coding bits (larger video packet length), which may make the video packet more probable to be discarded during the transmission process. In summary, the intra-coded MB refresh rate can be adaptively tuned by the SSIM-based error-resilient RDO in terms of the packet loss rate. To obtain the accurate performance improvement of the proposed scheme over the SSE-based scheme, the average bit rate reduction ΔR at the same SSIM-based decoding distortion is calculated using the method in [32]. In the experiments, the QPs (28, 31, 34, 37) are used to calculate the performance improvement, and the experimental results are summarized in Table 4. At the conditions of BER ¼ 5 10 6 , BER ¼ 1 10 5 and BER ¼ 5 10 5 , the proposed scheme can achieve 13.51%, 15.44% and 18.89% rate reduction in average, respectively. It can be seen that the proposed scheme can greatly reduce the coding bit rate at the same SSIM index value condition and that the performance improvement is more significant at the higher channel BER. The SSIM-based error-resilient RDO is proposed to promote the end-to-end perceptual quality. Therefore, the subjective quality of the proposed scheme is evaluated. We have preformed the comparison of two RDO methods using
Bus (QCIF) Crew (QCIF) News (QCIF) Football (QCIF) Foreman (QCIF) Harbour (QCIF) Ice (QCIF) Crew (CIF) Football (CIF) Hall (CIF) Mobile (CIF) Ice (CIF) Soccer (CIF) Mobacal (720P) Parkrun (720P) Shield (720P) Average
Table 5 Tested sequence pairs. Sequences
Bus (QCIF) Football (CIF) Mobacal (720P) Foreman (QCIF) Soccer (CIF) Parkrun (720P) Ice (QCIF) Mobile (CIF) Shield (720P)
SSE-based scheme
Proposed scheme
SSIM
Rate (kbps)
SSIM
Rate (kbps)
0.8705 0.8478 0.8504 0.9172 0.7650 0.8227 0.9294 0.8104 0.8712
86.67 314.11 1508.29 50.50 141.79 4471.13 58.07 772.07 951.21
0.8808 0.8659 0.8572 0.9236 0.7796 0.8321 0.9361 0.8223 0.8824
82.77 310.33 1498.44 49.82 130.71 4431.60 56.75 750.95 920.37
the two-alternative forced-choice (2AFC) method, which is widely used in psychophysical studies [25]. Based on this method, a viewer is shown a pair of video sequences and is asked to select the one that he/she thinks to have better quality. In the experiments, the sequences compressed by the two RDO schemes with the similar bit rate are tested at different channel conditions. The tested sequences are shown in Table 5. In Table 5, the video streams of Bus, Football and Mobacal are obtained at the condition of BER¼5 10 6; the video streams of Foreman, Soccer and Parkrun are obtained at the condition of BER¼5 10 5; the video streams of Ice, Mobile and Shield are obtained at the condition of BER¼1 10 4. Ten viewers participated in the experiments, and each sequence pair is repeated to be watched for five times in the test. For each pair, the number of the obtained 2AFC results is 50. The results of the test are shown in Fig. 7. In Fig. 7(a) and (b), the percentages by which the viewers are in favor of the proposed scheme against the anchor scheme are shown. The error bars (the standard deviation between the measurements) are also plotted. From Fig. 7, it can be seen that the viewers tend to select the proposed scheme for better visual quality (the preference percentage larger than 50%).
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
100
100
90
90
Percentage(%) (in favor of the proposed scheme)
Percentage(%) (in favor of the proposed scheme)
314
80 70 60 50 40 30 20 10 0
1
2
3
4
5
6
7
8
9
10
Sequence number
80 70 60 50 40 30 20 10 0
1
2
3
4
5
6
7
8
9
10
11
Viewer number
Fig. 7. (a) Mean and standard deviation of preference for individual sequence (1–9: sequence number, 10: average). (b) Mean and standard deviation of preference for individual viewer (1–10: viewer number, 11:average).
Fig. 8. The 88th frame for the (a) original sequence; (b) decoded sequence of the proposed SSIM-based scheme, frame bits = 1298, sequence bit rate = 314kbps, MSE = 233.90, DSSIM = 0.2552; (c) decoded sequence of the traditional SSE-based scheme, frame bits = 1502, sequence bit rate = 340kbps, MSE = 211.31, DSSIM = 0.2813; (d) decoded sequence of the random 4% intra-coded MB refresh scheme, frame bits = 1452, sequence bit rate = 324kbps, MSE = 254.77, DSSIM = 0.2934.
Further, four images of sequence Soccer with CIF resolution are presented in Fig. 8. The specific frame coding bits amount, sequence bit rate, the MSE value and the DSSIM value are provided in the caption of Fig. 8. It can be seen that the SSE-based scheme can achieve the smallest MSE distortion, while the SSIM-based scheme can achieve the smallest SSIM-based distortion. Subjectively, we can observe that our proposed scheme can obtain better perceptual video quality (especially the red rectangular areas) than both the
SSE-based scheme and the random intra-coded MB refresh scheme. Specifically, the proposed scheme can efficiently restrain the error propagation and preserve more structural information of image details. Compared to the other two schemes, the visual performance is improved because the proposed scheme can appropriately choose the optimal encoding mode which can achieve the best trade-off between the coding bits amount and the SSIM-based decoding distortion (structure distortion).
P. Zhao et al. / Signal Processing: Image Communication 29 (2014) 303–315
5. Conclusion This paper presents a SSIM-based error-resilient RDO coding scheme to promote the perceptual distortion-rate performance for wireless video streaming. By exploiting the advantage of SSIM metric in perceptual quality evaluation, the proposed scheme introduces the estimated SSIM-based decoding distortion into the error-resilient RDO process. Furthermore, the SSIM-based error-resilient Lagrange multiplier is derived to adapt to the time-varying wireless channel states. Compared to the SSE-based error-resilient RDO scheme, the proposed SSIM-based error-resilient RDO scheme can better consider the structural information maintenance for the received videos by the end users during the encoding mode selection process. Objective and subjective evaluation results indicate that the proposed SSIM-based error-resilient RDO can provide the superior perceptual performance improvement (more structural information) to the SSE-based error-resilient RDO scheme.
Acknowledgments This work was supported in part by Important National Science and Technology Specific Project under Contracts 2012ZX03003006-004, NSFC under Grant nos. 11161140319 and 61102077, Ningbo Natural Science Foundation under Contract 2012A610044, Zhejiang Provincial Natural Science Foundation of China under Contract LY13F010012 and NSF under Grant nos. 1145596 and 0830493. References [1] N. Färber, K. Stuhlmüller, B. Girod, Analysis of error propagation in hybrid video coding with application to error resilience, in: Proceedings of ICIP, 1999, pp. 550–554. [2] T. Stockhammer, T. Wiegand, S. Wenger, Optimized transmission of H.26L/JVT coded video over packet-lossy networks, in Proceedings of ICIP, 2002, pp. 173–176. [3] Z. He, J. Cai, C.W. Chen, Joint source channel rate-distortion analysis for adaptive mode selection and rate control in wireless video coding, IEEE Trans. Circuits Syst. Video Technol. 12 (6) (2002) 511–523. [4] G.J. Sullivan, T. Wiegand, Rate-distortion optimization for video compression, IEEE Signal Process. Mag. 15 (6) (1998) 74–90. [5] Y. Zhang, W. Gao, H.F. Sun, Q.M. Huang, Y. Lu, Error resilience video coding in H.264 encoder with potential distortion tracking, in Proceedings of ICIP, 2004, pp. 163–166. [6] G. Cote, S. Shirani, F. Kossentini, Optimal mode selection and synchronization for robust video communications over error-prone networks, IEEE J. Sel. Areas Commun. 18 (6) (2000) 952–965. [7] R. Zhang, S.L. Regunathan, K. Rose, Video coding with optimal inter/ intra-mode switching for packet loss resilience, IEEE J. Sel. Areas Commun. 18 (6) (2000) 966–976. [8] T. Stockhammer, D. Kontopodis, T.Wiegand, Rate-distortion optimization for JVT/H.26L coding in packet loss environment, in: Proceedings of Packet Video Workshop, 2002.
315
[9] Y. Zhang, W. Gao, Y. Lu, Q. Huang, D. Zhao, Joint source-channel ratedistortion optimization for H.264 video coding over error-prone networks, IEEE Trans. Multimed. 9 (3) (2007) 445–454. [10] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. on Image Process. 13 (4) (2004) 600–612. [11] Z. Wang, A.C. Bovik, Mean squared error: love it or leave it?—a new look at signal fidelity measures, IEEE Signal Process. Mag. 26 (1) (2009) 98–117. [12] Z. Wang, L. Lu, A.C. Bovik, Video quality assessment based on structural distortion measurement, Signal Process.: Image Commun. 19 (2) (2004) 121–132. [13] H. Sheikh, A.C. Bovik, Image information and visual quality, IEEE Trans. Image Process. 15 (2) (2006) 430–444. [14] D. Chandler, S. Hemami, VSNR: a wavelet-based visual signal-tonoise ratio for natural images, IEEE Trans. Image Process. 16 (9) (2007) 2284–2298. [15] C. Yang, R. Leung, L. Po, Z. Mai, An SSIM-optimal H.264/AVC inter frame encoder, in: IEEE International Conference on Intelligent Computing and Intelligent Systems, 2009, pp. 291–295. [16] Y.H. Huang, T.S. Ou, P.Y. Su, H. Chen, Perceptual rate-distortion optimization using structural similarity index as quality metric, IEEE Trans. Circuits Syst. Video Technol. 20 (11) (2010) 1614–1624. [17] H. Chen, Y.H. Huang, P. Su, T. Ou, Improving video coding quality by perceptual rate-distortion optimization, in: Proceedings of IEEE International Conference on Multimedia Expo, 2010, pp. 1287–1292. [18] X. Wang, L. Su, Q. Huang, C. Liu, Visual perception based Lagrangian rate distortion optimization for video coding, in: IEEE International conference on Image Processing, 2011, pp. 1653–1656. [19] S. Wang, A. Rehman, Z. Wang, S. Ma, W. Gao, SSIM-motivated rate distortion optimization for video coding, IEEE Trans. Circuits Syst. Video Technol. 22 (4) (2012) 516–529. [20] S. Wang, S. Ma, W. Gao, SSIM based perceptual distortion rate optimization coding, in: Proceedings of the SPIE: Visual Communications Image Processing, 2010, pp. 1–10. [21] C. Yeo, H.L. Tan, Y.H. Tan, On rate distortion optimization using SSIM, IEEE Trans. Circuits Syst. Video Technol. 23 (7) (2013) 1170–1181. [22] Z. Mai, C. Yang, L. Po, S. Xie, A new rate-distortion optimization using structural information in H.264 I-frame encoder, in: Advanced Concepts for Intelligent Vision Systems, Lecture Notes in Computer Science, 2005, pp. 435–441. [23] T. Ou, Y.H. Huang, H.H. Chen, SSIM-based perceptual rate control for video coding, IEEE Trans. Circuits Syst. Video Technol 21 (5) (2011) 682–691. [24] S. Wang, A. Rehman, Z. Wang, S. Ma, and W. Gao, Rate-SSIM optimization for video coding, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, pp. 833–836. [25] S. Wang, A. Rehman, Z. Wang, S. Ma, W. Gao, Perceptual video coding based on SSIM-inspired divisive normalization, IEEE Trans. Image Process. 22 (4) (2013) 1418–1429. [26] L. Zhang, Q. Peng, X. Wu, Q. Wang, SSIM-based error resilient video coding over packet-switched networks, J. Signal Process. Syst. Signal Image Video Technol., published online: April 2013. [27] T. Wiegand, B. Girod, Lagrangian multiplier selection in hybrid video coder control, in: Proceedings of International Conference of Image Processing, 2001, pp. 542–545. [28] H.264/MPEG-4 AVC Reference Software [online]. Available: 〈http:// iphome.hhi.de/suehring/tml/download/old_jm/jm16.1.zip〉. [29] P. Zhao, Y. Liu, J. Liu, R. Yao, S. Ci, H. Tang, Low-complexity contentadaptive Lagrange multiplier decision for SSIM-based RD-optimized video coding, in: The IEEE International Symposium on Circuits and Systems, 2013, pp. 485–488. [30] S.T. Chung, A.J. Goldsmith, Degrees of freedom in adaptive modulation: a unified view, IEEE Trans. Commun. 49 (9) (2001) 1561–1571. [31] The network simulator-ns-2 [online]. Available: 〈http://www.isi. edu/nsnam/ns〉. [32] G. Bjontegaard, Calculation of average PSNR differences between rd-curves, in: ITU-T Q.6/SG16 VCEG 13th Meeting, 2001.