Rate-distortion-smoothness optimized rate allocation schemes for Spectral Fine Granular Scalable video coding technique

Rate-distortion-smoothness optimized rate allocation schemes for Spectral Fine Granular Scalable video coding technique

J. Vis. Commun. Image R. 17 (2006) 799–829 www.elsevier.com/locate/jvci Rate-distortion-smoothness optimized rate allocation schemes for Spectral Fin...

1MB Sizes 0 Downloads 72 Views

J. Vis. Commun. Image R. 17 (2006) 799–829 www.elsevier.com/locate/jvci

Rate-distortion-smoothness optimized rate allocation schemes for Spectral Fine Granular Scalable video coding technique q Wen-Nung Lie a,*, Cheng-Hsiung Tseng b, Tom C.I. Lin a, Ming-Yang Tseng a, I-Cheng Ting c a

Department of Electrical Engineering, National Chung Cheng University, Chia-Yi 621, Taiwan, ROC Materials and Electro-Optics Research Division, Chung-Shan Institute of Science and Technology, Lung-Tan, Tao-Yuan 325, Taiwan, ROC c Network & Multimedia Institute, Institute for Information Industry (III), Taiwan, ROC

b

Received 2 August 2004; accepted 20 April 2005 Available online 7 July 2005

Abstract Spectral Fine Granular Scalability (SFGS), a variation of MPEG-4 FGS, is proposed in this paper as a scalable coding technique for video streaming. SFGS re-arranges enhancement-layer bit-plane data according to spectral frequency orderings before they are further processed with traditional FGS bit-plane coding technique. Based on this data reordering within each bit-plane, spectral bands of lower frequency or higher rate-distortion properties are transmitted with priorities. In spite of this modification, SFGS retains similar properties as FGS, such as the coding efficiency, error resilience, and adaptation to channel bandwidth variation. However, SFGS is promising in yielding evener image quality and smoother video perception at the receiver side when the channel bandwidth is limited. Traditional FGS rate control techniques lack a systematic approach to making tradeoffs between image quality, motion smoothness (i.e., the frame rate), and video smoothness (PSNR difference between

q Parts of this paper were presented in the IEEE International Symposium on Circuits and Systems, May, 2003, Bangkok, Thailand, and IEEE International Conference on Multimedia Expo (ICME), June 27–30, 2004, Taipei, Taiwan. * Corresponding author. Fax: +886 5 2720862. E-mail address: [email protected] (W.-N. Lie).

1047-3203/$ - see front matter  2005 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2005.04.003

800

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

consecutive frames), according to limitations on system resources and users preferences. In view of this drawback, we propose here a unified rate allocation scheme, based on rate-distortion-smoothness optimization criterion and multi-stage dynamic programming (DP) technique, to solve the above problems. Experiments show that our algorithm is capable of achieving a smoother video quality, and simultaneously guaranteeing no buffer overflow and underflow at encoder/decoder, under a target bit-rate constraint. Though our proposed algorithm was designed for SFGS, it can be also applied to other FGS variations.  2005 Elsevier Inc. All rights reserved. Keywords: Video streaming; MPEG-4 FGS; Fine Granular Scalability; Scalable video coding; Rate control; Dynamic programming

1. Introduction Traditional video compression techniques were concentrated on how to code a video sequence to achieve the best quality at a given bit-rate. Since the available network bandwidth is varying at different places or time, non-scalable compression techniques are not sufficiently practical now. New video compression techniques should have the ability to adapting to bandwidth variation. To achieve this purpose, two kinds of methods were proposed, namely, transcoder and layered scalable coding. By transcoding, we can partially decode and re-encode the originally compressed video data to change the bit-rate and adapt to channel bandwidth. However, the complexity of the transcoder framework is often high. In MPEG-2/4, several layered scalability techniques, namely, SNR scalability, temporal scalability, and spatial scalability, have been proposed. By these techniques, a video sequence is coded into a base layer (BL) and maybe several enhancement layers (ELs). The base-layer bit-stream, which is transmitted in a reliable channel, provides the basic video quality. The enhancement-layer bit-stream, which is transmitted in an unreliable channel, enhances video quality of the base layer. The video quality at decoder will be enhanced only if the enhancement-layer bit-stream is received completely. Partial decoding of the enhancement-layer data will however make image quality spatially uneven. Fine Granular Scalability (FGS) [6,11] has been proposed in MPEG-4 standard to overcome the above un-continuous rate-distortion property effectively. The residues between the original DCT coefficients and the de-quantized DCT coefficients of the base layer form the input of the enhancement layer and are encoded with the bitplane coding technology. By using a simple server, the enhancement-layer bit-stream can be truncated at any length and continuous quality at continuous transmission bit-rates can be achieved. Following the issuing of FGS, several variations, such as Progressive FGS (PFGS) [15], Motion-compensation FGS (MC-FGS) [12], and Robust FGS (RFGS) [3], were proposed to further make good compromise between coding efficiency and error resilience. The original FGS framework provides only SNR scalability. Later, FGST (Fine Granularity Scalable Video Coding with Temporal–SNR Scalabilities) [11] was proposed to support both SNR and temporal scalability through a single

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

801

enhancement layer. It provides the users a capability of trading motion smoothness (by increasing the frame rate) with individual image quality. In the original FGS coding scheme, each bit-plane of the enhancement layer is coded by scanning macroblocks (MB) in the top-to-bottom and left-to-right order. When the available bandwidth is not enough for transmitting a whole bit-plane data, the bit-stream is hence truncated. Cutting bits within a bit-plane is equivalent to enhancing the quality in only part of the frame. Video quality of the upper-left subimage will be better than that of the lower-right part. This effect of intra-frame quality variation is especially clear when the bit-stream is truncated within the more significant bit-plane (i.e., at a lower bit-rate). To overcome this situation, Zhou et al. [19] proposed to re-select bits in the enhancement layer for encoding, in view of rate-distortion optimization. Their scheme successfully provides an encoded bit-stream that conforms to the restricted bit budget and offers an optimized quality. However, their work did not discuss the issue of inter-frame smoothness in the whole video sequence. In this paper, we propose a new coding scheme to scan bit-plane data in the spectral frequency ordering (low frequency in the zigzag order first), as shown in Fig. 1. For each spectral frequency, bits are grouped according to Y, Cb, and Cr components; while in each component, MBs are scanned in the raster manner. The main concept is that perceptually important bits (often, the lower spectral frequencies) are encoded and transmitted first. The spectrally grouped bit-plane data (as shown in Fig. 1) are then treated much

Fig. 1. Spectrally bit-scanning scheme for each bit-plane of the enhancement layer for SFGS.

802

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

like that in traditional FGS coding scheme, hence called Spectral FGS (SFGS). SFGS changes only the bit-scanning orders in each bit-plane, without modifying the coding structure of FGS. Henceforth, SFGS is also capable of adapting to bandwidth variation, just like the FGS. In a similar manner, we apply the same bit-scanning scheme to FGST (introduced later) to form the Spectral FGST (SFGST). For the rate control issue, FGS adopts the TM5 (Test Model 5) for the base layer, while uses the simplest and common method to allocate bits evenly between frames for the enhancement layer. The same strategy was also adopted in rate allocation of FGS+ [10] and FGST [11] coding schemes. In this simple manner, the video quality would be fluctuating and degradation is notable at frames of scene change or large motion. New rate control schemes for FGS were proposed [17,1,18,14] to smooth out video quality variation. In Zhang and Cheng et al.s work, a set of rate-distortion points was extracted during the encoding process to establish an R-D model for the enhancement-layer signal of each frame. Constant quality among frames was then realized by finding the solution (i.e., rates allocated to the enhancement layers) to the set of equations defined by the frames in a sliding window. Their work, though presented an improvement on average mean square error (MSE) distortion against the uniform allocation method did not guarantee its optimality nor offer flexibility in trading with quality variation (i.e., video smoothness). In Wang et al.s work, an R-D function was established for the enhancement layers of a multi-frame group in PFGS coding. By utilizing the characteristics of PFGS, a fast rate allocation implementation that keeps performance difference to a complex iterative procedure a minimal was developed. Theoretically, a larger group size leads to a smoother video quality. However, only experiments for groups of two or three frames (to meet the PFGS coding characteristic) were reported in their work. In Zhao et al.s work, they considered both the non-scalable base layer and the scalable enhancement layer for smoother video quality. For the enhancement layer, they appended a small number of R-D samples into the bit-stream to construct a piecewise linear R-D model. Based on that model, optimal rate allocation can be reached by minimizing the distortion variation among frames. The hybrid temporal–SNR scalability (i.e., FGS/FGST) [11] is another form of rate allocation in FGS. It provides a simple and heuristic rate allocation mechanism to make tradeoffs between image quality and motion smoothness in real-time. Indices based on some base-layer information, such as the number of bits used to encode the motion vectors, the complexities of intra-coded frames, etc., were used for making a switching decision between FGS and FGST. However, no systematic analysis about how to choose thresholds for the proposed indices was given. In all cases of SNR, temporal, and hybrid scalability, their work adopted a strategy of even bit allocation in the enhancement layer between all frames. Though the above-mentioned FGS/FGST algorithm makes good tradeoffs between motion smoothness and image quality, it ignores the smoothness of video quality and the status of resource usages such as the buffer fullness at the encoder/decoder sides. Besides, users have no ways to make their own preferences. In this paper, we would propose a unified rate allocation algorithm to integrally solve the above optimization problems, by extending the authors preliminary results on SFGS/SFGST basic rate control [7,9]. Our rate allocation algorithm is established

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

803

on rate-distortion-smoothness optimization and constrained multi-stage dynamic programming (DP) technique. Based on DP, it becomes easy to tune coding parameters (e.g., SFGS/SFGST mode selection and the allocated bit-rates) or users preferences (image quality or video smoothness) by adjusting weighting parameters in the cost function to be optimized. Some other work about SFGS has also been proposed elsewhere. For example, to increase robustness of SFGS-coded video data, synchronization code-words were inserted in the enhancement layer [8] so that a less amount of enhancement-layer data would be discarded in presence of channel errors. The paper is organized as follows. In Section 2, we describe the SFGS and SFGST video coding techniques in details. In Section 3.1, an RD-based rate control algorithm is first proposed to smooth out video quality successfully within a window of frames. Since it does not consider the status of buffer fullness, buffer overflow or underflow may happen during network transmission of videos. Following in Section 3.2, we consider this as a constrained optimization problem and apply the DP technique to overcome this problem. In spite of its applicability, the DP-based algorithm suffers from unexpected step change in quality between two consecutive windows. In Section 3.3, the DP-based rate control algorithm is combined with a sliding window process to eliminate this step change in video quality. In Section 4, the DPbased rate allocation architecture is modified to make tradeoffs between temporal and SNR scalability (or called SFGS/SFGST mode selection). In Section 5, a rate-distortion model is established for each frame to reduce the complexity of online processing. In Section 6, we show the experimental results of proposed algorithms. Finally, Section 7 concludes this paper.

2. Spectral FGS video coding technique 2.1. Spectral FGS As illustrated in Fig. 1, the proposed SFGS technique re-groups bit-plane data into units of bands, according to spectral frequency ordering (in the zigzag order), before they are encoded. A spectral band Bi,j means a collection of bits (including 0s and 1s) of the same spectral frequency index in a bit-plane, where i is the frame index and j is the band index numbered from 1 to 64ÆNBP and from MSB to LSB (NBP is the number of bit-planes in the considered frame). The encoding unit thus changes from a MB in FGS to a spectral band in SFGS. Low-frequency bands are essentially transmitted with high priorities when bit-stream truncation is required due to insufficient bandwidth. However, low-frequency bands do not necessarily provide high coding efficiency than high-frequency bands. Here, we propose to calculate an R-D index for each spectral band in all bit-planes. Based on these R-D indices, bands in the same bit-plane are assigned priorities for transmission (more significant bit-planes are definitely transmitted with higher priorities). Notice that this reordering of transmission priority for bands in a bit-plane requires an overhead of 6 bits per band to specify the band number for proper reconstruction at decoder. Higher bit-planes that are not truncated do not incur this band reordering and need no overheads at all.

804

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

The proposed SFGS coding algorithm is summarized as below: 1. Encode the base-layer bit-stream as in MPEG-4 and record the PSNR (denoted as PSNRiBase ) of the reconstructed frame i. 2. Compute the enhancement-layer signal (DCT quantization errors) as in MPEG-4 FGS. 3. Re-arrange bit-plane data to form spectral bands as in Fig. 1. 4. Encode bit-plane data band-by-band by the method similarly used in MPEG-4 FGS (which encodes bit-plane data MB-by-MB), except that we use different VLC tables for different groups of bands (bands in zigzag frequency ordering are sequentially partitioned into three groups) [13]. 5. Calculate the R-D index ki,j for each band Bi,j as dRi;j ¼ Ri;j  Ri;j1 ; dPSNRi;j ¼ PSNRi;j  PSNRi;j1 ; ki;j ¼ dPSNRi;j =dRi;j ;

1 6 j 6 64  N BP ;

ð1Þ

where Ri,j represents the accumulated number of bits required to encode bands 1–j for the enhancement layer of the ith frame (excluding the header part of the enhancement layer), and PSNRi,j is the measured image quality when base-layer data and enhancement-layer bands 1–j of the ith frame are successfully received. By these definitions, dRi,j stands for the number of bits spent in coding Bi,j and dPSNRi,j represents the PSNR contributed by Bi,j. ki,j, hence, indicates the coding efficiency of Bi,j (dB/bit). Obviously, bands with a larger dPSNR and a smaller dR have higher ks. 6. Record Ri,js and PSNRi,js in an auxiliary file. On video transmission, the server performs the following procedures. 1. Determine the number of bits (also, the corresponding priority bands) allocated to the enhancement layer of current frame by the rate control algorithm described later. 2. Transmit the bit-stream of priority bands from MSB to LSB. For a fully transmitted bit-plane (i.e., not the last transmitted one), the 64 spectral bands are sequentially ordered (hence no overheads of band indices are required). For a partially transmitted bit-plane, bands of higher priorities (i.e., those with higher ki,js) are transmitted first. For the last transmitted bit-plane (often, truncated), positions and lengths of the transmitted Bi,js are yielded by referring to Ri,js recorded in the auxiliary file. 3. When the bit-rate budget is not sufficient to support the transmission of a full band, truncate the bit-stream anywhere within it. 2.2. Spectral FGST The derivation of SFGST from SFGS is similar to the way that FGST is derived from FGS. Two kinds of FGST architectures were ever proposed [11]. The first one

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

805

Fig. 2. The structure of P and B-frames in the enhancement layer of SFGST.

arranges FGST-VOP (Video Object Plane) and FGS-VOP at different enhancement layers so that the server is easy to choose FGST-VOP or FGS-VOP according to the bit-rate budget. The second one places FGST-VOP and FGS-VOP at the same enhancement layer, thus taking advantage of satisfying motion smoothness and video quality simultaneously, but requiring additional header bits to differentiate between FGST-VOP and FGS-VOP at the receiver side. In our research, we adopted the second architecture to form SFGST. Fig. 2 illustrates the structure of P and B-frames in the enhancement layer. In general, more B-frames can be inserted to further increase the frame rate. B-frames (SFGSTVOP) differ from I/P-frames (SFGS-VOP) in two aspects: (1) the absence of base-layer data and (2) motion-compensated residues vs. DCT quantization errors. Hence, each B-frame records two motion vectors (MVs) to accomplish motion compensation from its two adjacent P-frames. Despite the different natures of the enhancement-layer signals, both P and B-frames use the same bit-plane coding technique to form the enhancement-layer bit-stream. On video transmission, the server can choose to allocate bits to B-frames to increase the frame rate or to I/P-frames to increase the image quality. That is, FGST offers an alternative to make tradeoffs between motion smoothness and image quality. However, dynamic FGS/FGST (similarly, SFGS/SFGST) mode selection still remains an open issue to be solved.

3. Rate control schemes for SFGS 3.1. RD-based rate control scheme In MPEG-4 standard, FGS adopts a strategy to allocate bits evenly between frames for the enhancement layer. By this method, video quality will be less smooth along the whole sequence. Since frames of complex contents or large motions would consume much more bits to maintain a comparable quality, even bit

806

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

allocation will surely lower down their picture qualities. To overcome this, we propose here a novel rate control algorithm for SFGS to smooth out video quality within a group (window) of pictures. This algorithm is based on the principle of rate-distortion optimization. The R-D index of each spectral band is computed and recorded in the encoding process (as described in Section 2.1), which is then utilized for rate allocation when the encoded video data are to be transmitted on networks. Denote the bit-rate budget for a considered group (window) of pictures as S (surely, S depends on the available channel bandwidth) and the maximal PSNR difference allowable between any two consecutive frames as j dB. The RD-based rate allocation scheme is summarized as follows [13]. 1. Initialize the considered window of N frames with empty enhancement layers (i.e., with the transmitted spectral bands being undetermined). Retrieve their corresponding PSNRiBase ’s from the auxiliary file and assign them to the accumulated PSNR values: PSNRi :¼ PSNRiBase (i is the frame index). 2. Sort the spectral bands of each frame (according to bit-plane position in decreasing significance and R-D index ki,j in decreasing value for the same bit-plane) and place them in a queue. This arrangement makes priority bands being located at the front end of a queue. 3. Calculate the PSNR difference between any two consecutive frames and denote it as DPDij (i.e., DPDij = |PSNRi  PSNRj|). Choose the largest DPDij (denoted as DPDi j ) among those greater than j and go to step 4. If no DPDij is larger than j, go to step 7. 4. Find k* = i* (if PSNRi < PSNRj ) or k* = j* (if PSNRi P PSNRj ), i.e., choose the frame that has a lower accumulated PSNR from i* or j*. Allocate bits for the spectral band (denoted as h) at the head of queue-k* . This selection is effective in reducing the PSNR difference between frames i* and j* . 5. Remove the selected band in step 4 from queue-k* and update the accumulated PSNR value by PSNRk :¼ PSNRk þ dPSNRk ;h . 6. Check the amount of allocated bits up to now. If it is more than S, then go to step 9, else go back to step 3 for next allocation. 7. Allocate bits to frame k (1 6 k 6 N) whose unallocated spectral band at the head of the queue has the largest k value among all frames. 8. Remove the selected band in step 7 from the queue and go to step 6. 9. Discard this last allocated spectral band so that the bit-rate budget constraint is met. Complete the rate allocation process for the current group of pictures. Go back to step 1 for processing next group of pictures.

3.2. DP-based rate control scheme The RD-based rate control algorithm targets at smoothing out video quality by scheduling the transmission of spectral bands according to their rate-distortion indices and PSNR differences between consecutive frames. However, it ignores the status

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

807

Fig. 3. The multi-stage topology of DP-based rate allocation scheme.

of coder/decoder buffer fullness (may cause the problem of buffer underflow or overflow during network transmission). Neither it is capable of considering users preferences between video smoothness and image quality. To overcome this problem, we utilize the dynamic programming (DP) approach to account for a constrained optimization on rate-distortion-smoothness performance, thus called DP-based scheme. Similar to the RD-based scheme, a set of frames are grouped in a window for rate allocation. First, a multi-stage topology for DP processing is constructed, as shown in Fig. 3, where N is the window size and M is the total number of spectral bands in each frame, i.e., M = 64ÆNBP. Before describing the algorithm, some terminologies are defined below: a. stage: one frame in the window. b. node: one sorted spectral band in a frame. Denote V (i, j) to be the jth node at the ith stage. Each node V (i, j) is associated with values of Ri,j, and PSNRi,j (check definitions of i and j in Section 2.1) retrieved from the auxiliary file. c. node cost: c_nodei,j, the cost of selecting a node V (i, j), as calculated in Eq. (2).   c nodei;j ¼ PSNRi;j  PSNRi;M . ð2Þ Since only one node can be passed at a stage for any path in DP processing, the selection of V (i, j) stands for the transmission of up to j-priority spectral bands for the ith frame. Hence, the cost can be evaluated as the quality difference between V (i, j) and V (i, M). Nodes closer to V (i, M) have lower c_nodei,j values and hence higher image qualities. d. edge cost: c_edgei,j,k, the cost of traversing along an edge linking two nodes of successive stages, V (i, j) and V (i  1, k), as depicted in Eq. (3)   c edgei;j;k ¼ PSNRi;j  PSNRi1;k  ð3Þ

808

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

The edge cost c_edgei,j,k is evaluated as the quality difference between the ith and (i  1)th frames when up to j and k priority bands are transmitted, respectively. In general, the edge cost reflects smoothness of the video. e. path cost: COST (i, j), summation of node and edge costs along a path from the virtual node at stage 0 to V (i, j). Eq. (4) describes a recursive formulation in computing COST (i, j), where c (j, k) represents the weighted sum of costs (Eq. (5)) encountered when the path steps from one preceding node V (i  1, k) to V (i, j). Adjusting a (0.0–1.0) is equivalent to making a tradeoff or preference between image quality (a = 0, node cost is priority) and video smoothness (a = 1, edge cost is priority) in the considered window of frames. COST ði; jÞ ¼ min fcðj; kÞ þ COST ði  1; kÞg;

ð4Þ

cðj; kÞ ¼ ð1  aÞ  c nodei1;k þ a  c edgei;j;k .

ð5Þ

k

f. forward_linkage: f_link (i, j), node in the (i  1)th stage that is linked to V (i, j). g. Buffer occupancy: O (i, j), the accumulated bit counts in buffer after j priority bands in frame i are put into buffer for network transmission (i.e., O (i, j) reflects the buffer status at the ith stage for a path passing through V (i, j)). To avoid buffer overflow or underflow, we consider the buffer status as a constraint (see Eq. (8) below) in the optimization process. O (i, j) is initialized by using Eq. (6): 

Oinitial ði; jÞ ¼ Bitpre ;

i ¼ 0;

Oinitial ði; jÞ ¼ C BL ðiÞ þ C headerEL ðiÞ  Rch ; i 6¼ 0;

ð6Þ

where Bitpre is the number of bits preloaded in the buffer, CBL (i) and C_headerEL (i) are the numbers of bits in the BL and the header of EL for frame i, and Rch is the output rate to the channel. Notice that after putting the bit-stream of frame i with j priority bands into the buffer, the buffer fullness can be calculated as Oði; jÞ ¼ Oði  1; f linkði; jÞÞ þ C BL ðiÞ þ C header EL ðiÞ þ Ri;j  Rch .

ð7Þ

The first term is inherited from the status at the previous frame and the 2nd–4th terms are the transmitted parts of frame i. Since several terms (the 2nd, 3rd, and 5th terms) are constants, regardless of which node is selected, they can be initialized, as in Eq. (6), before the optimization process. The DP-based rate allocation scheme is described below: 1. Initialize the multi-stage topology (e.g., the node and edge costs defined in Eqs. (2) and (3)) with the Ri,j and PSNRi,j retrieved from the file. The parameters f_link (i, j)s are initialized with a value -1 and O (i, j)s are assigned to be Oinitial (i, j), as addressed in Eq. (6).

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

809

2. Compute the minimum cost for the path from stage 0 to any node V (i, j), i = 1–(N + 1), j = 1–M. On determining the forward-linking node of V (i, j) (i.e., determine k in Eq. (4)), the resulting buffer occupancy (Oinitial (i, j) + O (i  1, k) + Ri,j) is calculated and checked to satisfy Eq. (8) to avoid buffer overflow or underflow. Under the constraint of Eq. (8), a node k* from the previous (i.e., (i  1)th) stage is chosen and the buffer occupancy O (i, j) is updated by using Eq. (9). 0 < ðOði; jÞ þ Oði  1; kÞ þ Ri;j Þ 6 0.9  buffer size;

ð8Þ

Oði; jÞ: ¼ Oði; jÞ þ Oði  1; k  Þ þ Ri;j .

ð9Þ

3. After evaluating all nodes, the path resulting in COST (N + 1, 1) (i.e., the minimum-cost path from V (0, 1) to V (N + 1, 1)), obtained under the buffer fullness constraint in Eq. (8)), reveals the final solution. Details about the operation of DP approach are ignored here. Interesting readers are easy to find suitable references (e.g., [2]). 4. Allocate bits for each frame in this group of pictures based on the optimal path found above. For example, a passed node V (3, 100) indicates that the full MSB plane and the ki,j-sorted bands 1–36 of the (MSB-1) plane are to be transmitted for the 3rd frame. 5. Allocate bits for each group of N pictures in the sequence. It should be noted that the RD-based scheme is capable of keeping a maximal PSNR difference of j dB between any two adjacent frames within the processed window. However, the DP-based scheme sacrifices this j-guarantee, but tends to find the minimum-cost solution (defined as in Eqs. (4) and (5) that consider a tradeoff between image quality and video smoothness) that meets the buffer fullness constraint. 3.3. DP-based rate control scheme with sliding window Both the above two schemes process a video sequence window by window, providing ways to guarantee or minimize PSNR variations between adjacent frames in a window. However, they do not consider PSNR smoothness around window boundaries. Actually, both schemes cause a significant step change in PSNR at window boundaries. To cope with this problem, a sliding window procedure was integrated into the DP-based rate allocation scheme to eliminate these gross changes of PSNR in the transmitted video. Fig. 4 illustrates the proposed sliding window process, where the window size is denoted as WS (previously as N) and the sliding step size denoted as SWS. Our strategy is that rate allocation analysis within a window is performed as before, but only the first SWS frames are actually done. The later (WS–SWS) frames are re-processed after the window slides (with a step size SWS) to the next position. Rate allocation of each frame is done only it occupies one of the first SWS positions in a window. With this modification, the step change in PSNR between every SWS frames can be reduced. This smoothness is attributed from the referencing of contents of following

810

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Fig. 4. The DP-based rate allocation scheme with sliding window.

(WS–SWS) frames when allocating bits for the first SWS frames in a window. Similar techniques have been utilized in many other fields of research, e.g., MDCT transformation with overlapping frames in audio coding. The above sliding window process is somehow equivalent to shortening the window size from WS to SWS. Later in the experimental section, we will compare the performances between using a smaller non-sliding window (with WSnon-slide) and using a larger sliding window (with SWSslide), that have the same effective window size (i.e., WSnon-slide = SWSslide).

4. SFGS/SFGST mode selection and rate alloction As addressed in Section 2.2, FGST-VOPs (i.e., B-frames) are inserted to increase the frame rate or improve the motion smoothness of the video. However, the insertion of FGST-VOPs should depend on the video contents and available channel bandwidth. For example, a video clip of large motion had better be encoded with FGST, while FGS is enough for a static scenario. We face the fact that motion activity in a video may change over times, which makes it necessary to change the FGS/ FGST modes dynamically. In Rajendran et al.s [10] work , a subjective quality assessment was conducted to evaluate the performances of various FGS SNR–temporal tradeoffs at different bitrates and frame rates. This subjective study allows us to determine the optimal division of bits between SNR and temporal layers at different bit-rates (or, indicate the level to which SNR-quality needs to be enhanced before motion-smoothness should be improved). Henceforth, the SNR and temporal qualities were jointly considered rather than independently. Since this joint consideration on SNR and temporal qualities was based on an advanced subjective study, humans effort becomes large and unacceptable. On the other hand, Huang and Huang [4] proposed a method for content-based FGS coding mode selection. Their technique aims at the applications in real-time

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

811

video encoding and transmission over wireless networks. The spatial and temporal features are extracted from the video sequence and combined with the available transmission bandwidth to form a feature vector, which is then classified to make coding-mode decisions (FGS, FGST, FGS-SE (selective enhancement), and FGS-BC (background composition)). In short, they convert the cost-minimization problem into a standard maximum likelihood, or classification, problem. However, the classifier is transmission-bit-rate-dependent and the success rate of making the best decision is only 63.5%. In view of applications, an on-line FGS video coder is of less meaning for uni-cast network transmission, since a non-scalable video coder can otherwise provide a higher coding efficiency. Besides, their work did not touch the issue of rate control within a coding unit once the coding mode was determined. Compared to their works [10,4], main focuses of our algorithm are twofold: (1) providing a real content-based and rate-distortion-smoothness optimized scheme for SFGS/SFGST mode selection, in contrast to the feature-based classification [4] and subjective study [10], (2) providing an on-line rate control scheme for the server to transmit pre-encoded SFGS video, in contrast to the live FGS encoder [4]. With these two concerns, a similar DP-based rate control scheme, combined with some spatial–temporal distortion metrics, is proposed to achieve the purposes of dynamic SFGS/SFGST mode selection and rate allocation simultaneously. Similarly, our goal is also to provide a way of trading between video smoothness, motion smoothness, and image quality, according to users preference, bandwidth budget, and buffer fullness constraint. The multi-stage topology for combined mode_selection/rate_allocation is illustrated in Fig. 5, where B-frames are inserted for SFGST implementation. For stages representing B-frames, M virtual nodes, Vv (i, j)s, are added. Here, the selection of a virtual node Vv (i, j) in DP linking represents the situation that the corresponding B-frame is not transmitted (i.e., select the SFGS mode) and would be recovered at decoder side via interpolation from its two adjacent received frames. Here, the Motion-Compensated Interpolation (MCI) [5] technique is adopted for reconstructing an un-transmitted frame at the receiver side. On the contrary, selection of a real node V (i, j) means transmission of a B-frame (i.e., select the SFGST mode). Similar to the prior DP-based scheme, the optimal path from V (0, 1) to V (N + 1, 1) determines the transmission modes (i.e., the selection of a real or a virtual node) and the allocated bit-rates for all frames in the processed window. Other differences between Figs. 3 and 5 include the definitions of Rvi;j , PSNRvi;j , f_linkv (i, j), Ov (i, j), c nodevi;j , and c edgevi;j;k , for each virtual node Vv (i, j). They are explained below. (1) Rvi;j : always equals zero, since a virtual node represents an un-transmitted frame which is reconstructed from interpolation at decoder side and hence consumes no bit-rates.

812

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Fig. 5. DP topology for SFGS/SFGST mode selection and rate control.

(2) PSNRvi;j : represents the quality of the interpolated ith frame, when j-priority bands are received for the (i  1)th frame. This value depends on the reconstructed quality of the (i + 1)th frame, i.e., which node V (i + 1, l), l 2 [1, M], is linked to Vv (i, j). Hence, PSNRvi;j cannot be determined in advance, but only on the DP path-evaluation time. Note that we simplify frame interpolation to cases that are based on complete spectral bands, though FGS/SFGS server allows bit-stream truncation at any point within a bit-plane. Values of PSNRvi;j are to be used in calculating c nodevi;j and c edgevi;j;k . Later, we will explain how to figure out PSNRvi;j . (3) f_linkv (i, j): is fixedly set to j, i.e., Vv (i, j) is always forwardly linked to V (i  1, j). This simplifies the computation of PSNRvi;j , as discussed later. (4) Ov (i, j): is similarly formulated as O (i, j) depicted in Eqs. (6) and (7). (5) c nodevi;j , c edgevi;j;k : are similarly formulated as in Eqs. (2) and (3), except that each PSNRi,j is replaced by PSNRvi;j .

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

813

As mentioned, PSNRvi;j of a virtual node Vv (i, j) cannot be figured out before transcoding, since it depends on which node in the (i + 1) stage is linked. In this view, there are M possible values for a PSNRvi;j , each corresponding to a combination of triple pairing: V (i  1, j)–Vv (i, j)–V (i + 1, l), l = 1, . . . , M. Considering a total of M nodes for the (i  1)th stage, there would be M · M cases need to be evaluated (as shown in Fig. 6). Denote the picture qualities interpolated from all possible pairings as a matrix INTi[j][l], 1 6 j, l 6 M. Theoretically, the matrix INTi can be pre-computed and loaded on transcoding time for the calculation of PSNRvi;j ’s. However, this would consume a large part of the storages/buffers (if M = 256, each matrix will contain 65,536 elements). To make a tradeoff, off-line frame interpolations were conducted for a few sets of V (i  1, j)–V (i + 1, l) pairing (i.e., only a small part of INTi[j][l] is pre-computed), and others are derived via bi-linear interpolation during on-line DP evaluation process. For each ith frame, we pre-compute INTi[j][l], for j, l = b * 64 + 1, b * 64 + 16, b * 64 + 32, b * 64 + 64, where b = 0, 1, 2, . . . , NBP  1(MSB fi LSB) is the bit-plane index. That is, only 4NBP · 4NBP combinations are pre-analyzed. This would significantly reduce the pre-computing load and on-line memories by a factor of 16 · 16 = 256. Basically, this scheme is equivalent to approximating the Rate–PSNR relationship piecewise linearly for each bitplane (see Section 5). As an example, we compute the PSNR of the case when Vv (i, 150) and V (i + 1, 100) are to be linked, denoted as PSNRvi;150 jl¼100 . First, we retrieve the pre-stored PSNRvi;144 jl¼96 , PSNRvi;144 jl¼128 , PSNRvi;160 jl¼96 , and PSNRvi;160 jl¼128 (see Fig. 7A) that are closest to PSNRvi;150 jl¼100 in value (note that (144 = 64 * 2 + 16) < (150 = 64 * 2 + 22) < (160 = 64 * 2 + 32) and (96 = 64 * 1 + 32) < (100 = 64 * 1 + 28) < (128 = 64 * 1 + 64)). Then perform bi-linear interpolation from them to yield

Fig. 6. Possible V (i  1, j)–V (i + 1, l) pairings for the interpolation of frame i.

814

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Fig. 7. Bi-linear interpolation to calculate the interpolated quality. (A) An example of INTi[150][100] and (B) the concept illustration of (A) for bi-linear interpolation from four adjacent data.

PSNRvi;150 jl¼100 (see Fig. 7B for concept illustration). Since the DP algorithm needs only to calculate PSNRvi;j jl by using bi-linear interpolation from available INTi[j][l]s, the online/off-line computing loads and memory requirements are low. The detailed SFGS/SFGST mode selection and rate control scheme are described below. 1. Initialize the multi-stage DP graph in Fig. 5, where the upper and lower parts are for SFGST and SFGS mode, respectively. Node attributes for P-frames are initialized as in Section 3.2, while those for B-frames are specifically initialized as: Real nodes: Ri,j, PSNRi,j (retrieved), f_link (i, j) = 1, and O (i, j) = Oinitial(i, j) (CBL(i) = 0 in Eq. (6)). Virtual nodes: Rvi;j ¼ 0, PSNRvi;j ¼ 0, f link v ði; jÞ ¼ j, and Ov ði; jÞ ¼ Oinitial ði; jÞ ðC BL ðiÞ ¼ C headerEL ðiÞ ¼ 0 in Eq. (6)) 2. Execute the DP algorithm as before. For the forward linkage of a real node V (i, j), choose one node in the previous stage that satisfies the buffer fullness constraint in Eq. (8). For virtual nodes, fixedly link each Vv (i, j) to V (i  1, j), if the buffer constraint in Eq. (10) is satisfied. Otherwise, Vv (i, j) would not be linked to any node in the previous stage. That means, paths passing through Vv (i, j) will not be evaluated further more. Update the buffer occupancies of virtual nodes according to Eq. (11), instead of Eq. (9). Notice that when a node V (i, j) is linked to a virtual node Vv (i  1, j0 ), f_link (i, j) is recorded to be j0 + M. Clearly, if f_link (i, j) > M, the linkage of a virtual node is meant 0 6 ðOv ði; jÞ þ Oði  1; jÞÞ 6 0.9  buffer size;

ð10Þ

Ov ði; jÞ: ¼ Ov ði; jÞ þ Oði  1; jÞ.

ð11Þ

3. Find the path with the minimal cost, i.e., COST (N + 1, 1). 4. Allocate bits for each frame in this group of pictures based on the optimal path found above.

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

815

5. Allocate bits for next group of N pictures in the sequence (with a sliding window or not).

5. Estimation of bit-plane Rate–PSNR model All the algorithms described above (RD-based, DP-based, DP-based with sliding window, and SFGS/SFGST mode selection) rely on the advanced computation of Ri,j and PSNRi,j for each spectral bandBi,j . However, the computation of PSNRi,j will consume much of the computing time, even it is executed in an off-line manner. To save time and storage and to have little sacrifice on accuracy, we establish a bit-plane Rate–PSNR model for the enhancement-layer signal of each frame. The server then utilizes this model to predict all PSNRi,js for rate allocation on network transmission. Before establishing the model, we collect data and observe the Rate–PSNR behavior for the sequences ‘‘Foreman,’’ ‘‘Mobile,’’ and ‘‘Flower.’’ Fig. 8 shows a typical example of PSNRi,j vs. Ri,j plot. The curve seems suitable for linearization. More examples reveal similar behaviors, regardless of the bit-plane position and sequence content. In this paper, piecewise linear approximation is adopted to model the relationship between PSNRi,j and Ri,j for each bit-plane. It was observed from Fig. 8 that the sample points are more densely distributed for larger j, i.e., when higher frequency bands are included. Hence, 4 sample points corresponding to Ri,b*64+1, Ri,b*64+16, Ri,b*64+32, and Ri,b*64+64 (b is the bit-plane index), respectively, are selected as the corner points to describe the curve linearly. Note the consistency of this approximation with the pre-computed INTi[j][l] described in Section 4, where j,l = b * 64 + 1, b * 64 + 16, b * 64 + 32, b * 64 + 64, and b = 0, 1, 2, . . . , NBP  1. This four-point linear approximation is on one hand more accurate than the two-point strategy (only one line segment for approximation) and on the other hand more efficient than recording the full set of 64 sample points.

Fig. 8. A four-point strategy to piecewise-linearly approximate the Rate–PSNR curve (PSNRi,j vs. Ri,j) of each bit-plane.

816

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Recalling from the SFGS encoding procedure stated in Section 2.1, the data overheads for each frame now include: 1. PSNRiBase : one datum for each frame, 2. Ri,j: one datum for each band (real node), 3. PSNRi,j: one datum for each band (real node), 4. INTi[j][l]: 4NBP · 4NBP data for each B-frame (for virtual nodes or SFGS mode). Here, two-byte integers were used to represent low-precision floating-point numbers by pre-scaling so as to reduce storage consumption. Empirically assuming a maximum of four bit-planes (i.e., NBP = 4), the total overheads for a 20 fps SFGST video amounts to 2  20ðfpsÞ þ 2  64  4  20ðfpsÞ þ 2  64  4  20ðfpsÞ þ 2  16  16  10  ðfpsÞ ¼ 25.0 kbps. These overheads are equivalent to only 1.22% for a standard SFGS video encoded at 2048 kbps (BL + EL).

6. Experimental results All the experiments were based on three standard CIF sequences: ‘‘Flower,’’ ‘‘Mobile,’’ and ‘‘Table tennis’’ (each has 150 frames). The encoding frame rate is 20 fps for SFGST and 10 fps for SFGS. For simplicity, only the first frame is coded as an I-frame and the others are coded as P-frames. The base-layer signal is coded at 256 kbps, with the TM5 rate control scheme. In following experiments, we truncate the enhancement-layer bit-stream to 256, 512, . . . , or 1024 kbps, in a step of 256 kbps. Here, the buffer size is set to be 0.5 s of channel rate (Rch) and half of the buffer is pre-loaded with bit-stream data. The buffer fullness is constrained to 0.9 times of the buffer size. To compare video smoothness, four performance indices are defined below: PSNR : PSNR ¼

4  PSNRY þ PSNRCr þ PSNRCb ; 6

ð12Þ

Maximal PSNR variation : DPSNRmax ¼ maxðj PSNRi  PSNRiþ1 jÞ;

ð13Þ

! X 1 X 1 Average PSNR variation : DPSNRavg ¼ j PSNRi  PSNRiþ1 j ; X  1 i¼1

ð14Þ

i

Rate-control error rate : RER ¼

Ofinal  Opre-loading  100%; Opre-loading

ð15Þ

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

817

where PSNRi represents the quality of frame i (with both base and partial enhancement layers received) at decoder side, DPSNRmax and DPSNRavg are the maximal and average PSNR variation in the sequence, X is the number of frames in the sequence, and Ofinal and Opre-loading are the final (after sending the bit-stream of the last frame to the buffer) and initial statuses of buffer occupancy. If rate control is accurate, Ofinal will be close to Opre-loading . Hence, the sign of RER reflects either over-use (>0) or under-use (<0) of bit-rate budget. 6.1. Single-mode DP-based rate allocation In this experiment, video data are first SFGST-encoded and all frames, disregarding I, P, or B-frames, are transmitted. First, the edge-weighting a in the DP cost function (Eq. (5)) is varied to show the results of different preferences between the image quality and video smoothness. To give more emphasis on video smoothness, N is chosen a larger value (e.g., 50) and a is set to be between 0.6 and 1.0. Fig. 9 depicts the video smoothness and buffer status for the uniform, RD-based and proposed DP-based rate allocation schemes for the video sequence ‘‘Flower’’ coded at 512 kbps (256 kbps for BL and 256 kbps for EL). Also, Table 1 shows detailed statistics on PSNR, DPSNRavg, DPSNRmax, and RER for experiments with varying EL bit-rates. Several key points are observed below. (1) The RD-based scheme has a better performance in average PSNR. However, this may be due to excess use (i.e., overflow) of buffers between Frames 9–40. (2) The DP-based scheme with a = 1 has the best performance in view of the video smoothness (i.e., DPSNRavg). (3) The DP-based and RD-based schemes do not necessarily have a better performance in DPSNRmax than the traditional ‘‘uniform’’ method. This is due to the step change of PSNR at window transitions. When a is reduced down to 0.8, the step change in PSNR (i.e., DPSNRmax) becomes significant, especially at higher bit-rates. This phenomenon occurs at the first few frames (#1–#3), where the buffer is fast occupied to result in high PSNRs. (4) The DP-based scheme has a poorer performance in RER, since only the buffer fullness constraint (0.9 * buffer_size) is checked and no accumulated bit count is calculated in the DP optimization process. This results from the fact that the bit-stream of each frame is truncated in unit of spectral band, not arbitrarily. Specifically, the positive RERs for 0.6 6 a 6 0.9 are due to the use of buffers near the limit line (see Fig. 9C). Another set of figures for the sequence ‘‘Forman’’ (of larger motion) is given in Fig. 10 and Table 2. Different behaviors can be found here. (5) For a high-motion video, all the tested schemes have nearly indistinguishable PSNR performance, but recognizable difference in video smoothness.

818

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Fig. 9. Experiments for the ‘‘Flower’’ with EL bit-rate set at 256 kbps. (A) PSNR performances of DPbased schemes with different a, (B) a comparison of PSNR between the uniform, RD-based, and DP-based (a = 1) schemes, and (C) status of buffer occupancy for the above schemes.

(6) There is no buffer overflow for the RD-based scheme in this case. For other cases not shown here, the RD-based scheme contrarily has a less buffer use than the DP-based scheme. This is due to the fact that the DP-based scheme needs more bit-rate (thus more buffers used) to promote low-quality pictures for video smoothness.

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

819

Table 1 Performances of proposed schemes for the ‘‘Flower’’ sequence at different EL bit-rates: (a) the average image quality, PSNR; (b) average video smoothness, DPSNRavg; (c) maximum PSNR deviation, DPSNRmax; (d) rate-control error rate, RER PSNR (dB) 256 kbps 512 kbps 1024 kbps

a = 0.6

a = 0.7

a = 0.8

a = 0.9

29.9 30.8 32.6

29.9 30.8 32.6

29.9 30.8 32.6

29.9 30.8 32.5

a=1 29.8 30.6 32.5

Uniform 30.1 30.9 32.5

RD-based 30.4 31.2 33.0

DPSNRavg (dB) 256 kbps 512 kbps 1024 kbps

0.071 0.088 0.122

0.058 0.075 0.100

0.040 0.057 0.064

0.022 0.035 0.049

0.019 0.026 0.030

0.246 0.221 0.316

0.026 0.031 0.039

DPSNRmax (dB) 256 kbps 512 kbps 1024 kbps

2.74 3.96 6.05

0.93 2.86 4.75

1.24 1.93 3.07

0.42 0.80 1.27

0.96 1.07 0.97

0.77 0.84 0.98

0.99 0.77 1.10

RER (%) 256 kbps 512 kbps 1024 kbps

80 80 80

80 78 80

80 80 80

80 78 78

98 60 48

28 18 10

28 18 10

From mentioned above, the proposed DP-based scheme is featured of the capability in making tradeoffs among factors such as video smoothness, image quality, and buffer fullness. It also successfully models this tradeoff into a constrained optimization problem. In general, the gain in image quality is limited when a is reduced beyond 1.0. Hence, the adoption of a = 1 is recommended when the singlemode SFGS or SFGST video bit-stream is transmitted. 6.2. Single-mode DP-based rate allocation with sliding window From Fig. 9B, it was found that both the DP-based and RD-based schemes present a larger step change in video quality between windows (e.g., Frame 1–50) for the sequence ‘‘Flower.’’ In applying the sliding window, SWS is chosen to be 1, 10, 25, or 50, when WS is 50. We also experimented with the non-sliding window counterpart (i.e., WS = SWS) for comparison. Fig. 11 shows the result of sliding window for the same sequence in Fig. 9. It is clear that lowering down SWS (keep WS unchanged) is capable of reducing the step change in PSNR between window boundaries. Fig. 12 compares the performances of sliding and non-sliding window techniques, where (WS, SWS) = (50, 10) and (10, 10), respectively. Though these two techniques allocate bits for the same number of frames at a time (i.e., 10 frames), the resulting video smoothness is much different. Table 3 shows detailed statistics on PSNR, DPSNRavg, and DPSNRmax when the sliding and non-sliding window techniques are applied to the sequence ‘‘Flower’’ coded at different EL bit-rates. When SWS becomes smaller, DPSNRmax

820

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Fig. 10. Experiments for the ‘‘Foreman’’ with EL bit-rate set at 512 kbps. (A) PSNR performances of DPbased schemes with different a, (B) a comparison of PSNR between the uniform, RD-based, and DP-based (a = 1) schemes, (C) status of buffer occupancy for the above schemes.

is apparently reduced, though PSNR and DPSNRavg are still comparable. However, a small setting of SWS (especially, 1) will lead to a significant increase in the consumed CPU time for rate allocation (see Table 4). Considering the above aspects, setting of SWS to a half of WS or a smaller value will be a better choice. To sum up, the

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

821

Table 2 Performances for the ‘‘Foreman’’ sequence at different EL bit-rates: (a) the average image quality, PSNR, (b) average video smoothness, DPSNRavg, (c) maximum PSNR deviation, DPSNRmax, and (d) rate-control error rate, RER

PSNR (dB) 256 kbps 512 kbps 1024 kbps

a = 0.6

a = 0.7

a = 0.8

a = 0.9

35.6 36.5 38.0

35.6 36.5 38.0

35.6 36.5 38.0

35.6 36.5 38.0

a=1 35.6 36.5 37.9

Uniform

RD-based

35.5 36.5 38.0

35.5 36.4 37.9

DPSNRavg (dB) 256 kbps 512 kbps 1024 kbps

0.087 0.084 0.147

0.081 0.081 0.128

0.074 0.074 0.114

0.067 0.056 0.102

0.022 0.008 0.022

0.661 0.533 0.478

0.026 0.017 0.029

DPSNRmax (dB) 256 kbps 512 kbps 1024 kbps

1.02 1.97 2.45

1.16 1.60 2.20

1.01 1.28 2.17

1.05 1.20 1.96

0.27 0.08 0.12

1.84 1.79 2.10

0.09 0.08 0.18

RER (%) 256 kbps 512 kbps 1024 kbps

72 72 84

72 72 84

70 72 82

56 68 76

2 12 20

12 8 4

14 8 6

Fig. 11. Rate control with sliding window (WS = 50) for the same sequence in Fig. 9.

sliding window method is capable of reducing step changes in PSNR, with respect to its counterpart in the non-sliding window method. 6.3. SFGS/SFGST mode selection and rate control In this SFGS/SFGST mode selection experiment, the parameter a is tested with the full range of values (0.0–1.0). To demonstrate the benefit of automatic SFGS/ SFGST mode switching with respect to the single-mode SFGS or SFGST, the following experiments were conducted and compared. SFGS: 10 fps SFGS on encoding and transmission; a = 1.0 (fixed); Motion Compensation Interpolation (MCI) at decoder side to form a 20 fps video.

822

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Fig. 12. A comparison of sliding (WS = 50 and SWS = 10) and non-sliding window (WS = 10) methods.

Table 3 Results of DP-based rate allocation schemes with sliding and non-sliding window for the sequence ‘‘Flower’’ Sliding window (WS = 50) SWS = 1 PSNR (dB) EL: 256 kbps EL: 512 kbps EL: 1024 kbps

29.8 30.7 32.4

SWS = 10 29.8 30.7 32.4

Non-sliding window SWS = 25 29.8 30.7 32.3

SWS = 50 29.8 30.6 32.5

WS = 10

WS = 25

29.8 30.6 32.5

29.8 30.6 32.4

DPSNRavg (dB) EL: 256 kbps EL: 512 kbps EL: 1024 kbps

0.024 0.033 0.060

0.019 0.023 0.035

0.019 0.023 0.025

0.019 0.026 0.030

0.038 0.074 0.402

0.028 0.027 0.077

DPSNRmax (dB) EL: 256 kbps EL: 512 kbps EL: 1024 kbps

0.30 0.33 0.71

0.38 0.33 0.72

0.44 0.66 0.51

0.96 1.07 0.97

1.05 1.63 5.72

0.68 0.63 2.43

RER (%) EL: 256 kbps EL: 512 kbps EL: 1024 kbps

38 48 30

100 46 34

38 8 54

98 60 48

96 92 98

34 98 82

SFGST: 20 fps SFGST on encoding and transmission; a = 1.0 (fixed); 20 fps at decoder side. SFGS/SFGST: 20 fps SFGST on encoding; varying frame rates on transmission; a = 0.4, 0.8, 1.0; MCI at decoder side to form a 20 fps video. Fig. 13 shows the result for the sequence ‘‘Flower,’’ coded with the EL bit-rate at 1024 kbps. When a is set to 1.0, SFGST mode is frequently selected (or, B-frames are transmitted). When a is set to 0.4, B-frames are usually skipped and bit-rates are scheduled to P-frames for enhancing the image quality. Clearly, the proposed

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

823

Table 4 The consumed CPU time (Pentium IV, 1.7 GHz) for different settings of SWS at different EL bit-rates (WS = 50 is assumed) Time (s)

SWS = 1

SWS = 10

SWS = 25

SWS = 50

EL: 256 kbps EL: 512 kbps EL: 1024 kbps

12.818 16.483 17.765

1.772 2.213 2.383

1.061 1.231 1.311

0.821 0.921 0.961

algorithm is capable of determining the SFGS or SFGST mode according to both user preference (i.e., a) and image content characteristics. For example, when a is set to 0.8, segments of Frames 1–5 and Frames 29–75 are transmitted in SFGS mode (mainly due to their low-motion natures) and the requested video smoothness is also achieved simultaneously. As for the advantage of automatic dual-mode switching over singe-mode operation, Table 5 compares the PSNR and DPSNRavg statistics for the ‘‘Foreman’’ sequence coded at different EL bit-rates. It reveals that SFGST and SFGS/SFGST outperform SFGS in both image quality and video smoothness, due to the high-motion nature of the sequence (notice that the PSNR and DPSNR are calculated based

Fig. 13. Comparisons between (A) single-mode (SFGS or SFGST) and (B) dual-mode (SFGS/ SFGST) video coding with different a, for the sequence ‘‘Flower.’’ The EL bit-rate is set to 1024 kbps.

824

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Table 5 The PSNR and DPSNRavg statistics for the SFGS, SFGST, and SFGS/SFGST modes with different a settings at different EL bit-rates for the ‘‘Foreman’’ sequence SFGS/SFGST

PSNR (dB) EL: 256 kbps EL: 512 kbps EL: 1024 kbps DPSNRavg (dB) EL: 256 kbps EL: 512 kbps EL: 1024 kbps

SFGST

SFGS

a=0

a = 0.4

a = 0.8

a = 1.0

a = 1.0

a = 1.0

35.60 36.54 38.11

35.59 36.50 38.05

35.59 36.50 38.05

35.59 36.50 37.99

35.59 36.50 37.99

34.83 35.84 37.20

0.305 0.628 0.869

0.095 0.079 0.089

0.057 0.05 0.059

0.017 0.013 0.021

0.017 0.013 0.021

3.797 4.696 6.820

on the original and interpolated frames). On the other hand, SFGST and SFGS/ SFGST present a comparable tradeoff between the PSNR and DPSNRavg performances. Also, when a is set to 0.8, SFGS/SFGST presents a smoother video than SFGS (with a = 1.0) and a higher quality than SFGST (with a = 1.0). In case that a strong video smoothness is required (a = 1.0) or a high-motion video is encountered, our algorithm selects almost the SFGST mode for the whole video. Table 6 lists the number of non-transmitted B-frames in SFGS/SFGST mode switching. The total number of frames is 150 (including the I, P, and B-frames) in the whole sequence. For high-motion videos such as the ‘‘Foreman’’ sequence, no matter how a is set, the algorithm almost choose the SFGST mode (i.e., a low number of skipped B-frames). This situation is quite in the other way for the ‘‘Flower’’ sequence. For the sequence ‘‘Mobile,’’ the selection between the SFGS and SFGST modes is neutral, depending on the image content. To sum up, the proposed SFGS/SFGST mode selection algorithm has been proved sufficiently adaptive, even to sequences of unknown or varying characteristics. By the way, a setting of larger a (e.g., >0.4) is suggested for better compromise between image quality and video smoothness. This is a different conclusion with respect to the single-mode SFGS or SFGST, where a = 1.0 was suggested (see the end of Section 6.1). 6.4. Accuracy of the bit-plane Rate–PSNR model As mentioned in Section 5, a piecewise linear Rate–PSNR model is introduced for each bit-plane of the enhancement-layer signal. To measure the accuracy of this model in SFGS rate control, an index eavg is defined as 641   N 1 N BPX X 1  model  eavg ¼  PSNR ð16Þ PSNRreal ; i;j i;j N  N BP  64 i¼0 j¼0 model where PSNRreal represent the real and estimated (from model) values of i;j and PSNRi;j PSNRi,j (as defined in Eq. (1)), respectively, and N is the number of frames in the test sequence. Here, we have N = 150 and NBP = 4. Clearly, eavg equals 0 if the model fully matches the real data.

Number of skipped B-frames

EL: 256 kbps EL: 512 kbps EL: 1024 kbps

Flower

Foreman

Mobile

a=0

a = 0.4

a = 0.8

a = 1.0

a=0

a = 0.4

a = 0.8

a = 1.0

a=0

a = 0.4

a = 0.8

a = 1.0

74 74 74

73 74 74

43 34 26

4 5 2

2 3 5

0 0 0

0 0 0

0 0 0

50 59 69

5 7 10

2 1 0

2 2 0

The total number of frames is 150 (including the I, P, and B-frames) in the whole sequence.

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Table 6 Numbers of non-transmitted B-frames in SFGS/SFGST mode switching

825

826

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

Table 7 Differences between the real data and those estimated by proposed Rate–PSNR model, denoted as eavg eavg (dB)

Flower

Foreman

Mobile

BL: 64 kbps BL: 128 kbps

0.0161 0.0206

0.0124 0.0104

0.0142 0.0141

As seen from Table 7, values of eavg for the sequences ‘‘Flower,’’ ‘‘Mobile,’’ and ‘‘Foreman’’ are only about 0.01–0.02 dB, respectively. These figures reveal that our piecewise linear models are accurate enough. We then apply the estimated Rate– PSNR models in the rate allocation processes for further evaluation. Experiments show that data of (PSNR, DPSNRavg), as defined in Eqs. (12) and (14), by using the real and estimated PSNRi,j values, are also very close (within 0.01 dB of error). 6.5. Overall suggestions According to the above experimental results, SFGS/SFGST mode selection scheme is the best to make tradeoffs between image quality, motion smoothness, and video smoothness, under buffer fullness constraint at all chosen bit-rates and therefore would be suggested here. However, value of the parameter a and mode of sliding window could be suggested (as shown in Table 8) under different conditions, such as delay time and video content. For videos with simple contents or low motion, such as ‘‘Flower,’’ the difference between two adjacent frames is quite small. Therefore, a could be set in the range of 0.4–0.8 to increase the image quality. On the other hand, a should be set larger than 0.8 to increase the video smoothness for videos with complex content or high motion. For low delay time applications, the sliding window scheme has to be discarded or a small window size SWS is chosen to reduce the time required for video pre-loading and delay. 7. Complexity analysis for DP processing It is clear that the DP optimization process forms the main computing loads of the proposed rate control algorithms. Take the multi-stage topology in Fig. 3 as an example for analysis. The computations mainly include: (1) initialize the buffer occupancy O (i, j): NÆMÆtbuffer (tbuffer is the time to execute Eq. (6)), Table 8 Suggestions of rate control schemes for different transmission conditions Video content

Delay time Short

Long

Simple content/low motion Complex content/high motion

0.4 < a < 0.8 without sliding window a > 0.8 without sliding window

0.4 < a < 0.8 with sliding window a > 0.8 with sliding window

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

827

(2) computing the node cost c_nodei,j: NÆMÆtnode (tnode is the time to execute Eq. (2)), (3) computing the edge cost c_edgei,j,k: MÆMÆ(N  1)Ætedge (tedge is the time to execute Eq. (3)), (4) computing the path cost COST (i, j): MÆ(N  1)ÆMÆtpath (tpath is the time to execute Eq. (5)), (5) checking the buffer status: NÆMÆtbuffer_ck (tbuffer_ck is the time to execute Eq. (8)), (6) updating the buffer occupancy: NÆMÆtbuffer_upd (tbuffer_upd is the time to execute Eq. (9)). Since all the above time units (tbuffer, tnode, tedge, tpath, tbuffer_ck, and t buffer_upd) are approximate in the same order, the time complexity can then be expressed as O (M2N). That is, the computing time is proportional to the square of M. The normal case of M = 256 will not cause heavy computing loads to the system. Actually, the bottleneck will be on Eqs. (2) and (3) if PSNRi,js cannot be captured in advance in the encoding phase. Recalling from Table 4, when WS = 50 and SWS = 25, the consumed CPU time for processing 150 frames is less than 1.3 s. The measurement includes the read in of the auxiliary file, but not the input and output of the video enhancement-layer data. This reveals the feasibility of our proposed algorithms in practical applications.

8. Remarks and conclusions In this paper, several rate allocation schemes based on ‘‘rate-distortion-smoothness’’ optimization are proposed. All these algorithms rely on the proposed SFGS coding technique, which groups, encodes, and transmits FGS bit-plane data according to spectral frequency index. In this way, intra-frame quality can be made evener, when bit truncation adapting to channel bandwidth variation occurs. We propose rate allocation schemes for SFGS in a gradual manner: RDbased scheme first (without considering the buffer fullness status), then DP-based schemes with and without sliding window following (considering buffer fullness status), and finally SFGS/SFGST mode selection and rate allocation. Operations of these three algorithms are all based on the rate-distortion information of each spectral band in the enhancement layer. The latter two schemes, moreover, are both described in the optimization of multi-stage dynamic programming problem. We are benefited from solving the single/dual-mode rate allocation in a unified manner. Our proposed rate allocation schemes are featured of multi-capability in: (1) keeping buffers not to overflow or underflow and maintaining the rate control accuracy for the whole sequence to within a half of the buffer size, (2) making tradeoffs between image quality and video smoothness according to user preference, (3) adapting to image contents and user preferences by automatic mode selection. Our algorithm only considers global characteristics of the enhancement-layer residual image. To consider local characteristics, MB-scanning order other than traditional raster can be considered:

828

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

(1) Use a fixed box-out scanning pattern (from the image center) for MBs. This is useful for video conference sequences where the subjects are often centrally located. (2) Use object-based MB-scanning order. Take for example, the scanning order formed according to the result of foreground object segmentation. This however requires overheads to dynamically record the scanning pattern for each frame. In these ways, perceptually important parts of the enhancement layer will be transmitted first. This helps subjective image quality when the channel bandwidth is limited. Of course, the selective enhancement technique proposed in FGS could be used for considering the local characteristics, too. We may be interested in the extension of the proposed rate allocation schemes (e.g., the RD-based and DP-based) to other FGS coding techniques, e.g., the original FGS, PFGS [16]), MC-FGS [12], and RFGS [3]. Taking for example, we discuss the possibility of extension to FGS. As presented in Section 2, the difference between FGS and SFGS mainly comes from the partition of each bit-plane data into MBs or into spectral bands. Considering the grouping of several consecutive MBs into a unit, the counterparts Ri,js and PSNRi,js (i is the frame index, and j is the unit index) can be similarly defined and calculated for each unit. Based on these Ri,js and PSNRi,js, the proposed RD- and DP-based rate allocation schemes can be also applicable. Comparing these two implementations, it is reasonable for SFGS to have evener intra-frame qualities at the same transmission bit-rate than FGS. This is mostly contributed from the characteristics of spectrum-grouping and prioritization.

References [1] Hui Cheng, Xi Min Zhang, Yun Q. Shi, Anthony Vetro, Huifang Sun, Rate allocation for FGS coded video using composite R-D analysis, in: Proceedings, IEEE Int. Conf. On Multimedia and Expro (ICME), Baltimore, ML, USA, 2003, pp. 41–44. [2] Ellis Horowitz, Sartaj Sahni, Fundamentals of Computer Algorithms, Computer Science Press Inc., 1978. [3] Hsiang-Chun Huang, Chung-Neng Wang, Tihao Chiang, A robust fine granularity scalability using trellis based predictive leak, IEEE Trans. Circuits Syst. Video Technol. 12 (6) (2002) 372–385. [4] Bin-Feng Hung, C.L. Huang, Content-based FGS coding mode determination for video streaming over wireless networks, IEEE J. Selected Areas Commun. 21 (10) (2003) 1595–1603. [5] Sung-He Lee, Yoon-Cheol Shin, Seungjoon Yang, Heon-Hee Moon, Rae-Hong Park, Adaptive motion-compensated interpolation for frame rate up-conversion, IEEE Trans. Consumer Electron. 48 (3) (2002) 444–450. [6] Weiping Li, Overview of fine granularity scalability in MPEG-4 video standard, IEEE Trans. Circuits Syst. Video Technol. 11 (3) (2001) 301–317. [7] Wen-Nung Lie, Ming-Yang Tseng, I.-Cheng Ting, Constant-quality rate allocation for spectral fine granular scalable (SFGS) video coding, in: Proceedings, IEEE International Symposium on Circuits Systems (ISCAS), Bangkok, Thailand, 2003, pp. II-880–II-883. [8] Wen-Nung Lie, Cheng-Hsiung Tseng, Ping-Chang Jui, Error-resilient spectral fine granular scalable (SFGS) video coding for network streaming applications, in: Proceedings, IEEE Int. Conf. on Multimedia and Expo (ICME), Taipei, Taiwan, 2004, pp. 1747–1750. [9] Wen-Nung Lie, Cheng-Hsiung Tseng, Tom C.-I. Lin, Constant-quality rate allocation for spectral fine granular scalable (SFGS) video coding by using dynamic programming approach, in: Proceedings, IEEE Int. Conf. on Multimedia and Expo (ICME), Taipei, Taiwan, 2004, pp. 655–658.

W.-N. Lie et al. / J. Vis. Commun. Image R. 17 (2006) 799–829

829

[10] R.K. Rajendran, Mihaela van der Schaar, Shih-Fu Chang, FGS+: optimizing the joint SNR– temporal video quality in MPEG-4 fine grained scalable coding, in: Proceedings, IEEE Int. Symposium on Circuits and Systems, Phoenix, AZ, USA, 2002, pp. 445–448. [11] Mihaela van der Schaar, Hayder Radha, A hybrid temporal–SNR fine-granular scalability for internet video, IEEE Trans. Circuits Syst. Video Technol. 11 (3) (2001) 318–331. [12] Mihaela van der Schaar, H. Radha, Adaptive motion-compensation fine-granular-scalability (AMCFGS) for wireless video, IEEE Trans. Circuits Syst. Video Technol. 12 (6) (2002) 360–371. [13] Ming-Yang Tseng, Improving encoding efficiency and rate control for spectral fine granular scalable video coding (SFGS), Master Thesis of National Chung Cheng University, Taiwan, ROC, 2002. [14] Qi Wang, Zixiang Xiong, Feng Wu, Shipeng Li, Optimal rate allocation for progressive fine granularity scalable video coding, IEEE Signal Process. Lett. 9 (2) (2002) 33–39. [15] Feng Wu, Shipeng Li, Ya-Qin Zhang, A framework for efficient progressive fine granularity scalable video coding, IEEE Trans. Circuits Syst. Video Technol. 11 (3) (2001) 332–344. [16] Feng Wu, Shipeng Li, Ya-Qin Zhang, Progressive fine granular scalable (PFGS) video using advancepredicted bitplane coding (APBIC), in: Proceedings, IEEE Int. Symposium on Circuits and Systems, Sydney, Australia, 2001, pp. 97–100. [17] Xi Min Zhang, Anthony Vetro, Yun Q. Shi, Huifang Sun, Constant quality constrained rate allocation for FGS-coded video, IEEE Trans. Circuits Syst. Video Technol. 13 (2) (2003) 120–130. [18] Lifeng Zhao, JongWon Kim, C.-C. Jay Kuo, Constant quality rate control for streaming MPEG-4 FGS video, in: Proceedings, IEEE Int. Symp. Circuits Systems, Phoenix, AZ, USA, 2002, pp. 544– 547. [19] Jian Zhou, Huair Shao, Chia Shen, Ming-Ting Sun, FGS enhancement layer truncation with minimized intra-frame quality variation, in: Proceedings, IEEE Int. Conf. on Multimedia Expo (ICME), Baltimore, ML, USA, 2003, pp. 361–364.