Signal Processing : Image Communication 3 (1991) 1 29 -141 Elsevier
129
A two layers video coding scheme for ATM networks Stefano Tubaro CSTS-CNR Politecnico di Milano, Piazza Leonardo do Vinci 32, 20133 Milano, Italy
Abstract. In the transmission of image sequences over an ATM network variable bit-rate coding techniques can be implemented to optimize performances and to improve channel sharing efficiency . On the other hand the use in ATM networks of a packet switching technique introduces some problems related to packet delay, cell loss and packet delay jitter. One possible solution to these problems is the transmission of the video information over the network with different levels of protection with respect to the packet losses . The information used to guarantee a video connection with minimum acceptable quality is transmitted in a very protected way . On the contrary, the other information necessary to increase the quality of the coded images up to the desired level is not intrinsically protected against data loss. In the paper two different realizations of this type of video coder are presented and their performances are compared .
Keywords . Multi-layer video coding, hybrid coding, ATM networks .
1 . Introduction Asynchronous transfer mode (ATM) networks are expected to find wide use in the future . In fact, over these networks a lot of communication services can be provided effectively and economically because a unified communication protocol is applied to each service and therefore the users can transmit and receive any type of information [1] . The transmission of image sequences is an important service for point-to point and broadcast communications . The use in ATM networks of a packet switching technique forces the video encoder to assemble its coded information into discrete cells that are independently transported by the network . The video decoder must then extract the information out of these cells and reconstruct the desired pictures and timings . This operation is vulnerable to packet delay, cell loss, packet delay jitter [7] . On the other hand video coding can benefit from the ATM technique, by using variable bit-rate coding techniques to optimize performances and to improve channel sharing efficiency [4] . 0923-5965/91/$03 .50 © 1991 - Elsevier Science Publishers . .V B
In the video coder architectures, motion compensation and conditional replenishment have emerged as preferred method for interframe redundancy reduction [5] . The major drawback with conditional-replenishment techniques is their vulnerability to transmission e rror . i n case of packet transmission, especially when entropy coding strategies are also used, the loss of one packet may have such serious effects that the decoder is not able to decode the incoming data and reconstruct the transmitted images . One possible solution to these problems is the transmission of the video information over the network with different levels of protection with respect to the packet losses [2] . It is possible, for example, to take into account two different types of data packets : `guaranteed packets' and `enhancement packets' . The first have a very low loss probability, even negligible (they are also called `high priority packets') . This condition is reached by avoiding the discharge, every time possible, of these packets in the internal queues of the network nodes . That imposes an accurate and
130
S. Tubaro / A two layers video coding .scheme for ATM
limited use of this type of data packets . On the contrary in case of network congestion the `enhancement packets' are immediately discharged, but a lot of information can be sent with these packets . The idea of the two layer video coder is to transmit, by means of the 'guaranteed packets' the most important information necessary to reconstruct the coded images, such as the synchronization signals and the data necessary to maintain a video connection with minimum quality. The `enhancement packets' are used to improve the quality of the images reconstructed by the first layer. In case of network congestion these packets can be lost . To prevent deadlocks in the network nodes a feedback strategy to control these losses is not implemented . For this reason the information derived from these packets cannot be used in the decoding of the subsequent `high priority packets' . A general scheme of the two layer video coder is presented in Fig. 1 . In the following sections a comparison between two different realizations of this coder is presented . Both the coders use the same algorithm for the first coding layer (derived from the CCITT Recommendation H .261) . On the contrary, the second layer in one case is very simple and is based on a DPCM transmission of the errors between the original images and those reconstructed from the
first level . The second one is more sophisticated and uses a DCT (discrete cosine transform) to enclose, as much as possible, the energy of the previous defined errors into a small number of coefficients . The intention is to compare the performances of the two coders with respect to their complexity and generated bit-rates . Some strategies for error and packet losses concealment are also presented .
2 . First layer of the video coder As mentioned above, the first layer of the video coder generates `high priority packets' with the aim of guaranteeing a video connection with minimal quality . To prevent uncontrollable network congestions the information generated from this layer must be only a small part of total amount of the maximum bit-rate which can be generated by the entire coder. The H .261 coding scheme proposed by CCITT for progressive image sequence coding with p x 64 kbit/s (p = 1, . . . , 30) is based on a DCT of the motion compensated luminance difference [8] . The frames are organized in group of blocks (GOBS), macro blocks (MBs) and 8x8 pixel blocks on which the transformation is applied . Each macroblock consists of 16x 16 luminance pixels and the corresponding two 8X8 blocks of chrominance
IN
DECI
DEC1
M U X
D E M U
OUT
x
COD2
DEC2
Fig . 1 . Block diagram of the two layer image sequence coder. Signal Processing : Image Communimrion
S. Tubaro / A two layers video coding scheme for ATM
pixels (i .e ., 4 luminance and 2 chrominance blocks) . A uniform quantization is applied to the DCT coefficients, no different treatment is made between luminance and chrominance . All blocks of the same MB are quantized with the same stepsize . The recommendation is relative to the fixed bitrate (FBR) coder, and for this reason the quantization is controlled by the output buffer fullness and a change in the quantization stepsize will be transmitted in the macroblock header to the receiver . To transmit the DCT coefficients a variable length code is used . In fact, the event defined as the combination of a magnitude (a non-zero quantization index) and a run (number of zero indexes preceding the current non zero index) is coded . A special variable threshold is used to increase the length of the runs . The motion vectors relative to each macroblock are also transmitted to the receiver, a differential technique is used to reduce the bit-rate necessary to send this information . When the energy of a macroblock is too high (i .e ., the motion compensated luminance differences are very large) an intraframe coding scheme can be activated to transmit, in intraframe mode, the luminance information relative to this image region . A scheme of the H .261 coder is presented in Fig. 2 . The coding scheme used for the first layer of the considered video coder is very similar to the one
131
previously described, the most important difference regards the control of the output buffer . In fact in this case a variable bit-rate (VBR) is generated . Our intention is to maintain a fixed image quality until a prefixed maximum generable bit-rate is reached ; in this case the condition on the bit-rate becomes prevalent and the quality of coded image is not guaranteed . On the contrary, it is not necessary to control buffer underfiow because in this case simply no packets are sent to the network . This buffer control strategy has been introduced in order to use the minimum number of high priority ATM cells for the transmission of the information relative to the first layer of the coder with prefixed quality . Therefore, the bandwidth for the transmission of the data generated by the second layer of the coder is increased as much as possible . Afterwards, this strategy should permit a more efficient use of the network resources when several video sources are active . The check on the instantaneous bit-rate and the computation of the quantization stepsize is made every 11 MB as indicated in H .261 recommendation. A tree search procedure is used to detect the largest stepsize that guarantees the chosen image quality, with constraint on the maximum bit-rate (see Fig . 3) . To reduce the energy of the motion compensated luminance differences the displacements are estimated with a precision of 0 .5 pixel [6] . In all the simulations, for computation problems, only the luminance signal has been considered .
CODING CONTROL
aST EP
SNH
SOURCE CODER Fig. 2 . Block diagram of the H .261 image coder (T=DCT, Q=Quantizer, P=motion compensated predictor, F = low pass filter applied to the motion compensated luminance differences (optional).
BOFFCONT
CODING BUFFER
Fig . 3 . Scheme of the quantizer stepsize control process . The switch after the coding buffer represents the packetizer, that is the interface with the network . Vol . ), No, . 2-3, June 1991
132
S. Tubaro / A two layers video coding scheme Jar ATM
3. Second layer of the video coder HMMMMMMMME
The aim of the second layer of the video coder is to improve the picture quality, that is, to reduce the artifacts caused by the data compression of the first layer . This part of the coder can also be used to increase the frame and spatial resolution of the images, if the first layer of the coder works on a subsampled (in time and/or space) version of the original image sequence . In this work the second layer of the coder is used only to improve the picture, quality of the images . The considered images are in CIF format (30 frame/sec, progressive scan) with a significant pixel area (SPA) of 352 x 288 pels . As previously indicated the second layer is used to code the differences between the original images and their rough version obtained from the first layer (these differences are named OFLDs) . Two different versions of this second layer have been considered. One works in the pixel domain and the other in the transform domain .
3.1 . General architecture of the second layer coder This second layer of the coder generates `low priority packets' that can be lost in the network nodes or arrive at the receiver too late and therefore must be discharged, For these reasons, to prevent unrecoverable mismatch between the transmitter and the receiver, the employed coding scheme is not a predictive closed loop coder but an additive coder . This means that the enhancement data, if received, are not used for the coding of the next frames . Furthermore, particular attention must be used to define the scan of the OFLDs during the coding process to reduce the visibility of the packet losses. For these reasons the OFLD images are subdivided in 8 x 8 non-overlapped pixel blocks, called second layer blocks (SLBs) . These blocks are grouped in 12 group of blocks, called second layer group of blocks (SLGOBs), as described in Fig. 4. The SLGOB structure has been introduced with the intention of reducing the visibility effects of
5
2
3
4
6
7
a
9
i iiiuiiiii~
12
. .=l
SLGOBs In
the OFLD
image
SLB in a SLGOB
Fig . 4 . Block subdivision of the OFLD image .
packet losses by including, in subsequent data packets, information coming from different parts of the OFLD image, one far away from the others, as usually is made when data protection strategies are used [3] . The packet assemble is made taking information from the SLBs, using an interleaved scanning procedure as indicated in Fig. 5 . To improve the error concealment capabilities, initially, the data packets are assembled using only half of the information relative to each SLB . At the end of the first step the OFLD image is rescanned to send the remaining part of the information that must be transmitted . In this way the information relative to each SLB is subdivided among packets transmitted at very different times, and for this reason there is a high probability that almost half of the information relative to each SLB is received and decoded, also in case of bursty packet loss . The other piece of information, if not received, can be recovered with the use of some interpolation strategies .
r
Fig . 5 . Interleaved scanning procedure of the OFLD image .
S. Tubaro / A two layers video coding scheme for ATM
3 .2. OFLDs transmission in pixel domain The coding scheme used to implement the second layer of the coder in the pixel domain is based on a DPCM coding of the value of the OFLD image . As mentioned above, each SLB is scanned in two different times, that is, each SLB is subdivided in two semiblocks, named SLSBs . At first the odd lines are considered (see Fig . 6), and then the even ones are . For each point of the SLSB, the difference with respect to the previous scanned point is transmitted . The first point of each SLSB is coded in PCM mode to prevent error due to packet loss . The tests performed over the CIF sequences `TREVOR' and `MISS AMERICA' have demonstrated that the DPCM values transmitted at the second layer of the coder have a very low mean value and variance . For this reason a unique quantization stepsize is calculated for each OFLD image in order to reconstruct the image with a prefixed quality. The quantizer is, as in the first layer, a uniform quantizer with dead zone .
133
A variable length coder (VLC) is used to transmit the DPCM values . Also in this case the events defined as a non-zero values and a run (number of null values preceding the current one) is coded . An End Of SLSB symbol (EOSLSB) is used to define the end of the data relative to each SLSB . The codewords have a variable length for the most probable events and a fixed length for the others . The ATM data packets have a size of 53 bytes, but only 48 can be used to transmit information because the first 5 bytes are reserved for network header . To simplify the packetization and depacketization all the data relative to each SLSB must be accommodated in the same packet . To limit the effect of packet loss a packet subheader of 2 bytes has been reserved to transmit the information relative to the spatial start position of the included information, and to retransmit, several times per frame, the quantization step . In this way, at the receiver side, it is possible to find the SLSBs to which the information of the incoming cell are relative, without considering the
Fig . 6 . Scanning procedure of the first subblock in an SLB (pixel domain) . Vol . 3, Nos . 2-1, June 1991
134
S. Tubaro / A two layers video coding scheme jor ATM
previous cells, that might be lost . Moreover the multiple transmission, in each frame, of the quantization step ensures that this parameter is available also to the receiver when a large number of packets are lost. If the information that must be transmitted on the second layer presents a very large variance it is possible to use, in the same frame, different values of the quantization step ; some care must be taken to ensure that these values should be available to the receiver even with packet losses . After depacketization the receiver is able to detect the SLBs where, due to packet losses, only the data relative to one SLSB is available ; in this case a linear interpolation of the missing values is made to reconstruct these parts of the OFLD image . No error concealment techniques are implemented when the information relative to either the SLSB of an SLB is missing . The interleaved transmission of the data relative to the second layer of the coder improves the error concealment capabilities of the coder, but introduces a delay time of about 1 frame in the transmission . In fact, only when all the data relative to
the current frame have been received, the first lines of the reconstructed image can be displayed . 3.3 . OFLDs transmission in transform domain Another possibility for transmitting the enhancement information can be realized by applying a DCT to each SLB and sending the obtained coefficients to the receiver . An interleaved scan is used for the transmission of the DCT coefficients of each SLB . The coefficients are divided in two semiblock named also in this case SLSBs . A zig-zag scan is used to obtain the two SLSBs (see Fig . 7) . The first coefficient of each SLSB is coded in PCM mode while the other is coded in DPCM mode . A VLC scheme is used to transmit the coded values . The packetization procedure is exactly the same as the one used for the direct transmission of the values of the OFLD image . When, for an SLB, only one of the two SLSBs is received, only half of the DCT coefficients are available, the other can be obtained with a bidimensional linear interpolation .
r
u nil loss
N INc Pr
r c
In Z
Fig . 7 . Scanning procedure for the first subblock in an SLB (transform domain) . Signal Processing : image Comrnun¢'n7ion
S. Tubaro / A two layers video coding scheme for ATM
The zig-zag scan of the coefficients has been chosen because the bidimensional interpolation works better than the monodimensional one that should be used if a line by line scan of the coefficients is used .
4 . Experimental results The simulations have been carried out on the two CIF sequences `TREVOR' and `MISS AMERICA' that have 30 frame/ sec. The sequences have been sub-sampled by a factor of 2 in the time dimension to reduce the bit-rate generated by the first layer of the coder . At this point the missing frames are not considered in the simulation . The intention is to use, in the future, a motion compensated interpolator (the motion vectors are known both to the transmitter and to the receiver) to generate a rough version of these images, and then use the second layer of the coder to improve the quality of these images . The problems relative to the packetization of the data obtained from the first layer of the coder have not been analyzed because a negligible effect of error and packet loss has been considered . The peak-to-peak SNR is used to describe the performances of the coder . It is defined as N
1
F,
-
SNR=-10log N
1
(x.-z,) 2
2552
db,
(1)
135
where N is the total number of the points present in the image and x, and f, indicate the luminance of a pel in the original and coded images . The two considered sequences present very different motion activity, and therefore, in the simulation, different bit-rate constraints for the output of the first layer of the coder have been considered. As mentioned above, some requirements are imposed on the bit-rate generated at the first layer of the coder and on the coded image quality, more precisely a minimum quality of the first layer coded images must be respected until the maximum bitrate is exceeded and a constant image quality (with some tolerance) must be obtained at the output of the second layer . The imposed quality requirements are summarized in Table 1 . The number of ATM cell/frame is obtained considering a pay-load of 48 bytes for each packet . In all the tests the first frame is processed using, as previous coded frame, an original image . This simulates approximately the state of the coder after a long sequence of fixed frames . The mean bit-rate generated by the first layer of the coder is 63 .2 kbit/sec (89 .9 cells/frame) for the MISS AMERICA sequence and 190 .5kbit/sec (270 .9 cell/frame) for the TREVOR sequence . Figure 8 represents two coded frames, one with the second layer of the video coder working in the pixel domain and the other in the transform domain .
Table I Imposed requirements used in the simulations
Sequence
First layer
Second layer
Maximum bit-rate
TREVOR MISS AMERICA
kbit/sec
ATM cells/frame
192 128
273 182
min . SNR (db)
SNR (db)
36 36
40±0 .5 40±0.5 Val . a, \ns.
2-1,
June 1991
S. Tubaro / A two layers video coding scheme for ATM
136
Fig. 8 . Two coded frames obtained from the sequence TREVOR . In one case (right) the second layer of the video codes operates
in the pixel domain, in the other (left) in the transform domain .
Trevor sequence 42
_ __ First layer Second layer, p . dom . Second layer, t. dom.
360-
10
15
20
25
30
35
40
45
Frame number Fig .
9 . SNR of
the coded images (first layer, second layer in pixel and transform domain),
TREVOR
sequence .
Figures 9 and 10 show the SNR obtained, for the requirements on the image quality cannot be each frame, with the first and second level of the respected because otherwise the maximum bit-rate coder . would be exceeded. In Figs . 11 and 12 the number of ATM cells If a maximum bit-rate of 128 kbit/sec is imposed used to transmit in each frame the information also for the TREVOR sequence, for some frames,
S. Tubaro / A two layers video coding scheme for ATM
137
Miss America sequence 42 1
First layer Second layer, p, dom . Second layer, t. dom.
41 -- ----- - ------- ------
40 a 9
39 m 38
37
360
10
5
20
25
30
35
40
45
Frame number Fig . 10. SNR of the coded images (first layer, second layer in pixel and transform domain), MISS AMERICA sequence .
Trevor sequence
10
15
20
25
30
40
45
Frame number Fig. 11 . Number of ATM cells used to transmit the information generated by the second layer of the coder, TREVOR sequence .
generated by the second layer of the coder are plotted . An important test for evaluating the performances of the coders is their robustness to packet loss . For this reason, in the simulation of the second layer of the video coder a cell loss rate of 10 - ' has been considered . Both uniform and burst
distributed losses have been simulated . In the second case a mean burst length of 4-5 cells is used . Figure 13 shows, for each frame of the TREVOR sequence, the SNR obtained with a cell loss rate of 10- ' (uniform and burst distributed losses) when the error concealment (interpolation of the missing Vol 3, Nos . 2-3, June 1991
S. Tubaro / A two layers video coding scheme for ATM
138
Miss America seuqence 200
Second layer, p. dom. Second layer, L dom.
180 160 140 120 100 80 Q
60 40
- ---------- -- - -
-
1
20I 00
10
20
15
25
30
35
40
45
Frame number Fig .12. Number of ATM cells used to transmit the information generated by the second layer of the coder, MISS AMERICA sequence .
Trevor sequence 43 42 .5 42
Coder A, Unit losses Coder A, Bursty losses Coder B, Unit losses ---- Coder B, Bursty losses
41.5
1
a v
40 .5
n
40 5 39 38.5 3
0 Frame number
mg . 13 . SNR of the coded images with a cell loss rate of 10', TREVOR sequence .
SLSB) is used . In Fig. 14 the same results are presented for the sequence MISS AMERICA. With A is indicated the coder scheme whose
The results of these simulations are summarized in Table 2 . For the two considered types of losses (uniform and bursty), the mean and peak decre-
second layer operates in pixel domain, with B the
ments of the SNR of the coded images with respect
one that uses DCT .
to the case where all the cells are available to the
Figure 15 represents two coded frames, one for
receiver are presented . It is considered either the
each of the two considered coders obtained when
case in which the error concealment procedure is
bursty packet losses are considered .
activated or not .
S. Tubaro / A two layers video coding .scheme for ATM
139
Miss America sequence Coder A, Unif. losses - Coder A, Bursty losses - - Coder B, Unif . losses Coder B, Bursty losses
10
15
20
25
30
35
40
45
Frame number Fig . 14. SNR of the coded images with a cell loss rate of 10 ', MISS AMERICA sequence .
Fig . 15 . Two coded frames obtained from the sequence TREVOR when bursty packet losses are considered . In one case (right) the second layer of the video coder operates in the pixel domain, in the other (left) in the transform domain .
5. Conclusions Two different types of two layer video coders for ATM network have been presented . The first layer generates `guaranteed packets' and is used to maintain a video connection with minimum quality . Its architecture is derived from the CCITT
Reference Model H261 . Some variations have been introduced to this model to take into account that the coder now operates in a variable bit-rate environment . The second layer of the coder is used to transmit `enhancement information', but some of these data can be lost due to network congestions . Vol . 3, Nos . 2-3 . June 1991
S. Tubaro / A two layers video coding scheme . for ATM
140 Table 2
Results of the coder simulation with packet losses
TREVOR sequence, Coder A Loss type
uniform bursty
Loss rate
10% 10%
Mean cells/ frame
Without error concealment
With error concealment
Mean SNR deer.
Max . SNR deer .
Mean SNR deer .
Max. SNR deer.
134 134
0 .46 0 .45
0.77 1 .63
0 .37 0 .37
0.66 1 .24
Mean cells/frame
Without error concealment
With error concealment
Mean SNR deer .
Max . SNR deer .
Mean SNR deer.
Max . SNR deer .
52 52
0 .42 0 .36
0 .93 1 .21
0.13 0.10
0 .39
Mean cells/frame
Without error concealment
With error concealment
Mean SNR deer.
Max . SNR deer.
Mean SNR deer .
Max . SNR deer.
156 156
0.49 0.48
0 .79 1 .10
0 .31 0 .30
0 .49 0.87
Mean cells/frame
Without error concealment
With error concealment
Mean SNR deer .
Max . SNR deer .
Mean SNR deer.
Max . SNR deer .
0 .43 0 .38
0 .88 1 .99
0 .11 0.11
0.51 1 .35
TREVOR sequence, Coder B Loss type
uniform bursty
Loss rate
10% 10%
0 .46
MISS AMERICA sequence, Coder A Loss type
uniform bursty
Loss rate
10% 10%
MISS AMERICA sequence, Coder B Loss type
uniform bursty
Loss rate
10% 10%
54 54
The two analyzed coders differ in the realization of this second layer ; in one case it operates in the pixel domain, the other in the transform domain . In the simulation an equal requirement on the quality of the coded images has been imposed and the number of generated data packets is calculated . The data that are coded and quantized in the second layer of the video coder are the differences between the original images and the ones coded at the first layer . When visualized, these differences
appear not well structured and very similar to white noise. Notwithstanding, the use of a DCT is very useful for reducing the information that must be transmitted to guarantee a certain SNR ratio . In fact, the bit-rate generated by the second layer of the coder is about the same as the one generated by the first level in the case of the use of the DCT . In the other case it is four times higher . The packet loss is a very important problem for this type of coders . The second layer uses an inter-
S. Tubaro / A two layers video coding scheme far ATM
leaved scan of the data that must be transmitted, and also an error concealment strategy is used to reduce the effect of cell loss . The performances of the coder have been tested with a high cell loss rate (10 - '), in the case of both uniform distributed and bursty distributed losses . The performances of the two coders are in this case very similar and indicate the importance of the interleaved scan of the information . Furthermore, better results have been obtained when a bidimensional interpolation of the missing information is used instead of a monodimensional one . In conclusion, the use of a DCT is recommendable also in the second layer of the coder even if some further complexity is added to the coder with respect to the case in which the second layer operates in the pixel domain . Further research is being carried out in order to use the second layer of the coder to increase the time and space resolution of the coded images . In addition the possibility of using some vector quantization techniques in the first layer of the coder will be studied . 6. Acknowledgments I would like to thank Prof. F . Rocca of the Politecnico di Milano for his continual help and the research center CEFRIEL where the computer simulations have been done . Moreover I want to acknowledge the efforts of Dr . Ing. Nicola Grassi,
141
who has provided important ideas and has contributed to the computer simulations . The work was carried out with the financial support of the National Research Council (CNR) within the framework of the Telecommunication Project . References [1] J .P . Coudreuse, "ATM : State of definition and major issues", in : Proc. Internal. Workshop on Packet Video, Torino, September 1988 . [2] M . Ghanbari, "Two-layer coding of video signals for VBR networks", IEEE J. Selected Area Commun ., Vol . 7, No . 5, June 1989, pp . 771-781 . [31 H . Hessenmuller, "Video signal transmission in an ATMbased broadband network, treatment of cell losses", Roc. Third Internat . Workshop on Packet Video, Morristown, NJ, USA, March 1990 . [4] F. Kishino, K. Manabe, Y . Hayashi and H . Yasuda, "Variable bit-rate coding of video signals for ATM networks", IEEE J. Selected Area Common., Vol . 7, No . 5, June 1989, pp- 801-806 . [5] P. Migliorati, L . Ponte and S . Tubaro, "Application of vector quantization to a motion compensated interframe image coder with and without DCT", Proc. Picture Coding Symposium, Boston, March 1990, Section 11 .5 . [6] A . Pun, H .M . Hang and D.L . Schilling, "An efficient block matching algorithm for motion-compensating coding", Internal. Conf. Acoust . Speech Signal Process . 87. [7] M . Wada, "Selective recovery of video packet loss using error concealment", IEEE J. Select Area Commun., Vol . 5, June 1989, pp .809-814 . [8] "CCITT SG WP XV/1 Specialist Group on Coding for Visual Telephony", Draft Revision of Recommendation H. 261 : Video Codec for Audiovisual Services at px 64 kbit/s, Document 584, 10 November 1989 .
Vol . 3, Nos . 2-3, June IcxI