Multiview video compression with 1-D transforms

Signal Processing: Image Communication 33 (2015) 14–28 Contents lists available at ScienceDirect Signal Processing: Image Communication journal home...

Download PDF

2MB Sizes 1 Downloads 129 Views

Report

PDF Reader
Full Text

Signal Processing: Image Communication 33 (2015) 14–28

Contents lists available at ScienceDirect

Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image

Multiview video compression with 1-D transforms$ Burcu Karasoy a,b, Fatih Kamisli b,n a b

ASELSAN Inc., Turkey Department of Electrical and Electronics Engineering, Middle East Technical University, Turkey

a r t i c l e in f o

abstract

Article history: Received 2 December 2013 Received in revised form 29 January 2015 Accepted 29 January 2015 Available online 21 February 2015

Many alternative transforms have been developed recently for improved compression of images, intra prediction residuals or motion-compensated prediction residuals. In this paper, we propose alternative transforms for multiview video coding. We analyze the spatial characteristics of disparity-compensated prediction residuals, and the analysis results show that many regions have 1-D signal characteristics, similar to previous findings for motion-compensated prediction residuals. Signals with such characteristics can be transformed more efficiently with transforms adapted to these characteristics and we propose to use 1-D transforms in the compression of disparity-compensated prediction residuals in multiview video coding. To show the compression gains achievable from using these transforms, we modify the reference software (JMVC) of the multiview video coding amendment to H.264/AVC so that each residual block can be transformed either with a 1-D transform or with the conventional 2-D Discrete Cosine Transform. Experimental results show that coding gains ranging from about 1–15% of Bjontegaard-Delta bitrate savings can be achieved. & 2015 Elsevier B.V. All rights reserved.

Keywords: Discrete cosine transforms Disparity compensation Multiview video coding

1. Introduction Block-based transform coding is a widely used approach in image and video compression [1–3]. In image compression, a block of image pixels are transformed, and the transform coefficients are quantized and entropy coded. In video compression, a block of pixels are first predicted from previously coded pixels, and then the prediction residual block is transformed, and the transform coefficients are quantized and entropy coded. Different prediction methods are used in video compression, such as motion-compensated prediction (MCP), intra-frame prediction (IP), inter-layer prediction in scalable video coding, and disparity-compensated prediction (DCP) in multiview video coding. The most widely used transform in image and video compression is the 2-D Discrete Cosine Transform (DCT). ☆

This work was supported by Grant BAP-08-11-2013-060 of ODTU. Corresponding author. E-mail addresses: [email protected] (B. Karasoy), [email protected] (F. Kamisli). n

http://dx.doi.org/10.1016/j.image.2015.01.011 0923-5965/& 2015 Elsevier B.V. All rights reserved.

The 2-D DCT is the statistically optimal transform for a separable 2-D first-order Markov process as its correlation coefficient tends to its maximum value of 1. A separable 2-D first-order Markov process has been found to be a good global signal model for images and some prediction residuals. In particular, it has been determined that such a random process is a good global model for images with a correlation coefficient of 0.95 [4] and a good global model for motioncompensated prediction residuals with a smaller correlation coefficient [5–7]. Although the 2-D DCT is a close approximation to the optimal transform for the global signal model of images and some prediction residuals (and has been used in many image and video coding standards), the statistical characteristics of many local regions in images and prediction residuals can deviate significantly from the global signal model. In images, statistical characteristics of local regions can change significantly from one part of the image to another. For instance, some regions may have smoothly varying content, while others may have texture or edges. Similarly, the statistical characteristics of prediction residuals can vary significantly depending on

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

the spatial characteristics of the local region to be predicted, and the particular prediction method used. Hence, alternative transforms which can adapt to these varying characteristics can be used to improve the compression efficiency of image and video coding systems. The literature contains a significant amount of research on alternative transforms developed for image and/or video compression. Typically, the developed transforms are targeted at either images or a specific type of prediction residual, such as the intra prediction residual or the motion-compensated prediction residual. While images, intra prediction residuals and motioncompensated prediction residuals are the major signals for which alternative transforms have been proposed in the literature, alternative transform-coding of other prediction residuals, such as inter-layer prediction residuals in scalable video coding or disparity-compensated prediction residuals in multiview video coding, have not been investigated. In this paper, we focus on the disparity-compensated prediction residuals coded in multiview video coding. To the best of our knowledge, there has been no previous work reporting results of using alternative transforms in multiview video coding. We analyze the statistical characteristics of disparitycompensated prediction residuals and based on the analysis results we use the 1-D transforms in [8] as alternative transforms. Our results show that these transforms can increase the coding efficiency in multiview video coding. The remainder of the paper is organized as follows. In Section 2, a review of related previous research is presented. Section 3 presents a statistical analysis of disparitycompensated prediction residuals and suggests the use of 1-D transforms in multiview video coding. Section 4 discusses a system implementation with these transforms and Section 5 presents experimental results with that system. Finally, Section 6 concludes the paper. 2. Previous research Alternative transforms in the literature are typically targeted at either images or a specific type of prediction residual, such as the intra prediction residual or the motioncompensated prediction residual. Below, we briefly review the literature on alternative transforms for each of these signals. Alternative transforms developed for images typically exploit the anisotropic characteristics of local regions in images. Characteristics of local regions can be anisotropic (i.e. directional) and can vary significantly from one local region to another. For example, there can be local regions with edges along various angles or texture with varying structure. The 2-D DCT is a separable transform obtained by cascading 1-D DCTs along the horizontal and vertical dimensions. This separable construction favors horizontal and/or vertical structures over others and does not adapt to the local anisotropic characteristics. The developed alternative transforms can adapt to local anisotropic characteristics by adapting the filtering direction or coefficients to the dominant features or directions of each local region. This is achieved through various means, such as resampling image intensities along dominant directions [9–11], by performing filtering and subsampling of Discrete

15

Wavelet Transforms (DWT) along oriented sub-lattices of the 2-D sampling grid [12], by directional lifting implementations of the DWT [13,14], or by cascading 1-D DCTs along arbitrary directions [15,16]. There are also many transforms developed for specific types of prediction residuals. Most of the work is focused on the development of alternative transforms for intra prediction residuals and motion-compensated prediction residuals. Other prediction residuals have received little attention. Development of alternative transforms for intra prediction residuals was mainly inspired by the work of Ye and Karczewicz [17]. In intra prediction, a block of pixels is predicted from the previously reconstructed neighbor pixels residing in the left and upper blocks [18,2,3] by copying these neighbor pixels inside the block along the direction of the dominant feature in the block. Ye et al. claim that the statistical characteristics of the prediction residual blocks depend on the used direction for copying and develop different transforms for each available copying direction by an offline training method. Inspired by Ye's work, many research papers followed among which we summarize some of the major ones. In [19], the work of Ye et al. is improved by using a more robust training algorithm that handles outliers in the training data better. Instead of developing completely new transforms, sparse secondary transforms that are applied after the conventional 2-D DCT are developed in [20]. In [21], alternative transforms are not obtained for each intra prediction mode separately, but collectively so that multiple transforms are available for each intra prediction mode. Instead of using offline training data, authors of [22,23] use the first-order Markov process model of images to capture the statistics of image blocks. Using the model and the intra prediction algorithm, statistical description of the prediction residuals is obtained for each prediction direction, and a transform, the type-7 Discrete Sine Transform (DST), is derived from these descriptions to be cascaded with the 1-D DCT. Other approaches for developing transform for intra prediction residuals can be found in [24–26]. The second major prediction residual for which alternative transforms have been developed is the motioncompensated prediction residual. In motion-compensated prediction, a block of pixels is predicted from a previously coded frame by compensating for the (supposedly translational) motion that the block has undergone between the frames. Motion-compensated prediction typically works well for many pixels inside a block, and the pixels with large prediction errors often concentrate in a region of the block. This concentration is the major motivation of the work in [27], where only an adaptively positioned rectangular area, capturing most of the large magnitude prediction residual pixels, is transformed instead of the entire prediction residual block. In [28], similar motivation is used to adaptively change between coding each prediction residual block in either DCT or pixel domain. In [29], a two-layer transform approach is used, where the first layer is a 2-D Haar transform and the second layer consists of directional transforms on the sub-bands of the Haar transform. Another two-step approach is used in [30], where the first step consists of a residual block pattern from a trained codebook, and the second step consists of coding the

16

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

remaining residual block with 2-D DCT. It is observed in [31] that typical frames contain many repeating structures, and a block of prediction residuals can be transformed with a transform that is derived from previously coded regions containing structure similar to the structure in the current block. Other related approaches with alternative transforms for motion-compensated prediction residuals can be found in [32–37]. It is observed in [8,38] that motion-compensated prediction works well in many regions, and large prediction errors concentrate in regions which are difficult to predict, such as moving object boundaries, edges, or highly detailed texture regions. In these regions, mismatches (due to non-translational motion) across edges or object boundaries produce large prediction errors. Hence, signals with most of their energy positioned along these boundaries or edges arise in motioncompensated prediction residuals. Edges or boundaries are 1D structures and thus a significant fraction of local regions in motion-compensated prediction residuals can have 1-D signal characteristics. Based on these observations, 1-D transforms were proposed in [8] to transform motion-compensated prediction residuals. Experimental results using a sample set of 1D transforms (8 directional 1-D transforms for 4 4-blocks and 16 directional 1-D transforms for 8 8 blocks) are used in addition to the 2-D DCT in the H.264/AVC codec and achieve an average 5% coding gain. The work in [8] uses a total of 8 directional 1-D transforms for 4 4-blocks and 16 directional 1-D transforms for 8 8 blocks. Zhang and Lim [34] reduce the complexity of this system by discarding all directional 1-D transforms except the two most frequently used 1-D transforms along the horizontal and vertical directions. It is reported that this simplified system achieves most of the coding gain of the original system in [8]. A similar approach is taken in [35], where none, one or both of the 1-D DCT's (along the horizontal and vertical directions) forming the 2-D DCT are skipped, resulting essentially in the same approach of [34]. Experimental results of [35] are reported using the HEVC codec and also show coding gains. Algorithms developed to determine the effectiveness of transforms in multiple-transform video coding [36] show that the 1-D transforms in [8] are amongst the most effective 1-D transforms for compressing motion-compensated prediction residuals. Cohen et al. [33] modify the 1-D transforms of [8] by applying a second transform along the DC coefficients of the 1-D transforms forming block-transforms with basis functions that have both 1-D and 2-D support. Deng et al. [37] use the 1-D transforms in [8] along with other new coding tools such as larger block-sizes and extend the H.264 FRExt profile to show significant coding gains for high resolution sequences. In summary, the literature review shows that many alternative transforms have been proposed for the compression of images, intra prediction residuals or motion-compensated prediction residuals, while other prediction residuals such as disparity-compensated prediction residuals received little attention. In this paper, we focus on the disparity-comp ensated prediction residuals and show that 1-D transforms can also be used to efficiently transform these residuals and improve compression in multiview video coding.

3. Analysis of disparity-compensated prediction residuals and use of 1-D transforms This section presents empirical and statistical analyses of disparity-compensated prediction (DCP) residuals, motioncompensated prediction (MCP) residuals and images. The analysis methods used are similar to those used for analyzing MCP residuals and images in [8]. Results of our analyses indicate that DCP residuals have similar characteristics to MCP residuals and we suggest the use of 1-D transforms, proposed for compressing MCP residuals in [8], as alternative transforms for compressing DCP residuals in multiview video coding.

3.1. Empirical analysis Fig. 1 contains an image (frame 41 of view 1 of Race1 sequence), its MCP and DCP residual frames. To obtain the MCP residual frame in the figure, a matlab routine is utilized which uses the temporal neighbor frame (frame 40 of view 1 of Race1 sequence) of the image as the reference frame and applies 8 8-pixel block-matching algorithm where the motion vectors are determined based on minimum squared error. To obtain the DCP residual frame in the figure, the same matlab routine is utilized to use the neighbor view frame (frame 41 of view 0 of Race1 sequence) of the image as the reference frame, and in a similar manner, 8 8-pixel block-matching algorithm is applied where the disparity vectors are again determined based on minimum squared error. It can be seen from the figure that the MCP and DCP residual frames look quite similar. In fact, MCP and DCP work in the same way but the used reference frames are different. In MCP, a previously coded temporal neighbor frame, and in DCP, a previously coded (and temporally aligned) frame from a neighbor view is used as a reference frame. The temporal reference frame is different from the current frame due to object and/or camera motion and the neighboring view reference frame is different from the current frame due to differing location of the camera. Hence, the change along the temporal direction and amongst the views results from similar motion or disparity, and therefore it is expected that both MCP and DCP residuals have similar characteristics. A closer visual inspection of Fig. 1 confirms that in many local regions the statistical characteristics of MCP and DCP residuals are similar but can differ from those of the image. Both MCP and DCP are based on the assumption that the motion or disparity to be compensated for is translational. This assumption holds in some regions (e.g. temporally stationary regions or regions that are far away from the cameras) but not in others. In the regions where it does not hold, prediction errors with differing spatial characteristics can be formed, depending on the spatial characteristics of the region to be predicted. In smooth regions of the image, even if the motion is not translational, the high spatial correlation typically enables successful prediction and the prediction error is close to zero and smooth in both MCP and DCP residuals (see bottom part of frames in Fig. 1). Such regions are

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

17

Fig. 1. Frame 41 of Race1 view 1 sequence, its motion-compensated and disparity-compensated prediction residuals [39]: (a) image; (b) MCP residual; (c) DCP residual.

either coded at low bit-rates or efficiently transformed with the smooth basis functions of 2-D DCT. In texture regions and regions with edges or object boundaries of the image, MCP and DCP do not work well when motion or disparity is not exactly translational. Consider the regions with trees, moving cars and the illumination poles in Fig. 1. If the translational motion assumption does not hold well in such regions, the low spatial correlation in such regions causes large prediction errors. In particular, a significant amount of such regions contains edges, object boundaries or structures with high directionality and the failure of translational motion model causes mismatch with large errors along these edges, object boundaries or directional structures. Concentration of the large prediction errors along such edges or object boundaries causes the MCP or DCP residuals in such regions to be closer to 1-D signals than 2-D signals. It can be seen in the MCP and DCP residual frames of Fig. 1 that the prediction errors near the illumination poles and moving cars show 1-D signal characteristics at various directions. Such signals can be transformed more efficiently with 1-D transforms following these directions than with the conventional 2-D DCT.

3.2. Auto-covariance analysis To quantify the observations made in Section 3.1, we compare auto-covariances of 8 8-pixel blocks of the MCP and DCP residual frames and the image in Fig. 1. To simplify comparisons, we use parametric representations of the auto-covariances using a separable first-order Markov process auto-covariance model (Eq. (1)) and its generalized version (Eq. (2)). In these equations, I and J represent the horizontal and vertical distances between two pixels, respectively for which the covariance is obtained. In the generalized auto-covariance model, the additional parameter θ allows rotation of the horizontal and vertical axes of the separable model and this capability enables better representation of the statistical properties of the local regions which have highly directional characteristics, such as regions in MCP and DCP residuals with 1-D signal characteristics jJj C s ðI; JÞ ¼ ρjIj 1 ρ2

ð1Þ

C g ðθ; I; JÞ ¼ ρ1jI cos ðθÞ þ J sin ðθÞj ρ2j I sin ðθÞ þ J cos ðθÞj

ð2Þ

18

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

Each 8 8-pixel block of the residual frames and the image in Fig. 1 is modeled with both the separable and the generalized auto-covariance by estimating parameters ρ1 and ρ2 (and θ for the generalized model only) so that the parametric auto-covariance models provide the best fit (in terms of sum of squared errors) to the actual auto-covariance of each 8 8-pixel block. The estimated (ρ1, ρ2) parameter pairs from all 8 8-pixel blocks are plotted as scatter plots in

Fig. 2. Each point in the plots represents the estimated (ρ1, ρ2) pair from one 8 8-pixel block (the larger correlation coefficient is always chosen as ρ1 for convenience) and each plot shows the estimated parameters using either the separable or the generalized auto-covariance model from either the image, the MCP or the DCP residual. Below, we discuss the scatter plots and the conclusions we draw from the scatter plots about the statistical

Fig. 2. Scatter plots of estimated (ρ1, ρ2) parameter pairs from all 8 8-pixel blocks of either the image, the MCP or the DCP residual in Fig. 1 using either the separable or the generalized auto-covariance model. Each point in the plots represents the estimated (ρ1, ρ2) pair from one 8 8-pixel block: (a) Separable model, image; (b) generalized model, Image; (c) separable model, MCP residual; (d) generalized model, MCP residual; (e) separable model, DCP residual; (f) generalized model, DCP residual.

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

characteristics of residuals both qualitatively and quantitatively using Kullback–Leibler divergence. Scatter plots in Fig. 2c and d show the results for the MCP residual frame in Fig. 1 and scatter plots in Fig. 2e and f show the results for the DCP residual frame in Fig. 1. These scatter plots of the MCP and DCP residuals are very similar confirming that MCP and DCP residuals have similar statistical characteristics. But these scatter plots are different from those of the image (Fig. 2a and b), indicating that the statistical characteristics of images can differ from those of MCP and DCP residuals. MCP and DCP residuals have less spatial correlation than images, and therefore the estimated (ρ1, ρ2) pairs in Fig. 2c and e are generally smaller than those in Fig. 2a. However, there can still be significant correlation or structure in MCP and DCP residuals but the separable auto-covariance model cannot capture them well and produces a small ρ1. If the generalized auto-covariance model is used, larger ρ1 parameters are obtained (see Fig. 2d and f) because the additional parameter θ allows rotation of the axes to better adapt to the directionality of the residual block. Larger correlation parameters typically enable better compression. To quantify the discussed similarities or differences of scatter plots for many frames of several sequences, we use the Kullback – Leibler (KL) divergence [40]. The KL divergence is a non-symmetric measure of the difference between two probability distributions, which in our case are represented by the scatter plots.1 For discrete probability distributions P and Q , the KL divergence of Q from P is defined as DKL ðP J Q Þ ¼

X i

ln

PðiÞ P ðiÞ: Q ðiÞ

ð3Þ

We compute the KL divergence between various scatter plots obtained from the first 180 frames of several sequences and provide the results in Table 1. As seen in the table, for the Race1 sequence, the KL divergence of the distribution of (ρ1, ρ2) parameters of DCP residuals from those of MCP residuals is 0.01, if separable model (SM) or generalized model (GM) are used. Noticing that two identical probability distributions have a KL divergence of 0, we can say that the statistical characteristics of MCP and DCP residuals are very similar for this sequence. The KL divergences of the distribution of (ρ1, ρ2) parameters of both residuals modeled with the generalized model from those modeled with the separable model are 1.19 and 1.14 when separable model (SM) and generalized model (GM) are used, respectively. This indicates that the captured correlation from both residuals can be different when the separable model or the generalized model is used. Table 1 also shows KL divergence values obtained from Exit and Uli sequences. These values are similar to those of Race1 sequence discussed above. Again, the KL divergence values are small if MCP or DCP residuals are modeled with

1 We approximate joint discrete probability distributions of ρ1 and ρ2 for each scatter plot by counting the number of points in each 0.1 0.1 cell.

19

Table 1 KL divergence between scatter plots of (ρ1, ρ2) parameters with separable and generalized models for several sequences. KL Divergence

Race1

Exit

Uli

DKL ðImg SM J Img GM Þ DKL ðMCP SM J MCP GM Þ DKL ðDCP SM J DCP GM Þ DKL ðMCP SM J DCP SM Þ DKL ðMCP GM J DCP GM Þ

0.60 1.19 1.14 0.01 0.01

0.53 1.62 1.09 0.25 0.12

0.94 2.45 1.22 0.22 0.24

the same model, but large when modeled with different models. In summary, the quantitative analysis in this subsection indicates that if the auto-covariance of MCP and DCP residuals are modeled with the separable auto-covariance, estimated ρ1 (and ρ2) are typically small. If the generalized auto-covariance is used, then typically ρ1 is large and ρ2 is small. A large ρ1 indicates that there exist a large correlation along the direction of θ in the MCP or DCP residual blocks. The combination of large ρ1 and small ρ2 indicates that the structure is highly directional (along the θ direction), which is consistent with the anisotropic and 1-D signal observation obtained from the empirical analysis. 3.3. 1-D transforms Results of the empirical and auto-covariance analyses are consistent and indicate that the statistical characteristics of MCP and DCP residuals are similar and can be different from those of an image. In particular, a significant amount of local regions in both MCP and DCP residuals contains 1-D signal characteristics at various directions. As a result, we propose 1-D transforms, proposed for MCP residuals in [8], as alternative transforms for more efficient compression of DCP residuals in multiview coding. Hence, in our proposal both MCP and DCP residuals in multi-view video compression are transform-coded with 1-D transforms. Each residual block in multi-view video compression, either MCP residual block or DCP residual block, is transformed either with a 1-D transform or with the conventional 2-D DCT. In residual blocks which have 1-D signal characteristics, the 1-D transforms are likely to give better compression results, and in other blocks, the conventional 2-D DCT is likely to give better compression results. 4. System implementation with 1-D transforms To test the compression performance of 1-D transforms in multiview video coding, we modify the JMVC software 8.5, which is the reference software of Multiview Video Coding amendment to H.264/AVC developed jointly by MPEG and VCEG.The modification requires careful design and implementation of a number of related aspects, such as design and implementation of 1-D transforms, entropy coding of the 1-D transform coefficients, selection of the best transform for each block and communicating the selection. In our implementation, we also use a global luminance compensation algorithm to equalize global luminance between views,

20

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

which can further improve coding gains of systems with 1-D transforms. 4.1. Design and implementation of 1-D transforms In a codec with 1-D transforms, the directions of the 1-D transforms and the number of directions must be carefully chosen. We use the same 1-D transforms (eight 4 4-pixel and sixteen 8 8-pixel 1-D block transforms) with the same directions (patterns) as in [8]. The eight 1-D block transforms that we use for 4 4-pixel blocks are shown in Fig. 3. Each arrow in the figure indicates a group of pixels on which a 1-D DCT is applied. The sixteen 1-D block transforms that we use for 8 8-pixel blocks are similar and can be seen in [8]. While H.264/AVC uses integer transforms [41,18], we use floating point arithmetic for the implementation of the 1-D transforms for simplicity. While the proposed 1-D transforms can of course also be implemented with integer arithmetic like in H.264, we used floating point arithmetic because of simpler implementation. This increases encoding/decoding times of our implementation but does not change the coding gain results. If integer arithmetic 1-D transforms were used, the incurred coding gains loss would be insignificant [41]. 4.2. Entropy coding of 1-D transform coefficients In H.264/AVC, the transform coefficients are quantized and scanned into a 1-D array using a zigzag scanning pattern. The scanned coefficients are then entropy coded using context-adaptive variable-length coding (CAVLC) or contextadaptive binary arithmetic coding (CABAC) [2,42,43]. Both CAVLC and CABAC are designed to provide more efficient entropy coding when large amplitude coefficients (typically low-frequency DCT coefficients) are at the front of the scan and small amplitude coefficients (typically high-frequency DCT coefficients) are at the end of the scan. To perform entropy coding of the 1-D transform coefficients in this paper, the present CAVLC and CABAC methods are used but the scanning pattern is changed. For each of the 1-D block transforms, a specific scanning pattern

designed in [8] is used. These scanning patterns are designed so that potentially large amplitude coefficients are at the front and potentially small amplitude coefficients are at the end of the scan for each 1-D block transform. 4.3. Selection of the best transform With multiple transforms available to transform each residual block, it is important that the encoder selects the best transform for each block so that the compression efficiency of the codec is increased. This is a common problem in video coding where multiple coding options are present and the encoder needs to select the best option [44,45]. The classical solution to this problem is to select the coding option which gives the smallest rate-distortion cost, which can be calculated as J ¼ D þ λR

ð4Þ

where D represents the distortion and R the number of bits of a coding option. The parameter λ is the Lagrangian parameter representing the trade-off between distortion and rate [46,47]. In our implementation for the selection of the best transform, rate-distortion optimized selection is used. The residual block is coded with all possible transforms, the rate-distortion cost for each transform is computed (where rate includes the bits to communicate the selected transform), and the transform with the smallest rate-distortion cost is selected. Table 2 Codewords for communicating selected transforms. Transform

Codeword

(a) Transforms on 4 4-pixel blocks 2-D DCT: 1-D Transforms:

1 0XXX

(b) Transforms on 8 8-pixel blocks 2-D DCT: 1-D Transforms:

1 0XXXX

Fig. 3. The eight 1-D block transforms used in [8] for compressing MCP residuals in 4 4-pixel blocks.

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

21

Fig. 4. Frame 1 (a) from view 0 of Uli sequence, and frame 1 from view 1 of the same sequence (b) before and (c) after luminance compensation. (a) View 0 (luminance mean: 121) (b) View 1 before compensation (luminance mean: 112) (c) View 1 after compensation (luminance mean: 121).

4.4. Communicating the selected transform The encoder needs to inform the decoder about the selected transform for each block. This information is transmitted by the encoder to the decoder as side information within the bitstream. The binary codewords used to communicate the chosen transforms are shown in Table 2 and are the same as those in [8]. The chosen transform is communicated for each 8 8pixel residual block that has nonzero transform coefficients. For each 8 8-pixel residual block, one bit is used to communicate whether the 2-D DCT or a 1-D transform is used. If a 1-D transform is used, additional bits are communicated to indicate which one of the 1-D transforms is used. If 8 8-pixel block transforms are used, then additional four bits are transmitted to indicate which one of the sixteen 1-D transforms will be used. If 4 4pixel block transforms are used, then additional three bits are transmitted to indicate which one of the eight 1-D transforms will be used to transform all 4 4-pixel blocks in that 8 8-pixel residual block. Because of the used codewords, the encoder is biased to selecting the 2-D DCT over the 1-D transforms since the 2-D DCT is communicated with a 1-bit codeword while the 1-D transforms are communicated with longer codewords. For a 1-D transform to be selected, the compression efficiency of the 1-D transform must be good enough to compensate for the bias in the codewords. Biased codewords are used because they can increase the overall ratedistortion efficiency of the codec [48]. 4.5. Global luminance compensation between views In capturing multiview video, variations between the global luminance and/or chrominance values of frames in neighboring views can arise due to several reasons, such as calibration or view angle differences [49]. Compensating for these variations may provide increased compression efficiency and more pleasant multiview video experience for the viewer. A block-based compensation method based on predictive coding of the DC transform coefficient is proposed in [50]. An alternative compensation method is proposed in [51], where a histogram modification algorithm is applied at the encoder prior to encoding. The histogram modification approach does not require any

modification to encoder and decoder and may be preferable in many applications. Fig. 4 shows frame 1 from view 0 of Uli sequence, and frame 1 from view 1 of the same sequence before and after luminance compensation. The frame from view 1 becomes lighter after compensation and has the same global luminance value as the frame from view 0. Predicting a block from a reference frame that has a different global luminance value can create an artificial DC value in the residual block. A residual block which could have had 1-D signal characteristics can lose this characteristics due to the artificial DC value. Thus compensating for global luminance variations in multiview coding is also expected to create DCP residuals with more blocks that have 1-D signal characteristics. As a result, we propose to use global luminance compensation together with the proposed 1-D transforms in multiview video coding. As will be seen in the experimental results presented in Section 5.4, we use the histogram modification algorithm in [51] and report that global luminance compensation can further increase coding gains achievable with 1-D transforms.

5. Experimental results and discussion This section presents experimental results to demonstrate the compression gains achievable with 1-D transforms in multiview coding. The compression performance of conventional multiview coding systems which use only the 2-D DCT is compared with the compression performance of multiview video coding systems that have access to both 2-D DCT and the 1-D transforms discussed in Section 4. The JMVC software is used as the conventional multiview coding systems, and its modification according to the discussions in Section 4 is used as the modified multiview coding system. To explore also the effect of available blocksizes of the transforms on the gains achievable with the modified system, multiple systems are derived from the conventional and modified multiview coding systems. In summary, the following multiview video coding systems are compared in this section:

4 4-DCT 4 4-1D

22

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

Table 3 Transforms available in each multiview coding system. Systems

4 4-block 2-D DCT

Eight 4 4-block 1-D transforms

4 4-DCT 4 4-1D 8 8-DCT 8 8-1D 4 4-8 8-DCT 4 4-8 8-1D

✓ ✓

✓

✓ ✓

✓

8 8-DCT 8 8-1D 4 4-8 8-DCT 4 4-8 8-1D.

Each of these multiview coding systems has access to different types and blocksizes of transforms, summarized in Table 3. For example, 4 4-DCT system can only use 4 4-pixel 2-D DCT, whereas 4 4-1D system can chose either the 4 4-pixel 2-D DCT or one of the eight 1-D transforms defined on 4 4-pixel blocks for each residual block. In the systems which have access to 1-D transforms, the 1-D transforms are used to transform only the luminance pictures. Chrominance pictures are transformed always with the 2-D DCT for simplicity. Some important codec configuration parameters are as follows. All systems use CAVLC for entropy coding and perform MCP and DCP with quarter-pixel accurate motion vectors. In addition, all available blocks in the H.264/AVC standard are enabled for MCP and DCP. Multiview video compression is performed with only two views (view 0 and view 1). Test videos of 640 480 resolution (Exit, Ballroom, Race1 and Vassar sequences) and 1024 768 resolution (Uli sequence) are used in the experiments [39,52]. In each experiment of all following results, the first 180 frames of the test sequences are coded. The compression efficiencies of two systems on a particular video sequence are compared as follows. The video sequence is compressed with both systems four times using quantization parameters (QP) of 24, 28, 32, 36, which provide compression results over a range of picture qualities and bitrates. The bitrate (in kbit/s) of the compressed video stream and the Peak Signal-to-Noise Ratio (PNSR measured in dB) between compressed and original video sequence are recorded for each QP for both systems. The PSNR is obtained from only the luminance components of the compressed and original video sequences. However, the recorded bitrate is obtained from both luminance and chrominance components. Using the recorded bitrate and PSNR values for both systems, rate-distortion (in kbits/s-dB) curves are extrapolated and plotted. These plots are used to compare the compression efficiency of the systems over a range of picture qualities as percentage bitrate savings of one system with respect to the other at the same picture quality (i.e. PSNR). In order to summarize the bitrate savings over a range of picture qualities, the Bjontegaard-Delta (BD) bitrate metric [53] is used, which computes one average bitrate savings number

8 8-block 2-D DCT

✓ ✓ ✓ ✓

Sixteen 8 8-block 1-D tranforms

✓ ✓

Fig. 5. Prediction structure of the single view compression experiments with one reference frame, which corresponds to a low-delay configuration.

obtained from the range of picture qualities given by the used four QP values. An overview of performed experiments is as follows. Section 5.1 presents BD bitrate results to compare the compression efficiencies of the above listed multiview coding systems in single view compression. Section 5.2 presents rate-distortion plots and BD bitrate results to compare the compression efficiencies of the above listed systems in multiview compression. This section also provides other useful information from the experiments such as selection probability of different transforms and the bitrate used for coding the side information. Section 5.3 examines the effect of frame-type (P and B frames) on the results and Section 5.4 examines the effect of global luminance compensation between the views. Finally, Section 5.5 discusses the increase in computational complexity of the encoder and decoder due to 1-D transforms. 5.1. Single view compression experiment The prediction structure used for compressing a single view (view 0) is shown in Fig. 5. This prediction structure corresponds to a low-delay configuration with one reference frame. In single view compression, only MCP residuals are compressed. Although similar results for MCP residual experiments are provided in [8], we provide single view compression results for completeness here. We will also use these results to compare efficiency increase obtainable with 1-D directional transforms in single view and multiview compression. The following comparisons are performed:

4 4-DCT vs. 4 4-1D 8 8-DCT vs. 8 8-1D 4 4-8 8-DCT vs. 4 4-8 8-1D. Table 4 shows Bjontegaard-Delta (BD) bitrate savings of 4 4-1D with respect to 4 4-DCT, 8 8-1D with respect to 8 8-DCT, and 4 4–8 8-1D with respect to 4 4– 8 8-DCT. It can be seen that systems with access to 1-D transform achieve average bitrate savings ranging from a little above zero to 9%. These results are consistent with

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

23

Table 4 Bjontegaard-Delta (BD) bitrate savings (%) of single view coding (view 0). Comparisons

Exit (%)

Ball (%)

Race1 (%)

Vassar (%)

Uli (%)

Average (%)

4 4-1D vs. 4 4-DCT 8 8-1D vs. 8 8-DCT 4 4–8 8-1D vs. 4 4–8 8-DCT

1.8 6.2 3.7

0.7 4.1 1.8

0.2 2.5 0.5

5.9 9.2 7.2

1.4 4.3 2.3

2.0 5.3 3.1

Fig. 6. Prediction structure of the multiview compression experiments.

the ones in [8] and indicate that 1-D transforms can increase coding efficiency of MCP residuals. 5.2. Multiview compression experiment The prediction structure used for compressing two views is shown in Fig. 6. Only I and P frames are used. View 0 is the base view and it is compressed independently from view 1 and thus includes only MCP residuals. View 1 is compressed dependent on both view 0 and view 1 and therefore includes both MCP and DCP residuals. The experimental results presented in this section are from the coding of view 1 and therefore reflect compression efficiency of 1-D transforms for both MCP and DCP residuals. The following comparisons are performed with this prediction structure:

4 4-DCT vs. 4 4-1D 8 8-DCT vs. 8 8-1D 4 4–8 8-DCT vs. 4 4–8 8-1D. Note that in our multiview compression systems (JMVC), each block in view 1 can be predicted either from a temporal reference frame or a neighboring view frame, i.e. each block can have a MCP or DCP residual. Hence, in the above listed systems with access to 1-D transforms, both MCP and DCP residuals are coded with either a 1-D transform or the 2-D DCT. 5.2.1. Rate-distortion plots Fig. 7 shows the rate-distortion plots to compare compression efficiencies of 4 4-1D and 4 4-DCT, 8 8-1D and 8 8-DCT, and 4 4–8 8-1D and 4 4–8 8-DCT. These plots are obtained from multiview coding of the Uli view 1 sequence. The rate-distortion plots of other test sequences are similar. The plots in Fig. 7 show that multiview coding systems with access to 1-D transforms can achieve lower bitrates at same PSNR levels. The bitrate savings are higher at high picture qualities (i.e. PSNR levels) than at low picture qualities. One reason for this is as follows. At high picture qualities (or bitrates), the fraction of the total bitrate used for coding transform coefficients of residuals is higher than at low picture qualities. For example, at high picture qualities upto 90% of total bitrate may be used for coding the transforms coefficients (remaining 10% is used for motion vectors and

Fig. 7. Rate-distortion plots obtained from multiview coding of the Uli view 1 sequence: (a) 4 4-1D vs. 4 4-DCT; (b) 8 8-1D vs. 8 8-DCT; (c) 4 4–8 8-1D vs. 4 4–8 8-DCT.

other information) while at lower picture qualities only 30% may be used. With access to 1-D transforms, only the bitrate of the transform coefficients can be reduced and thus at higher picture qualities access to 1-D transforms can cause a bigger reduction in the total bitrate. Another reason is the effect of the bits used to transmit the side information. At low picture qualities, the side information becomes a higher fraction of the total bitrate and thus becomes a larger burden.

24

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

The bitrate savings are highest when 8 8-1D and 8 8-DCT are compared. This is in part because the distinction between 2-D transforms and 1-D transforms becomes more important when the blocksize is increased. For example, in the extreme case, a 1-point 1-D DCT and 2-D DCT would give the same results. 5.2.2. Bjontegaard-delta bitrate results Table 5 shows Bjontegaard-Delta (BD) bitrate savings of 4 4-1D with respect to 4 4-DCT, 8 8-1D with respect to 8 8-DCT, and 4 4–8 8-1D with respect to 4 4– 8 8-DCT of multiview coding of view 1 for different sequences. It can be seen that systems with access to 1D transform achieve average bitrate savings ranging from 2% to 15%. Notice that these bitrate savings in multiview compression are higher than those in single view compression (Table 4 of Section 5.1). The major reason for the increase is that each block can now be coded by transforming either an MCP or a DCP residual with 1-D transforms, thereby increasing the chance to code more residual blocks with 1-D transforms. In other words, view 1 has now less intra-coded blocks which are always coded with 2-D DCT. These results in Table 5 confirm that the bitrate savings are higher with systems that use larger blocksizes for the transforms. In particular, the bitrate savings are higher when results of 8 8-1D are compared with the results of 8 8DCT. Another factor that affects the bitrate savings is the characteristics of the encoded video sequences. For example, the Vassar sequence gives much higher bitrate savings than the Race1 sequence. Typically video sequences with sharp scenes and many edges, object boundaries and details also Table 5 Bjontegaard-Delta (BD) bitrate savings (%) of multiview coding of view 1. Comparisons

Exit (%)

Ball (%)

Race1 (%)

Vassar (%)

Uli (%)

Avg. (%)

4 4-1D vs. 4 4-DCT 8 8-1D vs. 8 8-DCT 4 4–8 8-1D vs. 4 4–8 8-DCT

3.8 8.8 5.1

2.7 6.9 3.4

2.1 4.1 2.1

7.4 15.5 10.0

3.0 6.0 3.6

3.8 8.3 4.8

have MCP or DCP residuals with many local regions that have 1-D signal characteristics, while sequences with smoother content have residuals with less 1-D signal characteristics. The more local regions with 1-D signal characteristics are present in a prediction residual, the more bitrate savings are likely to be achieved by systems which have access to 1-D transforms. 5.2.3. Bitrate for coding side information In systems with access to 1-D transforms, the chosen transform for each residual block (either one of several 1D transforms or the 2-D DCT) is transmitted by the encoder to the decoder as side information. In this section, the bitrate of this side information is examined. Fig. 8a shows the average percentage of the total bitrate used for coding the side information for view 1 only in the 4 4–8 8-1D system. On average, this system uses about 4% of the total bitrate for coding the side information. The percentage value can, however, change depending on the picture quality or coding bitrate. Fig. 8b shows the percentage of total bitrate used for the Exit view 1 sequence obtained from encoding results at different QP values. It can be seen that the percentage is smaller at low picture qualities (i.e. high QP values) and larger at high picture qualities. At high picture qualities the additional bits for communicating the use of a 1-D transform become less of a burden (recall that 2-D DCT is communicated with a 1bit codeword and the 1-D transforms with 4 or 5-bit codewords) and thus the encoder is more likely to choose 1-D transforms at high picture qualities or bitrates. 5.2.4. Probabilities for selection of transforms Fig. 9 shows the probability of selection (averaged over all sequences) for each available transform in the 4 4– 8 8-1D system at high and low picture qualities. The 2-D DCTs are chosen more frequently than the other transforms. The 2-D DCT is a global transform that works well in many regions of MCP and DCP residuals. However, each 1-D transform is chosen only if it works better than the 2-D DCT and other 1-D transforms, which happens only if the coded residual block has 1-D signal

Fig. 8. Percentage of total bitrate for view 1 used for coding side information in 4 4–8 8-1D: (a) average percentages for all sequences (averaged over encodings at different QP values); (b) percentages for exit view 1 at different QP values.

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

25

Table 6 Probabilities (%) of selection of MCP or DCP and 2-D DCT or a 1-D Transform for all test sequences. Sequences QP

MCP þ 2-D DCT

MCP þ1D Trs

DCPþ2-D DCT

DCPþ 1D Trs

Exit

QP ¼ 24 QP ¼ 28 QP ¼ 32 QP ¼ 36

65.7 76.8 82.1 84.8

21.5 12.8 8.9 6.8

9.3 8.4 8.0 7.6

2.8 1.9 1.1 1.0

Ballroom

QP ¼ 24 QP ¼ 28 QP ¼ 32 QP ¼ 36

67.5 68.8 62.6 60.0

20.2 10.9 7.7 6.0

10.6 17.8 26.6 30.0

1.7 2.6 3.0 3.0

Race1

QP ¼ 24 QP ¼ 28 QP ¼ 32 QP ¼ 36

60.0 62.5 66.8 71.0

24.4 20.3 14.9 8.7

10.0 12.7 14.0 17.3

4.5 4.4 3.6 3.0

Uli

QP ¼ 24 QP ¼ 28 QP ¼ 32 QP ¼ 36

57.0 72.0 76.2 78.2

37.0 20.8 15.1 11.5

4.3 6.1 7.5 9.1

1.0 1.0 1.1 1.1

Vassar

QP ¼ 24 QP ¼ 28 QP ¼ 32 QP ¼ 36

66.3 72.5 76.0 69.3

30.6 22.3 12.1 7.5

2.1 3.8 8.4 19.4

1.0 1.4 2.6 3.8

AVERAGE

QP ¼ 24 QP ¼ 28 QP ¼ 32 QP ¼ 36

63.5 70.5 72.8 72.7

26.8 17.4 11.8 8.2

7.4 9.8 13.0 16.7

2.3 2.3 2.4 2.4

Fig. 9. Average probability of selection (averaged over all sequences) for each available transform in 4 4–8 8-1D at high and low picture qualities or bitrates: (a) high picture qualities (QP¼ 24); (b) low picture qualities (QP ¼36).

characteristics with the appropriate direction. Thus each 1-D transform works well only in very specific residual blocks. In addition, while the selection of a 2-D DCT is communicated with a 1-bit codeword, a 1-D transform is communicated with a 4 or 5-bit codeword, creating a bias towards selecting the 2-D DCT. The selection probability of 1-D transforms increases at high picture qualities because the bias coming from the codeword length becomes less significant. The total probability of selecting a 2-D DCT is about 70% at high picture qualities and 90% at low picture qualities, whereas the total probability of selecting a 1-D transforms is about 30% at high picture qualities and 10% at low picture qualities. Despite being selected less frequently than the 2-D DCT the 1-D transforms can improve coding efficiency as shown in Table 5. These probabilities also show that 1-D transforms are not meant to replace the 2-D DCT but to complement it. Table 6 shows the probabilities of selected prediction and transform types for each block. On average, at high bitrates (QP¼24), motion-compensated temporal prediction (MCP) is chosen with about 90.3% probability (out of which 63.5% of the time the 2-D DCT is used and 26.8% of the time a 1-D transform is used) and disparity-compensated inter-view

Fig. 10. Prediction structure with B frames.

prediction (DCP) is chosen with about 9.7% probability (out of which 7.4% of the time the 2-D DCT is used and 2.3% of the time a 1-D transform is used). Similarly, at low bitrates (QP¼36), motion-compensated temporal prediction (MCP) is chosen with about 80.9% probability (out of which 72.7% of the time the 2-D DCT is used and 8.2% of the time a 1-D transform is used) and disparity-compensated inter-view prediction (DCP) is chosen with about 19.1% probability (out of which 16.7% of the time the 2-D DCT is used and 2.4% of the time a 1-D transform is used). These results indicate that in multi-view coding MCP is chosen more often than DCP and 2-D DCT is chosen more often than 1-D transforms. Yet, as discussed before, despite chosen less often, 1-D transforms work well in regions with 1-D characteristics and complement the 2-D DCT to improve coding efficiency in multi-view coding. 5.3. Effect of frame type To examine the effect of frame type on the results, the prediction structure is changed to the one in Fig. 10, which also uses B (bi-predictive) frames. Fig. 11 shows the BD

26

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

Fig. 12. Average bitrate savings without and with global luminance compensation of 4 4-8 8-1D with respect to 4 4–8 8-DCT. Fig. 11. Bjontegaard-Delta (BD) bitrate savings of 4 4–8 8-1D with respect to 4 4-8 8-DCT with the prediction structures in Figs. 6 and 10.

bitrate savings for view 1 of 4 4–8 8-1D with respect to 4 4–8 8-DCT using the new prediction structure and the prediction structure of Section 5.2 in Fig. 6. The results in the figure show that the bitrate savings decrease in the new prediction structure (average BD bitrate savings drops from 4.8% to 2.9%). The bitrate savings decrease mainly because residuals in the new prediction structure have less 1-D signal characteristics and therefore 1-D transforms provide smaller coding gains. The detailed explanation is as follows. Prediction in B frames is formed by averaging predictions from two reference frames. The averaging operation smoothes the prediction residuals and causes them in general to have less 1-D signal characteristics than in P frames. In addition, B frames are typically coded at lower bitrates because they are more accurately predicted than P frames. The side information at lower bitrates becomes a larger burden and thus the encoder will be more biased to choose 2D DCT than 1-D transforms in B-frames, which will decrease coding gains. For instance, the average probability of using 1-D transforms in the Vassar sequence is 26% in B frames while it is 32% in the P frames of the prediction structure of Fig. 6. Another reason for the decrease in the bitrate savings is as follows. The P frames in the new prediction structure have temporal reference frames that are temporally further away, which also reduces the 1-D signal characteristics in the P frames of the new prediction structure. The average probability of using 1-D transforms in the Vassar sequence is 28% in P frames of the new prediction structure while it is 32% in the P frames of the prediction structure of Fig. 6. In summary, residuals in the new prediction structure have less 1-D signal characteristics, causing a decrease in the compression efficiency improvement of systems with access to 1-D transforms. 5.4. Effect of global luminance compensation between views Global luminance compensation between views can increase compression efficiency in both the conventional and our modified multiview compression systems. As discussed in Section 4.5, however, it is especially useful for systems with 1-D transforms and therefore can further increase the coding gains of our modified multiview

coding systems with 1-D transforms with respect to the conventional multiview coding systems. To explore the effect of global luminance compensation on the compression efficiency results presented so far, we use the histogram modification algorithm in [51] prior to encoding all sequences using the prediction structure in Section 5.2 with 4 4–8 8-1D and 4 4–8 8-DCT systems. Fig. 12 presents the results for view 1 without and with compensation. Global luminance compensation increases the bitrate savings for the Uli sequence from 3.5% to 5.8%. The probability of selecting 1-D transforms increases from 38% to 52% (at high picture qualities) for this sequence after compensation. For the other sequences the bitrate savings do not change much since they do not have significant variations of global luminance between the views. Hence, we conclude that global luminance compensation can be an essential tool to be used with 1-D transforms in multiview coding. 5.5. Complexity increase of codec Having multiple transforms at the encoder and decoder increases the computational complexity of the codec. One way to measure the complexity increase of codecs implemented in software, such as JMVC, is to measure the increase in encoding and decoding times. In our implementation, the encoder applies all transforms and selects the best transform based on the ratedistortion cost of each available transform. In other words, forward transform, quantization of transform coefficients, entropy coding of quantized coefficients, inverse quantization and inverse transform processes are repeated for each available transform. Hence the encoding time the original encoder uses on these processes is multiplied by the number of available transforms. However, the encoding time the original encoder uses on these processes is only a fraction of the total encoding time (most of encoding time is spent on motion estimation) and therefore the increase of the overall encoding time is expected to be smaller than by the number of available transforms. The experimental results of average encoding times (at QP ¼24) of 4 4– 8 8-DCT and 4 4–8 8-1D systems are given in Tables 7 and 8 and it can be seen that the modified multiview coding system requires 533% and 173% of the encoding time of the conventional encoding system for view 0 and 1, respectively. However, note that as discussed

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

Table 7 Encoding and decoding time comparison for view 0. Systems

Encoder (%)

Decoder (%)

4 4–8 8-DCT 4 4–8 8-1D

100 533

100 320

Table 8 Encoding and decoding time comparison for view 1. Systems

Encoder (%)

Decoder (%)

4 4–8 8-DCT 4 4–8 8-1D

100 173

100 181

in Section 4.1, the conventional system uses integer transforms while the modified system uses floating point implementation for the 1-D transforms. Floating point arithmetic requires longer computation times and if the modified system's 1-D transforms were implemented with integer arithmetic, the increases would be less. In addition, to reduce the encoding time further, algorithms which can determine the best transform amongst the 1-D transforms without computing the time consuming rate-distortion cost can be developed. One such algorithm could be based on deciding on the best 1-D transform based on only the sum of absolute value of transform coefficients and skip the entropy coding, inverse quantization and inverse transform processes. Such an approach is likely to work well since the 1-D transforms work well only if aligned well with the 1-D structure in the residual block, otherwise they work extremely bad. The decoding time of the modified system is not expected to increase with multiple 1-D transforms if the 1-D transforms are implemented with integer arithmetic because decoder uses one transform for each block, which is transmitted by the encoder. However, due to floating point implementation of 1-D transforms, the decoding time of the modified coding system in our experiments is 320% and 181% of the decoding time of the conventional decoding system for view 0 and 1, respectively (see Tables 7 and 8). 6. Conclusion Many alternative transforms have been proposed for the compression of images, intra prediction residuals or motioncompensated prediction residuals. Other prediction residuals have received little attention. In this paper, we focused on the disparity-compensated prediction residuals transformed in multiview video coding and showed that 1-D transforms can be used to improve the coding efficiency of these residuals. First, we analyzed the statistical characteristics of disparitycompensated prediction residuals, which revealed that many of its local regions can have anisotropic and 1-D signal characteristics, similar to previous findings for motion-compensated prediction residuals. Based on these results, it was proposed to use 1-D transforms in multiview video coding. It was also discussed that global luminance compensation between views

27

can further improve compression efficiency of multiview coding systems with 1-D transforms. Experimental results with a modified multiview video coding system that can transform each residual block with a 1-D transform or the 2-D DCT showed that compression gains were achievable, which depend on a number of aspects including frame-type (P or B frames), blocksize of the used transforms, characteristics of the encoded video sequences, and global luminance difference between the views. Future work directions might include reducing the complexity of the particular implementation used in reporting experimental results, such as by utilizing integer arithmetic for the transforms, and a low-complexity transform decision algorithm at the encoder. Other potential future work directions might include exploring characteristics of resolutionenhancement residuals in spatial scalable video coding. References [1] J. Lim, Two-dimensional Signal and Image Processing, Prentice Hall, New Jersey, USA, 1990. [2] T. Wiegand, G. Sullivan, G. Bjontegaard, A. Luthra, Overview of the h.264/avc video coding standard, IEEE Trans. Circuits Syst. Video Technol. 13 (2003) 560–576. [3] G. Sullivan, J. Ohm, W.-J. Han, T. Wiegand, Overview of the high efficiency video coding (hevc) standard, IEEE Trans. Circuits Syst. Video Technol. 22 (2012) 1649–1668. [4] M. Flickner, N. Ahmed, A derivation for the discrete cosine transform, Proc. IEEE 70 (September) (1982) 1132–1134. [5] C.-F. Chen, K. Pang, The optimal transform of motion-compensated frame difference images in a hybrid coder, IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process. 40 (June) (1993) 393–397. [6] W. Niehsen, M. Brunig, Covariance analysis of motion-compensated frame differences, IEEE Trans. Circuits Syst. Video Technol. 9 (June) (1999) 536–539. [7] K.-C. Hui, W.-C. Siu, Extended analysis of motion-compensated frame difference for block-based motion prediction error, IEEE Trans. Image Process. 16 (May) (2007) 1232–1245. [8] F. Kamisli, J. Lim, 1-d transforms for the motion compensation residual, IEEE Trans. Image Process. 20 (2011) 1036–1046. [9] E. Le Pennec, S. Mallat, Sparse geometric image representations with bandelets, IEEE Trans. Image Process. 14 (April) (2005) 423–438. [10] G. Peyre, S. Mallat, Discrete bandelets with geometric orthogonal filters, in: IEEE International Conference on Image Processing, ICIP 2005, vol. 1, 11–14 September 2005, pp. I-65–8. [11] D. Taubman, A. Zakhor, Orientation adaptive subband coding of images, IEEE Trans. Image Process. 3 (Jul) (1994) 421–437. [12] V. Velisavljevic, B. Beferull-Lozano, M. Vetterli, P. Dragotti, Directionlets: anisotropic multidirectional representation with separable filtering, IEEE Trans. Image Process. 15 (July) (2006) 1916–1933. [13] C.-L. Chang, B. Girod, Direction-adaptive discrete wavelet transform for image compression, IEEE Trans. Image Process. 16 (May) (2007) 1289–1302. [14] W. Ding, F. Wu, S. Li, Lifting-based wavelet transform with directionally spatial prediction, in: Picture Coding Symposium, vol. 62, January 2004, pp. 291–294. [15] B. Zeng, J. Fu, Directional discrete cosine transforms for image coding, in: IEEE International Conference on Multimedia and Expo, 9–12 July 2006, pp. 721–724. [16] C.-L. Chang, M. Makar, S. Tsai, B. Girod, Direction-adaptive partitioned block transform for color image coding, IEEE Trans. Image Process. 19 (2010) 1740–1755. [17] Y. Ye, M. Karczewicz, Improved h.264 intra coding based on bidirectional intra prediction, directional transform, and adaptive coefficient scanning, in: 15th IEEE International Conference on Image Processing, ICIP 2008, 2008, pp. 2116–2119. [18] I. Richardson, The H.264 Advanced Video Compression Standard, Wiley, New Jersey, USA, 2010. [19] O. Sezer, R. Cohen, A. Vetro, Robust learning of 2-d separable transforms for next-generation video coding, in: Data Compression Conference (DCC), 2011, pp. 63–72, http://dx.doi.org/10.1109/DCC. 2011.14.

28

B. Karasoy, F. Kamisli / Signal Processing: Image Communication 33 (2015) 14–28

[20] E. Alshina, A. Alshin, F. Fernandes, Rotational transform for image and video compression, in: 18th IEEE International Conference on Image Processing (ICIP), 2011, pp. 3689–3692, http://dx.doi.org/10. 1109/ICIP.2011.6116520. [21] Feng Zou, O.C. Au, Chao Pang, Jingjing Dai, Xingyu Zhang, Lu Fang, Rate-distortion optimized transforms based on the Lloyd-type algorithm for intra block coding, IEEE J. Sel. Top. in Signal Process. 7 (6) (2013) 1072–1083, http://dx.doi.org/10.1109/JSTSP.2013.2274173. [22] C. Yeo, Y.H. Tan, Z. Li, S. Rahardja, Mode-dependent transforms for coding directional intra prediction residuals, IEEE Trans. Circuits Syst. Video Technol. 22 (2012) 545–554. [23] J. Han, A. Saxena, V. Melkote, K. Rose, Jointly optimized spatial prediction and block transform for video and image coding, IEEE Trans. Image Process. 21 (2012) 1874–1884. [24] J. Yamaguchi, T. Shiodera, S. Asaka, A. Tanizawa, T. Yamakage, Onedimensional directional unified transform for intra coding, in: 2011 18th IEEE International Conference on Image Processing (ICIP), 2011, pp. 3681–3684, http://dx.doi.org/10.1109/ICIP.2011.6116518. [25] X. Zhao, L. Zhang, S. Ma, W. Gao, Rate-distortion optimized transform for intra-frame coding, in: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2010, pp. 1414– 1417, http://dx.doi.org/10.1109/ICASSP.2010.5495468. [26] A. Saxena, F. Fernandes, On secondary transforms for intra prediction residual, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 1201–1204, http://dx.doi. org/10.1109/ICASSP.2012.6288103. [27] Cixun Zhang, K. Ugur, J. Lainema, A. Hallapuro, M. Gabbouj, Video coding using spatially varying transform, IEEE Trans. Circuits Systems Video Technol. 21 (2) (2011) 127–140, http://dx.doi.org/ 10.1109/TCSVT.2011.2105595. [28] M. Narroschke, Extending h.264/avc by an adaptive coding of the prediction error, in: Proceedings of the Picture Coding Symposium, 2006. [29] J. Dong, K.N. Ngan, Two-layer directional transform for high performance video coding, IEEE Trans. Circuits Syst. Video Technol. 22 (2012) 619–625. [30] J.-W. Kang, M. Gabbouj, C.-C. Kuo, Sparse/dct (s/dct) two-layered representation of prediction residuals for video coding, IEEE Trans. Image Process. 22 (2013) 2711–2722. [31] C. Lan, J. Xu, G. Shi, F. Wu, Exploiting non-local correlation via signaldependent transform (sdt), IEEE J. Sel. Top. Signal Process. 5 (2011) 1298–1308. [32] Z. Gu, W. Lin, B.-S. Lee, C.T. Lau, Rotated orthogonal transform (rot) for motion-compensation residual coding, IEEE Trans. Image Process. 21 (2012) 4770–4781. [33] R. Cohen, S. Klomp, A. Vetro, H. Sun, Direction-adaptive transforms for coding prediction residuals, in: 17th IEEE International Conference on Image Processing (ICIP), 2010, pp. 185–188, http://dx.doi. org/10.1109/ICIP.2010.5651058. [34] H. Zhang, J. Lim, Analysis of one-dimensional transforms in coding motion compensation prediction residuals for video applications, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 1229–1232, http://dx.doi.org/10. 1109/ICASSP.2012.6288110. [35] A. Gabriellini, M. Naccari, M. Mrak, D. Flynn, G.V. Wallendael, Adaptive transform skipping for improved coding of motion compensated residuals, Signal Process.: Image Commun. 28 (2013) 197–208.

[36] X. Cai, J. Lim, Algorithms for transform selection in multipletransform video compression, in: 19th IEEE International Conference on Image Processing (ICIP), 2012, pp. 2481–2484, http://dx.doi. org/10.1109/ICIP.2012.6467401. [37] C. Deng, W. Lin, B. sung Lee, C.T. Lau, M.-T. Sun, Performance analysis, parameter selection and extensions to h.264/avc {FRExt} for high resolution video coding, J. Vis. Commun. Image Represent. 22 (2011) 749–759. (Emerging Techniques for High Performance Video Coding). [38] F. Kamisli, J. Lim, Transforms for the motion compensation residual, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2009, 2009, pp. 789–792, http://dx.doi.org/10. 1109/ICASSP.2009.4959702. [39] Y. Su, A. Vetro, A. Smolic, Common test conditions for multiview video coding, in: JVT-T207, the 20th Meeting of Video Coding Experts Group (VCEG), Klagenfurt, Austria, 2006. [40] S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1951) 79–86. [41] H. Malvar, A. Hallapuro, M. Karczewicz, L. Kerofsky, Low-complexity transform and quantization in h.264/avc, IEEE Trans. Circuits Syst. Video Technol. 13 (2003) 598–603. [42] G. Bjontegaard, K. Lillevold, Context-adaptive vlc coding of coefficients, in: JVT-C028, the 3rd Meeting of Video Coding Experts Group (VCEG), Fairfax, Virginia, USA, 2002. [43] D. Marpe, H. Schwarz, T. Wiegand, Context-based adaptive binary arithmetic coding in the h.264/avc video compression standard, IEEE Trans. Circuits Syst. Video Technol. 13 (2003) 620–636. [44] G.J. Sullivan, T. Wiegand, Rate-distortion optimization for video compression, IEEE Signal Process. Mag. 15 (6) (1998) 74–90, http: //dx.doi.org/10.1109/79.733497. [45] T. Wiegand, B. Girod, Lagrange multiplier selection in hybrid video coder control, in: Proceedings of the International Conference on Image Processing, vol. 3, 2001, pp. 542–545, http://dx.doi.org/10. 1109/ICIP.2001.958171. [46] D. Bertsekas, Nonlinear Programming, Athena Scientific, New Hampshire, USA, 1995. [47] A. Ortega, K. Ramchandran, Rate-distortion methods for image and video compression, IEEE Signal Process. Mag. 15 (1998) 23–50. [48] F. Kamisli, Transforms for Prediction Residuals in Video Coding (Ph. D. thesis), Massachusetts Institute of Technology, 2010. [49] J. Lòpez, J.H. Kim, A. Ortega, Block-based illumination compensation and search techniques for multiview video coding, in: Picture Coding Symposium, 2004. [50] Y.-L. Lee, J.-H. Hur, Y.-K. Lee, K.-H. Han, S. Cho, N. Hur, J. Kim, J.-H. Kim, P.-L. Lai, A. Ortega, Y. Su, P. Yin, C. Gomila, Block-based Illumination Compensation and Search Techniques for Multiview Video Coding, CE11:Illumination Compensation, in Joint Video Team (JVT) of ISO/IECMPEG ITU-T VCEG, Document JVT-U052r2, 2006. [51] U. Fecker, M. Barkowsky, A. Kaup, Histogram-based prefiltering for luminance and chrominance compensation of multiview video, IEEE Trans. Circuits Syst. Video Technol. 18 (2008) 1258–1267. [52] A. Vetro, M. McGuire, W. Matusik, A. Behrens, J. Lee, H. Pfister, Multiview Video TestSequences from MERL for the MPEG Multiview Working Group. ISO/IEC JTC1/SC29/WG11 Document m12077, April 2005. [53] G. Bjontegaard, Calculation of Average psnr Differences Between rdCurves, VCEG Contribution VCEG-M33, April 2001.

Multiview video compression with 1-D transforms

Multiview video compression with 1-D transforms

Recommend Documents