Video intra prediction using convolutional encoder decoder network

Video intra prediction using convolutional encoder decoder network

ARTICLE IN PRESS JID: NEUCOM [m5G;June 27, 2019;14:5] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing jour...

2MB Sizes 1 Downloads 73 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;June 27, 2019;14:5]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Video intra prediction using convolutional encoder decoder networkR Zhipeng Jin a,∗, Ping An b,∗, Liquan Shen b a

Jiaxing Vocational and Technical College, School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China Shanghai Institute for Advanced Communication and Data Science, School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China

b

a r t i c l e

i n f o

Article history: Received 11 April 2018 Revised 8 December 2018 Accepted 1 February 2019 Available online xxx Keywords: Video coding Intra prediction Image inpainting Convolutional encoder-decoder network (CED) High Efficiency Video Coding (HEVC)

a b s t r a c t Intra prediction is an effective method for video coding to remove the spatial redundancy of content. Classical intra prediction method usually creates a prediction block by extrapolating the encoded pixels surrounding the target block. However, existing methods cannot guarantee the prediction efficiency for rich textural structure, especially when weak spatial correlation exists between the target block and reference pixels. To remedy this issue, this paper proposes a novel intra prediction method via convolutional encoder-decoder network, which we term IPCED. IPCED can learn and extract the internal representation of reference blocks, and progressively generate a prediction block from this representation. IPCED is a data-driven method, which represents an improvement over hand-crafted methods, and is capable of improving the accuracy of intra prediction. Extensive experimental results demonstrate that IPCED can generate higher-quality intra prediction results, achieves 3.41%, 3.07% and 3.44% bitrate saving for the Y/Cb/Cr channel compared with HEVC baseline, which is significantly beyond existing methods. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Intra prediction methods play an important role in current state-of-the-art video coding standards [1], as they provide an efficient solution to reduce signal energy by prediction from spatial neighboring encoded pixels. In order to capture finer edge directions presented in natural images, High Efficiency Video Coding (HEVC) employs 35 intra prediction modes, which include planar mode, DC mode, and 33 angular prediction modes [2]. Furthermore, in the developing Joint Exploration Model (JEM) [3], the number of angular prediction modes has been extended to 65. This kind of fine-grained modes can provide more accurate prediction when compared with the intra prediction in H.264/AVC, in which there are only 9 modes [4]. Video intra prediction is a well-studied and challenging task, and its classical method is to create a prediction block by extrapolating the reference pixels surrounding the target block, as shown in Fig. 1. For angular prediction, each pixel in the current block will be projected to the nearest reference line along the angular direction, and the projected pixel is used as the prediction. A linear

R This work was supported in part by the National Natural Science Foundation of China under Grants 61571285 and 61801006, and Shanghai Science and Technology Commission under Grant 17DZ2292400 and 18XD1423900. ∗ Corresponding authors. E-mail addresses: [email protected] (Z. Jin), [email protected] (P. An), [email protected] (L. Shen).

interpolation filter with 1/32 pixel accuracy is used to generate the reference line. And, the filter coefficient is the inverse proportion of the two distances between the projected fraction position and its two adjacent integer positions. In essence, the angular prediction in HEVC is a copying based process with the assumption that image content follows a pure direction of propagation. Besides, for DC mode, the prediction is the average of all the reference pixels. For planar mode, a bi-linear interpolation is used to create a prediction block. However, all these modes together are still too simple to fully characterize the complex non-linear relationship between the reference pixels and the target block. There are many works to further improve intra prediction efficiency. Kamisli et al. [5,6] models the correlation between adjacent pixels as a first order 2D Markov process, where each pixel is predicted by linearly weighing several adjacent pixels. Lai et al. [7] propose an error diffused intra prediction algorithm for HEVC. In addition, Chen et al. [8] incorporating ordered dither technique into intra prediction instead of error diffusion, to reduce computational complexity. Chen et al. [9] propose a copying-based improving intra prediction method. Lucas et al. [10] propose a intra prediction framework based on adaptive linear filters with sparsity constraints. Dias et al. [11] propose an improved combined intra prediction (CIP) method, which both use the reference pixels and the prediction pixels generated by the intra prediction modes. Li et al. [12] propose a piece-wise linear projection method based on canonical correlation analysis (CCA), to better exploit the local spatial correlations. However, these aforementioned works are single

https://doi.org/10.1016/j.neucom.2019.02.064 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;June 27, 2019;14:5]

Z. Jin, P. An and L. Shen / Neurocomputing xxx (xxxx) xxx

(2) IPCED encoder network builds a novel multi-scale skip architecture that combines deep global information with shallow local information. IPCED decoder network employs multilevel branches to generate the prediction at different levels, and synthesizes the intra prediction result in a coarse-tofine fashion. (3) Experiments demonstrate that the proposed IPCED method generates higher-quality intra prediction results than existing state-of-the-art methods [7–12,16,22,24], in terms of both objective and subjective visual quality.

Fig. 1. HEVC intra prediction. (Left) 35 intra prediction modes. (Right) An illustration of HEVC intra angular prediction.

The rest of the paper is organized as follows: a brief review on related works is given in Section 2. Section 3 illustrates our proposed IPCED architecture, with details of the implementation. In Section 4, extensive experiments are conducted to evaluate IPCED method. Finally, we conclude this work with some future directions in Section 5. 2. Related work

line-based scheme, where only the nearest neighboring reference line is used for predicting a block. This causes inaccurate prediction especially when weak spatial correlation exists between the target block and the reference line. Actually, the utilization of further reference lines has been investigated in [13], where the non-adjacent lines as well as the adjacent lines, are used for generating a prediction. Chang et al. [14] propose an arbitrary reference tier coding (ARTC) schemes, which allow the intra prediction modes to exploit farther four reference lines. In [15], intra block copy (IBC) method is proposed, through search for similar reconstructed blocks, which is commonly used in screen content coding. Recently, Li et al. [16] propose an efficient multiple line-based intra prediction (MLIP) scheme to improve coding efficiency. At the same time, the residue compensation is introduced to calibrate the prediction of boundary regions in a block, when using farther reference lines. Experimental results show that MLIP method achieves 2.4% Bjontegaard-Delta bitrate (BD-rate) savings on average. However, MLIP is limited by the effectiveness of hand-designed features, in general, can only predict simple textures, which are not effective to predict complicated structures. In recent years, deep learning methods have been successfully applied in various computer vision tasks, such as image denoising [17,18], video frame interpolation [19,20], and new views prediction [21]. Inspired by these successes, Cui et al. [22] applies convolutional neural network to intra prediction (IPCNN), and achieves a preliminary success. IPCNN takes a 16 × 16 block as input, which contains the best 8 × 8 intra prediction block and three nearest 8 × 8 reference blocks. Then, IPCNN outputs residue blocks, which are used to refine the current prediction block and its three reference blocks. We believe IPCNN is an intra prediction refinement method, which is much closer to the essence of the image denoising. Unlike [22], Li et al. [23,24] propose a fully connected network for intra prediction, IPFCN, where the inputs are multiple reference lines of the current block and the output is the intra prediction for the block. In IPFCN, there is no need to generate the HEVC intra prediction to feed the network, and the encoder will use rate distortion optimization (RDO) to choose the best one from IPFCN and HEVC intra prediction. In this paper, we propose a novel intra prediction method using convolutional encoder–decoder network for HEVC. The main contributions of this paper can be summarized as the following: (1) We propose a novel intra prediction mode, called IPCED, using convolutional encoder-decoder network. And, a GANbased framework is integrated into the training pipeline and is jointly optimized with the IPCED network.

Our work is related to a set of topics such as intra prediction and image inpainting based on very different motivations and arguments. In this section, we briefly review the challenges of video intra prediction task, and image inpainting method using generative adversarial networks (GANs) [25,26]. 2.1. Challenges of intra prediction Since pixels with shorter distance generally have a stronger correlation in an image, HEVC only uses the nearest reference line to predict the target block. However, the nearest reference line cannot always work well due to at least two issues. One is the incoherence caused by the signal noise or the texture of other objects, which causes the prediction texture deviates from the inherent texture of the target block [14,16]. The other reason is that the pixels in different positions of a block have different reconstruction quality in block-based video coding frameworks, e.g., H.264, HEVC, and JEM. In general, the boundary of a block has a larger quantization error (i.e. worse quality in the pixel domain), especially at the corners [27]. So the nearest reference line used by traditional intra prediction methods is often with worse quality. To overcome this issue, the utilization of further reference lines has been investigated in [13,14,16,23], which allow the intra prediction modes to exploit farther reference lines. Experimental results show that using the further reference lines is helpful because more predictions are provided for choosing. In fact, some further local regions can also be used for intra prediction. In the template matching method [28], they search similar non-local content by template matching in the reconstructed region. Besides, the non-local self-similarity is also widely exploited in image inpainting task [29]. For this reason, we expect to utilize the further reference region, learning and extracting inherent texture or non-local structure information of the target block, and seek potential better intra prediction. 2.2. Inpainting using deep learning method Image inpainting technique is originally developed to restore the missing information, to remove unwanted objects, and to extend the size of an image. Traditional inpainting technique has been investigated for intra prediction, e.g., uses partial differential equations (PDEs) model [30], and uses total variation (TV) model [31]. The goal of GANs is to estimate the potential distribution of real data samples and generate new samples from that distribution [26]. The basic idea of GANs is to simultaneously train a discriminator and a generator: the discriminator aims to distinguish

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064

JID: NEUCOM

ARTICLE IN PRESS

[m5G;June 27, 2019;14:5]

Z. Jin, P. An and L. Shen / Neurocomputing xxx (xxxx) xxx

3

3. Framework of IPCED

Fig. 2. Variations of the intra prediction task to image inpainting task.

between real samples and generated samples; while the generator tries to generate fake samples as real as possible. Recently, GANbased convolutional encoder-decoder (CED) networks were shown to be a simple and effective approach for inpainting task. Context Encoder [32] train a deep neural network for inpainting holes, where high-level recognition and low-level pixel synthesis are formulated into a CED network, jointly trained with an adversarial loss to encourage the coherency between generated and existing pixels. More recently, Iizuka et al. [33] improve it by introducing both global and local discriminators. The global discriminator assesses if the completed image is coherent as a whole, while the local discriminator focuses on the generated region to enforce the local consistency. In addition, Yang et al. [34] propose a multi-scale synthesis approach based on joint optimization of content and texture constraints, which treats the encoder–decoder prediction as the global content constraint, and the perceptual features similarity between the hole and the known region as the texture constraint. These works are shown to generate semantically plausible new contents in highly structured images, such as faces, and objects. In general, typical inpainting task involves hallucination of pixels, so it exists many plausible solutions for any given context. However, in particular for the video intra prediction task, how to predict more accurately with lower residue power is the key challenge. In addition, image inpainting task can make use of 8 neighboring regions to fill the center hole, however, for intra prediction tasks, usually only 3 reconstructed regions (Left, AboveLeft, and Above) available, and need to predicting pixel values of the fourth quadrant [35], as shown in Fig. 2. Therefore, intra prediction is more challenging than the typical image inpainting task. To alleviate this problem, we improve the IPCED intra prediction performance by innovation the structure of the convolutional encoder– decoder network.

In this section, we first provide an overview of the IPCED architecture, and then provide the loss functions devised to optimize such network, and finally provide details on the training procedure. We propose taking IPCED as a novel intra prediction mode added for HEVC 32 × 32, 16 × 16, 8 × 8 and 4 × 4 intra prediction units (PUs). Fig. 3 illustrates the overall framework of our proposed method for various intra prediction units in HEVC. Our framework consists of a convolutional encoder-decoder network (IPCED, it can also be considered G from GAN-framework), and an auxiliary context discriminator network (D). Discriminator D is used exclusively for training the decoder network and is not used during the testing. IPCED applies the idea of image inpainting to the intra prediction problem, and uses its three reference blocks to predict the pixel values of the fourth quadrant (current PU). IPCED takes the current PU (P) and three nearest reference blocks as input (x), where P is initialized with the intra prediction results of HEVC. Then, IPCED outputs the refined prediction block (p) for current PU, which can be used by subsequent coding processes of HEVC. Considering reference pixels in the BottomLeft column and AboveRight row are often not available. Therefore, in our scheme, we only employ the Left, Above, and AboveLeft reconstructed regions as the reference pixels. 3.1. Intra prediction convolutional encoder–decoder network Multi-scale skip encoder. The encoder takes an input image x, then learning, extracting and compressing inherent texture and context representation features. In order to expose global information of the input, input image is sub-sampled by a stack of stride convolutions with ReLU nonlinear activation function. At each sub-sampling steps, we double the number of feature channels. Dropout of 0.5 is added to the end of the encoder fully connected layer to prevent over-fitting. This architecture is inspired from the Context Encoder proposed in [32], deep hierarchical features jointly encoded in a local-toglobal pyramid. The mainly difference is that, we build a novel skip architecture that combines deep, global, semantic information and shallow, local, appearance information. Specifically, we use the skip architecture by concatenating fine layers (convolutions with stride 2) and coarse layers (convolutions with stride 4), lets the encoder can extract the inherent features that conforming to the context structure.

Fig. 3. Overview of our intra prediction convolutional encoder–decoder (IPCED) network architecture for HEVC. IPCED is a novel intra prediction method applies the idea of image inpainting, which takes the current PU (P) and three nearest reference blocks as input (x), and outputs the refined prediction block (p) for current PU. Details are omitted here to simplify the illustration.

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064

JID: NEUCOM 4

ARTICLE IN PRESS

[m5G;June 27, 2019;14:5]

Z. Jin, P. An and L. Shen / Neurocomputing xxx (xxxx) xxx

Coarse-to-fine decoder. Due to the asymmetric nature of intra prediction problems (large size reference block input, small size prediction block output), we break the symmetric structure of encoder-decoder networks and introduce a coarse-to-fine decoder. We employ a deconvolutional layer cascade a convolutional layer as the basic decoder unit. Decoder progressively generates pixels of the target block in a coarse-to-fine fashion, which is a novel intra prediction function and can also be seen as inpainting function. Firstly, deconvolutional layers, which are 4 × 4 kernel with ReLU activation, are integrated into our model to play the role of up-sampling. Secondly, multilevel branches to generate the prediction at different levels, and synthesize the intra prediction result in a coarse-to-fine fashion. Specifically, refined prediction result p2 added back to the rough branch result p1 , and finally synthesized the intra prediction block, as defined in (1) and (2).

with regard to its context. In practice, LMSE often fails to capture high-frequency details. This stems from the fact that the L2 loss minimizes the mean pixel-wise error, and results in a blurry averaged image [32–34]. We alleviated this problem by adding an adversarial loss Ladv . Joint training objective function. Following vanilla GAN framework, our IPCED network is trained in a competing minimax fashion as defined in Eq. (3). In order to generate a prediction with more accurate appearance and realistic looking, we train the generator G by combining the L2 distance loss of the pixel space and the adversarial loss of the feature space. Similar efforts have demonstrated effectiveness for generative models [32,33]. Specifically, the generator loss function LG is formally defined as:

LG = λLMSE + (1 − λ )Ladv

(6)

p1 = G1 (x )

(1)

Ladv = − log(D( p))

(7)

p2 = p1 + G2 (x )

(2)

Where function Gi (x) represents the residue prediction function at each refinement level i. This is similar to the original GAN, which learns a generator network G that generates samples by transforming a noise variable z into a sample G(z). The stepwise refinement strategy explicitly defines the specific function of each branch of the decoder network, i.e., Gi (x) predicts the sub-band residues at different levels, and improves the final prediction performance. Discriminator. The architecture of discriminator network D, as shown in the up right corner of Fig. 3, which inspired by popular DCGAN architecture [36] that consists of Conv-BatchNormLeakyReLU layers. A fully connected layer with sigmoid function is stacked at the end of the discriminator, to map the output to a probability. Following vanilla GAN [25] framework, G and D are trained following two-player minimax game with the objective function U(G,D), as defined in (3). Discriminator D is trained to distinguish predicted block p and original block y. On the contrary, the encoder-decoder network G learns the distribution of training samples, and tries to confuse the discriminator by generating more and more realistic prediction. We optimize the parameters of G and D in an alternative manner to solve the min–max problem.

min max U (G, D ) = Ey [log D(y )] + E p [log(1 − D( p)] G

D

(3)

3.2. IPCED training objective Encoder–decoder synthesis loss function. In intra prediction task, we need train our encoder–decoder network by regressing to the ground truth content of the target block, to minimize the prediction residue power. We use a multi-level L2 distance LMSE as our encoder–decoder synthesis loss function.

LiMSE =

W H 1  i 2 ( pwh − ywh ) WH

(4)

w=1 h=1

LMSE =

1 i LMSE N

(5)

Where LiMSE measures the discrepancy between the ground truth

block y and decoder synthesized block pi at each refined level i. The loss at each level is normalized by the number of width W and the height H (i.e. the total number of elements). N is the total number of multi-level intra prediction branches in the decoder. The encoder–decoder synthesis loss LMSE is responsible for capturing the overall structure of the predicted block, and coherence

Where λ is set as 0.999 in our implementation, as suggested by Pathak et al. [32] and by cross-validation on the BSD500 validation set. LMSE is a straightforward and widely used loss in image generation, and is critical for the training stability [34]. Ladv boosts IPCED to learn the natural distribution of original images, to improve the realism of the prediction results. Note that Ladv will not be back-propagating to the encoder since the decoder is responsible for synthesis images, while the encoder is used to extract and compress features [37]. Generally, GAN based inpainting task involve hallucination of pixels, so it could have many plausible solutions for any given context. In contrast, for the video intra prediction task, the prediction pixel of the fourth quadrants with lower residue power is the key challenge. And for that, our IPCED network employs convolutional encoder-decoder structure, and trains the generator G by combining the MSE loss and the adversarial loss. MSE loss encourages G to generate a high quality prediction with lower residue power from the pixel space. And, adversarial loss boosts G to learn the natural distribution of original images, to improve the realism of the prediction results. 3.3. Implementation and training details In order to make a fair comparison with other intra prediction methods, we train the IPCED on a collection of natural images, and test it on the HEVC standard test sequences. The training and testing images have no overlap so as to demonstrate the generalizability of the trained network. Specifically, our training and validation sets are selected from BSD500 database [38]. Its disjoint training set (200 images) and testing set (200 images) are all used for training, and its validation set (100 images) is used for validation. All training samples are encoded by HEVC all intra (AI) configuration, using HM 16.0 [39]. Note, only the luminance channel is considered for training. Our proposed IPCED and loss functions are built on the publicly available code of Caffe [40], and implemented on a Windows PC with Inter (R) Core (TM) i7-8700K CPU @ 3.7GHz and NVIDIA GeForce 1080Ti GPU. For each quantization parameters (QP) and each size PU, a separate network weight is trained out. Motivated by Refs. [33,41,42], we train our networks in two phases. Specifically, we first pre-train the encoder-decoder network IPCED, through directly optimizes the multi-level L2 distance LMSE as defined in (5). IPCED pre-trained for a total of 50 epochs with early stopping strategy, employs Adam [43] with a learning rate of 10−3 and a batch size of 64. After pre-training, we are re-formulate the optimization problem of IPCED (G) and D in GAN framework. Specifically, we employ the pre-trained network as initialization for the encoder-decoder network IPCED, and jointly training the IPCED

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064

ARTICLE IN PRESS

JID: NEUCOM

[m5G;June 27, 2019;14:5]

Z. Jin, P. An and L. Shen / Neurocomputing xxx (xxxx) xxx

5

Table 1 Performance Ablation Comparison under QP 22 (Y Component).

 PSNR(dB) Encoder

Decoder

Context Encoder [32] Multi-scale skip Context Encoder [32] Multi-scale skip Multi-scale skip

Context Decoder [32] Context Decoder [32] Coarse-to-fine Coarse-to-fine Coarse-to-fine + GAN

and D through adversarial training objective function as defined in (3) and (6). We jointly trained for a total of 20 epochs, starts from a learning rate of 10−5 .

32PU

16PU

8PU

4PU

0.2729 0.3733 0.3975 0.5049 0.5263

0.4078 0.4513 0.4491 0.4767 0.5472

0.4951 0.5011 0.5122 0.5221 0.5647

0.2512 0.2820 0.2904 0.2912 0.2932

Table 2 Coding Performance of Our IPCED Compared with HEVC Baseline in All Intra Configuration. BD-rate (%)

4. Experimental results In this section, experimental results are presented to validate the effectiveness of our IPCED approach for video intra prediction task, and comparing with the latest state-of-the-art approaches.

Test dataset Class A 2560 × 1600 Class B 1080p

4.1. Quantitative comparisons of network architecture Since the particularity and challenges of video intra prediction task compared with the typical inpainting task (see the analysis in Section 2), we introduce a novel multi-scale skip encoder and coarse-to-fine decoder architecture for the CED network. We first compare our novel architecture quantitatively with the classical symmetric architecture of CED network (represented by Context Encoder [32]). Ablation experimental results in Table 1 show that, our method achieves the highest quantitative performance on the BSD validation dataset (higher PSNR value is better). We attribute this to the nature of our novel architecture. Intuitively, multi-scale skip encoder architecture combines deep, global, semantic information with shallow, local, appearance information, which allows it can learn better feature representation than the Context Encoder. And coarse-to-fine decoder can also synthesize better image structure and details through stepwise progressive refinement strategy. 4.2. Coding performance of IPCED for HEVC In this subsection, we integrate the proposed IPCED into HM 16.0, and evaluate the intra prediction performance comparing with the HEVC baseline. The test is performed according to HEVC common test conditions (CTC) [44] under all intra (AI) main configuration, and QPs are set as 22, 27, 32, and 37. The test sequences include the whole range of HEVC standard test sequences in CTC, which are arranged into five classes: Class A (2560 × 1600), Class B (1080p), Class C (WVGA), Class D (WQVGA), and Class E (720p). To evaluate the coding performance, we use the BD-rate measure [45] on Luma and Chroma components independently, where a negative number indicates bitrate saving. The detailed results are summarized in Table 2. We can observe that, IPCED achieves significant BD-rate reduction on all the test sequences comparing with the HEVC baseline. For the Luma (Y), IPCED achieves an average of 3.41% BD-rate reduction for all test classes, which is significant in the current HEVC coding standard. And, the maximum BD-rate reduction is over 4.5% when coding for PeopleOnStreet, FourPeople, and Vidyo1 sequence. For the Chroma (Cb and Cr), IPCED achieves an average BD-rate savings of 3.07% and 3.44% over the HEVC baseline respectively. These results are indicative that IPCED can handle different resolutions and different images well, and significantly superior to HEVC. Fig. 4 illustrates the Luma rate-distortion curves of IPCED and HEVC baseline for typical sequences in each class, i.e., PeopleOn-

Class C WVGA

Class D WQVGA

Class E 720p

Traffic PeopleOnStreet ParkScene Cactus BasketballDrive BQTerrace BQMall BasketballDrill PartyScene RaceHorses BasketballPass BlowingBubbles RaceHorses BQSquare Johnny KristenAndSara FourPeople vidyo1

Average of all Classes

Y

Cb

Cr

−3.978 −4.566 −3.214 −3.333 −3.150 −2.324 −3.196 −2.282 −2.150 −3.246 −2.044 −2.927 −3.146 −1.823 −4.104 −4.184 −4.649 −4.658

−3.063 −5.751 −1.668 −2.113 −2.449 −1.653 −2.975 −3.564 −2.342 −3.096 −1.475 −3.124 −3.694 −2.680 −2.438 −5.370 −2.389 −3.440

−2.650 −5.854 −3.983 −1.689 −2.905 −3.487 −4.462 −4.621 −2.081 −2.696 −3.292 −3.490 −2.268 −2.200 −3.736 −3.104 −4.750 −4.553

−3.41

−3.07

−3.44

Street, Cactus, BQMall, RaceHorses, and Vidyo, respectively. We can observe that our IPCED method has stable performance over a wide bit range (from very low 0.16Kbps to ultra high 10.4Kbps), and coding performance is significantly improved. Since better intra-coded frames can improve the coding efficiency of inter slice, and inter slice will also use a small number of intra prediction modes. So, besides AI coding configuration, we also test our proposed IPCED method in Random Access (RA), Low Delay with B (LB), and Low Delay with P (LP) coding configuration. The BD-rate results are summarized in Table 3. We can observe that, under RA configuration, our proposed IPCED method achieves an average of 2.40%, 2.32% and 2.78% BD-rate savings for the Y/Cb/Cr component respectively. Under LB configuration, our method achieves an average of 2.23%, 2.08% and 3.04% BD-rate savings for the Y/Cb/Cr component respectively. Under LP configuration, our method achieves an average of 2.82%, 2.33% and 2.87% BD-rate savings for the Y/Cb/Cr component respectively. These results also clear that, IPCED method can also improve the coding gain even in RA, LB, and LP coding configuration, where the great majority of the video frames are inter-coded.

4.3. Comparison with state-of-the-art methods To verify the superior coding performance of the proposed IPCED method, we compare our IPCED with nine latest state-ofthe-art intra prediction methods, e.g., error diffusion method [7], ordered dither method [8], copying-based method [9], sparse linear method [10], improved CIP method [11], CCA based piece-wise projection method [12], multiple line method [16], IPCNN [22], and IPFCN [24], and the coding results are cited from their papers.

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;June 27, 2019;14:5]

Z. Jin, P. An and L. Shen / Neurocomputing xxx (xxxx) xxx

Fig. 4. The Luma rate-distortion performance of IPCED compared with HEVC baseline for typical bit range.

Table 3 BD-rate performance of Our IPCED compared with HEVC baseline in RA, LB, and LP configuration (%). Random Access (RA) Test dataset Class A 2560 × 1600 Class B 1080p

Class C WVGA

Class D WQVGA

Class E 720p

Traffic PeopleOnStreet ParkScene Cactus BasketballDrive BQTerrace BQMall BasketballDrill PartyScene RaceHorses BasketballPass BlowingBubbles RaceHorses BQSquare Johnny KristenAndSara FourPeople vidyo1

Average of all Classes

Low Delay with B (LB)

Low Delay with P (LP)

Y

Cb

Cr

Y

Cb

Cr

Y

Cb

Cr

−2.709 −2.621 −2.049 −2.283 −2.169 −1.306 −2.445 −2.133 −1.542 −2.295 −2.134 −1.499 −1.878 −1.208 −3.586 −3.618 −3.896 −3.422

−2.459 −3.643 −1.638 −1.509 −0.187 −1.971 −2.232 −2.353 −1.948 −2.474 −0.899 −3.774 −3.411 −0.832 −5.273 −2.621 −1.585 −1.618

−2.087 −3.244 −2.306 −1.328 −1.301 −2.761 −3.617 −1.721 −1.503 −1.836 −2.957 −1.891 −2.647 −2.442 −4.123 −3.948 −4.508 −5.495

−2.731 −2.343 −2.105 −2.052 −2.167 −1.383 −2.350 −1.810 −1.510 −1.559 −2.134 −1.627 −1.794 −0.782 −3.043 −3.107 −3.899 −3.643

−2.031 −3.862 −1.657 −2.460 −0.584 −1.456 −3.016 −1.681 −2.531 −1.797 −0.899 −0.917 −2.115 −0.108 −4.649 −2.715 −2.662 −0.203

−2.097 −3.798 −3.015 −2.523 −1.893 −2.812 −4.765 −2.260 −1.871 −1.620 −2.957 −3.427 −1.542 −3.130 −4.322 −4.010 −5.925 −4.926

−2.324 −2.781 −2.781 −2.286 −2.280 −1.263 −2.210 −2.352 −1.612 −1.675 −2.134 −1.252 −1.848 −1.163 −3.108 −3.345 −4.006 −3.788

−4.165 −2.293 −2.293 −2.640 −1.061 −1.251 −2.797 −2.551 −2.369 −2.408 −0.899 −2.404 −2.905 −0.271 −5.813 −2.800 −2.510 −0.313

−3.895 −2.204 −2.204 −2.736 −2.316 −2.586 −4.533 −2.824 −1.857 −1.886 −2.957 −1.704 −1.452 −3.445 −5.356 −3.315 −5.959 −5.031

−2.40

−2.32

−2.78

−2.23

−2.08

−3.04

−2.82

−2.33

−2.87

The experimental results are summarized in Table 4. We can observe that, our IPCED method achieves new state-of-the-art results, and it substantially outperforms the strongest competitors (e.g., MLIP [16] and IPFCN [24]). Specifically, our IPCED method achieves 3.41% BD-rate reduction on the average of all test classes, achieving a 45.73% relative improvement over MLIP [16], which is only 2.34% BD-rate reduction. Compared with the strongest competitor IPFCN [24], our IPCED method is more robust against various test classes with different resolutions (e.g. in class D, 2.49% vs. 1.80%). We can also observe that, IPCED method achieves better BD-rate savings on larger resolution, e.g., BD-rate gain is higher in

class A than in class D. This is a desirable feature since one of the trends in video communication is seeking higher resolution. Besides, Cui et al. reported that IPCNN achieves 0.7% BD-rate reduction on average in [22]. In fact, IPCNN is an intra prediction refinement method, which is much closer to the essence of the image denoising. Unlike [22], in our method, we propose taking IPCED as a novel intra prediction mode added for the encoder. And the encoder will use RDO to choose the best one from IPCED and HEVC intra prediction, so as to achieve a better ratedistortion performance. Experimental results show that our IPCED beyond the IPCNN method by a large margin (3.41% vs. 0.7%).

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064

ARTICLE IN PRESS

JID: NEUCOM

[m5G;June 27, 2019;14:5]

Z. Jin, P. An and L. Shen / Neurocomputing xxx (xxxx) xxx

7

Table 4 Coding performance of various intra prediction methods compared to HEVC in terms of Luma BD-rate (%). Test dataset

error diffusion method [7]

ordered dither method [8]

copying based method [9]

sparse linear method [10]

improved CIP method [11]

CCA projection method [12]

multiple line method [16]

IPFCN method [24]

IPCED (ours)

Class Class Class Class Class

A B C D E

−0.35 −0.28 −0.20 −0.20 −0.27

−0.87 −0.58 −0.45 −0.46 −0.62

−1.80 −1.62 −0.76 −0.59 −1.23

−0.85 −1.09 −1.15 −0.53 −1.50

−1.40 −0.90 −0.90 −0.90 −1.20

−0.95 −1.40 −1.20 −1.18 −1.13

−2.18 −2.88 −2.48 −2.03 −2.17

− −3.10 −2.10 −1.80 −4.50

−4.27 −3.01 −2.86 −2.49 −4.40

Overall

−0.26

−0.60

−1.20

−0.85

−1.06

−1.17

−2.34

−2.88

−3.41

Fig. 5. Frequency of various intra prediction modes being selected, when coding the first frame of typical sequences under QP37. We encourage the reader to zoom-in onto the screen to compare the changes in probability. We can observe that, when using the IPCED mode, the hit ratio of various intra prediction modes in HEVC has been reduced, especially in the Planar, DC, horizontal and vertical modes.

This comparison can prove the effectiveness of the proposed IPCED method.

Table 5 Percentage of Our Novel IPCED Model being Selected as the Best Intra Prediction Model in Various PUs (%). PU size

4.4. Effectiveness of IPCED as a novel intra prediction method

Test dataset

To demonstrate the effectiveness of the proposed IPCED method, the hit ratio of this new intra prediction mode is presented in Table 5. It shows that IPCED mode is effective for all kinds of sequences and various PUs. It is a large probability event that IPCED was selected as the best intra prediction mode (i.e., less distortion), and there is an average of 49.3% that IPCED mode was selected as the best mode for 32 × 32 PU. That is, from the perspective of RDO, the prediction quality of the IPCED mode is much higher than that of other intra prediction modes in HEVC. Correspondingly, we can observe from Fig. 5 that, when using the IPCED mode, the hit ratio of various intra prediction modes in HEVC has been reduced, especially in the Planar, DC, horizontal and vertical modes. This is due to the intrinsic limitation of the HEVC intra prediction modes, which only predict the target block as a linear combination of the previous encoded pixels in its neighborhood. When the inherent texture of the neighborhood reference pixels and the target block is deviates, HEVC intra prediction effect will decrease. This can be visually verified from Fig. 6, when we intuitively compare the intra prediction effects (i.e. inpainting effects) of various methods. Fig. 6 shows the reason that IPCED intra prediction mode has a better chance have been selected. We can observe that, MLIP [16] method can calibrate the boundary regions of the predicted block, by combining multi-line prediction and residual compensation, and improves prediction accuracy when compared with HEVC

Class A 2560 × 1600 Class B 1080p

Class C WVGA

Class D WQVGA

Class E 720p Average

Traffic PeopleOnStreet ParkScene Cactus BasketballDrive BQTerrace BQMall BasketballDrill PartyScene RaceHorses BasketballPass BlowingBubbles RaceHorses BQSquare Johnny KristenAndSara FourPeople vidyo1 of all test sequences

32

16

8

4

45.7 53.3 40.4 43.0 30.0 62.2 58.5 40.3 71.0 66.2 46.2 72.5 69.2 62.2 25.7 27.5 40.1 34.2 49.3

43.8 45.9 40.6 40.5 36.1 48.2 44.3 35.9 54.2 48.9 42.8 50.5 54.1 48.2 34.5 35.8 40.2 47.2 44.0

43.8 42.5 41.3 39.0 42.6 32.9 40.7 39.8 41.5 43.7 40.8 40.6 44.2 32.9 34.3 36.7 37.7 39.6 39.7

46.6 42.2 47.9 42.5 41.9 32.9 40.1 39.1 38.1 43.7 41.0 40.6 44.6 32.9 32.4 36.7 42.4 42.4 40.4

and copying-based method [9]. However, when compared with our IPCED method, MLIP is limited by the effectiveness of hand-crafted features, which are still not effective to predict complicated structures. From the perspective of quantitative analysis, intuitively, our proposed IPCED method can generate visually credible image structures and textures that it is consistent with the ground truth.

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;June 27, 2019;14:5]

Z. Jin, P. An and L. Shen / Neurocomputing xxx (xxxx) xxx

(a) Original block

(b) HEVC

(mse)

827.98

478.99

830.78

419.62

356.25

3004.10

473.46

689.94

290.04

625.20

523.03

374.46

3425.35

414.89

390.84

499.30

509.35

349.16

269.40

3016.22

377.94

135.21

132.84

239.47

103.85

79.46

970.37

175.12

(c) CBP [9] (mse) (d) MLIP [16] (mse) (e) IPCED ours (mse)

Fig. 6. Comparison of predicted blocks under QP22. The bottom right corner of each image is predicted block, and the other region is reconstructed pixels by HEVC. (a) Original block. (b) Predicted by the best intra prediction mode of HEVC. (c) Predicted by CBP of [9]. (D) Predicted by MLIP of [16]. (e) Predicted by our IPCED. We encourage the reader to zoom-in onto the screen to best view the fine details.

Fig. 7. Subjective comparison between the original picture, the decoded pictures by HM16.0 and our proposed IPCED method. Smaller bit per pixel (BPP) values represent smaller bit consumption, that is, the higher compression rate is achieved. Higher SSIM and PSNR values indicate a higher quality of the decoded image. The first frame of typical sequences is encoded at QP 37 in all intra configuration, and the in-loop filtering is closed in coding process for pure comparison of intra prediction quality.

Especially, IPCED results are more coherent and seamlessly with the boundary in many cases. From the perspective of quantitative analysis, MSE loss predicted by the IPCED method is much smaller than that predicted by the best intra prediction mode of HEVC, namely IPCED can generate more accurate prediction. These experimental results demonstrate the advantage of our proposed IPCED method based on encoder-decoder network, especially in the presence of complex textures, where HEVC intra prediction modes often fail. Due to the non-stationary statistics of natural images, deep learning based adaptive generalization approaches tend to be more effective for video intra prediction. Finally, a subjective comparison is made between the original picture, the decoded picture by HM16.0, and the decoded picture by the proposed IPCED method. Two cropped parts (in the resolution of 200 × 200 pixels) of some typical sequences are compared

in Fig. 7. From the details of the texture, we can observe that IPCED method preserves the smoothness of contours and sharpness of many edges, giving them a natural appearance. We can notice that, IPCED method preserves more details, and its decoded image is more faithful to the original picture. Generally, IPCED method has less coding artifacts, and has better visual quality. This also reflects that, IPCED can generate more a accurate prediction with lower prediction residual, and hence improving the coding performance. 4.5. Computational complexity analysis The corresponding computational complexity (i.e., compared with HEVC baseline), bitrate saving, and model parameters (e.g., network for 8 × 8 PU) of IPFCN and IPCED are shown in Table 6. Our IPCED uses convolution network, which has fewer

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064

ARTICLE IN PRESS

JID: NEUCOM

[m5G;June 27, 2019;14:5]

Z. Jin, P. An and L. Shen / Neurocomputing xxx (xxxx) xxx Table 6 Comparison of Computational Complexity, Bitrate Saving, and Model Parameters.

Encoding time Bitrate saving Model parameters

IPFCN method [24]

IPCED (ours)

9147% −2.88% 6.69 mil.

11227% −3.41% 5.78 mil.

parameters and better rate-distortion performance. IPFCN [24] uses a fully-connected network, which deploys faster. Comparability with IPFCN [24], our proposed IPCED method is also much more complex than HEVC baseline, since each PU needs the feed-forward calculation through network. Besides, our current implementation is straightforward and not optimized, and the parameters of IPCED network are in float precision, which is not computationally friendly for video coding. Considering the relative high computational complexity, we will study how to accelerate the proposed method. For example, in smooth texture regions, HEVC has been able to provide efficient accurate intra prediction, and thus not requiring an additional IPCED model. Besides, using parallel programming technique is another solution that could be investigated in the future to speed up the encoding process. In addition, both in industry and academia, efforts are being made to make the implementation of deep neural networks more efficient. Furthermore, compared with inter coding in HEVC, intra coding has lower complexity. This means that the increase in encoding complexity associated with IPCED has minimal impact in case inter coding. Based on the above analysis, we believe that computational complexity will not be the obstacle to the wide application of deep learning based intra prediction methods. 5. Conclusions In this paper, we combine the intra prediction with texture synthesis technique, and propose a novel intra prediction method using convolutional encoder-decoder network (IPCED). We are innovating the structure of the convolutional encoder-decoder network, which can boost both subjective and objective prediction quality. Experimental results demonstrate that the proposed IPCED method can perform more complicated intra prediction, and generate higher-quality intra prediction results than existing state-ofthe-art methods. And IPCED method achieves 3.41%, 3.07% and 3.44% BD-rate savings for the Y/Cb/Cr component over the HEVC baseline. Our IPCED method is the first step that employs deep encoder-decoder neural network for video intra prediction, and provide an exciting new tool. More work will be done to investigate how to accelerate the proposed IPCED method, e.g., training a smaller model, pruning, or low-bit representing network architecture. Declaration of interest None. Acknowledgment The authors would like to thank the anonymous reviewers for their feedback and helpful suggestions. References [1] G.J. Sullivan, J. Ohm, W.J. Han, T. Wiegand, Overview of the high efficiency video coding (HEVC) standard, IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2013) 1649–1668. [2] J. Lainema, F. Bossen, W.J. Han, J. Min, K. Ugur, Intra coding of the HEVC standard, IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2013) 1792–1801.

9

[3] J. Chen, E. Alshina, G.J. Sullivan, J.R. Ohm, J. Boyce, Algorithm description of joint exploration test model 5, in: Proceedings of the JVET-E1001, 2017. [4] T.K. Tan, R. Weerakkody, M. Mrak, N. Ramzan, V. Baroncini, J.R. Ohm, G.J. Sullivan, Video quality evaluation methodology and verification testing of hevc compression performance, IEEE Trans. Circuits Syst. Video Technol. 26 (1) (2016) 76–90. [5] F. Kamisli, Intra prediction based on Markov process modeling of images, IEEE Trans. Image Process. 22 (10) (2013) 3916–3925. [6] K. Fatih, Block-based spatial prediction and transforms based on 2-d Markov processes for image and video compression, IEEE Trans. Image Process. 24 (4) (2015). 1247–60 [7] Y.H. Lai, Y. Lin, Error diffused intra prediction for HEVC, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 1424–1427. [8] C.H. Chen, Y. Lin, Enhanced hevc intra prediction with ordered dither technique, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 1447–1450. [9] H. Chen, T. Zhang, M.T. Sun, A. Saxena, M. Budagavi, Improving intra prediction in high-efficiency video coding, IEEE Trans. Image Process. 25 (8) (2016) 3671–3682. [10] L.F.R. Lucas, N.M.M. Rodrigues, E.A.B.D. Silva, C.L. Pagliari, S.M.M.D. Faria, Image coding using generalized predictors based on sparsity and geometric transformations, IEEE Trans. Image Process. 25 (9) (2016) 4046–4060. [11] A.S. Dias, S.G. Blasi, M. Mrak, E. Izquierdo, Improved combined intra prediction for higher video compression efficiency, in: Proceedings of the Conference on Picture Coding Symposium (PCS), 2017, pp. 1–5. [12] Y. Li, L. Li, Z. Li, H. Li, Hierarchical piece-wise linear projections for efficient intra-prediction coding, in: Proceedings of the IEEE Visual Communications and Images Processing (VCIP), 2017. [13] S. Matsuo, S. Takamura, Y. Yashima, Intra prediction with spatial gradients and multiple reference lines, in: Proceedings of the Conference on Picture Coding Symposium (PCS), 2009, pp. 161–164. [14] Y.J. Chang, C.L. Lin, P.H. Lin, C.C. Lin, J.S. Tu, Improved intra prediction method based on arbitrary reference tier coding schemes, in: Proceedings of the Conference on Picture Coding Symposium (PCS), 2017, pp. 1–5. [15] C. Pang, Y. Wang, V. Seregin, K. Rapaka, M. Karczewicz, X. Xu, Non-ce2: intra block copy and inter signaling unification, in: JCTVC-T0227, 2015. [16] J. Li, B. Li, J. Xu, R. Xiong, Efficient multiple line-based intra prediction for HEVC, IEEE Trans. Circuits Syst. Video Technol. 28 (4) (2018) 947–957. [17] K. Zhang, Y. Chen, Y. Chen, D. Meng, L. Zhang, Beyond a gaussian denoiser: residual learning of deep CNN for image denoising, IEEE Trans. Image Process. 26 (7) (2016) 3142–3155. [18] N. Seungjun, K. Tae Hyun, L. Kyoung Mu, Deep multi-scale convolutional neural network for dynamic scene deblurring, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 257–265. [19] W. Lotter, G. Kreiman, D. Cox, Deep predictive coding networks for video prediction and unsupervised learning, in: Proceedings of the International Conference on Learning Representations (ICLR), 2017. [20] S. Niklaus, M. Long, F. Liu, Video frame interpolation via adaptive separable convolution, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 261–270. [21] J. Flynn, I. Neulander, J. Philbin, N. Snavely, Deepstereo: Learning to predict new views from the world’s imagery, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5515–5524. [22] W. Cui, T. Zhang, S. Zhang, F. Jiang, W. Zuo, Z. Wan, D. Zhao, Convolutional neural networks based intra prediction for HEVC, in: Proceedings of the Conference on Data Compression Conference (DCC), 2017. 436–436 [23] J. Li, B. Li, J. Xu, R. Xiong, Intra prediction using fully connected network for video coding, in: Proceedings of the IEEE International Conference on Image Processing, 2017, pp. 1–5. [24] J. Li, B. Li, J. Xu, R. Xiong, W. Gao, Fully connected network-based intra prediction for image coding, IEEE Trans. Image Process. 27 (7) (2018) 3236–3247. [25] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, D.W.-F. B. Xu, e. a. S. Ozair, Generative adversarial networks, arXiv preprint arXiv:1406.266 (2014). [26] K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, F.Y. Wang, Generative adversarial networks:introduction and outlook, IEEE/CAA J. Autom. Sin. 4 (4) (2017) 588–598. [27] M.A. Robertson, R.L. Stevenson, Dct quantization noise in compressed images, IEEE Trans. Circuits Syst. Video Technol. 15 (1) (2005) 27–38. [28] Y. Guo, Y.K. Wang, H. Li, Priority-based template matching intra prediction, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2008, pp. 1117–1120. [29] H. Liu, R. Xiong, X. Zhang, Y. Zhang, S. Ma, W. Gao, Nonlocal gradient sparsity regularization for image restoration, IEEE Trans. Circuits Syst. Video Technol. 27 (9) (2017) 1909–1921. [30] Y. Zhang, Y. Lin, Improving HEVC intra prediction with PDE-based inpainting, in: Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, 2014, pp. 1–5. [31] X. Qi, T. Zhang, F. Ye, A. Men, B. Yang, Intra prediction with enhanced inpainting method and vector predictor for HEVC, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 1217–1220. [32] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A.A. Efros, Context encoders: Feature learning by inpainting, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2536–2544. [33] S. Iizuka, E. Simo-Serra, H. Ishikawa, Globally and locally consistent image completion, ACM Trans. Gr. 36 (4) (2017) 1–14.

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064

JID: NEUCOM 10

ARTICLE IN PRESS

[m5G;June 27, 2019;14:5]

Z. Jin, P. An and L. Shen / Neurocomputing xxx (xxxx) xxx

[34] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, H. Li, High-resolution image inpainting using multi-scale neural patch synthesis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4076–4084. [35] M.H. Baig, V. Koltun, L. Torresani, M.H. Baig, V. Koltun, L. Torresani, Learning to inpaint for image compression, in: Proceedings of the Neural Information Processing Systems (NIPS), 2017. [36] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1710. 10196 (2015). [37] A.B.L. Larsen, H. Larochelle, O. Winther, Autoencoding beyond pixels using a learned similarity metric, in: Proceedings of the International Conference on Machine Learning, 2016, pp. 1558–1566. [38] P. Arbelez, M. Maire, C. Fowlkes, J. Malik, Contour detection and hierarchical image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (5) (2011) 898–916. [39] HHI, Hm software 16.0, 2014, (https://hevc.hhi.fraunhofer.de/svn/ svn_HEVCSoftware/tags/HM-16.0). [40] Y. Jia, S. Evan, D. Jeff, K. Sergey, L. Jonathan, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the Twenty-second ACM international conference on Multimedia, 2014, pp. 675–678. [41] C. Ledig, Z. Wang, W. Shi, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 105–114. [42] L. Galteri, L. Seidenari, M. Bertini, A.D. Bimbo, Deep generative adversarial compression artifact removal, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4836–4845. [43] D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Proceedings of the International Conference for Learning Representations (ICLR), 2015, pp. 1–13. [44] F. Bossen, Common test conditions and software reference configurations, in: Proceedings of the Joint Collaborative Team on Video Coding (JCT-VC), JCTVC-F900, 2011. [45] S. Pateux, J. Jung, An excel add-in for computing bjontegaard metric and its evolution, in: Proceedings of the ITU-T SG16/Q6, Document, VCEG-AE07, 2007.

Ping An is a professor of the video processing group at School of Communication and Information Engineering, Shanghai University, China. She received the B.S. and M.S. degrees from Hefei University of Technology in 1900, 1993, and Ph.D. from Shanghai University in 2002. In 1993, she joined Shanghai University. Between 2011 and 2012, she joined the Communication Systems Group at Technische University Berlin, Germany, as a visiting professor. Her research interest is image and video processing, especially focuses on 3D video processing recent years. She has finished more than 10 projects supported by the National Natural Science Foundation of China, National Science and Technology Ministry, and Science & Technology Commission of Shanghai Municipality, etc. She awarded the Second Prize of Shanghai Municipal Science & Technology Progress Award in 2011 and the Second Prize in Natural Sciences of the Ministry of Education in 2016. Liquan Shen received the B.S. degree in automation control from Henan Polytechnic University, Jiaozuo, China, and the M.S. and Ph.D. degrees in communication and information systems from Shanghai University, Shanghai, China, in 2001, 2005, and 2008, respectively. He was with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA, as a visiting professor from 2013 to 2014. He has been with the Faculty of the School of Communication and Information Engineering, Shanghai University, since 2008, where he is currently a Professor. He has authored and co-authored more than 80 refereed technical papers in international journals and conferences in the field of video coding and image processing. He holds 10 patents in the areas of image/video coding and communications. His research interests include High Efficiency Video Coding, perceptual coding, video codec optimization, 3DTV, and 3D image/video quality assessment.

Zhipeng Jin received the B.S. degree in electrical engineering from the University of Science and Technology of China (USTC), Hefei, China, in 2004, and the M.S. degrees in electrical engineering from Ningbo University, Ningbo, China, in 2007. He is currently pursuing the Ph.D. degree from the Shanghai University. His research interests include image/video coding, video codec optimization and deep learning.

Please cite this article as: Z. Jin, P. An and L. Shen, Video intra prediction using convolutional encoder decoder network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.02.064