Face completion with Hybrid Dilated Convolution

Face completion with Hybrid Dilated Convolution

Signal Processing: Image Communication 80 (2020) 115664 Contents lists available at ScienceDirect Signal Processing: Image Communication journal hom...

3MB Sizes 0 Downloads 15 Views

Signal Processing: Image Communication 80 (2020) 115664

Contents lists available at ScienceDirect

Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image

Face completion with Hybrid Dilated Convolution✩ Yuchun Fang ∗, Yifan Li, Xiaokang Tu, Taifeng Tan, Xin Wang School of Computer Engineering and Science, Shanghai University, Shanghai, China

ARTICLE MSC: 00-01 99-00 Keywords: Image completion Image inpainting Deep learning GAN

INFO

ABSTRACT Image completion is a challenging task which aims to fill the missing or masked regions in images with plausibly synthesized contents. In this paper, we focus on face image inpainting tasks, aiming at reconstructing missing or damaged regions of an incomplete face image given the context information. We specially design the U-Net architecture to tackle the problem. The proposed U-Net based method combines Hybrid Dilated Convolution (HDC) and spectral normalization to fill in missing regions of any shape with sharp structures and fine-detailed textures. We perform both qualitative and quantitative evaluation on two challenging face datasets. Experimental results demonstrate that our method outperforms previous learning-based inpainting methods. The proposed method can generate realistic and semantically plausible images.

1. Introduction Image inpainting can be approached as reconstructing or synthesizing texture based on the target image content, integrating the missing or replacing unwanted parts of an image. It is a fundamental problem in low-level vision as it can be used for filling occluded image regions or repairing damaged images. Hence, image inpainting has attracted widespread interest in the computer vision community for the past few decades. Early image inpainting methods mostly use low-level features to address the problem of filling in the hole. A typical way is the patch-based method [1–4], in which the best matching texture patches are sampled from a source image and then pasted into a target image to reconstruct the missing area. Wilczkowiak et al. [5] optimized the search areas to find the most appropriate patches. PatchMatch [6] achieved real-time image editing by a randomized patch search algorithm. These methods make use of the low-level features of the given context and propagate the local image appearance around the target holes to fill them in. Hence, they can provide a realistic texture by its nature. However, using only the low-level features of the given context cannot predict the high-level features in the missing hole. Moreover, to fill the target holes through propagating the local image appearance around them fails to capture the global structure of the images. Hence, it is necessary to develop semantically consistent inpainting to fulfill wanted results by humans. Recently, deep neural network is introduced for image completion [7–13] to solve the problems mentioned above. Typically, some

works [10–12] demonstrate that architectures using dilated convolutions are promising for image inpainting by enlarging receptive fields without any extra computational cost. As mentioned in [14], these methods use several dilated convolutions with exponential growth dilation rate (like 2,4,8, etc.). One problem existing in the aforementioned dilated convolution framework is the gridding effect, as shown in Fig. 1. Overlapping several dilated convolution operations with a fixed dilation rate may result in discontinuous convolution kernels. In such cases, not all pixels are involved for later calculation, which may lead to the inconsistency of local information and cause disadvantages for the pixel-level tasks. To address the gridding problem, we propose Hybrid Dilated Convolution (HDC) in this paper. Instead of using the same or exponential growth dilation rate for all layers after the downsampling occurs, we use a different dilation rate for each layer. Additionally, we apply spectral normalization for both the generator and the discriminator network to improve the training stability. Our network can complete masked faces with high quality in a single forward pass without any post-processing. We evaluate the proposed method both qualitatively and quantitatively, and show its success compared to the state-ofthe-art methods. The contribution of our work can be summarized as follows: • We propose a face completion network based on the U-Net architecture combined with the HDC blocks. The adopted linear growth dilation rate can address gridding effect for face inpainting task.

✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.image.2019.115664. ∗ Corresponding author. E-mail address: [email protected] (Y. Fang).

https://doi.org/10.1016/j.image.2019.115664 Received 9 February 2019; Received in revised form 24 July 2019; Accepted 6 October 2019 Available online 14 October 2019 0923-5965/© 2019 Published by Elsevier B.V.

Y. Fang, Y. Li, X. Tu et al.

Signal Processing: Image Communication 80 (2020) 115664

Fig. 1. Illustration of the gridding problem. (a) All convolutional layers have a same dilation rate 𝑟 = 2. From left to right: Green marked features contribute to the calculation of the central red marked feature using three convolution layers with 3 x 3 kernel size. (b) Subsequent convolutional layers have dilation rates of 𝑟 = 1, 2, 3, respectively. It makes the receptive field unchanged without the gridding problem..

Fig. 2. Overview of our improved generative inpainting framework. On the left is our UNet-based generator, which has three main parts: Upsampling block, downsampling block and HDC block. On the right is a PatchGAN discriminator.

• We introduce the spectral normalization in the proposed network to improve the training stability. Through introducing the spectral normalization, a modified PatchGAN discriminator is designed for irregular hole image inpainting. • We evaluate the proposed method on several public datasets. The experiments demonstrate that our model can produce photorealistic higher-quality images and obtain state-of-the-art performance in image inpainting.

More recently, Chen et al. [19] presented a fully end-to-end progressive GAN for high-resolution face completion with multiple controllable attributes. Liu et al. [13] proposed the use of partial convolutions, where the convolution was masked and renormalized to be conditioned on only valid pixels and achieved the state-of-the-art performance for irregular hole inpainting.

The remainder of this paper is organized as follows. The related work is reviewed in Section 2. The full architecture and the approach are detailed in Section 3. The experimental results and evaluations are presented in Section 4. Finally, the conclusions are drawn in Section 5.

Given a raw image 𝑥, we sample a binary image mask 𝑚 at random locations. Input image 𝑧 is corrupted from the raw image as 𝑧 = 𝑥 ⊙ 𝑚, where ⊙ is the element-wise production. The inpainting network 𝐺 takes 𝑧 and 𝑚 as input, and outputs predicted image 𝐺(𝑧, 𝑚) with the same size as the input. Pasting the masked region of 𝑥 to an input image, we get the inpainting output 𝑧+𝐺(𝑧, 𝑚)⊙(1−𝑚). In the following, we first introduce the network architecture and then describe the model objective and learning algorithm for image completion.

3. Proposed method

2. Related work A large body of literature exists for image inpainting. In this section, we review the work on image inpainting closely related to our work. Traditional Image Inpainting. Early image inpainting approaches make use of the low-level features to propagate the texture from the surrounding context to the missing hole [15,16]. These methods can only tackle small holes and may lead to apparent artifacts and noise patterns for large holes. Later works use patch-based methods [6] which iteratively search for the best fitting patches to fill in the holes and produce smooth results. These methods can produce plausible texture generation in the holes. However, they suffered from a lack of global structural information, which led to undesirable outputs. Impaiting with GANs. Recently, deep neural networks including generative adversarial network (GAN) exhibit great performance in image inpainting. Pathak et al. [7] presented Context Encoders (CE), a Convolutional Neural Network (CNN) trained to directly reconstruct the missing region combining 2 and adversarial loss [17]. CE encodes the entire image into a bottleneck representation to describe the context of the missing region. Their inpainting network takes the center cropped images and regresses the missing part. Although CE can synthesize promising inpainting results, its results are sometimes visually unrealistic. Iizuka et al. [10] improved CE by introducing both global and local discriminators as adversarial losses. A local discriminator focuses on a small region centered around the inpainting region, and a global discriminator is trained on the entire image. Poisson blending is applied as a post-process to combine global and local results. Yu et al. [11] replaced the post-processing with a refinement network powered by the contextual attention layers with global and local WGANs [18].

3.1. Architecture The architecture of the proposed method is illustrated in Fig. 2. Following the basic idea of GAN, the proposed network consists of the generator and the discriminator. The generator is a compound U-Net with multiple blocks on the left of Fig. 2. The discriminator is a general CNN that identifies created results with the ground-truth on the right of Fig. 2. Structure of the U-Net. We introduce a new architecture based on the U-Net [20] that integrates information across all scales to generate higher quality images. The model architecture has three down-sampling blocks, a bottleneck block, and three up-sampling blocks. The details of the basic blocks are shown in Fig. 3. The down-sampling block plays the role of an encoder, which reduces each spatial dimension to 1/8 of the input size, as shown in Fig. 3(a). The bottleneck block consists of four stacked HDC blocks that support expanding receptive fields without increasing the number of weights in the filter. The illustration of the HDC block is shown in Fig. 3(c). As shown in Fig. 3(b), the up-sampling block plays the role of a decoder, which transforms the feature map to an RGB image with the same resolution as the input. ReLU and batch normalization layer [21] are used between each convolutional layer except the last layer, which produces the final results. For each layer in the decoder of the U-net, features can be acquired from higher layers by skipping connected layers. The last layer uses a 𝑡𝑎𝑛ℎ activation function to produce an image with pixel values ranging at [−1, 1], which is then rescaled to the integer RGB values. 2

Y. Fang, Y. Li, X. Tu et al.

Signal Processing: Image Communication 80 (2020) 115664

In this way, it is expected that the reconstruction of the whole image and the completion of missing parts can prompt each other mutually. The information of the whole image obtained through reconstruction can be explicitly used to rebuild the missing part with the trained networks. Hence, our method can reconstruct high-quality face images without post-processing. Perceptual Loss. Recent research suggests that perceptual distance based on a pre-trained network is more close to human perception of similarity compared with traditional distance measures [26,27]. The perceptual network computes the activations of the hole patches and sums up the 𝐿1 distances across all feature layers between both output and the ground truth. Formally, the perceptual loss is defined as in Eq. (2).

Fig. 3. Illustration of the basic blocks used in our work. (a) The downsampling block consists of two convolution layers with kernel size 3 , two spectral normalization layers, an average pooling layer and a leaky ReLU activation function. (b) The upsampling block consists of two convolution layers with kernel size 3, two spectral normalization layers, a nearest upsample layer and a leaky ReLU activation. (c) The HDC block consists of three convolution layers with kernel size 3 and dilation rates of 𝑟 = 1, 2, 3, respectively, three spectral normalization layers and three leaky ReLU activation functions.

𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑢𝑎𝑙 = E𝑧,𝑚,𝑥 [

1 ‖𝐹 (𝐺(𝑧, 𝑚)) − 𝐹𝑗 (𝑥)‖1 ] 𝐶𝑗 𝐻 𝑗 𝑊 𝑗 𝑗

where 𝐹 is a pre-trained feature network for extracting a generic global feature from the input. 𝑗 refers to the feature layers. 𝑊 , 𝐻, 𝐶 are the width, height, and depth of the activation map. In the proposed model, we use the output of the first three pooling layers of the VGG-16 pre-trained on the ImageNet [28] to measure the perceptual loss.

Hybrid Dilated Convolution. Recent research has demonstrated that architectures using dilated convolutions are especially promising for image analysis tasks that require a detailed understanding of the scene [8,10,22,23]. Via enlarging the receptive fields of the network, dilated convolutions are also important for image inpainting tasks without downsampling or increasing the number of model parameters. Iizuka et al. [10] first used serialized layers with increasing rates of dilation to expand receptive fields of output neurons. One problem that exists in the dilated convolution framework is the gridding effect. However, the use of large dilation rate information may only be beneficial for some large regions, but it may have disadvantages for small areas. Here, borrowing the idea in [14], we adopt an HDC block which consists of 3 subsequent convolutional layers with dilation rates 1, 2 and 3, respectively. Fig. 3(c) depicts the detail of the HDC block in the proposed model. Spectral Normalization. Spectral normalization GANs [24] is one of the most recently proposed GAN models. Spectral normalization controls the Lipschitz constant of the discriminator by dividing the weight matrix of each layer with its spectral norm, which is equal to the largest singular value. Spectral normalization has been shown to make GANs robust to the choices of the hyper-parameters without incurring any significant overhead. To improve training stability, we also exploit the spectral normalization in the generator and discriminator of the proposed model.

Adversarial Loss. Image generation has witnessed rapid progress in recent years due to GANs. GANs can synthesize realistic contents and exhibits their superiority in restoring high-frequency details and photorealistic textures. GANs are implemented by a system of two neural networks contesting with each other in a zero-sum game framework. The generator 𝐺 struggles to generate data as real as possible to fool the discriminator 𝐷 while the discriminator 𝐷 tries its best to distinguish the generative data from those in the training set. Finally, the outputs of the generator 𝐺 are indistinguishable from those in training set for the discriminator 𝐷. Different from previous generative inpainting networks [7,10,11, 29] which rely on DCGAN [30] for adversarial supervision, we propose to use a conditioned version of PatchGAN [31]. The PatchGAN maps 𝑁 × 𝑁 image patches into a matrix, whose elements signify whether the corresponding image patch is ‘real’ or ‘fake’, while the regular GAN maps from an image to a single scalar output that denotes ‘real’ or ‘fake’ of the whole image. The PatchGAN can effectively annotate the high-frequency feature of images and compensate for the low-frequency representation obtained with the 𝐿1 items. Hence, the introduction of PatchGAN has the advantage of enhancing the localized reconstruction effect and serves to realize globally and locally consistent image completion. In the proposed model, the adversarial loss is defined in Eq. (3).

3.2. Loss functions

𝑎𝑑𝑣 = min max E𝑧,𝑚,𝑥 [log 𝐷(𝑥, 𝑧) + log(1 − 𝐷(𝐺(𝑥, 𝑚), 𝑧))]

Our model is trained in a self-supervision manner. During training, we rely on several loss functions to optimize the network. 𝐿1 Loss. Pixel-wise 𝐿1 loss is a straightforward and widely used loss function in image generation. This loss function measures the pixelwise differences between the synthesized content and its corresponding ground truth. 𝐿1 loss encourages the low-frequency correctness of the generated image. 𝐿1 distance is preferred over 𝐿2 distance as it produces images with less blurring [25]. 𝐿1 is also essential for both pose and texture information preserving and accelerated optimization. Some previous work [7,10]measured the difference between the missing regions of the generated images and the missing regions of ground truth. For the purpose to create the whole image, it is necessary to introduce certain post-processing. Differently, we calculate the mean absolute error on the whole predicted image and the corresponding ground-truth as in Eq. (1).

The parameters of 𝐷 and 𝐺 are updated alternatively.

𝑝𝑖𝑥𝑒𝑙 = E𝑧,𝑚,𝑥 [‖𝐺(𝑧, 𝑚) − 𝑥‖1 ]

(2)

𝐺

𝐷

(3)

The overall loss used for training can thus be denoted in Eq. (4) 𝑡𝑜𝑡𝑎𝑙 = 𝑝𝑖𝑥𝑒𝑙 + 𝜆𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑢𝑎𝑙 𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑢𝑎𝑙 + 𝜆𝑎𝑑𝑣 𝑎𝑑𝑣

(4)

where we set the tradeoff parameters 𝜆𝑎𝑑𝑣 = 0.001, 𝜆𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑢𝑎𝑙 = 0.05 in our experiment.

4. Experiments and results We carry out extensive experiments to demonstrate the performance of our model to synthesize the missing contents on face images. In the following, we first describe the datasets and the experimental setting used for training. Then, we compare our approach with other state-ofthe-art methods. Finally, we demonstrate the effectiveness of our model in high-quality face image inpainting visually and quantitatively.

(1) 3

Y. Fang, Y. Li, X. Tu et al.

Signal Processing: Image Communication 80 (2020) 115664

Table 1 Results of MSE, PSNR and SSIM score on validation set on CelebA and MultiPIE for reference. Dataset

CE [7]

GL [10]

CelebA [32]

PSNR SSIM MSE

24.90 0.80 220.80

28.64 0.90 93.66

GntIpt [11] 30.23 0.93 72.24

PConv [13] 25.72 0.84 187.39

31.57 0.95 51.89

Multi-PIE [33]

PSNR SSIM MSE

25.02 0.80 214.84

28.64 0.91 93.54

30.19 0.93 67.79

27.75 0.85 121.07

31.37 0.94 53.74

Places2 [34]

PSNR SSIM MSE

22.82 0.75 391.74

28.34 0.91 120.90

27.25 0.90 153.66

24.68 0.84 245.89

28.50 0.92 120.72

empirically set the optimizer as 𝛽1 = 0.5, 𝛽2 = 0.9, and the learning rate is initialized to 0.0002. We set the number of training epochs as 50. The early-stop mechanism is introduced in training. If a model is obtained with 50 complete epochs of training, it takes 69 h to converge on a single NVIDIA TITAN X GPU. The training convergence curve is shown as an example of the model with the best performance in Fig. 6.

Ours

4.3. Results One of the major problems of applications is measuring the output quality. As mentioned in [11], image inpainting lacks good quantitative evaluation metrics. We report our evaluation in terms of Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) loss on the validation set of CelebA just for reference in Table 1. Note that no post-processing step is performed for our baseline model. In Table 1, we make the quantitative comparisons between our method and several baseline methods: CE [7], GL [10], GntIpt [11] and PConv [13], which are all benchmarks in the area of image inpainting. We use the same training and test splits and masks that are used for analysis by other benchmark methods for a fair comparison. From the result, our method achieves the lowest MSE and performs best in all existing methods. In Fig. 5, we show several irregular region inpainting examples produced by different methods on CelebA dataset without any postprocessing. Qualitatively, both models struggle to recover content which was not available in the input. We can observe that the CE has more visible artifacts in the reconstructed region, and the contents generated with GL tend to be blurry and smooth. Judging from Fig. 5, our model appears to create content which is less blurry and more consistent with the context, and our result looks artifact-free and very realistic. Hence, our model is favorable in producing fine-detailed, semantically plausible and realistic images. Additionally, in Fig. 7, we show some examples obtained through image extrapolation, which generating new content beyond the image boundaries. Based on only the provided input, our model can create new content for the 16-pixel band surrounding the input. There are two observations of deficiency. First, CE fails to synthesize the context of missing boundaries. Second, both GL and GntIpt can obtain the generated context of absent boundaries, but with the by-production of blurry results. Our method can synthesize higher quality and more plausible images than these baseline methods. Although the output does not precisely match the input, the generated output does appear more plausible with the proposed approach. As a demonstration of the generalization ability of the proposed method in the wild faces, we use images from Multi-PIE [33] dataset to test our model trained solely on CelebA. As shown in Fig. 8, CE fails to synthesize the correct context on Multi-PIE. The color generated by GL is inconsistent with original images. GntIpt [11] produces blurry results. While our model can synthesize images with both finer details and better global shapes for face images in the Multi-PIE dataset. The experiment proves the strong generalization ability of our model. Besides facial data sets, we also validate our method on Places2 [34], a non-facial data set, to prove that our approach is universal in various types of images. As shown in Fig. 9, the images generated by CE and GL are all blurred. Compared with the results of our model, the results of GntIpt are slightly inadequate in detail. Our approach also performs best when comparing the PSNR, SSIM and MSE scores.

Fig. 4. Examples of irregular masks used in training. They cover about 25%–40% of the whole image.

4.1. Datsets and masks We use the CelebFaces Attributes Dataset (CelebA) [32] dataset to learn and evaluate our model. CelebA [32] is a large-scale face attributes dataset with more than 200 K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutters. CelebA has vast diversities, large quantities, and rich annotations. We randomly select 3000 images for testing and the rest images for training. We do not use any attributes or labels for both training and testing. To avoid the influence of shoulder and background regions, we crop each image to 160 × 160 with face part at the center of the picture. The images containing only face parts are resized to 128 × 128 to adapt to the design of the network structure. In the training phase, each training sample contains one training image and one binary mask. To obtain more training samples, we randomly select multiple binary masks for each training image. With large training sets, it is possible to achieve more robust models. While in the testing phase, the image and its corresponding mask are fixed for providing validation information. Instead of filling small holes in the image, we are interested in the more difficult task of irregular hole inpainting, as shown in Fig. 4. There are two different types of masks in tests: (1) Random irregular masks with approximately 25%–40% missing; (2) Homocentric squares pattern masks with around 7 to 16 pixels are missing. Usually, specific masks are created by users or artists through editing images. Common modifications include deleting some object or some person in the picture. Models are also trained to adapt to these particular masks. In our paper, we choose to try random masks for two concerns. First, the random masks are more suitable for the situation of natural damages to face photos, for example, the natural destruction of traditional photographs over time, or the loss of pixels in digital image processing, transmission and compression causing random and irregular image defects. Secondly, it is expected models trained to adapt to random masks can be generalized to more extensive applications.

4.4. Ablation study 4.2. Training To further analyze the contributions of different model components, namely PatchGAN discriminator and HDC, we implement ablation study by comparing three settings of the model, as listed below:

In this section, the details on the settings of the hyperparameters and training procedures are provided. We use a straightforward training procedure to update both the generator and the discriminator in every batch with joint loss functions stated in Eq. (4). We use Adam solver [35] to update the model weights during training. We

• ours: The proposed model integrating the PatchGAN and HDC, as shown in Fig. 2. 4

Y. Fang, Y. Li, X. Tu et al.

Signal Processing: Image Communication 80 (2020) 115664

Fig. 5. Comparison between our model with state-of-the-art methods on irregular hole inpainting on CelebA. (a) Corrupted images, (b) CE [7], (c) GL [10], (d) GntIpt [11], (e) PConv [13], (f) Our method, (g) Ground-truth images.

Fig. 6. Training curve of our improved generative inpainting framework. Where 𝑌 -axis is the PSNR score, and 𝑋-axis is the number of training hours. The model tends to converge after 32 h of training, and PSNR scores increase slowly. The model converges at 69 h.

5

Y. Fang, Y. Li, X. Tu et al.

Signal Processing: Image Communication 80 (2020) 115664

Fig. 7. Comparison with state-of-the-art methods on extrapolating 96 × 96 images beyond their current boundaries. (a) Corrupted images, (b) CE [7], (c) GL [10], (d) GntIpt [11], (e) PConv [13], (f) Our method, (g) Ground-truth images.

Fig. 8. Synthesis results on the MultiPIE dataset. Note that all methods are trained on CelebA. (a) Corrupted images, (b) CE [7], (c) GL [10], (d) GntIpt [11], (e) PConv [13], (f) Our method, (g) Ground-truth images.

• ours w/o PatchGAN: The variation of our model that replaces the PatchGAN discriminator with the discriminator of DCGAN. • ours w/o PatchGAN w/o HDC: The variation of our model that replaces the PatchGAN discriminator with DCGAN discriminator and replaces HDC with standard convolution layers.

that the PatchGAN serves to improve the performance of the model noticeably. 4.5. User study Given the multi-modality nature of the inpainting task, results solely based on comparison to the ground truth image (which produces SSIM and PSNR) is insufficient. Therefore, we compare our approach with the most advanced CNN-based facial completion methods with a pilot user study. We randomly select 200 images for each technique, and sort them by the naturalness from a personal point of view. 10 subjects volunteered to participate in our visual evaluation. Each subject

For each model, we use PSNR scores to evaluate the performance that is shown in Fig. 11. Then we present the quantitative results of the models in Table 2 that summarizes the PSNR, SSIM and MSE scores. As shown in Table 2, both PatchGAN and HDC play significant roles in our model. From Fig. 11, we validate that HDC contributes to stabilize and accelerate convergence. By the evaluation scores in Table 2, we prove 6

Y. Fang, Y. Li, X. Tu et al.

Signal Processing: Image Communication 80 (2020) 115664

Fig. 9. Synthesis results on the Places2 dataset. Note that all methods are trained on CelebA. (a) Corrupted images, (b) CE [7], (c) GL [10], (d) GntIpt [11], (e) PConv [13], (f) Our method, (g) Ground-truth images.

Fig. 10. The result of the user study to compare the naturalness of completion of three methods: ours method, GL and CE, GntIpt and Ground-Truth. The five colors in each bar from bottom to top represent the percentage of each method being ranked.

Table 2 Ablation study of the proposed method on validation set of CelebA. The results of MSE, PSNR and SSIM score show that both PatchGAN and HDC are essential for our model.

provides feedback for 100–150 tests. In total, we collect 1200 pieces of feedback records to make statistics in Fig. 10. Fig. 10 shows that there are significantly more images generated by our method of that the viewers are in favor. It also demonstrates that our method can synthesize photo-realistic higher-quality images in most cases.

Ablation

w/o PatchGAN w/o HDC w/o PatchGAN Ours

5. Conclusions In this paper, we present an end-to-end deep generative network with HDC and spectral normalization for face image inpainting. The main idea behind our method is using HDC to deal with the gridding problem. The integration of HDC can effectively enlarge the receptive fields of the network. The proposed model makes use of the spectral normalization to improve training stability. We validate our method on the face sets CelebA and Multi-PIE. We perform tests with different forms of missing regions for the challenging face completion task without additional post-processing. Experimental results demonstrate the superior performance of the proposed method on challenging image

CelebA PSNR

SSIM

MSE

31.0679 30.9636 31.5665

0.9342 0.9415 0.9470

52.48 52.17 51.89

inpainting examples. For future work, we plan to extend the method to very high-resolution inpainting applications and test our method on different datasets, such as ImageNet [28]. Acknowledgments The work is supported by the National Natural Science Foundation of China under Grant No. : 61976132 and the National Natural Science Foundation of Shanghai under Grant No. : 19ZR1419200. 7

Y. Fang, Y. Li, X. Tu et al.

Signal Processing: Image Communication 80 (2020) 115664

Fig. 11. Ablation curves. Where 𝑌 -axis is the PSNR score, and 𝑋-axis is the number of training epochs. The model w/o PatchGAN converges faster than the model w/o PatchGAN w/o HDC. The PSNR score of our model is the best when the final convergence occurs.

References

[18] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN, arXiv.org arXiv:1701. 07875v2, 2017. [19] Z. Chen, S. Nie, T. Wu, C.G. Healey, High resolution face completion with multiple controllable attributes via fully end-to-end progressive generative adversarial networks, arXiv.org arXiv:1801.07632, 2018. [20] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2015, pp. 234–241. [21] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, 2015, pp. 448-456. [22] F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions, in: Proceedings of International Conference on Learning Representations, 2016. [23] F. Yu, V. Koltun, T.A. Funkhouser, Dilated residual networks, Comput. Vis. Pattern Recognit. (2017) 636–644. [24] T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for generative adversarial networks, in: Proceedings of International Conference on Learning Representations, 2018. [25] P. Isola, J. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, Comput. Vis. Pattern Recognit. (2017) 5967–5976. [26] J. Johnson, A. Alahi, F.F. Li, Perceptual losses for real-time style transfer and super-resolution, in: European Conference on Computer Vision, 2016, pp. 694-711. [27] T. Wang, M. Liu, J. Zhu, A.J. Tao, J. Kautz, B. Catanzaro, High-resolution image synthesis and semantic manipulation with conditional gans, Comput. Vis. Pattern Recognit. (2018) 8798–8807. [28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, Imagenet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252. [29] H. Li, G. Li, L. Lin, Y. Yu, Context-aware semantic inpainting, IEEE Trans. Syst. Man Cybern. (2018) 1–14. [30] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: Proceedings of International Conference on Learning Representations, 2016. [31] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, 2017, pp. 2242-2251. [32] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of International Conference on Computer Vision, 2015. [33] R. Gross, I.A. Matthews, J.F. Cohn, T. Kanade, S. Baker, Multi-pie, Image Vis. Comput. 28 (5) (2010) 807–813. [34] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 million image database for scene recognition, IEEE Trans. Pattern Anal. Mach. Intell. 40 (6) (2018) 1452–1464. [35] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Proceedings of International Conference on Learning Representations, 2015.

[1] A.A. Efros, T.K. Leung, Texture synthesis by non-parametric sampling, in: Proc.ieee Int.conf.on Computer Vision, Vol. 2, 1999, pp. 1033-1038. [2] Efros, A. Alexei, Freeman, T. William, Image quilting for texture synthesis and transfer, Proc. Siggraph (2001) 341–346. [3] X. Zhu, Y. Qian, X. Zhao, B. Sun, Y. Sun, A deep learning approach to patch-based image inpainting forensics, Signal Process., Image Commun. 67 (2018) 90–99. [4] M. Wang, B. Yan, K.N. Ngan, An efficient framework for image/video inpainting, Signal Process., Image Commun. 28 (7) (2013) 753–762. [5] M. Wilczkowiak, G.J. Brostow, B. Tordoff, R. Cipolla, Hole filling through photomontage, in: Proceedings of the British Machine Vision Conference, 2005. [6] C. Barnes, E. Shechtman, A. Finkelstein, B.G. Dan, Patchmatch:a randomized correspondence algorithm for structural image editing, ACM Trans. Graph. 28 (3) (2009) 1–11. [7] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A.A. Efros, Context encoders: Feature learning by inpainting (2016), pp. 2536-2544. [8] R.A. Yeh, C. Chen, T. Yian Lim, A.G. Schwing, M. Hasegawa-Johnson, M.N. Do, Semantic image inpainting with deep generative models, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5485-5493. [9] Y. Li, S. Liu, J. Yang, M.-H. Yang, Generative face completion, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3911-3919. [10] H. Ishikawa, H. Ishikawa, H. Ishikawa, Globally and Locally Consistent Image Completion, ACM, 2017, pp. 1–14. [11] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T.S. Huang, Generative image inpainting with contextual attention, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5505-5514. [12] N. Van Noord, E. Postma, Light-weight pixel context encoders for image inpainting, arXiv preprint arXiv:1801.05585, 2018. [13] G. Liu, F.A. Reda, K.J. Shih, T. Wang, A.J. Tao, B. Catanzaro, Image inpainting for irregular holes using partial convolutions, in: European Conference on Computer Vision, 2018, pp. 89-105. [14] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, G. Cottrell, Understanding convolution for semantic segmentation (2017), pp. 1451-1460. [15] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, J. Verdera, Filling-in by joint interpolation of vector fields and gray levels., IEEE Trans Image Process. 10 (8) (2001) 1200–1211. [16] M. Bertalmí o, G. Sapiro, V. Caselles, C. Ballester, Image inpainting, in: Conference on Computer Graphics and Interactive Techniques, 2000, pp. 417-424. [17] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: International Conference on Neural Information Processing Systems, 2014, pp. 2672-2680.

8