Incremental focal loss GANs

Incremental focal loss GANs

Information Processing and Management 57 (2020) 102192 Contents lists available at ScienceDirect Information Processing and Management journal homep...

12MB Sizes 0 Downloads 62 Views

Information Processing and Management 57 (2020) 102192

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Incremental focal loss GANs ⁎,a,b

Fei Gao

, Jingjie Zhua, Hanliang Jiang

⁎⁎,c

, Zhenxing Niud, Weidong Hane, Jun Yua

T

a

Key Laboratory of Complex Systems Modelling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China b State Key Laboratory of Integrated Services Networks, School of Electronic Engineering, Xidian University, Xi’an 710071, China c Pulmonary and Critical Care Medicine, Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University, Hangzhou 310020, China d Alibaba Group, Hangzhou 310018, China e Department of Medical Oncology, Sir Run Run Shaw Hospital, College of Medicine, Zhejiang University, Hangzhou, Zhejiang 310016, China

ARTICLE INFO

ABSTRACT

Keywords: Generative adversarial networks Image generation Image-to-image translation Face photo-sketch synthesis Image super-resolution reconstruction

Generative Adversarial Networks (GANs) have achieved inspiring performance in both unsupervised image generation and conditional cross-modal image translation. However, how to generate quality images at an affordable cost is still challenging. We argue that it is the vast number of easy examples that disturb training of GANs, and propose to address this problem by down-weighting losses assigned to easy examples. Our novel Incremental Focal Loss (IFL) progressively focuses training on hard examples and prevents easy examples from overwhelming the generator and discriminator during training. In addition, we propose an enhanced self-attention (ESA) mechanism to boost the representational capacity of the generator. We apply IFL and ESA to a number of unsupervised and conditional GANs, and conduct experiments on various tasks, including face photo-sketch synthesis, map↔aerial-photo translation, single image super-resolution reconstruction, and image generation on CelebA, LSUN, and CIFAR-10. Results show that IFL boosts learning of GANs over existing loss functions. Besides, both IFL and ESA make GANs produce quality images with realistic details in all these tasks, even when no task adaptation is involved.

1. Introduction Multi-modality images, such as photos, sketches and paintings, have been widely used in human beings’ life, due to the dramatic developments of social media networks and mobile devices. However, most people lack the professional knowledge or technical ability for producing high-quality images. It is in great demand to develop algorithms to make computers automatically produce realistic images, so as to improve user experiences. To this end, image generation and cross-modal image translation (i.e. image-toimage translation) are promising solutions. The former is to generate images in a target domain, and the latter to translate an image from a source domain to a target domain. Recently, generative adversarial networks (GANs) have received a great deal of attention due to their success in image generation and image-to-image translation (Bousmalis, Silberman, Dohan, Erhan, & Krishnan, 2017; Isola, Zhu, Zhou, & Efros, 2017; Kurach, Lucic, Zhai, Michalski, & Gelly, 2018; Lucic, Kurach, Michalski, Gelly, & Bousquet, 2018; Radford, Metz, & Chintala, 2016; Zhang, Ji,

⁎ Corresponding author at: Pulmonary and Critical Care Medicine, Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University, Hangzhou 310020, China. ⁎⁎ Corresponding author. E-mail addresses: [email protected] (F. Gao), [email protected] (H. Jiang).

https://doi.org/10.1016/j.ipm.2019.102192 Received 26 July 2019; Received in revised form 3 December 2019; Accepted 26 December 2019 0306-4573/ © 2020 Elsevier Ltd. All rights reserved.

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Hu, Gao, & Chia-Wen, 2018). GANs typically include a generator G and a discriminator D, where G aims at generating a sample from an input random noise (i.e. unsupervised GANs) or conditioned on a source image x (i.e. conditional GANs); while D aims at distinguishing the synthesised sample y^ = G (x ) from a real sample y in the target domain. Training GANs involves solving a minimax problem over the parameters of G and D, which is hard to solve in practice. To debit this challenge, a lot of efforts have been made, including novel loss functions (Arjovsky, Chintala, & Bottou, 2017; Goodfellow et al., 2014; Gulrajani, Ahmed, Arjovsky, Dumoulin, & Courville, 2017; Mao et al., 2017), regularization and normalization schemes (Miyato, Kataoka, Koyama, & Yoshida, 2018), architectural modifications (Karras, Aila, Laine, & Lehtinen, 2018a; Karras, Laine, & Aila, 2018b; Zhang, Goodfellow, Metaxas, & Odena, 2019; Zhang et al., 2017), and scalable training of large-scale models (Brock, Donahue, & Simonyan, 2019a; Lucic, Ritter, Zhai, Bachem, & Gelly, 2019). Among existing work, Self-Attention GAN (SAGAN) (Zhang et al., 2019) has shown inspiring performance and lead to fantastic results coupled with scalable training (Brock et al., 2019a; Lucic et al., 2019). However, scalable training requires exceeding computation budget. It is still challenging to boost training of GANs at an affordable cost. In this paper, we argue that it is the vast number of easy examples that disturb training of GANs. It would be beneficial to make D progressively emphasize examples which are difficult to distinguish, while G progressively emphasize examples which are easily distinguished by D. As a result, D progressively improves the difficulty of learning, while G is enforced refining examples which are not realistic. To this end, we propose an Incremental Focal Loss (IFL), inspired by focal loss (Lin, Goyal, Girshick, He, & Dollar, 2017). Specially, we make scaled factors of easy examples progressively decay to zero. As a result, IFL gradually down-weights contribution of easy examples and incrementally focus G and D on hard-generated or hard-discriminated examples, respectively. To evaluate the effectiveness of IFL, we apply them to various base GANs, and design a new generative model we call FocalGANs. In FocalGANs, we introduce an Enhanced Self-Attention (ESA) mechanism to improve the representational capacity of deep features in the generator, inspired by SAGAN (Zhang et al., 2019). We experiment on a number of image generation and image-to-image translation tasks. Results show that IFL boosts training of GANs over other alternatives of loss functions. Besides, both IFL and ESA make GANs produce quality images with realistic details in all these tasks, even when no task adaptation is involved. In summary, we make the following contributions:

• We propose a novel alternative loss function termed Incremental Focal Loss (IFL), which is demonstrated to apparently boost training of GANs. • We proposed an enhanced self-attention (ESA) mechanism to improve the representational capacity of deep features in the generator. • Both IFL and ESA can be applied to various unsupervised or conditional GANs, and are promising to improve the quality of generated images.

The remainder of this paper is organized as follows. Section 2 introduces related work. Section 3 details the incremental focal loss, and Section 4 FocalGANs. Experiments are presented in Section 5. Section 6 concludes this paper. 2. Related work This work is related to image-to-image translation (Isola et al., 2017), image generation (Zhang et al., 2019), style transfer (Chen, Xu, Yang, Song, & Tao, 2019; Choi et al., 2018; Huang & Belongie, 2017; Zhu, Park, Isola, & Efros, 2017a), and GANs (Goodfellow et al., 2014). In this section, we will briefly introduce several classic loss functions and the use of attention in GANs. Please refer to Jing et al. (2017), Lucic et al. (2018) and Kurach et al. (2018) for a comprehensive survey of style transfer and GANs, respectively.

Table 1 Typical loss functions in purely unsupervised GANs. In conditional GANs, change D(y) to D(x, y) and D (y^ ) to D (x , y^ ) accordingly. Loss

Formula

CE loss (Goodfellow et al., 2014)

D

LS loss (Mao et al., 2017) Wasserstein Loss (Arjovsky et al., 2017; Gulrajani et al., 2017) Hinge loss (Zhang et al., 2019) IFL(ours)

2

=

y P [log(D (y ))]

^ Q [log(D (y ))]

G

=

D

=

y P [(1

G

=

x Q [(1

D

=

y P [D (y )]

G

=

x Q [D (y )]

D

=

y P [min(0,

Q [min(0,

x

D (y )) 2] + D (y^ )) 2] +

D (y^))]

x Q [log(1

^

x Q [(D (y ))

2]

^

x Q [D (y )]

^

G

=

D

=

y P [(1

G

=

x Q [(1

x

1 + D (y))+ 1 + D (y^))] D (y)) D (y^))

x Q [min(0,

(t ) log(D (y ))]

(t ) log(D (y^ ))],

1

D (y^))] ^

x Q [(D (y ))

(t ) log(1

with (t ) = min( t +

D (y^))] 0, max ) .

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

2.1. Loss functions in GANs In existing GANs, the mostly used loss functions are listed in Table 1. Here, x is a sample from source data distribution Q, and y is a sample from target data distribution Q. The vanilla GANs (Goodfellow et al., 2014) suggest two loss functions: the minimax GAN and the non-saturating GAN. In the former, the discriminator minimizes cross entropy (CE) loss for the binary classification task. In the latter, the generator maximizes the probability of generated samples being real. In the minimax GAN, the CE loss of the generator is: G

=

D (y^ ))],

x Q [log(1

(1)

and

G* = arg max G

G.

(2)

In the non-saturating GAN, the CE loss of the generator becomes: G

=

^

(3)

x Q [log(D (y ))],

and

G* = arg min G

G.

(4)

In this way, G = 0 while all the generated samples are judged as real by the discriminator. In this work, we consider the nonsaturating loss as it is sufficiently stable and outperforms minimax across data sets and settings (Kurach et al., 2018). Laterly, Arjovsky et al. (Arjovsky et al., 2017) propose to consider the Wasserstein divergence instead of the original Jensen–Shannon (JS) divergence between P and Q. The resulted framework is well known as Wasserstein GAN (WGAN). Recently, Mao et al. (2017) propose to use least-squares (LS) loss which corresponds to minimizing the Pearson L2 divergence between P and Q. Finally, in SAGAN (Zhang et al., 2019), BigGAN (Brock et al., 2019a), and S3GAN (Lucic et al., 2019), Hinge loss is used and might partially contribute to the high reality of generated images. However, researchers do not find evidence that any of these losses consistently outperforms the others based on a large-scale study (Lucic et al., 2018). Training of GANs is still challenging. In this paper, we argue that it is the loss incurred by numerous easy examples that prevents training of GANs. For a discriminator, easy examples are those correctly classified by D. While for a generator, the generated examples that fool a discriminator are easy ones. Such easy examples incur losses with non-trivial magnitude. When summed over a large number of easy examples, these small loss values can overwhelm the hard samples, and prevent performance improvement or even damage the performance. However, previously introduced losses treat all the examples equally. Inspired by Focal Loss (FL), in this paper, we propose to down-weight losses recurred by easy examples, and thus in turn improve the importance of hard examples in the learning procedure. Details of IFL will be introduced in Section 3 and Section 4. 2.2. Attention in GANs Recently, attention mechanisms have been introduced into GANs, e.g. image generation (Zhang et al., 2019), image-to-image translation (Chen, Xu, Yang, & Tao, 2018b), and text-to-image generation (Xu et al., 2018), etc. Learning attention encourages the generation of more realistic images compared to classic vanilla GANs. Such attention mechanisms are mainly two classes: spatial attention and self-attention. Our work is about self-attention, thus we will briefly review related works here. Self-attention (SA) is derived from Transformer (Vaswani et al., 2017) and is used to model global, long-range dependencies within internal representations of images. Transformer is originally proposed for natural language processing problems, and lead to extraordinary performance across various tasks (Vaswani et al., 2017). Recently, Zhang et al. (2019) introduce self-attention into unconditional GANs for efficiently modelling global, long-range dependencies within internal representations of images for image generation. Similarly, PixelSNAIL (Chen, Mishra, Rohaninejad, & Abbeel, 2018a) combines causal convolutions with self attention. The resulted model achieves distinct improvement in the generated image quality (Zhang et al., 2019) and lead to some fabulous works such as BigGAN (Brock, Donahue, & Simonyan, 2019b) and StyleGAN (Karras, Laine, & Aila, 2019). In the image-to-image translation community, Parmar et al. (2018) propose a local self-attention mechanism. Specially, they treat an image as sequential data and use a restricted area in the upper and left side of the target location as the neighbourhood. The resulted model achieves a distinct improvement in the generated image quality. Most recently, Rewon et al. propose a sparse transformer (Child, Gray, Radford, & Sutskever, 2019) to reduce the computational complexity of Transformer while applied to long sequences or images. Both of these works treat an image as sequential data and use a restricted area in the upper and left side of the target location as the neighbourhood. Inspired by successes of both self-attention and Transformer in GANs, we introduce SA into our FocalGANs, and enhance it by encouraging the preceding layer of SA to produce a primal output. We expect such enhancement would further improve the representational capacity of deep features in GANs and boost the quality of generated images. 3. Incremental focal loss We introduce incremental focal loss (IFL) starting from cross entropy (CE) loss and focal loss (FL). 3

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 1. Illustration of incremental focal loss (IFL). Here, (t ) = t + to 6.

0,

with

0

= 0,

= 0.03, t = 0, 1, …, 200 . The focusing factor λ(t) changes from 0

3.1. CE Loss Cross entropy (CE) loss is the most used objective for binary classification. CE loss can be expressed as:

CE(p , y ) =

log(p) if y = 1 log(1 p) otherwise,

(5)

where, y ∈ ± 1 specifies the ground-truth class and p ∈ [0, 1] is the model’s estimated probability for the class with label y = 1. Following Lin et al. (2017), we define q:

p 1

q=

if y = 1 p otherwise.

(6)

and rewrite: CE(p , y ) = CE(q) = log(q) . CE loss can be seen as the left mesh in Fig. 1. During training, CE loss related to each q is fixed. In other words, all examples share the same importance in learning. In the initial unsupervised GANs (Goodfellow et al., 2014), CE loss is used as the adversarial loss function. As shown in Table 1, the CE loss for the discriminator is D

=

y P [log(D (y ))]

x Q [log(1

D (y^ ))],

(7)

and the CE loss for the generator is G

=

^

(8)

x Q [log(D (y ))].

In the training process, the generator and discriminator are learned alternatively by: 4

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

G * = arg min, G

D * = arg min

D.

D

(9)

3.2. Focal loss In focal loss (FL) (Lin et al., 2017), a modulating factor (1

FL(q) =

(1

q) is multiplied to CE loss. Focal loss is expressed as: (10)

q) log(q),

where, γ ≥ 0 is a tunable focusing parameter. In focal loss, examples corresponding to q are consistently down-weighted by (1 q) . In the training process of GANs, D does not work at all in the beginning. It is assertive to drastically re-weight examples based on the predicted results of D (especially for optimization of G). As training goes on, D becomes more and more powerful, we can thereby increase the impact of modulating factors on learning. Alternatively, we propose to incrementally change the value of γ in the training process. The resulted variant of focal loss is introduced below. 3.3. Incremental focal loss definition Inspired by focal loss, we propose Incremental Focal Loss (IFL). Differently, the focusing factor γ(t) in our IFL increases during the training procedure. IFL is expressed as:

IFL(q) =

(1

q)

(11)

(t ) log(q ),

where, t denotes the index of iterations or epochs; γ(t) increases as t becomes larger. In facility, we can set γ(t) as a linear function:

(t ) = t +

(12)

0,

where, γ0 ≥ 0 is the initial focusing factor and λ ≥ 0 the incremental rate. In addition, γ(t) should not be over-large, because a large γ(t) will make the losses recurred by examples with q ≈ 0.5 very small. q ≈ 0.5 indicates that these examples confuse the model and thus are important for performance improvement. We further modify the computation of γ(t) as:

(t ) = min( t +

(13)

0, max ).

Here, γmax( ≥ γ0) is the upper limit of the focusing factor. In practice, we use a staged variant of IFL, i.e.

(t ) = min( t / Tf +

(14)

0, max ).

Here, Tf > 0 is a constant. This means that the focusing parameter is improved by λ after every Tf iterations (or epochs). We adopt this form in our experiments as it yields slightly improved accuracy over the non-staged form. Besides, there is typically a small batch of examples used in each training iteration of GANs. The model performance may not consistently improve. It is promising to change γ(t) after a period of training. Fig. 1 shows the modulating factors, (1 q) (t ) , and IFL related to different q and γ(t). In the beginning, i.e. when t = 0, we have (0) = 0 . The model is randomly initialized and thus does not work at all. All the examples have the same significance and contribute equally to update the model. While t increases, γ(t) increases and losses related to easy examples decrease accordingly. This implies that easy examples contribute less and less to learning. In other words, hard examples gradually dominate the learning process. As shown in Fig. 1, examples of low q are consistently assigned high modulating factors and producing high losses. In contrast, both the modulating factors and the losses of easy examples gradually decrease in the training procedure. IFL therefore progressively make the learner focus on hard examples. Major properties of IFL are analysed below:

• If = 0 and = 0, γ(t) is fixed to 0. The modulating factor (1 q) is consistently 1. In this case, IFL is equal to CE loss, i.e. log q ; If • γ > 0 and = 0, γ(t) is consistently γ , while the modulating factor is fixed to (1 q) . In this case, IFL is equal to FL, i.e. (1 q) log q ; • If λ > 0, γ(t) increases with t, while the modulating factor decreases inversely. Besides, the modulating factor related to a larger q (t )

0

0

0

0

0



decreases more dramatically than that related to a lower q. Assume that 0 = 0, = 0.1. When t = 20, γ(t) increases from 0 to 2, the modulating factor related to q = 0.9 decreases from 1 to 0.01; in contrast, the modulating factor related to q = 0.1 only decreases from 1 to 0.81. In other words, examples of lower q progressively dominate the learning process; and If γmax is very large or there is no upper limit of γ(t), losses recurred by examples with a possibility around 0.5 become very small. For instance, if (t ) = 6, the modulating factor related to q = 0.5 becomes 0.016. However, q = 0.5 means that the model is confused with the corresponding example. Thus the related modulating factor should not be too small.

5

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 2. Illustration of proposed FocalGANs.

4. FocalGANs In this work, we apply IFL to both unsupervised GANs and conditional GANs, including SAGAN (Zhang et al., 2019), SRGAN (Ledig et al., 2017) and Pix2Pix (Isola et al., 2017). Besides, we utilize an enhanced self attention (ESA) mechanism to improve the representational capacity of deep features in the generator. Pipelines of unsupervised and conditional FocalGANs are illustrated in Fig. 2. Details will be presented below. 4.1. Use IFL as adversarial loss In unsupervised or conditional GANs, D( · ) ∈ [0, 1] is the estimated probability that an example is REAL. Following Eq. (6), we have,

q=

D (·) for REAL examples 1 D (·) for FAKE examples.

(15)

4.1.1. In unsupervised GANs We first introduce the use of IFL in unsupervised GANs. The adversarial loss for the discriminator becomes: (u ) IFL, D (G ,

D) =

y P [(1

D (y ))

^

(t ) log(D (y ))]

x Q [(D (y ))

(t ) log(1

D (y^ ))].

(16)

In non-saturate GANs, the adversarial loss for the generator is: (u ) IFL, G (G ,

D) =

x Q [(1

D (y^ ))

(t ) log(D (y ^ ))].

(17)

4.1.2. In conditional GANs In conditional GANs, the adversarial loss for the discriminator becomes: (c ) IFL, D (G ,

D) =

y P [(1

D (x , y ))

(t ) log(D (x ,

y ))]

(t ) log(D (x ,

y^ ))].

x Q [(D (x ,

y^ ))

(t ) log(1

D (x , y^ ))].

(18)

The adversarial loss for the generator is: (c ) IFL, G (G ,

D) =

x Q [(1

D (x , y^ ))

(19)

4.1.3. Analysis We use the same focusing factor γ(t) for both G and D, because neither player can improve their cost unilaterally in a Nash equilibrium game. Modulating factors down-weight losses recurred by easy examples and thus in turn improve importance of hard examples in the learning procedure. Specially, during the optimization of D, we decrease importance of correctly classified examples. 6

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 3. Pipeline of the self-attention mechanism (following (Zhang et al., 2019)).

In this way, D makes the learning task progressively more difficult for G. In contrast, during the optimization of G, we increase the loss contribution from the generated examples which are predicted as fake by D. IFL is capable of boosting training of GANs due to the following two reasons. On one hand, easy examples recur trivial, modulated losses and have little impact on GAN performance. On the other hand, IFL incrementally enforce G to generate more real examples. 4.2. Enhanced self-attention In this work we introduce the self-attention (SA) mechanism into the generator network, and enhance it by encouraging the preceding layer of SA produce a primal output. The pipeline of SA is as shown in Fig. 3. C×N Self-Attention. Assume that the image features from the previous layer are x . Here, N = W × H ; W and H are the width and hight of the feature maps; C is the number of channels. x is first transformed into three feature spaces f, g, h by (20)

fi = Wf x , gi = Wg x, hi = Wh xi.

Wf where Wg and Wh are the learned weight matrices and implemented as 1 × 1 convolutions. Following Zhang et al. (2019), we use C¯ = C /8 in all our experiments. The SA mechanism calculates response at a position as a weighted sum of all the features. The output is calculated by C¯ × C ,

C¯ × C ,

C ×C

N

oj =

j, i h (x i),

where

j, i

=

i=1

exp(fTi gj) N k=1

exp(fTk gj)

.

(21)

th

th

Here βj,i indicates the extent to which the i location contributes to inference the j location. Following Zhang et al. (2019), the final output of the SA mechanism is given by yi = oi + x i . λ is initialized as 0 and autoC × N is reshaped to C × W × H and input into matically updated during the training process. In existing work, y = {yi}iN= 1 subsequent layers to produce an image y^ = (y) . ϕ( · ) denotes the subsequent layers and typically comprises of a Transposed Convolutional (TransConv) layer, followed by a Tanh function. Enhancement. In existing work, there is no direct supervision over x. Initially, we could further enhance the representational capacity of y by making x capable of producing an output. We thus add an additional TransConv layer after x, as shown in Fig. 2. The corresponding outputs related to y and x are denoted as y^0 and y^1 , sequentially. y^1 is our final result. For each output, we adopt a discriminator correspondingly. In this way, x is supervised to represent global structures of an image followed by implicit refinement in the SA mechanism to generate more details. 4.3. Joint objective Both IFL and ESA can be applied to unsupervised and conditional GANs. We refer to the resulted models as FocalGANs. In unsupervised FocalGANs, the generator and discriminators are trained in an alternating fashion by minimizing IFL, i.e.

llDi* = arg min Di

G * = arg min G

(u ) IFL, D (G , 3 i=1

Di ), with i = 0, 1,

(u ) IFL, G (G ,

Di ).

(22) 7

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Table 2 Experimental settings. Dataset

# Training

# Testing

Image size

Epochs

Batch size

CUFS MAP VOC2012 CelebA LSUN CIFAR-10 dataset CUFS MAP VOC2012 CelebA LSUN CIFAR-10

268 1096 16.7K 200K 126K 50K β1 0.5 0.5 0.9 0 0 0

338 1098 50K 30K 10K β2 0.9 0.9 0.999 0.9 0.9 0.9

256 512 88 64 64 32 lrG 0.0002 0.0002 0.0004 0.0001 0.0001 0.0001

1000 600 200 100 100 100 lrD 0.0002 0.0002 0.0004 0.0004 0.0004 0.0004

1 1 64 64 64 64 λ in IFL 0.1 0.03 0.15 0.2 0.4 0.2

In conditional FocalGANs, we additionally adopt a pixel-wise reconstruction loss for training our model. The reconstruction loss is denoted by rec

=

y^0

(x , y ) (X , Y ) [

y

1

+ y^1

(23)

y 1 ].

Similarly, the generator and discriminators are trained alternatively by

Di* = arg min Di

G * = arg min G

(c ) IFL, D (G , 3 i=1

Di ), with i = 0, 1

(c ) IFL, G (G ,

Di ) +

rec ,

(24)

where α is a weighting factor. 4.4. Implementation In this work, we use Pix2Pix (Isola et al., 2017), SRGAN (Ledig et al., 2017) and SAGAN (Zhang et al., 2019) as the baseline networks for different tasks. As illustrated in Fig. 2, we use IFL loss as the adversarial loss and employ the ESA mechanism before the last TC-IN-Tanh layer of the generator. We refer to the GAN with both IFL and ESA as FocalGAN, and denote the corresponding model variants as FocalGANPix2Pix, FocalGANSRGAN, and FocalGANSAGAN accordingly. Please refer to the original work for detailed network architecture. To optimize our networks, we alternate between one gradient descent step on D, then one step on G. We use the Adam optimizer and instance normalization. We train our models on a single GTX 1080Ti GPU. Table 2 shows detailed experimental settings, including the number of training images (# training), number of testing images (# testing), image size, epochs, batch size, β1 and β2 in the Adam optimizer, learning rate of the generator lrG, learning rate of the discriminators lrD, and default λ in IFL. We set Tf = 100 (epochs ) for both FocalGANPix2Pix and FocalGANSRGAN, and Tf = 10 (epochs ) for FocalGANSAGAN, with 0 = 0 and max = 4, unless otherwise specified. 5. Experiments We demonstrate the effectiveness of our proposed method on three conditional image-to-image translation tasks and three Table 3 Analysis of IFL. All the results are about face sketch synthesis on the CUFS dataset. γ0 0.5 1 2 3 4 γmax 0.8 2 3.2 ∞

Varying γ0 for IFL, with = 0, max = λ(t) 0.5 1 2 3 4 Varying γmax for IFL, with 0 = 0, = 0.4, Tf λ(t) 0→ 0→ 0→ 0→

0.8 2 3.2 4

. FID 32.50 32.15 32.95 33.75 34.35 = 100 .

λ 0.05 0.1 0.2 0.3 0.4

FID 33.43 32.22 32.83 33.08

λ 0.001 0.01 0.1 0.2

8

Varying λ for IFL, with

λ(t) 0 → 0.5 0→1 0→2 0→3 0→4 Varying (λ, Tf) for IFL, with Tf 1 10 100 200

0

= 0,

0

= 0,

max

=

max

=

, Tf = 100 .

FID 32.30 31.60 32.91 32.10 33.08 , (t ): 0 1. FID 32.43 32.07 31.60 32.58

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 4. FID curves on the training dataset while using different loss functions, on the CUFS dataset. Here,

0

= 0,

max

=

, Tf = 100 .

Table 4 Ablation study: performance of different model variants. model variants

FID

PSNR

FSIM

Pix2Pix Pix2Pix Pix2Pix Pix2Pix Pix2Pix

33.50 31.60 31.06 30.73 29.28

30.80 31.06 31.11 30.86 30.99

73.16 73.03 73.12 73.34 73.35

w/ w/ w/ w/ w/

CE IFL CE & SA CE & ESA IFL & ESA (FocalGANPix2Pix)

Fig. 5. Illustration of the effect of ESA.

unsupervised image generation tasks. Detailed experiments will be presented below. 5.1. Experimental settings Detailed experimental settings are previously summarized in Table 2. We will briefly introduce the datasets and criteria below.

9

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 6. Examples of synthesized face sketches on the CUFS dataset. (a) Photo, (b) Pix2Pix w/ CE, (c) Pix2Pix w/ IFL, and (d) FocalGANPix2Pix, (e) ground-truth (i.e. sketch drawn by artist). 10

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 7. Examples of synthesized face photos on the CUFS dataset. (a) Sketch, (b) Pix2Pix w/ CE, (c) Pix2Pix w/IFL, (d) FocalGANPix2Pix, and (e) ground-truth photo. 11

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Table 5 Performance on face photo-sketch synthesis on the CUFS dataset. photo↦sketch

Pix2Pix w/ CE Pix2Pix w/ LS Pix2Pix w/ Hinge Pix2Pix w/ IFL FocalGANPix2Pix

sketch↦photo

FID

PSNR

FSIM

FID

PSNR

FSIM

33.50 33.60 31.13 31.60 29.28

30.80 30.83 30.30 31.06 30.99

73.16 73.26 73.24 73.03 73.35

46.89 45.95 43.51 39.66 36.86

30.13 29.81 30.12 30.09 30.34

76.89 76.80 77.06 77.42 77.34

Table 6 Performance on face sketch synthesis on the CUFS dataset, while using BicycleGAN as the baseline.

BicycleGAN w/ CE BicycleGAN w/ LS BicycleGAN w/ Hinge BicycleGAN w/ IFL FocalGANBicycleGAN

FID

PSNR

FSIM

28.50 29.29 26.99 27.04 25.48

29.35 29.67 29.19 29.42 29.73

60.71 60.43 60.81 60.35 61.03

Table 7 Performance on map↔aerial photo translation on the MAP dataset. map↦aerial photo

Pix2Pix w/ CE Pix2Pix w/ LS Pix2Pix w/ Hinge Pix2Pix w/ IFL FocalGANPix2Pix

aerial photo↦map

FID

SSIM

FSIM

FID

SSIM

FSIM

46.24 48.79 60.86 43.32 46.16

22.37 22.97 21.64 22.31 23.67

67.52 67.75 67.69 67.80 68.36

35.48 32.28 50.20 31.58 31.09

74.91 76.59 73.18 77.24 77.41

78.79 80.15 78.70 80.37 80.51

5.1.1. Datasets To verify the effectiveness of conditional FocalGANs, we test our method on the following three tasks:

• Face Photo-Sketch Synthesis. Face photo-sketch synthesis aims at generating a face sketch/photo conditioned on a given face photo/ • •

sketch (Wang, Tao, Gao, Li, & Li, 2014; Zhang, Wang, Li, & Gao, 2019a; 2019b). We conduct experiments on the CUFS dataset, which consists of 606 face photo-sketch pairs. Following Wang, Gao, and Li (2018), we select 268 face photo-sketch pairs for training, and all the rest 338 pairs for testing. All these face images are geometrically aligned relying on three points: two eye centers and the mouth center. The aligned images are cropped to the size of 250 × 200. Aerial photo ↔ Map. This task aims at translating between aerial photos and maps images scraped from Google Maps. Here we use the MAP dataset (Radim Tyleček, 2013), with 1096 training pairs and 1098 testing pairs. Single Image Super-resolutional Reconstruction (SISR). SISR aims at producing a realistic high-resolution image based on a single low-resolution image. Following Ledig et al. (2017), we train our method on the VOC2012 dataset, which consists of 16,700 images for training (Everingham et al., 2015); and test on three standard benchmark datasets, including BSD100 (Martin, Fowlkes, Tal, & Malik, 2001), Set5 (Bruna, Sprechmann, & Lecun, 2016), and Set14 (Zeyde, Elad, & Protter, 2010), which have 100, 5 and 14 images, respectively. All experiments are performed with a scale factor 4 × between low- and high-resolution images.

To explore the efficiency of unsupervised FocalGANs, we conduct image generation tasks on the following three datasets, respectively.

• CelebA. We train our model on the CelebA dataset, which consists of more than 200K face images (Liu, Luo, Wang, & Tang, 2015). • LSUN. We train our model on the LSUN church-outdoor dataset, which contains about 126K labeled images of church (Yu, Zhang, Song, Seff, & Xiao, 2015). • CIFAR-10. We also trained the CIFAR-10 dataset in our model, including 6000 images for each of 10 classes (Krizhevsky, 2009). There are 50K images for training.

5.1.2. Criteria In this work, we adopt the following indices as the performance criteria: 12

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 8. Image-to-image translation examples of aerial photo↦map on the MAP dataset with annotated patch. (a) Aerial photo, (b) Pix2Pix w/ CE, (c) Pix2Pix w/ IFL, (d) FocalGANPix2Pix, and (e) ground-truth map.

• Fréchet Inception distance (FID): In this work, we choose FID as the main criterion for evaluating the realism and variation of •

generated images (Heusel, Ramsauer, Unterthiner, Nessler, & Hochreiter, 2017; Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2016). Lower FID values mean closer distances between synthetic data distribution and real data distribution; and Image quality assessment metrics: We additionally adopt the Peak Signal to Noise Ratio (PSNR), Structural Similarity Measure (SSIM) 13

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 9. Image-to-image translation examples of map↦aerial photo on the MAP dataset with annotated patch. (a) Map, (b) Pix2Pix w/ CE, (c) Pix2Pix w/ IFL, (d) FocalGANPix2Pix, and (e) ground-truth aerial photo.

(Wang, Bovik, Sheikh, & Simoncelli, 2004), and Feature Similarity Index Metric (FSIM) (Zhang, Zhang, Mou, & Zhang, 2011) between a conditionally generated image and the corresponding ground-truth for evaluating the fidelity. Greater PSNR, SSIM, and FISM values indicate better fidelity.

14

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Table 8 Performance on single-image super-resolution reconstruction on the BSD100, Set14, Set15, and Urban100 datasets. Overall indicates the image-wise average quality indices across all the four datasets. Here, SRGAN (Ledig et al., 2017) is adopted as the baseline. [4 × upscaling]

SRGAN w/ CE SRGAN w/ LS SRGAN w/ Hinge SRGAN w/ IFL FocalGANSRGAN [4 × upscaling] SRGAN w/ CE SRGAN w/ LS SRGAN w/ Hinge SRGAN w/ IFL FocalGANSRGAN

BSD100

Set5

PSNR

SSIM

FSIM

PSNR

SSIM

FSIM

25.72 25.73 25.79 25.89 25.94

69.78 70.00 69.87 70.53 70.82 Set14 SSIM 72.42 72.71 72.99 73.01 73.25

83.04 82.92 82.60 83.33 83.49

28.84 28.91 29.00 29.04 29.27

91.47 91.61 91.63 91.89 91.99

FSIM 91.15 91.19 91.08 91.36 91.61

PSNR 25.85 25.87 25.93 26.02 26.09

83.08 83.81 83.56 84.50 84.68 Overall SSIM 70.66 70.87 70.74 71.38 71.69

PSNR 25.76 23.68 23.77 25.86 25.98

FSIM 84.32 84.25 83.97 84.61 84.80

Note that Blau and Michaeli (2018) prove mathematically that distortion and perceptual quality are at odds with each other. In other words, the PSNR and FSIM value increase, the perceptual quality must be worse (i.e. the FID increases). They also show that GAN approaches the perception-distortion bound. It is challenging for a method to show superiority in all these three criteria. 5.1.3. Comparison In experiments, we mainly use Pix2Pix (Isola et al., 2017), SRGAN (Ledig et al., 2017) and SAGAN (Zhang et al., 2019) as the baseline networks for different tasks. Specially, we train each base GAN by using CE loss, LS loss, Hinge loss, and IFL, respectively. Note that we don’t compare with Wasserstein loss (Arjovsky et al., 2017; Gulrajani et al., 2017) here, because when we train Pix2Pix using Wasserstein loss, we only obtain a FID of 65 for the face sketch synthesis task on the CUFS dataset. This result is much worse than those of the other loss functions. Besides, Wasserstein loss is not widely used in recent work. 5.2. Analysis of settings In this section, we empirically analyse settings of IFL and present a series of ablation studies. All the experiments here are about face sketch synthesis and conducted on the CUFS dataset. 5.2.1. Analysis of IFL First, we analyse the impact of different parameters in IFL (Eq. (14)). Specially, we train Pix2Pix using different variants of IFL by changing one parameter but fixing all the others. The corresponding results are shown in sub-tables in Table. 3. As shown in Table 3a and Table 3b, using a large λ0, a too small or overlarge γ would increase the FID values. Besides, γ and Tf should not be too small or large. These observations are consistent to our previous analysis in Section 3.3 and Section 4.1.3. In addition, we make the model produce testing sketches after every 5 epochs in the training process, and compute the FID values accordingly. Fig. 4 shows the changes of FID related to CE loss and IFLs with different λ. Apparently, all the IFL variants lead to lower FIDs than CE loss at the end of training. As shown in Fig. 4, these lines are basically equal around the 100th epoch, because the focusing factor is = 0 = 0 during the first 100 epochs. In other words, the IFL loss is the same as the CE loss. After the 100th epochs, the focusing factor becomes greater than 0. In this case, the IFL loss makes the generator focusing on hard examples, and lead to lower FIDs than the CE loss. Finally, using a large λ might cause over-fitting in the end of the training process. In general, it is recommended to set 0 = 0 with a mediate λmax, a small γ and relatively large Tf. In this way, the focusing factors would change slowly, and gradually focus training on hard examples. The decreased FIDs indicate that IFL increases the realism of synthesized sketches. 5.2.2. Ablation study We further conduct ablation studies by adding different modules to the baseline network, i.e. Pix2Pix w/ CE loss. The results are shown in Table 4. Compared to the baseline, both IFL and ESA significantly decrease FID without considerably decreasing PSNR or FSIM. Besides, the model with ESA achieves lower FID than that with SA. Finally, FocalGANPix2Pix achieves the best results, demonstrating that the effect of IFL and ESA are additive. Fig. 5 illustrates the effect of ESA. Clearly, ESA typically produces clearer structures and textures in the synthesized sketches. For instance, in the left-top example, the structures of the right eye and the textures of the hair region in the third column are much clearer than those in the second column. Similar phenomena can be observed in the other examples. 15

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 10. Reconstructed high-resolution images with zoomed-in patches. (a) Bicubic interpolation, (b) SRGAN w/ CE, (c) SRGAN w/ IFL, (d) FocalGANSRGAN, and (e) ground-truth. 16

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Table 9 Performance on unsupervised image generation tasks.

SAGAN w/ CE SAGAN w/ LS SAGAN w/ Hinge SAGAN w/ IFL FocalGANSAGAN

CelebA

LSUN

CIFAR-10

Average

10.60 9.44 9.25 8.17 6.72

26.02 35.82 28.38 24.75 24.70

23.49 21.42 23.71 22.27 19.01

17.17 19.56 17.24 15.26 14.08

Fig. 11. Images generated by SAGAN on CelebA. (a) SAGAN w/ Hinge, (b) SAGAN w/ IFL, (c) FocalGANSAGAN, and (d) ground-truth.

5.3. Conditional image-to-image translation 5.3.1. Face photo-sketch synthesis We first evaluate our method for face photo-sketch synthesis on the CUFS dataset (Wang & Tang, 2009). We first adopt Pix2Pix as our baseline and apply different loss functions, respectively. Corresponding results are shown in Table 6, where the best indices are denoted in boldface and the second best ones underlined. Apparently, our method achieves the lowest FID values. Besides, Pix2Pix w/ IFL and our final FocalGANPix2Pix show superiorities over the other models. We report both PSNR and FSIM here just because it has been widely used in existing works. However, both PSNR and FSIM are not suitable for evaluating visual quality of sketches (Wang, Gao, Li, Song, & Li, 2016). As shown in Table 6, FocalGANPix2Pix show significant lower FID values than the other methods, and achieves 4 best results among all the 6 performance indices. While Pix2Pix w/ IFL achieves 2 best results. Such comparison demonstrates that our final model obtain the best performance and produces best face photos/sketches in general. Figs. 6 and 7 show examples of synthesized face sketches and photos. Obviously, our method produce quality sketches with legible structures and pencil-stroke-like textures, as well as quality photos with legible structures and visually-appealing appearance. In contrast, the sketches/photos produced by Pix2Pix w/ CE are heavily blurred and involves serious 17

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Fig. 12. Images generated by SAGAN on LSUN. (a) SAGAN w/ Hinge, (b) SAGAN w/ IFL, (c) FocalGANSAGAN, and (d) ground-truth.

Fig. 13. Images generated by SAGAN on CIFAR-10. (a) SAGAN w/ Hinge, (b) SAGAN w/ IFL, (c) FocalGANSAGAN, and (d) ground-truth. 18

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

deformations. Similar comparison results are observed across the whole dataset. To further verify the effect of our IFL, we apply BicycleGAN (Zhu et al., 2017b) with different loss functions. Besides, we evaluate the corresponding full model, BicycleGAN with IFL and ESA, and denote it as FocalGANBicycleGAN. Experiments are conducted on the face sketch synthesis task on the CUFS dataset. Corresponding results are shown in Table 5. Obviously, IFL outperforms both CE loss or LS loss, and performs on par with Hinge loss. Besides, FocalGANBicycleGAN achieves the best performance in terms of all the three indices. These comparisons demonstrate that our IFL is superior to the other loss functions in the face photo-sketch synthesis task. Besides, FocalGAN is capable of producing a both perceptually-realistic face sketches/photos. 5.3.2. Map ↔ aerial photo translation We further perform experiments on image-to-image translation on the MAP dataset (Radim Tyleček, 2013). Here, we also use Pix2Pix as the baseline. We evaluate our method on map↦aerial photo and aerial photo↦map, respectively. We use FID, SSIM, and FSIM to evaluate quality of synthesised aerial photos and maps. As shown in Table 7, IFL gains significant improvement in terms of almost all the performance indices. Besides, FocalGANPix2Pix further boost the performance. Here, FocalGANPix2Pix leads to larger FID than Pix2Pix w/ IFL. In aerial photos, there are a lot of detail information of tiny textures and structures. However, the SA or ESA mechanism harmonizes features at different positions, and produce slightly blurred patterns (as shown in Fig. 9), which generally does not harm but would increase the FID score. In contrast, images produced by Pix2Pix w/ IFL presents more fluctuations, which is similar to real aerial photos. Fortunately, the strong power of IFL makes FocalGANPix2Pix still outperform previous methods. Besides, FocalGANPix2Pix obtains the best performance according to all the other indices. In addition, as illustrated in Figs. 8 and 9, our results consist of significantly coherent structures and show higher fidelity to ground-truths than those produced by Pix2Pix w/ CE. To conclude, both quantitative results and qualitative results demonstrate the effectiveness of IFL and ESA. Besides, this success indicates the broader applicability of our model to various image-to-image translation tasks. 5.3.3. Single image super-resolution reconstruction (SISR) In this section, we verify whether our IFL and ESA can be applied to other conditional GANs. For this purpose, we use SRGAN (Ledig et al., 2017) as the baseline network, and experiment on the 4 × upscaling SISR task. Note that we apply IFL and ESA to SRGAN as previously shown in Fig. 2, without changing the other settings. Here, we train our model on the VOC2012 dataset, and test the learned model on the BSD100, Set5, and Set14 datasets. Table 8 shows the PSNR, SSIM, FSIM values of the reconstructed images on each dataset, as well as the average values across the three datasets. Inspiringly, using IFL instead of the other loss functions averagely gains 0.1, 0.5, and 0.3 improvements in PSNR, SSIM, and FSIM, respectively. Besides, FocalGANSRGAN further boosts the quality of reconstructed high-resolution images. Accordingly, as shown in Fig. 10, using IFL leads to better details, with coherent structures and realistic textures, in the reconstructed images. These observations demonstrate that IFL is capable of encouraging the generator producing photo-realistic details. Besides, both IFL and ESA can be applied to various conditional GANs and are promising to boost the performance for a variety of cross-modal image translation tasks. 5.4. Unsupervised image generation Finally, we evaluate the proposed techniques on unsupervised image generations tasks. We use SAGAN (Zhang et al., 2019) as baseline due to its extraordinary capacity, and experiment on CelebA, LSUN, and CIFAR-10, separately. In the original SAGAN, Hinge loss is used. In this work, we also train SAGAN using CE loss, LS loss, and IFL, respectively. Besides, we further replace the SA mechanism before the last layer in original SAGAN with our ESA, and train it using IFL. The resulted model is referred to as FocalGANSAGAN. For each model, we randomly generate 50K, 30K, and 10K images on CelebA, LSUN, and CIFAR-10, respectively. On each dataset, the number of testing images is approximately 1/4 of that of training images. Afterwards, we calculate the FID between the generated images and the same number of real images on each dataset. Corresponding FIDs and their averages are shown in Table 9. Obviously, although IFL is slightly inferior to LS loss on the CIFAR-10 dataset, it decreases the average FID by about 2 over CE loss and Hinge loss, and by about 4 over LS loss. Similarly, FocalGANSAGAN achieves the best FIDs across all the three datasets, and further decreases the FID by about 1 averagely. Recall that SAGAN itself includes SA mechanisms. Superiority of FocalGANSAGAN over SAGAN w/ IFL is achieved by using ESA. In addition, as shown in Figs. 11–13, images produced by FocalGANSAGAN generally present less deformations and more coherent structures, compared to those produced by the original SAGAN. These results demonstrate that IFL is superior to or at least highly comparable to the other three loss functions. Besides, both IFL and ESA boost the quality of generated images. 6. Conclusion In this work, we propose an novel loss function, i.e. IFL, to make GANs focus learning on hard examples. Besides, we employ the ESA mechanism to improve the capacity of the generator. We apply them in various unsupervised and conditional GANs, and experiment on a variety of tasks. Results show that IFL boosts training of GANs over other alternative loss functions. Besides, both IFL 19

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

and ESA are capable of improving the quality of generated images. In the future, we will verify the effect of IFL and ESA on various image generation tasks, such as cartoons, paintings, medical images, etc. Besides, it is promising to explore powerful baseline networks to further boost the quality of generated images. To this end, exploring hierarchical deep features (Jun, Min, Hongyuan, Dacheng, & Yong, 2019; Yu, Zhu) and using multi-task learning (Jun, Chaoqun, Yong, & Dacheng, 2018; Jun, Zhenzhong, Baopeng, Dan, & Jianping, 2017b) are promising solutions. Finally, better loss functions might be explored based on the idea of metric learning (Jun, Xiaokang, Fei, & Dacheng, 2017a). CRediT authorship contribution statement Fei Gao: Conceptualization, Funding acquisition, Methodology, Writing - original draft, Writing - review & editing. Jingjie Zhu: Data curation, Investigation, Methodology, Resources. Hanliang Jiang: Formal analysis, Funding acquisition, Methodology, Resources. Zhenxing Niu: Conceptualization, Methodology, Writing - review & editing. Weidong Han: Project administration, Supervision, Writing - review & editing. Jun Yu: Project administration, Supervision, Writing - review & editing. Acknowlgedgments This work was supported in part by the National Natural Science Foundation of China under Grants 61601158, 61971172, 61702145, 61836002, 61971339, 81972745, 81572361, and 61802100, in part by the Ten thousand plan youth talent support program of Zhejiang Province under Grant ZJWR0108009, in part by the Education of Zhejiang Province under Grants Y201534623, Y201840785, and Y201942162, in part by the Zhejiang Provincial Science Foundation under Grants Y18H160029, and in part by the China Post-Doctoral Science Foundation under Grant 2019M653563. References Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 214–223. Blau, Y., & Michaeli, T. (2018). The perception-distortion tradeoff. The ieee conference on computer vision and pattern recognition (cvpr). Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., & Krishnan, D. (2017). Unsupervised pixel-level domain adaptation with generative adversarial networks. Ieee conference on computer vision and pattern recognition3722–3731. Brock, A., Donahue, J., & Simonyan, K. (Donahue, Simonyan, 2019a). Large scale gan training for high fidelity natural image synthesis. International Conference on Learning Representations, 2019. Brock, A., Donahue, J., & Simonyan, K. (Donahue, Simonyan, 2019b). Large scale gan training for high fidelity natural image synthesis. International Conference on Learning Representations. Bruna, J., Sprechmann, P., & Lecun, Y. (2016). Super-resolution with deep convolutional sufficient statistics. International conference on representation learning, iclr. Chen, X., Mishra, N., Rohaninejad, M., & Abbeel, P. (Mishra, Rohaninejad, Abbeel, 2018a). PixelSNAIL: An improved autoregressive generative model. Proceedings of the 35th international conference on machine learning864–872. Chen, X., Xu, C., Yang, X., Song, L., & Tao, D. (2019). Gated-gan: Adversarial gated networks for multi-collection style transfer. IEEE Transactions on Image Processing, 28(2), 546–560. Chen, X., Xu, C., Yang, X., & Tao, D. (Xu, Yang, Tao, 2018b). Attention-GAN for object transfiguration in wild images. Computer Vision - ECCV, 167–184. Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. CoRR, abs/1904.10509. Choi, Y., Choi, M., Kim, M., Ha, J. W., Kim, S., & Choo, J. (2018). StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. Ieee international conference on computer vision8789–8797. Everingham, M., Eslami, S. M. A., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... Bengio, Y. (2014). Generative adversarial nets. International conference on neural information processing systems2672–2680. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. Advances in neural information processing systems5767–5777. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems6626–6637. Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. Ieee international conference on computer vision1510–1519. Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. Ieee conference on computer vision and pattern recognition1125–1134. Jing, Y., Yang, Y., Feng, Z., Ye, J., Song, M., Jing, Y., ... Song, M. (2017). Neural style transfer: A review. arXiv:1705.04058. Jun, Y., Chaoqun, H., Yong, R., & Dacheng, T. (2018). Multitask autoencoder model for recovering human poses. IEEE Transactions on Industrial Electronics, 65(6), 5060–5068. Jun, Y., Min, T., Hongyuan, Z., Dacheng, T., & Yong, R. (2019). Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2019.2932058. Jun, Y., Xiaokang, Y., Fei, G., & Dacheng, T. (Xiaokang, Fei, Dacheng, 2017a). Deep multimodal distance metric learning using click constraints for image ranking. IEEE Transactions on Cybernetics, 47(12), 4014â 4024. Jun, Y., Zhenzhong, K., Baopeng, Z., Dan, L., & Jianping, F. (Zhenzhong, Baopeng, Dan, Jianping, 2017b). Image privacy protection by identifying sensitive objects via deep multi-task learning. IEEE Transactions on Information Forensics and Security, 12(5), 1005–1016. Karras, T., Aila, T., Laine, S., & Lehtinen, J. (Aila, Laine, Lehtinen, 2018a). Progressive growing of gans for improved quality, stability, and variation. International conference on learning representations, iclr. Karras, T., Laine, S., & Aila, T. (Laine, Aila, 2018b). A style-based generator architecture for generative adversarial networks abs/1812.04948 Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 2009. Kurach, K., Lucic, M., Zhai, X., Michalski, M., & Gelly, S. (2018). The gan landscape: Losses, architectures, regularization, and normalization. CoRR, abs/1807.04720. Ledig, C., Theis, L., HuszÃ!‘r, F., Caballero, J., Cunningham, A., Acosta, A., ... Wang, Z. (2017). Photo-realistic single image super-resolution using a generative adversarial network. Ieee conference on computer vision and pattern recognition105–114. Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99), 2999–3007.

20

Information Processing and Management 57 (2020) 102192

F. Gao, et al.

Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep learning face attributes in the wild. Proceedings of international conference on computer vision3730–3738. Lucic, M., Kurach, K., Michalski, M., Gelly, S., & Bousquet, O. (2018). Are gans created equal? a large-scale study. Advances in neural information processing systems 31: Annual conference on neural information processing systems 2018, neurips 2018698–707. Lucic, M., Tschannen, M., Ritter, M., Zhai, X., Bachem, O., & Gelly, S. (2019). High-fidelity image generation with fewer labels. Proceedings of the 36th international conference on machine learning, icml4183–4192. Mao, X., Li, Q., Xie, H., Lau, R. Y. K., Wang, Z., & Smolley, S. P. (2017). Least squares generative adversarial networks. international conference on computer vision2813–2821. Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Proceedings eighth ieee international conference on computer vision. iccv 20012. Proceedings eighth ieee international conference on computer vision. iccv 2001 416–423vol.2. Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018. Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. In J. Dy, & A. Krause (Vol. Eds.), Proceedings of Machine Learning Research: . 80. Proceedings of the 35th international conference on machine learning (pp. 4055–4064). StockholmsmÃn, Stockholm Sweden: PMLR. Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Radim Tyleček, R.Š. (2013). Spatial pattern templates for recognition of objects with regular structure. Pattern recognition364–374. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Ieee conference on computer vision and pattern recognition2818–2826. Vaswani, A., Shazeer, N., Parmar, N., Jones, L., Uszkoreit, J., Gomez, A. N., & Kaiser, L. (2017). Attention is all you need. Advances in neural information processing systems 30: Annual conference on neural information processing systems 20175998–6008. Wang, N., Gao, X., & Li, J. (2018). Random sampling for fast face sketch synthesis. Pattern Recognition, 76, 215–227. Wang, N., Gao, X., Li, J., Song, B., & Li, Z. (2016). Evaluation on synthesized face sketches. Neurocomputing, 214(C), 991–1000. Wang, N., Tao, D., Gao, X., Li, X., & Li, J. (2014). A comprehensive survey to face hallucination. International Journal of Computer Vision, 106(1), 9–30. Wang, X., & Tang, X. (2009). Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11), 1955–1967. Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Process, 13(4), 600–612. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018). AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. 2018 IEEE conference on computer vision and pattern recognition, cvpr1316–1324. Yu, F., Zhang, Y., Song, S., Seff, A., & Xiao, J. (2015). LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. CoRR, abs/ 1506.03365. Yu, J., Zhu, C., Zhang, J., Huang, Q., & Tao, D.. Spatial pyramid-enhanced netvlad with and weighted triplet loss for place recognition. IEEE Transactions on Neural Networks and Learning Systems, 10.1109/TNNLS.2019.2908982. Zeyde, R., Elad, M., & Protter, M. (2010). On single image scale-up using sparse-representations. International conference on curves and surfaces711–730. Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2019). Self-attention generative adversarial networks. Proceedings of the 36th international conference on machine learning, icml7354–7363. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. (2017). StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99). Zhang, L., Zhang, L., Mou, X., & Zhang, D. (2011). FSIM: a feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20(8), 2378–2386. Zhang, M., Wang, N., Li, Y., & Gao, X. (Wang, Li, Gao, 2019a). Deep latent low-rank representation for face sketch synthesis. arXiv preprint arXiv:10.1109/ TNNLS.2018.2890017. Zhang, M., Wang, N., Li, Y., & Gao, X. (Wang, Li, Gao, 2019b). Neural probabilistic graphical model for face sketch synthesis. arXiv preprint arXiv:10.1109/ TNNLS.2018.2890017. Zhang, S., Ji, R., Hu, J., Gao, Y., & Chia-Wen, L. (2018). Robust face sketch synthesis via generative adversarial fusion of priors and parametric sigmoid. Proceedings of the twenty-seventh international joint conference on artificial intelligence (ijcai-18)1163–1169. Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (Park, Isola, Efros, 2017a). Unpaired image-to-image translation using cycle-consistent adversarial networks. Ieee international conference on computer vision2242–2251. Zhu, J.-Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (Zhang, Pathak, Darrell, Efros, Wang, Shechtman, 2017b). Toward multimodal image-to-image translation. Advances in neural information processing systems 30465–476.

21