Text-to-image via mask anchor points

Text-to-image via mask anchor points

Journal Pre-proof Text-to-image via mask anchor points Samah S. Baraheem , Tam V. Nguyen PII: DOI: Reference: S0167-8655(20)30052-0 https://doi.org/...

1MB Sizes 1 Downloads 37 Views

Journal Pre-proof

Text-to-image via mask anchor points Samah S. Baraheem , Tam V. Nguyen PII: DOI: Reference:

S0167-8655(20)30052-0 https://doi.org/10.1016/j.patrec.2020.02.013 PATREC 7791

To appear in:

Pattern Recognition Letters

Received date: Revised date: Accepted date:

4 October 2019 8 February 2020 11 February 2020

Please cite this article as: Samah S. Baraheem , Tam V. Nguyen , Text-to-image via mask anchor points, Pattern Recognition Letters (2020), doi: https://doi.org/10.1016/j.patrec.2020.02.013

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

1

Highlights     

Synthesizing the image from the input text that consists of multiple objects. Using a graphical user interface to input the text from end-users. Constructing a mask dataset for semantic classes. Preserving the spatial relations among objects using anchor points. Evaluating the synthesis results with state-of-the-art methods.

2

Text-to-image via mask anchor points Samah S. Baraheem a,b, and Tam V. Nguyen a a b

Department of Computer Science, University of Dayton, 300 College Park, Dayton, OH 45469, USA Department of Computer Science, Umm Al-Qura University, Prince Sultan Bin Abdulaziz Road, Mecca, Makkah 21421, Saudi Arabia

A BS T RA C T Text-to-image is a process of generating an image from the input text. It has a variety of applications in art generation, computeraided design, and data synthesis. In this paper, we propose a new framework which leverages mask anchor points to incorporate two major steps in the image synthesis. In the first step, the mask image is generated from the input text and the mask dataset. In the second step, the mask image is fed into the state-of-the-art mask-to-image generator. Note that the mask image captures the semantic information and the location relationship via the anchor points. We also developed a user-friendly interface which helps parse the input text into the meaningful semantic objects. As a result, our framework is able to produce clear, reasonable, and more realistic images. The experiments on the most challenging COCO-stuff dataset illustrate the superiority of our proposed approach over the previous state of the arts. Keywords: text-to-image, mask dataset, image synthesis, anchor points

2020 Elsevier Ltd. All rights reserved .

Instead of mapping from text to image directly, we first construct the mask map from the input text, which will be taken from the user via interface, along with the mask dataset after detecting anchor points 1. Introduction Text-to-image synthesis is an important problem in computer vision and artificial intelligence which concentrates on synthesizing a realistic and high-resolution image from natural language descriptions. Numerous applications such as art generation, photo-editing, data synthesis, and computer-aided design depend on text-to-image generator. In fact, myriad techniques are made via deep convolutional and recurrent networks to synthesize reasonable images from text [13-15]. Recently, Generative Adversarial Networks (GANs) play a principal role to generate photo-realistic images [1,2,4,5,6,9,16,18,21,23,24]. Many efforts are being made in this area with different methods in order to enhance the quality of the generated images. Indeed, various GANs approaches [3,4,5,6] achieve promising results on simple datasets such as: birds and flowers. Nevertheless, there are efforts [2,7,8,9] to synthesize complex and real-world scenes that compose of multiple objects with different connections. However, the quality of images is still unclear because of many problems. First, the result is not natural. Second, the semantic information is not preserved. Finally, there is a distortion in the generated image. In this paper, we aim to generate images from text with accurate shapes and textures to represent the objects clearly using mask dataset anchor points. As can be seen in Fig. 1, the text: “A man rides on a horse” synthesizes appropriate masks for (man and horse) with described connection (rides on) using mask dataset anchor points.

Fig. 1 The illustration of the general framework of generating multiple objects from natural language descriptions. The figure shows the generated image of an example text: “A man rides on a horse”. The red dashed rectangle highlights our contribution in this paper.

of all objects in the text. We use an interface to construct the input text because it is a user-friendly, easy to parse and capture verbs and nouns, appealing, and saving efforts. Indeed, to synthesize the proper semantic segmentation map, we concentrate on several aspects such as: spatial location, size, depth, and the number of entities in the first stage text-to-mask (T2M). Then the resulted mask map will be fed into the state-of-the-art mask-to-image generator in the second stage (M2I) to generate the corresponding image.

3

Fig. 2. The flowchart of our proposed framework for synthesizing the mask map from text. 2. Related Work 2.1. Earlier Work Synthesizing reasonable images from natural language input is an important task to understand the visual world. Thus, many studies have been conducted in this specific field of computer vision using different methods. Since convolutional neural networks [29,30] perform well in the recognition task, Dosovitskiy et al. [13] proposed using a deep convolutional decoder network to generate 3D chairs, tables, and cars via a given set of 3D models of different viewpoints for chairs, tables, and cars and respective parameters such as lighting information, brightness, color, position, and zoom. Later, recurrent neural network DRAW [14] was introduced to mimic the way humans draw by using variational auto-encoders and spatial attention mechanism to generate an image in an iterative manner by patches. It produced reasonable results for the MNIST dataset, which is a dataset of handwritten digits, but it lacks the ability to generate a complex image with a clear object. Additionally, one of the previous works to generate an image from text was proposed by extracting visual attributes from the sentence along with unknown or latent factors via training a recurrent convolutional encoder-decoder network [15]. 2.2. GAN-based Methods Recently, the research community shifted to Generative Adversarial Networks (GANs) due to the fact that GANs have shown more realistic results than other approaches [1,2,4,5,6,9,16,18,21,23,24]. Basically, GANs have two main components: a generator and a discriminator. The generator attempts to fool the discriminator by producing high quality and more realistic images while the discriminator attempts to distinguish between real and fake images that are generated by the generator. Most GAN-based methods rely on a global sentence vector to generate the image from text [3,4,5]. However, since the words of the sentence were encoded into one global sentence vector, it lacks the fine-grained information at the word level, and thus it fails in representing the vivid object details. Therefore, a new approach, Attentional Generative Adversarial Network (AttnGAN) [6], was introduced using GANs to generate an image with all necessary details. In particular, AttnGAN was proposed to enable attention-driven, multi-stage refinement for fine-grained text to image synthesis [6]. It works not only on global sentence vector but also with word vectors where each word in the sentence

is encoded into the word level vectors and therefore it synthesizes the details at different sub-regions of the image based on the related words. Although AttnGAN produces promising results on simple sentences with single object, i.e., bird and flower, it fails to generate “complex scenes” with several entities and multiple relationships. Reed et al. [7] provided a new method based on the global constraint that is the input text and location constraint that is the whole object using a bounding box and its main components locations. Although it produces good results with simple datasets [7], it struggles to generate the complicated scenes. Consequently, many works have been done using GANs to address the issue of generating complex scenes with multiple objects. In recent years, scene graph was introduced to generate the image [10]. It has the ability to synthesize multiple entities and their relationships via passing the input graph through graph convolution neural network to predict bounding boxes and segmentation masks of all entities and then to compute the scene layout. The scene layout is then used to generate realistic images via a cascaded refinement network. Tripathi et al. [11] introduced Graph Convolution Neural Network (GCNN), an improvement on generating an image from the scene graph which maps the scene graph into embedding vectors from GCNN to produce the scene layout. Moreover, there are many techniques to synthesize images from semantic label maps [1,1626] that can be helpful in tasks such as image-to-image translation. 2.3. Mask-to-image-based Methods This approach is different from the text to image process where it takes the segmentation map as an input and produces the label colored map along with the synthesized photorealistic image. Pix2pix [16] was the first framework to synthesize an image from label map using conditional GANs followed by an improvement pix2pixHD [18] to produce realistic high-resolution images. At the same time, Chen et al. [17] developed a new method aside from conditional GANs and without adversarial training using convolutional network with regression loss. Some methods provide multiple generated images simultaneously to ensure the diversity in the translated images [17,27,28]. However, to tackle the problem of generating discrete number of images, Zhu et al. [22] proposed a method using Bicycle-GAN to generate continuous multimodal images. Furthermore, a new trend was launched towards translation an image from semantic label map without supervision [19-21,23-26]. Park et al. [1] provided one of the most effective approaches to synthesize photorealistic images based on GANs along with a normalization approach called

4 spatially adaptive normalization. Each semantic class/label in segmentation map is represented by particular value

Fig. 4. The illustration of parsing the input text “A man rides on a horse”

Table 1 Notations and their corresponding meaning. Notation Meaning

Fig. 3. The examples of object mask with different directions.

in order to distinguish each object class and generate a proper synthesized image. The spatially adaptive normalization approach takes the semantic label map as an input to all intermediate layers not only to the input layer as previous models where the information in segmentation map washed away in the deeper layers. For that reason, SPADE model produces the best performance with high quality of image [1]; and thus, it can be considered as the mask-to-image component in our proposed framework. 3. Proposed Method In this section, we introduce our proposed framework that has two main components, namely, text-to-mask and mask-to-image components. Fig. 1 shows the flowchart about the framework. As shown in Fig. 2, our proposed text-to-mask component consists of six main steps. We describe the details as follows. 3.1. Mask Dataset The first major step is to collect a mask dataset for all semantic classes. Our dataset is composed of 182 semantic classes similar to COCO-stuff dataset, for example, person, horse, house, tree, and others. Our mask dataset was collected in three phases: 

First, we use a crawler tool 1 to search and download masks for each semantic class. In addition, we browse several websites [32-35] to download masks for each class using keywords: class label + “silhouette”, or class label + “transparent”.



Second, we verify the downloaded masks and discard some unused masks with low-quality, i.e., the presence of noise or distortions.



Finally, we crop non-object extra space in transparent background of the image. We also label the mask direction, left or right so that we guide the subject(s) to the same direction as the object(s) as shown in Fig.3.

In total, our mask dataset consists of 4,756 different masks. We collect at least 25 masks for each semantic class. 3.2. Text Parser via Interface In this work, we develop a user interface for the text input. Our interface is first user-friendly to the user than a file or the terminal/command prompt. Second, it saves the efforts to construct or write a proper sentence. Moreover, it is also fast and helps avoid spelling mistakes. All of sentence components (i.e., nouns, verbs, locations, and numbers) are provided for the user 1

GoogleImageCrawler https://pypi.org/project/icrawler/

𝐶𝐴𝑃

Center anchor point

𝑖

Pixel location of x coordinate

𝑗

Pixel location of y coordinate

𝑚𝑎𝑠𝑘(𝑖, 𝑗)

Pixel intensity, where mask ∊ {1,0}. 1, 0 for foreground and background, respectively.

𝑚

Margin

𝑛𝑜

Number of objects in the image

𝑠𝑐𝑣

Size of canvas, where 𝑠𝑐𝑣 = [𝑤𝑠𝑐𝑣 , ℎ𝑠𝑐𝑣 ]

𝑙𝑜𝑐𝑠𝑢𝑏

Left upper location of subject

𝐵𝐴𝑃

Base anchor point

𝑠

Mask size of the subject, where 𝑠 = [𝑤𝑠 , ℎ𝑠 ]

𝑜 𝑐𝑙𝑎𝑠𝑠

Class offset

𝑠𝑝

Spatial relation to determine the condition between entities when base anchor point is utilized.

selection. Then, the semantic inference parses the text to determine the value of each component (subject, object, spatial relations, and the number of both subject and object) as shown in Fig. 4 in order to facilitate the utilizing of the text input in the following stages. Note that, the list of subject and object is from COCO-stuff dataset, and we maintain the subject-verb agreement where the subject and verb must agree in terms of singular or plural. 3.3. Mask Retrieval Following the semantic inference, the masks are randomly selected for the subject(s) and object(s). Note that the selected masks must associate with the class labels of the subject(s) and object(s). We also consider the spatial directions, i.e., left or right to enforce the subject-object direction harmony. For our input text example, “A man rides on a horse”, we retrieve the mask of the object “horse” first. Then based on its direction, we retrieve the mask of the subject “man”. Furthermore, each inserted mask is placed onto the mask image (canvas) with its associated class label. 3.4. Mask Anchor Point Detection The mask anchor points are the principal points in the mask object that affect the mask synthesis based on spatial relations in the input text. In particular, we detect two main anchor points for each entity as the following: 

Center anchor point (CAP): We compute the center for each mask in order to determine the location among entities, i.e., upper and lower connections are based on

5 Captio n

GT

StackGAN + Object Pathways [2]

AttnGAN [6]

AttnGAN + Object Pathways [2]

Our generated mask map

Ours

A group of animals stands in a large grassy field A man rides on a horse

A man sits next to his bicycle on the beach

A dog sits next to a teddy bear

An open laptop on a wooden desk and two note pads on the desk A group of people stand on top of a snowcovered slope Fig. 5. Visual comparisons among StackGAN + Object Pathways [2], AttnGAN [6], AttnGAN + Object Pathways [2], and our proposed method on COCOstuff dataset.

center anchor point. It is called the geometric center. The coordinate of CAP is computed as below. CAPx 

 mask(i, j ) xi  mask (i, j ) i, j

(1)

i, j

CAPy 

 mask (i, j ) xj i, j

 mask (i, j ) i, j

(2)



Base anchor point (BAP): We detect the first bottom point of the object which is an important feature for spatial relations such as: close, front, behind, inside, and part in which means a part of other larger entity. This anchor point has an advantage to place all entities in the same height level within the resulted semantic segmentation map. The BAP point is computed as the lowest foreground pixel in the object mask.

Table 1 details the notations used in this paper. As illustrated in Fig. 2 (stage 4), the red rectangle represents the base anchor point

6 whereas the yellow rectangle depicts the center anchor point for both subject and object of the text input “A man rides on a horse”. For this text input, the center anchor points will be used since the spatial relation between the man and his horse is “on”. Therefore, the center anchor points help to place the center of the man on top of the center of his horse. 3.5. Mask Synthesis Upon retrieving the appropriate masks for all entities alongside with detecting the anchor points and after the spatial connection between the entities is specified via semantic parsing, we aim to synthesize the proper semantic segmentation map with focusing on three aspects which are: the spatial relations, the size, and the depth of each mask. We first create a canvas which is the base image that contains all synthesized masks and the appropriate scene background. Note that the scene background is constructed depending on both subject(s) and object(s). The scene background might be indoors or outdoors under many scenarios, i.e., outdoor scene can be “on snow”, “on grass”, “on desert”, “on beach”, “on sidewalk”, or “on road”. Moreover, indoor scene might be related to “on floor”, “on carpet”, “on rug”, or even “on table”. For instance, “horse” is usually outdoors while “sofa” is normally indoors. As to the first aspect which has a significant effect on synthesizing the segmentation map appropriately, we consider most of location conditions, i.e., upper, lower, behind, front, close, inside, and part in for many circumstances. Each circumstance has different parameters in our rules of mask synthesis incorporating the human commonsense. For upper and lower connection, center anchor point will be used to snap one entity to one or more entities depending on the input text. However, for all other spatial conditions, i.e., close, front, behind, inside, and part in, base anchor point will be utilized to place entities at the same height level. Thus, after calculating the CAP and the BAP, we compute the margin 𝑚 which relies on the number of objects in the image as described in Eqn. (3): m

scv no

(3)

To synthesize the mask map properly, we specify the location of subjects and objects. For objects, we determine the location based on the computed margin 𝑚. Meanwhile, the location 𝑙𝑜𝑐𝑠𝑢𝑏 (𝑥𝑠𝑢𝑏 , 𝑦𝑠𝑢𝑏 ) of the subject is computed as below. 𝑙𝑜𝑐𝑠𝑢𝑏 = 𝑓1 (𝐶𝐴𝑃) + 𝑓2 (𝐵𝐴𝑃)

(4)

, where the functions 𝑓1 and 𝑓2 are defined as follows: 𝑓1 (𝐶𝐴𝑃) = 𝛼. 𝐶𝐴𝑃 𝑜𝑏𝑗 − 𝛾. 𝑠 𝑠𝑢𝑏 − 𝑜𝑐𝑙𝑎𝑠𝑠

(5)

𝑓2 (𝐵𝐴𝑃) = 𝛽. 𝐵𝐴𝑃 𝑜𝑏𝑗 + 𝑚 + 𝑠𝑝

(6)

Here, α, β ∊ {0, 1} depending on the spatial relations, i.e., if the connection between entities is upper or lower, CAP will be used; however, there is no need to use the BAP. Thus, α= 1 and β=0. In contrast, if the connection between entities is front, behind, close, inside, or part in, BAP will be utilized and therefore, α= 0 and β=1. γ ∊ {0, 0.5,1} for each relation (upper, lower, close, front, behind, inside, part in), specific number from the aforementioned range is computed and then multiplied with the size of subject S in element-wise manner. The offset value 𝑜𝑐𝑙𝑎𝑠𝑠 is computed for some ride-able classes such as: horse, elephant, bicycle, and motorcycle only if relation is classified as upper relation due to the fact that the center anchor point for these classes will not be exactly on the back of horse and elephant or on the seat of bicycle and motorcycle. It is computed based on object width and subject height. The spatial relation value 𝑠𝑝 is calculated depending on the case of connection between entities and only if base anchor

point is used. It is computed based on object width where 𝑠𝑝 = ±𝑤 𝑜𝑏𝑗 . For our input text example “A man rides on a horse”, we determine the location of the object “horse” based on the margin 𝑚 which is computed based on the canvas size 𝑠𝑐𝑣 and the number of entities 𝑛𝑜 . Then, we specify the location of the subject “man” based on the location of the object “horse”; which is the margin 𝑚, the location of detected center anchor point 𝐶𝐴𝑃, and the offset 𝑜𝑐𝑙𝑎𝑠𝑠 to place the “man” in exact location on top of the” horse”. With regards to the size of each mask, we group and sort all semantic classes 𝑛𝑐 in seven different dictionaries (tiny, little, small, medium, big, huge, and extremely huge) depending on the comparisons of the entity’s actual size with other entities. Then, we scale each group by multiplying or dividing the mask size by the specific number of the actual dimensions of the mask since some masks are squared while some consist of bigger width than height or vice versa. In order to handle the depth layer of each inserted object mask, we move the largest entity in our segmentation map first in order to preserve the appearance of smaller entities since the larger entity may block the smaller entities. Then, after specifying all three factors: the location, the size, and the depth of each subject(s) and objects(s), we snap the entities to the canvas. In the final stage, we crop the constructed mask map via determining the related bounding box (ROI) of all entities together in order to highlight the main content. We specify the bounding box of each entity based on four points (left, upper, right, lower). For objects, left and upper points are determined by the margin 𝑚 whereas right and lower points are computed based on the object mask size. For subjects, left and upper points are specified by the left upper location of subject 𝑙𝑜𝑐𝑠𝑢𝑏 while right and lower points are calculated depending on the subject mask size. 3.6. Image Synthesis The mask map is fed into the state-of-the-art mask-to-image generator M2I to generate realistic high-resolution image of the same size. It is an important step in our proposed framework for two-fold: We want to reuse the state-of-the-art mask to image component. We want to ensure the input mask image with high quality. In fact, the mask to image synthesis aims to generate a realistic image from the semantic mask map. It has a variety of applications such as: photo editing, content generation, and image to image translation [1,16-28]. However, generating an image from natural language descriptions is so generic that it allows the text input instead of using the mask map. Therefore, we feed our semantic segmentation map to the state-of-the-art mask-to-image generator [1] since it produces more photorealistic images than previous methods and outperforms the previous state of the arts on COCOstuff dataset. 4. Experiments 4.1. Dataset and Baselines Dataset: In this paper, we evaluate the proposed method on COCO-Stuff dataset that contains 𝑛𝑐 = 182 classes where the first 91 classes belong to things and the last 91 classes belong to stuff. Baselines: We compare our proposed framework with state-ofthe-art methods [2,6]. The first baseline is Object Pathways (OP) [2] which attempts to synthesize complicated images of multiple objects and preserves the spatial relations among objects in the generated images based on both StackGAN [4] and AttnGAN [6] modules. StackGAN [4] generates 256×256 reasonable images

7 Table 2 Perceptual similarity and Frechet Inception Distance metrics Method

Mean of LPIPS

StackGAN + OP [2]

0.675



FID



290.372

AttnGAN [6]

0.687

251.191

AttnGAN + OP [2]

0.683

266.118

Ours

0.720

244.311

Table 3 User choice and Average Human Rank (HR) Method

First rank

Second rank

StackGAN + OP [2]

17/960

402/960

541/960

(1.77%)

(41.88%)

(56.35%)

12/960

539/960

409/960

(1.25%)

(56.15%)

(42.60%)

931/960

19/960

10/960

(96.98%)

(1.98%)

(1.04%)

AttnGAN [6]

Ours

Third rank

Fig. 7. User assessments for 30 captions to compare between the generated images by different methods: StackGAN + OP [2], AttnGAN [6], and our proposed method.

Fig. 6. An excerpt of the interface used in our evaluation study. The benchmark methods are shuffled to hide the identity. The users need to rank the synthesis results.

conditioned on text through two stages. The first step is to generate the object with the main features such as shape and

color and then during second step more details are inserted to the object in order to refine and increase the resolution of the image. The second baseline is AttnGAN [6] which utilizes global sentence vector and word vectors to produce fine-grained images [6]. In addition, we validate our method by comparing it with AttnGAN + Object Pathways (OP) [2]. 4.2. Evaluation on COCO-stuff To validate our proposed framework, we conduct multiple experiments on COCO-stuff dataset including visual comparison with previous text-to-image approaches alongside with the caption and the corresponding ground truth. In addition, we conduct user study to evaluate our framework with others text-to-image methods since human judgment plays a significant role in any experiments. In fact, we use 30 images of resolution 256 x 256 to compare between our proposed method and previous text-toimage approaches, in particular, StackGAN + Object Pathways (OP) [2], AttnGAN [6] and AttnGAN + Object Pathways (OP) [2]. We select images from either the released results of the pretrained models or the images from the published papers. 4.2 .1 . Vis ual C om par is on We compare our proposed method with the existing works textto-image. In particular, we compare our resulted images with

StackGAN + OP [2], AttnGAN [6], and AttnGAN + OP [2] alongside with the ground truth. As shown in Fig. 5, our proposed method works better and produces realistic high-quality images which maintain the shape and texture of each entity, and thus; it results natural images. However, all baselines generate images that seem as artificial images and contain distortion. Due to the fact that human judgment is subjective, Zhang et al. [31] compute the perceptual similarity metrics which mimic the process of human assessment on the similarity of images by computing the distances between the extracted features of generated image and the extracted features of ground truth. Therefore, we adopt Learned Perceptual Image Patch Similarity (LPIPS) metric [31] to measure the differences between each resulted image and the corresponding ground truth. The highest value is the most similar image to the query image, and the lowest value is the least similar image. In fact, we compare our proposed method with StackGAN + Object Pathways [2], AttnGAN [6], and AttnGAN + Object Pathways [2] based on 30 resulted images. Then, we calculate the mean LPIPS for each method as reported in Table 2. We found that our proposed approach surpasses all other methods in the experiments, namely, 72% in comparison with StackGAN + OP [2], AttnGAN [6], and AttnGAN + OP [2] with 67.5%, 68.7%, and 68.3% respectively. Furthermore, we evaluate our proposed method with

8 the baselines quantitatively using Frechet Inception Distance (FID) [36]. FID measures the distance between feature vectors computed for generated images and real images. Lower FID score denotes that the two datasets (generated images and real images) are more similar while higher FID score indicates that the

generated and real images are less similar. As can be seen in Table 2 our method achieves 244.311compared to StackGAN + Object Pathways [2], AttnGAN [6], and AttnGAN + Object Pathways [2] with 290.372, 251.191, and 266.118 respectively based on the resulted datasets that consists of 30 generated

Fig. 8. Generated mask maps and the corresponding generated images for some complicated sentences using our proposed method. significant margin 96.98% compared to 1.77% and 1.25% respectively as Table 3 and Fig. 7 demonstrated. 4.3 Experimental Analysis Based on the results of visual comparisons and user study, we conclude that our proposed method produces clear, high-quality

Fig. 9. Some failure cases of our proposed method.

images. This experiment proves that the superiority of our proposed approach over the baselines where lower FID score indicates better performance in terms of diversity and quality of generated images. 4.2 .2 . U s e r Study We further conduct a user study in order to get the subjective assessments on the generated images of our proposed method and baselines. In particular, 32 students and staff members of a university who aged between 21-45 years old with 16 females and 16 males participate in the study. They are requested to rank the synthesized images in terms of the image quality and the similarity with the corresponding image captions. The results of different methods are randomly placed under Method 1, 2,…n, where n is the number of compared methods. We compare our work with other existing works such as: StackGAN + OP [2] and AttnGAN [6]. Fig. 6 depicts the excerpt of our questionnaire. Consequently, based on people ranking choices, we compute Average Human Ranks (HR) to analyze the results. We find out that our proposed method outperforms the previous state of the arts text-to image methods, in particular, StackGAN + OP [2] and AttnGAN [6] by

generated images without noise and distortion. However, all baselines’ results suffer from un-naturality, ambiguity, and distortion. Furthermore, it is obvious from our experiments that our proposed framework yields not only recognizable objects compared to the baselines but also different entities of the same type of object in terms of shape, orientation, texture, and sometimes direction. Table 2 shows the quantitative results in terms of FID and LPIPS scores. The method with higher LPIPS score tends to have lower FID score. Indeed, as shown in Table 2, our results are the most similar to the real images since our method achieves the least FID score and the highest LPIPS score. Finally, our method produces images not only from simple sentences with just one subject and one object; however, it generates images from complicated sentences and preserves the spatial relation constraints among all entities as describes bellow. 4.4 Complicated Scenes and Failure Cases In case the constructed input text is complicated with many spatial relations, i.e., “A man rides on a horse behind two sheep close to a cow”, we construct the mask map in a successive manner. We synthesize the first subject and object using the previous method described in Section 3.5. Then, each object(s) is snapped to the previous mask map based on the spatial relation; i.e., upper, lower, behind, front, close, inside, and part in; and the location of corresponding subject. For our input example: “A man rides on a horse behind two sheep close to a cow”, we first construct a mask map of “upper” constraint which consists of a “man” rides a “horse”. Then, we snap the object(s) “sheep” to the previous constructed mask map and based on “behind” as spatial relation and the location of the subject(s) “a man who rides a horse”. Note that two different sheep will be snapped to the mask map. Subsequently, we snap the object “cow” to the prior mask

9 map based on the relation “close” and the location of the subject(s) “two sheep”. Fig. 8 demonstrates the results of T2M and M2I components of our proposed method with complicated sentences while Fig. 9 illustrates some failure cases of our proposed method. 5. Conclusion and Future Work In this paper, we propose a new approach to synthesize realistic images from the input text based on detecting the anchor points, namely, center and base anchor points of the retrieved mask. The anchor points are utilized to snap entities to form the proper mask map in the first stage which is T2M. secondly, we feed our synthesized mask map into the state-of-the-art mask-to-image generator to generate the photorealistic image in the second stage M2I. In fact, this paper focuses on the text-to-mask which supplements the availability of the state-of-the-art mask-to-image. We are exploring end-to-end solution for the future work. Our experiments show that our proposed framework produces better results with high quality images than all previous state-of-the-art text-to-image generators for complicated scenes. Indeed, myriad modifications and experiments have been left for future work. We plan to adopt our proposed method to different mask-to-image approaches. Moreover, we aim to enhance our results further and generate more realistic images.

Acknowledgement The first author would like to thank Umm Al-Qura University, in Saudi Arabia, for the continuous support. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of GPU used for this research.

References 1.

2.

3.

4.

5.

6.

7.

8.

9.

T. Park, M. Liu, T. Wang, and J. Zhu, "Semantic image synthesis with spatially-adaptive normalization," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. T. Hinz, S. Heinrich, and S. Wermter, “Generating multiple objects at spatially distinct locations,” in International Conference on Learning Representations, 2019. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” In Proceedings of International Conference on Machine Learning, 2016. H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan: text to photorealistic image synthesis with stacked generative adversarial networks.,” In Proceedings of International Conference on Computer Vision,2017. H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan++: realistic image synthesis with stacked generative adversarial networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: fine-grained text to image generation with attentional generative adversarial networks,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” In Conference on Neural Information Processing Systems, 2016. S. Hong, D. Yang, J. Choi, and H. Lee, "Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis," In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object-driven text-to-image synthesis via adversarial training,” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019.

10. J. Johnson, A. Gupta, and L. Fei-Fei,"Image Generation from Scene Graphs," In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 11. S. Tripathi, A. Bhiwandiwalla, A. Bastidas, and H. Tang, "Using Scene Graph Context to Improve Image Generation," In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 12. B. Zhao, L. Meng, W. Yin, and L. Sigal, "Image Generation from Layout," In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 13. A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox, "Learning to generate chairs with convolutional neural networks," In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. 14. K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, "DRAW: A recurrent neural network for image generation," In Proceedings of International Conference on Machine Learning, 2015. 15. X. Yan, J. Yang, K. Sohn, and H. Lee,"Attribute2Image: Conditional Image Generation from Visual Attributes," In Proceedings of European Conference on Computer Vision ,2016. 16. P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 17. Q. Chen and V. Koltun, "Photographic image synthesis with cascaded refinement networks," In Proceedings of International Conference on Computer Vision, 2017. 18. T. C. Wang, M. Y. Liu, J. Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 19. X. Huang, M. Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” In Proceedings of European Conference on Computer Vision, 2018. 20. M. Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-toimage translation networks,” In NeurIPS, 2017. 21. M. Li, H. Huang, L. Ma, W. Liu, T. Zhang, and Y. Jiang, "Unsupervised image-to-image translation with stacked cycleconsistent adversarial networks," In Proceedings of European Conference on Computer Vision,2018. 22. J. Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward multimodal image-toimage translation,” In NeurIPS, 2017. 23. A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 24. K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, “Unsupervised pixel-level domain adaptation with generative adversarial networks,” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 25. Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised crossdomain image generation,” In International Conference on Learning Representations, 2017. 26. Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning for image-to-image translation,” In Proceedings of International Conference on Computer Vision, 2017. 27. A. Ghosh, V. Kulharia, V. Namboodiri, P. H. Torr, and P. K. Dokania, “Multi-agent diverse generative adversarial networks,” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 28. A. Bansal, Y. Sheikh, and D. Ramanan, "Pixelnn: Examplebased image synthesis," In International Conference on Learning Representations, 2018. 29. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradientbased learning applied to document recognition," In IEEE,1998.

10 30. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep ConvolutionalNeural Networks," In Advances in Neural Information Processing Systems,2012. 31. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, "The unreasonable effectiveness of deep features as a perceptual metric," In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 32. Cleanpng https://www.cleanpng.com/ 33. SVG SiLH https://svgsilh.com/ 34. pngTree https://pngtree.com/ 35. OnlyGFX https://www.onlygfx.com/ 36. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium," In NIPS, 2017.

Conflict of interest

none