An image augmentation approach using two-stage generative adversarial network for nuclei image segmentation

An image augmentation approach using two-stage generative adversarial network for nuclei image segmentation

Biomedical Signal Processing and Control 57 (2020) 101782 Contents lists available at ScienceDirect Biomedical Signal Processing and Control journal...

2MB Sizes 0 Downloads 74 Views

Biomedical Signal Processing and Control 57 (2020) 101782

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control journal homepage: www.elsevier.com/locate/bspc

An image augmentation approach using two-stage generative adversarial network for nuclei image segmentation Siddharth Pandey, Pranshu Ranjan Singh, Jing Tian ∗ Institute of Systems Science, National University of Singapore, Singapore 119615, Singapore

a r t i c l e

i n f o

Article history: Received 12 February 2019 Received in revised form 23 October 2019 Accepted 16 November 2019 Keywords: Image augmentation Image segmentation Generative adversarial network Nuclei detection in image

a b s t r a c t The major challenge in applying deep neural network techniques in the medical imaging domain is how to cope with small datasets and the limited amount of annotated samples. Data augmentation procedures that include conventional geometrical transformation based augmentation techniques and the recent image synthesis techniques using generative adversarial networks (GANs) can be employed to artificially increase the number of training images. This paper is focused on data augmentation for image segmentation task, which has an inherent challenge when compared to the conventional image classification task, due to its requirement to produce a corresponding mask for each generated image. To tackle the challenge of image-mask pair augmentation for image segmentation, this paper proposes a novel two-stage generative adversarial network. The proposed approach first employs a GAN to generate a synthesized binary mask, then incorporates this synthesized mask into the second GAN to perform a conditional generation of the synthesized image. Thus, these two GANs collaborate to generate the synthesized image-mask pairs, which are used to improve the performance of the conventional image segmentation approaches. The proposed approach is evaluated using the cell nuclei image segmentation task and demonstrates the superior performance to outperform both the traditional augmentation methods and the existing GANbased augmentation methods in extensive results conducted using the benchmark Kaggle cell nuclei image segmentation dataset. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction Machine learning methods, particularly the deep neural networks, have been considered to be among the most effective methods for image segmentation [1,2]. For instance, in the context of cell nuclei detection, deep convolutional neural networks are used to segment cell nuclei in various kinds of medical images [3–6]. Computational pathology and microscopy images play a significant role in decision making for disease diagnosis, since they can provide extensive information that enables quantitative analysis of digital images with a high throughput processing rate. The accurate identification of cell nuclei, if performed manually, is both time and labor intensive, requiring many hours of works by highly skilled medical professionals. Automatic nuclei segmentation from cell images is a key technology in the computational pathology pipeline [7,8]. Various approaches have been developed in the literature for automatic nuclei image segmentation to quantitatively model nuclei feature and information in the image [9,10].

∗ Corresponding author. E-mail address: [email protected] (J. Tian). https://doi.org/10.1016/j.bspc.2019.101782 1746-8094/© 2019 Elsevier Ltd. All rights reserved.

The major challenge in applying deep neural network technology in the medical imaging domain is dealing with small datasets and the limited amount of annotated samples, while large-scale labeled training datasets are becoming increasingly important with the rise in the capabilities of deep learning methods [11]. To be more specific, in biomedical image processing, due to the need for specialized knowledge and the intensive efforts required in labeling medical images, obtaining sufficient training images (e.g., thousands of images) are difficult in many problem domains. To address this lack of data for the effective training of deep neural networks, data augmentation techniques are the potential methods used to increase the size of training dataset. Traditional data augmentation techniques, which mostly rely on performing geometric transformations, such as rotation, zoom, scaling, have been commonly used for training deep learning models for medical image segmentation [12,13]. The generative adversarial network (GAN) provides an alternative approach for generating synthetic images [14]. When used for the image segmentation task, such technique yields an inherent challenge of generating masks for synthetically generated images [15]. In this paper, a novel data augmentation architecture exclusively for image segmentation task is proposed to generate

2

S. Pandey, P.R. Singh and J. Tian / Biomedical Signal Processing and Control 57 (2020) 101782

image-mask pairs. As compared to the conventional image classification tasks there is an intrinsic challenge due to the requirement to produce a corresponding mask for each generated image. More specifically, it is not straightforward to apply the hidden transformation function used in the GAN-based techniques to generate such image-mask pairs whereas the traditional augmentation techniques can simply apply identical transformations on both images and masks. The contributions of the proposed approach are as follows: • A novel image augmentation architecture is proposed to generate the image-mask pair, which is a critical task for image segmentation. The proposed approach exploits two GANs collaborating to generate synthetic image and mask sample, respectively, to form an image-mask pair. • The synthesized image-mask pairs generated by our approach are verified to be useful in improving image segmentation performance, specifically for nuclei cell image segmentation, and outperform both conventional data augmentation techniques and the existing GAN-based augmentation techniques, as verified in extensive experiments conducted in Section 5. The rest of paper is organized as follows. In Section 2, a brief review of related work is provided with the focus on data augmentation for medical images. Section 3 presents the proposed approach in detail. The experimental set-up is described in Section 4. The results are provided in Section 5 to compare the proposed approach with other data augmentation approaches. An insightful discussion is presented in Section 6. Finally, Section 7 concludes this paper.

explored and remains an understudied problem. A straightforward approach is proposed in [15] to concatenate the binary mask along the channel axis to create the image-mask pairs and proceed with the conventional GAN training procedure. Thus, it generates imagemask pairs instead of just images. When compared to our proposed architecture that relies on the conditional generation, we find that it is much difficult to tune hyperparameters of such GANs to produce synthetic image-mask pairs of adequate quality. By adding the mask to the channel axis, the complexity of the task of generating the synthetic output is increased, resulting in longer training time and frequent mode collapses. Concatenating the mask to channel axis of image is one possible solution, another approach using image-to-image translation is also feasible as shown in pix2pix [21]. Image-to-image translation enables reconstructing objects from edge or label maps, to learn conditional generative models for producing output for a given image. A similar image-to-image translation for producing a new image-mask pair has been proposed in [22], which takes the binary mask and noise as input and outputs a phantom of the original image. The noise component accounts for the visual diversity of the generated image. A style transfer component is also incorporated to improve the visual quality of the produced image. Their work also shows improvement over the results of the baseline segmentation model when trained with additional synthesized images. The architecture described in [22] relies on noise and style component to generate a new synthetic image, but it does not generate any new masks. Both the real image and synthetic image share the same mask as the ground truth. Our proposed architecture follows a different approach, where we synthesize new masks and generate corresponding images. By doing so, we can augment the dataset with greater visual diversity.

2. Related work Training deep neural network models with a large number of parameters on an insufficiently sized training dataset faces considerable model over-fitting causing an undesirable impact on model accuracy [16]. This challenge could be addressed by artificially increasing the dataset using label-preserving transformations [17], such as translation, reflection, rotation and other geometric transformations. Also, the shearing operation that deforms an image is particularly useful in biomedical image segmentation, since the most common form of the variation in tissue in organisms is deformation [11,17]. Generative adversarial networks (GANs) have been used to generate synthetic images to improve variability and enrich the training image dataset to further improve the system training process [14]. The GAN is composed of two sub-networks: a generator network for the generation of synthesized images and a discriminator network for the discrimination of synthetic images from real images. A deep convolution generative adversarial network (DCGAN) method is developed in [18], by using a convolution neural network-based discriminative model as a robust feature extractor for more stable model training. Further improvement on DCGAN is proposed in Wasserstein GAN (WGAN) through the use of Wasserstein distance also known as the earth mover’s distance (EMD) as the GAN’s loss function [19]. This is combined with the gradient penalty enhancement to improve the convergence of GAN training in Wasserstein GAN Gradient Penalty (WGAN-GP) [23]. A few GAN-based methods have been proposed for image data augmentation, such as CycleGAN for handwritten digits [16] and a conditional variational auto-encoder GAN (CVAE-GAN) for facial recognition [20], but their applications are limited to only object recognition tasks. The use of GANs for augmentation in the supervised image segmentation task requires the generation of both images and their corresponding masks, which is not extensively

3. Proposed two-stage generative adversarial network for image augmentation The goal of the proposed approach is to generate new imagemask pairs using the existing image-mask pairs (from the given dataset) and improve the segmentation results. Specifically, the nuclei cell image dataset for segmentation is considered in this paper. The segmentation task is defined as identifying the pixels corresponding to the nuclei cell image, which corresponds to segregating nuclei (foreground) from the rest of the image (background). Consider the input domain of real image-mask pairs, R ⊂ {(I, M) : I ∈ RH×W ×3 , M ∈ RH×W ×1 }, and the output domain of synthetic imagemask pairs, S ⊂ {(I  , M  ) : I  ∈ RH×W ×3 , M  ∈ RH×W ×1 }, where (I, M) in R denotes the ordered pair of real image and real mask, respectively, and (I , M ) in S denotes the ordered pair of synthetic image and synthetic mask, respectively. The dimensions of the real and the synthetic image are H × W × 3, where H and W denote the height and width of the image, respectively, for 3-channel RGB-colored image. Since the task is to separate the nuclei cell pixels from the entire given image. Hence, the obtained mask is binary, where white pixels are the cell nuclei and black pixels are the background (rest of the image), both the real and synthetic masks are binary images with a dimension of H × W × 1, respectively. The generated image-mask pairs will be used as additional augmentation samples to improve the performance of the segmentation task. GANs are generative models that learn a mapping function from random noise vector z to output an image y, which is generated via G : z → y [14]. The generator and discriminator are trained in an adversarial fashion, to learn the data distribution of y. Conditional GANs learn a mapping function from the observed image x and the random noise vector z, to output an image y via G : x, z → y [21]. The conditional GAN setup allows for an additional image x variable along with the random noise z and acts as a transition from image

S. Pandey, P.R. Singh and J. Tian / Biomedical Signal Processing and Control 57 (2020) 101782

3

Fig. 1. The overview of the proposed two-stage generative adversarial network.

Table 1 The architecture of the first stage generator (G1) and the first stage discriminator (D1) of the proposed approach. Generator (G1)

Discriminator (D1)

Layers

Attributes

Layers

Attributes

Dense Activation Reshape Up sampling 2D Convolution 2D Activation Up sampling 2D Convolution 2D Activation Up sampling 2D Convolution 2D Activation Convolution 2D Activation

256 × 16 × 16 nodes ReLU (16, 16, 256) output shape (2, 2) size 128 filters, 4 × 4 kernel ReLU (2, 2) size 64 filters, 4 × 4 kernel ReLU (2, 2) size 32 filters, 4 × 4 kernel ReLU 1 filter, 4 × 4 kernel Tanh

Convolution 2D Activation Dropout Convolution 2D Zero Padding 2D Activation Dropout Convolution 2D Activation Dropout Convolution 2D Activation Dropout Flatten and Dense

16 filters, 3 × 3 kernel, (2, 2) strides Leaky ReLU, Alpha 0.2 0.25 rate 32 filters, 3 × 3 kernel, (2, 2) strides ((0, 1), (0, 1)) padding Leaky ReLU, Alpha 0.2 0.25 rate 64 filters, 3 × 3 kernel, (2, 2) strides Leaky ReLU, Alpha 0.2 0.25 rate 128 filters, 3 × 3 kernel, (1, 1) strides Leaky ReLU, Alpha 0.2 0.25 rate 1 node

x to y. The random vector z for both GANs and Conditional GANs is drawn from a prior distribution, usually Gaussian or Uniform distribution [19]. The proposed two-stage GAN comprises of two GAN architectures, both of which collaborate to synthesize new imagemask pairs. The following two subsections, Architecture overview and Network configuration, discuss the architecture overview of the proposed multi-stage GAN approach and the network architectures and training methodology for generator and discriminator. 3.1. Architecture overview The proposed approach consists of two stages, which are aligned sequentially as shown in Fig. 1. The aim of the first stage GAN is to learn the data distribution of the masks so that it can generate new synthetic masks. The first stage GAN learns a mapping from random noise vector z to output mask m, G1 : z → m. The generator G1 is trained to produce synthetic masks that cannot be distinguished from real masks by an adversarially trained discriminator, D1 , which is trained to do as well as possible at detecting the generator’s synthetic masks. The second stage GAN architecture learns a mapping from random noise vector z and conditional mask m, to output an image i, G2 : z, m → i. The generator G2 is trained to produce synthetic images that cannot be distinguished by real images using real masks as condition. The adversarial discriminator D2 is trained to minimize the combined loss of the image translation from the real mask to the real image and the image translation from the real mask to the synthetic image.

Both the GANs of the proposed approach are trained independently. The generation of synthetic image-mask pairs is performed sequentially; first the synthetic mask and then the corresponding synthetic image is generated. The first stage GAN is trained to generate synthetic masks, SM ⊂ {M  : M  ∈ RH×W ×1 }. The generated masks are expanded from the single-channel to three channels (copying the mask on other two channels), thus expanded synthetic masks are represented as SM ⊂ {M  : M  ∈ RH×W ×3 }. The second stage GAN takes the random noise and the above generated synthetic masks as inputs and generates the corresponding synthetic images, SI ⊂ {I  : I  ∈ RH×W ×3 }. Each synthetic mask generated in the first stage has a one-to-one mapping with synthetic image generated in the second stage. Thus, the required image-mask pair, S ⊂ {(I  , M  ) : I  ∈ RH×W ×3 , M  ∈ RH×W ×1 } is obtained by combining the corresponding synthesized mask and the synthesized image obtained from the two stages. 3.2. Network configuration The proposed two-stage conditional generative adversarial network consists of two GAN networks, both of which are described in detail in this section. • The first stage generator (denoted as G1) takes a random noise vector as the input and outputs a generated single-channel image. The input noise passes through a sequential deep convolution network as represented in Table 1. The layers of the network

4

S. Pandey, P.R. Singh and J. Tian / Biomedical Signal Processing and Control 57 (2020) 101782

Fig. 2. The detailed architecture of the generator G2 in the proposed approach.

Table 2 The summary of the noise component used in the second stage generator (G2) in the proposed approach.

Table 3 The summary of the variational component used in the second stage generator (G2) of the proposed approach.

Layers

Attributes

Name

Layers

Attributes

Dense Batch normalization Activation Reshape Up sampling 2D Convolution 2D Batch normalization Activation

256 × 4 ×4 nodes Momentum 0.9 Leaky ReLU, Alpha 0.2 (4, 4, 256) output shape (2, 2) size 128 filters, 3 × 3 kernel Momentum 0.9 Leaky ReLU, Alpha 0.2

Mean,  Variance,  2 Latent representation Variational output

Dense Dense Concatenate Lambda

256 nodes 256 nodes [,  2 ] Gaussian distribution (,  2 )

G1 follow three instances of the combination of Up-sampling, Convolution, and ReLU Activation layers. The output of the final convolutional layer has a depth of 1 and the Tanh function is used as the activation function. The depth 1 channel output corresponds to the generated mask. • The first stage discriminator (denoted as D1) takes an interpolated image as the input, which is obtained by performing a random weighted average of the real mask and the generated mask obtained from the generator G1. This image then passes through a sequential deep convolutional network as represented in Table 1. The layers of the network D1 have four instances of the combination of Convolution, Leaky ReLU Activation, and Dropout

layers. The output of the final dropout layer is flattened and passed through a single node dense layer to obtain the validity score (i.e., real/synthetic). The loss function used for the discriminator is the combination of Wasserstein loss and gradient penalty loss [23]. Wasserstein loss computes the distance between the generated images and real images, which is a measure to designate a score for how real are the generated images. Gradient penalty is a gradient regularizer, which penalizes the Wasserstein loss for the cases, where the gradient norm of the discriminator surpasses norm value 1. • The second stage generator (denoted as G2), as shown in Fig. 2, has both the real mask and the random noise as inputs. The generator follows U-Net like architecture [11] which relies on skip connections between the down-sampling and up-sampling layers. Two modifications are proposed to the conventional U-net

Fig. 3. The detailed architecture of the discriminator D2 in the proposed approach.

S. Pandey, P.R. Singh and J. Tian / Biomedical Signal Processing and Control 57 (2020) 101782

5

Table 4 The performance evaluation of various augmentations methods, which are used to generate image samples for the 10-layer U-Net model [11] for image segmentation task. Augmentation method

Image samples used

Dice

IoU

Precision

Recall

Specificity

No augmentation Traditional augmentation WGAN-GP [15]

Real image only both real + synthetic image synthetic image only Both real + synthetic image Synthetic image only Both real + synthetic image

0.7870 0.8154 0.7942 0.8168 0.7367 0.8251

0.7147 0.7476 0.6999 0.7410 0.6220 0.7521

0.7489 0.7924 0.7911 0.7984 0.7859 0.8238

0.8417 0.8537 0.8403 0.8523 0.7465 0.8594

0.9689 0.9306 0.9063 0.9391 0.8853 0.8858

Proposed approach

The bold values are highest values (performance). Table 5 The evaluation of the proposed approach, which is used to generate various number of synthesized image samples, for the 10-layer U-Net model [11] for image segmentation task. Image samples used

Number of images

Dice

IoU

Precision

Recall

Specificity

Synthetic image only

1000 5000 10,000 20,000

0.6902 0.7401 0.7367 0.7203

0.5668 0.6247 0.622 0.5979

0.7569 0.7705 0.7859 0.7815

0.6849 0.7602 0.7465 0.7267

0.8874 0.889 0.8853 0.8926

Both real + synthetic image

1000 5000 10,000 20,000

0.8158 0.8198 0.8251 0.8148

0.7464 0.7477 0.7521 0.7393

0.8028 0.8027 0.8238 0.8138

0.8558 0.8670 0.8594 0.8486

0.8763 0.8822 0.8858 0.8875

to make it function as a generator for a GAN: (i) noise component and (ii) variational component. The noise component takes a random noise vector as the input and passes it through the convolution networks as depicted in Table 2. The conditional mask input is passed through a network with layers having multiple Convolution, Batch Norm, Leaky, ReLU components [18]. The conditional mask passes through such four instances along with a down-sampling of the mask. The down-sampled mask then passes through another instance of Convolution, Batch Norm, Leaky ReLU and a dropout layer. The output of this layer then goes through the variational component, which outputs latent features of the same shape as depicted in Fig. 2 and Table 3. The variational output is then concatenated with the output of the noise component and then passes through two instances of Convolution, Batch Norm, Leaky ReLU layers. This output is then concatenated with the up-sampled output of the corresponding encoder unit and then passes through two instances of Convolution, Batch Norm, Leaky ReLU layers. Finally, this process is repeated twice and then passes through a convolution layer with Tanh activation to generate a synthetic image. • The second stage discriminator (denoted as D2), as illustrated in Fig. 3, takes two inputs: the mask and the image, and outputs a convolution map of single channel. For this the input mask and the image are concatenated along the channel axis and then passed through a network of multiple Convolution and Leaky ReLU layers. The loss function used for the combined model is weighed combination of the mean square error and the mean absolute error. 3.3. Network training The training of the first stage GAN follows the training procedure described in WGAN [19]. For each generator weight update iteration, the discriminator’s weights are updated for n-critic times. Typically, the n-critic value ranges from to 2 − 4. The root mean square propagation (RMSProp) [27] is used as the optimizer with a batch size of 32 and a learning rate of 0.00005. The random noise vector is set to have a dimension of 100 × 1. The training details of the second stage GAN are as follows. A mini-batch stochastic gradient descent (SGD) with a batch size of 32 and the Adam optimizer with a learning rate of 0.0002 and a beta decay of 0.5 are employed. The random noise vector is set to have a dimension of 100 × 1. The discriminator loss is calculated

for two image-mask pairs; real mask “real image and real mask” synthetic image. Both the loss values obtained are averaged over the convolution map to get the final discriminator loss. 4. Experimental setup 4.1. Dataset and preparation Experiments are conducted to evaluate our proposed method against various image augmentation techniques and their performance improvement for segmentation task. The dataset used for the experiment is the benchmark nuclei cell images, released by Kaggle Data Science bowl’s first stage of competition [24]. The dataset contains 670 nuclei cell images along with their corresponding masks. The nuclei cell images and corresponding masks are 3-channel (RGB) and 1-channel binary, respectively. Furthermore, the dataset has been divided into train and test set, with last 10% of images being in the test set. The image resolution has been reduced to 128 × 128 as large scale GANs image generations and training stability is a challenge. 4.2. Performance measures Five performance metrics, including (i) intersection over union (IOU), (ii) Dice coefficient, (iii) Precision, (iv) Recall, and (v) Specificity, are used to evaluate the performance of the segmentation model on the test set. The pixels identified by the U-Net model that matches the ground truth are the True Positive (TP). The pixels erroneously segmented by the model are the False Positive (FP) and pixels which were missed by the model but should have identified are the False Negative (FN). The IOU coefficient is defined as IoU =

TP , TP + FP + FN

(1)

The dice coefficient is defined as DSC =

2 × TP , (TP + FP) + (TP + FN)

(2)

Following the same convention Precision, Recall and Specificity can be defined as Precision =

TP , TP + FP

(3)

6

S. Pandey, P.R. Singh and J. Tian / Biomedical Signal Processing and Control 57 (2020) 101782

Fig. 4. The image samples synthesized by the conventional WGAN-GP approach [15].

Recall =

2 × TP , TP + FN

Specificity =

TN . TN + FP

(4)

(5)

4.3. Experiment procedure Images generated by the traditional image augmentation techniques are produced through the pipeline of different transformations like rotation, width-height shift, shear, zoom and horizontal flip. The WGAN-GP approach [15] used in the experi-

S. Pandey, P.R. Singh and J. Tian / Biomedical Signal Processing and Control 57 (2020) 101782

7

Fig. 5. The image samples synthesized by the first stage GAN of the proposed approach.

ments is trained on four channels (RGB image + mask concatenated) image. The generator is trained to output the image-mask pair for the given input noise. While the discriminator is trained to differentiate between the real and the fake image-mask pairs. Images generated by the WGAN-GP approach (after training over 60,000 epochs) are used for the augmentation purpose. The proposed approach is trained for 30,000 epochs for generating synthetic masks, and for 285 epochs for generating synthetic images from corresponding synthetic masks. Wasserstein loss with the gradient penalty is minimized for training the first stage GAN. For the second stage GAN, a combination of mean squared error and the mean absolute error is used as the loss function. To perform the comparative analysis for the performance of various augmentation techniques, the image segmentation model, which is a 10-layer deep U-Net model [11], is kept identical across all data augmentation approaches in the experiments. The IOU-loss function is used to train the model with Adam [28] optimizer having the learning rate set to be 10−4 . The configuration remains fixed during the span of the experiment with the input to the model coming from different GANs based augmentation or traditional augmentation techniques. 5. Experimental results and comparisons 5.1. Objective performance comparison Tables 4 and 5 report the different performance metrics, including Dice, IoU, Precision, Recall and Specificity. To improve the reliability of results, each experiment has been repeated 3 times, and the resulting mean value is reported. Table 4 compares the

results obtained after performing segmentation with 4 different initial setups. Table 5 describes the performance of the image segmentation with different numbers of generated images by the proposed approach. As seen from Tables 4 and 5, when segmentation task is performed with the mix of the real and 10K synthetic images, the two GAN-based data augmentation methods outperform no augmentation and traditional augmentation methods, and the proposed approach outperforms the conventional WGAN-GP [23] with a higher IoU (0.7521 as compared to 0.7410) and a higher precision (0.8238 as compared to 0.7984) score. This verifies that generating synthetic images helps the image segmentation model to perform better on the segmentation task than using no augmentation or traditional augmentation. As seen in Table 5, having more number of synthetic images could help the image segmentation task; however, if the number of such images is too large then the image segmentation performance could drop. 5.2. Qualitative performance comparison This section presents the qualitative performance comparison among image samples synthesized by various image augmentation approaches. First, as shown in Fig. 4, some images generated by the WGAN-GP approach [15] are not colorized properly. Second, for the first stage GAN output of the proposed approach, as shown in Fig. 5, the generated masks are similar to those are generated by the WGAN-GP approach [15]. The shapes of most nuclei masks are accurate and similar to those provided in the training set. The generated images obtained from image translation from the conditional mask are similar to original images. Some generated images

8

S. Pandey, P.R. Singh and J. Tian / Biomedical Signal Processing and Control 57 (2020) 101782

Fig. 6. The image samples synthesized by the second stage GAN of the proposed approach.

have a greyish patch which is not present in original images, as shown in Fig. 6. Lastly, the generated image-mask pairs obtained by the proposed approach, as shown in Fig. 7, look better than those generated from the WGAN-GP approach [15], since there is no colored patch present in generated images. However, the generated images are slightly blurred as compared to images generated by the WGAN-GP approach [15]. The shape of the nucleus in images and corresponding masks are mostly accurate for both these two approaches. Intel® MovidiusTM Neural Compute Stick enables the quick generation of image feature embeddings as opposed to when run on CPU.

5.3. Training time All the experiments are performed on standard PC with Intel Core i7 CPU and Nvidia GeForce GTX 1080 Ti graphics card with 12 GB memory. Table 6 summarizes the training time for the proposed method and the WGAN-GP method [15]. The combined training time of our method for both the stages is around 3.5 h which is considerably less when compared to the training time of the WGAN-GP method which is 7.5 h. The method employed in [15] takes up a monolith approach by generating a 4-channel image from noise, ignoring the relationship between image and mask present inherently in a typical segmentation task. On the contrary, we divide this task into two simpler sub-tasks. The first sub-task aims to generate binary single-channel masks. The other sub-task of generating an RGB image which is simplified by the use of masks as a precondition. The generator in the second stage can take advantage

of this extra knowledge and more easily produce a corresponding image. Thus our training time for the second stage is significantly reduced. 6. Discussion The main objective of this study is to evaluate whether GAN based augmentations for the image segmentation task provided an advantage over traditional augmentation strategies. Furthermore, this paper also provides a comparative analysis for training time and performance metrics for WGAN-GP based augmentation and our proposed approach. Contrary to results in [15], where the authors conclude that the advantage either does not exist or is insignificant, results from Table 4 shows that the GAN based augmentation does outperform traditional augmentation methods. A similar improvement over the baseline segmentation is seen in [22]. During our experiments, we observed that the performance of the proposed method drops when a very large number of synthetic images (≥20K) were used for training the segmentation model. But along with the combination of real images, the performance metrics outperform the existing methods. This observation might be attributed to a few low quality generated images like highly overlapping nuclei, low-intensity images, and false colorization. Image processing techniques such as thresholding, histogram equalization, and morphological operations can be implemented as an additional post-processing step to filter lowquality images.

S. Pandey, P.R. Singh and J. Tian / Biomedical Signal Processing and Control 57 (2020) 101782

9

Fig. 7. The image-mask pair generated by the proposed approach.

Table 6 Training time comparison between the WGAN-GP approach in [15] and proposed approach. Approach

Epochs

Training time

WGAN-GP [15] Proposed approach

60,000 30,000 (Stage 1) + 300 (Stage 2)

7.5 h 3.5 (3 + 0.5) h

7. Conclusion and future work This paper proposes a novel two-stage image augmentation architecture for the image segmentation task to generate imagemask pairs, where the synthetic mask is generated by the first stage and the synthetic image is generated by the second stage. These synthesized image-mask pairs have been used to increase the size of the training dataset in the nuclei cell image segmentation task, and consequently, have improved the conventional image segmentation approach performance. Extensive experimental results verify that the use of GAN based augmentation approaches outperforms traditional augmentation methods. We also show that our proposed approach that relies on dividing the generation task into two simpler subtasks using the mask as pre-condition outperforms that of using a single-stage approach on several performance metrics and total training time. For the future work, we plan to utilize GAN techniques capable of generating high-resolution images [25,26]. As generating highresolution images using GAN is a challenging task, we reduced the original resolution (256 × 256) of nuclei images to 128 × 128 in the experiments presented in this paper. Finally, we plan to evaluate our proposed approach on other similar biomedical datasets for segmentation task. Intel® MovidiusTM Neural Compute Stick has helped reduce the inference time for the pre-trained model.

Acknowledgement We acknowledge the support from Intel providing Intel® MovidiusTM Neural Compute Stick for our research works. Conflict of interest: None declared.

References [1] G. Litjens, T. Kooi, B.E. Bejnordi, A.A.A. Setio, F. Ciompi, M. Ghafoorian, J.A. Van Der Laak, B. Van Ginneken, C.I. Sánchez, A survey on deep learning in medical image analysis, Med. Image Anal. 42 (2017) 60–88. [2] P. Meyer, V. Noblet, C. Mazzara, A. Lallement, Survey on deep learning for radiotherapy, Comput. Biol. Med. 98 (1) (2018) 126–146. [3] E. Nasr-Esfahani, N. Karimi, M.H. Jafari, S.M.R. Soroushmehr, S. Samavi, B.K. Nallamothu, K. Najarian, Segmentation of vessels in angiograms using convolutional neural networks, Biomed. Signal Process. Control 40 (2018) 240–251. [4] P. Wang, L. Wang, Y. Li, Q. Song, S. Lv, X. Hu, Automatic cell nuclei segmentation and classification of cervical pap smear images, Biomed. Signal Process. Control 48 (2019) 93–103. [5] Y. Song, L. Zhang, S. Chen, D. Ni, B. Lei, T. Wang, Accurate segmentation of cervical cytoplasm and nuclei based on multiscale convolutional network and graph partitioning, IEEE Trans. Biomed. Eng. 62 (10) (2015) 2421–2433. [6] Y. Xie, F. Xing, X. Kong, H. Su, L. Yang, Beyond classification: structured regression for robust cell detection using convolutional neural network, International Conference on Medical Image Computing and Computer-Assisted Intervention (2015) 358–365. [7] H. Irshad, A. Veillard, L. Roux, D. Racoceanu, Methods for nuclei detection, segmentation, and classification in digital histopathology: a review-current status and future potential, IEEE Rev. Biomed. Eng. 7 (2014) 97–114. [8] F. Xing, L. Yang, Robust nucleus/cell detection and segmentation in digital pathology and microscopy images: a comprehensive review, IEEE Rev. Biomed. Eng. 9 (2016) 234–263. [9] N. Malpica, C.O. de Solorzano, J.J. Vaquero, A. Santos, I. Vallcorba, J.M. García-Sagredo, F. Del Pozo, Applying watershed algorithms to the segmentation of clustered nuclei, Cytometry 28 (4) (1997) 289–297. [10] X. Yang, H. Li, X. Zhou, Nuclei segmentation using marker-controlled watershed, tracking using mean-shift, and Kalman filter in time-lapse microscopy, IEEE Trans. Circuits Syst. I: Regular Papers 53 (11) (2006) 2405–2414.

10

S. Pandey, P.R. Singh and J. Tian / Biomedical Signal Processing and Control 57 (2020) 101782

[11] O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional networks for biomedical image segmentation, International Conference on Medical Image Computing and Computer-Assisted Intervention (2015) 234–241. [12] Y. Yuan, M. Chao, Y.-C. Lo, Automatic skin lesion segmentation using deep fully convolutional networks with Jaccard distance, IEEE Trans. Med. Imaging 36 (9) (2017) 1876–1886. [13] J.-G. Lee, S. Jun, Y.-W. Cho, H. Lee, G.B. Kim, J.B. Seo, N. Kim, Deep learning in medical imaging: general overview, Kor. J. Radiol. 18 (4) (2017) 570–584. [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672–2680. [15] T. Neff, C. Payer, D. tern, M. Urschler, Generative adversarial networks to synthetically augment data for deep learning based image segmentation, in: M. Welk, P. Roth, M. Urschler (Eds.), Proceeding of OAGM Workshop on Medical Image Analysis, 2018, pp. 22–29, May. [16] L. Perez, J. Wang, The Effectiveness of Data Augmentation in Image Classification Using Deep Learning, 2017 arXiv:1712.04621. [17] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems (2012) 1097–1105. [18] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: International Conference on Learning Representations, San Juan, Puerto Rico, 2016. [19] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: Proceedings of the International Conference on Machine Learning, Sydney, Australia, 2017, pp. 214–223.

[20] K. Sohn, H. Lee, X. Yan, Learning structured output representation using deep conditional generative models, Advances in Neural Information Processing Systems (2015) 3483–3491. [21] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 5967–5976. [22] H. Zhao, H. Li, S. Maurer-Stroh, L. Cheng, Synthesizing retinal and neuronal images with generative adversarial nets, Med. Image Anal. 49 (2018) 14–26. [23] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved training of Wasserstein GANS, in: Advances in Neural Information Processing Systems, California, United States, 2017, pp. 5767–5777. [24] V. Ljosa, K.L. Sokolnicki, A.E. Carpenter, Annotated high-throughput microscopy image sets for validatione, Nat. Methods 9 (2012) 637. [25] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANS for improved quality, stability, and variation, in: International Conference on Learning Representations, Vancouver, Canada, 2018. [26] A. Brock, J. Donahue, K. Simonyan, Large scale GAN training for high fidelity natural image synthesis, in: International Conference on Learning Representations, New Orleans, United States, 2019. [27] T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude, Neural Netw. Mach. Learn. 4 (2) (2012) 26–31. [28] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, in: International Conference on Learning Representations, San Diego, United States, 2015.