GC-Net: Global context network for medical image segmentation

GC-Net: Global context network for medical image segmentation

ARTICLE IN PRESS JID: COMM [m5G;October 14, 2019;12:47] Computer Methods and Programs in Biomedicine xxx (xxxx) xxx Contents lists available at Sc...

3MB Sizes 0 Downloads 61 Views

ARTICLE IN PRESS

JID: COMM

[m5G;October 14, 2019;12:47]

Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine journal homepage: www.elsevier.com/locate/cmpb

GC-Net: Global context network for medical image segmentation Jiajia Ni a,b, Jianhuang Wu a,∗, Jing Tong b, Zhengming Chen b, Junping Zhao c a

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China College of Internet of Things Engineering, HoHai University Changzhou, China c Institute of Medical Informatics, Chinese PLA General Hospital, China b

a r t i c l e

i n f o

Article history: Received 24 June 2019 Revised 23 September 2019 Accepted 4 October 2019 Available online xxx Keywords: Medical image segmentation Global context Convolutional neural network Spatial and excitation pyramid pooling

a b s t r a c t Background and Objective: Medical image segmentation plays an important role in many clinical applications such as disease diagnosis, surgery planning, and computer-assisted therapy. However, it is a very challenging task due to variant images qualities, complex shapes of objects, and the existence of outliers. Recently, researchers have presented deep learning methods to segment medical images. However, these methods often use the high-level features of the convolutional neural network directly or the highlevel features combined with the shallow features, thus ignoring the role of the global context features for the segmentation task. Consequently, they have limited capability on extensive medical segmentation tasks. The purpose of this work is to devise a neural network with global context feature information for accomplishing medical image segmentation of different tasks. Methods: The proposed global context network (GC-Net) consists of two components; feature encoding and decoding modules. We use multiple convolutions and batch normalization layers in the encoding module. On the other hand, the decoding module is formed by a proposed global context attention (GCA) block and squeeze and excitation pyramid pooling (SEPP) block. The GCA module connects low-level and high-level features to produce more representative features, while the SEPP module increases the size of the receptive field and the ability of multi-scale feature fusion. Moreover, a weighted cross entropy loss is designed to better balance the segmented and non-segmented regions. Results: The proposed GC-Net is validated on three publicly available datasets and one local dataset. The tested medical segmentation tasks include segmentation of intracranial blood vessel, retinal vessels, cell contours, and lung. Experiments demonstrate that, our network outperforms state-of-the-art methods concerning several commonly used evaluation metrics. Conclusion: Medical segmentation of different tasks can be accurately and effectively achieved by devising a deep convolutional neural network with a global context attention mechanism. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Image segmentation is a core task in many image processing applications. It can be viewed to some extent as a dense classification problem. With the development of the-state-of-the-art deep learning approaches, many convolutional neural network (CNN) methods have demonstrated astonishing results in semantic segmentation [1–5]. These methods employ stacking multi-layer CNN structure to extract features, and then use deconvolution or interpolation to restore the original segmented image. These methods use only the features obtained from the deep layers to perform segmentation, ignoring shallow features. It is well-know that, highlevel features are strong at category classification and weak in restructuring original resolution for binary prediction. However, shal∗

Corresponding author. E-mail address: [email protected] (J. Wu).

low features are much richer in location information than highlevel features. With the emergence of the U-Net [6], two features with skip connection between convolution and deconvolution layers are combined. Many researchers have proposed many methods to combine two features for semantic segmentation. For instances, Refinenet [7] and Jégou et al. [8]. The former introduced a multi-resolution fusion block to fuse different features, while the latter adopted a densely connected block to extract complex and abstract features to improve semantic segmentation accuracy. Peng et al. [9] proposed a complicated decoder module that uses low-level information to help high-level features to recover image details. These methods achieved excellent results on several semantic segmentation benchmarks. However, they are less effective on small target segmentation tasks, especially the segmentation of medical images. The reason is that they do not capture global context feature information. Although this shortcoming can be solved by increasing the number of features, the computational power of

https://doi.org/10.1016/j.cmpb.2019.105121 0169-2607/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121

JID: COMM 2

ARTICLE IN PRESS

[m5G;October 14, 2019;12:47]

J. Ni, J. Wu and J. Tong et al. / Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

Fig. 1. Overview of the proposed GC-Net. We use multiple times of convolutions and batch normalization layers (Conv + BN) to extract dense features. Then, we perform SEPP and GCA to extract precise pixel prediction and localization details. The green and red lines represent the pooling and upsample operators, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

computer memory and graphic processing units (GPU) are still limited. With the development of semantic segmentation, it can be noted that rich context information plays an essential part in image segmentation. For the purpose of capturing the contextual information on multiple scales, DeepLab [10,11] employed several parallel atrous convolutions with different rates to regard spatially regularly sampled pixels as the context of the center pixel, while PSPNet [12] performed pooling operations at different grid scales to generate multiple regions and different region-based context aggregation as global context information. On the other hand, many researchers have introduced attention mechanisms to better obtain global context information. Hu et al. [13] proposed a squeeze-andexcitation network structure to learn the context. Liu et al. [14] introduced global context information which uses the average feature for a layer to augment the features at each location in fully convolutional networks. Li et al. [15] adopted a feature pyramid attention module to learn a global context and a global attention upsample module as guide for low-level features to select category information. Motivated by the above discussions and also by the merits of spatial pyramid structures, which make the attention mechanism to capture more scale of features, as shown in Fig. 1, we propose a new network structure called Global context network (GC-Net). Our GC-Net contains two new blocks; the global context attention (GCA) block and squeeze and excitation pyramid pooling (SEPP) block. The GCA combines low-level features and high-level features to produce more representative new features, while SEPP increases the receptive field and model robustness. The main contributions of this work are summarized as follows: 1) We propose a GCA block to combine high-level features and shallow features to produce more expressive global feature information. 2) We propose a SEPP block to preserve more spatial information. 3) We integrate the proposed GCA block and SEPP block with encoder-decoder structure for medical image segmentation. The remainder of this paper is organized as follows. The related works on medical image segmentation are given in Section 2. Details of the proposed method are described in Section 3. The experimental results and discussion are presented in Section 4 and 5, respectively. Finally, conclusions and future work are summarized in Section 6.

2. Related work In this Section, we review recent developments in medical image segmentation. Previous approaches of medical image segmentation are often based on hand-crafted features to perform semantic segmentation [16–21]. The 2D Gabor wavelet and Bayesian classifier were used for retinal vessel segmentation [16]. 2D matched filters and B-COSFIRE filters were applied for blood vessels [17,21]. The difference in contrast between the vascular region and the non-vascular region was often employed to train the model to perform specific tasks [18–20]. Tian et al. [22] adopted a superpixelbased graph cut method to segment 3D prostate MRI images. Many excellent methods have also been proposed based on machine learning. Roychowdhury et al. [23] presented a novel three-stage blood vessel segmentation algorithm using high-pass filtering and a Gaussian mixture model. Hassouna et al. [24] unified a stochastic approach for extracting 3D blood vessels, in which the expectation maximization algorithm was used to estimate the parameters. Goceri et al. [25] unified k-means clustering and refined it iteratively with a linear contrast stretching algorithm. Oliveira et al. [26] proposed a segmentation method using a deformable model and region growing process. Orlando et al. [27] presented an approach based on a discriminatively trained fully connected conditional random field model to achieve retinal vessels in color fundus photography. The fully-connected framework leads to a more robust segmentation. An automated vessel segmentation method was presented based on a 41D feature vector and AdaBoost classifier designed to classify vessel and non-vessel pixels [28]. However, these methods have limited capability on some specific benchmark data and they are not easy to extend or apply to other medical segmentation contexts. With the development of CNN, especially, the emergence of the Alex-Net [29], many CNN models were applied to medical image analysis [30–33]. Fu et al. [34] designed a CNN with a side-output layer and combined the CNN and conditional random field layers to form an integrated deep network for retinal vessel segmentation. Kamnitsas et al. [35] employed a multi-scale 3D CNN architecture for brain lesion segmentation and used a fully connected conditional random field for post-processing. The encoder-decoder [36,37] has also demonstrated significant improvement on several segmentation benchmarks [38,39]. Ciresan et al. [5] proposed to segment neuronal membranes using deep artificial neural network as a pixel classifier. Due to the low precision of fully convolutional neural network (FCN), various improved models have emerged.

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121

ARTICLE IN PRESS

JID: COMM

[m5G;October 14, 2019;12:47]

J. Ni, J. Wu and J. Tong et al. / Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

Among them, the most famous network structure is U-Net [6]. Norman et al. [40] used U-Net to segment cartilage and meniscus from knee MRI data. With the development of deep learning, researchers began to study the influence of contextual information on the performance of segmentation. Inspired by ParseNet [12], global contextual information was explored, which introduced the context encoding module to capture the semantic context [41]. Directionaware spatial context [42] introduced an encoding layer with an IRNN-like [43] module to capture the global contextual information. A pyramid attention network (PAN) was proposed to exploit the impact of global contextual information on semantic segmentation [15]. A criss-cross (CCNet) network was proposed to more effectively and efficiently obtain such important information [44]. In this paper, we aim to present a new network structure global contextual information to further boost the segmentation performance of medical images. Our proposed GC-Net is different from the aforementioned studies. The above methods only use an attention mechanism to generate an attention map, which records the relationship between each pixel pair in the feature map. In the proposed GC-Net, the contextual information is aggregated by two attention modules on the GCA module. Besides, GC-Net can also obtain dense and multi-scale contextual information in a SEPP module which is more effective and efficient. 3. Methods In this Section, we describe the details of the proposed GC-Net for medical image segmentation. We first introduce the general framework of GC-Net and then present a feature encoder and a feature decoder module. Finally, the loss function used in our deep CNN is given. 3.1. Framework of the network architecture GC-Net: In the feature encoder module, we use convolutions and batch normalization (Conv+BN), instead of ResNet50 [45], and feature decoder module contains two modules: the GCA and SEPP modules, as illustrated in Fig. 1. After feature encoding, the output size of feature maps becomes 1/16 that of the input image. Then, we use SEPP to concatenate feature maps generated by squeezeand-excitation networks, so that the output feature maps contain

3

multiple receptive field sizes, which encode multiscale information. In detail, before the squeeze-and-excitation network, we use atrous convolution with different dilation rates to increase the receptive field sizes. Backbone: A modified U-shape network with the GCA module and feature encoder form the backbone of the proposed method. 3.2. Feature encoding module The feature encoding phase in a traditional semantic segmentation model, often uses ResNet as the pre-training model. This is because the dataset trained by the pre-training model is the same one used in semantic segmentation. Due to the uniqueness of medical images, in this work we do not simply use the pre-training model for feature encoding, instead, we propose a new feature extraction module containing four feature extracting blocks. Each block of encoding module contains one convolution layer, one batch normalization, and one max pooling layer. 3.3. Feature decoding module The feature decoding module is the most important part of the entire network, consisting of the GCA block and SEPP block. This module extracts global context semantic information and generates more representative feature maps. 1) GCA: In image segmentation, it is necessary to know the category and spatial information of each pixel. Various methods have been proposed to utilize these two features, e.g., U-Net [6], DenseAspp [46], and DeepLab v2 [11]. These methods use a skip connection to take advantage of both features. Later, to make better use of these two characteristics, an attention mechanism is applied to the network, for example ParseNet [12], PAN [15], and CCNet [44]. Motivated by the skip connection and attention mechanism, we propose GCA module (Fig. 2) to capture global context information, which acts as guidance for low-level features to select category localization details. It is known that, the feature sets extracted from multiple data sources can be fused to create a new high dimension feature vector that represents the complementary information

Fig. 2. Proposed GCA module for aggregating high-level and low-level features. Here, “” denotes spatial element-wise multiplication and “” denotes element-wise sum. The red line represents upsampling operator. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121

JID: COMM 4

ARTICLE IN PRESS

[m5G;October 14, 2019;12:47]

J. Ni, J. Wu and J. Tong et al. / Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

Fig. 3. The illustration of the SEPP module. Here, atrous CNN is the atrous convolution. The three routes atrous rates are 6, 12, and 24. SE is the squeeze-and-excitation network. The entire structure model refers to the SPP network structure.

of input for further classification/pattern recognition [47]. Therefore, in our case, GCA includes three branches. The high-level features undergo global average pooling to realize the global context features, which go through an L2-norm to generate global features that are restored to the same size as high-level features by bilinear interpolation. Then, the global features are used as the weight to multiply the original high-level features to produce new features. Next, the new features go through a global average pooling to generate a weight to multiply the shallow features map. Finally, the new features are combined with the original features to produce more representative feature maps. 2) SEPP: The size of the receptive field has a great influence on the results of image segmentation. At this stage, atrous spatial pyramid pooling (ASPP) can be employed using different methods, such as DenseASPP [46], RefineNet [7], and DeepLap v3 [10] to increase the receptive field. In this paper, we use a three-way parallel expansion convolution operation, each with an expansion rate of 6, 12, and 24, and the squeeze-and-excitation network structure is added to the traditional spatial pyramid pooling (SPP) structure, as shown in Fig. 3. In each branch, we apply one 1 × 1 convolution for rectified linear activation after every atrous convolution and SE network. This structure not only increases the receptive field, but also improves the network’s ability to extract features. However, SEPP retains the original SPP structure, which can increase the receptive field of the entire network; the squeeze-and-excitation network is added to the structure, to ensure the effect of feature extraction.

loss function is calculated using:

N 2 i p(k, i )q (k, i ) Ldice = 1 − N , N 2 2 (k, i ) + p i i q (k, i )

(1)

where N is the number of pixels/voxels, and p(k, i) ∈ [0, 1] and q(k, i) ∈ [0, 1] are, respectively, the predicted probability and ground truth label for class k. In some segmentation tasks, e.g. CT images, the target regions usually occupy smaller areas than other regions. Since, Dice coefficient loss function only pays attention to the accuracy rate during training, we use a weighted cross-entropy loss to optimize the whole network in the training process. Given the ground truth value of a pixel is y (where y = 1, if it is vasculum, and y = 0, otherwise) and the prediction label of the pixel is p (where p ∈ [0, 1]), the weighted cross-entropy loss Lr :



Lr = − 1 −







TP TN yl og( p) − 1 − (1 − y )l og(1 − p), Np Nn

(2)

where TP and TN are, respectively, the number of true positives and true negatives, and Np and Nn are the numbers of segmentation and non-segmentation pixels. We combine the two loss functions in Eqs. (1) and (2) to form our final loss function. In addition, we include a hyperparameter (λ) to control the effect of the weighted cross-entropy loss. Finally, our proposed loss function is defined as follows:

Lall = Ldice + λLr .

(3)

In this study, it is experimentally found that λ = 0.5 can obtain the best performance. 4. Results

3.4. Loss function

4.1. Experimental setup

Our proposed framework is an end-to-end deep learning method, as illustrated in Fig. 1. Cross entropy loss function is frequently used in traditional deep learning. However, objects in medical images are often small. The cross-entropy loss is not optimal for semantic segmentation. In this paper, we replace the cross entropy loss function with new loss function that combines the Dice coefficient loss [48] and weighted cross-entropy loss. The Dice coefficient is widely used to assess segmentation performance and its

Our approach is evaluated on one local dataset and another three publicly available datasets, respectively, intracranial blood vessel data, Digital Retinal Images for Vessel Extraction (DRIVE) [49], cell contour data [53], and Lung Nodule Analysis (LUNA) data [54]. In the training stage, we use mini-batch stochastic gradient descent with batch size = 16, momentum = 0.9, and weight decay = 0.0 0 01. In addition, we apply the "poly" learning rate strategy

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121

ARTICLE IN PRESS

JID: COMM

[m5G;October 14, 2019;12:47]

J. Ni, J. Wu and J. Tong et al. / Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

5

power

epoch−1 ) ) where where the initial rate is multiplied by (1 − ( totalepoch power is set to 0.9 and initial rate lr = 1e–3. The maximum number of epochs (totalepoch) is 200. The implementation is based on the public Keras library [55] with TensorFlow [56] as backend. The training and testing platform were on Ubuntu 16.04 system with 12 GB nVidia GeForce Titan XP GPU.

4.2. Evaluation metrics To evaluate the performance, we employ the Dice coefficient (Dic) for overall segmentation accuracy and mean intersection over union (Mean IoU) to evaluate the accuracy of semantic segmentation which are respectively defined as follows:



Dic =



2 A ∩B 

|A| + |B|

,

(4)

where |A| and |B| represents the number of pixels in the predicted and ground truth images and |A ∩B | represents the number of common pixels in both images.

Mean IoU =

k 1 k k i=0

j=0

pi j +

pii k j=0

p ji − pii

,

(5)

where k represents total number of categories. pij are pixels whose real pixel class is i are predicted as the total number of categories j. pii are pixels whose real pixel class is i are predicted as the total number of categories i. In our study, we also computed two other evaluation metrics, the sensitivity (Sen) and the accuracy (Acc) [50]:

Sen =

TP , TP + FN

(6)

Acc =

TP + TN , TP + TN + FP + FN

(7)

where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives, respectively. We also employ the area under receiver operation characteristic curve (AUC) to measure segmentation performance.

Fig. 4. 3D visualization of vascular segmentation. (a) 3D surface reconstruction from segmentation result and (b) vasculature after performing surface denoising. Table 1 Performance comparison of the proposed GCNet with the state-of-the-arts methods on the intracranial blood vessel data. Method

Dic (%)

Mean IoU (%)

U-Net SegNet FCN8s FCN16s GC-Net

87.32 88.40 84.23 76.14 96.35

86.48 81.63 67.72 66.53 91.89

Table 2 Performance comparison of the proposed GC-Net with different context aggregation approaches on the intracranial blood vessel data. Method

Dic (%)

Mean IoU (%)

ResNet50 + GCA + SEPP ResNet50 + GCA ResNet50 + GCA + ASPP Backbone Backbone + ASPP GC-Net

93.12 91.75 91.00 95.46 96.08 96.35

89.79 88.08 88.04 91.19 91.76 91.89

4.3. Intracranial blood vessel data The intracranial blood vessel dataset in this study is from The Sixth People’s Hospital, Shenzhen, China. The imaging modality of the dataset is computed tomography angiography (CTA), which is commonly used for assessment of many types of vascular diseases in clinical scenarios. After quality check and basic image preprocessing, we collect 4326 CTA images of intracranial blood vessels with dimension 512 × 512 in the original dataset. To further increase the number of training samples, the datasets are augmented to reduce the risk of overfitting [51]. Specifically, we perform data augmentation in three ways, including affine transformation, rotation, and vertical flip operations. As a consequence, each image in the dataset is augmented to three images. In our experiments, we use 80% of the dataset for training while the remaining 20% are used for testing. The network devised in this paper is based on 2D CT slice images. However, in a clinical practice, 3D visualization of vasculature can greatly help to diagnosis and surgical planning of vessel diseases. Therefore, a 3D surface is reconstructed by performing the marching cube technique. As shown in Fig. 4(a), there are some noises on the surface as isolated objects, arising from the misclassifications. Thus, regions unconnected and far from large vessels are eliminated to reduce the misclassifications. We compare our methods with the-state-of-the-art algorithms, namely, U-Net, FCN-8s, FCN-16s, and SegNet. All baseline models

adopt the same data format and preprocessing. Table 1 shows the performance comparison using the Dic and Mean IoU metrics of all competing methods. The proposed GC-Net clearly outperforms the state-of-the-art segmentation methods applied to intracranial blood vessel dataset. In particular, our method achieves Dic score of 96.35% which is 7.95% higher than the SegNet methods. Regarding the Mean IoU standard, GC-Net achieves 91.89% which is 5.41% higher than the U-Net. ResNet50 is also tested in our experiments and its performance is lower than that of the GC-Net in terms of Dic and Mean IoU, as shown in Table 2. We use ResNet50 as Backbone in different ablation studies. In Table 2, it can be noticed that both the Dic value and the Mean IoU value of the ResNet50 as the Backbone are lower than the other combinations in the experiments. When comparing the GC-Net with the Backbone and Backbone +ASPP, it is found that the Dic value increases from 95.46% to 96.35% and the Mean IoU score increases from 91.19% to 91.89%, which further demonstrate that the proposed GCA and SEPP blocks are beneficial for vessel segmentation. We also, respectively, show the 2D and 3D results in Figs. 5 and 6 to visually compare the results of the proposed method with competitive ones. In Fig. 6, it can be observed that the result of GCNet network segmentation has more detailed features than that of

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121

JID: COMM 6

ARTICLE IN PRESS

[m5G;October 14, 2019;12:47]

J. Ni, J. Wu and J. Tong et al. / Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

Fig. 5. 2D visualization of segmentation results. The top row shows original annotated CTA image slices, the middle row is ground truth, and the bottom row is our method.

Fig. 6. Visualization of 3D results achieved on the test dataset1 and testdatset2. Compared to the ground truth, all the four state-of-the-art methods (U-Net, FCN16s, SegNet, and FCN8s) miss fine features (e.g. small vessels in oval), whereas the proposed method preserves fine vessels well (areas indicated by ovals).

ground truth. This is because the artificially small blood vessels are often difficult to distinguish by human eyes, resulting in ground truth only marking the main blood vessels. However, with the proposed GC-Net, the intracranial blood vessel tree can be preserved, with accurately extracting very small vessels and correctly yielding topological consistency. In contrast, other competing methods fail to generate an accurate vessel tree with preserving both geometrical features and topological structures, as the regions indicated by ovals in Fig. 6. Considering the SegNet as an example, it not only produces disconnected vessels, but also loses small vessels. This may lead the radiologist to make an incorrect interpretation

of the images, e.g. considering the patient as having suffered from cerebral arterial stenosis. 4.4. Retinal vessel segmentation The photographs for the DRIVE database are obtained from a diabetic retinopathy screening program in the Netherlands [49]. The screening population consists of 400 diabetic subjects between 25 and 90 years of age and the image dimension is 512 × 512. In this experiment, 40 images are randomly selected which are further divided into 20 images for training and 20 for testing. We also per-

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121

ARTICLE IN PRESS

JID: COMM

[m5G;October 14, 2019;12:47]

J. Ni, J. Wu and J. Tong et al. / Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

7

Fig. 7. Visualization results on the DRIVE database. (a) Test image, (b) Ground truth, (c) U-Net and (d) GC-Net.

Table 3 Performance comparison of the GC-Net with other competing methods on retina vessel data using different performance metrics. Method

Sen (%)

Acc (%)

AUC (%)

Azzopadi et al.[21]. Roychowdhury et al. [23] Zhao et al. [52] U-Net [6] DeepVessel[34] Qiaoliang et al. [57] Melinscak et al. [58] GC-Net

76.55 72.50 74.20 73.44 76.03 75.69 – 78.44

94.42 95.20 95.40 95.23 95.23 95.27 94.66 95.51

96.14 96.72 86.20 97.44 97.52 97.38 97.49 97.77

form data augmentation in four ways, including gray-scale conversion, standardization, contrast-limited adaptive histogram equalization, and gamma adjustment. In this experiment, we use subimages (patches) for training each of size 96 × 96 that are obtained by randomly selecting its center inside the full image. We compare the proposed GC-Net with state-of-the-art algorithms: U-Net [6], DeepVessel [34], Azzopadi et al. [21], Roychowdhury et al. [23], Zhao et al. [52], et al. [57], and Melinscak et al. [58]. The results clearly demonstrate the improvement compared with previous works. All baseline models are obtained directly from results provided by the authors. Table 3 summarizes the performance comparison of all methods. The GC-Net achieves 97.77% in AUC, 78.44% in Sen score and 95.51% in Acc score which are better than other methods. It can be also noticed that, the AUC increases from 86.20% to 97.77% (13.4% improvement) and that the Sen score increases from 72.50% to 78.44% (5.94% improvement) while the Acc score increases from 94.42% to 95.51% (1.09% im-

Table 4 Performance comparison of the GC-Net with other competing methods on cell contour dataset. Method

Dic (%)

Acc (%)

U-net Backbone GC-Net

90.94 97.64 97.90

93.58 97.48 97.76

provement). Some examples for visual comparison are shown in Fig. 7.

4.5. Cell contour segmentation The cell contour dataset comes from the EM Challenge, which started at ISBI 2012 [53]. The training set contains 30 images (512 × 512 pixels) and the testing set consists of 30 images, which is available at http://brainiac2.mit.edu/. Since the training data is relatively small, we also perform data augmentation via rotation, vertical operation, and contrast enhancement. We also calculate Dic and Acc score as evaluation metrics. We compare the proposed GC-Net with the original U-Net and Backbone. The final segmentation results are summarized in Table 4. The GC-Net outperforms the U-Net and Backbone, indicating that GC-Net is effective for cell contour segmentation tasks. Our GC-Net achieves 97.90% in Dic and 97.76% in accuracy score which is better than other methods. It can be seen that the Dic score increases from 90.94% to 97.90%, and the Acc score from

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121

ARTICLE IN PRESS

JID: COMM 8

[m5G;October 14, 2019;12:47]

J. Ni, J. Wu and J. Tong et al. / Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

Fig. 8. Visualization results of cell contour segmentation. From left to right: original images, U-Net, Backbone, and GC-Net. (The ground truth for cell images is not given).

Fig. 9. Visualization results on lung segmentation set. From left to right: original images, ground truth, U-Net, Backbone, and GC-Net.

93.58% to 97.76%. Examples of segmentation results are shown for visual comparison in Fig. 8. 4.6. Lung segmentation The lung segmentation task is to segment lung structure in 2D CT images from the LUNA [54] competition, which is available at https://www.kaggle.com/kmader/finding- lungs- in- ct- data/data/. The LUNA competition is to propose an algorithm to accurately segment lungs and measure important clinical parameters. We employ this dataset to further evaluate the performance of the proposed GC-Net. The dataset has 267 lung 2D CT images and their corresponding label images. The size of the image is 512 × 512 pixels. In the experiment, we use 70% of the data for training and 30% as the test. Three different metrics including Dic, Acc, and Sen are estimated for performance evaluation. We compare the GC-Net with the U-Net and Backbone methods on lung segmentation in this Section. The performance comparisons are given in Table 5. It can be seen that our method achieves

Table 5 Performance comparison on lung segmentation dataset. Method

Dic (%)

Acc (%)

Sen (%)

U-Net Backbone GC-Net

95.88 98.50 98.97

98.33 98.78 98.95

96.88 98.22 98.67

better performance than the U-Net and the Backbone method. More specifically, the GC-Net achieves 98.97% in Dic, 98.95% in ACC, and 98.67% in Sen. Compared to the U-Net, the Dic score increases from 95.88% to 98.97% and the Sen score increases from 96.88% to 98.67%. However, the performances of these methods are slightly close to the proposed GC-Net. Segmentation examples for visual comparison are shown in Fig. 9. 5. Discussion To capture contextual information and obtain a better medical segmentation performance, we propose a new network called GC-Net, consisting of GCA and SEPP modules. The main contributions of the proposed algorithm are as follows. (1) Use the attention mechanism method for analyzing the global contextual feature information to segment medical images. (2) Propose the GCA and SEEP modules to, respectively, learn the high-level features and increase the size of receptive field of the network while learning more features. (3) Introduce new objective function that combines both Dice and weighted cross-entropy losses for efficient training of our network. (4) Test the proposed method on four different medical image datasets, one local and three publicly available to demonstrate the effectiveness and robustness of the proposed method. To further evaluate the effectiveness of the GC-Net, we conduct several ablation studies using the intracranial vascular as examples (as seen in Table 2). Compared to U-Net, our GC-Net improves the modified U-shape network and does not use the residual block to

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121

JID: COMM

ARTICLE IN PRESS

[m5G;October 14, 2019;12:47]

J. Ni, J. Wu and J. Tong et al. / Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

replace the original encoder block. In this paper, “Backbone” is defined as the modified U-shape network combined with the original encoder block and GCA modules. Recent work [59] points out that the results using standard models trained from random initialization are not worse than those of using ImageNet pre-training counterparts. We have also conducted some experiments to compare the results with ResNet50 as pre-training to those without. Table 2 shows how the Dic and Mean IoU change in these two scenarios. Adopting pretrained ResNet50 blocks, the performance of this method is weaker than that of only using the encoder block. For intracranial vessel segmentation, we can see that the Dic score is increased by 3.4% from 93.12% to 96.35% and the Mean IoU score is increased from 89.79% to 91.89%, respectively. This performance indicates that pretrained ResNet50 blocks are not suitable for the proposed network structure. The proposed GC-Net employs the SEPP block, assembled in the network structure. We design two different combinations of network models to validate the efficacy of the SEPP block. One is taking ResNet50, referred to ResNet50 + GCA + SEPP, as pre-training block, the other is using the original encoder block, referred to GCNet. Experiments show that the performance without SEPP or with ASPP is lower than that of using SEPP. As given in Table 2, the Dic score is increased from 91.00% and 95.46% to 93.12% and 96.35% and the Mean IoU score is increased from 88.04% and 91.19% to 89.78% and 91.89%, respectively. This reveals that the proposed SEPP block is capable of increasing the network’s receptive field and improve the accuracy of the network. Furthermore, by combining observations in Table 1 and 2, we can conclude that GCA block can combine high-level features with shallow features to produce more representative feature maps, which are beneficial for segmentation tasks. By comparing U-Net (here can be seen as the original network structure without using the GCA module) with Backbone, the Dic and Mean IoU score increase from 87.32% and 86.48% to 95.46% and 91.19%, respectively, for intracranial vascular segmentation tasks. Experiments also demonstrate that the proposed method is robust in medical segmentation of different imaging modalities. In the intracranial vascular and lung datasets of CT images, GC-Net obtains the best results compared to classical algorithms. Excellent results have also been obtained from publicly available datasets of fundus vascular and cell wall data. In DRIVE database, the Acc and AUC scores increase from 95.43% and 86.2% to 95.1% and 97.7%, respectively. In LUNA database, the Dic and Acc scores increase from 95.58% and 98.33% to 98.97% and 98.95%, respectively. These performances indicate that GC-Net may provide a novel method to medical image segmentation tasks that goes beyond imaging modality. Although the proposed model has achieved good results in different datasets, it still has few limitations. (1) The entire network training takes more time than the traditional method. (2) Our method is tested on the retinal vessel dataset that contains relatively small (only 40 images). (3) Our method is 2D-based. In fact, 3D segmentation is highly desirable for many tasks. In our future work, we aim to extend the proposed method for realistic 3D medical image segmentation, collect more clinical datasets, and reduce the computational complexity. In addition, we will evaluate the robustness and effectiveness of our network on public datasets of natural images such as pascal voc2012 [60]. 6. Conclusions This study presents a novel network for medical image segmentation. Our key idea is to analyze the global context by formulating a global context attention mechanism to automatically learn the attention weights and compose the spatial context. To obtain global contextual information, we introduce the GCA at-

9

tention module which combines low-resolution feature maps, but with large receptive field, with shallow features and therefore efficient full-resolution feature maps. In addition, the SEPP module is devised to increase the size of the receptive field of the network, while learning more features, and design a weighted crossentropy loss function to make an effective training process. These operations are beneficial for improving the accuracy and robustness of medical segmentation. Our method can be applied to different tasks by fine-tuning our model using the new training data. We also test our methods on four benchmark datasets and compare it with various state-of-the-art methods using different evaluation metrics (e.g. Dic, Sen, Mean IoU, and AUC). Moreover, compared with the U-Net and the SegNet, GC-Net greatly improves retention of details and segmentation of small objects. Extensive experiments demonstrate that, the richer the information, the more beneficial the segmentation of the image. Declaration of Competing Interest We confirm that all authors of this manuscript have no conflicts of interest to declare. Acknowledgments This work was supported in part by National Natural Science Foundation of China (no. 61672510), and Shenzhen Basic Research Program (no. JCYJ20180507182441903, no. JCYJ20170412174037594). The authors are grateful to Dr. Ahmed Elazab for his kind help and valuable suggestions. Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.cmpb.2019.105121. References [1] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: a deep convolutional encoder-decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39 (12) (2017) 2481–2495. [2] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [3] H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520–1528. [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Semantic image segmentation withdeep convolutional nets and fully connected crfs, Computing 1 (1) (2014) 12–17. [5] D. Ciresan, A. Giusti, L.M. Gambardella, J. Schmidhuber, Deep neural networks segment neuronal membranes in electron microscopy images, in: Advances in Neural Information Processing Systems, 2012, pp. 2843–2851. [6] O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2015, pp. 234–241. [7] G. Lin, A. Milan, C. Shen, I.D. Reid, RefineNet: multi-path refinement networks for high-resolution semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1, 2017, p. 5. [8] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, Y. Bengio, The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 11–19. [9] C. Peng, X. Zhang, G. Yu, G. Luo, J. Sun, Large kernel matters—improve semantic segmentation by global convolutional network, in: Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4353–4361. [10] L.-C. Chen, G. Papandreou, F. Schroff and H. Adam, Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017. [11] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Trans. Pattern Anal. Mach. Intell. 40 (4) (2018) 834–848. [12] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–2890.

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121

JID: COMM 10

ARTICLE IN PRESS

[m5G;October 14, 2019;12:47]

J. Ni, J. Wu and J. Tong et al. / Computer Methods and Programs in Biomedicine xxx (xxxx) xxx

[13] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141. [14] W. Liu, A. Rabinovich and A.C. Berg, Parsenet: looking wider to see better. arXiv:1506.04579, 2015. [15] H. Li, P. Xiong, J. An and L. Wang, Pyramid attention network for semantic segmentation. arXiv:1805.10180, 2018. [16] J.V.B. Soares, J.J.G. Leandro, R.M. Cesar, H.F. Jelinek, M.J. Cree, Retinal vessel segmentation using the 2-D Gabor wavelet and supervised classification, IEEE Trans. Med. Imaging 25 (9) (2006) 1214–1222. [17] N. Katz, M. Nelson, M. Goldbaum, S. Chaudhuri, S. Chatterjee, Detection of blood vessels in retinal images using two-dimensional matched filters, IEEE Trans. Med. Imaging 8 (3) (1989) 263–269. [18] Q. Li, B. Feng, L. Xie, P. Liang, H. Zhang, T. Wang, A cross-modality learning approach for vessel segmentation in retinal images, IEEE Trans. Med. Imaging 35 (1) (2016) 109–118. [19] A.F. Frangi, W.J. Niessen, K.L. Vincken, M.A. Viergever, Multiscale vessel enhancement filtering, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 1998, pp. 130–137. [20] K. Krissian, G. Malandain, N. Ayache, Directional anisotropic diffusion applied to segmentation of vessels in 3D images, in: International Conference on Scale-Space Theories in Computer Vision, Springer, 1997, pp. 345–348. [21] G. Azzopardi, N. Strisciuglio, M. Vento, N. Petkov, Trainable COSFIRE filters for vessel delineation with application to retinal images, Med. Image Anal. 19 (1) (2015) 46–57. [22] Z. Tian, L. Liu, Z. Zhang, B. Fei, Superpixel-based segmentation for 3D prostate MR images, IEEE Trans. Med. Imaging 35 (3) (2016) 791–801. [23] S. Roychowdhury, D.D. Koozekanani, K.K. Parhi, Blood vessel segmentation of fundus images by major vessel extraction and subimage classification, IEEE J. Biomed. Health Inform. 19 (3) (2015) 1118–1128. [24] M.S. Hassouna, A.A. Farag, S. Hushek, T. Moriarty, Cerebrovascular segmentation from TOF using stochastic models, Med. Image Anal. 10 (1) (2006) 2–18. [25] E. Goceri, Z.K. Shah, M.N. Gurcan, Vessel segmentation from abdominal magnetic resonance images: adaptive and reconstructive approach, Int. J. Numer. Method Biomed. Eng. 33 (3) (2017) e2811. [26] D.A. Oliveira, R.Q. Feitosa, M.M. Correia, Segmentation of liver, its vessels and lesions from CT images for surgical planning, Biomed. Eng. Online 10 (1) (2011) 30. [27] J.I. Orlando, E. Prokofyeva, M.B. Blaschko, A discriminatively trained fully connected conditional random field model for blood vessel segmentation in fundus images, IEEE Trans. Biomed. Eng. 64 (1) (2017) 16–27. [28] C.A. Lupascu, D. Tegolo, E. Trucco, FABC: retinal vessel segmentation using AdaBoost, IEEE Trans. Inf. Technol. Biomed. 14 (5) (2010) 1267–1274. [29] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [30] K. Zhou, Z. Gu, W. Liu, W. Luo, J. Cheng, S. Gao, J. Liu, Multi-Cell multi-task convolutional neural networks for diabetic retinopathy grading, in: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, 2018, pp. 2724–2727. [31] Z. Wang, Y. Yin, J. Shi, W. Fang, H. Li, X. Wang, Zoom-in-net: deep mining lesions for diabetic retinopathy detection, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2017, pp. 267–275. [32] A. Gumaei, M.M. Hassan, M.R. Hassan, A. Alelaiwi, G. Fortino, A hybrid feature extraction method with regularized extreme learning machine for brain tumor classification, IEEE Access (2019) 36266–36273. [33] D. Marín, A. Aquino, M.E. Gegúndez-Arias, J.M. Bravo, A new supervised method for blood vessel segmentation in retinal images by using gray-level and moment invariants-based features, IEEE Trans. Med. Imaging 30 (1) (2011) 146–158. [34] H. Fu, Y. Xu, S. Lin, D.W.K. Wong, J. Liu, Deepvessel: retinal vessel segmentation via deep learning and conditional random field, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2016, pp. 132–139. [35] K. Kamnitsas, C. Ledig, V.F. Newcombe, J.P. Simpson, A.D. Kane, D.K. Menon, D. Rueckert, B. Glocker, Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation, Med. Image Anal. 36 (2017) 61– 78.

[36] K. Cho, B. Van Merriënboer, D. Bahdanau and Y. Bengio, On the properties of neural machine translation: encoder-decoder approaches. arXiv:1409.1259, 2014. [37] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818. [38] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [39] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba, Scene parsing through ade20k dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 633–641. [40] B. Norman, V. Pedoia, S. Majumdar, Use of 2D U-Net convolutional neural networks for automated cartilage and meniscus segmentation of knee MR imaging data to determine relaxometry and morphometry, Radiology 288 (1) (2018) 177–185. [41] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, A. Agrawal, Context encoding for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160. [42] X. Hu, L. Zhu, C.-W. Fu, J. Qin, P.-A. Heng, Direction-aware spatial context features for shadow detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7454–7462. [43] S. Bell, C. Lawrence Zitnick, K. Bala, R. Girshick, Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2874–2883. [44] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei and W. Liu, Ccnet: criss-cross attention for semantic segmentation. arXiv:1811.11721, 2018. [45] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [46] M. Yang, K. Yu, C. Zhang, Z. Li, K. Yang, DenseASPP for semantic segmentation in street scenes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3684–3692. [47] R. Gravina, P. Alinia, H. Ghasemzadeh, G. Fortino, Multi-sensor fusion in body sensor networks: state-of-the-art and research challenges, Inf. Fusion 35 (2017) 68–80. [48] W.R. Crum, O. Camara, D.L. Hill, Generalized overlap measures for evaluation and validation in medical image analysis, IEEE Trans. Med. Imaging 23 (4) (2004) 501–509. [49] J. Staal, M.D. Abràmoff, M. Niemeijer, M.A. Viergever, B. Van Ginneken, Ridge-based vessel segmentation in color images of the retina, IEEE Trans. Med. Imaging 23 (4) (2004) 501–509. [50] P. Liskowski, K. Krawiec, Segmenting retinal blood vessels with deep neural networks, IEEE Trans. Med. Imaging 35 (11) (2016) 2369–2380. [51] T.S. Lee, Image representation using 2D Gabor wavelets, IEEE Trans. Pattern Anal. Mach. Intell. 18 (10) (1996) 959–971. [52] Y. Zhao, L. Rada, K. Chen, S.P. Harding, Y. Zheng, Automated vessel segmentation using infinite perimeter active contour model with hybrid region information with application to retinal images, IEEE Trans. Med. Imaging 34 (9) (2015) 1797–1807. [53] Cell contour data (http://brainiac2.mit.edu/). [54] Lung nodule analysis (LUNA) data (https://www.kaggle.com/kmader/ finding- lungs- in- ct- data/data/). [55] F. Chollet, et al., Keras, 2015, (https://keras.io). [56] M. Abadi, et al., TensorFlow: large-scale machine learning on heterogeneous systems, 2015, (https://www.tensorflow.org/). Software available from tensorflow.org. [57] Li, et al., A cross-modality learning approach for vessel segmentation in retinal images, IEEE Trans. Med. Imaging 35 (1) (2016) 109–118. [58] M. Melinšˇcak, P. Prentašic´ , S. Loncˇ aric´ , et al., Retinal vessel segmentation using deep neural networks, in: Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISIGRAPP 2015), 2015, pp. 577–582. [59] K. He, R. Girshick, P. Dollár. Rethinking imagenet pre-training. arXiv:1811. 08883, 2018. [60] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338.

Please cite this article as: J. Ni, J. Wu and J. Tong et al., GC-Net: Global context network for medical image segmentation, Computer Methods and Programs in Biomedicine, https://doi.org/10.1016/j.cmpb.2019.105121