Two stage residual CNN for texture denoising and structure enhancement on low dose CT image

Two stage residual CNN for texture denoising and structure enhancement on low dose CT image

Computer Methods and Programs in Biomedicine 184 (2020) 105115 Contents lists available at ScienceDirect Computer Methods and Programs in Biomedicin...

3MB Sizes 0 Downloads 57 Views

Computer Methods and Programs in Biomedicine 184 (2020) 105115

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine journal homepage: www.elsevier.com/locate/cmpb

Two stage residual CNN for texture denoising and structure enhancement on low dose CT image Liangliang Huang a, Huiyan Jiang a,∗, Shaojie Li b, Zhiqi Bai a, Jitong Zhang b a b

Software College, Northeastern University, Shenyang 110819, China Sino-Dutch Biomedical and Information Engineering College, Northeastern University, Shenyang 110819, China

a r t i c l e

i n f o

Article history: Received 23 January 2019 Revised 27 September 2019 Accepted 2 October 2019

Keywords: Image denoising Low dose CT Two stage residual CNN Normal dose CT model

a b s t r a c t Background and objective: X-ray computed tomography (CT) plays an important role in modern medical science. Human health problems caused by CT radiation have attracted the attention of the academic community widely. Reducing radiation dose results in a deterioration in image quality and further affects doctor’s diagnosis. Therefore, this paper introduces a new denoise method for low dose CT (LDCT) images, called two stage residual convolutional neural network (TS-RCNN). Methods: There are two important parts with respect to our network. 1) The first stage RCNN is proposed for texture denoising via the stationary wavelet transform (SWT) and the perceptual loss. Specifically, SWT is performed on each normal dose CT (NDCT) image and generated four wavelet images are considered as the labels. 2) The second stage RCNN is established for structure enhancement via the average NDCT model on the basis of the first network’s result. Finally, the denoised CT image is obtained via inverse SWT. Results: Our proposed TS-RCNN is trained on three groups of simulated LDCT images in 1123 images per group and evaluated on 129 simulated LDCT images for each group. Besides, to demonstrate the clinical application of TS-RCNN, we also test our method on the 2016 Low Dose CT Grand Challenge dataset. Quantitative results show that TS-RCNN almost achieves the best results in terms of MSE, SSIM and PSNR compared to other methods. Conclusions: The experimental results and comparisons demonstrate that TS-RCNN not only preserves more texture information, but also enhances structural information on LDCT images. © 2019 Elsevier B.V. All rights reserved.

1. Introduction X-ray computed tomography (CT) is one of the most important imaging modality in modern hospitals and electromagnetic dose used in CT scan is positively correlated with the incidence of cancer [1,2]. Therefore, it is of great significance to reduce the X-ray dose. However, image quality will decrease and more artifacts will appear in CT images as the electromagnetic dose decreases. Artifacts in CT images refer to unrelated scanned tissue structures during imaging due to machine failure or human factors which can degrade image quality and further affect the diagnosis of lesions. Low quality CT image will disturb the doctor’s diagnosis. Thus, researchers have extensively explored methods to minimize X-ray dose while maintaining image quality. Early researchers propose model-based iterative reconstruction (IR) methods [3–5] and dic-



Corresponding author. E-mail address: [email protected] (H. Jiang).

https://doi.org/10.1016/j.cmpb.2019.105115 0169-2607/© 2019 Elsevier B.V. All rights reserved.

tionary learning algorithms [6,7]. But, these methods are difficult to promote in practical scenarios due to the high computational cost. Many effective traditional post-processing methods are proposed to improve the quality of low dose CT (LDCT) images. For example, the Non-Local Mean (NLM) methods [8,9] are suitable for CT image denoising. Based on the theory of compressed sensing, an improved K-SVD method [10] is proposed to reduce artifacts in CT images. Block Matching 3D (BM3D) [11–13] algorithms achieve excellent results in the work of image restoration. The quality of LDCT images is significantly improved using these methods. However, excessive smoothing or residual errors are often observed in the denoised image. Many effective IR methods [14–17] are proposed to address this issure. But the issue of time consuming for IR methods is still unsolved. New sinogram restoration methods (Sinogram Discriminative Feature Representation) [18,19] are proposed to improve projection data inconsistency. Zhang et al. [20] explored the applicability of Gamma regularization for the

2

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115

sparse-view CT reconstruction. These methods provide new perspectives for traditional methods. In recent years, convolutional neural network (CNN) provides new perspectives [21] for image processing which shows great potential in medical image provessing. CNN provide many effectual methods for the quality improvement of LDCT images [22,23]. For datasets with a large number of images, many CNN-based methods are proposed to perform low-dose CT denoising and the purpose of these methods is to find the relationship between LDCT and corresponding normal dose CT (NDCT) images in pixel-wise level [24]. For example, Chen et al. [25] proposed a novel convolutional neural network combined autoencoder and shortcut for LDCT image denoising. Kang et al. [26] adopted a similar approach in wavelet domain and achieved considerable success. These methods show better performance in the task of LDCT image denoising. The main contributions of this paper are as follows. (1) As far as we know, a novel uniform framework composed of two residual CNN is first applied for LDCT image denoising. (2) Different from WGAN-VGG [27], a perceptual loss term is added to the first stage RCNN and only applied on the high frequency wavelet images which can preserve more texture information. (3) An average NDCT model is constructed using all NDCT images in the training set and assumed to be the NDCT image of each LDCT image, which can effectively train the second stage RCNN for enhancing structural information on the first stage’s result. The rest of this paper organized as follows. Chapter 2 will describe the implementation process of proposed TS-RCNN. Chapter 3 will show the experimental process, experimental results, and analyze the experimental results. Chapter 4 will discuss related issues and draw conclusions. 2. Method 2.1. Two-stage denoising model Let x denote a LDCT image and y denote the corresponding NDCT image. The target of first stage RCNN is to find a function ft , which simulates the mapping between x and y.

ft (x ) → y

(1)

However, ft cannot perfectly simulate the mapping relationship between x and y, which can only outputs an image ft (x) close to y like most end-to-end deep learning methods. To address this issue, the second stage RCNN is further proposed on the basis of ft . Its target is to find a function fs that simulate the mapping relationship between ft (x) and y. The residual of y and x is used as the input of the second stage RCNN.

f s (y − x ) → y − ft (x )

(2)

While y − x can not be obtained under clinical cases. Here we propose an average NDCT model to address this problem. The average NDCT model is constitute of NDCT image training set and outputs an average image y¯ . Hence, Eq. (2) should be expressed as Eq. (3).

fs (y¯ − x ) → y − ft (x )

(3)

Since the average model is used, the x in fs (y¯ − x ) should be replaced with ft (x). ft (x) contains less noise than x, which means that the second stage RCNN can focus on reducing the minor errors between y and ft (x) without considering the noise in x. Therefore, Eq. (4) should be expressed as.

fs (y¯ − ft (x )) → y − ft (x )

(4)

Finally, the denoised LDCT image is acquired by summing up the results of the two stage. Generally, the process of TS-RCNN is shown in Fig. 1 (a). Following the above theory, the stationary wavelet transform (SWT) [28– 30] is applied on LDCT image and wavelet LDCT images SWT(x) are input into the first stage RCNN to for texture denoising. Meanwhile, SWT is applied on NDCT image y and then SWT(y) are subtract to ft (x). SWT(y) – ft (x) are thus considered as the label of the second stage RCNN. Noted that, SWT(y) are the label of the first stage RCNN. From the perspective of the algorithm, the input data of the second stage RCNN should be SWT(y) – ft (x). But in the actual application scenario, the patient does not need to regain NDCT images if LDCT images have already been obtained. Therefore, an average model is constructed to address this problem. The average image y¯ , constructed by all NDCT images in training set, is used instead of y. The input of the second stage RCNN is thus SWT(y¯ ) – ft (x). Finally, the results of the above two stage networks are summed up and then the inverse SWT is applied on these fused images to obtain the denoised image. In the first stage RCNN, wavelet LDCT images are used to simulate wavelet NDCT images for performing texture denoising. The second stage RCNN considers SWT(y) – ft (x) as the label and SWT(y¯ ) – ft (x) as the input, whcih reduces the slight error between ft (x) and SWT(y) to achieve the goal of structure enhancement. The denoised image ydenoised is thus obtained using Eq. (5).

ydenoised = SWT−1 ( ft (x ) + fs (y¯ − ft (x )))

(5)

SWT−1

where represents the inverse stationary wavelet transform. ft (x) is the result of texture denoising CNN and fs is the result of structure enhancement CNN. As we know, the high frequency wavelet images are composed of detail and noise. It should be emphasized that TS-RCNN will not enhance noise in the high frequency wavelet images. It is considered that the label images SWT(y) contains no noise. Therefore, this network will suppress the noise in high frequency wavelet images of SWT(x), thereby achieving the effect of retaining details and removing noise. In the experiment section, the relevant experimental data are presented to prove this theory. Inspired by ResNet [31]. The residual CNN structure of TS-RCNN is shown in Fig. 1 (b), which includes 9 convolution layers. The kernel number of all convolution layers are set as 128 except the last convolution layers. The first eight convlotion laters have corresponding batch normalization layers to prevent over-fitting. Two shortcuts are added for gradient conduction so that the network can train a deeper structure. With the deeper structure, deeper features can be extracted for noise reduction, which can help preserve more texture information. It is worth mentioning that an add layer is used so that features in LDCT image can be preserved and reflected in the denoised images. The network uses ReLU as activation function. The size of input and output data of the two stage networks are both 512 × 512 × 4. Noted that, there is data difference between the first stage RCNN and the second stage RCNN, the structure of these two stages are slightly different. The add layer in red bounding appears only in the first stage while there is no add structure in the second stage. 2.2. Texture denoising residual CNN 2.2.1. Texture denoising on wavelet domain Most wavelet-based noise reduction methods have achieved excellent results in the traditional noise reduction domain. SWT has the characteristic of preserving image size, which can provide more details and features of image for network training. Therefore, wavelet LDCT images are used as the input data and corresponding wavelet NDCT images are used as labels. In the texture denoising RCNN, wavelet NDCT images are simulated to achieve the target

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115

3

Fig. 1. The left figure (a) is the denoise process of the TS-RCNN. The red arrow represents the label of the pointed RCNN. The right figure (b) is the network structure of RCNN. This figure shows the structure and parameters of the RCNN. The size convolution kernel is marked near the corresponding layer. The stride of all convolution layers are set to 1. The input shape and output shape are both 512 × 512 × 4 in both RCNN.

of texture noise reduction. The training process is summarized in Fig. 2. 2.2.2. Loss function For the purpose of texture denoising, texture preservation is performed on low frequency wavelet images and noise suppression is performed on high frequency wavelet images. Inspired by wgan-vgg [27], the perceptual loss term is combined with Mean Square Error (MSE) for texture suppression and preservation, respectively. As shown in Fig. 2, the 4-channel images R11 , R12 , R13 , R14 (512 × del pixels) of the network output are correspond to the 4channel wavelet domain images LND , HND , HND , HND , respectively. 1 2 3 4 The loss function is expressed as follows, 1 Loss1T S−RCNN = Loss1MSE(R1 ,LND ) + ωLossVGG 1

(6)

where Loss1MSE represents the mean squared error between R11 and LND . ω represents the weight of LossVGG . LossVGG measures the mi1 nor differences in features between high frequency NDCT images and high frequency denoised images as introduced in [24]. Motivated by [24], LossVGG in work of TS-RCNN is defined below, 1 LossVGG = ||F eatureNDCT − F eatureLDCT ||2

(7)

where || · || represents the Euclidean norm. Feature represents feature maps extracted via VGG-19 [32], which are pre-trained on ImageNet [33]. Note that, Yang [24] extracted features of spatial domain, while TS-RCNN extracts features of high frequency images in wavelet domain. Features extracted in the fourth pooling layer of VGG-19 are used.

As shown in Fig. 2, there are only minor nuance differences exist between high frequency parts of NDCT (R12 , R13 , R14 ) and deND ND noised LDCT (HND 2 , H3 , H4 ). Since the high frequency wavelet images are dominated by details and noise, VGG can make these information have multiple angles of presentation. In this paper, 512 slices of 32 × 32 image patches are obtained by VGG extraction. Compared to high frequency images with size of 512 × 512, 512 slices of 32 × 32 image patches can reflect these differences from more angles. The loss function takes the differences between NDCT and 512 denoised LDCT image patches as objective. In this way, it is possible to provide more angular details for high frequency images in order to distinguish detail and noise. It is worth mentioning that the perceptual loss term is not used in STAGE2. This is because that the differences between the training data and label in STAGE2 are too small. In this case, more training data is needed. 2.3. Structure enhancement residual CNN 2.3.1. Structure enhancement on wavelet domain The structure enhancement network works with the help of the first stage RCNN’s results. SWT(y¯ )-SWT(ft (x)) is used as the input and SWT(y)-SWT(ft (x)) is used as the label. MSE is used as loss function. The network structure is consistent with the texture denoising network. As shown in Fig. 3, R11 , R12 , R13 , R14 represent the results of texture denoising network and R21 , R22 , R23 , R24 represent the results of structure enhancement network. E21 , E22 , E23 , E24 represent SWT(y)-SWT(ft (x)). The superscripts LD, avg and ND represent the

4

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115

Fig. 2. The workflow of STAGE 1 texture denoising residual CNN. LOSS1TS−RCNN presents the objective function of the STAGE1.

Fig. 3. The workflow of STAGE 2 structure enhancement residual CNN. E21 , E22 , E23 , E24 are the training targets of the STAGE2.

result images obtained by processing the corresponding LDCT, average and NDCT images, respectively. 2.3.2. NDCT average model The NDCT average model is designed for solving the defect of the second stage RCNN. The purpose of the second stage is to obtain the difference fs (y¯ − ft (x )) between y and results of texture denoising network ft (x). In other words, the input of the second CNN should be SWT(y)-SWT(ft (x)), which represents the difference between NDCT images and result images of the first stage RCNN. But y cannot be obtained in practice. Therefore, the NDCT average model is constructed to obtain an average image y¯ that be acted as y. 3. Experiment 3.1. Dataset Both simulated data and clinical data are used to verify TSRCNN. Experiments on simulated data are focused on improving the effectiveness of TS-RCNN. Different noise intensities are used to evaluate the robustness of TS-RCNN. Meanwhile, the experiments

with clinical data confirm that TS-RCNN has the practical application value. 3.1.1. Simulated data 1375 NDCT images (10 patients) are downloaded from 3Dircadb.1 The visual effect produced by adding noise directly on NDCT image is inconsistent with real LDCT. Hence this paper adds noise on the sinogram of LDCT. In this way, the simulated image is more similar to real LDCT, visually. The specific simulation steps are as follows. First, radon transform is performed on a NDCT image to get the corresponding sinogram. Next, Poisson noise and Gaussian noise are added to the sinogram sequentially. Finally, inverse radon transform is applied on the noised sinogram to get simulated LDCT image. The process can be mathematically expressed as,

 

LDCTsimulated = iradon ln

I0 I0 × exp(−radon(y )) + P (λ )



+ G(μ, σ 2 )



(8) 1

https://www.ircad.fr/research/3dircadb/.

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115

5

Table 1 Parameters about network training. Items

Stage 1

Stage 2

Optimizer Initial learning rate Batch size EarlyStopping patience Learning rate decay patience

Adam (β1 = 0.5, β2 = 0.999) 2 × 10−3 4 8 5

Adam (β1 = 0.5, β2 = 0.999) 5 × 10−4 16 8 3

where I0 is the photon number emitting from the source, y is the attenuation coefficients. radon represents radon transform and iradon represents inverse radon transform. G(μ, σ 2 ) represents Gaussian distribution. Larger the σ is, the more noises are in the images. P(λ) represents Poisson distribution. In the experiment, the intensity of the noise is controlled by adjusting the σ of the Gaussian function. The greater the intensity is, the harder it is to denoise. The complexity of the noise is controlled by adding Poisson noise. The higher the complexity of the noise, the more difficult it is to denoise. Totally, 3 set of simulated images with different noise levels (i.e., different μ and whether to add Poisson noise) are used for comparison. The parameters of Poisson noise and Gaussian noise are given in the Table 2. 1123 images (8 patients) are used as training set. 123 images (1 patient) are used as validation set and 129 images (1 patient) are used as test set.

3.3. Parameters selection The network is trained on GTX1080Ti in Windows 10. CUDA 9.0 is used to accelerate network training. Python, Tensorflow [34], Keras [35] are used to complete the study. The Adam [36] optimizer is used to optimize the training process of network. According to the experimental situation, the regular term of LossVGG is set as ω = 1 × 10−8 in texture denoising network. Detailed parameters about network training are shown in the following Table 1. The denoised results are evaluated with common image quality evaluation indexes, including MSE, Structural Similarity Index (SSIM) and Peak Signal to Noise Ratio (PSNR). Their calculation formulas are as follows,

MSE =

m−1 n−1 1  [y(i, j ) − x(i, j )]2 mn

(9)

i=0 j=0

3.1.2. Clinical data Since the denoising results of clinical low dose CT data cannot be quantified, a well-recognized simulated low dose CT dataset is used to verify the performance of TS-RCNN. The dataset is authorized by Mayo Clinics for ‘the 2016 NIH-AAPM-Mayo Clinic Low Dose CT Grand Challenge’.2 2378 images (10 patients) reconstructed by medium kernel with 3 mm thickness are used in this paper. Note that, we have received license to publish paper. 1944 images (8 patients) are used as training set. 210 images (1 patient) are used as validation set. 224 images (1 patient) are used as test set.

SSIM =

(2μy μx + c1 )(2σyx + c2 ) (μ2y + μ2x + c1 )(σy2 + σx2 + c2 )   2

P SNR = 10 · log10

MAXy MSE

(10) (11)

where y and x represent NDCT image and denoised image, respectively. m and n represent the width and height of the image. μ is the average value of the image. σ 2 is the variance value of the image and σ yx is the covariance of the compared two images. MAXy is the max pixel values of the NDCT image. These indexes are calculated on each images in the test set with all compared methods. 3.4. Model determination

3.2. Network convergence In order to facilitate the network to a convergence, the early stop mechanism is used in this experiment, which uses the loss function of the validation set (val-loss) as a reference. The network will stop training earlier than it should be when the value of valloss is higher than the last time more than patience times. Early stop can not only effectively prevent over-fitting, but also prevent under-fitting. In the experiment, a large initial epoch number is set. The mechanism makes the network stop training earlier than the initial epoch and ensures that the training time is not too long to improve the efficiency of the TS-RCNN. Meanwhile, the optimizer learning rate decay mechanism is used to ensure that the network can achieve better results. Similar to the early stop mechanism, val-loss is also used as reference. Both mechanisms work to ensure that the network can achieve a more ideal convergence state and prevent over-fitting. In the experiment, the texture denoising network reaches the early stop condition around the 55th epoch and the structure denoising network reaches the early stop condition around the 30th epoch. For the comparison parameter settings and training times are consistent with corresponding original papers in order to restore the effect mentioned in them.

2

https://www.aapm.org/GrandChallenge/LowDoseCT/.

Four groups of comparative experiments are performed to verify the theory described in Section 2, namely TS-RCNN-1, Only-LF, NoAvg and TS-RCNN-2. TS-RCNN-1 represents only the first stage is used. Only-LF represents only wavelet low frequency images are trained for image enhancement, which uses the processed low frequency image and the unprocessed high frequency images as the input of SWT−1 to obtain the result. Since the comparison results clearly distinguish which is better, there is no second stage experiment performed on Only-LF. NoAvg represents that the average model is not used in the process. It uses SWT(STAGE1output ) - SWT(LDCT) as the inputs of the STAGE2. TS-RCNN-2 represents the complete process of TS-RCNN. Three groups of different noise intensity are simulated by back projection algorithm in order to show the merit of the architecture of TS-RCNN-2. Note that, the group with only Gaussian noise has the most similar intensity with the realistic CT data according to numerical results. The numerical results are shown in Table 2. Obviously, the result of TS-RCNN-1 is better than Only-LF’s. This means that TS-RCNN will retain details and remove noise on high frequency images. Compared with TS-RCNN-1, TS-RCNN-2 has a comprehensive improvement which reflects the necessity of the second stage RCNN. It can be observed that all methods work well when only Gaussian noise is added. However, it can be clearly seen that TS-RCNN-2 can be stabilized at an acceptable level after

6

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115 Table 2 The comparative experiments inside TS-RCNN. Noise

Index

LDCT

TS-RCNN-1

Only-LF

NoAvg

TS-RCNN-2

Gaussian μ = 0, σ = 4 × 10−5

PSNR SSIM MSE PSNR SSIM MSE PSNR SSIM MSE

28.9065 0.6135 86.1057 29.3963 0.6475 76.2016 28.1232 0.5771 102.2986

37.6655 0.9508 12.7192 31.3399 0.7722 50.0142 30.823 0.7152 56.2625

33.3208 0.7578 30.944 31.4191 0.8229 47.406 28.9916 0.6286 83.3285

37.3196 0.9575 14.0708 32.8271 0.8277 36.3167 30.6887 0.7496 58.8262

38.4755 0.9493 10.4006 34.8537 0.9185 23.4231 33.6012 0.8843 31.2167

Poisson & Gaussian ( λ = 1 ) & ( μ = 0, σ = 4 × 10−5 ) Poisson & Gaussian ( λ = 1 ) & ( μ = 0, σ = 6 × 10−5 )

Table 3 The comparative experiments with other methods. Index

LDCT

BM3D

DnCNN

RED-CNN

TS-RCNN-2

PSNR SSIM MSE

26.5259 0.8188 261.6591

26.3179 0.8178 152.4959

30.8122 0.8803 55.0427

31.0024 0.8773 53.2739

31.0606 0.8812 54.1549

PSNR and SSIM and is slightly inferior than RED-CNN in MSE. To further show the merits of TS-RCNN, the absolute difference images relative to the original image are in Fig. 5. Here, we introduce consistency factor C to describe the difference in gray value between the denoised image and NDCT per pixel. The formula for calculating C is as follows,

C =1− adding Poisson noise and the effects of other methods are significantly reduced. The visual denoised images are shown in Fig. 4. we can see that, (1) there are still remaining noises need to be removed for Only-LF, (s) NoAvg has unclear boundaries as shown in the yellow rectangle (zoomed images), (3) TS-RCNN-1 and TS-RCNN-2 have similar texture details, but TS-RCNN-2 is more similar to NDCT in brightness and contrast compared to that of TS-RCNN-1. 3.5. Comparison with others BM3D [11], DnCNN [37], RED-CNN [25]. The training and testing processes are carried out on the dataset of ‘the 2016 NIH-AAPMMayo Clinic Low Dose CT Grand Challenge’. The test set is consists of 224 LDCT images from one patient. 3.5.1. Comparison of objective indicators PSNR, SSIM, and MSE are selected as objective evaluation indicators for image quality. The average numerical results are shown in Table 3. The numerical data of LDCT images are calculated from HU range: [−160, 240]. TS-RCNN-2 performs the best in terms of

Di, j , i, j = 1, 2, . . . , 512 255

(12)

where Di,j presents the pixel value of the (i, j)th pixel of difference image. Fig. 5 shows the denoised image, processed difference image with C=1, processed difference image with C ≥ 0.99. For visual display, this paper binarizes the pixel that satisfies the condition and the pixel that does not satisfy the condition. The pixel value of pixels that satisfies the condition is set to 255, and the opposite is 0. In other word, the more white areas, the better the effect. Obviously, TS-RCNN works best in the second column images. In the third column, TS-RCNN has the clearest edge of rib. In the fourth column of images, the effect of TS-RCNN is still satisfactory, with more white pixels and the edges of the ribs are still clear. It should be noted that the edge of the LDCT is also clear, but it is obvious that there are large black pixels in the part of the organ texture. Meanwhile, the 10-fold cross T test is used to analyze the statistical difference between TS-RCNN and other methods, and their P-value are calculated and presented in Table 4. It is clear that the P values obtained by LDCT or BM3D are much less than 0.01, which proves that the statistical effect of TS-RCNN is much better than these two methods. Compared with DnCNN and

Fig. 4. Denoised results of mentioned methods. Group with only gaussian noise with σ = 4 × 10−5 is added to the displayed LDCT image. The red rectangle in (a) NDCT represents the zoomed area. Even column images are zoomed images of area inside red rectangle. The areas inside green circles in even column images are lesions. The area inside the yellow rectangle is the edge of the liver. Images are all shown in hounsfield unit (HU) range: [-200, 200].

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115

7

Fig. 5. The first column is the denoised image, the second column is the absolute difference image with C=1, and the third column is the zoomed view of the red box (upper left) in the second column. The fourth column is the absolute difference image with C ≥ 0.99, and the fifth column is the zoomed view of the red box (bottom right) in the fourth column.

RED-CNN, totally 6 P value of three indicators between 2 methods are more than 0.01, which means that in the statistical sense, TSRCNN is not significantly different from RED-CNN and DnCNN.

Table 4 The 10-fold cross T-test is performed on the data of the PSNR, SSIM, MSE data between TS-RCNN and other methods. P value of each set of data is given in the table. P-Value-of PSNR SSIM MSE

LDCT

BM3D −145

2.9 × 10 1.5 × 10−52 7.2 × 10−119

−211

6.2 × 10 7.5 × 10−73 5.9 × 10−255

DnCNN

RED-CNN

0.1205 0.7515 0.6616

0.6254 0.1912 0.6594

3.5.2. Comparison of structure enhancement In order to show the enhancement of structure information, the canny operator is used for edge detection, which can extract the edge information from the image. This paper uses this characteristic to demonstrate the structural gain of TS-RCNN. The canny operator in the python opencv package is used for feature extraction and the effect of edge detection is as shown in Fig. 6. From odd column images in Fig. 6, it can be clearly observed that the edge detection image of LDCT has an unclear edge and BM3D has a problem that the texture density is too low. The corresponding indicated area (blue (bottom) and green (upper) arrows) shows the effect of TS-RCNN on structural enhancement, which does not have any isolated points. Table 5 further quantifies the

8

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115

Fig. 6. Odd columns are edge detection images obtained the canny operator. Even column images are zoomed areas of red boxes (upper right) in odd columns.

Table 5 The SSIM comparison of the edge detection image between noise reduction images and NDCT images. SSIM

LDCT

BM3D

DnCNN

RED-CNN

TS-RCNN-2

Mean Standard deviation

0.594 0.061

0.522 0.047

0.618 0.060

0.615 0.059

0.622 0.060

effect of structural gain, where SSIM of the edge detection image between denoised images and NDCT images. Note that, the edge detection is performed on all images in the test dataset (1 patient, 224 images) and the structural similarity is expressed by calculating the SSIM of these images. It can be seen that TS-RCNN achieves the highest SSIM among all methods. Besides, the standard deviation of BM3D is the smallest, which proves that BM3D is the most stable and mature method in recent years. To sum up, Table 4 and 5 both demonstrate that TS-RCNN has certain advantages in preserving texture information and structure enhancement.

co-occurrence matrix for the given offset. The (i, j)th value of the co-occurrence matrix gives the number of times in the image that the ith and jth pixel values occur in the relation given by the offset. GLCM characterizes texture features using grayscale spatial distribution. Here, the contrast, dissimilarity, and homogeneity of GLCM are calculated to show improvements in texture. The contrast reflects the sharpness of the image and the depth of the texture. The dissimilarity shows the sharpness of image and homogeneity measures how much the image texture changes locally. The calculation of the above indicators is implemented using python’s scikit-image library. In our experiments, the distance of offset is set as 1 or 2, the angles of offset are set as 0, 0.25π , 0.5π , 0.75π , π , 1.25π , 1.5π , 1.75π . The detailed formulas are as follow,

Contrast =

l e vel −1

(13)

i, j=0

Dissimilarity = 3.5.3. Comparison of texture denoising The performance of texture denoising on images are analyzed as follows. Fig. 7 shows the denoised images with compared methods. An abdominal CT image is used for display. Through discussions with radiologists, four different local regions are selected to demonstrate the denoise effect of the TS-RCNN. The area of the red box (the second column, the upper box of the first column) and the blue box (the fourth column,the left box of the first column) are artifacts. The green box (the third column, the mid box of the first column) is inferior vena and the yellow box (the fifth column, the right box of the first column) is left renal. For artifacts in the red box, no method suppresses the artifacts. There are striped artifact in the center of the green box, TS-RCNN performs the best for the suppression of this artifact. Through visually observing the image of the LDCT, it can be seen that the artifact is very similar to the real body tissue. All methods treat those areas as body tissues and artifacts are not removed. The yellow box has a highlighted tissue in the upper left corner of the LDCT, and the highlighting in the NDCT is not obvious, hence all methods enhance this area. In order to show the improvement of texture information, the Gray-Level Co-Occurrence Matrix (GLCM) is used to measure the reduction of texture noise. GLCM is a histogram of co-occuring greyscale values at a given offset over an image. The offset (δ x, δ y) is a position operator that can be applied to any pixel in the image. An image with c different pixel values will produce a c × c

Pi, j · (i − j )2

l e vel −1 i, j=0

Homogeneity =

l e vel −1

Pi, j 1 + ( i − j )2

(14)

Pi, j · |i − j|

(15)

i, j=0

where P is the GLCM calculated from the corresponding image, level is the value number of pixels (255). Since the texture information of CT images affects the doctor’s judgment on the patient’s condition, it is considered that the texture value is close to the NDCT, indicating that the effect is better. Meanwhile, the test dataset covers all CT images of a patient, including organs in different parts of the body. Hence it is not advisable to select the same region of interest (ROI) for texture information evaluation. Here, the whole image is used as ROI. Table 6 presents the contrast, dissimilarity, and homogeneity of GLCM between denoised images and NDCT images, where the mean value and standard deviation of the average of these three properties are given. The smaller the difference, the closer the texture representing the denoised image is to NDCT. It can be seen that TS-RCNN are ranked second in indexes, which is slightly less than DnCNN. 4. Discussions and conclusions This paper considers the shortcomings of some current LDCT image denoising methods using CNN structure. CNN-based im-

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115

9

Fig. 7. Denoised results of mentioned methods. Dataset of mayo is used. Images are shown in hounsfield unit (HU) range: [−160, 240]. The first column are original images. There are four areas marked by different color boxes inside them. The four columns on the right are zoomed areas. Red box (the second column) markes the artifact area. Green box (the third column) markes the inferior vena area. Blue box (the fourth column) markes the artifact area. Yellow box (The fifth column) marks the left renal area.

10

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115 Table 6 Comparison of contrast, dissimilarity and homogeneity. The data in the table are the average and standard deviation of these three GLCM attributes for denoised images and the NDCT images in the testset (224 images). Property

Index

BM3D

RED-CNN

DnCNN

TS-RCNN

NDCT

Contrast

STD Mean STD Mean STD Mean

1.05 × 108 0.84 × 109 1.17 × 105 1.57 × 106 1.35 × 106 1.41 × 107

1.25 × 108 1.15 × 109 1.26 × 105 1.51 × 106 1.50 × 106 1.81 × 107

1.23 × 108 1.20 × 109 1.35 × 105 1.45 × 106 1.65 × 106 1.99 × 107

1.24 × 108 1.16 × 109 1.29 × 105 1.50 × 106 1.62 × 106 1.83 × 107

1.24 × 108 1.29 × 109 1.43 × 105 1.45 × 106 1.94 × 106 2.24 × 107

Homogeneity Dissimilarity

age denoising methods can surpass traditional methods in many cases. These methods achieve acceptable denoising effects. However, good-performing CNN methods tend to have more complex network structure, which are not very friendly for training and implementation. RED-CNN combines the autoencoder, deconvolution network, and shortcut connections to construct a LDCT denoising model. Due to the use of deconvolution, RED-CNN will inevitably have defects in the preservation of CT image texture information [38]. DnCNN tries to construct a model that separating noise from noisy observation. DnCNN has excellent theoretical foundation and good experimental results. However, according to the theory of this paper, DnCNN belongs to the one stage network model, which means the results of DnCNN still have room for improvement. Therefore, this paper proposes a two stage residual CNN (TSRCNN) on low dose CT image denoising. TS-RCNN uses relatively simple network structure. TS-RCNN not only improves the quality of LDCT images, but also reduces computational cost. As can be seen from training Epochs of TS-RCNN, total training epochs are around 80 times (STAGE 1 + STAGE 2), with an average of 180 seconds each epoch. A total of 4 h of training is not a high computational cost. Note that, time spent on the second stage RCNN’s training for the first stage will not output better result. It can be proved by the early stop mechanism. The network reaches convergence when pre-set early stop condition is reached, which means more time will be exchanged for less progress. TS-RCNN does not require high cost post-processing. Its running time is significantly less than RED-CNN. RED-CNN is a patch-based denoise method, which means it takes a lot of time to stitch image patches after network training. Meanwhile, the second stage RCNN also rises a problem. The network training label under ideal conditions cannot be obtained. For this reason, average NDCT model is proposed to address this problem. This model combines all NDCT images in training set to obtain the average image. The model effectively solves the defect of ideal label cannot be obtained. As shown in Table 1, LDCT images contains more noise when the sort of noise is increased, that means a stronger fitting ability is need. Most notably, the nearly same two-stage structure, NoAvg is unsatisfactory under complex noise conditions, while TS-RCNN-2 is maintained at an acceptable level. These prove that average model can enhance the noise reduction capability of TS-RCNN-2. TS-RCNN-2 is capable of performing complex noise denoising tasks. In addition, we use perceptual loss to improve visual effects. Different from the original work, the perceptual loss term is applied on the high frequency wavelet images obtained by stationary wavelet transform. It is devoted to preserving more texture features, so that the denoised LDCT image is closer to the corresponding NDCT image. Finally, we design a set of experiments to verify the effectiveness of TS-RCNN for texture denoising and structural enhancement. Following Table 6, TS-RCNN is slightly inferior than DnCNN in terms of texture indexes. The essence of DnCNN is to estimate the noise distribution from noise image and remove it. Ac-

cording to Table 5, TS-RCNN is superior than other methods in terms of structural enhancement. In discussions with radiologists, CT images can provide structural information of human tissues, which has important guiding significance for surgical navigation and biopsy guidance. Therefore, the more obvious the structural information of the ROI is, the more meaningful it is for therapeutic diagnosis, which also proves that TS-RCNN is clinically meaningful. It must admit that TS-RCNN still has many defects. First, the improvement of TS-RCNN on the image texture is weaker than DnCNN. On the one hand, the NDCT average image model is introduced to enhance the structural information. As shown in Table 2, the term of TS-RCNN-2 has a significant improvement compared to NoAvg on the PSNR, SSIM, and MSE. The higher the noise and the more complicated the noise, the worse the performance of NoAvg. However, the cost of NDCT average image model is that it suppresses part of the texture information, which causes visually smoothness to some extent. On the other hand, the loss function of first stage network combines the MSE for low frequency information of texture and the perceptual loss for high frequency information of structure, which means that the partial fitting ability of the network is placed on the processing of high frequency information. While the DnCNN trains on the spatial domain, which means DnCNN does not need to deliberately divide some computing power into the processing of high-frequency information. Although TS-RCNN does not perform as well as DnCNN on texture, the comprehensive performance of TS-RCNN has certain advantages. Second, the use of perceptual loss also poses some problems. As shown in Fig. 7, TS-RCNN does not clearly restore the fine tissue in the liver compared to the MSE-based method. This issue will also be our future work direction. Third, TS-RCNN is a two stage network and the average model assists to the establishment of the second stage, which will inevitably introduce some bias information into final image. Thus, the performance of TS-RCNN on textures is not as well as DnCNN. In summary, a new denoising network called, two stage residual CNN (TS-RCNN), for LDCT images is proposed. To solve the defect of the second stage RCNN, the NDCT model is proposed. Meanwhile, the use of perceptual loss in the wavelet domain helps the image to retain more visual features. By the approach of TS-RCNN, the denoised image not only has improvements in evaluation standard, but also has better visual effect. In the future, we plan to solve the problem of the bias information in the average image in order to get better results. Declaration of Competing Interest The authors declare that they have no competing interests. Acknowledgments This work was supported by the National Natural Science Foundation of China (No. 61872075). We gratefully acknowledge the

L. Huang, H. Jiang and S. Li et al. / Computer Methods and Programs in Biomedicine 184 (2020) 105115

support of General Hospital of Shenyang Military Area Command, especially discussions with Dr. Wang (Zhiguo Wang) and his group (Youchao Wang, Jia Guo and Guoxiu Lu) on image quality evaluation. References [1] L. Tanoue, Computed tomography — an increasing source of radiation exposure, Yearb. Pulm. Dis. 2009 (2009) 154–155, doi:10.1016/s8756-3452(08) 79173-4. [2] A.B. de González, S. Darby, Risk of cancer from diagnostic x-rays: estimates for the UK and 14 other countries, Lancet 363 (9406) (2004) 345–351, doi:10.1016/ s0140- 6736(04)15433- 0. [3] M. Beister, D. Kolditz, W.A. Kalender, Iterative reconstruction methods in x-ray CT, Phys. Med. 28 (2) (2012) 94–108, doi:10.1016/j.ejmp.2012.01.003. [4] M.G. McGaffin, S. Ramani, J.A. Fessler, Reduced memory augmented Lagrangian algorithm for 3d iterative x-ray CT image reconstruction, in: N.J. Pelc, R.M. Nishikawa, B.R. Whiting (Eds.), Medical Imaging 2012: Physics of Medical Imaging, SPIE, 2012, doi:10.1117/12.910941. [5] A.K. Hara, R.G. Paden, A.C. Silva, J.L. Kujak, H.J. Lawder, W. Pavlicek, Iterative reconstruction technique for reducing body radiation dose at CT: feasibility study, Am. J. Roentgenol. 193 (3) (2009) 764–771, doi:10.2214/ajr.09.2397. [6] Q. Xu, H. Yu, X. Mou, L. Zhang, J. Hsieh, G. Wang, Low-dose x-ray CT reconstruction via dictionary learning, IEEE Trans. Med. Imaging 31 (9) (2012) 1682– 1697, doi:10.1109/tmi.2012.2195669. [7] Y. Chen, X. Yin, L. Shi, H. Shu, L. Luo, J.-L. Coatrieux, C. Toumoulin, Improving abdomen tumor low-dose CT images using a fast dictionary learning based processing, Phys. Med. Biol. 58 (16) (2013) 5803–5820, doi:10.1088/0031-9155/ 58/16/5803. [8] J. Ma, J. Huang, Q. Feng, H. Zhang, H. Lu, Z. Liang, W. Chen, Low-dose computed tomography image restoration using previous normal-dose scan, Med. Phys. 38 (10) (2011) 5713–5731, doi:10.1118/1.3638125. [9] E.Y. Sidky, X. Pan, Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization, Phys. Med. Biol. 53 (17) (2008) 4777–4807, doi:10.1088/0031-9155/53/17/021. [10] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006) 4311–4322, doi:10.1109/tsp.2006.881199. [11] K. Dabov, A. Foi, V. Katkovnik, K. Egiazarian, Image denoising with block-matching and 3d filtering, in: Proc. SPIE-IS&T Electronic Imaging, 2006, 606414, 2006, pp. 354–365. [12] P.F. Feruglio, C. Vinegoni, J. Gros, A. Sbarbati, R. Weissleder, Block matching 3d random noise filtering for absorption optical projection tomography, Phys. Med. Biol. 55 (18) (2010) 5401–5415, doi:10.1088/0031-9155/55/18/009. [13] D. Kang, P. Slomka, R. Nakazato, J. Woo, D.S. Berman, C.-C.J. Kuo, D. Dey, Image denoising of low-radiation dose coronary CT angiography by an adaptive blockmatching 3d algorithm, in: S. Ourselin, D.R. Haynor (Eds.), Medical Imaging 2013: Image Processing, SPIE, 2013, doi:10.1117/12.2006907. [14] Y. Chen, L. Shi, Q. Feng, J. Yang, H. Shu, L. Luo, J.-L. Coatrieux, W. Chen, Artifact suppressed dictionary learning for low-dose CT image processing, IEEE Trans. Med. Imaging 33 (12) (2014) 2271–2292, doi:10.1109/tmi.2014.2336860. [15] Y. Chen, J. Ma, Q. Feng, L. Luo, P. Shi, W. Chen, Nonlocal prior Bayesian tomographic reconstruction, J. Math. Imaging Vis. 30 (2) (2007) 133–146, doi:10. 1007/s10851- 007- 0042- 5. [16] J. Liu, Y. Hu, J. Yang, Y. Chen, H. Shu, L. Luo, Q. Feng, Z. Gui, G. Coatrieux, 3D feature constrained reconstruction for low-dose CT imaging, IEEE Trans. Circt. Syst. Video Technol. 28 (5) (2018) 1232–1247, doi:10.1109/tcsvt.2016.2643009. [17] J. Liu, Y. Zhang, Q. Zhao, T. Lv, W. Wu, N. Cai, G. Quan, W. Yang, Y. Chen, L. Luo, H. Shu, J.L. Coatrieux, Deep iterative reconstruction estimation (DIRE): approximate iterative reconstruction estimation for low dose CT imaging, Phys. Med. Biol. (2019), doi:10.1088/1361-6560/ab18db.

11

[18] Y. Chen, Y. Zhang, H. Shu, J. Yang, L. Luo, J.-L. Coatrieux, Q. Feng, Structureadaptive fuzzy estimation for random-valued impulse noise suppression, IEEE Trans. Circt. Syst.Video Technol. 28 (2) (2018) 414–427, doi:10.1109/tcsvt.2016. 2615444. [19] J. Liu, J. Ma, Y. Zhang, Y. Chen, J. Yang, H. Shu, L. Luo, G. Coatrieux, W. Yang, Q. Feng, W. Chen, Discriminative feature representation to improve projection data inconsistency for low dose CT imaging, IEEE Trans. Med. Imaging 36 (12) (2017) 2499–2509, doi:10.1109/tmi.2017.2739841. [20] J. Zhang, Y. Hu, J. Yang, Y. Chen, J.-L. Coatrieux, L. Luo, Sparse-view x-ray CT reconstruction with gamma regularization, Neurocomputing 230 (2017) 251– 269, doi:10.1016/j.neucom.2016.12.019. [21] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, Commun. ACM 60 (6) (2017) 84–90, doi:10.1145/ 3065386. [22] W. Yang, H. Zhang, J. Yang, J. Wu, X. Yin, Y. Chen, H. Shu, L. Luo, G. Coatrieux, Z. Gui, Q. Feng, Improving low-dose CT image using residual convolutional network, IEEE Access 5 (2017) 24698–24705, doi:10.1109/access.2017.2766438. [23] X. Yin, Q. Zhao, J. Liu, W. Yang, J. Yang, G. Quan, Y. Chen, H. Shu, L. Luo, J.L. Coatrieux, Domain progressive 3d residual convolution network to improve low dose CT imaging, IEEE Trans. Med. Imaging (2019) 1, doi:10.1109/tmi.2019. 2917258. [24] J.M. Wolterink, T. Leiner, M.A. Viergever, I. Isgum, Generative adversarial networks for noise reduction in low-dose CT, IEEE Trans. Med. Imaging 36 (12) (2017) 2536–2545, doi:10.1109/tmi.2017.2708987. [25] H. Chen, Y. Zhang, M.K. Kalra, F. Lin, Y. Chen, P. Liao, J. Zhou, G. Wang, Lowdose CT with a residual encoder-decoder convolutional neural network, IEEE Trans. Med. Imaging 36 (12) (2017) 2524–2535, doi:10.1109/tmi.2017.2715284. [26] E. Kang, J. Min, J.C. Ye, A deep convolutional neural network using directional wavelets for low-dose x-ray CT reconstruction, Med. Phys. 44 (10) (2017) e360–e375, doi:10.1002/mp.12344. [27] Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M.K. Kalra, Y. Zhang, L. Sun, G. Wang, Low-dose CT image denoising using a generative adversarial network with wasserstein distance and perceptual loss, IEEE Trans. Med. Imaging 37 (6) (2018) 1348–1357, doi:10.1109/tmi.2018.2827462. [28] G.P. Nason, B.W. Silverman, The stationary wavelet transform and some statistical applications, in: Wavelets and Statistics, Springer, New York, 1995, pp. 281–299, doi:10.1007/978- 1- 4612- 2544- 7_17. [29] R.R. Coifman, D.L. Donoho, Translation-invariant de-noising, in: Wavelets and Statistics, Springer, New York, 1995, pp. 125–150, doi:10.1007/ 978- 1- 4612- 2544- 7_9. [30] J.-C. Pesquet, H. Krim, H. Carfantan, Time-invariant orthonormal wavelet representations, IEEE Trans. Signal Process. 44 (8) (1996) 1964–1970, doi:10.1109/ 78.533717. [31] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2016, doi:10.1109/cvpr.2016.90. [32] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, Comput. Sci. (2014). [33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, doi:10.1109/cvpr.2009.5206848. [34] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, X. Zheng, Tensorflow: large-scale machine learning on heterogeneous distributed systems (2016). [35] F. Chollet, keras, 2015, (https://github.com/fchollet/keras). [36] D. Kingma, J. Ba, Adam: a method for stochastic optimization, Comput. Sci. (2014). [37] K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a gaussian denoiser: residual learning of deep CNN for image denoising, IEEE Trans. Image Process. 26 (7) (2017) 3142–3155, doi:10.1109/tip.2017.2662206. [38] A. Odena, V. Dumoulin, C. Olah, Deconvolution and checkerboard artifacts, Distill 1 (10) (2016), doi:10.23915/distill.0 0 0 03.