Journal Pre-proof Mu-net: Multi-scale U-net for two-photon microscopy image denoising and restoration Sehyung Lee, Makiko Negishi, Hidetoshi Urakubo, Haruo Kasai, Shin Ishii PII: DOI: Reference:
S0893-6080(20)30036-8 https://doi.org/10.1016/j.neunet.2020.01.026 NN 4389
To appear in:
Neural Networks
Please cite this article as: S. Lee, M. Negishi, H. Urakubo et al., Mu-net: Multi-scale U-net for two-photon microscopy image denoising and restoration. Neural Networks (2020), doi: https://doi.org/10.1016/j.neunet.2020.01.026. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Elsevier Ltd. All rights reserved.
Journal Pre-proof
pro of
Mu-net: Multi-scale U-net for Two-Photon Microscopy Image Denoising and Restoration Sehyung Lee1,5 , Makiko Negishi2 , Hidetoshi Urakubo1 , Haruo Kasai2,3 , Shin Ishii1,3,4
Abstract
lP
re-
Advances in two two-photon microscopy (2PM) have made three-dimensional (3D) neural imaging of deep cortical regions possible. However, 2PM often suffers from poor image quality because of various noise factors, including blur, white noise, and photo bleaching. In addition, the effectiveness of the existing image processing methods is limited because of the special features of 2PM images such as deeper tissue penetration but higher image noises owing to rapid laser scanning. To address the denoising problems in 2PM 3D images, we present a new algorithm based on deep convolutional neural networks (CNNs). The proposed model consists of multiple U-nets in which an individual U-net removes noises at different scales and then yields a performance improvement based on a coarse-tofine strategy. Moreover, the constituent CNNs employ fully 3D convolution operations. Such an architecture enables the proposed model to facilitate end-to-end learning without any pre/post processing. Based on the experiments on 2PM image denoising, we observed that our new algorithm demonstrates substantial performance improvements over other baseline methods. Keywords: image denoising, two-photon microscopy image, deep learning, U-net, GAN.
3 4 5 6 7 8 9
Two photon fluorescence microscope (2PM) is a technique to capture neural images in three dimensions or four dimensions (the fourth dimension is time). Because of its high resolution, particularly in the X-Y plane, and observability of deep cortical regions, 2PM is an important tool that reveals neuroscientific evidence based on image processing. However, when 2PM is used for threedimensional (3D) neural imaging, there exist some dif1 Integrated
urn a
2
1. Introduction
Systems Biology Laboratory, Department of Systems Science, Graduate School of Informatics, Kyoto University. 2 Laboratory of Structural Physiology, Center for Disease Biology and Integrative Medicine, Faculty of Medicine, The University of Tokyo. 3 International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo Institutes for Advanced Study, The University of Tokyo. 4 Advanced Telecommunications Research Institute International (ATR). 5 Corresponding author. E-mail:
[email protected]
Jo
1
Preprint submitted to Neural Networks
ficulties in image processing, such as blur, noise, photo bleaching, and 3D non-isotropic characters because of the processes associated with image scanning. These imperfections may deteriorate the image quality, leading to further degradation depending on the depth of the image (in the Z-direction). To address such issues, many algorithms have been proposed. Among such algorithms, point spread function (PSF)-based de-blurring techniques such as Richardson-Lucy (RL) deconvolution [1, 2] and Wiener filter [3] have been commonly used. If the lens PSF is known, images can be recovered through deconvolution by employing the known PSF. These PSF-based deconvolution methods usually assume that the PSF follows a specific distribution such as Gaussian and Bessel functions of the first kind and that the PSF is spatially invariant. However, these assumptions are not necessarily satisfied in real situations, as shown in [4], resulting in a failure to restore clean images. In particular, undesired artefacts can be generated in the processed images
December 18, 2019
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Journal Pre-proof
among these variants is U-net [7]. U-net is basically a fully convolutional network [10] composed of encoder and decoder networks, which is distinct from primitive CNNs based on skip connections between corresponding layers of encoding and decoding blocks. U-net demonDenoising strates superior semantic segmentation performance over usual encoder-decoder-style CNN models because of its multi-scale architecture. In this study, with a particular focus on image denoising scenarios, we extended U-net for use in coarse-to-fine processing. The proposed model, multi-scale U-net (abbreviated as Mu-net), consists of multiple U-nets, and individual Unet is considering different scales. The coarsest U-net first learns to make a rough shape of the target image using pairs of down-sampled input and target images. The middle-level U-net reconstructs the target output with a middle-level resolution by combining cues from its own Figure 1: Our objective is to restore clean 3D images from noisy 2PM images. Images with a colored bounding box present XY-plane images and the coarsest U-net. The finest U-net then generates at the same location. As shown in the results, the proposed method the final output image of the same resolution as that of successfully removes various noises while preserving sharpness. the original image in which image details are restored to reproduce the target images in the training dataset. In addition to multi-scale learning, a generative adversarial because of imperfect PSF estimation. network (GAN) [11] is employed in the training phase to Recent machine learning techniques have demonstrated generate an output image that is more similar to the target significant improvements. In particular, deep learning image. By adopting a discriminator, Mu-net attempts to produces impressive results across a wide range of appli- generate clean images that are sufficient to fool the discations. In the field of image processing, deep learning criminator that can classify whether a given image is real is used for determining image features in a data-driven (i.e., stored in the training dataset), or artificial (i.e., outfashion, which can be applied to classification and re- put by the Mu-net). Owing to regularization by GAN, gression problems. Deep learning is also employed in high-quality images can be obtained. Because our 2PM applications related to microscopy image processing [5]. images are in 3D form, all basic operations in the CNNs, Convolutional neural networks (CNNs) are the most pop- such as convolution, max-pooling, and up-convolution, ular deep learning architecture, and they have been suc- are performed on 3D volumes. Therefore, our Mu-net is cessfully applied to various biological image processing fully 3D. tasks [6, 7, 8, 9], including cell segmentation, tissue segThe main contributions of this work include: mentation, and image denoising. The most applications are formulated as supervised learning. With many pairs • The proposed Mu-net demonstrates improvements in of input images and the desired output (target) images, 3D image denoising over recent deep-learning-based a CNN model learns hierarchical features to generate an methods. In particular, the proposed method can preoutput image that is close to its target image. This apserve well the image details such as small and thin proach greatly reduces the burden of feature engineering structures and weak signals even in very noisy conand produces very promising results. ditions. It is validated using extensive experimental Inspired by the successes of deep learning, we develevaluations. oped a new method for 3D image restoration of 2PM 3D images. Several CNN variants have been developed to • It is also trained in fully 3D and end-to-end manimprove the learning performance, and a famous model ner; thus, it is easy to use in practice. We release the y
x
z
y
Output image
x
z
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
lP
31
urn a
30
Jo
29
re-
pro of
Input image
2
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
88 89 90 91 92 93 94
95 96
Journal Pre-proof
102 103 104
105
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
pro of
101
re-
100
lP
99
urn a
98
source code6 including trained models that are bene- image processing problems in which substantial perforficial for 2PM image processing studies. mance improvements have been achieved through deep learning, e.g., object recognition [17] and semantic segThe remainder of this paper is organized as follows: Sec- mentation [10, 18]. In the image restoration scenario, sevtion 2 discusses the related works; Section 3 explains the eral attempts have been made to address the inverse probdetails of the proposed image restoration method; Section lem involved in image denoising, inpainting, and super4 demonstrates the performance of the proposed method resolution; the objective of image restoration is to reconusing various experiments; and Section 5 concludes this struct an image x ∈ Rm from a y ∈ Rn in the form of study. y = Ax + n where A ∈ Rn×m and n ∈ Rn . Kingma and Ba [19] demonstrated that the CNNs can obtain improvements in image denoising over classical approaches based 2. Related work on Markov random fields (MRFs) [20]. Although parameter optimization can be rather challenging in deep learnSeveral denoising methods for microscopy images have ing, because of the large parameter space, there are sevbeen proposed. When the PSF is known, classical decon- eral techniques that have been widely adopted in recent volution approaches, RL deconvolution [1, 2] and Wiener CNN architectures, including batch normalization [21] filter [3], are useful for removing image blurs. However, and residual learning [22]. Zhang et al. [23] demonwhen the PSF is imperfect or its kernel estimation is poor, strated that they are effective not only for fast and stathese methods can introduce unintended artefacts on out- ble learning but also for obtaining well de-noised images. put images, such as ghost and ring artefacts. Moreover, Because deep learning is considered a feature extraction image noises may also be amplified by deconvolution. A technique, the learned features can also be used, as prior possible solution is to introduce appropriate prior infor- terms, for other forms of image processing [24]. Another class of regularization techniques is data augmation to the deconvolution process. For example, Chan mentation. Because the parameter space of deep learning and Wong [12] and Dey et al. [13] introduced total variation (TV) regularization to obtain spatially smooth is usually large, one straightforward idea is to enlarge the images. Some recent methods employed more sophisti- number of training images. Moreover, recently proposed cated regularization terms [14, 15]. Meiniel et al. [16] generative adversarial networks (GANs) [11] demonstrate summarized the state-of-the-art methods using regulariza- excellent performance in several image-to-image translation techniques, with a particular interest in microscopy tion problems. A GAN employs a couple of deep neural image denoising. Although many advances have been networks, a generator network and a discriminator netmade in these deconvolution approaches, a fundamental work. In particular, the discriminator network learns to trade-off exists between data fidelity and prior informa- distinguish whether its input image is real and registered tion presented by the regularization term. Depending on in the training dataset or fake and produced by the generathe level of noise in the images, the strength of regulariza- tor model; whereas the generator network learns to genertion must be adjusted to obtain a good balance between ate realistic images to cheat the discriminator. One appliover-adaptation to the given image and the prior regular- cation of the GAN, Pix2Pix, is effective in image transization term (e.g., over-smoothing). Furthermore, these lation but still requires pairs of pre- and post-translation methods are based on an iterative algorithm and are, there- images in the training dataset [25]. Cycle-consistent adversarial networks (CycleGANs) [26] employ a cyclefore, computationally quite expensive. Recent deep-learning-based methods have emerged as consistent term in the adversarial loss function and enable powerful tools for determining image features in a data- image translation even in the absence of direct correspondriven fashion and are widely used in many real-world dence in the training dataset. Because the adversarial loss applications. There are numerous computer vision and acts as a regularizer that reduces negative perceptual effects, GAN-based CNN models can produce more realistic images [27, 28]. 6 The test code can be found in the supplementary material. According to recent survey studies [5, 29], there have
Jo
97
3
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
Journal Pre-proof
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
210 211 212 213
214 215 216
217 218 219 220
221
pro of
187
re-
186
Our objective is to construct a denoiser φ that maps a noisy image I to a clean image φ(I). It is, however, not easy to define the denoiser because of various noise factors involved not only in the target biological systems but also in the measurement optics. Under such uncertain conditions, estimation of the PSF and deconvolution kernels is not fully reliable. Incorrectly estimated kernels may introduce artefacts to the output image. In 2PM, changes in the diffraction rate along the depth direction can be a serious concern because of various biological distractors such as vessels and cellular myelination, which would make the design of spatially varying kernels in three dimensions infeasible. In this study, instead of designing those models, we chose a data-driven approach in which denoiser φ was built such that its features were determined based on the supervised training dataset. For learning features, we employed a CNN variant called U-net as the basic building block, which includes encoding and decoding networks with organized skip connections. The skip connections allow the features extracted from the encoding blocks to be passed on to the corresponding layers of the decoding blocks, where these features are merged together, leading to spatially higherlevel features (Figure 2). While low-level features extracted by shallow layers are useful for removing local image noises and for representing small details, highlevel features extracted by deeper layers are helpful for semantic-level processing. Therefore, these skip connections play a critical role in compromising the spatially local and global information during the training process. In our experiments, it was very useful to preserve the image details; however, our experiments tended to be sensitive to image noises because the features extracted at early layers were insufficient to distinguish signals from noises.
lP
185
• The proposed method is data-driven. In other words, the denoising process is identified based on a relatively small training dataset rather than on hand-craft features.
urn a
184
3. Algorithm overview
been several recent attempts to apply deep learning methods to microscope image processing. Wang et al. [30] applied deep learning techniques to achieve super-resolution in fluorescence microscopy. In biomedical image segmentation, U-net successfully illustrates fine segmentation results and preserves image details, owing to its skip connections in the encoder-decoder architecture that enables decoding from the features encoded with different scales [5]. In cancer cell detection, CNN-based algorithms have been proposed for robust and automatic nucleus segmentation with shape preservation [31, 32, 33]. In magnetic resonance (MR) image processing, deep learning has been widely adopted to solve various inverse problems, e.g., MR image super-resolution [34], MR image denoising [35], and MR image translation from 3T to 7T [36]. Several publications address the effectiveness of the denoising techniques in fluorescent imaging [37, 38, 39, 40]. These algorithms generally focus on the development of an adaptive weighting method for patch-based filtering. Recently, Weigert et al. [40] proposed content-aware image restoration (CARE) method for fluorescence microscopy, where U-net is trained with low- and high-SNR (signal-to-noise ratio) images. CARE is the most related work with the proposed method, and the current study also includes performance comparison between CARE and our method. Technically, the proposed method differs from the classical methods in the following manner:
• The novel multi-scale CNN architecture that employs multiple U-nets enables coarse-to-fine reconstruction, leading to superior performance.
Coarse-to-fine approach To address this problem, we propose a new deep-learning architecture consisting of multiple stacked U-nets, called Mu-net. The fundamental difference between U-net and Mu-net is that while U¯ Mu-net learns the relationnet directly learns φ(I) → I, ¯ ship φc (Ic ) → Ic to build a clean image I¯c from an input coarse image Ic first, and then it is used again as prior information through φ f I f ; φc (Ic ) → I¯f to establish the finer clean image I¯f . Our modeling can be viewed as a
• Because CNN fully operates on 3D rather than twodimensional (2D) image-wise computation, end-toend learning is possible without any pre- and postprocessing steps.
Jo
182 183
We describe the proposed method in detail below. 4
222
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
257 258 259 260 261 262 263 264 265
Journal Pre-proof
of details and robustness to noises. Input and output of networks take the form of image pyramids. Mu-net was trained to output the target 3D imBlock #1 Block #7 Skip connection age pyramid, which was expected to represent a clean 3D image. In particular, individual U-nets were trained with Block #2 Block #6 different scale images to obtain different roles to restore Skip connection the clean image. The lowest U-net tried to identify the relationship between the input and output with the lowestBlock #3 Block #5 (coarsest-) level of image pyramids and then capture a Convolution rough shape of the image, whereas the higher-level tried Convolution with stride 2 to capture the input-output relationship in the finer level Up-convolution with stride 2 Block #4 of the image pyramids. Therefore, the entire Mu-net is Figure 2: U-net is composed of encoder and decoder networks with expected to accomplish image denoising in a coarse-toskip connections between corresponding layers to bypass deeper con- fine fashion. To enable end-to-end learning, our method volutional layers. employs no additional pre- or post-processing steps such that the noisy and clean images are simply the inputs and Operation layers Weight dimension St outputs of the Mu-net, respectively. (WHDCN) Skip connection
c o n v
c o n v
c o n v
c o n v
c o n v
c o n v
c o n v
Blk #1
Conv, Relu Conv, Relu, Inst-norm
Blk #2
Conv, Relu Conv, Relu, Inst-norm
Blk #3
Conv, Relu Conv, Relu, Inst-norm
Blk #4
Conv, Relu Up-conv, Relu, Inst-norm
Blk #5
Concat, Conv, Relu Up-conv, Relu, Inst-norm
Blk #6
Concat, Conv, Relu Up-conv, Relu, Inst-norm
c o n v
c o n v
c o n v
3×3 × 3 × 1 (2) × 16 3×3 × 3 × 16 × 16
1 2(↓)
3 ×3 × 3 × 32 × 64 3 ×3 × 3 × 64 × 64
1 2(↓)
3 ×3 × 3 × 16 × 32 3 ×3 × 3 × 32 × 32
1 2(↓)
3 ×3 × 3 × 64 × 128 3 ×3 × 3 × 128 × 128
1 2(↑)
3 ×3 × 3 × 64 × 32 3 ×3 × 3 × 32 × 32
1 2(↑)
3 ×3 × 3 × 128 × 64 3 ×3 × 3 × 64 × 64
3 ×3 × 3 × 32 × 16 3 ×3 × 3 × 16 × 16 3 ×3 × 3 × 16 × 1
urn a
Blk #7
Concat, Conv, Relu Conv, Relu Conv
Output
pro of
c o n v
c o n v
re-
c o n v
c o n v
1 1 1
Table 1: Parameter summary of individual U-net where WHDCN denotes the convolution filter’s width (W), height (H), depth (D), number of channels (C), and number of filters (N). 2(↓) and 2(↑) mean the downsampling and upsampling with stride 2. Note that the channel size in the first layer is 1 and 2 in the first block; the doubling of the channel units would work to fuse the features extracted from the lower blocks.
266
Bayesian network,
267 268 269 270 271
Jo
P(φ) = P(φ1 , ..., φK ) = P(φ1 |φ2 )P(φ2 |φ3 ), ..., P(φK ) (1) where Mu-net tries to identify a hierarchy of denoisers with different scales, {φ1 , ..., φK } instead of sole φ. By decomposing the learning problem into simpler subproblems, from easier to more difficult tasks, Mu-net achieves performance improvements in the preservation
274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
4.1. Network architecture
290
Mu-net Figure 3 presents the architecture of our Mu-net, which consists of K (here, K = 4) U-nets operating on 3D images with four different scales, 128 × 128 × 128, 64 × 64 × 64, 32 × 32 × 32, and 16 × 16 × 16 pixels. We 5
273
289
U-net Our Mu-net is composed of multiple U-nets whose architecture and parameters are presented in Figure 2 and Table 1, respectively. An individual U-net includes seven blocks consisting of two convolutional layer units and a rectified linear unit (ReLU) [41]. We also used instance normalization [42] for training. The U-net is characterized by its organized skip connections; the features extracted from the first to third encoding blocks were directly passed on to the corresponding decoding blocks, respectively, where the two feature maps from the encoding blocks were merged and further convolved to serve as the input to the decoding blocks. After convolution in the first, second, and third blocks, the feature map size was reduced by convolution with stride 2. In the last layer in the decoding blocks, 2 × 2 × 2 up-convolution was performed to produce an output image whose size was the same as that of the input image. Every convolution layer maintained a consistent resolution with zero padding and all computation was performed in three dimensions.
1 2(↑)
272
4. Multi-scale U-net
lP
Input
c o n v
291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309
310 311 312 313
Journal Pre-proof
Final output
Original image c o n v
c o n v
c o n v
U-net Down-sampling
c o n v
c o n v
Residual output
pro of
Input
c o n v
Up-sampling
Input
c o n v
c o n v
c o n v
c o n v
U-net
Down-sampling
c o n v
c o n v
Down-sampling
Residual
output
Up-sampling
Down-sampling
Up-sampling
c o n v
Input images
c o n v
c o n v
U-net
c o n v
c o n v output
re-
Input
c o n v
Output images
Residual
Labeled images
316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334
first used a 3D image patch and down-sampled the image patch. By aligning those original and down-sampled image patches hierarchically, the original 3D image was transformed into an image pyramid in which small details were represented in the lower (finer) layers and larger structures were in the higher (coarser) layers. This image pyramid was defined as IP = {I1 , ..., IK } where I1 and Ik are the original image and the k-th resized image obtained by down-sampling, respectively.
lowed by upsampling process suitable for an input image of the subsequent U-nets with finer scales. After receiving such a 3D input image, middle-level Unets behave in a similar manner. This process plays an important role in ensuring that upper-level U-nets focus on the information with higher spatial frequency. Because the small and larger-scale images share spatially low-frequency information, delivering these features extracted by the lower-level U-nets would assist the upperlevel U-nets in avoiding detailed and direct computation of the low-frequency features but employing important information to determine which pixel belongs to the target object in the corresponding spatial resolution.
urn a
315
The coarsest-level U-net was positioned in front of the Mu-net. Because the spatial resolution (width, height, and depth) of the input features (subsampled image) were compressed to 2 × 2 × 2 after three applications of the convolution-pooling steps (i.e., Blocks #1, #2, and #3 in Table 1), the coarsest-level U-net had a sufficiently large receptive field to cover the entire 3D image. The output of this first U-net at the front was fed to the second U-net. As shown in Table 1 and Figure 3, there were extra skip connections from a lower-level U-net to an upper level one. In the first and second U-nets, outputs (with the corresponding resolutions in the image pyramid) were fol-
Furthermore, direct gradient propagation to early convolution layers was performed by these connections between lower- to higher-level U-nets in the training step, which would help effective end-to-end training. By utilizing both small- and middle-level visual features, a finer output image was expected to be created. Although all Unets had the same internal architecture, their parameters were not tied, allowing for a more flexible adaptation to
Jo
314
lP
Figure 3: Mu-net architecture for coarse-to-fine image denoising and restoration. Our particular implementation for Mu-net consists of K (e.g., four) U-nets with different spatial scales in the image pyramids. Indeed, the input and target output images in the first to (K − 1)-th U-nets are down-sampled from the original image. The output image reconstructed in the lower-level U-net (with the coarser level) is used along with the input image of the present-level U-net by jointly convolving the input images; accordingly, in the training phase, all residuals between the output and the target in the U-nets are considered. This architecture is effective in maintaining the structure embedded in the input image pyramid.
6
335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355
Journal Pre-proof
𝐼 ҧ (Real) or 𝐼′ (Fake)
c o n v
c o n v
c o n v
c o n v
c o n v
c o n v
Generator: 𝜙 Residual
Output 𝜓 𝐼ҧ or 𝜓 𝐼′
Input image
𝜓2 𝐼 ҧ , 𝜓2 𝐼′
𝜓3 𝐼 ҧ , 𝜓3 𝐼′
Fake
Convolution with stride 2
Discriminator: 𝜓
359
360
361 362 363 364 365 366 367 368 369 370
Conv, Relu
Layer #4
Conv, Relu, Inst norm
Layer #5
Conv, Relu
Layer #6
Conv, Relu, Inst norm
Layer #7
Conv
3 ×3 × 3 × 16 × 32
3 ×3 × 3 × 32 × 32 3 ×3 × 3 × 32 × 64
3 ×3 × 3 × 64 × 64 1 ×1 × 1 × 64 × 1
1
Loss function The loss function L for training our Munet is as follows:
2
re-
Layer #3
3 ×3 × 3 × 1 × 16
3 ×3 × 3 × 16 × 16
St
1 2 1 2
1
LI =
K
1 X
¯
Ik − Ik0
1 . K k
7
371 372
(2)
This loss function is the Manhattan distance between the Mu-nets output and the desired output where K is the Table 2: Summary of the discriminator’s model configuration where St number of scales in the image pyramid and also the numdenotes the stride size of convolution operation. ber of U-nets. I¯k denotes the target output of the k-th resized input image Ik . In our experiments, the results of the Mu-net trained on the Manhattan distance were better individual image scales. As the last step, the finest U-net than the results using the Euclidean distance. repeated this stage-wise refinement by mixing the middleand high-level image features. The original scale clean Adversarial loss In addition to the difference between image was built as the final output of this process. network output and the desired output, the loss function includes an adversarial loss that encourages our network to fool the discriminator network and output images to 4.2. Training reside on the manifold of natural images. The discriminator, ψ, is a straightforward CNN consisting of seven Let Θ = {Θ1 , ..., ΘK } denote the weight parameters of convolutional layers as presented in Table 2. It tried to Mu-net, where K is the number of constituent U-nets (in discriminate if the input is a real image registered in the our particular implementation, K = 4), such that an out- training dataset or a fake image produced by the generaput is produced as I0P = φ(IP ; Θ), given an input image tor φ; the generator was our Mu-net. The final score of IP . Our Mu-net with its coarse-to-fine approach requires the discriminator was an average of all the responses of that each intermediate output becomes the target image of the seventh-layer units. Following recent reports [25, 26], we replaced the negthe corresponding scale. Therefore, we trained the networks so that intermediate outputs should approach the ative log likelihood used in the original GAN [11] with targetn (hence odenoised) images in the image pyramids, the least-squared loss that is known to be more stable during training and generate higher-quality results. The loss I0P = I10 , ..., IK0 .
lP
358
Conv, Relu, Inst norm
Figure 5: Training a GAN to map noisy images to clean images. While the discriminator, ψ, is trained to classify the real and fake images produced by generator φ, the generator learns to fool the discriminator by producing an image that is closely similar to the real clean image.
urn a
357
Conv, Relu
Layer #2
Clean image
Jo
356
Layer #1
Weight dimension (WHDCN)
Real or Fake?
Real
Figure 4: Discriminator model in which the parameter size and additional operations are presented in Table 2. The generator is trained considering the feature matching loss which is calculated based on the difference between feature responses obtained in the middle of layers. Operation layers
Clean image
Fake image pool
Convolution
𝜓1 𝐼 ҧ , 𝜓1 𝐼′
Generated image
pro of
Input
c o n v
373 374 375 376 377 378 379
380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395
Journal Pre-proof
397 398 399 400 401
function for the discriminator is therefore defined as
Lψ =
ψ(I¯1 ) − 1
+
ψ(I10 )
. (3)
Training generator
Generator: 𝜙
Discriminator: 𝜓
Update
Frozen
Input image
According to this loss function, the discriminator CNN was trained to identify if the input image was either real (registered as the target image in the training dataset) or fake (generated by the generator) by providing its predicted label as 1 or 0, respectively.
pro of
396
Real or Fake?
Training discriminator
Discriminator: 𝜓
Clean image Generated image
Real or Fake?
Update
Figure 6: Each CNN model was trained by alternately updating the cor-
404 405 406 407 408
re-
403
Feature matching loss Incorporating the adversarial responding model parameters. loss, the generator Mu-net was trained to fool the discriminator CNN by outputting denoised images that are 1.E-04 as close as possible to those in the training dataset. Figure 5 depicts a graphical illustration of this training pro8.E-05 cess. Therefore, the loss function for training the genera6.E-05 tor, φ, is designed as (4)
Lφ = LI + λLF , 410
414 415 416 417 418 419 420 421 422 423
424 425 426 427 428 429 430 431
0
10000
20000
30000
40000
50000
60000
Iteration
Figure 7: Learning rate schedule. The learning rate was reduced stepwisely and increased once according to the number of iterations.
where ψi is the feature vector extracted in the convolution layer of the discriminator ψ. As illustrated in Figure 4, the feature matching loss is calculated based on the difference between features extracted in the intermediate layers of the discriminator. This feature matching loss captures the feature-level differences between real and fake images. Our full objective is L = Lψ +Lφ so that the parameters of the generator and discriminator are updated alternately during the training as shown in Figure 6. Note that the feature matching loss is defined for the original scale output image, whereas the residual term in Equation (2) also considers the discrepancy in the intermediate outputs of Mu-net such as the coarsest- and middle-level images.
β1 = 0.9 and β2 = 0.999. Using these trainers, the parameters of φ, and ψ were alternately updated with a minibatch including four images with 128 × 128 × 128 pixels. As presented in Figure 7, the learning rate was reduced from 0.0001 to half at 10K, 15K, 20K, 25K, 40K, 45K, 50K, and 55K iterations and increased to 0.0001 at 30K iterations, where one iteration means once parameter update with a mini-batch.
urn a
413
0.E+00
To reduce the model oscillation [44], a buffer pool for fake images by the generator was employed where 10 images were temporarily stored. If a new image was output Training details For training the networks, we followed by the generator, one of the old images was randomly rethe standard approaches [25, 11, 26]. We alternate train- placed with it in the buffer pool. When training the dising sub-processes, gradient descent steps on the generator criminator, one image was randomly sampled from the φ and discriminator ψ, and the training process is graphi- pool. The learner model was implemented based on Tencally illustrated in Figure 6. Two different sub-processes sorFlow [45]. The entire optimization took roughly 30 based on ADAM optimizer [43] are created such that hours on Intel Xeon E5-1650 CPU and Nvidia GTX 1080 each optimizer has an independent momentum history. ti GPU, which was much longer than for 2D-based CNNs, We used the default setting of the ADAM optimizer as because of the computation of 3D convolution.
Jo
411 412
2.E-05
where the constant coefficient is experimentally set to λ = 0.2, and the LF is the feature matching loss defined as X ||ψi (I¯1 ) − ψi (I10 )||1 , (5) LF = i
4.E-05
lP
409
Learning rate
402
8
432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
Journal Pre-proof
457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496
pro of
456
Noisy image
GT
CARE
CARE-ft
Figure 8: CARE [40] was fine-tuned (CARE-ft) to be adapted to our 2PM data domain using the same training dataset as that used for Munet. It can be seen that CARE-ft shows an improved denoising result than CARE.
that the denoised image was reconstructed by applying (I¯ + 1) × 2, 500 to the output of the networks. Considering the complicated process and the considerable effort needed to obtain real 2PM images, we increased the number of training images using data augmentation techniques. Data augmentation is the easiest and most popular way of reducing over-fitting to the training data by artificially enlarging the amount of training data based on label-preserving transformations. We employed several distinct forms of data augmentation, each of which allowed transformed images to be produced from the original images with relatively light computation. First, the images were blurred by applying Gaussian smoothing with four different standard deviations: 0.5, 0.75, 1.0, and 1.25. To simulate the Z-direction axial blur, the standard deviations of X- and Y-direction were set to small value: 0.1. Then, pixel-wise white noises generated by four different standard deviations: 0.1, 0.2, 0.3, and 0.4, were added to the blurred images. The generated images were then rotated around each of X-, Y-, and Z- axes. By applying these augmentation methods, 3, 200×4×4×6 noisy images were created and then registered in the training dataset, paired with the corresponding target images (associated with their original images). In the training, one mini-batch including two image patches with 128 × 128 × 128 resolution was sampled from this training dataset.
re-
455
lP
454
urn a
453
4.3. Data acquisition and image annotation Sparsely labeled neuronal fluorescence images were obtained as follows: AAV viruses (AAV1.hSyn.DIO. mRuby2.WPRE and AAV1.CaMKII(0.4).Cre.WPRE) were injected in the posterior parietal cortex (PPC) area of adult male C57BL/6 mice. More than 14 days after AAV virus injection, open-skull cranial window operation was performed based on the previously published method [46]. The transparent brains were prepared according to one of the optical clearing methods, ScaleS [47]. Two-photon imaging was performed with an FLUOVIEW FVMPE-RS laser scanning microscope system (Olympus Corp., Tokyo, Japan) equipped with water-immersion objective lenses (XLPLN25XWMP, 20 ×, 1.05 N.A. for in vivo imaging; XLSLPLN25XSVMP2, 25 ×, 0.95 N.A. for transparent brain imaging, respectively. Olympus Corp., Tokyo, Japan). A femtosecond pulse laser InSight-DS (Spectra-Physics, Mountain View, Ca) was used at 970nm for capturing mRuby2 images. For in vivo imaging, mice were anesthetized with 1.0% isoflurane. All images were captured at a resolution of 0.249µm per pixel (4 × digital zoom, 512 × 512 pixels) with a z-axis step size of 0.9µm. All animal based procedures in this study followed the guidelines of the Animal Experimental Committee of the Faculty of Medicine at the University of Tokyo. We used 15 3D images that contain neuronal objects, i.e., spines, axons and dendrites (collectively called neurites), and cell bodies. Many spines on the dendrites were visible because our target images had reasonably high spatial resolution (- 0.4 um; N/A, 1.05). The target images include independent bright spots that share similar sizes with dendritic spines, presumably because of a side effect of a viral expression of fluorescent molecules, which we considered noise. The target images are also contaminated by a shot noise because of physical process in the photo-detector. To remove such noises, we manually annotated the foreground areas including cell bodies, neurites, and spines using a seed-growing segmentation [48]. The clean image for each noisy image was constructed by removing all signals without the target object regions. Then, a total of 3,200 pairs, clean and noisy images, are obtained by slicing the image into 128 × 128 × 128 size patches. These image pairs were used as training dataset for model training. The intensity of the obtained training I − 1. Note images was normalized from -1 to 1 as 2,500
5. Experiments
Jo
451 452
We applied our method to denoising problems with real 2PM image datasets. In our experiments, we compared the performance of the proposed method with the recent deep-learning based method, CARE [40], and 9
497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523
524
525 526 527 528
Input image
lP
re-
pro of
Journal Pre-proof
Ground-truth
CARE-ft
Mu-net (𝐾 = 1)
Mu-net+GAN
Figure 9: Denoising results on N2 level images where Poisson and Gaussian noises were added to the tested images. From left to right in each row, input noisy image, GT, results of CARE-ft, vanilla Mu-net (K = 1), and Mu-net with GAN are displayed. Mu-net with GAN shows better denoising results as highlighted with yellow circles. Numbers in the upper left corner correspond to the PSNR of the denoised images.
531 532 533 534 535 536 537 538 539 540 541
deconvolution-based method. CARE was developed for image restoration in the context of processing of fluorescence microscopy images using deep learning. The original CARE was trained with pairs of low- and highexposure-images, but their target domain was different from ours. For more fair comparison, therefore, the CNN parameters of CARE are re-trained using the same training dataset and the original implementation7 . In this retrained CARE-ft shown in Figure 8, it can be seen that the image obtained by CARE was improved after re-training on our training dataset. The deconvolved image with total variation regularization is given by I ∗ = arg minI kI ∗ G − Itest k + λT V k∆Ik,
urn a
530
Jo
529
7 It can be found at the following link: http://csbdeep.bioimagecomputing.com/
10
where Itest is the input noisy image and ∗ denotes convolution with a 3D Gaussian filter G and a fixed standard deviation (SD) 0.5. The remaining parameters in this method are set experimentally as follows. The smoothness term ∆I is calculated by averaging the difference between center and surrounding pixels. The weight of the regularizer is set to λT V = 0.08. We obtained the deconvolved image I ∗ by minimizing the above equation using a gradient descent method. In the second experiment, we then examined the robustness against noise that would be dependent on the depth. In addition, we investigated the effect of introducing GAN into the Mu-net, and 3D sliding windowbased filtering was tested on larger images. As a measure of the denoising performance, we employed peak-signal-
542 543 544 545 546 547 548 549 550 551 552 553 554 555
Journal Pre-proof
Original image
N1
N2
N3
N4
15.01 20.40 26.28
12.53 18.34 24.28
10.44 15.26 23.12
8.86 12.94 22.52
25.56 26.04 26.08 26.21
24.71 24.99 24.99 25.06
23.95 24.12 24.12 24.16
23.33 23.41 23.45 23.46
26.17 26.46 26.47 26.50 26.27
25.02 25.13 25.11 25.16 25.07
24.11 24.16 24.14 24.19 24.16
23.42 23.45 23.42 23.48 23.45
Mu-net (K Mu-net (K Mu-net (K Mu-net (K
= 1) = 2) = 3) = 4)
pro of
Ground-truth
Noise level Input (PSNR) Deconvolution+TV CARE-ft
N1 image
Mu-net+GAN (K = 4, λ = 0.01) Mu-net+GAN (K = 4, λ = 0.05) Mu-net+GAN (K = 4, λ = 0.1) Mu-net+GAN (K = 4, λ = 0.2) Mu-net+GAN (K = 4, λ = 0.3)
N2 image
N3 image
N4 image
559 560 561
562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577
lP
558
urn a
557
512 × 512 × 1, 400 and each pixel exhibited 5,000 maximum gray-scale intensity. Based on these real images, we produced GT images by manually annotating noises from to-noise ratio (PSNR) defined as the real images in the same manner as described in the 2 Imax 0 ¯ PS NR(I , I) = 10 log10 (6) previous section. Total 500 noisy and clean image pairs ¯ MS E(I 0 , I) with 128 × 128 × 128 resolution were chosen from the imwhere Imax is the maximum intensity set to 5,000, and age pool where images only containing background were MS E denotes the mean squared error between ground- excluded. We then generated test images based on such truth (GT) and estimated images. We describe the exper- manually-annotated GT images. imental details and present the results of our comprehenPoisson and Gaussian noises In the first experiment, we sive analysis below. assumed Poisson and Gaussian noises. Pixel-wise independent Poisson process was applied to the original im5.1. Performance in image denoising and restoration To examine the denoising performance of the proposed ages, to simulate the photon emission process, and then method, we prepared the GT images and their noisy im- Gaussian noise N(0, σ) was added as a normal distribuages in a simulated manner. Although the spatial res- tion of zero mean and four different SD (σ), 0.1, 0.2, 0.3, olution of 2PM is high, approximately a few hundred and 0.4. We prepared a total of 2,000 test 3D images nanometers, it is larger than the scale of biology of our from 500 different 3D images disturbed with four differinterests, which is approximately ten nanometers; this ent noise settings as limited resolution would make it difficult to obtain preferrable images solely based on the experiments. Moreover, taking 2PM fluorescent images with uniform quality is practically infeasible because the image quality depends on various factors such as sample preparation, device setting, and the efficiency of the injected virus. We obtained four 3D 2PM images, i.e., using scanning microscopy, with a single channel, in which pixels are represented in 16 bits; the image size was approximately
Itest = f poisson (Iorigin ) + N(0, σ)
11
578 579 580 581 582 583 584 585 586
587 588 589 590 591 592 593 594 595
(7)
where f poisson denotes Poisson process, and N(0, σ) is a Gaussian random variable with zero mean and σ variance. Figure 10 presents examples of images. From top to bottom, left to right, GT and noisy images corrupted by different noise levels are displayed. In these images, the image quality was gradually degraded as the noise level increased. As shown in the right bottom image, the highest
Jo
556
re-
Figure 10: Example of test images. Based on GT and original images, test image pairs were generated by applying/adding Poisson and Gaussian noises to the original image. Note that 2D XY-plane images were displayed for clear comparison, but the actual input data were 3D images.
Table 3: Image denoising performance of different methods when the level of Poisson and Gaussian noises artificially applied to the input image was changed. The table shows the average performance in terms of PSNR (unit:dB).
596 597 598 599 600 601 602
pro of
Journal Pre-proof
x-axis
Input image
lP
re-
Z-axis
Ground-truth
CARE-ft
Mu-net (𝐾 = 1)
Mu-net+GAN
Figure 11: Restoration results on N1-B3 level images where Z-direction blur and Gaussian noises are applied to the tested images. Here, X-Z plane images with small Gaussian noises are displayed to show clearer Z-direction blur.
605 606 607 608 609 610
611 612 613 614 615 616 617 618 619
level, N4 image is very noisy. In the first experiment, the proposed method was applied to the denoising problem with these test image set. The performance of the algorithm was measured by verifying the similarity between each denoised image and its corresponding GT image in terms of the PSNR metrics. Note that these test images were not used for training our Mu-net or other models but only used for testing.
with the largest hierarchy (K = 4) presents better performance than those by the others. In its network architecture, the spatial resolution of input image is reduced to 2 × 2 × 2 by convolution-pooling layers at the coarsest level U-net, of which feature maps are fully covered by convolutional filters.
urn a
604
We tested different methods including CARE-ft, vanilla Mu-net with different number of hierarchy, and Mu-net with adversarial training (Mu-net+GAN). Table 3 summarizes the denoising performance averaged over the test images. The input images are fairly noisy that can be seen as the overall PSNR of input images, 15.01, 12.53, 10.44, and 8.86 dB from N1 to N4 datasets. It also includes the performance change associated with the different level of hierarchy K, and weight λ; indeed, Mu-net
Jo
603
12
In addition, the experimental results with changing the λ values present that the denoising performance was compromised at λ = 0.2. As the noise level increased, the advantage of the proposed method was obvious; in such a situation, the proposed method outperformed CARE-ft with a large margin. Figure 9 presents X-Y image slices when N2 level noisy images were input to each denoising method. From left to right in each row, we show the input noisy image, GT, and denoised images by CARE-ft, Munet (K = 1), and Mu-net (K = 4)+GAN. As shown in the highlighted yellow circles, the proposed method produced
620 621 622 623 624 625
626 627 628 629 630 631 632 633 634 635 636
Journal Pre-proof
N1-B1
N1-B2
N1-B3
N2-B1
N2-B2
N2-B3
N3-B1
N3-B2
N3-B3
15.10 20.36 26.15
15.08 20.13 25.55
14.98 19.89 24.77
12.57 18.35 24.23
12.55 18.25 23.95
12.49 18.09 23.52
10.46 15.28 23.08
10.44 15.25 22.88
10.40 15.17 22.59
25.51 25.98 26.03 26.14
25.20 25.57 25.65 25.72
24.63 24.86 24.93 24.97
24.62 24.88 24.89 24.95
24.27 24.47 24.49 24.53
23.78 23.92 23.93 23.95
23.86 24.01 24.02 24.05
23.54 23.65 23.66 23.69
23.12 23.20 23.21 23.23
26.12 26.37 26.37 26.41 26.20
25.72 25.84 25.83 25.89 25.77
25.00 25.01 25.00 25.05 25.02
24.92 25.00 24.99 25.04 24.96
24.51 24.53 24.51 24.58 24.54
23.94 23.92 23.92 23.98 23.97
24.01 24.04 24.03 24.08 24.05
23.64 23.65 23.65 23.70 23.69
23.18 23.18 23.19 23.24 23.23
Mu-net (K Mu-net (K Mu-net (K Mu-net (K
= 1) = 2) = 3) = 4)
Mu-net+GAN (K = 4, λ = 0.01) Mu-net+GAN (K = 4, λ = 0.05) Mu-net+GAN (K = 4, λ = 0.1) Mu-net+GAN (K = 4, λ = 0.2) Mu-net+GAN (K = 4, λ = 0.3)
pro of
Noise level Input (PSNR) Deconvolution+TV CARE-ft
Table 4: Image denoising performance of different methods when the level of Gaussian blur and noise artificially applied to the input image was changed. The table shows the average performance in terms of PSNR (unit:dB).
640 641 642 643 644 645 646 647 648 649
re-
639
better denoising results than the others in the restoration of detailed structures.
652 653 654 655 656 657 658 659 660 661 662
(8)
we prepared 4,500 test 3D images from 500 different 3D images disturbed by 9 (= 3 × 3) different Gaussian blur and noise settings. Using these images, we compared the denoising algorithms of which the experimental results are summarized in Table 4; here, the levels of noise and blur are presented as from N1-B1 to N3-B3. Similar to the results of testing with Poisson and Gaussian noises, the proposed method showed the outperforming performance in the setting of K = 4 and λ = 0.2. Some representative results are shown in Figure 11 where X-Z plane images are displayed to visualize the results depending on the Z-axis. As shown in these images, the proposed method presents the clearer
Jo
651
N1-B1
N1-B2
N2-B1
N2-B2
15.87 20.30 25.91
15.63 19.80 23.79
14.18 19.37 25.09
14.02 18.95 23.57
Axial blur and Gaussian noise In the second experiMu-net (K = 1) 24.63 23.42 24.12 23.04 ment, the test images were generated by applying GausMu-net (K = 2) 25.12 23.65 24.39 23.15 sian blur and noise. To simulate the forward optics in Mu-net (K = 3) 25.26 23.72 24.47 23.21 2PM, we assumed Gaussian PSF (blur) Gblur and pixelMu-net (K = 4) 25.40 23.81 24.54 23.23 wise Gaussian shot noises; the original images were Mu-net+GAN (K = 4, λ = 0.01) 26.19 24.41 25.17 23.71 Mu-net+GAN (K = 4, λ = 0.05) 26.36 24.44 25.17 23.68 blurred along the Z-direction by convolving a Gaussian Mu-net+GAN (K = 4, λ = 0.1) 26.38 24.46 25.17 23.71 PSF Gblur of which the Z-direction SD was set to 0.5, 1.0, Mu-net+GAN (K = 4, λ = 0.2) 26.43 24.47 25.21 23.74 and 1.5, while SD of X- and Y-directions were consisMu-net+GAN (K = 4, λ = 0.3) 26.22 24.48 25.12 23.71 tently fixed as 0.1. Then Gaussian noise N(0, σ) generated with σ = 0.2, σ = 0.3, and σ = 0.4, was added to the blurred images. According to the following equation, Table 5: Image denoising performance of different methods when the Itest = Iorigin ∗ Gblur + N(0, σ)
650
Noise level
Input (PSNR) Deconvolution+TV CARE-ft
lP
638
urn a
637
13
level of Poisson and Gaussian noises artificially applied to the input image was changed. The table shows the average performance in terms of PSNR (unit:dB).
results compared to other baseline methods. In particular, large structures in the output images of CARE-ft were much blurry. The result of CARE-ft was comparable to the proposed method if the input image was not severely degraded by noise such as in the case of N1-B1. The performance gap between CARE-ft and ours gradually increased as the noise and blur level increased. In Mu-net with the K = 1 setting, it produced blurry results and small image details were removed. Depth-dependent noise simulation The 2PM technology is known to be effective for visualization of deep re-
663 664 665 666 667 668 669 670 671
672 673
pro of
Journal Pre-proof
x-axis
Input image
lP
re-
Z-axis
Ground-truth
CARE-ft
Mu-net (𝐾 = 1)
Mu-net+GAN
Figure 12: One of the example images used in the third experiment in which images were corrupted by combining Z-direction blur, Poisson and depth-dependent noises. X-Z images of the image with different depth (different z coordinates) are presented in the bottom.
675 676 677 678 679 680 681 682
gions of biological tissue, owing to the penetration power of the low-energy photons. On the other hand, 2PM scanning from shallow to deep regions might make the diffraction ratio non-uniform along the depth direction, which would produce depth dependent noises. In the third experiment, we examined the manner in which the denoising methods are affected by such depth-dependent noises. Similar to the previous experiments, the noise model was characterized as
urn a
674
Itest = f poisson (Iorigin ∗ Gblur ) + N(0, σdepth ) 684 685 686 687 688
where Gblur was the Gaussian PSF of which the Zdirection SD was set to 1.0 and 2.0, and SD of X- and Y-directions were fixed to 0.1. For Gaussian white noise, levels of SD σdepth were further linearly scaled along the depth depth; (0.1, 0.2) × depth . By combining Poisson noise, max two different levels of Gaussian blur and depth-dependent
Jo
683
(9)
14
noise, four levels of test images were generated. The experimental results and some result images are presented in Table 5 and Figure 12, respectively. Similarly to the second experiment, Figure 12 displays XZ plane images to make changes along the Z-direction visible, where it can be clearly seen that the noise level of the input image increased as the depth was increased. As highlighted with the yellow circles in the result images, our method produced better denoising results than the other methods. 5.2. Performance improvement in neural tracking application In this subsection, we evaluated the usefulness of the proposed method as pre-processing for post-analyses of 2PM neural images. As a typical example of such postanalyses, the neural tracking and structure reconstruc-
689 690 691 692 693 694 695 696 697 698
699 700 701 702 703 704
Journal Pre-proof
Threshold
(a) image 1 (111,939) image 2
(f)
709 710 711 712 713 714 715 716 717
urn a
708
tion problems were chosen, in order to verify the benefits of denoising. Identification of such morphology and anatomy circuit significantly increases our understanding of how functional brain states emerge from the structural substrate [49], and provides new mechanistic insights into how brain function is affected if this structural substrate is disrupted [50, 51]. The proposed denoising method can strongly contribute to the automation of neuronal structure identification, which is now called connectomics.
Jo
707
31,284 (43.22%) 47,574 (65.73%) 54,575 (75.41%) 42,789 (59.12%) 56,068 (77.47%) 63,407 (87.61%)
plied to the original and denoised images. For the quantitative performance evaluation, GT tracking results were necessary, but it would be difficult to prepare them without manual supervision. In order to build GT as accurate as possible, we actively utilized the interactive neural tracking method provided by Neutube. More concretely, we manually inputted seed points so that tracking paths were generated based on these points by progressively growing the paths. GT was made by repeating the process and selecting only the tracking results that were considered as reasonable based on our visual inspection, from all the estimated paths. As the test, we applied a fully automatic tracking method, i.e., Neutube with the default setting, on the original and denoised images obtained by CARE-ft and ours. The tracking results were evaluated by comparing to the GT where tracking points were classified as true if the distance to the closest neighbor (from GT to the estimated points) was smaller than a certain threshold θ. The neural tracking performance on the first and second images are presented in Table 6. Among total 111,939 and 72,375 tracking points obtained based on the interactive neural tracking, 43.6% and 43.22% points were found in the first and second original images at θ = 3. When the original images were replaced with the denoised images, the results were improved to 63.82% and 65.73% by CAREft, and 74.17% and 75.41% by ours, respectively. Similar to this result, the tracking performance of these methods with the threshold (θ) setting to 5, were 74.16% and 85.70% (CARE-fit), and 77.47% and 87.61% (Mu-net), for the first and second images, respectively. Figure 13 presents a qualitative tracking result comparison when two denoising methods were applied. While weak signals and small structures were removed in the result of CARE-
lP
(e)
Figure 13: (a) Original image and interactive neural tracking result. (b) Neutube’s automatic tracking result on the denoised image obtained by CARE-ft. (c) Neutube’s automatic tracking result on the denoised image by the proposed method. (d), (e), and (f) are the zoom-in images of (a), (b), and (c), respectively, at the yellow rectangles. Green, yellow, and red points denote branch, terminal, and bridge nodes, respectively.
706
θ=3 θ=5
re-
(c)
705
Ours
Table 6: The neural tracking results using Neutube on the denoised images by CARE-ft and the proposed method. The numbers present the number of estimated tracking points and ratio whose distance to the nearest neighbors to GT points were smaller than a certain threshold (θ).
(b)
(d)
CARE-ft
48,805 (43.60%) 71,444 (63.82%) 83,025 (74.17%) 66,389 (59.31%) 83,010 (74.16%) 95,935 (85.70%)
pro of
(72,375)
Original
θ=3 θ=5
For this experiment, we prepared two additional 512 × 512 × 1450 resolution images to which CARE-ft and our method were applied to obtain the denoised images. Then, a popular neural tracking method, Neutube [50], was ap15
718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751
Journal Pre-proof
757
5.3. Discussion
759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779
780
781 782 783 784 785 786 787 788 789
So far, we examined the image denoising and restoration performance of the proposed method, as well as comparison to other baseline methods. All experimental results presented that the performance of the conventional U-net can be significantly improved by the proposed method, which is the combination of multiple stacked architecture and adversarial training. We performed the experiments by considering different noise and blur settings, such as Gaussian blur and noise, Poisson noise, and depth-dependent noise. In these tests, the proposed method outperformed other baseline methods, suggesting that our method was robust to the various types of image distortions. In addition, the benefits of using image denoising were presented in the neural tracking application where the performance improvement of the neural tracker was examined by comparing the tracking results obtained from the original and denoised images. These experiments verified the usefulness of the proposed method. Regarding processing speed, it took approximately 0.2 seconds per a single image of 128 × 128 × 128 pixels. Our implementation8 is based on Python and Tensorflow [45] version 1.14.
Acknowledgment
This work was supported partly by CREST (JPMJCR1652 to SL, SI and HK) from JST, Grants-inAid for Scientific Research Grant Numbers 26221001 to HK, 17H06310 to SI, and 17K00404 to HU, from JSPS, Strategic Research Program for Brain Sciences projects (SRPBS, 17dm0107120h0002) from AMED (to HK), and World Premier International Research Center Initiative (WPI) from MEXT (to HK and SI), and Brain Mapping by Integrated Neurotechnologies for Disease Studies (Brain/MINDS) from AMED (SI).
re-
758
while preserving image details. For quantitative performance evaluation, extensive experiments were performed with various levels of noisy conditions, in which considerable performance improvements were demonstrated compared to other methods, including the state-of-the-art method, CARE. Furthermore, we demonstrated that this image denoising algorithm is effective for post-analysis restoration of 3D structures of neurons and neural networks.
lP
755
6. Conclusion
In this work, we proposed a deep learning-based 3D image denoising algorithm for 2PM images. To improve denoising performance, a novel CNN architecture called Mu-net composed of multiple U-nets was designed, in which individual U-nets considered different scale images. This multi-scale learning approach enables a coarse-to-fine image reconstruction that incrementally builds the target output from low to high frequencies. Consequently, Mu-net effectively removed image noises 8 The
References [1] L. B. Lucy, An iterative technique for the rectification of observed distributions, The astronomical journal 79 (1974) 745. [2] W. H. Richardson, Bayesian-based iterative method of image restoration, JOSA 62 (1) (1972) 55–59.
urn a
754
Jo
753
pro of
756
ft, our method preserved them, so that the improved neural tracking performance was obtained, as shown in these images. These results demonstrate that the better denoising method was effective in improving the performance of neural tracking.
752
source code and pre-trained model will be publicly available after publication.
16
[3] N. Wiener, N. Wiener, C. Mathematician, N. Wiener, N. Wiener, C. Math´ematicien, Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications. [4] Y. Shih, B. Guenter, N. Joshi, Image enhancement using calibrated lens simulations, in: European Conference on Computer Vision, Springer, 2012, pp. 42–56. [5] F. Xing, Y. Xie, H. Su, F. Liu, L. Yang, Deep learning in microscopy image analysis: A survey, IEEE Transactions on Neural Networks and Learning Systems.
790 791 792 793 794 795 796 797 798
799
800 801 802 803 804 805 806 807 808 809
810
811 812 813
814 815
816 817 818 819
820 821 822 823
824 825 826 827
Journal Pre-proof
833 834 835 836 837
838 839 840 841 842 843
844 845 846 847 848 849
850 851 852 853
854 855 856 857 858
859 860 861
862 863 864 865
866 867
[15] Y.-W. Tai, S. Lin, Motion-aware noise filtering for deblurring of noisy and blurry images, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 17–24.
pro of
832
[7] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241.
[16] W. Meiniel, J.-C. Olivo-Marin, E. D. Angelini, Denoising of microscopy images: A review of the stateof-the-art, and a new sparsity-based method, IEEE Transactions on Image Processing 27 (8) (2018) 3842–3856.
[8] C. Fu, S. Lee, D. J. Ho, S. Han, P. Salama, K. W. Dunn, E. J. Delp, Three dimensional fluorescence microscopy image synthesis and segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 2221–2229.
[17] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
re-
831
Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 233–240.
¨ C [9] O. ¸ ic¸ek, A. Abdulkadir, S. S. Lienkamp, T. Brox, O. Ronneberger, 3d u-net: learning dense volumetric segmentation from sparse annotation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2016, pp. 424–432.
[18] H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520–1528.
lP
830
[6] K.-L. Tseng, Y.-L. Lin, W. Hsu, C.-Y. Huang, Joint sequence learning and cross-modality convolution for 3d biomedical segmentation, in: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, 2017, pp. 3739–3746.
[19] V. Jain, S. Seung, Natural image denoising with convolutional networks, in: Advances in Neural Information Processing Systems, 2009, pp. 769–776.
[10] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[20] S. Roth, M. J. Black, Fields of experts: A framework for learning image priors, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 2, IEEE, 2005, pp. 860–867.
[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014, pp. 2672– 2680.
urn a
829
[21] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167.
[12] T. F. Chan, C.-K. Wong, Total variation blind deconvolution, IEEE transactions on Image Processing 7 (3) (1998) 370–375.
[22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[13] N. Dey, L. Blanc-F´eraud, J. Zerubia, C. Zimmer, J.C. Olivo-Marin, Z. Kam, A deconvolution method for confocal microscopy with total variation regularization., in: ISBI, Vol. 2004, 2004, pp. 1223–1226.
Jo
828
[23] K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising, IEEE Transactions on Image Processing 26 (7) (2017) 3142–3155.
[14] D. Krishnan, T. Tay, R. Fergus, Blind deconvolution using a normalized sparsity measure, in: Computer 17
868 869
870 871 872 873
874 875 876 877 878
879 880 881 882
883 884 885 886
887 888 889
890 891 892 893 894
895 896 897
898 899 900 901
902 903 904 905
Journal Pre-proof
911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946
pro of
910
[25] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image-toimage translation with conditional adversarial networks, arXiv preprint.
[35] A. Benou, R. Veksler, A. Friedman, T. R. Raviv, Denoising of contrast-enhanced mri sequences by an ensemble of expert deep neural networks, in: Deep Learning and Data Labeling for Medical Applications, Springer, 2016, pp. 95–110.
[26] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using cycleconsistent adversarial networks, arXiv preprint arXiv:1703.10593.
[27] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cun- [36] K. Bahrami, F. Shi, I. Rekik, D. Shen, Convoluningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, tional neural network for reconstruction of 7t-like Z. Wang, et al., Photo-realistic single image superimages from 3t mri using appearance and anatomresolution using a generative adversarial network, ical features, in: Deep Learning and Data Labeling arXiv preprint. for Medical Applications, Springer, 2016, pp. 39– 47. [28] N. Divakar, R. V. Babu, Image denoising via cnns: An adversarial approach, in: New Trends in Im- [37] P. Coup´e, M. Munz, J. V. Manj´on, E. S. Ruthazer, age Restoration and Enhancement, CVPR workD. L. Collins, A candle for a deeper in vivo insight, shop, 2017. Medical image analysis 16 (4) (2012) 849–864.
re-
909
[34] O. Oktay, W. Bai, M. Lee, R. Guerrero, K. Kamnitsas, J. Caballero, A. de Marvao, S. Cook, D. ORegan, D. Rueckert, Multi-input cardiac image super-resolution using convolutional neural networks, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2016, pp. 246–254.
[29] E. Moen, D. Bannon, T. Kudo, W. Graf, M. Covert, D. Van Valen, Deep learning for cellular image analysis, Nature methods (2019) 1–14.
lP
908
[24] K. Zhang, W. Zuo, S. Gu, L. Zhang, Learning deep cnn denoiser prior for image restoration, in: IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, 2017.
[38] A. Danielyan, Y.-W. Wu, P.-Y. Shih, Y. Dembitskaya, A. Semyanov, Denoising of two-photon fluorescence images with block-matching 3d filtering, Methods 68 (2) (2014) 308–316.
[30] H. Wang, Y. Rivenson, Y. Jin, Z. Wei, R. Gao, H. Gunaydin, L. Bentolila, A. Ozcan, Deep learning achieves super-resolution in fluorescence microscopy, Biorxiv (2018) 309641.
[39] J. Boulanger, C. Kervrann, P. Bouthemy, P. Elbau, J.-B. Sibarita, J. Salamero, Patch-based nonlocal functional for denoising fluorescence microscopy image sequences, IEEE transactions on medical imaging 29 (2) (2010) 442–454.
urn a
907
[31] M. Veta, P. J. Van Diest, J. P. Pluim, Cutting out the middleman: measuring nuclear area in histopathology slides without segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2016, pp. 632–639.
[40] M. Weigert, U. Schmidt, T. Boothe, A. M¨uller, A. Dibrov, A. Jain, B. Wilhelm, D. Schmidt, C. Broaddus, S. Culley, et al., Content-aware image restoration: pushing the limits of fluorescence microscopy, Nature methods 15 (12) (2018) 1090.
[32] Z. Xu, J. Huang, Detecting 10,000 cells in one second, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2016, pp. 676–684.
[41] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
Jo
906
[33] F. Xing, Y. Xie, L. Yang, An automatic learningbased framework for robust nucleus segmentation, IEEE transactions on medical imaging 35 (2) (2016) 550–566.
[42] D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: The missing ingredient for fast stylization, arXiv preprint arXiv:1607.08022. 18
947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988
Journal Pre-proof
994
995 996 997 998 999
1000 1001 1002 1003 1004
1005 1006 1007 1008 1009
1010 1011 1012 1013 1014
1015 1016 1017
1018 1019 1020 1021
1022 1023 1024 1025
pro of
993
[44] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, R. Webb, Learning from simulated and unsupervised images through adversarial training., in: CVPR, Vol. 2, 2017, p. 5. [45] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: a system for large-scale machine learning., in: OSDI, Vol. 16, 2016, pp. 265– 283. [46] A. Hayashi-Takagi, S. Yagishita, M. Nakamura, F. Shirai, Y. I. Wu, A. L. Loshbaugh, B. Kuhlman, K. M. Hahn, H. Kasai, Labelling and optical erasure of synaptic memory traces in the motor cortex, Nature 525 (7569) (2015) 333.
re-
992
[47] H. Hama, H. Hioki, K. Namiki, T. Hoshida, H. Kurokawa, F. Ishidate, T. Kaneko, T. Akagi, T. Saito, T. Saido, et al., Scales: an optical clearing palette for biological imaging, Nature neuroscience 18 (10) (2015) 1518.
lP
991
[43] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
[48] J. A. Sethian, Level set methods and fast marching methods: evolving interfaces in computational geometry, fluid mechanics, computer vision, and materials science, Vol. 3, Cambridge university press, 1999.
urn a
990
[49] O. Sporns, G. Tononi, R. K¨otter, The human connectome: a structural description of the human brain, PLoS computational biology 1 (4) (2005) e42. [50] L. Feng, T. Zhao, J. Kim, neutube 1.0: a new design for efficient neuron reconstruction software based on the swc format, eneuro 2 (1) (2015) ENEURO– 0049. [51] R. M. Villalba, A. Mathai, Y. Smith, Morphological changes of glutamatergic synapses in animal models of parkinsons disease, Frontiers in neuroanatomy 9 (2015) 117.
Jo
989
19
Journal Pre-proof
Conflict of Interest and Authorship Conformation Form Manuscript Title: Mu-net: Multi-scale U-net for Two-Photon Microscopy Image Denoising and Restoration
pro of
Please check the following as appropriate:
All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version.
This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.
The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript
re-
Author’s name
Jo
urn a
Shin Ishii
Kyoto university The university of Tokyo Kyoto university The university of Tokyo, International Research Center for Neurointelligence (WPI-IRCN) Kyoto university, Advanced Telecommunications Research Institute International ATR
lP
Sehyung Lee Makiko Negishi Hidetoshi Urakubo Haruo Kasai
Affiliation