DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images

Knowledge-Based Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/k...

Download PDF

4MB Sizes 0 Downloads 85 Views

Report

Full Text

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images✩ ∗

Xiuyuan Xu a , Chengdi Wang b , Jixiang Guo a , Lan Yang b , Hongli Bai c , Weimin Li b , , ∗ Zhang Yi a , a

Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, 610065, PR China Department of Respiratory and Critical Care Medicine, West China Hospital, Sichuan University, Chengdu, Sichuan Province 610041, PR China c Department of Radiology, West China Hospital, Sichuan University, Chengdu, Sichuan Province 610041, PR China

b

article

info

Article history: Received 26 January 2019 Received in revised form 11 October 2019 Accepted 12 October 2019 Available online xxxx Keywords: Lung nodules detection Multi-model ensemble Multi-resolution CT screening images

a b s t r a c t Computed tomography (CT) is an important and valuable tool for detecting and diagnosing lung cancer at an early stage. Commonly, CT screenings with lower dose and resolution are used for preliminary screening. In particular, many hospitals in smaller towns only provide CT screenings at low resolution. However,when patients are diagnosed with suspected cancer, they are transferred or recommended to larger hospitals for more sophisticated examinations with high-resolution CT scans. Therefore, multiresolution CT images deserve attention and are critical in clinical practice. Currently, the available open source datasets only contain high-resolution CT screening images. To address this problem, a multi-resolution CT screening image dataset called the DeepLNDataset is constructed. A three-level labeling criterion and a semi-automatic annotation system are presented to guarantee the correctness and efficiency of lung nodule annotation. Moreover, a novel framework called DeepLN is proposed to detect lung nodules in both low-resolution and high-resolution CT screening images. The multi-level features are extracted by a neural-network based detector to locate the lung nodules. Hard negative mining and a modified focal loss function are employed to solve the common category imbalance problem. A novel non-maximum suppression based ensemble strategy is proposed to synthesize the results from multiple neural network models trained on CT image datasets of different resolutions. To the best of our knowledge, this is the first work that considers the influence of multiple resolutions on lung nodule detection. The experimental results demonstrate that the proposed method can address this issue well. © 2019 Published by Elsevier B.V.

1. Introduction Lung cancer is the deadliest cancer in the world [1]. A computed tomography (CT) screening is a quick and painless procedure that produces clear images of the inside of the lungs. CT screening is widely used to help diagnose and monitor treatment for a variety of pulmonary diseases such as lung cancer. These diseases manifest in images as lung nodules. To find the lung nodules in CT screening images for further diagnosis, an experienced radiologists must carefully read the CT screening images slice by slice. This process requires substantial time and effort. ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.105128. ∗ Corresponding authors. E-mail addresses: [email protected] (W. Li), [email protected] (Z. Yi).

Furthermore, there are not enough experienced experts to provide high quality medical services for many patients. Therefore, automatic detection of lung nodules is an important research topic in computer-assisted diagnosis. Because of the high potential value of automatic lung nodule detection, many efforts have been put into this research in recent years. However, lower-dose CT scans with lower resolution are used for preliminary screening, especially in many hospitals in smaller towns that may only provide low-dose CT scans. In contrast, the available open source datasets only contain highresolution CT screening images. To address this problem, we constructed a multi-resolution CT screening image dataset called the DeepLNDataset. The construction of such a dataset requires a large number of radiologist annotations. The annotation process, as in clinical practice, is often based on the radiologists’ experience, however, radiologists have differing opinions about lung nodule annotation. Employing an effective annotation method is the key to guarantee the objectivity and accuracy of labeling [2–8]. For example, in [8],

https://doi.org/10.1016/j.knosys.2019.105128 0950-7051/© 2019 Published by Elsevier B.V.

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

2

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

each case was blindly marked by four radiologists. The LUNA2016 challenge used the same parts of all radiologists’ annotations; this simple approach causes some real nodules to be omitted. In [7], the first round of annotation was accomplished by a CAD system. Then, in the second round, the initial labels were reviewed by two medical students to identify nodules. Since they lacked clinical experience, the accuracy of those results cannot be guaranteed. The CAD system used in [7] was threshold-based, which could not both guarantee sensitivity and reduce false positives (FPs). In contrast, in the dataset presented in this study, a three-level annotation method is employed to produce first annotations on a part of the dataset. Then, an initial detector is trained on these data to boost the semi-automatic annotation process. Because the annotations rely on the radiologist’s clinical experience, this method ensures the accuracy and efficiency of the annotations. In addition, methods have been employed to construct automatic lung nodule detection systems. Lung nodules only appear inside the lung regions, so effective lung region segmentation can avoid the detection of lesions outside lung regions, reducing FPs. Threshold-based methods are the most common for segmenting the lung regions [9] and works well. The subsequent lung nodule detection stage consists of two models: volume of interest (VOI) detection models that guarantee the maximum sensitivity of the subsequent stages and classifier models for reducing FPs. Advances in this research can be divided into three periods. In the first and earliest period, neither the detectors nor classifiers proposed were based on neural networks. All detection models were threshold-based methods [2,4,10–13] such as lung segmentation methods. Threshold-based methods to select VOIs are more complex than lung region segmentation methods because lung nodules have more diverse shapes and edges. Then, simple linear or non-linear classifiers were trained to determine whether the selected VOIs are lung nodules. Their inputs were complicated hand-designed features, and not all of the essential features can be extracted to maximize the classifiers’ performance. In the second period of research, convolutional neural networks (CNNs) [14–17] were employed to reduce the number of FPs. The detector methods employed during this period were still threshold-based, but they were more complex. In the third and most recent period, both detectors and classifiers are neural network-based models based on [9,16,18]. These studies proposed excellent methods for constructing effective models and obtained state-of-the-art results on open-source datasets. Nevertheless, the available open-source datasets only contain high-resolution CT screening images whose thicknesses range from 1.25 to 3 mm [4], and the studies mentioned above did not consider the problems caused by multi-resolution CT screening images. However, to reduce radiation injury caused by CT screening images in clinical practice, lower-resolution CT screening images are acquired for physical examinations. The dataset collected in this study contains CT screening images at two resolutions: a 1 mm thickness for thin-section images and a 5 mm thickness for thick-section images. A method to address the multi-resolution problem is proposed in this study. First, the DeepLNDataset’s thin-section data and thick-section data are separated. Two separate detectors are trained, each using one of the two subsets reduces the influence of multiple resolutions. To extract the features from CT screening images effectively, a residual neural network is employed as the backbone. Multi-level feature fusion can promote the nodule detection’s accuracy. Next, an ensemble method is proposed to improve the results of the two models. The contributions of this work can be summarized as follows: 1. A framework for automatic lung nodule detection from multi-resolution CT screening images is proposed that has obtained promising results in clinical practice.

2. A three-level annotation criterion and a semi-automatic annotation system are proposed to construct a multiresolution CT screening image dataset called the DeepLNDataset. 3. The influence of CT screenings with different resolutions are analyzed in depth, and an ensemble strategy is proposed to tackle this issue. The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 describes the construction of the DeepLNDataset. Section 4 presents the DeepLN method. Evaluations of our proposed methods and some analysis are given in Section 5. We conclude and discuss future work in Section 6. 2. Related work 2.1. Object detection in natural images With the development of deep learning, significant advances in the study of object detection in natural images have been obtained. Region-based convolutional neural networks (RCNNs) were proposed in [19]. This method devised the first type of CNN to successfully detect objects. This method employs a fine-tuned model to extract a region’s features and determine whether or not the proposed region is an object. RCNNs represented a huge leap in progress in the field of object detection research. However, because the region proposal extraction is based on selective search, RCNN has serious speed bottlenecks. To solve this problem, a spatial pyramid pooling (SPPNet) layer [20] can be used to extract the features of the original image only once, and then multiscale pooling can be used to obtain a uniform output. Based on this approach, Fast RCNN [21] employs a special layer called the region-of-interest (ROI) pooling layer, which can be regarded as a single-layer SPPNet to map different regional inputs to a uniform output. SPPNet and Fast RCNN dramatically improved RNNs’ speed. However, in these methods, selective search is still slow because each proposal must be processed through a CNN. In Faster RCNN [22], a layer called a region proposal network (RPN) is trained to obtain the region proposals. In RPN, different anchors are constructed to obtain much better regression models for the bounding boxes of multi-scale objects. A method to fuse the multi-level features to detect multi-scale objects was proposed in [23], the fusion method is to adopt a 1 × 1 CNN followed by element-wise addition. These methods consist of two stages. In contrast, RetinaNet [24] was very recently proposed. It includes a focal loss to construct a one-stage end-to-end model and obtained results similar to those of the two-stage methods mentioned above. In addition to object detection in natural images, RPN, focal loss, and other ideas in [19–22,24] inspired the method for lung nodule detection proposed in this study. An RPN is used to generate the lung nodule candidates, and focal loss and hard negative mining are used to balance the negatives and positives during training. 2.2. Dataset preparation from clinical data Dataset preparation is the first step in the construction of a lung nodule detection system. Dataset annotation is based on a radiologist’s knowledge and experience and requires a large amount of time and effort. Good labeling methods should guarantee both effectiveness and accuracy. A three-round annotation process in [2,3]. A scan was read by a reader for the first round. Then, the scan was given to the second reader who was trained for 3 weeks and unaware of the conclusion of the first reading. Then, a third reader made

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

the final decision. Other three-round annotation methods were used in [4,5]. Each radiologist first performed a blinded review and identified the location of all abnormalities in each scan. After the blinded review, 23 scans separately annotated by two radiologists were reviewed by the same 2 radiologists. Finally, the 23 scans were also reviewed by 2 experienced radiologists. Two other studies [6,7] employed a two-round annotation process. All CT screening images were first read by a workstation with integrated automatic nodule detection CAD tools. Two medical students, trained by a radiology research department to detect pulmonary nodules, either accepted or rejected the CAD labels. Recently, a large dataset called LIDC [8] was released. Each scan in this dataset was annotated by four radiologists. The matching annotations between all four radiologists were selected for the LUNA2016 challenge. In the present study, a new labeling method is proposed. It is a three-level annotation process whose aim is to make the annotation as accurate as possible. To help radiologists improve the efficiency of annotation, a semi-automatic annotation system was constructed to collect the first annotations of the data. Currently available datasets contain CT screening images with slice thickness range from 1.25 to 3 mm [4]. This is a range of 1.75 mm, which is relatively small, and different sections of CT screening images cannot be analyzed at some thicknesses. In contrast, the dataset we collected contains 1 mm and 5 mm thick CT images. This range is larger than those of previous datasets. 2.3. Lung nodule detection Lung nodule detection consists of two steps: (i) detecting candidates in a way that guarantees the sensitivity of detection and (ii) FP reduction. According to whether a neural network is used, lung nodule detection schemes can be divided into three periods based on when they were proposed. In schemes designed during the first period, the detection and FP reduction steps do not use neural networks. Different tissues exposed to X-rays have different HU values. Lung nodules can be regarded as abnormal tissue with a specific HU range. Thus, earlier methods are threshold-based. Lung nodules are not always isolated because they can establish connections with vessels or pleuras, and a single threshold cannot segment the nodules well. To tackle this problem, a multi-threshold method was proposed in [4]. Different thresholds that varied from minimum value to maximum values over a wide range were adopted, and the corresponding VOIs were selected. Analysis of the evolution of a VOI matched to neighboring thresholds can create a treelike structure. Another early method [2] was shown to detect solid nodules well. In that method, the shape index and curvature of each lung nodule are computed and a threshold is applied to the two results to define seed points. An automatic segmentation method is executed at the seed points to obtain clusters of interest, or VOIs. For VOI selection of subsolid nodules, a method was proposed in [10] in which a double-threshold density mask (−750, −350 (HU)) is first employed to obtain a mask of the VOIs. Morphological opening is applied to remove connected clusters, followed by three-dimensional connected component analysis. Clusters for which the centers of mass are within 5 mm are merged. These methods [2,4,10,25] extract features manually, and hand-designed features are used as input for linear or non-linear classifiers to determine whether the VOIs were nodules. However, these hand-designed methods require hundreds of features to be extracted. Sometimes many of them are not useful for judging whether or not a given region is a nodule, and some key features are lost. These features are both tedious and less effective to create.

3

Inspired by the power of CNNs, some works have obtained promised success in CADs [26–28]. Of course, some authors [14– 17] employed them to extract the features of lung nodules inside the lungs to reduce false positives in the second period. Deep CNNs (DCNNs) were first used to extract lung nodule features in [14]. In that work, a trained AlexNet [29] was employed to automatically extract each feature from three orthogonal planes. Then, in [15], a multi-view CNN was proposed to extract the candidate features of eight different planes. The two methods consider that the features of lung nodules vary from different points of view. However, the backbone of the models was based on 2D CNNs, which ignored the 3D inner features of the lung [30]. In [16], a 3D DCNN was constructed to reduce FPs. Around the same time, [17] analyzed the influence of a classifier model’s receptive field size, and a multilevel contextual 3D CNN was proposed to reduce FPs. Most recently, [9,16,18] focused on how to detect lung nodules using neural network-based methods. Methods proposed in this last period are regarded as third period approaches. In [16], an improved Faster RCNN was employed to detect lung nodule candidates. In that study, each axial slice in the CT image was concatenated with its two neighboring slices and used as the input of the Faster RCNN model. This method combined RPN and ROI pooling in 2D to select the ROIs in each slice. In [18], a 3D FCN was constructed with online sample filtering for candidate screening. This method can not only leverage rich volumetric spatial information to extract high-level features for accurate candidate retrieval, but also rapidly produce predict probabilities in a volume-to-volume manner. In [9,31,32], a U-NET like 3D DCNN combined with an RPN was constructed to select lung nodule candidates. These works’ backbones are based upon an 18-layer residual network [9,31]. The input size employed 128 × 128 × 128, and 96 × 96 × 96 was used in [32]. The studies mentioned above have pushed research on lung nodule detection forward, but they did not analyze the influence of CT image resolution on results. In this work, we constructed our methods using the proposed dataset to create a model that better adapts to the multi-resolution CT screening images found in clinical practice. 3. Dataset The data in the DeepLNDataset were provided by the West China Hospital, Sichuan University, China. All of the CT screening images were collected from patients when admitted to the hospital or at follow-up. To guarantee the accuracy of the dataset, not all available CT screening images were included in the dataset . The goal of this study is to assist radiologists with detection of lung nodules before treatment. Hence, postoperative cases were removed because operations can cause changes to lung structures. Some patients have more than one available of CT screening images because of periodic examinations or follow-ups. Each patient’s latest screening before operation is included in this study. Sensitive information was removed from the CT screening images before annotation to protect patient privacy. The nodules in this dataset were annotated by professional physicians from West China Hospital. 3.1. Inclusion criteria for annotations A high-quality dataset is vital for training a model. To guarantee the accuracy and objectivity of the annotation, a three-level annotation system is proposed in this paper. To increase the uniformity of the radiologists’ annotations, the radiologists were referred to the following rules when labeling. 1. All observed nodules are labeled regardless of size.

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

4

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx Table 1 The annotations obtained from three-level and semi-automatic annotation. Multi-round

–

+

Total

First-level Second-level Third-level Semi_automatic

0 56 61 275

798 140 0 471

798 880 819 1015

Table 2 Descriptive statistics of the current dataset.

Fig. 1. The process of lung nodules labeling.

2. Pleura nodules are not labeled. 3. Inflammatory nodules are annotated, whereas inflammatory plaque and streak shadows are not annotated. 4. Cases with too many nodules that are difficult to annotate completely are removed from the dataset, whereas cases with no nodules are retained as negative samples. Because this study focuses on constructing an automatic lung nodule detection system, only lung nodules are included in the labeling. Pleural nodules are defined strictly in clinical practice, not all the nodules that are attached to the pleura can be regarded as pleural nodules. If the location of a nodule’s maximum diameter is inside the lungs, this nodule is regarded as a lung nodule. If its maximum diameter is on the pleura, it is a pleural nodule. The pleura nodules are included in previous datasets [4,8], but pleura nodules are in fact not regarded as lung nodules, so they are excluded in this dataset and study. 3.2. Annotation methods This study’s dataset was labeled by doctors from the pneumology and radiology departments using our constructed annotation system [33]. Moreover, they are thoracic surgery clinical physicians with years of experience. The entire annotation process is illustrated in Fig. 1. The left of this figure shows the three-level annotation process conducted by doctors. In the first round of labeling, each scan was assigned to one resident physician for initial annotation. Then, in the second round, the annotations obtained from the first round were sent to attending physicians, who have many years of clinical experience. In the second round, the attending physicians evaluated the annotations as correct or incorrect and labeled any nodules missed in the first round. Finally, the head physician performed a second review and made the final decision. This process took a long time to complete because there are hundreds of slices in one scan. To improve the accuracy and efficiency of labeling, we propose a semi-automatic process. First, a portion of the dataset was separated for manual labeling. Then, an initial detector was trained using these annotations. The architecture of the detector is the same as that of the proposed model, and it is trained using both multi-resolution datasets. The detector obtained a sensitivity of 0.85 and an FROC score of 0.59 on the test data, because of the inconsistent labeling standards. This basic detector provides the initial locations of the lung nodule. After this, the other rest of the annotation steps are the same as those of the three-level annotation. This process forms a closed loop: the annotations are

Resolution

Scan

Patients

Nodules

Thin-Section ThickSection Total

367 212 579

202 104 260

1088 502 1590

continuously provided, and the performance of lung nodule annotation continuously improves. The detector provides the initial annotations and then the annotations are evaluated to train and improve the model’s performance. The numbers of labeled nodules obtained in each round are listed in Table 1. This part of the dataset contains 410 scans from 208 patients consisting of two-section CT images. In Table 1, ‘‘+’’ and ‘‘−’’ denote the newly-added and removed lung nodules in each round, respectively. For first-level annotations, 798 lung nodules were added. In the second-level round, 140 lung nodules were added and 56 lung nodules were removed, resulting in a total of 880 nodules. In the third-level round, only 61 lung nodules were removed, yielding a total of 819 lung nodules. After the process of semi-automatic annotation was complete, 275 had been lung nodules and 471 new lung nodules had been added. Table 1 shows that the numbers of newly added and removed lung nodules after semiautomatic annotation are much larger than those obtained by all three steps of the three-level annotation. On the one hand, semi-automatic annotation requires the three-level annotation to be performed again. On the other hand, the detector can reduce an annotator’s workload. Some removed and newly added lung nodules are shown in Figs. 2 and 3. Each cropped patch in this work is 45 × 45 mm2 . Because there were significant changes after the semi-automatic annotation, samples from this process were selected. Fig. 2(a) and (b) illustrate the lung nodules too shallow to comprise a distinct entity, which cannot be defined as lung nodules. Fig. 2(c) and (d) show pleural nodules whose maximum diameter is located on the pleura. The last two subfigures show some patchy shadows that were mistaken as lung nodules. Fig. 3(a), (b), and (c) show that missed lung nodules usually occurred on the edge of the lungs. Fig. 3(d) and (e) show some lung nodules that are small but have a distinct spherical shape. Fig. 3(f) shows a lung nodule that is attached to a blood vessel. Based on this part of the dataset, the other part was labeled by the semi-automatic annotation. Details of that stage’s final labels in the DeepLNDataset are listed in Table 2. The results show that the DeepLNDataset contains 579 cases of CT images from 260 patients, which contain a total of 1590 lung nodules. Most of the included patients were male. Most of their ages were between 40 and 80 years, as shown in Fig. 5. The DeepLNDataset is divided into two subsets: one subset containing 5 mm CT scans (called the ThickSet in this study) and the other one containing 1 mm CT scans (called the ThinSet). In the final dataset, the ThickSet contains 212 thick-section CT scans and the ThinSet contains 367 thin-section CT scans. All the CT images used in this study can be available for research community.1 The distribution of lung 1 https://github.com/xxy19404/DeepLNDataset.

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

5

Fig. 2. The removed lung nodules after semi-automatic annotation.

Fig. 3. The newly-added lung nodules after semi-automatic annotations.

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

6

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

proposal layer. Details of the whole architecture are shown in Table 3. The feature maps are doubled in size before combination, while the number of feature maps is not changed by elementwise addition. To address the multi-resolution problem, a novel ensemble method is proposed to implement bagging and boosting strategies using non-maximum suppression(NMS) [36] and weighted means(WM) as shown at the bottom of Fig. 6. 4.1. Lung segmentation and data preprocessing

Fig. 4. The labeled lung nodules’ size distribution.

nodule sizes was analyzed, and the results are shown in Fig. 4. Most of the labeled lung nodules are smaller than 10 mm in diameter. The following experiments are all based on this dataset. 4. DeepLN Automatic lung nodule detection can be regarded as an object detection task in which the input is CT screening images I and the output the lung nodule locations, consisting of four numbers [x, y, z , d]. Here, x, y, and z indicate the coordinates of the bounding box center in the 3D cubes and d indicates the diameter of the lung nodule. In this work, we aimed to construct a mapping F from I to [x, y, z , d]. To reach this goal, a lung nodule detector called DeepLN is proposed, as shown in Fig. 6. A single lung nodule detector is shown at the top of Fig. 6 the dashed line. The aim of this step is to detect lung nodules using an end-to-end model. Lung nodules occur only on the inside of the lungs. Hence, lung segmentation (the first step) is necessary to remove unrelated regions. After extracting the regions encompassing the whole lung, a detection network is constructed based on the residual networks and RPNs in [9,31]. These models won [9] the DSB2017 contest [34] Kaggle and obtained the best result [31] in the LUNA2016 [35] challenge, respectively. The proposed network backbone has two main pathways: a top-down pathway to extract stronger semantic features and a bottom-up pathway used to retain positional features. Further, feature maps of the same spatial size from both pathways are combined to detect lung nodules. The first pathway contains a usual block and four residual blocks. The first usual block contains two convolutional layers with 3 × 3 × 3 kernel size. Each residual block contains two convolutional layers followed by a batch-normalization layer and a ReLU activation layer. The convolutional features in each residual block are enhanced by identity short-cut connections. This can increase the effectiveness and depth of network training. A max pooling operation is adopted to halve the feature maps’ spatial size at the end of each residual block of the bottom-up pathway. The second pathway mainly consists of two residual blocks and an RPN. Each RPN outputs a 5-dimensional vector [pi , xi , yi , zi , di ], each value of which in this vector denotes the probability of a candidate and its center coordinates and diameter. In this pathway, after each residual block, a deconvolution operation is used to double the spatial size of the feature maps. After the final combination of the two pathways’ feature maps, an RPN is constructed to detect the lung nodules, in which, three anchors were employed to detect multi-size lung nodules. The end-to-end detector comprising 22 layers containing convolutional layers, pooling layers and a region

Lung segmentation is the first step in lung nodule detections, and it can remove many unrelated lesions in CT screening images. In general, a lung region segmentation method contains the following main steps: (a) thresholding-based binarization, (b) border cleaning, (c) labeling, and (d) erosion and dilation [9]. Images obtained by these steps are shown in Fig. 7. As mentioned above, the HU values of different lesions in CT screening images vary. A threshold can be used to distinguish lung regions from non-lung regions. The threshold used to segment lung regions in this work was −600 HU. We then used erosion and dilation to fill in the incursions into the lung region that were represented by radio-opaque tissue, followed by a region selection based on bounding box areas. Squares of resolution 10 × 10 4 × 4 are used as the structuring elements of dilation and erosion in this work [34]. Finally, the proportion of lung region area to the whole image are was calculated to label the non-lung region. The segmented lung regions were used as the input of the detection model. During training, if the entire lung regions were input into the model, a high computational capacity and CUDA memory would be necessary. To meet the hardware requirements, the whole lung region inputs should be divided into smaller cubes. Work [17] emphasized the importance of receptive field size when training a DCNN as a classifier to reduce the number of FPs. It argued that if the receptive field is small, the amount of surrounding contextual information is too limited to train a model well. In contrast, if the receptive field is too large, additional redundant information and even noise may be added to the training. We believe that this will also be the case for the lung nodule detection task, and each point along with its surrounding pixels can be regarded as a candidate sample for a detector. The size of the receptive fields and the batch size determines the number of candidates in a batch. A training batch with a proper batch size can guarantee that enough hard negatives are included and avoid redundant information and noise. Thus, the input size and batch size were analyzed with respect to the corresponding results. 4.2. Hard sample mining methods The numbers of samples depicting normal lung tissues and lung nodules will be imbalanced in a series of CT screening images. Methods to address the imbalance problem can be divided into two categories: data-level methods and classifier-level methods [37,38]. Data-level methods consist of minority-sample oversampling and majority-sample undersampling, which solves the imbalance problem by changing the number of each category’s samples. However, in the lung nodule detection task, the imbalance between positive and negative instances occurs in only one sample. Simple oversampling and undersampling methods are not effective because there are many normal tissue samples from among which the typical samples are hard to select. Some methods applied to object detection tasks focus the training on samples that are difficult to distinguish; this approach is called hard sample mining [39]. When training a neural network-based detector, the detector provides a confidence probability for each lung tissue region. This probability represents how likely it is that

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

7

Table 3 The details of this model’s architecture. Operation

Kernel size/stride(padding)

Ouput

Conv3D

3 × 3 × 3/1(1)

1 × 24 × 128 × 128 × 128

Conv3D

3 × 3 × 3/1(1)

1 × 24 × 128 × 128 × 128

MaxPool3D

2 × 2 × 2/2(0)

1 × 24 × 64 × 64 × 64

Residual block

3 × 3 × 3/1(1) 1 × 1 × 1/1(0) 3 × 3 × 3/1(1)

1 × 32 × 64 × 64 × 64 1 × 32 × 64 × 64 × 64

MaxPool3D

2 × 2 × 2/2(0)

1 × 32 × 32 × 32 × 32

Residual block

3 × 3 × 3/1(1) 1 × 1 × 1/1(0) 3 × 3 × 3/1(1)

1 × 64 × 32 × 32 × 32 1 × 64 × 32 × 32 × 32

MaxPool3D

2 × 2 × 2/2(0)

1 × 64 × 16 × 16 × 16

Residual block

3 × 3 × 3/1(1) 1 × 1 × 1/1(0) 3 × 3 × 3/1(1)

1 × 64 × 16 × 16 × 16 1 × 64 × 16 × 16 × 16

MaxPool3D

2 × 2 × 2/2(0)

1 × 64 × 16 × 16 × 16

Residual block

3 × 3 × 3/1(1) 1 × 1 × 1/1(0) 3 × 3 × 3/1(1)

1 × 64 × 8 × 8 × 8 1 × 64 × 8 × 8 × 8

ConvTranspose3D

2 × 2 × 2/2(0)

1 × 64 × 16 × 16 × 16

Residual block

3 × 3 × 3/1(1) 1 × 1 × 1/1(0) 3 × 3 × 3/1(1)

1 × 64 × 16 × 16 × 16 1 × 64 × 16 × 16 × 16

ConvTranspose3D

2 × 2 × 2/2(0)

1 × 128 × 32 × 32 × 32

Residual block

3 × 3 × 3/1(1) 1 × 1 × 1/1(0) 3 × 3 × 3/1(1)

1 × 64 × 32 × 32 × 32 1 × 64 × 32 × 32 × 32

Region proposal network

1 × 1 × 1/1(0) 1 × 1 × 1/1(0)

1 × 64 × 32 × 32 × 32 1 × 15 × 32 × 32 × 32

Multi-level features combination

1 × 128(64) ×16 × 16 × 16

Multi-level features combination

1 × 128(64) ×32 × 32 × 32

Fig. 5. Patients’ age and sex distribution.

this region belongs to a lung nodule. The regions for which the confidence probability values differ greatly from their labels are selected to train the model and are called hard samples. In the lung nodule detection task, there are substantially fewer positive samples than negative ones. We hence select the hard samples only from the negative samples. The selected samples are called hard negatives [9]. Hard negative mining can be regarded as a data-level downsampling method. Using this method may cause negatives to be sampled insufficiently. It would be better to find a method that can both guarantee sufficient sampling and focus the training on hard samples. Focal loss [24] has been proposed to obtain this goal. This method belongs to the category of classifier-level methods and focuses training on hard data by minimizing the following equation: FL(pt ) = −α (1 − pt )γ log(pt ),

(1)

where pt is defined as follows:

{ pt =

p if y = 1 1 − p otherwise.

(2)

As shown in the above equations, p denotes the probability of a candidate. When a sample is difficult to distinguish, the term −log(pt ) will be large. In contrast, this term decreases when the sample is easily distinguishable. When pt is small, 1 − pt is approximately 1 and loss is unaffected. This equation focuses the whole training process on the hard samples and neglects the simple samples to some extent. The focusing parameter γ smoothly adjusts the rate at which easy examples are down-weighted [24]. Weight α is added to different terms in this equation to control the ratios between different categories of errors. When 1 − pt is smaller, the impact of α is less. This can solve the balancing problem between hard and easy examples. In this study, the three-level annotation criterion guarantees that the labeled lung tissues are lung nodules, but there are still some missed lung nodules that are of no clinical significance. The aim of focal loss is to focus the training on hard samples, so that the missed lung nodules are regarded as hard negatives by the detector. To address this problem, a small modification is employed to improve focal loss, which we call the false negative focal loss

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

8

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 6. The process for lung nodule detection.

Fig. 7. The eight steps for lung segmentation.

(FNFL). FNFL(pt ) =

In this work, the hard negatives may be the missed lung nodules,

∑

whereas the quantity of negatives is much larger than that of the

[−y(1 − p)γ log(p) − β ∗ pγ (1 − y)log(1 − p)].

positives and not all lung nodules can be annotated. We set β ∈

y∈0,1

(3)

(0, 1) to not only focus training less on false negatives, which are

In Eq. (1), α is employed to zoom in and out of the errors produced by only hard examples no matter whether they are negative or positive samples. In contrast, β in Eq. (3) is used to magnify or shrink the errors only produced by negative samples.

regarded as hard negatives during training, but also the reduce influence of the imbalance between negatives and positives. In this work, β was set to 0.5.

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

4.3. Methods for training detectors on multi-resolution CT screening images In this study, the DeepLNDataset is divided into the ThinSet and ThickSet datasets. When training a general 3D DCNN-based detector, the CT screening images are resized to 1 × 1 × 1 mm spatial resolution. There is no change along the direction of the cross section in the ThinSet, but a large amount of noise caused by the resizing algorithms occurred in the ThickSet. Even though the same method was used to resample data from both subsets, the essential features of the resampled data are very different. If both subsets are used to train a detector model together, the updates for the weights come from different-resolution CT screening images and influence each other. This can degrade the performance of the detectors on the DeepLNDataset. Training the models on the two subsets separately can eliminate the influence of this problem on the ThinSet. Because noise only occurs in the ThickSet, the subset used to train the detector only contains one feature distribution. This reduces training conflicts, which improves the performance on the ThickSet. Furthermore, an ensemble strategy is used to enable the two models trained on the two subsets to cooperate and improve detection performance. Ensemble learning is a powerful rulebased method that enables a group of models to make a collective decision. The aim is to obtain a better and more robust performance [40–42]. Ensemble methods can be categorized into two classes: boosting and bagging [40]. The maximum number of votes for a class is considered to decide the final predictions in bagging, while weights are assigned to each prediction and the weighted predictions are integrated to vote for a class in boosting. Most previous methods [43] were proposed to improve the performance of models used in classification tasks. A few studies have improved detection performance using ensemble methods. In this work, we aggregate the results of different detector models to make final predictions using the NMS method [36]. NMS can be regarded as a kind of bagging that retains the most probable bounding boxes and removes others. The different models each output bounding boxes for the candidates. These candidate bounding boxes are combined. To reduce overlapping, the final concatenated bounding boxes are filtered using NMS. Then the filtered bounding boxes are used as the final lung nodule predictions. The method can be expressed using the following equation. det = NMS(det0 , . . . , deti , . . . , detn ).

(4)

Here, det denotes the final detection result, which is voted on by a list of detectors det0 · · · detn . deti is a 5-dimensional vector [pi , xi , yi , zi , di ], which is the output of an RPN. As shown in the above equation, all detectors equally contribute to the final predictions. However, different models trained on different subsets will focus on different features. These different models should not contribute equally to the final decision for a specific set of CT screening images. Hence, another ensemble method was used in this study in which different models’ predictions are assigned weights to balance the degree of importance of the models trained on different resolution CT screening images. The method uses the following equation. det = NMS(w0 × det0 , . . . , wi × deti , . . . , wn × detn ).

(5)

As shown in the equation above, the probability pi of each candidate given by each detector deti is assigned a weight wi . Then, the final result is voted for by all the detectors from among the candidates based on NMS. Eq. (4) is the particular case of Eq. (5) when all the weights wi = 1. In this work, we trained two models on the two different levels of resolution that we used.

9

5. Experiments In this section, we present the results of experiments conducted on the DeepLNDataset and evaluate the effectiveness of the proposed method. First, several methods were compared and their performances were analyzed. Different input sizes and different combinations of multi-level features were employed to train each detector separately. The corresponding results were analyzed. The proposed ensemble method is also evaluated. Some detection results are presented as figures, and qualitative analytical results are also presented. 5.1. Model training and testing In this study, we trained our models for 100 epochs and optimized them using stochastic gradient descent(SGD) with an initial learning rate of 0.01. Momentum was set at 0.9. The learning rate was halved at every 25 epochs after 50 epochs. To address the imbalance problem, hard negative mining was employed, in which the top-four hard negatives chosen from each batch were used to train the detector. FNFL was also employed, with β set to 0.5 to reduce the influence of missed lung nodules. The performance of detectors on different sizes of lung nodules varies. Hence, different oversampling rates were used for lung nodules of different sizes. The lung nodules smaller than 5 mm in diameter were sampled twice, those between 5 and 8 mm were sampled four times, and the other lung nodules were sampled twice each. The models were trained on an Ubuntu server with four TESLA K40 graphic processing units(GPUs). Each GPU had 12 GB of graphic memory. The models’ parameters were saved at every epoch, each intermediate result for every epoch was evaluated and the best results were recorded. The whole dataset was split into two parts: 80% for training and 20% for testing. Furthermore, scans from a single patient only appeared in the training set or the testing set. After the training was terminated at the 100th epoch, the last saved check pointer was used to evaluate the model’s performance. The saved check pointers were used for evaluation on both the training set and the test set. When the training loss value is much smaller than that of the test set, the model is overfitting. The model contained a total of 538,9199 parameters in this model, the whole training process need 78 hours using our devices. 5.2. Metrics To measure the models’ performance in the medical fields, we employed sensitivity (TPR) to measure the performance of true positive (TP) detection and it is defined as TPR =

TP TP + FN

.

(6)

If the center of a candidate was located in a nodule annotation, then the candidate was regarded as a TP. Otherwise, Otherwise, it was counted as an false positive (FP). Sensitivity and specificity are important indices for evaluating a model’s performance. The aim of this study is not only to guarantee the model’s sensitivity but also to reduce the number of FPs as much as possible. Hence, another index called the free receiver operating characteristic (FROC ) was employed to evaluate this aspect of performance. For this index, seven sensitivities are calculated in seven different situations, where the values of FP /scan are 0.125, 0.25, 0.5, 1, 2, 4, and 8. The FROC is the average value of the seven sensitivities. If the FROC value increases, then the FP rate decreases sharply at a specific boundary. This is a performance indicator for FP identification. Analysis of the FROC focuses more on measuring the model’s ability to distinguish FPs.

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

10

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx Table 4 The results obtained by different methods upon the same backbone. Different aspects

Methods

Sensitivity

FROC

Different size input

96 × 96 × 96 96 × 96 × 96(38) 128 × 128 × 128 144 × 144 × 144

0.9609 0.9500 0.9652 0.9522

0.6424 0.6932 0.7132 0.6928

Different combination

Element-wise addition Conv1 × 1 × 1 element-wise addition Concatenation Conv1 × 1 × 1 and concatenation

0.9174 0.9652 0.9652 0.9609

0.6852 0.6932 0.7132 0.6844

5.3. Methods for detecting lung nodules The most effective lung nodule detector is a neural networkbased model. The different connections in a neural network and multi-level feature map fusion are vital for this model. In this subsection, using the neural-network backbone described in [9] as the detector, different methods are analyzed experimentally. First, to evaluate the influence of receptive field and data sampling, the input size, and batch size were varied individually. We also conducted experiments to compare the results of different combinations. The experiments in this section were only conducted on the ThinSet. The results are shown in Table 4. To obtain the experimental results shown at the top of Table 4, direct concatenation was used to combine multi-level feature maps. The preprocessed CT images are cropped into smaller cubes. Three sizes of cropped cubes were employed: 96 × 96 × 96, 128 × 128 × 128 and 144 × 144 × 144 pixels. In the table, 96 × 96 × 96(38) denotes that the batch-size is 38 (the batch-size in all other experiments was 16). The results for 96 × 96 × 96 are not as good as those for 128 × 128 × 128 for both indexes. The input size is small, and hence there is not enough information and features for the negatives to train the model well. The results for 128 × 128 × 128 are better than those of 144 × 144 × 144. The input sizes were bigger, there are more additional redundant information and even noise added during the training, making it harder to train the model. Furthermore, the negative samples were selected from the regions of each batch, and the domain of selection significantly influenced the model’s performance. The input size increases with the range of hard negatives selected, and the input size comprises the cropped cube’s size and the batch size. To evaluate the influence of batch size, a 96 × 96 × 96 model with batch size 38 was trained and evaluated on the same subset. The results at the top of Table 4, show the FROC results, which indicate that the detector’s discrimination performance was improved. Furthermore, the results were lower than those of a 128 × 128 × 128 model trained with batch size of 16, 128 × 128 × 128 × 16 ≈ 96 × 96 × 96 × 38 for both indexes. The size of the cropped cube might affect performance more substantially than the training batch size does. Moreover, batch size increases the hardware usage cost, so the 144 × 144 × 144 with model with batch size 16 is the upper limit using our hardware. At the bottom of Table 4, different combinations of multilevel feature maps are analyzed. The input size of the detectors in these experiments was 128 × 128 × 128 and the batch size was 16. In total, the variation in FROC results is not significant for various combinations. In these experiments, element-wise addition was employed at first, and the initial sensitiv ity results for these experiments are much lower than those of the other experiments. The features in a feature map should retain the corresponding spatial features. However, in our scheme, the feature points’ representations in different levels of feature maps did not correspond. The direct element-wise addition of different levels of feature maps can reduce the efficiency of training. Of course, we can apply a transformation to make different level feature

Table 5 The results of ensemble methods. Methods

Resolution

Sensitivity

FROC

The method in [9]

ThinSet ThickSet

0.9505 0.8627

0.6908 0.6676

DeepLN(Res)

ThinSet ThickSet

0.9652 0.8824

0.7132 0.6589

DeepLN(Ens)

ThinSet ThickSet

0.9695 0.9117

0.7203 0.6762

maps correspond. A 1 × 1 × 1 convolution was used before element-wise addition to concatenate different levels of feature maps. The results of this experiment are better than the results of the experiment using direct element-wise addition, especially for sensitivity. When the direct concatenation of different level feature maps was employed, the trained model obtained the best results. This is because this method eliminated the conflict caused by different level feature maps in a simple way. Further, when a 1 × 1 × 1 convolution layer was used before concatenation, the model’s performance decreased (especially in term of the FROC value). The newly added convolution layer only modified the feature information of the low-level feature maps. The concatenation caused the number of corresponding layers in the feature maps to double. The newly added convolutional layer and concatenation required additional parameters, which increased the difficulty of training. 5.4. Detection methods in multi-resolution CT images In this subsection, we focus on the different resolutions of CT screening images. Given the results and analysis of the previous subsection, the input size was set to 128 × 128 × 128 and direct concatenation was employed as the combination method. We analyzed our methods on the DeepLNDataset and compared them with other state-of-the-art methods. The results are shown in Table 5. The method used in [9] recently won the Data Science Bowl 2017 organized by Kaggle. In [9], CT screening images with different resolutions were zoomed to cubes of 1 × 1 × 1 mm when employed to train the detection model. This method does not consider the influence caused by multi-resolution CT screening images. Its results on multi-resolution CT scans are listed at the top of Table 5. First, different resolution CT screening images include different features. Second, using the resampling method to preprocess CT screening image data at different resolutions can cause the noise level to vary. Using two resolutions of CT screening images to train a detector leads to training conflicts. The results for the proposed method, which uses two models on images with different resolutions, are given in the middle of Table 5 and labeled ‘‘DeepLN (Res)’’. Both performance indices for the results obtained by the proposed method on the thinsection CT screening images are better than those of the base method. The sensitivity for the thick-section CT screening images is improved, however, the FROC slightly decreased. Table 2 shows

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx Table 6 The results for ensemble methods. (wlr , whr ) denotes the values of ensemble weights. Weights(wlr , whr )

ThinSet

ThickSet

(0.0,1.0) (0.1,0.9) (0.2,0.8) (0.3,0.7) (0.4,0.6) (0.5,0.5) (0.6,0.4) (0.7,0.3) (0.8,0.2) (0.9,0.1) (1.0,0.0)

0.7132 0.7182 0.7203 0.7132 0.7131 0.7193 0.7131 0.7192 0.6927 0.7191 0.3176

0.4709 0.4719 0.6182 0.5099 0.5986 0.6659 0.6721 0.6731 0.6762 0.6748 0.6589

Table 7 The method’s 5-fold validation results on DeepLNDataset. FROC /sensitivity are shown in this table. Two subsets

5-fold validation

The method in [9]

DeepLN(Res)

DeepLN(Ens)

ThcikSet

Valid-1 Valid-2 Valid-3 Valid-4 Valid-5

0.6281/0.7288 0.6304/0.8706 0.6169/0.9189 0.6611/0.8382 0.6676/0.8627

0.6162/0.7456 0.6203/0.8812 0.6356/0.9223 0.6655/0.8378 0.6589/0.8824

0.6312/0.7655 0.6426/0.8923 0.6401/0.9339 0.6812/0.8689 0.6762/0.9117

0.6408/0.8436

0.6393/0.8538

0.6543/0.8745

0.7030/0.9324 0.6106/0.9406 0.6411/0.8899 0.6214/0.8899 0.6908/0.9505

0.7134/0.9807 0.6556/0.9772 0.6609/0.9174 0.6364/0.8916 0.7132/0.9652

0.7214/0.9816 0.6625/0.9862 0.6815/0.9357 0.6407/0.9259 0.7203/0.9695

0.6534/0.9207

0.6756/0.9464

0.6852/0.9598

Mean

ThinSet

Mean

Valid-1 Valid-2 Valid-3 Valid-4 Valid-5

that the thick-section data comprised one-third of all the data points. In contrast to the base method, the detector trained on thin-section images used two thirds of the dataset and achieves better results. Moreover, the detector trained on thick-section images uses a third of all data and obtains similar results. Hence, we conclude that the proposed method can reduce the influence of multi-resolution images on detection results. The proposed ensemble method was also evaluated. The results are given at the bottom of Table 5 and labeled as ‘‘DeepLN (Ens)’’. The proposed ensemble method obtains the best results for all indices on both high- and low-resolution CT screening images. The results of these experiments show that the resolution of CT screening images can affect the training of lung nodule detectors. The proposed methods were shown to perform better than the base method. 5.5. Comparison of different ensemble strategy weights To analyze the influence of multi-resolution CT images, we adopted different ensemble weights for two models to change the ratios of multiple models’ influence. In this work, two models were employed to obtain the ensemble results. The final results can be calculated as the following equation. det = NMS(wlr × detlr , whr × dethr ).

(7)

In this equation, detlr denotes the results given by the model trained on ThickSet, while dethr denotes the results on ThinSet. wlr and whr denote the weights assigned to the two models. The results obtained by the proposed method using different weights are shown in Table 6 and Fig. 8. In Table 6 and Fig. 8, the FROC metric is used to evaluate model performance. From Table 6 and Fig. 8 show that the best results on the two subsets were all obtained by ensemble models. When (wlr , whr ) were set (0.2, 0.8), the best results on ThinSet is obtained, (wlr , whr ) were set to

11

(0.8, 0.2), the best results on ThickSet were obtained. When we used the model trained on ThickSet, the results on ThinSet were the worst. Correspondingly, when we used the model trained on ThinSet, the results on ThickSet were the worst. The table demonstrates that there is a significant performance gap between the multi-resolution CT screening image datasets. 5.6. Analysis and discussion 5.6.1. Detected and undetected nodules To demonstrate our models’ results intuitively, some detected nodules are labeled in the original images in Fig. 9. The results are cropped from the original CT images around the center of the detected lung nodules. The crop size is 64 × 64 pixels about 45 × 45 mm, and four consecutive slices were chosen for the plot. The results show that nodules with different morphological features can be detected from multi-resolution CT scans. Fig. 9(a)–(e) shows the results detected from thin-section CT screening images. These detected nodules tend to have different morphological features and small diameters. The nodules in Fig. 9(c) and (e) show ground-glass opacity nodules, which were omitted because of their fuzziness and small scale. The nodules in Fig. 9(a) and (d) are solid, with high density and brightness in the CT screening images. Fig. 10(a)–(e) shows examples detected from thick-section CT screening images. Fig. 10(a) shows a very tiny lung nodule in thick-section CT screening images. This lesion is only observed in the second slice. Fig. 10(c) shows that lung nodules with cavities can be detected. The lung nodule shown in Fig. 10(e) was attached to the pleura, because its maximum diameter is located in the lung and there is an obvious boundary between it and the pleura. This kind of nodule forms inside the lungs and grows close to the pleura. Some false negative examples are plotted in Figs. 9(f) and 10(f). These lesions are nodules that were undetected by all of the ensemble models. The two undetected nodules are not only extremely small and fuzzy, but also close to the pleura, and they could be easily regarded as parts of the vessels. These nodules are not likely to be detected by radiologists either. Because of their tiny size, nodules whose diameters are smaller than 5 mm have less clinical significance and do not need the follow-up required for subsolid lung nodules [44,45]. 5.6.2. Quantitative analysis of DeepLN performance In this work, we analyze the impact of multi-resolution CT images. From the results in Tables 5 and 6, we conclude that multi-resolution CT screening images have a significant influence on the training of a lung nodule detector. In addition, a 5-fold validation experiment was organized, and the results are shown in Table 7. In these experiments, the optimal weights wlr and whr were (0.8, 0.2) on the ThickSet and (0.2, 0.8) on the ThinSet. The results in this table confirm our conclusion that there are differences between thin-section and thick-section CT screening images when training a lung nodule detector. First, using our proposed method can improve the detector’s performance significantly, especially on ThinSet. Second, when we train the two detectors using the two respective subsets, the sensitivity on ThickSet was significantly greater. This table shows that although the results on different validation sets varied, the performance of the proposed method is better than that of not using this proposed method. On ThinSet, the mean values on the two metrics of FROC and sensitivity rose by 2.22% and 2.57% when using our proposed method without the ensemble strategy. When the ensemble strategy was employed, these values rose by an additional 0.96% and 1.34% respectively. On ThickSet, although the thicksection data comprised one-third of all data, the FROC results

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

12

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 8. Ensemble results with different ensemble weights on DeepLNDataset. The horizontal axis denotes the ensemble weights (wlr , whr ).

Fig. 9. Visualization of some results’ samples upon ThinSet. (a)–(e) show detected nodules, and the last subfigure denotes undetected nodules.

achieved almost the same level as those of the base method, and sensitivity rose by 1%. When we used the ensemble strategy, sensitivity rose by 2%. 5.6.3. Qualitative analysis of diffuse lung nodule detection Diffuse lung nodules are a common clinical symptom of pulmonary sarcoidosis. This disease is characterized by many such lung nodules dispersed throughout the inside of the lungs. As mentioned above, CT scans with diffuse lung nodules were not included in the dataset because such nodules are too numerous to annotate completely.

However, during testing, some CT screening images consisting of dispersed lung nodules were used to test our model’s performance. The detector results are shown in Fig. 11. In this figure, a thin-section CT scan and a thick-section CT scan are presented. The window widths and window levels of the thinsection scan are 1800 and −500, respectively and those of the thick-section scan are 1200 and −600, respectively. The lesions in the red bounding boxes were predicted by our models, and those in the green bounding boxes were not found by our models. These examples show that our detector can detect some diffuse lung nodules to a certain extent. The ability to detect almost all

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

13

Fig. 10. Visualization of some results’ samples upon ThickSet. (a)–(e) show detected nodules, and the last subfigure denotes undetected nodules.

Fig. 11. Detector performance on some examples of diffuse lung nodules. The window width and window level of subfigure (a) are 1800 and −500, respectively. Those in subfigure (b) are 1200 and −600, respectively. These values are general settings for clinical CT interpretation.

lung nodules in this type of CT scan is very limited because many lung nodules are omitted from the results. In one slice of a lung scan, multiple groups of lung nodules have similar surrounding background and similar morphological characteristics, but the model yielded completely different results for them. We believe that these lung nodule samples can be regarded as a kind of ‘‘adversarial example’’. That is, tiny differences are likely to cause completely different results [46]. In future work, we plan to improve the robustness of the detector as well as its performance on this kind of scan.

5.6.4. Qualitative analysis of noise caused by the resampling algorithm When using the resampling algorithm to resize the CT images into 1 × 1 × 1 mm3 , much noise is created when resampling the thick-section CT images. In this work, we tried to demonstrate the generated noise, shown in some examples as Figs. 12 and 13. First, in Fig. 12, we show two lung nodules from t he same patient, the images of which were obtained at the same time. Above the black dashed line in this figure, the two rows of images show the same lung nodule in different-resolution CT images. The first row shows the cropped patches from thin-section CT images (higher resolution), and the second row shows those

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

14

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 12. The patches cropped from two cases, which were from one patient at the same time. one of the two cases is thin section and another is thick one. In this figure, we give two lung nodule examples.

Fig. 13. The images in the first row in this figure are original CT images, and the ones in the second row are resampled images. The third row is a visualization of the difference between the two rows of images above.

from thick-section ones (lower resolution). The image patches in red dashed bounding boxes are generated by the resampling algorithm. Compared with the thin-section CT, the lung nodules in generated images are shallower; however, the outline was not changed. This indicates that the lung nodules generated by the resampling algorithm are more like cylinders, whereas lung nodules are actually spherical. Second, we visualize the noise generated by the resampling algorithm when resizing thick-section CT images in Fig. 13. We cropped three continuous images from a lung nodule in a CT scan as shown in the top row. Then, the middle image of the three was removed, and a new image was generated by the resampling algorithm. These three images are shown in the middle row. The images in the bottom row visualize the difference between the images in top and middle rows. The difference is regarded as the generated noise in this study. This figure intuitively demonstrates that there is much noise generated by the resampling algorithm,

which leads to training conflict when using both resolutions of CT images to train one model together. 6. Conclusions and future work In this study, we constructed a multi-resolution CT image dataset and presented an automatic lung nodule detection framework named DeepLN to assist clinical physicians. First, to prepare a large-scale dataset, a three-level annotation criterion was proposed to construct a dataset that guarantees the accuracy of labeling. To improve the efficiency of labeling, a semi-automatic annotation system was constructed. Second, to detect lung nodules in a clinical dataset, a method was proposed to train a lung nodule detector. Higher- and lower-level features extracted by DCNNs were combined to make accurate predictions. Compared with normal tissue samples, lung nodule samples constitute a minority of all the samples. This category imbalance is a problem. To address this issue, hard negative mining and a modified

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

focal loss were employed to increase the effectiveness of the training process. Third, because low-resolution CT scans are used in physical examinations to reduce the damage caused by X-ray radiation, this annotated dataset contains images at two different resolutions. The practice of training with different-resolution CT scans together can affect the performance of training. To address this problem, an ensemble method was proposed to combine the results of the separately trained models and yield more accurate predictions. In this study, we use multi-resolution CT screening images to train our lung nodule detector and propose an ensemble strategy to address the problems caused by using images with multiple resolutions to some extent. However, the proposed annotation method requires many physicians to annotate, and the training of the proposed model requires much computational capacity. Future efforts will focus on this aspect. In addition, when evaluating our model’s performance on scans of different lung nodules, we found that not all the nodules can be detected completely. Tiny image differences around the nodules can cause perturbations that result in the model outputting completely different results. In future, more attention should be paid to the model’s robustness to disturbances to increase the robustness of the lung nodule detector across various circumstances, such as CT screening images with different lung nodules. Finally, this work aimed only to detect lung nodules. The next task is to estimate the lung nodules’ malignancy grades. Acknowledgments This work was supported by the National Major Science and Technology Projects of China under Grant 2018AAA0100201 and by the Science and Technology Project of Chengdu, PR China under Grant 2017-CY02-00030-GX. References [1] F. Bray, J. Ferlay, I. Soerjomataram, R.L. Siegel, L.A. Torre, A. Jemal, Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA: Cancer J. Clin. 68 (6) 394–424. http://dx.doi.org/10.3322/caac.21492. [2] K. Murphy, B. van Ginneken, A.M.R. Schilham, B.J. de Hoop, H.A. Gietema, M. Prokop, A large-scale evaluation of automatic pulmonary nodule detection in chest CT using local image features and k-nearest-neighbour classification, Med. Image Anal. 13 (5) (2009) 757–770, http://dx.doi.org/ 10.1016/j.media.2009.07.001. [3] D.M. Xu, H. Gietema, H. de Koning, R. Vernhout, K. Nackaerts, M. Prokop, C. Weenink, J.W. Lammers, H. Groen, M. Oudkerk, R. van Klaveren, Nodule management protocol of the NELSON randomised lung cancer screening trial, Lung Cancer 54 (2) (2006) 177–184, http://dx.doi.org/10. 1016/j.lungcan.2006.08.006. [4] B. Golosio, G.L. Masala, A. Piccioli, P. Oliva, M. Carpinelli, R. Cataldo, P. Cerello, F. De Carlo, F. Falaschi, M.E. Fantacci, G. Gargano, P. Kasae, M. Torsello, A novel multithreshold method for nodule detection in lung CT, Med. Phys. 36 (8) (2009) 3607–3618, http://dx.doi.org/10.1118/1.3160107. [5] G. Picozzi, E. Paci, P.A. Lopez, M. Bartolucci, G. Roselli, F.A. De, S. Gabrielli, A. Masi, N. Villari, M. Mascalchi, Screening of lung cancer with low dose spiral ct: results of a three year pilot study and design of the randomised controlled trial italung-ct, Radiol. Med. 109 (1–2) (2005) 17–26. [6] U. Pastorino, M. Rossi, V. Rosato, A. Marchian, N. Sverzellati, C. Morosi, A. Fabbri, C. Galeone, E. Negri, G. Sozzi, Annual or biennial ct screening versus observation in heavy smokers: 5-year results of the mild trial., Eur. J. Cancer Prev. 21 (3) (2012) 308–315. [7] F. Ciompi, K. Chung, S.J. van Riel, A.A.A. Setio, P.K. Gerke, C. Jacobs, E.T. Scholten, C. SchaeferProkop, M.M.W. Wille, A. Marchian, Towards automatic pulmonary nodule management in lung cancer screening with deep learning, Sci. Rep. 7 (2017) 46479. [8] R.S. Armato, G. Mclennan, L. Bidaut, M.F. Mcnittgray, C.R. Meyer, A.P. Reeves, B. Zhao, D.R. Aberle, C.I. Henschke, E.A. Hoffman, The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans, Med. Phys. 38 (2) (2015) 915.

15

[9] F. Liao, M. Liang, Z. Li, X. Hu, S. Song, Evaluate the malignancy of pulmonary nodules using the 3D deep leaky Noisy-or network, 14, (8) 2017, pp. 1–12, arXiv:1711.08324. [10] C. Jacobs, E.M. van Rikxoort, T. Twellmann, E.T. Scholten, P.A. de Jong, J.-M. Kuhnigk, M. Oudkerk, H.J. de Koning, M. Prokop, C. Schaefer-Prokop, B. van Ginneken, Automatic detection of subsolid pulmonary nodules in thoracic computed tomography images, Med. Image Anal. 18 (2) (2014) 374–384, http://dx.doi.org/10.1016/j.media.2013.12.001. [11] F. Society, A.A. Bankier, C.J. Herold, J.H. Austin, W.D. Travis, Recommendations for the Management of Subsolid Pulmonary Nodules Detected 266 (1). [12] Y. Zhao, G.H. de Bock, R. Vliegenthart, R.J. van Klaveren, Y. Wang, L. Bogoni, P.A. de Jong, W.P. Mali, P.M.A. van Ooijen, M. Oudkerk, Performance of computer-aided detection of pulmonary nodules in low-dose CT: comparison with double reading by nodule volume, Eur. Radiol. 22 (10) (2012) 2076–2084, http://dx.doi.org/10.1007/s00330-012-2437-y. [13] A.A.A. Setio, C. Jacobs, J. Gelderblom, B. Ginneken, Automatic detection of large pulmonary solid nodules in thoracic CT images, Medical Physics 42 (10) 5642–5653. http://dx.doi.org/10.1118/1.4929562. [14] B. van Ginneken, A.A.A. Setio, C. Jacobs, F. Ciompi, Off-the-shelf convolutional neural network features for pulmonary nodule detection in computed tomography scans, in: 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), 2015, pp. 286–289, http://dx.doi.org/10.1109/ ISBI.2015.7163869. [15] A. Arindra, A. Setio, F. Ciompi, G. Litjens, P. Gerke, C. Jacobs, S.J. Van Riel, M.M. Winkler Wille, M. Naqibullah, C.I. Sánchez, B. Van Ginneken, Pulmonary nodule detection in CT images: False positive reduction using multi-view convolutional networks, IEEE Trans. Med. Imaging 35 (5) (2016) 1160–1169, http://dx.doi.org/10.1109/TMI.2016.2536809. [16] J. Cai, L. Lu, Y. Xie, F. Xing, L. Yang, Improving deep pancreas segmentation in CT and MRI images via recurrent neural contextual learning and direct loss function, 2017, pp. 559–567, http://dx.doi.org/10.1007/978-3-31966179-7. [17] Q. Dou, H. Chen, L. Yu, J. Qin, P. Heng, Multilevel contextual 3-D CNNs for false positive reduction in pulmonary nodule detection, IEEE Trans. Biomed. Eng. 64 (7) (2017) 1558–1567, http://dx.doi.org/10.1109/TBME. 2016.2613502. [18] Q. Dou, H. Chen, Y. Jin, H. Lin, J. Qin, P.A. Heng, Automated pulmonary nodule detection via 3D convnets with online sample filtering and hybridloss residual learning, in: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), in: LNCS, vol. 10435, 2017, pp. 630–638. [19] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, 2014, http://dx.doi. org/10.1109/CVPR.2014.81. [20] H. Kaiming, Z. Xiangyu, R. Shaoqing, S. Jian, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (2015) 1904–1916. [21] R. Girshick, Fast R-CNN, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448, http://dx.doi.org/10.1109/ICCV. 2015.169. [22] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks., IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2017) 1137–1149. [23] T.-Y. Lin, P. Dollár, R.B. Girshick, K. He, B. Hariharan, S.J. Belongie, Feature Pyramid Networks for Object Detection. [24] T.Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–3007, http://dx.doi.org/10.1109/ICCV.2017.324. [25] T. Messay, R.C. Hardie, S.K. Rogers, A new computationally efficient CAD system for pulmonary nodule detection in CT imagery, Med. Image Anal. 14 (3) (2010) 390–406, http://dx.doi.org/10.1016/j.media.2010.02.004. [26] H. Fujita, D. Cimr, Computer aided detection for fibrillations and flutters using deep convolutional neural network, Inform. Sci. 486 (2019) 231–239. [27] Y. Hagiwara, H. Fujita, L.O. Shu, H. Jen, R. SanTan, E. JCiaccio, U.R. Acharya, Computer-aided diagnosis of atrial fibrillation based on ecg signals: A review, Inform. Sci. 467 (2018) 99–114. [28] H. Fujita, D. Cimr, Decision support system for arrhythmia prediction using convolutional neural network structure without preprocessing, Appl. Intell. [29] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (Eds.), Advances in Neural Information Processing Systems Vol. 25, Curran Associates, Inc, 2012, pp. 1097–1105. [30] S. Ji, W. Xu, M. Yang, K. Yu, 3D Convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 221–231, http://dx.doi.org/10.1109/TPAMI.2012.59. [31] P. An, T. Shenzhen, 3DCNN for Lung Nodule Detection and False Positive Reduction, Tech. rep. Challenge LUNA2016, 2018.

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

16

X. Xu, C. Wang, J. Guo et al. / Knowledge-Based Systems xxx (xxxx) xxx

[32] Z. Wentao, L. Chaochun, F. Wei, X. Xiaohui, Deeplung: Deep 3d dual path nets for automated pulmonary nodule detection and classification, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 673–681, http://dx.doi.org/10.1109/WACV.2018.00079. [33] S. Chen, J. Guo, C. Wang, X. Xu, Z. Yi, W. Li, Deeplnanno: a web-based lung nodules annotating system for ct images, J. Med. Syst. 43 (7) (2019) 197. [34] Data science bowl 2017, 2017, Website, https://www.kaggle.com/c/datascience-bowl-2017. [35] Challenge luna 2016, 2016, Website, https://luna16gr{and}-challenge.org. [36] A. Neubeck, L.J.V. Gool, Efficient non-maximum suppression, in: International Conference on Pattern Recognition, 2006, pp. 850–855. [37] M. Buda, A. Maki, M.A. Mazurowski, A systematic study of the class imbalance problem in convolutional neural networks, CoRR abs/1710.0. arXiv:1710.05381. [38] X. Xu, Q. Guo, J. Guo, Z. Yi, Deepcxray: Automatically diagnosing diseases on chest x-rays using deep neural networks, IEEE Access 6 (2018) 66972–66983, http://dx.doi.org/10.1109/ACCESS.2018.2875406. [39] A. Shrivastava, A. Gupta, R. Girshick, Training region-based object detectors with online hard example mining, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 761–769, http://dx.doi. org/10.1109/CVPR.2016.89.

[40] S.B. Meshram, S.M. Shinde, A survey on ensemble methods for high dimensional data classification in biomedicine field, Int. J. Comput. Appl. 111 (11) (2015) 5–7. [41] H.M. Gomes, J.P. Barddal, F. Enembreck, A. Bifet, A survey on ensemble learning for data stream classification, Acm Comput. Surv. 50 (2) (2017) 23. [42] T.G. Dietterich, Ensemble methods in machine learning, Proc. Int. Workshgp Multiple Classif. Syst. 1857 (1) (2000) 1–15. [43] V. Thakar, W. Ahmed, M.M. Soltani, J.Y. Yu, Ensemble-based adaptive single-shot multi-box detector. [44] D.P. Naidich, A.A. Bankier, H. MacMahon, C.M. Schaefer-Prokop, M. Pistolesi, J.M. Goo, P. Macchiarini, J.D. Crapo, C.J. Herold, J.H. Austin, W.D. Travis, Recommendations for the management of subsolid pulmonary nodules detected at CT: A statement from the fleischner society, Radiology 266 (1) (2013) 304–317, http://dx.doi.org/10.1148/radiol.12120628. [45] H. MacMahon, D.P. Naidich, J.M. Goo, K.S. Lee, A.N.C. Leung, J.R. Mayo, A.C. Mehta, Y. Ohno, C.A. Powell, M. Prokop, G.D. Rubin, C.M. Schaefer-Prokop, W.D. Travis, P.E. Van Schil, A.A. Bankier, Guidelines for management of incidental pulmonary nodules detected on CT images: From the fleischner society 2017, Radiology 284 (1) (2017) 228–243, http://dx.doi.org/10.1148/ radiol.2017161659. [46] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, Comput. Sci.

Please cite this article as: X. Xu, C. Wang, J. Guo et al., DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images, KnowledgeBased Systems (2019) 105128, https://doi.org/10.1016/j.knosys.2019.105128.

DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images

DeepLN: A framework for automatic lung nodule detection using multi-resolution CT screening images

Recommend Documents