Biomedical Signal Processing and Control 49 (2019) 192–201
Contents lists available at ScienceDirect
Biomedical Signal Processing and Control journal homepage: www.elsevier.com/locate/bspc
Look-behind fully convolutional neural network for computer-aided endoscopy Dimitrios E. Diamantis a , Dimitris K. Iakovidis a,∗ , Anastasios Koulaouzidis b a b
Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece Endoscopy Unit, The Royal Infirmary of Edinburgh, Edinburgh, UK
a r t i c l e
i n f o
Article history: Received 14 August 2018 Received in revised form 27 October 2018 Accepted 6 December 2018 Keywords: Computer-aided diagnosis Endoscopy Convolutional neural networks Abnormality detection
a b s t r a c t In this paper, we propose a novel Fully Convolutional Neural Network (FCN) architecture aiming to aid the detection of abnormalities, such as polyps, ulcers and blood, in gastrointestinal (GI) endoscopy images. The proposed architecture, named Look-Behind FCN (LB-FCN), is capable of extracting multi-scale image features by using blocks of parallel convolutional layers with different filter sizes. These blocks are connected by Look-Behind (LB) connections, so that the features they produce are combined with features extracted from behind layers, thus preserving the respective information. Furthermore, it has a smaller number of free parameters than conventional Convolutional Neural Network (CNN) architectures, which makes it suitable for training with smaller datasets. This is particularly useful in medical image analysis, since data availability is usually limited due to ethicolegal constraints. The performance of LB-FCN is evaluated on both flexible and wireless capsule endoscopy datasets, reaching 99.72% and 93.50%, in terms of Area Under receiving operating Characteristic (AUC) respectively. © 2018 Elsevier Ltd. All rights reserved.
1. Introduction Gastrointestinal (GI) diseases are becoming more and more common [1]. Each year in the United States 130,000 patients are diagnosed with colon cancer, making it the second most common form of cancer in the country. Recent studies show that the modern way of life, especially in the developed countries, has increased the number of cases with GI lesions. In this paper we aim to provide a computer-aided approach for abnormality detection addressing variety of diseases, such as polyps, vascular bleeding and inflammatory conditions. The GI tract can be broken down into four sections, namely the esophagus, the stomach, the small intestine and the colon. A typical examination method of the GI tract is Flexible Endoscopy (FE) [2] and its variations [3–5]. Wireless Capsule Endoscopy (WCE) [6] is becoming increasingly popular as a method of capturing images from the entire GI tract due to its non-invasiveness. This method uses a swallowable camera to capture low-resolution images throughout the entire GI tract which are afterwards examined by a clinician. A lot of manual human effort is required, which is typically interpreted into 45–90 minutes work, demanding undisrupted concentration. Thus, the review of an entire WCE video is prone to human errors, since the video reviewers can become tired
∗ Corresponding author. E-mail address:
[email protected] (D.K. Iakovidis). https://doi.org/10.1016/j.bspc.2018.12.005 1746-8094/© 2018 Elsevier Ltd. All rights reserved.
over the time. This raises the need for a computer-aided diagnosis methodology that could increase the overall diagnostic accuracy, and reduce the required examination time. Computer-aided abnormality detection in endoscopic images of the GI tract has been an active research subject over the last 15 years [7–9]. Abnormality detection refers to the ability of discriminating abnormal tissues from normal image contents. Normal image contents include non-pathologic tissues and intestinal content, such as debris and bubbles. First approaches were aiming to the detection of abnormalities in FE [7,8]. In that context, abnormality detection systems contribute in the early detection of life-threatening conditions such as cancer. Their use can contribute in speeding up the FE procedures, which are generally uncomfortable for the patients. An added benefit is that cost reduction can be achieved by the use of such systems, as they could enable less experienced personnel to perform the examination. The abnormality detection methodologies that have been proposed in the context of GI FE [8] and WCE [9] can be grouped into two main categories, according to the type of features used to describe the images. The methodologies of the first category are based on hand-crafted features for the representation of image properties, including color, texture and shape [8–19,59]. However, such features are usually selected based on considerations about the modality used to acquire the images or about the abnormalities to be detected. In the second category, which comprises more recent methodologies, the feature extraction process is automatic. This is usually implemented through adaptation on the annotated
D.E. Diamantis et al. / Biomedical Signal Processing and Control 49 (2019) 192–201
dataset used for training of the overall system. State-of-the-art approaches of this kind are based on Convolutional Neural Networks (CNNs) [20–24]. CNNs are Artificial Neural Networks (ANNs) [25,26] that consist of multiple convolutional layers, with a neuron arrangement that resembles the biological visual cortex, forming deep feed-forward architectures. Recently, supervised methodologies based on weakly annotated images have shown promising results for the classification of endoscopy images. The so-called weak labels are essentially keywords, only semantically describing image content. Therefore, weak labeling constitutes a time-efficient approach to obtain image annotations from the experts [27,28]. In this context we proposed a MIL-based approach following the Bag of visual Words (BoW) model, for classification of GI endoscopy images [19]. More recently, we proposed a methodology for weakly supervised detection and localization of abnormalities in GI endoscopy images [29]. This includes Weakly supervised CNN-based (WCNN) classification of the endoscopic images, followed by the detection of salient points, which were subsequently filtered by a clustering process to enable within-frame localization of the abnormalities. Specifically for bleeding detection and segmentation, a two stage approach has been proposed by Jia and Meng [30]. Initially the images obtained from WCE, are classified as active or in-active subgroups based on handcrafted statistically derived color probability features. Then the segmentation is done using a deep FCN architecture [31], i.e., an architecture composed of only convolutional layers. Other CNN architectures for classification of weakly labeled images, proposed in the context of GI endoscopy, include a CNN that receives RGB images along with their Hessian and Laplacian transformations as input [32]; a cascaded CNN architecture for the recognition of the different organs of the GI tract and normal intestinal content [33]; and, a CNN architecture for blood detection, using an SVM instead of the fully-connected layer of the conventional CNNs [34]. A recent, generic CNN-based approach to abnormality detection in GI endoscopy has been proposed in [35]. It utilizes a pre-trained CNN architecture, and more specifically the CaffeNet [36], as a feature extractor. The features are extracted from the intermediate layers of the network. The extracted feature-maps are then used to train an SVM classifier. A remarkable aspect of that approach is that it was trained solely on ImageNet [37], which is a large dataset of natural images that does not include any endoscopic or other relevant images. Recent CNN architectures that, to the best of our knowledge, have not been yet applied in the context of abnormality detection in endoscopy, include ResNet [38], ResNeXt [39] and Inception-v4 [40]. All three of them are state-of-the-art networks tested for natural image classification. The key contribution of ResNet is the introduction of residual learning by utilization of residual blocks (convolutional layers with a shortcut connection). ResNeXt follows the same principle of residual learning, yet instead of an increase in the depth of the network, introduces the concept of cardinality which defines the number of paths in a ResNeXt block. Through multiple experiments, cardinality meta-parameter proved to be more effective in enhancing the classification performance, than going deeper or wider in a network. Furthermore bottleneck residual blocks are used to reduce the number of feature maps of the network. By lowering the width of each convolution layer and increasing the cardinality number, ResNeXt achieves higher classification performance than ResNet while maintains similar number of free parameters. Inception-v4 provides a uniform design for three inception modules and a new stem block, which defines the initial set of operations performed before the Inception modules. Furthermore Inception-v4, introduced the “Reduction Blocks” which are used to adjust the width and height of the input and output volume of the Inception modules. Although this network
193
outperformed ResNet and ResNext architectures in natural image classification, it has a larger number of free parameters. While the usage of CNNs, including FCNs, has provided superior results compared to other conventional approaches, they generally require large training datasets. A drawback of current CNN approaches is that the use of smaller training datasets limits their generalization capacity. This derives from the fact that, as the number of the free parameters of the network increases, the need for more training examples also increases, in order to avoid overfitting [41]. However, the availability of such large training datasets in the medical domain is usually limited; thus, CNN training can become challenging. The challenge is to develop an architecture that generalizes well, even with smaller datasets. Furthermore, the increase of free parameters, increases the needs for computational resources, with a consequent deterioration of the time-performance of the system for both training and testing [42,43]. Considering that the access to high-end Graphical Processing Units (GPUs) can become costly, the development of a less resource-demanding architecture is a challenge that needs to be addressed. To address these challenges in the context of abnormality detection in GI endoscopy we propose a novel CNN architecture. The proposed architecture is a Fully Convolutional Network (FCN) [31] designed for supervised binary classification of endoscopic images using weakly labeled images, i.e., it is a CNN without fully-connected layers that can be trained using entire images as input, labeled as either abnormal or normal. An image is labeled as abnormal if it includes tissues clinically characterized as abnormal; otherwise, it is labeled as normal. Once trained, it is capable of classifying unlabelled input images as abnormal or normal. Its novelty relies on the use of consecutive blocks of parallel convolutional layers with different filter sizes, enabling multi-scale feature extraction. These blocks are connected to each other by Look-Behind (LB) connections, so that their features are combined with features from behind layers. This way the respective information from the previous layers is preserved, propagated and enriched with information from the next layers of the deep architecture. The proposed LB-FCN architecture involves a smaller number of free (weight) parameters than conventional CNN architectures [44,43], and it can be trained so that it generalizes well on smaller datasets. This makes it particularly attractive for medical image analysis, where the availability of large datasets is usually limited due to ethicolegal constraints. Due to the multi-scale feature extraction procedure followed, the proposed architecture is also capable of detecting abnormalities of different sizes without the need of applying an extensively deep architecture. This is essential for endoscopy, considering the diversity of the abnormalities, and also the fact that the abnormalities may be visualized from different distances (further from the endoscope head, the abnormalities look smaller, while closer, they look larger). In the context of WCE this aspect is even more important, since the frame rate of the capsule endoscopes is usually very low. Thus, given a WCE frame sequence acquired from a location of interest, an abnormality may be present only in a few frames, whereas it is not rare to have only a single frame with that abnormality. The information extracted at different depths of the proposed architecture undergoes consecutive multi-scale transformations as it propagates forward along the network, while part of this, is preserved through the LB connections and aggregated with the rest. This way, a richer representation of the endoscopic images is achieved. The proposed architecture can be considered as an evolution of ResNet [38], ResNeXt [39] and Inception-v4 [40] architectures, which combines their advantages to deliver enhanced binary classification accuracy in the context of abnormality detection in endoscopy. The rest of the paper consists of five sections. Section 2 presents the proposed CNN architecture. Section 3 presents the datasets used and the experimental evaluation methodology, and Section
194
D.E. Diamantis et al. / Biomedical Signal Processing and Control 49 (2019) 192–201
4 presents the results of the experiments performed on publicly available datasets, in comparison with the most relevant state-ofthe-art approaches to GI abnormality detection. Section 5 discusses the results obtained, and the last section summarizes the conclusions that can be derived from study.
2. LB-FCN architecture The design of the proposed architecture follows the FCN approach [31], where the fully-connected layers, typically used in the conventional CNN architectures [42,43], are replaced by fullyconvolutional layers. It is based on two fundamental elements, which differentiate it from other FCN architectures. These are: a) the Multi-scale Convolutional Block (MCB), which enables multiscale feature extraction from its input, and b) the LB connection, which aims to preserve the input volume, along with the extracted features per MCB. Both the MCB and the LB connection form the basic, complete structural component (module) of the LB-FCN, illustrated in Fig. 1. The input volume, in the case of the first LBMCB module of the LB-FCN, is an RGB endoscopic image, and in the case of subsequent layers it is the output of the previous LB-MCB modules. Overall, LB connections contribute in enhanced classification performance; however, in some cases where the performance is not significantly affected, they can be pruned for reduced computational complexity. The MCB is a small CNN composed of five convolutional layers (Fig. 1). The first convolutional layer performs a convolution operation with a filter size 1 × 1 on the input volume. The output of this layer is used as input to a parallel arrangement of three convolutional layers, with the same number of filters, performing 8 × 8, 4 × 4 and 2 × 2 convolution operations respectively. This way, larger, medium and smaller features of the input space can be captured. The choice of these filters has been driven by preliminary experimentation using various numbers of filters (1–5) with different sizes (from 2 × 2–12 × 12) on the available datasets. We observed that by using less than 3 filters the classification performance was deteriorating, whereas by using more than 3 filters the classification performance was not improving. The output feature maps of these parallel layers are concatenated and subsequently entered to the fifth convolutional layer of the MCB, with 1 × 1 filter size. Each MCB has a respective LB connection in parallel, forward passing its input through a 1 × 1 convolutional layer. An addition operator is used to aggregate the outputs of the MCB and the LB connection. The resulting feature maps pass through a convolutional layer with 1 × 1 filter size. Multiple MCBs along with or without LB connections can be sequentially arranged and connected to each other. After the aggregation of the output of the MCB with the output volume of the LB connection, a pooling operation is performed. Pooling is performed by a convolution layer of filter size 2 × 2 and stride 2.The usage of a convolution layer instead of a conventional max-pooling layer, was employed to introduce another level of non-linearity to the network architecture, and to unify and logically simplify the overall network architecture. The max pooling layer can be replaced by a convolution layer of appropriate size and stride without affecting the overall classification performance of the model [31]. The 1 × 1 convolution operations that are used across the MCB serve two purposes. Firstly they are helping to keep the overall number of free parameters of the network manageable and secondly, they allow the addition operation of the LB connection to the output of MCB to be valid. Fig. 2 illustrates the LB-FCN architecture proposed in this study. It is composed of five modules, four of which are complete, including both MCB and LB connections as in Fig. 1. One of them is incomplete, in the sense that it includes an MCB module without
an LB connection (which was pruned as it didn’t contribute to an increase in the classification performance). Each MCB receives 192 feature maps as input, while for each large, medium and small convolutional blocks, 64 feature maps are extracted. Experimentally, we have observed that by using less structural components the classification performance was deteriorating, whereas the use of more than five modules did not result in any classification performance increase either. The last layer of the network, is composed of two neurons with Softmax activations, which are used as the output of the model. All the convolutional layers have PReLU activations followed by batch normalization. Normalization is performed so that the values are centered on a zero mean with a unit standard deviation. It was empirically confirmed that this choice can contribute in a faster convergence by using higher learning rates, and also in limiting overfitting phenomena without using a dropout layer [45]. The PReLU was chosen over the conventional non-parametric ReLU activation function, because its use has proven to be beneficial in overcoming saturation problems that have been observed with the latter during training [46]. 3. Experiments and evaluation 3.1. Datasets Extensive experimentation was performed to investigate the classification performance of LB-FCN on two representative datasets of different GI endoscopy modalities that include a variety of abnormalities. The first dataset (D1) was made publicly available from Endovis challenge, held in MICCAI 2015 [47]. We used the data from the sub-challenge referring to the detection of abnormalities in gastroscopic images [48]. The selection of this dataset over others in that challenge was driven by the diversity of the abnormalities that it included, and the fact that it also included normal images. The gastroscopy challenge dataset was derived from a total of 10,000 images obtained from 544 healthy volunteers and 519 volunteers having various abnormalities, such as cancer, bleeding and gastritis. The images had originally a resolution of 768 × 576 pixels and they were cropped by the data providers down to 489 × 409 pixels in order to be anonymized [49]. For the purposes of the challenge a subset of 698 images from 137 volunteers was released (Fig. 3). The dataset then was split into two approximately balanced subsets; one for training with 465 images (205 normal and 260 abnormal) and one for test containing 233 images (104 normal and 129 abnormal) [48]. This dataset will be referred to as D1B. The second dataset (D2), originates from our database called KID [14]. This is a public, open access database of both semantically and graphically annotated WCE images and videos (Fig. 4). Dataset 2 consists of a total of 2352 images with a resolution of 360×360 pixels. It contains 1778 normal images from the whole GI tract, including 282 esophagus images, 599 stomach images, 728 small bowel images, and 169 images from the colon. It also contains a total of 574 images of various abnormalities found along the entire gastrointestinal tract, including vascular (303 images), polypoid (44 images) and inflammatory (227 images) conditions. It should be noted that in order to keep the dataset as realistic as possible, images with artifacts naturally occurring during a WCE procedure were not excluded. These include blurry frames, bubbles, intestinal juices, stool and other debris. 3.2. Evaluation methodology To evaluate the classification performance of LB-FCN architecture a comparison to state-of-the-art abnormality detection
D.E. Diamantis et al. / Biomedical Signal Processing and Control 49 (2019) 192–201
195
Fig. 1. A comparison between the core components of ResNet [38], ResNeXt [39], Inception-v4 [40] and the proposed LB-FCN architectures. The term “volume” represents either a set of feature maps (in case of hidden network components) or a single image (in the case of network’s input).
systems for GI endoscopy, a 10-fold cross-validation (CV) procedure was followed to limit the bias, i.e., the dataset was randomly partitioned into 10 equally sized disjoint subsets, a single subset
was retained as the validation data for testing the model, and the remaining 9 subsets was used as training data. This was repeated until all subsets are used for testing. Therefore, per CV fold, in the
196
D.E. Diamantis et al. / Biomedical Signal Processing and Control 49 (2019) 192–201
Fig. 2. Schematic representation of the LB-FCN architecture proposed in this study. It is composed of five MCBs with four LB connections.
Fig. 3. Sample images obtained from MICCAI 2015 Gastroscopy Challenge dataset. The first row contains normal images, whereas the second row contains images with abnormalities.
Fig. 4. Sample images from KID Dataset. The first row contains normal images, whereas the second row contains images with abnormalities.
case of D1 a total of 628 images were used for training and 70 images were used for testing, and in the case of D2 a total of 2117 images were used for training and 235 images were used for testing. The distribution of the normal and abnormal images, as well as the distribution of the abnormal frame categories, were held approximately constant in the training and testing sets per fold, i.e., 76% normal to 24% abnormal, out of which 53% were vascular, 8% were polypoid and 40% inflammatory conditions. The metrics used to assess the classification performance include accuracy (ACC), specificity (SPC) and sensitivity (TPR), as estimated from Eqs. (1–3). ACC =
TP + TN TP + TN + FP + FN
(1)
SPC =
TN TP + FP
(2)
TPR =
TP TP + FN
(3)
FPR = 1 − SPC
(4)
where the number of true negatives are denoted as TN, true positives as TP, false positives FP and false negatives FN.
The Receiver Operating Characteristic (ROC) curves were considered to visualize the tradeoff between TPR and FPR at different decision thresholds. The Area Under ROC (AUC) is used as a more reliable and intuitive classification performance measure, that is insensitive to class imbalance [50], which characterizes most medical datasets, as the ones used in this study.
4. Results The proposed LB-FCN architecture was evaluated using all images of datasets D1 and D2. Since the proposed architecture uses weakly labeled images, the available graphic annotations were not used. Only semantic annotations of the images, indicating whether they contain an abnormality or not, were used as ground truth. The semantics ‘abnormal’ and ‘normal’ are represented as vectors (1,0) and (0,1), respectively, at the output of the network. The network was trained using Root Mean Square Propagation (RMSProp) [51] optimizer with an initial learning rate n=0.01 and fuzz factor ε = 1e − 8. The network was implemented utilizing the Python Keras [52] library on top of the TensorFlow [53] graph framework trained for 2000 epochs with mini-batch of size 32 sam-
D.E. Diamantis et al. / Biomedical Signal Processing and Control 49 (2019) 192–201
197
Fig. 5. Mean ROC obtained by 10-fold CV on dataset D1, using LB-FCN. The grey area around the curve the represents the respective confidence band.
ples on an NVIDIA GTX-960 GPU, with 1024 CUDA [54] cores, 2GB of RAM and clock speed of 1127 MHz. In the following, LB-FCN is compared with state-of-the-art architectures, both in terms of effectiveness and efficiency.
4.1. Effectiveness assessment The evaluation of the network using CV yielded a mean Area Under Curve (AUC) of 99.72% (Fig. 5). The entire training process of each fold took 2 h as the network had only 9 million parameters to be trained (while conventional CNN architectures used in this context, such as CaffeNet [35] and VGG-16 [43], have between 60–130 million parameters). For completeness and further promote reproducibility of the results, we evaluated the classification performance of the proposed architecture on dataset D1B. This resulted in an AUC of 99.82%, which is comparable to that obtained with CV (Fig. 6). For generality, the experiments on dataset D2 were performed using the same LB-FCN configuration with the experiments on datasets D1 and D1B. The average Receiver Operation Characteristic (ROC) curve obtained by the evaluation of the model using CV had an AUC of 93.5% (Fig.7). To compare the performance of LB-FCN for abnormality detection in endoscopic images we implemented the most relevant state-of-the-art CNN architectures, including Transfer Learning [35], FCN [31], ResNet [38], ResNeXt [39] and Inception-v4 [40]. The results obtained on the same datasets, using 10-fold cross validation, are summarized in Table 1. In the same table provide the results of BoW [19] and WCNN [29] as they were recently reported on the same datasets using the same evaluation procedure. To validate the significance of the results obtained by the proposed architecture in comparison to the results obtained with the state-of-the-art architectures and methods we performed two statistical significance tests; a non-parametric Friedman test and a two-sided Wilcoxon rank sum test [55], In both tests the null hypothesis (i.e., that the samples are derived by identical continuous distribution with equal means and are independent) was
rejected (p-value < 0.05), which indicates that there are differences between the examined methods at a 5% significance level. 4.2. Efficiency assessment The total time-complexity of all convolution layers in a CNN is given by [56]: d
O(
nl−1 · sl2 · nl · m2l )
(5)
l=1
where l is the index of convolution layer, d is the number of all convolution layers; nl is the width of the l -th layer, nl−1 is the input channels of the l -th layer; sl and ml are the spatial size of the filter and the output feature map respectively. Therefore, for a given dataset with a specific size, the number of computations are proportional to the sum of products described in (5). The same time-complexity applies to both training and testing time, although on different scale; as training time involves one forward and two backward passes because of the error back propagation training algorithm. As a result, the training time of an image is roughly three times the testing time of an image. The top-ranked architectures of Table 1 are compared in terms of computational complexity in Table 2. It includes the number of Floating Point Operations (FLOPs), the number of convolution layers and the number of trainable free parameters. It can be noticed that LB-FCN has a comparable number of trainable parameters (and consequently similar memory requirements) with the rest of the Table 2 Comparison of the computational complexity of the top-ranked state-of-the-art architectures of Table 1.
LB-FCN ResNet [38] ResNeXt[39] Inception-v4 [40]
FLOPs
Convolution Layers
Trainable Free Parameters
1.36 × 107 4.74 × 107 2.83 × 107 2.05 × 108
32 53 94 149
8.26 × 106 2.35 × 107 5.63 × 106 4.11 × 107
198
D.E. Diamantis et al. / Biomedical Signal Processing and Control 49 (2019) 192–201
Fig. 6. ROC obtained on dataset D1B, using LB-FCN.
Fig. 7. Mean ROC obtained by 10-fold CV on dataset D1, using LB-FCN. The grey area around the curve the represents the respective confidence band.
architectures. However, the number of FLOPs of LB-FCN is smaller; more specifically, it is approximately 3.5 times smaller than ResNet, 2 times smaller than ResNeXt, and an order of magnitude smaller than Inception-v4. It is worth noting that the LB-FCN architecture tends to converge faster. For example, the average number of epochs for the convergence of the LB-FCN architecture on dataset D2 was approximately 2000 as compared to 2400, 2900, 3100 epochs for ResNeXt, ResNet and Inception-v4 respectively. Also, for the same dataset, the average training time of the proposed architecture, is significantly
smaller, compared to the rest of tested networks, averaging to 2 h. ResNeXt required on average 3.4 h, ResNet 4.2 h, and Inveption-v4 6 h.
5. Discussion The results obtained from the application of LB-FCN architecture on datasets D1, D1B and D2, show that it outperforms the state-ofthe-art architectures and methods in terms of AUC. The difference
D.E. Diamantis et al. / Biomedical Signal Processing and Control 49 (2019) 192–201
199
Table 1 Comparative abnormality detection results using 10-fold cross-validation on datasets D1, D1B and D2, as they were obtained for an optimum selection of the meta-parameters of the compared nets and methods. LB-FCN
Accuracy (%) Sensitivity (%) Specificity (%) AUC (%)
BoW [19]
WCNN [29]
D2
D1B
D1
D2
D1B
D1
D2
D1B
D1
D2
D1B
97.84 98.05 97.67 99.72
88.29 92.11 76.49 93.50
97.42 97.11 97.67 99.82
89.20 91.10 87.20 94.60
76.80 45.40 88.60 80.20
90.56 90.70 90.38 95.44
89.90 90.70 88.20 96.30
80.01 86.22 60.78 81.62
90.98 91.40 90.38 96.34
89.90 90.70 88.20 96.30
77.50 36.20 91.30 81.40
90.90 93.00 88.50 96.84
D1
D2
D1B
D1
D2
D1B
D1
D2
D1B
D1
D2
D1B
97.13 96.66 97.53 97.19
87.60 77.87 85.97 81.92
96.13 95.34 97.11 96.23
96.27 97.94 96.04 97.19
87.22 83.26 90.42 86.84
95.99 95.12 98.07 97.10
95.98 95.63 96.44 96.03
86.93 80.13 89.13 84.63
96.56 96.89 96.15 96.52
95.98 94.60 97.53 96.16
87.92 82.17 90.71 86.44
96.99 96.89 97.11 97.03
FCN [31]
Accuracy (%) Sensitivity (%) Specificity (%) AUC (%)
Transfer Learning [35]
D1
ResNet [38]
ResNeXt [39]
in AUC of LB-FCN in D1 and D1B datasets from the ResNet, which is the second best performing methodology, are 2.53% and 2.72% respectively. On the larger and even more diverse dataset D2, the classification performance of LB-FCN is significantly higher, reaching a difference in AUC of 6.66% from the same methodology. We believe that this is due to the multiscale feature extraction capabilities of the architecture. It is important to note here that the methodologies that follow handcrafted feature classification technique [19], are significantly lower than the CNN based approaches. The multi-scale design of LB-FCN is inspired by GoogLeNet [57] and its variation Inception-v4 architecture [40], were features of different abstraction levels are combined to produce a richer representation of the input volume. The LB connections are inspired by the work of Gers and Schmidhuber [58] where, the so-called peephole connections, were introduced in Long-Short-Term Memory (LSTM) networks to provide their gates with the ability to maintain information from previous states of the network. The multiscale feature extraction approach used in MCB is similar but not the same with that of the Inception module of GoogLeNet architecture. While both modules aim to the same goal (multiscale feature extraction), MCB does not use pooling to perform downsampling of the input volume. This enables the LB connection to be aggregated as is, to the output volume of the MCB. Furthermore the MCB extracts the same number of feature maps, across the entire network, on three different scales (convolution with 2 × 2, 4 × 4 and 8 × 8 filters and stride 1) instead of two (convolution with 3 × 3 and 5 × 5 filters and stride 2), extracted by the Inception module. This aims to provide more diverse feature representation of the input volume. In the case of GoogLeNet the number of feature maps extracted by each inception module is doubled, whenever the input volume size is downscaled by two. In the case of LB-FCN architecture, doubling the number of feature maps of the LB-MCB module after every convolutional pooling, did not yield any better results, yet it increased the number of the overall free parameters of the network. Also it is important to note that unlike the proposed LB-FCN architecture, GoogLeNet neither has any peephole-like connections nor any addition operator (Fig. 1). Also unlike GoogLeNet, LB-FCN does not utilize any fully-connected and dropout layers. Other differences of LB-FCN with GoogLeNet include the use of PReLU activations and batch normalization. Peephole-like connections, known as ‘identity shortcut connections’, have been adopted in the Residual Blocks (RBs) of ResNet [38]. The RBs do not have MCB-like, parallel layers; instead it is composed of sequentially arranged convolutional layers. In each RB the shortcut connection transfers its input volume unaltered for addition with the output of its last convolutional layer. The role of the shortcut connection is similar to that of LB. A variant of ResNet architecture, called ResNeXt [39] utilizes parallel but not multiscale convolutional layers, as the LB-FCN does. In the case of ResNeXt
Inception-v4 [40]
the outputs of the parallel layers are first summed up together and the resulting feature maps are subsequently aggregated with the output of the shortcut connection using an addition operator. In LBFCN the outputs of the parallel layers of the MCB are not summed up, instead they are concatenated to form a richer output volume. Similarly to ResNeXt the output of the MCB is aggregated by addition with the output of the LB connection. This allows the next MCB module to have an aggregated view of the initial input volume with the MCB output volume, which helps the overall classification performance of the network. A notable difference between ResNext and LB-FCN is not only that the MCB extracts feature maps of different filter sizes but that after each convolution operation the output is passed to an 1 × 1 convolution operation with batch output normalization. We found that the addition of this step increased the classification performance of the network approximately by 1.3% and reduced the training time by 5.6%. Similarly to our recent WCNN architecture [29], and unlike most relevant abnormality detection methodologies [9], LB-FCN aims to the detection (not the identification) of various types of abnormalities, and not just a single pathologic condition. From the results we observe a significant advantage of the LB-FCN over WCNN. The differences are 12.1%, 3.42% and 2.98% on D2, D1 and D1B datasets, respectively). We believe that this advantage is due to the introduction of multiscale feature extraction combined with look-behind connections and the overall deeper architecture. The results of ResNet, ResNeXt and Inception-v4 architectures, confirm that the depth of the network plays an essential role on the overall classification performance. This can be observed on the more complex and diverse datasets such as D2, on which all three architectures outperform the classification performance of the shallower WCNN architecture. The comparative study performed in section V, shows that the hand-crafted feature extraction approach used in [19] results in a significantly lower classification performance compared to LB-FCN architecture. The AUC differences observed are 13.3% and 5.12% on D2 and D1 datasets respectively. It also shows that the LB-FCN outperforms WCNN [35]; consequently it outperforms other stateof-the-art CNN-based methodologies, such as [24]. It is also worth noting that D1B dataset was introduced in the work of [49], in which DSSVM approach was presented achieving an AUC of 89.83%. It is worth mentioning that the AUC difference between DSSVM and LB-FCN on D1B dataset is 9.99%. 6. Conclusions We presented LB-FCN, a novel CNN architecture to deal with the problem of computer-aided human GI tract image classification. To the best of our knowledge none of the existing deep neural network architectures combine in the same way multi-scale feature extrac-
200
D.E. Diamantis et al. / Biomedical Signal Processing and Control 49 (2019) 192–201
tion along with LB connections. Overall, the conclusions that can be derived about LB-FCN and can be summarized as follows: • It has a simplified design following the FCN [31] architecture approach; • Its relatively low number of free parameters along with its multiscale feature extraction capability enables efficient training with either smaller or larger datasets; • It outperforms state-of-the-art architectures and methods in the detection of different types of abnormalities in images obtained from different endoscopic modalities, including gastroscopy and WCE.
[18] [19]
[20]
[21]
[22]
Topics such as coping with the large number of free parameters and overfitting are still open in deep learning applications. Towards these issues, we plan to apply systematic experiments on architecture variations of LB-FCN while applying it to larger, even more diverse datasets of human GI tract images. Furthermore, within our future research directions is to extend the proposed architecture to enable localization of the abnormalities through supervised learning using weakly labeled images, and the identification of the different types and subtypes of abnormalities.
[23]
[24]
[25]
[26]
References [27] [1] The Cleveland Clinic Foundation, Digestive Disorders & Gastrointestinal Diseases | Cleveland Clinic [WWW Document], 2017 (accessed 11.6.17) https://my.clevelandclinic.org/health/articles/gastrointestinal-disorders. [2] A.D. Muller, A. Sonnenberg, Prevention of colorectal cancer by flexible endoscopy and polypectomy: a case-control study of 32 702 veterans, Ann. Intern. Med. 123 (1995) 904–910. [3] K. Yao, Zoom Gastroscopy: Magnifying Endoscopy in the Stomach, Springer Science & Business Media, 2013. [4] K. Gono, T. Obi, M. Yamaguchi, N. Ohyama, H. Machida, Y. Sano, S. Yoshida, Y. Hamamoto, T. Endo, Appearance of enhanced tissue features in narrow-band endoscopic imaging, J. Biomed. Opt. 9 (2004) 568–577. [5] R. Kiesslich, M. Goetz, M. Vieth, P.R. Galle, M.F. Neurath, Confocal laser endomicroscopy, Gastrointest. Endosc. Clin. N. Am. 15 (2005) 715–731. [6] P. Swain, The future of wireless capsule endoscopy, World J. Gastroenterol. WJG 14 (2008) 4142. [7] S.A. Karkanis, D.K. Iakovidis, D.E. Maroulis, D.A. Karras, M. Tzivras, Computer-aided tumor detection in endoscopic video using color wavelet features, Inf. Technol. Biomed. IEEE Trans. 7 (2003) 141–152. [8] M. Liedlgruber, A. Uhl, Computer-aided decision support systems for endoscopy in the gastrointestinal tract: a review, Biomed. Eng. IEEE Rev. 4 (2011) 73–88. [9] D.K. Iakovidis, A. Koulaouzidis, Software for enhanced video capsule endoscopy: challenges for essential progress, Nat. Rev. Gastroenterol. Hepatol. 12 (2015) 172–186. [10] Y. Cong, S. Wang, J. Liu, J. Cao, Y. Yang, J. Luo, Deep sparse feature selection for computer aided endoscopy diagnosis, Pattern Recognit. 48 (2015) 907–917, http://dx.doi.org/10.1016/j.patcog.2014.09.010. [11] M. Häfner, T. Tamaki, S. Tanaka, A. Uhl, G. Wimmer, S. Yoshida, Local fractal dimension based approaches for colonic polyp classification, Med. Image Anal. 26 (2015) 92–107, http://dx.doi.org/10.1016/j.media.2015.08.007. [12] S. Hegenbart, A. Uhl, A. Vécsei, G. Wimmer, Scale invariant texture descriptors for classifying celiac disease, Med. Image Anal. 17 (2013) 458–474, http://dx. doi.org/10.1016/j.media.2013.02.001. [13] D.K. Iakovidis, D. Chatzis, P. Chrysanthopoulos, A. Koulaouzidis, Blood detection in wireless capsule endoscope images based on salient superpixels, Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS (2015) 731–734, http://dx.doi.org/10. 1109/EMBC.2015.7318466. [14] A. Koulaouzidis, D.K. Iakovidis, D.E. Yung, E. Rondonotti, U. Kopylov, J.N. Plevris, E. Toth, A. Eliakim, G.W. Johansson, W. Marlicz, others, KID Project: an internet-based digital video atlas of capsule endoscopyfor research purposes, Endosc. Int. Open 5 (2017) E477–E483. [15] D.Y. Liu, T. Gan, N.N. Rao, Y.W. Xing, J. Zheng, S. Li, C.S. Luo, Z.J. Zhou, Y.L. Wan, Identification of lesion images from gastrointestinal endoscope based on feature extraction of combinational methods with and without learning process, Med. Image Anal. 32 (2016) 281–294, http://dx.doi.org/10.1016/j. media.2016.04.007. [16] A.V. Mamonov, I.N. Figueiredo, P.N. Figueiredo, Y.-H.R. Tsai, Automated polyp detection in colon capsule endoscopy, IEEE Trans. Med. Imaging 33 (2014) 1488–1502, http://dx.doi.org/10.1109/TMI.2014.2314959. [17] F. van der Sommen, S. Zinger, W.L. Curvers, R. Bisschops, O. Pech, B.L.A.M. Weusten, J.J.G.H.M. Bergman, P.H.N. With, E.J. De, Schoon, F. Van Der Sommen,
[28]
[29]
[30]
[31] [32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40] [41]
Computer-aided detection of early neoplastic lesions in Barrett’ s esophagus, Endoscopy 48 (2016) 617–624, http://dx.doi.org/10.1055/s-0042-105284. Y. Cong, S. Wang, B. Fan, Y. Yang, H. Yu, UDSFS: unsupervised deep sparse feature selection, Neurocomputing 196 (2016) 150–158. M. Vasilakakis, D.K. Iakovidis, E. Spyrou, A. Koulaouzidis, Weakly-supervised lesion detection in video capsule endoscopy based on a bag-of-colour features model, in: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, pp. 96–103, http://dx.doi.org/10.1007/978-3-319-54057-3 9. N. Tajbakhsh, S.R. Gurudu, J. Liang, Automatic polyp detection in colonoscopy videos using an ensemble of convolutional neural networks, 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI). IEEE (2015) 79–83, http://dx.doi.org/10.1109/ISBI.2015.7163821. E. Ribeiro, A. Uhl, M. Hafner, Colonic polyp classification with convolutional neural networks, Proceedings - IEEE Symposium on Computer-Based Medical Systems (2016) 253–258, http://dx.doi.org/10.1109/CBMS.2016.39. J.-Y. He, X. Wu, Y.-G. Jiang, Q. Peng, R. Jain, Hookworm detection in wireless capsule endoscopy images with deep learning, IEEE Trans. Image Process. 27 (2018) 2379–2392. G. Wimmer, A. Vecsei, A. Uhl, Convolutional neural network architectures for the automated diagnosis of celiac disease, in: Proceedings of the 3rd International Workshop on Computer-Assisted and Robotic Endoscopy (CARE), Springer LNCS, 2016. A.K. Sekuboyina, S.T. Devarakonda, C.S. Seelamantula, A convolutional neural network approach for abnormality detection in wireless capsule endoscopy, 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) (2017) 1057–1060, http://dx.doi.org/10.1109/ISBI.2017.7950698. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (1998) 2278–2323, http://dx.doi.org/10. 1109/5.726791. J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, T. Chen, Recent advances in convolutional neural networks, Pattern Recognit. (2017), http://dx.doi.org/10.1016/j.patcog.2017.10.013. S. Wang, Y. Cong, H. Fan, Y. Yang, Y. Tang, H. Zhao, Computer aided endoscope diagnosis via weakly labeled data mining, in: Image Processing (ICIP), 2015 IEEE International Conference on, 2015, pp. 3072–3076. S. Wang, Y. Cong, H. Fan, L. Liu, X. Li, Y. Yang, Y. Tang, H. Zhao, H. Yu, Computer-aided endoscopic diagnosis without human-specific labeling, IEEE Trans. Biomed. Eng. 63 (2016) 2347–2357, http://dx.doi.org/10.1109/TBME. 2016.2530141. D.K. Iakovidis, S.V. Georgakopoulos, M. Vasilakakis, A. Koulaouzidis, V.P. Plagianakos, Detecting and Locating Gastrointestinal Anomalies Using Deep Learning and Iterative Cluster Unification, IEEE Trans. Med. Imaging, 2018. X. Jia, M.Q.-H. Meng, A study on automated segmentation of blood regions in wireless capsule endoscopy images using fully convolutional networks, in: Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on, 2017, pp. 179–182. J.T. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller, Striving for Simplicity: The All Convolutional Net, arXiv Prepr. arXiv1412.6806, 2014. S. Seguí, M. Drozdzal, G. Pascual, P. Radeva, C. Malagelada, F. Azpiroz, J. Vitrià, Generic feature learning for wireless capsule endoscopy analysis, Comput. Biol. Med. 79 (2016) 163–172, http://dx.doi.org/10.1016/j.compbiomed.2016. 10.011. H. Chen, X. Wu, G. Tao, Q. Peng, Automatic content understanding with cascaded spatial–temporal deep framework for capsule endoscopy videos, Neurocomputing 229 (2017) 77–87, http://dx.doi.org/10.1016/j.neucom. 2016.06.077. X. Jia, M.Q.-H. Meng, A deep convolutional neural network for bleeding detection in wireless capsule endoscopy images, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (2016) 639–642, http://dx.doi.org/10.1109/EMBC.2016. 7590783. R. Zhang, Y. Zheng, T.W.C. Mak, R. Yu, S.H. Wong, J.Y.W. Lau, C.C.Y. Poon, Automatic detection and classification of colorectal polyps by transferring low-level CNN features from nonmedical domain, IEEE J. Biomed. Heal. Inf. 21 (2017) 41–47, http://dx.doi.org/10.1109/JBHI.2016.2635662. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding, 2014, http://dx.doi.org/10.1145/2647868.2654889. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (2015) 211–252, http:// dx.doi.org/10.1007/s11263-015-0816-y. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations for deep neural networks, Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (2017) 5987–5995, http://dx.doi. org/10.1109/CVPR.2017.634. C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, Inception-resnet and the Impact of Residual Connections on Learning, AAAI, 2017, pp. 12. K.L. Du, M.N.S. Swamy, Neural Networks in a Softcomputing Framework, Neural Networks in a Softcomputing Framework, Springer-Verlag, London, 2006, http://dx.doi.org/10.1007/1-84628-303-5.
D.E. Diamantis et al. / Biomedical Signal Processing and Control 49 (2019) 192–201 [42] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet Classification With Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [43] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, Int. Conf. Learn. Represent. (2015) 1–14, http://dx.doi.org/ 10.1016/j.infsof.2008.09.005. [44] A. Krizhevsky, Learning Multiple Layers of Features From Tiny Images, Technical Report) Univ. Toronto, 2009, https://doi.org/10.1.1.222.9220. [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014) 1929–1958, http://dx.doi.org/10.1214/12-AOS1000. [46] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing human-level performance on imagenet classification, Proceedings of the IEEE International Conference on Computer Vision (2015) 1026–1034, http://dx. doi.org/10.1109/ICCV.2015.123. [47] Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., 2015. Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015: 18th International Conference Munich, Germany, October 5-9, 2015 proceedings, part III, in: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (Eds.), Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Lecture Notes in Computer Science. Springer International Publishing, Cham. https://doi.org/10.1007/ 978-3-319-24574-4. [48] EndoVisSub - Abnormal, WWW Document], 2015 (accessed 10.1.17) https:// endovissub-abnormal.grand-challenge.org/. [49] Y. Cong, S. Wang, J. Liu, J. Cao, Y. Yang, J. Luo, Deep sparse feature selection for computer aided endoscopy diagnosis, Pattern Recognit. 48 (2015) 907–917, http://dx.doi.org/10.1016/j.patcog.2014.09.010. [50] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (2006) 861–874.
201
[51] G.E. Hinton, N. Srivastava, K. Swersky, Lecture 6a- overview of mini-batch gradient descent, COURSERA Neural Netw. Mach. Learn. (2012) 31. [52] F. Chollet, Keras [WWW Document], 2015 https://github.com/fchollet/keras. [53] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, 2016, http://dx.doi.org/10.1038/nn.3331. [54] J. Nickolls, I. Buck, M. Garland, K. Skadron, Scalable parallel programming with CUDA, AMC Queue 6 (2008) 40–53, http://dx.doi.org/10.1145/1365490. 1365500. [55] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bull. 1 (1945) 80–83. [56] K. He, J. Sun, Convolutional neural networks at constrained time cost, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2015) 5353–5360, http://dx.doi.org/10.1109/CVPR. 2015.7299173. [57] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, C. Hill, A. Arbor, Going Deeper With Convolutions, 2014, pp. 1–9, http://dx.doi.org/10.1109/CVPR.2015.7298594. [58] F.A. Gers, J. Schmidhuber, Recurrent nets that time and count, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium. IEEE vol.3 (2000) 189–194, http://dx.doi.org/10.1109/IJCNN. 2000.861302. [59] D.K. Iakovidis, A. Koulaouzidis, Automatic lesion detection in wireless capsule endoscopy—A simple solution for a complex problem, in: Image Processing (ICIP), 2014 IEEE International Conference on, 2014, pp. 2236–2240.