Unsupervised Neural Network for Homography Estimation in Capsule Endoscopy Frames

Unsupervised Neural Network for Homography Estimation in Capsule Endoscopy Frames

ScienceDirect Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 00 (2019) 000–000 Available online at www.scienced...

795KB Sizes 0 Downloads 75 Views

ScienceDirect

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2019) 000–000

Available online at www.sciencedirect.com

www.elsevier.com/locate/procedia

Procedia Computer Science 00 (2019) 000–000

ScienceDirect

www.elsevier.com/locate/procedia

Procedia Computer Science 164 (2019) 602–609

CENTERIS - International Conference on ENTERprise Information Systems / ProjMAN International Conference on Project MANagement / HCist - International Conference on Health and Social Conference Care Information Systems and Technologies CENTERIS - International on ENTERprise Information Systems / ProjMAN International Conference on Project MANagement / HCist - International Conference on Health and Social CareNetwork Informationfor Systems and Technologies Unsupervised Neural Homography Estimation in

Capsule Endoscopy Frames Estimation in Unsupervised Neural Network for Homography 1,2 5 Capsule Frames Sara Gomes1,2*, Maria Teresa ValérioEndoscopy , Marta Salgado , Hélder P. Oliveira1,4, António Cunha1,3 Sara Gomes1,2*, Maria Teresa Valério1,2, Marta Salgado5, Hélder P. Oliveira1,4, António INESC TEC, Porto, 1,3Portugal Cunha FEUP – Faculdade de Engenharia da Universidade do Porto, Portugal 1

2

UTAD - Universidade de Trás-os-Montes e Alto Douro, Vila Real, Portugal 1 4 TEC, da Porto, Portugal do Porto, Portugal FCUP – FaculdadeINESC de Ciências Universidade 2 5 FEUP – Faculdade de do Porto, Portugal CentroEngenharia Hospitalar da do Universidade Porto, Portugal 3 UTAD - Universidade de Trás-os-Montes e Alto Douro, Vila Real, Portugal 4 FCUP – Faculdade de Ciências da Universidade do Porto, Portugal 5 Centro Hospitalar do Porto, Portugal 3

Abstract

Capsule Abstractendoscopy is becoming the major medical technique for the examination of the gastrointestinal tract, and the detection of small bowel lesions. With the growth of endoscopic capsules and the lack of an appropriate tracking system to allow the localization of lesions, the need is to becoming develop software-based techniques for for the the localisation of the at any giventract, frameand is also increasing. Capsule endoscopy the major medical technique examination of capsule the gastrointestinal the detection of With this in mind, and knowing the lack of availability of labelled endoscopic datasets, this work aims to develop a unsupervised small bowel lesions. With the growth of endoscopic capsules and the lack of an appropriate tracking system to allow the localization method forthe homography estimation in video capsule endoscopy to later be capsule applied atinany capsule systems. The of lesions, need to develop software-based techniques for the frames, localisation of the givenlocalisation frame is also increasing. pipeline is based on an unsupervised convolutional neural network, with a VGG Net architecture, that estimates the homography With this in mind, and knowing the lack of availability of labelled endoscopic datasets, this work aims to develop a unsupervised betweenfor twohomography images. Theestimation overall error, using acapsule synthetic dataset, was evaluated the mean average corner error, which The was method in video endoscopy frames, to laterthrough be applied in capsule localisation systems. 34 pixels, showing great promise for the real-life application of this technique, although there is still room for the improvement of pipeline is based on an unsupervised convolutional neural network, with a VGG Net architecture, that estimates the homography its performance. between two images. The overall error, using a synthetic dataset, was evaluated through the mean average corner error, which was 34 pixels, showing great promise for the real-life application of this technique, although there is still room for the improvement of its performance. © 2019 The Authors. Published by Elsevier B.V. This is an open accessPublished article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) © 2019 The Authors. by Elsevier B.V. This is an open access article underofthe BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility theCCscientific committee of the CENTERIS - International Conference on ENTERprise © 2019 The Authors. Published by Elsevier B.V. Peer-reviewSystems under responsibility of the scientific committee of the MANagement CENTERIS -International Conference on ENTERprise Information / ProjMAN – International Conference on Project / HCist - International Conference on Health This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Information Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on and Social Care Information Systems and Technologies Peer-review under Care responsibility of the scientific committee of the CENTERIS - International Conference on ENTERprise Health and Social Information Systems and Technologies. Information Systems / ProjMAN – International Conference on Project MANagement / HCist - International Conference on Health and Social Care Information Systems and Technologies Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000 . E-mail address: [email protected]

*

* Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000 . 1877-0509 © 2019 The Authors. Published by Elsevier B.V. E-mail address: [email protected] This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the CENTERIS - International Conference on ENTERprise Information Systems / 1877-0509 © 2019 The Authors. Published by Elsevier B.V. ProjMAN – International Conference on Project MANagement / HCist - International Conference on Health and Social Care Information This is anand open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Systems Technologies Peer-review under responsibility of the scientific committee of the CENTERIS - International Conference on ENTERprise Information Systems / ProjMAN – International Conference on Project MANagement / HCist - International Conference on Health and Social Care Information Systems and Technologies 1877-0509 © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the CENTERIS -International Conference on ENTERprise Information Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on Health and Social Care Information Systems and Technologies. 10.1016/j.procs.2019.12.226

2

Sara Gomes et al. / Procedia Computer Science 164 (2019) 602–609 Procedia Computer Science 00 (2019) 000–000

603

Keywords: endoscopic capsule, homography estimation, unsupervised learning

1. Introduction Capsule endoscopy is a medical technique, introduced in 2000 [8], used to examine the gastrointestinal tract (GIT), for the detection of pathologies and lesions in the small bowel. These capsules are vitamin-sized, and capture the interior of the intestinal tract through the use of a camera and small light installed within [16]. The capsules provide an 8 to 10 hour video, that is transmitted through a recorder (radio-frequency transmitter), and can be later analysed by a physician. This analysis can be tedious and prone to mistakes, due to the high amount of data and its poor quality [6]. When performing capsule endoscopy, one key aspect is the ability to know where the capsule was when it recorded lesions or abnormalities, in order to understand the location of the lesions found, and ensure the efficacy of further interventions [16]. The physician may perform this localisation visually, but due to the nature of the organ, this task becomes very difficult, since few landmarks can be used for guidance. Additionally, the images are often blurred and partially, or even completely, obstructed by gastric content. The capsule does not have a constant velocity or direction of movement, further hindering their localisation. Although there are some existing solutions to track the position of the capsule, most of them require the use of additional hardware, both within the capsule itself (reducing battery time and increasing the size of the capsule) and externally attached to the patient, making them unpractical and uncomfortable [6,16]. The flaws of the existing methods encouraged the development of software-based methods, focused on image analysis, to perform capsule localisation. Some of the methods perform feature detection, to identify the section of the GIT to which the frame belongs [6]. Others, after the detection, perform feature matching between consecutive frames, allowing the computation of the displacement and rotation of the capsule between the two [7]. However, the quality of the images makes it very difficult to detect distinct features, and to match common features between images. These techniques present advantages when compared to the physical methods, but they still are not able to reach a satisfying degree of accuracy [6,7]. Additionally, there are some works that present deep learning techniques for the visual odometry of the capsule, dispensing the use of feature-based methods, as seen, for example in Turan et al. [17]. The use of these techniques seems to improve upon the works seen before, showing great potential for future exploration. It has also been used for other applications that require the computation of homography between two images [10], providing techniques that could be useful for this particular application. However, these systems require a large amount of labelled endoscopic images for training and testing, which do not exist, forcing the use of synthetic data. In this paper, we propose the development of an unsupervised deep learning technique, to estimate the homography between consecutive video capsule endoscopy (VCE) frames, in the small intestine. This work should pave the way for the creation of novel capsule localisation systems, without the need for labelled datasets. The present document is divided into the analysis of some related work on homography estimation and capsule localisation (Section 2), followed by the exploration of the developed pipeline (Section 3) and the results obtained (Section 4). 2. Related Work 2.1. Deep Learning in Homography Estimation More recently, deep learning was applied successfully in motion estimation, with several approaches being applied to perform homography, depth and ego-motion estimation [3,20]. These less classical methods do not need to follow the traditional pipeline, having no need to establish frame-to-frame feature correspondence. However, there are other aspects of the deep learning systems that need to be considered. 2.1.1. 4-point Homography Parameterisation The traditional homography matrix has some problems that make it unsuitable for utilisation in deep learning approaches. It mixes rotational and translational terms in a single matrix, which hinders the balance between the terms.

Sara Gomes et al. / Procedia Computer Science 164 (2019) 602–609 Author name / Procedia Computer Science 00 (2019) 000–000

604

3

Thus, a 4-point parameterisation was proposed in DeTone et al. [3], based solely on corner location. Letting 𝛥𝛥𝑢𝑢𝑛𝑛 = 𝑢𝑢𝑛𝑛′ − 𝑢𝑢𝑛𝑛 , and 𝛥𝛥𝑣𝑣𝑛𝑛 = 𝑣𝑣𝑛𝑛′ − 𝑣𝑣𝑛𝑛 , for each 𝑛𝑛 point correspondence, with 𝑢𝑢𝑛𝑛 and 𝑣𝑣𝑛𝑛 , and 𝑢𝑢’𝑛𝑛 and 𝑣𝑣𝑛𝑛′ being the point coordinates in each image. The 4-point parameterisation will be: 𝐻𝐻4𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝

𝛥𝛥𝑢𝑢1 𝛥𝛥𝑢𝑢2 =( 𝛥𝛥𝑢𝑢3 𝛥𝛥𝑢𝑢4

𝛥𝛥𝑣𝑣1 𝛥𝛥𝑣𝑣2 ) 𝛥𝛥𝑣𝑣3 𝛥𝛥𝑣𝑣4

(1)

The 4-point homography can be converted to the traditional homography using a Direct Linear Transform (DLT) [2,4]. 2.1.2. Unsupervised Learning in Homography Estimation Some work has been done regarding the use of unsupervised deep learning techniques for the estimation of homography between images. These systems have the advantage of not needing labelled data to perform the estimation. In the particular case of capsule endoscopy, labelled data is generally not available, since manufacturers of the capsules do not provide that information, making it a suitable application for these techniques. One major work applied unsupervised learning for homography estimation in aerial images [10], by adapting HomographyNet [3] into an unsupervised network. The authors consider it an unsupervised approach, since it does not require the use of labelled images for training, instead using a pixel-wise photometric loss: given an image pair, A and B, the photometric loss compares the original image B with image A warped according to the estimated homography. However, this loss function can only be applied if the operations remain differentiable, in order to allow back-propagation during the training of the network. With this in mind, two layers are added to the original architecture: a Tensor Direct Linear Transform (TDLT) computes a mapping from the 4-point homography to the traditional homography matrix, and a spatial transformation, to warp the original image A, according to the 3×3 homography estimated [10]. 2.2. Deep Learning in Capsule Localisation An explored application of MVS methods has been the localisation of endoscopic capsules within the small bowel. Some classic techniques have been used for this end [7,14], but the use of deep learning techniques is growing in this field [17,18]. The estimation of homography was implemented as a part of a capsule localisation system in Pinheiro et al. [12], with the use of a MobileNet [5] inspired convolutional neural network (CNN). The network takes as input two grayscale images, and provides a vector of size 8, corresponding to the 4-point homography between the two images. Since a supervised approach is taken, the images used to train the network are synthetic, and are obtained through a data generation technique previously described in DeTone et al. [3]. This system is able to obtain a Mean Average Corner Error of 2 pixels in 320×320 synthetic VCE images. The tests in real images also prove to be successful, although the validation is purely visual, given the absence of ground truth labels. The evaluation is done by warping the first image according to the estimated homography, and comparing the result to the second image provided to the network. CNN have also been applied for the estimation of a 6 degree-of-freedom localisation of endoscopic capsules within the GIT [17]. The framework developed takes as input a single RGB endoscopic frame and uses a CNN architecture inspired by GoogLeNet [15] to regress the position of the capsule, which includes a translation vector and a rotation vector. The work included the use of transfer learning [11], to allow training with small quantities of endoscopic data, although all data still needed to be properly labelled. Thus, the network was initiated with the weights computed for ImageNet [19], and fine-tuned with an artificial endoscopy dataset. The labelling of the endoscopic data was made through the use of motion tracking hardware. The best results obtained consist of an average 3.44% error on rotation estimation, and 7.1% on the translation vector estimation. Additionally, some unsupervised work has been done in this area, namely the regression of depth and pose estimation using jointly trained CNN [18]. The networks use view synthesis as supervision: a target view can be

4

Sara Gomes et al. / Procedia Computer Science 164 (2019) 602–609 Procedia Computer Science 00 (2019) 000–000

605

synthesised given a per-pixel depth of the image, and the pose in a nearby view. Thus, there is no need for labelled data to train the network. The system is composed by 2 CNN, that are trained jointly. The first network is used for depth estimation, using a encoder-decoder strategy. This network is based on a DispNet architecture [9], and takes as input a single image. The second network estimates relative pose of the capsule in a given frame, and a reliability mask, based on a target view and a set of source views (based on which the view synthesis is performed). Although the networks are trained jointly, they can be tested and evaluated jointly. Once again, transfer learning is applied, this time using the weights available in Zhou et al. [20], obtained using the KITTI dataset. The results of the pose estimation are comparable to the ones seen using supervised approaches [17], although they remain slightly worse. 2.3. Summary The analysis of the available literature allows to conclude that the use of unsupervised approaches for both homography estimation and capsule localisation systems has the potential to achieve results on the same level as their supervised counterparts. Additionally, these methods forego the need to obtain large amounts of labelled data for training. With this in the mind, an unsupervised approach for homography estimation in endoscopic capsule frames was developed. 3. Methods The network used is based on the HomographyNet architecture [3], modified to allow homography estimation with an unlabelled dataset, as described in Nguyen et al. [10]. The base network (similar to a VGG Net [13]) is composed by a series of 8 convolutional layers, with batch normalisation and max pooling layers 2×2, stride 2) after every two convolutions. The first four convolutional layers have 64 filters per layer, while the last four have 128. After the convolutional stage, two fully connected layers are applied, with 1024 and 8 units. Finally, two dropout layers (p=0.5) are included after the convolutional stage and after the first fully connected layer, to prevent overfitting. The input of the network will be 2 grayscale images (A and B), and the final fully connected layer will provide 8 values, corresponding to the 4-point homography between the two images. To allow the use of unlabelled data for training, and inspired by Nguyen et al. [10], two final layers are added. The first, a Tensor Direct Linear Transform (TLDT), will convert the 4-point homography into a classical homography matrix representation. The second applies a spatial transformation to image A, according to the estimated homography, in order to obtain an approximation of image B. The pipeline is represented in Fig. 1.

Fig. 1: Diagram representing the pipeline applied.

5

Author name / Procedia Computer Science 00 (2019) 000–000 Sara Gomes et al. / Procedia Computer Science 164 (2019) 602–609

606

Using the warped image A, estimated by the network, a pixel-wise photometric loss can be computed [10]. This loss will compare the warped image to the original image B, according to the following equation: 1

𝐿𝐿𝑃𝑃𝑃𝑃 = |𝑥𝑥 | ∑𝑥𝑥𝑖𝑖 |𝐼𝐼 𝐴𝐴 (𝐻𝐻(𝑥𝑥𝑖𝑖 )) − 𝐼𝐼𝐵𝐵 (𝑥𝑥𝑖𝑖 )| 𝑖𝑖

(2)

Where 𝐼𝐼 𝐴𝐴 (𝐻𝐻(𝑥𝑥𝑖𝑖 )) is image A warped according to the estimated homography, 𝐻𝐻, 𝐼𝐼 𝐵𝐵 (𝑥𝑥𝑖𝑖 ) is the original image B, and 𝑥𝑥𝑖𝑖 are the homogeneous coordinates of each pixel in the images. The network can then be trained with any dataset, independently of any ground truth information concerning the homography between images A and B. 4. Results and Discussion 4.1. Dataset The dataset used included 30 VCE, all of them recorded with the PillCam SB3. The VCE were converted into a series of frames, and cut to include only the sections corresponding to the small bowel. The separation was based on medical annotations. This resulted in 30 processed videos, for a total of 340,325 frames. Examples of the small bowel frames can be seen in Fig. 2.

Fig. 2: Private dataset image examples, both of good (left) and bad (right) quality.

The PillCam SB3 has a frame rate of 2 to 6 fps, varying according to the speed of the camera. It has a length of 26.2 mm and diameter of 11.4 mm, weighing 3 g. It also has 4 white light emitting diodes on each side, to illuminate the GI tract as it goes through it. The operating time is 8 hours or more [1]. 4.2. Data Generation One key aspect in deep learning approaches is the amount of data available. To allow the generation of an unlimited number of training examples, along with ground truth labels for each one, a data generation methodology, described in DeTone et al. [3], was applied to the grayscale images of the original dataset. The method begins by cropping a random patch, at position p, from the image at hand (Patch A). Each corner of Patch A is then perturbed, within a predefined range [-ρ,ρ], originating a 4-point homography. The maximum distortion applied was ρ=32. The inverse of this homography can then be applied to the original image, generating the warped image, from which a patch will again be cropped at position p (Patch B). Patches A and B can then be fed into the network, along with the computed homography, which will serve as ground truth. Any number of patches and transformations can be applied to a single original image, creating an artificial dataset of unlimited size. In this case, each image was only used once, giving a single pair of training/validation examples per available video frame. Examples of the resulting images can be found in Fig. 3.

6

Sara Gomes et al. / Procedia Computer Science 164 (2019) 602–609 Procedia Computer Science 00 (2019) 000–000

607

4.3. Homography Estimation on Synthetic Dataset The synthetic dataset was divided into train and test sets, with approximately two thirds of the set (224,319 image pairs) being used for training and a third (116,006 image pairs) for testing. The train and test sets do not share frames from the same VCE, in order to prevent overfitting for specific patients. The training process was performed for 3,000 epochs, with a batch size of 64, and learning rate of 0.001. It used an ADAM optimizer, β1=0.9, β2=0.999, and ε=10-8. The network provided a 4-point homography estimation for each image pair provided. To evaluate the results obtained, the Mean Average Corner Error (MACE) was used. This metric computes the difference between each corner of the real 4-point homography and the estimated values. The average across all four corners will provide the MACE for each image pair, and the mean across the entire dataset will be the final considered value.

Fig. 3: Results of data generation and homography estimation for two example frames. On top, the best result with MACE=4: (left) original image with selected patch, (right) warped image with selected patch, in green, and predicted patch, in red. On the bottom, the worst result with MACE=485: (left) original image with selected patch, (right) warped image with selected patch, in green, the predicted patch does not appear.

The results can be seen in Fig. 4, along with the performance of other approaches. Pinheiro et al. [12] (HomographyNet Endo) uses the same kind of images (capsule endoscopy) as the current implementation, but applies a supervised approach. On the other hand, Nguyen et al. [10] (HomographyNet Aerial) uses a similar implementation, applied to a dataset of aerial images. As expected, the supervised approach performs better, as well as the unsupervised approach on different kinds of images. When further analysing the results, it is clear that the worst results correspond to the largest distortions (see examples in Fig. 3), indicating that although the network copes well with small distortions, it still has trouble working with highly warped images. The discrepancy of quality between the results for endoscopic images and those seen for aerial images [10] is due to the characteristics of the images themselves. Endoscopic images have less texture and colour variations than aerial images, meaning that the photometric loss is easily optimised, even if the homography is not being correctly estimated. In summary, the network has more difficulty learning the features present in the endoscopic images, due to their homogeneity. This issue may be subdued by applying methods of image pre-processing, that enhance the distinctive features of the images, such as red lesions, texture of the tissue, and smaller color variations that may be present in the walls of the GIT. Although there is still room for improvement, the results show that homography estimation in endoscopic capsule frames can be performed with unsupervised techniques, which opens the door to a new set of possible techniques for capsule localisation systems. To further understand this possibility, the validation of the network using real images is of paramount importance.

608

Sara Gomes et al. / Procedia Computer Science 164 (2019) 602–609 Author name / Procedia Computer Science 00 (2019) 000–000

7

4.4. Homography Estimation on Real Data The final testing of the network was performed on pairs of consecutive images from VCE. Here, instead of providing the image patches to the network, the full grayscale consecutive images were used as input: images A and B are fed into the network, which estimates the homography and warps image A accordingly. Since no ground truth labels are available in this case, the evaluation is done by visually comparing the original images to the warped images, where the goal is for the warped version of image A to be as close as possible to image B. In Fig. 4, some examples of the estimation are shown. Once again, it is very clear that the network tends to underestimate the homography between images. Although the transformations ae performed in the right direction, generally they should be more pronounced (see example 1, Fig. 4). It is clear that the network recognizes that from one image to the next there should be an approximation (zoom in), since the capsule moved forward.

Fig. 4: MACE of 4-point homography obtained through current implementation (Unsupervised Endo), compared to a supervised method: HomographyNet on endoscopy dataset (HomographyNet Endo) [12]; and an unsupervised method in a different kind of image: HomographyNet Aerial [20] (HomographyNet Aerial).

The most obvious problem when performing test with real data is the approximation for images that do not overlap, as seen on example 2 of Fig. 5, where the network cannot distinguish common features in the two images, and thus cannot perform the correct warping. Another important factor to take into account when performing this kind of test is that the homography model is simply an approximation of the transformation that occurs between the consecutive frames, and it is not enough to describe it. The application of a different model could be very beneficial for the final results.

Fig. 5: Result of homography estimation and warping for real VCE images, with example 1 (first line) and example 2 (second line). Each example contains the original image A (first column), the warped image A (second column), and the original image B (third column).

8

Sara Gomes et al. / Procedia Computer Science 164 (2019) 602–609 Procedia Computer Science 00 (2019) 000–000

609

5. Conclusions Although the results of this implementation did not reach the standards of supervised approaches, this work shows that there is room for the improvement, and eventual utilisation, of unsupervised approaches in endoscopic applications. To further improve the results seen in this work, pre-processing techniques should be implemented, to enhance the texture and color variations of the images, and thus ease the learning of distinctive features. Another possible strategy to decrease the learning time, could be to use transfer learning techniques. Finally, the network should be tested in a larger pool of real data, in order to validate it and prevent overfitting to this particular dataset. Only through this validation can we ensure that the network is fit to be used in real-world applications, such as capsule localisation problems. Acknowledgements This work is financed by National Funds through the Portuguese funding agency, FCT – Fundação para a Ciência e a Tecnologia within project: UID/EEA/50014/2019. References [1] PillCamTMSB 3 System — Medtronic, https://www.medtronic.com/covidien/en-us/products/capsule-endoscopy/pillcam-sb-3-system.html [2] Agarwal, A., Jawahar, C.V., Narayanan, P.J.: A Survey of Planar Homography Estimation Techniques. Tech. rep. (2005) [3] DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep Image Homography Estimation (2016). [4] Dubrofsky, E.: Homography Estimation (March) (2009) [5] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (4 2017). [6] Iakovidis, D.K., Koulaouzidis, A.: Software for enhanced video capsule endoscopy: Challenges for essential progress (2015). [7] Koulaouzidis, A., Iakovidis, D., Yung, D., Mazomenos, E., Bianchi, F., Karagyris,A., Dimas, G., Stoyanov, D., Thorlacius, H., Toth, E., Ciuti, G.: Novel experimental and software methods for image reconstruction and localization in capsuleendoscopy (2018) [8] Liao, Z., Gao, R., Xu, C., Li, Z.S.: Indications and detection, completion, and retention rates of small-bowel capsule endoscopy: a systematic review. Gastrointestinal Endoscopy (2010). [9] Mayer, N., Ilg, E., H ̈ausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation (12 2015). [10] Nguyen, T., Chen, S.W., Shivakumar, S.S., Taylor, C.J., Kumar, V.: Unsupervised Deep Homography: A Fast and Robust Homography Estimation Model (2017). [11] Pan, S.J., Yang, Q.: A survey on transfer learning (2010). [12] Pinheiro, G., Coelho, P., Salgado, M., Oliveira, H.P., Cunha, A.: Deep Homog- raphy Based Localization on Videos of Endoscopic Capsules. In: Proceedings - 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018. pp. 724–727. Institute of Electrical and Electronics Engineers Inc. (1 2019). [13] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition (2014). [14] Spyrou, E., Iakovidis, D.K.: Homography-based orientation estimation for capsule endoscope tracking. In: IST 2012 - 2012 IEEE International Conference on Imaging Systems and Techniques, Proceedings (2012). [15] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceed- ings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. vol. 07-12-June2015, pp. 1–9. IEEE Computer Society (10 2015). [16] Than, T.D., Alici, G., Zhou, H., Li, W.: A review of localization systems forrobotic endoscopic capsules. IEEE Transactions on Biomedical Engineering (2012). [17] Turan, M., Almalioglu, Y., Konukoglu, E., Sitti, M.: A Deep Learning Based 6 Degree-of-Freedom Localization Method for Endoscopic Capsule Robots (2017) [18] Turan, M., Ornek, E.P., Ibrahimli, N., Giracoglu, C., Almalioglu, Y., Yanik, M.F., Sitti, M.: Unsupervised Odometry and Depth Learning for Endoscopic Capsule Robots (2018) [19] Wu, Z., Zhang, Y., Yu, F., Xiao, J.: A gpu implementation of googlenet (2014) [20] Zhou, T., Brown, M., Noah, G., Google, S., Lowe Google, D.G.: Unsupervised Learning of Depth and Ego-Motion from Video. Tech. rep. (2017)