Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy

Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy

Journal Pre-proof Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy Eyal Klang, M.D, Yiftach Baras...

7MB Sizes 0 Downloads 5 Views

Journal Pre-proof Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy Eyal Klang, M.D, Yiftach Barash, M.D, Reuma Yehuda Margalit, M.D, Shelly Soffer, M.D, Orit Shimon, M.D, Ahmad Albshesh, M.D, Shomron Ben-Horin, M.D, Marianne Michal Amitai, M.D, Rami Eliakim, M.D, Uri Kopylov, M.D PII:

S0016-5107(19)32428-9

DOI:

https://doi.org/10.1016/j.gie.2019.11.012

Reference:

YMGE 11830

To appear in:

Gastrointestinal Endoscopy

Received Date: 3 May 2019 Accepted Date: 3 November 2019

Please cite this article as: Klang E, Barash Y, Yehuda Margalit R, Soffer S, Shimon O, Albshesh A, BenHorin S, Michal Amitai M, Eliakim R, Kopylov U, Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy, Gastrointestinal Endoscopy (2019), doi: https:// doi.org/10.1016/j.gie.2019.11.012. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Copyright © 2019 by the American Society for Gastrointestinal Endoscopy

Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy Eyal Klang1,2, M.D. Yiftach Barash2, M.D. Reuma Yehuda Margalit3, M.D. Shelly Soffer1,2, M.D. Orit Shimon1, M.D. Ahmad Albshesh3, M.D. Shomron Ben-Horin3, M.D. Marianne Michal Amitai1, M.D. Rami Eliakim3, M.D. Uri Kopylov3, M.D. 1- Department of Diagnostic Imaging, Sheba Medical Center, Tel Hashomer, Israel, and Sackler Medical School, Tel Aviv University, Tel Aviv, Israel 2- DeepVision Lab, Sheba Medical Center, Tel Hashomer, Israel 3- Department of Gastroenterology, Sheba Medical Center, Tel Hashomer, Israel, and Sackler Medical School, Tel Aviv University, Tel Aviv, Israel

Corresponding author Shelly Soffer Institutional address: Sheba Medical Center, Tel Hashomer, 5265601, Israel

Phone: +972-545258396, Fax: +972-3- 5357315, Email: [email protected] Author Contributions EK - study planning, coding of models, drafting of the manuscript YB - coding of models, reviewed the manuscript for important scientific content RYM, AA - data collection UK - study planning, data collection, drafting of the manuscript SBH, RE, MA, SS, OS - reviewed the manuscript for important scientific content

Conflicts of interest EK, YB, SS, RYM, OS, MA, AA - none UK - Speaker and advisory fees- Takeda, Jansen, Abbvie, MSD, Medtronic RE- Speaker for Takeda, Jansen and Medtronic SBH-received consulting and advisory board fees and/or research support from AbbVie, MSD, Jansen, Takeda, Pfizer, GSK and CellTrion. Research support- Takeda, Jansen, Medtronic

Funding The study was partially supported by the Leona M. and Harry B. Helmsley Charitable Trust.

Deep learning algorithms for automated detection of Crohn’s disease ulcers by video capsule endoscopy Abstract: Background and aim: The aim of our study was to develop and evaluate a deep learning algorithm for the automated detection of small-bowel ulcers in Crohn’s disease (CD) on capsule endoscopy (CE) images of individual patients. Methods: We retrospectively collected CE images of known CD patients and controls. Each image was labeled by an expert gastroenterologist as either normal mucosa or containing mucosal ulcers. A convolutional neural network (CNN) was trained to classify images into either normal mucosa or mucosal ulcers. First, we trained the network on 5-fold randomly split images (each fold with 80% training images and 20% images testing). Then we conducted 10 experiments in which images from n-1 patients were used to train a network and images from a different individual patient were used to test the network. Results of the networks were compared for randomly split images and for individual patients. Area under the curves (AUCs) and accuracies were computed for each individual network. Results: Overall, our dataset included 17,640 CE images from 49 patients; 7,391 images with mucosal ulcers and 10,249 images of normal mucosa. For randomly split images results were excellent with AUCs of 0.99 and accuracies ranging from 95.4% to 96.7%. For individual patient-level experiments, the AUCs were also excellent (0.94 to 0.99). Conclusions: Deep learning technology provides accurate and fast automated detection of mucosal ulcers on CE images. Individual patient-level analysis provided high and consistent diagnostic accuracy with shortened reading time; in the future deep learning algorithms may augment and facilitate CE reading.

Keywords: Crohn's Disease; Ulcer; Capsule Endoscopy; Neural Networks; AI (Artificial Intelligence)

Introduction Capsule endoscopy (CE) is a well- established modality for the diagnosis and the monitoring of Crohn’s disease (CD) (1-8). Small-bowel mucosal inflammation is frequently detected in CD patients even in clinical or biological remission (9). The diagnostic yield of CE is at least similar to that of cross-sectional imaging for detection of active endoscopic inflammation in established CD (10, 11). The main quantitative scoring systems for quantification of mucosal inflammation by CE are the Lewis score (12) (LS) and the Capsule Endoscopy Crohn's Disease Activity Index (CECDAI) (13, 14). Both indices perform comparably in established CD (15); nevertheless, their performance is hampered by interobserver variability (16). Although both scores are validated in established CD (14, 16), it is unclear how both reflect the true inflammatory burden in the small bowel due to

lack of a comparator modality for proximal small bowel. Despite the well-described merits of CE, the clinical performance of this modality may be further augmented by shortening reading time, improving interobserver variability, and implementing of precise scoring algorithms. Automated image analysis is termed computer vision, which is an interdisciplinary field that focuses on how computers gain understanding of digital images (17). In the past few years, artificial intelligence (AI) deep learning algorithms, termed convolutional neural networks (CNN), have revolutionized the computer vision field, offering remarkable near human accuracy in different image analysis tasks, including medical image analysis (18). To date, several workgroups have implemented the use of AI techniques for automated detection of various pathologies by CE, including angioectasia, celiac disease, polyps, and hookworm infection (1926), with impressive specificity and sensitivity. However, a review of the relevant literature

reveals that AI has not been applied on an individual patient level and it is unclear whether this

technology may provide reliable diagnostic and monitoring data that can guide treatment decisions for a single patient. The main potential obstacles for patient-level implementation are the marked variability of images between examinations with marked dissimilarities in image characteristics such as color hue, brightness and contrast, difference in ulcer shape, and size and quality of preparation. The aim of our project was to evaluate the accuracy of CNN for detection of ulcers in CD on CE for image sets from individual patients. If proven accurate and feasible, implementation of AI will allow for fast and automated detection and quantification of mucosal inflammation in CD. Methods: Study design: We randomly selected CE videos from patients diagnosed with CD as well as healthy subjects from our database and downloaded de-identified images from both ulcerated and normal mucosa. All patients were diagnosed and followed by the department of gastroenterology at Sheba Medical Center. The patients underwent CE for suspected or established Crohn's disease. The diagnosis of Crohn’s disease was established by either ileocolonoscopy with compatible endoscopic and histological findings or, when the disease location was inaccessible to ileocolonoscopy , by CE or a combination of CE with cross-sectional imaging findings. The images were obtained by PillCam SBIII (Medtronic Ltd, Dublin, Ireland) and reviewed with Rapid 9 (Medtronic Ltd, Dublin, Ireland) capsule reading software. The extracted images were labeled by gastroenterology fellows (R.M. and A.A.) supervised by a capsule expert (U.K.). Both ulcers and erosions were considered as “ulcerated mucosa.” For patients diagnosed with CD, we aimed to extract a comparable number of pathological and normal images. An institutional review board granted approval for this retrospective study. Software and hardware: The models were written in Python (ver. 3.6.5, 64 bits) utilizing the open-source Keras (ver 2.1.5) library and the open-source TensorFlow (ver. 1.5.0) library as backend for CNN

algorithms and the open-source Sickit-Learn library (ver. 0.20.2) for t-SNE and PCA algorithms. Models ran on an Intel i7 CPU and two GeForce GTX 1080ti Graphics Cards. Dimensionality reduction: We wanted to explore the relationship between capsule images of different patients. Images represent very high dimensional information, as each pixel in the image is a single variable. After cropping of borders, the size of each capsule image is 516 x 516. Because the images are stored in 3 color channels (RGB), the dimensions of each image are 516 x 516 x 3 = 798,768 dimensions. Because of their high dimensionality, it is difficult to interpret the relationships between images; thus, we used dimensionality reduction techniques. Dimensionality reduction was done for exploration of the relationships between images and not for network training. Dimensionality reduction enables us to present the complexity of the problem, showing the relationships between different patients and between normal and ulcer images in 2D, we used dimensionality reduction techniques. For dimensionality reduction we first used primary component analysis (PCA) to reduce each image into 500 dimensions, then we used tdistributed stochastic neighbor embedding algorithm (t-SNE) on the 500 dimensions’ vectors. PCA is a technique for reducing the dimensionality of large datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance (27). t-SNE is an algorithm for dimensionality reduction that is well suited for the visualization of high-dimensional datasets (28). CNN models: A state-of-the-art Xception CNN (29) was trained to classify capsule images into either images of normal mucosa or images with mucosal ulcers. The network’s weights were initialized using pre-trained weights from the 1.2 million everyday color images of ImageNet (30) that consists of 1000 categories.

Preprocessing of capsule images included cropping of images’ borders and legends. Images then were resized into a 299 × 299 matrix and pixels were normalized into 0 to 1 by dividing by 255. The following parameters were used for training the network: 5 epochs; batch size 16; Adam optimization with a learning rate of 0.001. Softmax was used as the output activation function. Experiments design: Experiment 1 We used 5-fold cross validation experiment where the entire dataset was randomly split into 5 equal-sized subsets: 80% of data were used as training and 20% (a single fold) as validation. This was repeated 5 times (on each fold). In this experiment images from the same patients appeared in both the training and validation datasets. Experiment 2 We evaluated the performance of CNN on the individual patient level. For this purpose, we designed the following data split: training was done on images from N-1 patients and testing was conducted on the images of one unseen patient. This experiment was repeated 10 times, separately conducted on 10 randomly picked different patients. The experiments were conducted with data augmentation. Figure 1 illustrates the 2 different experiments. We calculated the average prediction time of full movies by averaging the duration of predictions of 5 movies. Metrics: The network’s final neuron is a sigmoid activation function. Similarly to logistic regression, for each image this neuron outputs a probability between 0 and 1. To convert that to a discrete value of either 0 (normal) or 1 (ulcer), a default threshold value of 0.5 is set to the output (31).

Receiver operating curves (ROC) were plotted for each experiment by varying the operating threshold. The model’s metrics included area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) for detecting mucosal ulcers. Accuracy is a measure of errors in a statistical analysis and is defined as follows:       

, where TP = true positive, TN = true negative, FP = false positive, FN = false

negative. Results: Study population We included data from 49 patients: 36 with CD and ulcerated mucosa on CE, 2 patients with CD and normal mucosa on CE and 11 patients without CD and with normal mucosa on CE. Table 1 presents clinical and demographic characteristics of CD patients included in the cohort. The characteristics of patients without CD are as follows: males -3 (27.3%), age – 23 (21-40) years; the indications for CE were- abdominal pain 5 (45.4%), diarrhea 4 (36.3%), anemia 2 (18.1%). All patients underwent CE for suspected or established Crohn's disease. Fifteen (41.7%) of the patients had ilecolonoscopic findings and mucosal biopsies consistent with Crohn's disease. In the rest of the patients, the diagnosis was established by capsule endoscopy coupled with compatible findings on cross-sectional imaging and elevated biomarkers (Creactive protein/fecal calprotectin) (9/36, 25%), or capsule endoscopy and elevate biomarker alone following normal or inconclusive ileocolonscopy. Overall, our dataset included 17,640 CE images from 49 patients; 7,391 images with mucosal ulcers and 10,249 images of normal mucosa. Out of 10,239 normal images, 3,577 originated from patients with normal CE and 6,672 from patients with CD. To explore the complexity of the problem we used dimensionality reduction techniques. Figure 2 shows the distribution of images from 3 individual patients in 2 dimensions by using PCA and then t-SNE algorithms. One can observe that the problem is complex; the distribution of normal images and ulcers images of different patients is heterogeneous. However, it is still possible to identify discrete clusters of normal mucosa images and ulcer images of individual patients.

Experiment 1: When applying CNN on randomly selected images from the entire dataset the network shows a high accuracy. Table 2 present the results of 5 experiments performed on randomly split images (80% training, 20% testing). The AUCs in these experiments were 0.99 (Supplementary Figure 1) and the accuracies ranged from 95.4% to 96.7%. Supplementary Figure 2 presents examples of normal mucosa and mucosa with ulcers detected correctly and incorrectly by the network. Experiment 2: Experiments conducted on images from individual patients are presented in Tables 3. For all individual patient-level experiments, the AUCs were high, ranging from 0.94 to 0.99, although variability is seen in the accuracies for different patients, ranging from 73.7% to 98.2%. Figure 3 presents the ROC plots of 10 individual patients. The distributions of true positive normal and ulcer images for individual patients are presented in Table 4. The median number of frames extracted from 5 complete small-bowel films containing ulcerated mucosa (first duodenal to first cecal frame) from the capsule reader software was 6,176 (ranging from 3,301 to 12,176) images. The average duration for detecting a complete film was 204.7±93.9 seconds (ranging from 141.7 to 368.1 seconds). Discussion Our results demonstrate that CNN has potential in classification of CE images of mucosal ulceration in individual CD patients. The algorithm was able to detect ulcerations in established CD patients with AUC of 0.94 and above. In the last 2 years, several publications addressed the potential use of AI techniques for automated detection of pathologies on CE. Leenhardt et al used CNN for detection of angioectasias and reported sensitivity and specificity of 100% and 96%, respectively (32). Iakovidis et al used weakly supervised CNN for detection of a variety of gastrointestinal

pathologies on CE videos and achieved AUC of 80% (33). An alternative algorithm stacked sparse autoencoder with image manifold constraint (SSAEIM) provided an overall accuracy of 98% for detection of polyps (23). Diagnosis and monitoring of CD may particularly benefit from the implementation of AI detection. For the purpose of diagnosis, there is little emphasis on the particular number of ulcerations detected, and detection of even a few ulcers or erosions will support the diagnosis of CD in absence of an alternative diagnosis (1). Moreover, the size and magnitude of the ulcerations have no known implications on the diagnostic accuracy; although additional differential diagnoses such as nonsteroidal anti-inflammatory drug-related enteropathy, Bechet disease , autoimmune enteropathy, etc, CD is by far the most plausible diagnosis in patients with an appropriate clinical scenario and high pretest probability (2, 34, 35). Such detection may be achieved by a CNN algorithm. On the other hand, accurate quantification of mucosal involvement by CE may be of great interest in patients with established CD. The value of mucosal healing in CD is well established; patients achieving mucosal healing are more likely to maintain clinical remission and will require less surgery and hospitalizations (36). However, a standard tool for the assessment of mucosal healing in clinical trials is ileocolonoscopy; thus, the proximal small-bowel inflammation is completely disregarded in the setting of CD clinical trials. Proximal small disease can be detected in over 50% of CD patients (37). Such proximal involvement is associated with worse long-term outcomes and higher risk of surgery (38). The degree of mucosal healing of inflammation in one sector of the digestive tract may not necessarily reflect other segments (39). Moreover, active mucosal inflammation (Lewis score >135) is detected in almost 85% of

CD patients in clinical remission (9). In 2/3 of these patients, persistent low-grade inflammation was evident (LS <790). Until recently, the clinical significance of those supposedly insignificant findings was unclear; however, a recent publication by Ben-Horin et al demonstrated a 10-fold risk of relapse in CD patients with active small-bowel inflammation (LS>350) as compared with patients with LS<350. No other diagnostic modality (such as inflammatory biomarkers or MR enterography) had a similar predictive accuracy (40). Notwithstanding the established accuracy of CE for monitoring of CD, accurate quantification of mucosal involvement in CD poses a significant challenge. Primarily, there is no comparator

modality for validation, as cross-sectional imaging is significantly less sensitive for detection of proximal inflammation (41). Fecal calprotectin is correlated with endoscopic inflammation in the small bowel (42); however, the correlation with both the Lewis score and CECDAI is weak to moderate (15, 43). Such lack of a strong correlation may be at least partially explained by the limitations of CECDAI and the Lewis score. Both of the scores contain multiple “soft” operators lacking exact measurement scales and potentially biased by subjective interpretation (number of ulcers, longitudinal and circumferential extent of involvement, etc) (12, 14). Such operators are common to other endoscopic indices of inflammation such as SES-CD (Simple endoscopic score for Crohn’s disease) and CDEIS (Crohn's Disease Index of Severity) (44). Thus, it is plausible that a scoring system based on automated quantification of inflammatory burden, by approximation of involved surface or number of frames that contain pathological findings, may be more accurate for evaluation of disease severity and prognosis. With the introduction of the novel PillCam Crohn’s capsule (45) that allows for a “one stop shop” evaluation of the entire digestive tract, such an algorithm may be of an even higher importance. Recently, Aoki et al (46) reported their experience with CNN for the detection of mucosal ulcerations on CE images. There are several important differences between their project and our work. In our study the entire image was labeled either as normal or containing mucosal ulceration. We did not use a bounding box or any other region of interest (ROI) method to mark down specific lesions. Thus, our results support the robustness of CNN for video CE images. Moreover, Aoki et al collected and pooled data from 180 different patients. We limited our study to 49 patients and in the individual patient level experiments in each experiment 48 different patients were used for training and one patient for testing. This is a real-life experiment since after the initial training, the algorithm was capable of detection in a given unique individual patient. Detection in a single patient poses distinctive challenges, as there is a significant variability between different patients in the individual characteristics of the images (as demonstrated in Figure 1) that may overlap and obscure the selected feature. This point is critical for implication in real world situation/individual patient-training and testing images from the same patient cannot be mixed, because images of the same patient carry very similar features―lightning, color hue, similar mucosal pattern, and similar ulcers. Thus, the ability of our algorithm to distinguish between normal and inflamed images of a given patient

after training on the entire dataset is reassuring. Moreover, CNN was able to provide a significantly faster reading times in comparison with a human reader: after training, the algorithm required a median of less than 3.5 minutes to analyze a complete small-bowel film; although there is no standard definition of reading time by a human expert, the current literature reports reading times of 40 to 50 minutes (47). We believe that at this stage, neural network-based systems cannot replace human gastroenterologists, but rather serve to either shorten reading time using statistical methods that will allow for reduction of the images needed for a human reader validation, or as second readers systems for quality assurance. Shortening reading time may also provide a significant economic benefit by driving down reading costs. There are several limitations to our study. Primarily, the analysis was devoted to ulcerations and aphthae only; any other given pathology will require similar training. We did not try to distinguish between ulcers and aphthae as CE endoscopy does not provide accurate measuring tools, and even if this could be achieved the practical value of such distinction is unknown. Moreover, the ability to distinguish between normal and pathological images should next be translated to complete endoscopy films. For computing novel index for CD progression, neural networks would have to be specifically trained and perform statistical averaging of multiple image-based data. An additional limitation stems from the lack of pathological confirmation of the ulcers. However, this is an inherent limitation of capsule endoscopy; diagnosis of CD by CE is seldom amenable to histological confirmation due to inaccessibility by standard ileocolonoscopy; in these patients, performance of deep enteroscopy with histological confirmation is not required by diagnostic guidelines (48). Importantly, the high diagnostic yield of CE for diagnosis of suspected CD has been confirmed in multiple studies (49). In patients with established CD, previously undiagnosed proximal small-bowel lesions can be detected in >60% of the patients in clinical remission (9). Nonsteroid anti-inflammatory drug (NSAID) –associated enteropathy is the most common differential diagnosis for CD-like lesions in the small bowel on CE. These findings are frequently histologically indistinguishable from CD lesions (50). In our center patients are instructed to abstain from NSAIDs for at least 4 weeks before CE as a standard procedure.

In summary, our study demonstrates the practical applicability of a CNN model for automated detection of mucosal ulcers in CD patients. Such an algorithm may augment real-time CE reading software and can lead to improved accuracy and reproducibility along with shortened reading time. Acknowledgements The study was partially supported by the Leona M. and Harry B. Helmsley Charitable Trust.

References: 1. Pennazio M, Spada C, Eliakim R, Keuchel M, May A, Mulder CJ, et al. Small-bowel capsule endoscopy and device-assisted enteroscopy for diagnosis and treatment of small-bowel disorders: European Society of Gastrointestinal Endoscopy (ESGE) Clinical Guideline. Endoscopy. 2015;47:352-86. 2. Kopylov U, Koulaouzidis A, Klang E, Carter D, Ben-Horin S, Eliakim R. Monitoring of small bowel Crohn's disease. Expert review of gastroenterology & hepatology. 2017;11:1047-58. 3. Eliakim R. Video capsule endoscopy of the small bowel. Current opinion in gastroenterology. 2010;26:129-33. 4. Waterman M, Eliakim R. Capsule enteroscopy of the small intestine. Abdominal imaging. 2009;34:452-8. 5. Eliakim R. Video capsule endoscopy of the small bowel. Curr Opin Gastroenterol. 2008;24:15963. 6. Melmed GY, Dubinsky MC, Rubin DT, Fleisher M, Pasha SF, Sakuraba A, et al. Utility of video capsule endoscopy for longitudinal monitoring of Crohn’s disease activity in the small bowel: a prospective study. gastrointestinal endoscopy. 2018;88:947-55. e2. 7. Sturm A, Maaser C, Calabrese E, Annese V, Fiorino G, Kucharzik T, et al. ECCO-ESGAR Guideline for Diagnostic Assessment in IBD Part 2: IBD scores and general principles and technical aspects. Journal of Crohn's and Colitis. 2018;13:273-84. 8. Maaser C, Sturm A, Vavricka SR, Kucharzik T, Fiorino G, Annese V, et al. ECCO-ESGAR Guideline for Diagnostic Assessment in IBD Part 1: Initial diagnosis, monitoring of known IBD, detection of complications. Journal of Crohn's and Colitis. 2018;13:144-64K. 9. Kopylov U, Yablecovitch D, Lahat A, Neuman S, Levhar N, Greener T, et al. Detection of small bowel mucosal healing and deep remission in patients with known small bowel Crohn’s disease using biomarkers, capsule endoscopy, and imaging. The American journal of gastroenterology. 2015;110:1316. 10. Yung DE, Har-Noy O, Tham YS, Ben-Horin S, Eliakim R, Koulaouzidis A, et al. Capsule endoscopy, magnetic resonance enterography, and small bowel ultrasound for evaluation of postoperative recurrence in crohn’s disease: systematic review and meta-analysis. Inflammatory bowel diseases. 2017;24:93-100. 11. Kopylov U, Yung DE, Engel T, Vijayan S, Har-Noy O, Katz L, et al. Diagnostic yield of capsule endoscopy versus magnetic resonance enterography and small bowel contrast ultrasound in the evaluation of small bowel Crohn’s disease: Systematic review and meta-analysis. Digestive and Liver Disease. 2017;49:854-63. 12. Gralnek I, Defranchis R, Seidman E, Leighton JA, Legnani P, Lewis B. Development of a capsule endoscopy scoring index for small bowel mucosal inflammatory change. Alimentary pharmacology & therapeutics. 2008;27:146-54. 13. De Vos M, Cuvelier C, Mielants H, Veys E, Barbier F, Elewaut A. Ileocolonoscopy in seronegative spondylarthropathy. Gastroenterology. 1989;96:339-44.

14. Niv Y, Ilani S, Levi Z, Hershkowitz M, Niv E, Fireman Z, et al. Validation of the Capsule Endoscopy Crohn’s Disease Activity Index (CECDAI or Niv score): a multicenter prospective study. Endoscopy. 2012;44:21-6. 15. Yablecovitch D, Lahat A, Neuman S, Levhar N, Avidan B, Ben-Horin S, et al. The Lewis score or the capsule endoscopy Crohn’s disease activity index: which one is better for the assessment of small bowel inflammation in established Crohn’s disease? Therapeutic advances in gastroenterology. 2018;11:1756283X17747780. 16. Cotter J, de Castro FD, Magalhães J, Moreira MJ, Rosa B. Validation of the Lewis score for the evaluation of small-bowel Crohn’s disease activity. Endoscopy. 2015;47:330-5. 17. Klang E. Deep learning and medical imaging. Journal of thoracic disease. 2018;10:1325-8. 18. Soffer S, Ben-Cohen A, Shimon O, Amitai MM, Greenspan H, Klang E. Convolutional neural networks for radiologic images: a radiologist’s guide. Radiology. 2019;290:590-606. 19. Vieira PM, Silva CP, Costa D, Vaz IF, Rolanda C, Lima CS. Automatic Segmentation and Detection of Small Bowel Angioectasias in WCE Images. Annals of biomedical engineering. 2019;47:1446-62. 20. Blanes-Vidal V, Baatrup G, Nadimi ES. Addressing priority challenges in the detection and assessment of colorectal polyps from capsule endoscopy and colonoscopy in colorectal cancer screening using machine learning. Acta Oncologica. 2019;58(sup1):S29-S36. 21. Hwang Y, Park J, Lim YJ, Chun HJ. Application of Artificial Intelligence in Capsule Endoscopy: Where Are We Now? Clinical endoscopy. 2018;51:547. 22. Zhou T, Han G, Li BN, Lin Z, Ciaccio EJ, Green PH, et al. Quantitative analysis of patients with celiac disease by video capsule endoscopy: A deep learning method. Computers in biology and medicine. 2017;85:1-6. 23. Yuan Y, Meng MQH. Deep learning for polyp recognition in wireless capsule endoscopy images. Medical physics. 2017;44:1379-89. 24. Jia X, Meng MQ-H, editors. Gastrointestinal bleeding detection in wireless capsule endoscopy images using handcrafted and CNN features. 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); 2017: IEEE. 25. Min JK, Kwak MS, Cha JM. Overview of deep learning in gastrointestinal endoscopy. Gut and liver. 2019;13:388. 26. Hosoe N, Takabayashi K, Ogata H, Kanai T. Capsule endoscopy for small-intestinal disorders: Current status. Digestive endoscopy : official journal of the Japan Gastroenterological Endoscopy Society. 2019;31:498-507. 27. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016;374:20150202. 28. Arora S, Hu W, Kothari PK. An Analysis of the t-SNE Algorithm for Data Visualization. arXiv eprints [Internet]. 2018 March 01, 2018. Available from: https://ui.adsabs.harvard.edu/abs/2018arXiv180301768A. 29. Chollet F. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv e-prints [Internet]. 2016 October 01, 2016. Available from: https://ui.adsabs.harvard.edu/abs/2016arXiv161002357C. 30. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. Imagenet large scale visual recognition challenge. International journal of computer vision. 2015;115:211-52. 31. Cao C, Liu F, Tan H, Song D, Shu W, Li W, et al. Deep Learning and Its Applications in Biomedicine. Genomics, proteomics & bioinformatics. 2018;16:17-32. 32. Leenhardt R, Vasseur P, Li C, Saurin JC, Rahmi G, Cholet F, et al. A neural network algorithm for detection of GI angiectasia during small-bowel capsule endoscopy. Gastrointestinal endoscopy. 2019;89:189-94. 33. Iakovidis DK, Georgakopoulos SV, Vasilakakis M, Koulaouzidis A, Plagianakos VP. Detecting and locating gastrointestinal anomalies using deep learning and iterative cluster unification. IEEE transactions on medical imaging. 2018;37:2196-210.

34. Kopylov U, Starr M, Watts C, Dionne S, Girardin M, Seidman EG. Detection of Crohn disease in patients with spondyloarthropathy: the SpACE capsule study. The Journal of rheumatology. 2018;45:498-505. 35. Kopylov U, Ben-Horin S, Seidman EG, Eliakim R. Video capsule endoscopy of the small bowel for monitoring of Crohn's disease. Inflammatory bowel diseases. 2015;21:2726-35. 36. Shah SC, Colombel JF, Sands BE, Narula N. Systematic review with meta-analysis: mucosal healing is associated with improved long-term outcomes in Crohn's disease. Aliment Pharmacol Ther. 2016;43:317-33. 37. Greener T, Klang E, Yablecovitch D, Lahat A, Neuman S, Levhar N, et al. The impact of magnetic resonance enterography and capsule endoscopy on the re-classification of disease in patients with known Crohn’s disease: a prospective Israeli IBD Research Nucleus (IIRN) Study. Journal of Crohn's and Colitis. 2016;10:525-31. 38. Flamant M, Trang C, Maillard O, Sacher-Huvelin S, Le Rhun M, Galmiche J-P, et al. The prevalence and outcome of jejunal lesions visualized by small bowel capsule endoscopy in Crohn’s disease. Inflammatory Bowel Diseases. 2013;19:1390-6. 39. Carvalho PB, Rosa B, Cotter J. Mucosal healing in Crohn’s disease — Are we reaching as far as possible with capsule endoscopy? Journal of Crohn's and Colitis. 2014;8:1566-7. 40. Ben-Horin S, Lahat A, Amitai MM, Klang E, Yablecovitch D, Neuman S, et al. Assessment of small bowel mucosal healing by video capsule endoscopy for the prediction of short-term and long-term risk of Crohn's disease flare: a prospective cohort study. The Lancet Gastroenterology & Hepatology. 2019;4:519-28. 41. Kopylov U, Klang E, Yablecovitch D, Lahat A, Avidan B, Neuman S, et al. Magnetic resonance enterography versus capsule endoscopy activity indices for quantification of small bowel inflammation in Crohn’s disease. Therapeutic advances in gastroenterology. 2016;9:655-63. 42. Tham YS, Yung DE, Fay S, Yamamoto T, Ben-Horin S, Eliakim R, et al. Fecal calprotectin for detection of postoperative endoscopic recurrence in Crohn’s disease: systematic review and metaanalysis. Therapeutic advances in gastroenterology. 2018;11:1756284818785571. 43. Koulaouzidis A, Sipponen T, Nemeth A, Makins R, Kopylov U, Nadler M, et al. Association between fecal calprotectin levels and small-bowel inflammation score in capsule endoscopy: a multicenter retrospective study. Digestive diseases and sciences. 2016;61:2033-40. 44. Khanna R, Nelson SA, Feagan BG, D'Haens G, Sandborn WJ, Zou GY, et al. Endoscopic scoring indices for evaluation of disease activity in Crohn's disease. The Cochrane database of systematic reviews. 2016:Cd010642. 45. Eliakim R, Spada C, Lapidus A, Eyal I, Pecere S, Fernández-Urién I, et al. Evaluation of a new panenteric video capsule endoscopy system in patients with suspected or established inflammatory bowel disease–feasibility study. Endoscopy international open. 2018;6:E1235-E46. 46. Aoki T, Yamada A, Aoyama K, Saito H, Tsuboi A, Nakada A, et al. Automatic detection of erosions and ulcerations in wireless capsule endoscopy images based on a deep convolutional neural network. Gastrointestinal endoscopy. 2019;89:357-63. e2. 47. Sidhu R, Sanders D, Morris A, McAlindon M. Guidelines on small bowel enteroscopy and capsule endoscopy in adults. Gut. 2008;57:125-36. 48. Maaser C, Sturm A, Vavricka SR, Kucharzik T, Fiorino G, Annese V, et al. ECCO-ESGAR guideline for diagnostic assessment in inflammatory bowel disease. Journal of Crohn's and Colitis. 2018. 49. Dionisio PM, Gurudu SR, Leighton JA, Leontiadis GI, Fleischer DE, Hara AK, et al. Capsule endoscopy has a significantly higher diagnostic yield in patients with suspected and established smallbowel Crohn's disease: a meta-analysis. Am J Gastroenterol. 2010;105:1240-8; quiz 9. 50. Price AB. Pathology of drug-associated gastrointestinal disease. British journal of clinical pharmacology. 2003;56:477-82.

Figure legends:

Figure 1: Research design describing the 2 different experiments. Figure 2: 2D representation of the distribution of images from 3 patients. Dimensionality reduction was achieved using the primary component analysis (PCA) and then the t-distributed stochastic neighbor embedding algorithm (t-SNE) algorithms. The figure shows the complex distribution of the images in 2D space, although it is still possible to identify clusters of images from individual patients. Figure 3: Receiver operating curves (ROC) for individual patient level experiment.

Supplementary Figures Supplementary Figure 1: Receiver operating curves (ROC) for randomly split data (80% training, 20% testing). Very impressive areas under the curves (AUC) are observed for all folds. Supplementary Figure 2: Capsule endoscopy images of normal mucosa and mucosa with ulcers classified correctly (A and B) and incorrectly (C and D) by the convolutional neural network (CNN). A and B - CE images with ulcers diagnosed as ulcer by the network. C – False negative case where the network classified an ulcer as a normal image. D – False positive case where the network classified a normal image as with an ulcer.

Table 1 – Clinical and demographic characteristics of the included CD patients

Male

19 (50%)

Female

19 (50%)

Gender Age (median +interquartile range)

28 (21-41)

Ileal

30 (78.9%)

Ileocolonic

8 (21.1%)

Nonstricturing nonpenetrating

36 (94.7%)

Stricturing

1 (2.6%)

Penetrating

1 (2.6%)

Disease location

Disease phenotype

Previous abdominal surgery

Current treatment

4 (10.5%)

Corticosteroids

2 (5.3%)

5-ASA

4 (10.5%)

Thiopurine

2 (5.3%)

Anti-TNF

4 (10.5%)

Lewis score (median + interquartile range)

900 (450-1350)

Table 2: ROC curves of the network for the 5 experiments of randomly split images. Each fold represents one split of the data to 80% training and 20% testing. Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

AUC

0.990

0.989

0.994

0.989

0.993

Accuracy

95.7%

95.4%

96.4%

96.4%

96.7%

Sensitivity

92.5%

93.8%

97.1%

94.7%

96.8%

Specificity

98.1%

96.7%

96.0%

97.5%

96.6%

PPV

97.2%

95.4%

94.4%

96.5%

95.5%

NPV

94.8%

95.5%

97.9%

96.3%

97.6%

Table 3: Performance of CNN on the individual patient level.

Patient A

Patient B

Patient C

Patient D

Patient E

Patient F

Patient G

Patient H

Patient I

Patient J

362

420

438

387

246

672

325

376

502

372

AUC

0.995

0.966

0.940

0.949

0.966

0.995

0.996

0.956

0.979

0.999

Accuracy

96.1%

85.2%

73.7%

93.4%

91.9%

93.9%

98.2%

87.0%

91.0%

96.2%

Sensitivity

95.0%

69.5%

97.7%

81.8%

79.0%

100.0%

98.4%

76.7%

95.0%

95.1%

Specificity

84.9%

100.0%

56.8%

98.6%

98.2%

89.0%

98.0%

97.3%

84.9%

100.0%

PPV

90.6%

100.0%

61.3%

95.7%

95.5%

87.9%

96.8%

96.7%

90.6%

100.0%

NPV

91.8%

77.7%

97.4%

93.2%

90.5%

100.0%

99.0%

80.5%

91.8%

86.4%

Images (N)

Table 4: Distribution of true positive normal and ulcer images for individual patients.

Patient A

Patient B

Patient C

Patient D

Patient E

Patient F

Patient G

Patient H

Patient I

Patient J

Normal

124 / 128

217 / 217

147 / 258

273 / 277

162 / 165

333 / 374

198 / 202

182 / 187

169 / 199

89 / 89

Ulcer

224 / 234

141 / 203

176 / 180

90 / 110

64 / 81

298 / 298

121 / 123

145 / 189

288 / 303

269 / 283

Acronyms and abbreviations Crohn’s disease (CD) Convolutional Neural Network (CNN) Area under the curves (AUCs) Artificial Intelligence (AI) Capsule endoscopy (CE) Lewis score (LS) Primary component analysis (PCA) t-distributed stochastic neighbor embedding algorithm (t-SNE) Receiver operating curves (ROCs) Region of interest (ROI) Negative predictive value (NPV) Positive predictive value (PPV)