How to Approach and Interpret Studies on AI in Gastroenterology

How to Approach and Interpret Studies on AI in Gastroenterology

Journal Pre-proof How to Approach and Interpret Studies on AI in Gastroenterology Neil M. Carleton, Shyam Thakkar PII: DOI: Reference: S0016-5085(20...

3MB Sizes 2 Downloads 48 Views

Journal Pre-proof How to Approach and Interpret Studies on AI in Gastroenterology Neil M. Carleton, Shyam Thakkar

PII: DOI: Reference:

S0016-5085(20)30463-7 https://doi.org/10.1053/j.gastro.2020.04.001 YGAST 63346

To appear in:

Gastroenterology

Please cite this article as: Carleton NM, Thakkar S, How to Approach and Interpret Studies on AI in Gastroenterology, Gastroenterology (2020), doi: https://doi.org/10.1053/j.gastro.2020.04.001. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 by the AGA Institute

How to Approach and Interpret Studies on AI in Gastroenterology Neil M. Carleton1,2, Shyam Thakkar1,3

1 Division of Gastroenterology, Allegheny Health Network, Pittsburgh, PA, USA 2 Medical Scientist Training Program, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA 3 Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA

Corresponding Author: Shyam Thakkar, MD, FASGE 1307 Federal Street Pittsburgh, PA 15212 [email protected] 412-359-8900 Author Email Addresses: Neil M. Carleton [email protected] Shyam Thakkar [email protected] Funding: Disruptive Health Technologies Institute (DHTI) Grant from Carnegie Mellon University & Highmark Health to ST, Innovation Works Grant from Carnegie Mellon University to ST, and the National Institutes of Health T32GM008208 to NC. Writing Assistance: No writing assistance was used for this manuscript. Conflicts of Interest: The authors report no conflicts of interest. Main Text & References Word Count: 2389

1

Introduction There has been a proliferation of artificial intelligence (AI)-based research in recent years. Experts predict AI utilization in health care to be nearly inevitable as we progress to a time of unprecedented computing power that can analyze sizable sources of data.1 AI offers a number of opportunities to enhance human expertise that will be beneficial in the future: consistency in decision-making, lack of fatigue, improved efficiency to analyze data, and detection of patterns unrecognizable to humans. In turn, experts hope that improvements to disease detection and diagnosis will allow physicians to spend more time with patients and less time analyzing the vast data from a patient’s medical history, labs, and imaging results. Clinicians across all fields of medicine who read AI articles are united by the most fundamental of questions: Why should I care about this AI system? How might I potentially integrate this system into the current clinical setting? How does this ultimately help me help a patient with the disease under study? Alarmingly, many studies don’t answer these critical questions nor include understandable explanations of the AI methodology, which can significantly help clinicians put results in proper context5. Even though a model may be highly accurate, it does not often equate to clinical efficacy, especially when there is a disconnect between what a model is measuring versus what might be a relevant clinical metric3,4. Further, understandability and explainability of AI and ML algorithms remain chief concerns. This issue of explainability of AI systems in medicine can be addressed in multiple ways: first, the burden should be on the authors and reviewers of AI studies to measure clinically-relevant outcomes and include considerations for clinical impact and put their results in the context of how such a decision support system would complement current clinical practice. Second, readers of AI studies must be better equipped to understand the various facets of AI pipelines: model development, input and output data structures, the specific AI method, and critical evaluation of performance metrics. In this article, we aim to demystify AI- and ML-based research and to provide a framework for how to approach these types of article in the gastroenterology literature. In a recent issue of JAMA, a group of GoogleHealth scientists introduced a step-by-step guide to reviewing ML articles in the general medical literature, a resource for readers across medical specialties to utilize5. Here, we apply this framework to AI articles in gastroenterology and hepatology (Figure 1) using illustrative examples for each of the four critical aspects of AI studies: identifying the prediction task, the actual data, the AI or ML method, and evaluation of performance. Thus, these resources should serve as a guide to the author, reviewer, and reader alike when preparing, reviewing, or interpreting an AI article along with critical considerations for implementing an AI system into clinical practice (summarized in Figure 2A and 2B). For further background on AI and ML, see the Supplementary Material. A Framework for Interpreting AI Articles Often overlooked in much of the literature on AI in medicine is the human intelligence it will be synthesized with: AI can be a real asset with physicians, bringing the scale, speed, and power of the machine with the ingenuity, creativity, and empathy of the clinician. This synergism requires clinicians to remain well informed on AI for optimal use.6 Issues in understandability in AI articles are often multifactorial. Chief among them are the lack of a standardized, one-size-fits-all approach when it comes to developing AI systems4: variations in the size, composition, and split of training and testing sets; variations in the reporting of performance metrics; and variations in validation and evaluation of performance techniques are contributory factors2,7. However, these issues should not preclude evaluation of AI systems by clinicians. Prediction Task Identifying the prediction task, synonymous with what many people describe as the “output” of the model, is often the first step in the pipeline: what exactly is the model measuring and what is 2

the clinical benefit for having such a prediction? Would this model improve some aspect of patient care, such as better screening, improved efficiency, faster or more accurate diagnosis, or accurate prediction of prognosis? These can all be reasonable objectives. Critically aligned with identifying the prediction task is discerning who the intended end users might be, including both potential physicians and specific patient population. In the first illustrative example, in a recent study by de Groot et al8, the authors developed an AI system to detect early neoplasia in patients with Barrett esophagus during high-definition white light endoscopy. The authors are explicit about the prediction task and hypothesized that an AI system would be able to correctly detect early neoplasia at a higher rate than endoscopists. As such, it becomes clear who potential end users would be and the clinical benefit they would gain from such a system: as endoscopists often struggle to identify early neoplasia due to the rare incidence of Barrett esophagus progression. Patients with Barrett esophagus would benefit in that they could continue to have these procedures in local outpatient settings with highly predictable and reproducible results. Data Many considerations exist when evaluating the data used in the study, including the type of data, the input-output relationship, amount of data in the training and testing sets, and how the data was annotated. First, the fundamental unit of AI systems with machine learning is the perceptron - an algorithm for supervised learning of binary classifiers consisting of inputs and an output. The input data is used to train the system and can include any number of different data types: patient clinical characteristics (age, presence of disease-specific risk factors, lab values, diagnostic test results, etc.), images (from imaging studies or from endoscopic modalities), and videos are all common input data types with images being the most common. The output, or what the AI system is predicting based on the input data, is typically a binary outcome (i.e. tumor vs. non-tumor, polyp vs. no polyp, Barrett esophagus vs. adenocarcinoma). Critical considerations of the input and output include how the input data was gathered (was it gathered from common clinical practices or did the authors have to process data before using it to train the AI system?). Additionally, one should consider how the data was annotated prior to model training. If an AI system is using a supervised learning scheme, it often requires authors to annotate the data to establish ground truth for the algorithm to differentiate disease cases. In most AI systems, a panel of experts is often used to establish a ground truth: each expert annotates or scores input data. If their scoring differs, consensus choice may be used. When reading an AI article, a reader should consider: (1) Are the experts annotating input data using a clinically validated scale? If not, why? (2) Are the experts’ annotations sufficient to establish ground truth? In some fields, even expert scores can differ significantly, creating concern for the ground truth that was used to train the AI system. High-quality labeling is essential for a high-performing system. The last consideration for the data is how the data was split between the training and testing sets. There are many valid ways to split data between the sets, including by treatment, by data, by patient, or randomly. For example, if a study is using images derived form a set of patients to train and evaluate an AI system, a random split of these images into the training or test set could result in images derived from one patient in both the training and the test set. Instead, considering a random split by patient instead of by image would still allow for randomization but would ensure that all images from a single patient would remain in just one of the data sets. In this illustrative example, a recent study by Urban et al9 was created to localize colorectal polyps during screening colonoscopy. The authors utilize multiple data sets and both images and videos for the training and testing process. Notably, the authors are conscious about accounting for possible intra-patient polyp similarity bias but overcame this possibility by 3

including images from over 2,000 patients. The strong approach of large input data sets and varied data types for input data resulted in a highly accurate detection system, with the output of the system being the box that the system placed surrounding an on-screen polyp. Machine Learning Method The next consideration is the machine learning method itself. Typically, the method chosen is dictated by the data type, how much data will be used for model training, and whether the model is using supervised or unsupervised learning. For example, classification tasks that include clinical characteristics are well-suited for regression, support vector machine, decision tree and random forest, and nearest neighbor methods. However, more complex classification problems, such as image processing from histopathology slides, imaging modalities, or endoscopy-derived images, require more sophisticated methods such as convolutional or deep neural networks. Progressively complex neural networks with additional layering are considered DL.10 Depending on the method used, there is often a tradeoff between performance and model explainability: the best performing models, such as DL models, require vast amounts of data to accurately train due to the additional layering or networks it employs to enhance accuracy. These models are also often the least explainable. On the other hand, models with poorer performance, traditionally models that rely on regression, often require the least amount of data but are quite explainable. Important considerations for readers include identifying if the ML method is a standard method in the field or if it is a customized method designed specifically for the presented task. In this illustrative example, a recent study by Wu et al10 used four different ML methods to determine if they could predict fatty liver disease using demographic and clinical characteristics from the EMR. The authors state they used four classification models (random forest, Naïve Bayes, artificial neural network, and logistic regression) and compared accuracy between them. Further, the authors specify which parameters they pulled from the EMR to use as predictor variables, which included age and gender, systolic and diastolic blood pressure, abdominal girth, and HDL, AST, and ALT lab values. In the methods, the authors describe their rationale for choosing each of the classification models, helping the reader identify pertinent positives and negatives for each. Evaluation of Model Performance The last consideration is evaluation of model performance: what performance metrics do the authors report? Does the study compare AI performance to a clinical standard or human? Does this comparison make sense in the context of what is being predicted? Studies will often report accuracy for the proportion of times the AI system makes a correct prediction. This metric will often be accompanied by sensitivity, specificity, false positives, and false negatives, which are all critical for a full assessment of the model, not just in terms of the times it made a correct determination. Area-under-curve (AUC) metrics can also be used to assess predictive capabilities. It must be re-iterated that just because a study reports a highly accurate model does not equate with clinical efficacy, which requires additional comparisons. Further, just because the accuracy might be lower than anticipated does not mean the system does not work. Does the accuracy make sense for the intended deployment of the system? Toleration of lower accuracy may have a clear clinical benefit if the system is intended for low-resource settings, for example. This might be relevant for an AI-based screening test for low-resource settings: a lower accuracy may be tolerable if the goal of the system is to only alert the patient if they are at risk for a particular rather than providing a diagnostic prediction. Thus, this system would tell the patient that further evaluation from a physician is warranted. Equally important is the type of validation the study performed. Readers should also be wary of studies with “external validation” – for example, did the external data come from within the institution the authors are from or did the external originate from an outside institution? Readers should clarify this aspect when inspecting how the system was validated, as its 4

implications for generalizability and potential next steps are significant. Validation has implications for the level of generalizability of the study. Lastly, was the system benchmarked against human subjects? This could also be an important consideration for model performance – just because a model may be accurate does not mean it outperforms human experts. We must remember that if the “gold standard” used to train or benchmark the system was a physician’s annotation or diagnosis, concluding that the AI “outperforms” physicians is misleading. In this illustrative example, Shung et al11 utilize a ML model to identify patients with upper GI bleeds at risk for severe complications. Their system’s performance was measured using AUC analysis but additionally compared to predictions made by three other validated clinical risk scoring systems. Importantly, they show both an internal and external validation with performance metrics reported against the clinical risk scores. This show that their model is not only highly accurate in its predictive capacity but also connects its performance with the current clinical standard. Conclusions Ultimately, the goal should be to facilitate clinician understanding to promote rigorously trained and validated models that result in reproducible, clinically relevant predictions. Clinicians should be acutely aware of the potential advantages and weaknesses of AI models in the literature: it can be easy to show that AI systems can find more disease – but the real question is that if more disease is found and treated, will patients be better off? Not all diagnoses are treatable; not all diagnoses even require treatments; and not all disease benefits equally from treatments. Deploying AI systems with this in mind is critical. Trust of the system will be vital moving forward: validating systems against reference data and prospective trials will generate more faith in these systems to be truly beneficial in clinical medicine. Soon enough, more clinicians involved in both the development of AI models and the design of AI clinical studies promotes eventual translatability of these technologies into clinical practice.

5

REFERENCES 1. 2. 3. 4. 5.

Topol EJ. Nature Medicine. 2019 Jan;25:44-56. Le Berre C, et al. Gastroenterology. 2020 Jan;148:76-92. Kelly CJ, et al. BMC Medicine. 2019 Dec 1;17(1):195. Keane PA, Topol EJ. NPJ Digit Med. 2018;1:40. Liu Y, et al. JAMA. 2019 Nov 12;322(18):1806-1.

Full list of references can be found in the Supplementary Material.

6

FIGURES Figure 1: An overview of the main applications of AI to solve current challenges in gastroenterology & hepatology broken down by organ.

7

Figure 2: (A) Key questions for assessing each aspect of an AI article in the medical literature. (B) When thinking about moving potential technologies into clinical practice, the summarized stages show some of the considerations and challenges that accompany such translation.

8

SUPPLEMENTARY MATERIAL AI & ML: Definitions and Basics Many clinicians and researchers are familiar with traditional methods of regression. While regression relies on a series of pre-specified parameters or features to discriminate a binary diagnostic or prognostic outcome, the vast data and enhanced computing power now available enables sophisticated AI models to: (1) extract a larger number of features from data sets, either pre-specified by the creator or unspecified and left to the algorithm to decide criticality; (2) use greater numbers of mathematical functions, often in a series of iterative steps or layers as opposed to a single layer operation in traditional regression; and (3) better define complex clinical relationships that may not be well defined by binary outcomes as was the case with traditional regression. Thus, many describe AI and the subfields of ML and deep learning (DL) as the natural progression from regression modeling methods, allowing for quicker and more accurate predictive capacities. The term “artificial intelligence” itself is nebulous: it is considered to be the overarching term used to describe any computational structure that employs “human-like” functions to solve a predictive, analytical, or identification challenge. More specifically, ML includes any process that has the capability to take a set of data (input), “learn” or extract relationships and patterns, and make a prediction (output)12. Within ML, methods include use of supervised and unsupervised learning. Supervised learning, which is more commonly used amongst currently published ML studies in medicine, requires labeled input and output data for the machine to learn from (often described as the machine having “a priori” knowledge of such input-output relationships) while unsupervised learning occurs with the machine identifying patterns and relationships in the training data (without “a priori” knowledge of such relationships). Deep learning, an even more specific subset of ML, uses additional layering techniques to construct higher-order relationships between inputs and outputs, often requiring substantially more training data to be effective. Additional AI methods that are becoming more popular include computer vision and natural language processing (NLP). Computer vision techniques rely on motion and how objects in the field of view move in consecutive frames of a video to create flow fields; these methods have become a popular choice for real-time analysis of video feeds. NLP utilizes theories from linguistics and computer science to analyze large amounts of speech or text to make predictions; these methods have become a popular choice for speech recognition scientists as well as for analysis of free text that appears in the electronic medical record (EMR). Important Considerations for Clinical Implementation An important assessment for readers of AI articles to make is identifying future steps required to move a system into the clinical space. Currently published AI articles in the literature span the spectrum of clinical readiness, from proof-of-concept studies to studies that are ready for clinical trials. Only a few such prospective trials for gastroenterology AI applications have been published to date, with many more ongoing: systems for automated polyp and adenoma detection13,14 and an automated system for detecting blind spots for quality control during EGD15. These remain some of the few clinical trials across all of medicine for AI systems. While prospective trials remain the gold standard to show system efficacy, logistical, technical, and regulatory hurdles exist after their completion. Beyond clinical efficacy, these systems must also show effectiveness in terms of cost effectiveness and efficiency16. Developers of AI systems must consider the user interface, training of front-line clinicians who will use the system, or hospital infrastructure that must be in place to support the vast amounts of data the system will undoubtedly require. Further, the AI system will require continued monitoring and the ability to use new data to continually re-train and fine-tune itself while in use. If a system was designed and developed in one hospital system, it might even require site-specific retraining to allow for accurate predictions, as endoscopic videos, EMRs, and lab procedures may all differ slightly between institutions. 9

FULL REFERENCE LIST 1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine. 2019 Jan;25(1):44-56. 2. Le Berre C, Sandborn WJ, Aridhi S, Devignes MD, Fournier L, Smaïl-Tabbone M, Danese S, Peyrin-Biroulet L. Application of Artificial Intelligence to Gastroenterology and Hepatology. Gastroenterology. 2020 Jan;158:76-92. 3. Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Medicine. 2019 Dec 1;17(1):195. 4. Keane PA, Topol EJ. With an eye to AI and autonomous diagnosis. NPJ Digit Med. 2018;1:40. https://doi.org/10.1038/s41746-018-0048-y. 5. Liu Y, Chen PH, Krause J, Peng L. How to read articles that use machine learning: Users’ Guides to the Medical Literature. JAMA. 2019 Nov 12;322(18):1806-1. 6. Shah NH, Milstein A, Bagley SC. Making machine learning models clinically useful. JAMA. 2019 Oct 8;322(14):1351-2. 7. Sendak M, Elish M, Gao M, Futoma J, Ratliff W, Nichols M, Bedoya A, Balu S, O'Brien C. "The Human Body is a Black Box": Supporting Clinical Decision-Making with Deep Learning. arXiv:1911.08089. 2019 Nov 19. 8. de Groof AJ, Struyvenberg MR, van der Putten J, van der Sommen F, Fockens KN, Curvers WL, Zinger S, Pouw RE, Coron E, Baldaque-Silva F, Pech O. Deep-Learning System Detects Neoplasia in Patients With Barrett’s Esophagus With Higher Accuracy Than Endoscopists in a Multi-Step Training and Validation Study with Benchmarking. Gastroenterology. 2019 Nov 21. 9. Urban G, Tripathi P, Alkayali T, Mittal M, Jalali F, Karnes W, Baldi P. Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy. Gastroenterology. 2018 Oct 1;155(4):1069-78. 10. Wu CC, Yeh WC, Hsu WD, Islam MM, Nguyen PA, Poly TN, Wang YC, Yang HC, Li YC. Prediction of fatty liver disease using machine learning algorithms. Computer Methods and Programs in Biomedicine. 2019 Mar 1;170:23-9. 11. Shung DL, Au B, Taylor RA, Tay JK, Laursen SB, Stanley AJ, Dalton HR, Ngu J, Schultz M, Laine L. Validation of a machine learning model that outperforms clinical risk scoring systems for upper gastrointestinal bleeding. Gastroenterology. 2019 Sep 25. 12. Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. New York: Cambridge University Press, 2014. References from Supplementary Material: 13. Wang P, Berzin TM, Brown JR, Bharadwaj S, Becq A, Xiao X, Liu P, Li L, Song Y, Zhang D, Li Y. Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study. Gut. 2019 Feb 27. 14. Mori Y, Kudo SE, Misawa M, Saito Y, Ikematsu H, Hotta K, Ohtsuka K, Urushibara F, Kataoka S, Ogawa Y, Maeda Y. Real-time use of artificial intelligence in identification of diminutive polyps during colonoscopy: a prospective study. Annals of Internal Medicine. 2018 Sep 18;169(6):357-66. 15. Wu L, Zhang J, Zhou W, An P, Shen L, Liu J, Jiang X, Huang X, Mu G, Wan X, Lv X. Randomised controlled trial of WISENSE, a real-time quality improving system for monitoring blind spots during esophagogastroduodenoscopy. Gut. 2019 Mar 11. 16. Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, Doshi-Velez F, Jung K, Heller K, Kale D, Saeed M, Ossorio PN. Do no harm: a roadmap for responsible machine learning for health care. Nature Medicine. 2019 Sep;25(9):1337-40.

10