Clinical Neurology and Neurosurgery 192 (2020) 105732
Contents lists available at ScienceDirect
Clinical Neurology and Neurosurgery journal homepage: www.elsevier.com/locate/clineuro
Artificial intelligence outperforms human students in conducting neurosurgical audits
T
Maksymilian A. Brzezickia,*, Nicholas E. Bridgerb, Matthew D. Kobetićb, Maciej Ostrowskic, Waldemar Grabowskid, Simran S. Gille, Sandra Neumannf a
Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, UK Faculty of Health Sciences, University of Bristol, UK c Medical University of Lodz, Lodz, Poland d Institute of Physics, University of Zielona Gora, Zielona Gora, Poland e St. George’s, University of London Medical School, London, UK f Department of Physiology and Pharmacology, Clinical Research and Imaging Centre, University of Bristol, UK b
A R T I C LE I N FO
A B S T R A C T
Keywords: Neurosurgery audits Artificial intelligence Computational neuroscience Medical audits
Objectives: Neurosurgical audits are an important part of improving the safety, efficiency and quality of care but require considerable resources, time, and funding. To that end, the advent of the Artificial Intelligence-based algorithms offered a novel, more economically viable solution. The aim of the study was to evaluate whether the algorithm can indeed outperform humans in that task. Patients & methods: Forty-six human students were invited to inspect the clinical notes of 45 medical outliers on a neurosurgical ward. The aim of the task was to produce a report containing a quantitative analysis of the scale of the problem (e.g. time to discharge) and a qualitative list of suggestions on how to improve the patient flow, quality of care, and healthcare costs. The Artificial Intelligence-based Frideswide algorithm (FwA) was used to analyse the same dataset. Results: The FwA produced 44 recommendations whilst human students reported an average of 3.89. The mean time to deliver the final report was 5.80 s for the FwA and 10.21 days for humans. The mean relative error for factual inaccuracy for humans was 14.75 % for total waiting times and 81.06 % for times between investigations. The report produced by the FwA was entirely factually correct. 13 out of 46 students submitted an unfinished audit, 3 out of 46 made an overdue submission. Thematic analysis revealed numerous internal contradictions of the recommendations given by human students. Conclusion: The AI-based algorithm can produce significantly more recommendations in shorter time. The audits conducted by the AI are more factually accurate (0 % error rate) and logically consistent (no thematic contradictions). This study shows that the algorithm can produce reliable neurosurgical audits for a fraction of the resources required to conduct it by human means.
1. Introduction Healthcare audits provide both quantitative information on the current state of affairs and recommendations on how to improve the clinical outcomes. They are an integral part of every neurosurgical department. Thus, auditing skills are among the key learning outcomes of a competent graduate and a well-rounded clinical doctor [1]. They are
also at the centerground of the Evidence-Based Medicine (EBM) element of the undergraduate curriculum and are encouraged as early as the pre-clinical stage of the course [2,3]. Recently, a new Artificial Intelligence (AI)-based algorithm was developed with the aim of efficiently delivering the neurosurgical audits in a very short period of time [4]. However, concerns have been raised on whether humans, whilst slower in their work, could provide reports of a higher quality.
⁎
Corresponding author at: Nuffield Department of Clinical Neurosciences, Level 6, John Radcliffe Hospital, Headley Way, University of Oxford, Oxford, UK. E-mail addresses:
[email protected] (M.A. Brzezicki),
[email protected] (N.E. Bridger),
[email protected] (M.D. Kobetić),
[email protected] (M. Ostrowski),
[email protected] (W. Grabowski),
[email protected] (S.S. Gill),
[email protected] (S. Neumann). @maxbrzezicki (M.A. Brzezicki) https://doi.org/10.1016/j.clineuro.2020.105732 Received 12 January 2020; Received in revised form 2 February 2020; Accepted 7 February 2020 0303-8467/ © 2020 Elsevier B.V. All rights reserved.
Clinical Neurology and Neurosurgery 192 (2020) 105732
M.A. Brzezicki, et al.
2.3. Artificial intelligence algorithm
The purpose of this study was to assess whether human students can outperform the AI in that task.
The human-generated auditing work was compared to that of the Artificial Intelligence (AI)-powered, Frideswide Algorithm (FwA) [4]. This algorithm analysed the same set of notes and generated a comparable report elsewhere before [11]. The FwA was run once for each of the human participants. The FwA was run on a standard PC machine (Intel ® Core ™ i7-4790 CPU @ 3.60 GHz, 16.0 GB RAM, Windows 10 64-bit, x64-based processor).
1.1. Problem choice Audits can be commissioned routinely or where there is an important shortcoming in the delivery of service, such as an increase in infection rates [5] or patient flow concerns [6]. Lack of bed availability can be an important cause for all surgical cancellations [7] and is one of the preventable, administrative errors in patient flow [8]. This can often happen because of the delay in diagnosis and treatment of the neurological outliers on a neurosurgical ward. Medical outliers are also associated with delays in discharge, which can result in harm events and an increase in healthcare costs [9]. Thus, the students were asked to investigate the scale of the undiagnosed neurological medical outliers’ problem in a neurosurgical unit. They were set out to propose concrete recommendations to improve patient flow and delivery of service for both the medical outliers and neurosurgical patients waiting for their procedure.
2.4. Comparison methods The error rates were calculated by absolute error, i.e. mean absolute difference, and relative error, i.e. a ratio of absolute error and absolute value, given in percentages. Microsoft Excel 365 Pro Plus and IBM SPSS Statistics 24 were used to compute the absolute values. 2.5. Ethical considerations The study was reviewed by the Ethics Committee EC of the Faculty of Health Sciences, University of Bristol Reference: 66221. All students were provided with written information about the study before participation, as approved by the EC. Informed consent was taken by the GCP-trained investigators who held an appropriate delegation at the time.
2. Patients & methods 2.1. Participants characteristics Forty-six students participated in the experiment. All were nonvariable for experience in audit performance (mean = 0 audits). All students but one, in addition to their medical training, had a background in maths, physics and statistical sciences. All participants were at their pre-registration stage, i.e. they have not yet received full recognition by their professional clinical body (General Medical Council) and were at the level of pre-clinical teaching, with a dominance of theoretical not practical material. To achieve a fair comparison, all participants were given written material on how to perform audits, produced by the University Hospitals Bristol Clinical Audit Team (UH [10]). A thirty-minute faceto-face refresher was also offered on the day of the study.
2.6. Statistical analysis The main outcome was the difference in the number of recommendations generated by the human participants vs FwA. The data were not normally distributed (Shapiro-Wilk p = 0.016; KolmogorovSmirnov p = 0.001; Q-Q plot negatively skewed) and thus a nonparametric Mann-Whitney U test was performed to test for differences. Secondary outcomes included time of delivery, the reliability of the students (number of final reports submitted) – both not normally distributed (Shapiro-Wilk p < 0.001; Kolmogorov-Smirnov p < 0.001; Q-Q plot skewed bilaterally), and thus analysed by Mann-Whitney U test. Thematic analysis [12] of the recommendations was performed by initial familiarisation, data reduction, data compilation, searching for themes and a thematic review. MAB, MO and NEB performed an independent analysis that was then reviewed by MDK and SSG. Statistical significance was set at the p < 0.05.
2.2. Experimental set-up The participants were given forty-five sets of notes of the outliers, i.e. medical neurological patients on a virtual neurosurgical ward. For the sake of focusing on the analytical skills, and to alleviate the medical knowledge learning curve, only patients with one disease were included (Fronto-Temporal Dementia (FTD)). To aid in the inspection of the material, all clinical data were extracted from letters, discharge summaries, investigation requests and reports and presented in a structured manner. This was imported from electronic patient records (EPR). The main page contained a list of all notes and the sub-pages contained a chronological list of all ward admissions, with presenting symptoms, working diagnosis and actions undertaken throughout admission, all taken from electronic computer word-processed documents available on the EPR. The aim of the simulated audit was to produce (1) the report on the scale of the problem by quantitative measures and (2) suggest a list of improvements in the service by qualitative analysis (Pro-forma in Appendix 1). The participants were given a free choice of analytical methods but were encouraged to consider time in the hospital, invasiveness, and cost to the NHS as priorities in their decision-making process. All participants were given three hours to complete the initial stage of their audit first. During that time, they were allowed to use the Internet, ask questions, cooperate and access any scientific materials they wished to consult. After returning the pro-forma, they were allowed a further 30 days to inspect the data at their own pace and submit further findings.
3. Results 3.1. Primary outcomes Out of 46 participants, all (100 %) returned their initial part of the audit. After a month, 33 students submitted their final report (71.74 %), with 3 of them being overdue by up to 11 days. The average number of recommendations made by humans was 3.89 (range 0–16; SE = 0.47), and the number of recommendations given by the AI was 44 (p < 0.001) (Fig. 1A). It took an average of 10.21 days to deliver the final report (range 0–41 days; SE = 1.91) for humans and 5.80 s (range 3.8–24.5 seconds; SE = 0.60) for the FwA (p < 0.001) (Fig. 1B). 3.2. Accuracy of quantitative analysis All but four participants reported a quantitative analysis that was factually inaccurate. The mean absolute error for total waiting times (to receive a final diagnosis) was 4.19 months; relative error = 14.75 %. The errors for waiting time in between hospital appointments was reported to be, on average 3.04 months different from the actual value; relative error = 81.06 %. 2
Clinical Neurology and Neurosurgery 192 (2020) 105732
M.A. Brzezicki, et al.
Fig. 1. (A) Mean number of recommendations and (B) time to final report submission, as produced by the human students and the AI algorithm. The ranges represent 95 % CI intervals; ** p < 0.01.
Patients in question can become medically fit for discharge and thus it is more desirable to investigate and treat them on an outpatient basis.
The most common confounding diagnosis was written correctly in 39.40 % of reports, with 63.63 % of reports correctly identifying the most common misdiagnosis in top 3 choices. The AI-based analysis was entirely factually correct (0 % error rate), i.e. all answers were congruent with the results given by Microsoft Excel 365 Pro Plus and IBM SPSS Statistics 24 (all three agreed).
4.1. Limitations One obvious limitation to the study was the choice of the student population. It could be argued that the students are inexperienced, lack both clinical and academic skills to adequately judge the patient’s notes and arrive at a feasible list of recommendations. Indeed, this is a true reflection of the current medical training. Whilst most of the students say they appreciate the importance of the EBM, only 49 % of them will take part in a research or audit project throughout their undergraduate career [2]. The formal teaching of how to set up and run a research project is also very poor. As a result, this is, in fact, the population who will perform audits as part of their training. The results of these analyses may be acted upon in the clinical setting. Asking consultants and experienced neurosurgical trainees to perform every audit will inevitably increase its quality but is less clinically feasible given the lack of time and resources available to these clinicians. The limitations in knowledge and academic skill could have been compensated by remedial teaching offered by an experienced clinician. Indeed, the participants in our study were offered additional training on the pathophysiology of the disease and auditing techniques (which they have all attended) and prior written information about how to perform an audit. We accept, however, that more could have been done to prepare our participants for the task. Indeed, students can gain many practical skills and expand their specialist knowledge whilst doing an audit; this learning opportunity must not be depreciated. Another argument could be made that the results of the human population approximate the true value when all the audits are considered together. However, whilst this could be a valid point in research terms, real-life audits are often performed individually or by a small group of people. It would be a considerable waste of resources to ask forty-six students to perform one audit when an accurate AI-based solution is available instead. An alternative study could be devised, where students have access to AI-powered tools; perhaps it would result in better and more practically oriented quality improvement solutions. Artificial Intelligence also has several limitations. Its results are based on the set of priorities and preferable outcomes in decisionmaking that are set by the designers, and which can have ethical and practical consequences. FwA may not be able to fully recognise the
3.3. Thematic analysis The main themes of the suggestions and recommendations are presented in Figs. 2 and 3. The main group of recommendations was centred around the choice of investigations for the patients (85 entries in total). These, however, were very often contradictory, i.e. some students recommended the use of some radiological modalities, whilst others advised against them. The summary of these contradictions is visualised in Fig. 4. 4. Discussion The results clearly indicate that the AI is able to produce significantly more clinical recommendations than humans, and it does so in a considerably shorter period of time. The results given by the AI are also more factually accurate and more logically precise. For example, the FwA is able to produce more consistent recommendations for the use of investigations when humans give contradictory answers. The FwA’s suggestions are also more detailed and given with concrete evidence on how to improve the relevant clinical outcome ([11]; Fig. 4). In comparison, human suggestions like: “Improving general diagnostic skill”, “Reducing waiting times (not stated how)” or “Courses for doctors to be more professional” can be very vague and are unlikely to result in a tangible clinical improvement. Humans also produced recommendations that are unlikely to be financially feasible. Administering a brain biopsy or a DNA test for every patient on a neurosurgical ward, whilst both diagnostically advantageous, would be very perplexing to the already stretched NHS resources. The FwA also pays more attention to the priorities set out in the task brief. Human students suggested keeping patients on the wards for longer until a precise diagnosis can be reached, and appropriate treatment can be undertaken. This, in fact, would prolong the stay of the medical outliers on the neurosurgical ward and thus defeat the purpose of conducting the audit. 3
Clinical Neurology and Neurosurgery 192 (2020) 105732
M.A. Brzezicki, et al.
Fig. 2. Thematic analysis of the recommendations produced by the Human Students. The figures in brackets represent the total number of suggestions of a given kind, produced across the entire cohort.
4
Clinical Neurology and Neurosurgery 192 (2020) 105732
M.A. Brzezicki, et al.
5. Conclusion The constant audit cycles are key to improving the quality of care delivered to neurosurgical patients. They can improve the safety and efficiency of patient flow and ensure that no procedures are cancelled because of the lack of beds on the ward. However, conducting an audit requires considerable expertise, appraisal skills, and academic integrity. All of these utilise time and resources that could be spent on delivering frontline services within the NHS. Clinical audits performed by students can be an attractive alternative to taking clinicians off the ward or theatre work. However, the quality of such reports can be questionable in terms of accuracy of the quantitative analysis and logical integrity of the qualitative recommendations. FwA offers an AI-based alternative that can deliver more robust reports, with more detailed recommendations, all delivered within a timeframe that would not be feasible by human means. CRediT authorship contribution statement
Fig. 3. Thematic analysis of the recommendations produced by the Artificial Intelligence Algorithm. The figures in brackets represent the total number of suggestions of a given kind, produced across the entire cohort.
Maksymilian A. Brzezicki: Conceptualization, Methodology, Software, Investigation, Project administration, Writing - original draft. Nicholas E. Bridger: Investigation, Writing - original draft. Matthew D. Kobetić: Investigation, Writing - review & editing. Maciej Ostrowski: Conceptualization, Methodology, Writing - original draft. Waldemar Grabowski: Investigation, Writing - review & editing. Simran S. Gill: Investigation, Writing - review & editing. Sandra Neumann: Conceptualization, Methodology, Supervision, Project administration, Writing - review & editing.
Fig. 4. Summary of numbers of recommendations for and against a given procedure, as reported by the entire cohort of students. DAT = DaTSCAN, a nuclear medicine scan utilising Ioflupane (123I), CT = Computed Tomography, SPECT = Single Positron Emission Computed Tomography, MRI = Magnetic Resonance Imaging.
Declaration of Competing Interest The authors report no conflicts of interest. Acknowledgements
whole patient care and priorities can differ in various clinical settings. All audits rely on accurate data recording and reproducible clinical examination standards, which can vary across neurological centres. AI requires an error-free raw data input and has no ability to spot obvious transcription errors or disease mislabelling. A human student could potentially be better equipped to appreciate these errors. Forty-six changes can also be difficult to implement fully; the improvements are thus likely to be grouped in larger thematic clusters. This can effectively reduce the real-life number of recommendations, which was the primary outcome of this study. Furthermore, we could have incentivised the students to produce more recommendations, e.g. by asking them to return a minimum number or by offering more examples in the instructions. However, this could have imposed limitations on their creativity and applied undue pressure that is not paralleled in real life. Finally, the results of the study do not translate into a tangible change in neurosurgical practice and convey, on their own, very little value. The true merit in the experiment is that it provides first proof that this method of auditing could be used safely and efficiently, introducing a standard that is superior to the one currently utilised in clinical practice. The patients’ benefits are, of course, heavily dependent on the implementation of this method in future neurosurgical audits. Thus, the results of an AI audit will never stand alone or be applied in daily clinical work without proper review by auditors, management, and medical staff. AI should be viewed as a friendly advisor, a colleague, rather than a threat or a bully to human intelligence.
The authors would like to thank Mr Kuba Kowalczyk for his immense help in study logistics and popularising the idea of research and evidence-based medicine among students. Appendix A. Supplementary data Supplementary material related to this article can be found, in the online version, at doi:https://doi.org/10.1016/j.clineuro.2020.105732. References [1] J.G. Simpson, J. Furnace, J. Crosby, A.D. Cumming, P.A. Evans, M.F.B. David, R.M. Harden, D. Lloyd, H. McKenzie, J.C. McLachlan, G.F. McPhate, I.W. PercyRobb, S.G. MacPherson, The Scottish doctor–learning outcomes for the medical undergraduate in Scotland: a foundation for competent and reflective practitioners, Med. Teach. 24 (2002) 136–143, https://doi.org/10.1080/01421590220120713. [2] M.F. Griffin, S. Hindocha, Publication practices of medical students at British medical schools: experience, attitudes and barriers to publish, Med. Teach. 33 (2011) e1–e8, https://doi.org/10.3109/0142159X.2011.530320. [3] D.M. Rhodes, R. Ashcroft, R.A. Atun, G.K. Freeman, K. Jamrozik, Teaching evidence-based medicine to undergraduate medical students: a course integrating ethics, audit, management and clinical epidemiology, Med. Teach. 28 (2006) 313–317, https://doi.org/10.1080/01421590600624604. [4] M.A. Brzezicki, M.D. Kobetić, S. Neumann, Frideswide—an artificial intelligence deep learning algorithm for audits and quality improvement in the neurosurgical practice, Int. J. Surg. Lond. Engl. 43 (2017) 56–57, https://doi.org/10.1016/j.ijsu. 2017.05.038. [5] A. Nagar, P. Yew, D. Fairley, M. Hanrahan, S. Cooke, I. Thompson, W. Elbaz, Report of an outbreak of Clostridium difficile infection caused by ribotype 053 in a neurosurgery unit, J. Infect. Prev. 16 (2015) 126–130, https://doi.org/10.1177/ 1757177414560250. [6] A.S. Kamat, A. Parker, Effect of perioperative inefficiency on neurosurgical theatre efficacy: a 15-year analysis, Br. J. Neurosurg. 29 (2015) 565–568, https://doi.org/ 10.3109/02688697.2015.1019423.
5
Clinical Neurology and Neurosurgery 192 (2020) 105732
M.A. Brzezicki, et al.
[10] U.H. Bristol, How To: Set an Audit Aim, Objectives & Standards, [WWW Document]. URL http://www.uhbristol.nhs.uk/files/nhs-ubht/4%20How%20to %20Aim%20Objectives%20and%20Standards%20v3.pdf (accessed 7.30.18) (2009). [11] M.A. Brzezicki, M.D. Kobetić, S. Neumann, C. Pennington, Diagnostic accuracy of frontotemporal dementia. An artificial intelligence-powered study of symptoms, imaging and clinical judgement, Adv. Med. Sci. 64 (2019) 292–302, https://doi. org/10.1016/j.advms.2019.03.002. [12] B. Brown, Qualitative Methodologies: Ethnography, Phenomenology, Grounded Theory and More, (2015).
[7] R. Kaddoum, R. Fadlallah, E. Hitti, F. EL-Jardali, G. El Eid, Causes of cancellations on the day of surgery at a Tertiary Teaching Hospital, BMC Health Serv. Res. 16 (2016), https://doi.org/10.1186/s12913-016-1475-6. [8] A. González‐Arévalo, J.I. Gómez‐Arnau, F.J. DelaCruz, J.M. Marzal, S. Ramírez, E.M. Corral, S. García‐del‐Valle, Causes for cancellation of elective surgical procedures in a Spanish general hospital, Anaesthesia 64 (2009) 487–493, https://doi. org/10.1111/j.1365-2044.2008.05852.x. [9] N. Stylianou, R. Fackrell, C. Vasilakis, Are medical outliers associated with worse patient outcomes? A retrospective study within a regional NHS hospital using routine data, BMJ Open 7 (2017) e015676, https://doi.org/10.1136/bmjopen2016-015676.
6