Machine learning analyses of automated performance metrics during granular sub-stitch phases predict surgeon experience

Machine learning analyses of automated performance metrics during granular sub-stitch phases predict surgeon experience

Surgery xxx (2020) 1e5 Contents lists available at ScienceDirect Surgery journal homepage: www.elsevier.com/locate/surg Machine learning analyses o...

736KB Sizes 1 Downloads 42 Views

Surgery xxx (2020) 1e5

Contents lists available at ScienceDirect

Surgery journal homepage: www.elsevier.com/locate/surg

Machine learning analyses of automated performance metrics during granular sub-stitch phases predict surgeon experience Andrew B. Chen, MDa, Siqi Liang, BEb, Jessica H. Nguyen, BSa, Yan Liu, PhDb, Andrew J. Hung, MDa,* a Center for Robotic Simulation & Education, Catherine & Joseph Aresty Department of Urology, University of Southern California Institute of Urology, Los Angeles, CA b Computer Science Department, Viterbi School of Engineering, University of Southern California, Los Angeles, CA

a r t i c l e i n f o

a b s t r a c t

Article history: Accepted 21 September 2020 Available online xxx

Automated performance metrics objectively measure surgeon performance during a robot-assisted radical prostatectomy. Machine learning has demonstrated that automated performance metrics, especially during the vesico-urethral anastomosis of the robot-assisted radical prostatectomy, are predictive of long-term outcomes such as continence recovery time. This study focuses on automated performance metrics during the vesico-urethral anastomosis, specifically on stitch versus sub-stitch levels, to distinguish surgeon experience. During the vesico-urethral anastomosis, automated performance metrics, recorded by a systems data recorder (Intuitive Surgical, Sunnyvale, CA, USA), were reported for each overall stitch (Ctotal) and its individual components: needle handling/targeting (C1), needle driving (C2), and suture cinching (C3) (Fig 1, A). These metrics were organized into three datasets (GlobalSet [whole stitch], RowSet [independent sub-stitches], and ColumnSet [associated sub-stitches] (Fig 1, B) and applied to three machine learning models (AdaBoost, gradient boosting, and random forest) to solve two classifications tasks: experts (100 cases) versus novices (<100 cases) and ordinary experts (100 and <2,000 cases) versus super experts (2,000 cases). Classification accuracy was determined using analysis of variance. Input features were evaluated through a Jaccard index. From 68 vesico-urethral anastomoses, we analyzed 1,570 stitches broken down into 4,708 sub-stitches. For both classification tasks, ColumnSet best distinguished experts (n ¼ 8) versus novices (n ¼ 9) and ordinary experts (n ¼ 5) versus super experts (n ¼ 3) at an accuracy of 0.774 and 0.844, respectively. Feature ranking highlighted Endowrist articulation and needle handling/targeting as most important in classification. Surgeon performance measured by automated performance metrics on a granular sub-stitch level more accurately distinguishes expertise when compared with summary automated performance metrics over whole stitches. © 2020 Elsevier Inc. All rights reserved.

Highlights Topic: Application of machine learning algorithms to predict surgeon experience Purpose: To differentiate experts (100 cases) and novices (<100 cases), as well as super experts (2,000 cases) and ordinary experts (100 cases and <2,000 cases)

* Reprint requests: Andrew J. Hung, MD, University of Southern California Institute of Urology, 1441 Eastlake Avenue, Suite 7416, Los Angeles, CA 90089, USA. E-mail address: [email protected] (A.J. Hung). https://doi.org/10.1016/j.surg.2020.09.020 0039-6060/© 2020 Elsevier Inc. All rights reserved.

State-of-the-Art: Utilizing automated performance metrics (automated performance metrics, robotic kinematic and system events data) on stitch/sub-stitch levels Knowledge Gaps: Explore the value of detailed automated performance metrics during suturing sub-stitch maneuvers in contrast with previous automated performance metrics reported over specific steps of a procedure Technology Gaps: Compare the performance of various machine learning models when presented with datasets of increasing granularity Future Directions: This is foundational work to provide meaningful feedback to surgeons and learners in training.

2

A.B. Chen et al. / Surgery xxx (2020) 1e5

Introduction Surgical skill and technique have been demonstrated to correlate with postoperative clinical outcomes.1e4 Effective surgical evaluation and instruction of surgical trainees is critical to achieve excellent health care outcomes. At the same time, adequate evaluation of surgical skill and expertise requires extensive supervision and remains a challenge in surgical education.5,6 For robotic surgery, various technical assessment tools can be used to measure surgical performance. Manual assessment, which includes general and procedure-specific evaluations, are limited by difficulties in scaling, time, and limited interrater reliability.7e9 Automated assessment, a developing area of research, can measure and quantify surgical performance directly. Automated performance metrics (APMs) are one such tool derived directly from kinematic data and robotic systems data.10 In our earlier work, APMs summarized across a whole procedure or whole steps have been demonstrated to differentiate surgeon expertise and case volume. During the vesico-urethral anastomosis (VUA) of a robotic assisted radical prostatectomy (RARP), for instance, experts (caseload 100 cases) and novices (<100 cases) demonstrated statistically different APM profiles.11 In fact, super experts (2,000 cases) demonstrated significantly different APMs than ordinary experts (100 cases and 750 cases) along with statistically significantly improved perioperative outcomes.12 Further application using machine learning (ML) has demonstrated that APMs during the VUA are top features in deep learning models to predict clinical outcomes such as time to continence after RARP.13 As ML becomes more prevalent in applications in medicine, optimizing datasets for improved analysis is paramount. More data in ML have been associated with improved accuracy.14,15 In addition, improved label granularity in supervised learning, that is, a

more detailed label (eg, Persian cat versus cat), can improve classification accuracy.16 We apply this principle of improved data granularity to our present analysis of surgical experience in RARP. Previous studies have focused on APMs summated over an entire step (eg, VUA) or an entire procedure. We now break down a step of the RARP to the sub-stitch level. Herein, we evaluate with ML algorithms whether more granular APMs (over a sub-stitch level) improve classification of surgeon experience. We report APMs during the VUA for each overall stitch (Ctotal) and its individual sub-stitch components: needle handling/ targeting (C1), needle driving (C2), and suture cinching (C3) (Fig 1, A). Materials and Methods Study design Under an institutional review board-approved protocol, synchronized surgical video and systems events data during the VUA step of the RARP were recorded directly from da Vinci Si and Xi systems (Intuitive Surgical, Sunnyvale, CA, USA) consecutively from 2016 to 2017, using a custom video and data recorder. RARP cases performed without the da Vinci system data recorder were excluded. Participants Participants in the present study were 17 faculty surgeons, fellows, and residents. Surgeons classified a priori as experts (n ¼ 8) (100 cases of experience) were compared with novices (n ¼ 9) (<100 cases of experience). Experts were further subdivided into super experts (2,000 cases [n ¼ 3]) and ordinary experts ([n ¼ 5] 100 cases but 750 cases).

Fig 1. (A) Sub-stitch components. Suturing can generally be broken into three components, including “needle handling” with needle driver instruments, “needle driving” through tissue, and “suture cinching.” (B) Dataset organization. Automated performance metrics during the VUA are organized into three datasets that differ in granularity. GlobalSet contains the least amount of data as whole stitches are analyzed. RowSet follows with inclusion of sub-stitch components. ColumnSet contains the greatest information per data point as sub-stitch components are evaluated in context of the same stitch.

A.B. Chen et al. / Surgery xxx (2020) 1e5

3

Table I Performance accuracy when distinguishing experts versus novices (comparison 1) Machine learning model

Performance accuracy comparison between varying datasets

AdaBoost

ColumnSet RowSet ColumnSet RowSet ColumnSet RowSet

Gradient boosting Random forest

0.72747 0.71218 0.72675 0.72094 0.73274 0.71592

±0.01641 ±0.00863 ±0.01033 ±0.00619 ±0.00528 ±0.00300

Data collection Previously developed and validated APMs were derived from kinematic data (eg, instrument travel time, path length, velocity, Endowrist [Intuitive Surgical] movements) and system events data (eg, camera movements, third arm usage). Video review of each case synchronized the timestamp of each individual stitch and substitch with corresponding systems data to derive the APMs. These metrics were organized into three datasets that differed in granularity: GlobalSet (whole stitch with no sub-stitch designation), RowSet (sub-stitches reported as independent events), and ColumnSet (sub-stitches associated to a whole stitch) (Fig 1, B) were applied to three ML models (AdaBoost, gradient boosting, and random forest) to solve two classification tasks: comparison 1: experts versus novices performance and comparison 2: ordinary experts versus super experts. We utilized three ML models based on ensemble learning, a method combining predictions from multiple simple classifiers to obtain a final prediction with better classification performance. Random forest uses bagging strategy generating individual independent decision trees contributing to a final majority vote. AdaBoost is a boosting model that updates weights of data points, weighting each weak classifier based on corresponding errors in sequence, with the final prediction being a weighted majority. Gradient boosting is a boosting model that trains the current weak classifier to learn the residual error from the last former step, with the final prediction being a sum of all predictions. We randomly selected 80% of each dataset as the training set with the remainder used as the test set, maintaining the partition with a fixed random seed. Mean classification accuracies from various model/dataset combinations for the two classification tasks were compared using analysis of variance. The stability of feature importance ranking from every ML model on each classification task was evaluated using the Jaccard index and weighted to output the final feature importance rank. Results The median number of cases performed by a novice was 20 (interquartile range: 5e40) The median number of cases performed by an expert was 275 (interquartile range: 150e2,000). The subcategories of ordinary experts and super experts featured a median caseload of 150 and 2,000, respectively. We analyzed 68 VUAs, which consisted of 1,570 stitches further divided into 4,708 substitches. Of the sub-stitches, 1,571 were needle handling/targeting (C1), 1,568 were needle driving (C2), and 1,569 were suture cinching (C3). The 30 APMs were analyzed per sub-stitch to classify experts versus novices and ordinary experts versus super experts. Comparison 1 When attempting to differentiate between experts and novices, we observed a distinct hierarchy of accuracy between datasets consistent in each ML model. ColumnSet, providing sub-stitch details along with association to specific stitches, performed with the

P value RowSet GlobalSet RowSet GlobalSet RowSet GlobalSet

0.71218 ±0.00863 0.69879 ±0.01829 0.72094 ±0.00619 0.67241± 0.00111 0.71592 ±0.00300 0.72751 ±0.00908

.003 <.001 .13 <.001 <.001 <.001

highest accuracy when comparing within each ML model (Table I). RowSet followed in accuracy, and GlobalSet had the lowest accuracy. The one exception was GlobalSet outperforming RowSet with the random forest model. Overall, the best performing combination of dataset and model was ColumnSet by random forest model, with a prediction accuracy of 0.733 ±0.005. The worst performing combination of dataset and model was GlobalSet as analyzed by gradient boosting, with an accuracy of 0.672 ±0.001. Comparison 2 In the second comparison, ColumnSet, as analyzed by AdaBoost, labeled ordinary experts and super experts the most accurately, with a prediction of 0.801 ±0.014 (Table II). The worst performing model was GlobalSet, as analyzed by gradient boosting, with an accuracy of 0.769 ±0.009. Furthermore, we observed that the ML algorithms were more accurate in differentiating between super experts and ordinary experts, compared with distinguishing novices versus experts (P < .001). Finally, feature ranking was then performed on the ColumnSet data to identify which APMs contributed most to prediction accuracy. We noted 2 specific trends. Of the top 10 features, 7 differentiating experts and novices involved Endowrist (Intuitive Surgical) articulation as opposed to other kinematic metrics (Fig 2). In comparing super experts and ordinary experts, we observed that 7 of the top 10 features were APMs during the needle handling/targeting phase of suturing. (Fig 3). Discussion The present study demonstrated that ML can accurately classify surgeon experience based on individual stitches and sub-stitches in the VUA of a RARP. We not only observed a difference between experts and novices, but also between super experts and ordinary experts. It is interesting that the ML model was able to classify super experts versus ordinary experts more accurately than experts and novices. This held true for every ML model constructed on each dataset (Table II). Without reading into this outcome excessively, we believe the results confirm that the evolution of surgeon performance continues beyond the ordinary expert level. We observed that the best performing models with the greatest accuracy in comparisons 1 and 2 used the ColumnSet dataset (individual sub-stitch data and further association with whole stitches). The worst performing model in comparisons 1 and 2 used the GlobalSet dataset (no sub-stitch data). These results confirm our hypothesis that more granularity in data would improve classification accuracy. GlobalSet provided the least information of the datasets. Compared with RowSet, ColumnSet provided additional context by grouping associated sub-stitches. The feature rank derived from comparison 1 demonstrated that Endowrist articulation metrics throughout C1, C2, and C3 were a major contributory factor in differentiating experts and novices. On

4

A.B. Chen et al. / Surgery xxx (2020) 1e5

Table II Performance accuracy when distinguishing super experts versus ordinary experts (comparison 2) Machine learning model

Performance accuracy comparison between varying datasets

AdaBoost

ColumnSet RowSet ColumnSet RowSet ColumnSet RowSet

Gradient boosting Random forest

0.801 0.772 0.770 0.784 0.761 0.761

±0.014 ±0.009 ±0.006 ±0.006 ±0.007 ±0.004

RowSet GlobalSet RowSet GlobalSet RowSet GlobalSet

P value 0.772 0.774 0.784 0.759 0.761 0.769

±0.009 ±0.009 ±0.006 ±0.002 ±0.004 ±0.009

<.001 .14 <.001 <.001 .959 <.001

Fig 2. Top-ranked features distinguishing expert versus novice (comparison 1). Wrist articulation metrics rank highly in differentiating novices and experts

Fig 3. Top-ranked features distinguishing super expert versus ordinary expert (comparison 2). Needle handling (C1) metrics rank highly in differentiating ordinary experts and super experts

the other hand, feature ranking demonstrated that the top APMs differentiating super experts and ordinary experts involve metrics during the needle handling phase of suturing. Interpretation of the top ranking APMs may help guide education and learning. One possible takeaway is that novices should focus on usage of instrument articulation throughout all aspects of suturing. As skills mature, attention should also be paid to refining needle handling at the start of every stitch. The following study limitations should be acknowledged. Our study is based on the experience of a single center with surgeons who may share a similar surgical style. External validation with an outside dataset is required. Although the large datasets were analyzed by ML algorithms, these data were derived from manual segmentation of the sub-stitch phases. Scaling this task to analyze every surgeon’s VUA is still time-consuming at present. Future directions include automatic segmentation of stitches into sub-stitches with the assistance of ML and advances in computer vision. We also look to correlate sub-stitch metrics to clinical outcomes (ie, anastomotic leak).

In summary, our study demonstrates that surgeon performance measured by APMs, when reported on a detailed and granular substitch level, more accurately distinguishes surgeon experience when compared with summary APMs over whole stitches. Further investigation is warranted to translate these findings into formidable feedback for surgeons and surgical trainees. Conflict of interest/Disclosure Andrew J. Hung has financial disclosures with Quantgene, Inc (consultant), Mimic Technologies, Inc (consultant), and Johnson & Johnson (consultant). Funding/Support This study is supported in part by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under Award Number K23EB026493 and an Intuitive Surgical Clinical Research Grant.

A.B. Chen et al. / Surgery xxx (2020) 1e5

Acknowledgments We would like to acknowledge Anthony Jarc of Intuitive Surgical Inc Clinical Research (Norcross, GA, USA) for the processing of the automated performance metrics.

8.

9. 10.

References 1. Birkmeyer JD, Finks JF, O'Reilly A, et al. Surgical skill and complication rates after bariatric surgery. N Engl J Med. 2013;369:1434e1442. 2. Goldenberg MG, Goldenberg L, Grantcharov TP. Surgeon performance predicts early continence after robot-assisted radical prostatectomy. J Endourol. 2017;31:858e863. 3. Hogg ME, Zenati M, Novak S, et al. Grading of surgeon technical performance predicts postoperative pancreatic fistula for pancreaticoduodenectomy independent of patient-related variables. Ann Surg. 2016;264:482e491. 4. Fecso AB, Szasz P, Kerezov G, et al. The effect of technical performance on patient outcomes in surgery. Ann Surg. 2017;265:492e501. 5. Scott DJ, Rege RV, Bergen PC, et al. Measuring operative performance after laparoscopic skills training: edited videotape versus direct observation. J Laparoendosc Adv Surg Tech A. 2000;10:183e190. 6. Deal SB, Lendvay TS, Haque MI, et al. Crowd-sourced assessment of technical skills: an opportunity for improvement in the assessment of laparoscopic surgical skills. Am J Surg. 2016;211:398e404. 7. Raza SJ, Field E, Jay C, et al. Surgical competency for urethrovesical anastomosis during robot-assisted radical prostatectomy: development and

11.

12.

13.

14.

15. 16.

5

validation of the robotic anastomosis competency evaluation. Urology. 2015;85:27e32. Goh AC, Goldfarb DW, Sander JC, et al. Global evaluative assessment of robotic skills: validation of a clinical assessment tool to measure robotic surgical skills. J Urol. 2012;187:247e252. Prebay ZJ, Peabody JO, Miller DC, et al. Video review for measuring and improving skill in urological surgery. Nat Rev Urol. 2019;16:261e267. Chen J, Cheng N, Cacciamani G, et al. Objective assessment of robotic surgical technical skill: a systemic review. J Urol. 2019;201:461e469. Chen J, Oh PJ, Cheng N, et al. Use of automated performance metrics to measure surgeon performance during robotic vesicourethral anastomosis and methodical development of a training tutorial. J Urol. 2018;200:895e902. Hung AJ, Oh PJ, Chen J, et al. Experts vs super-experts: differences in automated performance metrics and clinical outcomes for robot-assisted radical prostatectomy. BJU Int. 2019;123:861e868. Hung AJ, Chen J, Ghodoussipour S, et al. A deep-learning model using automated performance metrics and clinical features to predict urinary continence recovery after robot-assisted radical prostatectomy. BJU Int. 2019;124: 487e495. Banko M, Brill E. Scaling to very large corpora for natural language disambiguation. In: Proceedings of the 39th annual meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics; 2001:26e33. Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intell Syst. 2009;24:8e12. Chen Z, Ding R, Chin TW, et al. Understanding the impact of label granularity on cnn-based image classification. In: 2018 IEEE international conference on data mining workshops (ICDMW). Piscataway, NJ: IEEE; 2018: 895e904.