Machine learning-based augmented reality for improved surgical scene understanding

Machine learning-based augmented reality for improved surgical scene understanding

Computerized Medical Imaging and Graphics 41 (2015) 55–60 Contents lists available at ScienceDirect Computerized Medical Imaging and Graphics journa...

1MB Sizes 0 Downloads 118 Views

Computerized Medical Imaging and Graphics 41 (2015) 55–60

Contents lists available at ScienceDirect

Computerized Medical Imaging and Graphics journal homepage: www.elsevier.com/locate/compmedimag

Machine learning-based augmented reality for improved surgical scene understanding Olivier Pauly a,c,∗ , Benoit Diotte a , Pascal Fallavollita a , Simon Weidert b , Ekkehard Euler b , Nassir Navab a a

Computer Aided Medical Procedures, Technische Universität, München, Germany Chirurgische Klinik und Poliklinik Innenstadt, München, Germany c Institute of Biomathematics and Biometry, Helmholtz Zentrum, München, Germany b

a r t i c l e

i n f o

Article history: Received 9 January 2014 Received in revised form 8 April 2014 Accepted 9 June 2014

a b s t r a c t In orthopedic and trauma surgery, AR technology can support surgeons in the challenging task of understanding the spatial relationships between the anatomy, the implants and their tools. In this context, we propose a novel augmented visualization of the surgical scene that mixes intelligently the different sources of information provided by a mobile C-arm combined with a Kinect RGB-Depth sensor. Therefore, we introduce a learning-based paradigm that aims at (1) identifying the relevant objects or anatomy in both Kinect and X-ray data, and (2) creating an object-specific pixel-wise alpha map that permits relevance-based fusion of the video and the X-ray images within one single view. In 12 simulated surgeries, we show very promising results aiming at providing for surgeons a better surgical scene understanding as well as an improved depth perception. © 2014 Elsevier Ltd. All rights reserved.

1. Introduction In orthopedic and trauma surgery, the introduction of AR technology such as the camera augmented mobile C-arm promises to support surgeons in their understanding of the spatial relationships between anatomy, implants and their surgical tools [1,2]. By using an additional color camera mounted so that its optical center coincides with the X-ray source, the CamC system provides an augmented view created through the superimposition of Xray and video images using alpha blending. In other words, the resulting image is a linear combination of the optical and the X-ray image by using the same mixing coefficient (alpha) over the whole image domain. While this embodies a simple and intuitive solution, the superimposition of additional X-ray information can harm the understanding of the scene for the surgeon when the field of view becomes highly cluttered (e.g. by surgical tools). It becomes more and more difficult to quickly recognize and differentiate structures in the overlaid image. Moreover, the depth perception of the surgeon is altered as the X-ray anatomy appears on top of the scene in the optical image. In both the X-ray and the optical image, all pixels in the image domain do not have the same relevance for a good perception and

∗ Corresponding author. E-mail addresses: [email protected], [email protected] (O. Pauly). http://dx.doi.org/10.1016/j.compmedimag.2014.06.007 0895-6111/© 2014 Elsevier Ltd. All rights reserved.

understanding of the scene. Indeed, in the X-ray, while all pixels that belong to the patient bone and soft tissues have a high relevance for surgery, pixels belonging to the background do not provide any information. Concerning the optical images, it is crucial to recognize different objects interacting in the surgical scene, e.g. background, surgical tools or surgeon hands. First, this permits to improve the perception by preserving the natural occlusion clues when surgeon’s hand or instruments occlude the augmented scene in the classical CamC view. Second, as a by-product, precious semantic information can be extracted for characterizing the activity performed by the surgeon or tracking the position of different objects present in the scene. In this paper, we introduce a novel learning-based AR fusion approach aiming at improving surgical scene understanding and depth perception. Therefore, we propose to combine a mobile C-arm with a Kinect sensor, adding not only X-ray but also depth information into the augmented scene. Using the fact that structured light functions through a mirror, the Kinect sensor is integrated with a mirror system on a mobile C-arm, so that both color and depth cameras as well as the X-ray source have the same viewpoint. In this context of learning-based image fusion, a few attempts have been done in [3,4] based on color and X-ray information only. In these early works, a Naïve Bayes classification approach based on the color and radiodensity is applied to recognize the different objects in respectively the color and X-ray images from the CamC system. Depending on the pair of objects it belongs to,

56

O. Pauly et al. / Computerized Medical Imaging and Graphics 41 (2015) 55–60

each pixel is associated to a mixing value to create a relevancebased fused image. While this approach provided promising first results, recognizing each object on their color distribution only is very challenging and not robust to changes in illumination. In the present work, we propose to take advantage of additional depth information to provide an improved AR visualization: (i) we define a learning-based strategy based on color and depth information for identifying objects of interest in Kinect data, (ii) we use state-ofthe-art random forest for identifying foreground objects in X-ray images and (iii) we use an object-specific mixing look-up table for creating a pixel-wise alpha map. In 12 simulated surgeries, we show that our fusion approach provides surgeons with a better surgical scene understanding as well as an improved depth perception. 2. Methods 2.1. System setup: Kinect augmented mobile C-arm In this work, we propose to extend a common intraoperative mobile C-arm by mounting a Kinect sensor, that consists in a depth sensor coupled to a video camera. The video camera optical center of this RGB-D sensor is mounted so that it coincides with the X-ray projection center. The depth sensor is based on so-called structured light where infrared light patterns are projected into the scene. Using an infrared camera, the depth is inferred from the deformations of those patterns induced by the 3D structure of the scene. To register the depth images into the video camera coordinates, the sensor disposes of a built-in calibration. In the proposed setup illustrated by Fig. 2 on the left, the surgical scene is seen through a mirror system. Note that depth inference is still possible as the mirror perfectly reflects structured light without inducing deformations on the infrared patterns. Fig. 2 on the right shows our proof-of-concept setup we will use in our experiments. This system consists of an aluminium frame mimicing a C-arm with realistic dimensions, a Kinect sensor and a mirror system. Here we use one mirror to simulate the fact that the video optical center of the camera augmented mobile C-arm system has to virtually coincide with the X-ray source. While the effective range of the depth information is between 50 cm and 3 m, the distance between X-ray source and detector is about 1 m. The depth sensor is mounted at about 30 cm from the mirror, which is about 10 cm below the X-ray source. This effectively allows a depth range of about 70 cm from the detector. The fastest and highest synchronized resolution of the Kinect is 640×480 at 30 fps. Since the field of view of the Kinect is larger than the mirror, the images are cropped at 320×240 to fit the mirror view. In our experiments, real X-ray shots acquired from different orthopedic surgeries will be manually aligned into the view of our scene before starting our surgery simulations. In the next section, we will describe our novel learning-based AR visualization that combines intelligently these different sources of information. 2.2. Learning-based AR visualization In the present work, we consider the 3 different sources of information provided by a mobile C-arm combined with a RGB-Depth Kinect sensor. The resulting color, depth and X-ray images are represented by their respective intensity functions I :  → R3 , D :  → R and J :  → R. We assume all images are registered through calibration, so that those functions are defined on the same image domain  ⊂ R2 . Each pixel x ∈  is associated to three-dimensional value in the CIElab color space, a depth value in mm, and a radiodensity value. Our goal is to create an augmented image F :  → R3 as the fusion of I and J, taking advantage of the additional depth information contained in D. Using a simple method called alpha blending, we could construct F as a convex combination of both I and J, ignoring D. The same “mixing value” ˛ ∈ [0, 1] would be then applied to

the whole image domain, without taking into account the content of those images. In the present work, we propose to create a pixelwise alpha mapping based on the semantic content of both images. Ideally, all relevant information needs to be retained and emphasized in the fused image. Based on color, depth and radiodensity information, our novel mixing paradigm can be defined as follows: F(x) = ˛I,D,J (x)I(x) + (1 − ˛I,D,J (x))J (x), J

(1)

R3

that associates a pixel x to a vecwhere is the function in tor [J(x), J(x), J(x)] . ˛I,D,J (x) is a pixel-wise alpha map that is constructed by taking into account the semantic content, i.e. the relevant objects present in the images to fuse. 2.3. Identifying objects of interest in RGB-D and X-ray images 2.3.1. Related work: object recognition/segmentation in RGB-Depth images Since the introduction of RGB-Depth sensors such as Kinect, many research have been conducted to tackle the problem of object detection or scene labelling, taking advantage of the combined color and depth information. In the field of pedestrian detection, several works [5–8] propose to combine image intensity, depth and motion cues. In the context of object classification, detection and pose estimation, Sun et al. [9] proposed to detect object from depth and image intensities with a modified Hough transform. More recently, Hinterstoisser et al. introduced in [10] a very fast template matching approach based on so-called multi-modal features extracted from RGB and depth images: they propose to combine color gradient information with surface normals to best describe the templates of the objects of interest. In [11] Silberman et al. also propose to use different type of hybrid features such as RGBD SIFTs within a CRF model in order to segment indoor scenes. In [12], authors tackle the problem of object recognition based on RGB-D images demonstrating that combining color and depth information substantially increase the recognition results. In the following, we will describe in details how we combine color and depth information to identify our objects of interest. 2.3.2. Our approach Let us first consider both the color I and depth images D provided by the Kinect sensor, both registered through built-in calibration. Our goal to identify relevant objects within the surgical scene, i.e. objects that belongs to the foreground and to specific classes of interest such as the hands of the surgeon or surgical tools. In this context, we propose to split the task of identifying relevant objects into two subtasks: (1) find candidate foreground objects using the content of the depth image and (2), identify relevant objects using the content of the color image. each pixel x needs to  Formally,  be associated with a label r ∈ 0, 1 , being equal to 1 for relevant objects and 0 otherwise. In our multi-sensor setup, this label r can seen as the realization of 2 random variables (f, c), where  be  f ∈ 0, 1 represents the observation of a foreground object in the





depth image and c ∈ C = background, surgeon, tool the observation of classes of interest in the color image. In a probabilistic framework, we aim at modeling the joint distribution PI,D (f, c|x) of a pixel x to belong to the foreground and to an object class given a depth image D and a color image I. By decorrelating the observations in depth and color images, we can model this distribution as: PI,D (f, c|x) = PD (f|x)PI (c|x)

(2)

As modeling the foreground in depth images is ill-posed, we propose to learn instead a background model PD (f|x) and to use the relation PD (f|x) = 1 − PD (f|x). Concerning the second term PI (c|x), we use a discriminative model based on random forests.

O. Pauly et al. / Computerized Medical Imaging and Graphics 41 (2015) 55–60

57

Fig. 1. Comparison between classical alpha blending (top) and our fusion approach (bottom): surgeons agreed that scene understanding as well as the depth perception is improved by using our method.

In the following, we will describe how to learn both probability distributions PD (f|x) and PI (c|x). 2.3.3. Background modeling using depth images Background modeling has been widely studied for performing background subtraction in color images for tracking applications. In a fixed-camera setup, the key idea is to learn a color distribution for each pixel from a set of background images. As reported in [13], several approaches have been proposed the last decade for adaptive real-time background subtraction based on running Gaussian average, mixture models [14], kernel density estimation [15] or so-called Eigenbackground [16]. In the present work, we propose to learn a fixed model for our depth background based on a set of frames at the beginning of the surgery. Indeed, adaptive

models are constantly updated by making the assumption that the foreground objects move fast. This assumption does not hold in the case of the surgery, where objects of interest such as surgical tools can sometime remain immobile during a few minutes. Formally, let



N

us consider a set of N depth frames D = Dn accumulated at n=1 the beginning of a surgical sequence, where no objects of interest entered the field of view yet. Since our camera is static, we can reasonably model the background depth distribution for each pixel x ∈  as a multivariate normal distribution: PD (f|x) = ND (␮x , x |x),

(3)

where the parameters (␮x , x ) are respectively the sample mean and covariance matrix estimated from the set of frames D at pixel x. For a new incoming frame, the depth value at pixel x is then

Fig. 2. On the left, the proposed setup: a classical mobile C-arm augmented with a RGB-D Kinect sensor through a mirror system. On the right, our proof-of-concept setup used for experiments.

58

O. Pauly et al. / Computerized Medical Imaging and Graphics 41 (2015) 55–60

Fig. 3. Objects identification in Kinect data: from left to right, (a) the probabilistic output of the forest combined with the depth foreground. Red and green colors are respectively representing the surgeon and tool objects classes. (b) The segmentation result and (c), the overlay of the segmentation on the original image. (For interpretation of the references to color in this legend, the reader is referred to the web version of the article.)

compared to its corresponding background model to infer the probability to belong to the background.

2.3.4. Object identification using color images As reported in [17], random forests have found a wide variety of applications in medical image analysis such as anatomy localization [18,19], segmentation [20] or lesion detection [21]. As ensemble of decision trees, they provide piecewise approximations of any distribution in high-dimensional spaces. In our case, we model the probability PI (c|x) of a pixel x ∈    to belong to an object class c ∈ C = background, surgeon, tool . The visual content of a pixel x is defined by a feature vector X ∈ X = Rd . X encodes the mean intensity value computed in d rectangular regions of different sizes in the neighborhood of x in the color channels of the CIElab color space. Following a “divide” and “conquer” strategy, each tree t, t ∈ {1, T} first partitions the feature space in a hierarchical fashion and then estimates the posterior in each “cell” of this space. Given a training set of pixels from different color images and their corresponding labels, a tree t aims at subdividing these data by using axis-aligned splits in X so that consistent subsets are created in its leaves in terms of their visual context and class information c. Each leaf of a tree mod(t) els “locally” the posterior PI (c|x), encoded as a class histogram, computed from the set of observations reaching the leaf. At test time, the output of the trees can be combined by using posterior averaging:

PI (c|x) =

T 

(t)

PI (c|x).

(4)

2.4. Relevance-based image fusion As an output of the relevant object identification steps in both color/depth and X-ray images, we  are given respectively two  label maps we denote LI,D : x → background, surgeon, tool





and LJ : x → background, foreground . To create a pixel-wise alpha map ␣I,D,J (x) from LI,D and LJ , we propose to use a mixing look up table that associated a specific alpha value to each label pair. Based on surgeon feedback, we created a mixing LUT, giving a higher values to surgical tool and surgeon’s hands over the X-ray foreground. As we will show in Section 3, it enables a better depth perception over classical alpha blending. Moreover, hands can be made more transparent to permit the surgeon to see the anatomy while he is holding the patient. Note that this mixing LUT can be adapted to the preference of the surgeon, to the type of surgery, and also dynamically depending on the current phase during a surgery. 3. Experiments and results In this paper, we demonstrate the potential of our approach by using our proof-of-concept system illustrated by Fig 2 (on the right). We perform 12 different orthopedic surgeries simulations using a surgical phantom and real X-ray shots acquired from different orthopedic surgeries. Note that the X-ray images are manually aligned into the view of our surgical scene before starting our acquisitions. In each sequence, different types of activities involving different surgical tools, e.g. scalpel, drill, hammer, are performed. While in 10 sequences, the simulations are performed in comparable light conditions, we acquire two sequences with changing light conditions.

t=1

3.1. Evaluation of the objects identification 2.3.5. Identifying anatomy of interest in X-ray images As in the previous section, our goal is to extract the relevant information contained in an X-ray image J, i.e. anatomy, bones and implants. This task will be defined as a classification  task where each pixel x ∈  will be associated to a label r ∈ 0, 1 , where r is equal to 1 is x belongs to a relevant structure and 0 otherwise. In a probabilistic framework, we model the posterior distribution PJ (r|x) by using a random forest. Similarly, the visual context  of each pixel x is described by a feature vector X ∈ Rd encoding  mean radiodensity values computed in d rectangular regions in its neighborhood. Once the forest has been trained by using a set of annotated images, a new incoming X-ray can be labelled by using rˆ = argmaxPJ (r|x). r∈{0,1}

For our experiments, we annotated in each of the 12 sequences 4 video and depth frames, as well as 20 X-ray shots. For evaluating our approaches in both RGB-D and X-ray data, we performed two-folds cross-validation. Note that our RGB-D identification classifier has not been trained using the two sequences with changing light conditions. To describe the visual context of each pixel in the color image, 50 context features are extracted per CIElab channel. To tackle the task of objects identification, we train a random forest classifier consisting in 20 trees of depth 15. In each sequence, the depth background model is built from the 30 first frames, prior to the beginning of the surgery. For identifying objects in X-ray images, 50 context features are extracted to describe pixel context, and the classifier consists in a random forest of 20 trees with depth 15. As described in Section 2, the class posteriors for corresponding

O. Pauly et al. / Computerized Medical Imaging and Graphics 41 (2015) 55–60

59

Fig. 4. Quantitative results of the objects identification in Kinect RGB-D data: at the top row overall results are displayed for standard sequences and sequences with light pertubations – at the bottom, detailed results are shown for sequences containing large tools such as the drill and sequences containing thin metallic tools such as clamps or scalpel.

to the tool, surgeon and background class are combined with the probability of belonging to the foreground in the depth image. Pixels are then labelled as surgeon, tool or background by computing the argmax. An example of the different outputs of the system are shown in Fig. 3. The complexity of the identification step depends linearly on the number of pixels, the number of trees, as well as their depth. The fusion itself is very fast as it depends only linearly on the number of pixels in the image. For our proof-of-concept prototype, the overall identification-fusion runs at a speed of 2.7 frame per seconds. Figs. 4 and 5 give a quantitative evaluation of our object identification approach in both Kinect and X-ray data. F-measure (DICE score), precision and recall were estimated for the task of classifying the objects of interest against the background. Tables 1 and 2 show the coverall confusion matrix for the different objects of interest vs. the background. These measures are computed using the annotated frames over all the sequences (48 images), the 10 standard sequences having similar light conditions (40 images) and the last 2 sequences (8 images) contain perturbations, e.g. dynamic changes in lighting conditions. Results shows that while the segmentation of the surgeon performs well, the segmentation

Classificaon in Xray images 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Table 1 Confusion matrix of the objects identification in Kinect RGB-D data in all sequences. Confusion matrix

Actual background (%)

Actual surgeon (%)

Actual tool (%)

Predicted background Predicted surgeon Predicted tool

98.81 5.58 18.12

0.41 94.27 0.18

0.78 0.16 81.70

of tools is more challenging. Indeed, during our experiments, different types of tools were used that present different shape, scale, color and material characteristics such as drill, clamps or scalpels. As they are not well reflecting infrared light, the segmentation of metallic tools such as scalpels is more challenging and relies only on RGB information. Moreover, their thin shape and small size makes them difficult to catch when they are far from the camera. In contrast, the segmentation of tools having plain color and larger size such as the drill get segmented reliably. While our approach performs well on the 10 comparable sequences, it is interesting to denote that it gives fair results in the case of changing light conditions (not seen during training). One challenge comes from the fact that light can flood the infrared patterns and thus disturb the depth estimation from the Kinect. Concerning the objects identification in X-ray data, results computed on the 10 X-ray shots suggest that our approach performs well. Fig. 1 shows a few frames augmented by using our fusion approach with different lighting conditions. 3.2. Evaluation of the fusion results The 12 different sequences acquired are processed and presented to 5 clinical participants, three experts surgeons and two Table 2 Confusion matrix of the objects identification in X-ray data.

F-measure

Precision Background

Recall

Foreground

Fig. 5. Quantitative results of the objects identification in X-ray data.

Confusion matrix

Actual background (%)

Actual foreground (%)

Predicted background Predicted foreground

94.16 2.65

5.83 97.35

60

O. Pauly et al. / Computerized Medical Imaging and Graphics 41 (2015) 55–60

4th year medical students. Their feedback are collected within a questionnaire and assessed using a 5-point Likert scale 1 – strongly disagree, 2 – disagree, 3 – neutral, 4 – agree, and 5 – strongly agree. Participants strongly agreed (4.6 ± 0.5) that the depth ordering is resolved using our approach. Concerning the visibility of the instrument tip, or the implants in X-ray, the feedback is respectively neutral (3.0 ± 1.4) and slightly positive (3.4 ± 1.1). They all agreed (4.0 ± 1.4) that the overall perception of the visualization is improved for potential integration in surgical workflow. Finally all participants strongly agreed (4.6 ± 0.9) on the fact that they would prefer our new visualization compared to classical alpha blending. 4. Conclusion In this paper, we proposed novel strategies and learning approaches for AR visualization to improve surgical scene understanding and depth perception. Our main contributions were to propose the concept of a combined C-arm with a Kinect sensor to get color as well as depth information, to define a learning-based strategies for identifying objects of interest in Kinect and X-ray data, and to create an object-specific pixel-wise alpha map for improved image fusion. In 12 simulated surgeries, we show promising results for better surgical scene understanding as well as improved depth perception. Moreover, our novel Kinect augmented C-arm system opens the door for many exciting future work such as tool tracking, pose estimation, navigation and workflow analysis. References [1] Nicolau S, Lee P, Wu H, Huang M-H, Lukang R, Soler L, Marescaux J. Fusion of C-arm X-ray image on video view to reduce radiation exposure and improve orthopedic surgery planning: first in-vivo evaluation. In: 15th annual conference of the international society for computer aided surgery. 2011. [2] Navab N, Heining S, Traub J. Camera-augmented mobile C-arm (CamC): calibration, accuracy study, and clinical applications, vol. 29. Springer; 2010. p. 1412–23. [3] Pauly O, Katouzian A, Eslami A, Fallavollita P, Navab N. Supervised classification for customized intraoperative augmented reality visualization. In: IEEE

[4]

[5] [6] [7] [8] [9] [10] [11]

[12]

[13]

[14]

[15] [16] [17] [18]

[19]

[20]

[21]

international symposium on mixed and augmented reality (ISMAR). 2012. p. 311–2. Erat O, Pauly O, Weidert S, Thaller P, Euler E, Mutschler W, et al. How a surgeon becomes superman by visualization of intelligently fused multi-modalities. In: SPIE medical imaging. 2013, 86710L. Enzweiler M, Eigenstetter A, Schiele B, Gavrila DM. Multi-cue pedestrian classification with partial occlusion handling; 2010. Ess A, Leibe B, Gool LJV. Depth and appearance for mobile scene analysis; 2007. Gavrila DM, Munder S. Multi-cue pedestrian detection and tracking from a moving vehicle; 2007. Wojek C, Walk S, Schiele B. Multi-cue onboard pedestrian detection; 2009. Sun M, Bradski GR, Xu B-X, Savarese S. Depth-encoded hough voting for joint object detection and shape recovery; 2010. Hinterstoisser S, Cagniart C, Ilic S, Sturm P, Navab N, Fua P, et al. Gradient response maps for real-time detection of texture-less objects; 2012. Silberman N, Fergus R. Indoor scene segmentation using a structured light sensor. In: IEEE international conference on computer vision workshops (ICCV workshops), 2011. IEEE; 2011. p. 601–8. Lai K, Bo L, Ren X, Fox D. A large-scale hierarchical multi-view RGB-D object dataset. In: IEEE international conference on robotics and automation (ICRA), 2011. IEEE; 2011. p. 1817–24. Piccardi M. Background subtraction techniques: a review. In: IEEE international conference on systems, man and cybernetics, vol. 4. IEEE; 2004. p. 3099–104. Stauffer C, Grimson WEL. Adaptive background mixture models for real-time tracking. In: IEEE Computer Society conference on computer vision and pattern recognition, vol. 2. IEEE; 1999. Elgammal A, Harwood D, Davis L. Non-parametric model for background subtraction. Springer; 2000. p. 751–67. Oliver NM, Rosario B, Pentland AP. A Bayesian computer vision system for modeling human interactions, vol. 22. IEEE 2000:831–43. Criminisi A, Shotton J. Decision forests for computer vision and medical image analysis. Springer; 2013. Criminisi A, Robertson D, Konukoglu E, Shotton J, Pathak S, White S, et al. Regression forests for efficient anatomy detection and localization in computed tomography scans. Med Image Anal 2013;17(8):1293–303. Pauly O, Glocker B, Criminisi A, Mateus D, Martinez-Moeller A, Nekolla S, Navab N. Fast multiple organs detection and localization in whole-body MR dixon sequences. In: International conference on medical image computing and computer assisted intervention (MICCAI). 2011. Glocker B, Pauly O, Konukoglu E, Criminisi A. Joint classification-regression forests for spatially structured multi-object segmentation. In: 12th European conference on computer vision (ECCV). 2012. Pauly O, Ahmadi S-A, Plate A, Boetzel K, Navab N. Detection of substantia nigra echogenicities in 3D transcranial ultrasound for early diagnosis of Parkinson disease. In: International conference on medical image computing and computer assisted intervention (MICCAI). 2012.