Human–machine interfaces for medical imaging and clinical interventions

Human–machine interfaces for medical imaging and clinical interventions

CHAPTER 33 Human–machine interfaces for medical imaging and clinical interventions Roy Eaglesona , Sandrine de Ribaupierreb a Software b Clinical En...

3MB Sizes 0 Downloads 21 Views

CHAPTER 33

Human–machine interfaces for medical imaging and clinical interventions Roy Eaglesona , Sandrine de Ribaupierreb a Software b Clinical

Engineering, University of Western Ontario, London, ON, Canada Neurological Sciences, University of Western Ontario, London, ON, Canada

Contents 33.1. HCI for medical imaging vs clinical interventions 33.1.1 HCI for diagnostic queries (using medical imaging) 33.1.2 HCI for planning, guiding, and executing imperative actions (computer-assisted interventions) 33.2. Human–computer interfaces: design and evaluation 33.3. What is an interface? 33.4. Human outputs are computer inputs 33.5. Position inputs (free-space pointing and navigation interactions) 33.6. Direct manipulation vs proxy-based interactions (cursors) 33.7. Control of viewpoint 33.8. Selection (object-based interactions) 33.9. Quantification (object-based position setting) 33.10. User interactions: selection vs position, object-based vs free-space 33.11. Text inputs (strings encoded/parsed as formal and informal language) 33.12. Language-based control (text commands or spoken language) 33.13. Image-based and workspace-based interactions: movement and selection events 33.14. Task representations for image-based and intervention-based interfaces 33.15. Design and evaluation guidelines for human–computer interfaces: human inputs are computer outputs – the system design must respect perceptual capacities and constraints 33.16. Objective evaluation of performance on a task mediated by an interface References

817 818 819 820 821 822 822 823 824 824 825 826 828 828 830 833 834 836 838

33.1. HCI for medical imaging vs clinical interventions A broad characterization of the distinction between “Medical Imaging” and “Computer-Assisted Interventions” is that Medical Imaging displays are used for diagnosis, and Computer-Assisted Interventions make use of displays as information for basing actions. The former set of tasks answer queries, while the latter are used for navigation and manipulation: they are actions which are imperative in nature. While systems can support a mix of these two broad use cases, we use this distinction to outline some typical examHandbook of Medical Image Computing and Computer Assisted Intervention https://doi.org/10.1016/B978-0-12-816176-0.00038-7

Copyright © 2020 Elsevier Inc. All rights reserved.

817

818

Handbook of Medical Image Computing and Computer Assisted Intervention

Figure 33.1 (left) MRI console during a fetal MRI acquisition where the parameters can be adjusted; (right) image of a fetal head.

ples from clinical settings. Some refer to this distinction as the “MIC” and the “CAI” of MICCAI (e.g., [9]), and so these task categories are the primary drivers that distinguish two HCI design streams.

33.1.1 HCI for diagnostic queries (using medical imaging) While “medical image understanding” might seem to be almost synonymous with “Visual Perception”, we can identify a fundamental difference between the two: Visual Perception involves the human visual system, which has evolved to “understand” (i.e., derive semantics from) a natural scene. “Medical Image Understanding”, on the other hand, makes use of physics-based sensors to produce an array which is perceived as an image. While viewing an X-Ray image may feel like looking at a natural pattern, it is far from natural scene understanding. This is the trap through which many medical imaging innovations fall. The human perceptual system is specialized for natural scenes. When the pipeline involve inputs which are physics-based sensors (e.g., coils in an MRI scanner) and there is a processing flow which, after considering several arbitrary parameters, maps this onto a visual image, then it seems clear that misperceptions of the information are almost inevitable. In the face of recognizing this shortcoming, it has been a common theme in Medical Imaging that the goal is to somehow optimize the mapping to the display so that the speed and accuracy of perceiving “what” or “where” is improved. “Is there a lesion in this sample?” “Is there flow in the vessel?” “Is that anatomical structure normal or is it pathological?” (See Fig. 33.1.) To facilitate such tasks using visual information, the processing pipeline embodies an algorithm which can have several degrees of freedom. These are parameters associated

Human–machine interfaces for medical imaging and clinical interventions

with the displayed image and, accordingly, they can be adjusted interactively at the user interface: (e.g., color lookup tables, contrast adjustments, histogramming, ROI, rotation of the image, translation, zoom, etc.). Quasiautomated tools and algorithms can help with shape estimation, (such as segmentation, 3D rendering of reconstructed volumes, detection, classification, localization, estimation of volumes, or shape parameters). Later we will discuss how this process is not just data-driven, but involves “top-down” processing in addition to the “bottom-up” flow, which is predominant in Medical Imaging (except when humans are involved, using HCI as part of the processing hierarchy). In any case, each of these, in Medical Imaging (“MIC”), can be used as objective metrics for “Diagnosis”. Consider next, the “CAI”, of MICCAI, in the following section.

33.1.2 HCI for planning, guiding, and executing imperative actions (computer-assisted interventions) Not only is the human perceptual/cognitive system able to “understand” a natural scene but, critically, it is then able to act within it. This notion of the sensorimotor system completes the concept of the “Perception–Action” loop that connects humans with the world. Certainly, we have evolved to deal within a natural world. Our perceptual–motor system is so adaptive that it is at least feasible that humans (making use of HCI systems) might be able to accomplish tasks which were formerly accomplished using natural vision and hand-held tools. Accordingly, they may be competent to perform similar tasks using medical imaging (within unnatural scenes) with the addition of artificial userinput devices to allow computers to drive effectors to perform medical interventions. In retrospect, it is a pretty bold stance. Yet, there has been some relatively recent progress over the past couple of decades in this regard. Some of the first examples of Human–Machine Interfaces for Computer-Assisted Interventions have been confined to the domain of Endoscopy/Laparoscopy. Other examples include: Microscopy, Ultrasound Guidance, keyhole robotic surgery (e.g., Computer Motion’s Zeus, Intuitive Surgical’s da Vinci, Renishaw’s Neuromate, and a well-established set of platforms for Neuronavigation [Medtronics, BrainLab, Synaptive, etc.]). (See Fig. 33.2.) These classes of clinical procedures take place within a workspace that is reasonably constrained, making “computer-assistance” more feasible (as contrasted with “open surgery”). Our main caveat, for the whole enterprise, is that this enterprise will be about “System Design” as much as it is about “System Evaluation”, which will not be “unit testing”. Instead, it will be the empiricism needed to measure the facilitation of the task for which the interface was designed. Later, we will develop an argument for objective evaluations of this “facilitation” in terms of the only performance metric that is relevant to a sensorimotor task – the product of speed and accuracy.

819

820

Handbook of Medical Image Computing and Computer Assisted Intervention

Figure 33.2 Setup during an epilepsy case using a robot for intracranial electrodes insertion: (left) monitor with different views of the brain on which the planning is performed; (right) robotic arm ready for the procedure.

33.2. Human–computer interfaces: design and evaluation One of the characteristics of the domain of Human–Computer Interfaces that is both exhilarating and challenging is the vast breadth of the subject. There is no shortage of wonder and fascination for students studying this domain. And yet, the seemingly unbridled nature belies a stark simplicity – there are a number of unifying principles that can be identified – and these principles serve to provide a structure for the study of “HCI”. The domain is fundamentally about connecting humans to computers/machines (Human–Computer Interfaces are simply restricted instances of Human– Machine Interfaces). Once identified, these fundamental principles can be used to establish “Guidelines for Design”, and “Methodologies for Evaluation”. Parenthetically, the processes of “Design and Evaluation” are merely pragmatic forms of a more fundamental pair of endeavors known broadly as ”Engineering and Science”. Roughly speaking, Engineering is the enterprise where a conceptual representation of a system is transformed into, or “designed as”, a physical system implementation. Science is the complementary process, whereby an existing physical system is examined through observation, guided by the “Scientific Method”, in an attempt to build conceptual representations, or “theories”, of the function or structure of the system. Accordingly, we treat Design and Evaluation as complementary trajectories. They should not be decoupled since, from an epistemological point of view, the two streams are needed to provide the complementary efforts that are needed for the “iterative process of design” (cf. [2,3]). There must be representations for concepts of systems, and there must be representations for implementations of systems. The two must be functionally and structurally equivalent. They must be “semantically equivalent”. And so, accordingly, “Design and Evaluation” need to remain entwined though iterations of the overall enterprise.

Human–machine interfaces for medical imaging and clinical interventions

Figure 33.3 Schematic of the information-processing categories in HCI.

33.3. What is an interface? The “interface” between any two regimes is quite simply, and essentially, the boundary between the two, whether that be a membrane between two volumes, a line that segments two areas, or a set of input and output connections between two agents. The study of the interface itself is restricted to the exchanges that take place between the two. Furthermore, the interface is characterized entirely by capturing the nature of that exchange – whether it be mediated by energy, matter, or abstract information. Let’s explore this notion of an “interface,” as a central them in this chapter on “Human– Computer Interfaces” for MICCAI. If you were to ask researchers in Surface Science “what is an interface?”, they might talk about a membrane that separates two material phases: perhaps solids, liquids, or gases. This is extremely instructive, since we learn that the study of the interface would be cast in a language relating to the materials and energies exchanged across the interface. The same would be true in Biology, Chemistry, or Physics. Will this be true if the interface passes something more like “view”, or “command and control”? If you were to ask Software Engineers “what is an interface?”, they might talk about two Software Objects; essentially, two abstractions which encapsulate information, communicating by passing messages (or invoking methods by passing parameters and returned structures; cf. Alan Kay’s development of Object-Oriented programming in 1968). A software API (or “Applications Programming Interface”) is simply a list of the names of a set of functions, along with the data types passed and returned. An API deliberately hides the implementations of the functions that receive the input parameters: in other words, the API does not contain the source code for these functions, and therefore, this is NOT a behavioral representation. Furthermore, it has nothing much to say about the programmer’s implementation of applications which will make use of the API calls. It exclusively and exhaustively lists only the data flow direction, and the data types. The same principle is true throughout Software Engineering; throughout Computer Engineering for that matter; and by extension, throughout any other discipline of Engineering. Accordingly, so we argue here, the same is true for HCI. (See Fig. 33.3.)

821

822

Handbook of Medical Image Computing and Computer Assisted Intervention

If two entities do not interact across a boundary, there’s not much to say about the pair together. A boundary which does not permit an exchange is not particularly interesting. Such an interface would block all interaction – there would be zero flow and not much to discuss. On the other extreme would be an interface so permeable that there is unrestricted flow – to the extent where the two regimes would not be disconnected at all – they would simply be the same regime. So, indeed, what makes an interface interesting is that it permits a precise characterization of the exchange, cast in terms of a set of constraints that clearly and extensively describe the interface: namely: What is exchanged? What direction does it flow? And what quantitative model can be used to describe the flow, as a function of the processes on either side that produce and consume it. So, to characterize a Human–Computer Interface for Medical Imaging and Computer-Assisted Interventions, we are essentially asking – what are the types of exchanges that can occur across the implicit interface? The answer to this question is surprisingly constrained, and so we explore it here.

33.4. Human outputs are computer inputs The outputs from a human (which become inputs for the computer or the machine) can only be produced using muscles. Humans, like all other organisms, interact with the physical world exclusively with actions that are caused by muscles producing forces in conjunction with states of position (configuration, kinematics, and dynamics). There are no exceptions; muscles are the only “output” system for interaction and control within a physical world [6,7]. “Language” is a very special form of output, but still makes use of muscle control: whether by text, by voice, or by gestures (cf. [4,5]). This special capacity underpins to the human ability to share “concepts” (declarations), “goals” (action plans to accomplish a task), and “beliefs” (the concepts that arise in the cognitive system of the user, based on their perception of the medical imagery or visual data, cf. [10,11]). This is a guiding constraint for analysis of Medical Imaging displays. But first we can consider a lower-bandwidth channel across the Human–Computer Interface: namely, the computer “inputs” (cf. [18,19]). All computer “input” devices (receiving human outputs) are sensitive to the relationship between position and force that the human “controls” in order to perform a task. Since this is a very low-bandwidth exchange, compared to the bandwidth of the flow of information from the computer to the human, we will first explore “computer input devices” by examining a cursory but representative set of illustrative examples.

33.5. Position inputs (free-space pointing and navigation interactions) Let us begin by considering the following list of 2D pointing devices and ask the question – are they all providing the same information?

Human–machine interfaces for medical imaging and clinical interventions

Mouse Trackball Light Pen on Screen Finger on Touchpad To be sure, each physical device is used to indicate a position – in these instances, position on a 2D space. The devices themselves may either present this information in absolute, or relative, coordinates. A touchscreen, for example, is an intrinsic part of a display, and so the position interaction is reported in the absolute coordinate frame of the display (x, y). A mouse, on the other hand, does not report absolute position; a mouse does not know where it is located. Interacting with a mouse does not produce an absolute position, rather, a stream of “changes in position” (x, y). The operating system integrates these changes of position over time, thereby updating the position of the cursor on the screen (cf. [13,14]). Gyroscopic-based position devices are the same – they can only report changes in attitude – the system must then integrate those changes over time in order to estimate a position, relative to some initial frame. The feedback provided to the user is a “proxy” in the scene (or “cursor” in 2D).

• • • •

33.6. Direct manipulation vs proxy-based interactions (cursors) When you move a position-based input device, you observe the motion of the cursor or proxy. It certainly feels as if, while you are moving your hand, that it is connected mechanically to the proxy – but, of course, it is all mediated by integration, and this is done typically by the operating system or VR API. The user’s experience is that they are directly controlling the proxy (cf. [29,22]), when in fact they are moving their hand and controlling a device which is actually out of the view. Only under very special circumstances is the hand or finger in actual registration with the perceived movement in the scene; indeed, this is very difficult to do in Augmented Reality displays (cf. [1,24]). A “touchscreen” is a 2D example which, on the other hand, works very well – but this is due to the fact that the device itself is a sensor for the direct input, as well as the display itself. In augmented reality, this is not the case – and is precisely what leads to frequent failures in AR for the user experience of “direct manipulation”. The same distinction can be made for position-based computer input devices with higher degrees of freedom. They may report absolute 3D or 6 dof coordinates, but on the other hand if they report “changes” in these 6 parameters, then the system will need to integrate them and update a cursor in the 3D environment, which will serve as a proxy for the absolute position in the workspace. The data type of any position-based device is ”tuples”. There will be an ordered pair in 2D, a triple for 3D, a 6-tuple for 6 dof; and so on. However, one major difference is that while 2D input devices are exclusively used while in contact with a surface (such as a mousepad, or the sphere of a

823

824

Handbook of Medical Image Computing and Computer Assisted Intervention

Figure 33.4 In addition to position-based interactions with objects, the user can also control the viewpoint.

trackball), this is not the case for a 3D device (since the surface contact would remove that third degree of freedom). Counterintuitively, the lack of contact with a constrained surface raises problems for 3D input devices; and as we will discuss shortly, this arises from an impaired ability to select virtual objects by interacting with them.

33.7. Control of viewpoint One very special kind of Position-based interaction is where the interaction is bound to the viewpoint, or “camera”, rather than an object in the view. Position-based interactions can also be used to change the position of the viewpoint (such as a virtual camera in 3D, or the view on a 2D plane, image, or pages of text on a document). This distinction corresponds to the dichotomy of space-based motions: Navigation vs Manipulation; object-based vs viewer-based. (See Fig. 33.4.)

33.8. Selection (object-based interactions) The other form of interaction is object-based, whether that is a mouse button, a switch on a panel, a drop-down menu item, or an interaction signaled by a superposition of a cursor and a target in the workspace. ”Selections” can also be mapped onto objects in the display using a focus-traversal mechanism (which is a discrete and coarse method for hopping between existing objects in the display; often by pressing the TAB key), or, less frequently in GUI systems, the traversal could be made directly by somehow naming those objects. This raises an important distinction: Position-based interactions are posed within free space (they do not necessarily need to map to objects), whereas selection interactions do. And aside from text-based interactions, after these three kinds of interaction, there

Human–machine interfaces for medical imaging and clinical interventions

Figure 33.5 User Interfaces for Medical Imaging make good use of the four types of computer inputs: Selection (buttons), Quantification (sliders and knobs), position (trackball and ultrasound probe), and text (optional keyboard not visible in this photograph).

are NO other kinds of interaction that can be transmitted from humans to computers (or to machines in general, for that matter). (See Fig. 33.5 for examples of selection, quantification, and position inputs.)

33.9. Quantification (object-based position setting) When we consider objects which the user can interact with by setting their position and/or orientation (such as a slider, a lever, a knob, etc.), then we have combined selection and position into a single interaction which we can call “quantification”. The data type of the quantifier is, generally, a scalar value along a continuum – which is, in practical terms, sampled from a set of discrete values. But in any case, once the interaction occurs, the device will retain that “state” – it holds the quantity as a value. Any object in the user interface, which can respond to a pointing interaction in a way which changes its state and returns that value as a quantity, will be a kind of quantifier. It’s not trite to point out that quantifiers are simply objects which can support “one degree-of-freedom position”. A slider is a 1 dof position input device; same for a knob, except the interaction is rotational rather than translational. So we should note that in-

825

826

Handbook of Medical Image Computing and Computer Assisted Intervention

Figure 33.6 Targeting tasks involve navigating through a space and then interacting with the perceived target can be accomplished using quantifiers (left image), direct pointing (middle image), or changing the position of a proxy or tool.

teractions that return ordered pairs corresponding to the position of the interaction are “Position” interactions, which can include 3D position, 6 dof (position and orientation in 3D), or in general, n-tuples corresponding to the configuration of the input device. (See Fig. 33.6.) However, in the case of quantifiers, the returned n-tuple is associated with the state of an object (i.e., the quantifier), rather than as a position in free space. Accordingly, “quantifiers” generally are used to set parameters for the goal process being controlled, rather than to specify positions in the workspace. Positions within the workspace are set using “proxies”, or cursors in the workspace. Objects which, upon interaction with the pointing mechanism, return an event of being selected or deselected, are stateful “switches”, and so these are “button events” – they are interactions which lead to a change in state of the way the goal process is being controlled by the user.

33.10. User interactions: selection vs position, object-based vs free-space So, we are left with the following restricted space of input types: there are either position events (either posed in free space, or mediated by an object in the workspace), or else there are object-based events which are selection events (action events, or one-of-many item events). This sounds a bit audacious. How can there be only two types of interaction, position and selection? You can either specify a position in n-degrees-of-freedom, or you can select an object by interacting with it. Well, this surprising restriction is actually a function of the space of human “outputs”. When you consider the human as a system and ask, what is the set of possible “output devices”, it turns out the only way for us to ”output” is using muscle. There are no other output channels or interfaces. That’s all

Human–machine interfaces for medical imaging and clinical interventions

Figure 33.7 Even when interactions are mediated by a Brain–Computer Interface, these are still object-based selections or changes in the position or orientation of the perceived entities.

we’ve got! And if you think you have a pious hope that brain–computer interfaces will allow us to download our whole “user experience”, think again. When we are cogitating, it’s all about “entities and relations”, formulated in sequences (i.e., “sequential”, not “parallel”) within the conscious mind. (See Fig. 33.7.) Humans have goals and beliefs, and in accord with those, to perform tasks, we formulate action plans which are ultimately implemented using movements, which are effected using muscles. And in turn, when our muscles act as effectors that drive the dynamics, kinematics, and configuration of our limbs, these dynamics will evolve in free space, or in interaction with objects in the environment. The resulting behaviors will manifest as an evolution of Forces, in relation to Position states (configuration, kinematics, and dynamics). To take a simple example, if we interact while holding a spring that is connected to a stationary frame, for the “linear range” of the spring, we mean that there will be a linear relationship between force and position. If we move our fingertip through free space, the characteristic relationship is to observe changes in position, with relatively small changes in force. The technical term for this is “isotonic” (the “tone” of the muscle stays constant). At the other extreme is a force that is exerted on a stationary frame, which resists change in position (i.e., a hard contact with an object). The technical term for this is “isometric” (the position metric stays constant). On a graph of force versus position (F vs X), behaviors which evolve roughly along horizontal trajectories are “isotonic” and those that evolve along vertical trajectories are

827

828

Handbook of Medical Image Computing and Computer Assisted Intervention

“isometric”. The former are like “movement through space to new positions”, and the latter are “interaction with a stationary object”. When you go to push a button on a console, or to interact with a checkbox on a touchscreen, you move through space until you interact with the object, at which point motion stops but force increases. So the sequence of movement-and-selection begins as a roughly isotonic interaction, followed by the selection phase which is isometric. If you then moved to a slider, the release of the button puts you back in a mode of free-space movement, until you interact with the slider lever, at which point you stop moving perpendicular to the plane of the touchscreen, but you are free to move parallel to the surface, which is a constrained isotonic motion. All interactions with computers, machines, or even surgical tools, will be about movement, selection, movement, selection, etc. Those are the only behaviors which can be measured, as outputs from the human who is performing a task.

33.11. Text inputs (strings encoded/parsed as formal and informal language) The following itemized list allows us to begin an analysis of “computer input” devices, starting with the earliest and most flexible form (text input): • Keyboard (physical keys, membrane key panels, keyboards displayed on touchscreens) • Voice Recognition • Handwriting Recognition • QR-code and barcode scanners In each of these cases, the user provides a string of characters to the computer (even voice recognition modules ultimately pass strings of text to the computer).

33.12. Language-based control (text commands or spoken language) There is a very special way that humans can communicate – we seem almost singularly capable of this – through the use of “language”. This is not just through the use of human languages (informal, spoken, or written), but also through the production and understanding of very formal languages that can be used to control machines or computers. These can range in complexity, from single-word spoken commands (which are really, once detected by the computer, simply forms of “selection” – they are equivalent to pushing a button which selects a mode), to short phrases (typical noun/verb pairs), all the way to spoken command phrases, or queries. The use of spoken language – or more frequently keyboard-typed as a “command line” or database query – are less frequently used. However, we do not want to omit a discussion here.

Human–machine interfaces for medical imaging and clinical interventions

Information can be passed to a computer, or passed back from a computer, using character-based strings (sentences), or equivalently, using voice recognition and voice synthesis (and less frequently, using handwriting recognition). But blending these distinctions into “text-based interactions”, we must note something important: these interactions are patently not movement-and-selection based on the position of our input devices relative to objects in the workspace – however, semantically they are “about” the entities (objects) in the task-based workspace, and the relationships between these entities. Put more frankly, “language-based interactions” will make use of symbols that represent “objects and entities” [28], and these symbols will take the form of “nouns and verbs”. The nouns will either be labels which can be associated with entities, or else they will be demonstratives (this and that, here and there, “pointers and indexes”) which will refer to entities in the domain. The verbs will describe relationships between entities, such as interactions, manipulations, navigations, or command-like expressions (corresponding to the “imperative” sentence types in human languages). As an example of these kinds of imperative-sentence text-based interactions, consider the development of IF-based parsers that followed Weizenbaum’s (1966) work [30], to Winograd’s [32] (in 1972), and through to interactive text-based interfaces for navigation and manipulation within spaces of 3D objects [8], paving the way for more sophisticated natural language understanding systems that can be directed through imperative sentences to perform actions, when prefaced by “Siri!”, or “Hey, Google!”. In a very limited sense, these systems can respond to queries, such as “What time is it?”, although often by submitting the queries verbatim to web-based search engines. (See Fig. 33.8.) Other queries (corresponding to the “where”, “what”, “when”, or less frequently, “who”), or for ontology-based knowledge level interactions (such as with medical ontologies), the special verbs correspond to the “existential and possessive” verbs (is-a and has-a) for sentences which would have a declarative type. Queries posed as “how” interrogatives are seeking to obtain returned expressions which are in the form of imperatives; they are sequences of actions which specify how to perform a task. Queries posed as “why” interrogatives are seeking to obtain returned expressions which are in the form of declaratives; they are lists of existential and possessive relations between entities in the knowledge domain (i.e., the ontological representation of the clinical workspace.) A good example of this is provided in the work of Jannin’s team [20] using the OntoSPM knowledge base, Protégé’s Ontology viewer, and B<>COM’s “Surgery Workflow Toolbox” software. (See Fig. 33.9.) These representations, and their interactive software tools, form “interactions” at a higher abstract level, corresponding to the knowledge level. We do foresee a time when human–machine interaction with Medical Imaging systems, or with Computer-Assisted Interventions, will rise more generally to this semantic level – but for now we will

829

830

Handbook of Medical Image Computing and Computer Assisted Intervention

Figure 33.8 Example of a natural-language interface for medical data analysis (adapted from a concept by Fast et al. [15]).

restrict our discussion to lower abstractions corresponding to objects in a medical image or in a clinical workspace, posed in terms of their structural and functional relationships. Accordingly, we return now to movement-and-selection interactions, based on entities and their relationships in the image or in the scene-based display. (See Fig. 33.10.) Gesture-based interactions are not a new class of interaction. They are either position-based or selection interactions (or alternating combinations of just these two).

33.13. Image-based and workspace-based interactions: movement and selection events The consequence of this analysis is that it then leads to a very systematic prescription for a general methodology for estimating the performance of a user – an agent who is performing tasks. Objective estimates of task performance will involve estimating the speed and accuracy of each of these sequences of movement-and-selection. . . movement-andselection-and-movement-and-selection, etc. Of course, the other side of the loop-of-control is the perception of the display, the analysis and recognition of the state of the workspace and the task evolution, and sub-

Human–machine interfaces for medical imaging and clinical interventions

Figure 33.9 Goals and Beliefs of the clinician can be represented as ontologies of domain knowledge.

Figure 33.10 Example of an anatomical table showing gesture-based interactions.

sequent planning and execution of the next movement-and-selection interaction. The initial arc of this loop is known as the “evaluation” aspect of HCI, and the descending arc of this loop of control is called the “execution” aspect of HCI. This is what led Don Norman [26] to propose an overarching principle in HCI: reduce the gulf of evaluation, and the gulf of execution. In other words, optimize the display so as to facilitate

831

832

Handbook of Medical Image Computing and Computer Assisted Intervention

Figure 33.11 Example of the Synaptive Software for planning surgical approaches in neurosurgery.

the user’s situational awareness of the progress of their task, as well as optimize the set of interactions provided to the user so that they can map the goals of their task onto interactions that will realize the task. (See Fig. 33.11.) In Software Engineering, the programming idiom for human–computer interfaces is such that, when a system starts to run, first the view on the display will have been initialized to some visual structure (as specified by the programmer using declarations of the objects that will be contained in the display, along with their relative configuration) and then the computer will rest in an “event-driven” state. It will wait for there to be some interaction from the user. The software contains a set of event callback functions, or more recently, “event listeners” – which will encapsulate the particular behaviors (represented as coded modules which will be invoked for any particular “event”) that will be triggered by these events on the user interface. Since these event-driven behaviors can also depend on the internal system state, this model then generalizes to, exactly, the quintessential set that characterizes general “computation.” There is a finite set of inputs/events, a finite set of states that the system can take, a finite set of output symbols or visual displays, and a mapping between them (called the transformation function), but really is the “computer program”, whether it be represented as a state table, state diagram, or in a computer language. Accordingly,

Human–machine interfaces for medical imaging and clinical interventions

the “program” that is written and embodied by the behavior of the interface is derived from an information-processing model of the user’s task.

33.14. Task representations for image-based and intervention-based interfaces What remains to be examined is the role that the “task” itself plays when considering the human–computer interface (as shared with Navab’s lab in [16] and [27]). Consequently, what gives rise to the behavioral (functional) aspects of the human–computer interface are: (1) The representation of the task embodied by the computer (in terms of the input–output and state-based behaviors, i.e., the “program” that is embodied by the human–computer interface), and (2) The representation of the task embodied by the user, in terms of the perceived inputs and planned motor responses (it is the “user’s task,” which is the goal, and the mapping to planned actions that drives the user’s behaviors). Accordingly, as part of their task goals and their cognitive beliefs about the progression of the task as the interactions evolve over time. For Medical Imaging, or for Clinical or Surgical Interactions, the tasks that the user is performing can be characterized in the same way that spoken sentences are characterized. There are only three types of expressions that can be posed as the basis for representing these tasks: (1) Declarations about what exists in the displayed data or the experienced workspace are expressions at the knowledge level, (2) Imperatives posed as part of a planning process that describe the action-based interactions in a workspace, or in the steps for processing the imaging data, (3) Queries which, in general, are restricted to questions about “where” (localization/segmentation), “what” (detection/classification), and less frequently “when”. These three forms of expression, whether represented at the computer algorithm level or represented in terms of the knowledge-level “goals and beliefs” of the interventionist or the diagnostician, are combined into sequences – in other words, these expressions are sequenced by coordinating conjunctions, (“and”, “then”, “subsequently”, “followed by”, etc.) or by subordinating conjunctions (“if ”, “else”, “whenever”, “otherwise”, “in case”, etc.). The response to a query allows either the human or the computer to change their knowledge of the state of the task, the result of imperatives is an action which will change the state of the workspace or image data, and the declaratives are expressions about the knowledge and beliefs about the problem domain. Complex structured tasks can be represented using hierarchical nesting of the workflow representations. (See Fig. 33.12.)

833

834

Handbook of Medical Image Computing and Computer Assisted Intervention

Figure 33.12 Operative Room setting for an endoscopic transphenoidal case: (left) anesthesia monitors to follow the vitals of the patient; (right) endoscopic camera showing the surgical cavity and neuromonitoring using tracked instruments.

33.15. Design and evaluation guidelines for human–computer interfaces: human inputs are computer outputs – the system design must respect perceptual capacities and constraints By adopting the perspective that HCI design is constrained by the special capacities and constraints of human Perception, Cognition, and Action, we can derive a set of HCI Guidelines for the system’s Design and Evaluation. Consider the following diagram, whose arrows represent the flows within an information-processing model of the human. We note the key stages: the perceptual system processes sensory inputs and extracts descriptions of the entities, and their relations, in the user’s domain. These descriptions allow the cognitive system to adopt beliefs about the current state of the user’s world, which in the case of HCI is the view of a task that is being conducted by the user. The task itself is embodied by the user, and at the cognitive level this is represented as goals and beliefs about the task being conducted. In order to do so, the user must form action plans which are then executed through the action system which, at its lowest level, operates through control of effectors, with feedback from the sensory system – this completes the loop of perception/cognition/action. (See Fig. 33.13.) In the boxes within this diagram, we identify Design Principles that are a reflection of the special capacities and constraints of the human (the user of the HCI system).

Human–machine interfaces for medical imaging and clinical interventions

Figure 33.13 User Interface Design Principles result from the special capacities and constraints of the human.

In a very general sense, the design guidelines will revolve around a common theme of facilitating the user’s task by regarding the human–computer interface as a tool for the user who is conducting a task. The user’s goals and beliefs will include a set of expectations about the world being perceived (the domain entities, and their relations) and about the objects being controlled (again, the entities within the domain, and the changes that can be effected to put them into new relations). Accordingly, the display should be designed to allow the user to effortlessly evaluate the state of the task, and there should be controllable objects (which are selection and quantification entities) which provide functionality that can be used to execute a task in step-by-step fashion (sequences of movement and interaction). The mapping between these interactions (which launch functionalities subsequently executed by the system) should be designed according to the principle that there should be minimal and natural mappings from the goals of the user, to its execution within the world of the problem domain. If done properly, then the system design is able to minimize this “gulf of execution.” In addition to these information-processing capacities of the user, there are also constraints that must be recognized simultaneously. The human action system, constrained to two modes of output (movement and interaction), stands in contrast to the perceptual system, which by comparison has a very high throughput of information. So much sensory information can be made available, yet only a small part of that information is task-relevant. Accordingly, the user must either make use of their attention-based

835

836

Handbook of Medical Image Computing and Computer Assisted Intervention

top-down processing to discover the task-relevant information, or else the HCI display can be designed so that this “gulf of evaluation” is minimized. Furthermore, since perceptual–motor control operates incrementally and iteratively as a loop-of-control, the stimulus-response behaviors of the system implementation must be consistent with the low-level expectations of the user. The cognitive system, with its enormous capacity for holding domain knowledge, can face an information bottleneck in terms of its “working memory” capacity for executing short-term tasks. Accordingly, the display itself can be used to hold information in order to minimize the reliance on working memory by displaying task-relevant information. In the face of these constraints on the perceptual, cognitive, and action systems, a good design will recognize that errors will be made – and so the system should, if possible, be designed in a way which can either prevent errors (by constraining the space of possible inputs as a function of the state of the user’s task), or else it should be designed, again, if possible, so that errors can be corrected – that steps taken can be “undone”. The system may also provide information about the kinds of steps that can be taken (“tooltips”, for example, or by providing online help or task-relevant implementation examples when requested). These Design Guidelines are summarized here (and expanded in [12]): - Design the system display entities, and interaction behaviors, to match existing user expectations - The Display should be designed so that the User’s Evaluation of the state of the task is effortless - The UI controls should afford sequences towards the goal execution with minimal steps/effort - Clear, task-relevant information displayed should not be ambiguous or incomplete (or else these should support queries from the user to resolve them) - User interaction errors should be prevented or “undoable” if possible - The interactions that lead to visual feedback should not violate natural stimulusresponse expectations Now, within the domain of Medical Imaging and Computer-Assisted Interventions, there are two broad categories of task, Medical Imaging Displays for Diagnostic Tasks, and Computer-Assisted Interventional Interfaces for surgical or clinical interactions.

33.16. Objective evaluation of performance on a task mediated by an interface Before the advent of human–computer interfaces, Experimental Psychologists had known that the objective evaluation of performance on a task can be formulated by

Human–machine interfaces for medical imaging and clinical interventions

considering both the time (or speed) and the error rate (or accuracy) of the sequence of actions which make up the overarching task [25,33,31]. They had also known that these are the only objective measures that can be made. And more recently, due to the prevalent theories and empirical support of the speed–accuracy trade-off, it has become recognized that these two measures cannot be disentangled. Put succinctly and emphatically, “performance of a task is the product of speed and accuracy, relative to that task”. (Within the HMI literature, this fundamental principle goes back as far as Fitts [17], Hyman [23], Hick [21], or perhaps Woodworth [34].) What we would like to emphasize here is that there are no other objective metrics of performance (any other would either be noncausal correlate measures, or worse, be “subjective”, and consequently prone to all of the cognitive and methodological biases that make subjective evaluations a very weak hand to play). There is an easy Gedankenexperiment that can be conducted in this regard. Pretend, for the sake of argument, that there might be, say, three objective metrics of performance: “Speed, Accuracy, and Path Length”. Well, then, what happens if your data shows that you have improved your speed and decreased your path length, but your accuracy is lower? Presumably that calls your performance into question. And what if you have improved your accuracy and decreased your path length, but you have been much slower? Well, then, have you improved your performance by being much slower while taking a shorter path? Really? But now: what if you have improved your speed and your accuracy, but not your path length? Well, from the perspective of the task – the increase in path length is irrelevant. As long as your speed and accuracy are both better, then, by definition, your performance has improved. “Path length” would only arise as a metric if it was explicitly made part of the goals of the task. In this case, it would be a dual task, such as when you are asked to “point to a target while at the same time minimizing your path length.” Then, in that case, you have two tasks – and you can consider the speed and accuracy of both of those tasks, and then consider whether or not there exists a trade-off between accuracy constraints of the two tasks – in which you will need to consider some weighting function if you need to extract a single performance measure from this dual task. Alternately, you can extract performance metrics on the two aspects of the dual task. This principle of objective performance metrics based on speed and accuracy can be applied hierarchically to any particular subtask of an overarching task; whether they comprised goal-directed movements, or discrete choice-based tasks (where in the discrete case, one considers the reciprocals of the task time and the error rate). While we have tried to provide an encompassing review of a broad area of research, we do hope that this article has been an argument that champions this particular emphatic point about the quantitative evaluation of human–machine interfaces for Medical Imaging, and Computer-Assisted Interventions.

837

838

Handbook of Medical Image Computing and Computer Assisted Intervention

References [1] K. Abhari, J. Baxter, E. Chen, A. Khan, C. Wedlake, T. Peters, S. de Ribaupierre, R. Eagleson, The role of augmented reality in training the planning of brain tumor resection, in: Augmented reality environments for medical imaging and computer-assisted Interventions (AECAI), 2013. [2] Fred Brooks, The Mythical Man-Month: Essays on Software Engineering, Addison-Wesley, Reading, MA, 1975. [3] Fred Brooks, The Design of Design: Essays from a Computer Scientist, Addison-Wesley, Reading, MA, 2010. [4] W. Buxton, Lexical and pragmatic considerations of input structures, Computer Graphics 17 (1) (1983) 31–37. [5] W. Buxton, R. Hill, P. Rowley, Issues and techniques in touch-sensitive tablet input, in: Proceedings of SIGGRAPH ’85, Computer Graphics 19 (3) (1985) 215–224. [6] S. Card, W. English, B. Burr, Evaluation of mouse, rate-controlled isometric joystick, step keys and text keys for text selection on a CRT, Ergonomics 21 (8) (1978) 601–613. [7] S. Card, J.D. Mackinlay, G.G. Robertson, The design space of input devices, in: Proceedings of CHI ’90, ACM Conference on Human Factors in Software, 1990. [8] W. Crowther, D. Woods, WOOD0350 aka. Adventure, computer program; source code; executable online: https://quuxplusone.github.io/Advent/play.html, 1976. [9] M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin, L. Collins, S. Duchesne, Introduction: MICCAI 2017, Preface to Proceedings of the 20th International Conference on Medical Imaging and Computer-Assisted Intervention, September 11, 2017, Quebec City. [10] R. Eagleson, T. Peters, Perceptual capacities and constraints in augmented reality biomedical displays, Australasian Physical & Engineering Sciences in Medicine 31 (4) (2008) 371. [11] R. Eagleson, S. de Ribaupierre, Visual perception and human–computer interaction in surgical augmented and virtual reality environments, in: Mixed and Augmented Reality in Medicine, 2018, pp. 83–98. [12] R. Eagleson, G. Hattab, Tutorial on design guidelines for HCI in medical imaging and computerassisted interventions, in: Conference on Computer-Assisted Radiology and Surgery (CARS 2019), Rennes, June 18, 2019, 2019. [13] D. Engelbart, Augmenting Human Intellect: A Conceptual Framework, Summary Report, Contract AF 49(638) 1024, SRI Project 3578, October, Stanford Research Institute, Menlo Park, Calif., 1962. [14] W. English, D. Engelbart, M. Berman, Display-selection techniques for text manipulation, IEEE Transactions on Human Factors in Electronics HFE-8 (1) (March 1967) 5–15. [15] E. Fast, B. Chen, J. Mendelsohn, J. Bassen, M. Bernstein, IRIS: a conversational agent for complex tasks, in: ACM Conference on Human Factors in Computing Systems (CHI 2018), 2018. [16] M. Feuerstein, T. Sielhorst, J. Traub, C. Bichlmeier, N. Navab, Action- and workflow-driven augmented reality for computer-aided medical procedures, IEEE Computer Graphics and Applications 2 (September/October 2007) 10–14. [17] P. Fitts, The information capacity of the human motor system in controlling the amplitude of movement, Journal of Experimental Psychology 47 (1954) 381–391. [18] J. Foley, A. Van Dam, Fundamentals of Interactive Computer Graphics, Addison-Wesley, Reading, MA, 1982. [19] J.D. Foley, V.L. Wallace, P. Chan, The human factors of computer graphics interaction techniques, IEEE Computer Graphics and Applications 4 (11) (1984) 13–48. [20] B. Gibaud, G. Forestier, C. Feldmann, G. Ferrigno, P. Gonçalves, T. Haidegger, C. Julliard, D. Kati´c, H. Kenngott, L. Maier-Hein, K. März, E. de Momi, D.Á. Nagy, H. Nakawala, J. Neumann, T. Neumuth, J. Rojas Balderrama, S. Speidel, M. Wagner, P. Jannin, Toward a standard ontology of surgical process models, International Journal of Computer Assisted Radiology and Surgery 13 (2018) 1397–1408.

Human–machine interfaces for medical imaging and clinical interventions

[21] W. Hick, On the rate of gain of information, Quarterly Journal of Experimental Psychology 4 (1952) 11–26. [22] E. Hutchins, J. Hollan, D. Norman, Direct manipulation interfaces, Human–Computer Interaction 1 (1985) 311–338. [23] R. Hyman, Stimulus information as a determinant of reaction time, Journal of Experimental Psychology 45 (1953) 188–196. [24] M. Kramers, R. Armstrong, S. Bakhshmand, A. Fenster, S. de Ribaupierre, R. Eagleson, Evaluation of a mobile augmented reality application for image guidance of neurosurgical interventions, in: MMVR, Medicine Meets Virtual Reality, 2014, pp. 204–208. [25] L. McLeod, The interrelations of speed, accuracy, and difficulty, Journal of Experimental Psychology 12 (5) (1929) 431–443. [26] D. Norman, User Centered System Design: New Perspectives on Human–Computer Interaction, CRC Press, ISBN 978-0-89859-872-8, 1986. [27] B. Preim, HCI in medical visualization, in: Hans Hagen (Ed.), Scientific Visualization: Interactions, Features, Metaphors, Dagstuhl Publishing, 2011, pp. 292–310. [28] Willard Van Ormon Quine, Word and Object, MIT Press, 1960. [29] B. Shneiderman, The future of interactive systems and the emergence of direct manipulation, Behaviour & Information Technology 1 (3) (1982) 237–256. [30] J. Weizenbaum, ELIZA—a computer program for the study of natural language communication between man and machine, Communications of the ACM 9 (1966) 36–45. [31] A. Wickelgren, Speed-accuracy tradeoff and information processing dynamics, Acta Psychologia 41 (1) (February 1977) 67–85. [32] T. Winograd, Understanding Natural Language, Academic Press, New York, 1972. [33] C. Wood, R. Jennings, Speed-accuracy tradeoff functions in choice reaction time: experimental design and computational procedures, Perception and Psychophysics 19 (1) (1976) 92–102. [34] R. Woodworth, Accuracy of Voluntary Movement, PhD Thesis (Advisor: Cattell), Columbia University, 1899.

839