The effect of known decision support reliability on outcome quality and visual information foraging in joint decision making

The effect of known decision support reliability on outcome quality and visual information foraging in joint decision making

Applied Ergonomics 86 (2020) 103102 Contents lists available at ScienceDirect Applied Ergonomics journal homepage: http://www.elsevier.com/locate/ap...

882KB Sizes 0 Downloads 75 Views

Applied Ergonomics 86 (2020) 103102

Contents lists available at ScienceDirect

Applied Ergonomics journal homepage: http://www.elsevier.com/locate/apergo

The effect of known decision support reliability on outcome quality and visual information foraging in joint decision making Sandra Dorothee Starke, MSc, PhD a, b, *, Chris Baber, PhD a a b

School of Engineering, University of Birmingham, B15 2TT, UK Aston Business School, Aston University, Birmingham, B4 7ET, UK

A B S T R A C T

Decision support systems (DSSs) are being woven into human workflows from aviation to medicine. In this study, we examine decision quality and visual information foraging for DSSs with different known reliability levels. Thirty-six participants completed a financial fraud detection task, first unsupported and then supported by a DSS which highlighted important information sources. Participants were randomly allocated to four cohorts, being informed that the system’s reliability was 100%, 90%, 80% or undisclosed. Results showed that only a DSS known to be 100% reliable resulted in participants systematically following its suggestions, increasing the percentage of correct classifications to a median of 100% while halving both, decision time and number of visually attended information sources. In all other conditions, the DSS had no effect on most visual sampling metrics, while decision quality of the human-DSS team was below the reliability level of the DSS. Knowledge of an even slightly unreliable system hence had a profound impact on joint decision making, with participants trusting their significantly worse per­ formance more than the DSSs suggestions.

1. Introduction Advances in artificial intelligence (AI) have increased the crossdisciplinary drive towards computational decision support systems (DSSs) supporting human decision making. Domains include aviation (Metzger and Parasuraman, 2005; Rovira and Parasuraman, 2010), medicine (Goddard et al., 2012, 2014; Guerlain et al., 1999; Miller, 1985) or finance (Dal Pozzolo et al., 2014; Rose and Rose, 2003). However, it remains underexplored how and to what extent humans integrate DSS suggestions into their actual decision making and work activity. While some studies have found evidence for improved decision quality in a DSS supported task (Corcoran et al., 1972; Dalal and Kasper, 1994; Goddard et al., 2014; Madhavan and Wiegmann, 2007; Thackray and Touchstone, 1989), others reported no performance gain or even performance degradation compared to an unsupported task (Dalal and Kasper, 1994; Metzger and Parasuraman, 2005; Skitka et al., 1999). This is especially the case when the DSS is imperfect (Metzger and Para­ suraman, 2005; Skitka et al., 1999). While studies into the effect of DSS reliability on user performance tend to explore the effect of system reliability experienced by the user, evidence for the effect of known sys­ tem reliability on user behaviour remains sparse (Goddard et al., 2012). The aim of this pilot study was to explore the effect of known system reliability on decision quality and visual information foraging in a task expected to be completed at near chance level without DSS support. The

practical significance concerns whether communicating DSS reliability would increase the compliance with DSS suggestions in a complex task, and how reliable a DSS needs to be to ensure uptake. 1.1. Reliability in dyadic decision making The reliability of human-machine teams, or ‘Joint Cognitive Sys­ tems’, has been of interest for decades (de Vries et al., 2003; Dzindolet et al., 2002; Lee and Moray, 1992; Muir, 1987; Sorkin and Woods, 1985). Importantly, joint task completion introduces human behaviours and associated errors not present during sole decision making (Bahner et al., 2008; Goddard et al., 2012, 2014; Lee and See, 2004; Parasuraman and Manzey, 2010; Parasuraman and Riley, 1997). For example, ‘auto­ mation bias’ (Mosier and Skitka, 1999) or ‘complacency’ result in un­ critical over-reliance on machine suggestions in a human-machine team (Goddard et al., 2012; Parasuraman and Manzey, 2010). Here, ‘compliance’ describes an operator consistently acting on a DSS alert, whereas ‘reliance’ describes an operator not acting in the absence of a DSS alert (Meyer, 2001). Errors of ‘commission’ describe accepting an erroneous system suggestion, while ‘omission’ describes missing events not flagged by the system (Skitka et al., 1999), leading to lower joint net improvement than would otherwise be expected (Goddard et al., 2014; Skitka et al., 1999). A major factor influencing the adoption of a DSS is the reliability that

* Corresponding author. E-mail addresses: [email protected] (S.D. Starke), [email protected] (C. Baber). https://doi.org/10.1016/j.apergo.2020.103102 Received 9 September 2018; Received in revised form 15 February 2020; Accepted 22 March 2020 Available online 6 April 2020 0003-6870/© 2020 Elsevier Ltd. All rights reserved.

S.D. Starke and C. Baber

Applied Ergonomics 86 (2020) 103102

2. Material and methods

users believe it exhibits (de Vries et al., 2003; Dzindolet et al., 2003; Lowe et al., 2002; Madhavan and Wiegmann, 2007; Metzger and Para­ suraman, 2005; Muir, 1987; Parasuraman and Riley, 1997). Other fac­ tors such as the operator’s personality traits, cognitive load, display design or accountability make a further contribution (Goddard et al., 2012). The two factors of a) trust in a system and b) the users’ confi­ dence in their own decision can be modelled as two interacting factors predicting the use of a DSS (Lee and Moray, 1992, 1994). If users over-estimate their own ability when working with a reasonably reliable system, the miscalibration between trust and reliability (Seong and Bisantz, 2008) results in poor joint performance. Experience of an imperfect system reduces DSS trust and conse­ quently its use: error rates such as 30% (Madhavan and Wiegmann, 2007) or even as low as 6% (Skitka et al., 1999) generally result in reduced use of system support. Morar and Baber (2017) showed that an automation error rate of 80% resulted in participants taking re­ sponsibility for all decisions while fully relying on system recommen­ dations for a system with an 20% error rate. Experiencing errors may turn strong a priori trust in a system into strong mistrust (Dzindolet et al., 2003). However, even when knowing their own reliability to be inferior to the DSS, users may still follow their own judgements, a behaviour also termed ‘self-reliance’ (Dzindolet et al., 2003).

2.1. Participants Thirty-six participants (mean (SD) age: 21 (5) years, 24 male) were recruited for this study from the studentship and affiliates of the Uni­ versity of Birmingham, Birmingham, UK. This study was approved by the University of Birmingham Ethics Panel (Reference Number ERN_13–0997). Participants provided written informed consent. 2.2. Task Participants took on the role of a ‘bank fraud analyst’. A version of this task has been previously reported in Starke and Baber (2018) and we have modified this to include cueing to participants. The task was to screen two sets of 15 credit card transactions each for credit card fraud and provide a binary classification as ‘normal’ or ‘fraudulent’. The first set of transactions was unsupported, the second set was supported by a DSS (Fig. 1). Participants were randomly assigned to one of four cohorts with N ¼ 9 each. All cohorts completed the unsupported task first. Subsequently, they completed the DSS supported task, being informed that the DSS was 100% reliable, 90% reliable, 80% reliable or the reli­ ability of the DSS was undisclosed (where reliability was undisclosed, unknown to the participant the system was actually 100% reliable). In the ‘90% reliable’ condition, the system was precisely 93% reliable due to the number of transactions presented. A transaction had to be classed as normal or fraudulent (Fig. 2) based on nine transaction attributes. These were presented in distinct areas of interest (AOIs) in a 3 � 3 grid layout within a user interface. The interface and experiment were created in Matlab (The MathWorks). Transaction attributes were selected as potentially relevant to the detection of credit card fraud based on reports of fraud detection pro­ cesses, listed in Table 1 together with classification thresholds (Table 1) presented to the participant in written form above the monitor. For any given transaction, there was disagreement between attributes as to the true state of the transaction (for example, three attributes may suggest fraud and six may suggest a normal transaction) and the participant was informed to make a decision based on those attributes he/she considered most important. Once the participant had classified the transaction, feedback in form of the DSS’s classification was provided as a written statement on screen. In case of the unreliable DSSs (80%, 90%), this message was a mismatch to the solution arising from the highlighted information for a number of transactions proportional to its reliability level.

1.2. The role of cueing in context of eye movement research Providing cues to human operators to draw their attention to infor­ mation identified as relevant by a system can be beneficial for the resulting decision quality (Bliss, 2003; Botzer et al., 2013; Maltz and Meyer, 2001; Mosier and Skitka, 1996; Onnasch et al., 2014). Such cues can for example indicate a specific region on the display to inspect (Yeh et al., 2003; Maltz and Shinar, 2003; Botzer et al., 2015). Cues can be provided to draw attention to items of information that could be bene­ ficial to a decision, particularly when the display contains multiple pieces of information. Studies exploring this show that participants tend to miss targets in uncued regions. This could be the result of ‘cognitive tunnelling’ (Yeh et al., 2003), satisfaction of search (Berbaum et al., 1998), or matching the probability of target occurrence (Bliss et al., 1995). Eye tracking offers a valuable complement in the study of visuallyguided decision making. Due to the physiology of the visual system, humans only actually see items which they directly look at (‘fixate’) in high resolution (Findlay and Gilchrist, 2003). This necessitates eye movements (‘saccades’) to build up scene perception, directing foveal vision to parts of the scene that are likely to hold information relevant to a given task. Analysing the resulting eye movements offers an insight into underlying cognitive processes (Henderson, 2003; Hayhoe and Ballard, 2005): we look at what we are interested in (Yeh et al., 2003; Findlay and Gilchrist, 2003). Studies of visually guided decision making therefore benefit from this technique. Accordingly, eye tracking has previously been used for example in the study of attention allocation during cueing (Botzer et al., 2015) and different alarm reliabilities (Onnasch et al., 2014).

2.3. Patterns Three sets of ‘fraud patterns’ (three matching sources indicating fraud, the remaining six sources populated at random), which partici­ pants were not expected to learn given the low number of cases, were embedded in the data. These patterns were 1) location, transaction time and expiry date; 2) card present, time and expiry date; 3) bank, CVV and history. A further set of three transactions indicated fraud based on random patterns, and three transactions were normal. This resulted in 15 transactions of which 12 were fraudulent and 3 were normal. Transaction content was randomised between the unsupported and supported condition and the location of AOIs differed between the un­ supported and supported task. Where there were numerical values, these were altered by sampling from a uniform distribution separated by the threshold visible to the participant. Stimuli were presented in Tobii Studio (Tobii, Sweden) to facilitate eye tracking during task completion. The first set of 15 transactions was completed without DSS support. Prior to commencing the second set of 15 transactions with DSS support, participants were informed that they would receive computer support based on a (fictional) machine learning algorithm for which they were given the system’s reliability information in writing on screen. The

1.3. Objectives and hypothesis The objective of this study was to quantify whether working with a DSS of a given reliability a) triggers more accurate decisions compared to an unsupported task, b) reduces the time to make decisions and c) leads to changes in users’ visual information sampling behaviour. We hypothesised that a DSS with �90% reliability would trigger compli­ ance/reliance and hence attention to highlighted areas only, with user reliability corresponding to system reliability. We further anticipated that participants would look at the region highlighted as most important first. For a DSS known to be 80% reliable, we expected attention to more areas of interest to verify the system choice and various levels of selfreliance across participants. 2

S.D. Starke and C. Baber

Applied Ergonomics 86 (2020) 103102

Fig. 1. Schematic of user interface in the unsupported task (left) and the DSS supported task (right). The ‘most important’ information source was highlighted blue, two ‘also important’ sources were highlighted grey. Information agreed between all three sources. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Fig. 2. Task sequence for a given transaction (left) and example of gaze data for three participants (right).

highlighted sources indicated the correct fraud pattern was set accord­ ing to the known reliability of the DSS. For the cohort which was not informed as to the reliability, unknown to the participants, 100% of trials were predicted correctly by the highlighted information sources.

Table 1 Transaction attributes. The nine transaction attributes associated with a transaction and the range of values indicating a normal or fraudulent transaction. Attribute

Normal

Fraud

Card expiry Card issued by bank Card present CVV entered Local transaction time Purchase location Transaction amount Transaction history Type of goods

�5 days Hanford Yes Yes 6:00 to 20:00 Europe �500 N/A Travel agent

�4 days NorthWest No No 20:00 to 6:00 USA �510 3 small amounts Electrical goods

2.4. Variables The independent variable of this study was system reliability (80%, 90%, 100% or undisclosed). The outcome variables were percentage correct and decision time (decision quality) as well as number of attended AOIs and time to first hit per AOI (visual information foraging). 2.5. Data collection

computer highlighted what it considered the most important informa­ tion source with a bold blue border and two additional information sources also considered important with a bold grey border (Fig. 1). The form of highlighting (bold blue or grey borders) was selected to be sufficient to attract the participants’ attention without causing distrac­ tion (Roads et al., 2016). We chose two levels of highlighting to check whether participants would attend to the most important AOI faster than to the other AOIs. The information provided by the three highlighted sources always agreed regarding the fraud classification. That is, attending to the highlighted sources ought to lead the user to interpret a specific fraud pattern correctly. The percentage of trials for which the

A screen-mounted eye tracker (X2-60, Tobii, Sweden) was used to record gaze data at 60 Hz. A standard nine-point calibration was per­ formed for each participant prior to commencing each of the two ses­ sions. Each transaction started with the display of a fixation cross for 3 s. The fixation cross was positioned to the right of the screen outside the panel showing the transaction attributes (Fig. 2). Transaction details were then displayed for as long as the participant chose to examine the available information. Once the participant felt confident to make a decision, he/she logged the decision through a questionnaire in Tobii Studio. 3

S.D. Starke and C. Baber

Applied Ergonomics 86 (2020) 103102

2.6. Follow-up interview

(number of information sources attended (p ¼ 0.245), time to first visit (p � 0.694)). From this, we assume that participants had similar levels of performance.

After completing both study sessions, a short interview was con­ ducted in which participants were asked 1) whether they preferred the unsupported or DSS task and 2) how they dealt with the given (or un­ known) reliability of the system.

3.2. Supported task With regards to decision quality, DSS support had a significant effect on the percentage of correctly classified transactions compared to the unsupported task (Fig. 3, top left) for the ‘known 100% reliable’ con­ dition (p ¼ 0.012, mean difference 40%, increasing from a median (IQR) of 60% (17%) to 100% (30%)) and the ‘known 80% reliable’ condition (p ¼ 0.042, mean difference 6%, increasing from a median (IQR) of 67% (10%) to 73% (20%)). DSS support had a significant effect on decision time compared to the unsupported task (Fig. 3, top right) for the ‘known 100% reliable’ condition only (p ¼ 0.011, mean difference 8.2 s, decreasing from a median (IQR) of 14.0 s (11.2 s) to 5.8 s (8.9 s)). With regards to visual information foraging, DSS support had a sig­ nificant effect on the number of attended AOIs compared to the un­ supported task (Fig. 3, bottom left) for the ‘known 100% reliable’ condition only (p ¼ 0.026, mean difference 5 AOIs, decreasing from a median (IQR) of 9 (0) to 4 (5) AOIs). In all four conditions, DSS support had a significant effect on time to first visit for regions highlighted blue compared to the unsupported task (Fig. 3, bottom right; p � 0.011, average mean difference 1.7 s, decreasing from an average median (IQR) of 2.4 s (2.8 s) to 0.7 s (0.9 s)) and regions highlighted grey (p � 0.021, average mean difference 2.1 s, decreasing from an average median (IQR) of 4.3 s (2.7 s) to 2.2 s (1.2 s)) except for no effect for blue highlighting in the ‘undeclared reliability’ condition. DSS support had no significant effect on time to first visit for the non-highlighted regions (p � 0.214). In the ‘known 100% reliable’ condition, six participants did not look at most non-highlighted regions, hence statistical comparison was excluded.

2.7. Data analysis and statistics Fixations on each of the nine AOIs (Fig. 2) were calculated in Tobii Studio using the Tobii fixation filter. Gaze data were exported for further processing in Matlab (The MathWorks). From the gaze data, number of attended AOIs and time to first visit for each AOI was calculated. The recorded decision logs were used to calculate decision time and the percentage of correctly classified transactions. From the interviews, the preference for the unsupported or supported scenario was noted and the participants’ reasoning summarised. Statistical analysis was conducted in IBM SPSS 24 with α ¼ 0.05. A Shapiro-Wilk test showed that not all datasets could be assumed to follow a normal distribution, hence non-parametric statistics were per­ formed throughout and median/interquartile range (IQR) values re­ ported. A Kruskall-Wallis test compared outcomes between the four cohorts for the unsupported condition to examine whether cohorts performed similarly unsupported. Wilcoxon Signed-Rank Matched-Pairs Tests were used to compare the unsupported and supported task within each cohort to examine whether the DSS had a significant effect on outcome measures. 3. Results 3.1. Unsupported task There was no significant difference between the four cohorts for the unsupported task with regards to decision quality (percentage correct (p ¼ 0.912), decision time (p ¼ 0.911)) or visual information foraging

Fig. 3. Key outcome metrics for the unsupported (white) and subsequently DSS supported (grey) task across the four participant cohorts. Shown are the median and interquartile range. Only the cohort which worked with a known 100% reliable DSS showed a marked change compared to the unsupported condition. Attention to the AOI highlighted as most important increased substantially across all four cohorts. 4

S.D. Starke and C. Baber

Applied Ergonomics 86 (2020) 103102

3.3. Preferences

80% correct) or undisclosed reliability, participants examined almost all available AOIs despite the highlighting. Hence, a DSS with a level of uncertainty may not benefit from the highlighting strongly guiding vi­ sual information foraging, as operators may aim to consolidate a deci­ sion by checking against non-highlighted AOIs. Similarly, we previously showed that operators tend to look at all information sources if these are readily available rather than developing shortcuts (Starke and Baber, 2018). In the present study, highlighting of AOIs resulted in faster attention to the cued location compared to the control condition, espe­ cially for the AOI highlighted in blue as most important. This however did not guarantee that the DSS suggestion was then actually followed by the participant. Similarly, previous work found that when using infor­ mation highlighting, participants looked more often and for longer at highlighted locations, but did not reduce their overall gaze distribution (de Koning et al., 2010). This means that while highlighting ensures that operators attend to an information source quickly, they likely still pro­ ceed to examine other information relevant to the task based on their own judgement if there is any uncertainty regarding the DSS’s reli­ ability. This may be beneficial, since cross-checking could reduce the risk of compliance and automation bias. Highlighting, however, does therefore not guarantee that the highlighted information is actually used exclusively to make or guide a decision.

Across the four conditions, the majority of participants (75%) preferred the DSS supported task, with strong preference in all but the ‘undeclared reliability’ condition (Table 2). The most common reasons for preferring the DSS task were: a faster working speed, the need to look at fewer information sources, guidance of attention to important attri­ butes, help in understanding system reasoning and easier recognition of patterns. The most common reasons for preferring the unsupported task were: the system suggestion did not match own reasoning, the high­ lighting was distracting and/or misleading, the DSS seemed to introduce bias, working with the DSS required less thinking, there was no sense ownership of the decision and the support took away the ‘thinking’ and enthusiasm at trying to find out the correct answer. In all but the 100% reliable condition, most participants mentioned checking the highlighted information sources against the remaining sources when asked how they handled the reliability of the DSS. This was done to examine how the machine reasoning fit in with the trend of the remaining sources. One participant mentioned that there had to be a “good reason” to not follow the DSS advice. The discrepancy between the information contained in the highlighted and non-highlighted sources was also a point of frustration for participants who chose to follow their own judgement.

4.3. Self-reliance

4. Discussion

Across conditions, we observed a tendency towards self-reliance when making decisions. This matches a strong bias towards one’s own judgement and capabilities over that of a DSS shown in previous work (de Vries et al., 2003; Dzindolet et al., 2003; Lee and Moray, 1994). It may be that participants assume their own reliability to be better when informed that a DSS is less than 100% reliable and therefore rely on their own judgement, as evident from previous work (Bahrami et al., 2010; Bang et al., 2014; Morar and Baber, 2017). Furthermore, previous work showed that errors made in one’s own judgement influence the trust in one’s decisions less than errors observed in an automated system affect the trust in that system (de Vries et al., 2003). This could explain that participants, while being aware of the mismatch between their decision and the system classification, continued to rely on themselves. While performing at near chance-level, participants of the present study still seemed to remain confident that their own opinion should be weighted highly. Surprisingly, even in the 100% reliable condition, three out of nine individuals chose to follow their own opinion, resulting in poor joint performance substantially below the DSS accuracy level. Partici­ pants justified this with the DSS highlighting what they felt was irrele­ vant information. This action of disregarding a DSS constitutes ‘disuse’, where the DSS is more reliable than the user, but not integrated suffi­ ciently (Dzindolet et al., 2003). Disuse arises from knowledge of a (possibly) imperfect DSS and can cause users to override correct

4.1. Summary This study showed that only a known 100% reliable DSS had a substantial impact on participant’s task completion, leading to a median of 100% correct classifications, changes to visual information foraging and significant time savings. Across all other cohorts, there was no effect of the DSS on the number of information sources examined, a small improvement in decision accuracy only in the ‘80% reliable’ condition, and no effect on decision time. However, the time to first visit for the most important AOI (highlighted blue) systematically dropped across cohorts by two-thirds from on average 2.4 s down to 0.7 s. This dem­ onstrates that participants did indeed attend these regions faster, although they might not have based their decision on them. 4.2. Effect of cueing on visual information foraging The ‘100% reliable’ cohort looked almost exclusively at the three highlighted AOIs in the DSS supported task and omitted the remaining information sources. Similarly, in a previous study into the effect of cues on visual search, Botzer et al. (2015) found that cues resulted in par­ ticipants attending mainly to the cued content. However, the present study illustrates that for a DSS with known imperfect reliability (90% or

Table 2 Comparison between change in performance and preference. Task preference in context of performance in the DSS supported task across the four participant cohorts. Presented is the measured change in percentage of correct answers when using the DSS supported task as well as the classified change in performance (improved, no change, worsened) in relation to preference for the un­ supported or supported task. 100%

90%

80%

Undisclosed

Total

Preference for DSS supported task

Mean (SD) change Improved No change Worsened Sum

34.1 (19.1) 8 0 0 8

11.1 (15.1) 5 0 2 7

16.1 (12.1) 7 0 0 7

5.1 (17.1) 3 2 0 5

– 23 2 2 27

Preference for unsupported task

Mean (SD) change Improved No change Worsened Sum

– 0 1 0 1 9

0 0 2 2 9

3.1 (5.1)

10.1 (4.1) 4 0 0 4 9

– 4 2 3 9 36

N

5

10.1 (5.1)

0 1 1 2 9

S.D. Starke and C. Baber

Applied Ergonomics 86 (2020) 103102

suggestions and accept incorrect suggestions as observed in applied domains such as medicine (Goddard et al., 2012, 2014) or aviation (Skitka et al., 1999). In the future, tackling the issue of inappropriate self-reliance without causing automation bias will therefore be one of the important challenges to solve.

guidance, and the cost of accessing these sources” (Acharya et al., 2018). In this model, when the reliability of a specific information source fell below 98%, then it is rational to consult other sources as well. This is the effect that we have been observing in this study. 4.6. Limitations

4.4. Combined performance of the human-machine team

This pilot study was based on a small sample size per cohort (N ¼ 9) and future work should aim to increase sample size in order to improve the precision of population estimates. The sample size was sufficient to detect large effects, however larger sample sizes may detect more subtle effects. All participants completed the unsupported condition prior to subsequent trials with DSS support, which could be subject to learning effects. However, given the small number of cases and complexity of patterns, it should have been impossible to learn the task. This could either mean that participant responses would be random or that they would use their own assumptions to guide decision making and this could have been informed by checking the full range of available information. This study was based on a credit card fraud detection task. Given the specific nature of the task, caution should of course be taken extrapo­ lating findings to other real-world scenarios. Specifically, in the real world the reliability of DSSs is hard to quantify and despite the field striving for perfectly reliable systems, this is often not the case. Yet what this study warns of should be considered by any practitioner when designing a DSS: a) a user may attend to highlighted information as intended but still make a contradicting choice, b) communicating even low levels of unreliability may shift user behaviour to self-reliance and poor quality joint performance, c) even for a known 100% reliable system the user may still exhibit self-reliance and d) a reliable DSS paired with a poor performing human may result in joint performance below the achievable performance level of the DSS alone.

Across the three cohorts that worked with a DSS that involved a level of uncertainty regarding the system’s suggestion (known 90% reliable, known 80% reliable, undisclosed), joint human-machine team perfor­ mance was often similar to sole performance in the unsupported task and in most cases below the reliability level of the DSS alone. This finding was similar to previous work reporting no performance gain or even performance degradation in a DSS supported task (Dalal and Kas­ per, 1994; Metzger and Parasuraman, 2005; Skitka et al., 1999). Particularly, in the two conditions with known – but imperfect – reli­ ability (90% and 80% reliability), this study demonstrated that an imperfect system paired with an imperfect user can result in perfor­ mance substantially below the reliability level of the DSS alone. To overcome errors introduced by a human decision maker incorrectly overriding DSS suggestions, ‘new’ error types may be addressable by introducing approaches such as critiquing (Guerlain et al., 1999; Miller, 1985), where the DSS monitors human judgement and provides input only after the human finalised a decision. In a medical task, authors found a significant improvement in performance (Guerlain et al., 1999) after employing critiquing. Another factor impacting on performance is the actual scope for the human to contribute to a decision. While DSS tasks may assume joint effort, the role of operators in technical systems is often supervisory rather than active (Bahner et al., 2008). A key challenge then lies in encouraging the human operator to work together with the automation, rather than blindly following the system or ignoring it altogether. However, it is questionable whether a human operator with reliability less than that of a DSS can improve joint per­ formance beyond the DSS’s reliability. Similarly, studies of pairs of human decision makers performing simple perceptual decision tasks (Bahrami et al., 2010) show that when two decision makers have similar levels of reliability (or sensitivity) in a detection task, their combined performance can be superior to that of either individual as long as they can communicate freely to discuss their judgements. However, perfor­ mance can be worse if one person’s reliability is much lower than the other’s (Bahrami et al., 2010).

4.7. Outlook The benefit of human-machine teams is envisaged as a time and quality gain arising from operators quickly accepting valid machine suggestions while spotting and challenging erroneous ones (Muir, 1987). A time gain appears unlikely given that observers have to examine information for disagreement if they were to notice system errors. In a related study, we have shown that when people use ‘dash­ boards’ (data summaries) to support fraud analysis tasks, they adapt their decision strategy to the reliability of the DSS but will still seek additional information, even when this is not relevant to the decision, but will do so with no obvious time cost, i.e., glancing at the additional information rather than reading it in detail (Morar et al., 2018). Human-machine teams are further envisaged to achieve reliability levels higher than each of the two parts (Dzindolet et al., 2003). While some studies reported net improvements of such teams (Goddard et al., 2014; Madhavan and Wiegmann, 2007), others showed lack of improved team performance (Dzindolet et al., 2003; Sorkin and Woods, 1985). In fact, joint performance may be below the reliability level of the DSS: it appears probabilistically implausible to achieve joint perfor­ mance better than the more reliable part of the team in a basic joint scenario without communication. When pairing an imperfect human observer with a near-perfect DSS, the question arises whether a decision based on the DSS alone (and hence solely AI) may not result in the overall more precise and hence preferable outcome. Equally, pairing a near-perfect observer with an imperfect DSS may make the DSS redun­ dant or counterproductive. This study illustrated the potential reluctance of people to follow DSS suggestions of systems known to be imperfect or with unknown reli­ ability. At the same time, automation bias is the inherent risk to any DSS integration. The literature has made manifold attempts to overcome these issues: first, through explaining possible errors with the DSS, although this led to users overlooking false suggestions (Dzindolet et al.,

4.5. Preferences for DSS or unsupported task Despite the observed trend towards self-reliance in this study, the majority of participants (27 out of 36) preferred the supported (DSS) task and often found it “easier”. This finding matches that of Botzer et al. (2015). Lowest preference for the DSS supported task was found across the ‘undeclared reliability’ cohort (five out of nine participants perferred the DSS task). Here, participants did not consistently follow the DSS, highlighting a potential bias toward mistrusting an unknown system, despite it being 100% correct in its suggestions. The literature provides different accounts regarding the baseline attitude towards a DSS: while some work shows that people generally at first trust an unfamiliar sys­ tem (Dzindolet et al., 2003), other work suggested a predisposition to distrust it (Lee and Moray, 1994). This may in parts result from a discrepancy between self-reported behaviour and actions (Dzindolet et al., 2002): in self-reports, people may describe a bias towards an automated system, while in practice actually relying on their own judgement to a substantial extent (Dzindolet et al., 2002). In the present study, participants may have had a bias towards mistrusting an un­ known system. In a recent paper that used reinforcement learning to model human decision making, it was shown that “strategies for infor­ mation gathering and decision making are an emergent consequence of the reliability of the information sources, relative to that of the 6

Applied Ergonomics 86 (2020) 103102

S.D. Starke and C. Baber

2003). Second, by making participants accountable for their decision quality and performance, thus reducing the prevalence of automation bias (Skitka et al., 2000). Third, by presenting the user with specifics of the machine reasoning (‘cognitive feedback’), resulting in improved handling of system errors while also improving the calibration of operator trust to system reliability (Seong and Bisantz, 2008). Fourth, though ‘critiquing’, where the DSS monitors human judgement and provides input only after the human finalised a decision, proven bene­ ficial in the medical domain (Guerlain et al., 1999; Miller, 1985). In the future, more work is needed in order to optimise joint human-machine performance.

Dzindolet, M.T., Pierce, L.G., Beck, H.P., Dawe, L.A., 2002. The perceived utility of human and automated aids in a visual detection task. Hum. Factors 44, 79–94. Findlay, J.M., Gilchrist, I.D., 2003. Active vision: The psychology of looking and seeing. Oxford University Press. Goddard, K., Roudsari, A., Wyatt, J.C., 2012. Automation bias: a systematic review of frequency, effect mediators, and mitigators. J. Am. Med. Inf. Assoc. 19, 121–127. Goddard, K., Roudsari, A., Wyatt, J.C., 2014. Automation bias: empirical results assessing influencing factors. Int. J. Med. Inf. 83, 368–375. Guerlain, S.A., Smith, P.J., Obradovich, J.H., Rudmann, S., Strohm, P., Smith, J.W., Svirbely, J., Sachs, L., 1999. Interactive critiquing as a form of decision support: an empirical evaluation. Hum. Factors 41, 72–89. Hayhoe, M., Ballard, D., 2005. Eye movements in natural behavior. Trends in cognitive sciences 9, 188–194. Henderson, J.M., 2003. Human gaze control during real-world scene perception. Trends in cognitive sciences 7, 498–504. Lee, J., Moray, N., 1992. Trust, control strategies and allocation of function in humanmachine systems. Ergonomics 35, 1243–1270. Lee, J.D., Moray, N., 1994. Trust, self-confidence, and operators’ adaptation to automation. Int. J. Hum. Comput. Stud. 40, 153–184. Lee, J.D., See, K.A., 2004. Trust in automation: designing for appropriate reliance. Hum. Factors 46, 50–80. Lowe, D.J., Reckers, P.M.J., Whitecotton, S.M., 2002. The effects of decision-aid use and reliability on jurors’ evaluations of auditor liability. Account. Rev. 77, 185. Madhavan, P., Wiegmann, D.A., 2007. Effects of information source, pedigree, and reliability on operator interaction with decision support systems. Hum. Factors 49, 773–785. Maltz, M., Meyer, J., 2001. Use of warnings in an attentionally demanding detection task. Hum. Factors 43, 563–572. Maltz, M., Shinar, D., 2003. New alternative methods of analyzing human behavior in cued target acquisition. Hum. Factors 45, 281–295. Metzger, U., Parasuraman, R., 2005. Automation in future air traffic management: effects of decision aid reliability on controller performance and mental workload. Hum. Factors: The Journal of the Human Factors and Ergonomics Society 47, 35–49. Meyer, J., 2001. Effects of warning validity and proximity on responses to warnings. Hum. Factors 43, 563–572. Miller, P., 1985. Goal-directed critiquing by computer: ventilator management. Comput. Biomed. Res. 18, 422–438. Morar, N., Baber, C., 2017. Joint human-automation decision making in road traffic management. In: Proceedings of the Human Factors and Ergonomics Society 61st Annual Meeting. HFES, Santa Monica, CA, pp. 385–389. Morar, N., Baber, C., McCabe, F., Starke, S.D., Skarbovsky, I., Artikis, A., Correai, I., 2018. Drilling into dashboards: responding to computer recommendation in fraud analysis. IEEE Transactions on Human-Machine Systems 49, 633–641. Mosier, K.L., Skitka, L.J., 1999. Automation use and automation bias. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 43 (3), 344–348. Mosier, K., Skitka, L.J., 1996. Human decision makers and automated decision aids: made for each other? In: Parasurman, R., Moulana, M. (Eds.), Automation and Human Performance: Theory and Application. Erlbaum, Mahwah, NJ, pp. 201–220. Muir, B.M., 1987. Trust between humans and machines, and the design of decision aids. Int. J. Man Mach. Stud. 27, 527–539. Onnasch, L., Ruff, S., Manzey, D., 2014. Operators‫ ׳‬adaptation to imperfect automation–Impact of miss-prone alarm systems on attention allocation and performance. Int. J. Hum. Comput. Stud. 72, 772–782. Parasuraman, R., Manzey, D.H., 2010. Complacency and bias in human use of automation: an attentional integration. Hum. Factors: The Journal of the Human Factors and Ergonomics Society 52, 381–410. Parasuraman, R., Riley, V., 1997. Humans and automation: use, misuse, disuse, abuse. Hum. Factors 39, 230–253. Roads, B., Mozer, M.C., Busey, T.A., 2016. Using highlighting to train attentional expertise. PloS One 11, e0146266. Rose, A.M., Rose, J.M., 2003. The effects of fraud risk assessments and a risk analysis decision aid on auditors’ evaluation of evidence and judgment. Account. Forum 27, 312–338. Rovira, E., Parasuraman, R., 2010. Transitioning to future air traffic management: effects of imperfect automation on controller attention and performance. Hum. Factors: The Journal of the Human Factors and Ergonomics Society 52, 411–425. Seong, Y., Bisantz, A.M., 2008. The impact of cognitive feedback on judgment performance and trust with decision aids. Int. J. Ind. Ergon. 38, 608–625. Skitka, L.J., Mosier, K., Burdick, M.D., 2000. Accountability and automation bias. Int. J. Hum. Comput. Stud. 52, 701–717. Skitka, L.J., Mosier, K.L., Burdick, M., 1999. Does automation bias decision-making? Int. J. Hum. Comput. Stud. 51, 991–1006. Sorkin, R.D., Woods, D.D., 1985. Systems with human monitors: a signal detection analysis. Hum. Comput. Interact. 1, 49. Starke, S.D., Baber, C., 2018. The effect of four user interface concepts on visual scan pattern similarity and information foraging in a complex decision making task. Appl. Ergon. 70, 6–17. Thackray, R.I., Touchstone, R.M., 1989. Detection efficiency on an air traffic control monitoring task with and without computer aiding. Aviat Space Environ. Med. 60, 744–748. Yeh, M., Merlo, J.L., Wickens, C.D., Brandenburg, D.L., 2003. Head up versus head down: the costs of imprecision, unreliability, and visual clutter on cue effectiveness for display signaling. Hum. Factors 45, 390–407.

5. Research ethics This study was approved by the University of Birmingham Ethics Panel (Reference Number ERN_13–0997) as part of the European Project SPEEDD and written informed consent was provided by all participants. Acknowledgements We would like to thank all of our participants for their time, Joanne Claire Kitchen for her help with mobilising participants and two anon­ ymous reviewers for their constructive feedback on the original manu­ script draft. This work was funded through the European Union FP7 project SPEEDD (grant number 619435). Appendix A. Supplementary data Supplementary data to this article can be found online at https://doi. org/10.1016/j.apergo.2020.103102. References Acharya, A., Howes, A., Baber, C., Marshall, T., 2018. Automation reliability and decision strategy: a sequential decision model for automation integration. In: Proceedings of the Human Factors and Ergonomics Society 2018 Annual Meeting. HFES, Santa Monica, CA, pp. 144–148. Bahner, J.E., Hüper, A.-D., Manzey, D., 2008. Misuse of automated decision aids: complacency, automation bias and the impact of training experience. Int. J. Hum. Comput. Stud. 66, 688–699. Bahrami, B., Olsen, K., Latham, P.E., Roepstorff, A., Rees, G., Frith, C.D., 2010. Optimally interacting minds. Science 329, 1081–1085. Bang, D., Fusaroli, R., Tyl� en, K., Olsen, K., Latham, P.E., Lau, J.Y.F., Roepstorff, A., Rees, G., Frith, C.D., Bahrami, B., 2014. Does interaction matter? Testing whether a confidence heuristic can replace interaction in collective decision-making. Conscious. Cognit. 26, 13–23. Berbaum, K.S., Franken, E.A., Dorfrman, D.D., Miller, E.M., Caldwell, R.T., Kuehnn, D. M., Berbamu, M.L., 1998. Role of faulty visual search in the satisfactoon of search effect in chest radiography. Acad. Radiol. 5, 9–19. Bliss, J.P., 2003. Investigation of alarm-related accidents and incidents in avaition. Int. J. Aviat. Psychol. 13, 249–268. Bliss, J.P., Gilson, R.D., Deaton, J.E., 1995. Human probability matching behavior in repsonse to alarms of varying reliability. Ergonomics 38, 2300–2312. Botzer, A., Meter, J., Bak, P., Parmet, Y., 2013. Mental effort in binary categorization cues. J. Exp. Psychol. Appl. 19, 39–54. Botzer, A., Meyer, J., Borowsky, A., Gdalyhau, I., Shalom, Y.B., 2015. Effects of cues on target search behavior. J. Exp. Psychol. Appl. 21, 73–88. Corcoran, D.W.J., Dennett, J.L., Carpenter, A., 1972. Cooperation of listener and computer in a recognition task. II. Effects of computer reliability and “dependent” versus “independent” conditions. J. Acoust. Soc. Am. 52, 1613–1619. Dalal, N.P., Kasper, G.M., 1994. The design of joint cognitive systems: the effect of cognitive coupling on performance. Int. J. Hum. Comput. Stud. 40, 677–702. Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., Bontempi, G., 2014. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst. Appl. 41, 4915–4928. de Koning, B.B., Tabbers, H.K., Rikers, R.M.J.P., Paas, F., 2010. Attention guidance in learning from a complex animation: seeing is understanding? Learn. InStruct. 20, 111–122. de Vries, P., Midden, C., Bouwhuis, D., 2003. The effects of errors on system trust, selfconfidence, and the allocation of control in route planning. Int. J. Hum. Comput. Stud. 58, 719–735. Dzindolet, M.T., Peterson, S.A., Pomranky, R.A., Pierce, L.G., Beck, H.P., 2003. The role of trust in automation reliance. Int. J. Hum. Comput. Stud. 58, 697–718.

7