Accepted Manuscript
The description-experience gap in the effect of warning reliability on user trust and performance in a phishing detection context Jing Chen , Scott Mishler , Bin Hu , Ninghui Li , Robert W. Proctor PII: DOI: Reference:
S1071-5819(18)30276-3 10.1016/j.ijhcs.2018.05.010 YIJHC 2215
To appear in:
International Journal of Human-Computer Studies
Received date: Revised date: Accepted date:
7 November 2017 21 May 2018 25 May 2018
Please cite this article as: Jing Chen , Scott Mishler , Bin Hu , Ninghui Li , Robert W. Proctor , The description-experience gap in the effect of warning reliability on user trust and performance in a phishing detection context , International Journal of Human-Computer Studies (2018), doi: 10.1016/j.ijhcs.2018.05.010
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights Reliability influenced user performance and trust in a phishing detection system.
This influence depended on task difficulty.
Feedback (experience) increased objective and subjective trust calibration.
Providing description of system reliability increased mainly subjective trust.
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
The description-experience gap in the effect of warning reliability on user trust and performance in a phishing detection context
Author Note
CR IP T
Jing Chen, Scott Mishler, Bin Hu, Ninghui Li, Robert W. Proctor
Jing Chen and Scott Mishler, Department of Psychology, Old Dominion University, New
AN US
Mexico State University; Bin Hu, Department of Engineering Technology, Old Dominion University; Ninghui Li, Department of Computer Sciences, Purdue University; Robert W. Proctor, Department of Psychological Sciences, Purdue University.
M
Correspondence concerning this article should be addressed to Jing Chen (
[email protected]), 250 Mills Godwin Life Sciences Building, Old Dominion University, Norfolk, VA 23529.
ED
This research was funded by the National Science Foundation under Grants 1566173, 1760347,
AC
CE
PT
and 1314688.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
3
Abstract How the human user trusts and interacts with an automation system is influenced by how well the system capabilities are conveyed to the user. When interacting with the automation system, the user can obtain the system reliability information through an explicit description or through
CR IP T
experiencing the system over time. The term description-experience gap illustrates the difference between humans’ decisions from description and decisions from experience. In the current study, we investigated how this description-experience gap applies to human-automation interaction with a phishing-detection task in the cyber domain. In two experiments, participants’
AN US
performance in detecting phishing emails and their trust in the phishing detection system were measured when system reliability, description, and experience (i.e., feedback) were varied systematically in easy and difficult phishing detection tasks. The results suggest that system
M
reliability has a profound influence on human performance in the system, but the benefits of having a more reliable system may depend on task difficulty. Also, providing feedback increased
ED
trust calibration in terms of both objective and subjective trust measures, yet providing description of system reliability increased only subjective trust. This result pattern not only
PT
shows the gap in effects of feedback and description, but it also extends the description-
AC
CE
experience gap concept from rare events to common events.
Keywords: Description-experience gap; trust; human-automation interaction; phishing detection; feedback
In a common storybook tale, there was a colt who tried to cross a river (Peng, 1955). A cow said that the river was very shallow, but a squirrel shouted that the river was very deep. The colt’s mom encouraged him to try by himself, and the little colt finally learned that the river was neither too shallow nor too deep. Our kids are often told, “Real knowledge comes from practice.”
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
4
In the real world, people make decisions during their lives and jobs all the time. A decision can be made based upon the decision maker’s own experience (e.g., the above colt testing the river by himself), or from the description of another person (e.g., the cow and the squirrel telling the colt about the river). These two types of information sources have been shown to give rise to different decision making. The present article focuses on the decisions that humans make when interacting with an
CR IP T
automated aid, more specifically, in a phishing context. Phishing is “a kind of social-engineering attack in which criminals use spoofed email messages to trick people into sharing sensitive
information or installing malware on their computers” (Hong, 2012, p. 74). There are many existing anti-phishing tools that are designed to help users identify phishing emails or websites,
AN US
but they are not always reliable (Egelman et al., 2008; Wu et al., 2006; Yang et al., 2017), as any other type of automation aid. The reliability of an anti-phishing aid can be learned through being told about it explicitly (i.e., description) and/or via using the aid and experiencing its successes
M
and failures over time (i.e., experience). These two ways of conveying information (description and experience) have been studied widely by human decision-making researchers. Our goal is to
ED
examine how the reliability of this type of automated aid and the ways of conveying this information affect users’ decision making and trust in the aid. To this end, we draw upon
PT
decision-making literature and human-automation interaction literature.
CE
Description-experience Gap
The description-experience gap is the finding that rare events are underweighted in
AC
decisions from experience compared with decisions from description. This underweighting is possibly due to the rare events being less likely for the decision maker to have encountered recently or at all, or to the different formats of statistical information triggering different cognitive algorithms (Hertwig et al., 2004; Hertwig & Erev, 2009). The “rare events” have been arbitrarily defined with a probability of .20 or less (Hertwig et al., 2004). The majority of studies
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
5
on the description-experience gap have focused on choices between risky alternatives (Wulff et al., 2018). However, as stated by Wulff et al., the distinction between description and experience has impact far beyond the domain of risk choice. This robust phenomenon has been observed with human subjects in gambling (Hertwig et al., 2004), medical decision making (Armstrong &
CR IP T
Spaniol, 2017; Lejarraga et al., 2016), and online product ratings (Wulff et al., 2015), as well as with rhesus monkeys in gambling (Heilbronner & Hayden, 2016). In addition, this phenomenon goes beyond choice tasks that are used most often in this line of research, because “cognitive functions and cognitive phenomena including reasoning, judgment, and choice may change”
AN US
(Wulff et al., 2018, p. 163) as a function of the learning and representation formats of description and experience. Thus, it is expected that the underweighting or overweighting of small probabilities conveyed by experience or description occurs not only in choice tasks but also
M
reasoning and judgment tasks.
In Hertwig et al.’s (2004) seminal study, decision problems were constructed so that each
ED
of them had one option with a rare event (e.g., .025 chance of winning 32 points) and the other option with a common event (e.g., .25 chance of winning 3 points). Participants in the
PT
description group read these problems described on a computer screen, whereas those in the
CE
experience group were allowed to click on the options and sample possible outcomes over time. Both groups then indicated their preferred option among the two options for each problem. The
AC
terms description-experience gap came from the result that the experience group, compared to the description group, underweighted the rare events in their decision making (e.g., the experience group preferred the option with a .025 chance of winning 32 points much less than did the description group). As another example, Wulff et al. (2015) examined the description-experience gap in the
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
6
context of online product reviews. Participants were to make choices between two products that differed only in the user ratings. In the description condition, participants viewed a full descriptive summary plot of 100 user ratings for each product, which indicated the number of users who rated the product at each level of 1 through 10 stars. In the experience condition,
CR IP T
participants were shown an individual rating (e.g., 7 stars) from the user-rating pool and were instructed to sample as many individual ratings as they liked. The underlying distributions of the ratings were the same in both conditions. The participants underestimated the rare ratings (e.g., a 2-star rating made by only 5 out of 100 users) in the experience condition compared to the
AN US
description condition, and thus a description-experience gap was evident.
The above-mentioned studies compared the description and experience conditions when only description or only experience was provided. In many situations, experience makes a
M
difference even when a full description of the problem has already been provided. Jessup et al. (2008) had two groups of participants perform a repeated gambling task. Both groups received
ED
full description of the gambles; one group also received feedback about the outcomes of their choices, but the other did not. Note that experience was obtained through feedback in this
PT
scenario. There were two conditions, a high-probability condition (with trials consisting of one
CE
sure option: Win 3¢ for sure; and one high-probability option: Win 4¢ with .8 probability or win 0¢ otherwise), and a low-probability condition (with trials consisting of one sure option: Win 3¢
AC
for sure; and one low-probability option: Win 64¢ with .05 probability or win 0¢ otherwise). In the high-probability condition, the high-probability option was chosen by the feedback group more often than by the no-feedback group; whereas in the low-probability condition, the lowprobability option was chosen by the feedback group less often than by the no-feedback. These results suggested that compared to the no-feedback/no-experience group, the
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
7
feedback/experience group underweighted the small probabilities. Jessup et al.’s results indicate the unique role of feedback as a critical component in experiential choice behavior because the options in both conditions had the same expected values and both groups were disclosed about the gamble information.
CR IP T
The studies of the description-experience gap have mostly focused on problems with low probabilities or that explain the phenomenon through low-probability events. Indeed, the
weighting of high-probability events has been assumed to be the mirror image of the weighting of low-probability events (Hadar & Fox, 2009; Hertwig et al., 2004). Jessup et al. (2008)
AN US
analyzed their data by comparing across the low- and high-probability conditions and reached the conclusion that the feedback group underweighted small probabilities. One can reach the same conclusion by examining the data in the low- and high-probability conditions separately
M
(see Figures 1 and 2 in Jessup et al., 2008). Moreover, the data in the high-probability condition suggested that the feedback group overweighted the high probabilities (choosing the high-
ED
probability option more often) than did the no-feedback group. Thus, studying the descriptionexperience gap is not only beneficial for understanding how people will react to events with low
PT
probabilities such as rare hazardous events (e.g., hurricanes, earthquakes). It is also informative
CE
for capturing people’s behaviors towards more common events. Trust in Automation
AC
Understanding probabilities is useful for risky decision-making among lotteries and also for other real-world problems. For example, when choosing a site for building a new house, the probability of a severe flood affects your choice of the site; when using an automated aid, the probability of which the aid works accurately determines how much the user relies on the aid. A human-automation system involves human operators and automated machines that work together
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
8
to achieve the mission of the whole system. Human-automation systems are pervasive, from modern manufacturing systems, to aviation systems, to surface transportation, to smart homes, and so on. The collaboration between human and automation is the key for a human-automation system to achieve its mission (Sheridan & Parasuraman, 2006).
CR IP T
Among the factors that affect human-automation collaboration, human trust in
automation is one of the most significant (Dzindolet et al., 2003; Lee & Moray, 1992; Muir, 1994; Parasuraman & Riley, 1997). Trust is a psychological concept that involves willingness of accepting uncertainty and vulnerability to another agent, with the expectation that the agent will
AN US
help achieve positive outcomes in future interactions (Lee & See, 2004; Verberne et al., 2015). This agent can be another person or automation system (Lewandowsky et al., 2000). Human trust in a certain type of automation can affect how often and widely this automation will be accepted
M
and implemented by users. As stated by Hoffman et al. (2009), “Trust in and through technology will likely mediate the effectiveness of software and hardware in maintaining security” (p. 6).
ED
Neither under-trust nor over-trust is desirable for an efficient human-automation system (Hoffman et al., 2009; Parasuraman & Riley, 1997). That is, it is not always appropriate for the
PT
human user to trust the system. Rather, the user should trust a system that is reliable but distrust a
CE
system that is unreliable. That is, trust should match the true system capabilities, a term called trust calibration (Lee & Moray, 1994; Lee & See, 2004; McGuirl & Sarter, 2006; Muir, 1987).
AC
As an important component of performance of the system, system reliability has been consistently reported to affect trust positively (Desai et al., 2013; Schaefer et al., 2016). System reliability is the percentage of the occurrences of the system doing what it is designed to do (e.g., a fire alarm goes off when there is a fire and remains silent otherwise). Typical research findings include that higher reliability levels lead to higher trust levels, and false alarms and misses
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
9
influence compliance and reliance, respectively (Chancey et al., 2017; Dixon & Wickens, 2006). Current Study “Appropriate trust and reliance depend on how well the capabilities of the automation are conveyed to the user” (Lee & See, 2004, p. 74). In much of the existing research in the field of
CR IP T
human-automation trust, researchers do not pay attention to the two ways of conveying the
system reliability: description and experience/feedback. Some researchers inform participants about the actual reliability (e.g., Barg-Walkow & Rogers, 2016; Rice, 2009), but some do not (e.g., Madhavan et al., 2006; Ross et al., 2008), and some even do not clearly report whether they
AN US
informed participants about the system reliability (e.g., Sanchez et al., 2004). Many studies
provided feedback to the participants but did not specify the rationale of doing so. A relevant implication from the description-experience gap is that these different ways of conveying the
M
system reliabilities, which have been largely overlooked, could influence how the human interacts with the automation. The goal of the current study was to investigate the description-
ED
experience gap in the context of human trust in automation. Namely, we examined how the reliability information conveyed through description and/or experience influenced human trust in
PT
automation in the cyber domain.
CE
Similar to other domains, human and machine are the two key aspects of the cyber domain (Stokes et al., 2010). In the cyber space, there are many different forms of automation
AC
(e.g., anti-virus warnings, email boxes indicating potential spam emails) that users face every day. In a review paper on trust in automation, Hoff and Bashir (2015) summarized different types of automated systems used in 127 prior studies, including combat identification aid, general decision aid, fault management/task monitoring aid, automated weapons detector (luggage), target identification aid (noncombat), collision warning system, route-planning system,
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
10
and other. Note that the cyber domain was not among the list. Moreover, although it is well recognized that trust is essential to the cyber domain (Hoffman et al., 2009; Kim & Moon, 1998; Singh, 2011; van Rooy & Bus, 2010), sparse empirical research has been conducted on human trust in automation in the cyber domain.
CR IP T
In the current study, a Phishing Detection System, which provided recommendations to users as to whether an email was legitimate or phishing, was used as the automation system. The reliability of this system was defined as the percentage of times the system was correct (60%, 70%, 80%, 90%). To simplify the design, there were equal numbers of false alarms and misses.
AN US
Description was manipulated by whether participants were informed explicitly about the system reliability percentage. Experience was manipulated by whether feedback on decision accuracy was provided after each experimental trial.
M
In most existing research on the description-experience gap, the description-only and experience-only conditions have been compared directly (e.g., Hertwig et al., 2004; Ungemach,
ED
Chater, & Stewart, 2009; Wulff et al., 2015), although Jessup et al. (2008) compared a description-only condition with a description + feedback group. It was not plausible for the prior
PT
studies to include a condition without description or feedback because the information presented
CE
through description or feedback (e.g., gambles, user reviews) was the target stimuli to which participants responded. Thus, both types of information could not be removed at the same time
AC
(e.g., there was no way to present a gamble if no description was provided or no sampling of the gamble were allowed). In the current study, participants responded to emails with assistance from the Phishing Detection System, and this setting allowed us to manipulate the information construct conditions with one, both, or neither of the two types of information (i.e., description and feedback). Therefore, we varied description and experience systematically in a full factorial
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
11
design, and examined how participants’ trust and performance were affected by these manipulations in an easy phishing detection task in Experiment 1 and a difficult task in Experiment 2. Based on prior studies, our Hypothesis 1 was that human trust should be increased and
CR IP T
performance improved when system reliability increased (Chancey et al., 2017; Dixon &
Wickens, 2006). Given that weighting of high-probability events has been assumed to mirror the weighting of low-probability events (Hadar & Fox, 2009; Hertwig, et al., 2004), participants in the with-experience/feedback conditions should overweight the system reliability and thus trust
AN US
the system more than those in the without-feedback conditions (Hypothesis 2); those in the withdescription conditions should underweight the system reliability and thus trust the system less than those in the without-description conditions (Hypothesis 3). Based on the rare events that
M
have been commonly defined (i.e., events with probabilities of .20 or less) in the descriptionexperience gap literature, the high-probability events that will be most affected similarly should
ED
be those with probabilities of .80 or more. As a result, we predicted that the overweighting of system reliability in the experience/feedback conditions would be more profound for the 80%
PT
and 90% reliability levels (Hypothesis 4), and the underweighting of system reliability in the
CE
description conditions would also be more profound for the 80% and 90% reliability levels (Hypothesis 5).
AC
Experiment 1: Easy Task
In this experiment, participants were shown images of legitimate and phishing emails,
and were required to classify each email as legitimate or phishing. Before they saw each image, a warning was provided by the Phishing Detection System, which had various reliability levels (60%-90%). The reliability level information was conveyed through description or experience. In
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
12
the with-description conditions, participants were explicitly informed about the reliability of the system (e.g., 90%); whereas in the without-description conditions, participants were told that the system had a reasonable reliability, without the specific percentage. In this scenario, providing feedback enables learning from experience: Participants in the with-feedback conditions acquired
CR IP T
knowledge of the system reliability that was not available in the without-feedback conditions. Method
Participants. A total of 484 (286 female; three declined to say) participants were
recruited and completed the experimental task through Amazon Mechanical Turk (MTurk).
AN US
There were 112 in the age range of 15~24 years, 209 in the range of 25~34, 95 in the range of 45~54, 44 in 55~64, five who were age 65 or older, and one who declined to report the age. The task took about 10 minutes, and each participant was paid $1 for their participation. Each MTurk
M
worker was allowed to participate only once in the experiment. This and the following experiment were approved by the Institutional Review Board at New Mexico State University.
ED
Apparatus and stimuli. The stimuli were screenshots of emails received by the first author, with personal information in the emails replaced with fake (but consistent) information.
PT
One legitimate email and one phishing email were used in the practice task, and 10 other
CE
legitimate emails and 10 other phishing emails were used in the experimental task. Participants used their own equipment, but were required to use a desktop or laptop computer in order to
AC
properly view the email images. Participants who reported using a tablet or smart phone were filtered out of the study. Design and procedure. There were three independent variables: reliability level of the
phishing detection system (i.e., 60%, 70%, 80%, vs. 90%, with equal false alarm and miss rates within each condition), whether the participant received explicit description of the reliability (i.e.,
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
13
with vs. without description), and whether immediate feedback was provided after the participant made a decision (i.e., with vs. without feedback). All independent variables were betweensubjects. As a result, there were 16 (4 Reliability × 2 Description × 2 Feedback) distinct groups to which participants were randomly assigned. The numbers of participants across the groups
CR IP T
were roughly the same.
At the beginning of the experiment, the participant was told that the task was to make a judgment of whether each email was legitimate or phishing. After answering a question of “how much do you know about phishing” (nothing, a bit, a lot), participants were shown the definition
AN US
of phishing with some examples. Then they were told that computer scientists had developed a Phishing Detection System to help users. On that same page, participants were told about the accuracy rate of this system corresponding to the group to which they were assigned (i.e., “60%”,
M
“70%”, “80%”, “90%” for the with-description groups, and “reasonable” for the withoutdescription groups). Participants in the with-description groups were then asked about the
ED
accuracy rate of the phishing detection system. If they did not answer correctly, the participants were warned to pay more attention and tested again (6 participants answered this question
PT
incorrectly initially but were correct the second time).
CE
Each trial began with a recommendation made by the Phishing Detection System (see Figure 1), and then an email image was displayed for which the participants were to make their
AC
own judgments. A practice task with one legitimate and one phishing email was used to familiarize participants with the experiment procedure. Each participant then performed the experimental task on 20 new emails, with the order of the emails being randomly assigned for each participant. The decision made by the participant was recorded for each email, as was decision time (recorded from onset of the email image until one of the two buttons, phishing or
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
14
legitimate, was clicked). For participants in the with-feedback groups, a feedback page was presented after each judgment. The feedback included a statement “Your response is correct/incorrect! That was a legitimate/phishing email” and a happy or an unhappy face for correct and incorrect responses,
CR IP T
respectively. For those in the without-feedback groups, the feedback page was not presented, and they were directed to the preparation page of the next trial. After classifying all 20 emails,
participants answered their amount of agreement with a few statements, including their trust in the Phishing Detection System (“I trusted the recommendations made by the Phishing Detection
AN US
System”), on a 7-point Likert scale. They also indicated their judgment of the real accuracy rate of the system (10%~100%, with 10% increment) and provided demographic information. Note that there was no deception in the experiment (i.e., the system was indeed set with a 60%, 70%,
M
80%, or 90% reliability, as the with-description groups were told).
Measurements. We measured both the participants’ performance and their trust in the
ED
phishing detection system. The performance measures included participants’ judgment accuracy of the emails and corresponding judgment time; the trust measures included agreement rate with
PT
the aid, self-reported trust, and perceived system reliability. Moreover, agreement rate is an
CE
objective measure of trust, and self-reported trust and perceived reliability are subjective measures of trust. Various questionnaires have been developed to measure trust in automation
AC
through subjective self-report (e.g., Jian et al., 2000). These questionnaires typically include multiple questions (e.g., Jian et al.’s questionnaire includes 12 questions) that may not be suitable for online studies due to potential boredom of participants. Thus, we used the singlequestion mentioned above to measure subjective trust. Self-reported trust, as any other selfreported data, though, may not always reflect behavior. As a result, we included the objective
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
15
trust measure (i.e., agreement rate) in the current study. It is assumed that, with other things being equal, participants agree with the automation aid more often when they trust the aid more. Results To exclude potential outlier data, we transformed each participant’s overall decision time
CR IP T
using a natural log transformation and obtained the mean and standard deviation of the log
decision-time distribution. A total of 18 participants (3.7%) with log decision times two standard deviations less than the mean (6 participants) or greater than the mean (12 participants) were excluded from further analyses. The dependent variables included accuracy of the participant’s
AN US
judgments on the email type (i.e., judgment accuracy), decision time, agreement rate with the phishing detection system, self-reported trust, and perceived reliability of the system. Analyses of variances (ANOVAs) were conducted with Reliability, Description, and Feedback as
M
between-subjects factors on the above dependent variables. The natural log transformation of decision time and an arcsine transformation of accuracy (by taking the arcsine of the square root
ED
of the accuracy) were used in the ANOVAs to ensure the normality of the data distributions. In this and the following experiments, we report only significant effects and those approximating
PT
the .05 level, and no other effects were significant, ps > .05.
CE
Performance: Decision time. The ANOVA results are obtained from analysis on the natural log transformation of the decision-time data, but for ease of understanding, mean values
AC
reported below are from the raw decision-time data. As the reliability level increased from 60% to 70%, 80%, and 90%, decision time tended to decrease (Ms = 15.4 s, 15.0 s, 13.5 s, and 13.6 s), although the p value did not quite attain the .05 significance level, F(3, 450) = 2.59, p = .052, ηp2 = .02. Decision time also decreased when feedback was provided (Ms = 15.3 s vs. 13.4 s for without- and with-feedback conditions, respectively), F(1, 450) = 7.04, p = .008, ηp2 = .02 (see
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
16
Figure 2, top panel). Performance: Judgment accuracy. Similarly, ANOVA results are based on the arcsine transformation of the accuracy data but mean values are reported based on the raw data. The effect of reliability was not significant (Ms = 82.3%, 82.7%, 83.2%, and 84.6% for Reliability of
CR IP T
60%, 70%, 80%, and 90%, respectively), F < 1. The Reliability × Description interaction was significant, F(3, 450) = 3.59, p = .014, ηp2 = .02. For the without-description groups, judgment accuracy did not differ significantly across reliability levels (Ms = 84.3%, 83.7%, 80.6%, and 83.1%); whereas for the with-description groups, the two lower reliability levels tended to have
AN US
lower accuracy (Ms = 80.3% and 81.8% for 60% and 70% reliability, respectively) than the two higher levels (Ms = 85.8% and 86.0% for 80% and 90% reliability, respectively; see Figure 2, bottom panel).
M
Trust: Agreement rate. As the reliability level increased from 60% to 70%, 80%, and 90%, participants agreed more often with the recommendations provided by the phishing
ED
detection system (agreement rate Ms = 62.5%, 68.2%, 73.5%, 81.1%)1, F(3, 450) = 71.07, p < .001, ηp2 = .32. The Reliability × Feedback interaction was marginally significant, F(3, 450) =
PT
2.61, p = .051, ηp2 = .02. Participants agreed with the system less often with-feedback than
CE
without-feedback but only at the lowest reliability level (with-feedback vs. without-feedback; Ms = 59.4% vs. 65.6%, 67.9% vs. 68.5%, 73.6% vs. 73.4%, and 81.1 vs. 81.0%, for reliability levels
AC
of 60%, 70%, 80%, and 90%, respectively; see Figure 3, top panel). Trust: Self-reported trust. On a 7-point scale, participants reported how much they
trusted the phishing detection system, with 4 being neutral. Self-reported trust increased with 1
Across the 60%, 70%, 80%, and 90% system reliabilities, participants’ average accuracy was 83.2%. If participants classified the emails independent of the phishing detection system, then the expected agreement rates with the system would be 56.6%, 63.2%, 69.8%, 76.4%, which would still show increased agreements but lower than the actual empirical values.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
17
increased reliability (Ms = 3.7, 4.0, 4.3, 4.8 for reliability of 60%, 70%, 80%, and 90%, respectively), F(3, 450) = 14.15, p < .001, ηp2 = .09. Description affected self-reported trust (Ms = 4.0 vs. 4.4 for without- vs. with-description conditions, respectively), F(1, 450) = 12.98, p < .001, ηp2 = .03. That is, participants reported that they trusted the phishing detection system
reliability levels (see Figure 3, bottom panel).
CR IP T
more when they were told the exact reliability of the system, and this pattern occurred across all
Perceived system reliability. Participants perceived the system to be more reliable when the system reliability was higher (Ms = 56.7%, 60.8%, 67.4%, 74.6%; see Figure 4, top panel),
AN US
F(3, 450) = 26.46, p < .001, ηp2 = .15. Providing feedback also increased perceived system
reliability (Ms = 67.3% vs. 62.4% for with- vs. without-feedback conditions, respectively), F(1, 450) = 10.30, p = .001, ηp2 = .02.
M
Discussion
Consistent with Hypothesis 1, with increased system reliability, participants’
ED
performance improved. These results are in agreement with those of prior research that indicate the importance of system capabilities in human-automation interaction (Chancey et al., 2017;
PT
Dixon & Wickens, 2006; Sanchez et al., 2004). This improvement in performance was mostly in
CE
terms of reduced decision time and a nonsignificant trend of increased judgment accuracy. A possible reason is that because the phishing emails were easily distinguishable from the
AC
legitimate emails in this experiment, participants were able to discriminate them accurately (accuracy > 80% across all conditions). In that case, participants were also able to infer the reliability of the phishing detection system by comparing their own judgments with the recommendations of the automation system. Thus, when they realized that the system was highly reliable they may have just obeyed the recommendations and made responses more quickly than
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
18
when they determined that the system was not as reliable. In addition, both the subjective measure of trust (i.e., self-reported trust) and the objective measure of trust (agreement rate) were affected by system reliability positively, also supporting Hypothesis 1. Consistent with previous studies (Sanchez et al., 2004), perceived system
CR IP T
reliability correlated positively with the actual system reliability, although the former is typically a lower value than the latter.
Providing feedback increased perceived system reliability, which then became closer to the actual system reliability compared to when feedback was not provided. This finding agrees
AN US
with Hypothesis 2 and the trend in mean values is consistent with Hypothesis 4, although when it came to agreement rate and self-reported trust, neither Hypothesis 2 nor 4 was supported. As a result, Hypotheses 2 and 4 were only partially supported. The agreement rate consistently
M
increased with reliability regardless of whether feedback was provided, again suggesting that participants were capable of distinguishing the phishing emails and inferring the system
ED
reliability even without feedback in the easy task. Explicit description of the reliability affected only the subjective measure of trust, self-
PT
reported trust. Across all reliability levels, the with-description groups reported trusting the
CE
system more than did the without-description groups. This result violates Hypotheses 3 and 5, in which description was predicted to underestimate the reliability levels. The missing effects of
AC
description in other measures were not predicted by the description-experience gap phenomenon. One suspect is that the description manipulation in the current study was not effective. However, the effect of description on self-reported trust contradicts this suspect. Overall, the comparison between the two types of decisions, decisions from description and decisions from experience, lies in the effects of description and feedback in the current
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
19
context. On the one hand, feedback increased perceived system reliability and affected objective agreement rate by reducing agreement rate with unreliable systems. On the other hand, description did not affect objective trust, but increased self-reported trust at all reliability levels. These effects did not completely conform to our hypotheses, but the differences in how feedback
CR IP T
and description affected the measures in and of itself were interesting. It seems that providing feedback has a more profound influence on users’ behavior, and providing an explicit description of the system reliability information will not affect the users’ behavior but only yielding an
inflated subjective report of their trust. When only using subject report as a trust measure in
AN US
research and design, this inflation should be noted.
Experiment 2: Difficult Task
The main goal of this research was to compare the effects of description and feedback,
M
which have not been given much attention in the context of trust in automation. Experiment 1 showed that feedback mainly affected objective trust calibration and description affected self-
ED
reported trust. In terms of performance, there were trends that feedback interacted with reliability on decision time and description interacted with reliability on accuracy, but these interactions did
PT
not attain the .05 significance level. As a result, one purpose of Experiment 2 was to further test
CE
our hypotheses and to examine whether these different effects of description and feedback in Experiment 1 can be further validated and generalized to a new set of stimuli.
AC
One would think that with a more reliable automation aid system, participants’ performance will be improved. However, in Experiment 1, judgment accuracy was not affected by any of the factors associated with the Phishing Detection System, although there was an effect of system reliability on decision time. A possible reason for this minimal effect on performance was that the emails used in Experiment 1 were relatively easy for participants to distinguish (so
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
20
that help from the system was not needed). Thus, another purpose of Experiment 2 was to test this possibility, with a new set of emails that are difficult for participants to classify as legitimate or phishing. To make the task more difficult, we modified the sender’s address of each legitimate
reliable signifier of a phishing email (Kumaraguru et al., 2010). Method
CR IP T
email to obtain a new set of phishing emails, because the sender’s address is the best and most
Participants. A total of 481 MTurk workers (261 females, 3 declined to say) were
recruited, with 50 being in the age range of 15-24 years, 222 in the range of 25-34, 105 in the
AN US
range of 35-44, 67 in the range of 45-54, 30 in the range of 55-64, and 7 being 65 or older. Each participant was compensated $1.00 for the study, which took on average about 10 minutes. MTurk workers in Experiment 1 were excluded, and each worker was allowed to participate only
M
once in this experiment.
Apparatus, stimuli, design, and procedure. The experiment was similar to Experiment
ED
1 except in the following ways. The original 10 legitimate emails used in Experiment 1 were kept, in addition to which 10 new legitimate emails were obtained from the second author’s
PT
email box. These 20 legitimate emails were then modified into 20 phishing emails. The
CE
modifications were only in the sender’s address (e.g., @bankofamerica.com was modified into @bankofamercia.com by switching the letters “i” and “c”) to make the emails difficult to
AC
distinguish. To control the email content, half of the participants were tested on the 10 original legitimate emails and 10 phishing emails modified based on the new legitimate emails; the other half saw the 10 new legitimate emails and 10 phishing emails modified based on the original legitimate ones. To ensure that the task was difficult, we tested a separate group of 30 participants on the
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
21
emails without the Phishing Detection system, and the judgment accuracy for this control group was 56%. Results The same method of excluding outlier data as in Experiment 1 was used. A total of 21
CR IP T
participants were excluded from further analyses, among whom 11 had log decision time two standard deviations less than the log mean and 10 had log decision time two standard deviations greater than the log mean. Comparable ANOVAs were conducted on judgment accuracy, log decision time, agreement rate, self-reported trust, and perceived system reliability, with
AN US
Reliability, Description, and Feedback as between-subjects factors.
Performance: Decision time. As in Experiment 1, the ANOVA results are from the analyses on the log decision time, and mean values are reported for the raw decision time for
M
ease of understanding. The effect of Reliability on decision time was not statistically significant (Ms = 12.2 s, 13.4 s, 11.7 s, 11.7 s for Reliability of 60%, 70%, 80%, and 90%, respectively),
ED
F(3, 444) = 1.52, p = .208, ηp2 = .01. Similar to Experiment 1, decision time decreased with feedback (Ms = 13.2 s vs. 11.3 s for without- vs. with-feedback groups, respectively; see Figure
PT
5, top panel), F(1, 444) = 10.33, p = .001, ηp2 = .02.
CE
Performance: Judgment accuracy. As in Experiment 1, the ANOVA results are based on the arcsine transformation of the accuracy data, and mean values are reported based on raw
AC
data. The trend that participants were more accurate with increased reliability in Experiment 1 was statistically significant in Experiment 2 (Ms = 54.1%, 57.2%, 59.7%, 67.2% for reliability levels of 60%, 70%, 80%, 90%, respectively), F(3, 444) = 22.81, p < .001, ηp2 = .13. A new emerging trend was the Reliability × Feedback interaction, although the p value did not quite attain the .05 level, F(3, 444) = 2.57, p = .054, ηp2 = .02. Providing feedback seemed more
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
22
effective at higher reliability levels than at lower reliability levels (see Figure 5, bottom panel). Trust: Agreement rate. Similar to Experiment 1, participants agreed with the system more often with higher reliability levels (Ms = 58.1%, 59.4%, 63.1%, 70.7%)2, F(3, 444) = 20.15, p < .001, ηp2 = .12. Participants agreed with the system more often with feedback than without
CR IP T
feedback (Ms = 65.0% vs. 60.7%), F(1, 444) = 11.16, p = .001, ηp2 = .03. The interaction
between reliability × feedback on agreement rate was also evident, F(3, 444) = 2.65, p = .049, ηp2 = .02. Providing feedback increased agreement rate mostly at the two higher levels of reliability (see Figure 6, top panel), indicating that feedback enhanced trust calibration.
AN US
Trust: Self-reported trust. Consistent with Experiment 1, participants reported that they trusted the system more with increased system reliability (Ms = 3.6, 3.9, 4.1, 4.6 for Reliability of 60%, 70%, 80%, and 90%, respectively), F(3, 444) = 9.04, p < .001, ηp2 = .06. The effect of
M
description on self-reported trust was no longer statistically significant although the mean values showed the same trend as in Experiment 1 (Ms = 3.9 vs. 4.1 without- vs. with-description), F(1,
ED
444) = 2.63, p = .106, ηp2 = .01. A new finding was that higher trust was reported with feedback than without feedback (Ms = 4.2 vs. 3.8), F(1, 444) = 7.35, p = .007, ηp2 = .02. The Reliability ×
PT
Feedback interaction was significant, F(3, 444) = 3.93, p = .009, ηp2 = .03. Providing feedback
CE
increased participants’ reported trust for the two higher reliability levels, but not for the lower reliability levels (see Figure 6, bottom panel), again demonstrating the effect of feedback on trust
AC
calibration.
Perceived system reliability. Like in Experiment 1, participants perceived the more
reliable system to be more reliable (Ms = 52%, 57%, 61%, 69%), F(3, 444) = 15.29, p < .001, ηp2
2
Using participants’ accuracy in the control condition, 56%, and assuming that participants performed the task independent of the system recommendations, then the expected agreement rates would be “51.2%, 52.4%, 53.6%, 54.8%.”
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
23
= .09. Providing feedback also increased perceived reliability (Ms = 56% vs. 63% for withoutfeedback vs. with-feedback groups, respectively), F(1, 444) = 22.50, p < .001, ηp2 = .05. Moreover, the two factors interacted with each other, F(3, 444) = 2.88, p = .035, ηp2 = .02. Providing feedback increased perceived system reliability only at the higher reliability levels (see
CR IP T
Figure 4, bottom panel). Discussion
This experiment included a more difficult task, and thus it was expected that participants would be affected more by the characteristics of the automated aid. Consistent with Hypothesis 1
AN US
and Experiment 1, when the reliability of the phishing detection system increased, participants’ performance was improved, their trust in the system increased, and they perceived the system reliability to be higher. It is worth noting that unlike in Experiment 1, system reliability affected
M
accuracy and not decision time in this experiment, and participants had faster decision times and lower accuracy compared to those in Experiment 1. A possible reason is that in the difficult task,
ED
the phishing emails were very similar to the legitimate ones, as indicated by the low accuracy (56%) of the control group and the fact that a reviewer did not spot the difference between
PT
@bankofamerica.com and @bankofamercia.com in an earlier version of this paper. In this case,
CE
participants might have just relied on the phishing detection system without trying as hard on their own as those in Experiment 1. As a result, decision accuracy was lower but improved with
AC
increased reliability, and decision times were faster across the reliability levels. Providing feedback also affected both performance and trust. Feedback decreased
decision time and, in terms of judgment accuracy, benefited the conditions with higher system reliability. Agreement rate and self-reported trust were increased by providing feedback (Hypothesis 2), especially at the two higher levels of reliability, supporting Hypothesis 4.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
24
Feedback increased participants’ agreement with the more reliable system, which is in line with the pattern in Experiment 1 that providing feedback decreased participants’ agreement with the less reliable system. Hypothesis 4 was fully supported in this experiment. Note that, without feedback, the agreement rate was mostly consistent across the reliability levels (except a higher
CR IP T
agreement rate at the 90% reliability level); with feedback, the agreement rate increased with the reliability level (except that similar agreement rates were evident at the 60% and 70% reliability levels). This pattern was also consistent with our earlier speculation that participants were not quite able to distinguish the phishing emails and relied heavily on the phishing detection system,
AN US
except when they could utilize the feedback.
Similar to Experiment 1, the effect of description was even more limited. The only effect of providing description on self-reported trust was no longer significant. Our hypotheses
M
regarding the effect of description (Hypotheses 3 and 5) were not supported by either of the experiments. However, it is not necessarily in contradict with prior work stating that rare events
ED
are overestimated (and common events are underestimated) in description-based choices, because most prior work contrasts description- and experience-based choices directly (see Wulff
PT
et al., 2018, for a review). The same difference in the effects of feedback and description was
CE
more evident, in which feedback affected both subjective and objective trust measures, whereas description did not have such an effect. The results provide evidence that providing feedback is a
AC
more effective vehicle for conveying (especially high) reliability information, and thus feedback should be provided to the users whenever possible. General Discussion
How the human user trusts and interacts with an automation system is influenced by how well the system capabilities are conveyed to the user (Lee & See, 2004). When interacting with
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
25
the automation system, the user can obtain the reliability information, among other information about the system, through an explicit description or through experiencing the system over time. The term of description-experience gap in the human decision-making literature illustrates the difference between decisions from description and those from experience (Hertwig et al., 2004;
CR IP T
Hertwig & Erev, 2009). In the current study, we investigated how this description-experience gap applies to human-automation interaction with a phishing-detection task in the cyber domain. In two experiments, participants’ performance in detecting phishing emails and their trust in the phishing detection system were measured when description and experience (i.e., feedback) were
AN US
varied systematically at different system reliabilities. Effect of Feedback (Experience)
In the current setting, experience was manipulated through providing feedback. Feedback
M
had the most influence on trust, including both subjective and objective trust measures, as well as perceived system reliability. For the easy task in Experiment 1, the agreement rate was lower
ED
with feedback than without feedback at the 60% level. This result indicates that, rather than increasing or decreasing objective trust in all systems, providing feedback reduces participants’
PT
trust only in the unreliable system for this easy task. Similarly, with the difficult task in
CE
Experiment 2, feedback increased agreement rate at the higher reliability levels (80% and 90%), but not so much at the lower levels (60% and 70%). Our results show that providing feedback led
AC
the users to agree with the phishing detection system less when it was less reliable and agree with it more when it was more reliable. The similar pattern was also evident on self-reported trust and perceived system reliability for the difficult task, with higher trust reported at the 80% and 90% reliability levels when feedback was provided than when feedback was not provided. That is, we found that feedback increased trust calibration (Lee & See, 2004) in terms of the
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
26
objective agreement rate. This trust calibration effect of feedback was more prominent in the difficult task of Experiment 2 than in the easy task of Experiment 1. As mentioned previously, participants were able to distinguish the phishing and legitimate emails in the easy task, and more or less infer the
CR IP T
reliability of the system even in the absence of feedback. Thus, providing feedback would not further calibrate how participants trust the system. In contrast, when the task became difficult, participants were not able to identify the emails on their own, and the presence of feedback
helped participants to distinguish the emails and learn the reliability of the system. Providing
AN US
feedback led participants to rely more on the systems with higher reliabilities and less on those with lower reliabilities.
An implication of these findings is that the feedback that allows users to learn system
M
capabilities is essential to proper levels of human trust in the system, which should correspond to the system’s actual capabilities. Moreover, this type of feedback is even more critical when the
ED
task is difficult than when the task is easy. In a real-world setting, the user uses a phishing detection system and makes a decision regarding an email he or she receives. As an example, the
PT
user judges a phishing email to be legitimate and clicks on it. The form of feedback would be
CE
receiving a feedback warning from the system when the system is certain that the email is phishing, or the user is taken to an obviously illegitimate webpage. A caveat is that immediate
AC
feedback to the user’s decision following a phishing detection system may not always be possible: The system may not be certain about the email or the consequence may be delayed. Effect of Description The effect of description was limited, being evident only on self-reported trust in Experiment 1. In that experiment, participants reported trusting the phishing detection system
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
27
more in the with-description conditions than in the without-description conditions. This result indicates that when people are told about the exact system reliability, their self-report of trust in the system increased. This effect of description was not significant in Experiment 2, though, which indicated that this may not be a robust effect. The system reliability description can be
CR IP T
treated as part of system transparency. System transparency has been found to improve trust calibration in the system (Oduor & Wiebe, 2008; Ososky et al., 2014) and overall subjective trust even when system reliability is below 70% (Mercado et al., 2016). In the current study, we found that the explicit description of system reliability had only limited effects. The possible reason is
AN US
that, although it provides more information regarding the system, this reliability description does not tell how the system works.
Note that the description of system reliability only affected self-reported trust, not the
M
more objective measure of trust, agreement rate. Agreement rate was unaffected by description, indicating that description of the system reliability, if anything, may change how people report
ED
their trust in the automation system but not how they actually behave when interacting with the system. An implication from this limited effect of description is that simply telling users about
PT
the system capabilities may not be an effective approach of conveying the information. Further,
CE
if one tries to evaluate the effectiveness of this approach through only subjective report, false evaluation results might be in place.
AC
The Description-Experience Gap In the current study, the effects of description and feedback (experience) were apparently
quite distinct. Description only affected the subjective trust measure (in Experiment 1) and feedback affected trust calibration for both subjective and objective trust measures. Note that the terms description and experience used in the current study are specific for conveying the system
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
28
reliability information, which is a type of probability. This usage is different from prior humanautomation studies that have used these terms (Hergeth et al., 2017; Koustanai et al., 2010), which compared written descriptions of critical situations (e.g., automation failure, takeover request) versus the driver’s experience/familiarization of the critical situation.
CR IP T
In the prior literature of risky decision making, this description-experience gap mainly focused on “rare events” with a probability of .20 or less (Hertwig et al., 2004). However, it has been assumed and shown that the high-probability events are weighted in a mirror-image way of the low-probability events (Hertwig, et al., 2004; Jessup et al., 2008). To complete the statement
AN US
over the whole probability spectrum (from 0 to 1): Compared with decisions from description, rare events are underweighted in decisions from experience and common (i.e., not rare) events are overweighted in decisions from experience. The system reliability levels in the current study
M
(60%-90%) lie in the latter part of the statement; our results showed that, compared to the conditions with description, the system reliability is less underweighted with
ED
feedback/experience, especially at the higher reliability levels. Note that the perceived system reliability was lower than the actual system reliability across all conditions in the current study,
PT
which is consistent with prior research (Sanchez et al., 2004). Our findings that
CE
feedback/experience had more effects on trust than description fits with this concept of description-experience gap for common events.
AC
Moreover, as mentioned in Introduction, most studies on description-experience gap used choice tasks with risky alternatives. Wulff et al. (2018) argued that this gap should exist in other cognitive functions and phenomena such as reasoning and judgment. The current study used an indirect “judgment” task, in which the judgment of reliability of the automated aid was reflected in how often the participants agreed with and trusted in the automated system. Our study
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
29
provided empirical evidence that this description-experience gap is not limited to choice tasks. Limitations of the Current Study Among the emails in both experiments, there were equal numbers of phishing and legitimate emails. We designed the experiments this way so that we could obtain enough data
CR IP T
points on both types of emails from each participant. In reality, although 76.5% of the surveyed information security professionals reported being a victim of a phishing attack in 2016 (Wombat Security Technologies, 2017), there are far fewer phishing emails than legitimate emails that arrive at an individual’s email box. This low frequency is because the emails must survive the
AN US
automatic spam filters of one’s email box or the information technology teams of organizations and institutions if they are to reach the individual user. To increase the ecological validity of this research, future studies can adjust the proportion of the phishing emails to closer to the real-
M
world situation and increase sample size of participants.
In the current experiments, a single-question was used to measure subjective trust. This
ED
was due to the consideration of limitations of running online experiments, to avoid participants being bored with too many questions and answering questions randomly. However, there are
PT
some questionnaires available, such as a Scale of Trust in Automated Systems developed by Jian
CE
et al. (2000) and the Human–Computer Trust Questionnaire (Madsen & Gregor, 2000). These existing scales may measure subjective trust more precisely and capture different aspects of trust.
AC
Future studies should examine whether these subjective trust measures show similar pattern to our current measure. For the without-description groups, the system was described to have “reasonable
reliability”. We tried to include a description of the reliability, rather than not mentioning it at all, to make the conditions as equivalent as possible to the with-description groups, the latter of
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
30
which were told the exact percentage of the reliability. However, people may have different mental references for “reasonable reliability”. Thus, it might become a problem if more people assumed the reasonable reliability to be high in one condition than in another. Hopefully, the large sample size and random assignments of participants in the current study minimized the
CR IP T
possibility of this problem, but the wording for the without-description group needs to be carefully considered in future studies (e.g., to state the automation as “imperfect”).
In addition, the current study was designed to include a wide range of system reliabilities to examine the effect of reliability on performance and trust. However, for the highest reliability
AN US
of 90%, participants’ perceived system reliability was 75% and 69% in Experiments 1 and 2, respectively. These low perceived system reliabilities are consistent with prior research (Sanchez et al., 2004), but one may question whether the current results apply to systems with a higher
M
reliability. Further study can test the current findings with systems by extending the range of system reliability beyond 90% (e.g., including 99% or 100%).
ED
Conclusions
The results from the current study suggest that system reliability has a profound influence
PT
on human performance in the system, but the benefits of having a system with higher reliability
CE
may depend on task difficulty. For the easy task, higher reliability led to reduced decision time, and in the difficult task, decision accuracy was also enhanced at higher reliability levels. This
AC
pattern indicates that the human users may rely on the system for different purposes. For the easy task, once they reckoned that the system was reliable, they would agree with the system and save time; for the difficult task, when they were not sure about the choices, they relied on the system to make the correct choice. Human users benefit from a reliable assistant automation system even with an easy task at which they are good.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
31
Our findings also showed a description-experience gap: Providing feedback increased trust calibration in terms of both objective and subjective trust measures, yet providing description of system reliability increased only self-reported trust. This result pattern not only shows the gap in effects of feedback and description, but also extends the description-experience
CR IP T
gap concept from rare events to common events, such as a correct recommendation from a highly reliable automated system. For systems with lower reliability, providing feedback will be
beneficial in terms of enhancing people’s performance but consequently decreasing people’s trust in the system; whereas for more reliable systems, providing feedback will be beneficial in
AN US
terms of increasing people’s performance and trust in the system. Thus, it is reasonable to
recommend that feedback should be provided whenever possible, regardless of the actual system reliability level. Unlike the effect of feedback, offering the description of system reliability
M
information will “inflate” subjective trust. Therefore, we recommend Human Factors practitioners and researchers keep this inflation in mind when only subjective trust measures are
ED
used in their studies. Additionally, those designing automated systems should be wary that providing certain reliability information may inflate a user’s subjective trust, leading to improper
AC
CE
PT
trust calibration and potential overreliance.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
32
References Armstrong, B., & Spaniol, J. (2017). Experienced probabilities increase understanding of diagnostic test results in younger and older adults. Medical Decision Making, 37, 670679.
CR IP T
Barg-Walkow, L. H., & Rogers, W. A. (2016). The effect of incorrect reliability information on expectations, perceptions, and use of automation. Human Factors, 58, 242-260.
Chancey, E. T., Bliss, J. P., Yamani, Y., & Handley, H. A. (2017). Trust and the compliance– reliance paradigm: The effects of risk, error bias, and reliability on trust and
AN US
dependence. Human Factors, 59, 333-345.
Desai, M., Kaniarasu, P., Medvedev, M., Steinfeld, A., & Yanco, H. (2013, March). Impact of robot failures and feedback on real-time trust. In Proceedings of the 8th ACM/IEEE
M
International Conference on Human-robot Interaction (pp. 251-258). New York: Institute of Electrical and Electronics Engineers.
ED
Dixon, S. R., & Wickens, C. D. (2006). Automation reliability in unmanned aerial vehicle control: A reliance-compliance model of automation dependence in high
PT
workload. Human Factors, 48, 474-486.
CE
Dzindolet, M. T., Peterson, S. A., Pomranky, R. A., Pierce, L. G., & Beck, H. P. (2003). The role of trust in automation reliance. International Journal of Human-Computer Studies, 58,
AC
697–718.
Egelman, S., Cranor, L. F., & Hong, J. (2008, April). You've been warned: an empirical study of the effectiveness of web browser phishing warnings. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 1065-1074). New York: Association for Computing Machinery.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
33
Hadar, L., & Fox, C. R. (2009). Information asymmetry in decision from description versus decision from experience. Judgment and Decision Making, 4, 317-325. Hertwig, R., Barron, G., Weber, E. U., & Erev, I. (2004). Decisions from experience and the effect of rare events in risky choice. Psychological Science, 15, 534-539.
CR IP T
Hertwig, R., & Erev, I. (2009). The description–experience gap in risky choice. Trends in Cognitive Sciences, 13, 517-523.
Heilbronner, S. R., & Hayden, B. Y. (2016). The description-experience gap in risky choice in nonhuman primates. Psychonomic Bulletin & Review, 23, 593-600.
AN US
Hergeth, S., Lorenz, L., & Krems, J. F. (2017). Prior familiarization with takeover requests
affects drivers’ takeover performance and automation trust. Human Factors, 59, 457-470. Hoff, K. A., & Bashir, M. (2015). Trust in automation integrating empirical evidence on factors
M
that influence trust. Human Factors, 57, 407-434.
Hoffman, R. R., Lee, J. D., Woods, D. D., Shadbolt, N., Miller, J., & Bradshaw, J. M. (2009).
ED
The dynamics of trust in cyber domains. IEEE Intelligent Systems, 24, 5-11. Hong, J. (2012). The state of phishing attacks. Communications of the ACM, 55, 74-81.
PT
Jessup, R. K., Bishara, A. J., & Busemeyer, J. R. (2008). Feedback produces divergence from
CE
prospect theory in descriptive choice. Psychological Science, 19, 1015-1022. Jian, J. Y., Bisantz, A. M., & Drury, C. G. (2000). Foundations for an empirically determined
AC
scale of trust in automated systems. International Journal of Cognitive Ergonomics, 4, 53-71.
Kim, J., & Moon, J. Y. (1998). Designing towards emotional usability in customer interfaces— trustworthiness of cyber-banking system interfaces. Interacting with Computers, 10, 1-29. Koustanai, A., Mas, A., Cavallo, V., & Delhomme, P. (2010, September). Familiarization with a
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
34
forward collision warning on driving simulator: cost and benefit on driver system interactions and trust. In Driving Simulation Conference 2010 Europe (pp. 169-179). Paris, France. Kumaraguru, P., Sheng, S., Acquisti, A., Cranor, L. F., & Hong, J. (2010). Teaching Johnny not
CR IP T
to fall for phish. ACM Transactions on Internet Technology (TOIT), 10, 7:1-7:31.
Lee, J. D., & Moray, N. (1992). Trust, control strategies and allocation of function in humanmachine systems. Ergonomics, 35, 1243-1270.
Lee, J. D., & Moray, N. (1994). Trust, self-confidence, and operators’ adaptation to automation.
AN US
International Journal of Human-Computer Studies, 40, 153–184.
Lee, J. D., & See, K. A. (2004). Trust in automation: Designing for appropriate reliance. Human Factors, 46, 50–80.
M
Lejarraga, T., Pachur, T., Frey, R., & Hertwig, R. (2016). Decisions from experience: From monetary to medical gambles. Journal of Behavioral Decision Making, 29, 67-77.
ED
Lewandowsky, S., Mundy, M., & Tan, G. (2000). The dynamics of trust: comparing humans to automation. Journal of Experimental Psychology: Applied, 6, 104-123.
PT
Madhavan, P., Wiegmann, D. A., & Lacson, F. C. (2006). Automation failures on tasks easily
CE
performed by operators undermine trust in automated aids. Human Factors, 48, 241-256. Madsen, M., & Gregor, S. (2000). Measuring human-computer trust. In G. Gable & M. Vitale
AC
(Eds.), Proceedings of the 11th Australasian Conference on Information Systems (paper 53). Brisbane, Australia: Information Systems Management Research Centre.
McGuirl, J. M., & Sarter, N. B. (2006). Supporting trust calibration and the effective use of decision aids by presenting dynamic system confidence information. Human Factors, 48, 656-665.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
35
Mercado, J. E., Rupp, M. A., Chen, J. Y., Barnes, M. J., Barber, D., & Procci, K. (2016). Intelligent agent transparency in human–agent teaming for Multi-UxV management. Human Factors, 58, 401-415. Muir, B. M. (1987). Trust between humans and machines, and the design of decision aides.
CR IP T
International Journal of Man-Machine Studies, 27, 527–539.
Muir, B. M. (1994). Trust in automation: Part I. Theoretical issues in the study of trust and human intervention in automated systems. Ergonomics, 37, 1905–1922.
Oduor, K. F., & Wiebe, E. N. (2008, September). The effects of automated decision algorithm
AN US
modality and transparency on reported trust and task performance. In Proceedings of the Human Factors and Ergonomics Society 52nd Annual Meeting (pp. 302–306). Santa Monica, CA: Human Factors and Ergonomics Society.
M
Ososky, S., Sanders, T., Jentsch, F., Hancock, P., & Chen, J. Y. (2014, June). Determinants of system transparency and its influence on trust in and reliance on unmanned robotic
ED
systems. In SPIE Defense + Security (pp. 90840E-90840E). Bellingham, WA: International Society for Optics and Photonics.
PT
Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse,
CE
abuse. Human Factors, 39, 230-253. Peng, W. (1955, November). Xiao ma guo he (How a colt crossed the river). Xin Shao Nian Bao.
AC
Beijing, China: China Children's Press & Publication Group. Rice, S. (2009). Examining single-and multiple-process theories of trust in automation. The Journal of General Psychology, 136, 303-322
Ross, J. M., Szalma, J. L., Hancock, P. A., Barnett, J. S., & Taylor, G. (2008, September). The effect of automation reliability on user automation trust and reliance in a search-and-
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
36
rescue scenario. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting (Vol. 52, No. 19, pp. 1340-1344). Los Angeles, CA: Sage Publications. Sanchez, J., Fisk, A. D., & Rogers, W. A. (2004, September). Reliability and age-related effects on trust and reliance of a decision support aid. In Proceedings of the Human Factors and
CR IP T
Ergonomics Society Annual Meeting (Vol. 48, No. 3, pp. 586-589). Los Angeles, CA: Sage Publications.
Schaefer, K. E., Chen, J. Y., Szalma, J. L., & Hancock, P. A. (2016). A meta-analysis of factors influencing the development of trust in automation: Implications for understanding
AN US
autonomy in future systems. Human Factors, 58, 377-400.
Sheridan, T. B., & Parasuraman, R. (2006). Human-automation interaction. Reviews of Human Factors and Ergonomics, 1, 89–129.
M
Singh, M. P. (2011, May). Trust as dependence: A logical approach. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 (pp. 863-870).
ED
International Foundation for Autonomous Agents and Multiagent Systems. Stokes, C. K., Lyons, J. B., Littlejohn, K., Natarian, J., Case, E., & Speranza, N. (2010, May).
PT
Accounting for the human in cyberspace: Effects of mood on trust in automation. In 2010
CE
International Symposium on Collaborative Technologies and Systems (CTS) (pp. 180187). New York: Institute of Electrical and Electronics Engineers.
AC
Ungemach, C., Chater, N., & Stewart, N. (2009). Are probabilities overweighted or underweighted when rare outcomes are experienced (rarely)?. Psychological Science, 20, 473-479.
van Rooy, D., & Bus, J. (2010). Trust and privacy in the future internet—a research perspective. Identity in the Information Society, 3, 397-404.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
37
Verberne, F. M., Ham, J., & Midden, C. J. (2015). Trusting a virtual driver that looks, acts, and thinks like you. Human Factors, 57, 895-909. Wombat Security Technologies. (2017). The state of the phish. Retrieved from https://info.wombatsecurity.com/state-of-the-phish
CR IP T
Wu, M., Miller, R. C., & Garfinkel, S. L. (2006, April). Do security toolbars actually prevent phishing attacks? In Proceedings of the SIGCHI conference on Human Factors in
Computing Systems (pp. 601-610). New York: Association for Computing Machinery. Wulff, D. U., Hills, T. T., & Hertwig, R. (2015). Online product reviews and the description–
AN US
experience gap. Journal of Behavioral Decision Making, 28, 214-223.
Wulff, D. U., Mergenthaler-Canseco, M., & Hertwig, R. (2018). A meta-analytic review of two modes of learning and the description-experience gap. Psychological Bulletin, 144, 140-
M
176.
Yang, W., Xiong, A., Chen, J., Proctor, R. W., & Li, N. (2017, April). Use of Phishing Training
ED
to Improve Security Warning Compliance: Evidence from a Field Experiment. In Proceedings of the Hot Topics in Science of Security: Symposium and Bootcamp (pp.
AC
CE
PT
52-61). New York: Association for Computing Machinery.
ACCEPTED MANUSCRIPT 38
CR IP T
DESCRIPTION-EXPERIENCE GAP
Figure 1. Recommendations (left, legitimate; right, phishing) made by the phishing detection
AC
CE
PT
ED
M
AN US
system.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
39
Decision Time
15 10
CR IP T
Decision Time (s)
20
w/ Fdbk
w/o Fdbk
5 0 60%
70%
80%
90%
AN US
Reliability level
Judgment Accuracy 85 80 75 65 60 55 50
w/ Desc
M
70
ED
Judgment Accuracy (%)
90
60%
70%
w/o Desc
80%
90%
PT
Reliability level
CE
Figure 2. Performance measures in Experiment 1: Decision time as a function of Feedback (Fdbk) and Reliability level (top panel), and judgment accuracy as a function of Description
AC
(Desc) and Reliability level (bottom panel). Error bars represent the 95% confidence interval.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
40
Agreement Rate 90.0
Agreement Rate (%)
85.0 80.0 75.0 70.0
w/ Fdbk
65.0 55.0 50.0 60%
70%
80%
90%
Reliability level
AN US
Self-reported Trust 7 6
4
w/ Desc
3
M
Trust level
5
2
ED
1 0
CR IP T
w/o Fdbk
60.0
60%
70%
w/o Desc
80%
90%
PT
Reliability level
CE
Figure 3. Trust Measures in Experiment 1: Agreement rate as a function of Feedback (Fdbk) and Reliability level (top panel), and self-reported trust as a function of Description (Desc) and
AC
Reliability level (bottom panel). Error bars represent the 95% confidence interval.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
41
90 80 70
CR IP T
Perceived System Reliability (%)
Perceived System Reliability
w/ Fdbk
60
w/o Fdbk
50 40 60%
70%
80%
90%
AN US
Reliability level
90 80
M
70 60 50 40
ED
Perceived System Reliability (%)
Perceived System Reliability
60%
70%
w/ Fdbk w/o Fdbk
80%
90%
PT
Reliability level
CE
Figure 4. Perceived system reliability as a function of Reliability level and Feedback (Fdbk) in Experiment 1 (top panel) and Experiment 2 (bottom panel). Error bars represent the 95%
AC
confidence interval.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
42
Decision Time 20 15 w/ Fdbk
10
w/o Fdbk
5 0 60%
70%
80%
90%
Reliability level
AN US
90 85 80 75 70 65 60 55 50
w/ Fdbk w/o Fdbk
60%
M
Judgment Accuracy (%)
Judgment Accuracy
CR IP T
Decision Time (s)
25
70%
80%
90%
ED
Reliability level
PT
Figure 5. Performance Measures in Experiment 2: Decision time as a function of Feedback (Fdbk) and Reliability level (top panel), and judgment accuracy as a function of Feedback and
AC
CE
Reliability level (bottom panel). Error bars represent the 95% confidence interval.
ACCEPTED MANUSCRIPT DESCRIPTION-EXPERIENCE GAP
43
Agreement Rate
w/ Fdbk w/o Fdbk
60%
70%
80%
90%
Reliability Level
AN US
Self-Reported Trust 7 6
Trust Level
CR IP T
Agreement Rate (%)
90 85 80 75 70 65 60 55 50
5 4 3
w/ Fdbk
2
w/o Fdbk
0 60%
M
1 70%
80%
90%
ED
Reliability level
PT
Figure 6. Trust Measures in Experiment 2: Agreement rate as a function of Feedback (Fdbk) and Reliability level (top panel), and self-reported trust as a function of Feedback and Reliability
AC
CE
level (bottom panel). Error bars represent the 95% confidence interval.