Physiological responses to different WEB page designs

Physiological responses to different WEB page designs

Int. J. Human-Computer Studies 59 (2003) 199–212 Physiological responses to different WEB page designs R.D. Ward*, P.H. Marsden Department of Multime...

529KB Sizes 1 Downloads 71 Views

Int. J. Human-Computer Studies 59 (2003) 199–212

Physiological responses to different WEB page designs R.D. Ward*, P.H. Marsden Department of Multimedia and Information Systems, School of Computing and Engineering, University of Huddersfield, Huddersfield, West Yorkshire HD1 3DH, UK Received 28 September 2001; received in revised form 19 November 2002; accepted 30 November 2002

Abstract Physiological indicators of arousal have long been known to be sensitive to mental events such as positive and negative emotion, changes in attention and changes in workload. It has therefore been suggested that human physiology might be of use in the evaluation of software usability. To this, there are two main approaches or paradigms: (i) comparisons of physiological readings across periods of time to indicate different arousal levels under different circumstances, and (ii) the detection of short-term (occurring in seconds) physiological changes in response to specific events. Both approaches involve methodological, analytical and interpretational difficulties. Also, the tight experimental controls usually adopted in psychophysiological experimentation can be at odds with the needs of applied usability testing. This paper reports initial investigations of these approaches and difficulties in the evaluation of software interfaces. From exploratory data, a preliminary model is proposed which combines the two paradigms for identifying significant HCI events. Explorations of the model within the context of a web-related task are then discussed. These explorations suggest techniques and procedures for applied usability testing, and the results point to ways in which physiological data may be informative about software usability. However, further investigations involving variations in task and procedure are required. r 2003 Elsevier Science Ltd. All rights reserved.

1. Introduction Evidence that human physiology responds to a wide variety of mental events has been available since the late 19th century. Andreassi (2000), in an extensive and wide *Corresponding author. E-mail addresses: [email protected] (R.D. Ward), [email protected] (P.H. Marsden). 1071-5819/03/$ - see front matter r 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S1071-5819(03)00019-3

200

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

ranging review of the field, reports that Skin Conductance (SC) (a measure of the activity of the eccrine sweat glands), cardiovascular activity, respiration, electrical activity in the brain, muscles and the peripheral nervous system, pupillary size and other physiological phenomena, have all been observed to vary along with factors such as task difficulty, levels of attention, activities involving decision-making and problem solving, experiences of frustration, surprise and insult, and the affective meanings of stimuli and mental imagery. These responses are involuntary and surprisingly sensitive. They reflect changes in levels of arousal and may also provide clues about emotional valence (positive or negative emotion). From time to time it has been proposed that physiological responses might contribute to the design and evaluation of software interfaces by helping to identify factors and events that cause changes in levels of arousal, and are therefore likely to be of significance to users (Wastell and Newman, 1996). As HCI events are essentially no different from other stimuli, they should evoke similar physiological responses. The idea also has considerable intuitive appeal-most computer users can readily list personal experiences of strong emotional reactions to the difficulties, frustrations, delights and satisfactions inherent in much of today’s software. Changes in levels of arousal might be expected as a result of negative emotions brought about by software-induced frustration, positive emotions brought about by successful completion of a task, and shifts in attention in response to particular content and moments of high workload. A number of studies have provided support for the application of this general idea in computer-related situations. These essentially fall into two main approaches or paradigms. First, there have been studies that make comparisons between repeated psychophysiological measurements averaged across periods of time with the aim of finding different levels of arousal in different situations. For example, Wastell and Newman (1996) found cardiovascular measures indicated reductions in the stress levels of ambulance control system operators at times of high workload when a computer-based system replaced a manual, paper-based one. Similarly, Wilson and Sasse (2000) showed SC and cardiovascular measures to indicate increased stress levels in viewers of video following a change from a high to a low frame rate, even though many participants were unaware that there had been a decrease in video quality. Secondly, there have been studies that attempt to find short-term (occurring within seconds) physiological changes in response to specific events, either by investigating the effects of known events, or by identifying candidate events that precede observed physiological changes. Short-term changes that occur in proximity to a novel stimulus are usually referred to as orienting responses. Picard (1996) describes rapid changes in the muscle electrical activity of computer game players, especially in situations where software fails to react correctly to its controls, and Scheirer et al. (2002) utilized short-term changes in skin conductivity and blood volume pulse (BVP) in applying pattern-based techniques to automatically detect the responses brought about by software-induced frustration. Both paradigms involve difficulties of methodology, signal analysis and interpretation. Firstly, physiological readings are inconsistent. Different metrics can give different indications, with little correlation between them. Even with single

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

201

measures there can be considerable differences between individuals, and considerable differences within individuals on different occasions. For example, in the case of SC, inconsistencies are caused by factors such as differences in room temperature and humidity, participants’ activities during the period prior to experimentation, participants’ skin structure, distance between electrodes and electrode type and size, and are well documented in the literature (Idzikowski and Baddeley, 1983). A second area of difficulty lies in the recognition of significant features within the physiological signals. Physiological measurements are highly changeable. There are problems in deciding how differences and changes are to be identified, for example in setting significance thresholds for the latency, duration and magnitude of responses. Much of the published literature discusses these kinds of analytical issues, with solutions proposed at various levels of sophistication. Fernandez (1997) suggests the use of dynamic patterns derived by means of techniques such as detrending of the SC signal by subtracting a 10-s time-varying sample mean, and calculation of the difference between the upper and lower envelopes of the BVP signal. Kramer (1991) discusses the advantages and disadvantages of different physiological measures. A third area of difficulty lies in the interpretation of any significant features that have been identified. Different mental events can produce near identical physical responses. Thus, in the absence of tight experimental control, it may not be possible to conclude whether a particular physiological response is due to the effects of workload, surprise, frustration, or any other mental experience. This touches upon an ongoing debate in the emotion theory literature. Whilst some researchers believe that specific emotional states may have characteristic physiological features (e.g. Ekman et al., 1983), there remain issues concerning the definition of emotion and emotional states. Other researchers therefore avoid labels, preferring to describe emotion by means of dimensions such as arousal and valence (e.g. Lang et al., 1993). Practically all of the findings summarized by Andreassi (2000), were observed in stringently controlled experimental situations using pure distinct stimuli, with other possible confounding sources of variability held constant. Human–computer interaction does not normally occur under such tightly controlled conditions. Software tends to be complex with many potentially confounding variables. Therefore, if psychophysiological measurements are to be of practical help in HCI design and evaluation, it would appear necessary to be able to employ them under less tightly controlled conditions. Ideally, physiological data should be available without recourse to lengthy baseline periods, temperature and humidity controlled environments, special electrodes and conductivity gels, skin abrasion, large numbers of participants, and other techniques and procedures often adopted in psychophysiological laboratories. It is not clear to what extent this is possible. It does however seem that, under less tightly controlled conditions, great care is needed in the design of testing procedures and in comparing and combining measurements across different occasions, situations and participants. In particular, it would appear that data should be regarded as relative rather than absolute, and that participants’ baseline control data should be collected within the same session as their experimental data.

202

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

Thus, although there is some empirical support for the idea of employing psychophysiological measurement to identify significant HCI events, there are also methodological, analytical and interpretational difficulties that call into question its viability. This requires further investigation. This paper reports initial exploration of the idea and its associated difficulties. First, physiological data was collected with the aim of obtaining prototypical baseline and scaling data relating to various HCI situations. From this, a preliminary model is proposed which combines short- and long-term changes in identifying significant HCI events. The model was then explored through measures of SC, blood volume and pulse rate in a specific computer-based task situation. This also served to develop ideas about experimental procedures through which physiological responses to HCI events might reliably be identified.

2. Prototypical data SC, blood volume and heart rate (HR) of participants were monitored in various loosely controlled computer-based situations with the aim of obtaining prototypical data to indicate the range and magnitude of the psychophysiological changes that occur in response to HCI events. Data was collected using DataLab 2000, a computerized physiological recording and data acquisition system manufactured by Lafayette. Collection and subsequent analysis of the signal data was carried out using National Instruments BioBench software which forms part of the DataLab system. More detailed analysis was carried out by exporting data to Microsoft Excel. SC and Blood Volume Pulse (BVP) were measured through electrodes and sensors attached to the fingers of participants’ non-dominant hand (leaving the dominant hand free to carry out experimental tasks). For SC, standard stainless-steel dry electrodes as supplied with the DataLab system were attached to the first and third fingers. The BVP sensor fitted over the end of second finger. Both sensors could be attached in seconds, without specialist procedures. The BVP sensor also detects pulse, and therefore provides a measure of HR. One Windows PC was used for data collection, a second to present experimental tasks. Data collection sessions took place in a quiet room with steps taken to prevent distraction or interruption. Fig. 1 plots typical changes in SC. Changes occurred over both short (i.e. in seconds) and longer periods of time. There were large variations in range and magnitude, both between and within individuals. It was possible for a single individual to have overall conductance range of, say, 1.25–2.50 mS on one occasion, and 5.0–10.0 mS under similar conditions on a different occasion. Short-term changes also were observed to vary from less than 1% to over 60% over a period of 10 s. These inconsistencies suggest that data needs to be normalized to permit comparisons between individuals, and that it is necessary to make assumptions about the latency, duration, magnitude and frequency parameters within which short-term changes are considered significant. We have adopted here the simple approach of using percentage changes in SC, and a time period of 10 s for the

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

203

Fig. 1. Prototypical SC readings showing SC in mS of participants (a) at rest, (b) during non-contentious computing, (c) presented with a surprise HCI event, (d) using a web site with a known design problem.

purpose of identifying short-term changes. These parameters may require later refinement. For accounts of more sophisticated approaches refer to Kramer (1991) and Fernandez (1997).

204

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

From the prototypical data we can make the following general observations. (a) At rest, HR slows, there is a steady decrease in SC indicating diminishing activity of the eccrine sweat glands, and an increase in finger blood volume indicating dilation of the peripheral blood vessels, all suggesting lowered levels of arousal. Fig. 1(a) shows SC over a 10 min period of inactivity during which, following an initial ‘‘settling down’’ period of 2–3 min, there is a gradual and smooth decrease in conductance. After the first 3 min there are no sudden changes at all, with no data points being 3% or more higher than the data point occurring exactly 10 s beforehand. (b) During non-contentious computer-based activities such as browsing the web, HR, SC and finger blood volume tend to show considerable fluctuation but remain around the same general level, suggesting maintained levels of arousal. Fig. 1(b) illustrates this by showing that conductance continues around the same level as the initial period, with several sudden fluctuations that suggest maintenance of arousal levels through responses to matters requiring attention. After the first 3 min, 3.1% of data points increased 3% or more over 10 s, the largest increase being 4.5% over 7 s. (c) Following an unexpected HCI event, participants tend to exhibit increases in HR and SC together with lowered peripheral blood volume, suggesting a sudden increase in arousal typical of an orienting response. This can be seen in Fig. 1(c) where a participant engaged in a computer-based task is unexpectedly presented with an alert box accompanied by a standard windows alert box warning sound at medium volume. This produced, after a 1 s latency, a rapid 63% increase in conductance over the following 9 s. The example shown here is an extreme one, from a participant who reported feeling startled, but most participants show an increase of between 5% and 10% in response to the appearance of the alert box. Furthermore, there is considerably more fluctuation in this graph, with 15.1% of data points increasing by 3% or more over 10 s, excluding the first 3 min and period covering the large orienting response. (d) When using software in more realistic situations, physiological readings are similar to those occurring in non-contentious activities, except that there appear to be more fluctuations. Fig. 1(d) represents a participant using a web site with a known usability problem (a difficult-to-find link). When this was encountered there was a rapid 9.7% increase in conductance (indicated by the shaded area). Elsewhere, excluding the first 3 min the number of data points that increased by 3% or more in 10 s was only 2.9%, which is a similar level of fluctuation to that of Fig. 1(b). These observations appear to indicate a possible relationship between SC and different kinds of HCI events and situations. If, as a measure of SC fluctuation, we calculate the percentage of data points showing an increase of 3% or more over 10 s, and as a measure of orienting responses we count the number of increases of 7% or more over 10 s, then the observations suggest the preliminary model represented by Table 1. Again, these operational definitions are arbitrary and may require later refinement, and may need to be reconsidered in terms of size rather than percentage change.

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

205

Table 1 Skin Conductance (SC) characteristics in different situations

1a 1b 1d 1c

Amount of stress

Number of events

SC trend

SC fluctuation

Orienting responses

Low Medium Medium High

0 0 1 >1

Decreasing Level Level Increasing

0% 7.2% 7.0% 15.1%

0 0 1 4

Thus, situations of low stress with no significant events would appear to produce decreasing conductance with little or no variability and no orienting responses. As stress increases then conductance tends to remain around the same level, with increased variability. When additionally there are significant HCI events, both conductance and the degree of fluctuation tend to increase, and there may be orienting responses. It therefore seems possible to categorize HCI situations according to the kinds of stress stimuli they present, and this would appear to be reflected in the prototypical SC traces they produce. It seems feasible that the model could be extended to incorporate prototypical traces for other physiological metrics. It should also be possible to test the model experimentally by a combination of measurement over periods of time with local measures of variability.

3. Procedure A computer-based task was devised with the aim of exploring the model and beginning to develop ideas about experimental procedures through which physiological responses to HCI events might reliably be identified. The task was based upon an unpublished digitized directory of organizations and residents of a small Yorkshire town and its surrounding villages for the year 1939. This directory was HTML-based and delivered by Microsoft Internet Explorer. It consisted of a front index page providing links to scanned images of the 340 pages of the original print version of the directory. For the purpose of these experiments two versions of the directory were created, both containing exactly the same information. One version (referred to as the ‘‘well-designed’’ version) was designed as far as possible to follow principles of good web and information design (e.g. Hartley, 1994; Nielsen, 1995–2001). The other (the ‘‘ill-designed’’ version) was designed to break these principles where possible. Both versions recorded a timed trace of pages visited. Features of the ill-designed version included: An index page that made excessive use of pull-down lists. This obscured the overall structure of the information in the directory and made links more difficult to find and use than in the well-designed version. Navigational cues and functions were impoverished to make navigation involve additional scrolling and mousing. For example, it was necessary to scroll back to the top of each scanned page in order to find the link back to the index page.

206

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

Gratuitous animation and the periodic appearance of advertisements either caused the display to move suddenly or appear in pop-up windows that had to be moved or closed in order to proceed. Screen shots of the two interfaces appear in Fig. 2. Figs. 2(a) and (b) show the opening page and one of the digitized directory pages in the well-designed version. Fig. 2(c) shows the opening page of the ill-designed version, with its pull down lists, gratuitous and inconsistent colors and typefaces, an animated ‘‘under construction icon’’ and one of the advertisements that cause the other content of the page suddenly to jump on the screen. Fig. 2(d) shows one of the digitized pages in the illdesigned version, with one of the pop-up advertisement windows that had to be moved or closed in order to proceed. The pop-up windows appeared pseudorandomly in approximately one in four of the digitized pages of the ill-designed version. For the pilot investigation, 20 participants aged 18–48, drawn from the general University population, were assigned to use either the well-designed or ill-designed version of the directory. It was hypothesized that the ill-designed version would be more difficult to use, and that this would be reflected in the physiological data in accordance with the preliminary model. Participants were asked to work through a sequence of questions presented verbally and on cue cards, which required them to find information in the directory (e.g. ‘‘How many people named Young lived at Rawcliffe?’’ ‘‘Who was the chief clerk of Airmyn Parish Council?’’ and ‘‘Why did Bennett Line steamers call at Hull?’’). The same sequence of questions was presented to both groups. Answers were made verbally. SC, finger blood volume and HR were measured over a 15 min period consisting of an initial 5 min familiarization and ‘‘settling in’’ period (quietly reading paper pages copied from the original directory), followed by 10 min on the question-answering task itself. The procedure was then concluded, regardless of the number of questions completed, and participants debriefed. Formal psychophysiology experiments almost always adopt much longer settlingin periods than here, typically 15 min. This could be impractical in applied HCI usability testing, and from the prototypical data, a period of 5 min appeared sufficient to minimize the effects of the fluctuations that occur at the beginning of a session. Psychophysiology experiments also tend to use larger numbers of participants. Although this would have been preferable for statistical significance testing purposes, an N of 20 would be a likely upper limit in applied usability testing situations.

4. Results For participants, the beginning of the question-answering section of the procedure followed the quiet ‘‘settling-in’’ period after which the experimenter re-appeared and began asking questions. At this point, all participants showed large increases in SC, indicative of substantial increases in arousal at the start of the task. The first minute

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

207

Fig. 2. Screen shots of the two interfaces: (a,b) well-designed version; (c,d) ill-designed version.

of the question-answering task was then used as a baseline reading for each participant against which readings over the subsequent 9 min were compared. In effect this plots recovery from high levels of arousal at the start of the task when

208

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

% Change from first minute

both groups of participants are attempting to gain understanding of the software and the task requirements. This meets the methodological requirement, discussed in the introduction, that baseline data should be obtained within the same session as experimental data. Fig. 3 shows that all three main measures produced group differences in the direction predicted by the hypothesis. These three graphs plot mean group changes against the first minute baseline over the subsequent 9 min of the question-answering task. The SC data, in Fig. 3(a), was first normalized ((signal-baseline)/range) to minimize the influence of individuals with the largest changes in conductance, and then expressed as a percentage change. The SC of participants using the welldesigned directory began to decrease after the first minute, indicating ‘‘relaxation

20 15 10 5 0 -5 -10 -15 -20

skin conductance ill-designed version

well-designed version

1

2

3

4

5

6 7 minutes

Change from first minute (bpm)

(a)

Change from first minute

0.5 heart rate 0 ill-designed version -0.5 -1 -1.5 well-designed version -2 -2.5 -3 -3.5 -4 1 2 3 4 5 6 7 (b) minutes

(c)

9 8 7 6 5 4 3 2 1 0

8

9

10

8

9

10

9

10

finger blood volume well-designed version

ill-designed version

1

2

3

4

5

6 7 minutes

8

Fig. 3. Group percentage changes against baseline during each minute of the question-answering task, taking the first minute as baseline. (a) SC, (b) pulse rate, (c) finger blood volume.

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

209

into the task’’. In comparison, the SC of participants using the ill-designed version continued to rise for several minutes. Fig. 3(b), HR, gives similar indications. Whereas the mean HR of students using the ill-designed interface remained around the initial level for the duration of the task, the mean HR of the well-designed group had fallen by around 3 beats per minute after minute 3. Fig. 3(c) shows the mean group finger blood volume data. This plots changes in the pinch of the upper and lower envelopes of the BVP, showing greater dilation of peripheral blood vessels in users of the well-designed directory, again indicating higher levels of arousal in users of the ill-designed version. However, this group data summarizes large individual differences between participants, with both groups containing large variances and individuals whose readings were completely at odds with the group trend. A GLM repeated measures analysis confirms that the group differences were not statistically significant, with F ¼ 1:235; p ¼ 0:282 for SC, F ¼ 3:324; p ¼ 0:086 for HR, and F ¼ 0:893; p ¼ 0:358 for finger blood volume. This is discussed further in the conclusions. There were however clear differences between the two groups in the numbers of questions completed. Users of the well-designed directory completed a mean of 22 questions, compared with 12 in the ill-designed group. These data relate to the first of the two paradigms introduced earlier— comparisons between readings across longer periods of time in response to different situations. The second paradigm is concerned with short-term changes in response to specific events. One of the most clearly bounded events in the task situation was the appearance of pop-up advertisements in the ill-designed version of the directory. These were one aspect of the deliberate ill design. Pop-up windows appeared 93 times at pseudo-random intervals across the participants using the ill-designed version. There was a significant difference in SC between the two 10 s periods immediately before and immediately after each appearance. Prior to appearance, participants’ SC decreased by a mean of 0.0308 mS. After appearance there was a mean increase of 0.0266 mS. Expressed as a percentage of the prevailing conductance levels this represents an increase of 0.88%. In most cases this reflects modest changes in the slope of otherwise upwards or downwards trends rather than obvious orienting responses, but the difference was consistent across all participants, and across the 93 occurrences was statistically significant (t ¼ 2:255; po0:05). Fluctuations in SC produced counter-intuitive results. The number of data points that increased by 3% or more in 10 s was 6.0% for participants using the ill-designed version and 11.9% for participants using the well-designed version.

5. Conclusions and discussion This short paper has reported ongoing investigations into the use of psychophysiological metrics in the evaluation of usability issues. A preliminary model combining short- and long-term measurements was proposed, and findings from first explorations of the model were presented.

210

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

One early conclusion is that if physiological data is to be of any practical help in evaluating HCI issues, then it needs to be able to tolerate collection under relatively loosely controlled situations, rather than under the tightly controlled designs and conditions normally adopted in psychophysiological experimentation. The explorations described here did take place in loosely controlled situations, and this allowed the possibility of a number of uncontrolled sources of variability. These included factors inherent in the use software of normal complexity rather than pure distinct stimuli. To control or remove these factors would change the essential look, feel and functionality of the software. Differential task and workload effects were also present in that users of the well-designed directory were able to answer approximately twice as many questions as users of the ill-designed directory, and questions were not necessarily of equal difficulty. Also, the experimental task was performed verbally, and some participants tended to give longer verbal answers than others, or to make comments about particular questions and answers (to which the experimenter did not respond). These sources of variability would be unacceptable in experimental psychophysiological settings, but it is unlikely they could be excluded entirely in applied usability testing settings. Uncontrolled sources of variability would appear to account for at least part of the lack of statistical significance in the differences between the users of the well-designed and ill-designed versions of the directory. This suggests here that conditions were insufficiently controlled to generate significance, and that further observations under more tightly controlled situations are needed. Psychophysiological testing is perhaps not as robust as HCI usability testers might like it to be. There were also considerable differences between individual participants, which would also have contributed to lack of statistical significance. Some participants simply showed greater reactivity than others. There appeared to be differences in ability to handle the usability problems posed by the ill-designed version. Some participants, especially experienced web users, tended to proceed patiently and methodologically, apparently without becoming at all irritated by the difficulties. Some participants may not have been as motivated and involved in the task as others. These factors appear to be reflected in the physiological data, producing large variances in group data. This problem might be reduced by using greater numbers of participants—possibly the experimental psychophysiologist’s preferred solution, but again, in applied usability testing settings, it is unlikely that this would be feasible. Even so, it is surprising, given the considerable differences between the two versions of the web site in their design and usability, and given that the psychophysiological literature is replete with evidence that the kinds of factors present in the ill-designed version should produce physiological responses, that significance was not demonstrated more easily. More stringent experimental control and greater numbers of participants might well have produced statistically significant differences. But the main issue is not whether it is possible to demonstrate significantly different physiological responses to different software designs, but whether physiological responses can be of help in evaluating real HCI issues. The challenge is to balance established experimental methods and the meaningful evidence they produce with the subtle, complex situations and practical constraints

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

211

of naturalistic settings. This is a long standing and sometimes contentious issue in the behavioral sciences. The results of these first explorations are promising. Group means did at least show consistent trends in the predicted direction, and it is expected that replication under more tightly controlled conditions or with greater numbers of participants would lead to statistical significance. The group summary data suggests that measures of SC, blood volume and pulse rate, averaged across periods of time, should be able to distinguish differences in arousal levels in different computer-based situations, and can therefore provide a good indication of software usability. Responses to the appearance of a pop-up advertisement window were statistically significant, showing that reactions to some discrete events can be detected in otherwise complex computing situations. In general, the procedures adopted appear to be a viable approach to psychophysiological usability testing, and should be applicable to different software in different HCI situations. An important aspect of these procedures was that participants’ control data was obtained within the same session as their experimental data. From Hartley (1994) and Nielsen (1995–2001), there are a variety of factors in the well- and ill-designed versions of the directory that might have contributed to the different numbers of questions completed and any difference in physiological response. Obscuring the logical structure of the directory by the ill-designed index page should have caused considerable difficulties. The unexpected movement of text on the ill-designed index page appeared difficult to handle, especially when users were on the point of selecting a menu item. Inconsistent use of colors, italics, underlines, and upper case letters, obtrusive backgrounds, and the presence of the animated ‘‘under construction’’ sign on every page should have caused visual distraction and other difficulties. However, any effects of these factors appear to have been cumulative and difficult to distinguish. In this light it is encouraging that the pop-up advertisement windows do appear to have had a distinguishable effect. This promises that in genuine, rather than contrived, testing situations where (one would hope) not so many ill-designed features are present at the same time, it may be possible to associate physiological reactions with discrete events or design elements. Further work in more realistic situations is now needed. Explanation at the level of mental events is also problematic. The physiological data might simply have indicated that the ill-designed directory demanded greater workload than the well-designed version. Alternatively, users may have been genuinely stressed or frustrated by aspects of the ill design, and the physiological data then reflected their ensuing emotions. It is not beyond possibility that users actually liked the appearance of the ill-designed version, or the content of its advertisements, and that any effects therefore reflected users’ positive emotions. Interpretation of physiological data clearly requires knowledge of the context in which it was obtained, together with other information such as psychological testing and interview data. However, in usability testing, it may not be necessary to be able to explain the whole chain of events leading to a distinguishable physiological reaction. The reaction alone could indicate the presence or occurrence of something unknown that requires further investigation, and this in itself could be helpful. If

212

R.D. Ward, P.H. Marsden / Int. J. Human-Computer Studies 59 (2003) 199–212

psychophysiology proves able to help identify important HCI elements of which users are not conscious, or which they forget before providing verbal feedback, it could be a useful and informative tool.

Acknowledgements This work is supported by EPSRC project grant GR/N00586 and the University of Huddersfield. Bernadette Cahill and the late Clive Johnson helped in discussions and data collection. We also thank the anonymous reviewers for their helpful and constructive advice in the area of psychophysiology, and would recommend HCI researchers to seek specialist help when planning to use psychophysiological techniques.

References Andreassi, J.L., 2000. Psychophysiology: Human Behavior and Physiological Response, 4th Edition. Lawrence Erlbaum Associates, Mahwah, NJ. Ekman, P., Levensen, R.W., Friesen, W.V., 1983. Autonomic nervous system activity distinguishes among emotions. Science 221, 1208–1209. Fernandez, R. (1997). Stochastic modelling of physiological signals with hidden Markov models: a step towards frustration detection in human–computer interfaces. MS Thesis, MIT Media Laboratory. MIT Media Laboratory Technical Report 446. Hartley, J., 1994. Designing Instructional Text, 3rd Edition. Kogan Page, London. Idzikowski, C., Baddeley, A.D., 1983. Fear and dangerous environments. In: Hockey, R. (Ed.), Stress and Fatigue in Human Performance. Wiley, Chichester (Chapter 5). Kramer, A.F., 1991. Physiological metrics of mental workload: a review of recent progress. In: Damos, D.L. (Ed.), Multiple Task Performance. Taylor & Francis, London, pp. 329–360. Lang, P.J., Greenwalk, M.K., Bradley, M.M., Hamm, A.O., 1993. Looking at pictures: affective, facial, visceral and behavioral reactions. Psychophysiology 30, 261–273. Nielsen, J., 1995–2001. Alert box columns. http://www.useit.com/alertbox/ (accessed 27th June 2001). Picard, R.W., 1996. Affective Computing. The MIT Press. Cambridge, MA, 1997, p. 164. Scheirer, J., Fernandez, R., Klein, J., Picard, R., 2002. Frustrating the user on purpose: a step toward building an affective computer. Interacting with Computers 14, 93–118. Wastell, D.G., Newman, M., 1996. Stress, control and computer system design: a psychophysiological field study. Behaviour and Information Technology 15 (3), 183–192. Wilson, G., Sasse, M.A., 2000. Do users always know what’s good for them utilising physiological responses to assess media quality. In: McDonald, S., Waern, Y., Cockton, G. (Eds.), People and Computers XIV-Usability or Else! Proceedings of HCI 2000. Springer, Berlin, pp. 327–339.