Comparison of techniques for matching of usability problem descriptions

Comparison of techniques for matching of usability problem descriptions

Interacting with Computers 20 (2008) 505–514 Contents lists available at ScienceDirect Interacting with Computers journal homepage: www.elsevier.com...

189KB Sizes 0 Downloads 48 Views

Interacting with Computers 20 (2008) 505–514

Contents lists available at ScienceDirect

Interacting with Computers journal homepage: www.elsevier.com/locate/intcom

Comparison of techniques for matching of usability problem descriptions Kasper Hornbk *, Erik Frøkjr Department of Computer Science, University of Copenhagen, Universitetsparken 1, DK-2100 Copenhagen, Denmark

a r t i c l e

i n f o

Article history: Received 28 November 2007 Received in revised form 16 August 2008 Accepted 18 August 2008 Available online 30 August 2008 Keywords: Usability evaluation Usability problems Problem matching Evaluator effect Similarity

a b s t r a c t Matching of usability problem descriptions consists of determining which problem descriptions are similar and which are not. In most comparisons of evaluation methods matching helps determine the overlap among methods and among evaluators. However, matching has received scant attention in usability research and may be fundamentally unreliable. We compare how 52 novice evaluators match the same set of problem descriptions from three think aloud studies. For matching the problem descriptions the evaluators use either (a) the similarity of solutions to the problems, (b) a prioritization effort for the owner of the application tested, (c) a model proposed by Lavery and colleagues [Lavery, D., Cockton, G., Atkinson, M.P., 1997. Comparison of evaluation methods using structured usability problem reports. Behaviour and Information Technology, 16 (4/5), 246–266], or (d) the User Action Framework [Andre, T.S., Hartson, H.R., Belz, S.M., McCreary, F.A., 2001. The user action framework: a reliable foundation for usability engineering support tools. International Journal of Human–Computer Studies, 54 (1), 107– 136]. The resulting matches are different, both with respect to the number of problems grouped or identified as unique, and with respect to the content of the problem descriptions that were matched. Evaluators report different concerns and foci of attention when using the techniques. We illustrate how these differences among techniques might adversely influence the reliability of findings in usability research, and discuss some remedies. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction Methods for usability evaluation are among the most widely used results of research in human–computer interaction. A huge literature describes drawbacks and benefits of usability evaluation methods based on empirical studies comparing evaluators’ performance (see Hartson et al., 2001; Cockton et al., 2007). This paper concerns a seemingly minor activity in these comparisons of usability evaluation methods, matching. Matching helps determine whether descriptions of usability problems are similar or not, that is, whether the descriptions are about the same underlying design flaw. Most comparisons of usability evaluation methods base their conclusions on matching, for instance, by comparing problem descriptions produced with the help of different methods or by contrasting a set of problem descriptions with known problems with the interface. While a few papers has discussed matching (e.g., Lavery et al., 1997), in general little attention has been given to how matching is performed or how it may impact usability research. This is surprising for at least two reasons. First, the bulk of comparisons of usability evaluation methods rarely explain how matching was done, leaving readers with the impression that it was straightforward. Nielsen (1992), for example, does not explain * Corresponding author. Tel.: +45 35321425; fax: +45 35321401. E-mail address: [email protected] (K. Hornbk). 0953-5438/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.intcom.2008.08.005

the procedure he followed for matching descriptions of usability problems or the criteria for treating two descriptions as similar. Yet, similarity is definitely not a straightforward notion, which is recognized both within usability research (Lavery et al., 1997) and more generally in psychology (Tversky, 1977). Second, central notions in usability research such as evaluator effect and false positives are derived from matching. For instance, for a problem to be deemed a false positive it needs to be matched to a set of problems establishing the actual problems with the application (often found in a series of think aloud tests). However, if matching may be done in varying degrees and with varying overlap, these notions may depend to a large extent on the specific approach taken to matching. As one example, the SUPEX framework (Cockton and Lavery, 1999) allows varying generalizations of the similarity of problems, essentially opening the possibility of varying the overlap between problem sets depending on how general a level matching is performed at. Thus, matching seems a difficult activity, which some studies seem to treat lightly. Further matching has the potential to strongly influence findings in usability research. To study matching we designed an experimental comparison of four techniques for matching problems. The aim is to understand better the reliability of matching techniques, and, concomitantly, how it influences conclusions in usability research. Our results are mainly important in starting a discussion about the reliability of matching and, more generally, of usability research based upon matching of problems.

506

K. Hornbk, E. Frøkjr / Interacting with Computers 20 (2008) 505–514

Further, we believe that usability practitioners may also benefit from the ensuing discussion of similarity and classification of usability problems. 2. Related work A scan of the literature on usability research suggests that matching has not been a topic in most comparisons of usability evaluation methods. Molich, Ede, Kaasgaard and Karyukin (2004), for example, only noted that the first author ‘‘went through all of the nine test reports in order to determine the overlap in the usability problems reported by the teams.” (p. 69). Though the matching was validated by some of the evaluators, the paper contains no information on the criteria used in the matches of the first author. Similarly, Mankoff et al. (2003) did not explain how problems identified with their new heuristics for usability inspection were matched to a set of known usability issues; and Kessner, Wood, Dillon and West (2001) wrote that ‘‘the evaluators independently grouped the problems into categories of problems that were essentially the same”, but left ‘‘essentially the same” without further explanation. While elaborate procedures for matching and definitions of what constitute a match may have been used, they are not described in most papers comparing usability evaluation methods. Thus one gets the impression that matching of descriptions of usability problems is not seen as an important issue, that it is selfevident when usability problems are similar, and that matching is straightforward. A number of arguments may be given why this is not so, and why matching is a relevant and important research topic. First, the lack of rigor in matching procedures suggests that different interpretations of what constitute a match may be used among matchers in one paper and among matchers across papers. Lavery et al. (1997) illustrated the absence of clear descriptions of matching rules in a selection of papers. As mentioned above, studies that appeared after the paper by Lavery et al. are brief about their matching procedures. Lavery et al. also described some of the shortcomings of tying matching to a specific evaluation method, for example when using the questions from cognitive walkthrough to match problems. They also listed problems associated with papers, such as Mack and Montaniz (1994) that do contain descriptions of how problems were matched. Second, matching is difficult because of the brief, context-free descriptions of usability problems in most usability reports. Keenan et al. (1999), for example, listed three problems consisting of an average of 23 words. Hartson et al. (2001) suggested that ‘‘because usability problems are usually written in an ad hoc manner, expressed in whatever terms seemed salient to the evaluator at the time the problem is observed, it is not unusual for two observers to write substantially different descriptions of the same problem” (p. 387). Third, a few results suggest that different approaches to matching may lead to different conclusions about evaluation techniques. As already mentioned, the SUPEX framework (Cockton and Lavery, 1999) allows matching at differing levels of generality, leading to different values of overlap between techniques. Connell and Hammond (1999) showed that counting by problem type rather than by problem token reduces the number of evaluators needed to find three quarters of the usability problems in an interface. But if counting at more general level changes usability research results, what may matching at different levels not do? Fourth, the role of matching in practical usability work is not well understood. We suspect that matching most often proceeds implicitly and possibly inconsistently. Moreover, matching appears to be related to evaluators’ decisions about which problems to report separately and which to report as one. However, we know of no studies that investigate practical matching.

A couple of attempts have been made to circumvent the above problems. One is to match descriptions of usability problems at varying degrees of similarity. John and Mashyna (1997) distinguished a precise and a vague hit between usability problems; Connell et al. (2004) similarly distinguished a hit from a possible hit. While matching in degrees reduces part of the uncertainty of matching, it is not a general solution because it does not remove most of the difficulties discussed above. Another suggestion has been to use structured reporting of usability problems to facilitate their matching. The idea is that the format for description of usability problems is crucial to the successful matching of problems. Lavery et al. (1997) proposed a reporting format for usability reports that separate causes, breakdowns, outcomes and solutions. They showed examples of usability problems in the proposed format, but did not document whether it improved matching of problems, or whether evaluators found the format too laborious. Building upon the work of Lavery et al., Cockton et al. (2003) showed how an extended reporting format made evaluators using heuristic evaluation predict fewer false positives, compared to a previous study using a simpler reporting format. The extended reporting format asked evaluators to report problem descriptions, the likely/actual difficulties, the context of a problem, and assumed causes. In addition, evaluators reported how they found problems, for example if they scanned the system for problems and which heuristic they used to find a particular problem. The study is indicative only, because the use of the extended reporting format is compared only to a previous study, with many differences to the study in which the extended reporting format was being used. Another suggestion for how to improve matching has been to use some kind of classification for describing usability problems (Keenan et al., 1999; Sutcliffe, Ryan, Doubleday, and Springett, 2000; Andre et al., 2001). The aim is to assist matching by either supporting the description of usability problems or the matching process. In the words of Hartson et al. (2001) ‘‘the User Action Framework allows usability engineers to normalize usability problem descriptions based on identifying the underlying usability problem types within a structured knowledge base of usability concepts and issues.” (p. 405). In sum it seems that matching is a concern for usability research because it may impact our findings. Understanding the relative benefits of using different techniques would be beneficial, as would more concrete evidence about the likely influence that different matching techniques have on the reliability of usability research. This is the focus of the experiment described next. 3. Experiment The aim of the experiment was to explore whether different matching techniques make a difference to which descriptions of usability problems that are seen alike. In particular we investigate the influence of following different matching techniques on the reliability of findings in usability research. Regarding terminology, the core notion in the following is a matching. Participants make matchings by transforming a common set of descriptions of usability problems into a set of groups of problems, where shared membership in a group suggests that problems are similar. Problems may be only in a group of size one, those we call single problems. 3.1. Participants Participants were 55 computer science students, who participated in an introductory course on human–computer interaction, typically third year students. All were familiar with think aloud testing and had performed at least one such test. The argument

K. Hornbk, E. Frøkjr / Interacting with Computers 20 (2008) 505–514

for using students to match is twofold. First, it was practically impossible for us to gather a similarly large number of usability professionals for a matching exercise, which would be a quite time demanding activity of about 5–10 h of work per participant. Even if it were possible, we doubt that professionals would have had previous experience with explicit and systematic experience of matching usability problems (see Nørgaard and Hornbk, 2006). Second, we hypothesized that students would be less biased when learning and applying a matching technique because they would have few established work habits that could interfere with the instructions. The ramifications of this choice for the external validity of our study are discussed in detail in Section 5. 3.2. Application The evaluators worked with identifying and matching problems from one of the largest job portals in Denmark, http://www.jobindex.dk. Jobindex has more than 150,000 unique visitors each week, placing it among the top 30 of the most visited Danish web sites. 3.3. Procedure Fig. 1 illustrates the procedure of the experiment. First, the application was subjected to a series of think aloud tests. Then participants identified problems on the videos of the tests, and a problem set was constructed, comprising one problem from each participant. That set formed the basis for all of the participants’ problem matching. The following sections describe each of these phases. 3.4. Think aloud tests The application was evaluated with a series of test sessions, each following the guidelines of Dumas and Redish (1999). Developers of the application helped formulate the focus of the test and approved the tasks. For this experiment we selected three sessions, each documented by digital video files lasting 20, 37 and 43 min. These files were produced with TechSmith’s Morae tool (www.techsmith.com/morae.asp), and showed a screen capture with audio and a picture-in-picture of the test subject’s face. The video files included introductions, background questions, activity during task solution, and debriefing. We chose to include all of these phases, rather than just task-solution activity, because real evaluation work would typically entail participation in all of these phases and because we wanted to give participants some context to the task-solution activity. The task-solution activities on the videos lasted about 66 min. 3.5. Problem identification Each participant was asked to identify and describe the usability problems that the videos contained. In addition to the videos, they received a description of the think aloud tests and of the tasks used.

Think aloud tests

Problem identification

Three video taped think aloud sessions

In identifying and describing usability problems, participants were helped by two definitions of what might constitute a usability problem. One of these was part of the paper by Jacobsen et al. (1998), which lists nine problem criteria as follows: The evaluators used nine predefined problem criteria: (1) The user articulates a goal and cannot succeed in attaining it within three minutes, (2) the user explicitly gives up, (3) the user articulates a goal and has to try three or more actions to find a solution, (4) the user produces a result different from the task given, (5) the user expresses surprise, (6) the user expresses some negative affect or says something is a problem, (7) the user makes a design suggestion, (8) a system crash, and (9) the evaluator generalizes a group of previously detected problems into a new problem. Participants also received Table 1 from Skov and Stage (2005), which aims to support usability problem identification. This table relates the seriousness of a problem to how that problem might be detected (e.g., user was slowed down or expressed frustration) by giving examples of how observations might be interpreted. For instance, a serious problem related to reduced work speed may be that the user was delayed for several seconds. By giving participants these descriptions, we wanted them to have a common understanding of what usability problem is, both during the problem identification phase and when performing the matching. It was not the intent to find all problems with the application, merely to identify a subset of problems that participants could subsequently match. Participants were asked to document the problems by describing the following five aspects:  A headline that summarizes the problem.  An explanation that details the problem. Participants were asked to give as many details as possible and to ensure that the description was understandable without knowledge of the test sessions or the videos.  A description of why the problem is serious to some or all users of the application. A problem may be seen as serious for example if users get confused, express that they are insecure, or cannot finish their tasks.  A description of the context in which the problem arose, for example in a particular task or part of the user interface.  A description of how the problem could be solved. This description should be understandable by the developers of the website, also without access to the other aspects of the problem description. Participants were asked to give as many details as possible and, if necessary, repeat parts of the description if that may be expected to facilitate understanding the solution. The aim of this documentation of problems was to allow a multi-facetted description of problems without going into the detail of the problem reporting formats of Lavery et al. (1997) or Cockton et al. (2003).

Problem set construction

746 usability problems

507

Problem matching

55 usability problems

Analysis

52 matchings, i.e., sets of grouped usability problems *

Fig. 1. Illustration of the procedure of the experiment. (*)Three matchings were not received.

K. Hornbk, E. Frøkjr / Interacting with Computers 20 (2008) 505–514

508

Table 1 Two examples of usability problems reported by the evaluators Problem headline

Description

Seriousness

Context

Solution

Categories of jobs under ‘‘check your salary”

On several occasions the users did not know which job category to chose. They experience difficulties navigating the very long drop-down menu

Because the different categories of jobs receive very different salaries, this is a critical defect in the usefulness of the salary check

The problem arose in connection with task 2, salary check

The menu on the right hand side is not noticed

The menu on the right hand side is not noticed by the user. It is very compact and hard to overview because it contains 8–10 items that are squeezed together

The problem is serious because the menu is needed for further navigation on the site

The test user was asked to find a job that he would want to apply for. He did not notice the menu on the right hand side and proceeded to create a CV which is unnecessary if you just want to see which jobs are available

Develop extra job categories that better fit the users’ understanding of their own education. Made it possible to navigate the drop-down menu by typing the first part of the request job on the keyboard I suggest that that a more easily accessible menu is developed. This might happen through removing some of the superfluous or less used menu items

3.6. Problem set construction To arrive at a reasonable sized collection of descriptions of usability problems that participants should match, we picked a problem description at random from the list of each participant, for a total of 55 problem descriptions. We did not want to allow participants to match only their own problems. While this would probably make understanding the problems easier, it would also make comparing the matchings made by participants almost impossible. We also decided against having fewer evaluators create the problem set. Though this would imitate usability evaluation in practice, it would entail that not all persons performing the matching would have evaluated the systems or that we would have available only a small group of participants who could match the problem descriptions. Finally, we wanted the participants to see that they had all contributed to the problem set, which we expected would strengthen their interest and motivation in doing a thorough matching. 3.7. Problem matching Participants used a randomly determined technique for problem matching, chosen among the four techniques described below. Participants thus used only one of the techniques. In all cases participants were given a description of affinity diagramming (http:// www.infodesign.com.au/usabilityresources/general/affinitydiagramming.asp), which we suggested they might find useful in matching the problems. They were also trained in using affinity diagramming on a set of usability problems. Besides the written material described below, they received no training in the matching techniques. Participants documented their matching similarly across techniques (we refer to this as the common format). They did so by describing (a) a list of groups of matched usability problems, (b) the reasons behind matching each group, and (c) an expression of doubts regarding any of the problems in a particular group. Participants could place a problem in several groups, for example if it in their view described multiple problems. They also listed any single problems they might have identified and described the reasons why that problem was unique. 3.8. Matching techniques The four techniques used for matching were as follows: 3.8.1. Similar Changes Similar Changes refers to a matching technique that instructs participants to match problems on whether they would lead to similar changes in the application. The underlying rationale is that

designers and developers who must fix usability problems are interested in treating as one those problems that will have similar fixes. Participants were given an instruction based on the method used by Molich and Dumas (2008); the instruction comprised the key points in the method section of paper by Molich and Dumas, including the following summary: Identical comments: Two problem comments are identical if fixing one of the comments would most likely also fix the other and vice versa. The fix does not have to be in accordance with the solution suggested in either of the comments. If one comment is a generalization of another, they are not identical. The rationale behind including Similar Changes is that it growing out of a comprehensive work on how to match problems and is well described; Similar Changes also seems to have been used implicitly in several studies. 3.8.2. Practical Prioritization With this technique, participants were instructed to imagine that they should prioritize problems for the developers and designers of the application. We provided them with the following scenario: Your customer, Jobindex, needs to take a managerial decision on which parts of their web site to improve. They plan an iterative cycle of improvements where they wish to address as many of the most serious usability findings as possible. So they need you to prioritize for them how to address the usability problems that have been found with the web site. On this background, participants were asked to create a prioritized list of usability findings. That list should summarize the findings and describe their impact on Jobindex’s website and its users/ customers. It should also recommend what to do about the problem and reference to the usability problems that a finding was based on. Because the participants had to report the usability problems each usability finding that they reported was based on, they essentially did an implicit matching of problems. That is, they identified which problems that in practice would be combined as one finding about the usability of a system. The common format was used to report the prioritization. The rationale behind Practical Prioritization is twofold. First, to our knowledge the literature contains no attempts at modeling matching techniques over the prioritization that forms part of almost all practical usability work. Second, we hypothesized that prioritization might make participants think differently about matching because it gives more direction to their matching activity. Thus, matching is only indirectly approached with this technique, modeling what we think is the way matching is approached in much practical usability work.

K. Hornbk, E. Frøkjr / Interacting with Computers 20 (2008) 505–514

3.8.3. The model of Lavery et al. (1997) The third matching technique builds on a model by Lavery et al. (1997) of what goes into describing a usability problem. This model formed part of an early paper to discuss matching. It suggests that descriptions of usability problems have four components: cause, breakdown, outcome, and design change. Those components may be used for matching, as suggested in our discussion of structured problem reporting. Lavery et al., however, did not describe a particular procedure for matching problems, only noted that their model would facilitate it. Thus we explained to the participants that they were first to read the key sections of the paper by Lavery et al., in particular the text explaining its Fig. 2. This figure differentiates four components of a usability problem: a cause (e.g., a design fault), a breakdown (e.g., the user misinterprets feedback), a behavioral outcome (e.g., the user’s task failed), and a design change (e.g., modification of a feature). The text accompanying the figure gives several examples of how to use these concepts to recognize components in problem descriptions. We then instructed participants: You are going to use the components in Fig. 2 as the basis for determining which problems match. The basic idea is to match problems in degrees depending on the number of components on which they match: for every group of problems this degree will go from 0 (no overlap in any component) to 4 (overlap with respect to the problems’ cause, breakdown, outcome, and design changes). Thus, the analysis of the descriptions of usability problems helps participants determine the extent to which such descriptions are in agreement, and it helps separate four aspects of a problem description. The notion of similarity to be used within these aspects is, however, left for the matcher to determine. We thought this a straightforward application of the model by Lavery et al. to matching. When reporting matching in the common format, participants were instructed to develop a working document in which they documented how they established the degree of overlap (in terms of cause, breakdown, outcome, and design change) between problems. The rationale behind using the model of Lavery et al. is that it takes advantage of structured reporting of problems, thereby making explicit how problems are different. Further we also wanted to see whether a conceptually driven approach to matching would increase the reliability and similarity of matches. 3.8.4. User Action Framework (Andre et al., 2001) The fourth matching technique is based on the User Action Framework (UAF), developed by Rex Hartson and colleagues (Andre et al., 2001). The UAF is a knowledge base of usability concepts and issues organized in a hierarchical structure. That structure is related to the seven-stage model of Norman (1986). The UAF separates in its top categories issues of planning, translation, physical actions, outcome and system functionality, assessment, and problems independent of the interaction cycle. For each of these issues, usability problems may be classified into subcategories that aim at describing more closely the nature of the problem. The version of UAF used consisted of a total of 382 categories organized in two to six levels. In relation to matching ‘‘the User Action Framework allows usability engineers to normalize usability problem descriptions based on identifying the underlying usability problem types within a structured knowledge base of usability concepts and issues.” (Hartson et al., 2001, p. 405). The participants received the paper by Andre et al. and were instructed to read Section 2.2.1 and Sections 3 and 3.1–3.2. These sections explain how the UAF was derived from the work of Norman, in particular how it builds on and expands the notion of inter-

509

action cycle. In addition participants received access to an online version of the UAF and a link to an introductory video on the UAF. We summarized the use of the UAF for matching as follows: The basic idea is to first classify problems using the UAF and then to judge whether or not problems placed in the same node of the UAF are the same problem or not; problems placed in different nodes of the UAF are by definition dissimilar. The main support for matching by the UAF is thus to separate types of problems. Within each type the UAF provides no way of determining whether problems are similar; this is left for the participants’ judgment. Participants were requested to report a full classification of problems together with problems in the common format. 3.9. Data collected A range of dependent measures was used to compare the four matching techniques. At the level of individual matchings we used (a) the number of problems grouped, and (b) the number of single problems, that is, problems not grouped with any other problem. Given a participant’s matching, we could also calculate (d) a measure of agreement among the evaluators that had identified the problems being matching (any-two agreement, Hertzum and Jacobsen 2001), and (e) two measures of similarity of the matching to other participants using the same matching technique and to participants using other matching techniques. The calculation of (d) and (e) will be explained in detail in the next section. After participants had finished the identification and matching activities, we asked them to (f) fill out questionnaires about the time used in the various activities of matching the problems and (g) write a brief essay of one to two pages about their confidence in the matchings produced and about the matching technique they used. 4. Results 4.1. Problem identification Each participant identified a mean number of 13.57 problems (SD = 5.98) from the videos. From among these problems we picked 55 problems at random, one from the list of each participant. These problems descriptions had mean lengths of 86 words. Refer to Table 1 for an impression of the problems participants matched on. 4.2. Problem matching Two evaluators failed to hand in their matching and another did not understand the assignment. This and the following sections are therefore based on data from 52 participants. Table 2 summarizes the results of participants’ matching activity. Using multivariate analysis of variance on the dependent variables (number of groups, number of single problems, agreement, and four measures of similarity), we find an overall significant difference between techniques, Wilk’s lambda = .001, F(24, 108) = 203.02, p < .001. Below we thus proceed to analyze differences within each of the dependent measures. Following Rosenthal and Rosnow (1991), measures given in percentages (that is, agreement and the measures of similarity) were arc sine transformed prior to the analysis. Further, the number of single problems was not normally distributed (as assessed with Shapiro–Wilk’s test for normality) and therefore ranked prior to analysis. For clarity, values reported in the tables and in the text are not transformed. We first characterize the participants’ matching activity by the groups and single problems they produced. In the next section we look at the agreement among the participants’ matchings.

K. Hornbk, E. Frøkjr / Interacting with Computers 20 (2008) 505–514

510

Table 2 Number of groups of problems, single problems, and similarity of matching for each of the four matching techniques Technique

N

Similar Changes Practical Prioritization Lavery et al.’s model User Action Framework Average

13 12 12 15 52

Groups

Single problems

Agreement

M

SD

M

SD

M

SD

11.3 11.7 11.5 9.5 10.9

2.3 2.7 2.3 2.2 2.5

15.1 6.0 7.4 7.3 9.0

5.1 4.7 5.6 6.1 6.4

4.0 7.9 8.3 10.5 7.77

2.7 3.2 3.3 10.7 6.6

Each evaluator matched a total of 55 problems using one of the four matching techniques listed in the leftmost column. Groups refer to the number of groups of problems created, single problems to the number of problems not grouped with another problem, and agreement to the mean any-two agreement (Hertzum and Jacobsen, 2001) within the group of evaluators using a particular technique.

We find a significant difference in the number of groups of usability problems produced by each technique, F(3, 48) = 2.88, p < .05. Linear contrasts suggest that the User Action Framework leads participants to produce fewer groups. About two groups fewer are produced with the UAF compared to the other three techniques (16%, 17% and 19% fewer than Similar Changes, Practical Prioritization, and the model of Lavery et al., respectively). We also find a significant difference in the number of single problems produced, F(3, 48) = 5.20, p < .01. Linear contrasts show that matching by Similar Changes produced significantly more single problems than the other techniques, about twice as many. In most comparisons of evaluation methods, more single problems would entail less overlap between techniques and evaluators, merely by virtue of the matching technique. 4.3. Agreement among participants The agreement among evaluators is quantified using a measure called any-two agreement. Hertzum and Jacobsen (2001) argued that this measure is superior to other measures of the evaluator effect, because it does not require that all problems in an interface are known, nor is any-two agreement affected by the number of evaluators. Any-two agreement is defined for a set of n evaluators i, j,. . ., with problem sets Pi, Pj,. . ., as the mean of

jPi \ P j j jPi [ P j j over all unique pairs of evaluators i and j, where i – j. The number of such pairs is ½ n  (n  1). In our case, the matching of each participant may be used to determine the agreement among evaluators because the problems in the matching were generated by different evaluators. Using this measure, Table 2 summarizes the mean agreement among evaluators using a particular matching technique. A significant difference among techniques is found, F(3, 48) = 4.2, p < .05. Using linear contrasts we find that Similar Changes produces significantly lower any-two agreement compared to the other techniques. We find no differences among any of the other techniques. Note that in this study the calculation of any-two agreement is simplified, because each evaluator is only represented in the problem set with one problem. Any-two agreement is thus related to the number of single problems, as also shown by Table 2. 4.4. Similarity of matchings Another way to characterize the similarity of how participants match is to look at the contents of their matchings. We address this question by using Rand’s measure of similarity (Rand, 1971), which in this paper are used to quantify whether pairs of usability problems are grouped together or not. This measure is calculated for pairs of matchings. For all pairs of problems in the problem set, it is considered whether they are treated similarly in the two

matchings under comparison, that is, whether they – in both matchings – are either placed in the same grouping or in different groupings. The Rand measure of similarity is then simply the number of pairs treated similarly relative to the total number of pairs. We also calculate the specific agreement, that is, the number of pairs of problems actually reported together in both matchings, relative to the total number of pairs. Though this calculation is done for pairs, it works for any size of groups in the matching because each pair of problems is considered one at a time. One way to think of these similarities is in terms of two square matrices (A and B), one representing each matching, where rows and columns correspond to usability problems being matched. If problem i and problem j are treated similarly in A and B (i.e., Aij = Bij) then we increase our count used to form the Rand measure; if problem i and problem j are both present in A and B (i.e., Aij = Bij = 1) then we increase the count used to form the specific agreement. Table 3 summarizes the similarity of matchings using Rand’s measure and specific agreement. These measures may be calculated in reference to two groups: the other participants who used a particular matching technique (within technique) and all other participants (to other techniques). The former measure indicates whether a particular matching technique makes participants’ matchings more similar, the latter indicates whether a particular matching technique is more or less removed from how participants generally match. First, let us discuss the within-technique differences. We find a significant difference between techniques with respect to the Rand similarity of matchings produced by a particular technique, F(3, 48) = 19.30, p < .001. Linear contrasts show that techniques have falling levels of similarity going from Similar Changes to the model of Lavery et al. to Practical Prioritization to User Action Framework; only the differences between the model of Lavery et al. and Similar Changes are not significant. Generally, the levels of similarity are high, ranging from 85% to 96%. The results on specific agreement make clear, however, that this is not impressive. In general evaluators have about three to four percent overlap in which problems are placed together. In a problem set of 55 problems, that comes to an average of about two pairs of problems that are grouped together in the matchings of two participants. We find a significant difference between techniques in their specific agreement, F(3, 48) = 9.59, p < .05. Linear contrasts show that specific agreement is highest with Practical Prioritization, significantly lower with the model of Lavery et al. and the User Action Framework, and significant lower than those, with Similar Changes. Note that though the differences in Table 3 may seem low, the effect sizes of the differences are substantial. The effect sizes expressed at squared eta values are .657 (Rand) and .345 (specific agreement). Squared eta is an expression of the proportion of the variance explained, and usually effect sizes of this order are considered medium to large effects (Cohen, 1969).

K. Hornbk, E. Frøkjr / Interacting with Computers 20 (2008) 505–514

511

Table 3 Average similarity of the matchings produced, quantified with Rand’s (1971) measure and using specific agreement Technique

N

Similar changes Practical Prioritization Lavery et al.’s model User Action Framework Average

13 12 12 15 52

Rand’s measure

Specific agreement

Within technique (%)

To other techniques (%)

Within technique (%)

To other techniques (%)

95.8 89.1 93.8 85.1 90.7

94.5 92.3 94.0 91.2 –

3.52 4.12 2.96 3.49 3.51

3.24 3.73 3.50 3.44 –

(1.50) (2.90) (1.69) (6.74) (5.84)

(1.21) (3.33) (1.74) (8.39)

(0.49) (0.60) (0.38) (0.63) (.66)

(0.72) (0.58) (0.45) (0.58)

Standard deviations are given in parentheses.

Table 4 Time usage with the four matching techniques Technique

Similar Changes Practical Prioritization Lavery et al.’s model User Action Framework Overall

N

Total M (h)

SD

12 11 11 15

7.9 9.0 11.1 10.6

2.2 4.2 3.6 2.8

49a

9.8

3.4

The table summarizes the mean duration spent in each of five work phases as calculated from participants’ self reports. All durations are in hours. a Three participants did not hand questionnaires about time usage.

4.5. Time usage Table 4 summarizes the time participants report having used on matching the 55 problems. We find no difference in overall time usage between techniques, F(1, 48) = 2.18, p > .05. Participants used from 3.5 to 21 h to perform the matching, with a mean time used of about 10 h. 4.6. Essays on matching We also asked the participants to write a short essay, commenting upon their satisfaction with the results of matching and with the matching technique used. Table 5 summarizes the most significant and clear comments in the participants’ essays. This table was created by systematically analyzing the essays in a bottom-up manner. We read through all of the essays several times and extracted all statements about matching. Only when we had a clear understanding of what the evaluator meant did we extract a statement. These statements were then grouped with similar statements from other participants using the affinity-diagramming technique (Beyer and Holtzblatt, 1997). First of all many evaluators clearly indicate that they are satisfied with their matching. This is often done while they simulta-

neously express doubts about their matching result, and about their learning and use of the matching technique. Perhaps this is understandable in the light of the comment of one participant: It is of course hard to say that what one has used the last 7–8 h to do is not made properly. That is why I have difficulty saying how the matching could have been done differently. However, fewer participants using Similar Changes directly express their satisfaction with matching, and more participants are dissatisfied with the result. Table 5 suggests why. The matching technique Similar Changes seems to lead to surprisingly many single problems. One participant noted that ‘‘I have for some groups taken lightly the principle of ‘similar changes’ because I would otherwise have had almost 55 single problems.” And another participant suggests: The strict requirements [for matching] are both an advantage and a restriction. An advantage is that the evaluator necessarily must be very precise and detailed in his/her classification. This makes for a nuanced picture of the web site’s errors. It is a restriction because it is very easy to treat the problems as unique. [. . .] That can be quite unfortunate. The matching is not an end in itself, but desirable because it helps everyone gain an overview, both the evaluator and the developer. Particularly the rules of Similar Changes for handling of general versus specific usability problems are problematic. One participant noted about the treatment of general problems: The problem is that the instruction for the technique requires an assessment of whether solving one problem will solve all the other problems in the same group. [. . .] However, in my view such a focus is unrealistic and problematic. Rather the focus should be that the solution of the most general problem within a group would lead to the solution of all sub-problems. [. . .] All the crosschecking that is necessary to follow the instruction slows down the work with the problems and hinders the analyst in getting an elegant and clear overview of all the problems as a whole.

Table 5 Eight groups of expressions about matching experiences found in the evaluators’ essays Content of expression

Similar Changes

Practical Prioritization

Lavery et al.’s model

User Action Framework

1 2 3 4

2 4 3 8

10 1 7 2

7 1 1

9 2 10

1 3 5

1 1

2

3

5 6 7 8

Satisfied with grouping Less satisfied with grouping Difficulties interpreting the technique The many unique usability problems found surprising or unsatisfactory General vs. specific usability problem descriptions cause doubts Usability problem description unsuited for technique Difficulties with interpreting usability problems in relation to technique Critique of specific aspects of the technique

6 2 3

. The numbers in the leftmost column are used in the text; the numbers in the other columns refer to the number of evaluators who clearly and unambiguously made a similar expression.

512

K. Hornbk, E. Frøkjr / Interacting with Computers 20 (2008) 505–514

Evaluators had more difficulty understanding or interpreting the matching techniques Practical Prioritization and User Action Framework (see Table 5 point 3, 7 and 8). Practical Prioritization seems to have been too open in its definitions. One evaluator for example noted in his essay that additional information would have been helpful. Another participant said that ‘‘the instruction about what to do is vague”. In contrast, a third participant accepts and seems comfortable with this room for interpretation. ...when you have grouped all of the problems you automatically get a better sense of the whole of the problem set. And if one at the same time makes a prioritization of the worst problems then one suddenly has a clear picture of where to begin working. Practical Prioritization is so open for interpretation that it apparently forces participants to actively involve their experiences from the usability evaluation. This was expressed by five participants (for the other techniques zero or one participant had similar comments): Therefore I had to either assume or relate to my knowledge of the problems with regards to giving the usability finding its right severity or number of users [experiencing the usability finding, our remark] The background for the difficulties with interpreting and understanding the UAF seems to be somewhat opposite to the reasons behind the difficulties related to Practical Prioritization. Many participants find the UAF very detailed and extensive in its definitions and requirements for describing the matchings. One participant noted: I find the technique difficult to learn. The first problems take a long time to classify. The framework is very hard to overview for a novice, and for many problems I doubt if they are really in the right category. One finds out how important it is that a problem is clearly expressed. Three evaluators find that the web-page presenting the UAF is very useful because it ‘‘makes it possible to investigate how the model was structured and what the individual nodes and paths in the classification represented”. A couple of participants point out a specific problem using the technique: ‘‘The UAF does not distinguish which parts of a given system a node is related to. Two different problems may be mapped to the same node in the UAF even though they are found in two quite different parts of the system. Here, I do not feel it makes sense to group the problems”. After learning the UAF, participants seem to find this technique quite effective and clear. The description of the model by Lavery et al. appears to have been easy to understand (see Table 5, point 3). However, it seems that the actual use of the model as a matching technique raises quite a number of difficulties in relating the descriptions of usability problems to the terms of the model (Table 5, points 6 and 7). An important reason might be that the problem descriptions in our experiment were not clear and comprehensive enough to allow description of a problem’s cause, breakdown, outcome and solution, as required in the model by Lavery et al. One participant notes: It seemed as if the paper [Lavery et al.] describes a reporting format that evaluators can use to describe problems and not so much a method for grouping problems. It might have been easier to group problems had all evaluators used this format. [. . .] I almost ended up by grouping after context, but the four criteria indicate something about how similar problems are.

5. Discussion We have compared how four techniques are used to match usability problems. Our results indicate a significant difference between techniques in how they group problems and in the number of single problems they create. We also show that the matchings made by participants differ among matching techniques. Comments from the participants suggest that some matching techniques were restrictive in what problems could be grouped, and that the form a usability problem was described in sometimes worked against effective matching. The main finding is that different techniques for matching produce quite different results when applied by novice evaluators. If this finding generalizes to expert evaluators, it has a number of implications for usability research. As an example consider the comparison of two evaluation methods with the purpose of establishing the overlap between the problems found with the methods, that is, the size of the common set of problems. We know of no other approach to computing the overlap than to use some kind of matching. However, in this study the number of single problems is doubled when Similar Changes matching is used compared to any of the other techniques. This will impact the overlap, exactly how much depends on the distribution of problems. With the present problem set, the overlap might decrease with as much as 9 problems (cf. Table 2: 15.1–6 problems, the numbers indicating the single problems found with Similar Changes and Practical Prioritization), corresponding to a 16% decrease (9/55, where 55 is the number of problems). Another example of the implications of our results concerns the establishing of the evaluator effect (Hertzum and Jacobsen 2001), the observation that evaluators evaluating the same interface will find markedly different problems. We show a significant and large difference with respect to the evaluator effect indicated by anytwo agreement and the matching similarities (cf. the squared eta effect-size measures), solely based on using different matching techniques. The evaluation is thus shaped not only by individual differences among evaluators, but also by the matching technique used to establish it. The conclusion is that our results threaten the reliability and validity of comparisons of usability evaluation methods and of other central notions in usability research that depend upon matching. The reliability is threatened because matching procedures are usually implicit: given the large variations we see between techniques it seems likely that matching problems with little explicit concern for the techniques used to match is a very variable process. Further, validity is threatened because specific matching techniques may introduce systematic bias in usability research. Matching by Similar Changes, for example, is likely to systematically underestimate the overlap among problems and overestimate the size of the evaluator effect, because each evaluator reports many more unique problems. In relation to practical usability work our study raises a number of questions. It is not clear which matching technique practitioners follow, even implicitly, or if they describe and match usability problems using just one notion of similarity. Our results suggest, however, that just as the reliability of research studies may be impacted by the variability in matching documented, so may practical usability work be affected by changes in how problems are matched. The nature and extent of how this influences usability work are open issues for future studies. The matching techniques used showed individual strengths and weaknesses. Matching by Similar Changes produces a different number of single problems than the other techniques. Interestingly, it shares with Practical Prioritization the aim of making matching similar to what might go on in actual usability work. Par-

K. Hornbk, E. Frøkjr / Interacting with Computers 20 (2008) 505–514

ticipants, however, seem to be bothered by the treatment of general problems in Similar Change (see Table 5). Practical Prioritization reaches the highest level of agreement among the novice evaluators. At the same time, many participants note that they to some degree take their personal experiences into the matching process. Our aim with the Practical Prioritization technique was to investigate if matching techniques may be modeled over how usability problems are worked with in practice: the answer seems affirmative. Future studies could further investigate this relation. The model of Lavery et al. had showed relatively low specific agreement among participants. One reason seems to be that interpreting problems in terms of causes, breakdowns, outcomes, and redesigns is hard. Though our problem reporting format separates some of these aspects, a more explicit structuring using these terms might facilitate matching. The User Action Framework worked well, and did not, as we had expected, produce a lot of single problems. Compared to the other techniques, participants report more difficulties performing the matching, but this could be an effect of learning a new and relatively complex framework. However, the UAF helps participants achieving a quite thorough understanding of the problems. An important question is what to do about our somewhat disturbing general result about the reliability and validity of matching. First of all we think merely being aware of this state of affairs is a step forward. If usability research studies take more care in describing and conducting matchings, more correct and more reliable results may be found. It is conceivable that matching in groups would be less variable (five participants suggested this as a good idea). Possibly a combination matching techniques may be useful, as different techniques will illuminate different, important aspects of usability problems and their similarity. A few comments on the experiment and its limitations are necessary. First, as mentioned both the User Action Framework and the model of Lavery et al. did not in themselves provide sufficient instruction for matching. As described earlier we created the instructions for how to match using those models. However, it is not clear if the problems seen are due to our attempt to make a straightforward matching technique out of the two models, or due to an inherent difficulty in the models. Second, a possible objection to our experiment is that novice evaluators may not be thorough enough or that they should not be expected to learn the technique well just by reading. We have no data to suggest that our students match differently from usability professionals. In particular the matching process as we here investigated it most often take place in research studies, and rarely explicitly in practical usability work (Nørgaard and Hornbk, 2006). Thus our findings need validation with users more representative of people that typically undertake matching. Independently of such verification, our study has immediate implications for those studies in the research literature that use novice evaluators. Third, we believe that we have controlled some possible validity concerns by closely controlling the participants’ experience with the evaluation and the problem set they matched on. Fourth, we used one problem from each participant to form the set of problems being matched on. Although ensuring that every participant had one of his or hers problem being matching on, the variability of problem descriptions are higher than if we had used problems from a lower number of participants. Further work is to show if and how this affects our results; one hypothesis could be that more similar problems might diminish the difference between matching techniques. Finally, we did not include a baseline condition where matching were done without instructions, but such a condition could form part of a follow up experiment that might further help characterize the influence of matching techniques.

513

6. Conclusion Matching of descriptions of usability problems is a seldom discussed issue in the literature on usability research. Nevertheless, matching is highly important because it underlies most conclusions about the relative effectiveness of evaluation methods, and about important phenomena in evaluation such as the evaluator effect. This study shows how different approaches to matching make novice usability evaluators create matchings of usability problems that differ substantially. One of the few matching techniques that is explicitly mentioned and used in the literature, namely matching by similarity of the changes necessary to alleviate the problem, produces by far the lowest agreement among evaluators. Our study indicates that research findings based on matching of problems can be strongly affected by the matching technique applied. As this is typically not accounted for, the reliability and the validity of many such findings are challenged. Acknowledgements We wish to thank Jon Howarth and Rex Hartson for giving us access to the User Action Framework, Rolf Molich for sharing his description of matching, Gilbert Cockton for supporting this work in many ways, and Mie Nørgaard for running the think aloud tests. References Andre, T.S., Hartson, H.R., Belz, S.M., McCreary, F.A., 2001. The user action framework: a reliable foundation for usability engineering support tools. International Journal of Human–Computer Studies 54 (1), 107–136. Beyer, H., Holtzblatt, K. (Eds.), 1997. Contextual Design: A Customer-Centered Approach to Systems Designs. Morgan Kaufmann. Cockton, G., Lavery, D., Woolrych, A., 2007. Inspection-based evaluations. In: Sears, A., Jacko, J.A. (Eds.), The Human–Computer Interaction Handbook Second Edition. Mahwah, NJ, Lawrence Erlbaum Associates, pp. 171–1190. Cockton, G., Woolrych, A., Hall, L., Hindmarch, M., 2003. Changing analysts’ tunes: the surprising impact of a new instrument for usability inspection method assessment. In: Proceedings of People and Computers XVII: Designing for Society. Springer, pp. 145–162. Cohen, J., 1969. Statistical power analysis for the behavioral sciences. Connell, I., Blandford, A., Green, T., 2004. CASSM and cognitive walkthrough: usability issues with ticket wending machines. Behaviour and Information Technology 23 (5), 307–320. Connell, I.W., Hammond, N.V., 1999. Comparing usability evaluation principles with heuristics: problem instances vs. problem types. In: Proceedings of IFIP TC.13 International Conference on Human–Computer Interaction, 621–629. Dumas, J., Redish, J., 1999. A Practical Guide to Usability Testing. Intellect. Hartson, H.R., Andre, T.S., Williges, R.C., 2001. Criteria for evaluating usability evaluation methods. International Journal of Human–Computer Interaction 13 (4), 373–410. Hertzum, M., Jacobsen, N.E., 2001. The evaluator effect: a chilling fact about usability evaluation methods. International Journal of Human–Computer Interaction 13, 421–443. Jacobsen, N.E., Hertzum, M., John, B.E., 1998. The evaluator effect in usability tests. In: Proceedings of ACM CHI’98 Conference Summary. ACM Press, New York, NY, pp. 255–256. John, B.E., Mashyna, M.M., 1997. Evaluating a multimedia authoring tool. Journal of the American Society of Information Science 48 (9), 1004–1022. Keenan, S.L., Hartson, H.R., Kafura, D.G., Schulman, R.S., 1999. The usability problem taxonomy: a framework for classification and analysis. Empirical Software Engineering 4 (1), 71–104. Kessner, M., Wood, J., Dillon, R.F., West, R.L., 2001. On the reliability of usability testing. In: Proceedings of Extended Abstract of ACM Conference on Human Factors in Computing Systems. ACM Press, New York, NY, pp. 97–98. Lavery, D., Cockton, G., Atkinson, M.P., 1997. Comparison of evaluation methods using structured usability problem reports. Behaviour and Information Technology 16 (4/5), 246–266. Mack, R.L., Montaniz, F., 1994. Observing, predicting, and analyzing usability problems. In: Nielsen, J., Mack, R.L. (Eds.), Usability Inspection Methods. John Wiley and Sons, pp. 295–339. Mankoff, J., Dey, A.K., Hsieh, G., Kientz, J., Ames, M., Lederer, S., 2003. Heuristic evaluation of ambient displays. CHI Letters, CHI 2003, ACM Conference on Human Factors in Computing Systems 5 (1), 169–176. Molich, R., Dumas, J., 2008. Comparative usability evaluation (CUE-4). Behaviour and Information Technology. 27 (3), 263–281. Molich, R., Ede, M.R., Kaasgaard, K., Karyukin, B., 2004. Comparative usability evaluation. Behaviour and Information Technology 23 (1), 65–74.

514

K. Hornbk, E. Frøkjr / Interacting with Computers 20 (2008) 505–514

Nielsen, J., 1992. Finding usability problems through heuristic evaluation usability walkthroughs. In: Baursfield, P., Bennet, J., Lynch, G. (Eds.), Proceedings of ACM CHI’92 Conference on Human Factors in Computing Systems. ACM Press, New York, NY, pp. 373–380. Norman, D.A., 1986. Cognitive engineering. In: User Centered System Design. Erlbaum, Hillsdale, NJ, pp. 31–61. Nørgaard, M., Hornbk, K., 2006, What do usability evaluators do in practice? An explorative study of think-aloud testing. In: ACM Symposium on Designing Interactive Systems (DIS 2006). ACM Press, New York, 209–218.

Rand, W.M., 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850. Rosenthal, R., Rosnow, R.L., 1991. Essentials of Behavioral Research: Methods and Data Analysis. McGraw-Hill, Boston, MA. Skov, M., Stage, J., 2005. Supporting problem identification in usability evaluations. In: Proceedings of Australian Computer–Human Interaction Conference 2005. Sutcliffe, A., Ryan, M., Doubleday, A., Springett, M., 2000. Model mismatch analysis: towards a deeper explanation of users’ usability problems. Behaviour and Information Technology 19 (1), 43–55. Tversky, A., 1977. Features of similarity. Psychological Review 84 (4), 327–352.